1: \documentstyle[11pt]{article}
2: \setlength{\oddsidemargin}{-.15in}
3: \setlength{\evensidemargin}{0pt}
4: \setlength{\headsep}{0pt}
5: \setlength{\topmargin}{-30pt}
6: \setlength{\textheight}{8.5in}
7: \setlength{\textwidth}{6.0in}
8:
9: \renewcommand{\baselinestretch}{1.2}
10:
11: \def\mod {\rm \ mod\ }
12: \def\oh {\cal O}
13: \def\End{\rm End}
14:
15: \newcommand{\qed}{\ \ \ \rule{7pt}{8pt}\medskip}
16: \newcommand{\partialqed}{\ \ \ \raisebox{2pt}{\framebox[7pt]{\ }}\medskip}
17: \newtheorem{defin}{Definition}
18: \newcommand{\bracket}[1]{\langle #1 \rangle}
19: \newcommand{\barr}{\overline}
20: \newcommand{\floor}[1]{\lfloor #1 \rfloor}
21: %General Forms
22: \newcommand{\proof}{{\bf Proof. \enspace}}
23: %\newcommand{\proof}{{\sc Proof \enspace}}
24: \newcommand{\comment}[1]{}
25: \newcommand{\hs}{\enspace}
26: \newcommand{\hhs}{\thinspace}
27:
28: \newcommand{\diverges}{\uparrow}
29: \newcommand{\converges}{\downarrow}
30:
31: \newcommand{\inter}{\bigcap}
32: \newcommand{\real}{\ifmmode {\rm R} \else ${\rm R}$ \fi}
33: \newcommand{\nat}{\ifmmode {\rm N} \else ${\rm N}$ \fi}
34: \newcommand{\tot}{\ifmmode {\cal T} \else ${\cal T}$ \fi}
35: \newcommand{\sigstar}{\ifmmode \Sigma^{\ast} \else $\Sigma^{\ast}$ \fi}
36:
37: \newtheorem{theorem}{Theorem}
38: \newtheorem{lemma}[theorem]{Lemma}
39: \newtheorem{corollary}[theorem]{Corollary}
40: \newtheorem{definition}{Definition}
41: \newtheorem{claim}[theorem]{Claim}
42: \newtheorem{conjecture}[theorem]{Conjecture}
43: \newlength{\thislabel}
44: \newcommand{\labsize}[1]{\settowidth{\thislabel}{#1}}
45: \def\lablimer2stlabel#1{\rm #1\hfil}
46: \def\pp{\par\noindent}
47:
48: \title{On The Closest String and Substring Problems
49: %\title{Approximating The Hamming Center
50: \footnote{Some of the results
51: in this paper have been
52: presented in {\em Proc.\ 31st ACM Symp.\ Theory of Computing}, May,
53: 1999 \cite{LMW99},
54: and in {\em Proc.\ 11th Symp. Combinatorial Pattern Matching}, June,
55: 2000, \cite{M00}.}}
56: \author{
57: Ming Li \\
58: Department of Computer Science\\
59: University of Waterloo\\
60: Waterloo, Ont. N2L 3G1, Canada\\
61: E-mail: mli@math.uwaterloo.ca
62: \and
63: Bin Ma\\
64: Department of Computer Science \\
65: University of Waterloo\\
66: Waterloo, Ont. N2L 3G1, Canada\\
67: E-mail: b3ma@wh.math.uwaterloo.ca
68: \and
69: Lusheng Wang\\
70: Department of Computer Science \\
71: City University of Hong Kong \\
72: Kowloon, Hong Kong \\
73: E-mail: lwang@cs.cityu.edu.hk
74: }
75: \date{}
76:
77: \begin{document}
78: \maketitle
79:
80: \begin{abstract}
81: The problem of finding a center string that is `close' to every
82: given string arises and has many applications in computational molecular
83: biology and coding theory.
84:
85: This problem has two versions: the Closest String
86: problem and the Closest Substring problem.
87: Assume that we are given a set of strings
88: ${\cal S}=\{s_1, s_2, \ldots, s_n\}$ of strings, say, each of length $m$.
89: The Closest String problem~\cite{BLPR97,BGHMS97,FL97,GJL99,LL+99}
90: asks for the smallest $d$ and a string $s$ of length $m$ which is within
91: Hamming distance $d$ to each $s_i\in {\cal S}$.
92: This problem comes from coding theory when we are looking for a code
93: not too far away from a given set of codes \cite{FL97}.
94: The problem is NP-hard~\cite{FL97,LL+99}. Berman {\em et al}
95: \cite{BGHMS97} give a polynomial time
96: algorithm for constant $d$. For super-logarithmic $d$,
97: Ben-Dor {\em et al} \cite{BLPR97} give an efficient approximation algorithm
98: using linear program relaxation technique.
99: The best polynomial time approximation has ratio $\frac{4}{3}$
100: for all $d$, given by \cite{LL+99} and \cite{GJL99}.
101: The Closest Substring problem looks for a string $t$ which is
102: within Hamming distance $d$ away from a substring of each $s_i$.
103: This problem only has a $2- \frac{2}{2|\Sigma|+1}$ approximation
104: algorithm previously \cite{LL+99} and
105: is much more elusive than the Closest String problem, but
106: it has many applications in finding
107: conserved regions, genetic drug target identification, and genetic probes
108: in molecular biology~\cite{HS94,LR90,LB+91,PBPR89,PH96,
109: S90,SH91,W86,WAG84,WP84,LL+99}. Whether there are efficient
110: approximation algorithms for both problems are
111: major open questions in this area.
112:
113: We present two polynomial time approxmation algorithms with
114: approximation ratio $1+ \epsilon$ for any small $\epsilon$
115: to settle both questions.
116:
117: \end{abstract}
118:
119: \section{Introduction}
120: \label{sec-intro}
121: Many problems in molecular biology involve finding similar
122: regions common to each sequence in a given set
123: of DNA, RNA, or protein sequences. These problems find applications in
124: locating binding sites and finding conserved
125: regions in unaligned sequences~\cite{SH91,LR90,HS94,S90},
126: genetic drug target identification~\cite{LL+99}, designing genetic
127: probes \cite{LL+99}, universal PCR primer design~\cite{LB+91,DR+93,PH96,LL+99},
128: and, outside computational biology, in coding theory~\cite{FL97,GJL99}.
129: Such problems may be considered to be various generalizations of the common
130: substring problem, allowing errors.
131: Many objective functions have been proposed
132: for finding such regions common to every given strings.
133: A popular and most fundamental measure is the Hamming distance. Other
134: measures, like the relative entropy measure used
135: by Stormo and his coauthors \cite{HS94} may be
136: considered as generalizations of Hamming distance, requires
137: different techniques, and is considered in \cite{LMW99-j}.
138:
139: Let $s$ and $s'$ be finite strings.
140: Let $d(s,s')$ denote the Hamming distance between $s$ and $s'$.
141: $|s|$ is the length of $s$. $s[i]$ is the $i$-th character
142: of $s$. Thus, $s=s[1]s[2] \ldots s[|s|]$.
143: The following are the problems we study in this paper:
144:
145: \vspace{1ex}
146: \noindent
147: {\sc Closest String:} Given a set ${\cal S}=\{s_1,s_2,\ldots,s_n\}$
148: of strings each of length $m$, find a
149: center string $s$ of length $m$ minimizing $d$
150: such that for every string $s_{i}\in {\cal S}$, $d(s, s_{i})\leq d$.
151:
152: \vspace{1ex}
153: \noindent
154: {\sc Closest Substring:} Given a set ${\cal S}=\{s_1,s_2,\ldots,s_n\}$
155: of strings, and an integer $L$, find a center string $s$ of length $L$
156: minimizing $d$ such that for each $s_i\in {\cal S}$ there is
157: a length $L$ substring $t_i$ of $s_i$ with $d(s, t_{i})\leq d$.
158: \vspace{1ex}
159: %wangl -- change the definition of the problem a little bit
160:
161: {\sc Closest String} has been widely and independently studied
162: in different contexts. In the context of coding theory
163: it was shown to be NP-hard~\cite{FL97}. In DNA sequence related topics,
164: \cite{BGHMS97} gave an exact algorithm when the distance $d$ is a constant.
165: \cite{BLPR97,GJL99} gave near-optimal
166: approximation algorithms only for large $d$ (super-logarithmic in number of
167: sequences); however the straightforward linear programming relaxation technique
168: does not work when $d$ is small because the randomized
169: rounding procedure introduces large errors.
170: This is exactly the reason why \cite{GJL99,LL+99}
171: analyzed more involved approximation algorithms, and
172: obtained the ratio $\frac{4}{3}$ approximation algorithms.
173: Note that the small $d$ is key in applications such as
174: %wangl -- do we need "the key" instead of "key"??
175: genetic drug target search where we look for similar regions to which
176: a complementary drug sequence would bind. It is a major open
177: problem~\cite{FL97,BGHMS97,BLPR97,GJL99,LL+99} to achieve the best
178: approximation ratio for this problem. (Justifications for using Hamming
179: distance can also be found in these references, especially \cite{LL+99}.)
180: We present a polynomial approximation scheme (PTAS), settling the problem.
181:
182: {\sc Closest Substring} is a more general version of the
183: {\sc Closest String} problem. Obviously, it is also NP-hard.
184: In applications such as drug target identification and
185: genetic probes design, the radius $d$ is usually small.
186: Moreover, when the radius $d$ is small,
187: the center strings can also be used
188: as {\it motifs} in {\it repeated-motif} methods for
189: multiple sequence alignment problems
190: ~\cite{Danbook, PBPR89,SAL91,W86,WAG84,WP84},
191: that repeatedly find motifs and recursively decompose the sequences
192: into shorter sequences.
193: A trivial ratio-$2$ approximation was given in~\cite{LL+99}.
194: We presented the first nontrivial
195: algorithm with approximation ratio $2- \frac{2}{2|\Sigma | +1}$,
196: in \cite{LMW99}. This is a key open problem in search of a
197: potential genetic drug sequence which is ``close''
198: to some sequences (of harmful germs) and ``far'' from some other sequences
199: (of humans). The problem appears to be much more elusive than
200: {\sc Closest String}.
201: We extend the techniques developed for closest string here
202: to design a PTAS for closest substring problem when
203: $d$ is small, i.e., $d\leq O(\log N)$, where $N$
204: is the input size of the instance.
205: Using a {\em random sampling} technique, and combining our
206: methods for {\sc Closest String}, we then design a PTAS
207: for {\sc Closest Substring}, for all $d$.
208:
209: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
210: \section{Approximating {\sc Closest String}}
211: \label{sec-closeststring}
212: In this section, we give a PTAS for {\sc Closest String}.
213: %wangl
214: %improve the $4/3$ approximation
215: %in \cite{LL+99} for {\sc Closest String} to a PTAS.
216: We note that a direct application of LP relaxation
217: in \cite{BLPR97} does not work when the optimal solution is small.
218: Rather we extend an idea in \cite{LL+99} to do LP relaxation
219: only to a fraction of the bits.
220: Let ${\cal S}=\{s_1, s_2, \ldots, s_n\}$ be a set of $n$ strings each of
221: length $m$.
222:
223: The idea is as follows. Let $r$ be a constant.
224: If we choose a subset
225: of $r$ strings from ${\cal S}$, consider the bits that they all agree.
226: Intutively, we can replace the corresponding bits
227: in the optimal solution by these bits of the $r$ strings,
228: and this will only slightly worsen the solution.
229: Lemma~\ref{KEY} shows that this is true for at least one subset
230: of $r$ strings. Then all we
231: need to do is to optimize on the positions (bits) where they do not agree,
232: by LP relaxation and randomized rounding.
233:
234: We first introduce some notations. Let
235: $P=\{j_1,j_2,\ldots,j_k\}$ be a set (multiset) and
236: $1\leq j_1 \leq j_2 \leq \cdots \leq j_k\leq m$.
237: $P$ is called a {\it position set} ({\it multiset}).
238: Let $s$ be a string of length $m$,
239: then $s|_P$ is the string $s[j_1]\,s[j_2]\,\cdots \,s[j_k]$.
240:
241: For any $k \geq 2$, let $1 \leq i_1,i_2,\ldots,i_k \leq n$ be
242: $k$ distinct numbers.
243: Let $Q_{i_1,i_2,\ldots,i_k}$ be the set of positions
244: where $s_{i_1},s_{i_2},\ldots,s_{i_k}$ agree.
245: Obviously $|Q_{i_1,i_2,\ldots,i_k}| \geq m- kd_{opt}$.
246: Let $\rho_0 = \max _{1\leq i,j\leq n} {d(s_i,s_j)}/{d_{opt}}$.
247: The following lemma is the key of our approximation algorithm.
248: \begin{lemma}\label{KEY}
249: If $\rho_0> 1+\frac{1}{2r-1}$, then
250: for any constant $r$, there are indices
251: $1 \leq i_1,i_2,\ldots,i_r \leq n$ such that for any $1 \leq l \leq n$,
252: %mab: n is the number of sequences. m is the length of string.
253: %$$
254: %| \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,
255: % s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j]
256: %\neq s[j]\} |
257: %\leq \frac{1}{2r-1} \,d_{opt}.
258: %$$
259: $$
260: d(s_l|_{Q_{i_1,i_2,\ldots,i_r}}, s_{i_1}|_{Q_{i_1,i_2,\ldots,i_r}})
261: - d(s_l|_{Q_{i_1,i_2,\ldots,i_r}}, s|_{Q_{i_1,i_2,\ldots,i_r}})
262: \leq \frac 1{2r-1} d_{opt}.
263: $$
264: \end{lemma}
265: \begin{proof}
266: Let $p_{i_1,i_2,\ldots,i_k}$ be the number of mismatches between
267: $s_{i_1}$ and $s$ at the positions in $Q_{i_1,i_2,\ldots,i_k}$. Let
268: $
269: \rho _k = \min _{1\leq i_1,i_2,\ldots,i_k\leq n}
270: {p_{i_1,i_2,\ldots,i_k}} /{d_{opt}}.
271: $
272: First, we prove the following claim.
273: \begin{claim}\label{Fact1}
274: For any $k$ such that $2 \leq k \leq r$, where $r$ is the constant
275: in the algorithm closestString, there are indices
276: $1 \leq i_1,i_2,\ldots,i_r \leq m$ such that for any $1 \leq l \leq n$.
277: %wangl n ==>m
278: $$%\begin{eqnarray*}
279: | \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,
280: s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j]
281: \neq s[j]\} |
282: \leq (\rho _k - \rho _{k+1}) \,d_{opt}
283: $$%\end{eqnarray*}
284: \end{claim}
285: \begin{proof}
286: Consider indices $1\leq i_1,i_2,\ldots,i_k\leq m$ such that
287: %wangl n==>m
288: ${p_{i_1,i_2,\ldots, i_k}}= \rho _k d_{opt}$.
289: Then for any $1\leq i_{k+1},i_{k+2},\ldots,i_{r} \leq m$
290: %wangl n==>m
291: and $1\leq l\leq n$, we have
292: \begin{eqnarray}
293: && | \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,
294: s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j] \neq s[j]\}
295: |
296: \nonumber\\
297: &\leq & | \{j \in Q_{i_1,i_2,\ldots,i_k} \,|\,
298: s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j] \neq s[j]\} | \label{eq-tmp11}\\
299: &=&
300: | \{ j \in Q_{i_1,i_2,\ldots,i_k} \,|\, s_{i_1}[j] \neq s[j] \}
301: %\nonumber \\
302: %&&
303: - \{ j \in Q_{i_1,i_2,\ldots,i_k} \,|\,
304: s_{i_1}[j]=s_l[j] \mbox{ and } s_{i_1}[j] \neq s[j] \} |
305: \nonumber\\
306: &=&
307: | \{ j \in Q_{i_1,i_2,\ldots,i_k} \,|\, s_{i_1}[j] \neq s[j] \}
308: - \{ j \in Q_{i_1,i_2,\ldots,i_k,l} \,|\, s_{i_1}[j] \neq s[j] \}|
309: \nonumber\\
310: &=&
311: p_{i_1,i_2,\ldots,i_k} - p_{i_1,i_2,\ldots,i_k,l}
312: \label{eq-tmp12}\\
313: &\leq& (\rho _k - \rho _{k+1}) \,d_{opt},
314: \nonumber
315: \end{eqnarray}
316: where Inequality (\ref{eq-tmp11}) is from the fact that
317: $Q_{i_1,i_2,\ldots,i_r} \subseteq Q_{i_1,i_2,\ldots,i_k}$
318: and Equality (\ref{eq-tmp12}) is from the fact that
319: $Q_{i_1,i_2,\ldots,i_k,l} \subseteq Q_{i_1,i_2,\ldots,i_k}$.
320: $\Box$
321: \end{proof}
322:
323: \begin{claim}\label{Fact2}
324: $\min \{\rho_0 -1, \rho _2 - \rho_3, \rho _3 - \rho_4,
325: \ldots , \rho _r - \rho_{r+1} \} \leq \frac {1}{2 r-1}.$
326: \end{claim}
327: \begin{proof}
328: Consider $1\leq i,j \leq n$ such that
329: $d(s_i,s_j) = \rho_0 d_{opt}$. Then among the
330: positions where $s_i$ mismatches $s_j$, for at least one of the
331: two strings, say, $s_i$, the number of
332: mismatches between $s_i$ and $s$ is at least
333: $\rho_0 d_{opt}/2$. Thus, among the positions where
334: $s_i$ matches $s_j$, the number of mismatches between
335: $s_i$ and $s$ is at most $(1-\frac{\rho_0}{2}){d_{opt}}$.
336: Therefore, $\rho_2 \leq 1-\frac {\rho_0}{2}$. So,
337: $$
338: \frac {{\frac 12} (\rho_0 -1) + ( \rho _2 - \rho_3) + ( \rho _3 - \rho_4)
339: + \cdots + (\rho_r - \rho_{r+1}) }{\frac 12 +r-1}
340: \leq \frac {{\frac 12} \rho _0 + \rho _2 - \frac 12 }{r-\frac 12}
341: \leq \frac {1}{2 r-1}
342: $$
343: Thus, at least one of $\rho_0-1$, $\rho _2 - \rho_3$, $\rho _3 - \rho_4$,
344: $\ldots$, $\rho _r - \rho_{r+1}$ is
345: less than or equal to
346: %wangl revise the order
347: $\frac {1}{2 r-1}$.
348: $\Box$
349: \end{proof}
350:
351: %Now we finish the proof. If $\rho_0 -1 \leq \frac {1}{2 r -1}$,
352: %then by the definition of $\rho _0$, it is easy to see that
353: %the algorithm finds a solution with cost at most
354: %$\rho_0 d_{opt} \leq (1+ \frac {1}{2r-1}) d_{opt}$ in step 2.
355: If $\rho_0 > 1+\frac {1}{2 r -1}$,
356: them from Claim \ref{Fact2},
357: there must be a $2 \leq k \leq r$ such that
358: $\rho _k - \rho _{k+1} \leq \frac {1}{2 r -1}$.
359: From Claim \ref{Fact1},
360: $$%\begin{eqnarray*}
361: | \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,
362: s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j]
363: \neq s[j]\} |
364: \leq \frac{1}{2r-1} \,d_{opt} \ .
365: $$%\end{eqnarray*}
366: Hence, there are at most $\frac 1{2r-1} \, d_{opt}$ bits
367: in $Q_{i_1,i_2,\ldots,i_r}$
368: where $s_l$ differs from $s_{i_1}$ while agrees with
369: $s$. The lemma is proved.
370: $\qed$
371: \end{proof}
372:
373: Lemma \ref{KEY} hints us to select $r$ strings
374: $s_{i-1}, s_{i_2}, \ldots, s_{i_r}$ from $\cal {S}$
375: at a time and use the unique letters at the positions in
376: $Q_{i_1,i_2,\ldots, i_r}$ as an approximation of
377: the optimal center string $s$.
378: For the positions in $P_{i_1,i_2,\ldots, i_r}=\{1,2,\ldots, L\}-Q_{i_1,i_2,\ldots, i_r}$,
379: we use ideas in \cite{LL+99}, i.e., the following two strategies:
380: %wangl --change above sentence.
381: (1) if $|P_{i_1,i_2,\ldots, i_r}|$ is small, i.e., $d\leq O(\log L)$,
382: we can enumerate $|\Sigma| ^{|P_{i_1,i_2,\ldots, i_r}|}$ possibilities to approximate $s$;
383: (2) if $|P_{i_1,i_2,\ldots, i_r}|$ is large, i.e., $d>O(\log L)$, we use the
384: LP relaxation to approximate $s$. The details are
385: found in Lemma~\ref{lem-rest}.
386: %wangl add subscripts in the above discussion
387: Before
388: %going on
389: %wangl delete the two words
390: presenting our main result, we need the following
391: two lemmas, where Lemma~\ref{lem-chernoff} is commonly known
392: as Chernoff's bounds~(\cite{MR95}, Theorem~4.2 and 4.3):
393: \begin{lemma}
394: {\rm \cite{MR95}~}
395: Let $X_1,X_2,\ldots,X_n$ be $n$ independent random 0-1 variables,
396: where $X_i$ takes $1$ with probability $p_i$, $0<p_i<1$.
397: Let $X=\sum _{i=1}^n X_i$, and $\mu=E[X]$.
398: Then for any $\delta>0$,
399: \begin{enumerate}
400: \item[(1)]
401: ${\bf Pr}(X>(1+\delta) \mu ) < \left[\frac {{\bf e}^{\delta}} {(1+\delta)^{(1+ \delta)}}\right]^{\mu}$,
402: \item[(2)]
403: ${\bf Pr}(X<(1-\delta) \mu ) \leq \exp \left( -\frac 12 \mu \delta ^2 \right)$.
404: \vspace{-6pt}
405: \end{enumerate}
406: \label{lem-chernoff}
407: \end{lemma}
408:
409: From Lemma~\ref{lem-chernoff}, we can prove the following lemma:
410: \begin{lemma}
411: Let $X_i$, $X$ and $\mu$ be defined as in Lemma~\ref{lem-chernoff}.
412: Then for any $0<\epsilon\leq 1$,
413: \begin{enumerate}
414: \item[(1)]
415: ${\bf Pr}(X>\mu+\epsilon\, n ) < \exp \left(-\frac 13 n \epsilon ^2 \right)$,
416: \item[(2)]
417: ${\bf Pr}(X<\mu-\epsilon\,n ) \leq \exp \left( -\frac 12 n \epsilon ^2 \right)$.
418: \vspace{-6pt}
419: \end{enumerate}
420: \label{lem-chernoff1}
421: \end{lemma}
422: \begin{proof}
423: (1) Let $\delta = \frac {\epsilon n} {\mu}$. By Lemma~\ref{lem-chernoff},
424: $$%\begin{eqnarray*}
425: {\bf Pr}(X>\mu +\epsilon n )
426: <\left[\frac {{\bf e}^{\frac{\epsilon n}{\mu}}}
427: {(1+\frac{\epsilon n}{\mu})^{(1+\frac{\epsilon n}{\mu})}}\right]^{\mu}
428: =\left[\frac {\bf e}{(1+\frac{\epsilon n}{\mu})^{(1+\frac{\mu}{\epsilon n})}}\right]^{\epsilon n}
429: \leq \left[\frac {\bf e} {(1+\epsilon)^{1+\frac 1{\epsilon}}} \right]^{\epsilon n},
430: $$%\end{eqnarray*}
431: where the last inequality is because $\mu \leq n$ and
432: that $(1+x)^{(1+\frac 1x)}$ is increasing for $x \geq 0$.
433: It is easy to verify that for $0< \epsilon \leq 1$,
434: $\frac {\bf e} {(1+\epsilon)^{1+\frac 1{\epsilon}}}
435: \leq \exp \left(-\frac \epsilon 3 \right).$
436: Therefore, (1) is proved.
437:
438: (2) Let $\delta =\frac {\epsilon n} {\mu}$. By Lemma~\ref{lem-chernoff},
439: (2) is proved.
440: $\qed$
441: \end{proof}
442:
443: Now, we come back to the approximation of $s$ at the positions
444: in $P_{i_1,i_2,\ldots,i_r}$.
445:
446: \begin{lemma}
447: \label{lem-rest}
448: Let ${\cal S} = \{s_1, s_2, \ldots s_n\}$, where $|s_i| = m$ for all $i$.
449: Assume that $s$ is the optimal solution of {\sc Closest String}
450: and $\max_{1 \leq i \leq n} d(s_i,s) =d_{opt}$.
451: Given a string $s'$ and a position set $Q$ of size $m-O(d_{opt})$
452: such that for any $i=1, \ldots , n$
453: \begin{equation}
454: \label{eq-rest01}
455: d(s_i|_Q,s'|_Q)-d(s_i|_Q,s|_Q) \leq \rho \, d_{opt},
456: \end{equation}
457: where $0 \leq \rho \leq 1$,
458: one can obtain a solution with cost at most
459: $(1+\rho + \epsilon)d_{opt}$ in polynomial time
460: for any fixed $\epsilon \geq 0$.
461: \end{lemma}
462:
463: \begin{proof}
464: Let $P=\{1,2,\ldots,m\} - Q$. Then, for any two
465: strings $x$ and $x'$ of length $m$, we have
466: $d(x|_P,x'|_P)+d(x|_Q,x'|_Q)=d(x,x')$.
467: Thus for any $i=1,2, \ldots ,n$,
468: $$%\begin{eqnarray*}
469: d(s_i|_P,s|_P)=d(s_i,s)-d(s_i|_Q,s|_Q)
470: \leq (1+ \rho)\,d_{opt} - d(s_i|_Q,s'|_Q).
471: $$%\end{eqnarray*}
472: Therefore, the following optimization problem
473: \begin{equation}
474: \label{lps1}
475: \left\{
476: \begin{array}{l}
477: \min \;\; d;\\
478: d(s_i|_P,x) \leq d - d(s_i|_Q,s'|_Q), \;\;
479: i=1, \cdots , n; |x|=|P |,
480: \end{array}
481: \right.
482: \end{equation}
483: has a solution with cost
484: $d \leq (1+ \rho) d_{opt}$.
485: %wangl move half of the sentence down
486: Suppose that the optimization problem has an optimal solution $x$ such that
487: $d=d_0$. Then
488: \begin{equation}
489: \label{eq-d0}
490: d_0\leq (1+\rho) d_{opt}.
491: \end{equation}
492: Now we solve (\ref{lps1}) approximately.
493: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%sep 26
494: Similar to \cite{BLPR97,LL+99}, we use a 0-1 variable
495: $x_{j,a}$ to indicate whether $x[j]=a$. Denote
496: $\chi (s_i[j],a)=0$ if $s_i[j]=a$ and $1$ if $s_i[j] \neq a$.
497: Then (\ref{lps1}) can be rewritten as a 0-1 optimization problem
498: as follows:
499: \begin{equation}
500: \label{lps2}
501: \left\{
502: \begin{array}{l}
503: \min \;\; d;\\
504: \sum _{a \in \Sigma} x_{j,a}=1, \;\; j=1,2,\ldots,|P|,\\
505: \sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}
506: \chi (s_i[j],a) \, x_{j,a}
507: \leq d - d(s_i|_Q,s'|_Q), \;\; i=1,2, \ldots , n.
508: \end{array}
509: \right.
510: \end{equation}
511: Solve (\ref{lps2}) by linear programming to
512: get a fractional solution ${\bar x}_{j,a}$ with cost
513: ${\bar d}$. Clearly ${\bar d} \leq d_0$.
514: Independently for each $0 \leq j \leq |P|$,
515: with probability ${\bar x}_{j,a}$,
516: set $x_{j,a}=1$ and $x_{j,a'}=0$ for any $a' \neq a$.
517: Then we get a solution $x_{j,a}$ for the 0-1 optimization
518: problem, hence a solution $x$ for (\ref{lps1}).
519: It is easy to see that
520: $\sum _{a \in \Sigma} \chi (s_i[j],a) \, x_{j,a} $
521: takes $1$ or $0$ randomly and independently for
522: different $j$'s. Thus
523: $d(s_i|_P,x)= \sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}
524: \chi (s_i[j],a) \, x_{j,a}$
525: is a sum of $|P|$ independent 0-1 random variables, and
526: \begin{eqnarray}
527: E[d(s_i|_P,x)]&=&\sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}
528: \chi (s_i[j],a) \, E[x_{j,a}]
529: \nonumber\\
530: &=&\sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}
531: \chi (s_i[j],a) \, {\bar x}_{j,a}
532: \nonumber\\
533: &\leq& {\bar d}-d(s_i|_Q,s'|_Q)\leq d_0-d(s_i|_Q,s'|_Q).
534: \label{eq-hoeff01}
535: \end{eqnarray}
536: Therefore, for any fixed $\epsilon '>0$,
537: by Lemma~\ref{lem-chernoff1},
538: $$
539: {\bf Pr}\left(
540: d(s_i|_P,x) \geq d_0+\epsilon' |P|
541: -d(s_i|_Q,s'|_Q)
542: \right)
543: \leq
544: \exp \left(- \frac 13 {{\epsilon '}^2} |P|\right).
545: $$
546: Considering all sequences, we have
547: \[
548: {\bf Pr}\left(d(s_i|_P,x) \geq d_0 +\epsilon ' |P|-
549: d(s_i|_Q,s'|_Q)
550: ~ {\rm for~at~least~one~}i \right)
551: \leq n\times \exp \left(-\frac 13 {\epsilon'} ^2 |P|\right).
552: \]
553: If $|P| \geq (4 \ln n )/ {\epsilon'} ^2$,
554: then,
555: $n\times
556: \exp\left(-\frac 13 {\epsilon'} ^2 |P|\right) \leq n^{-\frac 13}$.
557: Thus
558: we obtain a randomized algorithm to find a solution
559: for (\ref{lps1}) with cost at most
560: $d_0 +\epsilon' |P|$ with probability at least $1-n^{-\frac
561: 13}$.
562: The above randomized algorithm can be derandomized
563: by standard method of conditional probabilities~\cite{MR95}.
564:
565: If $|P| < ( 4 \ln n )/ {\epsilon'} ^2$,
566: $|\Sigma| ^{|P|} < n ^{(4 \ln |\Sigma|)/{\epsilon'} ^2}$
567: is a polynomial of $n$.
568: So, we can enumerate all strings in $\Sigma ^{|P|}$ to find
569: an optimal solution for (\ref{lps1}).
570: Thus, in both cases, we can obtain a solution $x$ for the optimization
571: problem (\ref{lps1}) with cost at most
572: $d_0+\epsilon' |P|$ in polynomial time.
573: Since $|P|=O(d_{opt})$,
574: $|P| \leq c\times d_{opt}$ for a {\em constant} $c$.
575: Let $\epsilon'= \frac {\epsilon}{c}$
576: and
577: $s^*=R(s',x,P)$. From Formula (\ref{lps1}),
578: \begin{eqnarray}
579: d(s_i,s^*)
580: &=&d(s_i|_P,s^*|_P)+d(s_i|_Q,s^*|_Q)
581: \nonumber\\
582: &=&d(s_i|_P,x)+d(s_i|_Q,s'|_Q)
583: \nonumber\\
584: &\leq& d_0+ \epsilon' |P|
585: \leq (1+ \rho) d_{opt} + \epsilon d_{opt},
586: %\label{eq-tmp01}
587: \nonumber
588: \end{eqnarray}
589: where the last inequality is from Formula~(\ref{eq-d0}).
590: This proves the lemma.
591: $\Box$
592: \end{proof}
593:
594: Now we describe the complete algorithm in Figure~\ref{stringAlg}.
595:
596: \begin{figure}[ht]
597: \begin{center}
598: \begin{tabular}{|l|}
599: \hline
600: \multicolumn{1}{|c|}{\bf Algorithm ~closestString}
601: \\
602: \makebox[.45in][l]{{Input}} \parbox[t]{4.55in}
603: {$s_1, s_2, \ldots , s_n \in \Sigma^m$.}
604: \\
605: \makebox[.45in][l]{{Output}} \parbox[t]{4.55in}
606: {a center string $s \in \Sigma^m$.}
607: \\
608: \makebox[.2in][l]{1.} \parbox[t]{4.8in}
609: {{\bf for} each $r$-element subset $\{ s_{i_1}$, $s_{i_2}$, $\ldots$,
610: $s_{i_r} \}$ of the $n$ input strings {\bf do}}
611: \\
612: \makebox[.5in][r]{(a)} \parbox[t]{4.5in}
613: {$Q=\{1\leq j \leq m \,|\, s_{i_1}[j]=s_{i_2}[j]=\ldots = s_{i_r}[j] \}$,
614: $P=\{1,2,\ldots, m\} - Q$.}
615: \\
616: \makebox[.5in][r]{(b)} \parbox[t]{4.5in}
617: {Solve the optimization problem defined by Formula (\ref{lps1})
618: as described in the proof of Lemma~\ref{lem-rest} to get an approximate
619: solution $x$ of length $|P|$.}
620: \\
621: \makebox[.5in][r]{(c)} \parbox[t]{4.5in}
622: {Let $s'$ be a string such that $s'|_Q=s_{i_1}|_Q$ and $s'|_P=x$.
623: Calculate the cost of $s'$ as the center
624: string.}
625: \\
626: \makebox[.2in][l]{2.} \parbox[t]{4.8in}
627: {{\bf for} $i=1, 2, \ldots , n$ {\bf do}}
628: \\
629: \makebox[.4in][l]{}\parbox[t]{4.6in}
630: {calculate the cost of $s_i$ as the center string.}
631: \\
632: \makebox[.2in][l]{3.} \parbox[t]{4.8in}
633: {Output the best solution of the above two steps.}
634: \\
635: \hline
636: \end{tabular}
637: \caption{Algorithm for {\sc Closest String}}
638: \label{stringAlg}
639: \end{center}
640: \end{figure}
641:
642: \begin{theorem}
643: \label{th-uniform}
644: The algorithm closestString is a PTAS for {\sc Closest String}.
645: \end{theorem}
646: %%%Shall we declare the ratio clearly here? -bin
647: \begin{proof}
648: Given an instance of {\sc Closest String},
649: suppose $s$ is an optimal solution and the optimal
650: cost is $d_{opt}$, i.e. $d(s,s_i) \leq d_{opt}$ for all $i$.
651: Let $P$ be defined as step 1(a) of Algorithm~closestString.
652: Since for every position in $P$, at least one of the $r$
653: strings $s_{i_1},s_{i_2},\ldots,s_{i_r}$ conflict
654: the optimal center string $s$, so we have
655: $|P|\leq r\times d_{opt}$. As far as $r$ is a constant,
656: step 1(b) can be done in polynomial time by Lemma~\ref{lem-rest}.
657: Obviously the other steps of
658: Algorithm~closestString runs in polynomial time,
659: with $r$ as a constant.
660:
661: If $\rho_0 -1 \leq \frac {1}{2 r -1}$,
662: then by the definition of $\rho _0$, it is easy to see that
663: the algorithm finds a solution with cost at most
664: $\rho_0 d_{opt} \leq (1+ \frac {1}{2r-1}) d_{opt}$ in step 2.
665:
666: If $\rho_0 > 1+\frac {1}{2 r -1}$,
667: them from Lemma \ref{KEY} and Lemma \ref{lem-rest},
668: the algorithm finds a solution with cost at most
669: $(1+\frac 1{2r-1} + \epsilon)d_{opt}$. This proves the theorem.
670: $\Box$
671: \end{proof}
672:
673: \section{Approximating {\sc Closest Substring} when $d$ is small}
674: In some applications such as drug target identification,
675: genetic probe design, the radius $d$ is often small.
676: As a direct application of Lemma \ref{KEY},
677: we now present a PTAS for {\sc Closest String}
678: when the radius $d$ is small, i.e., $d<O(\log N)$, where $N$
679: stands for the input size of the instance.
680: Again, we focus on the construction of the center string.
681: The basic idea is to choose $r$ substrings
682: $t_{i_1}$, $t_{i_2}$, $\ldots$, $t_{i_r}$
683: of length $L$
684: from the strings in ${\cal S}$,
685: keep the letters at the positions where
686: $t_{i_1}$, $t_{i_2}$, $\ldots$, $t_{i_r}$ all agree, and
687: try all possibilities for the rest of the positions.
688: % Bin: describe more of the proof idea perhaps?
689: The complete algorithm is described in Figure~\ref{fig-Algsmall}:
690:
691: \begin{figure}[h]
692: \begin{center}
693: \begin{tabular}{|l|}
694: \hline
695: \multicolumn{1}{|c|}{\bf Algorithm ~smallSubstring}
696: \\
697: \makebox[.45in][l]{{Input}} \parbox[t]{4.55in}
698: {$s_1, s_2, \ldots , s_n \in \Sigma^m$.}
699: \\
700: \makebox[.45in][l]{{Output}} \parbox[t]{4.55in}
701: {a center string $s \in \Sigma^L$.}
702: \\
703: \makebox[.2in][l]{1.} \parbox[t]{4.8in}
704: {{\bf for} each $r$-element subset $\{t_{i_1}$, $t_{i_2}$, $\ldots$,
705: $t_{i_r} \}$, where $t_{i_j}$ is a substring of length $L$ from
706: $s_{i_j}$ {\bf do}}
707: \\
708: \makebox[.5in][r]{(a)} \parbox[t]{4.5in}
709: {$Q=\{1\leq j \leq m \,|\, t_{i_1}[j]=t_{i_2}[j]=\ldots = t_{i_r}[j] \}$,
710: $P=\{1,2,\ldots, m\} - Q$.}
711: \\
712:
713: \makebox[.5in][r]{(b)} \parbox[t]{4.5in}
714: {{\bf for} every $x \in \Sigma ^{|P|}$ {\bf do}}
715: \\
716: \makebox[.7in][r]{} \parbox[t]{4.3in}
717: {let $t=S(t_{i_1},x,P)$; compute the cost of the solution $t$.}
718: \\
719: \makebox[.2in][l]{2.} \parbox[t]{4.8in}
720: {{\bf for} every length $L$ substring $t_k$ from any given sequence {\bf do}}
721: \\
722: \makebox[.4in][r]{} \parbox[t]{4.6in}
723: {compute the cost of the solution with $t_k$ as the center string}\\
724: \makebox[.2in][l]{3.} \parbox[t]{4.8in}
725: {select a center string that leads the best result in Step 1 and
726: Step 2;
727: output the best solution of the above two steps.}
728: \\
729: \hline
730: \end{tabular}
731: \caption{Algorithm for {\sc Closest Substring} when $d$ is
732: small}
733: \label{fig-Algsmall}
734: \end{center}
735: \end{figure}
736:
737: \begin{theorem}
738: Algorithm smallSubstring is a PTAS for {\sc Closest Substring}
739: when the radius $d$ is small, i.e., $d\leq O(\log N)$, where $N$
740: is the input size.
741: \end{theorem}
742: \begin{proof}
743: Obviously, the size of $P$ in Step 1 is at most $O(r\times \log N)$.
744: Step 1 takes $O((mn)^{r}\times\Sigma ^{O(r \times \log N)}\times mnL)
745: =O(N^{r+1} \times N^{O(r \times \log |\Sigma|)})
746: =O(N^{O(r \times \log |\Sigma|)})$
747: time.
748: Other steps take less than that time.
749: Thus, the total time required is
750: $O(N^{O(r\times \log |\Sigma|)})$,
751: which is polynomial in term of
752: input size for any constant $r$.
753:
754: From Lemma \ref{KEY}, the performance ratio
755: of the algorithm is $1+\frac{1}{2r-1}$. $\qed$
756: \end{proof}
757:
758: \section{A PTAS For {\sc Closest Substring}}
759: In this section, we further extend the algorithms
760: for {\sc Closest String} to a PTAS for {\sc Closest Substring},
761: making use of a {\em random sampling} strategy.
762: Note that Algorithm~smallSubstring runs in exponential time
763: for general radius $d$. And Algorithm~closestString does not
764: work for {\sc Closest Substring} since we do not
765: know how to construct an optimal problem similar to~(\ref{lps1})
766: --- The construction of~(\ref{lps1}) requires us to know all the $n$
767: strings (substrings)
768: in an optimal solution of {\sc Closest String} ({\sc Closest Substring}).
769: It is easy to see that the choice of a ``good'' substring
770: from every string $s_i$ is the only obstacle on the way to the solution.
771: We use random sampling to handle this.
772:
773: Now let us outline the main ideas.
774: Let $\langle {\cal S}=\{s_1,s_2,\ldots,s_n\},L \rangle$ be
775: an instance of
776: {\sc Closest Substring}, where $s_i$ is of length $m$.
777: Suppose that $s$ is its optimal center string and
778: $t_i$ is a length $L$ substring of $s_i$ which is
779: the closest to $s$ ($i=1,2,\ldots,n$).
780: Let $d_{opt}=\max _{i=1}^n d(s,t_i)$.
781: By trying all possibilities, we can assume that
782: $t_{i_1},t_{i_2},\ldots,t_{i_r}$ are the $r$ substrings $t_{i_j}$
783: that satisfy Lemma~\ref{KEY} by replacing $s_i$ by $t_i$ and $s_{i_j}$ by $t_{i_j}$.
784: Let $Q$ be the set of positions where $t_{i_1},t_{i_2},\ldots,t_{i_r}$
785: agree and $P=\{1,2,\ldots,L\}-Q$.
786: By Lemma~\ref{KEY}, $t_{i_1}|_Q$ is a good approximation to $s|_Q$.
787: We want to approximate $s|_P$ by the solution $x$ of the following
788: optimization problem~(\ref{opt2}), where $t'_i$ is a substring of $s_i$ and
789: is up to us to choose.
790: \begin{equation}
791: \label{opt2}
792: \left\{
793: \begin{array}{l}
794: \min \;\; d;\\
795: d(t'_i|_P,x) \leq d - d(t'_i|_Q,t_{i_1}|_Q), \;\;
796: i=1, \cdots , n; |x|=|P|.
797: \end{array}
798: \right.
799: \end{equation}
800:
801: The ideal choice is
802: $t'_i=t_i$, {\em i.e.}, $t'_i$ is the closest to $s$ among
803: all substrings of $s_i$.
804: However, we only approximately know $s$ in $Q$ and
805: know nothing about $s$ in $P$ so far.
806: So, we randomly pick $O(\log (mn))$ positions from $P$.
807: Suppose the multiset of these random positions is $R$.
808: By trying all possibilities, we can assume that
809: we know $s$ at these $|R|$ positions.
810: We then find the substring $t'_i$ from $s$ such that
811: $d(s|_R,t'_i|_{R})\times \frac {|P|}{|R|}+d(t_{i_1}|_Q,t'_i|_Q)$
812: is minimized. Then $t'_i$ potentially belongs to the substrings
813: which are the closest to $s$.
814:
815: Then we solve (\ref{opt2}) approximately by the method provided in
816: the proof of Lemma~\ref{lem-rest} and
817: combine the solution $x$ at $P$ and $t_{i_1}$ at $Q$, the
818: resulting string should be a good approximation to $s$.
819: The detailed algorithm (Algorithm closestSubstring)
820: is given in Figure~\ref{fig-alg}.
821: We prove Theorem~\ref{th-ptas} in the rest of the section.
822:
823: \begin{figure}[ht]
824: \begin{center}
825: {\normalfont\normalsize
826: \begin{tabular}{|l|}
827: \hline
828: \multicolumn{1}{|c|}{\bf Algorithm ~closestSubstring}
829: \\
830: \makebox[.55in][l]{{Input}} \parbox[t]{4.45in}
831: {$n$ sequences $\{s_1, s_2,\ldots,s_n\} \subseteq \Sigma^m$, integer $L$.}
832: \\
833: \makebox[.55in][l]{{Output}} \parbox[t]{4.45in}
834: {the center string $s$.}
835: \\
836: \makebox[.2in][l]{1.} \parbox[t]{4.8in}
837: {{\bf for} every $r$ length-$L$ substrings
838: $t_{i_1}, t_{i_2},\ldots, t_{i_r}$ (allowing repeats, but if $t_{i_j}$ and
839: $t_{i_k}$ are both chosen from the same $s_i$ then $t_{i_j}=t_{i_k}$)
840: of $s_1,\ldots ,s_n$
841: {\bf do}}
842: \\
843: \makebox[.5in][r]{(a)} \parbox[t]{4.5in}
844: {$Q=\{1\leq j \leq L \,|\, t_{i_1}[j]=t_{i_2}[j]=\ldots = t_{i_r}[j] \}$,
845: $P=\{1,2,\ldots, L\} - Q$.}
846: \\
847: \makebox[.5in][r]{(b)} \parbox[t]{4.5in}
848: {Let $R$ be a multiset containing
849: $\lceil \frac {4}{\epsilon ^2} \log(nm)\rceil$
850: uniformly random positions from $P$.
851: }
852: \\
853: \makebox[.5in][r]{(c)} \parbox[t]{4.5in}
854: {{\bf for} every string $y$ of length $|R|$ {\bf do}}
855: \\
856: \makebox[.7in][r]{(i)} \parbox[t]{4.3in}
857: {{\bf for} $i$ from $1$ to $n$ {\bf do}}
858: \\
859: \makebox[.8in][r]{} \parbox[t]{4.2in}
860: {Let $t'_i$ be a length $L$ substring of $s_i$
861: minimizing $d(y,t'_i|_{R})\times \frac {|P|}{|R|}+d(t_{i_1}|_Q,t'_i|_Q)$.
862: }
863: \\
864: \makebox[.7in][r]{(ii)} \parbox[t]{4.3in}
865: {Using the method provided in the proof of
866: Lemma~\ref{lem-rest}, solve the optimization
867: problem defined by Formula~(\ref{opt2}) approximately.
868: Let $x$ be the approximate solution within error $\epsilon\, |P|$.}
869: \\
870: \makebox[.7in][r]{(iii)} \parbox[t]{4.3in}
871: {Let $s'$ be the string such that $s'|_P=x$ and $s'|_Q=t_{i_1}|_Q$.
872: Let $c=\max^n_{i=1} \min_{\{t_i{\rm ~is~a~substring~of~}s_i\}} d(s',t_i)$.}
873: \\
874: \makebox[.2in][l]{2.} \parbox[t]{4.8in}
875: {{\bf for} every length-$L$ substring $s'$ of $s_1$ {\bf do}}
876: \\
877: \makebox[.5in][r]{} \parbox[t]{4.5in}
878: {Let $c=\max^n_{i=1} \min_{\{t_i{\rm ~is~a~substring~of~}s_i\}} d(s',t_i)$.}
879: \\
880: \makebox[.2in][l]{3.} \parbox[t]{4.8in}
881: {Output the $s'$ with minimum $c$ in step 1(c)(iii) and step 2.}
882: \\
883: \hline
884: \end{tabular}
885: \caption{The PTAS for the closest substring problem.}
886: \label{fig-alg}
887: }
888: \end{center}
889: \end{figure}
890:
891: \begin{theorem}
892: \label{th-ptas}
893: Algorithm closestSubstring is a PTAS for the closest substring problem.
894: \end{theorem}
895: \begin{proof}
896: Let $s$ be an optimal center string and $t_i$ be the
897: length-$L$ substring of $s_i$ that is the closest to $s$. Let
898: $d_{opt}=\max d(s,t_i)$. Let $\epsilon$ be any small
899: positive number and $r \geq 2$ be any fixed integer.
900: Let $\rho_0 = \max _{1\leq i,j\leq n} {d(t_i,t_j)}/{d_{opt}}$.
901: If $\rho _0 \leq 1+\frac{1}{2r-1}$, then clearly we can find a solution
902: $s'$ within ratio $\rho _0$ in step 2.
903: So, we assume that $\rho _0 \geq 1+\frac{1}{2r-1}$ from now on.
904:
905: By Lemma~\ref{KEY}, Algorithm~closestSubstring picks
906: a group of $t_{i_1},t_{i_2},\ldots,t_{i_r}$ in step 1
907: at some point such that
908:
909: \vspace{1ex}
910: \noindent
911: {\bf Fact 1~}
912: For any $1 \leq l \leq n$,
913: $
914: |\{j \in Q\,|\,
915: t_{i_1}[j] \neq t_l[j] \mbox{ and } t_{i_1}[j]
916: \neq s[j]\} |
917: \leq \frac{1}{2r-1} \,d_{opt}.
918: $
919: \vspace{1ex}
920:
921: Obviously, the algorithm takes $y$ as $s|_R$ for at some point
922: in step 1(c). Let $y=s|_R$ and $t_{i_1},t_{i_2},\ldots,t_{i_r}$
923: satisfy Fact~1. Let $t'_i$ be defined as in step 1(c)(i).
924: Let $s^*$ be a string such that $s^*|_P=s|_P$ and
925: $s^*|_Q=t_{i_1}|_Q$. Then we claim:
926:
927: \vspace{1ex}
928: \noindent
929: {\bf Fact 2~}
930: With high probability,
931: $d(s^*,t'_i)\leq d(s^*,t_i)+ 2 \epsilon |P|$
932: for all $1\leq i\leq n$.\\
933:
934: \begin{proof}
935: For convenience, for any position multiset $T$,
936: we denote $d^T(t_1,t_2)=d(t_1|_T,t_2|_T)$ for any two
937: strings $t_1$ and $t_2$. Let $\rho=\frac{|P|}{|R|}$.
938: %wangl add one step
939: Consider any length $L$ substring $t'$ of $s_i$
940: satisfying
941: \begin{equation}
942: d(s^*, t')\geq d(s^*,t_i)+2 \epsilon |P|.
943: \label{eq-close150}
944: \end{equation}
945: It is easy to see that
946: $
947: \rho\, d^R(s^*,t') + d^Q(t_{i_1},t')
948: \leq \rho\, d^R(s^*,t_i)+ d^Q(t_{i_1},t_i)$ implies
949: either
950: $(\rho \, d^R(s^*,t') + d^Q(s^*,t') \leq d(s^*,t') -\epsilon |P|$
951: or
952: $\rho \, d^R(s^*,t_i)+ d^Q(s^*,t_i) \geq d(s^*,t_i) + \epsilon |P|$.
953: Thus, we have the following inequality:
954: \begin{eqnarray}
955: &&{\bf Pr} \left( \rho\, d^R(s^*,t') + d^Q(t_{i_1},t')
956: \leq \rho\, d^R(s^*,t_i)+ d^Q(t_{i_1},t_i) \right)
957: \nonumber\\
958: &\leq& {\bf Pr}\left(\rho \, d^R(s^*,t') + d^Q(s^*,t')
959: \leq d(s^*,t') -\epsilon |P|\right)+
960: \nonumber\\
961: &&{\bf Pr}\left(\rho \, d^R(s^*,t_i)+ d^Q(s^*,t_i)
962: \geq d(s^*,t_i) + \epsilon |P|\right).
963: \label{eq-close200}
964: \end{eqnarray}
965:
966: It is easy to see that $d^R(s^*,t')$ is the sum of $|R|$ independent
967: random 0-1 variables $ \sum _{i=1} ^{|R|} X_i$, where
968: $X_i=1$ indicates a mismatch between $s^*$ and $t'$ at
969: the $i$-th position in $R$.
970: %wangl change X_i .... to sum .... etc
971: Let $\mu = E[d^R(s^*,t')]$.
972: Obviously, $\mu=d^P(s^*,t') / \rho$.
973: %wangl -delete Then
974: Therefore, by Lemma~\ref{lem-chernoff1} (2),
975: \begin{eqnarray}
976: &&{\bf Pr}\left(\rho \, d^R(s^*,t') + d^Q(s^*,t') \leq d(s^*,t') -\epsilon |P|\right)
977: \nonumber\\
978: &=&{\bf Pr}\left(d^R(s^*,t') \leq (d(s^*,t')- d^Q(s^*,t'))/\rho -\epsilon |R|\right)
979: \nonumber\\
980: &=&{\bf Pr}\left(d^R(s^*,t') \leq d^P(s^*,t')/\rho -\epsilon |R|\right)
981: \nonumber\\
982: &=&{\bf Pr}\left(d^R(s^*,t') \leq \mu -\epsilon |R|\right)
983: \leq \exp\left(-\frac 12 \epsilon ^2 |R| \right) \leq (nm)^{-2},
984: \label{eq-close300}
985: \end{eqnarray}
986: %wangl --change the P in the last line to R.
987: where the last inequality is
988: due to the setting
989: %because of
990: $|R|=\lceil \frac {4}{\epsilon ^2}\log (nm)\rceil$ in
991: step 1(b) of the algorithm.
992: %wangl -- I do not like "because of"
993: Similarly, using Lemma~\ref{lem-chernoff1} (1) we have
994: \begin{equation}
995: \label{eq-close400}
996: {\bf Pr}\left(\rho \, d^R(s^*,t_i)+ d^Q(s^*,t_i)
997: \geq d(s^*,t_i) + \epsilon |P|\right)
998: \leq (nm)^{-\frac 43}.
999: \end{equation}
1000: Combining Formula~(\ref{eq-close200})(\ref{eq-close300})(\ref{eq-close400}),
1001: we know that for any $t'$ that satisfies Formula~(\ref{eq-close150}),
1002: \begin{equation}
1003: {\bf Pr} \left( \rho\, d^R(s^*,t') + d^Q(t_{i_1},t')
1004: \leq \rho\, d^R(s^*,t_i)+ d^Q(t_{i_1},t_i) \right)
1005: \leq 2\, (nm)^{-\frac 43}.
1006: \label{eq-close500}
1007: \end{equation}
1008: For any fixed $1\leq i\leq n$, there are less than $m$ substrings $t'$
1009: that satisfies Formula~(\ref{eq-close150}). Thus,
1010: from Formula~(\ref{eq-close500}) and the definition of $t'_i$,
1011: \begin{equation}
1012: {\bf Pr}\left( d(s^*,t'_i) \geq d(s^*,t_i)+ 2 \epsilon |P| \right)
1013: \leq 2\,n^{-\frac {4}{3}}m^{-\frac 13}.
1014: \end{equation}
1015: Summing up all $i\in [1,n]$, we know that with probability
1016: at least $1-2\,(nm)^{-\frac 13}$,
1017: $d(s^*,t'_i)\leq d(s^*,t_i)+ 2 \epsilon |P|$ for all $i$.
1018: $\qed$
1019: \end{proof}
1020:
1021: From Fact 1,
1022: $d(s^*,t_i)=d^P(s,t_i)+d^Q(t_{i_1},t_i)\leq d(s,t_i)+\frac 1{2r-1} \, d_{opt}.$
1023: Combining with Fact~2 and $|P|\leq r\, d_{opt}$,
1024: we get
1025: \begin{equation}
1026: d(s^*,t'_i)\leq (1+ \frac 1{2r-1} + 2 \epsilon \, r) d_{opt}.
1027: \label{eq-close600}
1028: \end{equation}
1029: By the definition of $s^*$, the optimization
1030: problem defined by Formula~(\ref{opt2}) has a solution $s|_P$
1031: such that $d\leq (1+ \frac 1{2r-1} + 2 \epsilon \, r) d_{opt}$.
1032: We can solve the optimization problem within error $\epsilon |P|$ by the method in
1033: the proof of Lemma~\ref{lem-rest}.
1034: Let $x$ be the solution of the optimization problem.
1035: %wangl -breake the sentence
1036: Then by Formula~(\ref{opt2}), for any $1\leq i\leq n$,
1037: \begin{equation}
1038: d(t'_i|_P,x)\leq
1039: (1+ \frac 1{2r-1} + 2 \epsilon \, r) d_{opt}
1040: -d(t'_i|_Q,t_{i_1}|_Q)+\epsilon |P|.
1041: \label{close-800}
1042: \end{equation}
1043: Let $s'$ be defined in step 1(c)(iii), then by Formula~(\ref{close-800}),
1044: \begin{eqnarray*}
1045: d(s',t'_i)&=&d(x,t'_i|_P)+ d(t_{i_1}|_Q,t'_i|_Q)
1046: \\&\leq& (1+ \frac 1{2r-1} + 2 \epsilon r) d_{opt} + \epsilon |P|
1047: \\&\leq& (1+\frac 1{2r-1} + 3\epsilon r) d_{opt}.
1048: \end{eqnarray*}
1049:
1050: It is easy to see that the algorithm runs in polynomial time for
1051: any fixed positive $r$ and $\epsilon$. For any $\delta>0$, by
1052: properly setting $r$ and $\epsilon$ such that
1053: $\frac 1{2r-1} + 3\epsilon r \leq \delta$, with high probability,
1054: the algorithm outputs in polynomial time a solution $s'$
1055: such that $d(t'_i,s')$ is no more than $(1+\delta)d_{opt}$
1056: for every $1\leq i\leq n$, where $t'_i$ is a substring of $s_i$.
1057: %wangl change "and" to "where", add "for"
1058: The algorithm can be derandomized by standard methods~\cite{MR95}.
1059: $\qed$
1060: \end{proof}
1061:
1062: \section*{Acknowledgements}
1063: We would like to thank Tao Jiang, Kevin Lanctot,
1064: Joe Wang, and Louxin Zhang for discussions and suggestions on related
1065: topics.
1066:
1067: Ming Li is supported in part by the
1068: NSERC Research Grant OGP0046506, a CGAT grant,
1069: the E.W.R. Steacie Fellowship. Bin Ma is supported in part by
1070: the NSERC Research Grant OGP0046506. Bin Ma and Lusheng Wang are
1071: supported in part by HK RGC Grants 9040297, 9040352, 9040444 and CityU
1072: Strategic Grant 7000693.
1073: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1074: \begin{thebibliography}{99}
1075:
1076: %\bibitem{B94}
1077: %V. Bafna, E. Lawler and P. Pevzner,
1078: %Approximation algorithms for multiple sequence alignment,
1079: %{\em Proc.\ 8th Ann. Combinatorial Pattern
1080: %Matching Conf.\ (CPM'94)}, pp. 43-53, 1994.
1081:
1082: \bibitem{BLPR97}
1083: A. Ben-Dor, G. Lancia, J. Perone, and R. Ravi,
1084: Banishing bias from consensus sequences,
1085: {\em Proc.\ 8th Ann.\ Combinatorial Pattern
1086: Matching Conf.\ (CPM'97)}, pp. 247-261, 1997.
1087:
1088: \bibitem{BGHMS97}
1089: P. Berman, D. Gumucio, R. Hardison, W. Miller, and N. Stojanovic,
1090: A linear-time algorithm for the 1-mismatch problem,
1091: {\em WADS'97}, 1997.
1092:
1093: %\bibitem{CHS95}
1094: %Q. Chan, G. Hertz, G. Stormo,
1095: %Matrix search 1.0: a computer program that
1096: %scans DNA sequences for transcriptional elements using a database of weight
1097: %matrices, {\em CABIOS} (1995) 563-566.
1098: %{Computer Applications in the Biosciences}, pp. 563-566, 1995.
1099:
1100: \bibitem{DR+93}
1101: J. Dopazo, A. Rodr\'{\i}guez, J. C. S\'{a}iz, and F. Sobrino,
1102: Design of primers for PCR amplification of highly variable genomes,
1103: {\it CABIOS}, 9(1993), 123-125.
1104:
1105: %\bibitem{FM95}
1106: %Y. M. Fraenkel, Y Mandel, D. Friedberg and H. Margalit,
1107: %Identification of common motifs in unaligned DNA sequences: application to
1108: %Escherichia coli Lrp regulon,
1109: %{\em CABIOS}, (1995) 379-387.
1110: %{Computer Applications in the Biosciences}, pp. 379-387, 1995.
1111:
1112: \bibitem{FL97}
1113: M. Frances, A. Litman, On covering problems of codes,
1114: {\em Theor.\ Comput.\ Syst.} 30(1997) 113-119.
1115:
1116: %\bibitem{GJ79}
1117: %M. Garey and D. Johnson,
1118: %{\em Computers and Intractability, a guild to the theory of
1119: %NP-completeness}, Freeman, 1979.
1120:
1121: \bibitem{GJL99}
1122: L. G\c{a}sieniec, J. Jansson, and A. Lingas,
1123: Efficient approximation algorithms for the Hamming center problem,
1124: {\it Proc.\ 10th ACM-SIAM Symp. on Discrete Algorithms}, pp. S905-S906, 1999.
1125:
1126: \bibitem{G93}
1127: D. Gusfield,
1128: Efficient methods for multiple sequence alignment with guaranteed
1129: error bounds,
1130: {\it Bull. Math. Biol.}, vol. 30, pp. 141-154, 1993.
1131:
1132: \bibitem{Danbook}
1133: D. Gusfield,
1134: {\em Algorithms on Strings, Trees, and Sequences},
1135: Cambridge Univ. Press, 1997.
1136:
1137: \bibitem{HS94}
1138: G. Hertz and G. Stormo,
1139: Identification of consensus patterns in
1140: unaligned DNA and protein sequences: a large-deviation statistical
1141: basis for penalizing gaps.
1142: In: {\em Proc.\ 3rd Int'l
1143: Conf. Bioinformatics and Genome Research} (Lim and Cantor, eds.) World
1144: Scientific, 1995, pp. 201-216.
1145:
1146: %\bibitem{Hoe63}
1147: %W. Hoeffding,
1148: %Probability inequalities for sums of bound random variables.
1149: %{\it J. Amer. Statist. Assoc.}, 58(1963), 13-30.
1150:
1151: %\bibitem{K72}
1152: %R. Karp,
1153: %Reducibility among combinatorial problems,
1154: %in R.E. Miller and J.W. Thatcher (eds), {\em Complexity
1155: %of Computer Computations}, Plenum Press, pp. 85-103, 1972.
1156:
1157: %\bibitem{KKK95}
1158: %Y. V. Kondrakhin, A.E. Kel, N.A. Kolchanov, A.G. Romashchenko,
1159: %and L. Milanesi,
1160: %Eukaryotic promoter recognition by binding sites for transcription factors,
1161: %{\em CABIOS}, pp. 477-488, 1995.
1162: %{Computer Applications in the Biosciences}, pp. 477-488, 1995.
1163:
1164: \bibitem{LR90}
1165: C. Lawrence and A. Reilly,
1166: An expectation maximization (EM) algorithm for the identification
1167: and characterization of common sites in unaligned biopolymer
1168: sequences, {\em Proteins} 7(1990) 41-51.
1169:
1170: \bibitem{LB+91}
1171: K. Lucas, M. Busch, S. M\"{o}ssinger and J.A. Thompson,
1172: An improved microcomputer program for finding gene- or gene family-specific
1173: oligonucleotides suitable as primers for polymerase chain reactions or
1174: as probes, {\it CABIOS}, 7(1991), 525-529.
1175:
1176: \bibitem{LL+99}
1177: K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang,
1178: Distinguishing string selection problems,
1179: {\it Proc.\ 10th ACM-SIAM Symp. on Discrete Algorithms}, pp. 633-642, 1999.
1180: Also to appear in {\em Information and Computation}.
1181:
1182: \bibitem{LMW99}
1183: M. Li, B. Ma, and L. Wang, Finding Similar Regions in Many Strings,
1184: {\it Proceedings of the Thirty-first Annual ACM Symposium on Theory of
1185: Computing}, pp. 473-482, Atlanta, 1999.
1186:
1187: \bibitem{LMW99-j}
1188: M. Li, B. Ma, and L. Wang, Finding Similar Regions in Many Sequences,
1189: submitted to {\em J.\ Comput.\ Syst.\ Sci.} special issue for
1190: {\it Thirty-first Annual ACM Symposium on Theory of Computing}, 1999.
1191:
1192: \bibitem{M00}
1193: B. Ma, A polynomial time approximation scheme for the closest substring
1194: problem, to appear in
1195: {\it Proc.\ 11th Annual Symposium on Combinatorial Pattern Matching},
1196: Montreal, June 21-23, 2000.
1197:
1198: %\bibitem{LV97}
1199: %M. Li, P. Vit\'anyi,
1200: %{\em An Introduction to Kolmogorov Complexity and Its Applications},
1201: %Springer, 1993.
1202:
1203: \bibitem{MR95}
1204: R. Motwani and P. Raghavan,
1205: {\em Randomized Algorithms},
1206: Cambridge Univ. Press, 1995.
1207:
1208: %\bibitem{P92}
1209: %P. Pevzner,
1210: %Multiple alignment, communication cost, and graph matching,
1211: %{\it SIAM J. Applied Math.}, 52(1992), 1763-1779.
1212:
1213: %\bibitem{P96}
1214: %D. S. Prestridge,
1215: %SIGNAL SCAN 4.0: additional databases and sequence formats,
1216: %{\em CABIOS} (1996) 157-160.
1217: %{Computer Applications in the Biosciences}, pp. 157-160, 1996.
1218: %%%%???Lusheng, I convert all {Computer Applications in the
1219: %%%%Biosciences} to CABIOS to shorten to make space ... I hope this is
1220: %%%%correct:-)
1221:
1222: \bibitem{PBPR89}
1223: J. Posfai, A. Bhagwat, G. Posfai, and R. Roberts,
1224: Predictive motifs derived from cytosine methyltransferases,
1225: {\em Nucl. Acids Res.}, 17(1989), 2421-2435.
1226:
1227: \bibitem{PH96}
1228: V. Proutski and E. C. Holme,
1229: Primer Master: a new program for the design and analysis of PCR
1230: primers, {\it CABIOS}, 12(1996), 253-255.
1231:
1232: %\bibitem{R92}
1233: %M.A. Roytberg,
1234: %A search for common patterns in many sequences,
1235: %{\em CABIOS} (1992) 57-64.
1236: %{Computer Applications in the Biosciences}, pp. 57-64, 1992.
1237:
1238: \bibitem{SAL91}
1239: G. D. Schuler, S. F. Altschul, and D. J. Lipman,
1240: A workbench for multiple alignment construction and analysis,
1241: {\em Proteins: Structure, Function and Genetics}, 9(1991) 180-190.
1242:
1243: \bibitem{S90}
1244: G. Stormo,
1245: Consensus patterns in DNA,
1246: in R.F. Doolittle (ed.), {\em Molecular evolution: computer
1247: analysis of protein and nucleic acid sequences},
1248: {\em Methods in Enzymology}, 183, pp. 211-221, 1990.
1249:
1250: \bibitem{SH91}
1251: G. Stormo and G.W. Hartzell III,
1252: Identifying protein-binding sites from
1253: unaligned DNA fragments. {\em Proc.\ Natl.\ Acad.\ Sci.\ USA},
1254: 88(1991), 5699-5703.
1255:
1256: %\bibitem{WFHW96}
1257: %F. Wolfertstetter, K. Frech, G. Herrmann and T. Werner,
1258: %{\em CABIOS} (1996) 71-80.
1259: %{Computer Applications in the Biosciences}, pp. 71-80, 1996.
1260:
1261: \bibitem{W86}
1262: M. Waterman,
1263: Multiple sequence alignment by consensus,
1264: {\em Nucl. Acids Res.}, 14(1986), 9095-9102.
1265:
1266: \bibitem{WAG84}
1267: M. Waterman, R. Arratia and D. Galas,
1268: Pattern recognition in several sequences: consensus and alignment,
1269: {\em Bull. Math. Biol.}, 46(1984), 515-527.
1270:
1271: \bibitem{WP84}
1272: M. Waterman and M. Perlwitz,
1273: Line geometries for sequence comparisons,
1274: {\em Bull. Math. Biol.}, 46(1984), 567-577.
1275:
1276: \end{thebibliography}
1277:
1278: \end{document}
1279: