0002:cs0002012/cs0002012

1: \documentstyle[11pt]{article}

2: \setlength{\oddsidemargin}{-.15in}

3: \setlength{\evensidemargin}{0pt}

4: \setlength{\headsep}{0pt}

5: \setlength{\topmargin}{-30pt}

6: \setlength{\textheight}{8.5in}

7: \setlength{\textwidth}{6.0in}

8:

9: \renewcommand{\baselinestretch}{1.2}

10:

11: \def\mod {\rm \ mod\ }

12: \def\oh {\cal O}

13: \def\End{\rm End}

14:

15: \newcommand{\qed}{\ \ \ \rule{7pt}{8pt}\medskip}

16: \newcommand{\partialqed}{\ \ \ \raisebox{2pt}{\framebox[7pt]{\ }}\medskip}

17: \newtheorem{defin}{Definition}

18: \newcommand{\bracket}[1]{\langle #1 \rangle}

19: \newcommand{\barr}{\overline}

20: \newcommand{\floor}[1]{\lfloor #1 \rfloor}

21: %General Forms

22: \newcommand{\proof}{{\bf Proof. \enspace}}

23: %\newcommand{\proof}{{\sc Proof \enspace}}

24: \newcommand{\comment}[1]{}

25: \newcommand{\hs}{\enspace}

26: \newcommand{\hhs}{\thinspace}

27:

28: \newcommand{\diverges}{\uparrow}

29: \newcommand{\converges}{\downarrow}

30:

31: \newcommand{\inter}{\bigcap}

32: \newcommand{\real}{\ifmmode {\rm R} \else ${\rm R}$ \fi}

33: \newcommand{\nat}{\ifmmode {\rm N} \else ${\rm N}$  \fi}

34: \newcommand{\tot}{\ifmmode {\cal T} \else ${\cal T}$ \fi}

35: \newcommand{\sigstar}{\ifmmode \Sigma^{\ast} \else $\Sigma^{\ast}$ \fi}

36:

37: \newtheorem{theorem}{Theorem}

38: \newtheorem{lemma}[theorem]{Lemma}

39: \newtheorem{corollary}[theorem]{Corollary}

40: \newtheorem{definition}{Definition}

41: \newtheorem{claim}[theorem]{Claim}

42: \newtheorem{conjecture}[theorem]{Conjecture}

43: \newlength{\thislabel}

44: \newcommand{\labsize}[1]{\settowidth{\thislabel}{#1}}

45: \def\lablimer2stlabel#1{\rm #1\hfil}

46: \def\pp{\par\noindent}

47:

48: \title{On The Closest String and Substring Problems

49: %\title{Approximating The Hamming Center

50: \footnote{Some of the results

51: in this paper have been

52: presented in {\em Proc.\ 31st ACM Symp.\ Theory of Computing}, May,

53: 1999 \cite{LMW99},

54: and in {\em Proc.\ 11th Symp. Combinatorial Pattern Matching}, June,

55: 2000, \cite{M00}.}}

56: \author{

57: Ming Li \\

58: Department of Computer Science\\

59: University of Waterloo\\

60: Waterloo, Ont. N2L 3G1, Canada\\

61:  E-mail: mli@math.uwaterloo.ca

62: \and

63: Bin Ma\\

64: Department of Computer Science \\

65: University of Waterloo\\

66: Waterloo, Ont. N2L 3G1, Canada\\

67: E-mail: b3ma@wh.math.uwaterloo.ca

68: \and

69: Lusheng Wang\\

70: Department of Computer Science \\

71: City University of Hong Kong \\

72: Kowloon, Hong Kong \\

73: E-mail: lwang@cs.cityu.edu.hk

74: }

75: \date{}

76:

77: \begin{document}

78: \maketitle

79:

80: \begin{abstract}

81: The problem of finding a center string that is `close' to every

82: given string arises and has many applications in computational molecular

83: biology and coding theory.

84:

85: This problem has two versions: the Closest String

86: problem and the Closest Substring problem.

87: Assume that we are given a set of strings

88: ${\cal S}=\{s_1, s_2, \ldots, s_n\}$ of strings, say, each of length $m$.

89: The Closest String problem~\cite{BLPR97,BGHMS97,FL97,GJL99,LL+99}

90: asks for the smallest $d$ and a string $s$  of length $m$ which is within

91: Hamming distance $d$ to each $s_i\in {\cal S}$.

92: This problem comes from coding theory when we are looking for a code

93: not too far away from a given set of codes \cite{FL97}.

94: The problem is NP-hard~\cite{FL97,LL+99}. Berman {\em et al}

95: \cite{BGHMS97} give a polynomial time

96: algorithm for constant $d$. For super-logarithmic $d$,

97: Ben-Dor {\em et al} \cite{BLPR97} give an efficient approximation algorithm

98: using linear program relaxation technique.

99: The best polynomial time approximation has ratio $\frac{4}{3}$

100: for all $d$, given by \cite{LL+99} and \cite{GJL99}.

101: The Closest Substring problem looks for a string $t$ which is

102: within Hamming distance $d$ away from a substring of each $s_i$.

103: This problem only has a $2- \frac{2}{2|\Sigma|+1}$ approximation

104: algorithm previously \cite{LL+99} and

105: is much more elusive than the Closest String problem, but

106: it has many applications in finding

107: conserved regions, genetic drug target identification, and genetic probes

108: in molecular biology~\cite{HS94,LR90,LB+91,PBPR89,PH96,

109: S90,SH91,W86,WAG84,WP84,LL+99}. Whether there are efficient

110: approximation algorithms for both problems are

111: major open questions in this area.

112:

113: We present two polynomial time approxmation algorithms with

114: approximation ratio $1+ \epsilon$ for any small $\epsilon$

115: to settle both questions.

116:

117: \end{abstract}

118:

119: \section{Introduction}

120: \label{sec-intro}

121: Many problems in molecular biology involve finding similar

122: regions common to each sequence in a given set

123: of DNA, RNA, or protein sequences. These problems find applications in

124: locating binding sites and finding conserved

125: regions in unaligned sequences~\cite{SH91,LR90,HS94,S90},

126: genetic drug target identification~\cite{LL+99}, designing genetic

127: probes \cite{LL+99}, universal PCR primer design~\cite{LB+91,DR+93,PH96,LL+99},

128: and, outside computational biology, in coding theory~\cite{FL97,GJL99}.

129: Such problems may be considered to be various generalizations of the common

130: substring problem, allowing errors.

131: Many objective functions have been proposed

132: for finding such regions common to every given strings.

133: A popular and most fundamental measure is the Hamming distance. Other

134: measures, like the relative entropy measure used

135: by Stormo and his coauthors \cite{HS94} may be

136: considered as generalizations of Hamming distance, requires

137: different techniques, and is considered in \cite{LMW99-j}.

138:

139: Let $s$ and $s'$ be finite strings.

140: Let $d(s,s')$ denote the Hamming distance between $s$ and $s'$.

141: $|s|$ is the length of $s$. $s[i]$ is the $i$-th character

142: of $s$. Thus, $s=s[1]s[2] \ldots s[|s|]$.

143: The following are the problems we study in this paper:

144:

145: \vspace{1ex}

146: \noindent

147: {\sc Closest String:} Given a set ${\cal S}=\{s_1,s_2,\ldots,s_n\}$

148: of strings each of length $m$, find a

149: center string $s$ of length $m$ minimizing $d$

150: such that for every string $s_{i}\in {\cal S}$, $d(s, s_{i})\leq d$.

151:

152: \vspace{1ex}

153: \noindent

154: {\sc Closest Substring:} Given a set ${\cal S}=\{s_1,s_2,\ldots,s_n\}$

155: of strings, and an integer $L$, find a center string $s$ of length $L$

156: minimizing $d$ such that for each $s_i\in {\cal S}$ there is

157: a length $L$ substring $t_i$ of $s_i$ with  $d(s, t_{i})\leq d$.

158: \vspace{1ex}

159: %wangl -- change the definition of the problem  a little bit

160:

161: {\sc Closest String} has been widely and independently studied

162: in different contexts. In the context of coding theory

163: it was shown to be NP-hard~\cite{FL97}. In DNA sequence related topics,

164: \cite{BGHMS97} gave an exact algorithm when the distance $d$ is a constant.

165: \cite{BLPR97,GJL99} gave near-optimal

166: approximation algorithms only for large $d$ (super-logarithmic in number of

167: sequences); however the straightforward linear programming relaxation technique

168: does not work when $d$ is small because the randomized

169: rounding procedure introduces large errors.

170: This is exactly the reason why \cite{GJL99,LL+99}

171: analyzed more involved approximation algorithms, and

172: obtained the ratio $\frac{4}{3}$ approximation algorithms.

173: Note that the small $d$ is key in applications such as

174: %wangl -- do we need "the key" instead of "key"??

175: genetic drug target search where we look for similar regions to which

176: a complementary drug sequence would bind. It is a major open

177: problem~\cite{FL97,BGHMS97,BLPR97,GJL99,LL+99} to achieve the best

178: approximation ratio for this problem. (Justifications for using Hamming

179: distance can also be found in these references, especially \cite{LL+99}.)

180: We present a polynomial approximation scheme (PTAS), settling the problem.

181:

182: {\sc Closest Substring} is a more general version of the

183: {\sc Closest String} problem. Obviously, it is also NP-hard.

184: In applications such as drug target identification and

185: genetic probes design, the radius $d$ is usually small.

186: Moreover, when the radius $d$ is small,

187: the center strings can also be used

188: as {\it motifs} in {\it repeated-motif} methods  for

189: multiple sequence alignment problems

190: ~\cite{Danbook, PBPR89,SAL91,W86,WAG84,WP84},

191: that repeatedly find motifs and recursively decompose the sequences

192: into shorter sequences.

193: A trivial ratio-$2$ approximation was given in~\cite{LL+99}.

194: We presented the first nontrivial

195: algorithm with approximation ratio $2- \frac{2}{2|\Sigma | +1}$,

196: in \cite{LMW99}. This is a key open problem in search of a

197: potential genetic drug sequence which is ``close''

198: to some sequences (of harmful germs) and ``far'' from some other sequences

199: (of humans). The problem appears to be much more elusive than

200: {\sc Closest String}.

201: We extend the techniques developed  for closest string here

202: to design a PTAS for closest substring problem  when

203: $d$ is small, i.e., $d\leq O(\log N)$, where $N$

204: is the input size of the instance.

205: Using a {\em random sampling} technique, and combining our

206: methods for {\sc Closest String}, we then design a PTAS

207: for {\sc Closest Substring}, for all $d$.

208:

209: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

210: \section{Approximating {\sc Closest String}}

211: \label{sec-closeststring}

212: In this section, we give a PTAS for {\sc Closest String}.

213: %wangl

214: %improve the $4/3$ approximation

215: %in \cite{LL+99} for {\sc Closest String} to a PTAS.

216: We note that a direct application of LP relaxation

217: in \cite{BLPR97} does not work when the optimal solution is small.

218: Rather we extend an idea in \cite{LL+99} to do LP relaxation

219: only to a fraction of the bits.

220: Let ${\cal S}=\{s_1, s_2, \ldots, s_n\}$ be a set of  $n$ strings each of

221: length $m$.

222:

223: The idea is as follows. Let $r$ be a constant.

224: If we choose a subset

225: of $r$ strings from ${\cal S}$, consider the bits that they all agree.

226: Intutively, we can replace the corresponding bits

227: in the optimal solution by these bits of the $r$ strings,

228: and this will only slightly worsen the solution.

229: Lemma~\ref{KEY} shows that this is true for at least one subset

230: of $r$ strings. Then all we

231: need to do is to optimize on the positions (bits) where they do not agree,

232: by LP relaxation and randomized rounding.

233:

234: We first introduce some notations.  Let

235: $P=\{j_1,j_2,\ldots,j_k\}$ be a set (multiset) and

236: $1\leq j_1 \leq j_2 \leq \cdots \leq j_k\leq m$.

237: $P$ is called a {\it position set} ({\it multiset}).

238: Let $s$ be a string of length $m$,

239: then $s|_P$ is the string $s[j_1]\,s[j_2]\,\cdots \,s[j_k]$.

240:

241: For any  $k \geq 2$, let $1 \leq i_1,i_2,\ldots,i_k \leq n$ be

242: $k$ distinct numbers.

243: Let $Q_{i_1,i_2,\ldots,i_k}$ be the set of positions

244: where $s_{i_1},s_{i_2},\ldots,s_{i_k}$ agree.

245: Obviously $|Q_{i_1,i_2,\ldots,i_k}| \geq m- kd_{opt}$.

246: Let $\rho_0 = \max _{1\leq i,j\leq n} {d(s_i,s_j)}/{d_{opt}}$.

247: The following lemma is the key of our approximation algorithm.

248: \begin{lemma}\label{KEY}

249: If $\rho_0>  1+\frac{1}{2r-1}$, then

250: for any constant $r$, there are indices

251: $1 \leq i_1,i_2,\ldots,i_r \leq n$ such that for any $1 \leq l \leq n$,

252: %mab: n is the number of sequences. m is the length of string.

253: %$$

254: %| \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,

255: % s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j]

256: %\neq s[j]\} |

257: %\leq   \frac{1}{2r-1} \,d_{opt}.

258: %$$

259: $$

260: d(s_l|_{Q_{i_1,i_2,\ldots,i_r}}, s_{i_1}|_{Q_{i_1,i_2,\ldots,i_r}})

261: - d(s_l|_{Q_{i_1,i_2,\ldots,i_r}}, s|_{Q_{i_1,i_2,\ldots,i_r}})

262: \leq \frac 1{2r-1} d_{opt}.

263: $$

264: \end{lemma}

265: \begin{proof}

266: Let $p_{i_1,i_2,\ldots,i_k}$ be the number of mismatches between

267: $s_{i_1}$ and $s$ at the positions in $Q_{i_1,i_2,\ldots,i_k}$.  Let

268: $

269: \rho _k = \min _{1\leq i_1,i_2,\ldots,i_k\leq n}

270:  {p_{i_1,i_2,\ldots,i_k}} /{d_{opt}}.

271: $

272: First, we prove the following claim.

273: \begin{claim}\label{Fact1}

274: For any $k$ such that $2 \leq k \leq r$, where $r$ is the constant

275: in the algorithm closestString, there are indices

276: $1 \leq i_1,i_2,\ldots,i_r \leq m$ such that for any $1 \leq l \leq n$.

277: %wangl n ==>m

278: $$%\begin{eqnarray*}

279:  | \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,

280:  s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j]

281: \neq s[j]\} |

282: \leq  (\rho _k - \rho _{k+1}) \,d_{opt}

283: $$%\end{eqnarray*}

284: \end{claim}

285: \begin{proof}

286: Consider indices $1\leq i_1,i_2,\ldots,i_k\leq m$ such that

287: %wangl n==>m

288: ${p_{i_1,i_2,\ldots, i_k}}= \rho _k d_{opt}$.

289: Then for any $1\leq i_{k+1},i_{k+2},\ldots,i_{r} \leq m$

290: %wangl n==>m

291: and $1\leq l\leq n$, we have

292: \begin{eqnarray}

293: && | \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,

294:  s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j] \neq s[j]\}

295: |

296: \nonumber\\

297: &\leq & | \{j \in Q_{i_1,i_2,\ldots,i_k} \,|\,

298:  s_{i_1}[j] \neq s_l[j] \mbox{ and }  s_{i_1}[j] \neq s[j]\} | \label{eq-tmp11}\\

299: &=&

300: | \{ j \in Q_{i_1,i_2,\ldots,i_k} \,|\, s_{i_1}[j] \neq s[j] \}

301: %\nonumber \\

302: %&&

303:  - \{ j \in Q_{i_1,i_2,\ldots,i_k} \,|\,

304: s_{i_1}[j]=s_l[j] \mbox{ and } s_{i_1}[j] \neq s[j] \} |

305: \nonumber\\

306: &=&

307: | \{ j \in Q_{i_1,i_2,\ldots,i_k} \,|\, s_{i_1}[j] \neq s[j] \}

308:  - \{ j \in Q_{i_1,i_2,\ldots,i_k,l} \,|\, s_{i_1}[j] \neq s[j] \}|

309: \nonumber\\

310: &=&

311: p_{i_1,i_2,\ldots,i_k} - p_{i_1,i_2,\ldots,i_k,l}

312: \label{eq-tmp12}\\

313: &\leq& (\rho _k - \rho _{k+1}) \,d_{opt},

314: \nonumber

315: \end{eqnarray}

316: where Inequality (\ref{eq-tmp11}) is  from the fact that

317: $Q_{i_1,i_2,\ldots,i_r} \subseteq Q_{i_1,i_2,\ldots,i_k}$

318: and Equality (\ref{eq-tmp12}) is  from  the fact that

319: $Q_{i_1,i_2,\ldots,i_k,l} \subseteq Q_{i_1,i_2,\ldots,i_k}$.

320: $\Box$

321: \end{proof}

322:

323: \begin{claim}\label{Fact2}

324: $\min \{\rho_0 -1, \rho _2 - \rho_3, \rho _3 - \rho_4,

325: \ldots , \rho _r - \rho_{r+1} \} \leq \frac {1}{2 r-1}.$

326: \end{claim}

327: \begin{proof}

328: Consider $1\leq i,j \leq n$ such that

329: $d(s_i,s_j) = \rho_0 d_{opt}$.  Then among the

330: positions where $s_i$ mismatches $s_j$, for at least one of the

331: two strings, say, $s_i$, the number of

332: mismatches between $s_i$ and $s$ is at least

333: $\rho_0 d_{opt}/2$.  Thus, among the positions where

334: $s_i$ matches $s_j$, the number of mismatches between

335: $s_i$ and $s$ is at most $(1-\frac{\rho_0}{2}){d_{opt}}$.

336: Therefore, $\rho_2 \leq 1-\frac {\rho_0}{2}$. So,

337: $$

338: \frac {{\frac 12} (\rho_0 -1) + ( \rho _2 - \rho_3) + ( \rho _3 - \rho_4)

339:    + \cdots + (\rho_r - \rho_{r+1}) }{\frac 12 +r-1}

340: \leq \frac {{\frac 12} \rho _0 + \rho _2 - \frac 12 }{r-\frac 12}

341: \leq  \frac {1}{2 r-1}

342: $$

343: Thus, at least one of $\rho_0-1$, $\rho _2 - \rho_3$, $\rho _3 - \rho_4$,

344: $\ldots$, $\rho _r - \rho_{r+1}$ is

345: less than or equal to

346: %wangl revise the order

347: $\frac {1}{2 r-1}$.

348: $\Box$

349: \end{proof}

350:

351: %Now we finish the proof.  If $\rho_0 -1 \leq \frac {1}{2 r -1}$,

352: %then by the definition of $\rho _0$, it is easy to see that

353: %the algorithm finds a solution with cost at most

354: %$\rho_0 d_{opt} \leq (1+ \frac {1}{2r-1}) d_{opt}$ in step 2.

355: If $\rho_0 > 1+\frac {1}{2 r -1}$,

356: them from Claim    \ref{Fact2},

357:  there must be a $2 \leq k \leq r$ such that

358: $\rho _k - \rho _{k+1} \leq \frac {1}{2 r -1}$.

359:  From Claim \ref{Fact1},

360: $$%\begin{eqnarray*}

361:  | \{j \in Q_{i_1,i_2,\ldots,i_r} \,|\,

362:  s_{i_1}[j] \neq s_l[j] \mbox{ and } s_{i_1}[j]

363: \neq s[j]\} |

364: \leq  \frac{1}{2r-1} \,d_{opt} \ .

365: $$%\end{eqnarray*}

366: Hence, there are at most $\frac 1{2r-1} \, d_{opt}$ bits

367: in $Q_{i_1,i_2,\ldots,i_r}$

368: where $s_l$ differs from $s_{i_1}$ while agrees with

369: $s$.  The lemma is proved.

370: $\qed$

371: \end{proof}

372:

373: Lemma \ref{KEY} hints us  to select $r$ strings

374: $s_{i-1}, s_{i_2}, \ldots, s_{i_r}$  from $\cal {S}$

375: at a time and  use the unique letters at the positions in

376:  $Q_{i_1,i_2,\ldots, i_r}$  as an approximation of

377: the optimal center string $s$.

378: For the  positions in $P_{i_1,i_2,\ldots, i_r}=\{1,2,\ldots, L\}-Q_{i_1,i_2,\ldots, i_r}$,

379: we use ideas in \cite{LL+99}, i.e., the following   two strategies:

380: %wangl --change above sentence.

381: (1) if $|P_{i_1,i_2,\ldots, i_r}|$ is small, i.e., $d\leq O(\log L)$,

382: we can enumerate $|\Sigma| ^{|P_{i_1,i_2,\ldots, i_r}|}$ possibilities  to approximate $s$;

383: (2) if $|P_{i_1,i_2,\ldots, i_r}|$ is large, i.e., $d>O(\log L)$, we use the

384: LP relaxation to approximate $s$.  The details are

385: found in Lemma~\ref{lem-rest}.

386: %wangl add subscripts in the above discussion

387: Before

388: %going on

389: %wangl delete the two words

390: presenting our main result, we need the following

391: two lemmas, where Lemma~\ref{lem-chernoff} is commonly known

392: as Chernoff's bounds~(\cite{MR95}, Theorem~4.2 and 4.3):

393: \begin{lemma}

394: {\rm \cite{MR95}~}

395: Let $X_1,X_2,\ldots,X_n$ be $n$ independent random 0-1 variables,

396: where $X_i$ takes $1$ with probability $p_i$, $0<p_i<1$.

397: Let $X=\sum _{i=1}^n X_i$, and $\mu=E[X]$.

398: Then for any $\delta>0$,

399: \begin{enumerate}

400: \item[(1)]

401: ${\bf Pr}(X>(1+\delta) \mu ) < \left[\frac {{\bf e}^{\delta}} {(1+\delta)^{(1+ \delta)}}\right]^{\mu}$,

402: \item[(2)]

403: ${\bf Pr}(X<(1-\delta) \mu ) \leq \exp \left( -\frac 12 \mu \delta ^2 \right)$.

404: \vspace{-6pt}

405: \end{enumerate}

406: \label{lem-chernoff}

407: \end{lemma}

408:

409:  From Lemma~\ref{lem-chernoff}, we can prove the following lemma:

410: \begin{lemma}

411: Let $X_i$, $X$ and $\mu$ be defined as in Lemma~\ref{lem-chernoff}.

412: Then for any $0<\epsilon\leq 1$,

413: \begin{enumerate}

414: \item[(1)]

415: ${\bf Pr}(X>\mu+\epsilon\, n ) < \exp \left(-\frac 13 n \epsilon ^2 \right)$,

416: \item[(2)]

417: ${\bf Pr}(X<\mu-\epsilon\,n ) \leq \exp \left( -\frac 12 n \epsilon ^2 \right)$.

418: \vspace{-6pt}

419: \end{enumerate}

420: \label{lem-chernoff1}

421: \end{lemma}

422: \begin{proof}

423: (1) Let $\delta = \frac {\epsilon n} {\mu}$.  By Lemma~\ref{lem-chernoff},

424: $$%\begin{eqnarray*}

425: {\bf Pr}(X>\mu +\epsilon n )

426: <\left[\frac {{\bf e}^{\frac{\epsilon n}{\mu}}}

427: {(1+\frac{\epsilon n}{\mu})^{(1+\frac{\epsilon n}{\mu})}}\right]^{\mu}

428: =\left[\frac {\bf e}{(1+\frac{\epsilon n}{\mu})^{(1+\frac{\mu}{\epsilon n})}}\right]^{\epsilon n}

429: \leq \left[\frac {\bf e} {(1+\epsilon)^{1+\frac 1{\epsilon}}} \right]^{\epsilon n},

430: $$%\end{eqnarray*}

431: where the last inequality is because $\mu \leq n$ and

432: that $(1+x)^{(1+\frac 1x)}$ is increasing for $x \geq 0$.

433: It is easy to verify that for $0< \epsilon \leq 1$,

434: $\frac {\bf e} {(1+\epsilon)^{1+\frac 1{\epsilon}}}

435: \leq \exp \left(-\frac \epsilon 3 \right).$

436: Therefore, (1) is proved.

437:

438: (2)  Let $\delta =\frac {\epsilon n} {\mu}$.  By Lemma~\ref{lem-chernoff},

439: (2) is proved.

440: $\qed$

441: \end{proof}

442:

443: Now, we come back to the approximation of $s$ at the positions

444: in  $P_{i_1,i_2,\ldots,i_r}$.

445:

446: \begin{lemma}

447: \label{lem-rest}

448: Let  ${\cal S} = \{s_1, s_2, \ldots s_n\}$, where $|s_i| = m$ for all $i$.

449: Assume that $s$ is the optimal solution of {\sc Closest String}

450: and $\max_{1 \leq i \leq n} d(s_i,s) =d_{opt}$.

451: Given  a  string $s'$ and a  position set $Q$ of size $m-O(d_{opt})$

452: such that for any $i=1, \ldots , n$

453: \begin{equation}

454: \label{eq-rest01}

455: d(s_i|_Q,s'|_Q)-d(s_i|_Q,s|_Q) \leq \rho \, d_{opt},

456: \end{equation}

457: where  $0 \leq \rho \leq 1$,

458: one can obtain a solution with cost at most

459: $(1+\rho + \epsilon)d_{opt}$ in polynomial time

460:  for any fixed $\epsilon \geq 0$.

461: \end{lemma}

462:

463: \begin{proof}

464: Let $P=\{1,2,\ldots,m\} - Q$.  Then, for any two

465: strings $x$ and  $x'$ of length $m$, we have

466: $d(x|_P,x'|_P)+d(x|_Q,x'|_Q)=d(x,x')$.

467: Thus for any $i=1,2, \ldots ,n$,

468: $$%\begin{eqnarray*}

469: d(s_i|_P,s|_P)=d(s_i,s)-d(s_i|_Q,s|_Q)

470: \leq (1+ \rho)\,d_{opt} - d(s_i|_Q,s'|_Q).

471: $$%\end{eqnarray*}

472: Therefore,  the  following optimization problem

473: \begin{equation}

474: \label{lps1}

475: \left\{

476:  \begin{array}{l}

477:  \min \;\; d;\\

478:  d(s_i|_P,x) \leq d - d(s_i|_Q,s'|_Q), \;\;

479:   i=1, \cdots , n; |x|=|P |,

480:  \end{array}

481: \right.

482: \end{equation}

483:  has a solution with cost

484: $d \leq (1+ \rho) d_{opt}$.

485: %wangl move half of the sentence down

486: Suppose that the optimization problem  has an optimal solution $x$ such that

487: $d=d_0$.  Then

488: \begin{equation}

489: \label{eq-d0}

490: d_0\leq (1+\rho) d_{opt}.

491: \end{equation}

492: Now we solve (\ref{lps1}) approximately.

493: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%sep 26

494: Similar to \cite{BLPR97,LL+99}, we use a 0-1 variable

495: $x_{j,a}$ to indicate whether $x[j]=a$.  Denote

496: $\chi (s_i[j],a)=0$ if $s_i[j]=a$ and $1$ if $s_i[j] \neq a$.

497: Then (\ref{lps1}) can be rewritten as a 0-1 optimization problem

498: as follows:

499: \begin{equation}

500: \label{lps2}

501: \left\{

502:  \begin{array}{l}

503:  \min \;\; d;\\

504:  \sum _{a \in \Sigma} x_{j,a}=1, \;\; j=1,2,\ldots,|P|,\\

505:  \sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}

506:    \chi (s_i[j],a) \, x_{j,a}

507:    \leq d - d(s_i|_Q,s'|_Q), \;\; i=1,2, \ldots , n.

508:  \end{array}

509: \right.

510: \end{equation}

511: Solve (\ref{lps2}) by linear programming to

512: get a fractional solution ${\bar x}_{j,a}$ with cost

513: ${\bar d}$.  Clearly ${\bar d} \leq d_0$.

514: Independently for each $0 \leq j \leq |P|$,

515: with probability ${\bar x}_{j,a}$,

516: set $x_{j,a}=1$ and $x_{j,a'}=0$ for any $a' \neq a$.

517: Then we get a solution $x_{j,a}$ for the 0-1 optimization

518: problem, hence a solution $x$ for (\ref{lps1}).

519: It is easy to see that

520: $\sum _{a \in \Sigma} \chi (s_i[j],a) \, x_{j,a} $

521: takes $1$ or $0$ randomly and independently for

522: different $j$'s.  Thus

523: $d(s_i|_P,x)= \sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}

524:    \chi (s_i[j],a) \, x_{j,a}$

525: is a sum of $|P|$ independent 0-1 random variables, and

526: \begin{eqnarray}

527: E[d(s_i|_P,x)]&=&\sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}

528:    \chi (s_i[j],a) \, E[x_{j,a}]

529: \nonumber\\

530: &=&\sum_{1\leq j\leq |P|} \sum _{a \in \Sigma}

531:    \chi (s_i[j],a) \, {\bar x}_{j,a}

532: \nonumber\\

533: &\leq& {\bar d}-d(s_i|_Q,s'|_Q)\leq d_0-d(s_i|_Q,s'|_Q).

534: \label{eq-hoeff01}

535: \end{eqnarray}

536: Therefore, for any fixed $\epsilon '>0$,

537: by Lemma~\ref{lem-chernoff1},

538: $$

539: {\bf Pr}\left(

540: d(s_i|_P,x) \geq d_0+\epsilon' |P|

541: -d(s_i|_Q,s'|_Q)

542:         \right)

543: \leq

544: \exp \left(- \frac 13 {{\epsilon '}^2} |P|\right).

545: $$

546: Considering all sequences, we have

547: \[

548: {\bf Pr}\left(d(s_i|_P,x) \geq d_0 +\epsilon ' |P|-

549: d(s_i|_Q,s'|_Q)

550:      ~ {\rm for~at~least~one~}i       \right)

551: \leq n\times \exp \left(-\frac 13 {\epsilon'} ^2 |P|\right).

552: \]

553: If $|P| \geq (4 \ln n )/ {\epsilon'} ^2$,

554: then,

555: $n\times

556:  \exp\left(-\frac 13 {\epsilon'} ^2 |P|\right) \leq n^{-\frac 13}$.

557: Thus

558: we obtain a randomized algorithm to find a solution

559: for (\ref{lps1}) with cost at most

560: $d_0 +\epsilon' |P|$ with probability at least $1-n^{-\frac

561: 13}$.

562: The above randomized algorithm can be derandomized

563: by standard method of conditional probabilities~\cite{MR95}.

564:

565: If $|P| < ( 4 \ln n )/ {\epsilon'} ^2$,

566: $|\Sigma| ^{|P|} < n ^{(4 \ln |\Sigma|)/{\epsilon'} ^2}$

567: is a polynomial of $n$.

568: So, we can enumerate all strings in $\Sigma ^{|P|}$ to find

569: an optimal solution for (\ref{lps1}).

570: Thus, in both cases, we can obtain a solution $x$ for the optimization

571: problem (\ref{lps1}) with cost at most

572: $d_0+\epsilon' |P|$ in polynomial time.

573: Since $|P|=O(d_{opt})$,

574: $|P| \leq c\times d_{opt}$ for a {\em constant} $c$.

575: Let $\epsilon'= \frac {\epsilon}{c}$

576: and

577: $s^*=R(s',x,P)$. From Formula (\ref{lps1}),

578: \begin{eqnarray}

579: d(s_i,s^*)

580: &=&d(s_i|_P,s^*|_P)+d(s_i|_Q,s^*|_Q)

581: \nonumber\\

582: &=&d(s_i|_P,x)+d(s_i|_Q,s'|_Q)

583: \nonumber\\

584: &\leq& d_0+ \epsilon' |P|

585: \leq (1+ \rho) d_{opt} + \epsilon d_{opt},

586: %\label{eq-tmp01}

587: \nonumber

588: \end{eqnarray}

589: where the last inequality is from Formula~(\ref{eq-d0}).

590: This proves the lemma.

591: $\Box$

592: \end{proof}

593:

594: Now we describe the complete algorithm in Figure~\ref{stringAlg}.

595:

596: \begin{figure}[ht]

597: \begin{center}

598: \begin{tabular}{|l|}

599: \hline

600: \multicolumn{1}{|c|}{\bf Algorithm ~closestString}

601: \\

602: \makebox[.45in][l]{{Input}} \parbox[t]{4.55in}

603: {$s_1, s_2, \ldots , s_n \in \Sigma^m$.}

604: \\

605: \makebox[.45in][l]{{Output}} \parbox[t]{4.55in}

606: {a center string $s \in \Sigma^m$.}

607: \\

608: \makebox[.2in][l]{1.} \parbox[t]{4.8in}

609: {{\bf for} each $r$-element subset $\{ s_{i_1}$, $s_{i_2}$, $\ldots$,

610: $s_{i_r} \}$ of the $n$ input strings {\bf do}}

611: \\

612: \makebox[.5in][r]{(a)} \parbox[t]{4.5in}

613: {$Q=\{1\leq j \leq m \,|\, s_{i_1}[j]=s_{i_2}[j]=\ldots = s_{i_r}[j] \}$,

614: $P=\{1,2,\ldots, m\} - Q$.}

615: \\

616: \makebox[.5in][r]{(b)} \parbox[t]{4.5in}

617: {Solve the optimization problem defined by Formula (\ref{lps1})

618: as described in the proof of Lemma~\ref{lem-rest} to get an approximate

619: solution $x$ of length $|P|$.}

620: \\

621: \makebox[.5in][r]{(c)} \parbox[t]{4.5in}

622: {Let $s'$ be a string such that $s'|_Q=s_{i_1}|_Q$ and $s'|_P=x$.

623: Calculate the cost of $s'$ as the center

624: string.}

625: \\

626: \makebox[.2in][l]{2.} \parbox[t]{4.8in}

627: {{\bf for}   $i=1, 2, \ldots , n$ {\bf do}}

628: \\

629: \makebox[.4in][l]{}\parbox[t]{4.6in}

630: {calculate the cost of $s_i$ as the center string.}

631: \\

632: \makebox[.2in][l]{3.} \parbox[t]{4.8in}

633: {Output the best solution of the above two steps.}

634: \\

635: \hline

636: \end{tabular}

637: \caption{Algorithm for {\sc Closest String}}

638: \label{stringAlg}

639: \end{center}

640: \end{figure}

641:

642: \begin{theorem}

643: \label{th-uniform}

644: The algorithm closestString is a PTAS for {\sc Closest String}.

645: \end{theorem}

646: %%%Shall we declare the ratio clearly here?  -bin

647: \begin{proof}

648: Given an instance of {\sc Closest String},

649: suppose $s$ is an optimal solution and the optimal

650: cost is $d_{opt}$, i.e. $d(s,s_i) \leq d_{opt}$ for all $i$.

651: Let $P$ be defined as step 1(a) of Algorithm~closestString.

652: Since for every position in $P$, at least one of the $r$

653: strings $s_{i_1},s_{i_2},\ldots,s_{i_r}$ conflict

654: the optimal center string $s$, so we have

655: $|P|\leq r\times d_{opt}$.  As far as $r$ is a constant,

656: step 1(b) can be done in polynomial time by Lemma~\ref{lem-rest}.

657: Obviously the other steps of

658: Algorithm~closestString runs in polynomial time,

659: with $r$ as a constant.

660:

661: If $\rho_0 -1 \leq \frac {1}{2 r -1}$,

662: then by the definition of $\rho _0$, it is easy to see that

663: the algorithm finds a solution with cost at most

664: $\rho_0 d_{opt} \leq (1+ \frac {1}{2r-1}) d_{opt}$ in step 2.

665:

666: If $\rho_0 > 1+\frac {1}{2 r -1}$,

667: them from  Lemma \ref{KEY} and Lemma \ref{lem-rest},

668: the algorithm finds a solution with cost at most

669: $(1+\frac 1{2r-1} + \epsilon)d_{opt}$. This proves the theorem.

670: $\Box$

671: \end{proof}

672:

673: \section{Approximating {\sc Closest Substring}  when $d$ is small}

674: In some applications such as drug target identification,

675: genetic probe design, the radius $d$ is often small.

676: As a direct application of Lemma \ref{KEY},

677: we now present a PTAS for {\sc Closest String}

678: when the radius $d$ is small, i.e., $d<O(\log N)$, where $N$

679: stands for the input size of the instance.

680: Again, we focus on the construction of the center string.

681: The basic idea is  to choose $r$ substrings

682: $t_{i_1}$, $t_{i_2}$, $\ldots$, $t_{i_r}$

683: of length $L$

684: from the strings in ${\cal S}$,

685: keep the letters at the positions where

686: $t_{i_1}$, $t_{i_2}$, $\ldots$, $t_{i_r}$ all agree, and

687: try all possibilities for the rest of the positions.

688: % Bin: describe more of the proof idea perhaps?

689: The complete algorithm is described in Figure~\ref{fig-Algsmall}:

690:

691: \begin{figure}[h]

692: \begin{center}

693: \begin{tabular}{|l|}

694: \hline

695: \multicolumn{1}{|c|}{\bf Algorithm ~smallSubstring}

696: \\

697: \makebox[.45in][l]{{Input}} \parbox[t]{4.55in}

698: {$s_1, s_2, \ldots , s_n \in \Sigma^m$.}

699: \\

700: \makebox[.45in][l]{{Output}} \parbox[t]{4.55in}

701: {a center string $s \in \Sigma^L$.}

702: \\

703: \makebox[.2in][l]{1.} \parbox[t]{4.8in}

704: {{\bf for} each $r$-element subset $\{t_{i_1}$, $t_{i_2}$, $\ldots$,

705: $t_{i_r} \}$, where $t_{i_j}$ is a substring of length $L$  from

706: $s_{i_j}$ {\bf do}}

707: \\

708: \makebox[.5in][r]{(a)} \parbox[t]{4.5in}

709: {$Q=\{1\leq j \leq m \,|\, t_{i_1}[j]=t_{i_2}[j]=\ldots = t_{i_r}[j] \}$,

710: $P=\{1,2,\ldots, m\} - Q$.}

711: \\

712:

713: \makebox[.5in][r]{(b)} \parbox[t]{4.5in}

714: {{\bf for} every $x \in \Sigma ^{|P|}$ {\bf do}}

715: \\

716: \makebox[.7in][r]{} \parbox[t]{4.3in}

717: {let $t=S(t_{i_1},x,P)$; compute the cost of the solution $t$.}

718: \\

719: \makebox[.2in][l]{2.} \parbox[t]{4.8in}

720: {{\bf for} every length $L$ substring $t_k$ from any given sequence {\bf do}}

721: \\

722: \makebox[.4in][r]{} \parbox[t]{4.6in}

723: {compute the cost of the solution with $t_k$ as the center string}\\

724: \makebox[.2in][l]{3.} \parbox[t]{4.8in}

725: {select a center string  that leads  the best result in Step 1 and

726: Step 2;

727: output the best solution of the above two steps.}

728: \\

729: \hline

730: \end{tabular}

731: \caption{Algorithm for {\sc Closest Substring} when $d$ is

732: small}

733: \label{fig-Algsmall}

734: \end{center}

735: \end{figure}

736:

737: \begin{theorem}

738: Algorithm smallSubstring is a PTAS for {\sc Closest Substring}

739: when the radius $d$ is small, i.e., $d\leq O(\log N)$, where $N$

740: is the input size.

741: \end{theorem}

742: \begin{proof}

743: Obviously, the size of $P$ in Step 1 is at most $O(r\times \log N)$.

744: Step 1 takes $O((mn)^{r}\times\Sigma ^{O(r \times \log N)}\times mnL)

745: =O(N^{r+1} \times N^{O(r \times \log |\Sigma|)})

746: =O(N^{O(r \times \log |\Sigma|)})$

747: time.

748: Other steps take less than that time.

749: Thus, the total  time  required is

750: $O(N^{O(r\times \log |\Sigma|)})$,

751: which is polynomial in term of

752: input size for any constant $r$.

753:

754:  From Lemma \ref{KEY}, the performance ratio

755: of the algorithm is $1+\frac{1}{2r-1}$. $\qed$

756: \end{proof}

757:

758: \section{A PTAS For {\sc Closest Substring}}

759: In this section, we further extend the algorithms

760: for {\sc Closest String} to a PTAS for {\sc Closest Substring},

761: making use of a {\em random sampling} strategy.

762: Note that Algorithm~smallSubstring runs in exponential time

763: for general radius $d$.  And Algorithm~closestString does not

764: work for {\sc Closest Substring} since we do not

765: know how to construct an optimal problem similar to~(\ref{lps1})

766: --- The construction of~(\ref{lps1}) requires us to know all the $n$

767: strings (substrings)

768: in an optimal solution of {\sc Closest String} ({\sc Closest Substring}).

769: It is easy to see that the choice of a ``good'' substring

770: from every string $s_i$ is the only obstacle on the way to the solution.

771: We use random sampling to handle this.

772:

773: Now let us outline the main ideas.

774: Let $\langle {\cal S}=\{s_1,s_2,\ldots,s_n\},L \rangle$ be

775: an instance of

776: {\sc Closest Substring}, where  $s_i$ is of  length $m$.

777: Suppose that $s$ is its optimal center string and

778: $t_i$ is  a length $L$  substring of $s_i$ which is

779: the closest to $s$ ($i=1,2,\ldots,n$).

780: Let $d_{opt}=\max _{i=1}^n d(s,t_i)$.

781: By trying all possibilities, we can assume that

782: $t_{i_1},t_{i_2},\ldots,t_{i_r}$ are the $r$ substrings $t_{i_j}$

783: that satisfy Lemma~\ref{KEY} by replacing $s_i$ by $t_i$ and $s_{i_j}$ by $t_{i_j}$.

784: Let $Q$ be the set of positions where $t_{i_1},t_{i_2},\ldots,t_{i_r}$

785: agree and $P=\{1,2,\ldots,L\}-Q$.

786: By Lemma~\ref{KEY}, $t_{i_1}|_Q$ is a good approximation to $s|_Q$.

787: We want to approximate $s|_P$ by the solution $x$ of the following

788: optimization problem~(\ref{opt2}), where $t'_i$ is a substring of $s_i$ and

789: is up to us to choose.

790: \begin{equation}

791: \label{opt2}

792: \left\{

793:  \begin{array}{l}

794:  \min \;\; d;\\

795:  d(t'_i|_P,x) \leq d - d(t'_i|_Q,t_{i_1}|_Q), \;\;

796:   i=1, \cdots , n; |x|=|P|.

797:  \end{array}

798: \right.

799: \end{equation}

800:

801: The ideal choice is

802: $t'_i=t_i$, {\em i.e.}, $t'_i$ is the closest to $s$ among

803: all substrings of $s_i$.

804: However, we only approximately know $s$ in $Q$ and

805: know nothing about $s$ in $P$ so far.

806: So, we randomly pick $O(\log (mn))$ positions from $P$.

807: Suppose the multiset of these random positions is $R$.

808: By trying all possibilities, we can assume that

809: we know $s$ at these $|R|$ positions.

810: We then find the substring $t'_i$ from $s$ such that

811: $d(s|_R,t'_i|_{R})\times \frac {|P|}{|R|}+d(t_{i_1}|_Q,t'_i|_Q)$

812: is minimized.  Then $t'_i$ potentially belongs to the substrings

813: which are the closest to $s$.

814:

815: Then we solve (\ref{opt2}) approximately by the method provided in

816: the proof of Lemma~\ref{lem-rest} and

817: combine the solution $x$ at $P$ and $t_{i_1}$ at $Q$, the

818: resulting string should be a good approximation to $s$.

819: The detailed algorithm (Algorithm closestSubstring)

820: is given in Figure~\ref{fig-alg}.

821: We prove Theorem~\ref{th-ptas} in the rest of the section.

822:

823: \begin{figure}[ht]

824: \begin{center}

825: {\normalfont\normalsize

826: \begin{tabular}{|l|}

827: \hline

828: \multicolumn{1}{|c|}{\bf Algorithm ~closestSubstring}

829: \\

830: \makebox[.55in][l]{{Input}} \parbox[t]{4.45in}

831: {$n$ sequences $\{s_1, s_2,\ldots,s_n\} \subseteq \Sigma^m$, integer $L$.}

832: \\

833: \makebox[.55in][l]{{Output}} \parbox[t]{4.45in}

834: {the center string $s$.}

835: \\

836: \makebox[.2in][l]{1.} \parbox[t]{4.8in}

837: {{\bf for} every $r$ length-$L$ substrings

838: $t_{i_1}, t_{i_2},\ldots, t_{i_r}$ (allowing repeats, but if $t_{i_j}$ and

839: $t_{i_k}$ are both chosen from the same $s_i$ then $t_{i_j}=t_{i_k}$)

840: of $s_1,\ldots ,s_n$

841: {\bf do}}

842: \\

843: \makebox[.5in][r]{(a)} \parbox[t]{4.5in}

844: {$Q=\{1\leq j \leq L \,|\, t_{i_1}[j]=t_{i_2}[j]=\ldots = t_{i_r}[j] \}$,

845: $P=\{1,2,\ldots, L\} - Q$.}

846: \\

847: \makebox[.5in][r]{(b)} \parbox[t]{4.5in}

848: {Let $R$ be a multiset containing

849: $\lceil \frac {4}{\epsilon ^2} \log(nm)\rceil$

850: uniformly random positions from $P$.

851: }

852: \\

853: \makebox[.5in][r]{(c)} \parbox[t]{4.5in}

854: {{\bf for} every string $y$ of length $|R|$ {\bf do}}

855: \\

856: \makebox[.7in][r]{(i)} \parbox[t]{4.3in}

857: {{\bf for} $i$ from $1$ to $n$ {\bf do}}

858: \\

859: \makebox[.8in][r]{} \parbox[t]{4.2in}

860: {Let $t'_i$ be a length $L$ substring of $s_i$

861: minimizing $d(y,t'_i|_{R})\times \frac {|P|}{|R|}+d(t_{i_1}|_Q,t'_i|_Q)$.

862: }

863: \\

864: \makebox[.7in][r]{(ii)} \parbox[t]{4.3in}

865: {Using the method provided in the proof of

866: Lemma~\ref{lem-rest}, solve the optimization

867: problem defined by Formula~(\ref{opt2}) approximately.

868: Let $x$ be the approximate solution within error $\epsilon\, |P|$.}

869: \\

870: \makebox[.7in][r]{(iii)} \parbox[t]{4.3in}

871: {Let $s'$ be the string such that $s'|_P=x$ and $s'|_Q=t_{i_1}|_Q$.

872: Let $c=\max^n_{i=1} \min_{\{t_i{\rm ~is~a~substring~of~}s_i\}} d(s',t_i)$.}

873: \\

874: \makebox[.2in][l]{2.} \parbox[t]{4.8in}

875: {{\bf for} every length-$L$ substring $s'$ of $s_1$ {\bf do}}

876: \\

877: \makebox[.5in][r]{} \parbox[t]{4.5in}

878: {Let $c=\max^n_{i=1} \min_{\{t_i{\rm ~is~a~substring~of~}s_i\}} d(s',t_i)$.}

879: \\

880: \makebox[.2in][l]{3.} \parbox[t]{4.8in}

881: {Output the $s'$ with minimum $c$ in step 1(c)(iii) and step 2.}

882: \\

883: \hline

884: \end{tabular}

885: \caption{The PTAS for the closest substring problem.}

886: \label{fig-alg}

887: }

888: \end{center}

889: \end{figure}

890:

891: \begin{theorem}

892: \label{th-ptas}

893: Algorithm closestSubstring is a PTAS for the closest substring problem.

894: \end{theorem}

895: \begin{proof}

896: Let $s$ be an optimal center string and $t_i$ be the

897: length-$L$ substring of $s_i$ that is the closest to $s$.  Let

898: $d_{opt}=\max d(s,t_i)$.  Let $\epsilon$ be any small

899: positive number and $r \geq 2$ be any fixed integer.

900: Let $\rho_0 = \max _{1\leq i,j\leq n} {d(t_i,t_j)}/{d_{opt}}$.

901: If $\rho _0 \leq 1+\frac{1}{2r-1}$, then clearly we can find a solution

902: $s'$ within ratio $\rho _0$ in step 2.

903: So, we assume that $\rho _0 \geq 1+\frac{1}{2r-1}$ from now on.

904:

905: By Lemma~\ref{KEY}, Algorithm~closestSubstring picks

906: a group of $t_{i_1},t_{i_2},\ldots,t_{i_r}$ in step 1

907: at some point such that

908:

909: \vspace{1ex}

910: \noindent

911: {\bf Fact 1~}

912: For any $1 \leq l \leq n$,

913: $

914: |\{j \in Q\,|\,

915:  t_{i_1}[j] \neq t_l[j] \mbox{ and } t_{i_1}[j]

916: \neq s[j]\} |

917: \leq  \frac{1}{2r-1} \,d_{opt}.

918: $

919: \vspace{1ex}

920:

921: Obviously, the algorithm takes $y$ as $s|_R$ for at some point

922: in step 1(c).  Let $y=s|_R$ and $t_{i_1},t_{i_2},\ldots,t_{i_r}$

923: satisfy Fact~1.  Let $t'_i$ be defined as in step 1(c)(i).

924: Let $s^*$ be a string such that $s^*|_P=s|_P$ and

925: $s^*|_Q=t_{i_1}|_Q$.  Then we claim:

926:

927: \vspace{1ex}

928: \noindent

929: {\bf Fact 2~}

930: With high probability,

931: $d(s^*,t'_i)\leq d(s^*,t_i)+ 2 \epsilon |P|$

932: for all $1\leq i\leq n$.\\

933:

934: \begin{proof}

935: For convenience, for any position multiset $T$,

936: we denote $d^T(t_1,t_2)=d(t_1|_T,t_2|_T)$ for any two

937: strings $t_1$ and $t_2$.  Let $\rho=\frac{|P|}{|R|}$.

938: %wangl add one step

939: Consider  any length $L$ substring $t'$ of $s_i$

940: satisfying

941: \begin{equation}

942: d(s^*, t')\geq d(s^*,t_i)+2 \epsilon |P|.

943: \label{eq-close150}

944: \end{equation}

945: It is easy to see that

946: $

947:  \rho\, d^R(s^*,t') + d^Q(t_{i_1},t')

948: \leq \rho\, d^R(s^*,t_i)+ d^Q(t_{i_1},t_i)$ implies

949: either

950: $(\rho \, d^R(s^*,t') + d^Q(s^*,t') \leq d(s^*,t') -\epsilon |P|$

951: or

952: $\rho \, d^R(s^*,t_i)+ d^Q(s^*,t_i) \geq d(s^*,t_i) + \epsilon |P|$.

953: Thus, we have the following inequality:

954: \begin{eqnarray}

955: &&{\bf Pr} \left( \rho\, d^R(s^*,t') + d^Q(t_{i_1},t')

956: \leq \rho\, d^R(s^*,t_i)+ d^Q(t_{i_1},t_i) \right)

957: \nonumber\\

958: &\leq& {\bf Pr}\left(\rho \, d^R(s^*,t') + d^Q(s^*,t')

959: \leq d(s^*,t') -\epsilon |P|\right)+

960: \nonumber\\

961: &&{\bf Pr}\left(\rho \, d^R(s^*,t_i)+ d^Q(s^*,t_i)

962: \geq d(s^*,t_i) + \epsilon |P|\right).

963: \label{eq-close200}

964: \end{eqnarray}

965:

966: It is easy to see that $d^R(s^*,t')$ is the  sum of $|R|$ independent

967: random 0-1 variables $ \sum _{i=1} ^{|R|} X_i$,  where

968: $X_i=1$ indicates  a mismatch between  $s^*$ and  $t'$ at

969: the $i$-th position in $R$.

970: %wangl change X_i .... to sum .... etc

971: Let $\mu = E[d^R(s^*,t')]$.

972: Obviously, $\mu=d^P(s^*,t') / \rho$.

973: %wangl -delete Then

974: Therefore, by Lemma~\ref{lem-chernoff1} (2),

975: \begin{eqnarray}

976: &&{\bf Pr}\left(\rho \, d^R(s^*,t') + d^Q(s^*,t') \leq d(s^*,t') -\epsilon |P|\right)

977: \nonumber\\

978: &=&{\bf Pr}\left(d^R(s^*,t') \leq (d(s^*,t')- d^Q(s^*,t'))/\rho -\epsilon |R|\right)

979: \nonumber\\

980: &=&{\bf Pr}\left(d^R(s^*,t') \leq d^P(s^*,t')/\rho -\epsilon |R|\right)

981: \nonumber\\

982: &=&{\bf Pr}\left(d^R(s^*,t') \leq \mu -\epsilon |R|\right)

983: \leq \exp\left(-\frac 12 \epsilon ^2 |R| \right) \leq (nm)^{-2},

984: \label{eq-close300}

985: \end{eqnarray}

986: %wangl --change the P in the last line to R.

987: where the last inequality is

988: due to the setting

989: %because of

990: $|R|=\lceil \frac {4}{\epsilon ^2}\log (nm)\rceil$ in

991: step 1(b) of the algorithm.

992: %wangl -- I do not like "because of"

993: Similarly,  using  Lemma~\ref{lem-chernoff1} (1)  we have

994: \begin{equation}

995: \label{eq-close400}

996: {\bf Pr}\left(\rho \, d^R(s^*,t_i)+ d^Q(s^*,t_i)

997: \geq d(s^*,t_i) + \epsilon |P|\right)

998: \leq (nm)^{-\frac 43}.

999: \end{equation}

1000: Combining Formula~(\ref{eq-close200})(\ref{eq-close300})(\ref{eq-close400}),

1001: we know that for any $t'$ that satisfies Formula~(\ref{eq-close150}),

1002: \begin{equation}

1003: {\bf Pr} \left( \rho\, d^R(s^*,t') + d^Q(t_{i_1},t')

1004: \leq \rho\, d^R(s^*,t_i)+ d^Q(t_{i_1},t_i) \right)

1005: \leq 2\, (nm)^{-\frac 43}.

1006: \label{eq-close500}

1007: \end{equation}

1008: For any fixed $1\leq i\leq n$, there are less than $m$ substrings $t'$

1009: that satisfies Formula~(\ref{eq-close150}).  Thus,

1010: from Formula~(\ref{eq-close500}) and the definition of $t'_i$,

1011: \begin{equation}

1012: {\bf Pr}\left( d(s^*,t'_i) \geq d(s^*,t_i)+ 2 \epsilon |P| \right)

1013: \leq 2\,n^{-\frac {4}{3}}m^{-\frac 13}.

1014: \end{equation}

1015: Summing up all $i\in [1,n]$, we know that with probability

1016: at least $1-2\,(nm)^{-\frac 13}$,

1017: $d(s^*,t'_i)\leq d(s^*,t_i)+ 2 \epsilon |P|$ for all $i$.

1018: $\qed$

1019: \end{proof}

1020:

1021:  From Fact 1,

1022: $d(s^*,t_i)=d^P(s,t_i)+d^Q(t_{i_1},t_i)\leq d(s,t_i)+\frac 1{2r-1} \, d_{opt}.$

1023: Combining with Fact~2 and $|P|\leq r\, d_{opt}$,

1024: we get

1025: \begin{equation}

1026: d(s^*,t'_i)\leq (1+ \frac 1{2r-1} + 2 \epsilon \, r) d_{opt}.

1027: \label{eq-close600}

1028: \end{equation}

1029: By the definition of $s^*$, the optimization

1030: problem defined by Formula~(\ref{opt2}) has a solution $s|_P$

1031: such that $d\leq (1+ \frac 1{2r-1} + 2 \epsilon \, r) d_{opt}$.

1032: We can solve  the optimization problem  within error $\epsilon |P|$ by the method in

1033: the proof of Lemma~\ref{lem-rest}.

1034: Let  $x$ be  the solution of the optimization problem.

1035: %wangl -breake the sentence

1036: Then by Formula~(\ref{opt2}), for any $1\leq i\leq n$,

1037: \begin{equation}

1038: d(t'_i|_P,x)\leq

1039: (1+ \frac 1{2r-1} + 2 \epsilon \, r) d_{opt}

1040: -d(t'_i|_Q,t_{i_1}|_Q)+\epsilon |P|.

1041: \label{close-800}

1042: \end{equation}

1043: Let $s'$ be defined in step 1(c)(iii), then by Formula~(\ref{close-800}),

1044: \begin{eqnarray*}

1045: d(s',t'_i)&=&d(x,t'_i|_P)+ d(t_{i_1}|_Q,t'_i|_Q)

1046: \\&\leq& (1+ \frac 1{2r-1} + 2 \epsilon r) d_{opt} + \epsilon |P|

1047: \\&\leq& (1+\frac 1{2r-1} + 3\epsilon r) d_{opt}.

1048: \end{eqnarray*}

1049:

1050: It is easy to see that the algorithm runs in polynomial time for

1051: any fixed positive $r$ and $\epsilon$.  For any $\delta>0$, by

1052: properly setting $r$ and $\epsilon$ such that

1053: $\frac 1{2r-1} + 3\epsilon r \leq \delta$, with high probability,

1054: the algorithm outputs in polynomial time a solution $s'$

1055: such that $d(t'_i,s')$ is no more than $(1+\delta)d_{opt}$

1056: for every $1\leq i\leq n$, where $t'_i$ is a substring of $s_i$.

1057: %wangl change "and" to "where",  add "for"

1058: The algorithm can be derandomized by standard methods~\cite{MR95}.

1059: $\qed$

1060: \end{proof}

1061:

1062: \section*{Acknowledgements}

1063: We would like to thank Tao Jiang, Kevin Lanctot,

1064: Joe Wang, and Louxin Zhang for discussions and suggestions on related

1065: topics.

1066:

1067: Ming Li is supported in part by the

1068: NSERC Research Grant OGP0046506, a CGAT grant,

1069: the E.W.R. Steacie Fellowship. Bin Ma is supported in part by

1070: the NSERC Research Grant OGP0046506. Bin Ma and Lusheng Wang are

1071: supported in part by HK RGC Grants 9040297, 9040352, 9040444 and CityU

1072: Strategic Grant 7000693.

1073: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1074: \begin{thebibliography}{99}

1075:

1076: %\bibitem{B94}

1077: %V. Bafna, E. Lawler and P. Pevzner,

1078: %Approximation algorithms for multiple sequence alignment,

1079: %{\em Proc.\ 8th Ann. Combinatorial Pattern

1080: %Matching Conf.\ (CPM'94)}, pp. 43-53,  1994.

1081:

1082: \bibitem{BLPR97}

1083: A. Ben-Dor, G. Lancia, J. Perone, and R. Ravi,

1084: Banishing bias from consensus sequences,

1085: {\em Proc.\ 8th Ann.\ Combinatorial Pattern

1086: Matching Conf.\ (CPM'97)}, pp. 247-261, 1997.

1087:

1088: \bibitem{BGHMS97}

1089: P. Berman, D. Gumucio, R. Hardison, W. Miller, and N. Stojanovic,

1090: A linear-time algorithm for the 1-mismatch problem,

1091: {\em WADS'97}, 1997.

1092:

1093: %\bibitem{CHS95}

1094: %Q. Chan, G. Hertz, G. Stormo,

1095: %Matrix search 1.0: a computer program that

1096: %scans DNA sequences for transcriptional elements using a database of weight

1097: %matrices, {\em CABIOS} (1995) 563-566.

1098: %{Computer Applications in the Biosciences}, pp. 563-566,  1995.

1099:

1100: \bibitem{DR+93}

1101: J. Dopazo, A. Rodr\'{\i}guez, J. C. S\'{a}iz, and F. Sobrino,

1102: Design of primers for PCR amplification of highly variable genomes,

1103: {\it CABIOS}, 9(1993), 123-125.

1104:

1105: %\bibitem{FM95}

1106: %Y. M. Fraenkel, Y Mandel, D. Friedberg and H. Margalit,

1107: %Identification of common motifs in unaligned DNA sequences: application to

1108: %Escherichia coli Lrp regulon,

1109: %{\em CABIOS}, (1995) 379-387.

1110: %{Computer Applications in the Biosciences}, pp. 379-387, 1995.

1111:

1112: \bibitem{FL97}

1113: M. Frances, A. Litman, On covering problems of codes,

1114: {\em Theor.\ Comput.\ Syst.} 30(1997) 113-119.

1115:

1116: %\bibitem{GJ79}

1117: %M. Garey and D. Johnson,

1118: %{\em Computers and Intractability, a guild to the theory of

1119: %NP-completeness}, Freeman, 1979.

1120:

1121: \bibitem{GJL99}

1122: L. G\c{a}sieniec, J. Jansson, and A. Lingas,

1123: Efficient approximation algorithms for the Hamming center problem,

1124: {\it Proc.\ 10th ACM-SIAM Symp. on Discrete Algorithms}, pp. S905-S906, 1999.

1125:

1126: \bibitem{G93}

1127: D. Gusfield,

1128: Efficient methods for multiple sequence alignment with guaranteed

1129: error bounds,

1130: {\it Bull. Math. Biol.}, vol. 30, pp. 141-154, 1993.

1131:

1132: \bibitem{Danbook}

1133: D. Gusfield,

1134: {\em Algorithms on Strings, Trees, and Sequences},

1135: Cambridge Univ. Press, 1997.

1136:

1137: \bibitem{HS94}

1138: G. Hertz and G. Stormo,

1139: Identification of consensus patterns in

1140: unaligned DNA and protein sequences: a large-deviation statistical

1141: basis for penalizing gaps.

1142: In: {\em Proc.\ 3rd Int'l

1143: Conf. Bioinformatics and Genome Research} (Lim and Cantor, eds.) World

1144: Scientific, 1995, pp. 201-216.

1145:

1146: %\bibitem{Hoe63}

1147: %W. Hoeffding,

1148: %Probability inequalities for sums of bound random variables.

1149: %{\it J. Amer. Statist. Assoc.}, 58(1963), 13-30.

1150:

1151: %\bibitem{K72}

1152: %R. Karp,

1153: %Reducibility among combinatorial problems,

1154: %in R.E. Miller and J.W. Thatcher (eds), {\em Complexity

1155: %of Computer Computations}, Plenum Press, pp. 85-103, 1972.

1156:

1157: %\bibitem{KKK95}

1158: %Y. V. Kondrakhin, A.E. Kel, N.A. Kolchanov,  A.G. Romashchenko,

1159: %and L. Milanesi,

1160: %Eukaryotic promoter recognition by binding sites for transcription factors,

1161: %{\em CABIOS}, pp. 477-488, 1995.

1162: %{Computer Applications in the Biosciences}, pp. 477-488, 1995.

1163:

1164: \bibitem{LR90}

1165: C. Lawrence and A. Reilly,

1166: An expectation maximization (EM) algorithm for the identification

1167: and characterization of common sites in unaligned biopolymer

1168: sequences, {\em Proteins} 7(1990) 41-51.

1169:

1170: \bibitem{LB+91}

1171: K. Lucas, M. Busch, S. M\"{o}ssinger and J.A. Thompson,

1172: An improved microcomputer program for finding gene- or gene family-specific

1173: oligonucleotides suitable as primers for polymerase chain reactions or

1174: as probes, {\it CABIOS}, 7(1991),  525-529.

1175:

1176: \bibitem{LL+99}

1177: K. Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang,

1178: Distinguishing string selection problems,

1179: {\it Proc.\ 10th ACM-SIAM Symp. on Discrete Algorithms}, pp. 633-642, 1999.

1180: Also to appear in {\em Information and Computation}.

1181:

1182: \bibitem{LMW99}

1183: M. Li, B. Ma, and L. Wang, Finding Similar Regions in Many Strings,

1184: {\it Proceedings of the Thirty-first Annual ACM Symposium on Theory of

1185: Computing}, pp. 473-482, Atlanta, 1999.

1186:

1187: \bibitem{LMW99-j}

1188: M. Li, B. Ma, and L. Wang, Finding Similar Regions in Many Sequences,

1189: submitted to {\em J.\ Comput.\ Syst.\ Sci.} special issue for

1190: {\it Thirty-first Annual ACM Symposium on Theory of Computing}, 1999.

1191:

1192: \bibitem{M00}

1193: B. Ma, A polynomial time approximation scheme for the closest substring

1194: problem, to appear in

1195: {\it Proc.\ 11th Annual Symposium on Combinatorial Pattern Matching},

1196: Montreal, June 21-23, 2000.

1197:

1198: %\bibitem{LV97}

1199: %M. Li, P. Vit\'anyi,

1200: %{\em An Introduction to Kolmogorov Complexity and Its Applications},

1201: %Springer, 1993.

1202:

1203: \bibitem{MR95}

1204: R. Motwani and P. Raghavan,

1205: {\em Randomized Algorithms},

1206: Cambridge Univ. Press, 1995.

1207:

1208: %\bibitem{P92}

1209: %P. Pevzner,

1210: %Multiple alignment, communication cost, and graph matching,

1211: %{\it SIAM J. Applied Math.}, 52(1992), 1763-1779.

1212:

1213: %\bibitem{P96}

1214: %D. S. Prestridge,

1215: %SIGNAL SCAN 4.0: additional databases and sequence formats,

1216: %{\em CABIOS} (1996) 157-160.

1217: %{Computer Applications in the Biosciences}, pp. 157-160, 1996.

1218: %%%%???Lusheng, I convert all {Computer Applications in the

1219: %%%%Biosciences} to CABIOS to shorten to make space ... I hope this is

1220: %%%%correct:-)

1221:

1222: \bibitem{PBPR89}

1223: J. Posfai, A. Bhagwat, G. Posfai, and R. Roberts,

1224: Predictive motifs derived from cytosine methyltransferases,

1225: {\em Nucl. Acids Res.}, 17(1989), 2421-2435.

1226:

1227: \bibitem{PH96}

1228: V. Proutski and E. C. Holme,

1229: Primer Master: a new program for the design and analysis of PCR

1230: primers, {\it CABIOS}, 12(1996), 253-255.

1231:

1232: %\bibitem{R92}

1233: %M.A. Roytberg,

1234: %A search for common patterns in many sequences,

1235: %{\em CABIOS} (1992) 57-64.

1236: %{Computer Applications in the Biosciences}, pp. 57-64, 1992.

1237:

1238: \bibitem{SAL91}

1239: G. D. Schuler, S. F. Altschul, and D. J. Lipman,

1240: A workbench for multiple alignment construction and analysis,

1241: {\em Proteins: Structure, Function and Genetics}, 9(1991) 180-190.

1242:

1243: \bibitem{S90}

1244: G. Stormo,

1245: Consensus patterns in DNA,

1246: in R.F. Doolittle (ed.), {\em Molecular evolution: computer

1247: analysis of protein and nucleic acid sequences},

1248: {\em Methods in Enzymology}, 183, pp. 211-221, 1990.

1249:

1250: \bibitem{SH91}

1251: G. Stormo and G.W. Hartzell III,

1252: Identifying protein-binding sites from

1253: unaligned DNA fragments. {\em Proc.\ Natl.\ Acad.\ Sci.\ USA},

1254: 88(1991), 5699-5703.

1255:

1256: %\bibitem{WFHW96}

1257: %F. Wolfertstetter, K. Frech, G. Herrmann and T. Werner,

1258: %{\em CABIOS} (1996) 71-80.

1259: %{Computer Applications in the Biosciences}, pp. 71-80, 1996.

1260:

1261: \bibitem{W86}

1262: M. Waterman,

1263: Multiple sequence alignment by consensus,

1264: {\em Nucl. Acids Res.}, 14(1986), 9095-9102.

1265:

1266: \bibitem{WAG84}

1267: M. Waterman, R. Arratia and D. Galas,

1268: Pattern recognition in several sequences: consensus and alignment,

1269: {\em Bull. Math. Biol.}, 46(1984), 515-527.

1270:

1271: \bibitem{WP84}

1272: M. Waterman and  M. Perlwitz,

1273: Line geometries for sequence comparisons,

1274: {\em Bull. Math. Biol.}, 46(1984), 567-577.

1275:

1276: \end{thebibliography}

1277:

1278: \end{document}

1279: