0309:cs0309043/paper.tex

1: \input 11layout

2: \input macro

3: \input epsf

4:

5: \def\tld{\twoldots}

6:

7: \centerline{\biggbf Finding Approximate Palindromes in Strings}

8:

9: \bigskip\bigskip

10: \centerline{\it Alexandre H. L. Porto}

11: \centerline{\it Valmir C. Barbosa}

12:

13: \bigskip

14: \centerline{Programa de Engenharia de Sistemas e Computa\c c\~ao, COPPE}

15: \centerline{Universidade Federal do Rio de Janeiro}

16: \centerline{Caixa Postal 68511}

17: \centerline{21945-970 Rio de Janeiro - RJ, Brazil}

18:

19: \medskip

20: \centerline{\tt xandao@cos.ufrj.br, valmir@cos.ufrj.br}

21:

22: \bigskip\bigskip

23: \centerline{\bf Abstract}

24:

25: \medskip

26: \noindent

27: We introduce a novel definition of approximate palindromes in strings, and

28: provide an algorithm to find all maximal approximate palindromes in a string

29: with up to $k$ errors. Our definition is based on the usual edit operations of

30: approximate pattern matching, and the algorithm we give, for a string of size

31: $n$ on a fixed alphabet, runs in $O(k^2n)$ time. We also discuss two

32: implementation-related improvements to the algorithm, and demonstrate their

33: efficacy in practice by means of both experiments and an average-case analysis.

34:

35: \bigskip\bigskip

36: \noindent

37: {\bf Keywords:} Approximate palindromes, string editing, approximate pattern

38: matching.

39:

40: \vfill\eject

41: \bigbeginsection 1. Introduction

42:

43: Let $S$ be a string of $n$ characters from a fixed alphabet $\Sigma$. For

44: $1\le i\le j\le n$, let $S[i]$ denote the $i$th character in $S$ and

45: $S[i\tld j]$ denote the substring of $S$ whose first and last characters are

46: $S[i]$ and $S[j]$, respectively. We let $S^R$ denote the string whose $i$th

47: character is $S[n-i+1]$, that is, $S$ and $S^R$ are essentially the same string

48: when read in opposing directions. We say that $S$ and $S^R$ are the

49: {\it reverse\/} of each other.

50:

51: In this paper, we are concerned with {\it palindromes\/} occurring in $S$,

52: which are substrings $S[i\tld j]$ such that $S[i\tld j]=S[i\tld j]^R$. If

53: $S[i\tld j]$ is a palindrome, then it is an {\it even palindrome\/} if it

54: contains an even number of characters, otherwise it is an {\it odd palindrome}.

55: The {\it center\/} of this palindrome is $S[c]$, where

56: $c=i-1+\bigl\lceil(j-i+1)/2\bigr\rceil$. It follows that

57: $S[c+1\tld j]=S[i\tld c]^R$ for an even palindrome, while

58: $S[c+1\tld j]=S[i\tld c-1]^R$ for an odd palindrome. A palindrome $S[i\tld j]$

59: is an {\it initial palindrome\/} if $i=1$ and a {\it final palindrome\/} if

60: $j=n$. It is said to be a {\it maximal palindrome\/} if it is an initial

61: palindrome, a final palindrome, or if $S[i-1\tld j+1]$ is not a palindrome for

62: $1<i\le j<n$.

63:

64: Several algorithms have appeared for the detection of palindromes in strings.

65: These include sequential algorithms for detecting all initial palindromes or

66: maximal palindromes centered at all positions [1, 2], as well as parallel

67: algorithms [3--6].

68:

69: Palindromes appear in several domains, chiefly in computational molecular

70: biology [7, 8], where $\Sigma$ is for example the set of bases that link

71: together to form strands of nucleic acid. In this domain of application,

72: palindromes are often required to be {\it complementary}, in the following

73: sense. While for a palindrome centered at $c$ we have either $S[c-r+1]=S[c+r]$

74: or $S[c-r]=S[c+r]$, depending respectively on whether the palindrome is even or

75: odd and for $r$ representing distance from the center, if the palindrome is

76: complementary then the two characters are no longer equal but rather constitute

77: a complementary pair. In this case, we no longer have $S[i\tld j]=S[i\tld j]^R$,

78: but rather complementarity between the two strings.

79:

80: Although we in the sequel deal exclusively with palindromes for which equality

81: is used to compare them with their reverses (everything carries over trivially

82: to the case of complementarity), we dwell a little longer on the application of

83: palindromes to computational molecular biology because in that area the

84: definition we have given for palindromes is ``too exact,'' being therefore of

85: little use. Palindromes that matter in that domain are ``approximate,'' in the

86: sense that the symmetry around a palindrome's center need not be perfect, but

87: may instead contain a certain number of mismatches in the form of gaps and

88: defects of various other natures [7, 8]. This paper is about finding

89: approximate palindromes in $S$.

90:

91: The following is how the balance of this paper is organized. We start in

92: Section 2 by providing a precise definition of what is to be understood as

93: an approximate palindrome and a maximal approximate palindrome. This definition

94: is novel, as previous definitions appear to have been too restricted [7]. Then

95: in Section 3 we give an algorithm to find all maximal approximate palindromes

96: in a string allowing for $k$ errors. The algorithm runs in $O(k^2n)$ time.

97: Section 4 contains two improvements on the basic algorithm. These improvements

98: do not lead to an improved complexity, but do in practice make a difference,

99: as we demonstrate by means of some experimental results. Section 5 contains

100: an average-case analysis of the two improvements, which indicates that the

101: conclusions drawn in the previous section can be expected to hold on average.

102: Concluding remarks follow in Section 6.

103:

104: \bigbeginsection 2. Approximate palindromes

105:

106: Our definition of an approximate palindrome centered at position $c$ in string

107: $S$ is given for an integer $k\ge 0$ indicating the maximum number of errors to

108: be tolerated, and is based on the notion of {\it string editing}, that is,

109: the transformation of one string into another [7]. In order to define

110: approximate palindromes, we consider, for $1\le c\le n$, the editing of string

111: $S_\ell^c=S[1\tld u]^R$ to obtain string $S_r^c=S[c+1\tld n]$, where $u=c$ for

112: even palindromes or $u=c-1$ for odd palindromes (here, and henceforth,

113: $S[i\tld j]$ is to be understood as the empty string if $i>j$).

114:

115: The string editing that we consider is the same that has been used for other

116: problems on strings, and employs the operations presented next. These operations

117: act on cursors $p$ and $q$, which are used to point to specific positions in

118: $S_\ell^c$ and $S_r^c$, respectively. Such cursors are such that $0\le p\le u$

119: and $0\le q\le n-c$, the value $0$ being used only to initialize the cursors as

120: still not pointing to characters in the string. Note that $p$ and $q$, when

121: nonzero, do not indicate positions in $S$, but in two of its substrings taken as

122: independent entities.

123:

124: The operations we consider are the following, of which (ii)--(iv) are

125: called {\it edit operations}.

126:

127: \medskip

128: \itemitem{(i)} {\it Matching\/}: If $S_\ell^c[p+1]=S_r^c[q+1]$, then

129: increment both $p$ and $q$.

130:

131: \medskip

132: \itemitem{(ii)} {\it Substitution\/}: If $S_\ell^c[p+1]\neq S_r^c[q+1]$,

133: then increment both $p$ and $q$.

134:

135: \medskip

136: \itemitem{(iii)} {\it Insertion\/}: Increment $q$.

137:

138: \medskip

139: \itemitem{(iv)} {\it Deletion\/}: Increment $p$.

140:

141: \medskip

142: While a matching can only be applied if it will make both cursors

143: point to equal characters, the edit operations characterize the

144: possible sources of mismatch between the two strings. What they do is to allow

145: for a character from $S_r^c$ to substitute for a character in $S_\ell^c$ (this

146: is a substitution), for a character from $S_r^c$ to be inserted into

147: $S_\ell^c$ as an additional character (an insertion), or for a character to be

148: deleted from $S_\ell^c$ (a deletion). When grouped into a sequence, what

149: matching and edit operations can be regarded as doing is providing a script

150: (the {\it edit script\/}) for a prefix of $S_\ell^c$ to be turned into a

151: replica of a prefix of $S_r^c$ (note that, if an unlimited number of edit

152: operations is allowed, then such an edit script is guaranteed to exist).

153: The edit script to convert one prefix into the other having the smallest

154: number of edit operations is said to be {\it optimal\/} for the two prefixes,

155: and its number of edit operations is called the {\it edit distance\/} between

156: them [9, 10].

157:

158: For $0\le p^*\le u$ and $0\le q^*\le n-c$, we say that $S[u-p^*+1\tld c+q^*]$

159: is an {\it approximate palindrome\/} in $S$ centered at $c$ if, of

160: all the edit scripts that can be used to turn $S_\ell^c[1\tld p^*]$ into

161: $S_r^c[1\tld q^*]$, the one that is optimal comprises no more than $k$ edit

162: operations (in other words, the edit distance between $S_\ell^c[1\tld p^*]$ and

163: $S_r^c[1\tld q^*]$ is at most $k$). In this case, $p^*$ and $q^*$ are the values

164: of $p$ and $q$, respectively, after any of those edit scripts is played out.

165: The palindrome is even or odd according to how $S_\ell^c$ is originally set.

166: Its {\it size\/} is either $p^*+q^*$ or $p^*+q^*+1$, respectively if it is even

167: or odd. This definition is more general than the single other definition that

168: appears to have been given for approximate palindromes [7], which only allows

169: for matchings and substitutions.

170:

171: It is curious to observe that, unlike exact palindromes, the size of an even

172: approximate palindrome does not have to be even, nor does the size of an odd

173: approximate palindrome have to be odd. These would hold, however, for the exact

174: palindrome that would be obtained if the transformation of one prefix into the

175: other were indeed performed.

176:

177: We provide in Figure 1 an illustration of this concept of an approximate

178: palindrome. The figure contains four approximate palindromes in the string

179: ${\it bbaabac}$ for $k=3$, two even in part (a) and two odd in part (b).

180: The palindromes are depicted in a way that evidences their two parts and also

181: their centers, in the odd case. Blank cells appearing in the left part

182: correspond to insertions, those on the right to deletions (for ease of

183: representation, in the figure we let deletions from the left part be

184: represented as insertions into the right part). Substitutions are indicated

185: by shaded cells placed symmetrically in the two parts.

186:

187: Having defined approximate palindromes for $c$ and $k$ fixed, the notion that

188: remains to be introduced before we move on to discuss their detection is that

189: of maximality. Note, first, that the simple definition of a maximal palindrome

190: in the exact case does not carry over simply to the approximate case. In the

191: case of exact palindromes, maximality is related to the inability to extend a

192: palindrome into another substring of $S$ that is also a palindrome. In the

193: approximate case, however, the potential existence of several acceptable edit

194: scripts makes it inappropriate to adopt such a straightforward definition.

195:

196: While there does appear to exist more than one possibility for defining the

197: maximality of approximate palindromes, what we do in this paper is to say

198: that an approximate palindrome is {\it maximal\/} if no other

199: approximate palindrome for the same $c$ and $k$ exists having strictly greater

200: size or the same size but strictly fewer errors (edit distance between its two

201: parts). Unlike the case of exact palindromes, this definition clearly does not

202: guarantee the uniqueness of an approximate palindrome that is maximal.

203:

204: \bigbeginsection 3. An algorithm

205:

206: In this section, and for $k$ fixed, we introduce an algorithm for detecting

207: maximal approximate palindromes, one even and one odd for each $c$ such that

208: $1\le c\le n$.

209:

210: Our algorithm is based on an acyclic directed graph $D$, whose definition

211: relies on two generic strings $X$ and $Y$ on the same alphabet. We let

212: $x=\vert X\vert$ and $y=\vert Y\vert$. Graph $D$ has $(x+1)(y+1)$ nodes, one

213: for each $(i,j)$ pair such that $0\le i\le x$ and $0\le j\le y$. A directed

214: edge exists in $D$ from node $(i,j)$ to node $(i',j')$ if either $i'=i$

215: and $j'=j+1$, or $i'=i+1$ and $j'=j$, or yet $i'=i+1$ and $j'=j+1$. If we

216: position the nodes of $D$ on the vertices of a two-dimensional grid having $x+1$

217: rows and $y+1$ columns so that a node's first coordinate grows from top to

218: bottom and the second from left to right, then clearly directed edges exist

219: between nearest neighbors in the same row (a {\it horizontal edge\/}),

220: the same column (a {\it vertical edge\/}), and the same diagonal (a

221: {\it diagonal edge\/}). Because $j'-i'=j-i$ when a diagonal edge exists from

222: $(i,j)$ to $(i',j')$, we use such differences to label the various diagonals on

223: the grid. Diagonal labels are then in the range of $-x$ through $y$.

224:

225: Now consider a directed path in $D$ leading from node $(i,j)$ to node $(i',j')$.

226: By definition of the directed edges, clearly $i'\ge i$ and $j'\ge j$. The

227: importance of graph $D$ in our present context is that this directed path, if

228: it contains at least one edge, can be interpreted as an edit script for string

229: $X[i+1\tld i']$ to be transformed into string $Y[j+1\tld j']$. Along the script,

230: the cursors $p$ and $q$ of operations (i)--(iv) are used on $X$ and $Y$,

231: respectively, such that $i\le p\le i'$ and $j\le q\le j'$. On such a path, a

232: diagonal edge corresponds to either a matching or a substitution, a horizontal

233: edge to an insertion, and a vertical edge to a deletion. The number of edges on

234: the path that do not correspond to matchings is the number of edit operations in

235: the script. The optimal edit script for strings $X[i+1\tld i']$ and

236: $Y[j+1\tld j']$ is represented in $D$ by a directed path from $(i,j)$ to

237: $(i',j')$ whose number of edges corresponding to edit operations is minimum

238: among all directed paths between the two nodes. Such a path is said to be

239: {\it shortest\/} among all those paths according to the metric that assigns,

240: say, length $0$ to edges corresponding to matchings and length $1$ to all other

241: edges.

242:

243: We give an illustration in Figure 2, where graph $D$ is shown for

244: $X={\it bb}$ and $Y={\it aabac}$. The directed path shown in solid lines

245: contains two insertions, followed by one matching and one substitution,

246: and is a shortest path between its two end vertices. It corresponds,

247: therefore, to an optimal edit script to transform $X[1\tld 2]$ into

248: $Y[1\tld 4]$. All other edges are shown as dotted lines, directions omitted

249: for clarity.

250:

251: For $0\le e\le k$, let a directed path in $D$ be called an {\it $e$-path\/} if

252: it contains $e$ edges related to edit operations. One crucial problem to be

253: solved on $D$ is the problem of determining, for each diagonal and each $e$, the

254: $e$-path, if one exists, that starts in row $0$, ends on that diagonal at the

255: farthest possible node (greatest row number), and is in addition shortest among

256: all paths that start and end at the same nodes. This problem can be solved by

257: the following dynamic-programming approach.

258:

259: \medskip

260: \itemitem{1.} For $d=0,\ldots,y$, find the largest common prefix of strings

261: $X$ and $Y[d+1\tld y]$. Such prefixes will correspond to the $0$-paths that

262: start in row $0$ and end at the farthest possible nodes. Each of them will

263: be entirely confined to a diagonal $d$ and, like all $0$-paths, will be

264: shortest among all paths joining its end nodes.

265:

266: \medskip

267: \itemitem{2.} For $e=1,\ldots,k$, and for $d=-\min\{e,x\},\ldots,y$, do:

268:

269: \smallskip

270: \itemitemitem{2.1.} Consider the $(e-1)$-paths, if any,  determined in the

271: previous step on those of diagonals $d-1$, $d$, and $d+1$ that exist. Each of

272: these paths corresponds to an optimal edit script for transforming a prefix of

273: $X$ into a substring of $Y$. If possible, extend these scripts, respectively

274: by adding an insertion, a substitution, and a deletion, thereby creating

275: $e$-paths that end on diagonal $d$.

276:

277: \smallskip

278: \itemitemitem{2.2.} Of the $e$-paths created in step 2.1, if any, pick the

279: one that ends farthest down diagonal $d$ and extend it further by computing the

280: largest common prefix of what remains of $X$ and what remains of $Y$. The result

281: will be an $e$-path that starts in row $0$ and ends on diagonal $d$ at the

282: farthest possible node, being in addition shortest among all paths starting and

283: ending at the same nodes.

284:

285: \medskip

286: The lower bound on $d$ in step 2 reflects the fact that diagonal $-e$ can only

287: be reached from row $0$ by an $e'$-path, where $1\le e\le e'\le k$. What the

288: entire procedure computes is a set of directed paths departing from row $0$ at

289: several columns. Each of these directed paths is an $e$-path, for some $e$ such

290: that $0\le e\le k$, that ends as far down in the graph as possible, and is also

291: shortest among all paths that start and end at the same nodes. So an $e$-path

292: computed by the algorithm departing from node $(0,d)$ for $0\le d\le y$

293: represents an optimal edit script for turning a prefix of $X$ into a prefix of

294: $Y[d+1\tld y]$ with $e$ edit operations.

295:

296: The basic procedure comprising steps 1 and 2 was introduced to solve the problem

297: of approximate pattern matching [7, 11--13], which requires all approximate

298: occurrences of $X$ in $Y$ having edit distance from $X$ of at most $k\le y$ to

299: be determined [14]. The solution works by selecting, after the

300: execution of steps 1 and 2, the $e$-paths that end on row $x$ for $0\le e\le k$.

301: Several other solutions to this problem exist [7, 15--20].

302:

303: Note that the largest common prefixes asked for in steps 1 and 2.2 can be

304: obtained easily if we have a means of computing the largest common prefix of any

305: two suffixes of $X\$_1Y$, where $\$_1$ is any character that does not occur in

306: $X$ or $Y$, and $X\$_1Y$ is the string obtained by appending $\$_1$ to $X$, then

307: $Y$ to $X\$_1$. Such a means, of course, is provided by the well-known suffix

308: tree for string $X\$_1Y\$_2$, where $\$_2$ is any character not occurring in

309: $X\$_1Y$, ultimately needed to ensure that the tree does indeed exist. After

310: the suffix tree is built and preprocessed, which can be achieved in $O(x+y)$

311: time, any of those largest common prefixes can be found in constant time [7].

312: Fox approximate pattern matching, $x\le y$, so $O(y)$ is the time it takes to

313: handle the suffix tree initially. After that, the complexity of steps 1 and 2

314: is dominated by step 2, which comprises $O(ky)$ repetitions of 2.1 and 2.2,

315: each requiring $O(1)$ time per repetition. The overall time is then $O(ky)$.

316:

317: The same basic procedure can also be used to solve another problem involving

318: strings $X$ and $Y$, known as the $k$-differences problem. Assuming

319: $k\le\max\{x,y\}$ to avoid trivial cases, this problem asks for the edit script

320: to transform $X$ into $Y$ with the fewest possible edit operations, but no more

321: than $k$ [13, 21]. Clearly, no solution exists if $\vert x-y\vert>k$. A solution

322: may exist otherwise, and will correspond to the $e$-path from $(0,0)$ to $(x,y)$

323: for which $e$ is minimum (that is, a shortest path between the two nodes), if

324: one exists with $e\le k$. Adapting steps 1 and 2 to find such a path is a simple

325: matter, as follows. In step 1, let $d=0$ only. In step 2, let the range for $d$

326: be from $-\min\{e,x\}$ through $\min\{e,y\}$, again reflecting the fact that it

327: takes at least $e$ errors to reach diagonal $-e$ or $e$ from $(0,0)$. Finally,

328: abort the iterations whenever node $(x,y)$ is reached.

329:

330: This solution to the $k$-differences problem requires a number of matching

331: extensions (steps 1 and 2.2) given by $1+\sum_{e=1}^kO(e)=O(k^2)$, each one

332: requiring $O(1)$ time after the initial construction and preprocessing of the

333: suffix tree for $X\$_1Y\$_2$ in $O(x+y)$ time. The overall time is then

334: $O(k^2+x+y)$.

335:

336: This algorithm for the $k$-differences problem can be used directly to solve

337: our problem of determining all maximal approximate palindromes in $S$, because

338: what is required of an approximate palindrome is precisely that one of its parts

339: be transformable into the other by means of an optimal edit script comprising

340: no more than $k$ edit operations. For $1\le c\le n$, we simply let $X=S_\ell^c$

341: and $Y=S_r^c$ (then $x+y=n$ or $x+y=n-1$, respectively for even and odd

342: palindromes). Whenever a new path is determined in step 2.2, we check if it is

343: better than the ones found previously in terms of approximate-palindrome

344: maximality; it will be better if it leads to a node whose coordinates add to a

345: larger integer than the current best path (checking for the same integer but

346: fewer errors is needless, as the algorithm generates shortest $e$-paths in

347: nondecreasing order of $e$).

348:

349: Apart from the time needed to establish the suffix tree for

350: $S_\ell^c\$_1S_r^c\$_2$, this algorithm requires $O(k^2)$ time to determine a

351: maximal approximate palindrome in $S$ for fixed $c$. Determining all even and

352: odd approximate palindromes in $S$ then requires $O(k^2n)$ time beyond what is

353: needed to establish the suffix trees. If such a tree had indeed to be

354: constructed and preprocessed for each $c$, then an additional $O(n^2)$ time

355: would be required. However, note that every suffix of $S_\ell^c$ is also a

356: suffix of $S^R$, and that every suffix of $S_r^c$ is also a suffix of $S$. So

357: all that is required by steps 1 and 2.2, regardless of the value of $c$, is that

358: a preprocessed suffix tree be available for $S^R\$_1S\$_2$. This tree needs to

359: be established only once, which can be done in $O(n)$ time, and therefore the

360: overall complexity of determining all even and odd maximal approximate

361: palindromes in $S$ remains $O(k^2n)$.

362:

363: Assessing the space required by the algorithm depends on whether edit scripts

364: are also needed or simply the palindromes with the corresponding edit distances.

365: In the former case, the space required for each of the $O(k)$ diagonals is

366: $O(k)$; it is constant in the latter case. This, combined with the $O(n)$ space

367: needed for preprocessing, yields a space requirement of $O(k^2+n)$ or $O(k+n)$,

368: respectively.

369:

370: \bigbeginsection 4. Practical improvements

371:

372: Of the $2n$ maximal approximate palindromes determined by the algorithm of

373: Section 3 for $c=1,\ldots,n$, three are trivial and can be skipped in a

374: practical implementation. These are the odd palindrome for $c=1$, and the

375: even and odd palindromes for $c=n$. For these palindromes, the optimal edit

376: scripts contain $k$ insertions for $c=1$ and $k$ deletions for $c=n$.

377: Henceforth in the paper, we then assume that the algorithm is run for

378: $c=1,\ldots,n-1$ in the even case, $c=2,\ldots,n-1$ in the odd case.

379:

380: In addition to this simplification, the algorithm introduced in Section 3 for

381: the computation of all maximal approximate palindromes in $S$ can be improved

382: by selecting the range for $d$ in step 2 more carefully. We discuss two such

383: improvements in this section. Although they do not lead to a better execution

384: time in the asymptotic, worst-case sense, they do bring about a reduction in

385: execution times in practice, as we demonstrate.

386:

387: The first improvement consists of skipping the diagonals on which $e$-paths,

388: for suitable $e$, have been determined that end on row $x$ or column $y$.

389: The reason why this is safe to do is that no further path can be found ending

390: on those diagonals farther down from row $x$ or column $y$. The way to

391: efficiently handle this improved selection of diagonals in step 2 is to keep

392: all diagonals that are going to be processed in a doubly-linked list. This

393: allows new diagonals to be added at the list's extremes in constant time, and

394: diagonals that will no longer be processed can be deleted equally efficiently

395: as soon as it is detected that the corresponding $e$-paths have reached the

396: farthest row or column.

397:

398: The second improvement that we describe is itself an improvement over the

399: first one. The rationale is that, if a diagonal $d$ is dropped from further

400: consideration because the $e$-path that ends on it farthest down the grid

401: touches, say, row $x$, then all other diagonals $d'$ such that $d'<d$ may be

402: dropped as well. Similarly if that $e$-path on $d$ touches column $y$, in which

403: case not only diagonal $d$ but also all diagonals $d'$ such that $d'>d$ may be

404: dropped from further processing. What results from this improvement is that the

405: algorithm's operation is always confined to within a strip of contiguous

406: diagonals. This strip is limited on the left by either the leftmost diagonal on

407: which the farthest-reaching $e$-path does not touch row $x$ or diagonal

408: $-\min\{e,x\}$; on the right, it is limited by either the rightmost diagonal on

409: which the farthest-reaching $e$-path does not touch column $y$ or diagonal

410: $\min\{e,y\}$. As with the first improvement, it is a simple matter to implement

411: the second one efficiently by using the same doubly-linked list.

412:

413: We show in Figures 3 and 4 the gain that the first improvement elicits for a

414: number of strings, in Figures 5 and 6 the gain due to the second improvement,

415: and in Figures 7 and 8 the gain caused by the second improvement over the first.

416: For Figures 3 through 6, if $t_o$ is the the number of iterations performed by

417: the original algorithm and $t_i$ the number of iterations performed by the

418: improved algorithm, then gain is defined as $(t_o-t_i)/t_o$. An iteration is

419: either the initial execution of step 1 or each combined execution of steps

420: 2.1 and 2.2. Because every iteration can be carried out in constant time, our

421: assessment of gain in terms of numbers of iterations as opposed to elapsed time

422: provides a platform-independent evaluation of the improved algorithms. Gain is

423: defined similarly for Figures 7 and 8.

424:

425: The strings we have used are the ones given next. Of these, some are

426: {\it periodic}, meaning that there exists a string $P$ of size $s\le n$ such

427: that the periodic string is a prefix of the string formed by concatenating

428: $\lceil n/s\rceil$ copies of $P$. The {\it period\/} of the periodic string

429: is the value of $s$.

430:

431: \medskip

432: \itemitem{$\bullet$} ${\it dna}$: A DNA sequence with $n$ bases.

433:

434: \medskip

435: \itemitem{$\bullet$} ${\it dnap1}$: A periodic DNA sequence with $n$

436: bases and period $\lfloor 0.05n\rfloor$.

437:

438: \medskip

439: \itemitem{$\bullet$} ${\it dnap2}$: A periodic DNA sequence with $n$

440: bases and period $\lfloor 0.25n\rfloor$.

441:

442: \medskip

443: \itemitem{$\bullet$} ${\it dnap3}$: A periodic DNA sequence with $n$

444: bases and period $\lfloor 0.5n\rfloor$.

445:

446: \medskip

447: \itemitem{$\bullet$} ${\it txt}$: A sequence of $n$ ASCII characters.

448:

449: \medskip

450: \itemitem{$\bullet$} ${\it txtp1}$: A periodic sequence of $n$ ASCII

451: characters with period $\lfloor 0.05n\rfloor$.

452:

453: \medskip

454: \itemitem{$\bullet$} ${\it txtp2}$: A periodic sequence of $n$ ASCII

455: characters with period $\lfloor 0.25n\rfloor$.

456:

457: \medskip

458: \itemitem{$\bullet$} ${\it txtp3}$: A periodic sequence of $n$ ASCII

459: characters with period $\lfloor 0.5n\rfloor$.

460:

461: \medskip

462: \itemitem{$\bullet$} ${\it cnst}$: An $n$-fold repetition of the same

463: character.

464:

465: \medskip

466: \itemitem{$\bullet$} ${\it diff}$: A string comprising $n$ different

467: characters. This string does not fit the fixed-alphabet assumption we have made

468: from the start, so the complexity figures given in Section 3 do not apply to

469: executions of the algorithm on it.

470:

471: \medskip

472: Figures 3, 5, and 7 are given for $n=50$, while Figures 4, 6, and 8 are for

473: $n=2500$. In each figure, the values of $k$, the maximum number of errors to be

474: allowed in the approximate palindromes, are in

475: $\bigl\{\lceil 0.01n\rceil,

476: \lceil 0.05n\rceil, \lceil 0.1n\rceil, \lceil 0.2n\rceil,

477: \lceil 0.4n\rceil, \lceil 0.8n\rceil\bigr\}$. Note, in Figures 3, 5, and 7,

478: that gains are identical for ${\it dnap1}$ and ${\it txtp1}$, owing to the

479: fact that, for $n=50$, these two strings are essentially the same, being

480: periodic with period $2$, therefore having only $2$ different characters

481: throughout.

482:

483: With the single exception of Figure 7, we see in all cases that gains are

484: largest for ${\it cnst}$, which can be easily accounted for by the fact that,

485: for such a string, every diagonal is dropped from further consideration by any

486: of the two improvements right after having been processed for the first time.

487: The exception of Figure 7 can be explained by the fact that this figure gives

488: gains of one improvement over the other, and together with Figure 8 gives

489: a measure for how efficiently the second improvement creates the diagonal

490: strips. What Figure 7 indicates is that such an efficiency is higher for

491: ${\it dnap1}$ and ${\it txtp1}$ than it is for ${\it cnst}$ if $n=50$ and

492: $k\ge 3$.

493:

494: By contrast, gains are always smallest for ${\it diff}$, because matchings

495: never occur in the edit scripts and therefore

496: more iterations are needed before paths reach the grid's borders. For

497: ${\it diff}$ strings, it also happens that the second improvement provides no

498: gain over the first.

499:

500: Gains tend to be larger for sequences on smaller alphabets, because in

501: these cases there tend to be more matchings. This is what happens for DNA

502: sequences, which have an alphabet of size $4$, therefore smaller than the

503: alphabets of nearly all the other strings under consideration. Gains also

504: tend to be larger as $k$ gets larger, which is probably related to the fact

505: that the overall number of iterations also increases with $k$. To finalize,

506: we note that gains tend to decrease as $n$ is increased, which can be

507: seen by comparing Figures 3, 5, and 7 to Figures 4, 6, and 8, respectively.

508: This is due to the fact that, as $n$ gets larger, so does the number of

509: matchings needed for paths to reach the grid's bottommost or

510: rightmost border.

511:

512: \bigbeginsection 5. Expected gains

513:

514: As we know from the results in Section 4, one of the key factors affecting the

515: performance of practical implementations of our algorithm is the size of the

516: alphabet $\Sigma$, which henceforth we let be such that

517: $\sigma=\vert\Sigma\vert$. We discuss, in this section, a stochastic model that

518: can be used to assess, to a limited extent, the expected gain provided by the

519: two improvements of Section 4 for a given value of $\sigma$. The model has great

520: richness of detail [22], and may require considerable computational effort to be

521: solved even for strings comprising as small a number of characters as a few

522: tens. It is therefore not practical, but we present an outline of it nonetheless

523: because, already for a modest range of parameters, it is capable of conveying

524: useful information.

525:

526: For fixed $c$, the model views an execution of the algorithm (as introduced

527: originally or in one of its two improved variants), as a discrete-time,

528: discrete-state stochastic process. Each time step in this stochastic process

529: corresponds to one of the iterations on $e$ in step 2 of the algorithm. Each

530: state is a $(2k+1)$-tuple with one entry for each of the possible diagonals

531: from $-k$ through $k$. An entry contains the number of the row to which the

532: corresponding diagonal has been stretched so far by the algorithm. The initial

533: state has $0$ in all entries.

534:

535: This stochastic process never returns to a previously visited state, and can

536: as such be represented by an acyclic directed graph whose nodes stand for

537: states and whose directed edges represent the possible transitions among states.

538: In this graph, every state is on some directed path from the initial state.

539: Associated with an edge $(P,Q)$ are two quantities, namely the probability

540: $p(P,Q)$ of moving from state $P$ to state $Q$, and the number of diagonals

541: to be processed when that transition is undertaken. This number of diagonals

542: is, for each value of $e$, the number of iterations used in Section 4 to

543: evaluate the two improvements as a platform-independent measure of time. It

544: depends on whether the original algorithm or one of its two improved variants

545: is in use, and is denoted by $t(P,Q)$.

546:

547: The crucial issue in setting up this stochastic model is of course the

548: determination of $p(P,Q)$ for all edges $(P,Q)$. As it turns out, such

549: probabilities depend on how state $P$ is reached from the initial state,

550: and therefore it is best to assess them dynamically as the graph is processed

551: during the computation of one of the stochastic process' characteristics.

552: The characteristic that interests us in this section is the expected (average)

553: number of iterations, denoted by $\bar t$, for the algorithm to be completed on

554: a string, $n$ and $\sigma$ being fixed in addition to $c$.

555:

556: The following recursive procedure is a variation of straightforward depth-first

557: traversal, and can be used to compute $\bar t$. It is started by executing step

558: 1 on the initial state; upon termination, its output is assigned to $\bar t$.

559:

560: \medskip

561: \itemitem{1.} Let $P$ be the current state. If no edges outgo from $P$ in the

562: graph, then return $0$. Otherwise, for $z>0$, let $Q_1\ldots,Q_z$ be the states

563: to which a directed edge outgoes from $P$. Do:

564:

565: \smallskip

566: \itemitemitem{1.1.} For $i=1,\ldots,z$, recursively execute step 1 on state

567: $Q_i$, and let $t_i$ be its output.

568:

569: \smallskip

570: \itemitemitem{1.2.} For $i=1,\ldots,z$, assess the transition probability

571: $p(P,Q_i)$ as a function of the directed path through which $P$ was reached.

572:

573: \smallskip

574: \itemitemitem{1.3.} Return $\sum_{i=1}^z\bigl(t(P,Q_i)+t_i\bigr)p(P,Q_i)$.

575:

576: \medskip

577: What remains to be presented before we discuss some results is, naturally, how

578: to perform step 1.2 of this procedure. First we set up two matrices, called

579: $M(P)$ and $M(Q_i)$, representing respectively the relationships (equality,

580: inequality, or none) that must exist between characters of two generic strings

581: $S_\ell^c$ and $S_r^c$ for state $P$ to be reached from the initial state on the

582: specific path that is being considered, and for $Q_i$ to be reached on that same

583: path elongated by the edge $(P,Q_i)$. If $a$ stands for the number of different

584: strings $S$ satisfying the constraints of $M(P)$ and $b$ the number of those

585: that satisfy the constraints of $M(Q_i)$, then $p(P,Q_i)=b/a$.

586:

587: The computation of $a$ and $b$ can be reduced to a complex combinatorial

588: problem, of which we give an outline next, and constitutes the computationally

589: hardest part of the entire procedure to compute $\bar t$. Suppose we wish to

590: compute $a$ from $M(P)$. The way to proceed is to start by setting up an

591: undirected graph, call it $G$, whose nodes represent groups of positions in the

592: two parts of $S$ that by $M(P)$ must contain the same character. Two nodes are

593: connected by an edge if the corresponding groups of positions must, again by

594: $M(P)$, contain characters that differ from one group to the other. The value

595: of $a$ is then the number of distinct ways in which we can assign characters

596: from $\Sigma$ to the nodes of $G$ in such a way that nodes that are

597: connected by an edge receive different characters. This is a graph coloring

598: problem, that is, a problem related to assigning objects (colors) to the nodes

599: of a graph so that nodes that are connected by an edge receive different

600: objects. The problem is to find out the number of ways in which $G$'s

601: nodes can be colored by a total of $\sigma$ distinct colors. This number is

602: given by the so-called chromatic polynomial of $G$ evaluated at $\sigma$ [23].

603:

604: Finally, we present some numerical results. These are illustrated in Figures 9

605: through 12. Of these, Figures 9 and 11 are for $k=1$ ($n$ and $\sigma$ vary),

606: while Figures 10 and 12 are for $n=10$ ($k$ and $\sigma$ vary). What the

607: figures depict are the gains, averaged over $c$ for both even and odd

608: palindromes, that correspond to the expected times $\bar t$ assessed for the

609: original algorithm and its two variants. Figures 9 and 10 give gains for the

610: first improvement, and Figures 11 and 12 for the second. These figures tend to

611: support the conclusions we drew from specific examples in Section 4. These are

612: that gains are expected to be larger for larger $k$ and smaller $\sigma$, and

613: smaller for larger $n$.

614:

615: \bigbeginsection 6. Concluding remarks

616:

617: We have in this paper introduced a novel definition of approximate palindromes

618: in strings, and have given an algorithm for finding all approximate palindromes

619: in a string of $n$ characters to within at most $k$ errors. For a fixed

620: alphabet, the algorithm runs in time $O(k^2n)$. We have also indicated how to

621: perform implementation-related improvements in the algorithm, and demonstrated,

622: over a variety of strings and also based on an average-case analysis, that such

623: improvements do indeed often lead to reduced running times in practice.

624:

625: \beginsection  Acknowledgments

626:

627: The authors have received partial support from the Brazilian agencies CNPq and

628: CAPES, the PRONEX initiative of Brazil's MCT under contract 41.96.0857.00, and

629: a FAPERJ BBP grant.

630:

631: \bigbeginsection References

632:

633: {\frenchspacing

634:

635: \medskip

636: \item{1.} D. E. Knuth, J. H. Morris, and V. R. Pratt,

637: ``Fast pattern matching in strings,''

638: {\it SIAM J. on Computing\/} {\bf 6} (1977), 322--350.

639:

640: \medskip

641: \item{2.} G. Manacher,

642: ``A new linear-time on-line algorithm for finding the smallest initial

643: palindrome of a string,''

644: {\it J. of the ACM\/} {\bf 22} (1975), 346--351.

645:

646: \medskip

647: \item{3.} A. Apostolico, D. Breslauer, and Z. Galil,

648: ``Optimal parallel algorithms for periods, palindromes and squares,''

649: {\it Proc. of the Int. Colloq. on Automata, Languages, and Programming},

650: 296--307, 1992.

651:

652: \medskip

653: \item{4.} A. Apostolico, D. Breslauer, and Z. Galil,

654: ``Parallel detection of all palindromes in a string,''

655: {\it Theoretical Computer Science\/} {\bf 141} (1995), 163--173.

656:

657: \medskip

658: \item{5.} D. Breslauer and Z. Galil,

659: ``Finding all periods and initial palindromes of a string in parallel,''

660: {\it Algorithmica\/} {\bf 14} (1995), 355--366.

661:

662: \medskip

663: \item{6.} Z. Galil,

664: ``Optimal parallel algorithms for string matching,''

665: {\it Information and Control\/} {\bf 67} (1985), 144--157.

666:

667: \medskip

668: \item{7.} D. Gusfield,

669: {\it Algorithms on Strings, Trees, and Sequences: Computer Science and

670: Computational Biology},

671: Cambridge University Press, New York, NY, 1997.

672:

673: \medskip

674: \item{8.} J. Jurka,

675: ``Origin and evolution of Alu repetitive elements,''

676: in R. J. Maraia (Ed.),

677: {\it The Impact of Short Interspersed Elements (SINEs) on the Host Genome},

678: R. G. Landes, New York, NY, 25--41, 1995.

679:

680: \medskip

681: \item{9.} V. I. Levenstein,

682: ``Binary codes capable of correcting insertions and reversals,''

683: {\it Soviet Physics Doklady\/} {\bf 10} (1966), 707--710.

684:

685: \medskip

686: \item{10.} D. Sankoff and J. Kruskal (Eds.),

687: {\it Time Warps, String Edits, and Macromolecules: The Theory and Practice of

688: Sequence Comparison},

689: Addison-Wesley, Reading, MA, 1983.

690:

691: \medskip

692: \item{11.} G. M. Landau and U. Vishkin,

693: ``Introducing efficient parallelism into approximate string matching and a new

694: serial algorithm,''

695: {\it Proc. of the Annual ACM Symp. on Theory of Computing}, 220--230, 1986.

696:

697: \medskip

698: \item{12.} G. M. Landau and U. Vishkin,

699: ``Fast parallel and serial approximate string matching,''

700: {\it J. of Algorithms\/} {\bf 10} (1989), 157--169.

701:

702: \medskip

703: \item{13.} E. W. Myers,

704: ``An $O(nd)$ difference algorithm and its variations,''

705: {\it Algorithmica\/} {\bf 1} (1986), 251--266.

706:

707: \medskip

708: \item{14.} E. Ukkonen,

709: ``Algorithms for approximate string matching,''

710: {\it Information and Control\/} {\bf 64} (1985), 100--118.

711:

712: \medskip

713: \item{15.} R. Baeza-Yates and G. Navarro,

714: ``Faster approximate string matching,''

715: {\it Algorithmica\/} {\bf 23} (1999), 127--158.

716:

717: \medskip

718: \item{16.} W. I. Chang and J. Lampe,

719: ``Theoretical and empirical comparisons of approximate string matching

720: algorithms,''

721: {\it Proc. of the Symp. on Combinatorial Pattern Matching},

722: 175--184, 1992.

723:

724: \medskip

725: \item{17.} R. Cole and R. Hariharan,

726: ``Approximate string matching: a simpler faster algorithm,''

727: {\it Proc. of the Annual ACM-SIAM Symp. on Discrete Algorithms},

728: 463--472, 1998.

729:

730: \medskip

731: \item{18.} G. A. Stephen,

732: {\it String Searching Algorithms},

733: World Scientific, Singapore, 1994.

734:

735: \medskip

736: \item{19.} E. Ukkonen,

737: ``Finding approximate patterns in strings,''

738: {\it J. of Algorithms\/} {\bf 6} (1985), 132--137.

739:

740: \medskip

741: \item{20.} S. Wu and U. Manber,

742: ``Fast text searching allowing errors,''

743: {\it Comm. of the ACM\/} {\bf 35} (1992), 83--91.

744:

745: \medskip

746: \item{21.} G. M. Landau and U. Vishkin,

747: ``Efficient string matching with $k$ mismatches,''

748: {\it Theoretical Computer Science\/} {\bf 43} (1986), 239--249.

749:

750: \medskip

751: \item{22.} A. H. L. Porto,

752: {\it Detecting Approximate Palindromes in Strings},

753: M.Sc. Thesis, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil,

754: 1999 (in Portuguese).

755:

756: \medskip

757: \item{23.} J. A. Bondy and U. S. R. Murty,

758: {\it Graph Theory with Applications},

759: North-Holland, New York, NY, 1976.

760:

761: }

762:

763: \beginsection Authors' biographical data

764:

765: \vskip-\smallskipamount\medskip\noindent

766: {\bf Alexandre H. L. Porto} is a doctoral student at the Systems Engineering

767: and Computer Science Program of the Federal University of Rio de Janeiro. He

768: is interested in sequential and parallel algorithms for problems in

769: computational biology.

770:

771: \medskip

772: \noindent

773: {\bf Valmir C. Barbosa} is professor at the Systems Engineering and Computer

774: Science Program of the Federal University of Rio de Janeiro, and is interested

775: in the various aspects of distributed and parallel computing, as well as of

776: the so-called complex systems, like neural networks and related models. He

777: received his Ph.D. from the University of California, Los Angeles, in 1986,

778: and has held visiting positions at the IBM Rio Scientific Center in Brazil,

779: the International Computer Science Institute in Berkeley, and the Computer

780: Science Division of the University of California, Berkeley. He has authored

781: the books {\it Massively Parallel Models of Computation\/} (Ellis Horwood,

782: Chichester, UK, 1993), {\it An Introduction to Distributed Algorithms\/} (The

783: MIT Press, Cambridge, MA, 1996), and {\it An Atlas of Edge-Reversal Dynamics\/}

784: (Chapman \& Hall/CRC Press, London, UK, 2000).

785:

786: \vfill\eject

787:

788: \topinsert

789: \centerline{\epsfbox{fig1.ps}}

790: \bigskip

791: \centerline{{\bf Figure 1.} Even (a) and odd (b) approximate palindromes for

792: $k=3$ in the string ${\it bbaabac}$}

793: \bigskip\bigskip\bigskip

794: \endinsert

795:

796: \topinsert

797: \centerline{\epsfbox{fig2.ps}}

798: \bigskip

799: \centerline{{\bf Figure 2.} An edit script as a directed path in $D$}

800: \bigskip\bigskip\bigskip

801: \endinsert

802:

803: \topinsert

804: \centerline{\epsfbox{fig3.ps}}

805: \bigskip

806: \centerline{{\bf Figure 3.} Gains due to the first improvement for $n=50$}

807: \bigskip\bigskip\bigskip

808: \endinsert

809:

810: \topinsert

811: \centerline{\epsfbox{fig4.ps}}

812: \bigskip

813: \centerline{{\bf Figure 4.} Gains due to the first improvement for $n=2500$}

814: \bigskip\bigskip\bigskip

815: \endinsert

816:

817: \topinsert

818: \centerline{\epsfbox{fig5.ps}}

819: \bigskip

820: \centerline{{\bf Figure 5.} Gains due to the second improvement for $n=50$}

821: \bigskip\bigskip\bigskip

822: \endinsert

823:

824: \topinsert

825: \centerline{\epsfbox{fig6.ps}}

826: \bigskip

827: \centerline{{\bf Figure 6.} Gains due to the second improvement for $n=2500$}

828: \bigskip\bigskip\bigskip

829: \endinsert

830:

831: \topinsert

832: \centerline{\epsfbox{fig7.ps}}

833: \bigskip

834: \centerline{{\bf Figure 7.} Gains of the second improvement over the first

835: for $n=50$}

836: \bigskip\bigskip\bigskip

837: \endinsert

838:

839: \topinsert

840: \centerline{\epsfbox{fig8.ps}}

841: \bigskip

842: \centerline{{\bf Figure 8.} Gains of the second improvement over the first

843: for $n=2500$}

844: \bigskip\bigskip\bigskip

845: \endinsert

846:

847: \topinsert

848: \centerline{\epsfbox{fig9.ps}}

849: \bigskip

850: \centerline{{\bf Figure 9.} Average gains due to the first improvement for

851: $k=1$}

852: \bigskip\bigskip\bigskip

853: \endinsert

854:

855: \topinsert

856: \centerline{\epsfbox{fig10.ps}}

857: \bigskip

858: \centerline{{\bf Figure 10.} Average gains due to the first improvement for

859: $n=10$}

860: \bigskip\bigskip\bigskip

861: \endinsert

862:

863: \topinsert

864: \centerline{\epsfbox{fig11.ps}}

865: \bigskip

866: \centerline{{\bf Figure 11.} Average gains due to the second improvement for

867: $k=1$}

868: \bigskip\bigskip\bigskip

869: \endinsert

870:

871: \topinsert

872: \centerline{\epsfbox{fig12.ps}}

873: \bigskip

874: \centerline{{\bf Figure 12.} Average gains due to the second improvement for

875: $n=10$}

876: \bigskip\bigskip\bigskip

877: \endinsert

878:

879: \bye

880:

881: