cs0309043/paper.tex
1: \input 11layout
2: \input macro
3: \input epsf
4: 
5: \def\tld{\twoldots}
6: 
7: \centerline{\biggbf Finding Approximate Palindromes in Strings}
8: 
9: \bigskip\bigskip
10: \centerline{\it Alexandre H. L. Porto}
11: \centerline{\it Valmir C. Barbosa}
12: 
13: \bigskip
14: \centerline{Programa de Engenharia de Sistemas e Computa\c c\~ao, COPPE}
15: \centerline{Universidade Federal do Rio de Janeiro}
16: \centerline{Caixa Postal 68511}
17: \centerline{21945-970 Rio de Janeiro - RJ, Brazil}
18: 
19: \medskip
20: \centerline{\tt xandao@cos.ufrj.br, valmir@cos.ufrj.br}
21: 
22: \bigskip\bigskip
23: \centerline{\bf Abstract}
24: 
25: \medskip
26: \noindent
27: We introduce a novel definition of approximate palindromes in strings, and
28: provide an algorithm to find all maximal approximate palindromes in a string
29: with up to $k$ errors. Our definition is based on the usual edit operations of
30: approximate pattern matching, and the algorithm we give, for a string of size
31: $n$ on a fixed alphabet, runs in $O(k^2n)$ time. We also discuss two
32: implementation-related improvements to the algorithm, and demonstrate their
33: efficacy in practice by means of both experiments and an average-case analysis.
34: 
35: \bigskip\bigskip
36: \noindent
37: {\bf Keywords:} Approximate palindromes, string editing, approximate pattern
38: matching.
39: 
40: \vfill\eject
41: \bigbeginsection 1. Introduction
42: 
43: Let $S$ be a string of $n$ characters from a fixed alphabet $\Sigma$. For
44: $1\le i\le j\le n$, let $S[i]$ denote the $i$th character in $S$ and
45: $S[i\tld j]$ denote the substring of $S$ whose first and last characters are
46: $S[i]$ and $S[j]$, respectively. We let $S^R$ denote the string whose $i$th
47: character is $S[n-i+1]$, that is, $S$ and $S^R$ are essentially the same string
48: when read in opposing directions. We say that $S$ and $S^R$ are the
49: {\it reverse\/} of each other.
50: 
51: In this paper, we are concerned with {\it palindromes\/} occurring in $S$,
52: which are substrings $S[i\tld j]$ such that $S[i\tld j]=S[i\tld j]^R$. If
53: $S[i\tld j]$ is a palindrome, then it is an {\it even palindrome\/} if it
54: contains an even number of characters, otherwise it is an {\it odd palindrome}.
55: The {\it center\/} of this palindrome is $S[c]$, where
56: $c=i-1+\bigl\lceil(j-i+1)/2\bigr\rceil$. It follows that
57: $S[c+1\tld j]=S[i\tld c]^R$ for an even palindrome, while
58: $S[c+1\tld j]=S[i\tld c-1]^R$ for an odd palindrome. A palindrome $S[i\tld j]$
59: is an {\it initial palindrome\/} if $i=1$ and a {\it final palindrome\/} if
60: $j=n$. It is said to be a {\it maximal palindrome\/} if it is an initial
61: palindrome, a final palindrome, or if $S[i-1\tld j+1]$ is not a palindrome for
62: $1<i\le j<n$.
63: 
64: Several algorithms have appeared for the detection of palindromes in strings.
65: These include sequential algorithms for detecting all initial palindromes or
66: maximal palindromes centered at all positions [1, 2], as well as parallel
67: algorithms [3--6].
68: 
69: Palindromes appear in several domains, chiefly in computational molecular
70: biology [7, 8], where $\Sigma$ is for example the set of bases that link
71: together to form strands of nucleic acid. In this domain of application,
72: palindromes are often required to be {\it complementary}, in the following
73: sense. While for a palindrome centered at $c$ we have either $S[c-r+1]=S[c+r]$
74: or $S[c-r]=S[c+r]$, depending respectively on whether the palindrome is even or
75: odd and for $r$ representing distance from the center, if the palindrome is
76: complementary then the two characters are no longer equal but rather constitute
77: a complementary pair. In this case, we no longer have $S[i\tld j]=S[i\tld j]^R$,
78: but rather complementarity between the two strings.
79: 
80: Although we in the sequel deal exclusively with palindromes for which equality
81: is used to compare them with their reverses (everything carries over trivially
82: to the case of complementarity), we dwell a little longer on the application of
83: palindromes to computational molecular biology because in that area the
84: definition we have given for palindromes is ``too exact,'' being therefore of
85: little use. Palindromes that matter in that domain are ``approximate,'' in the
86: sense that the symmetry around a palindrome's center need not be perfect, but
87: may instead contain a certain number of mismatches in the form of gaps and
88: defects of various other natures [7, 8]. This paper is about finding
89: approximate palindromes in $S$.
90: 
91: The following is how the balance of this paper is organized. We start in
92: Section 2 by providing a precise definition of what is to be understood as
93: an approximate palindrome and a maximal approximate palindrome. This definition
94: is novel, as previous definitions appear to have been too restricted [7]. Then
95: in Section 3 we give an algorithm to find all maximal approximate palindromes
96: in a string allowing for $k$ errors. The algorithm runs in $O(k^2n)$ time.
97: Section 4 contains two improvements on the basic algorithm. These improvements
98: do not lead to an improved complexity, but do in practice make a difference,
99: as we demonstrate by means of some experimental results. Section 5 contains
100: an average-case analysis of the two improvements, which indicates that the
101: conclusions drawn in the previous section can be expected to hold on average.
102: Concluding remarks follow in Section 6.
103: 
104: \bigbeginsection 2. Approximate palindromes
105: 
106: Our definition of an approximate palindrome centered at position $c$ in string
107: $S$ is given for an integer $k\ge 0$ indicating the maximum number of errors to
108: be tolerated, and is based on the notion of {\it string editing}, that is,
109: the transformation of one string into another [7]. In order to define
110: approximate palindromes, we consider, for $1\le c\le n$, the editing of string
111: $S_\ell^c=S[1\tld u]^R$ to obtain string $S_r^c=S[c+1\tld n]$, where $u=c$ for
112: even palindromes or $u=c-1$ for odd palindromes (here, and henceforth,
113: $S[i\tld j]$ is to be understood as the empty string if $i>j$).
114: 
115: The string editing that we consider is the same that has been used for other
116: problems on strings, and employs the operations presented next. These operations
117: act on cursors $p$ and $q$, which are used to point to specific positions in
118: $S_\ell^c$ and $S_r^c$, respectively. Such cursors are such that $0\le p\le u$
119: and $0\le q\le n-c$, the value $0$ being used only to initialize the cursors as
120: still not pointing to characters in the string. Note that $p$ and $q$, when
121: nonzero, do not indicate positions in $S$, but in two of its substrings taken as
122: independent entities.
123: 
124: The operations we consider are the following, of which (ii)--(iv) are
125: called {\it edit operations}.
126: 
127: \medskip
128: \itemitem{(i)} {\it Matching\/}: If $S_\ell^c[p+1]=S_r^c[q+1]$, then
129: increment both $p$ and $q$.
130: 
131: \medskip
132: \itemitem{(ii)} {\it Substitution\/}: If $S_\ell^c[p+1]\neq S_r^c[q+1]$,
133: then increment both $p$ and $q$.
134: 
135: \medskip
136: \itemitem{(iii)} {\it Insertion\/}: Increment $q$.
137: 
138: \medskip
139: \itemitem{(iv)} {\it Deletion\/}: Increment $p$.
140: 
141: \medskip
142: While a matching can only be applied if it will make both cursors
143: point to equal characters, the edit operations characterize the
144: possible sources of mismatch between the two strings. What they do is to allow
145: for a character from $S_r^c$ to substitute for a character in $S_\ell^c$ (this
146: is a substitution), for a character from $S_r^c$ to be inserted into
147: $S_\ell^c$ as an additional character (an insertion), or for a character to be
148: deleted from $S_\ell^c$ (a deletion). When grouped into a sequence, what
149: matching and edit operations can be regarded as doing is providing a script
150: (the {\it edit script\/}) for a prefix of $S_\ell^c$ to be turned into a
151: replica of a prefix of $S_r^c$ (note that, if an unlimited number of edit
152: operations is allowed, then such an edit script is guaranteed to exist).
153: The edit script to convert one prefix into the other having the smallest
154: number of edit operations is said to be {\it optimal\/} for the two prefixes,
155: and its number of edit operations is called the {\it edit distance\/} between
156: them [9, 10].
157: 
158: For $0\le p^*\le u$ and $0\le q^*\le n-c$, we say that $S[u-p^*+1\tld c+q^*]$
159: is an {\it approximate palindrome\/} in $S$ centered at $c$ if, of
160: all the edit scripts that can be used to turn $S_\ell^c[1\tld p^*]$ into
161: $S_r^c[1\tld q^*]$, the one that is optimal comprises no more than $k$ edit
162: operations (in other words, the edit distance between $S_\ell^c[1\tld p^*]$ and
163: $S_r^c[1\tld q^*]$ is at most $k$). In this case, $p^*$ and $q^*$ are the values
164: of $p$ and $q$, respectively, after any of those edit scripts is played out.
165: The palindrome is even or odd according to how $S_\ell^c$ is originally set.
166: Its {\it size\/} is either $p^*+q^*$ or $p^*+q^*+1$, respectively if it is even
167: or odd. This definition is more general than the single other definition that
168: appears to have been given for approximate palindromes [7], which only allows
169: for matchings and substitutions.
170: 
171: It is curious to observe that, unlike exact palindromes, the size of an even
172: approximate palindrome does not have to be even, nor does the size of an odd
173: approximate palindrome have to be odd. These would hold, however, for the exact
174: palindrome that would be obtained if the transformation of one prefix into the
175: other were indeed performed.
176: 
177: We provide in Figure 1 an illustration of this concept of an approximate
178: palindrome. The figure contains four approximate palindromes in the string
179: ${\it bbaabac}$ for $k=3$, two even in part (a) and two odd in part (b).
180: The palindromes are depicted in a way that evidences their two parts and also
181: their centers, in the odd case. Blank cells appearing in the left part
182: correspond to insertions, those on the right to deletions (for ease of
183: representation, in the figure we let deletions from the left part be
184: represented as insertions into the right part). Substitutions are indicated
185: by shaded cells placed symmetrically in the two parts.
186: 
187: Having defined approximate palindromes for $c$ and $k$ fixed, the notion that
188: remains to be introduced before we move on to discuss their detection is that
189: of maximality. Note, first, that the simple definition of a maximal palindrome
190: in the exact case does not carry over simply to the approximate case. In the
191: case of exact palindromes, maximality is related to the inability to extend a
192: palindrome into another substring of $S$ that is also a palindrome. In the
193: approximate case, however, the potential existence of several acceptable edit
194: scripts makes it inappropriate to adopt such a straightforward definition.
195: 
196: While there does appear to exist more than one possibility for defining the
197: maximality of approximate palindromes, what we do in this paper is to say
198: that an approximate palindrome is {\it maximal\/} if no other
199: approximate palindrome for the same $c$ and $k$ exists having strictly greater
200: size or the same size but strictly fewer errors (edit distance between its two
201: parts). Unlike the case of exact palindromes, this definition clearly does not
202: guarantee the uniqueness of an approximate palindrome that is maximal.
203: 
204: \bigbeginsection 3. An algorithm
205: 
206: In this section, and for $k$ fixed, we introduce an algorithm for detecting
207: maximal approximate palindromes, one even and one odd for each $c$ such that
208: $1\le c\le n$.
209: 
210: Our algorithm is based on an acyclic directed graph $D$, whose definition
211: relies on two generic strings $X$ and $Y$ on the same alphabet. We let
212: $x=\vert X\vert$ and $y=\vert Y\vert$. Graph $D$ has $(x+1)(y+1)$ nodes, one
213: for each $(i,j)$ pair such that $0\le i\le x$ and $0\le j\le y$. A directed
214: edge exists in $D$ from node $(i,j)$ to node $(i',j')$ if either $i'=i$
215: and $j'=j+1$, or $i'=i+1$ and $j'=j$, or yet $i'=i+1$ and $j'=j+1$. If we
216: position the nodes of $D$ on the vertices of a two-dimensional grid having $x+1$
217: rows and $y+1$ columns so that a node's first coordinate grows from top to
218: bottom and the second from left to right, then clearly directed edges exist
219: between nearest neighbors in the same row (a {\it horizontal edge\/}),
220: the same column (a {\it vertical edge\/}), and the same diagonal (a
221: {\it diagonal edge\/}). Because $j'-i'=j-i$ when a diagonal edge exists from
222: $(i,j)$ to $(i',j')$, we use such differences to label the various diagonals on
223: the grid. Diagonal labels are then in the range of $-x$ through $y$.
224: 
225: Now consider a directed path in $D$ leading from node $(i,j)$ to node $(i',j')$.
226: By definition of the directed edges, clearly $i'\ge i$ and $j'\ge j$. The
227: importance of graph $D$ in our present context is that this directed path, if
228: it contains at least one edge, can be interpreted as an edit script for string
229: $X[i+1\tld i']$ to be transformed into string $Y[j+1\tld j']$. Along the script,
230: the cursors $p$ and $q$ of operations (i)--(iv) are used on $X$ and $Y$,
231: respectively, such that $i\le p\le i'$ and $j\le q\le j'$. On such a path, a
232: diagonal edge corresponds to either a matching or a substitution, a horizontal
233: edge to an insertion, and a vertical edge to a deletion. The number of edges on
234: the path that do not correspond to matchings is the number of edit operations in
235: the script. The optimal edit script for strings $X[i+1\tld i']$ and
236: $Y[j+1\tld j']$ is represented in $D$ by a directed path from $(i,j)$ to
237: $(i',j')$ whose number of edges corresponding to edit operations is minimum
238: among all directed paths between the two nodes. Such a path is said to be
239: {\it shortest\/} among all those paths according to the metric that assigns,
240: say, length $0$ to edges corresponding to matchings and length $1$ to all other
241: edges.
242: 
243: We give an illustration in Figure 2, where graph $D$ is shown for
244: $X={\it bb}$ and $Y={\it aabac}$. The directed path shown in solid lines
245: contains two insertions, followed by one matching and one substitution,
246: and is a shortest path between its two end vertices. It corresponds,
247: therefore, to an optimal edit script to transform $X[1\tld 2]$ into
248: $Y[1\tld 4]$. All other edges are shown as dotted lines, directions omitted
249: for clarity.
250: 
251: For $0\le e\le k$, let a directed path in $D$ be called an {\it $e$-path\/} if
252: it contains $e$ edges related to edit operations. One crucial problem to be
253: solved on $D$ is the problem of determining, for each diagonal and each $e$, the
254: $e$-path, if one exists, that starts in row $0$, ends on that diagonal at the
255: farthest possible node (greatest row number), and is in addition shortest among
256: all paths that start and end at the same nodes. This problem can be solved by
257: the following dynamic-programming approach.
258: 
259: \medskip
260: \itemitem{1.} For $d=0,\ldots,y$, find the largest common prefix of strings
261: $X$ and $Y[d+1\tld y]$. Such prefixes will correspond to the $0$-paths that
262: start in row $0$ and end at the farthest possible nodes. Each of them will
263: be entirely confined to a diagonal $d$ and, like all $0$-paths, will be
264: shortest among all paths joining its end nodes.
265: 
266: \medskip
267: \itemitem{2.} For $e=1,\ldots,k$, and for $d=-\min\{e,x\},\ldots,y$, do:
268: 
269: \smallskip
270: \itemitemitem{2.1.} Consider the $(e-1)$-paths, if any,  determined in the
271: previous step on those of diagonals $d-1$, $d$, and $d+1$ that exist. Each of
272: these paths corresponds to an optimal edit script for transforming a prefix of
273: $X$ into a substring of $Y$. If possible, extend these scripts, respectively
274: by adding an insertion, a substitution, and a deletion, thereby creating
275: $e$-paths that end on diagonal $d$.
276: 
277: \smallskip
278: \itemitemitem{2.2.} Of the $e$-paths created in step 2.1, if any, pick the
279: one that ends farthest down diagonal $d$ and extend it further by computing the
280: largest common prefix of what remains of $X$ and what remains of $Y$. The result
281: will be an $e$-path that starts in row $0$ and ends on diagonal $d$ at the
282: farthest possible node, being in addition shortest among all paths starting and
283: ending at the same nodes.
284: 
285: \medskip
286: The lower bound on $d$ in step 2 reflects the fact that diagonal $-e$ can only
287: be reached from row $0$ by an $e'$-path, where $1\le e\le e'\le k$. What the
288: entire procedure computes is a set of directed paths departing from row $0$ at
289: several columns. Each of these directed paths is an $e$-path, for some $e$ such
290: that $0\le e\le k$, that ends as far down in the graph as possible, and is also
291: shortest among all paths that start and end at the same nodes. So an $e$-path
292: computed by the algorithm departing from node $(0,d)$ for $0\le d\le y$
293: represents an optimal edit script for turning a prefix of $X$ into a prefix of
294: $Y[d+1\tld y]$ with $e$ edit operations.
295: 
296: The basic procedure comprising steps 1 and 2 was introduced to solve the problem
297: of approximate pattern matching [7, 11--13], which requires all approximate
298: occurrences of $X$ in $Y$ having edit distance from $X$ of at most $k\le y$ to
299: be determined [14]. The solution works by selecting, after the
300: execution of steps 1 and 2, the $e$-paths that end on row $x$ for $0\le e\le k$.
301: Several other solutions to this problem exist [7, 15--20].
302: 
303: Note that the largest common prefixes asked for in steps 1 and 2.2 can be
304: obtained easily if we have a means of computing the largest common prefix of any
305: two suffixes of $X\$_1Y$, where $\$_1$ is any character that does not occur in
306: $X$ or $Y$, and $X\$_1Y$ is the string obtained by appending $\$_1$ to $X$, then
307: $Y$ to $X\$_1$. Such a means, of course, is provided by the well-known suffix
308: tree for string $X\$_1Y\$_2$, where $\$_2$ is any character not occurring in
309: $X\$_1Y$, ultimately needed to ensure that the tree does indeed exist. After
310: the suffix tree is built and preprocessed, which can be achieved in $O(x+y)$
311: time, any of those largest common prefixes can be found in constant time [7].
312: Fox approximate pattern matching, $x\le y$, so $O(y)$ is the time it takes to
313: handle the suffix tree initially. After that, the complexity of steps 1 and 2
314: is dominated by step 2, which comprises $O(ky)$ repetitions of 2.1 and 2.2,
315: each requiring $O(1)$ time per repetition. The overall time is then $O(ky)$.
316: 
317: The same basic procedure can also be used to solve another problem involving
318: strings $X$ and $Y$, known as the $k$-differences problem. Assuming
319: $k\le\max\{x,y\}$ to avoid trivial cases, this problem asks for the edit script
320: to transform $X$ into $Y$ with the fewest possible edit operations, but no more
321: than $k$ [13, 21]. Clearly, no solution exists if $\vert x-y\vert>k$. A solution
322: may exist otherwise, and will correspond to the $e$-path from $(0,0)$ to $(x,y)$
323: for which $e$ is minimum (that is, a shortest path between the two nodes), if
324: one exists with $e\le k$. Adapting steps 1 and 2 to find such a path is a simple
325: matter, as follows. In step 1, let $d=0$ only. In step 2, let the range for $d$
326: be from $-\min\{e,x\}$ through $\min\{e,y\}$, again reflecting the fact that it
327: takes at least $e$ errors to reach diagonal $-e$ or $e$ from $(0,0)$. Finally,
328: abort the iterations whenever node $(x,y)$ is reached.
329: 
330: This solution to the $k$-differences problem requires a number of matching
331: extensions (steps 1 and 2.2) given by $1+\sum_{e=1}^kO(e)=O(k^2)$, each one
332: requiring $O(1)$ time after the initial construction and preprocessing of the
333: suffix tree for $X\$_1Y\$_2$ in $O(x+y)$ time. The overall time is then
334: $O(k^2+x+y)$.
335: 
336: This algorithm for the $k$-differences problem can be used directly to solve
337: our problem of determining all maximal approximate palindromes in $S$, because
338: what is required of an approximate palindrome is precisely that one of its parts
339: be transformable into the other by means of an optimal edit script comprising
340: no more than $k$ edit operations. For $1\le c\le n$, we simply let $X=S_\ell^c$
341: and $Y=S_r^c$ (then $x+y=n$ or $x+y=n-1$, respectively for even and odd
342: palindromes). Whenever a new path is determined in step 2.2, we check if it is
343: better than the ones found previously in terms of approximate-palindrome
344: maximality; it will be better if it leads to a node whose coordinates add to a
345: larger integer than the current best path (checking for the same integer but
346: fewer errors is needless, as the algorithm generates shortest $e$-paths in
347: nondecreasing order of $e$).
348: 
349: Apart from the time needed to establish the suffix tree for
350: $S_\ell^c\$_1S_r^c\$_2$, this algorithm requires $O(k^2)$ time to determine a
351: maximal approximate palindrome in $S$ for fixed $c$. Determining all even and
352: odd approximate palindromes in $S$ then requires $O(k^2n)$ time beyond what is
353: needed to establish the suffix trees. If such a tree had indeed to be
354: constructed and preprocessed for each $c$, then an additional $O(n^2)$ time
355: would be required. However, note that every suffix of $S_\ell^c$ is also a
356: suffix of $S^R$, and that every suffix of $S_r^c$ is also a suffix of $S$. So
357: all that is required by steps 1 and 2.2, regardless of the value of $c$, is that
358: a preprocessed suffix tree be available for $S^R\$_1S\$_2$. This tree needs to
359: be established only once, which can be done in $O(n)$ time, and therefore the
360: overall complexity of determining all even and odd maximal approximate
361: palindromes in $S$ remains $O(k^2n)$.
362: 
363: Assessing the space required by the algorithm depends on whether edit scripts
364: are also needed or simply the palindromes with the corresponding edit distances.
365: In the former case, the space required for each of the $O(k)$ diagonals is
366: $O(k)$; it is constant in the latter case. This, combined with the $O(n)$ space
367: needed for preprocessing, yields a space requirement of $O(k^2+n)$ or $O(k+n)$,
368: respectively.
369: 
370: \bigbeginsection 4. Practical improvements
371: 
372: Of the $2n$ maximal approximate palindromes determined by the algorithm of
373: Section 3 for $c=1,\ldots,n$, three are trivial and can be skipped in a
374: practical implementation. These are the odd palindrome for $c=1$, and the
375: even and odd palindromes for $c=n$. For these palindromes, the optimal edit
376: scripts contain $k$ insertions for $c=1$ and $k$ deletions for $c=n$.
377: Henceforth in the paper, we then assume that the algorithm is run for
378: $c=1,\ldots,n-1$ in the even case, $c=2,\ldots,n-1$ in the odd case.
379: 
380: In addition to this simplification, the algorithm introduced in Section 3 for
381: the computation of all maximal approximate palindromes in $S$ can be improved
382: by selecting the range for $d$ in step 2 more carefully. We discuss two such
383: improvements in this section. Although they do not lead to a better execution
384: time in the asymptotic, worst-case sense, they do bring about a reduction in
385: execution times in practice, as we demonstrate.
386: 
387: The first improvement consists of skipping the diagonals on which $e$-paths,
388: for suitable $e$, have been determined that end on row $x$ or column $y$.
389: The reason why this is safe to do is that no further path can be found ending
390: on those diagonals farther down from row $x$ or column $y$. The way to
391: efficiently handle this improved selection of diagonals in step 2 is to keep
392: all diagonals that are going to be processed in a doubly-linked list. This
393: allows new diagonals to be added at the list's extremes in constant time, and
394: diagonals that will no longer be processed can be deleted equally efficiently
395: as soon as it is detected that the corresponding $e$-paths have reached the
396: farthest row or column.
397: 
398: The second improvement that we describe is itself an improvement over the
399: first one. The rationale is that, if a diagonal $d$ is dropped from further
400: consideration because the $e$-path that ends on it farthest down the grid
401: touches, say, row $x$, then all other diagonals $d'$ such that $d'<d$ may be
402: dropped as well. Similarly if that $e$-path on $d$ touches column $y$, in which
403: case not only diagonal $d$ but also all diagonals $d'$ such that $d'>d$ may be
404: dropped from further processing. What results from this improvement is that the
405: algorithm's operation is always confined to within a strip of contiguous
406: diagonals. This strip is limited on the left by either the leftmost diagonal on
407: which the farthest-reaching $e$-path does not touch row $x$ or diagonal
408: $-\min\{e,x\}$; on the right, it is limited by either the rightmost diagonal on
409: which the farthest-reaching $e$-path does not touch column $y$ or diagonal
410: $\min\{e,y\}$. As with the first improvement, it is a simple matter to implement
411: the second one efficiently by using the same doubly-linked list.
412: 
413: We show in Figures 3 and 4 the gain that the first improvement elicits for a
414: number of strings, in Figures 5 and 6 the gain due to the second improvement,
415: and in Figures 7 and 8 the gain caused by the second improvement over the first.
416: For Figures 3 through 6, if $t_o$ is the the number of iterations performed by
417: the original algorithm and $t_i$ the number of iterations performed by the
418: improved algorithm, then gain is defined as $(t_o-t_i)/t_o$. An iteration is
419: either the initial execution of step 1 or each combined execution of steps
420: 2.1 and 2.2. Because every iteration can be carried out in constant time, our
421: assessment of gain in terms of numbers of iterations as opposed to elapsed time
422: provides a platform-independent evaluation of the improved algorithms. Gain is
423: defined similarly for Figures 7 and 8.
424: 
425: The strings we have used are the ones given next. Of these, some are
426: {\it periodic}, meaning that there exists a string $P$ of size $s\le n$ such
427: that the periodic string is a prefix of the string formed by concatenating
428: $\lceil n/s\rceil$ copies of $P$. The {\it period\/} of the periodic string
429: is the value of $s$.
430: 
431: \medskip
432: \itemitem{$\bullet$} ${\it dna}$: A DNA sequence with $n$ bases.
433: 
434: \medskip
435: \itemitem{$\bullet$} ${\it dnap1}$: A periodic DNA sequence with $n$
436: bases and period $\lfloor 0.05n\rfloor$.
437: 
438: \medskip
439: \itemitem{$\bullet$} ${\it dnap2}$: A periodic DNA sequence with $n$
440: bases and period $\lfloor 0.25n\rfloor$.
441: 
442: \medskip
443: \itemitem{$\bullet$} ${\it dnap3}$: A periodic DNA sequence with $n$
444: bases and period $\lfloor 0.5n\rfloor$.
445: 
446: \medskip
447: \itemitem{$\bullet$} ${\it txt}$: A sequence of $n$ ASCII characters.
448: 
449: \medskip
450: \itemitem{$\bullet$} ${\it txtp1}$: A periodic sequence of $n$ ASCII
451: characters with period $\lfloor 0.05n\rfloor$.
452: 
453: \medskip
454: \itemitem{$\bullet$} ${\it txtp2}$: A periodic sequence of $n$ ASCII
455: characters with period $\lfloor 0.25n\rfloor$.
456: 
457: \medskip
458: \itemitem{$\bullet$} ${\it txtp3}$: A periodic sequence of $n$ ASCII
459: characters with period $\lfloor 0.5n\rfloor$.
460: 
461: \medskip
462: \itemitem{$\bullet$} ${\it cnst}$: An $n$-fold repetition of the same
463: character.
464: 
465: \medskip
466: \itemitem{$\bullet$} ${\it diff}$: A string comprising $n$ different
467: characters. This string does not fit the fixed-alphabet assumption we have made
468: from the start, so the complexity figures given in Section 3 do not apply to
469: executions of the algorithm on it.
470: 
471: \medskip
472: Figures 3, 5, and 7 are given for $n=50$, while Figures 4, 6, and 8 are for
473: $n=2500$. In each figure, the values of $k$, the maximum number of errors to be
474: allowed in the approximate palindromes, are in
475: $\bigl\{\lceil 0.01n\rceil,
476: \lceil 0.05n\rceil, \lceil 0.1n\rceil, \lceil 0.2n\rceil,
477: \lceil 0.4n\rceil, \lceil 0.8n\rceil\bigr\}$. Note, in Figures 3, 5, and 7,
478: that gains are identical for ${\it dnap1}$ and ${\it txtp1}$, owing to the
479: fact that, for $n=50$, these two strings are essentially the same, being
480: periodic with period $2$, therefore having only $2$ different characters
481: throughout.
482: 
483: With the single exception of Figure 7, we see in all cases that gains are
484: largest for ${\it cnst}$, which can be easily accounted for by the fact that,
485: for such a string, every diagonal is dropped from further consideration by any
486: of the two improvements right after having been processed for the first time.
487: The exception of Figure 7 can be explained by the fact that this figure gives
488: gains of one improvement over the other, and together with Figure 8 gives
489: a measure for how efficiently the second improvement creates the diagonal
490: strips. What Figure 7 indicates is that such an efficiency is higher for
491: ${\it dnap1}$ and ${\it txtp1}$ than it is for ${\it cnst}$ if $n=50$ and
492: $k\ge 3$.
493: 
494: By contrast, gains are always smallest for ${\it diff}$, because matchings
495: never occur in the edit scripts and therefore
496: more iterations are needed before paths reach the grid's borders. For
497: ${\it diff}$ strings, it also happens that the second improvement provides no
498: gain over the first.
499: 
500: Gains tend to be larger for sequences on smaller alphabets, because in
501: these cases there tend to be more matchings. This is what happens for DNA
502: sequences, which have an alphabet of size $4$, therefore smaller than the
503: alphabets of nearly all the other strings under consideration. Gains also
504: tend to be larger as $k$ gets larger, which is probably related to the fact
505: that the overall number of iterations also increases with $k$. To finalize,
506: we note that gains tend to decrease as $n$ is increased, which can be
507: seen by comparing Figures 3, 5, and 7 to Figures 4, 6, and 8, respectively.
508: This is due to the fact that, as $n$ gets larger, so does the number of
509: matchings needed for paths to reach the grid's bottommost or
510: rightmost border.
511: 
512: \bigbeginsection 5. Expected gains
513: 
514: As we know from the results in Section 4, one of the key factors affecting the
515: performance of practical implementations of our algorithm is the size of the
516: alphabet $\Sigma$, which henceforth we let be such that
517: $\sigma=\vert\Sigma\vert$. We discuss, in this section, a stochastic model that
518: can be used to assess, to a limited extent, the expected gain provided by the
519: two improvements of Section 4 for a given value of $\sigma$. The model has great
520: richness of detail [22], and may require considerable computational effort to be
521: solved even for strings comprising as small a number of characters as a few
522: tens. It is therefore not practical, but we present an outline of it nonetheless
523: because, already for a modest range of parameters, it is capable of conveying
524: useful information.
525: 
526: For fixed $c$, the model views an execution of the algorithm (as introduced
527: originally or in one of its two improved variants), as a discrete-time,
528: discrete-state stochastic process. Each time step in this stochastic process
529: corresponds to one of the iterations on $e$ in step 2 of the algorithm. Each
530: state is a $(2k+1)$-tuple with one entry for each of the possible diagonals
531: from $-k$ through $k$. An entry contains the number of the row to which the
532: corresponding diagonal has been stretched so far by the algorithm. The initial
533: state has $0$ in all entries.
534: 
535: This stochastic process never returns to a previously visited state, and can
536: as such be represented by an acyclic directed graph whose nodes stand for
537: states and whose directed edges represent the possible transitions among states.
538: In this graph, every state is on some directed path from the initial state.
539: Associated with an edge $(P,Q)$ are two quantities, namely the probability
540: $p(P,Q)$ of moving from state $P$ to state $Q$, and the number of diagonals
541: to be processed when that transition is undertaken. This number of diagonals
542: is, for each value of $e$, the number of iterations used in Section 4 to
543: evaluate the two improvements as a platform-independent measure of time. It
544: depends on whether the original algorithm or one of its two improved variants
545: is in use, and is denoted by $t(P,Q)$.
546: 
547: The crucial issue in setting up this stochastic model is of course the
548: determination of $p(P,Q)$ for all edges $(P,Q)$. As it turns out, such
549: probabilities depend on how state $P$ is reached from the initial state,
550: and therefore it is best to assess them dynamically as the graph is processed
551: during the computation of one of the stochastic process' characteristics.
552: The characteristic that interests us in this section is the expected (average)
553: number of iterations, denoted by $\bar t$, for the algorithm to be completed on
554: a string, $n$ and $\sigma$ being fixed in addition to $c$.
555: 
556: The following recursive procedure is a variation of straightforward depth-first
557: traversal, and can be used to compute $\bar t$. It is started by executing step
558: 1 on the initial state; upon termination, its output is assigned to $\bar t$.
559: 
560: \medskip
561: \itemitem{1.} Let $P$ be the current state. If no edges outgo from $P$ in the
562: graph, then return $0$. Otherwise, for $z>0$, let $Q_1\ldots,Q_z$ be the states
563: to which a directed edge outgoes from $P$. Do:
564: 
565: \smallskip
566: \itemitemitem{1.1.} For $i=1,\ldots,z$, recursively execute step 1 on state
567: $Q_i$, and let $t_i$ be its output.
568: 
569: \smallskip
570: \itemitemitem{1.2.} For $i=1,\ldots,z$, assess the transition probability
571: $p(P,Q_i)$ as a function of the directed path through which $P$ was reached.
572: 
573: \smallskip
574: \itemitemitem{1.3.} Return $\sum_{i=1}^z\bigl(t(P,Q_i)+t_i\bigr)p(P,Q_i)$.
575: 
576: \medskip
577: What remains to be presented before we discuss some results is, naturally, how
578: to perform step 1.2 of this procedure. First we set up two matrices, called
579: $M(P)$ and $M(Q_i)$, representing respectively the relationships (equality,
580: inequality, or none) that must exist between characters of two generic strings
581: $S_\ell^c$ and $S_r^c$ for state $P$ to be reached from the initial state on the
582: specific path that is being considered, and for $Q_i$ to be reached on that same
583: path elongated by the edge $(P,Q_i)$. If $a$ stands for the number of different
584: strings $S$ satisfying the constraints of $M(P)$ and $b$ the number of those
585: that satisfy the constraints of $M(Q_i)$, then $p(P,Q_i)=b/a$.
586: 
587: The computation of $a$ and $b$ can be reduced to a complex combinatorial
588: problem, of which we give an outline next, and constitutes the computationally
589: hardest part of the entire procedure to compute $\bar t$. Suppose we wish to
590: compute $a$ from $M(P)$. The way to proceed is to start by setting up an
591: undirected graph, call it $G$, whose nodes represent groups of positions in the
592: two parts of $S$ that by $M(P)$ must contain the same character. Two nodes are
593: connected by an edge if the corresponding groups of positions must, again by
594: $M(P)$, contain characters that differ from one group to the other. The value
595: of $a$ is then the number of distinct ways in which we can assign characters
596: from $\Sigma$ to the nodes of $G$ in such a way that nodes that are
597: connected by an edge receive different characters. This is a graph coloring
598: problem, that is, a problem related to assigning objects (colors) to the nodes
599: of a graph so that nodes that are connected by an edge receive different
600: objects. The problem is to find out the number of ways in which $G$'s
601: nodes can be colored by a total of $\sigma$ distinct colors. This number is
602: given by the so-called chromatic polynomial of $G$ evaluated at $\sigma$ [23].
603: 
604: Finally, we present some numerical results. These are illustrated in Figures 9
605: through 12. Of these, Figures 9 and 11 are for $k=1$ ($n$ and $\sigma$ vary),
606: while Figures 10 and 12 are for $n=10$ ($k$ and $\sigma$ vary). What the
607: figures depict are the gains, averaged over $c$ for both even and odd
608: palindromes, that correspond to the expected times $\bar t$ assessed for the
609: original algorithm and its two variants. Figures 9 and 10 give gains for the
610: first improvement, and Figures 11 and 12 for the second. These figures tend to
611: support the conclusions we drew from specific examples in Section 4. These are
612: that gains are expected to be larger for larger $k$ and smaller $\sigma$, and
613: smaller for larger $n$.
614: 
615: \bigbeginsection 6. Concluding remarks
616: 
617: We have in this paper introduced a novel definition of approximate palindromes
618: in strings, and have given an algorithm for finding all approximate palindromes
619: in a string of $n$ characters to within at most $k$ errors. For a fixed
620: alphabet, the algorithm runs in time $O(k^2n)$. We have also indicated how to
621: perform implementation-related improvements in the algorithm, and demonstrated,
622: over a variety of strings and also based on an average-case analysis, that such
623: improvements do indeed often lead to reduced running times in practice.
624: 
625: \beginsection  Acknowledgments
626: 
627: The authors have received partial support from the Brazilian agencies CNPq and
628: CAPES, the PRONEX initiative of Brazil's MCT under contract 41.96.0857.00, and
629: a FAPERJ BBP grant.
630: 
631: \bigbeginsection References
632: 
633: {\frenchspacing
634: 
635: \medskip
636: \item{1.} D. E. Knuth, J. H. Morris, and V. R. Pratt,
637: ``Fast pattern matching in strings,''
638: {\it SIAM J. on Computing\/} {\bf 6} (1977), 322--350.
639: 
640: \medskip
641: \item{2.} G. Manacher,
642: ``A new linear-time on-line algorithm for finding the smallest initial
643: palindrome of a string,''
644: {\it J. of the ACM\/} {\bf 22} (1975), 346--351.
645: 
646: \medskip
647: \item{3.} A. Apostolico, D. Breslauer, and Z. Galil,
648: ``Optimal parallel algorithms for periods, palindromes and squares,''
649: {\it Proc. of the Int. Colloq. on Automata, Languages, and Programming},
650: 296--307, 1992.
651: 
652: \medskip
653: \item{4.} A. Apostolico, D. Breslauer, and Z. Galil,
654: ``Parallel detection of all palindromes in a string,''
655: {\it Theoretical Computer Science\/} {\bf 141} (1995), 163--173.
656: 
657: \medskip
658: \item{5.} D. Breslauer and Z. Galil,
659: ``Finding all periods and initial palindromes of a string in parallel,''
660: {\it Algorithmica\/} {\bf 14} (1995), 355--366.
661: 
662: \medskip
663: \item{6.} Z. Galil,
664: ``Optimal parallel algorithms for string matching,''
665: {\it Information and Control\/} {\bf 67} (1985), 144--157.
666: 
667: \medskip
668: \item{7.} D. Gusfield,
669: {\it Algorithms on Strings, Trees, and Sequences: Computer Science and
670: Computational Biology},
671: Cambridge University Press, New York, NY, 1997.
672: 
673: \medskip
674: \item{8.} J. Jurka,
675: ``Origin and evolution of Alu repetitive elements,''
676: in R. J. Maraia (Ed.),
677: {\it The Impact of Short Interspersed Elements (SINEs) on the Host Genome},
678: R. G. Landes, New York, NY, 25--41, 1995.
679: 
680: \medskip
681: \item{9.} V. I. Levenstein,
682: ``Binary codes capable of correcting insertions and reversals,''
683: {\it Soviet Physics Doklady\/} {\bf 10} (1966), 707--710.
684: 
685: \medskip
686: \item{10.} D. Sankoff and J. Kruskal (Eds.),
687: {\it Time Warps, String Edits, and Macromolecules: The Theory and Practice of
688: Sequence Comparison},
689: Addison-Wesley, Reading, MA, 1983.
690: 
691: \medskip
692: \item{11.} G. M. Landau and U. Vishkin,
693: ``Introducing efficient parallelism into approximate string matching and a new
694: serial algorithm,''
695: {\it Proc. of the Annual ACM Symp. on Theory of Computing}, 220--230, 1986.
696: 
697: \medskip
698: \item{12.} G. M. Landau and U. Vishkin,
699: ``Fast parallel and serial approximate string matching,''
700: {\it J. of Algorithms\/} {\bf 10} (1989), 157--169.
701: 
702: \medskip
703: \item{13.} E. W. Myers,
704: ``An $O(nd)$ difference algorithm and its variations,''
705: {\it Algorithmica\/} {\bf 1} (1986), 251--266.
706: 
707: \medskip
708: \item{14.} E. Ukkonen,
709: ``Algorithms for approximate string matching,''
710: {\it Information and Control\/} {\bf 64} (1985), 100--118.
711: 
712: \medskip
713: \item{15.} R. Baeza-Yates and G. Navarro,
714: ``Faster approximate string matching,''
715: {\it Algorithmica\/} {\bf 23} (1999), 127--158.
716: 
717: \medskip
718: \item{16.} W. I. Chang and J. Lampe,
719: ``Theoretical and empirical comparisons of approximate string matching
720: algorithms,''
721: {\it Proc. of the Symp. on Combinatorial Pattern Matching},
722: 175--184, 1992.
723: 
724: \medskip
725: \item{17.} R. Cole and R. Hariharan,
726: ``Approximate string matching: a simpler faster algorithm,''
727: {\it Proc. of the Annual ACM-SIAM Symp. on Discrete Algorithms},
728: 463--472, 1998.
729: 
730: \medskip
731: \item{18.} G. A. Stephen,
732: {\it String Searching Algorithms},
733: World Scientific, Singapore, 1994.
734: 
735: \medskip
736: \item{19.} E. Ukkonen,
737: ``Finding approximate patterns in strings,''
738: {\it J. of Algorithms\/} {\bf 6} (1985), 132--137.
739: 
740: \medskip
741: \item{20.} S. Wu and U. Manber,
742: ``Fast text searching allowing errors,''
743: {\it Comm. of the ACM\/} {\bf 35} (1992), 83--91.
744: 
745: \medskip
746: \item{21.} G. M. Landau and U. Vishkin,
747: ``Efficient string matching with $k$ mismatches,''
748: {\it Theoretical Computer Science\/} {\bf 43} (1986), 239--249.
749: 
750: \medskip
751: \item{22.} A. H. L. Porto,
752: {\it Detecting Approximate Palindromes in Strings},
753: M.Sc. Thesis, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil,
754: 1999 (in Portuguese).
755: 
756: \medskip
757: \item{23.} J. A. Bondy and U. S. R. Murty,
758: {\it Graph Theory with Applications},
759: North-Holland, New York, NY, 1976.
760: 
761: }
762: 
763: \beginsection Authors' biographical data
764: 
765: \vskip-\smallskipamount\medskip\noindent
766: {\bf Alexandre H. L. Porto} is a doctoral student at the Systems Engineering
767: and Computer Science Program of the Federal University of Rio de Janeiro. He
768: is interested in sequential and parallel algorithms for problems in
769: computational biology.
770: 
771: \medskip
772: \noindent
773: {\bf Valmir C. Barbosa} is professor at the Systems Engineering and Computer
774: Science Program of the Federal University of Rio de Janeiro, and is interested
775: in the various aspects of distributed and parallel computing, as well as of
776: the so-called complex systems, like neural networks and related models. He
777: received his Ph.D. from the University of California, Los Angeles, in 1986,
778: and has held visiting positions at the IBM Rio Scientific Center in Brazil,
779: the International Computer Science Institute in Berkeley, and the Computer
780: Science Division of the University of California, Berkeley. He has authored
781: the books {\it Massively Parallel Models of Computation\/} (Ellis Horwood,
782: Chichester, UK, 1993), {\it An Introduction to Distributed Algorithms\/} (The
783: MIT Press, Cambridge, MA, 1996), and {\it An Atlas of Edge-Reversal Dynamics\/}
784: (Chapman \& Hall/CRC Press, London, UK, 2000).
785: 
786: \vfill\eject
787: 
788: \topinsert
789: \centerline{\epsfbox{fig1.ps}}
790: \bigskip
791: \centerline{{\bf Figure 1.} Even (a) and odd (b) approximate palindromes for
792: $k=3$ in the string ${\it bbaabac}$}
793: \bigskip\bigskip\bigskip
794: \endinsert
795: 
796: \topinsert
797: \centerline{\epsfbox{fig2.ps}}
798: \bigskip
799: \centerline{{\bf Figure 2.} An edit script as a directed path in $D$}
800: \bigskip\bigskip\bigskip
801: \endinsert
802: 
803: \topinsert
804: \centerline{\epsfbox{fig3.ps}}
805: \bigskip
806: \centerline{{\bf Figure 3.} Gains due to the first improvement for $n=50$}
807: \bigskip\bigskip\bigskip
808: \endinsert
809: 
810: \topinsert
811: \centerline{\epsfbox{fig4.ps}}
812: \bigskip
813: \centerline{{\bf Figure 4.} Gains due to the first improvement for $n=2500$}
814: \bigskip\bigskip\bigskip
815: \endinsert
816: 
817: \topinsert
818: \centerline{\epsfbox{fig5.ps}}
819: \bigskip
820: \centerline{{\bf Figure 5.} Gains due to the second improvement for $n=50$}
821: \bigskip\bigskip\bigskip
822: \endinsert
823: 
824: \topinsert
825: \centerline{\epsfbox{fig6.ps}}
826: \bigskip
827: \centerline{{\bf Figure 6.} Gains due to the second improvement for $n=2500$}
828: \bigskip\bigskip\bigskip
829: \endinsert
830: 
831: \topinsert
832: \centerline{\epsfbox{fig7.ps}}
833: \bigskip
834: \centerline{{\bf Figure 7.} Gains of the second improvement over the first
835: for $n=50$}
836: \bigskip\bigskip\bigskip
837: \endinsert
838: 
839: \topinsert
840: \centerline{\epsfbox{fig8.ps}}
841: \bigskip
842: \centerline{{\bf Figure 8.} Gains of the second improvement over the first
843: for $n=2500$}
844: \bigskip\bigskip\bigskip
845: \endinsert
846: 
847: \topinsert
848: \centerline{\epsfbox{fig9.ps}}
849: \bigskip
850: \centerline{{\bf Figure 9.} Average gains due to the first improvement for
851: $k=1$}
852: \bigskip\bigskip\bigskip
853: \endinsert
854: 
855: \topinsert
856: \centerline{\epsfbox{fig10.ps}}
857: \bigskip
858: \centerline{{\bf Figure 10.} Average gains due to the first improvement for
859: $n=10$}
860: \bigskip\bigskip\bigskip
861: \endinsert
862: 
863: \topinsert
864: \centerline{\epsfbox{fig11.ps}}
865: \bigskip
866: \centerline{{\bf Figure 11.} Average gains due to the second improvement for
867: $k=1$}
868: \bigskip\bigskip\bigskip
869: \endinsert
870: 
871: \topinsert
872: \centerline{\epsfbox{fig12.ps}}
873: \bigskip
874: \centerline{{\bf Figure 12.} Average gains due to the second improvement for
875: $n=10$}
876: \bigskip\bigskip\bigskip
877: \endinsert
878: 
879: \bye
880: 
881: