q-bio0501010/letter.tex
1: % Philipp Messer
2: % Institut für Theoretische Physik
3: % Universität zu Köln
4: % Zülpicher Strasse 77
5: % D-50937 Köln
6: % GERMANY
7: %
8: % Physical Review Letters
9: %
10: % 
11: % A Solvable Sequence Evolution Model and Genomic Correlations
12: %
13: %
14: 
15: \documentclass[prl,twocolumn,showpacs,amsmath,amssymb]{revtex4}
16: 
17: \usepackage{graphicx}
18: \usepackage{dcolumn}
19: \usepackage{bm}
20: 
21: \setlength{\parskip}{1.5 ex}
22: \newcommand{\siml}{\raisebox{-.6ex}{$\stackrel{<}{\displaystyle{\sim}}$}}
23: 
24: \begin{document}
25: 
26: \title{A Solvable Sequence Evolution Model and Genomic Correlations}
27: 
28: \author{Philipp W. Messer$^1$, Peter F. Arndt$^2$, and Michael L\"assig$^1$}
29: \affiliation{$^{1}$Institute for Theoretical Physics, University of Cologne, Z\"ulpicher Str.~77, 50937 K\"oln, Germany}
30: \affiliation{$^{2}$Max Planck Institute for Molecular Genetics, Ihnestr.~73, 14195 Berlin, Germany}
31: 
32: \date{\today}
33: 
34: \begin{abstract}
35:   We study a minimal model for genome evolution whose elementary
36:   processes are single site mutation, duplication and deletion of
37:   sequence regions and insertion of random segments. These processes
38:   are found to generate long-range correlations in the composition of
39:   letters as long as the sequence length is growing, i.e., the
40:   combined rates of duplications and insertions are higher than the
41:   deletion rate. For constant sequence length, on the other hand, all
42:   initial correlations decay exponentially. These results are obtained
43:   analytically and by simulations. They are compared
44:   with the long-range correlations observed in genomic DNA, and the
45:   implications for genome evolution are discussed.
46: \end{abstract}
47: 
48: \pacs{87.23.Kg, 87.15.Cc, 05.40.-a}
49: 
50: \maketitle
51: 
52: Over a decade ago, long-range correlations in the sequence composition
53: of DNA have been discovered~\cite{Peng92,Voss92,Li92}. With the
54: rapidly growing availability of whole-genome sequence data, the
55: composition of genomic DNA can now be studied systematically over a
56: wide range of scales and organisms.  The statistical analysis is quite
57: intricate since genomic DNA is a rather ``patchy'' statistical
58: environment~\cite{Karlin93}: it consists of genes, noncoding regions,
59: repetitive elements etc., and all of these substructures have a
60: systematic influence on the local sequence composition. Variations in
61: composition along the genome have been studied extensively by a number
62: of different
63: methods~\cite{Li97,Vieira99,Bernaola02,Arneodo95,Peng94,Stanley99,Holste03,Ouyang04}, and it is now well established that
64: long-range correlations in base composition appear in the genomes of
65: many species.  These can be measured, for example, by the
66: autocorrelation function $C(r)$ of the GC-content, which measures the
67: likelihood of finding G-C Watson-Crick pairs at a distance of $r$
68: bases along the backbone of the DNA molecule.  However, the form of
69: these correlations is much more complex than simple power laws. Within
70: one chromosome, there is often a variety of different scaling regimes
71: and effective exponents, and sometimes no clear scaling at all.
72: Moreover, the effective exponents of comparable scaling regimes vary
73: considerably between different species, and even between different
74: chromosomes of the same species~\cite{Bernaola02,Holste03,next}.
75: 
76: Despite the ubiquity of genomic correlations, little is known about
77: their evolutionary origin. In this Letter, we address the question
78: whether the observed correlations can be explained quantitatively by a
79: biologically realistic ``minimal'' model of sequence evolution. We
80: take into account four well known elementary evolutionary modes:
81: single site mutations, duplications and deletions of existing segments of the 
82: sequence, and insertions of random segments. The duplication
83: processes are believed to be a crucial mechanism of genome
84: growth~\cite{Goffeau04,Eichler02a,Hsieh03}; the length of the
85: duplicated segments ranges from single letters to thousands of letters
86: as in the case of gene duplications. The model is minimal in the sense
87: that all four elementary modes are {\em local} stochastic processes
88: compatible with {\em neutral evolution}, i.e., they do not require any
89: assumption of natural selection.  An alternative possible reason for
90: the observed correlations may be {\em long-range interactions} likely
91: to be caused by natural selection for a specific local GC-content. An
92: example of such a selective process is the clustering of genes in some
93: regions of a chromosome~\cite{Lercher03}, but no plausible
94: mechanism producing  long-range interactions has been
95: proposed so far.
96: 
97: Li's original work has shown that already a simple stochastic process
98: consisting of duplications and mutations of single letters leads to
99: generic power law correlations in the sequence
100: composition~\cite{Li91}. Here we analyze in detail the generalized
101: sequence evolution model introduced above. In particular, we calculate
102: the stationary two-point correlation function $C(r)$.  It is of power
103: law form, $C(r) \sim r^{-\alpha}$, with a decay exponent $\alpha$
104: depending on only two effective parameters, which are simple functions
105: of the rates of the elementary processes.  These long-range
106: correlations are generic as long as the rates of the processes result
107: in a growing sequence. At constant sequence length, however, the
108: stationary correlations in sequence composition vanish, and initial
109: correlations from a previous growth phase decay. Our analytic results
110: (which differ from Li's approximate expressions~\cite{Li91} and the
111: results of~\cite{Mansilla00}) are in excellent agreement with our
112: numerical simulations.  We use these results to infer from measured
113: values of $\alpha$ a lower bound on the growth rate of the genome,
114: which can be compared with independent estimates. The implications of
115: our findings on the evolution of mammalian genomes are discussed at
116: the end of this Letter.  
117: %\medskip
118: 
119: {\em Sequence evolution model.}---The stochastic evolution model
120: generates sequences $(s_1, \dots, s_N)$ of variable length $N$. For
121: simplicity, their letters are taken from a binary alphabet; $s_k =
122: \pm1$. (In the application to genomic systems, $s_k = +1$ denotes a
123: GC-pair and $s_k = -1$ an AT-pair at backbone position $k$.) The
124: elementary evolutionary steps are mutations, duplications, insertions,
125: and deletions of single letters (the generalization to segments will
126: be discussed below). They are Markov processes with rates $\mu$,
127: $\delta$, $\gamma^+$, and $\gamma^-$ acting on the sequences as
128: \begin{equation}
129: \label{rep_mut_processes}
130: (\cdots,s,s',\cdots)\to\left\{\begin{array}{l@{\;\;:\;\;}l} 
131: (\cdots,-s,s',\cdots)&\mbox{rate }\mu \\
132: (\cdots,s,s,s',\cdots)&\mbox{rate }\delta \\
133: (\cdots,s,x,s',\cdots)&\mbox{rate }\gamma^+\\
134: (\cdots,s',\cdots)&\mbox{rate }\gamma^-,
135: \end{array}\right.
136: \end{equation}
137: where $x = \pm 1$ denotes an uniformly distributed random letter.
138: Duplication and insertion events introduce a new letter next to an
139: exiting one and shift all subsequent letters one position to the
140: right, thereby increasing the sequence length by~1. Conversely,
141: deletions shorten the length by~1. This type of Markov evolution model
142: is widely used in computational biology, forming the statistical basis
143: of sequence alignment algorithms~\cite{Durbin98}. Running all four
144: processes over a time $t$ produces a statistical ensemble of
145: sequences; the corresponding averages are denoted by $\langle \dots
146: \rangle (t)$. This ensemble is characterized by the rates $\delta$,
147: $\mu$, $\gamma^+$, $\gamma^-$, and by the initial sequence. Here we
148: use sequences of length $1$ with a fixed letter, $(s_1)=1$, or a
149: random letter, $(s_1) = x$.
150: 
151: After a time $t$, the sequences have an average length $\langle N
152: \rangle(t) = \exp(\lambda t)$ with the effective growth rate
153: \begin{equation}
154: \lambda = \delta+\gamma^+-\gamma^-.
155: \label{lambda}
156: \end{equation}
157: We are interested in two dynamical regimes, sequence growth from a
158: single-letter initial state (i.e., $\lambda > 0$) and the evolution of
159: sequences at stationary length $\langle N\rangle \gg 1$ (i.e.,
160: $\lambda = 0$), to which we now turn in order.  
161: %\medskip
162: 
163: {\em Growth dynamics and stationary correlations.}---The composition
164: bias of the sequences at position $k$ is measured by the expectation
165: value $\langle s_k\rangle(t)$. It is easy to show that any initial
166: composition bias decays due to mutations and random insertions. We
167: note that each insertion can be regarded as a duplication with a
168: subsequent mutation in half of the cases, resulting in an effective
169: mutation rate
170: \begin{equation}
171: \mu_{\rm eff}=\mu+\gamma^+/2.
172: \label{mu_eff}
173: \end{equation}
174: We obtain $\langle s_k\rangle (t)\propto\exp(-2\mu_{\rm eff}t)$ for
175: fixed initial condition, while $\langle s_k\rangle (t)=0$ for random
176: initial conditions. 
177: The composition correlation $C(r) \equiv \langle s_k s_{k+r} \rangle
178: (t)$ between two sequence positions at distance $r$ is affected by all
179: four processes and is independent of the initial condition. Its
180: evolution equation can be derived by writing it as $C(r,t)=P_{\rm
181:   eq}(r,t)-P_{\rm op}(r,t)$, where $P_{\rm eq}(r,t)$ and $P_{\rm
182:   op}(r,t)$ denote the joint probabilities of finding two symbols of
183: equal and opposite signs, respectively, at a distance $r$. The Master
184: equation for $P_{\rm eq}(r,t)$ takes the form
185: \begin{eqnarray} 
186: \frac{\partial}{\partial t} P_{\rm eq}(r,t)&=&
187: 2\mu_{\rm eff}\:[-P_{\rm eq}(r,t)+P_{\rm op}(r,t)]
188: \nonumber\\ 
189: &&+[r\delta+(r-1)\gamma^+]\:[P_{\rm eq}(r-1,t)-P_{\rm eq}(r,t)]
190: \nonumber\\ 
191: &&+r\gamma^-\:[P_{\rm eq}(r+1,t)-P_{\rm eq}(r,t)]. 
192: \end{eqnarray} 
193: The first term on the r.h.s.~describes the change in $P_{\rm eq}(r,t)$
194: due to mutations and random insertions, while the second term
195: specifies the probability current due to duplication of a site in the
196: interval $(k,k+r-1)$ or insertion of a new site in the interval
197: $(k,k+r-2)$. The third term gives the corresponding current due to
198: deletions. By exchanging $P_{\rm eq}$ and $P_{\rm op}$, we obtain a
199: similar equation for $P_{\rm op}(r,t)$. Hence we have
200: \begin{eqnarray}
201: \label{master_equation_C}
202: \frac{\partial}{\partial t} C(r,t)&=&-4\mu_{\rm eff}\:C(r)\nonumber\\
203: &&+[r\delta+(r-1)\gamma^+]\:[C(r-1)-C(r)]\nonumber\\
204: &&+r\gamma^-\:[C(r+1)-C(r)]. 
205: \end{eqnarray} 
206: For the special case with only single-letter duplications and
207: mutations ($\delta,\mu > 0$, $\gamma^+=\gamma^-=0$), which is
208: equivalent to Li's original model~\cite{Li91}, we find a simple
209: analytical form for the stationary $C(r)$ by solving the recursion
210: \begin{equation} 
211: \label{recursion_C}
212: C(r)=\frac{r}{\alpha+r}\:C(r-1)\quad\mbox{with}
213: \quad\alpha=\frac{4\mu}{\delta}
214: \end{equation} 
215: and the initial value $C(0)=1$. This gives
216: \begin{equation} 
217: \label{C_l_result}
218: C(r)=\frac{\Gamma(r+1)\Gamma(1+\alpha)}{\Gamma(r+1+\alpha)}=\frac{\alpha}{1+\alpha}\:B(r,\alpha),
219: \end{equation} 
220: where $\Gamma(x)$ is the gamma function and $B(x,y)$ the beta
221: function.  Evaluating its asymptotic behavior for $x\gg1$,
222: \begin{equation}
223: B(x,y)\propto\Gamma(y)\:x^{-y}
224: \left[1-\frac{y(y-1)}{2x}\left(1+\mbox{O}
225: \left(\frac{1}{x}\right)\right)\right]\nonumber,
226: \end{equation} 
227: then produces the algebraic decay $C(r)\propto r^{-\alpha}$. For the
228: general case including insertions and deletions, the asymptotic decay
229: can still be obtained exactly in the continuum limit. For $r \gg 1$
230: and $\delta > 0$, the difference equation~(\ref{master_equation_C})
231: becomes the differential equation
232: \begin{equation}
233: \label{C_l_indel_dgl}
234: \frac{\partial}{\partial t} C(r,t)=-4\mu_{\rm eff}C(r,t)-r\lambda
235: \frac{\partial}{\partial r} C(r,t)
236: \end{equation}
237: with the effective rates $\mu_{\rm eff}$ and $\lambda$ defined by 
238: (\ref{lambda}) and (\ref{mu_eff}). This
239: has the stationary solution 
240: \begin{equation}
241: \label{C_l_indel_asymptotics}
242: C(r)\propto r^{-\alpha}\quad\mbox{with}
243: \quad\alpha=\frac{4\mu_{\rm eff}}{\lambda}. 
244: \end{equation} 
245: %\medskip
246: 
247: Eq.~(\ref{C_l_indel_dgl}) clearly shows the mechanism generating
248: long-range correlations in this type of sequence evolution model.
249: Correlations are continuously produced at small scales by duplications
250: and transported to larger distances by the net exponential expansion
251: of the sequence (resulting from duplications and
252: insertions/deletions). On the other hand, correlations decay
253: exponentially due to processes randomizing the sequence (i.e.,
254: mutations and random insertions). The competition between expansion
255: and randomization produces the algebraic decay $C(r)\propto
256: r^{-\alpha}$, which is highly universal.  Microscopic details of the
257: evolution processes are irrelevant, the exponent $\alpha$ is
258: determined by a simple balance between the growth rate $\lambda$ and
259: the effective mutation rate $\mu_{\rm eff}$. Hence, an extended model
260: containing duplications, deletions and random insertions of sequence
261: {\em segments} of finite length $\ell= 1,2,...,\ell_{\rm max}$ with
262: respective rates $\delta_l$, $\gamma_{\ell}^-$, and $\gamma_{\ell}^+$
263: still has the same asymptotics ~(\ref{C_l_indel_asymptotics}) for
264: $N(t)\gg\ell_{\rm max}$ and $r\gg\ell_{\rm max}$. The effective rates
265: (\ref{lambda}), (\ref{mu_eff}) are now given by
266: \begin{equation}
267: \label{full_effective_rates}
268: \lambda=\sum_{\ell}\ell\,\left[\delta_{\ell}+\gamma^+_{\ell}-\gamma^-_{\ell}\right],\quad\mu_{\rm eff}=\mu+\frac{1}{2}\sum_{\ell}\ell\gamma_{\ell}^+.
269: \end{equation} 
270: This asymptotics can again be proved from an exact Master equation
271: similar to (\ref{master_equation_C})~\cite{next}.  The extended model
272: is important for genomic evolution since strong long-range
273: correlations (i.e., small values of $\alpha$) can be the combined
274: result of segment duplications with different values of $\ell$. Their
275: individual rates $\delta_\ell$ might be small and difficult to assess
276: but the cumulative rate $\lambda$ can still be estimated.
277: %\medskip
278: 
279: \emph{Stationary-length dynamics and time-dependent
280:   correlations.}---It is obvious from Eq.~(\ref{C_l_indel_dgl}) that
281: stationary long-range correlations only exist as long as the sequence
282: grows, i.e. for $\lambda > 0$. Consider now the following evolutionary
283: scenario: sequence growth with rate $\lambda_1 > 0$ up to a length
284: $N_0 = N(t_0)$, followed by a second phase with $\lambda_2 = 0$ and
285: $\langle N \rangle(t) = N_0$ for $t > t_0$. The time-dependent
286: solution of Eq.~(\ref{C_l_indel_dgl}) for the asymptotics of $C(r,t)$ is
287: then 
288: \begin{equation}
289: \label{C_l_theta_deletion}
290: C(r,t) = C(r,t_0)\:e^{-4\mu_{\rm eff}\Delta t} 
291: \propto r^{-4 \mu_{\rm eff}/\lambda_1} e^{-4\mu_{\rm eff}\Delta
292: t}
293: \end{equation}
294: with $\Delta t=t-t_0>0$. In the second phase, the long-range tails of
295: $C(r,t)$ are preserved but their amplitude decays with a
296: characteristic time scale $\tau=(4\mu_{\rm eff})^{-1}$.
297: %\medskip
298: 
299: \emph{Numerical results.}---We have performed extensive Monte Carlo
300: simulations of our model. During each time step $\Delta
301: t=[(\mu+\sum\nolimits_{\ell}[\delta_{\ell}+\gamma_{\ell}^++\gamma_{\ell}^-])N(t)]^{-1}$
302: we choose a random site and apply one of the elementary processes with
303: its relative weight.
304: \begin{figure} [t!]
305: \centering
306: \includegraphics[width=0.92\linewidth]{fig1a}
307: \includegraphics[width=0.92\linewidth]{fig1b}
308: \caption{Stationary $C(r)$ at different rates of the
309:   elementary processes. (a) Single-letter duplication-mutation model:
310:   Numerical results (circles) and the analytical
311:   form~(\ref{C_l_result}) (lines) for $\mu=1$, $\delta$ varying. (b)
312:   Full model: Numerical results (circles) with the analytic
313:   asymptotics~(\ref{C_l_indel_asymptotics})
314:   and~(\ref{full_effective_rates}) (lines) for $\mu=1$ and varying
315:   rates of the other processes (rates not specified in the plot are
316:   zero). The dynamics of the sequences was simulated until they
317:   reached a length of $N=2^{27}\approx10^8$; $C(r)$ was averaged over
318:   the sequence and over 100 runs.
319: \label{cor_eps}}
320: \end{figure}
321: For a single realization of this dynamics, the correlation function
322: $C(r)$ is well approximated by the sequence average $(N-r)^{-1}
323: \sum_{k=1}^{N-r} s_k s_{k+r}$. Further averaging over 100 realizations
324: produces very accurate measurements of $C(r)$.
325: 
326: Fig.~\ref{cor_eps}(a) shows the numerical $C(r)$ for the single-letter
327: duplication-mutation dynamics with various rates, which is in
328: excellent agreement with the analytic expression~(\ref{C_l_result}).
329: The same is shown in Fig.~\ref{cor_eps}(b) for the general case with
330: all types of processes present, verifying the asymptotic
331: behavior~(\ref{C_l_indel_asymptotics})
332: and~(\ref{full_effective_rates}). For completeness, we have also
333: obtained power spectra and the mutual information function, as defined
334: in~\cite{Holste03}, which have the expected decay exponents $1
335: -\alpha$ and $2\alpha$, respectively.
336: 
337: \begin{figure} [t!]
338: \centering
339: \includegraphics[width=0.92\linewidth]{fig2a}
340: \includegraphics[width=0.92\linewidth]{fig2b}
341: \caption{Time-dependent correlations $C(r,t)$. (a) Build-up of long-range
342:   correlations by stationary growth. Measured $C(r,t)$ at various
343:   intermediate lengths $N(t)=10^2,10^4,10^6$ (symbols) together with
344:   the stationary form~(\ref{C_l_result}) (line) for $\mu=1$,
345:   $\delta_1=\delta=8$, all other parameters are zero. (b) Decay of
346:   correlations during sequence evolution at stationary length
347:   $N_0=10^6$. Measured $C(r,t)$ at various times $\Delta t$ (symbols)
348:   together with the analytic decay of the long-range tail given by
349:   Eq.~(\ref{C_l_theta_deletion}) (lines). Note that there are still
350:   correlations remaining on short length scales.
351: \label{growth_decay_eps}}
352: \end{figure}
353: 
354: The dynamical build-up of these correlations for growing sequences is
355: seen in Fig.~\ref{growth_decay_eps}(a), which shows $C(r,t)$ at
356: various intermediate times of the growth process. The correlation
357: rapidly converges to the stationary form for all distances
358: $r\:\siml\:N(t)$.  This should be compared with the time-dependence of
359: $C(r,t)$ at constant length in Fig.~\ref{growth_decay_eps}(b), which
360: shows an algebraic tail with an exponentially decreasing amplitude as
361: predicted by Eq.~(\ref{C_l_theta_deletion}).
362: %\medskip
363: 
364: {\em Genomic evolution.}---As pointed out above, the processes
365: discussed here build a minimal model for dynamically generated
366: long-range correlations along a sequence. But can this model explain
367: the observed correlations in genomic DNA?  The correlation function
368: $C(r)$ along human chromosomes shows a rather slow algebraic decay on
369: distance scales $10^3<r<10^6$ with typical effective exponents
370: $\alpha\approx0.1$~\cite{Bernaola02,Holste03}. We have confirmed these
371: measurements and found them to be consistent with sequence data from
372: other mammals~\cite{next}. A lower bound of the effective mutation
373: rate in mammals is $\mu_{\rm eff}\approx 2\cdot10^{-9}\rm a^{-1}$ per
374: site~\cite{Arndt04}.  Assuming stationary growth, we can use these
375: values of $\alpha$ and $\mu_{\rm eff}$ to derive a lower bound on the
376: genomic growth rate $\lambda$, resulting in a minimum value $\lambda
377: \approx 10^{-7}\rm a^{-1}$ per site according to
378: Eq.~(\ref{C_l_indel_asymptotics}).  However, this rate is much too
379: high. Our genome would have expanded much faster than it is observed
380: since the current human genome contains $N\approx3\cdot 10^9$ base
381: pairs and, assuming the above rate of genome expansion, would have
382: contained only about $4\cdot 10^5$ base pairs at the time of mammalian
383: radiation about 90 million years ago. This can clearly be rejected
384: since approximately 40\% of the human genome can be aligned to the
385: mouse genome, representing most of the orthologous sequences that
386: remain in both lineages from the common ancestor~\cite{Mouse02}.
387: 
388: Over longer evolutionary periods, genomic expansion phases with rates
389: $\lambda \sim 10^{-7}\rm a^{-1}$ cannot be ruled out if we assume the
390: history of the genome has been a {\em punctuated} process, with such
391: expansion phases followed by periods of approximately constant length.
392: In the human genome, there is by now ample evidence for growth by
393: segmental duplications with various segment
394: lengths~\cite{Thomas04,Eichler02b}. In a punctuated growth process,
395: correlations are produced and transported during the expansion phases.
396: During the stationary phases, the previously established correlations
397: decay as given by Eq.~(\ref{C_l_theta_deletion}). In mammals, the last
398: period of rapid expansion has been the mammalian radiation, and the
399: characteristic time scale of the decay is $\tau\approx100$~Myr.
400: Correlations present or generated at the time of the mammalian
401: radiation would hence still persist. The succession of several
402: distinct growth phases with different values of $\lambda$ and
403: $\mu_{\rm eff}$ could even explain correlations $C(r)$ with several
404: scaling regimes as found in human chromosomes~\cite{Bernaola02}. We
405: conclude that the correlations observed in mammals are compatible with
406: a punctuated expansion-randomization process. Of course, this does not
407: rule out other causes. Indeed, the rather diverse functional forms
408: found in different species may point towards more than one generating
409: mechanism. If genomic expansion proves to be a significant
410: contribution, composition correlations could be the ``background
411: radiation'' of genomics, allowing us to trace the history of genomes
412: far back in evolutionary time.
413: 
414: \begin{references}
415: 
416: \bibitem{Li92}
417: W. Li and K. Kaneko,
418: {\it Europhys. Lett. }{\bf 17}, 655 (1992).
419: 
420: \bibitem{Peng92}
421: C.-K. Peng {\it et al.},
422: {\it Nature }(London) {\bf 356}, 168 (1992).
423: 
424: \bibitem{Voss92}
425: R. F. Voss,
426: {\it Phys. Rev. Lett. }{\bf 68}, 3805 (1992).
427: 
428: \bibitem{Karlin93}
429: S. Karlin and V. Brendel,
430: {\it Science }{\bf 259}, 677 (1993).
431: 
432: \bibitem{Peng94}
433: C.-K Peng {\it et al.},
434: {\it Phys. Rev. E }{\bf 49}, 1685 (1994).
435: 
436: \bibitem{Arneodo95}
437: A. Arneodo, E. Bacry, P. V. Graves, and J. F. Muzy,
438: {\it Phys. Rev. Lett. }{\bf 74}, 3293 (1995). 
439: 
440: \bibitem{Li97}
441: W. Li,
442: {\it Comput. Chem. }{\bf 21}, 257 (1997).
443: 
444: \bibitem{Vieira99}
445: M. de Sousa Vieira,
446: {\it Phys. Rev. E }{\bf 60}, 5932 (1999).
447: 
448: \bibitem{Stanley99}
449: H. E. Stanley {\it et al.},
450: {\it Physica A }{\bf 273}, 1 (1999).
451: 
452: \bibitem{Bernaola02}
453: P. Bernaola-Galvan, P. Carpena, R. Roman-Roldan, and J. L. Oliver,
454: {\it Gene }{\bf 300}, 105 (2002). 
455: 
456: \bibitem{Holste03}
457: D. Holste {\it et al.},
458: {\it Phys. Rev. E }{\bf 67}, 061913 (2003).
459: 
460: %\bibitem{Isohata03}
461: %Y. Isohata and M. Hayashi,
462: %{\it J. Phys. Soc. Japan }{\bf 72}, 735 (2003).
463: 
464: \bibitem{Ouyang04}
465: Z. Ouyang, C. Wang, and Z. S. She,
466: {\it Phys. Rev. Lett. }{\bf 93}, 078103 (2004). 
467: 
468: \bibitem{next} 
469: P. Messer, M. L\"assig, and P. Arndt, to be published. 
470: 
471: \bibitem{Eichler02a}
472: R. V. Samonte and E. E. Eichler,
473: {\it Nat. Rev. Genet. }{\bf 3}, 65 (2002).
474: 
475: \bibitem{Hsieh03}
476: L.-C. Hsieh, L. Luo, F. Ji, and H. C. Lee,
477: {\it Phys. Rev. Lett. }{\bf 90}, 018101 (2003).
478: 
479: \bibitem{Goffeau04}
480: A. Goffeau,
481: {\it Nature }(London) {\bf 430}, 25 (2004).
482: 
483: \bibitem{Lercher03}
484: M. J. Lercher, A. O. Urrutia, A. Pavlicek, and L. D. Hurst,
485: {\it Human Mol. Genetics }{\bf 12}, 2411 (2003).
486: 
487: \bibitem{Li91}
488: W. Li,
489: {\it Phys. Rev. A }{\bf 43}, 5240 (1991).
490: 
491: \bibitem{Mansilla00}
492: R. Mansilla and G. Cocho,
493: {\it Complex Systems }{\bf 12}, 207 (2000).
494: 
495: \bibitem{Durbin98}
496: R. Durbin, S. Eddy, A. Krogh, and G. Mitchison,
497: {\it Biological Sequence Analysis }
498: (Cambridge University Press, Cambridge, England, 1998).
499: 
500: \bibitem{Arndt04}
501: P. F. Arndt and T. Hwa,
502: {\it Bioinformatics }{\bf 20}, 1482 (2004).
503: 
504: \bibitem{Mouse02}
505: Mouse Genome Sequencing Consortium,
506: {\it Nature }(London) {\bf 420}, 520 (2002)
507: 
508: \bibitem{Eichler02b}
509: J. A. Bailey {\it et al.},
510: {\it Science }{\bf 297}, 1003 (2002)
511: 
512: \bibitem{Thomas04}
513: E. E. Thomas {\it et al.},
514: {\it PNAS }{\bf 101}, 10349 (2004)
515: 
516: \end{references}
517: 
518: \end{document}
519: 
520: 
521: 
522: