0501:q-bio0501010/letter.tex

1: % Philipp Messer

2: % Institut f�r Theoretische Physik

3: % Universit�t zu K�ln

4: % Z�lpicher Strasse 77

5: % D-50937 K�ln

6: % GERMANY

7: %

8: % Physical Review Letters

9: %

10: %

11: % A Solvable Sequence Evolution Model and Genomic Correlations

12: %

13: %

14:

15: \documentclass[prl,twocolumn,showpacs,amsmath,amssymb]{revtex4}

16:

17: \usepackage{graphicx}

18: \usepackage{dcolumn}

19: \usepackage{bm}

20:

21: \setlength{\parskip}{1.5 ex}

22: \newcommand{\siml}{\raisebox{-.6ex}{$\stackrel{<}{\displaystyle{\sim}}$}}

23:

24: \begin{document}

25:

26: \title{A Solvable Sequence Evolution Model and Genomic Correlations}

27:

28: \author{Philipp W. Messer$^1$, Peter F. Arndt$^2$, and Michael L\"assig$^1$}

29: \affiliation{$^{1}$Institute for Theoretical Physics, University of Cologne, Z\"ulpicher Str.~77, 50937 K\"oln, Germany}

30: \affiliation{$^{2}$Max Planck Institute for Molecular Genetics, Ihnestr.~73, 14195 Berlin, Germany}

31:

32: \date{\today}

33:

34: \begin{abstract}

35:   We study a minimal model for genome evolution whose elementary

36:   processes are single site mutation, duplication and deletion of

37:   sequence regions and insertion of random segments. These processes

38:   are found to generate long-range correlations in the composition of

39:   letters as long as the sequence length is growing, i.e., the

40:   combined rates of duplications and insertions are higher than the

41:   deletion rate. For constant sequence length, on the other hand, all

42:   initial correlations decay exponentially. These results are obtained

43:   analytically and by simulations. They are compared

44:   with the long-range correlations observed in genomic DNA, and the

45:   implications for genome evolution are discussed.

46: \end{abstract}

47:

48: \pacs{87.23.Kg, 87.15.Cc, 05.40.-a}

49:

50: \maketitle

51:

52: Over a decade ago, long-range correlations in the sequence composition

53: of DNA have been discovered~\cite{Peng92,Voss92,Li92}. With the

54: rapidly growing availability of whole-genome sequence data, the

55: composition of genomic DNA can now be studied systematically over a

56: wide range of scales and organisms.  The statistical analysis is quite

57: intricate since genomic DNA is a rather ``patchy'' statistical

58: environment~\cite{Karlin93}: it consists of genes, noncoding regions,

59: repetitive elements etc., and all of these substructures have a

60: systematic influence on the local sequence composition. Variations in

61: composition along the genome have been studied extensively by a number

62: of different

63: methods~\cite{Li97,Vieira99,Bernaola02,Arneodo95,Peng94,Stanley99,Holste03,Ouyang04}, and it is now well established that

64: long-range correlations in base composition appear in the genomes of

65: many species.  These can be measured, for example, by the

66: autocorrelation function $C(r)$ of the GC-content, which measures the

67: likelihood of finding G-C Watson-Crick pairs at a distance of $r$

68: bases along the backbone of the DNA molecule.  However, the form of

69: these correlations is much more complex than simple power laws. Within

70: one chromosome, there is often a variety of different scaling regimes

71: and effective exponents, and sometimes no clear scaling at all.

72: Moreover, the effective exponents of comparable scaling regimes vary

73: considerably between different species, and even between different

74: chromosomes of the same species~\cite{Bernaola02,Holste03,next}.

75:

76: Despite the ubiquity of genomic correlations, little is known about

77: their evolutionary origin. In this Letter, we address the question

78: whether the observed correlations can be explained quantitatively by a

79: biologically realistic ``minimal'' model of sequence evolution. We

80: take into account four well known elementary evolutionary modes:

81: single site mutations, duplications and deletions of existing segments of the

82: sequence, and insertions of random segments. The duplication

83: processes are believed to be a crucial mechanism of genome

84: growth~\cite{Goffeau04,Eichler02a,Hsieh03}; the length of the

85: duplicated segments ranges from single letters to thousands of letters

86: as in the case of gene duplications. The model is minimal in the sense

87: that all four elementary modes are {\em local} stochastic processes

88: compatible with {\em neutral evolution}, i.e., they do not require any

89: assumption of natural selection.  An alternative possible reason for

90: the observed correlations may be {\em long-range interactions} likely

91: to be caused by natural selection for a specific local GC-content. An

92: example of such a selective process is the clustering of genes in some

93: regions of a chromosome~\cite{Lercher03}, but no plausible

94: mechanism producing  long-range interactions has been

95: proposed so far.

96:

97: Li's original work has shown that already a simple stochastic process

98: consisting of duplications and mutations of single letters leads to

99: generic power law correlations in the sequence

100: composition~\cite{Li91}. Here we analyze in detail the generalized

101: sequence evolution model introduced above. In particular, we calculate

102: the stationary two-point correlation function $C(r)$.  It is of power

103: law form, $C(r) \sim r^{-\alpha}$, with a decay exponent $\alpha$

104: depending on only two effective parameters, which are simple functions

105: of the rates of the elementary processes.  These long-range

106: correlations are generic as long as the rates of the processes result

107: in a growing sequence. At constant sequence length, however, the

108: stationary correlations in sequence composition vanish, and initial

109: correlations from a previous growth phase decay. Our analytic results

110: (which differ from Li's approximate expressions~\cite{Li91} and the

111: results of~\cite{Mansilla00}) are in excellent agreement with our

112: numerical simulations.  We use these results to infer from measured

113: values of $\alpha$ a lower bound on the growth rate of the genome,

114: which can be compared with independent estimates. The implications of

115: our findings on the evolution of mammalian genomes are discussed at

116: the end of this Letter.

117: %\medskip

118:

119: {\em Sequence evolution model.}---The stochastic evolution model

120: generates sequences $(s_1, \dots, s_N)$ of variable length $N$. For

121: simplicity, their letters are taken from a binary alphabet; $s_k =

122: \pm1$. (In the application to genomic systems, $s_k = +1$ denotes a

123: GC-pair and $s_k = -1$ an AT-pair at backbone position $k$.) The

124: elementary evolutionary steps are mutations, duplications, insertions,

125: and deletions of single letters (the generalization to segments will

126: be discussed below). They are Markov processes with rates $\mu$,

127: $\delta$, $\gamma^+$, and $\gamma^-$ acting on the sequences as

128: \begin{equation}

129: \label{rep_mut_processes}

130: (\cdots,s,s',\cdots)\to\left\{\begin{array}{l@{\;\;:\;\;}l}

131: (\cdots,-s,s',\cdots)&\mbox{rate }\mu \\

132: (\cdots,s,s,s',\cdots)&\mbox{rate }\delta \\

133: (\cdots,s,x,s',\cdots)&\mbox{rate }\gamma^+\\

134: (\cdots,s',\cdots)&\mbox{rate }\gamma^-,

135: \end{array}\right.

136: \end{equation}

137: where $x = \pm 1$ denotes an uniformly distributed random letter.

138: Duplication and insertion events introduce a new letter next to an

139: exiting one and shift all subsequent letters one position to the

140: right, thereby increasing the sequence length by~1. Conversely,

141: deletions shorten the length by~1. This type of Markov evolution model

142: is widely used in computational biology, forming the statistical basis

143: of sequence alignment algorithms~\cite{Durbin98}. Running all four

144: processes over a time $t$ produces a statistical ensemble of

145: sequences; the corresponding averages are denoted by $\langle \dots

146: \rangle (t)$. This ensemble is characterized by the rates $\delta$,

147: $\mu$, $\gamma^+$, $\gamma^-$, and by the initial sequence. Here we

148: use sequences of length $1$ with a fixed letter, $(s_1)=1$, or a

149: random letter, $(s_1) = x$.

150:

151: After a time $t$, the sequences have an average length $\langle N

152: \rangle(t) = \exp(\lambda t)$ with the effective growth rate

153: \begin{equation}

154: \lambda = \delta+\gamma^+-\gamma^-.

155: \label{lambda}

156: \end{equation}

157: We are interested in two dynamical regimes, sequence growth from a

158: single-letter initial state (i.e., $\lambda > 0$) and the evolution of

159: sequences at stationary length $\langle N\rangle \gg 1$ (i.e.,

160: $\lambda = 0$), to which we now turn in order.

161: %\medskip

162:

163: {\em Growth dynamics and stationary correlations.}---The composition

164: bias of the sequences at position $k$ is measured by the expectation

165: value $\langle s_k\rangle(t)$. It is easy to show that any initial

166: composition bias decays due to mutations and random insertions. We

167: note that each insertion can be regarded as a duplication with a

168: subsequent mutation in half of the cases, resulting in an effective

169: mutation rate

170: \begin{equation}

171: \mu_{\rm eff}=\mu+\gamma^+/2.

172: \label{mu_eff}

173: \end{equation}

174: We obtain $\langle s_k\rangle (t)\propto\exp(-2\mu_{\rm eff}t)$ for

175: fixed initial condition, while $\langle s_k\rangle (t)=0$ for random

176: initial conditions.

177: The composition correlation $C(r) \equiv \langle s_k s_{k+r} \rangle

178: (t)$ between two sequence positions at distance $r$ is affected by all

179: four processes and is independent of the initial condition. Its

180: evolution equation can be derived by writing it as $C(r,t)=P_{\rm

181:   eq}(r,t)-P_{\rm op}(r,t)$, where $P_{\rm eq}(r,t)$ and $P_{\rm

182:   op}(r,t)$ denote the joint probabilities of finding two symbols of

183: equal and opposite signs, respectively, at a distance $r$. The Master

184: equation for $P_{\rm eq}(r,t)$ takes the form

185: \begin{eqnarray}

186: \frac{\partial}{\partial t} P_{\rm eq}(r,t)&=&

187: 2\mu_{\rm eff}\:[-P_{\rm eq}(r,t)+P_{\rm op}(r,t)]

188: \nonumber\\

189: &&+[r\delta+(r-1)\gamma^+]\:[P_{\rm eq}(r-1,t)-P_{\rm eq}(r,t)]

190: \nonumber\\

191: &&+r\gamma^-\:[P_{\rm eq}(r+1,t)-P_{\rm eq}(r,t)].

192: \end{eqnarray}

193: The first term on the r.h.s.~describes the change in $P_{\rm eq}(r,t)$

194: due to mutations and random insertions, while the second term

195: specifies the probability current due to duplication of a site in the

196: interval $(k,k+r-1)$ or insertion of a new site in the interval

197: $(k,k+r-2)$. The third term gives the corresponding current due to

198: deletions. By exchanging $P_{\rm eq}$ and $P_{\rm op}$, we obtain a

199: similar equation for $P_{\rm op}(r,t)$. Hence we have

200: \begin{eqnarray}

201: \label{master_equation_C}

202: \frac{\partial}{\partial t} C(r,t)&=&-4\mu_{\rm eff}\:C(r)\nonumber\\

203: &&+[r\delta+(r-1)\gamma^+]\:[C(r-1)-C(r)]\nonumber\\

204: &&+r\gamma^-\:[C(r+1)-C(r)].

205: \end{eqnarray}

206: For the special case with only single-letter duplications and

207: mutations ($\delta,\mu > 0$, $\gamma^+=\gamma^-=0$), which is

208: equivalent to Li's original model~\cite{Li91}, we find a simple

209: analytical form for the stationary $C(r)$ by solving the recursion

210: \begin{equation}

211: \label{recursion_C}

212: C(r)=\frac{r}{\alpha+r}\:C(r-1)\quad\mbox{with}

213: \quad\alpha=\frac{4\mu}{\delta}

214: \end{equation}

215: and the initial value $C(0)=1$. This gives

216: \begin{equation}

217: \label{C_l_result}

218: C(r)=\frac{\Gamma(r+1)\Gamma(1+\alpha)}{\Gamma(r+1+\alpha)}=\frac{\alpha}{1+\alpha}\:B(r,\alpha),

219: \end{equation}

220: where $\Gamma(x)$ is the gamma function and $B(x,y)$ the beta

221: function.  Evaluating its asymptotic behavior for $x\gg1$,

222: \begin{equation}

223: B(x,y)\propto\Gamma(y)\:x^{-y}

224: \left[1-\frac{y(y-1)}{2x}\left(1+\mbox{O}

225: \left(\frac{1}{x}\right)\right)\right]\nonumber,

226: \end{equation}

227: then produces the algebraic decay $C(r)\propto r^{-\alpha}$. For the

228: general case including insertions and deletions, the asymptotic decay

229: can still be obtained exactly in the continuum limit. For $r \gg 1$

230: and $\delta > 0$, the difference equation~(\ref{master_equation_C})

231: becomes the differential equation

232: \begin{equation}

233: \label{C_l_indel_dgl}

234: \frac{\partial}{\partial t} C(r,t)=-4\mu_{\rm eff}C(r,t)-r\lambda

235: \frac{\partial}{\partial r} C(r,t)

236: \end{equation}

237: with the effective rates $\mu_{\rm eff}$ and $\lambda$ defined by

238: (\ref{lambda}) and (\ref{mu_eff}). This

239: has the stationary solution

240: \begin{equation}

241: \label{C_l_indel_asymptotics}

242: C(r)\propto r^{-\alpha}\quad\mbox{with}

243: \quad\alpha=\frac{4\mu_{\rm eff}}{\lambda}.

244: \end{equation}

245: %\medskip

246:

247: Eq.~(\ref{C_l_indel_dgl}) clearly shows the mechanism generating

248: long-range correlations in this type of sequence evolution model.

249: Correlations are continuously produced at small scales by duplications

250: and transported to larger distances by the net exponential expansion

251: of the sequence (resulting from duplications and

252: insertions/deletions). On the other hand, correlations decay

253: exponentially due to processes randomizing the sequence (i.e.,

254: mutations and random insertions). The competition between expansion

255: and randomization produces the algebraic decay $C(r)\propto

256: r^{-\alpha}$, which is highly universal.  Microscopic details of the

257: evolution processes are irrelevant, the exponent $\alpha$ is

258: determined by a simple balance between the growth rate $\lambda$ and

259: the effective mutation rate $\mu_{\rm eff}$. Hence, an extended model

260: containing duplications, deletions and random insertions of sequence

261: {\em segments} of finite length $\ell= 1,2,...,\ell_{\rm max}$ with

262: respective rates $\delta_l$, $\gamma_{\ell}^-$, and $\gamma_{\ell}^+$

263: still has the same asymptotics ~(\ref{C_l_indel_asymptotics}) for

264: $N(t)\gg\ell_{\rm max}$ and $r\gg\ell_{\rm max}$. The effective rates

265: (\ref{lambda}), (\ref{mu_eff}) are now given by

266: \begin{equation}

267: \label{full_effective_rates}

268: \lambda=\sum_{\ell}\ell\,\left[\delta_{\ell}+\gamma^+_{\ell}-\gamma^-_{\ell}\right],\quad\mu_{\rm eff}=\mu+\frac{1}{2}\sum_{\ell}\ell\gamma_{\ell}^+.

269: \end{equation}

270: This asymptotics can again be proved from an exact Master equation

271: similar to (\ref{master_equation_C})~\cite{next}.  The extended model

272: is important for genomic evolution since strong long-range

273: correlations (i.e., small values of $\alpha$) can be the combined

274: result of segment duplications with different values of $\ell$. Their

275: individual rates $\delta_\ell$ might be small and difficult to assess

276: but the cumulative rate $\lambda$ can still be estimated.

277: %\medskip

278:

279: \emph{Stationary-length dynamics and time-dependent

280:   correlations.}---It is obvious from Eq.~(\ref{C_l_indel_dgl}) that

281: stationary long-range correlations only exist as long as the sequence

282: grows, i.e. for $\lambda > 0$. Consider now the following evolutionary

283: scenario: sequence growth with rate $\lambda_1 > 0$ up to a length

284: $N_0 = N(t_0)$, followed by a second phase with $\lambda_2 = 0$ and

285: $\langle N \rangle(t) = N_0$ for $t > t_0$. The time-dependent

286: solution of Eq.~(\ref{C_l_indel_dgl}) for the asymptotics of $C(r,t)$ is

287: then

288: \begin{equation}

289: \label{C_l_theta_deletion}

290: C(r,t) = C(r,t_0)\:e^{-4\mu_{\rm eff}\Delta t}

291: \propto r^{-4 \mu_{\rm eff}/\lambda_1} e^{-4\mu_{\rm eff}\Delta

292: t}

293: \end{equation}

294: with $\Delta t=t-t_0>0$. In the second phase, the long-range tails of

295: $C(r,t)$ are preserved but their amplitude decays with a

296: characteristic time scale $\tau=(4\mu_{\rm eff})^{-1}$.

297: %\medskip

298:

299: \emph{Numerical results.}---We have performed extensive Monte Carlo

300: simulations of our model. During each time step $\Delta

301: t=[(\mu+\sum\nolimits_{\ell}[\delta_{\ell}+\gamma_{\ell}^++\gamma_{\ell}^-])N(t)]^{-1}$

302: we choose a random site and apply one of the elementary processes with

303: its relative weight.

304: \begin{figure} [t!]

305: \centering

306: \includegraphics[width=0.92\linewidth]{fig1a}

307: \includegraphics[width=0.92\linewidth]{fig1b}

308: \caption{Stationary $C(r)$ at different rates of the

309:   elementary processes. (a) Single-letter duplication-mutation model:

310:   Numerical results (circles) and the analytical

311:   form~(\ref{C_l_result}) (lines) for $\mu=1$, $\delta$ varying. (b)

312:   Full model: Numerical results (circles) with the analytic

313:   asymptotics~(\ref{C_l_indel_asymptotics})

314:   and~(\ref{full_effective_rates}) (lines) for $\mu=1$ and varying

315:   rates of the other processes (rates not specified in the plot are

316:   zero). The dynamics of the sequences was simulated until they

317:   reached a length of $N=2^{27}\approx10^8$; $C(r)$ was averaged over

318:   the sequence and over 100 runs.

319: \label{cor_eps}}

320: \end{figure}

321: For a single realization of this dynamics, the correlation function

322: $C(r)$ is well approximated by the sequence average $(N-r)^{-1}

323: \sum_{k=1}^{N-r} s_k s_{k+r}$. Further averaging over 100 realizations

324: produces very accurate measurements of $C(r)$.

325:

326: Fig.~\ref{cor_eps}(a) shows the numerical $C(r)$ for the single-letter

327: duplication-mutation dynamics with various rates, which is in

328: excellent agreement with the analytic expression~(\ref{C_l_result}).

329: The same is shown in Fig.~\ref{cor_eps}(b) for the general case with

330: all types of processes present, verifying the asymptotic

331: behavior~(\ref{C_l_indel_asymptotics})

332: and~(\ref{full_effective_rates}). For completeness, we have also

333: obtained power spectra and the mutual information function, as defined

334: in~\cite{Holste03}, which have the expected decay exponents $1

335: -\alpha$ and $2\alpha$, respectively.

336:

337: \begin{figure} [t!]

338: \centering

339: \includegraphics[width=0.92\linewidth]{fig2a}

340: \includegraphics[width=0.92\linewidth]{fig2b}

341: \caption{Time-dependent correlations $C(r,t)$. (a) Build-up of long-range

342:   correlations by stationary growth. Measured $C(r,t)$ at various

343:   intermediate lengths $N(t)=10^2,10^4,10^6$ (symbols) together with

344:   the stationary form~(\ref{C_l_result}) (line) for $\mu=1$,

345:   $\delta_1=\delta=8$, all other parameters are zero. (b) Decay of

346:   correlations during sequence evolution at stationary length

347:   $N_0=10^6$. Measured $C(r,t)$ at various times $\Delta t$ (symbols)

348:   together with the analytic decay of the long-range tail given by

349:   Eq.~(\ref{C_l_theta_deletion}) (lines). Note that there are still

350:   correlations remaining on short length scales.

351: \label{growth_decay_eps}}

352: \end{figure}

353:

354: The dynamical build-up of these correlations for growing sequences is

355: seen in Fig.~\ref{growth_decay_eps}(a), which shows $C(r,t)$ at

356: various intermediate times of the growth process. The correlation

357: rapidly converges to the stationary form for all distances

358: $r\:\siml\:N(t)$.  This should be compared with the time-dependence of

359: $C(r,t)$ at constant length in Fig.~\ref{growth_decay_eps}(b), which

360: shows an algebraic tail with an exponentially decreasing amplitude as

361: predicted by Eq.~(\ref{C_l_theta_deletion}).

362: %\medskip

363:

364: {\em Genomic evolution.}---As pointed out above, the processes

365: discussed here build a minimal model for dynamically generated

366: long-range correlations along a sequence. But can this model explain

367: the observed correlations in genomic DNA?  The correlation function

368: $C(r)$ along human chromosomes shows a rather slow algebraic decay on

369: distance scales $10^3<r<10^6$ with typical effective exponents

370: $\alpha\approx0.1$~\cite{Bernaola02,Holste03}. We have confirmed these

371: measurements and found them to be consistent with sequence data from

372: other mammals~\cite{next}. A lower bound of the effective mutation

373: rate in mammals is $\mu_{\rm eff}\approx 2\cdot10^{-9}\rm a^{-1}$ per

374: site~\cite{Arndt04}.  Assuming stationary growth, we can use these

375: values of $\alpha$ and $\mu_{\rm eff}$ to derive a lower bound on the

376: genomic growth rate $\lambda$, resulting in a minimum value $\lambda

377: \approx 10^{-7}\rm a^{-1}$ per site according to

378: Eq.~(\ref{C_l_indel_asymptotics}).  However, this rate is much too

379: high. Our genome would have expanded much faster than it is observed

380: since the current human genome contains $N\approx3\cdot 10^9$ base

381: pairs and, assuming the above rate of genome expansion, would have

382: contained only about $4\cdot 10^5$ base pairs at the time of mammalian

383: radiation about 90 million years ago. This can clearly be rejected

384: since approximately 40\% of the human genome can be aligned to the

385: mouse genome, representing most of the orthologous sequences that

386: remain in both lineages from the common ancestor~\cite{Mouse02}.

387:

388: Over longer evolutionary periods, genomic expansion phases with rates

389: $\lambda \sim 10^{-7}\rm a^{-1}$ cannot be ruled out if we assume the

390: history of the genome has been a {\em punctuated} process, with such

391: expansion phases followed by periods of approximately constant length.

392: In the human genome, there is by now ample evidence for growth by

393: segmental duplications with various segment

394: lengths~\cite{Thomas04,Eichler02b}. In a punctuated growth process,

395: correlations are produced and transported during the expansion phases.

396: During the stationary phases, the previously established correlations

397: decay as given by Eq.~(\ref{C_l_theta_deletion}). In mammals, the last

398: period of rapid expansion has been the mammalian radiation, and the

399: characteristic time scale of the decay is $\tau\approx100$~Myr.

400: Correlations present or generated at the time of the mammalian

401: radiation would hence still persist. The succession of several

402: distinct growth phases with different values of $\lambda$ and

403: $\mu_{\rm eff}$ could even explain correlations $C(r)$ with several

404: scaling regimes as found in human chromosomes~\cite{Bernaola02}. We

405: conclude that the correlations observed in mammals are compatible with

406: a punctuated expansion-randomization process. Of course, this does not

407: rule out other causes. Indeed, the rather diverse functional forms

408: found in different species may point towards more than one generating

409: mechanism. If genomic expansion proves to be a significant

410: contribution, composition correlations could be the ``background

411: radiation'' of genomics, allowing us to trace the history of genomes

412: far back in evolutionary time.

413:

414: \begin{references}

415:

416: \bibitem{Li92}

417: W. Li and K. Kaneko,

418: {\it Europhys. Lett. }{\bf 17}, 655 (1992).

419:

420: \bibitem{Peng92}

421: C.-K. Peng {\it et al.},

422: {\it Nature }(London) {\bf 356}, 168 (1992).

423:

424: \bibitem{Voss92}

425: R. F. Voss,

426: {\it Phys. Rev. Lett. }{\bf 68}, 3805 (1992).

427:

428: \bibitem{Karlin93}

429: S. Karlin and V. Brendel,

430: {\it Science }{\bf 259}, 677 (1993).

431:

432: \bibitem{Peng94}

433: C.-K Peng {\it et al.},

434: {\it Phys. Rev. E }{\bf 49}, 1685 (1994).

435:

436: \bibitem{Arneodo95}

437: A. Arneodo, E. Bacry, P. V. Graves, and J. F. Muzy,

438: {\it Phys. Rev. Lett. }{\bf 74}, 3293 (1995).

439:

440: \bibitem{Li97}

441: W. Li,

442: {\it Comput. Chem. }{\bf 21}, 257 (1997).

443:

444: \bibitem{Vieira99}

445: M. de Sousa Vieira,

446: {\it Phys. Rev. E }{\bf 60}, 5932 (1999).

447:

448: \bibitem{Stanley99}

449: H. E. Stanley {\it et al.},

450: {\it Physica A }{\bf 273}, 1 (1999).

451:

452: \bibitem{Bernaola02}

453: P. Bernaola-Galvan, P. Carpena, R. Roman-Roldan, and J. L. Oliver,

454: {\it Gene }{\bf 300}, 105 (2002).

455:

456: \bibitem{Holste03}

457: D. Holste {\it et al.},

458: {\it Phys. Rev. E }{\bf 67}, 061913 (2003).

459:

460: %\bibitem{Isohata03}

461: %Y. Isohata and M. Hayashi,

462: %{\it J. Phys. Soc. Japan }{\bf 72}, 735 (2003).

463:

464: \bibitem{Ouyang04}

465: Z. Ouyang, C. Wang, and Z. S. She,

466: {\it Phys. Rev. Lett. }{\bf 93}, 078103 (2004).

467:

468: \bibitem{next}

469: P. Messer, M. L\"assig, and P. Arndt, to be published.

470:

471: \bibitem{Eichler02a}

472: R. V. Samonte and E. E. Eichler,

473: {\it Nat. Rev. Genet. }{\bf 3}, 65 (2002).

474:

475: \bibitem{Hsieh03}

476: L.-C. Hsieh, L. Luo, F. Ji, and H. C. Lee,

477: {\it Phys. Rev. Lett. }{\bf 90}, 018101 (2003).

478:

479: \bibitem{Goffeau04}

480: A. Goffeau,

481: {\it Nature }(London) {\bf 430}, 25 (2004).

482:

483: \bibitem{Lercher03}

484: M. J. Lercher, A. O. Urrutia, A. Pavlicek, and L. D. Hurst,

485: {\it Human Mol. Genetics }{\bf 12}, 2411 (2003).

486:

487: \bibitem{Li91}

488: W. Li,

489: {\it Phys. Rev. A }{\bf 43}, 5240 (1991).

490:

491: \bibitem{Mansilla00}

492: R. Mansilla and G. Cocho,

493: {\it Complex Systems }{\bf 12}, 207 (2000).

494:

495: \bibitem{Durbin98}

496: R. Durbin, S. Eddy, A. Krogh, and G. Mitchison,

497: {\it Biological Sequence Analysis }

498: (Cambridge University Press, Cambridge, England, 1998).

499:

500: \bibitem{Arndt04}

501: P. F. Arndt and T. Hwa,

502: {\it Bioinformatics }{\bf 20}, 1482 (2004).

503:

504: \bibitem{Mouse02}

505: Mouse Genome Sequencing Consortium,

506: {\it Nature }(London) {\bf 420}, 520 (2002)

507:

508: \bibitem{Eichler02b}

509: J. A. Bailey {\it et al.},

510: {\it Science }{\bf 297}, 1003 (2002)

511:

512: \bibitem{Thomas04}

513: E. E. Thomas {\it et al.},

514: {\it PNAS }{\bf 101}, 10349 (2004)

515:

516: \end{references}

517:

518: \end{document}

519:

520:

521:

522: