1: %\documentstyle[12pt,aps,prd,preprint]{revtex}
2:
3: \documentstyle[prl,floats,aps,twocolumn,epsf,graphicx]{revtex}
4: \begin{document}
5: \twocolumn[\hsize\textwidth\columnwidth\hsize\csname
6: @twocolumnfalse\endcsname
7:
8: %\begin{document}
9: \title{Phase-Transition in Binary Sequences with Long-Range Correlations}
10: \author{Shahar Hod$^{1,2}$ and Uri Keshet$^2$}
11: \address{$^1$The Racah Institute of Physics, The Hebrew University, Jerusalem 91904, Israel}
12: \address{}
13: \address{$^2$Department of Condensed Matter Physics, Weizmann Institute, Rehovot
14: 76100, Israel}
15: \date{\today}
16: \maketitle
17:
18: \begin{abstract}
19:
20: \ \ \ Motivated by novel results
21: in the theory of correlated sequences, we analyze the dynamics of random walks with long-term
22: memory (binary chains with long-range correlations).
23: In our model, the probability for a unit bit in a binary string depends on the
24: {\it fraction} of unities preceding it.
25: We show that the system undergoes a dynamical phase-transition from normal
26: diffusion, in which the variance $D_L$ scales as the string's length $L$,
27: into a super-diffusion phase ($D_L \sim L^{1+|\alpha|}$), when the correlation strength
28: exceeds a critical value.
29: We demonstrate the generality of our results with respect to alternative models, and discuss
30: their applicability to various data, such as
31: coarse-grained DNA sequences, written texts, and financial data.
32: \end{abstract}
33: \bigskip
34:
35: ]
36:
37: Dynamical systems with long-range spatial (and/or temporal) correlations
38: are attracting considerable interest across many disciplines.
39: They are identified in physical, biological, social, and economic sciences
40: (see e.g., [1-6] and references therein).
41: Of particular interest are situations
42: in which the system can be mapped onto a mathematical object, such as
43: a correlated sequence of symbols, preserving the essential statistical properties
44: of the original system.
45:
46: One of the methods most frequently used to obtain insight into the nature of correlations in
47: a dynamical system consists of mapping the space of states onto two symbols \cite{Usa}.
48: Thus, the problem is reduced to the exploration of the statistical properties of correlated
49: binary chains.
50: This can also be viewed as the analysis of a history-dependent random walk.
51: Random walk is one of the most ubiquitous concepts of statistical physics. It lends
52: applications to numerous scientific fields (see e.g., \cite{BaNi,Kam,FeFrSo,Wei,AvHa,DiDa,Hod} and
53: references therein).
54:
55: It is well established that the statistical properties of
56: coarse-grained DNA strings and written texts significantly deviate from those of purely
57: random sequences \cite{Kan,Sch}. Financial data (such as stock market quotes) are similarly
58: far from being pure-diffusive. Moreover, these systems exhibit
59: ``super-diffusive'' behavior in the sense that the variance $D(L)$ grows asymptotically {\it faster} than $L$
60: (where $L$ is the length of the considered text). Specifically,
61: $D \sim L^{\alpha}$, with $\alpha > 1$ \cite{Usa}.
62: Such a remarkable (and essentially universal) phenomenon can be attributed
63: to long-range positive correlations.
64: Systems with such correlations may be anticipated to exhibit a dynamical phase transition
65: (from normal to super diffusive behavior) at some critical correlation strength.
66:
67: Thus, the problem of random walk where the jumping probabilities are history-dependent is
68: of great interest for understanding the behavior of systems with long-range correlations, such
69: as DNA strings, written texts, and financial data.
70: The aim of the present Letter is to analyze this problem, and to provide a simple yet generic
71: {\it analytical} description of the statistical properties of these systems.
72:
73: We begin by solving a simple model which incorporates long-range correlations into an otherwise random
74: sequence. We consider a discrete binary string of symbols, $a_i=\{0,1\}$, in which
75: the conditional probability of a given symbol (say, a unit bit) occurring at the position $L+1$
76: is {\it history-dependent}, and given by
77:
78: \begin{equation}\label{Eq1}
79: p(k,L)={1 \over 2}\Big(1-\mu {{L-2k} \over {L+L_0}}\Big)\ ,
80: \end{equation}
81: where $k$ is the number of such symbols (unities) appearing in the preceding $L$ bits.
82: The correlation parameter $\mu$, where $-1< \mu < 1$,
83: determines the strength of correlations in the system.
84: The persistence condition $\mu>0$ implies that a given symbol
85: in the preceding sequence promotes the birth of a new identical symbol.
86: On the other hand, in the anti-persistence region $\mu < 0$, each
87: symbol inhibits the appearance of a new identical symbol.
88: The parameter $L_0>0$ is a constant transient time. For $L \ll L_0$ the sequence is approximately
89: random (uncorrelated), whereas for $L \gg L_0$ the effect of correlations takes over \cite{Note1}.
90:
91: In this model, the conditional probability $p(k,L;\mu,L_0)$ depends on the {\it fraction}
92: of unities (or zeroes) in the preceding bits, and is independent of their arrangement.
93: This allows one to obtain an {\it analytical} description of the system's dynamical behavior.
94: As we shall demonstrate below, this simple model provides a good quantitative
95: description of the observed statistical properties of various natural systems, such as
96: coarse-grained DNA strings, written texts, and financial data.
97:
98: The probability $P(k,L+1)$ of finding $k$ identical symbols (say, unities)
99: in a sequence of length $L+1$ follows the evolution equation
100:
101: \begin{eqnarray}\label{Eq2}
102: P(k,L+1) & = &[1-p(k,L)]P(k,L) \nonumber \\
103: && +p(k-1,L)P(k-1,L)\ .
104: \end{eqnarray}
105: Crossing to the continuous limit, one obtains the
106: Fokker-Planck diffusion equation for the correlated process
107:
108: \begin{equation}\label{Eq3}
109: {{\partial P} \over {\partial L}}={1 \over 2} {{{\partial^2 P} \over {\partial x}^2}}
110: -{{\mu} \over {L+L_0}}{{\partial(xP)} \over \partial x}\ ,
111: \end{equation}
112: where $x \equiv 2k-L$. The evolution equation (\ref{Eq3}) along with the
113: initial condition $P(x,t=0)=\delta(x)$, has a solution in the
114: form of a Gaussian distribution
115:
116: \begin{equation}\label{Eq4}
117: P(x,L)={1 \over {\sqrt{{2\pi D(L)}}}} \exp\Big[-{{x^2} \over {2D(L)}}\Big]\ ,
118: \end{equation}
119: where the variance $D(L)$ is given by
120:
121: \begin{equation}\label{Eq5}
122: D(L;\mu,L_0)={{L+L_0} \over {1-2\mu}} \Big[ 1 -{\Big({{L_0} \over {L+L_0}}\Big)}^{1-2\mu}\Big]\ .
123: \end{equation}
124: Equation (\ref{Eq5})
125: breaks down at the special case $\mu= {1 \over 2}$, in which case the variance is given by
126:
127: \begin{equation}\label{Eq6}
128: D(L;\mu_c,L_0)=(L+L_0)\ln\Big({{L+L_0} \over {L_0}}\Big)\ .
129: \end{equation}
130:
131: Remarkably, one finds that the correlated system undergoes a dynamical phase transition
132: at the critical correlation strength $\mu_c \equiv {1 \over 2}$.
133: The variance $D(L)$ of the correlated sequence has three qualitatively different
134: asymptotic behaviors (in the $L \gg L_0$ limit)
135:
136: \begin{equation}\label{Eq7}
137: D(L) \simeq \cases{
138: (1-2\mu)^{-1}L & $\mu<\mu_c$\ ; \cr
139: L \ln (L/L_0) & $\mu=\mu_c$\ ; \cr
140: (2\mu-1)^{-1}{L_0}^{1-2\mu}L^{2\mu} & $\mu>\mu_c$\ . \cr }
141: \end{equation}
142: Thus, for $\mu < \mu_c$ the asymptotic variance scales linearly with the string length,
143: whereas for a history-dependent chain with strong positive correlations ($\mu > \mu_c$) the system is
144: characterized by a super-diffusion phase, in which case $D(L)$ grows asymptotically
145: faster than $L$ \cite{Note2}.
146:
147: The analytical model can readily be extended to encompass situations in which the binary sequence
148: is {\it biased}. Let
149:
150: \begin{equation}\label{Eq8}
151: p(k,L)={1 \over 2}\Big(1+q-\mu {{L-2k} \over {L+L_0}}\Big)\ ,
152: \end{equation}
153: with $-1<q<1$. The distribution $P(x,L)$ corresponding to this conditional
154: probability is given by a Gaussian function, centered about the position
155:
156: \begin{equation}\label{Eq9}
157: x_c(L)={{q} \over {1-\mu({{L} \over {L+L_0}})}}L\ .
158: \end{equation}
159: Thus, the drift velocity approaches an asymptotically constant value ${{q} \over {1-\mu}}$.
160: The variance $D(L)$, unaltered by the bias is given by Eqs. (\ref{Eq5}) and (\ref{Eq6}).
161:
162: In order to confirm the analytical results, we perform numerical simulations of (discrete)
163: binary sequences. Figure \ref{Fig1} displays the resulting scaled variance $L^{-1}D(L)$ of
164: correlated strings with various different values of the correlation parameter $\mu$.
165: We find an excellent agreement between the analytically predicted results [see Eqs. (\ref{Eq5}) and
166: (\ref{Eq6})] and the numerical ones.
167:
168: \begin{figure}[tbh]
169: \centerline{\epsfxsize=9cm \epsfbox{perfig1.eps}}
170: \caption{The scaled variance $L^{-1}D(L)$ as a function of the string length $L$.
171: We present results for $\mu=-0.8, -0.4, 0, 0.2, 0.5, 0.8$, and $0.9$ (from bottom to top),
172: with $L_0=100$.
173: The numerically computed asymptotic slopes agree with the analytical predictions [see Eqs.
174: (\ref{Eq5}) and (\ref{Eq6})] to within less than $1\%$.}
175: \label{Fig1}
176: \end{figure}
177:
178: {\it Robustness of the linear model.--}
179: In order to show the generality of the model discussed above, we consider situations in which
180: the (history-dependent) jump probability is an arbitrary odd function \cite{Note3} of
181: the fraction $\xi \equiv {x \over {L+L_0}}$ of unities (zeroes) that appeared in the previous $L$ symbols
182:
183: \begin{equation}\label{Eq10}
184: p(x,L)={1 \over 2}[1+\mu F(\xi)]\ .
185: \end{equation}
186: For asymptotically large $L$, one always finds $\xi\to 0$ for non-ballistic diffusion,
187: justifying a power-law expansion of $F(\xi)$.
188: As long as this expansion includes a linear term, the original differential equation (\ref{Eq3})
189: is recovered for large $L$. We therefore expect the previous analytical results
190: [Eqs. (\ref{Eq5}) and (\ref{Eq6})] to hold true for generic ({\it non}-linear) models as well.
191: The generality of the model is illustrated in Fig. \ref{Fig2}, in which we depicts
192: results for various choices of the probability function $F(\xi)$. As predicted, the results
193: are found to agree with the linear model.
194:
195: \begin{figure}[tbh]
196: \centerline{\epsfxsize=9cm \epsfbox{perfig2.eps}}
197: \caption{The scaled variance $L^{-1}D(L)$ for three different forms of the
198: function $F(\xi)$: $\xi$, ${2 \over \pi}\sin({\pi \over 2}\xi)$, and $\tanh(\xi)$.
199: We present results for $\mu=-0.8$ and $\mu=0.8$,with $L_0=100$. The different curves are
200: almost indistinguishable.}
201: \label{Fig2}
202: \end{figure}
203:
204: {\it Applications.--}
205: The robustness of the linear model (see Fig. \ref{Fig2}) suggests
206: that it may capture the essence of the
207: underlying correlations in a diversity of systems in nature.
208: We therefore examine the use of the results derived in the present work as an analytical explanation for the
209: observed statistical properties of natural systems, such as
210: DNA strings, written texts, and financial data.
211:
212: As mentioned, it is well established that these systems often exhibit a significant
213: deviation from random sequences \cite{Kan,Sch}, and are characterized by a
214: ``super-diffusive'' behavior in which $D \sim L^{\alpha}$, with $\alpha > 1$ \cite{Usa}. In such
215: systems, super-diffusion may be attributed to long-range (positive) correlations. In fact,
216: the analytical model allows one to determine the correlation strength of these chains.
217:
218: Figure \ref{Fig3} depicts the scaled variance $L^{-1}D(L)$ calculated from DNA sequences of
219: various organisms, as a function of the string length $L$.
220: It is of considerable interest to examine in such methods the statistical
221: properties characterizing the DNA of organisms in various evolutionary levels:
222: Bacillus subtilis ({\it Bacteria}), Methanosarcina acetivorans ({\it Archaea}),
223: and Drosophila melanogaster ({\it Eukarya}) \cite{Usa,DNAs}.
224: The theoretical model provides a good description of the
225: empirical data \cite{Note4}, attributing different correlation strengths $\mu$ to different organisms, as
226: summarized in Table \ref{Tab1}.
227:
228: The super-diffusive behavior, shown in Fig. \ref{Fig3} to persist across very long sequences is highly suggestive
229: of {\it long}-range correlation extending over {\it more} than one
230: gene (e.g., $\sim 5 \times 10^4$ base-pairs in Drosophila).
231:
232: Next, we have applied the results of the analytical model to various coarse-grained written
233: texts \cite{Kan,Sch,Usa}. It has long been recognized that the corresponding binary strings are highly
234: self-correlated. The present analytical model enables one to determine quantitatively the strength of these
235: inner correlations; see Table \ref{Tab1}.
236:
237: \begin{figure}[tbh]
238: \centerline{\epsfxsize=9cm \epsfbox{perfig3.eps}}
239: \caption{The scaled variance $L^{-1}D(L)$ as a function of the string length $L$,
240: for coarse-grained DNA sequences of various organisms.
241: The mapping and parameters used are given in Table I.
242: Theoretical results [see Eq. (\ref{Eq5})] are represented by curves.}
243: \label{Fig3}
244: \end{figure}
245:
246: \begin{table}
247: \caption{The correlation strength parameter $\mu$ for various binary strings.
248: We use the following mappings: $\{A,G\} \to 0$, $\{C,T\} \to 1$ for DNA sequences [5,18];
249: (a to m) $\to 0$, (n to z) $\to 1$ for written texts [5]; and daily fall $\to 0$, daily rise
250: $\to 1$ for stock market quotes [20].}
251: \label{Tab1}
252: \begin{tabular}{llc}
253: Data Type & String Source & $\mu$ \\
254: \tableline
255: DNA sequences
256: & Drosophila melanogaster & $0.57$ \\
257: & Methanosarcina acetivorans & $0.70$ \\
258: & Bacillus subtilis & $0.86$\\
259: Written texts
260: & Alice's adventures in wonderland& $0.58$ \\
261: & The Holy Bible in English & $0.84$ \\
262: & Works on computer science & $0.88$ \\
263: Stock markets
264: & NASDAQ & 0.39 \\
265: & DJIA & 0.76 \\
266: \end{tabular}
267: \end{table}
268:
269: In Figure \ref{Fig4} we show the scaled variance of coarse-grained
270: financial data (daily quotes of the Dow Jones Industrial Average, and the NASDAQ \cite{Djia}).
271: We note that the linear model underestimates the
272: empirical variance at {\it short} time scales. This fact can be traced back to
273: short-term correlations in the markets. (It is interesting to note that the DJIA maintains
274: an approximately normal diffusive behavior for a period of about one month).
275: However, this short-term memory is washed out at longer time
276: scales, in which case the analytical model provides a good description of the
277: empirical results, as evident from Fig. \ref{Fig4}. The corresponding values of the
278: correlation parameter $\mu$ are summarized in Table \ref{Tab1}.
279:
280: \begin{figure}[tbh]
281: \centerline{\epsfxsize=9cm \epsfbox{perfig4.eps}}
282: \caption{The scaled variance $L^{-1}D(L)$ as a function of the sequence length $L$,
283: for coarse-grained financial data: DJIA and NASDAQ daily quotes [20].
284: The mapping and parameters used are given in Table I. Theoretical results
285: [see Eq. (\ref{Eq5})] are represented by curves.}
286: \label{Fig4}
287: \end{figure}
288:
289: In summary, in this Letter we have analyzed the dynamics of random walks with
290: {\it history-dependent} jump probabilities.
291: Our work was motivated not only by the intrinsic interest in such dynamical
292: processes, but also by the flurry of activity in the field of long-range
293: correlated systems, and by some universal statistical features observed in many
294: different natural systems.
295:
296: We have broadened the study of binary strings to include long-range
297: correlations, extending throughout the length of the chain.
298: Using a simple and exactly solvable model, we identify a dynamical phase
299: transition, from normal diffusion [$D(L) \sim L$] to super-diffusive
300: behavior [$D(L) \sim L^{2 \mu}$], taking place as the correlation parameter $\mu$
301: exceeds its critical value.
302: We show that in spite of the simplicity of the model, it is robust, and can
303: easily be extended to describe various features (such as a biased history-dependent random
304: walk or sub-diffusion).
305:
306: Next, we have applied the analytical results of the model to various binary strings, extracted
307: from very different natural systems, such as
308: coarse-grained DNA sequences, written texts, and financial data.
309: We find that the model adequately describes the long-term behavior of these systems.
310: Furthermore, the model provides a straightforward method to measure the
311: correlation strength of these systems.
312: Our results can be applied to various natural systems, and may shed light on the
313: underlying rules governing their dynamics.
314: For example, the super-diffusive behavior of DNA sequences (see Fig. \ref{Fig3}) suggests
315: long-range correlations extending across more than one gene. The model attributes
316: different correlation strengths to different organisms.
317:
318: \bigskip
319: \noindent
320: {\bf ACKNOWLEDGMENTS}
321: \bigskip
322:
323: SH thanks a support by the Dr. Robert G. Picard fund in physics.
324: We would like to thank Oded Agam, Yitzhak Pilpel, Eli Keshet, Ilana Keshet,
325: Clovis Hopman, Eros Mariani, Assaf Pe`er, Oded Hod, and Ehud Nakar for helpful discussions.
326: We thank O. V. Usatenko and V. A. Yampol`skii for providing us with their data.
327: This research was supported by grant 159/99-3 from the Israel Science Foundation.
328:
329: \begin{thebibliography}{99}
330:
331: \bibitem{Man} R. N. Mantegna and H. E. Stanley, Nature (London) {\bf 376}, 46 (1995).
332:
333: \bibitem{Kan} I. Kanter and D. F. Kessler, Phys. Rev. Lett. {\bf 74}, 4559 (1995).
334:
335: \bibitem{Sta} H. E. Stanley {\it et. al.}, Physica (Amsterdam) {\bf 224A}, 302 (1996).
336:
337: \bibitem{Pro} A. Provata and Y. Almirantis, Physica (Amsterdam) {\bf 247A}, 482 (1997).
338:
339: \bibitem{Usa} O. V. Usatenko and V. A. Yampol`skii, Phys. Rev. Lett. {\bf 90}, 110601 (2003).
340:
341: \bibitem{Yan} A. C. C. Yang, S. S. Hseu, H. W. Yien, A. L. Goldberger, and C. K. Peng,
342: Phys. Rev. Lett. {\bf 90}, 108103 (2003).
343:
344: \bibitem{BaNi} M. N. Barber and B. W. Ninham, {\it Random and Restricted Walks} (Gordon
345: and Breach, New York, 1970).
346:
347: \bibitem{Kam} N. G. van Kampen, {\it Stochastic Processes in Physics and
348: Chemistry} (North-Holland, Amsterdam, 1992).
349:
350: \bibitem{FeFrSo} R. Fernandez, J. Frohlich, and A. D. Sokal, {\it Random Walks, Critical
351: Phenomena, and Triviality in Quantum Field Theory} (Springer Verlag, Berlin, 1992).
352:
353: \bibitem{Wei} G. H. Weiss, {\it Aspects and Applications of the Random Walk} (North
354: Holland, Amsterdam, 1994).
355:
356: \bibitem{AvHa} D. ben-Avraham and S. Havlin, {\it Diffusion and Reactions in Fractals and
357: Disordered Systems} (Cambridge University Press, Cambridge, 2000).
358:
359: \bibitem{DiDa} R. Dickman and D. ben-Avraham, Phys. Rev. E. {\bf 64}, 020102(R) (2001).
360:
361: \bibitem{Hod} S. Hod, Phys. Rev. Lett. {\bf 90}, 128701 (2003).
362:
363: \bibitem{Sch} A. Schenkel, J. Zhang, and Y. C. Zhang, Fractals {\bf 1}, 47 (1993).
364:
365: \bibitem{Note1} The introduction of the parameter $L_0$ is mainly motivated by the observed
366: behavior of the variance of DNA sequences, written texts, and financial data.
367: These systems are characterized by normal diffusion [$D(L) \sim L$] for small $L$
368: values, and by a super-diffusive behavior [$D(L) \sim L^{\alpha}$, with $\alpha>1$]
369: for large $L$ values.
370:
371: \bibitem{Note2} The model may be broadened to describe sub-diffusive behavior as well, by considering
372: the conditional probability $p(k,L)=f\{{1 \over 2}[1-\mu{{L-2k} \over
373: {(L+L_0)^{1-m}}}]\}$, where $f(u) \equiv u \Theta(u) -(u-1) \Theta(u-1)$ and $\Theta(u)$
374: is the Heaviside step-function.
375: This yields, for $L \gg l_0$, $m>0$, and $\mu<0$ a Gaussian distribution of
376: variance $D(L) \sim L^{1-m}$.
377:
378: \bibitem{Note3} For the probability distribution $P(x,L)$ to be an even function of $x$ (and thus $\langle x \rangle =0$),
379: the function $F(\xi)$ should be an odd function of its argument.
380:
381: \bibitem{DNAs} DNA sequences of various organisms were obtained from ftp://ftp.ncbi.nih.gov/genomes.
382:
383: \bibitem{Note4} We have verified that for the DNA mapping used ($\{A,G\} \to 0$, $\{C,T\} \to 1$), the
384: distribution $P(x,L=const.)$ is well approximated by a Gaussian. The alternative mappings yield
385: a broader distribution ($\{T,G\} \to 0$) or a large asymmetry ($\{C,G\} \to 0$).
386:
387: \bibitem{Djia} Financial data for the DJIA and NASDAQ stock markets are quoted from http://finance.yahoo.com.
388:
389: \end{thebibliography}
390:
391: \end{document}
392:
393:
394:
395:
396:
397:
398:
399:
400:
401:
402:
403:
404:
405:
406:
407:
408:
409:
410:
411:
412:
413:
414:
415: