cond-mat0311483/per.tex
1: %\documentstyle[12pt,aps,prd,preprint]{revtex}
2: 
3: \documentstyle[prl,floats,aps,twocolumn,epsf,graphicx]{revtex}
4: \begin{document}
5: \twocolumn[\hsize\textwidth\columnwidth\hsize\csname
6: @twocolumnfalse\endcsname
7: 
8: %\begin{document}
9: \title{Phase-Transition in Binary Sequences with Long-Range Correlations}
10: \author{Shahar Hod$^{1,2}$ and Uri Keshet$^2$}
11: \address{$^1$The Racah Institute of Physics, The Hebrew University, Jerusalem 91904, Israel}
12: \address{}
13: \address{$^2$Department of Condensed Matter Physics, Weizmann Institute, Rehovot
14:  76100, Israel}
15: \date{\today}
16: \maketitle
17: 
18: \begin{abstract}
19: 
20: \ \ \ Motivated by novel results 
21: in the theory of correlated sequences, we analyze the dynamics of random walks with long-term 
22: memory (binary chains with long-range correlations). 
23: In our model, the probability for a unit bit in a binary string depends on the 
24: {\it fraction} of unities preceding it. 
25: We show that the system undergoes a dynamical phase-transition from normal 
26: diffusion, in which the variance $D_L$ scales as the string's length $L$, 
27: into a super-diffusion phase ($D_L \sim L^{1+|\alpha|}$), when the correlation strength 
28: exceeds a critical value.
29: We demonstrate the generality of our results with respect to alternative models, and discuss 
30: their applicability to various data, such as 
31: coarse-grained DNA sequences, written texts, and financial data. 
32: \end{abstract}
33: \bigskip
34: 
35: ]
36: 
37: Dynamical systems with long-range spatial (and/or temporal) correlations 
38: are attracting considerable interest across many disciplines. 
39: They are identified in physical, biological, social, and economic sciences 
40: (see e.g., [1-6] and references therein). 
41: Of particular interest are situations 
42: in which the system can be mapped onto a mathematical object, such as 
43: a correlated sequence of symbols, preserving the essential statistical properties 
44: of the original system. 
45: 
46: One of the methods most frequently used to obtain insight into the nature of correlations in 
47: a dynamical system consists of mapping the space of states onto two symbols \cite{Usa}. 
48: Thus, the problem is reduced to the exploration of the statistical properties of correlated 
49: binary chains. 
50: This can also be viewed as the analysis of a history-dependent random walk. 
51: Random walk is one of the most ubiquitous concepts of statistical physics. It lends 
52: applications to numerous scientific fields (see e.g., \cite{BaNi,Kam,FeFrSo,Wei,AvHa,DiDa,Hod} and 
53: references therein). 
54: 
55: It is well established that the statistical properties of 
56: coarse-grained DNA strings and written texts significantly deviate from those of purely 
57: random sequences \cite{Kan,Sch}. Financial data (such as stock market quotes) are similarly 
58: far from being pure-diffusive. Moreover, these systems exhibit 
59: ``super-diffusive'' behavior in the sense that the variance $D(L)$ grows asymptotically {\it faster} than $L$ 
60: (where $L$ is the length of the considered text). Specifically, 
61: $D \sim L^{\alpha}$, with $\alpha > 1$ \cite{Usa}. 
62: Such a remarkable (and essentially universal) phenomenon can be attributed 
63: to long-range positive correlations. 
64: Systems with such correlations may be anticipated to exhibit a dynamical phase transition 
65: (from normal to super diffusive behavior) at some critical correlation strength. 
66: 
67: Thus, the problem of random walk where the jumping probabilities are history-dependent is 
68: of great interest for understanding the behavior of systems with long-range correlations, such 
69: as DNA strings, written texts, and financial data. 
70: The aim of the present Letter is to analyze this problem, and to provide a simple yet generic 
71: {\it analytical} description of the statistical properties of these systems.
72: 
73: We begin by solving a simple model which incorporates long-range correlations into an otherwise random 
74: sequence. We consider a discrete binary string of symbols, $a_i=\{0,1\}$, in which 
75: the conditional probability of a given symbol (say, a unit bit) occurring at the position $L+1$ 
76: is {\it history-dependent}, and given by 
77: 
78: \begin{equation}\label{Eq1}
79: p(k,L)={1 \over 2}\Big(1-\mu {{L-2k} \over {L+L_0}}\Big)\  ,
80: \end{equation}
81: where $k$ is the number of such symbols (unities) appearing in the preceding $L$ bits. 
82: The correlation parameter $\mu$, where $-1< \mu < 1$, 
83: determines the strength of correlations in the system. 
84: The persistence condition $\mu>0$ implies that a given symbol 
85: in the preceding sequence promotes the birth of a new identical symbol. 
86: On the other hand, in the anti-persistence region $\mu < 0$, each 
87: symbol inhibits the appearance of a new identical symbol. 
88: The parameter $L_0>0$ is a constant transient time. For $L \ll L_0$ the sequence is approximately 
89: random (uncorrelated), whereas for $L \gg L_0$ the effect of correlations takes over \cite{Note1}. 
90: 
91: In this model, the conditional probability $p(k,L;\mu,L_0)$ depends on the {\it fraction} 
92: of unities (or zeroes) in the preceding bits, and is independent of their arrangement. 
93: This allows one to obtain an {\it analytical} description of the system's dynamical behavior. 
94: As we shall demonstrate below, this simple model provides a good quantitative 
95: description of the observed statistical properties of various natural systems, such as 
96: coarse-grained DNA strings, written texts, and financial data.
97: 
98: The probability $P(k,L+1)$ of finding $k$ identical symbols (say, unities) 
99: in a sequence of length $L+1$ follows the evolution equation
100: 
101: \begin{eqnarray}\label{Eq2}
102: P(k,L+1) & = &[1-p(k,L)]P(k,L) \nonumber \\
103: && +p(k-1,L)P(k-1,L)\  .
104: \end{eqnarray} 
105: Crossing to the continuous limit, one obtains the 
106: Fokker-Planck diffusion equation for the correlated process
107: 
108: \begin{equation}\label{Eq3}
109: {{\partial P} \over {\partial L}}={1 \over 2} {{{\partial^2 P} \over {\partial x}^2}}
110: -{{\mu} \over {L+L_0}}{{\partial(xP)} \over \partial x}\  ,
111: \end{equation}
112: where $x \equiv 2k-L$. The evolution equation (\ref{Eq3}) along with the 
113: initial condition $P(x,t=0)=\delta(x)$, has a solution in the 
114: form of a Gaussian distribution
115: 
116: \begin{equation}\label{Eq4}
117: P(x,L)={1 \over {\sqrt{{2\pi D(L)}}}} \exp\Big[-{{x^2} \over {2D(L)}}\Big]\  ,
118: \end{equation}
119: where the variance $D(L)$ is given by
120: 
121: \begin{equation}\label{Eq5}
122: D(L;\mu,L_0)={{L+L_0} \over {1-2\mu}} \Big[ 1 -{\Big({{L_0} \over {L+L_0}}\Big)}^{1-2\mu}\Big]\  .
123: \end{equation} 
124: Equation (\ref{Eq5}) 
125: breaks down at the special case $\mu= {1 \over 2}$, in which case the variance is given by 
126: 
127: \begin{equation}\label{Eq6}
128: D(L;\mu_c,L_0)=(L+L_0)\ln\Big({{L+L_0} \over {L_0}}\Big)\  .
129: \end{equation}
130: 
131: Remarkably, one finds that the correlated system undergoes a dynamical phase transition 
132: at the critical correlation strength $\mu_c \equiv {1 \over 2}$. 
133: The variance $D(L)$ of the correlated sequence has three qualitatively different 
134: asymptotic behaviors (in the $L \gg L_0$ limit)
135: 
136: \begin{equation}\label{Eq7}
137: D(L) \simeq \cases{ 
138: (1-2\mu)^{-1}L & $\mu<\mu_c$\  ; \cr
139: L \ln (L/L_0) & $\mu=\mu_c$\  ; \cr
140: (2\mu-1)^{-1}{L_0}^{1-2\mu}L^{2\mu} & $\mu>\mu_c$\  . \cr }
141: \end{equation}
142: Thus, for $\mu < \mu_c$ the asymptotic variance scales linearly with the string length, 
143: whereas for a history-dependent chain with strong positive correlations ($\mu > \mu_c$) the system is 
144: characterized by a super-diffusion phase, in which case $D(L)$ grows asymptotically 
145: faster than $L$ \cite{Note2}.
146: 
147: The analytical model can readily be extended to encompass situations in which the binary sequence 
148: is {\it biased}. Let
149: 
150: \begin{equation}\label{Eq8}
151: p(k,L)={1 \over 2}\Big(1+q-\mu {{L-2k} \over {L+L_0}}\Big)\  ,
152: \end{equation}
153: with $-1<q<1$. The distribution $P(x,L)$ corresponding to this conditional 
154: probability is given by a Gaussian function, centered about the position 
155: 
156: \begin{equation}\label{Eq9}
157: x_c(L)={{q} \over {1-\mu({{L} \over {L+L_0}})}}L\  .
158: \end{equation}
159: Thus, the drift velocity approaches an asymptotically constant value ${{q} \over {1-\mu}}$. 
160: The variance $D(L)$, unaltered by the bias is given by Eqs. (\ref{Eq5}) and (\ref{Eq6}).
161: 
162: In order to confirm the analytical results, we perform numerical simulations of (discrete) 
163: binary sequences. Figure \ref{Fig1} displays the resulting scaled variance $L^{-1}D(L)$ of 
164: correlated strings with various different values of the correlation parameter $\mu$. 
165: We find an excellent agreement between the analytically predicted results [see Eqs. (\ref{Eq5}) and 
166: (\ref{Eq6})] and the numerical ones.
167: 
168: \begin{figure}[tbh]
169: \centerline{\epsfxsize=9cm \epsfbox{perfig1.eps}} 
170: \caption{The scaled variance $L^{-1}D(L)$ as a function of the string length $L$. 
171: We present results for $\mu=-0.8, -0.4, 0, 0.2, 0.5, 0.8$, and $0.9$ (from bottom to top), 
172: with $L_0=100$. 
173: The numerically computed asymptotic slopes agree with the analytical predictions [see Eqs. 
174: (\ref{Eq5}) and (\ref{Eq6})] to within less than $1\%$.}
175: \label{Fig1}
176: \end{figure}
177: 
178: {\it Robustness of the linear model.--} 
179: In order to show the generality of the model discussed above, we consider situations in which 
180: the (history-dependent) jump probability is an arbitrary odd function \cite{Note3} of 
181: the fraction $\xi \equiv {x \over {L+L_0}}$ of unities (zeroes) that appeared in the previous $L$ symbols
182: 
183: \begin{equation}\label{Eq10}
184: p(x,L)={1 \over 2}[1+\mu F(\xi)]\  .
185: \end{equation}
186: For asymptotically large $L$, one always finds $\xi\to 0$ for non-ballistic diffusion, 
187: justifying a power-law expansion of $F(\xi)$. 
188: As long as this expansion includes a linear term, the original differential equation (\ref{Eq3}) 
189: is recovered for large $L$. We therefore expect the previous analytical results 
190: [Eqs. (\ref{Eq5}) and (\ref{Eq6})] to hold true for generic ({\it non}-linear) models as well. 
191: The generality of the model is illustrated in Fig. \ref{Fig2}, in which we depicts 
192: results for various choices of the probability function $F(\xi)$. As predicted, the results 
193: are found to agree with the linear model.
194: 
195: \begin{figure}[tbh]
196: \centerline{\epsfxsize=9cm \epsfbox{perfig2.eps}} 
197: \caption{The scaled variance $L^{-1}D(L)$ for three different forms of the 
198: function $F(\xi)$: $\xi$, ${2 \over \pi}\sin({\pi \over 2}\xi)$, and $\tanh(\xi)$. 
199: We present results for $\mu=-0.8$ and $\mu=0.8$,with $L_0=100$. The different curves are 
200: almost indistinguishable.} 
201: \label{Fig2}
202: \end{figure}
203: 
204: {\it Applications.--} 
205: The robustness of the linear model (see Fig. \ref{Fig2}) suggests 
206: that it may capture the essence of the 
207: underlying correlations in a diversity of systems in nature. 
208: We therefore examine the use of the results derived in the present work as an analytical explanation for the 
209: observed statistical properties of natural systems, such as 
210: DNA strings, written texts, and financial data.
211: 
212: As mentioned, it is well established that these systems often exhibit a significant 
213: deviation from random sequences \cite{Kan,Sch}, and are characterized by a 
214: ``super-diffusive'' behavior in which $D \sim L^{\alpha}$, with $\alpha > 1$ \cite{Usa}. In such 
215: systems, super-diffusion may be attributed to long-range (positive) correlations. In fact, 
216: the analytical model allows one to determine the correlation strength of these chains. 
217: 
218: Figure \ref{Fig3} depicts the scaled variance $L^{-1}D(L)$ calculated from DNA sequences of 
219: various organisms, as a function of the string length $L$. 
220: It is of considerable interest to examine in such methods the statistical 
221: properties characterizing the DNA of organisms in various evolutionary levels: 
222: Bacillus subtilis ({\it Bacteria}), Methanosarcina acetivorans ({\it Archaea}), 
223: and Drosophila melanogaster ({\it Eukarya}) \cite{Usa,DNAs}. 
224: The theoretical model provides a good description of the 
225: empirical data \cite{Note4}, attributing different correlation strengths $\mu$ to different organisms, as 
226: summarized in Table \ref{Tab1}. 
227: 
228: The super-diffusive behavior, shown in Fig. \ref{Fig3} to persist across very long sequences is highly suggestive 
229: of {\it long}-range correlation extending over {\it more} than one 
230: gene (e.g., $\sim 5 \times 10^4$ base-pairs in Drosophila).
231: 
232: Next, we have applied the results of the analytical model to various coarse-grained written
233: texts \cite{Kan,Sch,Usa}. It has long been recognized that the corresponding binary strings are highly 
234: self-correlated. The present analytical model enables one to determine quantitatively the strength of these 
235: inner correlations; see Table \ref{Tab1}.
236: 
237: \begin{figure}[tbh]
238: \centerline{\epsfxsize=9cm \epsfbox{perfig3.eps}} 
239: \caption{The scaled variance $L^{-1}D(L)$ as a function of the string length $L$, 
240: for coarse-grained DNA sequences of various organisms.
241: The mapping and parameters used are given in Table I. 
242: Theoretical results [see Eq. (\ref{Eq5})] are represented by curves.}
243: \label{Fig3}
244: \end{figure}
245: 
246: \begin{table}
247: \caption{The correlation strength parameter $\mu$ for various binary strings. 
248: We use the following mappings: $\{A,G\} \to 0$, $\{C,T\} \to 1$ for DNA sequences [5,18]; 
249: (a to m) $\to 0$, (n to z) $\to 1$ for written texts [5]; and daily fall $\to 0$, daily rise 
250: $\to 1$ for stock market quotes [20].}
251: \label{Tab1}
252: \begin{tabular}{llc}
253: Data Type & String Source & $\mu$ \\
254: \tableline
255: DNA sequences 
256: & Drosophila melanogaster & $0.57$ \\
257: & Methanosarcina acetivorans & $0.70$ \\
258: & Bacillus subtilis & $0.86$\\
259: Written texts 
260: & Alice's adventures in wonderland& $0.58$ \\
261: & The Holy Bible in English & $0.84$ \\
262: & Works on computer science & $0.88$ \\
263: Stock markets
264: & NASDAQ & 0.39 \\
265: & DJIA & 0.76 \\
266: \end{tabular}
267: \end{table}
268: 
269: In Figure \ref{Fig4} we show the scaled variance of coarse-grained 
270: financial data (daily quotes of the Dow Jones Industrial Average, and the NASDAQ \cite{Djia}). 
271: We note that the linear model underestimates the 
272: empirical variance at {\it short} time scales. This fact can be traced back to 
273: short-term correlations in the markets. (It is interesting to note that the DJIA maintains 
274: an approximately normal diffusive behavior for a period of about one month). 
275: However, this short-term memory is washed out at longer time 
276: scales, in which case the analytical model provides a good description of the 
277: empirical results, as evident from Fig. \ref{Fig4}. The corresponding values of the 
278: correlation parameter $\mu$ are summarized in Table \ref{Tab1}.
279: 
280: \begin{figure}[tbh]
281: \centerline{\epsfxsize=9cm \epsfbox{perfig4.eps}} 
282: \caption{The scaled variance $L^{-1}D(L)$ as a function of the sequence length $L$, 
283: for coarse-grained financial data: DJIA and NASDAQ daily quotes [20]. 
284: The mapping and parameters used are given in Table I. Theoretical results 
285: [see Eq. (\ref{Eq5})] are represented by curves.}
286: \label{Fig4}
287: \end{figure}
288: 
289: In summary, in this Letter we have analyzed the dynamics of random walks with
290: {\it history-dependent} jump probabilities. 
291: Our work was motivated not only by the intrinsic interest in such dynamical
292: processes, but also by the flurry of activity in the field of long-range
293: correlated systems, and by some universal statistical features observed in many 
294: different natural systems.
295: 
296: We have broadened the study of binary strings to include long-range
297: correlations, extending throughout the length of the chain.
298: Using a simple and exactly solvable model, we identify a dynamical phase
299: transition, from normal diffusion [$D(L) \sim L$] to super-diffusive
300: behavior [$D(L) \sim L^{2 \mu}$], taking place as the correlation parameter $\mu$
301: exceeds its critical value. 
302: We show that in spite of the simplicity of the model, it is robust, and can
303: easily be extended to describe various features (such as a biased history-dependent random 
304: walk or sub-diffusion).
305: 
306: Next, we have applied the analytical results of the model to various binary strings, extracted 
307: from very different natural systems, such as 
308: coarse-grained DNA sequences, written texts, and financial data. 
309: We find that the model adequately describes the long-term behavior of these systems. 
310: Furthermore, the model provides a straightforward method to measure the
311: correlation strength of these systems. 
312: Our results can be applied to various natural systems, and may shed light on the
313: underlying rules governing their dynamics. 
314: For example, the super-diffusive behavior of DNA sequences (see Fig. \ref{Fig3}) suggests 
315: long-range correlations extending across more than one gene. The model attributes 
316: different correlation strengths to different organisms.
317:  
318: \bigskip
319: \noindent
320: {\bf ACKNOWLEDGMENTS}
321: \bigskip
322: 
323: SH thanks a support by the Dr. Robert G. Picard fund in physics. 
324: We would like to thank Oded Agam, Yitzhak Pilpel, Eli Keshet, Ilana Keshet, 
325: Clovis Hopman, Eros Mariani, Assaf Pe`er, Oded Hod, and Ehud Nakar for helpful discussions. 
326: We thank O. V. Usatenko and V. A. Yampol`skii for providing us with their data. 
327: This research was supported by grant 159/99-3 from the Israel Science Foundation.
328: 
329: \begin{thebibliography}{99}
330: 
331: \bibitem{Man} R. N. Mantegna and H. E. Stanley, Nature (London) {\bf 376}, 46 (1995).
332: 
333: \bibitem{Kan} I. Kanter and D. F. Kessler, Phys. Rev. Lett. {\bf 74}, 4559 (1995).
334: 
335: \bibitem{Sta} H. E. Stanley {\it et. al.}, Physica (Amsterdam) {\bf 224A}, 302 (1996).
336: 
337: \bibitem{Pro} A. Provata and Y. Almirantis,  Physica (Amsterdam) {\bf 247A}, 482 (1997).
338: 
339: \bibitem{Usa} O. V. Usatenko and V. A. Yampol`skii, Phys. Rev. Lett. {\bf 90}, 110601 (2003).
340: 
341: \bibitem{Yan} A. C. C. Yang, S. S. Hseu, H. W. Yien, A. L. Goldberger, and C. K. Peng, 
342: Phys. Rev. Lett. {\bf 90}, 108103 (2003).
343: 
344: \bibitem{BaNi} M. N. Barber and B. W. Ninham, {\it Random and Restricted Walks} (Gordon 
345: and Breach, New York, 1970).
346: 
347: \bibitem{Kam} N. G. van Kampen, {\it Stochastic Processes in Physics and 
348: Chemistry} (North-Holland, Amsterdam, 1992).
349: 
350: \bibitem{FeFrSo} R. Fernandez, J. Frohlich, and A. D. Sokal, {\it Random Walks, Critical 
351: Phenomena, and Triviality in Quantum Field Theory} (Springer Verlag, Berlin, 1992).
352: 
353: \bibitem{Wei} G. H. Weiss, {\it  Aspects and Applications of the Random Walk} (North 
354: Holland, Amsterdam, 1994).
355: 
356: \bibitem{AvHa} D. ben-Avraham and S. Havlin, {\it Diffusion and Reactions in Fractals and 
357: Disordered Systems} (Cambridge University Press, Cambridge, 2000).
358: 
359: \bibitem{DiDa} R. Dickman and D. ben-Avraham, Phys. Rev. E. {\bf 64}, 020102(R) (2001).
360: 
361: \bibitem{Hod} S. Hod, Phys. Rev. Lett. {\bf 90}, 128701 (2003).
362: 
363: \bibitem{Sch} A. Schenkel, J. Zhang, and Y. C. Zhang, Fractals {\bf 1}, 47 (1993).
364: 
365: \bibitem{Note1} The introduction of the parameter $L_0$ is mainly motivated by the observed 
366: behavior of the variance of DNA sequences, written texts, and financial data. 
367: These systems are characterized by normal diffusion [$D(L) \sim L$] for small $L$ 
368: values, and by a super-diffusive behavior [$D(L) \sim L^{\alpha}$, with $\alpha>1$] 
369: for large $L$ values.
370: 
371: \bibitem{Note2} The model may be broadened to describe sub-diffusive behavior as well, by considering 
372: the conditional probability $p(k,L)=f\{{1 \over 2}[1-\mu{{L-2k} \over 
373: {(L+L_0)^{1-m}}}]\}$, where $f(u) \equiv u \Theta(u) -(u-1) \Theta(u-1)$ and $\Theta(u)$ 
374: is the Heaviside step-function. 
375: This yields, for $L \gg l_0$, $m>0$, and $\mu<0$ a Gaussian distribution of 
376: variance $D(L) \sim L^{1-m}$.
377: 
378: \bibitem{Note3} For the probability distribution $P(x,L)$ to be an even function of $x$ (and thus $\langle x \rangle =0$), 
379: the function $F(\xi)$ should be an odd function of its argument.
380: 
381: \bibitem{DNAs} DNA sequences of various organisms were obtained from ftp://ftp.ncbi.nih.gov/genomes.
382: 
383: \bibitem{Note4} We have verified that for the DNA mapping used ($\{A,G\} \to 0$, $\{C,T\} \to 1$), the 
384: distribution $P(x,L=const.)$ is well approximated by a Gaussian. The alternative mappings yield 
385: a broader distribution ($\{T,G\} \to 0$) or a large asymmetry ($\{C,G\} \to 0$).
386: 
387: \bibitem{Djia} Financial data for the DJIA and NASDAQ stock markets are quoted from http://finance.yahoo.com.
388: 
389: \end{thebibliography}
390: 
391: \end{document}
392: 
393: 
394: 
395: 
396: 
397: 
398: 
399: 
400: 
401: 
402: 
403: 
404: 
405: 
406: 
407: 
408: 
409: 
410: 
411: 
412: 
413: 
414: 
415: