1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: % typeset in RevTex.
3: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4: %\documentstyle[pre,aps,psfig,twocolumn]{revtex}
5: %\documentstyle[preprint,aps,psfig]{revtex}
6: \documentstyle[pre,aps,psfig]{revtex}
7: %\pagestyle{empty}
8: \begin{document}
9:
10: %\input psfig
11:
12: %\twocolumn[
13:
14: %\hsize\textwidth\columnwidth\hsize\csname @twocolumnfalse\endcsname
15:
16: % \draft command makes pacs numbers print
17: \draft
18:
19: \title{Long range correlations in DNA sequences}
20: %\author{A. V. S. S. Narayana Rao and A. K. Mohanty$^*$}
21: \author{A. K. Mohanty and A. V. S. S. Narayana Rao$^*$}
22: \address{Nuclear Physics Division, Bhabha Atomic Research Centre, Mumbai-400085}
23: \address{$^*$Molecular Biology and Agriculture Division,Bhabha Atomic Research Centre, Mumbai-400085}
24:
25:
26:
27:
28:
29:
30: \maketitle
31:
32: \begin{abstract}
33: The so called long range correlation properties of DNA
34: sequences are studied using the variance analyses of the density
35: distribution of a single or a group of nucleotides in a model independent way.
36: This new method which was suggested earlier has been applied to extract
37: slope parameters that characterize the correlation properties for several
38: intron containing and intron less DNA sequences. An important aspect of all the DNA
39: sequences is the properties of complimentarity by virtue
40: of which any two complimentary
41: distributions (like $GA$ is complimentary to $TC$ or $G$ is complimentary to $ATC$)
42: have identical fluctuations at all scales although their distribution functions
43: need not be identical. Due to this complimentarity, the famous DNA walk
44: representation whose statistical interpretation is still unresolved is shown
45: to be a special case of the present formalism with a density distribution
46: corresponding to a purine or a pyrimidine group. Another interesting aspect
47: of most of the DNA sequences is that the factorial moments
48: as a function of length exceed unity around a region where the variance
49: versus length in a log-log plot shows a bending. This is a pure phenomenological
50: observation
51: which is found for several DNA sequences with a few exception. Therefore, this
52: length scale has been used as an approximate measure to exclude the bending regions
53: from the slope analyses. The asymmetries in the nucleotide contents or
54: the patchy structure as a possible origin of the long range correlations has
55: also been investigated.
56:
57:
58: \end{abstract}
59:
60: \pacs{PAC(s) 87.14.Gg.87.16.AC,05.10.-a}
61: %]
62:
63: \section{INTRODUCTION}
64: Recently, there has been considerable interest in the finding of long range
65: correlations in genomic DNA sequences \cite{LI1}. A DNA sequence is a chain
66: of sites, each occupied by either a purine (Adenine and Guanine) or a
67: pyrimidine (Cytocyine and Thymine) group. For mathematical modeling, the DNA
68: sequence might be considered as a string of symbols (G, A, T and C) whose
69: correlation structure can be characterized completely by all possible base-base
70: correlation functions or their corresponding power spectra. Different techniques
71: including mutual information functions and power spectra analyses
72: \cite{LI1,LI2,LI3,LI4,VOSS,BUL1,BOR,LU,VIE}, auto correlation \cite{AZB,HER,LUO}, DNA
73: walk representation \cite{PENG1,MAD,NEE,CHA,PRA,KAR,STA,BUL2}, wavelet
74: analysis \cite{ARN1,ARN2} and Zipf analysis \cite{MAN} were used for statistical
75: analyses of DNA sequences. But despite the effort spent, it is still an
76: open question whether
77: the long range correlation properties are different for protein
78: coding (exonic) and non coding (intronic, intergenemic) sequences \cite{BUL3}.
79: One more fundamental ground, there is still continuing debate as to whether
80: the reported long range correlations really mean a lack of independence at long
81: distances or simply reflect the patchiness (bias in nucleotide composition) of
82: DNA sequences. There have been attempts to eliminate local patchiness using
83: methods such as min-max \cite{PENG1}, detrended fluctuation analysis (DFA) \cite{BUL3,PENG2}
84: and wavelet analysis \cite{ARN1}. In spite of its success in modeling the long
85: range correlations observed in DNA sequences, as indicated by the
86: power law increase
87: in the variance and the inverse power law spectrum \cite{VOSS,VIE}, the problem of the correct
88: statistical interpretation of DNA walk is still unresolved and is attracting
89: the attention of an increasing number of investigators. Since approaches
90: based on different models predict different correlation structure, there is
91: no unique measure of the degree of correlation in DNA sequences.
92: Therefore, it is very important
93: to investigate the correlations and extract the power law exponent $\alpha$ rather
94: in a model independent way so that the interpretation of the data including the
95: theoretical analysis becomes more meaningful.
96: There is another
97: confusion related to this study is the absence of a clear definition of the
98: term "long range". Clearly, what is considered to be long is relative to what
99: is considered to be short. To over come some of these problems, recently we have
100: suggested a new method \cite{AKM1} to measure the degree of correlations
101: using the variance analysis of the density distribution of a single or a group
102: of nucleotides. We have also suggested a way to find out an approximate length
103: scale above which all DNA sequences show strong long range correlations irrespective
104: of their intron contents while below this, the correlation is relatively weak.
105: Further, the density distribution which is nearly Gaussian at short distances
106: shows significant deviations from the Gaussian statistics at large distances.
107: In this paper, we present the details of
108: the analyses and also
109: extract the correlation parameter $\alpha$ for several
110: intron containing and intronless sequences.
111:
112: \section {Density distribution and Factorial moments:}
113: In the present method, we build the frequency spectrum of a
114: single or a group of nucleotides by dividing the DNA sequence into many
115: equal intervals of length $l$. For example, to build a purine spectrum,
116: we compute
117: \equation n=\sum_{i=l_0}^{l_0+l} u_i \endequation
118: where $u_i$=1 if the site is occupied by a G or A and $u_i$=0 otherwise.
119: Ideally, one can divide the entire DNA sequence of length $L$ into $m$
120: equal intervals of size $l$ $(l=L/m)$. The purine or GA spectrum can be built
121: by computing $n$ from all the intervals. Alternatively, $n$ can be computed
122: in any segment between $l_0$ and $l_0+l$ and the spectrum ($n$ distribution
123: or $P_n$) is built by varying
124: the starting position $l_0$ from 1, 2, 3 etc upto $L-l$ so as to cover the whole
125: sequence
126: \footnote{At short distances, $n$ can be zero
127: due to the non occurence of a given nucleotide. In such cases, the density
128: spectrum can be built either including or excluding zero$^{th}$ channel. In this
129: analysis, we include zero$^{th}$ channel also so that the complementarity is
130: satisfied which is unlike the case when the zero$^{th}$ channel is excluded.
131: See appendix B for details}.
132: We adopt this second procedure for better statistics. Finally, the
133: standard deviation (SD) of this $P_n$ distribution can be obtained from
134: $\sigma^2=<n^2-{n_0}^2>$ which in general will depend on the interval or the
135: window size $l$.
136:
137: In addition to the standard deviation $\sigma^2$, we also
138: compute the factorial moments $F_q$'s of $P_n$.
139: The normalized factorial moments of order q are written as
140: \equation F_q=\frac{f_q}{f_1^q} \endequation
141: where
142: \equation f_q=\sum_{n=q}^{\infty} P_n n(n-1).....(n-q+1)
143: =\sum_{n=q} ^{\infty} \frac{n!}{(n-q)!} P_n \endequation
144: As will be shown later, the factorial moment has the distinct advantage over
145: the normal moments in identifying the genomic sequence from the random one.
146: It may be mentioned here that for random Poisson distribution, the factorial
147: moments for all q's become unity i.e. for
148: \equation P_n=\frac{a^n e^{-a}}{n!} \endequation
149: the above factor for $f_q$ becomes
150: \equation
151: f_q=\sum_{n=q}^\infty \frac{n!}{(n-q)!} \frac{a^n e^{-a}}{n!}
152: =\sum_{n=q}^\infty \frac{a^n e^{-a}}{(n-q)!}
153: =\sum_{m=0}^\infty \frac{a^{m+q} e^{-a}}{m!}
154: =a^q\sum_{m=0}^\infty \frac{a^m e^{-a}}{m!}
155: =a^q
156: \endequation
157: which gives $F_q$=1.
158:
159:
160:
161: In this work, we have applied the above factorial moment
162: analysis (generally used to study the fluctuations during a phase transition
163: \cite{AKM2}) to study the dynamical fluctuations present in the DNA sequences.
164:
165:
166: \section {Principle of complimentarity}
167:
168: A general property noticed for all the genomic sequences (of statistically
169: significant length) with a few exceptions is that the distributions of any
170: single or group of nucleotides which has a probability of occurrence $p$ has
171: the same variance $\sigma$ as that of its complimentary group that has the
172: probability of occurrence $(1-p)$, although both have different distribution
173: functions. This would imply that even a single nucleotide distribution
174: say $G$ distribution will have same variance as that of $ATC$ distribution or
175: a $GA$ distribution will have identical variance as that of $TC$ distribution.
176: Figure \ref{sd1} shows $\sigma$ versus $l$ plots for $G$ and $GA$ distributions
177: (solid curves) for two typical sequences of $DROMHC$ (Drosphilia Melanogaster,
178: MHC, 22663 bps, $20.5 \%$ $G$, $30.3 \%$ $A$, $25.4 \% $ $T$, $23.8 \%$ $C$) and
179: $SC\_MIT$
180: (yeast mitochondrial DNA, $9.1 \%$ $G$, $42.2 \%$ $A$, $40.7 \%$ $T$,
181: $8.0 \%$ $C$).
182: As can be seen from the figure, the $G$ and $GA$ distributions have same $\sigma$
183: at all scale as that of $ATC$ and $TC$ distributions (filled circles) although
184: the distribution functions of the two complimentary groups need not be identical.
185: The above agreement is exact for most of the DNA sequences
186: (with a few exceptions) as well as for the
187: random sequences. For example, the $\sigma$ for
188: $G$ and $ATC$ distributions of $SC\_MIT$ and $E. Coli:TN10$ ($E. Coli$ with a
189: $TN10$ mobile transposion (9147 bps) at location 22000 bps) show $2\%$ to $3\%$
190: deviations at all scale depending on the total
191: length of the sequences where as for other
192: DNA as well as random sequences, this
193: agreement is exact.
194: (This difference is not visible from figure \ref{sd1}
195: in case of $SC\_MIT$ as the deviation is insignificant over a large distance).
196:
197:
198:
199:
200: \begin{figure}
201: \centerline{\hbox{
202: \psfig{figure=sd1.eps,width=3.0in,height=3.2in}}}
203: \caption{ The variance $\sigma$ versus $l$ for $G$ and $GA$ distributions
204: (solid curves). Top panel is
205: for $DROMHC$ (Drosophilia Melanogaster, MHC) while the bottom panel for
206: $SC\_MIT$ (yeast mitocondrial DNA). The filled circles are for
207: the complimentary $ATC$ and $TC$ distributions.
208: The curve $RW$ (dotted curve) corresponds to the
209: slope in case of random walk (see text for details). The curves are scaled up appropriately for
210: better clarity.}
211: \label{sd1}
212: \end{figure}
213:
214: Within the present formalism, we can also reproduce the result of random walk
215: $(RW)$ model (See appendix for more detail) by assigning
216: $u_i=1$ for purine group ($G$ and $A$)
217: and $u_i=-1$ for pyrimidine group ($T$ and $C$). However, unlike the random
218: walk model of interpreting $+1$ and $-1$ as the probability of step up and
219: step down, $P_n$ can be considered as the frequency distribution of $n$
220: which gives the excess or deficit of purines over pyrimidines. The $\sigma$
221: versus $l$ as obtained from this assignment has also been shown in
222: figure \ref{sd1} (see the dotted curves labeled $RW$) for comparison. It is
223: interesting to note that the $RW$ curves shows a parallel shift with respect
224: to the $GA$ or $TC$ curves indicating that $GA$ or $TC$ distributions and $RW$
225: model have similar fluctuations at all scale. This is an interesting
226: observations, as we can now use $GA$ or $TC$ distributions as alternatives
227: to the DNA walk representation to study the correlation. The advantage is, since
228: $n$ represents a sum, unlike the DNA walk model, the entire spectrum lies
229: to the positive side of the coordinates which is essential to compute various
230: higher moments like $F_q$ of the distributions.
231:
232:
233: It is also important to note that although the complimentary distributions
234: have same $\sigma$ at all scale, the distribution functions need not be
235: exactly identical.
236: Figure \ref{sd2} shows a typical normalized density distribution functions $P_n$
237: of two complimentary distributions $G$ and $ATC$
238: for the above two sequences ($SC\_MIT$ and $DROMHC$)
239: as a function of $n-n_0$ (where $n_0$ is the average
240: count ) at a typical length scale of
241: $l=150$ (figures in left). The figures to the
242: right shows $P_n$ distributions ($x$-axis is shifted by 100 for clarity)
243: corresponding to the
244: two purely random sequences having same length and nucleotide
245: contents as that of $DROMHC$ and $SC\_MIT$ sequences.
246: It is interesting to note that although $\sigma$ versus $l$ plots are (nearly)
247: identical $i. e.$, both distributions have same fluctuations at all scales,
248: the distribution functions are not identical.
249: This is an important characteristic of
250: a DNA sequence which is not found in case of a random one.
251:
252:
253:
254: \begin{figure}
255: \centerline{\hbox{
256: \psfig{figure=sd2.eps,width=4.0in,height=4.2in}}}
257: \caption{ The complimentary $G$ and $ATC$ density distributions at
258: a typical distance of $l=150$
259: for above two sequences. The curves on the right
260: (shifted by $100$ units) shows the corresponding
261: distributions in case of a purely random sequence of appropriate $G$, $A$, $T$
262: and $C$ contents.}
263: \label{sd2}
264: \end{figure}
265:
266: \section {Extraction of slope parameter}
267: The long range correlations are generally studied from the relation
268: $\sigma \sim l^\alpha$ where the parameter $\alpha$ is extracted from the
269: $\sigma$ versus $l$ plot in the log-log scale. For the case of a completely
270: random sequence, $\alpha \sim 0.5$. The deviation of $\alpha$ from $0.5$
271: indicates presence of long range correlations. We have estimated $\sigma$
272: of $G$, $A$, $T$, $C$ and $GA$ distributions for several DNA sequences and
273: found that $\sigma$ versus $l$ plot in the log-log scale is not linear over
274: the entire length \footnote{We consider only the $G$, $A$, $T$ and $C$
275: distributions to extract the correlation parameters for the individual nucelotides
276: and $GA$ distributions to simulate the results of random walk model}.
277: Figure \ref{ec1} shows $\sigma$ versus $l$ plot (bottom panel)
278: for a typical $E. Coli$ sequence of length $L=1.2$ Mbps (solid curves)
279: and $L=30$ Kbps (dotted curves) respectively. The top panel shows the factorial
280: distributions of $q$=2, 3, 4 and 6 for a typical $A$ distributions, although
281: similar plots can be obtained for other nucleotide distributions as well.
282: A general feature of the factorial moments of the DNA sequence with a few
283: exception is that at short distances, $F_q < 1.0$ for all $q's$ and exceeds
284: unity at some point say at $l_q$. This behavior is not found in case of a purely
285: random sequence where $F_q$ is always $\le 1.0$. Further, all $q$'s do not
286: cross unity exactly at the same point, $l_q$ being more for higher $q$ values.
287: However, this variation is insignificant over a very large scale if we
288: restrict to some of the lower moments say up to $q=6$.
289:
290: From these plots and also from the several other studies,
291: we make following few observations; (i) The $\sigma$ versus $l$ plot is
292: not linear through out, rather starts bending around some region (say $l_c$,
293: which could be different for different distributions) indicating a change
294: of slope from $\alpha_1$ to $\alpha_2$, (ii) For most of the cases, while
295: $\alpha_1$ shows weak deviation from $0.5$, $\alpha_2$ deviates significantly
296: from $0.5$ and also depends on the sequence length $L$, (iii) The individual
297: nucleotide distributions may have stronger correlations than any sum like $GA$
298: and $TC$ distributions or any other combinations.
299:
300: \begin{figure}
301: \centerline{\hbox{
302: \psfig{figure=ec1.eps,width=4.0in,height=4.2in}}}
303: \caption{ (a) The factorial moments $F_q$ versus $l$ for a typical $A$ distributions
304: of $E. Coli$ sequence of length 1.2 Mbps. (b) The corresponding
305: slope parameter $\sigma$ versus $l$ for $E. Coli$ of length 1.2 Mbps (solid curves)
306: and of length 30 Kbps (dashed curves). The curves are scaled up appropriately for clarity.}
307: \label{ec1}
308: \end{figure}
309:
310:
311: Since $\sigma$ versus $l$ in the log-log plot starts bending around $l_c$,
312: we can extract the slope by dividing the entire length into two segments;
313: one for $l<l_c$ and the other one for $l>l_c$. This can be done by examining
314: each case individually.
315: However, we have noticed an approximate correlation
316: between this bending region in $\sigma$ versus $l$ plot
317: and the cross over
318: points $l_q$ of the corresponding factorial moments i.e. the slope changes
319: around the same region where the factorial moments become unity. This
320: is a pure phenomenological observation which is found for several DNA sequences as listed in tables with a
321: few exceptions which we will discuss below.
322: It may be mentioned here that although, the two complimentary distributions
323: have same fluctuations, both need not have identical factorial moments.
324: Figure \ref{lam} shows the plots of $F_q$ versus $l$ for $A$ and $GTC$ distribution
325: for a $LAMCG$ sequence.
326: Since both are complimentary, they have
327: identical fluctuations at all scales (hence same bending region), but the
328: cross over regions in $F_q$ plots are different, being higher for $ATC$ distributions
329: (due to large average values $n_0$ at all scales). While the $l_q$ value of the
330: $A$ distribution shows an approximate correlation with the bending region of
331: $\sigma$ versus $l$ plot where a possible slope change occurs, the $l_q$
332: values of $GTC$ distribution has no such correlations. This is true for any
333: complementary distributions of $G$, $A$, $T$ and $C$ except for $GA$ and $TC$ distributions since
334: both have nearly
335: same overlapping cross over regions.
336:
337: \begin{figure}
338: \centerline{\hbox{
339: \psfig{figure=lam.eps,width=4.0in,height=4.2in}}}
340: \caption{ The factorial moments $F_q$ versus $l$ for $G$ and $ATC$ distributions
341: of $LAMCG$ sequence}
342: \label{lam}
343: \end{figure}
344:
345:
346: Therefore, only the $l_q$ values of the $G$, $A$, $T$,
347: $C$ and $GA$ distributions are used as an approximate length scales $(l_c)$.
348: The entire length of the sequence is divided into
349: two parts one for $0< l <l_{c1}$ and other for $l_{c2}<l<L_{max}$ where $l_{c1}$
350: and $l_{c2}$ are the minimum and maximum of all the $l_c$ corresponding to
351: $G$, $A$, $T$, $C$ and $GA$ distributions. The $L_{max}=L/30$, i.e. we have at
352: least $30$ independent data sets so that the statistical analysis becomes
353: meaningful. Therefore, excluding the region $l_{c1}<l<l_{c2}$, we have extracted
354: $\alpha_1$ and $\alpha_2$ since the linearity in these two segments
355: are found to be extremely good for most
356: of the cases. The results are summarized in three tables which covers
357: both intronless and
358: intron containing sequences. The table shows the length of the sequence $L$
359: used in the analyses, the cross over values $l_q$ ( same as $l_c$),
360: the slope parameters $\alpha_1$
361: and $\alpha_2$ and also the corresponding percentage of the nucleotide contents
362: $P$. A general observation is that the sequence is weakly
363: correlated at short distance with $\alpha_1$ which is quite close to $0.5$ where as
364: for $l>l_c$, the correlation is relatively stronger with a larger value
365: of $\alpha_2$. Now we discuss a few exceptions like in the case of $SC\_MIT$ and
366: $PODOT7$ ($T7$ bacteriophage, $39936$ bps). Figure \ref{pc1} shows the
367: factorial moments of a typical $G$ distributions. In both the cases, the factorial
368: moments do not have any cross over point.
369: In case of $SC\_MIT$, the factorial moments are much higher than unity
370: even at small distance and starts decreasing afterwards. The similar behavior
371: is found for $C$ distribution also. However, the $A$, $T$ and $GA$ distributions
372: do have $l_c$ points. Therefore, using $l_{c1}$ as $\sim 36$ and $l_{c2} \sim 184$,
373: we estimated $\alpha_1$ and $\alpha_2$ for $G$, $A$, $T$, $C$ and $GA$ distributions
374: which are listed in table III. The symbol $'*'$ indicates absence of any critical
375: value. It is interesting to note that $\alpha_1$ is quite large
376: and in some cases $\alpha_1 > \alpha_2$.
377: On the other hand , the factorial moments of the sequence like $PODOT7$ do not
378: reach unity at any scale. The absence of such type of scale has been indicated by
379: the symbol $'-'$ in table III. This type of sequences behave like a pure random
380: one having $\alpha$ values quite close to $0.5$. We have listed a few such sequences
381: with exceptions in table III.
382:
383: \begin{figure}
384: \centerline{\hbox{
385: \psfig{figure=pc1.eps,width=4.0in,height=4.2in}}}
386: \caption{ The factorial moments $F_q$ versus $l$ for $G$ distributions
387: of $SC\_MIT$ (scaled up) and PODOT7 (T7 bacteriophage) sequences.}
388: \label{pc1}
389: \end{figure}
390:
391: Further, we would like to mention here that we have noticed that
392: the factorial moments for many sequences starts decreasing at large distances.
393: Also for a few cases, the
394: factorial moments start decreasing even at a very short distances.
395: Consequently, the slope also changes accordingly. However, we would not
396: like to assign any reasons due to lack of enough statistics.
397:
398:
399: The slope with $\alpha=0.5$ corresponds to the case of a normal diffusion
400: process of a random Brownian trajectory. The basic idea of a Brownian motion is that
401: of a random walk having a Gaussian distribution probability for the position
402: of the random walker after a time $t$ with the variance ($\sigma^2$)
403: proportional to $t$ ($\sigma \sim t^\alpha$ where $\alpha=0.5$).
404: This corresponds to the case of normal diffusion. However, nature shows
405: enough examples of anomalous diffusion characterized by a variance
406: which does not follow a linear growth in time \cite{KLA}.
407: In such cases either the diffusion is accelerated if $\alpha > 0.5$ or
408: the growth is
409: dispersive if $\alpha < 0.5$. As found in the analyses (see tables I and II),
410: $\alpha_2 > 0.5$ at large distances for most of the sequences irrespective of
411: their intron contents. However, a few sequences as shown in table III,
412: not only peculiar, may also have $\alpha$ which decreases at large distances.
413: In such cases, $\alpha<0.5$ which may indicate the influence of dispersive
414: dynamics. This aspect needs further investigations.
415: Finally, we would like to add here that $\alpha_1$ is close to $0.5$ for
416: most of the sequences at short distance (see tables I and II). Although, $\alpha=0.5$
417: would imply about a random behavior, it can not be told conclusively from the
418: present analyses unless the short distance effects are taken into consideration
419: \cite{GAL}.
420:
421: \section{Patchy sequences}
422:
423: In the following, we investigate whether the mosaic character of DNA
424: consisting of patches of different composition can account for apparent
425: long range correlations in DNA sequences\cite{KAR}. The Chargaff's second parity
426: rule states that in a single strand $G \approx C$ and $T \approx A$.
427: However, asymmetries in base composition have been observed in many
428: sequences. A quantitative estimate of the $GC$ and $AT$ skews can be
429: obtained from the relation $(G-C)/(G+C)$ (Excess of $G$ nucleotides over $C$
430: nucleotides) and $(A-T)/(A+T)$ (Excess of $A$ nucleotides over $T$ nucleotides).
431: This is, operationally equivalent to estimating $n$ as defined in Eq.(1) except
432: $n$ now represents the count $(G-C)/(G+C)$
433: for $GC$ skew and $(A-T)/(A+T)$
434: for $AT$ skew in a fixed window size of
435: $(L/20)$. We consider $LAMCG$ as an example and plot $n$ (defined appropriately)
436: versus $l_0$ where
437: the starting position of the sliding window $l_0$ varies from $1$, $2$, $3$ etc
438: upto $L-l$. Figure \ref{skew} shows the plots of $GC$ and $AT$ skews as a function
439: of the length for a typical $LAMCG$ sequence.
440: The plots show a change in the direction of the slope with a change in sign of
441: the skew. The quantity and quality of the skew can be assessed from the $V$
442: or from the inverted-$V$ shape of the curves.
443:
444: \begin{figure}
445: \centerline{\hbox{
446: \psfig{figure=patch.ps,width=4.0in,height=4.2in}}}
447: \caption{ The $GC$ and $AT$ skews as a function of $l_0$ for $LAMCG$ sequence.}
448: \label{skew}
449: \end{figure}
450:
451: From the above plots, we can identify
452: three well known
453: compositional domains of $LAMCG$ of size 22000 bps ($GA$ contents 0.54), 17000
454: bps ($GA$ contents 0.47) and 9000 bps ($GA$ contents 0.54). We also consider
455: an artificially generated sequence by joining three random
456: patches of size 22000 bps, 17000 bps and 9000 bps respectively with appropriate
457: $G$, $A$, $T$ and $C$ contents. We also consider another heterogeneous sequence
458: generated from $E. Coli$ DNA by
459: a mobile insertion of TN10 at location 22000 bps. The corresponding
460: random patches are of size 22000 bps, 9147 bps and 22000 bps respectively
461: \footnote{ Please note the distinction between the random sequence
462: which is generated by joining three random patches of total length $L$
463: and a pure random one of length $L$. Although, both the sequence has same
464: percentage of nucleotide contents in the length $L$,
465: the former is random only patch wise.}
466:
467:
468: \begin{figure}
469: \centerline{\hbox{
470: \psfig{figure=lar.eps,width=4.0in,height=4.2in}}}
471: \caption{ The $F_q$ versus $l$ of $C$ distribution of
472: for $LAMCG$ and an artificially
473: sequence generated by joining three randomly generated patches
474: of size 22000 bps, 17000 bps and 9000 bps with the same $G$, $A$, $T$ and $C$
475: contents as that of $LAMCG$.}
476: \label{lar}
477: \end{figure}
478:
479:
480: Figure \ref{lar} shows the $F_q$ versus $l$ plot of a typical $C$ distribution
481: for $LAMCG$ and for an artificially generated sequence (random only patch wise).
482: Interestingly, the factorial
483: moments for both the cases behave similarly.
484: Figure \ref{rans1} shows a similar $\sigma(l)$ versus $l$ plot both for real
485: and artificially generated (from random patches) sequences.
486: Although, in some cases both agree, in general they are not identical at the
487: individual nucleotide levels particularly at large distances (Note that
488: the scale is highly compressed). This deviation
489: would mean that at large distances, the density distribution functions will
490: have significant discrepancy due to different widths.
491: So at a first look from the $\sigma$ versus $l$ plot, we can say that
492: the actual DNA sequences and the RANDOM patches need not have identical
493: slopes $\alpha$ (hence the width $\sigma$) at large distances for all
494: the nucelotides although they agree in some cases.
495: Even at short distances, although the DNA and the
496: RANDOM
497: sequences have nearly identical width $\sigma$, the full shape
498: of the distributions need
499: not be identical. To demonstrate this, we invoke the principle of
500: complimentary which was mentioned before.
501:
502:
503:
504: \begin{figure}
505: \centerline{\hbox{
506: \psfig{figure=rans1.eps,width=3.0in,height=3.2in}}}
507: \caption{ The variance $\sigma$ versus $l$ for $G$, $A$, $T$, $C$, and $GA$
508: distributions. (a) $LAMCG$ and an artificial
509: sequence generated by joining three randomly generated patches
510: of size 22000 bps, 17000 bps and 9000 bps with the same $G$, $A$, $T$ and $C$
511: contents as that of $LAMCG$. (b) for $E. Coli$ with a $TN10$ mobile
512: transposition (9147 bps) at location 22000 bps. The three random patches
513: are of size 22000 bps, 9147 bps and 22000 bps with appropriate
514: $G$, $A$, $T$ and $C$ contents. }
515: \label{rans1}
516:
517: \end{figure}
518:
519:
520:
521:
522: Figure \ref{fig5}(a)
523: shows a $G$ and $ATC$ distribution (left most) for a $LAMCG$ sequence at
524: $l=300$. Notice that
525: although $\sigma$ versus $l$ plots are identical, i.e. both distributions have same
526: fluctuations at all scales, the distribution functions are not same. Such
527: differences are not found for a real random sequence (right most). The middle
528: figure corresponds to the case of artificially generated random
529: sequence. Although, the artificially
530: generated sequence mimics the real sequence to some extent, it is not fully
531: capable of reproducing the characteristic of a real sequence. Figure
532: \ref{fig5}(b)
533: shows another comparison for a $E. Coli::TN10$ sequence for $A$ and $GTC$
534: distributions. This discrepancy will be more
535: prominent at higher $l$ values which the artificially generated sequence can not
536: reproduce.
537:
538:
539:
540:
541:
542: \begin{figure}
543: \centerline{\hbox{
544: \psfig{figure=fig5.eps,width=3.0in,height=3.2in}}}
545: \caption{The density distribution $P_n$ versus $n-n_0$ (where
546: $n_0$ is average density) for a real DNA sequence (left most),
547: for an artificially generated sequence (middle) and for a completely
548: random sequence (right most) shown for two complementary
549: distributions. (a) for $LAMCG$ and (b) for $E. Coli::TN10$.}
550:
551: \label{fig5}
552:
553: \end{figure}
554:
555:
556: \section {Density distributions}
557:
558: In \cite{AKM1}, we had demonstrated that the density distribution $P_n$
559: is Gaussian at short distances and starts deviating from it as the distance
560: increases. Figure \ref{den} shows another example where $P_n$ has been
561: plotted for two complimentary distributions at $l=25$, $100$ and $200$ respectively.
562: The complimentary distributions are nearly identical at short
563: distance and coincide with the random distributions where as $P_n$ distributions
564: for $G$, $ATC$ and pure random one are all different at larger distances.
565:
566: \begin{figure}
567: \centerline{\hbox{
568: \psfig{figure=den.ps,width=3.0in,height=3.2in}}}
569: \caption{The density distribution $P_n$ versus $n-n_0$ (where
570: $n_0$ is average density) for $LAMCG$ sequence at $l=25$, $100$
571: and $200$ respectively. The solid and the dashed curves are for $G$ and
572: $ATC$ distributions respectively where as the dotted curve is for a
573: purely random sequence.}
574: \label{den}
575: \end{figure}
576:
577:
578: Thus, irrespective of intron contents, most of the sequences follow Gaussian
579: statistics at short distances. However, at large distances, the statistics
580: deviates significantly from the Gaussian nature.
581:
582: \section {Conclusions}
583: In conclusion, we have extended our previous work to extract the slope
584: parameter $\alpha$ for several intron containing and intron less DNA sequences.
585: The advantage of the present method is that the variance analysis
586: can be applied to any individual or group of nucleotides. We believe that the
587: individual nucleotides provide a more fundamental measure of the correlation
588: than any combination or group (like the DNA walk representation) where the
589: effects may get reduced or washed out. Another interesting aspect is
590: the (lower) factorial moments of most of the DNA sequences cross unity in
591: a very narrow region in $l$ where the $\sigma$ versus $l$ plot in the log-log scale
592: also shows a bending. Although, a formal justification to this correlation
593: has not been provided, we have used this scale as an approximate measure
594: to exclude the bending regions from the slope analyses. Based on this scale,
595: we divide the DNA sequence into two segments to extract the slope parameters.
596: It is found that below this scale, the correlation is weak and the DNA
597: statistics is essentially Gaussian while above this all DNA sequences show
598: strong long range correlations irrespective of their intron contents with a
599: significant deviation from the Gaussian behavior. It may be mentioned here
600: that the controversies that exist in this field of research are primarily
601: due to different approaches that are adopted in various models. In this context,
602: our analyses is model independent as it only involves the counting of an individual
603: or a group of nucleotides in a given length to build the density distribution.
604: In this work, we do not advocate for any specific model,
605: although the extracted slope parameters indicate the presence of anomalous
606: diffusion of both enhanced and dispersive nature. Instead, we
607: provide an elegant tool to measure
608: the degree of correlations unambiguously so that the interpretation of
609: the data including theoretical analyses will become more meaningful. This work will
610: also provide further impetus to develop models for the understanding of
611: the DNA dynamics.
612:
613:
614:
615:
616:
617:
618: \begin{table}
619: \squeezetable
620: \caption{Summary of the correlation analysis of intron containing sequences.
621: $l_c$ is the characteristic length scale.
622: $\alpha_1$ is the slope parameter for $l<l_{c1}$ and $\alpha_2$ is the slope parameter for
623: $l_{c2} < l < l_{max}$, where $l_{c1}$ and $l_{c2}$ are the minimum and the
624: maximum of all the $l_c$, $l_{max}$=L/30 where L is the total length of the
625: sequence. The acronym in column 1 is the name of the GenBank. Since
626: the factorial moments for all $q$ do not cross exactly at same point,
627: we have chosen $l_c$ for which $F_q$ for $q=2,3,4$ and $6$ approaches unity
628: simultaneously. $P$ denotes percentage of $G$, $A$, $T$ and $C$ in the sequence.
629: We have also not fine tuned the cross over point $l_c$, it is only approximate.}
630:
631: \begin{tabular}{|c|c|c| c |c| c| c|c|}
632:
633: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
634: \hline
635: Human $\beta$-globin & 73,308 & $l_c$ & 12 & 14 & 14 & 14 & 32\\
636: (Chromosomal region) & & $ \alpha_1$ &0.640 &0.644 &0.671 &0.620 &0.652\\
637: HUMHBB & & $\alpha_2$ &0.703 &0.783 &0.812 &0.655 &0.758\\
638: & &P &20.2 & 30.1 &30.4 & 19.3 & 50.3\\
639: \hline
640: Adenovirus type 2 & 35,937 & $l_c$ & 24 & 12 & 12 & 36 &132\\
641: (Intron containing) & &$\alpha_1$ &0.598 &0.586 &0.567 &0.583 &0.564\\
642: ADRCG & &$\alpha_2$ &0.862 &0.815 &0.816 &0.758 &0.661\\
643: & &P &27.3 & 23.2 &21.6 & 27.9 & 50.5\\
644: \hline
645: Chicken embryonic MHC& 31,111 &$l_c$ & 24 & 36 & 14 & 28 &48\\
646: (Gene) & &$\alpha_1$ &0.644 &0.578 &0.658 &0.581 &0.623\\
647: CHKMYHE & &$\alpha_2$ &0.775 &0.698 &0.800 &0.715 &0.762\\
648: & &P &22.2 & 31.3 &26.7 & 19.8 & 53.5\\
649: \hline
650: Human $\beta$-cardiac MHC& 28,438 &$l_c$ & 16 & 16 & 10 & 18 &20\\
651: (Gene) & &$\alpha_1$ &0.638 &0.579 &0.627 &0.620 &0.664\\
652: HUMBMYH7 & &$\alpha_2$ &0.681 &0.663 &0.700 &0.673 &0.688\\
653: & &P &25.9 & 23.6 &23.0 & 27.5 & 49.5\\
654: \hline
655: Drosophila melanogaster MHC& 22,663 &$l_c$ & 20 & 20 & 14 & 36 &156\\
656: (Gene) & &$\alpha_1$ &0.648 &0.594 &0.644 &0.562 &0.569\\
657: DROMHC & &$\alpha_2$ &0.820 &0.652 &0.798 &0.707 &0.719\\
658: & &P &20.5 & 30.3 &25.4 & 23.8 & 50.8\\
659: \hline
660: Chicken c-myb oncogene & 8200 &$l_c$& 14 & 10 & 10 & 12 &48\\
661: (Gene) & &$\alpha_1$ &0.663 &0.661 &0.688 &0.670 &0.645\\
662: CHKMYB15 & &$\alpha_2$ &0.749 &0.873 &0.752 &0.852 &0.550\\
663: & &P &28.4 & 21.9 &23.5 & 22.2 & 50.3\\
664:
665: \end{tabular}
666: \end{table}
667:
668:
669: \begin{table}
670: %\squeezetable
671:
672: \caption{Same as table I, but for intron less sequences.
673: For $E. Coli$,
674: $l_{max}$ is chosen as 120,0000 bps. The data is taken from the site
675: {\bf http://www.ncbi.nlm.nih.gov}.}
676:
677:
678: \begin{tabular}{|c|c|c| c |c| c| c|c|}
679: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
680: \hline
681: $E. Coli K12$ & 1200000&$l_c$ & 100 & 32 & 32 & 92 &684\\
682: & &$\alpha_1$ &0.535 &0.542 &0.549 &0.532 &0.529\\
683: & &$\alpha_2$ &0.665 &0.639 &0.664 &0.674 &0.614\\
684: & &$\alpha_2$ &0.654 &0.654 &0.655 &0.715 &0.563\\
685: & &P &27.2 &23.6 &24.2 &25.0 & 50.8\\
686: \hline
687: H. Influenzae & 240000&$l_c$ & 52 & 48 & 56 & 52 &214\\
688: & &$\alpha_1$ &0.542 &0.552 &0.543 &0.547 &0.543\\
689: & &$\alpha_2$ &0.720 &0.712 &0.635 &0.770 &0.709\\
690: & &P &17.9 &31.6 &30.7 &19.8 & 49.5\\
691: \hline
692: Bacillus subtilis & 3840x60&$l_c$ & 80 & 40 & 22 & 132 &274\\
693: & &$\alpha_1$ &0.538 &0.545 &0.550 &0.508 &0.536\\
694: & &$\alpha_2$ &0.815 &0.770 &0.816 &0.779 &0.766\\
695: & &P &24.5 &29.5 &26.5 &19.5 & 54.0\\
696: \hline
697: Mycobacterium & 9665x60&$l_c$ & 20 & 64 & 44 & 24 &136\\
698: tuberculosis & &$\alpha_1$ &0.549 &0.535 &0.548 &0.540 &0.542\\
699: & &$\alpha_2$ &0.827 &0.681 &0.826 &0.765 &0.791\\
700: & &P &15.92 &34.57 &33.73 &15.78 & 50.49\\
701: \hline
702: Cyano bacterium & 4166x60&$l_c$ & 32 & 40 & 28 & 24 &304\\
703: & &$\alpha_1$ &0.545 &0.532 &0.542 &0.541 &0.535\\
704: & &$\alpha_2$ &0.730 &0.678 &0.763 &0.733 &0.587\\
705: & &P &24.1 &26.0 &26.0 &23.9 & 50.1\\
706: \hline
707: Schizosaccharomyces & 19431 &$l_c$ & 32 & 60 & 80 &304 &160\\
708: Mitochondiron & &$\alpha_1$ &0.547 &0.561 &0.568 &0.504 &0.543\\
709: NC-001326 & &$\alpha_2$ &0.698 &0.690 &0.774 &0.465 &0.773\\
710: & & P &15.8 &33.8 &36.1 &14.3 &49.6 \\
711: \hline
712: Human Cytomegalovirus & 229354 &$l_c$ & 36 & 10 & 10 & 32 &148\\
713: Strain AD169 & &$\alpha_1$ &0.582 &0.588 &0.596 &0.581 &0.575\\
714: HEHCMVCG & &$\alpha_2$ &0.806 &0.799 &0.800 &0.800 &0.682\\
715: \hline
716: dmal &889x60&$l_c$ & 20 & 12 & 12 & 22 &68\\
717: & &$\alpha_1$ &0.575 &0.628 &0.599 &0.559 &0.60\\
718: & &$\alpha_2$ &0.730 &0.782 &0.602 &0.720 &0.596\\
719: \hline
720: Chicken nonmuscle MHC &7003 &$l_c$ & 96 & 72 & 12 & 28 & 64\\
721: (cDNA) & &$\alpha_1$ &0.573 &0.538 &0.569 &0.554 &0.627\\
722: CHKMYHN & &$\alpha_2$ &0.722 &0.833 &0.841 &0.601 &0.842\\
723: & &P &27.0 &31.2 &20.6 &21.2 & 58.2\\
724: \hline
725: Bacteriophage $\lambda$& 48,502&$l_c$ & 56 & 36 & 18 &124 &168\\
726: (Intronless virus) & &$\alpha_1$ &0.563 &0.541 &0.598 &0.513 &0.550\\
727: LAMCG & &$\alpha_2$ &0.935 &0.819 &0.911 &0.810 &0.866\\
728: & &P &26.4 &25.4 &24.7 &23.5 & 51.8\\
729: \hline
730: Human dystrophin & 13,957&$l_c$ & 136 & 56 & 14 & 22 &128\\
731: (cDNA) & &$\alpha_1$ &0.530 &0.552 &0.569 &0.552 &0.544\\
732: HUMDYS:M18533 & &$\alpha_2$ &0.738 &0.634 &0.777 &0.720 &0.725\\
733: & & P &22.4 &33.0 & 24.7 &19.9 &55.4\\
734:
735: \end{tabular}
736: \end{table}
737:
738:
739: \begin{table}
740: %\squeezetable
741:
742: \caption{Same as table II.
743: The symbol $*$ indicates that the factorial moments are larger
744: than unity even at very short distance where as $-$ indicates that the factorial
745: moments do not reach unity.}
746:
747: \begin{tabular}{|c|c|c| c |c| c| c|c|}
748: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
749: \hline
750: SC-MIT & 85779 &$l_c$ & * & 36 & 36 & * &184\\
751: Nc-001224 & &$\alpha_1$ &0.732 &0.697 &0.680 &0.720 &0.578\\
752: & &$\alpha_2$ &0.698 &0.540 &0.747 &0.508 &0.730\\
753: & & P &9.1 &42.2 &40.7 &8.0 &51.3 \\
754: \hline
755: Pichia canadensis & 27694 &$l_c$ & * & 36 & 64 &* &96\\
756: Mitochondiron & &$\alpha_1$ &0.654 &0.688 &0.624 &0.615 &0.620\\
757: NC-001762 & &$\alpha_2$ &0.662 &0.755 &0.784 &0.660 &0.801\\
758: & & P &10.2 &41.6 &40.2 &8.0 &51.84 \\
759: \hline
760: Ti(Plasmid) &24595 &$l_c$ & 76 & 24 & 32 & 40 & -\\
761: & &$\alpha_1$ &0.543 &0.564 &0.552 &0.586 &0.508\\
762: & &$\alpha_2$ &0.706 &.700 &0.676 &0.728 &0.433\\
763: & &P &23.5 &26.6 &27.5 &22.4 & 50.1\\
764: \hline
765: BacteriophageT7 &39937 &$l_c$ & - & 116 & 884 & 1284 &-\\
766: NC-001604 & &$\alpha_1<116$ &0.526 &0.571 &0.529 &0.530 &0.530\\
767: & &$116<\alpha_2<1330$ &0.560 &0.587 &0.590 &0.566 &0.551\\
768: & &P &25.8 &27.2 &24.4 &22.6 & 53.0\\
769: \hline
770: Tyorg & 196x60 &$l_c$ & - & 96 &- &36 & 96\\
771: & &$\alpha_1$&0.491 & 0.560 & 0.515 & 0.620 & 0.587\\
772: & &$\alpha_2$&0.370 & 0.715 & 0.514 & 0.799 & 0.704\\
773: & &P &16.0 &35.9 &26.7 &21.4 & 51.9\\
774: \end{tabular}
775: \end{table}
776:
777:
778:
779: \appendix
780: \renewcommand{\thefigure}{A\arabic{figure}}
781: \section *{Random walk model}
782:
783: The method of DNA walks, first suggested by Peng et al \cite{PENG1} is based
784: on the rule that the walker either moves up $(u_i=1)$ or down $u_i=-1)$ for each
785: step $i$ of the walk. This is the case of a correlated random walk and differs
786: from an uncorrelated walk where the direction of each step is independent of the
787: previous steps. Further they assign $u_i=1$ if a pyrimidine occurs at the site
788: $i$ whereas $u_i=-1$ if the site contains a purine.
789: The net displacement $(y)$ of the walker after $l$ steps is defined
790: as
791: \equation
792: y(l)=\sum_{i=1}^l u(i)
793: \endequation
794: The standard deviation of the above quantity can be estimated from
795: \equation
796: \sigma^2(l,L)=\frac{1}{L-l} \sum_{l_0=1}^{L-l} (\Delta y(l_0,l)-{\bar {\Delta(l)}})^2
797: \endequation
798: where $L$ is the number of nucleotides in the entire sequence and
799: \equation
800: {\bar {\Delta y(l)}}=\frac{1}{L-l} \sum_{l_0=1}^{L-l} \Delta y(l_0,l)
801: \endequation
802: where $\Delta y(l_0,l)=y(l_0+l)-y(l_0)$.
803: It was found \cite{PENG1} that the fluctuations can be approximated by
804: \equation
805: \sigma(l,L) \sim l^\alpha
806: \endequation
807: where $\alpha$ is the correlation exponents. For $\alpha$ close to $0.5$, there
808: is no correlation or only short range correlation in the sequence. If $\alpha$
809: is significantly different from $0.5$, it indicates long range correlations.
810:
811:
812: \appendix
813: \setcounter {figure}{0}
814: \renewcommand{\thefigure}{B\arabic{figure}}
815: \section *{B}
816:
817:
818: In the previos analyses, we account for the non-occurence of a particular
819: nucleotide. This is operationally equivalent to building the density spectrum
820: $P_n$ including $n=0$. If the nucleotide compositional asymmetry is quite large like
821: $SC\_MIT$, the occurence $n$ can be zero for some nucleotides particularly at
822: short distances. Therefore, we can build $P_n$ distribution either including
823: or excluding zero$^{th}$ channel. The figure \ref{ap1}(a) shows the comparison
824: of $\sigma$ versus $l$ plot for two complimentary distributions corresponding
825: to a $LAMCG$ sequence both with (top panel where $G$ and $ATC$ distributions have
826: identical slopes at all scales) and without (bottom panel)
827: inclusion of $n=0$ channel in the $P_n$ spectra. Interestingly, absence of
828: $n=0$ channel does not satisfy the complimentarity relation particularly at
829: short distances. However, the difference does not exist at larger distances
830: where always $n>1$. Figure \ref{ap1}(b) shows another example of $F_q$ versus
831: $l$ plot for a typical $SC\_MIT$ sequence. The spectrum with exclusion of
832: $n=0$ channel behaves differently when zero$^{th}$ channel is included (compare
833: it with figure \ref{pc1} where $F_q$ versus $l$ has no cross over).
834:
835:
836: \begin{figure}
837: \centerline{\hbox{
838: \psfig{figure=ap1.eps,width=3.0in,height=3.2in}}}
839: \caption{ (a) The variance $\sigma$ versus $l$ for $G$ (solid curves)
840: and $ATC$ distributions (dotted curves) for $LAMCG$ sequence.
841: Top panel is
842: for distribution for which the complimentarity is preserved
843: while complimentarity is not satisfied in the case of bottom panel particularly
844: at small distances. (b) $F_q$ versus $l$ plot for $G$ distribution of $SC\_MIT$
845: for the case when complimentarity is not preserved.
846: The curves are scaled up appropriately for
847: better clarity.}
848: \label{ap1}
849: \end{figure}
850:
851: Since the spectrum behaves differently when zero$^{th}$ channel is not included,
852: we have analysed the spectrum of three typical sequences listed in the table below.
853: Notice now that while $\alpha_2$ values are essentially same as before, the
854: $\alpha_1$ values are quite different. In fact, we have noticed a general
855: trend where $\alpha_1$ is higer than the previous values although the corresponding
856: density distributions do not deviate significantly from the Gaussian behavior
857: at short distances. However, in the previous analysis, we alwyas include the
858: zero$^{th}$ channel so that the complimentarity properties is satisfied at all
859: scales. Moreover, we also found a correlation between $\alpha$ and Gaussian
860: statistics, namely the deviation of $\alpha$ from $0.5$ also shows a
861: corresponding deviation of $P_n$ distribution from Gaussian behavior.
862: For example, in case of $SC\_MIT$, the $\alpha$ is quite large at a short
863: distance. Accordingly, the $P_n$ distribution also shows strong deviation from
864: the Gaussian statistics. However, this is not necessarilly true when
865: complimentarity is not preserved while building the spectrum.
866: At short distances, the deviation of $\alpha$ from $0.5$
867: does not always mean a strong deviation from the Gaussian statistics.
868:
869:
870: \begin{table}
871: %\squeezetable
872:
873: \caption{The slope parameters for three typical sequences where the complimenraity
874: is not preserved.}
875:
876: \begin{tabular}{|c|c|c| c |c| c| c|c|}
877: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
878: \hline
879: Bacteriophage $\lambda$& 48,502&$l_c$ & 56 & 36 & 18 &124 &168\\
880: (Intronless virus) & &$\alpha_1$ &0.720 &0.670 &0.740 &0.680 &0.580\\
881: LAMCG & &$\alpha_2$ &0.935 &0.819 &0.910 &0.800 &0.860\\
882: & &P &26.4 &25.4 &24.7 &23.5 & 51.8\\
883: \hline
884: SC-MIT & 85779 &$l_c$ & 14 & 36 & 40 & 12 &184\\
885: Nc-001224 & &$\alpha_1$ &0.703 &0.760 &0.750 &0.700 &0.630\\
886: & &$\alpha_2$ &0.694 &0.540 &0.750 &0.510 &0.730\\
887: & & P &9.1 &42.2 &40.7 &8.0 &51.3 \\
888: \hline
889: BacteriophageT7 &39937 &$l_c$ & - & 116 & 884 & 1284 &-\\
890: NC-001604 & &$\alpha_1<116$ &0.560 &0.610 &0.570 &0.570 &0.530\\
891: & &$116<\alpha_2<1330$ &0.560 &0.587 &0.590 &0.566 &0.551\\
892: & &P &25.8 &27.2 &24.4 &22.6 & 53.0\\
893: \end{tabular}
894: \end{table}
895:
896:
897:
898:
899:
900:
901: \begin{thebibliography}{99}
902:
903: \bibitem{LI1} For a review on long range correlation in DNA sequences,
904: see for example, W. Li, Computers Chem, {\bf 21}, 257 (1997);
905: http://linkage.rockefeller.edu/wli/dna\_corr.html
906:
907: \bibitem{LI2} W. Li, Int. Journal of Bifurcation and Chaos, {\bf 2(1)}, 137 (1992).
908:
909: \bibitem{LI3} W. Li and K. Kaneko, Euro Phys. Lett, {\bf 17}, 655 (1992).
910:
911: \bibitem{LI4} W. Li, T. Marr and K. Kaneko, Physica {\bf D75}, 392 (1994).
912:
913: \bibitem{VOSS} R. F. Voss, Phys. Rev. Lett., {\bf 68}, 3805 (1992); Fractals {\bf 2}, 1 (1994).
914:
915: \bibitem{BUL1} S.V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng,
916: M. Simons, F. Sciortino and H. E. Stanley, Phys. Rev. Lett.,
917: {\bf 71}, 1776 (1993).
918:
919: \bibitem{BOR} B. Borstnik, D. Pumpernik, and D. Lukman, Euro phys. Lett., {\bf 23}, 389 (1993).
920:
921: \bibitem{LU} X. Lu, Z. Sun, H. Chen, and Y. Li, Phys. Rev. {\bf E58}, 3578 (1998).
922:
923: \bibitem{VIE} M. de Vieira, Phys. Rev. {\bf E60}, 5932 (1999).
924:
925: \bibitem{AZB} M. Ya. Azbel, Phys. Rev. Lett., {\bf 75}, 168 (1995).
926:
927: \bibitem{HER} H. Herzel, I. Gro$\beta$e, Physica {\bf A216}, 518 (1995).
928:
929: \bibitem{LUO} Liaofu Luo, Weijiang Lee, Lijun Jia, Fengmin Ji, and Lu Tsai, Phys. Rev. {\bf E58}, 861 (1998).
930:
931: \bibitem{PENG1}C. K. Peng, S.V. Buldyrev, A. L. Goldberger, S. Havlin,
932: F. Sciortino, M. Simons, and H. E. Stanley, Nature (London),
933: {\bf 356}, 168 (1992).
934:
935: \bibitem{MAD} J. Maddox, Nature (London), {\bf 358}, 103 (1992).
936:
937: \bibitem{NEE}S. Nee, Nature (London), {\bf 357}, 450 (1992)
938:
939: \bibitem{CHA}Chatzidimitriou-Dreismann and Larhammar D, Nature (London), {\bf 361}, 212 (1993).
940:
941: \bibitem{PRA}V. V. Prabhu, and J. M. Claverie, Nature (London), {\bf 357}, 782 (1992).
942:
943: \bibitem{KAR}S. Karlin and V. Brendel Science, {\bf 259}, 677 (1993).
944:
945: \bibitem{STA} H. E. Stanley, S.V. Buldyrev, A. L. Goldberger, Z. D. Goldberg,
946: S. Havlin, R. N. Mantegna, S. M. Ossadnik, C. K. Peng, and
947: M. Simons, Physica {\bf A205}, 214 (1994).
948:
949: \bibitem{BUL2} S.V. Buldyrev, N. V. Dokholyan, A. L. Goldberger,
950: S. Havlin, C. K. Peng, H. E. Stanley and G. M. Visvanathan,
951: Physica {\bf A249}, 430 (1998).
952:
953: \bibitem{ARN1}A. Arnedo, E. Bacry, P. V. Graves and J. F. Muzy, Phys. Rev. Lett.,
954: {\bf 74}, 3293 (1995).
955:
956: \bibitem{ARN2}A. Arnedo, Y. D'Aubenton-Carafa, B. Audit, E. Bacry,
957: J. F. Muzy, and C. Thermes, Physica {bf A249}, 439 (1998).
958:
959:
960: \bibitem{MAN} R. N. Mantegna, S.V. Buldyrev, A. L. Goldberger,
961: S. Havlin, C. K. Peng, M. Simons, and H. E. Stanley,
962: Phy. Rev. Lett., {\bf 73}, 333 (1994); Phys. Rev. {\bf E52}, 2939 (1995).
963:
964:
965: \bibitem{BUL3} S.V. Buldyrev, A. L. Goldberger, S. V. Havlin, R. N. Mantegna,
966: M. E. Matsa, C. K. Peng, M. Simons, and H. E. Stanley,
967: Phys. Rev. {\bf E51}, 5084 (1995).
968:
969: \bibitem{PENG2} C. K. Peng, S.V. Buldyrev, S. V. Havlin, M. Simons, H. E. Stanley,
970: and A. L. Goldberger, Phys. Rev. {\bf E49}, 1685 (1994).
971:
972: \bibitem{AKM1} A. K. Mohanty, and A. V. S. S. Narayana Rao, Phys. Rev. Lett., {\bf 84}, 1832 (2000).
973:
974: \bibitem{AKM2}A. K. Mohanty, and S. K. Kataria, Phys. Rev. Lett, {\bf 73}, 2672 (1994);
975: Phys. Rev. Lett, {\bf 75}, 2449 (1995); Phys. Rev. C, {\bf C53}, 887 (1996).
976:
977: \bibitem{KLA} For a review see, J. Klafter, M. F. Shlesinger and G. Zumofen,
978: Physics Today, {\bf 49}, 33 (1996); M. F. Shlesinger, J. Klafter
979: and G. Zumofen, Am. J. Phys., {\bf 67}, 1253 (1999).
980:
981:
982:
983: \bibitem{GAL} Bernaola- Galvan and P. Carpena, (To be published).
984:
985:
986: %\bibitem{ALL1} P. Allegrini, P. Grigolini and B. J. West, Phys. Rev. {\bf E54}, 4760 (1996).
987: %
988: %\bibitem{ALL2} P. Allegrini, M. Barbi, P. Grigolini and B. J. West,
989: % Phys. Rev. E, {\bf 52}, 5281 {1995}
990:
991: %\bibitem{ALL3} P. Allegrini, P. Grigolini and B. J. West,
992: % Phys. Lett A, {\bf 211}, 217 {1996}
993:
994: \end{thebibliography}
995: \end{document}
996: