physics0202075/pre.tex
1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %            typeset in RevTex.
3: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4: %\documentstyle[pre,aps,psfig,twocolumn]{revtex}
5: %\documentstyle[preprint,aps,psfig]{revtex}
6: \documentstyle[pre,aps,psfig]{revtex}
7: %\pagestyle{empty}
8: \begin{document}
9: 
10: %\input psfig
11: 
12: %\twocolumn[
13: 
14: %\hsize\textwidth\columnwidth\hsize\csname @twocolumnfalse\endcsname
15: 
16: % \draft command makes pacs numbers print
17: \draft
18: 
19: \title{Long range correlations in DNA sequences}
20: %\author{A. V. S. S. Narayana Rao and A. K. Mohanty$^*$}
21: \author{A. K. Mohanty and A. V. S. S. Narayana Rao$^*$}
22: \address{Nuclear Physics Division, Bhabha Atomic Research Centre, Mumbai-400085}
23: \address{$^*$Molecular Biology and Agriculture Division,Bhabha Atomic Research Centre, Mumbai-400085}
24: 
25: 
26: 
27: 
28: 
29: 
30: \maketitle
31: 
32: \begin{abstract}
33: The so called long range correlation properties of DNA
34: sequences are studied using the  variance analyses of the density
35: distribution of a single or a group of nucleotides in a model independent way.
36: This new method which was suggested earlier has been applied to  extract
37: slope parameters that characterize the correlation properties for several
38: intron containing and intron less DNA sequences. An important aspect of all the DNA
39: sequences is the properties of complimentarity by virtue
40: of which any two complimentary
41: distributions (like $GA$ is complimentary to $TC$ or $G$ is complimentary to $ATC$)
42: have identical fluctuations at all scales although their distribution functions
43: need not be identical. Due to this complimentarity, the famous DNA walk
44: representation whose statistical interpretation is still unresolved is shown
45: to be a special case of the present formalism with a density distribution
46: corresponding to a purine or a pyrimidine group. Another interesting aspect
47: of most of the DNA sequences is that the factorial moments
48: as a function of length exceed unity around a region where the variance
49: versus length in a log-log plot shows a bending. This is a pure phenomenological
50: observation
51: which is found for several DNA sequences with a few exception. Therefore, this
52: length scale has been used as an approximate measure to exclude the bending regions
53: from the slope analyses. The asymmetries in the nucleotide contents or
54: the patchy structure as a possible origin of the long range correlations has
55: also been investigated.
56: 
57: 
58: \end{abstract}
59: 
60: \pacs{PAC(s) 87.14.Gg.87.16.AC,05.10.-a}
61: %]
62: 
63: \section{INTRODUCTION}
64: Recently, there has been considerable interest in the finding of long range
65: correlations in genomic DNA sequences \cite{LI1}. A DNA sequence is a chain
66: of sites, each occupied by either a purine (Adenine and Guanine) or a
67: pyrimidine (Cytocyine and Thymine) group. For mathematical modeling, the DNA
68: sequence might be considered as a string of symbols (G, A, T and C) whose
69: correlation structure can be characterized  completely by all possible base-base
70: correlation functions or their corresponding power spectra. Different techniques
71: including mutual information functions and power spectra analyses
72: \cite{LI1,LI2,LI3,LI4,VOSS,BUL1,BOR,LU,VIE}, auto correlation \cite{AZB,HER,LUO}, DNA
73: walk representation \cite{PENG1,MAD,NEE,CHA,PRA,KAR,STA,BUL2}, wavelet
74: analysis \cite{ARN1,ARN2} and Zipf analysis \cite{MAN} were used for statistical
75: analyses of DNA sequences. But despite the effort spent, it is still an
76: open question whether
77: the long range correlation properties are different for protein
78: coding (exonic) and non coding (intronic, intergenemic) sequences \cite{BUL3}.
79: One more fundamental ground, there is still continuing debate as to whether
80: the reported long range correlations really mean a lack of independence at long
81: distances or simply reflect the patchiness (bias in nucleotide composition) of
82: DNA sequences. There have been attempts to eliminate local patchiness using
83: methods such as min-max \cite{PENG1}, detrended fluctuation analysis (DFA) \cite{BUL3,PENG2}
84: and wavelet analysis \cite{ARN1}. In spite of its success in modeling the long
85: range correlations observed in DNA sequences, as indicated by the
86: power law increase
87: in the variance and the inverse power law spectrum \cite{VOSS,VIE}, the problem of the correct
88: statistical interpretation of DNA walk is still unresolved and is attracting
89: the attention of an increasing number of investigators. Since approaches
90: based on different models predict different correlation structure, there is
91: no unique measure of the degree of correlation in DNA sequences.
92: Therefore, it is very important
93: to investigate the correlations and extract the power law exponent $\alpha$ rather
94: in a model independent way so that the interpretation of the data including the
95: theoretical analysis becomes more meaningful.
96: There is another
97: confusion related to this study is the absence of a clear definition of the
98: term "long range". Clearly, what is considered to be long is relative to what
99: is considered to be short. To over come some of these problems, recently we have
100: suggested a new method \cite{AKM1} to measure the degree of correlations
101: using the variance analysis of the density distribution of a single or a group
102: of nucleotides. We have also suggested a way to find out an approximate length
103: scale above which all DNA sequences show strong long range correlations irrespective
104: of their intron contents while below this, the correlation is relatively weak.
105: Further, the density distribution which is nearly Gaussian at short distances
106: shows significant deviations  from the Gaussian statistics at large distances.
107: In this paper, we present the details of
108: the analyses and also
109: extract the correlation parameter $\alpha$ for several
110: intron containing and intronless sequences.
111: 
112: \section {Density distribution and Factorial moments:}
113: In the present method, we build the frequency spectrum of a
114: single or a group of nucleotides by dividing the DNA sequence into many
115: equal intervals of length $l$. For example, to build a purine spectrum,
116: we compute
117: \equation n=\sum_{i=l_0}^{l_0+l} u_i \endequation
118: where $u_i$=1 if the site is occupied by a G or A and $u_i$=0 otherwise.
119: Ideally, one can divide the entire DNA sequence of length $L$ into $m$
120: equal intervals of size $l$ $(l=L/m)$. The purine or GA spectrum can be built
121: by computing $n$ from all the intervals. Alternatively, $n$ can be computed
122: in any segment between $l_0$ and $l_0+l$ and the spectrum ($n$ distribution
123: or $P_n$) is built by varying
124: the starting position $l_0$ from 1, 2, 3 etc upto $L-l$ so as to cover the whole
125: sequence
126: \footnote{At short distances, $n$ can be zero
127: due to the non occurence of a given nucleotide. In such cases, the density
128: spectrum can be built either including or excluding zero$^{th}$ channel. In this
129: analysis, we include zero$^{th}$ channel also so that the complementarity is
130: satisfied which is unlike the case when the zero$^{th}$ channel is excluded.
131: See appendix B for details}.
132: We adopt this second procedure for better statistics. Finally, the
133: standard deviation (SD) of this $P_n$ distribution can be obtained from
134: $\sigma^2=<n^2-{n_0}^2>$ which in general will depend on the interval or the
135: window size $l$.
136: 
137: In addition to the standard deviation $\sigma^2$, we also
138: compute the factorial moments $F_q$'s of $P_n$.
139: The normalized factorial moments of order q are written as
140: \equation F_q=\frac{f_q}{f_1^q} \endequation
141: where
142: \equation f_q=\sum_{n=q}^{\infty} P_n n(n-1).....(n-q+1)
143:              =\sum_{n=q} ^{\infty} \frac{n!}{(n-q)!} P_n \endequation
144: As will be shown later, the factorial moment has the distinct advantage over
145: the normal moments in identifying the genomic sequence from the random one.
146: It may be mentioned here that for random  Poisson distribution, the factorial
147: moments for all q's become unity i.e. for
148: \equation P_n=\frac{a^n e^{-a}}{n!} \endequation
149: the above factor for $f_q$ becomes
150: \equation
151: f_q=\sum_{n=q}^\infty \frac{n!}{(n-q)!} \frac{a^n e^{-a}}{n!}
152:    =\sum_{n=q}^\infty \frac{a^n e^{-a}}{(n-q)!}
153:    =\sum_{m=0}^\infty \frac{a^{m+q} e^{-a}}{m!}
154:    =a^q\sum_{m=0}^\infty \frac{a^m e^{-a}}{m!}
155:    =a^q
156:    \endequation
157: which gives $F_q$=1.
158: 
159: 
160: 
161: In this work, we have applied the above factorial moment
162: analysis (generally used to study the fluctuations during a phase transition
163: \cite{AKM2}) to study the dynamical fluctuations present in the DNA sequences.
164: 
165: 
166: \section {Principle of complimentarity}
167: 
168: A general property noticed for all the genomic sequences (of statistically
169: significant length) with a few exceptions is that the distributions of any
170: single or group of nucleotides which has a probability of occurrence $p$ has
171: the same variance $\sigma$ as that of its complimentary group that has the
172: probability of occurrence $(1-p)$, although both have different distribution
173: functions. This would imply that even a single nucleotide distribution
174: say $G$ distribution will have same variance as that of $ATC$ distribution or
175: a $GA$ distribution will have identical variance as that of $TC$ distribution.
176: Figure \ref{sd1} shows $\sigma$ versus $l$ plots for $G$ and $GA$ distributions
177: (solid curves) for two typical sequences of $DROMHC$ (Drosphilia Melanogaster,
178: MHC, 22663 bps, $20.5 \%$ $G$, $30.3 \%$ $A$, $25.4 \% $ $T$, $23.8 \%$ $C$) and
179: $SC\_MIT$
180: (yeast mitochondrial DNA, $9.1 \%$ $G$, $42.2 \%$ $A$, $40.7 \%$ $T$,
181: $8.0 \%$ $C$).
182: As can be seen from the figure, the $G$ and $GA$ distributions have same $\sigma$
183: at all scale as that of $ATC$ and $TC$ distributions (filled circles) although
184: the distribution functions of the two complimentary groups need not be identical.
185: The above agreement is exact for most of the DNA sequences
186: (with a few exceptions) as well as for the
187: random sequences. For example, the $\sigma$ for
188: $G$ and $ATC$ distributions of $SC\_MIT$ and $E. Coli:TN10$ ($E. Coli$ with a
189: $TN10$ mobile transposion (9147 bps) at location 22000 bps) show $2\%$ to $3\%$
190: deviations at all scale depending on the total
191: length of the sequences where as for other
192: DNA as well as random sequences, this
193: agreement is exact.
194: (This difference is not visible from figure \ref{sd1}
195: in case of $SC\_MIT$ as the deviation is insignificant over a large distance).
196: 
197: 
198: 
199: 
200: \begin{figure}
201: \centerline{\hbox{
202: \psfig{figure=sd1.eps,width=3.0in,height=3.2in}}}
203: \caption{ The variance $\sigma$ versus $l$ for $G$ and $GA$ distributions
204: (solid curves). Top panel is
205: for $DROMHC$ (Drosophilia Melanogaster, MHC) while the bottom panel for
206:   $SC\_MIT$ (yeast mitocondrial DNA). The filled circles are for
207:   the complimentary $ATC$ and $TC$ distributions.
208:   The curve $RW$  (dotted curve) corresponds to the
209:   slope in case of random walk (see text for details). The curves are scaled up appropriately for
210:   better clarity.}
211: \label{sd1}
212: \end{figure}
213: 
214: Within the present formalism, we can also reproduce the result of random walk
215: $(RW)$ model (See appendix for more detail) by assigning
216: $u_i=1$ for purine group ($G$ and $A$)
217: and $u_i=-1$ for pyrimidine group ($T$ and $C$). However, unlike the random
218: walk model of interpreting $+1$ and $-1$ as the probability of step up and
219: step down, $P_n$ can be considered as the frequency distribution of $n$
220: which gives the excess or deficit of purines over pyrimidines. The $\sigma$
221: versus $l$ as obtained from this assignment has also been shown in
222: figure \ref{sd1} (see the dotted curves labeled $RW$) for comparison. It is
223: interesting to note that the $RW$ curves shows a parallel shift with respect
224: to the $GA$ or $TC$ curves indicating that $GA$ or $TC$ distributions and $RW$
225: model have similar fluctuations at all scale. This is an interesting
226: observations, as we can now use $GA$ or $TC$ distributions as alternatives
227: to the DNA walk representation to study the correlation. The advantage is, since
228: $n$ represents a sum, unlike the DNA walk model, the entire spectrum lies
229: to the positive side of the coordinates which is essential to compute various
230: higher moments like $F_q$ of the distributions.
231: 
232: 
233: It is also important to note that although the complimentary distributions
234: have same $\sigma$ at all scale, the distribution functions need not be
235: exactly identical.
236: Figure \ref{sd2} shows a typical normalized density distribution functions $P_n$
237: of two complimentary distributions $G$ and $ATC$
238: for the above two sequences ($SC\_MIT$ and $DROMHC$)
239: as a function of $n-n_0$ (where $n_0$ is the average
240: count ) at a typical length scale of
241: $l=150$ (figures in left). The figures to the
242: right shows $P_n$ distributions ($x$-axis is shifted by 100 for clarity)
243: corresponding to the
244: two purely random sequences having same length and nucleotide
245: contents as that of $DROMHC$ and $SC\_MIT$ sequences.
246: It is interesting to note that although $\sigma$ versus $l$ plots are (nearly)
247: identical $i. e.$, both distributions have same fluctuations at all scales,
248: the distribution functions are not identical.
249: This is an important characteristic of
250: a DNA sequence which is not found in case of a random one.
251: 
252: 
253: 
254: \begin{figure}
255: \centerline{\hbox{
256: \psfig{figure=sd2.eps,width=4.0in,height=4.2in}}}
257: \caption{ The complimentary $G$ and $ATC$ density distributions at
258: a typical distance of $l=150$
259: for above two sequences. The curves on the right
260: (shifted by $100$ units) shows the corresponding
261: distributions in case of a purely random sequence of appropriate $G$, $A$, $T$
262: and $C$ contents.}
263: \label{sd2}
264: \end{figure}
265: 
266: \section {Extraction of slope parameter}
267: The long range correlations are generally studied from the relation
268: $\sigma \sim l^\alpha$ where the parameter $\alpha$ is extracted from the
269: $\sigma$ versus $l$ plot in the log-log scale. For the case of a completely
270: random sequence, $\alpha \sim 0.5$. The deviation of $\alpha$ from $0.5$
271: indicates presence of long range correlations. We have estimated $\sigma$
272: of $G$, $A$, $T$, $C$ and $GA$ distributions for several DNA sequences and
273: found that $\sigma$ versus $l$ plot in the log-log scale is not linear over
274: the entire length \footnote{We consider only the $G$, $A$, $T$ and $C$
275: distributions to extract the correlation parameters for the individual nucelotides
276: and $GA$ distributions to simulate the results of random walk model}.
277: Figure \ref{ec1} shows $\sigma$ versus $l$ plot (bottom panel)
278: for a typical $E. Coli$ sequence of length $L=1.2$ Mbps (solid curves)
279: and $L=30$ Kbps (dotted curves) respectively. The top panel shows the factorial
280: distributions of $q$=2, 3, 4 and 6 for a typical $A$ distributions, although
281: similar plots can be obtained for other nucleotide distributions as well.
282: A general feature of the factorial moments of the DNA sequence with a few
283: exception is that at short distances, $F_q < 1.0$ for all $q's$ and exceeds
284: unity at some point say at $l_q$. This behavior is not found in case of a purely
285: random sequence where $F_q$ is always $\le 1.0$. Further, all $q$'s do not
286: cross unity exactly at the same point, $l_q$ being more for higher $q$ values.
287: However, this variation is insignificant over a very large scale if we
288: restrict to some of the lower moments say up to $q=6$.
289: 
290: From these plots and also from the several other studies,
291: we make following few observations; (i) The $\sigma$ versus $l$ plot is
292: not linear through out, rather starts bending around some region (say $l_c$,
293: which could be different for different distributions) indicating a change
294: of slope from $\alpha_1$ to $\alpha_2$, (ii) For most of the cases, while
295: $\alpha_1$ shows weak deviation from $0.5$, $\alpha_2$ deviates significantly
296: from $0.5$ and also depends on the sequence length $L$, (iii) The individual
297: nucleotide distributions may have stronger correlations than any sum like $GA$
298: and $TC$ distributions or any other combinations.
299: 
300: \begin{figure}
301: \centerline{\hbox{
302: \psfig{figure=ec1.eps,width=4.0in,height=4.2in}}}
303: \caption{ (a) The factorial moments $F_q$ versus $l$ for a typical $A$ distributions
304: of $E. Coli$ sequence of length 1.2 Mbps. (b) The corresponding
305: slope parameter $\sigma$ versus $l$ for $E. Coli$ of length 1.2 Mbps (solid curves)
306: and of length 30 Kbps (dashed curves). The curves are scaled up appropriately for clarity.}
307: \label{ec1}
308: \end{figure}
309: 
310: 
311: Since $\sigma$ versus $l$ in the log-log plot starts bending around $l_c$,
312: we can extract the slope by dividing the entire length into two segments;
313: one for $l<l_c$ and the other one for $l>l_c$. This can be done by examining
314: each case individually.
315: However, we have noticed an approximate correlation
316: between this bending region in $\sigma$ versus $l$ plot
317: and the cross over
318: points $l_q$ of the corresponding factorial moments i.e. the slope changes
319: around the same region where the factorial moments become unity. This
320: is a pure phenomenological observation which is found for several DNA sequences as listed in tables with a
321: few exceptions which we will discuss below.
322: It may be mentioned here that although, the two complimentary distributions
323: have same fluctuations, both need not have identical factorial moments.
324: Figure \ref{lam} shows the plots of $F_q$ versus $l$ for $A$ and $GTC$ distribution
325: for a $LAMCG$ sequence.
326: Since both are complimentary, they have
327: identical fluctuations at all scales (hence same bending region), but the
328: cross over regions in $F_q$ plots are different, being higher for $ATC$ distributions
329: (due to large average values $n_0$ at all scales). While the $l_q$ value of the
330: $A$ distribution shows an approximate correlation with the bending region of
331: $\sigma$ versus $l$ plot where a possible slope change occurs, the $l_q$
332: values of $GTC$ distribution has no such correlations. This is true for any
333: complementary distributions of $G$, $A$, $T$ and $C$ except for $GA$ and $TC$ distributions since
334: both have nearly
335: same overlapping cross over regions.
336: 
337: \begin{figure}
338: \centerline{\hbox{
339: \psfig{figure=lam.eps,width=4.0in,height=4.2in}}}
340: \caption{ The factorial moments $F_q$ versus $l$ for $G$ and $ATC$ distributions
341: of $LAMCG$ sequence}
342: \label{lam}
343: \end{figure}
344: 
345: 
346: Therefore, only the $l_q$ values of the $G$, $A$, $T$,
347: $C$ and $GA$ distributions are used as an approximate length scales $(l_c)$.
348: The entire length of the sequence is divided into
349: two parts one for $0< l <l_{c1}$ and other for $l_{c2}<l<L_{max}$ where $l_{c1}$
350: and $l_{c2}$ are the minimum and maximum of all the $l_c$ corresponding to
351: $G$, $A$, $T$, $C$ and $GA$ distributions. The $L_{max}=L/30$, i.e. we have at
352: least $30$ independent data sets so that the statistical analysis becomes
353: meaningful. Therefore, excluding the region $l_{c1}<l<l_{c2}$, we have extracted
354: $\alpha_1$ and $\alpha_2$ since the linearity in these two segments
355: are found to be extremely good for most
356: of the cases.  The results are summarized in three tables which covers
357: both intronless and
358: intron containing sequences. The table shows the length of the sequence $L$
359: used in the analyses, the cross over values $l_q$ ( same as $l_c$),
360: the slope parameters $\alpha_1$
361: and $\alpha_2$ and also the corresponding percentage of the nucleotide contents
362: $P$. A general observation is that the sequence is weakly
363: correlated at short distance with $\alpha_1$ which is quite close to $0.5$ where as
364: for $l>l_c$, the correlation is relatively stronger with a larger value
365: of $\alpha_2$. Now we discuss a few exceptions like in the case of $SC\_MIT$ and
366: $PODOT7$ ($T7$ bacteriophage, $39936$ bps). Figure \ref{pc1} shows the
367: factorial moments of a typical $G$ distributions. In both the cases, the factorial
368: moments do not have any cross over point.
369: In case of $SC\_MIT$, the factorial moments are much higher than unity
370: even at small distance and starts decreasing afterwards. The similar behavior
371: is found for $C$ distribution also. However, the $A$, $T$ and $GA$ distributions
372: do have $l_c$ points. Therefore, using $l_{c1}$ as $\sim 36$ and $l_{c2} \sim 184$,
373: we estimated $\alpha_1$ and $\alpha_2$ for $G$, $A$, $T$, $C$ and $GA$ distributions
374: which are listed in table III. The symbol $'*'$ indicates absence of any critical
375: value. It is interesting to note that $\alpha_1$ is quite large
376: and in some cases $\alpha_1 > \alpha_2$.
377: On the other hand , the factorial moments of the sequence like $PODOT7$ do not
378: reach unity at any scale. The absence of such type of scale has been indicated by
379: the symbol $'-'$ in table III. This type of sequences behave like a pure random
380: one having $\alpha$ values quite close to $0.5$. We have listed a few such sequences
381: with exceptions in table III.
382: 
383: \begin{figure}
384: \centerline{\hbox{
385: \psfig{figure=pc1.eps,width=4.0in,height=4.2in}}}
386: \caption{ The factorial moments $F_q$ versus $l$ for $G$ distributions
387: of $SC\_MIT$ (scaled up) and PODOT7 (T7 bacteriophage) sequences.}
388: \label{pc1}
389: \end{figure}
390: 
391: Further, we would like to mention here that we have noticed that
392: the factorial moments for many sequences starts decreasing at large distances.
393: Also for a few cases, the
394: factorial moments start decreasing even at a very short distances.
395: Consequently, the slope also changes accordingly. However, we would not
396: like to assign any reasons due to lack of enough statistics.
397: 
398: 
399: The slope with $\alpha=0.5$ corresponds to the case of a normal diffusion
400: process of a random Brownian trajectory. The basic idea of a Brownian motion is that
401: of a random walk having a Gaussian distribution probability for the position
402: of the random walker after a time $t$ with the variance ($\sigma^2$)
403: proportional to $t$ ($\sigma \sim t^\alpha$ where $\alpha=0.5$).
404: This corresponds to the case of normal diffusion. However, nature shows
405: enough examples of anomalous diffusion characterized by a variance
406: which does not follow a linear growth in time \cite{KLA}.
407: In such cases either the diffusion is accelerated if $\alpha > 0.5$ or
408: the growth is
409: dispersive if $\alpha < 0.5$. As found in the analyses (see tables I and II),
410: $\alpha_2 > 0.5$ at large distances for most of the sequences irrespective of
411: their intron contents. However, a few sequences as shown in table III,
412: not only peculiar, may also have $\alpha$ which decreases at large distances.
413: In such cases, $\alpha<0.5$ which may indicate the influence of  dispersive
414: dynamics. This aspect needs further investigations.
415: Finally, we would like to add here that $\alpha_1$ is close to $0.5$ for
416: most of the sequences at short distance (see tables I and II). Although, $\alpha=0.5$
417: would imply about a random behavior, it can not be told conclusively from the
418: present analyses unless the short distance effects are taken into consideration
419: \cite{GAL}.
420: 
421: \section{Patchy sequences}
422: 
423: In the following, we investigate whether the mosaic character of DNA
424: consisting of patches of different composition can account for apparent
425: long range correlations in DNA sequences\cite{KAR}. The Chargaff's second parity
426: rule states that in a single strand $G \approx C$ and $T \approx  A$.
427: However, asymmetries in base composition have been observed in many
428: sequences. A quantitative estimate of the $GC$ and $AT$ skews  can be
429: obtained from the relation $(G-C)/(G+C)$ (Excess of $G$ nucleotides over $C$
430: nucleotides) and $(A-T)/(A+T)$ (Excess of $A$ nucleotides over $T$ nucleotides).
431: This is, operationally equivalent to estimating $n$ as defined in Eq.(1) except
432: $n$ now represents the count $(G-C)/(G+C)$
433: for $GC$ skew and $(A-T)/(A+T)$
434: for $AT$ skew in a fixed window size of
435: $(L/20)$. We consider $LAMCG$ as an example and plot $n$ (defined appropriately)
436: versus $l_0$ where
437: the starting position of the sliding window $l_0$ varies from $1$, $2$, $3$ etc
438: upto $L-l$. Figure \ref{skew} shows the plots of $GC$ and $AT$ skews as a function
439: of the length for a typical $LAMCG$ sequence.
440: The plots show  a change in the direction of the slope with a change in sign of
441: the skew. The quantity and quality of the skew can be assessed from the $V$
442: or from the inverted-$V$ shape of the curves.
443: 
444: \begin{figure}
445: \centerline{\hbox{
446: \psfig{figure=patch.ps,width=4.0in,height=4.2in}}}
447: \caption{ The $GC$ and $AT$ skews as a function of $l_0$ for $LAMCG$ sequence.}
448: \label{skew}
449: \end{figure}
450: 
451: From the above plots, we can identify
452: three well known
453: compositional domains of $LAMCG$ of size 22000 bps ($GA$ contents 0.54), 17000
454: bps ($GA$ contents 0.47) and 9000 bps ($GA$ contents 0.54). We also consider
455: an artificially generated sequence by joining three random
456: patches of size 22000 bps, 17000 bps and 9000 bps respectively with appropriate
457: $G$, $A$, $T$ and $C$ contents. We also consider another heterogeneous sequence
458: generated from $E. Coli$ DNA by
459: a  mobile insertion of TN10 at location 22000 bps. The corresponding
460: random patches are of size 22000 bps, 9147 bps and 22000 bps respectively
461: \footnote{ Please note the distinction between the random sequence
462: which is generated by joining three random patches of total length $L$
463: and a pure random one of length $L$. Although, both the sequence has same
464: percentage of nucleotide contents in the length $L$,
465: the former is random only patch wise.}
466: 
467: 
468: \begin{figure}
469: \centerline{\hbox{
470: \psfig{figure=lar.eps,width=4.0in,height=4.2in}}}
471: \caption{ The $F_q$ versus $l$ of $C$ distribution of
472: for $LAMCG$ and an artificially
473: sequence generated by joining three randomly generated patches
474: of size 22000 bps, 17000 bps and 9000 bps with the same $G$, $A$, $T$ and $C$
475: contents as that of $LAMCG$.}
476: \label{lar}
477: \end{figure}
478: 
479: 
480: Figure \ref{lar} shows the $F_q$ versus $l$ plot of a typical $C$ distribution
481: for $LAMCG$ and for an artificially generated sequence (random only patch wise).
482: Interestingly, the factorial
483: moments for both the cases behave similarly.
484: Figure \ref{rans1} shows a similar $\sigma(l)$ versus $l$ plot both for real
485: and artificially  generated (from random patches) sequences.
486: Although, in some cases both agree, in general they are not identical at the
487: individual nucleotide levels particularly at large distances (Note that
488: the scale is highly compressed). This deviation
489: would mean that at large distances, the density distribution functions will
490: have significant discrepancy due to different widths.
491: So  at a first look from the $\sigma$ versus $l$ plot, we can say that
492: the actual DNA sequences and the RANDOM patches need not have identical
493: slopes $\alpha$ (hence the width $\sigma$) at large distances for all
494: the nucelotides although they agree in some cases.
495: Even at short distances, although the DNA and the
496: RANDOM
497: sequences have nearly identical width $\sigma$, the full shape
498: of the distributions need
499: not be identical. To demonstrate this, we invoke the principle of
500: complimentary which was mentioned before.
501: 
502: 
503: 
504: \begin{figure}
505: \centerline{\hbox{
506: \psfig{figure=rans1.eps,width=3.0in,height=3.2in}}}
507: \caption{ The variance $\sigma$ versus $l$ for $G$, $A$, $T$, $C$, and $GA$
508: distributions. (a)  $LAMCG$ and an artificial
509: sequence generated by joining three randomly generated patches
510: of size 22000 bps, 17000 bps and 9000 bps with the same $G$, $A$, $T$ and $C$
511: contents as that of $LAMCG$. (b) for $E. Coli$ with a $TN10$ mobile
512: transposition (9147 bps) at location 22000 bps. The three random patches
513: are of size 22000 bps, 9147 bps and 22000 bps with appropriate
514: $G$, $A$, $T$ and $C$ contents. }
515: \label{rans1}
516: 
517: \end{figure}
518: 
519: 
520: 
521: 
522: Figure \ref{fig5}(a)
523: shows a $G$ and $ATC$ distribution (left most) for a $LAMCG$ sequence at
524: $l=300$. Notice that
525: although $\sigma$ versus $l$ plots are identical, i.e. both distributions have same
526: fluctuations at all scales, the distribution functions are not same. Such
527: differences are not found for a real random sequence (right most). The middle
528: figure corresponds to the case of artificially generated random
529: sequence. Although, the artificially
530: generated sequence mimics the real sequence to some extent, it is not fully
531: capable of reproducing the characteristic of a real sequence. Figure
532: \ref{fig5}(b)
533: shows another comparison for a $E. Coli::TN10$ sequence for $A$ and $GTC$
534: distributions. This discrepancy will be more
535: prominent at higher $l$ values which the artificially generated sequence can not
536: reproduce.
537: 
538: 
539: 
540: 
541: 
542: \begin{figure}
543: \centerline{\hbox{
544: \psfig{figure=fig5.eps,width=3.0in,height=3.2in}}}
545: \caption{The density distribution $P_n$ versus $n-n_0$ (where
546: $n_0$ is average density) for a real DNA sequence (left most),
547: for an artificially generated sequence (middle) and for a completely
548: random sequence (right most) shown for two complementary
549: distributions. (a) for $LAMCG$ and (b) for $E. Coli::TN10$.}
550: 
551: \label{fig5}
552: 
553: \end{figure}
554: 
555: 
556: \section {Density distributions}
557: 
558: In \cite{AKM1}, we had demonstrated that the density distribution $P_n$
559: is Gaussian at short distances and starts deviating from it as the distance
560: increases. Figure \ref{den} shows another example where $P_n$ has been
561: plotted for two complimentary distributions at $l=25$, $100$ and $200$ respectively.
562: The complimentary distributions are nearly identical at short
563: distance and coincide with the random distributions where as $P_n$ distributions
564: for $G$, $ATC$ and pure random one are all different at larger distances.
565: 
566: \begin{figure}
567: \centerline{\hbox{
568: \psfig{figure=den.ps,width=3.0in,height=3.2in}}}
569: \caption{The density distribution $P_n$ versus $n-n_0$ (where
570: $n_0$ is average density) for $LAMCG$ sequence at $l=25$, $100$
571: and $200$ respectively. The solid and the dashed curves are for $G$ and
572: $ATC$ distributions respectively where as the dotted curve is for a
573: purely random sequence.}
574: \label{den}
575: \end{figure}
576: 
577: 
578: Thus, irrespective of intron contents, most of the sequences follow Gaussian
579: statistics at short distances. However, at large distances, the statistics
580: deviates  significantly from the Gaussian nature.
581: 
582: \section {Conclusions}
583: In conclusion, we have extended our previous work to extract the slope
584: parameter $\alpha$ for several intron containing and intron less DNA sequences.
585: The advantage of the present method is that the variance analysis
586: can be applied to any individual or group of nucleotides. We believe that the
587: individual nucleotides provide a more fundamental measure of the correlation
588: than any combination or group (like the DNA walk representation) where the
589: effects may get reduced or washed out. Another interesting aspect is
590: the (lower) factorial moments of most of the DNA sequences cross unity in
591: a very narrow region in $l$ where the $\sigma$ versus $l$ plot in the log-log scale
592: also shows a bending. Although, a formal justification to this correlation
593: has not been provided, we have used this scale as an approximate measure
594: to exclude the bending regions from the slope analyses. Based on this scale,
595: we divide the DNA sequence into two segments to extract the slope parameters.
596: It is found that below this scale, the correlation is weak and the DNA
597: statistics is essentially Gaussian while above this all DNA sequences show
598: strong long range correlations irrespective of their intron contents with a
599: significant deviation from the Gaussian behavior. It may be mentioned here
600: that the controversies that exist in this field of research are primarily
601: due to different approaches that are adopted in various models. In this context,
602: our analyses is model independent as it only involves the counting of an individual
603: or a group of nucleotides in a given length to build the density distribution.
604: In this work, we do not advocate for any specific model,
605: although the extracted slope parameters indicate the presence of anomalous
606: diffusion of both enhanced and dispersive nature. Instead, we
607: provide an elegant tool to measure
608: the degree of correlations unambiguously so that the interpretation of
609: the data including theoretical analyses will become more meaningful. This work will
610: also provide further impetus to develop models for the understanding of
611: the DNA dynamics.
612: 
613: 
614: 
615: 
616: 
617: 
618: \begin{table}
619: \squeezetable
620: \caption{Summary of the correlation analysis of intron containing sequences.
621: $l_c$ is the characteristic length scale.
622: $\alpha_1$ is the slope parameter for $l<l_{c1}$ and $\alpha_2$ is the slope parameter for
623: $l_{c2} < l < l_{max}$, where $l_{c1}$ and $l_{c2}$ are the minimum and the
624: maximum of all the $l_c$, $l_{max}$=L/30 where L is the total length of the
625: sequence. The acronym in column 1 is the name of the GenBank. Since
626: the factorial moments for all $q$ do not cross exactly at same point,
627: we have chosen $l_c$ for which $F_q$ for $q=2,3,4$ and $6$  approaches unity
628: simultaneously. $P$ denotes percentage of $G$, $A$, $T$ and $C$ in the sequence.
629: We have also not fine tuned the cross over point $l_c$, it is only approximate.}
630: 
631: \begin{tabular}{|c|c|c| c |c| c| c|c|}
632: 
633: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
634: \hline
635: Human $\beta$-globin & 73,308  & $l_c$    & 12   & 14   & 14   & 14   & 32\\
636: (Chromosomal region) &          & $ \alpha_1$   &0.640 &0.644 &0.671 &0.620 &0.652\\
637: HUMHBB               &          & $\alpha_2$ &0.703 &0.783 &0.812 &0.655 &0.758\\
638:                      &          &P           &20.2  & 30.1 &30.4 & 19.3 & 50.3\\
639: \hline
640: Adenovirus type 2    & 35,937  & $l_c$    & 24   & 12   & 12   & 36   &132\\
641: (Intron containing)   &         &$\alpha_1$     &0.598 &0.586 &0.567 &0.583 &0.564\\
642: ADRCG                &         &$\alpha_2$     &0.862 &0.815 &0.816 &0.758 &0.661\\
643:                      &          &P           &27.3  & 23.2 &21.6 & 27.9 & 50.5\\
644: \hline
645: Chicken embryonic MHC& 31,111  &$l_c$     & 24   & 36   &  14   & 28   &48\\
646: (Gene)               &      &$\alpha_1$        &0.644 &0.578 &0.658  &0.581 &0.623\\
647: CHKMYHE              &      &$\alpha_2$        &0.775 &0.698 &0.800   &0.715 &0.762\\
648:                      &          &P           &22.2  & 31.3 &26.7 & 19.8 & 53.5\\
649: \hline
650: Human $\beta$-cardiac MHC& 28,438  &$l_c$     & 16   & 16   &  10   & 18   &20\\
651: (Gene)               &      &$\alpha_1$        &0.638 &0.579 &0.627  &0.620 &0.664\\
652: HUMBMYH7              &      &$\alpha_2$        &0.681 &0.663 &0.700   &0.673 &0.688\\
653:                      &          &P           &25.9  & 23.6 &23.0 & 27.5 & 49.5\\
654: \hline
655: Drosophila melanogaster MHC& 22,663  &$l_c$     & 20   & 20   &  14   & 36   &156\\
656: (Gene)               &      &$\alpha_1$        &0.648 &0.594 &0.644  &0.562 &0.569\\
657: DROMHC                     &      &$\alpha_2$        &0.820  &0.652 &0.798  &0.707 &0.719\\
658:                      &          &P           &20.5  & 30.3 &25.4 & 23.8 & 50.8\\
659: \hline
660: Chicken c-myb oncogene    & 8200  &$l_c$& 14  & 10   &  10   & 12   &48\\
661: (Gene)              &    &$\alpha_1$      &0.663 &0.661 &0.688  &0.670 &0.645\\
662: CHKMYB15            &    &$\alpha_2$      &0.749 &0.873 &0.752  &0.852 &0.550\\
663:                      &          &P           &28.4  & 21.9 &23.5 & 22.2 & 50.3\\
664: 
665: \end{tabular}
666: \end{table}
667: 
668: 
669: \begin{table}
670: %\squeezetable
671: 
672: \caption{Same as table I, but for intron less sequences.
673: For $E. Coli$,
674: $l_{max}$ is chosen as 120,0000 bps. The data is taken from the site
675: {\bf http://www.ncbi.nlm.nih.gov}.}
676: 
677: 
678: \begin{tabular}{|c|c|c| c |c| c| c|c|}
679: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
680: \hline
681: $E. Coli K12$           & 1200000&$l_c$        & 100  & 32   &  32   & 92   &684\\
682:                        &    &$\alpha_1$        &0.535 &0.542 &0.549  &0.532 &0.529\\
683:                        &    &$\alpha_2$        &0.665 &0.639 &0.664  &0.674 &0.614\\
684:                        &    &$\alpha_2$        &0.654 &0.654 &0.655  &0.715   &0.563\\
685:                  &    &P                 &27.2  &23.6  &24.2   &25.0  & 50.8\\
686: \hline
687: H. Influenzae                    & 240000&$l_c$        &  52  & 48   &  56   & 52   &214\\
688:                        &    &$\alpha_1$        &0.542 &0.552 &0.543  &0.547 &0.543\\
689:                        &    &$\alpha_2$        &0.720 &0.712 &0.635  &0.770 &0.709\\
690:                  &    &P                 &17.9  &31.6  &30.7   &19.8  & 49.5\\
691: \hline
692: Bacillus subtilis                  & 3840x60&$l_c$        &  80  & 40   &  22   & 132   &274\\
693:                        &    &$\alpha_1$        &0.538 &0.545 &0.550  &0.508 &0.536\\
694:                        &    &$\alpha_2$        &0.815 &0.770 &0.816  &0.779 &0.766\\
695:                  &    &P                 &24.5  &29.5  &26.5   &19.5  & 54.0\\
696: \hline
697: Mycobacterium                 & 9665x60&$l_c$        &  20  & 64   &  44   & 24   &136\\
698: tuberculosis                       &    &$\alpha_1$        &0.549 &0.535 &0.548  &0.540 &0.542\\
699:                  &    &$\alpha_2$        &0.827 &0.681 &0.826  &0.765 &0.791\\
700:                  &    &P                 &15.92  &34.57  &33.73   &15.78  & 50.49\\
701: \hline
702: Cyano bacterium                   & 4166x60&$l_c$     &  32  & 40   &  28   & 24   &304\\
703:                        &    &$\alpha_1$        &0.545 &0.532 &0.542  &0.541 &0.535\\
704:                        &    &$\alpha_2$        &0.730 &0.678 &0.763  &0.733 &0.587\\
705:                  &    &P                 &24.1  &26.0  &26.0   &23.9  & 50.1\\
706: \hline
707: Schizosaccharomyces    & 19431      &$l_c$     & 32   & 60   & 80    &304  &160\\
708: Mitochondiron          &    &$\alpha_1$        &0.547 &0.561 &0.568  &0.504  &0.543\\
709: NC-001326              &    &$\alpha_2$        &0.698 &0.690 &0.774  &0.465  &0.773\\
710:                        &    & P                &15.8  &33.8  &36.1   &14.3   &49.6 \\
711: \hline
712: Human Cytomegalovirus  & 229354 &$l_c$     & 36   & 10   & 10    & 32   &148\\
713: Strain AD169                &    &$\alpha_1$        &0.582 &0.588 &0.596  &0.581 &0.575\\
714: HEHCMVCG                    &    &$\alpha_2$        &0.806 &0.799 &0.800   &0.800  &0.682\\
715: \hline
716: dmal                   &889x60&$l_c$      & 20   & 12   & 12    & 22   &68\\
717:                        &    &$\alpha_1$        &0.575 &0.628 &0.599  &0.559 &0.60\\
718:                        &    &$\alpha_2$        &0.730 &0.782 &0.602  &0.720 &0.596\\
719: \hline
720: Chicken nonmuscle MHC       &7003  &$l_c$     & 96  & 72   & 12   & 28   &  64\\
721: (cDNA)                 &    &$\alpha_1$        &0.573 &0.538 &0.569  &0.554    &0.627\\
722: CHKMYHN                &    &$\alpha_2$        &0.722 &0.833  &0.841  &0.601   &0.842\\
723:                  &    &P                 &27.0  &31.2  &20.6   &21.2  & 58.2\\
724: \hline
725: Bacteriophage $\lambda$& 48,502&$l_c$     & 56   & 36   &  18   &124   &168\\
726: (Intronless virus)     &    &$\alpha_1$        &0.563 &0.541 &0.598  &0.513 &0.550\\
727: LAMCG                  &    &$\alpha_2$        &0.935 &0.819 &0.911  &0.810 &0.866\\
728:                  &    &P                 &26.4  &25.4  &24.7   &23.5  & 51.8\\
729: \hline
730: Human dystrophin       & 13,957&$l_c$     & 136  & 56   &  14   & 22   &128\\
731: (cDNA)                 &    &$\alpha_1$        &0.530 &0.552 &0.569  &0.552 &0.544\\
732: HUMDYS:M18533          &    &$\alpha_2$        &0.738 &0.634 &0.777  &0.720 &0.725\\
733:                        &    &     P            &22.4  &33.0  & 24.7  &19.9  &55.4\\
734: 
735: \end{tabular}
736: \end{table}
737: 
738: 
739: \begin{table}
740: %\squeezetable
741: 
742: \caption{Same as table II.
743: The symbol $*$ indicates that the factorial moments are larger
744: than unity even at very short distance where as $-$ indicates that the factorial
745: moments do not reach unity.}
746: 
747: \begin{tabular}{|c|c|c| c |c| c| c|c|}
748: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
749: \hline
750: SC-MIT                 & 85779 &$l_c$     & *  & 36   & 36    & *  &184\\
751: Nc-001224              &    &$\alpha_1$        &0.732 &0.697 &0.680  &0.720  &0.578\\
752:                        &    &$\alpha_2$        &0.698 &0.540 &0.747  &0.508  &0.730\\
753:                        &    & P                &9.1   &42.2  &40.7   &8.0    &51.3 \\
754: \hline
755: Pichia canadensis      & 27694      &$l_c$     & *    & 36   & 64    &*  &96\\
756: Mitochondiron          &    &$\alpha_1$        &0.654 &0.688 &0.624  &0.615  &0.620\\
757: NC-001762              &    &$\alpha_2$        &0.662 &0.755 &0.784  &0.660  &0.801\\
758:                        &    & P                &10.2  &41.6  &40.2   &8.0    &51.84 \\
759: \hline
760: Ti(Plasmid)            &24595  &$l_c$     & 76  & 24   & 32    & 40   & -\\
761:                        &    &$\alpha_1$        &0.543 &0.564 &0.552  &0.586    &0.508\\
762:                        &    &$\alpha_2$        &0.706 &.700  &0.676  &0.728   &0.433\\
763:                  &    &P                 &23.5  &26.6  &27.5   &22.4  & 50.1\\
764: \hline
765: BacteriophageT7                 &39937  &$l_c$     & -  & 116   & 884    & 1284   &-\\
766: NC-001604               &    &$\alpha_1<116$        &0.526 &0.571 &0.529  &0.530    &0.530\\
767:                        &  &$116<\alpha_2<1330$        &0.560 &0.587  &0.590  &0.566   &0.551\\
768:                  &    &P                 &25.8  &27.2  &24.4   &22.6  & 53.0\\
769: \hline
770: Tyorg                  & 196x60 &$l_c$     &  - & 96 &- &36 & 96\\
771:                        &        &$\alpha_1$&0.491 & 0.560 & 0.515 & 0.620 & 0.587\\
772:                        &        &$\alpha_2$&0.370 & 0.715 & 0.514 & 0.799 & 0.704\\
773:                  &    &P                 &16.0  &35.9  &26.7  &21.4  & 51.9\\
774: \end{tabular}
775: \end{table}
776: 
777: 
778: 
779: \appendix
780: \renewcommand{\thefigure}{A\arabic{figure}}
781: \section *{Random walk model}
782: 
783: The method of DNA walks, first suggested by Peng et al \cite{PENG1} is based
784: on the rule that the walker either moves up $(u_i=1)$ or down $u_i=-1)$ for each
785: step $i$ of the walk. This is the case of a correlated random walk and differs
786: from an uncorrelated walk where the direction of each step is independent of the
787: previous steps. Further they assign $u_i=1$ if a pyrimidine occurs at the site
788: $i$ whereas $u_i=-1$ if the site contains a purine.
789: The net displacement $(y)$ of the walker after $l$ steps is defined
790: as
791: \equation
792: y(l)=\sum_{i=1}^l u(i)
793: \endequation
794: The standard deviation of the above quantity can be estimated from
795: \equation
796: \sigma^2(l,L)=\frac{1}{L-l} \sum_{l_0=1}^{L-l} (\Delta y(l_0,l)-{\bar {\Delta(l)}})^2
797: \endequation
798: where $L$ is the number of nucleotides in the entire sequence and
799: \equation
800: {\bar {\Delta y(l)}}=\frac{1}{L-l} \sum_{l_0=1}^{L-l} \Delta y(l_0,l)
801: \endequation
802: where $\Delta y(l_0,l)=y(l_0+l)-y(l_0)$.
803: It was found \cite{PENG1} that the fluctuations can be approximated by
804: \equation
805: \sigma(l,L) \sim l^\alpha
806: \endequation
807: where $\alpha$ is the correlation exponents. For $\alpha$ close to $0.5$, there
808: is no correlation or only short range correlation in the sequence. If $\alpha$
809: is significantly different from $0.5$, it indicates long range correlations.
810: 
811: 
812: \appendix
813: \setcounter {figure}{0}
814: \renewcommand{\thefigure}{B\arabic{figure}}
815: \section *{B}
816: 
817: 
818: In the previos analyses, we account for the non-occurence of a particular
819: nucleotide. This is operationally equivalent to building the density spectrum
820: $P_n$ including $n=0$. If the nucleotide compositional asymmetry is  quite large like
821: $SC\_MIT$, the occurence $n$ can be zero for some nucleotides particularly at
822: short distances. Therefore, we can build $P_n$ distribution either including
823: or excluding zero$^{th}$ channel. The figure \ref{ap1}(a) shows the comparison
824: of $\sigma$ versus $l$ plot for two complimentary distributions corresponding
825: to a $LAMCG$ sequence both with (top panel where $G$ and $ATC$ distributions have
826: identical slopes at all scales) and without (bottom panel)
827: inclusion of $n=0$ channel in the $P_n$ spectra. Interestingly, absence of
828: $n=0$ channel does not satisfy the complimentarity relation particularly at
829: short distances. However, the difference does not exist at larger distances
830: where always $n>1$. Figure \ref{ap1}(b) shows another example of $F_q$ versus
831: $l$ plot for a typical $SC\_MIT$ sequence. The spectrum with exclusion of
832: $n=0$ channel behaves differently when zero$^{th}$ channel is included (compare
833: it with figure \ref{pc1} where $F_q$ versus $l$ has no cross over).
834: 
835: 
836: \begin{figure}
837: \centerline{\hbox{
838: \psfig{figure=ap1.eps,width=3.0in,height=3.2in}}}
839: \caption{ (a) The variance $\sigma$ versus $l$ for $G$ (solid curves)
840: and $ATC$ distributions (dotted curves) for $LAMCG$ sequence.
841: Top panel is
842: for distribution for which the complimentarity is preserved
843: while complimentarity is not satisfied in the case of bottom panel particularly
844: at small distances. (b) $F_q$ versus $l$ plot for $G$ distribution of $SC\_MIT$
845: for the case when complimentarity is not preserved.
846: The curves are scaled up appropriately for
847:  better clarity.}
848: \label{ap1}
849: \end{figure}
850: 
851: Since the spectrum behaves differently when zero$^{th}$ channel is not included,
852: we have analysed the spectrum of three typical sequences listed in the table below.
853: Notice now that while $\alpha_2$ values are essentially same as before, the
854: $\alpha_1$ values are quite different. In fact, we have noticed a general
855: trend where $\alpha_1$ is higer than the previous values although the corresponding
856: density distributions do not deviate significantly from the Gaussian behavior
857: at short distances. However, in the previous analysis, we alwyas include the
858: zero$^{th}$ channel so that the complimentarity properties is satisfied at all
859: scales. Moreover, we also found a correlation between $\alpha$ and Gaussian
860: statistics, namely the deviation of $\alpha$ from $0.5$ also shows a
861: corresponding deviation of $P_n$ distribution from Gaussian behavior.
862: For example, in case of $SC\_MIT$, the $\alpha$ is quite large at a short
863: distance. Accordingly, the $P_n$ distribution also shows strong deviation from
864: the Gaussian statistics. However, this is not  necessarilly true when
865: complimentarity is not preserved while building the spectrum.
866: At short distances, the deviation of $\alpha$ from $0.5$
867: does not always mean a strong deviation from the Gaussian statistics.
868: 
869: 
870: \begin{table}
871: %\squeezetable
872: 
873: \caption{The slope parameters for three typical sequences where the complimenraity
874: is not preserved.}
875: 
876: \begin{tabular}{|c|c|c| c |c| c| c|c|}
877: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\
878: \hline
879: Bacteriophage $\lambda$& 48,502&$l_c$     & 56   & 36   &  18   &124   &168\\
880: (Intronless virus)     &    &$\alpha_1$        &0.720 &0.670 &0.740  &0.680 &0.580\\
881: LAMCG                  &    &$\alpha_2$        &0.935 &0.819 &0.910  &0.800 &0.860\\
882:                  &    &P                 &26.4  &25.4  &24.7   &23.5  & 51.8\\
883: \hline
884: SC-MIT                 & 85779 &$l_c$     & 14  & 36   & 40    & 12  &184\\
885: Nc-001224              &    &$\alpha_1$        &0.703 &0.760 &0.750  &0.700  &0.630\\
886:                        &    &$\alpha_2$        &0.694 &0.540 &0.750  &0.510  &0.730\\
887:                        &    & P                &9.1   &42.2  &40.7   &8.0    &51.3 \\
888: \hline
889: BacteriophageT7                 &39937  &$l_c$     & -  & 116   & 884    & 1284   &-\\
890: NC-001604               &    &$\alpha_1<116$        &0.560 &0.610 &0.570  &0.570    &0.530\\
891:                        &  &$116<\alpha_2<1330$        &0.560 &0.587  &0.590  &0.566   &0.551\\
892:                  &    &P                 &25.8  &27.2  &24.4   &22.6  & 53.0\\
893: \end{tabular}
894: \end{table}
895: 
896: 
897: 
898: 
899: 
900: 
901: \begin{thebibliography}{99}
902: 
903: \bibitem{LI1} For a review on long range correlation in DNA sequences,
904:               see for example, W. Li, Computers Chem, {\bf 21}, 257 (1997);
905:               http://linkage.rockefeller.edu/wli/dna\_corr.html
906: 
907: \bibitem{LI2} W. Li, Int. Journal of Bifurcation and Chaos, {\bf 2(1)}, 137 (1992).
908: 
909: \bibitem{LI3} W. Li and K. Kaneko, Euro Phys. Lett, {\bf 17}, 655 (1992).
910: 
911: \bibitem{LI4} W. Li, T. Marr and K. Kaneko, Physica {\bf D75}, 392 (1994).
912: 
913: \bibitem{VOSS} R. F. Voss, Phys. Rev. Lett., {\bf 68}, 3805 (1992); Fractals {\bf 2}, 1 (1994).
914: 
915: \bibitem{BUL1} S.V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng,
916:            M. Simons, F. Sciortino and H. E. Stanley, Phys. Rev. Lett.,
917:            {\bf 71}, 1776 (1993).
918: 
919: \bibitem{BOR} B. Borstnik, D. Pumpernik, and D. Lukman, Euro phys. Lett., {\bf 23}, 389 (1993).
920: 
921: \bibitem{LU} X. Lu, Z. Sun, H. Chen, and Y. Li, Phys. Rev. {\bf E58}, 3578 (1998).
922: 
923: \bibitem{VIE} M. de Vieira, Phys. Rev. {\bf E60}, 5932 (1999).
924: 
925: \bibitem{AZB} M. Ya. Azbel, Phys. Rev. Lett., {\bf 75}, 168 (1995).
926: 
927: \bibitem{HER} H. Herzel, I. Gro$\beta$e, Physica {\bf A216}, 518 (1995).
928: 
929: \bibitem{LUO} Liaofu Luo, Weijiang Lee, Lijun Jia, Fengmin Ji, and Lu Tsai, Phys. Rev. {\bf E58}, 861 (1998).
930: 
931: \bibitem{PENG1}C. K. Peng, S.V. Buldyrev, A. L. Goldberger, S. Havlin,
932:            F. Sciortino, M. Simons, and H. E. Stanley, Nature (London),
933:            {\bf 356}, 168 (1992).
934: 
935: \bibitem{MAD} J. Maddox, Nature (London), {\bf 358}, 103 (1992).
936: 
937: \bibitem{NEE}S. Nee, Nature (London), {\bf 357}, 450 (1992)
938: 
939: \bibitem{CHA}Chatzidimitriou-Dreismann and Larhammar D, Nature (London), {\bf 361}, 212 (1993).
940: 
941: \bibitem{PRA}V. V. Prabhu, and J. M. Claverie, Nature (London), {\bf 357}, 782 (1992).
942: 
943: \bibitem{KAR}S. Karlin and V. Brendel Science, {\bf 259}, 677 (1993).
944: 
945: \bibitem{STA} H. E. Stanley, S.V. Buldyrev, A. L. Goldberger, Z. D. Goldberg,
946:            S. Havlin, R. N. Mantegna, S. M. Ossadnik, C. K. Peng, and
947:            M. Simons, Physica {\bf A205}, 214 (1994).
948: 
949: \bibitem{BUL2} S.V. Buldyrev, N. V. Dokholyan, A. L. Goldberger,
950:            S. Havlin, C. K. Peng, H. E. Stanley and G. M. Visvanathan,
951:            Physica {\bf A249}, 430 (1998).
952: 
953: \bibitem{ARN1}A. Arnedo, E. Bacry, P. V. Graves and J. F. Muzy, Phys. Rev. Lett.,
954:            {\bf 74}, 3293 (1995).
955: 
956: \bibitem{ARN2}A. Arnedo, Y. D'Aubenton-Carafa, B. Audit, E. Bacry,
957:               J. F. Muzy, and C. Thermes, Physica {bf A249}, 439 (1998).
958: 
959: 
960: \bibitem{MAN} R. N. Mantegna, S.V. Buldyrev, A. L. Goldberger,
961:            S. Havlin, C. K. Peng, M. Simons, and  H. E. Stanley,
962:            Phy. Rev. Lett., {\bf 73}, 333 (1994); Phys. Rev. {\bf E52}, 2939 (1995).
963: 
964: 
965: \bibitem{BUL3}  S.V. Buldyrev, A. L. Goldberger, S. V. Havlin, R. N. Mantegna,
966:                 M. E. Matsa, C. K. Peng, M. Simons, and H. E. Stanley,
967:                 Phys. Rev. {\bf E51}, 5084 (1995).
968: 
969: \bibitem{PENG2} C. K. Peng, S.V. Buldyrev, S. V. Havlin, M. Simons, H. E. Stanley,
970:                 and A. L. Goldberger, Phys. Rev. {\bf E49}, 1685 (1994).
971: 
972: \bibitem{AKM1} A. K. Mohanty, and A. V. S. S. Narayana Rao, Phys. Rev. Lett., {\bf 84}, 1832 (2000).
973: 
974: \bibitem{AKM2}A. K. Mohanty,  and S. K.  Kataria, Phys. Rev. Lett, {\bf 73}, 2672 (1994);
975:            Phys. Rev. Lett, {\bf 75}, 2449 (1995); Phys. Rev. C, {\bf C53}, 887 (1996).
976: 
977: \bibitem{KLA} For a review see, J. Klafter, M. F. Shlesinger and G. Zumofen,
978:               Physics Today, {\bf 49}, 33 (1996); M. F. Shlesinger, J. Klafter
979:               and G. Zumofen, Am. J. Phys., {\bf 67}, 1253 (1999).
980: 
981: 
982: 
983: \bibitem{GAL} Bernaola- Galvan and P. Carpena, (To be published).
984: 
985: 
986: %\bibitem{ALL1} P. Allegrini, P. Grigolini and B. J. West, Phys. Rev. {\bf E54}, 4760 (1996).
987: %
988: %\bibitem{ALL2} P. Allegrini, M. Barbi, P. Grigolini and B. J. West,
989: %           Phys. Rev. E, {\bf 52}, 5281 {1995}
990: 
991: %\bibitem{ALL3} P. Allegrini, P. Grigolini and B. J. West,
992: %           Phys. Lett A, {\bf 211}, 217 {1996}
993: 
994: \end{thebibliography}
995: \end{document}
996: