0202:physics0202075/pre.tex

1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: %            typeset in RevTex.

3: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

4: %\documentstyle[pre,aps,psfig,twocolumn]{revtex}

5: %\documentstyle[preprint,aps,psfig]{revtex}

6: \documentstyle[pre,aps,psfig]{revtex}

7: %\pagestyle{empty}

8: \begin{document}

9:

10: %\input psfig

11:

12: %\twocolumn[

13:

14: %\hsize\textwidth\columnwidth\hsize\csname @twocolumnfalse\endcsname

15:

16: % \draft command makes pacs numbers print

17: \draft

18:

19: \title{Long range correlations in DNA sequences}

20: %\author{A. V. S. S. Narayana Rao and A. K. Mohanty$^*$}

21: \author{A. K. Mohanty and A. V. S. S. Narayana Rao$^*$}

22: \address{Nuclear Physics Division, Bhabha Atomic Research Centre, Mumbai-400085}

23: \address{$^*$Molecular Biology and Agriculture Division,Bhabha Atomic Research Centre, Mumbai-400085}

24:

25:

26:

27:

28:

29:

30: \maketitle

31:

32: \begin{abstract}

33: The so called long range correlation properties of DNA

34: sequences are studied using the  variance analyses of the density

35: distribution of a single or a group of nucleotides in a model independent way.

36: This new method which was suggested earlier has been applied to  extract

37: slope parameters that characterize the correlation properties for several

38: intron containing and intron less DNA sequences. An important aspect of all the DNA

39: sequences is the properties of complimentarity by virtue

40: of which any two complimentary

41: distributions (like $GA$ is complimentary to $TC$ or $G$ is complimentary to $ATC$)

42: have identical fluctuations at all scales although their distribution functions

43: need not be identical. Due to this complimentarity, the famous DNA walk

44: representation whose statistical interpretation is still unresolved is shown

45: to be a special case of the present formalism with a density distribution

46: corresponding to a purine or a pyrimidine group. Another interesting aspect

47: of most of the DNA sequences is that the factorial moments

48: as a function of length exceed unity around a region where the variance

49: versus length in a log-log plot shows a bending. This is a pure phenomenological

50: observation

51: which is found for several DNA sequences with a few exception. Therefore, this

52: length scale has been used as an approximate measure to exclude the bending regions

53: from the slope analyses. The asymmetries in the nucleotide contents or

54: the patchy structure as a possible origin of the long range correlations has

55: also been investigated.

56:

57:

58: \end{abstract}

59:

60: \pacs{PAC(s) 87.14.Gg.87.16.AC,05.10.-a}

61: %]

62:

63: \section{INTRODUCTION}

64: Recently, there has been considerable interest in the finding of long range

65: correlations in genomic DNA sequences \cite{LI1}. A DNA sequence is a chain

66: of sites, each occupied by either a purine (Adenine and Guanine) or a

67: pyrimidine (Cytocyine and Thymine) group. For mathematical modeling, the DNA

68: sequence might be considered as a string of symbols (G, A, T and C) whose

69: correlation structure can be characterized  completely by all possible base-base

70: correlation functions or their corresponding power spectra. Different techniques

71: including mutual information functions and power spectra analyses

72: \cite{LI1,LI2,LI3,LI4,VOSS,BUL1,BOR,LU,VIE}, auto correlation \cite{AZB,HER,LUO}, DNA

73: walk representation \cite{PENG1,MAD,NEE,CHA,PRA,KAR,STA,BUL2}, wavelet

74: analysis \cite{ARN1,ARN2} and Zipf analysis \cite{MAN} were used for statistical

75: analyses of DNA sequences. But despite the effort spent, it is still an

76: open question whether

77: the long range correlation properties are different for protein

78: coding (exonic) and non coding (intronic, intergenemic) sequences \cite{BUL3}.

79: One more fundamental ground, there is still continuing debate as to whether

80: the reported long range correlations really mean a lack of independence at long

81: distances or simply reflect the patchiness (bias in nucleotide composition) of

82: DNA sequences. There have been attempts to eliminate local patchiness using

83: methods such as min-max \cite{PENG1}, detrended fluctuation analysis (DFA) \cite{BUL3,PENG2}

84: and wavelet analysis \cite{ARN1}. In spite of its success in modeling the long

85: range correlations observed in DNA sequences, as indicated by the

86: power law increase

87: in the variance and the inverse power law spectrum \cite{VOSS,VIE}, the problem of the correct

88: statistical interpretation of DNA walk is still unresolved and is attracting

89: the attention of an increasing number of investigators. Since approaches

90: based on different models predict different correlation structure, there is

91: no unique measure of the degree of correlation in DNA sequences.

92: Therefore, it is very important

93: to investigate the correlations and extract the power law exponent $\alpha$ rather

94: in a model independent way so that the interpretation of the data including the

95: theoretical analysis becomes more meaningful.

96: There is another

97: confusion related to this study is the absence of a clear definition of the

98: term "long range". Clearly, what is considered to be long is relative to what

99: is considered to be short. To over come some of these problems, recently we have

100: suggested a new method \cite{AKM1} to measure the degree of correlations

101: using the variance analysis of the density distribution of a single or a group

102: of nucleotides. We have also suggested a way to find out an approximate length

103: scale above which all DNA sequences show strong long range correlations irrespective

104: of their intron contents while below this, the correlation is relatively weak.

105: Further, the density distribution which is nearly Gaussian at short distances

106: shows significant deviations  from the Gaussian statistics at large distances.

107: In this paper, we present the details of

108: the analyses and also

109: extract the correlation parameter $\alpha$ for several

110: intron containing and intronless sequences.

111:

112: \section {Density distribution and Factorial moments:}

113: In the present method, we build the frequency spectrum of a

114: single or a group of nucleotides by dividing the DNA sequence into many

115: equal intervals of length $l$. For example, to build a purine spectrum,

116: we compute

117: \equation n=\sum_{i=l_0}^{l_0+l} u_i \endequation

118: where $u_i$=1 if the site is occupied by a G or A and $u_i$=0 otherwise.

119: Ideally, one can divide the entire DNA sequence of length $L$ into $m$

120: equal intervals of size $l$ $(l=L/m)$. The purine or GA spectrum can be built

121: by computing $n$ from all the intervals. Alternatively, $n$ can be computed

122: in any segment between $l_0$ and $l_0+l$ and the spectrum ($n$ distribution

123: or $P_n$) is built by varying

124: the starting position $l_0$ from 1, 2, 3 etc upto $L-l$ so as to cover the whole

125: sequence

126: \footnote{At short distances, $n$ can be zero

127: due to the non occurence of a given nucleotide. In such cases, the density

128: spectrum can be built either including or excluding zero$^{th}$ channel. In this

129: analysis, we include zero$^{th}$ channel also so that the complementarity is

130: satisfied which is unlike the case when the zero$^{th}$ channel is excluded.

131: See appendix B for details}.

132: We adopt this second procedure for better statistics. Finally, the

133: standard deviation (SD) of this $P_n$ distribution can be obtained from

134: $\sigma^2=<n^2-{n_0}^2>$ which in general will depend on the interval or the

135: window size $l$.

136:

137: In addition to the standard deviation $\sigma^2$, we also

138: compute the factorial moments $F_q$'s of $P_n$.

139: The normalized factorial moments of order q are written as

140: \equation F_q=\frac{f_q}{f_1^q} \endequation

141: where

142: \equation f_q=\sum_{n=q}^{\infty} P_n n(n-1).....(n-q+1)

143:              =\sum_{n=q} ^{\infty} \frac{n!}{(n-q)!} P_n \endequation

144: As will be shown later, the factorial moment has the distinct advantage over

145: the normal moments in identifying the genomic sequence from the random one.

146: It may be mentioned here that for random  Poisson distribution, the factorial

147: moments for all q's become unity i.e. for

148: \equation P_n=\frac{a^n e^{-a}}{n!} \endequation

149: the above factor for $f_q$ becomes

150: \equation

151: f_q=\sum_{n=q}^\infty \frac{n!}{(n-q)!} \frac{a^n e^{-a}}{n!}

152:    =\sum_{n=q}^\infty \frac{a^n e^{-a}}{(n-q)!}

153:    =\sum_{m=0}^\infty \frac{a^{m+q} e^{-a}}{m!}

154:    =a^q\sum_{m=0}^\infty \frac{a^m e^{-a}}{m!}

155:    =a^q

156:    \endequation

157: which gives $F_q$=1.

158:

159:

160:

161: In this work, we have applied the above factorial moment

162: analysis (generally used to study the fluctuations during a phase transition

163: \cite{AKM2}) to study the dynamical fluctuations present in the DNA sequences.

164:

165:

166: \section {Principle of complimentarity}

167:

168: A general property noticed for all the genomic sequences (of statistically

169: significant length) with a few exceptions is that the distributions of any

170: single or group of nucleotides which has a probability of occurrence $p$ has

171: the same variance $\sigma$ as that of its complimentary group that has the

172: probability of occurrence $(1-p)$, although both have different distribution

173: functions. This would imply that even a single nucleotide distribution

174: say $G$ distribution will have same variance as that of $ATC$ distribution or

175: a $GA$ distribution will have identical variance as that of $TC$ distribution.

176: Figure \ref{sd1} shows $\sigma$ versus $l$ plots for $G$ and $GA$ distributions

177: (solid curves) for two typical sequences of $DROMHC$ (Drosphilia Melanogaster,

178: MHC, 22663 bps, $20.5 \%$ $G$, $30.3 \%$ $A$, $25.4 \% $ $T$, $23.8 \%$ $C$) and

179: $SC\_MIT$

180: (yeast mitochondrial DNA, $9.1 \%$ $G$, $42.2 \%$ $A$, $40.7 \%$ $T$,

181: $8.0 \%$ $C$).

182: As can be seen from the figure, the $G$ and $GA$ distributions have same $\sigma$

183: at all scale as that of $ATC$ and $TC$ distributions (filled circles) although

184: the distribution functions of the two complimentary groups need not be identical.

185: The above agreement is exact for most of the DNA sequences

186: (with a few exceptions) as well as for the

187: random sequences. For example, the $\sigma$ for

188: $G$ and $ATC$ distributions of $SC\_MIT$ and $E. Coli:TN10$ ($E. Coli$ with a

189: $TN10$ mobile transposion (9147 bps) at location 22000 bps) show $2\%$ to $3\%$

190: deviations at all scale depending on the total

191: length of the sequences where as for other

192: DNA as well as random sequences, this

193: agreement is exact.

194: (This difference is not visible from figure \ref{sd1}

195: in case of $SC\_MIT$ as the deviation is insignificant over a large distance).

196:

197:

198:

199:

200: \begin{figure}

201: \centerline{\hbox{

202: \psfig{figure=sd1.eps,width=3.0in,height=3.2in}}}

203: \caption{ The variance $\sigma$ versus $l$ for $G$ and $GA$ distributions

204: (solid curves). Top panel is

205: for $DROMHC$ (Drosophilia Melanogaster, MHC) while the bottom panel for

206:   $SC\_MIT$ (yeast mitocondrial DNA). The filled circles are for

207:   the complimentary $ATC$ and $TC$ distributions.

208:   The curve $RW$  (dotted curve) corresponds to the

209:   slope in case of random walk (see text for details). The curves are scaled up appropriately for

210:   better clarity.}

211: \label{sd1}

212: \end{figure}

213:

214: Within the present formalism, we can also reproduce the result of random walk

215: $(RW)$ model (See appendix for more detail) by assigning

216: $u_i=1$ for purine group ($G$ and $A$)

217: and $u_i=-1$ for pyrimidine group ($T$ and $C$). However, unlike the random

218: walk model of interpreting $+1$ and $-1$ as the probability of step up and

219: step down, $P_n$ can be considered as the frequency distribution of $n$

220: which gives the excess or deficit of purines over pyrimidines. The $\sigma$

221: versus $l$ as obtained from this assignment has also been shown in

222: figure \ref{sd1} (see the dotted curves labeled $RW$) for comparison. It is

223: interesting to note that the $RW$ curves shows a parallel shift with respect

224: to the $GA$ or $TC$ curves indicating that $GA$ or $TC$ distributions and $RW$

225: model have similar fluctuations at all scale. This is an interesting

226: observations, as we can now use $GA$ or $TC$ distributions as alternatives

227: to the DNA walk representation to study the correlation. The advantage is, since

228: $n$ represents a sum, unlike the DNA walk model, the entire spectrum lies

229: to the positive side of the coordinates which is essential to compute various

230: higher moments like $F_q$ of the distributions.

231:

232:

233: It is also important to note that although the complimentary distributions

234: have same $\sigma$ at all scale, the distribution functions need not be

235: exactly identical.

236: Figure \ref{sd2} shows a typical normalized density distribution functions $P_n$

237: of two complimentary distributions $G$ and $ATC$

238: for the above two sequences ($SC\_MIT$ and $DROMHC$)

239: as a function of $n-n_0$ (where $n_0$ is the average

240: count ) at a typical length scale of

241: $l=150$ (figures in left). The figures to the

242: right shows $P_n$ distributions ($x$-axis is shifted by 100 for clarity)

243: corresponding to the

244: two purely random sequences having same length and nucleotide

245: contents as that of $DROMHC$ and $SC\_MIT$ sequences.

246: It is interesting to note that although $\sigma$ versus $l$ plots are (nearly)

247: identical $i. e.$, both distributions have same fluctuations at all scales,

248: the distribution functions are not identical.

249: This is an important characteristic of

250: a DNA sequence which is not found in case of a random one.

251:

252:

253:

254: \begin{figure}

255: \centerline{\hbox{

256: \psfig{figure=sd2.eps,width=4.0in,height=4.2in}}}

257: \caption{ The complimentary $G$ and $ATC$ density distributions at

258: a typical distance of $l=150$

259: for above two sequences. The curves on the right

260: (shifted by $100$ units) shows the corresponding

261: distributions in case of a purely random sequence of appropriate $G$, $A$, $T$

262: and $C$ contents.}

263: \label{sd2}

264: \end{figure}

265:

266: \section {Extraction of slope parameter}

267: The long range correlations are generally studied from the relation

268: $\sigma \sim l^\alpha$ where the parameter $\alpha$ is extracted from the

269: $\sigma$ versus $l$ plot in the log-log scale. For the case of a completely

270: random sequence, $\alpha \sim 0.5$. The deviation of $\alpha$ from $0.5$

271: indicates presence of long range correlations. We have estimated $\sigma$

272: of $G$, $A$, $T$, $C$ and $GA$ distributions for several DNA sequences and

273: found that $\sigma$ versus $l$ plot in the log-log scale is not linear over

274: the entire length \footnote{We consider only the $G$, $A$, $T$ and $C$

275: distributions to extract the correlation parameters for the individual nucelotides

276: and $GA$ distributions to simulate the results of random walk model}.

277: Figure \ref{ec1} shows $\sigma$ versus $l$ plot (bottom panel)

278: for a typical $E. Coli$ sequence of length $L=1.2$ Mbps (solid curves)

279: and $L=30$ Kbps (dotted curves) respectively. The top panel shows the factorial

280: distributions of $q$=2, 3, 4 and 6 for a typical $A$ distributions, although

281: similar plots can be obtained for other nucleotide distributions as well.

282: A general feature of the factorial moments of the DNA sequence with a few

283: exception is that at short distances, $F_q < 1.0$ for all $q's$ and exceeds

284: unity at some point say at $l_q$. This behavior is not found in case of a purely

285: random sequence where $F_q$ is always $\le 1.0$. Further, all $q$'s do not

286: cross unity exactly at the same point, $l_q$ being more for higher $q$ values.

287: However, this variation is insignificant over a very large scale if we

288: restrict to some of the lower moments say up to $q=6$.

289:

290: From these plots and also from the several other studies,

291: we make following few observations; (i) The $\sigma$ versus $l$ plot is

292: not linear through out, rather starts bending around some region (say $l_c$,

293: which could be different for different distributions) indicating a change

294: of slope from $\alpha_1$ to $\alpha_2$, (ii) For most of the cases, while

295: $\alpha_1$ shows weak deviation from $0.5$, $\alpha_2$ deviates significantly

296: from $0.5$ and also depends on the sequence length $L$, (iii) The individual

297: nucleotide distributions may have stronger correlations than any sum like $GA$

298: and $TC$ distributions or any other combinations.

299:

300: \begin{figure}

301: \centerline{\hbox{

302: \psfig{figure=ec1.eps,width=4.0in,height=4.2in}}}

303: \caption{ (a) The factorial moments $F_q$ versus $l$ for a typical $A$ distributions

304: of $E. Coli$ sequence of length 1.2 Mbps. (b) The corresponding

305: slope parameter $\sigma$ versus $l$ for $E. Coli$ of length 1.2 Mbps (solid curves)

306: and of length 30 Kbps (dashed curves). The curves are scaled up appropriately for clarity.}

307: \label{ec1}

308: \end{figure}

309:

310:

311: Since $\sigma$ versus $l$ in the log-log plot starts bending around $l_c$,

312: we can extract the slope by dividing the entire length into two segments;

313: one for $l<l_c$ and the other one for $l>l_c$. This can be done by examining

314: each case individually.

315: However, we have noticed an approximate correlation

316: between this bending region in $\sigma$ versus $l$ plot

317: and the cross over

318: points $l_q$ of the corresponding factorial moments i.e. the slope changes

319: around the same region where the factorial moments become unity. This

320: is a pure phenomenological observation which is found for several DNA sequences as listed in tables with a

321: few exceptions which we will discuss below.

322: It may be mentioned here that although, the two complimentary distributions

323: have same fluctuations, both need not have identical factorial moments.

324: Figure \ref{lam} shows the plots of $F_q$ versus $l$ for $A$ and $GTC$ distribution

325: for a $LAMCG$ sequence.

326: Since both are complimentary, they have

327: identical fluctuations at all scales (hence same bending region), but the

328: cross over regions in $F_q$ plots are different, being higher for $ATC$ distributions

329: (due to large average values $n_0$ at all scales). While the $l_q$ value of the

330: $A$ distribution shows an approximate correlation with the bending region of

331: $\sigma$ versus $l$ plot where a possible slope change occurs, the $l_q$

332: values of $GTC$ distribution has no such correlations. This is true for any

333: complementary distributions of $G$, $A$, $T$ and $C$ except for $GA$ and $TC$ distributions since

334: both have nearly

335: same overlapping cross over regions.

336:

337: \begin{figure}

338: \centerline{\hbox{

339: \psfig{figure=lam.eps,width=4.0in,height=4.2in}}}

340: \caption{ The factorial moments $F_q$ versus $l$ for $G$ and $ATC$ distributions

341: of $LAMCG$ sequence}

342: \label{lam}

343: \end{figure}

344:

345:

346: Therefore, only the $l_q$ values of the $G$, $A$, $T$,

347: $C$ and $GA$ distributions are used as an approximate length scales $(l_c)$.

348: The entire length of the sequence is divided into

349: two parts one for $0< l <l_{c1}$ and other for $l_{c2}<l<L_{max}$ where $l_{c1}$

350: and $l_{c2}$ are the minimum and maximum of all the $l_c$ corresponding to

351: $G$, $A$, $T$, $C$ and $GA$ distributions. The $L_{max}=L/30$, i.e. we have at

352: least $30$ independent data sets so that the statistical analysis becomes

353: meaningful. Therefore, excluding the region $l_{c1}<l<l_{c2}$, we have extracted

354: $\alpha_1$ and $\alpha_2$ since the linearity in these two segments

355: are found to be extremely good for most

356: of the cases.  The results are summarized in three tables which covers

357: both intronless and

358: intron containing sequences. The table shows the length of the sequence $L$

359: used in the analyses, the cross over values $l_q$ ( same as $l_c$),

360: the slope parameters $\alpha_1$

361: and $\alpha_2$ and also the corresponding percentage of the nucleotide contents

362: $P$. A general observation is that the sequence is weakly

363: correlated at short distance with $\alpha_1$ which is quite close to $0.5$ where as

364: for $l>l_c$, the correlation is relatively stronger with a larger value

365: of $\alpha_2$. Now we discuss a few exceptions like in the case of $SC\_MIT$ and

366: $PODOT7$ ($T7$ bacteriophage, $39936$ bps). Figure \ref{pc1} shows the

367: factorial moments of a typical $G$ distributions. In both the cases, the factorial

368: moments do not have any cross over point.

369: In case of $SC\_MIT$, the factorial moments are much higher than unity

370: even at small distance and starts decreasing afterwards. The similar behavior

371: is found for $C$ distribution also. However, the $A$, $T$ and $GA$ distributions

372: do have $l_c$ points. Therefore, using $l_{c1}$ as $\sim 36$ and $l_{c2} \sim 184$,

373: we estimated $\alpha_1$ and $\alpha_2$ for $G$, $A$, $T$, $C$ and $GA$ distributions

374: which are listed in table III. The symbol $'*'$ indicates absence of any critical

375: value. It is interesting to note that $\alpha_1$ is quite large

376: and in some cases $\alpha_1 > \alpha_2$.

377: On the other hand , the factorial moments of the sequence like $PODOT7$ do not

378: reach unity at any scale. The absence of such type of scale has been indicated by

379: the symbol $'-'$ in table III. This type of sequences behave like a pure random

380: one having $\alpha$ values quite close to $0.5$. We have listed a few such sequences

381: with exceptions in table III.

382:

383: \begin{figure}

384: \centerline{\hbox{

385: \psfig{figure=pc1.eps,width=4.0in,height=4.2in}}}

386: \caption{ The factorial moments $F_q$ versus $l$ for $G$ distributions

387: of $SC\_MIT$ (scaled up) and PODOT7 (T7 bacteriophage) sequences.}

388: \label{pc1}

389: \end{figure}

390:

391: Further, we would like to mention here that we have noticed that

392: the factorial moments for many sequences starts decreasing at large distances.

393: Also for a few cases, the

394: factorial moments start decreasing even at a very short distances.

395: Consequently, the slope also changes accordingly. However, we would not

396: like to assign any reasons due to lack of enough statistics.

397:

398:

399: The slope with $\alpha=0.5$ corresponds to the case of a normal diffusion

400: process of a random Brownian trajectory. The basic idea of a Brownian motion is that

401: of a random walk having a Gaussian distribution probability for the position

402: of the random walker after a time $t$ with the variance ($\sigma^2$)

403: proportional to $t$ ($\sigma \sim t^\alpha$ where $\alpha=0.5$).

404: This corresponds to the case of normal diffusion. However, nature shows

405: enough examples of anomalous diffusion characterized by a variance

406: which does not follow a linear growth in time \cite{KLA}.

407: In such cases either the diffusion is accelerated if $\alpha > 0.5$ or

408: the growth is

409: dispersive if $\alpha < 0.5$. As found in the analyses (see tables I and II),

410: $\alpha_2 > 0.5$ at large distances for most of the sequences irrespective of

411: their intron contents. However, a few sequences as shown in table III,

412: not only peculiar, may also have $\alpha$ which decreases at large distances.

413: In such cases, $\alpha<0.5$ which may indicate the influence of  dispersive

414: dynamics. This aspect needs further investigations.

415: Finally, we would like to add here that $\alpha_1$ is close to $0.5$ for

416: most of the sequences at short distance (see tables I and II). Although, $\alpha=0.5$

417: would imply about a random behavior, it can not be told conclusively from the

418: present analyses unless the short distance effects are taken into consideration

419: \cite{GAL}.

420:

421: \section{Patchy sequences}

422:

423: In the following, we investigate whether the mosaic character of DNA

424: consisting of patches of different composition can account for apparent

425: long range correlations in DNA sequences\cite{KAR}. The Chargaff's second parity

426: rule states that in a single strand $G \approx C$ and $T \approx  A$.

427: However, asymmetries in base composition have been observed in many

428: sequences. A quantitative estimate of the $GC$ and $AT$ skews  can be

429: obtained from the relation $(G-C)/(G+C)$ (Excess of $G$ nucleotides over $C$

430: nucleotides) and $(A-T)/(A+T)$ (Excess of $A$ nucleotides over $T$ nucleotides).

431: This is, operationally equivalent to estimating $n$ as defined in Eq.(1) except

432: $n$ now represents the count $(G-C)/(G+C)$

433: for $GC$ skew and $(A-T)/(A+T)$

434: for $AT$ skew in a fixed window size of

435: $(L/20)$. We consider $LAMCG$ as an example and plot $n$ (defined appropriately)

436: versus $l_0$ where

437: the starting position of the sliding window $l_0$ varies from $1$, $2$, $3$ etc

438: upto $L-l$. Figure \ref{skew} shows the plots of $GC$ and $AT$ skews as a function

439: of the length for a typical $LAMCG$ sequence.

440: The plots show  a change in the direction of the slope with a change in sign of

441: the skew. The quantity and quality of the skew can be assessed from the $V$

442: or from the inverted-$V$ shape of the curves.

443:

444: \begin{figure}

445: \centerline{\hbox{

446: \psfig{figure=patch.ps,width=4.0in,height=4.2in}}}

447: \caption{ The $GC$ and $AT$ skews as a function of $l_0$ for $LAMCG$ sequence.}

448: \label{skew}

449: \end{figure}

450:

451: From the above plots, we can identify

452: three well known

453: compositional domains of $LAMCG$ of size 22000 bps ($GA$ contents 0.54), 17000

454: bps ($GA$ contents 0.47) and 9000 bps ($GA$ contents 0.54). We also consider

455: an artificially generated sequence by joining three random

456: patches of size 22000 bps, 17000 bps and 9000 bps respectively with appropriate

457: $G$, $A$, $T$ and $C$ contents. We also consider another heterogeneous sequence

458: generated from $E. Coli$ DNA by

459: a  mobile insertion of TN10 at location 22000 bps. The corresponding

460: random patches are of size 22000 bps, 9147 bps and 22000 bps respectively

461: \footnote{ Please note the distinction between the random sequence

462: which is generated by joining three random patches of total length $L$

463: and a pure random one of length $L$. Although, both the sequence has same

464: percentage of nucleotide contents in the length $L$,

465: the former is random only patch wise.}

466:

467:

468: \begin{figure}

469: \centerline{\hbox{

470: \psfig{figure=lar.eps,width=4.0in,height=4.2in}}}

471: \caption{ The $F_q$ versus $l$ of $C$ distribution of

472: for $LAMCG$ and an artificially

473: sequence generated by joining three randomly generated patches

474: of size 22000 bps, 17000 bps and 9000 bps with the same $G$, $A$, $T$ and $C$

475: contents as that of $LAMCG$.}

476: \label{lar}

477: \end{figure}

478:

479:

480: Figure \ref{lar} shows the $F_q$ versus $l$ plot of a typical $C$ distribution

481: for $LAMCG$ and for an artificially generated sequence (random only patch wise).

482: Interestingly, the factorial

483: moments for both the cases behave similarly.

484: Figure \ref{rans1} shows a similar $\sigma(l)$ versus $l$ plot both for real

485: and artificially  generated (from random patches) sequences.

486: Although, in some cases both agree, in general they are not identical at the

487: individual nucleotide levels particularly at large distances (Note that

488: the scale is highly compressed). This deviation

489: would mean that at large distances, the density distribution functions will

490: have significant discrepancy due to different widths.

491: So  at a first look from the $\sigma$ versus $l$ plot, we can say that

492: the actual DNA sequences and the RANDOM patches need not have identical

493: slopes $\alpha$ (hence the width $\sigma$) at large distances for all

494: the nucelotides although they agree in some cases.

495: Even at short distances, although the DNA and the

496: RANDOM

497: sequences have nearly identical width $\sigma$, the full shape

498: of the distributions need

499: not be identical. To demonstrate this, we invoke the principle of

500: complimentary which was mentioned before.

501:

502:

503:

504: \begin{figure}

505: \centerline{\hbox{

506: \psfig{figure=rans1.eps,width=3.0in,height=3.2in}}}

507: \caption{ The variance $\sigma$ versus $l$ for $G$, $A$, $T$, $C$, and $GA$

508: distributions. (a)  $LAMCG$ and an artificial

509: sequence generated by joining three randomly generated patches

510: of size 22000 bps, 17000 bps and 9000 bps with the same $G$, $A$, $T$ and $C$

511: contents as that of $LAMCG$. (b) for $E. Coli$ with a $TN10$ mobile

512: transposition (9147 bps) at location 22000 bps. The three random patches

513: are of size 22000 bps, 9147 bps and 22000 bps with appropriate

514: $G$, $A$, $T$ and $C$ contents. }

515: \label{rans1}

516:

517: \end{figure}

518:

519:

520:

521:

522: Figure \ref{fig5}(a)

523: shows a $G$ and $ATC$ distribution (left most) for a $LAMCG$ sequence at

524: $l=300$. Notice that

525: although $\sigma$ versus $l$ plots are identical, i.e. both distributions have same

526: fluctuations at all scales, the distribution functions are not same. Such

527: differences are not found for a real random sequence (right most). The middle

528: figure corresponds to the case of artificially generated random

529: sequence. Although, the artificially

530: generated sequence mimics the real sequence to some extent, it is not fully

531: capable of reproducing the characteristic of a real sequence. Figure

532: \ref{fig5}(b)

533: shows another comparison for a $E. Coli::TN10$ sequence for $A$ and $GTC$

534: distributions. This discrepancy will be more

535: prominent at higher $l$ values which the artificially generated sequence can not

536: reproduce.

537:

538:

539:

540:

541:

542: \begin{figure}

543: \centerline{\hbox{

544: \psfig{figure=fig5.eps,width=3.0in,height=3.2in}}}

545: \caption{The density distribution $P_n$ versus $n-n_0$ (where

546: $n_0$ is average density) for a real DNA sequence (left most),

547: for an artificially generated sequence (middle) and for a completely

548: random sequence (right most) shown for two complementary

549: distributions. (a) for $LAMCG$ and (b) for $E. Coli::TN10$.}

550:

551: \label{fig5}

552:

553: \end{figure}

554:

555:

556: \section {Density distributions}

557:

558: In \cite{AKM1}, we had demonstrated that the density distribution $P_n$

559: is Gaussian at short distances and starts deviating from it as the distance

560: increases. Figure \ref{den} shows another example where $P_n$ has been

561: plotted for two complimentary distributions at $l=25$, $100$ and $200$ respectively.

562: The complimentary distributions are nearly identical at short

563: distance and coincide with the random distributions where as $P_n$ distributions

564: for $G$, $ATC$ and pure random one are all different at larger distances.

565:

566: \begin{figure}

567: \centerline{\hbox{

568: \psfig{figure=den.ps,width=3.0in,height=3.2in}}}

569: \caption{The density distribution $P_n$ versus $n-n_0$ (where

570: $n_0$ is average density) for $LAMCG$ sequence at $l=25$, $100$

571: and $200$ respectively. The solid and the dashed curves are for $G$ and

572: $ATC$ distributions respectively where as the dotted curve is for a

573: purely random sequence.}

574: \label{den}

575: \end{figure}

576:

577:

578: Thus, irrespective of intron contents, most of the sequences follow Gaussian

579: statistics at short distances. However, at large distances, the statistics

580: deviates  significantly from the Gaussian nature.

581:

582: \section {Conclusions}

583: In conclusion, we have extended our previous work to extract the slope

584: parameter $\alpha$ for several intron containing and intron less DNA sequences.

585: The advantage of the present method is that the variance analysis

586: can be applied to any individual or group of nucleotides. We believe that the

587: individual nucleotides provide a more fundamental measure of the correlation

588: than any combination or group (like the DNA walk representation) where the

589: effects may get reduced or washed out. Another interesting aspect is

590: the (lower) factorial moments of most of the DNA sequences cross unity in

591: a very narrow region in $l$ where the $\sigma$ versus $l$ plot in the log-log scale

592: also shows a bending. Although, a formal justification to this correlation

593: has not been provided, we have used this scale as an approximate measure

594: to exclude the bending regions from the slope analyses. Based on this scale,

595: we divide the DNA sequence into two segments to extract the slope parameters.

596: It is found that below this scale, the correlation is weak and the DNA

597: statistics is essentially Gaussian while above this all DNA sequences show

598: strong long range correlations irrespective of their intron contents with a

599: significant deviation from the Gaussian behavior. It may be mentioned here

600: that the controversies that exist in this field of research are primarily

601: due to different approaches that are adopted in various models. In this context,

602: our analyses is model independent as it only involves the counting of an individual

603: or a group of nucleotides in a given length to build the density distribution.

604: In this work, we do not advocate for any specific model,

605: although the extracted slope parameters indicate the presence of anomalous

606: diffusion of both enhanced and dispersive nature. Instead, we

607: provide an elegant tool to measure

608: the degree of correlations unambiguously so that the interpretation of

609: the data including theoretical analyses will become more meaningful. This work will

610: also provide further impetus to develop models for the understanding of

611: the DNA dynamics.

612:

613:

614:

615:

616:

617:

618: \begin{table}

619: \squeezetable

620: \caption{Summary of the correlation analysis of intron containing sequences.

621: $l_c$ is the characteristic length scale.

622: $\alpha_1$ is the slope parameter for $l<l_{c1}$ and $\alpha_2$ is the slope parameter for

623: $l_{c2} < l < l_{max}$, where $l_{c1}$ and $l_{c2}$ are the minimum and the

624: maximum of all the $l_c$, $l_{max}$=L/30 where L is the total length of the

625: sequence. The acronym in column 1 is the name of the GenBank. Since

626: the factorial moments for all $q$ do not cross exactly at same point,

627: we have chosen $l_c$ for which $F_q$ for $q=2,3,4$ and $6$  approaches unity

628: simultaneously. $P$ denotes percentage of $G$, $A$, $T$ and $C$ in the sequence.

629: We have also not fine tuned the cross over point $l_c$, it is only approximate.}

630:

631: \begin{tabular}{|c|c|c| c |c| c| c|c|}

632:

633: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\

634: \hline

635: Human $\beta$-globin & 73,308  & $l_c$    & 12   & 14   & 14   & 14   & 32\\

636: (Chromosomal region) &          & $ \alpha_1$   &0.640 &0.644 &0.671 &0.620 &0.652\\

637: HUMHBB               &          & $\alpha_2$ &0.703 &0.783 &0.812 &0.655 &0.758\\

638:                      &          &P           &20.2  & 30.1 &30.4 & 19.3 & 50.3\\

639: \hline

640: Adenovirus type 2    & 35,937  & $l_c$    & 24   & 12   & 12   & 36   &132\\

641: (Intron containing)   &         &$\alpha_1$     &0.598 &0.586 &0.567 &0.583 &0.564\\

642: ADRCG                &         &$\alpha_2$     &0.862 &0.815 &0.816 &0.758 &0.661\\

643:                      &          &P           &27.3  & 23.2 &21.6 & 27.9 & 50.5\\

644: \hline

645: Chicken embryonic MHC& 31,111  &$l_c$     & 24   & 36   &  14   & 28   &48\\

646: (Gene)               &      &$\alpha_1$        &0.644 &0.578 &0.658  &0.581 &0.623\\

647: CHKMYHE              &      &$\alpha_2$        &0.775 &0.698 &0.800   &0.715 &0.762\\

648:                      &          &P           &22.2  & 31.3 &26.7 & 19.8 & 53.5\\

649: \hline

650: Human $\beta$-cardiac MHC& 28,438  &$l_c$     & 16   & 16   &  10   & 18   &20\\

651: (Gene)               &      &$\alpha_1$        &0.638 &0.579 &0.627  &0.620 &0.664\\

652: HUMBMYH7              &      &$\alpha_2$        &0.681 &0.663 &0.700   &0.673 &0.688\\

653:                      &          &P           &25.9  & 23.6 &23.0 & 27.5 & 49.5\\

654: \hline

655: Drosophila melanogaster MHC& 22,663  &$l_c$     & 20   & 20   &  14   & 36   &156\\

656: (Gene)               &      &$\alpha_1$        &0.648 &0.594 &0.644  &0.562 &0.569\\

657: DROMHC                     &      &$\alpha_2$        &0.820  &0.652 &0.798  &0.707 &0.719\\

658:                      &          &P           &20.5  & 30.3 &25.4 & 23.8 & 50.8\\

659: \hline

660: Chicken c-myb oncogene    & 8200  &$l_c$& 14  & 10   &  10   & 12   &48\\

661: (Gene)              &    &$\alpha_1$      &0.663 &0.661 &0.688  &0.670 &0.645\\

662: CHKMYB15            &    &$\alpha_2$      &0.749 &0.873 &0.752  &0.852 &0.550\\

663:                      &          &P           &28.4  & 21.9 &23.5 & 22.2 & 50.3\\

664:

665: \end{tabular}

666: \end{table}

667:

668:

669: \begin{table}

670: %\squeezetable

671:

672: \caption{Same as table I, but for intron less sequences.

673: For $E. Coli$,

674: $l_{max}$ is chosen as 120,0000 bps. The data is taken from the site

675: {\bf http://www.ncbi.nlm.nih.gov}.}

676:

677:

678: \begin{tabular}{|c|c|c| c |c| c| c|c|}

679: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\

680: \hline

681: $E. Coli K12$           & 1200000&$l_c$        & 100  & 32   &  32   & 92   &684\\

682:                        &    &$\alpha_1$        &0.535 &0.542 &0.549  &0.532 &0.529\\

683:                        &    &$\alpha_2$        &0.665 &0.639 &0.664  &0.674 &0.614\\

684:                        &    &$\alpha_2$        &0.654 &0.654 &0.655  &0.715   &0.563\\

685:                  &    &P                 &27.2  &23.6  &24.2   &25.0  & 50.8\\

686: \hline

687: H. Influenzae                    & 240000&$l_c$        &  52  & 48   &  56   & 52   &214\\

688:                        &    &$\alpha_1$        &0.542 &0.552 &0.543  &0.547 &0.543\\

689:                        &    &$\alpha_2$        &0.720 &0.712 &0.635  &0.770 &0.709\\

690:                  &    &P                 &17.9  &31.6  &30.7   &19.8  & 49.5\\

691: \hline

692: Bacillus subtilis                  & 3840x60&$l_c$        &  80  & 40   &  22   & 132   &274\\

693:                        &    &$\alpha_1$        &0.538 &0.545 &0.550  &0.508 &0.536\\

694:                        &    &$\alpha_2$        &0.815 &0.770 &0.816  &0.779 &0.766\\

695:                  &    &P                 &24.5  &29.5  &26.5   &19.5  & 54.0\\

696: \hline

697: Mycobacterium                 & 9665x60&$l_c$        &  20  & 64   &  44   & 24   &136\\

698: tuberculosis                       &    &$\alpha_1$        &0.549 &0.535 &0.548  &0.540 &0.542\\

699:                  &    &$\alpha_2$        &0.827 &0.681 &0.826  &0.765 &0.791\\

700:                  &    &P                 &15.92  &34.57  &33.73   &15.78  & 50.49\\

701: \hline

702: Cyano bacterium                   & 4166x60&$l_c$     &  32  & 40   &  28   & 24   &304\\

703:                        &    &$\alpha_1$        &0.545 &0.532 &0.542  &0.541 &0.535\\

704:                        &    &$\alpha_2$        &0.730 &0.678 &0.763  &0.733 &0.587\\

705:                  &    &P                 &24.1  &26.0  &26.0   &23.9  & 50.1\\

706: \hline

707: Schizosaccharomyces    & 19431      &$l_c$     & 32   & 60   & 80    &304  &160\\

708: Mitochondiron          &    &$\alpha_1$        &0.547 &0.561 &0.568  &0.504  &0.543\\

709: NC-001326              &    &$\alpha_2$        &0.698 &0.690 &0.774  &0.465  &0.773\\

710:                        &    & P                &15.8  &33.8  &36.1   &14.3   &49.6 \\

711: \hline

712: Human Cytomegalovirus  & 229354 &$l_c$     & 36   & 10   & 10    & 32   &148\\

713: Strain AD169                &    &$\alpha_1$        &0.582 &0.588 &0.596  &0.581 &0.575\\

714: HEHCMVCG                    &    &$\alpha_2$        &0.806 &0.799 &0.800   &0.800  &0.682\\

715: \hline

716: dmal                   &889x60&$l_c$      & 20   & 12   & 12    & 22   &68\\

717:                        &    &$\alpha_1$        &0.575 &0.628 &0.599  &0.559 &0.60\\

718:                        &    &$\alpha_2$        &0.730 &0.782 &0.602  &0.720 &0.596\\

719: \hline

720: Chicken nonmuscle MHC       &7003  &$l_c$     & 96  & 72   & 12   & 28   &  64\\

721: (cDNA)                 &    &$\alpha_1$        &0.573 &0.538 &0.569  &0.554    &0.627\\

722: CHKMYHN                &    &$\alpha_2$        &0.722 &0.833  &0.841  &0.601   &0.842\\

723:                  &    &P                 &27.0  &31.2  &20.6   &21.2  & 58.2\\

724: \hline

725: Bacteriophage $\lambda$& 48,502&$l_c$     & 56   & 36   &  18   &124   &168\\

726: (Intronless virus)     &    &$\alpha_1$        &0.563 &0.541 &0.598  &0.513 &0.550\\

727: LAMCG                  &    &$\alpha_2$        &0.935 &0.819 &0.911  &0.810 &0.866\\

728:                  &    &P                 &26.4  &25.4  &24.7   &23.5  & 51.8\\

729: \hline

730: Human dystrophin       & 13,957&$l_c$     & 136  & 56   &  14   & 22   &128\\

731: (cDNA)                 &    &$\alpha_1$        &0.530 &0.552 &0.569  &0.552 &0.544\\

732: HUMDYS:M18533          &    &$\alpha_2$        &0.738 &0.634 &0.777  &0.720 &0.725\\

733:                        &    &     P            &22.4  &33.0  & 24.7  &19.9  &55.4\\

734:

735: \end{tabular}

736: \end{table}

737:

738:

739: \begin{table}

740: %\squeezetable

741:

742: \caption{Same as table II.

743: The symbol $*$ indicates that the factorial moments are larger

744: than unity even at very short distance where as $-$ indicates that the factorial

745: moments do not reach unity.}

746:

747: \begin{tabular}{|c|c|c| c |c| c| c|c|}

748: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\

749: \hline

750: SC-MIT                 & 85779 &$l_c$     & *  & 36   & 36    & *  &184\\

751: Nc-001224              &    &$\alpha_1$        &0.732 &0.697 &0.680  &0.720  &0.578\\

752:                        &    &$\alpha_2$        &0.698 &0.540 &0.747  &0.508  &0.730\\

753:                        &    & P                &9.1   &42.2  &40.7   &8.0    &51.3 \\

754: \hline

755: Pichia canadensis      & 27694      &$l_c$     & *    & 36   & 64    &*  &96\\

756: Mitochondiron          &    &$\alpha_1$        &0.654 &0.688 &0.624  &0.615  &0.620\\

757: NC-001762              &    &$\alpha_2$        &0.662 &0.755 &0.784  &0.660  &0.801\\

758:                        &    & P                &10.2  &41.6  &40.2   &8.0    &51.84 \\

759: \hline

760: Ti(Plasmid)            &24595  &$l_c$     & 76  & 24   & 32    & 40   & -\\

761:                        &    &$\alpha_1$        &0.543 &0.564 &0.552  &0.586    &0.508\\

762:                        &    &$\alpha_2$        &0.706 &.700  &0.676  &0.728   &0.433\\

763:                  &    &P                 &23.5  &26.6  &27.5   &22.4  & 50.1\\

764: \hline

765: BacteriophageT7                 &39937  &$l_c$     & -  & 116   & 884    & 1284   &-\\

766: NC-001604               &    &$\alpha_1<116$        &0.526 &0.571 &0.529  &0.530    &0.530\\

767:                        &  &$116<\alpha_2<1330$        &0.560 &0.587  &0.590  &0.566   &0.551\\

768:                  &    &P                 &25.8  &27.2  &24.4   &22.6  & 53.0\\

769: \hline

770: Tyorg                  & 196x60 &$l_c$     &  - & 96 &- &36 & 96\\

771:                        &        &$\alpha_1$&0.491 & 0.560 & 0.515 & 0.620 & 0.587\\

772:                        &        &$\alpha_2$&0.370 & 0.715 & 0.514 & 0.799 & 0.704\\

773:                  &    &P                 &16.0  &35.9  &26.7  &21.4  & 51.9\\

774: \end{tabular}

775: \end{table}

776:

777:

778:

779: \appendix

780: \renewcommand{\thefigure}{A\arabic{figure}}

781: \section *{Random walk model}

782:

783: The method of DNA walks, first suggested by Peng et al \cite{PENG1} is based

784: on the rule that the walker either moves up $(u_i=1)$ or down $u_i=-1)$ for each

785: step $i$ of the walk. This is the case of a correlated random walk and differs

786: from an uncorrelated walk where the direction of each step is independent of the

787: previous steps. Further they assign $u_i=1$ if a pyrimidine occurs at the site

788: $i$ whereas $u_i=-1$ if the site contains a purine.

789: The net displacement $(y)$ of the walker after $l$ steps is defined

790: as

791: \equation

792: y(l)=\sum_{i=1}^l u(i)

793: \endequation

794: The standard deviation of the above quantity can be estimated from

795: \equation

796: \sigma^2(l,L)=\frac{1}{L-l} \sum_{l_0=1}^{L-l} (\Delta y(l_0,l)-{\bar {\Delta(l)}})^2

797: \endequation

798: where $L$ is the number of nucleotides in the entire sequence and

799: \equation

800: {\bar {\Delta y(l)}}=\frac{1}{L-l} \sum_{l_0=1}^{L-l} \Delta y(l_0,l)

801: \endequation

802: where $\Delta y(l_0,l)=y(l_0+l)-y(l_0)$.

803: It was found \cite{PENG1} that the fluctuations can be approximated by

804: \equation

805: \sigma(l,L) \sim l^\alpha

806: \endequation

807: where $\alpha$ is the correlation exponents. For $\alpha$ close to $0.5$, there

808: is no correlation or only short range correlation in the sequence. If $\alpha$

809: is significantly different from $0.5$, it indicates long range correlations.

810:

811:

812: \appendix

813: \setcounter {figure}{0}

814: \renewcommand{\thefigure}{B\arabic{figure}}

815: \section *{B}

816:

817:

818: In the previos analyses, we account for the non-occurence of a particular

819: nucleotide. This is operationally equivalent to building the density spectrum

820: $P_n$ including $n=0$. If the nucleotide compositional asymmetry is  quite large like

821: $SC\_MIT$, the occurence $n$ can be zero for some nucleotides particularly at

822: short distances. Therefore, we can build $P_n$ distribution either including

823: or excluding zero$^{th}$ channel. The figure \ref{ap1}(a) shows the comparison

824: of $\sigma$ versus $l$ plot for two complimentary distributions corresponding

825: to a $LAMCG$ sequence both with (top panel where $G$ and $ATC$ distributions have

826: identical slopes at all scales) and without (bottom panel)

827: inclusion of $n=0$ channel in the $P_n$ spectra. Interestingly, absence of

828: $n=0$ channel does not satisfy the complimentarity relation particularly at

829: short distances. However, the difference does not exist at larger distances

830: where always $n>1$. Figure \ref{ap1}(b) shows another example of $F_q$ versus

831: $l$ plot for a typical $SC\_MIT$ sequence. The spectrum with exclusion of

832: $n=0$ channel behaves differently when zero$^{th}$ channel is included (compare

833: it with figure \ref{pc1} where $F_q$ versus $l$ has no cross over).

834:

835:

836: \begin{figure}

837: \centerline{\hbox{

838: \psfig{figure=ap1.eps,width=3.0in,height=3.2in}}}

839: \caption{ (a) The variance $\sigma$ versus $l$ for $G$ (solid curves)

840: and $ATC$ distributions (dotted curves) for $LAMCG$ sequence.

841: Top panel is

842: for distribution for which the complimentarity is preserved

843: while complimentarity is not satisfied in the case of bottom panel particularly

844: at small distances. (b) $F_q$ versus $l$ plot for $G$ distribution of $SC\_MIT$

845: for the case when complimentarity is not preserved.

846: The curves are scaled up appropriately for

847:  better clarity.}

848: \label{ap1}

849: \end{figure}

850:

851: Since the spectrum behaves differently when zero$^{th}$ channel is not included,

852: we have analysed the spectrum of three typical sequences listed in the table below.

853: Notice now that while $\alpha_2$ values are essentially same as before, the

854: $\alpha_1$ values are quite different. In fact, we have noticed a general

855: trend where $\alpha_1$ is higer than the previous values although the corresponding

856: density distributions do not deviate significantly from the Gaussian behavior

857: at short distances. However, in the previous analysis, we alwyas include the

858: zero$^{th}$ channel so that the complimentarity properties is satisfied at all

859: scales. Moreover, we also found a correlation between $\alpha$ and Gaussian

860: statistics, namely the deviation of $\alpha$ from $0.5$ also shows a

861: corresponding deviation of $P_n$ distribution from Gaussian behavior.

862: For example, in case of $SC\_MIT$, the $\alpha$ is quite large at a short

863: distance. Accordingly, the $P_n$ distribution also shows strong deviation from

864: the Gaussian statistics. However, this is not  necessarilly true when

865: complimentarity is not preserved while building the spectrum.

866: At short distances, the deviation of $\alpha$ from $0.5$

867: does not always mean a strong deviation from the Gaussian statistics.

868:

869:

870: \begin{table}

871: %\squeezetable

872:

873: \caption{The slope parameters for three typical sequences where the complimenraity

874: is not preserved.}

875:

876: \begin{tabular}{|c|c|c| c |c| c| c|c|}

877: Sequence& L& $l_c$, $\alpha$ &G&A&T&C&GA\\

878: \hline

879: Bacteriophage $\lambda$& 48,502&$l_c$     & 56   & 36   &  18   &124   &168\\

880: (Intronless virus)     &    &$\alpha_1$        &0.720 &0.670 &0.740  &0.680 &0.580\\

881: LAMCG                  &    &$\alpha_2$        &0.935 &0.819 &0.910  &0.800 &0.860\\

882:                  &    &P                 &26.4  &25.4  &24.7   &23.5  & 51.8\\

883: \hline

884: SC-MIT                 & 85779 &$l_c$     & 14  & 36   & 40    & 12  &184\\

885: Nc-001224              &    &$\alpha_1$        &0.703 &0.760 &0.750  &0.700  &0.630\\

886:                        &    &$\alpha_2$        &0.694 &0.540 &0.750  &0.510  &0.730\\

887:                        &    & P                &9.1   &42.2  &40.7   &8.0    &51.3 \\

888: \hline

889: BacteriophageT7                 &39937  &$l_c$     & -  & 116   & 884    & 1284   &-\\

890: NC-001604               &    &$\alpha_1<116$        &0.560 &0.610 &0.570  &0.570    &0.530\\

891:                        &  &$116<\alpha_2<1330$        &0.560 &0.587  &0.590  &0.566   &0.551\\

892:                  &    &P                 &25.8  &27.2  &24.4   &22.6  & 53.0\\

893: \end{tabular}

894: \end{table}

895:

896:

897:

898:

899:

900:

901: \begin{thebibliography}{99}

902:

903: \bibitem{LI1} For a review on long range correlation in DNA sequences,

904:               see for example, W. Li, Computers Chem, {\bf 21}, 257 (1997);

905:               http://linkage.rockefeller.edu/wli/dna\_corr.html

906:

907: \bibitem{LI2} W. Li, Int. Journal of Bifurcation and Chaos, {\bf 2(1)}, 137 (1992).

908:

909: \bibitem{LI3} W. Li and K. Kaneko, Euro Phys. Lett, {\bf 17}, 655 (1992).

910:

911: \bibitem{LI4} W. Li, T. Marr and K. Kaneko, Physica {\bf D75}, 392 (1994).

912:

913: \bibitem{VOSS} R. F. Voss, Phys. Rev. Lett., {\bf 68}, 3805 (1992); Fractals {\bf 2}, 1 (1994).

914:

915: \bibitem{BUL1} S.V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng,

916:            M. Simons, F. Sciortino and H. E. Stanley, Phys. Rev. Lett.,

917:            {\bf 71}, 1776 (1993).

918:

919: \bibitem{BOR} B. Borstnik, D. Pumpernik, and D. Lukman, Euro phys. Lett., {\bf 23}, 389 (1993).

920:

921: \bibitem{LU} X. Lu, Z. Sun, H. Chen, and Y. Li, Phys. Rev. {\bf E58}, 3578 (1998).

922:

923: \bibitem{VIE} M. de Vieira, Phys. Rev. {\bf E60}, 5932 (1999).

924:

925: \bibitem{AZB} M. Ya. Azbel, Phys. Rev. Lett., {\bf 75}, 168 (1995).

926:

927: \bibitem{HER} H. Herzel, I. Gro$\beta$e, Physica {\bf A216}, 518 (1995).

928:

929: \bibitem{LUO} Liaofu Luo, Weijiang Lee, Lijun Jia, Fengmin Ji, and Lu Tsai, Phys. Rev. {\bf E58}, 861 (1998).

930:

931: \bibitem{PENG1}C. K. Peng, S.V. Buldyrev, A. L. Goldberger, S. Havlin,

932:            F. Sciortino, M. Simons, and H. E. Stanley, Nature (London),

933:            {\bf 356}, 168 (1992).

934:

935: \bibitem{MAD} J. Maddox, Nature (London), {\bf 358}, 103 (1992).

936:

937: \bibitem{NEE}S. Nee, Nature (London), {\bf 357}, 450 (1992)

938:

939: \bibitem{CHA}Chatzidimitriou-Dreismann and Larhammar D, Nature (London), {\bf 361}, 212 (1993).

940:

941: \bibitem{PRA}V. V. Prabhu, and J. M. Claverie, Nature (London), {\bf 357}, 782 (1992).

942:

943: \bibitem{KAR}S. Karlin and V. Brendel Science, {\bf 259}, 677 (1993).

944:

945: \bibitem{STA} H. E. Stanley, S.V. Buldyrev, A. L. Goldberger, Z. D. Goldberg,

946:            S. Havlin, R. N. Mantegna, S. M. Ossadnik, C. K. Peng, and

947:            M. Simons, Physica {\bf A205}, 214 (1994).

948:

949: \bibitem{BUL2} S.V. Buldyrev, N. V. Dokholyan, A. L. Goldberger,

950:            S. Havlin, C. K. Peng, H. E. Stanley and G. M. Visvanathan,

951:            Physica {\bf A249}, 430 (1998).

952:

953: \bibitem{ARN1}A. Arnedo, E. Bacry, P. V. Graves and J. F. Muzy, Phys. Rev. Lett.,

954:            {\bf 74}, 3293 (1995).

955:

956: \bibitem{ARN2}A. Arnedo, Y. D'Aubenton-Carafa, B. Audit, E. Bacry,

957:               J. F. Muzy, and C. Thermes, Physica {bf A249}, 439 (1998).

958:

959:

960: \bibitem{MAN} R. N. Mantegna, S.V. Buldyrev, A. L. Goldberger,

961:            S. Havlin, C. K. Peng, M. Simons, and  H. E. Stanley,

962:            Phy. Rev. Lett., {\bf 73}, 333 (1994); Phys. Rev. {\bf E52}, 2939 (1995).

963:

964:

965: \bibitem{BUL3}  S.V. Buldyrev, A. L. Goldberger, S. V. Havlin, R. N. Mantegna,

966:                 M. E. Matsa, C. K. Peng, M. Simons, and H. E. Stanley,

967:                 Phys. Rev. {\bf E51}, 5084 (1995).

968:

969: \bibitem{PENG2} C. K. Peng, S.V. Buldyrev, S. V. Havlin, M. Simons, H. E. Stanley,

970:                 and A. L. Goldberger, Phys. Rev. {\bf E49}, 1685 (1994).

971:

972: \bibitem{AKM1} A. K. Mohanty, and A. V. S. S. Narayana Rao, Phys. Rev. Lett., {\bf 84}, 1832 (2000).

973:

974: \bibitem{AKM2}A. K. Mohanty,  and S. K.  Kataria, Phys. Rev. Lett, {\bf 73}, 2672 (1994);

975:            Phys. Rev. Lett, {\bf 75}, 2449 (1995); Phys. Rev. C, {\bf C53}, 887 (1996).

976:

977: \bibitem{KLA} For a review see, J. Klafter, M. F. Shlesinger and G. Zumofen,

978:               Physics Today, {\bf 49}, 33 (1996); M. F. Shlesinger, J. Klafter

979:               and G. Zumofen, Am. J. Phys., {\bf 67}, 1253 (1999).

980:

981:

982:

983: \bibitem{GAL} Bernaola- Galvan and P. Carpena, (To be published).

984:

985:

986: %\bibitem{ALL1} P. Allegrini, P. Grigolini and B. J. West, Phys. Rev. {\bf E54}, 4760 (1996).

987: %

988: %\bibitem{ALL2} P. Allegrini, M. Barbi, P. Grigolini and B. J. West,

989: %           Phys. Rev. E, {\bf 52}, 5281 {1995}

990:

991: %\bibitem{ALL3} P. Allegrini, P. Grigolini and B. J. West,

992: %           Phys. Lett A, {\bf 211}, 217 {1996}

993:

994: \end{thebibliography}

995: \end{document}

996: