0411:q-bio0411016/tmp.tex

1: \def \manuflag {0}

2:

3: \ifnum \manuflag = 0

4:  \documentclass[pre,twocolumn,showpacs,amsmath,amssymb]{revtex4}

5:  \usepackage{psfig}

6:  \usepackage{epsfig}

7:  \usepackage{delarray}

8:  \usepackage{graphicx}

9:  \usepackage{dcolumn}

10:  \usepackage{bm}

11:  \def \Title{

12: Universal $1/f$ noise, cross-overs of scaling exponents,

13: and chromosome specific patterns of GC content in

14: DNA sequences of the human genome}

15:  \def \figsize{6.85cm}

16:  \def \figname{\footnotesize \sc FIG.}

17:  \def \tblname{\footnotesize \sc TABLE}

18:  \sloppy

19:  \newcommand{\SEC}{\section}

20:  \newcommand{\SUBSEC}{\subsection}

21: \else

22:  \documentclass[pre,onecolumn,showpacs,amsmath,amssymb]{revtex4}

23:  \usepackage{psfig}

24:  \usepackage{epsfig}

25:  \usepackage{delarray}

26:  \usepackage{graphicx}

27:  \usepackage{dcolumn}

28:  \usepackage{bm}

29:  \usepackage{rotating}

30:  \def \Title{Universal $1/f$ Noise, crossing-Overs of scaling

31: exponents, and chromosome specific patterns of GC content in

32: DNA sequences of the human genome}

33:  \def \figsize{12.6cm}

34:  \renewcommand{\baselinestretch}{2.4}

35: \fi

36:

37: \def \Abstract{

38: Spatial fluctuations of guanine and cytosine base content (GC\%) are

39: studied by spectral analysis for the complete set of human genomic DNA

40: sequences. We find that (i) the $1/f^{\alpha}$ decay is universally

41: observed in the power spectra of all twenty-four chromosomes, and that

42: (ii) the exponent $\alpha \approx 1$ extends to about $10^7$ bases,

43: one order of magnitude longer than what has previously been observed.

44: We further find that (iii) almost all human chromosomes exhibit a

45: cross-over from $\alpha_1 \approx 1$ ($1/f^{\alpha_1}$) at lower

46: frequency to $\alpha_2 < 1$ ($1/f^{\alpha_2}$) at higher frequency,

47: typically occurring at around 30,000--100,000

48: bases, while (iv) the cross-over in this frequency range is

49: virtually absent in

50: human chromosome 22. In addition to the universal $1/f^\alpha$ noise in power

51: spectra, we find (v) several lines of evidence for chromosome-specific

52: correlation structures, including a 500,000 bases long oscillation in

53: human chromosome 21. The universal $1/f^\alpha$ spectrum in human

54: genome is further substantiated by a resistance to variance reduction

55: in guanine and cytosine content when the window size is increased.

56: }

57:

58:

59:

60: \ifnum \manuflag = 0

61:  \begin{document}

62:  \title{\Title}

63:  \author{Wentian Li}

64:  \email{wli@nslij-genetics.org}

65:  \affiliation{The Robert S. Boas Center for Genomics and Human Genetics,

66: 	      North Shore LIJ Institute for Medical Research, 350 Community Dr.,

67: 	      Manhasset, NY 10030.}

68:  \author{Dirk Holste}

69:  \email{holste@mit.edu}

70:  \affiliation{Department of Biology,

71: 	      Massachusetts Institute of Technology, Cambridge, MA 02139.}

72:  \begin{abstract}

73:    \Abstract

74:  \end{abstract}

75:  \pacs{87.10.+e, 87.14.Gg, 87.15.Cc, 02.50.-r, , 02.50.Tt, 89.75Da, 89.75.Fb, 05.40.-a}

76:  %% \keywords{Suggested keywords if desired}

77:  \maketitle

78: \else

79:  \begin{document}

80:  \title{\Title}

81:  \author{Wentian Li}

82:  \email{...@...}

83:  \affiliation{...}

84:  \author{Dirk Holste}

85:  \email{holste@mit.edu}

86:  \affiliation{Department of Biology,

87: 	      Massachusetts Institute of Technology, Cambridge, MA 02139.}

88:  \begin{abstract}

89:    \Abstract

90:  \end{abstract}

91:  \pacs{87.10.+e, 02.50.-r, 05.40.-a  \hfill {\tt Thu Sep  5 15:36:39 EDT 2002}}

92:  %% \keywords{Suggested keywords if desired}

93: \maketitle

94: \fi

95:

96: \SEC{Introduction}

97:

98: By measuring the proportion of a signal's power $S(f)$ falling into

99: a range of frequency components $f$, a power spectrum of the form

100: $S(f) \sim  1/f^\alpha$ distinguishes between two prototypes of noise:

101: white noise ($\alpha = 0$) and Brownian noise ($\alpha = 2$). The

102: intermittent range, termed ``$1/f$ noise'', can practically be defined

103: as $1/f^\alpha$ ($0.5 \lesssim \alpha \lesssim 1.5$). $1/f$ noise was experimentally

104: observed first in electric current fluctuations of the thermionic

105: tube at the beginning of the nineteenth century \cite{johnson}.

106: Since then, $1/f$ noise has been found repeatedly in many other

107: conducting materials \cite{1freview}.  More generally, it has also

108: been observed in wide ranges of natural as well as human-related

109: phenomena, including traffic flow, star light, speech, music

110: and human coordination \cite{1freview-other,1fbib}.

111: For biological sequences, such as DNA, the concept of slow-varying,

112: multiple-length variations in the power of frequency components

113: can be translated to long-ranging correlations in the spatial

114: arrangement of the four bases adenine (A), cytosine (C), guanine

115: (G) and thymine (T).  One can categorize chemically A, C,

116: G, and T as strong (G or C) or weak (A or T) bonding. It has been

117: shown that fluctuations of the GC base content along a DNA sequence

118: are typically stronger correlated when compared to other possible

119: binary classifications \cite{mapping,dirk}.

120: Initial studies of $1/f$ noise in DNA sequences were motivated

121: by a model of spatial $1/f$ noise of symbolic sequence evolution

122: \cite{wli-em}.  Subsequently, empirical $1/f$ spectra were

123: indeed observed in non-protein-coding DNA sequences

124: \cite{wli-dna}, and their generality in DNA sequences was further

125: illustrated in \cite{voss}.

126:

127: $1/f$ noise has been detected in a variety of different species and

128: taxonomic classes, including bacteria \cite{bac}, yeasts

129: \cite{yeast}, insects \cite{fukushima}, and other higher eukaryotic

130: genomes. Integrating this and several other lines of evidence, a

131: consensus on $1/f$ noise in DNA sequences has emerged:

132: (1) for DNA sequences of the order of $10^6$~bases ($1$~Mb), $1/f^\alpha$

133: spectrum ($\alpha \approx 1$) is consistently observed; (2) for

134: isochores, which are DNA sequences of relatively homogeneous base

135: concentration at least $300\cdot 10^3$ bases ($300$~kb) long

136: \cite{isochore,isochore-clay,cc03}, $1/f^\alpha$ spectrum is also

137: observed,  but typically shows a smaller exponent $\alpha < 0.7$

138: \cite{isochore-clay,clay3,isochore-spe}; (3) for DNA sequences of

139: the order of several kb, the decay of $S(f)$ is non-trivial and may

140: depend on whether the sequence is protein-coding \cite{wli-dna}.

141: The viral DNA sequence of the $\lambda$-phage, e.g., shows a single step in its GC

142: base concentration and its spectrum is $S(f) \sim 1/f^2$, which is

143: characteristic of random block sequences \cite{wli-complexity}.

144: We note that the universal scaling of $S(f) \sim 1/f^\alpha$ ($\alpha \approx 1$)

145: across all species discussed in \cite{voss} has apparently been

146: restricted to a length scale of $1$~kb, by averaging the spectrum over

147: many $N=2$~kb DNA segments.

148:

149: With the availability of the first completed version of the DNA

150: sequence of human genome \cite{lander}, several studies have been able

151: to demonstrate that the base-base correlation function $\Gamma(d)$

152: ($d$ distance between bases) of several DNA sequences follows a

153: power-law decay, $\Gamma(d) \sim 1/d^{\gamma}$. For instance, the DNA

154: sequence of human chromosome 22 shows statistically significant

155: power-law correlations up to $d=1$~Mb, and

156: correlations in the DNA sequence of chromosomes 21 are statistically

157: significant up to several~Mb (with the scaling exponent $\gamma$

158: changing beyond a few~kb) \cite{dirk,pedro}.

159: While the DNA sequences of human chromosomes 21 and 22 are about

160: $34$~Mb long, in order to estimate the limit of the range of $1/f^\alpha$

161: spectrum, longer sequences are necessary.

162:

163:

164:

165: After the release of the draft of the human genome sequence in February

166: 2001, about three years later in 2004, a dozen (out of 24) human chromosomes

167: have been completed with a sequence accuracy to following the

168: standard of less than one error per 10,000 DNA

169: bases (99.99\% accuracy) \cite{hg-quality}.  Building upon the release

170: of updated, high-quality sequence data, in the era of genomics we can

171: now conduct a systematic analysis of several issues of $1/f$ noise in

172: the DNA sequences of our own species {\em Homo sapiens}, which

173: have been pursued over the last decade in a fragmentary manner.

174:

175: In this paper, we use the DNA sequences of the complete set of

176: twenty-two autosomes and two sex chromosomes to address the following

177: issues: Is $1/f$ noise

178: universally present across the entire set of human genome sequences?

179: Does $1/f$ noise extend to lower frequency ranges in longer DNA

180: sequences? Is the decay of $S(f)$ characterized by a single exponent

181: $\alpha$, or does it exhibit cross-overs (multiple scaling exponents)?

182: Given the presence of universal variations at multiple scales,

183: do these co-exist with variations at chromosome-specific scales?

184:

185: \begin{figure}

186: \centerline{\psfig{figure=dirk1.eps,width=68mm,angle=0}}

187: \caption{\label{fig-dirk1}

188: Double-logarithmic representation of the human genome-wide length

189: distribution of interspersed repeat sequences, non-repetitive

190: sequences, and sequences of unknown base composition (gaps).  The

191: length distribution of interspersed repeats and non-repetitive sequences

192: exhibits a power-law-like decay, while that of gap sequences

193: is scattered across different sequence length. The peaks

194: at $\sim 300$ bases and

195: several kb correspond to Alu and possibly LINE repeats.

196: }

197: \end{figure}

198:

199: \begin{figure}[ht]

200: \centerline{\psfig{figure=dirk2.eps,width=68mm,angle=0}}

201: \caption{\label{fig-dirk2}

202: Distribution of genome-wide GC content (GC\%) of the human genome

203: for interspersed repeat sequences, non-repetitive sequences,

204: and all (``overall") sequences with sequence segments of $20~{\rm kb}$.

205: The mode (peak location) of non-repetitive sequences is at

206: $\sim$35\%, while the mode of repetitive sequences shifted

207: to a higher GC\% ($\sim$42\%). The fraction

208: of non-repetitive sequences with GC\%~$>$~50\% is markedly

209: larger as compared to the repetitive sequences.

210: }

211: \end{figure}

212:

213:

214: \SEC{Data and methods}

215:

216: In this section, we introduce the data for human genome sequences, as well

217: as the notation and definitions used throughout this study.

218: Twenty-four chromosomes are assembled in build 34

219: of the NCBI (human genome hg16 release). Sequence data were downloaded

220: from the UCSC human genome repository (available at {\sf http://genome.ucsc.edu/}).

221: Unsequenced bases are kept to preserve spacing between

222: bases. Human chromosomes (Chr) 13, 14, 15, 21, and 22 contain large

223: amount of unsequenced bases in the left end of their DNA sequences,

224: consisting of about 15\%, 17\%, 18\%, 21\%, and 29\% of the

225: individual chromosome size,

226: respectively; 51\% of chromosome Y are unsequenced.

227:

228: Our analysis on human DNA sequences is conducted using coarse-grained data.

229: Each original sequence was transformed into a spatial series

230: of GC content (GC\%) values. To this end, we evenly partition a

231: DNA sequence into $N$ non-overlapping windows of length $w$ bases,

232: compute $\rho_i(w)=$GC\%$_i$ for each window $i$, to obtain a spatial

233: GC\% series:

234: \begin{equation}

235: \label{rho}

236: \{ \rho_i \} \equiv  \{ \rho_i(w) \}  \equiv \{ \mbox{GC\%}_i \}

237: \hspace{0.1in}

238: \mbox{i= 1, 2, $\dots$, N }

239: \end{equation}

240: Table~1 lists the corresponding window sizes

241: for each human chromosome.  Since different human chromosomes have

242: different sizes, whereas the number of partitions ($N$) is the same,

243: the window lengths vary.

244:

245:

246: \begin{table}[ht]

247: \caption{\label{tab:table1}

248: Average GC content ($\overline{ \mbox{GC}\%}$ or $\overline{ \rho }$), the window

249: size ($w$) for partitions using $N=2^{17}$ non-overlapping windows for

250: twenty-four human chromosomes. Low-frequency scaling exponents

251: $\alpha_1$ are estimated  from $S(f; s=3) \sim 1/f^{\alpha_1}$

252: in the range of $10^{-7} < f < 10^{-5}$ base$^{-1}$, and high-frequency

253: scaling exponents $\alpha_2$  are estimated in the range of

254: $10^{-5} < f < 2 \times 10^{-4}$ base$^{-1}$. The difference

255: between the two scaling exponents, $\Delta \alpha \equiv \alpha_2-\alpha_1$,

256: are listed in the fifth column. Low- and high-frequency exponents for $S(f)$

257: with substituted interspersed repeats are indicated by

258: $\alpha'_1$ and $\alpha'_2$, and their difference by

259: $\Delta \alpha' \equiv \alpha'_2-\alpha'_1 $.

260: }

261: \begin{ruledtabular}

262: \begin{tabular}{l|c|c|c|c|c|c|}

263: Chr & $\overline{ GC\%}$ & $w$ (kb) & $\alpha_1$ $\alpha_2$ & $\Delta \alpha$

264:  & $\alpha'_1$ $\alpha'_2$ & $\Delta \alpha'$ \\

265: \hline

266: 1  & 41.7 &1.88 & 0.88 0.46 & 0.42 & 0.80 0.29 &0.51\\

267: 2  & 40.2 &1.86 & 0.99 0.51&  0.48 & 0.96 0.30 &0.66\\

268: 3  & 39.7 &1.52 & 0.95 0.43& 0.53 & 0.88 0.27 &0.61 \\

269: 4  & 38.2 &1.46 & 0.87 0.34& 0.53 & 0.75 0.19 &0.57\\

270: 5  & 39.5 &1.38 & 0.89 0.39& 0.51 & 0.88 0.23 &0.65 \\

271: 6  & 39.6 &1.30 & 0.99 0.36& 0.63 & 0.86 0.24 &0.63\\

272: 7  & 40.7 &1.21 & 0.97 0.46& 0.51 & 0.87 0.33 &0.55 \\

273: 8  & 40.1 &1.12 & 0.97 0.42& 0.55 & 0.91 0.26 &0.66\\

274: 9  & 41.3 &1.04 & 0.96 0.39& 0.57 & 0.90 0.28 &0.62 \\

275: 10 & 41.6 &1.03 & 0.97 0.52& 0.46 & 0.95 0.34 &0.61 \\

276: 11 & 41.6 &1.03 & 1.05 0.50& 0.55 & 0.97 0.35 &0.62 \\

277: 12 & 40.8 &1.01 & 0.97 0.39& 0.59 & 0.89 0.28 &0.61 \\

278: 13 & 38.5 & 0.86 & 0.83 0.33 & 0.50 & 0.73 0.24 &0.49 \\

279: 14 & 40.9 &0.80  & 1.03 0.36 & 0.66 & 0.95 0.27 &0.68\\

280: 15 & 42.2 &0.76 & 0.90 0.50 & 0.40 & 0.83 0.39 &0.44 \\

281: 16 & 44.8 &0.69 & 0.91 0.51 & 0.40 &0.81 0.36 &0.45\\

282: 17 & 45.5 &0.62 & 0.98 0.57 & 0.42 & 0.89 0.44 &0.46 \\

283: 18 & 39.8 &0.58 & 1.12 0.40 & 0.72 & 1.12 0.28 &0.83 \\

284: 19 & 48.4 &0.49  & 1.00 0.56 & 0.44 & 0.81 0.37 &0.45 \\

285: 20 & 44.1 &0.49  & 0.87 0.51 & 0.36 & 0.83 0.30 &0.53 \\

286: 21 & 40.9 &0.36 & 0.91 0.33 & 0.58 & 0.86 0.22 &0.64 \\

287: 22 & 47.9 &0.38  & 0.90 0.62 & 0.28 & 0.86 0.40 &0.45 \\

288: X  & 39.4 &1.17  & 0.93 0.38 & 0.54 & 0.73 0.18 & 0.55 \\

289: Y  & 39.1 &0.38  & 0.83 0.38 & 0.45 & 0.70 0.21 & 0.49 \\

290: \end{tabular}

291: \end{ruledtabular}

292: \end{table}

293:

294: Human DNA sequences contain a large fraction of interspersed repeats,

295: i.e., copies of an ancestral sequence fragment that possess a high

296: similarity between the duplicated and the ancestral sequence.  One can

297: detect interspersed repeats by using the program {\sf RepeatMasker}

298: \cite{repeatmasker}. ``Soft-masked'' annotations of interspersed repeats

299: are taken from the DNA sequences of the UCSC human genome repository

300: ({\sf http://genome.ucsc.edu/}), where repetitive (non-repetitive)

301: bases are annotated in small (capital) letters. Figure~\ref{fig-dirk1}

302: shows the length distribution of the three sequences classes of uninterrupted

303: non-repetitive, interspersed repeat, and gap sequences.

304: Figure~\ref{fig-dirk2} shows the corresponding distribution of the

305: genome-wide GC\% for these three sequences classes.

306:

307:

308: To investigate the effect of interspersed repeats, we substitute

309: them by random bases according to the chromosomal level of GC\%.

310: Transformed, repeat-substituted DNA sequences of original human

311: chromosomes are distinguished from original sequences.  On the

312: coarse-grained level, it is equivalent to

313: the replacement in the $\{ \rho_i \}$ ($i=1, 2, \dots, N$) series

314: of any values calculated from the interspersed repeats by a random

315: value which is sampled from a Gaussian distribution; the

316: mean and variance of this Gaussian distribution is the

317: same as those of GC\% in the original sequence.  Another possibility

318: consists in substituting repetitive sequences by

319: by a constant value (e.g., the averaged GC\% value

320: of the original sequence).  This method introduces

321: additional correlations (and less variance) in the $\{ \rho_i \}$

322: series,  and is not adopted in this paper.

323:

324: Three different, albeit functionally related, measures are

325: applied to the $\{ \rho_i \}$ series: the power spectrum

326: as a function of the frequency $S(f)$, the correlation function

327: $\Gamma(d)$ as a function of the distance $d$ between

328: windows, and variance $\sigma^2(w)$ of GC\%

329: series as a function of the window size $w$.

330:

331: First, we conduct spectral analyses by calculating the power spectrum,

332: the absolute squared-average of the Fourier transform, defined as:

333: \begin{equation}

334: S(f) \equiv \frac{1}{N} \left| \sum_{k=1}^{N} \rho_k

335: \cdot

336: e^{ -i 2 \pi k f/N} \right|^2.

337: \end{equation}

338: where $N$ is the total number of windows, and $f$ is measured

339: in units of cycle/window, which can be converted to units

340: of cycle/base by the window size (cf. Table~1).

341:

342: Coarse-graining ``hides'' base-base correlations at scales smaller

343: than $w$ bases.  The choice of $N = 2^{17}$ windows was made such

344: that it is (i) sufficiently large to cover small-scale fluctuations,

345: while (ii) at the same time sufficiently small so that the spectral analysis is

346: computationally feasible. As different chromosomes have difference

347: lengths, equal number of partitions leads to different window sizes $w$.

348:

349: The unsmoothed $S(f)$, or periodogram, contains $N/2$ independent

350: spectral components.  One can filter periodograms to obtain a

351: ``smoothed'' spectrum $S(f;s)$, where $s$ is the span-size

352: parameter.  Since filtering with a relatively large $s$-value

353: possibly distorts the shape of $S(f;s)$ at lower frequency components,

354: different span-sizes are applied for different frequency ranges.

355:

356:

357: The second measure applied to the $\{ \rho_i \}$ series

358: is the correlation function, $\Gamma(d)$, which is computed

359: from two truncated series

360: of $\{ \rho_i \}$, $ \rho' = \{ \rho_k \}$  ($k=1, 2, \dots, N-d$) and

361: $ \rho'' = \{ \rho_k \}$  ($k=d+1, d+2, \dots,  N$):

362: \begin{equation}

363: \label{gamma}

364: \Gamma(d) \equiv \frac{ \rm Cov(\rho', \rho'')}

365: { \sqrt{ \rm Var(\rho')} \sqrt{ \rm Var(\rho'')}}

366: \end{equation}

367: where $\mbox{Cov}( \rho', \rho'')=

368: \langle \rho' \rho''\rangle - \langle \rho'\rangle \langle \rho''\rangle $

369: and $\mbox{Var}( \rho') = \langle \rho'^2 \rangle -  \langle  \rho' \rangle^2$

370: (or $\mbox{Var}( \rho'') = \langle \rho''^2 \rangle -  \langle  \rho'' \rangle^2$)

371: are the covariance and variance.

372: Note that the $\Gamma(d)$ defined in Eq.(\ref{gamma})

373: is slightly different from that defined using

374: a periodic boundary condition.

375:

376:

377:

378: The third and final measure applied to the

379: $\{ \rho_i \}$ series is the variance $\sigma^2(w)$:

380: \begin{equation}

381: \sigma^2(w) \equiv \langle \rho(w)^2 \rangle -

382: \langle \rho(w) \rangle^2

383: \end{equation}

384: as a function of the window size $w$.

385: The power spectrum, the correlation function, and the window-size-dependent

386: variance are interrelated quantities \cite{clay3}:

387: \begin{equation}

388:   \sigma^2(w) \sim \frac{ \Gamma(0) }{ w } \cdot

389: \bigg\{ 1 + \frac{2}{w}\sum_{d=1}^{w-1} (w-d) \Gamma(d) \bigg\}.

390: \end{equation}

391: If $S(f) \sim 1/f^\alpha$, $\Gamma(d) \sim 1/d^\gamma$,

392: $\sigma^2(w) \sim 1/w^\beta$ are power-law functions,

393: then their scaling exponents are related

394: by $\alpha = 1-  \gamma$ and  $\gamma=\beta$ \cite{clay3}.

395:

396: The calculation of $S(f)$ and $\Gamma(d)$ was carried out by the

397: statistical package {\sl S-PLUS} (Version 3.4, MathSoft, Inc.), and the

398: type of filter implemented for $S(f)$ is the Daniell-filter \cite{daniell}.

399:

400:

401: \SEC{$1/f$ noise  is a universal feature of human DNA sequences}

402:

403:

404: In this section, we use the power spectrum $S(f)$ to study

405: GC\% of human genome sequences, with

406: the goals of testing the universality of $1/f$ noise, quantifying

407: different decay ranges for $S(f) \sim 1/f^\alpha$, and comparing

408: $S(f)$ across DNA sequences of different human chromosomes.

409:

410: Figure~\ref{fig3:s(f)} shows for $N=2^{17}$ GC\% values the power

411: spectra $S(f)$ across all human chromosomes. We find that $S(f)$

412: exhibits no clear plateau at

413: low frequency ($< 10^{-6}$ cycle/base) and increases steadily

414: with decreasing frequency. The decay can be mathematically

415: approximated by a power-law of the form $S(f) \sim 1/f^\alpha$

416: with $\alpha \approx 1$.  Table~1 lists for the frequency range

417: $f=$ 10~Mb$^{-1}$--100~kb$^{-1}$ the estimated scaling exponent

418: $\alpha_1$ for all chromosomes, using a best-fit regression

419: of $\log_{10} S(f; s=3) = a + \alpha_1 \log_{10}(f)$. We find that

420: $\alpha_1$ is typically close to $\alpha_1 \approx 1$ with

421: practically little variation across chromosomes.

422:

423: A closer inspection of Fig.~\ref{fig3:s(f)} shows that the

424: majority of $1/f$ spectra undergo a cross-over from $\alpha_1 \approx 1$

425: to $\alpha_2 < 1 $ at high frequency. The deviation from $\alpha_1 \approx 1$

426: starts about 30--100~kb and continues at smaller distances.

427: Figure~\ref{fig4:ex} illustrates this feature for $S(f; s=31)$

428: of the DNA sequences of Chr15, Chr21, and Chr22 in more detail.

429: We find that chromosomes 15 and 21 exhibit clear cross-overs

430: at about 100~kb, while chromosome 22 exhibits no apparent break-point.

431: Table~1 contains for the frequency range of

432: $f=$ 100~kb$^{-1}$--5~kb$^{-1}$ the corresponding scaling

433: exponents $\alpha_2$, obtained from the

434: regression $\log_{10} S(f; s=3) = a + \alpha_2 \log_{10}(f)$.

435: We find a pronounced difference in absolute values between

436: $\alpha_1 \approx 1$ and $\alpha_2 < 1$, indicating a transition

437: from the universal $1/f^{\alpha_1}$ ($\alpha_1 \approx 1 $)

438: spectrum at low frequency to a

439: more flattened $1/f^{\alpha_2}$ ($\alpha_2 < 1$) spectrum

440: at higher frequency.

441:

442:

443: Figure~\ref{fig5:alpha}(a) shows for all human chromosomes $\alpha_1$

444: and $\alpha_2$ as a function of chromosome-specific GC\%. The

445: majority of human chromosomes have a specific GC content ranging between

446: 38--43\%, whereas chromosomes 16, 17, 19, 20, and 22 have higher GC\%

447: up to 49\%.  While the low-frequency scaling exponent $\alpha_1$

448: remains approximately independent of GC\%, Fig.~\ref{fig5:alpha}(a)

449: shows that $\alpha_2$ increases with increasing GC\% and

450: gives rise to a positive correlation between $\alpha_2$ and GC\%.

451:

452: The three chromosomes illustrated in Fig.~\ref{fig4:ex} exhibit

453: different degrees of transition from the $1/f^{\alpha_1}$

454: ($\alpha_1 \approx 1$) to the flattened

455: $1/f^{\alpha_2}$ ($\alpha_2 <1$) spectrum, with chromosome 21 (22)

456: undergoing the sharpest (smoothest) transition. This

457: observation can be further quantitized by the change in

458: scaling exponents $\alpha_1$ and $\alpha_2$. Table~1 lists for

459: all chromosomes $\Delta \alpha = \alpha_2 - \alpha_1$.

460: Chromosome 22 is distinct from all other human chromosomes

461: as the most scale-invariant one (same or similar scaling

462: exponent at different length scales).

463: The same observation that human chromosome 22 was perhaps different

464: from the remaining human chromosomes was made using limited

465: sequence data in \cite{isochore-clay,pedro}.

466:

467: \begin{figure*}

468: \centerline{\psfig{figure=pre-fig1.eps,width=80mm,angle=-90}}

469: \caption{\label{fig3:s(f)}

470: Double-logarithmic representation of

471: power spectra $S(f)$ of GC\% of all twenty-four human

472: chromosomes. Each plot shows $S(f)$ of six chromosomes

473: (shifted on the $y$-axis for clearer representation):

474: chromosomes (a) 1--6; (b) 7--12; (c) 13--18; (d) 19--22, X, and Y.

475: The $x$-axis (in logarithmic scale) is converted from cycle/window

476: to cycle/base by using the window sizes listed in Table~1.

477: $S(f)$ is filtered at different levels for different frequency

478: ranges: $S(f; s=1)$ for the first ten spectral components,

479: $S(f; s=3)$ for the components 11--30,

480: $S(f; s=31)$ for the components 31--400,

481: and $S(f; s=501)$  for the components 400--65536 (=$2^{16}$).

482: }

483: \end{figure*}

484:

485:

486: \begin{figure}

487: \centerline{\psfig{figure=pre-fig2.eps,width=55mm,angle=-90}}

488: \caption{\label{fig4:ex}

489: Cross-over from $S(f) \sim 1/f^{\alpha_1}$ to $S(f) \sim 1/f^{\alpha_2}$

490: illustrated for human chromosomes 15, 21, and 22

491: (smoothed with the span size of 31, and shown in double-logarithmic scale).

492: The scaling exponents $\alpha_1$ and $\alpha_2$ are shown  for

493: the frequency ranges 10~Mb$^{-1}$--100~kb$^{-1}$ and 100~kb$^{-1}$--5k$^{-1}$.

494: }

495: \end{figure}

496:

497: \begin{figure}

498: \centerline{\psfig{figure=pre-fig3.eps,width=70mm,angle=-90}}

499: \caption{\label{fig5:alpha}

500: (a) Scaling exponents $\alpha_1$ and $\alpha_2$

501: for fitting the power spectrum $S(f) \sim 1/f^{\alpha_i}$

502: ($i=1,2$) at the frequency range of 10~Mb$^{-1}$--100~kb$^{-1}$,

503: and 100~kb$^{-1}$--5~kb$^{-1}$, respectively, versus the

504: chromosome-specific GC content of all 24 human chromosomes.

505: (b) Scaling exponents $\alpha'_1$ and $\alpha'_2$

506: for $S(f)$ with substituted interspersed repeats.

507: }

508: \end{figure}

509:

510:

511: \SEC{Interspersed repeats are not responsible for

512: $1/f$ spectrum}

513:

514: About 45\% of human genomic DNA sequences are interspersed

515: repeats \cite{lander}. Interspersed repeats consist of copies of the same

516: sequence segment that are inserted in the human genome, possess a high

517: similarity between the duplicated and ancestral sequence, and have

518: been implicated in a variety of biological functions, including genome

519: organization, human chromosome segregation, or regulation of gene

520: expression \cite{repeats-bio}. Large copy numbers increase

521: the sequence redundancy and it has been shown, e.g., that

522: about 10\% interspersed Alu repeats significantly increase

523: base-base correlations in the range up to 300~bases

524: \cite{dirk}.

525:

526:

527:

528: Figure\ref{fig6:rep} shows the power spectrum $S(f)$ for

529: the original human chromosome 1 and for the transformed sequence in

530: which interspersed repeats are substituted.  We find in

531: the low-frequency range of $10^{-7} < f < 10^{-5}$ cycle/base

532: that $S(f)$ decays in the original sequence

533: with $\alpha_1 \approx 0.88$ and in the transformed sequence with

534: $\alpha^\prime_1 \approx 0.80$, indicating only marginal differences in

535: the decay properties of $S(f)$ due to repetitive sequences.  In

536: contrast, in the high frequency range of $10^{-5} < f < 2 \times 10^{-4}$

537: we find $\alpha_2 \approx 0.46$ and $\alpha^\prime_1 \approx 0.29$,

538: and thus interspersed repeats contributes to the decay properties

539: of $S(f)$ for high-frequency components by flattening the power spectrum.

540:

541:

542: The scaling exponents $\alpha^\prime_1$ and $\alpha^\prime_2$ for

543: repeat-substituted DNA sequences of all 24

544: human chromosomes are shown in Table~1.  The difference

545: between low- and high-frequency ranges for DNA sequences of original

546: chromosomes, $\Delta \alpha= \alpha_2 - \alpha_1$, is smaller

547: than the difference between low- and high-frequency ranges for

548: transformed sequences, $\Delta \alpha' = \alpha^\prime_2-\alpha^\prime_1$.

549: When we compare $\alpha_1$ and $\alpha^\prime_1$, as well as $\alpha_2$ and

550: $\alpha^\prime_2$, we find that the magnitude of $\alpha^\prime_1$

551: ($\alpha^\prime_2$) is always smaller than that of $\alpha_1$

552: ($\alpha_2$), which means a flattened spectrum

553: due to the substitution of interspersed repeats.

554: The average change of low-frequency

555: scaling exponents, $ \alpha_1- \alpha'_1$, is about 0.07,

556: whereas the average change of high-frequency scaling

557: exponents, $ \alpha_2- \alpha'_2$, is about 0.14. This

558: confirms that the universal presence of $1/f$ spectrum

559: at low frequency is not caused by interspersed

560: repeats,  but that interspersed repeats affect $S(f)$

561: predominantly  at high frequencies.  A similar conclusion

562: that the decay rate of base-base correlations in DNA sequences of

563: human chromosomes 20, Chr21, and Chr22 is not markedly affected

564: by the substitution of interspersed repeats was reached in \cite{dirk}.

565:

566:

567: We note that the extent of deviation,

568: $|\alpha'-\alpha|$, depends on how the replacement of

569: interspersed repeats  is conducted. Possible substitutions

570: of interspersed repeats include the substitution by

571: a constant value or a randomly sampled value.

572: In general, the substitution of GC\% values

573: calculated from the repetitive sequences by random values enhances

574: the deviation and flattens the spectrum $S(f)$ more than the

575: substitution by a constant value (e.g., average GC\%).

576:

577:

578: \begin{figure}

579: \centerline{\psfig{figure=pre-fig4.eps,width=60mm,angle=-90}}

580: \caption{\label{fig6:rep}

581: Power spectra $S(f)$ of GC\% for the original and the

582: transformed (interspersed repeats substituted)  DNA

583: sequence of human chromosome 1.  The scaling exponent for

584: low-frequency (10~Mb--100~kb) and high-frequency

585: (100~kb--5~kb) ranges are obtained by a best-fit regression

586: of $\log_{10} S(f)$ over $\log_{10} f$.

587: }

588: \end{figure}

589:

590: \SEC{Resistance to variance reduction at larger window sizes}

591:

592: In this section, we study the decay properties of the

593: variance ($\sigma^2$) of spatial GC\% series  as

594: a function of difference window sizes $w$,  and

595: we compare the scaling of $\sigma^2$ with the

596: scaling of the power spectrum $S(f)$.

597:

598:

599: Early experimental measurement of the GC\% distribution by

600: using cesium chloride (CsCl) profile \cite{cscl} showed for mouse

601: {\em Mus musculus} genomic DNA sequences that the

602: variance of GC\% values does not markedly decreases with the DNA segment size

603:  \cite{macaya}. This experimental observation is directly related

604: to the presence of 1/f spectra in DNA sequences

605: \cite{isochore-clay,li-nova}. If the variance of the spatial

606: GC\% series calculated at the window size $w$ is $\sigma^2(w)$,

607: then a scaling of

608: $\sigma^2(w) \sim 1/w^\beta$ implies a corresponding

609: scaling in the power spectrum $S(f) \sim 1/f^{1-\beta}$

610: \cite{isochore-clay,beran}. If GC\% is obtained from

611: $w$ uncorrelated bases, it follows a binomial distribution.

612: Consequently, $\sigma^2(w) \sim \langle\rho \rangle

613: (1- \langle \rho \rangle) /w \sim 1/w$ with

614: $\beta=1$. The corresponding scaling exponent of

615: the power spectrum is $\alpha=1-\beta=0$, and thus the

616: $S(f) \sim \mbox{cons.}$ is equivalent to the white noise.

617:

618: Figure~\ref{fig7:var} shows $\sigma^2(w)$ as a function of window size $w$

619: for all human chromosomes. In a double-logarithmic representation,

620: we find that $\log (\sigma^2(w)) $ decays approximately

621: linearly with $\log( w) $. A decay according to

622: $\sigma^2(w) \sim 1/w^\beta$ with $\beta=1$ leads to

623: white noise. This situation is indicated in Fig.~\ref{fig7:var}

624: by the straight line. An inspection of Fig.~\ref{fig7:var}

625: shows, however, that the variance decays at a much slower

626: rate than what would be for white noise. The variance of

627: the DNA sequence of human chromosome 1, e.g.,

628: gives rise to $\beta \approx 0.12$, and the

629: corresponding scaling exponent $\alpha_1 \approx

630: 1- \beta =0.88$  is indeed close to the estimated

631: exponent listed in Table~1. The scaling of the variance

632: with the exponent $\beta  << 1 $ is in accord with

633: the low-frequency $1/f$ noise.

634:

635:

636: \begin{figure}

637: \centerline{\psfig{figure=pre-fig5.eps,width=65mm,angle=-90}}

638: \caption{\label{fig7:var}

639: Double-logarithmic representation of the variance $\sigma^2(w)$

640: of the spatial GC\% series for all human chromosomes (Chr)

641: as a function of the window size $w$:

642: (a) $\bigcirc$ Chr1,

643: $\triangle$ Chr2,

644: $+$ Chr3, $\times$ Chr4, $\diamondsuit$ Chr5, $\bigtriangledown$ Chr6;

645: (b) $\bigcirc$ Chr7, $\triangle$ Chr8, $+$ Chr9, $\times$ Chr10,

646: $\diamondsuit$ Chr11, $\bigtriangledown$ Chr12;

647: (c)  $\bigcirc$ Chr13, $\triangle$ Chr14, $+$ Chr15,

648: $\times$ Chr16, $\diamondsuit$ Chr17, $\bigtriangledown$ Chr18;

649: (d)  $\bigcirc$ Chr19, $\triangle$ Chr20, $+$ Chr21, $\times$ Chr22,

650: $\diamondsuit$ ChX, $\bigtriangledown$ ChrY.

651: Straight lines indicate $\sigma^2(w) \sim 1/w$ (corresponding

652: to white noise).  One regression line for Chr1 ($\beta \approx $0.12)

653: and a piece-wise regression for Chr13 ($\beta \approx $0.27 and

654: $\beta \approx $0.10) are drawn. The 95\% confidence

655: interval for the $\sigma^2(w)$ estimation of Chr1 at each point

656: of $w$ is marked by a vertical dashed line.

657: }

658: \end{figure}

659:

660:

661: The scaling of $\sigma^2(w)$ shown in Fig.~\ref{fig7:var}

662: differs from one human chromosome to another. For instance,

663: in the range of $w=$ 1~kb--5~Mb, for example, human chromosome

664: 13 exhibit a clear transition from $\beta_2 \approx 0.27$ ($w < $ 50~kb)

665: to $\beta_1 \approx 0.10$ ($w > $ 50~kb), corresponding to

666: $S(f) \sim 1/f^{0.63}$ and $S(f) \sim 1/f^{0.9}$,

667: respectively, at high- and low-frequency ranges.

668: Other human chromosomes, although generally exhibiting a power-law

669: scaling form of $\sigma^2(w)$, show deviations from

670: $\sigma^2(l) \sim 1/l^{\beta}$ line for the largest

671: window sizes tested.

672:

673:

674: The investigation of $\sigma^2(w)$ as a function of

675: different window sizes $w$ requires careful examination

676: \cite{audit,pedro04}.  First, since we partition each

677: human chromosome in $2^k$ ($k=$17, 16, $\dots$)

678: windows, the variance of GC\% series $\{ \rho_i \} $

679: could be accidentally large when windows reside on

680: the isochore borders, and small by chance if they

681: start/end within an isochore.

682:

683: Second, when the number of windows is small (e.g. the last

684: point of $\sigma^2(w)$ for each chromosome in Fig.~\ref{fig7:var}

685: is calculated with the largest window size that

686: gives rise to  32 windows), the standard error of the

687: sample variance is large. The 95\% confidence interval for

688: $\sigma^2(w)$ of Chr1 is shown in Fig.~\ref{fig7:var}(a),

689: using the interval:

690: [$ (w-1) \sigma^2/t_{0.025}, (w-1) \sigma^2 /t_{0.975}$],

691: where $t_x$ is defined by $\int_{-\infty}^{t_x} \chi^2(\rm{df}=w-1) dt= x$

692: (where $\chi^2(\rm{df})$ is the chi-square distribution with $\rm{df}$

693: degrees of freedom) \cite{snedecor}.  Figure~\ref{fig7:var}(a)

694: shows that for fewer windows (and larger window sizes),

695: the 95\% confidence interval of $\sigma^2(w)$ could be large

696: such that the estimated value of $\beta$ may change from sample to sample.

697:

698: Finally, the relationship between scaling exponents

699: $\alpha+\beta = 1$ \cite{beran,isochore-clay},

700: is based on the assumption that both $S(f)$ and $\sigma^2(w)$ are

701: theoretical power-law functions. If $S(f)$ is a piece-wise

702: power-law function, as in the case of GC\% fluctuation of

703: human chromosomes, a correction term to the relationship

704: $\alpha +\beta=1$ is expected.

705:

706: \begin{figure}

707: \centerline{\psfig{figure=pre-fig6.eps,width=65mm,angle=-90}}

708: \caption{\label{fig8:corr}

709: Correlation function $\Gamma(d)$ for 24 human chromosomes (Chr)

710: as a function of the window distance $d$ (converted to bases

711: by the window size listed in Table~1).  The distance is

712: represented on a logarithmic scale. (a) Chr1--6; (b) Chr7--12;

713: (c) Chr13--18; and (d) Chr19--22, ChrX, and ChrY.

714: }

715: \end{figure}

716:

717: \begin{figure}

718: \centerline{\psfig{figure=pre-fig7.eps,width=50mm,angle=-90}}

719: \caption{\label{fig9:ch21}

720: Correlation function $\Gamma(d)$ for human chromosome 21

721: as a function of the window distance $d$  (converted to bases

722: by the window size given in Table~1). The oscillation in $\Gamma(d)$

723: is highlighted by vertical lines, indicating the

724: distances of $d=$500~kb, 1~Mb, 1.5~Mb, and 2~Mb.

725: }

726: \end{figure}

727:

728:

729:

730: \SEC{Chromosome-specific correlation structures}

731:

732:

733: Apparently, $1/f$ noise in music and speech signals \cite{voss-music}

734: does not prevent music and speech from sounding differently.

735: Similarly, universal $1/f^\alpha$ spectra in GC\% fluctuations

736: across human chromosomes do not imply that all chromosomes

737: exhibit the same detailed correlation structure. The generic trend

738: of $S(f)$ spectra to increase at low frequency may ``co-exist"

739: with small peaks at higher frequency.  Such chromosome-specific

740: characteristic length scales can be more intuitively examined

741: by correlation functions. In this section, we investigate

742: the correlation function $\Gamma(d)$

743: of coarse-grained DNA sequences of human chromosomes with the

744: aim of further examining chromosome-specific structures,

745: such as characteristic length scales and oscillation

746: detected by $\Gamma(d)$.

747:

748: Figure~\ref{fig8:corr} shows for all human chromosomes

749: the $\Gamma(d)$'s of GC\% series

750: $\{ \rho_i \}$ calculated for the window sizes given in Table~1,

751: of all human chromosomes.  For each chromosome, the minimum

752: (maximum) distance is 80 (16,000) windows.

753: Since each chromosome is partitioned into $2^{17}$ windows,

754: the maximum distance $d$ at which the correlation is examined

755: is about $16,000/2^{17} \approx 12\%$ of the total sequence length.

756:

757: An inspection of Fig.~\ref{fig8:corr} shows that the magnitude

758: of correlation at the distance of $d=1~{\rm Mb}$ is clearly above

759: the noise level. With the exceptions of Chr15, Chr22, and ChrY,

760: the correlation function $\Gamma(d) > 0.1$ at $d=1~{\rm Mb}$

761: for all other chromosomes. The low correlation in ChrY is

762: due to the fact that about half of the bases are unsequenced,

763: and the substitution of gaps by random values lowers the correlation.

764: At even longer distances such as $d=$10~Mb, correlations

765: $\Gamma(d=10$~Mb) for chromosomes 1 and 6 are still

766: above the 0.1 level.

767:

768: Given different windows ($w$) due to different chromosome

769: sizes and provided that the covariance of GC\% is approximately

770: independent of $w$, a scaling of the variance according to

771: $1/w^\beta$ implies that the  correlation function

772: $\Gamma(d)$ in Eq.(\ref{gamma}) increases with the window

773: size as $\sim w^\beta$. Test calculations of covariance

774: for $2^{15}$ and $2^{17}$ windows show that the covariance

775: differs by less than 1\% (and hence is fairly independent in

776: this range of window sizes).

777: Consequently, for a detailed comparison of correlation

778: functions calculated for different chromosomes one has to

779: take into account different windows sizes.

780:

781:

782:

783: Any deviation from the monotonic decrease of $\Gamma(d)$

784: might be indicative of correlations at characteristic

785: length scales (visible as ``bumps"). For example,

786: Fig.~\ref{fig8:corr} shows for chromosome 1 such a bump at

787: $d \approx$ 21--23~Mb. Bumps or sharper peaks in other chromosomes include

788: $d \approx$ 9.3~Mb (Chr2), 7.2~Mb (Chr10), 3.2--3.8~Mb

789: (Chr12), and 2.4--3.1~Mb (Chr19).  One  plausible

790: explanation is that for chromosomes 2, 10, 12, and 19 one or few

791: alterations of GC-rich/low isochores \cite{isochore} with these

792: length scales enhance the correlation.

793:

794:

795: Chromosome 21 stands out among all human chromosomes for

796: having a comparatively higher correlations at distances of

797: several Mb (despite having a smaller $w^\beta$ factor than

798: other chromosomes due to a smaller window size).  A detailed

799: inspection of Fig.~\ref{fig9:ch21} uncovers an oscillation of

800: $\Gamma(d)$ of  about 500~kb, ranging from $d=$500~kb to $d=$2~Mb, which has

801: not been reported before. It can be further shown that this

802: oscillation is not due to the substitution of interspersed repeats

803: \cite{li-holste-21},  and it is localized to about one-eighth

804: of the right distal end of chromosome 21  \cite{li-holste-21}.

805:

806:

807:

808: \SEC{Discussions}

809:

810: We study correlation structures and spectral components in

811: the set of human chromosomes, using power spectra,

812: coarse-grained correlation functions, and the variance

813: of different window sizes. All three measures are

814: interrelated and highlight compositional structures

815: at different feature levels.  Our results firmly establish

816: the presence of long-ranging correlations and

817: $1/f^\alpha$ spectra in the DNA sequences of the set

818: of twenty-four  human chromosomes.

819:

820: Using updated and completed human sequence data, we find the

821: presence of 1/f noise in the DNA sequences of

822: all human chromosomes. We further find that, with the exception

823: of chromosome 22, all chromosomes exhibit a cross over

824: from $1/f^{\alpha_1}$ at low-frequency to $1/f^{\alpha_2}$

825: scaling at high-frequency ($\alpha_1 > \alpha_2$).

826: The result of two scaling ranges at low- and high-frequency

827: are in accord with previous findings, obtained from

828: sequence data of lower quality, and it refines break-point

829: regions for each individual chromosome.

830:

831:

832: We also examined the effect of about 45\% interspersed

833: repeats in the human genome. Using a procedure in which

834: masks and subsequently substitutes interspersed repeats

835: with random GC\% values, we find that interspersed

836: repeats (i) only marginally affect the scaling exponent

837: $\alpha_1$ in the low-frequency range, but (ii) lower

838: $\alpha_2$ in the high-frequency range (cf.~Fig.\ref{fig5:alpha}(b)).

839: This supports the general understanding that interspersed repeats

840: only contribute to short-ranging (high-frequency) correlations \cite{dirk}.

841:

842: We have shown elsewhere that $1/f^\alpha$

843: spectra of GC\% fluctuation are also universally present

844: in the mouse {\sl Mus musculus}  genomic DNA sequences

845: \cite{1fmouse}. It is known that human and mouse genomes

846: are separated by approximately 65--75 million years of evolution. Besides

847: the similarity (or homology ) between

848: these two genomes on a local scale, there is in fact a

849: large amount of reshuffling of the chromosome segments

850: at a global scale when two current-day copies of

851: the two genomes are compared side-by-side \cite{pevzner}. Since reshuffling

852: of a sequence at global scales could potentially destroy

853: long-range correlations, it is still to be resolved

854: under what conditions a reshuffling of the

855: human genome into the mouse genome, or vice versa,

856: conserves 1/f noise.

857:

858: One possible explanation of why $1/f^\alpha$ spectra appear

859: in both the human and the mouse genomes is that such long-range

860: patterns were probably generated from ancestral DNA

861: sequences by sequence evolutionary mechanisms. One

862: sequence evolution model, termed expansion-modification

863: (EM) model, is known to generate $1/f^\alpha$ spectra \cite{wli-em}.

864: The EM model incorporates duplications and mutations.

865: Since the duplication process is an essential element in

866: evolutionary genomics \cite{ohno}, whose role is perhaps as important as Darwin's

867: natural selection \cite{meyer}, even a yet unsophisticated

868: incorporation of duplications in the EM model may capture the

869: essence of the evolutionary origin of long-range correlations

870: in DNA sequences. In the EM model, only the duplication of

871: segments with the same length scale is included, whereas

872: in reality segments with a broad range of length scales

873: are duplicated \cite{lander}.

874:

875: One frequently posed question concerns the ``biological meaning"

876: of $1/f^\alpha$ spectra or long-range correlations in DNA

877: sequences. In order to address this question, one may ask

878: a couple of related questions beforehand.

879: Does the compositional GC\% have any biological effects?

880: What biological functions of the DNA molecule are of relevance?

881: From the {\sl functional genomics} perspective, interesting

882: biological processes related to DNA molecules include transcription,

883: replication, and recombination, and  their potential connection

884: to GC\% has been reviewed in \cite{bernardi,bernardi-book,li-nova}.

885: Generally speaking, GC\% has a statistical association with

886: all three processes, though the cause-and-effect role has

887: not yet been firmly established. Recent studies show that broadly

888: expressed ``housekeeping genes" tend to be located in GC-rich

889: regions \cite{housekeeping}. To understand the genome-wide organization

890: of biological units that play a role in those processes

891: (e.g., genes, origins and timing of replication, or recombination hotspots),

892: at times it is more feasible to directly study the

893: spatial distribution of functional units instead of using

894: the GC\% as a surrogate.

895:

896: From the {\sl biophysics and cellular biology}

897: perspective, GC\% is linked with  bands from chromosome-staining

898: \cite{gojobori}, and in addition, possibly with the

899: matrix/scaffold attachment/associated regions located at

900: the end of DNA loops \cite{attachment}.  It has also been

901: suggested that GC-rich chromosomes (or regions)

902: tend to be located in the interior of the nuclear

903: during interphase and are more ``open" in their

904: tertiary structure, whereas GC-poor segments are

905: more likely to be close to the surface of the

906: nuclear and more condensed \cite{interphase}.

907:

908: Further exploration of the relationship between GC\% fluctuations,

909: as well as its large-scale patterns, and the above

910: biological processes is beyond the scope of this

911: paper. An attempt for bacterial genomes has been

912: made to relate the scale-invariance feature in sequence

913: statistics to the genome organization of transcription

914: activities \cite{audit}. It is clear that more

915: integrated computational and experimental analyses

916: need be carried out along similar lines

917: before one can give universal $1/f$ spectra in

918: DNA sequences a satisfactory biological explanation.

919:

920:

921:

922:

923: \SEC*{Acknowledgements}

924:

925: We thank S. Guharay for participating the early stage

926: of this project, and O. Clay, J.L. Oliver, A. Fukushima for

927: valuable discussions.

928:

929:

930:

931: \begin{thebibliography}{99}

932:

933: \bibitem{johnson}

934: J.B. Johnson, Phys. Rev. {\bf 26}, 71-85 (1925).

935:

936: \bibitem{1freview}

937: A. van der Ziel, Adv. Electronics and Electronics Phys.

938: {\bf 49}, 225-297 (1979);

939: P. Dutta and P.M. Horn, Rev. Mod. Phys. {\bf 53}, 497-516 (1981);

940: M.B. Weissman, Rev. Mod. Phys.  {\bf 60}, 537-571 (1988);

941: H. Wong, Microelectronics Reliability {\bf 43}, 585-599 (2003).

942:

943: \bibitem{1freview-other}

944: M. Gardner, Sci. Am. {\bf 238}, 16-32 (1978);

945: W. Press, Comments on Astrophys. {\bf 7},103-119 (1978);

946: B.J. West and M.F. Shlesinger, Am. Sci. {\bf 78}, 40-45 (1990);

947: E. Milotti, arXiv preprint, physics/0204033 (2002).

948:

949: \bibitem{1fbib}

950: W. Li, {\em A bibliography on 1/f noise} (online),

951: {\sf http://www.nslij-genetics.org/wli/1fnoise/}.

952:

953: \bibitem{mapping}

954: H. Herzel and I. Grosse, Physica A {\bf 216}, 518-542 (1995);

955: S.V. Buldyrev {\sl et al.}, Phys. Rev. E {\bf 51}, 5084-5091 (1995).

956:

957: \bibitem{dirk}

958: D. Holste, I. Grosse and H. Herzel, Phys. Rev. E, {\bf 64}, 041917 (2001);

959: D. Holste, {\sl et al.}, Phys. Rev. E, {\bf 67}, 061913 (2003).

960:

961: \bibitem{wli-em}

962: W. Li, Europhys. Lett. {\bf 10},395-400 (1989);

963: W. Li, Phys. Rev. A, {\bf 43}, 5240-5260 (1991).

964:

965: \bibitem{wli-dna}

966: W. Li, Int. J. Bifurcation \& Chaos, {\bf 2}, 137-154 (1992);

967: W. Li and  K. Kaneko, Europhys. Lett. {\bf 17}, 655-660 (1992).

968:

969: \bibitem{voss}

970: R.F. Voss, Phys. Rev. Lett., {\bf 68},3805-3808 (1992).

971:

972: \bibitem{bac}

973: X. Lu, {\sl et al.}, Phys. Rev. E, {\bf 58}, 3578-3584 (1998);

974: M. de Sousa Vieira, Phys. Rev. E, {\bf 60}, 5932-5937 (1999).

975:

976: \bibitem{yeast}

977: W. Li, {\sl et al.}, Genome Res. {\bf 8}, 916-928 (1998).

978:

979: \bibitem{fukushima}

980: A. Fukushima

981: {\em Periodicity in Genome Architecture from Bacteria to Human}

982: (Ph.D Thesis, Nara Institute of Science and Technology, 2003);

983: A. Fukushima, {\sl et al.}, Gene, {\bf 300}, 203-211 (2002).

984:

985: \bibitem{isochore}

986: G. Cuny, {\sl et al.}, Euro. J. Biochem. {\bf 99}, 179-186 (1981);

987: G. Bernardi, {\sl et al.}, Science, {\bf 228}, 953-958 (1985);

988: G. Bernardi, Gene, {\bf 241}, 3-17 (2000).

989:

990: \bibitem{isochore-clay}

991: O. Clay, {\sl et al.}, Gene, {\bf 276}, 15-24 (2001);

992: O. Clay and  G. Bernardi,  Gene, {\bf 276}, 25-31 (2001).

993:

994:

995: \bibitem{cc03}

996: W. Li, {\sl et al.}, Comput. Biol. and Chem., {\bf 27}, 5-10 (2003).

997:

998: \bibitem{clay3}

999: O. Clay, Gene, {\bf 276}, 33-38 (2001);

1000:

1001: \bibitem{isochore-spe}

1002: W. Li,  Gene, {\bf 300}, 129-139 (2002).

1003:

1004: \bibitem{wli-complexity}

1005: W. Li, Complexity, {\bf 3}, 33-37 (1997).

1006:

1007:

1008: \bibitem{lander}

1009: E.S. Lander, {\sl et al.}, Nature, {\bf 409}, 860-921 (2001).

1010:

1011: \bibitem{pedro}

1012: P. Bernaola-Galv\'{a}n, {\sl et al.},

1013: Gene, {\bf 300}, 105-115 (2002).

1014:

1015: \bibitem {hg-quality}

1016: J. Schmutz {\sl et al.},

1017: Nature {\bf 429}, 365-368 (2004).

1018:

1019:

1020: \bibitem{repeatmasker}

1021: A.F.A.~Smit and  P.~Green,

1022: unpublished results

1023: (URL: {\sf http://repeatmasker.genome.washington.edu/}).

1024:

1025: \bibitem{daniell}

1026: P.J. Daniell,

1027: {\em Suppl. J. Royal Stat. Soc.} {\bf 8} 88--90 (1946).

1028:

1029:

1030:

1031: \bibitem{repeats-bio}

1032: J.R.~Korenberg and M.C.~Rykowski, Cell {\bf 53,} 391 (1988);

1033: P.~Medstrand {\em et al.}, Genome Res. {\bf 12}, 1483 (2002);

1034: M.-A.~Hakimi {\em et al.}, Nature {\bf 418}, 994 (2002);

1035: J.S.~Han, S.T.~Szak, and J.D.~Boeke, Nature {\bf 429}, 268--274 (2004).

1036:

1037: \bibitem{cscl}

1038: O. Clay, {\sl et al.},

1039: Euro. Biophys. J., {\bf 32}, 418-426 (2003).

1040:

1041: \bibitem{macaya}

1042: G. Macaya, J.P. Thiery, and G. Bernardi, J. Mol. Biol. {\bf 108}, 237-254 (1976).

1043:

1044: \bibitem{li-nova}

1045: W. Li, in {\sl Progress in Bioinformatics} (Nova Science Publisher, 2005),

1046: to appear.

1047:

1048: \bibitem{beran}

1049: J. Beran, {\sl Statistics for Long-Memory Processes} (Chapman \& Hall, 1994).

1050:

1051: \bibitem{audit}

1052: B. Audit and C.A. Ouzounis, J. Mol. Biol. {\bf 332}, 617-633 (2003).

1053:

1054: \bibitem{pedro04}

1055: P. Bernaola-Galv\'{a}n, {\em et al.,}

1056: Gene, {\bf 333}, 121-133 (2004).

1057:

1058:

1059: \bibitem{snedecor}

1060: G.W. Snedecor and  W.G. Cochran,

1061: {\sl Statistical Methods}, Seventh Edition

1062: (Iowa State University Press, 1980).

1063:

1064: \bibitem{voss-music}

1065: R.F. Voss and  J. Clarke,  Nature, {\bf 258}, 317-318 (1975);

1066: K. J. Hsu and  A. Hsu,  Proc. Natl. Acad. Sci., {\bf 88}, 3507-3509 (1991).

1067:

1068:

1069:

1070:

1071: \bibitem{li-holste-21}

1072: W. Li and  D. Holste, Comp. Bio. Chem., {\bf 28} in press (2004).

1073:

1074:

1075:

1076: \bibitem{1fmouse}

1077: W. Li and D. Holste, Fluct. Noise Lett. {\bf 4} in press (2004).

1078:

1079: \bibitem{pevzner}

1080: P. Pevzner and G. Tesler, Genome Res. {\bf 13}, 37-45 (2003).

1081:

1082: \bibitem{ohno}

1083: S. Ohno {\em Evolution by Gene Duplication}

1084: (Springer-Verlag, Berlin, 1970).

1085:

1086: \bibitem{meyer}

1087: A. Meyer and Y. van de Peer, J. Struct. Funct. Genomics,  {\bf 3}, vii-ix (2003).

1088:

1089: \bibitem{bernardi}

1090: G. Bernardi, Ann. Rev. Genet. {\bf 23}, 637-661 (1989);

1091: G. Bernardi, Ann. Rev. Genet. {\bf 29}, 445-476 (1995).

1092:

1093: \bibitem{bernardi-book}

1094: G. Bernardi, {\em Structural and Evolutionary Genomics}

1095: (Elsevier, 2004).

1096:

1097: \bibitem{housekeeping}

1098: M.J. Lercher, A.O. Urrutia, and L.D. Hurst,

1099: Nature Genet. {\bf 31}, 180-183 (2002);

1100: M.J. Lercher, {\sl et al.}, Hum. Mol. Genet. {\bf 12}, 2411-2415

1101: (2003);

1102: R. Versteeg, {\sl et al.}, Genome Res. {\bf 13}, 1998-2004 (2003).

1103:

1104: \bibitem{gojobori}

1105: Y. Niimura and  T. Gojobori,

1106: Proc. Natl. Acad. Sci. {\bf 99}, 797-802 (2002).

1107:

1108: \bibitem{attachment}

1109: P.A. Dijkwel and  J.L. Hamlin, Int. Rev. Cytol.  {\bf 162A}, 455-484 (1995);

1110: S.V. Razin, I.I. Gromova, and  O.V. Iarovaia (1995), Int. Rev. Cytol.

1111: {\bf 162B}, 405-448 (1995).

1112:

1113:

1114: \bibitem{interphase}

1115: S. Boyle, {\sl et al.}, Hum. Mol. Genet. {\bf 10}, 211-219 (2001);

1116: S. Saccone, {\sl et al.}, Gene, {\bf 300}, 169-178 (2002).

1117:

1118:

1119: \end{thebibliography}

1120:

1121:

1122: \end{document}

1123:

1124:

1125: