q-bio0411016/tmp.tex
1: \def \manuflag {0}
2:  
3: \ifnum \manuflag = 0
4:  \documentclass[pre,twocolumn,showpacs,amsmath,amssymb]{revtex4}
5:  \usepackage{psfig}	
6:  \usepackage{epsfig}		
7:  \usepackage{delarray}
8:  \usepackage{graphicx}
9:  \usepackage{dcolumn}
10:  \usepackage{bm}
11:  \def \Title{
12: Universal $1/f$ noise, cross-overs of scaling exponents, 
13: and chromosome specific patterns of GC content in
14: DNA sequences of the human genome}
15:  \def \figsize{6.85cm}
16:  \def \figname{\footnotesize \sc FIG.}
17:  \def \tblname{\footnotesize \sc TABLE}
18:  \sloppy
19:  \newcommand{\SEC}{\section}
20:  \newcommand{\SUBSEC}{\subsection}
21: \else
22:  \documentclass[pre,onecolumn,showpacs,amsmath,amssymb]{revtex4}
23:  \usepackage{psfig}	
24:  \usepackage{epsfig}		
25:  \usepackage{delarray}
26:  \usepackage{graphicx}
27:  \usepackage{dcolumn}
28:  \usepackage{bm}
29:  \usepackage{rotating}
30:  \def \Title{Universal $1/f$ Noise, crossing-Overs of scaling
31: exponents, and chromosome specific patterns of GC content in
32: DNA sequences of the human genome}
33:  \def \figsize{12.6cm}
34:  \renewcommand{\baselinestretch}{2.4} 
35: \fi
36: 
37: \def \Abstract{
38: Spatial fluctuations of guanine and cytosine base content (GC\%) are
39: studied by spectral analysis for the complete set of human genomic DNA
40: sequences. We find that (i) the $1/f^{\alpha}$ decay is universally
41: observed in the power spectra of all twenty-four chromosomes, and that
42: (ii) the exponent $\alpha \approx 1$ extends to about $10^7$ bases,
43: one order of magnitude longer than what has previously been observed.
44: We further find that (iii) almost all human chromosomes exhibit a
45: cross-over from $\alpha_1 \approx 1$ ($1/f^{\alpha_1}$) at lower
46: frequency to $\alpha_2 < 1$ ($1/f^{\alpha_2}$) at higher frequency, 
47: typically occurring at around 30,000--100,000
48: bases, while (iv) the cross-over in this frequency range is 
49: virtually absent in
50: human chromosome 22. In addition to the universal $1/f^\alpha$ noise in power
51: spectra, we find (v) several lines of evidence for chromosome-specific
52: correlation structures, including a 500,000 bases long oscillation in
53: human chromosome 21. The universal $1/f^\alpha$ spectrum in human 
54: genome is further substantiated by a resistance to variance reduction 
55: in guanine and cytosine content when the window size is increased.
56: }
57: 
58: 
59: 
60: \ifnum \manuflag = 0
61:  \begin{document}
62:  \title{\Title}
63:  \author{Wentian Li}
64:  \email{wli@nslij-genetics.org}
65:  \affiliation{The Robert S. Boas Center for Genomics and Human Genetics,
66: 	      North Shore LIJ Institute for Medical Research, 350 Community Dr.,
67: 	      Manhasset, NY 10030.}
68:  \author{Dirk Holste}
69:  \email{holste@mit.edu}
70:  \affiliation{Department of Biology, 
71: 	      Massachusetts Institute of Technology, Cambridge, MA 02139.}
72:  \begin{abstract}
73:    \Abstract
74:  \end{abstract}
75:  \pacs{87.10.+e, 87.14.Gg, 87.15.Cc, 02.50.-r, , 02.50.Tt, 89.75Da, 89.75.Fb, 05.40.-a}
76:  %% \keywords{Suggested keywords if desired}
77:  \maketitle
78: \else
79:  \begin{document}
80:  \title{\Title}
81:  \author{Wentian Li}
82:  \email{...@...}
83:  \affiliation{...}
84:  \author{Dirk Holste}
85:  \email{holste@mit.edu}
86:  \affiliation{Department of Biology,
87: 	      Massachusetts Institute of Technology, Cambridge, MA 02139.}
88:  \begin{abstract}
89:    \Abstract
90:  \end{abstract}
91:  \pacs{87.10.+e, 02.50.-r, 05.40.-a  \hfill {\tt Thu Sep  5 15:36:39 EDT 2002}}
92:  %% \keywords{Suggested keywords if desired}
93: \maketitle
94: \fi
95: 
96: \SEC{Introduction}
97: 
98: By measuring the proportion of a signal's power $S(f)$ falling into
99: a range of frequency components $f$, a power spectrum of the form
100: $S(f) \sim  1/f^\alpha$ distinguishes between two prototypes of noise:
101: white noise ($\alpha = 0$) and Brownian noise ($\alpha = 2$). The
102: intermittent range, termed ``$1/f$ noise'', can practically be defined
103: as $1/f^\alpha$ ($0.5 \lesssim \alpha \lesssim 1.5$). $1/f$ noise was experimentally 
104: observed first in electric current fluctuations of the thermionic 
105: tube at the beginning of the nineteenth century \cite{johnson}.
106: Since then, $1/f$ noise has been found repeatedly in many other 
107: conducting materials \cite{1freview}.  More generally, it has also 
108: been observed in wide ranges of natural as well as human-related
109: phenomena, including traffic flow, star light, speech, music 
110: and human coordination \cite{1freview-other,1fbib}.
111: For biological sequences, such as DNA, the concept of slow-varying, 
112: multiple-length variations in the power of frequency components 
113: can be translated to long-ranging correlations in the spatial 
114: arrangement of the four bases adenine (A), cytosine (C), guanine 
115: (G) and thymine (T).  One can categorize chemically A, C,
116: G, and T as strong (G or C) or weak (A or T) bonding. It has been
117: shown that fluctuations of the GC base content along a DNA sequence
118: are typically stronger correlated when compared to other possible
119: binary classifications \cite{mapping,dirk}. 
120: Initial studies of $1/f$ noise in DNA sequences were motivated 
121: by a model of spatial $1/f$ noise of symbolic sequence evolution 
122: \cite{wli-em}.  Subsequently, empirical $1/f$ spectra were 
123: indeed observed in non-protein-coding DNA sequences 
124: \cite{wli-dna}, and their generality in DNA sequences was further 
125: illustrated in \cite{voss}.
126: 
127: $1/f$ noise has been detected in a variety of different species and
128: taxonomic classes, including bacteria \cite{bac}, yeasts
129: \cite{yeast}, insects \cite{fukushima}, and other higher eukaryotic
130: genomes. Integrating this and several other lines of evidence, a
131: consensus on $1/f$ noise in DNA sequences has emerged: 
132: (1) for DNA sequences of the order of $10^6$~bases ($1$~Mb), $1/f^\alpha$ 
133: spectrum ($\alpha \approx 1$) is consistently observed; (2) for 
134: isochores, which are DNA sequences of relatively homogeneous base 
135: concentration at least $300\cdot 10^3$ bases ($300$~kb) long 
136: \cite{isochore,isochore-clay,cc03}, $1/f^\alpha$ spectrum is also
137: observed,  but typically shows a smaller exponent $\alpha < 0.7$
138: \cite{isochore-clay,clay3,isochore-spe}; (3) for DNA sequences of
139: the order of several kb, the decay of $S(f)$ is non-trivial and may
140: depend on whether the sequence is protein-coding \cite{wli-dna}. 
141: The viral DNA sequence of the $\lambda$-phage, e.g., shows a single step in its GC
142: base concentration and its spectrum is $S(f) \sim 1/f^2$, which is
143: characteristic of random block sequences \cite{wli-complexity}. 
144: We note that the universal scaling of $S(f) \sim 1/f^\alpha$ ($\alpha \approx 1$)
145: across all species discussed in \cite{voss} has apparently been
146: restricted to a length scale of $1$~kb, by averaging the spectrum over
147: many $N=2$~kb DNA segments.
148: 
149: With the availability of the first completed version of the DNA
150: sequence of human genome \cite{lander}, several studies have been able
151: to demonstrate that the base-base correlation function $\Gamma(d)$
152: ($d$ distance between bases) of several DNA sequences follows a
153: power-law decay, $\Gamma(d) \sim 1/d^{\gamma}$. For instance, the DNA
154: sequence of human chromosome 22 shows statistically significant
155: power-law correlations up to $d=1$~Mb, and
156: correlations in the DNA sequence of chromosomes 21 are statistically
157: significant up to several~Mb (with the scaling exponent $\gamma$
158: changing beyond a few~kb) \cite{dirk,pedro}.
159: While the DNA sequences of human chromosomes 21 and 22 are about
160: $34$~Mb long, in order to estimate the limit of the range of $1/f^\alpha$
161: spectrum, longer sequences are necessary. 
162: 
163: 
164: 
165: After the release of the draft of the human genome sequence in February 
166: 2001, about three years later in 2004, a dozen (out of 24) human chromosomes 
167: have been completed with a sequence accuracy to following the 
168: standard of less than one error per 10,000 DNA
169: bases (99.99\% accuracy) \cite{hg-quality}.  Building upon the release
170: of updated, high-quality sequence data, in the era of genomics we can
171: now conduct a systematic analysis of several issues of $1/f$ noise in
172: the DNA sequences of our own species {\em Homo sapiens}, which 
173: have been pursued over the last decade in a fragmentary manner.
174: 
175: In this paper, we use the DNA sequences of the complete set of
176: twenty-two autosomes and two sex chromosomes to address the following 
177: issues: Is $1/f$ noise
178: universally present across the entire set of human genome sequences?
179: Does $1/f$ noise extend to lower frequency ranges in longer DNA
180: sequences? Is the decay of $S(f)$ characterized by a single exponent
181: $\alpha$, or does it exhibit cross-overs (multiple scaling exponents)? 
182: Given the presence of universal variations at multiple scales, 
183: do these co-exist with variations at chromosome-specific scales?
184: 
185: \begin{figure}
186: \centerline{\psfig{figure=dirk1.eps,width=68mm,angle=0}}
187: \caption{\label{fig-dirk1}
188: Double-logarithmic representation of the human genome-wide length
189: distribution of interspersed repeat sequences, non-repetitive
190: sequences, and sequences of unknown base composition (gaps).  The
191: length distribution of interspersed repeats and non-repetitive sequences
192: exhibits a power-law-like decay, while that of gap sequences
193: is scattered across different sequence length. The peaks 
194: at $\sim 300$ bases and
195: several kb correspond to Alu and possibly LINE repeats.
196: }
197: \end{figure}
198: 
199: \begin{figure}[ht]
200: \centerline{\psfig{figure=dirk2.eps,width=68mm,angle=0}}
201: \caption{\label{fig-dirk2}
202: Distribution of genome-wide GC content (GC\%) of the human genome
203: for interspersed repeat sequences, non-repetitive sequences, 
204: and all (``overall") sequences with sequence segments of $20~{\rm kb}$. 
205: The mode (peak location) of non-repetitive sequences is at 
206: $\sim$35\%, while the mode of repetitive sequences shifted 
207: to a higher GC\% ($\sim$42\%). The fraction 
208: of non-repetitive sequences with GC\%~$>$~50\% is markedly 
209: larger as compared to the repetitive sequences.
210: }
211: \end{figure}
212: 
213: 
214: \SEC{Data and methods}
215: 
216: In this section, we introduce the data for human genome sequences, as well
217: as the notation and definitions used throughout this study.
218: Twenty-four chromosomes are assembled in build 34
219: of the NCBI (human genome hg16 release). Sequence data were downloaded
220: from the UCSC human genome repository (available at {\sf http://genome.ucsc.edu/}).
221: Unsequenced bases are kept to preserve spacing between 
222: bases. Human chromosomes (Chr) 13, 14, 15, 21, and 22 contain large 
223: amount of unsequenced bases in the left end of their DNA sequences, 
224: consisting of about 15\%, 17\%, 18\%, 21\%, and 29\% of the 
225: individual chromosome size,
226: respectively; 51\% of chromosome Y are unsequenced.
227: 
228: Our analysis on human DNA sequences is conducted using coarse-grained data.
229: Each original sequence was transformed into a spatial series
230: of GC content (GC\%) values. To this end, we evenly partition a
231: DNA sequence into $N$ non-overlapping windows of length $w$ bases, 
232: compute $\rho_i(w)=$GC\%$_i$ for each window $i$, to obtain a spatial
233: GC\% series: 
234: \begin{equation}
235: \label{rho}
236: \{ \rho_i \} \equiv  \{ \rho_i(w) \}  \equiv \{ \mbox{GC\%}_i \}
237: \hspace{0.1in}
238: \mbox{i= 1, 2, $\dots$, N }
239: \end{equation}
240: Table~1 lists the corresponding window sizes 
241: for each human chromosome.  Since different human chromosomes have 
242: different sizes, whereas the number of partitions ($N$) is the same, 
243: the window lengths vary.
244: 
245: 
246: \begin{table}[ht]
247: \caption{\label{tab:table1}
248: Average GC content ($\overline{ \mbox{GC}\%}$ or $\overline{ \rho }$), the window 
249: size ($w$) for partitions using $N=2^{17}$ non-overlapping windows for
250: twenty-four human chromosomes. Low-frequency scaling exponents 
251: $\alpha_1$ are estimated  from $S(f; s=3) \sim 1/f^{\alpha_1}$
252: in the range of $10^{-7} < f < 10^{-5}$ base$^{-1}$, and high-frequency
253: scaling exponents $\alpha_2$  are estimated in the range of
254: $10^{-5} < f < 2 \times 10^{-4}$ base$^{-1}$. The difference
255: between the two scaling exponents, $\Delta \alpha \equiv \alpha_2-\alpha_1$,
256: are listed in the fifth column. Low- and high-frequency exponents for $S(f)$
257: with substituted interspersed repeats are indicated by
258: $\alpha'_1$ and $\alpha'_2$, and their difference by
259: $\Delta \alpha' \equiv \alpha'_2-\alpha'_1 $.
260: }
261: \begin{ruledtabular}
262: \begin{tabular}{l|c|c|c|c|c|c|}
263: Chr & $\overline{ GC\%}$ & $w$ (kb) & $\alpha_1$ $\alpha_2$ & $\Delta \alpha$ 
264:  & $\alpha'_1$ $\alpha'_2$ & $\Delta \alpha'$ \\
265: \hline
266: 1  & 41.7 &1.88 & 0.88 0.46 & 0.42 & 0.80 0.29 &0.51\\
267: 2  & 40.2 &1.86 & 0.99 0.51&  0.48 & 0.96 0.30 &0.66\\
268: 3  & 39.7 &1.52 & 0.95 0.43& 0.53 & 0.88 0.27 &0.61 \\
269: 4  & 38.2 &1.46 & 0.87 0.34& 0.53 & 0.75 0.19 &0.57\\
270: 5  & 39.5 &1.38 & 0.89 0.39& 0.51 & 0.88 0.23 &0.65 \\
271: 6  & 39.6 &1.30 & 0.99 0.36& 0.63 & 0.86 0.24 &0.63\\
272: 7  & 40.7 &1.21 & 0.97 0.46& 0.51 & 0.87 0.33 &0.55 \\
273: 8  & 40.1 &1.12 & 0.97 0.42& 0.55 & 0.91 0.26 &0.66\\
274: 9  & 41.3 &1.04 & 0.96 0.39& 0.57 & 0.90 0.28 &0.62 \\
275: 10 & 41.6 &1.03 & 0.97 0.52& 0.46 & 0.95 0.34 &0.61 \\
276: 11 & 41.6 &1.03 & 1.05 0.50& 0.55 & 0.97 0.35 &0.62 \\
277: 12 & 40.8 &1.01 & 0.97 0.39& 0.59 & 0.89 0.28 &0.61 \\
278: 13 & 38.5 & 0.86 & 0.83 0.33 & 0.50 & 0.73 0.24 &0.49 \\
279: 14 & 40.9 &0.80  & 1.03 0.36 & 0.66 & 0.95 0.27 &0.68\\
280: 15 & 42.2 &0.76 & 0.90 0.50 & 0.40 & 0.83 0.39 &0.44 \\
281: 16 & 44.8 &0.69 & 0.91 0.51 & 0.40 &0.81 0.36 &0.45\\
282: 17 & 45.5 &0.62 & 0.98 0.57 & 0.42 & 0.89 0.44 &0.46 \\
283: 18 & 39.8 &0.58 & 1.12 0.40 & 0.72 & 1.12 0.28 &0.83 \\
284: 19 & 48.4 &0.49  & 1.00 0.56 & 0.44 & 0.81 0.37 &0.45 \\
285: 20 & 44.1 &0.49  & 0.87 0.51 & 0.36 & 0.83 0.30 &0.53 \\
286: 21 & 40.9 &0.36 & 0.91 0.33 & 0.58 & 0.86 0.22 &0.64 \\
287: 22 & 47.9 &0.38  & 0.90 0.62 & 0.28 & 0.86 0.40 &0.45 \\
288: X  & 39.4 &1.17  & 0.93 0.38 & 0.54 & 0.73 0.18 & 0.55 \\
289: Y  & 39.1 &0.38  & 0.83 0.38 & 0.45 & 0.70 0.21 & 0.49 \\
290: \end{tabular}
291: \end{ruledtabular}
292: \end{table}
293: 
294: Human DNA sequences contain a large fraction of interspersed repeats,
295: i.e., copies of an ancestral sequence fragment that possess a high
296: similarity between the duplicated and the ancestral sequence.  One can
297: detect interspersed repeats by using the program {\sf RepeatMasker}
298: \cite{repeatmasker}. ``Soft-masked'' annotations of interspersed repeats 
299: are taken from the DNA sequences of the UCSC human genome repository
300: ({\sf http://genome.ucsc.edu/}), where repetitive (non-repetitive)
301: bases are annotated in small (capital) letters. Figure~\ref{fig-dirk1} 
302: shows the length distribution of the three sequences classes of uninterrupted
303: non-repetitive, interspersed repeat, and gap sequences.
304: Figure~\ref{fig-dirk2} shows the corresponding distribution of the
305: genome-wide GC\% for these three sequences classes.
306: 
307: 
308: To investigate the effect of interspersed repeats, we substitute 
309: them by random bases according to the chromosomal level of GC\%. 
310: Transformed, repeat-substituted DNA sequences of original human 
311: chromosomes are distinguished from original sequences.  On the 
312: coarse-grained level, it is equivalent to
313: the replacement in the $\{ \rho_i \}$ ($i=1, 2, \dots, N$) series
314: of any values calculated from the interspersed repeats by a random
315: value which is sampled from a Gaussian distribution; the
316: mean and variance of this Gaussian distribution is the
317: same as those of GC\% in the original sequence.  Another possibility
318: consists in substituting repetitive sequences by
319: by a constant value (e.g., the averaged GC\% value
320: of the original sequence).  This method introduces
321: additional correlations (and less variance) in the $\{ \rho_i \}$
322: series,  and is not adopted in this paper.
323:  
324: Three different, albeit functionally related, measures are
325: applied to the $\{ \rho_i \}$ series: the power spectrum
326: as a function of the frequency $S(f)$, the correlation function
327: $\Gamma(d)$ as a function of the distance $d$ between
328: windows, and variance $\sigma^2(w)$ of GC\%
329: series as a function of the window size $w$.
330: 
331: First, we conduct spectral analyses by calculating the power spectrum, 
332: the absolute squared-average of the Fourier transform, defined as:
333: \begin{equation}
334: S(f) \equiv \frac{1}{N} \left| \sum_{k=1}^{N} \rho_k
335: \cdot
336: e^{ -i 2 \pi k f/N} \right|^2.
337: \end{equation}
338: where $N$ is the total number of windows, and $f$ is measured
339: in units of cycle/window, which can be converted to units
340: of cycle/base by the window size (cf. Table~1). 
341: 
342: Coarse-graining ``hides'' base-base correlations at scales smaller
343: than $w$ bases.  The choice of $N = 2^{17}$ windows was made such 
344: that it is (i) sufficiently large to cover small-scale fluctuations, 
345: while (ii) at the same time sufficiently small so that the spectral analysis is
346: computationally feasible. As different chromosomes have difference
347: lengths, equal number of partitions leads to different window sizes $w$. 
348: 
349: The unsmoothed $S(f)$, or periodogram, contains $N/2$ independent
350: spectral components.  One can filter periodograms to obtain a
351: ``smoothed'' spectrum $S(f;s)$, where $s$ is the span-size
352: parameter.  Since filtering with a relatively large $s$-value 
353: possibly distorts the shape of $S(f;s)$ at lower frequency components, 
354: different span-sizes are applied for different frequency ranges.
355: 
356: 
357: The second measure applied to the $\{ \rho_i \}$ series
358: is the correlation function, $\Gamma(d)$, which is computed
359: from two truncated series 
360: of $\{ \rho_i \}$, $ \rho' = \{ \rho_k \}$  ($k=1, 2, \dots, N-d$) and 
361: $ \rho'' = \{ \rho_k \}$  ($k=d+1, d+2, \dots,  N$):
362: \begin{equation}
363: \label{gamma}
364: \Gamma(d) \equiv \frac{ \rm Cov(\rho', \rho'')}
365: { \sqrt{ \rm Var(\rho')} \sqrt{ \rm Var(\rho'')}}
366: \end{equation}
367: where $\mbox{Cov}( \rho', \rho'')=
368: \langle \rho' \rho''\rangle - \langle \rho'\rangle \langle \rho''\rangle $ 
369: and $\mbox{Var}( \rho') = \langle \rho'^2 \rangle -  \langle  \rho' \rangle^2$ 
370: (or $\mbox{Var}( \rho'') = \langle \rho''^2 \rangle -  \langle  \rho'' \rangle^2$) 
371: are the covariance and variance.
372: Note that the $\Gamma(d)$ defined in Eq.(\ref{gamma})
373: is slightly different from that defined using
374: a periodic boundary condition.
375: 
376: 
377: 
378: The third and final measure applied to the 
379: $\{ \rho_i \}$ series is the variance $\sigma^2(w)$:
380: \begin{equation}
381: \sigma^2(w) \equiv \langle \rho(w)^2 \rangle -
382: \langle \rho(w) \rangle^2
383: \end{equation}
384: as a function of the window size $w$.
385: The power spectrum, the correlation function, and the window-size-dependent
386: variance are interrelated quantities \cite{clay3}:
387: \begin{equation}
388:   \sigma^2(w) \sim \frac{ \Gamma(0) }{ w } \cdot 
389: \bigg\{ 1 + \frac{2}{w}\sum_{d=1}^{w-1} (w-d) \Gamma(d) \bigg\}.
390: \end{equation}
391: If $S(f) \sim 1/f^\alpha$, $\Gamma(d) \sim 1/d^\gamma$,
392: $\sigma^2(w) \sim 1/w^\beta$ are power-law functions, 
393: then their scaling exponents are related  
394: by $\alpha = 1-  \gamma$ and  $\gamma=\beta$ \cite{clay3}.
395: 
396: The calculation of $S(f)$ and $\Gamma(d)$ was carried out by the 
397: statistical package {\sl S-PLUS} (Version 3.4, MathSoft, Inc.), and the
398: type of filter implemented for $S(f)$ is the Daniell-filter \cite{daniell}.
399: 
400:  
401: \SEC{$1/f$ noise  is a universal feature of human DNA sequences}
402: 
403: 
404: In this section, we use the power spectrum $S(f)$ to study
405: GC\% of human genome sequences, with
406: the goals of testing the universality of $1/f$ noise, quantifying
407: different decay ranges for $S(f) \sim 1/f^\alpha$, and comparing
408: $S(f)$ across DNA sequences of different human chromosomes.
409: 
410: Figure~\ref{fig3:s(f)} shows for $N=2^{17}$ GC\% values the power 
411: spectra $S(f)$ across all human chromosomes. We find that $S(f)$ 
412: exhibits no clear plateau at 
413: low frequency ($< 10^{-6}$ cycle/base) and increases steadily
414: with decreasing frequency. The decay can be mathematically 
415: approximated by a power-law of the form $S(f) \sim 1/f^\alpha$ 
416: with $\alpha \approx 1$.  Table~1 lists for the frequency range
417: $f=$ 10~Mb$^{-1}$--100~kb$^{-1}$ the estimated scaling exponent
418: $\alpha_1$ for all chromosomes, using a best-fit regression
419: of $\log_{10} S(f; s=3) = a + \alpha_1 \log_{10}(f)$. We find that
420: $\alpha_1$ is typically close to $\alpha_1 \approx 1$ with 
421: practically little variation across chromosomes.
422: 
423: A closer inspection of Fig.~\ref{fig3:s(f)} shows that the 
424: majority of $1/f$ spectra undergo a cross-over from $\alpha_1 \approx 1$ 
425: to $\alpha_2 < 1 $ at high frequency. The deviation from $\alpha_1 \approx 1$ 
426: starts about 30--100~kb and continues at smaller distances.
427: Figure~\ref{fig4:ex} illustrates this feature for $S(f; s=31)$ 
428: of the DNA sequences of Chr15, Chr21, and Chr22 in more detail.
429: We find that chromosomes 15 and 21 exhibit clear cross-overs 
430: at about 100~kb, while chromosome 22 exhibits no apparent break-point.
431: Table~1 contains for the frequency range of 
432: $f=$ 100~kb$^{-1}$--5~kb$^{-1}$ the corresponding scaling
433: exponents $\alpha_2$, obtained from the 
434: regression $\log_{10} S(f; s=3) = a + \alpha_2 \log_{10}(f)$. 
435: We find a pronounced difference in absolute values between 
436: $\alpha_1 \approx 1$ and $\alpha_2 < 1$, indicating a transition
437: from the universal $1/f^{\alpha_1}$ ($\alpha_1 \approx 1 $) 
438: spectrum at low frequency to a 
439: more flattened $1/f^{\alpha_2}$ ($\alpha_2 < 1$) spectrum
440: at higher frequency. 
441: 
442: 
443: Figure~\ref{fig5:alpha}(a) shows for all human chromosomes $\alpha_1$ 
444: and $\alpha_2$ as a function of chromosome-specific GC\%. The
445: majority of human chromosomes have a specific GC content ranging between
446: 38--43\%, whereas chromosomes 16, 17, 19, 20, and 22 have higher GC\%
447: up to 49\%.  While the low-frequency scaling exponent $\alpha_1$ 
448: remains approximately independent of GC\%, Fig.~\ref{fig5:alpha}(a)
449: shows that $\alpha_2$ increases with increasing GC\% and
450: gives rise to a positive correlation between $\alpha_2$ and GC\%.
451: 
452: The three chromosomes illustrated in Fig.~\ref{fig4:ex} exhibit
453: different degrees of transition from the $1/f^{\alpha_1}$
454: ($\alpha_1 \approx 1$) to the flattened
455: $1/f^{\alpha_2}$ ($\alpha_2 <1$) spectrum, with chromosome 21 (22)
456: undergoing the sharpest (smoothest) transition. This
457: observation can be further quantitized by the change in
458: scaling exponents $\alpha_1$ and $\alpha_2$. Table~1 lists for
459: all chromosomes $\Delta \alpha = \alpha_2 - \alpha_1$. 
460: Chromosome 22 is distinct from all other human chromosomes
461: as the most scale-invariant one (same or similar scaling 
462: exponent at different length scales).
463: The same observation that human chromosome 22 was perhaps different 
464: from the remaining human chromosomes was made using limited 
465: sequence data in \cite{isochore-clay,pedro}. 
466: 
467: \begin{figure*}
468: \centerline{\psfig{figure=pre-fig1.eps,width=80mm,angle=-90}}
469: \caption{\label{fig3:s(f)} 
470: Double-logarithmic representation of 
471: power spectra $S(f)$ of GC\% of all twenty-four human 
472: chromosomes. Each plot shows $S(f)$ of six chromosomes
473: (shifted on the $y$-axis for clearer representation):
474: chromosomes (a) 1--6; (b) 7--12; (c) 13--18; (d) 19--22, X, and Y. 
475: The $x$-axis (in logarithmic scale) is converted from cycle/window 
476: to cycle/base by using the window sizes listed in Table~1.
477: $S(f)$ is filtered at different levels for different frequency
478: ranges: $S(f; s=1)$ for the first ten spectral components,
479: $S(f; s=3)$ for the components 11--30,
480: $S(f; s=31)$ for the components 31--400,
481: and $S(f; s=501)$  for the components 400--65536 (=$2^{16}$).
482: }
483: \end{figure*}
484: 
485: 
486: \begin{figure}
487: \centerline{\psfig{figure=pre-fig2.eps,width=55mm,angle=-90}}
488: \caption{\label{fig4:ex} 
489: Cross-over from $S(f) \sim 1/f^{\alpha_1}$ to $S(f) \sim 1/f^{\alpha_2}$ 
490: illustrated for human chromosomes 15, 21, and 22
491: (smoothed with the span size of 31, and shown in double-logarithmic scale).
492: The scaling exponents $\alpha_1$ and $\alpha_2$ are shown  for 
493: the frequency ranges 10~Mb$^{-1}$--100~kb$^{-1}$ and 100~kb$^{-1}$--5k$^{-1}$.
494: }
495: \end{figure}
496: 
497: \begin{figure}
498: \centerline{\psfig{figure=pre-fig3.eps,width=70mm,angle=-90}}
499: \caption{\label{fig5:alpha} 
500: (a) Scaling exponents $\alpha_1$ and $\alpha_2$
501: for fitting the power spectrum $S(f) \sim 1/f^{\alpha_i}$ 
502: ($i=1,2$) at the frequency range of 10~Mb$^{-1}$--100~kb$^{-1}$,
503: and 100~kb$^{-1}$--5~kb$^{-1}$, respectively, versus the
504: chromosome-specific GC content of all 24 human chromosomes.
505: (b) Scaling exponents $\alpha'_1$ and $\alpha'_2$
506: for $S(f)$ with substituted interspersed repeats.
507: }
508: \end{figure}
509: 
510: 
511: \SEC{Interspersed repeats are not responsible for 
512: $1/f$ spectrum}
513: 
514: About 45\% of human genomic DNA sequences are interspersed 
515: repeats \cite{lander}. Interspersed repeats consist of copies of the same
516: sequence segment that are inserted in the human genome, possess a high
517: similarity between the duplicated and ancestral sequence, and have
518: been implicated in a variety of biological functions, including genome
519: organization, human chromosome segregation, or regulation of gene
520: expression \cite{repeats-bio}. Large copy numbers increase 
521: the sequence redundancy and it has been shown, e.g., that
522: about 10\% interspersed Alu repeats significantly increase
523: base-base correlations in the range up to 300~bases
524: \cite{dirk}. 
525: 
526: 
527: 
528: Figure\ref{fig6:rep} shows the power spectrum $S(f)$ for 
529: the original human chromosome 1 and for the transformed sequence in 
530: which interspersed repeats are substituted.  We find in 
531: the low-frequency range of $10^{-7} < f < 10^{-5}$ cycle/base 
532: that $S(f)$ decays in the original sequence
533: with $\alpha_1 \approx 0.88$ and in the transformed sequence with
534: $\alpha^\prime_1 \approx 0.80$, indicating only marginal differences in
535: the decay properties of $S(f)$ due to repetitive sequences.  In
536: contrast, in the high frequency range of $10^{-5} < f < 2 \times 10^{-4}$ 
537: we find $\alpha_2 \approx 0.46$ and $\alpha^\prime_1 \approx 0.29$,
538: and thus interspersed repeats contributes to the decay properties 
539: of $S(f)$ for high-frequency components by flattening the power spectrum.
540: 
541: 
542: The scaling exponents $\alpha^\prime_1$ and $\alpha^\prime_2$ for 
543: repeat-substituted DNA sequences of all 24
544: human chromosomes are shown in Table~1.  The difference
545: between low- and high-frequency ranges for DNA sequences of original
546: chromosomes, $\Delta \alpha= \alpha_2 - \alpha_1$, is smaller 
547: than the difference between low- and high-frequency ranges for 
548: transformed sequences, $\Delta \alpha' = \alpha^\prime_2-\alpha^\prime_1$.  
549: When we compare $\alpha_1$ and $\alpha^\prime_1$, as well as $\alpha_2$ and
550: $\alpha^\prime_2$, we find that the magnitude of $\alpha^\prime_1$
551: ($\alpha^\prime_2$) is always smaller than that of $\alpha_1$
552: ($\alpha_2$), which means a flattened spectrum
553: due to the substitution of interspersed repeats.  
554: The average change of low-frequency
555: scaling exponents, $ \alpha_1- \alpha'_1$, is about 0.07, 
556: whereas the average change of high-frequency scaling
557: exponents, $ \alpha_2- \alpha'_2$, is about 0.14. This
558: confirms that the universal presence of $1/f$ spectrum
559: at low frequency is not caused by interspersed
560: repeats,  but that interspersed repeats affect $S(f)$
561: predominantly  at high frequencies.  A similar conclusion 
562: that the decay rate of base-base correlations in DNA sequences of
563: human chromosomes 20, Chr21, and Chr22 is not markedly affected
564: by the substitution of interspersed repeats was reached in \cite{dirk}.
565: 
566: 
567: We note that the extent of deviation,
568: $|\alpha'-\alpha|$, depends on how the replacement of 
569: interspersed repeats  is conducted. Possible substitutions
570: of interspersed repeats include the substitution by
571: a constant value or a randomly sampled value.
572: In general, the substitution of GC\% values
573: calculated from the repetitive sequences by random values enhances 
574: the deviation and flattens the spectrum $S(f)$ more than the
575: substitution by a constant value (e.g., average GC\%). 
576: 
577: 
578: \begin{figure}
579: \centerline{\psfig{figure=pre-fig4.eps,width=60mm,angle=-90}}
580: \caption{\label{fig6:rep} 
581: Power spectra $S(f)$ of GC\% for the original and the
582: transformed (interspersed repeats substituted)  DNA 
583: sequence of human chromosome 1.  The scaling exponent for 
584: low-frequency (10~Mb--100~kb) and high-frequency
585: (100~kb--5~kb) ranges are obtained by a best-fit regression
586: of $\log_{10} S(f)$ over $\log_{10} f$.
587: }
588: \end{figure}
589: 
590: \SEC{Resistance to variance reduction at larger window sizes}
591: 
592: In this section, we study the decay properties of the
593: variance ($\sigma^2$) of spatial GC\% series  as 
594: a function of difference window sizes $w$,  and
595: we compare the scaling of $\sigma^2$ with the
596: scaling of the power spectrum $S(f)$.
597: 
598: 
599: Early experimental measurement of the GC\% distribution by
600: using cesium chloride (CsCl) profile \cite{cscl} showed for mouse 
601: {\em Mus musculus} genomic DNA sequences that the 
602: variance of GC\% values does not markedly decreases with the DNA segment size
603:  \cite{macaya}. This experimental observation is directly related 
604: to the presence of 1/f spectra in DNA sequences 
605: \cite{isochore-clay,li-nova}. If the variance of the spatial 
606: GC\% series calculated at the window size $w$ is $\sigma^2(w)$, 
607: then a scaling of 
608: $\sigma^2(w) \sim 1/w^\beta$ implies a corresponding 
609: scaling in the power spectrum $S(f) \sim 1/f^{1-\beta}$ 
610: \cite{isochore-clay,beran}. If GC\% is obtained from 
611: $w$ uncorrelated bases, it follows a binomial distribution.
612: Consequently, $\sigma^2(w) \sim \langle\rho \rangle 
613: (1- \langle \rho \rangle) /w \sim 1/w$ with 
614: $\beta=1$. The corresponding scaling exponent of
615: the power spectrum is $\alpha=1-\beta=0$, and thus the
616: $S(f) \sim \mbox{cons.}$ is equivalent to the white noise.
617: 
618: Figure~\ref{fig7:var} shows $\sigma^2(w)$ as a function of window size $w$ 
619: for all human chromosomes. In a double-logarithmic representation,
620: we find that $\log (\sigma^2(w)) $ decays approximately
621: linearly with $\log( w) $. A decay according to  
622: $\sigma^2(w) \sim 1/w^\beta$ with $\beta=1$ leads to 
623: white noise. This situation is indicated in Fig.~\ref{fig7:var}
624: by the straight line. An inspection of Fig.~\ref{fig7:var}
625: shows, however, that the variance decays at a much slower 
626: rate than what would be for white noise. The variance of
627: the DNA sequence of human chromosome 1, e.g., 
628: gives rise to $\beta \approx 0.12$, and the
629: corresponding scaling exponent $\alpha_1 \approx
630: 1- \beta =0.88$  is indeed close to the estimated
631: exponent listed in Table~1. The scaling of the variance
632: with the exponent $\beta  << 1 $ is in accord with
633: the low-frequency $1/f$ noise.
634: 
635: 
636: \begin{figure}
637: \centerline{\psfig{figure=pre-fig5.eps,width=65mm,angle=-90}}
638: \caption{\label{fig7:var} 
639: Double-logarithmic representation of the variance $\sigma^2(w)$ 
640: of the spatial GC\% series for all human chromosomes (Chr)
641: as a function of the window size $w$: 
642: (a) $\bigcirc$ Chr1,
643: $\triangle$ Chr2,
644: $+$ Chr3, $\times$ Chr4, $\diamondsuit$ Chr5, $\bigtriangledown$ Chr6;
645: (b) $\bigcirc$ Chr7, $\triangle$ Chr8, $+$ Chr9, $\times$ Chr10,
646: $\diamondsuit$ Chr11, $\bigtriangledown$ Chr12;
647: (c)  $\bigcirc$ Chr13, $\triangle$ Chr14, $+$ Chr15,
648: $\times$ Chr16, $\diamondsuit$ Chr17, $\bigtriangledown$ Chr18;
649: (d)  $\bigcirc$ Chr19, $\triangle$ Chr20, $+$ Chr21, $\times$ Chr22,
650: $\diamondsuit$ ChX, $\bigtriangledown$ ChrY.
651: Straight lines indicate $\sigma^2(w) \sim 1/w$ (corresponding
652: to white noise).  One regression line for Chr1 ($\beta \approx $0.12)
653: and a piece-wise regression for Chr13 ($\beta \approx $0.27 and 
654: $\beta \approx $0.10) are drawn. The 95\% confidence
655: interval for the $\sigma^2(w)$ estimation of Chr1 at each point
656: of $w$ is marked by a vertical dashed line.
657: }
658: \end{figure}
659: 
660: 
661: The scaling of $\sigma^2(w)$ shown in Fig.~\ref{fig7:var} 
662: differs from one human chromosome to another. For instance,
663: in the range of $w=$ 1~kb--5~Mb, for example, human chromosome 
664: 13 exhibit a clear transition from $\beta_2 \approx 0.27$ ($w < $ 50~kb) 
665: to $\beta_1 \approx 0.10$ ($w > $ 50~kb), corresponding to 
666: $S(f) \sim 1/f^{0.63}$ and $S(f) \sim 1/f^{0.9}$,
667: respectively, at high- and low-frequency ranges.  
668: Other human chromosomes, although generally exhibiting a power-law 
669: scaling form of $\sigma^2(w)$, show deviations from 
670: $\sigma^2(l) \sim 1/l^{\beta}$ line for the largest
671: window sizes tested.
672: 
673:  
674: The investigation of $\sigma^2(w)$ as a function of
675: different window sizes $w$ requires careful examination
676: \cite{audit,pedro04}.  First, since we partition each
677: human chromosome in $2^k$ ($k=$17, 16, $\dots$) 
678: windows, the variance of GC\% series $\{ \rho_i \} $ 
679: could be accidentally large when windows reside on
680: the isochore borders, and small by chance if they
681: start/end within an isochore. 
682: 
683: Second, when the number of windows is small (e.g. the last 
684: point of $\sigma^2(w)$ for each chromosome in Fig.~\ref{fig7:var} 
685: is calculated with the largest window size that
686: gives rise to  32 windows), the standard error of the 
687: sample variance is large. The 95\% confidence interval for
688: $\sigma^2(w)$ of Chr1 is shown in Fig.~\ref{fig7:var}(a), 
689: using the interval:
690: [$ (w-1) \sigma^2/t_{0.025}, (w-1) \sigma^2 /t_{0.975}$],
691: where $t_x$ is defined by $\int_{-\infty}^{t_x} \chi^2(\rm{df}=w-1) dt= x$
692: (where $\chi^2(\rm{df})$ is the chi-square distribution with $\rm{df}$
693: degrees of freedom) \cite{snedecor}.  Figure~\ref{fig7:var}(a)
694: shows that for fewer windows (and larger window sizes), 
695: the 95\% confidence interval of $\sigma^2(w)$ could be large
696: such that the estimated value of $\beta$ may change from sample to sample.
697: 
698: Finally, the relationship between scaling exponents 
699: $\alpha+\beta = 1$ \cite{beran,isochore-clay},
700: is based on the assumption that both $S(f)$ and $\sigma^2(w)$ are 
701: theoretical power-law functions. If $S(f)$ is a piece-wise
702: power-law function, as in the case of GC\% fluctuation of
703: human chromosomes, a correction term to the relationship
704: $\alpha +\beta=1$ is expected.
705:  
706: \begin{figure}
707: \centerline{\psfig{figure=pre-fig6.eps,width=65mm,angle=-90}}
708: \caption{\label{fig8:corr} 
709: Correlation function $\Gamma(d)$ for 24 human chromosomes (Chr)
710: as a function of the window distance $d$ (converted to bases 
711: by the window size listed in Table~1).  The distance is
712: represented on a logarithmic scale. (a) Chr1--6; (b) Chr7--12; 
713: (c) Chr13--18; and (d) Chr19--22, ChrX, and ChrY.
714: }
715: \end{figure}
716: 
717: \begin{figure}
718: \centerline{\psfig{figure=pre-fig7.eps,width=50mm,angle=-90}}
719: \caption{\label{fig9:ch21} 
720: Correlation function $\Gamma(d)$ for human chromosome 21
721: as a function of the window distance $d$  (converted to bases 
722: by the window size given in Table~1). The oscillation in $\Gamma(d)$
723: is highlighted by vertical lines, indicating the
724: distances of $d=$500~kb, 1~Mb, 1.5~Mb, and 2~Mb.
725: }
726: \end{figure}
727: 
728: 
729: 
730: \SEC{Chromosome-specific correlation structures}
731: 
732: 
733: Apparently, $1/f$ noise in music and speech signals \cite{voss-music}
734: does not prevent music and speech from sounding differently.
735: Similarly, universal $1/f^\alpha$ spectra in GC\% fluctuations
736: across human chromosomes do not imply that all chromosomes
737: exhibit the same detailed correlation structure. The generic trend
738: of $S(f)$ spectra to increase at low frequency may ``co-exist"
739: with small peaks at higher frequency.  Such chromosome-specific 
740: characteristic length scales can be more intuitively examined
741: by correlation functions. In this section, we investigate
742: the correlation function $\Gamma(d)$
743: of coarse-grained DNA sequences of human chromosomes with the
744: aim of further examining chromosome-specific structures,
745: such as characteristic length scales and oscillation
746: detected by $\Gamma(d)$.
747: 
748: Figure~\ref{fig8:corr} shows for all human chromosomes
749: the $\Gamma(d)$'s of GC\% series
750: $\{ \rho_i \}$ calculated for the window sizes given in Table~1,
751: of all human chromosomes.  For each chromosome, the minimum 
752: (maximum) distance is 80 (16,000) windows.
753: Since each chromosome is partitioned into $2^{17}$ windows, 
754: the maximum distance $d$ at which the correlation is examined 
755: is about $16,000/2^{17} \approx 12\%$ of the total sequence length.
756: 
757: An inspection of Fig.~\ref{fig8:corr} shows that the magnitude
758: of correlation at the distance of $d=1~{\rm Mb}$ is clearly above
759: the noise level. With the exceptions of Chr15, Chr22, and ChrY, 
760: the correlation function $\Gamma(d) > 0.1$ at $d=1~{\rm Mb}$
761: for all other chromosomes. The low correlation in ChrY is 
762: due to the fact that about half of the bases are unsequenced, 
763: and the substitution of gaps by random values lowers the correlation. 
764: At even longer distances such as $d=$10~Mb, correlations 
765: $\Gamma(d=10$~Mb) for chromosomes 1 and 6 are still
766: above the 0.1 level. 
767: 
768: Given different windows ($w$) due to different chromosome 
769: sizes and provided that the covariance of GC\% is approximately 
770: independent of $w$, a scaling of the variance according to 
771: $1/w^\beta$ implies that the  correlation function
772: $\Gamma(d)$ in Eq.(\ref{gamma}) increases with the window
773: size as $\sim w^\beta$. Test calculations of covariance 
774: for $2^{15}$ and $2^{17}$ windows show that the covariance 
775: differs by less than 1\% (and hence is fairly independent in 
776: this range of window sizes).
777: Consequently, for a detailed comparison of correlation 
778: functions calculated for different chromosomes one has to 
779: take into account different windows sizes.
780: 
781: 
782: 
783: Any deviation from the monotonic decrease of $\Gamma(d)$ 
784: might be indicative of correlations at characteristic
785: length scales (visible as ``bumps"). For example, 
786: Fig.~\ref{fig8:corr} shows for chromosome 1 such a bump at 
787: $d \approx$ 21--23~Mb. Bumps or sharper peaks in other chromosomes include 
788: $d \approx$ 9.3~Mb (Chr2), 7.2~Mb (Chr10), 3.2--3.8~Mb 
789: (Chr12), and 2.4--3.1~Mb (Chr19).  One  plausible
790: explanation is that for chromosomes 2, 10, 12, and 19 one or few
791: alterations of GC-rich/low isochores \cite{isochore} with these 
792: length scales enhance the correlation.
793: 
794: 
795: Chromosome 21 stands out among all human chromosomes for
796: having a comparatively higher correlations at distances of
797: several Mb (despite having a smaller $w^\beta$ factor than 
798: other chromosomes due to a smaller window size).  A detailed 
799: inspection of Fig.~\ref{fig9:ch21} uncovers an oscillation of 
800: $\Gamma(d)$ of  about 500~kb, ranging from $d=$500~kb to $d=$2~Mb, which has
801: not been reported before. It can be further shown that this 
802: oscillation is not due to the substitution of interspersed repeats 
803: \cite{li-holste-21},  and it is localized to about one-eighth 
804: of the right distal end of chromosome 21  \cite{li-holste-21}. 
805: 
806: 
807: 
808: \SEC{Discussions}
809: 
810: We study correlation structures and spectral components in 
811: the set of human chromosomes, using power spectra, 
812: coarse-grained correlation functions, and the variance 
813: of different window sizes. All three measures are 
814: interrelated and highlight compositional structures 
815: at different feature levels.  Our results firmly establish 
816: the presence of long-ranging correlations and 
817: $1/f^\alpha$ spectra in the DNA sequences of the set 
818: of twenty-four  human chromosomes. 
819: 
820: Using updated and completed human sequence data, we find the 
821: presence of 1/f noise in the DNA sequences of 
822: all human chromosomes. We further find that, with the exception 
823: of chromosome 22, all chromosomes exhibit a cross over 
824: from $1/f^{\alpha_1}$ at low-frequency to $1/f^{\alpha_2}$ 
825: scaling at high-frequency ($\alpha_1 > \alpha_2$).
826: The result of two scaling ranges at low- and high-frequency 
827: are in accord with previous findings, obtained from 
828: sequence data of lower quality, and it refines break-point 
829: regions for each individual chromosome.
830: 
831: 
832: We also examined the effect of about 45\% interspersed 
833: repeats in the human genome. Using a procedure in which 
834: masks and subsequently substitutes interspersed repeats 
835: with random GC\% values, we find that interspersed 
836: repeats (i) only marginally affect the scaling exponent 
837: $\alpha_1$ in the low-frequency range, but (ii) lower 
838: $\alpha_2$ in the high-frequency range (cf.~Fig.\ref{fig5:alpha}(b)).
839: This supports the general understanding that interspersed repeats 
840: only contribute to short-ranging (high-frequency) correlations \cite{dirk}.
841: 
842: We have shown elsewhere that $1/f^\alpha$
843: spectra of GC\% fluctuation are also universally present
844: in the mouse {\sl Mus musculus}  genomic DNA sequences
845: \cite{1fmouse}. It is known that human and mouse genomes 
846: are separated by approximately 65--75 million years of evolution. Besides 
847: the similarity (or homology ) between
848: these two genomes on a local scale, there is in fact a
849: large amount of reshuffling of the chromosome segments
850: at a global scale when two current-day copies of
851: the two genomes are compared side-by-side \cite{pevzner}. Since reshuffling
852: of a sequence at global scales could potentially destroy
853: long-range correlations, it is still to be resolved
854: under what conditions a reshuffling of the 
855: human genome into the mouse genome, or vice versa,
856: conserves 1/f noise.
857: 
858: One possible explanation of why $1/f^\alpha$ spectra appear
859: in both the human and the mouse genomes is that such long-range
860: patterns were probably generated from ancestral DNA
861: sequences by sequence evolutionary mechanisms. One
862: sequence evolution model, termed expansion-modification
863: (EM) model, is known to generate $1/f^\alpha$ spectra \cite{wli-em}.
864: The EM model incorporates duplications and mutations. 
865: Since the duplication process is an essential element in 
866: evolutionary genomics \cite{ohno}, whose role is perhaps as important as Darwin's 
867: natural selection \cite{meyer}, even a yet unsophisticated 
868: incorporation of duplications in the EM model may capture the 
869: essence of the evolutionary origin of long-range correlations
870: in DNA sequences. In the EM model, only the duplication of
871: segments with the same length scale is included, whereas
872: in reality segments with a broad range of length scales
873: are duplicated \cite{lander}.
874: 
875: One frequently posed question concerns the ``biological meaning"
876: of $1/f^\alpha$ spectra or long-range correlations in DNA 
877: sequences. In order to address this question, one may ask
878: a couple of related questions beforehand. 
879: Does the compositional GC\% have any biological effects?
880: What biological functions of the DNA molecule are of relevance?
881: From the {\sl functional genomics} perspective, interesting
882: biological processes related to DNA molecules include transcription, 
883: replication, and recombination, and  their potential connection 
884: to GC\% has been reviewed in \cite{bernardi,bernardi-book,li-nova}.
885: Generally speaking, GC\% has a statistical association with 
886: all three processes, though the cause-and-effect role has
887: not yet been firmly established. Recent studies show that broadly
888: expressed ``housekeeping genes" tend to be located in GC-rich
889: regions \cite{housekeeping}. To understand the genome-wide organization 
890: of biological units that play a role in those processes
891: (e.g., genes, origins and timing of replication, or recombination hotspots), 
892: at times it is more feasible to directly study the
893: spatial distribution of functional units instead of using 
894: the GC\% as a surrogate. 
895: 
896: From the {\sl biophysics and cellular biology}  
897: perspective, GC\% is linked with  bands from chromosome-staining 
898: \cite{gojobori}, and in addition, possibly with the 
899: matrix/scaffold attachment/associated regions located at
900: the end of DNA loops \cite{attachment}.  It has also been
901: suggested that GC-rich chromosomes (or regions)
902: tend to be located in the interior of the nuclear 
903: during interphase and are more ``open" in their 
904: tertiary structure, whereas GC-poor segments are
905: more likely to be close to the surface of the
906: nuclear and more condensed \cite{interphase}.
907: 
908: Further exploration of the relationship between GC\% fluctuations,
909: as well as its large-scale patterns, and the above
910: biological processes is beyond the scope of this
911: paper. An attempt for bacterial genomes has been
912: made to relate the scale-invariance feature in sequence 
913: statistics to the genome organization of transcription 
914: activities \cite{audit}. It is clear that more 
915: integrated computational and experimental analyses 
916: need be carried out along similar lines 
917: before one can give universal $1/f$ spectra in 
918: DNA sequences a satisfactory biological explanation.
919: 
920: 
921: 
922: 
923: \SEC*{Acknowledgements}
924: 
925: We thank S. Guharay for participating the early stage 
926: of this project, and O. Clay, J.L. Oliver, A. Fukushima for
927: valuable discussions.
928: 
929: 
930: 
931: \begin{thebibliography}{99}
932: 
933: \bibitem{johnson}
934: J.B. Johnson, Phys. Rev. {\bf 26}, 71-85 (1925).
935: 
936: \bibitem{1freview}
937: A. van der Ziel, Adv. Electronics and Electronics Phys. 
938: {\bf 49}, 225-297 (1979); 
939: P. Dutta and P.M. Horn, Rev. Mod. Phys. {\bf 53}, 497-516 (1981); 
940: M.B. Weissman, Rev. Mod. Phys.  {\bf 60}, 537-571 (1988);
941: H. Wong, Microelectronics Reliability {\bf 43}, 585-599 (2003).
942: 
943: \bibitem{1freview-other}
944: M. Gardner, Sci. Am. {\bf 238}, 16-32 (1978); 
945: W. Press, Comments on Astrophys. {\bf 7},103-119 (1978); 
946: B.J. West and M.F. Shlesinger, Am. Sci. {\bf 78}, 40-45 (1990);
947: E. Milotti, arXiv preprint, physics/0204033 (2002).
948: 
949: \bibitem{1fbib}
950: W. Li, {\em A bibliography on 1/f noise} (online),
951: {\sf http://www.nslij-genetics.org/wli/1fnoise/}.
952: 
953: \bibitem{mapping}
954: H. Herzel and I. Grosse, Physica A {\bf 216}, 518-542 (1995);
955: S.V. Buldyrev {\sl et al.}, Phys. Rev. E {\bf 51}, 5084-5091 (1995).
956: 
957: \bibitem{dirk}
958: D. Holste, I. Grosse and H. Herzel, Phys. Rev. E, {\bf 64}, 041917 (2001);
959: D. Holste, {\sl et al.}, Phys. Rev. E, {\bf 67}, 061913 (2003).
960: 
961: \bibitem{wli-em}
962: W. Li, Europhys. Lett. {\bf 10},395-400 (1989);
963: W. Li, Phys. Rev. A, {\bf 43}, 5240-5260 (1991).
964: 
965: \bibitem{wli-dna}
966: W. Li, Int. J. Bifurcation \& Chaos, {\bf 2}, 137-154 (1992); 
967: W. Li and  K. Kaneko, Europhys. Lett. {\bf 17}, 655-660 (1992).
968: 
969: \bibitem{voss}
970: R.F. Voss, Phys. Rev. Lett., {\bf 68},3805-3808 (1992).
971: 
972: \bibitem{bac}
973: X. Lu, {\sl et al.}, Phys. Rev. E, {\bf 58}, 3578-3584 (1998);
974: M. de Sousa Vieira, Phys. Rev. E, {\bf 60}, 5932-5937 (1999).
975: 
976: \bibitem{yeast}
977: W. Li, {\sl et al.}, Genome Res. {\bf 8}, 916-928 (1998).
978: 
979: \bibitem{fukushima}
980: A. Fukushima
981: {\em Periodicity in Genome Architecture from Bacteria to Human}
982: (Ph.D Thesis, Nara Institute of Science and Technology, 2003);
983: A. Fukushima, {\sl et al.}, Gene, {\bf 300}, 203-211 (2002).
984: 
985: \bibitem{isochore}
986: G. Cuny, {\sl et al.}, Euro. J. Biochem. {\bf 99}, 179-186 (1981);
987: G. Bernardi, {\sl et al.}, Science, {\bf 228}, 953-958 (1985);
988: G. Bernardi, Gene, {\bf 241}, 3-17 (2000).
989: 
990: \bibitem{isochore-clay}
991: O. Clay, {\sl et al.}, Gene, {\bf 276}, 15-24 (2001);
992: O. Clay and  G. Bernardi,  Gene, {\bf 276}, 25-31 (2001).
993: 
994: 
995: \bibitem{cc03}
996: W. Li, {\sl et al.}, Comput. Biol. and Chem., {\bf 27}, 5-10 (2003). 
997: 
998: \bibitem{clay3}
999: O. Clay, Gene, {\bf 276}, 33-38 (2001);
1000: 
1001: \bibitem{isochore-spe}
1002: W. Li,  Gene, {\bf 300}, 129-139 (2002).
1003: 
1004: \bibitem{wli-complexity}
1005: W. Li, Complexity, {\bf 3}, 33-37 (1997).
1006: 
1007: 
1008: \bibitem{lander}
1009: E.S. Lander, {\sl et al.}, Nature, {\bf 409}, 860-921 (2001).
1010: 
1011: \bibitem{pedro}
1012: P. Bernaola-Galv\'{a}n, {\sl et al.},
1013: Gene, {\bf 300}, 105-115 (2002).
1014: 
1015: \bibitem {hg-quality}
1016: J. Schmutz {\sl et al.},
1017: Nature {\bf 429}, 365-368 (2004).
1018: 
1019: 
1020: \bibitem{repeatmasker}
1021: A.F.A.~Smit and  P.~Green,
1022: unpublished results
1023: (URL: {\sf http://repeatmasker.genome.washington.edu/}).
1024: 
1025: \bibitem{daniell}
1026: P.J. Daniell,
1027: {\em Suppl. J. Royal Stat. Soc.} {\bf 8} 88--90 (1946).
1028: 
1029: 
1030: 
1031: \bibitem{repeats-bio}
1032: J.R.~Korenberg and M.C.~Rykowski, Cell {\bf 53,} 391 (1988);
1033: P.~Medstrand {\em et al.}, Genome Res. {\bf 12}, 1483 (2002);
1034: M.-A.~Hakimi {\em et al.}, Nature {\bf 418}, 994 (2002);
1035: J.S.~Han, S.T.~Szak, and J.D.~Boeke, Nature {\bf 429}, 268--274 (2004).
1036: 
1037: \bibitem{cscl}
1038: O. Clay, {\sl et al.}, 
1039: Euro. Biophys. J., {\bf 32}, 418-426 (2003).
1040: 
1041: \bibitem{macaya}
1042: G. Macaya, J.P. Thiery, and G. Bernardi, J. Mol. Biol. {\bf 108}, 237-254 (1976).
1043: 
1044: \bibitem{li-nova}
1045: W. Li, in {\sl Progress in Bioinformatics} (Nova Science Publisher, 2005),
1046: to appear.
1047: 
1048: \bibitem{beran}
1049: J. Beran, {\sl Statistics for Long-Memory Processes} (Chapman \& Hall, 1994).
1050: 
1051: \bibitem{audit}
1052: B. Audit and C.A. Ouzounis, J. Mol. Biol. {\bf 332}, 617-633 (2003).
1053: 
1054: \bibitem{pedro04}
1055: P. Bernaola-Galv\'{a}n, {\em et al.,}
1056: Gene, {\bf 333}, 121-133 (2004). 
1057: 
1058: 
1059: \bibitem{snedecor}
1060: G.W. Snedecor and  W.G. Cochran,
1061: {\sl Statistical Methods}, Seventh Edition
1062: (Iowa State University Press, 1980).
1063: 
1064: \bibitem{voss-music}
1065: R.F. Voss and  J. Clarke,  Nature, {\bf 258}, 317-318 (1975);
1066: K. J. Hsu and  A. Hsu,  Proc. Natl. Acad. Sci., {\bf 88}, 3507-3509 (1991).
1067: 
1068: 
1069: 
1070: 
1071: \bibitem{li-holste-21}
1072: W. Li and  D. Holste, Comp. Bio. Chem., {\bf 28} in press (2004).
1073: 
1074: 
1075: 
1076: \bibitem{1fmouse}
1077: W. Li and D. Holste, Fluct. Noise Lett. {\bf 4} in press (2004).
1078: 
1079: \bibitem{pevzner}
1080: P. Pevzner and G. Tesler, Genome Res. {\bf 13}, 37-45 (2003).
1081: 
1082: \bibitem{ohno}
1083: S. Ohno {\em Evolution by Gene Duplication}
1084: (Springer-Verlag, Berlin, 1970).
1085: 
1086: \bibitem{meyer}
1087: A. Meyer and Y. van de Peer, J. Struct. Funct. Genomics,  {\bf 3}, vii-ix (2003).
1088: 
1089: \bibitem{bernardi}
1090: G. Bernardi, Ann. Rev. Genet. {\bf 23}, 637-661 (1989);
1091: G. Bernardi, Ann. Rev. Genet. {\bf 29}, 445-476 (1995).
1092: 
1093: \bibitem{bernardi-book}
1094: G. Bernardi, {\em Structural and Evolutionary Genomics}
1095: (Elsevier, 2004).
1096: 
1097: \bibitem{housekeeping}
1098: M.J. Lercher, A.O. Urrutia, and L.D. Hurst,
1099: Nature Genet. {\bf 31}, 180-183 (2002);
1100: M.J. Lercher, {\sl et al.}, Hum. Mol. Genet. {\bf 12}, 2411-2415
1101: (2003);
1102: R. Versteeg, {\sl et al.}, Genome Res. {\bf 13}, 1998-2004 (2003).
1103: 
1104: \bibitem{gojobori}
1105: Y. Niimura and  T. Gojobori,
1106: Proc. Natl. Acad. Sci. {\bf 99}, 797-802 (2002).
1107: 
1108: \bibitem{attachment}
1109: P.A. Dijkwel and  J.L. Hamlin, Int. Rev. Cytol.  {\bf 162A}, 455-484 (1995);
1110: S.V. Razin, I.I. Gromova, and  O.V. Iarovaia (1995), Int. Rev. Cytol.
1111: {\bf 162B}, 405-448 (1995).
1112: 
1113: 
1114: \bibitem{interphase}
1115: S. Boyle, {\sl et al.}, Hum. Mol. Genet. {\bf 10}, 211-219 (2001);
1116: S. Saccone, {\sl et al.}, Gene, {\bf 300}, 169-178 (2002).
1117: 
1118: 
1119: \end{thebibliography}
1120: 
1121: 
1122: \end{document}
1123: 
1124: 
1125: