1: \def \manuflag {0}
2:
3: \ifnum \manuflag = 0
4: \documentclass[pre,twocolumn,showpacs,amsmath,amssymb]{revtex4}
5: \usepackage{psfig}
6: \usepackage{epsfig}
7: \usepackage{delarray}
8: \usepackage{graphicx}
9: \usepackage{dcolumn}
10: \usepackage{bm}
11: \def \Title{
12: Universal $1/f$ noise, cross-overs of scaling exponents,
13: and chromosome specific patterns of GC content in
14: DNA sequences of the human genome}
15: \def \figsize{6.85cm}
16: \def \figname{\footnotesize \sc FIG.}
17: \def \tblname{\footnotesize \sc TABLE}
18: \sloppy
19: \newcommand{\SEC}{\section}
20: \newcommand{\SUBSEC}{\subsection}
21: \else
22: \documentclass[pre,onecolumn,showpacs,amsmath,amssymb]{revtex4}
23: \usepackage{psfig}
24: \usepackage{epsfig}
25: \usepackage{delarray}
26: \usepackage{graphicx}
27: \usepackage{dcolumn}
28: \usepackage{bm}
29: \usepackage{rotating}
30: \def \Title{Universal $1/f$ Noise, crossing-Overs of scaling
31: exponents, and chromosome specific patterns of GC content in
32: DNA sequences of the human genome}
33: \def \figsize{12.6cm}
34: \renewcommand{\baselinestretch}{2.4}
35: \fi
36:
37: \def \Abstract{
38: Spatial fluctuations of guanine and cytosine base content (GC\%) are
39: studied by spectral analysis for the complete set of human genomic DNA
40: sequences. We find that (i) the $1/f^{\alpha}$ decay is universally
41: observed in the power spectra of all twenty-four chromosomes, and that
42: (ii) the exponent $\alpha \approx 1$ extends to about $10^7$ bases,
43: one order of magnitude longer than what has previously been observed.
44: We further find that (iii) almost all human chromosomes exhibit a
45: cross-over from $\alpha_1 \approx 1$ ($1/f^{\alpha_1}$) at lower
46: frequency to $\alpha_2 < 1$ ($1/f^{\alpha_2}$) at higher frequency,
47: typically occurring at around 30,000--100,000
48: bases, while (iv) the cross-over in this frequency range is
49: virtually absent in
50: human chromosome 22. In addition to the universal $1/f^\alpha$ noise in power
51: spectra, we find (v) several lines of evidence for chromosome-specific
52: correlation structures, including a 500,000 bases long oscillation in
53: human chromosome 21. The universal $1/f^\alpha$ spectrum in human
54: genome is further substantiated by a resistance to variance reduction
55: in guanine and cytosine content when the window size is increased.
56: }
57:
58:
59:
60: \ifnum \manuflag = 0
61: \begin{document}
62: \title{\Title}
63: \author{Wentian Li}
64: \email{wli@nslij-genetics.org}
65: \affiliation{The Robert S. Boas Center for Genomics and Human Genetics,
66: North Shore LIJ Institute for Medical Research, 350 Community Dr.,
67: Manhasset, NY 10030.}
68: \author{Dirk Holste}
69: \email{holste@mit.edu}
70: \affiliation{Department of Biology,
71: Massachusetts Institute of Technology, Cambridge, MA 02139.}
72: \begin{abstract}
73: \Abstract
74: \end{abstract}
75: \pacs{87.10.+e, 87.14.Gg, 87.15.Cc, 02.50.-r, , 02.50.Tt, 89.75Da, 89.75.Fb, 05.40.-a}
76: %% \keywords{Suggested keywords if desired}
77: \maketitle
78: \else
79: \begin{document}
80: \title{\Title}
81: \author{Wentian Li}
82: \email{...@...}
83: \affiliation{...}
84: \author{Dirk Holste}
85: \email{holste@mit.edu}
86: \affiliation{Department of Biology,
87: Massachusetts Institute of Technology, Cambridge, MA 02139.}
88: \begin{abstract}
89: \Abstract
90: \end{abstract}
91: \pacs{87.10.+e, 02.50.-r, 05.40.-a \hfill {\tt Thu Sep 5 15:36:39 EDT 2002}}
92: %% \keywords{Suggested keywords if desired}
93: \maketitle
94: \fi
95:
96: \SEC{Introduction}
97:
98: By measuring the proportion of a signal's power $S(f)$ falling into
99: a range of frequency components $f$, a power spectrum of the form
100: $S(f) \sim 1/f^\alpha$ distinguishes between two prototypes of noise:
101: white noise ($\alpha = 0$) and Brownian noise ($\alpha = 2$). The
102: intermittent range, termed ``$1/f$ noise'', can practically be defined
103: as $1/f^\alpha$ ($0.5 \lesssim \alpha \lesssim 1.5$). $1/f$ noise was experimentally
104: observed first in electric current fluctuations of the thermionic
105: tube at the beginning of the nineteenth century \cite{johnson}.
106: Since then, $1/f$ noise has been found repeatedly in many other
107: conducting materials \cite{1freview}. More generally, it has also
108: been observed in wide ranges of natural as well as human-related
109: phenomena, including traffic flow, star light, speech, music
110: and human coordination \cite{1freview-other,1fbib}.
111: For biological sequences, such as DNA, the concept of slow-varying,
112: multiple-length variations in the power of frequency components
113: can be translated to long-ranging correlations in the spatial
114: arrangement of the four bases adenine (A), cytosine (C), guanine
115: (G) and thymine (T). One can categorize chemically A, C,
116: G, and T as strong (G or C) or weak (A or T) bonding. It has been
117: shown that fluctuations of the GC base content along a DNA sequence
118: are typically stronger correlated when compared to other possible
119: binary classifications \cite{mapping,dirk}.
120: Initial studies of $1/f$ noise in DNA sequences were motivated
121: by a model of spatial $1/f$ noise of symbolic sequence evolution
122: \cite{wli-em}. Subsequently, empirical $1/f$ spectra were
123: indeed observed in non-protein-coding DNA sequences
124: \cite{wli-dna}, and their generality in DNA sequences was further
125: illustrated in \cite{voss}.
126:
127: $1/f$ noise has been detected in a variety of different species and
128: taxonomic classes, including bacteria \cite{bac}, yeasts
129: \cite{yeast}, insects \cite{fukushima}, and other higher eukaryotic
130: genomes. Integrating this and several other lines of evidence, a
131: consensus on $1/f$ noise in DNA sequences has emerged:
132: (1) for DNA sequences of the order of $10^6$~bases ($1$~Mb), $1/f^\alpha$
133: spectrum ($\alpha \approx 1$) is consistently observed; (2) for
134: isochores, which are DNA sequences of relatively homogeneous base
135: concentration at least $300\cdot 10^3$ bases ($300$~kb) long
136: \cite{isochore,isochore-clay,cc03}, $1/f^\alpha$ spectrum is also
137: observed, but typically shows a smaller exponent $\alpha < 0.7$
138: \cite{isochore-clay,clay3,isochore-spe}; (3) for DNA sequences of
139: the order of several kb, the decay of $S(f)$ is non-trivial and may
140: depend on whether the sequence is protein-coding \cite{wli-dna}.
141: The viral DNA sequence of the $\lambda$-phage, e.g., shows a single step in its GC
142: base concentration and its spectrum is $S(f) \sim 1/f^2$, which is
143: characteristic of random block sequences \cite{wli-complexity}.
144: We note that the universal scaling of $S(f) \sim 1/f^\alpha$ ($\alpha \approx 1$)
145: across all species discussed in \cite{voss} has apparently been
146: restricted to a length scale of $1$~kb, by averaging the spectrum over
147: many $N=2$~kb DNA segments.
148:
149: With the availability of the first completed version of the DNA
150: sequence of human genome \cite{lander}, several studies have been able
151: to demonstrate that the base-base correlation function $\Gamma(d)$
152: ($d$ distance between bases) of several DNA sequences follows a
153: power-law decay, $\Gamma(d) \sim 1/d^{\gamma}$. For instance, the DNA
154: sequence of human chromosome 22 shows statistically significant
155: power-law correlations up to $d=1$~Mb, and
156: correlations in the DNA sequence of chromosomes 21 are statistically
157: significant up to several~Mb (with the scaling exponent $\gamma$
158: changing beyond a few~kb) \cite{dirk,pedro}.
159: While the DNA sequences of human chromosomes 21 and 22 are about
160: $34$~Mb long, in order to estimate the limit of the range of $1/f^\alpha$
161: spectrum, longer sequences are necessary.
162:
163:
164:
165: After the release of the draft of the human genome sequence in February
166: 2001, about three years later in 2004, a dozen (out of 24) human chromosomes
167: have been completed with a sequence accuracy to following the
168: standard of less than one error per 10,000 DNA
169: bases (99.99\% accuracy) \cite{hg-quality}. Building upon the release
170: of updated, high-quality sequence data, in the era of genomics we can
171: now conduct a systematic analysis of several issues of $1/f$ noise in
172: the DNA sequences of our own species {\em Homo sapiens}, which
173: have been pursued over the last decade in a fragmentary manner.
174:
175: In this paper, we use the DNA sequences of the complete set of
176: twenty-two autosomes and two sex chromosomes to address the following
177: issues: Is $1/f$ noise
178: universally present across the entire set of human genome sequences?
179: Does $1/f$ noise extend to lower frequency ranges in longer DNA
180: sequences? Is the decay of $S(f)$ characterized by a single exponent
181: $\alpha$, or does it exhibit cross-overs (multiple scaling exponents)?
182: Given the presence of universal variations at multiple scales,
183: do these co-exist with variations at chromosome-specific scales?
184:
185: \begin{figure}
186: \centerline{\psfig{figure=dirk1.eps,width=68mm,angle=0}}
187: \caption{\label{fig-dirk1}
188: Double-logarithmic representation of the human genome-wide length
189: distribution of interspersed repeat sequences, non-repetitive
190: sequences, and sequences of unknown base composition (gaps). The
191: length distribution of interspersed repeats and non-repetitive sequences
192: exhibits a power-law-like decay, while that of gap sequences
193: is scattered across different sequence length. The peaks
194: at $\sim 300$ bases and
195: several kb correspond to Alu and possibly LINE repeats.
196: }
197: \end{figure}
198:
199: \begin{figure}[ht]
200: \centerline{\psfig{figure=dirk2.eps,width=68mm,angle=0}}
201: \caption{\label{fig-dirk2}
202: Distribution of genome-wide GC content (GC\%) of the human genome
203: for interspersed repeat sequences, non-repetitive sequences,
204: and all (``overall") sequences with sequence segments of $20~{\rm kb}$.
205: The mode (peak location) of non-repetitive sequences is at
206: $\sim$35\%, while the mode of repetitive sequences shifted
207: to a higher GC\% ($\sim$42\%). The fraction
208: of non-repetitive sequences with GC\%~$>$~50\% is markedly
209: larger as compared to the repetitive sequences.
210: }
211: \end{figure}
212:
213:
214: \SEC{Data and methods}
215:
216: In this section, we introduce the data for human genome sequences, as well
217: as the notation and definitions used throughout this study.
218: Twenty-four chromosomes are assembled in build 34
219: of the NCBI (human genome hg16 release). Sequence data were downloaded
220: from the UCSC human genome repository (available at {\sf http://genome.ucsc.edu/}).
221: Unsequenced bases are kept to preserve spacing between
222: bases. Human chromosomes (Chr) 13, 14, 15, 21, and 22 contain large
223: amount of unsequenced bases in the left end of their DNA sequences,
224: consisting of about 15\%, 17\%, 18\%, 21\%, and 29\% of the
225: individual chromosome size,
226: respectively; 51\% of chromosome Y are unsequenced.
227:
228: Our analysis on human DNA sequences is conducted using coarse-grained data.
229: Each original sequence was transformed into a spatial series
230: of GC content (GC\%) values. To this end, we evenly partition a
231: DNA sequence into $N$ non-overlapping windows of length $w$ bases,
232: compute $\rho_i(w)=$GC\%$_i$ for each window $i$, to obtain a spatial
233: GC\% series:
234: \begin{equation}
235: \label{rho}
236: \{ \rho_i \} \equiv \{ \rho_i(w) \} \equiv \{ \mbox{GC\%}_i \}
237: \hspace{0.1in}
238: \mbox{i= 1, 2, $\dots$, N }
239: \end{equation}
240: Table~1 lists the corresponding window sizes
241: for each human chromosome. Since different human chromosomes have
242: different sizes, whereas the number of partitions ($N$) is the same,
243: the window lengths vary.
244:
245:
246: \begin{table}[ht]
247: \caption{\label{tab:table1}
248: Average GC content ($\overline{ \mbox{GC}\%}$ or $\overline{ \rho }$), the window
249: size ($w$) for partitions using $N=2^{17}$ non-overlapping windows for
250: twenty-four human chromosomes. Low-frequency scaling exponents
251: $\alpha_1$ are estimated from $S(f; s=3) \sim 1/f^{\alpha_1}$
252: in the range of $10^{-7} < f < 10^{-5}$ base$^{-1}$, and high-frequency
253: scaling exponents $\alpha_2$ are estimated in the range of
254: $10^{-5} < f < 2 \times 10^{-4}$ base$^{-1}$. The difference
255: between the two scaling exponents, $\Delta \alpha \equiv \alpha_2-\alpha_1$,
256: are listed in the fifth column. Low- and high-frequency exponents for $S(f)$
257: with substituted interspersed repeats are indicated by
258: $\alpha'_1$ and $\alpha'_2$, and their difference by
259: $\Delta \alpha' \equiv \alpha'_2-\alpha'_1 $.
260: }
261: \begin{ruledtabular}
262: \begin{tabular}{l|c|c|c|c|c|c|}
263: Chr & $\overline{ GC\%}$ & $w$ (kb) & $\alpha_1$ $\alpha_2$ & $\Delta \alpha$
264: & $\alpha'_1$ $\alpha'_2$ & $\Delta \alpha'$ \\
265: \hline
266: 1 & 41.7 &1.88 & 0.88 0.46 & 0.42 & 0.80 0.29 &0.51\\
267: 2 & 40.2 &1.86 & 0.99 0.51& 0.48 & 0.96 0.30 &0.66\\
268: 3 & 39.7 &1.52 & 0.95 0.43& 0.53 & 0.88 0.27 &0.61 \\
269: 4 & 38.2 &1.46 & 0.87 0.34& 0.53 & 0.75 0.19 &0.57\\
270: 5 & 39.5 &1.38 & 0.89 0.39& 0.51 & 0.88 0.23 &0.65 \\
271: 6 & 39.6 &1.30 & 0.99 0.36& 0.63 & 0.86 0.24 &0.63\\
272: 7 & 40.7 &1.21 & 0.97 0.46& 0.51 & 0.87 0.33 &0.55 \\
273: 8 & 40.1 &1.12 & 0.97 0.42& 0.55 & 0.91 0.26 &0.66\\
274: 9 & 41.3 &1.04 & 0.96 0.39& 0.57 & 0.90 0.28 &0.62 \\
275: 10 & 41.6 &1.03 & 0.97 0.52& 0.46 & 0.95 0.34 &0.61 \\
276: 11 & 41.6 &1.03 & 1.05 0.50& 0.55 & 0.97 0.35 &0.62 \\
277: 12 & 40.8 &1.01 & 0.97 0.39& 0.59 & 0.89 0.28 &0.61 \\
278: 13 & 38.5 & 0.86 & 0.83 0.33 & 0.50 & 0.73 0.24 &0.49 \\
279: 14 & 40.9 &0.80 & 1.03 0.36 & 0.66 & 0.95 0.27 &0.68\\
280: 15 & 42.2 &0.76 & 0.90 0.50 & 0.40 & 0.83 0.39 &0.44 \\
281: 16 & 44.8 &0.69 & 0.91 0.51 & 0.40 &0.81 0.36 &0.45\\
282: 17 & 45.5 &0.62 & 0.98 0.57 & 0.42 & 0.89 0.44 &0.46 \\
283: 18 & 39.8 &0.58 & 1.12 0.40 & 0.72 & 1.12 0.28 &0.83 \\
284: 19 & 48.4 &0.49 & 1.00 0.56 & 0.44 & 0.81 0.37 &0.45 \\
285: 20 & 44.1 &0.49 & 0.87 0.51 & 0.36 & 0.83 0.30 &0.53 \\
286: 21 & 40.9 &0.36 & 0.91 0.33 & 0.58 & 0.86 0.22 &0.64 \\
287: 22 & 47.9 &0.38 & 0.90 0.62 & 0.28 & 0.86 0.40 &0.45 \\
288: X & 39.4 &1.17 & 0.93 0.38 & 0.54 & 0.73 0.18 & 0.55 \\
289: Y & 39.1 &0.38 & 0.83 0.38 & 0.45 & 0.70 0.21 & 0.49 \\
290: \end{tabular}
291: \end{ruledtabular}
292: \end{table}
293:
294: Human DNA sequences contain a large fraction of interspersed repeats,
295: i.e., copies of an ancestral sequence fragment that possess a high
296: similarity between the duplicated and the ancestral sequence. One can
297: detect interspersed repeats by using the program {\sf RepeatMasker}
298: \cite{repeatmasker}. ``Soft-masked'' annotations of interspersed repeats
299: are taken from the DNA sequences of the UCSC human genome repository
300: ({\sf http://genome.ucsc.edu/}), where repetitive (non-repetitive)
301: bases are annotated in small (capital) letters. Figure~\ref{fig-dirk1}
302: shows the length distribution of the three sequences classes of uninterrupted
303: non-repetitive, interspersed repeat, and gap sequences.
304: Figure~\ref{fig-dirk2} shows the corresponding distribution of the
305: genome-wide GC\% for these three sequences classes.
306:
307:
308: To investigate the effect of interspersed repeats, we substitute
309: them by random bases according to the chromosomal level of GC\%.
310: Transformed, repeat-substituted DNA sequences of original human
311: chromosomes are distinguished from original sequences. On the
312: coarse-grained level, it is equivalent to
313: the replacement in the $\{ \rho_i \}$ ($i=1, 2, \dots, N$) series
314: of any values calculated from the interspersed repeats by a random
315: value which is sampled from a Gaussian distribution; the
316: mean and variance of this Gaussian distribution is the
317: same as those of GC\% in the original sequence. Another possibility
318: consists in substituting repetitive sequences by
319: by a constant value (e.g., the averaged GC\% value
320: of the original sequence). This method introduces
321: additional correlations (and less variance) in the $\{ \rho_i \}$
322: series, and is not adopted in this paper.
323:
324: Three different, albeit functionally related, measures are
325: applied to the $\{ \rho_i \}$ series: the power spectrum
326: as a function of the frequency $S(f)$, the correlation function
327: $\Gamma(d)$ as a function of the distance $d$ between
328: windows, and variance $\sigma^2(w)$ of GC\%
329: series as a function of the window size $w$.
330:
331: First, we conduct spectral analyses by calculating the power spectrum,
332: the absolute squared-average of the Fourier transform, defined as:
333: \begin{equation}
334: S(f) \equiv \frac{1}{N} \left| \sum_{k=1}^{N} \rho_k
335: \cdot
336: e^{ -i 2 \pi k f/N} \right|^2.
337: \end{equation}
338: where $N$ is the total number of windows, and $f$ is measured
339: in units of cycle/window, which can be converted to units
340: of cycle/base by the window size (cf. Table~1).
341:
342: Coarse-graining ``hides'' base-base correlations at scales smaller
343: than $w$ bases. The choice of $N = 2^{17}$ windows was made such
344: that it is (i) sufficiently large to cover small-scale fluctuations,
345: while (ii) at the same time sufficiently small so that the spectral analysis is
346: computationally feasible. As different chromosomes have difference
347: lengths, equal number of partitions leads to different window sizes $w$.
348:
349: The unsmoothed $S(f)$, or periodogram, contains $N/2$ independent
350: spectral components. One can filter periodograms to obtain a
351: ``smoothed'' spectrum $S(f;s)$, where $s$ is the span-size
352: parameter. Since filtering with a relatively large $s$-value
353: possibly distorts the shape of $S(f;s)$ at lower frequency components,
354: different span-sizes are applied for different frequency ranges.
355:
356:
357: The second measure applied to the $\{ \rho_i \}$ series
358: is the correlation function, $\Gamma(d)$, which is computed
359: from two truncated series
360: of $\{ \rho_i \}$, $ \rho' = \{ \rho_k \}$ ($k=1, 2, \dots, N-d$) and
361: $ \rho'' = \{ \rho_k \}$ ($k=d+1, d+2, \dots, N$):
362: \begin{equation}
363: \label{gamma}
364: \Gamma(d) \equiv \frac{ \rm Cov(\rho', \rho'')}
365: { \sqrt{ \rm Var(\rho')} \sqrt{ \rm Var(\rho'')}}
366: \end{equation}
367: where $\mbox{Cov}( \rho', \rho'')=
368: \langle \rho' \rho''\rangle - \langle \rho'\rangle \langle \rho''\rangle $
369: and $\mbox{Var}( \rho') = \langle \rho'^2 \rangle - \langle \rho' \rangle^2$
370: (or $\mbox{Var}( \rho'') = \langle \rho''^2 \rangle - \langle \rho'' \rangle^2$)
371: are the covariance and variance.
372: Note that the $\Gamma(d)$ defined in Eq.(\ref{gamma})
373: is slightly different from that defined using
374: a periodic boundary condition.
375:
376:
377:
378: The third and final measure applied to the
379: $\{ \rho_i \}$ series is the variance $\sigma^2(w)$:
380: \begin{equation}
381: \sigma^2(w) \equiv \langle \rho(w)^2 \rangle -
382: \langle \rho(w) \rangle^2
383: \end{equation}
384: as a function of the window size $w$.
385: The power spectrum, the correlation function, and the window-size-dependent
386: variance are interrelated quantities \cite{clay3}:
387: \begin{equation}
388: \sigma^2(w) \sim \frac{ \Gamma(0) }{ w } \cdot
389: \bigg\{ 1 + \frac{2}{w}\sum_{d=1}^{w-1} (w-d) \Gamma(d) \bigg\}.
390: \end{equation}
391: If $S(f) \sim 1/f^\alpha$, $\Gamma(d) \sim 1/d^\gamma$,
392: $\sigma^2(w) \sim 1/w^\beta$ are power-law functions,
393: then their scaling exponents are related
394: by $\alpha = 1- \gamma$ and $\gamma=\beta$ \cite{clay3}.
395:
396: The calculation of $S(f)$ and $\Gamma(d)$ was carried out by the
397: statistical package {\sl S-PLUS} (Version 3.4, MathSoft, Inc.), and the
398: type of filter implemented for $S(f)$ is the Daniell-filter \cite{daniell}.
399:
400:
401: \SEC{$1/f$ noise is a universal feature of human DNA sequences}
402:
403:
404: In this section, we use the power spectrum $S(f)$ to study
405: GC\% of human genome sequences, with
406: the goals of testing the universality of $1/f$ noise, quantifying
407: different decay ranges for $S(f) \sim 1/f^\alpha$, and comparing
408: $S(f)$ across DNA sequences of different human chromosomes.
409:
410: Figure~\ref{fig3:s(f)} shows for $N=2^{17}$ GC\% values the power
411: spectra $S(f)$ across all human chromosomes. We find that $S(f)$
412: exhibits no clear plateau at
413: low frequency ($< 10^{-6}$ cycle/base) and increases steadily
414: with decreasing frequency. The decay can be mathematically
415: approximated by a power-law of the form $S(f) \sim 1/f^\alpha$
416: with $\alpha \approx 1$. Table~1 lists for the frequency range
417: $f=$ 10~Mb$^{-1}$--100~kb$^{-1}$ the estimated scaling exponent
418: $\alpha_1$ for all chromosomes, using a best-fit regression
419: of $\log_{10} S(f; s=3) = a + \alpha_1 \log_{10}(f)$. We find that
420: $\alpha_1$ is typically close to $\alpha_1 \approx 1$ with
421: practically little variation across chromosomes.
422:
423: A closer inspection of Fig.~\ref{fig3:s(f)} shows that the
424: majority of $1/f$ spectra undergo a cross-over from $\alpha_1 \approx 1$
425: to $\alpha_2 < 1 $ at high frequency. The deviation from $\alpha_1 \approx 1$
426: starts about 30--100~kb and continues at smaller distances.
427: Figure~\ref{fig4:ex} illustrates this feature for $S(f; s=31)$
428: of the DNA sequences of Chr15, Chr21, and Chr22 in more detail.
429: We find that chromosomes 15 and 21 exhibit clear cross-overs
430: at about 100~kb, while chromosome 22 exhibits no apparent break-point.
431: Table~1 contains for the frequency range of
432: $f=$ 100~kb$^{-1}$--5~kb$^{-1}$ the corresponding scaling
433: exponents $\alpha_2$, obtained from the
434: regression $\log_{10} S(f; s=3) = a + \alpha_2 \log_{10}(f)$.
435: We find a pronounced difference in absolute values between
436: $\alpha_1 \approx 1$ and $\alpha_2 < 1$, indicating a transition
437: from the universal $1/f^{\alpha_1}$ ($\alpha_1 \approx 1 $)
438: spectrum at low frequency to a
439: more flattened $1/f^{\alpha_2}$ ($\alpha_2 < 1$) spectrum
440: at higher frequency.
441:
442:
443: Figure~\ref{fig5:alpha}(a) shows for all human chromosomes $\alpha_1$
444: and $\alpha_2$ as a function of chromosome-specific GC\%. The
445: majority of human chromosomes have a specific GC content ranging between
446: 38--43\%, whereas chromosomes 16, 17, 19, 20, and 22 have higher GC\%
447: up to 49\%. While the low-frequency scaling exponent $\alpha_1$
448: remains approximately independent of GC\%, Fig.~\ref{fig5:alpha}(a)
449: shows that $\alpha_2$ increases with increasing GC\% and
450: gives rise to a positive correlation between $\alpha_2$ and GC\%.
451:
452: The three chromosomes illustrated in Fig.~\ref{fig4:ex} exhibit
453: different degrees of transition from the $1/f^{\alpha_1}$
454: ($\alpha_1 \approx 1$) to the flattened
455: $1/f^{\alpha_2}$ ($\alpha_2 <1$) spectrum, with chromosome 21 (22)
456: undergoing the sharpest (smoothest) transition. This
457: observation can be further quantitized by the change in
458: scaling exponents $\alpha_1$ and $\alpha_2$. Table~1 lists for
459: all chromosomes $\Delta \alpha = \alpha_2 - \alpha_1$.
460: Chromosome 22 is distinct from all other human chromosomes
461: as the most scale-invariant one (same or similar scaling
462: exponent at different length scales).
463: The same observation that human chromosome 22 was perhaps different
464: from the remaining human chromosomes was made using limited
465: sequence data in \cite{isochore-clay,pedro}.
466:
467: \begin{figure*}
468: \centerline{\psfig{figure=pre-fig1.eps,width=80mm,angle=-90}}
469: \caption{\label{fig3:s(f)}
470: Double-logarithmic representation of
471: power spectra $S(f)$ of GC\% of all twenty-four human
472: chromosomes. Each plot shows $S(f)$ of six chromosomes
473: (shifted on the $y$-axis for clearer representation):
474: chromosomes (a) 1--6; (b) 7--12; (c) 13--18; (d) 19--22, X, and Y.
475: The $x$-axis (in logarithmic scale) is converted from cycle/window
476: to cycle/base by using the window sizes listed in Table~1.
477: $S(f)$ is filtered at different levels for different frequency
478: ranges: $S(f; s=1)$ for the first ten spectral components,
479: $S(f; s=3)$ for the components 11--30,
480: $S(f; s=31)$ for the components 31--400,
481: and $S(f; s=501)$ for the components 400--65536 (=$2^{16}$).
482: }
483: \end{figure*}
484:
485:
486: \begin{figure}
487: \centerline{\psfig{figure=pre-fig2.eps,width=55mm,angle=-90}}
488: \caption{\label{fig4:ex}
489: Cross-over from $S(f) \sim 1/f^{\alpha_1}$ to $S(f) \sim 1/f^{\alpha_2}$
490: illustrated for human chromosomes 15, 21, and 22
491: (smoothed with the span size of 31, and shown in double-logarithmic scale).
492: The scaling exponents $\alpha_1$ and $\alpha_2$ are shown for
493: the frequency ranges 10~Mb$^{-1}$--100~kb$^{-1}$ and 100~kb$^{-1}$--5k$^{-1}$.
494: }
495: \end{figure}
496:
497: \begin{figure}
498: \centerline{\psfig{figure=pre-fig3.eps,width=70mm,angle=-90}}
499: \caption{\label{fig5:alpha}
500: (a) Scaling exponents $\alpha_1$ and $\alpha_2$
501: for fitting the power spectrum $S(f) \sim 1/f^{\alpha_i}$
502: ($i=1,2$) at the frequency range of 10~Mb$^{-1}$--100~kb$^{-1}$,
503: and 100~kb$^{-1}$--5~kb$^{-1}$, respectively, versus the
504: chromosome-specific GC content of all 24 human chromosomes.
505: (b) Scaling exponents $\alpha'_1$ and $\alpha'_2$
506: for $S(f)$ with substituted interspersed repeats.
507: }
508: \end{figure}
509:
510:
511: \SEC{Interspersed repeats are not responsible for
512: $1/f$ spectrum}
513:
514: About 45\% of human genomic DNA sequences are interspersed
515: repeats \cite{lander}. Interspersed repeats consist of copies of the same
516: sequence segment that are inserted in the human genome, possess a high
517: similarity between the duplicated and ancestral sequence, and have
518: been implicated in a variety of biological functions, including genome
519: organization, human chromosome segregation, or regulation of gene
520: expression \cite{repeats-bio}. Large copy numbers increase
521: the sequence redundancy and it has been shown, e.g., that
522: about 10\% interspersed Alu repeats significantly increase
523: base-base correlations in the range up to 300~bases
524: \cite{dirk}.
525:
526:
527:
528: Figure\ref{fig6:rep} shows the power spectrum $S(f)$ for
529: the original human chromosome 1 and for the transformed sequence in
530: which interspersed repeats are substituted. We find in
531: the low-frequency range of $10^{-7} < f < 10^{-5}$ cycle/base
532: that $S(f)$ decays in the original sequence
533: with $\alpha_1 \approx 0.88$ and in the transformed sequence with
534: $\alpha^\prime_1 \approx 0.80$, indicating only marginal differences in
535: the decay properties of $S(f)$ due to repetitive sequences. In
536: contrast, in the high frequency range of $10^{-5} < f < 2 \times 10^{-4}$
537: we find $\alpha_2 \approx 0.46$ and $\alpha^\prime_1 \approx 0.29$,
538: and thus interspersed repeats contributes to the decay properties
539: of $S(f)$ for high-frequency components by flattening the power spectrum.
540:
541:
542: The scaling exponents $\alpha^\prime_1$ and $\alpha^\prime_2$ for
543: repeat-substituted DNA sequences of all 24
544: human chromosomes are shown in Table~1. The difference
545: between low- and high-frequency ranges for DNA sequences of original
546: chromosomes, $\Delta \alpha= \alpha_2 - \alpha_1$, is smaller
547: than the difference between low- and high-frequency ranges for
548: transformed sequences, $\Delta \alpha' = \alpha^\prime_2-\alpha^\prime_1$.
549: When we compare $\alpha_1$ and $\alpha^\prime_1$, as well as $\alpha_2$ and
550: $\alpha^\prime_2$, we find that the magnitude of $\alpha^\prime_1$
551: ($\alpha^\prime_2$) is always smaller than that of $\alpha_1$
552: ($\alpha_2$), which means a flattened spectrum
553: due to the substitution of interspersed repeats.
554: The average change of low-frequency
555: scaling exponents, $ \alpha_1- \alpha'_1$, is about 0.07,
556: whereas the average change of high-frequency scaling
557: exponents, $ \alpha_2- \alpha'_2$, is about 0.14. This
558: confirms that the universal presence of $1/f$ spectrum
559: at low frequency is not caused by interspersed
560: repeats, but that interspersed repeats affect $S(f)$
561: predominantly at high frequencies. A similar conclusion
562: that the decay rate of base-base correlations in DNA sequences of
563: human chromosomes 20, Chr21, and Chr22 is not markedly affected
564: by the substitution of interspersed repeats was reached in \cite{dirk}.
565:
566:
567: We note that the extent of deviation,
568: $|\alpha'-\alpha|$, depends on how the replacement of
569: interspersed repeats is conducted. Possible substitutions
570: of interspersed repeats include the substitution by
571: a constant value or a randomly sampled value.
572: In general, the substitution of GC\% values
573: calculated from the repetitive sequences by random values enhances
574: the deviation and flattens the spectrum $S(f)$ more than the
575: substitution by a constant value (e.g., average GC\%).
576:
577:
578: \begin{figure}
579: \centerline{\psfig{figure=pre-fig4.eps,width=60mm,angle=-90}}
580: \caption{\label{fig6:rep}
581: Power spectra $S(f)$ of GC\% for the original and the
582: transformed (interspersed repeats substituted) DNA
583: sequence of human chromosome 1. The scaling exponent for
584: low-frequency (10~Mb--100~kb) and high-frequency
585: (100~kb--5~kb) ranges are obtained by a best-fit regression
586: of $\log_{10} S(f)$ over $\log_{10} f$.
587: }
588: \end{figure}
589:
590: \SEC{Resistance to variance reduction at larger window sizes}
591:
592: In this section, we study the decay properties of the
593: variance ($\sigma^2$) of spatial GC\% series as
594: a function of difference window sizes $w$, and
595: we compare the scaling of $\sigma^2$ with the
596: scaling of the power spectrum $S(f)$.
597:
598:
599: Early experimental measurement of the GC\% distribution by
600: using cesium chloride (CsCl) profile \cite{cscl} showed for mouse
601: {\em Mus musculus} genomic DNA sequences that the
602: variance of GC\% values does not markedly decreases with the DNA segment size
603: \cite{macaya}. This experimental observation is directly related
604: to the presence of 1/f spectra in DNA sequences
605: \cite{isochore-clay,li-nova}. If the variance of the spatial
606: GC\% series calculated at the window size $w$ is $\sigma^2(w)$,
607: then a scaling of
608: $\sigma^2(w) \sim 1/w^\beta$ implies a corresponding
609: scaling in the power spectrum $S(f) \sim 1/f^{1-\beta}$
610: \cite{isochore-clay,beran}. If GC\% is obtained from
611: $w$ uncorrelated bases, it follows a binomial distribution.
612: Consequently, $\sigma^2(w) \sim \langle\rho \rangle
613: (1- \langle \rho \rangle) /w \sim 1/w$ with
614: $\beta=1$. The corresponding scaling exponent of
615: the power spectrum is $\alpha=1-\beta=0$, and thus the
616: $S(f) \sim \mbox{cons.}$ is equivalent to the white noise.
617:
618: Figure~\ref{fig7:var} shows $\sigma^2(w)$ as a function of window size $w$
619: for all human chromosomes. In a double-logarithmic representation,
620: we find that $\log (\sigma^2(w)) $ decays approximately
621: linearly with $\log( w) $. A decay according to
622: $\sigma^2(w) \sim 1/w^\beta$ with $\beta=1$ leads to
623: white noise. This situation is indicated in Fig.~\ref{fig7:var}
624: by the straight line. An inspection of Fig.~\ref{fig7:var}
625: shows, however, that the variance decays at a much slower
626: rate than what would be for white noise. The variance of
627: the DNA sequence of human chromosome 1, e.g.,
628: gives rise to $\beta \approx 0.12$, and the
629: corresponding scaling exponent $\alpha_1 \approx
630: 1- \beta =0.88$ is indeed close to the estimated
631: exponent listed in Table~1. The scaling of the variance
632: with the exponent $\beta << 1 $ is in accord with
633: the low-frequency $1/f$ noise.
634:
635:
636: \begin{figure}
637: \centerline{\psfig{figure=pre-fig5.eps,width=65mm,angle=-90}}
638: \caption{\label{fig7:var}
639: Double-logarithmic representation of the variance $\sigma^2(w)$
640: of the spatial GC\% series for all human chromosomes (Chr)
641: as a function of the window size $w$:
642: (a) $\bigcirc$ Chr1,
643: $\triangle$ Chr2,
644: $+$ Chr3, $\times$ Chr4, $\diamondsuit$ Chr5, $\bigtriangledown$ Chr6;
645: (b) $\bigcirc$ Chr7, $\triangle$ Chr8, $+$ Chr9, $\times$ Chr10,
646: $\diamondsuit$ Chr11, $\bigtriangledown$ Chr12;
647: (c) $\bigcirc$ Chr13, $\triangle$ Chr14, $+$ Chr15,
648: $\times$ Chr16, $\diamondsuit$ Chr17, $\bigtriangledown$ Chr18;
649: (d) $\bigcirc$ Chr19, $\triangle$ Chr20, $+$ Chr21, $\times$ Chr22,
650: $\diamondsuit$ ChX, $\bigtriangledown$ ChrY.
651: Straight lines indicate $\sigma^2(w) \sim 1/w$ (corresponding
652: to white noise). One regression line for Chr1 ($\beta \approx $0.12)
653: and a piece-wise regression for Chr13 ($\beta \approx $0.27 and
654: $\beta \approx $0.10) are drawn. The 95\% confidence
655: interval for the $\sigma^2(w)$ estimation of Chr1 at each point
656: of $w$ is marked by a vertical dashed line.
657: }
658: \end{figure}
659:
660:
661: The scaling of $\sigma^2(w)$ shown in Fig.~\ref{fig7:var}
662: differs from one human chromosome to another. For instance,
663: in the range of $w=$ 1~kb--5~Mb, for example, human chromosome
664: 13 exhibit a clear transition from $\beta_2 \approx 0.27$ ($w < $ 50~kb)
665: to $\beta_1 \approx 0.10$ ($w > $ 50~kb), corresponding to
666: $S(f) \sim 1/f^{0.63}$ and $S(f) \sim 1/f^{0.9}$,
667: respectively, at high- and low-frequency ranges.
668: Other human chromosomes, although generally exhibiting a power-law
669: scaling form of $\sigma^2(w)$, show deviations from
670: $\sigma^2(l) \sim 1/l^{\beta}$ line for the largest
671: window sizes tested.
672:
673:
674: The investigation of $\sigma^2(w)$ as a function of
675: different window sizes $w$ requires careful examination
676: \cite{audit,pedro04}. First, since we partition each
677: human chromosome in $2^k$ ($k=$17, 16, $\dots$)
678: windows, the variance of GC\% series $\{ \rho_i \} $
679: could be accidentally large when windows reside on
680: the isochore borders, and small by chance if they
681: start/end within an isochore.
682:
683: Second, when the number of windows is small (e.g. the last
684: point of $\sigma^2(w)$ for each chromosome in Fig.~\ref{fig7:var}
685: is calculated with the largest window size that
686: gives rise to 32 windows), the standard error of the
687: sample variance is large. The 95\% confidence interval for
688: $\sigma^2(w)$ of Chr1 is shown in Fig.~\ref{fig7:var}(a),
689: using the interval:
690: [$ (w-1) \sigma^2/t_{0.025}, (w-1) \sigma^2 /t_{0.975}$],
691: where $t_x$ is defined by $\int_{-\infty}^{t_x} \chi^2(\rm{df}=w-1) dt= x$
692: (where $\chi^2(\rm{df})$ is the chi-square distribution with $\rm{df}$
693: degrees of freedom) \cite{snedecor}. Figure~\ref{fig7:var}(a)
694: shows that for fewer windows (and larger window sizes),
695: the 95\% confidence interval of $\sigma^2(w)$ could be large
696: such that the estimated value of $\beta$ may change from sample to sample.
697:
698: Finally, the relationship between scaling exponents
699: $\alpha+\beta = 1$ \cite{beran,isochore-clay},
700: is based on the assumption that both $S(f)$ and $\sigma^2(w)$ are
701: theoretical power-law functions. If $S(f)$ is a piece-wise
702: power-law function, as in the case of GC\% fluctuation of
703: human chromosomes, a correction term to the relationship
704: $\alpha +\beta=1$ is expected.
705:
706: \begin{figure}
707: \centerline{\psfig{figure=pre-fig6.eps,width=65mm,angle=-90}}
708: \caption{\label{fig8:corr}
709: Correlation function $\Gamma(d)$ for 24 human chromosomes (Chr)
710: as a function of the window distance $d$ (converted to bases
711: by the window size listed in Table~1). The distance is
712: represented on a logarithmic scale. (a) Chr1--6; (b) Chr7--12;
713: (c) Chr13--18; and (d) Chr19--22, ChrX, and ChrY.
714: }
715: \end{figure}
716:
717: \begin{figure}
718: \centerline{\psfig{figure=pre-fig7.eps,width=50mm,angle=-90}}
719: \caption{\label{fig9:ch21}
720: Correlation function $\Gamma(d)$ for human chromosome 21
721: as a function of the window distance $d$ (converted to bases
722: by the window size given in Table~1). The oscillation in $\Gamma(d)$
723: is highlighted by vertical lines, indicating the
724: distances of $d=$500~kb, 1~Mb, 1.5~Mb, and 2~Mb.
725: }
726: \end{figure}
727:
728:
729:
730: \SEC{Chromosome-specific correlation structures}
731:
732:
733: Apparently, $1/f$ noise in music and speech signals \cite{voss-music}
734: does not prevent music and speech from sounding differently.
735: Similarly, universal $1/f^\alpha$ spectra in GC\% fluctuations
736: across human chromosomes do not imply that all chromosomes
737: exhibit the same detailed correlation structure. The generic trend
738: of $S(f)$ spectra to increase at low frequency may ``co-exist"
739: with small peaks at higher frequency. Such chromosome-specific
740: characteristic length scales can be more intuitively examined
741: by correlation functions. In this section, we investigate
742: the correlation function $\Gamma(d)$
743: of coarse-grained DNA sequences of human chromosomes with the
744: aim of further examining chromosome-specific structures,
745: such as characteristic length scales and oscillation
746: detected by $\Gamma(d)$.
747:
748: Figure~\ref{fig8:corr} shows for all human chromosomes
749: the $\Gamma(d)$'s of GC\% series
750: $\{ \rho_i \}$ calculated for the window sizes given in Table~1,
751: of all human chromosomes. For each chromosome, the minimum
752: (maximum) distance is 80 (16,000) windows.
753: Since each chromosome is partitioned into $2^{17}$ windows,
754: the maximum distance $d$ at which the correlation is examined
755: is about $16,000/2^{17} \approx 12\%$ of the total sequence length.
756:
757: An inspection of Fig.~\ref{fig8:corr} shows that the magnitude
758: of correlation at the distance of $d=1~{\rm Mb}$ is clearly above
759: the noise level. With the exceptions of Chr15, Chr22, and ChrY,
760: the correlation function $\Gamma(d) > 0.1$ at $d=1~{\rm Mb}$
761: for all other chromosomes. The low correlation in ChrY is
762: due to the fact that about half of the bases are unsequenced,
763: and the substitution of gaps by random values lowers the correlation.
764: At even longer distances such as $d=$10~Mb, correlations
765: $\Gamma(d=10$~Mb) for chromosomes 1 and 6 are still
766: above the 0.1 level.
767:
768: Given different windows ($w$) due to different chromosome
769: sizes and provided that the covariance of GC\% is approximately
770: independent of $w$, a scaling of the variance according to
771: $1/w^\beta$ implies that the correlation function
772: $\Gamma(d)$ in Eq.(\ref{gamma}) increases with the window
773: size as $\sim w^\beta$. Test calculations of covariance
774: for $2^{15}$ and $2^{17}$ windows show that the covariance
775: differs by less than 1\% (and hence is fairly independent in
776: this range of window sizes).
777: Consequently, for a detailed comparison of correlation
778: functions calculated for different chromosomes one has to
779: take into account different windows sizes.
780:
781:
782:
783: Any deviation from the monotonic decrease of $\Gamma(d)$
784: might be indicative of correlations at characteristic
785: length scales (visible as ``bumps"). For example,
786: Fig.~\ref{fig8:corr} shows for chromosome 1 such a bump at
787: $d \approx$ 21--23~Mb. Bumps or sharper peaks in other chromosomes include
788: $d \approx$ 9.3~Mb (Chr2), 7.2~Mb (Chr10), 3.2--3.8~Mb
789: (Chr12), and 2.4--3.1~Mb (Chr19). One plausible
790: explanation is that for chromosomes 2, 10, 12, and 19 one or few
791: alterations of GC-rich/low isochores \cite{isochore} with these
792: length scales enhance the correlation.
793:
794:
795: Chromosome 21 stands out among all human chromosomes for
796: having a comparatively higher correlations at distances of
797: several Mb (despite having a smaller $w^\beta$ factor than
798: other chromosomes due to a smaller window size). A detailed
799: inspection of Fig.~\ref{fig9:ch21} uncovers an oscillation of
800: $\Gamma(d)$ of about 500~kb, ranging from $d=$500~kb to $d=$2~Mb, which has
801: not been reported before. It can be further shown that this
802: oscillation is not due to the substitution of interspersed repeats
803: \cite{li-holste-21}, and it is localized to about one-eighth
804: of the right distal end of chromosome 21 \cite{li-holste-21}.
805:
806:
807:
808: \SEC{Discussions}
809:
810: We study correlation structures and spectral components in
811: the set of human chromosomes, using power spectra,
812: coarse-grained correlation functions, and the variance
813: of different window sizes. All three measures are
814: interrelated and highlight compositional structures
815: at different feature levels. Our results firmly establish
816: the presence of long-ranging correlations and
817: $1/f^\alpha$ spectra in the DNA sequences of the set
818: of twenty-four human chromosomes.
819:
820: Using updated and completed human sequence data, we find the
821: presence of 1/f noise in the DNA sequences of
822: all human chromosomes. We further find that, with the exception
823: of chromosome 22, all chromosomes exhibit a cross over
824: from $1/f^{\alpha_1}$ at low-frequency to $1/f^{\alpha_2}$
825: scaling at high-frequency ($\alpha_1 > \alpha_2$).
826: The result of two scaling ranges at low- and high-frequency
827: are in accord with previous findings, obtained from
828: sequence data of lower quality, and it refines break-point
829: regions for each individual chromosome.
830:
831:
832: We also examined the effect of about 45\% interspersed
833: repeats in the human genome. Using a procedure in which
834: masks and subsequently substitutes interspersed repeats
835: with random GC\% values, we find that interspersed
836: repeats (i) only marginally affect the scaling exponent
837: $\alpha_1$ in the low-frequency range, but (ii) lower
838: $\alpha_2$ in the high-frequency range (cf.~Fig.\ref{fig5:alpha}(b)).
839: This supports the general understanding that interspersed repeats
840: only contribute to short-ranging (high-frequency) correlations \cite{dirk}.
841:
842: We have shown elsewhere that $1/f^\alpha$
843: spectra of GC\% fluctuation are also universally present
844: in the mouse {\sl Mus musculus} genomic DNA sequences
845: \cite{1fmouse}. It is known that human and mouse genomes
846: are separated by approximately 65--75 million years of evolution. Besides
847: the similarity (or homology ) between
848: these two genomes on a local scale, there is in fact a
849: large amount of reshuffling of the chromosome segments
850: at a global scale when two current-day copies of
851: the two genomes are compared side-by-side \cite{pevzner}. Since reshuffling
852: of a sequence at global scales could potentially destroy
853: long-range correlations, it is still to be resolved
854: under what conditions a reshuffling of the
855: human genome into the mouse genome, or vice versa,
856: conserves 1/f noise.
857:
858: One possible explanation of why $1/f^\alpha$ spectra appear
859: in both the human and the mouse genomes is that such long-range
860: patterns were probably generated from ancestral DNA
861: sequences by sequence evolutionary mechanisms. One
862: sequence evolution model, termed expansion-modification
863: (EM) model, is known to generate $1/f^\alpha$ spectra \cite{wli-em}.
864: The EM model incorporates duplications and mutations.
865: Since the duplication process is an essential element in
866: evolutionary genomics \cite{ohno}, whose role is perhaps as important as Darwin's
867: natural selection \cite{meyer}, even a yet unsophisticated
868: incorporation of duplications in the EM model may capture the
869: essence of the evolutionary origin of long-range correlations
870: in DNA sequences. In the EM model, only the duplication of
871: segments with the same length scale is included, whereas
872: in reality segments with a broad range of length scales
873: are duplicated \cite{lander}.
874:
875: One frequently posed question concerns the ``biological meaning"
876: of $1/f^\alpha$ spectra or long-range correlations in DNA
877: sequences. In order to address this question, one may ask
878: a couple of related questions beforehand.
879: Does the compositional GC\% have any biological effects?
880: What biological functions of the DNA molecule are of relevance?
881: From the {\sl functional genomics} perspective, interesting
882: biological processes related to DNA molecules include transcription,
883: replication, and recombination, and their potential connection
884: to GC\% has been reviewed in \cite{bernardi,bernardi-book,li-nova}.
885: Generally speaking, GC\% has a statistical association with
886: all three processes, though the cause-and-effect role has
887: not yet been firmly established. Recent studies show that broadly
888: expressed ``housekeeping genes" tend to be located in GC-rich
889: regions \cite{housekeeping}. To understand the genome-wide organization
890: of biological units that play a role in those processes
891: (e.g., genes, origins and timing of replication, or recombination hotspots),
892: at times it is more feasible to directly study the
893: spatial distribution of functional units instead of using
894: the GC\% as a surrogate.
895:
896: From the {\sl biophysics and cellular biology}
897: perspective, GC\% is linked with bands from chromosome-staining
898: \cite{gojobori}, and in addition, possibly with the
899: matrix/scaffold attachment/associated regions located at
900: the end of DNA loops \cite{attachment}. It has also been
901: suggested that GC-rich chromosomes (or regions)
902: tend to be located in the interior of the nuclear
903: during interphase and are more ``open" in their
904: tertiary structure, whereas GC-poor segments are
905: more likely to be close to the surface of the
906: nuclear and more condensed \cite{interphase}.
907:
908: Further exploration of the relationship between GC\% fluctuations,
909: as well as its large-scale patterns, and the above
910: biological processes is beyond the scope of this
911: paper. An attempt for bacterial genomes has been
912: made to relate the scale-invariance feature in sequence
913: statistics to the genome organization of transcription
914: activities \cite{audit}. It is clear that more
915: integrated computational and experimental analyses
916: need be carried out along similar lines
917: before one can give universal $1/f$ spectra in
918: DNA sequences a satisfactory biological explanation.
919:
920:
921:
922:
923: \SEC*{Acknowledgements}
924:
925: We thank S. Guharay for participating the early stage
926: of this project, and O. Clay, J.L. Oliver, A. Fukushima for
927: valuable discussions.
928:
929:
930:
931: \begin{thebibliography}{99}
932:
933: \bibitem{johnson}
934: J.B. Johnson, Phys. Rev. {\bf 26}, 71-85 (1925).
935:
936: \bibitem{1freview}
937: A. van der Ziel, Adv. Electronics and Electronics Phys.
938: {\bf 49}, 225-297 (1979);
939: P. Dutta and P.M. Horn, Rev. Mod. Phys. {\bf 53}, 497-516 (1981);
940: M.B. Weissman, Rev. Mod. Phys. {\bf 60}, 537-571 (1988);
941: H. Wong, Microelectronics Reliability {\bf 43}, 585-599 (2003).
942:
943: \bibitem{1freview-other}
944: M. Gardner, Sci. Am. {\bf 238}, 16-32 (1978);
945: W. Press, Comments on Astrophys. {\bf 7},103-119 (1978);
946: B.J. West and M.F. Shlesinger, Am. Sci. {\bf 78}, 40-45 (1990);
947: E. Milotti, arXiv preprint, physics/0204033 (2002).
948:
949: \bibitem{1fbib}
950: W. Li, {\em A bibliography on 1/f noise} (online),
951: {\sf http://www.nslij-genetics.org/wli/1fnoise/}.
952:
953: \bibitem{mapping}
954: H. Herzel and I. Grosse, Physica A {\bf 216}, 518-542 (1995);
955: S.V. Buldyrev {\sl et al.}, Phys. Rev. E {\bf 51}, 5084-5091 (1995).
956:
957: \bibitem{dirk}
958: D. Holste, I. Grosse and H. Herzel, Phys. Rev. E, {\bf 64}, 041917 (2001);
959: D. Holste, {\sl et al.}, Phys. Rev. E, {\bf 67}, 061913 (2003).
960:
961: \bibitem{wli-em}
962: W. Li, Europhys. Lett. {\bf 10},395-400 (1989);
963: W. Li, Phys. Rev. A, {\bf 43}, 5240-5260 (1991).
964:
965: \bibitem{wli-dna}
966: W. Li, Int. J. Bifurcation \& Chaos, {\bf 2}, 137-154 (1992);
967: W. Li and K. Kaneko, Europhys. Lett. {\bf 17}, 655-660 (1992).
968:
969: \bibitem{voss}
970: R.F. Voss, Phys. Rev. Lett., {\bf 68},3805-3808 (1992).
971:
972: \bibitem{bac}
973: X. Lu, {\sl et al.}, Phys. Rev. E, {\bf 58}, 3578-3584 (1998);
974: M. de Sousa Vieira, Phys. Rev. E, {\bf 60}, 5932-5937 (1999).
975:
976: \bibitem{yeast}
977: W. Li, {\sl et al.}, Genome Res. {\bf 8}, 916-928 (1998).
978:
979: \bibitem{fukushima}
980: A. Fukushima
981: {\em Periodicity in Genome Architecture from Bacteria to Human}
982: (Ph.D Thesis, Nara Institute of Science and Technology, 2003);
983: A. Fukushima, {\sl et al.}, Gene, {\bf 300}, 203-211 (2002).
984:
985: \bibitem{isochore}
986: G. Cuny, {\sl et al.}, Euro. J. Biochem. {\bf 99}, 179-186 (1981);
987: G. Bernardi, {\sl et al.}, Science, {\bf 228}, 953-958 (1985);
988: G. Bernardi, Gene, {\bf 241}, 3-17 (2000).
989:
990: \bibitem{isochore-clay}
991: O. Clay, {\sl et al.}, Gene, {\bf 276}, 15-24 (2001);
992: O. Clay and G. Bernardi, Gene, {\bf 276}, 25-31 (2001).
993:
994:
995: \bibitem{cc03}
996: W. Li, {\sl et al.}, Comput. Biol. and Chem., {\bf 27}, 5-10 (2003).
997:
998: \bibitem{clay3}
999: O. Clay, Gene, {\bf 276}, 33-38 (2001);
1000:
1001: \bibitem{isochore-spe}
1002: W. Li, Gene, {\bf 300}, 129-139 (2002).
1003:
1004: \bibitem{wli-complexity}
1005: W. Li, Complexity, {\bf 3}, 33-37 (1997).
1006:
1007:
1008: \bibitem{lander}
1009: E.S. Lander, {\sl et al.}, Nature, {\bf 409}, 860-921 (2001).
1010:
1011: \bibitem{pedro}
1012: P. Bernaola-Galv\'{a}n, {\sl et al.},
1013: Gene, {\bf 300}, 105-115 (2002).
1014:
1015: \bibitem {hg-quality}
1016: J. Schmutz {\sl et al.},
1017: Nature {\bf 429}, 365-368 (2004).
1018:
1019:
1020: \bibitem{repeatmasker}
1021: A.F.A.~Smit and P.~Green,
1022: unpublished results
1023: (URL: {\sf http://repeatmasker.genome.washington.edu/}).
1024:
1025: \bibitem{daniell}
1026: P.J. Daniell,
1027: {\em Suppl. J. Royal Stat. Soc.} {\bf 8} 88--90 (1946).
1028:
1029:
1030:
1031: \bibitem{repeats-bio}
1032: J.R.~Korenberg and M.C.~Rykowski, Cell {\bf 53,} 391 (1988);
1033: P.~Medstrand {\em et al.}, Genome Res. {\bf 12}, 1483 (2002);
1034: M.-A.~Hakimi {\em et al.}, Nature {\bf 418}, 994 (2002);
1035: J.S.~Han, S.T.~Szak, and J.D.~Boeke, Nature {\bf 429}, 268--274 (2004).
1036:
1037: \bibitem{cscl}
1038: O. Clay, {\sl et al.},
1039: Euro. Biophys. J., {\bf 32}, 418-426 (2003).
1040:
1041: \bibitem{macaya}
1042: G. Macaya, J.P. Thiery, and G. Bernardi, J. Mol. Biol. {\bf 108}, 237-254 (1976).
1043:
1044: \bibitem{li-nova}
1045: W. Li, in {\sl Progress in Bioinformatics} (Nova Science Publisher, 2005),
1046: to appear.
1047:
1048: \bibitem{beran}
1049: J. Beran, {\sl Statistics for Long-Memory Processes} (Chapman \& Hall, 1994).
1050:
1051: \bibitem{audit}
1052: B. Audit and C.A. Ouzounis, J. Mol. Biol. {\bf 332}, 617-633 (2003).
1053:
1054: \bibitem{pedro04}
1055: P. Bernaola-Galv\'{a}n, {\em et al.,}
1056: Gene, {\bf 333}, 121-133 (2004).
1057:
1058:
1059: \bibitem{snedecor}
1060: G.W. Snedecor and W.G. Cochran,
1061: {\sl Statistical Methods}, Seventh Edition
1062: (Iowa State University Press, 1980).
1063:
1064: \bibitem{voss-music}
1065: R.F. Voss and J. Clarke, Nature, {\bf 258}, 317-318 (1975);
1066: K. J. Hsu and A. Hsu, Proc. Natl. Acad. Sci., {\bf 88}, 3507-3509 (1991).
1067:
1068:
1069:
1070:
1071: \bibitem{li-holste-21}
1072: W. Li and D. Holste, Comp. Bio. Chem., {\bf 28} in press (2004).
1073:
1074:
1075:
1076: \bibitem{1fmouse}
1077: W. Li and D. Holste, Fluct. Noise Lett. {\bf 4} in press (2004).
1078:
1079: \bibitem{pevzner}
1080: P. Pevzner and G. Tesler, Genome Res. {\bf 13}, 37-45 (2003).
1081:
1082: \bibitem{ohno}
1083: S. Ohno {\em Evolution by Gene Duplication}
1084: (Springer-Verlag, Berlin, 1970).
1085:
1086: \bibitem{meyer}
1087: A. Meyer and Y. van de Peer, J. Struct. Funct. Genomics, {\bf 3}, vii-ix (2003).
1088:
1089: \bibitem{bernardi}
1090: G. Bernardi, Ann. Rev. Genet. {\bf 23}, 637-661 (1989);
1091: G. Bernardi, Ann. Rev. Genet. {\bf 29}, 445-476 (1995).
1092:
1093: \bibitem{bernardi-book}
1094: G. Bernardi, {\em Structural and Evolutionary Genomics}
1095: (Elsevier, 2004).
1096:
1097: \bibitem{housekeeping}
1098: M.J. Lercher, A.O. Urrutia, and L.D. Hurst,
1099: Nature Genet. {\bf 31}, 180-183 (2002);
1100: M.J. Lercher, {\sl et al.}, Hum. Mol. Genet. {\bf 12}, 2411-2415
1101: (2003);
1102: R. Versteeg, {\sl et al.}, Genome Res. {\bf 13}, 1998-2004 (2003).
1103:
1104: \bibitem{gojobori}
1105: Y. Niimura and T. Gojobori,
1106: Proc. Natl. Acad. Sci. {\bf 99}, 797-802 (2002).
1107:
1108: \bibitem{attachment}
1109: P.A. Dijkwel and J.L. Hamlin, Int. Rev. Cytol. {\bf 162A}, 455-484 (1995);
1110: S.V. Razin, I.I. Gromova, and O.V. Iarovaia (1995), Int. Rev. Cytol.
1111: {\bf 162B}, 405-448 (1995).
1112:
1113:
1114: \bibitem{interphase}
1115: S. Boyle, {\sl et al.}, Hum. Mol. Genet. {\bf 10}, 211-219 (2001);
1116: S. Saccone, {\sl et al.}, Gene, {\bf 300}, 169-178 (2002).
1117:
1118:
1119: \end{thebibliography}
1120:
1121:
1122: \end{document}
1123:
1124:
1125: