q-bio0411017/tmp.tex
1: \documentclass{article}[15pt]
2: 
3: \usepackage[dvips]{epsfig}
4: \usepackage{rotating}
5: 
6: \pagestyle{myheadings}  % specify the format for running headings
7: \setlength{\textwidth}{460pt}
8: \setlength{\textheight}{620pt}
9: \setlength{\oddsidemargin}{-22pt}
10: \setlength{\evensidemargin}{-22pt}
11: \setlength{\topmargin}{0pt}
12: 
13: \renewcommand{\baselinestretch}{1.17} % double spacing
14: 
15: \begin{document}
16: 
17: 
18: \title{
19: Spectral Analysis of Guanine and Cytosine Fluctuations
20: of Mouse Genomic DNA
21: \vspace{0.2in}
22: \author{
23: Wentian Li$^{a}$ and  Dirk Holste$^{b}$ \\
24: {\small \sl  a. The Robert S. Boas Center for Genomics and Human Genetics}\\
25: {\small \sl North Shore LIJ Institute for Medical Research, Manhasset, NY 11030, USA.}\\
26: {\small \sl b. Department of Biology, Massachusetts Institute of Technology, 
27: Cambridge, MA 02139, USA. }
28: }
29: \date{}
30: }
31: \maketitle  % End title section
32: \markboth{\sl W.Li, D. Holste) }{\sl W.Li, D. Holste}
33: 
34: 
35: 
36: {\bf key words:
37: DNA sequences; GC fluctuations; 1/f noise; long-ranging 
38: correlations; mouse genome.}
39: 
40: \begin{abstract}
41: We study global fluctuations of the guanine and cytosine base content
42: (GC\%) in mouse genomic DNA using spectral analyses.  Power spectra 
43: $S(f)$ of GC\% fluctuations in all nineteen autosomal and
44: two sex chromosomes are observed to have the universal
45: functional form $S(f) \sim 1/f^{\alpha}$ ($\alpha \approx 1$)
46: over several orders of magnitude in the frequency range
47: $10^{-7}<f< 10^{-5}$ cycle/base, corresponding to long-ranging
48: GC\% correlations at distances between 100 kb and 10 Mb.
49: $S(f)$ for higher frequencies ($f > 10^{-5}$~cycle/base) shows a
50: flattened power-law function with $\alpha < 1$
51: across all twenty-one chromosomes. The substitution of about 38\% 
52: interspersed
53: repeats does not affect the functional form of $S(f)$, indicating
54: that these are not predominantly responsible for the long-ranged 
55: multi-scale
56: GC\% fluctuations in mammalian genomes. Several biological
57: implications of the large-scale GC\% fluctuation are discussed,
58: including neutral evolutionary history by DNA duplication,
59: chromosomal bands, spatial distribution of transcription units
60: (genes), replication timing, and recombination hot spots.
61: \end{abstract}
62: 
63: \large
64: 
65: \section{Introduction}
66: 
67: \indent 
68: 
69: DNA sequences, the blueprint of almost all essential genetic
70: information, are polymers consisting of two complementary strands of
71: four types of bases: adenine (A), cytosine (C), guanine (G), and
72: thymine (T). Among the four bases, the presence of A on one strand is
73: always paired with T on the opposite strand, forming a ``base pair"
74: with 2 hydrogen bonds; similarly, G and C are complementary to one
75: another, while forming a base-pair with 3 hydrogen bonds 
76: \cite{pauling,watson,calladine}.
77: Consequently, one may characterize AT base-pairs as ``weak'' bases
78: and GC base-pairs as ``strong'' bases.  In addition, the frequency of A
79: (G) on a single strand is approximately equal to the frequency of T
80: (C) on the same strand, a phenomenon that has been termed ``strand
81: symmetry" \cite{fickett92} or ``Chargaff's second parity"
82: \cite{forsdyke00}. Therefore, DNA sequences can be transformed into
83: reduced 2-symbol sequences of weak W (A or T) and strong S (G or C) 
84: bases.  The
85: percentage of S (G or C) bases of a DNA sequence segment is denoted as
86: the GC base content (GC\%). 
87: 
88: The spatial variation of GC\% along a DNA sequence has been of
89: long-standing interests \cite{churchill,elton,ikemura88,ikemura90}.
90: GC\%-series can be considered as fluctuating or unsteady signals, and
91: consequently many signal processing and stochastic analysis techniques
92: can be applied to characterize and quantify the statistical properties
93: of the DNA sequences \cite{anastassiou,cristea,vaidy}.  In particular,
94: spectral and correlation analyzes are standard tools that can be
95: applied \cite{anastassiou04}.  Initial spectral
96: analysis \cite{li92-1,likaneko,voss} provided evidence that
97: DNA sequences, especially non-protein-coding sequences,
98: exhibit a power spectrum $S(f)$ that can be approximated by $S(f) \sim
99: 1/f^{\alpha}$ ($\alpha \approx 1$) and are termed ``1/f noise" (or
100: ``$1/f^\alpha$ noise'' with $ 0.5 \le \alpha \le 1.5 $)
101: \cite{keshner,1f,milo,west,press}.  $1/f$ noise lies in-between the
102: realm of white noise ($\alpha=0$) and Brownian noise ($\alpha=2$)
103: \cite{gardner}, and is indicative of a wide distribution of length
104: scales (or time, in the case of stochastic processes) \cite{peng}.
105: 
106: The observation of $1/f^\alpha$ spectra  in many, but not all,
107: short DNA sequences (of the order of a few thousands bases)
108: poses the question of whether $1/f^\alpha$ spectra are a
109: universal characteristic across all DNA sequences. Several
110: lines of evidence show that the $1/f^\alpha$ spectrum 
111: is indeed a generic phenomenon of GC\% fluctuations in DNA sequences
112: and is found in genomic DNA sequences from different taxonomic
113: classes, including genomes from bacteria \cite{maria,lu}, yeast
114: \cite{li-gr}, insect \cite{fukushima-t}, worm (W Li, unpublished data), 
115: and human \cite{fukushima,li-holste}. The human and mouse genomes
116: are evolutionarily separated by about 65-75 million years, and they
117: exhibit a high level of homology \cite{mouse}.  Yet several
118: species-specific differences exist that might lead to different
119: functional properties of $S(f)$:
120: \begin{itemize}
121: \item
122: While the overall, genome-wide GC\% of both human and mouse
123: genomic DNA sequences is about 42\%, the distribution of GC\% is
124: different: GC\% when measured in 20 kb ($20\times 10^{3}$ bases)
125: windows in mouse genomic DNA lacks extremely high and low GC\% values
126: \cite{mouse}.
127: \item
128: There exist pronounced differences between large sequence
129: segments (of the order of several Mb ($10^{6}$ bases)) of human and
130: mouse chromosomes due to chromosomal rearrangements. At such length
131: scales, GC\% correlations existing in the human genome may be absent
132: in mouse genome.
133: \end{itemize}
134: 
135: This Letter examines the presence of $1/f^\alpha$ spectra in
136: spatial GC\% variations across all {\sl Mus musculus} chromosomes.
137: A graphic display of the mouse genome GC\% fluctuation can
138: be found at \cite{paces}.
139: 
140: \section{Material and Methods}
141: \subsection{DNA Sequence Data}
142: We download mouse genomic DNA sequences for nineteen autosomal
143: chromosomes (Chr1--Chr19) and two sex chromosomes (ChrX and ChrY) from
144: the UCSC Genome Bioinformatics Site {\sf http://genome.ucsc.edu/}
145: (the October 2003 release, or UCSC version {\sl mm04}).
146: All twenty-one chromosomes are evenly
147: partitioned into $2^{17}=131,072$ non-overlapping windows of $\omega$
148: bases, and the GC\% of each window is computed.  A fraction of bases
149: are yet uncharacterized and at these positions A, C, G, or T is
150: substituted by the symbol ``N" (sequence gaps). Windows that
151: contain only uncharacterized bases have an undetermined GC\% value.
152: In this study, we replace all undetermined GC\% values by randomly
153: chosen values from a normal distribution with GC\% mean and variance
154: taken from the empirical distribution of all determined GC\% windows.
155: 
156: \begin{figure}[htbp]
157: %% \centering{\resizebox{4cm}{!}{\includegraphics{fnlf1.eps}}}
158:   \centerline{\psfig{figure=gc-distri.eps,width=95mm,angle=0}}
159:   \caption{\label{fig1}
160: Genome-wide GC content (GC\%) of mouse {\em Mus musculus} 
161: for overall, non-repetitive, and repetitive sequences.
162:   }
163: \end{figure}
164: 
165: 
166: Higher eukaryotic genomes are enriched in repetitive sequences
167: \cite{smit}. Repeats are approximate copies of DNA sequence segments,
168: and interspersed repeats are an abundant class of repetitive sequence
169: segments in mammalian genomes that scattered throughout the
170: genome.  In both the human and the mouse genome, transposon-derived
171: interspersed repeats constitute about 35-45\%\cite{lander,venter} and
172: 38\% \cite{mouse} of the total genome, respectively.  As can be 
173: seen from Fig.~1, the distribution of GC\% for repetitive 
174: sequences is markedly different from that of the non-repetitive sequences.
175: In order to study the effects of interspersed repeats,
176: we use interspersed repeat-annotated versions of mouse
177: chromosomes \cite{repeatmasker},
178: and separately analyze GC\% fluctuations obtained from
179: DNA sequences with retained and substituted interspersed repeats.  In the
180: latter DNA sequences, we substitute GC\% from interspersed repeats by
181: randomly GC\% values taken from a normal distribution with mean and
182: variance taken from the empirical distribution of the non-repetitive
183: proportion of each individual chromosome.
184: 
185: \subsection{Spectral analysis of DNA Sequences}
186: 
187: \indent
188: 
189: We coarse-grain sequences into a spatial-series of GC contents
190: (GC\%) and conduct spectral analysis of the spatial GC\%-series.  To
191: this end, we chose a window of size $\omega$ bases, compute GC\%, move
192: the window along the DNA sequence by $\Delta\omega$ bases, and iterate
193: the computation to obtain a spatial-series of GC\% values.
194: Non-overlapping windows are obtained by setting $\Delta\omega=\omega$.
195: 
196: 
197: The power spectrum, the absolute squared average of the Fourier
198: transform, is defined as
199: \begin{equation}
200: S(f) \equiv
201: \frac{1}{N} \left| \sum_{k=1}^{N} ({\rm GC\%})_k \cdot e^{ -i 2 \pi k 
202: f/N} \right|^2
203: \label{DEF}
204: \end{equation}
205: where $N$ is the total number of windows. Table~1 lists the window
206: sizes and averaged GC\% calculated at these window sizes for all
207: chromosomes.
208: 
209: 
210: \begin{table}[htbp]
211: \caption{\label{tab:table1}
212:   Window sizes ($\omega$) and average GC contents ($\overline{\rm 
213: GC\%}$).  Each mouse chromosome (Chr) is partitioned into $2^{17}$ 
214: non-overlapping windows.
215:   \vspace*{5mm}
216:   }
217: \centering\footnotesize
218: \begin{tabular}{rcc|rcc}
219: Chr & $\overline{\rm GC\%}$ & $\omega$~(kb) & Chr & $\overline{\rm 
220: GC\%}$ & $\omega$~(kb) \\ \hline
221:   1 & 41                & 1.52         &  11 & 44                & 0.93 \\
222:   2 & 42                & 1.39         &  12 & 42                & 0.88 \\
223:   3 & 40                & 1.25         &  13 & 42                & 0.90 \\
224:   4 & 42                & 1.18         &  14 & 41                & 0.90 \\
225:   5 & 43                & 1.15         &  15 & 42                & 0.81 \\
226:   6 & 41                & 1.15         &  16 & 41                & 0.76 \\
227:   7 & 43                & 1.05         &  17 & 43                & 0.73 \\
228:   8 & 42                & 1.00         &  18 & 41                & 0.69 \\
229:   9 & 43                & 0.96         &  19 & 43                & 0.47 \\
230:  10 & 41                & 1.02         &  X/Y & 39/39            & 1.21/0.17 \\
231: \end{tabular}
232: \end{table}
233: 
234: \section{Results}
235: 
236: \indent
237: 
238: We use the computational Fast Fourier Transform (FFT), implemented in
239: the {\small\protect\sf S-PLUS} statistical package (Version 3.4,
240: MathSoft, Inc.).  The {\small\protect\sf S-PLUS} subroutine {\small\sf
241: Spectrum} takes as input a discrete FFT to calculate as output a
242: periodogram (the power spectrum in units $10\cdot\log_{10} S(f)$), and
243: subsequently applies Daniell-filtering (i.e. rectangular window)
244: \cite{daniell,priestley} to compute a smoothed spectrum using a 
245: user-specified parameter value ({\sf span}).
246: 
247: Figure~2 shows the power spectrum $S(f)$ as a function of the
248: frequency $f$ across nineteen autosomal and two sex chromosomes.  We
249: find for sequences with retained interspersed repeats that $S(f)$
250: exhibits the functional form $S(f)\sim 1/f^{\alpha}$ persistently
251: across twenty-one chromosomes. The exponent $\alpha$ is close to $\alpha \approx$ 1
252: for frequency ranges of $10^{-7} < f < 10^{-5}$ cycle/base, corresponding to
253: length scales $L=1/f$ of 100kb  $< L <$  10Mb.  At
254: higher frequencies $f > 10^{-5}$ cycle/base ($L < 100$~kb), $S(f)$
255: generally becomes flattened with $\alpha < 1$ across all chromosomes.
256: This deviation from the $1/f$ spectrum was also observed in
257: human genome \cite{fukushima,fukushima-t,li-holste}
258: At lower frequencies $f < 10^{-7}$ cycle/base  ($L > 10$~Mb), there are much
259: less spectral components, and hence $S(f)$ shows relatively larger
260: fluctuations,  and the estimation of $S(f)\sim 1/f^{\alpha}$
261: is less reliable. In the frequency range of $10^{-7} < f < 10^{-8}$
262: cycle/base, only $S(f)$ for Chr2, chr4, Chr7, Chr11 and Chr16 is 
263: indicative of a persistence of $\alpha\approx 1$.
264: 
265: When we compute $S(f)$ for mouse chromosomes $1, 2, \dots, 19, X$ and Y
266: with substituted interspersed repeats, we find that $S(f)$ is higher than
267: $S(f)$ obtained for the original sequences, especially at frequency
268: ranges higher than $f > 10^{-5}$ ($L > 10$~kb).  One possible
269: explanation is that the substitution of GC\% estimated from
270: repetitive GC bases by random GC\% values increases the level
271: of white noise fluctuations at length scales comparable to lengths
272: of interspersed repeats.
273: 
274: It is interesting to note that the substitution of about 38\% interspersed
275: repeats hardly affects $S(f)$ at intermediate and lower frequencies. A
276: similar observation has been made for human genomic DNA sequences
277: \cite{holste03,li-holste}. Thus, interspersed repeats may not contribute
278: predominantly to long-ranging correlations in mammalian genomic DNA.
279: 
280: 
281: % \begin{figure}[htbp]
282: \begin{figure}[bh]
283: %% \centering{\resizebox{4cm}{!}{\includegraphics{fnlf1.eps}}}
284:   \centerline{\psfig{figure=mouse-spec-one-by-one.eps,width=95mm,angle=-90}}
285:   \caption{\label{fig2}
286:   Double logarithmic representation of the power spectrum $10\log_{10}
287:   S(f)$ of GC\% fluctuations across nineteen autosomal (Chr1--Chr19)
288:   and two sex (ChrX and ChrY) mouse chromosomes.  The functional form of $S(f)$
289:   can be approximated by $S(f)\sim 1/f^{\alpha}$ over several order of
290:   frequency magnitudes ($\alpha=1$ is plotted for comparison).  Two
291:   curves represent $S(f)$ obtained for DNA sequences with retained and
292:   substituted interspersed repeats, respectively.  For clear 
293: representation,
294:   $S(f)$ is smoothed at different frequency ranges, using
295:   Daniell/rectangular-filter with sizes ({\sf span} parameter) 1, 3, 31,
296:   and 501 for the 1--10, 10--100, 100--500, and 500-65,536 spectral
297:   components. Vertical lines mark the length scales $L=1/f$
298:   (base/cycle) for $L=10$~Mb, $L=1$~Mb, and $L=100$~kb.
299:   }
300: \end{figure}
301: 
302: \section{Discussion}
303: \subsection{Spatial 1/f spectra are not an in generally held property}
304: 
305: \indent
306: 
307: Before discussing the universal spectral shape of GC\% of
308: genomic DNA sequences, note that not
309: all spatial sequences or signals exhibit $1/f^\alpha$ ($\alpha 
310: \approx 1$) power spectra.
311: 
312: An instructive example is provided  by the spatial spectrum of images 
313: taken of natural scenes, where it is known that the {\sl amplitude
314: spectrum} of image pixels is typically  $1/f$, and consequently 
315: its power spectrum is $S(f)\sim 1/f^2$ \cite{burton,field,tolhurst92}.
316: Sometimes, the exponent $\alpha$ in $S(f) \sim 1/f^\alpha$ 
317: may not be exactly equal to 2: for example, it was shown that 
318: underwater images tend to exhibit a larger
319: exponent of $\alpha$ (or deeper slope) than that of atmospheric
320: images \cite{balboa}. There are mainly two theories of the
321: $1/f^2$ scaling in such images: (i) it is caused by
322: luminance edges, and (ii) it is caused by a power-law distribution
323: of sizes of regions with constant intensity \cite{balboa2}.
324: Experiments have been carried out to test whether images with a change of the 
325: slope $\alpha$ can
326: be detected visually by human objects \cite{tolhurst97, tolhurst00}.
327: 
328: This well established $1/f^2$ spatial power spectrum in
329: images provides a case example that spatial
330: power spectra are not necessarily of the form $S(f) \sim 1/f$.
331: Rather, $S(f) \sim 1/f$ and $1/f^2$ spectra are considered to belong
332: to two different classes \cite{gardner}, and so the exponent
333: $\alpha \approx 1$ observed in DNA sequences is not 
334: {\sl a priori} expected.
335: 
336: \subsection{Spatial $1/f$ spectra are consistent with the
337: evolutionary expansion-modification model}
338: 
339: \indent
340: 
341: One hypothesis is that $1/f^\alpha$ ($\alpha \approx 1$) constitutes
342: a universal property of all long DNA sequences subject to
343: neutral evolution that involved duplications and
344: mutations \cite{li92-1,li91}. A simplified model, termed
345: ``expansion-modification (EM) model" \cite{li89,li91} generates
346: a binary 2-symbol sequence by two local operations: (i)
347: expansion/duplication: $0 \rightarrow 00$, $ 1 \rightarrow 11$;
348: and (ii) modification/mutation: $0 \rightarrow 1$,
349: $1 \rightarrow 0$. When the probability of the first operation
350: is large (e.g. probability $p_1=0.9$), resulting binary sequences
351: exhibit $1/f^\alpha$ ($\alpha \approx 1$) power spectra.
352: If the probability of the second operation ($p_2$) is large, the
353: resulting sequences exhibit white spectra ($\alpha\approx 0$) \cite{li91}.
354: Since the sequence generating process is hierarchical, it
355: is implicit that the resulting sequence
356: exhibit scale-invariance (or perhaps multiple-scale-invariance
357: \cite{mansilla}).
358: 
359: The EM model contains two features that
360: are essential for DNA evolution: duplication and mutation.
361: Duplications, both inter-chromosomal
362: or intra-chromosomal, expand DNA sequences and provide a
363: potential for genes to develop novel functions
364: \cite{ohno}. In one point of view, duplications have a larger impact
365: than natural selection in Darwin's evolution theory, as
366: duplications and the resulting redundancy actually created
367: the foundation upon which natural selection acts \cite{meyer}.
368: Although point mutations might be detrimental to 
369: biological fitness, they neverthless provide a potential for evolution,
370: perhaps on a smaller scale as compared to that for duplications.
371: 
372: A more realistic modeling of neutral evolution of DNA sequences
373: by duplication and mutation beyond the EM model is still lacking
374: \cite{eichler}, and
375: so the hypothesis that all long DNA sequences undergoing
376: duplications and mutations
377: exhibit $1/f^\alpha$ ($\alpha \approx 1$) power spectra
378: remains to be validated.  The results presented in this paper,
379: that {\sl Mus musculus} genomic DNA exhibits $\alpha \approx 1$
380: adds another line of evidence toward verification of this hypothesis.
381: 
382: \subsection{High level of chromosomal segment translocations did
383: not destroy $1/f$ spectra in mouse genomic DNA}
384: 
385: \indent
386: 
387: About 90\% of the human and 93\% of the mouse
388: genome reside within syntenic blocks, in which the
389: order of a series of biological markers (e.g. genes) are
390: approximately conserved \cite{mouse}. However, with the
391: exception of ChrX, these syntenic blocks have different loci
392: at human and mouse chromosomes.
393: About 65-75 million years ago, the human genome and the
394: mouse genome embarked on a different evolutionary history,
395: with many chromosomal translocations, that left
396: syntenic blocks of a human chromosome
397: scattered on different mouse chromosomes.
398: 
399: This basic picture indicates that the observation of $1/f$
400: spectra in the human genome \cite{li-holste} will not guarantee
401: similar spectra in mouse genome, as random translocations
402: can easily destroy any long-range order a given DNA sequence.
403: An alternative explanation of $1/f$ spectra in the mouse
404: genome, in light of observed $1/f$ spectra in human genome,
405: is that both are in fact a consequence of the large-scale
406: dynamics, such as translocation and duplication. Theoretically,
407: DNA sequences frozen at about 65-75 million years ago before
408: the divergence of human and mouse species could test this
409: hypothesis.
410: 
411: Besides translocation/duplication, mutations may also affect
412: GC\% fluctuations. If chromosomal segment translocations
413: were the only process in the evolution, the human and mouse genomes
414: ought to have the same GC content distribution.
415: Nevertheless, the GC\% distribution in the mouse genome is tighter
416: (with smaller variance) and lacks segments with higher GC\%
417: (``outliers'') as in the human genome, a fact both observed
418: experimentally \cite{thiery} and by sequence analysis \cite{mouse}.
419: 
420: One might think that this difference of variances of
421: the GC\% distributions between the human and the mouse 
422: genome may render their spatial spectra different. However, 
423: as pointed out in \cite{clay,li04},
424: the exponent $\alpha$ in $1/f^\alpha$ is related to how
425: the variance of GC\% changes with the window size,
426: instead of the variance itself (and the isochore
427: or fairly homogeneous sequence segment of about 200--300 kb
428: long or more \cite{bernardi89,bernardi01} is related to
429: the asymmetry or the third moment of the GC\% distribution).
430: 
431: \subsection{GC content correlates with many biological features
432: and $1/f$ spectra of GC\% imply a scale-invariant distribution
433: of these features}
434: 
435: \indent
436: 
437: Although variations in GC\% are not of direct biological relevance,
438: they are correlated with many other other measurements and
439: biological functions of DNA sequences, including chromosome
440: bands, protein-coding gene density, replication timing,
441: or recombination hot spots \cite{bernardi89,bernardi04}.
442: If the correlation between GC\% and a biological function
443: is strong, then the scale-invariance pattern in GC\% can be
444: transformed into a similar spatial structure of these 
445: biologically functioning units.
446: 
447: While a stretched out chromosome is too thin to be visible
448: (the diameter of a DNA thread is about $2\times 10^{-9}$m 
449: \cite{calladine}), during the metaphase of a cell cycle 
450: it becomes visible because it is tightly packed into a chromatin 
451: structure (with the length of a typical human chromosome of this
452: compact form of about $10^{-5}$m \cite{calladine}).  Using 
453: Giemsa dyes to stain chromosomes leads to alternating dark and 
454: light bands \cite{ried}. The mechanism of this staining difference 
455: at different chromosomal regions is thought to be caused by the degree 
456: of condensation of the chromatin structure \cite{saitoh}
457: The connection between GC\% and Giemsa bands was long before
458: suggested: Giemsa-light bands (termed R-bands) are GC-rich, whereas
459: Giemsa-dark bands (G-bands) are GC-poor \cite{coming,ikemura88}.
460: This connection has been further calibrated, such that
461: being {\sl relatively} GC-rich or GC-poor as compared to the
462: flanking regions is correlated with Giemsa-bands \cite{gojobori}.
463: This new proposal manages to reproduce chromosome bands quite 
464: well by sequence analysis \cite{gojobori}.
465: 
466: After the sequencing of the human genome has been completed,
467: it was revealed that many long GC-poor regions contain a
468: low gene density (``gene desert") \cite{lander,bernardi01}, confirming
469: the earlier proposed correlation between GC\% and gene
470: density \cite{b85,b91,b96}. In fact, the gene density
471: not only increases with GC\%, but increases faster than
472: in a linear fashion \cite{bernardi01}. Extremely GC-rich regions
473: in the human genome are also extremely gene-rich, and beyond the
474: GC content of GC\% $\approx$ 46\% the gene-density
475: increases markedly.
476:  
477: Due to the large genome size for most eukaryotic species,
478: the replication of DNA sequences starts at multiple positions.
479: Specific chromosome regions replicate earlier in time, while other
480: regions replicate later, marked by a clear boundary between the
481: two types of regions.
482: While a correlation between replicating-timing and chromosome
483: bands was proposed in \cite{coming,bernardi89}
484: (R-band replicates earlier), the extent and biological relevance 
485: are still subject to investigation \cite{ikemura02}.
486: 
487: Furthermore, regions with high recombination rate (recombination 
488: ot spots) have been associated with being GC-rich \cite{clark}.
489: While this correlation has been established for yeast {\sl Saccharomyces 
490: cerevisiae}
491: genome sequence \cite{gerton}, no conclusive result have yet been
492: obtained for higher genomes such as human genomic DNA \cite{mcvean}.
493: The general difficulty in the determination of regional recombination rates
494: in human genome is that it is either indirect (using
495: pedigree analysis \cite{yu,kong}) or limited to only
496: male samples (using sperm typing \cite{jeffreys}). Even a newly
497: proposed population genealogy-based inference of recombination
498: rates \cite{mcvean} is still not a direct measurement. In addition,
499: it has also been proposed that recombination events increase
500: the local GC content \cite{birdsell,galtier,duret04}. While switching
501: the role of cause and effect, from a statistical (correlation) point 
502: of view,  the outcome remains the same. 
503: 
504: Instead of using the GC\% as a ``surrogate'', there also
505: have been attempts to
506: study large-scale patterns of biological units directly.
507: It has been shown that in circular bacterial genome sequences, 
508: the positions and orientations
509: of genes do not have any preferable length scale and are
510: scale-invariant \cite{audit}. This result should be directly linked
511: to a similar scale-invariance of GC\% in bacterial genomes.
512: The universally observed $1/f^\alpha$ ($\alpha \approx 1$) spectra
513: in the mouse {\sl Mus musculus} chromosomes, as well as in
514: human chromosomes \cite{li-holste}, motivates further sequence 
515: analysis on the spatial arrangement of functional biological units.
516: 
517: 
518: \section*{Acknowledgements}
519: W. Li acknowledges support from NIH-N01-AR12256.
520: 
521: %% References are to be listed in the order cited in the text
522: %% They are to be cited in the text after punctuation marks,
523: %% using square brackets\cite{r6,r7}.  For journal names, use the
524: %% standard abbreviations.
525: %% R. Tamassia, C. Batini and M. Talamo,
526: %% R. Lorentz and D. B. Benson,
527: %% {\em Constructive role of noise in human brain activity},
528: %% {\em Nature} {\bf 27} (1983) 400--433.
529: %% {\em Flicker Noise Data Base},
530: %% eds. H. Gallaire and J. Winker, Plenum Press, New York (1973) 293--306.
531: %% {\em IEEE Trans. Electr. Dev.}  {\bf 45} (1976) 753--764.
532: 
533: \begin{thebibliography}{99}
534: 
535: \bibitem{pauling}
536: L. Pauling and R.B. Corey,
537: {\em A proposed structure for the nucleic acids},
538: {\em Proceedings of National Academy of Sciences} {\bf 39} (1953) 84--97.
539: 
540: \bibitem{watson}
541: J.D. Watson and C. Crick,
542: {P\em A structure for deoxyribose nucleic acid},
543: {\em Nature} {\bf 171}  (1953) 737--738.
544: 
545: \bibitem {calladine}
546: C.R. Calladine and H.R. Drew
547: {\em Understanding DNA}
548: (Academic Press, London, 1992).
549: 
550: \bibitem{fickett92}
551: J.W. Fickett, D.C. Torney and D.R. Wolf,
552: {\em Base compositional structure of genomes},
553: {\em Genomics} {\bf 13} (1992) 1056--1064.
554: 
555: \bibitem{forsdyke00}
556: D.R. Forsdyke and J.R.  Mortimer,
557: {\em Chargaff's legacy},
558: {\em Gene} {\bf 261} (2000) 127--137.
559: 
560: \bibitem{churchill}
561: G.A. Churchill,
562: {\em Stochastic models for heterogeneous DNA sequences},
563: {\em Bulletin of Mathematical Biology} {\bf 51} (1989) 79--94.
564: 
565: \bibitem{elton}
566: R.A. Elton,
567: {\em Theoretical models for heterogeneity for base composition in DNA},
568: {\em Journal of Theoretical Biology} {\bf  45} (1974) 533--553.
569: 
570: \bibitem{ikemura88}
571: T. Ikemura and S. Aota,
572: {\em Global variation in G+C content along vertebrate genome DNA:
573: possible correlation with chromosome band structure},
574: {\em Journal of Molecular Biology} {\bf 203} (1988) 1--13.
575: 
576: \bibitem{ikemura90}
577: T. Ikemura, K.N. Wada and S. Aota,
578: {\em Giant G+C\% mosaic structures of the human genome found by 
579: arrangement of GenBank
580: human DNA sequences according to genetic positions},
581: {\em Genomics} {\bf 8} (1990) 207--216.
582: 
583: \bibitem{anastassiou}
584: D. Anastassiou,
585: {\em Genomic signal processing},
586: {\em IEEE Signal Processing Magazine} {\bf 18} (2001) 8--20.
587: 
588: \bibitem{cristea}
589: P.D. Cristea,
590: {\em Large scale features in DNA genomic signals},
591: {\em Signal Processing} {\bf 83} (2003) 871--888.
592: 
593: \bibitem{vaidy}
594: P.P. Vaidyanathan and B.J. Yoon,
595: {\em The role of signal-processing concepts in genomics and proteomics},
596: {\em Journal of the Franklin Institute} {\bf 341} (2004) 111--135.
597: 
598: \bibitem{anastassiou04}
599: D. Sussillo, A. Kundaje and D. Anastassiou,
600: {\em Spectrogram analysis of genomics},
601: {\em EURASIP Journal on Applied Signal Processing} {\bf 2004} (2004) 29--42.
602: 
603: \bibitem{li92-1}
604: W. Li,
605: {\em Generating nontrivial long-range correlations and 1/f spectra by 
606: replication and mutation},
607: {\em International Journal of Bifurcation and Chaos} {\bf 2} (1992) 
608: 137--154.
609: 
610: \bibitem{likaneko}
611: W. Li and K. Kaneko,
612: {\em Long-range correlation and partial $1/f^{\alpha}$ spectrum in a 
613: non-coding DNA sequence},
614: {\em Europhysics Letters} {\bf 17} (1992) 655--660.
615: 
616: \bibitem{voss}
617: R. Voss,
618: {\em Evolution of long-range fractal correlations and 1/f noise in DNA 
619: base sequences},
620: {\em Physical Review Letters} {\bf 68} (1992) 3805--3808.
621: 
622: \bibitem{keshner}
623: M.S. Keshner,
624: {\em 1/f noise},
625: {\em Proceedings of the IEEE} {\bf 70} (1982) 212--218.
626: 
627: \bibitem{1f}
628: W. Li, 
629: {\em An online bibliography on 1/f noise},
630: {\sf http://www.nslij-genetics.org/wli/1fnoise/}
631: 
632: \bibitem{milo}
633: E. Milotti,
634: {\em 1/f noise: a pedagogical review},
635: arxiv preprint, physics/0204033 (2002)
636: {\sf http://arxiv.org/abs/physics/0204033}.
637: 
638: \bibitem{west}
639: B.J. West and M.F. Shlesinger,
640: {\em The noise in natural phenomena},
641: {\em American Scientist} {\bf 78} (1990) 40--45.
642: 
643: \bibitem{press}
644: W. Press,
645: {\em Flicker noise in astronomy and elsewhere},
646: {\em Comments on Astrophysics} {\bf 7} (1978) 103--119.
647: 
648: \bibitem{gardner}
649: M. Gardner,
650: {\em Mathematical games -- white and brown music, fractal curves and 1/f 
651: fluctuations},
652: {\em Scientific American} {\bf 238} (1978) 16--32.
653: 
654: \bibitem{peng}
655: J.M. Hausdorff and C.K. Peng,
656: {\em Multis-scaled randomness: a possible source of 1/f noise in biology},
657: {\em Physical Review E} {\bf 54} (1996) 2154--2155.
658: 
659: \bibitem{maria}
660: M. de Sousa Vieira,
661: {\em Statistics of DNA sequences: a low frequency analysis},
662: {\em Physical Review E} {\bf 60} (1999) 5932--5937.
663: 
664: \bibitem{lu}
665: X. Lu, Z.R. Sun, H.M. Chen and Y.D. Li,
666: {\em Characterizing self-similarity in bacteria DNA sequences},
667: {\em Physical Review E} {\bf 58} (1998) 3578--3584.
668: 
669: \bibitem{li-gr}
670: W. Li, G. Stolovitzky, P. Bernaola-Galvan and J.L. Oliver,
671: {\em Compositional heterogeneity within, and uniformity between, DNA
672: sequences of yeast chromosomes},
673: {\em Genome Research} {\bf 8} (1998) 916--928.
674: 
675: \bibitem{fukushima-t}
676: A. Fukushima,
677: {\em Periodicity in Genome Architecture from Bacteria to Human}
678: (Ph.D Thesis, Nara Institute of Science and Technology, 2003).
679: 
680: \bibitem{fukushima}
681: A. Fukushima, T. Ikemura, M. Kinouchi, T. Oshima, Y. Kudo, H. Mori
682: and S. Kanaya,
683: {\em Periodicity in prokaryotic and eukaryotic genomes identified by
684: power spectrum analysis},
685: {\em Gene}, {\bf 300} (2002) 203--211.
686: 
687: \bibitem{li-holste}
688: W. Li and D. Holste,
689: {\em Universal 1/f noise, cross-overs of scaling exponents,
690: and chromosome specific patterns of GC content in
691: DNA sequences of the human genome}, 
692: {\em Physical Review E}, submitted.
693: 
694: \bibitem{mouse}
695: R.H. Waterston {\em et al.},
696: {\em Initial sequencing and comparative analysis of the mouse genome},
697: {\em Nature} {\bf 420} (2002) 520--562.
698: 
699: \bibitem{paces}
700: J. Pa\u{c}es, R. Z\'{i}ka, V. Pa\u{c}es, A. Pavl\'{i}\u{c}ek, O. Clay
701: and G. Bernardi,
702: {\em Representing GC variation along eukaryotic chromosomes},
703: {\em Gene} {\bf 333} (2004) 135--141.
704: 
705: \bibitem{smit}
706: A.F. Smit,
707: {\em Interspersed repeats and other mementos of transposable elements in 
708: mammalian genomes}, 
709: {\em Current Opinion in Genetics \& Development} {\bf 9} (1999) 657--663.
710: 
711: \bibitem{lander}
712: E.S. Lander {\em et al.},
713: {\em Initial sequencing and analysis of the human genome},
714: {\em Nature} {\bf 409} (2001) 860--921.
715: 
716: \bibitem{venter}
717: J.C. Venter {\em et al.},
718: {\em The sequence of the human genome},
719: {\em Science} {\bf 291} (2001) 1304--1351.
720: 
721: \bibitem{repeatmasker}
722: A.F. Smit and P. Green,
723: RepeatMasker,
724: {\sf http://repeatmasker.genome.washington.edu/}.
725: 
726: 
727: \bibitem{daniell}
728: P.J. Daniell,
729: {\em Discussion on the Paper by Bartlett, Foster, Cummingham and Hynd},
730: {\em Supplement to the Journal of the Royal Statistical Society} {\bf  
731: 8} (1946) 88--90.
732: 
733: \bibitem{priestley}
734: M.B. Priestley,
735: {\em Spectral Analysis and Time Series}
736: (Academic Press, London, 1981).
737: 
738: 
739: 
740: \bibitem{holste03}
741: D. Holste, S. Beirer, P. Schieg, I. Grosse and H. Herzel,
742: {\em Repeats and correlations in human DNA sequences},
743: {\em Physical Review E} {\bf 67} (2003) 061913.
744: 
745: \bibitem{burton}
746: G.J. Burton and T.R. Moorehead,
747: {\em Color and spatial structure in natural scenes},
748: {\em Applied Optics} {\bf 26} (1987) 157--170.
749: 
750: \bibitem{field}
751: D.J. Fields,
752: {\em Relations between the statistics of natural images
753: and the response properties of cortical cells},
754: {\em Journal of the Optical Society of America A} {\bf 4} (1987) 2379--2394.
755: 
756: \bibitem{tolhurst92}
757: D.J. Tolhurst, Y. Tadmor and T. Chao,
758: {\em Amplitude spectra of natural images},
759: {\em Ophthalmic \& Physiological Optics} {\bf 12} (1992) 229--232.
760: 
761: \bibitem{balboa}
762: R.M. Balboa and N.M. Grzywacz,
763: {\em Power spectra and distribution of contrasts of natural images from 
764: different habitats},
765: {\em Vision Research} {\bf 43} (2003) 2527--2537.
766: 
767: \bibitem{balboa2}
768: R.M. Balboa, C.W. Tyler and N.M. Grzywacz,
769: {\em Occlusions contribute to scaling in natural images},
770: {\em Vision Research} {\bf 41} (2001) 955--964.
771: 
772: \bibitem{tolhurst97}
773: D.J. Tolhurst and Y. Tadmor,
774: {\em Discrimination of changes in the slopes of the
775: amplitude spectra of natural images: band-limited
776: contrast and psychometric functions},
777: {\em Perception} {\bf 26}  (1997) 1011--1025.
778: 
779: \bibitem{tolhurst00}
780: C.A. Parraga and D.J. Tolhurst,
781: {\em The effect of contrast randomisation on the discrimination of
782: changes in the slopes of the amplitude spectra of natural scenes},
783: {\em Perception} {\bf 29} (2000) 1101--1116.
784: 
785: \bibitem{li91}
786: W. Li,
787: {\em Expansion-modification systems: a model for spatial 1/f spectra},
788: {\em Physical Review A} {\bf 43} (1991) 5240--5260.
789: 
790: \bibitem{li89}
791: W. Li,
792: {\em Spatial 1/f spectra in open dynamical systems},
793: {\em Europhysics Letters} {\bf 10} (1989) 395--400.
794: 
795: 
796: \bibitem{mansilla}
797: R. Mansilla and G. Cocho,
798: {\em Multiscaling in expansion-modification systems: an
799: explanation for long range correlation in DNA},
800: {\em Complex Systems} {\bf 12} (2000) 207--240.
801: 
802: \bibitem{ohno}
803: S. Ohno,
804: {\em Evolution by Gene Duplication}
805: (Springer-Verlag, Berlin, 1970).
806: 
807: \bibitem{meyer}
808: A. Meyer and Y. van de Peer,
809: {\em Natural selection merely modified while redundancy created.
810: Susumu Ohno's idea of evolutionary importance of gene and genome 
811: duplication},
812: {\em Journal of Structural and Functional Genomics} {\bf 3} (2003) vii-ix.
813: 
814: \bibitem{eichler}
815: E.E. Eichler and D. Sankoff,
816: {\em Structural dynamics of eukaryotic chromosome evolution},
817: {\em Science} {\bf 301} (2003) 793--797.
818: 
819: \bibitem{thiery}
820: J.P. Thiery, G. Macaya and G. Bernardi,
821: {\em An analysis of eukaryotic genomes by density gradient centrifugation},
822: {\em Journal of Molecular Biology} {\bf 108} (1976) 219--235.
823: 
824: \bibitem{clay}
825: O. Clay, N. Carels, C. Douady, G. Macaya, G. Bernardi,
826: {\em Compositional heterogeneity within and among isochores in mammalian 
827: genomes},
828: {\em Gene} {\bf 27} (2001) 615--24.
829: 
830: \bibitem{li04}
831: W. Li,
832: {\em Large-scale fluctuation of guanine and cytosine content
833: in genome sequences: isochores and 1/f spectra},
834: in: {\em Progress in Bioinformatics} (Nova Science, 2005).
835: 
836: \bibitem{bernardi89}
837: G. Bernardi,
838: {\em The isochore organization of the human genome},
839: {\em Annual Review of Genetics} {\bf 23} (1989) 637--661.
840: 
841: \bibitem{bernardi04}
842: G. Bernardi, {\em Structural and Evolutionary Genomics}
843: (Elsevier, 2004).
844: 
845: 
846: \bibitem{bernardi01}
847: G. Bernardi,
848: {\em Misunderstandings about isochores. Part I},
849: {\em Gene} {\bf 276} (2001) 3--13.
850: 
851: \bibitem{ried}
852: T. Ried,
853: {\em Cytogenetics -- in color and digitized},
854: {\em New England Journal of Medicine} {\bf 350} (2004) 1597--1600.
855: 
856: \bibitem{saitoh}
857: Y. Saitoh and U.K. Laemmli,
858: {\em From the chromosomal loops and the scaffold to the classic
859: bands of metaphase chromosomes},
860: {\em Cold Spring Harbor Symposium on Quantitative Biology} {\bf 58} 
861: (1993) 755--765.
862: 
863: \bibitem{coming}
864: D.E. Comings,
865: {\em Mechanisms of chromosome banding and implications for chromosome 
866: structure},
867: {\em Annual Review of Genetics} {\bf 12} (1978) 25--46.
868: 
869: \bibitem{gojobori}
870: Y. Niimura and T. Gojobori,
871: {\em {\rm In silico} chromosome staining: reconstruction
872: of Giemsa bands from the whole human genome sequence},
873: {\em Proceedings of the National Academy of Sciences} {\bf 99} (2002) 
874: 797--802.
875: 
876: \bibitem{b85}
877: G. Bernardi, B. Olofsson, J. Filipski, M. Zerial, J. Salinas,
878: G. Cuny, M. Meunier-Rotival and F. Rodier,
879: {\em The mosaic genome of warm-blooded vertebrates},
880: {\em Science} {\bf 228} (1985) 953--958.
881: 
882: \bibitem{b91}
883: D. Mouchiroud, G. D'Onofrio, B. Aissani, G. Macaya, C. Gautier
884: and G. Bernardi,
885: {\em The distribution of genes in the human genome},
886: {\em Gene} {\bf 100} (1991) 181--187.
887: 
888: \bibitem{b96}
889: S. Zoubak, O. Clay and G. Bernardi,
890: {\em The gene distribution of the human genome},
891: {\em Gene} {\bf 174} (1996) 95--102.
892: 
893: \bibitem{ikemura02}
894: Y. Watanabe, A. Fujiyama, Y. Ichiba, M. Hattori,
895: T. Yada, Y. Sakaki and T. Ikemura,
896: {\em Chromosome-wide assessment of replication timing for
897: human chromosomes 11q and 21q: disease-related genes in timing-switch 
898: regions},
899: {\em Human Molecular Genetics} {\bf 11} (2002) 13--21.
900: 
901: \bibitem{clark}
902: S.M. Fullerton, A. Bernardo Carvalho and A.G. Clark,
903: {\em Local rates of recombination are positively correlated with GC 
904: content in the human genome},
905: {\em Molecular Biology and Evolution} {\bf 18} (2001) 1139--1142.
906: 
907: \bibitem{gerton}
908: J.L. Gerton, J. DeRisi, R. Shroff, M. Lichten, P.O. Brown and T.D. Petes,
909: {\em Global mapping of meiotic recombination hotspots
910: and coldspots in the yeast {\rm Saccharomyces cerevisiae}},
911: {\em Proceedings of the National Academy of Sciences} {\bf 7} (2000) 
912: 11383--11390.
913: 
914: \bibitem{mcvean}
915: G.A.T. McVean, S.R. Myers, S. Hunt, P. Deloukas, D.R. Bentley and P. Donnelly,
916: {\em The fine-scale structure of recombination rate
917: variation in the human genome},
918: {\em Science} {\bf 304} (2004) 581--584.
919: 
920: \bibitem{yu}
921: A. Yu, C. Zhao, Y. Fan, W. Jang, A.J. Mungall, P. Deloukas, A. Olsen,
922: N.A. Doggett, N. Ghebranious, K.W. Broman and J.L. Weber,
923: {\em Comparison of human genetic and sequence-based physical maps},
924: {\em Nature} {\bf 409} (2001) 951--953.
925: 
926: \bibitem{kong}
927: A. Kong, {\em et al.}
928: % D.F. Gudbjartsson, J. Sainz, G.M. Jonsdottir, S.A. Gudjonsson,
929: % B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson,
930: % A. Shlien, S.T. Palsson, M.L. Frigge, T.E. Thorgeirsson, J.R.  Gulcher, K. Stefansson,
931: {\em A high-resolution recombination map of the human genome},
932: {\em Nature Genetics} {\bf 31} (2002) 241--247.
933: 
934: \bibitem{jeffreys}
935: A.J. Jeffreys, A. Ritchie and R. Neumann,
936: {\em High resolution analysis of haplotype diversity and meiotic crossover
937: in the human TAP2 recombination hotspot},
938: {\em Human Molecular Genetics} {\bf 9} (2000) 725--733.
939: 
940: \bibitem{birdsell}
941: J.A. Birdsell,
942: {\em Integrating genomics, bioinformatics, and classical genetics to
943: study the effects of recombination on genome evolution},
944: {\em Molecular Biology and Evolution} {\bf 19} (2002) 1181--1197.
945: 
946: \bibitem{galtier}
947: J.I. Montoya-Burgos, P. Boursot and N. Galtier,
948: {\em Recombination explains isochores in mammalian genomes},
949: {\em Trends in Genetics} {\bf 19} (2003) 128--130.
950: 
951: %% \bibitem{li-pre}
952: %% W Li,
953: %% {\em Large-scale patterns in DNA texts},
954: %% preprint (1999),
955: %% {\sl  http://www.nslij-genetics.org/wli/pub/sa\_pre.pdf}
956: 
957: \bibitem{duret04}
958: J. Meunier and L. Duret,
959: {\em Recombination drives the evolution of GC-content in the human genome},
960: {\em Molecular Biology and Evolution} {\bf 21} (2004) 984--990.
961: 
962: \bibitem{audit}
963: B. Audit and C.A. Ouzounis,
964: {\em From genes to genomes: universal, scale-invariant properties of 
965: microbial chromosome organisation}
966: {\em Journal of Molecular Biology} {\bf 332} (2003) 617-633.
967: 
968: \end{thebibliography}
969: 
970: \end{document}
971: