0411:q-bio0411017/tmp.tex

1: \documentclass{article}[15pt]

2:

3: \usepackage[dvips]{epsfig}

4: \usepackage{rotating}

5:

6: \pagestyle{myheadings}  % specify the format for running headings

7: \setlength{\textwidth}{460pt}

8: \setlength{\textheight}{620pt}

9: \setlength{\oddsidemargin}{-22pt}

10: \setlength{\evensidemargin}{-22pt}

11: \setlength{\topmargin}{0pt}

12:

13: \renewcommand{\baselinestretch}{1.17} % double spacing

14:

15: \begin{document}

16:

17:

18: \title{

19: Spectral Analysis of Guanine and Cytosine Fluctuations

20: of Mouse Genomic DNA

21: \vspace{0.2in}

22: \author{

23: Wentian Li$^{a}$ and  Dirk Holste$^{b}$ \\

24: {\small \sl  a. The Robert S. Boas Center for Genomics and Human Genetics}\\

25: {\small \sl North Shore LIJ Institute for Medical Research, Manhasset, NY 11030, USA.}\\

26: {\small \sl b. Department of Biology, Massachusetts Institute of Technology,

27: Cambridge, MA 02139, USA. }

28: }

29: \date{}

30: }

31: \maketitle  % End title section

32: \markboth{\sl W.Li, D. Holste) }{\sl W.Li, D. Holste}

33:

34:

35:

36: {\bf key words:

37: DNA sequences; GC fluctuations; 1/f noise; long-ranging

38: correlations; mouse genome.}

39:

40: \begin{abstract}

41: We study global fluctuations of the guanine and cytosine base content

42: (GC\%) in mouse genomic DNA using spectral analyses.  Power spectra

43: $S(f)$ of GC\% fluctuations in all nineteen autosomal and

44: two sex chromosomes are observed to have the universal

45: functional form $S(f) \sim 1/f^{\alpha}$ ($\alpha \approx 1$)

46: over several orders of magnitude in the frequency range

47: $10^{-7}<f< 10^{-5}$ cycle/base, corresponding to long-ranging

48: GC\% correlations at distances between 100 kb and 10 Mb.

49: $S(f)$ for higher frequencies ($f > 10^{-5}$~cycle/base) shows a

50: flattened power-law function with $\alpha < 1$

51: across all twenty-one chromosomes. The substitution of about 38\%

52: interspersed

53: repeats does not affect the functional form of $S(f)$, indicating

54: that these are not predominantly responsible for the long-ranged

55: multi-scale

56: GC\% fluctuations in mammalian genomes. Several biological

57: implications of the large-scale GC\% fluctuation are discussed,

58: including neutral evolutionary history by DNA duplication,

59: chromosomal bands, spatial distribution of transcription units

60: (genes), replication timing, and recombination hot spots.

61: \end{abstract}

62:

63: \large

64:

65: \section{Introduction}

66:

67: \indent

68:

69: DNA sequences, the blueprint of almost all essential genetic

70: information, are polymers consisting of two complementary strands of

71: four types of bases: adenine (A), cytosine (C), guanine (G), and

72: thymine (T). Among the four bases, the presence of A on one strand is

73: always paired with T on the opposite strand, forming a ``base pair"

74: with 2 hydrogen bonds; similarly, G and C are complementary to one

75: another, while forming a base-pair with 3 hydrogen bonds

76: \cite{pauling,watson,calladine}.

77: Consequently, one may characterize AT base-pairs as ``weak'' bases

78: and GC base-pairs as ``strong'' bases.  In addition, the frequency of A

79: (G) on a single strand is approximately equal to the frequency of T

80: (C) on the same strand, a phenomenon that has been termed ``strand

81: symmetry" \cite{fickett92} or ``Chargaff's second parity"

82: \cite{forsdyke00}. Therefore, DNA sequences can be transformed into

83: reduced 2-symbol sequences of weak W (A or T) and strong S (G or C)

84: bases.  The

85: percentage of S (G or C) bases of a DNA sequence segment is denoted as

86: the GC base content (GC\%).

87:

88: The spatial variation of GC\% along a DNA sequence has been of

89: long-standing interests \cite{churchill,elton,ikemura88,ikemura90}.

90: GC\%-series can be considered as fluctuating or unsteady signals, and

91: consequently many signal processing and stochastic analysis techniques

92: can be applied to characterize and quantify the statistical properties

93: of the DNA sequences \cite{anastassiou,cristea,vaidy}.  In particular,

94: spectral and correlation analyzes are standard tools that can be

95: applied \cite{anastassiou04}.  Initial spectral

96: analysis \cite{li92-1,likaneko,voss} provided evidence that

97: DNA sequences, especially non-protein-coding sequences,

98: exhibit a power spectrum $S(f)$ that can be approximated by $S(f) \sim

99: 1/f^{\alpha}$ ($\alpha \approx 1$) and are termed ``1/f noise" (or

100: ``$1/f^\alpha$ noise'' with $ 0.5 \le \alpha \le 1.5 $)

101: \cite{keshner,1f,milo,west,press}.  $1/f$ noise lies in-between the

102: realm of white noise ($\alpha=0$) and Brownian noise ($\alpha=2$)

103: \cite{gardner}, and is indicative of a wide distribution of length

104: scales (or time, in the case of stochastic processes) \cite{peng}.

105:

106: The observation of $1/f^\alpha$ spectra  in many, but not all,

107: short DNA sequences (of the order of a few thousands bases)

108: poses the question of whether $1/f^\alpha$ spectra are a

109: universal characteristic across all DNA sequences. Several

110: lines of evidence show that the $1/f^\alpha$ spectrum

111: is indeed a generic phenomenon of GC\% fluctuations in DNA sequences

112: and is found in genomic DNA sequences from different taxonomic

113: classes, including genomes from bacteria \cite{maria,lu}, yeast

114: \cite{li-gr}, insect \cite{fukushima-t}, worm (W Li, unpublished data),

115: and human \cite{fukushima,li-holste}. The human and mouse genomes

116: are evolutionarily separated by about 65-75 million years, and they

117: exhibit a high level of homology \cite{mouse}.  Yet several

118: species-specific differences exist that might lead to different

119: functional properties of $S(f)$:

120: \begin{itemize}

121: \item

122: While the overall, genome-wide GC\% of both human and mouse

123: genomic DNA sequences is about 42\%, the distribution of GC\% is

124: different: GC\% when measured in 20 kb ($20\times 10^{3}$ bases)

125: windows in mouse genomic DNA lacks extremely high and low GC\% values

126: \cite{mouse}.

127: \item

128: There exist pronounced differences between large sequence

129: segments (of the order of several Mb ($10^{6}$ bases)) of human and

130: mouse chromosomes due to chromosomal rearrangements. At such length

131: scales, GC\% correlations existing in the human genome may be absent

132: in mouse genome.

133: \end{itemize}

134:

135: This Letter examines the presence of $1/f^\alpha$ spectra in

136: spatial GC\% variations across all {\sl Mus musculus} chromosomes.

137: A graphic display of the mouse genome GC\% fluctuation can

138: be found at \cite{paces}.

139:

140: \section{Material and Methods}

141: \subsection{DNA Sequence Data}

142: We download mouse genomic DNA sequences for nineteen autosomal

143: chromosomes (Chr1--Chr19) and two sex chromosomes (ChrX and ChrY) from

144: the UCSC Genome Bioinformatics Site {\sf http://genome.ucsc.edu/}

145: (the October 2003 release, or UCSC version {\sl mm04}).

146: All twenty-one chromosomes are evenly

147: partitioned into $2^{17}=131,072$ non-overlapping windows of $\omega$

148: bases, and the GC\% of each window is computed.  A fraction of bases

149: are yet uncharacterized and at these positions A, C, G, or T is

150: substituted by the symbol ``N" (sequence gaps). Windows that

151: contain only uncharacterized bases have an undetermined GC\% value.

152: In this study, we replace all undetermined GC\% values by randomly

153: chosen values from a normal distribution with GC\% mean and variance

154: taken from the empirical distribution of all determined GC\% windows.

155:

156: \begin{figure}[htbp]

157: %% \centering{\resizebox{4cm}{!}{\includegraphics{fnlf1.eps}}}

158:   \centerline{\psfig{figure=gc-distri.eps,width=95mm,angle=0}}

159:   \caption{\label{fig1}

160: Genome-wide GC content (GC\%) of mouse {\em Mus musculus}

161: for overall, non-repetitive, and repetitive sequences.

162:   }

163: \end{figure}

164:

165:

166: Higher eukaryotic genomes are enriched in repetitive sequences

167: \cite{smit}. Repeats are approximate copies of DNA sequence segments,

168: and interspersed repeats are an abundant class of repetitive sequence

169: segments in mammalian genomes that scattered throughout the

170: genome.  In both the human and the mouse genome, transposon-derived

171: interspersed repeats constitute about 35-45\%\cite{lander,venter} and

172: 38\% \cite{mouse} of the total genome, respectively.  As can be

173: seen from Fig.~1, the distribution of GC\% for repetitive

174: sequences is markedly different from that of the non-repetitive sequences.

175: In order to study the effects of interspersed repeats,

176: we use interspersed repeat-annotated versions of mouse

177: chromosomes \cite{repeatmasker},

178: and separately analyze GC\% fluctuations obtained from

179: DNA sequences with retained and substituted interspersed repeats.  In the

180: latter DNA sequences, we substitute GC\% from interspersed repeats by

181: randomly GC\% values taken from a normal distribution with mean and

182: variance taken from the empirical distribution of the non-repetitive

183: proportion of each individual chromosome.

184:

185: \subsection{Spectral analysis of DNA Sequences}

186:

187: \indent

188:

189: We coarse-grain sequences into a spatial-series of GC contents

190: (GC\%) and conduct spectral analysis of the spatial GC\%-series.  To

191: this end, we chose a window of size $\omega$ bases, compute GC\%, move

192: the window along the DNA sequence by $\Delta\omega$ bases, and iterate

193: the computation to obtain a spatial-series of GC\% values.

194: Non-overlapping windows are obtained by setting $\Delta\omega=\omega$.

195:

196:

197: The power spectrum, the absolute squared average of the Fourier

198: transform, is defined as

199: \begin{equation}

200: S(f) \equiv

201: \frac{1}{N} \left| \sum_{k=1}^{N} ({\rm GC\%})_k \cdot e^{ -i 2 \pi k

202: f/N} \right|^2

203: \label{DEF}

204: \end{equation}

205: where $N$ is the total number of windows. Table~1 lists the window

206: sizes and averaged GC\% calculated at these window sizes for all

207: chromosomes.

208:

209:

210: \begin{table}[htbp]

211: \caption{\label{tab:table1}

212:   Window sizes ($\omega$) and average GC contents ($\overline{\rm

213: GC\%}$).  Each mouse chromosome (Chr) is partitioned into $2^{17}$

214: non-overlapping windows.

215:   \vspace*{5mm}

216:   }

217: \centering\footnotesize

218: \begin{tabular}{rcc|rcc}

219: Chr & $\overline{\rm GC\%}$ & $\omega$~(kb) & Chr & $\overline{\rm

220: GC\%}$ & $\omega$~(kb) \\ \hline

221:   1 & 41                & 1.52         &  11 & 44                & 0.93 \\

222:   2 & 42                & 1.39         &  12 & 42                & 0.88 \\

223:   3 & 40                & 1.25         &  13 & 42                & 0.90 \\

224:   4 & 42                & 1.18         &  14 & 41                & 0.90 \\

225:   5 & 43                & 1.15         &  15 & 42                & 0.81 \\

226:   6 & 41                & 1.15         &  16 & 41                & 0.76 \\

227:   7 & 43                & 1.05         &  17 & 43                & 0.73 \\

228:   8 & 42                & 1.00         &  18 & 41                & 0.69 \\

229:   9 & 43                & 0.96         &  19 & 43                & 0.47 \\

230:  10 & 41                & 1.02         &  X/Y & 39/39            & 1.21/0.17 \\

231: \end{tabular}

232: \end{table}

233:

234: \section{Results}

235:

236: \indent

237:

238: We use the computational Fast Fourier Transform (FFT), implemented in

239: the {\small\protect\sf S-PLUS} statistical package (Version 3.4,

240: MathSoft, Inc.).  The {\small\protect\sf S-PLUS} subroutine {\small\sf

241: Spectrum} takes as input a discrete FFT to calculate as output a

242: periodogram (the power spectrum in units $10\cdot\log_{10} S(f)$), and

243: subsequently applies Daniell-filtering (i.e. rectangular window)

244: \cite{daniell,priestley} to compute a smoothed spectrum using a

245: user-specified parameter value ({\sf span}).

246:

247: Figure~2 shows the power spectrum $S(f)$ as a function of the

248: frequency $f$ across nineteen autosomal and two sex chromosomes.  We

249: find for sequences with retained interspersed repeats that $S(f)$

250: exhibits the functional form $S(f)\sim 1/f^{\alpha}$ persistently

251: across twenty-one chromosomes. The exponent $\alpha$ is close to $\alpha \approx$ 1

252: for frequency ranges of $10^{-7} < f < 10^{-5}$ cycle/base, corresponding to

253: length scales $L=1/f$ of 100kb  $< L <$  10Mb.  At

254: higher frequencies $f > 10^{-5}$ cycle/base ($L < 100$~kb), $S(f)$

255: generally becomes flattened with $\alpha < 1$ across all chromosomes.

256: This deviation from the $1/f$ spectrum was also observed in

257: human genome \cite{fukushima,fukushima-t,li-holste}

258: At lower frequencies $f < 10^{-7}$ cycle/base  ($L > 10$~Mb), there are much

259: less spectral components, and hence $S(f)$ shows relatively larger

260: fluctuations,  and the estimation of $S(f)\sim 1/f^{\alpha}$

261: is less reliable. In the frequency range of $10^{-7} < f < 10^{-8}$

262: cycle/base, only $S(f)$ for Chr2, chr4, Chr7, Chr11 and Chr16 is

263: indicative of a persistence of $\alpha\approx 1$.

264:

265: When we compute $S(f)$ for mouse chromosomes $1, 2, \dots, 19, X$ and Y

266: with substituted interspersed repeats, we find that $S(f)$ is higher than

267: $S(f)$ obtained for the original sequences, especially at frequency

268: ranges higher than $f > 10^{-5}$ ($L > 10$~kb).  One possible

269: explanation is that the substitution of GC\% estimated from

270: repetitive GC bases by random GC\% values increases the level

271: of white noise fluctuations at length scales comparable to lengths

272: of interspersed repeats.

273:

274: It is interesting to note that the substitution of about 38\% interspersed

275: repeats hardly affects $S(f)$ at intermediate and lower frequencies. A

276: similar observation has been made for human genomic DNA sequences

277: \cite{holste03,li-holste}. Thus, interspersed repeats may not contribute

278: predominantly to long-ranging correlations in mammalian genomic DNA.

279:

280:

281: % \begin{figure}[htbp]

282: \begin{figure}[bh]

283: %% \centering{\resizebox{4cm}{!}{\includegraphics{fnlf1.eps}}}

284:   \centerline{\psfig{figure=mouse-spec-one-by-one.eps,width=95mm,angle=-90}}

285:   \caption{\label{fig2}

286:   Double logarithmic representation of the power spectrum $10\log_{10}

287:   S(f)$ of GC\% fluctuations across nineteen autosomal (Chr1--Chr19)

288:   and two sex (ChrX and ChrY) mouse chromosomes.  The functional form of $S(f)$

289:   can be approximated by $S(f)\sim 1/f^{\alpha}$ over several order of

290:   frequency magnitudes ($\alpha=1$ is plotted for comparison).  Two

291:   curves represent $S(f)$ obtained for DNA sequences with retained and

292:   substituted interspersed repeats, respectively.  For clear

293: representation,

294:   $S(f)$ is smoothed at different frequency ranges, using

295:   Daniell/rectangular-filter with sizes ({\sf span} parameter) 1, 3, 31,

296:   and 501 for the 1--10, 10--100, 100--500, and 500-65,536 spectral

297:   components. Vertical lines mark the length scales $L=1/f$

298:   (base/cycle) for $L=10$~Mb, $L=1$~Mb, and $L=100$~kb.

299:   }

300: \end{figure}

301:

302: \section{Discussion}

303: \subsection{Spatial 1/f spectra are not an in generally held property}

304:

305: \indent

306:

307: Before discussing the universal spectral shape of GC\% of

308: genomic DNA sequences, note that not

309: all spatial sequences or signals exhibit $1/f^\alpha$ ($\alpha

310: \approx 1$) power spectra.

311:

312: An instructive example is provided  by the spatial spectrum of images

313: taken of natural scenes, where it is known that the {\sl amplitude

314: spectrum} of image pixels is typically  $1/f$, and consequently

315: its power spectrum is $S(f)\sim 1/f^2$ \cite{burton,field,tolhurst92}.

316: Sometimes, the exponent $\alpha$ in $S(f) \sim 1/f^\alpha$

317: may not be exactly equal to 2: for example, it was shown that

318: underwater images tend to exhibit a larger

319: exponent of $\alpha$ (or deeper slope) than that of atmospheric

320: images \cite{balboa}. There are mainly two theories of the

321: $1/f^2$ scaling in such images: (i) it is caused by

322: luminance edges, and (ii) it is caused by a power-law distribution

323: of sizes of regions with constant intensity \cite{balboa2}.

324: Experiments have been carried out to test whether images with a change of the

325: slope $\alpha$ can

326: be detected visually by human objects \cite{tolhurst97, tolhurst00}.

327:

328: This well established $1/f^2$ spatial power spectrum in

329: images provides a case example that spatial

330: power spectra are not necessarily of the form $S(f) \sim 1/f$.

331: Rather, $S(f) \sim 1/f$ and $1/f^2$ spectra are considered to belong

332: to two different classes \cite{gardner}, and so the exponent

333: $\alpha \approx 1$ observed in DNA sequences is not

334: {\sl a priori} expected.

335:

336: \subsection{Spatial $1/f$ spectra are consistent with the

337: evolutionary expansion-modification model}

338:

339: \indent

340:

341: One hypothesis is that $1/f^\alpha$ ($\alpha \approx 1$) constitutes

342: a universal property of all long DNA sequences subject to

343: neutral evolution that involved duplications and

344: mutations \cite{li92-1,li91}. A simplified model, termed

345: ``expansion-modification (EM) model" \cite{li89,li91} generates

346: a binary 2-symbol sequence by two local operations: (i)

347: expansion/duplication: $0 \rightarrow 00$, $ 1 \rightarrow 11$;

348: and (ii) modification/mutation: $0 \rightarrow 1$,

349: $1 \rightarrow 0$. When the probability of the first operation

350: is large (e.g. probability $p_1=0.9$), resulting binary sequences

351: exhibit $1/f^\alpha$ ($\alpha \approx 1$) power spectra.

352: If the probability of the second operation ($p_2$) is large, the

353: resulting sequences exhibit white spectra ($\alpha\approx 0$) \cite{li91}.

354: Since the sequence generating process is hierarchical, it

355: is implicit that the resulting sequence

356: exhibit scale-invariance (or perhaps multiple-scale-invariance

357: \cite{mansilla}).

358:

359: The EM model contains two features that

360: are essential for DNA evolution: duplication and mutation.

361: Duplications, both inter-chromosomal

362: or intra-chromosomal, expand DNA sequences and provide a

363: potential for genes to develop novel functions

364: \cite{ohno}. In one point of view, duplications have a larger impact

365: than natural selection in Darwin's evolution theory, as

366: duplications and the resulting redundancy actually created

367: the foundation upon which natural selection acts \cite{meyer}.

368: Although point mutations might be detrimental to

369: biological fitness, they neverthless provide a potential for evolution,

370: perhaps on a smaller scale as compared to that for duplications.

371:

372: A more realistic modeling of neutral evolution of DNA sequences

373: by duplication and mutation beyond the EM model is still lacking

374: \cite{eichler}, and

375: so the hypothesis that all long DNA sequences undergoing

376: duplications and mutations

377: exhibit $1/f^\alpha$ ($\alpha \approx 1$) power spectra

378: remains to be validated.  The results presented in this paper,

379: that {\sl Mus musculus} genomic DNA exhibits $\alpha \approx 1$

380: adds another line of evidence toward verification of this hypothesis.

381:

382: \subsection{High level of chromosomal segment translocations did

383: not destroy $1/f$ spectra in mouse genomic DNA}

384:

385: \indent

386:

387: About 90\% of the human and 93\% of the mouse

388: genome reside within syntenic blocks, in which the

389: order of a series of biological markers (e.g. genes) are

390: approximately conserved \cite{mouse}. However, with the

391: exception of ChrX, these syntenic blocks have different loci

392: at human and mouse chromosomes.

393: About 65-75 million years ago, the human genome and the

394: mouse genome embarked on a different evolutionary history,

395: with many chromosomal translocations, that left

396: syntenic blocks of a human chromosome

397: scattered on different mouse chromosomes.

398:

399: This basic picture indicates that the observation of $1/f$

400: spectra in the human genome \cite{li-holste} will not guarantee

401: similar spectra in mouse genome, as random translocations

402: can easily destroy any long-range order a given DNA sequence.

403: An alternative explanation of $1/f$ spectra in the mouse

404: genome, in light of observed $1/f$ spectra in human genome,

405: is that both are in fact a consequence of the large-scale

406: dynamics, such as translocation and duplication. Theoretically,

407: DNA sequences frozen at about 65-75 million years ago before

408: the divergence of human and mouse species could test this

409: hypothesis.

410:

411: Besides translocation/duplication, mutations may also affect

412: GC\% fluctuations. If chromosomal segment translocations

413: were the only process in the evolution, the human and mouse genomes

414: ought to have the same GC content distribution.

415: Nevertheless, the GC\% distribution in the mouse genome is tighter

416: (with smaller variance) and lacks segments with higher GC\%

417: (``outliers'') as in the human genome, a fact both observed

418: experimentally \cite{thiery} and by sequence analysis \cite{mouse}.

419:

420: One might think that this difference of variances of

421: the GC\% distributions between the human and the mouse

422: genome may render their spatial spectra different. However,

423: as pointed out in \cite{clay,li04},

424: the exponent $\alpha$ in $1/f^\alpha$ is related to how

425: the variance of GC\% changes with the window size,

426: instead of the variance itself (and the isochore

427: or fairly homogeneous sequence segment of about 200--300 kb

428: long or more \cite{bernardi89,bernardi01} is related to

429: the asymmetry or the third moment of the GC\% distribution).

430:

431: \subsection{GC content correlates with many biological features

432: and $1/f$ spectra of GC\% imply a scale-invariant distribution

433: of these features}

434:

435: \indent

436:

437: Although variations in GC\% are not of direct biological relevance,

438: they are correlated with many other other measurements and

439: biological functions of DNA sequences, including chromosome

440: bands, protein-coding gene density, replication timing,

441: or recombination hot spots \cite{bernardi89,bernardi04}.

442: If the correlation between GC\% and a biological function

443: is strong, then the scale-invariance pattern in GC\% can be

444: transformed into a similar spatial structure of these

445: biologically functioning units.

446:

447: While a stretched out chromosome is too thin to be visible

448: (the diameter of a DNA thread is about $2\times 10^{-9}$m

449: \cite{calladine}), during the metaphase of a cell cycle

450: it becomes visible because it is tightly packed into a chromatin

451: structure (with the length of a typical human chromosome of this

452: compact form of about $10^{-5}$m \cite{calladine}).  Using

453: Giemsa dyes to stain chromosomes leads to alternating dark and

454: light bands \cite{ried}. The mechanism of this staining difference

455: at different chromosomal regions is thought to be caused by the degree

456: of condensation of the chromatin structure \cite{saitoh}

457: The connection between GC\% and Giemsa bands was long before

458: suggested: Giemsa-light bands (termed R-bands) are GC-rich, whereas

459: Giemsa-dark bands (G-bands) are GC-poor \cite{coming,ikemura88}.

460: This connection has been further calibrated, such that

461: being {\sl relatively} GC-rich or GC-poor as compared to the

462: flanking regions is correlated with Giemsa-bands \cite{gojobori}.

463: This new proposal manages to reproduce chromosome bands quite

464: well by sequence analysis \cite{gojobori}.

465:

466: After the sequencing of the human genome has been completed,

467: it was revealed that many long GC-poor regions contain a

468: low gene density (``gene desert") \cite{lander,bernardi01}, confirming

469: the earlier proposed correlation between GC\% and gene

470: density \cite{b85,b91,b96}. In fact, the gene density

471: not only increases with GC\%, but increases faster than

472: in a linear fashion \cite{bernardi01}. Extremely GC-rich regions

473: in the human genome are also extremely gene-rich, and beyond the

474: GC content of GC\% $\approx$ 46\% the gene-density

475: increases markedly.

476:

477: Due to the large genome size for most eukaryotic species,

478: the replication of DNA sequences starts at multiple positions.

479: Specific chromosome regions replicate earlier in time, while other

480: regions replicate later, marked by a clear boundary between the

481: two types of regions.

482: While a correlation between replicating-timing and chromosome

483: bands was proposed in \cite{coming,bernardi89}

484: (R-band replicates earlier), the extent and biological relevance

485: are still subject to investigation \cite{ikemura02}.

486:

487: Furthermore, regions with high recombination rate (recombination

488: ot spots) have been associated with being GC-rich \cite{clark}.

489: While this correlation has been established for yeast {\sl Saccharomyces

490: cerevisiae}

491: genome sequence \cite{gerton}, no conclusive result have yet been

492: obtained for higher genomes such as human genomic DNA \cite{mcvean}.

493: The general difficulty in the determination of regional recombination rates

494: in human genome is that it is either indirect (using

495: pedigree analysis \cite{yu,kong}) or limited to only

496: male samples (using sperm typing \cite{jeffreys}). Even a newly

497: proposed population genealogy-based inference of recombination

498: rates \cite{mcvean} is still not a direct measurement. In addition,

499: it has also been proposed that recombination events increase

500: the local GC content \cite{birdsell,galtier,duret04}. While switching

501: the role of cause and effect, from a statistical (correlation) point

502: of view,  the outcome remains the same.

503:

504: Instead of using the GC\% as a ``surrogate'', there also

505: have been attempts to

506: study large-scale patterns of biological units directly.

507: It has been shown that in circular bacterial genome sequences,

508: the positions and orientations

509: of genes do not have any preferable length scale and are

510: scale-invariant \cite{audit}. This result should be directly linked

511: to a similar scale-invariance of GC\% in bacterial genomes.

512: The universally observed $1/f^\alpha$ ($\alpha \approx 1$) spectra

513: in the mouse {\sl Mus musculus} chromosomes, as well as in

514: human chromosomes \cite{li-holste}, motivates further sequence

515: analysis on the spatial arrangement of functional biological units.

516:

517:

518: \section*{Acknowledgements}

519: W. Li acknowledges support from NIH-N01-AR12256.

520:

521: %% References are to be listed in the order cited in the text

522: %% They are to be cited in the text after punctuation marks,

523: %% using square brackets\cite{r6,r7}.  For journal names, use the

524: %% standard abbreviations.

525: %% R. Tamassia, C. Batini and M. Talamo,

526: %% R. Lorentz and D. B. Benson,

527: %% {\em Constructive role of noise in human brain activity},

528: %% {\em Nature} {\bf 27} (1983) 400--433.

529: %% {\em Flicker Noise Data Base},

530: %% eds. H. Gallaire and J. Winker, Plenum Press, New York (1973) 293--306.

531: %% {\em IEEE Trans. Electr. Dev.}  {\bf 45} (1976) 753--764.

532:

533: \begin{thebibliography}{99}

534:

535: \bibitem{pauling}

536: L. Pauling and R.B. Corey,

537: {\em A proposed structure for the nucleic acids},

538: {\em Proceedings of National Academy of Sciences} {\bf 39} (1953) 84--97.

539:

540: \bibitem{watson}

541: J.D. Watson and C. Crick,

542: {P\em A structure for deoxyribose nucleic acid},

543: {\em Nature} {\bf 171}  (1953) 737--738.

544:

545: \bibitem {calladine}

546: C.R. Calladine and H.R. Drew

547: {\em Understanding DNA}

548: (Academic Press, London, 1992).

549:

550: \bibitem{fickett92}

551: J.W. Fickett, D.C. Torney and D.R. Wolf,

552: {\em Base compositional structure of genomes},

553: {\em Genomics} {\bf 13} (1992) 1056--1064.

554:

555: \bibitem{forsdyke00}

556: D.R. Forsdyke and J.R.  Mortimer,

557: {\em Chargaff's legacy},

558: {\em Gene} {\bf 261} (2000) 127--137.

559:

560: \bibitem{churchill}

561: G.A. Churchill,

562: {\em Stochastic models for heterogeneous DNA sequences},

563: {\em Bulletin of Mathematical Biology} {\bf 51} (1989) 79--94.

564:

565: \bibitem{elton}

566: R.A. Elton,

567: {\em Theoretical models for heterogeneity for base composition in DNA},

568: {\em Journal of Theoretical Biology} {\bf  45} (1974) 533--553.

569:

570: \bibitem{ikemura88}

571: T. Ikemura and S. Aota,

572: {\em Global variation in G+C content along vertebrate genome DNA:

573: possible correlation with chromosome band structure},

574: {\em Journal of Molecular Biology} {\bf 203} (1988) 1--13.

575:

576: \bibitem{ikemura90}

577: T. Ikemura, K.N. Wada and S. Aota,

578: {\em Giant G+C\% mosaic structures of the human genome found by

579: arrangement of GenBank

580: human DNA sequences according to genetic positions},

581: {\em Genomics} {\bf 8} (1990) 207--216.

582:

583: \bibitem{anastassiou}

584: D. Anastassiou,

585: {\em Genomic signal processing},

586: {\em IEEE Signal Processing Magazine} {\bf 18} (2001) 8--20.

587:

588: \bibitem{cristea}

589: P.D. Cristea,

590: {\em Large scale features in DNA genomic signals},

591: {\em Signal Processing} {\bf 83} (2003) 871--888.

592:

593: \bibitem{vaidy}

594: P.P. Vaidyanathan and B.J. Yoon,

595: {\em The role of signal-processing concepts in genomics and proteomics},

596: {\em Journal of the Franklin Institute} {\bf 341} (2004) 111--135.

597:

598: \bibitem{anastassiou04}

599: D. Sussillo, A. Kundaje and D. Anastassiou,

600: {\em Spectrogram analysis of genomics},

601: {\em EURASIP Journal on Applied Signal Processing} {\bf 2004} (2004) 29--42.

602:

603: \bibitem{li92-1}

604: W. Li,

605: {\em Generating nontrivial long-range correlations and 1/f spectra by

606: replication and mutation},

607: {\em International Journal of Bifurcation and Chaos} {\bf 2} (1992)

608: 137--154.

609:

610: \bibitem{likaneko}

611: W. Li and K. Kaneko,

612: {\em Long-range correlation and partial $1/f^{\alpha}$ spectrum in a

613: non-coding DNA sequence},

614: {\em Europhysics Letters} {\bf 17} (1992) 655--660.

615:

616: \bibitem{voss}

617: R. Voss,

618: {\em Evolution of long-range fractal correlations and 1/f noise in DNA

619: base sequences},

620: {\em Physical Review Letters} {\bf 68} (1992) 3805--3808.

621:

622: \bibitem{keshner}

623: M.S. Keshner,

624: {\em 1/f noise},

625: {\em Proceedings of the IEEE} {\bf 70} (1982) 212--218.

626:

627: \bibitem{1f}

628: W. Li,

629: {\em An online bibliography on 1/f noise},

630: {\sf http://www.nslij-genetics.org/wli/1fnoise/}

631:

632: \bibitem{milo}

633: E. Milotti,

634: {\em 1/f noise: a pedagogical review},

635: arxiv preprint, physics/0204033 (2002)

636: {\sf http://arxiv.org/abs/physics/0204033}.

637:

638: \bibitem{west}

639: B.J. West and M.F. Shlesinger,

640: {\em The noise in natural phenomena},

641: {\em American Scientist} {\bf 78} (1990) 40--45.

642:

643: \bibitem{press}

644: W. Press,

645: {\em Flicker noise in astronomy and elsewhere},

646: {\em Comments on Astrophysics} {\bf 7} (1978) 103--119.

647:

648: \bibitem{gardner}

649: M. Gardner,

650: {\em Mathematical games -- white and brown music, fractal curves and 1/f

651: fluctuations},

652: {\em Scientific American} {\bf 238} (1978) 16--32.

653:

654: \bibitem{peng}

655: J.M. Hausdorff and C.K. Peng,

656: {\em Multis-scaled randomness: a possible source of 1/f noise in biology},

657: {\em Physical Review E} {\bf 54} (1996) 2154--2155.

658:

659: \bibitem{maria}

660: M. de Sousa Vieira,

661: {\em Statistics of DNA sequences: a low frequency analysis},

662: {\em Physical Review E} {\bf 60} (1999) 5932--5937.

663:

664: \bibitem{lu}

665: X. Lu, Z.R. Sun, H.M. Chen and Y.D. Li,

666: {\em Characterizing self-similarity in bacteria DNA sequences},

667: {\em Physical Review E} {\bf 58} (1998) 3578--3584.

668:

669: \bibitem{li-gr}

670: W. Li, G. Stolovitzky, P. Bernaola-Galvan and J.L. Oliver,

671: {\em Compositional heterogeneity within, and uniformity between, DNA

672: sequences of yeast chromosomes},

673: {\em Genome Research} {\bf 8} (1998) 916--928.

674:

675: \bibitem{fukushima-t}

676: A. Fukushima,

677: {\em Periodicity in Genome Architecture from Bacteria to Human}

678: (Ph.D Thesis, Nara Institute of Science and Technology, 2003).

679:

680: \bibitem{fukushima}

681: A. Fukushima, T. Ikemura, M. Kinouchi, T. Oshima, Y. Kudo, H. Mori

682: and S. Kanaya,

683: {\em Periodicity in prokaryotic and eukaryotic genomes identified by

684: power spectrum analysis},

685: {\em Gene}, {\bf 300} (2002) 203--211.

686:

687: \bibitem{li-holste}

688: W. Li and D. Holste,

689: {\em Universal 1/f noise, cross-overs of scaling exponents,

690: and chromosome specific patterns of GC content in

691: DNA sequences of the human genome},

692: {\em Physical Review E}, submitted.

693:

694: \bibitem{mouse}

695: R.H. Waterston {\em et al.},

696: {\em Initial sequencing and comparative analysis of the mouse genome},

697: {\em Nature} {\bf 420} (2002) 520--562.

698:

699: \bibitem{paces}

700: J. Pa\u{c}es, R. Z\'{i}ka, V. Pa\u{c}es, A. Pavl\'{i}\u{c}ek, O. Clay

701: and G. Bernardi,

702: {\em Representing GC variation along eukaryotic chromosomes},

703: {\em Gene} {\bf 333} (2004) 135--141.

704:

705: \bibitem{smit}

706: A.F. Smit,

707: {\em Interspersed repeats and other mementos of transposable elements in

708: mammalian genomes},

709: {\em Current Opinion in Genetics \& Development} {\bf 9} (1999) 657--663.

710:

711: \bibitem{lander}

712: E.S. Lander {\em et al.},

713: {\em Initial sequencing and analysis of the human genome},

714: {\em Nature} {\bf 409} (2001) 860--921.

715:

716: \bibitem{venter}

717: J.C. Venter {\em et al.},

718: {\em The sequence of the human genome},

719: {\em Science} {\bf 291} (2001) 1304--1351.

720:

721: \bibitem{repeatmasker}

722: A.F. Smit and P. Green,

723: RepeatMasker,

724: {\sf http://repeatmasker.genome.washington.edu/}.

725:

726:

727: \bibitem{daniell}

728: P.J. Daniell,

729: {\em Discussion on the Paper by Bartlett, Foster, Cummingham and Hynd},

730: {\em Supplement to the Journal of the Royal Statistical Society} {\bf

731: 8} (1946) 88--90.

732:

733: \bibitem{priestley}

734: M.B. Priestley,

735: {\em Spectral Analysis and Time Series}

736: (Academic Press, London, 1981).

737:

738:

739:

740: \bibitem{holste03}

741: D. Holste, S. Beirer, P. Schieg, I. Grosse and H. Herzel,

742: {\em Repeats and correlations in human DNA sequences},

743: {\em Physical Review E} {\bf 67} (2003) 061913.

744:

745: \bibitem{burton}

746: G.J. Burton and T.R. Moorehead,

747: {\em Color and spatial structure in natural scenes},

748: {\em Applied Optics} {\bf 26} (1987) 157--170.

749:

750: \bibitem{field}

751: D.J. Fields,

752: {\em Relations between the statistics of natural images

753: and the response properties of cortical cells},

754: {\em Journal of the Optical Society of America A} {\bf 4} (1987) 2379--2394.

755:

756: \bibitem{tolhurst92}

757: D.J. Tolhurst, Y. Tadmor and T. Chao,

758: {\em Amplitude spectra of natural images},

759: {\em Ophthalmic \& Physiological Optics} {\bf 12} (1992) 229--232.

760:

761: \bibitem{balboa}

762: R.M. Balboa and N.M. Grzywacz,

763: {\em Power spectra and distribution of contrasts of natural images from

764: different habitats},

765: {\em Vision Research} {\bf 43} (2003) 2527--2537.

766:

767: \bibitem{balboa2}

768: R.M. Balboa, C.W. Tyler and N.M. Grzywacz,

769: {\em Occlusions contribute to scaling in natural images},

770: {\em Vision Research} {\bf 41} (2001) 955--964.

771:

772: \bibitem{tolhurst97}

773: D.J. Tolhurst and Y. Tadmor,

774: {\em Discrimination of changes in the slopes of the

775: amplitude spectra of natural images: band-limited

776: contrast and psychometric functions},

777: {\em Perception} {\bf 26}  (1997) 1011--1025.

778:

779: \bibitem{tolhurst00}

780: C.A. Parraga and D.J. Tolhurst,

781: {\em The effect of contrast randomisation on the discrimination of

782: changes in the slopes of the amplitude spectra of natural scenes},

783: {\em Perception} {\bf 29} (2000) 1101--1116.

784:

785: \bibitem{li91}

786: W. Li,

787: {\em Expansion-modification systems: a model for spatial 1/f spectra},

788: {\em Physical Review A} {\bf 43} (1991) 5240--5260.

789:

790: \bibitem{li89}

791: W. Li,

792: {\em Spatial 1/f spectra in open dynamical systems},

793: {\em Europhysics Letters} {\bf 10} (1989) 395--400.

794:

795:

796: \bibitem{mansilla}

797: R. Mansilla and G. Cocho,

798: {\em Multiscaling in expansion-modification systems: an

799: explanation for long range correlation in DNA},

800: {\em Complex Systems} {\bf 12} (2000) 207--240.

801:

802: \bibitem{ohno}

803: S. Ohno,

804: {\em Evolution by Gene Duplication}

805: (Springer-Verlag, Berlin, 1970).

806:

807: \bibitem{meyer}

808: A. Meyer and Y. van de Peer,

809: {\em Natural selection merely modified while redundancy created.

810: Susumu Ohno's idea of evolutionary importance of gene and genome

811: duplication},

812: {\em Journal of Structural and Functional Genomics} {\bf 3} (2003) vii-ix.

813:

814: \bibitem{eichler}

815: E.E. Eichler and D. Sankoff,

816: {\em Structural dynamics of eukaryotic chromosome evolution},

817: {\em Science} {\bf 301} (2003) 793--797.

818:

819: \bibitem{thiery}

820: J.P. Thiery, G. Macaya and G. Bernardi,

821: {\em An analysis of eukaryotic genomes by density gradient centrifugation},

822: {\em Journal of Molecular Biology} {\bf 108} (1976) 219--235.

823:

824: \bibitem{clay}

825: O. Clay, N. Carels, C. Douady, G. Macaya, G. Bernardi,

826: {\em Compositional heterogeneity within and among isochores in mammalian

827: genomes},

828: {\em Gene} {\bf 27} (2001) 615--24.

829:

830: \bibitem{li04}

831: W. Li,

832: {\em Large-scale fluctuation of guanine and cytosine content

833: in genome sequences: isochores and 1/f spectra},

834: in: {\em Progress in Bioinformatics} (Nova Science, 2005).

835:

836: \bibitem{bernardi89}

837: G. Bernardi,

838: {\em The isochore organization of the human genome},

839: {\em Annual Review of Genetics} {\bf 23} (1989) 637--661.

840:

841: \bibitem{bernardi04}

842: G. Bernardi, {\em Structural and Evolutionary Genomics}

843: (Elsevier, 2004).

844:

845:

846: \bibitem{bernardi01}

847: G. Bernardi,

848: {\em Misunderstandings about isochores. Part I},

849: {\em Gene} {\bf 276} (2001) 3--13.

850:

851: \bibitem{ried}

852: T. Ried,

853: {\em Cytogenetics -- in color and digitized},

854: {\em New England Journal of Medicine} {\bf 350} (2004) 1597--1600.

855:

856: \bibitem{saitoh}

857: Y. Saitoh and U.K. Laemmli,

858: {\em From the chromosomal loops and the scaffold to the classic

859: bands of metaphase chromosomes},

860: {\em Cold Spring Harbor Symposium on Quantitative Biology} {\bf 58}

861: (1993) 755--765.

862:

863: \bibitem{coming}

864: D.E. Comings,

865: {\em Mechanisms of chromosome banding and implications for chromosome

866: structure},

867: {\em Annual Review of Genetics} {\bf 12} (1978) 25--46.

868:

869: \bibitem{gojobori}

870: Y. Niimura and T. Gojobori,

871: {\em {\rm In silico} chromosome staining: reconstruction

872: of Giemsa bands from the whole human genome sequence},

873: {\em Proceedings of the National Academy of Sciences} {\bf 99} (2002)

874: 797--802.

875:

876: \bibitem{b85}

877: G. Bernardi, B. Olofsson, J. Filipski, M. Zerial, J. Salinas,

878: G. Cuny, M. Meunier-Rotival and F. Rodier,

879: {\em The mosaic genome of warm-blooded vertebrates},

880: {\em Science} {\bf 228} (1985) 953--958.

881:

882: \bibitem{b91}

883: D. Mouchiroud, G. D'Onofrio, B. Aissani, G. Macaya, C. Gautier

884: and G. Bernardi,

885: {\em The distribution of genes in the human genome},

886: {\em Gene} {\bf 100} (1991) 181--187.

887:

888: \bibitem{b96}

889: S. Zoubak, O. Clay and G. Bernardi,

890: {\em The gene distribution of the human genome},

891: {\em Gene} {\bf 174} (1996) 95--102.

892:

893: \bibitem{ikemura02}

894: Y. Watanabe, A. Fujiyama, Y. Ichiba, M. Hattori,

895: T. Yada, Y. Sakaki and T. Ikemura,

896: {\em Chromosome-wide assessment of replication timing for

897: human chromosomes 11q and 21q: disease-related genes in timing-switch

898: regions},

899: {\em Human Molecular Genetics} {\bf 11} (2002) 13--21.

900:

901: \bibitem{clark}

902: S.M. Fullerton, A. Bernardo Carvalho and A.G. Clark,

903: {\em Local rates of recombination are positively correlated with GC

904: content in the human genome},

905: {\em Molecular Biology and Evolution} {\bf 18} (2001) 1139--1142.

906:

907: \bibitem{gerton}

908: J.L. Gerton, J. DeRisi, R. Shroff, M. Lichten, P.O. Brown and T.D. Petes,

909: {\em Global mapping of meiotic recombination hotspots

910: and coldspots in the yeast {\rm Saccharomyces cerevisiae}},

911: {\em Proceedings of the National Academy of Sciences} {\bf 7} (2000)

912: 11383--11390.

913:

914: \bibitem{mcvean}

915: G.A.T. McVean, S.R. Myers, S. Hunt, P. Deloukas, D.R. Bentley and P. Donnelly,

916: {\em The fine-scale structure of recombination rate

917: variation in the human genome},

918: {\em Science} {\bf 304} (2004) 581--584.

919:

920: \bibitem{yu}

921: A. Yu, C. Zhao, Y. Fan, W. Jang, A.J. Mungall, P. Deloukas, A. Olsen,

922: N.A. Doggett, N. Ghebranious, K.W. Broman and J.L. Weber,

923: {\em Comparison of human genetic and sequence-based physical maps},

924: {\em Nature} {\bf 409} (2001) 951--953.

925:

926: \bibitem{kong}

927: A. Kong, {\em et al.}

928: % D.F. Gudbjartsson, J. Sainz, G.M. Jonsdottir, S.A. Gudjonsson,

929: % B. Richardsson, S. Sigurdardottir, J. Barnard, B. Hallbeck, G. Masson,

930: % A. Shlien, S.T. Palsson, M.L. Frigge, T.E. Thorgeirsson, J.R.  Gulcher, K. Stefansson,

931: {\em A high-resolution recombination map of the human genome},

932: {\em Nature Genetics} {\bf 31} (2002) 241--247.

933:

934: \bibitem{jeffreys}

935: A.J. Jeffreys, A. Ritchie and R. Neumann,

936: {\em High resolution analysis of haplotype diversity and meiotic crossover

937: in the human TAP2 recombination hotspot},

938: {\em Human Molecular Genetics} {\bf 9} (2000) 725--733.

939:

940: \bibitem{birdsell}

941: J.A. Birdsell,

942: {\em Integrating genomics, bioinformatics, and classical genetics to

943: study the effects of recombination on genome evolution},

944: {\em Molecular Biology and Evolution} {\bf 19} (2002) 1181--1197.

945:

946: \bibitem{galtier}

947: J.I. Montoya-Burgos, P. Boursot and N. Galtier,

948: {\em Recombination explains isochores in mammalian genomes},

949: {\em Trends in Genetics} {\bf 19} (2003) 128--130.

950:

951: %% \bibitem{li-pre}

952: %% W Li,

953: %% {\em Large-scale patterns in DNA texts},

954: %% preprint (1999),

955: %% {\sl  http://www.nslij-genetics.org/wli/pub/sa\_pre.pdf}

956:

957: \bibitem{duret04}

958: J. Meunier and L. Duret,

959: {\em Recombination drives the evolution of GC-content in the human genome},

960: {\em Molecular Biology and Evolution} {\bf 21} (2004) 984--990.

961:

962: \bibitem{audit}

963: B. Audit and C.A. Ouzounis,

964: {\em From genes to genomes: universal, scale-invariant properties of

965: microbial chromosome organisation}

966: {\em Journal of Molecular Biology} {\bf 332} (2003) 617-633.

967:

968: \end{thebibliography}

969:

970: \end{document}

971: