0402:q-bio0402047/final.tex

1: \documentclass[12pt]{iopart}

2: \usepackage{graphicx, bm, color, cite}

3: \usepackage{iopams}

4: \usepackage[nolists]{endfloat}

5:

6: \newcommand{\transpose}{^\mathrm{T}}

7: \newcommand{\expm}[1]{\exp\!\big(#1\big)}

8: \newcommand{\expt}[1]{\langle #1 \rangle}

9:

10: \newcommand{\be}{\begin{eqnarray}}

11: \newcommand{\ee}{\end{eqnarray}}

12:

13: \newcommand{\beq}{\begin{equation}}

14: \newcommand{\eeq}{\end{equation}}

15:

16: \newcommand{\ta}{\tau_1}

17: \newcommand{\tb}{\tau_2}

18:

19: \begin{document}

20:

21: \title{Gene-history correlation and population structure}

22:

23: \author{A. Eriksson\dag\ and B. Mehlig\ddag}

24:  \address{\dag\ Dept. of Physical Resource Theory, Chalmers and G\"oteborg

25:  University, Sweden}

26:  \address{\ddag\ Dept. of Theoretical Physics, G\"oteborg

27:  University and Chalmers, Sweden}

28:

29: \begin{abstract}

30: Correlation of gene histories in the human genome determines the

31: patterns of genetic variation ({\em haplotype structure}) and is

32: crucial to understanding genetic factors in common diseases. We

33: derive closed analytical expressions for the correlation of gene

34: histories in established demographic models for genetic evolution

35: and show how to extend the analysis to more realistic (but more

36: complicated) models of demographic structure. We identify two

37: contributions to the correlation of gene histories in divergent

38: populations: linkage disequilibrium, and differences in the

39: demographic history of individuals in the sample. These two

40: factors contribute to correlations at different length scales: the

41: former at small, and the latter at large scales. We show that

42: recent mixing events in divergent populations limit the range of

43: correlations and compare our findings to empirical results on the

44: correlation of gene histories in the human genome.

45: \end{abstract}

46:

47: \submitto{Physical Biology}

48: \pacs{89.75.Hc,87.23.Kg,02.50.Ga}

49: %\keywords{Suggested keywords}

50:

51: \maketitle

52:

53:

54: \clearpage \newpage %

55: \section{Introduction}

56: \label{sec:introduction}

57:

58: Populations are shaped by demographic, historical and

59: social factors, determining gene histories in characteristic ways.

60: Empirical data on genetic variation are now routinely interpreted

61: using well-established gene-genealogical models

62: \cite{hudson90,nordborg_tavare02,reich_etal02,hapmap_group03} of

63: the population in question. Local properties of genetic variation

64: (pertaining to {\em loci}, short stretches of a chromosome) in

65: such models are very well understood, by means of models of

66: bottlenecks, population expansion \cite{tajima87a, tajima87b,

67: slatkin_hudson91, sano_etal04}, and migration \cite{wakeley96,

68: teshima_tajima03, stumph_goldstein03}.

69: By contrast, very little is know about global patterns

70: \cite{patil_etal01}.

71: %

72: Global correlation and variation of patterns appear to be the key

73: to understanding the genetic factors contributing to common

74: diseases: there is now a wealth of empirical information on the

75: variation of genetic material in the human genome

76: \cite{snp_group01}. Many common diseases (such as cancer, obesity,

77: cardiovascular disorder and diabetes) are caused by combinations

78: of genetic and environmental factors \cite{hapmap_group03}. In

79: some cases a common variant of a single gene is responsible for

80: specific syndromes. In more complex diseases, however, it may not

81: be possible to link a disease to a single genetic factor. It is

82: thus necessary to understand genome-wide association of genetic

83: factors.

84:

85: Mutations and linkage disequilibrium (explained and illustrated in

86: figure~\ref{fig:samplegenealogy}) determine the genetic history of

87: a population, which in turn shapes the patterns of genetic

88: variation of interest in gene association studies

89: \cite{patil_etal01,hapmap_group03}.

90: %

91: The question is: how strongly are the patterns at two different

92: loci correlated?

93: %

94: Reich \etal \cite{reich_etal02} estimate the empirical association

95: of polymorphism rates, as a function of the physical distance

96: between the loci on the same chromosome, from human population

97: data (compensating for variations in the mutation rate along the

98: chromosome by comparing to the population data from the great

99: apes). Assuming a neutral model with uniform mutation rate, the

100: covariance of polymorphism rates is given by the covariance of the

101: times to the most recent common ancestor of the two loci (c.f.

102: figure \ref{fig:samplegenealogy}c).

103: %

104: Kaplan and Hudson \cite{kaplan_hudson85} (see also

105: \cite{hudson83}) analysed the association of polymorphism rates for

106: short loci, within the standard unstructured neutral model. This

107: was further developed by Pluzhnikov and Donelly

108: \cite{pluzhnikov_donelly96}, who analysed optimal sample sizes for

109: surveying genetic diversity.

110: %

111: Hudson \cite{hudson01} and McVean \etal \cite{mcvean_etal02}

112: estimate the recombination rate likelihood from two-locus sample

113: statistics, based on simulations. Recombination rate likelihoods,

114: conditional on more than two sites, have also been estimated using

115: Monte-Carlo methods

116: \cite{griffiths_marjoram96,kuhner_etal00,nielsen00}. Although

117: statistically powerful, these methods are computationally very

118: demanding.

119: %

120: Linkage disequilibrium is often assessed through summary

121: statistics such as $r^2$ \cite{hill_robertson68} or $D'$

122: \cite{tajima87a}. McVean \cite{ mcvean02} introduced an

123: approximation $\sigma^2_d$ of the expected value of $r^2$, and

124: showed that the approximation is accurate, in the absence of

125: demographic structure, if the expectations are taken conditional

126: on intermediate allelic frequencies.

127:

128: In this paper, we derive analytical expressions for the

129: correlation of genetic histories in established models of

130: demographic history (see figure~\ref{fig:pop struct models}a--c)

131: in the limit of negligible selection.

132: For several reasons these results are of interest.

133: First, as explained in the following, they enable us

134: to gain a qualitative understanding of the relative importance

135: of different biological factors determining the empirically

136: observed patterns of linkage disequilibrium. Second,

137: the analytical results summarised in this article

138: can be easily generalised as explained below

139: (see figure~\ref{fig:pop struct models}d,e).

140: Third, our analytical expressions for the decorrelation

141: of gene histories allow for studying the implications

142: of variations of the recombination rate along the chromosomes

143: \cite{kong_etal02,eriksson_mehlig04}.

144: %

145: The remainder of this paper is organised into five  parts. We

146: begin by discussing gene-history correlations and linkage

147: disequilibrium in section \ref{sec:gene-history correlations}

148: (see also figure~\ref{fig:samplegenealogy}). In section \ref{sec:methods}

149: we describe our method. We summarise our results in section

150: \ref{sec:results} and discuss their implications in section

151: \ref{sec:discussion}. In section \ref{sec:conclusions} we draw

152: conclusions. Two appendices summarise details of our calculations.

153:

154: %==================================================================

155: \begin{figure}

156:    \centerline{\includegraphics{fig1.eps}}

157: \caption{\label{fig:samplegenealogy}

158: Gene history and polymorphic sites. \textbf{a} In DNA, genetic

159: information is encoded by base-pairs of the four nucleic acids

160: adenine ({\tt A}), thymine ({\tt T}), guanine ({\tt G}), and

161: cytosine ({\tt C}). In a sample of three individuals, we show

162: three polymorphic sites, with two nucleotides around each

163: polymorphism. \textbf{b} The most common variation is a difference

164: at a single position (SNP), caused by a mutation at the position

165: in an individual in the history of the population, where e.g. a

166: fraction of the population has the nucleotide {\tt T} at the site,

167: and the rest has the nucleotide {\tt A}. The three mutations in

168: panel \textbf{a} are shown as filled circles. Mutation 4 does not

169: cause a polymorphism in the sample, since all individuals in the

170: sample inherits the mutation from the common ancestor. Given

171: $\tau$ (the number of generations since the most recent common

172: ancestor) of a stretch of $L$ nucleotides, the number of

173: differences between two individuals is assumed to be Poisson

174: distributed with expected value $2 \mu L \tau$, where $\mu$ is the

175: mutation rate per site per generation \cite{hudson90}. \textbf{c}

176: In recombination, part of a \emph{gamete} (one of the two copies

177: of a chromosome) is inherited from one parent and the rest from

178: the other parent. We show a sample gene history with one

179: recombination event, for two loci ($x$ and $y$) in two gametes

180: $i$ and $j$. The time axis is the same

181: as in panel \textbf{b}. The ancestral history for

182: loci $x$ and $y$ are shown in blue and red, respectively. The

183: times until the most recent common ancestor are $\tau_{x(ij)}$ and

184: $\tau_{y(ij)}$ for loci $x$ and $y$, respectively. In the absence of

185: recombination, two loci on the same gamete share the same genetic

186: history, and have the same time to the most recent common

187: ancestor, $\tau_{x(ij)} = \tau_{y(ij)}$, causing \emph{linkage

188: disequilibrium}. If a recombination event occurs in the genetic

189: history of a sample, it may lead to a decorrelation of $\tau_{x(ij)}$

190: and $\tau_{y(ij)}$.

191: $x_i$ represents the genetic material at locus $x$ of

192: chromosome $i$. Dashes correspond to genetic material not in the

193: history of the sample, and the diamonds to common ancestral

194: material.

195: }

196: \end{figure}

197: \begin{figure}

198:    \centerline{\includegraphics{fig2.eps}}

199: \caption{\label{fig:pop struct models}

200: Models illustrating demographic history, i.e. changes in

201: population size and structure. \textbf{a} Population bottleneck.

202: \textbf{b},\textbf{c} Models of population structure and

203: expansion. \textbf{d} A more general model of demographic

204: structure. \textbf{e} Demographic structure determining genetic

205: variation in the laboratory-mouse genome \cite{wade_etal02} (time

206: here is measured in years).

207: }

208: \end{figure}

209:

210: \clearpage \newpage %

211: \section{Gene-history correlations, linkage disequilibrium, and

212: patterns of genetic variation}

213: \label{sec:gene-history correlations}

214:

215: Genetic variation is caused by multiple factors. Together,

216: mutations and recombination (figure~\ref{fig:samplegenealogy}) are

217: the most important determinants of the large-scale haplotype

218: structure in the human genome \cite{reich_etal02, patil_etal01,

219: hapmap_group03}. The genetic history of nearby sites is closely

220: related, while distant sites may become unrelated only a few

221: generations in the past.

222:

223: Correlation of gene histories determines the degree of association

224: between patterns of genetic variation at different loci.

225: An example is the correlation of the counts of

226: single-nucleotide polymorphisms (SNPs) at different loci:

227: let $S_{x(ij)}$ be the number of SNPs

228: at locus $x$ between a pair of chromosomes $i$ and $j$.

229: Further, let $\tau_{x(ij)}$ denote

230: the time to the most recent common ancestor of a locus at position

231: $x$ on chromosomes $i$ and $j$, and define $\tau_{y(ij)}$

232: correspondingly for the locus at position $y$.

233: Then the sample covariance of the number of SNPs

234: in non-overlapping loci $x$ and $y$ is

235: related to the covariance of times $\tau_{x(ij)}$ and $\tau_{y(ij)}$ as

236: follows

237: \begin{equation}\label{eq:cov S_a S_b}

238:    \mathrm{cov}[S_{x(ij)},S_{y(ij)}] \approx (2 \mu L)^2 \, \mathrm{cov}[\tau_{x(ij)},\tau_{y(ij)}]\,.

239: \end{equation}

240: Here  $L$ is the size of the loci, assuming variations in the

241: mutation rate $\mu$ along the chromosome are negligible. For

242: (\ref{eq:cov S_a S_b}) to hold, $L$ must be small enough that the

243: sites within each locus have a high degree of linkage (in humans,

244: $L$ must be of the order of or smaller than a few hundred

245: base-pairs).

246:

247: Associations between SNPs in the genetic mosaic

248: allows for efficient mapping of genes. Suitably

249: chosen, a relatively small set of SNPs can capture most of the

250: common patterns of variation in the genome \cite{hapmap_group03}.

251:

252: The decay of the covariance $\mbox{cov}[\tau_{x(ij)},\tau_{y(ij)}]$ as a

253: function of $|x-y|$ measures linkage disequilibrium.

254: In the remainder of this section we briefly comment on other

255: common measures of linkage disequilibrium. Global association

256: between patterns of diversity, quantified by the extent of linkage

257: disequilibrium is often measured by Tajima's $D'$ \cite{tajima87a} or

258: alternatively by

259: \beq

260:    r^2 = \frac{D^2}{f_{A(x)} (1 - f_{A(x)}) f_{B(y)} (1 - f_{B(y)})},

261: \eeq

262: where $D = f_{A(x)B(y)} - f_{A(x)} f_{B(y)}$, $A(x)$ and $B(y)$ are

263: the allelic types at the loci $x$ and $y$, respectively, and

264: $f_{A(x)B(y)}$ is frequency of alleles $A(x)$ and $B(y)$ on the

265: same chromosome in the sample \cite{tajima87a}. McVean

266: \cite{mcvean02} introduced an approximation to the expected value

267: of $r^2$, called $\sigma^2_d$, which makes the connection to the

268: correlation of gene history explicit. With the notation $E_{ij,kl}

269: = \expt{ \tau_{x(ij)} \tau_{y(kl)}}$,

270: \beq\label{eq:sigma2 def}

271:    \sigma^2_d =

272:    \frac{ (n^2 - 2n + 2)E_{ij,ij} - 2(n-2)^2 E_{ij,ik} + (n-2)(n-3) E_{ij,kl} }

273:         { 2 E_{ij,ij} + 4(n-2) E_{ij,ik} + (n - 2)(n - 3) E_{ij,kl}} \, .

274: \eeq

275: The factors $E_{ij,ij}$ and $E_{ij,ik}$ are defined analogously.

276: For unstructured populations, $\sigma^2_d$ and the expected value

277: of $r^2$ are approximately equal under the neutral dynamics, if

278: the expectation is conditioned on intermediate allelic frequencies

279: \cite{mcvean02}.

280:

281:

282: \clearpage \newpage %

283: \section{Methods}

284: \label{sec:methods}

285:

286: In the following we analyse how correlation of gene histories

287: depends on demographical factors. In a large, unstructured population

288: with constant population size, and when selection is negligible,

289: the ancestral history of a locus may be modeled as a Markov

290: process \cite{griffiths81, hudson_kaplan85, nordborg_tavare02},

291: where the states of the process correspond to different

292: configurations of ancestral DNA through the history of the sample.

293:

294: We trace the ancestral history of two loci (at positions $x$ and

295: $y$) in $n$ individuals, from the present

296: back in time until the most recent common ancestor has been found

297: for all loci. When the population size $N$ is large, the genealogical

298: process may be approximated by the so-called coalescent process \cite{hudson90}:

299: recombination is modeled as a Poisson

300: process with rate $r$ per generation per chromosome: for any given

301: chromosome, with probability $r$ (also known as the recombination

302: fraction) the loci stem from different parents. The

303: probability that one pair of individuals has a common ancestor in

304: the preceding generation, and the probability that an individual

305: inherits genetic material from both parents, are expanded in

306: $N^{-1}$ to the first order. Time is measured in units of $2N$

307: generations. In the limit of large $N$, the time to the next event

308: is approximately exponentially distributed \cite{hudson90}.

309:

310: By explicitly taking into account the symmetries of the state

311: space of the coalescent for two individuals, we obtain a compact

312: representation of the Markov process

313: (figure~\ref{fig:markovgraph}) which allows us to derive and

314: understand gene-history correlations in the models mentioned

315: in the introduction.

316:

317: We illustrate our approach by re-deriving Hudson's result for the

318: correlation of gene histories in the unstructured, constant

319: population-size coalescent model \cite{hudson83}. Consider a

320: sample of two individuals. Figure~\ref{fig:markovgraph} shows a

321: representation of the coalescent for this case. Each node in the

322: graph corresponds to a configuration of ancestral DNA (listed in

323: the table in figure~\ref{fig:markovgraph}). Due to the symmetries

324: of the coalescent, many different configurations may be mapped

325: onto the same node.

326:

327: \begin{figure}

328: \centerline{

329: \begin{tabular}{@{}ll@{}}

330:    %%%%%%%%%%%%%%%%%%%%%%%%%%

331:    \includegraphics{fig3.eps}

332:    %%%%%%%%%%%%%%%%%%%%%%%%%%

333:    &

334:    %%%%%%%%%%%%%%%%%%%%%%%%%%

335:    \raisebox{2.5cm}{

336:    \begin{tabular}{cl}

337:       \br

338:       \small State $i\ $ & \small Population \\

339:       \mr

340:       \small\raisebox{1.5ex}{$1$} & \small\shortstack{$x_iy_i$,\,$x_jy_j$\\

341:                     $x_iy_j$,\,$x_jy_i$} \\

342:       \mr

343:       \small\raisebox{4ex}{$2$} & \small\shortstack{

344:                     $x_i-$,\,$-y_i$,\,$x_jy_j$\\

345:                     $x_iy_i$,\,$x_j-$,\,$-y_j$\\

346:                     $x_i-$,\,$-y_j$,\,$x_jy_i$\\

347:                     $x_iy_j$,\,$x_j-$,\,$-y_i$

348:                  }\\

349:       \mr

350:       \small$3$ & \small$x_i-$,\,$-y_i$,\,$x_j-$,\,$-y_j$ \\

351:       \mr

352:       \small\raisebox{1ex}{$4$} & \small\shortstack{$x_i\scriptstyle\lozenge$,\,$x_j\scriptstyle\lozenge$\\

353:                     ${\scriptstyle\lozenge}y_i$,\,${\scriptstyle\lozenge}y_j$}\\

354:       \mr

355:       \small$5$ & \small$\scriptstyle\lozenge\lozenge$ \\

356:       \br

357:    \end{tabular}

358:    }

359:    %%%%%%%%%%%%%%%%%%%%%%%%%%

360: \end{tabular}

361: }

362: \caption{\label{fig:markovgraph}

363: A graph representation of the coalescent process for two loci ($x$

364: and $y$) and two chromosomes ($i$ and $j$). The transition rates

365: (measured in units of $2N$ generations) between the different

366: groups of states, corresponding to the table, are printed along

367: the arrows ($R = 4Nr$). The process starts in state $1$ and

368: ends in state $5$, the only absorbing state. If the path goes from

369: state $1$ to state $5$ we have linkage, but if the system enters

370: state $4$ linkage is broken.

371: Same notation as in figure~1.

372: }

373: \end{figure}

374:

375: The time evolution of the probability distribution $P_i(t)$ over the states

376: $i$ is given by the master equation

377: \begin{equation}

378:    \partial_t P_i(t) = \sum_j w_{j \rightarrow i} P_j(t) - \sum_j  w_{i \rightarrow j} P_i(t)\,,

379: \end{equation}

380: where $w_{i \rightarrow j}$ is the transition rate from state $i$

381: to state $j$, given in figure~\ref{fig:markovgraph}. As above, time is

382: measured in units of $2N$ generations. The process is started in

383: state $1$, and proceeds until it comes to state $5$. We find that

384: $\langle\tau_{x(ij)}\tau_{y(ij)}\rangle$ is given by the exit rates to state

385: $5$, via states $1$ and $4$. Let $\ta$ be the first time at which a locus

386: coalesces, and $\tb$ be the time when both loci have coalesced.

387: Since $\tau_{x(ij)}\tau_{y(ij)} = \ta\tb$ we obtain

388: \begin{equation}

389: \label{eq:corr}

390:    \left<\tau_{x(ij)}\tau_{y(ij)}\right> =

391:       \int_0^\infty \Big[

392:       {\bm u}_1\transpose\tau_1^2

393:       + {\bm u}_2\transpose\!

394:       \int_{\tau_1}^\infty \tau_1 \tau_2\,

395:       {\rm e}^{\tau_1\!-\!\tau_2} \,\rmd\tau_2

396:       \Big] {\rm e}^{{\bf M} \tau_1} \,{\bm v} \,\rmd\tau_1 \,,

397: \end{equation}

398: where ${\bm v} = {\bm u}_1 = (1,0,0)\transpose$, ${\bm u}_2 =

399: (0,2,2)\transpose$ and $\mathbf{M}$ is a three-by-three matrix

400: defined by $\mathbf{M}_{ij} = w_{j \rightarrow i}$ for $i,j = 1,

401: \dots, 3$ and $i \ne j$, and $M_{ii} = - \sum_{j=1}^{3} w_{i

402: \rightarrow j}$. Evaluating (\ref{eq:corr}) we obtain the

403: well-known result \cite{hudson_kaplan85,hudson83}

404: \begin{equation}\label{eq:rho_no_pop_struct}

405:    \rho(\tau_{x(ij)},\tau_{y(ij)}) \equiv

406:    \frac{\left<\tau_{x(ij)}\tau_{y(ij)}\right> - \left<\tau\right>^2}{ \left<\tau^2\right>

407:    -  \left<\tau\right>^2} =  \frac{R + 18}{R^2 + 13 R + 18} \,,

408: \end{equation}

409: where $R = 4Nr$. In order to calculate $\sigma^2_d$ for the

410: unstructured model, we obtain $\expt{\tau_{x(ij)}\tau_{y(ik)}}$

411: and $\expt{\tau_{x(ij)}\tau_{y(kl)}}$ from (\ref{eq:corr}) with

412: ${\bm v} = (0,1,0)\transpose$ and ${\bm v} = (0,0,1)\transpose$,

413: respectively. Inserting these into eq.~(\ref{eq:sigma2 def}), we recover

414: the result of McVean \cite{mcvean02}:

415: \beq

416:    \sigma^2_d =

417:  \frac{2\,( 6 + R )  + n\,( 10 + 11 R + R^2 ) +  n^2 ( 10 + R )  }

418:   {2\,( 6 + R )  - n\,( 14 + 13 R + R^2 )  + n^2 ( 22 + 13 R + R^2 ) }.

419: \eeq

420:

421: In the following, we consider models corresponding   to Markov processes with rates which are

422: piece-wise constant functions

423: of time $t$. This allows us to calculate

424: $\langle\tau_{x(ij)}\tau_{y(ij)}\rangle$ from (\ref{eq:corr}) by taking

425: $\mathbf{M}$ and ${\bm u}$ to be functions of time.

426:

427:

428: \clearpage \newpage %

429: \section{Results}

430: \label{sec:results}

431:

432: After having illustrated our approach, we now briefly describe

433: the demographic models we have considered and summarise our results

434: for gene-history correlations in these models. Mathematical details

435: are given in appendices A and B. Implications are discussed

436: in section 5.

437:

438: \subsection{Bottleneck model}

439: \label{sec:bottleneck_model}

440:

441: Consider (c.f.~\cite{eyre-walker_etal98}) an unstructured

442: population of constant size $N$ until $\tau_0 = 2 N G$ generations

443: ago. The population was then subject to a severe bottleneck of

444: short duration, followed by a rapid expansion to a very large

445: (infinite) population size (figure~\ref{fig:pop struct models}a).

446: Between the bottleneck and now, the population size is taken to be

447: effectively infinite: and thus the probability that two randomly

448: sampled individuals have a common ancestor before the bottleneck

449: is negligible. Since the bottleneck is very narrow and has a short

450: duration, we may ignore the effect of recombination during the

451: bottleneck. It is convenient to parameterise the duration of the

452: bottleneck in terms of the probability $F$ that a single locus

453: coalesces during the bottleneck. In the limit when both the

454: population size and duration of the bottleneck are small (compared

455: to $2N$ individuals and generations, respectively), we obtain

456: (appendix A):

457: \begin{equation}\label{eq:rho_bottleneck}

458:    \rho(\tau_{x(ij)},\tau_{y(ij)})  = \frac{A + B\,e^{-R G/2} + C\,e^{-R G}}{15\,

459:    (2 - h)\,(18 + 13\,R + R^2 ) }\,,

460: \end{equation}

461: where $h = 1 - F$ and

462: \begin{eqnarray}

463:    A &=& 6 ( 36 - 45 h + 20 h^2 - h^5 )

464:                 + 3 ( 28 - 65 h +\nonumber\\&&+\ 40 h^2 - 3 h^5 ) R

465:                 + {( 1 - h ) }^3 ( 6 + 3 h + h^2) R^2 \,, \\

466:    B &=& 12( 9 - 5 h^2 + h^5 ) + ( 3 - 5 h^2 + 2 h^5 ) R^2\nonumber\\

467:                &&+\ 6 ( 7 - 10 h^2 + 3 h^5 ) R \,, \\

468:    C &=& 6 ( 36 - 10 h^2 - h^5 ) + ( 6 - 5 h^2 - h^5 ) R^2  \nonumber\\

469:                &&+\ 3 ( 28 - 20 h^2 - 3 h^5 ) R \,.

470: \end{eqnarray}

471: We thus find

472: that this model exhibits correlations at arbitrarily large values of $R$,

473: a consequence of an infinite expansion rate after the bottleneck,

474: and negligible recombination within it. If, instead, the expansion

475: were to a finite population size, (smaller than $GN$, say), the

476: correlations would still converge to a constant at large $R$. The

477: constant, however, is expected to be lower than the asymptotic

478: value obtained from (4) as $R\rightarrow\infty$. Finally, if the

479: bottleneck lasts long enough for significant recombination to

480: occur within it, we still find long-range correlations, up to

481: scales of the order of $(2\tau_{\rm D}r)^{-1}$ where $\tau_{\rm

482: D}$ is the duration of the bottleneck (in generations). Beyond

483: this, the correlations decay, and in the limit $R\rightarrow\infty$

484: we have $\rho(\tau_{x(ij)},\tau_{y(ij)})\rightarrow 0$ as in the

485: unstructured population model.

486:

487: By the same approach, we calculate

488: $\expt{\tau_{x(ij)}\tau_{y(ik)}}$ and $\expt{ \tau_{x(ij)}

489: \tau_{y(kl)}}$. Inserting this into (\ref{eq:sigma2 def}) yields,

490: for large $n$:

491: \begin{eqnarray}\label{eq:sigma2_bottleneck}

492:    \sigma^2_d &=& \frac{e^{-G\,R}}{\expt{ \tau_{x(ij)} \tau_{y(kl)}}} \Big[ 18\,h\,( 36 - 10\,h^2 - h^5)  +

493:    9\,h\,( 28 - 20\,h^2 - 3\,h^5) \,R + \nonumber\\&& 3\,h\,( 6 - 5\,h^2 - h^5) \,R^2 \Big] \, ,

494: \end{eqnarray}

495: where

496: \begin{eqnarray}

497:   \expt{ \tau_{x(ij)} \tau_{y(kl)}} &=& 18\,( 45\,G^2 + 36\,h + 90\,G\,h + 20\,h^3 - h^6 )  +\nonumber\\&&

498:   9\,( 65\,G^2 + 28\,h + 130\,G\,h + 40\,h^3 - 3\,h^6) \,R +\nonumber\\&&

499:   ( 45\,G^2 + 18\,h + 90\,G\,h + 30\,h^3 - 3\,h^6) \,R^2 \, .

500: \end{eqnarray}

501: Note that $\sigma^2_d \rightarrow 0$ as $R \rightarrow \infty$.

502: The difference, in particular, to expression (7) is not large.

503: Hence, when the aim is to detect the population-size variations it

504: is better to focus on single-locus statistics.

505:

506: \subsection{Model of divergent populations, I}

507: \label{sec:div_model_1}

508:

509: Reich {\em et al.} consider a model of a diverging population

510: \cite{reich_etal02}: the population was unstructured with constant

511: population size $N$ until $\tau_0 =2 N G$ generations ago, when

512: the the population split into two parts of equal size $N$ (note

513: that this implies a rapid population expansion from $N/2$ to $N$

514: after the split). The model is illustrated in figure~\ref{fig:pop

515: struct models}c. A portion $p$ of the sample is chosen from the

516: first population, and the rest from the second population. For any

517: two individuals in the sample, the expectation

518: $\rho(\tau_{x(ij)},\tau_{y(ij)})$ depends on whether the

519: individuals come from the same sub-population or not. Using the

520: technique illustrated above, it is straightforward to calculate

521: the expectation for both cases. Again, we find long-range

522: correlations, namely

523: \begin{equation}\label{eq:corr_model_2c}

524:    \rho(\tau_{x(ij)},\tau_{y(ij)})

525:    = 1 - \frac{1}{1 + 2\,p\,(1-p)\,(1 - 2\,p + 2\,p^2)\,G^2} \,,

526: \end{equation}

527: in the limit of large $R$ (in appendix B we describe how to

528: obtain the full result, valid for arbitrary values of $R$).

529:

530: Further, in the limit of large $R$ and large sample size $n$, we have

531: \beq\label{eq:sigma2_model_2c}

532:    \sigma^2_d = \frac{2\, p^2\,(1 - p)^2\,G}{ 1 + 2\,p\,(1-p)\,G} .

533: \eeq

534: Thus, for this model $\sigma^2_d$ is finite in the limit of large

535: $R$, as opposed to $\sigma^2_d$ in the unstructured model (section

536: \ref{sec:gene-history correlations}) and the bottleneck model

537: (section \ref{sec:bottleneck_model}).

538:

539: \subsection{Model of divergent populations, II}

540: \label{sec:div_model_2}

541:

542: Now consider the model of two diverging sub-populations

543: \cite{eyre-walker_etal98} in figure~\ref{fig:pop struct models}b.

544: The population was unstructured with constant size of $N$

545: individuals until $\tau_0=2 N G$ generations ago, when a fraction

546: $\gamma$ of the population diverged. In subsequent generations,

547: the two sub-populations where unstructured but with no contact

548: between sub-populations. Individuals are randomly chosen from the

549: joint population. For two individuals in the sample, there are

550: three cases: both individuals may come from the smaller

551: sub-population, they may come from the larger sub-population, or

552: from different sub-populations. Using equation (\ref{eq:corr}) we

553: find long-range correlations: in the limit of large $R$,

554: $\rho$ remains finite,

555: \begin{eqnarray}\label{eq:corr_model_2b}

556:    \rho(\tau_{x(ij)},\tau_{y(ij)})  &=& \frac{1}{\mbox{var}[\tau]} \big[

557:    1 - 2 s + 2 s^2 + 2 G\left( 2 + G \right) s +

558:    s^2  {\rm e}^{-\frac{2 G}{\gamma }} +\\

559:    &&s^2 {\rm e}^{-\frac{2 G}{1 - \gamma }} +

560:    2 s {\left( 1 - \gamma  \right) }^2  {\rm e}^{-\frac{G}{1 - \gamma }} +

561:    2 s {\gamma }^2  e^{-\frac{G}{\gamma }}  - \left<\tau\right>^2

562:    \big]\,

563:    \nonumber

564: \end{eqnarray}

565: where $s = \gamma\,(1-\gamma)$ and

566: \begin{eqnarray}

567:    \left<\tau\right> &=& 1 + s (2 G - 1) + s \gamma {\rm e}^{-\frac{G}{\gamma}}

568:    + s (1 - \gamma) {\rm e}^{-\frac{G}{1 - \gamma}}\, \\

569:    \mbox{var}[\tau] &=&  2 + 2 s  \big[ 2 s  + (G + 1)^2 +

570:        \gamma (1 + G + \gamma) {\rm e}^{-\frac{G}{\gamma}}

571:        +\nonumber\\&&+\ (1 - \gamma) (2 + G - \gamma)

572:        {\rm e}^{-\frac{G}{1 - \gamma}} - 3 \big] - \left<\tau\right>^2\,.

573: \end{eqnarray}

574: See the appendix for the full result. The long-range correlations

575: are found to be due to sampling of different sub-populations.

576:

577: In the limit of large $R$ and large sample size, we have

578: \beq\label{eq:sigma2_model_2b}

579:    \sigma^2_d = \frac{\gamma^2 (1 - \gamma)^2}{\expt{\tau}^2} \left[ 2\,G + \gamma\,(1 - \rme^{-\frac{G}{1-\gamma}}) + (1 - \gamma)(1 - \rme^{-\frac{G}{\gamma}}) \right]^2 .

580: \eeq

581: Again, we find that $\sigma^2_d$ is finite in the limit of large

582: $R$.

583:

584:

585: \clearpage \newpage %

586: \section{Discussion}

587: \label{sec:discussion}

588:

589: Figure~\ref{fig:pop struct results} shows the correlations

590: $\rho(\tau_{x(ij)},\tau_{y(ij)})$ in the demographic models

591: considered, with parameters chosen to be consistent with the

592: empirically estimated time to the most recent common ancestor and

593: its coefficient of variation \cite{reich_etal02}.

594: %

595: When plotting the correlation of gene histories against physical

596: positions, we need to translate the recombination fraction $r$

597: into the corresponding expected number $\sigma x$ of crossover

598: events between the two loci. There are many such maps proposed in

599: the literature (see e.g. \cite{mcpeek_speed95} for a review of

600: these). They differ in how they model the chiasma process, but all

601: models have in common that for small enough $r$, $r \approx \sigma

602: x$. In humans, $r \approx \sigma x$ for $x \lesssim 10^6$bp. At

603: larger distances, deviations from linearity are not noticeable

604: since the expressions for $\rho(\tau_{x(ij)},\tau_{y(ij)})$ and

605: $\sigma^2_d$ converge for large $R$ (to different values, in

606: general).

607: %

608: Also shown are empirical estimates of lower and upper bounds on

609: the correlation of gene histories in the human genome

610: \cite{reich_etal02}. The correlations for the models described in

611: section \ref{sec:results} are substantially larger at large

612: distances than those for the unstructured model, but they lie

613: significantly below the lower bound of the empirical data, at

614: intermediate distances. We comment on possible causes for

615: this discrepancy in our conclusions.

616:

617: \begin{figure}

618:    \centerline{\includegraphics{fig4.eps}}

619: \caption{\label{fig:pop struct results}

620: Correlation $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ of gene histories as a

621: function of the distance $x$ between them. Equations

622: (\ref{eq:rho_no_pop_struct}), (\ref{eq:rho_bottleneck}), and exact

623: expressions corresponding to (\ref{eq:corr_model_2c}) and

624: (\ref{eq:corr_model_2b}), from the appendix, were used. In all

625: cases, $r = 1.2$ cM/Mb, $N$ and $\mu$ were chosen to be

626: consistent with $2N\left<\tau\right> = 1.55\times 10^4$, and a

627: coefficient of variation of $0.94$ \cite{reich_etal02} (except in

628: the unstructured model). The lines are: the unstructured

629: coalescent (dashed), bottleneck model with $H = 0.1$ (red),

630: divergent model in figure~\ref{fig:pop struct models}b with

631: $\gamma = 0.2$ (blue), and divergent model in figure~\ref{fig:pop

632: struct models}c with $p = 0.3$ (green). Also shown are empirical

633: estimates of lower and upper bounds for the correlation of gene

634: histories in the human genome (squares) \cite{reich_etal02}.

635: }

636: \end{figure}

637:

638: Our results allow us to gain a qualitative understanding

639: of the influence of demographic factors on the decorrelation

640: of gene histories.

641: First, we find that models of bottlenecks and divergent

642: populations  (figure~\ref{fig:pop struct models}) both exhibit

643: long-range correlations in gene histories, as numerically

644: demonstrated in \cite{reich_etal02}, but for very different

645: reasons. In bottlenecks, the length scale at which we find

646: significant correlations is governed by the degree of

647: recombination

648: within the

649: bottleneck: low recombination in the bottleneck gives rise to

650: long-range correlations. Further, the amount of correlation is

651: affected by the rate of expansion of the population after the

652: bottleneck: rapid expansion gives high correlations. Long-range

653: correlation in divergent models, on other hand, we ascribe to the

654: fact that the covariance of $\tau_{x(ij)}$ and $\tau_{y(ij)}$ (that is, the

655: number of generations since the common ancestor of two copies of

656: loci $x$ and $y$) is different when individuals are selected from

657: the same or different sub-populations: typically, the covariance

658: is lower for individuals from the same sub-population than from

659: different ones. We find that this effect persists even for loci

660: far apart, but is decreased by population expansions during the

661: divergence.

662:

663: Second, we identify two contributions to the correlation of gene

664: histories in divergent populations: linkage disequilibrium and the

665: sampling of sub-populations with different demographic histories.

666: At short ranges, linkage disequilibrium correlates nearby patterns

667: by co-inheritance. Thus, for small distances, we conclude that the

668: demographic structure is unimportant: all reasonable models must

669: give high correlation for small distances. For long ranges, by

670: contrast, correlations due to linkage disequilibrium are expected

671: to vanish, but the contribution from differences in gene history

672: across sub-populations remains.

673:

674: Third, the domestication of crops and animals has shaped the

675: genetic makeup of the species, through selection for desirable

676: traits but also through the demographic history of each species

677: \cite{eyre-walker_etal98}. The pattern of genetic differences in

678: the laboratory mouse population depends strongly on its

679: demographic history \cite{wade_etal02}. In divergent populations,

680: we find that long-range correlations are insensitive to the

681: demographic history of the sub-populations. As a consequence, we

682: predict that the most important contribution to the correlation of

683: gene history in the laboratory mouse is from the original

684: divergence from the wild-type mouse.

685:

686: Fourth, we found that within the models described in

687: section~\ref{sec:results}, gene-history correlations are

688: substantially increased as compared with the unstructured,

689: standard model. However, the correlations still lie significantly

690: below the empirically determined data at intermediate distances.

691: In \cite{eriksson_mehlig04} it was shown that incorporating

692: empirically observed variations in the recombination-rate along

693: the chromosomes \cite{kong_etal02} significantly increases the

694: correlations in this regime.

695: Our analytical expressions for the correlation of gene

696: histories allow for studying the effect  of such variations in the

697: recombination rate in models with demographic population structure.

698:

699: Fifth,  we briefly mention possible extensions of the scheme introduced

700: in this paper.

701: In more general sampling schemes (different from those depicted

702: in figure~\ref{fig:pop struct models}), we may use the expressions for

703: $\left<\tau_{x(ij)}\,\tau_{y(ij)}\right>$ conditional on whether the

704: individuals in the sample came from the same sub-population or

705: not, and conditional on the population size during the divergence,

706: to calculate the correlation of gene histories by weighting the

707: different contributions by the probability that they occur under

708: the sampling scheme. Also, it is straight-forward to extend the

709: calculations to combinations of bottlenecks and divergent

710: populations (figure~\ref{fig:pop struct models}d), and to more

711: complicated models involving more than two diverging branches

712: (figure~\ref{fig:pop struct models}e). It is expected that the

713: most distant (symmetric) divergence determines the long-range

714: correlations.

715:

716: How would a recent mixing event (figure~\ref{fig:pop struct

717: models}e) affect the correlation of gene histories? A merging of

718: the divergent populations $g$ generations ago leads to a

719: decorrelation of gene histories at distances of the order of $(4 g

720: r)^{-1}$, since then ancestral lines of both loci may come from

721: different sub-populations with approximately equal probability.

722:

723: Finally, we have argued that the correlation $\rho(\tau_{x(ij)},\tau_{y(ij)})$ of gene

724: histories determines the association of SNP counts,

725: $\mbox{cov}[S_{x(ij)},S_{y(ij)}]$. Conversely one may be interested

726: in estimating model

727: parameters from population data, deducing

728: $\rho(\tau_{x(ij)},\tau_{y(ij)})$

729: from the pairwise statistic $\mbox{cov}[S_{x(ij)},S_{y(ij)}]$.

730: Three questions arise. First, how can one in practice estimate $\mbox{cov}[\tau_{x(ij)},\tau_{y(ij)}]$

731: from              the variance of SNP counts? Second,

732: how good is this estimate? Third, how much of

733: the information the full data set (possibly pertaining to a large

734: number of individuals) is retained in the pair-wise statistic

735: $\mbox{cov}[S_{x(ij)},S_{y(ij)}]$?

736: We begin by answering the last question.

737: Due to the high amount of association between the chromosomes in a

738: sample, the information on genealogical history accumulates slowly as the

739: sample size is increased \cite{hudson01}. It follows that most

740: information can be found in pair-wise comparisons between the

741: chromosomes in the sample as used in eq.~(\ref{eq:cov S_a S_b}).

742: Going back to the first two questions, an estimator for

743: $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ can be

744: constructed as follows.

745: Assuming that the length $L_\mathrm{c}$

746: of the sequences is long, we can estimate the correlation of

747: polymorphism rates by averaging over all pairs and positions:

748: \begin{equation}\label{eq:estimate_rho}

749:    \rho(\tau_{y(ij)},\tau_{(y+x)(ij)}) \approx \hat{\rho}(x) =  \frac{\overline{S_y S_{y+x}} - \overline{S_y}^2}{\overline{S_y^2} - \overline{S_y}^2  - \overline{S_y}},

750: \end{equation}

751: where

752: \begin{equation}\label{eq:sequence_average_def}

753:    \overline{S_y S_{y+x}} = \frac{2}{n(n-1)(L_\mathrm{c} - x - L)}

754:    \sum_{i=2}^n \sum_{j=1}^{i-1} \sum_{y=1}^{L_\mathrm{c}-x-L} S_{y(ij)} S_{(y+x)(ij)} \,.

755: \end{equation}

756: and the single-locus quantities $\overline{S_y}$ and

757: $\overline{S_y^2}$ are defined similarly. Instead of regularly

758: spaced bins, as in (\ref{eq:sequence_average_def}), one may use

759: randomly positioned bins. For unstructured populations, and for

760: populations with bottlenecks and expansions, the accuracy of the

761: estimator $\hat{\rho}(x)$ depends mostly on the number of bins

762: (and hence on $L_\mathrm{c}$), and improves only slowly with

763: increasing $n$. For divergent models, however, increasing $n$

764: improves the sampling from the different sub-populations. In

765: figure~\ref{fig:estimate_rho} we show how $\hat{\rho}(x)$ compares

766: to $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ when applied to a sample. As can be

767: seen in the figure, when $x < L$ the bins overlap and

768: $\hat{\rho}(x)$ overestimates the correlations, but

769: otherwise it works well.

770:

771: \begin{figure}

772:    \centerline{\includegraphics{fig5.eps}}

773: \caption{\label{fig:estimate_rho} %

774: Comparison of $\hat{\rho}(x)$ (markers) to

775: $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ (solid lines, calculated from theory),

776: for an unstructured population (red) and a divergent population

777: (blue). The estimator $\hat{\rho}(x)$ were obtained from a single

778: sample of 50 individuals, with $L_\mathrm{c} = 10$Mb, for

779: different bin sizes $L = 100$bp (diamonds), $L = 500$bp (circles)

780: and $L = 1$kb (squares). The parameters for the divergent model

781: are: $G = 0.6$, $p = 0.3$, $N = 6963.7$, $r = 0.95633$cM/Mb,

782: $\theta = 7.6\,10^{-4}$. In the unstructured population model, the

783: population size is $N = 10^4$.

784: }

785: \end{figure}

786:

787:

788: \clearpage \newpage %

789: \section{Conclusions and outlook}

790: \label{sec:conclusions}

791:

792: We have derived closed analytical expressions for the correlation

793: of gene histories in established demographic models for genetic

794: evolution. These expressions allow us to understand and quantitatively

795: determine how demographical factors give rise to long-range

796: correlations in gene histories.

797:

798: The correlations analysed here determine

799: the two-person summary statistic (\ref{eq:cov S_a S_b}).

800: More information is contained in the mosaics of SNP

801: haplotype patterns for more than two individuals, and their

802: associations \cite{hudson01}. It is of great interest to derive

803: corresponding expressions for correlations between such patterns

804: in the models considered in this paper, especially

805: in the case of more than two loci.

806: Finally we note that the

807: quantity $\sigma_d^2$, a measure of linkage disequilibrium, was

808: shown to be a good approximation to $r^2$ in the case of

809: unstructured populations \cite{mcvean_etal02}. It is necessary to

810: investigate the relation between $r^2$ and $\sigma^2_d$ in models

811: with demographic structure.

812:

813:

814:

815: \clearpage \newpage %

816: { \appendix

817:

818: \section*{Appendix A: Derivation of bottleneck formula}

819: \setcounter{section}{1}

820: \label{app:bottleneck formula}

821:

822: During the bottleneck, the time between coalescent events is

823: exponentially distributed with rate ${n \choose 2}/(2\,\gamma N)$,

824: where $n$ is the number of lines carrying ancestral material.

825: Recombination events occurs with rate $n\,R/(4N)$, independent of

826: $\gamma$. Thus when $\gamma$ is very small, coalescent events

827: dominate the process.

828:

829: We assume that during the bottleneck, the reduction in effective

830: population size is so drastic that $\gamma$ is effectively zero.

831: By rescaling the time by a factor of $\gamma$ and taking the limit

832: of $\gamma \rightarrow 0$ we find

833: \be

834:    \mathbf{M}' = \lim_{\gamma \rightarrow 0} \mathbf{M}(\gamma)\,\gamma =

835:    \left[\begin{array}{rrr}

836:          -1 & 1 & 0 \\

837:          0 & -3 & 4 \\

838:          0 & 0 & -6

839:    \end{array}\right],

840: \ee

841: so the time evolution operator becomes

842: \be

843:    \exp(\mathbf{M}'\,t) =

844:    \left[\begin{array}{rrr}

845:       e^{-t} &  \frac{1}{2}\,e^{-t} - \frac{1}{2}\,e^{-3 t} & \frac{2}{5}\,e^{-t} - \frac{2}{3}\,e^{-3 t} +  \frac{4}{15}\,e^{-6 t}\\

846:       0 & e^{-3 t} &  \frac{4}{3}\,e^{-3 t} - \frac{4}{3}\,e^{-6 t}\\

847:       0 & 0 & e^{-6 t} \\

848:   \end{array}\right] .

849: \ee

850: In the original model, the inbreeding coefficient $F$ was

851: specified. We choose to parameterise the severity of the

852: bottleneck by its duration $D$. If the process is in state $1$

853: (figure~3)

854: when entering the bottleneck, the probability of coalescence

855: during the bottleneck is

856: \be

857:    \int_0^D {\bm u}_1\transpose\,\rme^{\mathbf{M}'\,t}\,{\bm u}_1 \,\rmd t = 1 - \rme^{-D},

858: \ee

859: so we see that by taking $D = -\ln(1 - F)$, we get the correct

860: inbreeding coefficient. We can now express the time evolution

861: operator from the beginning to the end of the bottleneck as

862: \be\label{eq:propagator_bn}

863:    \exp(\mathbf{M}'\,D) =

864:    \left[\begin{array}{rrr}

865:       H & \frac{1}{2}\,H\,(1 - H^2) & \frac{2}{15}\,H\,( 3 - 5\,H^2 + 2\,H^5 ) \\

866:       0 & H^3 & \frac{4}{3}\,H^3\,(1 - H^3) \\

867:       0 & 0 & H^6

868:    \end{array}\right] ,

869: \ee

870: where $H = 1 - F$. The probability that the loci become linked

871: during the bottleneck depends on the state of the process when the

872: bottleneck is entered:

873: \be\label{eq:prob_linked_bn}

874:    \int_0^D {\bm u}_1\transpose\, \rme^{\mathbf{M}'\,t} \,\rmd t =

875:    \left\{\begin{array}{ll}

876:       F & \mbox{in state $1$} \\

877:       \frac{1}{6}\,( 2 + H )\,F^2 & \mbox{in state $2$} \\

878:       \frac{2}{45}\,( 5 + 6\,H + 3\,H^2 + H^3 )\,F^3 &  \mbox{in state $3$}

879:    \end{array}\right.

880: \ee

881: Similarly, we have the probability that one locus, but not the

882: other, reaches its most recent common ancestor during the

883: bottleneck, depending on the state of the process when entering

884: the bottleneck:

885: \be\label{eq:prob_apart_bn}

886:    \int_0^D {\bm u}_2\transpose\, \rme^{\mathbf{M}'\,t} \,\rmd t =

887:    \left\{\begin{array}{ll}

888:       0  & \mbox{in state $1$}\\

889:       \frac{2}{3} \,( 1 - H^3) & \mbox{in state $2$} \\

890:        \frac{1}{9}\,(7 - 8\,H^3 + H^6) & \mbox{in state $3$}

891:    \end{array}\right.

892: \ee

893: Together, (\ref{eq:propagator_bn}), (\ref{eq:prob_linked_bn}) and

894: (\ref{eq:prob_apart_bn}) determines the

895: state of the process after the bottleneck. Using this

896: information and the method

897: for the unstructured population as outlined in section 2 allows

898: us to derive the gene-history correlation for the bottleneck model.

899:

900: \section*{Appendix B: Correlation of gene histories in divergent

901: populations}

902: \setcounter{section}{2}

903: \label{app:div pop}

904:

905: Assume that individuals come from left sub-population with

906: probability $p$ and from the right one with probability $1-p$. The

907: population size in the left and right sub-populations are $\gamma

908: N$ and $\Gamma N$, respectively, and the population size before

909: the divergence is $N$.

910: %

911: The two-person coalescent process is described by a Markov process

912: over the states in table~\ref{tab:states}, where state $1$ is the

913: absorbing state of the process, and the process starts in one of

914: states $3 - 11$.

915: %

916:

917: \begin{table}

918: \caption{\label{tab:states} %

919: The states of the Markov process of loci $x$ and $y$ in

920: chromosomes $i$ and $j$, for the divergent population. For each

921: state we show the corresponding configurations of the

922: sub-populations, separated by a vertical bar. A dash denotes

923: genetic material that is not ancestral to any locus in the sample.

924: The symbol $\phi$ denotes a sub-population unrelated to sample,

925: and the diamonds denotes a common ancestor to chromosomes $i$ and

926: $j$ (for that locus).

927: }

928: \begin{indented}

929:    \item[]\begin{tabular}{c r@{ $|$ }l}

930:    \br

931:    State  & \multicolumn{2}{c}{Population configuration} \\

932:    \mr

933:    0 & $\phi$ & $\phi$ \\

934:    \mr

935:    1 & $x_i \diamond$, $x_j \diamond$ & $\phi$ \\

936:    2 & $x_i \diamond$ &  $x_j \diamond$ \\

937:    \mr

938:    3 & $x_i y_i$, $x_j y_j$ & $\phi$ \\

939:    4 & $x_i y_i$ & $x_j y_j$ \\

940:    \mr

941:    5 & $x_i y_i$, $x_j-$, $-y_j$ & $\phi$ \\

942:    6 & $x_i y_i$, $x_j-$ & $-y_j$ \\

943:    7 & $x_i y_i$ & $x_j-$, $-y_j$ \\

944:    \mr

945:    8 & $x_i-$, $-y_i$, $x_j-$, $-y_j$ & $\phi$ \\

946:    9 & $x_i-$, $-y_i$, $x_j-$ & $-y_j$ \\

947:    10 & $x_i-$, $-y_i$ &  $x_j-$, $-y_j$  \\

948:    11 & $x_i-$, $x_j-$ & $-y_i$, $-y_j$ \\

949:    \br

950: \end{tabular}

951: \end{indented}

952: \end{table}

953: %

954: We now define $e_i = \expt{\, \ta \tb\, |\, \mbox{Process starting

955: in state $i$}\,}$. With these, we may write

956: \be

957:    \expt{ \tau_{x(ij)} \tau_{y(ij)} } &=&

958:          p^2\, e_3(\gamma) + (1 - p)^2\, e_3(\Gamma) + 2 p (1 - p)\, e_4(\gamma,\Gamma), \\

959:    %

960:    \expt{ \tau_{x(ij)} \tau_{y(ik)} } &=&

961:          p^3\, e_5(\gamma) + (1-p)^3\, e_5(\Gamma) \nonumber\\&+&

962:          2 p (1 - p)^2\, e_6(\gamma)  + 2 p^2 (1 - p)\, e_6(\Gamma) \nonumber\\&+&

963:          p (1 - p)^2\, e_7(\gamma, \Gamma) + p^2 (1 - p)\, e_7(\Gamma, \gamma), \\

964:    %

965:    \expt{ \tau_{x(ij)} \tau_{y(kl)} } &=&

966:          p^4\, e_8(\gamma) + (1 - p)^4\, e_8(\Gamma) \nonumber\\&+&

967:          4 p^3 (1 - p)\, e_9(\gamma) + 4 p (1 - p)^3\, e_9(\Gamma) \nonumber\\&+&

968:          4 p^2 (1 - p)^2\, e_{10}(\gamma,\Gamma) + 2 p^2 (1 - p)^2\, e_{11}(\gamma,\Gamma).

969: \ee

970: From this, the correlation $\rho(\tau_{x(ij)},\tau_{y(ij)})$ and $\sigma^2_d$

971: may be calculated for both models of divergent populations:

972: setting $\gamma = \Gamma = 1$ gives the model described in section

973: \ref{sec:div_model_1}; setting $\Gamma = 1 - \gamma$ and $p =

974: \gamma$ gives the model described in section

975: \ref{sec:div_model_2}.

976:

977:

978:

979: \subsection*{Calculation of $e_3,\ldots,e_{11}$ for the model

980: introduced in section 4.2}

981:

982: \newcommand{\MM}{\mathbf{M}_1}

983:

984: The two-locus coalescent in a population of size $\gamma N$ is

985: described by a Markov process with the evolution matrix

986: \beq

987:   \MM = \left[ \begin{array}{ccc}

988:        - 1/\gamma - R & 1/\gamma & 0  \\

989:       R &  - 3/\gamma - R/2 & 4/\gamma  \\

990:       0 & R/2 &  - 6/\gamma

991:    \end{array}  \right]\!\!.

992: \eeq

993: where $R = 4Nr$. Before the divergence, $\gamma = 1$ and we denote

994: the corresponding evolution matrix $\mathbf{M}$. the coalescent is

995: described by a Markov process with the evolution matrix

996: $\mathbf{M}$. Assuming that population is in state $3$, $5$, or

997: $8$ with probabilities $v_1$, $v_2$, and $v_3$, respectively, we

998: proceed as for the unstructured population in section

999: \ref{sec:methods}, calculating $\expt{\ta \tb}$ conditional on

1000: starting from distribution ${\bm v}$. We obtain

1001:  $e_3(\gamma) = c_\mathrm{s}(\gamma, (1,\, 0,\, 0)\transpose)$,

1002:  $e_5(\gamma) = c_\mathrm{s}(\gamma, (0,\, 1,\, 0)\transpose)$,

1003:  and

1004:  $e_8(\gamma) = c_\mathrm{s}(\gamma, (0,\, 0,\, 1)\transpose)$,

1005: where

1006: \be

1007:  \fl    c_\mathrm{s}(\gamma, {\bm v}) &=& \frac{{\bm u}_1\transpose}{\gamma} \, (-\MM)^{-3} \, \big[2\,\mathbf{I} - (2\,\mathbf{I} - 2\,\frac{G}{\gamma}\,\MM + \frac{G^2}{\gamma^2}\, \MM^2)\,\expm{\MM G}\big] {\bm v} \nonumber\\

1008:  \fl    &+& {\bm u}_1\transpose \, (-\mathbf{M})^{-3} \, \big(2\,\mathbf{I} - 2\,G\,\mathbf{M} + G^2\,\mathbf{M}^2 \big) \, \expm{\MM G}  {\bm v} \nonumber\\

1009:  \fl    &+& \frac{ {\bm u}_2\transpose}{\gamma} \, (-\MM)^{-3}  \Big\{ 2\,\mathbf{I} - \gamma\,\MM - \left[ 2\,\mathbf{I} - (2\,G + \gamma)\,\MM + G\,(G + \gamma)\, \MM^2 \right]\expm{\MM G} \Big\}  {\bm v} \nonumber\\

1010:  \fl    &+& (1 - \gamma)\,{\bm u}_2\transpose \, (\mathbf{I} + \gamma\,\MM)^{-2} \, \Big\{ \gamma\,e^{-G/\gamma}\,\mathbf{I} + \left[ \, (G - \gamma)\,\mathbf{I} + \gamma\,G\,\MM \, \right]\,\expm{\MM G} \Big\}  {\bm v} \nonumber\\

1011:  \fl    &+& {\bm u}_2\transpose (-\mathbf{M})^{-3} \left[ 2\,\mathbf{I} - (1 + 2\,G)\,\mathbf{M} + G\,(G + 1)\,\mathbf{M}^2 \right] \expm{\MM G}  {\bm v} .

1012: \ee

1013:

1014:

1015:

1016: During the split, the coalescent is described by a Markov process

1017: with the evolution matrix

1018: \beq

1019:   \mathbf{M}_2 = \left[ \begin{array}{cc}

1020:       - 1/\gamma - R/2 & 2/\gamma  \\[2pt]

1021:       R/2 &  - 3/\gamma

1022:    \end{array}  \right]\!\!.

1023: \eeq

1024: A coalescent event during the split happens with the distribution

1025:  $\gamma^{-1} (1,\, 1)\, \rme^{\mathbf{M}_2 \ta} {\bm v},$

1026: %\eeq

1027: where ${\bm v} = (1,\, 0)$ when starting from state $6$ and ${\bm

1028: v} = (0,\, 1)$  when starting from state $9$. Thus, we have the

1029: contribution

1030: \[

1031:  \int_0^G\!\! \ta \, \frac{1}{\gamma}\, (1,\, 1) \rme^{\mathbf{M}_2 \ta}\, {\bm v}\, \rmd\ta

1032:  \int_G^\infty\!\! \tb\, \rme^{-(\tb - G)} \rmd\tb

1033: \]

1034: The population is in state $5$ or $8$, right before the split,

1035: with probability ${\bm a}\, \expm{\mathbf{M}_2 G}\, {\bm v}$,

1036: where ${\bm a} = (1, 0)$ for state $5$ and ${\bm a} = (0, 1)$ for

1037: state $8$. From this we obtain

1038: \be

1039:    e_6(\gamma) &=& A(\gamma) + R \gamma \,  B(\gamma) \nonumber\\

1040:    e_9(\gamma) &=& A(\gamma) - 2\, B(\gamma) \nonumber

1041: \ee

1042: where

1043: \beq

1044:  \fl  A(\gamma) = (1 + G) \gamma + \left[ (1 + G)(1 - \gamma)  + \frac{24 + 4 R \gamma }{( 4 + R \gamma)( 18 + 13 R + R^2 ) } \right] \mathrm{e}^{-G/\gamma}

1045: \eeq

1046: and

1047: \beq

1048:  \fl  B(\gamma) = \frac{2}{( 4 + R\,\gamma) \, ( 18 + 13\,R + R^2 ) }\, \exp\!\left(- \frac{G\,( 6 + R\,\gamma) }{2\,\gamma }\right)

1049: \eeq

1050:

1051:

1052: Now consider starting from states $4$, $7$ or $10$.

1053: In these cases, there is no coalescent event during the split. In

1054: each sub-population the coalescent is described by a Markov

1055: process with the evolution matrix

1056: \beq

1057:   \mathbf{M}_3 = \left[ \begin{array}{cc}

1058:       - R/2 & 1/\gamma  \\[2pt]

1059:       R/2 &  - 1/\gamma

1060:    \end{array}  \right]\!\!.

1061: \eeq

1062: Note that the columns sum to zero: the probability of escaping

1063: from these states is zero during the split.

1064:

1065: Right before the split, the population is in state $3$, $5$ or $8$

1066: with probability $\phi_1$, $\phi_2$, and $\phi_3$, respectively.

1067: Then, the contribution is

1068: \be

1069:  \fl  && \int_G^\infty \! \left[ \ta^2\, {\bm u}_1\transpose + \int_{\ta}^\infty\!\! \ta \tb\, \rme^{\ta - \tb}\, \rmd\tb \, {\bm u}_2\transpose \right] \rme^{\mathbf{M}\,(\ta - G)}\, {\bm \phi} \,\, \rmd\ta \nonumber\\

1070:  \fl  &&\hspace{1cm}=\ (1 + G)^2 (\phi_1 + \phi_2 + \phi_3)  + \frac{(R + 18)\phi_1 + 6 \phi_2 + 4 \phi_3}{R^2 + 13 R + 18}

1071: \ee

1072: Now define $P_\mathrm{L}(\gamma)$ as the probability of the

1073: genetic material being on the same gamete at the moment of the

1074: split, given that it is on the same gamete in the sample. We have

1075: \beq

1076:    P_\mathrm{L}(\gamma) = (1,\, 0)\, \expm{\mathbf{M}_3\, G}\, (1,\,0)\transpose = \frac{2 +  R \gamma \exp\!\left(- \frac{G (2 + R \gamma)}{2 \gamma} \right)}{2 + R \gamma}.

1077: \eeq

1078: Similarly, we define $P_\mathrm{B}(\gamma)$ as the probability of

1079: the genetic material being on the same gamete at the moment of the

1080: split, given that it is on different gametes in the sample. We

1081: have

1082: \beq

1083:    P_\mathrm{B}(\gamma) = (1,\, 0)\, \expm{\mathbf{M}_3\, G}\, (0,\,1)\transpose = \frac{2 -  2  \exp\!\left(- \frac{G (2 + R \gamma)}{2 \gamma} \right)}{2 + R \gamma}.

1084: \eeq

1085: If the sample is in state $4$, we have

1086: \be

1087:    \phi_1 &=& P_\mathrm{L}(\gamma) \, P_\mathrm{L}(\Gamma) \nonumber\\

1088:    \phi_2 &=& P_\mathrm{L}(\gamma) \, [1 -  P_\mathrm{L}(\Gamma)] + [1 - P_\mathrm{L}(\gamma)] \, P_\mathrm{L}(\Gamma) \nonumber\\

1089:    \phi_3 &=& [1 - P_\mathrm{L}(\gamma)] \, [1 - P_\mathrm{L}(\Gamma)]

1090: \ee

1091: Since $\phi_1 + \phi_2 + \phi_3 = 1$ we have

1092: \beq

1093:   \fl e_4(\gamma,\Gamma) = (1 + G)^2 + \frac{4 + 2\,P_\mathrm{L}(\gamma) + 2\,P_\mathrm{L}(\Gamma) + (10 +  R)\,P_\mathrm{L}(\gamma)\, P_\mathrm{L}(\Gamma) }{R^2 + 13 R + 18}

1094: \eeq

1095: Similarly, we obtain

1096: \beq

1097:   \fl  e_7(\gamma,\Gamma) = (1 + G)^2 + \frac{4 + 2\,P_\mathrm{L}(\gamma) + 2\,P_\mathrm{B}(\Gamma) + (10 +  R)\,P_\mathrm{L}(\gamma)\, P_\mathrm{B}(\Gamma) }{R^2 + 13 R + 18}

1098: \eeq

1099: and

1100: \beq

1101:   \fl  e_{10}(\gamma,\Gamma) = (1 + G)^2 + \frac{4 + 2\,P_\mathrm{B}(\gamma) + 2\,P_\mathrm{B}(\Gamma) + (10 +  R)\,P_\mathrm{B}(\gamma)\, P_\mathrm{B}(\Gamma) }{R^2 + 13 R + 18}

1102: \eeq

1103: %

1104: % --------------------------------------------------------------

1105: %

1106: Finally, starting from state $11$, we obtain

1107: \beq

1108:   \fl e_{11}(\gamma,\Gamma) =

1109:    \frac{4}{18 + 13R + R^2} \, \rme^{-G/\gamma - G/\Gamma} +

1110:    \left[ \gamma  + ( 1 - \gamma  ) \mathrm{e}^{-G/\gamma} \right]\!

1111:    \left[ \Gamma  + ( 1 - \Gamma  )\mathrm{e}^{-G/\gamma} \right]

1112: \eeq

1113:

1114:

1115: \subsection*{Calculation of $e_3,\ldots,e_{11}$ for the

1116: model introduced in section 4.3}

1117: In this model, $\gamma = \Gamma = 1$ so the formulas simplify

1118: considerably. Starting from state $3$, $5$ or $8$, we obtain

1119: \be

1120:    e_3 &=& 1 + \frac{18 + R}{R^2 + 13 R + 18} \nonumber\\

1121:    e_5 &=& 1 + \frac{6}{R^2 + 13 R + 18} \nonumber\\

1122:    e_8 &=& 1 + \frac{4}{R^2 + 13 R + 18} \nonumber\\

1123: \ee

1124: as calculated by Griffiths \cite{griffiths81}. Starting from state

1125: $6$ or $9$, we obtain

1126: \be

1127:    e_6 &=& (1 + G)^2 + \frac{ (24 + 4 R) \rme^{-G} + 2 R\, \rme^{-G(6 + R)/2}} {( 4 + R )( 18 + 13 R + R^2 ) } \\

1128:    e_9 &=& (1 + G)^2 + \frac{ (24 + 4 R) \rme^{-G} - 4 \, \rme^{-G(6 + R)/2}} {( 4 + R )( 18 + 13 R + R^2 ) } \\

1129: \ee

1130: Starting from state $4$, $7$ or $10$, we obtain

1131: \be

1132:    e_4    &=& a +        8 R\, b + R^2\, c \nonumber \\

1133:    e_7    &=& a +  4 (R - 2)\, b - 2 R\, c \nonumber\\

1134:    e_{10} &=& a -         16\, b +   4\, c

1135: \ee

1136: where

1137: \be

1138:    a &=& (1 + G)^2 - \frac{8}{(2 + R)^2} - \frac{21}{2 + R} + \frac{3\,( 81 + 7 R)}{18 + 13 R + R^2} \nonumber\\

1139:    b &=& \frac{6 + R}{(2 + R)^2 (18 + 13 R + R^2)}\, \mathrm{e}^{-G(2 + R)/2} \nonumber\\

1140:    c &=& \frac{10 + R}{(2 + R)^2 (18 + 13 R + R^2)}\, \mathrm{e}^{-G(2 + R)}

1141: \ee

1142: Finally, starting from state $11$ gives

1143: \beq

1144:    e_{11} = 1 + \frac{4 \mathrm{e}^{-2G}}{18 + 13 R + R^2} .

1145: \eeq

1146:

1147: } % end of appendix

1148:

1149:

1150: \newpage

1151: \section*{References}

1152:

1153: \bibliographystyle{prsty}

1154: %\bibliographystyle{unsrt}

1155:

1156: \begin{thebibliography}{10}

1157:

1158: \bibitem{hudson90}

1159: R.~R. Hudson,  in {\em Oxford Surveys in Evolutionary Biology}, edited by D.

1160:   Futuyma and J. Antonovics (Oxford University Press, Oxford, 1990), pp.\ 1 --

1161:   43.

1162:

1163: \bibitem{nordborg_tavare02}

1164: M. Nordborg and S. Tavar\'e, Trends in Genetics {\bf 18},  83   (2002).

1165:

1166: \bibitem{reich_etal02}

1167: D.~E. Reich {\it et~al.}, Nature Genetics {\bf 32},  135   (2002).

1168:

1169: \bibitem{hapmap_group03}

1170: {Int. HapMap Consortium}, Nature {\bf 426},  789  (2003).

1171:

1172: \bibitem{tajima87a}

1173: F. Tajima, Genetics {\bf 123},  585   (1987).

1174:

1175: \bibitem{tajima87b}

1176: F. Tajima, Genetics {\bf 123},  597   (1987).

1177:

1178: \bibitem{slatkin_hudson91}

1179: M. Slatkin and R.~R. Hudson, Genetics {\bf 129},  555  (1991).

1180:

1181: \bibitem{sano_etal04}

1182: A. Sano, A. Shimizu, and M. Iizuka, Theor. Pop. Biol. {\bf 65},  39   (2004).

1183:

1184: \bibitem{wakeley96}

1185: J. Wakeley, Theor. Pop. Biol. {\bf 49},  39   (1996).

1186:

1187: \bibitem{teshima_tajima03}

1188: K.~M. Teshima and F. Tajima, Theor. Pop. Biol. {\bf 62},  81   (2003).

1189:

1190: \bibitem{stumph_goldstein03}

1191: M.~P.~H. Stumpf and D.~L. Goldstein, Curr. Biol. {\bf 13},  1   (2003).

1192:

1193: \bibitem{patil_etal01}

1194: N. Patil {\it et~al.}, Science {\bf 294},  1719  (2001).

1195:

1196: \bibitem{snp_group01}

1197: {Int. SNP Map Working Group}, Nature {\bf 409},  928  (2001).

1198:

1199: \bibitem{kaplan_hudson85}

1200: N. Kaplan and R.~R. Hudson, Theor. Pop. Biol. {\bf 28},  382   (1985).

1201:

1202: \bibitem{hudson83}

1203: R.~R. Hudson, Theor. Pop. Biol. {\bf 23},  183   (1983).

1204:

1205: \bibitem{pluzhnikov_donelly96}

1206: A. Pluzhnikov and P. Donelly, Genetics {\bf 144},  1247   (1996).

1207:

1208: \bibitem{hudson01}

1209: R. Hudson, Genetics {\bf 159},  1805 � 1817  (2001).

1210:

1211: \bibitem{mcvean_etal02}

1212: G. McVean, P. Awadalla, and P. Fearnhead, Genetics {\bf 160},  1231 � 12411

1213:   (2002).

1214:

1215: \bibitem{griffiths_marjoram96}

1216: R.~C. Griffiths and P. Marjoram, J. Comput. Biol. {\bf 3},  479�502  (1996).

1217:

1218: \bibitem{kuhner_etal00}

1219: M.~K. Kuhner, J. Yamato, and J. Felsenstein, Genetics {\bf 156},  1393�1401

1220:   (2000).

1221:

1222: \bibitem{nielsen00}

1223: R. Nielsen, Genetics {\bf 154},  931 � 942  (2000).

1224:

1225: \bibitem{hill_robertson68}

1226: W.~G. Hill and A. Robertson, Theor. Appl. Genet. {\bf 38},  473   (1968).

1227:

1228: \bibitem{mcvean02}

1229: G. McVean, Genetics {\bf 162},  987   (2002).

1230:

1231: \bibitem{kong_etal02}

1232: A. Kong {\it et~al.}, Nature {\bf 31},  241   (2002).

1233:

1234: \bibitem{eriksson_mehlig04}

1235: A. Eriksson and B. Mehlig, Submitted to Genetics  (2004).

1236:

1237: \bibitem{griffiths81}

1238: R.~C. Griffiths, Theor. Pop. Biol. {\bf 19},  169   (1981).

1239:

1240: \bibitem{hudson_kaplan85}

1241: R.~R. Hudson and N.~L. Kaplan, Genetics {\bf 111},  147   (1985).

1242:

1243: \bibitem{eyre-walker_etal98}

1244: A. Eyre-Walker {\it et~al.}, Proc. Natl. Acad. Sci. {\bf 95},  4441   (1998).

1245:

1246: \bibitem{mcpeek_speed95}

1247: M.~S. McPeek and T.~P. Speed, Genetics {\bf 139},  1031   (1995).

1248:

1249: \bibitem{wade_etal02}

1250: C.~M. Wade {\it et~al.}, Nature {\bf 420},  574   (2002).

1251:

1252: \end{thebibliography}

1253:

1254:

1255: \newpage

1256: \section*{Glossary}

1257:

1258: \emph{Locus} %

1259: A specific chromosomal location.

1260: \\[1ex]

1261: \emph{Allele} %

1262: One of several alternative forms of a gene, or DNA sequence, at a

1263: locus.

1264: \\[1ex]

1265: \emph{Genetic mosaic} %

1266: The pattern of differences between individuals in a population.

1267: \\[1ex]

1268: \emph{Haplotype} %

1269: A block of closely linked alleles that are inherited together.

1270: Such alleles are often used as markers in the process of gene

1271: mapping.

1272: \\[1ex]

1273: \emph{Linkage disequilibrium} %

1274: At linkage equilibrium, traits at different loci are inherited

1275: independently. Deviation from this is called linkage

1276: disequilibrium.

1277: \\[1ex]

1278: \emph{Population bottleneck} %

1279: When the population has been subject to a drastic decrease in

1280: abundance, followed by a rapid increase in abundance. This may

1281: happen e.g. when a small part of a population colonise a new

1282: environment, without extensive interbreeding with the main

1283: population.

1284: \\[1ex]

1285: \emph{SNP} %

1286: Single nucleotide polymorphism. A difference in the genetic code

1287: at a single position.

1288: \\[1ex]

1289: \emph{Markov process} %

1290: A stochastic process, where the future development depends only on

1291: the present state (no memory).

1292: \\[1ex]

1293: \emph{Divergence} %

1294: When a population splits into two parts that does not interbreed,

1295: the independent accumulation of neutral mutations within each

1296: subpopulation leads to that the number of genetic differences

1297: between individuals from different sub-populations increase with

1298: time.

1299: \\[1ex]

1300: \emph{Gene history} %

1301: The sequence of ancestors to a gene.

1302: \\[1ex]

1303: \emph{Coalescent process} %

1304: An approximation of neutral evolution, valid for large

1305: populations.

1306: \\[1ex]

1307: \emph{Chiasma process} %

1308: Exchange of genetic material between copies chromosome pairs

1309: during the production of gametes (egg or sperm cells).

1310: \\[1ex]

1311: \emph{Recombination fraction} %

1312: The probability that two loci on the same chromosome was inherited

1313: from different parents.

1314:

1315: \end{document}

1316: