1: \documentclass[12pt]{iopart}
2: \usepackage{graphicx, bm, color, cite}
3: \usepackage{iopams}
4: \usepackage[nolists]{endfloat}
5:
6: \newcommand{\transpose}{^\mathrm{T}}
7: \newcommand{\expm}[1]{\exp\!\big(#1\big)}
8: \newcommand{\expt}[1]{\langle #1 \rangle}
9:
10: \newcommand{\be}{\begin{eqnarray}}
11: \newcommand{\ee}{\end{eqnarray}}
12:
13: \newcommand{\beq}{\begin{equation}}
14: \newcommand{\eeq}{\end{equation}}
15:
16: \newcommand{\ta}{\tau_1}
17: \newcommand{\tb}{\tau_2}
18:
19: \begin{document}
20:
21: \title{Gene-history correlation and population structure}
22:
23: \author{A. Eriksson\dag\ and B. Mehlig\ddag}
24: \address{\dag\ Dept. of Physical Resource Theory, Chalmers and G\"oteborg
25: University, Sweden}
26: \address{\ddag\ Dept. of Theoretical Physics, G\"oteborg
27: University and Chalmers, Sweden}
28:
29: \begin{abstract}
30: Correlation of gene histories in the human genome determines the
31: patterns of genetic variation ({\em haplotype structure}) and is
32: crucial to understanding genetic factors in common diseases. We
33: derive closed analytical expressions for the correlation of gene
34: histories in established demographic models for genetic evolution
35: and show how to extend the analysis to more realistic (but more
36: complicated) models of demographic structure. We identify two
37: contributions to the correlation of gene histories in divergent
38: populations: linkage disequilibrium, and differences in the
39: demographic history of individuals in the sample. These two
40: factors contribute to correlations at different length scales: the
41: former at small, and the latter at large scales. We show that
42: recent mixing events in divergent populations limit the range of
43: correlations and compare our findings to empirical results on the
44: correlation of gene histories in the human genome.
45: \end{abstract}
46:
47: \submitto{Physical Biology}
48: \pacs{89.75.Hc,87.23.Kg,02.50.Ga}
49: %\keywords{Suggested keywords}
50:
51: \maketitle
52:
53:
54: \clearpage \newpage %
55: \section{Introduction}
56: \label{sec:introduction}
57:
58: Populations are shaped by demographic, historical and
59: social factors, determining gene histories in characteristic ways.
60: Empirical data on genetic variation are now routinely interpreted
61: using well-established gene-genealogical models
62: \cite{hudson90,nordborg_tavare02,reich_etal02,hapmap_group03} of
63: the population in question. Local properties of genetic variation
64: (pertaining to {\em loci}, short stretches of a chromosome) in
65: such models are very well understood, by means of models of
66: bottlenecks, population expansion \cite{tajima87a, tajima87b,
67: slatkin_hudson91, sano_etal04}, and migration \cite{wakeley96,
68: teshima_tajima03, stumph_goldstein03}.
69: By contrast, very little is know about global patterns
70: \cite{patil_etal01}.
71: %
72: Global correlation and variation of patterns appear to be the key
73: to understanding the genetic factors contributing to common
74: diseases: there is now a wealth of empirical information on the
75: variation of genetic material in the human genome
76: \cite{snp_group01}. Many common diseases (such as cancer, obesity,
77: cardiovascular disorder and diabetes) are caused by combinations
78: of genetic and environmental factors \cite{hapmap_group03}. In
79: some cases a common variant of a single gene is responsible for
80: specific syndromes. In more complex diseases, however, it may not
81: be possible to link a disease to a single genetic factor. It is
82: thus necessary to understand genome-wide association of genetic
83: factors.
84:
85: Mutations and linkage disequilibrium (explained and illustrated in
86: figure~\ref{fig:samplegenealogy}) determine the genetic history of
87: a population, which in turn shapes the patterns of genetic
88: variation of interest in gene association studies
89: \cite{patil_etal01,hapmap_group03}.
90: %
91: The question is: how strongly are the patterns at two different
92: loci correlated?
93: %
94: Reich \etal \cite{reich_etal02} estimate the empirical association
95: of polymorphism rates, as a function of the physical distance
96: between the loci on the same chromosome, from human population
97: data (compensating for variations in the mutation rate along the
98: chromosome by comparing to the population data from the great
99: apes). Assuming a neutral model with uniform mutation rate, the
100: covariance of polymorphism rates is given by the covariance of the
101: times to the most recent common ancestor of the two loci (c.f.
102: figure \ref{fig:samplegenealogy}c).
103: %
104: Kaplan and Hudson \cite{kaplan_hudson85} (see also
105: \cite{hudson83}) analysed the association of polymorphism rates for
106: short loci, within the standard unstructured neutral model. This
107: was further developed by Pluzhnikov and Donelly
108: \cite{pluzhnikov_donelly96}, who analysed optimal sample sizes for
109: surveying genetic diversity.
110: %
111: Hudson \cite{hudson01} and McVean \etal \cite{mcvean_etal02}
112: estimate the recombination rate likelihood from two-locus sample
113: statistics, based on simulations. Recombination rate likelihoods,
114: conditional on more than two sites, have also been estimated using
115: Monte-Carlo methods
116: \cite{griffiths_marjoram96,kuhner_etal00,nielsen00}. Although
117: statistically powerful, these methods are computationally very
118: demanding.
119: %
120: Linkage disequilibrium is often assessed through summary
121: statistics such as $r^2$ \cite{hill_robertson68} or $D'$
122: \cite{tajima87a}. McVean \cite{ mcvean02} introduced an
123: approximation $\sigma^2_d$ of the expected value of $r^2$, and
124: showed that the approximation is accurate, in the absence of
125: demographic structure, if the expectations are taken conditional
126: on intermediate allelic frequencies.
127:
128: In this paper, we derive analytical expressions for the
129: correlation of genetic histories in established models of
130: demographic history (see figure~\ref{fig:pop struct models}a--c)
131: in the limit of negligible selection.
132: For several reasons these results are of interest.
133: First, as explained in the following, they enable us
134: to gain a qualitative understanding of the relative importance
135: of different biological factors determining the empirically
136: observed patterns of linkage disequilibrium. Second,
137: the analytical results summarised in this article
138: can be easily generalised as explained below
139: (see figure~\ref{fig:pop struct models}d,e).
140: Third, our analytical expressions for the decorrelation
141: of gene histories allow for studying the implications
142: of variations of the recombination rate along the chromosomes
143: \cite{kong_etal02,eriksson_mehlig04}.
144: %
145: The remainder of this paper is organised into five parts. We
146: begin by discussing gene-history correlations and linkage
147: disequilibrium in section \ref{sec:gene-history correlations}
148: (see also figure~\ref{fig:samplegenealogy}). In section \ref{sec:methods}
149: we describe our method. We summarise our results in section
150: \ref{sec:results} and discuss their implications in section
151: \ref{sec:discussion}. In section \ref{sec:conclusions} we draw
152: conclusions. Two appendices summarise details of our calculations.
153:
154: %==================================================================
155: \begin{figure}
156: \centerline{\includegraphics{fig1.eps}}
157: \caption{\label{fig:samplegenealogy}
158: Gene history and polymorphic sites. \textbf{a} In DNA, genetic
159: information is encoded by base-pairs of the four nucleic acids
160: adenine ({\tt A}), thymine ({\tt T}), guanine ({\tt G}), and
161: cytosine ({\tt C}). In a sample of three individuals, we show
162: three polymorphic sites, with two nucleotides around each
163: polymorphism. \textbf{b} The most common variation is a difference
164: at a single position (SNP), caused by a mutation at the position
165: in an individual in the history of the population, where e.g. a
166: fraction of the population has the nucleotide {\tt T} at the site,
167: and the rest has the nucleotide {\tt A}. The three mutations in
168: panel \textbf{a} are shown as filled circles. Mutation 4 does not
169: cause a polymorphism in the sample, since all individuals in the
170: sample inherits the mutation from the common ancestor. Given
171: $\tau$ (the number of generations since the most recent common
172: ancestor) of a stretch of $L$ nucleotides, the number of
173: differences between two individuals is assumed to be Poisson
174: distributed with expected value $2 \mu L \tau$, where $\mu$ is the
175: mutation rate per site per generation \cite{hudson90}. \textbf{c}
176: In recombination, part of a \emph{gamete} (one of the two copies
177: of a chromosome) is inherited from one parent and the rest from
178: the other parent. We show a sample gene history with one
179: recombination event, for two loci ($x$ and $y$) in two gametes
180: $i$ and $j$. The time axis is the same
181: as in panel \textbf{b}. The ancestral history for
182: loci $x$ and $y$ are shown in blue and red, respectively. The
183: times until the most recent common ancestor are $\tau_{x(ij)}$ and
184: $\tau_{y(ij)}$ for loci $x$ and $y$, respectively. In the absence of
185: recombination, two loci on the same gamete share the same genetic
186: history, and have the same time to the most recent common
187: ancestor, $\tau_{x(ij)} = \tau_{y(ij)}$, causing \emph{linkage
188: disequilibrium}. If a recombination event occurs in the genetic
189: history of a sample, it may lead to a decorrelation of $\tau_{x(ij)}$
190: and $\tau_{y(ij)}$.
191: $x_i$ represents the genetic material at locus $x$ of
192: chromosome $i$. Dashes correspond to genetic material not in the
193: history of the sample, and the diamonds to common ancestral
194: material.
195: }
196: \end{figure}
197: \begin{figure}
198: \centerline{\includegraphics{fig2.eps}}
199: \caption{\label{fig:pop struct models}
200: Models illustrating demographic history, i.e. changes in
201: population size and structure. \textbf{a} Population bottleneck.
202: \textbf{b},\textbf{c} Models of population structure and
203: expansion. \textbf{d} A more general model of demographic
204: structure. \textbf{e} Demographic structure determining genetic
205: variation in the laboratory-mouse genome \cite{wade_etal02} (time
206: here is measured in years).
207: }
208: \end{figure}
209:
210: \clearpage \newpage %
211: \section{Gene-history correlations, linkage disequilibrium, and
212: patterns of genetic variation}
213: \label{sec:gene-history correlations}
214:
215: Genetic variation is caused by multiple factors. Together,
216: mutations and recombination (figure~\ref{fig:samplegenealogy}) are
217: the most important determinants of the large-scale haplotype
218: structure in the human genome \cite{reich_etal02, patil_etal01,
219: hapmap_group03}. The genetic history of nearby sites is closely
220: related, while distant sites may become unrelated only a few
221: generations in the past.
222:
223: Correlation of gene histories determines the degree of association
224: between patterns of genetic variation at different loci.
225: An example is the correlation of the counts of
226: single-nucleotide polymorphisms (SNPs) at different loci:
227: let $S_{x(ij)}$ be the number of SNPs
228: at locus $x$ between a pair of chromosomes $i$ and $j$.
229: Further, let $\tau_{x(ij)}$ denote
230: the time to the most recent common ancestor of a locus at position
231: $x$ on chromosomes $i$ and $j$, and define $\tau_{y(ij)}$
232: correspondingly for the locus at position $y$.
233: Then the sample covariance of the number of SNPs
234: in non-overlapping loci $x$ and $y$ is
235: related to the covariance of times $\tau_{x(ij)}$ and $\tau_{y(ij)}$ as
236: follows
237: \begin{equation}\label{eq:cov S_a S_b}
238: \mathrm{cov}[S_{x(ij)},S_{y(ij)}] \approx (2 \mu L)^2 \, \mathrm{cov}[\tau_{x(ij)},\tau_{y(ij)}]\,.
239: \end{equation}
240: Here $L$ is the size of the loci, assuming variations in the
241: mutation rate $\mu$ along the chromosome are negligible. For
242: (\ref{eq:cov S_a S_b}) to hold, $L$ must be small enough that the
243: sites within each locus have a high degree of linkage (in humans,
244: $L$ must be of the order of or smaller than a few hundred
245: base-pairs).
246:
247: Associations between SNPs in the genetic mosaic
248: allows for efficient mapping of genes. Suitably
249: chosen, a relatively small set of SNPs can capture most of the
250: common patterns of variation in the genome \cite{hapmap_group03}.
251:
252: The decay of the covariance $\mbox{cov}[\tau_{x(ij)},\tau_{y(ij)}]$ as a
253: function of $|x-y|$ measures linkage disequilibrium.
254: In the remainder of this section we briefly comment on other
255: common measures of linkage disequilibrium. Global association
256: between patterns of diversity, quantified by the extent of linkage
257: disequilibrium is often measured by Tajima's $D'$ \cite{tajima87a} or
258: alternatively by
259: \beq
260: r^2 = \frac{D^2}{f_{A(x)} (1 - f_{A(x)}) f_{B(y)} (1 - f_{B(y)})},
261: \eeq
262: where $D = f_{A(x)B(y)} - f_{A(x)} f_{B(y)}$, $A(x)$ and $B(y)$ are
263: the allelic types at the loci $x$ and $y$, respectively, and
264: $f_{A(x)B(y)}$ is frequency of alleles $A(x)$ and $B(y)$ on the
265: same chromosome in the sample \cite{tajima87a}. McVean
266: \cite{mcvean02} introduced an approximation to the expected value
267: of $r^2$, called $\sigma^2_d$, which makes the connection to the
268: correlation of gene history explicit. With the notation $E_{ij,kl}
269: = \expt{ \tau_{x(ij)} \tau_{y(kl)}}$,
270: \beq\label{eq:sigma2 def}
271: \sigma^2_d =
272: \frac{ (n^2 - 2n + 2)E_{ij,ij} - 2(n-2)^2 E_{ij,ik} + (n-2)(n-3) E_{ij,kl} }
273: { 2 E_{ij,ij} + 4(n-2) E_{ij,ik} + (n - 2)(n - 3) E_{ij,kl}} \, .
274: \eeq
275: The factors $E_{ij,ij}$ and $E_{ij,ik}$ are defined analogously.
276: For unstructured populations, $\sigma^2_d$ and the expected value
277: of $r^2$ are approximately equal under the neutral dynamics, if
278: the expectation is conditioned on intermediate allelic frequencies
279: \cite{mcvean02}.
280:
281:
282: \clearpage \newpage %
283: \section{Methods}
284: \label{sec:methods}
285:
286: In the following we analyse how correlation of gene histories
287: depends on demographical factors. In a large, unstructured population
288: with constant population size, and when selection is negligible,
289: the ancestral history of a locus may be modeled as a Markov
290: process \cite{griffiths81, hudson_kaplan85, nordborg_tavare02},
291: where the states of the process correspond to different
292: configurations of ancestral DNA through the history of the sample.
293:
294: We trace the ancestral history of two loci (at positions $x$ and
295: $y$) in $n$ individuals, from the present
296: back in time until the most recent common ancestor has been found
297: for all loci. When the population size $N$ is large, the genealogical
298: process may be approximated by the so-called coalescent process \cite{hudson90}:
299: recombination is modeled as a Poisson
300: process with rate $r$ per generation per chromosome: for any given
301: chromosome, with probability $r$ (also known as the recombination
302: fraction) the loci stem from different parents. The
303: probability that one pair of individuals has a common ancestor in
304: the preceding generation, and the probability that an individual
305: inherits genetic material from both parents, are expanded in
306: $N^{-1}$ to the first order. Time is measured in units of $2N$
307: generations. In the limit of large $N$, the time to the next event
308: is approximately exponentially distributed \cite{hudson90}.
309:
310: By explicitly taking into account the symmetries of the state
311: space of the coalescent for two individuals, we obtain a compact
312: representation of the Markov process
313: (figure~\ref{fig:markovgraph}) which allows us to derive and
314: understand gene-history correlations in the models mentioned
315: in the introduction.
316:
317: We illustrate our approach by re-deriving Hudson's result for the
318: correlation of gene histories in the unstructured, constant
319: population-size coalescent model \cite{hudson83}. Consider a
320: sample of two individuals. Figure~\ref{fig:markovgraph} shows a
321: representation of the coalescent for this case. Each node in the
322: graph corresponds to a configuration of ancestral DNA (listed in
323: the table in figure~\ref{fig:markovgraph}). Due to the symmetries
324: of the coalescent, many different configurations may be mapped
325: onto the same node.
326:
327: \begin{figure}
328: \centerline{
329: \begin{tabular}{@{}ll@{}}
330: %%%%%%%%%%%%%%%%%%%%%%%%%%
331: \includegraphics{fig3.eps}
332: %%%%%%%%%%%%%%%%%%%%%%%%%%
333: &
334: %%%%%%%%%%%%%%%%%%%%%%%%%%
335: \raisebox{2.5cm}{
336: \begin{tabular}{cl}
337: \br
338: \small State $i\ $ & \small Population \\
339: \mr
340: \small\raisebox{1.5ex}{$1$} & \small\shortstack{$x_iy_i$,\,$x_jy_j$\\
341: $x_iy_j$,\,$x_jy_i$} \\
342: \mr
343: \small\raisebox{4ex}{$2$} & \small\shortstack{
344: $x_i-$,\,$-y_i$,\,$x_jy_j$\\
345: $x_iy_i$,\,$x_j-$,\,$-y_j$\\
346: $x_i-$,\,$-y_j$,\,$x_jy_i$\\
347: $x_iy_j$,\,$x_j-$,\,$-y_i$
348: }\\
349: \mr
350: \small$3$ & \small$x_i-$,\,$-y_i$,\,$x_j-$,\,$-y_j$ \\
351: \mr
352: \small\raisebox{1ex}{$4$} & \small\shortstack{$x_i\scriptstyle\lozenge$,\,$x_j\scriptstyle\lozenge$\\
353: ${\scriptstyle\lozenge}y_i$,\,${\scriptstyle\lozenge}y_j$}\\
354: \mr
355: \small$5$ & \small$\scriptstyle\lozenge\lozenge$ \\
356: \br
357: \end{tabular}
358: }
359: %%%%%%%%%%%%%%%%%%%%%%%%%%
360: \end{tabular}
361: }
362: \caption{\label{fig:markovgraph}
363: A graph representation of the coalescent process for two loci ($x$
364: and $y$) and two chromosomes ($i$ and $j$). The transition rates
365: (measured in units of $2N$ generations) between the different
366: groups of states, corresponding to the table, are printed along
367: the arrows ($R = 4Nr$). The process starts in state $1$ and
368: ends in state $5$, the only absorbing state. If the path goes from
369: state $1$ to state $5$ we have linkage, but if the system enters
370: state $4$ linkage is broken.
371: Same notation as in figure~1.
372: }
373: \end{figure}
374:
375: The time evolution of the probability distribution $P_i(t)$ over the states
376: $i$ is given by the master equation
377: \begin{equation}
378: \partial_t P_i(t) = \sum_j w_{j \rightarrow i} P_j(t) - \sum_j w_{i \rightarrow j} P_i(t)\,,
379: \end{equation}
380: where $w_{i \rightarrow j}$ is the transition rate from state $i$
381: to state $j$, given in figure~\ref{fig:markovgraph}. As above, time is
382: measured in units of $2N$ generations. The process is started in
383: state $1$, and proceeds until it comes to state $5$. We find that
384: $\langle\tau_{x(ij)}\tau_{y(ij)}\rangle$ is given by the exit rates to state
385: $5$, via states $1$ and $4$. Let $\ta$ be the first time at which a locus
386: coalesces, and $\tb$ be the time when both loci have coalesced.
387: Since $\tau_{x(ij)}\tau_{y(ij)} = \ta\tb$ we obtain
388: \begin{equation}
389: \label{eq:corr}
390: \left<\tau_{x(ij)}\tau_{y(ij)}\right> =
391: \int_0^\infty \Big[
392: {\bm u}_1\transpose\tau_1^2
393: + {\bm u}_2\transpose\!
394: \int_{\tau_1}^\infty \tau_1 \tau_2\,
395: {\rm e}^{\tau_1\!-\!\tau_2} \,\rmd\tau_2
396: \Big] {\rm e}^{{\bf M} \tau_1} \,{\bm v} \,\rmd\tau_1 \,,
397: \end{equation}
398: where ${\bm v} = {\bm u}_1 = (1,0,0)\transpose$, ${\bm u}_2 =
399: (0,2,2)\transpose$ and $\mathbf{M}$ is a three-by-three matrix
400: defined by $\mathbf{M}_{ij} = w_{j \rightarrow i}$ for $i,j = 1,
401: \dots, 3$ and $i \ne j$, and $M_{ii} = - \sum_{j=1}^{3} w_{i
402: \rightarrow j}$. Evaluating (\ref{eq:corr}) we obtain the
403: well-known result \cite{hudson_kaplan85,hudson83}
404: \begin{equation}\label{eq:rho_no_pop_struct}
405: \rho(\tau_{x(ij)},\tau_{y(ij)}) \equiv
406: \frac{\left<\tau_{x(ij)}\tau_{y(ij)}\right> - \left<\tau\right>^2}{ \left<\tau^2\right>
407: - \left<\tau\right>^2} = \frac{R + 18}{R^2 + 13 R + 18} \,,
408: \end{equation}
409: where $R = 4Nr$. In order to calculate $\sigma^2_d$ for the
410: unstructured model, we obtain $\expt{\tau_{x(ij)}\tau_{y(ik)}}$
411: and $\expt{\tau_{x(ij)}\tau_{y(kl)}}$ from (\ref{eq:corr}) with
412: ${\bm v} = (0,1,0)\transpose$ and ${\bm v} = (0,0,1)\transpose$,
413: respectively. Inserting these into eq.~(\ref{eq:sigma2 def}), we recover
414: the result of McVean \cite{mcvean02}:
415: \beq
416: \sigma^2_d =
417: \frac{2\,( 6 + R ) + n\,( 10 + 11 R + R^2 ) + n^2 ( 10 + R ) }
418: {2\,( 6 + R ) - n\,( 14 + 13 R + R^2 ) + n^2 ( 22 + 13 R + R^2 ) }.
419: \eeq
420:
421: In the following, we consider models corresponding to Markov processes with rates which are
422: piece-wise constant functions
423: of time $t$. This allows us to calculate
424: $\langle\tau_{x(ij)}\tau_{y(ij)}\rangle$ from (\ref{eq:corr}) by taking
425: $\mathbf{M}$ and ${\bm u}$ to be functions of time.
426:
427:
428: \clearpage \newpage %
429: \section{Results}
430: \label{sec:results}
431:
432: After having illustrated our approach, we now briefly describe
433: the demographic models we have considered and summarise our results
434: for gene-history correlations in these models. Mathematical details
435: are given in appendices A and B. Implications are discussed
436: in section 5.
437:
438: \subsection{Bottleneck model}
439: \label{sec:bottleneck_model}
440:
441: Consider (c.f.~\cite{eyre-walker_etal98}) an unstructured
442: population of constant size $N$ until $\tau_0 = 2 N G$ generations
443: ago. The population was then subject to a severe bottleneck of
444: short duration, followed by a rapid expansion to a very large
445: (infinite) population size (figure~\ref{fig:pop struct models}a).
446: Between the bottleneck and now, the population size is taken to be
447: effectively infinite: and thus the probability that two randomly
448: sampled individuals have a common ancestor before the bottleneck
449: is negligible. Since the bottleneck is very narrow and has a short
450: duration, we may ignore the effect of recombination during the
451: bottleneck. It is convenient to parameterise the duration of the
452: bottleneck in terms of the probability $F$ that a single locus
453: coalesces during the bottleneck. In the limit when both the
454: population size and duration of the bottleneck are small (compared
455: to $2N$ individuals and generations, respectively), we obtain
456: (appendix A):
457: \begin{equation}\label{eq:rho_bottleneck}
458: \rho(\tau_{x(ij)},\tau_{y(ij)}) = \frac{A + B\,e^{-R G/2} + C\,e^{-R G}}{15\,
459: (2 - h)\,(18 + 13\,R + R^2 ) }\,,
460: \end{equation}
461: where $h = 1 - F$ and
462: \begin{eqnarray}
463: A &=& 6 ( 36 - 45 h + 20 h^2 - h^5 )
464: + 3 ( 28 - 65 h +\nonumber\\&&+\ 40 h^2 - 3 h^5 ) R
465: + {( 1 - h ) }^3 ( 6 + 3 h + h^2) R^2 \,, \\
466: B &=& 12( 9 - 5 h^2 + h^5 ) + ( 3 - 5 h^2 + 2 h^5 ) R^2\nonumber\\
467: &&+\ 6 ( 7 - 10 h^2 + 3 h^5 ) R \,, \\
468: C &=& 6 ( 36 - 10 h^2 - h^5 ) + ( 6 - 5 h^2 - h^5 ) R^2 \nonumber\\
469: &&+\ 3 ( 28 - 20 h^2 - 3 h^5 ) R \,.
470: \end{eqnarray}
471: We thus find
472: that this model exhibits correlations at arbitrarily large values of $R$,
473: a consequence of an infinite expansion rate after the bottleneck,
474: and negligible recombination within it. If, instead, the expansion
475: were to a finite population size, (smaller than $GN$, say), the
476: correlations would still converge to a constant at large $R$. The
477: constant, however, is expected to be lower than the asymptotic
478: value obtained from (4) as $R\rightarrow\infty$. Finally, if the
479: bottleneck lasts long enough for significant recombination to
480: occur within it, we still find long-range correlations, up to
481: scales of the order of $(2\tau_{\rm D}r)^{-1}$ where $\tau_{\rm
482: D}$ is the duration of the bottleneck (in generations). Beyond
483: this, the correlations decay, and in the limit $R\rightarrow\infty$
484: we have $\rho(\tau_{x(ij)},\tau_{y(ij)})\rightarrow 0$ as in the
485: unstructured population model.
486:
487: By the same approach, we calculate
488: $\expt{\tau_{x(ij)}\tau_{y(ik)}}$ and $\expt{ \tau_{x(ij)}
489: \tau_{y(kl)}}$. Inserting this into (\ref{eq:sigma2 def}) yields,
490: for large $n$:
491: \begin{eqnarray}\label{eq:sigma2_bottleneck}
492: \sigma^2_d &=& \frac{e^{-G\,R}}{\expt{ \tau_{x(ij)} \tau_{y(kl)}}} \Big[ 18\,h\,( 36 - 10\,h^2 - h^5) +
493: 9\,h\,( 28 - 20\,h^2 - 3\,h^5) \,R + \nonumber\\&& 3\,h\,( 6 - 5\,h^2 - h^5) \,R^2 \Big] \, ,
494: \end{eqnarray}
495: where
496: \begin{eqnarray}
497: \expt{ \tau_{x(ij)} \tau_{y(kl)}} &=& 18\,( 45\,G^2 + 36\,h + 90\,G\,h + 20\,h^3 - h^6 ) +\nonumber\\&&
498: 9\,( 65\,G^2 + 28\,h + 130\,G\,h + 40\,h^3 - 3\,h^6) \,R +\nonumber\\&&
499: ( 45\,G^2 + 18\,h + 90\,G\,h + 30\,h^3 - 3\,h^6) \,R^2 \, .
500: \end{eqnarray}
501: Note that $\sigma^2_d \rightarrow 0$ as $R \rightarrow \infty$.
502: The difference, in particular, to expression (7) is not large.
503: Hence, when the aim is to detect the population-size variations it
504: is better to focus on single-locus statistics.
505:
506: \subsection{Model of divergent populations, I}
507: \label{sec:div_model_1}
508:
509: Reich {\em et al.} consider a model of a diverging population
510: \cite{reich_etal02}: the population was unstructured with constant
511: population size $N$ until $\tau_0 =2 N G$ generations ago, when
512: the the population split into two parts of equal size $N$ (note
513: that this implies a rapid population expansion from $N/2$ to $N$
514: after the split). The model is illustrated in figure~\ref{fig:pop
515: struct models}c. A portion $p$ of the sample is chosen from the
516: first population, and the rest from the second population. For any
517: two individuals in the sample, the expectation
518: $\rho(\tau_{x(ij)},\tau_{y(ij)})$ depends on whether the
519: individuals come from the same sub-population or not. Using the
520: technique illustrated above, it is straightforward to calculate
521: the expectation for both cases. Again, we find long-range
522: correlations, namely
523: \begin{equation}\label{eq:corr_model_2c}
524: \rho(\tau_{x(ij)},\tau_{y(ij)})
525: = 1 - \frac{1}{1 + 2\,p\,(1-p)\,(1 - 2\,p + 2\,p^2)\,G^2} \,,
526: \end{equation}
527: in the limit of large $R$ (in appendix B we describe how to
528: obtain the full result, valid for arbitrary values of $R$).
529:
530: Further, in the limit of large $R$ and large sample size $n$, we have
531: \beq\label{eq:sigma2_model_2c}
532: \sigma^2_d = \frac{2\, p^2\,(1 - p)^2\,G}{ 1 + 2\,p\,(1-p)\,G} .
533: \eeq
534: Thus, for this model $\sigma^2_d$ is finite in the limit of large
535: $R$, as opposed to $\sigma^2_d$ in the unstructured model (section
536: \ref{sec:gene-history correlations}) and the bottleneck model
537: (section \ref{sec:bottleneck_model}).
538:
539: \subsection{Model of divergent populations, II}
540: \label{sec:div_model_2}
541:
542: Now consider the model of two diverging sub-populations
543: \cite{eyre-walker_etal98} in figure~\ref{fig:pop struct models}b.
544: The population was unstructured with constant size of $N$
545: individuals until $\tau_0=2 N G$ generations ago, when a fraction
546: $\gamma$ of the population diverged. In subsequent generations,
547: the two sub-populations where unstructured but with no contact
548: between sub-populations. Individuals are randomly chosen from the
549: joint population. For two individuals in the sample, there are
550: three cases: both individuals may come from the smaller
551: sub-population, they may come from the larger sub-population, or
552: from different sub-populations. Using equation (\ref{eq:corr}) we
553: find long-range correlations: in the limit of large $R$,
554: $\rho$ remains finite,
555: \begin{eqnarray}\label{eq:corr_model_2b}
556: \rho(\tau_{x(ij)},\tau_{y(ij)}) &=& \frac{1}{\mbox{var}[\tau]} \big[
557: 1 - 2 s + 2 s^2 + 2 G\left( 2 + G \right) s +
558: s^2 {\rm e}^{-\frac{2 G}{\gamma }} +\\
559: &&s^2 {\rm e}^{-\frac{2 G}{1 - \gamma }} +
560: 2 s {\left( 1 - \gamma \right) }^2 {\rm e}^{-\frac{G}{1 - \gamma }} +
561: 2 s {\gamma }^2 e^{-\frac{G}{\gamma }} - \left<\tau\right>^2
562: \big]\,
563: \nonumber
564: \end{eqnarray}
565: where $s = \gamma\,(1-\gamma)$ and
566: \begin{eqnarray}
567: \left<\tau\right> &=& 1 + s (2 G - 1) + s \gamma {\rm e}^{-\frac{G}{\gamma}}
568: + s (1 - \gamma) {\rm e}^{-\frac{G}{1 - \gamma}}\, \\
569: \mbox{var}[\tau] &=& 2 + 2 s \big[ 2 s + (G + 1)^2 +
570: \gamma (1 + G + \gamma) {\rm e}^{-\frac{G}{\gamma}}
571: +\nonumber\\&&+\ (1 - \gamma) (2 + G - \gamma)
572: {\rm e}^{-\frac{G}{1 - \gamma}} - 3 \big] - \left<\tau\right>^2\,.
573: \end{eqnarray}
574: See the appendix for the full result. The long-range correlations
575: are found to be due to sampling of different sub-populations.
576:
577: In the limit of large $R$ and large sample size, we have
578: \beq\label{eq:sigma2_model_2b}
579: \sigma^2_d = \frac{\gamma^2 (1 - \gamma)^2}{\expt{\tau}^2} \left[ 2\,G + \gamma\,(1 - \rme^{-\frac{G}{1-\gamma}}) + (1 - \gamma)(1 - \rme^{-\frac{G}{\gamma}}) \right]^2 .
580: \eeq
581: Again, we find that $\sigma^2_d$ is finite in the limit of large
582: $R$.
583:
584:
585: \clearpage \newpage %
586: \section{Discussion}
587: \label{sec:discussion}
588:
589: Figure~\ref{fig:pop struct results} shows the correlations
590: $\rho(\tau_{x(ij)},\tau_{y(ij)})$ in the demographic models
591: considered, with parameters chosen to be consistent with the
592: empirically estimated time to the most recent common ancestor and
593: its coefficient of variation \cite{reich_etal02}.
594: %
595: When plotting the correlation of gene histories against physical
596: positions, we need to translate the recombination fraction $r$
597: into the corresponding expected number $\sigma x$ of crossover
598: events between the two loci. There are many such maps proposed in
599: the literature (see e.g. \cite{mcpeek_speed95} for a review of
600: these). They differ in how they model the chiasma process, but all
601: models have in common that for small enough $r$, $r \approx \sigma
602: x$. In humans, $r \approx \sigma x$ for $x \lesssim 10^6$bp. At
603: larger distances, deviations from linearity are not noticeable
604: since the expressions for $\rho(\tau_{x(ij)},\tau_{y(ij)})$ and
605: $\sigma^2_d$ converge for large $R$ (to different values, in
606: general).
607: %
608: Also shown are empirical estimates of lower and upper bounds on
609: the correlation of gene histories in the human genome
610: \cite{reich_etal02}. The correlations for the models described in
611: section \ref{sec:results} are substantially larger at large
612: distances than those for the unstructured model, but they lie
613: significantly below the lower bound of the empirical data, at
614: intermediate distances. We comment on possible causes for
615: this discrepancy in our conclusions.
616:
617: \begin{figure}
618: \centerline{\includegraphics{fig4.eps}}
619: \caption{\label{fig:pop struct results}
620: Correlation $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ of gene histories as a
621: function of the distance $x$ between them. Equations
622: (\ref{eq:rho_no_pop_struct}), (\ref{eq:rho_bottleneck}), and exact
623: expressions corresponding to (\ref{eq:corr_model_2c}) and
624: (\ref{eq:corr_model_2b}), from the appendix, were used. In all
625: cases, $r = 1.2$ cM/Mb, $N$ and $\mu$ were chosen to be
626: consistent with $2N\left<\tau\right> = 1.55\times 10^4$, and a
627: coefficient of variation of $0.94$ \cite{reich_etal02} (except in
628: the unstructured model). The lines are: the unstructured
629: coalescent (dashed), bottleneck model with $H = 0.1$ (red),
630: divergent model in figure~\ref{fig:pop struct models}b with
631: $\gamma = 0.2$ (blue), and divergent model in figure~\ref{fig:pop
632: struct models}c with $p = 0.3$ (green). Also shown are empirical
633: estimates of lower and upper bounds for the correlation of gene
634: histories in the human genome (squares) \cite{reich_etal02}.
635: }
636: \end{figure}
637:
638: Our results allow us to gain a qualitative understanding
639: of the influence of demographic factors on the decorrelation
640: of gene histories.
641: First, we find that models of bottlenecks and divergent
642: populations (figure~\ref{fig:pop struct models}) both exhibit
643: long-range correlations in gene histories, as numerically
644: demonstrated in \cite{reich_etal02}, but for very different
645: reasons. In bottlenecks, the length scale at which we find
646: significant correlations is governed by the degree of
647: recombination
648: within the
649: bottleneck: low recombination in the bottleneck gives rise to
650: long-range correlations. Further, the amount of correlation is
651: affected by the rate of expansion of the population after the
652: bottleneck: rapid expansion gives high correlations. Long-range
653: correlation in divergent models, on other hand, we ascribe to the
654: fact that the covariance of $\tau_{x(ij)}$ and $\tau_{y(ij)}$ (that is, the
655: number of generations since the common ancestor of two copies of
656: loci $x$ and $y$) is different when individuals are selected from
657: the same or different sub-populations: typically, the covariance
658: is lower for individuals from the same sub-population than from
659: different ones. We find that this effect persists even for loci
660: far apart, but is decreased by population expansions during the
661: divergence.
662:
663: Second, we identify two contributions to the correlation of gene
664: histories in divergent populations: linkage disequilibrium and the
665: sampling of sub-populations with different demographic histories.
666: At short ranges, linkage disequilibrium correlates nearby patterns
667: by co-inheritance. Thus, for small distances, we conclude that the
668: demographic structure is unimportant: all reasonable models must
669: give high correlation for small distances. For long ranges, by
670: contrast, correlations due to linkage disequilibrium are expected
671: to vanish, but the contribution from differences in gene history
672: across sub-populations remains.
673:
674: Third, the domestication of crops and animals has shaped the
675: genetic makeup of the species, through selection for desirable
676: traits but also through the demographic history of each species
677: \cite{eyre-walker_etal98}. The pattern of genetic differences in
678: the laboratory mouse population depends strongly on its
679: demographic history \cite{wade_etal02}. In divergent populations,
680: we find that long-range correlations are insensitive to the
681: demographic history of the sub-populations. As a consequence, we
682: predict that the most important contribution to the correlation of
683: gene history in the laboratory mouse is from the original
684: divergence from the wild-type mouse.
685:
686: Fourth, we found that within the models described in
687: section~\ref{sec:results}, gene-history correlations are
688: substantially increased as compared with the unstructured,
689: standard model. However, the correlations still lie significantly
690: below the empirically determined data at intermediate distances.
691: In \cite{eriksson_mehlig04} it was shown that incorporating
692: empirically observed variations in the recombination-rate along
693: the chromosomes \cite{kong_etal02} significantly increases the
694: correlations in this regime.
695: Our analytical expressions for the correlation of gene
696: histories allow for studying the effect of such variations in the
697: recombination rate in models with demographic population structure.
698:
699: Fifth, we briefly mention possible extensions of the scheme introduced
700: in this paper.
701: In more general sampling schemes (different from those depicted
702: in figure~\ref{fig:pop struct models}), we may use the expressions for
703: $\left<\tau_{x(ij)}\,\tau_{y(ij)}\right>$ conditional on whether the
704: individuals in the sample came from the same sub-population or
705: not, and conditional on the population size during the divergence,
706: to calculate the correlation of gene histories by weighting the
707: different contributions by the probability that they occur under
708: the sampling scheme. Also, it is straight-forward to extend the
709: calculations to combinations of bottlenecks and divergent
710: populations (figure~\ref{fig:pop struct models}d), and to more
711: complicated models involving more than two diverging branches
712: (figure~\ref{fig:pop struct models}e). It is expected that the
713: most distant (symmetric) divergence determines the long-range
714: correlations.
715:
716: How would a recent mixing event (figure~\ref{fig:pop struct
717: models}e) affect the correlation of gene histories? A merging of
718: the divergent populations $g$ generations ago leads to a
719: decorrelation of gene histories at distances of the order of $(4 g
720: r)^{-1}$, since then ancestral lines of both loci may come from
721: different sub-populations with approximately equal probability.
722:
723: Finally, we have argued that the correlation $\rho(\tau_{x(ij)},\tau_{y(ij)})$ of gene
724: histories determines the association of SNP counts,
725: $\mbox{cov}[S_{x(ij)},S_{y(ij)}]$. Conversely one may be interested
726: in estimating model
727: parameters from population data, deducing
728: $\rho(\tau_{x(ij)},\tau_{y(ij)})$
729: from the pairwise statistic $\mbox{cov}[S_{x(ij)},S_{y(ij)}]$.
730: Three questions arise. First, how can one in practice estimate $\mbox{cov}[\tau_{x(ij)},\tau_{y(ij)}]$
731: from the variance of SNP counts? Second,
732: how good is this estimate? Third, how much of
733: the information the full data set (possibly pertaining to a large
734: number of individuals) is retained in the pair-wise statistic
735: $\mbox{cov}[S_{x(ij)},S_{y(ij)}]$?
736: We begin by answering the last question.
737: Due to the high amount of association between the chromosomes in a
738: sample, the information on genealogical history accumulates slowly as the
739: sample size is increased \cite{hudson01}. It follows that most
740: information can be found in pair-wise comparisons between the
741: chromosomes in the sample as used in eq.~(\ref{eq:cov S_a S_b}).
742: Going back to the first two questions, an estimator for
743: $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ can be
744: constructed as follows.
745: Assuming that the length $L_\mathrm{c}$
746: of the sequences is long, we can estimate the correlation of
747: polymorphism rates by averaging over all pairs and positions:
748: \begin{equation}\label{eq:estimate_rho}
749: \rho(\tau_{y(ij)},\tau_{(y+x)(ij)}) \approx \hat{\rho}(x) = \frac{\overline{S_y S_{y+x}} - \overline{S_y}^2}{\overline{S_y^2} - \overline{S_y}^2 - \overline{S_y}},
750: \end{equation}
751: where
752: \begin{equation}\label{eq:sequence_average_def}
753: \overline{S_y S_{y+x}} = \frac{2}{n(n-1)(L_\mathrm{c} - x - L)}
754: \sum_{i=2}^n \sum_{j=1}^{i-1} \sum_{y=1}^{L_\mathrm{c}-x-L} S_{y(ij)} S_{(y+x)(ij)} \,.
755: \end{equation}
756: and the single-locus quantities $\overline{S_y}$ and
757: $\overline{S_y^2}$ are defined similarly. Instead of regularly
758: spaced bins, as in (\ref{eq:sequence_average_def}), one may use
759: randomly positioned bins. For unstructured populations, and for
760: populations with bottlenecks and expansions, the accuracy of the
761: estimator $\hat{\rho}(x)$ depends mostly on the number of bins
762: (and hence on $L_\mathrm{c}$), and improves only slowly with
763: increasing $n$. For divergent models, however, increasing $n$
764: improves the sampling from the different sub-populations. In
765: figure~\ref{fig:estimate_rho} we show how $\hat{\rho}(x)$ compares
766: to $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ when applied to a sample. As can be
767: seen in the figure, when $x < L$ the bins overlap and
768: $\hat{\rho}(x)$ overestimates the correlations, but
769: otherwise it works well.
770:
771: \begin{figure}
772: \centerline{\includegraphics{fig5.eps}}
773: \caption{\label{fig:estimate_rho} %
774: Comparison of $\hat{\rho}(x)$ (markers) to
775: $\rho(\tau_{y(ij)},\tau_{(y+x)(ij)})$ (solid lines, calculated from theory),
776: for an unstructured population (red) and a divergent population
777: (blue). The estimator $\hat{\rho}(x)$ were obtained from a single
778: sample of 50 individuals, with $L_\mathrm{c} = 10$Mb, for
779: different bin sizes $L = 100$bp (diamonds), $L = 500$bp (circles)
780: and $L = 1$kb (squares). The parameters for the divergent model
781: are: $G = 0.6$, $p = 0.3$, $N = 6963.7$, $r = 0.95633$cM/Mb,
782: $\theta = 7.6\,10^{-4}$. In the unstructured population model, the
783: population size is $N = 10^4$.
784: }
785: \end{figure}
786:
787:
788: \clearpage \newpage %
789: \section{Conclusions and outlook}
790: \label{sec:conclusions}
791:
792: We have derived closed analytical expressions for the correlation
793: of gene histories in established demographic models for genetic
794: evolution. These expressions allow us to understand and quantitatively
795: determine how demographical factors give rise to long-range
796: correlations in gene histories.
797:
798: The correlations analysed here determine
799: the two-person summary statistic (\ref{eq:cov S_a S_b}).
800: More information is contained in the mosaics of SNP
801: haplotype patterns for more than two individuals, and their
802: associations \cite{hudson01}. It is of great interest to derive
803: corresponding expressions for correlations between such patterns
804: in the models considered in this paper, especially
805: in the case of more than two loci.
806: Finally we note that the
807: quantity $\sigma_d^2$, a measure of linkage disequilibrium, was
808: shown to be a good approximation to $r^2$ in the case of
809: unstructured populations \cite{mcvean_etal02}. It is necessary to
810: investigate the relation between $r^2$ and $\sigma^2_d$ in models
811: with demographic structure.
812:
813:
814:
815: \clearpage \newpage %
816: { \appendix
817:
818: \section*{Appendix A: Derivation of bottleneck formula}
819: \setcounter{section}{1}
820: \label{app:bottleneck formula}
821:
822: During the bottleneck, the time between coalescent events is
823: exponentially distributed with rate ${n \choose 2}/(2\,\gamma N)$,
824: where $n$ is the number of lines carrying ancestral material.
825: Recombination events occurs with rate $n\,R/(4N)$, independent of
826: $\gamma$. Thus when $\gamma$ is very small, coalescent events
827: dominate the process.
828:
829: We assume that during the bottleneck, the reduction in effective
830: population size is so drastic that $\gamma$ is effectively zero.
831: By rescaling the time by a factor of $\gamma$ and taking the limit
832: of $\gamma \rightarrow 0$ we find
833: \be
834: \mathbf{M}' = \lim_{\gamma \rightarrow 0} \mathbf{M}(\gamma)\,\gamma =
835: \left[\begin{array}{rrr}
836: -1 & 1 & 0 \\
837: 0 & -3 & 4 \\
838: 0 & 0 & -6
839: \end{array}\right],
840: \ee
841: so the time evolution operator becomes
842: \be
843: \exp(\mathbf{M}'\,t) =
844: \left[\begin{array}{rrr}
845: e^{-t} & \frac{1}{2}\,e^{-t} - \frac{1}{2}\,e^{-3 t} & \frac{2}{5}\,e^{-t} - \frac{2}{3}\,e^{-3 t} + \frac{4}{15}\,e^{-6 t}\\
846: 0 & e^{-3 t} & \frac{4}{3}\,e^{-3 t} - \frac{4}{3}\,e^{-6 t}\\
847: 0 & 0 & e^{-6 t} \\
848: \end{array}\right] .
849: \ee
850: In the original model, the inbreeding coefficient $F$ was
851: specified. We choose to parameterise the severity of the
852: bottleneck by its duration $D$. If the process is in state $1$
853: (figure~3)
854: when entering the bottleneck, the probability of coalescence
855: during the bottleneck is
856: \be
857: \int_0^D {\bm u}_1\transpose\,\rme^{\mathbf{M}'\,t}\,{\bm u}_1 \,\rmd t = 1 - \rme^{-D},
858: \ee
859: so we see that by taking $D = -\ln(1 - F)$, we get the correct
860: inbreeding coefficient. We can now express the time evolution
861: operator from the beginning to the end of the bottleneck as
862: \be\label{eq:propagator_bn}
863: \exp(\mathbf{M}'\,D) =
864: \left[\begin{array}{rrr}
865: H & \frac{1}{2}\,H\,(1 - H^2) & \frac{2}{15}\,H\,( 3 - 5\,H^2 + 2\,H^5 ) \\
866: 0 & H^3 & \frac{4}{3}\,H^3\,(1 - H^3) \\
867: 0 & 0 & H^6
868: \end{array}\right] ,
869: \ee
870: where $H = 1 - F$. The probability that the loci become linked
871: during the bottleneck depends on the state of the process when the
872: bottleneck is entered:
873: \be\label{eq:prob_linked_bn}
874: \int_0^D {\bm u}_1\transpose\, \rme^{\mathbf{M}'\,t} \,\rmd t =
875: \left\{\begin{array}{ll}
876: F & \mbox{in state $1$} \\
877: \frac{1}{6}\,( 2 + H )\,F^2 & \mbox{in state $2$} \\
878: \frac{2}{45}\,( 5 + 6\,H + 3\,H^2 + H^3 )\,F^3 & \mbox{in state $3$}
879: \end{array}\right.
880: \ee
881: Similarly, we have the probability that one locus, but not the
882: other, reaches its most recent common ancestor during the
883: bottleneck, depending on the state of the process when entering
884: the bottleneck:
885: \be\label{eq:prob_apart_bn}
886: \int_0^D {\bm u}_2\transpose\, \rme^{\mathbf{M}'\,t} \,\rmd t =
887: \left\{\begin{array}{ll}
888: 0 & \mbox{in state $1$}\\
889: \frac{2}{3} \,( 1 - H^3) & \mbox{in state $2$} \\
890: \frac{1}{9}\,(7 - 8\,H^3 + H^6) & \mbox{in state $3$}
891: \end{array}\right.
892: \ee
893: Together, (\ref{eq:propagator_bn}), (\ref{eq:prob_linked_bn}) and
894: (\ref{eq:prob_apart_bn}) determines the
895: state of the process after the bottleneck. Using this
896: information and the method
897: for the unstructured population as outlined in section 2 allows
898: us to derive the gene-history correlation for the bottleneck model.
899:
900: \section*{Appendix B: Correlation of gene histories in divergent
901: populations}
902: \setcounter{section}{2}
903: \label{app:div pop}
904:
905: Assume that individuals come from left sub-population with
906: probability $p$ and from the right one with probability $1-p$. The
907: population size in the left and right sub-populations are $\gamma
908: N$ and $\Gamma N$, respectively, and the population size before
909: the divergence is $N$.
910: %
911: The two-person coalescent process is described by a Markov process
912: over the states in table~\ref{tab:states}, where state $1$ is the
913: absorbing state of the process, and the process starts in one of
914: states $3 - 11$.
915: %
916:
917: \begin{table}
918: \caption{\label{tab:states} %
919: The states of the Markov process of loci $x$ and $y$ in
920: chromosomes $i$ and $j$, for the divergent population. For each
921: state we show the corresponding configurations of the
922: sub-populations, separated by a vertical bar. A dash denotes
923: genetic material that is not ancestral to any locus in the sample.
924: The symbol $\phi$ denotes a sub-population unrelated to sample,
925: and the diamonds denotes a common ancestor to chromosomes $i$ and
926: $j$ (for that locus).
927: }
928: \begin{indented}
929: \item[]\begin{tabular}{c r@{ $|$ }l}
930: \br
931: State & \multicolumn{2}{c}{Population configuration} \\
932: \mr
933: 0 & $\phi$ & $\phi$ \\
934: \mr
935: 1 & $x_i \diamond$, $x_j \diamond$ & $\phi$ \\
936: 2 & $x_i \diamond$ & $x_j \diamond$ \\
937: \mr
938: 3 & $x_i y_i$, $x_j y_j$ & $\phi$ \\
939: 4 & $x_i y_i$ & $x_j y_j$ \\
940: \mr
941: 5 & $x_i y_i$, $x_j-$, $-y_j$ & $\phi$ \\
942: 6 & $x_i y_i$, $x_j-$ & $-y_j$ \\
943: 7 & $x_i y_i$ & $x_j-$, $-y_j$ \\
944: \mr
945: 8 & $x_i-$, $-y_i$, $x_j-$, $-y_j$ & $\phi$ \\
946: 9 & $x_i-$, $-y_i$, $x_j-$ & $-y_j$ \\
947: 10 & $x_i-$, $-y_i$ & $x_j-$, $-y_j$ \\
948: 11 & $x_i-$, $x_j-$ & $-y_i$, $-y_j$ \\
949: \br
950: \end{tabular}
951: \end{indented}
952: \end{table}
953: %
954: We now define $e_i = \expt{\, \ta \tb\, |\, \mbox{Process starting
955: in state $i$}\,}$. With these, we may write
956: \be
957: \expt{ \tau_{x(ij)} \tau_{y(ij)} } &=&
958: p^2\, e_3(\gamma) + (1 - p)^2\, e_3(\Gamma) + 2 p (1 - p)\, e_4(\gamma,\Gamma), \\
959: %
960: \expt{ \tau_{x(ij)} \tau_{y(ik)} } &=&
961: p^3\, e_5(\gamma) + (1-p)^3\, e_5(\Gamma) \nonumber\\&+&
962: 2 p (1 - p)^2\, e_6(\gamma) + 2 p^2 (1 - p)\, e_6(\Gamma) \nonumber\\&+&
963: p (1 - p)^2\, e_7(\gamma, \Gamma) + p^2 (1 - p)\, e_7(\Gamma, \gamma), \\
964: %
965: \expt{ \tau_{x(ij)} \tau_{y(kl)} } &=&
966: p^4\, e_8(\gamma) + (1 - p)^4\, e_8(\Gamma) \nonumber\\&+&
967: 4 p^3 (1 - p)\, e_9(\gamma) + 4 p (1 - p)^3\, e_9(\Gamma) \nonumber\\&+&
968: 4 p^2 (1 - p)^2\, e_{10}(\gamma,\Gamma) + 2 p^2 (1 - p)^2\, e_{11}(\gamma,\Gamma).
969: \ee
970: From this, the correlation $\rho(\tau_{x(ij)},\tau_{y(ij)})$ and $\sigma^2_d$
971: may be calculated for both models of divergent populations:
972: setting $\gamma = \Gamma = 1$ gives the model described in section
973: \ref{sec:div_model_1}; setting $\Gamma = 1 - \gamma$ and $p =
974: \gamma$ gives the model described in section
975: \ref{sec:div_model_2}.
976:
977:
978:
979: \subsection*{Calculation of $e_3,\ldots,e_{11}$ for the model
980: introduced in section 4.2}
981:
982: \newcommand{\MM}{\mathbf{M}_1}
983:
984: The two-locus coalescent in a population of size $\gamma N$ is
985: described by a Markov process with the evolution matrix
986: \beq
987: \MM = \left[ \begin{array}{ccc}
988: - 1/\gamma - R & 1/\gamma & 0 \\
989: R & - 3/\gamma - R/2 & 4/\gamma \\
990: 0 & R/2 & - 6/\gamma
991: \end{array} \right]\!\!.
992: \eeq
993: where $R = 4Nr$. Before the divergence, $\gamma = 1$ and we denote
994: the corresponding evolution matrix $\mathbf{M}$. the coalescent is
995: described by a Markov process with the evolution matrix
996: $\mathbf{M}$. Assuming that population is in state $3$, $5$, or
997: $8$ with probabilities $v_1$, $v_2$, and $v_3$, respectively, we
998: proceed as for the unstructured population in section
999: \ref{sec:methods}, calculating $\expt{\ta \tb}$ conditional on
1000: starting from distribution ${\bm v}$. We obtain
1001: $e_3(\gamma) = c_\mathrm{s}(\gamma, (1,\, 0,\, 0)\transpose)$,
1002: $e_5(\gamma) = c_\mathrm{s}(\gamma, (0,\, 1,\, 0)\transpose)$,
1003: and
1004: $e_8(\gamma) = c_\mathrm{s}(\gamma, (0,\, 0,\, 1)\transpose)$,
1005: where
1006: \be
1007: \fl c_\mathrm{s}(\gamma, {\bm v}) &=& \frac{{\bm u}_1\transpose}{\gamma} \, (-\MM)^{-3} \, \big[2\,\mathbf{I} - (2\,\mathbf{I} - 2\,\frac{G}{\gamma}\,\MM + \frac{G^2}{\gamma^2}\, \MM^2)\,\expm{\MM G}\big] {\bm v} \nonumber\\
1008: \fl &+& {\bm u}_1\transpose \, (-\mathbf{M})^{-3} \, \big(2\,\mathbf{I} - 2\,G\,\mathbf{M} + G^2\,\mathbf{M}^2 \big) \, \expm{\MM G} {\bm v} \nonumber\\
1009: \fl &+& \frac{ {\bm u}_2\transpose}{\gamma} \, (-\MM)^{-3} \Big\{ 2\,\mathbf{I} - \gamma\,\MM - \left[ 2\,\mathbf{I} - (2\,G + \gamma)\,\MM + G\,(G + \gamma)\, \MM^2 \right]\expm{\MM G} \Big\} {\bm v} \nonumber\\
1010: \fl &+& (1 - \gamma)\,{\bm u}_2\transpose \, (\mathbf{I} + \gamma\,\MM)^{-2} \, \Big\{ \gamma\,e^{-G/\gamma}\,\mathbf{I} + \left[ \, (G - \gamma)\,\mathbf{I} + \gamma\,G\,\MM \, \right]\,\expm{\MM G} \Big\} {\bm v} \nonumber\\
1011: \fl &+& {\bm u}_2\transpose (-\mathbf{M})^{-3} \left[ 2\,\mathbf{I} - (1 + 2\,G)\,\mathbf{M} + G\,(G + 1)\,\mathbf{M}^2 \right] \expm{\MM G} {\bm v} .
1012: \ee
1013:
1014:
1015:
1016: During the split, the coalescent is described by a Markov process
1017: with the evolution matrix
1018: \beq
1019: \mathbf{M}_2 = \left[ \begin{array}{cc}
1020: - 1/\gamma - R/2 & 2/\gamma \\[2pt]
1021: R/2 & - 3/\gamma
1022: \end{array} \right]\!\!.
1023: \eeq
1024: A coalescent event during the split happens with the distribution
1025: $\gamma^{-1} (1,\, 1)\, \rme^{\mathbf{M}_2 \ta} {\bm v},$
1026: %\eeq
1027: where ${\bm v} = (1,\, 0)$ when starting from state $6$ and ${\bm
1028: v} = (0,\, 1)$ when starting from state $9$. Thus, we have the
1029: contribution
1030: \[
1031: \int_0^G\!\! \ta \, \frac{1}{\gamma}\, (1,\, 1) \rme^{\mathbf{M}_2 \ta}\, {\bm v}\, \rmd\ta
1032: \int_G^\infty\!\! \tb\, \rme^{-(\tb - G)} \rmd\tb
1033: \]
1034: The population is in state $5$ or $8$, right before the split,
1035: with probability ${\bm a}\, \expm{\mathbf{M}_2 G}\, {\bm v}$,
1036: where ${\bm a} = (1, 0)$ for state $5$ and ${\bm a} = (0, 1)$ for
1037: state $8$. From this we obtain
1038: \be
1039: e_6(\gamma) &=& A(\gamma) + R \gamma \, B(\gamma) \nonumber\\
1040: e_9(\gamma) &=& A(\gamma) - 2\, B(\gamma) \nonumber
1041: \ee
1042: where
1043: \beq
1044: \fl A(\gamma) = (1 + G) \gamma + \left[ (1 + G)(1 - \gamma) + \frac{24 + 4 R \gamma }{( 4 + R \gamma)( 18 + 13 R + R^2 ) } \right] \mathrm{e}^{-G/\gamma}
1045: \eeq
1046: and
1047: \beq
1048: \fl B(\gamma) = \frac{2}{( 4 + R\,\gamma) \, ( 18 + 13\,R + R^2 ) }\, \exp\!\left(- \frac{G\,( 6 + R\,\gamma) }{2\,\gamma }\right)
1049: \eeq
1050:
1051:
1052: Now consider starting from states $4$, $7$ or $10$.
1053: In these cases, there is no coalescent event during the split. In
1054: each sub-population the coalescent is described by a Markov
1055: process with the evolution matrix
1056: \beq
1057: \mathbf{M}_3 = \left[ \begin{array}{cc}
1058: - R/2 & 1/\gamma \\[2pt]
1059: R/2 & - 1/\gamma
1060: \end{array} \right]\!\!.
1061: \eeq
1062: Note that the columns sum to zero: the probability of escaping
1063: from these states is zero during the split.
1064:
1065: Right before the split, the population is in state $3$, $5$ or $8$
1066: with probability $\phi_1$, $\phi_2$, and $\phi_3$, respectively.
1067: Then, the contribution is
1068: \be
1069: \fl && \int_G^\infty \! \left[ \ta^2\, {\bm u}_1\transpose + \int_{\ta}^\infty\!\! \ta \tb\, \rme^{\ta - \tb}\, \rmd\tb \, {\bm u}_2\transpose \right] \rme^{\mathbf{M}\,(\ta - G)}\, {\bm \phi} \,\, \rmd\ta \nonumber\\
1070: \fl &&\hspace{1cm}=\ (1 + G)^2 (\phi_1 + \phi_2 + \phi_3) + \frac{(R + 18)\phi_1 + 6 \phi_2 + 4 \phi_3}{R^2 + 13 R + 18}
1071: \ee
1072: Now define $P_\mathrm{L}(\gamma)$ as the probability of the
1073: genetic material being on the same gamete at the moment of the
1074: split, given that it is on the same gamete in the sample. We have
1075: \beq
1076: P_\mathrm{L}(\gamma) = (1,\, 0)\, \expm{\mathbf{M}_3\, G}\, (1,\,0)\transpose = \frac{2 + R \gamma \exp\!\left(- \frac{G (2 + R \gamma)}{2 \gamma} \right)}{2 + R \gamma}.
1077: \eeq
1078: Similarly, we define $P_\mathrm{B}(\gamma)$ as the probability of
1079: the genetic material being on the same gamete at the moment of the
1080: split, given that it is on different gametes in the sample. We
1081: have
1082: \beq
1083: P_\mathrm{B}(\gamma) = (1,\, 0)\, \expm{\mathbf{M}_3\, G}\, (0,\,1)\transpose = \frac{2 - 2 \exp\!\left(- \frac{G (2 + R \gamma)}{2 \gamma} \right)}{2 + R \gamma}.
1084: \eeq
1085: If the sample is in state $4$, we have
1086: \be
1087: \phi_1 &=& P_\mathrm{L}(\gamma) \, P_\mathrm{L}(\Gamma) \nonumber\\
1088: \phi_2 &=& P_\mathrm{L}(\gamma) \, [1 - P_\mathrm{L}(\Gamma)] + [1 - P_\mathrm{L}(\gamma)] \, P_\mathrm{L}(\Gamma) \nonumber\\
1089: \phi_3 &=& [1 - P_\mathrm{L}(\gamma)] \, [1 - P_\mathrm{L}(\Gamma)]
1090: \ee
1091: Since $\phi_1 + \phi_2 + \phi_3 = 1$ we have
1092: \beq
1093: \fl e_4(\gamma,\Gamma) = (1 + G)^2 + \frac{4 + 2\,P_\mathrm{L}(\gamma) + 2\,P_\mathrm{L}(\Gamma) + (10 + R)\,P_\mathrm{L}(\gamma)\, P_\mathrm{L}(\Gamma) }{R^2 + 13 R + 18}
1094: \eeq
1095: Similarly, we obtain
1096: \beq
1097: \fl e_7(\gamma,\Gamma) = (1 + G)^2 + \frac{4 + 2\,P_\mathrm{L}(\gamma) + 2\,P_\mathrm{B}(\Gamma) + (10 + R)\,P_\mathrm{L}(\gamma)\, P_\mathrm{B}(\Gamma) }{R^2 + 13 R + 18}
1098: \eeq
1099: and
1100: \beq
1101: \fl e_{10}(\gamma,\Gamma) = (1 + G)^2 + \frac{4 + 2\,P_\mathrm{B}(\gamma) + 2\,P_\mathrm{B}(\Gamma) + (10 + R)\,P_\mathrm{B}(\gamma)\, P_\mathrm{B}(\Gamma) }{R^2 + 13 R + 18}
1102: \eeq
1103: %
1104: % --------------------------------------------------------------
1105: %
1106: Finally, starting from state $11$, we obtain
1107: \beq
1108: \fl e_{11}(\gamma,\Gamma) =
1109: \frac{4}{18 + 13R + R^2} \, \rme^{-G/\gamma - G/\Gamma} +
1110: \left[ \gamma + ( 1 - \gamma ) \mathrm{e}^{-G/\gamma} \right]\!
1111: \left[ \Gamma + ( 1 - \Gamma )\mathrm{e}^{-G/\gamma} \right]
1112: \eeq
1113:
1114:
1115: \subsection*{Calculation of $e_3,\ldots,e_{11}$ for the
1116: model introduced in section 4.3}
1117: In this model, $\gamma = \Gamma = 1$ so the formulas simplify
1118: considerably. Starting from state $3$, $5$ or $8$, we obtain
1119: \be
1120: e_3 &=& 1 + \frac{18 + R}{R^2 + 13 R + 18} \nonumber\\
1121: e_5 &=& 1 + \frac{6}{R^2 + 13 R + 18} \nonumber\\
1122: e_8 &=& 1 + \frac{4}{R^2 + 13 R + 18} \nonumber\\
1123: \ee
1124: as calculated by Griffiths \cite{griffiths81}. Starting from state
1125: $6$ or $9$, we obtain
1126: \be
1127: e_6 &=& (1 + G)^2 + \frac{ (24 + 4 R) \rme^{-G} + 2 R\, \rme^{-G(6 + R)/2}} {( 4 + R )( 18 + 13 R + R^2 ) } \\
1128: e_9 &=& (1 + G)^2 + \frac{ (24 + 4 R) \rme^{-G} - 4 \, \rme^{-G(6 + R)/2}} {( 4 + R )( 18 + 13 R + R^2 ) } \\
1129: \ee
1130: Starting from state $4$, $7$ or $10$, we obtain
1131: \be
1132: e_4 &=& a + 8 R\, b + R^2\, c \nonumber \\
1133: e_7 &=& a + 4 (R - 2)\, b - 2 R\, c \nonumber\\
1134: e_{10} &=& a - 16\, b + 4\, c
1135: \ee
1136: where
1137: \be
1138: a &=& (1 + G)^2 - \frac{8}{(2 + R)^2} - \frac{21}{2 + R} + \frac{3\,( 81 + 7 R)}{18 + 13 R + R^2} \nonumber\\
1139: b &=& \frac{6 + R}{(2 + R)^2 (18 + 13 R + R^2)}\, \mathrm{e}^{-G(2 + R)/2} \nonumber\\
1140: c &=& \frac{10 + R}{(2 + R)^2 (18 + 13 R + R^2)}\, \mathrm{e}^{-G(2 + R)}
1141: \ee
1142: Finally, starting from state $11$ gives
1143: \beq
1144: e_{11} = 1 + \frac{4 \mathrm{e}^{-2G}}{18 + 13 R + R^2} .
1145: \eeq
1146:
1147: } % end of appendix
1148:
1149:
1150: \newpage
1151: \section*{References}
1152:
1153: \bibliographystyle{prsty}
1154: %\bibliographystyle{unsrt}
1155:
1156: \begin{thebibliography}{10}
1157:
1158: \bibitem{hudson90}
1159: R.~R. Hudson, in {\em Oxford Surveys in Evolutionary Biology}, edited by D.
1160: Futuyma and J. Antonovics (Oxford University Press, Oxford, 1990), pp.\ 1 --
1161: 43.
1162:
1163: \bibitem{nordborg_tavare02}
1164: M. Nordborg and S. Tavar\'e, Trends in Genetics {\bf 18}, 83 (2002).
1165:
1166: \bibitem{reich_etal02}
1167: D.~E. Reich {\it et~al.}, Nature Genetics {\bf 32}, 135 (2002).
1168:
1169: \bibitem{hapmap_group03}
1170: {Int. HapMap Consortium}, Nature {\bf 426}, 789 (2003).
1171:
1172: \bibitem{tajima87a}
1173: F. Tajima, Genetics {\bf 123}, 585 (1987).
1174:
1175: \bibitem{tajima87b}
1176: F. Tajima, Genetics {\bf 123}, 597 (1987).
1177:
1178: \bibitem{slatkin_hudson91}
1179: M. Slatkin and R.~R. Hudson, Genetics {\bf 129}, 555 (1991).
1180:
1181: \bibitem{sano_etal04}
1182: A. Sano, A. Shimizu, and M. Iizuka, Theor. Pop. Biol. {\bf 65}, 39 (2004).
1183:
1184: \bibitem{wakeley96}
1185: J. Wakeley, Theor. Pop. Biol. {\bf 49}, 39 (1996).
1186:
1187: \bibitem{teshima_tajima03}
1188: K.~M. Teshima and F. Tajima, Theor. Pop. Biol. {\bf 62}, 81 (2003).
1189:
1190: \bibitem{stumph_goldstein03}
1191: M.~P.~H. Stumpf and D.~L. Goldstein, Curr. Biol. {\bf 13}, 1 (2003).
1192:
1193: \bibitem{patil_etal01}
1194: N. Patil {\it et~al.}, Science {\bf 294}, 1719 (2001).
1195:
1196: \bibitem{snp_group01}
1197: {Int. SNP Map Working Group}, Nature {\bf 409}, 928 (2001).
1198:
1199: \bibitem{kaplan_hudson85}
1200: N. Kaplan and R.~R. Hudson, Theor. Pop. Biol. {\bf 28}, 382 (1985).
1201:
1202: \bibitem{hudson83}
1203: R.~R. Hudson, Theor. Pop. Biol. {\bf 23}, 183 (1983).
1204:
1205: \bibitem{pluzhnikov_donelly96}
1206: A. Pluzhnikov and P. Donelly, Genetics {\bf 144}, 1247 (1996).
1207:
1208: \bibitem{hudson01}
1209: R. Hudson, Genetics {\bf 159}, 1805 – 1817 (2001).
1210:
1211: \bibitem{mcvean_etal02}
1212: G. McVean, P. Awadalla, and P. Fearnhead, Genetics {\bf 160}, 1231 – 12411
1213: (2002).
1214:
1215: \bibitem{griffiths_marjoram96}
1216: R.~C. Griffiths and P. Marjoram, J. Comput. Biol. {\bf 3}, 479–502 (1996).
1217:
1218: \bibitem{kuhner_etal00}
1219: M.~K. Kuhner, J. Yamato, and J. Felsenstein, Genetics {\bf 156}, 1393–1401
1220: (2000).
1221:
1222: \bibitem{nielsen00}
1223: R. Nielsen, Genetics {\bf 154}, 931 – 942 (2000).
1224:
1225: \bibitem{hill_robertson68}
1226: W.~G. Hill and A. Robertson, Theor. Appl. Genet. {\bf 38}, 473 (1968).
1227:
1228: \bibitem{mcvean02}
1229: G. McVean, Genetics {\bf 162}, 987 (2002).
1230:
1231: \bibitem{kong_etal02}
1232: A. Kong {\it et~al.}, Nature {\bf 31}, 241 (2002).
1233:
1234: \bibitem{eriksson_mehlig04}
1235: A. Eriksson and B. Mehlig, Submitted to Genetics (2004).
1236:
1237: \bibitem{griffiths81}
1238: R.~C. Griffiths, Theor. Pop. Biol. {\bf 19}, 169 (1981).
1239:
1240: \bibitem{hudson_kaplan85}
1241: R.~R. Hudson and N.~L. Kaplan, Genetics {\bf 111}, 147 (1985).
1242:
1243: \bibitem{eyre-walker_etal98}
1244: A. Eyre-Walker {\it et~al.}, Proc. Natl. Acad. Sci. {\bf 95}, 4441 (1998).
1245:
1246: \bibitem{mcpeek_speed95}
1247: M.~S. McPeek and T.~P. Speed, Genetics {\bf 139}, 1031 (1995).
1248:
1249: \bibitem{wade_etal02}
1250: C.~M. Wade {\it et~al.}, Nature {\bf 420}, 574 (2002).
1251:
1252: \end{thebibliography}
1253:
1254:
1255: \newpage
1256: \section*{Glossary}
1257:
1258: \emph{Locus} %
1259: A specific chromosomal location.
1260: \\[1ex]
1261: \emph{Allele} %
1262: One of several alternative forms of a gene, or DNA sequence, at a
1263: locus.
1264: \\[1ex]
1265: \emph{Genetic mosaic} %
1266: The pattern of differences between individuals in a population.
1267: \\[1ex]
1268: \emph{Haplotype} %
1269: A block of closely linked alleles that are inherited together.
1270: Such alleles are often used as markers in the process of gene
1271: mapping.
1272: \\[1ex]
1273: \emph{Linkage disequilibrium} %
1274: At linkage equilibrium, traits at different loci are inherited
1275: independently. Deviation from this is called linkage
1276: disequilibrium.
1277: \\[1ex]
1278: \emph{Population bottleneck} %
1279: When the population has been subject to a drastic decrease in
1280: abundance, followed by a rapid increase in abundance. This may
1281: happen e.g. when a small part of a population colonise a new
1282: environment, without extensive interbreeding with the main
1283: population.
1284: \\[1ex]
1285: \emph{SNP} %
1286: Single nucleotide polymorphism. A difference in the genetic code
1287: at a single position.
1288: \\[1ex]
1289: \emph{Markov process} %
1290: A stochastic process, where the future development depends only on
1291: the present state (no memory).
1292: \\[1ex]
1293: \emph{Divergence} %
1294: When a population splits into two parts that does not interbreed,
1295: the independent accumulation of neutral mutations within each
1296: subpopulation leads to that the number of genetic differences
1297: between individuals from different sub-populations increase with
1298: time.
1299: \\[1ex]
1300: \emph{Gene history} %
1301: The sequence of ancestors to a gene.
1302: \\[1ex]
1303: \emph{Coalescent process} %
1304: An approximation of neutral evolution, valid for large
1305: populations.
1306: \\[1ex]
1307: \emph{Chiasma process} %
1308: Exchange of genetic material between copies chromosome pairs
1309: during the production of gametes (egg or sperm cells).
1310: \\[1ex]
1311: \emph{Recombination fraction} %
1312: The probability that two loci on the same chromosome was inherited
1313: from different parents.
1314:
1315: \end{document}
1316: