q-bio0507018/ms.tex
1: \documentclass[12pt,preprint]{aastex}
2: 
3: \usepackage{natbib}
4: \bibpunct[]{(}{)}{,}{a}{}{,}
5: 
6: \usepackage{amsmath}
7: \usepackage{amsfonts}
8: \usepackage{amsthm}
9: 
10: \newcommand{\me}{\mathrm{e}}
11: \newcommand{\mi}{\mathrm{i}}
12: \newcommand{\dif}{\mathrm{d}}
13: \usepackage{bm}
14: \renewcommand{\vec}[1]{\bm{#1}}
15: \DeclareMathAlphabet{\mathsfsl}{OT1}{cmss}{m}{sl}
16: \newcommand{\tensor}[1]{\mathsfsl{#1}}
17: 
18: \DeclareMathOperator*{\argmax}{argmax}
19: 
20: \DeclareMathOperator{\expn}{Mean}
21: \DeclareMathOperator{\medn}{Median}
22: 
23: \DeclareMathOperator{\podist}{Po}
24: \DeclareMathOperator{\tdist}{T}
25: \DeclareMathOperator{\ndist}{N}
26: \DeclareMathOperator{\betadist}{B}
27: \DeclareMathOperator{\expdist}{E}
28: \DeclareMathOperator{\gamdist}{G}
29: \DeclareMathOperator{\dirproc}{DP}
30: 
31: \DeclareMathOperator{\unifdist}{U}
32: \DeclareMathOperator{\bindist}{Bin}
33: 
34: \DeclareMathOperator{\bindens}{Bin}
35: \DeclareMathOperator{\betadens}{B}
36: 
37: \newcommand{\nd}{\ensuremath{n_\mathrm{d}}}
38: \newcommand{\nc}{\ensuremath{n_\mathrm{c}}}
39: \newcommand{\eya}[3]{\ensuremath{\hat{y}_{\mathrm{#1},#2}{(#3)}}}
40: \newcommand{\ya}[3]{\ensuremath{y_{\mathrm{#1},#2}{(#3)}}}
41: \newcommand{\ey}[2]{\ensuremath{\hat{y}_{\mathrm{#1},#2}}}
42: \newcommand{\y}[2]{\ensuremath{y_{\mathrm{#1},#2}}}
43: 
44: \newcommand{\p}[2]{\ensuremath{\pi_{#1}{(#2)}}}
45: \newcommand{\pmin}{\ensuremath{p_{\min}}}
46: \newcommand{\muminp}{\ensuremath{\hat{\mu}_{\min p}}}
47: 
48: \newcommand{\ceil}[1]{\ensuremath{\lceil #1 \rceil}}
49: \newcommand{\floor}[1]{\ensuremath{\lfloor #1 \rfloor}}
50: 
51: 
52: \begin{document}
53: \bibliographystyle{toby}
54:  
55: \title{Bayesian Method for Disease QTL Detection and Mapping, using a
56:   Case and Control Design and DNA Pooling}
57: 
58: \author{Toby Johnson\altaffilmark{1}}
59: \affil{School of Biological Sciences, The University of Edinburgh}
60: \affil{West Mains Road, Edinburgh, EH9 3JT}
61: \email{toby.johnson@ed.ac.uk}
62: 
63: \altaffiltext{1}{Jointly affiliated to Rothamsted Research and to The University of Edinburgh}
64: 
65: \slugcomment{draft \today}
66: 
67: \begin{abstract}
68:   This paper describes a Bayesian statistical method for determining
69:   the genetic basis of a complex genetic trait.  The method uses a
70:   sample of unrelated individuals classified into two groups, for
71:   example cases and controls.  Each group is assumed to have been
72:   genotyped at a battery of marker loci using a laboratory effort
73:   efficient technique called DNA pooling.  The aim is to detect and
74:   map a quantitative trait locus (QTL) that is \emph{not} one of the
75:   typed markers.  The method works by conducting an exact Bayesian
76:   analysis under a number of simplifying population genetic
77:   assumptions that are somewhat unrealistic.  Despite this, the method
78:   is shown to perform acceptably on datasets simulated under a more
79:   realistic model, and furthermore is shown to outperform classical
80:   single point methods.
81: \end{abstract}
82: 
83: \section{Introduction}
84: For many traits of interest, including susceptibility to many genetic
85: diseases, the genetic basis is complex, meaning that many genes of
86: individually small effect (quantitative trait loci; QTLs) contribute.
87: For detecting and mapping such QTLs, association mapping studies that
88: use large samples of essentially unrelated individuals may have two
89: advantages over linkage studies that use pedigrees or families: For a
90: given sample size, association studies may have more power to detect a
91: QTL \citep[e.g.][]{rischmerikangas1996,risch2000}, and they may allow
92: the QTL to be fine mapped with greater resolution or precision
93: \citep[e.g.][]{terwilliger1995,mcpeek1999}.  One important
94: experimental design is a genome wide scan, in which many thousands of
95: markers covering the whole genome are typed \citep[see
96: e.g.][]{lander1996,rischmerikangas1996,kruglyak1999,carlson2003}.  The
97: aim may be to detect QTL that are not candidate genes and that have
98: escaped detection in linkage studies.  After preliminary analysis,
99: attention may focus on relatively small chromosomal regions, each
100: containing perhaps only one QTL.  Because of
101: nongenetic factors and the effects of other genetically distant QTLs,
102: each focal QTL will explain only a small fraction of the variance in
103: phenotype.  If the trait is binary (such as presence or absence of a
104: disease), the difference in QTL allele frequencies between the two
105: trait groups will be small.  The large numbers of markers required for
106: a genome wide scan will need to have been typed in large numbers of
107: individuals to detect such a QTL, and to infer its position,
108: frequency, and effect on the trait.
109: 
110: In order to reduce the cost of such a study, an experimental strategy
111: called DNA pooling has been proposed
112: \citep[e.g][]{arnheim1985,barcellos1997,sham2002,norton2004}.  Here
113: DNA from individuals with similar phenotypes is physically mixed
114: together into a pool before genotyping.  After overheads to do with
115: construction of pools and assay development, the cost of genotyping an
116: entire pool at a given marker is reduced to the cost of genotyping a
117: single individual.  Thus costs may be reduced by a factor close to the
118: number of individuals in a pool (divided by the number of experimental
119: replicates used for each pool), when the overheads can be spread
120: across many markers and across many disease studies.
121: 
122: In this paper I consider the simplest experimental design where there
123: are two trait groups.  These could have been classified by the
124: presence or absence of binary trait such as a disease.  Alternatively,
125: if the trait is quantitative, individuals from each tail of the trait
126: distribution could make up the two groups, for example the upper and
127: lower 10\% tails \citep[e.g.][]{darvasisoller1994,bader2001}.  For
128: simplicity I will refer throughout the paper to the two trait groups
129: as cases and controls, and to one of the alleles at the QTL as the
130: disease allele.  I note that data collected from more than two groups
131: can be analysed using the method developed here, by discarding or
132: combining data from some of the groups.  (An extreme example is where
133: each pool contains two chromosomes, i.e.\ genotype data are
134: available.)  Such an approach is valid and may be useful, but will
135: almost certainly not make most efficient use of the available data,
136: since it will be based on only a marginal observation that is almost
137: certainly not sufficient.
138: 
139: Compared with individual genotyping, a DNA pooling strategy incurs
140: three types of information loss and error.  At each marker only the
141: marginal counts of the alleles present within each pool are are
142: available, and so there is (i) no information about deviations from
143: Hardy--Weinberg proportions within each marker within each pool
144: \citep{rischteng1998} and (ii) no information about phase or linkage
145: information across markers within each pool \citep{johnson2005a}.
146: Further (iii), the marginal counts are only estimated from some kind
147: of quantitative genotyping experiment \citep[e.g.][]{germer2000,lehellard2002},
148: rather than by counting individual genotyping experiments.
149: 
150: The present approach deals with (iii) by allowing a quite general
151: class of models for errors in allele frequency estimation.  It
152: attempts to deal with (i) and (ii) by using the full likelihood given
153: the available data.  This likelihood does however assume a model that
154: makes simplifying assumptions that are not as realistic as one would
155: like.  This may be an acceptable price to pay, because it allows all
156: the necessary computations to be done analytically or using simple
157: numerical algorithms.  The aim is to develop a method that can be used
158: on very large data sets, with hundreds of cases and hundreds of
159: controls typed at hundreds of markers.
160: 
161: The simple model used here is a special case of the model introduced
162: by \citet{mcpeek1999} and studied by \citet{morris2000},
163: \citet{zhangzhao2000,zhangzhao2002} and \citet{liu2001} for haplotype
164: and genotype data.  In brief summary, I assume a unique mutation
165: event (perhaps a single nucleotide polymorphism, or a deletion)
166: generated the disease allele at the disease QTL, that there is a star
167: shaped genealogy since that mutation, that the disease allele is
168: absent from the control group, and that Hardy--Weinberg and linkage
169: equilibrium apply in the control group.
170: 
171: Avoiding the use of more complicated algorithms such as Markov chain
172: Monte Carlo (MCMC) means that there are no concerns about mixing and
173: convergence, and no difficulties with computing normalising constants
174: and Bayes factors for model testing and model comparison.  When the
175: model is correct, the Bayes factor is \citep[in various classical
176: senses, see][ ch.5]{ohaganforster} an optimal test statistic for
177: detecting a QTL \citep[see also][]{patterson2004}.  Therefore the
178: present approach can be viewed as choosing an approximate model that
179: is simple enough to allow an exact and optimal analysis.  It will
180: therefore complement alternative approaches that perform approximate
181: or sub-optimal analyses for more realistic models.
182: 
183: One example of such an alternative approach is to perform a classical
184: single point analysis, which applies a separate test to the data for
185: each marker.  Such an approach can be made essentially free of any
186: population genetic model, meaning that no worring assumptions are made
187: but also that there is no efficient framework for combining analyses
188: across many markers (e.g.\ to correct for multiple testing).  When
189: genotype data are available and when the QTL is \emph{not} one of the
190: typed markers, such single point methods are known to lack power
191: \citep[e.g.][]{risch2000,mott2000,zollnerpritchard2005}, and may
192: produce inefficient point estimates and inefficiently large region
193: estimates for the position of the QTL
194: \citep[e.g.][]{morris2002,morris2003}.
195: 
196: One justification for the method developed here is that although the
197: model is simple, it is to my knowledge the most complex model for
198: which a Bayesian analysis has been implemented using data from DNA
199: pools.  To provide further justification and reassurance about the
200: simplifying assumptions made, I have tested the method on data
201: simulated under a more realistic full coalescent model.  The
202: (necessarily classical) measures of performance are encouraging in
203: three respects.  Firstly, the power to detect a QTL is substantially
204: higher than for classical single point analysis.  Secondly, point
205: estimates for the position of the QTL derived from the present method
206: outperform the simple procedure of choosing the map position of the
207: marker with the most significant single point test result \citep[the
208: ``minimum $p$-value method'',
209: e.g.][]{kaplanmorris2001a,kaplanmorris2001b}.  Thirdly, after
210: ``flattening'' the posterior using a factor \citep[derived
211: by][]{mcpeek1999} that corrects for the fact that the true genealogy
212: is not in fact star shaped, credibility intervals for the position of
213: the QTL cover its true position with frequency equal to their nominal
214: size.  That is, they are well calibrated.
215: 
216: The structure of this paper is as follows.  In
217: section~\ref{sec:appr-model-prior} I review the model introduced by
218: \citet{mcpeek1999}, for which an exact Bayesian analysis can be
219: performed using multilocus estimated allele frequency data, collected
220: using DNA pools.  In section~\ref{sec:analysis} I describe in detail
221: how the computations for such an analysis are performed.  In
222: section~\ref{sec:simul-model} I review a more realistic coalescent
223: model that I have used to generate simulated data sets on which to
224: test the current approach.  In section~\ref{sec:results} I describe
225: the performance of the method developed in
226: sections~\ref{sec:appr-model-prior}--\ref{sec:analysis} on those
227: simulated data sets.  In section~\ref{sec:discussion} I summarise the
228: results and discuss how the method could be improved.
229: 
230: \section{The Simplified Model and Prior}
231: \label{sec:appr-model-prior}
232: The main notations and abbreviations used are listed in
233: table~\ref{tab:notations-used}, and the conditional independencies of
234: the variables are represented in figure~\ref{fig:factorisemodel}.
235: 
236: I assume the data (collectively denoted $\hat{\vec{y}}$) consist of
237: allele frequency estimates at $L$ single nucleotide polymorphisms
238: (SNPs), obtained from a single pool of \nd{} case chromosomes and a
239: single pool of \nc{} control chromosomes.  Let $m_i$ be the known map
240: position of the $i$-th SNP.  Let the alleles at each SNP be
241: arbitrarily labelled 0 and 1, with $\eya{d}{i}{1}\equiv
242: \nd-\eya{d}{i}{0}$ the estimated count of the 1 allele at the $i$-th
243: SNP in the cases, and $\eya{c}{i}{1}\equiv \nc-\eya{c}{i}{0}$ the
244: estimated count of the 1 allele at the $i$-th SNP in the controls.
245: When there is no ambiguity, $\hat{y}$ will mean the estimated count of
246: the 1 allele at some SNP in some pool that is to be inferred from the
247: context.  (This can easily be generalised to let $\ey{d}{i}$ be
248: arbitrary vector valued information about the counts of the two
249: alleles at the $i$-th SNP in the cases, etc.)  The method may be
250: generalised to allow more than two alleles at each marker.
251: 
252: Let $\ya{d}{i}{1}$ and $\ya{c}{i}{1}$ be the true counts of the 1
253: allele at the $i$-th SNP in the cases and in the controls
254: respectively.  I assume that an error model
255: $\Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}$ has been chosen and
256: parameterised, for example from a calibration data set consisting of
257: pairs $(\hat{y},y)$ that were obtained from a given set of individuals
258: by genotyping them as a pool and by genotyping them individually.
259: Note that the error model cannot be factorised unless we assume that
260: the errors in the case and control pools are unconditionally
261: independent.  However, we may believe that the data were obtained
262: using assays that vary in precision across SNPs.  Letting $e_i$
263: indicate the unknown precision of the assay used for the $i$-th SNP,
264: we could model these beliefs as
265: \begin{equation}
266:   \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}=\sum_{e_i}{\Pr{(\ey{d}{i}|\y{d}{i},e_i)}\Pr{(\ey{c}{i}|\y{c}{i},e_i)}\Pr{(e_i)}}\;\mbox{,}
267:   \label{eq:errorfactorise}
268: \end{equation}
269: that is, the error distribution is modelled as a mixture distribution
270: where each component of the mixture factorises.  In the following,
271: reasonable flexibility in the nature of this error model is allowed:
272: the errors must be independent across SNPs but they need not be
273: independent across pools and they do not need to be identically
274: distributed across SNPs.  A example error model is given in
275: section~\ref{sec:simul-model}.
276: 
277: Let the unknown map position of the disease QTL relative to the
278: leftmost marker SNP be $\mu$; often this will be the variable of main
279: interest.  I allow any prior $\Pr{(\mu)}$; obvious choices would be a
280: uniform density on physical distance, or one based on known gene
281: density \citep[see e.g.][]{rannalareeve2001}.  I assume a unique
282: mutation event generated the disease allele at the disease QTL, so
283: some number of haplotypes ($x_\mu$) in the case pool carry the disease
284: allele and are identical by descent (i.b.d.) at this position.  I
285: assume $x_\mu\sim\bindist{(\nd,\rho)}$ for a rate $\rho$,
286: and since $\rho$ itself is uncertain I assume a beta prior with
287: parameter $R=(R_1,R_0)$, so the prior mean is $R_1/(R_0+R_1)$.
288: Specifying $(R_1,R_0)=(1,1)$ specifies a flat prior on $[0,1]$ for
289: $\rho$.  I assume that the disease allele is absent from the control
290: pool.  The adequacy of this approximation, and ways in which it could
291: be relaxed, are taken up in the discussion in section~\ref{sec:discussion}.
292: 
293: Each disease allele is embedded in a block of i.b.d.\ haplotype, the
294: ``ancestral haplotype'', with breakpoints at distances $d_\mathrm{L}$
295: and $d_\mathrm{R}$ to the left and right of the position of the QTL.
296: I assume a star shaped genealogy for the disease allele, so that
297: $d_\mathrm{L}$ and $d_\mathrm{R}$ for all blocks of ancestral
298: haplotype are conditionally independent and identically distributed
299: (i.i.d.) from an exponential distribution with mean $1/\tau$ Morgans,
300: where $\tau$ is the age of the disease allele in generations and
301: distances are measured in Morgans \citetext{\citealt{mcpeek1999}, see
302:   also \citealt{morris2000,zhangzhao2000,zhangzhao2002,liu2001}}.
303: (The left and right breakpoints are the positions of the nearest
304: crossovers in $\tau$ meioses where crossovers occur as a Poisson
305: process with rate 1.)  I allow any prior $\Pr{(\tau)}$; lognormal and
306: gamma are reasonable choices.  Specifying an exponential prior with
307: sufficiently small parameter $T$ (so the prior mean $1/T$ is
308: sufficiently large) specifies a prior that is effectively flat over
309: the region where the likelihood is non-negligible.  This has the
310: effect of making the posterior model probability or Bayes factor
311: proportional to $T$.  A crucial variable in the analysis is $x_i$, the
312: number of chromosomes in the case pool that carry the i.b.d.\
313: haplotype at the $i$-th SNP.  The assumptions above specify a
314: distribution for the $x_i$.  A key feature of the present method is
315: that the $x_i$ are not independent across SNPs, in constrast to what
316: is assumed in composite likelihood methods
317: \cite[e.g.][]{terwilliger1995,xiong1997,collins1998,maniatis2004,maniatis2005}.
318: 
319: All other non-i.b.d\ haplotype is called heterogenous ``non-ancestral
320: haplotype''.  The allele present in non-ancestral haplotype is assumed
321: conditionally independent across chromosomes within SNPs, and across
322: SNPs within chromosomes, with \p{i}{1} the probability of a 1 allele
323: at the $i$-th SNP and $\p{i}{0}\equiv 1-\p{i}{1}$.  (This is
324: equivalent to assuming Hardy--Weinberg and linkage equilibrium in
325: blocks of non-ancestral chromosome.)  This assumption is a very bad
326: one when haplotype or genotype data are available
327: \citep{liu2001,morris2002,listephens2003}, but when only multilocus
328: allele frequency data are available it may be more innocuous; this
329: assumption leads to a massive simplification in the likelihood.  I
330: assume an independent beta prior for each $\p{i}{1}$, with parameter
331: $P_i=(P_{i,1},P_{i,0})$, so the prior mean is
332: $P_{i,1}/(P_{i,0}+P_{i,1})$.  The method should be robust to
333: misspecification of $P_i$ as long as both elements are much smaller
334: than $\nc$.
335: 
336: The allele present on the ancestral haplotype at the position of the
337: $i$-th SNP, $a_i$, is assumed to be a single draw from the same
338: distribution as for a block of nonancestral chromosome.  That is, each
339: $a_i$ is an independent Bernoulli variable with parameter $\p{i}{1}$.
340: 
341: \section{Analysis}
342: \label{sec:analysis}
343: \subsection{Overview}
344: The purpose of the analysis is to compute the posterior distribution
345: for quantities of interest.  Here I focus on computing the Bayes
346: factor in favour of a model in which a QTL is present, versus a model
347: with no QTL.  This Bayes factor allows the posterior model
348: probabilities to be computed for any prior on the two models.  I also
349: compute the posteriors for $\mu$, $\tau$ and $\rho$, given that a QTL
350: is present, marginalising all other variables.
351: 
352: The model has a hierarchical structure as shown in
353: figure~\ref{fig:factorisemodel}.  Note that the joint distribution of
354: the $x_i$ depends on $\mu$, $\tau$ and $\rho$, but that the
355: probability of the data at the $i$-th SNP only depends on $x_i$ and
356: other variables $\pi_i$ and $a_i$ for which the prior is independent
357: across SNPs.  The first step of the analysis, in section
358: \ref{sec:at-each-snp}, is to perform an independent calculation for
359: each SNP that integrates out $\pi_i$ and $a_i$ conditional on $x_i$.
360: The second step of the analysis, in section
361: \ref{sec:hidden-markov-model}, is to integrate out all the $x_i$ and
362: $x_\mu$ conditional on $\mu$, $\tau$ and $\rho$.  This gives the
363: posterior density for these three variables.  The final step, in
364: section \ref{sec:final-posterior}, is to compute the marginal
365: posteriors for each of $\mu$, $\tau$ and $\rho$, and the normalising
366: constant or probability of the data given models with and without a
367: QTL.
368: 
369: \subsection{Calculations for the $i$-th SNP}
370: \label{sec:at-each-snp}
371: Note that all probabilities in this section are conditional on $x_i$,
372: the number of chromosomes in the case pool carrying the ancestral
373: i.b.d.\ chromosome.  At the $i$-th SNP let $\ya{d}{i}{1}\equiv
374: \nd-\ya{d}{i}{0}$ be the (unknown) actual count of the 1 allele in
375: cases, and $\ya{c}{i}{1}\equiv \nc-\ya{c}{i}{0}$ be the actual
376: estimated count of the 1 allele in controls.  Let
377: $\y{d}{i}=(\ya{d}{i}{1},\ya{d}{i}{0})$ and a similar notation apply
378: for controls.  Under the modelling assumptions we have
379: \begin{eqnarray}
380:   \label{eq:ydprob}
381:   \Pr{(\y{d}{i}|x_i,a_i,\pi_i)} &=&
382:   \bindens{(\ya{d}{i}{a_i}-x_i,\nd-x_i,\pi_i(a_i))} \\
383:   \label{eq:ycprob}
384:   \Pr{(\y{c}{i}|\pi_i)} &=&
385:   \bindens{(\ya{d}{i}{1},\nc,\pi_i(1))}
386: \end{eqnarray}
387: where $\bindens{(x,n,p)}$ is the probability of observing $x$
388: successes in $n$ independent trials with success probability $p$ (and
389: understood to be zero unless $0\le x\le n$).  For a more detailed
390: motivation of (\ref{eq:ydprob}) and (\ref{eq:ycprob}) see \citet[
391: appendix A]{johnson2005a}
392: 
393: Then the probability of the observed data can be written
394: \begin{equation}
395:   \Pr{(\ey{d}{i},\ey{c}{i}|x_i,a_i,\pi_i)} =
396:   \sum_{\y{d}{i}}{\sum_{\y{c}{i}}{
397:       \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}
398:       \Pr{(\y{d}{i}|x_i,a_i,\pi_i)}\Pr{(\y{c}{i}|\pi_i)}
399:     }}\;\mbox{.}
400:   \label{eq:SNPi-prob}
401: \end{equation}
402: 
403: We can simplify at this stage by writing down the distribution
404: marginal to $a_i$ and $\pi_i$ and moving the respective sum and
405: integral as far inside the expression as possible, to get
406: \begin{eqnarray}
407:   \Pr{(\ey{d}{i},\ey{c}{i}|x_i)} &=&
408:   \sum_{\y{d}{i}}{\sum_{\y{c}{i}}{\bigg(
409:       \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}\bigg.}}\nonumber\\
410:     &&\bigg.\int{\sum_{a_i}{\big(
411:       \Pr{(\y{d}{i}|x_i,a_i,\pi_i)}\Pr{(a_i|\pi_i)}\big)}\Pr{(\y{c}{i}|\pi_i)}\Pr{(\pi_i)}\dif\pi_i}\bigg)\;\mbox{.}
412: \end{eqnarray}
413: The innermost sum over $a_i$ can be rewritten
414: \begin{eqnarray}
415:   \label{eq:emprob}
416:   \Pr{(\y{d}{i}|x_i,\pi_i)} &=&
417:   \sum_{a_i}{\Pr{(\y{d}{i}|x_i,a_i,\pi_i)}\Pr{(a_i|\pi_i)}} \\
418:   &=& \sum_{a_i}{\bigg(
419:     \bindens{(\ya{d}{i}{a_i}-x_i+1,\nd-x_i+1,\pi_i(a_i))}\bigg.}\nonumber\\
420:   &&\qquad\times
421:   \bigg.\binom{\nd-x_i}{\ya{d}{i}{a_i}-x_i}\bigg/\binom{\nd-x_i+1}{\ya{d}{i}{a_i}-x_i+1}\bigg)\\
422:   &=& \sum_{a_i}{\bigg(
423:     \bindens{(\ya{d}{i}{a_i}-x_i+1,\nd-x_i+1,\pi_i(a_i))}\bigg.}\nonumber\\
424:   &&\qquad\times
425:   \bigg.\frac{\ya{d}{i}{a_i}-x_i+1}{\nd-x_i+1}\bigg)\;\mbox{.}
426: \end{eqnarray}
427: This allows the integral over $\pi_i$ to be computed
428: analytically when (as assumed above) the prior $\Pr{(\pi_i(a_i))}$ is a
429: beta distribution with parameter $(P_{i,a_i},P_{i,1-a_i})$\begin{eqnarray}
430:   \label{eq:ygivenxraw}
431:   \Pr{(\y{d}{i},\y{c}{i}|x_i)} &=& \int{
432:     \Pr{(\y{d}{i}|x_i,\pi_i)}\Pr{(\y{c}{i}|\pi_i)}\Pr{(\pi_i)}\dif \pi_i}\\
433:   \label{eq:ygivenxbinombeta}
434: &=&
435:   \sum_{a_i}{\bigg(\frac{(\nd-x_i+1)!\;\Gamma{(\nc+P_{i,0}+P_{i,1})}}
436:   {\Gamma{(\nd-x_i+1+\nc+P_{i,0}+P_{i,1})}}\bigg.}\nonumber\\
437: %
438:   &&\qquad\times \frac{\Gamma{(\ya{d}{i}{a_i}-x_i+1+\ya{c}{i}{a_i}+P_{i,a_i})}}
439:   {(\ya{d}{i}{a_i}-x_i+1)!\;\Gamma{(\ya{c}{i}{a_i}+P_{i,a_i})}}\nonumber\\
440: %
441:   &&\qquad\times \frac{\Gamma{(\ya{d}{i}{1-a_i}+\ya{c}{i}{1-a_i}+P_{i,1-a_i})}}
442:   {\ya{d}{i}{1-a_i}!\;\Gamma{(\ya{c}{i}{1-a_i}+P_{i,1-a_i})}}\nonumber\\
443:   &&\qquad\times
444:   \bigg.\frac{\ya{d}{i}{a_i}-x_i+1}{\nd-x_i+1}\bigg)
445:   \label{eq:probSNPfull}
446: \end{eqnarray}
447: Thus the probability of the observed data reduces to
448: \begin{equation}
449:   \Pr{(\ey{d}{i},\ey{c}{i}|x_i)} =
450:   \sum_{\y{d}{i}}{\sum_{\y{c}{i}}{
451:       \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}
452:       \Pr{(\y{d}{i},\y{c}{i}|x_i)}
453:     }}
454:   \label{eq:probSNPsumsum}
455: \end{equation}
456: where the first term in the summand is the error
457: model and the second term is given by
458: (\ref{eq:probSNPfull}).  
459: 
460: Because $\ey{d}{i}$ and $\ey{c}{i}$ are fixed and $a_i$ and $\pi_i$
461: have been integrated out, values of (\ref{eq:probSNPsumsum}) for each
462: $x_i$ can be precomputed and stored in a small lookup table of size
463: $(n_d+1)$.  It is therefore feasible to use a complicated and
464: hopefully realistic error distribution, as discussed above.  For many
465: error models, computation can be speeded up without loss of accuracy
466: by judicious reduction of the range of the summation in
467: (\ref{eq:probSNPsumsum}).
468: 
469: \subsection{Hidden Markov Model for $x_\mu$ and the $x_i$}
470: \label{sec:hidden-markov-model}
471: Throughout this section, all probabilities are conditional on given
472: values of $\mu$, $\tau$ and $\rho$ but to make a clearer exposition this is
473: suppressed in the notation.
474: 
475: Assume for clarity of exposition that $\mu$ is a position within the
476: battery of marker SNPs.  Let $\ceil{\mu}=\min{\{i:m_i\ge \mu\}}$ and
477: $\floor{\mu}=\max{\{i:m_i< \mu\}}$ denote the indices of the SNPs to
478: the right and left of the QTL.  The algorithm works with
479: obvious modifications when $\mu$ is a position outside the battery of
480: marker SNPs.
481: 
482: Under the modelling assumptions, as we move away from the position of
483: the disease locus, the number of chromosomes containing i.b.d.\
484: ancestral chromosome, $x_i$, is Markovian.  The \textit{transition
485:   probabilities} of a (nonstationary and inhomogenous) hidden Markov
486: model \cite[HMM; see e.g.][ ch.3]{durbinbook}, for marker SNPs to the right
487: of $\mu$ (i.e.\ $\ceil{\mu}\le i$), are
488: \begin{equation}
489:   \label{eq:jumpprob}
490:   \Pr{(x_{i+1}|x_{i})}= 
491:   \bindens{(x_{i+1},\;x_{i},\;\exp{(-\tau\times(m_{i+1}-m_{i}))})}
492: \end{equation}
493: Similar equations hold for for marker SNPs to the left of $\mu$, and
494: for $\Pr{(x_i|x_\mu)}$.  The \textit{emmission probabilities} of the
495: HMM are given by (\ref{eq:probSNPsumsum}).  There is an important
496: difference between this HMM and the HMMs of \citet{mcpeek1999},
497: \citet{morris2000}, \citet{zhangzhao2000,zhangzhao2002} and \citet{liu2001}.  Those
498: authors modelled the observed haplotypes or genotypes as a set of
499: conditionally independent HMMs, conditional on (in the present
500: notation) $(a_1,\ldots,a_L)$, $(\pi_1,\ldots,\pi_L)$ as well as $\mu$,
501: $\tau$ and $\rho$.  Thus they had to use expensive numerical
502: algorithms (optimisation or MCMC) on the high dimensional space of
503: variables on which the HMMs were conditioned.  Here I am able to model
504: all the data as a \emph{single} HMM conditional on only $\mu$, $\tau$
505: and $\rho$.  I am thus able to integrate out the high dimensional
506: $(a_1,\ldots,a_L)$ and $(\pi_1,\ldots,\pi_L)$ using an efficient
507: numerical algorithm and am left with only a low dimensional space (the
508: space for $(\mu,\tau,\rho)$) on which I will need to use an expensive
509: numerical algorithm.
510: 
511: We can use the backwards propagation algorithm for HMMs to sum over
512: the hidden states $(x_1,\ldots,x_L)$ \citetext{see e.g.\
513:   \citealt{durbinbook} ch.3, \citealt{liubook2001} pp.28--31}.
514: Readers familiar with HMMs may wish to skip the rest of this section.
515: 
516: Define the \textit{backwards variables} for SNPs at positions to the
517: right of the QTL
518: \begin{equation}
519:   \label{eq:bvdefright}
520:   b(x_i) := \Pr{(\ey{d}{i+1},\ey{c}{i+1},\ldots,\ey{d}{L},\ey{c}{L}|x_i)}
521: \end{equation}
522: and for SNPs at positions to the left of the QTL
523: \begin{equation}
524: \label{eq:bvdefleft}
525:   b'(x_i) := \Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{i-1},\ey{c}{i-1}|x_i)}
526: \end{equation}
527: Here backwards is relative to the direction in which the hidden
528: process is Markovian.  Equation~(\ref{eq:bvdefleft}) should not be
529: confused with a forwards variable, which are not used in this
530: computation.  Note also that the backwards variables are functions of
531: which SNP ($i$) and the value of that $x_i$ at that SNP, but that I have
532: adopted a more streamlined notation that I hope is unambiguous.
533: 
534: The backwards variables have the obvious interpretation when the
535: arguments run out of
536: range, that
537: \begin{eqnarray}
538:   b(x_L) &:=& \Pr{(\mbox{nothing}|x_L)} = 1 \\
539:   \label{eq:hmm-initialise-r}
540:   b'(x_1) &:=& \Pr{(\mbox{nothing}|x_1)} = 1
541:   \label{eq:hmm-initialise-l}
542: \end{eqnarray}
543: 
544: Using (\ref{eq:hmm-initialise-r}) to \textit{initialise} the backwards variables
545: at $i=L$, we then proceed to \textit{propagate} leftwards along the
546: chromosomes for each
547: $i=L-1,\ldots,\ceil{\mu}$ in turn, compute the backwards variables for every $(x_i)$ using
548: \begin{equation}
549:   b(x_i) =\sum_{x_{i+1}}{\Pr{(x_{i+1}|x_i)}
550:     \Pr{(\ey{d}{i+1},\ey{c}{i+1}|x_{i+1})} b(x_{i+1})}
551: \end{equation}
552: and then \textit{terminate} the algorithm by computing
553: \begin{equation}
554:   \label{eq:modelprobgivenmu}
555:   \Pr{(\ey{d}{\ceil{\mu}},\ey{c}{\ceil{\mu}},\ldots,\ey{d}{L},\ey{c}{L}|x_\mu)} 
556:   = \sum_{x_{\ceil{\mu}}}{\Pr{(x_{\ceil{\mu}}|x_\mu)}
557:     \Pr{(\ey{d}{\ceil{\mu}},\ey{c}{\ceil{\mu}}|x_{\ceil{\mu}})} b{(x_{\ceil{\mu}})}}
558: \end{equation}
559: for every $x_\mu$.
560: 
561: Likewise, using (\ref{eq:hmm-initialise-l}) to initialise the $b'$
562: backwards variables at $i=1$ we can propagate rightwards along the
563: chromosomes (backwards in the sense that time or space is usually
564: considered in HMMs) for $i=2,\ldots,\floor{\mu}$ and terminating in a
565: symmetric manner to compute
566: $\Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{\floor{\mu}},\ey{c}{\floor{\mu}}|x_\mu)}$.
567: 
568: We then obtain the probability of all the data by computing
569: \begin{eqnarray}
570:   \label{eq:modelprobgivenrho}
571:   \Pr{(\hat{\vec{y}})} &=&
572:   \Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{L},\ey{c}{L})} \nonumber \\
573:   &=& \sum_{x_\mu}{\bigg(\bindens{(x_\mu,n_d,\rho)}
574:     \Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{\floor{\mu}},\ey{c}{\floor{\mu}}|x_\mu)}\bigg.}\nonumber\\
575:   &&\qquad \bigg. \Pr{(\ey{d}{\ceil{\mu}},\ey{c}{\ceil{\mu}},\ldots,\ey{d}{L},\ey{c}{L}|x_\mu)}\bigg) 
576: \end{eqnarray}
577: 
578: Restoring the conditioning that has been implicit throughout this
579: section, (\ref{eq:modelprobgivenrho}) is
580: $\Pr{(\hat{\vec{y}}|\mu,\tau,\rho)}$, the probability of all the data
581: conditional on $(\mu,\tau,\rho)$ and marginal to $(a_1,\ldots,a_L)$
582: and $(\pi_1,\ldots,\pi_L)$,.
583: 
584: 
585: \subsection{Marginal Posteriors for $\mu$, $\tau$ and $\rho$ and Model
586: Probabilities}
587: \label{sec:final-posterior}
588: The posterior for $\mu$, $\tau$ and $\rho$, up to a normalising
589: constant, is obtained simply by multiplying
590: (\ref{eq:modelprobgivenrho}) by the relevant priors.
591: \begin{equation}
592:   \Pr{(\hat{\vec{y}},\mu,\tau,\rho)} = \Pr{(\hat{\vec{y}}|\mu,\tau,\rho)}
593:   \,\Pr{(\mu)}\,\Pr{(\tau)}\,\betadens{(\rho,R_1,R_0)}
594: \end{equation}
595: so
596: \begin{equation}
597:   \Pr{(\mu,\tau,\rho|\hat{\vec{y}})} = \Pr{(\hat{\vec{y}}|\mu,\tau,\rho)}
598:   \,\Pr{(\mu)}\,\Pr{(\tau)}\,\betadens{(\rho,R_1,R_0)}\,\frac{1}{\Pr{(\hat{\vec{y}})}}\;\mbox{.}
599: \end{equation}
600: Summarising this posterior is not entirely trivial for large data
601: sets, because computing the posterior at any given point
602: $(\mu,\tau,\rho)$ using the propagation algorithm of
603: section~\ref{sec:hidden-markov-model} takes on the order of
604: $L\,n_d^2$ operations (and all addition has to be done in log-space,
605: see e.g.\ \citet[ p.77--78]{durbinbook}), so we cannot rely on being able
606: to make an arbitrarily large number of such computations.
607: 
608: I use Cartesian product quadrature \citep[CPQ; see e.g.][
609: \textbf{9.43}--\textbf{9.44}]{ohaganforster} to compute marginal
610: posteriors for each of $\mu$, $\tau$ or $\rho$, and the normalising
611: constant or marginal likelihood $\Pr{(\hat{\vec{y}})}$ assuming there
612: is a disease QTL.  CPQ makes the approximation
613: \begin{eqnarray}
614:   \Pr{(\hat{\vec{y}})}&=&\int{\int{\int{
615:         \Pr{(\hat{\vec{y}},\mu,\tau,\rho)}\,\dif\mu}\,\dif\tau}\,\dif\rho}
616:   \nonumber\\
617:   &\simeq&\sum_j{\sum_k{\sum_\ell{ w_j^{(\mu)} w_k^{(\tau)} w_\ell^{(\rho)} \Pr{(\hat{\vec{y}},\mu_j,\tau_k,\rho_\ell)}}}}
618:   \label{eq:cpq-howto}
619: \end{eqnarray}
620: where e.g.\ $j$ indexes a set of \emph{design points}
621: $\{\mu_1,\mu_2,\ldots\}$, and $w_j^{(\mu)}$ is a \emph{weight}
622: associated with the $j$-th design point.  (A quantity proportional to)
623: the marginal posterior for any variable $\mu$, $\tau$ or $\rho$ is obtained by omitting the
624: respective sum from (\ref{eq:cpq-howto}).  When computed using priors
625: describing our beliefs about these variables given that there is a
626: QTL in the region of interest, the quantity in (\ref{eq:cpq-howto})
627: will be called $\Pr{(\hat{\vec{y}}|\mbox{QTL})}$.
628: 
629: The choice of design points and weights depends on the region of
630: interest and the prior, and the quadrature rule to be used.  For
631: example, all calculations in section~\ref{sec:results} the region of
632: interest is $\mu\in(0,1)$, measured in Mb or cM.  With a uniform prior
633: for $\mu$, exponential prior for $T$ with $T=1/1000$, and flat beta
634: prior for $\rho$ with $R_1=R_0=1$, I used 100 design points for each
635: variable as follows:
636: \begin{eqnarray}
637:   \mu_j = \left(j+1/2\right)/100\mbox{,}&
638:   w_j^{(\mu)} = 0.01\mbox{,}& j=0,1,\ldots,99\nonumber\\ 
639:   \tau_k = \exp{(k/11)}\mbox{,}&
640:   w_k^{(\tau)} = \exp{(k/11)}/11\mbox{,}& k=0,1,\ldots,99\nonumber\\
641:   \rho_\ell = \left(\ell+1/2\right)/100\mbox{,}&
642:   w_\ell^{(\rho)} = 0.01\mbox{,}& \ell=0,1,\ldots,99
643:   \label{eq:cpq-default-design}
644: \end{eqnarray}
645: Here the quadrature rule is very simple and corresponds approximately
646: to the trapezoid rule or two point Newton--Cotes rule.
647: 
648: A simple model with no QTL is to assume there are no blocks of i.b.d.\
649: haplotype, $x_i=0$ for all $i$.  This corresponds to a degenerate
650: prior at $\rho=0$ (or the limits $\mu\to\pm\infty$ or $\tau\to\infty$).
651: The probability of the data under this model, $\Pr{(\hat{\vec{y}}|\mbox{no
652:     QTL})}$, is easily computed directly from
653: (\ref{eq:probSNPsumsum}).  The Bayes factor (BF) in favour of the
654: model with a QTL is then
655: \begin{equation}
656:   \label{eq:BF-def}
657:   \mbox{BF} = \frac{\Pr{(\mbox{QTL}|\hat{\vec{y}})}}{\Pr{(\mbox{no
658:         QTL}|\hat{\vec{y}})}}\bigg/
659: \frac{\Pr{(\mbox{QTL})}}{\Pr{(\mbox{no
660:         QTL})}}
661: =\frac{\Pr{(\hat{\vec{y}}|\mbox{QTL})}}{\Pr{(\hat{\vec{y}}|\mbox{no QTL})}}\;\mbox{.}
662: \end{equation}
663: In addition to its Bayesian interpretation, the
664: BF (or any isotonic transformation thereof) has good
665: properties, from a classical frequentist perspective, as a test
666: statistic to test the null model with no QTL against the alternative
667: with a QTL \citetext{\citealt{ohaganforster} ch.5,
668:   \citealt{patterson2004}}.  Tests based on the BF are admissible,
669: which means that no other test has greater power for all
670: $(\mu,\tau,\rho)$.  There may be other tests that have greater power
671: for some $(\mu,\tau,\rho)$, and are therefore also admissible, but it
672: is ``unusual and strange'' to find an admissible test that is not
673: based on the BF computed using \emph{some} prior.  Furthermore, (up to
674: isomorphism) the BF uniquely maximises average power, when the
675: averaging is done with respect to the prior for $(\mu,\tau,\rho)$ used
676: to compute the BF.  Of course, this theory only applies when the model
677: is correct, but we might hope that the most powerful test for an
678: approximate model would be approximately most powerful for a more
679: realistic model.  The BF is a sensible way to combine information
680: across many markers to produce a single test, and thus avoids (or
681: overcomes) the problematic need to correct for multiple testing
682: \citep{patterson2004}.
683: 
684: I have also implemented a Markov chain Monte Carlo
685: \citep[MCMC; see e.g.][]{gilks1996,liubook2001} sampler, which uses
686: the Metropolis--Hastings algorithm to sample from
687: the posterior density for $(\mu,\tau,\rho)$.  
688: A proposal distribution that seems to work reasonably well
689: is to update a single variable, choosen at random with equal
690: probability, using the following:
691: \begin{eqnarray}
692:   \label{eq:proposal-z}
693:   \mu' &\sim& \ndist{\left(\mu,(0.1(m_L-m_1))^2\right)}\nonumber\\
694:   \label{eq:proposal-t}
695:   t' &\sim& \gamdist{\left(10,t/10\right)}\nonumber\\
696:   \label{eq:proposal-r}
697:   r' &\sim& \betadist{\left(5r,5(1-r)\right)}
698: \end{eqnarray}
699: Kernel based methods \citep[see e.g.][]{silverman1986} can then be
700: used to estimate marginal posteriors for each of the variables, and
701: these seemed similar to the marginal posteriors computed using CPQ in
702: the cases I examined.  However, standard numerical methods to estimate
703: the model probability $\Pr{(\hat{\vec{y}})}$ from the MCMC output
704: \cite[see e.g.][ \textbf{10.46}]{ohaganforster} converged very slowly
705: and did not seem reliable.
706: 
707: In addition to the ease and reliability with which the normalising
708: constant or BF can be computed, CPQ offers substantial advantages over
709: alternatives such as MCMC in low dimensional situations such as this
710: one.  There are no concerns about burnin or mixing.  CPQ makes
711: efficient use of the evaluation of the posterior density at each
712: design point. We can investigate sensitivity to prior specification
713: afterwards without redoing much of the computation.
714: 
715: In this particular situation CPQ offers an additional advantage, that
716: by traversing the design points in a particular order many of the
717: backards variables computed in the propagation algorithm of
718: section~\ref{sec:hidden-markov-model} it be stored and reused, and
719: thus a CPQ algorithm with $n$ design points runs much faster than a
720: MCMC algorithm with $n$ samples.  The details are as follows: The
721: posterior can be computed on all points on a three dimensional lattice
722: most efficiently by traversing the lattice with different values of
723: $\tau$ in the outermost loop, different values of $\mu$ in the middle
724: loop, and different values of $\rho$ in the innermost loop.  This is
725: because each time $\rho$ changes but $\mu$ and $\tau$ do not then only
726: (\ref{eq:modelprobgivenrho}) has to be recomputed.  Each time $\mu$
727: changes but $\tau$ does not then (\ref{eq:modelprobgivenmu}) always
728: has to be recomputed, and if the old and new values of $\mu$ are
729: separated by one or more typed SNPs then one or more columns of
730: backwards variables also have to be recomputed.  If $\mu$ always
731: increases then the $b$ are all computed first and then as the lattice
732: is traversed extra columns of $b'$ are computed and columns of $b$ are
733: simply discarded.  Changing $\tau$ means that the transition
734: probabilities~(\ref{eq:jumpprob}) change and everything has to be
735: recomputed.  The run time of the whole algorithm (propagation and CPQ)
736: scales quadratically in $n_d$ and linearly in $L+d_\mu$ where $d_\mu$
737: is the number of design points used for $\mu$.  If a small map region
738: is found to be interesting additional design points can be added later
739: at moderate cost.
740: 
741: The disadvantages of CPQ are that we learn little about the posterior
742: until calculations have been completed for all the design points, we
743: may belatedly discover that our choice of design points was not a good
744: one, that MCMC \emph{may} be more sensitive to very narrow spikes
745: containing substantial probability mass (these will be missed if they
746: fall inbetween the design points) and that CPQ will not scale well to
747: higher dimensional spaces which we might need to study if we
748: elaborated the model.
749: 
750: 
751: \section{Coalescent Simulation Model and Error Model}
752: \label{sec:simul-model}
753: In this section I describe a more realistic model that was used to
754: simulate datasets on which to test the method described above in
755: sections~\ref{sec:appr-model-prior}--\ref{sec:analysis}.  I also
756: describe a specific model for errors in allele frequency estimation.
757: Each simulated dataset was generated as follows.
758: 
759: First I used the \texttt{mksamples} program of \citet{hudson2002} to
760: simulate a sample of 20,000 1Mb long regions, assuming the standard
761: neutral coalescent model with population recombination rate
762: $4N_{\mathrm{e}}c=400$ ($N_{\mathrm{e}}=10,000$ assuming
763: $1\mbox{cM}/\mbox{Mb}$), and assuming the infinite sites mutation
764: model with population mutation rate $4N_{\mathrm{e}}\mu=10$.  (This is
765: an unrealistically low mutation rate, the idea is to simulate some of
766: the SNPs in the region rather than all of them.)  Chromosomes were
767: paired at random to generate a sample of 10,000 individuals.
768: 
769: One SNP with a minor allele frequency between 10\% and 20\% was chosen
770: as the disease QTL.  The disease status of each individual was
771: simulated assuming multiplicative risks, so that
772: $\gamma_{01}/\gamma_{00}=\gamma_{11}/\gamma_{01}=g$.  Here $\gamma_G$
773: is the penetrance (probability of having the disease) given genotype
774: $G$ at the disease QTL, with 1 the minor allele.  The parameter $g$ is
775: called the allelic or genotype relative risk 
776: \cite[see e.g.][]{rischmerikangas1996}.  Simulations for this paper used
777: values of $g=4$ and $g=1$.  The penetrance of the wild type
778: homozygote, $\gamma_{00}$, was set so that the marginal probability of
779: having the disease was 0.02.  (Thus the number of case chromosomes
780: $\nd$ was random with expectation $0.02\times10,000\times2=400$.)
781: Data from all $\nd/2$ case individuals, and an equal number $\nc/2$ of
782: randomly chosed control individuals, were used to make up the two
783: pools.
784: 
785: Excluding the disease QTL, all simulated SNPs with a minor allele
786: frequency greater than 0.05 in the $\nc/2$ individuals in the control
787: pool were analysed, so the number of SNPs $L$ was also random.  The
788: estimated allele frequencies at each SNP and for each pool were either
789: assumed to be known exactly, or assuming that allele frequencies were
790: estimated using the lag between kinetic PCR growth curves
791: \citep{germer2000}, using
792: \begin{equation}
793:   \label{eq:yhatfromlag}
794:   \hat{y} = \frac{1}{1+2^{\Delta\widehat{C_t}}}\times n\;\mbox{.}
795: \end{equation}
796: Here $\hat{y}$ is shorthand for the estimated count of the 1 allele,
797: $n$ is the number of chromosomes in the pool,
798: $\Delta\widehat{C_t}=\widehat{C_t}(1)-\widehat{C_t}(0)$, and
799: $\widehat{C_t}(a)$ is the number of PCR cycles before the amount of
800: PCR product for allele $a$ exceeds some threshold level
801: \citep{germer2000}.  The model for $\Pr{(\hat{y}|y,e)}$ is then as
802: follows: Define the true lag $\Delta C_t=\log_2((n-y)/y)$, which is
803: the lag that would give the correct frequency when
804: (\ref{eq:yhatfromlag}) was used.  I assume that the observed lag
805: $\Delta\widehat{C_t}$ averaged across $r$ experiments is normally
806: distributed with mean $\Delta C_t$ and variance $\sigma^2/r$, where
807: $\sigma^2$ is the variance in lags across replicate experiments.
808: 
809: Using the Jacobian
810: \begin{equation}
811:   \label{eq:yhatzhatjacobian}
812:   \frac{\dif \hat{y}}{\dif (\Delta\widehat{C_t})} = \ln{(2)}\hat{y}(n-\hat{y})\frac{1}{n}
813: \end{equation}
814: we can write down the error model in
815: the form required in (\ref{eq:SNPi-prob}) for the analysis,
816: \begin{equation}
817:   \label{eq:errorgaussdeltact}
818:   \Pr{(\hat{y}|y,n,\sigma^2,r)} = \frac{n}{\ln{(2)}\hat{y}(n-\hat{y})}
819:   \frac{1}{\sqrt{2\pi\sigma^2/r}}
820:   \exp{\left(-\frac{\left(\ln\left(\frac{(n-y)\hat{y}}{y(n-
821: \hat{y})}\right)\right)^2}{2\ln{(2)}^2\sigma^2/r}\right)} \;\mbox{.}
822: \end{equation}
823: 
824: \section{Testing the Method} 
825: \label{sec:results}
826: All measurements of the performance of the Bayesian method described
827: in sections \ref{sec:appr-model-prior}--\ref{sec:analysis} are based
828: on analyses of datasets simulated under a more realistic model, as described in
829: section~\ref{sec:simul-model}.  Results are reported for two situations, either where the allele
830: frequencies in each pool are known exactly, or where there are errors in allele
831: frequency estimation using (\ref{eq:errorgaussdeltact}) and assuming
832: that $n$, $r=2$ replicates and $\sigma=0.2\mbox{ PCR cycles}$ are all known.
833: This magnitude of error is comparable to those reported by
834: \citet{germer2000} and \citet{shiffman2004}.
835: Using (\ref{eq:yhatzhatjacobian}) we can say that these parameter values
836: correspond to a ``typical'' error in allele frequency estimate
837: $\hat{y}/n$ of about $y/n (1-y/n)\ln{(2)}\sigma/\sqrt{2}\simeq y/n
838: (1-y/n)0.098$ or, for intermediate allele frequencies, about 2.5\%.
839: 
840: For each situation, allele frequencies known either exactly or
841: estimated with errors, I analysed 500 datasets simulated
842: assuming there was a QTL with a
843: genotype relative risk $g=4$, and 500 datasets simulated assuming a null model
844: with no QTL ($g=1$, so the penetrances
845: $\gamma_{00}=\gamma_{01}=\gamma_{11}=0.02$ are all equal).  In these
846: simulations, the median number of case or control individuals,
847: $\nd/2=\nc/2$, was 200 (interquartile range 191--209, range 154--248).
848: The median number of SNPs, $L$, was 28 (interquartile range 24--32,
849: range 12--51).  These simulations assumed relative risks that are 
850: higher, and correspondingly sample sizes that are smaller than may be
851: realistic for many studies of QTLs influencing complex genetic
852: diseases.  This reflects the need to analyse a reasonably large number
853: of simulated datasets with the computing resources currently available
854: to me.  The mean time to run an analysis on a simulated dataset, using
855: CPQ with the design (\ref{eq:cpq-default-design}) which requires
856: evaluating the posterior at $10^6$ points on a $100\times100\times100$
857: lattice, was about 36 minutes on a 2.4GHz Intel\textregistered{}
858: Xeon\texttrademark{} processor (totalling about 50 processor days for
859: all the simulated datasets).
860: 
861: The inferences from the Bayesian method described here are compared
862: against simple but widely used classical single point analyses.  When
863: allele frequencies in each pool are known exactly, a chi squared test
864: can be used on the counts of the two alleles in the two pools
865: \citep[see e.g.\ ][]{clayton2001}, at each
866: marker SNP separately.  \citet{visscher2003} consider how to perform
867: an equivalent test when the errors in allele frequency estimates are
868: Gaussian.  The relatively small Gaussian errors in $\Delta\widehat{C_t}$
869: simulated here will produce errors in allele frequency estimates that
870: are approximately Gaussian (to the extent that (\ref{eq:yhatfromlag})
871: is linear, and in fact are underdispersed relative to a Gaussian in
872: the direction of extreme allele frequencies).  \citet{visscher2003}
873: show that a ``shrunk'' test statistic is approximately distributed as
874: $\chi^2_{(1)}$.  This shrunk statistic is equal to the ordinary
875: $\chi^2$ statistic computed using a point estimate of the counts,
876: times a factor $2V_s/(2V_s+V_e)$ where $V_s$ is the estimated sampling
877: variance of the allele frequency due to sampling a finite number of
878: cases and controls, under the null hypothesis of equal allele
879: frequency in cases and controls, and $V_e$ is the variance of the
880: allele frequency in either pool due to experimental error.  Using
881: (\ref{eq:yhatzhatjacobian}), for the simulations performed here this
882: shrinking factor is (approximately, for small $\sigma$)
883: \begin{equation}
884:   \frac{2}{2+(\nd+\nc)\hat{p}(1-\hat{p})\ln{(2)}^2\sigma^2/r}
885:   \;\mbox{.}
886:   \label{eq:visscher-shrink-factor}
887: \end{equation}
888: where $\hat{p}$ is the allele frequency estimated under the null
889: hypothesis, i.e.\ by pooling the case and control pools.
890: 
891: The most widely considered statistics from a classical single point
892: analysis are as follows:  Let \pmin{} be the smallest $p$-value of the
893: $L$ (shrunk) chi squared tests applied to a given dataset, and let
894: \muminp{} be the map position of the marker with the smallest
895: $p$-value.
896: 
897: It is worth emphasising that all the tests described in the following
898: sections concern the classical sense performance of statistics
899: computed from the Bayesian analysis.  Strictly, the Bayesian sense
900: performance can only be tested by conducting a Bayesian analysis
901: assuming a more realistic model (or prior).
902: 
903: \subsection{Power to Detect a QTL}
904: \label{sec:power-detect-qtl}
905: In this section I compare the power of tests to detect a QTL.  I
906: consider two different test statistics, and different methods of
907: determining critical regions.  The first test statistic is
908: $2\ln\mbox{BF}$, twice the logarithm of the Bayes factor
909: (\ref{eq:BF-def}).  The second test statistic is $\pmin\times L$.
910: Multiplying the smallest single point $p$-value by $L$ achieves a
911: simple Bonferonni correction for multiple testing that makes the
912: critical region independent of $L$.  Critical regions were determined
913: either analytically (by arbitrary or approximate methods), or
914: empirically (from analyses of datasets simulated under the null
915: model).  I report the performance of tests with nominal sizes of
916: $\alpha=0.05$ and $\alpha=0.01$; a more general comparison is made in
917: figures~\ref{fig:roc-no-error} and \ref{fig:roc-with-error}.  For each
918: test, the true size was estimated using 500 simulations under the null
919: model with genotype relative risk $g=1$, i.e.\ where risk is
920: independent of genotype at the QTL.  The power against an alternative
921: with $g=4$ was estimated using 500 simulations.  For each test I
922: report the estimated size and power, along with exact 95\% binomial
923: confidence intervals for their values.
924: 
925: From a Bayesian perspective, $2\ln\mbox{BF}>0$ indicates evidence in
926: favour of the model with a QTL over the model with no QTL.  As 
927: tables~\ref{tab:power-no-error} and \ref{tab:power-error}
928: show, the test with this critical region has small size and reasonable
929: power.  However, much more simulation work, for different models and
930: combinations of parameters, is required to establish the generality of
931: this result.  Also, at least from a classical perspective it is
932: desirable to be able to choose a critical region according to the size
933: (or perhaps power) that is desired.
934: 
935: An arbitrary critical region is
936: $2\ln\mbox{BF}>2\ln{(\frac{1-\alpha}{\alpha})}$.  I say arbitrary
937: because this in fact guarantees nothing about the classical sense
938: error rate, but does bound the Bayesian sense error rate: The
939: posterior probability that there is no QTL is less than $\alpha$,
940: $\Pr{(\mbox{no QTL}|\hat{y})}<\alpha$, when the model, the prior
941: $\Pr{(\mbox{QTL})}=\Pr{(\mbox{no QTL})}$, and the prior for
942: $(\mu,\tau,\rho)$ are correctly specified. It can be seen from
943: tables~\ref{tab:power-no-error} and \ref{tab:power-error} that these
944: arbitrary critical regions give tests with true sizes that are smaller
945: than $\alpha$.  Such tests are therefore conservative.  Causes may
946: include the simplified model used to compute the BF, the dependence of
947: the BF on the prior specification $T$, and the fact that the critical
948: region bounds the Bayesian sense error rate rather than controls the
949: classical sense error rate.  The use of these arbitrary critical
950: regions $2\ln\mbox{BF}>2\ln{(\frac{1-\alpha}{\alpha})}$ entails a loss
951: of power due to the actual size of the test being smaller than
952: intended, so a better method for determining a critical region is
953: desirable.
954: 
955: Assuming goodness of the $\chi^2_{(1)}$ approximation, with the shrinking
956: factor (\ref{eq:visscher-shrink-factor}), and using simple Bonferonni
957: correction, suggests an approximate critical region $\pmin\times
958: L<\alpha$.  These tests have true sizes equal to or slightly smaller
959: than their nominal sizes, which is expected because the Bonferonni
960: correction is conservative.  This effect is expected to increase in
961: severity as the marker density increases, because there will be a
962: greater number of more positively correlated tests.
963: 
964: Although these respectively arbitrary and approximate methods for
965: determining critical regions are not terribly accurate, and cannot be
966: recommended, it is worth noting that there is no clear difference in
967: power between the two test statistics, for tests with the same nominal
968: size.  Since the test based on $2\ln\mbox{BF}$ is more conservative,
969: it might reasonably be preferred.
970: 
971: It is not very meaningful to compare the power of tests with different
972: sizes.  Therefore I used simulations to estimate exact critical
973: regions, so that the power of different tests with true size $\alpha$
974: could be compared.  For these tests, the estimated size is exactly
975: equal to the nominal size, because the same set of simulations are
976: used to compute both values.  Although the critical region for
977: $\alpha=0.01$ is unlikely to be well estimated using only 500
978: simulations, by a simple symmetry argument this procedure still allows
979: a fair comparison across the different test statistics.  In every case
980: the test based on $2\ln\mbox{BF}$ is substantially more powerful than
981: the test based on $\pmin\times L$.
982: 
983: By combining simulations in which there is not and is a QTL, we can
984: view test statistics as \emph{classifiers}, and ask how well they
985: discriminate between the two cases.  The receiver operating
986: characteristics (ROC) for the two statistics are compared in
987: figures~\ref{fig:roc-no-error} and \ref{fig:roc-with-error}.  The ROC curves are equivalent to plotting estimated power
988: ($=\mbox{sensitivity}$) against size of test ($=1-\mbox{specificity}$)
989: for all possible tests (in fact, only all tests with non-disjoint
990: critical regions).  When viewed in this way, it can be seen that the
991: BF based statistic is uniformly equal to or superior to the minimum
992: $p$-value based statistic.  The advantage of the BF is greater when
993: there are errors in allele frequency estimation.  This may be because,
994: when the dataset is less informative, it may be more important to have
995: a model based way to combine information across SNPs.
996: 
997: For comparison, I have also plotted the ROC curves for a test
998: statistic derived from the nonparametric likelihood approach that I
999: have described previously \citep{johnson2005a}.  This nonparametric
1000: likelihood ratio (NLR) statistic is defined in the same way as the
1001: BF~(\ref{eq:BF-def}), using the same value for
1002: $\Pr{(\hat{\vec{y}}|\mbox{no QTL})}$, but makes the approximation
1003: \begin{equation}
1004:   \label{eq:proftestdef}
1005:   \Pr{(\hat{\vec{y}}|\mbox{QTL})} \simeq
1006:   \prod_{i=1}^{L}{\Pr{(\ey{d}{i},\ey{c}{i}|x_i^*)}}
1007: \end{equation}
1008: where $(x_1^*,x_2^*,\ldots,x_L^*)$ is the set of hidden states in the
1009: HMM that maximise the probability~(\ref{eq:proftestdef}) under an
1010: order restriction that they are either a weakly increasing sequence, a
1011: weakly decreasing sequence, or a weakly increasing then weakly
1012: decreasing sequence.  This order restriction must be true regardless
1013: of the shape of the genealogy at the QTL.  The NLR statistic is not a
1014: very good approximation to the BF, in particular because it can never
1015: be negative.  As far as I know, there is no theoretical reason to
1016: believe that it should have good properties as a test statistic.
1017: However, as figures~\ref{fig:roc-no-error} and
1018: \ref{fig:roc-with-error} show, tests based on the NLR are superior to
1019: tests based on $\pmin\times L$ and are not clearly distinguishable
1020: from tests based on the BF.  For the simulated datasets studied here,
1021: once the lookup table of emmission
1022: probabilities~(\ref{eq:probSNPsumsum}) has been computed, computing
1023: the NLR is over $10^4$ times faster than computing the BF.
1024: Furthermore, a Viterbi-like algorithm \citep[see][]{durbinbook} for
1025: computing the NLR has time complexity $\mathrm{O}{(L\times\nd)}$,
1026: compared with the CPQ algorithm for computing the BF which has time
1027: complexity $\mathrm{O}{((L+d_\mu)\times\nd^2)}$.
1028: 
1029: It may concern some readers that the critical regions and sizes and
1030: powers of tests were all estimated while allowing the numbers of
1031: cases, controls and marker SNPs all to vary across simulations.  To
1032: interpret results acquired in this way, a formal classical framework
1033: would require us to view the genotype relative risk $g$ as the single
1034: parameter, and the number of case chromosomes \nd{} and number of SNPs
1035: $L$ as random variables.  It is true that even in such a framework we
1036: would normally wish to perform tests conditional on the values of
1037: ancilliary variables such as \nd{} and $L$ that contain no information
1038: about whether there is a QTL in the region.
1039: However, it is a feature (or weakness) of classical inference that one
1040: is often free to choose whether to condition on any given variable
1041: \citep[but see e.g.][]{jaynes1976,robinson1979}.  The present results
1042: therefore do have a sound classical interpretation.  In any case, the
1043: small number of simulations performed here do not allow the luxury of
1044: estimating critical regions conditional on \nd{} or $L$.  The
1045: simulation procedure as used reflects a likely feature of real
1046: datasets, that SNP density will be higher in regions of the genome
1047: where the genealogy is deeper.  To alter the simulation procedure so
1048: that all simulated datasets had the same value of $L$ would require
1049: the introduction of an \textit{ad hoc} algorithm to select the $L$
1050: markers to be used from a larger number of candidates.
1051: 
1052: As shown in figure~\ref{fig:teststats}, the null distributions of both test statistics show a negative
1053: relationship with $L$.  The negative relationship is most pronounced
1054: for the $2\ln\mbox{BF}$ statistic, for the situation where there are
1055: errors in allele frequency estimation.  In this case a linear
1056: regression of $2\ln\mbox{BF}$ on $L$ had a slope significantly
1057: different from zero ($p=0.010$), and if the values of $2\ln\mbox{BF}$
1058: are partitioned into two groups according to the rank of $L$, the
1059: hypothesis that they are drawn from the same distribution can be
1060: rejected using a Kolmogorov--Smirnov test ($p=0.004$).  These tests do
1061: not detect significant dependence ($p>0.05$) for the $2\ln\mbox{BF}$
1062: statistic when the allele frequencies are known exactly, or for the
1063: $\pmin\times L$ statistic.
1064: 
1065: Although the simulations described here are adequate for demonstrating
1066: the superiority of the BF based test over the \pmin{} based test, we
1067: should be cautious about extrapolating from the current results.  In
1068: particular, it seems that the arbitrary
1069: ($2\ln\mbox{BF}>2\ln{(\frac{1-\alpha}{\alpha})}$) or approximate
1070: (Bonferonni) critical regions described above will become more
1071: conservative as SNP number or density increases.  Performing tests
1072: that are not conditioned on SNP number and density will introduce
1073: recognisable subset biases \citep[see e.g.][]{robinson1979}.  In a
1074: real situation, a critical region should be determined using
1075: simulations conditioned on as many ancilliary statistics of the
1076: observed data as possible, although for complex simulation models it
1077: may be a matter of guesswork which statistics are approximately
1078: ancilliary.  An approach that could be most useful in practice is a
1079: variant of the permutation test of \citet{churchill1994}.  This could
1080: be applied if there were matched pairs of pools of cases and controls,
1081: and each pair were typed in separate DNA pooling experiments
1082: \citep{shiffman2004}.  Then the phenotype labels could be permuted
1083: within each pair, giving a set of equiprobable values for any test
1084: statistic under the null hypothesis.  Such an approach could not be
1085: explored here because it would require too much computation.
1086: 
1087: 
1088: \subsection{Sensitivity to prior specification}
1089: \label{sec:prior-specification}
1090: It is important to appreciate that the Bayes factor does not depend on
1091: the prior probabilities for the two models (QTL or no QTL), but
1092: \emph{does} depend on the priors for the parameters within the QTL
1093: model.  Misspecification of these priors could adversely influence the
1094: performance of the BF as a test statistic, and it is important to
1095: examine typical levels of robustness to the prior.  To explore this, I
1096: compare the analyses above that used relatively flat generic priors to
1097: analyses that used priors that were in a way optimised for the
1098: simulated datasets under consideration.
1099: 
1100: Note that the prior for $\mu$ is correct, but that the approximate
1101: model here uses two other parameters $\tau$ and $\rho$ that do not
1102: have any direct correspondance to parameters of the coalescent model
1103: that the data were simulated under.  It is therefore difficult to say
1104: what the best prior is for analysing the simulated datasets.  Loosely
1105: speaking, we might imagine that for any one simulated dataset, in the
1106: limit of an infinite amount of informative data the posterior for
1107: $\tau$ or $\rho$ would converge to a single value, which we could call
1108: the ``best approximating'' value for that dataset.  However, with less
1109: than an infinite amount of data the posterior mean for either variable
1110: would lie somewhere between the prior mean and the best approximating
1111: value.  Thus the distribution of posterior means across simulations
1112: would be (very loosely speaking) inbetween the degenerate distribution
1113: at the prior mean, and the distribution of best approximating values.
1114: Figure~\ref{fig:new-prior} shows that this is indeed the case for $\mu$, for which the prior mean
1115: is 0.5 and the true correct prior is uniform on $[0,1]$.  The
1116: distributions of posterior means for $\tau$ and $\rho$ shown in
1117: figure~\ref{fig:new-prior} suggests a lognormal prior for $\tau$ (with
1118: $\ln{(\tau)}$ having prior mean 6.8 and prior standard deviation 0.74)
1119: and a beta prior for $\rho$ (with $R_1=3.2$ and $R_0=7.8$, $\rho$
1120: having prior mean 0.29).  Here I am assuming independent priors.  Note
1121: that this exercise in prior specification was totally \textit{ad hoc}.
1122: 
1123: As can be seen in figures~\ref{fig:roc-no-error} and
1124: \ref{fig:roc-with-error}, the ROC of the test statistic
1125: $2\ln\mbox{BF}$ computed using these priors is hardly 
1126: different better than that using the original priors.  The power for
1127: tests of sizes $\alpha=0.05$ and $\alpha=0.01$ is not significantly
1128: different, based on 500 simulations.  This suggests that the
1129: performance of the BF as a test statistic, for these datasets, is
1130: quite robust to prior specification within the QTL model.
1131: 
1132: Because most of the computation in QPQ can be reused, computing the BF
1133: for a different prior took on average less than three minutes,
1134: compared with the 36 minutes required to compute the BF for the
1135: original prior.
1136: 
1137: \subsection{Estimation of QTL Position}
1138: \label{sec:estim-qtl-posit}
1139: Figures~\ref{fig:example} and \ref{fig:nexample} show analyses of four
1140: randomly chosen simulated datasets with QTLs ($g=4$).  These
1141: illustrate the fact that these datasets contain only weak information
1142: about the position of the QTL (or at least that the Bayesian method
1143: described here only extracts weak information).  It nonetheless seems
1144: worthwhile to examine how much information is present.
1145: 
1146: It has been suggested that the map position of the marker with the
1147: most significant single point test result (i.e.\ the minimum
1148: $p$-value) would be a ``good'' point estimate for the position of the
1149: QTL \citep{kaplanmorris2001a,kaplanmorris2001b}.  However, I point out
1150: that it is asymptotically inadmissible for a model very similar to the
1151: one assumed here.  This argument considers the limit of a QTL of small
1152: effect.  One can imagine models where the position of the marker with
1153: the minimum $p$-value, \muminp, will be tend to become uniformly
1154: distributed on $(0,1)$, independent of the true value of $\mu$, as the
1155: effect of the QTL tends to zero.  The estimator \muminp{} then has
1156: expected loss $\mu(1-\mu)+\frac{1}{2}$ under absolute error loss and
1157: $\frac{1}{3}-\mu(1-\mu)$ under squared error loss.  The estimator
1158: $\hat{\mu}=\frac{1}{2}$ has uniformly lower expected loss,
1159: $|\frac{1}{2}-\mu|$ under absolute error loss and
1160: $(\frac{1}{2}-\mu)^2$ under squared error loss.  This argument does
1161: not technically apply for the model simulated here because SNPs
1162: (including the QTL) tend to be concentrated in regions where the
1163: genealogy is deepest, so even completely ignoring the genotype data,
1164: the position of any SNP is informative about the positions of all
1165: other SNPs including the QTL.  It does however suggest that better
1166: point estimates may be found, and suggests what their asymptotic
1167: behaviour ought to be, at least approximately.
1168: 
1169: The performance of different methods for estimating the position of
1170: the QTL was assessed using the 500 simulations with $g=4$, for the two
1171: situations with and without errors in allele frequency estimation.
1172: Due to the nature of the simulations performed here, the errors
1173: reported are averaged over the distribution of the true value of
1174: $\mu$.  They are therefore not classical expected losses in the strict
1175: sense, but expected losses averaged with respect to a distribution of
1176: parameter values.  Bayesian point estimators have uniquely best
1177: performance when measured in this way \citep[ ch.5]{ohaganforster};
1178: the theory requires that the model and prior are both correct.  In
1179: particular, under squared error losses the average expected loss is
1180: minimised by the posterior mean, and under absolute error losses the
1181: average expected loss is minimised by the posterior median.  As shown
1182: in table~\ref{tab:point-exp-loss}, point estimators derived from the
1183: posterior calculated using the Bayesian method described here are
1184: superior to \muminp{}, the map location of the marker with the
1185: smallest $p$-value.  When allele frequencies in each pool are known
1186: exactly the Baysian analysis produces a 21\% reduction in root average
1187: mean squared error and a 13\% reduction in average mean absolute
1188: error. When there are errors in allele frequency estimation the
1189: figures are similar, 18\% and 11\% respectively.  The nonparametric
1190: method developed previously by me \citep{johnson2005a} produces point
1191: estimates that are competitive with the Bayesian method under squared
1192: error losses.
1193: 
1194: Figures~\ref{fig:coverage} and~\ref{fig:ncoverage} show results from the 500 datasets simulated with $g=4$.  The coverage
1195: of credibility intervals constructed from the marginal posterior for
1196: $\mu$ falls well below nominal levels.  This suggests strongly that
1197: the simple model used for the analyses is not a good approximation to
1198: the more realistic model the data were simulated under.  One way to
1199: improve the model would be to allow a more realistic model for the
1200: shape of the genealogy at the QTL.  To explicitly model this genealogy
1201: and hence the joint distribution of breakpoints between ancestral and
1202: nonancestral chromosome would require something like the MCMC sampler
1203: of \citet{rannalareeve2001} or \citet{morris2002}.  Although taking
1204: such an approach is highly desirable, it may not scale well to large
1205: datasets and it seems worthwhile to investigate approximations.
1206: 
1207: One approximation is the ``pairwise correction'' derived by
1208: \citet{mcpeek1999} and justified by them by the use of a quasi-score
1209: function, and used in a Bayesian context by \citet{morris2000}.
1210: Essentially, this involves flattening the likelihood function by
1211: raising all likelihoods to a power $w_n=(1+(n-1)c_n)^{-1}<1$.  Here
1212: $c_n$ is the pairwise correlation over sampled chromosomes of the
1213: conditional score function for the position of the QTL.  An expression
1214: for $c_n$ for a coalescent model is given in appendix D of
1215: \citet{mcpeek1999}.  The $n$ in this equation (which $c_n$ also
1216: depends on) is the number of chromosomes carrying the QTL.  It is not
1217: at all clear whether or how this correction should be applied in the
1218: present context, because (i) as noted by \citet{morris2000} the
1219: quasi-score justification used by \citet{mcpeek1999} does not apply in
1220: a Bayesian setting, (ii) in the present work the likelihood is never
1221: written as a conditional product across chromosomes carrying the QTL,
1222: (iii) it is not known how many chromosomes carry the QTL, and (iv) in
1223: my computational implementation no proper likelihoods are ever
1224: calculated, only likelihoods marginal to $(\pi_1,\ldots,\pi_L)$.
1225: However, the following ad-hoc approach does produce corrected
1226: credibility intervals that achieve coverage very close to their
1227: nominal levels.  The procedure is to first estimate $n$ by
1228: $n_d\mathrm{E}{(\rho)}$, the product of the number of case chromosomes
1229: and the posterior expectation of $\rho$, and then to flatten the
1230: marginal posterior for $\mu$ by raising it to a power $w_n$ and
1231: renormalising.  For the simulations performed here, $w_n$ had median
1232: $0.56$ and interquartile range 0.48--0.63.  When this procedure was
1233: applied, good agreemement between nominal and achieved coverage is
1234: obtained (figures~\ref{fig:coverage} and \ref{fig:ncoverage}).  This
1235: suggests that the most serious misspecification of the current model
1236: is the assumption of a star shaped genealogy, rather than the
1237: assumption of linkage equilibrium in nonancestral blocks or the
1238: absence of the disease variant in the control pool.
1239: 
1240: 
1241: \subsection{Application to real data}
1242: \label{sec:appl-real-data}
1243: It is not really possible to examine the effectiveness of the Bayesian
1244: method described here on real data, due to a lack of relevant
1245: published datasets.  Primarily for the purpose of comparison with
1246: other fine scale mapping methods, I have applied it to the dataset of
1247: \citet{hosking2002}, and to quasi-synthetic datasets generated from
1248: that dataset.  \citet{hosking2002} collected data using individual
1249: genotyping.  In order to pretend that the data were acquired using DNA
1250: pooling, I use a hypergeometric error model to relate the observed
1251: counts with missing data to the underlying full data that were not
1252: observed.  This assumes the data are missing at random within and
1253: across SNPs.
1254: 
1255: To my knowledge, no fine scale mapping method has been published that
1256: does not perform well on the data of \cite{hosking2002}.  Therefore,
1257: observing that the present method performs acceptably, as shown in
1258: figure~\ref{fig:hosking}, is not necessarily encouraging.  To simulate a disease with a complex
1259: genetic basis, I generated three datasets by randomly relabelling
1260: controls as cases with probability 10\%, 20\% or 30\%.  As shown in
1261: figure~\ref{fig:hosking}, on all four datasets 95\% credibility
1262: intervals covered the true location of the CYP2D6 gene after the
1263: correction factor of \citep{mcpeek1999} had been applied to flatten
1264: the posterior. This provides weak evidence that the method developed
1265: here may be reliable for mapping QTLs from real data.
1266: 
1267: \section{Discussion}
1268: \label{sec:discussion}
1269: In this paper I have described and tested a Bayesian method for
1270: detecting and mapping a QTL, using multilocus data collected using DNA
1271: pooling within two trait groups.
1272: 
1273: Relatively recently, likelihood based fine scale mapping methods have
1274: been developed for genotype data that build on previous haplotype
1275: based analyses by treating the unobserved haplotypes as missing data
1276: and integrating over all possible haplotypes that are consistent with
1277: the observed genotypes.  This integration can be performed either
1278: using Markov chain Monte Carlo (MCMC)
1279: \citep{liu2001,reeverannala2002,morris2003} or using exact numerical
1280: methods for hidden Markov models \citep{zhangzhao2002}.  Data from DNA
1281: pools are estimated counts of alleles at each locus with no phase
1282: information.  Fine scale mapping from genotype data and from DNA pools
1283: can in theory be regarded as closely related missing data problems.
1284: 
1285: The approach taken in this paper combines elements of the approaches
1286: of \citet{zhangzhao2002} and of \citet{morris2000} and
1287: \citet{liu2001}.  Like \citet{zhangzhao2002}, I use a model that is
1288: sufficiently simple that I can use hidden Markov model (HMM) methods
1289: to sum over all possible haplotypes that are consistent with the
1290: observed data.  However, after computing the likelihood using a
1291: propagation algorithm, \citet{zhangzhao2002} then maximise that
1292: likelihood with respect to the remaining model parameters.  In
1293: contrast and like \citet{morris2000} and \citet{liu2001}, I embed the
1294: HMM within a fully Bayesian approach and compute posterior probability
1295: distributions for the quantities of interest.
1296: 
1297: One advantage of a Bayesian approach is that probability statements
1298: can be made directly about quantities of interest.  For example, we
1299: can state the probability that there is QTL in any given region,
1300: including the whole region under study.  Thus, mapping and detecting a
1301: QTL are intimately related aspects of the same analysis.  They are
1302: different inferences that are made from the same posterior probability
1303: distribution.  Within the Bayesian framework there is no need to
1304: choose between a bewildering array of estimators, test statistics and
1305: methods for correcting for multiple testing; the approach has a
1306: pleasing simplicity, at least conceptually.
1307: 
1308: However, the probabilities computed in a Bayesian analysis are only
1309: meaningful if the model and prior are realistic.  The Catch-22 is that
1310: in order to compute Baysian posterior probabilities, I had to assume a
1311: model that was worringly oversimplified and not very believable.  The
1312: present work is therefore best regarded as a step towards Bayesian
1313: analysis of data collected using DNA pooling.  It may be helpful to
1314: draw parallels with methods for analysis of genotype data (collected
1315: using individual typing).  Sadly, the present method allows us to make
1316: inferences assuming a model less elaborate than the one of
1317: \citet{morris2000}, whereas we might aspire to being able to assume a
1318: model like the one of of \citet{morris2002} or
1319: \citet{zollnerpritchard2005}.  However, Bayesian analysis of such
1320: realistic models has required Markov chain Monte Carlo (MCMC) to
1321: integrate over high dimensional spaces of auxiliary variables or
1322: missing data.  Such computationally intensive approaches may have
1323: difficulty handling large datasets.  In contrast, the method described
1324: here is relatively fast, and large datasets could be analysed with
1325: realistic computational resources.  For example, 27 processor-days
1326: would be required to analyse data from a whole genome scan with 100
1327: cases, 500,000 SNPs, and evaluation of the posterior at points 50kb
1328: apart.  In contrast, \citet{zollnerpritchard2005} estimate that their
1329: MCMC based procedure for data from individual typings would take 85
1330: processor-years for the same scale of analysis.  A further advantage
1331: of avoiding Monte Carlo methods is that the large numbers of analyses
1332: needed for a sliding window analysis, or a permutation test, can be
1333: performed without needing human intervention to adjust mixing
1334: parameters or monitor convergence.  Finally and perhaps most
1335: significantly, I am able to compute a Bayes factor (BF) to compare
1336: models in which there is, and is not, a disease QTL in the whole
1337: region of interest.  To my knowledge, no association mapping method
1338: using genotype data is able to do this, although \citet{patterson2004}
1339: are able to compute a BF for \emph{admixture} mapping using
1340: genotype data.
1341: 
1342: There is a Bayesian justification for the present method. (``This is
1343: the best model for which a Bayesian analysis of data from DNA pools is
1344: currently possible.'')  However, serious concerns about model
1345: inadequacy (``Well, that model simply isn't good enough!'') mean that,
1346: in this paper, I have mostly focussed on the classical frequentist
1347: justification.  Using simulations assuming a more realistic model, I
1348: have shown that the present method is uniformly superior to classical
1349: single point methods of analysis.  Single point methods are the most
1350: obvious way to analyse data collected using DNA pooling, although
1351: composite likelihood methods
1352: \citep{terwilliger1995,xiong1997,collins1998,maniatis2004,maniatis2005}
1353: could also be used.  The simulation results demonstrate that the BF
1354: computed using the present method makes a more powerful test for
1355: the presence of a QTL than the minimum $p$-value from single point
1356: tests, that the posterior density for the position of the QTL leads to
1357: a better point estimator than the position of the marker with the
1358: minimum $p$-value, and that well calibrated credibility intervals can
1359: be derived from the posterior density for the position of the QTL,
1360: after applying the correction of \citet{mcpeek1999}.
1361: 
1362: The performance of composite likelihood (CL) methods was not examined
1363: here.  This was because no CL method has been developed that allows
1364: errors in allele frequency estimation, and because, to my knowledge,
1365: no CL method assumes a model that is obviously more realistic than the
1366: model assumed by the present method.  In particular, all CL methods
1367: implicitly assume linkage equilibrium in non-ancestral blocks of
1368: chromosome.  In the notation of the present paper, CL methods assume
1369: that the number of chromosomes carrying ancestral haplotype at the
1370: $i$-th SNP, $x_i$, is conditionally independent across SNPs.  Even a
1371: poor model that does capture some aspect of the dependence across
1372: SNPs, such as the star shaped genealogy assumed here, seems
1373: preferable.  To my knowledge, there is no CL method that produces well
1374: calibrated confidence or credibility intervals.  Perhaps because of
1375: this, \citet{maniatis2005} state that ``[t]he main objective in
1376: positional cloning is to estimate the kb location of a causal SNP as
1377: accurately as possible, with its support interval an important but
1378: secondary objective.''  However, it seems to me that we should focus
1379: on methods for computing well calibrated credibility intervals, and
1380: ideally a well calibrated posterior density.  The acid test is to ask
1381: whether a statistical method informs us about what is a good action or
1382: decision to be taken subsequently.  A point estimate for QTL position,
1383: without a reliable measure of precision, is not very helpful for
1384: planning future experiments to further refine the position of that
1385: QTL.
1386: 
1387: One of the more surprising results is that, in the simulations
1388: performed here, the nonparametric likelihood ratio (NLR) test derived
1389: from the method proposed previously by me \citep{johnson2005a} is
1390: basically as powerful as the BF for detecting a QTL.  This is
1391: surprising because there is no theoretical basis for the NLR test
1392: statistic, but a strong theoretical basis for the BF test statistic.
1393: Since the NLR can be computed much more quickly, both in absolute and
1394: complexity terms, its performance in simulations over a wider range of
1395: parameters will be examined in a subsequent paper.
1396: 
1397: Given that the NLR performs as well as the BF for detecting a QTL, but
1398: that the BF is much more expensive to compute, one might reasonably
1399: ask what are the benefits of the Bayesian method described here.
1400: Firstly, the BF has a Bayesian interpretation, and since it can be
1401: negative it can indicate Bayesian sense evidence in favour of there
1402: being no QTL.  The NLR test statistic can never be negative, has no
1403: direct Bayesian interpretation, and is not a good approximation to the
1404: BF.  Secondly, the posterior median from the Bayesian method provides
1405: superior point estimates under absolute error losses.  Thirdly, the
1406: Bayesian method produces well calibrated credibility intervals, but
1407: the profile likelihood method I proposed previously does not
1408: \citep[see figure 4 of][]{johnson2005a}.  Finally, the unconditional
1409: coverage frequencies of credibility intervals say nothing about the
1410: conditional or Bayesian sense performance of a method.  For multistage
1411: QTL mapping experiments we should probably guide our choice of where
1412: to type further markers using the typically complex, heavy tailed and
1413: often multimodel posterior distributions computed using the Bayesian
1414: method described here, as exemplified in figure~\ref{fig:nexample}.
1415: If analysing data from a whole genome scan, I would recommend a
1416: multistage analysis that first uses the NLR statistic to identify
1417: regions of interest, and the to use the CPQ algorithm to compute Bayes
1418: factors and posterior distributions for QTL position within those
1419: regions.
1420: 
1421: 
1422: Given the large number of simplifications made in specifying the model
1423: used here, one might wonder why the method works at all.  The three
1424: most obviously inadequate approximations are the star shaped
1425: genealogy, the absence of the disease allele in the control pool,
1426: and the assumption of linkage equilibrium in non-ancestral
1427: blocks of chromosome.  I will briefly discuss these inadequacies in turn.
1428: 
1429: Figures~\ref{fig:coverage} and \ref{fig:ncoverage} show that
1430: credibility intervals only achieve prescribed coverage levels when a
1431: correction is made for the genealogy not in fact being star shaped.
1432: This suggests a serious inadequacy of the model.  This is further
1433: supported by the observation of the very similar ROC curves in
1434: figures~\ref{fig:roc-no-error} and \ref{fig:roc-with-error} for the
1435: theoretically optimal BF (assuming a star shaped genealogy), and the
1436: NLR statistic that has no theoretical basis \citep[but allows any
1437: shape genealogy;][]{johnson2005a}.  Addressing this inadequacy is
1438: likely to lead to greater power to detect a QTL, and perhaps smaller
1439: credibility intervals of a given size.  However, it will be hard to
1440: achieve without imposing a substantial computational burden.  In
1441: particular, it may become difficult to compute the BF test statistic
1442: if MCMC is used to integrate over genealogies at the QTL.
1443: 
1444: Although it is conceptually straightforward to allow blocks of
1445: ancestral chromosome in the control pool, this would increase the
1446: number of hidden states at each SNP from $(\nd+1)$ to
1447: $(\nd+1)(\nc+1)$.  Since the propagation algorithm
1448: (section~\ref{sec:hidden-markov-model}) requires time that is
1449: quadratic in the number of hidden states, the analysis would be
1450: intractable using the current approach.  As an alternative, any number of
1451: separate pools could be treated as conditionally independent HMMs, but
1452: then we would have to integrate over the high dimensional space of
1453: allele frequencies and ancestral haplotypes using MCMC or importance
1454: sampling (see below).
1455: 
1456: It is possible that the current model adapts to fit there being blocks
1457: of ancestral chromosome in the control pool, by appropriate adjustment
1458: of the allele frequency parameters.  Ancestral blocks that are
1459: explicitly modelled in the disease pool would then represent
1460: additional blocks beyond what would be expected according to the
1461: adjusted allele frequency parameters.  If this was so, the parameter
1462: $\rho$ might be best interpeted as representing the rate of
1463: \emph{excess} disease alleles in the disease pool.
1464: 
1465: Since only marginal observations are available, the assumption of
1466: linkage equilibrium may be relatively innocuous.  Since there is
1467: virtually no information in the data about linkage disequilibrium,
1468: introducing parameters describing linkage disequilibrium into the
1469: model might have little effect on inferences about the quantities of
1470: interest.  It is possible to retain the present framework where all
1471: the data are modelled as a single HMM, but to include pairwise linkage
1472: disequilibrium by allowing allelic state along each chromosome to be a
1473: first order Markov chain \citep[see e.g.][]{liu2001,morris2002}.  This
1474: will be quite computationally expensive, but could be examined in the
1475: future.
1476: 
1477: For the parameters chosen for the simulations performed here, the
1478: benefits of the present Bayesian method are somewhat modest.  It
1479: remains unclear whether there would be larger benefits for other
1480: values of the simulation parameters, in particular more SNPs in the
1481: dataset, and/or larger benefits from a Baysian analysis with a more
1482: realistic model.  Clarification of both points awaits access to
1483: substantial computational resources.  It is worth commenting that
1484: many of the variables in the present model also feature in more
1485: elaborate models, and therefore the present approach could be used to
1486: generate (for example) a joint importance sampling distribution for
1487: the ancestral haplotype, allele frequencies, and age and position of
1488: the QTL.
1489: 
1490: Even the simulated datasets studied here were generated under a model
1491: that lacks realism in several respects.  For example, in simulating
1492: errors in allele frequency estimation I have ignored differential
1493: amplification of the two alleles, which may cause estimates of allele
1494: frequencies obtained using DNA pools to be biased.  This manifests
1495: itself as only a second order effect on the difference in allele
1496: frequency between case and control pools \citep{visscher2003}.
1497: Differential amplification can be accomodated easily in the present
1498: method of analysis, for example by making \ey{d}{i} a vector
1499: consisting of data from the pool and also from heterozygous
1500: individuals or pools of known composition.  Even if no data from
1501: heterozygotes is available, it is possible to compute a
1502: $\Pr{(\hat{y}|y)}$ by integrating over a distribution of differential
1503: amplification constants, like in the approach of \citet{moskvina2005}.
1504: 
1505: One feature of the posteriors calculated using the present method (and
1506: especially after \citet{mcpeek1999} flattening) is that they are very
1507: heavy tailed, and so large credibility intervals (99\%, 99.9\%) tend
1508: to be very wide, perhaps almost as wide as credibility intervals
1509: computed from the prior!  This suggests that, if a series of fine
1510: scale mapping experiments were conducted using DNA pooling, we would
1511: not be making radical reductions in the size of the region under study
1512: at each stage, but rather would be increasing the density of markers
1513: in some regions more than others after each stage of analysis.
1514: 
1515: 
1516: 
1517: 
1518: 
1519: 
1520: 
1521: 
1522: 
1523: 
1524: 
1525: 
1526: \section*{Software}
1527: A software package implementing the methods described here is
1528: available from the web site
1529: \texttt{http://homepages.ed.ac.uk/tobyj/software/}~.  Source code is
1530: available and the package can be distributed freely under the terms of
1531: the GNU general public licence \citep{fsf1991}.
1532: 
1533: \bibliography{tobyrefs}
1534: 
1535: \clearpage
1536: \begin{deluxetable}{l p{0.8\textwidth}}
1537:   \tablecaption{Frequently used notations.\label{tab:notations-used}}
1538:   \tablehead{
1539:     \colhead{Symbol} & \colhead{Meaning}}
1540:   \startdata
1541:   $a_i$ & Allele present (0 or 1) on ancestral haplotype at $i$-th SNP \\
1542:   $b$,$b'$ & Backwards variables, see (\ref{eq:bvdefright}) and (\ref{eq:bvdefleft}) \\
1543:   $\betadist{(\alpha,\beta)}$ & Beta distribution with parameters $\alpha$ and $\beta$ \\
1544:   $\bindist{(n,p)}$ & Binomial distribution with parameters $n$ and $p$ \\
1545:   $\bindens{(x,n,p)}$ & Probability of drawing $x$ from a binomial distribution with parameters $n$ and $p$ \\
1546:   $\mbox{BF}$ & Bayes factor \\
1547:   CPQ & Cartesian product quadrature \\
1548:   $d_\mu$ & Number of design points used for $\mu$ in quadrature algorithm \\
1549:   $e_i$ & Precision of assay used to genotype $i$-th SNP \\
1550:   $\expdist{(\lambda)}$ & Exponential distribution with rate parameter $\lambda$ (mean $1/\lambda$) \\
1551:   $g$ & Genotype relative risk; factor by which disease allele
1552:   increases penetrance or risk \\
1553:   $\gamdist{(\alpha,\beta)}$ & Gamma distribution with shape parameter $\alpha$ and scale parameter $\beta$ \\
1554:   HMM & Hidden Markov model \\
1555:   $i$ & Index of SNP, $i=1,\ldots,L$ \\
1556:   i.b.d. & Identical by descent \\
1557:   $L$ & Number of SNPs \\
1558:   $m_i$ & Map position of the $i$-th marker \\
1559:   MCMC & Markov chain Monte Carlo \\
1560:   $\nc$, $\nd$ & Number of chromosomes in control and case pools respectively \\
1561:   $\ndist{(\mu,\sigma^2)}$ & Normal distribution with mean $\mu$ and variance $\sigma^2$ \\
1562:   $\mbox{NLR}$ & Nonparametric likelihood ratio, see (\ref{eq:proftestdef}) \\
1563:   $\pmin$ & Smallest $p$-values out of $L$ tests in single point analysis\\
1564:   $P_{i,a}$ & Prior parameter: $\p{i}{a}\sim\betadist{(P_{i,a},P_{i,1-a})}$ \\
1565:   $r$ & Number of experimental replicates used to estimate $\Delta\widehat{C_t}$\\
1566:   $R$ & Prior parameter: $\rho\sim\betadist{(R_1,R_0)}$ \\
1567:   ROC & Receiver operating characteristics \\
1568:   $T$ & Prior parameter: $\tau\sim\expdist{(T)}$ \\
1569:   $x_i$ & Number of chromosomes in case pool carrying ancestral
1570:   i.b.d.\ haplotype at $i$-th SNP \\
1571:   $x_\mu$ & Number of chromosomes in case pool carrying ancestral i.b.d.\ haplotype at position of QTL \\
1572:   \ya{c}{i}{a} & True count of allele $a$ at $i$-th SNP in control pool \\
1573:   \ya{d}{i}{a} & True count of allele $a$ at $i$-th SNP in case pool \\
1574:   \eya{c}{i}{a} & Estimated count of allele $a$ at $i$-th SNP in control pool \\
1575:   \eya{d}{i}{a} & Estimated count of allele $a$ at $i$-th SNP in case pool \\
1576:   $\hat{y}$ & Shorthand for \eya{c}{i}{1} or \eya{d}{i}{1} for some $i$ \\
1577:   $\hat{\vec{y}}$ & All the data \\
1578:   $\alpha$ & Nominal size (rate of type I error) of a test \\
1579:   $\Delta\widehat{C_t}$ & Estimated lag between PCR growth curves used to type DNA pool\\
1580:   $\gamma_G$ & Penetrance (risk of disease) for genotype $G$ at the QTL\\
1581:   $\mu$ & Map position of the disease locus \\
1582:   $\muminp$ & Map position of SNP with smallest $p$-value in single
1583:   point analysis \\
1584:   $\p{i}{1}$ & Expected frequency of allele 1 at $i$-th SNP in
1585:   non-ancestral chromosome \\
1586:   $\rho$ & Expected frequency of disease allele in case pool \\
1587:   $\sigma$ & Standard deviation of experimental error in estimation of
1588:   $\Delta\widehat{C_t}$ \\
1589:   $\tau$ & Age of the disease allele \\
1590:   \enddata
1591: \end{deluxetable}
1592: 
1593: \begin{deluxetable}{l l l l l l l l}
1594:   \tablecaption{Performance of tests to detect a disease QTL, when allele
1595:     frequencies in each pool are known exactly.
1596:     \label{tab:power-no-error}}
1597:   \tablehead{
1598:     \colhead{Statistic} & \colhead{Method} & \colhead{Nominal size} &
1599:     \colhead{Critical value} & \colhead{True size} & 
1600:     \colhead{Power} }
1601:   \startdata
1602:   $2\ln{\mbox{BF}}$&Arbitrary&&0&0.080&0.870\\
1603:   &&&&(0.058, 0.107)&(0.837, 0.898)\\
1604:   $2\ln{\mbox{BF}}$&Arbitrary&0.05\tablenotemark{a}&5.889&0.010&0.710\\
1605:   &&&&(0.003, 0.023)&(0.668, 0.749)\\
1606:   $p_{\min}\times L$&Bonferonni&0.05&0.05&0.040&0.720\\
1607:   &&&&(0.025, 0.061)&(0.678, 0.759)\\
1608:   $2\ln{\mbox{BF}}$&Simulation&0.05&0.903&0.050&0.842\\
1609:   &&&&(0.033, 0.073)&(0.807, 0.873)\\
1610:   $p_{\min}\times L$&Simulation&0.05&0.063&0.050&0.740\\
1611:   &&&&(0.033, 0.073)&(0.699, 0.778)\\
1612:   $2\ln{\mbox{BF}}$&Arbitrary&0.01\tablenotemark{a}&9.19&0.002&0.628\\
1613:   &&&&(0.000, 0.011)&(0.584, 0.67)\\
1614:   $p_{\min}\times L$&Bonferonni&0.01&0.010&0.000&0.582\\
1615:   &&&&(0.000, 0.006)&(0.537, 0.626)\\
1616:   $2\ln{\mbox{BF}}$&Simulation&0.01&5.864&0.010&0.710\\
1617:   &&&&(0.003, 0.023)&(0.668, 0.749)\\
1618:   $p_{\min}\times L$&Simulation&0.01&0.027&0.010&0.664\\
1619:   &&&&(0.003, 0.023)&(0.621, 0.705)\\
1620:   \enddata
1621:   \tablenotetext{a}{not a nominal size in the classical sense but a nominal upper
1622:     bound on the Bayesian sense error rate}
1623: \end{deluxetable}
1624: 
1625: \begin{deluxetable}{l l l l l l l l}
1626:   \tablecaption{Performance of tests to detect a disease QTL, when there
1627:     are errors in allele frequency estimation with $r=2$ replicates and
1628:     $\sigma=0.2\mbox{ PCR cycles}$.
1629:     \label{tab:power-error}}
1630:   \tablehead{
1631:     \colhead{Statistic} & \colhead{Method} & \colhead{Nominal size} &
1632:     \colhead{Critical value} & \colhead{True size} & 
1633:     \colhead{Power} }
1634:   \startdata
1635:   $2\ln{\mbox{BF}}$&Arbitrary&&0&0.080&0.782\\
1636:   &&&&(0.058, 0.107)&(0.743, 0.817)\\
1637:   $2\ln{\mbox{BF}}$&Arbitrary&0.05\tablenotemark{a}&5.889&0.008&0.532\\
1638:   &&&&(0.002, 0.020)&(0.487, 0.576)\\
1639:   $p_{\min}\times L$&Bonferonni&0.05&0.05&0.040&0.560\\
1640:   &&&&(0.025, 0.061)&(0.515, 0.604)\\
1641:   $2\ln{\mbox{BF}}$&Simulation&0.05&0.723&0.050&0.746\\
1642:   &&&&(0.033, 0.073)&(0.705, 0.784)\\
1643:   $p_{\min}\times L$&Simulation&0.05&0.085&0.050&0.642\\
1644:   &&&&(0.033, 0.073)&(0.598, 0.684)\\
1645:   $2\ln{\mbox{BF}}$&Arbitrary&0.01\tablenotemark{a}&9.19&0.000&0.424\\
1646:   &&&&(0.000, 0.006)&(0.380, 0.469)\\
1647:   $p_{\min}\times L$&Bonferonni&0.01&0.01&0.010&0.432\\
1648:   &&&&(0.003, 0.023)&(0.388, 0.477)\\
1649:   $2\ln{\mbox{BF}}$&Simulation&0.01&4.28&0.010&0.596\\
1650:   &&&&(0.003, 0.023)&(0.552, 0.639)\\
1651:   $p_{\min}\times L$&Simulation&0.01&0.01&0.010&0.432\\
1652:   &&&&(0.003, 0.023)&(0.388, 0.477)\\
1653:   \enddata
1654:   \tablenotetext{a}{not a nominal size in the classical sense but a nominal upper
1655:     bound on the Bayesian sense error rate}
1656: \end{deluxetable}
1657: 
1658: \begin{deluxetable}{r r r r}
1659:   \tablecaption{Performance of point estimators of QTL position.\label{tab:point-exp-loss}}
1660:   \tablehead{
1661:     Estimator&\multicolumn{2}{c}{Average
1662:       expected loss under}\\\cline{2-3}
1663:     &squared error losses&absolute error losses}
1664:   \startdata
1665:   \cutinhead{Allele frequencies known exactly}
1666:   \muminp&0.208$\,{}^2$&0.120\\
1667:   $\expn{(\mu|\hat{\vec{y}})}$&0.165$\,{}^2$&0.107\\
1668:   $\medn{(\mu|\hat{\vec{y}})}$&0.166$\,{}^2$&0.105\\
1669:   NP method&0.165$\,{}^2$&0.112\\
1670:   \cutinhead{Errors in allele frequency estimation}
1671:   \muminp&0.239$\,{}^2$&0.146\\
1672:   $\expn{(\mu|\hat{\vec{y}})}$&0.195$\,{}^2$&0.132\\
1673:   $\medn{(\mu|\hat{\vec{y}})}$&0.203$\,{}^2$&0.130\\
1674:   NP method&0.198$\,{}^2$&0.136\\
1675:   \enddata
1676: \end{deluxetable}
1677: 
1678: \clearpage
1679: \listoffigures
1680: 
1681: \begin{figure}[p]
1682:   \begin{center}
1683:     \begin{picture}(0,0)%
1684:       \includegraphics{hier}%
1685:     \end{picture}%
1686:     \setlength{\unitlength}{4144sp}%
1687:     % 
1688:     \begingroup\makeatletter\ifx\SetFigFont\undefined%
1689:     \gdef\SetFigFont#1#2#3#4#5{%
1690:       \reset@font\fontsize{#1}{#2pt}%
1691:       \fontfamily{#3}\fontseries{#4}\fontshape{#5}%
1692:       \selectfont}%
1693:     \fi\endgroup%
1694:     \begin{picture}(5697,3186)(-3164,-1918)
1695:       \put(-2429,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\vec{x}=(x_1,x_2,\ldots,x_L)$}%
1696:             }}}}
1697:       \put(-2294,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\rho$}%
1698:             }}}}
1699:       \put(-1664,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\mu$}%
1700:             }}}}
1701:       \put(-1034,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\tau$}%
1702:             }}}}
1703:       \put(-2339,-421){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$x_\mu$}%
1704:             }}}}
1705:       \put(-1799,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{prior}%
1706:             }}}}
1707:       \put(1261,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\pi_i$}%
1708:             }}}}
1709:       \put(1261,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$P_i$}%
1710:             }}}}
1711:       \put(631,-421){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$a_i$}%
1712:             }}}}
1713:       \put(  1,-421){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$x_i$}%
1714:             }}}}
1715:       \put(1261,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$e_i$}%
1716:             }}}}
1717:       \put(-179,1109){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$i=1,2,\ldots,L$}%
1718:             }}}}
1719:       \put(586,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\hat{y}_{\mathrm{d},i}$}%
1720:             }}}}
1721:       \put(586,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$y_{\mathrm{d},i}$}%
1722:             }}}}
1723:       \put(1846,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$y_{\mathrm{c},i}$}%
1724:             }}}}
1725:       \put(1846,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\hat{y}_{\mathrm{c},i}$}%
1726:             }}}}
1727:       \put(-2339,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$R$}%
1728:             }}}}
1729:       \put(-1079,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$T$}%
1730:             }}}}
1731:       \put(-1709,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\vec{m}$}%
1732:             }}}}
1733:       \put(-3149,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{prior:}%
1734:             }}}}
1735:       \put(-3149,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{data:}%
1736:             }}}}
1737:     \end{picture}%
1738:   \end{center}
1739:   \caption{Hierachical or Bayesian network structure of the model.  The region inside the
1740:     rectangle is duplicated for each of $L$ SNPs.  Lines indicate the
1741:     dependence structure of the model:  Variables not connected are
1742:     independent, conditional on all other variables in the model.}
1743:   \label{fig:factorisemodel}
1744: \end{figure}
1745: 
1746: \begin{figure}[p]
1747:   \begin{center}
1748:     \includegraphics{roc1}
1749:   \end{center}
1750:   \caption{Sensitivity vs.\ specificity for $2\ln\mbox{BF}$ (solid
1751:     line) and $\pmin\times L$ (dashed line), when allele frequencies
1752:     in each pool are known exactly.  The dotted line shows the
1753:     performance of $2\ln\mbox{BF}$ computed using the priors obtained
1754:     from figure~\ref{fig:new-prior}, and the dot-dashed line shows the
1755:     performance of a nonparametric likelihood ratio test
1756:     statistic \citep[; see text]{johnson2005a}.}
1757:   \label{fig:roc-no-error}
1758: \end{figure}
1759: 
1760: \begin{figure}[p]
1761:   \begin{center}
1762:     \includegraphics{roc2}
1763:   \end{center}
1764:   \caption{Sensitivity vs.\ specificity for $2\ln\mbox{BF}$ (solid
1765:     line) and $\pmin\times L$ (dashed line), when there are errors in
1766:     allele frequency estimation with $r=2$ replicates and
1767:     $\sigma=0.2\mbox{ PCR cycles}$.  The dotted line shows the
1768:     performance of $2\ln\mbox{BF}$ computed using the priors obtained
1769:     from figure~\ref{fig:new-prior}, and the dot-dashed line shows the
1770:     performance of a nonparametric likelihood ratio test statistic
1771:     \citep[; see text]{johnson2005a}.}
1772:   \label{fig:roc-with-error}
1773: \end{figure}
1774: 
1775: \begin{figure}[p]
1776:   \begin{center}
1777:     \includegraphics{teststats2}
1778:   \end{center}
1779:   \caption{Sampling distribution of test statistics $2\ln\mbox{BF}$
1780:     (top) and $\pmin\times L$ (bottom, on log scale) under null model
1781:     ($g=1$), as functions of $L$, the number of SNPs in the simulated
1782:     data set.  The 0.95 and 0.99 quantiles are shown as solid lines.
1783:     The least squares linear regression is shown as a dotted line.
1784:     Results shown are for the situation where there are errors in
1785:     allele frequency estimation, but results are similar when allele
1786:     frequencies are known exactly.}
1787:   \label{fig:teststats}
1788: \end{figure}
1789: 
1790: \begin{figure}[p]
1791:   \begin{center}
1792:     \includegraphics{newprior}
1793:   \end{center}
1794:   \caption{Original priors (dotted lines) and distribution of
1795:     posterior expectations (solid lines) for the three parameters of
1796:     the approximate model.  This suggests more accurately specified
1797:     priors (dashed lines) as described in the text.}
1798:   \label{fig:new-prior}
1799: \end{figure}
1800: 
1801: \begin{figure}[p]
1802:   \begin{center}
1803:     \includegraphics{exampleB}
1804:   \end{center}
1805:   \caption{Example simulated datasets with $g=4$ and where allele
1806:     frequences are known exactly.  Points are $-\log_{10}{p}$ for
1807:     single point $\chi^2$ tests.  Dotted lines are posterior density
1808:     and solid lines are posterior density with McPeek--Strahs
1809:     correction.  Vertical dashed lines show position of disease QTL.}
1810:   \label{fig:example}
1811: \end{figure}
1812: 
1813: \begin{figure}[p]
1814:   \begin{center}
1815:     \includegraphics{nexampleB}
1816:   \end{center}
1817:   \caption{The same simulated datasets as shown in
1818:     figure~\ref{fig:example}, but with errors in allele frequency
1819:     estimation with $\sigma=0.2$ PCR cycles and $r=2$ experimental
1820:     replicates.  Points are $-\log_{10}{p}$ for single point shrunk
1821:     \citep{visscher2003} $\chi^2$ tests.  Dotted lines are posterior
1822:     density and solid lines are posterior density with McPeek--Strahs
1823:     correction.  Vertical dashed lines show position of disease QTL.}
1824:   \label{fig:nexample}
1825: \end{figure}
1826: 
1827: \begin{figure}[p]
1828:   \begin{center}
1829:     \includegraphics{coverage1}
1830:   \end{center}
1831:   \caption{Nominal and achieved coverage of credibility intervals for
1832:     position of QTL, when allele frequencies are known exactly.
1833:     Credibility intervals were constructed either without (dotted
1834:     line) or with (solid line) the approximate correction factor of
1835:     \citet{mcpeek1999}.}
1836:   \label{fig:coverage}
1837: \end{figure}
1838: 
1839: \begin{figure}[p]
1840:   \begin{center}
1841:     \includegraphics{coverage2}
1842:   \end{center}
1843:   \caption{Nominal and achieved coverage of credibility intervals for
1844:     position of QTL, when there are errors in allele frequency
1845:     estimation, with $\sigma=0.2$ and $r=2$.  Credibility intervals
1846:     were constructed either without (dotted line) or with (solid line)
1847:     the approximate correction factor of \citet{mcpeek1999}.}
1848:   \label{fig:ncoverage}
1849: \end{figure}
1850: 
1851: \begin{figure}[p]
1852:   \begin{center}
1853:     \includegraphics{hosking_panel}
1854:   \end{center}
1855:   \caption{Analysis of data of \citet[; top panel]{hosking2002}, and
1856:     quasi-synthetic datasets generated by randomly relabelling
1857:     controls as cases with probability 10\%, 20\% or 30\% (lower three
1858:     panels, top to bottom). Points are $-\log_{10}{(p)}$ from single
1859:     point $\chi^2$ tests, and dashed and solid lines are the marginal
1860:     posterior for disease gene position, without and with the
1861:     correction factor of \citep{mcpeek1999}.  Vertical dashed lines
1862:     show the true position of CYP2D6.}
1863:   \label{fig:hosking}
1864: \end{figure}
1865: \end{document}
1866: 
1867: