0507:q-bio0507018/ms.tex

1: \documentclass[12pt,preprint]{aastex}

2:

3: \usepackage{natbib}

4: \bibpunct[]{(}{)}{,}{a}{}{,}

5:

6: \usepackage{amsmath}

7: \usepackage{amsfonts}

8: \usepackage{amsthm}

9:

10: \newcommand{\me}{\mathrm{e}}

11: \newcommand{\mi}{\mathrm{i}}

12: \newcommand{\dif}{\mathrm{d}}

13: \usepackage{bm}

14: \renewcommand{\vec}[1]{\bm{#1}}

15: \DeclareMathAlphabet{\mathsfsl}{OT1}{cmss}{m}{sl}

16: \newcommand{\tensor}[1]{\mathsfsl{#1}}

17:

18: \DeclareMathOperator*{\argmax}{argmax}

19:

20: \DeclareMathOperator{\expn}{Mean}

21: \DeclareMathOperator{\medn}{Median}

22:

23: \DeclareMathOperator{\podist}{Po}

24: \DeclareMathOperator{\tdist}{T}

25: \DeclareMathOperator{\ndist}{N}

26: \DeclareMathOperator{\betadist}{B}

27: \DeclareMathOperator{\expdist}{E}

28: \DeclareMathOperator{\gamdist}{G}

29: \DeclareMathOperator{\dirproc}{DP}

30:

31: \DeclareMathOperator{\unifdist}{U}

32: \DeclareMathOperator{\bindist}{Bin}

33:

34: \DeclareMathOperator{\bindens}{Bin}

35: \DeclareMathOperator{\betadens}{B}

36:

37: \newcommand{\nd}{\ensuremath{n_\mathrm{d}}}

38: \newcommand{\nc}{\ensuremath{n_\mathrm{c}}}

39: \newcommand{\eya}[3]{\ensuremath{\hat{y}_{\mathrm{#1},#2}{(#3)}}}

40: \newcommand{\ya}[3]{\ensuremath{y_{\mathrm{#1},#2}{(#3)}}}

41: \newcommand{\ey}[2]{\ensuremath{\hat{y}_{\mathrm{#1},#2}}}

42: \newcommand{\y}[2]{\ensuremath{y_{\mathrm{#1},#2}}}

43:

44: \newcommand{\p}[2]{\ensuremath{\pi_{#1}{(#2)}}}

45: \newcommand{\pmin}{\ensuremath{p_{\min}}}

46: \newcommand{\muminp}{\ensuremath{\hat{\mu}_{\min p}}}

47:

48: \newcommand{\ceil}[1]{\ensuremath{\lceil #1 \rceil}}

49: \newcommand{\floor}[1]{\ensuremath{\lfloor #1 \rfloor}}

50:

51:

52: \begin{document}

53: \bibliographystyle{toby}

54:

55: \title{Bayesian Method for Disease QTL Detection and Mapping, using a

56:   Case and Control Design and DNA Pooling}

57:

58: \author{Toby Johnson\altaffilmark{1}}

59: \affil{School of Biological Sciences, The University of Edinburgh}

60: \affil{West Mains Road, Edinburgh, EH9 3JT}

61: \email{toby.johnson@ed.ac.uk}

62:

63: \altaffiltext{1}{Jointly affiliated to Rothamsted Research and to The University of Edinburgh}

64:

65: \slugcomment{draft \today}

66:

67: \begin{abstract}

68:   This paper describes a Bayesian statistical method for determining

69:   the genetic basis of a complex genetic trait.  The method uses a

70:   sample of unrelated individuals classified into two groups, for

71:   example cases and controls.  Each group is assumed to have been

72:   genotyped at a battery of marker loci using a laboratory effort

73:   efficient technique called DNA pooling.  The aim is to detect and

74:   map a quantitative trait locus (QTL) that is \emph{not} one of the

75:   typed markers.  The method works by conducting an exact Bayesian

76:   analysis under a number of simplifying population genetic

77:   assumptions that are somewhat unrealistic.  Despite this, the method

78:   is shown to perform acceptably on datasets simulated under a more

79:   realistic model, and furthermore is shown to outperform classical

80:   single point methods.

81: \end{abstract}

82:

83: \section{Introduction}

84: For many traits of interest, including susceptibility to many genetic

85: diseases, the genetic basis is complex, meaning that many genes of

86: individually small effect (quantitative trait loci; QTLs) contribute.

87: For detecting and mapping such QTLs, association mapping studies that

88: use large samples of essentially unrelated individuals may have two

89: advantages over linkage studies that use pedigrees or families: For a

90: given sample size, association studies may have more power to detect a

91: QTL \citep[e.g.][]{rischmerikangas1996,risch2000}, and they may allow

92: the QTL to be fine mapped with greater resolution or precision

93: \citep[e.g.][]{terwilliger1995,mcpeek1999}.  One important

94: experimental design is a genome wide scan, in which many thousands of

95: markers covering the whole genome are typed \citep[see

96: e.g.][]{lander1996,rischmerikangas1996,kruglyak1999,carlson2003}.  The

97: aim may be to detect QTL that are not candidate genes and that have

98: escaped detection in linkage studies.  After preliminary analysis,

99: attention may focus on relatively small chromosomal regions, each

100: containing perhaps only one QTL.  Because of

101: nongenetic factors and the effects of other genetically distant QTLs,

102: each focal QTL will explain only a small fraction of the variance in

103: phenotype.  If the trait is binary (such as presence or absence of a

104: disease), the difference in QTL allele frequencies between the two

105: trait groups will be small.  The large numbers of markers required for

106: a genome wide scan will need to have been typed in large numbers of

107: individuals to detect such a QTL, and to infer its position,

108: frequency, and effect on the trait.

109:

110: In order to reduce the cost of such a study, an experimental strategy

111: called DNA pooling has been proposed

112: \citep[e.g][]{arnheim1985,barcellos1997,sham2002,norton2004}.  Here

113: DNA from individuals with similar phenotypes is physically mixed

114: together into a pool before genotyping.  After overheads to do with

115: construction of pools and assay development, the cost of genotyping an

116: entire pool at a given marker is reduced to the cost of genotyping a

117: single individual.  Thus costs may be reduced by a factor close to the

118: number of individuals in a pool (divided by the number of experimental

119: replicates used for each pool), when the overheads can be spread

120: across many markers and across many disease studies.

121:

122: In this paper I consider the simplest experimental design where there

123: are two trait groups.  These could have been classified by the

124: presence or absence of binary trait such as a disease.  Alternatively,

125: if the trait is quantitative, individuals from each tail of the trait

126: distribution could make up the two groups, for example the upper and

127: lower 10\% tails \citep[e.g.][]{darvasisoller1994,bader2001}.  For

128: simplicity I will refer throughout the paper to the two trait groups

129: as cases and controls, and to one of the alleles at the QTL as the

130: disease allele.  I note that data collected from more than two groups

131: can be analysed using the method developed here, by discarding or

132: combining data from some of the groups.  (An extreme example is where

133: each pool contains two chromosomes, i.e.\ genotype data are

134: available.)  Such an approach is valid and may be useful, but will

135: almost certainly not make most efficient use of the available data,

136: since it will be based on only a marginal observation that is almost

137: certainly not sufficient.

138:

139: Compared with individual genotyping, a DNA pooling strategy incurs

140: three types of information loss and error.  At each marker only the

141: marginal counts of the alleles present within each pool are are

142: available, and so there is (i) no information about deviations from

143: Hardy--Weinberg proportions within each marker within each pool

144: \citep{rischteng1998} and (ii) no information about phase or linkage

145: information across markers within each pool \citep{johnson2005a}.

146: Further (iii), the marginal counts are only estimated from some kind

147: of quantitative genotyping experiment \citep[e.g.][]{germer2000,lehellard2002},

148: rather than by counting individual genotyping experiments.

149:

150: The present approach deals with (iii) by allowing a quite general

151: class of models for errors in allele frequency estimation.  It

152: attempts to deal with (i) and (ii) by using the full likelihood given

153: the available data.  This likelihood does however assume a model that

154: makes simplifying assumptions that are not as realistic as one would

155: like.  This may be an acceptable price to pay, because it allows all

156: the necessary computations to be done analytically or using simple

157: numerical algorithms.  The aim is to develop a method that can be used

158: on very large data sets, with hundreds of cases and hundreds of

159: controls typed at hundreds of markers.

160:

161: The simple model used here is a special case of the model introduced

162: by \citet{mcpeek1999} and studied by \citet{morris2000},

163: \citet{zhangzhao2000,zhangzhao2002} and \citet{liu2001} for haplotype

164: and genotype data.  In brief summary, I assume a unique mutation

165: event (perhaps a single nucleotide polymorphism, or a deletion)

166: generated the disease allele at the disease QTL, that there is a star

167: shaped genealogy since that mutation, that the disease allele is

168: absent from the control group, and that Hardy--Weinberg and linkage

169: equilibrium apply in the control group.

170:

171: Avoiding the use of more complicated algorithms such as Markov chain

172: Monte Carlo (MCMC) means that there are no concerns about mixing and

173: convergence, and no difficulties with computing normalising constants

174: and Bayes factors for model testing and model comparison.  When the

175: model is correct, the Bayes factor is \citep[in various classical

176: senses, see][ ch.5]{ohaganforster} an optimal test statistic for

177: detecting a QTL \citep[see also][]{patterson2004}.  Therefore the

178: present approach can be viewed as choosing an approximate model that

179: is simple enough to allow an exact and optimal analysis.  It will

180: therefore complement alternative approaches that perform approximate

181: or sub-optimal analyses for more realistic models.

182:

183: One example of such an alternative approach is to perform a classical

184: single point analysis, which applies a separate test to the data for

185: each marker.  Such an approach can be made essentially free of any

186: population genetic model, meaning that no worring assumptions are made

187: but also that there is no efficient framework for combining analyses

188: across many markers (e.g.\ to correct for multiple testing).  When

189: genotype data are available and when the QTL is \emph{not} one of the

190: typed markers, such single point methods are known to lack power

191: \citep[e.g.][]{risch2000,mott2000,zollnerpritchard2005}, and may

192: produce inefficient point estimates and inefficiently large region

193: estimates for the position of the QTL

194: \citep[e.g.][]{morris2002,morris2003}.

195:

196: One justification for the method developed here is that although the

197: model is simple, it is to my knowledge the most complex model for

198: which a Bayesian analysis has been implemented using data from DNA

199: pools.  To provide further justification and reassurance about the

200: simplifying assumptions made, I have tested the method on data

201: simulated under a more realistic full coalescent model.  The

202: (necessarily classical) measures of performance are encouraging in

203: three respects.  Firstly, the power to detect a QTL is substantially

204: higher than for classical single point analysis.  Secondly, point

205: estimates for the position of the QTL derived from the present method

206: outperform the simple procedure of choosing the map position of the

207: marker with the most significant single point test result \citep[the

208: ``minimum $p$-value method'',

209: e.g.][]{kaplanmorris2001a,kaplanmorris2001b}.  Thirdly, after

210: ``flattening'' the posterior using a factor \citep[derived

211: by][]{mcpeek1999} that corrects for the fact that the true genealogy

212: is not in fact star shaped, credibility intervals for the position of

213: the QTL cover its true position with frequency equal to their nominal

214: size.  That is, they are well calibrated.

215:

216: The structure of this paper is as follows.  In

217: section~\ref{sec:appr-model-prior} I review the model introduced by

218: \citet{mcpeek1999}, for which an exact Bayesian analysis can be

219: performed using multilocus estimated allele frequency data, collected

220: using DNA pools.  In section~\ref{sec:analysis} I describe in detail

221: how the computations for such an analysis are performed.  In

222: section~\ref{sec:simul-model} I review a more realistic coalescent

223: model that I have used to generate simulated data sets on which to

224: test the current approach.  In section~\ref{sec:results} I describe

225: the performance of the method developed in

226: sections~\ref{sec:appr-model-prior}--\ref{sec:analysis} on those

227: simulated data sets.  In section~\ref{sec:discussion} I summarise the

228: results and discuss how the method could be improved.

229:

230: \section{The Simplified Model and Prior}

231: \label{sec:appr-model-prior}

232: The main notations and abbreviations used are listed in

233: table~\ref{tab:notations-used}, and the conditional independencies of

234: the variables are represented in figure~\ref{fig:factorisemodel}.

235:

236: I assume the data (collectively denoted $\hat{\vec{y}}$) consist of

237: allele frequency estimates at $L$ single nucleotide polymorphisms

238: (SNPs), obtained from a single pool of \nd{} case chromosomes and a

239: single pool of \nc{} control chromosomes.  Let $m_i$ be the known map

240: position of the $i$-th SNP.  Let the alleles at each SNP be

241: arbitrarily labelled 0 and 1, with $\eya{d}{i}{1}\equiv

242: \nd-\eya{d}{i}{0}$ the estimated count of the 1 allele at the $i$-th

243: SNP in the cases, and $\eya{c}{i}{1}\equiv \nc-\eya{c}{i}{0}$ the

244: estimated count of the 1 allele at the $i$-th SNP in the controls.

245: When there is no ambiguity, $\hat{y}$ will mean the estimated count of

246: the 1 allele at some SNP in some pool that is to be inferred from the

247: context.  (This can easily be generalised to let $\ey{d}{i}$ be

248: arbitrary vector valued information about the counts of the two

249: alleles at the $i$-th SNP in the cases, etc.)  The method may be

250: generalised to allow more than two alleles at each marker.

251:

252: Let $\ya{d}{i}{1}$ and $\ya{c}{i}{1}$ be the true counts of the 1

253: allele at the $i$-th SNP in the cases and in the controls

254: respectively.  I assume that an error model

255: $\Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}$ has been chosen and

256: parameterised, for example from a calibration data set consisting of

257: pairs $(\hat{y},y)$ that were obtained from a given set of individuals

258: by genotyping them as a pool and by genotyping them individually.

259: Note that the error model cannot be factorised unless we assume that

260: the errors in the case and control pools are unconditionally

261: independent.  However, we may believe that the data were obtained

262: using assays that vary in precision across SNPs.  Letting $e_i$

263: indicate the unknown precision of the assay used for the $i$-th SNP,

264: we could model these beliefs as

265: \begin{equation}

266:   \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}=\sum_{e_i}{\Pr{(\ey{d}{i}|\y{d}{i},e_i)}\Pr{(\ey{c}{i}|\y{c}{i},e_i)}\Pr{(e_i)}}\;\mbox{,}

267:   \label{eq:errorfactorise}

268: \end{equation}

269: that is, the error distribution is modelled as a mixture distribution

270: where each component of the mixture factorises.  In the following,

271: reasonable flexibility in the nature of this error model is allowed:

272: the errors must be independent across SNPs but they need not be

273: independent across pools and they do not need to be identically

274: distributed across SNPs.  A example error model is given in

275: section~\ref{sec:simul-model}.

276:

277: Let the unknown map position of the disease QTL relative to the

278: leftmost marker SNP be $\mu$; often this will be the variable of main

279: interest.  I allow any prior $\Pr{(\mu)}$; obvious choices would be a

280: uniform density on physical distance, or one based on known gene

281: density \citep[see e.g.][]{rannalareeve2001}.  I assume a unique

282: mutation event generated the disease allele at the disease QTL, so

283: some number of haplotypes ($x_\mu$) in the case pool carry the disease

284: allele and are identical by descent (i.b.d.) at this position.  I

285: assume $x_\mu\sim\bindist{(\nd,\rho)}$ for a rate $\rho$,

286: and since $\rho$ itself is uncertain I assume a beta prior with

287: parameter $R=(R_1,R_0)$, so the prior mean is $R_1/(R_0+R_1)$.

288: Specifying $(R_1,R_0)=(1,1)$ specifies a flat prior on $[0,1]$ for

289: $\rho$.  I assume that the disease allele is absent from the control

290: pool.  The adequacy of this approximation, and ways in which it could

291: be relaxed, are taken up in the discussion in section~\ref{sec:discussion}.

292:

293: Each disease allele is embedded in a block of i.b.d.\ haplotype, the

294: ``ancestral haplotype'', with breakpoints at distances $d_\mathrm{L}$

295: and $d_\mathrm{R}$ to the left and right of the position of the QTL.

296: I assume a star shaped genealogy for the disease allele, so that

297: $d_\mathrm{L}$ and $d_\mathrm{R}$ for all blocks of ancestral

298: haplotype are conditionally independent and identically distributed

299: (i.i.d.) from an exponential distribution with mean $1/\tau$ Morgans,

300: where $\tau$ is the age of the disease allele in generations and

301: distances are measured in Morgans \citetext{\citealt{mcpeek1999}, see

302:   also \citealt{morris2000,zhangzhao2000,zhangzhao2002,liu2001}}.

303: (The left and right breakpoints are the positions of the nearest

304: crossovers in $\tau$ meioses where crossovers occur as a Poisson

305: process with rate 1.)  I allow any prior $\Pr{(\tau)}$; lognormal and

306: gamma are reasonable choices.  Specifying an exponential prior with

307: sufficiently small parameter $T$ (so the prior mean $1/T$ is

308: sufficiently large) specifies a prior that is effectively flat over

309: the region where the likelihood is non-negligible.  This has the

310: effect of making the posterior model probability or Bayes factor

311: proportional to $T$.  A crucial variable in the analysis is $x_i$, the

312: number of chromosomes in the case pool that carry the i.b.d.\

313: haplotype at the $i$-th SNP.  The assumptions above specify a

314: distribution for the $x_i$.  A key feature of the present method is

315: that the $x_i$ are not independent across SNPs, in constrast to what

316: is assumed in composite likelihood methods

317: \cite[e.g.][]{terwilliger1995,xiong1997,collins1998,maniatis2004,maniatis2005}.

318:

319: All other non-i.b.d\ haplotype is called heterogenous ``non-ancestral

320: haplotype''.  The allele present in non-ancestral haplotype is assumed

321: conditionally independent across chromosomes within SNPs, and across

322: SNPs within chromosomes, with \p{i}{1} the probability of a 1 allele

323: at the $i$-th SNP and $\p{i}{0}\equiv 1-\p{i}{1}$.  (This is

324: equivalent to assuming Hardy--Weinberg and linkage equilibrium in

325: blocks of non-ancestral chromosome.)  This assumption is a very bad

326: one when haplotype or genotype data are available

327: \citep{liu2001,morris2002,listephens2003}, but when only multilocus

328: allele frequency data are available it may be more innocuous; this

329: assumption leads to a massive simplification in the likelihood.  I

330: assume an independent beta prior for each $\p{i}{1}$, with parameter

331: $P_i=(P_{i,1},P_{i,0})$, so the prior mean is

332: $P_{i,1}/(P_{i,0}+P_{i,1})$.  The method should be robust to

333: misspecification of $P_i$ as long as both elements are much smaller

334: than $\nc$.

335:

336: The allele present on the ancestral haplotype at the position of the

337: $i$-th SNP, $a_i$, is assumed to be a single draw from the same

338: distribution as for a block of nonancestral chromosome.  That is, each

339: $a_i$ is an independent Bernoulli variable with parameter $\p{i}{1}$.

340:

341: \section{Analysis}

342: \label{sec:analysis}

343: \subsection{Overview}

344: The purpose of the analysis is to compute the posterior distribution

345: for quantities of interest.  Here I focus on computing the Bayes

346: factor in favour of a model in which a QTL is present, versus a model

347: with no QTL.  This Bayes factor allows the posterior model

348: probabilities to be computed for any prior on the two models.  I also

349: compute the posteriors for $\mu$, $\tau$ and $\rho$, given that a QTL

350: is present, marginalising all other variables.

351:

352: The model has a hierarchical structure as shown in

353: figure~\ref{fig:factorisemodel}.  Note that the joint distribution of

354: the $x_i$ depends on $\mu$, $\tau$ and $\rho$, but that the

355: probability of the data at the $i$-th SNP only depends on $x_i$ and

356: other variables $\pi_i$ and $a_i$ for which the prior is independent

357: across SNPs.  The first step of the analysis, in section

358: \ref{sec:at-each-snp}, is to perform an independent calculation for

359: each SNP that integrates out $\pi_i$ and $a_i$ conditional on $x_i$.

360: The second step of the analysis, in section

361: \ref{sec:hidden-markov-model}, is to integrate out all the $x_i$ and

362: $x_\mu$ conditional on $\mu$, $\tau$ and $\rho$.  This gives the

363: posterior density for these three variables.  The final step, in

364: section \ref{sec:final-posterior}, is to compute the marginal

365: posteriors for each of $\mu$, $\tau$ and $\rho$, and the normalising

366: constant or probability of the data given models with and without a

367: QTL.

368:

369: \subsection{Calculations for the $i$-th SNP}

370: \label{sec:at-each-snp}

371: Note that all probabilities in this section are conditional on $x_i$,

372: the number of chromosomes in the case pool carrying the ancestral

373: i.b.d.\ chromosome.  At the $i$-th SNP let $\ya{d}{i}{1}\equiv

374: \nd-\ya{d}{i}{0}$ be the (unknown) actual count of the 1 allele in

375: cases, and $\ya{c}{i}{1}\equiv \nc-\ya{c}{i}{0}$ be the actual

376: estimated count of the 1 allele in controls.  Let

377: $\y{d}{i}=(\ya{d}{i}{1},\ya{d}{i}{0})$ and a similar notation apply

378: for controls.  Under the modelling assumptions we have

379: \begin{eqnarray}

380:   \label{eq:ydprob}

381:   \Pr{(\y{d}{i}|x_i,a_i,\pi_i)} &=&

382:   \bindens{(\ya{d}{i}{a_i}-x_i,\nd-x_i,\pi_i(a_i))} \\

383:   \label{eq:ycprob}

384:   \Pr{(\y{c}{i}|\pi_i)} &=&

385:   \bindens{(\ya{d}{i}{1},\nc,\pi_i(1))}

386: \end{eqnarray}

387: where $\bindens{(x,n,p)}$ is the probability of observing $x$

388: successes in $n$ independent trials with success probability $p$ (and

389: understood to be zero unless $0\le x\le n$).  For a more detailed

390: motivation of (\ref{eq:ydprob}) and (\ref{eq:ycprob}) see \citet[

391: appendix A]{johnson2005a}

392:

393: Then the probability of the observed data can be written

394: \begin{equation}

395:   \Pr{(\ey{d}{i},\ey{c}{i}|x_i,a_i,\pi_i)} =

396:   \sum_{\y{d}{i}}{\sum_{\y{c}{i}}{

397:       \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}

398:       \Pr{(\y{d}{i}|x_i,a_i,\pi_i)}\Pr{(\y{c}{i}|\pi_i)}

399:     }}\;\mbox{.}

400:   \label{eq:SNPi-prob}

401: \end{equation}

402:

403: We can simplify at this stage by writing down the distribution

404: marginal to $a_i$ and $\pi_i$ and moving the respective sum and

405: integral as far inside the expression as possible, to get

406: \begin{eqnarray}

407:   \Pr{(\ey{d}{i},\ey{c}{i}|x_i)} &=&

408:   \sum_{\y{d}{i}}{\sum_{\y{c}{i}}{\bigg(

409:       \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}\bigg.}}\nonumber\\

410:     &&\bigg.\int{\sum_{a_i}{\big(

411:       \Pr{(\y{d}{i}|x_i,a_i,\pi_i)}\Pr{(a_i|\pi_i)}\big)}\Pr{(\y{c}{i}|\pi_i)}\Pr{(\pi_i)}\dif\pi_i}\bigg)\;\mbox{.}

412: \end{eqnarray}

413: The innermost sum over $a_i$ can be rewritten

414: \begin{eqnarray}

415:   \label{eq:emprob}

416:   \Pr{(\y{d}{i}|x_i,\pi_i)} &=&

417:   \sum_{a_i}{\Pr{(\y{d}{i}|x_i,a_i,\pi_i)}\Pr{(a_i|\pi_i)}} \\

418:   &=& \sum_{a_i}{\bigg(

419:     \bindens{(\ya{d}{i}{a_i}-x_i+1,\nd-x_i+1,\pi_i(a_i))}\bigg.}\nonumber\\

420:   &&\qquad\times

421:   \bigg.\binom{\nd-x_i}{\ya{d}{i}{a_i}-x_i}\bigg/\binom{\nd-x_i+1}{\ya{d}{i}{a_i}-x_i+1}\bigg)\\

422:   &=& \sum_{a_i}{\bigg(

423:     \bindens{(\ya{d}{i}{a_i}-x_i+1,\nd-x_i+1,\pi_i(a_i))}\bigg.}\nonumber\\

424:   &&\qquad\times

425:   \bigg.\frac{\ya{d}{i}{a_i}-x_i+1}{\nd-x_i+1}\bigg)\;\mbox{.}

426: \end{eqnarray}

427: This allows the integral over $\pi_i$ to be computed

428: analytically when (as assumed above) the prior $\Pr{(\pi_i(a_i))}$ is a

429: beta distribution with parameter $(P_{i,a_i},P_{i,1-a_i})$\begin{eqnarray}

430:   \label{eq:ygivenxraw}

431:   \Pr{(\y{d}{i},\y{c}{i}|x_i)} &=& \int{

432:     \Pr{(\y{d}{i}|x_i,\pi_i)}\Pr{(\y{c}{i}|\pi_i)}\Pr{(\pi_i)}\dif \pi_i}\\

433:   \label{eq:ygivenxbinombeta}

434: &=&

435:   \sum_{a_i}{\bigg(\frac{(\nd-x_i+1)!\;\Gamma{(\nc+P_{i,0}+P_{i,1})}}

436:   {\Gamma{(\nd-x_i+1+\nc+P_{i,0}+P_{i,1})}}\bigg.}\nonumber\\

437: %

438:   &&\qquad\times \frac{\Gamma{(\ya{d}{i}{a_i}-x_i+1+\ya{c}{i}{a_i}+P_{i,a_i})}}

439:   {(\ya{d}{i}{a_i}-x_i+1)!\;\Gamma{(\ya{c}{i}{a_i}+P_{i,a_i})}}\nonumber\\

440: %

441:   &&\qquad\times \frac{\Gamma{(\ya{d}{i}{1-a_i}+\ya{c}{i}{1-a_i}+P_{i,1-a_i})}}

442:   {\ya{d}{i}{1-a_i}!\;\Gamma{(\ya{c}{i}{1-a_i}+P_{i,1-a_i})}}\nonumber\\

443:   &&\qquad\times

444:   \bigg.\frac{\ya{d}{i}{a_i}-x_i+1}{\nd-x_i+1}\bigg)

445:   \label{eq:probSNPfull}

446: \end{eqnarray}

447: Thus the probability of the observed data reduces to

448: \begin{equation}

449:   \Pr{(\ey{d}{i},\ey{c}{i}|x_i)} =

450:   \sum_{\y{d}{i}}{\sum_{\y{c}{i}}{

451:       \Pr{(\ey{d}{i},\ey{c}{i}|\y{d}{i},\y{c}{i})}

452:       \Pr{(\y{d}{i},\y{c}{i}|x_i)}

453:     }}

454:   \label{eq:probSNPsumsum}

455: \end{equation}

456: where the first term in the summand is the error

457: model and the second term is given by

458: (\ref{eq:probSNPfull}).

459:

460: Because $\ey{d}{i}$ and $\ey{c}{i}$ are fixed and $a_i$ and $\pi_i$

461: have been integrated out, values of (\ref{eq:probSNPsumsum}) for each

462: $x_i$ can be precomputed and stored in a small lookup table of size

463: $(n_d+1)$.  It is therefore feasible to use a complicated and

464: hopefully realistic error distribution, as discussed above.  For many

465: error models, computation can be speeded up without loss of accuracy

466: by judicious reduction of the range of the summation in

467: (\ref{eq:probSNPsumsum}).

468:

469: \subsection{Hidden Markov Model for $x_\mu$ and the $x_i$}

470: \label{sec:hidden-markov-model}

471: Throughout this section, all probabilities are conditional on given

472: values of $\mu$, $\tau$ and $\rho$ but to make a clearer exposition this is

473: suppressed in the notation.

474:

475: Assume for clarity of exposition that $\mu$ is a position within the

476: battery of marker SNPs.  Let $\ceil{\mu}=\min{\{i:m_i\ge \mu\}}$ and

477: $\floor{\mu}=\max{\{i:m_i< \mu\}}$ denote the indices of the SNPs to

478: the right and left of the QTL.  The algorithm works with

479: obvious modifications when $\mu$ is a position outside the battery of

480: marker SNPs.

481:

482: Under the modelling assumptions, as we move away from the position of

483: the disease locus, the number of chromosomes containing i.b.d.\

484: ancestral chromosome, $x_i$, is Markovian.  The \textit{transition

485:   probabilities} of a (nonstationary and inhomogenous) hidden Markov

486: model \cite[HMM; see e.g.][ ch.3]{durbinbook}, for marker SNPs to the right

487: of $\mu$ (i.e.\ $\ceil{\mu}\le i$), are

488: \begin{equation}

489:   \label{eq:jumpprob}

490:   \Pr{(x_{i+1}|x_{i})}=

491:   \bindens{(x_{i+1},\;x_{i},\;\exp{(-\tau\times(m_{i+1}-m_{i}))})}

492: \end{equation}

493: Similar equations hold for for marker SNPs to the left of $\mu$, and

494: for $\Pr{(x_i|x_\mu)}$.  The \textit{emmission probabilities} of the

495: HMM are given by (\ref{eq:probSNPsumsum}).  There is an important

496: difference between this HMM and the HMMs of \citet{mcpeek1999},

497: \citet{morris2000}, \citet{zhangzhao2000,zhangzhao2002} and \citet{liu2001}.  Those

498: authors modelled the observed haplotypes or genotypes as a set of

499: conditionally independent HMMs, conditional on (in the present

500: notation) $(a_1,\ldots,a_L)$, $(\pi_1,\ldots,\pi_L)$ as well as $\mu$,

501: $\tau$ and $\rho$.  Thus they had to use expensive numerical

502: algorithms (optimisation or MCMC) on the high dimensional space of

503: variables on which the HMMs were conditioned.  Here I am able to model

504: all the data as a \emph{single} HMM conditional on only $\mu$, $\tau$

505: and $\rho$.  I am thus able to integrate out the high dimensional

506: $(a_1,\ldots,a_L)$ and $(\pi_1,\ldots,\pi_L)$ using an efficient

507: numerical algorithm and am left with only a low dimensional space (the

508: space for $(\mu,\tau,\rho)$) on which I will need to use an expensive

509: numerical algorithm.

510:

511: We can use the backwards propagation algorithm for HMMs to sum over

512: the hidden states $(x_1,\ldots,x_L)$ \citetext{see e.g.\

513:   \citealt{durbinbook} ch.3, \citealt{liubook2001} pp.28--31}.

514: Readers familiar with HMMs may wish to skip the rest of this section.

515:

516: Define the \textit{backwards variables} for SNPs at positions to the

517: right of the QTL

518: \begin{equation}

519:   \label{eq:bvdefright}

520:   b(x_i) := \Pr{(\ey{d}{i+1},\ey{c}{i+1},\ldots,\ey{d}{L},\ey{c}{L}|x_i)}

521: \end{equation}

522: and for SNPs at positions to the left of the QTL

523: \begin{equation}

524: \label{eq:bvdefleft}

525:   b'(x_i) := \Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{i-1},\ey{c}{i-1}|x_i)}

526: \end{equation}

527: Here backwards is relative to the direction in which the hidden

528: process is Markovian.  Equation~(\ref{eq:bvdefleft}) should not be

529: confused with a forwards variable, which are not used in this

530: computation.  Note also that the backwards variables are functions of

531: which SNP ($i$) and the value of that $x_i$ at that SNP, but that I have

532: adopted a more streamlined notation that I hope is unambiguous.

533:

534: The backwards variables have the obvious interpretation when the

535: arguments run out of

536: range, that

537: \begin{eqnarray}

538:   b(x_L) &:=& \Pr{(\mbox{nothing}|x_L)} = 1 \\

539:   \label{eq:hmm-initialise-r}

540:   b'(x_1) &:=& \Pr{(\mbox{nothing}|x_1)} = 1

541:   \label{eq:hmm-initialise-l}

542: \end{eqnarray}

543:

544: Using (\ref{eq:hmm-initialise-r}) to \textit{initialise} the backwards variables

545: at $i=L$, we then proceed to \textit{propagate} leftwards along the

546: chromosomes for each

547: $i=L-1,\ldots,\ceil{\mu}$ in turn, compute the backwards variables for every $(x_i)$ using

548: \begin{equation}

549:   b(x_i) =\sum_{x_{i+1}}{\Pr{(x_{i+1}|x_i)}

550:     \Pr{(\ey{d}{i+1},\ey{c}{i+1}|x_{i+1})} b(x_{i+1})}

551: \end{equation}

552: and then \textit{terminate} the algorithm by computing

553: \begin{equation}

554:   \label{eq:modelprobgivenmu}

555:   \Pr{(\ey{d}{\ceil{\mu}},\ey{c}{\ceil{\mu}},\ldots,\ey{d}{L},\ey{c}{L}|x_\mu)}

556:   = \sum_{x_{\ceil{\mu}}}{\Pr{(x_{\ceil{\mu}}|x_\mu)}

557:     \Pr{(\ey{d}{\ceil{\mu}},\ey{c}{\ceil{\mu}}|x_{\ceil{\mu}})} b{(x_{\ceil{\mu}})}}

558: \end{equation}

559: for every $x_\mu$.

560:

561: Likewise, using (\ref{eq:hmm-initialise-l}) to initialise the $b'$

562: backwards variables at $i=1$ we can propagate rightwards along the

563: chromosomes (backwards in the sense that time or space is usually

564: considered in HMMs) for $i=2,\ldots,\floor{\mu}$ and terminating in a

565: symmetric manner to compute

566: $\Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{\floor{\mu}},\ey{c}{\floor{\mu}}|x_\mu)}$.

567:

568: We then obtain the probability of all the data by computing

569: \begin{eqnarray}

570:   \label{eq:modelprobgivenrho}

571:   \Pr{(\hat{\vec{y}})} &=&

572:   \Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{L},\ey{c}{L})} \nonumber \\

573:   &=& \sum_{x_\mu}{\bigg(\bindens{(x_\mu,n_d,\rho)}

574:     \Pr{(\ey{d}{1},\ey{c}{1},\ldots,\ey{d}{\floor{\mu}},\ey{c}{\floor{\mu}}|x_\mu)}\bigg.}\nonumber\\

575:   &&\qquad \bigg. \Pr{(\ey{d}{\ceil{\mu}},\ey{c}{\ceil{\mu}},\ldots,\ey{d}{L},\ey{c}{L}|x_\mu)}\bigg)

576: \end{eqnarray}

577:

578: Restoring the conditioning that has been implicit throughout this

579: section, (\ref{eq:modelprobgivenrho}) is

580: $\Pr{(\hat{\vec{y}}|\mu,\tau,\rho)}$, the probability of all the data

581: conditional on $(\mu,\tau,\rho)$ and marginal to $(a_1,\ldots,a_L)$

582: and $(\pi_1,\ldots,\pi_L)$,.

583:

584:

585: \subsection{Marginal Posteriors for $\mu$, $\tau$ and $\rho$ and Model

586: Probabilities}

587: \label{sec:final-posterior}

588: The posterior for $\mu$, $\tau$ and $\rho$, up to a normalising

589: constant, is obtained simply by multiplying

590: (\ref{eq:modelprobgivenrho}) by the relevant priors.

591: \begin{equation}

592:   \Pr{(\hat{\vec{y}},\mu,\tau,\rho)} = \Pr{(\hat{\vec{y}}|\mu,\tau,\rho)}

593:   \,\Pr{(\mu)}\,\Pr{(\tau)}\,\betadens{(\rho,R_1,R_0)}

594: \end{equation}

595: so

596: \begin{equation}

597:   \Pr{(\mu,\tau,\rho|\hat{\vec{y}})} = \Pr{(\hat{\vec{y}}|\mu,\tau,\rho)}

598:   \,\Pr{(\mu)}\,\Pr{(\tau)}\,\betadens{(\rho,R_1,R_0)}\,\frac{1}{\Pr{(\hat{\vec{y}})}}\;\mbox{.}

599: \end{equation}

600: Summarising this posterior is not entirely trivial for large data

601: sets, because computing the posterior at any given point

602: $(\mu,\tau,\rho)$ using the propagation algorithm of

603: section~\ref{sec:hidden-markov-model} takes on the order of

604: $L\,n_d^2$ operations (and all addition has to be done in log-space,

605: see e.g.\ \citet[ p.77--78]{durbinbook}), so we cannot rely on being able

606: to make an arbitrarily large number of such computations.

607:

608: I use Cartesian product quadrature \citep[CPQ; see e.g.][

609: \textbf{9.43}--\textbf{9.44}]{ohaganforster} to compute marginal

610: posteriors for each of $\mu$, $\tau$ or $\rho$, and the normalising

611: constant or marginal likelihood $\Pr{(\hat{\vec{y}})}$ assuming there

612: is a disease QTL.  CPQ makes the approximation

613: \begin{eqnarray}

614:   \Pr{(\hat{\vec{y}})}&=&\int{\int{\int{

615:         \Pr{(\hat{\vec{y}},\mu,\tau,\rho)}\,\dif\mu}\,\dif\tau}\,\dif\rho}

616:   \nonumber\\

617:   &\simeq&\sum_j{\sum_k{\sum_\ell{ w_j^{(\mu)} w_k^{(\tau)} w_\ell^{(\rho)} \Pr{(\hat{\vec{y}},\mu_j,\tau_k,\rho_\ell)}}}}

618:   \label{eq:cpq-howto}

619: \end{eqnarray}

620: where e.g.\ $j$ indexes a set of \emph{design points}

621: $\{\mu_1,\mu_2,\ldots\}$, and $w_j^{(\mu)}$ is a \emph{weight}

622: associated with the $j$-th design point.  (A quantity proportional to)

623: the marginal posterior for any variable $\mu$, $\tau$ or $\rho$ is obtained by omitting the

624: respective sum from (\ref{eq:cpq-howto}).  When computed using priors

625: describing our beliefs about these variables given that there is a

626: QTL in the region of interest, the quantity in (\ref{eq:cpq-howto})

627: will be called $\Pr{(\hat{\vec{y}}|\mbox{QTL})}$.

628:

629: The choice of design points and weights depends on the region of

630: interest and the prior, and the quadrature rule to be used.  For

631: example, all calculations in section~\ref{sec:results} the region of

632: interest is $\mu\in(0,1)$, measured in Mb or cM.  With a uniform prior

633: for $\mu$, exponential prior for $T$ with $T=1/1000$, and flat beta

634: prior for $\rho$ with $R_1=R_0=1$, I used 100 design points for each

635: variable as follows:

636: \begin{eqnarray}

637:   \mu_j = \left(j+1/2\right)/100\mbox{,}&

638:   w_j^{(\mu)} = 0.01\mbox{,}& j=0,1,\ldots,99\nonumber\\

639:   \tau_k = \exp{(k/11)}\mbox{,}&

640:   w_k^{(\tau)} = \exp{(k/11)}/11\mbox{,}& k=0,1,\ldots,99\nonumber\\

641:   \rho_\ell = \left(\ell+1/2\right)/100\mbox{,}&

642:   w_\ell^{(\rho)} = 0.01\mbox{,}& \ell=0,1,\ldots,99

643:   \label{eq:cpq-default-design}

644: \end{eqnarray}

645: Here the quadrature rule is very simple and corresponds approximately

646: to the trapezoid rule or two point Newton--Cotes rule.

647:

648: A simple model with no QTL is to assume there are no blocks of i.b.d.\

649: haplotype, $x_i=0$ for all $i$.  This corresponds to a degenerate

650: prior at $\rho=0$ (or the limits $\mu\to\pm\infty$ or $\tau\to\infty$).

651: The probability of the data under this model, $\Pr{(\hat{\vec{y}}|\mbox{no

652:     QTL})}$, is easily computed directly from

653: (\ref{eq:probSNPsumsum}).  The Bayes factor (BF) in favour of the

654: model with a QTL is then

655: \begin{equation}

656:   \label{eq:BF-def}

657:   \mbox{BF} = \frac{\Pr{(\mbox{QTL}|\hat{\vec{y}})}}{\Pr{(\mbox{no

658:         QTL}|\hat{\vec{y}})}}\bigg/

659: \frac{\Pr{(\mbox{QTL})}}{\Pr{(\mbox{no

660:         QTL})}}

661: =\frac{\Pr{(\hat{\vec{y}}|\mbox{QTL})}}{\Pr{(\hat{\vec{y}}|\mbox{no QTL})}}\;\mbox{.}

662: \end{equation}

663: In addition to its Bayesian interpretation, the

664: BF (or any isotonic transformation thereof) has good

665: properties, from a classical frequentist perspective, as a test

666: statistic to test the null model with no QTL against the alternative

667: with a QTL \citetext{\citealt{ohaganforster} ch.5,

668:   \citealt{patterson2004}}.  Tests based on the BF are admissible,

669: which means that no other test has greater power for all

670: $(\mu,\tau,\rho)$.  There may be other tests that have greater power

671: for some $(\mu,\tau,\rho)$, and are therefore also admissible, but it

672: is ``unusual and strange'' to find an admissible test that is not

673: based on the BF computed using \emph{some} prior.  Furthermore, (up to

674: isomorphism) the BF uniquely maximises average power, when the

675: averaging is done with respect to the prior for $(\mu,\tau,\rho)$ used

676: to compute the BF.  Of course, this theory only applies when the model

677: is correct, but we might hope that the most powerful test for an

678: approximate model would be approximately most powerful for a more

679: realistic model.  The BF is a sensible way to combine information

680: across many markers to produce a single test, and thus avoids (or

681: overcomes) the problematic need to correct for multiple testing

682: \citep{patterson2004}.

683:

684: I have also implemented a Markov chain Monte Carlo

685: \citep[MCMC; see e.g.][]{gilks1996,liubook2001} sampler, which uses

686: the Metropolis--Hastings algorithm to sample from

687: the posterior density for $(\mu,\tau,\rho)$.

688: A proposal distribution that seems to work reasonably well

689: is to update a single variable, choosen at random with equal

690: probability, using the following:

691: \begin{eqnarray}

692:   \label{eq:proposal-z}

693:   \mu' &\sim& \ndist{\left(\mu,(0.1(m_L-m_1))^2\right)}\nonumber\\

694:   \label{eq:proposal-t}

695:   t' &\sim& \gamdist{\left(10,t/10\right)}\nonumber\\

696:   \label{eq:proposal-r}

697:   r' &\sim& \betadist{\left(5r,5(1-r)\right)}

698: \end{eqnarray}

699: Kernel based methods \citep[see e.g.][]{silverman1986} can then be

700: used to estimate marginal posteriors for each of the variables, and

701: these seemed similar to the marginal posteriors computed using CPQ in

702: the cases I examined.  However, standard numerical methods to estimate

703: the model probability $\Pr{(\hat{\vec{y}})}$ from the MCMC output

704: \cite[see e.g.][ \textbf{10.46}]{ohaganforster} converged very slowly

705: and did not seem reliable.

706:

707: In addition to the ease and reliability with which the normalising

708: constant or BF can be computed, CPQ offers substantial advantages over

709: alternatives such as MCMC in low dimensional situations such as this

710: one.  There are no concerns about burnin or mixing.  CPQ makes

711: efficient use of the evaluation of the posterior density at each

712: design point. We can investigate sensitivity to prior specification

713: afterwards without redoing much of the computation.

714:

715: In this particular situation CPQ offers an additional advantage, that

716: by traversing the design points in a particular order many of the

717: backards variables computed in the propagation algorithm of

718: section~\ref{sec:hidden-markov-model} it be stored and reused, and

719: thus a CPQ algorithm with $n$ design points runs much faster than a

720: MCMC algorithm with $n$ samples.  The details are as follows: The

721: posterior can be computed on all points on a three dimensional lattice

722: most efficiently by traversing the lattice with different values of

723: $\tau$ in the outermost loop, different values of $\mu$ in the middle

724: loop, and different values of $\rho$ in the innermost loop.  This is

725: because each time $\rho$ changes but $\mu$ and $\tau$ do not then only

726: (\ref{eq:modelprobgivenrho}) has to be recomputed.  Each time $\mu$

727: changes but $\tau$ does not then (\ref{eq:modelprobgivenmu}) always

728: has to be recomputed, and if the old and new values of $\mu$ are

729: separated by one or more typed SNPs then one or more columns of

730: backwards variables also have to be recomputed.  If $\mu$ always

731: increases then the $b$ are all computed first and then as the lattice

732: is traversed extra columns of $b'$ are computed and columns of $b$ are

733: simply discarded.  Changing $\tau$ means that the transition

734: probabilities~(\ref{eq:jumpprob}) change and everything has to be

735: recomputed.  The run time of the whole algorithm (propagation and CPQ)

736: scales quadratically in $n_d$ and linearly in $L+d_\mu$ where $d_\mu$

737: is the number of design points used for $\mu$.  If a small map region

738: is found to be interesting additional design points can be added later

739: at moderate cost.

740:

741: The disadvantages of CPQ are that we learn little about the posterior

742: until calculations have been completed for all the design points, we

743: may belatedly discover that our choice of design points was not a good

744: one, that MCMC \emph{may} be more sensitive to very narrow spikes

745: containing substantial probability mass (these will be missed if they

746: fall inbetween the design points) and that CPQ will not scale well to

747: higher dimensional spaces which we might need to study if we

748: elaborated the model.

749:

750:

751: \section{Coalescent Simulation Model and Error Model}

752: \label{sec:simul-model}

753: In this section I describe a more realistic model that was used to

754: simulate datasets on which to test the method described above in

755: sections~\ref{sec:appr-model-prior}--\ref{sec:analysis}.  I also

756: describe a specific model for errors in allele frequency estimation.

757: Each simulated dataset was generated as follows.

758:

759: First I used the \texttt{mksamples} program of \citet{hudson2002} to

760: simulate a sample of 20,000 1Mb long regions, assuming the standard

761: neutral coalescent model with population recombination rate

762: $4N_{\mathrm{e}}c=400$ ($N_{\mathrm{e}}=10,000$ assuming

763: $1\mbox{cM}/\mbox{Mb}$), and assuming the infinite sites mutation

764: model with population mutation rate $4N_{\mathrm{e}}\mu=10$.  (This is

765: an unrealistically low mutation rate, the idea is to simulate some of

766: the SNPs in the region rather than all of them.)  Chromosomes were

767: paired at random to generate a sample of 10,000 individuals.

768:

769: One SNP with a minor allele frequency between 10\% and 20\% was chosen

770: as the disease QTL.  The disease status of each individual was

771: simulated assuming multiplicative risks, so that

772: $\gamma_{01}/\gamma_{00}=\gamma_{11}/\gamma_{01}=g$.  Here $\gamma_G$

773: is the penetrance (probability of having the disease) given genotype

774: $G$ at the disease QTL, with 1 the minor allele.  The parameter $g$ is

775: called the allelic or genotype relative risk

776: \cite[see e.g.][]{rischmerikangas1996}.  Simulations for this paper used

777: values of $g=4$ and $g=1$.  The penetrance of the wild type

778: homozygote, $\gamma_{00}$, was set so that the marginal probability of

779: having the disease was 0.02.  (Thus the number of case chromosomes

780: $\nd$ was random with expectation $0.02\times10,000\times2=400$.)

781: Data from all $\nd/2$ case individuals, and an equal number $\nc/2$ of

782: randomly chosed control individuals, were used to make up the two

783: pools.

784:

785: Excluding the disease QTL, all simulated SNPs with a minor allele

786: frequency greater than 0.05 in the $\nc/2$ individuals in the control

787: pool were analysed, so the number of SNPs $L$ was also random.  The

788: estimated allele frequencies at each SNP and for each pool were either

789: assumed to be known exactly, or assuming that allele frequencies were

790: estimated using the lag between kinetic PCR growth curves

791: \citep{germer2000}, using

792: \begin{equation}

793:   \label{eq:yhatfromlag}

794:   \hat{y} = \frac{1}{1+2^{\Delta\widehat{C_t}}}\times n\;\mbox{.}

795: \end{equation}

796: Here $\hat{y}$ is shorthand for the estimated count of the 1 allele,

797: $n$ is the number of chromosomes in the pool,

798: $\Delta\widehat{C_t}=\widehat{C_t}(1)-\widehat{C_t}(0)$, and

799: $\widehat{C_t}(a)$ is the number of PCR cycles before the amount of

800: PCR product for allele $a$ exceeds some threshold level

801: \citep{germer2000}.  The model for $\Pr{(\hat{y}|y,e)}$ is then as

802: follows: Define the true lag $\Delta C_t=\log_2((n-y)/y)$, which is

803: the lag that would give the correct frequency when

804: (\ref{eq:yhatfromlag}) was used.  I assume that the observed lag

805: $\Delta\widehat{C_t}$ averaged across $r$ experiments is normally

806: distributed with mean $\Delta C_t$ and variance $\sigma^2/r$, where

807: $\sigma^2$ is the variance in lags across replicate experiments.

808:

809: Using the Jacobian

810: \begin{equation}

811:   \label{eq:yhatzhatjacobian}

812:   \frac{\dif \hat{y}}{\dif (\Delta\widehat{C_t})} = \ln{(2)}\hat{y}(n-\hat{y})\frac{1}{n}

813: \end{equation}

814: we can write down the error model in

815: the form required in (\ref{eq:SNPi-prob}) for the analysis,

816: \begin{equation}

817:   \label{eq:errorgaussdeltact}

818:   \Pr{(\hat{y}|y,n,\sigma^2,r)} = \frac{n}{\ln{(2)}\hat{y}(n-\hat{y})}

819:   \frac{1}{\sqrt{2\pi\sigma^2/r}}

820:   \exp{\left(-\frac{\left(\ln\left(\frac{(n-y)\hat{y}}{y(n-

821: \hat{y})}\right)\right)^2}{2\ln{(2)}^2\sigma^2/r}\right)} \;\mbox{.}

822: \end{equation}

823:

824: \section{Testing the Method}

825: \label{sec:results}

826: All measurements of the performance of the Bayesian method described

827: in sections \ref{sec:appr-model-prior}--\ref{sec:analysis} are based

828: on analyses of datasets simulated under a more realistic model, as described in

829: section~\ref{sec:simul-model}.  Results are reported for two situations, either where the allele

830: frequencies in each pool are known exactly, or where there are errors in allele

831: frequency estimation using (\ref{eq:errorgaussdeltact}) and assuming

832: that $n$, $r=2$ replicates and $\sigma=0.2\mbox{ PCR cycles}$ are all known.

833: This magnitude of error is comparable to those reported by

834: \citet{germer2000} and \citet{shiffman2004}.

835: Using (\ref{eq:yhatzhatjacobian}) we can say that these parameter values

836: correspond to a ``typical'' error in allele frequency estimate

837: $\hat{y}/n$ of about $y/n (1-y/n)\ln{(2)}\sigma/\sqrt{2}\simeq y/n

838: (1-y/n)0.098$ or, for intermediate allele frequencies, about 2.5\%.

839:

840: For each situation, allele frequencies known either exactly or

841: estimated with errors, I analysed 500 datasets simulated

842: assuming there was a QTL with a

843: genotype relative risk $g=4$, and 500 datasets simulated assuming a null model

844: with no QTL ($g=1$, so the penetrances

845: $\gamma_{00}=\gamma_{01}=\gamma_{11}=0.02$ are all equal).  In these

846: simulations, the median number of case or control individuals,

847: $\nd/2=\nc/2$, was 200 (interquartile range 191--209, range 154--248).

848: The median number of SNPs, $L$, was 28 (interquartile range 24--32,

849: range 12--51).  These simulations assumed relative risks that are

850: higher, and correspondingly sample sizes that are smaller than may be

851: realistic for many studies of QTLs influencing complex genetic

852: diseases.  This reflects the need to analyse a reasonably large number

853: of simulated datasets with the computing resources currently available

854: to me.  The mean time to run an analysis on a simulated dataset, using

855: CPQ with the design (\ref{eq:cpq-default-design}) which requires

856: evaluating the posterior at $10^6$ points on a $100\times100\times100$

857: lattice, was about 36 minutes on a 2.4GHz Intel\textregistered{}

858: Xeon\texttrademark{} processor (totalling about 50 processor days for

859: all the simulated datasets).

860:

861: The inferences from the Bayesian method described here are compared

862: against simple but widely used classical single point analyses.  When

863: allele frequencies in each pool are known exactly, a chi squared test

864: can be used on the counts of the two alleles in the two pools

865: \citep[see e.g.\ ][]{clayton2001}, at each

866: marker SNP separately.  \citet{visscher2003} consider how to perform

867: an equivalent test when the errors in allele frequency estimates are

868: Gaussian.  The relatively small Gaussian errors in $\Delta\widehat{C_t}$

869: simulated here will produce errors in allele frequency estimates that

870: are approximately Gaussian (to the extent that (\ref{eq:yhatfromlag})

871: is linear, and in fact are underdispersed relative to a Gaussian in

872: the direction of extreme allele frequencies).  \citet{visscher2003}

873: show that a ``shrunk'' test statistic is approximately distributed as

874: $\chi^2_{(1)}$.  This shrunk statistic is equal to the ordinary

875: $\chi^2$ statistic computed using a point estimate of the counts,

876: times a factor $2V_s/(2V_s+V_e)$ where $V_s$ is the estimated sampling

877: variance of the allele frequency due to sampling a finite number of

878: cases and controls, under the null hypothesis of equal allele

879: frequency in cases and controls, and $V_e$ is the variance of the

880: allele frequency in either pool due to experimental error.  Using

881: (\ref{eq:yhatzhatjacobian}), for the simulations performed here this

882: shrinking factor is (approximately, for small $\sigma$)

883: \begin{equation}

884:   \frac{2}{2+(\nd+\nc)\hat{p}(1-\hat{p})\ln{(2)}^2\sigma^2/r}

885:   \;\mbox{.}

886:   \label{eq:visscher-shrink-factor}

887: \end{equation}

888: where $\hat{p}$ is the allele frequency estimated under the null

889: hypothesis, i.e.\ by pooling the case and control pools.

890:

891: The most widely considered statistics from a classical single point

892: analysis are as follows:  Let \pmin{} be the smallest $p$-value of the

893: $L$ (shrunk) chi squared tests applied to a given dataset, and let

894: \muminp{} be the map position of the marker with the smallest

895: $p$-value.

896:

897: It is worth emphasising that all the tests described in the following

898: sections concern the classical sense performance of statistics

899: computed from the Bayesian analysis.  Strictly, the Bayesian sense

900: performance can only be tested by conducting a Bayesian analysis

901: assuming a more realistic model (or prior).

902:

903: \subsection{Power to Detect a QTL}

904: \label{sec:power-detect-qtl}

905: In this section I compare the power of tests to detect a QTL.  I

906: consider two different test statistics, and different methods of

907: determining critical regions.  The first test statistic is

908: $2\ln\mbox{BF}$, twice the logarithm of the Bayes factor

909: (\ref{eq:BF-def}).  The second test statistic is $\pmin\times L$.

910: Multiplying the smallest single point $p$-value by $L$ achieves a

911: simple Bonferonni correction for multiple testing that makes the

912: critical region independent of $L$.  Critical regions were determined

913: either analytically (by arbitrary or approximate methods), or

914: empirically (from analyses of datasets simulated under the null

915: model).  I report the performance of tests with nominal sizes of

916: $\alpha=0.05$ and $\alpha=0.01$; a more general comparison is made in

917: figures~\ref{fig:roc-no-error} and \ref{fig:roc-with-error}.  For each

918: test, the true size was estimated using 500 simulations under the null

919: model with genotype relative risk $g=1$, i.e.\ where risk is

920: independent of genotype at the QTL.  The power against an alternative

921: with $g=4$ was estimated using 500 simulations.  For each test I

922: report the estimated size and power, along with exact 95\% binomial

923: confidence intervals for their values.

924:

925: From a Bayesian perspective, $2\ln\mbox{BF}>0$ indicates evidence in

926: favour of the model with a QTL over the model with no QTL.  As

927: tables~\ref{tab:power-no-error} and \ref{tab:power-error}

928: show, the test with this critical region has small size and reasonable

929: power.  However, much more simulation work, for different models and

930: combinations of parameters, is required to establish the generality of

931: this result.  Also, at least from a classical perspective it is

932: desirable to be able to choose a critical region according to the size

933: (or perhaps power) that is desired.

934:

935: An arbitrary critical region is

936: $2\ln\mbox{BF}>2\ln{(\frac{1-\alpha}{\alpha})}$.  I say arbitrary

937: because this in fact guarantees nothing about the classical sense

938: error rate, but does bound the Bayesian sense error rate: The

939: posterior probability that there is no QTL is less than $\alpha$,

940: $\Pr{(\mbox{no QTL}|\hat{y})}<\alpha$, when the model, the prior

941: $\Pr{(\mbox{QTL})}=\Pr{(\mbox{no QTL})}$, and the prior for

942: $(\mu,\tau,\rho)$ are correctly specified. It can be seen from

943: tables~\ref{tab:power-no-error} and \ref{tab:power-error} that these

944: arbitrary critical regions give tests with true sizes that are smaller

945: than $\alpha$.  Such tests are therefore conservative.  Causes may

946: include the simplified model used to compute the BF, the dependence of

947: the BF on the prior specification $T$, and the fact that the critical

948: region bounds the Bayesian sense error rate rather than controls the

949: classical sense error rate.  The use of these arbitrary critical

950: regions $2\ln\mbox{BF}>2\ln{(\frac{1-\alpha}{\alpha})}$ entails a loss

951: of power due to the actual size of the test being smaller than

952: intended, so a better method for determining a critical region is

953: desirable.

954:

955: Assuming goodness of the $\chi^2_{(1)}$ approximation, with the shrinking

956: factor (\ref{eq:visscher-shrink-factor}), and using simple Bonferonni

957: correction, suggests an approximate critical region $\pmin\times

958: L<\alpha$.  These tests have true sizes equal to or slightly smaller

959: than their nominal sizes, which is expected because the Bonferonni

960: correction is conservative.  This effect is expected to increase in

961: severity as the marker density increases, because there will be a

962: greater number of more positively correlated tests.

963:

964: Although these respectively arbitrary and approximate methods for

965: determining critical regions are not terribly accurate, and cannot be

966: recommended, it is worth noting that there is no clear difference in

967: power between the two test statistics, for tests with the same nominal

968: size.  Since the test based on $2\ln\mbox{BF}$ is more conservative,

969: it might reasonably be preferred.

970:

971: It is not very meaningful to compare the power of tests with different

972: sizes.  Therefore I used simulations to estimate exact critical

973: regions, so that the power of different tests with true size $\alpha$

974: could be compared.  For these tests, the estimated size is exactly

975: equal to the nominal size, because the same set of simulations are

976: used to compute both values.  Although the critical region for

977: $\alpha=0.01$ is unlikely to be well estimated using only 500

978: simulations, by a simple symmetry argument this procedure still allows

979: a fair comparison across the different test statistics.  In every case

980: the test based on $2\ln\mbox{BF}$ is substantially more powerful than

981: the test based on $\pmin\times L$.

982:

983: By combining simulations in which there is not and is a QTL, we can

984: view test statistics as \emph{classifiers}, and ask how well they

985: discriminate between the two cases.  The receiver operating

986: characteristics (ROC) for the two statistics are compared in

987: figures~\ref{fig:roc-no-error} and \ref{fig:roc-with-error}.  The ROC curves are equivalent to plotting estimated power

988: ($=\mbox{sensitivity}$) against size of test ($=1-\mbox{specificity}$)

989: for all possible tests (in fact, only all tests with non-disjoint

990: critical regions).  When viewed in this way, it can be seen that the

991: BF based statistic is uniformly equal to or superior to the minimum

992: $p$-value based statistic.  The advantage of the BF is greater when

993: there are errors in allele frequency estimation.  This may be because,

994: when the dataset is less informative, it may be more important to have

995: a model based way to combine information across SNPs.

996:

997: For comparison, I have also plotted the ROC curves for a test

998: statistic derived from the nonparametric likelihood approach that I

999: have described previously \citep{johnson2005a}.  This nonparametric

1000: likelihood ratio (NLR) statistic is defined in the same way as the

1001: BF~(\ref{eq:BF-def}), using the same value for

1002: $\Pr{(\hat{\vec{y}}|\mbox{no QTL})}$, but makes the approximation

1003: \begin{equation}

1004:   \label{eq:proftestdef}

1005:   \Pr{(\hat{\vec{y}}|\mbox{QTL})} \simeq

1006:   \prod_{i=1}^{L}{\Pr{(\ey{d}{i},\ey{c}{i}|x_i^*)}}

1007: \end{equation}

1008: where $(x_1^*,x_2^*,\ldots,x_L^*)$ is the set of hidden states in the

1009: HMM that maximise the probability~(\ref{eq:proftestdef}) under an

1010: order restriction that they are either a weakly increasing sequence, a

1011: weakly decreasing sequence, or a weakly increasing then weakly

1012: decreasing sequence.  This order restriction must be true regardless

1013: of the shape of the genealogy at the QTL.  The NLR statistic is not a

1014: very good approximation to the BF, in particular because it can never

1015: be negative.  As far as I know, there is no theoretical reason to

1016: believe that it should have good properties as a test statistic.

1017: However, as figures~\ref{fig:roc-no-error} and

1018: \ref{fig:roc-with-error} show, tests based on the NLR are superior to

1019: tests based on $\pmin\times L$ and are not clearly distinguishable

1020: from tests based on the BF.  For the simulated datasets studied here,

1021: once the lookup table of emmission

1022: probabilities~(\ref{eq:probSNPsumsum}) has been computed, computing

1023: the NLR is over $10^4$ times faster than computing the BF.

1024: Furthermore, a Viterbi-like algorithm \citep[see][]{durbinbook} for

1025: computing the NLR has time complexity $\mathrm{O}{(L\times\nd)}$,

1026: compared with the CPQ algorithm for computing the BF which has time

1027: complexity $\mathrm{O}{((L+d_\mu)\times\nd^2)}$.

1028:

1029: It may concern some readers that the critical regions and sizes and

1030: powers of tests were all estimated while allowing the numbers of

1031: cases, controls and marker SNPs all to vary across simulations.  To

1032: interpret results acquired in this way, a formal classical framework

1033: would require us to view the genotype relative risk $g$ as the single

1034: parameter, and the number of case chromosomes \nd{} and number of SNPs

1035: $L$ as random variables.  It is true that even in such a framework we

1036: would normally wish to perform tests conditional on the values of

1037: ancilliary variables such as \nd{} and $L$ that contain no information

1038: about whether there is a QTL in the region.

1039: However, it is a feature (or weakness) of classical inference that one

1040: is often free to choose whether to condition on any given variable

1041: \citep[but see e.g.][]{jaynes1976,robinson1979}.  The present results

1042: therefore do have a sound classical interpretation.  In any case, the

1043: small number of simulations performed here do not allow the luxury of

1044: estimating critical regions conditional on \nd{} or $L$.  The

1045: simulation procedure as used reflects a likely feature of real

1046: datasets, that SNP density will be higher in regions of the genome

1047: where the genealogy is deeper.  To alter the simulation procedure so

1048: that all simulated datasets had the same value of $L$ would require

1049: the introduction of an \textit{ad hoc} algorithm to select the $L$

1050: markers to be used from a larger number of candidates.

1051:

1052: As shown in figure~\ref{fig:teststats}, the null distributions of both test statistics show a negative

1053: relationship with $L$.  The negative relationship is most pronounced

1054: for the $2\ln\mbox{BF}$ statistic, for the situation where there are

1055: errors in allele frequency estimation.  In this case a linear

1056: regression of $2\ln\mbox{BF}$ on $L$ had a slope significantly

1057: different from zero ($p=0.010$), and if the values of $2\ln\mbox{BF}$

1058: are partitioned into two groups according to the rank of $L$, the

1059: hypothesis that they are drawn from the same distribution can be

1060: rejected using a Kolmogorov--Smirnov test ($p=0.004$).  These tests do

1061: not detect significant dependence ($p>0.05$) for the $2\ln\mbox{BF}$

1062: statistic when the allele frequencies are known exactly, or for the

1063: $\pmin\times L$ statistic.

1064:

1065: Although the simulations described here are adequate for demonstrating

1066: the superiority of the BF based test over the \pmin{} based test, we

1067: should be cautious about extrapolating from the current results.  In

1068: particular, it seems that the arbitrary

1069: ($2\ln\mbox{BF}>2\ln{(\frac{1-\alpha}{\alpha})}$) or approximate

1070: (Bonferonni) critical regions described above will become more

1071: conservative as SNP number or density increases.  Performing tests

1072: that are not conditioned on SNP number and density will introduce

1073: recognisable subset biases \citep[see e.g.][]{robinson1979}.  In a

1074: real situation, a critical region should be determined using

1075: simulations conditioned on as many ancilliary statistics of the

1076: observed data as possible, although for complex simulation models it

1077: may be a matter of guesswork which statistics are approximately

1078: ancilliary.  An approach that could be most useful in practice is a

1079: variant of the permutation test of \citet{churchill1994}.  This could

1080: be applied if there were matched pairs of pools of cases and controls,

1081: and each pair were typed in separate DNA pooling experiments

1082: \citep{shiffman2004}.  Then the phenotype labels could be permuted

1083: within each pair, giving a set of equiprobable values for any test

1084: statistic under the null hypothesis.  Such an approach could not be

1085: explored here because it would require too much computation.

1086:

1087:

1088: \subsection{Sensitivity to prior specification}

1089: \label{sec:prior-specification}

1090: It is important to appreciate that the Bayes factor does not depend on

1091: the prior probabilities for the two models (QTL or no QTL), but

1092: \emph{does} depend on the priors for the parameters within the QTL

1093: model.  Misspecification of these priors could adversely influence the

1094: performance of the BF as a test statistic, and it is important to

1095: examine typical levels of robustness to the prior.  To explore this, I

1096: compare the analyses above that used relatively flat generic priors to

1097: analyses that used priors that were in a way optimised for the

1098: simulated datasets under consideration.

1099:

1100: Note that the prior for $\mu$ is correct, but that the approximate

1101: model here uses two other parameters $\tau$ and $\rho$ that do not

1102: have any direct correspondance to parameters of the coalescent model

1103: that the data were simulated under.  It is therefore difficult to say

1104: what the best prior is for analysing the simulated datasets.  Loosely

1105: speaking, we might imagine that for any one simulated dataset, in the

1106: limit of an infinite amount of informative data the posterior for

1107: $\tau$ or $\rho$ would converge to a single value, which we could call

1108: the ``best approximating'' value for that dataset.  However, with less

1109: than an infinite amount of data the posterior mean for either variable

1110: would lie somewhere between the prior mean and the best approximating

1111: value.  Thus the distribution of posterior means across simulations

1112: would be (very loosely speaking) inbetween the degenerate distribution

1113: at the prior mean, and the distribution of best approximating values.

1114: Figure~\ref{fig:new-prior} shows that this is indeed the case for $\mu$, for which the prior mean

1115: is 0.5 and the true correct prior is uniform on $[0,1]$.  The

1116: distributions of posterior means for $\tau$ and $\rho$ shown in

1117: figure~\ref{fig:new-prior} suggests a lognormal prior for $\tau$ (with

1118: $\ln{(\tau)}$ having prior mean 6.8 and prior standard deviation 0.74)

1119: and a beta prior for $\rho$ (with $R_1=3.2$ and $R_0=7.8$, $\rho$

1120: having prior mean 0.29).  Here I am assuming independent priors.  Note

1121: that this exercise in prior specification was totally \textit{ad hoc}.

1122:

1123: As can be seen in figures~\ref{fig:roc-no-error} and

1124: \ref{fig:roc-with-error}, the ROC of the test statistic

1125: $2\ln\mbox{BF}$ computed using these priors is hardly

1126: different better than that using the original priors.  The power for

1127: tests of sizes $\alpha=0.05$ and $\alpha=0.01$ is not significantly

1128: different, based on 500 simulations.  This suggests that the

1129: performance of the BF as a test statistic, for these datasets, is

1130: quite robust to prior specification within the QTL model.

1131:

1132: Because most of the computation in QPQ can be reused, computing the BF

1133: for a different prior took on average less than three minutes,

1134: compared with the 36 minutes required to compute the BF for the

1135: original prior.

1136:

1137: \subsection{Estimation of QTL Position}

1138: \label{sec:estim-qtl-posit}

1139: Figures~\ref{fig:example} and \ref{fig:nexample} show analyses of four

1140: randomly chosen simulated datasets with QTLs ($g=4$).  These

1141: illustrate the fact that these datasets contain only weak information

1142: about the position of the QTL (or at least that the Bayesian method

1143: described here only extracts weak information).  It nonetheless seems

1144: worthwhile to examine how much information is present.

1145:

1146: It has been suggested that the map position of the marker with the

1147: most significant single point test result (i.e.\ the minimum

1148: $p$-value) would be a ``good'' point estimate for the position of the

1149: QTL \citep{kaplanmorris2001a,kaplanmorris2001b}.  However, I point out

1150: that it is asymptotically inadmissible for a model very similar to the

1151: one assumed here.  This argument considers the limit of a QTL of small

1152: effect.  One can imagine models where the position of the marker with

1153: the minimum $p$-value, \muminp, will be tend to become uniformly

1154: distributed on $(0,1)$, independent of the true value of $\mu$, as the

1155: effect of the QTL tends to zero.  The estimator \muminp{} then has

1156: expected loss $\mu(1-\mu)+\frac{1}{2}$ under absolute error loss and

1157: $\frac{1}{3}-\mu(1-\mu)$ under squared error loss.  The estimator

1158: $\hat{\mu}=\frac{1}{2}$ has uniformly lower expected loss,

1159: $|\frac{1}{2}-\mu|$ under absolute error loss and

1160: $(\frac{1}{2}-\mu)^2$ under squared error loss.  This argument does

1161: not technically apply for the model simulated here because SNPs

1162: (including the QTL) tend to be concentrated in regions where the

1163: genealogy is deepest, so even completely ignoring the genotype data,

1164: the position of any SNP is informative about the positions of all

1165: other SNPs including the QTL.  It does however suggest that better

1166: point estimates may be found, and suggests what their asymptotic

1167: behaviour ought to be, at least approximately.

1168:

1169: The performance of different methods for estimating the position of

1170: the QTL was assessed using the 500 simulations with $g=4$, for the two

1171: situations with and without errors in allele frequency estimation.

1172: Due to the nature of the simulations performed here, the errors

1173: reported are averaged over the distribution of the true value of

1174: $\mu$.  They are therefore not classical expected losses in the strict

1175: sense, but expected losses averaged with respect to a distribution of

1176: parameter values.  Bayesian point estimators have uniquely best

1177: performance when measured in this way \citep[ ch.5]{ohaganforster};

1178: the theory requires that the model and prior are both correct.  In

1179: particular, under squared error losses the average expected loss is

1180: minimised by the posterior mean, and under absolute error losses the

1181: average expected loss is minimised by the posterior median.  As shown

1182: in table~\ref{tab:point-exp-loss}, point estimators derived from the

1183: posterior calculated using the Bayesian method described here are

1184: superior to \muminp{}, the map location of the marker with the

1185: smallest $p$-value.  When allele frequencies in each pool are known

1186: exactly the Baysian analysis produces a 21\% reduction in root average

1187: mean squared error and a 13\% reduction in average mean absolute

1188: error. When there are errors in allele frequency estimation the

1189: figures are similar, 18\% and 11\% respectively.  The nonparametric

1190: method developed previously by me \citep{johnson2005a} produces point

1191: estimates that are competitive with the Bayesian method under squared

1192: error losses.

1193:

1194: Figures~\ref{fig:coverage} and~\ref{fig:ncoverage} show results from the 500 datasets simulated with $g=4$.  The coverage

1195: of credibility intervals constructed from the marginal posterior for

1196: $\mu$ falls well below nominal levels.  This suggests strongly that

1197: the simple model used for the analyses is not a good approximation to

1198: the more realistic model the data were simulated under.  One way to

1199: improve the model would be to allow a more realistic model for the

1200: shape of the genealogy at the QTL.  To explicitly model this genealogy

1201: and hence the joint distribution of breakpoints between ancestral and

1202: nonancestral chromosome would require something like the MCMC sampler

1203: of \citet{rannalareeve2001} or \citet{morris2002}.  Although taking

1204: such an approach is highly desirable, it may not scale well to large

1205: datasets and it seems worthwhile to investigate approximations.

1206:

1207: One approximation is the ``pairwise correction'' derived by

1208: \citet{mcpeek1999} and justified by them by the use of a quasi-score

1209: function, and used in a Bayesian context by \citet{morris2000}.

1210: Essentially, this involves flattening the likelihood function by

1211: raising all likelihoods to a power $w_n=(1+(n-1)c_n)^{-1}<1$.  Here

1212: $c_n$ is the pairwise correlation over sampled chromosomes of the

1213: conditional score function for the position of the QTL.  An expression

1214: for $c_n$ for a coalescent model is given in appendix D of

1215: \citet{mcpeek1999}.  The $n$ in this equation (which $c_n$ also

1216: depends on) is the number of chromosomes carrying the QTL.  It is not

1217: at all clear whether or how this correction should be applied in the

1218: present context, because (i) as noted by \citet{morris2000} the

1219: quasi-score justification used by \citet{mcpeek1999} does not apply in

1220: a Bayesian setting, (ii) in the present work the likelihood is never

1221: written as a conditional product across chromosomes carrying the QTL,

1222: (iii) it is not known how many chromosomes carry the QTL, and (iv) in

1223: my computational implementation no proper likelihoods are ever

1224: calculated, only likelihoods marginal to $(\pi_1,\ldots,\pi_L)$.

1225: However, the following ad-hoc approach does produce corrected

1226: credibility intervals that achieve coverage very close to their

1227: nominal levels.  The procedure is to first estimate $n$ by

1228: $n_d\mathrm{E}{(\rho)}$, the product of the number of case chromosomes

1229: and the posterior expectation of $\rho$, and then to flatten the

1230: marginal posterior for $\mu$ by raising it to a power $w_n$ and

1231: renormalising.  For the simulations performed here, $w_n$ had median

1232: $0.56$ and interquartile range 0.48--0.63.  When this procedure was

1233: applied, good agreemement between nominal and achieved coverage is

1234: obtained (figures~\ref{fig:coverage} and \ref{fig:ncoverage}).  This

1235: suggests that the most serious misspecification of the current model

1236: is the assumption of a star shaped genealogy, rather than the

1237: assumption of linkage equilibrium in nonancestral blocks or the

1238: absence of the disease variant in the control pool.

1239:

1240:

1241: \subsection{Application to real data}

1242: \label{sec:appl-real-data}

1243: It is not really possible to examine the effectiveness of the Bayesian

1244: method described here on real data, due to a lack of relevant

1245: published datasets.  Primarily for the purpose of comparison with

1246: other fine scale mapping methods, I have applied it to the dataset of

1247: \citet{hosking2002}, and to quasi-synthetic datasets generated from

1248: that dataset.  \citet{hosking2002} collected data using individual

1249: genotyping.  In order to pretend that the data were acquired using DNA

1250: pooling, I use a hypergeometric error model to relate the observed

1251: counts with missing data to the underlying full data that were not

1252: observed.  This assumes the data are missing at random within and

1253: across SNPs.

1254:

1255: To my knowledge, no fine scale mapping method has been published that

1256: does not perform well on the data of \cite{hosking2002}.  Therefore,

1257: observing that the present method performs acceptably, as shown in

1258: figure~\ref{fig:hosking}, is not necessarily encouraging.  To simulate a disease with a complex

1259: genetic basis, I generated three datasets by randomly relabelling

1260: controls as cases with probability 10\%, 20\% or 30\%.  As shown in

1261: figure~\ref{fig:hosking}, on all four datasets 95\% credibility

1262: intervals covered the true location of the CYP2D6 gene after the

1263: correction factor of \citep{mcpeek1999} had been applied to flatten

1264: the posterior. This provides weak evidence that the method developed

1265: here may be reliable for mapping QTLs from real data.

1266:

1267: \section{Discussion}

1268: \label{sec:discussion}

1269: In this paper I have described and tested a Bayesian method for

1270: detecting and mapping a QTL, using multilocus data collected using DNA

1271: pooling within two trait groups.

1272:

1273: Relatively recently, likelihood based fine scale mapping methods have

1274: been developed for genotype data that build on previous haplotype

1275: based analyses by treating the unobserved haplotypes as missing data

1276: and integrating over all possible haplotypes that are consistent with

1277: the observed genotypes.  This integration can be performed either

1278: using Markov chain Monte Carlo (MCMC)

1279: \citep{liu2001,reeverannala2002,morris2003} or using exact numerical

1280: methods for hidden Markov models \citep{zhangzhao2002}.  Data from DNA

1281: pools are estimated counts of alleles at each locus with no phase

1282: information.  Fine scale mapping from genotype data and from DNA pools

1283: can in theory be regarded as closely related missing data problems.

1284:

1285: The approach taken in this paper combines elements of the approaches

1286: of \citet{zhangzhao2002} and of \citet{morris2000} and

1287: \citet{liu2001}.  Like \citet{zhangzhao2002}, I use a model that is

1288: sufficiently simple that I can use hidden Markov model (HMM) methods

1289: to sum over all possible haplotypes that are consistent with the

1290: observed data.  However, after computing the likelihood using a

1291: propagation algorithm, \citet{zhangzhao2002} then maximise that

1292: likelihood with respect to the remaining model parameters.  In

1293: contrast and like \citet{morris2000} and \citet{liu2001}, I embed the

1294: HMM within a fully Bayesian approach and compute posterior probability

1295: distributions for the quantities of interest.

1296:

1297: One advantage of a Bayesian approach is that probability statements

1298: can be made directly about quantities of interest.  For example, we

1299: can state the probability that there is QTL in any given region,

1300: including the whole region under study.  Thus, mapping and detecting a

1301: QTL are intimately related aspects of the same analysis.  They are

1302: different inferences that are made from the same posterior probability

1303: distribution.  Within the Bayesian framework there is no need to

1304: choose between a bewildering array of estimators, test statistics and

1305: methods for correcting for multiple testing; the approach has a

1306: pleasing simplicity, at least conceptually.

1307:

1308: However, the probabilities computed in a Bayesian analysis are only

1309: meaningful if the model and prior are realistic.  The Catch-22 is that

1310: in order to compute Baysian posterior probabilities, I had to assume a

1311: model that was worringly oversimplified and not very believable.  The

1312: present work is therefore best regarded as a step towards Bayesian

1313: analysis of data collected using DNA pooling.  It may be helpful to

1314: draw parallels with methods for analysis of genotype data (collected

1315: using individual typing).  Sadly, the present method allows us to make

1316: inferences assuming a model less elaborate than the one of

1317: \citet{morris2000}, whereas we might aspire to being able to assume a

1318: model like the one of of \citet{morris2002} or

1319: \citet{zollnerpritchard2005}.  However, Bayesian analysis of such

1320: realistic models has required Markov chain Monte Carlo (MCMC) to

1321: integrate over high dimensional spaces of auxiliary variables or

1322: missing data.  Such computationally intensive approaches may have

1323: difficulty handling large datasets.  In contrast, the method described

1324: here is relatively fast, and large datasets could be analysed with

1325: realistic computational resources.  For example, 27 processor-days

1326: would be required to analyse data from a whole genome scan with 100

1327: cases, 500,000 SNPs, and evaluation of the posterior at points 50kb

1328: apart.  In contrast, \citet{zollnerpritchard2005} estimate that their

1329: MCMC based procedure for data from individual typings would take 85

1330: processor-years for the same scale of analysis.  A further advantage

1331: of avoiding Monte Carlo methods is that the large numbers of analyses

1332: needed for a sliding window analysis, or a permutation test, can be

1333: performed without needing human intervention to adjust mixing

1334: parameters or monitor convergence.  Finally and perhaps most

1335: significantly, I am able to compute a Bayes factor (BF) to compare

1336: models in which there is, and is not, a disease QTL in the whole

1337: region of interest.  To my knowledge, no association mapping method

1338: using genotype data is able to do this, although \citet{patterson2004}

1339: are able to compute a BF for \emph{admixture} mapping using

1340: genotype data.

1341:

1342: There is a Bayesian justification for the present method. (``This is

1343: the best model for which a Bayesian analysis of data from DNA pools is

1344: currently possible.'')  However, serious concerns about model

1345: inadequacy (``Well, that model simply isn't good enough!'') mean that,

1346: in this paper, I have mostly focussed on the classical frequentist

1347: justification.  Using simulations assuming a more realistic model, I

1348: have shown that the present method is uniformly superior to classical

1349: single point methods of analysis.  Single point methods are the most

1350: obvious way to analyse data collected using DNA pooling, although

1351: composite likelihood methods

1352: \citep{terwilliger1995,xiong1997,collins1998,maniatis2004,maniatis2005}

1353: could also be used.  The simulation results demonstrate that the BF

1354: computed using the present method makes a more powerful test for

1355: the presence of a QTL than the minimum $p$-value from single point

1356: tests, that the posterior density for the position of the QTL leads to

1357: a better point estimator than the position of the marker with the

1358: minimum $p$-value, and that well calibrated credibility intervals can

1359: be derived from the posterior density for the position of the QTL,

1360: after applying the correction of \citet{mcpeek1999}.

1361:

1362: The performance of composite likelihood (CL) methods was not examined

1363: here.  This was because no CL method has been developed that allows

1364: errors in allele frequency estimation, and because, to my knowledge,

1365: no CL method assumes a model that is obviously more realistic than the

1366: model assumed by the present method.  In particular, all CL methods

1367: implicitly assume linkage equilibrium in non-ancestral blocks of

1368: chromosome.  In the notation of the present paper, CL methods assume

1369: that the number of chromosomes carrying ancestral haplotype at the

1370: $i$-th SNP, $x_i$, is conditionally independent across SNPs.  Even a

1371: poor model that does capture some aspect of the dependence across

1372: SNPs, such as the star shaped genealogy assumed here, seems

1373: preferable.  To my knowledge, there is no CL method that produces well

1374: calibrated confidence or credibility intervals.  Perhaps because of

1375: this, \citet{maniatis2005} state that ``[t]he main objective in

1376: positional cloning is to estimate the kb location of a causal SNP as

1377: accurately as possible, with its support interval an important but

1378: secondary objective.''  However, it seems to me that we should focus

1379: on methods for computing well calibrated credibility intervals, and

1380: ideally a well calibrated posterior density.  The acid test is to ask

1381: whether a statistical method informs us about what is a good action or

1382: decision to be taken subsequently.  A point estimate for QTL position,

1383: without a reliable measure of precision, is not very helpful for

1384: planning future experiments to further refine the position of that

1385: QTL.

1386:

1387: One of the more surprising results is that, in the simulations

1388: performed here, the nonparametric likelihood ratio (NLR) test derived

1389: from the method proposed previously by me \citep{johnson2005a} is

1390: basically as powerful as the BF for detecting a QTL.  This is

1391: surprising because there is no theoretical basis for the NLR test

1392: statistic, but a strong theoretical basis for the BF test statistic.

1393: Since the NLR can be computed much more quickly, both in absolute and

1394: complexity terms, its performance in simulations over a wider range of

1395: parameters will be examined in a subsequent paper.

1396:

1397: Given that the NLR performs as well as the BF for detecting a QTL, but

1398: that the BF is much more expensive to compute, one might reasonably

1399: ask what are the benefits of the Bayesian method described here.

1400: Firstly, the BF has a Bayesian interpretation, and since it can be

1401: negative it can indicate Bayesian sense evidence in favour of there

1402: being no QTL.  The NLR test statistic can never be negative, has no

1403: direct Bayesian interpretation, and is not a good approximation to the

1404: BF.  Secondly, the posterior median from the Bayesian method provides

1405: superior point estimates under absolute error losses.  Thirdly, the

1406: Bayesian method produces well calibrated credibility intervals, but

1407: the profile likelihood method I proposed previously does not

1408: \citep[see figure 4 of][]{johnson2005a}.  Finally, the unconditional

1409: coverage frequencies of credibility intervals say nothing about the

1410: conditional or Bayesian sense performance of a method.  For multistage

1411: QTL mapping experiments we should probably guide our choice of where

1412: to type further markers using the typically complex, heavy tailed and

1413: often multimodel posterior distributions computed using the Bayesian

1414: method described here, as exemplified in figure~\ref{fig:nexample}.

1415: If analysing data from a whole genome scan, I would recommend a

1416: multistage analysis that first uses the NLR statistic to identify

1417: regions of interest, and the to use the CPQ algorithm to compute Bayes

1418: factors and posterior distributions for QTL position within those

1419: regions.

1420:

1421:

1422: Given the large number of simplifications made in specifying the model

1423: used here, one might wonder why the method works at all.  The three

1424: most obviously inadequate approximations are the star shaped

1425: genealogy, the absence of the disease allele in the control pool,

1426: and the assumption of linkage equilibrium in non-ancestral

1427: blocks of chromosome.  I will briefly discuss these inadequacies in turn.

1428:

1429: Figures~\ref{fig:coverage} and \ref{fig:ncoverage} show that

1430: credibility intervals only achieve prescribed coverage levels when a

1431: correction is made for the genealogy not in fact being star shaped.

1432: This suggests a serious inadequacy of the model.  This is further

1433: supported by the observation of the very similar ROC curves in

1434: figures~\ref{fig:roc-no-error} and \ref{fig:roc-with-error} for the

1435: theoretically optimal BF (assuming a star shaped genealogy), and the

1436: NLR statistic that has no theoretical basis \citep[but allows any

1437: shape genealogy;][]{johnson2005a}.  Addressing this inadequacy is

1438: likely to lead to greater power to detect a QTL, and perhaps smaller

1439: credibility intervals of a given size.  However, it will be hard to

1440: achieve without imposing a substantial computational burden.  In

1441: particular, it may become difficult to compute the BF test statistic

1442: if MCMC is used to integrate over genealogies at the QTL.

1443:

1444: Although it is conceptually straightforward to allow blocks of

1445: ancestral chromosome in the control pool, this would increase the

1446: number of hidden states at each SNP from $(\nd+1)$ to

1447: $(\nd+1)(\nc+1)$.  Since the propagation algorithm

1448: (section~\ref{sec:hidden-markov-model}) requires time that is

1449: quadratic in the number of hidden states, the analysis would be

1450: intractable using the current approach.  As an alternative, any number of

1451: separate pools could be treated as conditionally independent HMMs, but

1452: then we would have to integrate over the high dimensional space of

1453: allele frequencies and ancestral haplotypes using MCMC or importance

1454: sampling (see below).

1455:

1456: It is possible that the current model adapts to fit there being blocks

1457: of ancestral chromosome in the control pool, by appropriate adjustment

1458: of the allele frequency parameters.  Ancestral blocks that are

1459: explicitly modelled in the disease pool would then represent

1460: additional blocks beyond what would be expected according to the

1461: adjusted allele frequency parameters.  If this was so, the parameter

1462: $\rho$ might be best interpeted as representing the rate of

1463: \emph{excess} disease alleles in the disease pool.

1464:

1465: Since only marginal observations are available, the assumption of

1466: linkage equilibrium may be relatively innocuous.  Since there is

1467: virtually no information in the data about linkage disequilibrium,

1468: introducing parameters describing linkage disequilibrium into the

1469: model might have little effect on inferences about the quantities of

1470: interest.  It is possible to retain the present framework where all

1471: the data are modelled as a single HMM, but to include pairwise linkage

1472: disequilibrium by allowing allelic state along each chromosome to be a

1473: first order Markov chain \citep[see e.g.][]{liu2001,morris2002}.  This

1474: will be quite computationally expensive, but could be examined in the

1475: future.

1476:

1477: For the parameters chosen for the simulations performed here, the

1478: benefits of the present Bayesian method are somewhat modest.  It

1479: remains unclear whether there would be larger benefits for other

1480: values of the simulation parameters, in particular more SNPs in the

1481: dataset, and/or larger benefits from a Baysian analysis with a more

1482: realistic model.  Clarification of both points awaits access to

1483: substantial computational resources.  It is worth commenting that

1484: many of the variables in the present model also feature in more

1485: elaborate models, and therefore the present approach could be used to

1486: generate (for example) a joint importance sampling distribution for

1487: the ancestral haplotype, allele frequencies, and age and position of

1488: the QTL.

1489:

1490: Even the simulated datasets studied here were generated under a model

1491: that lacks realism in several respects.  For example, in simulating

1492: errors in allele frequency estimation I have ignored differential

1493: amplification of the two alleles, which may cause estimates of allele

1494: frequencies obtained using DNA pools to be biased.  This manifests

1495: itself as only a second order effect on the difference in allele

1496: frequency between case and control pools \citep{visscher2003}.

1497: Differential amplification can be accomodated easily in the present

1498: method of analysis, for example by making \ey{d}{i} a vector

1499: consisting of data from the pool and also from heterozygous

1500: individuals or pools of known composition.  Even if no data from

1501: heterozygotes is available, it is possible to compute a

1502: $\Pr{(\hat{y}|y)}$ by integrating over a distribution of differential

1503: amplification constants, like in the approach of \citet{moskvina2005}.

1504:

1505: One feature of the posteriors calculated using the present method (and

1506: especially after \citet{mcpeek1999} flattening) is that they are very

1507: heavy tailed, and so large credibility intervals (99\%, 99.9\%) tend

1508: to be very wide, perhaps almost as wide as credibility intervals

1509: computed from the prior!  This suggests that, if a series of fine

1510: scale mapping experiments were conducted using DNA pooling, we would

1511: not be making radical reductions in the size of the region under study

1512: at each stage, but rather would be increasing the density of markers

1513: in some regions more than others after each stage of analysis.

1514:

1515:

1516:

1517:

1518:

1519:

1520:

1521:

1522:

1523:

1524:

1525:

1526: \section*{Software}

1527: A software package implementing the methods described here is

1528: available from the web site

1529: \texttt{http://homepages.ed.ac.uk/tobyj/software/}~.  Source code is

1530: available and the package can be distributed freely under the terms of

1531: the GNU general public licence \citep{fsf1991}.

1532:

1533: \bibliography{tobyrefs}

1534:

1535: \clearpage

1536: \begin{deluxetable}{l p{0.8\textwidth}}

1537:   \tablecaption{Frequently used notations.\label{tab:notations-used}}

1538:   \tablehead{

1539:     \colhead{Symbol} & \colhead{Meaning}}

1540:   \startdata

1541:   $a_i$ & Allele present (0 or 1) on ancestral haplotype at $i$-th SNP \\

1542:   $b$,$b'$ & Backwards variables, see (\ref{eq:bvdefright}) and (\ref{eq:bvdefleft}) \\

1543:   $\betadist{(\alpha,\beta)}$ & Beta distribution with parameters $\alpha$ and $\beta$ \\

1544:   $\bindist{(n,p)}$ & Binomial distribution with parameters $n$ and $p$ \\

1545:   $\bindens{(x,n,p)}$ & Probability of drawing $x$ from a binomial distribution with parameters $n$ and $p$ \\

1546:   $\mbox{BF}$ & Bayes factor \\

1547:   CPQ & Cartesian product quadrature \\

1548:   $d_\mu$ & Number of design points used for $\mu$ in quadrature algorithm \\

1549:   $e_i$ & Precision of assay used to genotype $i$-th SNP \\

1550:   $\expdist{(\lambda)}$ & Exponential distribution with rate parameter $\lambda$ (mean $1/\lambda$) \\

1551:   $g$ & Genotype relative risk; factor by which disease allele

1552:   increases penetrance or risk \\

1553:   $\gamdist{(\alpha,\beta)}$ & Gamma distribution with shape parameter $\alpha$ and scale parameter $\beta$ \\

1554:   HMM & Hidden Markov model \\

1555:   $i$ & Index of SNP, $i=1,\ldots,L$ \\

1556:   i.b.d. & Identical by descent \\

1557:   $L$ & Number of SNPs \\

1558:   $m_i$ & Map position of the $i$-th marker \\

1559:   MCMC & Markov chain Monte Carlo \\

1560:   $\nc$, $\nd$ & Number of chromosomes in control and case pools respectively \\

1561:   $\ndist{(\mu,\sigma^2)}$ & Normal distribution with mean $\mu$ and variance $\sigma^2$ \\

1562:   $\mbox{NLR}$ & Nonparametric likelihood ratio, see (\ref{eq:proftestdef}) \\

1563:   $\pmin$ & Smallest $p$-values out of $L$ tests in single point analysis\\

1564:   $P_{i,a}$ & Prior parameter: $\p{i}{a}\sim\betadist{(P_{i,a},P_{i,1-a})}$ \\

1565:   $r$ & Number of experimental replicates used to estimate $\Delta\widehat{C_t}$\\

1566:   $R$ & Prior parameter: $\rho\sim\betadist{(R_1,R_0)}$ \\

1567:   ROC & Receiver operating characteristics \\

1568:   $T$ & Prior parameter: $\tau\sim\expdist{(T)}$ \\

1569:   $x_i$ & Number of chromosomes in case pool carrying ancestral

1570:   i.b.d.\ haplotype at $i$-th SNP \\

1571:   $x_\mu$ & Number of chromosomes in case pool carrying ancestral i.b.d.\ haplotype at position of QTL \\

1572:   \ya{c}{i}{a} & True count of allele $a$ at $i$-th SNP in control pool \\

1573:   \ya{d}{i}{a} & True count of allele $a$ at $i$-th SNP in case pool \\

1574:   \eya{c}{i}{a} & Estimated count of allele $a$ at $i$-th SNP in control pool \\

1575:   \eya{d}{i}{a} & Estimated count of allele $a$ at $i$-th SNP in case pool \\

1576:   $\hat{y}$ & Shorthand for \eya{c}{i}{1} or \eya{d}{i}{1} for some $i$ \\

1577:   $\hat{\vec{y}}$ & All the data \\

1578:   $\alpha$ & Nominal size (rate of type I error) of a test \\

1579:   $\Delta\widehat{C_t}$ & Estimated lag between PCR growth curves used to type DNA pool\\

1580:   $\gamma_G$ & Penetrance (risk of disease) for genotype $G$ at the QTL\\

1581:   $\mu$ & Map position of the disease locus \\

1582:   $\muminp$ & Map position of SNP with smallest $p$-value in single

1583:   point analysis \\

1584:   $\p{i}{1}$ & Expected frequency of allele 1 at $i$-th SNP in

1585:   non-ancestral chromosome \\

1586:   $\rho$ & Expected frequency of disease allele in case pool \\

1587:   $\sigma$ & Standard deviation of experimental error in estimation of

1588:   $\Delta\widehat{C_t}$ \\

1589:   $\tau$ & Age of the disease allele \\

1590:   \enddata

1591: \end{deluxetable}

1592:

1593: \begin{deluxetable}{l l l l l l l l}

1594:   \tablecaption{Performance of tests to detect a disease QTL, when allele

1595:     frequencies in each pool are known exactly.

1596:     \label{tab:power-no-error}}

1597:   \tablehead{

1598:     \colhead{Statistic} & \colhead{Method} & \colhead{Nominal size} &

1599:     \colhead{Critical value} & \colhead{True size} &

1600:     \colhead{Power} }

1601:   \startdata

1602:   $2\ln{\mbox{BF}}$&Arbitrary&&0&0.080&0.870\\

1603:   &&&&(0.058, 0.107)&(0.837, 0.898)\\

1604:   $2\ln{\mbox{BF}}$&Arbitrary&0.05\tablenotemark{a}&5.889&0.010&0.710\\

1605:   &&&&(0.003, 0.023)&(0.668, 0.749)\\

1606:   $p_{\min}\times L$&Bonferonni&0.05&0.05&0.040&0.720\\

1607:   &&&&(0.025, 0.061)&(0.678, 0.759)\\

1608:   $2\ln{\mbox{BF}}$&Simulation&0.05&0.903&0.050&0.842\\

1609:   &&&&(0.033, 0.073)&(0.807, 0.873)\\

1610:   $p_{\min}\times L$&Simulation&0.05&0.063&0.050&0.740\\

1611:   &&&&(0.033, 0.073)&(0.699, 0.778)\\

1612:   $2\ln{\mbox{BF}}$&Arbitrary&0.01\tablenotemark{a}&9.19&0.002&0.628\\

1613:   &&&&(0.000, 0.011)&(0.584, 0.67)\\

1614:   $p_{\min}\times L$&Bonferonni&0.01&0.010&0.000&0.582\\

1615:   &&&&(0.000, 0.006)&(0.537, 0.626)\\

1616:   $2\ln{\mbox{BF}}$&Simulation&0.01&5.864&0.010&0.710\\

1617:   &&&&(0.003, 0.023)&(0.668, 0.749)\\

1618:   $p_{\min}\times L$&Simulation&0.01&0.027&0.010&0.664\\

1619:   &&&&(0.003, 0.023)&(0.621, 0.705)\\

1620:   \enddata

1621:   \tablenotetext{a}{not a nominal size in the classical sense but a nominal upper

1622:     bound on the Bayesian sense error rate}

1623: \end{deluxetable}

1624:

1625: \begin{deluxetable}{l l l l l l l l}

1626:   \tablecaption{Performance of tests to detect a disease QTL, when there

1627:     are errors in allele frequency estimation with $r=2$ replicates and

1628:     $\sigma=0.2\mbox{ PCR cycles}$.

1629:     \label{tab:power-error}}

1630:   \tablehead{

1631:     \colhead{Statistic} & \colhead{Method} & \colhead{Nominal size} &

1632:     \colhead{Critical value} & \colhead{True size} &

1633:     \colhead{Power} }

1634:   \startdata

1635:   $2\ln{\mbox{BF}}$&Arbitrary&&0&0.080&0.782\\

1636:   &&&&(0.058, 0.107)&(0.743, 0.817)\\

1637:   $2\ln{\mbox{BF}}$&Arbitrary&0.05\tablenotemark{a}&5.889&0.008&0.532\\

1638:   &&&&(0.002, 0.020)&(0.487, 0.576)\\

1639:   $p_{\min}\times L$&Bonferonni&0.05&0.05&0.040&0.560\\

1640:   &&&&(0.025, 0.061)&(0.515, 0.604)\\

1641:   $2\ln{\mbox{BF}}$&Simulation&0.05&0.723&0.050&0.746\\

1642:   &&&&(0.033, 0.073)&(0.705, 0.784)\\

1643:   $p_{\min}\times L$&Simulation&0.05&0.085&0.050&0.642\\

1644:   &&&&(0.033, 0.073)&(0.598, 0.684)\\

1645:   $2\ln{\mbox{BF}}$&Arbitrary&0.01\tablenotemark{a}&9.19&0.000&0.424\\

1646:   &&&&(0.000, 0.006)&(0.380, 0.469)\\

1647:   $p_{\min}\times L$&Bonferonni&0.01&0.01&0.010&0.432\\

1648:   &&&&(0.003, 0.023)&(0.388, 0.477)\\

1649:   $2\ln{\mbox{BF}}$&Simulation&0.01&4.28&0.010&0.596\\

1650:   &&&&(0.003, 0.023)&(0.552, 0.639)\\

1651:   $p_{\min}\times L$&Simulation&0.01&0.01&0.010&0.432\\

1652:   &&&&(0.003, 0.023)&(0.388, 0.477)\\

1653:   \enddata

1654:   \tablenotetext{a}{not a nominal size in the classical sense but a nominal upper

1655:     bound on the Bayesian sense error rate}

1656: \end{deluxetable}

1657:

1658: \begin{deluxetable}{r r r r}

1659:   \tablecaption{Performance of point estimators of QTL position.\label{tab:point-exp-loss}}

1660:   \tablehead{

1661:     Estimator&\multicolumn{2}{c}{Average

1662:       expected loss under}\\\cline{2-3}

1663:     &squared error losses&absolute error losses}

1664:   \startdata

1665:   \cutinhead{Allele frequencies known exactly}

1666:   \muminp&0.208$\,{}^2$&0.120\\

1667:   $\expn{(\mu|\hat{\vec{y}})}$&0.165$\,{}^2$&0.107\\

1668:   $\medn{(\mu|\hat{\vec{y}})}$&0.166$\,{}^2$&0.105\\

1669:   NP method&0.165$\,{}^2$&0.112\\

1670:   \cutinhead{Errors in allele frequency estimation}

1671:   \muminp&0.239$\,{}^2$&0.146\\

1672:   $\expn{(\mu|\hat{\vec{y}})}$&0.195$\,{}^2$&0.132\\

1673:   $\medn{(\mu|\hat{\vec{y}})}$&0.203$\,{}^2$&0.130\\

1674:   NP method&0.198$\,{}^2$&0.136\\

1675:   \enddata

1676: \end{deluxetable}

1677:

1678: \clearpage

1679: \listoffigures

1680:

1681: \begin{figure}[p]

1682:   \begin{center}

1683:     \begin{picture}(0,0)%

1684:       \includegraphics{hier}%

1685:     \end{picture}%

1686:     \setlength{\unitlength}{4144sp}%

1687:     %

1688:     \begingroup\makeatletter\ifx\SetFigFont\undefined%

1689:     \gdef\SetFigFont#1#2#3#4#5{%

1690:       \reset@font\fontsize{#1}{#2pt}%

1691:       \fontfamily{#3}\fontseries{#4}\fontshape{#5}%

1692:       \selectfont}%

1693:     \fi\endgroup%

1694:     \begin{picture}(5697,3186)(-3164,-1918)

1695:       \put(-2429,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\vec{x}=(x_1,x_2,\ldots,x_L)$}%

1696:             }}}}

1697:       \put(-2294,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\rho$}%

1698:             }}}}

1699:       \put(-1664,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\mu$}%

1700:             }}}}

1701:       \put(-1034,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\tau$}%

1702:             }}}}

1703:       \put(-2339,-421){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$x_\mu$}%

1704:             }}}}

1705:       \put(-1799,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{prior}%

1706:             }}}}

1707:       \put(1261,119){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\pi_i$}%

1708:             }}}}

1709:       \put(1261,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$P_i$}%

1710:             }}}}

1711:       \put(631,-421){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$a_i$}%

1712:             }}}}

1713:       \put(  1,-421){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$x_i$}%

1714:             }}}}

1715:       \put(1261,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$e_i$}%

1716:             }}}}

1717:       \put(-179,1109){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$i=1,2,\ldots,L$}%

1718:             }}}}

1719:       \put(586,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\hat{y}_{\mathrm{d},i}$}%

1720:             }}}}

1721:       \put(586,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$y_{\mathrm{d},i}$}%

1722:             }}}}

1723:       \put(1846,-961){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$y_{\mathrm{c},i}$}%

1724:             }}}}

1725:       \put(1846,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\hat{y}_{\mathrm{c},i}$}%

1726:             }}}}

1727:       \put(-2339,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$R$}%

1728:             }}}}

1729:       \put(-1079,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$T$}%

1730:             }}}}

1731:       \put(-1709,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{$\vec{m}$}%

1732:             }}}}

1733:       \put(-3149,659){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{prior:}%

1734:             }}}}

1735:       \put(-3149,-1501){\makebox(0,0)[lb]{\smash{{\SetFigFont{12}{14.4}{\familydefault}{\mddefault}{\updefault}{data:}%

1736:             }}}}

1737:     \end{picture}%

1738:   \end{center}

1739:   \caption{Hierachical or Bayesian network structure of the model.  The region inside the

1740:     rectangle is duplicated for each of $L$ SNPs.  Lines indicate the

1741:     dependence structure of the model:  Variables not connected are

1742:     independent, conditional on all other variables in the model.}

1743:   \label{fig:factorisemodel}

1744: \end{figure}

1745:

1746: \begin{figure}[p]

1747:   \begin{center}

1748:     \includegraphics{roc1}

1749:   \end{center}

1750:   \caption{Sensitivity vs.\ specificity for $2\ln\mbox{BF}$ (solid

1751:     line) and $\pmin\times L$ (dashed line), when allele frequencies

1752:     in each pool are known exactly.  The dotted line shows the

1753:     performance of $2\ln\mbox{BF}$ computed using the priors obtained

1754:     from figure~\ref{fig:new-prior}, and the dot-dashed line shows the

1755:     performance of a nonparametric likelihood ratio test

1756:     statistic \citep[; see text]{johnson2005a}.}

1757:   \label{fig:roc-no-error}

1758: \end{figure}

1759:

1760: \begin{figure}[p]

1761:   \begin{center}

1762:     \includegraphics{roc2}

1763:   \end{center}

1764:   \caption{Sensitivity vs.\ specificity for $2\ln\mbox{BF}$ (solid

1765:     line) and $\pmin\times L$ (dashed line), when there are errors in

1766:     allele frequency estimation with $r=2$ replicates and

1767:     $\sigma=0.2\mbox{ PCR cycles}$.  The dotted line shows the

1768:     performance of $2\ln\mbox{BF}$ computed using the priors obtained

1769:     from figure~\ref{fig:new-prior}, and the dot-dashed line shows the

1770:     performance of a nonparametric likelihood ratio test statistic

1771:     \citep[; see text]{johnson2005a}.}

1772:   \label{fig:roc-with-error}

1773: \end{figure}

1774:

1775: \begin{figure}[p]

1776:   \begin{center}

1777:     \includegraphics{teststats2}

1778:   \end{center}

1779:   \caption{Sampling distribution of test statistics $2\ln\mbox{BF}$

1780:     (top) and $\pmin\times L$ (bottom, on log scale) under null model

1781:     ($g=1$), as functions of $L$, the number of SNPs in the simulated

1782:     data set.  The 0.95 and 0.99 quantiles are shown as solid lines.

1783:     The least squares linear regression is shown as a dotted line.

1784:     Results shown are for the situation where there are errors in

1785:     allele frequency estimation, but results are similar when allele

1786:     frequencies are known exactly.}

1787:   \label{fig:teststats}

1788: \end{figure}

1789:

1790: \begin{figure}[p]

1791:   \begin{center}

1792:     \includegraphics{newprior}

1793:   \end{center}

1794:   \caption{Original priors (dotted lines) and distribution of

1795:     posterior expectations (solid lines) for the three parameters of

1796:     the approximate model.  This suggests more accurately specified

1797:     priors (dashed lines) as described in the text.}

1798:   \label{fig:new-prior}

1799: \end{figure}

1800:

1801: \begin{figure}[p]

1802:   \begin{center}

1803:     \includegraphics{exampleB}

1804:   \end{center}

1805:   \caption{Example simulated datasets with $g=4$ and where allele

1806:     frequences are known exactly.  Points are $-\log_{10}{p}$ for

1807:     single point $\chi^2$ tests.  Dotted lines are posterior density

1808:     and solid lines are posterior density with McPeek--Strahs

1809:     correction.  Vertical dashed lines show position of disease QTL.}

1810:   \label{fig:example}

1811: \end{figure}

1812:

1813: \begin{figure}[p]

1814:   \begin{center}

1815:     \includegraphics{nexampleB}

1816:   \end{center}

1817:   \caption{The same simulated datasets as shown in

1818:     figure~\ref{fig:example}, but with errors in allele frequency

1819:     estimation with $\sigma=0.2$ PCR cycles and $r=2$ experimental

1820:     replicates.  Points are $-\log_{10}{p}$ for single point shrunk

1821:     \citep{visscher2003} $\chi^2$ tests.  Dotted lines are posterior

1822:     density and solid lines are posterior density with McPeek--Strahs

1823:     correction.  Vertical dashed lines show position of disease QTL.}

1824:   \label{fig:nexample}

1825: \end{figure}

1826:

1827: \begin{figure}[p]

1828:   \begin{center}

1829:     \includegraphics{coverage1}

1830:   \end{center}

1831:   \caption{Nominal and achieved coverage of credibility intervals for

1832:     position of QTL, when allele frequencies are known exactly.

1833:     Credibility intervals were constructed either without (dotted

1834:     line) or with (solid line) the approximate correction factor of

1835:     \citet{mcpeek1999}.}

1836:   \label{fig:coverage}

1837: \end{figure}

1838:

1839: \begin{figure}[p]

1840:   \begin{center}

1841:     \includegraphics{coverage2}

1842:   \end{center}

1843:   \caption{Nominal and achieved coverage of credibility intervals for

1844:     position of QTL, when there are errors in allele frequency

1845:     estimation, with $\sigma=0.2$ and $r=2$.  Credibility intervals

1846:     were constructed either without (dotted line) or with (solid line)

1847:     the approximate correction factor of \citet{mcpeek1999}.}

1848:   \label{fig:ncoverage}

1849: \end{figure}

1850:

1851: \begin{figure}[p]

1852:   \begin{center}

1853:     \includegraphics{hosking_panel}

1854:   \end{center}

1855:   \caption{Analysis of data of \citet[; top panel]{hosking2002}, and

1856:     quasi-synthetic datasets generated by randomly relabelling

1857:     controls as cases with probability 10\%, 20\% or 30\% (lower three

1858:     panels, top to bottom). Points are $-\log_{10}{(p)}$ from single

1859:     point $\chi^2$ tests, and dashed and solid lines are the marginal

1860:     posterior for disease gene position, without and with the

1861:     correction factor of \citep{mcpeek1999}.  Vertical dashed lines

1862:     show the true position of CYP2D6.}

1863:   \label{fig:hosking}

1864: \end{figure}

1865: \end{document}

1866:

1867: