0312:q-bio0312017/caps-analysis.tex

1: \documentclass[11pt,e-only,final]{amsart}

2:

3: % -------------------------------- amsart publication data

4: \newcommand{\publname}{}

5: \dateposted{Preprint posted electronically on December 10, 2003}

6: \renewcommand{\volinfo}[1]{} % eats the comma in the cls file

7: \newcommand{\pageinfo}{Pages 1--\pageref{veryend}}

8: \copyrightinfo{2003}{M. Cs\H{u}r\"os, B.Li, and A. Milosavljevic}

9: \PII{}

10: % --------------------------------

11:

12: \usepackage{graphicx}

13: \usepackage{latexsym,cmmib57}

14: \usepackage[psamsfonts]{eucal}

15: %\usepackage{probability,url}

16: %\usepackage{marginkern}

17: %\usepackage{hyperref}

18: \usepackage[in]{fullpage}

19: \usepackage{chicago}

20:

21: \newcommand{\probgen}[4]{#1 #2{#4} #3}

22: \newcommand{\conditional}[6]{%

23:        #1\!\left#2\,{#5}\,\vphantom{#6}\right#3\left.%

24:        \vphantom{#5}{#6}\,\right#4}

25: \newcommand{\condgen}[6]{{#1}#2 #5 #3 #6 #4}

26: \newcommand{\bbrd}[1]{\mbox{\rm{I}\kern-.1667em{#1}}}

27: \newcommand{\EXP}{\mathbb{E}}

28: \newcommand{\PROB}{\mathbb{P}}

29: \newcommand{\Probsm}[1]{\probgen{\PROB}{\bigl\{}{\bigr\}}{#1}}

30: \newcommand{\Probmd}[1]{\probgen{\PROB}{\Bigl\{}{\Bigr\}}{#1}}

31: \newcommand{\Problg}[1]{\probgen{\PROB}{\biggl\{}{\biggr\}}{#1}}

32: \newcommand{\Probxl}[1]{\probgen{\PROB}{\Biggl\{}{\Biggr\}}{#1}}

33: \newcommand{\Probcsm}[2]{\condgen{\PROB}{\bigl\{}{\bigm|}{\bigr\}}{#1}{#2}}

34: \newcommand{\Probcmd}[2]{\condgen{\PROB}{\Bigl\{}{\Bigm|}{\Bigr\}}{#1}{#2}}

35: \newcommand{\Probclg}[2]{\condgen{\PROB}{\biggl\{}{\biggm|}{\biggr\}}{#1}{#2}}

36: \newcommand{\Probcxl}[2]{\condgen{\PROB}{\Biggl\{}{\Biggm|}{\Biggr\}}{#1}{#2}}

37:

38: \newcommand{\Expc}[2]{\conditional{\EXP}{[}{|}{]}{#1}{#2}}

39: \newcommand{\Expcsm}[2]{\condgen{\EXP}{\bigl[}{\bigm|}{\bigr]}{#1}{#2}}

40: \newcommand{\Expcmd}[2]{\condgen{\EXP}{\Bigl[}{\Bigm|}{\Bigr]}{#1}{#2}}

41: \newcommand{\Expclg}[2]{\condgen{\EXP}{\biggl[}{\biggm|}{\biggr]}{#1}{#2}}

42: \newcommand{\Expcxl}[2]{\condgen{\EXP}{\Biggl[}{\Biggm|}{\Biggr]}{#1}{#2}}

43:

44:

45: \newcommand{\myfigheight}{.30\textheight}

46:

47: \newtheorem{theorem}{Theorem}

48:

49: \newsavebox{\fmbox}

50: \newenvironment{fmpage}[1]

51:         {\begin{lrbox}{\fmbox}\begin{minipage}{#1}}

52:         {\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}}}

53:

54: \newcommand{\captionstyle}{\small}

55: \newcommand{\mysectionstyle}{\medskip\bf}

56:

57: \newcommand{\nclones}{N} % number of clones

58: \newcommand{\poolsize}{m} % number of clones within a pool

59: \newcommand{\clength}{L} % length of a clone

60: \newcommand{\nfrags}{F} % total number of shotgun fragments

61: \newcommand{\flength}{\ell} % expected shotgun fragment length

62: \newcommand{\coverage}{c} % shotgun coverage

63: \newcommand{\rlength}{\lambda} % random read length

64: \newcommand{\mlength}{M} % number of positions where match is detected

65: \newcommand{\npools}{K} % number of pools

66: \newcommand{\glength}{G} % genome length

67:

68: \newcommand{\cset}{\mathcal{B}} % clone set

69: \newcommand{\clone}{B}

70: \newcommand{\pset}{\mathcal{P}} % pool set

71: \newcommand{\pool}{P}

72:

73: \newcommand{\nrect}{R} % number of preserved rectangles

74: \newcommand{\rectangle}{\mathcal{R}} % rectangle

75: \newcommand{\irect}{I} % indicator for preserved rectangle

76:

77: \newcommand{\mindecon}{k} % min. pools needed for deconvolution

78: \newcommand{\sigsize}{n} % number of pools one clone is included in

79:

80: \newcommand{\incmatrix}{\mathbf{M}}

81: \newcommand{\sigvector}{\mathbf{c}}

82: \newcommand{\sigindexvector}{\mathbf{x}}

83: \newcommand{\sigdistance}{\Delta}

84:

85: \newcommand{\codeword}{\mathbf{c}}

86: \newcommand{\codeset}{\mathcal{C}}

87: \newcommand{\codelen}{\sigsize}

88: \newcommand{\codedist}{d}

89: \newcommand{\codedim}{\mindecon}

90: \newcommand{\codegen}{\mathbf{G}}

91: \newcommand{\binsubst}{\phi}

92: \newcommand{\vecsubst}{f}

93: \newcommand{\msgword}{\mathbf{u}}

94: \newcommand{\weight}{w}

95:

96: \newcommand{\eventA}{\mathcal{X}}

97: \newcommand{\eventB}{\mathcal{Y}}

98:

99: \newcommand{\Galois}[1]{\mathbb{F}_{#1}}

100: \newcommand{\ambiguity}{t}

101:

102: \newcommand{\wgscoverage}{w}

103: \newcommand{\pcoverage}{a}

104: \newcommand{\roverlap}{\sigma}

105: \newcommand{\fpooled}{\mu}

106: \newcommand{\soverlap}{\vartheta}

107:

108: \newcommand{\coverlap}{\Theta}

109:

110: \newcommand{\Pa}{P_0}

111: \newcommand{\Pb}{P_{\fpooled}^{(\sigsize)}}

112: \newcommand{\Pc}{P_{\fpooled}^{(\infty)}}

113:

114: \newcommand{\ilen}{\lambda}

115:

116: \newcommand{\ndfactor}{\beta}

117: \newcommand{\pnd}{q}

118:

119: \newcommand{\rvgaps}{G}

120: \newcommand{\rvreads}{R}

121: \newcommand{\gengaps}{\mathcal{G}}

122: \newcommand{\nreads}{r}

123: \newcommand{\ngaps}{g}

124: \newcommand{\mreads}{\lambda}

125:

126: \begin{document}

127: \title[CAPSS and CAPS-MAP]{Clone-array pooled shotgun mapping and sequencing:\\

128: 		design and analysis of experiments}

129:

130: \author[M. Cs{\H u}r\"os]{Mikl\'os Cs\H{u}r\"os}\address{MC:

131:                 D\'epartement d'informatique et de recherche op\'erationnelle,

132:                 Universit\'e de Montr\'eal,

133:                 CP 6128 succ. Centre-Ville,

134:                 Montr\'eal, Qu\'ebec H3C 3J7, Canada.

135:                 Phone: +1 (514) 343-6111x1655, Fax: +1 (514) 343-5834.}

136:                 \email{csuros@iro.umontreal.ca}

137:                 \urladdr{http://www.iro.umontreal.ca/\textasciitilde{}csuros/}

138: \author[B. Li]{Bingshan Li}\address{BL:

139: 				Human Genome Sequencing Center, Department of Molecular and Human Genetics,

140: 				Baylor college of Medicine, Houston, Texas, 77030, USA.}

141: \author[A. Milosavljevic]{

142:         Aleksandar Milosavljevic}\address{AM:

143:               Bioinformatics Research Laboratory,

144:               Program in Structural and Computational Biology and Molecular Biophysics, and

145:                 Human Genome Sequencing Center ---

146:                 Department of Molecular and Human Genetics,

147:                 Baylor College of Medicine,

148:                 Houston, Texas 77030, USA.}

149:                 \email{amilosav@bcm.tmc.edu}

150:                 \urladdr{http://www.brl.bcm.tmc.edu/}

151:

152: \begin{abstract}

153: This paper studies sequencing and mapping

154: methods that rely solely on pooling and shotgun sequencing of clones.

155: First, we scrutinize and improve

156: the recently proposed Clone-Array Pooled

157: Shotgun Sequencing (CAPSS) method,

158: which delivers a BAC-linked assembly of a whole genome sequence.

159: Secondly, we introduce a novel physical mapping method, called

160: {\em Clone-Array Pooled Shotgun Mapping} (CAPS-MAP), which

161: computes the physical ordering of BACs in a random library.

162: Both CAPSS and CAPS-MAP

163: construct subclone libraries from

164: pooled genomic BAC clones.

165:

166: We propose

167: algorithmic and experimental improvements

168: that make CAPSS a viable option for

169: sequencing a set of BACs. We provide the first

170: probabilistic model of CAPSS sequencing progress. The model leads to

171: theoretical results supporting previous, less formal arguments on

172: the practicality of CAPSS.

173: We demonstrate the

174: usefulness of CAPS-MAP for clone overlap detection

175: with a probabilistic analysis,

176: and a simulated assembly

177: of the Drosophila melanogaster genome.

178: Our analysis indicates that CAPS-MAP is well-suited for

179: detecting BAC overlaps in a highly redundant library,

180: relying on a low amount of shotgun sequence information.

181: Consequently, it is a practical method

182: for computing the physical ordering of clones in

183: a random library, without requiring additional clone fingerprinting.

184: Since CAPS-MAP requires only shotgun sequence reads,

185: it can be seamlessly incorporated into a

186: sequencing project with almost no experimental overhead.

187: \end{abstract}

188:

189: \keywords{sequencing, physical mapping, pooled shotgun sequencing}

190:

191:

192: \maketitle

193:

194:

195:

196: %{\bf ACM subject classification:}

197: %J.3 [\textbf{Life and Medical Sciences}]: \textit{biology and genetics};

198: %G.2.1 [\textbf{Discrete Mathematics}]: \textit{combinatorics};

199: %G.3 [\textbf{Discrete Mathematics}]: \textit{probability and statistics}

200: %

201:

202: \section{Introduction}

203: In a {\em hierarchical approach} to large genome sequencing,

204: one first breaks many genome copies into random fragments.

205: A {\em library} is constructed by cloning the fragments,

206: typically as

207: {\em Bacterial Artificial

208: Chromosome} inserts (BACs).

209: Some BACs in the library are selected for complete sequencing.

210: Each selected BAC sequence is assembled individually

211: using the shotgun method: a {\em subclone} library

212: is prepared by cloning short fragments of the BAC.

213: Subsequently, sequence {\em reads}

214: are produced from a sufficient number of randomly chosen subclones.

215: The reads are assembled algorithmically into the BAC sequence.

216: An alternative to the hierarchical, or {\em clone-by-clone},

217: strategy is the {\em whole-genome shotgun} approach~\shortcite{WGS},

218: which employs a few (essentially 1--3) subclone libraries prepared

219: from the entire genome,

220: without resorting to an intermediate BAC library.

221: The main advantage of the whole-genome approach

222: is that it eliminates

223: the need to prepare tens of thousands of subclone libraries to

224: sequence a mammalian genome.

225: However, it is generally an inadequate strategy for finishing the assembly

226: of such large repeat-rich genomes.

227: For a review of contemporary sequencing methodologies, see, e.g.,

228: \shortciteN{Sequencing.review}.

229:

230: %

231: %BAC-end sequencing~\shortcite{BAC.end} was proposed as an alternative to restriction

232: %fingerprinting. However, the cost of BAC end sequencing is high and a significant

233: %fraction of mammalian BAC ends contain repetitive elements that are not informative for mapping.

234: %

235:

236: A new BAC-based sequencing strategy, called Clone-Array Pooled

237: Shotgun Sequencing (CAPSS), was proposed recently~\shortcite{CAPSS}.

238: CAPSS assembles the complete sequences of individual BACs

239: as does the clone-by-clone approach, but requires a much smaller

240: number of subclone library preparations.

241: The strategy is currently being applied for the first time on a

242: genome scale in the context of sequencing the honey bee genome.

243: This paper provides the theory for the design and analysis of pooling-based

244: genome projects. It also introduces the CAPS-MAP method for

245: physical mapping,

246: and transversal pooling designs for both CAPSS and CAPS-MAP,

247: thereby laying the theoretical foundation for pooling-based genome-scale

248: sequencing projects.

249:

250: \begin{figure}

251: \centerline{\includegraphics[height=0.3\textheight]{capss-method}}

252: \caption[CAPSS]{\captionstyle

253: CAPSS strategy for arrayed BACs.

254: DNA extracted from each clone is pooled together with

255: other clones in the same row

256: and column.

257: Subclone libraries are prepared from the pools,

258: and shotgun sequences are collected from the

259: sublibraries. Sequences are assembled into contigs.

260: If a contig contains sequences from a

261: row and a column pool's sublibrary,

262: the contig is assigned to the BAC at the

263: intersection of the row and the column.}

264: \label{fig:capss}

265: \end{figure}

266:

267: In a clone-by-clone approach,

268: BACs are sequenced independently:

269: one subclone library is constructed

270: for every clone.

271: In contrast, DNA from BACs are pooled together in a CAPSS approach,

272: and subclone libraries are prepared from the pools.

273: A CAPSS experiment is designed so

274: that the number of subclone libraries is much smaller than the number

275: of clones, yet the pooling design enables the assembly

276: of individual clone sequences. In what follows,

277: by {\em pooled shotgun} (CAPS) sequences we mean

278: shotgun sequence reads collected from a

279: subclone library that was constructed using pooled BACs.

280: For the computational

281: aspects of sequence assembly, pooled shotgun sequences are random subsequences

282: originating from a set of clone sequences.

283:

284: The original CAPSS proposal of \shortciteN{CAPSS} relied

285: on a simple rectangular

286: design defined by an array layout of BACs

287: (Figure~\ref{fig:capss}). The pools correspond to the rows and columns. An array layout

288: reduces the number of shotgun library preparations to

289: the square root of the number of BACs when compared

290: to clone-by-clone sequencing. This reduction can be important

291: in case of a mammalian genome,

292: for which

293: even a minimally overlapping tiling path contains between twenty and thirty

294: thousand clones~\shortcite{human.genome}.

295:

296: This paper has two goals. First, after pointing out some

297: shortcomings of the original CAPSS proposal, we propose

298: algorithmic and experimental improvements

299: that make CAPSS a viable option for

300: sequencing a set of BACs.

301: Specifically, we apply transversal pooling designs to increase the accuracy of CAPSS,

302: which we previously developed for the PGI method of comparative

303: physical mapping that also uses pooled shotgun sequencing~\shortcite{PGI.conf}.

304: We provide the first

305: probabilistic model of CAPSS sequencing progress.

306: The model leads to

307: theoretical results supporting previous, less formal arguments on

308: the practicality of CAPSS.

309:

310: The paper's second goal is to introduce the {\em Clone-Array Pooled Shotgun

311: Mapping (CAPS-MAP)} method to detect clone overlaps in a random BAC library.

312: The information on clone overlaps is used to compute the physical

313: ordering of clones in

314: the library, without requiring additional clone fingerprinting.

315: CAPS-MAP operates in the same experimental framework as CAPSS.

316: It needs only

317: shotgun sequences, which makes

318: it a cost-effective method that can be seamlessly integrated into a

319: sequencing project with very little experimental overhead.

320: We demonstrate the usefulness of CAPS-MAP for clone overlap detection

321: with a probabilistic analysis.

322: In addition to the theoretical results, we illustrate the method's performance

323: in a simulated project using the Drosophila genome assembly.

324:

325:

326: \section{Transversal designs}

327: %{\mysectionstyle 2. Transversal designs. \ \ }

328: It was proposed by \shortciteN{CAPSS} that CAPSS be

329: used in hybrid projects, combining

330: whole-genome shotgun (WGS) and pooled shotgun (CAPS) sequences.

331: The motivation is that the pooled shotgun sequences

332: can provide the localization information

333: for the whole-genome shotgun sequences

334: so that the latter can be used for a

335: clone-linked assembly.

336: After WGS and CAPS sequences from a set of pools

337: are assembled into contigs,

338: the contigs need to be mapped to individual BACs.

339: There are a few challenges to contig mapping.

340: We mention here three

341: main problems: false negatives, ambiguities, and false mapping.

342: A false negative refers to a situation where a BAC is not sampled in a

343: pool it is included in, due to the low number of CAPS sequences

344: collected.

345: A false negative for a simple rectangular design means that

346: no contigs can be mapped to the BAC.

347: Ambiguities and false mappings are caused by overlapping clones,

348: or more generally, by clones that have highly similar regions.

349: The mapping of a contig is ambiguous if it is not possible to decide

350: which clones the contig should be assigned to, in cases where

351: two or more clone sets are equally likely choices for the mapping.

352: False mapping occurs when an insufficient number of CAPS

353: sequences are collected, and a contig

354: that covers overlapping BACs gets assigned to the wrong clone or clone set.

355: %False mapping is more detrimental than ambiguity

356: %since it is not detected during contig mapping.

357:

358: One strategy used to overcome the mapping problems

359: involves transversal pooling designs~\shortcite{PGI.conf,CGT}.

360: For a transversal design with~$\sigsize$ pool sets,

361: every clone is included in exactly one pool of each pool set, and any subset

362: with two of those pools uniquely identifies the clone.

363: Half of the pool

364: sets are designated as column pools, and the other half as row pools

365: to realize the design with the help of BAC arrays.

366: Using a transversal

367: double-array design (i.e., one with four pool sets),

368: the same set of BACs is independently arrayed twice.

369: Each of the two resulting

370: arrays contains the

371: same set of BACs.

372: Thus, each BAC ends up being sampled in two column-pools and

373: two row-pools. One of the arrays contains an arbitrary arrangement of

374: BACs, while the other is ``reshuffled'' relative to the first.

375: More generally,

376: clones can be arranged on~$d$ reshuffled arrays using

377: a transversal pooling design with~$\sigsize=2d$ pool sets.

378:

379: The number of arrays in a transversal design may be adjusted to allow

380: unambiguous and correct contig mapping for any redundancy in a BAC

381: library. Specifically, it can be shown~\shortcite{PGI.conf,CGT}

382: that a $d$-array transversal design can accurately resolve BACs at

383: up to $(2d-1)$X redundancy.

384: We previously described and analyzed transversal designs in

385: the context of pooled shotgun experiments~\shortcite{PGI.conf}

386: and compared their performance to

387: other designs. Even though our analysis was performed for

388: the Pooled Genomic Indexing (PGI) method

389: in the context of comparative physical mapping,

390: the results are generally

391: valid for CAPSS and CAPS-MAP as well. Specifically,

392: our results indicate that transversal designs

393: reduce the frequency of false negatives and false mappings

394: when compared to a simple rectangular design.

395: Furthermore, when compared to other more complicated designs,

396: they achieve an optimal balance between the number of shotgun

397: library preparations and the frequency of contig mapping problems.

398: Transversal designs also enjoy a practical advantage

399: over more complicated combinatorial designs, in that they are

400: readily implemented using existing automated clone arraying technologies.

401:

402: When a transversal design is used,

403: contig mapping can be implemented very efficiently,

404: based on an algorithm that runs in~$O(\nclones+M)$ time for

405: mapping~$M$ contigs onto~$\nclones$ BACs. Without going into

406: details,

407: the main idea is to first build in~$O(\nclones)$ time

408: a hash table that maps pool pairs to BACs. Based on the property of transversal designs

409: that two pools identify a clone, this table contains all pool pairs that identify a

410: unique clone. For each contig, it takes~$O(1)$ time using the hash table to

411: either identify the most likely clone set to which

412: the contig can be mapped, or to declare the contig ambiguous.

413:

414: \section{Sequence assembly}\label{sec:capss}

415: %{\mysectionstyle 3. Pooled shotgun reads for sequencing. \ \ }

416: This section analyzes CAPSS progress in a hybrid

417: project that uses whole-genome

418: and pooled shotgun sequences. CAPS sequences are collected

419: using a transversal design with~$\sigsize$ pool sets, i.e., $\sigsize/2$

420: arrays.

421: In order to derive a probabilistic model for such experiments,

422: we introduce some standard

423: simplifying assumptions and the following notations.

424: Assume that every clone has the same length~$\clength$

425: (100--200 thousand base pairs in practice),

426: and that each shotgun sequence has the same length~$\flength$ (e.g., 500 bp).

427: The WGS and CAPS sequences are

428: combined and compared to each other to

429: find overlaps between them.

430: Overlapping sequences form {\em islands}.

431: Islands with two or more sequences are {\em contigs}.

432: An overlap between two shotgun sequences

433: is detected if it is at least of length~$\soverlap\flength$

434: where~$0<\vartheta\le 1$.

435: Statistics for islands, and gaps between islands

436: are well known~\shortcite{LanderWaterman,WendlWaterston}.

437: We are interested in statistics for

438: {\em clone-linked contigs}, those that are assigned

439: to BACs using the pooling information.

440:

441: Let~$\pcoverage$ be the coverage by CAPS sequences, i.e.,

442: if~$\nfrags_{\mathrm{p}}$ CAPS sequences

443: are collected, then~$\pcoverage=\frac{\nfrags_{\mathrm{p}}\flength}{\nclones\clength}$

444: where~$\nclones$ is the total number of clones.

445: Let~$\wgscoverage$ denote the coverage by WGS

446: sequences, i.e., if~$\nfrags_{\mathrm{w}}$ WGS

447: sequences are collected,

448: then~$\wgscoverage=\frac{\nfrags_{\mathrm{w}}\flength}{\glength}$

449: where~$\glength$ is the genome length.

450: Notice that~$\wgscoverage=0$ is possible.

451: Here we consider the simplest case of

452: assembling the sequence of a single

453: clone that does not overlap with any other clone.

454: Such a clone is covered by

455: a total coverage of~$(\pcoverage+\wgscoverage)$.

456: Although we concentrate on sequencing a particular clone,

457: the transversal design allows the simultaneous sequencing

458: of multiple, possibly overlapping clones

459: by combining WGS sequences with CAPS sequences from many (or even all) pools.

460: Regions of overlapping clones have higher

461: coverage since they are covered by

462: more CAPS sequences than a single clone.

463: The sequencing of overlapping regions progresses thus

464: faster than what is suggested by the statistics

465: for a single clone.

466: We examine the case of assigning contigs to overlapping BACs

467: in \S\ref{sec:capsmap}.

468: Two shotgun sequences from different pools suffice to assign a contig to a single BAC.

469: In a practical setting, it may be advantageous to require more

470: stringent criteria in order to avoid false mappings.

471: Theorem~\ref{tm:capss} can be readily adapted for such

472: criteria, albeit resulting in bulkier formulas.

473:

474: Figures~\ref{fig:capss.islands} and~\ref{fig:capss.cover}

475: compare different experimental designs based on Theorem~\ref{tm:capss}

476: and simulations.

477: Figure~\ref{fig:capss.islands} plots the island statistics from the theorem.

478: It illustrates that for lower coverages (about $\coverage<4$), the ratio

479: of pooled shotgun sequences makes a large difference in the sequencing.

480: This difference

481: is mainly shown in the number of clone-linked contigs, as the

482: contig sizes do not differ much. At large coverage levels, when

483: sequencing is nearly completed, the impact of pooled sequences is less,

484: i.e., WGS sequences can make up for a lower pooled coverage.

485:

486: Figure~\ref{fig:capss.cover}a

487: shows that while more arrays increase the sequencing success, the improvements

488: are very small after the second array.

489: Notice that if the clones are selected from a minimally overlapping tiling path,

490: then no part of the genome is covered by more than two BACs,

491: and thus two arrays suffice for the unambiguous mapping of all contigs

492: that cover clone overlaps.

493: Figure~\ref{fig:capss.cover}b plots the N50 values. The N50

494: contig length is the value~$l$ such that half of the

495: sequenced nucleotides belong to contigs of length at least~$l$.

496: The statistics for all designs converge to

497: those of a non-pooled sequencing project as the coverage increases.

498: In other words, the negative effects of pooling diminish and the

499: project progresses just as without pooling:

500: for example, at total coverage 4--5X, 99\% of the clone is sequenced.

501:

502:

503: \begin{theorem}\label{tm:capss}

504: Let~$\roverlap=1-\soverlap$ where~$\soverlap$

505: is the fraction of length two shotgun sequences must share in order for

506: the overlap to be detected.

507: Consider a BAC that does not overlap with other

508: clones.

509: Define~$\coverage=\wgscoverage+\pcoverage$,

510: the total coverage.

511: Let

512: $

513: X_1 = \frac{\wgscoverage+\frac{\pcoverage}{\sigsize}}{\coverage}$,

514: %\qquad

515: $X_2 = \frac{\wgscoverage}{\coverage}$,

516: and $Y_i=1-(1-e^{-\coverage\roverlap})X_i$ for~$i=1,2$.

517:

518: \begin{enumerate}

519: \item[(i)]

520: The expected number of clone-linked contigs covering the clone equals

521: \begin{equation}\label{eq:num.link}

522: \frac{\clength}{\flength}\coverage e^{-\coverage\roverlap} p_{\mathrm{link}},

523: \end{equation}

524: where

525: \begin{equation}\label{eq:prob.link}

526: p_{\mathrm{link}}=

527: \begin{cases}

528:  1-e^{-\coverage\roverlap}

529:  	\biggl(\sigsize\frac{X_1}{Y_1}-(\sigsize-1)\frac{X_2}{Y_2}\biggr)

530: 			& \text{if $\wgscoverage>0$;}\\*

531: \frac{1-e^{-\pcoverage\roverlap}}{1+\frac{1}{\sigsize-1}e^{-\pcoverage\roverlap}}

532: 			& \text{if $\wgscoverage=0$.}

533: \end{cases}

534: \end{equation}

535:

536: \item[(ii)] The expected number of shotgun sequences in a clone-linked contig is

537: \begin{equation}\label{eq:nr.link}

538: \nfrags_{\mathrm{link}}

539: = \begin{cases}

540: \frac{e^{\coverage\roverlap}}{p_{\mathrm{link}}}

541: \Biggl(1-e^{-2\coverage\roverlap}\biggl(

542:  				\sigsize\frac{X_1}{Y_1^2}

543: 				-(\sigsize-1)\frac{X_2}{Y_2^2}\biggr)\Biggr) & \text{if $\wgscoverage>0$;}\\

544: e^{\pcoverage\roverlap}+\frac{1+\frac{1}{\sigsize-1}}{1+\frac{e^{-\pcoverage\roverlap}}{\sigsize-1}}

545: 					& \text{if $\wgscoverage=0$.}

546: \end{cases}

547: \end{equation}

548:

549: \item[(iii)]

550: Define

551: \begin{align}

552: \label{eq:mk.nolink}

553: \nfrags_{\mathrm{nolink}}

554: =	\frac{\sigsize\frac{X_1}{Y_1^2}-(\sigsize-1)\frac{X_2}{Y_2^2}}{\sigsize\frac{X_1}{Y_1}-(\sigsize-1)\frac{X_2}{Y_2}},

555: \intertext{and}

556: \lambda_{\mathrm{CBC}}

557: =

558: 	\frac{e^{\coverage\roverlap}-1}{\coverage}+\soverlap.

559: \end{align}

560: The expected length of a clone-linked contig

561: can be written as $\flength\ilen_{\mathrm{link}}$

562: where $\ilen_{\mathrm{link}}$

563: is bounded as

564: \begin{equation}\label{eq:len.link}

565: \frac{\ilen_{\mathrm{CBC}}-\Bigl(\nfrags_{\mathrm{nolink}}\roverlap+\soverlap\Bigr)(1-p_{\mathrm{link}})}{

566: 	p_{\mathrm{link}}}

567: \le

568: 	\ilen_{\mathrm{link}}

569: \le

570: 	\frac{\ilen_{\mathrm{CBC}}}{p_{\mathrm{link}}}. % \frac{\ilen_{\mathrm{CBC}}-(1-p_{\mathrm{link}})}{p_{\mathrm{link}}}.

571: \end{equation}

572: Furthermore, when~$\fpooled=\pcoverage/\coverage$ is kept constant,

573: $\nfrags_{\mathrm{nolink}}$ increases monotonically

574: with~$\coverage$ and

575: \begin{equation}\label{eq:len.limit}

576: \lim_{\coverage\to\infty}\nfrags_{\mathrm{nolink}}

577: =

578: \begin{cases}

579: \fpooled^{-1}

580: 	\frac{(3\sigsize^2-3\sigsize+1)-\fpooled(2\sigsize^2-3\sigsize+1)}{

581: 		(2\sigsize^2-3\sigsize+1)-\fpooled(\sigsize^2-2\sigsize+1)}

582: 		& \text{if $\wgscoverage>0$;}\\

583: \frac{\sigsize}{\sigsize-1} & \text{if $\wgscoverage=0$.}

584: \end{cases}

585: \end{equation}

586: \end{enumerate}

587: \end{theorem}

588:

589: \begin{figure}

590: \begin{tabular}{cc}\includegraphics[width=0.48\textwidth]{capss-mapped-ireads} &

591: \includegraphics[width=0.48\textwidth]{capss-mapped-nislands}\\

592: a & b

593: \end{tabular}

594: \caption[CAPSS island statistics]{\captionstyle

595: CAPSS (Theorem~\ref{tm:capss}): clone-linked contig statistics. The values are calculated

596: from Theorem~\ref{tm:capss} for two-array transversal designs

597: and different pooled coverage levels~$\pcoverage$. Overlaps between shotgun sequences are detected

598: with~$\soverlap=0.1$. The number of contigs on the right-hand side is given

599: in multiples of~$\clength/\flength$. The abscissa is the total coverage~$\coverage$.}

600: \label{fig:capss.islands}

601: \end{figure}

602:

603: \begin{figure}

604: \begin{tabular}{cc}\includegraphics[width=0.48\textwidth]{capss-mapped-cover} &

605: \includegraphics[width=0.48\textwidth]{capss-mapped-n50}\\

606: {\captionstyle a} & {\captionstyle b}

607: \end{tabular}

608: \caption[CAPSS sequencing progress]{\captionstyle

609: 	CAPSS (Theorem~\ref{tm:capss}): sequencing progress.

610: 	The left-hand side plots the

611: 	fraction of bases covered by clone-linked contigs as

612: 	a function of total coverage ($\coverage=\pcoverage+\wgscoverage$)

613: 	for different designs. Notice that the improvement from two arrays

614: 	to four arrays ($\sigsize=4$ vs.\ $\sigsize=8$) is marginal.

615: 	The right-hand side plots the N50 values

616: 	for different designs with two arrays, as multiples of~$\flength$.

617: 	All values were calculated

618: 	with shotgun sequence overlap detection~$\soverlap=0.1$. The N50 plot was obtained

619: 	from simulation: each point is an average of 200 measurements.

620: }\label{fig:capss.cover}

621: \end{figure}

622:

623: \begin{proof}

624: The proof relies on a Poisson process model, following the technique

625: of \shortciteN{Waterman}.

626: We model the location of the shotgun sequences as a Poisson process

627: with rate~$\coverage$.

628: Define~$\fpooled=\pcoverage/\coverage$, the fraction of CAPS sequences.

629: Every sequence is either a

630: WGS sequence with probability~$(1-\fpooled)$, or

631: comes from each one of the clone's pools with

632: probability~$\fpooled/\sigsize$.

633: First we state the well-known facts~\shortcite{LanderWaterman,Waterman}

634: about apparent islands, whether or not they are linked to a clone.

635: The event~$E$ that a given shotgun sequence is the right-hand end of an apparent island

636: has probability~$J=\PROB E=e^{-\coverage\roverlap}$.

637: For the $k$-th read, define~$M_k$ as the number of reads

638: from its right-hand end until the first gap towards the left.

639: The probability that an island has~$j$ sequences in it equals

640: \[

641: \Probcmd{M_k=j}{E}=(1-J)^{j-1}J.

642: \]

643:

644: An island can be mapped to a clone if it contains sequences from at least two pools.

645: The probability of mapping the island ending at the $k$-th read

646: (event $D_k$)

647: depends on the number of shotgun sequences in the island. Using inclusion-exclusion:

648: \begin{multline}\label{eq:dme}

649: \Probcmd{D_k}{M_k=j}  \\*

650: \begin{aligned}

651: & =  1-\sum_{\text{pools}}\Probcmd{\text{CAPS reads from only one pool+WGS}}{M_k=j}\\

652: & +(\sigsize-1) \Probcmd{\text{only WGS reads}}{M_k=j}\\*

653: & =  1-\sigsize\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)^{j}+(\sigsize-1)(1-\fpooled)^j.

654: \end{aligned}

655: \end{multline}

656:

657: By Equation~\eqref{eq:dme},

658: the number of shotgun sequences in a clone-linked island is distributed

659: by the probabilities

660: \begin{multline}\label{eq:prob.dm}

661: \Probcmd{D_k,M_k=j}{E}  = \Probcmd{D_k}{M_k=j, E}\Probcmd{M_k=j}{E} \\*

662: \begin{aligned}

663: & =  \biggl(1-\sigsize\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)^{j}

664: 					+(\sigsize-1)(1-\fpooled)^j\biggr)

665: 				(1-J)^{j-1}J\\

666: & = \Pa(j) - \sigsize\Pb(j) + (\sigsize-1)\Pc(j).

667: \end{aligned}

668: \end{multline}

669: with

670: \begin{subequations}\label{eq:px}

671: \begin{align}

672: \Pa(j) & =  (1-J)^{j-1}J; \\*

673: \Pb(j) & = 	\biggl((1-J)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)

674: 				\biggr)^{j-1}

675: 					J\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr);\\*

676: \Pc(j) & = \Bigl((1-J)(1-\fpooled)\Bigr)^{j-1} J(1-\fpooled).

677: \end{align}

678: \end{subequations}

679:

680: Now, for all~$0< z\le 1$,

681: \begin{equation}

682: \sum_{j=1}^\infty (1-z)^{j-1} = \frac{1}{z}; \quad

683: \sum_{j=1}^\infty j(1-z)^{j-1} = \frac{1}{z^2}. \label{eq:z}

684: \end{equation}

685: Using Equation~\eqref{eq:z},

686: \begin{align*}

687: \Probcmd{D_k}{E}

688: & =\sum_{j=1}^\infty \Probcmd{D_k, M_k=j}{E} \\*

689: & = 1-\frac{\sigsize J(1-\frac{\sigsize-1}{\sigsize}\fpooled)}{1-(1-J)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)}

690: 	+\frac{(\sigsize-1)J(1-\fpooled)}{1-(1-J)(1-\fpooled)}.

691: \end{align*}

692: In Equation~\eqref{eq:prob.link},

693: $p_{\mathrm{link}}=\Probcmd{D_k}{E}$.

694: Equation~\eqref{eq:num.link} follows from

695: the fact that the expected number of shotgun fragments covering the clone

696: equals~$\coverage\clength/\flength$.

697:

698: By definition of the conditional probability,

699: \[

700: \Probcmd{M_k=j}{D_k, E}=

701: \frac{\Probcmd{D_k,M_k=j}{E}}{\Probcmd{D_k}{E}} = \frac{\Pa(j)-\sigsize\Pb(j)+(\sigsize-1)\Pc(j)}{p_{\mathrm{link}}},

702: \]

703: where the values can be plugged in from

704: Equations~\eqref{eq:prob.link} and~\eqref{eq:px}.

705: By Equation~\eqref{eq:z},

706: \begin{equation}\label{eq:mk}

707: \Expcmd{M_k}{D_k,E}

708: = \frac{p_{\mathrm{link}}^{-1}}{J}

709: 	\Biggl(1-\frac{nJ^2\Bigl(1-\frac{\sigsize}{\sigsize-1}\fpooled\Bigr)}{

710: 				\Bigl(1-(1-J)(1-\frac{\sigsize-1}{\sigsize}\fpooled)\Bigr)^{2}}

711: 			+\frac{(\sigsize-1)J^2(1-\fpooled)}{

712: 				\Bigl(1-(1-J)(1-\fpooled)\Bigr)^2}

713: 	\Biggr),

714: \end{equation}

715: which corresponds to~(ii)

716: with~$\nfrags_{\mathrm{link}}=\Expcmd{M_k}{D_k,E}$.

717: It is interesting to notice that

718: when~$\fpooled=1$, in Equation~\eqref{eq:mk},

719: \[

720: \frac2{J(\roverlap)} \ge \Expcmd{M_k}{D_k,E} > \frac1{J(\roverlap)},

721: \]

722: and that $\Expcmd{M_k}{D_k,E}J^{-1}(\roverlap)$ decreases when the coverage~$\coverage$ increases.

723:

724: By Equation~\eqref{eq:prob.link},

725: \begin{equation}\label{eq:nolink}

726: \Probcmd{\overline{D_k}}{E}

727: = 1-p_{\mathrm{link}}

728: =1-J\frac{1-\Bigl(1-J\Bigr)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)(1-\fpooled)}{%

729: 	\biggl(1-\Bigl(1-J\Bigr)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)\biggr)

730: 	\biggl(1-\Bigl(1-J\Bigr)\Bigl(1-\fpooled\Bigr)\biggr)}.

731: \end{equation}

732: The expected number of shotgun sequences in an island that is not mapped to a clone equals

733: \[

734: \Expcmd{M_k}{\overline{D_k},E}

735: =\frac{\Expcmd{M_k}{E}-\Expcmd{M_k}{D_k,E}\Probcmd{D_k}{E}}{\Probcmd{\overline{D_k}}{E}}.

736: \]

737: Using~$\Expcmd{M_k}{E}=J^{-1}$ and Equations~\eqref{eq:mk}, \eqref{eq:prob.link},

738: and~\eqref{eq:nolink},

739: we get Equation~\eqref{eq:mk.nolink} with the notation

740: $\nfrags_{\mathrm{nolink}}=\Expcmd{M_k}{\overline{D_k},E}$.

741:

742: Let~$\flength\ilen_k$ be the length of the island ending with the $k$-th sequence.

743: The length of a non-linked island can be bounded as

744: $\flength\Expcmd{\ilen_{\mathrm{nolink}}}{\overline{D_k},E}$

745: with

746: \[

747: 1\le

748: \Expcmd{\ilen_{\mathrm{nolink}}}{\overline{D_k},E}

749: \le \Expcmd{M_k}{\overline{D_k},E}\roverlap+\soverlap.

750: \]

751: The bounds of Equation~\eqref{eq:len.link} follow from

752: \[

753: \Expcmd{\ilen_k}{D_k,E}

754:  =  \frac{\Expcmd{\ilen_k}{E}-\Expcmd{\ilen_k}{\overline{D_k},E}\Probcmd{\overline{D_k}}{E}}{

755: 	\Probcmd{\overline{D_k}}{E}},

756: \]

757: where $\Expcmd{\ilen_k}{E}=\ilen_{\mathrm{CBC}}=\frac{J^{-1}-1}{\coverage}+\soverlap$

758: \shortcite{Waterman}.

759: \end{proof}

760:

761: The value~$\ilen_{\mathrm{CBC}}$ in the

762: theorem is the expected island length in a non-pooled sequencing project.

763: By Equation~\eqref{eq:len.limit}, and

764: the fact that $\lim_{\coverage\to\infty}p_{\mathrm{link}}=1$,

765: we have

766: $\lim_{\coverage\to\infty}\ilen_{\mathrm{link}}=\ilen_{\mathrm{CBC}}$

767: when the ratio of CAPS sequences is kept constant.

768: This limit result is not surprising given that

769: every island can be assigned to a clone with near certainty when the

770: sequence read coverage is large.

771:

772: \section{Clone overlap detection}\label{sec:capsmap}

773: %{\mysectionstyle 4. Pooled shotgun reads for clone overlap detection. \ \ }

774: The key observation for this section is that a transversal design

775: makes it possible to map a contig unambiguously to more than one

776: BAC at once. Now, a contig that is mapped to two clones simultaneously can be

777: viewed as evidence that the two clones overlap. Taking the idea further,

778: an entire set of BACs can be tested for overlaps in this manner,

779: which leads us to the Clone-Array Pooled Shotgun Mapping (CAPS-MAP) method

780: that is

781: described as follows. A redundant collection of random BACs covering a large

782: genome is grouped into subsets of size~$q^2$.

783: Pooled shotgun sequence reads are collected from each clone group using a

784: transversal design with~$d$ arrays of size~$q\times q$.

785: Partitioning into subsets may be dictated by the practical

786: concerns of chemistry, biology and robotic automation.

787: For array sizes that are multiples of 8 or 12 or both

788: (yielding standard dimensions of a

789: 96-well microtiter plate), such as~$q=24$, or~$q=48$,

790: there exist known~\shortcite{design.handbook} transversal designs.

791: A pooling design with a few ($d=2,3,4$) arrays suffice to compute the

792: physical ordering of BACs in the library, depending on the

793: library's redundancy and the array sizes.

794: In addition to the CAPS sequences,

795: WGS sequences are used to increase read

796: contig lengths.

797: The shotgun sequences are compared to each other to

798: find the overlaps between them,

799: and are assembled into contigs. Contigs that map

800: unambiguously to more than one clone are taken as evidence that the clones overlap.

801: See Figures~\ref{fig:capsm} and~\ref{fig:capsm.fn} for illustrations.

802: The clone overlap information can then be used to

803: compute the physical ordering of the BACs in the library, and

804: to select a minimal tiling path for complete sequencing,

805: just as if the overlaps were detected using a

806: fingerprinting scheme \shortcite{map.sequenceready}.

807:

808: Theorem~\ref{tm:capsm} considers the case of detecting an overlap between

809: two clones in different clone groups. Similar analyses can be carried out

810: for more general cases with more overlapping clones, or clones

811: in the same clone group, resulting in more cumbersome formulas.

812:

813: \begin{figure}

814: \centerline{\includegraphics[height=0\myfigheight]{capsmap-method}}

815: \caption[CAPS-MAP]{\captionstyle

816: CAPS-MAP detects overlaps between clones by identifying situations where

817: a read contig maps simultaneously to two clones. This figure illustrates a

818: transversal pooling design with two clone groups and two arrays per group.

819: The transversal design guarantees that the intersection of any

820: two pools out of four possible for each BAC (two row and two

821: column pools) uniquely identifies the BAC.

822: Note that overlaps between clones on the same array can also be

823: detected by a transversal design.}\label{fig:capsm}

824: \end{figure}

825:

826: \begin{figure}

827: \centerline{\includegraphics[height=\myfigheight]{capsmap-onearray}}

828: \caption{

829: Overlaps between clones on the same array can also be

830: detected by a transversal design, even in the presence of false negatives,

831: i.e., situations where a particular BAC

832: is not represented in a particular pool. Specifically, overlap between

833: the two BACs illustrated in the figure is detected despite the fact that

834: each BAC is sampled in only three pools.}\label{fig:capsm.fn}

835: \end{figure}

836:

837: Figure~\ref{fig:capsm.probs} plots

838: the overlap detection probabilities

839: in a few scenarios with different

840: amounts of CAPS and WGS sequences.

841: Based on the figure, the probability of detecting an overlap

842: increases exponentially toward~1 with the overlap length. The same exponential

843: behavior is characteristic of clone anchoring methods for overlap detection

844: \shortcite{map.anchor}. Consequently, clone contig statistics for CAPS-MAP can be

845: calculated using a clone anchoring model with an appropriate

846: anchoring process intensity. Clone contig statistics can also be estimated

847: using a fingerprinting model \shortcite{LanderWaterman}

848: by noticing that clone overlaps above a certain length are detected with near certainty.

849: Figure~\ref{fig:capsm.probs} indicates that using 1X CAPS coverage and

850: 2--5X WGS coverage, BAC overlaps of more than 20000 bp are detected almost certainly.

851: While CAPS-MAP uses only the fact that a contig is mapped to multiple BACs, and

852: not the actual contig sequence, the sequence information is used in

853: the ensuing sequencing phase, and thus CAPS-MAP represents very little overhead in a

854: genome sequencing project.

855:

856: It is worth pointing out here that CAPS-MAP detects very short, or even

857: {\em negative} clone overlaps with non-negligible probability.

858: A short region of the genome

859: that is not covered by BACs in the library can be bridged by WGS sequences.

860: The bridging WGS sequences may form a contig with CAPS sequences from the two BACs

861: at the gap's ends that can be mapped to the two clones simultaneously.

862: This unique feature of CAPS-MAP among clone overlap detection methods

863: does not interfere with the calculation of the physical ordering of BACs.

864: At the same time, it does decrease the necessary BAC library size for

865: sequencing the genome completely.

866: After the clones are selected for complete sequencing, the

867: already collected WGS sequences are included in the genome sequence assembly.

868: Consequently, negative overlaps detected by CAPS-MAP are already covered by

869: shotgun sequences

870: in the sequencing phase, and

871: pose no additional requirements for shotgun sequence collection.

872:

873: \begin{theorem}\label{tm:capsm}

874: Let two clones from different clone groups

875: share an overlap. % of length~$\coverlap\clength$ with~$0<\coverlap\le 1$.

876: Define~$\coverage_2=2\pcoverage+\wgscoverage$,

877: the total shotgun sequence coverage for the overlap.

878: Define

879: \begin{gather*}

880: \begin{aligned}

881: \ndfactor_1 & = \frac{\wgscoverage+(1+\frac{1}{\sigsize})\pcoverage}{\coverage_2} &

882: \ndfactor_2 & = \frac{\wgscoverage+\pcoverage}{\coverage_2} &

883: \ndfactor_3 & = \frac{\wgscoverage+\frac{2\pcoverage}{\sigsize}}{\coverage_2} &

884: \ndfactor_4 & = \frac{\wgscoverage+\frac{\pcoverage}{\sigsize}}{\coverage_2} &

885: \ndfactor_5 & = \frac{\wgscoverage}{\coverage_2};

886: \end{aligned}\\*

887: \gamma_i=1-(1-e^{-\coverage_2\roverlap})\ndfactor_i

888: \quad\text{ for $i=1,\dotsc,5$}.

889: \end{gather*}

890:

891: \begin{enumerate}

892: \item[(i)] An apparent island in the

893: overlap consisting of~$j>0$ shotgun sequences

894: is mapped to the two clones simultaneously

895: with probability~$1-\pnd(j)$ where

896: \begin{equation}\label{eq:pnd}

897: \pnd(j)

898: = 2\sigsize \ndfactor_1^j

899: 	- 2(\sigsize-1) \ndfactor_2^j

900: 	- \sigsize^2 \ndfactor_3^j

901: 	+ 2\sigsize(\sigsize-1) \ndfactor_4^j

902: 	- (\sigsize-1)^2 \ndfactor_5^j

903: 	< 2\sigsize\ndfactor_1^j.

904: \end{equation}

905:

906: \item[(ii)] An apparent island covering the overlap is mapped

907: to the two clones simultaneously with probability

908: \begin{equation}\label{eq:prob.map}

909: p_2

910: = 1-e^{-\coverage_2\roverlap}\biggl(

911: 	2\sigsize\frac{\ndfactor_1}{\gamma_1}

912: 	-2(\sigsize-1)\frac{\ndfactor_2}{\gamma_2}

913: 	-\sigsize^2\frac{\ndfactor_3}{\gamma_3}

914: 	+2\sigsize(\sigsize-1)\frac{\ndfactor_4}{\gamma_4}

915: 	-(\sigsize-1)^2\frac{\ndfactor_5}{\gamma_5}\biggr).

916: \end{equation}

917:

918: \end{enumerate}

919: \end{theorem}

920:

921:

922: \begin{proof}

923: The overlap is detected if

924: it is covered by an island that can be simultaneously mapped

925: to the two clones.

926: We model the location of the shotgun sequences as a Poisson process

927: with rate~$\coverage_2$.

928: Define~$\fpooled_2=\frac{2\pcoverage}{\coverage_2}$,

929: the fraction of CAPS sequences covering the overlap.

930: Every shotgun sequence is either a

931: WGS sequence with probability~$(1-\fpooled_2)$, or

932: comes from each one of the two clones' pools with

933: probability~$\fpooled_2/(2\sigsize)$.

934: The event~$E_2$ that a given shotgun sequence is the right-hand end of an apparent island

935: has probability~$J_2=\PROB E_2=e^{-\coverage_2\roverlap}$.

936: For the $k$-th sequence, define~$M_k$ as the number of sequences

937: from its right-hand end until the first gap towards the left.

938: The probability that an island has~$j$ sequences in it equals

939: \[

940: \Probcmd{M_k=j}{E_2}=\Bigl(1-J_2\Bigr)^{j-1}J_2.

941: \]

942: The probability of mapping the island that ends at the $k$-th shotgun sequence

943: (event $D_k$)

944: depends on the number of sequences in the island. We

945: calculate the probability of event~$\overline{D_k}$ in separate cases.

946: Let~$p_{0,0}(j)$ denote the event that the island

947: consists of WGS sequence reads only given that it has~$j$ reads.

948: Then

949: \begin{subequations}

950: \begin{equation}\label{eq:p00}

951: p_{0,0}(j) = (1-\fpooled_2)^j.

952: \end{equation}

953: Let~$p_{0,*}(j)$ denote the event that the island

954: consists of CAPS sequences for one clone only and WGS sequences, given that it has~$j$

955: shotgun sequences in it:

956: \begin{equation}\label{eq:p0x}

957: p_{0,*}(j)=\Bigl(1-\frac{\fpooled_2}2\Bigr)^j.

958: \end{equation}

959: Let~$p_{1,0}(j)$ denote the event that the island

960: consists of CAPS sequences from a fixed pool and WGS sequences, given that it has~$j$

961: shotgun sequences in it:

962: \begin{equation}\label{eq:p10}

963: p_{1,0}(j)=\Bigl(1-\frac{\sigsize-\frac12}{\sigsize}\fpooled_2\Bigr)^j-p_{0,0}(j).

964: \end{equation}

965: Let~$p_{1,1}(j)$ denote the event that the island

966: consists of CAPS sequences from a fixed pool for one clone,

967: from another fixed pool for the other clone,

968: and WGS sequences, given that it has~$j$ shotgun sequences in it:

969: \begin{equation}\label{eq:p11}

970: p_{1,1}(j)=\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled_2\Bigr)^j-2p_{1,0}(j)-p_{0,0}(j).

971: \end{equation}

972: Let~$p_{1,+}(j)$ denote the event that the island

973: consists of CAPS sequences from a fixed pool for one clone,

974: at least one CAPS sequence for the other clone, and WGS sequences:

975: \begin{equation}\label{eq:p1x}

976: p_{1,+}(j)=\Bigl(1-\frac{\sigsize-1}{2\sigsize}\fpooled_2\Bigr)^j

977: 	-p_{0,*}(j)-p_{1,0}(j).

978: \end{equation}

979: \end{subequations}

980: Using inclusion-exclusion,

981: \[

982: \Probcmd{\overline{D_k}}{E_2, M_k=j}

983: =\Bigl(2p_{0,*}(j)-p_{0,0}(j)\Bigr)

984: +2\sigsize p_{1,+}(j)-\sigsize^2 p_{1,1}(j).

985: \]

986: By Equations~(\ref{eq:p00}--\ref{eq:p1x}),

987: \begin{multline}\label{eq:map.fmap.j}

988: \Probcmd{\overline{D_k}}{E_2, M_k=j}

989: = 2\sigsize \Bigl(1-\frac{\sigsize-1}{2\sigsize}\fpooled_2\Bigr)^j

990:  - 2(\sigsize-1) \Bigl(1-\frac{\fpooled_2}2\Bigr)^j\\*

991:  - \sigsize^2 \Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled_2\Bigr)^j

992:  + 2\sigsize(\sigsize-1) \Bigl(1-\frac{\sigsize-\frac12}{\sigsize}\fpooled_2\Bigr)^j

993:  - (\sigsize-1)^2 (1-\fpooled_2)^j,

994: \end{multline}

995: which corresponds to Equation~\eqref{eq:pnd}

996: with $\pnd(j)=\Probcmd{\overline{D_k}}{E_2, M_k=j}$.

997: Using the same technique as before

998: \[

999: 1-p_2=\Probcmd{\overline{D_k}}{E_2}=\sum_{j=1}^{\infty}

1000: \Probcmd{\overline{D_k}}{E_2, M_k=j}\Probcmd{M_k=j}{E_2},

1001: \]

1002: leading to Equation~\eqref{eq:prob.map}.

1003:

1004: Recall that~$\pnd(j)$ is the probability of failing to map a

1005: contig of~$j$ reads to the two clones simultaneously.

1006: In order to show that the inequality in Equation~\eqref{eq:pnd}

1007: holds, we prove that

1008: \begin{equation}\label{eq:map.fmap.bound}

1009: \pnd(j) < 2\sigsize \ndfactor^j-(2\sigsize-1)\ndfactor_3^j < 2\sigsize\ndfactor_1^j.

1010: \end{equation}

1011: Notice that $\ndfactor_5<\ndfactor_4<\ndfactor_3<\ndfactor_2<\ndfactor_1$ and thus

1012: $\pnd(j)\nearrow 2\sigsize\ndfactor_1^j$. Since~$\ndfactor_4=(\ndfactor_3+\ndfactor_5)/2$,

1013: it follows from the convexity of~$x^j$ that

1014: \begin{equation}\label{eq:b345}

1015: 2\ndfactor_4^j \le \ndfactor_3^j+\ndfactor_5^j.

1016: \end{equation}

1017: (Alternatively, notice that the same inequality follows from

1018: $p_{1,1}(j)\ge 0$ in Equation~\eqref{eq:p11}.)

1019: We proceed by rearranging the equality of Equation~\eqref{eq:pnd}:

1020: \begin{multline*}

1021: 2\sigsize \ndfactor_1^j - (2\sigsize-1) \ndfactor_3^j - \pnd(j)  =

1022: 	2(\sigsize-1) \ndfactor_2^j

1023: 	+ (\sigsize-1)^2 \ndfactor_3^j

1024: 	- 2\sigsize(\sigsize-1) \ndfactor_4^j

1025: 	+ (\sigsize-1)^2 \ndfactor_5^j\\*

1026: =

1027: 	 (\sigsize-1)^2\underbrace{\Bigl(

1028: 		\ndfactor_3^j

1029: 		+\ndfactor_5^j

1030: 		-2\ndfactor_4^j

1031: 		\Bigr)}_{\text{$>0$ by Eq.~\eqref{eq:b345}}}

1032: 	+2(\sigsize-1) \underbrace{\Bigl(

1033: 		\ndfactor_2^j

1034: 		-\ndfactor_4^j

1035: 		\Bigr)}_{\text{$>0$ since $\ndfactor_2>\ndfactor_4$}},

1036: \end{multline*}

1037: which proves Equation~\eqref{eq:map.fmap.bound}.

1038: \end{proof}

1039:

1040: It is difficult to derive useful closed formulas

1041: for the probability of overlap detection.

1042: For example, based on Equation~\eqref{eq:prob.map},

1043: the number of contigs in the overlap that are simultaneously

1044: mapped to the clones can be modeled as arrivals in a Poisson process

1045: with intensity $\coverage_2e^{-\coverage_2\roverlap}p_2$.

1046: For practical values of~$\coverage_2$, this

1047: model seriously underestimates the probability of overlap detection.

1048: The problem is similar to the one of using Lander-Waterman statistics \cite{LanderWaterman}

1049: at high coverage levels (see \citeN{WendlWaterston} for a discussion).

1050: For a more suitable model, let~$\rvgaps$ be the number of gaps entirely contained in the

1051: overlap, and number the islands from 0 to~$\rvgaps$.

1052: Let~$j_0, j_2, \dotsc, j_{\rvgaps}$ denote the number of shotgun sequences in

1053: the islands. The probability that none of the islands can be

1054: mapped simultaneously to the two clones can be calculated as

1055: \begin{equation}\label{eq:pnomap.exp}

1056: p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})

1057: 	=\prod_{i=0}^{\rvgaps}\pnd(j_i),

1058: \end{equation}

1059: 	where~$\pnd(j)$ is defined by Equation~\eqref{eq:pnd}.

1060: (Notice that~$\rvgaps$ and the~$j_i$ are random variables.)

1061: We are interested in the expected value

1062: $p_{\mathrm{nomap}}

1063: 	=\EXP p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})$.

1064: In order to get a good assessment of CAPS-MAP performance,

1065: we found that

1066: it is best to use a Monte-Carlo estimation\footnote{

1067: Specifically, for every overlap size considered,

1068: we carried out a number of simulated experiments.

1069: Each experiment used a fixed number of

1070: shotgun sequences~$\rvreads$ placed

1071: randomly in the overlap, and

1072: produced an instance of

1073: a $(j_0,\dotsc,j_{\rvgaps})$ vector, for which

1074: $p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})$

1075: was calculated using Equation~\eqref{eq:pnd}.

1076: The average of these values was used to estimate $p_{\mathrm{nomap}}$.

1077: The average was weighted with the probabilities

1078: of different~$\rvreads$ values, given by a Poisson distribution.

1079: The set of~$\rvreads$ values was chosen so that

1080: it provided a sufficient

1081: accuracy for the weighted average estimate.

1082: For every~$\rvreads$, ten thousand experiments were

1083: performed.}

1084: of this expected value;

1085: see Figure~\ref{fig:capsm.probs}.

1086: For an alternative, observe that

1087: the inequality of

1088: Equation~\eqref{eq:pnd} implies

1089: $p_{\mathrm{nomap}}	< \EXP\Bigl[\ndfactor_1^{\rvreads}(2\sigsize)^{\rvgaps+1}\Bigr]$

1090: where~$\rvreads$ is the number of sequences

1091: in the overlap, and thus $\rvreads=\sum_{i=0}^{\rvgaps} j_i$.

1092: Based on this observation,

1093: we derived bounds (see Appendix)

1094: that are useful for large values of~$\coverage_2$ (e.g., $\coverage_2=7$),

1095: but at lower coverages, this approach also

1096: underestimates the overlap detection probabilities significantly.

1097:

1098: \begin{figure}

1099: \centerline{

1100: \includegraphics[height=0\myfigheight]{capsm-nodetect}}

1101: \caption[CAPS-MAP statistics]{\captionstyle

1102: 	Clone overlap detection.

1103: 	The graph shows the probability of not detecting an overlap between

1104: 	two clones, as a function of the overlap size. The plots were calculated by a Monte-Carlo

1105: 	method using Theorem~\ref{tm:capsm}.

1106: 	All plots use~$\soverlap=0.1$ for shotgun sequence overlap detection, and

1107: 	$\flength=500$ for shotgun sequence length.

1108: }

1109: 	\label{fig:capsm.probs}

1110: \end{figure}

1111:

1112: \section{BAC ordering}

1113: Our analyses so far have focused on detecting BAC overlaps via CAPS-MAP.

1114: This localized perspective was partly adopted to ease the theoretical analysis.

1115: In practice, mapping is performed based on a global clone-contig incidence matrix.

1116: The global approach exploits the dependencies in the collected data for increased

1117: accuracy.

1118: The algorithmic issues are very similar

1119: to those encountered

1120: in the context of STS-based physical mapping

1121: \shortcite{Gusfield}.

1122: Define the mapping matrix~$\mathbf{M}$ in which the rows correspond to the BACs,

1123: the columns correspond to the contigs, and $\mathbf{M}[i,j]=1$

1124: if contig~$j$ is linked to clone~$i$, otherwise $\mathbf{M}[i,j]=0$.

1125: We want to find the true ordering of the rows and columns,

1126: defined by their physical locations on the genome.

1127: Assume for a moment that the matrix is completely error-free,

1128: i.e., all contigs are correctly assembled, and all contig-clone overlaps

1129: are detected.

1130: It is not hard to see that the row and column permutations

1131: corresponding to the correct ordering result in a matrix~$\mathbf{M}'$

1132: that satisfies the {\em consecutive ones} property (C1P) in the rows and the columns:

1133: for every row~$i$, there exist $a\le b$ with $\mathbf{M}'[i,j]=1$ if and only if

1134: $a\le j\le b$, and the same property holds for the columns.

1135: (A sufficient condition ensuring row-wise (or column-wise) C1P is that

1136: 	if the left endpoint of a contig (or a clone) precedes the left endpoint

1137: 	of another one, the same holds for their right endpoints.)

1138: Finding such permutations is a well-known problem~\shortcite{C1P},

1139: and can be done in linear time.

1140: When the matrix is not error-free, one can use

1141: techniques introduced for STS-based physical mapping.

1142: In \S\ref{sec:simulation} we detail a

1143: method that relies on traveling salesperson tours.

1144:

1145: \section{CAPS-MAP simulation of Drosophila assembly}\label{sec:simulation}

1146: We tested the CAPS-MAP approach

1147: by simulating the assembly of the {\em Drosophila melanogaster}

1148: genome. One of the main goals of the

1149: simulated assembly was to predict the performance a

1150: hybrid approach combining WGS

1151: and CAPS sequences in a

1152: project that closely resembles the

1153: setup of the honey bee genome project's

1154: (\verb|http://www.hgsc.bcm.tmc.edu/projects/honeybee/|),

1155: currently pursued at the Human Genome Sequencing Center

1156: (HGSC)

1157: of Baylor College of Medicine.

1158:

1159: Concatenating all the Drosophila genome sequence (Release 2.5,

1160: 112.6 million bases),

1161: 2880 BAC sequences were generated

1162: by randomly picking their locations and lengths. The

1163: mean BAC insert length was 150 kbp,

1164: and its standard deviation was 500 bp.

1165: The resulting random BAC library provides 3.6X coverage of the genome.

1166: BACs were arrayed by first partitioning them into 5 groups, and then

1167: using a two-array $24\times24$ transversal design for each group.

1168: Every BAC was covered by 1.2X CAPS sequences: 0.4X per pool on

1169: the first array and 0.2X per pool on the second (reshuffled) array.

1170: In addition, WGS sequences were produced at 4X genome coverage.

1171: The shotgun sequences were generated using the program \texttt{wgs-simulator}

1172: (written by K.~James Durbin),

1173: which mimics shotgun sequence collection realistically by

1174: relying on sequence quality files \shortcite{Phred.error}

1175: produced in sequencing projects.

1176:

1177: Shotgun sequences were assembled into contigs using

1178: the Atlas suite of genome assembly tools (\verb|http://www.hgsc.bcm.tmc.edu/downloads/software/atlas/|)

1179: and Phrap (\verb|http://www.phrap.org/|).

1180: A contig was mapped

1181: to a clone if it contained sequences from all four clone pools.

1182: Contigs that mapped to more than one

1183: BACs provided the evidence of BAC overlaps.

1184: BACs were grouped into maximal overlapping sets, or {\em bactigs}.

1185:

1186: We compared the overlap graphs to assess CAPS-MAP overlap detection.

1187: The vertices of the overlap graphs are the BACs, and two BACs

1188: are connected if there is an overlap between them.

1189: The true overlap graph for the original

1190: BACs contains 2880 vertices, and 10992 edges in 66 graph components.

1191: The overlap graph calculated from the bactigs

1192: has 9193 edges in 110 components.

1193: Among its edges, 8527 (93\%) are correct,

1194: and 2465 (22\%) of the true overlaps are not discovered.

1195: The median length of detected overlaps is 87 kbp, and the median length

1196: of undetected overlaps is 42 kbp.

1197: There are 666 edges that correspond to no real

1198: overlaps. The vast majority of these ``false positives''

1199: are instances when a long read contig links several BACs,

1200: which do not always overlap pairwise.

1201: All but two of the

1202: CAPS-MAP bactigs are true overlapping sets of BACs.

1203: CAPS-MAP links the assembled contigs to BACs correctly even in these two

1204: bactigs:

1205: the source of the error is the read contig assembly.

1206: Table~\ref{tbl:droso} shows statistics on the bactig sizes and genome coverage.

1207:

1208: \begin{table}

1209: \begin{center}

1210: \small

1211: \begin{tabular}{|r|r|r|}

1212: \hline

1213: Minimum bactig size & Genome covered & Number of BACs in bactigs \\

1214: \hline

1215: 2 & 97.1\% & 2758 \\

1216: 3 & 96.7\% & 2746 \\

1217: 5 & 94.9\% & 2714 \\

1218: 10 & 88.5\% & 2565 \\

1219: 15 & 82.4\% & 2400 \\

1220: 20 & 77.9\% & 2284 \\

1221: 30 & 65.0\% & 1945 \\

1222: 51 & 50.9\% & 1521 \\

1223: 60 & 40.3\% & 1195 \\

1224: \hline

1225: %Min bactig size	Bases covered	Total BACs	Genome percentage covered

1226: %1	109852500	2766	97.49%

1227: %2	109466060	2758	97.14%

1228: %3	108913450	2746	96.65%

1229: %5	106944875	2714	94.91%

1230: %10	99773079	2565	88.54%

1231: %15	92872143	2400	82.42%

1232: %20	87814164	2284	77.93%

1233: %25	80304207	2107	71.26%

1234: %30	73277950	1945	65.03%

1235: %35	68186296	1813	60.51%

1236: %40	62570414	1660	55.53%

1237: %45	60987140	1616	54.12%

1238: %50	59293215	1571	52.62%

1239: %51	57301929	1521	50.85%

1240: %60	45414106	1195	40.30%

1241: \end{tabular}

1242: \end{center}

1243: \caption{Statistics for simulated Drosophila assembly.

1244: This table details the genome and BAC library coverage

1245: by bactig sizes. More than half of the genome is covered by bactigs

1246: with at least 51 BACs in them, defining the N50

1247: statistic for the clone map.

1248: }\label{tbl:droso}

1249: \end{table}

1250:

1251: BACs were ordered within each bactig.

1252: For every bactig, an overlap matrix~$\mathbf{M}$ was

1253: calculated, in which

1254: the rows correspond to the bactig's clones,

1255: the columns correspond to the contigs linked to at least one bactig clone,

1256: and $\mathbf{M}[i,j]=1$

1257: if contig~$j$ is linked to clone~$i$, otherwise $\mathbf{M}[i,j]=0$.

1258: The following

1259: traveling salesperson (TSP)

1260: formulation is used to find the correct column permutation.

1261: We search for a tour in a graph, in which every vertex corresponds to a

1262: contig (and thus a column), with an additional vertex~$u_0$.

1263: The weight of an edge between vertices~$u$ and~$u'$, corresponding

1264: to contigs~$j$ and~$j'$, is the

1265: number of rows in which they differ:

1266: $w(u,u')=\sum_i\chi\Bigl\{\mathbf{M}[i,j]\ne\mathbf{M}[i,j']\Bigr\}$,

1267: where~$\chi\{\cdot\}$ is the indicator function.

1268: The weight of an edge between~$u$ and~$u_0$ is the sum of ones

1269: in the column~$j$ that corresponds to~$u$: $w(u,u_0)=\sum_i\mathbf{M}[i,j]$.

1270: Now, a Hamilton path with the minimum weight in this graph

1271: gives the best column permutation in the sense that it

1272: minimizes the number of gaps between blocks of ones within rows

1273: \cite{mapping.tsp}.

1274: The best row ordering could be found in an analogous manner, but we used

1275: a simpler method which worked better in practice.

1276: Clones are ordered relatively to the contig order

1277: by placing clone~$\clone$ before~$\clone'$ if

1278: the first contig $\clone$ is linked to is before the first

1279: contig $\clone'$ is linked to,

1280: or if their first contigs are identical

1281: but $\clone$ has its last contig before $\clone'$.

1282:

1283: We used the \texttt{concorde} program \shortcite{concorde}

1284: to solve the TSP instances.

1285: The resulting row permutation is

1286: then further analyzed to find clones, for which the

1287: permutation arbitrarily enforces an order. Specifically,

1288: if consecutive rows of the permuted matrix~$\mathbf{M}'$ are identical,

1289: then the order of the corresponding clones is not resolved.

1290: Subsequently, we compared the TSP orders to the true orders, which

1291: is known since the BAC sequences are generated artificially.

1292: Figure~\ref{fig:fly.order} shows the

1293: outcome of the comparison for two bactigs.

1294: The TSP order is very close to the true order.

1295:

1296: \begin{figure}

1297: \centerline{\includegraphics[height=.15\textheight]{bactig-example-1}}

1298: \centerline{\includegraphics[height=.15\textheight]{bactig-example-2}}

1299: \caption{

1300: Correctness of BAC ordering in Drosophila simulation.

1301: The top (\textsf{Loc})

1302: of each graph shows the relative

1303: physical location of each BAC,

1304: the middle (\textsf{True}) shows

1305: the correct BAC order, and

1306: the bottom (\textsf{TSP})

1307: shows the TSP order,

1308: and the BAC identifiers.

1309: Identical BACs

1310: are connected in order to

1311: display the differences between the two permutations.

1312: The order of BACs at the bottom is not resolved when they are connected

1313: with a horizontal line.

1314: By resolving them optimally,

1315: bactig 23 produces the order of 72 BACs with 12 breakpoints

1316: and bactig 16 orders 50 BACs with 9 breakpoints.

1317: (Breakpoints are neighbors in

1318: the TSP order that are not neighbors in the true order.)

1319: }

1320: \label{fig:fly.order}

1321: \end{figure}

1322:

1323: \section{Discussion}

1324: %{\mysectionstyle 5. Discussion. \ \ }

1325: The experimental expedience of shotgun sequencing has been essential

1326: for the success of genome-scale sequencing projects in the past decade.

1327: The power of the concept comes from the now established

1328: fact that the loss of information about read localization

1329: incurred by random subcloning can be largely recovered

1330: in the assembly step using sequence information.

1331: Clone pooling is similar in spirit to shotgun sequencing in that it

1332: introduces experimental expedience by dramatically reducing the number

1333: of subclone library preparations. The clone pooling step leads to a

1334: temporary loss of information about localization of shotgun sequences on

1335: individual BAC clones. We have demonstrated that

1336: sequence information can be used to

1337: successfully recover most of the information lost in

1338: pooling.

1339:

1340: Our analyses presented here indicate the theoretical feasibility of the CAPS-MAP

1341: method and provide guidance for the design of genome-scale CAPS-MAP

1342: experiments. In particular, our analysis indicates that transversal

1343: pooling designs can accommodate high levels of clone redundancy and

1344: perform well even at low levels of shotgun sequence coverage of clone

1345: pools.

1346:

1347: Practical biological and technical considerations may set a limit to the

1348: array size. In case of large genomes, the limitations may imply that the set of

1349: BACs is partitioned and that pooling is applied separately to individual

1350: subsets. This results in a lower clone redundancy within individual arrays

1351: and a larger number of pools. Our analysis allows for the

1352: partitioning of clones. It also allows for the

1353: possibility of including whole-genome shotgun sequence reads.

1354: It thus covers

1355: realistic and practical scenarios of the CAPSS and CAPS-MAP methods'

1356: application.

1357:

1358: \section*{Acknowledgements}

1359: We are grateful to

1360: Richard Gibbs and George

1361: Weinstock for sharing pre-publication information on CAPSS and for useful

1362: comments.

1363: Our discussion of

1364: computing CAPS-MAP overlap detection probabilities

1365: has greatly benefited from conversations with

1366: Luc Devroye and Michael Waterman.

1367: This work was supported by grants

1368: RO1~HG02583-01 from NHGRI at the NIH,

1369: U01~RR18464 from the NCRR,

1370: and

1371: 250391-02 from the NSERC.

1372:

1373: \textsc{Remark.}\ \

1374: An extended abstract of this paper is published in

1375: Genome Informatics vol.14 Universal Academy Press, Tokyo

1376: (Proceedings of the

1377: 14th International Conference on Genome Informatics (GIW),

1378: December 14--17, 2003, Yokohama, Japan).

1379:

1380:

1381: \bibliographystyle{chicago}

1382: \begin{thebibliography}{}

1383:

1384: \bibitem[\protect\citeauthoryear{Alizadeh, Karp, Newberg, and Weisser}{Alizadeh

1385:   et~al.}{1995}]{mapping.tsp}

1386: Alizadeh, F., R.~M. Karp, L.~A. Newberg, and D.~K. Weisser (1995).

1387: \newblock Physical mapping of chromosomes: a combinatorial problem in molecular

1388:   biology.

1389: \newblock {\em Algorithmica\/}~{\em 13}, 52--76.

1390:

1391: \bibitem[\protect\citeauthoryear{Applegate, Bixby, Chv\'atal, and

1392:   Cook}{Applegate et~al.}{1999}]{concorde}

1393: Applegate, D., R.~Bixby, V.~Chv\'atal, and W.~Cook (1999).

1394: \newblock Concorde 99.12.15 release.

1395: \newblock \verb|http://www.math.princeton.edu/tsp/concorde.html|.

1396:

1397: \bibitem[\protect\citeauthoryear{Arratia, Lander, Tavar{\'e}, and

1398:   Waterman}{Arratia et~al.}{1991}]{map.anchor}

1399: Arratia, R., E.~S. Lander, S.~Tavar{\'e}, and M.~S. Waterman (1991).

1400: \newblock Genomic mapping by anchoring random clones: A mathematical analysis.

1401: \newblock {\em Genomics\/}~{\em 11}, 806--827.

1402:

1403: \bibitem[\protect\citeauthoryear{Booth and Lueker}{Booth and

1404:   Lueker}{1976}]{C1P}

1405: Booth, K.~S. and G.~S. Lueker (1976).

1406: \newblock Testing for the {C}onsecutive {O}nes {P}roperty, interval graphs, and

1407:   graph planarity using {PQ}-tree algorithms.

1408: \newblock {\em Journal of Computer and System Sciences\/}~{\em 13}, 335--379.

1409:

1410: \bibitem[\protect\citeauthoryear{Cai, Chen, Gibbs, and Bradley}{Cai

1411:   et~al.}{2001}]{CAPSS}

1412: Cai, W.-W., R.~Chen, R.~A. Gibbs, and A.~Bradley (2001).

1413: \newblock A clone-array pooled strategy for sequencing large genomes.

1414: \newblock {\em Genome Research\/}~{\em 11}, 1619--1623.

1415:

1416: \bibitem[\protect\citeauthoryear{Colbourn and Dinitz}{Colbourn and

1417:   Dinitz}{1996}]{design.handbook}

1418: Colbourn, C.~J. and J.~H. Dinitz (Eds.) (1996).

1419: \newblock {\em The {CRC} Handbook of Combinatorial Designs}.

1420: \newblock Boca Raton: CRC Press.

1421:

1422: \bibitem[\protect\citeauthoryear{Cs{\H u}r\"os and Milosavljevic}{Cs{\H u}r\"os

1423:   and Milosavljevic}{2002}]{PGI.conf}

1424: Cs{\H u}r\"os, M. and A.~Milosavljevic (2002).

1425: \newblock Pooled genomic indexing ({PGI}): mathematical analysis and experiment

1426:   design.

1427: \newblock In {\em Algorithms in Bioinformatics: Second International Workshop},

1428:   Volume 2452 of {\em {LNCS}}, pp.\  10--28. Berlin Heidelberg:

1429:   Springer-Verlag.

1430:

1431: \bibitem[\protect\citeauthoryear{Du and Hwang}{Du and Hwang}{2000}]{CGT}

1432: Du, D.-Z. and F.~K. Hwang (2000).

1433: \newblock {\em Combinatorial Group Testing and Its Applications\/} (2nd ed.).

1434: \newblock Singapore: World Scientific.

1435:

1436: \bibitem[\protect\citeauthoryear{Ewens and Grant}{Ewens and Grant}{2001}]{EG}

1437: Ewens, W.~J. and G.~R. Grant (2001).

1438: \newblock {\em Statistical Methods in Bioinformatics: An Introduction}.

1439: \newblock New York: Springer-Verlag.

1440:

1441: \bibitem[\protect\citeauthoryear{Ewing and Green}{Ewing and

1442:   Green}{1998}]{Phred.error}

1443: Ewing, B. and P.~Green (1998).

1444: \newblock Base-calling of automated sequencer traces using {\em {p}hred}: {II}.

1445:   error probabilities.

1446: \newblock {\em Genome Research\/}~{\em 8}, 186--194.

1447:

1448: \bibitem[\protect\citeauthoryear{Green}{Green}{2001}]{Sequencing.review}

1449: Green, E.~D. (2001).

1450: \newblock Strategies for the systematic sequencing of complex genomes.

1451: \newblock {\em Nature Reviews Genetics\/}~{\em 2}, 573--583.

1452:

1453: \bibitem[\protect\citeauthoryear{Gusfield}{Gusfield}{1997}]{Gusfield}

1454: Gusfield, D. (1997).

1455: \newblock {\em Algorithms on Strings, Trees, and Sequences: Computer Science

1456:   and Computational Biology}.

1457: \newblock UK: Cambridge University Press.

1458:

1459: \bibitem[\protect\citeauthoryear{{IHGSC}}{{IHGSC}}{2001}]{human.genome}

1460: {IHGSC} (2001).

1461: \newblock Initial sequencing and analysis of the human genome.

1462: \newblock {\em Nature\/}~{\em 609\/}(6822), 860--921.

1463:

1464: \bibitem[\protect\citeauthoryear{Lander and Waterman}{Lander and

1465:   Waterman}{1988}]{LanderWaterman}

1466: Lander, E.~S. and M.~S. Waterman (1988).

1467: \newblock Genomic mapping by fingerprinting random clones: a mathematical

1468:   analysis.

1469: \newblock {\em Genomics\/}~{\em 2}, 231--239.

1470:

1471: \bibitem[\protect\citeauthoryear{Marra, Kucaba, Dietrich, Green, Brownstein,

1472:   Wilson, McDonald, Hillier, McPherson, and Waterston}{Marra

1473:   et~al.}{1997}]{map.sequenceready}

1474: Marra, M.~A., T.~A. Kucaba, N.~L. Dietrich, E.~D. Green, B.~Brownstein, R.~K.

1475:   Wilson, K.~M. McDonald, L.~W. Hillier, J.~D. McPherson, and R.~H. Waterston

1476:   (1997).

1477: \newblock High throughput fingerprint analysis of large-insert clones.

1478: \newblock {\em Genome Research\/}~{\em 7}, 1072--1084.

1479:

1480: \bibitem[\protect\citeauthoryear{Waterman}{Waterman}{1995}]{Waterman}

1481: Waterman, M.~S. (1995).

1482: \newblock {\em Introduction to Computational Molecular Biology: Maps, Sequences

1483:   and Genomes}.

1484: \newblock Boca Raton: Chapman \&\ Hall.

1485:

1486: \bibitem[\protect\citeauthoryear{Weber and Myers}{Weber and Myers}{1997}]{WGS}

1487: Weber, J.~L. and E.~W. Myers (1997).

1488: \newblock Human whole-genome shotgun sequencing.

1489: \newblock {\em Genome Research\/}~{\em 7}, 401--409.

1490:

1491: \bibitem[\protect\citeauthoryear{Wendl and Waterston}{Wendl and

1492:   Waterston}{2002}]{WendlWaterston}

1493: Wendl, M.~C. and R.~H. Waterston (2002).

1494: \newblock Generalized gap model for bacterial artificial chromosome clone

1495:   fingerprint mapping and shotgun sequencing.

1496: \newblock {\em Genome Research\/}~{\em 12}, 1943--1949.

1497:

1498: \end{thebibliography}

1499:

1500: \clearpage

1501: \appendix

1502: \section*{Appendix}

1503: Here we expand our discussion on the probability of overlap detection in CAPS-MAP.

1504: In particular, we derive formulas that

1505: show the exponential decay of the probability of not detecting an overlap

1506: when the coverage~$\coverage_2$ is not too small.

1507: We start with the bound

1508: \begin{equation}\label{eq:pnomap.bound.def}

1509: p_{\mathrm{nomap}}	< \EXP\Bigl[\ndfactor_1^{\rvreads}(2\sigsize)^{\rvgaps+1}\Bigr]

1510: \end{equation}

1511:

1512:

1513: Define

1514: \[

1515: \gengaps_{\nreads}(z) = \Expcmd{z^{\rvgaps}}{\rvreads},

1516: \]

1517: the probability generating function for

1518: the distribution of the number of gaps conditioned on the number of shotgun sequences.

1519: Define the events~$A_i$ for $i=1,\dotsc,\nreads-1$: $A_i$ denotes

1520: the event that the $i$-th sequence is followed by a gap, conditioned

1521: on the event $\{\rvreads=r\}$.

1522: For arbitrary~$\ngaps$, and set of indexes $i_1<i_2<\dotsb<i_{\ngaps}$,

1523: \[

1524: \PROB\Bigl\{A_{i_1}A_{i_2}\dotsm A_{i_{\ngaps}}\Bigr\}

1525: 	= (1-\ngaps\delta)_{+}^{\nreads},

1526: \]

1527: where $\delta=\frac{\roverlap\flength}{\coverlap\clength}$,

1528: and $(x)_+=\max\{0,x\}$ \shortcite{EG,WendlWaterston}.

1529: Let

1530: \begin{align*}

1531: S_0 & = 1\\*

1532: S_g & = \sum_{i_1<\dotsb< i_{\ngaps}} \PROB\Bigl\{A_{i_1}A_{i_2}\dotsm A_{i_{\ngaps}}\Bigr\}

1533: 	= \binom{\nreads-1}{\ngaps}(1-\ngaps\delta)_{+}^{\nreads}.

1534: \end{align*}

1535: Using inclusion-exclusion,

1536: \[

1537: \Probcmd{\rvgaps=\ngaps}{\rvreads=\nreads}

1538: =\sum_{j=\ngaps}^{\nreads-1}

1539: 	\binom{j}{\ngaps}(-1)^{j-\ngaps} S_j.

1540: \]

1541: Hence,

1542: \begin{align*}

1543: \gengaps_{\nreads}(z) & =

1544: \sum_{\ngaps=0}^{\nreads-1}

1545: z^\ngaps \sum_{j=\ngaps}^{\nreads-1} \binom{j}{\ngaps}(-1)^{j-\ngaps} S_j\\*

1546: & = \sum_{j=0}^{\nreads-1} S_j

1547: 	\sum_{\ngaps=0}^j (-1)^{j-\ngaps} \binom{j}{\ngaps} z^{\ngaps} \\*

1548: & = \sum_{j=0}^{\nreads-1} S_j (z-1)^j.

1549: \end{align*}

1550: Substituting the $S_j$ values:

1551: \begin{equation}\label{eq:gengaps}

1552: \gengaps_{\nreads}(z) =

1553: 	\sum_{j=0}^{\nreads-1}

1554: 		\binom{\nreads-1}{j} (1-j\delta)_{+}^{\nreads} (z-1)^j,

1555: \end{equation}

1556: a result interesting on its own.

1557:

1558: Returning to Equation~\eqref{eq:pnomap.bound.def}, we have

1559: \begin{equation}\label{eq:pnomap.bound.1}

1560: p_{\mathrm{nomap}}

1561: < \EXP\biggl[

1562: 	2\sigsize \ndfactor_1^{\rvreads}

1563: 	 \sum_{j=0}^{\rvreads-1}

1564: 		\binom{\rvreads-1}{j} (1-j\delta)_{+}^{\rvreads} (2\sigsize-1)^j

1565: 	\biggr],

1566: \end{equation}

1567: where~$\rvreads$ is a Poisson random variable with

1568: mean

1569: \[

1570: \mreads=

1571: \frac{\coverage_2\coverlap\clength}{\flength}

1572: \]

1573: For every~$\nreads\ge0$,

1574: $(1-j\delta)_{+}^{\nreads} \le e^{-j\nreads\delta}$,

1575: hence

1576: \[

1577: \sum_{j=0}^{\nreads-1}

1578: 		\binom{\nreads-1}{j} (1-j\delta)_{+}^{\nreads} (2\sigsize-1)^j

1579: 	\le \Bigl(1+(2\sigsize-1)e^{-\nreads\delta}\Bigr)^{\nreads-1}.

1580: \]

1581: Consequently, by Equation~\eqref{eq:pnomap.bound.1},

1582: \[

1583: p_{\mathrm{nomap}}

1584: < \EXP\biggl[

1585: 	2\sigsize \ndfactor_1^{\rvreads}

1586: 		\Bigl(1+(2\sigsize-1)e^{-\rvreads\delta}\Bigr)^{\rvreads-1}

1587: 	\biggr].

1588: \]

1589: Recall that the random value we take the expectation of

1590: is an upper bound on~$p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})$,

1591: and thus if it is larger than one, it is useless.

1592: Let

1593: \[

1594: f(\nreads)=

1595: 	\min\Bigl\{1,2\sigsize \ndfactor_1^{\nreads}

1596: 		\Bigl(1+(2\sigsize-1)e^{-\nreads\delta}\Bigr)^{\nreads-1}\Bigr\}.

1597: \]

1598: So we have in fact the bound

1599: \begin{equation}\label{eq:pnomap.bound.2}

1600: p_{\mathrm{nomap}}

1601: 	< \EXP f(\rvreads).

1602: \end{equation}

1603: In order to achieve exponential decay in the bound, we would like to have

1604: \[

1605: \ndfactor_1\Bigl(1+(2\sigsize-1)e^{-\nreads_0\delta}\Bigr)<1

1606: \]

1607: for some~$\nreads_0<\mreads$. Rearranging the inequality, we have

1608: \begin{equation}\label{eq:goodaw}

1609: (2\sigsize-1)\frac{\sigsize(\pcoverage+\wgscoverage)+\pcoverage}{(\sigsize-1)\pcoverage}

1610: 	< e^{(2\pcoverage+\wgscoverage)\roverlap},

1611: \end{equation}

1612: which is satisfied when~$\pcoverage$ and~$\wgscoverage$ are not too small

1613: (see Figure~\ref{fig:goodaw}).

1614:

1615: \begin{figure}

1616: \centerline{\includegraphics[height=.3\textheight]{good-aw}}

1617: \caption{Values of the pooled shotgun coverage~$\pcoverage$

1618: and WGS coverage~$\wgscoverage$, for which the clone overlap detection bound applies,

1619: are above the graphs (see Equation~\eqref{eq:goodaw})}\label{fig:goodaw}.

1620: \end{figure}

1621:

1622: There are several possible ways to

1623: exploit the fact that the exponential component of $f(\nreads)$ becomes

1624: for~$\nreads$ less than the expected value~$\mreads$.

1625: The main idea is that when evaluating

1626: $\EXP f(\rvreads)=\sum f(\nreads)\PROB\{\rvreads=\nreads\}$

1627: in Equation~\eqref{eq:pnomap.bound.2},

1628: either the probability of~$\rvreads=\nreads$ is small, or

1629: the value of~$f(\nreads)$ is small.

1630: Let~$0<k<\lambda$ be a threshold (that we specify later), and let~$\alpha=k/\mreads$.

1631: To proceed with Equation~\eqref{eq:pnomap.bound.2}, we condition

1632: on the event~$\{\rvreads\le\alpha\mreads\}$.

1633: We use the bound

1634: \begin{equation}\label{eq:poisson.bound}

1635: \PROB\{\rvreads\le \alpha\mreads\}

1636: 	< \frac{e^{-\mreads(1-\alpha)^2/2}}{(1-\alpha)\sqrt{2\pi\alpha\mreads}},

1637: \end{equation}

1638: which we prove here quickly.

1639: By definition,

1640: \begin{align*}

1641: \PROB\{\rvreads\le\alpha\mreads\}

1642: & \le \sum_{\nreads=0}^k

1643: 	\frac{{\mreads}^{\nreads}}{\nreads!} e^{-\mreads}

1644: < e^{-\mreads}\frac{{\mreads}^{k}}{k!}

1645: 	\sum_{\nreads=0}^k \Bigl(\frac{k}{\mreads}\Bigr)^{\nreads}\\*

1646: & <  e^{-\mreads}\frac{{\mreads}^{k}}{k!} (1-\alpha)^{-1}

1647: < e^{-\mreads(1-\alpha+\alpha\ln\alpha)} \frac{1}{(1-\alpha)\sqrt{2\pi\alpha\lambda}},

1648: \end{align*}

1649: where we used a Stirling approximation: $k!>(k/e)^k/\sqrt{2\pi k}$. Using a Taylor series expansion,

1650: \[

1651: 1-\alpha+\alpha\ln\alpha = \frac12 (1-\alpha)^2 + \frac16 (1-\alpha)^3 + \frac{1}{12}(1-\alpha)^4 \dotsc

1652: \]

1653: and thus $1-\alpha+\alpha\ln\alpha>\frac12 (1-\alpha)^2$ for~$0<\alpha<1$,

1654: and Equation~\eqref{eq:poisson.bound} follows.

1655:

1656: Now,

1657: \begin{align*}

1658: \EXP f(\rvreads)

1659:  & = \Expcmd{f(\rvreads)}{\rvreads\le\alpha\mreads}\PROB\{\rvreads\le\alpha\mreads\}

1660:  +\Expcmd{f(\rvreads)}{\rvreads>\alpha\mreads} \PROB\{\rvreads>\alpha\mreads\}\\*

1661:  & \le \PROB\{\rvreads\le\alpha\mreads\} + \Expcmd{f(\rvreads)}{\rvreads>\alpha\mreads}\\*

1662:  & < \frac{e^{-\mreads(1-\alpha)^2/2}}{(1-\alpha)\sqrt{2\pi\alpha\mreads}}

1663:  	+ \frac{2\sigsize e^{-\mreads}

1664:  		\sum_{\nreads=0}^{\infty}

1665:  			\frac{\Bigl(\ndfactor_1(1+(2\sigsize-1)e^{-\alpha\delta\mreads})\Bigr)^{\nreads}}{\nreads!}}{1+(2\sigsize-1)e^{-\alpha\delta\mreads}}

1666:  		\\*

1667:  & = \frac{\exp\Bigl(-\mreads(1-\alpha)^2/2\Bigr)}{(1-\alpha)\sqrt{2\pi\alpha\mreads}}

1668:  	+ \frac{2\sigsize \exp\biggl(-\mreads\Bigl(1-\ndfactor_1(1+(2\sigsize-1)e^{-\alpha\coverage_2\roverlap})\Bigr)\biggr)}{

1669:  		1+(2\sigsize-1)e^{-\alpha\coverage_2\roverlap}},

1670:  \end{align*}

1671:  where we used~$\delta\mreads=\coverage_2\roverlap$.

1672: Figure~\ref{fig:balance.alpha} shows values of~$\alpha$ for different~$\pcoverage,\wgscoverage$ pairs

1673: that balance the exponents in the two terms.

1674:

1675: \begin{figure}

1676: \centerline{\includegraphics[width=\textwidth]{balance-alpha}}

1677: \caption{Balanced $\alpha$ values for our exponential bound.}\label{fig:balance.alpha}

1678: \end{figure}

1679:

1680: After choosing a balancing~$\alpha$ value for a given~$(\pcoverage,\wgscoverage)$ pair,

1681: we obtain

1682: \[

1683: \EXP f(\rvreads) < X_1 \exp(-X_2 \roverlap\clength),

1684: \]

1685: where~$X_1$ and~$X_2$ are constants that do not depend on~$\roverlap$.

1686: The bound becomes small ($<10^{-8}$)

1687: for larger~$\coverage_2$ values (e.g., $\coverage_2=7$),

1688: but even then, it is not very tight.

1689: Based on simulation results, the tightness is lost

1690: with the inequality of Equation~\eqref{eq:pnomap.bound.def},

1691: and not in the following steps.

1692: For example, we evaluated the bounds of Equations~\eqref{eq:pnomap.bound.1}

1693: and~\eqref{eq:pnomap.bound.2} numerically.

1694: While they are fairly close to each other, and to the exponential bound

1695: using~$\alpha$, they already bound the expected value of~Equation~\eqref{eq:pnomap.exp}

1696: rather loosely in many cases.

1697: Furthermore, even for~$(\pcoverage,\wgscoverage)$ pairs

1698: for which we cannot establish exponential decay

1699: using the inequality of Equation~\eqref{eq:pnomap.bound.def}, the overlap

1700: detection probability may get very close to one.

1701: For instance, a two-array design with

1702: $\pcoverage=0.5$ and~$\wgscoverage=2$ falls below the curve

1703: of Figure~\ref{fig:goodaw}, yet can be employed efficiently

1704: in CAPS-MAP as shown in Figure~\ref{fig:capsm.probs}.

1705: Therefore, we prefer using a Monte-Carlo evaluation of Equation~\eqref{eq:pnomap.exp}

1706: to predict the experimental performance of CAPS-MAP.\label{veryend}

1707:

1708: \end{document}