1: \documentclass[11pt,e-only,final]{amsart}
2:
3: % -------------------------------- amsart publication data
4: \newcommand{\publname}{}
5: \dateposted{Preprint posted electronically on December 10, 2003}
6: \renewcommand{\volinfo}[1]{} % eats the comma in the cls file
7: \newcommand{\pageinfo}{Pages 1--\pageref{veryend}}
8: \copyrightinfo{2003}{M. Cs\H{u}r\"os, B.Li, and A. Milosavljevic}
9: \PII{}
10: % --------------------------------
11:
12: \usepackage{graphicx}
13: \usepackage{latexsym,cmmib57}
14: \usepackage[psamsfonts]{eucal}
15: %\usepackage{probability,url}
16: %\usepackage{marginkern}
17: %\usepackage{hyperref}
18: \usepackage[in]{fullpage}
19: \usepackage{chicago}
20:
21: \newcommand{\probgen}[4]{#1 #2{#4} #3}
22: \newcommand{\conditional}[6]{%
23: #1\!\left#2\,{#5}\,\vphantom{#6}\right#3\left.%
24: \vphantom{#5}{#6}\,\right#4}
25: \newcommand{\condgen}[6]{{#1}#2 #5 #3 #6 #4}
26: \newcommand{\bbrd}[1]{\mbox{\rm{I}\kern-.1667em{#1}}}
27: \newcommand{\EXP}{\mathbb{E}}
28: \newcommand{\PROB}{\mathbb{P}}
29: \newcommand{\Probsm}[1]{\probgen{\PROB}{\bigl\{}{\bigr\}}{#1}}
30: \newcommand{\Probmd}[1]{\probgen{\PROB}{\Bigl\{}{\Bigr\}}{#1}}
31: \newcommand{\Problg}[1]{\probgen{\PROB}{\biggl\{}{\biggr\}}{#1}}
32: \newcommand{\Probxl}[1]{\probgen{\PROB}{\Biggl\{}{\Biggr\}}{#1}}
33: \newcommand{\Probcsm}[2]{\condgen{\PROB}{\bigl\{}{\bigm|}{\bigr\}}{#1}{#2}}
34: \newcommand{\Probcmd}[2]{\condgen{\PROB}{\Bigl\{}{\Bigm|}{\Bigr\}}{#1}{#2}}
35: \newcommand{\Probclg}[2]{\condgen{\PROB}{\biggl\{}{\biggm|}{\biggr\}}{#1}{#2}}
36: \newcommand{\Probcxl}[2]{\condgen{\PROB}{\Biggl\{}{\Biggm|}{\Biggr\}}{#1}{#2}}
37:
38: \newcommand{\Expc}[2]{\conditional{\EXP}{[}{|}{]}{#1}{#2}}
39: \newcommand{\Expcsm}[2]{\condgen{\EXP}{\bigl[}{\bigm|}{\bigr]}{#1}{#2}}
40: \newcommand{\Expcmd}[2]{\condgen{\EXP}{\Bigl[}{\Bigm|}{\Bigr]}{#1}{#2}}
41: \newcommand{\Expclg}[2]{\condgen{\EXP}{\biggl[}{\biggm|}{\biggr]}{#1}{#2}}
42: \newcommand{\Expcxl}[2]{\condgen{\EXP}{\Biggl[}{\Biggm|}{\Biggr]}{#1}{#2}}
43:
44:
45: \newcommand{\myfigheight}{.30\textheight}
46:
47: \newtheorem{theorem}{Theorem}
48:
49: \newsavebox{\fmbox}
50: \newenvironment{fmpage}[1]
51: {\begin{lrbox}{\fmbox}\begin{minipage}{#1}}
52: {\end{minipage}\end{lrbox}\fbox{\usebox{\fmbox}}}
53:
54: \newcommand{\captionstyle}{\small}
55: \newcommand{\mysectionstyle}{\medskip\bf}
56:
57: \newcommand{\nclones}{N} % number of clones
58: \newcommand{\poolsize}{m} % number of clones within a pool
59: \newcommand{\clength}{L} % length of a clone
60: \newcommand{\nfrags}{F} % total number of shotgun fragments
61: \newcommand{\flength}{\ell} % expected shotgun fragment length
62: \newcommand{\coverage}{c} % shotgun coverage
63: \newcommand{\rlength}{\lambda} % random read length
64: \newcommand{\mlength}{M} % number of positions where match is detected
65: \newcommand{\npools}{K} % number of pools
66: \newcommand{\glength}{G} % genome length
67:
68: \newcommand{\cset}{\mathcal{B}} % clone set
69: \newcommand{\clone}{B}
70: \newcommand{\pset}{\mathcal{P}} % pool set
71: \newcommand{\pool}{P}
72:
73: \newcommand{\nrect}{R} % number of preserved rectangles
74: \newcommand{\rectangle}{\mathcal{R}} % rectangle
75: \newcommand{\irect}{I} % indicator for preserved rectangle
76:
77: \newcommand{\mindecon}{k} % min. pools needed for deconvolution
78: \newcommand{\sigsize}{n} % number of pools one clone is included in
79:
80: \newcommand{\incmatrix}{\mathbf{M}}
81: \newcommand{\sigvector}{\mathbf{c}}
82: \newcommand{\sigindexvector}{\mathbf{x}}
83: \newcommand{\sigdistance}{\Delta}
84:
85: \newcommand{\codeword}{\mathbf{c}}
86: \newcommand{\codeset}{\mathcal{C}}
87: \newcommand{\codelen}{\sigsize}
88: \newcommand{\codedist}{d}
89: \newcommand{\codedim}{\mindecon}
90: \newcommand{\codegen}{\mathbf{G}}
91: \newcommand{\binsubst}{\phi}
92: \newcommand{\vecsubst}{f}
93: \newcommand{\msgword}{\mathbf{u}}
94: \newcommand{\weight}{w}
95:
96: \newcommand{\eventA}{\mathcal{X}}
97: \newcommand{\eventB}{\mathcal{Y}}
98:
99: \newcommand{\Galois}[1]{\mathbb{F}_{#1}}
100: \newcommand{\ambiguity}{t}
101:
102: \newcommand{\wgscoverage}{w}
103: \newcommand{\pcoverage}{a}
104: \newcommand{\roverlap}{\sigma}
105: \newcommand{\fpooled}{\mu}
106: \newcommand{\soverlap}{\vartheta}
107:
108: \newcommand{\coverlap}{\Theta}
109:
110: \newcommand{\Pa}{P_0}
111: \newcommand{\Pb}{P_{\fpooled}^{(\sigsize)}}
112: \newcommand{\Pc}{P_{\fpooled}^{(\infty)}}
113:
114: \newcommand{\ilen}{\lambda}
115:
116: \newcommand{\ndfactor}{\beta}
117: \newcommand{\pnd}{q}
118:
119: \newcommand{\rvgaps}{G}
120: \newcommand{\rvreads}{R}
121: \newcommand{\gengaps}{\mathcal{G}}
122: \newcommand{\nreads}{r}
123: \newcommand{\ngaps}{g}
124: \newcommand{\mreads}{\lambda}
125:
126: \begin{document}
127: \title[CAPSS and CAPS-MAP]{Clone-array pooled shotgun mapping and sequencing:\\
128: design and analysis of experiments}
129:
130: \author[M. Cs{\H u}r\"os]{Mikl\'os Cs\H{u}r\"os}\address{MC:
131: D\'epartement d'informatique et de recherche op\'erationnelle,
132: Universit\'e de Montr\'eal,
133: CP 6128 succ. Centre-Ville,
134: Montr\'eal, Qu\'ebec H3C 3J7, Canada.
135: Phone: +1 (514) 343-6111x1655, Fax: +1 (514) 343-5834.}
136: \email{csuros@iro.umontreal.ca}
137: \urladdr{http://www.iro.umontreal.ca/\textasciitilde{}csuros/}
138: \author[B. Li]{Bingshan Li}\address{BL:
139: Human Genome Sequencing Center, Department of Molecular and Human Genetics,
140: Baylor college of Medicine, Houston, Texas, 77030, USA.}
141: \author[A. Milosavljevic]{
142: Aleksandar Milosavljevic}\address{AM:
143: Bioinformatics Research Laboratory,
144: Program in Structural and Computational Biology and Molecular Biophysics, and
145: Human Genome Sequencing Center ---
146: Department of Molecular and Human Genetics,
147: Baylor College of Medicine,
148: Houston, Texas 77030, USA.}
149: \email{amilosav@bcm.tmc.edu}
150: \urladdr{http://www.brl.bcm.tmc.edu/}
151:
152: \begin{abstract}
153: This paper studies sequencing and mapping
154: methods that rely solely on pooling and shotgun sequencing of clones.
155: First, we scrutinize and improve
156: the recently proposed Clone-Array Pooled
157: Shotgun Sequencing (CAPSS) method,
158: which delivers a BAC-linked assembly of a whole genome sequence.
159: Secondly, we introduce a novel physical mapping method, called
160: {\em Clone-Array Pooled Shotgun Mapping} (CAPS-MAP), which
161: computes the physical ordering of BACs in a random library.
162: Both CAPSS and CAPS-MAP
163: construct subclone libraries from
164: pooled genomic BAC clones.
165:
166: We propose
167: algorithmic and experimental improvements
168: that make CAPSS a viable option for
169: sequencing a set of BACs. We provide the first
170: probabilistic model of CAPSS sequencing progress. The model leads to
171: theoretical results supporting previous, less formal arguments on
172: the practicality of CAPSS.
173: We demonstrate the
174: usefulness of CAPS-MAP for clone overlap detection
175: with a probabilistic analysis,
176: and a simulated assembly
177: of the Drosophila melanogaster genome.
178: Our analysis indicates that CAPS-MAP is well-suited for
179: detecting BAC overlaps in a highly redundant library,
180: relying on a low amount of shotgun sequence information.
181: Consequently, it is a practical method
182: for computing the physical ordering of clones in
183: a random library, without requiring additional clone fingerprinting.
184: Since CAPS-MAP requires only shotgun sequence reads,
185: it can be seamlessly incorporated into a
186: sequencing project with almost no experimental overhead.
187: \end{abstract}
188:
189: \keywords{sequencing, physical mapping, pooled shotgun sequencing}
190:
191:
192: \maketitle
193:
194:
195:
196: %{\bf ACM subject classification:}
197: %J.3 [\textbf{Life and Medical Sciences}]: \textit{biology and genetics};
198: %G.2.1 [\textbf{Discrete Mathematics}]: \textit{combinatorics};
199: %G.3 [\textbf{Discrete Mathematics}]: \textit{probability and statistics}
200: %
201:
202: \section{Introduction}
203: In a {\em hierarchical approach} to large genome sequencing,
204: one first breaks many genome copies into random fragments.
205: A {\em library} is constructed by cloning the fragments,
206: typically as
207: {\em Bacterial Artificial
208: Chromosome} inserts (BACs).
209: Some BACs in the library are selected for complete sequencing.
210: Each selected BAC sequence is assembled individually
211: using the shotgun method: a {\em subclone} library
212: is prepared by cloning short fragments of the BAC.
213: Subsequently, sequence {\em reads}
214: are produced from a sufficient number of randomly chosen subclones.
215: The reads are assembled algorithmically into the BAC sequence.
216: An alternative to the hierarchical, or {\em clone-by-clone},
217: strategy is the {\em whole-genome shotgun} approach~\shortcite{WGS},
218: which employs a few (essentially 1--3) subclone libraries prepared
219: from the entire genome,
220: without resorting to an intermediate BAC library.
221: The main advantage of the whole-genome approach
222: is that it eliminates
223: the need to prepare tens of thousands of subclone libraries to
224: sequence a mammalian genome.
225: However, it is generally an inadequate strategy for finishing the assembly
226: of such large repeat-rich genomes.
227: For a review of contemporary sequencing methodologies, see, e.g.,
228: \shortciteN{Sequencing.review}.
229:
230: %
231: %BAC-end sequencing~\shortcite{BAC.end} was proposed as an alternative to restriction
232: %fingerprinting. However, the cost of BAC end sequencing is high and a significant
233: %fraction of mammalian BAC ends contain repetitive elements that are not informative for mapping.
234: %
235:
236: A new BAC-based sequencing strategy, called Clone-Array Pooled
237: Shotgun Sequencing (CAPSS), was proposed recently~\shortcite{CAPSS}.
238: CAPSS assembles the complete sequences of individual BACs
239: as does the clone-by-clone approach, but requires a much smaller
240: number of subclone library preparations.
241: The strategy is currently being applied for the first time on a
242: genome scale in the context of sequencing the honey bee genome.
243: This paper provides the theory for the design and analysis of pooling-based
244: genome projects. It also introduces the CAPS-MAP method for
245: physical mapping,
246: and transversal pooling designs for both CAPSS and CAPS-MAP,
247: thereby laying the theoretical foundation for pooling-based genome-scale
248: sequencing projects.
249:
250: \begin{figure}
251: \centerline{\includegraphics[height=0.3\textheight]{capss-method}}
252: \caption[CAPSS]{\captionstyle
253: CAPSS strategy for arrayed BACs.
254: DNA extracted from each clone is pooled together with
255: other clones in the same row
256: and column.
257: Subclone libraries are prepared from the pools,
258: and shotgun sequences are collected from the
259: sublibraries. Sequences are assembled into contigs.
260: If a contig contains sequences from a
261: row and a column pool's sublibrary,
262: the contig is assigned to the BAC at the
263: intersection of the row and the column.}
264: \label{fig:capss}
265: \end{figure}
266:
267: In a clone-by-clone approach,
268: BACs are sequenced independently:
269: one subclone library is constructed
270: for every clone.
271: In contrast, DNA from BACs are pooled together in a CAPSS approach,
272: and subclone libraries are prepared from the pools.
273: A CAPSS experiment is designed so
274: that the number of subclone libraries is much smaller than the number
275: of clones, yet the pooling design enables the assembly
276: of individual clone sequences. In what follows,
277: by {\em pooled shotgun} (CAPS) sequences we mean
278: shotgun sequence reads collected from a
279: subclone library that was constructed using pooled BACs.
280: For the computational
281: aspects of sequence assembly, pooled shotgun sequences are random subsequences
282: originating from a set of clone sequences.
283:
284: The original CAPSS proposal of \shortciteN{CAPSS} relied
285: on a simple rectangular
286: design defined by an array layout of BACs
287: (Figure~\ref{fig:capss}). The pools correspond to the rows and columns. An array layout
288: reduces the number of shotgun library preparations to
289: the square root of the number of BACs when compared
290: to clone-by-clone sequencing. This reduction can be important
291: in case of a mammalian genome,
292: for which
293: even a minimally overlapping tiling path contains between twenty and thirty
294: thousand clones~\shortcite{human.genome}.
295:
296: This paper has two goals. First, after pointing out some
297: shortcomings of the original CAPSS proposal, we propose
298: algorithmic and experimental improvements
299: that make CAPSS a viable option for
300: sequencing a set of BACs.
301: Specifically, we apply transversal pooling designs to increase the accuracy of CAPSS,
302: which we previously developed for the PGI method of comparative
303: physical mapping that also uses pooled shotgun sequencing~\shortcite{PGI.conf}.
304: We provide the first
305: probabilistic model of CAPSS sequencing progress.
306: The model leads to
307: theoretical results supporting previous, less formal arguments on
308: the practicality of CAPSS.
309:
310: The paper's second goal is to introduce the {\em Clone-Array Pooled Shotgun
311: Mapping (CAPS-MAP)} method to detect clone overlaps in a random BAC library.
312: The information on clone overlaps is used to compute the physical
313: ordering of clones in
314: the library, without requiring additional clone fingerprinting.
315: CAPS-MAP operates in the same experimental framework as CAPSS.
316: It needs only
317: shotgun sequences, which makes
318: it a cost-effective method that can be seamlessly integrated into a
319: sequencing project with very little experimental overhead.
320: We demonstrate the usefulness of CAPS-MAP for clone overlap detection
321: with a probabilistic analysis.
322: In addition to the theoretical results, we illustrate the method's performance
323: in a simulated project using the Drosophila genome assembly.
324:
325:
326: \section{Transversal designs}
327: %{\mysectionstyle 2. Transversal designs. \ \ }
328: It was proposed by \shortciteN{CAPSS} that CAPSS be
329: used in hybrid projects, combining
330: whole-genome shotgun (WGS) and pooled shotgun (CAPS) sequences.
331: The motivation is that the pooled shotgun sequences
332: can provide the localization information
333: for the whole-genome shotgun sequences
334: so that the latter can be used for a
335: clone-linked assembly.
336: After WGS and CAPS sequences from a set of pools
337: are assembled into contigs,
338: the contigs need to be mapped to individual BACs.
339: There are a few challenges to contig mapping.
340: We mention here three
341: main problems: false negatives, ambiguities, and false mapping.
342: A false negative refers to a situation where a BAC is not sampled in a
343: pool it is included in, due to the low number of CAPS sequences
344: collected.
345: A false negative for a simple rectangular design means that
346: no contigs can be mapped to the BAC.
347: Ambiguities and false mappings are caused by overlapping clones,
348: or more generally, by clones that have highly similar regions.
349: The mapping of a contig is ambiguous if it is not possible to decide
350: which clones the contig should be assigned to, in cases where
351: two or more clone sets are equally likely choices for the mapping.
352: False mapping occurs when an insufficient number of CAPS
353: sequences are collected, and a contig
354: that covers overlapping BACs gets assigned to the wrong clone or clone set.
355: %False mapping is more detrimental than ambiguity
356: %since it is not detected during contig mapping.
357:
358: One strategy used to overcome the mapping problems
359: involves transversal pooling designs~\shortcite{PGI.conf,CGT}.
360: For a transversal design with~$\sigsize$ pool sets,
361: every clone is included in exactly one pool of each pool set, and any subset
362: with two of those pools uniquely identifies the clone.
363: Half of the pool
364: sets are designated as column pools, and the other half as row pools
365: to realize the design with the help of BAC arrays.
366: Using a transversal
367: double-array design (i.e., one with four pool sets),
368: the same set of BACs is independently arrayed twice.
369: Each of the two resulting
370: arrays contains the
371: same set of BACs.
372: Thus, each BAC ends up being sampled in two column-pools and
373: two row-pools. One of the arrays contains an arbitrary arrangement of
374: BACs, while the other is ``reshuffled'' relative to the first.
375: More generally,
376: clones can be arranged on~$d$ reshuffled arrays using
377: a transversal pooling design with~$\sigsize=2d$ pool sets.
378:
379: The number of arrays in a transversal design may be adjusted to allow
380: unambiguous and correct contig mapping for any redundancy in a BAC
381: library. Specifically, it can be shown~\shortcite{PGI.conf,CGT}
382: that a $d$-array transversal design can accurately resolve BACs at
383: up to $(2d-1)$X redundancy.
384: We previously described and analyzed transversal designs in
385: the context of pooled shotgun experiments~\shortcite{PGI.conf}
386: and compared their performance to
387: other designs. Even though our analysis was performed for
388: the Pooled Genomic Indexing (PGI) method
389: in the context of comparative physical mapping,
390: the results are generally
391: valid for CAPSS and CAPS-MAP as well. Specifically,
392: our results indicate that transversal designs
393: reduce the frequency of false negatives and false mappings
394: when compared to a simple rectangular design.
395: Furthermore, when compared to other more complicated designs,
396: they achieve an optimal balance between the number of shotgun
397: library preparations and the frequency of contig mapping problems.
398: Transversal designs also enjoy a practical advantage
399: over more complicated combinatorial designs, in that they are
400: readily implemented using existing automated clone arraying technologies.
401:
402: When a transversal design is used,
403: contig mapping can be implemented very efficiently,
404: based on an algorithm that runs in~$O(\nclones+M)$ time for
405: mapping~$M$ contigs onto~$\nclones$ BACs. Without going into
406: details,
407: the main idea is to first build in~$O(\nclones)$ time
408: a hash table that maps pool pairs to BACs. Based on the property of transversal designs
409: that two pools identify a clone, this table contains all pool pairs that identify a
410: unique clone. For each contig, it takes~$O(1)$ time using the hash table to
411: either identify the most likely clone set to which
412: the contig can be mapped, or to declare the contig ambiguous.
413:
414: \section{Sequence assembly}\label{sec:capss}
415: %{\mysectionstyle 3. Pooled shotgun reads for sequencing. \ \ }
416: This section analyzes CAPSS progress in a hybrid
417: project that uses whole-genome
418: and pooled shotgun sequences. CAPS sequences are collected
419: using a transversal design with~$\sigsize$ pool sets, i.e., $\sigsize/2$
420: arrays.
421: In order to derive a probabilistic model for such experiments,
422: we introduce some standard
423: simplifying assumptions and the following notations.
424: Assume that every clone has the same length~$\clength$
425: (100--200 thousand base pairs in practice),
426: and that each shotgun sequence has the same length~$\flength$ (e.g., 500 bp).
427: The WGS and CAPS sequences are
428: combined and compared to each other to
429: find overlaps between them.
430: Overlapping sequences form {\em islands}.
431: Islands with two or more sequences are {\em contigs}.
432: An overlap between two shotgun sequences
433: is detected if it is at least of length~$\soverlap\flength$
434: where~$0<\vartheta\le 1$.
435: Statistics for islands, and gaps between islands
436: are well known~\shortcite{LanderWaterman,WendlWaterston}.
437: We are interested in statistics for
438: {\em clone-linked contigs}, those that are assigned
439: to BACs using the pooling information.
440:
441: Let~$\pcoverage$ be the coverage by CAPS sequences, i.e.,
442: if~$\nfrags_{\mathrm{p}}$ CAPS sequences
443: are collected, then~$\pcoverage=\frac{\nfrags_{\mathrm{p}}\flength}{\nclones\clength}$
444: where~$\nclones$ is the total number of clones.
445: Let~$\wgscoverage$ denote the coverage by WGS
446: sequences, i.e., if~$\nfrags_{\mathrm{w}}$ WGS
447: sequences are collected,
448: then~$\wgscoverage=\frac{\nfrags_{\mathrm{w}}\flength}{\glength}$
449: where~$\glength$ is the genome length.
450: Notice that~$\wgscoverage=0$ is possible.
451: Here we consider the simplest case of
452: assembling the sequence of a single
453: clone that does not overlap with any other clone.
454: Such a clone is covered by
455: a total coverage of~$(\pcoverage+\wgscoverage)$.
456: Although we concentrate on sequencing a particular clone,
457: the transversal design allows the simultaneous sequencing
458: of multiple, possibly overlapping clones
459: by combining WGS sequences with CAPS sequences from many (or even all) pools.
460: Regions of overlapping clones have higher
461: coverage since they are covered by
462: more CAPS sequences than a single clone.
463: The sequencing of overlapping regions progresses thus
464: faster than what is suggested by the statistics
465: for a single clone.
466: We examine the case of assigning contigs to overlapping BACs
467: in \S\ref{sec:capsmap}.
468: Two shotgun sequences from different pools suffice to assign a contig to a single BAC.
469: In a practical setting, it may be advantageous to require more
470: stringent criteria in order to avoid false mappings.
471: Theorem~\ref{tm:capss} can be readily adapted for such
472: criteria, albeit resulting in bulkier formulas.
473:
474: Figures~\ref{fig:capss.islands} and~\ref{fig:capss.cover}
475: compare different experimental designs based on Theorem~\ref{tm:capss}
476: and simulations.
477: Figure~\ref{fig:capss.islands} plots the island statistics from the theorem.
478: It illustrates that for lower coverages (about $\coverage<4$), the ratio
479: of pooled shotgun sequences makes a large difference in the sequencing.
480: This difference
481: is mainly shown in the number of clone-linked contigs, as the
482: contig sizes do not differ much. At large coverage levels, when
483: sequencing is nearly completed, the impact of pooled sequences is less,
484: i.e., WGS sequences can make up for a lower pooled coverage.
485:
486: Figure~\ref{fig:capss.cover}a
487: shows that while more arrays increase the sequencing success, the improvements
488: are very small after the second array.
489: Notice that if the clones are selected from a minimally overlapping tiling path,
490: then no part of the genome is covered by more than two BACs,
491: and thus two arrays suffice for the unambiguous mapping of all contigs
492: that cover clone overlaps.
493: Figure~\ref{fig:capss.cover}b plots the N50 values. The N50
494: contig length is the value~$l$ such that half of the
495: sequenced nucleotides belong to contigs of length at least~$l$.
496: The statistics for all designs converge to
497: those of a non-pooled sequencing project as the coverage increases.
498: In other words, the negative effects of pooling diminish and the
499: project progresses just as without pooling:
500: for example, at total coverage 4--5X, 99\% of the clone is sequenced.
501:
502:
503: \begin{theorem}\label{tm:capss}
504: Let~$\roverlap=1-\soverlap$ where~$\soverlap$
505: is the fraction of length two shotgun sequences must share in order for
506: the overlap to be detected.
507: Consider a BAC that does not overlap with other
508: clones.
509: Define~$\coverage=\wgscoverage+\pcoverage$,
510: the total coverage.
511: Let
512: $
513: X_1 = \frac{\wgscoverage+\frac{\pcoverage}{\sigsize}}{\coverage}$,
514: %\qquad
515: $X_2 = \frac{\wgscoverage}{\coverage}$,
516: and $Y_i=1-(1-e^{-\coverage\roverlap})X_i$ for~$i=1,2$.
517:
518: \begin{enumerate}
519: \item[(i)]
520: The expected number of clone-linked contigs covering the clone equals
521: \begin{equation}\label{eq:num.link}
522: \frac{\clength}{\flength}\coverage e^{-\coverage\roverlap} p_{\mathrm{link}},
523: \end{equation}
524: where
525: \begin{equation}\label{eq:prob.link}
526: p_{\mathrm{link}}=
527: \begin{cases}
528: 1-e^{-\coverage\roverlap}
529: \biggl(\sigsize\frac{X_1}{Y_1}-(\sigsize-1)\frac{X_2}{Y_2}\biggr)
530: & \text{if $\wgscoverage>0$;}\\*
531: \frac{1-e^{-\pcoverage\roverlap}}{1+\frac{1}{\sigsize-1}e^{-\pcoverage\roverlap}}
532: & \text{if $\wgscoverage=0$.}
533: \end{cases}
534: \end{equation}
535:
536: \item[(ii)] The expected number of shotgun sequences in a clone-linked contig is
537: \begin{equation}\label{eq:nr.link}
538: \nfrags_{\mathrm{link}}
539: = \begin{cases}
540: \frac{e^{\coverage\roverlap}}{p_{\mathrm{link}}}
541: \Biggl(1-e^{-2\coverage\roverlap}\biggl(
542: \sigsize\frac{X_1}{Y_1^2}
543: -(\sigsize-1)\frac{X_2}{Y_2^2}\biggr)\Biggr) & \text{if $\wgscoverage>0$;}\\
544: e^{\pcoverage\roverlap}+\frac{1+\frac{1}{\sigsize-1}}{1+\frac{e^{-\pcoverage\roverlap}}{\sigsize-1}}
545: & \text{if $\wgscoverage=0$.}
546: \end{cases}
547: \end{equation}
548:
549: \item[(iii)]
550: Define
551: \begin{align}
552: \label{eq:mk.nolink}
553: \nfrags_{\mathrm{nolink}}
554: = \frac{\sigsize\frac{X_1}{Y_1^2}-(\sigsize-1)\frac{X_2}{Y_2^2}}{\sigsize\frac{X_1}{Y_1}-(\sigsize-1)\frac{X_2}{Y_2}},
555: \intertext{and}
556: \lambda_{\mathrm{CBC}}
557: =
558: \frac{e^{\coverage\roverlap}-1}{\coverage}+\soverlap.
559: \end{align}
560: The expected length of a clone-linked contig
561: can be written as $\flength\ilen_{\mathrm{link}}$
562: where $\ilen_{\mathrm{link}}$
563: is bounded as
564: \begin{equation}\label{eq:len.link}
565: \frac{\ilen_{\mathrm{CBC}}-\Bigl(\nfrags_{\mathrm{nolink}}\roverlap+\soverlap\Bigr)(1-p_{\mathrm{link}})}{
566: p_{\mathrm{link}}}
567: \le
568: \ilen_{\mathrm{link}}
569: \le
570: \frac{\ilen_{\mathrm{CBC}}}{p_{\mathrm{link}}}. % \frac{\ilen_{\mathrm{CBC}}-(1-p_{\mathrm{link}})}{p_{\mathrm{link}}}.
571: \end{equation}
572: Furthermore, when~$\fpooled=\pcoverage/\coverage$ is kept constant,
573: $\nfrags_{\mathrm{nolink}}$ increases monotonically
574: with~$\coverage$ and
575: \begin{equation}\label{eq:len.limit}
576: \lim_{\coverage\to\infty}\nfrags_{\mathrm{nolink}}
577: =
578: \begin{cases}
579: \fpooled^{-1}
580: \frac{(3\sigsize^2-3\sigsize+1)-\fpooled(2\sigsize^2-3\sigsize+1)}{
581: (2\sigsize^2-3\sigsize+1)-\fpooled(\sigsize^2-2\sigsize+1)}
582: & \text{if $\wgscoverage>0$;}\\
583: \frac{\sigsize}{\sigsize-1} & \text{if $\wgscoverage=0$.}
584: \end{cases}
585: \end{equation}
586: \end{enumerate}
587: \end{theorem}
588:
589: \begin{figure}
590: \begin{tabular}{cc}\includegraphics[width=0.48\textwidth]{capss-mapped-ireads} &
591: \includegraphics[width=0.48\textwidth]{capss-mapped-nislands}\\
592: a & b
593: \end{tabular}
594: \caption[CAPSS island statistics]{\captionstyle
595: CAPSS (Theorem~\ref{tm:capss}): clone-linked contig statistics. The values are calculated
596: from Theorem~\ref{tm:capss} for two-array transversal designs
597: and different pooled coverage levels~$\pcoverage$. Overlaps between shotgun sequences are detected
598: with~$\soverlap=0.1$. The number of contigs on the right-hand side is given
599: in multiples of~$\clength/\flength$. The abscissa is the total coverage~$\coverage$.}
600: \label{fig:capss.islands}
601: \end{figure}
602:
603: \begin{figure}
604: \begin{tabular}{cc}\includegraphics[width=0.48\textwidth]{capss-mapped-cover} &
605: \includegraphics[width=0.48\textwidth]{capss-mapped-n50}\\
606: {\captionstyle a} & {\captionstyle b}
607: \end{tabular}
608: \caption[CAPSS sequencing progress]{\captionstyle
609: CAPSS (Theorem~\ref{tm:capss}): sequencing progress.
610: The left-hand side plots the
611: fraction of bases covered by clone-linked contigs as
612: a function of total coverage ($\coverage=\pcoverage+\wgscoverage$)
613: for different designs. Notice that the improvement from two arrays
614: to four arrays ($\sigsize=4$ vs.\ $\sigsize=8$) is marginal.
615: The right-hand side plots the N50 values
616: for different designs with two arrays, as multiples of~$\flength$.
617: All values were calculated
618: with shotgun sequence overlap detection~$\soverlap=0.1$. The N50 plot was obtained
619: from simulation: each point is an average of 200 measurements.
620: }\label{fig:capss.cover}
621: \end{figure}
622:
623: \begin{proof}
624: The proof relies on a Poisson process model, following the technique
625: of \shortciteN{Waterman}.
626: We model the location of the shotgun sequences as a Poisson process
627: with rate~$\coverage$.
628: Define~$\fpooled=\pcoverage/\coverage$, the fraction of CAPS sequences.
629: Every sequence is either a
630: WGS sequence with probability~$(1-\fpooled)$, or
631: comes from each one of the clone's pools with
632: probability~$\fpooled/\sigsize$.
633: First we state the well-known facts~\shortcite{LanderWaterman,Waterman}
634: about apparent islands, whether or not they are linked to a clone.
635: The event~$E$ that a given shotgun sequence is the right-hand end of an apparent island
636: has probability~$J=\PROB E=e^{-\coverage\roverlap}$.
637: For the $k$-th read, define~$M_k$ as the number of reads
638: from its right-hand end until the first gap towards the left.
639: The probability that an island has~$j$ sequences in it equals
640: \[
641: \Probcmd{M_k=j}{E}=(1-J)^{j-1}J.
642: \]
643:
644: An island can be mapped to a clone if it contains sequences from at least two pools.
645: The probability of mapping the island ending at the $k$-th read
646: (event $D_k$)
647: depends on the number of shotgun sequences in the island. Using inclusion-exclusion:
648: \begin{multline}\label{eq:dme}
649: \Probcmd{D_k}{M_k=j} \\*
650: \begin{aligned}
651: & = 1-\sum_{\text{pools}}\Probcmd{\text{CAPS reads from only one pool+WGS}}{M_k=j}\\
652: & +(\sigsize-1) \Probcmd{\text{only WGS reads}}{M_k=j}\\*
653: & = 1-\sigsize\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)^{j}+(\sigsize-1)(1-\fpooled)^j.
654: \end{aligned}
655: \end{multline}
656:
657: By Equation~\eqref{eq:dme},
658: the number of shotgun sequences in a clone-linked island is distributed
659: by the probabilities
660: \begin{multline}\label{eq:prob.dm}
661: \Probcmd{D_k,M_k=j}{E} = \Probcmd{D_k}{M_k=j, E}\Probcmd{M_k=j}{E} \\*
662: \begin{aligned}
663: & = \biggl(1-\sigsize\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)^{j}
664: +(\sigsize-1)(1-\fpooled)^j\biggr)
665: (1-J)^{j-1}J\\
666: & = \Pa(j) - \sigsize\Pb(j) + (\sigsize-1)\Pc(j).
667: \end{aligned}
668: \end{multline}
669: with
670: \begin{subequations}\label{eq:px}
671: \begin{align}
672: \Pa(j) & = (1-J)^{j-1}J; \\*
673: \Pb(j) & = \biggl((1-J)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)
674: \biggr)^{j-1}
675: J\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr);\\*
676: \Pc(j) & = \Bigl((1-J)(1-\fpooled)\Bigr)^{j-1} J(1-\fpooled).
677: \end{align}
678: \end{subequations}
679:
680: Now, for all~$0< z\le 1$,
681: \begin{equation}
682: \sum_{j=1}^\infty (1-z)^{j-1} = \frac{1}{z}; \quad
683: \sum_{j=1}^\infty j(1-z)^{j-1} = \frac{1}{z^2}. \label{eq:z}
684: \end{equation}
685: Using Equation~\eqref{eq:z},
686: \begin{align*}
687: \Probcmd{D_k}{E}
688: & =\sum_{j=1}^\infty \Probcmd{D_k, M_k=j}{E} \\*
689: & = 1-\frac{\sigsize J(1-\frac{\sigsize-1}{\sigsize}\fpooled)}{1-(1-J)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)}
690: +\frac{(\sigsize-1)J(1-\fpooled)}{1-(1-J)(1-\fpooled)}.
691: \end{align*}
692: In Equation~\eqref{eq:prob.link},
693: $p_{\mathrm{link}}=\Probcmd{D_k}{E}$.
694: Equation~\eqref{eq:num.link} follows from
695: the fact that the expected number of shotgun fragments covering the clone
696: equals~$\coverage\clength/\flength$.
697:
698: By definition of the conditional probability,
699: \[
700: \Probcmd{M_k=j}{D_k, E}=
701: \frac{\Probcmd{D_k,M_k=j}{E}}{\Probcmd{D_k}{E}} = \frac{\Pa(j)-\sigsize\Pb(j)+(\sigsize-1)\Pc(j)}{p_{\mathrm{link}}},
702: \]
703: where the values can be plugged in from
704: Equations~\eqref{eq:prob.link} and~\eqref{eq:px}.
705: By Equation~\eqref{eq:z},
706: \begin{equation}\label{eq:mk}
707: \Expcmd{M_k}{D_k,E}
708: = \frac{p_{\mathrm{link}}^{-1}}{J}
709: \Biggl(1-\frac{nJ^2\Bigl(1-\frac{\sigsize}{\sigsize-1}\fpooled\Bigr)}{
710: \Bigl(1-(1-J)(1-\frac{\sigsize-1}{\sigsize}\fpooled)\Bigr)^{2}}
711: +\frac{(\sigsize-1)J^2(1-\fpooled)}{
712: \Bigl(1-(1-J)(1-\fpooled)\Bigr)^2}
713: \Biggr),
714: \end{equation}
715: which corresponds to~(ii)
716: with~$\nfrags_{\mathrm{link}}=\Expcmd{M_k}{D_k,E}$.
717: It is interesting to notice that
718: when~$\fpooled=1$, in Equation~\eqref{eq:mk},
719: \[
720: \frac2{J(\roverlap)} \ge \Expcmd{M_k}{D_k,E} > \frac1{J(\roverlap)},
721: \]
722: and that $\Expcmd{M_k}{D_k,E}J^{-1}(\roverlap)$ decreases when the coverage~$\coverage$ increases.
723:
724: By Equation~\eqref{eq:prob.link},
725: \begin{equation}\label{eq:nolink}
726: \Probcmd{\overline{D_k}}{E}
727: = 1-p_{\mathrm{link}}
728: =1-J\frac{1-\Bigl(1-J\Bigr)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)(1-\fpooled)}{%
729: \biggl(1-\Bigl(1-J\Bigr)\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled\Bigr)\biggr)
730: \biggl(1-\Bigl(1-J\Bigr)\Bigl(1-\fpooled\Bigr)\biggr)}.
731: \end{equation}
732: The expected number of shotgun sequences in an island that is not mapped to a clone equals
733: \[
734: \Expcmd{M_k}{\overline{D_k},E}
735: =\frac{\Expcmd{M_k}{E}-\Expcmd{M_k}{D_k,E}\Probcmd{D_k}{E}}{\Probcmd{\overline{D_k}}{E}}.
736: \]
737: Using~$\Expcmd{M_k}{E}=J^{-1}$ and Equations~\eqref{eq:mk}, \eqref{eq:prob.link},
738: and~\eqref{eq:nolink},
739: we get Equation~\eqref{eq:mk.nolink} with the notation
740: $\nfrags_{\mathrm{nolink}}=\Expcmd{M_k}{\overline{D_k},E}$.
741:
742: Let~$\flength\ilen_k$ be the length of the island ending with the $k$-th sequence.
743: The length of a non-linked island can be bounded as
744: $\flength\Expcmd{\ilen_{\mathrm{nolink}}}{\overline{D_k},E}$
745: with
746: \[
747: 1\le
748: \Expcmd{\ilen_{\mathrm{nolink}}}{\overline{D_k},E}
749: \le \Expcmd{M_k}{\overline{D_k},E}\roverlap+\soverlap.
750: \]
751: The bounds of Equation~\eqref{eq:len.link} follow from
752: \[
753: \Expcmd{\ilen_k}{D_k,E}
754: = \frac{\Expcmd{\ilen_k}{E}-\Expcmd{\ilen_k}{\overline{D_k},E}\Probcmd{\overline{D_k}}{E}}{
755: \Probcmd{\overline{D_k}}{E}},
756: \]
757: where $\Expcmd{\ilen_k}{E}=\ilen_{\mathrm{CBC}}=\frac{J^{-1}-1}{\coverage}+\soverlap$
758: \shortcite{Waterman}.
759: \end{proof}
760:
761: The value~$\ilen_{\mathrm{CBC}}$ in the
762: theorem is the expected island length in a non-pooled sequencing project.
763: By Equation~\eqref{eq:len.limit}, and
764: the fact that $\lim_{\coverage\to\infty}p_{\mathrm{link}}=1$,
765: we have
766: $\lim_{\coverage\to\infty}\ilen_{\mathrm{link}}=\ilen_{\mathrm{CBC}}$
767: when the ratio of CAPS sequences is kept constant.
768: This limit result is not surprising given that
769: every island can be assigned to a clone with near certainty when the
770: sequence read coverage is large.
771:
772: \section{Clone overlap detection}\label{sec:capsmap}
773: %{\mysectionstyle 4. Pooled shotgun reads for clone overlap detection. \ \ }
774: The key observation for this section is that a transversal design
775: makes it possible to map a contig unambiguously to more than one
776: BAC at once. Now, a contig that is mapped to two clones simultaneously can be
777: viewed as evidence that the two clones overlap. Taking the idea further,
778: an entire set of BACs can be tested for overlaps in this manner,
779: which leads us to the Clone-Array Pooled Shotgun Mapping (CAPS-MAP) method
780: that is
781: described as follows. A redundant collection of random BACs covering a large
782: genome is grouped into subsets of size~$q^2$.
783: Pooled shotgun sequence reads are collected from each clone group using a
784: transversal design with~$d$ arrays of size~$q\times q$.
785: Partitioning into subsets may be dictated by the practical
786: concerns of chemistry, biology and robotic automation.
787: For array sizes that are multiples of 8 or 12 or both
788: (yielding standard dimensions of a
789: 96-well microtiter plate), such as~$q=24$, or~$q=48$,
790: there exist known~\shortcite{design.handbook} transversal designs.
791: A pooling design with a few ($d=2,3,4$) arrays suffice to compute the
792: physical ordering of BACs in the library, depending on the
793: library's redundancy and the array sizes.
794: In addition to the CAPS sequences,
795: WGS sequences are used to increase read
796: contig lengths.
797: The shotgun sequences are compared to each other to
798: find the overlaps between them,
799: and are assembled into contigs. Contigs that map
800: unambiguously to more than one clone are taken as evidence that the clones overlap.
801: See Figures~\ref{fig:capsm} and~\ref{fig:capsm.fn} for illustrations.
802: The clone overlap information can then be used to
803: compute the physical ordering of the BACs in the library, and
804: to select a minimal tiling path for complete sequencing,
805: just as if the overlaps were detected using a
806: fingerprinting scheme \shortcite{map.sequenceready}.
807:
808: Theorem~\ref{tm:capsm} considers the case of detecting an overlap between
809: two clones in different clone groups. Similar analyses can be carried out
810: for more general cases with more overlapping clones, or clones
811: in the same clone group, resulting in more cumbersome formulas.
812:
813: \begin{figure}
814: \centerline{\includegraphics[height=0\myfigheight]{capsmap-method}}
815: \caption[CAPS-MAP]{\captionstyle
816: CAPS-MAP detects overlaps between clones by identifying situations where
817: a read contig maps simultaneously to two clones. This figure illustrates a
818: transversal pooling design with two clone groups and two arrays per group.
819: The transversal design guarantees that the intersection of any
820: two pools out of four possible for each BAC (two row and two
821: column pools) uniquely identifies the BAC.
822: Note that overlaps between clones on the same array can also be
823: detected by a transversal design.}\label{fig:capsm}
824: \end{figure}
825:
826: \begin{figure}
827: \centerline{\includegraphics[height=\myfigheight]{capsmap-onearray}}
828: \caption{
829: Overlaps between clones on the same array can also be
830: detected by a transversal design, even in the presence of false negatives,
831: i.e., situations where a particular BAC
832: is not represented in a particular pool. Specifically, overlap between
833: the two BACs illustrated in the figure is detected despite the fact that
834: each BAC is sampled in only three pools.}\label{fig:capsm.fn}
835: \end{figure}
836:
837: Figure~\ref{fig:capsm.probs} plots
838: the overlap detection probabilities
839: in a few scenarios with different
840: amounts of CAPS and WGS sequences.
841: Based on the figure, the probability of detecting an overlap
842: increases exponentially toward~1 with the overlap length. The same exponential
843: behavior is characteristic of clone anchoring methods for overlap detection
844: \shortcite{map.anchor}. Consequently, clone contig statistics for CAPS-MAP can be
845: calculated using a clone anchoring model with an appropriate
846: anchoring process intensity. Clone contig statistics can also be estimated
847: using a fingerprinting model \shortcite{LanderWaterman}
848: by noticing that clone overlaps above a certain length are detected with near certainty.
849: Figure~\ref{fig:capsm.probs} indicates that using 1X CAPS coverage and
850: 2--5X WGS coverage, BAC overlaps of more than 20000 bp are detected almost certainly.
851: While CAPS-MAP uses only the fact that a contig is mapped to multiple BACs, and
852: not the actual contig sequence, the sequence information is used in
853: the ensuing sequencing phase, and thus CAPS-MAP represents very little overhead in a
854: genome sequencing project.
855:
856: It is worth pointing out here that CAPS-MAP detects very short, or even
857: {\em negative} clone overlaps with non-negligible probability.
858: A short region of the genome
859: that is not covered by BACs in the library can be bridged by WGS sequences.
860: The bridging WGS sequences may form a contig with CAPS sequences from the two BACs
861: at the gap's ends that can be mapped to the two clones simultaneously.
862: This unique feature of CAPS-MAP among clone overlap detection methods
863: does not interfere with the calculation of the physical ordering of BACs.
864: At the same time, it does decrease the necessary BAC library size for
865: sequencing the genome completely.
866: After the clones are selected for complete sequencing, the
867: already collected WGS sequences are included in the genome sequence assembly.
868: Consequently, negative overlaps detected by CAPS-MAP are already covered by
869: shotgun sequences
870: in the sequencing phase, and
871: pose no additional requirements for shotgun sequence collection.
872:
873: \begin{theorem}\label{tm:capsm}
874: Let two clones from different clone groups
875: share an overlap. % of length~$\coverlap\clength$ with~$0<\coverlap\le 1$.
876: Define~$\coverage_2=2\pcoverage+\wgscoverage$,
877: the total shotgun sequence coverage for the overlap.
878: Define
879: \begin{gather*}
880: \begin{aligned}
881: \ndfactor_1 & = \frac{\wgscoverage+(1+\frac{1}{\sigsize})\pcoverage}{\coverage_2} &
882: \ndfactor_2 & = \frac{\wgscoverage+\pcoverage}{\coverage_2} &
883: \ndfactor_3 & = \frac{\wgscoverage+\frac{2\pcoverage}{\sigsize}}{\coverage_2} &
884: \ndfactor_4 & = \frac{\wgscoverage+\frac{\pcoverage}{\sigsize}}{\coverage_2} &
885: \ndfactor_5 & = \frac{\wgscoverage}{\coverage_2};
886: \end{aligned}\\*
887: \gamma_i=1-(1-e^{-\coverage_2\roverlap})\ndfactor_i
888: \quad\text{ for $i=1,\dotsc,5$}.
889: \end{gather*}
890:
891: \begin{enumerate}
892: \item[(i)] An apparent island in the
893: overlap consisting of~$j>0$ shotgun sequences
894: is mapped to the two clones simultaneously
895: with probability~$1-\pnd(j)$ where
896: \begin{equation}\label{eq:pnd}
897: \pnd(j)
898: = 2\sigsize \ndfactor_1^j
899: - 2(\sigsize-1) \ndfactor_2^j
900: - \sigsize^2 \ndfactor_3^j
901: + 2\sigsize(\sigsize-1) \ndfactor_4^j
902: - (\sigsize-1)^2 \ndfactor_5^j
903: < 2\sigsize\ndfactor_1^j.
904: \end{equation}
905:
906: \item[(ii)] An apparent island covering the overlap is mapped
907: to the two clones simultaneously with probability
908: \begin{equation}\label{eq:prob.map}
909: p_2
910: = 1-e^{-\coverage_2\roverlap}\biggl(
911: 2\sigsize\frac{\ndfactor_1}{\gamma_1}
912: -2(\sigsize-1)\frac{\ndfactor_2}{\gamma_2}
913: -\sigsize^2\frac{\ndfactor_3}{\gamma_3}
914: +2\sigsize(\sigsize-1)\frac{\ndfactor_4}{\gamma_4}
915: -(\sigsize-1)^2\frac{\ndfactor_5}{\gamma_5}\biggr).
916: \end{equation}
917:
918: \end{enumerate}
919: \end{theorem}
920:
921:
922: \begin{proof}
923: The overlap is detected if
924: it is covered by an island that can be simultaneously mapped
925: to the two clones.
926: We model the location of the shotgun sequences as a Poisson process
927: with rate~$\coverage_2$.
928: Define~$\fpooled_2=\frac{2\pcoverage}{\coverage_2}$,
929: the fraction of CAPS sequences covering the overlap.
930: Every shotgun sequence is either a
931: WGS sequence with probability~$(1-\fpooled_2)$, or
932: comes from each one of the two clones' pools with
933: probability~$\fpooled_2/(2\sigsize)$.
934: The event~$E_2$ that a given shotgun sequence is the right-hand end of an apparent island
935: has probability~$J_2=\PROB E_2=e^{-\coverage_2\roverlap}$.
936: For the $k$-th sequence, define~$M_k$ as the number of sequences
937: from its right-hand end until the first gap towards the left.
938: The probability that an island has~$j$ sequences in it equals
939: \[
940: \Probcmd{M_k=j}{E_2}=\Bigl(1-J_2\Bigr)^{j-1}J_2.
941: \]
942: The probability of mapping the island that ends at the $k$-th shotgun sequence
943: (event $D_k$)
944: depends on the number of sequences in the island. We
945: calculate the probability of event~$\overline{D_k}$ in separate cases.
946: Let~$p_{0,0}(j)$ denote the event that the island
947: consists of WGS sequence reads only given that it has~$j$ reads.
948: Then
949: \begin{subequations}
950: \begin{equation}\label{eq:p00}
951: p_{0,0}(j) = (1-\fpooled_2)^j.
952: \end{equation}
953: Let~$p_{0,*}(j)$ denote the event that the island
954: consists of CAPS sequences for one clone only and WGS sequences, given that it has~$j$
955: shotgun sequences in it:
956: \begin{equation}\label{eq:p0x}
957: p_{0,*}(j)=\Bigl(1-\frac{\fpooled_2}2\Bigr)^j.
958: \end{equation}
959: Let~$p_{1,0}(j)$ denote the event that the island
960: consists of CAPS sequences from a fixed pool and WGS sequences, given that it has~$j$
961: shotgun sequences in it:
962: \begin{equation}\label{eq:p10}
963: p_{1,0}(j)=\Bigl(1-\frac{\sigsize-\frac12}{\sigsize}\fpooled_2\Bigr)^j-p_{0,0}(j).
964: \end{equation}
965: Let~$p_{1,1}(j)$ denote the event that the island
966: consists of CAPS sequences from a fixed pool for one clone,
967: from another fixed pool for the other clone,
968: and WGS sequences, given that it has~$j$ shotgun sequences in it:
969: \begin{equation}\label{eq:p11}
970: p_{1,1}(j)=\Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled_2\Bigr)^j-2p_{1,0}(j)-p_{0,0}(j).
971: \end{equation}
972: Let~$p_{1,+}(j)$ denote the event that the island
973: consists of CAPS sequences from a fixed pool for one clone,
974: at least one CAPS sequence for the other clone, and WGS sequences:
975: \begin{equation}\label{eq:p1x}
976: p_{1,+}(j)=\Bigl(1-\frac{\sigsize-1}{2\sigsize}\fpooled_2\Bigr)^j
977: -p_{0,*}(j)-p_{1,0}(j).
978: \end{equation}
979: \end{subequations}
980: Using inclusion-exclusion,
981: \[
982: \Probcmd{\overline{D_k}}{E_2, M_k=j}
983: =\Bigl(2p_{0,*}(j)-p_{0,0}(j)\Bigr)
984: +2\sigsize p_{1,+}(j)-\sigsize^2 p_{1,1}(j).
985: \]
986: By Equations~(\ref{eq:p00}--\ref{eq:p1x}),
987: \begin{multline}\label{eq:map.fmap.j}
988: \Probcmd{\overline{D_k}}{E_2, M_k=j}
989: = 2\sigsize \Bigl(1-\frac{\sigsize-1}{2\sigsize}\fpooled_2\Bigr)^j
990: - 2(\sigsize-1) \Bigl(1-\frac{\fpooled_2}2\Bigr)^j\\*
991: - \sigsize^2 \Bigl(1-\frac{\sigsize-1}{\sigsize}\fpooled_2\Bigr)^j
992: + 2\sigsize(\sigsize-1) \Bigl(1-\frac{\sigsize-\frac12}{\sigsize}\fpooled_2\Bigr)^j
993: - (\sigsize-1)^2 (1-\fpooled_2)^j,
994: \end{multline}
995: which corresponds to Equation~\eqref{eq:pnd}
996: with $\pnd(j)=\Probcmd{\overline{D_k}}{E_2, M_k=j}$.
997: Using the same technique as before
998: \[
999: 1-p_2=\Probcmd{\overline{D_k}}{E_2}=\sum_{j=1}^{\infty}
1000: \Probcmd{\overline{D_k}}{E_2, M_k=j}\Probcmd{M_k=j}{E_2},
1001: \]
1002: leading to Equation~\eqref{eq:prob.map}.
1003:
1004: Recall that~$\pnd(j)$ is the probability of failing to map a
1005: contig of~$j$ reads to the two clones simultaneously.
1006: In order to show that the inequality in Equation~\eqref{eq:pnd}
1007: holds, we prove that
1008: \begin{equation}\label{eq:map.fmap.bound}
1009: \pnd(j) < 2\sigsize \ndfactor^j-(2\sigsize-1)\ndfactor_3^j < 2\sigsize\ndfactor_1^j.
1010: \end{equation}
1011: Notice that $\ndfactor_5<\ndfactor_4<\ndfactor_3<\ndfactor_2<\ndfactor_1$ and thus
1012: $\pnd(j)\nearrow 2\sigsize\ndfactor_1^j$. Since~$\ndfactor_4=(\ndfactor_3+\ndfactor_5)/2$,
1013: it follows from the convexity of~$x^j$ that
1014: \begin{equation}\label{eq:b345}
1015: 2\ndfactor_4^j \le \ndfactor_3^j+\ndfactor_5^j.
1016: \end{equation}
1017: (Alternatively, notice that the same inequality follows from
1018: $p_{1,1}(j)\ge 0$ in Equation~\eqref{eq:p11}.)
1019: We proceed by rearranging the equality of Equation~\eqref{eq:pnd}:
1020: \begin{multline*}
1021: 2\sigsize \ndfactor_1^j - (2\sigsize-1) \ndfactor_3^j - \pnd(j) =
1022: 2(\sigsize-1) \ndfactor_2^j
1023: + (\sigsize-1)^2 \ndfactor_3^j
1024: - 2\sigsize(\sigsize-1) \ndfactor_4^j
1025: + (\sigsize-1)^2 \ndfactor_5^j\\*
1026: =
1027: (\sigsize-1)^2\underbrace{\Bigl(
1028: \ndfactor_3^j
1029: +\ndfactor_5^j
1030: -2\ndfactor_4^j
1031: \Bigr)}_{\text{$>0$ by Eq.~\eqref{eq:b345}}}
1032: +2(\sigsize-1) \underbrace{\Bigl(
1033: \ndfactor_2^j
1034: -\ndfactor_4^j
1035: \Bigr)}_{\text{$>0$ since $\ndfactor_2>\ndfactor_4$}},
1036: \end{multline*}
1037: which proves Equation~\eqref{eq:map.fmap.bound}.
1038: \end{proof}
1039:
1040: It is difficult to derive useful closed formulas
1041: for the probability of overlap detection.
1042: For example, based on Equation~\eqref{eq:prob.map},
1043: the number of contigs in the overlap that are simultaneously
1044: mapped to the clones can be modeled as arrivals in a Poisson process
1045: with intensity $\coverage_2e^{-\coverage_2\roverlap}p_2$.
1046: For practical values of~$\coverage_2$, this
1047: model seriously underestimates the probability of overlap detection.
1048: The problem is similar to the one of using Lander-Waterman statistics \cite{LanderWaterman}
1049: at high coverage levels (see \citeN{WendlWaterston} for a discussion).
1050: For a more suitable model, let~$\rvgaps$ be the number of gaps entirely contained in the
1051: overlap, and number the islands from 0 to~$\rvgaps$.
1052: Let~$j_0, j_2, \dotsc, j_{\rvgaps}$ denote the number of shotgun sequences in
1053: the islands. The probability that none of the islands can be
1054: mapped simultaneously to the two clones can be calculated as
1055: \begin{equation}\label{eq:pnomap.exp}
1056: p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})
1057: =\prod_{i=0}^{\rvgaps}\pnd(j_i),
1058: \end{equation}
1059: where~$\pnd(j)$ is defined by Equation~\eqref{eq:pnd}.
1060: (Notice that~$\rvgaps$ and the~$j_i$ are random variables.)
1061: We are interested in the expected value
1062: $p_{\mathrm{nomap}}
1063: =\EXP p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})$.
1064: In order to get a good assessment of CAPS-MAP performance,
1065: we found that
1066: it is best to use a Monte-Carlo estimation\footnote{
1067: Specifically, for every overlap size considered,
1068: we carried out a number of simulated experiments.
1069: Each experiment used a fixed number of
1070: shotgun sequences~$\rvreads$ placed
1071: randomly in the overlap, and
1072: produced an instance of
1073: a $(j_0,\dotsc,j_{\rvgaps})$ vector, for which
1074: $p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})$
1075: was calculated using Equation~\eqref{eq:pnd}.
1076: The average of these values was used to estimate $p_{\mathrm{nomap}}$.
1077: The average was weighted with the probabilities
1078: of different~$\rvreads$ values, given by a Poisson distribution.
1079: The set of~$\rvreads$ values was chosen so that
1080: it provided a sufficient
1081: accuracy for the weighted average estimate.
1082: For every~$\rvreads$, ten thousand experiments were
1083: performed.}
1084: of this expected value;
1085: see Figure~\ref{fig:capsm.probs}.
1086: For an alternative, observe that
1087: the inequality of
1088: Equation~\eqref{eq:pnd} implies
1089: $p_{\mathrm{nomap}} < \EXP\Bigl[\ndfactor_1^{\rvreads}(2\sigsize)^{\rvgaps+1}\Bigr]$
1090: where~$\rvreads$ is the number of sequences
1091: in the overlap, and thus $\rvreads=\sum_{i=0}^{\rvgaps} j_i$.
1092: Based on this observation,
1093: we derived bounds (see Appendix)
1094: that are useful for large values of~$\coverage_2$ (e.g., $\coverage_2=7$),
1095: but at lower coverages, this approach also
1096: underestimates the overlap detection probabilities significantly.
1097:
1098: \begin{figure}
1099: \centerline{
1100: \includegraphics[height=0\myfigheight]{capsm-nodetect}}
1101: \caption[CAPS-MAP statistics]{\captionstyle
1102: Clone overlap detection.
1103: The graph shows the probability of not detecting an overlap between
1104: two clones, as a function of the overlap size. The plots were calculated by a Monte-Carlo
1105: method using Theorem~\ref{tm:capsm}.
1106: All plots use~$\soverlap=0.1$ for shotgun sequence overlap detection, and
1107: $\flength=500$ for shotgun sequence length.
1108: }
1109: \label{fig:capsm.probs}
1110: \end{figure}
1111:
1112: \section{BAC ordering}
1113: Our analyses so far have focused on detecting BAC overlaps via CAPS-MAP.
1114: This localized perspective was partly adopted to ease the theoretical analysis.
1115: In practice, mapping is performed based on a global clone-contig incidence matrix.
1116: The global approach exploits the dependencies in the collected data for increased
1117: accuracy.
1118: The algorithmic issues are very similar
1119: to those encountered
1120: in the context of STS-based physical mapping
1121: \shortcite{Gusfield}.
1122: Define the mapping matrix~$\mathbf{M}$ in which the rows correspond to the BACs,
1123: the columns correspond to the contigs, and $\mathbf{M}[i,j]=1$
1124: if contig~$j$ is linked to clone~$i$, otherwise $\mathbf{M}[i,j]=0$.
1125: We want to find the true ordering of the rows and columns,
1126: defined by their physical locations on the genome.
1127: Assume for a moment that the matrix is completely error-free,
1128: i.e., all contigs are correctly assembled, and all contig-clone overlaps
1129: are detected.
1130: It is not hard to see that the row and column permutations
1131: corresponding to the correct ordering result in a matrix~$\mathbf{M}'$
1132: that satisfies the {\em consecutive ones} property (C1P) in the rows and the columns:
1133: for every row~$i$, there exist $a\le b$ with $\mathbf{M}'[i,j]=1$ if and only if
1134: $a\le j\le b$, and the same property holds for the columns.
1135: (A sufficient condition ensuring row-wise (or column-wise) C1P is that
1136: if the left endpoint of a contig (or a clone) precedes the left endpoint
1137: of another one, the same holds for their right endpoints.)
1138: Finding such permutations is a well-known problem~\shortcite{C1P},
1139: and can be done in linear time.
1140: When the matrix is not error-free, one can use
1141: techniques introduced for STS-based physical mapping.
1142: In \S\ref{sec:simulation} we detail a
1143: method that relies on traveling salesperson tours.
1144:
1145: \section{CAPS-MAP simulation of Drosophila assembly}\label{sec:simulation}
1146: We tested the CAPS-MAP approach
1147: by simulating the assembly of the {\em Drosophila melanogaster}
1148: genome. One of the main goals of the
1149: simulated assembly was to predict the performance a
1150: hybrid approach combining WGS
1151: and CAPS sequences in a
1152: project that closely resembles the
1153: setup of the honey bee genome project's
1154: (\verb|http://www.hgsc.bcm.tmc.edu/projects/honeybee/|),
1155: currently pursued at the Human Genome Sequencing Center
1156: (HGSC)
1157: of Baylor College of Medicine.
1158:
1159: Concatenating all the Drosophila genome sequence (Release 2.5,
1160: 112.6 million bases),
1161: 2880 BAC sequences were generated
1162: by randomly picking their locations and lengths. The
1163: mean BAC insert length was 150 kbp,
1164: and its standard deviation was 500 bp.
1165: The resulting random BAC library provides 3.6X coverage of the genome.
1166: BACs were arrayed by first partitioning them into 5 groups, and then
1167: using a two-array $24\times24$ transversal design for each group.
1168: Every BAC was covered by 1.2X CAPS sequences: 0.4X per pool on
1169: the first array and 0.2X per pool on the second (reshuffled) array.
1170: In addition, WGS sequences were produced at 4X genome coverage.
1171: The shotgun sequences were generated using the program \texttt{wgs-simulator}
1172: (written by K.~James Durbin),
1173: which mimics shotgun sequence collection realistically by
1174: relying on sequence quality files \shortcite{Phred.error}
1175: produced in sequencing projects.
1176:
1177: Shotgun sequences were assembled into contigs using
1178: the Atlas suite of genome assembly tools (\verb|http://www.hgsc.bcm.tmc.edu/downloads/software/atlas/|)
1179: and Phrap (\verb|http://www.phrap.org/|).
1180: A contig was mapped
1181: to a clone if it contained sequences from all four clone pools.
1182: Contigs that mapped to more than one
1183: BACs provided the evidence of BAC overlaps.
1184: BACs were grouped into maximal overlapping sets, or {\em bactigs}.
1185:
1186: We compared the overlap graphs to assess CAPS-MAP overlap detection.
1187: The vertices of the overlap graphs are the BACs, and two BACs
1188: are connected if there is an overlap between them.
1189: The true overlap graph for the original
1190: BACs contains 2880 vertices, and 10992 edges in 66 graph components.
1191: The overlap graph calculated from the bactigs
1192: has 9193 edges in 110 components.
1193: Among its edges, 8527 (93\%) are correct,
1194: and 2465 (22\%) of the true overlaps are not discovered.
1195: The median length of detected overlaps is 87 kbp, and the median length
1196: of undetected overlaps is 42 kbp.
1197: There are 666 edges that correspond to no real
1198: overlaps. The vast majority of these ``false positives''
1199: are instances when a long read contig links several BACs,
1200: which do not always overlap pairwise.
1201: All but two of the
1202: CAPS-MAP bactigs are true overlapping sets of BACs.
1203: CAPS-MAP links the assembled contigs to BACs correctly even in these two
1204: bactigs:
1205: the source of the error is the read contig assembly.
1206: Table~\ref{tbl:droso} shows statistics on the bactig sizes and genome coverage.
1207:
1208: \begin{table}
1209: \begin{center}
1210: \small
1211: \begin{tabular}{|r|r|r|}
1212: \hline
1213: Minimum bactig size & Genome covered & Number of BACs in bactigs \\
1214: \hline
1215: 2 & 97.1\% & 2758 \\
1216: 3 & 96.7\% & 2746 \\
1217: 5 & 94.9\% & 2714 \\
1218: 10 & 88.5\% & 2565 \\
1219: 15 & 82.4\% & 2400 \\
1220: 20 & 77.9\% & 2284 \\
1221: 30 & 65.0\% & 1945 \\
1222: 51 & 50.9\% & 1521 \\
1223: 60 & 40.3\% & 1195 \\
1224: \hline
1225: %Min bactig size Bases covered Total BACs Genome percentage covered
1226: %1 109852500 2766 97.49%
1227: %2 109466060 2758 97.14%
1228: %3 108913450 2746 96.65%
1229: %5 106944875 2714 94.91%
1230: %10 99773079 2565 88.54%
1231: %15 92872143 2400 82.42%
1232: %20 87814164 2284 77.93%
1233: %25 80304207 2107 71.26%
1234: %30 73277950 1945 65.03%
1235: %35 68186296 1813 60.51%
1236: %40 62570414 1660 55.53%
1237: %45 60987140 1616 54.12%
1238: %50 59293215 1571 52.62%
1239: %51 57301929 1521 50.85%
1240: %60 45414106 1195 40.30%
1241: \end{tabular}
1242: \end{center}
1243: \caption{Statistics for simulated Drosophila assembly.
1244: This table details the genome and BAC library coverage
1245: by bactig sizes. More than half of the genome is covered by bactigs
1246: with at least 51 BACs in them, defining the N50
1247: statistic for the clone map.
1248: }\label{tbl:droso}
1249: \end{table}
1250:
1251: BACs were ordered within each bactig.
1252: For every bactig, an overlap matrix~$\mathbf{M}$ was
1253: calculated, in which
1254: the rows correspond to the bactig's clones,
1255: the columns correspond to the contigs linked to at least one bactig clone,
1256: and $\mathbf{M}[i,j]=1$
1257: if contig~$j$ is linked to clone~$i$, otherwise $\mathbf{M}[i,j]=0$.
1258: The following
1259: traveling salesperson (TSP)
1260: formulation is used to find the correct column permutation.
1261: We search for a tour in a graph, in which every vertex corresponds to a
1262: contig (and thus a column), with an additional vertex~$u_0$.
1263: The weight of an edge between vertices~$u$ and~$u'$, corresponding
1264: to contigs~$j$ and~$j'$, is the
1265: number of rows in which they differ:
1266: $w(u,u')=\sum_i\chi\Bigl\{\mathbf{M}[i,j]\ne\mathbf{M}[i,j']\Bigr\}$,
1267: where~$\chi\{\cdot\}$ is the indicator function.
1268: The weight of an edge between~$u$ and~$u_0$ is the sum of ones
1269: in the column~$j$ that corresponds to~$u$: $w(u,u_0)=\sum_i\mathbf{M}[i,j]$.
1270: Now, a Hamilton path with the minimum weight in this graph
1271: gives the best column permutation in the sense that it
1272: minimizes the number of gaps between blocks of ones within rows
1273: \cite{mapping.tsp}.
1274: The best row ordering could be found in an analogous manner, but we used
1275: a simpler method which worked better in practice.
1276: Clones are ordered relatively to the contig order
1277: by placing clone~$\clone$ before~$\clone'$ if
1278: the first contig $\clone$ is linked to is before the first
1279: contig $\clone'$ is linked to,
1280: or if their first contigs are identical
1281: but $\clone$ has its last contig before $\clone'$.
1282:
1283: We used the \texttt{concorde} program \shortcite{concorde}
1284: to solve the TSP instances.
1285: The resulting row permutation is
1286: then further analyzed to find clones, for which the
1287: permutation arbitrarily enforces an order. Specifically,
1288: if consecutive rows of the permuted matrix~$\mathbf{M}'$ are identical,
1289: then the order of the corresponding clones is not resolved.
1290: Subsequently, we compared the TSP orders to the true orders, which
1291: is known since the BAC sequences are generated artificially.
1292: Figure~\ref{fig:fly.order} shows the
1293: outcome of the comparison for two bactigs.
1294: The TSP order is very close to the true order.
1295:
1296: \begin{figure}
1297: \centerline{\includegraphics[height=.15\textheight]{bactig-example-1}}
1298: \centerline{\includegraphics[height=.15\textheight]{bactig-example-2}}
1299: \caption{
1300: Correctness of BAC ordering in Drosophila simulation.
1301: The top (\textsf{Loc})
1302: of each graph shows the relative
1303: physical location of each BAC,
1304: the middle (\textsf{True}) shows
1305: the correct BAC order, and
1306: the bottom (\textsf{TSP})
1307: shows the TSP order,
1308: and the BAC identifiers.
1309: Identical BACs
1310: are connected in order to
1311: display the differences between the two permutations.
1312: The order of BACs at the bottom is not resolved when they are connected
1313: with a horizontal line.
1314: By resolving them optimally,
1315: bactig 23 produces the order of 72 BACs with 12 breakpoints
1316: and bactig 16 orders 50 BACs with 9 breakpoints.
1317: (Breakpoints are neighbors in
1318: the TSP order that are not neighbors in the true order.)
1319: }
1320: \label{fig:fly.order}
1321: \end{figure}
1322:
1323: \section{Discussion}
1324: %{\mysectionstyle 5. Discussion. \ \ }
1325: The experimental expedience of shotgun sequencing has been essential
1326: for the success of genome-scale sequencing projects in the past decade.
1327: The power of the concept comes from the now established
1328: fact that the loss of information about read localization
1329: incurred by random subcloning can be largely recovered
1330: in the assembly step using sequence information.
1331: Clone pooling is similar in spirit to shotgun sequencing in that it
1332: introduces experimental expedience by dramatically reducing the number
1333: of subclone library preparations. The clone pooling step leads to a
1334: temporary loss of information about localization of shotgun sequences on
1335: individual BAC clones. We have demonstrated that
1336: sequence information can be used to
1337: successfully recover most of the information lost in
1338: pooling.
1339:
1340: Our analyses presented here indicate the theoretical feasibility of the CAPS-MAP
1341: method and provide guidance for the design of genome-scale CAPS-MAP
1342: experiments. In particular, our analysis indicates that transversal
1343: pooling designs can accommodate high levels of clone redundancy and
1344: perform well even at low levels of shotgun sequence coverage of clone
1345: pools.
1346:
1347: Practical biological and technical considerations may set a limit to the
1348: array size. In case of large genomes, the limitations may imply that the set of
1349: BACs is partitioned and that pooling is applied separately to individual
1350: subsets. This results in a lower clone redundancy within individual arrays
1351: and a larger number of pools. Our analysis allows for the
1352: partitioning of clones. It also allows for the
1353: possibility of including whole-genome shotgun sequence reads.
1354: It thus covers
1355: realistic and practical scenarios of the CAPSS and CAPS-MAP methods'
1356: application.
1357:
1358: \section*{Acknowledgements}
1359: We are grateful to
1360: Richard Gibbs and George
1361: Weinstock for sharing pre-publication information on CAPSS and for useful
1362: comments.
1363: Our discussion of
1364: computing CAPS-MAP overlap detection probabilities
1365: has greatly benefited from conversations with
1366: Luc Devroye and Michael Waterman.
1367: This work was supported by grants
1368: RO1~HG02583-01 from NHGRI at the NIH,
1369: U01~RR18464 from the NCRR,
1370: and
1371: 250391-02 from the NSERC.
1372:
1373: \textsc{Remark.}\ \
1374: An extended abstract of this paper is published in
1375: Genome Informatics vol.14 Universal Academy Press, Tokyo
1376: (Proceedings of the
1377: 14th International Conference on Genome Informatics (GIW),
1378: December 14--17, 2003, Yokohama, Japan).
1379:
1380:
1381: \bibliographystyle{chicago}
1382: \begin{thebibliography}{}
1383:
1384: \bibitem[\protect\citeauthoryear{Alizadeh, Karp, Newberg, and Weisser}{Alizadeh
1385: et~al.}{1995}]{mapping.tsp}
1386: Alizadeh, F., R.~M. Karp, L.~A. Newberg, and D.~K. Weisser (1995).
1387: \newblock Physical mapping of chromosomes: a combinatorial problem in molecular
1388: biology.
1389: \newblock {\em Algorithmica\/}~{\em 13}, 52--76.
1390:
1391: \bibitem[\protect\citeauthoryear{Applegate, Bixby, Chv\'atal, and
1392: Cook}{Applegate et~al.}{1999}]{concorde}
1393: Applegate, D., R.~Bixby, V.~Chv\'atal, and W.~Cook (1999).
1394: \newblock Concorde 99.12.15 release.
1395: \newblock \verb|http://www.math.princeton.edu/tsp/concorde.html|.
1396:
1397: \bibitem[\protect\citeauthoryear{Arratia, Lander, Tavar{\'e}, and
1398: Waterman}{Arratia et~al.}{1991}]{map.anchor}
1399: Arratia, R., E.~S. Lander, S.~Tavar{\'e}, and M.~S. Waterman (1991).
1400: \newblock Genomic mapping by anchoring random clones: A mathematical analysis.
1401: \newblock {\em Genomics\/}~{\em 11}, 806--827.
1402:
1403: \bibitem[\protect\citeauthoryear{Booth and Lueker}{Booth and
1404: Lueker}{1976}]{C1P}
1405: Booth, K.~S. and G.~S. Lueker (1976).
1406: \newblock Testing for the {C}onsecutive {O}nes {P}roperty, interval graphs, and
1407: graph planarity using {PQ}-tree algorithms.
1408: \newblock {\em Journal of Computer and System Sciences\/}~{\em 13}, 335--379.
1409:
1410: \bibitem[\protect\citeauthoryear{Cai, Chen, Gibbs, and Bradley}{Cai
1411: et~al.}{2001}]{CAPSS}
1412: Cai, W.-W., R.~Chen, R.~A. Gibbs, and A.~Bradley (2001).
1413: \newblock A clone-array pooled strategy for sequencing large genomes.
1414: \newblock {\em Genome Research\/}~{\em 11}, 1619--1623.
1415:
1416: \bibitem[\protect\citeauthoryear{Colbourn and Dinitz}{Colbourn and
1417: Dinitz}{1996}]{design.handbook}
1418: Colbourn, C.~J. and J.~H. Dinitz (Eds.) (1996).
1419: \newblock {\em The {CRC} Handbook of Combinatorial Designs}.
1420: \newblock Boca Raton: CRC Press.
1421:
1422: \bibitem[\protect\citeauthoryear{Cs{\H u}r\"os and Milosavljevic}{Cs{\H u}r\"os
1423: and Milosavljevic}{2002}]{PGI.conf}
1424: Cs{\H u}r\"os, M. and A.~Milosavljevic (2002).
1425: \newblock Pooled genomic indexing ({PGI}): mathematical analysis and experiment
1426: design.
1427: \newblock In {\em Algorithms in Bioinformatics: Second International Workshop},
1428: Volume 2452 of {\em {LNCS}}, pp.\ 10--28. Berlin Heidelberg:
1429: Springer-Verlag.
1430:
1431: \bibitem[\protect\citeauthoryear{Du and Hwang}{Du and Hwang}{2000}]{CGT}
1432: Du, D.-Z. and F.~K. Hwang (2000).
1433: \newblock {\em Combinatorial Group Testing and Its Applications\/} (2nd ed.).
1434: \newblock Singapore: World Scientific.
1435:
1436: \bibitem[\protect\citeauthoryear{Ewens and Grant}{Ewens and Grant}{2001}]{EG}
1437: Ewens, W.~J. and G.~R. Grant (2001).
1438: \newblock {\em Statistical Methods in Bioinformatics: An Introduction}.
1439: \newblock New York: Springer-Verlag.
1440:
1441: \bibitem[\protect\citeauthoryear{Ewing and Green}{Ewing and
1442: Green}{1998}]{Phred.error}
1443: Ewing, B. and P.~Green (1998).
1444: \newblock Base-calling of automated sequencer traces using {\em {p}hred}: {II}.
1445: error probabilities.
1446: \newblock {\em Genome Research\/}~{\em 8}, 186--194.
1447:
1448: \bibitem[\protect\citeauthoryear{Green}{Green}{2001}]{Sequencing.review}
1449: Green, E.~D. (2001).
1450: \newblock Strategies for the systematic sequencing of complex genomes.
1451: \newblock {\em Nature Reviews Genetics\/}~{\em 2}, 573--583.
1452:
1453: \bibitem[\protect\citeauthoryear{Gusfield}{Gusfield}{1997}]{Gusfield}
1454: Gusfield, D. (1997).
1455: \newblock {\em Algorithms on Strings, Trees, and Sequences: Computer Science
1456: and Computational Biology}.
1457: \newblock UK: Cambridge University Press.
1458:
1459: \bibitem[\protect\citeauthoryear{{IHGSC}}{{IHGSC}}{2001}]{human.genome}
1460: {IHGSC} (2001).
1461: \newblock Initial sequencing and analysis of the human genome.
1462: \newblock {\em Nature\/}~{\em 609\/}(6822), 860--921.
1463:
1464: \bibitem[\protect\citeauthoryear{Lander and Waterman}{Lander and
1465: Waterman}{1988}]{LanderWaterman}
1466: Lander, E.~S. and M.~S. Waterman (1988).
1467: \newblock Genomic mapping by fingerprinting random clones: a mathematical
1468: analysis.
1469: \newblock {\em Genomics\/}~{\em 2}, 231--239.
1470:
1471: \bibitem[\protect\citeauthoryear{Marra, Kucaba, Dietrich, Green, Brownstein,
1472: Wilson, McDonald, Hillier, McPherson, and Waterston}{Marra
1473: et~al.}{1997}]{map.sequenceready}
1474: Marra, M.~A., T.~A. Kucaba, N.~L. Dietrich, E.~D. Green, B.~Brownstein, R.~K.
1475: Wilson, K.~M. McDonald, L.~W. Hillier, J.~D. McPherson, and R.~H. Waterston
1476: (1997).
1477: \newblock High throughput fingerprint analysis of large-insert clones.
1478: \newblock {\em Genome Research\/}~{\em 7}, 1072--1084.
1479:
1480: \bibitem[\protect\citeauthoryear{Waterman}{Waterman}{1995}]{Waterman}
1481: Waterman, M.~S. (1995).
1482: \newblock {\em Introduction to Computational Molecular Biology: Maps, Sequences
1483: and Genomes}.
1484: \newblock Boca Raton: Chapman \&\ Hall.
1485:
1486: \bibitem[\protect\citeauthoryear{Weber and Myers}{Weber and Myers}{1997}]{WGS}
1487: Weber, J.~L. and E.~W. Myers (1997).
1488: \newblock Human whole-genome shotgun sequencing.
1489: \newblock {\em Genome Research\/}~{\em 7}, 401--409.
1490:
1491: \bibitem[\protect\citeauthoryear{Wendl and Waterston}{Wendl and
1492: Waterston}{2002}]{WendlWaterston}
1493: Wendl, M.~C. and R.~H. Waterston (2002).
1494: \newblock Generalized gap model for bacterial artificial chromosome clone
1495: fingerprint mapping and shotgun sequencing.
1496: \newblock {\em Genome Research\/}~{\em 12}, 1943--1949.
1497:
1498: \end{thebibliography}
1499:
1500: \clearpage
1501: \appendix
1502: \section*{Appendix}
1503: Here we expand our discussion on the probability of overlap detection in CAPS-MAP.
1504: In particular, we derive formulas that
1505: show the exponential decay of the probability of not detecting an overlap
1506: when the coverage~$\coverage_2$ is not too small.
1507: We start with the bound
1508: \begin{equation}\label{eq:pnomap.bound.def}
1509: p_{\mathrm{nomap}} < \EXP\Bigl[\ndfactor_1^{\rvreads}(2\sigsize)^{\rvgaps+1}\Bigr]
1510: \end{equation}
1511:
1512:
1513: Define
1514: \[
1515: \gengaps_{\nreads}(z) = \Expcmd{z^{\rvgaps}}{\rvreads},
1516: \]
1517: the probability generating function for
1518: the distribution of the number of gaps conditioned on the number of shotgun sequences.
1519: Define the events~$A_i$ for $i=1,\dotsc,\nreads-1$: $A_i$ denotes
1520: the event that the $i$-th sequence is followed by a gap, conditioned
1521: on the event $\{\rvreads=r\}$.
1522: For arbitrary~$\ngaps$, and set of indexes $i_1<i_2<\dotsb<i_{\ngaps}$,
1523: \[
1524: \PROB\Bigl\{A_{i_1}A_{i_2}\dotsm A_{i_{\ngaps}}\Bigr\}
1525: = (1-\ngaps\delta)_{+}^{\nreads},
1526: \]
1527: where $\delta=\frac{\roverlap\flength}{\coverlap\clength}$,
1528: and $(x)_+=\max\{0,x\}$ \shortcite{EG,WendlWaterston}.
1529: Let
1530: \begin{align*}
1531: S_0 & = 1\\*
1532: S_g & = \sum_{i_1<\dotsb< i_{\ngaps}} \PROB\Bigl\{A_{i_1}A_{i_2}\dotsm A_{i_{\ngaps}}\Bigr\}
1533: = \binom{\nreads-1}{\ngaps}(1-\ngaps\delta)_{+}^{\nreads}.
1534: \end{align*}
1535: Using inclusion-exclusion,
1536: \[
1537: \Probcmd{\rvgaps=\ngaps}{\rvreads=\nreads}
1538: =\sum_{j=\ngaps}^{\nreads-1}
1539: \binom{j}{\ngaps}(-1)^{j-\ngaps} S_j.
1540: \]
1541: Hence,
1542: \begin{align*}
1543: \gengaps_{\nreads}(z) & =
1544: \sum_{\ngaps=0}^{\nreads-1}
1545: z^\ngaps \sum_{j=\ngaps}^{\nreads-1} \binom{j}{\ngaps}(-1)^{j-\ngaps} S_j\\*
1546: & = \sum_{j=0}^{\nreads-1} S_j
1547: \sum_{\ngaps=0}^j (-1)^{j-\ngaps} \binom{j}{\ngaps} z^{\ngaps} \\*
1548: & = \sum_{j=0}^{\nreads-1} S_j (z-1)^j.
1549: \end{align*}
1550: Substituting the $S_j$ values:
1551: \begin{equation}\label{eq:gengaps}
1552: \gengaps_{\nreads}(z) =
1553: \sum_{j=0}^{\nreads-1}
1554: \binom{\nreads-1}{j} (1-j\delta)_{+}^{\nreads} (z-1)^j,
1555: \end{equation}
1556: a result interesting on its own.
1557:
1558: Returning to Equation~\eqref{eq:pnomap.bound.def}, we have
1559: \begin{equation}\label{eq:pnomap.bound.1}
1560: p_{\mathrm{nomap}}
1561: < \EXP\biggl[
1562: 2\sigsize \ndfactor_1^{\rvreads}
1563: \sum_{j=0}^{\rvreads-1}
1564: \binom{\rvreads-1}{j} (1-j\delta)_{+}^{\rvreads} (2\sigsize-1)^j
1565: \biggr],
1566: \end{equation}
1567: where~$\rvreads$ is a Poisson random variable with
1568: mean
1569: \[
1570: \mreads=
1571: \frac{\coverage_2\coverlap\clength}{\flength}
1572: \]
1573: For every~$\nreads\ge0$,
1574: $(1-j\delta)_{+}^{\nreads} \le e^{-j\nreads\delta}$,
1575: hence
1576: \[
1577: \sum_{j=0}^{\nreads-1}
1578: \binom{\nreads-1}{j} (1-j\delta)_{+}^{\nreads} (2\sigsize-1)^j
1579: \le \Bigl(1+(2\sigsize-1)e^{-\nreads\delta}\Bigr)^{\nreads-1}.
1580: \]
1581: Consequently, by Equation~\eqref{eq:pnomap.bound.1},
1582: \[
1583: p_{\mathrm{nomap}}
1584: < \EXP\biggl[
1585: 2\sigsize \ndfactor_1^{\rvreads}
1586: \Bigl(1+(2\sigsize-1)e^{-\rvreads\delta}\Bigr)^{\rvreads-1}
1587: \biggr].
1588: \]
1589: Recall that the random value we take the expectation of
1590: is an upper bound on~$p_{\mathrm{nomap}}(j_0,\dotsc,j_{\rvgaps})$,
1591: and thus if it is larger than one, it is useless.
1592: Let
1593: \[
1594: f(\nreads)=
1595: \min\Bigl\{1,2\sigsize \ndfactor_1^{\nreads}
1596: \Bigl(1+(2\sigsize-1)e^{-\nreads\delta}\Bigr)^{\nreads-1}\Bigr\}.
1597: \]
1598: So we have in fact the bound
1599: \begin{equation}\label{eq:pnomap.bound.2}
1600: p_{\mathrm{nomap}}
1601: < \EXP f(\rvreads).
1602: \end{equation}
1603: In order to achieve exponential decay in the bound, we would like to have
1604: \[
1605: \ndfactor_1\Bigl(1+(2\sigsize-1)e^{-\nreads_0\delta}\Bigr)<1
1606: \]
1607: for some~$\nreads_0<\mreads$. Rearranging the inequality, we have
1608: \begin{equation}\label{eq:goodaw}
1609: (2\sigsize-1)\frac{\sigsize(\pcoverage+\wgscoverage)+\pcoverage}{(\sigsize-1)\pcoverage}
1610: < e^{(2\pcoverage+\wgscoverage)\roverlap},
1611: \end{equation}
1612: which is satisfied when~$\pcoverage$ and~$\wgscoverage$ are not too small
1613: (see Figure~\ref{fig:goodaw}).
1614:
1615: \begin{figure}
1616: \centerline{\includegraphics[height=.3\textheight]{good-aw}}
1617: \caption{Values of the pooled shotgun coverage~$\pcoverage$
1618: and WGS coverage~$\wgscoverage$, for which the clone overlap detection bound applies,
1619: are above the graphs (see Equation~\eqref{eq:goodaw})}\label{fig:goodaw}.
1620: \end{figure}
1621:
1622: There are several possible ways to
1623: exploit the fact that the exponential component of $f(\nreads)$ becomes
1624: for~$\nreads$ less than the expected value~$\mreads$.
1625: The main idea is that when evaluating
1626: $\EXP f(\rvreads)=\sum f(\nreads)\PROB\{\rvreads=\nreads\}$
1627: in Equation~\eqref{eq:pnomap.bound.2},
1628: either the probability of~$\rvreads=\nreads$ is small, or
1629: the value of~$f(\nreads)$ is small.
1630: Let~$0<k<\lambda$ be a threshold (that we specify later), and let~$\alpha=k/\mreads$.
1631: To proceed with Equation~\eqref{eq:pnomap.bound.2}, we condition
1632: on the event~$\{\rvreads\le\alpha\mreads\}$.
1633: We use the bound
1634: \begin{equation}\label{eq:poisson.bound}
1635: \PROB\{\rvreads\le \alpha\mreads\}
1636: < \frac{e^{-\mreads(1-\alpha)^2/2}}{(1-\alpha)\sqrt{2\pi\alpha\mreads}},
1637: \end{equation}
1638: which we prove here quickly.
1639: By definition,
1640: \begin{align*}
1641: \PROB\{\rvreads\le\alpha\mreads\}
1642: & \le \sum_{\nreads=0}^k
1643: \frac{{\mreads}^{\nreads}}{\nreads!} e^{-\mreads}
1644: < e^{-\mreads}\frac{{\mreads}^{k}}{k!}
1645: \sum_{\nreads=0}^k \Bigl(\frac{k}{\mreads}\Bigr)^{\nreads}\\*
1646: & < e^{-\mreads}\frac{{\mreads}^{k}}{k!} (1-\alpha)^{-1}
1647: < e^{-\mreads(1-\alpha+\alpha\ln\alpha)} \frac{1}{(1-\alpha)\sqrt{2\pi\alpha\lambda}},
1648: \end{align*}
1649: where we used a Stirling approximation: $k!>(k/e)^k/\sqrt{2\pi k}$. Using a Taylor series expansion,
1650: \[
1651: 1-\alpha+\alpha\ln\alpha = \frac12 (1-\alpha)^2 + \frac16 (1-\alpha)^3 + \frac{1}{12}(1-\alpha)^4 \dotsc
1652: \]
1653: and thus $1-\alpha+\alpha\ln\alpha>\frac12 (1-\alpha)^2$ for~$0<\alpha<1$,
1654: and Equation~\eqref{eq:poisson.bound} follows.
1655:
1656: Now,
1657: \begin{align*}
1658: \EXP f(\rvreads)
1659: & = \Expcmd{f(\rvreads)}{\rvreads\le\alpha\mreads}\PROB\{\rvreads\le\alpha\mreads\}
1660: +\Expcmd{f(\rvreads)}{\rvreads>\alpha\mreads} \PROB\{\rvreads>\alpha\mreads\}\\*
1661: & \le \PROB\{\rvreads\le\alpha\mreads\} + \Expcmd{f(\rvreads)}{\rvreads>\alpha\mreads}\\*
1662: & < \frac{e^{-\mreads(1-\alpha)^2/2}}{(1-\alpha)\sqrt{2\pi\alpha\mreads}}
1663: + \frac{2\sigsize e^{-\mreads}
1664: \sum_{\nreads=0}^{\infty}
1665: \frac{\Bigl(\ndfactor_1(1+(2\sigsize-1)e^{-\alpha\delta\mreads})\Bigr)^{\nreads}}{\nreads!}}{1+(2\sigsize-1)e^{-\alpha\delta\mreads}}
1666: \\*
1667: & = \frac{\exp\Bigl(-\mreads(1-\alpha)^2/2\Bigr)}{(1-\alpha)\sqrt{2\pi\alpha\mreads}}
1668: + \frac{2\sigsize \exp\biggl(-\mreads\Bigl(1-\ndfactor_1(1+(2\sigsize-1)e^{-\alpha\coverage_2\roverlap})\Bigr)\biggr)}{
1669: 1+(2\sigsize-1)e^{-\alpha\coverage_2\roverlap}},
1670: \end{align*}
1671: where we used~$\delta\mreads=\coverage_2\roverlap$.
1672: Figure~\ref{fig:balance.alpha} shows values of~$\alpha$ for different~$\pcoverage,\wgscoverage$ pairs
1673: that balance the exponents in the two terms.
1674:
1675: \begin{figure}
1676: \centerline{\includegraphics[width=\textwidth]{balance-alpha}}
1677: \caption{Balanced $\alpha$ values for our exponential bound.}\label{fig:balance.alpha}
1678: \end{figure}
1679:
1680: After choosing a balancing~$\alpha$ value for a given~$(\pcoverage,\wgscoverage)$ pair,
1681: we obtain
1682: \[
1683: \EXP f(\rvreads) < X_1 \exp(-X_2 \roverlap\clength),
1684: \]
1685: where~$X_1$ and~$X_2$ are constants that do not depend on~$\roverlap$.
1686: The bound becomes small ($<10^{-8}$)
1687: for larger~$\coverage_2$ values (e.g., $\coverage_2=7$),
1688: but even then, it is not very tight.
1689: Based on simulation results, the tightness is lost
1690: with the inequality of Equation~\eqref{eq:pnomap.bound.def},
1691: and not in the following steps.
1692: For example, we evaluated the bounds of Equations~\eqref{eq:pnomap.bound.1}
1693: and~\eqref{eq:pnomap.bound.2} numerically.
1694: While they are fairly close to each other, and to the exponential bound
1695: using~$\alpha$, they already bound the expected value of~Equation~\eqref{eq:pnomap.exp}
1696: rather loosely in many cases.
1697: Furthermore, even for~$(\pcoverage,\wgscoverage)$ pairs
1698: for which we cannot establish exponential decay
1699: using the inequality of Equation~\eqref{eq:pnomap.bound.def}, the overlap
1700: detection probability may get very close to one.
1701: For instance, a two-array design with
1702: $\pcoverage=0.5$ and~$\wgscoverage=2$ falls below the curve
1703: of Figure~\ref{fig:goodaw}, yet can be employed efficiently
1704: in CAPS-MAP as shown in Figure~\ref{fig:capsm.probs}.
1705: Therefore, we prefer using a Monte-Carlo evaluation of Equation~\eqref{eq:pnomap.exp}
1706: to predict the experimental performance of CAPS-MAP.\label{veryend}
1707:
1708: \end{document}