1: \documentclass[pre,epsfig,aps,twocolumn]{revtex4}
2:
3: \usepackage{epsfig}
4: \usepackage{fancyheadings}
5: \usepackage{amstext}
6:
7: \begin{document}
8:
9: \sloppy
10: \pagestyle{empty}
11:
12: \title{Inferring DNA sequences from mechanical unzipping: an
13: ideal-case study}
14: \author{V. Baldazzi $^{1,2,3}$, S. Cocco $^2$, E. Marinari $^4$,
15: R. Monasson $^3$}
16: \affiliation{
17: $^1$ Dipartimento di Fisica, Universit\`a di Roma
18: {\em Tor Vergata}, Roma, Italy\\
19: $^2$ CNRS-Laboratoire de Physique Statistique de l'ENS, 24 rue Lhomond,
20: 75005 Paris, France\\
21: $^3$ CNRS-Laboratoire de Physique Th\'eorique de l'ENS, 24 rue Lhomond,
22: 75005 Paris, France\\
23: $^4$ Dipartimento di Fisica and INFN, Universit\`a di Roma
24: {\em La Sapienza}, P.le Aldo Moro 2, 00185 Roma, Italy
25: }
26:
27: \begin{abstract}
28: We introduce and test a method to predict the
29: sequence of DNA molecules from {\em in silico} unzipping
30: experiments. The method is based on Bayesian inference
31: and on the Viterbi decoding algorithm. The probability of misprediction
32: decreases exponentially with the number of unzippings, with a decay
33: rate depending on the applied force and the sequence content.
34: \end{abstract}
35:
36: \maketitle
37:
38: %Introduction
39:
40: DNA molecules are the support for the genetic information, and
41: knowledge of their sequences is very important from the biological and
42: medical points of view. State-of-the-art DNA sequencing methods
43: rely on biochemical and gel electrophoresis techniques \cite{mb}, and
44: are able to correctly predict about 99.9\% of the bases. They were massively
45: used over the past ten year to obtain the human genome (and the ones
46: of other organisms).
47:
48: Nevertheless, the quest for alternative (cheaper and/or faster)
49: sequencing methods is an active field of research. In this
50: regard, recent single molecule micro-manipulations are of particular
51: interest. Among them are DNA unzipping under a mechanical action
52: \cite{Ess97,Boc02,Lip,Har03,Dan03}
53: or due to translocation through nanopores
54: \cite{Mat04}, the observation of the sequence-dependent
55: activity of an exonuclease \cite{Van03,Per03}, the optical analysis of
56: DNA polymerization in a nano-chip device \cite{Lev03}, the detection of
57: single DNA hybridization \cite{Zoc03}. Hereafter, we focus on
58: mechanical unzipping (see Figure~\ref{fig1}), first realized by
59: Bockelmann, Heslot and coworkers in 1997 \cite{Ess97,Boc02}. In their
60: experiment, the strands are pulled apart under a constant velocity.
61: The force is measured and fluctuates around $15$ pN for
62: the $\lambda$-phage DNA (a $48,502$ base long virus), with higher
63: (respectively, lower) values corresponding to the unzipping of GC (AT)
64: rich regions. Researchers have also unzipped RNA
65: molecules \cite{Mat04,Lip,Har03}, or DNA under a constant force
66: (instead of velocity) \cite{Dan03}. Figure~\ref{fig2}A sketches a
67: fixed-force output signal, with its pauses in the opening at
68: sequence-specific positions.
69:
70: Various theoretical works have studied and reproduced
71: the unzipping signal related to a given sequence
72: \cite{Boc02,Coc3,Coc4,Lub,Hwa,Felix,mar}. Hereafter we address
73: the inverse problem: given an unzipping signal (for example the one of
74: Figure \ref{fig2}A), can we predict the underlying sequence? We
75: propose a Bayesian inference method to solve this problem
76: \cite{bayes}, and test
77: it {\em in silico} on the $\lambda$-phage. We analytically study the
78: dependence of the quality of the prediction on the sequence content,
79: on the force, and on the number of unzippings.
80: Finally we list the main obstacles to be circumvented prior
81: to practical applications.
82:
83: \begin{figure}
84: \begin{center}
85: \psfig{figure=./fig1.eps,height=3cm,angle=0}
86: \end{center}
87: \caption{An unzipping experiment. The extremities of the molecule are
88: stretched apart under a force $f$. The fork at location $n$ (nb. of
89: open base pairs) moves
90: backward or forward with rates (probability per unit of time)
91: $r_c$ and $r_o$ (\ref{ratemd}).}
92: \label{fig1}
93: \end{figure}
94:
95: \begin{figure}
96: \begin{center}
97: \psfig{figure=./fig2.eps,height=8cm,angle=-90}
98: \end{center}
99: \caption{Fixed-force unzipping of $\lambda$-phage. {\bf A.} number $n$ of
100: open base pairs vs. time $t$ for forces $f$ ranging from $15.5$ to
101: $17$ pN from model (\ref{ratemd}). {\bf B.}
102: magnification of the boxed region in {\bf A}
103: after a $90$ degree clockwise rotation. {\bf C.} free
104: energy landscape $g(n)$ versus $n$ for the first $450$ bases and
105: $f=16$ pN. Down and up arrows indicate, respectively, a local minimum
106: in $n=50$ and two maxima in $n=232$ and $n=327$ (see the text).}
107: \label{fig2}
108: \end{figure}
109:
110: %inference
111:
112: Let ${\cal S}=\{b_1,b_2,\ldots,b_N\}$ denote the sequence of $N$ bases
113: along the $5'\to3'$ strand (the other strand is
114: complementary). We model the unzipping of the molecule through the
115: evolution of the number $n$ of open base pairs \cite{Coc4};
116: base pair opening ($n\to n+1$) and closing ($n\to n-1$) happen with rates
117: (Figure~\ref{fig1})
118: \begin{equation}
119: r_o (n) = r\; \exp\{g_0(n)\} \; , \;\;
120: r_c = r\; \exp\{g_{ss}\} \; .
121: \label{ratemd}
122: \end{equation}
123: $g_0(n)$ is the binding energy of base pair (bp) $n$ in units
124: of k$_B$T \cite{Zuk};
125: it depends on the base $b_n=A,T,G$, or $C$ and,
126: due to stacking effects, on the nearest base $b_{n+1}$.
127: $g_{ss}$ is the work needed to stretch an open bp under a force $f$
128: in units of k$_B$T ;
129: according to the modified freely--jointed--chain model
130: \cite{Coc3}, ${g}_{ss}= -2
131: \ell/\ell_0\,\ln [\sinh(x)/x]$ where $x\equiv \ell_0 \, f / k_B T$,
132: and $\ell_0=15$ {\AA} and $\ell= 5.6$ {\AA} are, respectively, the Kuhn
133: and effective nucleotide lengths.
134: Relation (\ref{ratemd}) implies that the opening rate
135: at base $n$ is a function of the sequence, $r_o(n)=r_o(b_n,b_{n+1})$,
136: while the closing rate $r_c$ only depends on the force \cite{lungo}.
137: This {\em a priori} choice has been shown \cite{Coc4} to reproduce
138: quantitatively the behavior of unzipping experiments on short
139: polynucleotides \cite{Lip}, with a typical frequency $r \simeq
140: 10^{6-7}$ sec$^{-1}$.
141:
142: Rates (\ref{ratemd}) define a one-dimensional biased random walk for
143: the fork position (number of open bp) $n(t)$ in the potential
144: $\displaystyle{g(n)=n \, g_{ss} -\sum_{i=1} ^n g_0(i)}$, that can be
145: interpreted as the free energy of the molecule when the first $n$ bp
146: are open. We show in Figure~\ref{fig2}B\&C a typical time-trace of $n(t)$
147: generated by Monte Carlo (MC) simulation for the $\lambda$-phage sequence,
148: together with the free energy landscape $g(n)$. Plateaus of $n(t)$
149: coincide with deep local minima of $g(n)$, where the fork remains
150: trapped for a long time. As the force increases, opening becomes more
151: favorable, and plateaus shrink.
152:
153: Our {\em in silico} time-traces are stochastic due to the thermal noise:
154: two runs will give different traces. The probability of a time-trace
155: only depends on the set ${\cal N}=\{t_n,u_n,d_n\}$
156: of times $t_n$ spent on each base $n$, and of numbers $u_n$ and
157: $d_n$ of up ($n\to n+1$) and down ($n\to n-1$) transitions respectively.
158: Given the sequence ${\cal S}$,
159: this probability reads
160: \begin{equation}
161: {\cal P}({\cal N} | {\cal S} )= c \prod_n \,M (b_n,b_{n+1}; t_n,u_n,d_n)\;,
162: \label{p}
163: \end{equation}
164: where $c$ is a (sequence-independent) normalization constant and
165: $M (b_n,b_{n+1} ;t_n,u_n,d_n) =r_o\left(b_n,b_{n+1}\right)^{u_n} \,
166: r_c^{d_n}\; \exp\{-(r_o(b_n,b_{n+1})+r_c)t_n \}$.
167: Equation (\ref{p}) provides the solution of the direct problem:
168: given the sequence ${\cal S}$ what is the distribution of the
169: time-traces ${\cal N}$? The inverse problem, that is the prediction
170: of the sequence given some time-trace, can be
171: addressed within the Bayesian inference framework.
172: The probability that DNA sequence is ${\cal S}$ given an observed ${\cal N}$
173: is \cite{bayes}
174: \begin{equation}
175: \label{bayes}
176: {\cal P}({\cal S}|{\cal N})= \frac{{\cal P }( {\cal N }|{\cal S})
177: \;{\cal P}_0({\cal S}) }{ {{\cal P}({\cal N})}}\;.
178: \end{equation}
179: The value of ${\cal S}$ that maximizes this probability, ${\cal S}^*$,
180: is our prediction for the sequence. In the absence of any {\em a
181: priori} information about the sequence, ${\cal P}_0({\cal S})$ is the flat
182: distribution, equal to $4^{-N}$. The maximization of ${\cal P}({\cal
183: S}|{\cal N})$ then reduces to that of
184: ${\cal P}( {\cal N}|{\cal S})$ (\ref{p}).
185:
186: In practice the most likely sequence ${\cal S}^*$ may be found using
187: the Viterbi algorithm \cite{viterbi}. The procedure is equivalent to
188: a zero temperature transfer matrix technique exploiting the
189: nearest-neighbor nature of couplings between bases in (\ref{p}). The
190: probability $P_n$ for the base $b_{n}$ fulfills the recursive equation
191: \begin{equation}
192: \label{recur}
193: P_{n+1}(b_{n+1}) \propto \max_{b_{n}} \; P_n(b_n) \, M (b_{n},b_{n+1} ;
194: t_n,u_n,d_n) \;,
195: \end{equation}
196: where the proportionality constant is irrelevant for our purpose. The
197: maximum in (\ref{recur}) is reached for some base $b_n^{max}
198: (b_{n+1})$ that depends on the next base $b_{n+1}$. Starting
199: from $P_1 (b_1)= \frac 14$, we obtain the probability
200: $P_N(b_N)$ for the last base of the sequence through iterations of
201: (\ref{recur}). Maximization of $P_N(b_N)$ yields the most likely value
202: for this last base, $b^*_N$. The whole optimal sequence ${\cal S}^*$
203: is then recursively obtained from the relation $b^*_{n-1} =
204: b_{n-1}^{max} (b_n^*)$.
205:
206:
207: % risultati
208:
209: We have tested our sequencing method on the $\lambda$--phage. First
210: we build a dynamical process on the sequence ${\cal S}^\lambda$ of the phage
211: with rates (\ref{ratemd}), and
212: generate an unzipping trace ${\cal N}$ by a MC procedure. Then we use
213: the Viterbi procedure (which ignores the phage sequence) to make a
214: prediction for the sequence, ${\cal S}^*$, from this signal ${\cal
215: N}$. We estimate the error over the prediction about base $n$ from
216: the failure rate
217: \begin{equation} \label{defom}
218: \epsilon _n = \hbox{\rm Probability}
219: \left[ b_n^* \ne b_n^\lambda \right]\;,
220: \end{equation}
221: where the probability is computed by repeating the procedure over
222: different MC runs.
223: The errors $\epsilon _n$ are shown in Figure~\ref{fig3} (with
224: the continuous curve) for the first $450$ bases
225: at a force of $16$ pN. Values range from $0$ (perfect prediction) to
226: $0.75$ (random guess of one among four bases). A comparison with the
227: free energy $g(n)$ (Figure~\ref{fig2}) shows that $\epsilon _n$ is
228: small in the flattest part of the landscape ($350< n< 450$), or in
229: local minima e.g. the $n=50$ base
230: preceded by 4 weak bases and followed by 4 strong bases
231: (...TTTA-A-GGCG...). Conversely, bases that are not well determined
232: correspond to local maxima of the landscape e.g. $n=327$, $328$ bases
233: between $7$ strong and $7$ weak bases
234: (...GCCGCCG-TC-ATAAAAT...). We plot the average fraction of
235: mispredicted bases, $\displaystyle{\epsilon = \frac 1 N \sum_{n}
236: \epsilon _n}$, in Figure~\ref{fig4}A. As shown in Fig.~\ref{fig2},
237: for a larger force, there are more open bases (about $60$, $600$ and
238: $5000$ at $15.5$, $16$ and $17$ pN in about $100$ seconds), but the
239: time spent on each base is smaller, and therefore $\epsilon$ is larger
240: ($\epsilon =20 \%,23\%,47\% $).
241: Most errors are due to the difficulty of distinguishing A from T, and
242: G from C. The probability that a weak
243: (A or T) base is confused with a strong one (G or
244: C), or vice-versa, is plotted in Figure~\ref{fig4}B.
245:
246: Performances can be greatly improved by collecting information from
247: multiple unzippings. As the
248: number of passages over the same base $n$ gets larger, the total
249: waiting times $t_n$ and transition parameters $u_n,d_n$ become less
250: affected by fluctuations, and reflect more faithfully the
251: thermodynamic signature of the base. In practice, we look for the
252: most likely sequence ${\cal S}^*$ given $R$ unzipping signals
253: ${\cal N}_1, {\cal N}_2,\ldots , {\cal N}_R$.
254: Figures \ref{fig3}A and \ref{fig4} shows the drop down in the
255: probability of error when the number $R$ of
256: unzippings increases. Observe from Figure \ref{fig3}A that the decay of
257: $\epsilon _n$ with $R$ (\ref{defom}) varies from base to base.
258: The decrease of the total error $\epsilon$
259: is much faster for AT vs. GC (Figure~\ref{fig4}B) than for
260: complete (Figure~\ref{fig4}A) recognition.
261:
262: \begin{figure}
263: \begin{center}
264: \psfig{figure=./fig3.eps,height=7cm,angle=-90}
265: \caption{{\bf A}. Probability $\epsilon _n$ of an error (top)
266: and entropy $\sigma _n$ (middle) versus the base index $n$,
267: for the first
268: $450$ bp of DNA $\lambda$-phage at $f=16$~pN. Full lines correspond
269: to $R=1$ unzipping, dotted lines to $R=40$.
270: {\bf B.} Theoretical values for the decay constants $R_n^c$ in
271: $\epsilon _n$ (\ref{rcf}). For instance, base
272: $232$ (arrow) is characterized by $R_{232}^c \simeq 10$, and
273: is not (respectively, well) predicted with $R=1$ (resp. $R=40$)
274: unzippings. }
275: \label{fig3}
276: \end{center}
277: \end{figure}
278:
279: \begin{figure}
280: \begin{center}
281: \psfig{figure=./fig4a.eps,height=4cm,angle=-90}
282: \psfig{figure=./fig4b.eps,height=4cm,angle=-90}
283: \caption{{\bf A.}
284: Fraction $\epsilon$ of mispredicted bases for the $\lambda$-phage
285: versus the number $R$ of unzippings, averaged over $1000$ samples of
286: $R$ unzippings, and
287: for forces of $15.5$, $16$ and $17$ pN (from bottom to top).
288: {\bf B}. Same as {\bf A}, but we only discriminate among weak and
289: strong basis.}
290: \label{fig4}
291: \end{center}
292: \end{figure}
293:
294: It is useful to build indicators of performances that do not rely on
295: the exact knowledge of the unzipped sequence (used here for checking
296: the quality of our results but unknown in practical
297: applications). To this aim, we calculate the optimal sequences $S^*_b$
298: when base $n$ is constrained to value $b$, and the corresponding
299: probabilities $P_n^*(b)$.
300: We then define the Shannon entropy
301: \begin{equation}
302: \sigma_n=- \sum_{b=A,T,G,C}\, \langle P_n^*(b)\,
303: \log _4 P_n^*(b) \rangle\; ,
304: \end{equation}
305: where $\langle\cdot\rangle$ denotes the average over MC data. $\sigma
306: _n$ is low when one of the four bases has much higher probability than
307: the other ones and close to unity for uncertain predictions
308: (equiprobable bases). Figure \ref{fig3} shows that $\sigma _n$ and
309: $\epsilon_n$ as a function of the base index $n$ are indeed very
310: similar: the Shannon entropy is a good indicator of the success of our
311: reconstruction.
312:
313: % Theory
314:
315: Our analytical study of the dependence of the quality of
316: the prediction upon the force, the sequence content, and the number
317: of unzippings confirms that the probability of error $\epsilon _n$
318: decreases very quickly with $R$,
319: \begin{equation} \label{rcf}
320: \epsilon _n \sim e^{ -R/R^c_n} \ .
321: \end{equation}
322: As $f$ decreases to its critical value (below which the
323: molecule cannot open), the decay constant
324: $R_n^c$ decreases to zero, and predictions
325: drastically improve at fixed $R$.
326: Our theoretical values for $R^c _n$ are shown in Figure~\ref{fig3}B
327: for $f=16$ pN, and vary from 0.1 to 45 with the base index $n$. The
328: agreement with the decay of $\epsilon _n$ from $R=1$ to $40$ unzippings
329: (Figure \ref{fig3}A) is excellent. Note that $\epsilon$ in
330: Figure~\ref{fig4} is not a pure exponential, but a superposition of
331: exponentials with $n$-dependent decay constants $R^c_n$.
332: We now present the calculation of $R^c_n$ in three steps.
333:
334: {\em (a) Pairing only, high force.}
335: Assume first that there are only 2 and not 4 bp-types, called
336: $+$ and $-$, and no stacking interaction. Call $\Delta$ the difference
337: between the (pairing) free-energies of $+$ and $-$ bp, and $\langle
338: t_\pm\rangle$ the average time spent by the
339: fork on a $\pm$ bp before moving forward or backward. Consider
340: now a bp of type $b$ and call $t$ the time spent on this bp
341: divided by the number $R$ of unzippings. From the central limit
342: theorem, for large $R$,
343: $t$ gets narrowly peaked around its mean value $\langle
344: t_b\rangle$, with Gaussian fluctuations $\delta t\sim R^{-\frac 12}$.
345: Bayes prediction (\ref{bayes}) will be erroneous,
346: $b^*=-b$, when $t$ is closer to $\langle
347: t_{-b}\rangle$ than to its expected value $\langle t_b\rangle$.
348: The probability of error is thus given
349: by the Gaussian tail, and scales as $\epsilon \sim \exp( - \delta t^{-2})$,
350: hence (\ref{rcf}). A careful calculation \cite{lungo} gives the
351: precise value of the decay constant in (\ref{rcf}),
352: \begin{equation}\label{nostackomega}
353: R ^c= \frac 1{\tau -1 -\ln \tau}\quad \hbox{\rm with}
354: \quad \tau = \frac {\Delta}{1- e^{-\Delta}} \ .
355: \end{equation}
356: Good predictions are obtained when the molecule is unzipped a
357: few $R^c$ times (for example $R \simeq 4 R^c$
358: gives $\epsilon \simeq 2\%$).
359: %$R^c$ is well approximated by $\frac 8{\Delta^2}$ for $\Delta < 3$ k$_B$T.
360: %This formula allows us to qualitatively understand Figure~\ref{fig4}.
361: To distinguish weak (AT) from strong (CG) bp only we have
362: $\Delta \simeq 2.8$ \cite{Zuk} and
363: $R^c\simeq 1$ (Figure~\ref{fig4}B), while complete recognition corresponds
364: to $\Delta \simeq 0.5$ and $R^c \simeq 30$ (Figure~\ref{fig4}A).
365: %The quantitative understanding of Figure~\ref{fig4} requires the
366: %calculation of $R^c$ as a function of the force (see below).
367:
368: {\em (b) Pairing and Stacking, high force.} In presence
369: of stacking interactions, the error $\epsilon _b$ on base $b$
370: depends on the neighboring bases, say, $x$ and $y$.
371: At large $R$, errors are rare and
372: are typically due to a single base mis-prediction e.g. $b\to b'$. The
373: probability $\epsilon_{b\to b'}$ of this mistake is the product of the
374: probabilities $\epsilon _{xb\to xb'}$ and $\epsilon _{by\to b'y}$ of
375: the two bond violations. We estimate $ \epsilon
376: _{xb\to xb'} \sim e^{-R/R^c_{xb\to xb'}}$
377: from (\ref{rcf}) where $R^c_{xb\to xb'}$ is
378: given by (\ref{nostackomega}) with $\Delta = g_0^{xb'}-g_0^{xb}$.
379: A similar expression is readily obtained for the $by$ bond. Knowing
380: the asymptotic behavior of $\epsilon _{b\to b'}$, we calculate
381: $\epsilon _b \sim e^{-R/R^c_{xby}}$ by selecting the worst value for $b'$,
382: \begin{equation}
383: \label{id}
384: \frac 1{R^c _{xby}} = \min _{b' (\ne b)} \left[ \frac 1{R^c _{xb \to xb'}}
385: + \frac 1{R^c _{by \to b'y}} \right] \ .
386: \end{equation}
387: The above derivation is confirmed by exact calculations based on
388: techniques for 1D disordered systems \cite{diso,lungo}.
389:
390: {\em (c) Moderate force.}
391: The above calculations are correct for high forces. At moderate forces,
392: bp can close and are visited several times by the fork. The effective number
393: of unzippings is $R\times \langle u_{n}\rangle$, where $\langle u_n\rangle$
394: is the average number of openings of bp $n$ during a single unzipping.
395: The decay constant is thus, from (\ref{rcf}),
396: \begin{equation}
397: R^c_n = {R ^ c _{b_{n-1}b_nb_{n+1}} }/{\langle u_n\rangle }\ .
398: \end{equation}
399: As the force is lowered, $\langle u_n\rangle$ increases (from 1
400: at high force), and $R^c_n$ diminishes. To
401: calculate $\langle u_n\rangle$, we consider the 1D transient random walk
402: defined by the probabilities $q_m\equiv
403: r_c/(r_o(m)+r_c)$ and $1-q_m$ for closing or opening bp $m$.
404: Let $p_{m}^{(n)}$ be the probability that the fork will never
405: reach position $n$ starting from $m (>n)$. The ratios $\rho _m^{(n)}
406: = p_{m}^{(n)}/p_{m+1}^{(n)}$ fulfill the Riccati recursion relation
407: \cite{lungo} $\rho _{m+1} ^{(n)} = (1- q_{m+1}) / (1-q_{m+1} \, \rho
408: _m ^{(n)} )$. Iterating with boundary condition $\rho_n^{(n)}=0$
409: allows us to obtain $\langle u _ n\rangle = 1/p_n^{(n+1)} = \prod
410: _{m>n} \rho _m^{(n)}$.
411:
412: % perspective
413:
414: Finally we discuss the difficulties hindering a direct application of
415: our inference method to real data (see also \cite{hwa}),
416: and possible way-outs.
417:
418: First, temporal resolution is limited in practice. The frequency bandwidth
419: is controlled by the viscous friction and the stiffness of the
420: setup, with a typical value of $10$ kHz \cite{Boc02,bustt}. The
421: corresponding time, $\delta\tau \simeq 100$ $\mu$sec, is about $10$
422: (resp. $200$) times longer than the typical opening time for GC
423: (resp. AT) bp. As a result, the fork can move by $D (> 1)$ bp during
424: the time interval $\delta\tau$. We have taken into account such moves by
425: considering interactions between bases at distance $\le D$ in the
426: probability $P(\cal N|S)$, and modified the
427: reconstruction procedure accordingly (the transfer matrix has now
428: dimension $4^D$) \cite{lungo}. In practice, when
429: $\delta\tau = 1\ \mu$sec, sequences cannot be predicted with the usual
430: $D=1$ reconstruction procedure, but are correctly inferred with the
431: $D=6$ procedure. Though time resolution is currently far below this
432: limit, future experimental progresses, and new technologies e.g.
433: combination of optical trap and single-molecule fluorescence
434: \cite{Lan03}, could help bridging the gap.
435:
436: Secondly, thermal fluctuations
437: of the open strands lead to an uncertainty $\delta n$
438: over the position $n$ of the fork \cite{siggia} e.g. $\delta n \simeq 5$
439: for $f\simeq 15$ pN and $n=300$ open bp \cite{Coc3}. The presence
440: of correlations between bases at distance $D\le \delta n$ does not
441: affect the result (\ref{rcf}) for $\epsilon _n$ as long as the relaxation
442: time of the strands is smaller than the bp opening time {\em
443: i.e.} up to a few hundreds open bp. What happens for larger values of $n$
444: is currently under study.
445:
446: Thirdly, we have assumed so far to have a perfect knowledge of the
447: dynamics of unzipping. In practice, any functional form for ${\cal P}({\cal
448: N}|{\cal S})$ will be only approximate for a given experimental
449: setup. A possible way-out based on a learning principle is
450: the following: in a first stage unzipping data corresponding to a known
451: sequence ($\lambda$-phage) are collected to caliber ${\cal P}$, in a second
452: stage predictions are made for new sequences.
453:
454: Last of all, our study of fixed-force unzipping shows that bases located in
455: local minima of the free-energy landscape are well predicted, while
456: maxima are much harder to predict. Accuracy could be greatly improved
457: through an adequate force vs. time scheme capable of bringing
458: the fork in the right place and making it spend time there.
459: Investigation of the fixed-velocity case, where the force signal is
460: remarkably affected by single base mutation \cite{Boc02},
461: will be very interesting.
462:
463: In conclusion, we hope the present study will motivate further work
464: to assess and improve the performances of unzipping-based sequencing.
465:
466: %grazie
467:
468: This work has been partially sponsored by the EC FP6
469: program under contract IST-001935, EVERGROW, and the
470: French ACI-DRAB \& PPF Biophysique-ENS actions.
471: \begin{thebibliography}{999999}
472:
473: \bibitem{mb}
474: P.C. Turner, A.G. McLennan, A.D. Bates, M.R.H. White,
475: Molecular Biology, Springer-Verlag (2000).
476:
477: \bibitem{Ess97}
478: B. Essevaz-Roulet, U. Bockelmann, F. Heslot,
479: {\em Proc. Natl. Acad. Sci. (USA) } {\bf 94}, 11935 (1997).
480:
481: \bibitem{Boc02}
482: U. Bockelmann {\em et al.}
483: {\em Biophys. J.} {\bf 82}, 1537 (2002).
484:
485: \bibitem{Lip}
486: J. Liphardt {\em et al.} {\em Science} {\bf 297}, 733 (2001).
487:
488: \bibitem{Har03}
489: S. Harlepp {\em et al.}
490: {\em Eur. Phys. J. E} {\bf 12}, 605 (2003).
491:
492: \bibitem{Dan03}
493: C. Danilowitcz {\em et al.}
494: {\em Proc. Natl. Acad. Sci. (USA) } {\bf 100}, 1694 (2003).
495:
496: \bibitem{Mat04}
497: J. Math\'e {\em et al.}
498: {\em Biophys. J.} {\bf 87}, 3205 (2004).
499:
500: %exonucleasi
501: \bibitem{Van03}
502: M. van Oijen {\em et al.} {\em Science} {\bf 301}, 123 (2003).
503:
504: \bibitem{Per03}
505: T. Perkins {\em et al.} {\em Science} {\bf 301}, 1914 (2003).
506:
507: %zero mode vaweguide
508: \bibitem{Lev03}
509: M.J. Levene {\em et al.} {\em Science} {\bf 299}, 682 (2003).
510:
511: % single molecule hybridization
512: \bibitem{Zoc03}
513: M. Singh-Zocchi {\em et al.} {\em Proc. Natl. Acad. Sci. (USA)}
514: {\bf 100}, 7605 (2003).
515:
516: \bibitem{Coc3}
517: S. Cocco, R. Monasson, J. Marko.
518: {\em C.R. Physique} {\bf 3}, 569 (2002).
519:
520: \bibitem{Coc4}
521: S. Cocco, R. Monasson, J. Marko.
522: {\em Eur. Phys. J. E} {\bf 10}, 153 (2003).
523:
524: \bibitem{Lub}
525: D.K. Lubensky, D.R. Nelson. {\em Phys. Rev. Lett.} {\bf 85},
526: 1572 (2000); {\em Phys. Rev. E} {\bf 65}, 031917 (2002).
527:
528: \bibitem{Hwa}
529: U. Gerland, R. Bundschuh, T. Hwa.
530: {\em Biophys. J.} {\bf 81}, 1324 (2001).
531:
532: \bibitem{Felix}
533: M. Manosas, F. Ritort, {\em cond-mat/0405035} (2004).
534:
535: \bibitem{mar}
536: D. Marenduzzo {\em et al.} {\em Phys. Rev. Lett.} {\bf 88}, 028102
537: (2002).
538:
539: \bibitem{bayes}
540: D.H. DeGroot, Probability and Statistics, Addison-Wesley
541: Publishing Co. (1986).
542:
543: \bibitem{Zuk}
544: M. Zuker.
545: {\em Curr. Opin. Struct. Biol.} {\bf 10}, 303 (2000). From Santa Lucia Jr.
546: {\em Proc. Natl. Aca. Sci. (USA)} {\bf 95}, 1460 (1998),
547: $g_0^{AA}=-1.78$, $g_0^{AT}=-1.55$, $g_0^{AC}=-2.52$, $g_0^{AG}=-2.22$,
548: $g_0^{TA}=-1.06$,
549: $g_0^{TC}=-2.28$, $g_0^{TG}=-2.54$, $g_0^{CC}=-3.14$, $g_0^{CG}=-3.85$,
550: $g_0^{GC}=-3.90$ k$_B$T at $T=25$~C, 150 mM Na.
551:
552:
553: \bibitem{lungo}
554: V. Baldazzi {\em et al.}, in preparation (2005).
555:
556: \bibitem{viterbi}
557: A.J. Viterbi, {\em IEEE Trans. Inf. Th.} {\bf 13},
558: 260 (1967).
559:
560: \bibitem{diso}
561: F.J. Dyson {\em Phys. Rev.} {\bf 92}, 1331 (1953).
562:
563: \bibitem{siggia}
564: R.E. Thompson, E.D. Siggia.
565: {\em Europhys. Lett.} {\bf 31}, 335 (1995).
566:
567: \bibitem{bustt}
568: B. Onoa {\em et al.} {\em Science} {\bf 299}, 1892 (2003) (supplementary
569: materials).
570:
571: \bibitem{hwa}
572: U. Gerland, R. Bundschuh, T. Hwa.
573: {\em Phys. Biol.} {\bf 1}, 19 (2004).
574:
575: % opticaltrap+fluorescence
576: \bibitem{Lan03}
577: M.J. Lang, P.M. Fordyce, S.M. Block.
578: {\em J. Biol.} {\bf 2}, 6 (2003).
579:
580: \end{thebibliography}
581:
582:
583: \end{document}
584:
585: