0506:cond-mat0506221/art.tex

1: \documentclass[pre,epsfig,aps,twocolumn]{revtex4}

2:

3: \usepackage{epsfig}

4: \usepackage{fancyheadings}

5: \usepackage{amstext}

6:

7: \begin{document}

8:

9: \sloppy

10: \pagestyle{empty}

11:

12: \title{Inferring DNA sequences from mechanical unzipping: an

13:   ideal-case study}

14: \author{V. Baldazzi $^{1,2,3}$, S. Cocco $^2$, E. Marinari $^4$,

15: R. Monasson $^3$}

16: \affiliation{

17: $^1$ Dipartimento di Fisica, Universit\`a di Roma

18: {\em Tor Vergata}, Roma, Italy\\

19: $^2$ CNRS-Laboratoire de Physique Statistique de l'ENS, 24 rue Lhomond,

20: 75005 Paris, France\\

21: $^3$ CNRS-Laboratoire de Physique Th\'eorique de l'ENS, 24 rue Lhomond,

22: 75005 Paris, France\\

23: $^4$ Dipartimento di Fisica and INFN, Universit\`a di Roma

24: {\em La Sapienza}, P.le Aldo Moro 2,  00185 Roma, Italy

25: }

26:

27: \begin{abstract}

28: We introduce and test  a method to predict the

29: sequence of DNA molecules from {\em in silico}  unzipping

30: experiments.  The method is based on Bayesian inference

31: and on the Viterbi decoding algorithm.  The probability of misprediction

32: decreases exponentially with the number of unzippings, with a decay

33: rate depending on the applied force and the sequence content.

34: \end{abstract}

35:

36: \maketitle

37:

38: %Introduction

39:

40: DNA molecules are the support for the genetic information, and

41: knowledge of their sequences is very important from the  biological and

42: medical points of view. State-of-the-art DNA sequencing methods

43: rely on biochemical and gel electrophoresis techniques \cite{mb}, and

44: are able to correctly predict about 99.9\% of the bases. They were massively

45: used over the past ten year to obtain the human genome (and the ones

46: of other organisms).

47:

48: Nevertheless, the quest for alternative (cheaper and/or faster)

49: sequencing methods is an active field of research. In this

50: regard, recent single molecule micro-manipulations are of particular

51: interest. Among them are DNA unzipping under a mechanical action

52: \cite{Ess97,Boc02,Lip,Har03,Dan03}

53: or due to translocation through nanopores

54: \cite{Mat04}, the observation of the sequence-dependent

55: activity of an exonuclease \cite{Van03,Per03}, the optical analysis of

56: DNA polymerization in a nano-chip device \cite{Lev03}, the detection of

57: single DNA hybridization \cite{Zoc03}.  Hereafter, we focus on

58: mechanical unzipping (see Figure~\ref{fig1}), first realized by

59: Bockelmann, Heslot and coworkers in 1997 \cite{Ess97,Boc02}. In their

60: experiment, the strands are pulled apart under a constant velocity.

61: The force is measured and fluctuates around $15$ pN for

62: the $\lambda$-phage DNA (a $48,502$ base long virus), with higher

63: (respectively, lower) values corresponding to the unzipping of GC (AT)

64: rich regions.  Researchers have also unzipped RNA

65: molecules \cite{Mat04,Lip,Har03}, or DNA under a constant force

66: (instead of velocity) \cite{Dan03}.  Figure~\ref{fig2}A sketches a

67: fixed-force output signal, with its pauses in the opening at

68: sequence-specific positions.

69:

70: Various theoretical works have studied and reproduced

71: the unzipping signal related to a given sequence

72: \cite{Boc02,Coc3,Coc4,Lub,Hwa,Felix,mar}. Hereafter we address

73: the inverse problem: given an unzipping signal (for example the one of

74: Figure \ref{fig2}A), can we predict the underlying sequence?  We

75: propose a Bayesian inference method to solve this problem

76: \cite{bayes},  and test

77: it {\em in silico} on the $\lambda$-phage.  We analytically study the

78: dependence of the quality of the prediction on the sequence content,

79: on the force, and on the number of unzippings.

80: Finally we list the main obstacles to be circumvented prior

81: to practical applications.

82:

83: \begin{figure}

84: \begin{center}

85: \psfig{figure=./fig1.eps,height=3cm,angle=0}

86: \end{center}

87: \caption{An unzipping experiment. The extremities of the molecule are

88: stretched apart under a force $f$. The fork at location $n$ (nb. of

89: open base pairs) moves

90: backward or forward with rates (probability per unit of time)

91: $r_c$  and $r_o$ (\ref{ratemd}).}

92: \label{fig1}

93: \end{figure}

94:

95: \begin{figure}

96: \begin{center}

97: \psfig{figure=./fig2.eps,height=8cm,angle=-90}

98: \end{center}

99: \caption{Fixed-force unzipping of $\lambda$-phage. {\bf A.} number $n$ of

100: open base pairs vs. time $t$ for forces $f$ ranging from $15.5$ to

101: $17$ pN from  model (\ref{ratemd}). {\bf B.}

102: magnification of the boxed region in {\bf A}

103: after a $90$ degree clockwise rotation. {\bf C.} free

104: energy landscape $g(n)$ versus $n$ for the first $450$ bases and

105: $f=16$ pN.  Down and up arrows indicate, respectively, a local minimum

106: in $n=50$ and two maxima in $n=232$ and $n=327$ (see the text).}

107: \label{fig2}

108: \end{figure}

109:

110: %inference

111:

112: Let ${\cal S}=\{b_1,b_2,\ldots,b_N\}$ denote the sequence of $N$ bases

113: along the $5'\to3'$ strand (the other strand is

114: complementary).  We model the unzipping of the molecule through the

115: evolution of the number $n$ of open base pairs \cite{Coc4};

116: base pair opening ($n\to n+1$) and closing ($n\to n-1$) happen with rates

117: (Figure~\ref{fig1})

118: \begin{equation}

119: r_o (n) = r\; \exp\{g_0(n)\} \; ,  \;\;

120: r_c     = r\; \exp\{g_{ss}\} \; .

121: \label{ratemd}

122: \end{equation}

123: $g_0(n)$ is the binding energy of base pair (bp) $n$ in units

124: of k$_B$T \cite{Zuk};

125: it depends on the base $b_n=A,T,G$, or $C$ and,

126: due to stacking effects, on the nearest base $b_{n+1}$.

127: $g_{ss}$ is the work needed to stretch an open bp under a force $f$

128: in units of k$_B$T ;

129: according to the modified freely--jointed--chain model

130: \cite{Coc3}, ${g}_{ss}= -2

131: \ell/\ell_0\,\ln [\sinh(x)/x]$ where $x\equiv \ell_0 \, f / k_B T$,

132: and $\ell_0=15$ {\AA} and $\ell= 5.6$ {\AA} are, respectively, the Kuhn

133: and  effective nucleotide lengths.

134: Relation (\ref{ratemd}) implies that the opening rate

135: at base $n$ is a function of the sequence, $r_o(n)=r_o(b_n,b_{n+1})$,

136: while the closing rate $r_c$ only depends on the force \cite{lungo}.

137: This {\em a priori} choice has been shown \cite{Coc4} to reproduce

138: quantitatively the behavior of unzipping experiments on short

139: polynucleotides \cite{Lip}, with a typical frequency $r \simeq

140: 10^{6-7}$ sec$^{-1}$.

141:

142: Rates (\ref{ratemd}) define a one-dimensional biased random walk for

143: the fork position (number of open bp) $n(t)$ in the potential

144: $\displaystyle{g(n)=n \, g_{ss} -\sum_{i=1} ^n g_0(i)}$, that can be

145: interpreted as the free energy of the molecule when the first $n$ bp

146: are open.  We show in Figure~\ref{fig2}B\&C a typical time-trace of $n(t)$

147: generated by Monte Carlo (MC) simulation for the $\lambda$-phage sequence,

148: together with the free energy landscape $g(n)$. Plateaus of $n(t)$

149: coincide with deep local minima of $g(n)$, where the fork remains

150: trapped for a long time. As the force increases, opening becomes more

151: favorable, and plateaus shrink.

152:

153: Our {\em in silico} time-traces are stochastic due to the thermal noise:

154: two runs will give different traces. The probability of a time-trace

155: only depends on the set ${\cal N}=\{t_n,u_n,d_n\}$

156: of times $t_n$ spent on each base $n$, and of numbers  $u_n$ and

157: $d_n$ of up ($n\to n+1$) and down ($n\to n-1$) transitions respectively.

158: Given the sequence ${\cal S}$,

159: this probability reads

160: \begin{equation}

161: {\cal P}({\cal N} | {\cal S} )= c \prod_n \,M (b_n,b_{n+1}; t_n,u_n,d_n)\;,

162: \label{p}

163: \end{equation}

164: where $c$ is a (sequence-independent) normalization constant and

165: $M (b_n,b_{n+1} ;t_n,u_n,d_n) =r_o\left(b_n,b_{n+1}\right)^{u_n} \,

166: r_c^{d_n}\; \exp\{-(r_o(b_n,b_{n+1})+r_c)t_n \}$.

167: Equation (\ref{p}) provides the solution of the direct problem:

168: given the sequence ${\cal S}$ what is the distribution of the

169: time-traces ${\cal N}$? The inverse problem, that is the prediction

170: of the sequence given some time-trace, can be

171: addressed within the Bayesian inference framework.

172: The probability that DNA sequence is ${\cal S}$  given an observed ${\cal N}$

173: is \cite{bayes}

174: \begin{equation}

175: \label{bayes}

176: {\cal P}({\cal S}|{\cal N})= \frac{{\cal P }( {\cal N }|{\cal S})

177: \;{\cal P}_0({\cal S}) }{ {{\cal P}({\cal N})}}\;.

178: \end{equation}

179: The value of ${\cal S}$ that maximizes this probability, ${\cal S}^*$,

180: is our prediction for the sequence. In the absence of any {\em a

181: priori} information about the sequence, ${\cal P}_0({\cal S})$ is the flat

182: distribution, equal to $4^{-N}$. The maximization of ${\cal P}({\cal

183: S}|{\cal N})$ then reduces to that of

184: ${\cal P}( {\cal N}|{\cal S})$  (\ref{p}).

185:

186: In practice the most likely sequence ${\cal S}^*$ may be found using

187: the Viterbi algorithm \cite{viterbi}. The procedure is equivalent to

188: a zero temperature transfer matrix technique exploiting the

189: nearest-neighbor nature of couplings between bases in (\ref{p}). The

190: probability $P_n$ for the base $b_{n}$ fulfills the recursive equation

191: \begin{equation}

192: \label{recur}

193: P_{n+1}(b_{n+1}) \propto \max_{b_{n}} \; P_n(b_n) \, M (b_{n},b_{n+1} ;

194: t_n,u_n,d_n) \;,

195: \end{equation}

196: where the proportionality constant is irrelevant for our purpose.  The

197: maximum in (\ref{recur}) is reached for some base $b_n^{max}

198: (b_{n+1})$ that depends on the next base $b_{n+1}$.  Starting

199: from $P_1 (b_1)= \frac 14$, we obtain the probability

200: $P_N(b_N)$ for the last base of the sequence through iterations of

201: (\ref{recur}). Maximization of $P_N(b_N)$ yields the most likely value

202: for this last base, $b^*_N$.  The whole optimal sequence ${\cal S}^*$

203: is then recursively obtained from the relation $b^*_{n-1} =

204: b_{n-1}^{max} (b_n^*)$.

205:

206:

207: % risultati

208:

209: We have tested our sequencing method on the $\lambda$--phage.  First

210: we build a dynamical process on the sequence ${\cal S}^\lambda$ of the phage

211: with rates (\ref{ratemd}), and

212: generate an unzipping trace ${\cal N}$ by a MC procedure.  Then we use

213: the Viterbi procedure (which ignores the phage sequence) to make a

214: prediction for the sequence, ${\cal S}^*$, from this signal ${\cal

215: N}$. We estimate the error over the prediction about base $n$ from

216: the failure rate

217: \begin{equation} \label{defom}

218: \epsilon _n = \hbox{\rm Probability}

219: \left[ b_n^*  \ne b_n^\lambda \right]\;,

220: \end{equation}

221: where the probability is computed by repeating the procedure over

222: different MC runs.

223: The errors $\epsilon _n$ are shown in Figure~\ref{fig3} (with

224: the continuous curve) for the first $450$ bases

225: at a force of $16$ pN.  Values range from $0$ (perfect prediction) to

226: $0.75$ (random guess of one among four bases).  A comparison with the

227: free energy $g(n)$ (Figure~\ref{fig2}) shows that $\epsilon _n$ is

228: small in the flattest part of the landscape ($350< n< 450$), or in

229: local minima e.g. the $n=50$ base

230: preceded by 4 weak bases and followed by 4 strong bases

231: (...TTTA-A-GGCG...).  Conversely, bases that are not well determined

232: correspond to local maxima of the landscape e.g. $n=327$, $328$ bases

233: between $7$ strong and $7$ weak bases

234: (...GCCGCCG-TC-ATAAAAT...).  We plot the average fraction of

235: mispredicted bases, $\displaystyle{\epsilon = \frac 1 N \sum_{n}

236: \epsilon _n}$, in Figure~\ref{fig4}A.  As shown in Fig.~\ref{fig2},

237: for a larger force, there are more open bases (about $60$, $600$ and

238: $5000$ at $15.5$, $16$ and $17$ pN in about $100$ seconds), but the

239: time spent on each base is smaller, and therefore $\epsilon$ is larger

240: ($\epsilon =20 \%,23\%,47\% $).

241: Most errors are due to the difficulty of distinguishing A from T, and

242: G from C. The probability  that a weak

243: (A or T) base is confused with a strong one (G or

244: C), or vice-versa, is plotted in Figure~\ref{fig4}B.

245:

246: Performances can be greatly improved by collecting information from

247: multiple unzippings. As the

248: number of passages over the same base $n$ gets larger, the total

249: waiting times $t_n$ and transition parameters $u_n,d_n$ become less

250: affected by fluctuations, and reflect more faithfully the

251: thermodynamic signature of the base. In practice, we look for the

252: most likely sequence ${\cal S}^*$ given $R$ unzipping signals

253: ${\cal N}_1, {\cal N}_2,\ldots , {\cal N}_R$.

254: Figures \ref{fig3}A and \ref{fig4} shows the drop down in the

255: probability of error when the number $R$ of

256: unzippings increases. Observe from Figure \ref{fig3}A that the decay of

257: $\epsilon _n$ with $R$ (\ref{defom}) varies from base to base.

258: The decrease of the total error $\epsilon$

259: is much faster for AT vs. GC (Figure~\ref{fig4}B) than for

260: complete (Figure~\ref{fig4}A) recognition.

261:

262: \begin{figure}

263: \begin{center}

264: \psfig{figure=./fig3.eps,height=7cm,angle=-90}

265: \caption{{\bf A}. Probability $\epsilon _n$ of an error (top)

266: and entropy $\sigma _n$ (middle) versus the base index $n$,

267: for the first

268: $450$ bp of DNA $\lambda$-phage at $f=16$~pN. Full lines correspond

269: to $R=1$ unzipping, dotted lines to $R=40$.

270: {\bf B.} Theoretical values for the  decay constants $R_n^c$  in

271:  $\epsilon _n$  (\ref{rcf}). For instance, base

272: $232$ (arrow) is characterized by $R_{232}^c \simeq 10$, and

273: is not (respectively, well) predicted with $R=1$ (resp. $R=40$)

274: unzippings. }

275: \label{fig3}

276: \end{center}

277: \end{figure}

278:

279: \begin{figure}

280: \begin{center}

281: \psfig{figure=./fig4a.eps,height=4cm,angle=-90}

282: \psfig{figure=./fig4b.eps,height=4cm,angle=-90}

283: \caption{{\bf A.}

284: Fraction $\epsilon$ of mispredicted bases for the $\lambda$-phage

285: versus the number $R$ of unzippings, averaged over $1000$ samples of

286: $R$ unzippings, and

287: for forces of $15.5$, $16$ and $17$ pN (from bottom to top).

288: {\bf B}. Same as {\bf A}, but we only discriminate among weak and

289: strong basis.}

290: \label{fig4}

291: \end{center}

292: \end{figure}

293:

294: It is useful to build indicators of performances that do not rely on

295: the exact knowledge of the unzipped sequence (used here for checking

296: the quality of our results but unknown in practical

297: applications). To this aim, we calculate the optimal sequences $S^*_b$

298: when base $n$ is constrained to value $b$, and the corresponding

299: probabilities $P_n^*(b)$.

300: We then define the Shannon entropy

301: \begin{equation}

302: \sigma_n=-  \sum_{b=A,T,G,C}\, \langle P_n^*(b)\,

303: \log _4 P_n^*(b) \rangle\; ,

304: \end{equation}

305: where $\langle\cdot\rangle$ denotes the average over MC data.  $\sigma

306: _n$ is low when one of the four bases has much higher probability than

307: the other ones and close to unity for uncertain predictions

308: (equiprobable bases).  Figure \ref{fig3} shows that $\sigma _n$ and

309: $\epsilon_n$ as a function of the base index $n$ are indeed very

310: similar: the Shannon entropy is a good indicator of the success of our

311: reconstruction.

312:

313: % Theory

314:

315: Our analytical study of the dependence of the quality of

316: the prediction upon the force, the sequence content, and the number

317: of unzippings confirms that the probability of error $\epsilon _n$

318: decreases very quickly with $R$,

319: \begin{equation} \label{rcf}

320: \epsilon _n \sim e^{ -R/R^c_n} \ .

321: \end{equation}

322:  As $f$ decreases to its critical value (below which the

323: molecule cannot open), the decay constant

324: $R_n^c$ decreases to zero, and predictions

325: drastically improve at fixed $R$.

326: Our theoretical values for $R^c _n$ are shown in Figure~\ref{fig3}B

327: for $f=16$ pN, and vary from 0.1 to 45 with the base index $n$. The

328: agreement with the decay of $\epsilon _n$ from $R=1$ to $40$ unzippings

329: (Figure \ref{fig3}A) is excellent. Note that $\epsilon$ in

330: Figure~\ref{fig4} is not a pure exponential, but a superposition of

331: exponentials with $n$-dependent decay constants $R^c_n$.

332: We now present the calculation of $R^c_n$ in three steps.

333:

334: {\em (a) Pairing only, high force.}

335: Assume first that there are only 2 and not 4 bp-types, called

336: $+$ and $-$, and no stacking interaction. Call $\Delta$ the difference

337: between the (pairing) free-energies of $+$ and $-$ bp, and  $\langle

338: t_\pm\rangle$ the average time spent by the

339: fork on a $\pm$ bp before moving forward or backward. Consider

340: now a bp of type $b$ and call $t$ the time spent on this bp

341: divided by the number $R$ of unzippings. From the central limit

342: theorem, for large  $R$,

343: $t$ gets narrowly peaked around its mean value $\langle

344: t_b\rangle$, with Gaussian fluctuations $\delta t\sim R^{-\frac 12}$.

345: Bayes prediction (\ref{bayes}) will be erroneous,

346: $b^*=-b$, when  $t$ is closer to $\langle

347: t_{-b}\rangle$ than to its expected value $\langle t_b\rangle$.

348: The probability of error is thus given

349: by the Gaussian tail, and scales as $\epsilon \sim \exp( - \delta t^{-2})$,

350: hence (\ref{rcf}).  A careful calculation \cite{lungo} gives the

351: precise value of the decay constant in (\ref{rcf}),

352: \begin{equation}\label{nostackomega}

353: R ^c= \frac 1{\tau -1 -\ln \tau}\quad \hbox{\rm with}

354: \quad \tau = \frac {\Delta}{1- e^{-\Delta}} \ .

355: \end{equation}

356: Good predictions are obtained when  the molecule is unzipped a

357: few $R^c$ times (for example $R \simeq 4 R^c$

358: gives $\epsilon \simeq 2\%$).

359: %$R^c$ is well approximated by $\frac 8{\Delta^2}$ for $\Delta < 3$ k$_B$T.

360: %This formula  allows us to qualitatively understand Figure~\ref{fig4}.

361: To distinguish weak (AT) from strong (CG) bp only we have

362: $\Delta \simeq 2.8$ \cite{Zuk} and

363: $R^c\simeq 1$ (Figure~\ref{fig4}B), while complete recognition corresponds

364: to $\Delta \simeq 0.5$ and  $R^c \simeq 30$ (Figure~\ref{fig4}A).

365: %The quantitative understanding of Figure~\ref{fig4} requires the

366: %calculation of $R^c$ as a function of the force (see below).

367:

368: {\em (b) Pairing and Stacking, high force.} In presence

369: of stacking interactions, the error $\epsilon _b$ on base $b$

370: depends on the neighboring bases, say, $x$ and $y$.

371:  At large $R$, errors are rare and

372: are typically due to a single base mis-prediction e.g. $b\to b'$. The

373: probability $\epsilon_{b\to b'}$ of this mistake is the product of the

374: probabilities $\epsilon _{xb\to xb'}$ and $\epsilon _{by\to b'y}$ of

375: the two bond violations. We estimate $ \epsilon

376: _{xb\to xb'} \sim e^{-R/R^c_{xb\to xb'}}$

377: from (\ref{rcf}) where $R^c_{xb\to xb'}$ is

378: given by (\ref{nostackomega}) with $\Delta = g_0^{xb'}-g_0^{xb}$.

379: A similar expression is readily obtained for the $by$ bond. Knowing

380: the asymptotic behavior of $\epsilon _{b\to b'}$, we calculate

381: $\epsilon _b \sim e^{-R/R^c_{xby}}$ by selecting the worst value for $b'$,

382: \begin{equation}

383: \label{id}

384: \frac 1{R^c _{xby}} = \min _{b' (\ne b)} \left[  \frac 1{R^c _{xb \to xb'}}

385: + \frac 1{R^c _{by \to b'y}} \right] \ .

386: \end{equation}

387: The above derivation is confirmed by exact calculations based on

388: techniques for 1D disordered systems \cite{diso,lungo}.

389:

390: {\em (c) Moderate force.}

391: The above calculations are correct for high forces. At moderate forces,

392: bp can close and are visited several times by the fork. The effective number

393: of unzippings is $R\times \langle u_{n}\rangle$, where $\langle u_n\rangle$

394: is the average number of openings of bp $n$ during a single unzipping.

395: The decay constant is thus, from (\ref{rcf}),

396: \begin{equation}

397: R^c_n =  {R ^ c _{b_{n-1}b_nb_{n+1}} }/{\langle u_n\rangle }\ .

398: \end{equation}

399: As the force is lowered, $\langle u_n\rangle$ increases (from 1

400: at high force), and $R^c_n$ diminishes.  To

401: calculate $\langle u_n\rangle$, we consider the 1D transient random walk

402: defined by the probabilities $q_m\equiv

403: r_c/(r_o(m)+r_c)$ and $1-q_m$ for closing or opening bp $m$.

404: Let $p_{m}^{(n)}$ be the probability that the fork will never

405: reach position $n$ starting from $m (>n)$.  The ratios $\rho _m^{(n)}

406: = p_{m}^{(n)}/p_{m+1}^{(n)}$ fulfill the Riccati recursion relation

407: \cite{lungo} $\rho _{m+1} ^{(n)} = (1- q_{m+1}) / (1-q_{m+1} \, \rho

408: _m ^{(n)} )$.  Iterating with boundary condition $\rho_n^{(n)}=0$

409: allows us to obtain $\langle u _ n\rangle = 1/p_n^{(n+1)} = \prod

410: _{m>n} \rho _m^{(n)}$.

411:

412: % perspective

413:

414: Finally we discuss the difficulties hindering a direct application of

415: our inference method to real data (see also \cite{hwa}),

416: and possible way-outs.

417:

418: First, temporal resolution is limited in practice. The frequency bandwidth

419: is controlled by the viscous friction and the stiffness of the

420: setup, with a typical value of $10$ kHz \cite{Boc02,bustt}. The

421: corresponding time, $\delta\tau \simeq 100$ $\mu$sec, is about $10$

422: (resp. $200$) times longer than the typical opening time for GC

423: (resp. AT) bp.  As a result, the fork can move by $D (> 1)$ bp during

424: the time interval $\delta\tau$.  We have taken into account such moves by

425: considering interactions between bases at distance $\le D$ in the

426: probability $P(\cal N|S)$, and modified the

427: reconstruction procedure accordingly (the transfer matrix has now

428: dimension $4^D$) \cite{lungo}.  In practice, when

429: $\delta\tau = 1\ \mu$sec, sequences cannot be predicted with the usual

430: $D=1$ reconstruction procedure, but are correctly inferred with the

431: $D=6$ procedure.  Though time resolution is currently far below this

432: limit, future experimental progresses, and new technologies e.g.

433: combination of optical trap and single-molecule fluorescence

434: \cite{Lan03}, could help bridging the gap.

435:

436: Secondly, thermal fluctuations

437: of the open strands lead to  an uncertainty $\delta n$

438: over the position $n$ of the fork \cite{siggia} e.g. $\delta n \simeq 5$

439: for $f\simeq 15$ pN and $n=300$ open bp \cite{Coc3}. The presence

440: of correlations between bases at distance $D\le \delta n$ does not

441: affect the result (\ref{rcf}) for $\epsilon _n$  as long as the relaxation

442: time of the strands is smaller than the bp opening time {\em

443: i.e.} up to a few hundreds open bp. What happens for larger values of $n$

444: is currently under study.

445:

446: Thirdly, we have assumed so far to have a perfect knowledge of the

447: dynamics of unzipping. In practice, any functional form for ${\cal P}({\cal

448: N}|{\cal S})$ will be only approximate for a given experimental

449: setup. A possible way-out based on a learning principle is

450: the following: in a first stage unzipping data corresponding to a known

451: sequence ($\lambda$-phage) are collected to caliber ${\cal P}$, in a second

452: stage predictions are made for new sequences.

453:

454: Last of all, our study of fixed-force unzipping shows that bases located in

455: local minima of the free-energy landscape are well predicted, while

456: maxima are much harder to predict. Accuracy could be greatly improved

457: through an adequate force vs. time scheme capable of bringing

458: the fork in the right place and making it spend time there.

459: Investigation of the fixed-velocity case, where the force signal is

460: remarkably affected by single base mutation \cite{Boc02},

461: will be very interesting.

462:

463: In conclusion, we hope the present study will motivate further work

464: to assess and improve the performances of unzipping-based sequencing.

465:

466: %grazie

467:

468: This work has been partially sponsored by the EC FP6

469: program under contract IST-001935, EVERGROW, and the

470: French  ACI-DRAB \& PPF Biophysique-ENS actions.

471: \begin{thebibliography}{999999}

472:

473: \bibitem{mb}

474: P.C. Turner, A.G. McLennan, A.D. Bates, M.R.H. White,

475: Molecular Biology, Springer-Verlag (2000).

476:

477: \bibitem{Ess97}

478: B. Essevaz-Roulet, U. Bockelmann,  F. Heslot,

479: {\em Proc. Natl. Acad. Sci. (USA) } {\bf 94}, 11935 (1997).

480:

481: \bibitem{Boc02}

482: U. Bockelmann {\em et al.}

483: {\em Biophys. J.} {\bf 82}, 1537 (2002).

484:

485: \bibitem{Lip}

486: J. Liphardt {\em et al.} {\em Science} {\bf 297}, 733 (2001).

487:

488: \bibitem{Har03}

489: S. Harlepp {\em et al.}

490: {\em Eur. Phys. J. E} {\bf 12}, 605 (2003).

491:

492: \bibitem{Dan03}

493: C. Danilowitcz {\em et al.}

494: {\em Proc. Natl. Acad. Sci. (USA) } {\bf 100}, 1694 (2003).

495:

496: \bibitem{Mat04}

497: J. Math\'e {\em et al.}

498: {\em Biophys. J.} {\bf 87}, 3205 (2004).

499:

500: %exonucleasi

501: \bibitem{Van03}

502: M. van Oijen {\em et al.} {\em Science}  {\bf 301}, 123 (2003).

503:

504: \bibitem{Per03}

505: T. Perkins {\em et al.}  {\em Science} {\bf 301}, 1914 (2003).

506:

507: %zero mode vaweguide

508: \bibitem{Lev03}

509: M.J. Levene {\em et al.}  {\em Science}  {\bf 299}, 682 (2003).

510:

511: % single molecule hybridization

512: \bibitem{Zoc03}

513: M. Singh-Zocchi {\em et al.} {\em Proc. Natl. Acad. Sci. (USA)}

514: {\bf 100}, 7605 (2003).

515:

516: \bibitem{Coc3}

517: S. Cocco, R. Monasson, J. Marko.

518: {\em C.R. Physique} {\bf 3}, 569  (2002).

519:

520: \bibitem{Coc4}

521: S. Cocco, R. Monasson, J. Marko.

522: {\em Eur. Phys. J. E} {\bf 10}, 153 (2003).

523:

524: \bibitem{Lub}

525: D.K. Lubensky, D.R. Nelson. {\em Phys. Rev. Lett.} {\bf 85},

526: 1572 (2000); {\em Phys. Rev. E} {\bf 65}, 031917 (2002).

527:

528: \bibitem{Hwa}

529: U. Gerland, R. Bundschuh, T. Hwa.

530: {\em Biophys. J.} {\bf 81}, 1324 (2001).

531:

532: \bibitem{Felix}

533: M. Manosas, F. Ritort, {\em cond-mat/0405035} (2004).

534:

535: \bibitem{mar}

536: D. Marenduzzo {\em et al.} {\em Phys. Rev. Lett.} {\bf 88}, 028102

537: (2002).

538:

539: \bibitem{bayes}

540: D.H. DeGroot, Probability and Statistics, Addison-Wesley

541:  Publishing Co. (1986).

542:

543: \bibitem{Zuk}

544: M. Zuker.

545: {\em Curr. Opin. Struct. Biol.} {\bf 10}, 303 (2000). From Santa Lucia Jr.

546: {\em Proc. Natl. Aca. Sci. (USA)} {\bf 95}, 1460 (1998),

547: $g_0^{AA}=-1.78$, $g_0^{AT}=-1.55$, $g_0^{AC}=-2.52$, $g_0^{AG}=-2.22$,

548: $g_0^{TA}=-1.06$,

549: $g_0^{TC}=-2.28$, $g_0^{TG}=-2.54$, $g_0^{CC}=-3.14$, $g_0^{CG}=-3.85$,

550: $g_0^{GC}=-3.90$ k$_B$T at $T=25$~C, 150 mM Na.

551:

552:

553: \bibitem{lungo}

554: V. Baldazzi {\em et al.}, in preparation (2005).

555:

556: \bibitem{viterbi}

557: A.J. Viterbi, {\em IEEE Trans. Inf. Th.} {\bf 13},

558: 260 (1967).

559:

560: \bibitem{diso}

561: F.J. Dyson {\em Phys. Rev.} {\bf 92}, 1331 (1953).

562:

563: \bibitem{siggia}

564: R.E. Thompson, E.D. Siggia.

565: {\em Europhys. Lett.} {\bf 31}, 335 (1995).

566:

567: \bibitem{bustt}

568: B. Onoa {\em et al.} {\em Science} {\bf 299}, 1892 (2003) (supplementary

569: materials).

570:

571: \bibitem{hwa}

572: U. Gerland, R. Bundschuh, T. Hwa.

573: {\em Phys. Biol.} {\bf 1}, 19 (2004).

574:

575: % opticaltrap+fluorescence

576: \bibitem{Lan03}

577: M.J. Lang, P.M. Fordyce, S.M. Block.

578:  {\em J. Biol.} {\bf 2}, 6 (2003).

579:

580: \end{thebibliography}

581:

582:

583: \end{document}

584:

585: