0503:q-bio0503014/rao.tex

1: \documentclass[twocolumn,floatfix,prb,amsmath,amssymb,showkeys]{revtex4}

2: \usepackage{amsmath,amssymb}

3: \usepackage{epsfig}

4: \usepackage{graphicx}

5: \usepackage{dcolumn}

6: \usepackage{bm}

7:

8: \begin{document}

9:

10: \title{Estimation of protein folding probability from equilibrium simulations}

11:

12:

13: \author{Francesco Rao}

14: \author{Giovanni Settanni}

15: \author{Enrico Guarnera}

16: \author{Amedeo Caflisch}

17:

18: \email[corresponding author, tel: +41 44 635 55 21,

19: fax: +41 44 635 68 62, e-mail: ]{caflisch@bioc.unizh.ch}

20:

21: \affiliation{Department of Biochemistry, University of Zurich,

22:              Winterthurerstrasse 190, CH-8057 Zurich, Switzerland\\

23:              tel: +41 44 635 55 21, fax: +41 44 635 68 62,\\

24:              e-mail: caflisch@bioc.unizh.ch}

25:

26: \date{\today}

27:

28: \begin{abstract}

29:

30: The assumption that similar structures have similar folding probabilities

31: ($p_{fold}$) leads naturally to a procedure to evaluate $p_{fold}$ for every

32: snapshot saved along an equilibrium folding-unfolding trajectory of a

33: structured peptide or protein.  The procedure utilizes a structurally

34: homogeneous clustering and does not require any additional simulation.

35: It can be used to detect multiple folding pathways as shown for a three-stranded antiparallel $\beta$-sheet peptide investigated by implicit solvent molecular dynamics simulations.

36:

37: \end{abstract}

38:

39: \keywords{molecular dynamics; transition state; p$_{fold}$; multiple pathways; denatured state ensemble}

40:

41: \maketitle

42:

43: \section{introduction}

44:

45: The folding probability ($p_{fold}$) of a protein conformation saved along a

46: Monte Carlo or molecular dynamics (MD) trajectory is the probability to fold

47: before unfolding \cite{Du:On}.  It is a useful measure of kinetic distance from

48: the folded, i.e., functional state, and can be used to validate transition

49: state ensemble (TSE) structures, which should have $p_{fold}\approx 0.5$.  Such

50: validation consists of starting a large number of trajectories from putative

51: TSE structures with varying initial distribution of velocities and counting the

52: number of those that fold within a "commitment" time which has to be chosen

53: much longer than the shortest time-scales of conformational fluctuations and

54: much shorter than the average folding time \cite{Hubner:Commitment}.  The

55: concept of $p_{fold}$ calculation originates from a method for determining

56: transmission coefficients, starting from a known transition state

57: \cite{Chandler:Statistical} and the identification of simpler transition states

58: in protein dynamics (e.g., tyrosine ring flips) \cite{Northrup:Dynamical}.  The

59: approach has been used to identify the otherwise very elusive folding TSE by

60: atomistic Monte Carlo off-lattice simulations of small proteins with a $G\bar

61: o$ potential \cite{Li:Constr,Hubner:Commitment}, as well as implicit solvent MD

62: \cite{Gsponer:Molecular,Rao:The} and Monte Carlo \cite{Lenz:Folding}

63: simulations with a physico-chemical based potential. The number of trial simulations needed for the reliable evaluation of $p_{fold}$ makes the estimation of the folding probability computationally very expensive.

64: For this reason, here we propose a method to estimate folding probabilities

65: for \textit{all} structures visited in an equilibrium folding-unfolding trajectory without

66: any additional simulation.

67:

68: \section{Methods}

69:

70: %---[ TABLES ]--------------------------------------------------------------

71: %---------------------------------------------------------------------------

72: \begingroup

73: \squeezetable

74: \begin{table}[h]

75: \caption{\label{list}{\sc DRMS} clusters used for the calculation of $P_{f}$.}

76: \begin{ruledtabular}

77: \begin{tabular}{ccccccc}

78: Cluster &

79: $P_{f}^C$ \footnote{Cluster-$p_{fold}$ [$P_f^C$, Eq.\ \ref{cluster-pfold}].} &

80: $P_{f}$ \footnote{Traditional, i.e., computationally expensive $P_f$ value [Eq.\ \ref{average-pfi}].} &

81: $\sigma_{p_{fold}}$ \footnote{Standard deviation of $p_{fold}$ in a cluster [Eq.\ \ref{sigma-pfold}].} &

82: $N$ \footnote{Total number of trials used to evaluate $P_f$. For every structure $n_{t}=10$ trials were performed \hbox{($N=n_{t}\ W_{sample}$)}

83: except for clusters 7 and 25 for which 20 and 50 trials were performed, respectively.} &

84: $W$ \footnote{Number of snapshots in the cluster.} &

85: $W_{sample}$ \footnote{Number of snapshots used to evaluate $P_f$.

86: The $W_{sample}$ subset was obtained by selecting structures in a cluster every $|W/W_{sample}|$ saved conformations.} \\

87: \hline

88: \hline

89:      1 &   0.00 &   0.03 &   0.04 &  150 &  144 &   15\\

90:      2 &   0.11 &   0.05 &   0.06 &  150 &  449 &   15\\

91:      3 &   0.06 &   0.05 &   0.07 &  120 &   36 &   12\\

92:      4 &   0.08 &   0.07 &   0.08 &  140 &  555 &   14\\

93:      5 &   0.10 &   0.08 &   0.06 &  100 &   10 &   10\\

94:      6 &   0.13 &   0.12 &   0.18 &  160 &  911 &   16\\

95:      7 &   0.25 &   0.16 &   0.07 &   80 &    4 &    4\\

96:      8 &   0.23 &   0.20 &   0.31 &  150 &  141 &   15\\

97:      9 &   0.21 &   0.22 &   0.15 &  140 &  178 &   14\\

98:     10 &   0.12 &   0.23 &   0.20 &  120 &   48 &   12\\

99:     11 &   0.57 &   0.25 &   0.14 &  140 &   14 &   14\\

100:     12 &   0.05 &   0.27 &   0.19 &  100 &   19 &   10\\

101:     13 &   0.23 &   0.29 &   0.38 &  140 &  391 &   14\\

102:     14 &   0.08 &   0.30 &   0.15 &  120 &   12 &   12\\

103:     15 &   0.72 &   0.35 &   0.23 &  130 &  129 &   13\\

104:     16 &   0.19 &   0.38 &   0.18 &  130 &   26 &   13\\

105:     17 &   0.38 &   0.44 &   0.39 &  160 &   16 &   16\\

106:     18 &   0.38 &   0.51 &   0.28 &  160 &   16 &   16\\

107:     19 &   0.65 &   0.60 &   0.29 &  100 &   20 &   10\\

108:     20 &   0.57 &   0.61 &   0.35 &   70 &    7 &    7\\

109:     21 &   0.48 &   0.63 &   0.32 &  140 &   27 &   14\\

110:     22 &   0.74 &   0.65 &   0.40 &  140 &  539 &   14\\

111:     23 &   0.68 &   0.66 &   0.18 &  140 &   28 &   14\\

112:     24 &   0.38 &   0.71 &   0.24 &  130 &   13 &   13\\

113:     25 &   0.50 &   0.72 &   0.20 &  100 &    2 &    2\\

114:     26 &   0.82 &   0.76 &   0.31 &  170 &   17 &   17\\

115:     27 &   0.50 &   0.78 &   0.14 &  120 &   12 &   12\\

116:     28 &   0.78 &   0.78 &   0.22 &  180 &   18 &   18\\

117:     29 &   0.70 &   0.79 &   0.19 &  130 &  189 &   13\\

118:     30 &   0.77 &   0.79 &   0.17 &  150 &   30 &   15\\

119:     31 &   0.85 &   0.81 &   0.11 &  130 &   13 &   13\\

120:     32 &   0.91 &   0.83 &   0.20 &  140 &  401 &   14\\

121:     33 &   0.90 &   0.85 &   0.27 &  100 &   20 &   10\\

122:     34 &   0.85 &   0.85 &   0.10 &  120 &   48 &   12\\

123:     35 &   0.94 &   0.88 &   0.13 &  170 & 1990 &   17\\

124:     36 &   0.71 &   0.94 &   0.07 &   70 &    7 &    7\\

125:     37 &   0.95 &   0.95 &   0.06 &  150 &  855 &   15\\

126: \end{tabular}

127: \end{ruledtabular}

128: \end{table}

129: \endgroup

130:

131:

132: \subsection{Molecular dynamics simulations}

133:

134: Beta3s is a designed 20-residue sequence whose

135: solution conformation has been investigated by NMR spectroscopy

136: \cite{DeAlba:Denovo}. The NMR data indicate that beta3s in aqueous solution

137: forms a monomeric (up to more than 1mM concentration) triple-stranded

138: antiparallel $\beta$-sheet, in equilibrium with the

139: denatured state \cite{DeAlba:Denovo}. We have previously shown that in

140: implicit solvent \cite{Ferrara:Evaluation} molecular dynamics simulations

141: beta3s folds reversibly to the NMR solution conformation, irrespective of the

142: starting structure \cite{Ferrara:Folding}.

143: Recently, four molecular dynamics simulations of beta3s were performed at 330 K

144: for a total simulation time of 12.6 $\mu$s \cite{Cavalli:Fast}.  There are 72

145: folding events and 73 unfolding events and the average time required to go from

146: the denatured state to the folded conformation is 83 ns.  The 12.6 $\mu$s of

147: simulation length is about two orders of magnitude longer than the average

148: folding or unfolding time, which are similar because at 330 K the native and

149: denatured states are almost equally populated \cite{Cavalli:Fast}. For the

150: $p_{fold}$ analysis the first 0.65 $\mu$s of each of the four simulations were

151: neglected so that along the 10 $\mu$s of simulations there are a total of 500000 snapshots because coordinates were saved every 20 ps.

152:

153: The simulations  were performed with

154: the program CHARMM {\cite{Brooks:CHARMM}}.  Beta3s was modeled by explicitly

155: considering all heavy atoms and the hydrogen atoms bound to nitrogen or oxygen

156: atoms (PARAM19 force field {\cite{Brooks:CHARMM}}).  A mean field approximation

157: based on the solvent accessible surface was used to describe the main effects

158: of the aqueous solvent on the solute \cite{Ferrara:Evaluation}.  The two

159: surface tension-like parameters of the solvation model were optimized without using beta3s.  The

160: same force field and implicit solvent model have been used recently in

161: molecular dynamics simulations of the early steps of ordered aggregation

162: \cite{Gsponer:Therole}, and folding of structured peptides\cite{Ferrara:Evaluation,Ferrara:Folding}, as well as small

163: proteins of about 60 residues \cite{Gsponer:Role}. Despite

164: the absence of collisions with water molecules, in the simulations with

165: implicit solvent the separation of time scales is comparable with that observed

166: experimentally.  Helices fold in about 1 ns \cite{Ferrara:Thermodynamics},

167: $\beta$-hairpins in about 10 ns \cite{Ferrara:Thermodynamics} and

168: triple-stranded $\beta$-sheets in about 100 ns \cite{Cavalli:Fast}, while the

169: experimental values are $\sim$0.1 $\mu$s \cite{Eaton:Fast}, $\sim$1 $\mu$s

170: \cite{Eaton:Fast} and $\sim$10 $\mu$s \cite{DeAlba:Denovo}, respectively.

171:

172: \subsection{Clusterization}

173:

174: \begin{figure}[h]

175: \includegraphics[angle=-90,width=8cm]{fig1.eps}

176: \caption{Probability distribution for the first passage time (fpt) to the most

177: populated cluster (\emph{folded state}) of the DRMS 1.2 \AA\

178: clusterization.}

179: \label{fig.pfpt}

180: \end{figure}

181:

182:

183:

184: % There are several procedures for clustering conformations according

185: % to structural similarity.

186: The 500000 conformations obtained from the simulations of beta3s (see above) were clustered by the leader algorithm \cite{hartigan}. Briefly, the first structure defines the first cluster and each

187: subsequent structure is compared with the set of clusters found so far until the first similar structure is found.

188: If the structural deviation (see below) from the first conformation of all of the known clusters exceeds a given threshold, a new cluster is defined.

189: The leader algorithm is very fast even when analyzing large sets of

190: structures like in the present work.

191: The results presented here were obtained with a structural comparison based on the Distance Root Mean Square (DRMS) deviation considering all distances involving C$_\alpha$ and/or C$_\beta$ atoms and a cutoff of 1.2 \AA.  This yielded 78183 clusters.

192: The DRMS and root mean square deviation of atomic coordinates (upon

193: optimal superposition) have been shown to be highly correlated \cite{Hubner:Commitment}. The DRMS cutoff of 1.2~\AA\ was chosen on the basis of the distribution of the pairwise DRMS values in a subsample of the wild-type trajectories.  The distribution shows

194: two main peaks that originate from intra- and inter-cluster distances,

195: respectively

196: (data not shown).  The cutoff is located at the minimum between the two

197: peaks.

198: The main findings of this work are valid

199: also for clusterization based on secondary structure similarity

200: \cite{Rao:The}

201: (see Suppl.\ Mat.).

202:

203:

204: %\subsection{Clusterization}

205:

206: %The

207: %500000 conformations obtained from the simulations of beta3s (see above)

208: %were clustered by the leader

209: %algorithm \cite{hartigan} based on the Distance Root Mean Square (DRMS)

210: %deviation considering C$_\alpha$ and C$_\beta$ atoms and a cutoff of 1.2 \AA.  This yields 78183 clusters.

211: %In general, conformations visited along an MD simulation can be structurally clustered in several ways. Recently a method based on secondary structure \cite{Rao:The} has been applied for the clusterization of  beta3s conformations (see Supp. Mat. for details).

212:

213: \subsection{Folding probability}

214:

215:

216:

217: \begin{figure*}

218: \includegraphics[width=7.0cm, angle=-90]{fig2.eps}

219: \caption{Standard deviation

220: $\sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2

221: \right>_{i \in \alpha}}$ of the $p_{fold}$ for the 37 DRMS clusters used in the study.

222: \textbf{(A)} $\sigma_{p_{fold}}$ as a function of $P_{f}$ compared to a

223: Bernoulli distribution (solid line). Ten trials were performed for each

224: snapshot.  The largest values for the standard deviation are located around the

225: 0.5 region and this is probably due to the Bernoulli process ($\theta=0,1$)

226: used for the calculation of $p_{fold}$. \textbf{(B)} $\sigma_{p_{fold}}$

227: dependence on the number of trials used to evaluate $p_{fold}$. The dashed curves are fits with a $\frac{a}{\sqrt{x}}+b$ function.  The horizontal

228: dashed lines are drawn to help identifying in \textbf{A} the two clusters used

229: in \textbf{B}.   \textbf{(C)} Dependence of $P_f$  on the number of trials $n_t$ for the two clusters used in \textbf{B}.}

230: \label{fig.sigmapfold}

231: \end{figure*}

232:

233:

234: For the computation of $p_{fold}$ a criterion ($\Phi$) is needed to determine

235: when the system reaches the folded state.

236: Given a clusterization of the structures, a natural choice for $\Phi$ is the visit of the most populated cluster which for

237: structured peptides and proteins is not degenerate (other criteria are also possible, e.g., fraction of native contacts $Q$ larger than a given threshold).

238: Given $\Phi$ and a commitment time ($\tau_{commit}$),

239: the folding probability $p_{fold}(i)$ of an MD snapshot $i$ is computed as \cite{Du:On,Hubner:Commitment}

240: %

241: \begin{equation}

242: p_{fold}(i)=\frac{n_{f}(i)}{n_t(i)}\

243: \label{pfi}

244: \end{equation}

245: %

246: where $n_{f}(i)$ and $n_t(i)$ are the number of trials started from snapshot $i$

247: which reach within a time $\tau_{commit}$ the folded state and the total number of trials, respectively.

248:

249: Every simulation started from snapshot $i$ can

250: be considered as a Bernoulli trial of a random variable

251: $\theta$ with value 1 (folding within $\tau_{commit}$) or 0

252: (no folding within $\tau_{commit}$).  The variable

253: $\theta$ has average and variance on the average of the form:

254: %

255: \begin{equation}

256: \begin{split}

257: \langle \theta \rangle = &

258: p_{fold}\ =  \frac{1}{n_t}\sum_{i=1}^{n_t} \theta_i \\

259: \sigma^2_{\left< \theta \right>} = & \frac{1}{n_t}p_{fold}(1-p_{fold})

260: \end{split}

261: \end{equation}

262: %

263: where $n_t$ is the total number of trials and the accuracy on the $p_{fold}$

264: value increases with $n_t$.

265:

266: In Fig.\ \ref{fig.pfpt} the distribution of the first passage time (fpt)

267: to the folded state is shown. The double

268: peak shape of the distribution provides evidence for the different time scales

269: between \emph{intra}-basin and \emph{inter}-basin transitions.

270: A value of 5 ns is chosen

271: for $\tau_{commit}$ because events with

272: smaller time scales correspond to the diffusion within the native free-energy

273: basin, while events with larger time scales are transitions from other basins

274: to the native one, i.e., folding/unfolding events \cite{Cavalli:Fast}.

275:

276:

277: \section{Folding probability from equilibrium trajectories}

278:

279: \begin{figure*}[t]

280: \includegraphics[width=11.5cm, angle=-90]{fig3.eps}

281: \caption{Cluster folding probability $P_{f}^{C}$. \textbf{(A)} Scatter plot of $P_{f}^C$ versus $P_{f}$.

282: The DRMS 1.2 \AA\ clusterization and the folding criterion $\Phi$

283: (reaching the most populated cluster within $\tau_{commit}=5$ ns)

284: were used. \textbf{(B)} Probability distribution of the $p_{fold}$ value for the 500000 snapshots saved along the $10\ \mu s$ MD trajectory. The folding probability for snapshot $i$ is computed as $p_{fold}(i)=P_f^C[\alpha]$ for $i \in \alpha$. \textbf{(C-E)} Scatter plot of $P_{f}^C$ versus $P_{f}$ for 1.0, 5.0, and 10 $\mu s$ of simulation time, respectively.}

285: \label{fig.scatterpfold}

286: \end{figure*}

287:

288:

289:

290: The basic assumption of the present work is that conformations that

291: are structurally similar have the same kinetic behavior, hence they have

292: similar values of $p_{fold}$.  Note that the opposite is not necessarily true as explained in Section IV for the TSE and the denatured state. To exploit this assumption, snapshots saved along

293: a trajectory are grouped in structurally similar clusters\cite{Symbolic}.

294: Then, the $\tau_{commit}$-segment of MD trajectory following each snapshot is

295: analyzed to check if the folding condition $\Phi$ is met (i.e, the snapshot

296: "folds").  For each cluster, the ratio between the snapshots which lead to

297: folding and the total number of snapshots in the cluster is defined as the

298: cluster-$p_{fold}$ ($P_{f}^C$; throughout the text uppercase $P$ and lowercase

299: $p$ refer to folding probability for clusters and individual snapshots,

300: respectively).  This value is an approximation of the $p_{fold}$ of any single

301: structure in the cluster which is valid if the cluster consists of structurally

302: similar conformations.  In other words, the occurrence of the folding event for

303: the snapshots of a given cluster can be considered as a Bernoulli trial of a

304: random variable $\theta$.  The average of $\theta$ and variance on the average

305: for the set of snapshots belonging to a given cluster $\alpha$ can be written

306: as:

307: %

308: \begin{equation}

309: \begin{split}

310: P_{f}^C [\alpha] & = \langle \theta \rangle =

311: \frac{1}{W}\sum_{i=1}^{W} \theta_i\ , \qquad i \in \alpha \\

312: \sigma^2_{\left< \theta \right>} & = \frac{1}{W}P_{f}^C(1-P_{f}^C)

313: \end{split}

314: \label{cluster-pfold}

315: \end{equation}

316: %

317: where $W$ is the number of snapshots in cluster $\alpha$.  $P_{f}^C$ is

318: the average folding probability over a set of structurally homogeneous

319: conformations. Using the clustering and the folding criterion $\Phi$

320: introduced above, values of $P_{f}^C$ for the 78183 clusters can be computed

321: by Eq.\ \ref{cluster-pfold}, i.e., the number of conformations of the

322: cluster that fold within 5 ns divided by the total number of conformations

323: belonging to the cluster.

324:

325: In this article we provide evidence that the basic assumption mentioned above,

326: that is, similar conformations have similar folding probabilities, holds in

327: the case of beta3s, a three-stranded antiparallel $\beta-$sheet peptide investigated by MD \cite{Cavalli:Fast}.

328: Moreover,  we show that the computationally expensive

329: %

330: \begin{equation}

331: P_{f}[\alpha] = \frac{1}{W}\sum_{i=1}^{W} p_{fold}(i)\ ,

332: \qquad i \in \alpha

333: \label{average-pfi}

334: \end{equation}

335: %

336: which is measured by

337: starting several simulations from each snapshot $i$ in the cluster $\alpha$

338: with $W$ snapshots, is well approximated by $P_{f}^C$ whose evaluation is

339: straightforward.

340:

341:

342: To test the assumption that similar structures have similar $p_{fold}$ and to

343: compare the values of $P_{f}^C$ with those obtained from the standard approach

344: \cite{Du:On}, folding probabilities $P_{f}$ were computed for the structures of

345: 37 clusters by starting several 5 ns MD runs from each structure and counting

346: those that fold (Eq.~\ref{pfi} and~\ref{average-pfi}).  The 37 clusters chosen

347: among the 78183 include both high- and low-populated clusters with $P_{f}^C$

348: values evenly distributed in the range between 0 and 1 (see Tab.\ 1).  In the

349: case of large clusters a subset of snapshots is considered for the

350: computation of $P_{f}$. In those cases $W$ is replaced in Eq.~\ref{average-pfi} by $W_{sample}<W$ that is the number of snapshots involved

351: in the calculation.

352:

353:

354: \begin{figure*}[p]

355: \includegraphics[width=12cm]{fig4.ps}

356: \caption{Transition state ensemble (TSE) of beta3s. \textbf{(A)} RMSD pairwise distribution for structures with $p_{fold}>0.51$ (native state), $0.49< p_{fold}< 0.51$ (TSE), and $p_{fold}<0.49$ (denatured state). \textbf{(B)} Type I and \textbf{(C)} type II transition states (thin lines). Structures are superimposed on residues 2-11 and 10-19 with an average pairwise RMSD of 0.81 and 0.82 \AA\ for type I and type II, respectively. For comparison, the native state is shown as a thick line with a circle to label the N-terminus.}

357: \label{fig.pair}

358: \end{figure*}

359:

360:

361:

362:

363:

364: The standard deviation of $p_{fold}$ in a cluster is computed as

365: %

366: \begin{equation}

367: \sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2 \right>_{i \in \alpha}}

368: \label{sigma-pfold}

369: \end{equation}

370: %

371: In the case of full kinetic inhomogeneity, i.e., random grouping of snapshots,

372: the $p_{fold}$ value  for all snapshots in a given cluster will be equal to 0

373: or 1, indicating the coexistence (in the same cluster) of structures that

374: either exclusively fold or unfold. In this case $\sigma_{p_{fold}}$ reflects

375: the Bernoulli distribution (see Supp. Mat.).  Fig.~\ref{fig.sigmapfold}A shows that, even when only $n_t=10$ runs

376: per snapshot are used to compute $p_{fold}$, $\sigma_{p_{fold}}$ values are not

377: compatible with those of a Bernoulli distribution.  Moreover the values of the

378: standard deviation decrease when the number of trials $n_t$ increases,

379: as reported in Fig.~\ref{fig.sigmapfold}B for two sample clusters. The asymptotic value

380: of $\sigma_{p_{fold}}$  ($n_t \rightarrow \infty$) for these two data sets is

381: of 0.05 and 0.2. This value cannot reach zero because snapshots in a cluster

382: are similar but not identical.  These results suggest that snapshots inside the same

383: cluster are kinetically homogeneous and a statistical description of $p_{fold}$ can

384: be adopted, that is, folding probabilities are computed as cluster averages

385: (instead of single snapshots) by means of $P_f$ and $P_f^C$.

386:

387: We still have to verify that $P_{f}^C$ indeed approximates the computationally expensive $P_{f}$. Namely, for the 37 clusters mentioned above a correlation of 0.89 between $P_{f}^C$ and $P_{f}$ is found with a slope of 0.86 (see Fig.~\ref{fig.scatterpfold}A and

388: Tab.~1), indicating that the procedure is able to estimate folding

389: probabilities for clusters on the folding-transition barrier ($P_{f}\sim 0.5$)

390: as well as in the folding ($P_{f}\sim 1.0$) or unfolding ($P_{f}\sim 0.0$)

391: regions.  The error bars for $P_{f}^C$ in Fig.~\ref{fig.scatterpfold}A are derived from the

392: definition of variance given in Eq.\ \ref{cluster-pfold}.  In the same spirit

393: of Eq.\ \ref{cluster-pfold} the folding probability $P_{f}$ and its variance

394: are written as

395: %

396: \begin{equation}

397: \begin{split}

398: P_{f} = & \left < \theta \right> = \frac{1}{N}\sum_{i=1}^N \theta_i \\

399: \sigma^2_{\left< \theta \right>} = & \frac{1}{N} P_{f}(1-P_{f})

400: \end{split}

401: \label{sigmapfc}

402: \end{equation}

403: %

404: where $N=\sum n_t$ is the total number of runs and $\theta$ is equal to 1 or 0,

405: if the run folded or unfolded, respectively.  Note that the same number of runs

406: $n_t$ has been used for every snapshot of a cluster. The large vertical error bars in

407: Fig.~\ref{fig.scatterpfold}A correspond to clusters with less than 10 snapshots.  The largest

408: deviations between $P_{f}$ and $P_{f}^C$ are around the $0.5$ region. This is

409: due to the limited number of crossings of the folding barrier observed in the

410: MD simulation (Fig.~\ref{fig.scatterpfold}B, around 70 events of folding \cite{Cavalli:Fast}).

411: Improvements in the accuracy for the estimation of $P_f$ are achieved as the

412: number of folding events, i.e., the simulation time, increases (Fig.~\ref{fig.scatterpfold}C-E).

413:

414: The two main results of this study, i.e., the kinetic homogeneity of the clusters and the

415: validity of $P_f^C$ as an approximation of $P_f$, are robust with respect to the choice of the clusterization.  Similar results

416: can be obtained also with different flavors of conformation space partitioning,

417: as long as they group together structurally homogeneous conformations, e.g., clusterization based on root mean square deviation of atomic coordinates (RMSD) or secondary structure strings (see Supp.\ Mat.).

418: The latter are appropriate for structured peptides but not for proteins with irregular secondary structure because of string degeneracy. Note that partitions

419: based on order parameters (like native contacts) are usually unsatisfactory and

420: not robust. This is mainly due to the fact that clusters defined in this way

421: are characterized by large structural heterogeneities \cite{Rao:The}.

422:

423: \section{Analysis of transition state ensemble}

424:

425: The folding probability of structure $i$ is estimated as $p_{fold}(i)=P_{f}^{C}[\alpha]$ for $i \in \alpha$. This approximation allows to plot the pairwise RMSD distribution of beta3s structures with

426: $p_{fold}>0.51$ (native state), $0.49< p_{fold}< 0.51$ (transition state ensemble, TSE), and $p_{fold}<0.49$ (denatured state) (Fig.~\ref{fig.pair}A). For the native state, the distribution is peaked around low values of RMSD ($\sim 1.5$~\AA) indicating that structures with $p_{fold}>0.51$ are structurally similar and belong to a non-degenerate state. The statistical weight of this group of structures is 49.4\% and corresponds to the expected statistics for the native state because the simulations are performed at the melting temperature.

427: In the case of TSE, the distribution is broad because of the coexistence of heterogeneous structures. This scenario is compatible with the presence of multiple folding pathways. Beta3s folding was already shown to involve two main average pathways depending on the sequence of formation of the two hairpins\cite{Ferrara:Folding,Rao:The}. Here, a \textit{naive} approach based on the number of native contacts\cite{Ferrara:Folding} is used to structurally characterize the folding barrier. TSE structures with number of native contacts of the first hairpin greater than the ones of the second hairpin are called type I conformations (Fig.~\ref{fig.pair}B), otherwise they are called type II (Fig.~\ref{fig.pair}C). In both cases the transition state is characterized by the presence of one of the two native hairpins formed while the rest of the peptide is mainly unstructured. These findings are also in agreement with

428: the complex network  analysis of beta3s reported in Ref \onlinecite{Rao:The}. Finally, the denatured state shows a broad pairwise RMSD distribution around even larger values of RMSD  ($\sim 5.5$ \AA), indicating the presence of highly heterogeneous conformations.

429:

430:

431:

432: \section{Conclusions}

433:

434: Two main results have emerged from the present study. First, snapshots grouped in structurally homogeneous clusters are

435: characterized by similar values of $p_{fold}$. This result justifies the use of

436: a statistical approach for the study of the kinetic properties of the

437: structures sampled along a simulation.  Second, given a set of structurally

438: homogeneous clusters and a folding criterion, it is possible to obtain a first

439: approximation of the folding probability for every structure sampled along an

440: equilibrium folding-unfolding simulation. Thus, the cluster-$p_{fold}$ is a quantitative

441: measure of the kinetic distance from the native state and is computationally very cheap\cite{Computation}. Furthermore, it can be used to detect multiple folding pathways. The accuracy in the identification of the transition state ensemble improves as the number of folding events observed in the simulation increases.

442: %

443: %\footnote{The computation of $P_{f}^C$ presented in this work takes few seconds

444: %on a desktop computer}.

445: %

446: Recently the cluster-$p_{fold}$ approach has been used to identify the transition state ensemble of a large set of beta3s mutants (for a total of 0.65~$ms$ of simulation time\cite{Settanni:Phi}), which would have been impossible with traditional methods.  As a further

447: application, the cluster-$p_{fold}$ procedure can be used to validate TSE conformations

448: obtained by wide-spread $G\bar o$ models.

449:

450:

451: \begin{acknowledgments}

452: We thank Stefanie Muff for useful and stimulating discussions and comments to the manuscript.

453: We also thank Dr.\ Emanuele Paci for interesting discussions.

454: We acknowledge an anonymous referee for suggesting the use of cluster-$p_{fold}$ to detect multiple pathways.

455: The molecular

456: dynamics simulations were performed on the Matterhorn Beowulf cluster at the

457: Informatikdienste of the University of Zurich.  We thank C.\ Bollinger, Dr.\

458: T.\ Steenbock, and Dr.\ A.\ Godknecht for setting up and maintaining the

459: cluster.  This work was supported by the Swiss National Science Foundation

460: grant nr. 205321-105946/1.

461: \end{acknowledgments}

462:

463: %\bibliography{/home/caflisch/tex/a-bib}

464: \bibliography{a-bib}

465:

466: %\end{document}

467:

468: %\clearpage

469:

470: %%---[ TABLES ]--------------------------------------------------------------

471: %%---------------------------------------------------------------------------

472: %\begingroup

473: %\squeezetable

474: %\begin{table}[h]

475: %\caption{\label{list}{\sc DRMS} clusters used for the calculation of $P_{f}$.}

476: %\begin{ruledtabular}

477: %\begin{tabular}{ccccccc}

478: %Cluster &

479: %$P_{f}^C$ \footnote{Cluster-$p_{fold}$ [$P_f^C$, Eq.\ \ref{cluster-pfold}].} &

480: %$P_{f}$ \footnote{Traditional, i.e., computationally expensive $P_f$ value [Eq.\ \ref{average-pfi}].} &

481: %$\sigma_{p_{fold}}$ \footnote{Standard deviation of $p_{fold}$ in a cluster [Eq.\ \ref{sigma-pfold}].} &

482: %$N$ \footnote{Total number of trials used to evaluate $P_f$. For every structure $n_{t}=10$ trials were performed \hbox{($N=n_{t}\ W_{sample}$)}

483: %except for clusters 7 and 25 for which 20 and 50 trials were performed, respectively.} &

484: %$W$ \footnote{Number of snapshots in the cluster.} &

485: %$W_{sample}$ \footnote{Number of snapshots used to evaluate $P_f$.

486: %The $W_{sample}$ subset was obtained by selecting structures in a cluster every $|W/W_{sample}|$ saved conformations.} \\

487: %\hline

488: %\hline

489: %     1 &   0.00 &   0.03 &   0.04 &  150 &  144 &   15\\

490: %     2 &   0.11 &   0.05 &   0.06 &  150 &  449 &   15\\

491: %     3 &   0.06 &   0.05 &   0.07 &  120 &   36 &   12\\

492: %     4 &   0.08 &   0.07 &   0.08 &  140 &  555 &   14\\

493: %     5 &   0.10 &   0.08 &   0.06 &  100 &   10 &   10\\

494: %     6 &   0.13 &   0.12 &   0.18 &  160 &  911 &   16\\

495: %     7 &   0.25 &   0.16 &   0.07 &   80 &    4 &    4\\

496: %     8 &   0.23 &   0.20 &   0.31 &  150 &  141 &   15\\

497: %     9 &   0.21 &   0.22 &   0.15 &  140 &  178 &   14\\

498: %    10 &   0.12 &   0.23 &   0.20 &  120 &   48 &   12\\

499: %    11 &   0.57 &   0.25 &   0.14 &  140 &   14 &   14\\

500: %    12 &   0.05 &   0.27 &   0.19 &  100 &   19 &   10\\

501: %    13 &   0.23 &   0.29 &   0.38 &  140 &  391 &   14\\

502: %    14 &   0.08 &   0.30 &   0.15 &  120 &   12 &   12\\

503: %    15 &   0.72 &   0.35 &   0.23 &  130 &  129 &   13\\

504: %    16 &   0.19 &   0.38 &   0.18 &  130 &   26 &   13\\

505: %    17 &   0.38 &   0.44 &   0.39 &  160 &   16 &   16\\

506: %    18 &   0.38 &   0.51 &   0.28 &  160 &   16 &   16\\

507: %    19 &   0.65 &   0.60 &   0.29 &  100 &   20 &   10\\

508: %    20 &   0.57 &   0.61 &   0.35 &   70 &    7 &    7\\

509: %    21 &   0.48 &   0.63 &   0.32 &  140 &   27 &   14\\

510: %    22 &   0.74 &   0.65 &   0.40 &  140 &  539 &   14\\

511: %    23 &   0.68 &   0.66 &   0.18 &  140 &   28 &   14\\

512: %    24 &   0.38 &   0.71 &   0.24 &  130 &   13 &   13\\

513: %    25 &   0.50 &   0.72 &   0.20 &  100 &    2 &    2\\

514: %    26 &   0.82 &   0.76 &   0.31 &  170 &   17 &   17\\

515: %    27 &   0.50 &   0.78 &   0.14 &  120 &   12 &   12\\

516: %    28 &   0.78 &   0.78 &   0.22 &  180 &   18 &   18\\

517: %    29 &   0.70 &   0.79 &   0.19 &  130 &  189 &   13\\

518: %    30 &   0.77 &   0.79 &   0.17 &  150 &   30 &   15\\

519: %    31 &   0.85 &   0.81 &   0.11 &  130 &   13 &   13\\

520: %    32 &   0.91 &   0.83 &   0.20 &  140 &  401 &   14\\

521: %    33 &   0.90 &   0.85 &   0.27 &  100 &   20 &   10\\

522: %    34 &   0.85 &   0.85 &   0.10 &  120 &   48 &   12\\

523: %    35 &   0.94 &   0.88 &   0.13 &  170 & 1990 &   17\\

524: %    36 &   0.71 &   0.94 &   0.07 &   70 &    7 &    7\\

525: %    37 &   0.95 &   0.95 &   0.06 &  150 &  855 &   15\\

526: %\end{tabular}

527: %\end{ruledtabular}

528: %\end{table}

529: %\endgroup

530:

531: %\clearpage

532:

533: %%---[ FIGURES ]-------------------------------------------------------------

534: %%---------------------------------------------------------------------------

535: %\begin{figure}[h]

536: %\includegraphics[width=7.5cm]{eps/pfpt.pdf}

537: %\caption{Probability distribution for the first passage time (fpt) to the most

538: %populated cluster (\emph{folded state}) of the DRMS 1.2 \AA\

539: %clusterization.}

540: %\label{fig.pfpt}

541: %\end{figure}

542:

543: %\clearpage

544:

545: %\begin{figure*}

546: %%\includegraphics[width=7.0cm, angle=-90]{eps/sigmapfold.eps}

547: %\includegraphics[width=14cm]{eps/sigmapfold.pdf}

548: %\caption{Standard deviation

549: %$\sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2

550: %\right>_{i \in \alpha}}$ of the $p_{fold}$ for the 37 DRMS clusters used in the study.

551: %\textbf{(A)} $\sigma_{p_{fold}}$ as a function of $P_{f}$ compared to a

552: %Bernoulli distribution (solid line). Ten trials were performed for each

553: %snapshot.  The largest values for the standard deviation are located around the

554: %0.5 region and this is probably due to the Bernoulli process ($\theta=0,1$)

555: %used for the calculation of $p_{fold}$. \textbf{(B)} $\sigma_{p_{fold}}$

556: %dependence on the number of trials used to evaluate $p_{fold}$. The dashed curves are fits with a $\frac{a}{\sqrt{x}}+b$ function.  The horizontal

557: %dashed lines are drawn to help identifying in \textbf{A} the two clusters used

558: %in \textbf{B}.   \textbf{(C)} Dependence of $P_f$  on the number of trials $n_t$ for the two clusters used in \textbf{B}.}

559: %\label{fig.sigmapfold}

560: %\end{figure*}

561:

562: %\clearpage

563:

564: %\begin{figure}[h]

565: %%\includegraphics[width=7.5cm, angle=-90]{eps/scatterpfold.eps}

566: %\includegraphics[width=14cm]{eps/scatterpfold-multi.pdf}

567: %\caption{Cluster folding probability $P_{f}^{C}$. \textbf{(A)} Scatter plot of $P_{f}^C$ versus $P_{f}$.

568: %The DRMS 1.2 \AA\ clusterization and the folding criterion $\Phi$

569: %(reaching the most populated cluster within $\tau_{commit}=5$ ns)

570: %were used. \textbf{(B)} Probability distribution of the $p_{fold}$ value for the 500000 snapshots saved along the $10\ \mu s$ MD trajectory. The folding probability for snapshot $i$ is computed as $p_{fold}(i)=P_f^C[\alpha]$ for $i \in \alpha$. \textbf{(C-E)} Scatter plot of $P_{f}^C$ versus $P_{f}$ for 1.0, 5.0, and 10 $\mu s$ of simulation time, respectively.}

571: %\label{fig.scatterpfold}

572: %\end{figure}

573:

574: %\clearpage

575:

576: %\begin{figure}[h]

577: %\includegraphics[width=10cm]{eps/pairRMSD.pdf}

578: %\caption{Transition state ensemble (TSE) of beta3s. \textbf{(A)} RMSD pairwise distribution for structures with $p_{fold}>0.51$ (native state), $0.49< p_{fold}< 0.51$ (TSE), and $p_{fold}<0.49$ (denatured state). \textbf{(B)} Type I and \textbf{(C)} type II transition states (thin lines). Structures are superimposed on residues 2-11 and 10-19 with an average pairwise RMSD of 0.81 and 0.82 \AA\ for type I and type II, respectively. For comparison, the native state is shown as a thick line with a circle to label the N-terminus.}

579:

580: %\label{fig.pair}

581: %\end{figure}

582:

583: %

584:

585: \clearpage

586:

587: \onecolumngrid

588:

589: % -----------------------------------------------------------

590: %               SUPPLEMENTARY MATERIAL

591: % -----------------------------------------------------------

592: \markright{\centerline{\LARGE\sc Supplementary Material}}

593: \pagestyle{myheadings}

594:

595: \clearpage

596:

597: \setcounter{section}{0}

598: %\renewcommand{\thepage}{S-\arabic{section}}

599:

600: \setcounter{page}{1}

601: \renewcommand{\thepage}{S-\arabic{page}}

602:

603: \setcounter{figure}{0}

604: \renewcommand{\thefigure}{S\arabic{figure}}

605:

606: \setcounter{table}{0}

607: \renewcommand{\thetable}{S-\Roman{table}}

608:

609:

610: \section{Secondary structure clusterization}

611:

612: Recently, the secondary structure has been used to cluster the conformation space

613: of peptides (F. Rao et al, JMB 342, 299, 2004). Secondary structure along an MD

614: simulation trajectory can be easily calculated using known algorithms

615: (C.A.F. Andersen et al, Structure 10, 174, 2002).  A cluster is a single string

616: of secondary structure, e.g., the most populated conformation for beta3s is

617: {\tt -EEEESSEEEEEESSEEEE-} where "{\tt E}", "{\tt S}", and "{\tt -}" stand for

618: extended, turn, and unstructured, respectively.  There are 8 possible "letters"

619: in the secondary structure "alphabet": "{\tt H}", "{\tt G}", "{\tt I}", "{\tt

620: E}", "{\tt B}", "{\tt T}", "{\tt S}", and "{\tt -}", standing for $\alpha$

621: helix, 3/10 helix, $\pi$ helix, extended, isolated $\beta$-bridge, hydrogen

622: bonded turn, bend, and unstructured, respectively.  Since the N- and C-terminal

623: residues are always assigned an "{\tt -}" a 20-residue peptide can in principle

624: assume $8^{18}\simeq 10^{16}$ conformations.

625:

626: \begin{figure*}[h]

627: \includegraphics[angle=-90,width=70mm]  {stdev.eps}

628: \includegraphics[angle=-90,width=80mm]  {scatterpfold-sstr.eps}

629: \caption{\textbf{(left)} $p_{fold}$ standard deviation inside a cluster for 16

630: secondary structure (\emph{sstr}) and 37 DRMS 1.2 \AA\ clusters.  Both

631: \emph{sstr} and DRMS 1.2 \AA\ clusterizations are defined by similar

632: fluctuations.  \textbf{(right)} Scatter plot of $P_{f}^C$ versus $P_{f}$ for

633: \emph{sstr} clusterization.  In this case the folding criteria used is based on

634: the native contacts $Q$ (Settanni et al., PNAS 102, 628, 2005). A folding

635: (unfolding) event is realized when $Q>0.85$ ($Q<0.15$).}

636: \end{figure*}

637:

638: \begin{figure*}[h]

639: \includegraphics[angle=-90,width=90mm]  {prmsd.eps}

640: \includegraphics[angle=-90,width=90mm]  {psigmarmsd.eps}

641: \caption{\textbf{(top)} Probability to have a given pairwise

642: root mean square deviation (RMSD) inside a cluster for the secondary

643: structure (\emph{sstr}) and DRMS 1.2 \AA\ clusterizations. \textbf{(bottom)}

644: Probability to have a given variance for the RMSD inside a cluster.

645: Both plots show that secondary structure clusters are less structurally

646: homogeneous than DRMS 1.2 \AA\ clusters.}

647: \end{figure*}

648:

649: \clearpage

650:

651: \section{First passage times}

652:

653: The first passage time (fpt) to a given cluster $\alpha$ is computed as the

654: time along the MD trajectory that any given snapshot takes to the first

655: subsequent snapshot belonging to $\alpha$. In fig.\ S3 the fpt distribution

656: to the folded state is

657: shown for two different clusterizations of the conformation space. The double

658: peak shape of the distribution provides evidence of the different time scales

659: between \emph{intra}-basin and \emph{inter}-basin transitions. The wider shape

660: of the \emph{intra}-basin peak for the secondary structure clusterization is

661: consistent with the higher degree of structural diversity with

662: respect to the DRMS 1.2 \AA\ clusterization (see previous section).

663:

664: \begin{figure*}[h]

665: \includegraphics[angle=-90,width=90mm]  {pfpt.eps}

666: \includegraphics[angle=-90,width=90mm]  {pfpt-sstr.eps}

667: \caption{Probability distribution for the first passage times (fpt) to the most

668: populated cluster (\emph{folded state}). \textbf{(top)} DRMS 1.2 \AA\

669: clusterization.  \textbf{(bottom)} Secondary structure clusterization.}

670: \end{figure*}

671:

672: \clearpage

673:

674: \section{Random clusterization}

675:

676: The results of this section were obtained using the DRMS 1.2 \AA\

677: clusterization.

678: In the text evidence was provided that the standard deviation of $p_{fold}$

679: %

680: \begin{equation*} \sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2 \right>_{i \in \alpha}}

681: \end{equation*}

682: %

683: is not compatible with the one of a Bernoulli distribution.  This means that

684: snapshots in a cluster have similar values of $p_{fold}$ and are kinetically homogeneous.  This is

685: not the case for a random clusterization of the snapshots.  Since it is not

686: feasible to compute the $p_{fold}$ for every snapshot of a simulation, the

687: assumption that $p_{fold}$ of snapshot $i$ is equal to the cluster folding

688: probability $P_f^C$ of its cluster (as computed in the text) is made.

689: Then, snapshots are reshuffled in 50000 random clusters.  The folding

690: probability for a random cluster $\alpha_R$ is computed as $P_f=\left<

691: p_{fold}\right>_{\alpha_R}$.  Most of the snapshots will have $p_{fold}$ close

692: to $1$ or $0$ (see Fig.\ 3B in the text) and because of the random grouping,

693: i.e., no kinetic homogeneity, the above standard deviation $\sigma_{p_{fold}}$

694: resembles the one of a Bernoulli distribution as shown in Fig.\

695: \ref{sigmapfold-null}. Data obtained from a DRMS 1.2 \AA\ clusterization

696: deviates from this behavior (compare Fig.\ 2A and Fig.\ \ref{sigmapfold-null}).

697: Moreover this deviation becomes bigger as the number of trials $n_t$, in this

698: case $10$, increases (see Fig.\ 1B).

699:

700: \begin{figure*}[h]

701: \includegraphics[angle=-90,width=80mm]  {figS4.ps}

702: \caption{Standard deviation $\sigma_{p_{fold}}$ for a random clusterization.

703: Black dots, red curve, blue squares, blue curve show

704: $\sigma_{p_{fold}}$ for the random clusters, its histogram,

705: $\sigma_{p_{fold}}$ for 37 non-random clusters (see text), and its

706: histogram, respectively.}

707: \label{sigmapfold-null}

708: \end{figure*}

709:

710: \end{document}

711: