1: \documentclass[twocolumn,floatfix,prb,amsmath,amssymb,showkeys]{revtex4}
2: \usepackage{amsmath,amssymb}
3: \usepackage{epsfig}
4: \usepackage{graphicx}
5: \usepackage{dcolumn}
6: \usepackage{bm}
7:
8: \begin{document}
9:
10: \title{Estimation of protein folding probability from equilibrium simulations}
11:
12:
13: \author{Francesco Rao}
14: \author{Giovanni Settanni}
15: \author{Enrico Guarnera}
16: \author{Amedeo Caflisch}
17:
18: \email[corresponding author, tel: +41 44 635 55 21,
19: fax: +41 44 635 68 62, e-mail: ]{caflisch@bioc.unizh.ch}
20:
21: \affiliation{Department of Biochemistry, University of Zurich,
22: Winterthurerstrasse 190, CH-8057 Zurich, Switzerland\\
23: tel: +41 44 635 55 21, fax: +41 44 635 68 62,\\
24: e-mail: caflisch@bioc.unizh.ch}
25:
26: \date{\today}
27:
28: \begin{abstract}
29:
30: The assumption that similar structures have similar folding probabilities
31: ($p_{fold}$) leads naturally to a procedure to evaluate $p_{fold}$ for every
32: snapshot saved along an equilibrium folding-unfolding trajectory of a
33: structured peptide or protein. The procedure utilizes a structurally
34: homogeneous clustering and does not require any additional simulation.
35: It can be used to detect multiple folding pathways as shown for a three-stranded antiparallel $\beta$-sheet peptide investigated by implicit solvent molecular dynamics simulations.
36:
37: \end{abstract}
38:
39: \keywords{molecular dynamics; transition state; p$_{fold}$; multiple pathways; denatured state ensemble}
40:
41: \maketitle
42:
43: \section{introduction}
44:
45: The folding probability ($p_{fold}$) of a protein conformation saved along a
46: Monte Carlo or molecular dynamics (MD) trajectory is the probability to fold
47: before unfolding \cite{Du:On}. It is a useful measure of kinetic distance from
48: the folded, i.e., functional state, and can be used to validate transition
49: state ensemble (TSE) structures, which should have $p_{fold}\approx 0.5$. Such
50: validation consists of starting a large number of trajectories from putative
51: TSE structures with varying initial distribution of velocities and counting the
52: number of those that fold within a "commitment" time which has to be chosen
53: much longer than the shortest time-scales of conformational fluctuations and
54: much shorter than the average folding time \cite{Hubner:Commitment}. The
55: concept of $p_{fold}$ calculation originates from a method for determining
56: transmission coefficients, starting from a known transition state
57: \cite{Chandler:Statistical} and the identification of simpler transition states
58: in protein dynamics (e.g., tyrosine ring flips) \cite{Northrup:Dynamical}. The
59: approach has been used to identify the otherwise very elusive folding TSE by
60: atomistic Monte Carlo off-lattice simulations of small proteins with a $G\bar
61: o$ potential \cite{Li:Constr,Hubner:Commitment}, as well as implicit solvent MD
62: \cite{Gsponer:Molecular,Rao:The} and Monte Carlo \cite{Lenz:Folding}
63: simulations with a physico-chemical based potential. The number of trial simulations needed for the reliable evaluation of $p_{fold}$ makes the estimation of the folding probability computationally very expensive.
64: For this reason, here we propose a method to estimate folding probabilities
65: for \textit{all} structures visited in an equilibrium folding-unfolding trajectory without
66: any additional simulation.
67:
68: \section{Methods}
69:
70: %---[ TABLES ]--------------------------------------------------------------
71: %---------------------------------------------------------------------------
72: \begingroup
73: \squeezetable
74: \begin{table}[h]
75: \caption{\label{list}{\sc DRMS} clusters used for the calculation of $P_{f}$.}
76: \begin{ruledtabular}
77: \begin{tabular}{ccccccc}
78: Cluster &
79: $P_{f}^C$ \footnote{Cluster-$p_{fold}$ [$P_f^C$, Eq.\ \ref{cluster-pfold}].} &
80: $P_{f}$ \footnote{Traditional, i.e., computationally expensive $P_f$ value [Eq.\ \ref{average-pfi}].} &
81: $\sigma_{p_{fold}}$ \footnote{Standard deviation of $p_{fold}$ in a cluster [Eq.\ \ref{sigma-pfold}].} &
82: $N$ \footnote{Total number of trials used to evaluate $P_f$. For every structure $n_{t}=10$ trials were performed \hbox{($N=n_{t}\ W_{sample}$)}
83: except for clusters 7 and 25 for which 20 and 50 trials were performed, respectively.} &
84: $W$ \footnote{Number of snapshots in the cluster.} &
85: $W_{sample}$ \footnote{Number of snapshots used to evaluate $P_f$.
86: The $W_{sample}$ subset was obtained by selecting structures in a cluster every $|W/W_{sample}|$ saved conformations.} \\
87: \hline
88: \hline
89: 1 & 0.00 & 0.03 & 0.04 & 150 & 144 & 15\\
90: 2 & 0.11 & 0.05 & 0.06 & 150 & 449 & 15\\
91: 3 & 0.06 & 0.05 & 0.07 & 120 & 36 & 12\\
92: 4 & 0.08 & 0.07 & 0.08 & 140 & 555 & 14\\
93: 5 & 0.10 & 0.08 & 0.06 & 100 & 10 & 10\\
94: 6 & 0.13 & 0.12 & 0.18 & 160 & 911 & 16\\
95: 7 & 0.25 & 0.16 & 0.07 & 80 & 4 & 4\\
96: 8 & 0.23 & 0.20 & 0.31 & 150 & 141 & 15\\
97: 9 & 0.21 & 0.22 & 0.15 & 140 & 178 & 14\\
98: 10 & 0.12 & 0.23 & 0.20 & 120 & 48 & 12\\
99: 11 & 0.57 & 0.25 & 0.14 & 140 & 14 & 14\\
100: 12 & 0.05 & 0.27 & 0.19 & 100 & 19 & 10\\
101: 13 & 0.23 & 0.29 & 0.38 & 140 & 391 & 14\\
102: 14 & 0.08 & 0.30 & 0.15 & 120 & 12 & 12\\
103: 15 & 0.72 & 0.35 & 0.23 & 130 & 129 & 13\\
104: 16 & 0.19 & 0.38 & 0.18 & 130 & 26 & 13\\
105: 17 & 0.38 & 0.44 & 0.39 & 160 & 16 & 16\\
106: 18 & 0.38 & 0.51 & 0.28 & 160 & 16 & 16\\
107: 19 & 0.65 & 0.60 & 0.29 & 100 & 20 & 10\\
108: 20 & 0.57 & 0.61 & 0.35 & 70 & 7 & 7\\
109: 21 & 0.48 & 0.63 & 0.32 & 140 & 27 & 14\\
110: 22 & 0.74 & 0.65 & 0.40 & 140 & 539 & 14\\
111: 23 & 0.68 & 0.66 & 0.18 & 140 & 28 & 14\\
112: 24 & 0.38 & 0.71 & 0.24 & 130 & 13 & 13\\
113: 25 & 0.50 & 0.72 & 0.20 & 100 & 2 & 2\\
114: 26 & 0.82 & 0.76 & 0.31 & 170 & 17 & 17\\
115: 27 & 0.50 & 0.78 & 0.14 & 120 & 12 & 12\\
116: 28 & 0.78 & 0.78 & 0.22 & 180 & 18 & 18\\
117: 29 & 0.70 & 0.79 & 0.19 & 130 & 189 & 13\\
118: 30 & 0.77 & 0.79 & 0.17 & 150 & 30 & 15\\
119: 31 & 0.85 & 0.81 & 0.11 & 130 & 13 & 13\\
120: 32 & 0.91 & 0.83 & 0.20 & 140 & 401 & 14\\
121: 33 & 0.90 & 0.85 & 0.27 & 100 & 20 & 10\\
122: 34 & 0.85 & 0.85 & 0.10 & 120 & 48 & 12\\
123: 35 & 0.94 & 0.88 & 0.13 & 170 & 1990 & 17\\
124: 36 & 0.71 & 0.94 & 0.07 & 70 & 7 & 7\\
125: 37 & 0.95 & 0.95 & 0.06 & 150 & 855 & 15\\
126: \end{tabular}
127: \end{ruledtabular}
128: \end{table}
129: \endgroup
130:
131:
132: \subsection{Molecular dynamics simulations}
133:
134: Beta3s is a designed 20-residue sequence whose
135: solution conformation has been investigated by NMR spectroscopy
136: \cite{DeAlba:Denovo}. The NMR data indicate that beta3s in aqueous solution
137: forms a monomeric (up to more than 1mM concentration) triple-stranded
138: antiparallel $\beta$-sheet, in equilibrium with the
139: denatured state \cite{DeAlba:Denovo}. We have previously shown that in
140: implicit solvent \cite{Ferrara:Evaluation} molecular dynamics simulations
141: beta3s folds reversibly to the NMR solution conformation, irrespective of the
142: starting structure \cite{Ferrara:Folding}.
143: Recently, four molecular dynamics simulations of beta3s were performed at 330 K
144: for a total simulation time of 12.6 $\mu$s \cite{Cavalli:Fast}. There are 72
145: folding events and 73 unfolding events and the average time required to go from
146: the denatured state to the folded conformation is 83 ns. The 12.6 $\mu$s of
147: simulation length is about two orders of magnitude longer than the average
148: folding or unfolding time, which are similar because at 330 K the native and
149: denatured states are almost equally populated \cite{Cavalli:Fast}. For the
150: $p_{fold}$ analysis the first 0.65 $\mu$s of each of the four simulations were
151: neglected so that along the 10 $\mu$s of simulations there are a total of 500000 snapshots because coordinates were saved every 20 ps.
152:
153: The simulations were performed with
154: the program CHARMM {\cite{Brooks:CHARMM}}. Beta3s was modeled by explicitly
155: considering all heavy atoms and the hydrogen atoms bound to nitrogen or oxygen
156: atoms (PARAM19 force field {\cite{Brooks:CHARMM}}). A mean field approximation
157: based on the solvent accessible surface was used to describe the main effects
158: of the aqueous solvent on the solute \cite{Ferrara:Evaluation}. The two
159: surface tension-like parameters of the solvation model were optimized without using beta3s. The
160: same force field and implicit solvent model have been used recently in
161: molecular dynamics simulations of the early steps of ordered aggregation
162: \cite{Gsponer:Therole}, and folding of structured peptides\cite{Ferrara:Evaluation,Ferrara:Folding}, as well as small
163: proteins of about 60 residues \cite{Gsponer:Role}. Despite
164: the absence of collisions with water molecules, in the simulations with
165: implicit solvent the separation of time scales is comparable with that observed
166: experimentally. Helices fold in about 1 ns \cite{Ferrara:Thermodynamics},
167: $\beta$-hairpins in about 10 ns \cite{Ferrara:Thermodynamics} and
168: triple-stranded $\beta$-sheets in about 100 ns \cite{Cavalli:Fast}, while the
169: experimental values are $\sim$0.1 $\mu$s \cite{Eaton:Fast}, $\sim$1 $\mu$s
170: \cite{Eaton:Fast} and $\sim$10 $\mu$s \cite{DeAlba:Denovo}, respectively.
171:
172: \subsection{Clusterization}
173:
174: \begin{figure}[h]
175: \includegraphics[angle=-90,width=8cm]{fig1.eps}
176: \caption{Probability distribution for the first passage time (fpt) to the most
177: populated cluster (\emph{folded state}) of the DRMS 1.2 \AA\
178: clusterization.}
179: \label{fig.pfpt}
180: \end{figure}
181:
182:
183:
184: % There are several procedures for clustering conformations according
185: % to structural similarity.
186: The 500000 conformations obtained from the simulations of beta3s (see above) were clustered by the leader algorithm \cite{hartigan}. Briefly, the first structure defines the first cluster and each
187: subsequent structure is compared with the set of clusters found so far until the first similar structure is found.
188: If the structural deviation (see below) from the first conformation of all of the known clusters exceeds a given threshold, a new cluster is defined.
189: The leader algorithm is very fast even when analyzing large sets of
190: structures like in the present work.
191: The results presented here were obtained with a structural comparison based on the Distance Root Mean Square (DRMS) deviation considering all distances involving C$_\alpha$ and/or C$_\beta$ atoms and a cutoff of 1.2 \AA. This yielded 78183 clusters.
192: The DRMS and root mean square deviation of atomic coordinates (upon
193: optimal superposition) have been shown to be highly correlated \cite{Hubner:Commitment}. The DRMS cutoff of 1.2~\AA\ was chosen on the basis of the distribution of the pairwise DRMS values in a subsample of the wild-type trajectories. The distribution shows
194: two main peaks that originate from intra- and inter-cluster distances,
195: respectively
196: (data not shown). The cutoff is located at the minimum between the two
197: peaks.
198: The main findings of this work are valid
199: also for clusterization based on secondary structure similarity
200: \cite{Rao:The}
201: (see Suppl.\ Mat.).
202:
203:
204: %\subsection{Clusterization}
205:
206: %The
207: %500000 conformations obtained from the simulations of beta3s (see above)
208: %were clustered by the leader
209: %algorithm \cite{hartigan} based on the Distance Root Mean Square (DRMS)
210: %deviation considering C$_\alpha$ and C$_\beta$ atoms and a cutoff of 1.2 \AA. This yields 78183 clusters.
211: %In general, conformations visited along an MD simulation can be structurally clustered in several ways. Recently a method based on secondary structure \cite{Rao:The} has been applied for the clusterization of beta3s conformations (see Supp. Mat. for details).
212:
213: \subsection{Folding probability}
214:
215:
216:
217: \begin{figure*}
218: \includegraphics[width=7.0cm, angle=-90]{fig2.eps}
219: \caption{Standard deviation
220: $\sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2
221: \right>_{i \in \alpha}}$ of the $p_{fold}$ for the 37 DRMS clusters used in the study.
222: \textbf{(A)} $\sigma_{p_{fold}}$ as a function of $P_{f}$ compared to a
223: Bernoulli distribution (solid line). Ten trials were performed for each
224: snapshot. The largest values for the standard deviation are located around the
225: 0.5 region and this is probably due to the Bernoulli process ($\theta=0,1$)
226: used for the calculation of $p_{fold}$. \textbf{(B)} $\sigma_{p_{fold}}$
227: dependence on the number of trials used to evaluate $p_{fold}$. The dashed curves are fits with a $\frac{a}{\sqrt{x}}+b$ function. The horizontal
228: dashed lines are drawn to help identifying in \textbf{A} the two clusters used
229: in \textbf{B}. \textbf{(C)} Dependence of $P_f$ on the number of trials $n_t$ for the two clusters used in \textbf{B}.}
230: \label{fig.sigmapfold}
231: \end{figure*}
232:
233:
234: For the computation of $p_{fold}$ a criterion ($\Phi$) is needed to determine
235: when the system reaches the folded state.
236: Given a clusterization of the structures, a natural choice for $\Phi$ is the visit of the most populated cluster which for
237: structured peptides and proteins is not degenerate (other criteria are also possible, e.g., fraction of native contacts $Q$ larger than a given threshold).
238: Given $\Phi$ and a commitment time ($\tau_{commit}$),
239: the folding probability $p_{fold}(i)$ of an MD snapshot $i$ is computed as \cite{Du:On,Hubner:Commitment}
240: %
241: \begin{equation}
242: p_{fold}(i)=\frac{n_{f}(i)}{n_t(i)}\
243: \label{pfi}
244: \end{equation}
245: %
246: where $n_{f}(i)$ and $n_t(i)$ are the number of trials started from snapshot $i$
247: which reach within a time $\tau_{commit}$ the folded state and the total number of trials, respectively.
248:
249: Every simulation started from snapshot $i$ can
250: be considered as a Bernoulli trial of a random variable
251: $\theta$ with value 1 (folding within $\tau_{commit}$) or 0
252: (no folding within $\tau_{commit}$). The variable
253: $\theta$ has average and variance on the average of the form:
254: %
255: \begin{equation}
256: \begin{split}
257: \langle \theta \rangle = &
258: p_{fold}\ = \frac{1}{n_t}\sum_{i=1}^{n_t} \theta_i \\
259: \sigma^2_{\left< \theta \right>} = & \frac{1}{n_t}p_{fold}(1-p_{fold})
260: \end{split}
261: \end{equation}
262: %
263: where $n_t$ is the total number of trials and the accuracy on the $p_{fold}$
264: value increases with $n_t$.
265:
266: In Fig.\ \ref{fig.pfpt} the distribution of the first passage time (fpt)
267: to the folded state is shown. The double
268: peak shape of the distribution provides evidence for the different time scales
269: between \emph{intra}-basin and \emph{inter}-basin transitions.
270: A value of 5 ns is chosen
271: for $\tau_{commit}$ because events with
272: smaller time scales correspond to the diffusion within the native free-energy
273: basin, while events with larger time scales are transitions from other basins
274: to the native one, i.e., folding/unfolding events \cite{Cavalli:Fast}.
275:
276:
277: \section{Folding probability from equilibrium trajectories}
278:
279: \begin{figure*}[t]
280: \includegraphics[width=11.5cm, angle=-90]{fig3.eps}
281: \caption{Cluster folding probability $P_{f}^{C}$. \textbf{(A)} Scatter plot of $P_{f}^C$ versus $P_{f}$.
282: The DRMS 1.2 \AA\ clusterization and the folding criterion $\Phi$
283: (reaching the most populated cluster within $\tau_{commit}=5$ ns)
284: were used. \textbf{(B)} Probability distribution of the $p_{fold}$ value for the 500000 snapshots saved along the $10\ \mu s$ MD trajectory. The folding probability for snapshot $i$ is computed as $p_{fold}(i)=P_f^C[\alpha]$ for $i \in \alpha$. \textbf{(C-E)} Scatter plot of $P_{f}^C$ versus $P_{f}$ for 1.0, 5.0, and 10 $\mu s$ of simulation time, respectively.}
285: \label{fig.scatterpfold}
286: \end{figure*}
287:
288:
289:
290: The basic assumption of the present work is that conformations that
291: are structurally similar have the same kinetic behavior, hence they have
292: similar values of $p_{fold}$. Note that the opposite is not necessarily true as explained in Section IV for the TSE and the denatured state. To exploit this assumption, snapshots saved along
293: a trajectory are grouped in structurally similar clusters\cite{Symbolic}.
294: Then, the $\tau_{commit}$-segment of MD trajectory following each snapshot is
295: analyzed to check if the folding condition $\Phi$ is met (i.e, the snapshot
296: "folds"). For each cluster, the ratio between the snapshots which lead to
297: folding and the total number of snapshots in the cluster is defined as the
298: cluster-$p_{fold}$ ($P_{f}^C$; throughout the text uppercase $P$ and lowercase
299: $p$ refer to folding probability for clusters and individual snapshots,
300: respectively). This value is an approximation of the $p_{fold}$ of any single
301: structure in the cluster which is valid if the cluster consists of structurally
302: similar conformations. In other words, the occurrence of the folding event for
303: the snapshots of a given cluster can be considered as a Bernoulli trial of a
304: random variable $\theta$. The average of $\theta$ and variance on the average
305: for the set of snapshots belonging to a given cluster $\alpha$ can be written
306: as:
307: %
308: \begin{equation}
309: \begin{split}
310: P_{f}^C [\alpha] & = \langle \theta \rangle =
311: \frac{1}{W}\sum_{i=1}^{W} \theta_i\ , \qquad i \in \alpha \\
312: \sigma^2_{\left< \theta \right>} & = \frac{1}{W}P_{f}^C(1-P_{f}^C)
313: \end{split}
314: \label{cluster-pfold}
315: \end{equation}
316: %
317: where $W$ is the number of snapshots in cluster $\alpha$. $P_{f}^C$ is
318: the average folding probability over a set of structurally homogeneous
319: conformations. Using the clustering and the folding criterion $\Phi$
320: introduced above, values of $P_{f}^C$ for the 78183 clusters can be computed
321: by Eq.\ \ref{cluster-pfold}, i.e., the number of conformations of the
322: cluster that fold within 5 ns divided by the total number of conformations
323: belonging to the cluster.
324:
325: In this article we provide evidence that the basic assumption mentioned above,
326: that is, similar conformations have similar folding probabilities, holds in
327: the case of beta3s, a three-stranded antiparallel $\beta-$sheet peptide investigated by MD \cite{Cavalli:Fast}.
328: Moreover, we show that the computationally expensive
329: %
330: \begin{equation}
331: P_{f}[\alpha] = \frac{1}{W}\sum_{i=1}^{W} p_{fold}(i)\ ,
332: \qquad i \in \alpha
333: \label{average-pfi}
334: \end{equation}
335: %
336: which is measured by
337: starting several simulations from each snapshot $i$ in the cluster $\alpha$
338: with $W$ snapshots, is well approximated by $P_{f}^C$ whose evaluation is
339: straightforward.
340:
341:
342: To test the assumption that similar structures have similar $p_{fold}$ and to
343: compare the values of $P_{f}^C$ with those obtained from the standard approach
344: \cite{Du:On}, folding probabilities $P_{f}$ were computed for the structures of
345: 37 clusters by starting several 5 ns MD runs from each structure and counting
346: those that fold (Eq.~\ref{pfi} and~\ref{average-pfi}). The 37 clusters chosen
347: among the 78183 include both high- and low-populated clusters with $P_{f}^C$
348: values evenly distributed in the range between 0 and 1 (see Tab.\ 1). In the
349: case of large clusters a subset of snapshots is considered for the
350: computation of $P_{f}$. In those cases $W$ is replaced in Eq.~\ref{average-pfi} by $W_{sample}<W$ that is the number of snapshots involved
351: in the calculation.
352:
353:
354: \begin{figure*}[p]
355: \includegraphics[width=12cm]{fig4.ps}
356: \caption{Transition state ensemble (TSE) of beta3s. \textbf{(A)} RMSD pairwise distribution for structures with $p_{fold}>0.51$ (native state), $0.49< p_{fold}< 0.51$ (TSE), and $p_{fold}<0.49$ (denatured state). \textbf{(B)} Type I and \textbf{(C)} type II transition states (thin lines). Structures are superimposed on residues 2-11 and 10-19 with an average pairwise RMSD of 0.81 and 0.82 \AA\ for type I and type II, respectively. For comparison, the native state is shown as a thick line with a circle to label the N-terminus.}
357: \label{fig.pair}
358: \end{figure*}
359:
360:
361:
362:
363:
364: The standard deviation of $p_{fold}$ in a cluster is computed as
365: %
366: \begin{equation}
367: \sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2 \right>_{i \in \alpha}}
368: \label{sigma-pfold}
369: \end{equation}
370: %
371: In the case of full kinetic inhomogeneity, i.e., random grouping of snapshots,
372: the $p_{fold}$ value for all snapshots in a given cluster will be equal to 0
373: or 1, indicating the coexistence (in the same cluster) of structures that
374: either exclusively fold or unfold. In this case $\sigma_{p_{fold}}$ reflects
375: the Bernoulli distribution (see Supp. Mat.). Fig.~\ref{fig.sigmapfold}A shows that, even when only $n_t=10$ runs
376: per snapshot are used to compute $p_{fold}$, $\sigma_{p_{fold}}$ values are not
377: compatible with those of a Bernoulli distribution. Moreover the values of the
378: standard deviation decrease when the number of trials $n_t$ increases,
379: as reported in Fig.~\ref{fig.sigmapfold}B for two sample clusters. The asymptotic value
380: of $\sigma_{p_{fold}}$ ($n_t \rightarrow \infty$) for these two data sets is
381: of 0.05 and 0.2. This value cannot reach zero because snapshots in a cluster
382: are similar but not identical. These results suggest that snapshots inside the same
383: cluster are kinetically homogeneous and a statistical description of $p_{fold}$ can
384: be adopted, that is, folding probabilities are computed as cluster averages
385: (instead of single snapshots) by means of $P_f$ and $P_f^C$.
386:
387: We still have to verify that $P_{f}^C$ indeed approximates the computationally expensive $P_{f}$. Namely, for the 37 clusters mentioned above a correlation of 0.89 between $P_{f}^C$ and $P_{f}$ is found with a slope of 0.86 (see Fig.~\ref{fig.scatterpfold}A and
388: Tab.~1), indicating that the procedure is able to estimate folding
389: probabilities for clusters on the folding-transition barrier ($P_{f}\sim 0.5$)
390: as well as in the folding ($P_{f}\sim 1.0$) or unfolding ($P_{f}\sim 0.0$)
391: regions. The error bars for $P_{f}^C$ in Fig.~\ref{fig.scatterpfold}A are derived from the
392: definition of variance given in Eq.\ \ref{cluster-pfold}. In the same spirit
393: of Eq.\ \ref{cluster-pfold} the folding probability $P_{f}$ and its variance
394: are written as
395: %
396: \begin{equation}
397: \begin{split}
398: P_{f} = & \left < \theta \right> = \frac{1}{N}\sum_{i=1}^N \theta_i \\
399: \sigma^2_{\left< \theta \right>} = & \frac{1}{N} P_{f}(1-P_{f})
400: \end{split}
401: \label{sigmapfc}
402: \end{equation}
403: %
404: where $N=\sum n_t$ is the total number of runs and $\theta$ is equal to 1 or 0,
405: if the run folded or unfolded, respectively. Note that the same number of runs
406: $n_t$ has been used for every snapshot of a cluster. The large vertical error bars in
407: Fig.~\ref{fig.scatterpfold}A correspond to clusters with less than 10 snapshots. The largest
408: deviations between $P_{f}$ and $P_{f}^C$ are around the $0.5$ region. This is
409: due to the limited number of crossings of the folding barrier observed in the
410: MD simulation (Fig.~\ref{fig.scatterpfold}B, around 70 events of folding \cite{Cavalli:Fast}).
411: Improvements in the accuracy for the estimation of $P_f$ are achieved as the
412: number of folding events, i.e., the simulation time, increases (Fig.~\ref{fig.scatterpfold}C-E).
413:
414: The two main results of this study, i.e., the kinetic homogeneity of the clusters and the
415: validity of $P_f^C$ as an approximation of $P_f$, are robust with respect to the choice of the clusterization. Similar results
416: can be obtained also with different flavors of conformation space partitioning,
417: as long as they group together structurally homogeneous conformations, e.g., clusterization based on root mean square deviation of atomic coordinates (RMSD) or secondary structure strings (see Supp.\ Mat.).
418: The latter are appropriate for structured peptides but not for proteins with irregular secondary structure because of string degeneracy. Note that partitions
419: based on order parameters (like native contacts) are usually unsatisfactory and
420: not robust. This is mainly due to the fact that clusters defined in this way
421: are characterized by large structural heterogeneities \cite{Rao:The}.
422:
423: \section{Analysis of transition state ensemble}
424:
425: The folding probability of structure $i$ is estimated as $p_{fold}(i)=P_{f}^{C}[\alpha]$ for $i \in \alpha$. This approximation allows to plot the pairwise RMSD distribution of beta3s structures with
426: $p_{fold}>0.51$ (native state), $0.49< p_{fold}< 0.51$ (transition state ensemble, TSE), and $p_{fold}<0.49$ (denatured state) (Fig.~\ref{fig.pair}A). For the native state, the distribution is peaked around low values of RMSD ($\sim 1.5$~\AA) indicating that structures with $p_{fold}>0.51$ are structurally similar and belong to a non-degenerate state. The statistical weight of this group of structures is 49.4\% and corresponds to the expected statistics for the native state because the simulations are performed at the melting temperature.
427: In the case of TSE, the distribution is broad because of the coexistence of heterogeneous structures. This scenario is compatible with the presence of multiple folding pathways. Beta3s folding was already shown to involve two main average pathways depending on the sequence of formation of the two hairpins\cite{Ferrara:Folding,Rao:The}. Here, a \textit{naive} approach based on the number of native contacts\cite{Ferrara:Folding} is used to structurally characterize the folding barrier. TSE structures with number of native contacts of the first hairpin greater than the ones of the second hairpin are called type I conformations (Fig.~\ref{fig.pair}B), otherwise they are called type II (Fig.~\ref{fig.pair}C). In both cases the transition state is characterized by the presence of one of the two native hairpins formed while the rest of the peptide is mainly unstructured. These findings are also in agreement with
428: the complex network analysis of beta3s reported in Ref \onlinecite{Rao:The}. Finally, the denatured state shows a broad pairwise RMSD distribution around even larger values of RMSD ($\sim 5.5$ \AA), indicating the presence of highly heterogeneous conformations.
429:
430:
431:
432: \section{Conclusions}
433:
434: Two main results have emerged from the present study. First, snapshots grouped in structurally homogeneous clusters are
435: characterized by similar values of $p_{fold}$. This result justifies the use of
436: a statistical approach for the study of the kinetic properties of the
437: structures sampled along a simulation. Second, given a set of structurally
438: homogeneous clusters and a folding criterion, it is possible to obtain a first
439: approximation of the folding probability for every structure sampled along an
440: equilibrium folding-unfolding simulation. Thus, the cluster-$p_{fold}$ is a quantitative
441: measure of the kinetic distance from the native state and is computationally very cheap\cite{Computation}. Furthermore, it can be used to detect multiple folding pathways. The accuracy in the identification of the transition state ensemble improves as the number of folding events observed in the simulation increases.
442: %
443: %\footnote{The computation of $P_{f}^C$ presented in this work takes few seconds
444: %on a desktop computer}.
445: %
446: Recently the cluster-$p_{fold}$ approach has been used to identify the transition state ensemble of a large set of beta3s mutants (for a total of 0.65~$ms$ of simulation time\cite{Settanni:Phi}), which would have been impossible with traditional methods. As a further
447: application, the cluster-$p_{fold}$ procedure can be used to validate TSE conformations
448: obtained by wide-spread $G\bar o$ models.
449:
450:
451: \begin{acknowledgments}
452: We thank Stefanie Muff for useful and stimulating discussions and comments to the manuscript.
453: We also thank Dr.\ Emanuele Paci for interesting discussions.
454: We acknowledge an anonymous referee for suggesting the use of cluster-$p_{fold}$ to detect multiple pathways.
455: The molecular
456: dynamics simulations were performed on the Matterhorn Beowulf cluster at the
457: Informatikdienste of the University of Zurich. We thank C.\ Bollinger, Dr.\
458: T.\ Steenbock, and Dr.\ A.\ Godknecht for setting up and maintaining the
459: cluster. This work was supported by the Swiss National Science Foundation
460: grant nr. 205321-105946/1.
461: \end{acknowledgments}
462:
463: %\bibliography{/home/caflisch/tex/a-bib}
464: \bibliography{a-bib}
465:
466: %\end{document}
467:
468: %\clearpage
469:
470: %%---[ TABLES ]--------------------------------------------------------------
471: %%---------------------------------------------------------------------------
472: %\begingroup
473: %\squeezetable
474: %\begin{table}[h]
475: %\caption{\label{list}{\sc DRMS} clusters used for the calculation of $P_{f}$.}
476: %\begin{ruledtabular}
477: %\begin{tabular}{ccccccc}
478: %Cluster &
479: %$P_{f}^C$ \footnote{Cluster-$p_{fold}$ [$P_f^C$, Eq.\ \ref{cluster-pfold}].} &
480: %$P_{f}$ \footnote{Traditional, i.e., computationally expensive $P_f$ value [Eq.\ \ref{average-pfi}].} &
481: %$\sigma_{p_{fold}}$ \footnote{Standard deviation of $p_{fold}$ in a cluster [Eq.\ \ref{sigma-pfold}].} &
482: %$N$ \footnote{Total number of trials used to evaluate $P_f$. For every structure $n_{t}=10$ trials were performed \hbox{($N=n_{t}\ W_{sample}$)}
483: %except for clusters 7 and 25 for which 20 and 50 trials were performed, respectively.} &
484: %$W$ \footnote{Number of snapshots in the cluster.} &
485: %$W_{sample}$ \footnote{Number of snapshots used to evaluate $P_f$.
486: %The $W_{sample}$ subset was obtained by selecting structures in a cluster every $|W/W_{sample}|$ saved conformations.} \\
487: %\hline
488: %\hline
489: % 1 & 0.00 & 0.03 & 0.04 & 150 & 144 & 15\\
490: % 2 & 0.11 & 0.05 & 0.06 & 150 & 449 & 15\\
491: % 3 & 0.06 & 0.05 & 0.07 & 120 & 36 & 12\\
492: % 4 & 0.08 & 0.07 & 0.08 & 140 & 555 & 14\\
493: % 5 & 0.10 & 0.08 & 0.06 & 100 & 10 & 10\\
494: % 6 & 0.13 & 0.12 & 0.18 & 160 & 911 & 16\\
495: % 7 & 0.25 & 0.16 & 0.07 & 80 & 4 & 4\\
496: % 8 & 0.23 & 0.20 & 0.31 & 150 & 141 & 15\\
497: % 9 & 0.21 & 0.22 & 0.15 & 140 & 178 & 14\\
498: % 10 & 0.12 & 0.23 & 0.20 & 120 & 48 & 12\\
499: % 11 & 0.57 & 0.25 & 0.14 & 140 & 14 & 14\\
500: % 12 & 0.05 & 0.27 & 0.19 & 100 & 19 & 10\\
501: % 13 & 0.23 & 0.29 & 0.38 & 140 & 391 & 14\\
502: % 14 & 0.08 & 0.30 & 0.15 & 120 & 12 & 12\\
503: % 15 & 0.72 & 0.35 & 0.23 & 130 & 129 & 13\\
504: % 16 & 0.19 & 0.38 & 0.18 & 130 & 26 & 13\\
505: % 17 & 0.38 & 0.44 & 0.39 & 160 & 16 & 16\\
506: % 18 & 0.38 & 0.51 & 0.28 & 160 & 16 & 16\\
507: % 19 & 0.65 & 0.60 & 0.29 & 100 & 20 & 10\\
508: % 20 & 0.57 & 0.61 & 0.35 & 70 & 7 & 7\\
509: % 21 & 0.48 & 0.63 & 0.32 & 140 & 27 & 14\\
510: % 22 & 0.74 & 0.65 & 0.40 & 140 & 539 & 14\\
511: % 23 & 0.68 & 0.66 & 0.18 & 140 & 28 & 14\\
512: % 24 & 0.38 & 0.71 & 0.24 & 130 & 13 & 13\\
513: % 25 & 0.50 & 0.72 & 0.20 & 100 & 2 & 2\\
514: % 26 & 0.82 & 0.76 & 0.31 & 170 & 17 & 17\\
515: % 27 & 0.50 & 0.78 & 0.14 & 120 & 12 & 12\\
516: % 28 & 0.78 & 0.78 & 0.22 & 180 & 18 & 18\\
517: % 29 & 0.70 & 0.79 & 0.19 & 130 & 189 & 13\\
518: % 30 & 0.77 & 0.79 & 0.17 & 150 & 30 & 15\\
519: % 31 & 0.85 & 0.81 & 0.11 & 130 & 13 & 13\\
520: % 32 & 0.91 & 0.83 & 0.20 & 140 & 401 & 14\\
521: % 33 & 0.90 & 0.85 & 0.27 & 100 & 20 & 10\\
522: % 34 & 0.85 & 0.85 & 0.10 & 120 & 48 & 12\\
523: % 35 & 0.94 & 0.88 & 0.13 & 170 & 1990 & 17\\
524: % 36 & 0.71 & 0.94 & 0.07 & 70 & 7 & 7\\
525: % 37 & 0.95 & 0.95 & 0.06 & 150 & 855 & 15\\
526: %\end{tabular}
527: %\end{ruledtabular}
528: %\end{table}
529: %\endgroup
530:
531: %\clearpage
532:
533: %%---[ FIGURES ]-------------------------------------------------------------
534: %%---------------------------------------------------------------------------
535: %\begin{figure}[h]
536: %\includegraphics[width=7.5cm]{eps/pfpt.pdf}
537: %\caption{Probability distribution for the first passage time (fpt) to the most
538: %populated cluster (\emph{folded state}) of the DRMS 1.2 \AA\
539: %clusterization.}
540: %\label{fig.pfpt}
541: %\end{figure}
542:
543: %\clearpage
544:
545: %\begin{figure*}
546: %%\includegraphics[width=7.0cm, angle=-90]{eps/sigmapfold.eps}
547: %\includegraphics[width=14cm]{eps/sigmapfold.pdf}
548: %\caption{Standard deviation
549: %$\sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2
550: %\right>_{i \in \alpha}}$ of the $p_{fold}$ for the 37 DRMS clusters used in the study.
551: %\textbf{(A)} $\sigma_{p_{fold}}$ as a function of $P_{f}$ compared to a
552: %Bernoulli distribution (solid line). Ten trials were performed for each
553: %snapshot. The largest values for the standard deviation are located around the
554: %0.5 region and this is probably due to the Bernoulli process ($\theta=0,1$)
555: %used for the calculation of $p_{fold}$. \textbf{(B)} $\sigma_{p_{fold}}$
556: %dependence on the number of trials used to evaluate $p_{fold}$. The dashed curves are fits with a $\frac{a}{\sqrt{x}}+b$ function. The horizontal
557: %dashed lines are drawn to help identifying in \textbf{A} the two clusters used
558: %in \textbf{B}. \textbf{(C)} Dependence of $P_f$ on the number of trials $n_t$ for the two clusters used in \textbf{B}.}
559: %\label{fig.sigmapfold}
560: %\end{figure*}
561:
562: %\clearpage
563:
564: %\begin{figure}[h]
565: %%\includegraphics[width=7.5cm, angle=-90]{eps/scatterpfold.eps}
566: %\includegraphics[width=14cm]{eps/scatterpfold-multi.pdf}
567: %\caption{Cluster folding probability $P_{f}^{C}$. \textbf{(A)} Scatter plot of $P_{f}^C$ versus $P_{f}$.
568: %The DRMS 1.2 \AA\ clusterization and the folding criterion $\Phi$
569: %(reaching the most populated cluster within $\tau_{commit}=5$ ns)
570: %were used. \textbf{(B)} Probability distribution of the $p_{fold}$ value for the 500000 snapshots saved along the $10\ \mu s$ MD trajectory. The folding probability for snapshot $i$ is computed as $p_{fold}(i)=P_f^C[\alpha]$ for $i \in \alpha$. \textbf{(C-E)} Scatter plot of $P_{f}^C$ versus $P_{f}$ for 1.0, 5.0, and 10 $\mu s$ of simulation time, respectively.}
571: %\label{fig.scatterpfold}
572: %\end{figure}
573:
574: %\clearpage
575:
576: %\begin{figure}[h]
577: %\includegraphics[width=10cm]{eps/pairRMSD.pdf}
578: %\caption{Transition state ensemble (TSE) of beta3s. \textbf{(A)} RMSD pairwise distribution for structures with $p_{fold}>0.51$ (native state), $0.49< p_{fold}< 0.51$ (TSE), and $p_{fold}<0.49$ (denatured state). \textbf{(B)} Type I and \textbf{(C)} type II transition states (thin lines). Structures are superimposed on residues 2-11 and 10-19 with an average pairwise RMSD of 0.81 and 0.82 \AA\ for type I and type II, respectively. For comparison, the native state is shown as a thick line with a circle to label the N-terminus.}
579:
580: %\label{fig.pair}
581: %\end{figure}
582:
583: %
584:
585: \clearpage
586:
587: \onecolumngrid
588:
589: % -----------------------------------------------------------
590: % SUPPLEMENTARY MATERIAL
591: % -----------------------------------------------------------
592: \markright{\centerline{\LARGE\sc Supplementary Material}}
593: \pagestyle{myheadings}
594:
595: \clearpage
596:
597: \setcounter{section}{0}
598: %\renewcommand{\thepage}{S-\arabic{section}}
599:
600: \setcounter{page}{1}
601: \renewcommand{\thepage}{S-\arabic{page}}
602:
603: \setcounter{figure}{0}
604: \renewcommand{\thefigure}{S\arabic{figure}}
605:
606: \setcounter{table}{0}
607: \renewcommand{\thetable}{S-\Roman{table}}
608:
609:
610: \section{Secondary structure clusterization}
611:
612: Recently, the secondary structure has been used to cluster the conformation space
613: of peptides (F. Rao et al, JMB 342, 299, 2004). Secondary structure along an MD
614: simulation trajectory can be easily calculated using known algorithms
615: (C.A.F. Andersen et al, Structure 10, 174, 2002). A cluster is a single string
616: of secondary structure, e.g., the most populated conformation for beta3s is
617: {\tt -EEEESSEEEEEESSEEEE-} where "{\tt E}", "{\tt S}", and "{\tt -}" stand for
618: extended, turn, and unstructured, respectively. There are 8 possible "letters"
619: in the secondary structure "alphabet": "{\tt H}", "{\tt G}", "{\tt I}", "{\tt
620: E}", "{\tt B}", "{\tt T}", "{\tt S}", and "{\tt -}", standing for $\alpha$
621: helix, 3/10 helix, $\pi$ helix, extended, isolated $\beta$-bridge, hydrogen
622: bonded turn, bend, and unstructured, respectively. Since the N- and C-terminal
623: residues are always assigned an "{\tt -}" a 20-residue peptide can in principle
624: assume $8^{18}\simeq 10^{16}$ conformations.
625:
626: \begin{figure*}[h]
627: \includegraphics[angle=-90,width=70mm] {stdev.eps}
628: \includegraphics[angle=-90,width=80mm] {scatterpfold-sstr.eps}
629: \caption{\textbf{(left)} $p_{fold}$ standard deviation inside a cluster for 16
630: secondary structure (\emph{sstr}) and 37 DRMS 1.2 \AA\ clusters. Both
631: \emph{sstr} and DRMS 1.2 \AA\ clusterizations are defined by similar
632: fluctuations. \textbf{(right)} Scatter plot of $P_{f}^C$ versus $P_{f}$ for
633: \emph{sstr} clusterization. In this case the folding criteria used is based on
634: the native contacts $Q$ (Settanni et al., PNAS 102, 628, 2005). A folding
635: (unfolding) event is realized when $Q>0.85$ ($Q<0.15$).}
636: \end{figure*}
637:
638: \begin{figure*}[h]
639: \includegraphics[angle=-90,width=90mm] {prmsd.eps}
640: \includegraphics[angle=-90,width=90mm] {psigmarmsd.eps}
641: \caption{\textbf{(top)} Probability to have a given pairwise
642: root mean square deviation (RMSD) inside a cluster for the secondary
643: structure (\emph{sstr}) and DRMS 1.2 \AA\ clusterizations. \textbf{(bottom)}
644: Probability to have a given variance for the RMSD inside a cluster.
645: Both plots show that secondary structure clusters are less structurally
646: homogeneous than DRMS 1.2 \AA\ clusters.}
647: \end{figure*}
648:
649: \clearpage
650:
651: \section{First passage times}
652:
653: The first passage time (fpt) to a given cluster $\alpha$ is computed as the
654: time along the MD trajectory that any given snapshot takes to the first
655: subsequent snapshot belonging to $\alpha$. In fig.\ S3 the fpt distribution
656: to the folded state is
657: shown for two different clusterizations of the conformation space. The double
658: peak shape of the distribution provides evidence of the different time scales
659: between \emph{intra}-basin and \emph{inter}-basin transitions. The wider shape
660: of the \emph{intra}-basin peak for the secondary structure clusterization is
661: consistent with the higher degree of structural diversity with
662: respect to the DRMS 1.2 \AA\ clusterization (see previous section).
663:
664: \begin{figure*}[h]
665: \includegraphics[angle=-90,width=90mm] {pfpt.eps}
666: \includegraphics[angle=-90,width=90mm] {pfpt-sstr.eps}
667: \caption{Probability distribution for the first passage times (fpt) to the most
668: populated cluster (\emph{folded state}). \textbf{(top)} DRMS 1.2 \AA\
669: clusterization. \textbf{(bottom)} Secondary structure clusterization.}
670: \end{figure*}
671:
672: \clearpage
673:
674: \section{Random clusterization}
675:
676: The results of this section were obtained using the DRMS 1.2 \AA\
677: clusterization.
678: In the text evidence was provided that the standard deviation of $p_{fold}$
679: %
680: \begin{equation*} \sigma_{p_{fold}}=\sqrt{\left< (p_{fold}(i)-P_{f}[\alpha])^2 \right>_{i \in \alpha}}
681: \end{equation*}
682: %
683: is not compatible with the one of a Bernoulli distribution. This means that
684: snapshots in a cluster have similar values of $p_{fold}$ and are kinetically homogeneous. This is
685: not the case for a random clusterization of the snapshots. Since it is not
686: feasible to compute the $p_{fold}$ for every snapshot of a simulation, the
687: assumption that $p_{fold}$ of snapshot $i$ is equal to the cluster folding
688: probability $P_f^C$ of its cluster (as computed in the text) is made.
689: Then, snapshots are reshuffled in 50000 random clusters. The folding
690: probability for a random cluster $\alpha_R$ is computed as $P_f=\left<
691: p_{fold}\right>_{\alpha_R}$. Most of the snapshots will have $p_{fold}$ close
692: to $1$ or $0$ (see Fig.\ 3B in the text) and because of the random grouping,
693: i.e., no kinetic homogeneity, the above standard deviation $\sigma_{p_{fold}}$
694: resembles the one of a Bernoulli distribution as shown in Fig.\
695: \ref{sigmapfold-null}. Data obtained from a DRMS 1.2 \AA\ clusterization
696: deviates from this behavior (compare Fig.\ 2A and Fig.\ \ref{sigmapfold-null}).
697: Moreover this deviation becomes bigger as the number of trials $n_t$, in this
698: case $10$, increases (see Fig.\ 1B).
699:
700: \begin{figure*}[h]
701: \includegraphics[angle=-90,width=80mm] {figS4.ps}
702: \caption{Standard deviation $\sigma_{p_{fold}}$ for a random clusterization.
703: Black dots, red curve, blue squares, blue curve show
704: $\sigma_{p_{fold}}$ for the random clusters, its histogram,
705: $\sigma_{p_{fold}}$ for 37 non-random clusters (see text), and its
706: histogram, respectively.}
707: \label{sigmapfold-null}
708: \end{figure*}
709:
710: \end{document}
711: