0703:physics0703222/go4.tex

1: \documentclass[prb,twocolumn,amsmath,amssymb]{revtex4}

2: %\documentclass[prb,preprint,amsmath,amssymb]{revtex4}

3:

4: \usepackage{graphicx} %figures

5: \bibliographystyle{apsrev}

6:

7: \newcommand{\taud}{\tau_{\text{dec}}}

8: \newcommand{\tsim}{t_{\text{sim}}}

9:

10: \begin{document}

11:

12: \title{Demonstrated convergence of the equilibrium ensemble for a fast

13:     united-residue protein model}

14: \author{F.\ Marty Ytreberg\footnote{E-mail: ytreberg@uidaho.edu}}

15: \affiliation{Department of Physics,

16:     University of Idaho, Moscow, ID 83844-0903}

17: \author{Svetlana Kh.\ Aroutiounian\footnote{The first two authors contributed

18:     equally}}

19: \affiliation{Department of Physics, Dillard University,

20: 2601 Gentilly Blvd., New Orleans, LA 70122}

21: \author{Daniel M.\ Zuckerman\footnote{E-mail: dmz@ccbb.pitt.edu}}

22: \affiliation{Department of Computational Biology,

23:     University of Pittsburgh, 3064 BST-3, Pittsburgh, PA 15213}

24: \date{\today}

25:

26: \begin{abstract}

27: Due to the time-scale limitations of all-atom simulation of proteins,

28: there has been substantial interest in coarse-grained approaches.

29: Some methods, like ``Resolution Exchange,''

30: [E.\ Lyman \emph{et al.}, Phys.\ Rev.\ Lett.\ {\bf 96}, 028105 (2006)]

31: can accelerate canonical

32: all-atom sampling, but require properly distributed coarse ensembles.

33: We therefore demonstrate that full sampling can indeed be achieved in a

34: sufficiently simplified protein model, as verified by a recently

35: developed convergence analysis.

36: The model accounts for protein backbone geometry in that rigid

37: peptide planes rotate according to atomistically defined dihedral

38: angles, but there are only two degrees of freedom

39: ($\phi$ and $\psi$ dihedrals) per residue.

40: Our convergence analysis indicates that small proteins

41: (up to 89 residues in our tests) can be simulated for more than

42: 50 ``structural decorrelation times'' in less than a week on

43: a single processor.

44: We show that the fluctuation behavior is reasonable,

45: as well as discussing applications, limitations, and extensions of the model.

46: \end{abstract}

47:

48: \maketitle

49:

50: \section{Introduction}

51: How simplified must a molecular model of a protein be for

52: it to allow full canonical sampling?

53: This question may be important to the solution of the protein

54: sampling problem---the generation of protein structures properly

55: distributed according to statistical mechanics---because of the

56: well-known inadequacy of all-atom simulations, which

57: are limited to sub-microsecond timescales.

58: Even small peptides have proven slow to reach convergence

59: \cite{lyman-converge}.

60: Sophisticated atomistic methods, moreover, which often employ

61: elevated temperatures \cite{swendsen-repx,nemoto,hansmann,okamoto,garcia-repx},

62: have yet to show they can overcome the remaining gap in

63: timescales \cite{zuckerman-barriers}---which is

64: generally considered to be several orders of magnitude.

65: On the other hand, because of the drastically reduced numbers

66: of degrees of freedom and smoother landscapes,

67: coarse-grained models

68: (e.g., Refs.\ \onlinecite{levitt-nature,go,scheraga75,kuntz-coarse,

69: miyazawa,skolnick,wolynes,dill,thirumalai,

70: friesner,jernigan-bahar,karplus97,scheraga97a,scheraga97b,clementi-pnas,hall,

71: shakhnovich,voth-forcematching,zuckerman-cam})

72: may have the potential to aid the ultimate solution

73: to the sampling problem, particularly in light of recently developed

74: algorithms like ``Resolution Exchange''

75: \cite{lyman-resx,lyman-resx2}

76: and related methods \cite{luo-coarse,vangunsteren-resx,voth-resx}.

77:

78: Although the Resolution Exchange approach, in principle, can produce

79: properly distributed atomistic ensembles of protein configurations,

80: it requires full sampling at the coarse-grained

81: level \cite{lyman-resx,lyman-resx2}.

82: While the potential for such full sampling has been suggested by

83: some studies of folding and conformational change

84: (e.g., Refs.\ \onlinecite{clementi-jmb,zuckerman-cam}),

85: convergence has yet to be carefully quantified in equilibrium

86: sampling of folded proteins.

87: How much coarse-graining really is necessary?

88: What is the precise computational cost of different approaches?

89: This report begins to answer these questions by studying

90: a united-residue model with realistic backbone geometry.

91:

92: We will require a quantitative method for assessing sampling.

93: A number of approaches have been suggested

94: \cite{brooks-converge,thirumalai-converge,vangunsteren-converge,

95: pande-converge,lyman-converge},

96: but we rely on a recently proposed statistical approach which

97: directly probes the fundamental configuration-space distribution

98: \cite{lyman-converge,lyman-converge2}.

99: The method does not require knowledge of important

100: configurational states or any parameter fitting.

101: In essence, the approach attempts to answer the most

102: fundamental statistical question,

103: ``What is the minimum time interval between snapshots so

104: that a set of structures will behave as if each member

105: were drawn independently from the configuration-space

106: distribution exhibited by the full trajectory?''

107: This interval is termed the structural decorrelation time $\taud$,

108: and the goal is to generate simulations of length $\tsim \gg \taud$.

109:

110: In this report, we demonstrate the convergence

111: of the equilibrium ensemble for several proteins using a fast,

112: united-residue model employing rigid peptide planes.

113: The relative motion of the planes is determined by the

114: \emph{atomistic} geometry embodied in the $\phi$ and $\psi$

115: dihedral-angle rotations, as explained below.

116: We believe such realistic backbone geometry will

117: be necessary for success in Resolution Exchange studies.

118: The use of geometric look-up tables enables the rapid use of

119: only two degrees of freedom per residue ($\phi$ and $\psi$),

120: and one interaction site at the alpha-carbon.

121: The simulations are therefore extremely fast.

122: G{\=o} interactions stabilize the native state

123: while permitting substantial fluctuations in the overall backbone.

124:

125: After the model and the simulation approach are explained,

126: the fluctuations are compared with experimental data

127: from X-ray temperature factors and the diversity of NMR structure sets.

128: The simulations are then analyzed for convergence and timing.

129:

130: \begin{figure}

131:     \begin{center}

132: 	\includegraphics[width=5.7cm,clip]{pep-plane.eps}

133:     \end{center}

134:     \caption{\label{fig-rigid}

135: 	The rigid peptide plane model used this study.

136: 	Note that, in the coarse-grained simulations,

137: 	only alpha-carbons are represented,

138: 	and the only degrees of freedom are $\phi$ and $\psi$.

139: 	Other atoms are shown in the figure only to clarify

140:         the geometry and our

141: 	assumption of rigid peptide planes.

142:     }

143: \end{figure}

144:

145: \section{Coarse-grained model}

146: The coarse-grained model used for this study

147: was chosen to meet several criteria:

148: (i) the fewest number of degrees of freedom per residue;

149: (ii) the ability to utilize lookup tables for enhanced simulation speed;

150: (iii) the stability of the native state along with the potential

151: for substantial non-native fluctuations; and,

152: (iv) the ability to allow the addition of

153: chemical detail, as simply as possible.

154: Thus, we chose a rigid peptide plane model with G{\=o}

155: interactions \cite{go,go2,go-pnas}

156: and sterics based on alpha-carbon interaction sites as shown

157: in Fig.\ \ref{fig-rigid}.

158: The use of such a simple model, we emphasize, is consistent

159: with our goal of understanding both the potential, and the

160: limitations of coarse models for statistically valid sampling.

161: Once we have understood the costs associated with the present model,

162: we can design more realistic models, as discussed below.

163: In other words, we made no attempt to design the most

164: chemically realistic coarse-grained model,

165: although we believe the use of atomistic peptide geometry

166: is an improvement over a coarse model we considered

167: previously \cite{zuckerman-cam}.

168:

169: The rigid peptide planes allows the use of only two degrees

170: of freedom per residue, arguably the fewest that one would consider

171: in such a model.

172: Indeed, this is fewer than in a freely rotating chain,

173: although admittedly our model requires somewhat more

174: complex simulation moves, described below.

175:

176: G{\=o} interactions were used because they simultaneously

177: stabilize the native

178: state of the protein and also permit reasonable equilibrium

179: fluctuations, as was shown in an earlier study \cite{zuckerman-cam}.

180: Given our interest in native-state fluctuations and the lack of a

181: \emph{universal} coarse-grained model capable of stabilizing the

182: native state for \emph{any} protein, G{\=o} interactions are a

183: natural choice for enforcing stability.

184: Further, beyond the reasonable ``local'' fluctuations shown below,

185: the model also exhibits partial unfolding events which are

186: expected both theoretically and

187: experimentally \cite{falke,englander,kern}.

188:

189: Because we see the present model as only a first step in the

190: development of better models, it is important that it easily

191: allows for the addition of

192: chemical detail, such as Ramachandran propensities which

193: require only the dihedral angles we use explicitly \cite{richardson}.

194: Furthermore, with a rigid peptide plane, the locations

195: of all backbone atoms---and the beta carbon---are known implicitly.

196: Thus hydrogen-bonding and hydrophobic interactions \cite{dill} can

197: be included in the model with little effort.

198: In other words, the ``extendibility'' of the

199: present simple model was a significant factor in its design.

200:

201: \subsection{Potential energy of model system}

202: The total potential used in the model is given by

203: \begin{equation}

204:     U = U^{\rm nat} + U^{\rm non},

205:     \label{eq-u}

206: \end{equation}

207: where $U^{\rm nat}$ is the total energy for native contacts,

208: and $U^{\rm non}$ is the total energy for non-native contacts.

209:

210: For the G{\=o} interactions, all residues that are separated by a distance

211: \emph{less} than $R_{\rm cut}$ in the experimental structure are

212: given native interaction energies defined by a square well:

213: \begin{align}

214:     U^{\rm nat} &= \sum_{ \{i<j\} }^{\rm native} u^{\rm nat}(r_{ij}),

215: 	\nonumber \\

216:     u^{\rm nat}(r_{ij}) &= \left\{

217:     \begin{array}{l}

218: 	\infty \;\; {\rm if} \;\; r_{ij} < r_{ij}^{\rm nat}(1-\delta)\\

219: 	-\epsilon \;\; {\rm if} \;\;

220: 	    r_{ij}^{\rm nat}(1-\delta) \leq r_{ij}

221: 	    < r_{ij}^{\rm nat}(1+\delta)\\

222: 	0 \;\; {\rm otherwise}

223:     \end{array}

224:     \right.,

225:     \label{eq-un}

226: \end{align}

227: where $r_{ij}$ is the $C_\alpha-C_\alpha$

228: distance between residue $i$ and $j$, $r_{ij}^{\rm nat}$

229: is the the distance between the residues in the experimental structure,

230: $\epsilon$ determines the energy scale of the native

231: G{\=o} attraction,

232: and $\delta$ is a parameter to choose the width of the well.

233: All residues that are separated by \emph{more} than $R_{\rm cut}$ in the

234: experimental structure are

235: given non-native interaction energies defined by

236: \begin{align}

237:     U^{\rm non} &= \sum_{ \{i,j\} }^{\rm non-native} u^{\rm non}(r_{ij}),

238: 	\nonumber \\

239:     u^{\rm non}(r_{ij}) &= \left\{

240:     \begin{array}{l}

241: 	\infty \;\; {\rm if} \;\; r_{ij} < (\rho_i+\rho_j)(1-\delta)\\

242: 	+h\epsilon \;\; {\rm if} \;\; (\rho_i+\rho_j)(1-\delta)

243: 	    \leq r_{ij} < R_{\rm cut}\\

244: 	0 \;\; {\rm otherwise}

245:     \end{array}

246:     \right.,

247:     \label{eq-unn}

248: \end{align}

249: where $\rho_i$ is the hard-core radius of residue $i$ defined as half the

250: $C_\alpha$ distance to the nearest non-covalently-bonded residue,

251: and $h$ determines the strength of the repulsive interaction.

252:

253: For this study, parameters were chosen to be similar to those

254: in Ref.\ \onlinecite{zuckerman-cam}, i.e.,

255: $\epsilon=1.0$, $h=0.3$, $\delta=0.1$, and $R_\text{cut}=8.0$ \AA.

256:

257: \subsection{Monte Carlo simulation}

258: The protein fluctuations were generated using

259: Metropolis Monte Carlo \cite{metropolis}.

260: Trial configurations were generated by adding a random Gaussian

261: deviate to the values of three sequential pairs of backbone torsions

262: (three $\phi$ and three $\psi$ angles).

263: We found that changing six sequential backbone

264: torsions maximizes the rate of convergence of the equilibrium ensemble

265: (data not shown).

266: The energy of the trial configuration was

267: then determined using Eq.\ (\ref{eq-u}), and the conformation

268: was accepted with probability $\min (1,e^{-\Delta U/k_BT} )$,

269: where $\Delta U$ is the total change in potential energy of the system.

270: The width of the Gaussian distribution for generating random deviates

271: was chosen such that the acceptance

272: ratio was about 40\% for all simulations.

273: The choice of temperature is discussed below.

274:

275: \subsection{Use of lookup tables}

276: The speed of the coarse-grained simulation was enhanced by using

277: lookup tables to avoid unnecessary computation.

278: In general, utilizing lookup tables increases memory

279: usage while decreasing the number of computations.

280: Since memory is inexpensive and can be expanded easily,

281: utilizing as much memory as possible can be an effective

282: technique for increasing the speed of simulations.

283:

284: In our model there are only two degrees of freedom per residue

285: ($\phi,\psi$), but $C_\alpha$ distances $r_{ij}$ must be

286: computed to determine native and non-native interaction energies

287: given by Eqs. (\ref{eq-un}) and (\ref{eq-unn}).

288: All peptide planes are considered to possess ideal,

289: rigid geometry as determined by energy minimization of

290: the all-atom OPLS forcefield \cite{oplsaa}

291: using the {\sc tinker} simulation package \cite{tinker}.

292:

293: Given a sequence of three residues (alpha carbons),

294: we employed a lookup table to provide the Cartesian coordinates

295: of the third residue---starting from the N-terminus---and

296: its normal vector as a function of

297: $\phi$ and $\psi$; see Fig.\ \ref{fig-rigid}.

298: The table values assume that the first residue is at the origin

299: and the second residue is located on the z-axis. Once the coordinates

300: for the third residue were determined via the lookup table, the fourth

301: residue position was determined using the lookup table in conjunction with

302: a coordinate rotation and shift. Continuing in this fashion, coordinates

303: for the entire protein were determined.

304:

305: The resolution of the lookup table is an important consideration, i.e.,

306: the number of $\phi,\psi$ values for which Cartesian coordinates are stored.

307: In our simulations, we tried resolutions as high as $0.1^\circ$

308: and as low as $1.0^\circ$, and found no difference between the results.

309: Thus, all simulation results presented here use lookup tables with a

310: resolution of $1.0^\circ$.

311:

312: \subsection{\label{sec-equil}Initial protein relaxation}

313: One perhaps unexpected complication of utilizing a rigid peptide plane model

314: is that great care must be taken to relax the protein

315: before simulations can be performed.

316: Although initial values of $\phi,\psi$ are obtained from the

317: X-ray or NMR structure,

318: there are slight deviations from planar/ideal geometry in

319: a real protein. These deviations, while small, can accumulate rapidly to

320: become very large differences in the Cartesian coordinate positions

321: of the residues.

322: Thus, the positions of residues near the beginning of the protein

323: will be nearly correct, while the residues near the end of the protein

324: will likely have large errors---compared to the experimental structure being

325: modeled---which can create severe steric clashes or even incorrect

326: protein topology.

327: The severity of these ``errors'' necessitates the use of a relaxation

328: procedure to generate a suitable starting structure---i.e., a set of $\phi$

329: and $\psi$ angles which, with our ideal-geometry peptide planes,

330: lead to a topologically reasonable and relatively clash-free structure.

331:

332: Before we detail our relaxation procedure, we note that the need for this

333: additional calculation is an artifact of the simplicity of

334: our model which can be overcome.

335: With the use lookup tables, in fact,

336: it is possible to include \emph{flexible} peptide planes

337: without significantly increasing the computational

338: cost of the model.

339: Such an approach, which does not require initial relaxation,

340: is currently under investigation with promising preliminary results

341: (data not shown).

342:

343: The relaxation procedure employed in the present study first

344: uses the $\phi,\psi$ values directly obtained from

345: the experimental structure.

346: These dihedrals provide the initial (problematic)

347: structure for a coarse-grained simulation.

348: Due to the deviations from planarity described above,

349: the root means-square deviation (RMSD)

350: between the initial structure we create

351: and the experimental structure tends to be large ($\sim 10$ \AA\

352: was not uncommon for the proteins in this study).

353: To increase the number of native contacts and reduce the number

354: of steric clashes, we next performed what we term ``RMSD Monte

355: Carlo'' to relax the protein to a low RMSD structure.

356: Trial moves for RMSD Monte Carlo were created as described above, but accepted

357: with probability $\min (1,e^{-\Delta(\text{RMSD})/k_BT_\text{RMSD}} )$,

358: where $k_BT_\text{RMSD} = 10^{-7}$ was chosen so that moves to a higher

359: RMSD were rare.

360: In other words, the energy function itself was not used in this initial phase.

361:

362: Since residues near the beginning of the protein have less

363: error in the starting structure than residues near the end, we used

364: RMSD Monte Carlo in segments. The first twenty residues were relaxed

365: until the RMSD was constant within a tolerance of 0.0001 \AA, followed

366: by the first forty, then the first sixty and so on until the RMSD of the

367: entire protein was relaxed. The RMSD Monte Carlo simulation

368: typically brought the RMSD of the simulated structure to less than

369: 0.5 \AA, however, there were generally still steric clashes,

370: and some native contacts were still not present.

371:

372: The final stage of relaxation was to do regular (i.e., using energy)

373: Metropolis Monte Carlo simulation, with a very low temperature.

374:

375: Relaxation was performed until four criteria were met:

376: (i) the number of native contacts in the relaxed structure

377: was equal to that in the NMR or X-ray structure;

378: (ii) no steric clashes were present;

379: (iii) no non-native contacts were present,

380: i.e., $U^{\rm non} = 0$ in Eq.\ (\ref{eq-unn}), and;

381: (iv) the RMSD was less than 1.0 \AA.

382: When these criteria were

383: met the structure was saved and used

384: as the starting configuration in all future simulations of the protein.

385:

386: \begin{figure*}

387:     \begin{center}

388: 	\includegraphics[width=5.7cm,clip]{rmsf-bar.eps}

389: 	\hfill

390: 	\includegraphics[width=5.7cm,clip]{rmsf-cam.eps}

391: 	\hfill

392: 	\includegraphics[width=5.7cm,clip]{rmsf-pg.eps}

393:     \end{center}

394:     \begin{center}

395: 	\includegraphics[width=5.7cm,clip]{rmsd-1a19.eps}

396: 	\hfill

397: 	\includegraphics[width=5.7cm,clip]{rmsd-1cll.eps}

398: 	\hfill

399: 	\includegraphics[width=5.7cm,clip]{rmsd-1pgb.eps}

400:     \end{center}

401:     \caption{\label{fig-rmsf}

402: 	(Color online)

403: 	Relative alpha-carbon root mean square fluctuations for three

404: 	different proteins: (a) barstar, (b) calmodulin, and (c) protein G.

405: 	Each plot shows results for the

406: 	X-ray structure (dot-dash), the NMR ensemble (dash),

407: 	and the coarse-grained simulation (solid).

408: 	X-ray results were given by $\sqrt{3B/8\pi^2}$, where

409: 	$B$ is the temperature factor given in the PDB entry.

410: 	NMR and simulation data were generated using the

411: 	g\_rmsf program in the {\sc gromacs} molecular simulation

412: 	package \cite{gromacs}; each ensemble was aligned to the

413: 	first structure in the corresponding trajectory.

414: 	For each coarse-grained simulation, $2\times 10^9$ Monte

415: 	Carlo steps were performed with snapshots saved every

416: 	1000 steps, and the potential energy

417: 	\eqref{eq-u} was set up using the X-ray structure.

418:         Panels (d) - (f) show the corresponding whole-structure

419: 	fluctuations as indicated by the RMSD from the experimental structures.

420:     }

421: \end{figure*}

422:

423: \section{Results and Discussion}

424: Using the coarse-grained protein model described above, we

425: generated and tested equilibrium ensembles for three proteins:

426: barstar (PDB entry 1A19, residues 1-89),

427: the N-terminal domain of calmodulin (PDB entry 1CLL, residues 4-75), and

428: the binding domain of protein G (PDB entry 1PGB, residues 1-56)

429:

430: For each protein, the initial simulation structure was generated,

431: followed by RMSD and energy relaxation, as described in

432: Sec.\ \ref{sec-equil}. Then, production runs of

433: $2 \times 10^9$ Monte Carlo moves were performed with snapshots

434: saved every 1000 moves, generating an equilibrium ensemble

435: with $2 \times 10^6$ frames.

436:

437: In an attempt to obtain consistent results for the three proteins,

438: we chose the temperature of the simulation, $k_BT$, to

439: be slightly below the unfolding temperature of the protein. The unfolding

440: temperature was determined by running simulations over a broad range

441: of temperatures and studying the RMSD as a function of simulation

442: time. The temperatures used in the simulations were $k_BT=0.6$ for barstar,

443: $k_BT=0.4$ for calmodulin and $k_BT=0.5$ for protein G.

444:

445: \subsection{Speed of simulations}

446: Due to the use of lookup tables for coordinate transformations,

447: the small number of degrees of freedom,

448: and utilizing simple square potentials, the equilibrium

449: ensembles were generated very rapidly.

450:

451: Running on one Xeon 2.4 GHz processor, $2 \times 10^9$ Monte

452: Carlo moves with snapshots saved every 1000 steps took roughly

453: 6 days for barstar, 4 days for calmodulin, and 3 days for protein G.

454: Thus, less than a week was required to obtain well-converged

455: (see Sec.\ \ref{sec-conv}) simulations

456: of these coarse-grained proteins.

457:

458: \subsection{Protein fluctuations}

459: We first sought to determine whether fluctuations in the

460: coarse-grained simulation are reasonable.

461: Figure \ref{fig-rmsf} shows the alpha-carbon

462: relative root mean square fluctuation for three

463: different proteins.

464: The figures show that there is reasonable qualitative agreement

465: between the NMR, X-ray and simulation data.

466:

467: It should be noted that, in fact, \emph{none} of the three

468: data sets in Figs.\ \ref{fig-rmsf}a, b and c represents the true

469: fluctuations in the protein---for different reasons.

470: The X-ray temperature factor, in addition to thermal fluctuations,

471: includes crystal lattice artifacts and other experimental errors \cite{northrup}.

472: NMR ensembles tend to be biased, perhaps severely, toward low energy structures, and

473: thus also do not represent equilibrium ensembles \cite{spronk}.

474: Finally, our simulation data is

475: not accurate due to the lack of chemical detail in the forcefield.

476:

477: In spite of the limitations of the analysis, we conclude

478: from Fig.\ \ref{fig-rmsf} that

479: the fluctuations of the coarse-grained model are in fact

480: reasonable.

481:

482: The bottom panels of Fig.\ \ref{fig-rmsf} show the whole-molecule

483: fluctuations exhibited throughout the trajectories.

484: In addition to the ability to sample large conformational

485: fluctuations---such as in the case of calmodulin and,

486: to a lesser degree, for protein G---the trajectories are

487: visibly more converged than is typically observed in atomistic

488: simulations, where RMSD values rarely reach a plateau value,

489: let alone sampling around that plateau value multiple times

490: as would be desirable.

491:

492: \subsection{\label{sec-conv}Convergence analysis}

493: The primary purpose of this report is to demonstrate the convergence

494: of the equilibrium ensemble for a coarse-grained protein.

495: The details of the convergence analysis are described in

496: Ref.\ \onlinecite{lyman-converge2}, so we will only briefly describe

497: the method here.

498:

499: Previously, Lyman and Zuckerman \cite{lyman-converge}

500: developed an approach which groups sampled conformations

501: into structural histogram bins, using the RMSD as a metric.

502: While promising, the primary limitation of the method was

503: the lack of a quantitative measure of the convergence.

504:

505: In the method used here, convergence

506: was analyzed by studying the variance of the structural histogram bin

507: populations \cite{lyman-converge2}.

508: The new approach allows a rigorous

509: \emph{quantitative} estimation of convergence---the structural

510: decorrelation time

511: $\taud$, given by the time between frames required for the

512: variance to reach an analytically computable independent-sampling value.

513: Intuitively, and mathematically, $\taud$ is the time interval

514: between snapshots for which they behave as if each frame

515: were drawn independently.

516: If simulation times $\tsim \gg \taud$

517: are obtained, the equilibrium ensemble is considered converged.

518:

519: Perhaps the most important feature of the convergence

520: analysis for our study is that the method does not require

521: any prior knowledge of important states.

522: Furthermore, there is no parameter-fitting or subjective analysis of any kind.

523:

524: \begin{figure*}

525:     \begin{center}

526: 	\includegraphics[width=5.7cm,clip]{conv-bar.eps}

527: 	\hfill

528: 	\includegraphics[width=5.7cm,clip]{conv-cam.eps}

529: 	\hfill

530: 	\includegraphics[width=5.7cm,clip]{conv-pg.eps}

531:     \end{center}

532:     \caption{\label{fig-conv}

533: 	Convergence analysis for coarse-grained simulations of

534: 	three different proteins:

535: 	(a) barstar, (b) calmodulin, and (c) protein G.

536: 	Each plot shows the convergence properties for the same trajectories

537: 	as used for Fig.\ \ref{fig-rmsf},

538: 	analyzed using the procedure in

539: 	Ref.\ \onlinecite{lyman-converge2}.

540: 	The number of frames required to reach the value of one

541: 	(the solid horizontal line) is an approximation

542: 	of the structural decorrelation time $\taud$

543: 	and is shown on each plot.

544: 	The three curves on each plot are results for different

545: 	histogram sub-sample sizes \cite{lyman-converge2}

546: 	and demonstrates the robustness

547: 	of the value of $\taud$.

548: 	The plots predict that the decorrelation times are roughly

549: 	40 000 frames for barstar, 20 000 frames for calmodulin

550: 	and 30 000 frames for protein G.

551: 	Note that the total number of frames generated for each protein

552: 	during the simulation was $2\times10^6$.

553: 	Thus, since each simulation was more than $50 \taud$

554: 	in length, we conclude that the equilibrium ensembles

555: 	are well-converged.

556:         Error bars represent 80\% confidence intervals

557: 	in the expected fluctuations around the ideal value of one,

558: 	based on the given trajectory length and the numerical procedure

559: 	used to generate the solid curve.

560:     }

561: \end{figure*}

562:

563: Figure \ref{fig-conv} shows the convergence properties

564: of the coarse-grained simulations using the same trajectories

565: as in Fig.\ \ref{fig-rmsf}.

566: The ratio of the observed variance to the ideal variance for independent

567: sampling is plotted as a function of the time between the configurations

568: used to compute the observed variance.

569: When this ratio decreases to one the structural decorrelation time $\taud$

570: has been reached, as shown in the figure.

571: The analysis predicts that each simulation

572: is at least 50 times longer than the

573: structural decorrelation time.

574:

575: Thus we conclude that, in less than a week of single-processor

576: time, the equilibrium ensembles for these three proteins are

577: well converged.

578:

579: \section{Conclusions}

580: We have demonstrated the convergence of the equilibrium

581: ensemble for a simple united-residue protein model.

582: The model assumes rigid peptide planes, with atomistically

583: correct geometry, and exhibits reasonable residue-level

584: fluctuations based the planes' geometry, G{\=o} interactions,

585: and sterics.

586:

587: Most importantly, the results indicate \emph{quantitatively}

588: that carefully designed united-residue models have

589: the potential to fully sample protein fluctuations.

590: By using only two degrees of freedom per residue,

591: look up tables for coordinate transforms, and

592: simple square well potentials, we were able

593: to demonstrate that converged equilibrium ensembles

594: can be obtained in less

595: than a week of single processor time.

596: The quantitative convergence analysis indicates that more than 50

597: ``decorrelation times'' were simulated in each case,

598: indicating high-precision ensembles.

599: In addition to application in Resolution Exchange sampling of

600: all-atom models \cite{lyman-resx,lyman-resx2},

601: such speed opens up the long-term possibility of large-scale

602: simulation of many proteins.

603:

604: One important practical limitation of the ideal-peptide-plane

605: geometry in the present model is the need to relax

606: the the initial structure.

607: Proteins larger than 100 residues are difficult to relax.

608: However, we have already begun investigating a flexible-plane

609: model incorporating lookup tables which exhibits no such

610: limitation and remains computationally affordable.

611: We will report on the flexible model in the future.

612:

613: Although the intrinsic atomistic geometry of the peptide

614: plane was included in our model, it lacks chemical interactions.

615: Yet because we obtained converged ensembles in such a short

616: time, it is clear we can ``afford'' extensions

617: to the model which include realistic chemistry.

618: For instance, additional potential energy terms such as

619: Ramachandran propensities \cite{richardson},

620: hydrophobic interactions \cite{dill}

621: and hydrogen-bonding can be included at small cost.

622:

623: Aside from the potential for rigorous atomistic

624: sampling \cite{lyman-resx,lyman-resx2,ytreberg-bbrw},

625: it is important to note the general usefulness of coarse-grained

626: models for generating \emph{ad hoc} atomistic ensembles.

627: Specifically, upon generating a well-sampled ensemble of coarse-grained

628: structures, atomic detail can be added using existing software

629: such as those in Refs.\ \onlinecite{sccomp,rapper}.

630: Once minimized and relaxed,

631: these (now) atomically detailed structures form

632: an \emph{ad hoc} ensemble which

633: may be of immediate use in docking \cite{knegtel,shoichet-nature}

634: and homology modeling applications.

635: Further, in principle, such structures can be re-weighted

636: into the Boltzmann distribution

637: \cite{ytreberg-bbrw}.

638:

639: In the long term, one can imagine a day when structural databases

640: will be based not on single (static) structures but rather

641: will collect ensembles---as envisioned in the authors' scheme for an

642: ``Ensemble Protein Database''(http://www.epdb.pitt.edu/).

643:

644: \begin{acknowledgments}

645: We thank Edward Lyman, Bin Zhang and Artem Mamonov

646: for helpful discussions.

647: Funding was provided by the National Institutes of Health

648: under fellowship GM073517 (to F.M.Y.),

649: and grants GM070987 and ES007318,

650: and by the National Science Foundation grant MCB-0643456.

651: \end{acknowledgments}

652:

653: \bibliography{/home/marty/res/tex/my}

654:

655: \end{document}

656: