1: \documentclass[twocolumn, aps]{revtex4}
2: %\documentclass[preprint, aps]{revtex4}
3: % This tex file has the bibliography inserted at the end
4: \usepackage{natbib}
5: \usepackage{amsmath}
6: %\usepackage{graphicx}
7:
8: % The macro below allows you to use .eps files in pdflatex.
9: % It converts on the fly .eps to .pdf files if you use pdflatex
10: % otherwise, if you are using latex, it just uses the .eps file
11: %
12: % Note: filename suffix (.eps) is left out of the includegraphics statement
13: % Note: you must use the command pdflatex -enable-write18 <filename.tex>
14: % which enables the running of epstopdf as a separate program.
15: % The default does not allow pdflatex to launch sub-processes
16:
17: %\ifx\pdfoutput\undefined
18: % this is the case we are running LaTeX, not pdflatex
19: \usepackage{graphicx}
20: %\else
21: % this is the case we are running pdflatex, so convert .eps files to .pdf
22: %\usepackage[pdftex]{graphicx}
23: %\usepackage{epstopdf}
24: %\fi
25:
26: \begin{document}
27:
28: \title{Protein Structure Prediction: The Next Generation}
29: \author{Michael C. Prentiss}
30:
31: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, CA 92093
32: }
33:
34: \author{Corey Hardin}
35: \affiliation{Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Ave, Urbana, IL 61801
36: Urbana, IL 61801}
37:
38: \author{Michael P. Eastwood}
39: \affiliation{Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, CA 92093
40: }
41:
42: \author{Chenghong Zong}
43: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, CA 92093
44: }
45: \author{Peter G. Wolynes}
46: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry, Department of Physics University of
47: California at San Diego, La Jolla, CA 92093}
48:
49: \date{\today}
50:
51: \begin{abstract}
52: Over the last 10-15 years a general understanding of the chemical
53: reaction of protein folding has emerged from statistical mechanics.
54: The lessons learned from protein folding kinetics based on energy
55: landscape ideas have benefited protein structure prediction, in
56: particular the development of coarse grained models. We survey results
57: from blind structure prediction. We explore how second generation
58: prediction energy functions can be developed by introducing
59: information from an ensemble of previously simulated structures. This
60: procedure relies on the assumption of a funnelled energy landscape
61: keeping with the principle of minimal frustration. First generation
62: simulated structures provide an improved input for associative memory
63: energy functions in comparison to the experimental protein structures
64: chosen on the basis of sequence alignment.
65: \end{abstract}
66:
67: \maketitle
68: Every other summer, research groups compare their different protein
69: structure prediction methods via the Critical Assessment of Techniques
70: for Protein Structure Prediction (CASP) experiment. During the CASP
71: experiment, sequences of experimentally determined protein structures
72: that are not public available are placed on the web. This exercise is
73: double blind where neither the organisers nor the participants know
74: the experimentally determined structure. Groups respond with up to 5
75: ranked predictions, before a predetermined date, such as the
76: publication of the structures. Since the inception of CASP, three
77: dimensional structure prediction category has expanded to address
78: related prediction questions such as sequence to structure alignment
79: quality, amino acid sidechain placement, multi-domain domain
80: boundaries, and the ordered or disordered nature of a protein sequence
81: \cite{Moult}.
82:
83: These different prediction questions can be examined from a common
84: framework: the principle of minimal frustration. The principle of
85: minimal frustration states that native contacts must be more
86: favourable, in a strict statistical sense \cite{GoldsteinRA-AMH-92},
87: than non-native contacts in order for proteins to fold on physiologic
88: time scales \cite{BryngelsonJD87}. Without a sufficient energetic
89: bias towards the native state, the multi-dimensional energy surface as
90: a function of native structure possesses too many minima for an
91: efficient stochastic search. Such an energy surface would lead to
92: slow folding kinetics, even if the proteins ever found a sufficiently
93: stable native state. This is not true since we know most proteins
94: fold without assistance \cite{AnfinsenCB73}. The opposite of a
95: rough energy surface is biased toward the native basin without any
96: local minima is an absolute manifestation of the principle of minimal
97: frustration. Funnelled energy surfaces have no unfavourable
98: energetic traps (\emph{i.e.} G\=o Models) have been shown to reproduce
99: most features of experimental folding kinetics \cite{GoN83,
100: Koga_Takada01, portman98}. These energy landscape concepts can richly
101: be applied in several areas of chemistry and physics\cite{Wales}.
102: Apparently, evolution's energy function is minimally frustrated.
103:
104: The correlation between a protein sequence and its three dimensional
105: structure can be described using similar landscape language. As a
106: protein sequence diverges away from a consensus wild type sequence,
107: the potential for energetically unfavourable interactions increases.
108: The wild type sequence and its homologues will fold toward the same
109: native basin. Only once enough frustrating contacts are added to the
110: wild type sequence will the sequences no longer correspond to the native
111: state ensemble. Sequences with over 25\% sequence
112: identity to previously determined protein structures are called
113: comparative modelling targets. The energy landscape underlying such a
114: prediction is a G\=o model based on the structure of the known
115: homologue. This heavily funnelled energy surface yields
116: high resolution structures, with the discrepancies the turns and
117: residues which have poor sequence to structure alignments.
118: Fig.~\ref{casp6_difficulty} demonstrates the distribution of homology
119: of proteins sequence to known structures included in CASP6. Since
120: proteins below 25\% sequence identity are considered new fold
121: recognition targets, 70\% of the structures were comparative modelling
122: targets. Recently sequenced genomes such as \textit{E. coli} have the
123: same ratio of \textit{ab initio} to comparative modelling targets,
124: which suggests the analysis of this ratio over time could be a useful
125: measure of the progress of efforts to experimentally find examples of
126: all of Nature's protein structures.
127:
128: \begin{figure}{\par\centering
129: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/difficultyhisto.eps}}
130: \par}
131: \caption{\label{casp6_difficulty}The difficultly of the prediction
132: targets as defined by percent identity. Proteins below 25\%
133: sequence identity are usually considered \textit{ab initio} or
134: fold recognition targets.}
135: \end{figure}
136:
137: In contrast to comparative modelling, \textit{ab initio} structure
138: predictions do not have the advantage of creating G\=o like energy
139: surfaces. While many \textit{ab initio} targets contain less than 150
140: residues, and thus are candidates for standard techniques, there are
141: several that are longer as shown in Fig.~\ref{casp6_ab_length}. Most
142: longer sequences will be multi-domain proteins. This causes new
143: problems. Folding a protein with two hydrophobic cores allows
144: for new sources of frustration, beyond those present in single domain
145: proteins. To obtain predictions for such problematic sequences, they
146: usually must be divided into their constituent domains. Current
147: methods for dividing the sequence into domains range from purely
148: sequence based algorithms, which look for sequence patterns in
149: multiple sequence alignments, to simulation techniques that look for
150: hydrophobic core formation amongst multiple independent simulations
151: \cite{Wheelan_etal00,Heringa02,Rigden02}.
152:
153: \begin{figure}{\par\centering
154: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/abitinio_length.eps}}
155: \par}
156: \caption{\label{casp6_ab_length}The \textit{ab initio} prediction
157: targets amino acid lengths for CASP6.}
158: \end{figure}
159:
160: The case studies we highlight of difficult structure predictions were
161: chosen from our participation in the CASP5 and CASP6 experiments. In
162: CASP5, we utilised several improved techniques, such as a backbone hydrogen
163: bond term for the proper formation of beta sheets, and a liquid
164: crystal like term to ensure parallel or anti-parallel sheet formation
165: \cite{Hardin2002ab}. We also performed target sequence averaging
166: which enhances the funnelling of the prediction landscape
167: \cite{Eastwood00}, and assessed our ensemble of sampled structures
168: with a twenty letter contact for submission \cite{KoretkeKK96}. Our
169: most striking result from this round of blind prediction was a
170: prediction for target T0170 protein databank \cite{Berman} code (PDB
171: ID IUZC). Fig.~\ref{casp5t0170overlay} presents the sequence dependent
172: overlay of our Model 1 structure with the experimentally determined
173: structure. The sequence dependent alignment quality of this structure
174: is high as measured by a Q score of 0.38. Q is an order parameter
175: defined in Eq.~\ref{q} that measures the sequence dependent structural
176: complementarity of two structures, where Q is defined as a normalised
177: summation of C-alpha pairwise contact differences.
178: \begin{equation}
179: \label{q}
180: Q=\frac{2}{(N-1)(N-2)}\sum_{i<j-1}
181: \exp\left[-\frac{(r_{ij}-r_{ij}^{\rm N})^2}{\sigma_{ij}^2}\right]
182: \end{equation}
183: The resulting order parameter, Q, ranges from 0, when there is no
184: similarity between structures at a pair level, to 1 which is an exact
185: match. Q has been shown to be more sensitive in determining the
186: quality of intermediate quality protein structure predictions
187: \cite{Eastwood00}. Q scores of 0.4 for single domain proteins equals an
188: RMSD of 5$\text{\AA}$. In most cases the reference state for the Q
189: score is the native state, but often one wants to compare structural
190: similarity between structures in a simulation. A sequence independent
191: measure CE\cite{Bourne98}, also scores well(CE Z-score = 4.1). The CE
192: Z-score measures structural complementarity without regard to sequence
193: information, and is parameterised such that structures between with a
194: Z-Score greater than 4 belong to the same protein
195: structure family. The contact map of the prediction,
196: Fig.~\ref{casp5t0170contactmap} which identifies all of the C-alpha
197: intermolecular interactions within 9$\text{\AA}$ where the axes
198: are the index of the protein, shows the correct packing of the helices.
199: \begin{figure}
200: \centering \includegraphics[width=\linewidth]{figures/t0170overlay.eps}
201: \caption{\label{casp5t0170overlay}Sequence dependent
202: superpositions of Model 1 structure against the native state for
203: CASP5 target T0170 (PDB ID 1UZC). Blue represents the
204: prediction and the native state is represented with red.}
205: \end{figure}
206:
207: \begin{figure}{\par\centering
208: \resizebox*{2.0in}{2.0in}{\rotatebox{90}{\includegraphics{figures/t0170contactmap.eps}}}
209: \par}
210: \caption{\label{casp5t0170contactmap}Contact map of target T0170 (PDB
211: ID 1UZC) model
212: 1 structure against the NMR structure.}
213: \end{figure}
214:
215: Fig.~\ref{casp6_hubbard} shows the size of partially correct
216: continuous in sequence segments under an RMSD cutoff. When compared
217: against the other predictions, our Model 1 prediction (Dark Blue) was
218: amongst the best of all submitted structures. Also the relative
219: success of the prediction, classifies this target as being of
220: moderate difficulty. In this example CASP demonstrates small (70
221: residues) all alpha proteins are beginning to be successfully
222: predicted by a variety of \textit{ab initio} techniques.
223: \begin{figure}
224: \centering
225: \includegraphics[width=\linewidth]{figures/t0170hubbard.eps}
226: \caption{\label{casp6_hubbard}Percentage of residues under a RMSD
227: limit. (Dark Blue - Model 1, Light Blue - Model 2-5, Orange - Other
228: Groups Prediction)}
229: \end{figure}
230:
231: \section*{Methods}
232:
233: \subsection*{Energy Functions and Sampling}
234:
235: We used an Associative Memory Hamiltonian (AMH), with optimised
236: parameters to sample and predict structures
237: \cite{FriedrichsMS89,FriedrichsMS90,FriedrichsMS91}. The AMH uses a
238: reduced description of the amino acid chain in order to gain the
239: orders of magnitude computational acceleration over all atom models
240: needed to fold moderate length proteins with ordinary
241: computational resources, and has been described in great detail before
242: \cite{Eastwood00}. This is possible due to reducing the number of
243: atoms per residue from over 10 to only three backbone atoms: the
244: \(C_{\alpha},C_{\beta}\), and \(O\). The remaining backbone heavy
245: atoms (\(N,C'\)) can be reconstituted using the ideal geometry of the
246: peptide bond as a template. Also we reduced the complexity of the
247: amino acid code from twenty letters, to four. We chose the four
248: letter code, which has the advantage of preserving a diversity of
249: contacts, because it is still simple enough that the number of
250: coefficients that need to be optimised does not create problems of
251: inaccurate statistics due to limits of interactions encountered in the
252: molten globule state. Specifically four amino acid classes are
253: defined: hydrophilic (A, G, P, S, T), hydrophobic (C, I, L, M, F, W,
254: Y, V), acidic (N,D,Q,E), and basic (R,H,K) \cite{Hardin00}. The
255: optimisation procedure produces an energy landscape that discriminates
256: the native state from misfolded states, while avoiding kinetic traps
257: reasonably well \cite{GoldsteinRA92,GoldsteinRA-AMH-92}. The AMH is
258: an analogue to the neural networks designed by Hopfield to synthesise
259: information from multiple previous experiences \cite{Hopfield_1982}.
260: This energy function recalls structural patterns in a set of known
261: protein structures. The Hamiltonian produces an energetically
262: favourable minimum when there is sufficient coherence between a set of
263: three dimensional protein structures.
264:
265: The AMH energy function, in its most general sense, consists of a
266: backbone term, $E_{\rm back}$ and interaction term, $E_{\rm int}$
267: defined by,
268: \begin{equation}
269: \label{amh_function}
270: E_{\rm total}=E_{\rm back}+E_{\rm int}.
271: \end{equation}
272: The backbone energy term consists of several terms that reproduce the
273: self-avoiding behaviour of the polypeptide chain give by,
274: \begin{equation}
275: \label{amh_back}
276: E_{\rm back}=-(E_{\rm SHAKE}+ E_{\rm rama} + E_{\rm ev} + E_{\rm chain} + E_{\rm chi}).
277: \end{equation}
278: As in many molecular mechanics energy functions, covalent bonds are
279: preserved by using the SHAKE algorithm \cite{Ryckaert77} \(E_{\rm
280: SHAKE}\), which enables an increase of the time step size, and
281: eliminates the need for a traditional harmonic calculation. The SHAKE
282: algorithm preserves the distances between neighbouring
283: \(C_{\alpha}\)-\(C_{\beta}\), and \(C_{\alpha}\)-\(O\) atoms. The
284: neighbouring residues limit the variety of angles the backbone atoms
285: can occupy, producing a Ramachadran plot \cite{Rama}. This
286: distribution of angles is reinforced by a potential, \(E_{\rm rama}\)
287: with low barriers to encourage rapid local backbone movements.
288: Another term, \(E_{\rm ev}\) maintains a sequence specific excluded
289: volume constraint between \(C_{\alpha}\)-\(C_{\alpha}\),
290: \(C_{\beta}\)-\(C_{\beta}\), \(O\)-\(O\), \(C_{\alpha}\)-\(C_{\beta}\)
291: atoms. The chain connectivity, and planarity of the peptide bond due
292: to resonance is ensured by means of a harmonic potential, \(E_{\rm
293: chain}\). Also the chirality of the \(C_{\alpha}\), due to its four
294: different bonding partners is maintained using scalar product of
295: neighbouring unit vectors of carbon and nitrogen bonds, \(E_{\rm
296: chi}\).
297:
298: While \(E_{\rm back}\) creates peptide like stereo-chemistry, it
299: does not introduce the majority of the attractive
300: interactions that result in folding. Such interaction are supplied by
301: the rest of the potential \(E_{\rm int}\). The interactions described
302: by $E_{\rm int}$ depends on the sequence separation $\left\vert i-j
303: \right\vert$. Specifically,
304: they are divided into three proximity classes $x(\left\vert
305: i-j\right\vert)$: $x={\rm short}$ ($\left\vert i-j\right\vert<5$),
306: $x={\rm medium}$ ($5\le\left\vert i-j\right\vert\le12$) and $x={\rm
307: long}$ ($\left\vert i-j\right\vert>12$) as defined by Eq.~\ref{eint}.
308: \begin{equation}
309: \label{eint}
310: E_{\rm int}=E_{\rm short}+E_{\rm med}+E_{\rm long}.
311: \end{equation}
312: Also these distance classes are also referred to as local,
313: super-secondary, and tertiary, respectively.
314:
315: The AMH interaction potential \(E_{\rm int}\) is based on correlations
316: between a target's sequence signified by \(i,j\), and the
317: sequence-structure patterns in a set of memory proteins \(\mu\)
318: represented as \(i',j'\), and a pairwise contact potential. The pairs
319: in the target and in the memory are first associated using a
320: sequence-structure threading algorithm \cite{KoretkeKK96}. The
321: database is assumed to contain a subset of pair distances, which may
322: match the associated pair distances in the target structure. The
323: general form of the associative memory interaction is:
324: \begin{equation}
325: \begin{split}
326: \label{amh_general}
327: E_{\rm int}&= -\frac{\epsilon}{a}\sum ^{N_{mem}}_{\mu}\sum_{j-12 \leq i \leq j-3} \\
328: & \gamma(P_{i},P_{j},P^{\mu}_{i'},P^{\mu }_{j'})
329: \exp\left[-\frac{(r_{ij}-r_{i'j'}^{\rm
330: \mu})^2}{2\sigma^2_{ij}}\right] \\
331: &+ -\frac{\epsilon}{a}\sum_{k=1}^{3}C_{k}(N)\gamma (P_{i},P_{j},k)U_{k}(r_{ij})
332: \end{split}
333: \end{equation}
334: where the similarity between target pair distances \(r_{ij}\), with
335: aligned memory pair distances \(r^{\mu}_{i'j'}\) is measured by Gaussian
336: functions whose widths are given by \(\sigma_{ij}=\left\vert
337: i-j\right\vert^{0.15}\text{\AA}\). The set of parameters,
338: \(\gamma\), encode the similarity between residues i and j, and the
339: memories residues i' and j'. Favourable interactions occur during
340: coherence in the distances achieved in the sequence to structure
341: alignments. The encoding of the alignment information in
342: Eq.~\ref{amh_general} is only an example of what is used for the
343: all-alpha energy functions. Other encodings have been used in the
344: alpha-beta energy function \cite{Hardin2002ab} to improve the
345: discrimination between helices and strands. While the first term in
346: Eq.~\ref{amh_general} is the superposition of interactions over a set
347: of experimentally determined structures, it also shares a dependence
348: on the sequence separation between the interacting residues. For
349: residues separated by greater than 12 residues, a contact potential
350: \(E_{\rm long}\), as described by the second term in
351: Eq.~\ref{amh_general}, which does not depend on interaction
352: information from the structures used to define local in sequence
353: interactions. In this term $C_{k}(N)$ represents a sequence length
354: dependence scaling to account for the variation in probability
355: distributions based on sequence length. Five wells instead of the
356: three defined here by $U_{k}(r_{ij})$ determine interactions in the
357: alpha-beta energy function \cite{Hardin2002ab}. Energy units
358: $\epsilon$ are defined excluding backbone contributions in terms of a
359: native state energy in Eq.~\ref{units},
360: \begin{equation}
361: \label{units}
362: \epsilon=\frac{\left\vert E^{\rm N}_{\rm amh}\right\vert}{4N},
363: \end{equation}
364: where $N$ is the number of residues. A distance class scaling $a$, is
365: constant in each of the energy classes because they are designed to
366: be equal during the optimisation
367:
368: The solvent in these energy functions is treated in a mean field
369: manner, where the implicitly solvated native states of the proteins define the
370: energy gap to the molten globule state. Solvent effects are also
371: present in the sequence to structure alignment energy functions, but
372: they are not explicitly represented in the molecular dynamics energy
373: function. Water mediated contacts with an expanded 20 letter code in
374: the contact potential were introduced \cite{Papoian04pnas}, based upon
375: previous work which examined protein recognition
376: \cite{Papoian03biopoly, Papoian03jacs}. The water mediated contacts
377: along with a new one dimensional burial term has shown promising
378: results especially for long proteins.
379:
380: Once the energy function is optimised, the minima of the energy
381: function are probed via simulated annealing with molecular dynamics
382: simulations. This minimisation technique integrates Newton's
383: equations of motions to determine the energy of the next time step.
384: Simulated annealing slowly reduces the temperature from a high value
385: as in the tempering of steel in metallurgy. This minimisation algorithm
386: allows for local searches, while allowing modest energy barriers
387: to be overcome.
388:
389: Energy landscape ideas have generated an optimisation scheme for
390: creating funnelled energy surfaces. While funnelled, the
391: parameterisation does not eliminate all non-native minima. The
392: superposition of several energy surfaces reduces the likelihood of
393: such trapping in local minima \cite{maxfield79,finkelstein98}. The
394: flexibility of the AMH framework provides several ways of
395: incorporating multiple sequence alignment information. Some of the
396: options include creating a consensus sequence \cite{Eastwood00},
397: simulating different homologue sequences concurrently, and averaging
398: the resulting forces and energies \cite{Hardin2002ab}. The averaged
399: AMH energy function we used average the forces and the energies of
400: these simulation over a set of sequences, because it allows for more
401: generalisable results than may occur with other techniques, and is
402: described as in Eqs.~\ref{msa1},~\ref{msa2} ,
403:
404: \begin{equation}
405: \begin{split}
406: \label{msa1}
407: E_{\rm short + medium} &= -\frac{1}{N_{seq}}\frac{\epsilon}{a} \sum ^{seq}_{1
408: }\sum ^{N_{mem}}_{\mu}\sum_{j-12 \leq i \leq j-3} \\
409: & \gamma (P_{i},P_{j},P^{\mu}_{i'},P^{\mu }_{j'} \exp\left[-\frac{(r_{ij}-r_{i'j'}^{\rm \mu})^2}{2\sigma^2_{ij}}\right]
410: \end{split}
411: \end{equation}
412:
413: \begin{equation}
414: \label{msa2}
415: E_{\rm long} = -1/N_{seq}\frac{\epsilon}{a} \sum ^{seq}_{1 }\sum_{k=1}^{3}C_{k}(N)\gamma (P_{i},P_{j},k)U_{k}(r_{ij})
416: \end{equation}
417:
418: To superimpose multiple energy landscapes, we need a multiple sequence
419: alignment to a set of sequence homologue. Sequences homologous to
420: the target sequence are first identified by using PSI-Blast with
421: default parameters \cite{Altschul1997}. Each sequence above and below
422: a certain sequence identity thresholds (70\% 30\% in this work) is
423: then aligned against each other, and proteins that have greater than
424: 90\% sequence identity to other identified sequence homologues are
425: removed. The culling of the sequence homologues via open source
426: bioinformatic libraries is necessary for two reasons \cite{bioperl}.
427: Some classes of proteins have a large number of sequence homologues,
428: and performing a multiple sequence alignment can be impractical. Also
429: removing sequence homologues attempts to remove biases introduced when
430: there are few homologues. The remaining sequences were aligned using
431: a multiple sequence alignment algorithm\cite{CLUSTAL}. Within the AMH
432: energy function, gaps occurring in a sequence alignment could be
433: addressed in a variety of ways, in this work gaps in the target
434: sequence are ignored, while gaps within homologues are completed with
435: residues from the target protein. This strategy may introduce biases
436: toward the target sequence, but this approach is preferred to perhaps
437: ignoring interactions. Fig.~\ref{casp6t0212_msa} shows a
438: representative multiple sequence alignment for a target, coloured with
439: respect to the four letter code of the AMH. If one focuses on the
440: hydrophobic yellow residues, the alternating hydrophobic hydrophilic
441: patterns for beta strands formation are apparent.
442:
443: \begin{figure}{\par\centering
444: \includegraphics[width=\linewidth]{figures/t0212_msa.eps}
445: \par}
446: \caption{\label{casp6t0212_msa} Multiple sequence alignment for target
447: T0212 (PDB 1TZA) coloured with respect to a four letter code, where red
448: represents acidic residues, blue represents polar residues, yellow
449: represents nonpolar residues, and green represents basic residues.}
450: \end{figure}
451:
452: Another way of introducing the characteristics of multiple funnelled
453: energy landscapes is using information derived from neural networks
454: trained on multiple sequence alignments. Even with different
455: architectures, neural networks typically achieve 75\% accuracy when
456: predicting secondary structure. Recently it has been shown artful
457: combinations of two different predictions can slightly improve the
458: results \cite{Zhang_etal03}. This secondary structure information was
459: added by a biasing energy function to either a helix or a strand via,
460: $E_{Q_{ss}} = 10^5 \epsilon(Q-Q_{ss})^4$ \cite{Eastwood00}, where
461: ${Q_{ss}}$ is defined by Eq.~\ref{qmf_ss},
462: \begin{equation}
463: \label{qmf_ss}
464: Q_{\rm ss}= \sum_{k}^{n}\frac{2}{(N_{k}-1)(N_{k}-2)}\sum_{i<j-1}\exp\left[-
465: \frac{(r_{ij}-r_{ij}^{\rm ss})^2}{\sigma_{ij}^2}\right].
466: \end{equation}
467: ${Q_{\rm ss}}$ is takes the same form of the $Q$ define before in
468: Eq.~\ref{q} except that potential acts over $n$ independent secondary
469: structures units derived from secondary structure prediction. The
470: distances that define energy minimum, $r_{ij}^{\rm ss}$ are determined
471: from experimentally determined Cartesian distances. Previously in an
472: effort to incorporate this secondary structure information, the
473: Ramachandran potential has been altered to bias the backbone
474: \cite{Hardin02a}. The local in sequence potential $E_{\rm Q_{\rm
475: ss}}$ is preferred to the Ramachandran potential biasing because
476: it avoids SHAKE violations when the strength of the bias is increased.
477:
478: For most selected CASP6 targets, we followed the same protocol. We
479: averaged the AMH potential over multiple sequence homologues when they
480: were available. In most cases, information from secondary structure
481: prediction was used to bias secondary structure units to their
482: predicted structures. Molecular dynamics with simulated annealing
483: sampled low energy structures. Also constant temperature
484: slightly above the predicted glass temperature were used to generate
485: candidate structures. We collected structures above \(T_{K}\), which
486: usually gives the fastest folding thereby compromising between the
487: funnelled and glassy behaviour of the energy function. Once the
488: kinetics of the structure slows, the diversity of structures
489: encountered disappears. The slow kinetics regime typically
490: predominates around a temperature of 0.9. While using a linear
491: annealing schedule up to \(T_{K}\), about 25 different collapsed
492: structures were collected during each simulation. The amount of
493: sampling performed for each structure varied from about 500 to 20,000
494: different structures. While this was roughly 50 times more sampling
495: than we had previously performed in the CASP setting, it is dwarfed by
496: the efforts of others who can sample in the millions of structures by
497: using more powerful computational resources \cite{BONN2001b}.
498: Subsequently, a smaller subset of structures was selected for
499: submission by evaluating the size of the hydrophobic core and the
500: hydrophilic surface area. Further selection criteria included visual
501: inspection, agreement with the preliminary secondary structure
502: prediction, and low energies predicted from a second optimised contact
503: energy function.
504:
505: \subsection*{Selection of Structures}
506:
507: To select candidate structures from independent simulated annealing or
508: constant temperature trajectories, we calculated both the buried
509: hydrophobic surface area and the exposed hydrophilic surface area
510: along the trajectory. In an effort to calculate the buried or exposed
511: surface area, we assigned residues which have greater than the mean
512: total surface area as solvent exposed, and the converse as solvent
513: buried. We scaled each surface area by a weight to represent the
514: likelihood of amino acid burial. It was modelled to the free energy
515: cost of transferring each amino acid from octanol to water
516: \cite{zhou:zhou04} in an effort to introduce a sequence specificity
517: as shown in Eq.~\ref{sa},
518: \begin{equation}
519: \label{sa}
520: E_{\rm Burial}=\sum_{i}^{N}
521: \begin{cases}
522: \gamma_{i}*SA_{i},& \text{if $ SA_{i} > $ total surface} \\
523: 0,& \text{if $ SA_{i} > $ total surface}
524: \end{cases}
525: \end{equation}
526:
527: This normalisation is desirable because the surface accessibility is
528: calculated from our minimal \(C_{\alpha}, C_{\beta }\), and \(O\)
529: atoms, which produces amino acids of the same volume. Such an energy
530: term would be more valuable if non-additive interactions, and a larger
531: number of hydration layers were added. The
532: unavoidable inaccuracies in atomistic force fields, and the slow
533: glassy kinetics of sidechain rearrangements prevented any completion
534: of the backbone and sidechains with all-atoms or minimisation of
535: putative structures \cite{kussell:168101}.
536:
537: \begin{table}
538: \caption{\label{table1} Linear Regression of Hydrophobic Burial Energy }
539: {\centering \begin{tabular}{|c|c|c|} \hline Proteins & fold class &
540: Correlation Coefficient \\ \hline \hline
541: 1R69&\(\alpha\)&.22 \\
542: 1BG8&\(\alpha\)&.33 \\
543: 1UTG&\(\alpha\)&.63 \\
544: 1MBA&\(\alpha\)&.40 \\
545: 2MHR&\(\alpha\)&.46 \\
546: 1IGD&\(\alpha/\beta\)&-.70 \\
547: 3IL8&\(\alpha/\beta\)&-.06 \\
548: 1TIG&\(\alpha/\beta\)&.02 \\
549: % 3chyz&\(\alpha/\beta\)& fill in value \\
550: % 5nulz&\(\alpha/\beta\)& fill in value \\
551: 1BFG&\(\beta\)&.16 \\
552: 1CKA&\(\beta\)&-.14 \\
553: 1JV5&\(\beta\)&.11 \\
554: 1K0S&\(\beta\)&.27 \\
555: \hline
556: \end{tabular}\par}
557: \end{table}
558:
559: %% \clearpage
560:
561: Another parameter we used after sampling to select and examine
562: structures was based on sequence specific backbone probabilities. The
563: specificity of local interactions have been fruitful for improving
564: collapsed proteins structure predictions \cite{Baker97}. In a similar
565: spirit sequence specific nearest neighbour probabilities were also used
566: \cite{Betancourt:2004}. Local signals have also been theoretically
567: shown to contribute roughly a third of the total folding gap for
568: $\alpha$ helical proteins \cite{SavenJG96}. Similarly we started
569: looking at such probabilities to further improve the backbone
570: potential of the AMH, but without needing secondary structure
571: prediction.
572: \begin{equation}
573: \label{mscore}
574: E_{\rm trimer}=\sum_{i=2}^{N-1}Log P(i-1,i,i+1,\phi,\psi)
575: \end{equation}
576: Somewhat surprisingly, the summation of the resulting $\log$
577: probabilities from 4,012 highly resolved protein structures could be
578: used as an additional measure as part of a strategy for the selection
579: of structures out of an ensemble. Table~\ref{table1} shows the linear
580: correlation coefficients between structures of varying Q-scores,
581: sampled above \(T_{\rm K}\) which is where the best predictions
582: usually occur before glassy dynamics dominates the kinetics. For both
583: proteins with all \(\alpha\), and \(\alpha/\beta\) compositions, the
584: summed log probabilities provide discrimination, but not within the
585: all \(\beta\) folds. These results shown in Table~\ref{table_marcio}
586: echo the previous findings in terms of the \(\phi\), \(\psi\)
587: probability maps and also that all beta structures are less well
588: predicted when a dihedral angle energy function is minimised. The
589: weakness of nearest neighbour excluded volume effects to determine
590: local structure is also demonstrated in the consistent weakness of
591: secondary structure prediction with respect to beta strands. Alpha
592: helices are correctly predicted to roughly 80\% accuracy while beta
593: strands average 60\% accuracy by such pure sequence based algorithms.
594: The difficulty of predicting some circular dichroism spectroscopy
595: results for beta to coil transitions can also be attributed to the
596: weakness of the local backbone excluded volume interactions.
597:
598: \begin{table}
599: \caption{\label{table_marcio} Linear Regression of Mscore }
600: {\centering \begin{tabular}{|c|c|c|} \hline Proteins & fold class &
601: Correlation Coefficient \\ \hline \hline
602: 1R69&\(\alpha\)&.29 \\
603: 1BG8&\(\alpha\)&.04 \\
604: 1UTG&\(\alpha\)&.26 \\
605: 1MBA&\(\alpha\)&.26 \\
606: 2MHR&\(\alpha\)&.10 \\
607: 1IGD&\(\alpha/\beta\)&.37 \\
608: 3IL8&\(\alpha/\beta\)&.13 \\
609: 1TIG&\(\alpha/\beta\)&.19 \\
610: % 3CHY&\(\alpha/\beta\)& .40 \\
611: % 5NUL&\(\alpha/\beta\)& .25 \\
612: 1BFG&\(\beta\)&.08 \\
613: 1CKA&\(\beta\)&.03 \\
614: 1JV5&\(\beta\)&-.07 \\
615: 1K0S&\(\beta\)&-.10 \\
616: \hline
617: \end{tabular}\par}
618: \end{table}
619:
620: %\clearpage
621:
622: \section*{Results}
623: \subsection*{Blind Simulations}
624:
625: For \textit{ab initio} blind predictions in CASP6, we selected
626: sequences if there were no experimentally determined homologous
627: structures found by automated comparative modelling servers. The
628: overall results for the \textit{ab initio} structure prediction
629: simulation are summarised in Table~\ref{casp6_table1}, where the
630: abbreviations are length = the number of amino acids, temp =
631: temperature where best structure was encountered, sub Q or samp Q =
632: the best sampled and submitted structures respectively as a judged
633: by a function of Q, and traj = number of independent trajectories
634: simulated. The
635: CASP6 targets are classified under the following categories (NF=new
636: fold, FR/A=fold recognition analog, FR/H=fold recognition homologue,
637: CM/H=comparative modelling hard). Targets T0207, and T0270 where
638: removed from the experiment so their CASP class are undefined. Structures for T0207 and
639: T0272-b were not submitted. There
640: are a few main points from this data. Using a Q of 0.4 as a measure
641: successful prediction, we were able to encounter high quality
642: structures for 4 targets and nearly so for 4 others. The temperature
643: at which the best structures were sampled was between the 1.2 and 0.8,
644: which is the annealing regime we investigated most throughly. This
645: suggests our annealing schedules were close to the behaviour we sought
646: \textit{a priori}. The longer the length of the target sequence
647: clearly reduced the quality of our predictions. Also the proteins
648: where we had a greater number of trajectories naturally showed better
649: structures. A final observation identifies the difference between the
650: best submitted structure and the best sampled structure as
651: disappointingly large for some of the targets. This can be attributed
652: our strategy of maximising the number of simulations performed rather
653: than more carefully studying our trajectories. This difference would
654: be smaller if greater care was taken in the selection of the
655: structures, but the number of high quality structures would have been
656: less.
657: \begin{table}
658: \caption{\label{casp6_table1}CASP6 Results: Best Submitted and Sampled Structures }
659: {\centering \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline
660: target & length & fold &sub Q&samp Q& temp & traj & CASP \\ \hline \hline
661: T0281 & 70 & \(\alpha/\beta\) &.34 &.48 & 0.85 & 986 & NF \\
662: T0201 & 94 & \(\alpha/\beta\) &.36 &.44 & 1.39 & 199 & NF \\
663: T0212 & 123 & \(\beta\) &.26 &.42 & 1.30 & 97 & FR/A \\
664: T0230 & 102 & \(\alpha/\beta\) &.31 &.42 & 1.05 & 395 & FR/A \\
665: \hline \hline
666: T0207 & 76 & \(\alpha/\beta\) & -- &.39 & 0.98 & 297 & -- \\
667: T0224 & 87 & \(\alpha/\beta\) &.30 &.38 & 1.20 & 501 & FR/H \\
668: T0263 & 97 & \(\alpha/\beta\) &.34 &.38 & 0.94 & 404 & FR/H \\
669: T0272-a & 85 & \(\alpha/\beta\) &.30 &.37 & 0.94 & 30 & FR/A \\
670: T0265 & 102 & \(\alpha/\beta\) &.29 &.34 & 0.83 & 374 & CM/H \\
671: T0213 & 103 & \(\alpha/\beta\) &.26 &.32 & 0.98 & 448 & FR/H \\
672: T0243 & 88 & \(\alpha/\beta\) &.31 &.32 & 0.95 & 418 & FR/H \\
673: T0239 & 98 & \(\alpha/\beta\) &.25 &.32 & 0.99 & 424 & FR/A \\
674: T0214 & 110 & \(\alpha/\beta\) &.24 &.30 & 0.41 & 348 & FR/H \\
675: T0242 & 115 & \(\alpha/\beta\) &.27 &.30 & 0.89 & 358 & NF \\
676: \hline \hline
677: T0270-b & 125 & \(\alpha/\beta\) &.27 &.28 & 0.99 & 32 & -- \\
678: T0270-a & 122 & \(\alpha/\beta\) &.25 &.27 & 0.80 & 47 & -- \\
679: T0272-b & 124 & \(\alpha/\beta\) & -- &.26 & 0.81 & 34 & FR/A \\
680: T0273 & 186 & \(\alpha/\beta\) &.22 &.24 & 0.98 & 189 & NF \\
681: \hline
682: \end{tabular}\par}
683: \end{table}
684:
685: Calculating the free energy of several randomly chosen CASP6 targets
686: in Fig.~\ref{fq_totalt0214t0243} provides us with probabilities of
687: what we would have expected to see if more simulations has been
688: performed during the CASP season. We can estimate how
689: many independent structures need to be seen at this temperature to
690: sample the region 10 $k_BT$ greater than the minimum of the free
691: energy. We see roughly $e^{10}\approx 2*10^{4}$ independent sampled
692: structures would be needed at a temperature of 1.0. Target T0242 (PDB
693: ID 2BLK) illustrates why the best structure we encountered had a Q
694: score of 0.3. For this target, we sampled roughly 7000 different
695: structures. To achieve a Q of 0.45, according to the free
696: energy analysis we would need to increase our sampling by a factor
697: of 3.
698:
699: \begin{figure}{\par\centering
700: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_totalt0214-t0243.eps}}
701: \par}
702: \caption{\label{fq_totalt0214t0243}Free Energy calculations for
703: CASP6 targets T0213, T0214, T0224, T0242, and T0243.}
704: \end{figure}
705:
706: When extrapolating to lower temperatures, we see lower barriers to the
707: folded state, and thus if sampling were more complete one would see
708: better structures at these temperatures. This further cooling would
709: be a favorable strategy except that dynamic slowing due to the
710: approach of the glass transition interferes, which occurs at a
711: temperature of 0.9. Naturally, it is best to sample just above the
712: glass transition temperature, which can be approximately found from
713: Q-Q correlation ($<Q(t)Q(t+\tau)>$) \cite{AllenTildesley}, and by
714: using the Kolmogorov-Smirnov test to asses the independence of samples
715: \cite{Eastwood2003}. Table~\ref{casp6_likely} indicates what was the
716: best structure we would be likely to see under such sampling
717: conditions. The differences between thermodynamically accessible
718: structures and those that were sampled suggests that increased
719: simulations would have improved the best structures sampled
720: considerably. The free energy of target T0243 (PDB ID not available) is
721: significantly different due to its unusual architecture that contains
722: a buried helix.
723:
724: \begin{table}
725: \caption{\label{casp6_likely} Likely Quality of Structure Seen
726: at a Free Energy of 10 CASP6 }
727: {\centering \begin{tabular}{|c|c|c|c|c|} \hline
728: Target & PDB & length & Probable Q & Sampled Q \\ \hline
729: T0213 & 1TE7 & 103&.43 &.32 \\
730: T0214 & 1S04 & 110&.40 &.30 \\
731: T0224 & 1RHX & 87&.39 &.38 \\
732: T0242 & 2BLK & 123&.45 &.30 \\
733: T0243 & --- & 88&.28 &.32 \\ \hline
734: \end{tabular}\par}
735: \end{table}
736:
737: % \clearpage
738:
739: As in Fig.~\ref{casp5t0170contactmap}, we compare contact maps
740: between the predictions and the experimentally resolved structure.
741: Often contact maps give more insightful than superimposed structures
742: especially when viewing in 2 dimensions. We compare the submitted
743: structures with the best structure encountered during our sampling to
744: determine what aspect of folding are being captured by our energy
745: functions. For a short target T0201 (PDB ID 1S12), we see that
746: sometimes a small difference in the contact maps in
747: Fig.~\ref{contact_T0201}, can greatly improve the quality of the
748: prediction even though a large number of contacts are already correct.
749: There was a larger fraction of incorrect contacts in our best
750: submitted structure for target T0230 (PDB ID 1WCJ) than we would have
751: seen in the best generated structure as shown in
752: Fig.~\ref{contact_T0230}. The incorrect parallel docking of the first
753: two helices is largely resolved in the best sampled structure and the
754: Q score improves considerably. Similar analysis for target T0281 (PDB
755: ID 1WHZ) shows incorrect long range contacts between the two otherwise
756: properly oriented helices, and disordered intermediate interactions as
757: in Fig.~\ref{contact_T0281}. Again the best sampled structure has
758: these problems largely resolved.
759:
760: \begin{figure}{\par\centering
761: \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0201.eps}}}
762: \hspace{.5in}
763: \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_samp_t0201.eps}}}
764: \par}
765: \caption{\label{contact_T0201}Contact maps for the best submitted
766: (Q=.36) and the best sampled (Q=.44) structures for target T0201.}
767: \end{figure}
768:
769: \begin{figure}{\par\centering
770: \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0230.eps}}}
771: \hspace{.5in}
772: \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_samp_t0230.eps}}}
773: \par}
774: \caption{\label{contact_T0230}Contact maps for the best submitted
775: (Q=.31) and the best sampled (Q=.42) structures for target T0230.}
776: \end{figure}
777:
778: \begin{figure}{\par\centering
779: \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0281.eps}}}
780: \hspace{.5in}
781: \resizebox*{2.0in}{2.0in}{\rotatebox{90}{\includegraphics{figures/contactbest_samp_t0281.eps}}}
782: \par}
783: \caption{\label{contact_T0281}Contact maps for the best submitted
784: (Q=.34) and the best sampled (Q=.48) structures for target T0281.}
785: \end{figure}
786:
787: % \clearpage
788:
789: One amusing way to analyze predicted structures is to view the results
790: of different structure prediction schemes as intermediates along a
791: kinetic folding coordinate. How far did the simulated annealing get in
792: the folding pathway? By mapping the likelihood of folding
793: \cite{Shanknovich_1997} against its location on a folding free energy
794: surface, we can assess how close the model structure is to the folded
795: state in a kinetic sense. The energy function for the kinetic
796: modeling is a G\=o model \emph{i.e.} ideally non-frustrated energy
797: function. The difference between the G\=o model and the structure
798: prediction energy functions is a measure of the quality of those
799: structure prediction schemes. A pairwise additive G\=o model was
800: created based on the native structure of the experimentally determined
801: protein. As it has been discussed previously \cite{Eastwood00}, this
802: G\=o model has both a polypeptide backbone energy terms that are the
803: same as in the structure prediction energy function as described by
804: Eq.~\ref{amh_back} and an interaction potential were the Gaussian
805: interaction potential distances \(r_{ij}^{N}\) are determined by the
806: native state formally described in Eq.~\ref{amh_go}.
807: \begin{equation}
808: \label{amh_go}
809: E_{\text{G\=o}}=- \frac{\epsilon}{a} \sum_{i<j-3} \gamma_{\text{G\=o}}[x_{(|i-j|)}]\exp\left[-\frac{(r_{ij}-r_{ij}^{N})^2}{\sigma_{ij}^2}\right]
810: \end{equation}
811: The interactions are defined in this minimal model as residues with
812: greater the three residues in sequence separation between \(
813: C^{\alpha}-C^{\alpha}, C^{\alpha}-C^{\beta}, C^{\beta}-C^{\alpha},
814: C^{\beta}-C^{\beta} \) atom pairs. The weights
815: \(\gamma_{\text{G\=o}}\) or the depth of the Gaussian wells are set to
816: (.177,.048,.430) in order to approximately divided the interaction
817: energy equally between the different distance classes as defined in
818: the original structure prediction energy function. The width of the
819: gaussians $\sigma_{ij}^2$ are defined by the sequence separation as
820: before. Notice that the G\=o Hamiltonian does not contain a summation
821: over a set of memory structures as in the AMH, this is because all of
822: the contacts in this definition of a G\=o model uses only the native
823: state. One hundred independent simulations of this G\=o energy
824: function are performed starting with the best structure of three
825: different structure prediction groups. Pfold is then calculated by
826: simply determining whether the simulation started from the model
827: structure folds to the native structure or not. The results in
828: Fig.~\ref{pfold_fig} compare three minimalist models, one of which
829: (the Baker Group) has undergone a further atomistic refinement. The
830: minimalist models are only a few $k_BT$ from the barrier's peak, they
831: only infrequently cross it. It also suggests that a detailed less
832: coarse grain sampling procedure maybe necessary for correctly
833: assigning hydrophobic packing and hydrogen bonding patterns.
834:
835: \begin{figure}{\par\centering
836: \resizebox*{3.5in}{3.0in}{\rotatebox{00}{\includegraphics{figures/t0281_pfold_3d.eps}}}
837: \par}
838: \caption{\label{pfold_fig}G\=o Model Free Energy Surface with final
839: prediction structures shown. The Pfold values for the three
840: proteins are the Wolynes Group 0.07, Scheraga Group 0.02, and the
841: Baker Group 0.97 with an error of +/- 0.1.}
842: \end{figure}
843:
844: %\clearpage
845: \subsection*{The Next Generation in Structure Prediction}
846:
847: Examining the contact maps of structures encountered during the CASP
848: experiment, we observed that contacts between residues with a
849: large separation in
850: sequence can be inaccurate, even when most of the contacts within a 12
851: residues sequence separation are native like. A different way of
852: expressing this idea is the amount of funnelling is different within
853: the different distance classes. When comparing the quality of the
854: intermediate range interactions in the sampled structures with the
855: memories obtained with sequence analysis from the protein data bank, a
856: dramatic increase of native-like interactions is seen as shown in
857: Fig.~\ref{lh_256ba}. While this was not used in the recent CASP
858: exercise, we thought it would be interesting and straight forward to
859: improve the prediction energy function by using these first generation
860: results as better memory structures in the AMH. Sequence to structure
861: alignments yield gap-less identity alignments thereby eliminating any
862: possibility of secondary structure registry shift irregularities.
863:
864: \begin{figure}{\par\centering
865: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_bw.eps}}
866: % \hspace{0.3in} \vspace{0.3in}
867: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_short_bw.eps}}
868: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_medium_bw.eps}}
869: % \hspace{0.3in}
870: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_long_bw.eps}}
871: \par}
872: \caption{\label{lh_256ba}These figures show the total Q, and the Q
873: in the different distance classes between PDB structures,
874: structures from a temperature of 1, and a temperature near zero
875: for structures used as inputs to AMH simulations. The
876: lowest temperature show the largest improvement because they are
877: fully collapsed.}
878: \end{figure}
879:
880: Different energy functions have been used to identify native like
881: proteins from an ensemble of simulated structures. Alternatively, one
882: can rely on energy landscape ideas, and assume a mean field contact
883: potential derived from the energy minima of the simulated energy
884: function. This approach has the additional advantage, that it does
885: not rely on using a distinct energy function: one is simply seeing how
886: close simulated annealing was to completely accessing the global
887: minimum of the prediction energy function. To select structures a
888: pairwise Q denoted by a lower case q, is calculated between all of the
889: ground state structures encountered in 200 independent simulations.
890:
891: By dividing the inter-chain interactions under the same definitions as
892: used in the energy function, the potential for improvements from such
893: second generation structures over the original memories is
894: considerable for protein 256B. As seen in Fig.~\ref{lh_256ba}, the
895: low temperature structure as identified by little q have an increased
896: amount of native like contacts in all distance classes. This style of
897: analysis also suggests potential changes in the energy function. The
898: long distance in sequence interactions are also improved over that
899: original memory used in the energy function. In order to utilise this
900: improvement the energy function in the distant interaction class was
901: modified. The original function used a multi-well contact potential,
902: which does not use any information from the memory proteins. For this
903: third distance class the next generation energy function uses
904: associative memory contacts much as was done before for modelling with
905: homologues \cite{Koretke98}. The energy function now takes the form
906: \begin{equation}
907: E_{\rm int}=- \sum ^{c}_{3}\frac{\epsilon}{a_{c}}\sum ^{n}_{\mu }\sum ^{N}_{i<j} \gamma
908: (P_{i}P_{j}P^{\mu }_{i'}P^{\mu }_{j'})\Theta
909: (r_{ij}-r^{\mu }_{i'j'}).
910: \end{equation}
911: The parameters for this new distance class are taken from the second
912: distance class. The total energy is defined over the set of memory
913: structures as defined by Eq.~\ref{next_gen_units}
914: \begin{equation}
915: \label{next_gen_units}
916: \epsilon=\frac{1}{36}\sum_{1}^{\mu}\frac{\left\vert E^{\rm model}_{\rm amh}\right\vert}{4N},
917: \end{equation}
918: instead of using the values taken from the optimisation. Some next
919: generation memory structures are more collapsed than the memory
920: structures used in initial round of simulation. Furthermore the
921: scaling is changed from the initial round of simulation's 1:1:1
922: scaling amongst the three different (local, super-secondary, tertiary)
923: distance classes to 1.5:0.5:1 in an effort to approximate the equal
924: division of energy in each distance class. To examine the equilibrium
925: properties of this energy function, we need
926: to estimate the glass transition temperature. As previously explored
927: \cite{Eastwood2003}, we use the Kolmogorov-Smirnov test to determine
928: if two independent simulations have been sampled from the same
929: equilibrium distribution. This test ensures that
930: simulations are equilibrated.
931: \begin{figure}
932: \centering
933: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/Pq_black_white_15.eps}}
934: \hspace{.5in}
935: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/Pq_black_white_14.eps}}
936: \caption{\label{ks_256ba}Kolmogorov-Smirnov test shows the constant
937: temperature simulation falling out of equilibrium at a lower
938: temperature of 1.4. The different probability distributions of
939: structures between two independent simulations is no longer the
940: same.}
941: \end{figure}
942: Once the glass transition temperature ($T_{K}$) is estimated using the
943: Kolmogorov-Smirnov test, we can use standard techniques to
944: quantify the equilibrium properties of different energy functions.
945: The proteins we used for study of the next generation AMH strategy are
946: cytochrome B562 (PDB ID 256b), HDEA (PDB ID 1BG8), because they are
947: both of moderate size and one of them (1BG8) was not in the training
948: set of proteins that optimized the original energy function. An
949: additional advantage of this choice is these proteins have different
950: fold types. According to CATH \cite{FPearl} HDEA belongs to the
951: orthogonal bundle architecture, while cytochrome B562 represents an
952: up-down bundle. Using umbrella sampling combined with the weighted
953: histogramming method, we are able to sample parts of phase space that
954: would rarely be encountered during a simulation \cite{kong96}. When
955: using memories with a larger number of native contacts, we see
956: improved free energy and energy profiles as shown in
957: Fig.~\ref{amh_256ba}.
958: \begin{figure}{\par\centering
959: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_bw_paper.eps}}
960: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_local_bw_paper.eps}}
961: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_ss_bw_paper.eps}}
962: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_tert_bw_paper.eps}}
963: \par}
964: \caption{\label{amh_256ba}The free energy the two different energy
965: functions for the protein 256B, shows roughly a 5-10 $k_BT$
966: improvement for this protein. The primary improvements are in the
967: medium and long range distance classes.}
968: \end{figure}
969: This is even more impressive when we consider
970: this energy function has not yet been properly optimised for this new
971: hamiltonian. For the other target, the results are also not
972: surprising. In this case the next generation memories used to simulate
973: this protein were not of greater structural quality than the
974: initial set. Thus a very similar free energy profile was generated as
975: seen in Fig.~\ref{amh_1bg8a}. Our use of q as an order
976: parameter was successful in identifying the high Q protein for the
977: 256B example. This is due to the highly funnelled characteristic of
978: the first generation energy function. The original energy function
979: for 1BG8 is not as funnelled so therefore there is poorer enrichment by
980: scanning with little q. This limitation could be over come by increasing
981: the amount of sampling of structures in the first generation simulations.
982: \begin{figure}
983: \centering
984: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/qscore_bw_1bg8.eps}}
985: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_bw_paper_1bg8.eps}}
986: \caption{\label{amh_1bg8a}The free energy the two different energy
987: functions for the protein 1BG8 show little improvement. The
988: memories though show no enrichment in native contacts.}
989: \end{figure}
990: More simulations would guarantee better structure
991: as was demonstrated during the CASP5 exercise. This difference in the
992: enrichment could be anticipated by using the Kolmogorov-Smirnov measure
993: to differentiate the distribution of the q values encountered between
994: structures derived from simulation and the protein databank.
995:
996: \section*{Conclusion}
997:
998: These case studies from our participation in the CASP experiment only
999: provide a snap shot of our group's prediction schemes. It produces a
1000: series of lessons for us and we hope for others. In the future, a
1001: more balanced efforts between the sampling and selection of structures
1002: from that ensemble would appear to be desirable. More efforts in
1003: selection would have clearly improved the results submitted in CASP6.
1004: While it is was computationally impractical to quench all of the
1005: structures simulated during the prediction season, the comparison of
1006: the contact maps demonstrated further that tempering of the structure
1007: would have improved intermediate range ordering. Using preliminary
1008: structures as input to a next generation of AMH modelling improves the
1009: quality of the prediction results. While these results may initially
1010: appear to be model or energy function specific, we feel that any
1011: algorithm that uses structures as an input would benefit from similar
1012: next generation approaches.
1013:
1014: \section*{Acknowledgments}
1015: The authors thank Joe Hegler, Zaida Luthey-Schulten, Garegin Papoian,
1016: and Marcio Von Muhlen for their key roles in developing codes used in
1017: this study and for many helpful discussions over the years. The
1018: efforts of P.G.W. are supported through the National Institutes of
1019: Health Grant 5RO1GM44557. Computing resources were supplied by the
1020: Center for Theoretical Biological Physics through National Science
1021: Foundation Grants PHY0216576 and PHY0225630.
1022:
1023: %\bibliographystyle{pnas}
1024: \bibliographystyle{ieeetr}
1025: %\bibliography{refs}
1026:
1027:
1028: \providecommand{\refin}[1]{\\ \textbf{Referenced in:} #1}
1029: \begin{thebibliography}{10}
1030:
1031: \bibitem{Moult}
1032: Moult,~J.;\ \ Fidelis,~K.;\ \ Zemla,~A.;\ \ Hubbard,~T. \textit{Proteins}
1033: \textbf{2003,} \textsl{53 Suppl 6,} 334-339.
1034:
1035: \bibitem{GoldsteinRA-AMH-92}
1036: Goldstein,~R.~A.;\ \ Luthey-Schulten,~Z.~A.;\ \ Wolynes,~P.~G. \textit{Proc
1037: Natl Acad Sci USA} \textbf{1992,} \textsl{89,} 4918-4922.
1038:
1039: \bibitem{BryngelsonJD87}
1040: Bryngelson,~J.~D.;\ \ Wolynes,~P.~G. \textit{Proc Natl Acad Sci USA}
1041: \textbf{1987,} \textsl{84,} 7524-7528.
1042:
1043: \bibitem{AnfinsenCB73}
1044: Anfinsen,~C.~B. \textit{Science} \textbf{1973,} \textsl{181,} 223-230.
1045:
1046: \bibitem{GoN83}
1047: G{\={o}},~N. \textit{Annu Rev Biophys and Bioeng} \textbf{1983,} \textsl{12,}
1048: 183-210.
1049:
1050: \bibitem{Koga_Takada01}
1051: Koga,~N.;\ \ Takada,~S. \textit{J Mol Biol} \textbf{2001,} \textsl{313,}
1052: 171-180.
1053:
1054: \bibitem{portman98}
1055: Portman,~J.~J.;\ \ Takada,~S.;\ \ Wolynes,~P.~G. \textit{Phys Rev Lett}
1056: \textbf{1998,} \textsl{81,} 5237--5240.
1057:
1058: \bibitem{Wales}
1059: Wales,~D. \textit{Energy Landscapes;} Cambridge University Press: Cambridge,
1060: UK, 2003.
1061:
1062: \bibitem{Wheelan_etal00}
1063: Wheelan,~S.~J.;\ \ Marchler-Bauer,~A.;\ \ Bryant,~S.~H. \textit{Bioinformatics}
1064: \textbf{2000,} \textsl{16,} 613-618.
1065:
1066: \bibitem{Heringa02}
1067: George,~R.~A.;\ \ Heringa,~J. \textit{J Mol Biol} \textbf{2002,} \textsl{316,}
1068: 839-851.
1069:
1070: \bibitem{Rigden02}
1071: Rigden,~D.~J. \textit{Protein Eng} \textbf{2002,} \textsl{15,} 65-77.
1072:
1073: \bibitem{Hardin2002ab}
1074: Hardin,~C.;\ \ Eastwood,~M.;\ \ Prentiss,~M.;\ \ Luthey-Schulten,~Z.;\ \
1075: Wolynes,~P.~G. \textit{Proc. Nat. Acad. Sci. U.S.A.} \textbf{2002,}
1076: \textsl{100,} 1679-1684.
1077:
1078: \bibitem{Eastwood00}
1079: Eastwood,~M.~P.;\ \ Hardin,~C.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.
1080: \textit{IBM Systems Research} \textbf{2001,} \textsl{45,} 475-497.
1081:
1082: \bibitem{KoretkeKK96}
1083: Koretke,~K.~K.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G. \textit{Protein Sci}
1084: \textbf{1996,} \textsl{5,} 1043-1059.
1085:
1086: \bibitem{Berman}
1087: Berman,~H.~M.;\ \ Westbrook,~J.;\ \ Feng,~Z.;\ \ Gilliland,~G.;\ \
1088: Bhat,~T.~N.;\ \ Weissig,~H.;\ \ Shindyalov,~I.~N.;\ \ Bourne,~P.~E.
1089: \textit{Nucl. Acids Res.} \textbf{2000,} \textsl{28,} 235-242.
1090:
1091: \bibitem{Bourne98}
1092: Shindyalov,~I.;\ \ Bourne,~P. \textit{Protein Engineering} \textbf{1998,}
1093: \textsl{11,} 739-747.
1094:
1095: \bibitem{FriedrichsMS89}
1096: Friedrichs,~M.~S.;\ \ Wolynes,~P.~G. \textit{Science} \textbf{1989,}
1097: \textsl{246,} 371-373.
1098:
1099: \bibitem{FriedrichsMS90}
1100: Friedrichs,~M.;\ \ Wolynes,~P.~G. \textit{Tet Comp Meth} \textbf{1990,}
1101: \textsl{3,} 175.
1102:
1103: \bibitem{FriedrichsMS91}
1104: Friedrichs,~M.~S.;\ \ Goldstein,~R.~A.;\ \ Wolynes,~P.~G. \textit{J Mol Biol}
1105: \textbf{1991,} \textsl{222,} 1013-1034.
1106:
1107: \bibitem{Hardin00}
1108: Hardin,~C.;\ \ Eastwood,~M.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.
1109: \textit{Proc Natl Acad Sci USA} \textbf{2000,} \textsl{97,} 14235-14240.
1110:
1111: \bibitem{GoldsteinRA92}
1112: Goldstein,~R.;\ \ Luthey-Schulten,~Z.~A.;\ \ Wolynes,~P.~G. \textit{Proc Natl
1113: Acad Sci USA} \textbf{1992,} \textsl{89,} 9029-9033.
1114:
1115: \bibitem{Hopfield_1982}
1116: Hopfield,~J.~J. \textit{Proc Natl Acad Sci USA} \textbf{1982,} \textsl{79,}
1117: 2554-2558.
1118:
1119: \bibitem{Ryckaert77}
1120: Ryckaert,~J.;\ \ Ciccotti,~G.;\ \ Berendsen,~H. \textit{J Comput Phys}
1121: \textbf{1977,} \textsl{23,} 327-341.
1122:
1123: \bibitem{Rama}
1124: Ramachandran,~G.;\ \ Sasisekharan,~V. \textit{Adv Protein Chem} \textbf{1968,}
1125: \textsl{23,} 283-438.
1126:
1127: \bibitem{Papoian04pnas}
1128: Papoian,~G.~A.;\ \ Ulander,~J.;\ \ Eastwood,~M.~P.;\ \ Luthey-Schulten,~Z.;\ \
1129: Wolynes,~P.~G. \textit{Proc Natl Acad Sci U S A} \textbf{2004,} \textsl{101,}
1130: 3352-3357.
1131:
1132: \bibitem{Papoian03biopoly}
1133: Papoian,~G.~A.;\ \ Wolynes,~P.~G. \textit{Biopolymers} \textbf{2003,}
1134: \textsl{68,} 333-349.
1135:
1136: \bibitem{Papoian03jacs}
1137: Papoian,~G.~A.;\ \ Ulander,~J.;\ \ Wolynes,~P.~G. \textit{J Am Chem Soc}
1138: \textbf{2003,} \textsl{125,} 9170-9178.
1139:
1140: \bibitem{maxfield79}
1141: Maxfield,~F.~R.;\ \ Scheraga,~H.~A. \textit{Biochemistry} \textbf{1979,}
1142: \textsl{18,} 697--704.
1143:
1144: \bibitem{finkelstein98}
1145: Finkelstein,~A.~V. \textit{Phys Rev Lett} \textbf{1998,} \textsl{80,}
1146: 4823-4825.
1147:
1148: \bibitem{Altschul1997}
1149: Altschul,~S.;\ \ Madden,~T.;\ \ Schaffer,~A.;\ \ Zhang,~J.;\ \ Zhang,~Z.;\ \
1150: Miller,~W.;\ \ Lipman,~D. \textit{Nucl. Acids Res.} \textbf{1997,}
1151: \textsl{25,} 3389-3402.
1152:
1153: \bibitem{bioperl}
1154: Stajich,~J.~E. \textit{et al.}\ \textit{Genome Res.} \textbf{2002,}
1155: \textsl{12,} 1611-1618.
1156:
1157: \bibitem{CLUSTAL}
1158: Thompson,~J.;\ \ Higgins,~D.;\ \ Gibson,~T. \textit{Nucl. Acids Res.}
1159: \textbf{1994,} \textsl{22,} 4673-4680.
1160:
1161: \bibitem{Zhang_etal03}
1162: Zhang,~Y.;\ \ Kolinski,~A.;\ \ Skolnick,~J. \textit{Biophys J} \textbf{2003,}
1163: \textsl{85,} 1145-1164.
1164:
1165: \bibitem{Hardin02a}
1166: Hardin,~C.;\ \ Eastwood,~M.;\ \ Prentiss,~M.;\ \ Luthey-Schulten,~Z.;\ \
1167: Wolynes,~P.~G. \textit{J Comput Chem} \textbf{2002,} \textsl{23,} 138-146.
1168:
1169: \bibitem{BONN2001b}
1170: Bonneau,~R.;\ \ Tsai,~J.;\ \ Ruczinski,~I.;\ \ Chivian,~D.;\ \ Rohl,~C.;\ \
1171: Strauss,~C. E.~M.;\ \ Baker,~D. \textit{Proteins} \textbf{2001,}
1172: \textsl{Suppl 5,} 119-126.
1173:
1174: \bibitem{zhou:zhou04}
1175: Zhou,~H.;\ \ Zhou,~Y. \textit{Proteins} \textbf{2004,} \textsl{54,} 315-322.
1176:
1177: \bibitem{kussell:168101}
1178: Kussell,~E.;\ \ Shakhnovich,~E.~I. \textit{Phys Rev Lett} \textbf{2002,}
1179: \textsl{89,} 168101.
1180:
1181: \bibitem{Baker97}
1182: Simons,~K.;\ \ Kooperberg,~C.;\ \ Huang,~E.;\ \ Baker,~D. \textit{J. Mol.
1183: Biol.} \textbf{1997,} \textsl{268,} 209-225.
1184:
1185: \bibitem{Betancourt:2004}
1186: Betancourt,~M.;\ \ Skolnick,~J. \textit{J Mol Biol} \textbf{2004,} \textsl{2,}
1187: 635-649.
1188:
1189: \bibitem{SavenJG96}
1190: Saven,~J.~G.;\ \ Wolynes,~P.~G. \textit{J. Mol. Biol.} \textbf{1996,}
1191: \textsl{257,} 199-216.
1192:
1193: \bibitem{AllenTildesley}
1194: Allen,~M.~P.;\ \ Tildesley,~D.~J. \textit{Computer Simulation of Liquids;}
1195: Clarendon Press: New York, NY, USA, 1987.
1196:
1197: \bibitem{Eastwood2003}
1198: Eastwood,~M.;\ \ Hardin,~C.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.
1199: \textit{J Chem Phys} \textbf{2003,} \textsl{118,} 8500-8512.
1200:
1201: \bibitem{Shanknovich_1997}
1202: Du,~R.;\ \ Pande,~V.;\ \ A.Y.,~G.;\ \ Shakhnovich,~E.~I. \textit{J Chem Phys}
1203: \textbf{1997,} \textsl{108,} 334-350.
1204:
1205: \bibitem{Koretke98}
1206: Koretke,~K.~K.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G. \textit{Proc Natl
1207: Acad Sci USA} \textbf{1998,} \textsl{95,} 2932-2937.
1208:
1209: \bibitem{FPearl}
1210: Pearl,~F. \textit{et al.}\ \textit{Nucl. Acids Res.} \textbf{2005,}
1211: \textsl{33,} D247-251.
1212:
1213: \bibitem{kong96}
1214: Kong,~X.;\ \ {Brooks III},~C.~L. \textit{J Chem Phys} \textbf{1996,}
1215: \textsl{105,} 2414--2423.
1216:
1217: \end{thebibliography}
1218:
1219: \end{document}
1220: