q-bio0606012/lanl.tex
1: \documentclass[twocolumn, aps]{revtex4}
2: %\documentclass[preprint, aps]{revtex4}
3: % This tex file has the bibliography inserted at the end
4: \usepackage{natbib}
5: \usepackage{amsmath}
6: %\usepackage{graphicx}
7: 
8: % The macro below allows you to use .eps files in pdflatex.
9: % It converts on the fly .eps to .pdf files if you use pdflatex
10: %    otherwise, if you are using latex, it just uses the .eps file
11: %
12: % Note: filename suffix (.eps) is left out of the includegraphics statement
13: % Note: you must use the command pdflatex -enable-write18 <filename.tex>
14: %       which enables the running of epstopdf as a separate program.
15: %       The default does not allow pdflatex to launch sub-processes
16: 
17: %\ifx\pdfoutput\undefined
18: % this is the case we are running LaTeX, not pdflatex
19: \usepackage{graphicx}
20: %\else
21: % this is the case we are running pdflatex, so convert .eps files to .pdf
22: %\usepackage[pdftex]{graphicx}
23: %\usepackage{epstopdf}
24: %\fi
25: 
26: \begin{document}
27: 
28: \title{Protein Structure Prediction: The Next Generation}
29: \author{Michael C. Prentiss}
30: 
31: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry,  University of California at San Diego, La Jolla, CA 92093
32: }
33: 
34: \author{Corey Hardin}
35: \affiliation{Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Ave, Urbana, IL 61801
36: Urbana, IL 61801}
37: 
38: \author{Michael P. Eastwood}
39: \affiliation{Department of Chemistry and Biochemistry,  University of California at San Diego, La Jolla, CA 92093
40: }
41: 
42: \author{Chenghong Zong}
43: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry,  University of California at San Diego, La Jolla, CA 92093
44: }
45: \author{Peter G. Wolynes}
46: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry,  Department of Physics University of
47:     California at San Diego, La Jolla, CA 92093}
48: 
49: \date{\today}
50: 
51: \begin{abstract}
52: Over the last 10-15 years a general understanding of the chemical
53: reaction of protein folding has emerged from statistical mechanics.
54: The lessons learned from protein folding kinetics based on energy
55: landscape ideas have benefited protein structure prediction, in
56: particular the development of coarse grained models. We survey results
57: from blind structure prediction.  We explore how second generation
58: prediction energy functions can be developed by introducing
59: information from an ensemble of previously simulated structures.  This
60: procedure relies on the assumption of a funnelled energy landscape
61: keeping with the principle of minimal frustration.  First generation
62: simulated structures provide an improved input for associative memory
63: energy functions in comparison to the experimental protein structures
64: chosen on the basis of sequence alignment.
65: \end{abstract}
66: 
67: \maketitle
68: Every other summer, research groups compare their different protein
69: structure prediction methods via the Critical Assessment of Techniques
70: for Protein Structure Prediction (CASP) experiment.  During the CASP
71: experiment, sequences of experimentally determined protein structures
72: that are not public available are placed on the web. This exercise is
73: double blind where neither the organisers nor the participants know
74: the experimentally determined structure.  Groups respond with up to 5
75: ranked predictions, before a predetermined date, such as the 
76: publication of the structures.  Since the inception of CASP, three
77: dimensional structure prediction category has expanded to address
78: related prediction questions such as sequence to structure alignment
79: quality, amino acid sidechain placement, multi-domain domain
80: boundaries, and the ordered or disordered nature of a protein sequence
81: \cite{Moult}.
82: 
83: These different prediction questions can be examined from a common
84: framework: the principle of minimal frustration.  The principle of
85: minimal frustration states that native contacts must be more
86: favourable, in a strict statistical sense \cite{GoldsteinRA-AMH-92},
87: than non-native contacts in order for proteins to fold on physiologic
88: time scales \cite{BryngelsonJD87}.  Without a sufficient energetic
89: bias towards the native state, the multi-dimensional energy surface as
90: a function of native structure possesses too many minima for an
91: efficient stochastic search.  Such an energy surface would lead to
92: slow folding kinetics, even if the proteins ever found a sufficiently
93: stable native state.  This is not true since we know most proteins
94: fold without assistance \cite{AnfinsenCB73}.  The opposite of a
95: rough energy surface is biased toward the native basin without any
96: local minima is an absolute manifestation of the principle of minimal
97: frustration. Funnelled energy surfaces have no unfavourable
98: energetic traps (\emph{i.e.} G\=o Models) have been shown to reproduce
99: most features of experimental folding kinetics \cite{GoN83,
100: Koga_Takada01, portman98}. These energy landscape concepts can richly
101: be applied in several areas of chemistry and physics\cite{Wales}.
102: Apparently, evolution's energy function is minimally frustrated.
103: 
104: The correlation between a protein sequence and its three dimensional
105: structure can be described using similar landscape language.  As a
106: protein sequence diverges away from a consensus wild type sequence,
107: the potential for energetically unfavourable interactions increases.
108: The wild type sequence and its homologues will fold toward the same
109: native basin.  Only once enough frustrating contacts are added to the
110: wild type sequence will the sequences no longer correspond to the native
111: state ensemble.  Sequences with over 25\% sequence
112: identity to previously determined protein structures are called
113: comparative modelling targets.  The energy landscape underlying such a
114: prediction is a G\=o model based on the structure of the known
115: homologue. This heavily funnelled energy surface yields
116: high resolution structures, with the discrepancies the turns and
117: residues which have poor sequence to structure alignments.
118: Fig.~\ref{casp6_difficulty} demonstrates the distribution of homology
119: of proteins sequence to known structures included in CASP6.  Since
120: proteins below 25\% sequence identity are considered new fold
121: recognition targets, 70\% of the structures were comparative modelling
122: targets.  Recently sequenced genomes such as \textit{E. coli} have the
123: same ratio of \textit{ab initio} to comparative modelling targets,
124: which suggests the analysis of this ratio over time could be a useful
125: measure of the progress of efforts to experimentally find examples of
126: all of Nature's protein structures. 
127:  
128: \begin{figure}{\par\centering
129:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/difficultyhisto.eps}}
130:  \par}
131: \caption{\label{casp6_difficulty}The difficultly of the prediction
132:  targets as defined by percent identity. Proteins below 25\%
133:  sequence identity are usually considered \textit{ab initio} or
134:  fold recognition targets.}
135: \end{figure}
136: 
137:  In contrast to comparative modelling, \textit{ab initio} structure
138:  predictions do not have the advantage of creating G\=o like energy
139:  surfaces.  While many \textit{ab initio} targets contain less than 150
140:  residues, and thus are candidates for standard techniques, there are
141:  several that are longer as shown in Fig.~\ref{casp6_ab_length}.  Most
142:  longer sequences will be multi-domain proteins. This causes new
143:  problems.  Folding a protein with two hydrophobic cores allows
144:  for new sources of frustration, beyond those present in single domain
145:  proteins.  To obtain predictions for such problematic sequences, they
146:  usually must be divided into their constituent domains.  Current
147:  methods for dividing the sequence into domains range from purely
148:  sequence based algorithms, which look for sequence patterns in
149:  multiple sequence alignments, to simulation techniques that look for
150:  hydrophobic core formation amongst multiple independent simulations
151:  \cite{Wheelan_etal00,Heringa02,Rigden02}.
152: 
153: \begin{figure}{\par\centering
154:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/abitinio_length.eps}}
155:  \par}
156:  \caption{\label{casp6_ab_length}The \textit{ab initio} prediction
157:      targets amino acid lengths for CASP6.}
158: \end{figure}
159: 
160:  The case studies we highlight of difficult structure predictions were
161:   chosen from our participation in the CASP5 and CASP6 experiments.  In
162:   CASP5, we utilised several improved techniques, such as a backbone hydrogen
163:   bond term for the proper formation of beta sheets, and a liquid
164:   crystal like term to ensure parallel or anti-parallel sheet formation
165:   \cite{Hardin2002ab}.  We also performed target sequence averaging
166:   which enhances the funnelling of the prediction landscape
167:   \cite{Eastwood00}, and assessed our ensemble of sampled structures
168:   with a twenty letter contact for submission \cite{KoretkeKK96}.  Our
169:   most striking result from this round of blind prediction was a
170:   prediction for target T0170 protein databank \cite{Berman} code (PDB
171:   ID IUZC).  Fig.~\ref{casp5t0170overlay} presents the sequence dependent
172:   overlay of our Model 1 structure with the experimentally determined
173:   structure.  The sequence dependent alignment quality of this structure
174:   is high as measured by a Q score of 0.38. Q is an order parameter
175:   defined in Eq.~\ref{q} that measures the sequence dependent structural
176:   complementarity of two structures, where Q is defined as a normalised
177:   summation of C-alpha pairwise contact differences.
178:    \begin{equation}
179:    \label{q}
180:    Q=\frac{2}{(N-1)(N-2)}\sum_{i<j-1}
181:        \exp\left[-\frac{(r_{ij}-r_{ij}^{\rm N})^2}{\sigma_{ij}^2}\right]
182:   \end{equation}      
183:   The resulting order parameter, Q, ranges from 0, when there is no
184:   similarity between structures at a pair level, to 1 which is an exact
185:   match.  Q has been shown to be more sensitive in determining the
186:   quality of intermediate quality protein structure predictions
187:   \cite{Eastwood00}. Q scores of 0.4 for single domain proteins equals an
188:   RMSD of 5$\text{\AA}$.  In most cases the reference state for the Q
189:   score is the native state, but often one wants to compare structural
190:   similarity between structures in a simulation.  A sequence independent
191:   measure CE\cite{Bourne98}, also scores well(CE Z-score = 4.1).  The CE
192:   Z-score measures structural complementarity without regard to sequence
193:   information, and is parameterised such that structures between with a
194:   Z-Score greater than 4 belong to the same protein
195:   structure family.  The contact map of the prediction,
196:   Fig.~\ref{casp5t0170contactmap} which identifies all of the C-alpha
197:   intermolecular interactions within 9$\text{\AA}$ where the axes 
198:   are the index of the protein, shows the correct packing of the helices.
199:  \begin{figure}
200:      \centering \includegraphics[width=\linewidth]{figures/t0170overlay.eps}
201:      \caption{\label{casp5t0170overlay}Sequence dependent
202:        superpositions of Model 1 structure against the native state for
203:        CASP5 target T0170 (PDB ID 1UZC).  Blue represents the
204:        prediction and the native state is represented with red.}
205:  \end{figure}
206: 
207:  \begin{figure}{\par\centering
208:  \resizebox*{2.0in}{2.0in}{\rotatebox{90}{\includegraphics{figures/t0170contactmap.eps}}}
209:  \par}
210:  \caption{\label{casp5t0170contactmap}Contact map of target T0170 (PDB
211:      ID 1UZC) model
212:    1 structure against the NMR structure.}
213:  \end{figure}
214: 
215:   Fig.~\ref{casp6_hubbard} shows the size of partially correct
216:   continuous in sequence segments under an RMSD cutoff.  When compared
217:   against the other predictions, our Model 1 prediction (Dark Blue) was
218:   amongst the best of all submitted structures.  Also the relative
219:   success of the prediction, classifies this target as being of
220:   moderate difficulty.  In this example CASP demonstrates small (70
221:   residues) all alpha proteins are beginning to be successfully
222:   predicted by a variety of \textit{ab initio} techniques.
223:   \begin{figure}
224:       \centering
225:       \includegraphics[width=\linewidth]{figures/t0170hubbard.eps}
226:       \caption{\label{casp6_hubbard}Percentage of residues under a RMSD
227:         limit. (Dark Blue - Model 1, Light Blue - Model 2-5, Orange - Other
228:         Groups Prediction)}
229:   \end{figure}
230: 
231:  \section*{Methods}
232: 
233:  \subsection*{Energy Functions and Sampling}
234: 
235:  We used an Associative Memory Hamiltonian (AMH), with optimised
236:  parameters to sample and predict structures
237:  \cite{FriedrichsMS89,FriedrichsMS90,FriedrichsMS91}. The AMH uses a
238:  reduced description of the amino acid chain in order to gain the
239:  orders of magnitude computational acceleration over all atom models
240:  needed to fold moderate length proteins with ordinary
241:  computational resources, and has been described in great detail before
242:  \cite{Eastwood00}.  This is possible due to reducing the number of
243:  atoms per residue from over 10 to only three backbone atoms: the
244:  \(C_{\alpha},C_{\beta}\), and \(O\). The remaining backbone heavy
245:  atoms (\(N,C'\)) can be reconstituted using the ideal geometry of the
246:  peptide bond as a template. Also we reduced the complexity of the
247:  amino acid code from twenty letters, to four.  We chose the four
248:  letter code, which has the advantage of preserving a diversity of
249:  contacts, because it is still simple enough that the number of
250:  coefficients that need to be optimised does not create problems of
251:  inaccurate statistics due to limits of interactions encountered in the
252:  molten globule state.  Specifically four amino acid classes are
253:  defined: hydrophilic (A, G, P, S, T), hydrophobic (C, I, L, M, F, W,
254:  Y, V), acidic (N,D,Q,E), and basic (R,H,K) \cite{Hardin00}. The
255:  optimisation procedure produces an energy landscape that discriminates
256:  the native state from misfolded states, while avoiding kinetic traps
257:  reasonably well \cite{GoldsteinRA92,GoldsteinRA-AMH-92}.  The AMH is
258:  an analogue to the neural networks designed by Hopfield to synthesise
259:  information from multiple previous experiences \cite{Hopfield_1982}.
260:  This energy function recalls structural patterns in a set of known
261:  protein structures.  The Hamiltonian produces an energetically
262:  favourable minimum when there is sufficient coherence between a set of
263:  three dimensional protein structures.
264: 
265:  The AMH energy function, in its most general sense, consists of a
266:  backbone term, $E_{\rm back}$ and interaction term, $E_{\rm int}$
267:  defined by,
268:  \begin{equation}
269:    \label{amh_function}
270:    E_{\rm total}=E_{\rm back}+E_{\rm int}.
271:  \end{equation} 
272:  The backbone energy term consists of several terms that reproduce the
273:  self-avoiding behaviour of the polypeptide chain give by,
274:  \begin{equation}
275:    \label{amh_back}
276:    E_{\rm back}=-(E_{\rm SHAKE}+ E_{\rm rama} + E_{\rm ev} + E_{\rm chain} + E_{\rm chi}).
277:  \end{equation}
278:  As in many molecular mechanics energy functions, covalent bonds are
279:  preserved by using the SHAKE algorithm \cite{Ryckaert77} \(E_{\rm
280:    SHAKE}\), which enables an increase of the time step size, and
281:  eliminates the need for a traditional harmonic calculation.  The SHAKE
282:  algorithm preserves the distances between neighbouring
283:  \(C_{\alpha}\)-\(C_{\beta}\), and \(C_{\alpha}\)-\(O\) atoms.  The
284:  neighbouring residues limit the variety of angles the backbone atoms
285:  can occupy, producing a Ramachadran plot \cite{Rama}.  This
286:  distribution of angles is reinforced by a potential, \(E_{\rm rama}\)
287:  with low barriers to encourage rapid local backbone movements.
288:  Another term, \(E_{\rm ev}\) maintains a sequence specific excluded
289:  volume constraint between \(C_{\alpha}\)-\(C_{\alpha}\),
290:  \(C_{\beta}\)-\(C_{\beta}\), \(O\)-\(O\), \(C_{\alpha}\)-\(C_{\beta}\)
291:  atoms.  The chain connectivity, and planarity of the peptide bond due
292:  to resonance is ensured by means of a harmonic potential, \(E_{\rm
293:    chain}\).  Also the chirality of the \(C_{\alpha}\), due to its four
294:  different bonding partners is maintained using scalar product of
295:  neighbouring unit vectors of carbon and nitrogen bonds, \(E_{\rm
296:    chi}\).
297: 
298:  While \(E_{\rm back}\) creates peptide like stereo-chemistry, it 
299:  does not introduce the majority of the attractive
300:  interactions that result in folding. Such interaction are supplied by
301:  the rest of the potential \(E_{\rm int}\). The interactions described
302:  by $E_{\rm int}$ depends on the sequence separation $\left\vert i-j
303:  \right\vert$. Specifically,
304:  they are divided into three proximity classes $x(\left\vert
305:    i-j\right\vert)$: $x={\rm short}$ ($\left\vert i-j\right\vert<5$),
306:  $x={\rm medium}$ ($5\le\left\vert i-j\right\vert\le12$) and $x={\rm
307:    long}$ ($\left\vert i-j\right\vert>12$) as defined by Eq.~\ref{eint}. 
308:  \begin{equation}
309:  \label{eint}
310:  E_{\rm int}=E_{\rm short}+E_{\rm med}+E_{\rm long}.
311:  \end{equation}
312:  Also these distance classes are also referred to as local, 
313:  super-secondary, and tertiary, respectively.
314: 
315:  The AMH interaction potential \(E_{\rm int}\) is based on correlations
316:  between a target's sequence signified by \(i,j\), and the
317:  sequence-structure patterns in a set of memory proteins \(\mu\)
318:  represented as \(i',j'\), and a pairwise contact potential.  The pairs
319:  in the target and in the memory are first associated using a
320:  sequence-structure threading algorithm \cite{KoretkeKK96}. The
321:  database is assumed to contain a subset of pair distances, which may
322:  match the associated pair distances in the target structure.  The
323:  general form of the associative memory interaction is:
324:   \begin{equation}
325:   \begin{split} 
326:    \label{amh_general}
327:     E_{\rm int}&= -\frac{\epsilon}{a}\sum ^{N_{mem}}_{\mu}\sum_{j-12 \leq i \leq j-3} \\             
328:      & \gamma(P_{i},P_{j},P^{\mu}_{i'},P^{\mu }_{j'})
329:    \exp\left[-\frac{(r_{ij}-r_{i'j'}^{\rm
330:       \mu})^2}{2\sigma^2_{ij}}\right] \\
331:            &+ -\frac{\epsilon}{a}\sum_{k=1}^{3}C_{k}(N)\gamma (P_{i},P_{j},k)U_{k}(r_{ij})
332:   \end{split}
333:   \end{equation}
334:  where the similarity between target pair distances \(r_{ij}\), with
335:  aligned memory pair distances \(r^{\mu}_{i'j'}\) is measured by Gaussian
336:  functions whose widths are given by \(\sigma_{ij}=\left\vert
337:    i-j\right\vert^{0.15}\text{\AA}\). The set of parameters,
338:  \(\gamma\), encode the similarity between residues i and j, and the
339:  memories residues i' and j'.  Favourable interactions occur during
340:  coherence in the distances achieved in the sequence to structure
341:  alignments. The encoding of the alignment information in
342:  Eq.~\ref{amh_general} is only an example of what is used for the
343:  all-alpha energy functions.  Other encodings have been used in the
344:  alpha-beta energy function \cite{Hardin2002ab} to improve the
345:  discrimination between helices and strands. While the first term in
346:  Eq.~\ref{amh_general} is the superposition of interactions over a set
347:  of experimentally determined structures, it also shares a dependence
348:  on the sequence separation between the interacting residues.  For
349:  residues separated by greater than 12 residues, a contact potential
350:  \(E_{\rm long}\), as described by the second term in
351:  Eq.~\ref{amh_general}, which does not depend on interaction
352:  information from the structures used to define local in sequence
353:  interactions.  In this term $C_{k}(N)$ represents a sequence length
354:  dependence scaling to account for the variation in probability
355:  distributions based on sequence length. Five wells instead of the
356:  three defined here by $U_{k}(r_{ij})$ determine interactions in the
357:  alpha-beta energy function \cite{Hardin2002ab}.  Energy units
358:  $\epsilon$ are defined excluding backbone contributions in terms of a
359:  native state energy in Eq.~\ref{units},
360:  \begin{equation}
361:    \label{units}
362:    \epsilon=\frac{\left\vert E^{\rm N}_{\rm amh}\right\vert}{4N},
363:  \end{equation}
364:  where $N$ is the number of residues. A distance class scaling $a$, is
365:  constant in each of the energy classes because they are designed to
366:  be equal during the optimisation 
367: 
368:  The solvent in these energy functions is treated in a mean field
369:  manner, where the implicitly solvated native states of the proteins define the
370:  energy gap to the molten globule state.  Solvent effects are also
371:  present in the sequence to structure alignment energy functions, but
372:  they are not explicitly represented in the molecular dynamics energy
373:  function. Water mediated contacts with an expanded 20 letter code in
374:  the contact potential were introduced \cite{Papoian04pnas}, based upon
375:  previous work which examined protein recognition
376:  \cite{Papoian03biopoly, Papoian03jacs}.  The water mediated contacts
377:  along with a new one dimensional burial term has shown promising
378:  results especially for long proteins.
379: 
380:  Once the energy function is optimised, the minima of the energy
381:  function are probed via simulated annealing with molecular dynamics
382:  simulations.  This minimisation technique integrates Newton's
383:  equations of motions to determine the energy of the next time step.
384:  Simulated annealing slowly reduces the temperature from a high value
385:  as in the tempering of steel in metallurgy.  This minimisation algorithm 
386:  allows for local searches, while allowing modest energy barriers 
387:  to be overcome.
388: 
389:  Energy landscape ideas have generated an optimisation scheme for
390:  creating funnelled energy surfaces.  While funnelled, the
391:  parameterisation does not eliminate all non-native minima.  The
392:  superposition of several energy surfaces reduces the likelihood of
393:  such trapping in local minima \cite{maxfield79,finkelstein98}.  The
394:  flexibility of the AMH framework provides several ways of
395:  incorporating multiple sequence alignment information.  Some of the
396:  options include creating a consensus sequence \cite{Eastwood00},
397:  simulating different homologue sequences concurrently, and averaging
398:  the resulting forces and energies \cite{Hardin2002ab}.  The averaged
399:  AMH energy function we used average the forces and the energies of
400:  these simulation over a set of sequences, because it allows for more
401:  generalisable results than may occur with other techniques, and is
402:  described as in Eqs.~\ref{msa1},~\ref{msa2} ,
403: 
404:   \begin{equation}
405:    \begin{split}
406:    \label{msa1}
407:     E_{\rm short + medium} &= -\frac{1}{N_{seq}}\frac{\epsilon}{a} \sum ^{seq}_{1
408:       }\sum ^{N_{mem}}_{\mu}\sum_{j-12 \leq i \leq j-3} \\ 
409:        & \gamma (P_{i},P_{j},P^{\mu}_{i'},P^{\mu }_{j'} \exp\left[-\frac{(r_{ij}-r_{i'j'}^{\rm \mu})^2}{2\sigma^2_{ij}}\right] 
410:   \end{split}
411:   \end{equation}
412: 
413:  \begin{equation}
414:   \label{msa2}
415:     E_{\rm long} = -1/N_{seq}\frac{\epsilon}{a} \sum ^{seq}_{1 }\sum_{k=1}^{3}C_{k}(N)\gamma (P_{i},P_{j},k)U_{k}(r_{ij})
416:  \end{equation}
417: 
418:  To superimpose multiple energy landscapes, we need a multiple sequence
419:  alignment to a set of sequence homologue.  Sequences homologous to
420:  the target sequence are first identified by using PSI-Blast with
421:  default parameters \cite{Altschul1997}.  Each sequence above and below
422:  a certain sequence identity thresholds (70\% 30\% in this work) is
423:  then aligned against each other, and proteins that have greater than
424:  90\% sequence identity to other identified sequence homologues are
425:  removed.  The culling of the sequence homologues via open source
426:  bioinformatic libraries is necessary for two reasons \cite{bioperl}.
427:  Some classes of proteins have a large number of sequence homologues,
428:  and performing a multiple sequence alignment can be impractical.  Also
429:  removing sequence homologues attempts to remove biases introduced when
430:  there are few homologues.  The remaining sequences were aligned using
431:  a multiple sequence alignment algorithm\cite{CLUSTAL}.  Within the AMH
432:  energy function, gaps occurring in a sequence alignment could be
433:  addressed in a variety of ways, in this work gaps in the target
434:  sequence are ignored, while gaps within homologues are completed with
435:  residues from the target protein.  This strategy may introduce biases
436:  toward the target sequence, but this approach is preferred to perhaps
437:  ignoring interactions.  Fig.~\ref{casp6t0212_msa} shows a
438:  representative multiple sequence alignment for a target, coloured with
439:  respect to the four letter code of the AMH.  If one focuses on the
440:  hydrophobic yellow residues, the alternating hydrophobic hydrophilic
441:  patterns for beta strands formation are apparent.
442: 
443:  \begin{figure}{\par\centering
444:  \includegraphics[width=\linewidth]{figures/t0212_msa.eps}
445:  \par}
446:  \caption{\label{casp6t0212_msa} Multiple sequence alignment for target
447:    T0212 (PDB 1TZA) coloured with respect to a four letter code, where red
448:    represents acidic residues, blue represents polar residues, yellow
449:    represents nonpolar residues, and green represents basic residues.}
450:  \end{figure}
451: 
452:  Another way of introducing the characteristics of multiple funnelled
453:  energy landscapes is using information derived from neural networks
454:  trained on multiple sequence alignments.  Even with different
455:  architectures, neural networks typically achieve 75\% accuracy when
456:  predicting secondary structure.  Recently it has been shown artful
457:  combinations of two different predictions can slightly improve the
458:  results \cite{Zhang_etal03}.  This secondary structure information was
459:  added by a biasing energy function to either a helix or a strand via,
460:  $E_{Q_{ss}} = 10^5 \epsilon(Q-Q_{ss})^4$ \cite{Eastwood00}, where
461:  ${Q_{ss}}$ is defined by Eq.~\ref{qmf_ss},
462:  \begin{equation}
463:    \label{qmf_ss}
464:    Q_{\rm ss}= \sum_{k}^{n}\frac{2}{(N_{k}-1)(N_{k}-2)}\sum_{i<j-1}\exp\left[-
465:      \frac{(r_{ij}-r_{ij}^{\rm ss})^2}{\sigma_{ij}^2}\right].
466:  \end{equation} 
467:  ${Q_{\rm ss}}$ is takes the same form of the $Q$ define before in
468:  Eq.~\ref{q} except that potential acts over $n$ independent secondary
469:  structures units derived from secondary structure prediction. The
470:  distances that define energy minimum, $r_{ij}^{\rm ss}$ are determined
471:  from experimentally determined Cartesian distances.  Previously in an
472:  effort to incorporate this secondary structure information, the
473:  Ramachandran potential has been altered to bias the backbone
474:  \cite{Hardin02a}.  The local in sequence potential $E_{\rm Q_{\rm
475:      ss}}$ is preferred to the Ramachandran potential biasing because
476:  it avoids SHAKE violations when the strength of the bias is increased.
477: 
478:  For most selected CASP6 targets, we followed the same protocol.  We
479:  averaged the AMH potential over multiple sequence homologues when they
480:  were available.  In most cases, information from secondary structure
481:  prediction was used to bias secondary structure units to their
482:  predicted structures.  Molecular dynamics with simulated annealing 
483:  sampled low energy structures.  Also constant temperature
484:  slightly above the predicted glass temperature were used to generate
485:  candidate structures. We collected structures above \(T_{K}\), which
486:  usually gives the fastest folding thereby compromising between the
487:  funnelled and glassy behaviour of the energy function.  Once the
488:  kinetics of the structure slows, the diversity of structures
489:  encountered disappears.  The slow kinetics regime typically
490:  predominates around a temperature of 0.9.  While using a linear
491:  annealing schedule up to \(T_{K}\), about 25 different collapsed
492:  structures were collected during each simulation.  The amount of
493:  sampling performed for each structure varied from about 500 to 20,000
494:  different structures.  While this was roughly 50 times more sampling
495:  than we had previously performed in the CASP setting, it is dwarfed by
496:  the efforts of others who can sample in the millions of structures by
497:  using more powerful computational resources \cite{BONN2001b}.
498:  Subsequently, a smaller subset of structures was selected for
499:  submission by evaluating the size of the hydrophobic core and the
500:  hydrophilic surface area.  Further selection criteria included visual
501:  inspection, agreement with the preliminary secondary structure
502:  prediction, and low energies predicted from a second optimised contact
503:  energy function.
504: 
505:  \subsection*{Selection of Structures}
506: 
507:  To select candidate structures from independent simulated annealing or
508:  constant temperature trajectories, we calculated both the buried
509:  hydrophobic surface area and the exposed hydrophilic surface area
510:  along the trajectory.  In an effort to calculate the buried or exposed
511:  surface area, we assigned residues which have greater than the mean
512:  total surface area as solvent exposed, and the converse as solvent
513:  buried.  We scaled each surface area by a weight to represent the
514:  likelihood of amino acid burial. It was modelled to the free energy
515:  cost of transferring each amino acid from octanol to water
516:  \cite{zhou:zhou04} in an effort to introduce a sequence specificity
517:  as shown in Eq.~\ref{sa},
518:       \begin{equation} 
519:        \label{sa}
520:         E_{\rm Burial}=\sum_{i}^{N}
521:           \begin{cases}
522:            \gamma_{i}*SA_{i},& \text{if $ SA_{i} > $ total surface} \\ 
523:                           0,&  \text{if $ SA_{i} > $ total surface}
524:           \end{cases}
525:       \end{equation}
526: 
527: This normalisation is desirable because the surface accessibility is
528:   calculated from our minimal \(C_{\alpha}, C_{\beta }\), and \(O\)
529:   atoms, which produces amino acids of the same volume.  Such an energy 
530:  term would be more valuable if non-additive interactions, and a larger 
531:  number of hydration layers were added.  The
532:   unavoidable inaccuracies in atomistic force fields, and the slow
533:   glassy kinetics of sidechain rearrangements prevented any completion
534:   of the backbone and sidechains with all-atoms or minimisation of
535:   putative structures \cite{kussell:168101}.
536: 
537:   \begin{table}
538:     \caption{\label{table1} Linear Regression of Hydrophobic Burial Energy }
539:     {\centering \begin{tabular}{|c|c|c|} \hline Proteins & fold class &
540:         Correlation  Coefficient \\ \hline \hline
541:         1R69&\(\alpha\)&.22 \\
542:         1BG8&\(\alpha\)&.33 \\
543:         1UTG&\(\alpha\)&.63 \\
544:         1MBA&\(\alpha\)&.40 \\
545:         2MHR&\(\alpha\)&.46 \\
546:         1IGD&\(\alpha/\beta\)&-.70 \\ 
547:         3IL8&\(\alpha/\beta\)&-.06 \\
548:         1TIG&\(\alpha/\beta\)&.02    \\ 
549:         % 3chyz&\(\alpha/\beta\)& fill in value  \\
550:         % 5nulz&\(\alpha/\beta\)& fill in value  \\
551:         1BFG&\(\beta\)&.16 \\
552:         1CKA&\(\beta\)&-.14 \\
553:         1JV5&\(\beta\)&.11 \\
554:         1K0S&\(\beta\)&.27  \\
555:         \hline
556:       \end{tabular}\par}
557:   \end{table}
558: 
559: %% \clearpage
560: 
561:   Another parameter we used after sampling to select and examine
562:   structures was based on sequence specific backbone probabilities.  The
563:   specificity of local interactions have been fruitful for improving
564:   collapsed proteins structure predictions \cite{Baker97}.  In a similar
565:   spirit sequence specific nearest neighbour probabilities were also used
566:   \cite{Betancourt:2004}.  Local signals have also been theoretically
567:   shown to contribute roughly a third of the total folding gap for
568:   $\alpha$ helical proteins \cite{SavenJG96}.  Similarly we started
569:   looking at such probabilities to further improve the backbone
570:   potential of the AMH, but without needing secondary structure
571:   prediction.
572:   \begin{equation} 
573:    \label{mscore}
574:     E_{\rm trimer}=\sum_{i=2}^{N-1}Log P(i-1,i,i+1,\phi,\psi)
575:   \end{equation}
576:   Somewhat surprisingly, the summation of the resulting $\log$
577:   probabilities from 4,012 highly resolved protein structures could be
578:   used as an additional measure as part of a strategy for the selection
579:   of structures out of an ensemble.  Table~\ref{table1} shows the linear
580:   correlation coefficients between structures of varying Q-scores,
581:   sampled above \(T_{\rm K}\) which is where the best predictions
582:   usually occur before glassy dynamics dominates the kinetics.  For both
583:   proteins with all \(\alpha\), and \(\alpha/\beta\) compositions, the
584:   summed log probabilities provide discrimination, but not within the
585:   all \(\beta\) folds.  These results shown in Table~\ref{table_marcio}
586:   echo the previous findings in terms of the \(\phi\), \(\psi\)
587:   probability maps and also that all beta structures are less well
588:   predicted when a dihedral angle energy function is minimised.  The
589:   weakness of nearest neighbour excluded volume effects to determine
590:   local structure is also demonstrated in the consistent weakness of
591:   secondary structure prediction with respect to beta strands. Alpha
592:   helices are correctly predicted to roughly 80\% accuracy while beta
593:   strands average 60\% accuracy by such pure sequence based algorithms.
594:   The difficulty of predicting some circular dichroism spectroscopy
595:   results for beta to coil transitions can also be attributed to the
596:   weakness of the local backbone excluded volume interactions.
597: 
598:   \begin{table}
599:     \caption{\label{table_marcio} Linear Regression of Mscore }
600:     {\centering \begin{tabular}{|c|c|c|} \hline Proteins & fold class &
601:         Correlation  Coefficient \\ \hline \hline
602:         1R69&\(\alpha\)&.29 \\
603:         1BG8&\(\alpha\)&.04 \\
604:         1UTG&\(\alpha\)&.26 \\
605:         1MBA&\(\alpha\)&.26 \\
606:         2MHR&\(\alpha\)&.10 \\
607:         1IGD&\(\alpha/\beta\)&.37 \\
608:         3IL8&\(\alpha/\beta\)&.13 \\
609:         1TIG&\(\alpha/\beta\)&.19 \\
610:   %      3CHY&\(\alpha/\beta\)& .40 \\
611:   %      5NUL&\(\alpha/\beta\)& .25 \\
612:         1BFG&\(\beta\)&.08 \\
613:         1CKA&\(\beta\)&.03 \\
614:         1JV5&\(\beta\)&-.07 \\
615:         1K0S&\(\beta\)&-.10 \\
616:         \hline
617:       \end{tabular}\par}
618:   \end{table}
619: 
620:   %\clearpage
621: 
622:   \section*{Results}
623:   \subsection*{Blind Simulations}
624: 
625:   For \textit{ab initio} blind predictions in CASP6, we selected
626:   sequences if there were no experimentally determined homologous
627:   structures found by automated comparative modelling servers.  The
628:   overall results for the \textit{ab initio} structure prediction
629:   simulation are summarised in Table~\ref{casp6_table1}, where the
630:   abbreviations are length = the number of amino acids, temp =
631:   temperature where best structure was encountered, sub Q or samp Q =
632:   the best sampled and submitted structures respectively as a judged 
633:   by a function of Q, and traj = number of independent trajectories 
634:   simulated.  The
635:   CASP6 targets are classified under the following categories (NF=new
636:   fold, FR/A=fold recognition analog, FR/H=fold recognition homologue,
637:   CM/H=comparative modelling hard).  Targets T0207, and T0270 where
638:   removed from the experiment so their CASP class are undefined.  Structures for T0207 and
639:       T0272-b were not submitted.  There
640:   are a few main points from this data.  Using a Q of 0.4 as a measure
641:   successful prediction, we were able to encounter high quality
642:   structures for 4 targets and nearly so for 4 others.  The temperature
643:   at which the best structures were sampled was between the 1.2 and 0.8,
644:   which is the annealing regime we investigated most throughly.  This
645:   suggests our annealing schedules were close to the behaviour we sought
646:   \textit{a priori}. The longer the length of the target sequence
647:   clearly reduced the quality of our predictions.  Also the proteins
648:   where we had a greater number of trajectories naturally showed better
649:   structures. A final observation identifies the difference between the
650:   best submitted structure and the best sampled structure as
651:   disappointingly large for some of the targets.  This can be attributed
652:   our strategy of maximising the number of simulations performed rather
653:   than more carefully studying our trajectories.  This difference would
654:   be smaller if greater care was taken in the selection of the
655:   structures, but the number of high quality structures would have been
656:   less.
657:   \begin{table}
658:     \caption{\label{casp6_table1}CASP6 Results: Best Submitted and Sampled Structures  }
659:     {\centering \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline 
660:         target & length & fold &sub Q&samp Q& temp & traj & CASP \\ \hline \hline
661:         T0281   & 70  & \(\alpha/\beta\) &.34 &.48 & 0.85 & 986 & NF     \\
662:         T0201   & 94  & \(\alpha/\beta\) &.36 &.44 & 1.39 & 199 & NF     \\ 
663:         T0212   & 123 & \(\beta\)        &.26 &.42 & 1.30 &  97 & FR/A   \\ 
664:         T0230   & 102 & \(\alpha/\beta\) &.31 &.42 & 1.05 & 395 & FR/A   \\
665:         \hline   \hline
666:         T0207   & 76  & \(\alpha/\beta\) & -- &.39 & 0.98 & 297 & --     \\
667:         T0224   & 87  & \(\alpha/\beta\) &.30 &.38 & 1.20 & 501 & FR/H   \\
668:         T0263   & 97  & \(\alpha/\beta\) &.34 &.38 & 0.94 & 404 & FR/H   \\ 
669:         T0272-a & 85  & \(\alpha/\beta\) &.30 &.37 & 0.94 &  30 & FR/A   \\ 
670:         T0265   & 102 & \(\alpha/\beta\) &.29 &.34 & 0.83 & 374 & CM/H   \\
671:         T0213   & 103 & \(\alpha/\beta\) &.26 &.32 & 0.98 & 448 & FR/H   \\
672:         T0243   & 88  & \(\alpha/\beta\) &.31 &.32 & 0.95 & 418 & FR/H   \\
673:         T0239   & 98  & \(\alpha/\beta\) &.25 &.32 & 0.99 & 424 & FR/A   \\ 
674:         T0214   & 110 & \(\alpha/\beta\) &.24 &.30 & 0.41 & 348 & FR/H   \\ 
675:         T0242   & 115 & \(\alpha/\beta\) &.27 &.30 & 0.89 & 358 & NF     \\
676:         \hline  \hline
677:         T0270-b & 125 & \(\alpha/\beta\) &.27 &.28 & 0.99 &  32 & --     \\ 
678:         T0270-a & 122 & \(\alpha/\beta\) &.25 &.27 & 0.80 &  47 & --     \\      
679:         T0272-b & 124 & \(\alpha/\beta\) & -- &.26 & 0.81 &  34 & FR/A   \\ 
680:         T0273   & 186 & \(\alpha/\beta\) &.22 &.24 & 0.98 & 189 & NF     \\ 
681:         \hline
682:       \end{tabular}\par}
683:   \end{table}
684: 
685:   Calculating the free energy of several randomly chosen CASP6 targets
686:   in Fig.~\ref{fq_totalt0214t0243} provides us with probabilities of
687:   what we would have expected to see if more simulations has been
688:   performed during the CASP season.  We can estimate how
689:   many independent structures need to be seen at this temperature to
690:   sample the region 10 $k_BT$ greater than the minimum of the free
691:   energy.  We see roughly $e^{10}\approx 2*10^{4}$ independent sampled
692:   structures would be needed at a temperature of 1.0.  Target T0242 (PDB
693:   ID 2BLK) illustrates why the best structure we encountered had a Q
694:   score of 0.3. For this target, we sampled roughly 7000 different
695:   structures. To achieve a Q of 0.45, according to the free
696:   energy analysis we would need to increase our sampling by a factor 
697:   of 3.
698: 
699: \begin{figure}{\par\centering
700:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_totalt0214-t0243.eps}}
701:  \par}
702:   \caption{\label{fq_totalt0214t0243}Free Energy calculations for
703:          CASP6 targets T0213, T0214, T0224, T0242, and T0243.}
704: \end{figure}
705: 
706:   When extrapolating to lower temperatures, we see lower barriers to the
707:   folded state, and thus if sampling were more complete one would see
708:   better structures at these temperatures.  This further cooling would
709:   be a favorable strategy except that dynamic slowing due to the
710:   approach of the glass transition interferes, which occurs at a
711:   temperature of 0.9.  Naturally, it is best to sample just above the
712:   glass transition temperature, which can be approximately found from
713:   Q-Q correlation ($<Q(t)Q(t+\tau)>$) \cite{AllenTildesley}, and by
714:   using the Kolmogorov-Smirnov test to asses the independence of samples
715:   \cite{Eastwood2003}.  Table~\ref{casp6_likely} indicates what was the
716:   best structure we would be likely to see under such sampling
717:   conditions.  The differences between thermodynamically accessible
718:   structures and those that were sampled suggests that increased
719:   simulations would have improved the best structures sampled
720:   considerably.  The free energy of target T0243 (PDB ID not available) is
721:   significantly different due to its unusual architecture that contains
722:   a buried helix.
723: 
724:   \begin{table}
725:     \caption{\label{casp6_likely} Likely Quality of Structure Seen 
726:       at a Free  Energy of 10 CASP6 }
727:     {\centering \begin{tabular}{|c|c|c|c|c|} \hline 
728:         Target & PDB & length  & Probable Q & Sampled Q \\ \hline    
729:         T0213  & 1TE7 & 103&.43 &.32 \\ 
730:         T0214  & 1S04 & 110&.40 &.30 \\
731:         T0224  & 1RHX &  87&.39 &.38 \\ 
732:         T0242  & 2BLK & 123&.45 &.30 \\ 
733:         T0243  & ---  &  88&.28 &.32 \\ \hline
734:       \end{tabular}\par}
735:   \end{table}
736: 
737:  % \clearpage
738: 
739:   As in Fig.~\ref{casp5t0170contactmap}, we compare contact maps
740:   between the predictions and the experimentally resolved structure.
741:   Often contact maps give more insightful than superimposed structures
742:   especially when viewing in 2 dimensions.  We compare the submitted
743:   structures with the best structure encountered during our sampling to
744:   determine what aspect of folding are being captured by our energy
745:   functions. For a short target T0201 (PDB ID 1S12), we see that
746:   sometimes a small difference in the contact maps in
747:   Fig.~\ref{contact_T0201}, can greatly improve the quality of the
748:   prediction even though a large number of contacts are already correct.
749:   There was a larger fraction of incorrect contacts in our best
750:   submitted structure for target T0230 (PDB ID 1WCJ) than we would have
751:   seen in the best generated structure as shown in
752:   Fig.~\ref{contact_T0230}.  The incorrect parallel docking of the first
753:   two helices is largely resolved in the best sampled structure and the
754:   Q score improves considerably.  Similar analysis for target T0281 (PDB
755:   ID 1WHZ) shows incorrect long range contacts between the two otherwise
756:   properly oriented helices, and disordered intermediate interactions as
757:   in Fig.~\ref{contact_T0281}.  Again the best sampled structure has
758:   these problems largely resolved.
759: 
760:  \begin{figure}{\par\centering
761:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0201.eps}}}
762:  \hspace{.5in}
763:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_samp_t0201.eps}}}
764:  \par}
765:  \caption{\label{contact_T0201}Contact maps for the best submitted
766:         (Q=.36) and the best sampled (Q=.44) structures for target T0201.}
767:  \end{figure}
768: 
769:  \begin{figure}{\par\centering
770:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0230.eps}}}
771:  \hspace{.5in}
772:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_samp_t0230.eps}}}
773:  \par}
774:      \caption{\label{contact_T0230}Contact maps for the best submitted
775:        (Q=.31) and the best sampled (Q=.42) structures for target T0230.}
776:  \end{figure}
777: 
778:   \begin{figure}{\par\centering
779:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0281.eps}}}
780:  \hspace{.5in}
781:  \resizebox*{2.0in}{2.0in}{\rotatebox{90}{\includegraphics{figures/contactbest_samp_t0281.eps}}}
782:  \par}
783:  \caption{\label{contact_T0281}Contact maps for the best submitted
784:        (Q=.34) and the best sampled (Q=.48) structures for target T0281.}
785:  \end{figure}
786: 
787:  % \clearpage
788: 
789:   One amusing way to analyze predicted structures is to view the results
790:   of different structure prediction schemes as intermediates along a
791:   kinetic folding coordinate.  How far did the simulated annealing get in
792:   the folding pathway?  By mapping the likelihood of folding
793:   \cite{Shanknovich_1997} against its location on a folding free energy
794:   surface, we can assess how close the model structure is to the folded
795:   state in a kinetic sense.  The energy function for the kinetic
796:   modeling is a G\=o model \emph{i.e.} ideally non-frustrated energy
797:   function.  The difference between the G\=o model and the structure
798:   prediction energy functions is a measure of the quality of those
799:   structure prediction schemes. A pairwise additive G\=o model was
800:   created based on the native structure of the experimentally determined
801:   protein. As it has been discussed previously \cite{Eastwood00}, this
802:   G\=o model has both a polypeptide backbone energy terms that are the
803:   same as in the structure prediction energy function as described by
804:   Eq.~\ref{amh_back} and an interaction potential were the Gaussian
805:   interaction potential distances \(r_{ij}^{N}\) are determined by the
806:   native state formally described in Eq.~\ref{amh_go}.
807:   \begin{equation}
808:     \label{amh_go}
809:     E_{\text{G\=o}}=- \frac{\epsilon}{a} \sum_{i<j-3} \gamma_{\text{G\=o}}[x_{(|i-j|)}]\exp\left[-\frac{(r_{ij}-r_{ij}^{N})^2}{\sigma_{ij}^2}\right]
810:   \end{equation}
811:   The interactions are defined in this minimal model as residues with
812:   greater the three residues in sequence separation between \(
813:   C^{\alpha}-C^{\alpha}, C^{\alpha}-C^{\beta}, C^{\beta}-C^{\alpha},
814:   C^{\beta}-C^{\beta} \) atom pairs. The weights
815:   \(\gamma_{\text{G\=o}}\) or the depth of the Gaussian wells are set to
816:   (.177,.048,.430) in order to approximately divided the interaction
817:   energy equally between the different distance classes as defined in
818:   the original structure prediction energy function.  The width of the
819:   gaussians $\sigma_{ij}^2$ are defined by the sequence separation as
820:   before.  Notice that the G\=o Hamiltonian does not contain a summation
821:   over a set of memory structures as in the AMH, this is because all of
822:   the contacts in this definition of a G\=o model uses only the native
823:   state.  One hundred independent simulations of this G\=o energy
824:   function are performed starting with the best structure of three
825:   different structure prediction groups.  Pfold is then calculated by
826:   simply determining whether the simulation started from the model
827:   structure folds to the native structure or not.  The results in
828:   Fig.~\ref{pfold_fig} compare three minimalist models, one of which
829:   (the Baker Group) has undergone a further atomistic refinement. The
830:   minimalist models are only a few $k_BT$ from the barrier's peak, they
831:   only infrequently cross it.  It also suggests that a detailed less
832:   coarse grain sampling procedure maybe necessary for correctly
833:   assigning hydrophobic packing and hydrogen bonding patterns.
834: 
835:    \begin{figure}{\par\centering
836:        \resizebox*{3.5in}{3.0in}{\rotatebox{00}{\includegraphics{figures/t0281_pfold_3d.eps}}}
837:        \par}
838:      \caption{\label{pfold_fig}G\=o Model Free Energy Surface with final
839:        prediction structures shown. The Pfold values for the three
840:        proteins are the Wolynes Group 0.07, Scheraga Group 0.02, and the
841:        Baker Group 0.97 with an error of +/- 0.1.}
842:    \end{figure}
843: 
844:   %\clearpage
845:   \subsection*{The Next Generation in Structure Prediction}
846: 
847:   Examining the contact maps of structures encountered during the CASP
848:   experiment, we observed that contacts between residues with a 
849:   large separation in
850:   sequence can be inaccurate, even when most of the contacts within a 12
851:   residues sequence separation are native like.  A different way of
852:   expressing this idea is the amount of funnelling is different within
853:   the different distance classes.  When comparing the quality of the
854:   intermediate range interactions in the sampled structures with the
855:   memories obtained with sequence analysis from the protein data bank, a
856:   dramatic increase of native-like interactions is seen as shown in
857:   Fig.~\ref{lh_256ba}.  While this was not used in the recent CASP
858:   exercise, we thought it would be interesting and straight forward to
859:   improve the prediction energy function by using these first generation
860:   results as better memory structures in the AMH.  Sequence to structure 
861:   alignments yield gap-less identity alignments thereby eliminating any 
862:   possibility of secondary structure registry shift irregularities.
863: 
864:  \begin{figure}{\par\centering
865:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_bw.eps}}
866: %   \hspace{0.3in} \vspace{0.3in}
867:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_short_bw.eps}}
868:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_medium_bw.eps}}
869: %   \hspace{0.3in}
870:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_long_bw.eps}}
871: \par}
872:    \caption{\label{lh_256ba}These figures show the total Q, and the Q
873:      in the different distance classes between PDB structures,
874:      structures from a temperature of 1, and a temperature near zero
875:      for structures used as inputs to AMH simulations.  The
876:      lowest temperature show the largest improvement because they are
877:      fully collapsed.}
878:  \end{figure}
879: 
880:   Different energy functions have been used to identify native like
881:   proteins from an ensemble of simulated structures.  Alternatively, one
882:   can rely on energy landscape ideas, and assume a mean field contact
883:   potential derived from the energy minima of the simulated energy
884:   function.  This approach has the additional advantage, that it does
885:   not rely on using a distinct energy function: one is simply seeing how
886:   close simulated annealing was to completely accessing the global
887:   minimum of the prediction energy function.  To select structures a
888:   pairwise Q denoted by a lower case q, is calculated between all of the
889:   ground state structures encountered in 200 independent simulations.
890: 
891:   By dividing the inter-chain interactions under the same definitions as
892:   used in the energy function, the potential for improvements from such
893:   second generation structures over the original memories is
894:   considerable for protein 256B.  As seen in Fig.~\ref{lh_256ba}, the
895:   low temperature structure as identified by little q have an increased
896:   amount of native like contacts in all distance classes.  This style of
897:   analysis also suggests potential changes in the energy function.  The
898:   long distance in sequence interactions are also improved over that
899:   original memory used in the energy function.  In order to utilise this
900:   improvement the energy function in the distant interaction class was
901:   modified.  The original function used a multi-well contact potential,
902:   which does not use any information from the memory proteins.  For this
903:   third distance class the next generation energy function uses
904:   associative memory contacts much as was done before for modelling with
905:   homologues \cite{Koretke98}.  The energy function now takes the form
906:   \begin{equation}
907:     E_{\rm int}=-  \sum ^{c}_{3}\frac{\epsilon}{a_{c}}\sum ^{n}_{\mu }\sum ^{N}_{i<j} \gamma
908:     (P_{i}P_{j}P^{\mu }_{i'}P^{\mu }_{j'})\Theta
909:     (r_{ij}-r^{\mu }_{i'j'}). 
910:   \end{equation}
911:   The parameters for this new distance class are taken from the second
912:   distance class.  The total energy is defined over the set of memory
913:   structures as defined by Eq.~\ref{next_gen_units}
914:   \begin{equation}
915:     \label{next_gen_units}
916:     \epsilon=\frac{1}{36}\sum_{1}^{\mu}\frac{\left\vert E^{\rm model}_{\rm amh}\right\vert}{4N},
917:   \end{equation}
918:   instead of using the values taken from the optimisation.  Some next
919:   generation memory structures are more collapsed than the memory
920:   structures used in initial round of simulation.  Furthermore the
921:   scaling is changed from the initial round of simulation's 1:1:1
922:   scaling amongst the three different (local, super-secondary, tertiary)
923:   distance classes to 1.5:0.5:1 in an effort to approximate the equal
924:   division of energy in each distance class. To examine the equilibrium 
925:   properties of this energy function, we need
926:   to estimate the glass transition temperature.  As previously explored
927:   \cite{Eastwood2003}, we use the Kolmogorov-Smirnov test to determine
928:   if two independent simulations have been sampled from the same
929:   equilibrium distribution.  This test ensures that
930:   simulations are equilibrated.
931:  \begin{figure}
932:    \centering
933:    \resizebox*{3.0in}{2.0in}{\includegraphics{figures/Pq_black_white_15.eps}}
934:    \hspace{.5in}
935:    \resizebox*{3.0in}{2.0in}{\includegraphics{figures/Pq_black_white_14.eps}}
936:    \caption{\label{ks_256ba}Kolmogorov-Smirnov test shows the constant
937:      temperature simulation falling out of equilibrium at a lower
938:      temperature of 1.4. The different probability distributions of
939:      structures between two independent simulations is no longer the
940:      same.}
941:  \end{figure}
942:  Once the glass transition temperature ($T_{K}$) is estimated using the
943:  Kolmogorov-Smirnov test, we can use standard techniques to
944:  quantify the equilibrium properties of different energy functions.
945:  The proteins we used for study of the next generation AMH strategy are
946:  cytochrome B562 (PDB ID 256b), HDEA (PDB ID 1BG8), because they are
947:  both of moderate size and one of them (1BG8) was not in the training
948:  set of proteins that optimized the original energy function. An
949:  additional advantage of this choice is these proteins have different
950:  fold types. According to CATH \cite{FPearl} HDEA belongs to the
951:  orthogonal bundle architecture, while cytochrome B562 represents an
952:  up-down bundle.  Using umbrella sampling combined with the weighted
953:  histogramming method, we are able to sample parts of phase space that
954:  would rarely be encountered during a simulation \cite{kong96}.  When
955:  using memories with a larger number of native contacts, we see
956:  improved free energy and energy profiles as shown in
957:  Fig.~\ref{amh_256ba}. 
958: \begin{figure}{\par\centering
959: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_bw_paper.eps}}
960: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_local_bw_paper.eps}}
961: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_ss_bw_paper.eps}}
962: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_tert_bw_paper.eps}}
963: \par}
964:    \caption{\label{amh_256ba}The free energy the two different energy
965:      functions for the protein 256B, shows roughly a 5-10 $k_BT$
966:      improvement for this protein.  The primary improvements are in the
967:      medium and long range distance classes.}
968:  \end{figure}
969:  This is even more impressive when we consider
970:  this energy function has not yet been properly optimised for this new
971:  hamiltonian.  For the other target, the results are also not
972:  surprising.  In this case the next generation memories used to simulate
973:  this protein were not of greater structural quality than the
974:  initial set.  Thus a very similar free energy profile was generated as
975:  seen in Fig.~\ref{amh_1bg8a}.  Our use of q as an order
976:  parameter was successful in identifying the high Q protein for the
977:  256B example.  This is due to the highly funnelled characteristic of
978:  the first generation energy function.  The original energy function
979:  for 1BG8 is not as funnelled so therefore there is poorer enrichment by
980:  scanning with little q. This limitation could be over come by increasing 
981:  the amount of sampling of structures in the first generation simulations. 
982: \begin{figure}
983:    \centering
984: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/qscore_bw_1bg8.eps}}
985: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_bw_paper_1bg8.eps}}
986: \caption{\label{amh_1bg8a}The free energy the two different energy
987:      functions for the protein 1BG8 show little improvement.  The
988:      memories though show no enrichment in native contacts.}
989:  \end{figure}
990:  More simulations would guarantee better structure
991:  as was demonstrated during the CASP5 exercise. This difference in the 
992:  enrichment could be anticipated by using the Kolmogorov-Smirnov measure
993:  to differentiate the distribution of the q values encountered between
994:  structures derived from simulation and the protein databank.
995: 
996:  \section*{Conclusion}
997: 
998:  These case studies from our participation in the CASP experiment only
999:  provide a snap shot of our group's prediction schemes.  It produces a
1000:  series of lessons for us and we hope for others.  In the future, a
1001:  more balanced efforts between the sampling and selection of structures
1002:  from that ensemble would appear to be desirable.  More efforts in
1003:  selection would have clearly improved the results submitted in CASP6.
1004:  While it is was computationally impractical to quench all of the
1005:  structures simulated during the prediction season, the comparison of
1006:  the contact maps demonstrated further that tempering of the structure
1007:  would have improved intermediate range ordering.  Using preliminary
1008:  structures as input to a next generation of AMH modelling improves the
1009:  quality of the prediction results.  While these results may initially
1010:  appear to be model or energy function specific, we feel that any
1011:  algorithm that uses structures as an input would benefit from similar
1012:  next generation approaches.
1013: 
1014:  \section*{Acknowledgments}
1015:  The authors thank Joe Hegler, Zaida Luthey-Schulten, Garegin Papoian,
1016:  and Marcio Von Muhlen for their key roles in developing codes used in
1017:  this study and for many helpful discussions over the years.  The
1018:  efforts of P.G.W. are supported through the National Institutes of
1019:  Health Grant 5RO1GM44557.  Computing resources were supplied by the
1020:  Center for Theoretical Biological Physics through National Science
1021:  Foundation Grants PHY0216576 and PHY0225630.
1022: 
1023: %\bibliographystyle{pnas}
1024: \bibliographystyle{ieeetr}
1025: %\bibliography{refs}
1026: 
1027: 
1028: \providecommand{\refin}[1]{\\ \textbf{Referenced in:} #1}
1029: \begin{thebibliography}{10}
1030: 
1031: \bibitem{Moult}
1032: Moult,~J.;\ \ Fidelis,~K.;\ \ Zemla,~A.;\ \ Hubbard,~T. \textit{Proteins}
1033:   \textbf{2003,} \textsl{53 Suppl 6,} 334-339.
1034: 
1035: \bibitem{GoldsteinRA-AMH-92}
1036: Goldstein,~R.~A.;\ \ Luthey-Schulten,~Z.~A.;\ \ Wolynes,~P.~G. \textit{Proc
1037:   Natl Acad Sci USA} \textbf{1992,} \textsl{89,} 4918-4922.
1038: 
1039: \bibitem{BryngelsonJD87}
1040: Bryngelson,~J.~D.;\ \ Wolynes,~P.~G. \textit{Proc Natl Acad Sci USA}
1041:   \textbf{1987,} \textsl{84,} 7524-7528.
1042: 
1043: \bibitem{AnfinsenCB73}
1044: Anfinsen,~C.~B. \textit{Science} \textbf{1973,} \textsl{181,} 223-230.
1045: 
1046: \bibitem{GoN83}
1047: G{\={o}},~N. \textit{Annu Rev Biophys and Bioeng} \textbf{1983,} \textsl{12,}
1048:   183-210.
1049: 
1050: \bibitem{Koga_Takada01}
1051: Koga,~N.;\ \ Takada,~S. \textit{J Mol Biol} \textbf{2001,} \textsl{313,}
1052:   171-180.
1053: 
1054: \bibitem{portman98}
1055: Portman,~J.~J.;\ \ Takada,~S.;\ \ Wolynes,~P.~G. \textit{Phys Rev Lett}
1056:   \textbf{1998,} \textsl{81,} 5237--5240.
1057: 
1058: \bibitem{Wales}
1059: Wales,~D. \textit{Energy Landscapes;} Cambridge University Press: Cambridge,
1060:   UK, 2003.
1061: 
1062: \bibitem{Wheelan_etal00}
1063: Wheelan,~S.~J.;\ \ Marchler-Bauer,~A.;\ \ Bryant,~S.~H. \textit{Bioinformatics}
1064:   \textbf{2000,} \textsl{16,} 613-618.
1065: 
1066: \bibitem{Heringa02}
1067: George,~R.~A.;\ \ Heringa,~J. \textit{J Mol Biol} \textbf{2002,} \textsl{316,}
1068:   839-851.
1069: 
1070: \bibitem{Rigden02}
1071: Rigden,~D.~J. \textit{Protein Eng} \textbf{2002,} \textsl{15,} 65-77.
1072: 
1073: \bibitem{Hardin2002ab}
1074: Hardin,~C.;\ \ Eastwood,~M.;\ \ Prentiss,~M.;\ \ Luthey-Schulten,~Z.;\ \
1075:   Wolynes,~P.~G. \textit{Proc. Nat. Acad. Sci. U.S.A.} \textbf{2002,}
1076:   \textsl{100,} 1679-1684.
1077: 
1078: \bibitem{Eastwood00}
1079: Eastwood,~M.~P.;\ \ Hardin,~C.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.
1080:   \textit{IBM Systems Research} \textbf{2001,} \textsl{45,} 475-497.
1081: 
1082: \bibitem{KoretkeKK96}
1083: Koretke,~K.~K.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G. \textit{Protein Sci}
1084:   \textbf{1996,} \textsl{5,} 1043-1059.
1085: 
1086: \bibitem{Berman}
1087: Berman,~H.~M.;\ \ Westbrook,~J.;\ \ Feng,~Z.;\ \ Gilliland,~G.;\ \
1088:   Bhat,~T.~N.;\ \ Weissig,~H.;\ \ Shindyalov,~I.~N.;\ \ Bourne,~P.~E.
1089:   \textit{Nucl. Acids Res.} \textbf{2000,} \textsl{28,} 235-242.
1090: 
1091: \bibitem{Bourne98}
1092: Shindyalov,~I.;\ \ Bourne,~P. \textit{Protein Engineering} \textbf{1998,}
1093:   \textsl{11,} 739-747.
1094: 
1095: \bibitem{FriedrichsMS89}
1096: Friedrichs,~M.~S.;\ \ Wolynes,~P.~G. \textit{Science} \textbf{1989,}
1097:   \textsl{246,} 371-373.
1098: 
1099: \bibitem{FriedrichsMS90}
1100: Friedrichs,~M.;\ \ Wolynes,~P.~G. \textit{Tet Comp Meth} \textbf{1990,}
1101:   \textsl{3,} 175.
1102: 
1103: \bibitem{FriedrichsMS91}
1104: Friedrichs,~M.~S.;\ \ Goldstein,~R.~A.;\ \ Wolynes,~P.~G. \textit{J Mol Biol}
1105:   \textbf{1991,} \textsl{222,} 1013-1034.
1106: 
1107: \bibitem{Hardin00}
1108: Hardin,~C.;\ \ Eastwood,~M.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.
1109:   \textit{Proc Natl Acad Sci USA} \textbf{2000,} \textsl{97,} 14235-14240.
1110: 
1111: \bibitem{GoldsteinRA92}
1112: Goldstein,~R.;\ \ Luthey-Schulten,~Z.~A.;\ \ Wolynes,~P.~G. \textit{Proc Natl
1113:   Acad Sci USA} \textbf{1992,} \textsl{89,} 9029-9033.
1114: 
1115: \bibitem{Hopfield_1982}
1116: Hopfield,~J.~J. \textit{Proc Natl Acad Sci USA} \textbf{1982,} \textsl{79,}
1117:   2554-2558.
1118: 
1119: \bibitem{Ryckaert77}
1120: Ryckaert,~J.;\ \ Ciccotti,~G.;\ \ Berendsen,~H. \textit{J Comput Phys}
1121:   \textbf{1977,} \textsl{23,} 327-341.
1122: 
1123: \bibitem{Rama}
1124: Ramachandran,~G.;\ \ Sasisekharan,~V. \textit{Adv Protein Chem} \textbf{1968,}
1125:   \textsl{23,} 283-438.
1126: 
1127: \bibitem{Papoian04pnas}
1128: Papoian,~G.~A.;\ \ Ulander,~J.;\ \ Eastwood,~M.~P.;\ \ Luthey-Schulten,~Z.;\ \
1129:   Wolynes,~P.~G. \textit{Proc Natl Acad Sci U S A} \textbf{2004,} \textsl{101,}
1130:   3352-3357.
1131: 
1132: \bibitem{Papoian03biopoly}
1133: Papoian,~G.~A.;\ \ Wolynes,~P.~G. \textit{Biopolymers} \textbf{2003,}
1134:   \textsl{68,} 333-349.
1135: 
1136: \bibitem{Papoian03jacs}
1137: Papoian,~G.~A.;\ \ Ulander,~J.;\ \ Wolynes,~P.~G. \textit{J Am Chem Soc}
1138:   \textbf{2003,} \textsl{125,} 9170-9178.
1139: 
1140: \bibitem{maxfield79}
1141: Maxfield,~F.~R.;\ \ Scheraga,~H.~A. \textit{Biochemistry} \textbf{1979,}
1142:   \textsl{18,} 697--704.
1143: 
1144: \bibitem{finkelstein98}
1145: Finkelstein,~A.~V. \textit{Phys Rev Lett} \textbf{1998,} \textsl{80,}
1146:   4823-4825.
1147: 
1148: \bibitem{Altschul1997}
1149: Altschul,~S.;\ \ Madden,~T.;\ \ Schaffer,~A.;\ \ Zhang,~J.;\ \ Zhang,~Z.;\ \
1150:   Miller,~W.;\ \ Lipman,~D. \textit{Nucl. Acids Res.} \textbf{1997,}
1151:   \textsl{25,} 3389-3402.
1152: 
1153: \bibitem{bioperl}
1154: Stajich,~J.~E. \textit{et al.}\  \textit{Genome Res.} \textbf{2002,}
1155:   \textsl{12,} 1611-1618.
1156: 
1157: \bibitem{CLUSTAL}
1158: Thompson,~J.;\ \ Higgins,~D.;\ \ Gibson,~T. \textit{Nucl. Acids Res.}
1159:   \textbf{1994,} \textsl{22,} 4673-4680.
1160: 
1161: \bibitem{Zhang_etal03}
1162: Zhang,~Y.;\ \ Kolinski,~A.;\ \ Skolnick,~J. \textit{Biophys J} \textbf{2003,}
1163:   \textsl{85,} 1145-1164.
1164: 
1165: \bibitem{Hardin02a}
1166: Hardin,~C.;\ \ Eastwood,~M.;\ \ Prentiss,~M.;\ \ Luthey-Schulten,~Z.;\ \
1167:   Wolynes,~P.~G. \textit{J Comput Chem} \textbf{2002,} \textsl{23,} 138-146.
1168: 
1169: \bibitem{BONN2001b}
1170: Bonneau,~R.;\ \ Tsai,~J.;\ \ Ruczinski,~I.;\ \ Chivian,~D.;\ \ Rohl,~C.;\ \
1171:   Strauss,~C. E.~M.;\ \ Baker,~D. \textit{Proteins} \textbf{2001,}
1172:   \textsl{Suppl 5,} 119-126.
1173: 
1174: \bibitem{zhou:zhou04}
1175: Zhou,~H.;\ \ Zhou,~Y. \textit{Proteins} \textbf{2004,} \textsl{54,} 315-322.
1176: 
1177: \bibitem{kussell:168101}
1178: Kussell,~E.;\ \ Shakhnovich,~E.~I. \textit{Phys Rev Lett} \textbf{2002,}
1179:   \textsl{89,} 168101.
1180: 
1181: \bibitem{Baker97}
1182: Simons,~K.;\ \ Kooperberg,~C.;\ \ Huang,~E.;\ \ Baker,~D. \textit{J. Mol.
1183:   Biol.} \textbf{1997,} \textsl{268,} 209-225.
1184: 
1185: \bibitem{Betancourt:2004}
1186: Betancourt,~M.;\ \ Skolnick,~J. \textit{J Mol Biol} \textbf{2004,} \textsl{2,}
1187:   635-649.
1188: 
1189: \bibitem{SavenJG96}
1190: Saven,~J.~G.;\ \ Wolynes,~P.~G. \textit{J. Mol. Biol.} \textbf{1996,}
1191:   \textsl{257,} 199-216.
1192: 
1193: \bibitem{AllenTildesley}
1194: Allen,~M.~P.;\ \ Tildesley,~D.~J. \textit{Computer Simulation of Liquids;}
1195:   Clarendon Press: New York, NY, USA, 1987.
1196: 
1197: \bibitem{Eastwood2003}
1198: Eastwood,~M.;\ \ Hardin,~C.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.
1199:   \textit{J Chem Phys} \textbf{2003,} \textsl{118,} 8500-8512.
1200: 
1201: \bibitem{Shanknovich_1997}
1202: Du,~R.;\ \ Pande,~V.;\ \ A.Y.,~G.;\ \ Shakhnovich,~E.~I. \textit{J Chem Phys}
1203:   \textbf{1997,} \textsl{108,} 334-350.
1204: 
1205: \bibitem{Koretke98}
1206: Koretke,~K.~K.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G. \textit{Proc Natl
1207:   Acad Sci USA} \textbf{1998,} \textsl{95,} 2932-2937.
1208: 
1209: \bibitem{FPearl}
1210: Pearl,~F. \textit{et al.}\  \textit{Nucl. Acids Res.} \textbf{2005,}
1211:   \textsl{33,} D247-251.
1212: 
1213: \bibitem{kong96}
1214: Kong,~X.;\ \ {Brooks III},~C.~L. \textit{J Chem Phys} \textbf{1996,}
1215:   \textsl{105,} 2414--2423.
1216: 
1217: \end{thebibliography}
1218: 
1219: \end{document}
1220: