0606:q-bio0606012/lanl.tex

1: \documentclass[twocolumn, aps]{revtex4}

2: %\documentclass[preprint, aps]{revtex4}

3: % This tex file has the bibliography inserted at the end

4: \usepackage{natbib}

5: \usepackage{amsmath}

6: %\usepackage{graphicx}

7:

8: % The macro below allows you to use .eps files in pdflatex.

9: % It converts on the fly .eps to .pdf files if you use pdflatex

10: %    otherwise, if you are using latex, it just uses the .eps file

11: %

12: % Note: filename suffix (.eps) is left out of the includegraphics statement

13: % Note: you must use the command pdflatex -enable-write18 <filename.tex>

14: %       which enables the running of epstopdf as a separate program.

15: %       The default does not allow pdflatex to launch sub-processes

16:

17: %\ifx\pdfoutput\undefined

18: % this is the case we are running LaTeX, not pdflatex

19: \usepackage{graphicx}

20: %\else

21: % this is the case we are running pdflatex, so convert .eps files to .pdf

22: %\usepackage[pdftex]{graphicx}

23: %\usepackage{epstopdf}

24: %\fi

25:

26: \begin{document}

27:

28: \title{Protein Structure Prediction: The Next Generation}

29: \author{Michael C. Prentiss}

30:

31: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry,  University of California at San Diego, La Jolla, CA 92093

32: }

33:

34: \author{Corey Hardin}

35: \affiliation{Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Ave, Urbana, IL 61801

36: Urbana, IL 61801}

37:

38: \author{Michael P. Eastwood}

39: \affiliation{Department of Chemistry and Biochemistry,  University of California at San Diego, La Jolla, CA 92093

40: }

41:

42: \author{Chenghong Zong}

43: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry,  University of California at San Diego, La Jolla, CA 92093

44: }

45: \author{Peter G. Wolynes}

46: \affiliation{Center for Theoretical Biological Physics, La Jolla, CA 92093, Department of Chemistry and Biochemistry,  Department of Physics University of

47:     California at San Diego, La Jolla, CA 92093}

48:

49: \date{\today}

50:

51: \begin{abstract}

52: Over the last 10-15 years a general understanding of the chemical

53: reaction of protein folding has emerged from statistical mechanics.

54: The lessons learned from protein folding kinetics based on energy

55: landscape ideas have benefited protein structure prediction, in

56: particular the development of coarse grained models. We survey results

57: from blind structure prediction.  We explore how second generation

58: prediction energy functions can be developed by introducing

59: information from an ensemble of previously simulated structures.  This

60: procedure relies on the assumption of a funnelled energy landscape

61: keeping with the principle of minimal frustration.  First generation

62: simulated structures provide an improved input for associative memory

63: energy functions in comparison to the experimental protein structures

64: chosen on the basis of sequence alignment.

65: \end{abstract}

66:

67: \maketitle

68: Every other summer, research groups compare their different protein

69: structure prediction methods via the Critical Assessment of Techniques

70: for Protein Structure Prediction (CASP) experiment.  During the CASP

71: experiment, sequences of experimentally determined protein structures

72: that are not public available are placed on the web. This exercise is

73: double blind where neither the organisers nor the participants know

74: the experimentally determined structure.  Groups respond with up to 5

75: ranked predictions, before a predetermined date, such as the

76: publication of the structures.  Since the inception of CASP, three

77: dimensional structure prediction category has expanded to address

78: related prediction questions such as sequence to structure alignment

79: quality, amino acid sidechain placement, multi-domain domain

80: boundaries, and the ordered or disordered nature of a protein sequence

81: \cite{Moult}.

82:

83: These different prediction questions can be examined from a common

84: framework: the principle of minimal frustration.  The principle of

85: minimal frustration states that native contacts must be more

86: favourable, in a strict statistical sense \cite{GoldsteinRA-AMH-92},

87: than non-native contacts in order for proteins to fold on physiologic

88: time scales \cite{BryngelsonJD87}.  Without a sufficient energetic

89: bias towards the native state, the multi-dimensional energy surface as

90: a function of native structure possesses too many minima for an

91: efficient stochastic search.  Such an energy surface would lead to

92: slow folding kinetics, even if the proteins ever found a sufficiently

93: stable native state.  This is not true since we know most proteins

94: fold without assistance \cite{AnfinsenCB73}.  The opposite of a

95: rough energy surface is biased toward the native basin without any

96: local minima is an absolute manifestation of the principle of minimal

97: frustration. Funnelled energy surfaces have no unfavourable

98: energetic traps (\emph{i.e.} G\=o Models) have been shown to reproduce

99: most features of experimental folding kinetics \cite{GoN83,

100: Koga_Takada01, portman98}. These energy landscape concepts can richly

101: be applied in several areas of chemistry and physics\cite{Wales}.

102: Apparently, evolution's energy function is minimally frustrated.

103:

104: The correlation between a protein sequence and its three dimensional

105: structure can be described using similar landscape language.  As a

106: protein sequence diverges away from a consensus wild type sequence,

107: the potential for energetically unfavourable interactions increases.

108: The wild type sequence and its homologues will fold toward the same

109: native basin.  Only once enough frustrating contacts are added to the

110: wild type sequence will the sequences no longer correspond to the native

111: state ensemble.  Sequences with over 25\% sequence

112: identity to previously determined protein structures are called

113: comparative modelling targets.  The energy landscape underlying such a

114: prediction is a G\=o model based on the structure of the known

115: homologue. This heavily funnelled energy surface yields

116: high resolution structures, with the discrepancies the turns and

117: residues which have poor sequence to structure alignments.

118: Fig.~\ref{casp6_difficulty} demonstrates the distribution of homology

119: of proteins sequence to known structures included in CASP6.  Since

120: proteins below 25\% sequence identity are considered new fold

121: recognition targets, 70\% of the structures were comparative modelling

122: targets.  Recently sequenced genomes such as \textit{E. coli} have the

123: same ratio of \textit{ab initio} to comparative modelling targets,

124: which suggests the analysis of this ratio over time could be a useful

125: measure of the progress of efforts to experimentally find examples of

126: all of Nature's protein structures.

127:

128: \begin{figure}{\par\centering

129:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/difficultyhisto.eps}}

130:  \par}

131: \caption{\label{casp6_difficulty}The difficultly of the prediction

132:  targets as defined by percent identity. Proteins below 25\%

133:  sequence identity are usually considered \textit{ab initio} or

134:  fold recognition targets.}

135: \end{figure}

136:

137:  In contrast to comparative modelling, \textit{ab initio} structure

138:  predictions do not have the advantage of creating G\=o like energy

139:  surfaces.  While many \textit{ab initio} targets contain less than 150

140:  residues, and thus are candidates for standard techniques, there are

141:  several that are longer as shown in Fig.~\ref{casp6_ab_length}.  Most

142:  longer sequences will be multi-domain proteins. This causes new

143:  problems.  Folding a protein with two hydrophobic cores allows

144:  for new sources of frustration, beyond those present in single domain

145:  proteins.  To obtain predictions for such problematic sequences, they

146:  usually must be divided into their constituent domains.  Current

147:  methods for dividing the sequence into domains range from purely

148:  sequence based algorithms, which look for sequence patterns in

149:  multiple sequence alignments, to simulation techniques that look for

150:  hydrophobic core formation amongst multiple independent simulations

151:  \cite{Wheelan_etal00,Heringa02,Rigden02}.

152:

153: \begin{figure}{\par\centering

154:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/abitinio_length.eps}}

155:  \par}

156:  \caption{\label{casp6_ab_length}The \textit{ab initio} prediction

157:      targets amino acid lengths for CASP6.}

158: \end{figure}

159:

160:  The case studies we highlight of difficult structure predictions were

161:   chosen from our participation in the CASP5 and CASP6 experiments.  In

162:   CASP5, we utilised several improved techniques, such as a backbone hydrogen

163:   bond term for the proper formation of beta sheets, and a liquid

164:   crystal like term to ensure parallel or anti-parallel sheet formation

165:   \cite{Hardin2002ab}.  We also performed target sequence averaging

166:   which enhances the funnelling of the prediction landscape

167:   \cite{Eastwood00}, and assessed our ensemble of sampled structures

168:   with a twenty letter contact for submission \cite{KoretkeKK96}.  Our

169:   most striking result from this round of blind prediction was a

170:   prediction for target T0170 protein databank \cite{Berman} code (PDB

171:   ID IUZC).  Fig.~\ref{casp5t0170overlay} presents the sequence dependent

172:   overlay of our Model 1 structure with the experimentally determined

173:   structure.  The sequence dependent alignment quality of this structure

174:   is high as measured by a Q score of 0.38. Q is an order parameter

175:   defined in Eq.~\ref{q} that measures the sequence dependent structural

176:   complementarity of two structures, where Q is defined as a normalised

177:   summation of C-alpha pairwise contact differences.

178:    \begin{equation}

179:    \label{q}

180:    Q=\frac{2}{(N-1)(N-2)}\sum_{i<j-1}

181:        \exp\left[-\frac{(r_{ij}-r_{ij}^{\rm N})^2}{\sigma_{ij}^2}\right]

182:   \end{equation}

183:   The resulting order parameter, Q, ranges from 0, when there is no

184:   similarity between structures at a pair level, to 1 which is an exact

185:   match.  Q has been shown to be more sensitive in determining the

186:   quality of intermediate quality protein structure predictions

187:   \cite{Eastwood00}. Q scores of 0.4 for single domain proteins equals an

188:   RMSD of 5$\text{\AA}$.  In most cases the reference state for the Q

189:   score is the native state, but often one wants to compare structural

190:   similarity between structures in a simulation.  A sequence independent

191:   measure CE\cite{Bourne98}, also scores well(CE Z-score = 4.1).  The CE

192:   Z-score measures structural complementarity without regard to sequence

193:   information, and is parameterised such that structures between with a

194:   Z-Score greater than 4 belong to the same protein

195:   structure family.  The contact map of the prediction,

196:   Fig.~\ref{casp5t0170contactmap} which identifies all of the C-alpha

197:   intermolecular interactions within 9$\text{\AA}$ where the axes

198:   are the index of the protein, shows the correct packing of the helices.

199:  \begin{figure}

200:      \centering \includegraphics[width=\linewidth]{figures/t0170overlay.eps}

201:      \caption{\label{casp5t0170overlay}Sequence dependent

202:        superpositions of Model 1 structure against the native state for

203:        CASP5 target T0170 (PDB ID 1UZC).  Blue represents the

204:        prediction and the native state is represented with red.}

205:  \end{figure}

206:

207:  \begin{figure}{\par\centering

208:  \resizebox*{2.0in}{2.0in}{\rotatebox{90}{\includegraphics{figures/t0170contactmap.eps}}}

209:  \par}

210:  \caption{\label{casp5t0170contactmap}Contact map of target T0170 (PDB

211:      ID 1UZC) model

212:    1 structure against the NMR structure.}

213:  \end{figure}

214:

215:   Fig.~\ref{casp6_hubbard} shows the size of partially correct

216:   continuous in sequence segments under an RMSD cutoff.  When compared

217:   against the other predictions, our Model 1 prediction (Dark Blue) was

218:   amongst the best of all submitted structures.  Also the relative

219:   success of the prediction, classifies this target as being of

220:   moderate difficulty.  In this example CASP demonstrates small (70

221:   residues) all alpha proteins are beginning to be successfully

222:   predicted by a variety of \textit{ab initio} techniques.

223:   \begin{figure}

224:       \centering

225:       \includegraphics[width=\linewidth]{figures/t0170hubbard.eps}

226:       \caption{\label{casp6_hubbard}Percentage of residues under a RMSD

227:         limit. (Dark Blue - Model 1, Light Blue - Model 2-5, Orange - Other

228:         Groups Prediction)}

229:   \end{figure}

230:

231:  \section*{Methods}

232:

233:  \subsection*{Energy Functions and Sampling}

234:

235:  We used an Associative Memory Hamiltonian (AMH), with optimised

236:  parameters to sample and predict structures

237:  \cite{FriedrichsMS89,FriedrichsMS90,FriedrichsMS91}. The AMH uses a

238:  reduced description of the amino acid chain in order to gain the

239:  orders of magnitude computational acceleration over all atom models

240:  needed to fold moderate length proteins with ordinary

241:  computational resources, and has been described in great detail before

242:  \cite{Eastwood00}.  This is possible due to reducing the number of

243:  atoms per residue from over 10 to only three backbone atoms: the

244:  \(C_{\alpha},C_{\beta}\), and \(O\). The remaining backbone heavy

245:  atoms (\(N,C'\)) can be reconstituted using the ideal geometry of the

246:  peptide bond as a template. Also we reduced the complexity of the

247:  amino acid code from twenty letters, to four.  We chose the four

248:  letter code, which has the advantage of preserving a diversity of

249:  contacts, because it is still simple enough that the number of

250:  coefficients that need to be optimised does not create problems of

251:  inaccurate statistics due to limits of interactions encountered in the

252:  molten globule state.  Specifically four amino acid classes are

253:  defined: hydrophilic (A, G, P, S, T), hydrophobic (C, I, L, M, F, W,

254:  Y, V), acidic (N,D,Q,E), and basic (R,H,K) \cite{Hardin00}. The

255:  optimisation procedure produces an energy landscape that discriminates

256:  the native state from misfolded states, while avoiding kinetic traps

257:  reasonably well \cite{GoldsteinRA92,GoldsteinRA-AMH-92}.  The AMH is

258:  an analogue to the neural networks designed by Hopfield to synthesise

259:  information from multiple previous experiences \cite{Hopfield_1982}.

260:  This energy function recalls structural patterns in a set of known

261:  protein structures.  The Hamiltonian produces an energetically

262:  favourable minimum when there is sufficient coherence between a set of

263:  three dimensional protein structures.

264:

265:  The AMH energy function, in its most general sense, consists of a

266:  backbone term, $E_{\rm back}$ and interaction term, $E_{\rm int}$

267:  defined by,

268:  \begin{equation}

269:    \label{amh_function}

270:    E_{\rm total}=E_{\rm back}+E_{\rm int}.

271:  \end{equation}

272:  The backbone energy term consists of several terms that reproduce the

273:  self-avoiding behaviour of the polypeptide chain give by,

274:  \begin{equation}

275:    \label{amh_back}

276:    E_{\rm back}=-(E_{\rm SHAKE}+ E_{\rm rama} + E_{\rm ev} + E_{\rm chain} + E_{\rm chi}).

277:  \end{equation}

278:  As in many molecular mechanics energy functions, covalent bonds are

279:  preserved by using the SHAKE algorithm \cite{Ryckaert77} \(E_{\rm

280:    SHAKE}\), which enables an increase of the time step size, and

281:  eliminates the need for a traditional harmonic calculation.  The SHAKE

282:  algorithm preserves the distances between neighbouring

283:  \(C_{\alpha}\)-\(C_{\beta}\), and \(C_{\alpha}\)-\(O\) atoms.  The

284:  neighbouring residues limit the variety of angles the backbone atoms

285:  can occupy, producing a Ramachadran plot \cite{Rama}.  This

286:  distribution of angles is reinforced by a potential, \(E_{\rm rama}\)

287:  with low barriers to encourage rapid local backbone movements.

288:  Another term, \(E_{\rm ev}\) maintains a sequence specific excluded

289:  volume constraint between \(C_{\alpha}\)-\(C_{\alpha}\),

290:  \(C_{\beta}\)-\(C_{\beta}\), \(O\)-\(O\), \(C_{\alpha}\)-\(C_{\beta}\)

291:  atoms.  The chain connectivity, and planarity of the peptide bond due

292:  to resonance is ensured by means of a harmonic potential, \(E_{\rm

293:    chain}\).  Also the chirality of the \(C_{\alpha}\), due to its four

294:  different bonding partners is maintained using scalar product of

295:  neighbouring unit vectors of carbon and nitrogen bonds, \(E_{\rm

296:    chi}\).

297:

298:  While \(E_{\rm back}\) creates peptide like stereo-chemistry, it

299:  does not introduce the majority of the attractive

300:  interactions that result in folding. Such interaction are supplied by

301:  the rest of the potential \(E_{\rm int}\). The interactions described

302:  by $E_{\rm int}$ depends on the sequence separation $\left\vert i-j

303:  \right\vert$. Specifically,

304:  they are divided into three proximity classes $x(\left\vert

305:    i-j\right\vert)$: $x={\rm short}$ ($\left\vert i-j\right\vert<5$),

306:  $x={\rm medium}$ ($5\le\left\vert i-j\right\vert\le12$) and $x={\rm

307:    long}$ ($\left\vert i-j\right\vert>12$) as defined by Eq.~\ref{eint}.

308:  \begin{equation}

309:  \label{eint}

310:  E_{\rm int}=E_{\rm short}+E_{\rm med}+E_{\rm long}.

311:  \end{equation}

312:  Also these distance classes are also referred to as local,

313:  super-secondary, and tertiary, respectively.

314:

315:  The AMH interaction potential \(E_{\rm int}\) is based on correlations

316:  between a target's sequence signified by \(i,j\), and the

317:  sequence-structure patterns in a set of memory proteins \(\mu\)

318:  represented as \(i',j'\), and a pairwise contact potential.  The pairs

319:  in the target and in the memory are first associated using a

320:  sequence-structure threading algorithm \cite{KoretkeKK96}. The

321:  database is assumed to contain a subset of pair distances, which may

322:  match the associated pair distances in the target structure.  The

323:  general form of the associative memory interaction is:

324:   \begin{equation}

325:   \begin{split}

326:    \label{amh_general}

327:     E_{\rm int}&= -\frac{\epsilon}{a}\sum ^{N_{mem}}_{\mu}\sum_{j-12 \leq i \leq j-3} \\

328:      & \gamma(P_{i},P_{j},P^{\mu}_{i'},P^{\mu }_{j'})

329:    \exp\left[-\frac{(r_{ij}-r_{i'j'}^{\rm

330:       \mu})^2}{2\sigma^2_{ij}}\right] \\

331:            &+ -\frac{\epsilon}{a}\sum_{k=1}^{3}C_{k}(N)\gamma (P_{i},P_{j},k)U_{k}(r_{ij})

332:   \end{split}

333:   \end{equation}

334:  where the similarity between target pair distances \(r_{ij}\), with

335:  aligned memory pair distances \(r^{\mu}_{i'j'}\) is measured by Gaussian

336:  functions whose widths are given by \(\sigma_{ij}=\left\vert

337:    i-j\right\vert^{0.15}\text{\AA}\). The set of parameters,

338:  \(\gamma\), encode the similarity between residues i and j, and the

339:  memories residues i' and j'.  Favourable interactions occur during

340:  coherence in the distances achieved in the sequence to structure

341:  alignments. The encoding of the alignment information in

342:  Eq.~\ref{amh_general} is only an example of what is used for the

343:  all-alpha energy functions.  Other encodings have been used in the

344:  alpha-beta energy function \cite{Hardin2002ab} to improve the

345:  discrimination between helices and strands. While the first term in

346:  Eq.~\ref{amh_general} is the superposition of interactions over a set

347:  of experimentally determined structures, it also shares a dependence

348:  on the sequence separation between the interacting residues.  For

349:  residues separated by greater than 12 residues, a contact potential

350:  \(E_{\rm long}\), as described by the second term in

351:  Eq.~\ref{amh_general}, which does not depend on interaction

352:  information from the structures used to define local in sequence

353:  interactions.  In this term $C_{k}(N)$ represents a sequence length

354:  dependence scaling to account for the variation in probability

355:  distributions based on sequence length. Five wells instead of the

356:  three defined here by $U_{k}(r_{ij})$ determine interactions in the

357:  alpha-beta energy function \cite{Hardin2002ab}.  Energy units

358:  $\epsilon$ are defined excluding backbone contributions in terms of a

359:  native state energy in Eq.~\ref{units},

360:  \begin{equation}

361:    \label{units}

362:    \epsilon=\frac{\left\vert E^{\rm N}_{\rm amh}\right\vert}{4N},

363:  \end{equation}

364:  where $N$ is the number of residues. A distance class scaling $a$, is

365:  constant in each of the energy classes because they are designed to

366:  be equal during the optimisation

367:

368:  The solvent in these energy functions is treated in a mean field

369:  manner, where the implicitly solvated native states of the proteins define the

370:  energy gap to the molten globule state.  Solvent effects are also

371:  present in the sequence to structure alignment energy functions, but

372:  they are not explicitly represented in the molecular dynamics energy

373:  function. Water mediated contacts with an expanded 20 letter code in

374:  the contact potential were introduced \cite{Papoian04pnas}, based upon

375:  previous work which examined protein recognition

376:  \cite{Papoian03biopoly, Papoian03jacs}.  The water mediated contacts

377:  along with a new one dimensional burial term has shown promising

378:  results especially for long proteins.

379:

380:  Once the energy function is optimised, the minima of the energy

381:  function are probed via simulated annealing with molecular dynamics

382:  simulations.  This minimisation technique integrates Newton's

383:  equations of motions to determine the energy of the next time step.

384:  Simulated annealing slowly reduces the temperature from a high value

385:  as in the tempering of steel in metallurgy.  This minimisation algorithm

386:  allows for local searches, while allowing modest energy barriers

387:  to be overcome.

388:

389:  Energy landscape ideas have generated an optimisation scheme for

390:  creating funnelled energy surfaces.  While funnelled, the

391:  parameterisation does not eliminate all non-native minima.  The

392:  superposition of several energy surfaces reduces the likelihood of

393:  such trapping in local minima \cite{maxfield79,finkelstein98}.  The

394:  flexibility of the AMH framework provides several ways of

395:  incorporating multiple sequence alignment information.  Some of the

396:  options include creating a consensus sequence \cite{Eastwood00},

397:  simulating different homologue sequences concurrently, and averaging

398:  the resulting forces and energies \cite{Hardin2002ab}.  The averaged

399:  AMH energy function we used average the forces and the energies of

400:  these simulation over a set of sequences, because it allows for more

401:  generalisable results than may occur with other techniques, and is

402:  described as in Eqs.~\ref{msa1},~\ref{msa2} ,

403:

404:   \begin{equation}

405:    \begin{split}

406:    \label{msa1}

407:     E_{\rm short + medium} &= -\frac{1}{N_{seq}}\frac{\epsilon}{a} \sum ^{seq}_{1

408:       }\sum ^{N_{mem}}_{\mu}\sum_{j-12 \leq i \leq j-3} \\

409:        & \gamma (P_{i},P_{j},P^{\mu}_{i'},P^{\mu }_{j'} \exp\left[-\frac{(r_{ij}-r_{i'j'}^{\rm \mu})^2}{2\sigma^2_{ij}}\right]

410:   \end{split}

411:   \end{equation}

412:

413:  \begin{equation}

414:   \label{msa2}

415:     E_{\rm long} = -1/N_{seq}\frac{\epsilon}{a} \sum ^{seq}_{1 }\sum_{k=1}^{3}C_{k}(N)\gamma (P_{i},P_{j},k)U_{k}(r_{ij})

416:  \end{equation}

417:

418:  To superimpose multiple energy landscapes, we need a multiple sequence

419:  alignment to a set of sequence homologue.  Sequences homologous to

420:  the target sequence are first identified by using PSI-Blast with

421:  default parameters \cite{Altschul1997}.  Each sequence above and below

422:  a certain sequence identity thresholds (70\% 30\% in this work) is

423:  then aligned against each other, and proteins that have greater than

424:  90\% sequence identity to other identified sequence homologues are

425:  removed.  The culling of the sequence homologues via open source

426:  bioinformatic libraries is necessary for two reasons \cite{bioperl}.

427:  Some classes of proteins have a large number of sequence homologues,

428:  and performing a multiple sequence alignment can be impractical.  Also

429:  removing sequence homologues attempts to remove biases introduced when

430:  there are few homologues.  The remaining sequences were aligned using

431:  a multiple sequence alignment algorithm\cite{CLUSTAL}.  Within the AMH

432:  energy function, gaps occurring in a sequence alignment could be

433:  addressed in a variety of ways, in this work gaps in the target

434:  sequence are ignored, while gaps within homologues are completed with

435:  residues from the target protein.  This strategy may introduce biases

436:  toward the target sequence, but this approach is preferred to perhaps

437:  ignoring interactions.  Fig.~\ref{casp6t0212_msa} shows a

438:  representative multiple sequence alignment for a target, coloured with

439:  respect to the four letter code of the AMH.  If one focuses on the

440:  hydrophobic yellow residues, the alternating hydrophobic hydrophilic

441:  patterns for beta strands formation are apparent.

442:

443:  \begin{figure}{\par\centering

444:  \includegraphics[width=\linewidth]{figures/t0212_msa.eps}

445:  \par}

446:  \caption{\label{casp6t0212_msa} Multiple sequence alignment for target

447:    T0212 (PDB 1TZA) coloured with respect to a four letter code, where red

448:    represents acidic residues, blue represents polar residues, yellow

449:    represents nonpolar residues, and green represents basic residues.}

450:  \end{figure}

451:

452:  Another way of introducing the characteristics of multiple funnelled

453:  energy landscapes is using information derived from neural networks

454:  trained on multiple sequence alignments.  Even with different

455:  architectures, neural networks typically achieve 75\% accuracy when

456:  predicting secondary structure.  Recently it has been shown artful

457:  combinations of two different predictions can slightly improve the

458:  results \cite{Zhang_etal03}.  This secondary structure information was

459:  added by a biasing energy function to either a helix or a strand via,

460:  $E_{Q_{ss}} = 10^5 \epsilon(Q-Q_{ss})^4$ \cite{Eastwood00}, where

461:  ${Q_{ss}}$ is defined by Eq.~\ref{qmf_ss},

462:  \begin{equation}

463:    \label{qmf_ss}

464:    Q_{\rm ss}= \sum_{k}^{n}\frac{2}{(N_{k}-1)(N_{k}-2)}\sum_{i<j-1}\exp\left[-

465:      \frac{(r_{ij}-r_{ij}^{\rm ss})^2}{\sigma_{ij}^2}\right].

466:  \end{equation}

467:  ${Q_{\rm ss}}$ is takes the same form of the $Q$ define before in

468:  Eq.~\ref{q} except that potential acts over $n$ independent secondary

469:  structures units derived from secondary structure prediction. The

470:  distances that define energy minimum, $r_{ij}^{\rm ss}$ are determined

471:  from experimentally determined Cartesian distances.  Previously in an

472:  effort to incorporate this secondary structure information, the

473:  Ramachandran potential has been altered to bias the backbone

474:  \cite{Hardin02a}.  The local in sequence potential $E_{\rm Q_{\rm

475:      ss}}$ is preferred to the Ramachandran potential biasing because

476:  it avoids SHAKE violations when the strength of the bias is increased.

477:

478:  For most selected CASP6 targets, we followed the same protocol.  We

479:  averaged the AMH potential over multiple sequence homologues when they

480:  were available.  In most cases, information from secondary structure

481:  prediction was used to bias secondary structure units to their

482:  predicted structures.  Molecular dynamics with simulated annealing

483:  sampled low energy structures.  Also constant temperature

484:  slightly above the predicted glass temperature were used to generate

485:  candidate structures. We collected structures above \(T_{K}\), which

486:  usually gives the fastest folding thereby compromising between the

487:  funnelled and glassy behaviour of the energy function.  Once the

488:  kinetics of the structure slows, the diversity of structures

489:  encountered disappears.  The slow kinetics regime typically

490:  predominates around a temperature of 0.9.  While using a linear

491:  annealing schedule up to \(T_{K}\), about 25 different collapsed

492:  structures were collected during each simulation.  The amount of

493:  sampling performed for each structure varied from about 500 to 20,000

494:  different structures.  While this was roughly 50 times more sampling

495:  than we had previously performed in the CASP setting, it is dwarfed by

496:  the efforts of others who can sample in the millions of structures by

497:  using more powerful computational resources \cite{BONN2001b}.

498:  Subsequently, a smaller subset of structures was selected for

499:  submission by evaluating the size of the hydrophobic core and the

500:  hydrophilic surface area.  Further selection criteria included visual

501:  inspection, agreement with the preliminary secondary structure

502:  prediction, and low energies predicted from a second optimised contact

503:  energy function.

504:

505:  \subsection*{Selection of Structures}

506:

507:  To select candidate structures from independent simulated annealing or

508:  constant temperature trajectories, we calculated both the buried

509:  hydrophobic surface area and the exposed hydrophilic surface area

510:  along the trajectory.  In an effort to calculate the buried or exposed

511:  surface area, we assigned residues which have greater than the mean

512:  total surface area as solvent exposed, and the converse as solvent

513:  buried.  We scaled each surface area by a weight to represent the

514:  likelihood of amino acid burial. It was modelled to the free energy

515:  cost of transferring each amino acid from octanol to water

516:  \cite{zhou:zhou04} in an effort to introduce a sequence specificity

517:  as shown in Eq.~\ref{sa},

518:       \begin{equation}

519:        \label{sa}

520:         E_{\rm Burial}=\sum_{i}^{N}

521:           \begin{cases}

522:            \gamma_{i}*SA_{i},& \text{if $ SA_{i} > $ total surface} \\

523:                           0,&  \text{if $ SA_{i} > $ total surface}

524:           \end{cases}

525:       \end{equation}

526:

527: This normalisation is desirable because the surface accessibility is

528:   calculated from our minimal \(C_{\alpha}, C_{\beta }\), and \(O\)

529:   atoms, which produces amino acids of the same volume.  Such an energy

530:  term would be more valuable if non-additive interactions, and a larger

531:  number of hydration layers were added.  The

532:   unavoidable inaccuracies in atomistic force fields, and the slow

533:   glassy kinetics of sidechain rearrangements prevented any completion

534:   of the backbone and sidechains with all-atoms or minimisation of

535:   putative structures \cite{kussell:168101}.

536:

537:   \begin{table}

538:     \caption{\label{table1} Linear Regression of Hydrophobic Burial Energy }

539:     {\centering \begin{tabular}{|c|c|c|} \hline Proteins & fold class &

540:         Correlation  Coefficient \\ \hline \hline

541:         1R69&\(\alpha\)&.22 \\

542:         1BG8&\(\alpha\)&.33 \\

543:         1UTG&\(\alpha\)&.63 \\

544:         1MBA&\(\alpha\)&.40 \\

545:         2MHR&\(\alpha\)&.46 \\

546:         1IGD&\(\alpha/\beta\)&-.70 \\

547:         3IL8&\(\alpha/\beta\)&-.06 \\

548:         1TIG&\(\alpha/\beta\)&.02    \\

549:         % 3chyz&\(\alpha/\beta\)& fill in value  \\

550:         % 5nulz&\(\alpha/\beta\)& fill in value  \\

551:         1BFG&\(\beta\)&.16 \\

552:         1CKA&\(\beta\)&-.14 \\

553:         1JV5&\(\beta\)&.11 \\

554:         1K0S&\(\beta\)&.27  \\

555:         \hline

556:       \end{tabular}\par}

557:   \end{table}

558:

559: %% \clearpage

560:

561:   Another parameter we used after sampling to select and examine

562:   structures was based on sequence specific backbone probabilities.  The

563:   specificity of local interactions have been fruitful for improving

564:   collapsed proteins structure predictions \cite{Baker97}.  In a similar

565:   spirit sequence specific nearest neighbour probabilities were also used

566:   \cite{Betancourt:2004}.  Local signals have also been theoretically

567:   shown to contribute roughly a third of the total folding gap for

568:   $\alpha$ helical proteins \cite{SavenJG96}.  Similarly we started

569:   looking at such probabilities to further improve the backbone

570:   potential of the AMH, but without needing secondary structure

571:   prediction.

572:   \begin{equation}

573:    \label{mscore}

574:     E_{\rm trimer}=\sum_{i=2}^{N-1}Log P(i-1,i,i+1,\phi,\psi)

575:   \end{equation}

576:   Somewhat surprisingly, the summation of the resulting $\log$

577:   probabilities from 4,012 highly resolved protein structures could be

578:   used as an additional measure as part of a strategy for the selection

579:   of structures out of an ensemble.  Table~\ref{table1} shows the linear

580:   correlation coefficients between structures of varying Q-scores,

581:   sampled above \(T_{\rm K}\) which is where the best predictions

582:   usually occur before glassy dynamics dominates the kinetics.  For both

583:   proteins with all \(\alpha\), and \(\alpha/\beta\) compositions, the

584:   summed log probabilities provide discrimination, but not within the

585:   all \(\beta\) folds.  These results shown in Table~\ref{table_marcio}

586:   echo the previous findings in terms of the \(\phi\), \(\psi\)

587:   probability maps and also that all beta structures are less well

588:   predicted when a dihedral angle energy function is minimised.  The

589:   weakness of nearest neighbour excluded volume effects to determine

590:   local structure is also demonstrated in the consistent weakness of

591:   secondary structure prediction with respect to beta strands. Alpha

592:   helices are correctly predicted to roughly 80\% accuracy while beta

593:   strands average 60\% accuracy by such pure sequence based algorithms.

594:   The difficulty of predicting some circular dichroism spectroscopy

595:   results for beta to coil transitions can also be attributed to the

596:   weakness of the local backbone excluded volume interactions.

597:

598:   \begin{table}

599:     \caption{\label{table_marcio} Linear Regression of Mscore }

600:     {\centering \begin{tabular}{|c|c|c|} \hline Proteins & fold class &

601:         Correlation  Coefficient \\ \hline \hline

602:         1R69&\(\alpha\)&.29 \\

603:         1BG8&\(\alpha\)&.04 \\

604:         1UTG&\(\alpha\)&.26 \\

605:         1MBA&\(\alpha\)&.26 \\

606:         2MHR&\(\alpha\)&.10 \\

607:         1IGD&\(\alpha/\beta\)&.37 \\

608:         3IL8&\(\alpha/\beta\)&.13 \\

609:         1TIG&\(\alpha/\beta\)&.19 \\

610:   %      3CHY&\(\alpha/\beta\)& .40 \\

611:   %      5NUL&\(\alpha/\beta\)& .25 \\

612:         1BFG&\(\beta\)&.08 \\

613:         1CKA&\(\beta\)&.03 \\

614:         1JV5&\(\beta\)&-.07 \\

615:         1K0S&\(\beta\)&-.10 \\

616:         \hline

617:       \end{tabular}\par}

618:   \end{table}

619:

620:   %\clearpage

621:

622:   \section*{Results}

623:   \subsection*{Blind Simulations}

624:

625:   For \textit{ab initio} blind predictions in CASP6, we selected

626:   sequences if there were no experimentally determined homologous

627:   structures found by automated comparative modelling servers.  The

628:   overall results for the \textit{ab initio} structure prediction

629:   simulation are summarised in Table~\ref{casp6_table1}, where the

630:   abbreviations are length = the number of amino acids, temp =

631:   temperature where best structure was encountered, sub Q or samp Q =

632:   the best sampled and submitted structures respectively as a judged

633:   by a function of Q, and traj = number of independent trajectories

634:   simulated.  The

635:   CASP6 targets are classified under the following categories (NF=new

636:   fold, FR/A=fold recognition analog, FR/H=fold recognition homologue,

637:   CM/H=comparative modelling hard).  Targets T0207, and T0270 where

638:   removed from the experiment so their CASP class are undefined.  Structures for T0207 and

639:       T0272-b were not submitted.  There

640:   are a few main points from this data.  Using a Q of 0.4 as a measure

641:   successful prediction, we were able to encounter high quality

642:   structures for 4 targets and nearly so for 4 others.  The temperature

643:   at which the best structures were sampled was between the 1.2 and 0.8,

644:   which is the annealing regime we investigated most throughly.  This

645:   suggests our annealing schedules were close to the behaviour we sought

646:   \textit{a priori}. The longer the length of the target sequence

647:   clearly reduced the quality of our predictions.  Also the proteins

648:   where we had a greater number of trajectories naturally showed better

649:   structures. A final observation identifies the difference between the

650:   best submitted structure and the best sampled structure as

651:   disappointingly large for some of the targets.  This can be attributed

652:   our strategy of maximising the number of simulations performed rather

653:   than more carefully studying our trajectories.  This difference would

654:   be smaller if greater care was taken in the selection of the

655:   structures, but the number of high quality structures would have been

656:   less.

657:   \begin{table}

658:     \caption{\label{casp6_table1}CASP6 Results: Best Submitted and Sampled Structures  }

659:     {\centering \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline

660:         target & length & fold &sub Q&samp Q& temp & traj & CASP \\ \hline \hline

661:         T0281   & 70  & \(\alpha/\beta\) &.34 &.48 & 0.85 & 986 & NF     \\

662:         T0201   & 94  & \(\alpha/\beta\) &.36 &.44 & 1.39 & 199 & NF     \\

663:         T0212   & 123 & \(\beta\)        &.26 &.42 & 1.30 &  97 & FR/A   \\

664:         T0230   & 102 & \(\alpha/\beta\) &.31 &.42 & 1.05 & 395 & FR/A   \\

665:         \hline   \hline

666:         T0207   & 76  & \(\alpha/\beta\) & -- &.39 & 0.98 & 297 & --     \\

667:         T0224   & 87  & \(\alpha/\beta\) &.30 &.38 & 1.20 & 501 & FR/H   \\

668:         T0263   & 97  & \(\alpha/\beta\) &.34 &.38 & 0.94 & 404 & FR/H   \\

669:         T0272-a & 85  & \(\alpha/\beta\) &.30 &.37 & 0.94 &  30 & FR/A   \\

670:         T0265   & 102 & \(\alpha/\beta\) &.29 &.34 & 0.83 & 374 & CM/H   \\

671:         T0213   & 103 & \(\alpha/\beta\) &.26 &.32 & 0.98 & 448 & FR/H   \\

672:         T0243   & 88  & \(\alpha/\beta\) &.31 &.32 & 0.95 & 418 & FR/H   \\

673:         T0239   & 98  & \(\alpha/\beta\) &.25 &.32 & 0.99 & 424 & FR/A   \\

674:         T0214   & 110 & \(\alpha/\beta\) &.24 &.30 & 0.41 & 348 & FR/H   \\

675:         T0242   & 115 & \(\alpha/\beta\) &.27 &.30 & 0.89 & 358 & NF     \\

676:         \hline  \hline

677:         T0270-b & 125 & \(\alpha/\beta\) &.27 &.28 & 0.99 &  32 & --     \\

678:         T0270-a & 122 & \(\alpha/\beta\) &.25 &.27 & 0.80 &  47 & --     \\

679:         T0272-b & 124 & \(\alpha/\beta\) & -- &.26 & 0.81 &  34 & FR/A   \\

680:         T0273   & 186 & \(\alpha/\beta\) &.22 &.24 & 0.98 & 189 & NF     \\

681:         \hline

682:       \end{tabular}\par}

683:   \end{table}

684:

685:   Calculating the free energy of several randomly chosen CASP6 targets

686:   in Fig.~\ref{fq_totalt0214t0243} provides us with probabilities of

687:   what we would have expected to see if more simulations has been

688:   performed during the CASP season.  We can estimate how

689:   many independent structures need to be seen at this temperature to

690:   sample the region 10 $k_BT$ greater than the minimum of the free

691:   energy.  We see roughly $e^{10}\approx 2*10^{4}$ independent sampled

692:   structures would be needed at a temperature of 1.0.  Target T0242 (PDB

693:   ID 2BLK) illustrates why the best structure we encountered had a Q

694:   score of 0.3. For this target, we sampled roughly 7000 different

695:   structures. To achieve a Q of 0.45, according to the free

696:   energy analysis we would need to increase our sampling by a factor

697:   of 3.

698:

699: \begin{figure}{\par\centering

700:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_totalt0214-t0243.eps}}

701:  \par}

702:   \caption{\label{fq_totalt0214t0243}Free Energy calculations for

703:          CASP6 targets T0213, T0214, T0224, T0242, and T0243.}

704: \end{figure}

705:

706:   When extrapolating to lower temperatures, we see lower barriers to the

707:   folded state, and thus if sampling were more complete one would see

708:   better structures at these temperatures.  This further cooling would

709:   be a favorable strategy except that dynamic slowing due to the

710:   approach of the glass transition interferes, which occurs at a

711:   temperature of 0.9.  Naturally, it is best to sample just above the

712:   glass transition temperature, which can be approximately found from

713:   Q-Q correlation ($<Q(t)Q(t+\tau)>$) \cite{AllenTildesley}, and by

714:   using the Kolmogorov-Smirnov test to asses the independence of samples

715:   \cite{Eastwood2003}.  Table~\ref{casp6_likely} indicates what was the

716:   best structure we would be likely to see under such sampling

717:   conditions.  The differences between thermodynamically accessible

718:   structures and those that were sampled suggests that increased

719:   simulations would have improved the best structures sampled

720:   considerably.  The free energy of target T0243 (PDB ID not available) is

721:   significantly different due to its unusual architecture that contains

722:   a buried helix.

723:

724:   \begin{table}

725:     \caption{\label{casp6_likely} Likely Quality of Structure Seen

726:       at a Free  Energy of 10 CASP6 }

727:     {\centering \begin{tabular}{|c|c|c|c|c|} \hline

728:         Target & PDB & length  & Probable Q & Sampled Q \\ \hline

729:         T0213  & 1TE7 & 103&.43 &.32 \\

730:         T0214  & 1S04 & 110&.40 &.30 \\

731:         T0224  & 1RHX &  87&.39 &.38 \\

732:         T0242  & 2BLK & 123&.45 &.30 \\

733:         T0243  & ---  &  88&.28 &.32 \\ \hline

734:       \end{tabular}\par}

735:   \end{table}

736:

737:  % \clearpage

738:

739:   As in Fig.~\ref{casp5t0170contactmap}, we compare contact maps

740:   between the predictions and the experimentally resolved structure.

741:   Often contact maps give more insightful than superimposed structures

742:   especially when viewing in 2 dimensions.  We compare the submitted

743:   structures with the best structure encountered during our sampling to

744:   determine what aspect of folding are being captured by our energy

745:   functions. For a short target T0201 (PDB ID 1S12), we see that

746:   sometimes a small difference in the contact maps in

747:   Fig.~\ref{contact_T0201}, can greatly improve the quality of the

748:   prediction even though a large number of contacts are already correct.

749:   There was a larger fraction of incorrect contacts in our best

750:   submitted structure for target T0230 (PDB ID 1WCJ) than we would have

751:   seen in the best generated structure as shown in

752:   Fig.~\ref{contact_T0230}.  The incorrect parallel docking of the first

753:   two helices is largely resolved in the best sampled structure and the

754:   Q score improves considerably.  Similar analysis for target T0281 (PDB

755:   ID 1WHZ) shows incorrect long range contacts between the two otherwise

756:   properly oriented helices, and disordered intermediate interactions as

757:   in Fig.~\ref{contact_T0281}.  Again the best sampled structure has

758:   these problems largely resolved.

759:

760:  \begin{figure}{\par\centering

761:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0201.eps}}}

762:  \hspace{.5in}

763:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_samp_t0201.eps}}}

764:  \par}

765:  \caption{\label{contact_T0201}Contact maps for the best submitted

766:         (Q=.36) and the best sampled (Q=.44) structures for target T0201.}

767:  \end{figure}

768:

769:  \begin{figure}{\par\centering

770:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0230.eps}}}

771:  \hspace{.5in}

772:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_samp_t0230.eps}}}

773:  \par}

774:      \caption{\label{contact_T0230}Contact maps for the best submitted

775:        (Q=.31) and the best sampled (Q=.42) structures for target T0230.}

776:  \end{figure}

777:

778:   \begin{figure}{\par\centering

779:  \resizebox*{2.0in}{2.0in}{\rotatebox{00}{\includegraphics{figures/contactbest_sub_t0281.eps}}}

780:  \hspace{.5in}

781:  \resizebox*{2.0in}{2.0in}{\rotatebox{90}{\includegraphics{figures/contactbest_samp_t0281.eps}}}

782:  \par}

783:  \caption{\label{contact_T0281}Contact maps for the best submitted

784:        (Q=.34) and the best sampled (Q=.48) structures for target T0281.}

785:  \end{figure}

786:

787:  % \clearpage

788:

789:   One amusing way to analyze predicted structures is to view the results

790:   of different structure prediction schemes as intermediates along a

791:   kinetic folding coordinate.  How far did the simulated annealing get in

792:   the folding pathway?  By mapping the likelihood of folding

793:   \cite{Shanknovich_1997} against its location on a folding free energy

794:   surface, we can assess how close the model structure is to the folded

795:   state in a kinetic sense.  The energy function for the kinetic

796:   modeling is a G\=o model \emph{i.e.} ideally non-frustrated energy

797:   function.  The difference between the G\=o model and the structure

798:   prediction energy functions is a measure of the quality of those

799:   structure prediction schemes. A pairwise additive G\=o model was

800:   created based on the native structure of the experimentally determined

801:   protein. As it has been discussed previously \cite{Eastwood00}, this

802:   G\=o model has both a polypeptide backbone energy terms that are the

803:   same as in the structure prediction energy function as described by

804:   Eq.~\ref{amh_back} and an interaction potential were the Gaussian

805:   interaction potential distances \(r_{ij}^{N}\) are determined by the

806:   native state formally described in Eq.~\ref{amh_go}.

807:   \begin{equation}

808:     \label{amh_go}

809:     E_{\text{G\=o}}=- \frac{\epsilon}{a} \sum_{i<j-3} \gamma_{\text{G\=o}}[x_{(|i-j|)}]\exp\left[-\frac{(r_{ij}-r_{ij}^{N})^2}{\sigma_{ij}^2}\right]

810:   \end{equation}

811:   The interactions are defined in this minimal model as residues with

812:   greater the three residues in sequence separation between \(

813:   C^{\alpha}-C^{\alpha}, C^{\alpha}-C^{\beta}, C^{\beta}-C^{\alpha},

814:   C^{\beta}-C^{\beta} \) atom pairs. The weights

815:   \(\gamma_{\text{G\=o}}\) or the depth of the Gaussian wells are set to

816:   (.177,.048,.430) in order to approximately divided the interaction

817:   energy equally between the different distance classes as defined in

818:   the original structure prediction energy function.  The width of the

819:   gaussians $\sigma_{ij}^2$ are defined by the sequence separation as

820:   before.  Notice that the G\=o Hamiltonian does not contain a summation

821:   over a set of memory structures as in the AMH, this is because all of

822:   the contacts in this definition of a G\=o model uses only the native

823:   state.  One hundred independent simulations of this G\=o energy

824:   function are performed starting with the best structure of three

825:   different structure prediction groups.  Pfold is then calculated by

826:   simply determining whether the simulation started from the model

827:   structure folds to the native structure or not.  The results in

828:   Fig.~\ref{pfold_fig} compare three minimalist models, one of which

829:   (the Baker Group) has undergone a further atomistic refinement. The

830:   minimalist models are only a few $k_BT$ from the barrier's peak, they

831:   only infrequently cross it.  It also suggests that a detailed less

832:   coarse grain sampling procedure maybe necessary for correctly

833:   assigning hydrophobic packing and hydrogen bonding patterns.

834:

835:    \begin{figure}{\par\centering

836:        \resizebox*{3.5in}{3.0in}{\rotatebox{00}{\includegraphics{figures/t0281_pfold_3d.eps}}}

837:        \par}

838:      \caption{\label{pfold_fig}G\=o Model Free Energy Surface with final

839:        prediction structures shown. The Pfold values for the three

840:        proteins are the Wolynes Group 0.07, Scheraga Group 0.02, and the

841:        Baker Group 0.97 with an error of +/- 0.1.}

842:    \end{figure}

843:

844:   %\clearpage

845:   \subsection*{The Next Generation in Structure Prediction}

846:

847:   Examining the contact maps of structures encountered during the CASP

848:   experiment, we observed that contacts between residues with a

849:   large separation in

850:   sequence can be inaccurate, even when most of the contacts within a 12

851:   residues sequence separation are native like.  A different way of

852:   expressing this idea is the amount of funnelling is different within

853:   the different distance classes.  When comparing the quality of the

854:   intermediate range interactions in the sampled structures with the

855:   memories obtained with sequence analysis from the protein data bank, a

856:   dramatic increase of native-like interactions is seen as shown in

857:   Fig.~\ref{lh_256ba}.  While this was not used in the recent CASP

858:   exercise, we thought it would be interesting and straight forward to

859:   improve the prediction energy function by using these first generation

860:   results as better memory structures in the AMH.  Sequence to structure

861:   alignments yield gap-less identity alignments thereby eliminating any

862:   possibility of secondary structure registry shift irregularities.

863:

864:  \begin{figure}{\par\centering

865:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_bw.eps}}

866: %   \hspace{0.3in} \vspace{0.3in}

867:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_short_bw.eps}}

868:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_medium_bw.eps}}

869: %   \hspace{0.3in}

870:  \resizebox*{3.0in}{2.0in}{\includegraphics{figures/256ba_qscore_long_bw.eps}}

871: \par}

872:    \caption{\label{lh_256ba}These figures show the total Q, and the Q

873:      in the different distance classes between PDB structures,

874:      structures from a temperature of 1, and a temperature near zero

875:      for structures used as inputs to AMH simulations.  The

876:      lowest temperature show the largest improvement because they are

877:      fully collapsed.}

878:  \end{figure}

879:

880:   Different energy functions have been used to identify native like

881:   proteins from an ensemble of simulated structures.  Alternatively, one

882:   can rely on energy landscape ideas, and assume a mean field contact

883:   potential derived from the energy minima of the simulated energy

884:   function.  This approach has the additional advantage, that it does

885:   not rely on using a distinct energy function: one is simply seeing how

886:   close simulated annealing was to completely accessing the global

887:   minimum of the prediction energy function.  To select structures a

888:   pairwise Q denoted by a lower case q, is calculated between all of the

889:   ground state structures encountered in 200 independent simulations.

890:

891:   By dividing the inter-chain interactions under the same definitions as

892:   used in the energy function, the potential for improvements from such

893:   second generation structures over the original memories is

894:   considerable for protein 256B.  As seen in Fig.~\ref{lh_256ba}, the

895:   low temperature structure as identified by little q have an increased

896:   amount of native like contacts in all distance classes.  This style of

897:   analysis also suggests potential changes in the energy function.  The

898:   long distance in sequence interactions are also improved over that

899:   original memory used in the energy function.  In order to utilise this

900:   improvement the energy function in the distant interaction class was

901:   modified.  The original function used a multi-well contact potential,

902:   which does not use any information from the memory proteins.  For this

903:   third distance class the next generation energy function uses

904:   associative memory contacts much as was done before for modelling with

905:   homologues \cite{Koretke98}.  The energy function now takes the form

906:   \begin{equation}

907:     E_{\rm int}=-  \sum ^{c}_{3}\frac{\epsilon}{a_{c}}\sum ^{n}_{\mu }\sum ^{N}_{i<j} \gamma

908:     (P_{i}P_{j}P^{\mu }_{i'}P^{\mu }_{j'})\Theta

909:     (r_{ij}-r^{\mu }_{i'j'}).

910:   \end{equation}

911:   The parameters for this new distance class are taken from the second

912:   distance class.  The total energy is defined over the set of memory

913:   structures as defined by Eq.~\ref{next_gen_units}

914:   \begin{equation}

915:     \label{next_gen_units}

916:     \epsilon=\frac{1}{36}\sum_{1}^{\mu}\frac{\left\vert E^{\rm model}_{\rm amh}\right\vert}{4N},

917:   \end{equation}

918:   instead of using the values taken from the optimisation.  Some next

919:   generation memory structures are more collapsed than the memory

920:   structures used in initial round of simulation.  Furthermore the

921:   scaling is changed from the initial round of simulation's 1:1:1

922:   scaling amongst the three different (local, super-secondary, tertiary)

923:   distance classes to 1.5:0.5:1 in an effort to approximate the equal

924:   division of energy in each distance class. To examine the equilibrium

925:   properties of this energy function, we need

926:   to estimate the glass transition temperature.  As previously explored

927:   \cite{Eastwood2003}, we use the Kolmogorov-Smirnov test to determine

928:   if two independent simulations have been sampled from the same

929:   equilibrium distribution.  This test ensures that

930:   simulations are equilibrated.

931:  \begin{figure}

932:    \centering

933:    \resizebox*{3.0in}{2.0in}{\includegraphics{figures/Pq_black_white_15.eps}}

934:    \hspace{.5in}

935:    \resizebox*{3.0in}{2.0in}{\includegraphics{figures/Pq_black_white_14.eps}}

936:    \caption{\label{ks_256ba}Kolmogorov-Smirnov test shows the constant

937:      temperature simulation falling out of equilibrium at a lower

938:      temperature of 1.4. The different probability distributions of

939:      structures between two independent simulations is no longer the

940:      same.}

941:  \end{figure}

942:  Once the glass transition temperature ($T_{K}$) is estimated using the

943:  Kolmogorov-Smirnov test, we can use standard techniques to

944:  quantify the equilibrium properties of different energy functions.

945:  The proteins we used for study of the next generation AMH strategy are

946:  cytochrome B562 (PDB ID 256b), HDEA (PDB ID 1BG8), because they are

947:  both of moderate size and one of them (1BG8) was not in the training

948:  set of proteins that optimized the original energy function. An

949:  additional advantage of this choice is these proteins have different

950:  fold types. According to CATH \cite{FPearl} HDEA belongs to the

951:  orthogonal bundle architecture, while cytochrome B562 represents an

952:  up-down bundle.  Using umbrella sampling combined with the weighted

953:  histogramming method, we are able to sample parts of phase space that

954:  would rarely be encountered during a simulation \cite{kong96}.  When

955:  using memories with a larger number of native contacts, we see

956:  improved free energy and energy profiles as shown in

957:  Fig.~\ref{amh_256ba}.

958: \begin{figure}{\par\centering

959: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_bw_paper.eps}}

960: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_local_bw_paper.eps}}

961: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_ss_bw_paper.eps}}

962: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/eq_tert_bw_paper.eps}}

963: \par}

964:    \caption{\label{amh_256ba}The free energy the two different energy

965:      functions for the protein 256B, shows roughly a 5-10 $k_BT$

966:      improvement for this protein.  The primary improvements are in the

967:      medium and long range distance classes.}

968:  \end{figure}

969:  This is even more impressive when we consider

970:  this energy function has not yet been properly optimised for this new

971:  hamiltonian.  For the other target, the results are also not

972:  surprising.  In this case the next generation memories used to simulate

973:  this protein were not of greater structural quality than the

974:  initial set.  Thus a very similar free energy profile was generated as

975:  seen in Fig.~\ref{amh_1bg8a}.  Our use of q as an order

976:  parameter was successful in identifying the high Q protein for the

977:  256B example.  This is due to the highly funnelled characteristic of

978:  the first generation energy function.  The original energy function

979:  for 1BG8 is not as funnelled so therefore there is poorer enrichment by

980:  scanning with little q. This limitation could be over come by increasing

981:  the amount of sampling of structures in the first generation simulations.

982: \begin{figure}

983:    \centering

984: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/qscore_bw_1bg8.eps}}

985: \resizebox*{3.0in}{2.0in}{\includegraphics{figures/fq_bw_paper_1bg8.eps}}

986: \caption{\label{amh_1bg8a}The free energy the two different energy

987:      functions for the protein 1BG8 show little improvement.  The

988:      memories though show no enrichment in native contacts.}

989:  \end{figure}

990:  More simulations would guarantee better structure

991:  as was demonstrated during the CASP5 exercise. This difference in the

992:  enrichment could be anticipated by using the Kolmogorov-Smirnov measure

993:  to differentiate the distribution of the q values encountered between

994:  structures derived from simulation and the protein databank.

995:

996:  \section*{Conclusion}

997:

998:  These case studies from our participation in the CASP experiment only

999:  provide a snap shot of our group's prediction schemes.  It produces a

1000:  series of lessons for us and we hope for others.  In the future, a

1001:  more balanced efforts between the sampling and selection of structures

1002:  from that ensemble would appear to be desirable.  More efforts in

1003:  selection would have clearly improved the results submitted in CASP6.

1004:  While it is was computationally impractical to quench all of the

1005:  structures simulated during the prediction season, the comparison of

1006:  the contact maps demonstrated further that tempering of the structure

1007:  would have improved intermediate range ordering.  Using preliminary

1008:  structures as input to a next generation of AMH modelling improves the

1009:  quality of the prediction results.  While these results may initially

1010:  appear to be model or energy function specific, we feel that any

1011:  algorithm that uses structures as an input would benefit from similar

1012:  next generation approaches.

1013:

1014:  \section*{Acknowledgments}

1015:  The authors thank Joe Hegler, Zaida Luthey-Schulten, Garegin Papoian,

1016:  and Marcio Von Muhlen for their key roles in developing codes used in

1017:  this study and for many helpful discussions over the years.  The

1018:  efforts of P.G.W. are supported through the National Institutes of

1019:  Health Grant 5RO1GM44557.  Computing resources were supplied by the

1020:  Center for Theoretical Biological Physics through National Science

1021:  Foundation Grants PHY0216576 and PHY0225630.

1022:

1023: %\bibliographystyle{pnas}

1024: \bibliographystyle{ieeetr}

1025: %\bibliography{refs}

1026:

1027:

1028: \providecommand{\refin}[1]{\\ \textbf{Referenced in:} #1}

1029: \begin{thebibliography}{10}

1030:

1031: \bibitem{Moult}

1032: Moult,~J.;\ \ Fidelis,~K.;\ \ Zemla,~A.;\ \ Hubbard,~T. \textit{Proteins}

1033:   \textbf{2003,} \textsl{53 Suppl 6,} 334-339.

1034:

1035: \bibitem{GoldsteinRA-AMH-92}

1036: Goldstein,~R.~A.;\ \ Luthey-Schulten,~Z.~A.;\ \ Wolynes,~P.~G. \textit{Proc

1037:   Natl Acad Sci USA} \textbf{1992,} \textsl{89,} 4918-4922.

1038:

1039: \bibitem{BryngelsonJD87}

1040: Bryngelson,~J.~D.;\ \ Wolynes,~P.~G. \textit{Proc Natl Acad Sci USA}

1041:   \textbf{1987,} \textsl{84,} 7524-7528.

1042:

1043: \bibitem{AnfinsenCB73}

1044: Anfinsen,~C.~B. \textit{Science} \textbf{1973,} \textsl{181,} 223-230.

1045:

1046: \bibitem{GoN83}

1047: G{\={o}},~N. \textit{Annu Rev Biophys and Bioeng} \textbf{1983,} \textsl{12,}

1048:   183-210.

1049:

1050: \bibitem{Koga_Takada01}

1051: Koga,~N.;\ \ Takada,~S. \textit{J Mol Biol} \textbf{2001,} \textsl{313,}

1052:   171-180.

1053:

1054: \bibitem{portman98}

1055: Portman,~J.~J.;\ \ Takada,~S.;\ \ Wolynes,~P.~G. \textit{Phys Rev Lett}

1056:   \textbf{1998,} \textsl{81,} 5237--5240.

1057:

1058: \bibitem{Wales}

1059: Wales,~D. \textit{Energy Landscapes;} Cambridge University Press: Cambridge,

1060:   UK, 2003.

1061:

1062: \bibitem{Wheelan_etal00}

1063: Wheelan,~S.~J.;\ \ Marchler-Bauer,~A.;\ \ Bryant,~S.~H. \textit{Bioinformatics}

1064:   \textbf{2000,} \textsl{16,} 613-618.

1065:

1066: \bibitem{Heringa02}

1067: George,~R.~A.;\ \ Heringa,~J. \textit{J Mol Biol} \textbf{2002,} \textsl{316,}

1068:   839-851.

1069:

1070: \bibitem{Rigden02}

1071: Rigden,~D.~J. \textit{Protein Eng} \textbf{2002,} \textsl{15,} 65-77.

1072:

1073: \bibitem{Hardin2002ab}

1074: Hardin,~C.;\ \ Eastwood,~M.;\ \ Prentiss,~M.;\ \ Luthey-Schulten,~Z.;\ \

1075:   Wolynes,~P.~G. \textit{Proc. Nat. Acad. Sci. U.S.A.} \textbf{2002,}

1076:   \textsl{100,} 1679-1684.

1077:

1078: \bibitem{Eastwood00}

1079: Eastwood,~M.~P.;\ \ Hardin,~C.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.

1080:   \textit{IBM Systems Research} \textbf{2001,} \textsl{45,} 475-497.

1081:

1082: \bibitem{KoretkeKK96}

1083: Koretke,~K.~K.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G. \textit{Protein Sci}

1084:   \textbf{1996,} \textsl{5,} 1043-1059.

1085:

1086: \bibitem{Berman}

1087: Berman,~H.~M.;\ \ Westbrook,~J.;\ \ Feng,~Z.;\ \ Gilliland,~G.;\ \

1088:   Bhat,~T.~N.;\ \ Weissig,~H.;\ \ Shindyalov,~I.~N.;\ \ Bourne,~P.~E.

1089:   \textit{Nucl. Acids Res.} \textbf{2000,} \textsl{28,} 235-242.

1090:

1091: \bibitem{Bourne98}

1092: Shindyalov,~I.;\ \ Bourne,~P. \textit{Protein Engineering} \textbf{1998,}

1093:   \textsl{11,} 739-747.

1094:

1095: \bibitem{FriedrichsMS89}

1096: Friedrichs,~M.~S.;\ \ Wolynes,~P.~G. \textit{Science} \textbf{1989,}

1097:   \textsl{246,} 371-373.

1098:

1099: \bibitem{FriedrichsMS90}

1100: Friedrichs,~M.;\ \ Wolynes,~P.~G. \textit{Tet Comp Meth} \textbf{1990,}

1101:   \textsl{3,} 175.

1102:

1103: \bibitem{FriedrichsMS91}

1104: Friedrichs,~M.~S.;\ \ Goldstein,~R.~A.;\ \ Wolynes,~P.~G. \textit{J Mol Biol}

1105:   \textbf{1991,} \textsl{222,} 1013-1034.

1106:

1107: \bibitem{Hardin00}

1108: Hardin,~C.;\ \ Eastwood,~M.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.

1109:   \textit{Proc Natl Acad Sci USA} \textbf{2000,} \textsl{97,} 14235-14240.

1110:

1111: \bibitem{GoldsteinRA92}

1112: Goldstein,~R.;\ \ Luthey-Schulten,~Z.~A.;\ \ Wolynes,~P.~G. \textit{Proc Natl

1113:   Acad Sci USA} \textbf{1992,} \textsl{89,} 9029-9033.

1114:

1115: \bibitem{Hopfield_1982}

1116: Hopfield,~J.~J. \textit{Proc Natl Acad Sci USA} \textbf{1982,} \textsl{79,}

1117:   2554-2558.

1118:

1119: \bibitem{Ryckaert77}

1120: Ryckaert,~J.;\ \ Ciccotti,~G.;\ \ Berendsen,~H. \textit{J Comput Phys}

1121:   \textbf{1977,} \textsl{23,} 327-341.

1122:

1123: \bibitem{Rama}

1124: Ramachandran,~G.;\ \ Sasisekharan,~V. \textit{Adv Protein Chem} \textbf{1968,}

1125:   \textsl{23,} 283-438.

1126:

1127: \bibitem{Papoian04pnas}

1128: Papoian,~G.~A.;\ \ Ulander,~J.;\ \ Eastwood,~M.~P.;\ \ Luthey-Schulten,~Z.;\ \

1129:   Wolynes,~P.~G. \textit{Proc Natl Acad Sci U S A} \textbf{2004,} \textsl{101,}

1130:   3352-3357.

1131:

1132: \bibitem{Papoian03biopoly}

1133: Papoian,~G.~A.;\ \ Wolynes,~P.~G. \textit{Biopolymers} \textbf{2003,}

1134:   \textsl{68,} 333-349.

1135:

1136: \bibitem{Papoian03jacs}

1137: Papoian,~G.~A.;\ \ Ulander,~J.;\ \ Wolynes,~P.~G. \textit{J Am Chem Soc}

1138:   \textbf{2003,} \textsl{125,} 9170-9178.

1139:

1140: \bibitem{maxfield79}

1141: Maxfield,~F.~R.;\ \ Scheraga,~H.~A. \textit{Biochemistry} \textbf{1979,}

1142:   \textsl{18,} 697--704.

1143:

1144: \bibitem{finkelstein98}

1145: Finkelstein,~A.~V. \textit{Phys Rev Lett} \textbf{1998,} \textsl{80,}

1146:   4823-4825.

1147:

1148: \bibitem{Altschul1997}

1149: Altschul,~S.;\ \ Madden,~T.;\ \ Schaffer,~A.;\ \ Zhang,~J.;\ \ Zhang,~Z.;\ \

1150:   Miller,~W.;\ \ Lipman,~D. \textit{Nucl. Acids Res.} \textbf{1997,}

1151:   \textsl{25,} 3389-3402.

1152:

1153: \bibitem{bioperl}

1154: Stajich,~J.~E. \textit{et al.}\  \textit{Genome Res.} \textbf{2002,}

1155:   \textsl{12,} 1611-1618.

1156:

1157: \bibitem{CLUSTAL}

1158: Thompson,~J.;\ \ Higgins,~D.;\ \ Gibson,~T. \textit{Nucl. Acids Res.}

1159:   \textbf{1994,} \textsl{22,} 4673-4680.

1160:

1161: \bibitem{Zhang_etal03}

1162: Zhang,~Y.;\ \ Kolinski,~A.;\ \ Skolnick,~J. \textit{Biophys J} \textbf{2003,}

1163:   \textsl{85,} 1145-1164.

1164:

1165: \bibitem{Hardin02a}

1166: Hardin,~C.;\ \ Eastwood,~M.;\ \ Prentiss,~M.;\ \ Luthey-Schulten,~Z.;\ \

1167:   Wolynes,~P.~G. \textit{J Comput Chem} \textbf{2002,} \textsl{23,} 138-146.

1168:

1169: \bibitem{BONN2001b}

1170: Bonneau,~R.;\ \ Tsai,~J.;\ \ Ruczinski,~I.;\ \ Chivian,~D.;\ \ Rohl,~C.;\ \

1171:   Strauss,~C. E.~M.;\ \ Baker,~D. \textit{Proteins} \textbf{2001,}

1172:   \textsl{Suppl 5,} 119-126.

1173:

1174: \bibitem{zhou:zhou04}

1175: Zhou,~H.;\ \ Zhou,~Y. \textit{Proteins} \textbf{2004,} \textsl{54,} 315-322.

1176:

1177: \bibitem{kussell:168101}

1178: Kussell,~E.;\ \ Shakhnovich,~E.~I. \textit{Phys Rev Lett} \textbf{2002,}

1179:   \textsl{89,} 168101.

1180:

1181: \bibitem{Baker97}

1182: Simons,~K.;\ \ Kooperberg,~C.;\ \ Huang,~E.;\ \ Baker,~D. \textit{J. Mol.

1183:   Biol.} \textbf{1997,} \textsl{268,} 209-225.

1184:

1185: \bibitem{Betancourt:2004}

1186: Betancourt,~M.;\ \ Skolnick,~J. \textit{J Mol Biol} \textbf{2004,} \textsl{2,}

1187:   635-649.

1188:

1189: \bibitem{SavenJG96}

1190: Saven,~J.~G.;\ \ Wolynes,~P.~G. \textit{J. Mol. Biol.} \textbf{1996,}

1191:   \textsl{257,} 199-216.

1192:

1193: \bibitem{AllenTildesley}

1194: Allen,~M.~P.;\ \ Tildesley,~D.~J. \textit{Computer Simulation of Liquids;}

1195:   Clarendon Press: New York, NY, USA, 1987.

1196:

1197: \bibitem{Eastwood2003}

1198: Eastwood,~M.;\ \ Hardin,~C.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G.

1199:   \textit{J Chem Phys} \textbf{2003,} \textsl{118,} 8500-8512.

1200:

1201: \bibitem{Shanknovich_1997}

1202: Du,~R.;\ \ Pande,~V.;\ \ A.Y.,~G.;\ \ Shakhnovich,~E.~I. \textit{J Chem Phys}

1203:   \textbf{1997,} \textsl{108,} 334-350.

1204:

1205: \bibitem{Koretke98}

1206: Koretke,~K.~K.;\ \ Luthey-Schulten,~Z.;\ \ Wolynes,~P.~G. \textit{Proc Natl

1207:   Acad Sci USA} \textbf{1998,} \textsl{95,} 2932-2937.

1208:

1209: \bibitem{FPearl}

1210: Pearl,~F. \textit{et al.}\  \textit{Nucl. Acids Res.} \textbf{2005,}

1211:   \textsl{33,} D247-251.

1212:

1213: \bibitem{kong96}

1214: Kong,~X.;\ \ {Brooks III},~C.~L. \textit{J Chem Phys} \textbf{1996,}

1215:   \textsl{105,} 2414--2423.

1216:

1217: \end{thebibliography}

1218:

1219: \end{document}

1220: