1: \documentclass[12pt]{article}
2: \usepackage{times}
3: \usepackage{graphicx}
4: \usepackage{proteins,citesupernumber}
5:
6: %\renewcommand{\baselinestretch}{1.5}
7: \setlength{\textheight}{20cm}
8: \begin{document}
9: \setlength{\baselineskip}{20pt}
10: \begin{flushleft}
11: {\Large \bf Predicting Residue-wise Contact Orders of Native Protein Structure from Amino Acid Sequence}
12:
13: \vspace{5mm}
14: Akira R. Kinjo$^{1,2,*}$ and Ken Nishikawa$^{1,2}$
15:
16: \vspace{3mm}
17: $^{1}$Center for Information Biology and DNA Data Bank of Japan,
18: National Institute of Genetics, Mishima, 411-8450, Japan\\
19: $^{2}$Department of Genetics,
20: The Graduate University for Advanced Studies (SOKENDAI),
21: Mishima, 411-8540, Japan
22:
23: \vspace{1cm}
24: $^{*}$Correspondence to Akira R. Kinjo.\\
25: Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima, 411-8540, Japan\\
26: Tel: +81-55-981-6859, Fax: +81-55-981-6889\\
27: E-mail: akinjo@genes.nig.ac.jp
28:
29: \vspace{1cm}
30: Running title: Residue-wise contact order prediction.
31:
32: \vspace{1cm}
33: Key words: protein structure prediction; residue-wise contact order; linear regression; one-dimensional structure.
34: \end{flushleft}
35: \newpage
36: \begin{abstract}
37: Residue-wise contact order (RWCO) is a new kind of one-dimensional protein
38: structures which represents the extent of long-range contacts.
39: We have recently shown that a set of three types of one-dimensional structures
40: (secondary structure, contact number, and RWCO) contains sufficient
41: information for reconstructing the three-dimensional structure of proteins.
42: Currently, there exist prediction methods for secondary structure and contact
43: number from amino acid sequence, but none exists for RWCO. Also, the properties of
44: amino acids that affect RWCO is not clearly understood. Here, we present a
45: linear regression-based method to predict RWCO from amino acid sequence,
46: and analyze the regression parameters to identify the properties that
47: correlates with the RWCO. The present method achieves the significant
48: correlation of 0.59 between the native and predicted RWCOs on average.
49: An unusual feature of the RWCO prediction is the remarkably large optimal
50: half window size of 26 residues.
51: The regression parameters for the central and near-central residues of the
52: local sequence segment highly correlate with those of the contact
53: number prediction, and hence with hydrophobicity.
54: \end{abstract}
55: \emph{Key words:} protein structure prediction, residue-wise contact order,
56: one-dimensional structure, linear regression.
57: \newpage
58: \section*{Introduction}
59: One of the main goals of protein structure prediction is to provide an
60: intuitive picture of the relationship between the amino acid sequence
61: and the native three-dimensional (3D) structure of proteins.
62: To this end, a number of methods have been developed for \textit{ab initio} or
63: \textit{de novo} protein structure prediction. However, such methods are
64: usually very complicated and make it difficult to intuitively understand
65: the relationship between amino acid sequence and 3D structure.
66: In this respect, one-dimensional (1D) structures\cite{Rost2003} of
67: proteins may be conventional intermediate representations of both
68: sequence and structure
69: of proteins as it is easy to grasp the correspondence between sequence and
70: structural characteristics.
71:
72: Since 1D structures are 3D structural features projected onto strings of
73: residue-wise structural assignments\cite{Rost2003}, a large part of 3D
74: information appears to be lost. That is, the correspondence between amino
75: acid sequence and 1D structures does not seem to be sufficient for
76: uncovering the correspondence between amino acid sequence and 3D structure.
77: However, Porto \textit{et al.}\cite{PortoETAL2004} have recently shown that
78: the contact matrix of a protein structure can be uniquely recovered from
79: its principal eigenvector. Since the protein 3D structure can be recovered
80: from the contact matrix\cite{VendruscoloETAL1997}, the result of
81: Porto \textit{et al.}\cite{PortoETAL2004} indicates that
82: the information contained in the 3D structure can be expressed as a
83: one-dimensional representation.
84: Furthermore, we have recently shown that 3D structure of proteins can be
85: reconstructed from a set of three types of
86: 1D structures\cite{KinjoANDNishikawa2005}.
87: In other words, the 3D structure of a protein is essentially equivalent
88: to a set of three types of 1D structures.
89: These 1D structures are namely secondary structure, contact number and
90: residue-wise contact order.
91: The fact that the 3D structure of a protein can be recovered from
92: a set of these 1D structures
93: opens a new possibility for elucidating the sequence-structure relationship
94: of proteins.
95:
96: The secondary structure of a protein is a string of symbols representing
97: $\alpha$ helix, $\beta$ strand, or coils. The contact number of each residue
98: in a protein is defined by the number of contacts the residue makes with other
99: residues in the protein. More precisely,
100: if we represent the contact map of the protein by $C_{i,j}$ ($C_{i,j} = 1$
101: if the $i$-th and $j$-th residues are in
102: contact, or $C_{i,j} = 0$ otherwise), the contact number $n_{i}$
103: of the $i$-th residue is defined by $n_{i} = \sum_{j}C_{i,j}$.
104: Similarly, the residue-wise contact order (RWCO) $o_{i}$
105: of the $i$-th residue of a protein is defined
106: by $o_{i} = \sum_{j}|i-j|C_{i,j}$,
107: that is, a sum of sequence separations between the residue and the
108: contacting residues\cite{KinjoANDNishikawa2005}.
109: The contact order was first introduced as a per-protein quantity by
110: Plaxco et al.\cite{PlaxcoETAL1998} to study the correlation between protein
111: topology and folding rate. The RWCO introduced here is a generalization of
112: the contact order, and is a per-residue quantity.
113:
114: At least in principle, if we can predict those 1D structures,
115: we can also construct the corresponding 3D structures. Many accurate methods
116: have been developed for secondary structure prediction\cite{Rost2003}.
117: We have developed a method to predict the contact number from amino acid
118: sequence\cite{KinjoETAL2005} with the average correlation of 0.63
119: between the native and predicted contact numbers. However, there is
120: no method for predicting RWCO from amino acid sequence to date,
121: and it is not clear if the prediction is possible at all.
122: The primary objective of the present paper is to develop a method
123: to predict RWCO from amino acid sequence.
124:
125: While the accurate prediction of structural properties is important for
126: its own sake, for a thorough understanding of the
127: sequence-structure relationship, we still need to identify the properties
128: of amino acid sequence that determine the structure.
129: From the vast amount of studies on secondary structure prediction in the past,
130: we are now convinced
131: that each amino acid has a particular propensity for a particular secondary
132: structure, although the final secondary structures in the native
133: structure are determined in the global context. Also, contact number is
134: closely related to the hydrophobicity of amino acids. Thus, both
135: secondary structure and contact number have clear connections with
136: the properties of amino acids. As for the residue-wise contact order,
137: its geometrical meaning is clear (i.e., a quantity related to the extent of
138: long-range contacts), but the conjugate properties of amino acids are not.
139: As the second objective of the present study, we attempt to identify the
140: amino acids' property affecting RWCO by examining the parameters derived
141: for the prediction method.
142:
143: The prediction method developed in this paper is based on a simple linear
144: regression scheme which was also applied to the contact number
145: prediction in our previous study\cite{KinjoETAL2005}.
146: By examining the regression parameters,
147: we show that the RWCO is primarily determined by the pattern of hydrophobicity
148: of amino acids.
149: Although the method is extremely simple, it yields a significant
150: correlation of 0.59 between the native and predicted RWCOs.
151: While further refinement is definitely necessary to apply the method
152: for 3D structure prediction, the present method will serve as a basis for
153: more elaborate methods yet to be developed.
154:
155: \section*{Materials and Method}
156: \subsection*{Definition of residue-wise contact order}
157: As mentioned in the Introduction, the residue-wise contact order (RWCO) of
158: the $i$-th residue is defined by
159: \begin{equation}
160: o_{i} = \frac{1}{L}\sum_{j:|j-i|>2}|i-j|C_{i,j}\label{eq:def}
161: \end{equation}
162: where the summation is normalized by the length $L$ of the amino acid
163: sequence of the protein and $C_{i,j}$ represents the contact map of the protein.
164: We exclude trivial contacts between nearest- and next-nearest residues
165: along the sequence.
166: To make the RWCO useful for molecular dynamics simulations, the contact
167: between two residues is defined by a smooth sigmoid function:
168: \begin{equation}
169: C_{i,j} = 1/\{1+\exp[w(r_{i,j} - d_c)]\}
170: \end{equation}
171: where $r_{i,j}$ is the distance between $C_{\beta}$ atoms of the $i$-th
172: and $j$-th
173: residues ($C_{\alpha}$ atoms for glycine), $d_c$ is the cut-off distance for
174: the contact definition, and $w$ is
175: a parameter that determines the sharpness of the sigmoid function.
176: To be consistent with our previous
177: studies\cite{KinjoETAL2005,KinjoANDNishikawa2005},
178: we set $d_c = 12$\AA{} and $w=3$ throughout the present paper.
179:
180: We also define the normalized (relative) RWCO by
181: \begin{equation}
182: {y}_{i}^{p} = ({o}_{i}^{p} - \langle {o}_{i}^{p} \rangle)/
183: \sqrt{\langle({o}_{i}^{p} - \langle {o}_{i}^{p} \rangle)^2\rangle}
184: \label{eq:normal}
185: \end{equation}
186: where $\langle \cdot \rangle$ denotes averaging operation over the given
187: protein chain $p$.
188:
189: \subsection*{Prediction scheme}
190: To predict the RWCO of each residue in a protein, we first conduct three
191: iterations of PSI-BLAST\cite{AltschulETAL1997} search against the
192: NCBI non-redundant amino acid sequence database to obtain the sequence profile
193: of the protein with the E-value cut-off of $10^{-7}$.
194: We use the amino acid score table of the
195: PSI-BLAST profile which is represented as $f(i,a)$
196: ($i$: site, $a$: amino acid) in the following (instead of the frequency table
197: used in the previous study\cite{KinjoETAL2005}).
198:
199: The RWCO $\hat{o}_{i}^{p}$ of the $i$-th residue in the protein $p$
200: is predicted in two steps. First we predict the normalized RWCO $y_{i}^p$ for
201: each residue, and then we combine it with the mean $\mu^p$ and standard
202: deviation (S.D.) $\sigma^p$ of the RWCOs of the protein,
203: which are predicted separately. The normalized RWCO is predicted by
204: the following linear regression scheme:
205: \begin{equation}
206: \hat{y}_{i}^{p} = \sum_{m=-M}^{M}\sum_{a}^{\mbox{\scriptsize residue types}}C_{m,a}f^{p}(i+m,a) + C\label{eq:reg}
207: \end{equation}
208: where $M$ is the half window size (a free parameter to be determined),
209: $f^{p} (i+m,a)$ represents an element of the PSI-BLAST profile of the
210: protein $p$, and $C_{m,a}$ and $C$ are regression parameters.
211: Both amino and carboxyl termini are treated by introducing an extra symbol
212: for the ``terminal residue.''
213: Thus, the RWCO of the $i$-th residue is expressed as a linear function of
214: the local sequence of $2M+1$ residues surrounding the $i$-th residue.
215:
216: The values of $C_{m,a}$ and $C$ are determined so as to minimize the prediction
217: error over a database of protein structures. The error function is defined by
218: \begin{equation}
219: E = \sum_{p}\sum_{i}(y_{i}^{p} - \hat{y}_{i}^{p})^{2}
220: \end{equation}
221: where $y_{i}^{p}$ is the observed normalized RWCO of the $i$-th residue of the
222: protein $p$.
223: The minimization of $E$ can be achieved by the usual least squares method.
224:
225: The mean ($\mu^p$) and standard deviation ($\sigma^p$) of
226: the RWCOs of a protein are predicted from the amino acid
227: composition ($f_a^p$) and sequence length ($L^p$) of the protein $p$ in
228: the same manner as we have done for the contact number
229: prediction\cite{KinjoETAL2005}.
230: That is, the mean and S.D. are predicted by the following linear regression
231: scheme:
232: \begin{eqnarray}
233: \hat{\mu}^{p} & = &\sum_{a}A_{a}f_{a}^{p} + A_{l}F(L^{p}) + A\\
234: \hat{\sigma}^{p} & = &\sum_{a}D_{a}f_{a}^{p} + D_{l}F(L^{p}) + D
235: \end{eqnarray}
236: where $F(L^p) = L^p$ for $L^p < 300$ and $F(L^p) = 300$ for $L^p \geq 300$,
237: and $A_{a}, A, D_{a}, D$ are regression parameters.
238: The final value for the predicted absolute RWCO ($\hat{o}_{i}^{p}$) is given by
239: \begin{equation}
240: \hat{o}_{i}^{p} = \hat{\mu}^{p} + \hat{\sigma}^{p}\hat{y}_{i}^{p}.
241: \label{eq:pred}
242: \end{equation}
243:
244: \subsection*{Data set}
245: We first selected representative proteins from each superfamily of
246: all-$\alpha$, all-$\beta$, $\alpha/ \beta$, $\alpha + \beta$,
247: and multi-domain classes of the SCOP\cite{SCOP} (version 1.65) protein
248: structure classification database through the ASTRAL\cite{ASTRAL}
249: database. Those structures which were present in this superfamily
250: representative set but were absent from the 40\% representative set of
251: ASTRAL, those containing chain breaks (except for termini), or those
252: with the average contact number of less than 7.5 (non-compact structures)
253: were discarded.
254: Non-standard amino acid residues were converted to the corresponding standard
255: residues when possible, otherwise discarded.
256: When $C_\beta$ atoms were absent in non-glycine residues, they were modeled
257: by the SCWRL\cite{SCWRL3} side-chain prediction program.
258: After all, there remained 680 protein chains. The list of this data set
259: will be available from the author's website.
260:
261: For training the parameters and testing the prediction accuracy, we performed
262: a 15-fold cross-validation test. The 680 proteins were randomly
263: divided into two groups, one consisting of 630 proteins for training
264: the parameters (training set), and the other (test set) consisting of
265: 50 proteins for testing the prediction using the parameters obtained from
266: the training set. The procedure was iterated for 15 times.
267:
268: \subsection*{Measures of prediction accuracy}
269: We employ two measures for evaluating the prediction accuracy.
270: The first one is the correlation coefficient ($Cor_p$) between the observed
271: and predicted RWCOs for a given protein $p$, which is defined by
272: \begin{equation}
273: Cor_{p} = \frac{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)(\hat{o}_{i}^{p} - \langle \hat{o}_{i}^{p}\rangle)\rangle}{
274: \sqrt{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)^{2}\rangle}
275: \sqrt{\langle (\hat{o}_{i}^{p} - \langle \hat{o}_{i}^{p} \rangle)^{2}\rangle}}.
276: \label{eq:cor}
277: \end{equation}
278: The $Cor_p$ measures the consistency of the normalized RWCOs.
279: In order to measure the accuracy of the predicted absolute values, we
280: use the RMS error divided by the standard deviation of the observed
281: RWCO ($DevA_p$):
282: \begin{equation}
283: DevA_{p} = \frac{\sqrt{\langle (o_{i}^{p} - \hat{o}_{i}^{p})^{2}\rangle}}
284: {\sqrt{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)^2\rangle}}.
285: \label{eq:deva}
286: \end{equation}
287:
288: \section*{Results}
289: \subsection*{Optimal window size}
290: In the prediction scheme presented in this paper, the half window size $M$
291: is a free parameter. We determine its value so that the prediction accuracy
292: is maximized. We have performed a 15-fold cross-validation test with $M$
293: ranging from 0 to 40. The result is summarized in Figure~\ref{fig:window}.
294: The correlation coefficient $Cor_p$ (averaged over the test sets)
295: ranges from 0.48 at $M=0$ to $\approx$ 0.59 at $M=26$
296: (Figure \ref{fig:window} A). It should be noted that the correlation of 0.48
297: is already statistically significant given the
298: average sequence length (172 residues) of the proteins in the data set.
299: The value of $Cor_p$ monotonically increases from $M=0$ to $M=26$, but
300: starts to saturate for $M > 20$ and decreases slowly for $M>26$.
301: The deviation $DevA_p$
302: (averaged over the test sets) shows a consistent trend with $Cor_p$
303: (Figure \ref{fig:window} B), and it reaches the minimum value of $\approx$
304: 1.03 at $M=26$.
305: Thus, the optimal window size has been determined to be $M=26$.
306:
307: This optimal window size of $M=26$ is much larger than the ones for any
308: other 1D structure predictions. As far as we are aware, this is the longest
309: range of correlation observed between 1D structure and amino acid sequence.
310: For example, the optimal half window size is
311: $M=9$ for contact number prediction (see below) and $M = 6-8$ for secondary
312: structure prediction. Large window sizes usually result in over-fitting
313: the training data, but such is not the case for RWCO prediction, as we have
314: performed cross-validation tests. This unusually long-range correlation with
315: amino acid sequence is a conspicuous property of the RWCO.
316:
317: \subsection*{Distribution of correlation}
318: As indicated by the average values of $Cor_p$ and $DevA_p$,
319: the linear regression method with $M=26$ tends to produce more accurate
320: predictions than with other window sizes. However, the prediction
321: accuracies for individual proteins do differ significantly as shown in
322: Figure \ref{fig:len_cor}. While most of the proteins
323: are decently predicted with correlations of 0.5 or higher,
324: some proteins exhibit very poor correlations. The poorly predicted proteins
325: are found not well-packed due to the small size of the protein (e.g.,
326: SCOP domain d1fs1a1),
327: a large fraction of structurally disordered regions (e.g., d1cpo\_1), or
328: being a subunit of a large complex (e.g., d1mtyg\_).
329:
330: The prediction accuracy does not strikingly differ depending on the structural
331: class of proteins (Table \ref{tab:histo}). However, all-$\alpha$ proteins
332: show slightly poorer correlations compared to other classes,
333: and $\alpha + \beta$ proteins show relatively better correlations. The latter
334: may be due to the over-dominance of the $\alpha + \beta$ proteins in the
335: data sets.
336:
337: In Figure \ref{fig:ex}, three examples of predicted RWCO are shown.
338: Despite the relatively good correlation between the native and predicted RWCOs,
339: the absolute values of predicted RWCOs at many sites
340: significantly differ from the corresponding native RWCOs.
341: This behavior is indicated by the relatively
342: large value of $DevA_p \approx 1.03$ (Figure \ref{fig:window} B).
343: In particular, we notice that RWCOs of large values are consistently
344: underestimated. This behavior suggests that some cooperative effects
345: be taken into account for better prediction.
346: Provided that the present method is based exclusively
347: on one-body terms (Eq. \ref{eq:reg}), the prediction accuracy achieved is
348: satisfactory, at least qualitatively.
349:
350: \subsection*{Regression parameters as functions of sequence position}
351: Since the present study is the first attempt to develop a prediction method
352: for RWCO, it is of interest to examine the properties of amino acid
353: residues that affect the RWCO, which are reflected in the values of
354: the regression coefficients $C_{m,a}$.
355: Figure \ref{fig:aaprop} shows the values of $C_{m,a}$ for each amino acid
356: type $a$ as a function of the window position $m$. For all the amino acid
357: types, the peak of $C_{m,a}$, when present, is at the center ($m=0$).
358: We can easily recognize that these values, those at $m=0$ in particular,
359: are related to the hydrophobicity of amino acids.
360: That is, $C_{0,a}>0$ for hydrophobic residues and $C_{0,a}<0$ for hydrophilic
361: residues.
362: When the amino acid index (AAindex) database\cite{TomiiANDKanehisa1996}
363: was scanned for indices that
364: highly correlates with $C_{0,a}$, we have found various hydrophobicity scales
365: with correlations with $C_{0,a}$ over 0.90 (data not shown).
366: Therefore, we can conclude that the RWCO is primarily determined by the
367: pattern of hydrophobicity along the sequence.
368:
369: Some amino acid types exhibit oscillation with the periodicity of
370: 3 to 4 residues, which is expected for the $\alpha$ helix. In fact,
371: such residues (e.g., GLU, GLN, ALA, etc.) are of high $\alpha$ helix
372: propensity. On the contrary, the residues of high $\beta$ strand
373: propensity (e.g., ILE, VAL, etc.) do not exhibit such oscillation.
374: Therefore, in addition to the hydrophobic properties, the parameters for
375: RWCO also contain information for secondary structures.
376:
377: \section*{Discussion}
378: \subsection*{Comparison with contact number prediction}
379: As can be seen from their definitions, the native RWCOs and contact numbers
380: show a high correlation of 0.7
381: (data not shown). This is also consistent with the finding that RWCOs are
382: primarily determined by hydrophobicity. Because of the correlation
383: between RWCO and contact number, it is of interest to ask whether it is possible to ``predict''
384: RWCOs using contact number prediction, and vice versa.
385: The result of this ``cross-prediction'' is listed in Table \ref{tab:cnrwco}.
386: Here, the contact number prediction\cite{KinjoETAL2005} is based
387: on exactly the same linear regression scheme as the RWCO prediction method.
388: In order to make consistent the quality of the two different prediction
389: methods, we have determined the regression parameters and the optimal
390: half window size for the contact number prediction using the same training
391: and test data sets as used here.
392: The resulting contact number prediction method yields the average
393: prediction accuracy of $Cor_p \approx 0.70$ and
394: $DevA_p \approx 0.803$ with the optimal half window size of 9
395: (Table \ref{tab:cnrwco}, Case B), a remarkable improvement over our
396: previous study
397: ($Cor_p \approx 0.63$ and $DevA_p \approx 0.941$)\cite{KinjoETAL2005}
398: which is likely to be due to the use of PSI-BLAST score profiles
399: (we used frequency profiles derived from the HSSP database\cite{HSSP}
400: in the previous study).
401: When the values obtained from the contact number prediction are compared
402: to the native RWCOs, the highest correlation is 0.50 with the optimal half
403: window size
404: of $M = 4$ (Table \ref{tab:cnrwco}, Case C). Although the correlation of 0.50
405: is statistically significant,
406: the value is much lower than the one
407: obtained for the proper prediction of RWCO, $Cor_p \approx 0.59$
408: (Table \ref{tab:cnrwco}, Case A). For the ``prediction'' in the opposite
409: direction, that is, when the values obtained from the RWCO prediction are
410: compared to the native contact numbers, the correlation is as high as 0.62
411: with the optimal half window size of $M=4$ (Table \ref{tab:cnrwco}, Case D).
412: Again, this value, though statistically significant, is lower than the
413: proper contact number prediction ($Cor_p \approx 0.70$).
414: Interestingly, for the Cases C and D in
415: Table \ref{tab:cnrwco}, the optimal half window sizes coincide ($M = 4$).
416: Therefore, it is expected that the contact number and RWCO are very closely
417: related with each other in terms of the short-range pattern of the
418: local amino acid sequence. In other words, the distinction between the
419: contact number and RWCO originates from the interactions of longer range.
420:
421: To further clarify the correlation between
422: RWCO and contact number predictions, we compared the regression
423: parameters $C_{m,a}$ for RWCO and contact number predictions up to the
424: half window size of $M=9$ (Figure \ref{fig:parcor}).
425: It can be clearly seen that the both sets of regression parameters
426: very significantly correlate (correlation of $>0.7$)
427: with each other within the window positions of
428: $-4 \leq m \leq 4$ (Figure \ref{fig:parcor}), which confirms
429: the above observation (Table \ref{tab:cnrwco}, Cases C and D).
430:
431: \subsection*{Perspective for improving prediction accuracy}
432: The method for predicting RWCOs from amino acid sequence
433: developed in this paper is a very primitive one.
434: While the correlation of 0.59 between the native and predicted
435: RWCOs is significant, it is not as high as 0.70 in the case of the
436: contact number prediction (Table \ref{tab:cnrwco})
437: based on the same linear regression scheme.
438: Furthermore, the agreement of absolute RWCO values
439: is relatively poor, especially so for RWCOs of large values.
440: As mentioned above, inclusion of many-body
441: effects seems mandatory for better RWCO prediction.
442: A popular method for dealing with many-body terms is artificial
443: neural networks. Other non-linear regression schemes such as radial basis
444: or support vector regressions can be also
445: applicable.
446: Neural network methods as well as a support vector regression method
447: have been successfully applied to real value prediction of solvent
448: accessibility\cite{AhmadETAL2003,AdamczakETAL2004,YuanANDHuang2004}.
449: Solvent accessibility is closely related to the hydrophobicity of amino
450: acids, and hence is likely to be related to the RWCO. Thus, we can expect
451: such non-linear regression approaches may be also useful for predicting RWCO.
452: However, since the RWCO prediction requires rather long segment of local
453: amino acid sequence (half window size of $M=26$),
454: straightforward application of non-linear regression methods requiring
455: a great number of parameters may not work.
456: The number of parameters must be somehow reduced.
457: How to extract essential parameters for RWCO prediction is left for
458: future studies.
459:
460: An alternative route to the improved accuracy is to properly treat
461: the large deviation of RWCOs along the amino acid sequence.
462: For the contact number, its average over a local segment
463: tends to be close to the average over the whole sequence,
464: whereas, for the RWCO, such is not the case.
465: For example, for the SCOP domain d1a9xb1 (Figure \ref{fig:ex}C),
466: the average contact number for the whole domain, for residues 1 to 20,
467: and for residues 51 to 70 are, respectively, 25.5, 28.4, and 26.6, whereas
468: the corresponding averages of the RWCOs are 8.0, 14.3, and 4.9, respectively.
469: Since the present method is based on the globally normalized RWCO
470: (Eq. \ref{eq:normal}), such large deviations are difficult to
471: handle. If this limitation is overcome, better prediction accuracy may be
472: obtained.
473:
474: \section*{Acknowledgment}
475: The authors thank Satoshi Fukuchi, Yoshiaki Minezaki, and Yasuo Shirakihara
476: for helpful comments.
477: Most of the computations were carried out at the supercomputing facility of
478: National Institute of Genetics, Japan. This work was supported in part by a
479: grant-in-aid from the MEXT, Japan.
480:
481: The list of the SCOP domain identifiers used in the present study, and
482: the optimal parameter sets are available at the URL
483: http://maccl01.genes.nig.ac.jp/\~{}akinjo/rwco/.
484:
485: %\bibliographystyle{unsrt}
486: %\bibliography{refs,mypaper}
487: \begin{thebibliography}{10}
488:
489: \bibitem{Rost2003}
490: B.~Rost.
491: \newblock Prediction in {1D}: secondary structure, membrane helices, and
492: accessibility.
493: \newblock In P.~E. Bourne and H.~Weissig, editors, {\em Structural
494: Bioinformatics}, chapter~28, pages 559--587. Wiley-Liss, Inc., Hoboken,
495: U.S.A., 2003.
496:
497: \bibitem{PortoETAL2004}
498: M.~Porto, U.~Bastolla, H.~E. Roman, and M.~Vendruscolo.
499: \newblock Reconstruction of protein structures from a vectorial representation.
500: \newblock {\em Phys. Rev. Lett.}, 92:218101, 2004.
501:
502: \bibitem{VendruscoloETAL1997}
503: M.~Vendruscolo, E.~Kussell, and E.~Domany.
504: \newblock Recovery of protein structure from contact maps.
505: \newblock {\em Fold. Des.}, 2:295--306, 1997.
506:
507: \bibitem{KinjoANDNishikawa2005}
508: A.~R. Kinjo and K.~Nishikawa.
509: \newblock Recoverable one-dimensional encoding of protein three-dimensional
510: structures.
511: \newblock {\em (submitted)}, 2005.
512: \newblock http://arXiv.org/abs/q-bio.BM/0501005.
513:
514: \bibitem{PlaxcoETAL1998}
515: K.~W. Plaxco, K.~T. Simons, and D.~Baker.
516: \newblock Contact order, transition state placement and the refolding rates of
517: single domain proteins.
518: \newblock {\em J. Mol. Biol.}, 277:985--994, 1998.
519:
520: \bibitem{KinjoETAL2005}
521: A.~R. Kinjo, K.~Horimoto, and K.~Nishikawa.
522: \newblock Predicting absolute contact numbers of native protein structure from
523: amino acid sequence.
524: \newblock {\em Proteins}, 58:158--165, 2005.
525:
526: \bibitem{AltschulETAL1997}
527: S.~F. Altschul, T.~L. Madden, A.~A. Schaffer, J.~Zhang, Z.~Zhang, W.~Miller,
528: and D.~L. Lipman.
529: \newblock Gapped blast and {PSI}-blast: A new generation of protein database
530: search programs.
531: \newblock {\em Nucleic Acids Res.}, 25:3389--3402, 1997.
532:
533: \bibitem{SCOP}
534: A.~G. Murzin, S.~E. Brenner, T.~Hubbard, and C.~Chothia.
535: \newblock {SCOP}: A structural classification of proteins database for the
536: investigation of sequences and structures.
537: \newblock {\em J. Mol. Biol.}, 247:536--540, 1995.
538:
539: \bibitem{ASTRAL}
540: J.~M. Chandonia, G.~Hon, N.~S. Walker, L.~{Lo Conte}, P.~Koehl, M.~Levitt, and
541: S.~E. Brenner.
542: \newblock The astral compendium in 2004.
543: \newblock {\em Nucleic Acids Res.}, 32:D189--D192, 2004.
544:
545: \bibitem{SCWRL3}
546: A.~A. Canutescu, A.~A. Shelenkov, and R.~L. Dunbrack.
547: \newblock A graph theory algorithm for protein side-chain prediction.
548: \newblock {\em Protein Sci.}, 12:2001--2014, 2003.
549:
550: \bibitem{TomiiANDKanehisa1996}
551: K.~Tomii and M.~Kanehisa.
552: \newblock Analysis of amino acid indices and mutation matrices for sequence
553: comparison and structure prediction of proteins.
554: \newblock {\em Protein Eng.}, 9:27--36, 1996.
555:
556: \bibitem{HSSP}
557: C.~Sander and R.~Schneider.
558: \newblock Database of homology-derived protein structures.
559: \newblock {\em Proteins}, 9:56--68, 1991.
560:
561: \bibitem{AhmadETAL2003}
562: S.~Ahmad, M.~M. Gromiha, and A.~Sarai.
563: \newblock Real value prediction of solvent accessibility from amino acid
564: sequence.
565: \newblock {\em Proteins}, 50:629--635, 2003.
566:
567: \bibitem{AdamczakETAL2004}
568: R.~Adamczak, A.~Porollo, and J.~Meller.
569: \newblock Accurate prediction of solvent accessibility using neural
570: networks-based regression.
571: \newblock {\em Proteins}, 56:753--767, 2004.
572:
573: \bibitem{YuanANDHuang2004}
574: Z.~Yuan and B.~Huang.
575: \newblock Prediction of protein accessible surface areas by support vector
576: regression.
577: \newblock {\em Proteins}, 57:558--564, 2004.
578:
579: \end{thebibliography}
580:
581:
582: \newpage
583: \begin{table}
584: \caption{\label{tab:histo}Distribution of $Cor_p$ for each SCOP class$^a$.}
585: \begin{center}
586: \begin{tabular}{lrrrrr}\hline
587: range$^b$ &\multicolumn{5}{c}{SCOP class$^c$}\\
588: ($Cor_p$) & a & b & c & d & e\\\hline
589: (-1,0.2] & 4(3) & 1(0.6)& 7(4) & 2(0.8) & 0 \\
590: (0.2,0.4] & 23(14) & 17(10) & 14(8) & 22(9) & 1(5) \\
591: (0.4,0.6] & 61(38) & 54(33) & 55(33) & 72(30) & 11(61) \\
592: (0.6,0.8] & 73(45) & 86(52) & 82(49) & 136(57) & 6(33) \\
593: (0.8,1.0] & 1(0.6)& 6(4) & 8(5) & 8(3) & 0 \\
594: total & 162 & 164 & 166 & 240 & 18\\
595: \hline
596: \end{tabular}
597: \end{center}
598: $^a$ The number (percentage in the parentheses)
599: of occurrences of $Cor_p$ for the proteins in the test sets,
600: classified according to the SCOP database.\\
601: $^b$ The range ``$(x,y]$'' denotes $x < Cor_p \leq y$.\\
602: $^c$ a: all-$\alpha$, b: all-$\beta$, c: $\alpha / \beta$, d: $\alpha + \beta$,
603: e: multi-domain.
604: \end{table}
605: ~\\
606: \newpage
607: \begin{table}
608: \caption{\label{tab:cnrwco}Cross-prediction between residue-wise contact orders and contact numbers.}
609: \begin{center}
610: \begin{tabular}{cccrrr}\hline
611: Case & Train$^a$ & Test$^b$ & $M^c$ & $Cor_p$ & $DevA_p$ \\\hline
612: A & RWCO & RWCO & 26 & 0.59 & 1.03 \\
613: B & CN & CN & 9 & 0.70 & 0.803\\
614: C & CN & RWCO & 4 & 0.50 & N.A.$^d$ \\
615: D & RWCO & CN & 4 & 0.62 & N.A.$^d$ \\\hline
616: \end{tabular}
617: \end{center}
618: $^a$Target values for which the regression parameters were trained. ``RWCO'' and ``CN''
619: indicate that the regression parameters were trained to fit the residue-wise contact orders
620: and contact numbers, respectively.\\
621: $^b$Target values for which the ``prediction'' was applied. ``RWCO'' and ``CN'' indicate
622: that predicted values were compared with the native residue-wise contact orders and native
623: contact numbers, respectively.\\
624: $^c$Optimal half window size for the prediction.\\
625: $^d$Not applicable because the ranges of RWCO and CN values are different.
626: \end{table}
627: ~\\
628:
629: \begin{figure}
630: \begin{center}
631: \includegraphics[width=8cm]{window_sc.eps}
632: \end{center}
633: \caption{\label{fig:window}Prediction accuracy as a function of window size.
634: (A) The correlation coefficient ($Cor_p$) between the native and predicted RWCO, averaged
635: over the test set proteins. (B) Deviation of the predicted RWCO from the native one ($DevA_p$), averaged over the test set proteins.}
636: \end{figure}
637: ~\\
638: \newpage
639: \begin{figure}
640: \includegraphics[width=8cm]{len_cor.eps}
641: \caption{\label{fig:len_cor}$Cor_p$ plotted against chain length. Each point represents a protein in one of the test sets.}
642: \end{figure}
643:
644: \begin{figure}
645: \begin{center}
646: \includegraphics[width=7cm]{./example4.eps}
647: \end{center}
648: \caption{\label{fig:ex}Examples of prediction. Red: native RWCO; Green: predicted RWCO.
649: (A) SCOP domain d1a6m\_\_ (myoglobin, all-$\alpha$), $Cor_p = 0.73$,
650: $DevA_p = 0.75$;
651: (B) SCOP domain d1ifra\_ (Lamin A/C globular tail domain, all-$\beta$),
652: $Cor_p =0.72$, $DevA_p = 0.87$;
653: (C) SCOP domain d1a9xb1 (Carbamoyl phosphate synthetase, small subunit N-terminal domain, $\alpha / \beta$), $Cor_p = 0.72$, $DevA_p = 0.81$. }
654: \end{figure}
655: ~\\
656:
657: \begin{figure}
658: \begin{center}
659: \includegraphics[width=16cm]{a2z.eps}
660: \end{center}
661: \caption{\label{fig:aaprop}$C_{m,a}$ for each amino acid type ($a$) as a function of the window position ($m$).}
662: \end{figure}
663: ~\\
664: \newpage
665: \begin{figure}
666: \includegraphics[width=8cm]{par_cor.eps}
667: \caption{\label{fig:parcor}Correlation between the regression parameters
668: $C_{m,a}$ for contact number and RWCO predictions for each window position.
669: The horizontal axis is the window position $m$ in the local sequence.
670: The vertical axis is the correlation coefficient between the regression
671: parameters $C_{m,a}$ for RWCO prediction and those for contact number
672: prediction at the window position $m$.}
673: \end{figure}
674: \end{document}
675: