q-bio0501015/kinjo.tex
1: \documentclass[12pt]{article}
2: \usepackage{times}
3: \usepackage{graphicx}
4: \usepackage{proteins,citesupernumber}
5: 
6: %\renewcommand{\baselinestretch}{1.5}
7: \setlength{\textheight}{20cm}
8: \begin{document}
9: \setlength{\baselineskip}{20pt}
10: \begin{flushleft}
11: {\Large \bf Predicting Residue-wise Contact Orders of Native Protein Structure from  Amino Acid Sequence}
12: 
13: \vspace{5mm}
14: Akira R. Kinjo$^{1,2,*}$ and Ken Nishikawa$^{1,2}$
15: 
16: \vspace{3mm}
17: $^{1}$Center for Information Biology and DNA Data Bank of Japan,
18: National Institute of Genetics, Mishima, 411-8450, Japan\\
19: $^{2}$Department of Genetics, 
20: The Graduate University for Advanced Studies (SOKENDAI), 
21: Mishima, 411-8540, Japan
22: 
23: \vspace{1cm}
24: $^{*}$Correspondence to Akira R. Kinjo.\\
25: Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima, 411-8540, Japan\\
26: Tel: +81-55-981-6859, Fax: +81-55-981-6889\\
27: E-mail: akinjo@genes.nig.ac.jp
28: 
29: \vspace{1cm}
30: Running title: Residue-wise contact order prediction.
31: 
32: \vspace{1cm}
33: Key words: protein structure prediction; residue-wise contact order; linear regression; one-dimensional structure.
34: \end{flushleft}
35: \newpage
36: \begin{abstract}
37: Residue-wise contact order (RWCO) is a new kind of one-dimensional protein 
38: structures which represents the extent of long-range contacts. 
39: We have recently shown that a set of three types of one-dimensional structures 
40: (secondary structure, contact number, and RWCO) contains sufficient 
41: information for reconstructing the three-dimensional structure of proteins.
42: Currently, there exist prediction methods for secondary structure and contact 
43: number from amino acid sequence, but none exists for RWCO. Also, the properties of 
44: amino acids that affect RWCO is not clearly understood. Here, we present a
45: linear regression-based method to predict RWCO from amino acid sequence, 
46: and analyze the regression parameters to identify the properties that 
47: correlates with the RWCO. The present method achieves the significant 
48: correlation of 0.59 between the native and predicted RWCOs on average. 
49: An unusual feature of the RWCO prediction is the remarkably large optimal 
50: half window size of 26 residues.
51: The regression parameters for the central and near-central residues of the 
52: local sequence segment highly correlate with those of the contact 
53: number prediction, and hence with hydrophobicity.
54: \end{abstract}
55: \emph{Key words:} protein structure prediction, residue-wise contact order,
56: one-dimensional structure, linear regression.
57: \newpage
58: \section*{Introduction}
59: One of the main goals of protein structure prediction is to provide an 
60: intuitive picture of the relationship between the amino acid sequence
61: and the native three-dimensional (3D) structure of proteins.
62: To this end, a number of methods have been developed for \textit{ab initio} or 
63: \textit{de novo} protein structure prediction. However, such methods are 
64: usually very complicated and make it difficult to intuitively understand 
65: the relationship between amino acid sequence and 3D structure.
66: In this respect, one-dimensional (1D) structures\cite{Rost2003} of 
67: proteins may be conventional intermediate representations of both 
68: sequence and structure 
69: of proteins as it is easy to grasp the correspondence between sequence and 
70: structural characteristics. 
71: 
72: Since 1D structures are 3D structural features projected onto strings of 
73: residue-wise structural assignments\cite{Rost2003}, a large part of 3D 
74: information appears to be lost. That is, the correspondence between amino 
75: acid sequence and 1D structures does not seem to be sufficient for 
76: uncovering the correspondence between amino acid sequence and 3D structure.
77: However, Porto \textit{et al.}\cite{PortoETAL2004} have recently shown that 
78: the contact matrix of a protein structure can be uniquely recovered from 
79: its principal eigenvector. Since the protein 3D structure can be recovered 
80: from the contact matrix\cite{VendruscoloETAL1997}, the result of 
81: Porto \textit{et al.}\cite{PortoETAL2004} indicates that 
82: the information contained in the 3D structure can be expressed as a 
83: one-dimensional representation.
84: Furthermore, we have recently shown that 3D structure of proteins can be 
85: reconstructed from a set of three types of 
86: 1D structures\cite{KinjoANDNishikawa2005}. 
87: In other words, the 3D structure of a protein is essentially equivalent 
88: to a set of three types of 1D structures. 
89: These 1D structures are namely secondary structure, contact number and 
90: residue-wise contact order.
91: The fact that the 3D structure of a protein can be recovered from 
92: a set of these 1D structures
93: opens a new possibility for elucidating the sequence-structure relationship
94: of proteins.
95: 
96: The secondary structure of a protein is a string of symbols representing 
97: $\alpha$ helix, $\beta$ strand, or coils. The contact number of each residue 
98: in a protein is defined by the number of contacts the residue makes with other 
99: residues in the protein. More precisely, 
100: if we represent the contact map of the protein by $C_{i,j}$ ($C_{i,j} = 1$ 
101: if the $i$-th and $j$-th residues are in 
102: contact, or $C_{i,j} = 0$ otherwise), the contact number $n_{i}$ 
103: of the $i$-th residue is defined by $n_{i} = \sum_{j}C_{i,j}$. 
104: Similarly, the residue-wise contact order (RWCO) $o_{i}$ 
105: of the $i$-th residue of a protein is defined 
106: by $o_{i} = \sum_{j}|i-j|C_{i,j}$,
107: that is, a sum of sequence separations between the residue and the 
108: contacting residues\cite{KinjoANDNishikawa2005}.
109: The contact order was first introduced as a per-protein quantity by 
110: Plaxco et al.\cite{PlaxcoETAL1998} to study the correlation between protein 
111: topology and folding rate. The RWCO introduced here is a generalization of
112:  the contact order, and is a per-residue quantity. 
113: 
114: At least in principle, if we can predict those 1D structures, 
115: we can also construct the corresponding 3D structures. Many accurate methods
116: have been developed for secondary structure prediction\cite{Rost2003}. 
117: We have developed a method to predict the contact number from amino acid 
118: sequence\cite{KinjoETAL2005} with the average correlation of 0.63 
119: between the native and predicted contact numbers. However, there is 
120: no method for predicting RWCO from amino acid sequence to date, 
121: and it is not clear if the prediction is possible at all. 
122: The primary objective of the present paper is to develop a method 
123: to predict RWCO from amino acid sequence.
124: 
125: While the accurate prediction of structural properties is important for 
126: its own sake, for a thorough understanding of the 
127: sequence-structure relationship, we still need to identify the properties 
128: of amino acid sequence that determine the structure.
129: From the vast amount of studies on secondary structure prediction in the past, 
130: we are now convinced 
131: that each amino acid has a particular propensity for a particular secondary 
132: structure, although the final secondary structures in the native 
133: structure are determined in the global context. Also, contact number is 
134:  closely related to the hydrophobicity of amino acids. Thus, both 
135: secondary structure and contact number have clear connections with
136: the properties of amino acids. As for the residue-wise contact order, 
137:  its geometrical meaning is clear (i.e., a quantity related to the extent of 
138: long-range contacts), but the conjugate properties of amino acids are not.
139: As the second objective of the present study, we attempt to identify the 
140: amino acids' property affecting RWCO by examining the parameters derived 
141: for the prediction method.
142: 
143: The prediction method developed in this paper is based on a simple linear 
144: regression scheme which was also applied to the contact number 
145: prediction in our previous study\cite{KinjoETAL2005}. 
146: By examining the regression parameters, 
147: we show that the RWCO is primarily determined by the pattern of hydrophobicity
148: of amino acids. 
149: Although the method is extremely simple, it yields a significant 
150: correlation of 0.59 between the native and predicted RWCOs. 
151: While further refinement is definitely necessary to apply the method 
152: for 3D structure prediction, the present method will serve as a basis for 
153: more elaborate methods yet to be developed.
154: 
155: \section*{Materials and Method}
156: \subsection*{Definition of residue-wise contact order}
157: As mentioned in the Introduction, the residue-wise contact order (RWCO) of 
158: the $i$-th residue is defined by
159: \begin{equation}
160: o_{i} = \frac{1}{L}\sum_{j:|j-i|>2}|i-j|C_{i,j}\label{eq:def}
161: \end{equation}
162: where the summation is normalized by the length $L$ of the amino acid 
163: sequence of the protein and $C_{i,j}$ represents the contact map of the protein.
164: We exclude trivial contacts between nearest- and next-nearest residues 
165: along the sequence.
166: To make the RWCO useful for molecular dynamics simulations, the contact 
167: between two residues is defined by a smooth sigmoid function:
168: \begin{equation}
169: C_{i,j} = 1/\{1+\exp[w(r_{i,j} - d_c)]\}
170: \end{equation}
171: where $r_{i,j}$ is the distance between $C_{\beta}$ atoms of the $i$-th 
172: and $j$-th 
173: residues ($C_{\alpha}$ atoms for glycine), $d_c$ is the cut-off distance for 
174: the contact definition, and $w$ is 
175: a parameter that determines the sharpness of the sigmoid function. 
176: To be consistent with our previous 
177: studies\cite{KinjoETAL2005,KinjoANDNishikawa2005},
178: we set $d_c = 12$\AA{} and $w=3$ throughout the present paper.
179: 
180: We also define the normalized (relative) RWCO by
181: \begin{equation}
182: {y}_{i}^{p} = ({o}_{i}^{p} - \langle {o}_{i}^{p} \rangle)/
183: \sqrt{\langle({o}_{i}^{p} - \langle {o}_{i}^{p} \rangle)^2\rangle}
184: \label{eq:normal}
185: \end{equation}
186: where $\langle \cdot \rangle$ denotes averaging operation over the given 
187: protein chain $p$.
188: 
189: \subsection*{Prediction scheme}
190: To predict the RWCO of each residue in a protein, we first conduct three 
191: iterations of PSI-BLAST\cite{AltschulETAL1997} search against the 
192: NCBI non-redundant amino acid sequence database to obtain the sequence profile 
193: of the protein with the E-value cut-off of $10^{-7}$. 
194: We use the amino acid score table of the  
195: PSI-BLAST  profile which is represented as $f(i,a)$ 
196: ($i$: site, $a$: amino acid) in the following (instead of the frequency table 
197: used in the previous study\cite{KinjoETAL2005}).
198: 
199: The RWCO $\hat{o}_{i}^{p}$ of the $i$-th residue in the protein $p$ 
200: is predicted in two steps. First we predict the normalized RWCO $y_{i}^p$ for 
201: each residue, and then we combine it with the mean $\mu^p$ and standard 
202: deviation (S.D.) $\sigma^p$ of the RWCOs of the protein, 
203: which are predicted separately. The normalized RWCO is predicted by 
204: the following linear regression scheme:
205: \begin{equation}
206: \hat{y}_{i}^{p} = \sum_{m=-M}^{M}\sum_{a}^{\mbox{\scriptsize residue types}}C_{m,a}f^{p}(i+m,a) + C\label{eq:reg}
207: \end{equation}
208: where $M$ is the half window size (a free parameter to be determined), 
209: $f^{p} (i+m,a)$ represents an element of the PSI-BLAST profile of the 
210: protein $p$, and $C_{m,a}$ and $C$ are regression parameters.
211: Both amino and carboxyl termini are treated by introducing an extra symbol 
212: for the ``terminal residue.'' 
213: Thus, the RWCO of the $i$-th residue is expressed as a linear function of 
214: the local sequence of $2M+1$ residues surrounding the $i$-th residue.
215: 
216: The values of $C_{m,a}$ and $C$ are determined so as to minimize the prediction
217: error over a database of protein structures. The error function is defined by
218: \begin{equation}
219: E = \sum_{p}\sum_{i}(y_{i}^{p} - \hat{y}_{i}^{p})^{2}
220: \end{equation}
221: where $y_{i}^{p}$ is the observed normalized RWCO of the $i$-th residue of the 
222: protein $p$.
223: The minimization of $E$ can be achieved by the usual least squares method.
224: 
225: The mean ($\mu^p$) and standard deviation ($\sigma^p$) of 
226: the RWCOs of a protein are predicted from the amino acid 
227: composition ($f_a^p$) and sequence length ($L^p$) of the protein $p$ in 
228: the same manner as we have done for the contact number 
229: prediction\cite{KinjoETAL2005}.
230: That is, the mean and S.D. are predicted by the following linear regression 
231: scheme:
232: \begin{eqnarray}
233:   \hat{\mu}^{p} & = &\sum_{a}A_{a}f_{a}^{p} + A_{l}F(L^{p}) + A\\
234:   \hat{\sigma}^{p} & = &\sum_{a}D_{a}f_{a}^{p} + D_{l}F(L^{p}) + D
235: \end{eqnarray}
236: where $F(L^p) = L^p$ for $L^p < 300$ and $F(L^p) = 300$ for $L^p \geq 300$,
237: and $A_{a}, A, D_{a}, D$ are regression parameters.
238: The final value for the predicted absolute RWCO ($\hat{o}_{i}^{p}$) is given by 
239: \begin{equation}
240:   \hat{o}_{i}^{p} = \hat{\mu}^{p} + \hat{\sigma}^{p}\hat{y}_{i}^{p}.
241: \label{eq:pred}
242: \end{equation}
243: 
244: \subsection*{Data set}
245: We first selected representative proteins from each superfamily of 
246: all-$\alpha$, all-$\beta$, $\alpha/ \beta$, $\alpha + \beta$, 
247: and multi-domain classes of the SCOP\cite{SCOP} (version 1.65) protein 
248: structure classification database through the ASTRAL\cite{ASTRAL}  
249: database. Those structures which were present in this superfamily 
250: representative set but were absent from the 40\% representative set of 
251: ASTRAL, those containing chain breaks (except for termini), or those 
252: with the average contact number of less than 7.5 (non-compact structures) 
253: were discarded. 
254: Non-standard amino acid residues were converted to the corresponding standard
255: residues when possible, otherwise discarded. 
256: When $C_\beta$ atoms were absent in non-glycine residues, they were modeled 
257: by the SCWRL\cite{SCWRL3} side-chain prediction program.
258: After all, there remained 680 protein chains. The list of this data set 
259: will be available from the author's website.
260: 
261: For training the parameters and testing the prediction accuracy, we performed 
262: a 15-fold cross-validation test. The 680 proteins were randomly 
263: divided into two groups, one consisting of 630 proteins for training 
264: the parameters (training set), and the other (test set) consisting of 
265: 50 proteins for testing the prediction using the parameters obtained from 
266: the training set. The procedure was iterated for 15 times.
267: 
268: \subsection*{Measures of prediction accuracy}
269: We employ two measures for evaluating the prediction accuracy. 
270: The first one is the correlation coefficient ($Cor_p$) between the observed 
271: and predicted RWCOs for a given protein $p$, which is defined by
272: \begin{equation}
273:   Cor_{p} = \frac{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)(\hat{o}_{i}^{p} - \langle \hat{o}_{i}^{p}\rangle)\rangle}{
274: \sqrt{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)^{2}\rangle} 
275: \sqrt{\langle (\hat{o}_{i}^{p} - \langle \hat{o}_{i}^{p} \rangle)^{2}\rangle}}.
276: \label{eq:cor}
277: \end{equation}
278: The $Cor_p$ measures the consistency of the normalized RWCOs.
279: In order to measure the accuracy of the predicted absolute values, we 
280: use the RMS error divided by the standard deviation of the observed
281: RWCO ($DevA_p$):
282: \begin{equation}
283:     DevA_{p} = \frac{\sqrt{\langle (o_{i}^{p} - \hat{o}_{i}^{p})^{2}\rangle}}
284: {\sqrt{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)^2\rangle}}.
285: \label{eq:deva}
286: \end{equation}
287: 
288: \section*{Results}
289: \subsection*{Optimal window size}
290: In the prediction scheme presented in this paper, the half window size $M$ 
291: is a free parameter. We determine its value so that the prediction accuracy 
292: is maximized. We have performed a 15-fold cross-validation test with $M$ 
293: ranging from 0 to 40. The result is summarized in Figure~\ref{fig:window}. 
294: The correlation coefficient $Cor_p$ (averaged over the test sets) 
295: ranges from 0.48 at $M=0$ to $\approx$ 0.59 at $M=26$ 
296: (Figure \ref{fig:window} A). It should be noted that the correlation of 0.48 
297: is already statistically significant given the 
298: average sequence length (172 residues) of the proteins in the data set.
299: The value of  $Cor_p$ monotonically increases from $M=0$ to $M=26$, but 
300: starts to saturate for $M > 20$ and decreases slowly for $M>26$. 
301: The deviation $DevA_p$ 
302: (averaged over the test sets) shows a consistent trend with $Cor_p$ 
303: (Figure \ref{fig:window} B), and it reaches the minimum value of $\approx$ 
304: 1.03 at $M=26$.
305: Thus, the optimal window size has been determined to be $M=26$.
306: 
307: This optimal window size of $M=26$ is much larger than the ones for any 
308: other 1D structure predictions. As far as we are aware, this is the longest 
309: range of correlation observed between 1D structure and amino acid sequence.
310: For example, the optimal half window size  is 
311: $M=9$ for contact number prediction (see below) and $M = 6-8$ for secondary 
312: structure prediction. Large window sizes usually result in over-fitting  
313: the training data, but such is not the case for RWCO prediction, as we have 
314: performed cross-validation tests. This unusually long-range correlation with 
315: amino acid sequence is a conspicuous property of the RWCO. 
316: 
317: \subsection*{Distribution of correlation}
318: As indicated by the average values of $Cor_p$ and $DevA_p$, 
319: the linear regression method with $M=26$ tends to produce more accurate
320: predictions than with other window sizes. However, the prediction 
321: accuracies for individual proteins do differ significantly as shown in 
322: Figure \ref{fig:len_cor}. While most of the proteins
323: are decently predicted with correlations of 0.5 or higher, 
324: some proteins exhibit very poor correlations. The poorly predicted proteins 
325: are found not well-packed due to the small size of the protein (e.g., 
326: SCOP domain d1fs1a1), 
327:  a large fraction of structurally disordered regions (e.g., d1cpo\_1), or 
328: being a subunit of a large complex (e.g., d1mtyg\_). 
329: 
330: The prediction accuracy does not strikingly differ depending on the structural
331: class of proteins (Table \ref{tab:histo}). However, all-$\alpha$ proteins 
332: show slightly poorer correlations compared to other classes, 
333: and $\alpha + \beta$ proteins show relatively better correlations. The latter
334: may be due to the over-dominance of the $\alpha + \beta$ proteins in the 
335: data sets.
336: 
337: In Figure \ref{fig:ex}, three examples of predicted RWCO are shown. 
338: Despite the relatively good correlation between the native and predicted RWCOs,
339: the absolute values of predicted RWCOs at many sites 
340: significantly differ from the corresponding native RWCOs. 
341: This behavior is indicated by the relatively
342: large value of $DevA_p \approx 1.03$ (Figure \ref{fig:window} B). 
343: In particular, we notice that  RWCOs of large values are consistently  
344: underestimated. This behavior suggests that some cooperative effects 
345: be taken into account for better prediction. 
346: Provided that the present method is based exclusively
347: on one-body terms (Eq. \ref{eq:reg}), the prediction accuracy achieved is
348: satisfactory, at least qualitatively. 
349: 
350: \subsection*{Regression parameters as functions of sequence position}
351: Since the present study is the first attempt to develop a prediction method 
352: for RWCO, it is of interest to examine the properties of amino acid
353: residues that affect the RWCO, which are reflected in the values of 
354: the regression coefficients $C_{m,a}$. 
355: Figure \ref{fig:aaprop} shows the values of $C_{m,a}$ for each amino acid 
356: type $a$ as a function of the window position $m$. For all the amino acid 
357: types, the peak of $C_{m,a}$, when present, is at the center ($m=0$). 
358: We can easily recognize that these values, those at $m=0$ in particular, 
359: are related to the hydrophobicity of amino acids. 
360: That is, $C_{0,a}>0$ for hydrophobic residues and $C_{0,a}<0$ for hydrophilic 
361: residues. 
362: When the amino acid index (AAindex) database\cite{TomiiANDKanehisa1996} 
363: was scanned for indices that 
364: highly correlates with $C_{0,a}$, we have found various hydrophobicity scales 
365: with correlations with $C_{0,a}$ over 0.90 (data not shown).  
366: Therefore, we can conclude that the RWCO is primarily determined by the 
367: pattern of hydrophobicity along the sequence.
368: 
369: Some amino acid types exhibit oscillation with the periodicity of 
370: 3 to 4 residues, which is expected for the $\alpha$ helix. In fact, 
371: such residues (e.g., GLU, GLN, ALA, etc.) are of high $\alpha$ helix 
372: propensity. On the contrary, the residues of high $\beta$ strand 
373: propensity (e.g., ILE, VAL, etc.) do not exhibit such oscillation. 
374: Therefore, in addition to the hydrophobic properties, the parameters for 
375: RWCO also contain information for secondary structures.
376: 
377: \section*{Discussion}
378: \subsection*{Comparison with contact number prediction}
379: As can be seen from their definitions, the native RWCOs and contact numbers 
380: show a high correlation of 0.7 
381: (data not shown). This is also consistent with the finding that RWCOs are 
382: primarily determined by hydrophobicity. Because of the correlation 
383: between RWCO and contact number, it is of interest to ask whether it is possible to ``predict''
384: RWCOs using contact number prediction, and vice versa.
385: The result of this ``cross-prediction'' is listed in Table \ref{tab:cnrwco}.
386: Here, the contact number prediction\cite{KinjoETAL2005} is based 
387: on exactly the same linear regression scheme as the RWCO prediction method.
388: In order to make consistent the quality of the two different prediction 
389: methods, we have determined the regression parameters and the optimal 
390: half window size for the contact number prediction using the same training 
391: and test data sets as used here. 
392: The resulting contact number prediction method yields the average 
393: prediction accuracy of $Cor_p \approx 0.70$ and 
394: $DevA_p \approx 0.803$ with the optimal half window size of 9 
395: (Table \ref{tab:cnrwco}, Case B), a remarkable improvement over our 
396: previous study 
397: ($Cor_p \approx 0.63$ and $DevA_p \approx 0.941$)\cite{KinjoETAL2005} 
398: which is likely to be due to the use of PSI-BLAST score profiles 
399: (we used frequency profiles derived from the HSSP database\cite{HSSP}
400: in the previous study). 
401: When the values obtained from the contact number prediction are compared 
402: to the native RWCOs, the highest correlation is 0.50 with the optimal half 
403: window size 
404: of $M = 4$ (Table \ref{tab:cnrwco}, Case C). Although the correlation of 0.50
405: is statistically significant, 
406: the value is much lower than the one 
407: obtained for the proper prediction of RWCO, $Cor_p \approx 0.59$
408: (Table \ref{tab:cnrwco}, Case A). For the ``prediction'' in the opposite 
409: direction, that is, when the values obtained from the RWCO prediction are 
410: compared to the native contact numbers, the correlation is as high as 0.62
411: with the optimal half window size of $M=4$ (Table \ref{tab:cnrwco}, Case D). 
412: Again, this value, though statistically significant, is lower than the 
413: proper contact number prediction ($Cor_p \approx 0.70$).
414: Interestingly, for the Cases C and D in 
415: Table \ref{tab:cnrwco}, the optimal half window sizes coincide ($M = 4$). 
416: Therefore, it is expected that the contact number and RWCO are very closely 
417: related with each other in terms of the short-range pattern of the 
418: local amino acid sequence. In other words, the distinction between the 
419: contact number and RWCO originates from the interactions of longer range.
420: 
421: To further clarify the correlation between 
422:  RWCO and contact number predictions, we compared the regression 
423: parameters $C_{m,a}$ for RWCO and contact number predictions up to the 
424: half window size of $M=9$ (Figure \ref{fig:parcor}). 
425: It can be clearly seen that the both sets of regression parameters
426: very significantly correlate (correlation of $>0.7$) 
427: with each other within the window positions of 
428: $-4 \leq m \leq 4$ (Figure \ref{fig:parcor}), which confirms 
429: the above observation (Table \ref{tab:cnrwco}, Cases C and D).
430: 
431: \subsection*{Perspective for improving prediction accuracy}
432: The method for predicting RWCOs from amino acid sequence  
433: developed in this paper is a very primitive one.
434: While the correlation of 0.59 between the native and predicted 
435: RWCOs is significant, it is not as high as 0.70 in the case of the 
436: contact number prediction (Table \ref{tab:cnrwco}) 
437: based on the same linear regression scheme.
438: Furthermore, the agreement of absolute RWCO values 
439: is relatively poor, especially so for RWCOs of large values. 
440: As mentioned above, inclusion of many-body
441: effects seems mandatory for better RWCO prediction. 
442: A popular method for dealing with many-body terms is artificial 
443: neural networks. Other non-linear regression schemes such as radial basis
444: or support vector regressions can be also 
445: applicable.
446: Neural network methods as well as a support vector regression method
447: have been successfully applied to real value prediction of solvent
448: accessibility\cite{AhmadETAL2003,AdamczakETAL2004,YuanANDHuang2004}. 
449: Solvent accessibility is closely related to the hydrophobicity of amino 
450: acids, and hence is likely to be related to the RWCO. Thus, we can expect 
451: such non-linear regression approaches may be also useful for predicting RWCO.
452: However, since the RWCO prediction requires rather long segment of local 
453: amino acid sequence (half window size of $M=26$), 
454: straightforward application of non-linear regression methods requiring 
455: a great number of parameters may not work.
456: The number of parameters must be somehow reduced.
457: How to extract essential parameters for RWCO prediction is left for 
458: future studies.
459: 
460: An alternative route to the improved accuracy is to properly treat 
461: the large deviation of RWCOs along the amino acid sequence. 
462: For the contact number, its average over a local segment 
463: tends to be close to the average over the whole sequence, 
464: whereas, for the RWCO, such is not the case. 
465: For example, for the SCOP domain d1a9xb1 (Figure \ref{fig:ex}C), 
466: the average contact number for the whole domain, for residues 1 to 20, 
467: and for residues 51 to 70 are, respectively, 25.5, 28.4, and 26.6, whereas 
468: the corresponding averages of the RWCOs are 8.0, 14.3, and 4.9, respectively.
469: Since the present method is based on the globally normalized RWCO
470: (Eq. \ref{eq:normal}), such large deviations are difficult to 
471: handle. If this limitation is overcome, better prediction accuracy may be 
472: obtained.
473: 
474: \section*{Acknowledgment}
475: The authors thank Satoshi Fukuchi, Yoshiaki Minezaki, and Yasuo Shirakihara
476: for helpful comments.
477: Most of the computations were carried out at the supercomputing facility of
478: National Institute of Genetics, Japan. This work was supported in part by a
479: grant-in-aid from the MEXT, Japan.
480: 
481: The list of the SCOP domain identifiers used in the present study, and
482: the optimal parameter sets are available at the URL
483: http://maccl01.genes.nig.ac.jp/\~{}akinjo/rwco/.
484: 
485: %\bibliographystyle{unsrt}
486: %\bibliography{refs,mypaper}
487: \begin{thebibliography}{10}
488: 
489: \bibitem{Rost2003}
490: B.~Rost.
491: \newblock Prediction in {1D}: secondary structure, membrane helices, and
492:   accessibility.
493: \newblock In P.~E. Bourne and H.~Weissig, editors, {\em Structural
494:   Bioinformatics}, chapter~28, pages 559--587. Wiley-Liss, Inc., Hoboken,
495:   U.S.A., 2003.
496: 
497: \bibitem{PortoETAL2004}
498: M.~Porto, U.~Bastolla, H.~E. Roman, and M.~Vendruscolo.
499: \newblock Reconstruction of protein structures from a vectorial representation.
500: \newblock {\em Phys. Rev. Lett.}, 92:218101, 2004.
501: 
502: \bibitem{VendruscoloETAL1997}
503: M.~Vendruscolo, E.~Kussell, and E.~Domany.
504: \newblock Recovery of protein structure from contact maps.
505: \newblock {\em Fold. Des.}, 2:295--306, 1997.
506: 
507: \bibitem{KinjoANDNishikawa2005}
508: A.~R. Kinjo and K.~Nishikawa.
509: \newblock Recoverable one-dimensional encoding of protein three-dimensional
510:   structures.
511: \newblock {\em (submitted)}, 2005.
512: \newblock http://arXiv.org/abs/q-bio.BM/0501005.
513: 
514: \bibitem{PlaxcoETAL1998}
515: K.~W. Plaxco, K.~T. Simons, and D.~Baker.
516: \newblock Contact order, transition state placement and the refolding rates of
517:   single domain proteins.
518: \newblock {\em J. Mol. Biol.}, 277:985--994, 1998.
519: 
520: \bibitem{KinjoETAL2005}
521: A.~R. Kinjo, K.~Horimoto, and K.~Nishikawa.
522: \newblock Predicting absolute contact numbers of native protein structure from
523:   amino acid sequence.
524: \newblock {\em Proteins}, 58:158--165, 2005.
525: 
526: \bibitem{AltschulETAL1997}
527: S.~F. Altschul, T.~L. Madden, A.~A. Schaffer, J.~Zhang, Z.~Zhang, W.~Miller,
528:   and D.~L. Lipman.
529: \newblock Gapped blast and {PSI}-blast: A new generation of protein database
530:   search programs.
531: \newblock {\em Nucleic Acids Res.}, 25:3389--3402, 1997.
532: 
533: \bibitem{SCOP}
534: A.~G. Murzin, S.~E. Brenner, T.~Hubbard, and C.~Chothia.
535: \newblock {SCOP}: A structural classification of proteins database for the
536:   investigation of sequences and structures.
537: \newblock {\em J. Mol. Biol.}, 247:536--540, 1995.
538: 
539: \bibitem{ASTRAL}
540: J.~M. Chandonia, G.~Hon, N.~S. Walker, L.~{Lo Conte}, P.~Koehl, M.~Levitt, and
541:   S.~E. Brenner.
542: \newblock The astral compendium in 2004.
543: \newblock {\em Nucleic Acids Res.}, 32:D189--D192, 2004.
544: 
545: \bibitem{SCWRL3}
546: A.~A. Canutescu, A.~A. Shelenkov, and R.~L. Dunbrack.
547: \newblock A graph theory algorithm for protein side-chain prediction.
548: \newblock {\em Protein Sci.}, 12:2001--2014, 2003.
549: 
550: \bibitem{TomiiANDKanehisa1996}
551: K.~Tomii and M.~Kanehisa.
552: \newblock Analysis of amino acid indices and mutation matrices for sequence
553:   comparison and structure prediction of proteins.
554: \newblock {\em Protein Eng.}, 9:27--36, 1996.
555: 
556: \bibitem{HSSP}
557: C.~Sander and R.~Schneider.
558: \newblock Database of homology-derived protein structures.
559: \newblock {\em Proteins}, 9:56--68, 1991.
560: 
561: \bibitem{AhmadETAL2003}
562: S.~Ahmad, M.~M. Gromiha, and A.~Sarai.
563: \newblock Real value prediction of solvent accessibility from amino acid
564:   sequence.
565: \newblock {\em Proteins}, 50:629--635, 2003.
566: 
567: \bibitem{AdamczakETAL2004}
568: R.~Adamczak, A.~Porollo, and J.~Meller.
569: \newblock Accurate prediction of solvent accessibility using neural
570:   networks-based regression.
571: \newblock {\em Proteins}, 56:753--767, 2004.
572: 
573: \bibitem{YuanANDHuang2004}
574: Z.~Yuan and B.~Huang.
575: \newblock Prediction of protein accessible surface areas by support vector
576:   regression.
577: \newblock {\em Proteins}, 57:558--564, 2004.
578: 
579: \end{thebibliography}
580: 
581: 
582: \newpage
583: \begin{table}
584: \caption{\label{tab:histo}Distribution of $Cor_p$ for each SCOP class$^a$.}
585: \begin{center}
586:   \begin{tabular}{lrrrrr}\hline
587: range$^b$ &\multicolumn{5}{c}{SCOP class$^c$}\\
588: ($Cor_p$) & a & b & c & d & e\\\hline
589: (-1,0.2]  &  4(3)  &    1(0.6)&    7(4) &    2(0.8) &    0 \\
590: (0.2,0.4] & 23(14) &   17(10) &   14(8) &   22(9) &    1(5) \\
591: (0.4,0.6] & 61(38) &   54(33) &   55(33) &   72(30) &   11(61) \\
592: (0.6,0.8] & 73(45) &   86(52) &   82(49) &  136(57) &    6(33) \\
593: (0.8,1.0] &  1(0.6)&    6(4)  &    8(5) &    8(3) &    0 \\
594: total     &  162 &  164 &  166 &  240 &   18\\
595: \hline
596:   \end{tabular}
597: \end{center}
598: $^a$ The number (percentage in the parentheses) 
599: of occurrences of $Cor_p$ for the proteins in the test sets, 
600: classified according to the SCOP database.\\
601: $^b$ The range ``$(x,y]$'' denotes $x < Cor_p \leq y$.\\
602: $^c$ a: all-$\alpha$, b: all-$\beta$, c: $\alpha / \beta$, d: $\alpha + \beta$, 
603: e: multi-domain.
604: \end{table}
605: ~\\
606: \newpage
607: \begin{table}
608:   \caption{\label{tab:cnrwco}Cross-prediction between residue-wise contact orders and contact numbers.}
609:   \begin{center}
610:     \begin{tabular}{cccrrr}\hline
611: Case & Train$^a$ & Test$^b$ & $M^c$ & $Cor_p$ & $DevA_p$  \\\hline
612: A    &  RWCO & RWCO & 26  & 0.59 & 1.03 \\
613: B    &  CN   & CN   & 9  & 0.70 &  0.803\\
614: C    &  CN   & RWCO & 4  & 0.50 & N.A.$^d$  \\
615: D    &  RWCO & CN   & 4  & 0.62 & N.A.$^d$  \\\hline
616:     \end{tabular}
617:   \end{center}
618: $^a$Target values for which the regression parameters were trained. ``RWCO'' and ``CN'' 
619: indicate that the regression parameters were trained to fit the residue-wise contact orders 
620: and contact numbers, respectively.\\
621: $^b$Target values for which the ``prediction'' was applied. ``RWCO'' and ``CN'' indicate 
622: that predicted values were compared with the native residue-wise contact orders and native 
623: contact numbers, respectively.\\
624: $^c$Optimal half window size for the prediction.\\
625: $^d$Not applicable because the ranges of RWCO and CN values are different.
626: \end{table}
627: ~\\
628: 
629: \begin{figure}
630:   \begin{center}
631: \includegraphics[width=8cm]{window_sc.eps}
632:   \end{center}
633: \caption{\label{fig:window}Prediction accuracy as a function of window size.
634: (A) The correlation coefficient ($Cor_p$) between the native and predicted RWCO, averaged
635: over the test set proteins. (B) Deviation of the predicted RWCO from the native one ($DevA_p$), averaged over the test set proteins.}
636: \end{figure}
637: ~\\
638: \newpage
639: \begin{figure}
640:   \includegraphics[width=8cm]{len_cor.eps}
641: \caption{\label{fig:len_cor}$Cor_p$ plotted against chain length. Each point represents a protein in one of the test sets.}
642: \end{figure}
643: 
644: \begin{figure}
645:   \begin{center}
646:     \includegraphics[width=7cm]{./example4.eps}
647:   \end{center}
648: \caption{\label{fig:ex}Examples of prediction. Red: native RWCO; Green: predicted RWCO.
649: (A) SCOP domain d1a6m\_\_ (myoglobin, all-$\alpha$), $Cor_p = 0.73$, 
650: $DevA_p = 0.75$; 
651: (B) SCOP domain d1ifra\_ (Lamin A/C globular tail domain, all-$\beta$), 
652: $Cor_p =0.72$, $DevA_p = 0.87$;
653: (C) SCOP domain d1a9xb1 (Carbamoyl phosphate synthetase, small subunit N-terminal domain, $\alpha / \beta$), $Cor_p = 0.72$, $DevA_p = 0.81$. }
654: \end{figure}
655: ~\\
656: 
657: \begin{figure}
658:   \begin{center}
659: \includegraphics[width=16cm]{a2z.eps}    
660:   \end{center}
661: \caption{\label{fig:aaprop}$C_{m,a}$ for each amino acid type ($a$) as a function of the window position ($m$).}
662: \end{figure}
663: ~\\
664: \newpage
665: \begin{figure}
666:   \includegraphics[width=8cm]{par_cor.eps}
667: \caption{\label{fig:parcor}Correlation between the regression parameters 
668: $C_{m,a}$ for contact number and RWCO predictions for each window position.
669: The horizontal axis is the window position $m$ in the local sequence. 
670: The vertical axis is the correlation coefficient between the regression 
671: parameters $C_{m,a}$ for RWCO prediction and those for contact number 
672: prediction at the window position $m$.}
673: \end{figure}
674: \end{document}
675: