0501:q-bio0501015/kinjo.tex

1: \documentclass[12pt]{article}

2: \usepackage{times}

3: \usepackage{graphicx}

4: \usepackage{proteins,citesupernumber}

5:

6: %\renewcommand{\baselinestretch}{1.5}

7: \setlength{\textheight}{20cm}

8: \begin{document}

9: \setlength{\baselineskip}{20pt}

10: \begin{flushleft}

11: {\Large \bf Predicting Residue-wise Contact Orders of Native Protein Structure from  Amino Acid Sequence}

12:

13: \vspace{5mm}

14: Akira R. Kinjo$^{1,2,*}$ and Ken Nishikawa$^{1,2}$

15:

16: \vspace{3mm}

17: $^{1}$Center for Information Biology and DNA Data Bank of Japan,

18: National Institute of Genetics, Mishima, 411-8450, Japan\\

19: $^{2}$Department of Genetics,

20: The Graduate University for Advanced Studies (SOKENDAI),

21: Mishima, 411-8540, Japan

22:

23: \vspace{1cm}

24: $^{*}$Correspondence to Akira R. Kinjo.\\

25: Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima, 411-8540, Japan\\

26: Tel: +81-55-981-6859, Fax: +81-55-981-6889\\

27: E-mail: akinjo@genes.nig.ac.jp

28:

29: \vspace{1cm}

30: Running title: Residue-wise contact order prediction.

31:

32: \vspace{1cm}

33: Key words: protein structure prediction; residue-wise contact order; linear regression; one-dimensional structure.

34: \end{flushleft}

35: \newpage

36: \begin{abstract}

37: Residue-wise contact order (RWCO) is a new kind of one-dimensional protein

38: structures which represents the extent of long-range contacts.

39: We have recently shown that a set of three types of one-dimensional structures

40: (secondary structure, contact number, and RWCO) contains sufficient

41: information for reconstructing the three-dimensional structure of proteins.

42: Currently, there exist prediction methods for secondary structure and contact

43: number from amino acid sequence, but none exists for RWCO. Also, the properties of

44: amino acids that affect RWCO is not clearly understood. Here, we present a

45: linear regression-based method to predict RWCO from amino acid sequence,

46: and analyze the regression parameters to identify the properties that

47: correlates with the RWCO. The present method achieves the significant

48: correlation of 0.59 between the native and predicted RWCOs on average.

49: An unusual feature of the RWCO prediction is the remarkably large optimal

50: half window size of 26 residues.

51: The regression parameters for the central and near-central residues of the

52: local sequence segment highly correlate with those of the contact

53: number prediction, and hence with hydrophobicity.

54: \end{abstract}

55: \emph{Key words:} protein structure prediction, residue-wise contact order,

56: one-dimensional structure, linear regression.

57: \newpage

58: \section*{Introduction}

59: One of the main goals of protein structure prediction is to provide an

60: intuitive picture of the relationship between the amino acid sequence

61: and the native three-dimensional (3D) structure of proteins.

62: To this end, a number of methods have been developed for \textit{ab initio} or

63: \textit{de novo} protein structure prediction. However, such methods are

64: usually very complicated and make it difficult to intuitively understand

65: the relationship between amino acid sequence and 3D structure.

66: In this respect, one-dimensional (1D) structures\cite{Rost2003} of

67: proteins may be conventional intermediate representations of both

68: sequence and structure

69: of proteins as it is easy to grasp the correspondence between sequence and

70: structural characteristics.

71:

72: Since 1D structures are 3D structural features projected onto strings of

73: residue-wise structural assignments\cite{Rost2003}, a large part of 3D

74: information appears to be lost. That is, the correspondence between amino

75: acid sequence and 1D structures does not seem to be sufficient for

76: uncovering the correspondence between amino acid sequence and 3D structure.

77: However, Porto \textit{et al.}\cite{PortoETAL2004} have recently shown that

78: the contact matrix of a protein structure can be uniquely recovered from

79: its principal eigenvector. Since the protein 3D structure can be recovered

80: from the contact matrix\cite{VendruscoloETAL1997}, the result of

81: Porto \textit{et al.}\cite{PortoETAL2004} indicates that

82: the information contained in the 3D structure can be expressed as a

83: one-dimensional representation.

84: Furthermore, we have recently shown that 3D structure of proteins can be

85: reconstructed from a set of three types of

86: 1D structures\cite{KinjoANDNishikawa2005}.

87: In other words, the 3D structure of a protein is essentially equivalent

88: to a set of three types of 1D structures.

89: These 1D structures are namely secondary structure, contact number and

90: residue-wise contact order.

91: The fact that the 3D structure of a protein can be recovered from

92: a set of these 1D structures

93: opens a new possibility for elucidating the sequence-structure relationship

94: of proteins.

95:

96: The secondary structure of a protein is a string of symbols representing

97: $\alpha$ helix, $\beta$ strand, or coils. The contact number of each residue

98: in a protein is defined by the number of contacts the residue makes with other

99: residues in the protein. More precisely,

100: if we represent the contact map of the protein by $C_{i,j}$ ($C_{i,j} = 1$

101: if the $i$-th and $j$-th residues are in

102: contact, or $C_{i,j} = 0$ otherwise), the contact number $n_{i}$

103: of the $i$-th residue is defined by $n_{i} = \sum_{j}C_{i,j}$.

104: Similarly, the residue-wise contact order (RWCO) $o_{i}$

105: of the $i$-th residue of a protein is defined

106: by $o_{i} = \sum_{j}|i-j|C_{i,j}$,

107: that is, a sum of sequence separations between the residue and the

108: contacting residues\cite{KinjoANDNishikawa2005}.

109: The contact order was first introduced as a per-protein quantity by

110: Plaxco et al.\cite{PlaxcoETAL1998} to study the correlation between protein

111: topology and folding rate. The RWCO introduced here is a generalization of

112:  the contact order, and is a per-residue quantity.

113:

114: At least in principle, if we can predict those 1D structures,

115: we can also construct the corresponding 3D structures. Many accurate methods

116: have been developed for secondary structure prediction\cite{Rost2003}.

117: We have developed a method to predict the contact number from amino acid

118: sequence\cite{KinjoETAL2005} with the average correlation of 0.63

119: between the native and predicted contact numbers. However, there is

120: no method for predicting RWCO from amino acid sequence to date,

121: and it is not clear if the prediction is possible at all.

122: The primary objective of the present paper is to develop a method

123: to predict RWCO from amino acid sequence.

124:

125: While the accurate prediction of structural properties is important for

126: its own sake, for a thorough understanding of the

127: sequence-structure relationship, we still need to identify the properties

128: of amino acid sequence that determine the structure.

129: From the vast amount of studies on secondary structure prediction in the past,

130: we are now convinced

131: that each amino acid has a particular propensity for a particular secondary

132: structure, although the final secondary structures in the native

133: structure are determined in the global context. Also, contact number is

134:  closely related to the hydrophobicity of amino acids. Thus, both

135: secondary structure and contact number have clear connections with

136: the properties of amino acids. As for the residue-wise contact order,

137:  its geometrical meaning is clear (i.e., a quantity related to the extent of

138: long-range contacts), but the conjugate properties of amino acids are not.

139: As the second objective of the present study, we attempt to identify the

140: amino acids' property affecting RWCO by examining the parameters derived

141: for the prediction method.

142:

143: The prediction method developed in this paper is based on a simple linear

144: regression scheme which was also applied to the contact number

145: prediction in our previous study\cite{KinjoETAL2005}.

146: By examining the regression parameters,

147: we show that the RWCO is primarily determined by the pattern of hydrophobicity

148: of amino acids.

149: Although the method is extremely simple, it yields a significant

150: correlation of 0.59 between the native and predicted RWCOs.

151: While further refinement is definitely necessary to apply the method

152: for 3D structure prediction, the present method will serve as a basis for

153: more elaborate methods yet to be developed.

154:

155: \section*{Materials and Method}

156: \subsection*{Definition of residue-wise contact order}

157: As mentioned in the Introduction, the residue-wise contact order (RWCO) of

158: the $i$-th residue is defined by

159: \begin{equation}

160: o_{i} = \frac{1}{L}\sum_{j:|j-i|>2}|i-j|C_{i,j}\label{eq:def}

161: \end{equation}

162: where the summation is normalized by the length $L$ of the amino acid

163: sequence of the protein and $C_{i,j}$ represents the contact map of the protein.

164: We exclude trivial contacts between nearest- and next-nearest residues

165: along the sequence.

166: To make the RWCO useful for molecular dynamics simulations, the contact

167: between two residues is defined by a smooth sigmoid function:

168: \begin{equation}

169: C_{i,j} = 1/\{1+\exp[w(r_{i,j} - d_c)]\}

170: \end{equation}

171: where $r_{i,j}$ is the distance between $C_{\beta}$ atoms of the $i$-th

172: and $j$-th

173: residues ($C_{\alpha}$ atoms for glycine), $d_c$ is the cut-off distance for

174: the contact definition, and $w$ is

175: a parameter that determines the sharpness of the sigmoid function.

176: To be consistent with our previous

177: studies\cite{KinjoETAL2005,KinjoANDNishikawa2005},

178: we set $d_c = 12$\AA{} and $w=3$ throughout the present paper.

179:

180: We also define the normalized (relative) RWCO by

181: \begin{equation}

182: {y}_{i}^{p} = ({o}_{i}^{p} - \langle {o}_{i}^{p} \rangle)/

183: \sqrt{\langle({o}_{i}^{p} - \langle {o}_{i}^{p} \rangle)^2\rangle}

184: \label{eq:normal}

185: \end{equation}

186: where $\langle \cdot \rangle$ denotes averaging operation over the given

187: protein chain $p$.

188:

189: \subsection*{Prediction scheme}

190: To predict the RWCO of each residue in a protein, we first conduct three

191: iterations of PSI-BLAST\cite{AltschulETAL1997} search against the

192: NCBI non-redundant amino acid sequence database to obtain the sequence profile

193: of the protein with the E-value cut-off of $10^{-7}$.

194: We use the amino acid score table of the

195: PSI-BLAST  profile which is represented as $f(i,a)$

196: ($i$: site, $a$: amino acid) in the following (instead of the frequency table

197: used in the previous study\cite{KinjoETAL2005}).

198:

199: The RWCO $\hat{o}_{i}^{p}$ of the $i$-th residue in the protein $p$

200: is predicted in two steps. First we predict the normalized RWCO $y_{i}^p$ for

201: each residue, and then we combine it with the mean $\mu^p$ and standard

202: deviation (S.D.) $\sigma^p$ of the RWCOs of the protein,

203: which are predicted separately. The normalized RWCO is predicted by

204: the following linear regression scheme:

205: \begin{equation}

206: \hat{y}_{i}^{p} = \sum_{m=-M}^{M}\sum_{a}^{\mbox{\scriptsize residue types}}C_{m,a}f^{p}(i+m,a) + C\label{eq:reg}

207: \end{equation}

208: where $M$ is the half window size (a free parameter to be determined),

209: $f^{p} (i+m,a)$ represents an element of the PSI-BLAST profile of the

210: protein $p$, and $C_{m,a}$ and $C$ are regression parameters.

211: Both amino and carboxyl termini are treated by introducing an extra symbol

212: for the ``terminal residue.''

213: Thus, the RWCO of the $i$-th residue is expressed as a linear function of

214: the local sequence of $2M+1$ residues surrounding the $i$-th residue.

215:

216: The values of $C_{m,a}$ and $C$ are determined so as to minimize the prediction

217: error over a database of protein structures. The error function is defined by

218: \begin{equation}

219: E = \sum_{p}\sum_{i}(y_{i}^{p} - \hat{y}_{i}^{p})^{2}

220: \end{equation}

221: where $y_{i}^{p}$ is the observed normalized RWCO of the $i$-th residue of the

222: protein $p$.

223: The minimization of $E$ can be achieved by the usual least squares method.

224:

225: The mean ($\mu^p$) and standard deviation ($\sigma^p$) of

226: the RWCOs of a protein are predicted from the amino acid

227: composition ($f_a^p$) and sequence length ($L^p$) of the protein $p$ in

228: the same manner as we have done for the contact number

229: prediction\cite{KinjoETAL2005}.

230: That is, the mean and S.D. are predicted by the following linear regression

231: scheme:

232: \begin{eqnarray}

233:   \hat{\mu}^{p} & = &\sum_{a}A_{a}f_{a}^{p} + A_{l}F(L^{p}) + A\\

234:   \hat{\sigma}^{p} & = &\sum_{a}D_{a}f_{a}^{p} + D_{l}F(L^{p}) + D

235: \end{eqnarray}

236: where $F(L^p) = L^p$ for $L^p < 300$ and $F(L^p) = 300$ for $L^p \geq 300$,

237: and $A_{a}, A, D_{a}, D$ are regression parameters.

238: The final value for the predicted absolute RWCO ($\hat{o}_{i}^{p}$) is given by

239: \begin{equation}

240:   \hat{o}_{i}^{p} = \hat{\mu}^{p} + \hat{\sigma}^{p}\hat{y}_{i}^{p}.

241: \label{eq:pred}

242: \end{equation}

243:

244: \subsection*{Data set}

245: We first selected representative proteins from each superfamily of

246: all-$\alpha$, all-$\beta$, $\alpha/ \beta$, $\alpha + \beta$,

247: and multi-domain classes of the SCOP\cite{SCOP} (version 1.65) protein

248: structure classification database through the ASTRAL\cite{ASTRAL}

249: database. Those structures which were present in this superfamily

250: representative set but were absent from the 40\% representative set of

251: ASTRAL, those containing chain breaks (except for termini), or those

252: with the average contact number of less than 7.5 (non-compact structures)

253: were discarded.

254: Non-standard amino acid residues were converted to the corresponding standard

255: residues when possible, otherwise discarded.

256: When $C_\beta$ atoms were absent in non-glycine residues, they were modeled

257: by the SCWRL\cite{SCWRL3} side-chain prediction program.

258: After all, there remained 680 protein chains. The list of this data set

259: will be available from the author's website.

260:

261: For training the parameters and testing the prediction accuracy, we performed

262: a 15-fold cross-validation test. The 680 proteins were randomly

263: divided into two groups, one consisting of 630 proteins for training

264: the parameters (training set), and the other (test set) consisting of

265: 50 proteins for testing the prediction using the parameters obtained from

266: the training set. The procedure was iterated for 15 times.

267:

268: \subsection*{Measures of prediction accuracy}

269: We employ two measures for evaluating the prediction accuracy.

270: The first one is the correlation coefficient ($Cor_p$) between the observed

271: and predicted RWCOs for a given protein $p$, which is defined by

272: \begin{equation}

273:   Cor_{p} = \frac{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)(\hat{o}_{i}^{p} - \langle \hat{o}_{i}^{p}\rangle)\rangle}{

274: \sqrt{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)^{2}\rangle}

275: \sqrt{\langle (\hat{o}_{i}^{p} - \langle \hat{o}_{i}^{p} \rangle)^{2}\rangle}}.

276: \label{eq:cor}

277: \end{equation}

278: The $Cor_p$ measures the consistency of the normalized RWCOs.

279: In order to measure the accuracy of the predicted absolute values, we

280: use the RMS error divided by the standard deviation of the observed

281: RWCO ($DevA_p$):

282: \begin{equation}

283:     DevA_{p} = \frac{\sqrt{\langle (o_{i}^{p} - \hat{o}_{i}^{p})^{2}\rangle}}

284: {\sqrt{\langle (o_{i}^{p} - \langle o_{i}^{p} \rangle)^2\rangle}}.

285: \label{eq:deva}

286: \end{equation}

287:

288: \section*{Results}

289: \subsection*{Optimal window size}

290: In the prediction scheme presented in this paper, the half window size $M$

291: is a free parameter. We determine its value so that the prediction accuracy

292: is maximized. We have performed a 15-fold cross-validation test with $M$

293: ranging from 0 to 40. The result is summarized in Figure~\ref{fig:window}.

294: The correlation coefficient $Cor_p$ (averaged over the test sets)

295: ranges from 0.48 at $M=0$ to $\approx$ 0.59 at $M=26$

296: (Figure \ref{fig:window} A). It should be noted that the correlation of 0.48

297: is already statistically significant given the

298: average sequence length (172 residues) of the proteins in the data set.

299: The value of  $Cor_p$ monotonically increases from $M=0$ to $M=26$, but

300: starts to saturate for $M > 20$ and decreases slowly for $M>26$.

301: The deviation $DevA_p$

302: (averaged over the test sets) shows a consistent trend with $Cor_p$

303: (Figure \ref{fig:window} B), and it reaches the minimum value of $\approx$

304: 1.03 at $M=26$.

305: Thus, the optimal window size has been determined to be $M=26$.

306:

307: This optimal window size of $M=26$ is much larger than the ones for any

308: other 1D structure predictions. As far as we are aware, this is the longest

309: range of correlation observed between 1D structure and amino acid sequence.

310: For example, the optimal half window size  is

311: $M=9$ for contact number prediction (see below) and $M = 6-8$ for secondary

312: structure prediction. Large window sizes usually result in over-fitting

313: the training data, but such is not the case for RWCO prediction, as we have

314: performed cross-validation tests. This unusually long-range correlation with

315: amino acid sequence is a conspicuous property of the RWCO.

316:

317: \subsection*{Distribution of correlation}

318: As indicated by the average values of $Cor_p$ and $DevA_p$,

319: the linear regression method with $M=26$ tends to produce more accurate

320: predictions than with other window sizes. However, the prediction

321: accuracies for individual proteins do differ significantly as shown in

322: Figure \ref{fig:len_cor}. While most of the proteins

323: are decently predicted with correlations of 0.5 or higher,

324: some proteins exhibit very poor correlations. The poorly predicted proteins

325: are found not well-packed due to the small size of the protein (e.g.,

326: SCOP domain d1fs1a1),

327:  a large fraction of structurally disordered regions (e.g., d1cpo\_1), or

328: being a subunit of a large complex (e.g., d1mtyg\_).

329:

330: The prediction accuracy does not strikingly differ depending on the structural

331: class of proteins (Table \ref{tab:histo}). However, all-$\alpha$ proteins

332: show slightly poorer correlations compared to other classes,

333: and $\alpha + \beta$ proteins show relatively better correlations. The latter

334: may be due to the over-dominance of the $\alpha + \beta$ proteins in the

335: data sets.

336:

337: In Figure \ref{fig:ex}, three examples of predicted RWCO are shown.

338: Despite the relatively good correlation between the native and predicted RWCOs,

339: the absolute values of predicted RWCOs at many sites

340: significantly differ from the corresponding native RWCOs.

341: This behavior is indicated by the relatively

342: large value of $DevA_p \approx 1.03$ (Figure \ref{fig:window} B).

343: In particular, we notice that  RWCOs of large values are consistently

344: underestimated. This behavior suggests that some cooperative effects

345: be taken into account for better prediction.

346: Provided that the present method is based exclusively

347: on one-body terms (Eq. \ref{eq:reg}), the prediction accuracy achieved is

348: satisfactory, at least qualitatively.

349:

350: \subsection*{Regression parameters as functions of sequence position}

351: Since the present study is the first attempt to develop a prediction method

352: for RWCO, it is of interest to examine the properties of amino acid

353: residues that affect the RWCO, which are reflected in the values of

354: the regression coefficients $C_{m,a}$.

355: Figure \ref{fig:aaprop} shows the values of $C_{m,a}$ for each amino acid

356: type $a$ as a function of the window position $m$. For all the amino acid

357: types, the peak of $C_{m,a}$, when present, is at the center ($m=0$).

358: We can easily recognize that these values, those at $m=0$ in particular,

359: are related to the hydrophobicity of amino acids.

360: That is, $C_{0,a}>0$ for hydrophobic residues and $C_{0,a}<0$ for hydrophilic

361: residues.

362: When the amino acid index (AAindex) database\cite{TomiiANDKanehisa1996}

363: was scanned for indices that

364: highly correlates with $C_{0,a}$, we have found various hydrophobicity scales

365: with correlations with $C_{0,a}$ over 0.90 (data not shown).

366: Therefore, we can conclude that the RWCO is primarily determined by the

367: pattern of hydrophobicity along the sequence.

368:

369: Some amino acid types exhibit oscillation with the periodicity of

370: 3 to 4 residues, which is expected for the $\alpha$ helix. In fact,

371: such residues (e.g., GLU, GLN, ALA, etc.) are of high $\alpha$ helix

372: propensity. On the contrary, the residues of high $\beta$ strand

373: propensity (e.g., ILE, VAL, etc.) do not exhibit such oscillation.

374: Therefore, in addition to the hydrophobic properties, the parameters for

375: RWCO also contain information for secondary structures.

376:

377: \section*{Discussion}

378: \subsection*{Comparison with contact number prediction}

379: As can be seen from their definitions, the native RWCOs and contact numbers

380: show a high correlation of 0.7

381: (data not shown). This is also consistent with the finding that RWCOs are

382: primarily determined by hydrophobicity. Because of the correlation

383: between RWCO and contact number, it is of interest to ask whether it is possible to ``predict''

384: RWCOs using contact number prediction, and vice versa.

385: The result of this ``cross-prediction'' is listed in Table \ref{tab:cnrwco}.

386: Here, the contact number prediction\cite{KinjoETAL2005} is based

387: on exactly the same linear regression scheme as the RWCO prediction method.

388: In order to make consistent the quality of the two different prediction

389: methods, we have determined the regression parameters and the optimal

390: half window size for the contact number prediction using the same training

391: and test data sets as used here.

392: The resulting contact number prediction method yields the average

393: prediction accuracy of $Cor_p \approx 0.70$ and

394: $DevA_p \approx 0.803$ with the optimal half window size of 9

395: (Table \ref{tab:cnrwco}, Case B), a remarkable improvement over our

396: previous study

397: ($Cor_p \approx 0.63$ and $DevA_p \approx 0.941$)\cite{KinjoETAL2005}

398: which is likely to be due to the use of PSI-BLAST score profiles

399: (we used frequency profiles derived from the HSSP database\cite{HSSP}

400: in the previous study).

401: When the values obtained from the contact number prediction are compared

402: to the native RWCOs, the highest correlation is 0.50 with the optimal half

403: window size

404: of $M = 4$ (Table \ref{tab:cnrwco}, Case C). Although the correlation of 0.50

405: is statistically significant,

406: the value is much lower than the one

407: obtained for the proper prediction of RWCO, $Cor_p \approx 0.59$

408: (Table \ref{tab:cnrwco}, Case A). For the ``prediction'' in the opposite

409: direction, that is, when the values obtained from the RWCO prediction are

410: compared to the native contact numbers, the correlation is as high as 0.62

411: with the optimal half window size of $M=4$ (Table \ref{tab:cnrwco}, Case D).

412: Again, this value, though statistically significant, is lower than the

413: proper contact number prediction ($Cor_p \approx 0.70$).

414: Interestingly, for the Cases C and D in

415: Table \ref{tab:cnrwco}, the optimal half window sizes coincide ($M = 4$).

416: Therefore, it is expected that the contact number and RWCO are very closely

417: related with each other in terms of the short-range pattern of the

418: local amino acid sequence. In other words, the distinction between the

419: contact number and RWCO originates from the interactions of longer range.

420:

421: To further clarify the correlation between

422:  RWCO and contact number predictions, we compared the regression

423: parameters $C_{m,a}$ for RWCO and contact number predictions up to the

424: half window size of $M=9$ (Figure \ref{fig:parcor}).

425: It can be clearly seen that the both sets of regression parameters

426: very significantly correlate (correlation of $>0.7$)

427: with each other within the window positions of

428: $-4 \leq m \leq 4$ (Figure \ref{fig:parcor}), which confirms

429: the above observation (Table \ref{tab:cnrwco}, Cases C and D).

430:

431: \subsection*{Perspective for improving prediction accuracy}

432: The method for predicting RWCOs from amino acid sequence

433: developed in this paper is a very primitive one.

434: While the correlation of 0.59 between the native and predicted

435: RWCOs is significant, it is not as high as 0.70 in the case of the

436: contact number prediction (Table \ref{tab:cnrwco})

437: based on the same linear regression scheme.

438: Furthermore, the agreement of absolute RWCO values

439: is relatively poor, especially so for RWCOs of large values.

440: As mentioned above, inclusion of many-body

441: effects seems mandatory for better RWCO prediction.

442: A popular method for dealing with many-body terms is artificial

443: neural networks. Other non-linear regression schemes such as radial basis

444: or support vector regressions can be also

445: applicable.

446: Neural network methods as well as a support vector regression method

447: have been successfully applied to real value prediction of solvent

448: accessibility\cite{AhmadETAL2003,AdamczakETAL2004,YuanANDHuang2004}.

449: Solvent accessibility is closely related to the hydrophobicity of amino

450: acids, and hence is likely to be related to the RWCO. Thus, we can expect

451: such non-linear regression approaches may be also useful for predicting RWCO.

452: However, since the RWCO prediction requires rather long segment of local

453: amino acid sequence (half window size of $M=26$),

454: straightforward application of non-linear regression methods requiring

455: a great number of parameters may not work.

456: The number of parameters must be somehow reduced.

457: How to extract essential parameters for RWCO prediction is left for

458: future studies.

459:

460: An alternative route to the improved accuracy is to properly treat

461: the large deviation of RWCOs along the amino acid sequence.

462: For the contact number, its average over a local segment

463: tends to be close to the average over the whole sequence,

464: whereas, for the RWCO, such is not the case.

465: For example, for the SCOP domain d1a9xb1 (Figure \ref{fig:ex}C),

466: the average contact number for the whole domain, for residues 1 to 20,

467: and for residues 51 to 70 are, respectively, 25.5, 28.4, and 26.6, whereas

468: the corresponding averages of the RWCOs are 8.0, 14.3, and 4.9, respectively.

469: Since the present method is based on the globally normalized RWCO

470: (Eq. \ref{eq:normal}), such large deviations are difficult to

471: handle. If this limitation is overcome, better prediction accuracy may be

472: obtained.

473:

474: \section*{Acknowledgment}

475: The authors thank Satoshi Fukuchi, Yoshiaki Minezaki, and Yasuo Shirakihara

476: for helpful comments.

477: Most of the computations were carried out at the supercomputing facility of

478: National Institute of Genetics, Japan. This work was supported in part by a

479: grant-in-aid from the MEXT, Japan.

480:

481: The list of the SCOP domain identifiers used in the present study, and

482: the optimal parameter sets are available at the URL

483: http://maccl01.genes.nig.ac.jp/\~{}akinjo/rwco/.

484:

485: %\bibliographystyle{unsrt}

486: %\bibliography{refs,mypaper}

487: \begin{thebibliography}{10}

488:

489: \bibitem{Rost2003}

490: B.~Rost.

491: \newblock Prediction in {1D}: secondary structure, membrane helices, and

492:   accessibility.

493: \newblock In P.~E. Bourne and H.~Weissig, editors, {\em Structural

494:   Bioinformatics}, chapter~28, pages 559--587. Wiley-Liss, Inc., Hoboken,

495:   U.S.A., 2003.

496:

497: \bibitem{PortoETAL2004}

498: M.~Porto, U.~Bastolla, H.~E. Roman, and M.~Vendruscolo.

499: \newblock Reconstruction of protein structures from a vectorial representation.

500: \newblock {\em Phys. Rev. Lett.}, 92:218101, 2004.

501:

502: \bibitem{VendruscoloETAL1997}

503: M.~Vendruscolo, E.~Kussell, and E.~Domany.

504: \newblock Recovery of protein structure from contact maps.

505: \newblock {\em Fold. Des.}, 2:295--306, 1997.

506:

507: \bibitem{KinjoANDNishikawa2005}

508: A.~R. Kinjo and K.~Nishikawa.

509: \newblock Recoverable one-dimensional encoding of protein three-dimensional

510:   structures.

511: \newblock {\em (submitted)}, 2005.

512: \newblock http://arXiv.org/abs/q-bio.BM/0501005.

513:

514: \bibitem{PlaxcoETAL1998}

515: K.~W. Plaxco, K.~T. Simons, and D.~Baker.

516: \newblock Contact order, transition state placement and the refolding rates of

517:   single domain proteins.

518: \newblock {\em J. Mol. Biol.}, 277:985--994, 1998.

519:

520: \bibitem{KinjoETAL2005}

521: A.~R. Kinjo, K.~Horimoto, and K.~Nishikawa.

522: \newblock Predicting absolute contact numbers of native protein structure from

523:   amino acid sequence.

524: \newblock {\em Proteins}, 58:158--165, 2005.

525:

526: \bibitem{AltschulETAL1997}

527: S.~F. Altschul, T.~L. Madden, A.~A. Schaffer, J.~Zhang, Z.~Zhang, W.~Miller,

528:   and D.~L. Lipman.

529: \newblock Gapped blast and {PSI}-blast: A new generation of protein database

530:   search programs.

531: \newblock {\em Nucleic Acids Res.}, 25:3389--3402, 1997.

532:

533: \bibitem{SCOP}

534: A.~G. Murzin, S.~E. Brenner, T.~Hubbard, and C.~Chothia.

535: \newblock {SCOP}: A structural classification of proteins database for the

536:   investigation of sequences and structures.

537: \newblock {\em J. Mol. Biol.}, 247:536--540, 1995.

538:

539: \bibitem{ASTRAL}

540: J.~M. Chandonia, G.~Hon, N.~S. Walker, L.~{Lo Conte}, P.~Koehl, M.~Levitt, and

541:   S.~E. Brenner.

542: \newblock The astral compendium in 2004.

543: \newblock {\em Nucleic Acids Res.}, 32:D189--D192, 2004.

544:

545: \bibitem{SCWRL3}

546: A.~A. Canutescu, A.~A. Shelenkov, and R.~L. Dunbrack.

547: \newblock A graph theory algorithm for protein side-chain prediction.

548: \newblock {\em Protein Sci.}, 12:2001--2014, 2003.

549:

550: \bibitem{TomiiANDKanehisa1996}

551: K.~Tomii and M.~Kanehisa.

552: \newblock Analysis of amino acid indices and mutation matrices for sequence

553:   comparison and structure prediction of proteins.

554: \newblock {\em Protein Eng.}, 9:27--36, 1996.

555:

556: \bibitem{HSSP}

557: C.~Sander and R.~Schneider.

558: \newblock Database of homology-derived protein structures.

559: \newblock {\em Proteins}, 9:56--68, 1991.

560:

561: \bibitem{AhmadETAL2003}

562: S.~Ahmad, M.~M. Gromiha, and A.~Sarai.

563: \newblock Real value prediction of solvent accessibility from amino acid

564:   sequence.

565: \newblock {\em Proteins}, 50:629--635, 2003.

566:

567: \bibitem{AdamczakETAL2004}

568: R.~Adamczak, A.~Porollo, and J.~Meller.

569: \newblock Accurate prediction of solvent accessibility using neural

570:   networks-based regression.

571: \newblock {\em Proteins}, 56:753--767, 2004.

572:

573: \bibitem{YuanANDHuang2004}

574: Z.~Yuan and B.~Huang.

575: \newblock Prediction of protein accessible surface areas by support vector

576:   regression.

577: \newblock {\em Proteins}, 57:558--564, 2004.

578:

579: \end{thebibliography}

580:

581:

582: \newpage

583: \begin{table}

584: \caption{\label{tab:histo}Distribution of $Cor_p$ for each SCOP class$^a$.}

585: \begin{center}

586:   \begin{tabular}{lrrrrr}\hline

587: range$^b$ &\multicolumn{5}{c}{SCOP class$^c$}\\

588: ($Cor_p$) & a & b & c & d & e\\\hline

589: (-1,0.2]  &  4(3)  &    1(0.6)&    7(4) &    2(0.8) &    0 \\

590: (0.2,0.4] & 23(14) &   17(10) &   14(8) &   22(9) &    1(5) \\

591: (0.4,0.6] & 61(38) &   54(33) &   55(33) &   72(30) &   11(61) \\

592: (0.6,0.8] & 73(45) &   86(52) &   82(49) &  136(57) &    6(33) \\

593: (0.8,1.0] &  1(0.6)&    6(4)  &    8(5) &    8(3) &    0 \\

594: total     &  162 &  164 &  166 &  240 &   18\\

595: \hline

596:   \end{tabular}

597: \end{center}

598: $^a$ The number (percentage in the parentheses)

599: of occurrences of $Cor_p$ for the proteins in the test sets,

600: classified according to the SCOP database.\\

601: $^b$ The range ``$(x,y]$'' denotes $x < Cor_p \leq y$.\\

602: $^c$ a: all-$\alpha$, b: all-$\beta$, c: $\alpha / \beta$, d: $\alpha + \beta$,

603: e: multi-domain.

604: \end{table}

605: ~\\

606: \newpage

607: \begin{table}

608:   \caption{\label{tab:cnrwco}Cross-prediction between residue-wise contact orders and contact numbers.}

609:   \begin{center}

610:     \begin{tabular}{cccrrr}\hline

611: Case & Train$^a$ & Test$^b$ & $M^c$ & $Cor_p$ & $DevA_p$  \\\hline

612: A    &  RWCO & RWCO & 26  & 0.59 & 1.03 \\

613: B    &  CN   & CN   & 9  & 0.70 &  0.803\\

614: C    &  CN   & RWCO & 4  & 0.50 & N.A.$^d$  \\

615: D    &  RWCO & CN   & 4  & 0.62 & N.A.$^d$  \\\hline

616:     \end{tabular}

617:   \end{center}

618: $^a$Target values for which the regression parameters were trained. ``RWCO'' and ``CN''

619: indicate that the regression parameters were trained to fit the residue-wise contact orders

620: and contact numbers, respectively.\\

621: $^b$Target values for which the ``prediction'' was applied. ``RWCO'' and ``CN'' indicate

622: that predicted values were compared with the native residue-wise contact orders and native

623: contact numbers, respectively.\\

624: $^c$Optimal half window size for the prediction.\\

625: $^d$Not applicable because the ranges of RWCO and CN values are different.

626: \end{table}

627: ~\\

628:

629: \begin{figure}

630:   \begin{center}

631: \includegraphics[width=8cm]{window_sc.eps}

632:   \end{center}

633: \caption{\label{fig:window}Prediction accuracy as a function of window size.

634: (A) The correlation coefficient ($Cor_p$) between the native and predicted RWCO, averaged

635: over the test set proteins. (B) Deviation of the predicted RWCO from the native one ($DevA_p$), averaged over the test set proteins.}

636: \end{figure}

637: ~\\

638: \newpage

639: \begin{figure}

640:   \includegraphics[width=8cm]{len_cor.eps}

641: \caption{\label{fig:len_cor}$Cor_p$ plotted against chain length. Each point represents a protein in one of the test sets.}

642: \end{figure}

643:

644: \begin{figure}

645:   \begin{center}

646:     \includegraphics[width=7cm]{./example4.eps}

647:   \end{center}

648: \caption{\label{fig:ex}Examples of prediction. Red: native RWCO; Green: predicted RWCO.

649: (A) SCOP domain d1a6m\_\_ (myoglobin, all-$\alpha$), $Cor_p = 0.73$,

650: $DevA_p = 0.75$;

651: (B) SCOP domain d1ifra\_ (Lamin A/C globular tail domain, all-$\beta$),

652: $Cor_p =0.72$, $DevA_p = 0.87$;

653: (C) SCOP domain d1a9xb1 (Carbamoyl phosphate synthetase, small subunit N-terminal domain, $\alpha / \beta$), $Cor_p = 0.72$, $DevA_p = 0.81$. }

654: \end{figure}

655: ~\\

656:

657: \begin{figure}

658:   \begin{center}

659: \includegraphics[width=16cm]{a2z.eps}

660:   \end{center}

661: \caption{\label{fig:aaprop}$C_{m,a}$ for each amino acid type ($a$) as a function of the window position ($m$).}

662: \end{figure}

663: ~\\

664: \newpage

665: \begin{figure}

666:   \includegraphics[width=8cm]{par_cor.eps}

667: \caption{\label{fig:parcor}Correlation between the regression parameters

668: $C_{m,a}$ for contact number and RWCO predictions for each window position.

669: The horizontal axis is the window position $m$ in the local sequence.

670: The vertical axis is the correlation coefficient between the regression

671: parameters $C_{m,a}$ for RWCO prediction and those for contact number

672: prediction at the window position $m$.}

673: \end{figure}

674: \end{document}

675: