q-bio0503032/kinjo.tex
1: \documentclass[12pt]{article}
2: %%
3: %% $Id: kinjo.tex,v 2.7 2005/09/26 01:39:24 akinjo Exp akinjo $
4: %% 
5: \usepackage{times}
6: \usepackage{graphicx}
7: %\usepackage{natbib}
8: %\usepackage{citesupernumber}
9: %\usepackage{biophysics}
10: %\renewcommand{\baselinestretch}{1.5}
11: \begin{document}
12: \begin{center}
13: {\Large \bf Predicting Secondary Structures, Contact Numbers, and Residue-wise Contact Orders of Native Protein Structure from Amino Acid Sequence by Critical Random Networks}  
14: 
15: {Akira R. Kinjo$^*$  and Ken Nishikawa
16: 
17: {\em Center for Information Biology and DNA Data Bank of Japan,\\
18: National Institute of Genetics, Mishima, 411-8540, Japan;\\
19: Department of Genetics, The Graduate University for Advanced Studies \\
20: (SOKENDAI), Mishima, 411-8540, Japan}}
21: 
22: \end{center}
23: 
24: \begin{flushleft}
25: Running title: Protein structure prediction in 1D.
26: 
27: $^*$Correspondence to A. R. Kinjo.\\
28: Center for Information Biology and DNA Data Bank of Japan,\\
29: National Institute of Genetics, Mishima, Shizuoka, 411-8540, Japan\\
30: Tel: +81-55-981-6859\\
31: Fax: +81-55-981-6889\\
32: E-mail: akinjo@genes.nig.ac.jp
33: \end{flushleft}
34: 
35: \begin{abstract}
36: Prediction of one-dimensional protein structures such as secondary structures 
37:  and contact numbers is useful for the three-dimensional structure 
38: prediction and important for the understanding of sequence-structure
39:  relationship. 
40: Here we present a new machine-learning method, critical random networks (CRNs), 
41: for predicting one-dimensional structures, and apply it, with position-specific
42: scoring matrices, to the prediction of secondary structures (SS), contact 
43: numbers (CN), and residue-wise contact orders (RWCO).
44: The present method achieves, on average, $Q_3$ accuracy of 77.8\% for SS,
45: correlation coefficients of 0.726 and 0.601 for CN and RWCO, respectively. 
46: The accuracy of the SS prediction is comparable to other state-of-the-art 
47: methods, and that of the CN prediction is a significant improvement over  
48: previous methods. We give a detailed formulation of critical random 
49: networks-based prediction scheme, and examine the context-dependence of 
50: prediction accuracies. In order to study the nonlinear and multi-body effects,
51: we compare the CRNs-based method with a purely linear method based on 
52: position-specific scoring matrices. Although not superior to the CRNs-based 
53: method, the surprisingly good accuracy achieved by the linear method highlights 
54: the difficulty in extracting structural features of higher order from amino 
55: acid sequence beyond that provided by the position-specific scoring matrices.
56: \end{abstract}
57: \begin{flushleft}
58:   \textit{Key words:} Protein structure prediction, one-dimensional structure, 
59: position-specific scoring matrix, critical random network
60: \end{flushleft}
61: 
62: \section*{Introduction}
63: Predicting the three-dimensional structure of a protein from its 
64: amino acid sequence is an essential step toward the 
65: thorough bottom-up understanding of complex biological phenomena.
66: Recently, much progress has been made in developing 
67: so-called \emph{ab initio} or \emph{de novo} structure prediction 
68: methods\cite{BonneauANDBaker2001}.
69: In the standard approach to such \emph{de novo} structure predictions,
70: a protein is represented as a physical object in three-dimensional (3D) space,
71: and the global minimum of free energy surface is sought with a given 
72: force-field or a set of scoring functions. In the minimization process, 
73: structural features predicted from the amino acid sequence
74: may be used as restraints to limit the conformational space to be sampled.
75: Such structural features include so-called one-dimensional (1D) structures of 
76: proteins.
77: 
78: Protein 1D structures are 3D structural features
79: projected onto strings of residue-wise structural assignments 
80: along the amino acid sequence\cite{Rost2003}. 
81: For example, a string of secondary structures is a 1D structure. 
82: Other 1D structures include (solvent) 
83: accessibilities\cite{LeeANDRichards1971}, contact 
84: numbers\cite{KinjoETAL2005} and recently introduced residue-wise contact 
85: orders\cite{KinjoANDNishikawa2005}. 
86: The contact number, also referred to as coordination number or Ooi 
87: number\cite{NishikawaANDOoi1980}, 
88: of a residue is the number of contacts that the residue makes with 
89: other residues in the native 3D structure, while the residue-wise 
90: contact order of a 
91: residue is the sum of sequence separations between that residue and 
92: contacting residues.
93: We have recently shown that it is possible to reconstruct the native 
94: 3D structure of a protein from a set of three types of native 1D structures, 
95: namely secondary structures (SS), contact numbers (CN), and residue-wise 
96: contact orders (RWCO)\cite{KinjoANDNishikawa2005}. 
97: Therefore, these 1D structures contain rich information regarding the 
98: corresponding 3D structure, and their accurate prediction may be very helpful
99: for 3D structure prediction. 
100: 
101: In our previous study\cite{KinjoETAL2005}, we have developed a simple linear 
102: method to predict contact numbers from amino acid sequence. In that method, 
103: the use of multiple sequence alignment 
104: was shown to improve the prediction accuracy, achieving an average 
105: correlation coefficient of 0.63 between predicted and observed contact 
106: numbers per protein. There, we used amino acid frequency table obtained from
107:  the HSSP\cite{HSSP} multiple sequence alignment.
108: 
109: In this paper, we extend the previous method by introducing a new framework 
110: called critical random networks (CRNs), and apply it to the prediction of
111: secondary structure and residue-wise contact order in addition to contact 
112: number prediction. In this framework, a state vector of a large dimension 
113: is associated with each site of a target sequence.
114: The state vectors are connected via random nearest-neighbor interactions.
115: The value of the state vectors are determined by solving an equation of 
116: state. Then a 1D quantity of each site is predicted as a linear function
117: of the state vector of the site as well as the corresponding local PSSM segment.
118: This approach was inspired by the method of echo state networks 
119: (ESNs) which has been recently developed and successfully applied 
120: to complex time series analysis\cite{Jaeger2001,JaegerANDHaas2004}. 
121: Unlike ESNs which treat infinite series of input signals in one direction 
122: (from the past to the future), CRNs treat finite systems incorporating 
123: both up- and downstream information at the same time. Also, 
124: the so-called echo state property is not imposed to a network, 
125: but the system is instead set at a critical point of the network.
126: As the input to CRNs-based prediction, we employ position-specific 
127: scoring matrices (PSSMs) generated by PSI-BLAST\cite{AltschulETAL1997}. 
128: By the combination of PSSMs and CRNs, accurate prediction of 
129: SS, CN and RWCO have been achieved. 
130: 
131: Currently, almost all the accurate methods for one-dimensional structure 
132: predictions combine some kind of sophisticated machine-learning approaches 
133: such as neural networks and support vector machines with PSSMs. The method
134: presented here is no exception.
135: This trend raises a question as to what extent the machine-learning 
136: approaches are effective. In this study, we address this question by comparing
137: the CRNs-based method with a purely linear method based on PSSMs. Although
138: not so good as the CRNs-based method, the linear predictions are of 
139: surprisingly high quality. This result suggests that, although not 
140: insignificant, the effect of the machine-learning approaches is relatively 
141: of minor importance while the use of PSSMs is the most significant ingredient 
142: in one-dimensional structure prediction. The problem of how to effectively 
143: extract meaningful information from the amino acid sequence beyond that 
144: provided by PSSMs requires yet further studies.
145: 
146: \section*{Materials and Methods}
147: \subsection*{Definition of 1D structures}
148: \paragraph{Secondary structures (SS)}
149: Secondary structures were defined by the DSSP program\cite{DSSP}.
150: For three-state SS prediction, the simple encoding scheme was employed.
151: That is, $\alpha$ helices ($H$), $\beta$ strands ($E$), and other structures
152: (``coils'') defined by DSSP were encoded as $H$, $E$, and $C$, respectively.
153: For SS prediction, we introduce feature variables $(y_i^H, y_i^E, y_i^C)$ 
154: to represent each type of secondary structures at the $i$-th residue position,
155: so that $H$ is represented as $(1,-1,-1)$, $E$ as $(-1,1,-1)$, and $C$ as 
156: $(-1,-1,1)$.
157: \paragraph{Contact numbers (CN)}
158: Let $C_{i,j}$ represent the contact map of a protein. Usually, the contact 
159: map is defined so that $C_{i,j} = 1$ if the $i$-th and $j$-th residues are in 
160: contact by some definition, or $C_{i,j} = 0$, otherwise. As in our 
161: previous study, we slightly modify the definition using a sigmoid function. 
162: That is, 
163: \begin{equation}
164:   C_{i,j} = 1/\{1+\exp[w(r_{i,j} - d)]\}
165: \end{equation}
166: where $r_{i,j}$ is the distance between $C_{\beta}$ ($C_{\alpha}$ 
167: for glycines) atoms of the $i$-th and $j$-th residues, $d = 12$\AA{} is a 
168: cutoff distance, and $w$ is a sharpness parameter of the sigmoid function 
169: which is set to 3\cite{KinjoETAL2005,KinjoANDNishikawa2005}. The rather 
170: generous cutoff length of 12\AA{} was shown to optimize the prediction 
171: accuracy\cite{KinjoETAL2005}. The use of the sigmoid function enables us to 
172: use the contact numbers in molecular dynamics 
173: simulations\cite{KinjoANDNishikawa2005}.
174: Using the above definition of the contact map, the contact number of the
175: $i$-th residue of a protein is defined as
176: \begin{equation}
177:   n_i = \sum_{j:|i-j|>2}C_{i,j}. \label{eq:defcn}
178: \end{equation}
179: The feature variable $y_i$ for CN is defined as $y_i = n_i / \log L$ where 
180: $L$ is the sequence length of a target protein. The normalization 
181: factor $\log L$ is introduced because we have observed that the contact 
182: number averaged over a protein chain is roughly proportional to $\log L$,
183: and thus division by this value removes the size-dependence of predicted
184: contact numbers.
185: \paragraph{Residue-wise contact orders (RWCO)}
186: RWCOs were first introduced in Kinjo and Nishikawa\cite{KinjoANDNishikawa2005}.
187: Using the same notation as contact numbers (see above), 
188: the RWCO of the $i$-th residue in a protein structure is defined by 
189: \begin{equation}
190:   o_i = \sum_{j:|i-j|>2}|i-j|C_{i,j}. \label{eq:defrwco}
191: \end{equation}
192: The feature variable $y_i$ for RWCO is defined as $y_i = o_i / L$ where 
193: $L$ is the sequence length. Due to the similar reason as CN, the normalization
194: factor $L$ was introduced to remove the size-dependence of the predicted
195: RWCOs (the RWCO averaged over a protein chain is roughly proportional to the 
196: chain length).
197: 
198: \subsection*{Linear regression scheme}
199: The input to the prediction scheme we develop in this paper is a 
200: position-specific scoring matrix (PSSM) of the amino acid sequence of 
201: a target protein.
202: Let us denote the PSSM by $U = (\mathbf{u}_1, \cdots , \mathbf{u}_{L})$ 
203: where $L$ is the sequence length of the target protein and 
204: $\mathbf{u}_i$ is a 20-vector containing the scores of 20 types of 
205: amino acid residues at the $i$-th position: 
206: $\mathbf{u}_i = (u_{1,i}, \cdots , u_{20,i})^{t}$.
207: 
208: When predicting a type of 1D structures, we first predict the feature 
209: variable(s) for that type of 1D structures [i.e., $y_i = y_i^H$, etc. for SS,
210: $n_i/\log L$ for CN, and $o_i/L$ for RWCO], and then 
211: the feature variable is converted to the target 1D structure.
212: Prediction of the feature variable $y_i$ can be considered as a mapping 
213: from a given PSSM $U$ to $y_i$. More formally, we are going to 
214: establish the functional form of the mapping $F$ in $\hat{y}_{i} = F(U,i)$
215: where $\hat{y}_{i}$ is the predicted value of the feature variable $y_i$.
216: In our previous paper, we showed that CN can be predicted to a moderate 
217: accuracy by a simple linear regression scheme with a 
218: local sequence window\cite{KinjoETAL2005}. 
219: Accordingly, we assume that the function $F$ can be decomposed into 
220: linear ($F_l$) and nonlinear ($F_n$) parts: $F = F_{l} + F_{n}$. 
221: 
222: The linear part is expressed as 
223: \begin{equation}
224:   F_l(U,i) = \sum_{m=-M}^{M}\sum_{a=1}^{21}D_{m,a}u_{a,i+m}
225: \label{eq:lin}
226: \end{equation}
227: where $M$ is the half window size of the local PSSM segment around 
228: the $i$-th residue, and $\{D_{m,a}\}$ are the weights to be trained. 
229: To treat N- and C-termini separately, we introduced 
230: the ``terminal residue'' as the 21st kind of amino acid residue.
231: The value of $u_{21,i+m}$ is set to unity if $i+m<0$ or $i+m>L$, or to zero 
232: otherwise. The ``terminal residue'' for the central residue ($m=0$) serves 
233: as a bias term and is always set to unity. 
234: 
235: To establish the nonlinear part, we first introduce an $N$-dimensional 
236: ``state vector'' $\mathbf{x}_i = (x_{1,i}, \cdots , x_{N,i})^{t}$ 
237: for the $i$-th sequence position where the dimension $N$ is a free parameter.
238: The value of $\mathbf{x}_i$ is determined by solving the equation of state
239: which is described in the next subsection. For the moment, let us assume 
240: that the equation of state has been solved, and denote 
241: the solution by $\mathbf{x}_{i}^{*}$. The state 
242: vector can be considered as a function of the whole PSSM $U$ 
243: (i.e., $\mathbf{x}_{i}^{*} = \mathbf{x}_{i}^{*}(U)$), and 
244: implicitly incorporates nonlinear and long-range effects. Now, the nonlinear 
245: part $F_n$ is expressed as a linear projection of the state vector:
246: \begin{equation}
247:   F_n(U,i) = \sum_{k=1}^{N}E_{k}x_{k,i}^{*}(U)
248: \label{eq:nonlin}
249: \end{equation}
250: where $\{E_{k}\}$ are the weights to be trained. 
251: 
252: In summary, the prediction scheme is expressed as 
253: \begin{equation}
254:     \hat{y}_{i} = \sum_{m=-M}^{M}\sum_{a=1}^{21}D_{m,a}u_{a,i+m}
255: + \sum_{k=1}^{N}E_{k}x_{k,i}^{*}(U) \label{eq:pred0}
256: \end{equation}
257: Regarding $\mathbf{u}_{i-M}, \cdots, \mathbf{u}_{i+M}$ and 
258: $\mathbf{x}_{i}^{*}$ as independent variables, Eq. \ref{eq:pred0} reduces to
259: a simple linear regression problem for which the optimal weights $\{D_{m,a}\}$ 
260: and $\{E_k\}$ are readily determined by using a least squares method.
261: For CN or RWCO predictions, the predicted feature variable can be easily 
262: converted to the corresponding 1D quantities by multiplying by 
263: $\log L$ or $L$, respectively.
264: For SS prediction, the secondary structure $\hat{s}_i$ of the $i$-th residue
265: is given by $\hat{s}_i = \mathrm{arg}\max_{s\in \{H, E, C\}}y_i^s$.
266: 
267: \subsection*{Critical random networks and the equation of state}
268: We now describe the equation of state for the system of state vectors.
269: We denote $L$ state vectors along the amino acid sequence by
270: $\mathbf{X} = 
271: (\mathbf{x}_{1}, \cdots , \mathbf{x}_{L}) \in \mathbf{R}^{N\times L}$, 
272: and define a nonlinear mapping 
273: $g_i : \mathbf{R}^{N\times L} \to \mathbf{R}^{N}$ for $i = 1, \cdots , L$ by
274: \begin{equation}
275:   g_i(\mathbf{X}) = \tanh \left[\beta W(\mathbf{x}_{i-1}+\mathbf{x}_{i+1})+\alpha V \mathbf{u}_{i}\right]
276: \end{equation}
277: where $\beta$ and $\alpha$ are positive constants, $W$ is an $N\times N$ 
278: block-diagonal orthogonal random matrix, and $V$ is an $N\times 21$ random 
279: matrix (a unit bias term is assumed in $\mathbf{u}_i$). 
280: The hyperbolic tangent function ($\tanh$) is applied element-wise. 
281: We impose the boundary conditions as
282: $\mathbf{x}_0 = \mathbf{x}_{L+1} = \mathbf{0}$.
283: In this equation, the term containing $W$ represents nearest-neighbor 
284: interactions along the sequence. The amino acid sequence information is 
285: taken into account as an external field in the form of 
286: $\alpha{}V\mathbf{u}_{i}$. Next we define a mapping 
287: $G : \mathbf{R}^{N\times L} \to \mathbf{R}^{N\times L}$ by
288: \begin{equation}
289:   G(\mathbf{X}) = (g_{1}(\mathbf{X}), \cdots , g_{L}(\mathbf{X})).
290: \end{equation}
291: Using this mapping $G$, the equation of state is defined as 
292: \begin{equation}
293:   \mathbf{X} = G(\mathbf{X}). \label{eq:fixpoint}
294: \end{equation}
295: That is, the state vectors are determined as a fixed point of the mapping 
296: $G$. More explicitly, Eq. \ref{eq:fixpoint} can be expressed as 
297: \begin{equation}
298:     \mathbf{x}_{i} = \tanh \left[\beta W(\mathbf{x}_{i-1}+\mathbf{x}_{i+1})+\alpha V \mathbf{u}_{i}\right], \label{eq:eos}
299: \end{equation}
300: for $i = 1, \cdots, L$. That is, the state vector $\mathbf{x}_i$ of the site
301: $i$ is determined by the interaction with the state vectors of the neighboring 
302: sites $i-1$ and $i+1$ as well as with the `external field' $\mathbf{u}_i$ of 
303: the site. The information of the external field at each site is propagated
304: throughout the whole amino acid sequence via the nearest-neighbor interactions.
305: Therefore, solving Eq. (\ref{eq:eos}) means finding the state vectors that 
306: are consistent with the external field as well as the nearest-neighbor 
307: interactions, and each state vector in the obtained solution 
308: $\{\mathbf{x}_i\}$ self-consistently embodies the information of the whole 
309: amino acid sequence in a mean-field sense.
310: 
311: For $\beta < 0.5$, it can be shown 
312: that $G$ is a contraction mapping in $\mathbf{R}^{N\times L}$  
313: (with an appropriate norm defined therein). 
314: And hence, by the contraction mapping principle\cite{TakahashiNLFA}, the 
315: mapping $G$ has a unique fixed point independently of the strength $\alpha$ 
316: of the external field.
317: When $\beta$ is sufficiently smaller than 0.5, 
318: the correlation between two state vectors, say $\mathbf{x}_{i}$ 
319: and $\mathbf{x}_{j}$, is expected to decay exponentially as a function of 
320: the sequential separation $|i-j|$.
321: On the other hand, for $\beta > 0.5$, the number of the fixed points
322: varies depending on the strength of the external field $\alpha$.
323: In this regime, we cannot reliably solve the equation of 
324: state (Eq.\ref{eq:fixpoint}). 
325: In this sense, $\beta = 0.5$ can be considered as a critical 
326: point of the system $\mathbf{X}$. 
327: From an analogy with critical phenomena of physical 
328: systems\cite{Goldenfeld1992} (note the formal similarity of Eq. \ref{eq:eos} 
329: with the mean field equation of the Ising model), the correlation length 
330: between state vectors is expected to diverge, or become long when the 
331: external field is finite but small. 
332: We call the system defined by Eq. \ref{eq:eos}
333: with $\beta = 0.5$ a critical random network (CRN). 
334: 
335: The equation of state (Eq. \ref{eq:eos}) is parameterized by two random 
336: matrices $W$ and $V$, and consequently, so is the predicted feature variables 
337: $\hat{y}_{i}$. Following a standard technique of statistical 
338: learning such as neural networks\cite{Haykin}, we may improve the prediction 
339: accuracy by averaging $\hat{y}_{i}$ obtained by multiple CRNs with 
340: different pairs of $W$ and $V$. This averaging operation reduces the prediction
341: errors due to the random fluctuations in the estimated parameters.
342: We employ such an ensemble prediction with 10 sets of random matrices $W$ and
343: $V$ in the following. The use of a larger number of random matrices for 
344: ensemble predictions improved the prediction accuracies slightly, but the 
345: difference was insignificant.
346: 
347: \subsection*{Numerics}
348: Here we describe the value of the free parameters used, and a numerical 
349: procedure to solve the equation of state.
350: 
351: The half window size $M$ in the linear part of Eq. \ref{eq:pred0} is 
352: set to 9 for SS and CN predictions, and to 26 for RWCO prediction. 
353: These values are found to be optimal in preliminary studies\cite{KinjoETAL2005,
354: KinjoANDNishikawa2005b}.
355: Regarding the dimension $N$ of the state vector, we have found that $N=2000$ 
356: gives the best result after some experimentation, and this value is 
357: used throughout. Using the state vector of a large dimension as 2000, it is 
358: expected that various properties of amino acid sequences can be extracted and 
359: memorized. If the dimension is too large, overfitting may occur, but we did
360: not find such a case up to $N=2000$. Therefore, in principle, the state vector
361: dimension could be even larger (but the computational cost becomes a problem).
362: 
363: Each element in the $N\times 21$ random matrix $V$ in Eq. \ref{eq:eos} is 
364: obtained from a uniform distribution in the range [-1, 1] and the strength 
365: parameter $\alpha$ is set to 0.01. 
366: Here and in the following, all random numbers were generated by the Mersenne 
367: twister algorithm\cite{MersenneTwister}. 
368: The $N\times N$ random matrix $W$ is obtained in the following manner. 
369: First we generate a random block diagonal matrix $A$ whose block sizes 
370: are drawn from a uniform distribution of integers 2 to 20 (both inclusive), 
371: and the values of the block elements are drawn 
372: from the standard Gaussian distribution (zero mean and unit variance). 
373: By applying singular value decomposition, we have $A = U\Sigma V^{t}$ 
374: where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal 
375: matrix of singular values.
376: We set $W = UV^{t}$ which is orthogonal as well as block diagonal.
377: 
378: To solve the equation of state (Eq. \ref{eq:eos}), we use a simple functional
379: iteration with a Gauss-Seidel-like updating scheme. 
380: Let $\nu$ denote the stage of iteration.
381: We set the initial value of the state vectors (with $\nu = 0$) as 
382: \begin{equation}
383:   \mathbf{x}_{i}^{(0)} = \tanh \left[\alpha V \mathbf{u}_{i}\right].\label{eq:init_eos}
384: \end{equation}
385: Then, for $i = 1, \cdots , L$ (in increasing order of $i$), we update 
386: the state vectors by
387: \begin{equation}
388:   \mathbf{x}_{i}^{(2\nu+1)} \gets \tanh \left[W(\mathbf{x}_{i-1}^{(2\nu+1)}+\mathbf{x}_{i+1}^{(2\nu)})+\alpha V \mathbf{u}_{i}\right].
389: \label{eq:feos}
390: \end{equation}
391: Next, we update them in the reverse order. That is, for $i = L, \cdots , 1$ 
392: (in decreasing order of $i$), 
393: \begin{equation}
394:   \mathbf{x}_{i}^{(2\nu+2)} \gets \tanh \left[W(\mathbf{x}_{i-1}^{(2\nu+1)}+\mathbf{x}_{i+1}^{(2\nu+2)})+\alpha V \mathbf{u}_{i}\right].
395: \label{eq:beos}
396: \end{equation}
397: We then set $\nu \gets \nu + 1$, and iterate Eqs. (\ref{eq:feos}) and (\ref{eq:beos}) until $\{\mathbf{x}_{i}\}$ converges. The convergence criterion is 
398: \begin{equation}
399: \sqrt{\sum_{i=1}^{L}\left\|\mathbf{x}_{i}^{(2\nu+2)}-\mathbf{x}_{i}^{(2\nu+1)}\right\|_{\mathbf{R}^{N}}^{2}/{NL}}<10^{-7}
400: \end{equation}
401: where $\left\|\cdot\right\|_{\mathbf{R}^{N}}$ denotes the Euclidean norm.
402: Convergence is typically achieved within 100 to 200 iterations for one protein.
403: 
404: \subsection*{Preparation of training and test sets}
405: We use the same set of proteins as used in our preliminary 
406: study\cite{KinjoANDNishikawa2005b}. In this set, there are 680 protein domains
407: selected from the ASTRAL database\cite{ASTRAL}, 
408: each of which represents a superfamily from one of all-$\alpha$, all-$\beta$,
409: $\alpha/\beta$, $\alpha+\beta$ or ``multi-domain'' classes of the SCOP database 
410: (release 1.65, December 2003)\cite{SCOP}. Conversely, each SCOP superfamily
411: is represented by only one of the protein domains in the data set. 
412: Thus, no pair of protein domains in the data set are expected to 
413: be homologous to each other.
414: For training the parameters and testing the prediction accuracy, 15-fold
415: cross-validation is employed. The set of 680 proteins is randomly 
416: divided into two groups: one consisting of 630 proteins (training set), 
417: and the other consisting of 50 proteins (test set). For each training set, 
418: the regression parameters $\{D_{m,a}\}$ and $\{E_{i}\}$ are determined, and 
419: using these parameters, the prediction accuracy is evaluated for the 
420: corresponding test set. 
421: This procedure was repeated for 15 times with different random divisions, 
422: leading to 15 pairs of training and test sets. In this way, there is some redundancy in the training and test sets although each pair of these sets share no 
423: proteins in common. But this raises no problem since our objective is to 
424: estimate the average accuracy of the predictions. A similar validation procedure was also employed by Petersen et al.\cite{PetersenETAL2000}
425: In total, 750 ($= 15\times 50$) proteins were tested over which 
426: the averages of the measures of accuracy (see below) were calculated.
427: 
428: \subsection*{Preparation of position-specific scoring matrix}
429: To obtain the position-specific scoring matrix (PSSM) of a protein, 
430: we conducted ten iterations of PSI-BLAST\cite{AltschulETAL1997} search
431: against a customized sequence database with the E-value cutoff of 
432: 0.0005\cite{TomiiANDAkiyama2004}. The sequence database was compiled from the 
433: DAD database provided by DNA Data Bank of Japan\cite{DDBJ2005}, from which 
434: redundancy was removed by the program CD-HIT\cite{CD-HIT} with 95\% identity 
435: cutoff. This database was subsequently filtered by the program 
436: PFILT used in the PSIPRED program\cite{Jones1999}.
437: We use the position-specific scoring matrices (PSSM) rather than the frequency 
438: tables for the prediction.
439: 
440: \subsection*{Measures of accuracy}
441: For assessing the quality of SS predictions, we mainly use $Q_3$ and 
442: $SOV$ (the 1999 revision)\cite{SOV99}. The $Q_3$ measure quantifies the percentage of correctly predicted residues, while the $SOV$ measure evaluates the 
443: segment overlaps of secondary structural elements of predicted and native 
444: structures. Optionally, we use $Q_s$ and $Q_s^{pre}$
445: (with $s$ being $H$, $E$, or $C$) and Matthews' correlation coefficient $MC$. 
446: The $Q_s$ is defined by the percentage of correctly predicted SS type $s$ 
447: out of the native SS type $s$, and $Q_s^{pre}$ is defined by the percentage 
448: of correctly predicted SS type $s$ out of the predicted SS type $s$.
449: 
450: For CN and RWCO predictions, we use two measures for evaluating the prediction 
451: accuracy.
452: The first one is the correlation coefficient ($Cor$) between the observed
453: ($n_{i}$) and predicted ($\hat{n}_{i}$) CN or RWCO\cite{KinjoETAL2005}.
454: The second  is the RMS error normalized by the standard deviation of the 
455: native CN or RWCO ($DevA$)\cite{KinjoETAL2005}.
456: While $Cor$ measures the quality of relative values, $DevA$ measures that of 
457: absolute values of the predicted CN or RWCO.
458: 
459: Note that the measures $Q_3$, $SOV$,  $Cor$ and $DevA$ are defined for a 
460: single protein chain. In practice, we average these quantities over the 
461: proteins in the test sets to estimate the average accuracy of prediction.
462: On the other hand, per-residue measures, $Q_s$, $Q_s^{pre}$ and $MC$, 
463: were calculated using all the residues in the test data sets, rather than 
464: on a per-protein basis.
465: 
466: \section*{Results}
467: We examine the prediction accuracies for SS, CN, and RWCO in turn.
468: The main results are summarized in Table \ref{tab:summ} and Figure \ref{fig:histo}. Finally, in order to examine the effect of nonlinear terms, we verify 
469: the prediction results obtained using only linear terms (Eq. \ref{eq:lin}).
470: \begin{table}
471: \caption{\label{tab:summ}Summary of average prediction accuracies.}
472:   \begin{center}
473:   \begin{tabular}[h]{ll}\hline
474: Struct. & Accuracy \\\hline
475: SS  & $Q_3$ = 77.8; $SOV$ = 77.3\\
476: CN  & $Cor$ = 0.726; $DevA$ = 0.707\\
477: RWCO& $Cor$ = 0.601; $DevA$ = 0.881\\\hline
478:   \end{tabular}
479:   \end{center}
480: \end{table}
481: 
482: \begin{figure}[htb]
483: \begin{center}
484: \includegraphics[width=7cm]{./histos.eps}
485: \end{center}
486: \caption{\label{fig:histo}Histograms of accuracy measure obtained by  
487: ensemble predictions using 10 critical random networks. (a) $Q_3$ for 
488: secondary structure prediction; (b) $Cor$ for contact number prediction; 
489: (c) $Cor$ for residue-wise contact order prediction.}
490: \end{figure}
491: 
492: \subsection*{Secondary structure prediction}
493: The average accuracy of secondary structure prediction achieved by 
494: the ensemble CRNs-based approach is $Q_3=77.8$\% and $SOV=77.3$ 
495: (Table \ref{tab:summ}). This is comparable to the current state-of-the-art
496: predictors such as PSIPRED\cite{Jones1999}. The results in 
497: terms of per-residue accuracies ($Q_s$ and $Q_s^{pre}$) are listed in 
498: Table \ref{tab:ss}.
499: The values of $Q_s$ suggest that the present method 
500: underestimates $\alpha$ helices ($H$) and, especially, $\beta$ strands ($E$) 
501: compared to coils $C$. 
502: However, when a residue is predicted as being $H$ or $E$, the probability
503: of the correct prediction is rather high, especially for $E$ 
504: ($Q_E^{pre} =$ 79.9\%).
505: The histogram of $Q_3$ (Figure \ref{fig:histo}a) shows that the peak 
506: of the histogram resides well beyond $Q_3$ = 80\%, and that 
507: only 20\% of the predictions exhibit $Q_3$ of less than 70\%. These 
508: observations demonstrate the capability of the CRNs-based prediction schemes.
509: \begin{table}
510: \caption{\label{tab:ss}Summary of per-residue accuracies for SS predictions.}
511:   \begin{center}
512:   \begin{tabular}[h]{lrrr}\hline
513: measure    & $H$ & $E$ & $C$ \\\hline
514: $Q_s$      & 78.4 & 61.9 & 84.6 \\
515: $Q_s^{pre}$ & 81.9 & 79.9 & 74.3\\
516: $MC$       &  0.704 & 0.636 & 0.602 \\\hline
517:   \end{tabular}
518:   \end{center}
519: \end{table}
520: 
521: \subsection*{Contact number prediction}
522: Using an ensemble of CRNs, a correlation coefficient ($Cor$) of 0.726 and 
523: normalized RMS error ($DevA$) of 0.707 was achieved for CN predictions on 
524: average (Table \ref{tab:summ}). This result is a significant improvement over
525: the previous method\cite{KinjoETAL2005} which yielded 
526: $Cor=0.627$ and $DevA = 0.941$. The median of the distribution of $Cor$ 
527: (Figure \ref{fig:histo}b) is 0.744, indicating that the majority of 
528: the predictions are of very high accuracy. 
529: 
530: We have also examined the dependence of prediction accuracy on the structural
531: class of target proteins (Table \ref{tab:cnhisto}). 
532: Among all the structural classes, $\alpha/\beta$ proteins are predicted most 
533: accurately with $Cor=$ 0.757 and $DevA =$ 0.668. The accuracy for other 
534: classes do not differ qualitatively although all-$\beta$ proteins are predicted
535: slightly less accurately.
536: \begin{table}
537: \caption{\label{tab:cnhisto}Summary of CN predictions for each SCOP class$^a$.}
538: \begin{center}
539:   \begin{tabular}{lrrrrr}\hline
540: range$^b$ &\multicolumn{5}{c}{SCOP class$^c$}\\
541: ($Cor$) & a & b & c & d & e\\\hline
542: (-1,0.5]  &    8 &    6 &    3 &   14 &    1 \\
543: (0.5,0.6] &   19 &   25 &    8 &   19 &    1 \\
544: (0.6,0.7] &   29 &   29 &   22 &   54 &    3 \\
545: (0.7,0.8] &   62 &   66 &   76 &   85 &   10 \\
546: (0.8,0.9] &   43 &   38 &   57 &   67 &    3 \\
547: (0.9,1.0] &    1 &    0 &    0 &    1 &    0 \\
548: total    &  162 &  164 &  166 &  240 &   18 \\\hline
549: average $Cor$ & 0.721 & 0.712 & 0.757 & 0.728 & 0.722\\
550: average $DevA$ & 0.715 & 0.726 & 0.668 & 0.717 & 0.705\\
551: \hline
552:   \end{tabular}
553: \end{center}
554: $^a$ The number of occurrences of $Cor$ for the proteins in the test sets,
555: classified according to the SCOP database; average values of $Cor$ and $DevA$ 
556: are also listed for each class.\\
557: $^b$ The range ``$(x,y]$'' denotes $x < Cor \leq y$.\\
558: $^c$ a: all-$\alpha$; b: all-$\beta$; c: $\alpha / \beta$; d: $\alpha + \beta$;
559: e: multi-domain.
560: \end{table}
561: 
562: \subsection*{Residue-wise contact order prediction}
563: For RWCO prediction, the average accuracy was such that $Cor$ = 0.601 and 
564: $DevA$ = 0.881. Although these figures appear to be poor compared to those 
565: of the CN prediction described above, they are yet statistically significant.
566: The distribution of $Cor$ appears to be rather dispersed 
567: (Figure \ref{fig:histo}c), indicating that the prediction accuracy 
568: strongly depends on the characteristics of each target protein.
569: In a similar manner as for CN, we also examined the dependence of prediction
570: accuracy on the structural class of target 
571: proteins (Table \ref{tab:rwcohisto}).
572: In this case, we have found a notable dependence of prediction accuracy on 
573: structural classes. The best accuracy is obtained for $\alpha+\beta$
574: proteins with $Cor = $ 0.629 and $DevA = $ 0.832. For these proteins, 
575: the distribution of $Cor$ also shows good tendency in that the fraction of
576: poor predictions is relatively small (e.g., 14\% for $Cor <$ 0.5). 
577: Interestingly, all-$\beta$ proteins also show good accuracies but 
578: all-$\alpha$ proteins are particularly poorly predicted. These observations 
579: suggest that the correlation between amino acid sequence and RWCO is 
580: strongly dependent on the structural class of the target protein. 
581: However, the rather dispersed distribution of $Cor$ for each class 
582: (Table \ref{tab:rwcohisto}) also suggests that there are more detailed 
583: effects of the global context on the accuracy of RWCO prediction.
584: \begin{table}
585: \caption{\label{tab:rwcohisto}Summary of RWCO predictions for each SCOP class$^a$}
586: \begin{center}
587:   \begin{tabular}{lrrrrr}\hline
588: range &\multicolumn{5}{c}{SCOP class}\\
589: ($Cor$) & a & b & c & d & e\\\hline
590: (-1,0.5]  &   58 &   31 &   46 &   34 &    6 \\
591: (0.5,0.6] &   29 &   37 &   31 &   56 &    4 \\
592: (0.6,0.7] &   41 &   27 &   33 &   65 &    5 \\
593: (0.7,0.8] &   24 &   47 &   40 &   72 &    3 \\
594: (0.8,0.9] &   10 &   22 &   16 &   13 &    0 \\
595: total    &  162 &  164 &  166 &  240 &   18 \\\hline
596: average $Cor$ & 0.549 & 0.620 & 0.595 & 0.629 & 0.564\\
597: average $DevA$ & 0.981 & 0.869 & 0.857 & 0.832 & 0.957\\
598: \hline
599:   \end{tabular}
600: \end{center}
601: $^a$See Table \ref{tab:cnhisto} for notations.
602: \end{table}
603: 
604: \subsection*{Purely linear predictions with PSSMs}
605: Almost all the modern methods for 1D structure prediction make use of PSSMs
606: in combination with some kind of machine-learning techniques such as 
607: feed-forward or recurrent neural networks or support vector machines. 
608: The present study is no exception. Curiously, machine-learning approaches 
609: have become so widespread that no attempt appears to have been made to test
610: simplest linear predictors based on PSSMs. 
611: In this subsection, we present results of 1D predictions using only the linear 
612: terms (Eq. \ref{eq:lin}) but without CRNs. In this prediction scheme, 
613: input is a local segment of a PSSM generated by PSI-BLAST, and a feature 
614: variable is predicted by a straight forward linear regression. 
615: 
616: As can be clearly seen in Table \ref{tab:lin},  the results of the linear 
617: predictions are surprisingly good although not as good as with CRNs. 
618: For example, in SS prediction, the purely linear scheme achieved 
619: $Q_3$ = 75.2\% which is lower than that of the CRNs-based scheme by only 
620: 3.6\%. Although this is of course a large difference in a statistical sense, 
621: there may not be a discernible difference when individual predictions are 
622: concerned. (However, the improvement in the $SOV$ measure by using CRNs is 
623: quite large, indicating that the nonlinear terms in CRNs are indeed able to 
624: extract cooperative features.) It is widely accepted that the upper limit of 
625: accuracy ($Q_3$) of SS prediction based on a local window of a single 
626: sequence is less than 70\%\cite{CrooksANDBrenner2004}. 
627: Therefore, more than 5\% of the increase in $Q_3$ is brought simply 
628: by the use of PSSMs.
629: 
630: Similar observations also hold for CN and RWCO predictions 
631: (Table \ref{tab:lin}). In case of CN prediction, we have previously 
632: obtained $Cor$ = 0.555 by a simple linear method
633:  with single sequences\cite{KinjoETAL2005}. Therefore, the effect of 
634: PSSMs is even more dramatic than SS prediction. This may be due to the fact
635: that the most conspicuous feature of amino acid sequences conserved among 
636: distant homologs (as detected by PSI-BLAST)
637: is the hydrophobicity of amino acid residues\cite{KinjoANDNishikawa2004}, 
638: which is closely related to contact numbers.
639: Of course, the improvement by the use of PSSMs is largely made possible by 
640: the recent increase of amino acid sequence 
641: databases\cite{PrzybylskiANDRost2002}.
642: \begin{table}
643: \caption{\label{tab:lin}Summary of prediction accuracies using only linear terms.}
644:   \begin{center}
645:   \begin{tabular}[h]{ll}\hline
646: Struct. & Accuracy \\\hline
647: SS  & $Q_3$ = 75.2; $SOV$ = 72.7\\
648: CN  & $Cor$ = 0.701; $DevA$ = 0.735\\
649: RWCO& $Cor$ = 0.584; $DevA$ = 0.902\\\hline
650:   \end{tabular}
651:   \end{center}
652: \end{table}
653: 
654: \subsection*{The significance of criticality}
655: The condition of criticality ($\beta = 0.5$ in Eq. \ref{eq:eos}) is expected 
656: to enhance the extraction of the long-range correlations of an amino acid 
657: sequence, thus improving the prediction accuracy. To confirm this 
658: point, we tested the method by setting $\beta = 0.1$ so that the 
659: network of state vectors is not at the critical point any more (otherwise 
660: the prediction and validation schemes were the same as above). 
661: The prediction accuracies obtained by these non-critical random networks 
662: were $Q_3 = 76.7$\% and $SOV = 76.6$ for SS, 
663: $Cor = 0.716$ and $DevA = 0.719$ for CN, 
664: and $Cor = 0.589$ and $DevA = 0.897$ for RWCO.
665: These values are inferior to those obtained by the critical random networks
666: (Table \ref{tab:summ}), although slightly better than the purely linear 
667: predictions (Table \ref{tab:lin}). 
668: Therefore, compared to the non-critical random networks, the critical random 
669: networks can indeed extract more information from amino acid sequence and 
670: improve the prediction accuracies. 
671: 
672: \section*{Discussion}
673: \subsection*{Comparison with other methods}
674: Regarding the framework of 1D structure prediction, the critical random 
675: networks are most closely related to bidirectional recurrent neural
676: networks (BRNNs)\cite{BaldiETAL1999}, in that both can treat a whole amino 
677: acid sequence rather than only a local window segment. The main differences
678: are the following. First, network weights between input and hidden 
679: layers as well as those between hidden units are trained in BRNNs, 
680: whereas the corresponding weights in CRNs (random matrices $V$ and $W$, 
681: respectively, in Eq. \ref{eq:eos}) are fixed. Second, the output layer is 
682: nonlinear in BRNNs but linear in CRNs. Third, the network components that 
683: propagate sequence information from N-terminus to C-terminus are decoupled 
684: from those in the opposite direction in BRNNs, but they are coupled in CRNs. 
685: 
686: Regarding the accuracy of SS prediction, BRNNs\cite{PollastriETAL2002b} and 
687: CRNs exhibit comparable results of $Q_3 \approx$ 78\%. 
688: However, a standard local window-based approach using feed-forward neural 
689: networks can also achieve this level of accuracy\cite{Jones1999}. 
690: Thus, the CRNs-based method is not a single best predictor, but may serve as 
691: an addition to consensus predictions.
692: 
693: Although BRNNs have been also applied to CN prediction\cite{PollastriETAL2002},
694:  contact numbers are predicted as 2-state categorical data (buried or exposed) 
695: so that the results cannot be directly compared. Nevertheless, we can convert 
696: CRNs-based real-value predictions into 2-state predictions. By using the same 
697: thresholds for 
698: the 2-state discretization as Pollastri et al.\cite{PollastriETAL2002}
699: (i.e., the average CN for each residue type), 
700: we obtained $Q_2 =$ 75.6\% per chain (75.1\% per residue), 
701: and Matthews' correlation coefficient $MC =$ 0.503
702: whereas those obtained by BRNNs are $Q_2 =$ 73.9\% (per residue) 
703: and $MC =$ 0.478. 
704: Therefore, for 2-state CN prediction, the present method yields more 
705: accurate results. 
706: 
707: Since the present study is the very first attempt to predict RWCOs, there are 
708: no alternative methods to compare with. However, the comparison of 
709: CRNs-based methods for SS and CN predictions with other methods suggests that
710:  the accuracy of the RWCO prediction presented here may be the best 
711: possible result using any of the statistical learning methods currently 
712: available for 1D structure predictions. 
713: 
714: \subsection*{Possibilities for further improvements}
715: In the present study, we employed the simplest possible architecture for CRNs
716: in which different sites are connected via nearest-neighbor interactions.
717: A number of possibilities exist for the elaboration of the architecture.
718: For example, we may introduce short-cuts between distant sites to treat 
719: non-local interactions more directly. Since the prediction accuracies 
720: depend on the structural context of target proteins (Tables \ref{tab:cnhisto} 
721: and \ref{tab:rwcohisto}), it may be also useful to include  more global
722: features of amino acid sequences such as the bias of amino acid composition
723: or the average of PSSM components. These possibilities are to be pursued in 
724: future studies.
725: 
726: \section*{Conclusion}
727: We have developed a novel method, CRNs-based regression, for predicting
728: 1D protein structures from amino acid sequence. When combined 
729: with position-specific scoring matrices produced by PSI-BLAST, this method
730: yields SS predictions as accurate as the best current predictors, CN
731: predictions far better than previous methods, and RWCO predictions 
732: significantly correlated with observed values. We also examined 
733: the effect of PSSMs on prediction accuracy, and showed that most 
734: improvement is brought by the use of PSSMs although the further improvement 
735: due to the CRNs-based method is also significant. 
736: In order to achieve a qualitatively yet better predictions, however, it seems
737: necessary to take into account other, more global, information than is 
738: provided by PSSMs.
739: 
740: \section*{Acknowledgments}
741: The authors thank Motonori Ota for critical comments on an early version of 
742: the manuscript, and Kentaro Tomii for the advice on the use of PSI-BLAST.
743: Most of the computations were carried out at the supercomputing facility of
744: National Institute of Genetics, Japan. This work was supported in part by a
745: grant-in-aid from the MEXT, Japan.
746: The source code of the programs for the CRNs-based prediction  
747: as well as the lists of protein domains used in this study are available at 
748: \verb|http://maccl01.genes.nig.ac.jp/~akinjo/crnpred_suppl/|.
749: 
750: %\bibliographystyle{biophysics}
751: %\bibliography{refs,mypaper}
752: \begin{thebibliography}{10}
753: 
754: \bibitem{BonneauANDBaker2001}
755: Bonneau, R. and Baker, D.
756: \newblock Ab initio protein structure prediction: progress and prospects.
757: \newblock {\em Annu. Rev. Biophys. Biomol. Struct.}{ \bf 30}, 173--189, 2001.
758: 
759: \bibitem{Rost2003}
760: Rost, B.
761: \newblock Prediction in {1D}: secondary structure, membrane helices, and
762:   accessibility.
763: \newblock In {\em Structural Bioinformatics, }Bourne, P.~E. and Weissig, H.,
764:   editors, chapter~28,  559--587. Wiley-Liss, Inc., Hoboken, U.S.A., 2003.
765: 
766: \bibitem{LeeANDRichards1971}
767: Lee, B. and Richards, F.~M.
768: \newblock The interpretation of protein structures: Estimation of static
769:   accessibility.
770: \newblock {\em J. Mol. Biol.}{ \bf 55}, 379--400, 1971.
771: 
772: \bibitem{KinjoETAL2005}
773: Kinjo, A.~R., Horimoto, K., and Nishikawa, K.
774: \newblock Predicting absolute contact numbers of native protein structure from
775:   amino acid sequence.
776: \newblock {\em Proteins}{ \bf 58}, 158--165, 2005.
777: 
778: \bibitem{KinjoANDNishikawa2005}
779: Kinjo, A.~R. and Nishikawa, K.
780: \newblock Recoverable one-dimensional encoding of protein three-dimensional
781:   structures.
782: \newblock {\em Bioinformatics}{ \bf 21}, 2167--2170, 2005.
783: \newblock doi:10.1093/bioinformatics/bti330.
784: 
785: \bibitem{NishikawaANDOoi1980}
786: Nishikawa, K. and Ooi, T.
787: \newblock Prediction of the surface-interior diagram of globular proteins by an
788:   empirical method.
789: \newblock {\em Int. J. Peptide Protein Res.}{ \bf 16}, 19--32, 1980.
790: 
791: \bibitem{HSSP}
792: Sander, C. and Schneider, R.
793: \newblock Database of homology-derived protein structures.
794: \newblock {\em Proteins}{ \bf 9}, 56--68, 1991.
795: 
796: \bibitem{Jaeger2001}
797: Jaeger, H.
798: \newblock The ``echo state'' approach to analysing and training recurrent
799:   neural networks.
800: \newblock Technical Report 148, GMD - German National Research Institute for
801:   Computer Science, , 2001.
802: 
803: \bibitem{JaegerANDHaas2004}
804: Jaeger, H. and Haas, H.
805: \newblock Harnessing nonlinearity: predicting chaotic systems and saving energy
806:   in wireless communication.
807: \newblock {\em Science}{ \bf 304}, 78--80, 2004.
808: 
809: \bibitem{AltschulETAL1997}
810: Altschul, S.~F., Madden, T.~L., Schaffer, A.~A., Zhang, J., Zhang, Z., Miller,
811:   W., and Lipman, D.~L.
812: \newblock Gapped blast and {PSI}-blast: A new generation of protein database
813:   search programs.
814: \newblock {\em Nucleic Acids Res.}{ \bf 25}, 3389--3402, 1997.
815: 
816: \bibitem{DSSP}
817: Kabsch, W. and Sander, C.
818: \newblock Dictionary of protein secondary structure: Pattern recognition of
819:   hydrogen bonded and geometrical features.
820: \newblock {\em Biopolymers}{ \bf 22}, 2577--2637, 1983.
821: 
822: \bibitem{TakahashiNLFA}
823: Takahashi, W.
824: \newblock Nonlinear functional analysis: Fixed point theorems and related
825:   topics.
826: \newblock Kindai Kagaku Sha, Tokyo, , 1988.
827: \newblock In Japanese.
828: 
829: \bibitem{Goldenfeld1992}
830: Goldenfeld, N.
831: \newblock Lectures on phase transitions and the renormalization group,
832:   volume~85 of {\em Frontiers in physics}.
833: \newblock Addison-{W}esley, Reading, Massachusetts, , 1992.
834: 
835: \bibitem{Haykin}
836: Haykin, S.
837: \newblock Neural networks: A comprehensive foundation.
838: \newblock Prentice-Hall, Upper Saddle River, New Jersey, 2nd edition, , 1999.
839: 
840: \bibitem{KinjoANDNishikawa2005b}
841: Kinjo, A.~R. and Nishikawa, K.
842: \newblock Predicting residue-wise contact orders of native protein structure
843:   from amino acid sequence.
844: \newblock arXiv.org:q-bio.BM/0501015, , 2005.
845: 
846: \bibitem{MersenneTwister}
847: Matsumoto, M. and Nishimura, T.
848: \newblock Mersenne twister: a 623-dimensionally equidistributed uniform
849:   pseudorandom number generator.
850: \newblock {\em {ACM} Trans. Model. Comput. Simul.}{ \bf 8}, 3--30, 1998.
851: 
852: \bibitem{ASTRAL}
853: Chandonia, J.~M., Hon, G., Walker, N.~S., {Lo Conte}, L., Koehl, P., Levitt,
854:   M., and Brenner, S.~E.
855: \newblock The astral compendium in 2004.
856: \newblock {\em Nucleic Acids Res.}{ \bf 32}, D189--D192, 2004.
857: 
858: \bibitem{SCOP}
859: Murzin, A.~G., Brenner, S.~E., Hubbard, T., and Chothia, C.
860: \newblock {SCOP}: A structural classification of proteins database for the
861:   investigation of sequences and structures.
862: \newblock {\em J. Mol. Biol.}{ \bf 247}, 536--540, 1995.
863: 
864: \bibitem{PetersenETAL2000}
865: Petersen, T.~N., Lundegaard, C., Nielsen, M., Bohr, H., Bohr, J., Brunak, S.,
866:   Gippert, G.~P., and Lund, O.
867: \newblock Prediction of protein secondary structure at 80\% accuracy.
868: \newblock {\em Proteins}{ \bf 41}, 17--20, 2000.
869: 
870: \bibitem{TomiiANDAkiyama2004}
871: Tomii, K. and Akiyama, Y.
872: \newblock {FORTE}: a profile-profile comparison tool for protein fold
873:   recognition.
874: \newblock {\em Bioinformatics}{ \bf 20}, 594--595, 2004.
875: 
876: \bibitem{DDBJ2005}
877: Tateno, Y., Saitou, N., Okubo, K., Sugawara, H., and Gojobori, T.
878: \newblock {DDBJ} in collaboration with mass-sequencing teams on annotation.
879: \newblock {\em Nucleic Acids Res.}{ \bf 33}, D25--D28, 2005.
880: 
881: \bibitem{CD-HIT}
882: Li, W., Jaroszewski, L., and Godzik, A.
883: \newblock Tolerating some redundancy significantly speeds up clustering of
884:   large protein databases.
885: \newblock {\em Bioinformatics}{ \bf 18}, 77--82, 2002.
886: 
887: \bibitem{Jones1999}
888: Jones, D.~T.
889: \newblock Protein secondary structure prediction based on position-specific
890:   scoring matrices.
891: \newblock {\em J. Mol. Biol.}{ \bf 292}, 195--202, 1999.
892: 
893: \bibitem{SOV99}
894: Zemla, A., Venclovas, C., Fidelis, K., and Rost, B.
895: \newblock A modified definition of sov, a segment-based measure for protein
896:   secondary structure prediction assessment.
897: \newblock {\em Proteins}{ \bf 34}, 220--223, 1999.
898: 
899: \bibitem{CrooksANDBrenner2004}
900: Crooks, G.~E. and Brenner, S.~E.
901: \newblock Protein secondary structure: entropy, correlations and prediction.
902: \newblock {\em Bioinformatics}{ \bf 20}, 1603--1611, 2004.
903: 
904: \bibitem{KinjoANDNishikawa2004}
905: Kinjo, A.~R. and Nishikawa, K.
906: \newblock Eigenvalue analysis of amino acid substitution matrices reveals a
907:   sharp transition of the mode of sequence conservation in proteins.
908: \newblock {\em Bioinformatics}{ \bf 20}, 2504--2508, 2004.
909: 
910: \bibitem{PrzybylskiANDRost2002}
911: Przybylski, D. and Rost, B.
912: \newblock Alignments grow, secondary structure prediction improves.
913: \newblock {\em Proteins}{ \bf 46}, 197--205, 2002.
914: 
915: \bibitem{BaldiETAL1999}
916: Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G.
917: \newblock Exploiting the past and the future in protein secondary structure
918:   prediction.
919: \newblock {\em Bioinformatics}{ \bf 15}, 937--946, 1999.
920: 
921: \bibitem{PollastriETAL2002b}
922: Pollastri, G., Przybylski, D., Rost, B., and Baldi, P.
923: \newblock Improving the prediction of protein secondary structure in three and
924:   eight classes using recurrent neural networks and profiles.
925: \newblock {\em Proteins}{ \bf 47}, 228--235, 2002.
926: 
927: \bibitem{PollastriETAL2002}
928: Pollastri, G., Baldi, P., Fariselli, P., and Casadio, R.
929: \newblock Prediction of coordination number and relative solvent accessibility
930:   in proteins.
931: \newblock {\em Proteins}{ \bf 47}, 142--153, 2002.
932: 
933: \end{thebibliography}
934: \end{document}
935: