1: %\documentstyle[aps,preprint]{revtex}
2: %\documentstyle[twocolumn,aps,epsf]{revtex}
3: %%%%%%%%%%%%
4: \documentstyle{article}%
5: \hoffset=-2cm%
6: \voffset=-1cm%
7: %\def\acknowledgments{\bigskip\hspace{0.5cm}\parbox{15cm}}
8: \textwidth=16.5cm%
9: \textheight=22cm%
10: %\documentstyle[aps]{revtex}
11:
12: \begin{document}
13: %\input{psfig}
14: %\preprint
15: %\draft
16: \title{Self-Organizing Approach for Finding Borders \\
17: of DNA Coding Regions}
18: \author{Fang Wu$^1$ and Wei-Mou Zheng$^2$\\
19: {$^1$\it Department of Physics, Peking University, Beijing 100871,
20: China}\\
21: {$^2$\it Institute of Theoretical Physics, Academia Sinica,
22: Beijing 100080, China}}
23: %\author{Wei-Mou Zheng}
24: %\address{Institute of Theoretical Physics, Academia Sinica, Beijing 100080, China}
25: \date{}
26: \maketitle
27:
28: \begin{abstract}
29: A self-organizing approach is proposed for gene finding based on the
30: model of codon usage for coding regions and positional preference for
31: noncoding regions. The symmetry between the direct and reverse coding regions
32: is adopted for reducing the number of parameters. Without requiring prior
33: training, parameters are estimated by iteration. By employing the window
34: sliding technique and likelihood ratio, a very accurate segmentation is
35: obtained.
36: \end{abstract}
37:
38: %\pacs{}
39: \leftline{PACS number(s): 87.15.Cc, 87.14.Gg, 87.10.+e}%
40: %\begin{multicols}{2}
41: %\narrowtext
42:
43: %\section{Introduction}
44: %\subsection{}
45: The data of raw DNA sequences is increasing at a phenomenal pace, providing
46: a rich source of data to study. As a consequence, we now face the
47: tremendous challenge of extracting information from the formidable volume
48: of DNA sequence data. Computational methods for reliably detecting
49: protein-coding regions are becoming more and more important.
50:
51: Genome annotation by statistical methods is based on various statistical
52: models of genomic sequences \cite{fick,fick2}, one of the most popular
53: being the inhomogeneous, three-period Markov chain model for
54: protein-coding regions with an ordinary Markov model for noncoding
55: regions. The independent random chain model can be included in this
56: category by regarding it as a Markov chain of order 0. The codon usage model
57: is the independent random chain model of non-overlapping triplets, and
58: corresponds to an inhomogeneous Markov model of order 2. Signals
59: in a short segment are usually buried in large fluctuations. With well
60: chosen parameters statistical models work as a noise filter to pick out
61: the signals.
62:
63: Methods based on local inhomogeneity, e.g. position asymmetry
64: or periodicity of period 3, suffer fluctuations.
65: Most of the current computer methods for locating genes require
66: some prior knowledge of the sequence's statistical properties such as the
67: codon usage or positional preference \cite{grant,fick3,karl}. That is, a
68: sizable training set is necessary for estimating good parameters of the
69: model in use \cite{boro1,boro2}.
70: Strongly biased by the training, such models have little power to discover
71: surprising or atypical features. Thus, it is desirable to decipher the
72: genomic information in an objective way. Audic and Claverie \cite{audic}
73: have proposed a method which does not require learning of species-specific
74: features from an arbitrary training set for predicting protein-coding
75: regions. They use an {\it ab initio} iterative Markov modeling procedure
76: to automatically partition genome sequences into direct coding, reverse
77: coding, and noncoding segments. This is an expectation-maximization (EM)
78: algorithm, which is useful in modeling with hidden variables, and is
79: performed in two steps of expectation and maximization \cite{baldi,law,car}.
80: Such a self-organizing or adaptive approach uses all the available
81: unannotated genomic data for its calibration.
82:
83: Before introducing the model we use and describing the technical details, we
84: explain the EM algorithm with a simple pedagogic model which assumes
85: that a DNA sequence written in four letters $\{a, c, g, t\}$ is generated by
86: independent tosses of two four-sided dice. An
87: annotation maps the DNA sequence site-to-site to a two-letter sequence of
88: the alphabet $\{C, N\}$ ($C$ for coding and $N$ for noncoding). Two sets
89: $\{p_a, p_c, p_g, p_t\}$ and $\{q_a, q_c, q_g, q_t\}$ of positional nucleotide
90: probabilities are associated with the two dice $C$ and $N$, respectively.
91: The total probability for the given DNA sequence $S=s_1s_2\ldots$ to be seen
92: under the model is the partition or likelihood function
93: \begin{equation}
94: Z=\sum_H P(S|H_\alpha )=\sum_H \prod_i P(s_i|h_i^\alpha ),
95: \end{equation}
96: where the summation is over all the possible ``annotations"
97: $H=\{H_\alpha \}$ with $H_\alpha =h_1^\alpha h_2^\alpha \ldots$, $h_i^\alpha
98: \in\{N,C\}$, and $P(s|C)=p_s$, $P(s|N)=q_s$. The unknown two sets of
99: probabilities can be determined by maximizing the likelihood $Z$. From
100: Bayesian statistics
101: \begin{equation}
102: P(H_\alpha |S)=\frac {P(S|H_\alpha )P(H_\alpha )} {\sum_H P(S|H_\alpha )
103: P(H_\alpha )},
104: \end{equation}
105: with prior $P(H_\alpha )$ assumed, the most possible $H_\alpha$ can then
106: be selected as the inferred annotation. As we know, coding regions are
107: organized in blocks. The first simplification is the window
108: coarse-graining. The sequence $S$ is divided into nonoverlapping window
109: segments of constant length $w$, and each whole window is entirely assigned
110: to either $N$ or $C$. Conducting Bayesian analysis for window $W_j$
111: and accepting uniform prior, we have
112: $P(h|W_j) \propto P(W_j|h) $.
113: The second simplification is to introduce ``temperature" $\tau$ (as in
114: the simulated annealing),
115: replace $P(W_j|h_j)$ with $[P(W_j|h_j)]^{1/\tau}$ and take the limit
116: $\tau\to 0$. In this way we keep only a single term, i.e. the greatest one,
117: in the summation for $Z$. Window $W$ is inferred to belong to either $N$
118: or $C$ depending on whether $P(W|N)$ or $P(W|C)$ is larger. The likelihood
119: maximization is then equivalent to estimating nucleotide probabilities with
120: frequencies in two window classes inferred from the pre-assumed $\{p_s\}$ and
121: $\{q_s\}$. Consistency requires
122: that the estimate probabilities must be equal to $\{p_s\}$ and $\{q_s\}$.
123: This ``fixed point" can be found by iteration. As an example, we use the
124: first $99\times 5\,051 = 500\,049$ nucleotides of the complete genome
125: of E.~coli as the input data. Statistical significance requires
126: that the window size cannot be too small, while a large window size would give
127: poor resolution in discriminating different regions. The window size is
128: chosen to be $w=99$. We assign the $5\,051$ fixed nonoverlapping windows
129: to the two subsets of $N$ and $C$ in either a periodic or a random way.
130: We estimate $p_s$ and $q_s$ from the counts of different nucleotides
131: in each subset. The likelihood functions for
132: each window are then calculated using the estimated $p_s$ and
133: $q_s$, and the assignment of the windows to $C$ or $N$ is updated
134: according to which of $P(W|C)$ and $P(W|N)$ is larger. This ends one
135: iteration. The process of iteration converges to a single fixed point of
136: precision $10^{-4}$ around step 28 for different initializations with a final
137: window assignment to $N$ and $C$ also given.
138: The final $p_s$ and $q_s$ are $\{0.219, 0.270, 0.289,
139: 0.222\}$ and $\{0.279, 0.213, 0.227, 0.281\}$. The $q_s$ estimated
140: from the complete genome are $\{ 0.285, 0.214, 0.218, 0.283\}$, which are
141: rather close to the corresponding convergent values.
142:
143: More realistic models take the three phases in the coding regions and the
144: opposite ordering of the direct and reverse coding regions into
145: account. Such models adopt 7 subsets: one for noncoding (N), three for
146: direct coding (C$_1$, C$_2$, C$_3$) and three for reverse coding (C$_4$,
147: C$_5$, C$_6$). The subscript $i$ in C$_i$ indicates the phase 0, 1 or 2
148: accordng to $i$ (mod 3). From the genomic data statistics, we may assume
149: that there is symmetry between the direct and reverse coding regions, which
150: means that a reverse coding sequence is indistinguishable from a direct
151: coding sequence if we make the exchanges $a\leftrightarrow t$,
152: $c\leftrightarrow g$ and reverse the order. For the model based on the
153: positional preference of codons, instead of 7 sets of positional
154: nucleotide probabilities, we need only 4 sets. The reduction of the total
155: number of parameters by the symmetry consideration improves
156: the statistics. The procedure of
157: iteration is similar to that for the last model, the only difference being
158: that now we have to estimate 4 sets of probabilities and calculate 7
159: likelihood functions for a window.
160:
161: We use a better model based on the codon usage. We now need a set of
162: 64 probabilities for coding regions. For noncoding segments, 4
163: positional nucleotide probabilities are used just as before. To simplify
164: the programming, we move the windows with a phase-shift other than zero by one
165: or two nucleotides to clear the phase-shift, although we can calculate the
166: marginal distribution probabilities for uni- and bi-nucleotides. For example,
167: we replace the window $W=s_is_{i+1}\ldots s_{i+w-1}$ marked as C$_2$ with
168: $W'=s_{i+1}s_{i+2}\ldots s_{i+w}$. (The
169: alternative way is to consider a cyclic transformation.) Our further
170: discussions are all based on this model. It is observed that the iteration
171: also quickly converges to a fixed point. Contrary to the two-sets model where
172: coding and noncoding are symmetric, and extra knowledge is required to
173: relate one set to coding and the other to noncoding, we can now distinctly
174: distinguish coding from noncoding regions, even with their phases fixed.
175: Direct and reverse coding sets are symmetric in the model.
176: However, the fact that stop codons $taa$, $tag$ and $tga$ are rare can
177: be used to remove the symmetry between direct and reverse coding. That
178: is, if the convergent probabilities for $taa$, $tag$ and $tga$ are all
179: significantly small in comparison with the other 61, sets C$_1$, C$_2$ and
180: C$_3$ then do indeed correspond to direct coding. (Otherwise, those of $tta$,
181: $cta$ and $tca$ would be small instead.)
182:
183: We employ the sliding window technique to improve the resolution as follows.
184: We shift each window by 3 nucleotides, initiate the window assignment with
185: the convergent probabilities just obtained, and then find new assignments
186: for the shifted windows by iteration. We repeat the shifting process 32
187: times to cover the window width. This ends with 33 assignments for
188: triplets, except for a few sites at the two ends. By a majority vote we
189: can obtain a triplet assignment of the whole sequence.
190:
191: Recently, an entropic segmentation method that uses the Jensen-Shannon
192: measure for sequences of a 12-letter alphabet has been proposed to find
193: borders between coding and noncoding regions \cite{stan}. Their best
194: result was obtained on the genome of the bacterium {\it Rickettsia
195: prowazekii}. We test our approach with the same genome data.
196: To inspect the accuracy we obtain the ``true"
197: assignment of sites based on the known annotation as follows. If a
198: nucleotide is in a noncoding region, it belongs to N. If it is in a coding
199: (or reverse coding) region and the site-index of the beginning nucleotide
200: plus 1 is congruent to $i$ modulo 3, the nucleotide under consideration
201: will belong to C$_{1+i}$ (or C$_{4+i}$). For overlapping coding zones we
202: may keep two alternative assignments. We define three rates of accuracy
203: $R_2$, $R_3$ and $R_7$: $R_2$ only discriminates coding from noncoding
204: segments while $R_7$ covers full discrimination of the 7 sets, and $R_3$
205: ignores the phases. For the total $N =1\,111\,523$ nucleotides, we obtain
206: $R_2=91.7\%$, $R_3=89.8\%$ and $R_7=89.7\%$. (The rates without window sliding
207: are $R_2=89.1\%$, $R_3=84.8\%$ and $R_7=84.4\%$.)
208:
209: For finding block borders, to eliminate illusary fluctuations
210: we accept only the assignments with the 33 identical samplings, and
211: regard others as undetermined. When two adjacent identified blocks are of
212: the same assignment we join the two together with the sites between into a
213: single zone of the same assignment. Otherwise, we take the middle site of
214: the intervening undetermined zone as the border, and assign the two sides
215: according to their corresponding flank blocks. We can do
216: the job better by means of the likelihood ratio. Suppose that the left
217: block is assigned to $l$, and the right to $r$. A point $m$ in the
218: intervening zone divides the zone into two segments $L_m$ and $R_m$. The
219: likelihood ratio is defined as
220: \begin{equation}
221: \Gamma_m =\frac {P(L_m|l)P(R_m|r)}{P(L_m|r)P(R_m|l)}.
222: \end{equation}
223: The maximal $\Gamma_m$ places the border at $m$. This segmentation
224: finally gives the accuracy rates $R_2=93.3\%$, $R_3=92.8\%$ and $R_7=92.7\%$.
225:
226: In Ref.~\cite{stan} the quantity quantifying the coincidence between
227: borders inferred from the segmentation and those from the known annotation is
228: defined by
229: \begin{equation}
230: D=\frac 1{2N}\left[ \sum_i \min_j|b_i-c_j|+\sum_j \min_i|b_i-c_j|
231: \right],
232: \end{equation}
233: where $\{b_i\}$ is the set of all borders between coding and noncoding
234: regions, and $\{c_j\}$ is the set of all cuts produced by the
235: segmentation. We use an even harsher quantity $D$ by interpreting $\{b_i\}$
236: and $\{c_j\}$ as the borders of all coding zones. That is, we include
237: borders of each overlapping coding zone. The total number of ``CDS" in
238: the annotation is 834, one of which has two joint zones. We obtain
239: $1-D=87.7\%$, compared with $\sim 80\%$ of Ref.~\cite{stan}. In Fig.~1 we show
240: a comparison of the inferred segmantation with the known coding regions.
241: In the section from $475\,500$ to $497\,500$ there are two overlaps (one
242: for direct, and the other for reverse coding regions), and the shortest gap
243: separating adjacent coding regions is just 1 nucleotide (at $486\,215$). They
244: do not escape detection. As mentioned in \cite{stan}, there are
245: two very close coding regions in the same phase ($538\,197:539\,879$ and
246: $539\,937:540\,887$). The result from the majority vote is shown in Fig.~2 for
247: the section. We see indeed a peak of the counts for set N between the two coding
248: regions. The highest count for N is 32, and so is ignored in our strategy.
249: There is indeed plenty of room for improving this approach. A larger width $w=123$
250: gives higher accuracy rates: $R_2=93.6\%$, $R_3=93.3\%$, $R_7=93.0\%$ and
251: $1-D= 88.2\%$. When we consider only the triplets with all 33 assignments
252: identical in window sliding the rates are $R_2=98.7\%$, $R_3=98.6\%$ and
253: $R_7=98.6\%$. In the above we avoid setting up an arbitary cut-off threshold.
254: If a threshold of 17 counts is used to determine the segments whose central
255: parts have 33 identical samplings,
256: for $w=99$ we predict a total of $1\,001\,351\ (90.1\%)$ sites with
257: accuracies $R_2= 97.4\%$ and $R_3= 95.4\%$. The accuracy rate for noncoding
258: regions is 96.5\%, much
259: higher than that of Ref.~\cite{audic}. It is important and feasible to
260: integrate biological signals into our algorithm. We expect our algorithm,
261: with certain modifications, should work well for other species, too.
262:
263: %\acknowledgments
264: \begin{quotation}
265: { This work was supported in part by the Special Funds for Major National
266: Basic Research Projects, the National Natural Science Foundation
267: of China and Research Project 248 of Beijing.}
268: \end{quotation}
269:
270: % REFERENCES
271: \begin{thebibliography}{99}
272: %\begin{references}
273:
274: \bibitem{fick} J.W. Fickett, Comput. Chem. {\bf 20}, 103 (1996).
275: \bibitem{fick2} J.W. Fickett and C.S. Tung, Nucleic Acids Res. {\bf 20},
276: 6441 (1992).
277: \bibitem{grant} R.~Grantham, C.~Gautier, M.~Gouy, M.~Jacobzone, and
278: R.~Mercier, Nucleic Acids Res. {\bf 9},0 R43 (1981).
279: \bibitem{fick3} J.~W.~Fickett, Nucleic Acids Res. {\bf 10},
280: 5303 (1982).
281: \bibitem{karl} S.~Karlin and J.~Mrazek, J.~Mol. Biol. {\bf 262}, 459 (1996).
282: \bibitem{boro1} M. Borodovsky and J. D. McIninch, Comput. Chem. {\bf 17},
283: 123 (1993).
284: \bibitem{boro2} M. Borodovsky, J. D. McIninch, E.~V.~Koonin, K.~E.~Rudd,
285: C.~Medigue, and A.~Danchin, Nucleic Acids Res. {\bf 23}, 3554 (1995).
286: \bibitem{audic} S. Audic and J.-M. Claverie, Proc. Natl. Acad. Sci. USA,
287: {\bf 95}, 10026 (1998).
288: \bibitem{baldi} P. Baldi, Bioinformatics {\bf 16}, 367 (2000); P.~Baldi
289: and S.~Brunak, {\it Bioinformatics: The Mechine Learning Approach} (The MIT
290: Press, Cambridge, Ma., 1998).
291: \bibitem{law} C. E. Lawrence and A. A. Reilly, Proteins {\bf 7}, 41
292: (1990).
293: \bibitem{car} L. R. Cardon and G. D. Stormo, J. Mol. Biol. {\bf 223},
294: 159 (1992).
295: \bibitem{stan} P. Bernaola-Galv\'an, I. Grosse, P. Carpena, J.L. Oliver,
296: R. Rom\'an-Rold\'an, and H.E. Stanley, Phys. Rev. Lett. {\bf 85}, 1342 (2000).
297: %\end{references}
298: \end{thebibliography}
299:
300: %\newpage
301: % FIGURE CAPTIONS
302: \begin{figure}[hb]
303: \caption{
304: %Fig.~1
305: Comparison between the inferred segmentation (dotted lines) and the known
306: coding regions of {\it Rickettsia} (shaded areas).}
307: %\label{fig1}
308: \end{figure}
309:
310: \begin{figure}[ht]
311: \caption{
312: %Fig.~2
313: Counts of majority assignment in the section containing two very close
314: coding regions (shaded areas) in the same phase. A peak corresponding to
315: noncoding assignment is clearly seen.}
316: %\label{fig2}
317: \end{figure}
318:
319: \end{document}
320: