0102:physics0102048/seg.tex

1:  %\documentstyle[aps,preprint]{revtex}

2:  %\documentstyle[twocolumn,aps,epsf]{revtex}

3: %%%%%%%%%%%%

4: \documentstyle{article}%

5:       \hoffset=-2cm%

6:     \voffset=-1cm%

7: 	%\def\acknowledgments{\bigskip\hspace{0.5cm}\parbox{15cm}}

8:   \textwidth=16.5cm%

9:   \textheight=22cm%

10:   %\documentstyle[aps]{revtex}

11:

12: \begin{document}

13: %\input{psfig}

14: %\preprint

15: %\draft

16: \title{Self-Organizing Approach for Finding Borders \\

17: of DNA Coding Regions}

18: \author{Fang Wu$^1$ and Wei-Mou Zheng$^2$\\

19:   {$^1$\it Department of Physics, Peking University, Beijing 100871,

20:   China}\\

21:   {$^2$\it Institute of Theoretical Physics, Academia Sinica,

22:   Beijing 100080, China}}

23: %\author{Wei-Mou Zheng}

24: %\address{Institute of Theoretical Physics, Academia Sinica, Beijing 100080, China}

25: \date{}

26: \maketitle

27:

28: \begin{abstract}

29: A self-organizing approach is proposed for gene finding based on the

30: model of codon usage for coding regions and positional preference for

31: noncoding regions. The symmetry between the direct and reverse coding regions

32: is adopted for reducing the number of parameters. Without requiring prior

33: training, parameters are estimated by iteration. By employing the window

34: sliding technique and likelihood ratio, a very accurate segmentation is

35: obtained.

36: \end{abstract}

37:

38: %\pacs{}

39: \leftline{PACS number(s): 87.15.Cc, 87.14.Gg, 87.10.+e}%

40: %\begin{multicols}{2}

41: %\narrowtext

42:

43: %\section{Introduction}

44: %\subsection{}

45: The data of raw DNA sequences is increasing at a phenomenal pace, providing

46: a rich source of data to study. As a consequence, we now face the

47: tremendous challenge of extracting information from the formidable volume

48: of DNA sequence data. Computational methods for reliably detecting

49: protein-coding regions are becoming more and more important.

50:

51: Genome annotation by statistical methods is based on various statistical

52: models of genomic sequences \cite{fick,fick2}, one of the most popular

53: being the inhomogeneous, three-period Markov chain model for

54: protein-coding regions with an ordinary Markov model for noncoding

55: regions. The independent random chain model can be included in this

56: category by regarding it as a Markov chain of order 0. The codon usage model

57: is the independent random chain model of non-overlapping triplets, and

58: corresponds to an inhomogeneous Markov model of order 2. Signals

59: in a short segment are usually buried in large fluctuations. With well

60: chosen parameters statistical models work as a noise filter to pick out

61: the signals.

62:

63: Methods based on local inhomogeneity, e.g. position asymmetry

64: or periodicity of period 3, suffer fluctuations.

65: Most of the current computer methods for locating genes require

66: some prior knowledge of the sequence's statistical properties such as the

67: codon usage or positional preference \cite{grant,fick3,karl}. That is, a

68: sizable training set is necessary for estimating good parameters of the

69: model in use \cite{boro1,boro2}.

70: Strongly biased by the training, such models have little power to discover

71: surprising or atypical features. Thus, it is desirable to decipher the

72: genomic information in an objective way. Audic and Claverie \cite{audic}

73: have proposed a method which does not require learning of species-specific

74: features from an arbitrary training set for predicting protein-coding

75: regions. They use an {\it ab initio} iterative Markov modeling procedure

76: to automatically partition genome sequences into direct coding, reverse

77: coding, and noncoding segments. This is an expectation-maximization (EM)

78: algorithm, which is useful in modeling with hidden variables, and is

79: performed in two steps of expectation and maximization \cite{baldi,law,car}.

80: Such a self-organizing or adaptive approach uses all the available

81: unannotated genomic data for its calibration.

82:

83: Before introducing the model we use and describing the technical details, we

84: explain the EM algorithm with a simple pedagogic model which assumes

85: that a DNA sequence written in four letters $\{a, c, g, t\}$ is generated by

86: independent tosses of two four-sided dice. An

87: annotation maps the DNA sequence site-to-site to a two-letter sequence of

88: the alphabet $\{C, N\}$ ($C$ for coding and $N$ for noncoding). Two sets

89: $\{p_a, p_c, p_g, p_t\}$ and $\{q_a, q_c, q_g, q_t\}$ of positional nucleotide

90: probabilities are associated with the two dice $C$ and $N$, respectively.

91: The total probability for the given DNA sequence $S=s_1s_2\ldots$ to be seen

92: under the model is the partition or likelihood function

93: \begin{equation}

94: Z=\sum_H P(S|H_\alpha )=\sum_H \prod_i P(s_i|h_i^\alpha ),

95: \end{equation}

96: where the summation is over all the possible ``annotations"

97: $H=\{H_\alpha \}$ with $H_\alpha =h_1^\alpha h_2^\alpha \ldots$, $h_i^\alpha

98: \in\{N,C\}$, and $P(s|C)=p_s$, $P(s|N)=q_s$. The unknown two sets of

99: probabilities can be determined by maximizing the likelihood $Z$. From

100: Bayesian statistics

101: \begin{equation}

102: P(H_\alpha |S)=\frac {P(S|H_\alpha )P(H_\alpha )} {\sum_H P(S|H_\alpha )

103: P(H_\alpha )},

104: \end{equation}

105: with prior $P(H_\alpha )$ assumed, the most possible $H_\alpha$ can then

106: be selected as the inferred annotation. As we know, coding regions are

107: organized in blocks. The first simplification is the window

108: coarse-graining. The sequence $S$ is divided into nonoverlapping window

109: segments of constant length $w$, and each whole window is entirely assigned

110: to either $N$ or $C$. Conducting Bayesian analysis for window $W_j$

111: and accepting uniform prior, we have

112: $P(h|W_j) \propto P(W_j|h) $.

113: The second simplification is to introduce ``temperature" $\tau$ (as in

114: the simulated annealing),

115: replace $P(W_j|h_j)$ with $[P(W_j|h_j)]^{1/\tau}$ and take the limit

116: $\tau\to 0$. In this way we keep only a single term, i.e. the greatest one,

117: in the summation for $Z$. Window $W$ is inferred to belong to either $N$

118: or $C$ depending on whether $P(W|N)$ or $P(W|C)$ is larger. The likelihood

119: maximization is then equivalent to estimating nucleotide probabilities with

120: frequencies in two window classes inferred from the pre-assumed $\{p_s\}$ and

121: $\{q_s\}$. Consistency requires

122: that the estimate probabilities must be equal to $\{p_s\}$ and $\{q_s\}$.

123: This ``fixed point" can be found by iteration. As an example, we use the

124: first $99\times 5\,051 = 500\,049$ nucleotides of the complete genome

125: of E.~coli as the input data. Statistical significance requires

126: that the window size cannot be too small, while a large window size would give

127: poor resolution in discriminating different regions. The window size is

128: chosen to be $w=99$. We assign the $5\,051$ fixed nonoverlapping windows

129: to the two subsets of $N$ and $C$ in either a periodic or a random way.

130: We estimate $p_s$ and $q_s$ from the counts of different nucleotides

131: in each subset. The likelihood functions for

132: each window are then calculated using the estimated $p_s$ and

133: $q_s$, and the assignment of the windows to $C$ or $N$ is updated

134: according to which of $P(W|C)$ and $P(W|N)$ is larger. This ends one

135: iteration. The process of iteration converges to a single fixed point of

136: precision $10^{-4}$ around step 28 for different initializations with a final

137: window assignment to $N$ and $C$ also given.

138: The final $p_s$ and $q_s$ are $\{0.219, 0.270, 0.289,

139: 0.222\}$ and $\{0.279, 0.213, 0.227, 0.281\}$. The $q_s$ estimated

140: from the complete genome are $\{ 0.285, 0.214, 0.218, 0.283\}$, which are

141: rather close to the corresponding convergent values.

142:

143: More realistic models take the three phases in the coding regions and the

144: opposite ordering of the direct and reverse coding regions into

145: account. Such models adopt 7 subsets: one for noncoding (N), three for

146: direct coding (C$_1$, C$_2$, C$_3$) and three for reverse coding (C$_4$,

147: C$_5$, C$_6$). The subscript $i$ in C$_i$ indicates the phase 0, 1 or 2

148: accordng to $i$ (mod 3). From the genomic data statistics, we may assume

149: that there is symmetry between the direct and reverse coding regions, which

150: means that a reverse coding sequence is indistinguishable from a direct

151: coding sequence if we make the exchanges $a\leftrightarrow t$,

152: $c\leftrightarrow g$ and reverse the order. For the model based on the

153: positional preference of codons, instead of 7 sets of positional

154: nucleotide probabilities, we need only 4 sets. The reduction of the total

155: number of parameters by the symmetry consideration improves

156: the statistics. The procedure of

157: iteration is similar to that for the last model, the only difference being

158: that now we have to estimate 4 sets of probabilities and calculate 7

159: likelihood functions for a window.

160:

161: We use a better model based on the codon usage. We now need a set of

162: 64 probabilities for coding regions. For noncoding segments, 4

163: positional nucleotide probabilities are used just as before. To simplify

164: the programming, we move the windows with a phase-shift other than zero by one

165: or two nucleotides to clear the phase-shift, although we can calculate the

166: marginal distribution probabilities for uni- and bi-nucleotides. For example,

167: we replace the window $W=s_is_{i+1}\ldots s_{i+w-1}$ marked as C$_2$ with

168: $W'=s_{i+1}s_{i+2}\ldots s_{i+w}$. (The

169: alternative way is to consider a cyclic transformation.) Our further

170: discussions are all based on this model. It is observed that the iteration

171: also quickly converges to a fixed point. Contrary to the two-sets model where

172: coding and noncoding are symmetric, and extra knowledge is required to

173: relate one set to coding and the other to noncoding, we can now distinctly

174: distinguish coding from noncoding regions, even with their phases fixed.

175: Direct and reverse coding sets are symmetric in the model.

176: However, the fact that stop codons $taa$, $tag$ and $tga$ are rare can

177: be used to remove the symmetry between direct and reverse coding. That

178: is, if the convergent probabilities for $taa$, $tag$ and $tga$ are all

179: significantly small in comparison with the other 61, sets C$_1$, C$_2$ and

180: C$_3$ then do indeed correspond to direct coding. (Otherwise, those of $tta$,

181: $cta$ and $tca$ would be small instead.)

182:

183: We employ the sliding window technique to improve the resolution as follows.

184: We shift each window by 3 nucleotides, initiate the window assignment with

185: the convergent probabilities just obtained, and then find new assignments

186: for the shifted windows by iteration. We repeat the shifting process 32

187: times to cover the window width. This ends with 33 assignments for

188: triplets, except for a few sites at the two ends. By a majority vote we

189: can obtain a triplet assignment of the whole sequence.

190:

191: Recently, an entropic segmentation method that uses the Jensen-Shannon

192: measure for sequences of a 12-letter alphabet has been proposed to find

193: borders between coding and noncoding regions \cite{stan}. Their best

194: result was obtained on the genome of the bacterium {\it Rickettsia

195: prowazekii}. We test our approach with the same genome data.

196: To inspect the accuracy we obtain the ``true"

197: assignment of sites based on the known annotation as follows. If a

198: nucleotide is in a noncoding region, it belongs to N. If it is in a coding

199: (or reverse coding) region and the site-index of the beginning nucleotide

200: plus 1 is congruent to $i$ modulo 3, the nucleotide under consideration

201: will belong to C$_{1+i}$ (or C$_{4+i}$). For overlapping coding zones we

202: may keep two alternative assignments. We define three rates of accuracy

203: $R_2$, $R_3$ and $R_7$: $R_2$ only discriminates coding from noncoding

204: segments while $R_7$ covers full discrimination of the 7 sets, and $R_3$

205: ignores the phases. For the total $N =1\,111\,523$ nucleotides, we obtain

206: $R_2=91.7\%$, $R_3=89.8\%$ and $R_7=89.7\%$. (The rates without window sliding

207: are $R_2=89.1\%$, $R_3=84.8\%$ and $R_7=84.4\%$.)

208:

209: For finding block borders, to eliminate illusary fluctuations

210: we accept only the assignments with the 33 identical samplings, and

211: regard others as undetermined. When two adjacent identified blocks are of

212: the same assignment we join the two together with the sites between into a

213: single zone of the same assignment. Otherwise, we take the middle site of

214: the intervening undetermined zone as the border, and assign the two sides

215: according to their corresponding flank blocks. We can do

216: the job better by means of the likelihood ratio. Suppose that the left

217: block is assigned to $l$, and the right to $r$. A point $m$ in the

218: intervening zone divides the zone into two segments $L_m$ and $R_m$. The

219: likelihood ratio is defined as

220: \begin{equation}

221: \Gamma_m =\frac {P(L_m|l)P(R_m|r)}{P(L_m|r)P(R_m|l)}.

222: \end{equation}

223: The maximal $\Gamma_m$ places the border at $m$. This segmentation

224: finally gives the accuracy rates $R_2=93.3\%$, $R_3=92.8\%$ and $R_7=92.7\%$.

225:

226: In Ref.~\cite{stan} the quantity quantifying the coincidence between

227: borders inferred from the segmentation and those from the known annotation is

228: defined by

229: \begin{equation}

230: D=\frac 1{2N}\left[ \sum_i \min_j|b_i-c_j|+\sum_j \min_i|b_i-c_j|

231: \right],

232: \end{equation}

233: where $\{b_i\}$ is the set of all borders between coding and noncoding

234: regions, and $\{c_j\}$ is the set of all cuts produced by the

235: segmentation. We use an even harsher quantity $D$ by interpreting $\{b_i\}$

236: and $\{c_j\}$ as the borders of all coding zones. That is, we include

237: borders of each overlapping coding zone. The total number of ``CDS" in

238: the annotation is 834, one of which has two joint zones. We obtain

239: $1-D=87.7\%$, compared with $\sim 80\%$ of Ref.~\cite{stan}. In Fig.~1 we show

240: a comparison of the inferred segmantation with the known coding regions.

241: In the section from $475\,500$ to $497\,500$ there are two overlaps (one

242: for direct, and the other for reverse coding regions), and the shortest gap

243: separating adjacent coding regions is just 1 nucleotide (at $486\,215$). They

244: do not escape detection. As mentioned in \cite{stan}, there are

245: two very close coding regions in the same phase ($538\,197:539\,879$ and

246: $539\,937:540\,887$). The result from the majority vote is shown in Fig.~2 for

247: the section. We see indeed a peak of the counts for set N between the two coding

248: regions. The highest count for N is 32, and so is ignored in our strategy.

249: There is indeed plenty of room for improving this approach. A larger width $w=123$

250: gives higher accuracy rates: $R_2=93.6\%$, $R_3=93.3\%$, $R_7=93.0\%$ and

251: $1-D= 88.2\%$. When we consider only the triplets with all 33 assignments

252: identical in window sliding the rates are $R_2=98.7\%$, $R_3=98.6\%$ and

253: $R_7=98.6\%$. In the above we avoid setting up an arbitary cut-off threshold.

254: If a threshold of 17 counts is used to determine the segments whose central

255: parts have 33 identical samplings,

256: for $w=99$ we predict a total of $1\,001\,351\ (90.1\%)$ sites with

257: accuracies $R_2= 97.4\%$ and $R_3= 95.4\%$. The accuracy rate for noncoding

258: regions is 96.5\%, much

259: higher than that of Ref.~\cite{audic}. It is important and feasible to

260: integrate biological signals into our algorithm. We expect our algorithm,

261: with certain modifications, should work well for other species, too.

262:

263: %\acknowledgments

264: \begin{quotation}

265: { This work was supported in part by the Special Funds for Major National

266: Basic Research Projects, the National Natural Science Foundation

267: of China and Research Project 248 of Beijing.}

268: \end{quotation}

269:

270: % REFERENCES

271: \begin{thebibliography}{99}

272: %\begin{references}

273:

274: \bibitem{fick} J.W. Fickett, Comput. Chem. {\bf 20}, 103 (1996).

275: \bibitem{fick2} J.W. Fickett and C.S. Tung, Nucleic Acids Res. {\bf 20},

276: 6441 (1992).

277: \bibitem{grant} R.~Grantham, C.~Gautier, M.~Gouy, M.~Jacobzone, and

278: R.~Mercier, Nucleic Acids Res. {\bf 9},0 R43 (1981).

279: \bibitem{fick3} J.~W.~Fickett, Nucleic Acids Res. {\bf 10},

280: 5303 (1982).

281: \bibitem{karl} S.~Karlin and J.~Mrazek, J.~Mol. Biol. {\bf 262}, 459 (1996).

282: \bibitem{boro1} M. Borodovsky and J. D. McIninch, Comput. Chem. {\bf 17},

283: 123 (1993).

284: \bibitem{boro2} M. Borodovsky, J. D. McIninch, E.~V.~Koonin, K.~E.~Rudd,

285: C.~Medigue, and A.~Danchin, Nucleic Acids Res. {\bf 23}, 3554 (1995).

286: \bibitem{audic} S. Audic and J.-M. Claverie, Proc. Natl. Acad. Sci. USA,

287: {\bf 95}, 10026 (1998).

288: \bibitem{baldi} P. Baldi, Bioinformatics {\bf 16}, 367 (2000); P.~Baldi

289: and S.~Brunak, {\it Bioinformatics: The Mechine Learning Approach} (The MIT

290: Press, Cambridge, Ma., 1998).

291: \bibitem{law} C. E. Lawrence and A. A. Reilly, Proteins {\bf 7}, 41

292: (1990).

293: \bibitem{car} L. R. Cardon and G. D. Stormo, J. Mol. Biol. {\bf 223},

294: 159 (1992).

295: \bibitem{stan} P. Bernaola-Galv\'an, I. Grosse, P. Carpena, J.L. Oliver,

296: R. Rom\'an-Rold\'an, and H.E. Stanley, Phys. Rev. Lett. {\bf 85}, 1342 (2000).

297: %\end{references}

298: \end{thebibliography}

299:

300: %\newpage

301: % FIGURE CAPTIONS

302: \begin{figure}[hb]

303: \caption{

304: %Fig.~1

305: Comparison between the inferred segmentation (dotted lines) and the known

306: coding regions of {\it Rickettsia} (shaded areas).}

307: %\label{fig1}

308: \end{figure}

309:

310: \begin{figure}[ht]

311: \caption{

312: %Fig.~2

313: Counts of majority assignment in the section containing two very close

314: coding regions (shaded areas) in the same phase. A peak corresponding to

315: noncoding assignment is clearly seen.}

316: %\label{fig2}

317: \end{figure}

318:

319: \end{document}

320: