0701:q-bio0701036/talk.tex

1: \documentclass[10pt,conference]{IEEEtran}

2:

3: \usepackage{amssymb}

4: \usepackage{amsmath}

5: \usepackage{eucal}	% use Zapf's beautiful calligraphic characters

6:

7: \newtheorem{theorem}{Theorem}[section]

8:

9: \newcommand{\defeq}{\stackrel{\triangle}{=}}

10:

11: \begin{document}

12:

13: \title{Parametrized Stochastic Grammars for\\

14: RNA Secondary Structure Prediction}

15: \author{\authorblockN{Robert S. Maier}

16: \authorblockA{Departments of Mathematics and Physics\\

17: University of Arizona\\

18: Tucson, AZ 85721, USA\\

19: Email: rsm@math.arizona.edu}}

20:

21: \maketitle

22:

23: \begin{abstract}

24: We propose a two-level stochastic context-free grammar (SCFG) architecture

25: for parametrized stochastic modeling of a family of RNA sequences,

26: including their secondary structure.  A~stochastic model of this type can

27: be used for maximum a~posteriori estimation of the secondary structure of

28: any new sequence in the family.  The proposed SCFG architecture models RNA

29: subsequences comprising paired bases as stochastically weighted

30: Dyck-language words, i.e., as weighted balanced-parenthesis expressions.

31: The length of each run of unpaired bases, forming a loop or a bulge, is

32: taken to have a phase-type distribution: that of the hitting time in a

33: finite-state Markov chain.  Without loss of generality, each such Markov

34: chain can be taken to have a bounded complexity.  The scheme yields an

35: overall family SCFG with a manageable number of parameters.

36: \end{abstract}

37:

38: \section{Introduction}

39: \label{sec:intro}

40: In biological sequence analysis, probability distributions over finite

41: ($1$-dimensional) sequences of symbols, representing nucleotides or amino

42: acids, play a major role.  They specify the probability of a sequence

43: belonging to a specified family, and are usually generated by Markov

44: chains.  These include the stochastic finite-state Moore machines called

45: hidden Markov models (HMMs); or infinite-state Markov chains such as

46: stochastic push-down automata (SPDAs).  By computing the most probable path

47: through the Markov chain, one can answer such questions as ``What hidden

48: (e.g., phylogenetic) structure does a sequence have?'', and ``What

49: secondary structure will a sequence give rise~to?''.  The number of Markov

50: model parameters should ideally be kept to a minimum, to facilitate

51: parameter estimation and model validation.

52:

53: The a~priori modeling of an RNA sequence family is considered here.  Due~to

54: Watson--Crick base pairing, a recursively structured RNA sequence will

55: fold, and display secondary structure.  To model stochastically both

56: pairings and runs of unpaired bases (which form loops and bulges), results

57: from a subfield of formal language theory, the {\em structure theory of

58: weighted strings\/}~\cite{KuichSalomaa} (each string being weighted by an

59: element of a specified `semiring' such as~${\mathbb{R}}_+$), are reviewed

60: and employed in stochastic model construction.

61:

62: In Section~\ref{sec:duration}, {\em duration modeling\/} is discussed: the

63: modeling of a probability distribution on `runs', i.e., on the natural

64: numbers~$\mathbb{N}$.  A non-RNA biological example is the modeling and

65: prediction of CpG~islands in a DNA sequence.  A~sequence may flip between

66: CpG and non-CpG states, with distinct HMMs for generation of symbols in

67: $\{A,T,G,C\}$.  For ease of HMM parameter estimation, and for finding the

68: most probable parse, or path through the model (e.g., by the Viterbi

69: algorithm), the length of each CpG island and non-CpG region should be

70: modeled in a Markovian way, as the first hitting time in a finite-state

71: Markov chain.  That~is, on~$\mathbb N=\{0,1,2,\dots\}$, the set of possible

72: lengths, it should have a {\em phase-type

73: distribution\/}~\cite{Neuts,OCinn90}.  There is a theorem of the author's

74: on such distributions~\cite{Maier8a}, which grew out of results on

75: positively weighted {\em regular\/} sequences~\cite{Katayama,Soittola}.  It

76: says that without loss of generality, the structure of the Markov chain can

77: be greatly restricted: its `cyclicity' can be required to be at most~$2$.

78: This has implications for HMM parametrization.

79:

80: The generating function $G(z)$ of a phase-type (PH) distribution

81: on~$\mathbb N$ (which is a normalized $\mathbb{R}_+$-weighted regular

82: language over a $1$-letter alphabet) is a {\em rational\/} function of~$z$.

83: Going beyond regular languages to the context-free case yields an {\em

84: algebraic\/} generating function: one of several variables, if each type of

85: letter in the sequence is separately kept track~of.  In RNA secondary

86: structure prediction, stochastic context-free grammars (SCFGs), usually in

87: Chomsky normal form, have been used~\cite{Sakakibara94}.  They tend to be

88: complicated; if the grammar has $k$~non-terminal symbols, then it may have

89: $O(k^3)$~transition probabilities, which must be estimated from training

90: sequences~\cite{Lari90}.  What is needed is a class of SCFGs with

91: (i)~restricted internal structure, (ii)~equivalent modeling power, and

92: (iii)~computationally convenient parametrization.  Finding such a class of

93: models is a hard problem: even on the level of $1$-letter-alphabet (i.e.,

94: univariate) generating functions, it involves the constructive theory of

95: algebraic functions.

96:

97: In Section~\ref{sec:algebraic}, as a small step toward solving this

98: problem, it is pointed~out that there is a class of probability

99: distributions on~$\mathbb N$ with generating functions (i.e.,

100: $z$-transforms) that are algebraic and non-rational, which can be

101: conveniently parametrized.  This is the class of algebraic {\em

102: hypergeometric distributions\/}.  E.g., the $\mathbb{N}$-valued random

103: variable~$\tau$ could satisfy $\sum_{n=0}^\infty z^n\, Pr(\tau=n)\propto

104: {}_2F_1(a,b;c;z)$, where ${}_2F_1(a,b;c;\cdot)$ is Gauss's (parametrized)

105: hypergeometric function.  If $a,b,c$ are suitably chosen, $n\mapsto

106: Pr(\tau=n)$ will be a probability density function with an algebraic

107: $z$-transform.  Algebraic hypergeometric probability densities satisfy nice

108: recurrence relations, and SCFG interpretations for them can be worked~out.

109:

110: A more general approach toward solving the above problem, not restricted to

111: the case of a $1$-letter alphabet, employs SCFGs with a two-level

112: structure.  In~effect, these are SCFGs wrapped around HMMs.  The following

113: is an illustration.  A~probabilistically weighted Dyck language over the

114: alphabet $\{a,b\}$, i.e., a distribution over the words in $\{a,b\}^*$ that

115: comprise nested $a$\textendash\nobreak$b$ pairs, is generated from a

116: symbol~$S$ by repeated application of the production rule $S\mapsto

117: p_1\cdot ab+p_2\cdot abS+p_3\cdot aSb +p_4\cdot aSbS$.  The

118: probabilities~$p_i$ sum to~$1$.  If each of $a,b$ in~turn represents a

119: weighted {\em regular\/} language over some alphabet~$\Sigma$ (e.g.,~a

120: PH-distribution if $\Sigma$~has only one letter), then the resulting

121: distribution over words in~$\Sigma^*$ comes from a SCFG with the stated

122: two-level structure.  This setup is familiar from (unweighted) language

123: theory applied to compilation: the top-level structure of a program is

124: specified as a word in a context-free language, and islands of low-level

125: structure (e.g., identifier names and arithmetic literals) as words in

126: regular languages.

127:

128: In Section~\ref{sec:modeling}, it is indicated how the idea of a SCFG

129: wrapped around HMMs can be applied to RNA structure prediction: initially,

130: to the parametric stochastic modeling, in a given sequence family, of the

131: recursive primary structure that induces secondary folding.  The goal is

132: parameter estimation and model validation, by comparison with data on real

133: RNA sequences.  Knudsen and Hein~\cite{Knudsen99} and

134: Nebel~\cite{Nebel2004} have worked on this, using Dyck-like languages, but

135: stochastic modeling using distinct SCFG and HMM levels is a significant

136: advance.

137:

138: On the level of primary RNA structure, paired nucleotides will make~up a

139: subsequence of the full nucleotide sequence, and must constitute a Dyck

140: word, for simplicity written as a word over~$\{a,b\}$.  A distribution over

141: the infinite family of such Dyck words is determined by the above

142: stochastic production rule, the parameters $p_1,p_2,p_3,p_4$ in which are

143: specific to the family being modeled.  The production rule for {\em full\/}

144: sequences, including unpaired nucleotides, will have not $ab$, $abS$,

145: $aSb$, $aSbS$ on its right-hand side, but rather $IaIbI$, $IaIbS$, $IaSbI$,

146: $IaSbS$, where each~$I$ expands to a `run' of unpaired nucleotides.  If the

147: four nucleotides are treated as equally likely in this context, each~$I$

148: will be a stochastic language over a $1$-letter alphabet, and the length of

149: each run is reasonably modeled as having a PH-distribution.  The PH class

150: includes geometric distributions, but is more general.  The overall SCFG is

151: obtained by wrapping the Dyck SCFG around the finite-state Markov chains

152: that yield the PH-distributions.

153:

154: From a given family of RNA sequences, Dyck SCFG parameters can be

155: estimated, e.g., by the standard Inside--Outside Algorithm~\cite{Lari90};

156: and then HMM parameters (i.e., PH-distribution parameters) can be estimated

157: separately.  By employing a large enough class of PH-distributions, it

158: should be possible to produce a better fit to data on secondary structure

159: than were obtained from the few-parameter models of Knudsen and

160: Hein~\cite{Knudsen99} and Nebel~\cite{Nebel2004}.  Once the family has been

161: modeled, the most likely parse tree for any new RNA sequence in the family

162: can be computed by maximum a~posteriori estimation, using the CYK

163: algorithm~\cite{Sakakibara94}.  The sequence is predicted to have the

164: secondary structure represented by that parse tree.

165:

166: \section{Duration Modeling}

167: \label{sec:duration}

168: Since loops and bulges in RNA secondary structure comprise runs of unpaired

169: nucleotides, they can be modeled without taking long-range covariation into

170: account.  The appropriate stochastic model is an HMM {\em with

171: absorption\/}, since the accurate modeling of run lengths is a goal.  Any

172: such HMM will specify a probability distribution on the set of finite

173: strings~$\Sigma^*$, where $\Sigma=\{A,U,G,C\}$ is the alphabet set, and

174: long words are exponentially unlikely.  There should be little change in

175: the nucleotide distribution along typical runs, so the distribution of the

176: string length $\tau\in\mathbb{N}$ is what is important.

177:

178: The time~$\tau$ to reach a final (absorbing) state in an irreducible

179: discrete-time Markov chain on a state space $Q=\{1,\dots,m\}$, with a

180: transition matrix $\mathbf{T}=(T_{ij})_{i,j=1}^m$ that is {\em

181: substochastic\/} (i.e., $\sum_{j=1}^m T_{ij}\le1$), and an initial state

182: vector $\mathbf{\alpha}=(\alpha_i)_{i=1}^m$ that is also substochastic

183: (i.e., $\sum_{i=1}^m\alpha_i\le1$), is said to have a discrete

184: PH-distribution.  The substochasticity of $\mathbf{T}$

185: and~$\mathbf{\alpha}$ expresses the absorption of probability, since they

186: can be extended to a larger state space $\tilde Q=Q\cup\{m+1\}$, on which

187: they will be stochastic.  The added state $m+1$ is absorbing.

188:

189: There is a close connection between PH-distributions and finite automata

190: theory, in particular the theory of rational series over

191: semirings~\cite{KuichSalomaa}.  If $A$~is a semiring (a~set having binary

192: addition and multiplication operations, $\oplus$~and~$\odot$, each with an

193: associated identity element; but not necessarily having a unary negation

194: operation), then an $A$-{\em rational sequence\/}, $a=(a_n)_{n=0}^\infty\in

195: A^{\mathbb{N}}$, is a sequence of the form $\oplus_{i,j=0}^m

196: \left[u_i\odot(\mathbf{M}^n)_{ij}\odot v_j\right]$, where for some $m>0$,

197: $\mathbf{M}\in A^{m\times m}$ and $\mathbf{u},\mathbf{v}\in A^m$.  It is an

198: $A$-weighted regular language over a $1$-letter alphabet.  Semirings of

199: interest here include $\mathbb{R}$, $\mathbb{R}_+=\{x\in\mathbb{R}\mid

200: x\ge0\}$, and the Boolean semiring $\mathbb{B}=\{0,1\}$.

201:

202: \smallskip

203: \begin{theorem}[\cite{Maier9}]

204: \label{thm:normalization}

205:   Any PH-distribution on~$\mathbb{N}$ is an $\mathbb{R}_+$-rational

206:   sequence.  Any {\em summable\/} $\mathbb{R}_+$-rational sequence, if

207:   normalized to have unit sum, becomes a PH-distribution.

208: \end{theorem}

209:

210: \smallskip

211: If $\tau\in\mathbb{N}$ is PH-distributed, it is useful to focus on its

212: $z$-transform, i.e., $G(z)=E\left[z^\tau\right]=\sum_{n=0}^\infty

213: z^n\,Pr(\tau=n)$.  This will be a rational function, in~$\mathbb{R}_+(z)$.

214: If the distribution is {\em finitely supported\/}, it will be a polynomial,

215: in~$\mathbb{R}_+[z]$.

216:

217: \smallskip

218: \begin{theorem}[\cite{Maier9}]

219: \label{thm:2}

220:   Any PH-distribution on~$\mathbb{N}$ can be generated from finitely

221:   supported distributions by repeated applications of (i)~the binary

222:   operation of mixture, i.e., $G_1,G_2\mapsto pG_1+(1-p)G_2$, where

223:   $p\in(0,1)$, (ii)~the binary operation of convolution, i.e.,

224:   $G_1,G_2\mapsto G_1G_2$, and (iii)~the unary `geometric mixture'

225:   operation, i.e., $G\mapsto (1-p)\sum_{k=0}^\infty p^kG^k=(1-p)/(1-pG)$,

226:   where $p\in(0,1)$.

227: \end{theorem}

228:

229: \smallskip

230: This is a variant of the Kleene--Sch\"utzenberger Theorem on the

231: $A$-rational series associated to $A$-finite automata~\cite{KuichSalomaa}.

232: The Boolean ($A=\mathbb{B}$) case of their theorem is familiar from formal

233: language theory: it says that any regular language over a finite alphabet

234: can be generated from {\em finite\/} languages by repeated applications of

235: (i)~union, (ii)~concatenation, and (iii)~the so-called Kleene star

236: operation.  Just as in formal language theory, the third operation of

237: Theorem~\ref{thm:2} can be implemented on the automaton level by adding

238: `loopback', or cycle-inducing, transitions from final state(s) back to

239: initial state(s).

240:

241: \smallskip

242: \begin{theorem}[\cite{Maier8a}]

243: \label{thm:3}

244: The unary--binary computation tree leading to any PH-distribution

245: on~$\mathbb{N}$, the leaves of which are finitely supported distributions,

246: can be required without loss of generality to have at most~$2$ unary

247: `geometric mixture' nodes along the path extending to the root from any

248: leaf.  That~is, those operations do~not need to be more than doubly nested.

249: \end{theorem}

250:

251: \smallskip

252: This is a normalized, or `stochastic', version of a result on the

253: representation of $\mathbb{R}_+$-rational

254: sequences~\cite{Katayama,Soittola}.  Results of this type are strongly

255: semiring-dependent.  It is not difficult to see that in the cases

256: $A=\mathbb{R}$ and~$\mathbb{B}$, the analogue of the number~`$2$' is~`$1$'.

257: (This is because, e.g., a $\mathbb{B}$-rational sequence is simply an

258: sequence in~$\mathbb{B}^\mathbb{N}$ that is eventually periodic.)  The

259: proof of Theorem~\ref{thm:3} is an explicit construction, which respects

260: positivity constraints at each stage.  That~is, the construction solves the

261: {\em representation problem\/} for univariate PH-distributions, which has

262: strong connections to the positive realization problem in control

263: theory~\cite{Commault2003}.

264:

265: What the theorem says, since operations of types (i),(ii),(iii) correspond

266: to parallel composition, serial composition, and cyclic iteration of Markov

267: chains, is that any PH-distribution arises without loss of generality from

268: a Markov chain in which cycles of states are nested at most $2$~deep.

269: That~is, the chain may include cycles, and cycles within cycles, but not

270: cycles within cycles within cycles.  So for modeling purposes, the chain

271: transition matrix~$\mathbf{T}$ may be taken to have a highly restricted

272: structure.  A~completely connected transition graph on a state space of

273: size~$m$ would have ${m}^{2}$~possible transitions, and would be

274: unnecessarily general when $m$~is large.

275:

276: Unpaired nucleotide run lengths in RNA are naturally modeled as having

277: (discrete-time) PH~distributions because of the close connection with HMMs,

278: and the consequent ease of parameter estimation.  However, the class of

279: PH~distributions is so versatile that it would be useful in this context,

280: regardless.  Discrete PH~distributions include geometric and negative

281: binomial distributions, and are dense (in~a suitable sense) in the class of

282: distributions $Pr(\tau=n)$, $n\in\mathbb{N}$, which have leading-order

283: geometric falloff as~$n\to\infty$.

284:

285: Any PH distribution on~$\mathbb{N}$ has a $z$-transform

286: $G(z)=E\left[z^\tau\right]$ that is rational in the conventional sense;

287: equivalently, it must satisfy a finite-depth recurrence relation of the

288: form $\sum_{k=0}^N c_k\,Pr(\tau=n+k)=0$.  In~fact, any probability

289: distribution on~$\mathbb{N}$ with (i)~a rational $z$-transform $G(z)$, and

290: (ii)~the property that the pole which $G(z)$ necessarily has at~$z=1$ is

291: the {\em only\/} pole on the circle $\left|z\right|=1$, is necessarily a PH

292: distribution~\cite{OCinn90}.  This is a sort of converse of the

293: Perron--Frobenius Theorem.  However, there are distributions

294: on~$\mathbb{N}$ which satisfy~(i) but not~(ii), and are not PH

295: distributions.  They are necessary $\mathbb{R}$-rational sequences, but are

296: not $\mathbb{R}_+$-rational sequences as defined above, even though they

297: are sequences of elements of~$\mathbb{R}_+$ (probabilities).  In

298: abstract-algebraic terms, this situation is possible because the

299: semiring~$\mathbb{R}$ is not a {\em Fatou extension\/} of the

300: semiring~$\mathbb{R}_+$~\cite{KuichSalomaa}.  The existence of pathological

301: examples of this type does not vitiate the usefulness of PH distributions

302: in run-length modeling.

303:

304: \section{Algebraic Sequences}

305: \label{sec:algebraic}

306: Most work on RNA secondary structure prediction that draws on formal

307: language theory has employed SCFGs~\cite{Knudsen99,Nebel2004,Sakakibara94}.

308: A~CFG in Chomsky normal form~(CNF), used for generating strings

309: in~$\Sigma^*$ where $\Sigma$~is a finite alphabet set, is a set of

310: production rules of the form $V\mapsto W_1W_2$ or $V\mapsto a$, where

311: $V,W_1,W_2$ are elements of a set~$\mathcal{V}$ of `variables', i.e.,

312: nonterminal symbols, and~$a\in\Sigma$.  There is a distinguished start

313: symbol $S\in\mathcal{V}$ with which the process begins.  Applying the

314: production rules repeatedly yields a subset $L\subset\Sigma^*$, i.e., a

315: language.  An SCFG assigns probabilities (which add to unity) to the

316: productions of each $V\in\mathcal{V}$, and yields a probability

317: distribution over the strings in $L\subset\Sigma^*$, i.e., over~$\Sigma^*$.

318:

319: The probability distribution $P:\Sigma^*\to[0,1]\subset\mathbb{R}_+$

320: produced by an SCFG is an example of an $\mathbb{R}_+$-algebraic series.

321: In general, if $A$~is a semiring, an $A$-algebraic series (of CNF type)

322: over an alphabet~$\Sigma$ is a weighting function $f\colon\Sigma^*\to A$

323: obtained as one component (i.e., the component~$f_S$) of the formal

324: solution of a coupled set of quadratic equations

325: \begin{displaymath}

326:   f_V=\sum_{W_1,W_2\in\mathcal{V}} c_{V;W_1,W_2}\, f_{W_1}f_{W_2} +

327:   \sum_{a\in\Sigma} c_{V;a}\,a,\qquad V\in\mathcal{V},

328: \end{displaymath}

329: computed by iteration~\cite{KuichSalomaa}.  The coefficients

330: $c_{V;W_1,W_2}$ and~$c_{V;a}$ are elements of~$A$, so each~$f_V$ is a sum

331: of $A$-weighted strings in~$\Sigma^*$, or equivalently a function

332: $f_V\colon\Sigma^*\to A$.  It~is clear that Theorem~\ref{thm:normalization}

333: has an analogue: any probability distribution on~$\Sigma^*$ produced by a

334: SCFG of CNF type is simply a {\em normalized\/} $\mathbb{R}_+$-algebraic

335: series.

336:

337: Any SCFG of CNF type has $\left|\mathcal{V}\right|^3 +

338: \left|\mathcal{V}\right|\left|\Sigma\right|$ parameters, which may be too

339: many for practical estimation if a small sequence family is being modeled.

340: To~facilitate modeling, one should use an SCFG with a restricted structure,

341: and also exploit results from weighted automata theory.  If the nucleotide

342: distribution does not vary much along typical sequences, then the alphabet

343: set~$\Sigma$ can be taken to be a $2$-letter alphabet $\{a,b\}$ (if~one is

344: modeling Watson--Crick pairing, exclusively) or even a $1$-letter alphabet

345: (if~one is modeling runs of unpaired bases).  Also, one can leverage the

346: fact that $A$-algebraic series subsume $A$-rational series, which implies

347: (in~the $1$-letter case) that $A$-algebraic {\em sequences\/}, which are

348: effectively indexed by~$\mathbb{N}$, subsume $A$-rational sequences.  In

349: the Boolean ($A=\mathbb{B}$) case, the first statement is the familiar

350: Chomsky hierarchy.

351:

352: In the case of a $1$-letter alphabet $\Sigma=\{a\}$, an SCFG defines a

353: probability distribution on $\mathbb{N}\cong\{a\}^*$.  That is, it defines

354: an $\mathbb{N}$-valued random variable~$\tau$, the length of the string

355: emitted by the stochastic push-down automaton (SPDA) corresponding to

356: the~SCFG.  The SPDA uses~$\mathcal{V}$, the set of nonterminal symbols, as

357: its stack alphabet, and its stack is initially occupied by the start

358: symbol~$S$.  The stochastic production rules specify what happens when a

359: symbol $V\in\mathcal{V}$ is popped off the stack: either two symbols

360: $W_1,W_2\in\mathcal{V}$ are pushed back, or a letter~`$a$' is emitted.  By

361: construction, at~least one letter must be emitted by a CNF-type SCFG before

362: its stack empties, so $Pr(\tau=0)=0$.

363:

364: The class of probability distributions on~$\mathbb{N}$ associated to SCFGs

365: (whether or~not of CNF type), i.e., that of normalized

366: $\mathbb{R}_+$-algebraic sequences, is potentially useful in parametric

367: stochastic modeling, but has not been widely employed.  It will be denoted

368: $\mathcal{F}_{\rm alg}$ here, since each distribution in~it has an

369: algebraic $z$-transform $G(z)=\sum_{n=0}^\infty z^n\,Pr(\tau=n)$.  For any

370: SCFG, an algebraic equation satisfied by~$G(z)$ can be computed by

371: polynomial elimination (e.g., by computing the resultant of the above

372: system of quadratic equations).  Let ${\it PH}_d$ denote the class of

373: discrete phase-type distributions.

374:

375: \smallskip

376: \begin{theorem}

377: \label{thm:hadamardetc}

378:   (i) ${\it PH}_d\subset\mathcal{F}_{\rm alg}$.  (ii)~If $X,Y$ are

379:   independent $\mathbb{N}$-valued random variables~(RVs) with distributions

380:   in~${\it PH}_d$, then conditioning on~$X=Y$ yields an~RV with

381:   distribution in~$\mathcal{F}_{\rm alg}$.  (iii)~If, furthermore, $Z$~is

382:   an independent $\mathbb{N}$-valued RV with distribution

383:   in~$\mathcal{F}_{\rm alg}$, then conditioning on~$X=Z$ yields an~RV with

384:   distribution in~$\mathcal{F}_{\rm alg}$.

385: \end{theorem}

386:

387: \smallskip

388: These are `normalized' versions of standard facts on $A$-rational

389: and~$A$-algebraic series, in particular on their composition under the

390: Hadamard product $(x_n),(y_n)\mapsto (x_ny_n)$, in the special case when

391: $A=\mathbb{R}_+$ and~$\Sigma=\{a\}$.  (See~\cite{Fliess74,KuichSalomaa}.)

392: They have direct probabilistic proofs.  E.g., to prove~(i), one would show

393: that starting from the distribution of~$\tau\in\mathbb{N}$, the absorption

394: time in a~HMM, one can construct an SCFG that yields the same distribution

395: on~$\mathbb{N}$.  (If $Pr(\tau=0)>0$ then the SCFG cannot be of CNF type.)

396: The procedure is similar to constructing a PDA that accepts a specified

397: regular language.

398:

399: Much as with discrete PH distributions, it is difficult to parametrize

400: distributions in the class~$\mathcal{F}_{\rm alg}$ without, rather

401: explicitly, parametrizing the stochastic automata (SCFGs or SPDAs) that

402: give rise to them; or at~least their $z$-transforms.  It is difficult,

403: in~general, to characterize when a probability distribution on~$\mathbb{N}$

404: that has an algebraic $z$-transform lies in~$\mathcal{F}_{\rm alg}$.

405:

406: The following example illustrates the problem.  Any distribution $n\mapsto

407: Pr(\tau=n)$ on~$\mathbb{N}$ that has an algebraic $z$-transform necessarily

408: satisfies a finite-depth recurrence of the form $\sum_{k=0}^N

409: C_k(n)\,Pr(\tau=n+k)=0$, where the functions~$C_k$, $k=0,\dots,N,$ are

410: polynomial in~$n$.  (If~none of the~$C_k$ depends on~$n$, then the

411: $z$-transform will be rational.)  Consider, for example, the $2$-term

412: recurrence

413: \begin{displaymath}

414: (n+a)(n+b)\,Pr(\tau=n) = (n+c)(n+1)\,Pr(\tau=n+1),

415: \end{displaymath}

416: where $a,b,c\in\mathbb{R}$ are parameters, which is of this form.  The

417: $z$-transform $G(z)=\sum_{n=0}^\infty z^n\,Pr(\tau=n)$ of its solution is

418: proportional, by definition, to ${}_2F_1(a,b;c;z)$, which is Gauss's

419: parametrized hypergeometric function.  The set of triples

420: $(a,b;c)\in\mathbb{R}^3$ that yields an {\em algebraic\/} $z$-transform,

421: and hence an $\mathbb{R}$-algebraic sequence $n\mapsto Pr(\tau=n)$, is

422: explicitly known.  It was derived in the nineteenth century by

423: H.~A. Schwartz~\cite[Chap.~VII]{Poole36}.  Unfortunately, it is an {\em

424: infinite discrete\/} subset of~$\mathbb{R}^3$, not a continuous subset.

425:

426: In~general, the $z$-transform of the solution of a finite-depth recurrence

427: of the above form will be algebraic in~$z$ only if the overall parameter

428: vector of its coefficients, the polynomials $\{C_k(n)\}_{k=0}^N$, is

429: confined to a submanifold of positive codimension.  For distributions

430: in~$\mathcal{F}_{\rm alg}$, this makes recurrence-based parametrization

431: less useful than SCFG-based or $z$-transform-based parametrization.

432:

433: \section{Modeling Secondary Structure}

434: \label{sec:modeling}

435: A new scheme for modeling the prior distribution of secondary structures in

436: an RNA sequence family will now be proposed.  It will exploit the insights

437: of Sections \ref{sec:duration} and~\ref{sec:algebraic}, on the class of

438: discrete phase-type distributions on~$\mathbb{N}$ (i.e.,~${\it PH}_d$), and

439: the larger class of $\mathbb{R}_+$-algebraic distributions on~$\mathbb{N}$

440: (i.e.,~$\mathcal{F}_{\rm alg}$).

441:

442: If $\Sigma=\{A,U,G,C\}$ is the alphabet set, any SCFG, or its associated

443: SPDA, will define a probability distribution on~$\Sigma^*$, the set of

444: finite length sequences~\cite{Sakakibara94}.  (The distribution of the

445: sequence length, which is a random variable, lies in~$\mathcal{F}_{\rm

446: alg}$.)  But even if the SCFG is in Chomsky normal form (CNF), the number

447: of its parameters grows cubically in the number of grammar variables, as

448: mentioned above.  To~facilitate estimation, the model should have a

449: restricted structure.

450:

451: The models of Knudsen and Hein~\cite{Knudsen99} and Nebel~\cite{Nebel2004}

452: are representative.  The Knudsen--Hein SCFG has variable set

453: $\mathcal{V}=\{S,L,F\}$, and production rules

454: \begin{align*}

455:   S&\mapsto LS\mid L,{\rm\ i.e.,\ }S\mapsto L^+\defeq L\mid L^2\mid L^3\mid\cdots,\\

456:   L&\mapsto s\mid a_1Fb_1,\\

457:   F&\mapsto a_2Fb_2\mid LS,{\rm\ i.e.,\ }F\mapsto a_2Fb_2\mid LL^+.

458: \end{align*}

459: Here $s$ signifies an unpaired base and $a_i\dots b_i$ signifies two bases

460: that are paired in the secondary structure, so $L^+$~produces runs of

461: unpaired bases, i.e., loops (which may include stems) and $F$~produces runs

462: of paired bases, i.e., stems (which may include loops of length at

463: least~$2$).  This SCFG is not a CNF one, but model parameters may be

464: estimated by a variant of the Inside--Outside algorithm.  If one takes

465: single base frequencies and pair frequencies (i.e., the probability of

466: $a_i\dots b_i$ representing $A$\textendash\nobreak$U$,

467: $G$\textendash\nobreak$C$, or even $G$\textendash\nobreak$U$) into account,

468: one has only three independent parameters to be estimated, one probability

469: per production rule.  Knudsen and Hein (cf.\ Nebel) used as their primary

470: training set a subset of the European database of long subunit ribosomal

471: RNAs (LSU rRNAs)~\cite{DeRijk98,Wuyts2001}.  For the probabilities of $LS$

472: vs.~$L$ (from~$S$), they estimated $87\%$ vs.~$13\%$; for $s$ vs.~$a_1Fb_1$

473: (from~$L$), $90\%$ vs.~$10\%$; and for $a_2Fb_2$ vs.~$LS$ (from~$F$),

474: $79\%$ vs.~$21\%$.  Their training set actually included tRNAs as~well,

475: since they were attempting to model the family of folded RNA molecules as a

476: whole.

477:

478: As Knudsen and Hein note, their model yields loops and stems with

479: geometrically distributed lengths.  To improve quantitative agreement, it

480: would need to be made more sophisticated.  It would also benefit from a

481: cleaner separation between its two levels: the paired-base and

482: unpaired-base levels, i.e., the context-free and regular levels (in~the

483: formal language sense), i.e., the SPDA and HMM levels (in~the stochastic

484: automata-theoretic sense).  The above production rules couple the two

485: levels together.  It is not clear from Ref.~\cite{Knudsen99} how well the

486: model stochastically fits the length of (i)~training sequences, (ii)~the

487: subsequences comprising paired bases, and (iii)~the subsequences comprising

488: unpaired bases.  Separating the two levels should facilitate the separate

489: fitting of these quantities.

490:

491: By definition, folded RNA secondary structure is characterized by a

492: subsequence comprising paired bases, so the stochastic modeling of

493: secondary structure in a given family should initially focus on such

494: subsequences.  If pseudo-knots (a~thorny problem for automata-theoretic

495: modeling) are ignored, these subsequences are effectively {\em Dyck

496: words\/}, or balanced parenthesis expressions.  In the absence of

497: covariation, one expects to be able to generate such words over

498: $\{A,U,G,C\}$ from classical Dyck words over the $2$-letter alphabet

499: $\{a,b\}$, consisting of opening and closing parentheses, by replacing each

500: $a$\textendash\nobreak$b$ pair independently by an

501: $A$\textendash\nobreak$U$, $G$\textendash\nobreak$C$, or

502: $G$\textendash\nobreak$U$ pair, according to observed pair frequencies.

503: Knudsen and Hein note that order matters: in

504: $G$\textendash\nobreak$C$\,/\,$C$\textendash\nobreak$G$ pairs in~tRNA,

505: the~$G$ tends to be nearer the $5'$~end of the RNA than the~$C$.  Still to

506: be resolved, of~course, is the selection of the underlying probability

507: distribution over Dyck words in~$\{a,b\}^*$.

508:

509: One could start with any CFG that unambiguously generates the Dyck words

510: over $\{a,b\}$, and make it stochastic by weighting its productions.  The

511: simplest such CFG is a $1$-variable one, $S\mapsto ab\mid abS\mid aSb \mid aSbS$.

512: The corresponding SCFG is

513: \begin{displaymath}

514: S   \mapsto p_1\cdot ab+p_2\cdot abS+p_3\cdot aSb +p_4\cdot aSbS,

515: \end{displaymath}

516: where $\sum_ip_i=1$.  This SCFG, with $3$~free parameters, is so simple

517: that it can be studied analytically.  The length of a Dyck word is an

518: $\mathbb{N}$-valued random variable, the distribution of which lies

519: in~$\mathcal{F}_{\rm alg}$, with a parameter-dependent, algebraic

520: $z$-transform.  As was explained in Section~\ref{sec:algebraic}, it is best

521: to parametrize distributions in~$\mathcal{F}_{\rm alg}$ by the SCFGs that

522: give rise to them, rather than by explicit formulas or even by the

523: recurrence relations that they satisfy; and this is an example.

524:

525: This Dyck model could be made arbitrarily more versatile, since arbitrarily

526: complicated CFGs that generate the Dyck language over $\{a,b\}$ can readily

527: be constructed.  One could, for instance, iterate $S\mapsto ab\mid abS\mid

528: aSb \mid aSbS$ once, obtaining a production rule for~$S$ with

529: $25$~alternatives on its right-hand side.  Weighting them with

530: probabilities would yield an SCFG with $24$~independent parameters, which

531: would be capable of much more accurate fitting of data on an empirical

532: family of RNA sequences.  In~general, one could choose model parameters to

533: fit not~only the observed distribution of Dyck word lengths (i.e.,

534: per-family paired-base subsequence lengths), but also the distribution of

535: lengths of stems, i.e., runs of contiguous paired bases, which may be far

536: from geometric.

537:

538: The preceding discussion of Dyck words formed from paired bases ignored

539: loops, i.e., runs of unpaired bases.  They are best handled on a second

540: level of the SCFG.  The simplest production rule for {\em full\/} sequences

541: would have not $S$, $abS$, $aSb$, $aSbS$ on its right-hand side, but rather

542: $IaIbI$, $IaIbS$, $IaSbI$, $IaSbS$, where each of the eight~$I$s expands to

543: a run of unpaired bases.  In the absence of covariation, modeling each run

544: is a matter of duration modeling.  Starting from an $\mathbb{N}$-valued

545: random run length, or equivalently a distribution over finite $1$-letter

546: words, one would generate a run of unpaired bases by replacing each letter

547: independently by $A,U,G,C$, according to family-specific single base

548: frequencies.

549:

550: Each run length is naturally taken to have a distribution in~${\it PH}_d$,

551: since that will allow the resulting run of bases to be generated by an HMM

552: (with absorption).  Each run length will be the absorption time in a

553: finite-state Markov chain, the parameters of which, i.e., transition

554: probabilities, can be estimated from empirical data.  Geometric

555: distributions, and generalizations, are appropriate.  It follows from

556: Theorem~\ref{thm:3} that employing a large Markov chain with a fully

557: connected transition graph, and hence a number of parameters that grows

558: quadratically in the number of states, would {\em not\/} be appropriate.

559: Without loss of generality, each transition graph can be assumed to have no

560: `cycles within cycles within cycles'.

561:

562: In this extended (two-level) stochastic model of the secondary structure of

563: a family of RNA sequences, the sequence length distribution still lies

564: in~$\mathcal{F}_{\rm alg}$.  That is because a (Dyck-type) SPDA wrapped

565: around one or more HMMs is still an SPDA, with an SCFG representation.

566: This observation is similar to the proof of Theorem~\ref{thm:hadamardetc}:

567: what has been constructed here is simply an SCFG with a special structure,

568: not given explicitly in Chomsky normal form.  The full set of model

569: parameters could be estimated by the Inside--Outside

570: algorithm~\cite{Lari90}, rather than by estimating Dyck-SCFG and run-length

571: parameters separately; but that would not be so efficient.

572:

573: The test of the proposed SCFG architecture will be its value in secondary

574: structure prediction, since from any RNA sequence the most likely parse

575: tree, and paired-base subsequence, can be computed by maximum a~posteriori

576: estimation.

577:

578: %\bibliographystyle{IEEEtranS}

579: %\bibliography{general}

580:

581: % Generated by IEEEtranS.bst, version: 1.12 (2007/01/11)

582: \begin{thebibliography}{10}

583: \providecommand{\url}[1]{#1}

584: \csname url@samestyle\endcsname

585: \providecommand{\newblock}{\relax}

586: \providecommand{\bibinfo}[2]{#2}

587: \providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}

588: \providecommand{\BIBentryALTinterwordstretchfactor}{4}

589: \providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus

590: \BIBentryALTinterwordstretchfactor\fontdimen3\font minus

591:   \fontdimen4\font\relax}

592: \providecommand{\BIBforeignlanguage}[2]{{%

593: \expandafter\ifx\csname l@#1\endcsname\relax

594: \typeout{** WARNING: IEEEtranS.bst: No hyphenation pattern has been}%

595: \typeout{** loaded for the language `#1'. Using the pattern for}%

596: \typeout{** the default language instead.}%

597: \else

598: \language=\csname l@#1\endcsname

599: \fi

600: #2}}

601: \providecommand{\BIBdecl}{\relax}

602: \BIBdecl

603:

604: \bibitem{Commault2003}

605: C.~Commault and S.~Mocanu, ``Phase-type distributions and representations: Some

606:   results and open problems for system theory,'' \emph{Int. J.~Control},

607:   vol.~76, no.~6, pp. 566--580, 2003.

608:

609: \bibitem{DeRijk98}

610: P.~de~Rijk, A.~Caers, Y.~van~de Peer, and R.~de~Wachter, ``Database on the

611:   structure of large ribosomal subunit {RNA},'' \emph{Nucleic Acids Research},

612:   vol.~26, no.~1, pp. 183--186, 1998.

613:

614: \bibitem{Fliess74}

615: M.~Fliess, ``Sur divers produits de s{\'e}ries formelles,'' \emph{Bull. Soc.

616:   Math. France}, vol. 102, pp. 181--191, 1974.

617:

618: \bibitem{Katayama}

619: T.~Katayama, M.~Okamoto, and H.~Enomoto, ``Characterization of the

620:   structure-generating functions of regular sets and the {D0L} growth

621:   functions,'' \emph{Inform. and Control}, vol.~36, no.~1, pp. 85--101, 1978.

622:

623: \bibitem{Knudsen99}

624: B.~Knudsen and J.~Hein, ``{RNA} secondary structure prediction using stochastic

625:   context-free grammars and evolutionary history,'' \emph{Bioinformatics},

626:   vol.~15, no.~6, pp. 446--454, 1999.

627:

628: \bibitem{KuichSalomaa}

629: W.~Kuich and A.~Salomaa, \emph{Semirings, Automata, Languages}.\hskip 1em plus

630:   0.5em minus 0.4em\relax New York/Berlin: Springer-Verlag, 1986.

631:

632: \bibitem{Lari90}

633: K.~Lari and S.~J. Young, ``The estimation of stochastic context-free grammars

634:   using the {I}nside--{O}utside algorithm,'' \emph{Computer Speech and

635:   Language}, vol.~4, no.~1, pp. 35--56, 1990.

636:

637: \bibitem{Maier8a}

638: R.~S. Maier, ``Phase-type distributions and the structure of finite {M}arkov

639:   chains,'' \emph{J.~Comp. Appl. Math.}, vol.~46, no.~3, pp. 449--453, 1993.

640:

641: \bibitem{Maier9}

642: R.~S. Maier and C.~A. O'Cinneide, ``A closure characterization of phase-type

643:   distributions,'' \emph{J.~Appl. Probab.}, vol.~29, no.~1, pp. 92--103, 1992.

644:

645: \bibitem{Nebel2004}

646: M.~E. Nebel, ``Investigation of the {B}ernoulli model for {RNA} secondary

647:   structure prediction,'' \emph{Bull. Math. Biol.}, vol.~66,

648: %  no.~5,

649:   pp. 925--964, 2004.

650:

651: \bibitem{Neuts}

652: M.~F. Neuts, \emph{Matrix-Geometric Solutions in Stochastic Models}.\hskip 1em

653:   plus 0.5em minus 0.4em\relax Baltimore, Maryland: Johns Hopkins University

654:   Press, 1981.

655:

656: \bibitem{OCinn90}

657: C.~A. O'Cinneide, ``Characterization of phase-type distributions,'' \emph{Comm.

658:   Statist. Stochastic Models}, vol.~6, no.~1, pp. 1--57, 1990.

659:

660: \bibitem{Poole36}

661: E.~G.~C. Poole, \emph{Linear Differential Equations}.\hskip 1em plus 0.5em

662:   minus 0.4em\relax Oxford: Oxford University Press, 1936.

663:

664: \bibitem{Sakakibara94}

665: Y.~Sakakibara, M.~Brown, R.~Hughey, I.~S. Mian, K.~Sj{\"o}lander, R.~C.

666:   Underwood, and D.~Haussler, ``Stochastic context-free grammars for {tRNA}

667:   modeling,'' \emph{Nucleic Acids Research}, vol.~22, no.~23, pp. 5112--5120,

668:   1994.

669:

670: \bibitem{Soittola}

671: M.~Soittola, ``Positive rational sequences,'' \emph{Theoret. Comput. Sci.},

672:   vol.~2, no.~3, pp. 317--322, 1976.

673:

674: \bibitem{Wuyts2001}

675: J.~Wuyts, P.~de~Rijk, Y.~van~de Peer, T.~Winkelmans, and R.~de~Wachter, ``The

676:   {E}uropean large ribosomal subunit {RNA} database,'' \emph{Nucleic Acids

677:   Research}, vol.~29, no.~1, pp. 175--177, 2001.

678: \end{thebibliography}

679: \end{document}

680: