q-bio0701036/talk.tex
1: \documentclass[10pt,conference]{IEEEtran}
2: 
3: \usepackage{amssymb}
4: \usepackage{amsmath}
5: \usepackage{eucal}	% use Zapf's beautiful calligraphic characters
6: 
7: \newtheorem{theorem}{Theorem}[section]
8: 
9: \newcommand{\defeq}{\stackrel{\triangle}{=}}
10: 
11: \begin{document}
12: 
13: \title{Parametrized Stochastic Grammars for\\
14: RNA Secondary Structure Prediction}
15: \author{\authorblockN{Robert S. Maier}
16: \authorblockA{Departments of Mathematics and Physics\\
17: University of Arizona\\
18: Tucson, AZ 85721, USA\\
19: Email: rsm@math.arizona.edu}}
20: 
21: \maketitle
22: 
23: \begin{abstract}
24: We propose a two-level stochastic context-free grammar (SCFG) architecture
25: for parametrized stochastic modeling of a family of RNA sequences,
26: including their secondary structure.  A~stochastic model of this type can
27: be used for maximum a~posteriori estimation of the secondary structure of
28: any new sequence in the family.  The proposed SCFG architecture models RNA
29: subsequences comprising paired bases as stochastically weighted
30: Dyck-language words, i.e., as weighted balanced-parenthesis expressions.
31: The length of each run of unpaired bases, forming a loop or a bulge, is
32: taken to have a phase-type distribution: that of the hitting time in a
33: finite-state Markov chain.  Without loss of generality, each such Markov
34: chain can be taken to have a bounded complexity.  The scheme yields an
35: overall family SCFG with a manageable number of parameters.
36: \end{abstract}
37: 
38: \section{Introduction}
39: \label{sec:intro}
40: In biological sequence analysis, probability distributions over finite
41: ($1$-dimensional) sequences of symbols, representing nucleotides or amino
42: acids, play a major role.  They specify the probability of a sequence
43: belonging to a specified family, and are usually generated by Markov
44: chains.  These include the stochastic finite-state Moore machines called
45: hidden Markov models (HMMs); or infinite-state Markov chains such as
46: stochastic push-down automata (SPDAs).  By computing the most probable path
47: through the Markov chain, one can answer such questions as ``What hidden
48: (e.g., phylogenetic) structure does a sequence have?'', and ``What
49: secondary structure will a sequence give rise~to?''.  The number of Markov
50: model parameters should ideally be kept to a minimum, to facilitate
51: parameter estimation and model validation.
52: 
53: The a~priori modeling of an RNA sequence family is considered here.  Due~to
54: Watson--Crick base pairing, a recursively structured RNA sequence will
55: fold, and display secondary structure.  To model stochastically both
56: pairings and runs of unpaired bases (which form loops and bulges), results
57: from a subfield of formal language theory, the {\em structure theory of
58: weighted strings\/}~\cite{KuichSalomaa} (each string being weighted by an
59: element of a specified `semiring' such as~${\mathbb{R}}_+$), are reviewed
60: and employed in stochastic model construction.
61: 
62: In Section~\ref{sec:duration}, {\em duration modeling\/} is discussed: the
63: modeling of a probability distribution on `runs', i.e., on the natural
64: numbers~$\mathbb{N}$.  A non-RNA biological example is the modeling and
65: prediction of CpG~islands in a DNA sequence.  A~sequence may flip between
66: CpG and non-CpG states, with distinct HMMs for generation of symbols in
67: $\{A,T,G,C\}$.  For ease of HMM parameter estimation, and for finding the
68: most probable parse, or path through the model (e.g., by the Viterbi
69: algorithm), the length of each CpG island and non-CpG region should be
70: modeled in a Markovian way, as the first hitting time in a finite-state
71: Markov chain.  That~is, on~$\mathbb N=\{0,1,2,\dots\}$, the set of possible
72: lengths, it should have a {\em phase-type
73: distribution\/}~\cite{Neuts,OCinn90}.  There is a theorem of the author's
74: on such distributions~\cite{Maier8a}, which grew out of results on
75: positively weighted {\em regular\/} sequences~\cite{Katayama,Soittola}.  It
76: says that without loss of generality, the structure of the Markov chain can
77: be greatly restricted: its `cyclicity' can be required to be at most~$2$.
78: This has implications for HMM parametrization.
79: 
80: The generating function $G(z)$ of a phase-type (PH) distribution
81: on~$\mathbb N$ (which is a normalized $\mathbb{R}_+$-weighted regular
82: language over a $1$-letter alphabet) is a {\em rational\/} function of~$z$.
83: Going beyond regular languages to the context-free case yields an {\em
84: algebraic\/} generating function: one of several variables, if each type of
85: letter in the sequence is separately kept track~of.  In RNA secondary
86: structure prediction, stochastic context-free grammars (SCFGs), usually in
87: Chomsky normal form, have been used~\cite{Sakakibara94}.  They tend to be
88: complicated; if the grammar has $k$~non-terminal symbols, then it may have
89: $O(k^3)$~transition probabilities, which must be estimated from training
90: sequences~\cite{Lari90}.  What is needed is a class of SCFGs with
91: (i)~restricted internal structure, (ii)~equivalent modeling power, and
92: (iii)~computationally convenient parametrization.  Finding such a class of
93: models is a hard problem: even on the level of $1$-letter-alphabet (i.e.,
94: univariate) generating functions, it involves the constructive theory of
95: algebraic functions.
96: 
97: In Section~\ref{sec:algebraic}, as a small step toward solving this
98: problem, it is pointed~out that there is a class of probability
99: distributions on~$\mathbb N$ with generating functions (i.e.,
100: $z$-transforms) that are algebraic and non-rational, which can be
101: conveniently parametrized.  This is the class of algebraic {\em
102: hypergeometric distributions\/}.  E.g., the $\mathbb{N}$-valued random
103: variable~$\tau$ could satisfy $\sum_{n=0}^\infty z^n\, Pr(\tau=n)\propto
104: {}_2F_1(a,b;c;z)$, where ${}_2F_1(a,b;c;\cdot)$ is Gauss's (parametrized)
105: hypergeometric function.  If $a,b,c$ are suitably chosen, $n\mapsto
106: Pr(\tau=n)$ will be a probability density function with an algebraic
107: $z$-transform.  Algebraic hypergeometric probability densities satisfy nice
108: recurrence relations, and SCFG interpretations for them can be worked~out.
109: 
110: A more general approach toward solving the above problem, not restricted to
111: the case of a $1$-letter alphabet, employs SCFGs with a two-level
112: structure.  In~effect, these are SCFGs wrapped around HMMs.  The following
113: is an illustration.  A~probabilistically weighted Dyck language over the
114: alphabet $\{a,b\}$, i.e., a distribution over the words in $\{a,b\}^*$ that
115: comprise nested $a$\textendash\nobreak$b$ pairs, is generated from a
116: symbol~$S$ by repeated application of the production rule $S\mapsto
117: p_1\cdot ab+p_2\cdot abS+p_3\cdot aSb +p_4\cdot aSbS$.  The
118: probabilities~$p_i$ sum to~$1$.  If each of $a,b$ in~turn represents a
119: weighted {\em regular\/} language over some alphabet~$\Sigma$ (e.g.,~a
120: PH-distribution if $\Sigma$~has only one letter), then the resulting
121: distribution over words in~$\Sigma^*$ comes from a SCFG with the stated
122: two-level structure.  This setup is familiar from (unweighted) language
123: theory applied to compilation: the top-level structure of a program is
124: specified as a word in a context-free language, and islands of low-level
125: structure (e.g., identifier names and arithmetic literals) as words in
126: regular languages.
127: 
128: In Section~\ref{sec:modeling}, it is indicated how the idea of a SCFG
129: wrapped around HMMs can be applied to RNA structure prediction: initially,
130: to the parametric stochastic modeling, in a given sequence family, of the
131: recursive primary structure that induces secondary folding.  The goal is
132: parameter estimation and model validation, by comparison with data on real
133: RNA sequences.  Knudsen and Hein~\cite{Knudsen99} and
134: Nebel~\cite{Nebel2004} have worked on this, using Dyck-like languages, but
135: stochastic modeling using distinct SCFG and HMM levels is a significant
136: advance.
137: 
138: On the level of primary RNA structure, paired nucleotides will make~up a
139: subsequence of the full nucleotide sequence, and must constitute a Dyck
140: word, for simplicity written as a word over~$\{a,b\}$.  A distribution over
141: the infinite family of such Dyck words is determined by the above
142: stochastic production rule, the parameters $p_1,p_2,p_3,p_4$ in which are
143: specific to the family being modeled.  The production rule for {\em full\/}
144: sequences, including unpaired nucleotides, will have not $ab$, $abS$,
145: $aSb$, $aSbS$ on its right-hand side, but rather $IaIbI$, $IaIbS$, $IaSbI$,
146: $IaSbS$, where each~$I$ expands to a `run' of unpaired nucleotides.  If the
147: four nucleotides are treated as equally likely in this context, each~$I$
148: will be a stochastic language over a $1$-letter alphabet, and the length of
149: each run is reasonably modeled as having a PH-distribution.  The PH class
150: includes geometric distributions, but is more general.  The overall SCFG is
151: obtained by wrapping the Dyck SCFG around the finite-state Markov chains
152: that yield the PH-distributions.
153: 
154: From a given family of RNA sequences, Dyck SCFG parameters can be
155: estimated, e.g., by the standard Inside--Outside Algorithm~\cite{Lari90};
156: and then HMM parameters (i.e., PH-distribution parameters) can be estimated
157: separately.  By employing a large enough class of PH-distributions, it
158: should be possible to produce a better fit to data on secondary structure
159: than were obtained from the few-parameter models of Knudsen and
160: Hein~\cite{Knudsen99} and Nebel~\cite{Nebel2004}.  Once the family has been
161: modeled, the most likely parse tree for any new RNA sequence in the family
162: can be computed by maximum a~posteriori estimation, using the CYK
163: algorithm~\cite{Sakakibara94}.  The sequence is predicted to have the
164: secondary structure represented by that parse tree.
165: 
166: \section{Duration Modeling}
167: \label{sec:duration}
168: Since loops and bulges in RNA secondary structure comprise runs of unpaired
169: nucleotides, they can be modeled without taking long-range covariation into
170: account.  The appropriate stochastic model is an HMM {\em with
171: absorption\/}, since the accurate modeling of run lengths is a goal.  Any
172: such HMM will specify a probability distribution on the set of finite
173: strings~$\Sigma^*$, where $\Sigma=\{A,U,G,C\}$ is the alphabet set, and
174: long words are exponentially unlikely.  There should be little change in
175: the nucleotide distribution along typical runs, so the distribution of the
176: string length $\tau\in\mathbb{N}$ is what is important.
177: 
178: The time~$\tau$ to reach a final (absorbing) state in an irreducible
179: discrete-time Markov chain on a state space $Q=\{1,\dots,m\}$, with a
180: transition matrix $\mathbf{T}=(T_{ij})_{i,j=1}^m$ that is {\em
181: substochastic\/} (i.e., $\sum_{j=1}^m T_{ij}\le1$), and an initial state
182: vector $\mathbf{\alpha}=(\alpha_i)_{i=1}^m$ that is also substochastic
183: (i.e., $\sum_{i=1}^m\alpha_i\le1$), is said to have a discrete
184: PH-distribution.  The substochasticity of $\mathbf{T}$
185: and~$\mathbf{\alpha}$ expresses the absorption of probability, since they
186: can be extended to a larger state space $\tilde Q=Q\cup\{m+1\}$, on which
187: they will be stochastic.  The added state $m+1$ is absorbing.
188: 
189: There is a close connection between PH-distributions and finite automata
190: theory, in particular the theory of rational series over
191: semirings~\cite{KuichSalomaa}.  If $A$~is a semiring (a~set having binary
192: addition and multiplication operations, $\oplus$~and~$\odot$, each with an
193: associated identity element; but not necessarily having a unary negation
194: operation), then an $A$-{\em rational sequence\/}, $a=(a_n)_{n=0}^\infty\in
195: A^{\mathbb{N}}$, is a sequence of the form $\oplus_{i,j=0}^m
196: \left[u_i\odot(\mathbf{M}^n)_{ij}\odot v_j\right]$, where for some $m>0$,
197: $\mathbf{M}\in A^{m\times m}$ and $\mathbf{u},\mathbf{v}\in A^m$.  It is an
198: $A$-weighted regular language over a $1$-letter alphabet.  Semirings of
199: interest here include $\mathbb{R}$, $\mathbb{R}_+=\{x\in\mathbb{R}\mid
200: x\ge0\}$, and the Boolean semiring $\mathbb{B}=\{0,1\}$.
201: 
202: \smallskip
203: \begin{theorem}[\cite{Maier9}]
204: \label{thm:normalization}
205:   Any PH-distribution on~$\mathbb{N}$ is an $\mathbb{R}_+$-rational
206:   sequence.  Any {\em summable\/} $\mathbb{R}_+$-rational sequence, if
207:   normalized to have unit sum, becomes a PH-distribution.
208: \end{theorem}
209: 
210: \smallskip
211: If $\tau\in\mathbb{N}$ is PH-distributed, it is useful to focus on its
212: $z$-transform, i.e., $G(z)=E\left[z^\tau\right]=\sum_{n=0}^\infty
213: z^n\,Pr(\tau=n)$.  This will be a rational function, in~$\mathbb{R}_+(z)$.
214: If the distribution is {\em finitely supported\/}, it will be a polynomial,
215: in~$\mathbb{R}_+[z]$.
216: 
217: \smallskip
218: \begin{theorem}[\cite{Maier9}]
219: \label{thm:2}
220:   Any PH-distribution on~$\mathbb{N}$ can be generated from finitely
221:   supported distributions by repeated applications of (i)~the binary
222:   operation of mixture, i.e., $G_1,G_2\mapsto pG_1+(1-p)G_2$, where
223:   $p\in(0,1)$, (ii)~the binary operation of convolution, i.e.,
224:   $G_1,G_2\mapsto G_1G_2$, and (iii)~the unary `geometric mixture'
225:   operation, i.e., $G\mapsto (1-p)\sum_{k=0}^\infty p^kG^k=(1-p)/(1-pG)$,
226:   where $p\in(0,1)$.
227: \end{theorem}
228: 
229: \smallskip
230: This is a variant of the Kleene--Sch\"utzenberger Theorem on the
231: $A$-rational series associated to $A$-finite automata~\cite{KuichSalomaa}.
232: The Boolean ($A=\mathbb{B}$) case of their theorem is familiar from formal
233: language theory: it says that any regular language over a finite alphabet
234: can be generated from {\em finite\/} languages by repeated applications of
235: (i)~union, (ii)~concatenation, and (iii)~the so-called Kleene star
236: operation.  Just as in formal language theory, the third operation of
237: Theorem~\ref{thm:2} can be implemented on the automaton level by adding
238: `loopback', or cycle-inducing, transitions from final state(s) back to
239: initial state(s).
240: 
241: \smallskip
242: \begin{theorem}[\cite{Maier8a}]
243: \label{thm:3}
244: The unary--binary computation tree leading to any PH-distribution
245: on~$\mathbb{N}$, the leaves of which are finitely supported distributions,
246: can be required without loss of generality to have at most~$2$ unary
247: `geometric mixture' nodes along the path extending to the root from any
248: leaf.  That~is, those operations do~not need to be more than doubly nested.
249: \end{theorem}
250: 
251: \smallskip
252: This is a normalized, or `stochastic', version of a result on the
253: representation of $\mathbb{R}_+$-rational
254: sequences~\cite{Katayama,Soittola}.  Results of this type are strongly
255: semiring-dependent.  It is not difficult to see that in the cases
256: $A=\mathbb{R}$ and~$\mathbb{B}$, the analogue of the number~`$2$' is~`$1$'.
257: (This is because, e.g., a $\mathbb{B}$-rational sequence is simply an
258: sequence in~$\mathbb{B}^\mathbb{N}$ that is eventually periodic.)  The
259: proof of Theorem~\ref{thm:3} is an explicit construction, which respects
260: positivity constraints at each stage.  That~is, the construction solves the
261: {\em representation problem\/} for univariate PH-distributions, which has
262: strong connections to the positive realization problem in control
263: theory~\cite{Commault2003}.
264: 
265: What the theorem says, since operations of types (i),(ii),(iii) correspond
266: to parallel composition, serial composition, and cyclic iteration of Markov
267: chains, is that any PH-distribution arises without loss of generality from
268: a Markov chain in which cycles of states are nested at most $2$~deep.
269: That~is, the chain may include cycles, and cycles within cycles, but not
270: cycles within cycles within cycles.  So for modeling purposes, the chain
271: transition matrix~$\mathbf{T}$ may be taken to have a highly restricted
272: structure.  A~completely connected transition graph on a state space of
273: size~$m$ would have ${m}^{2}$~possible transitions, and would be
274: unnecessarily general when $m$~is large.
275: 
276: Unpaired nucleotide run lengths in RNA are naturally modeled as having
277: (discrete-time) PH~distributions because of the close connection with HMMs,
278: and the consequent ease of parameter estimation.  However, the class of
279: PH~distributions is so versatile that it would be useful in this context,
280: regardless.  Discrete PH~distributions include geometric and negative
281: binomial distributions, and are dense (in~a suitable sense) in the class of
282: distributions $Pr(\tau=n)$, $n\in\mathbb{N}$, which have leading-order
283: geometric falloff as~$n\to\infty$.
284: 
285: Any PH distribution on~$\mathbb{N}$ has a $z$-transform
286: $G(z)=E\left[z^\tau\right]$ that is rational in the conventional sense;
287: equivalently, it must satisfy a finite-depth recurrence relation of the
288: form $\sum_{k=0}^N c_k\,Pr(\tau=n+k)=0$.  In~fact, any probability
289: distribution on~$\mathbb{N}$ with (i)~a rational $z$-transform $G(z)$, and
290: (ii)~the property that the pole which $G(z)$ necessarily has at~$z=1$ is
291: the {\em only\/} pole on the circle $\left|z\right|=1$, is necessarily a PH
292: distribution~\cite{OCinn90}.  This is a sort of converse of the
293: Perron--Frobenius Theorem.  However, there are distributions
294: on~$\mathbb{N}$ which satisfy~(i) but not~(ii), and are not PH
295: distributions.  They are necessary $\mathbb{R}$-rational sequences, but are
296: not $\mathbb{R}_+$-rational sequences as defined above, even though they
297: are sequences of elements of~$\mathbb{R}_+$ (probabilities).  In
298: abstract-algebraic terms, this situation is possible because the
299: semiring~$\mathbb{R}$ is not a {\em Fatou extension\/} of the
300: semiring~$\mathbb{R}_+$~\cite{KuichSalomaa}.  The existence of pathological
301: examples of this type does not vitiate the usefulness of PH distributions
302: in run-length modeling.
303: 
304: \section{Algebraic Sequences}
305: \label{sec:algebraic}
306: Most work on RNA secondary structure prediction that draws on formal
307: language theory has employed SCFGs~\cite{Knudsen99,Nebel2004,Sakakibara94}.
308: A~CFG in Chomsky normal form~(CNF), used for generating strings
309: in~$\Sigma^*$ where $\Sigma$~is a finite alphabet set, is a set of
310: production rules of the form $V\mapsto W_1W_2$ or $V\mapsto a$, where
311: $V,W_1,W_2$ are elements of a set~$\mathcal{V}$ of `variables', i.e.,
312: nonterminal symbols, and~$a\in\Sigma$.  There is a distinguished start
313: symbol $S\in\mathcal{V}$ with which the process begins.  Applying the
314: production rules repeatedly yields a subset $L\subset\Sigma^*$, i.e., a
315: language.  An SCFG assigns probabilities (which add to unity) to the
316: productions of each $V\in\mathcal{V}$, and yields a probability
317: distribution over the strings in $L\subset\Sigma^*$, i.e., over~$\Sigma^*$.
318: 
319: The probability distribution $P:\Sigma^*\to[0,1]\subset\mathbb{R}_+$
320: produced by an SCFG is an example of an $\mathbb{R}_+$-algebraic series.
321: In general, if $A$~is a semiring, an $A$-algebraic series (of CNF type)
322: over an alphabet~$\Sigma$ is a weighting function $f\colon\Sigma^*\to A$
323: obtained as one component (i.e., the component~$f_S$) of the formal
324: solution of a coupled set of quadratic equations
325: \begin{displaymath}
326:   f_V=\sum_{W_1,W_2\in\mathcal{V}} c_{V;W_1,W_2}\, f_{W_1}f_{W_2} +
327:   \sum_{a\in\Sigma} c_{V;a}\,a,\qquad V\in\mathcal{V},
328: \end{displaymath}
329: computed by iteration~\cite{KuichSalomaa}.  The coefficients
330: $c_{V;W_1,W_2}$ and~$c_{V;a}$ are elements of~$A$, so each~$f_V$ is a sum
331: of $A$-weighted strings in~$\Sigma^*$, or equivalently a function
332: $f_V\colon\Sigma^*\to A$.  It~is clear that Theorem~\ref{thm:normalization}
333: has an analogue: any probability distribution on~$\Sigma^*$ produced by a
334: SCFG of CNF type is simply a {\em normalized\/} $\mathbb{R}_+$-algebraic
335: series.
336: 
337: Any SCFG of CNF type has $\left|\mathcal{V}\right|^3 +
338: \left|\mathcal{V}\right|\left|\Sigma\right|$ parameters, which may be too
339: many for practical estimation if a small sequence family is being modeled.
340: To~facilitate modeling, one should use an SCFG with a restricted structure,
341: and also exploit results from weighted automata theory.  If the nucleotide
342: distribution does not vary much along typical sequences, then the alphabet
343: set~$\Sigma$ can be taken to be a $2$-letter alphabet $\{a,b\}$ (if~one is
344: modeling Watson--Crick pairing, exclusively) or even a $1$-letter alphabet
345: (if~one is modeling runs of unpaired bases).  Also, one can leverage the
346: fact that $A$-algebraic series subsume $A$-rational series, which implies
347: (in~the $1$-letter case) that $A$-algebraic {\em sequences\/}, which are
348: effectively indexed by~$\mathbb{N}$, subsume $A$-rational sequences.  In
349: the Boolean ($A=\mathbb{B}$) case, the first statement is the familiar
350: Chomsky hierarchy.
351: 
352: In the case of a $1$-letter alphabet $\Sigma=\{a\}$, an SCFG defines a
353: probability distribution on $\mathbb{N}\cong\{a\}^*$.  That is, it defines
354: an $\mathbb{N}$-valued random variable~$\tau$, the length of the string
355: emitted by the stochastic push-down automaton (SPDA) corresponding to
356: the~SCFG.  The SPDA uses~$\mathcal{V}$, the set of nonterminal symbols, as
357: its stack alphabet, and its stack is initially occupied by the start
358: symbol~$S$.  The stochastic production rules specify what happens when a
359: symbol $V\in\mathcal{V}$ is popped off the stack: either two symbols
360: $W_1,W_2\in\mathcal{V}$ are pushed back, or a letter~`$a$' is emitted.  By
361: construction, at~least one letter must be emitted by a CNF-type SCFG before
362: its stack empties, so $Pr(\tau=0)=0$.
363: 
364: The class of probability distributions on~$\mathbb{N}$ associated to SCFGs
365: (whether or~not of CNF type), i.e., that of normalized
366: $\mathbb{R}_+$-algebraic sequences, is potentially useful in parametric
367: stochastic modeling, but has not been widely employed.  It will be denoted
368: $\mathcal{F}_{\rm alg}$ here, since each distribution in~it has an
369: algebraic $z$-transform $G(z)=\sum_{n=0}^\infty z^n\,Pr(\tau=n)$.  For any
370: SCFG, an algebraic equation satisfied by~$G(z)$ can be computed by
371: polynomial elimination (e.g., by computing the resultant of the above
372: system of quadratic equations).  Let ${\it PH}_d$ denote the class of
373: discrete phase-type distributions.
374: 
375: \smallskip
376: \begin{theorem}
377: \label{thm:hadamardetc}
378:   (i) ${\it PH}_d\subset\mathcal{F}_{\rm alg}$.  (ii)~If $X,Y$ are
379:   independent $\mathbb{N}$-valued random variables~(RVs) with distributions
380:   in~${\it PH}_d$, then conditioning on~$X=Y$ yields an~RV with
381:   distribution in~$\mathcal{F}_{\rm alg}$.  (iii)~If, furthermore, $Z$~is
382:   an independent $\mathbb{N}$-valued RV with distribution
383:   in~$\mathcal{F}_{\rm alg}$, then conditioning on~$X=Z$ yields an~RV with
384:   distribution in~$\mathcal{F}_{\rm alg}$.
385: \end{theorem}
386: 
387: \smallskip
388: These are `normalized' versions of standard facts on $A$-rational
389: and~$A$-algebraic series, in particular on their composition under the
390: Hadamard product $(x_n),(y_n)\mapsto (x_ny_n)$, in the special case when
391: $A=\mathbb{R}_+$ and~$\Sigma=\{a\}$.  (See~\cite{Fliess74,KuichSalomaa}.)
392: They have direct probabilistic proofs.  E.g., to prove~(i), one would show
393: that starting from the distribution of~$\tau\in\mathbb{N}$, the absorption
394: time in a~HMM, one can construct an SCFG that yields the same distribution
395: on~$\mathbb{N}$.  (If $Pr(\tau=0)>0$ then the SCFG cannot be of CNF type.)
396: The procedure is similar to constructing a PDA that accepts a specified
397: regular language.
398: 
399: Much as with discrete PH distributions, it is difficult to parametrize
400: distributions in the class~$\mathcal{F}_{\rm alg}$ without, rather
401: explicitly, parametrizing the stochastic automata (SCFGs or SPDAs) that
402: give rise to them; or at~least their $z$-transforms.  It is difficult,
403: in~general, to characterize when a probability distribution on~$\mathbb{N}$
404: that has an algebraic $z$-transform lies in~$\mathcal{F}_{\rm alg}$.
405: 
406: The following example illustrates the problem.  Any distribution $n\mapsto
407: Pr(\tau=n)$ on~$\mathbb{N}$ that has an algebraic $z$-transform necessarily
408: satisfies a finite-depth recurrence of the form $\sum_{k=0}^N
409: C_k(n)\,Pr(\tau=n+k)=0$, where the functions~$C_k$, $k=0,\dots,N,$ are
410: polynomial in~$n$.  (If~none of the~$C_k$ depends on~$n$, then the
411: $z$-transform will be rational.)  Consider, for example, the $2$-term
412: recurrence
413: \begin{displaymath}
414: (n+a)(n+b)\,Pr(\tau=n) = (n+c)(n+1)\,Pr(\tau=n+1),
415: \end{displaymath}
416: where $a,b,c\in\mathbb{R}$ are parameters, which is of this form.  The
417: $z$-transform $G(z)=\sum_{n=0}^\infty z^n\,Pr(\tau=n)$ of its solution is
418: proportional, by definition, to ${}_2F_1(a,b;c;z)$, which is Gauss's
419: parametrized hypergeometric function.  The set of triples
420: $(a,b;c)\in\mathbb{R}^3$ that yields an {\em algebraic\/} $z$-transform,
421: and hence an $\mathbb{R}$-algebraic sequence $n\mapsto Pr(\tau=n)$, is
422: explicitly known.  It was derived in the nineteenth century by
423: H.~A. Schwartz~\cite[Chap.~VII]{Poole36}.  Unfortunately, it is an {\em
424: infinite discrete\/} subset of~$\mathbb{R}^3$, not a continuous subset.
425: 
426: In~general, the $z$-transform of the solution of a finite-depth recurrence
427: of the above form will be algebraic in~$z$ only if the overall parameter
428: vector of its coefficients, the polynomials $\{C_k(n)\}_{k=0}^N$, is
429: confined to a submanifold of positive codimension.  For distributions
430: in~$\mathcal{F}_{\rm alg}$, this makes recurrence-based parametrization
431: less useful than SCFG-based or $z$-transform-based parametrization.
432: 
433: \section{Modeling Secondary Structure}
434: \label{sec:modeling}
435: A new scheme for modeling the prior distribution of secondary structures in
436: an RNA sequence family will now be proposed.  It will exploit the insights
437: of Sections \ref{sec:duration} and~\ref{sec:algebraic}, on the class of
438: discrete phase-type distributions on~$\mathbb{N}$ (i.e.,~${\it PH}_d$), and
439: the larger class of $\mathbb{R}_+$-algebraic distributions on~$\mathbb{N}$
440: (i.e.,~$\mathcal{F}_{\rm alg}$).
441: 
442: If $\Sigma=\{A,U,G,C\}$ is the alphabet set, any SCFG, or its associated
443: SPDA, will define a probability distribution on~$\Sigma^*$, the set of
444: finite length sequences~\cite{Sakakibara94}.  (The distribution of the
445: sequence length, which is a random variable, lies in~$\mathcal{F}_{\rm
446: alg}$.)  But even if the SCFG is in Chomsky normal form (CNF), the number
447: of its parameters grows cubically in the number of grammar variables, as
448: mentioned above.  To~facilitate estimation, the model should have a
449: restricted structure.
450: 
451: The models of Knudsen and Hein~\cite{Knudsen99} and Nebel~\cite{Nebel2004}
452: are representative.  The Knudsen--Hein SCFG has variable set
453: $\mathcal{V}=\{S,L,F\}$, and production rules
454: \begin{align*}
455:   S&\mapsto LS\mid L,{\rm\ i.e.,\ }S\mapsto L^+\defeq L\mid L^2\mid L^3\mid\cdots,\\
456:   L&\mapsto s\mid a_1Fb_1,\\
457:   F&\mapsto a_2Fb_2\mid LS,{\rm\ i.e.,\ }F\mapsto a_2Fb_2\mid LL^+.
458: \end{align*}
459: Here $s$ signifies an unpaired base and $a_i\dots b_i$ signifies two bases
460: that are paired in the secondary structure, so $L^+$~produces runs of
461: unpaired bases, i.e., loops (which may include stems) and $F$~produces runs
462: of paired bases, i.e., stems (which may include loops of length at
463: least~$2$).  This SCFG is not a CNF one, but model parameters may be
464: estimated by a variant of the Inside--Outside algorithm.  If one takes
465: single base frequencies and pair frequencies (i.e., the probability of
466: $a_i\dots b_i$ representing $A$\textendash\nobreak$U$,
467: $G$\textendash\nobreak$C$, or even $G$\textendash\nobreak$U$) into account,
468: one has only three independent parameters to be estimated, one probability
469: per production rule.  Knudsen and Hein (cf.\ Nebel) used as their primary
470: training set a subset of the European database of long subunit ribosomal
471: RNAs (LSU rRNAs)~\cite{DeRijk98,Wuyts2001}.  For the probabilities of $LS$
472: vs.~$L$ (from~$S$), they estimated $87\%$ vs.~$13\%$; for $s$ vs.~$a_1Fb_1$
473: (from~$L$), $90\%$ vs.~$10\%$; and for $a_2Fb_2$ vs.~$LS$ (from~$F$),
474: $79\%$ vs.~$21\%$.  Their training set actually included tRNAs as~well,
475: since they were attempting to model the family of folded RNA molecules as a
476: whole.
477: 
478: As Knudsen and Hein note, their model yields loops and stems with
479: geometrically distributed lengths.  To improve quantitative agreement, it
480: would need to be made more sophisticated.  It would also benefit from a
481: cleaner separation between its two levels: the paired-base and
482: unpaired-base levels, i.e., the context-free and regular levels (in~the
483: formal language sense), i.e., the SPDA and HMM levels (in~the stochastic
484: automata-theoretic sense).  The above production rules couple the two
485: levels together.  It is not clear from Ref.~\cite{Knudsen99} how well the
486: model stochastically fits the length of (i)~training sequences, (ii)~the
487: subsequences comprising paired bases, and (iii)~the subsequences comprising
488: unpaired bases.  Separating the two levels should facilitate the separate
489: fitting of these quantities.
490: 
491: By definition, folded RNA secondary structure is characterized by a
492: subsequence comprising paired bases, so the stochastic modeling of
493: secondary structure in a given family should initially focus on such
494: subsequences.  If pseudo-knots (a~thorny problem for automata-theoretic
495: modeling) are ignored, these subsequences are effectively {\em Dyck
496: words\/}, or balanced parenthesis expressions.  In the absence of
497: covariation, one expects to be able to generate such words over
498: $\{A,U,G,C\}$ from classical Dyck words over the $2$-letter alphabet
499: $\{a,b\}$, consisting of opening and closing parentheses, by replacing each
500: $a$\textendash\nobreak$b$ pair independently by an
501: $A$\textendash\nobreak$U$, $G$\textendash\nobreak$C$, or
502: $G$\textendash\nobreak$U$ pair, according to observed pair frequencies.
503: Knudsen and Hein note that order matters: in
504: $G$\textendash\nobreak$C$\,/\,$C$\textendash\nobreak$G$ pairs in~tRNA,
505: the~$G$ tends to be nearer the $5'$~end of the RNA than the~$C$.  Still to
506: be resolved, of~course, is the selection of the underlying probability
507: distribution over Dyck words in~$\{a,b\}^*$.
508: 
509: One could start with any CFG that unambiguously generates the Dyck words
510: over $\{a,b\}$, and make it stochastic by weighting its productions.  The
511: simplest such CFG is a $1$-variable one, $S\mapsto ab\mid abS\mid aSb \mid aSbS$.
512: The corresponding SCFG is
513: \begin{displaymath}
514: S   \mapsto p_1\cdot ab+p_2\cdot abS+p_3\cdot aSb +p_4\cdot aSbS,
515: \end{displaymath}
516: where $\sum_ip_i=1$.  This SCFG, with $3$~free parameters, is so simple
517: that it can be studied analytically.  The length of a Dyck word is an
518: $\mathbb{N}$-valued random variable, the distribution of which lies
519: in~$\mathcal{F}_{\rm alg}$, with a parameter-dependent, algebraic
520: $z$-transform.  As was explained in Section~\ref{sec:algebraic}, it is best
521: to parametrize distributions in~$\mathcal{F}_{\rm alg}$ by the SCFGs that
522: give rise to them, rather than by explicit formulas or even by the
523: recurrence relations that they satisfy; and this is an example.
524: 
525: This Dyck model could be made arbitrarily more versatile, since arbitrarily
526: complicated CFGs that generate the Dyck language over $\{a,b\}$ can readily
527: be constructed.  One could, for instance, iterate $S\mapsto ab\mid abS\mid
528: aSb \mid aSbS$ once, obtaining a production rule for~$S$ with
529: $25$~alternatives on its right-hand side.  Weighting them with
530: probabilities would yield an SCFG with $24$~independent parameters, which
531: would be capable of much more accurate fitting of data on an empirical
532: family of RNA sequences.  In~general, one could choose model parameters to
533: fit not~only the observed distribution of Dyck word lengths (i.e.,
534: per-family paired-base subsequence lengths), but also the distribution of
535: lengths of stems, i.e., runs of contiguous paired bases, which may be far
536: from geometric.
537: 
538: The preceding discussion of Dyck words formed from paired bases ignored
539: loops, i.e., runs of unpaired bases.  They are best handled on a second
540: level of the SCFG.  The simplest production rule for {\em full\/} sequences
541: would have not $S$, $abS$, $aSb$, $aSbS$ on its right-hand side, but rather
542: $IaIbI$, $IaIbS$, $IaSbI$, $IaSbS$, where each of the eight~$I$s expands to
543: a run of unpaired bases.  In the absence of covariation, modeling each run
544: is a matter of duration modeling.  Starting from an $\mathbb{N}$-valued
545: random run length, or equivalently a distribution over finite $1$-letter
546: words, one would generate a run of unpaired bases by replacing each letter
547: independently by $A,U,G,C$, according to family-specific single base
548: frequencies.
549: 
550: Each run length is naturally taken to have a distribution in~${\it PH}_d$,
551: since that will allow the resulting run of bases to be generated by an HMM
552: (with absorption).  Each run length will be the absorption time in a
553: finite-state Markov chain, the parameters of which, i.e., transition
554: probabilities, can be estimated from empirical data.  Geometric
555: distributions, and generalizations, are appropriate.  It follows from
556: Theorem~\ref{thm:3} that employing a large Markov chain with a fully
557: connected transition graph, and hence a number of parameters that grows
558: quadratically in the number of states, would {\em not\/} be appropriate.
559: Without loss of generality, each transition graph can be assumed to have no
560: `cycles within cycles within cycles'.
561: 
562: In this extended (two-level) stochastic model of the secondary structure of
563: a family of RNA sequences, the sequence length distribution still lies
564: in~$\mathcal{F}_{\rm alg}$.  That is because a (Dyck-type) SPDA wrapped
565: around one or more HMMs is still an SPDA, with an SCFG representation.
566: This observation is similar to the proof of Theorem~\ref{thm:hadamardetc}:
567: what has been constructed here is simply an SCFG with a special structure,
568: not given explicitly in Chomsky normal form.  The full set of model
569: parameters could be estimated by the Inside--Outside
570: algorithm~\cite{Lari90}, rather than by estimating Dyck-SCFG and run-length
571: parameters separately; but that would not be so efficient.
572: 
573: The test of the proposed SCFG architecture will be its value in secondary
574: structure prediction, since from any RNA sequence the most likely parse
575: tree, and paired-base subsequence, can be computed by maximum a~posteriori
576: estimation.
577: 
578: %\bibliographystyle{IEEEtranS}
579: %\bibliography{general}
580: 
581: % Generated by IEEEtranS.bst, version: 1.12 (2007/01/11)
582: \begin{thebibliography}{10}
583: \providecommand{\url}[1]{#1}
584: \csname url@samestyle\endcsname
585: \providecommand{\newblock}{\relax}
586: \providecommand{\bibinfo}[2]{#2}
587: \providecommand{\BIBentrySTDinterwordspacing}{\spaceskip=0pt\relax}
588: \providecommand{\BIBentryALTinterwordstretchfactor}{4}
589: \providecommand{\BIBentryALTinterwordspacing}{\spaceskip=\fontdimen2\font plus
590: \BIBentryALTinterwordstretchfactor\fontdimen3\font minus
591:   \fontdimen4\font\relax}
592: \providecommand{\BIBforeignlanguage}[2]{{%
593: \expandafter\ifx\csname l@#1\endcsname\relax
594: \typeout{** WARNING: IEEEtranS.bst: No hyphenation pattern has been}%
595: \typeout{** loaded for the language `#1'. Using the pattern for}%
596: \typeout{** the default language instead.}%
597: \else
598: \language=\csname l@#1\endcsname
599: \fi
600: #2}}
601: \providecommand{\BIBdecl}{\relax}
602: \BIBdecl
603: 
604: \bibitem{Commault2003}
605: C.~Commault and S.~Mocanu, ``Phase-type distributions and representations: Some
606:   results and open problems for system theory,'' \emph{Int. J.~Control},
607:   vol.~76, no.~6, pp. 566--580, 2003.
608: 
609: \bibitem{DeRijk98}
610: P.~de~Rijk, A.~Caers, Y.~van~de Peer, and R.~de~Wachter, ``Database on the
611:   structure of large ribosomal subunit {RNA},'' \emph{Nucleic Acids Research},
612:   vol.~26, no.~1, pp. 183--186, 1998.
613: 
614: \bibitem{Fliess74}
615: M.~Fliess, ``Sur divers produits de s{\'e}ries formelles,'' \emph{Bull. Soc.
616:   Math. France}, vol. 102, pp. 181--191, 1974.
617: 
618: \bibitem{Katayama}
619: T.~Katayama, M.~Okamoto, and H.~Enomoto, ``Characterization of the
620:   structure-generating functions of regular sets and the {D0L} growth
621:   functions,'' \emph{Inform. and Control}, vol.~36, no.~1, pp. 85--101, 1978.
622: 
623: \bibitem{Knudsen99}
624: B.~Knudsen and J.~Hein, ``{RNA} secondary structure prediction using stochastic
625:   context-free grammars and evolutionary history,'' \emph{Bioinformatics},
626:   vol.~15, no.~6, pp. 446--454, 1999.
627: 
628: \bibitem{KuichSalomaa}
629: W.~Kuich and A.~Salomaa, \emph{Semirings, Automata, Languages}.\hskip 1em plus
630:   0.5em minus 0.4em\relax New York/Berlin: Springer-Verlag, 1986.
631: 
632: \bibitem{Lari90}
633: K.~Lari and S.~J. Young, ``The estimation of stochastic context-free grammars
634:   using the {I}nside--{O}utside algorithm,'' \emph{Computer Speech and
635:   Language}, vol.~4, no.~1, pp. 35--56, 1990.
636: 
637: \bibitem{Maier8a}
638: R.~S. Maier, ``Phase-type distributions and the structure of finite {M}arkov
639:   chains,'' \emph{J.~Comp. Appl. Math.}, vol.~46, no.~3, pp. 449--453, 1993.
640: 
641: \bibitem{Maier9}
642: R.~S. Maier and C.~A. O'Cinneide, ``A closure characterization of phase-type
643:   distributions,'' \emph{J.~Appl. Probab.}, vol.~29, no.~1, pp. 92--103, 1992.
644: 
645: \bibitem{Nebel2004}
646: M.~E. Nebel, ``Investigation of the {B}ernoulli model for {RNA} secondary
647:   structure prediction,'' \emph{Bull. Math. Biol.}, vol.~66, 
648: %  no.~5, 
649:   pp. 925--964, 2004.
650: 
651: \bibitem{Neuts}
652: M.~F. Neuts, \emph{Matrix-Geometric Solutions in Stochastic Models}.\hskip 1em
653:   plus 0.5em minus 0.4em\relax Baltimore, Maryland: Johns Hopkins University
654:   Press, 1981.
655: 
656: \bibitem{OCinn90}
657: C.~A. O'Cinneide, ``Characterization of phase-type distributions,'' \emph{Comm.
658:   Statist. Stochastic Models}, vol.~6, no.~1, pp. 1--57, 1990.
659: 
660: \bibitem{Poole36}
661: E.~G.~C. Poole, \emph{Linear Differential Equations}.\hskip 1em plus 0.5em
662:   minus 0.4em\relax Oxford: Oxford University Press, 1936.
663: 
664: \bibitem{Sakakibara94}
665: Y.~Sakakibara, M.~Brown, R.~Hughey, I.~S. Mian, K.~Sj{\"o}lander, R.~C.
666:   Underwood, and D.~Haussler, ``Stochastic context-free grammars for {tRNA}
667:   modeling,'' \emph{Nucleic Acids Research}, vol.~22, no.~23, pp. 5112--5120,
668:   1994.
669: 
670: \bibitem{Soittola}
671: M.~Soittola, ``Positive rational sequences,'' \emph{Theoret. Comput. Sci.},
672:   vol.~2, no.~3, pp. 317--322, 1976.
673: 
674: \bibitem{Wuyts2001}
675: J.~Wuyts, P.~de~Rijk, Y.~van~de Peer, T.~Winkelmans, and R.~de~Wachter, ``The
676:   {E}uropean large ribosomal subunit {RNA} database,'' \emph{Nucleic Acids
677:   Research}, vol.~29, no.~1, pp. 175--177, 2001.
678: \end{thebibliography}
679: \end{document}
680: