1: \documentclass[11pt]{article}
2: \usepackage{epsfig}
3: \usepackage{amssymb}
4: \usepackage{amsthm}
5: \usepackage{amscd}
6: \usepackage{amsfonts}
7: \usepackage{amsmath}
8: \usepackage[T1]{fontenc}
9: \usepackage{ae,aecompl}
10: \usepackage{pslatex}
11: \usepackage{graphicx}
12: \usepackage{color}
13: \usepackage{latexsym} % for \Box.
14: %\usepackage{fullpage}
15: \usepackage{url}
16: \usepackage{setspace}
17:
18: \setlength{\textwidth}{6.5in}
19: \setlength{\textheight}{8.5in}
20: \setlength{\oddsidemargin}{0.0in}
21: \setlength{\evensidemargin}{0.0in}
22: \setlength{\topmargin}{0.0in}
23:
24: \newcommand{\mysec}[1]{Section~\ref{sec:#1}}
25: \newcommand{\fig}[1]{Figure~\ref{fig:#1}}
26: \newcommand{\tbl}[1]{Table~\ref{tbl:#1}}
27: \newtheorem{thm}{Theorem}
28: \newtheorem{claim}[thm]{Claim}
29: \newtheorem{conj}[thm]{Conjecture}
30: \newtheorem{defn}[thm]{Definition}
31: \newtheorem{ex}[thm]{Example}
32: \newtheorem{obs}[thm]{Observation}
33: \newtheorem{lem}[thm]{Lemma}
34: \newtheorem{cor}[thm]{Corollary}
35: \newtheorem{prop}[thm]{Proposition}
36: \newtheorem{fact}[thm]{Fact}
37: \newcommand{\ZZ}{\mathbb{Z}}
38: \newcommand{\RR}{\mathbb{R}}
39: \newcommand{\NN}{\mathbb{N}}
40: \newcommand{\TP}{\mathbb{TP}}
41: \newcommand{\bin}[2]{{#1\choose #2}}
42:
43: \pagestyle{myheadings}
44: %\markright{Draft version October 24th, 2003}
45: \begin{document}
46: \title{Parametric Inference for Biological Sequence Analysis}
47:
48: \author{Lior Pachter and Bernd Sturmfels \\
49: Department of Mathematics, University of California, Berkeley, CA
50: 94720}
51:
52: \maketitle
53:
54: \begin{abstract}
55: One of the major successes in computational biology has been the
56: unification, using the graphical model formalism, of a multitude of
57: algorithms for annotating and comparing biological sequences.
58: Graphical models that have been applied towards these problems include
59: hidden Markov models for annotation, tree
60: models for phylogenetics, and pair hidden Markov models for
61: alignment. A single algorithm, the sum-product algorithm, solves
62: many of the inference problems associated with different statistical models.
63: This paper introduces the \emph{polytope propagation algorithm}
64: for computing the Newton polytope of an observation from a graphical model.
65: This algorithm is a geometric version of the sum-product algorithm and
66: is used to analyze the parametric behavior of maximum a posteriori
67: inference calculations for graphical models.
68: \end{abstract}
69:
70: %\doublespacing
71:
72: \section{Inference with Graphical Models for Biological Sequence Analysis}
73:
74: This paper develops a new
75: algorithm for graphical models based on the mathematical foundation for statistical models proposed in \cite{Pachter:04}. Its relevance
76: for computational biology can be summarized as follows:
77:
78: \textbf {(a) Graphical models are a unifying statistical framework for biological sequence analysis.}
79:
80: \textbf {(b) Parametric inference is important for obtaining biologically meaningful results.}
81:
82: \textbf {(c) The polytope propagation algorithm solves the parametric inference problem.}
83:
84: \vskip .1cm
85:
86: Thesis (a) states that graphical models are good models for biological sequences. This emerging understanding is the result of practical success with
87: probabilistic algorithms, and also the observation that inference algorithms for graphical models subsume many apparently non-statistical methods.
88: A noteworthy example of the latter is the explanation of classic alignment
89: algorithms such as Needleman-Wunsch and Smith-Waterman in terms of the Viterbi algorithm for pair hidden Markov models \cite{Bucher:96}.
90: Graphical models are now used for many problems including motif detection, gene finding, alignment, phylogeny reconstruction and protein structure prediction. For example, most gene prediction methods are now hidden Markov model (HMM) based, and previously non-probabilistic methods
91: now have HMM based re-implementations.
92:
93: In typical applications, biological sequences are modeled as {\em observed random variables} $Y_1,\ldots,Y_n$ in a graphical model. The observed random variables may correspond to sequence elements such as nucleotides or amino acids. {\em Hidden random variables} $X_1,\ldots,X_m$ encode information of interest that is unknown, but which one would like to infer. For example, the information could be an annotation, alignment or ancestral sequence in a phylogenetic tree. One of the strengths of graphical models is that by virtue of being probabilistic, they can be combined into powerful models where the hidden variables are more complex. For example, hidden Markov models can be combined with pair hidden Markov models to simultaneously align and annotate sequences \cite{Alexandersson:03}. One of the drawbacks of such approaches is that the models have more parameters and as a result inferences could be less robust.
94:
95: For a fixed observed sequence $\sigma_1 \sigma_2 \ldots \sigma_n$ and {\em fixed parameters},
96: the standard inference problems are:
97: \begin{enumerate}
98: \item[1.] the calculation of {\em marginal probabilities}:
99: \[ p_{\sigma_1 \cdots \sigma_n}
100: \quad = \quad
101: \sum_{h_1,\ldots,h_m} {\rm Prob} (X_1=h_1,\ldots,X_m=h_m,Y_1=\sigma_1,\ldots,Y_n=\sigma_n) \]
102: \item[2.] the calculation of {\em maximum a posteriori log probabilities}:
103: \[ \delta_{\sigma_1 \cdots \sigma_n}
104: \quad = \quad
105: \min_{h_1,\ldots,h_m} - {\rm log} \left( {\rm Prob} (X_1=h_1,\ldots,X_m=h_m,Y_1=\sigma_1,\ldots,Y_n=\sigma_n) \right), \]
106: \end{enumerate}
107: where the $h_i$ range over all the possible assignments for the hidden random variables $X_i$.
108: In practice, it is the solution to Problem 2 that is of interest, since it is the one that solves the problem of finding the genes in a genome or the ``best'' alignment for a pair of sequences.
109: A shortcoming of this approach is that the solution $\widehat {\bf h} = (\hat h_1, \ldots, \hat h_m)$ may vary considerably with a change in parameters.
110:
111: Thesis (b) suggests that a {\em parametric} solution to the inference problem can help in ascertaining the reliability, robustness and biological meaning of an inference result. By {\em parametric inference} we mean the solution of
112: Problem 2 for all model parameters simultaneously. In this way we can decide if a solution
113: obtained for particular parameters is an artifact or is largely independent of the chosen
114: parameters. This approach has already been applied successfully to the problem of pairwise sequence alignment in which parameter choices are known to be crucial in obtaining good alignments \cite{Fernandez-Baca:00, Gusfield:96, Waterman:92}.
115: Our aim is to develop this approach for arbitrary graphical models.
116: In thesis (c) we claim that the polytope propagation algorithm is efficient for solving the parametric inference problem, and, in certain cases is not much slower than solving Problem 2 for fixed parameters.
117: The algorithm is a geometric
118: version of the sum-product algorithm, which is the standard tool for
119: inference with graphical models.
120:
121: The mathematical setting for understanding
122: the polytope propagation algorithm is {\em tropical geometry}.
123: The connection between tropical geometry and parametric inference in statistical models
124: is developed in the companion paper \cite{Pachter:04}. Here we describe the details of the polytope propagation algorithm (Section 3) in two familiar settings: the hidden Markov model for annotation (Section 2) and the pair hidden Markov model for alignment (Section 4). Finally, in Section 5, we discuss some practical aspects of parametric inference, such as specializing parameters, the construction of single cones which eliminates the need for identifying all possible maximum a posteriori explanations, and the relevance of our findings to Bayesian computations.
125:
126: \section{Parametric Inference with Hidden Markov Models}
127: Hidden Markov models play a central role in sequence analysis,
128: where they are widely used to annotate DNA sequences \cite{Baldi:98}.
129: A simple example is the CpG island annotation problem \cite[\S 3]{Durbin:98}.
130: CpG sites are locations in DNA sequences where
131: the nucleotide cytosine (C) is situated next to a guanine (G) nucleotide (the ``p'' comes from the fact that a phosphate links them together). There are regions with many CpG sites in eukaryotic
132: genomes, and these are of interest because of the action of DNA methyltransferase, which
133: recognizes CpG sites and converts the cytosine into 5-methylcytosine. Spontaneous deamination
134: causes the 5-methylcytosine to be converted into thymine (T), and the mutation is not fixed
135: by DNA repair mechanisms. This results in a gradual erosion
136: of CpG sites in the genome. {\em CpG islands} are regions of DNA with many unmethylated CpG sites. Spontaneous deamination of cytosine to thymine in these sites is repaired, resulting
137: in a restored CpG site. The computational identification of CpG islands is important, because they are associated with promoter regions of genes, and are known to be involved
138: in gene silencing.
139:
140: Unfortunately, there is no sequence characterization of CpG islands. A generally accepted definition due to Gardiner-Garden and Frommer \cite{Gardiner-Garden:87}
141: is that a CpG island is a region of DNA at least 200bp long with a G+C content of at least 50\%, and with a ratio of observed to expected CpG sites of at least 0.6. This arbitrary
142: definition has since been refined (e.g. \cite{Takai:02}), however even analysis of the complete sequence of the human genome \cite{Lander:01} has failed to
143: reveal precise criteria for what constitutes a CpG island. Hidden Markov models can be used to predict CpG islands \cite[\S 3]{Durbin:98}. We have selected this application of HMMs
144: in order to illustrate our approach to parametric inference in a mathematically simple setting.
145:
146: The CpG island HMM we consider has $n$ hidden binary random variables $X_i$, and $n$ observed random variables $Y_i$ that take
147: on the values $\{A,C,G,T\}$ (see Figure 1 in \cite{Pachter:04}). In general, an
148: HMM can be characterized by the following conditional
149: independence statements for $i = 1 , \ldots,n$:
150: \begin{eqnarray*} & p(X_i \, | \,X_1,X_2,\ldots,X_{i-1}) \quad
151: = \quad p(X_i \,| \, X_{i-1}),
152: \\& p(Y_i \, |\, X_1,\ldots,X_i,Y_1,\ldots,Y_{i-1})\quad =
153: \quad p(Y_i \,|\, X_i). \end{eqnarray*}
154: The CpG island HMM has twelve model parameters, namely, the
155: entries of the transition matrices
156: $$ S \, = \, \begin{pmatrix}
157: s_{00} & s_{01} \\
158: s_{10} & s_{11} \\
159: \end{pmatrix}
160: \qquad \hbox{and} \qquad
161: T \, = \, \begin{pmatrix}
162: t_{0A} & t_{0C} & t_{0G} & t_{0T} \\
163: t_{1A} & t_{1C} & t_{1G} & t_{1T}
164: \end{pmatrix}.
165: $$
166: Here the hidden state space has just two states non-CpG $=0$ and CpG $=1$
167: with transitions allowed between them, but in more complicated applications, such as gene finding,
168: the state space is used to model numerous gene components (such as introns and exons) and
169: the sparsity pattern of the matrix $S$ is crucial. In its algebraic representation
170: \cite[\S 2]{Pachter:04}, the HMM is given as the image
171: of the polynomial map
172: \begin{equation}
173: \label{polymap}
174: f \, : \, {\bf R}^{12} \rightarrow {\bf R}^{4^n}, \,\,\,
175: (S,T) \ \mapsto \ \sum_{h \in \{0,1\}^n} \ \ t_{h_1 \sigma_1}
176: s_{h_1 h_2} t_{h_2 \sigma_2} s_{h_2 h_3} \cdots
177: s_{h_{n-1} h_n} t_{h_n \sigma_n}.
178: \end{equation}
179: The inference problem 1 asks for an evaluation of one coordinate polynomial $f_\sigma$ of the map $f$. This can be done in linear time (in $n$) using the
180: \emph{forward algorithm} \cite{Jordan:02},
181: which recursively evaluates the formula
182: \begin{equation}
183: \label{sum-product}
184: f_{\sigma} \quad = \quad
185: \sum_{h_n=0}^1 t_{h_n \sigma_n} \biggl(
186: \sum_{h_{n-1}=0}^1 s_{h_{n-1} h_n} t_{h_{n-1} \sigma_{n-1}}
187: \cdots
188: \bigl(
189: \sum_{h_2=0}^1 t_{h_2 h_3} s_{h_2 \sigma_2}
190: (\sum_{h_1=0}^1 t_{h_1 h_2} s_{h_1 \sigma_1} )\bigr) \cdots \biggr)
191: \end{equation}
192: Problem 2 is to identify the largest term in the expansion of $f_\sigma$.
193: Equivalently, if we write $u_{ij} = - {\rm log}(s_{ij})$ and
194: $v_{ij} = - {\rm log}(t_{ij})$ then Problem 2 is to evaluate the piecewise-linear function
195: \begin{equation}
196: \label{ref:Viterbi}
197: g_{\sigma} \,\, = \,\,
198: {\rm min}_{h_n} v_{h_n \sigma_n} + \bigl(
199: {\rm min}_{h_{n-1}} u_{h_{n-1} h_n} + v_{h_{n-1} \sigma_{n-1}} +
200: \cdots +
201: \bigl(
202: {\rm min}_{h_2} v_{h_2 h_3} + u_{h_2 \sigma_2} +
203: ( {\rm min}_{h_1} u_{h_1 h_2} + v_{h_1 \sigma_1} )\bigr) \cdots \ \bigr).
204: \end{equation}
205: This formula can be efficiently evaluated by recursively computing the
206: parenthesized expressions. This is known as the
207: \emph{Viterbi algorithm} in the HMM literature.
208: The Viterbi and forward algorithms are instances of
209: the more general {\em sum-product algorithm} \cite{Kschischang:01}.
210:
211: What we are proposing in this paper is to compute
212: the collection of cones in ${\bf R}^{12}$
213: on which the piecewise-linear function $g_\sigma$ is linear.
214: This may be feasible because the number of cones grows polynomially in $n$.
215: Each cone is indexed by
216: a binary sequence ${\bf h} \in \{0,1\}^n$ which represents the CpG islands found
217: for any system of parameters $(u_{ij}, v_{ij})$ in that cone. A binary sequence which
218: arises in this manner is an \emph{explanation for $\sigma$} in the sense of
219: \cite[\S 4]{Pachter:04}.
220: Our results in \cite{Pachter:04} imply that the number of explanations
221: scales polynomially with $n$.
222:
223: \begin{thm}
224: For any given DNA sequence $\sigma$ of length $n$, the
225: number of bit strings $\widehat {\bf h} \in \{0,1\}^n$ which are
226: explanations for the sequence $\sigma$ in the CpG island HMM
227: is bounded above by a constant times $n^{5.25}$.
228: \end{thm}
229:
230: \begin{proof}
231: There are a total of $2 \cdot 4 + 4 = 12$ parameters which is the dimension of the
232: ambient space. Note, however, that for a fixed observed sequence the number of times
233: the observation $A$ is made is fixed, and similarly for $C,G,T$. Furthermore, the total
234: number of transitions in the hidden states must equal $n$. Together, these constraints remove
235: five degrees of freedom. We can thus apply \cite[Theorem 7]{Pachter:04}
236: with $d=12-5 = 7$. This shows that
237: the total number of vertices of the Newton polytope of $\,f_{\bf \sigma}\,$ is
238: $\,O(n^{\frac{7 \cdot 6}{8}}) = O(n^{5.25})$.
239: \end{proof}
240:
241: \begin{figure}[ht]
242: \begin{center}
243: \includegraphics[scale=0.35]{CpGSchlegel_mod.ps}
244: \end{center}
245: \caption{The Schlegel diagram of the Newton polytope of
246: an observation in the CpG island HMM.}
247: \label{fig:Newton_polytope}
248: \end{figure}
249:
250:
251: We explain the biological meaning of our parametric analysis
252: with a very small example.
253: Let us consider the
254: following special case of the CpG island HMM.
255: First, assume that $t_{iA}=t_{iT}$ and that $t_{iC}=t_{iG}$, i.e.,
256: the output probability depends only on whether the nucleotide
257: is a purine or pyrimidine. Furthermore, assume that
258: $t_{0A}=t_{0G}$, which means that the probability of emitting
259: a purine or a pyrimidine in the non-CpG island state is equal
260: (i.e. base composition is uniform in non-CpG islands).
261:
262: Suppose that the observed sequence is
263: ${\bf \sigma}=AATAGCGG$. We ask for {\em all}
264: the possible explanations for ${\bf \sigma}$,
265: that is, for all possible maximum a posteriori
266: CpG island annotations for all parameters.
267: A priori, the number of explanations is bounded by $2^8 = 256$, the total
268: number of binary strings of length eight. However, of the
269: $256$ binary strings, only $25$ are explanations.
270: Figure 1 is a geometric representation of the
271: solution to this problem: the Newton polytope of $f_\sigma$ is
272: a $4$-dimensional polytope with $25$ vertices.
273: The figure is a \emph{Schlegel diagram} of this polytope.
274: It was drawn with the software POLYMAKE
275: \cite{Gawrilow:00,Gawrilow:01}.
276: The $25$ vertices in Figure 1 correspond to the
277: $25$ annotations, which are the explanations for $\sigma$
278: as the parameters vary. Two annotations are connected by
279: an edge if and only if their parameter cones share a wall.
280: From this geometric representation, we can determine all
281: parameters which result in the
282: same maximum a posteriori prediction.
283:
284: \section{Polytope Propagation}
285:
286: The evaluation of $g_{\sigma}$ for fixed parameters using the formulation in (\ref{ref:Viterbi}) is known as the Viterbi algorithm in the HMM literature. We begin by re-interpreting this algorithm as a convex optimization problem.
287:
288: \begin{defn}
289: The Newton polytope of a polynomial
290: \[ f(x_1,\ldots,x_d) \quad = \quad \sum_{i=1}^{n} c_i \cdot x_1^{a_{1,i}} x_2^{a_{2,i}} \cdots x_d^{a_{d,i}} \]
291: is defined to be the convex hull of the lattice points in ${\bf R}^d$ corresponding to
292: the monomials in $f$:
293: \[ NP(f) \quad = \quad
294: conv\{(a_{1,1},a_{2,1},\ldots,a_{d,1}), \cdots, (a_{1,n},a_{2,n},\ldots,a_{d,n})\}. \]
295: \end{defn}
296: Recall that for a fixed observation there are natural polynomials associated with a graphical model, which we have been denoting by $f_{\sigma}$.
297: In the CpG island example from Section 2, these polynomials are the coordinates
298: $f_\sigma$ of the polynomial map $f$ in (\ref{polymap}).
299: Each coordinate polynomial $f_\sigma$ is the sum of $2^n$ monomials,
300: where $n = |\sigma|$. The crucial observation is that even though the number of monomials grows exponentially with $n$, the number of vertices of the
301: Newton polytope $NP(f_\sigma)$ is much smaller. The Newton polytope
302: is important for us because its vertices represent the solutions to the
303: inference problem 2.
304:
305: \begin{prop}
306: \label{polytopepropagation}
307: The maximum a posteriori log probabilities $\,\delta_{\sigma}\,$
308: in Problem 2 can be determined by
309: minimizing a linear functional over the Newton polytope of $\,f_\sigma$.
310: \end{prop}
311:
312: \begin{proof}
313: This is nothing but a restatement of the fact that when passing to logarithms, monomials in the parameters become linear functions in the logarithms of the parameters. \end{proof}
314:
315: Our main result in this section is an algorithm which we state
316: in the form of a theorem.
317:
318: \begin{thm}[Polytope propagation]
319: Let $f_{\sigma}$ be the polynomial associated to a fixed observation $\sigma$ from a graphical model. The list of all vertices of the Newton polytope of $f_{\sigma}$ can be
320: computed efficiently by recursive convex hull and Minkowski sum computations on unions of polytopes.
321: \end{thm}
322:
323: \begin{proof}
324: Observe that if $f_1,f_2$ are polynomials then $NP(f_1 \cdot f_2) = NP(f_1) + NP(f_2)$
325: where the $+$ on the right hand side denotes the Minkowski sum of the two
326: polytopes. Similarly, $\,NP(f_1+f_2) = {\rm conv} \bigl( NP(f_1) \cup NP(f_2) \bigr)\,$
327: if $f_1$ and $f_2$ are polynomials with positive coefficients.
328: The recursive description of $f_{\bf \sigma}$ given in (\ref{sum-product}) can be used
329: to evaluate the Newton polytope efficiently. The necessary geometric
330: primitives are precisely Minkowski sum and convex hull of unions of convex polytopes.
331: These primitives run in polynomial
332: time since the dimension of the polytopes is fixed. This is the
333: case in our situation since we consider graphical models
334: with a fixed number of parameters. We can hence
335: run the sum-product algorithm efficiently in the
336: semiring known as the \emph{polytope algebra}.
337: The size of the output scales polynomially by \cite[Thm.~7]{Pachter:04}.
338: \end{proof}
339:
340:
341: \begin{figure}[ht]
342: \begin{center}
343: \includegraphics[scale=0.75]{polytope_propagation.ps}
344: \end{center}
345: \caption{Graphical representation of the polytope propagation algorithm for a hidden Markov model.
346: For a particular pair of parameters, there is one
347: optimal Viterbi path (shown as large vertices on the polytopes).}
348: \label{fig:HMMpoly}
349: \end{figure}
350:
351: Figure 2 shows an example of the polytope propagation algorithm for a hidden Markov model
352: with all random variables binary and with the following transition and output
353: matrices:
354: $$ S \, = \, \begin{pmatrix}
355: s_{00} & 1 \\
356: 1 & s_{11} \\
357: \end{pmatrix}
358: \qquad \hbox{and} \qquad
359: T \, = \, \begin{pmatrix}
360: s_{00} & 1 \\
361: 1 & s_{11}
362: \end{pmatrix}.
363: $$
364: Here we specialized to only two parameters in order to simplify the diagram.
365: When we run polytope propagation for long enough DNA sequences
366: $\sigma$ in the
367: CpG island HMM of Section 2 with all $12$ free parameters, we get a diagram just like Figure 2,
368: but with each polygon replaced by a seven-dimensional polytope.
369:
370: It is useful to note that for HMMs, the Minkowski sum operations are simply shifts of the polytopes, and therefore the only non-trivial geometric operations required are the convex hulls of unions of polytopes.
371: The polytope in Figure 1 was computed using polytope propagation. This polytope
372: has dimension $4$ (rather than $7$) because the sequence ${\bf \sigma}=AATAGCGG$ is so short.
373: We wish to emphasize that the small size of our examples is only for clarity; there is no practical
374: or theoretical barrier to computing much larger instances.
375:
376:
377: For general graphical models, the running time of the Minkowski sum and convex hull computations depends on the number of parameters, and the number of vertices in each computation. These are
378: clearly bounded by the total number of vertices of $NP(f_{\sigma})$, which are bounded above by \cite[Theorem 7]{Pachter:04}:
379: $$ \# \,{\rm vertices} (NP(f_\sigma)) \,\,\, \leq \,\,\,
380: {\rm constant} \cdot E^{d(d-1)/(d+1)} \,\,\, \leq \,\,\, {\rm
381: constant} \cdot E^{d-1} . $$
382: Here $E$ is the number of edges in the graphical model (often linear in the number of vertices of the model). The dimension $d$ of the Newton polytope $NP(f_\sigma)$
383: is fixed because it is bounded above by the number of model parameters.
384: The total running time
385: of the polytope propagation algorithm can then be estimated by multiplying the running time for the geometric operations of Minkowski sum and convex hull
386: with the running time of the sum-product algorithm. In any case,
387: the running time scales polynomially in $E$.
388:
389: We have shown in \cite[\S 4]{Pachter:04}
390: that the vertices of $NP(f_{\sigma})$ correspond to explanations
391: for the observation $\sigma$. In parametric inference we are interested
392: in identifying the parameter regions that lead to the same explanations.
393: Since parameters can be identified
394: with linear functionals, it is the case that the set of parameters that lead to the same explanation (i.e. a vertex $v$) are those linear functionals that minimize on $v$. The
395: set of these linear functionals is the {\em normal cone of
396: $NP(f_\sigma)$ at $v$}. The collection of all normal cones
397: at the various vertices $v$ forms the {\em normal fan} of the polytope. Putting this together with Proposition \ref{polytopepropagation} we obtain:
398:
399: \begin{prop}
400: The normal fan of the Newton polytope of $f_{\sigma}$ solves the parametric
401: inference problem for an observation $\sigma$ in a graphical model.
402: It is computed using the polytope propagation algorithm.
403: \end{prop}
404:
405: An implementation of polytope propagation for arbitrary graphical models
406: is currently being developed within the
407: geometry software package POLYMAKE \cite{Gawrilow:00,Gawrilow:01} by Michael Joswig.
408:
409: \section{Parametric Sequence Alignment}
410:
411: The \emph{sequence alignment} problem asks to find the best alignment between two sequences which have evolved from a common ancestor via a series of mutations, insertions and deletions. Formally,
412: given two sequences $\,\sigma^1 =
413: \sigma^1_1 \sigma^1_2 \cdots \sigma^1_n \,$ and
414: $\,\sigma^2 = \sigma^2_1 \sigma^2_2 \cdots \sigma^2_m \,$
415: over the alphabet $ \{0,1,\ldots,l-1\}$,
416: an \emph{alignment} is a string over the alphabet $\{M,I,D\}$ such that
417: $\#M+\#D= n$ and $\#M+\#I=m $.
418: Here $\#M, \#I, \#D$ denote the number of characters $M,I,D$
419: in the word respectively. An alignment records the ``edit steps'' from the sequence
420: $\sigma^1$ to the sequence $\sigma^2$, where edit operations consist of changing characters,
421: preserving them, or inserting/deleting them. An $I$ in the alignment string
422: corresponds to an insertion in the first sequence, a $D$ is a deletion in the first
423: sequence, and an $M$ is either a character change, or lack thereof.
424: We write ${\cal A}_{n.m}$ for the set of all alignments.
425: For a given $h \in {\cal A}_{m,n}$, we will denote the $j$th character in $h$ by $h_j$, we write $\,h[i] \,$ for $\,\#M+\#I \,$ in the prefix
426: $\,h_1 h_2 \ldots h_i$, and we write
427: $\,h \langle j \rangle\,$ for $\,\#M+\#D\,$ in
428: the prefix $\,h_1 h_2 \ldots h_j$.
429: The cardinality of the set ${\cal A}_{n.m}$ of all alignments can be computed
430: as the coefficient of $x^m y^n$ in the generating function
431: $1/(1-x-y-xy)$. These coefficients are known as
432: \emph{Delannoy numbers} in combinatorics
433: \cite[\S 6.3]{Stanley:99}.
434:
435: {\em Bayesian multi-nets} were introduced in \cite{Friedman:97} and are
436: extensions of graphical models via the introduction of class nodes, and a
437: set of local networks corresponding to values of the class nodes.
438: In other words, the value of a random variable can change the structure
439: of the graph underlying the graphical model. The
440: {\em pair hidden Markov model} (see Figure \ref{fig:pairHMM}) is
441: an instance of a Bayesian multinet. In this model,
442: the hidden states (unshaded nodes forming the chain) take on
443: one of the values $M,I,D$. Depending on the value at a hidden node,
444: either one or two characters are generated; this is encoded by plates (squares around the observed states) and class nodes (unshaded nodes in the plates).
445: The class nodes take on the values $0$ or $1$ corresponding to whether
446: or not a character is generated.
447: Pair hidden Markov models are
448: therefore probabilistic models of alignments, in which the structure of
449: the model depends on the assignments to the hidden states.
450: \begin{figure}
451: \begin{center}
452: \includegraphics[scale=0.7]{pairhmm.ps}
453: \end{center}
454: \caption{A pair hidden Markov model for sequence alignment.}
455: \label{fig:pairHMM}
456: \end{figure}
457:
458: Our next result gives the precise description of the pair HMM for sequence alignment in
459: the language of algebraic statistics, namely, we represent this model
460: by means of a polynomial map $f$.
461: Let $\sigma^1$, $\sigma^2$ be the output strings from a pair hidden Markov model (of lengths $n,m$ respectively). Then:
462: \begin{equation}
463: \label{pairhmm}
464: f_{\sigma^1,\sigma^2}
465: \quad = \quad \sum_{h \in {\cal A}_{n,m}}
466: t_{h_1}(\sigma^1_{h[1]},\sigma^2_{h \langle 1 \rangle}) \cdot
467: \prod_{i = 2}^{|h|}
468: s_{h_{i-1}h_i} \cdot t_{h_i}(\sigma^1_{h[i]},\sigma^2_{h \langle i \rangle}) ,
469: \end{equation}
470: where $s_{h_{i-1}h_i}$ is the transition probability from state $h_{i-1}$ to $h_i$ and $t_{h_i}(\sigma^1_{h[i]},\sigma^2_{h \langle i \rangle})$ are the output probabilities
471: for a given state $h_i$ and the corresponding output characters on the strings $\sigma^1,\sigma^2$.
472:
473: \begin{prop} \label{pairHMMmap}
474: The pair hidden Markov model for sequence alignment is the
475: image of a polynomial map $f : {\bf R}^{9 + 2l+ l^2 }
476: \rightarrow {\bf R}^{l^{n+m}}$.
477: The coordinates of $f$ are
478: polynomials
479: of degree $n + m + 1 $ in (\ref{pairhmm}).
480: \end{prop}
481:
482: We need to explain why the number of parameters is $9 + 2l+ l^2 $.
483: First, there are nine parameters
484: $$ S \quad = \quad
485: \begin{pmatrix}
486: s_{MM} & s_{MI} & s_{MD} \\
487: s_{IM} & s_{II} & s_{ID} \\
488: s_{DM} & s_{DI} & s_{DD}
489: \end{pmatrix} , $$
490: which play the same role as in Section 2,
491: namely, they represent transition probabilities
492: in the Markov chain. There are
493: $l^2$ parameters $\,t_M(a,b) =: t_{Mab}\,$
494: for the probability that letter $a$ in
495: $\sigma^1$ is matched with letter $b$ in $\sigma^2$.
496: The insertion parameters $\,t_I(a,b) \,$
497: depend only on the letter $b$, and the
498: deletion parameters $\,t_D(a,b) \,$
499: depend only on the letter $a$, so there
500: are only $2l $ of these parameters. In the upcoming example,
501: which explains the algebraic representation of
502: Proposition \ref{pairHMMmap},
503: we use the abbreviations $\,t_{Ib}\,$ and $\,t_{Da}\,$
504: for these parameters.
505:
506: Consider two sequences $\, \sigma^1 = ij \,$ and $\sigma^2 = klm \,$
507: of length $n = 2$ and $m = 3$ over any alphabet.
508: The number of alignments is $\,\#( {\cal A}_{n,m} ) = 25$, and they are listed in Table 1.
509: \begin{table}
510: \begin{center}
511: \begin{tabular} {|l|l|l|} \hline
512: %$$
513: %\begin{matrix}
514: IIIDD & \,\, $( \,\cdot \cdot \cdot ij \,,\, klm\cdot \cdot \, )$ & $
515: t_{Ik} s_{II} t_{Il} s_{II} t_{Im} s_{ID} t_{Di} s_{DD} t_{Dj} $\\
516: IIDID & \,\, $( \,\cdot \cdot i\cdot j \, ,\, kl\cdot m\cdot \, )$ & $
517: t_{Ik} s_{II} t_{Il} s_{ID} t_{Di} s_{DI} t_{Im} s_{ID} t_{Dj} $\\
518: IIDDI & \,\, $( \,\cdot \cdot ij \,\cdot \,,\, kl\cdot \cdot m \, )$ & $
519: t_{Ik} s_{II} t_{Il} s_{ID} t_{Di} s_{DD} t_{Dj} s_{DI} t_{Im} $\\
520: IDIID & \,\, $( \,\cdot \, i\cdot \cdot j\,,\, k\cdot lm\cdot \, )$ & $
521: t_{Ik} s_{ID} t_{Di} s_{DI} t_{Il} s_{II} t_{Im} s_{ID} t_{Dj} $\\
522: IDIDI & \,\, $( \,\cdot \, i\cdot j\cdot \,,\, k\cdot l\cdot m \, )$ & $
523: t_{Ik} s_{ID} t_{Di} s_{DI} t_{Il} s_{ID} t_{Dj} s_{DI} t_{Im} $\\
524: IDDII & \,\, $( \,\cdot \,ij \cdot \cdot \,,\, k\cdot \cdot lm \, )$ & $
525: t_{Ik} s_{ID} t_{Di} s_{DD} t_{Dj} s_{DI} t_{Il} s_{II} t_{Im} $\\
526: DIIID & \,\, $( \,i\cdot \cdot \cdot j \,,\, \cdot \, klm\cdot \, )$ & $
527: t_{Di} s_{DI} t_{Ik} s_{II}
528: t_{Il} s_{II} t_{Im} s_{ID} t_{Dj} $\\
529: DIIDI & \,\, $( \,i\cdot \cdot j\cdot \,,\, \cdot \,kl\cdot m \, )$ & $
530: t_{Di} s_{DI} t_{Ik} s_{II} t_{Il} s_{ID} t_{Dj} s_{DI} t_{Im} $\\
531: DIDII & \,\, $( \,i\cdot j\cdot \cdot \,,\, \cdot \,k\cdot lm \, )$ & $
532: t_{Di} s_{DI} t_{Ik} s_{ID} t_{Dj} s_{DI} t_{Il} s_{II} t_{Im} $\\
533: DDIII & \,\, $( \,ij\cdot \cdot \,\cdot\, ,\, \cdot \cdot klm \, )$ & $
534: t_{Di} s_{DD} t_{Dj} s_{DI} t_{Ik} s_{II} t_{Il} s_{II} t_{Im} $\\
535: MIID & \,\, $( \,i\cdot \cdot j \,,\, klm \,\cdot \, )$ & $ t_{Mik} s_{MI} t_{Il} s_{II} t_{Im} s_{ID} t_{Dj} $\\
536: MIDI & \,\, $( \,i\cdot j\cdot \,,\, kl\cdot m \, )$ & $ t_{Mik} s_{MI} t_{Il} s_{ID} t_{Dj} s_{DI} t_{Im} $\\
537: MDII & \,\, $( \,ij\cdot \cdot \,,\, k\cdot lm \, )$ & $ t_{Mik} s_{MD} t_{Dj} s_{DI} t_{Il} s_{II} t_{Im} $\\
538: IMID & \,\, $( \,\cdot \,i\cdot j \,,\, klm\cdot \, )$ & $ t_{Ik} s_{IM} t_{Mil} s_{MI} t_{Im} s_{ID} t_{Dj} $\\
539: IMDI & \,\, $( \,\cdot \,ij\,\cdot \,,\, kl\cdot m \, )$ & $ t_{Ik} s_{IM} t_{Mil} s_{MD} t_{Dj} s_{DI} t_{Im} $\\
540: IIMD & \,\, $( \,\cdot \cdot ij\,,\, klm \,\cdot \, )$ & $ t_{Ik} s_{II} t_{Il} s_{IM} t_{Mim} s_{MD} t_{Dj} $\\
541: IIDM & \,\, $( \,\cdot \cdot ij\,,\, kl\cdot m \, )$ & $ t_{Ik} s_{II} t_{Il} s_{ID} t_{Di} s_{DM} t_{Mjm} $\\
542: IDMI & \,\, $( \,\cdot ij\cdot \,,\, k\cdot lm \, )$ & $ t_{Ik} s_{ID} t_{Di} s_{DM} t_{Mjl} s_{MI} t_{Im} $\\
543: IDIM & \,\, $( \,\cdot i\cdot j\,,\, k\cdot lm \, )$ & $ t_{Ik} s_{ID} t_{Di} s_{DI} t_{Il} s_{IM} t_{Mjm} $\\
544: DMII & \,\, $( \,ij\cdot \cdot \,,\, \cdot \,klm \, )$ & $ t_{Di} s_{DM} t_{Mjk} s_{MI} t_{Il} s_{II} t_{Im} $\\
545: DIMI & \,\, $( \,i\cdot j\cdot \,,\, \cdot \,klm \, )$ & $ t_{Di} s_{DI} t_{Ik} s_{IM} t_{Mjl} s_{MI} t_{Im} $\\
546: DIIM & \,\, $( \,i\cdot \cdot j\,,\, \cdot \,klm \, )$ & $ t_{Di} s_{DI} t_{Ik} s_{II} t_{Il} s_{IM} t_{Mjm} $\\
547: MMI & \,\, $( \,ij \,\cdot\,\, , \,\,klm \, )$ & $ t_{Mik} s_{MM} t_{Mjl} s_{MI} t_{Im} $\\
548: MIM & \,\, $( \,i \cdot j \,\,,\,\, klm \, )$ & $ t_{Mik} s_{MI} t_{Il} s_{IM} t_{Mjm} $\\
549: IMM & \,\, $( \,\cdot \,ij \,\,,\,\, klm \, )$ & $ t_{Ik} s_{IM} t_{Mil} s_{MM} t_{Mjm} $\\ \hline
550: %\end{matrix}
551: %$$
552: \end{tabular}
553: \end{center}
554: \caption{Alignments for a pair of sequences of length $2$ and $3$.}
555: \end{table}
556: The polynomial $f_{\sigma^1,\sigma^2}$ is the sum of the
557: $25$ monomials (of degree $9,7,5$) in the rightmost column.
558: For instance, if we consider strings over the binary
559: alphabet $\{0,1\}$, then there are $17$ parameters
560: (nine $s$-parameters and eight $t$-parameters), and
561: the pair HMM for alignment is the image of a map
562: $ \, f : {\bf R}^{17} \rightarrow {\bf R}^{32}$.
563: The coordinate of $f$ which is indexed by
564: $(i,j,k,l,m) \in \{0,1\}^5$ equals the
565: $25$-term polynomial gotten by summing the
566: rightmost column in Table 1.
567:
568: The parametric inference problem for sequence alignment is solved
569: by computing the Newton polytopes $NP(f_{\sigma_1,\sigma_2})$ with the
570: polytope propagation algorithm.
571: In the terminology introduced in \cite[\S 4]{Pachter:04},
572: an observation $\sigma$ in the pair HMM is the pair of sequences
573: $(\sigma_1,\sigma_2)$, and the possible explanations
574: are the optimal alignments of these sequences with
575: respect to the various choices of parameters.
576: In summary, the vertices of the Newton polytope
577: $NP(f_{\sigma_1,\sigma_2})$ correspond to the optimal alignments.
578: If the observed sequences $\sigma_1,\sigma_2$ are not fixed then we are in the situation of
579: \cite[Proposition 6]{Pachter:04}.
580: Each parameter choice
581: defines a function from pairs of sequences to alignments:
582: $$\, \{0,\ldots,l-1\}^n \times \{0,\ldots,l-1\}^m
583: \rightarrow {\cal A}_{n,m} \,,\quad ( \sigma_1,\sigma_2) \mapsto \hat {\bf h} .$$
584: The number of such functions
585: grows doubly-exponentially in $n$ and $m$, but only
586: a tiny fraction of them are \emph{inference functions},
587: which means they correspond to the vertices of the Newton polytope
588: of the map $f$.
589: It is an interesting combinatorial problem to characterize
590: the inference functions for sequence alignment.
591:
592: An important observation is that our formulation in Problem 2 is equivalent to
593: combinatorial ``scoring schemes'' or ``generalized edit distances'' which
594: can be used to assign weights to alignments \cite{Bucher:96}.
595: For example, the simplest scoring scheme consists of two parameters:
596: a mismatch score $mis$, and an indel score $gap$ \cite{Fernandez-Baca:00, Gusfield:94, Waterman:92}.
597: The weight of an alignment is the sum of the scores for all positions in the alignment, where a match is assigned a score of $1$.
598: This is equivalent to specializing the logarithmic parameters
599: $U = - {\rm log} (S)$ and $V = - {\rm log} (T)$ of the pair hidden Markov model as follows:
600: \begin{equation}
601: \label{specialize}
602: u_{ij} = 0, \quad
603: v_{Mij}=1 \,\hbox{ if $i=j$}, \,\,\,\,
604: v_{Mij}=mis\, \hbox{ if $i \neq j$, and }\,\,\,\,
605: v_{Ij} = v_{Di} = gap
606: \qquad \hbox{for all $i,j$}.
607: \end{equation}
608: This specialization of the parameters
609: corresponds to intersecting the normal fan of
610: the Newton polytope with a two-dimensional affine subspace
611: (whose coordinates are called $mis$ and $gap$).
612:
613: Efficient software for parametrically aligning the sequences with two free parameters
614: already exists (XPARAL \cite{Gusfield:96}).
615: Consider the example of the following two sequences:
616: $\sigma^1=AGGACCGATTACAGTTCAA$ and $\sigma^2=TTCCTAGGTTAAACCTCATGCA$. XPARAL will return four cones, however a computation of the Newton polytope reveals seven vertices (three correspond to positive $mis$ or $gap$ values). The polytope propagation algorithm has
617: the same running time as XPARAL: for two sequences of
618: length $n,m$, the method requires $O(nm)$ two-dimensional convex hull computations. The number of points in each computation is bounded by the total
619: number of points in the final convex hull (or equivalently the number, $K$, of explanations). Each convex hull computation therefore
620: requires at most $O(K {\rm log}(K))$ operations, thus giving an $O(nmK {\rm log}(K))$ algorithm for solving the parametric alignment problem. However, this
621: running time can be improved by observing that the convex hull computations that need to be carried out have a very special form, namely in each
622: step of the algorithm we need to compute the convex hull of two superimposed convex polygons. This procedure is in fact a primitive of the divide
623: and conquer approach to convex hull computation, and there is a well known $O(K)$ algorithm for solving it \cite[\S 3.3.5]{Preparata:85}. Therefore, for two parameters, our recursive approach
624: to solving the parametric problem yields an $O(Kmn)$ algorithm, matching the running time of XPARAL and the conjecture of Waterman, Eggert and Lander \cite{Waterman:92}.
625:
626: \begin{figure}[ht]
627: \begin{center}
628: \includegraphics[scale=1.2]{alignment_pic2.ps}
629: \end{center}
630: \caption{Edge graph of the Newton polytope for a four parameter alignment problem.}
631: \label{fig:parametric}
632: \end{figure}
633:
634:
635: In order to demonstrate the practicality of our approach for higher-dimensional problems, we implemented a four parameter recursive parametric alignment solver. The more
636: general alignment model includes different transition/transversion parameters (instead of just one mismatch parameter), and separate parameters for
637: opening gaps and extending gaps. A transition is mutation from one purine ($A$ or $G$) to another, or from one pyrimidine ($C$ or $T$) to another, and a transversion is a mutation
638: from a purine to a pyrimidine or vice versa. More precisely, if we let $P_u=\{A,G\}$ and $P_y=\{C,T\}$ the model is:
639: \begin{eqnarray*}
640: \label{specialize2}
641: u_{MM} = u_{IM} = u_{DM} & = & 0\\
642: u_{MI} = u_{MD} & = & gapopen\\
643: u_{II} = u_{DD} & = & gapextend\\
644: v_{Mij} & = & \hbox {$1$ if $i=j$}\\
645: v_{Mij} & = & transt\, \ \hbox {if $i \neq j$, and $i,j \in P_u$ or $i,j \in P_y$}\\
646: v_{Mij} & = & transv\, \ \hbox {if $i \neq j$, and $i \in P_u, j \in P_y$ or vice versa}\\
647: v_{Ij} = v_{Di} & = & \hbox{$0$ for all $i,j$}.
648: \end{eqnarray*}
649:
650: For the two sequences $\sigma^1$ and $\sigma^2$ in the example above, the number of vertices of the four dimensional
651: Newton polytope (shown in Figure 4) is $224$ (to be compared to $7$ for the two parameter case).
652:
653:
654: \section{Practical Aspects of Parametric Inference}
655:
656: We begin by pointing out that parametric inference is useful for Bayesian computations. Consider the problem where we have a prior distribution $\pi(s)$ on our parameters
657: $s = (s_1,\ldots,s_d)$, and we would like to compute the posterior probability of a maximum a posteriori explanation $\widehat {\bf h}$:
658: \begin{equation}
659: \label{Bayesian}
660: {\rm Prob}({\bf X} = \widehat {\bf h} \,|\, {\bf Y} = {\bf \sigma}) \quad
661: = \quad \int_{s} {\rm Prob}({\bf X}= \widehat {\bf h} \,|\, {\bf Y}
662: = {\bf \sigma},\,s_1,\ldots,s_d \,)\pi(s) ds.
663: \end{equation}
664: This is an important problem, since it can give a quantitative assessment of the validity of $\widehat {\bf h}$ in a setting where we have prior, but not certain, information about the parameters, and also because we may want to sample $\widehat {\bf h}$ according to its posterior distribution (for an example of how this can be applied in computational biology see \cite{Liu:94}). Unfortunately, these integrals may be difficult to compute. We propose the following simple
665: Monte Carlo algorithm for computing a numerical approximation
666: to the integral (\ref{Bayesian}):
667:
668: \begin{prop}
669: Select $N$ parameter vectors $s^{(1)},\ldots,s^{(N)}$
670: according to the distribution $\pi(s)$, where
671: $N$ is much larger than the
672: number of vertices of the Newton polytope $\,NP(f_\sigma)$.
673: Let $K$ be the number of $s^{(i)}$
674: such that $-{\rm log}(s^{(i)})$ lies in the normal cone of
675: $NP(f_\sigma)$ indexed by the explanation $\widehat {\bf h}$.
676: Then $K/N$ approximates (\ref{Bayesian}).
677: \end{prop}
678:
679: \begin{proof}
680: The expression $\,
681: {\rm Prob}({\bf X}= \widehat {\bf h} \,|\, {\bf Y}
682: = {\bf \sigma},\,s_1,\ldots,s_d \,) \,$ is
683: zero or one depending on whether the vector
684: $-{\rm log}(s) = (-{\rm log}(s_1),\ldots,-{\rm log}(s_d))
685: $ lies in the normal cone of
686: $NP(f_\sigma)$ indexed by $\widehat {\bf h}$. This membership test can be done without ever running the sum-product algorithm if we precompute an inequality representation of the normal cones.
687: \end{proof}
688:
689: The bound on the number of vertices of the Newton polytope
690: in \cite[\S 4]{Pachter:04} provides a valuable tool for
691: estimating the quality of this Monte Carlo approximation.
692: We believe that the tropical geometry developed in \cite{Pachter:04}
693: will also be useful for more refined analytical approaches to
694: Bayesian integrals. The study of Newton polytopes
695: can also complement the algebraic geometry
696: approach to model selection proposed in \cite{Rusakov:02}.
697:
698:
699: Another application of parametric inference is to problems where the number of parameters may be very large, but where we want to fix a large subset of them, thereby reducing the dimensions of the polytopes. Gene finding models, for example, may have up to thousands of parameters and input sequences can be millions of base pairs long however, we are usually only interested in studying the dependence of inference on a select few. Although specializing parameters reduces the dimension of the parameter space, the explanations correspond to vertices of a
700: \emph{regular subdivision of the Newton polytope}, rather than just to the vertices of the polytope itself. This is explained below (readers may also
701: want to refer to \cite{Pachter:04} for more background).
702:
703: Consider a graphical model with parameters $s_1,\ldots, s_{d}$
704: of which the parameters $s_1,\ldots , s_{r}$ are
705: free but $\, s_{r+1} = S_{r+1}, \ldots, s_d = S_d \,$
706: where the $S_i$ are fixed non-negative numbers.
707: Then the coordinate polynomials $f_\sigma$ of our model
708: specialize to polynomials in $r$ unknowns
709: whose coefficients $c_a$ are non-negative numbers:
710: $$\, \tilde f_\sigma(s_1,\ldots,s_{r}) \quad = \quad
711: f_\sigma(s_1,\ldots,s_{r}, S_{r+1}, \ldots, S_{d})
712: \quad = \quad \sum_{a \in {\bf N}^r} c_a \cdot s_1^{a_1} \cdots s_{r}^{a_r}. $$
713: The \emph{support} of this polynomial is the finite set
714: $\, {\cal A}_\sigma \, = \, \{\, a \in {\bf N}^r \, : \,c_a > 0 \,\}$.
715: The convex hull of $\, {\cal A}_\sigma\,$ in ${\bf R}^r$
716: is the Newton polytope of the polynomial $\tilde f_\sigma = \tilde f_\sigma(s_1,\ldots,s_r)$. For example, in the case of the hidden Markov model with output parameters specialized,
717: the Newton polytope of
718: $\tilde f_{\sigma}$ is the polytope associated with a Markov chain.
719: Kuo \cite{Kuo:04} shows that the size of these
720: polytopes does not depend on the length of the chain.
721:
722: Let ${\bf h}$ be any explanation for $\sigma$ in the original model
723: and let $(u_1,\ldots,u_r,u_{r+1}, \ldots,u_n)$ be the vertex
724: of the Newton polytope of $f_\sigma$ corresponding
725: to that explanation. We abbreviate $\,a_{\bf h} = (u_1,\ldots,u_r)\,$
726: and $\, S_{\bf h} \, = \,S_{r+1}^{u_{r+1}} \cdots S_d^{u_d}$.
727: The assignment
728: $\, {\bf h} \mapsto a_{\bf h}\,$ defines a map
729: from the set of explanations of $\sigma$ to the support
730: $\, {\cal A}_\sigma$. The convex hull of
731: the image coincides with the Newton polytope of $\,\tilde f_\sigma$.
732: We define
733: \begin{equation}
734: \label{fromHtoA}
735: w_a \, = \, {\rm min} \bigl\{ \, - {\rm log}(S_{\bf h}) \, \, :\,\,
736: {\bf h} \, \, \hbox{is an explanation for } \, \sigma \,\,\, \hbox{with}\,\,\,
737: a_{\bf h} = a \, \bigr\}.
738: \end{equation}
739: If the specialization is sufficiently generic
740: then this maximum is attained uniquely,
741: and, for simplicity, we will assume that this is the case.
742: If a point $a \in {\cal A}_\sigma$ is not the image of any explanation ${\bf h}$ then
743: we set $w_a = \infty$.
744: The assignment $a \mapsto w_a$ is a real valued function
745: on the support of our polynomial $\tilde f_\sigma$,
746: and it defines a \emph{regular polyhedral subdivision} $\, \Delta_w \,$
747: of the Newton polytope $NP(\tilde f_\sigma)$. Namely, $\Delta_w$ is the polyhedral
748: complex consisting of all lower faces of the polytope gotten by taking the
749: convex hull of the points $(a,w_a)$ in ${\bf R}^{r+1}$.
750: See \cite{Sturmfels:96} for details on regular triangulations
751: and regular polyhedral subdivisions.
752:
753: \begin{thm}
754: The explanations for the observation $\sigma$ in the specialized model are
755: in bijection with the vertices of the regular polyhedral subdivision $\, \Delta_w \,$
756: of the Newton polytope of the specialized polynomial $\, \tilde f_\sigma$.
757: \end{thm}
758:
759: \begin{proof}
760: The point $(a,w_a)$ is a vertex of $\Delta_w$ if and only if
761: the following open polyhedron is non-empty:
762: $$ P_a \quad = \quad \bigl\{ \, v \in {\bf R}^r \,\, : \,\,
763: a \cdot v + w_a \, < \, a' \cdot v + w_{a'}\,\, \hbox{for all}\,\,
764: a \in {\cal A}_\sigma \backslash \{a\} \, \bigr\}. $$
765: If $v$ is a point in $P_a$ then we set
766: $\,s_i = {\rm exp}(-v_i)\,$ for $i=1,\ldots,r$,
767: and we consider the explanation ${\bf h}$
768: which attains the minimum in (\ref{fromHtoA}).
769: Now all parameters have been specialized
770: and ${\bf h}$ is the solution to Problem 2.
771: This argument is reversible: any explanation for
772: $\sigma$ in the specialized model arises from
773: one of the non-empty polyhedra $P_a$.
774: We note that the collection of polyhedra $P_a$ defines a polyhedral
775: subdivision of ${\bf R}^r$ which is geometrically dual
776: to the subdivision $\Delta_w$ of the Newton polytope
777: of $\tilde f_\sigma$.
778: \end{proof}
779:
780: \vskip .1cm
781:
782: In practical applications of parametric inference, it
783: may be of interest to compute only one normal cone of the Newton polytope (for example the cone containing some fixed parameters). We conclude this section by observing that the polytope propagation algorithm is suitable for this computation as well:
784:
785: \begin{prop}
786: Let $v$ be a vertex of a $d$-dimensional Newton polytope of a hidden Markov model. Then the normal cone containing $v$ can be computed using a polytope propagation algorithm
787: in dimension $d-1$.
788: \end{prop}
789:
790: \begin{proof} We run the standard polytope propagation algorithm
791: described in Section 4,
792: but at each step we record only the minimizing vertex in the direction of the log parameters, together with its neighboring vertices in the edge graph of the Newton polytope. It follows, by induction, that given this information at the $n$th step, we can use it to find the minimizing vertices and related neighbors in the $(n+1)$st step.
793: \end{proof}
794:
795: \section{Summary}
796:
797: We envision a number of biological applications for the polytope propagation algorithm, including:
798:
799: \begin{itemize}
800: \item Full parametric inference using the normal fan of the Newton polytope of an observation when the graphical model under
801: consideration has only few model parameters.
802: \item Utilization of the edge graph of the polytope to identify stable parts of
803: alignments and annotations.
804: \item Construction
805: of the normal cone containing a specific parameter vector
806: when computation of the full Newton polytope is infeasible.
807: \item Computation of the posterior probability
808: (in the sense of Bayesian statistics) of an alignment
809: or annotation. The regions for the relevant integrations
810: are the normal cones of the Newton polytope.
811: \end{itemize}
812:
813:
814: As we have seen, the computation of Newton polytopes for (interesting) graphical models is certainly feasible for a few free parameters, and we expect that further analysis of the computational geometry should yield efficient algorithms in higher dimensions. For example, the key operation, computation of convex hulls of unions of convex polytopes, is likely to be considerably easier than general convex hull computations even in high
815: dimensions. Fukuda, Liebling and L\"{u}tlof \cite{Fukuda:01} give a polynomial time algorithm for computing extended convex hulls (convex hulls of unions of convex polytopes) under
816: the assumption that the polytopes are in general position. Furthermore, it should be possible to optimize the geometric algorithms for specific models of interest, and combinatorial analysis of the Newton polytopes arising in graphical models should yield better complexity estimates (see, e.g., \cite{ Fernandez-Baca:00, Gusfield:94}).
817: Michael Joswig is currently working on a general polytope propagation implementation in POLYMAKE \cite{Gawrilow:00,Gawrilow:01}.
818:
819: In the case where computation of the Newton polytope is impractical, it is still possible to identify the cone containing a specific parameter, and this can be used to quantitatively measure the robustness of the inference. Parameters near a boundary are unlikely to lead to biologically meaningful results. Furthermore, the edge graph can be used to identify common regions in the explanations corresponding to adjacent vertices. In the case of alignment, biologists might see a collection of alignments rather than just one optimal one, with common sub-alignments highlighted. This is quite different from returning the $k$ best alignments, since suboptimal alignments may not be vertices of the Newton polytope. The solution we propose explicitly identifies all suboptimal alignments that can result from similar parameter choices.
820:
821: \section{Acknowledgments}
822: Lior Pachter was supported in part by a grant from the NIH (R01-HG02362-02).
823: Bernd Sturmfels was supported by
824: a Hewlett Packard Visiting Research Professorship 2003/2004
825: at MSRI Berkeley and in part by the NSF (DMS-0200729).
826:
827: \nocite{*}
828: \begin{thebibliography}{26}
829:
830: \bibitem{Alexandersson:03} M. Alexandersson, S. Cawley and L. Pachter: SLAM - Cross-species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model, Genome Research 13 (2003) 496--502.
831: \bibitem{Baldi:98} P. Baldi and S. Brunak: Bioinformatics. The Machine Learning Approach. A Bradford Book.The MIT Press. Cambridge, Massachusetts, 1998.
832: \bibitem{Bucher:96} P. Bucher and K. Hofmann: A sequence similarity
833: search algorithm based on a probabilistic interpretation of an alignment
834: scoring system, Proceedings of the Conference on Intelligent Systems for Molecular Biology, 1996, 44--51.
835: \bibitem{Durbin:98} R. Durbin, S. Eddy, A. Krogh and G. Mitchison: Biological Sequence Analysis (Probabilistic Models of Proteins and Nucleic Acids),
836: Cambridge University Press, 1998.
837: \bibitem{Fernandez-Baca:00} D. Fern\'andez-Baca, T. Sepp\"al\"ainen and G. Slutzki:
838: Parametric multiple sequence alignment and phylogeny construction,
839: in Combinatorial Pattern Matching, Lecture
840: Notes in Computer Science (R. Giancarlo and D. Sankoff eds.), Vol. 1848, 2000, 68--82.
841: \bibitem{Friedman:97} N. Friedman, D. Geiger and M. Goldszmidt: Bayesian network classifiers, Machine Learning 29 (1997) 131--161.
842: \bibitem{Fukuda:01} K. Fukuda, T.H. Liebling and C. L\"{u}tlof: Extended convex hull, Computational Geometry 20 (2001) 13--23.
843: \bibitem{Gardiner-Garden:87} M. Gardiner-Garden and M. Frommer: CpG islands in vertebrate genomes, Journal of Molecular Biology 196 (1987) 261--282.
844: \bibitem{Gawrilow:00} E. Gawrilow and M. Joswig: polymake: a Framework for Analyzing Convex Polytopes, Polytopes -- Combinatorics and Computation (G. Kalai and G.M. Ziegler eds.), Birkhh\"{a}user (2000).
845: \bibitem{Gawrilow:01} E. Gawrilow and M. Joswig: polymake: an Approach to Modular Software Design in Computational Geometry, Proceedings of the 17th Annual Symposium on Computational Geometry, ACM, 2001, 222--231.
846: \bibitem{Gusfield:94} D. Gusfield, K. Balasubramanian, and D. Naor: Parametric optimization of sequence alignment, Algorithmica 12 (1994) 312--326.
847: \bibitem{Gusfield:96} D. Gusfield and P. Stelling: Parametric and inverse-parametric sequence alignment with XPARAL, Methods Enzymology 266 (1996) 481--494.
848: \bibitem{Jordan:02} M.I. Jordan and Y. Weiss: Graphical Models:
849: Probabilistic Inference, in {\it Handbook of Brain Theory and
850: Neural Networks, 2nd edition}, M. Arbib (Ed.), Cambridge, MA, MIT
851: Press, 2002.
852: \bibitem{Kschischang:01} F. Kschischang, B. Frey, and H. A. Loeliger: Factor graphs and the sum-product algorithm, IEEE Trans. Inform. Theory 47 (2001) 498--519.
853: \bibitem{Kuo:04} E. Kuo, Viterbi sequences and polytopes, {\tt http://front.math.ucdavis.edu/math.CO/0401342}.
854: \bibitem{Lander:01} E.S. Lander et al.: Initial sequencing and analysis of the human genome, Nature 409 (2001) 860--921.
855: \bibitem{Liu:94} J. Liu: The collapsed Gibbs sampler with applications to a gene regulation problem, J. Amer. Statist. Assoc.~89 (1994) 958--966.
856: \bibitem{Pachter:04} L. Pachter and B. Sturmfels: Tropical geometry of statistical models, companion paper, submitted.
857: \bibitem{Preparata:85} F. P. Preparata and M. I. Shamos: Computational Geometry- An Introduction, Springer Verlag 1985.
858: \bibitem{Rusakov:02} D.~Rusakov and D.~Geiger: Asymptotic model
859: selection for naive Bayesian networks, Uncertainty in
860: Artificial Intelligence, 2002, 438--445.
861: \bibitem{Stanley:99} R. Stanley: Enumerative Combinatorics, Volume 2,
862: Cambridge University Press, 1999.
863: \bibitem{Sturmfels:96} B. Sturmfels: Gr\"obner Bases and Convex Polytopes, University Lecture Series, Vol. 8, American Mathematical Society, 1996.
864: \bibitem{Takai:02} D. Takai and P. A. Jones: Comprehensive analysis of CpG islands in human chromosomes 21 and 22, Proc. Natl. Acad. Sci. USA 99 (2002) 3740--3745.
865: \bibitem{Waterman:92} M. Waterman, M. Eggert and E. Lander: Parametric
866: sequence comparisons, Proc. Natl. Acad. Sci. USA 89 (1992) 6090--6093.
867: \end{thebibliography}
868: \end{document}
869:
870: