1: \section{Memory requirements of the on-line Viterbi algorithm}
2:
3: In this section, we analyze the memory requirements of the on-line Viterbi
4: algorithm. The memory used by the algorithm is variable throughout the
5: execution of the algorithm, but of special interest are asymptotic
6: bounds on the expected maximum amount of memory used by the algorithm
7: while decoding a sequence of length $n$.
8:
9: We use analogy to random walks and results in extreme value theory to
10: argue that for a symmetric two-state HMMs, the expected maximum memory
11: is $\Theta(m\log n)$. We also conduct experiments on an HMM for
12: gene finding, and both real and simulated DNA sequences.
13:
14: \subsection{Symmetric two-state HMMs}
15:
16: Consider a two-state HMM over a binary alphabet as shown in Figure
17: \ref{fig:twostate}a. For simplicity, we assume $t<1/2$ and $e<1/2$.
18: The back pointers between the sequence positions $i$ and $i+1$ can
19: form one of the configurations i--iii shown in Figure
20: \ref{fig:twostate}b. Denote $p_A=\log P(i,A)$ and $p_B=\log P(i,B)$,
21: where $P(i,j)$ is the table of probabilities from the Viterbi algorithm.
22: The recurrence used in the Viterbi algorithm implies that
23: the configuration i occurs when $\log t-\log(1-t)\le p_A-p_B\le \log (1-t)
24: - \log t$, configuration ii occurs when $p_A-p_B\ge \log(1-t)-\log t$,
25: and configuration iii occurs when $p_A-p_B\le \log t - \log(1-t)$.
26: Configuration iv never happens for $t<1/2$.
27:
28: \begin{figure}[t]
29: \centerline{\includegraphics[width=0.8\textwidth]{figures/twostate.eps}}
30: \caption{{\bf (a) Symmetric two-state HMM} with two parameters:
31: $e$ for emission
32: probabilities and $t$ for transitions probabilities.
33: {\bf (b) Possible back-pointer configurations} for the two-state HMM.
34: \label{fig:twostate}}
35: \end{figure}
36:
37: Note that for a two-state HMM, a coalescence point occurs
38: whenever one of the configurations ii or iii occur. Thus the memory
39: used by the HMM is proportional to the length of continuous sequence
40: of configurations i. We will call such a sequence of configurations
41: a \emph{run}.
42:
43: First, we analyze the length distribution of runs under the
44: assumption that the input sequence $X$ is a sequence of uniform
45: i.i.d. binary random variables. In such case, we represent the run
46: by a symmetric random walk corresponding
47: to a random variable
48: $X=\frac{p_A-p_B}{\log (1-e) - \log e} - (\log t-\log(1-t)).$ Whenever
49: this variable is within the interval $(0,K)$, where
50: $K = \left\lceil 2 \frac{\log(1-t)-\log(t)}{\log(1-e)-\log(e)}\right\rceil,$
51: the configuration i occurs, and the quantity $p_A-p_B$ is updated by
52: $\log(1-e)-\log e$, if the symbol at the corresponding sequence position
53: is 0, or $\log e - \log(1-e)$, if this symbol is 1. These shifts
54: correspond to updating the value of $X$ by $+1$ or $-1$.
55:
56: When $X$ reaches 0, we have a coalescence point in configuration iii, and
57: the $p_A-p_B$ is initialized to $\log t - \log(1-t) \pm (\log e - \log 1-e)$,
58: which either means initialization of $X$ to $+1$, or another coalescence
59: point, depending on the symbol at the corresponding sequence position.
60: The other case, when $X$ reaches $K$ and we have a coalescence point in
61: configuration ii, is symmetric.
62:
63: We can now apply the classical results from the theory of random walks
64: (see \cite[ch.14.3,14.5]{Feller1968}) to analyze the expected length
65: of runs.
66:
67: \begin{lemma}
68: Assuming that the input sequence is uniformly i.i.d., the expected length of a
69: run of a symmetrical two-state HMM is $K-1$.
70: \end{lemma}
71:
72: Therefore the larger is $K$, the more memory is required to decode the
73: HMM. The worst case is achieved as $e$ approaches $1/2$.
74: In such case, the two states are indistinguishable and being in state
75: $A$ is equivalent to being in state $B$. Using the theory of random walks,
76: we can also characterize the distribution of length of runs.
77:
78: \begin{lemma}
79: \label{lem:distrib}
80: Let $R_\ell$ be the event that the length of a run of a symmetrical
81: two-state HMM is either $2\ell+1$ or $2\ell+2$. Then,
82: assuming that the input sequence is uniformly i.i.d., for some constants
83: $b,c>0$:
84: \begin{equation}
85: b\cdot\cos^{2\ell}\frac{\pi}{K}\le \Pr(R_\ell)
86: \le c\cdot \cos^{2\ell}\frac{\pi}{K}
87: \end{equation}
88: \end{lemma}
89:
90: \def\pivk{\frac{\pi v}{K}}
91: \def\pik{\frac{\pi}{K}}
92: \begin{proof}
93: For a symmetric random walk on interval $(0,K)$ with absorbing barriers
94: and with starting point
95: $z$, the probability of event $W_{z,n}$ that this random walk ends
96: in point $0$ after $n$ steps is zero, if $n-z$ is odd, and the
97: following quantity, if $n-z$ is even \cite[ch.14.5]{Feller1968}:
98: \begin{equation}
99: \Pr(W_{z,n}) = \frac{2}{K}\sum_{0<v<K/2}
100: \cos^{n-1}\pivk \sin\pivk \sin\frac{\pi z v}{K}
101: \end{equation}
102: Using symmetry, note that the probability of the same random walk
103: ending after $n$ steps at barrier $K$ is the same as probability of
104: $W_{K-z,n}$. Thus, if $K$ is odd, we can state:
105: \begin{eqnarray}
106: \Pr(R_\ell) &=& \Pr(W_{1,2\ell+1}) + \Pr(W_{K-1,2\ell+1}) \nonumber\\
107: &=& \frac{2}{K}\sum_{0<v<K/2}\cos^{2\ell}\pivk
108: \sin\pivk\left(\sin\pivk+(-1)^{v+1}\sin\pivk\right)
109: \nonumber\\
110: &=& \frac{4}{K}\sum_{0<v<K/2,\mbox{ $v$ odd}}
111: \cos^{2\ell}\pivk\sin^2\pivk
112: \end{eqnarray}
113: There are at most $K/4$ terms in the sum and they can all be bounded from above
114: by
115: $\cos^{2\ell}\pivk$. Thus, we can
116: give both upper and lower bounds on $\Pr(R_\ell)$ using only the
117: first term of the sum as follows:
118: \begin{equation}
119: \frac{4}{K}\sin^2\pik \cos^{2\ell}\pik
120: \le \Pr(R_\ell) \le \cos^{2\ell}\pik
121: \end{equation}
122: Similarly, if $K$ is even, we can state:
123: \begin{eqnarray}
124: \Pr(R_\ell) &=& \Pr(W_{1,2\ell+1}) + \Pr(W_{K-1,2\ell+2})\nonumber \\
125: &=& \frac{2}{K}\sum_{0<v<K/2}\cos^{2\ell}\pivk
126: \sin^2\pivk\left(1+(-1)^{v+1}\cos\pivk\right)
127: \end{eqnarray}
128: and thus we have a similar bound:
129: \begin{equation}
130: \frac{2}{K}\sin^2\pik\left(1+\cos\pik\right)\cos^{2\ell}\pik
131: \le \Pr(R_\ell) \le 2\cos^{2\ell}\pik
132: \end{equation}
133: \qed
134: \end{proof}
135:
136:
137: The previous lemma characterizes the length distribution of a single
138: run. However, to analyze memory requirements for a sequence of length
139: $n$, we need to consider maximum over several runs whose total length
140: is $n$. Similar problem was studied for the runs of
141: heads in a sequence of $n$ coin tosses
142: \cite{Guibas1980,Gordon1986}. For coin tosses, the length distribution
143: of runs is geometric, while in our case the runs are only bounded by
144: geometricaly decaying functions. Still, we can prove that the expected
145: length of the longest run grows logarithmically with the length of the
146: sequence, as is the case for the coin tosses.
147:
148: \begin{lemma}
149: \label{lem:max}
150: Let $X_1,X_2,\dots$ be a sequence of i.i.d. random variables drawn from a
151: geometrically decaying distribution over positive integers, i.e.
152: there exist constants $a,b,c$, $a\in (0,1)$,
153: $0<b\le c$, such that for all integers $k\ge 1$,
154: $b a^k \le \Pr(X_i > k) \le c a^k.$
155:
156: Let $N_n$ be the largest index such that $\sum_{i=1\dots N_n} X_i\le n$,
157: and let $Y_n$ be $\max\{X_1,X_2,\dots,X_{N_n},n-\sum_{i=1}^{N_n} X_i\}$.
158: Then
159: \begin{equation}
160: E[Y_n] = \log_{1/a} n + o(\log n)
161: \end{equation}
162: \end{lemma}
163:
164: \begin{proof}
165:
166: Let $Z_n = \max_{i=1\dots n} X_n$ be the maximum of the first $n$
167: runs. Clearly, $\Pr(Z_n \le k) = \Pr(X_i \le k)^n$, and therefore
168: $(1 - c a^k)^n \le \Pr(Z_n \le k) \le (1 - b a^k)^n$ for all integers
169: $k\ge \log_{1/a}(c)$.
170:
171:
172: \paragraph{Lower bound:}
173: Let $t_n = \log_{1/a} n - \sqrt{\ln n}$.
174: If $Y_n\le t_n$, we need at
175: least $n/t_n$ runs to reach the sum $n$, i.e.
176: $N_n\ge n/t_n-1$ (discounting the last
177: incomplete run). Therefore
178: \begin{equation}
179: \Pr(Y_n\le t_n) \le \Pr(Z_{\frac{n}{t_n}-1} \le t_n)
180: \le (1 - b a^{t_n})^{\frac{n}{t_n}-1}=
181: (1-ba^{t_n})^{a^{-t_n}a^{t_n}(\frac{n}{t_n}-1)}
182: \end{equation}
183:
184: Since $\lim_{n \to \infty} a^{t_n}(n/t_n-1) =
185: \infty$ and $\lim_{x \to 0} (1-b x)^{1/x} =
186: e^{-b}$, we get $\lim_{n\to\infty} \Pr(Y_n\le t_n) = 0$.
187: Note that $E[Y_n] \ge t_n (1-\Pr(Y_n \le t_n))$, and thus we
188: get the desired bound.
189:
190: \paragraph{Upper bound:}
191: Clearly, $Y_n\le Z_n$ and so $E[Y_n] \le E[Z_n]$.
192: Let $Z'_n$ be the
193: maximum of $n$ i.i.d. geometric random variables $X'_1, \dots, X'_n$
194: such that $\Pr(X'_i\le k) = 1-a^k$.
195:
196: We will compare
197: $E[Z_n]$ to the expected value of variable $Z'_n$.
198: Without loss of generality, $c\ge 1$. For any real
199: $x\ge \log_{1/a}(c)+1$ we have:
200: \begin{eqnarray*}
201: \Pr(Z_n\le x)
202: &\ge& (1-c a^{\lfloor x\rfloor})^n \\
203: &=& \left(1-a^{\lfloor x\rfloor -\log_{1/a}(c)}\right)^n\\
204: &\ge& \left(1-a^{\lfloor x -\log_{1/a}(c)-1\rfloor}\right)^n\\
205: &=& \Pr(Z'_n\le x -\log_{1/a}(c)-1)\\
206: &=& \Pr(Z'_n+\log_{1/a}(c)+1 \le x)
207: \end{eqnarray*}
208: This inequality holds even for $x<\log_{1/a}(c)+1$, since the
209: right-hand side is zero in such case.
210: Therefore, $E[Z_n]\le E[Z'_n+\log_{1/a}(c)+1] =E[Z'_n] + O(1)$.
211: Expected value of $Z'_n$ is $\log_{1/a}(n)+o(\log n)$ \cite{Schuster1985},
212: which proves our claim.\qed
213: \end{proof}
214:
215: %% sum_i=k^infty a^k = a^k/(a-1) (to apply distributions)
216: %% need to multiply by two, since this is a distribution
217: %% of 2-steps rather than single steps
218: %% 2*1/ln(1/cos^2(\pi/K)) = 1/ln(1/cos(\pi/K))
219:
220: Using results of Lemma \ref{lem:max} together with the
221: characterization of run length distributions by Lemma
222: \ref{lem:distrib}, we can conclude that for symmetric two-state HMMs,
223: the expected maximum memory required to process
224: a uniform i.i.d. input sequence of length $n$ is
225: $(1/\ln(1/\cos(\pi/K)))\cdot \ln n + o(\log n)$. \footnote{%
226: We omitted the first run, which has a different
227: starting point and thus does not follow the distribution
228: outlined in Lemma \ref{lem:distrib}. However, the expected
229: length of this run does not depend on $n$ and thus contributes only
230: a lower-order term. We also omitted the runs of length one that start
231: outside the interval $(0,K)$; these
232: runs again contribute only to lower order terms of the lower bound.}
233: Using the Taylor
234: expansion of the constant term as $K$ grows to infinity,
235: $1/\ln(1/\cos(\pi/K))) = 2K^2/\pi^2 + O(1)$,
236: we obtain that the maximum memory grows
237: approximately as $(2K^2/\pi^2)\ln n$.
238:
239: The asymptotic bound $\Theta(\log n)$ can be easily extended to the
240: sequences that are generated by the symmetric HMM, instead of uniform
241: i.i.d. The underlying process can be described as a random walk with
242: approximately $2K$ states on two $(0,K)$ lines, each line
243: corresponding to sequence symbols generated by one of the two
244: states. The distribution of run lengths still decays geometrically
245: as required by Lemma \ref{lem:max}; the base of the exponent is the
246: largest eigenvalue of the transition matrix
247: with absorbing states omitted (see e.g. \cite[Claim 2]{Buhler2005}).
248:
249: The situation is more complicated in the case of non-symmetric
250: two-state HMMs.
251: Here, our random walks proceed in steps that are arbitrary real
252: numbers, different in each direction. We are not aware of any
253: results that would help us to directly analyze distributions
254: of runs in these models, however we conjecture that the size of
255: the longest run is still $\Theta(\log n)$. Perhaps, to obtain
256: bounds on the length distribution of runs, one can approximate
257: the behaviour of such non-discrete random walks by a different
258: model (for example, \cite[ch.7]{Durrett1996}).
259:
260:
261: \subsection{Multi-state HMMs}
262:
263: Our analysis technique cannot be easily extended to HMMs with many
264: states. In two-state HMMs, each new coalescence event clears the
265: memory, and thus the execution of the algorithm can be divided
266: into more or less independent runs. A coalescent event in
267: a multi-state HMM results in a non-trivial tree left in memory,
268: sometimes with a substantial depth. Thus, the sizes of
269: consecutive runs are no longer independent
270: (see Figure \ref{fig:max}a).
271:
272: To evaluate the memory requirements of our algorithm for multi-state
273: HMMs, we have implemented the algorithm and performed several experiments
274: on both simulated and biological sequences. First, we generalized
275: the symmetric HMMs from the previous section to multiple states.
276: The symmetric HMM with $m$ states emits symbols over $m$-letter
277: alphabet, where each state emits one symbol with higher probability
278: than the other symbols. The transition probabilities are equiprobable,
279: except for self-transitions. We have tested the algorithm for
280: $m\le 6$ and sequences generated both by a uniform i.i.d. process, and
281: by the HMM itself. Observed data are consistent with the logarithmic
282: growth of average maximum memory needed to decode a sequence of length $n$
283: (data not shown).
284:
285: We have also evaluated the algorithm using a simplified
286: HMM for gene finding with 265 states. The emission probabilities
287: of the states
288: are defined using at most 4-th order Markov chains, and the
289: structure of the HMM reflects known properties of genes (similar
290: to the structure shown in \cite{Brejova2007}). The HMM was
291: trained on RefSeq annotations of human chromosomes 1 and 22.
292:
293: In gene finding, we segment the input DNA sequence into exons
294: (protein-coding sequence intervals), introns (non-coding sequence
295: separating exons within a gene), and intergenic regions (sequence
296: separating genes). Common measure of accuracy is exon sensitivity (how
297: many of real exons we have succesfuly and exactly predicted).
298: The implementation
299: used here has exon sensitivity 37\% on testing set
300: of genes by Guigo et al. \cite{Guigo2006}. A realistic gene finder,
301: such as ExonHunter \cite{Brejova2005}, trained on the same data set
302: achieves sensitivity of 53\%. This difference is due to additional features
303: that are not implemented in our test, namely GC content levels, non-geometric
304: length distributions, and sophisticated signal models.
305:
306: \iffalse
307: masked sequence results this Genscan ExonHunter
308: Gene Sensitivity 6.76% 15.88% 8.78%
309: Gene Specificity 3.13% 9.81% 12.50%
310: Exon Sensitivity 37.13% 58.84% 52.91%
311: Exon Specificity 29.27% 45.43% 66.47%
312: Nucleotide Sensitivity 71.48% 83.68% 77.55%
313: Nucleotide Specificity 36.62% 59.70% 80.13%
314: \fi
315:
316: We have tested the algorithm on 20~MB long sequences: regions from the human
317: genome, simulated
318: sequences generated by the HMM, and i.i.d. sequences.
319: Regions of the human genome were chosen from hg18 assembly so that
320: they do not contain sequencing gaps. The distribution for the i.i.d. sequences
321: mirrors the distribution of bases in the human chromosome 1.
322:
323: The results are shown in Figure \ref{fig:max}b.
324: The average maximum length of the table over several samples appears to grow
325: faster than logarithmically with the length of the sequence, though
326: it seems to be bounded by a polylogarithmic function. It is not clear whether
327: the faster growth is an artifact that would disapear
328: with longer sequences or higher number of samples.
329:
330: \begin{figure}[t]
331: \begin{minipage}[b]{0.48\textwidth}
332: \centerline{\bf (a)}
333: \includegraphics[scale=0.55]{figures/zuby.eps}
334: \end{minipage}
335: \hfill
336: \begin{minipage}[b]{0.48\textwidth}
337: \centerline{\bf (b)}
338: \includegraphics[scale=0.55]{figures/max.eps}
339: \end{minipage}
340: \caption{{\bf Memory requirements of a gene finding HMM.} a) Actual
341: length of table used on a segment of human chromosome 1. b) Average maximum
342: table length needed for prefixes of 20~MB sequences.
343: \label{fig:max}}
344: \end{figure}
345:
346: The HMM for gene finding has a special structure,
347: with three copies of the state for introns that have the same emission
348: probabilities and the same self-transition probability.
349: In two-state symmetric HMMs, similar emission probabilities of the two states
350: lead to increase in the length of individual runs. Intron states of a
351: gene finder are an extreme example of this phenomenon.
352:
353:
354: Nonetheless, on average a table of length roughly 100,000 is sufficient to
355: to process sequences of length 20~MB, which is a 200-fold improvement compared
356: to the trivial Viterbi algorithm. In addition, the length of
357: the table did not exceed
358: 222,000 on any of the 20MB human segments.
359: As we can see in Figure \ref{fig:max}a, most of the time the
360: program keeps only relatively short table; the average length on the human
361: segments is 11,000. The low average length can be
362: of a significant advantage if multiple processes share the same memory.
363: