0704:0704.0062/sym.tex

1: \section{Memory requirements of the on-line Viterbi algorithm}

2:

3: In this section, we analyze the memory requirements of the on-line Viterbi

4: algorithm. The memory used by the algorithm is variable throughout the

5: execution of the algorithm, but of special interest are asymptotic

6: bounds on the expected maximum amount of memory used by the algorithm

7: while decoding a sequence of length $n$.

8:

9: We use analogy to random walks and results in extreme value theory to

10: argue that for a symmetric two-state HMMs, the expected maximum memory

11: is $\Theta(m\log n)$. We also conduct experiments on an HMM for

12: gene finding, and both real and simulated DNA sequences.

13:

14: \subsection{Symmetric two-state HMMs}

15:

16: Consider a two-state HMM over a binary alphabet as shown in Figure

17: \ref{fig:twostate}a. For simplicity, we assume $t<1/2$ and $e<1/2$.

18: The back pointers between the sequence positions $i$ and $i+1$ can

19: form one of the configurations i--iii shown in Figure

20: \ref{fig:twostate}b. Denote $p_A=\log P(i,A)$ and $p_B=\log P(i,B)$,

21: where $P(i,j)$ is the table of probabilities from the Viterbi algorithm.

22: The recurrence used in the Viterbi algorithm implies that

23: the configuration i occurs when $\log t-\log(1-t)\le p_A-p_B\le \log (1-t)

24: - \log t$, configuration ii occurs when $p_A-p_B\ge \log(1-t)-\log t$,

25: and configuration iii occurs when $p_A-p_B\le \log t - \log(1-t)$.

26: Configuration iv never happens for $t<1/2$.

27:

28: \begin{figure}[t]

29: \centerline{\includegraphics[width=0.8\textwidth]{figures/twostate.eps}}

30: \caption{{\bf (a) Symmetric two-state HMM} with two parameters:

31: $e$ for emission

32: probabilities and $t$ for transitions probabilities.

33: {\bf (b) Possible back-pointer configurations} for the two-state HMM.

34: \label{fig:twostate}}

35: \end{figure}

36:

37: Note that for a two-state HMM, a coalescence point occurs

38: whenever one of the configurations ii or iii occur. Thus the memory

39: used by the HMM is proportional to the length of continuous sequence

40: of configurations i. We will call such a sequence of configurations

41: a \emph{run}.

42:

43: First, we  analyze the length distribution of runs under the

44: assumption that the input sequence $X$ is a sequence of uniform

45: i.i.d. binary random variables. In such case, we represent the run

46: by a symmetric random walk corresponding

47: to a random variable

48: $X=\frac{p_A-p_B}{\log (1-e) - \log e} - (\log t-\log(1-t)).$ Whenever

49: this variable is within the interval $(0,K)$, where

50: $K = \left\lceil 2 \frac{\log(1-t)-\log(t)}{\log(1-e)-\log(e)}\right\rceil,$

51: the configuration i occurs, and the quantity $p_A-p_B$ is updated by

52: $\log(1-e)-\log e$, if the symbol at the corresponding sequence position

53: is 0, or $\log e - \log(1-e)$, if this symbol is 1. These shifts

54: correspond to updating the value of $X$ by $+1$ or $-1$.

55:

56: When $X$ reaches 0, we have a coalescence point in configuration iii, and

57: the $p_A-p_B$ is initialized to $\log t - \log(1-t) \pm (\log e - \log 1-e)$,

58: which either means initialization of $X$ to $+1$, or another coalescence

59: point, depending on the symbol at the corresponding sequence position.

60: The other case, when $X$ reaches $K$ and we have a coalescence point in

61: configuration ii, is symmetric.

62:

63: We can now apply the classical results from the theory of random walks

64: (see \cite[ch.14.3,14.5]{Feller1968}) to analyze the expected length

65: of runs.

66:

67: \begin{lemma}

68: Assuming that the input sequence is uniformly i.i.d., the expected length of a

69: run of a symmetrical two-state HMM is $K-1$.

70: \end{lemma}

71:

72: Therefore the larger is $K$, the more memory is required to decode the

73: HMM. The worst case is achieved as $e$ approaches $1/2$.

74: In such case, the two states are indistinguishable and being in state

75: $A$ is equivalent to being in state $B$. Using the theory of random walks,

76: we can also characterize the distribution of length of runs.

77:

78: \begin{lemma}

79: \label{lem:distrib}

80: Let $R_\ell$ be the event that the length of a run of a symmetrical

81: two-state HMM is either $2\ell+1$ or $2\ell+2$. Then,

82: assuming that the input sequence is uniformly i.i.d., for some constants

83: $b,c>0$:

84: \begin{equation}

85: b\cdot\cos^{2\ell}\frac{\pi}{K}\le \Pr(R_\ell)

86: \le c\cdot \cos^{2\ell}\frac{\pi}{K}

87: \end{equation}

88: \end{lemma}

89:

90: \def\pivk{\frac{\pi v}{K}}

91: \def\pik{\frac{\pi}{K}}

92: \begin{proof}

93: For a symmetric random walk on interval $(0,K)$ with absorbing barriers

94: and with starting point

95: $z$, the probability of event $W_{z,n}$ that this random walk ends

96: in point $0$ after $n$ steps is zero, if $n-z$ is odd, and the

97: following quantity, if $n-z$ is even \cite[ch.14.5]{Feller1968}:

98: \begin{equation}

99: \Pr(W_{z,n}) = \frac{2}{K}\sum_{0<v<K/2}

100:        \cos^{n-1}\pivk \sin\pivk \sin\frac{\pi z v}{K}

101: \end{equation}

102: Using symmetry, note that the probability of the same random walk

103: ending after $n$ steps at barrier $K$ is the same as probability of

104: $W_{K-z,n}$. Thus, if $K$ is odd, we can state:

105: \begin{eqnarray}

106: \Pr(R_\ell) &=& \Pr(W_{1,2\ell+1}) + \Pr(W_{K-1,2\ell+1}) \nonumber\\

107:             &=& \frac{2}{K}\sum_{0<v<K/2}\cos^{2\ell}\pivk

108:               \sin\pivk\left(\sin\pivk+(-1)^{v+1}\sin\pivk\right)

109:               \nonumber\\

110:             &=& \frac{4}{K}\sum_{0<v<K/2,\mbox{ $v$ odd}}

111:                   \cos^{2\ell}\pivk\sin^2\pivk

112: \end{eqnarray}

113: There are at most $K/4$ terms in the sum and they can all be bounded from above

114: by

115: $\cos^{2\ell}\pivk$. Thus, we can

116: give both upper and lower bounds on $\Pr(R_\ell)$ using only the

117: first term of the sum as follows:

118: \begin{equation}

119: \frac{4}{K}\sin^2\pik \cos^{2\ell}\pik

120: \le \Pr(R_\ell) \le \cos^{2\ell}\pik

121: \end{equation}

122: Similarly, if $K$ is even, we can state:

123: \begin{eqnarray}

124: \Pr(R_\ell) &=& \Pr(W_{1,2\ell+1}) + \Pr(W_{K-1,2\ell+2})\nonumber \\

125:             &=& \frac{2}{K}\sum_{0<v<K/2}\cos^{2\ell}\pivk

126:                            \sin^2\pivk\left(1+(-1)^{v+1}\cos\pivk\right)

127: \end{eqnarray}

128: and thus we have a similar bound:

129: \begin{equation}

130: \frac{2}{K}\sin^2\pik\left(1+\cos\pik\right)\cos^{2\ell}\pik

131: \le \Pr(R_\ell) \le 2\cos^{2\ell}\pik

132: \end{equation}

133: \qed

134: \end{proof}

135:

136:

137: The previous lemma characterizes the length distribution of a single

138: run. However, to analyze memory requirements for a sequence of length

139: $n$, we need to consider maximum over several runs whose total length

140: is $n$.  Similar problem was studied for the runs of

141: heads in a sequence of $n$ coin tosses

142: \cite{Guibas1980,Gordon1986}. For coin tosses, the length distribution

143: of runs is geometric, while in our case the runs are only bounded by

144: geometricaly decaying functions. Still, we can prove that the expected

145: length of the longest run grows logarithmically with the length of the

146: sequence, as is the case for the coin tosses.

147:

148: \begin{lemma}

149: \label{lem:max}

150: Let $X_1,X_2,\dots$ be a sequence of i.i.d. random variables drawn from a

151: geometrically decaying distribution over positive integers, i.e.

152: there exist constants $a,b,c$, $a\in (0,1)$,

153: $0<b\le c$, such that for all integers $k\ge 1$,

154: $b a^k \le \Pr(X_i > k) \le c a^k.$

155:

156: Let $N_n$ be the largest index such that $\sum_{i=1\dots N_n} X_i\le n$,

157: and let $Y_n$ be $\max\{X_1,X_2,\dots,X_{N_n},n-\sum_{i=1}^{N_n} X_i\}$.

158: Then

159: \begin{equation}

160: E[Y_n] = \log_{1/a} n + o(\log n)

161: \end{equation}

162: \end{lemma}

163:

164: \begin{proof}

165:

166: Let $Z_n = \max_{i=1\dots n} X_n$ be the maximum of the first $n$

167: runs. Clearly, $\Pr(Z_n \le k) = \Pr(X_i \le k)^n$, and therefore

168: $(1 - c a^k)^n \le \Pr(Z_n \le k) \le (1 - b a^k)^n$ for all integers

169: $k\ge \log_{1/a}(c)$.

170:

171:

172: \paragraph{Lower bound:}

173: Let $t_n = \log_{1/a} n - \sqrt{\ln n}$.

174: If $Y_n\le t_n$, we need at

175: least $n/t_n$ runs to reach the sum $n$, i.e.

176: $N_n\ge n/t_n-1$ (discounting the last

177: incomplete run). Therefore

178: \begin{equation}

179: \Pr(Y_n\le t_n) \le \Pr(Z_{\frac{n}{t_n}-1} \le t_n)

180: \le (1 - b a^{t_n})^{\frac{n}{t_n}-1}=

181: (1-ba^{t_n})^{a^{-t_n}a^{t_n}(\frac{n}{t_n}-1)}

182: \end{equation}

183:

184: Since $\lim_{n \to \infty} a^{t_n}(n/t_n-1) =

185: \infty$ and $\lim_{x \to 0} (1-b x)^{1/x} =

186: e^{-b}$, we get $\lim_{n\to\infty} \Pr(Y_n\le t_n) = 0$.

187: Note that $E[Y_n] \ge t_n (1-\Pr(Y_n \le t_n))$, and thus we

188: get the desired bound.

189:

190: \paragraph{Upper bound:}

191: Clearly, $Y_n\le Z_n$ and so $E[Y_n] \le E[Z_n]$.

192: Let $Z'_n$ be the

193: maximum of $n$ i.i.d. geometric random variables $X'_1, \dots, X'_n$

194: such that $\Pr(X'_i\le k) = 1-a^k$.

195:

196: We will compare

197: $E[Z_n]$ to the expected value of variable $Z'_n$.

198: Without loss of generality, $c\ge 1$.  For any real

199: $x\ge \log_{1/a}(c)+1$ we have:

200: \begin{eqnarray*}

201: \Pr(Z_n\le x)

202: &\ge& (1-c a^{\lfloor x\rfloor})^n \\

203: &=& \left(1-a^{\lfloor x\rfloor -\log_{1/a}(c)}\right)^n\\

204: &\ge& \left(1-a^{\lfloor x -\log_{1/a}(c)-1\rfloor}\right)^n\\

205: &=& \Pr(Z'_n\le x -\log_{1/a}(c)-1)\\

206: &=& \Pr(Z'_n+\log_{1/a}(c)+1 \le x)

207: \end{eqnarray*}

208: This inequality holds even for $x<\log_{1/a}(c)+1$, since the

209: right-hand side is zero in such case.

210: Therefore, $E[Z_n]\le E[Z'_n+\log_{1/a}(c)+1] =E[Z'_n] + O(1)$.

211: Expected value of $Z'_n$ is $\log_{1/a}(n)+o(\log n)$ \cite{Schuster1985},

212: which proves our claim.\qed

213: \end{proof}

214:

215: %% sum_i=k^infty a^k = a^k/(a-1) (to apply distributions)

216: %% need to multiply by two, since this is a distribution

217: %% of 2-steps rather than single steps

218: %% 2*1/ln(1/cos^2(\pi/K)) = 1/ln(1/cos(\pi/K))

219:

220: Using results of Lemma \ref{lem:max} together with the

221: characterization of run length distributions by Lemma

222: \ref{lem:distrib}, we can conclude that for symmetric two-state HMMs,

223: the expected maximum memory required to process

224: a uniform i.i.d. input sequence of length $n$ is

225: $(1/\ln(1/\cos(\pi/K)))\cdot \ln n + o(\log n)$. \footnote{%

226: We omitted the first run, which has a different

227: starting point and thus does not follow the distribution

228: outlined in Lemma \ref{lem:distrib}. However, the expected

229: length of this run does not depend on $n$ and thus contributes only

230: a lower-order term. We also omitted the runs of length one that start

231: outside the interval $(0,K)$; these

232: runs again contribute only to lower order terms of the lower bound.}

233: Using the Taylor

234: expansion of the constant term as $K$ grows to infinity,

235: $1/\ln(1/\cos(\pi/K))) = 2K^2/\pi^2 + O(1)$,

236: we obtain that the maximum memory grows

237: approximately as $(2K^2/\pi^2)\ln n$.

238:

239: The asymptotic bound $\Theta(\log n)$ can be easily extended to the

240: sequences that are generated by the symmetric HMM, instead of uniform

241: i.i.d. The underlying process can be described as a random walk with

242: approximately $2K$ states on two $(0,K)$ lines, each line

243: corresponding to sequence symbols generated by one of the two

244: states. The distribution of run lengths still decays geometrically

245: as required by Lemma \ref{lem:max}; the base of the exponent is the

246: largest eigenvalue of the transition matrix

247: with absorbing states omitted (see e.g. \cite[Claim 2]{Buhler2005}).

248:

249: The situation is more complicated in the case of non-symmetric

250: two-state HMMs.

251: Here, our random walks proceed in steps that are arbitrary real

252: numbers, different in each direction. We are not aware of any

253: results that would help us to directly analyze distributions

254: of runs in these models, however we conjecture that the size of

255: the longest run is still $\Theta(\log n)$. Perhaps, to obtain

256: bounds on the length distribution of runs, one can approximate

257: the behaviour of such non-discrete random walks by a different

258: model (for example, \cite[ch.7]{Durrett1996}).

259:

260:

261: \subsection{Multi-state HMMs}

262:

263: Our analysis technique cannot be easily extended to HMMs with many

264: states. In two-state HMMs, each new coalescence event clears the

265: memory, and thus the execution of the algorithm can be divided

266: into more or less independent runs. A coalescent event in

267: a multi-state HMM results in a non-trivial tree left in memory,

268: sometimes with a substantial depth. Thus, the sizes of

269: consecutive runs are no longer independent

270: (see Figure \ref{fig:max}a).

271:

272: To evaluate the memory requirements of our algorithm for multi-state

273: HMMs, we have implemented the algorithm and performed several experiments

274: on both simulated and biological sequences. First, we generalized

275: the symmetric HMMs from the previous section to multiple states.

276: The symmetric HMM with $m$ states emits symbols over $m$-letter

277: alphabet, where each state emits one symbol with higher probability

278: than the other symbols. The transition probabilities are equiprobable,

279: except for self-transitions. We have tested the algorithm for

280: $m\le 6$ and sequences generated both by a uniform i.i.d. process, and

281: by the HMM itself. Observed data are consistent with the logarithmic

282: growth of average maximum memory needed to decode a sequence of length $n$

283: (data not shown).

284:

285: We have also evaluated the algorithm using a simplified

286: HMM for gene finding with 265 states. The emission probabilities

287: of the states

288: are defined using at most 4-th order Markov chains, and the

289: structure of the HMM reflects known properties of genes (similar

290: to the structure shown in \cite{Brejova2007}). The HMM was

291: trained on RefSeq annotations of human chromosomes 1 and 22.

292:

293: In gene finding, we segment the input DNA sequence into exons

294: (protein-coding sequence intervals), introns (non-coding sequence

295: separating exons within a gene), and intergenic regions (sequence

296: separating genes). Common measure of accuracy is exon sensitivity (how

297: many of real exons we have succesfuly and exactly predicted).

298: The implementation

299: used here has exon sensitivity 37\% on testing set

300: of genes by Guigo et al. \cite{Guigo2006}. A realistic gene finder,

301: such as ExonHunter \cite{Brejova2005}, trained on the same data set

302: achieves sensitivity of 53\%. This difference is due to additional features

303: that are not implemented in our test, namely GC content levels, non-geometric

304: length distributions, and sophisticated signal models.

305:

306: \iffalse

307: masked sequence results         this           Genscan        ExonHunter

308: Gene Sensitivity                6.76%           15.88%          8.78%

309: Gene Specificity                3.13%           9.81%           12.50%

310: Exon Sensitivity                37.13%          58.84%          52.91%

311: Exon Specificity                29.27%          45.43%          66.47%

312: Nucleotide Sensitivity          71.48%          83.68%          77.55%

313: Nucleotide Specificity          36.62%          59.70%          80.13%

314: \fi

315:

316: We have tested the algorithm on 20~MB long sequences: regions from the human

317: genome, simulated

318: sequences generated by the HMM, and i.i.d. sequences.

319: Regions of the human genome were chosen from hg18 assembly so that

320: they do not contain sequencing gaps. The distribution for the i.i.d. sequences

321: mirrors the distribution of bases in the human chromosome 1.

322:

323: The results are shown in Figure \ref{fig:max}b.

324: The average maximum length of the table over several samples appears to grow

325: faster than logarithmically with the length of the sequence, though

326: it seems to be bounded by a polylogarithmic function. It is not clear whether

327: the faster growth is an artifact that would disapear

328: with longer sequences or higher number of samples.

329:

330: \begin{figure}[t]

331: \begin{minipage}[b]{0.48\textwidth}

332: \centerline{\bf (a)}

333: \includegraphics[scale=0.55]{figures/zuby.eps}

334: \end{minipage}

335: \hfill

336: \begin{minipage}[b]{0.48\textwidth}

337: \centerline{\bf (b)}

338: \includegraphics[scale=0.55]{figures/max.eps}

339: \end{minipage}

340: \caption{{\bf Memory requirements of a gene finding HMM.} a) Actual

341: length of table used on a segment of human chromosome 1. b) Average maximum

342: table length needed for prefixes of 20~MB sequences.

343: \label{fig:max}}

344: \end{figure}

345:

346: The HMM for gene finding has a special structure,

347: with three copies of the state for introns that have the same emission

348: probabilities and the same self-transition probability.

349: In two-state symmetric HMMs, similar emission probabilities of the two states

350: lead to increase in the length of individual runs. Intron states of a

351: gene finder are an extreme example of this phenomenon.

352:

353:

354: Nonetheless, on average a table of length roughly 100,000 is sufficient to

355: to process sequences of length 20~MB, which is a 200-fold improvement compared

356: to the trivial Viterbi algorithm. In addition, the length of

357: the table did not exceed

358: 222,000 on any of the 20MB human segments.

359: As we can see in Figure \ref{fig:max}a, most of the time the

360: program keeps only relatively short table; the average length on the human

361: segments is 11,000. The low average length can be

362: of a significant advantage if multiple processes share the same memory.

363: