0511:q-bio0511039/evol.tex

1: \documentclass[11pt]{amsart}

2:

3: \usepackage{graphicx}

4: \usepackage{hyperref}

5: \usepackage{url}

6:

7: \input{pstricks}

8: \input{pst-node}

9: \usepackage{pst-tree}

10:

11: % trees and tree nodes:

12: \newcommand{\tree}[2]{\pstree[treemode=U,arrows=->,treefit=tight,treesep=0.5cm,levelsep=1cm]{#1}{#2}}

13: \newcommand{\node}[1]{\Tr{\psframebox[linecolor=white,framearc=.5]{#1}}}

14: %\newcommand{\node}[1]{\Toval{#1}}

15: %%\renewcommand{\root}[1]{\Tr{\psframebox[linecolor=white,framearc=.5]{#1}}}

16: \renewcommand{\root}[1]{\Toval{#1}}

17:

18: % xy pic:

19: %\input xy

20: %\xyoption{all}

21: %\CompileMatrices

22:

23: \DeclareMathOperator{\rank}{rank}

24: \DeclareMathOperator{\Prob}{Prob}

25: \DeclareMathOperator{\diag}{diag}

26:

27: % "A independent of B given C"

28: \newcommand{\ind}{\mbox{$\perp \kern-5.5pt \perp$}}

29: \newcommand{\nind}{\mbox{$\not\hspace{-4pt}\ind$}}

30:

31: \newcommand{\one}{\mathbf 1}

32: \newcommand{\pa}{\mathrm{pa}}  % parent

33: \newcommand{\ch}{\mathrm{ch}}  % child

34: \newcommand{\cT}{\mathcal{T}}  % mutagenetic tree

35: \newcommand{\cM}{\mathcal{M}}  % mixture model

36: \newcommand{\cI}{\mathcal{I}}  % states

37: \newcommand{\cC}{\mathcal{C}}  % compatible states

38: \newcommand{\cS}{\mathcal{S}}  % star

39: \newcommand{\cE}{\mathcal{E}}

40: \newcommand{\cG}{\mathcal{G}}

41: \newcommand{\cB}{\mathcal{B}}

42: \newcommand{\R}{\mathbb{R}}

43:

44: \newcommand{\RP}{\mathcal R}  % risk polynomial

45: \newcommand{\ba}{\mathbf a}

46: \newcommand{\bU}{\mathbf U}

47: \newcommand{\bI}{\mathbf I}

48: \newcommand{\muta}[3]{\rho_{#1,#2}^{#3}}

49: \newcommand{\thet}[1]{\theta^e_{#1_{\pa(e)}, #1_e}}

50:

51: \newtheorem{thm}{Theorem}

52: \newtheorem{lemma}[thm]{Lemma}

53: \newtheorem{prop}[thm]{Proposition}

54: \newtheorem{cor}[thm]{Corollary}

55: \newtheorem{prob}[thm]{Problem}

56: \newtheorem{conj}[thm]{Conjecture}

57: \newtheorem{alg}[thm]{Algorithm}

58:

59: \newtheorem{ex}[thm]{Example}

60: \newtheorem{df}[thm]{Definition}

61:

62: \title{Evolution on distributive lattices}

63:

64: \author[Beerenwinkel, Eriksson, and Sturmfels]{

65: Niko Beerenwinkel$^*$ \and Nicholas Eriksson  \and Bernd Sturmfels\\

66: Department of Mathematics\\

67: University of California\\

68: Berkeley, CA 94720, USA\\

69: $\{$niko,eriksson,bernd$\}$@math.berkeley.edu\\

70: $^*$Corresponding Author:\\

71: phone: +1 (510) 642-3529, fax: +1 (510) 642-8204

72: }

73:

74:

75: %\date{\today}

76:

77: \begin{document}

78:

79: \begin{abstract}

80: We consider the directed evolution of a population after an

81: intervention that has significantly altered the underlying

82: fitness landscape.

83: We model the space of genotypes as a distributive lattice;

84: the fitness landscape is a real-valued function on

85: that lattice. The risk of escape from intervention, i.e., the

86: probability that the population

87: develops an escape mutant before extinction, is

88: encoded in the  risk polynomial.

89: Tools from algebraic combinatorics are applied

90: to compute the risk polynomial in terms of

91: the fitness landscape. In an application to

92:  the development of drug

93: resistance in HIV, we study the

94:  risk of viral escape from

95: treatment with the protease inhibitors ritonavir

96: and indinavir.

97: \end{abstract}

98:

99: \maketitle

100:

101: \begin{quote}

102: \noindent {\bf Keywords:}

103: fitness landscape, distributive lattice, directed evolution,

104: risk polynomial, chain polynomial,

105: HIV drug resistance, Bayesian network, mutagenetic tree

106: \end{quote}

107:

108:

109:

110: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

111: \section{Introduction}

112:

113: The evolutionary fate of a population is determined by the replication

114: dynamics of the ensemble and by the reproductive success of its individuals.

115: We are interested in scenarios where most individuals have a low fitness,

116: eventually leading to extinction, and only a few types of individuals

117: (``escape mutants'')

118: can survive permanently. These situations often arise due to

119: a significant change of the underlying fitness landscape.

120: For example, a virus

121: that has been transmitted to a new host is confronted with a new immune

122: response. Likewise, medical interventions such as radiation therapy,

123: vaccination, or chemotherapy result in altered fitness landscapes for the

124: targeted agents, which may be bacteria, viruses, or cancer cells.

125:

126: Given a population and such a hostile fitness landscape, the central question

127: is whether the population will survive.

128: In the case of medical interventions we wish to know the probability

129: of successful treatment. Answering this question involves computing

130: the risk of evolutionary escape, i.e., the probability that the

131: population develops an escape mutant before extinction.

132: We present a mathematical framework for computing such probabilities.

133:

134: Our primary application is the evolution of drug resistance

135: during treatment of HIV infected patients \cite{Clavel2004}.

136: We consider therapy with two different protease inhibitors (PIs).

137: These compounds interfere with HIV particle maturation

138: by inhibiting the viral protease enzyme.

139: The effectiveness of PI therapy is limited

140: by the development of drug resistance.

141: Rapid and highly error prone replication of a large virus

142: population generates mutants that resist the selective pressure of

143: drug therapy. PI resistance is caused by mutations in the protease gene

144: that reduce the binding affinity of the drug to the enzyme.

145: These mutations have been shown to accumulate in a stepwise manner

146: \cite{Berkhout1999}. For most PIs, no single mutation confers

147: a significant level of resistance, but multiple mutations are

148: required for escape from drug pressure.

149: Quantitative predictions of the probability of successful PI treatment

150: would help in finding effective antiretroviral

151: combination therapies. Selecting a drug combination

152: amounts to controlling the viral fitness landscape.

153:

154: We regard the directed evolution of a population towards an escape state

155: as a fluctuation on a fitness landscape. The space of

156: genotypes is modeled as follows. We start with a

157: finite partially ordered set (poset) $\cE$ whose elements are called

158: \emph{events}. The events are non-reversible

159: mutations with some constraints on their order of occurrence.

160: Such constraints are primarily due to

161: epistatic effects between different loci in a genome

162: \cite{Bonhoeffer2000}.

163: The event constraints define the poset structure:

164: $\,e_1 < e_2 \,$ in $\cE$ means that

165: event $e_1$ must occur before event $e_2$ can occur.

166: Each genotype $g$ is represented by a subset of $\cE$, namely,

167: the set of all events that occurred to create $g$.

168: Thus a genotype $g$ is an \emph{order ideal} in the  poset $\cE$.

169: The space of genotypes $\cG$ is the set of

170: all order ideals in $\cE$, which is a {\em distributive lattice}

171: \cite[Sec.~3.4]{Stanley1999}.

172: The order relation on $\cG$ is set inclusion and

173: corresponds to the accumulation of mutations.

174: This mathematical formulation is reasonable in the above situations,

175: where a population is exposed to strong selective pressure.

176:

177: \begin{figure}

178: \includegraphics[width=\textwidth]{landscape}

179: \caption{An event poset, its genotype lattice, and a fitness landscape.}

180: \label{fig:ex1}

181: \end{figure}

182:

183: The risk of escape is governed by the structure of $\cG$,

184: the fitness function on $\cG$, and the population dynamics

185: (such as the mutation rates and population size). Our focus

186: is on the dependency of the risk of escape

187: on the assigned fitness values for each genotype $g \in \cG$.

188: This leads us to the \emph{risk polynomial},

189: which is shown to be equivalent to a well-known object in

190: algebraic combinatorics. Indeed, one of the objectives of this

191: work is to provide a bridge between algebraic combinatorics

192: and evolutionary biology.

193:

194:

195: This paper  is organized as follows. In Section~\ref{sec:fitness}

196: we formalize our

197: model of a static fitness landscape on the genotype lattice $\cG$

198: derived from an event poset $\cE$,

199: and we discuss evolution on the lattice $\cG$.

200: In Section~\ref{sec:branching} we review the multistate

201: branching process studied by Iwasa, Michor and Nowak

202: \cite{Iwasa2003,Iwasa2004}.

203:

204: In Section~\ref{sec:bayes} we study the Bayesian networks

205: which arise from identifying the events in $\cE$

206: with binary random variables. These

207: statistical models can be used

208: to infer the genotype space from

209: given data. For conjunctive Bayesian networks

210: we recover the distributive lattice of order ideals in $\cE$.

211: Of particular interest is

212: the case where $\cE$ is a directed forest: here the Bayesian network

213: is a mutagenetic tree model \cite{Beerenwinkel2005c,Beerenwinkel2005f}.

214: The application of our methods

215: to the development of PI resistance in HIV

216: is presented in  Section~\ref{sec:apply}.

217:

218: The Appendix summarizes various representations

219: of the risk polynomial in terms of structures from

220: algebraic combinatorics. Efficient methods for computing

221: the risk polynomial and their implementation are presented.

222:

223: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

224:

225: \section{Fitness landscapes on distributive lattices}   \label{sec:fitness}

226:

227: A partially ordered set (or poset) is a set $\cE$

228: together with a binary relation, denoted ``$\leq$'', which is

229: reflexive, antisymmetric, and transitive. Here

230: we fix a finite poset $\cE$  whose elements are called \emph{events}.

231: If the number of events is $n$ then we

232: often identify the set underlying $\cE$ with

233: the set $\,[n] = \{1,2,\ldots,n\}$. In this way,

234: the subsets of $\cE$ are encoded by the $2^n$ binary strings of length $n$.

235: The empty subset of $\cE$ is

236: encoded by the all-zero string $\hat{0} = 0 0 \cdots 0$

237: which represents the \emph{wild type}, and

238: the full set $\cE$ is

239: encoded by the  all-one string $\hat{1} = 11  \cdots 1$

240: which represents the \emph{escape state}.

241:

242: An order ideal $g$ in a poset $\cE$ is a subset of $\cE$

243: that is closed downward;

244: that is, if $e_2 \in g$ and $e_1 \le e_2$, then $e_1 \in g$.

245: The set of all order ideals of $\cE$ forms a distributive lattice

246: $J(\cE)$ under inclusion. Birkhoff's Representation Theorem

247: \cite[Thm.~3.4.1]{Stanley1999}

248: states that all distributive lattices have the form

249: $J(\cE)$ for a poset $\cE$.

250: We write $\cG = J(\cE)$, and we

251: call $\cG$ the {\em genotype lattice}.

252:

253: \begin{ex} \rm

254: Let $\cE$ be the trivial poset,

255: where no two events are comparable,

256: with $|\cE| = n$.

257: Then $\cG = J(\cE)$ is the Boolean lattice consisting

258: of all subsets of $\cE$ ordered by inclusion.

259: This means that all possible combinations of mutations

260: are possible, and they can occur in any order. Each of

261: the $2^n$ binary strings $g \in \{0,1\}^n$ represents

262: a mutational pattern, or genotype.

263: \end{ex}

264:

265: In general, the event poset $\cE$ does have non-trivial

266: relations $e_1 < e_2$. The relation $e_1 < e_2$ excludes all

267: genotypes $g$ with $g_{e_1} = 0$ and

268: $g_{e_2} = 1$ from $\cG$. The remaining genotypes

269: $g$ form a sublattice of the Boolean lattice $\{0,1\}^n$,

270: and this is precisely our distributive lattice

271: $\cG = J(\cE)$. Note that the

272: lattice $\cG$ is ranked, with the rank function given by

273: $\rank(g) = |g|$.

274:

275: \begin{ex} \rm \label{FourEvents}

276: Consider a scenario with $n=4$ mutation events, labeled $\cE = \{1,2,3,4\}$.

277: Suppose that event $3$ can only  occur after

278: events $1$ and $2$,

279: and event $4$ can only occur after event $2$.

280: This allows for precisely eight genotypes

281: \[

282: \cG \,\, = \,\, \bigl\{

283: 0000, 1000, 0100, 1100, 0101, 1110, 1101, 1111  \bigr\}.

284: \]

285: The event poset $\cE$ and the genotype lattice $\cG$ are

286: shown in Figure~\ref{fig:ex1}.

287: \end{ex}

288:

289: A fitness landscape associates to each possible genotype

290: a number which quantifies the reproductive capacity of

291: an individual with that genotype \cite{Reidys2002}. We define a

292: \emph{fitness landscape} on the distributive lattice $\cG$

293:  to be any function ${\mathbf f} \colon \cG \to \mathbb{R}$.

294: The value  ${\mathbf f}(g)$ at any $g \in \cG$

295: is the  \emph{fitness} of the genotype $g$.

296: Thus, the space of all fitness landscapes is the finite-dimensional

297: vector space $\mathbb{R}^\cG$.

298:

299: We shall consider certain special models of fitness landscapes,

300: which are represented by linear subspaces of $\mathbb{R}^\cG$.

301: In the following definitions, a genotype $g$ is regarded

302: as a subset of the event poset $\cE$, where $|\cE| = n$.

303: A \emph{constant fitness landscape} has the

304: form ${\mathbf f}(g) \equiv a$ for some constant $a$.

305: Thus the constant landscapes form a

306: line through the origin in $\mathbb{R}^\cG$.

307: A \emph{graded fitness landscape} is a landscape on

308: $\cG$ whose fitness values depend only on the rank. Equivalently, we have

309: ${\mathbf f}(g) = a_{|g|}$ for

310: constants $a_0,a_1,\ldots,a_n$. Thus, graded fitness landscapes

311: form an $(n+1)$-dimensional linear subspace of $\mathbb{R}^\cG$.

312:

313: Our biological application in Section~\ref{sec:apply} uses

314: the graded fitness landscape model, which means that the

315: fitness of a virus type depends only on the number of mutations it

316: harbors. We shall

317: model situations where a virus escapes from a wild

318: type $\hat{0}$ to a drug-resistant type $\hat{1}$.  In this case, we

319: assume a graded fitness landscape that is

320: monotonically increasing with rank, i.e.,

321: \[

322:    a_0 \,<\, a_1 \,< \,a_2 \,<\, \cdots \,<\, a_n.

323: \]

324:   This implies that the fitness landscape ${\mathbf f}$ has a unique

325: local (and global) maximum at the drug resistant type $\hat{1}$,

326: which is the top element in $\cG$.

327:

328: We next introduce the mathematical framework

329: for evolution on a fitness landscape. The general

330: setup is as in the work of Reidys and Stadler

331: \cite{Reidys2002}, but this is adapted here to our specific

332: situation, where the genotypes form a

333:  distributive lattice $\cG$. The order relation on $\cG$,

334:  which comes from inclusion of subsets of $\cE$,  induces a

335: neighborhood structure on $\cG$ where the neighbors

336: of $g \in \cG$ are the genotypes that strictly contain $g$,

337: \begin{equation}   \label{eq:neighborhood}

338:    N(g) \, := \, \bigl\{ h \in \cG \,\mid \, g \subset h \bigr\}.

339: \end{equation}

340: Unlike the typical situation considered in \cite{Reidys2002},

341: this notion of neighborhood is not symmetric. To be precise,

342: we have that $h \in N(g)$ implies $g \not\in N(h)$.

343:

344: This neighborhood structure implies that mutational

345: changes are possible only upward in the genotype lattice.

346: This structure models a directed evolutionary

347: process from the wild type $\hat{0}$ towards the escape state

348: $\hat{1}$. Typically, our configuration space $\cG$ is a small subset

349: of the Boolean lattice $\{0,1\}^n$ of all binary strings.

350: Indeed, in the course of viral evolution,

351:  a population will visit only a small fraction of $\{0,1\}^n$,

352:  as most mutants are not viable.

353:

354:

355: Suppose that the number of genotypes in $\cG$ is $m$.

356:  We wish to define dynamics between the states of $\cG$.

357:  To this end, we fix a linear extension of $\cG$, and we

358:   introduce an

359:  $m \times m$ matrix of transition rates, written

360:  ${\bf U} = (u_{gh})$, whose rows and columns

361:  are indexed by genotypes $g,h \in \cG$.

362: Each entry $u_{gh}$ of the matrix ${\bf U}$ is a non-negative

363: real number which is zero unless $h \in N(g)$.

364: In the framework of algebraic combinatorics, it

365: is convenient to think of the matrix ${\bf U}$ as an element in the

366: incidence algebra of $\cG$;

367: see \cite[Sec.~3.6]{Stanley1999}.

368:

369: We further assume that the non-zero mutation rates

370: $u_{gh}$ depend only on the events in $h \backslash g$.

371: Equivalently, the rate at which a collection of mutation events

372: occurs is independent of which other mutations have

373: already occurred. With this assumption, there are only $n$ free

374: parameters $\mu_1,\ldots,\mu_n$ in the matrix ${\bf U}$,

375: where $\mu_e$ is the mutation rate of event $e$.

376: Then

377: \begin{equation} \label{eq:muta}

378: u_{gh} \,\,=\,\, \begin{cases}

379:    \,\,\,  \prod_{e \in h\backslash g} \mu_e & \text{if $g \subset h$}\\

380:    \,\,\,  0                               & \text{otherwise}.

381:    \end{cases}

382: \end{equation}

383: In particular, if all rates are the same, say $\mu = \mu_1 = \dots = \mu_n$, then

384: the entries of $\bU$ are $\, u_{gh} \,=\,  \mu^{|h \backslash g|}\,$

385: if $g \subset h$ and $\,u_{gh} = 0\,$ otherwise.

386:

387: \begin{ex} \rm \label{FourEvents2}

388: For the genotype lattice $\cG$ in

389: Figure~\ref{fig:ex1}, the matrix $\bU$ equals

390: \[

391: \bordermatrix{ & \! 0000 \! &

392: \! 1000 \! & \! 0100 \! & \! 1100 \! & \! 0101 \! & 1110 & 1101 & 1111 \cr

393: 0000 &           0  & \mu_1 & \mu_2 & \mu_1 \mu_2 & \mu_2 \mu_4 &

394: \! \mu_1 \mu_2 \mu_3 \! & \! \mu_1 \mu_2 \mu_4 \!&

395: \! \mu_1 \mu_2 \mu_3 \mu_4 \! \cr

396: 1000 & 0 & 0 & 0 &\mu_2 & 0 & \mu_2 \mu_3 & \mu_2 \mu_4 & \mu_2 \mu_3 \mu_4 \cr

397: 0100 & 0 & 0 & 0 &\mu_1 & \mu_4 & \mu_1 \mu_3 & \mu_1 \mu_4 & \mu_1 \mu_3 \mu_4

398:  \cr

399: 1100 &        0    &  0   &  0   &  0   &  0   & \mu_3  & \mu_4 & \mu_3 \mu_4

400: \cr

401: 0101 &        0    &  0   &  0   &  0   &  0   &  0  &  \mu_1 & \mu_1 \mu_3 \cr

402: 1110 &        0    &  0   &  0   &  0   &  0   &  0  &  0   &  \mu_4   \cr

403: 1101 &        0    &  0   &  0   &  0   &  0   &  0  &  0   &  \mu_3   \cr

404: 1111 &        0    &  0   &  0   &  0   &  0   &  0  &  0   &  0   \cr}

405: \]

406: Note that the entry in row $g$ and column $h$ of

407: any power $\bU^k$ equals $u_{gh}$ times the number

408: of paths of length $k$ from $g$ to $h$ in $\cG$. In particular,

409: $\,\bU^5 = 0 $.

410: \end{ex}

411:

412: Let ${\mathbf f}$ be a fitness landscape on $\cG$ and  $\,{\mathbf F} \,=\,

413: \diag\bigl({\bf f}(g) \mid g \in \cG \bigr)\,$  the $m \times m$ diagonal

414: matrix whose entries are the fitness values.

415: The entry of the matrix product ${\bU} {\mathbf F}$ in row $g$ and column $h$

416: represents the  probability of genotype $g$ transitioning

417: into genotype $h$ in one step.

418: A precise probabilistic derivation and interpretation

419: will be given in the next section.

420:

421: We are interested

422: in \emph{all} mutational pathways that lead from the wild type

423: $\hat{0}$ to the escape state $\hat{1}$.

424: Towards this end, note that the entry $(g,h)$ of the matrix

425: $({\bU}{\mathbf F})^k$ represents the probability of

426: genotype $g$ evolving to genotype $h$

427: along any mutational pathway (chain) of length $k$ in

428: the genotype lattice $\cG$.

429: The chains from $\hat{0}$ to $\hat{1}$ in $\cG$

430: are accounted for by the

431: upper right hand entry of $({\bU}{\mathbf F})^k$.

432: Note that the matrix $\,({\bU}{\mathbf F})^k\,$ is zero for $k > n$.

433:

434: To account for chains of arbitrary length, we consider the matrix

435: \begin{equation}

436: \label{GeometricSeries}

437: (\bI - {\bU} {\mathbf F})^{-1} - \bI \,\,\, = \,\,\,

438:   {\bU}{\mathbf F}

439:  +  ({\bU}{\mathbf F})^2

440:  +  ({\bU}{\mathbf F})^3

441:  + \cdots  +   ({\bU}{\mathbf F})^n,

442: \end{equation}

443: where $\bI$ is the $m \times m$ identity matrix.

444: We summarize our discussion in the following proposition,

445: which is proved by elementary matrix algebra.

446:

447: \begin{prop}

448: \label{ZeroUnless}

449: The entry of the matrix (\ref{GeometricSeries})

450: in row $g$ and column $h$ is zero unless

451: $g \subset h$, in which case it is

452: $\,u_{gh} \cdot {\mathbf f}(h) \cdot P_{gh}({\mathbf f}) \,$

453: where $P_{gh}$ is a polynomial function of degree

454: $|h \backslash g|-1$ on the space

455: of all fitness landscapes $\,\mathbb{R}^\cG $.

456: \end{prop}

457:

458: The polynomial  $\,P_{gh}({\mathbf f}) \,$  is the

459: generating function for all chains from $g$ to $h$ in $\cG$.

460: This will be made precise in the following corollary.

461: We shall restrict ourselves to the most important case

462: when  $ g = \hat{0}$ is the wild type

463: and $h = \hat{1}$ is the escape state.

464: Studying $\,P_{\hat{0} \hat{1}}({\mathbf f})\,$ only

465: is no loss of generality because any

466: interval of a distributive lattice

467: is again a distributive lattice.

468:

469: Proposition~\ref{ZeroUnless} tells us

470: that $\,P_{\hat{0} \hat{1}}({\mathbf f})  \,$

471: is a polynomial of   degree $n-1$

472: in the unknown fitness values ${\bf f}(g)$,

473: which are also written as $f_g$, where $g \in \cG$.

474:

475:

476: \begin{cor} \label{AllThoseChains}

477: The polynomial $\,P_{\hat{0} \hat{1}}({\mathbf f})  \,$

478: in the upper-right entry of

479: (\ref{GeometricSeries}) equals

480: \begin{equation}

481: \label{RISK}

482: P_{\hat{0} \hat{1}}({\mathbf f})

483: \quad  = \sum_{\hat{0}=g_0 \subset g_1 \subset \dots \subset g_k = \hat{1}}

484: \!\!\!\!\!\!  f_{g_1} f_{g_2} \cdots f_{g_{k-1}},

485: \end{equation}

486: where the sum runs over all chains

487: from $\hat{0}$ to $\hat{1}$ in

488: the genotype lattice $\cG$.

489: \end{cor}

490:

491:

492:

493: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

494:

495: \section{The risk of escape}   \label{sec:branching}

496:

497: For a poset of events $\cE$ and

498: the corresponding distributive lattice $\cG = J(\cE)$,

499: the \emph{risk polynomial} of $\cG$ is defined as the

500: polynomial (\ref{RISK}), which we denote by $\,\RP(\cG;{\mathbf f})$.

501: The risk polynomial was introduced

502:   in \cite{Iwasa2003,Iwasa2004}.

503: In this section we review the evolutionary dynamics model

504: proposed in these papers, and we

505: discuss the probabilistic meaning

506: of the risk polynomial.

507:

508: \begin{ex} \label{ex:rp} \rm

509: Let $\cG$ be the genotype lattice in Figure~\ref{fig:ex1}.

510: Then the risk polynomial

511: $\,\RP(\cG;{\mathbf f})\,$ is the following

512: polynomial of degree three in six unknowns:

513: \begin{eqnarray*}

514: & 1 + f_{1000} + f_{0100} + f_{1100} + f_{0101} + f_{1110} + f_{1101} \\

515: &  + f_{1000} f_{1100} + f_{0100} f_{1100} +

516:     f_{0100} f_{0101}  + f_{1000} f_{1110} + f_{0100} f_{1110}

517:  \\ &   + f_{1000} f_{1101} + f_{0100} f_{1101} +  f_{1100} f_{1110}

518:   + f_{1100} f_{1101} + f_{0101} f_{1101} \\

519: &   + f_{1000} f_{1100}  f_{1110} + f_{0100} f_{1100} f_{1110}

520:  + f_{1000} f_{1100} f_{1101}  \\ &

521:      +  f_{0100} f_{1100} f_{1101} + f_{0100} f_{0101} f_{1101}.

522: \end{eqnarray*}

523: \end{ex}

524:

525: If we restrict the fitness landscape

526:  ${\mathbf f}$ to lie in a linear subspace of $\mathbb{R}^\cG$,

527: then $\,\RP(\cG;{\mathbf f})$ specializes to a polynomial in fewer unknowns.

528: For example, the risk polynomial for graded fitness landscapes

529: is obtained from the specialization

530: ${\mathbf f}(g) = a_{|g|}$. That risk polynomial has

531: degree $n-1$ and is denoted by $\RP(\cG; a_1,\ldots,a_{n-1})$.

532: For instance, $\,\RP(\cG;{\mathbf f})\,$ in

533: Example \ref{ex:rp} specializes to

534: \[

535:    \RP(\cG; a_1,a_2,a_3) =

536:        1 + 2 a_1 + 2 a_2 + 2 a_3

537:      + 3 a_1 a_2 + 4 a_1 a_3

538:      + 3 a_2 a_3 + 5 a_1 a_2 a_3.

539: \]

540: For constant fitness landscapes

541: $\, {\mathbf f} \equiv a \,$, the risk polynomial is a polynomial in one unknown $\,a$.

542: It is denoted $\RP(\cG; a)$. In our running example,

543: \[

544:    \RP(\cG; a) \, = \, 1 + 6 a + 10 a^2 + 5 a^3.

545: \]

546:

547: We now make precise the notion of {\em risk of escape}, which will

548: justify our definition of the risk polynomial.

549: Our derivation is based on the model

550: for the dynamics of a replicating population

551: on a fitness landscape studied by

552: Iwasa, Michor and Nowak  \cite{Iwasa2003,Iwasa2004}.

553: See also the work of Wilke \cite{Wilke2003}

554: and the references given therein for approaches

555: to computing fixation probabilities.

556:

557:

558: A {\em multistate branching process} \cite{Athreya1972} consists

559: of a set of genotypes along with a fitness landscape and mutation

560: rates between genotypes.  We assume a discrete time process, where

561: in one generation an individual with genotype $g$

562: has a random number of offspring following a Poisson distribution

563: with mean $R_g$.  Some of these offspring may be mutants according to

564: the mutation rates $u_{gh}$.

565: The parameter $R_g$ is the {\em basic

566: reproductive ratio} \cite[Chap.~3]{Nowak2000}.

567:

568: We assume there is no interaction between individuals; each reproduces

569: at a rate independent of the distribution of the population.

570: Let $\muta{g}{h}{k}$ be the probability

571: that one individual of genotype $g$ has $k$ children of type $h$.  Then, \begin{equation}\label{eq:1}

572: \muta{g}{h}{k}  \, = \,

573: \frac {(u_{gh}R_g)^k \cdot e^{-u_{gh}R_g}} {k!}.

574: \end{equation}

575: The {\em reproductive fitness} $f_g$ is related to

576: the reproductive ratio $R_g$ by

577: \begin{equation}

578: \label{Randf}

579:  f_g \, = \,  \frac {R_g} {1-R_g}

580: \qquad \hbox{and} \qquad

581: R_g \, = \, \frac{f_g}{1+f_g}.

582: \end{equation}

583:

584: Let $\xi_g$ be the probability of escape  starting with one individual of

585: genotype $g$, so $1 - \xi_g$ is the probability of extinction.

586: In particular, $\xi_{\hat{1}}$ is the probability that one resistant

587: virus will not become extinct.

588: Each of these probabilities is a function

589: of the mutation rates $u_{gh}$ and the reproductive ratios $R_g$.

590: We assume that the $u_{gh}$ are as in

591: (\ref{eq:muta}), but with $u_{gg} = 1$.

592: Thus, each escape probability $\xi_g$  can be expressed

593: as a function of the $\mu_e$

594: for $e \in \cE$ and  (using the relation (\ref{Randf}))

595:  the fitness values $f_g$ for $g \in \cG$.

596:

597: \begin{thm} \label{thm:1}

598: If $\xi_g \ll 1$ for $g \neq \hat{1}$, then

599: the probability of escape on

600: the fitness landscape $\mathbf{f} \in \mathbb{R}^{\cG}$ starting with one

601: individual of wild type $\hat{0}$, satisfies

602: \begin{equation} \label{eq:2}

603:   \xi_{\hat{0}} \quad  \approx  \quad \xi_{\hat{1}} \cdot f_{\hat{0}} \cdot

604:     \prod_{e \in \cE} {\mu_e} \cdot \RP(\cG;\mathbf{f}).

605: \end{equation}

606: \end{thm}

607:

608: \begin{proof}

609: The probability of extinction

610: satisfies the recursive formula

611: \begin{equation}

612: \label{michorproof}

613:    1-\xi_g \quad = \quad  \prod_{h \supseteq g} \sum_{k=0}^{\infty}

614: (1-\xi_h)^k \cdot

615:        \muta{g}{h}{k} .

616:   \end{equation}

617: Using (\ref{eq:1}), the right hand side

618: of (\ref{michorproof}) can be rewritten as follows:

619: \begin{equation*}

620: \label{michorproof2}

621:       \prod_{h \supseteq g}

622:        {\rm exp}({(1-\xi_h)u_{gh}R_g} ) \cdot {\rm exp} ({-u_{gh}R_g})

623:      \quad = \quad \exp\left(\sum_{h \supseteq g} -\xi_h u_{gh} R_g\right).

624: \end{equation*}

625: We conclude that

626: \[

627:    \log(1-\xi_g) \quad = \quad - \sum_{h\supseteq g} \xi_h u_{gh} R_g \quad

628: \qquad \hbox{for all} \,\, g \in \cG.

629: \]

630: Under the assumption that $\xi_g \ll 1$ for $g \neq \hat{1}$, we can

631: linearize the logarithms using

632: the relation $\,\log(1-\xi_g) \approx -\xi_g$. This implies,

633: for $\, g \in \cG \backslash \{\hat{1}\}$,

634: \begin{eqnarray*}

635:    \xi_g \quad  \approx  & R_g \cdot \sum_{h \supseteq g} \xi_h u_{gh} \\

636:   \quad    = & \frac {R_g} {1-R_g u_{gg}} \cdot \sum_{h \supset g} \xi_h u_{gh} \\

637:      = & f_g \cdot \sum_{h \supset g} \xi_h u_{gh}.

638: \end{eqnarray*}

639:

640: The theorem now

641: follows by setting $g = \hat{0}$ and expanding the last equation recursively.

642: Here we are using the fact  from (\ref{eq:muta}) that the

643: product of the $u_{gh}$ over any

644: chain from $\hat{0}$ to $\hat{1}$ in $\cG$

645: equals $\,\prod_{e \in \cE} \mu_e$.

646: \end{proof}

647:

648: The typical situation of interest is a fitness landscape for which

649: only the escape state has a basic reproductive ratio greater than one,

650: i.e.,

651: \[

652:    R_{\hat{1}} > 1 \qquad \mbox{and} \qquad

653:    R_g < 1 \quad \mbox{for all} \quad g \not= \hat{1}.

654: \]

655: When the positive numbers $R_g$ are very small for

656:  $g  \in \cG \backslash \{\hat{1}\}$ then the approximation

657: (\ref{eq:2}) is valid, and

658: it shows the crucial role that the risk polynomial

659: $\RP(\cG;\mathbf{f})$ plays in

660: assessing the risk of escape from the wild type $\hat{0}$

661: to the escape state $\hat{1}$.

662: The theorem implies that the risk of escape

663: of a population of $N$ wild type viruses

664: is $(1-\xi_{\hat{0}})^N$. In Section~\ref{sec:discussion} we

665: discuss the situation in which the population is not homogeneous

666: at the time of intervention.

667:

668: \smallskip

669:

670: The risk of escape is an important quantity in analyzing the

671: invasiveness of pathogens and in assessing

672: the success probability of medical interventions such as

673: chemotherapy. However, putting this concept into practice

674: depends on our ability to actually compute the risk polynomial.

675: It turns out that methods from algebraic combinatorics lead

676: to efficient algorithms for this task.

677: In the Appendix, several methods are presented in detail.

678:

679: \begin{figure}

680: \centering

681: \includegraphics[width=.6\textwidth]{fence}

682: %\begin{verbatim}

683: %                      7 8 9 101112

684: %                      |/|/|/|/|/|

685: %                      1 2 3 4 5 6

686: %\end{verbatim}

687: \caption{Example of an event poset whose general risk polynomial

688: is of degree 11 in 375 unknowns.}

689: \label{fig:poset}

690: \end{figure}

691:

692: Our method of choice from a practical perspective

693: relies on computing linear extensions

694: of the event poset $\cE$ (Theorem~\ref{thm:linearExtensions}, Appendix).

695: Our software implementation is available at

696: \url{http://bio.math.berkeley.edu/riskpoly/} .

697: For an example of the efficiency of the software,

698: let $\cE$ be the poset in Figure~\ref{fig:poset}

699: on $n=12$ events with cover relations

700: $i < 6 + i$ for $1 \leq i \leq 6$ and $i < 7 + i$ for $1 \leq i \leq 5$.

701: Here the genotype lattice $\cG$ consists of $375$ genotypes.

702: The risk polynomial $\RP(\cG; {\mathbf f})$ is a polynomial

703: of degree 11 in 375 unknowns $f_g$.

704: This polynomial has 224,750,298 monomials in the 375

705: unknowns, but we represent it as a sum of

706: 2,702,765 products, one for each

707: linear extension of the event poset $\cE$.

708: Our software takes about ten seconds to compute

709: this representation of $\RP(\cG; {\mathbf f})$.

710: The result takes up 200MB of disk space.

711:

712:  The univariate risk polynomial for this example is

713: \begin{multline*}

714: 1 + 375a + 19088a^2 + 324498 a^3 + 2610169 a^4 + 11729394 a^5 +

715: 32080336 a^6 +\\ 55597909 a^7 + 61448965 a^8 + 42020208 a^9 + 16216590

716: a^{10} + 2702765a^{11}.

717: \end{multline*}

718: Thus, exact symbolic computations, as opposed to numerical approximations,

719: may be necessary and feasible when one is interested in

720: assessing the risk of escape in applications like

721: the one described in Section \ref{sec:apply} below.

722:

723:

724:

725: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

726:

727: \section{Distributive lattices from Bayesian networks}   \label{sec:bayes}

728:

729: In this section, we present a family of statistical models that naturally

730: gives rise to distributive lattices.  This statistical interpretation

731: provides a method for deriving the genotype lattice $\cG$ directly from data.

732: The basic idea is to estimate the poset structure on $\cE$ from

733: observed genotypes, by applying model selection techniques

734: to a range of Bayesian networks, and to

735:  define $\cG$ as the set of all genotypes with

736: non-zero probability in the model.

737:

738:

739: We first make precise the derivation of a genotype space

740: from a statistical model.

741: Let $\cE$ be an unordered set of $n$ genetic events.

742: The events are labeled by $1,2,\ldots,n$. Subsets

743: of $\cE$ are identified with binary strings $g \in \{0,1\}^n$.

744: They are the possible genotypes.

745: We consider binary random variables $X_{\cE} = (X_1, \dots, X_n)$,

746: where $X_e = 1$ indicates the occurrence of event $e$.

747: Let $\Delta$ denote the $(2^n-1)$-dimensional simplex

748: of probability distributions on $ \{0,1\}^n$. A {\em statistical

749: model} for $X_{\cE}$ is a map $\,p  \colon \Theta \to \Delta$,

750: where $\Theta$ is some parameter space.

751: The $g$-th coordinate of $p$, denoted

752: $p_g$,  is the

753: probability of genotype $g \in \{0,1\}^n$ under the model $p$.

754: The \emph{induced genotype space} of the model

755: $\,p \colon \Theta \to \Delta\,$ is the set

756: $\cG_p $ of all strings $\, g \in \{0,1\}^n \,$ such that

757: $p_g$ is not the zero function on $\Theta$.

758: We regard $\cG_p$ as a poset ordered by inclusion.

759:

760: Now consider a directed acyclic graph on the set of events $\cE$.

761: We will also call this graph $\cE$.

762: The {\em Bayesian network model}, or directed acyclic graphical model,

763: defined by $\cE$ is the family of joint distributions

764: that factor as

765: \begin{equation*}   \label{eqn:bayesnet}

766:    \Pr(X_1, \dots, X_n)

767:      \quad = \quad \prod_{e \in \cE} \Pr(X_e \mid X_{\pa(e)}),

768: \end{equation*}

769: where $\pa(e)$ denotes the set of parents of $e$ in $\cE$.

770: Equivalently, a Bayesian network is specified by a  set of conditional independence

771: statements. Each node is independent of its ancestors given its parents.

772: See \cite{Lauritzen1996} for an introduction  to the relevant

773: statistical theory and \cite{Garcia2005} for an algebraic perspective.

774:

775: The parameters for a Bayesian network are specified by

776: providing, for each event $e \in \cE$,

777: a $2^{|\pa(e)|} \times 2$

778: matrix $\theta^e$. The matrix entries are

779: \[   \theta^e_{g_{\pa(e)},g_e}

780:      \quad = \quad \Pr\left( X_e = g_e \mid X_{\pa(e)}

781:      = g_{\pa(e)} \right),

782: \]

783: for  $ \, g_{\pa(e)} \in \{0,1\}^{\pa(e)},

784: \, g_e \in \{0,1\}$.

785: These conditional probabilities satisfy

786: \begin{equation}

787: \label{SumToOne}

788: \theta^e_{g_{\pa(e)},0} \geq 0\,,\,\,\,

789: \theta^e_{g_{\pa(e)},1} \geq 0 \,\, \quad \hbox{and} \,\,\quad

790:   \theta^e_{g_{\pa(e)},0}\, +\, \theta^e_{g_{\pa(e)},1} \,\,= \,\,1 .

791: \end{equation}

792:

793: Set $d = \sum_{e \in \cE} 2^{|\pa(e)|}$ and $\Theta = [0,1]^d$.

794: The points in the cube $\Theta$ are identified with $n$-tuples

795: of matrices $\,\theta = (\theta^e \,|\, e \in \cE)\,$ as above.

796: The {\em general Bayesian network} is the polynomial map

797: $\, p  \colon  \Theta \, \rightarrow \,\Delta \,$

798:  whose coordinates are

799:   \begin{equation}   \label{eqn:bayesfactor}

800:    p_g(\theta) \,\,\,= \,\,\, \prod_{e \in \cE} \theta^e_{g_{\pa(e)}, g_e}.

801: \end{equation}

802: The general Bayesian network on $\cE$ induces the

803: genotype space $\cG_p = \{0,1\}^n$, the Boolean lattice on $\cE$.

804: Indeed, the factorization~(\ref{eqn:bayesfactor}) implies

805: that no genotype $g \in \{0,1\}^n$ has probability zero for all

806: parameter values.

807:

808: To obtain other genotype spaces, we replace the

809: cube $\Theta = [0,1]^d$ by one of its faces, as follows.

810: For each event $e \in \cE$ consider a Boolean function

811: $\,\beta_e \colon \{0,1\}^{\pa(e)} \rightarrow \{0,1\}$.

812: If $\beta_e(g_e) = 0$ then

813: the row of the $2^{|\pa(e)|} \times 2$-matrix $\theta^e$

814: indexed by the genotype $g$ is fixed

815: to be the vector $(1,0)$;

816: otherwise that row remains indeterminate

817: subject to the constraints (\ref{SumToOne}).

818: Let $\Theta^\beta$ denote the face of $\Theta$

819: determined by these requirements

820: and  $\,p^\beta \colon \Theta^\beta \,\rightarrow \,\Delta\,$

821:  the restriction of the polynomial map $p$ to $\Theta^\beta$.

822: The resulting model is the Bayesian network on $\cE$ constrained by the

823: Boolean functions $\beta^e$.

824:

825: If all Boolean functions $\beta^e$ are disjunctions

826: then we get the {\em disjunctive Bayesian network} on $ \cE$.

827: In this model, an event $e$ can only occur if at least one

828: of its parent events has already occurred.

829: If all Boolean functions $\beta^e$ are conjunctions

830: then we get the {\em conjunctive Bayesian network} on $\cE$.

831: In this model, an event $e$ can only occur if all

832: of its parent events have already occurred.

833: These restricted Bayesian network models induce

834:  interesting genotype spaces.

835: Our main result in this section concerns the conjunctive case.

836:

837:

838: We regard the given directed acyclic graph $\cE$ as a poset by setting $e_1

839: \leq e_2$ if there exists a path from $e_1$ to $e_2$.

840: We write $\,p^{\rm conj} \colon [0,1]^n \rightarrow \Delta\,$

841: for the conjunctive Bayesian network on $\cE$,

842: since it has precisely $n$ free parameters.

843:

844: \begin{thm} \label{fromBNtoDL}

845: The genotype space induced by the conjunctive

846: Bayesian network on $\cE$ is the distributive lattice of order ideals

847: in $\cE$, i.e., $\cG_{p^{\rm conj}} = J(\cE)$.

848: \end{thm}

849:

850: \begin{proof}

851: The possible genotypes $g $ are binary strings whose coordinates $g_e$

852: indicate whether or not the event $e$ has occurred. If $p$ is

853: any of the Bayesian network models discussed above, then

854:  (\ref{eqn:bayesfactor}) implies that $g \in \cG_p$ if and only if

855: each $\thet{g}$ is non-zero. Consider now the

856:  conjunctive model $\,p = p^{\rm conj}$.

857: Here, the conditional probability

858:  $\thet{g}$ is non-zero if and

859: only if $g_e = 1$ implies $g_{\pa(e)} = (1, \dots, 1)$.  This is

860: precisely the condition for $g$ to be an order ideal in $\cE$.

861: Thus $\cG_p$ is the distributive lattice of order ideals of $\cE$.

862: \end{proof}

863:

864: The following example illustrates Theorem \ref{fromBNtoDL},

865: and it compares the genotype spaces induced by

866: the disjunctive and the conjunctive  Bayesian network.

867: The former is not a distributive lattice,

868: but the latter always is.

869:

870: \begin{ex} \label{ex:conjunctive} \rm

871: Let $\cE$ be the event poset in Figure~\ref{fig:ex1}.

872: The general Bayesian network model defined by $\cE$

873: is parametrized by the following four matrices:

874:

875: \vspace{1ex}

876: \parbox{3.5cm}{ \centering

877: $

878:   \begin{array}{l}

879:     \theta^1 =

880:     \left( \begin{array}{cc}

881:       a & 1-a

882:     \end{array} \right), \\[2ex]

883:     \theta^2 =

884:      \left( \begin{array}{cc}

885:       b & 1-b

886:     \end{array} \right),

887:   \end{array}

888: $

889: }

890: \parbox{4.5cm}{ \centering

891: $

892:   \theta^3 = \left(

893:   \begin{array}{cc}

894:    c_{00} & 1 - c_{00} \\

895:    c_{01} & 1 - c_{01} \\

896:    c_{10} & 1 - c_{10} \\

897:    c_{11} & 1 - c_{11}

898:   \end{array} \right),

899: $

900: }

901: \parbox{4cm}{ \centering

902: $

903:     \theta^4 =

904:      \left( \begin{array}{cc}

905:       d_0 & 1 - d_0 \\

906:       d_1 & 1 - d_1

907:     \end{array} \right).

908: $

909: }\\

910: \vspace{1ex}

911:

912: \noindent The map $p \colon [0,1]^8 \to \Delta$ has coordinates

913: \begin{eqnarray*} &

914:  p_{0000} \, = \, a b c_{00} d_0, &

915:  p_{0001} \, = \, a b c_{00} (1-d_0) , \\ &

916:  p_{0010} \, = \, a b (1-c_{00}) d_0, &

917:  p_{0011} \, = \, a b (1-c_{00}) (1-d_0), \\ &

918:  p_{0100} \, = \, a (1-b) c_{01} d_1, &

919:  p_{0101} \, = \, a (1-b) c_{01} (1-d_1), \\ &

920:  p_{0110} \, = \, a (1-b) (1-c_{01}) d_1, &

921:  p_{0111} \, = \, a (1-b) (1-c_{01}) (1-d_1), \\ &

922:  p_{1000} \, = \, (1-a) b c_{10} d_0, &

923:  p_{1001} \, = \, (1-a) b c_{10} (1-d_0), \\ &

924:  p_{1010} \, = \, (1-a) b (1-c_{10}) d_0, &

925:  p_{1011} \, = \, (1-a) b (1-c_{10}) (1-d_0), \\ &

926:  p_{1100} \, = \, (1-a) (1-b) c_{11} d_1, &

927:  p_{1101} \, = \, (1-a) (1-b) c_{11} (1-d_1), \\ &

928:  p_{1110} \, = \, (1-a) (1-b) (1-c_{11}) d_1, &

929:  p_{1111} \, = \, (1\!-\!a) (1\!-\!b) (1 \! - \! c_{11}) (1 \! - \! d_1).

930: \end{eqnarray*}

931: This model induces the Boolean lattice $\{0,1\}^4$ as genotype space.

932:

933: The disjunctive Bayesian network is the

934: six-dimensional  submodel  obtained by setting

935: $\,c_{00}=1 \,$ and $ \,d_0=1 $. This substitution implies

936: \[

937: p_{0001} \,=\, p_{0010} \, = \, p_{0011} \,=\,

938: p_{1001}  \,=\, p_{1011} \,\, = \,\, 0.

939: \]

940: The genotype space $\,\cG_{p^{\rm disj}}$

941: consists of the remaining eleven strings in

942: $\{0,1\}^4$. Note that

943: $\,\cG_{p^{\rm disj}} \,$ is not

944: a  lattice because it is not

945: closed under intersections. For instance,

946: $\,1010$ and $ 0110 $ are in $ \cG_{p^{\rm disj}} \,$

947: but $\,0010 =  1010\,\cap \, 0110

948:  \not\in \cG_{p^{\rm disj}} $.

949:

950: The conjunctive Bayesian network is the

951: four-dimensional  submodel  obtained by setting

952: $\, c_{00}= c_{01}= c_{10}= d_0 = 1$. The

953: remaining eight non-zero probabilities are

954: indexed by the eight genotypes in Figure~\ref{fig:ex1}:

955: \begin{eqnarray*}

956: &  p_{0000} \, = \, a b   \, ,\,\,&

957:  p_{0100} \, = \, a (1-b)  d_1 \, ,\,\,\\

958: & p_{0101} \, = \, a (1-b) (1-d_1) \, ,\,\, &

959:  p_{1000} \, = \, (1-a) b \,,\,\,\, \\

960: & p_{1100} \, = \, (1-a) (1-b) c_{11} d_1 \, ,\,\, &

961:  p_{1101} \, = \, (1-a) (1-b) c_{11} (1-d_1) \, ,\,\, \\

962: & p_{1110} \, = \, (1-a) (1-b) (1-c_{11}) d_1 \, ,\,\, &

963:  p_{1111} \, = \, (1\! - \! a) (1\! - \! b)

964:  (1 \! - \! c_{11}) (1 \! - \! d_1).

965: \end{eqnarray*}

966: \end{ex}

967:

968:

969: If $\cE$ is a directed forest, i.e.,

970: if every $e \in \cE$ has at most one parent,

971: then we can augment $\cE$ to a tree $\cE^T$

972: by adding an auxiliary root node $0$

973: which points to the roots (edges with no parents) of the forest.

974: On the resulting tree $\cE^T$ we consider the

975: {\em mutagenetic tree model} of \cite{Beerenwinkel2005f, Desper1999}.

976:

977: \begin{prop}   \label{prop:forest}

978: If $\cE$ is a directed forest then the following three statistical

979: models coincide: the disjunctive Bayesian network on $\cE$,

980: the conjunctive Bayesian network on $\cE$, and the

981:  mutagenetic tree model on $\cE^T$.

982: \end{prop}

983:

984: \begin{proof}

985: The disjunctive and the conjunctive networks

986: coincide because they are defined by the same

987: specializations of the parameters $\,\theta^e$.

988: The identification with the mutagenetic tree model follows from

989: \cite[Thm.~14.6]{Beerenwinkel2005c}.

990: \end{proof}

991:

992: Mutagenetic tree models can be learned from observed data by an efficient

993: combinatorial algorithm.

994: With appropriate edge weights that depend on the pairwise

995: probabilities of events, a mutagenetic tree can be obtained as the maximum

996: weight branching rooted at 0 in the complete graph on $\{0,\dots,n\}$; see

997: \cite{Desper1999}. This gives an efficient method for learning

998: the poset $\cE$, and hence the genotype lattice $\cG = J(\cE)$, from

999: data. It would be interesting to extend this model selection

1000: technique to arbitrary  conjunctive Bayesian networks.

1001:

1002: %If $\cE$ is a directed forest, the algebraic geometry of the

1003: %Bayesian network model is well-understood and the risk polynomial

1004: %can be derived directly from the algebraic invariants of the model.

1005: %Details of this algebraic statistical perspective are given

1006: %in the Appendix.

1007:

1008:

1009: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1010:

1011: \section{Applications to HIV drug resistance}   \label{sec:apply}

1012:

1013: We investigate the development of resistance during treatment of HIV

1014: infected patients with two different PIs. Consider the seven genetic events

1015: \[

1016:    \cE \, = \, \left\{ \mbox{K20R,~M36I,~M46I,~I54V,~A71V,~V82A,~I84V} \right\},

1017: \]

1018: where K20R stands for the amino acid change from lysine (K) to arginine (R)

1019: at position 20 of the protease chain, etc.

1020: The occurrence of these mutations confers broad cross-resistance to the

1021: entire class of PIs. Appearance of the virus with

1022: all 7 mutations renders most of the PIs ineffective for subsequent

1023: treatment.  We analyze the risk of reaching this escape state under

1024: therapy with the PIs ritonavir (RTV) and indinavir (IDV)

1025: \cite{Condra1996, Molla1996}.

1026:

1027:

1028: We use mutagenetic trees for estimating preferred mutational pathways

1029: and for defining genotype lattices.

1030: For both drugs, a tree $\cE^T$ is learned from genotypes derived

1031: from patients under the respective therapy. We used 112 and 691 samples

1032: from the Stanford HIV Drug Resistance Database \cite{Rhee2003}

1033: for ritonavir and indinavir, respectively.

1034: Figure~\ref{fig:trees} shows the inferred mutagenetic trees.

1035: The models indicate that the evolution of ritonavir

1036: resistance is partly a linear process, whereas indinavir resistance

1037: develops in a less ordered fashion. This is consistent with

1038: previous studies \cite{Condra1996, Molla1996}.

1039: The genotype lattices $\cG$  have size

1040: $16 $  for ritonavir and $45$ for indinavir.

1041: We study the risk polynomials on these

1042: lattices under different fitness landscape models.

1043:

1044: \begin{figure}[!tpb]

1045: \centering

1046: \begin{tabular}{ccc}

1047: \tree{\root{0}}{

1048: 	\tree{\node{V82A}}{

1049: 		\tree{\node{M46I}}{

1050: 			\node{I84V}

1051: 			}

1052: 		\tree{\node{I54V}}{

1053: 			\tree{\node{A71V}}{

1054: 				\tree{\node{K20R}}{

1055: 					\node{M36I}

1056: 					}

1057: 				}

1058: 			}

1059: 		}

1060: 	}

1061: & ~~~~~~~~~~~~~ &

1062: \tree{\root{0}}{

1063: 	\tree{\node{M36I}}{

1064: 		\node{K20R}

1065: 		}

1066: 	\tree{\node{V82A}}{

1067: 		\node{I54V}

1068: 		\node{A71V}

1069: 		}

1070: 	\tree{\node{M46I}}{

1071: 		\node{I84V}

1072: 		}

1073: 	}\\[3ex]

1074: (a) & & (b)

1075: \end{tabular}

1076: \caption{Mutagenetic tree $\cE^T$ for the development of resistance

1077: to (a) ritonavir and (b) indinavir in the HIV-1 protease.

1078: The event poset $\cE$ is obtained by removing the

1079: root node ``0''.}

1080: \label{fig:trees}

1081: \end{figure}

1082:

1083:

1084:

1085: For the constant fitness landscape on $\,\cG \backslash

1086: \{\hat{0}, \hat{1}\}$, we obtain

1087: \begin{eqnarray*}

1088:   \RP_{\rm RTV}(a) &=& 15a^6+70a^5+131a^4+124a^3+61a^2+14a+1, \\

1089:   \RP_{\rm IDV}(a) &=& 420a^6+1470a^5+1970a^4+1250a^3+372a^2+43a+1.

1090: \end{eqnarray*}

1091: Thus, the risk of developing all seven PI resistance mutations

1092: is higher under indinavir therapy than under ritonavir:

1093: $  \RP_{\rm IDV}(a) >   \RP_{\rm RTV}(a)$ for $a > 0$.

1094: Intuitively, the risk under ritonavir is lower because

1095: the mutations must occur in a certain order. Likewise,

1096: the high risk under indinavir results from many mutations occurring

1097: independently, which gives rise to a large genotype lattice and to many

1098: mutational pathways from the wild type to the escape state.

1099:

1100: More realistic fitness landscapes may be derived by modeling viral fitness

1101: as a function of drug concentration. We follow the approach pursued

1102: in \cite{Stilianakis1997a} and use a simple saturation function for

1103: this dependency. Specifically, we assume viral fitness to be the following

1104: function of drug concentration $D$,

1105: \begin{equation}   \label{eqn:drugfitness}

1106:    f_g(D) \quad = \quad \frac{\phi_g}{1 + D/r_g},

1107: \end{equation}

1108: where $\phi_g$ denotes the fitness of genotype $g$ in the absence of drug

1109: and $r_g$ the IC$_{50}$ value of $g$, i.e., the drug concentration necessary

1110: to inhibit viral replication \emph{in vitro} by 50\%. The IC$_{50}$ value

1111: is a measure of resistance. We will assume

1112: throughout that all $\phi_g \equiv \phi$ are equal.

1113: If we assume, in addition,

1114: that the resistance landscape is constant on $\cG \backslash \{\hat{0},\hat{1}\}$,

1115: with $r_g \equiv r$,

1116: then the substitution (\ref{eqn:drugfitness}) turns

1117: the risk polynomial into a rational function in $\phi$, $D$, and $r$.

1118: For example, for ritonavir, this rational function is

1119: \[

1120:    \frac{(15\phi^2r^2+10\phi Dr+10\phi r^2+D^2+2Dr+r^2)(\phi r+D+r)^4}{(D+r)^6}.

1121: \]

1122:

1123: \begin{figure}

1124: \centering

1125: \includegraphics[width=.8\textwidth,angle=270]{gfl}

1126: \caption{Graded resistance landscapes for ritonavir (RTV, bullets)

1127: and indinavir (IDV, squares). Resistance is quantified as the

1128: drug concentration necessary to inhibit viral replication \emph{in vitro}

1129: by 50\% (IC$_{50}$).}

1130: \label{fig:gfl}

1131: \end{figure}

1132:

1133: In general, the IC$_{50}$ values $r_g$ are distinct and can be determined

1134: experimentally for some genotypes

1135: by phenotypic resistance testing \cite{Walter1999},

1136: and may be predicted for all genotypes using regression techniques

1137: \cite{Beerenwinkel2003d}.

1138: PI phenotypic resistance data suggests a graded resistance landscape;

1139: see \cite{Berkhout1999} and \cite[Tab.~3]{Condra1996}.

1140: Hence, we estimate the resistance $r \in \mathbb{R}^8$

1141: for ritonavir and indinavir by defining $r_k$

1142: as the mean predicted IC$_{50}$ of all

1143: genotypes of rank~$k$. The resulting resistance landscapes

1144: are shown in Figure~\ref{fig:gfl}.

1145:

1146: \begin{figure}

1147: \centering

1148: \includegraphics[height=\textwidth,angle=270]{drugfit}

1149: \caption{Drug dependent risk. The log of the risk polynomial

1150: for ritonavir (a) and indinavir (b)

1151: is displayed as a function of plasma drug concentration $D$. Marked

1152: values denote mean trough ($C_{\min}$) and peak ($C_{\max}$)

1153: levels observed in clinical studies. The parameter $\phi$ is

1154: the relative fitness of mutants as compared to the wild type

1155: in the absence of drug.}

1156: \label{fig:drugfit}

1157: \end{figure}

1158:

1159: The graded risk polynomials $\RP(a_1,a_2,a_3,a_4,a_5,a_6)$ have 64 terms. After

1160: substituting $a_k = \phi/(1 + D/r_k)$, we obtain rational risk functions in $D$

1161: with parameter $\phi$. Figure~\ref{fig:drugfit} illustrates the dependency of

1162: the risk on drug concentration for three different values of $\phi$. For both

1163: drugs we indicate published mean plasma trough ($C_{\min}$) and peak ($C_{\max}$) levels

1164: observed in clinical settings.

1165:

1166: This example illustrates how the risk

1167: polynomial can be used to study viral escape as a function of

1168: different parameters. For instance, given a pharmacokinetics model

1169: of antiretroviral drug therapy, we can compute

1170: the risk of developing resistance after a patient has missed a dose.

1171: Thus, our mathematical framework may help in designing robust drug combinations.

1172:

1173:

1174:

1175: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1176:

1177: \section{Discussion}   \label{sec:discussion}

1178:

1179: We have presented a computational framework for assessing

1180: the risk of escape of an evolving population of pathogens.

1181: The risk of escape is the probability that the population

1182: reaches an escape state before extinction.

1183: In virus transmissions, for example, this probability is

1184: the chance of survival in the new host. In the situation

1185: of antiretroviral therapy, the risk of escape is the

1186: probability of therapy failure due to the development

1187: of drug resistance.

1188:

1189: The general setup we consider for computing the risk of escape

1190: includes an event poset, a fitness landscape on its induced

1191: genotype lattice, and a branching process on this lattice.

1192: The event poset $\cE$ consists of all mutational events that can

1193: occur and encodes the constraints which apply to their order of

1194: occurrence. From this structure the genotype space $\cG$ is obtained

1195: by considering all mutational pathways that respect the order

1196: constraints. This natural construction endows $\cG$ with the

1197: mathematical structure of a distributive lattice.

1198: The risk polynomial, the crucial factor in

1199: computing the risk of escape, turns out to coincide with the chain

1200: polynomial of the genotype lattice. We have presented

1201: methods from algebraic combinatorics that exploit

1202: this connection and that result in efficient algorithms.

1203:

1204: The space of genotypes may also be inferred from

1205: observed genotype data using statistical model selection tools.

1206: We have identified a class of Bayesian network models,

1207: the conjunctive Bayesian networks, whose support induces

1208: a genotype lattice.

1209: Mutagenetic tree models arise as important special cases.

1210: Here, both statistical model selection

1211: and risk computation are particularly efficient, and readily available

1212: with existing software \cite{Beerenwinkel2005b}

1213: coupled with our implementation of the linear extensions

1214: method (Theorem~\ref{thm:linearExtensions}, Appendix).

1215:

1216: \smallskip

1217:

1218: %The risk polynomial is a crucial factor in assessing the risk

1219: %of escape from strong selective pressure experienced by

1220: %a population evolving according to a multitype branching process.

1221: We have focused on the dependency of the risk polynomial

1222: on the fitness landscape and considered throughout a homogeneous

1223: wild type population prior to intervention. However, the risk of

1224: escape is calculated  similarly for a quasispecies

1225: distribution at the time of intervention. In fact,

1226: this involves computing the risk polynomial of

1227: the prior fitness landscape \cite{Iwasa2003}.

1228: In contrast, the branching process

1229: model can not account

1230: for recombination, horizontal gene transfer, or frequency

1231: dependent selection, since evolution is assumed to take place

1232: in multiple lineages independently.

1233:

1234: The main challenge in using our method to compute the risk

1235: of escape from antiretroviral therapy lies in accurately

1236: modeling the fitness landscape.

1237: The dependency (\ref{eqn:drugfitness}) of the fitness on drug

1238: concentration may be improved by experimentally determined

1239: viral replicative capacities in the

1240: absence of drugs. An alternative approach to derive a

1241: fitness landscape for HIV-1 proteases is based on estimating

1242: the binding affinity of the drug to the mutant protease, and

1243: the mutant's ability to cleave its natural substrates

1244: \cite{Rosin1999a}.

1245: These calculations are based on simplified molecular

1246: modeling techniques.

1247: The resulting fitness landscape does not account for different

1248: drug levels, but it is independent of experimental

1249: resistance and fitness data.

1250:

1251: Escape from indinavir and ritonavir therapy may in some cases

1252: involve mutations other than the seven we considered, although those

1253: are the most frequent mutations observed after therapy failure

1254: \cite{Condra1996,Molla1996}.

1255: On the other hand, viral escape might be accomplished with

1256: genotypes that harbor fewer than all of the mutations.

1257: Thus it would be desirable to compute the risk of reaching

1258: any of several escape states, rather than only the $11\cdots 1$ type.

1259: This computation will involve similar techniques to those presented

1260: in Section~\ref{sec:branching} and the Appendix.

1261:

1262: Finally, the PIs form only one out of four distinct

1263: classes of antiretroviral drugs

1264: that are in current clinical use. The standard of care is combination

1265: therapy with at least three different drugs from two different drug

1266: classes. Modeling the fitness landscape of combination therapy in

1267: terms of viral drug resistance and drug exposure is even more

1268: challenging, but can eventually help in designing optimal

1269: antiretroviral therapies.  Algebraic combinatorics offers

1270: tools for the mathematical analysis of these

1271: biomedical problems.

1272:

1273:

1274:

1275: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1276:

1277: \section*{Acknowledgements}

1278:

1279: Niko Beerenwinkel is supported by Deutsche Forschungsgemeinschaft under

1280: grant No.\ BE~3217/1-1.

1281: Nicholas Eriksson and Bernd Sturmfels are supported by

1282: the U.S.~National Science Foundation,

1283: under the grants  EF-0331494 and DMS-0456960

1284: respectively, and by the DARPA program

1285: {\em Fundamental Laws in Biology} (HR0011-05-1-0057).

1286:

1287: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1288:

1289: \bigskip

1290:

1291: \bibliographystyle{plain}

1292:

1293: \begin{thebibliography}{10}

1294:

1295: \bibitem{Athreya1972}

1296: K.B. Athreya and P.E. Ney.

1297: \newblock {\em Branching processes}.

1298: \newblock Dover, Mineola, New York, 1972.

1299:

1300: \bibitem{Beerenwinkel2003d}

1301: N. Beerenwinkel, M. D{\"a}umer, M. Oette, K. Korn, D. Hoffmann,

1302:   R. Kaiser, T. Lengauer, J. Selbig, and H. Walter.

1303: \newblock Geno2pheno: {E}stimating phenotypic drug resistance from {HIV}-1

1304:   genotypes.

1305: \newblock {\em Nucl. Acids Res.}, 31(13):3850--3855, Jul 2003.

1306:

1307: \bibitem{Beerenwinkel2005c}

1308: N. Beerenwinkel and M. Drton.

1309: \newblock Mutagenetic tree models.

1310: \newblock In L.~Pachter and B.~Sturmfels, editors, {\em Algebraic Statistics

1311:   for Computational Biology}, chapter~14, pages 278--290. Cambridge University

1312:   Press, Cambridge, UK, 2005.

1313:

1314: \bibitem{Beerenwinkel2005f}

1315: N. Beerenwinkel, J. Rahnenf{\"u}hrer, M. D{\"a}umer, D.

1316:   Hoffmann, R. Kaiser, J. Selbig, and T. Lengauer.

1317: \newblock Learning multiple evolutionary pathways from cross-sectional data.

1318: \newblock {\em J. Comput. Biol.}, 12(6):584--598, 2005.

1319:

1320: \bibitem{Beerenwinkel2005b}

1321: N. Beerenwinkel, J. Rahnenf{\"u}hrer, R. Kaiser, D. Hoffmann,

1322:   J. Selbig, and T. Lengauer.

1323: \newblock Mtreemix: a software package for learning and using mixture models of

1324:   mutagenetic trees.

1325: \newblock {\em Bioinformatics}, 21(9):2106--2107, May 2005.

1326:

1327: \bibitem{Berkhout1999}

1328: B.~Berkhout.

1329: \newblock {HIV}-1 evolution under pressure of protease inhibitors: Climbing the

1330:   stairs of viral fitness.

1331: \newblock {\em J. Biomed. Sci.}, 6:298--305, 1999.

1332:

1333: \bibitem{Bonhoeffer2000}

1334: S. Bonhoeffer, C. Chappey, N.T. Parkin, J.M. Whitcomb, and C.J. Petropoulos.

1335: \newblock Evidence for Positive Epistasis in HIV-1.

1336: \newblock {\em Science}, 306:1547--1550, 2004.

1337:

1338: \bibitem{brightwell}

1339: G. Brightwell and P. Winkler.

1340: \newblock Counting linear extensions.

1341: \newblock {\em Order}, 8(3):225--242, 1991.

1342:

1343: \bibitem{Clavel2004}

1344: F. Clavel and A.J. Hance.

1345: \newblock H{IV} drug resistance.

1346: \newblock {\em N. Engl. J. Med.}, 350(10):1023--1035, Mar 2004.

1347:

1348: \bibitem{Condra1996}

1349: J.H. Condra, D.J. Holder, W.A. Schleif, O.M. Blahy, R.M. Danovich, L.J.

1350:   Gabryelski, D.J. Graham, D.~Laird, J.C. Quintero, A.~Rhodes, H.L. Robbins,

1351:   E.~Roth, M.~Shivaprakash, T.~Yang, J.A. Chodakewitz, P.J. Deutsch, R.Y.

1352:   Leavitt, F.E. Massari, J.W. Mellors, K.E. Squires, R.T. Steigbigel,

1353:   H.~Teppler, and E.A. Emini.

1354: \newblock Genetic correlates of in vivo viral resistance to indinavir, a human

1355:   immunodeficiency virus type 1 protease inhibitor.

1356: \newblock {\em J. Virol.}, 70(12):8270--8276, 1996.

1357:

1358: \bibitem{Desper1999}

1359: R.~Desper, F.~Jiang, O.P. Kallioniemi, H.~Moch, C.H. Papadimitriou, and A.A.

1360:   Sch{\"a}ffer.

1361: \newblock Inferring tree models for oncogenesis from comparative genome

1362:   hybridization data.

1363: \newblock {\em J. Comput. Biol.}, 6(1):37--51, 1999.

1364:

1365: \bibitem{ehrenborg1996}

1366: R. Ehrenborg.

1367: \newblock On posets and {H}opf algebras.

1368: \newblock {\em Adv. Math.}, 119(1):1--25, 1996.

1369:

1370: \bibitem{Garcia2005}

1371:  L.~Garcia, M.~Stillman, and B.~Sturmfels.

1372: \newblock Algebraic geometry of {B}ayesian networks.

1373: \newblock {\em J. Symbol. Comput.}, 39:331--355, 2005.

1374:

1375: \bibitem{Iwasa2003}

1376: Y. Iwasa, F. Michor, and M.A. Nowak.

1377: \newblock Evolutionary dynamics of escape from biomedical intervention.

1378: \newblock {\em Proc. Biol. Sci.}, 270(1533):2573--2578, Dec 2003.

1379:

1380: \bibitem{Iwasa2004}

1381: Y. Iwasa, F. Michor, and M.A. Nowak.

1382: \newblock Evolutionary dynamics of invasion and escape.

1383: \newblock {\em J. Theor. Biol.}, 226(2):205--214, Jan 2004.

1384:

1385: %\bibitem{Kimmel2002}

1386: %M. Kimmel and D.E. Axelrod.

1387: %\newblock {\em Branching Processes in Biology}.

1388: %\newblock Springer, 2002.

1389:

1390: \bibitem{Lauritzen1996}

1391: S.L. Lauritzen.

1392: \newblock {\em Graphical Models}.

1393: \newblock Clarendon Press, 1996.

1394:

1395: \bibitem{Miller2004}

1396: E. Miller and B. Sturmfels.

1397: \newblock {\em Combinatorial commutative algebra}, volume 227 of {\em Graduate

1398:   Texts in Mathematics}.

1399: \newblock Springer, New York, 2005.

1400:

1401: \bibitem{Molla1996}

1402: A.~Molla, M.~Korneyeva, Q.~Gao, S.~Vasavanonda, P.J. Schipper, H.M. Mo,

1403:   M.~Markowitz, T.~Chernyavskiy, P.~Niu, N.~Lyons, A.~Hsu, G.R. Granneman,

1404:   D.D. Ho, C.A. Boucher, J.M. Leonard, D.W. Norbeck, and D.J. Kempf.

1405: \newblock Ordered accumulation of mutations in {HIV} protease confers

1406:   resistance to ritonavir.

1407: \newblock {\em Nat. Med.}, 2(7):760--766, Jul 1996.

1408:

1409: \bibitem{Nowak2000}

1410: M.A. Nowak and R.M. May.

1411: \newblock {\em Virus dynamics}.

1412: \newblock Oxford University Press, 2000.

1413:

1414: %\bibitem{Pachter2005}

1415: %L. Pachter and B. Sturmfels, editors.

1416: %\newblock {\em Algebraic Statistics for Computational Biology}.

1417: %\newblock Oxford University Press, 2005.

1418:

1419: \bibitem{pruesse1994}

1420: G. Pruesse and F. Ruskey.

1421: \newblock Generating linear extensions fast.

1422: \newblock {\em SIAM J. Comput.}, 23(2):373--386, 1994.

1423:

1424: \bibitem{Reidys2002}

1425: C.M. Reidys and P.F. Stadler.

1426: \newblock Combinatorial landscapes.

1427: \newblock {\em SIAM Review}, 44:3--54, 2002.

1428:

1429: \bibitem{Rhee2003}

1430: S.-Y. Rhee, M.J. Gonzales, R. Kantor, B.J. Betts, J. Ravela,

1431:   and R.W. Shafer.

1432: \newblock Human immunodeficiency virus reverse transcriptase and protease

1433:   sequence database.

1434: \newblock {\em Nucl. Acids Res.}, 31(1):298--303, Jan 2003.

1435:

1436:

1437: \bibitem{Rosin1999a}

1438: C.D. Rosin, R.K. Belew, G.M. Morris, A.J. Olson, and D.S. Goodsell.

1439: \newblock Coevolutionary analysis of resistance-evading peptidomimetic

1440:   inhibitors of {HIV-1} protease.

1441: \newblock {\em Proc. Natl. Acad. Sci. U. S. A.}, 96:1369--1374, 1999.

1442:

1443: \bibitem{Stanley1996}

1444: R.P. Stanley.

1445: \newblock A matrix for counting paths in acyclic digraphs.

1446: \newblock {\em J. Combin. Theory Ser. A}, 74(1):169--172, 1996.

1447:

1448: \bibitem{Stanley1999}

1449: R.P. Stanley.

1450: \newblock {\em Enumerative combinatorics. {V}ol. 1}, volume~49 of {\em

1451:   Cambridge Studies in Advanced Mathematics}.

1452: \newblock Cambridge University Press, Cambridge, 1997.

1453: %\newblock With a foreword by Gian-Carlo Rota, Corrected reprint of the 1986

1454: %  original.

1455:

1456: \bibitem{Stilianakis1997a}

1457: N.I. Stilianakis, C.A. Boucher, M.D.~De Jong, R.~Van Leeuwen, R.~Schuurman, and

1458:   R.J.~De Boer.

1459: \newblock Clinical data sets of human immunodeficiency virus type 1 reverse

1460:   transcriptase resistant mutants explained by a mathematical model.

1461: \newblock {\em J. Virol.}, 71(1):161--168, 1997.

1462:

1463: \bibitem{Varol1981}

1464: Y.L.~Varol and D.~Rotem.

1465: \newblock An algorithm to generate all topological sorting arrangements.

1466: \newblock {\em Comput. J.}, 24(1):83--84, 1981.

1467:

1468: \bibitem{Walter1999}

1469: H.~Walter, B.~Schmidt, K.~Korn, A.~M. Vandamme, T.~Harrer, and K.~{\"U}berla.

1470: \newblock Rapid, phenotypic {HIV-1} drug sensitivity assay for protease and

1471:   reverse transcriptase inhibitors.

1472: \newblock {\em J. Clin. Virol.}, 13:71--80, 1999.

1473:

1474: \bibitem{Wilke2003}

1475: C.O.~Wilke.

1476: \newblock Probability of fixation of an advantageous mutant

1477: in a viral quasispecies.

1478: \newblock {\em Genetics}, 163:467--474, 2003.

1479:

1480: \end{thebibliography}

1481:

1482: \section*{Appendix: Mathematics and computation of the risk polynomial}

1483:

1484: Here we discuss in more detail mathematical properties

1485: of the risk polynomial and we present several methods for computing it.

1486: The given data consists of an $n$ element poset $\cE$

1487: and its induced genotype lattice $\cG$, which is the distributive

1488: lattice of order ideals in $\cE$. We assume that $\cG$ has

1489: $m$ elements, which are encoded either

1490: as subsets of $\cE$ or as binary strings in $\{0,1\}^n$.

1491: The risk polynomial is the polynomial $\,\RP(\cG;{\bf f})\,$

1492: in the $m$ unknowns $f_g = {\bf f}(g)$,

1493: one for each genotype $g$.

1494: We are also interested in  specializations of

1495: $\RP(\cG;{\bf f})$ obtained by setting some (or all) of the unknowns

1496: equal to each other, such as

1497: the graded risk polynomial and the univariate risk polynomial.

1498:

1499:

1500: \subsection*{Stanley's linear algebra method}

1501:

1502: A direct method for computing the risk polynomial is given

1503: in Section~\ref{sec:branching}.

1504:  Namely, we can set all $\mu_e$  equal to one

1505: in the matrix ${\bf U}$ and then compute the upper right

1506: entry of the matrix $\,({\bf I} - {\bf UF})^{-1} - {\bf I} \,$ of

1507: equation (\ref{GeometricSeries}).

1508: In practice, one would compute this entry

1509: by a dynamic program which runs in time $O(m^2)$.

1510: That dynamic program is easily   derived by resolving the recursion

1511: in  the last equation of the proof  of Theorem~\ref{thm:1}.

1512:

1513:

1514: The following alternative linear algebra technique for

1515: computing polynomials similar to our risk polynomials

1516: was given by Stanley  in \cite{Stanley1996}.

1517: Let $\,\cG' = \cG  \backslash \{\hat{0}, \hat{1}\} \,$ denote

1518: the genotype lattice with the top element

1519: $\hat{1}$ and the bottom element $\hat{0}$ removed.

1520: We define ${\bf A} $ to be the {\em anti-adjacency matrix} of the truncated

1521: genotype lattice $\cG'$. Thus ${\bf A}$ is the $(m-2) \times (m-2)$-matrix

1522: with rows and columns indexed by $\cG'$, and whose entry

1523: in row $g$ and column $h$ is $0$ if $ g \subset h$

1524: and is $1$ otherwise. We write ${\bf I}$ for the

1525: $(m-2) \times (m-2)$ identity matrix and

1526: $\, {\bf F}' = {\rm diag} \bigl( \,{\bf f}(g) \,|\, g \in \cG' \bigr)\,$ for the

1527: $ (m-2) \times (m-2)$-diagonal matrix whose entries are the

1528: fitness values. Stanley's result reads as follows.

1529:

1530: \begin{thm}[Stanley \cite{Stanley1996}]

1531: \label{stanley}

1532: The risk polynomial $\,\RP(\cG; {\bf f})\,$ equals

1533: the determinant of the $(m-2) \times (m-2)$-matrix $\, {\bf I} \, +\, {\bf F}' \cdot {\bf A}$.

1534: \end{thm}

1535:

1536: \begin{ex} \rm

1537: Let $\cG$ be the genotype lattice in Figure~\ref{fig:ex1}. Then $m =8$ and

1538:  $\, {\bf I} \, +\, {\bf F}' \cdot {\bf A}\,$ is the $6 \times 6$-matrix

1539: \[

1540: \bordermatrix{ &  1000 & 0100 & 1100 & 0101 & 1110 & 1101 \cr

1541: 1000 & 1 + f_{1000} &   f_{1000} &    0  &   f_{1000} &   0 &   0 \cr

1542: 0100 &       f_{0100} &    1 + f_{0100} &    0  &    0  &    0 &    0 \cr

1543: 1100 & f_{1100} &   f_{1100} &   1 + f_{1100 } &   f_{1100 } & 0 &  0 \cr

1544: 0101 &  f_{0101} & f_{0101} &  f_{0101 } & 1 + f_{0101} &  f_{0101} & 0 \cr

1545: 1110 &  f_{1110 } &  f_{1110} &  f_{1110} & f_{1110} &  1 + f_{1110} &  f_{1110} \cr

1546: 1101 & f_{1101} &  f_{1101} & f_{1101} &  f_{1101} &  f_{1101} &  1 + f_{1101} \cr}.

1547: \]

1548: The determinant of this matrix is

1549: the risk polynomial of Example~\ref{ex:rp}.

1550: \end{ex}

1551:

1552:

1553: \subsection*{The Hilbert series method}

1554:

1555: A more conceptual way of thinking about the risk polynomial

1556: is based on the following algebraic construction.

1557: The {\em Stanley-Reisner ideal} $\,I_{\cG'}\,$ of $\cG'$

1558: is the ideal generated by all quadratic monomials

1559: $\,f_g \cdot f_h \,$ where $g$ and $h$

1560: are genotypes that are incomparable,

1561: i.e., neither $g \subseteq h$ nor $h \subseteq g$ holds.

1562: The ambient polynomial ring $\,S = \mathbb{R}[{\bf f}] $

1563: is generated by the unknowns $f_g$ where $g \in \cG'$.

1564: The {\em Hilbert series} of $\,I_{\cG'}\,$

1565: is the formal sum over all monomials

1566: $\,{\bf f}^u \, = \,\prod_{g \in \cG'} f_g^{u_g}\,$

1567: which  are not in the ideal $\,I_{\cG'}$.

1568: This is a formal generating function which can be

1569: written as a rational function of the following form

1570: \[

1571: H(S/I_{\cG'}; {\bf f}) \quad = \quad

1572: \frac{K_\cG({\bf f})}{\prod_{g \in \cG'} (1-f_g)}.

1573: \]

1574: Here $K_\cG({\bf f})$ is a polynomial

1575: in the unknowns $f_g$ with integer coefficients.

1576: The polynomial $K_\cG({\bf f})$

1577: is known as the {\em K-polynomial} of the ideal $I_{\cG'}$.

1578: We refer to \cite{Miller2004} for an introduction

1579: to Stanley-Reisner ideals and their K-polynomials.

1580:

1581: If $\cE$ is a directed forest (and we identify $f_g = p_g$)

1582: then Proposition \ref{prop:forest} and

1583: \cite[Thm.~14.11]{Beerenwinkel2005c} imply that

1584: the ideal $I_{\cG'}$ is an initial monomial ideal

1585: of the conjunctive Bayesian network on $\cE$.

1586: In a forthcoming paper we shall prove

1587: that this initial ideal property holds

1588: for all event posets (not just trees).

1589:

1590: \begin{ex} \rm

1591: Let $\cG$ be the genotype lattice in Figure~\ref{fig:ex1}.

1592: Then

1593: \[

1594: I _{\cG'} \quad = \quad \langle\,

1595: f_{0101} f_{1110},\,

1596: f_{1101} f_{1110},\,

1597: f_{0101} f_{1100},\,

1598: f_{0101} f_{1000},\,

1599: f_{0100} f_{1000}

1600: \rangle

1601: \]

1602: %Comparing these monomials to the underlined initial monomials in

1603: %Example~\ref{ex:conjunctive}, we see that

1604: %$I_{\cG'}$

1605: is indeed the initial monomial ideal

1606: of the conjunctive Bayesian network

1607: % in that example.

1608: in Example~\ref{ex:conjunctive}.

1609: The K-polynomial $K_{\cG}({\bf f})$ equals

1610: \begin{eqnarray*}

1611: & 1

1612: - f_{0101} f_{1110}

1613: - f_{1101} f_{1110}

1614: - f_{0101} f_{1100}

1615: - f_{0101} f_{1000}

1616: - f_{0100} f_{1000} \\ &

1617: + f_{0100} f_{1000} f_{0101}

1618: + f_{1000} f_{0101} f_{1100}

1619: + f_{1000} f_{0101} f_{1110}

1620: + f_{0101} f_{1100} f_{1110} \\ &

1621: + f_{0101} f_{1110} f_{1101}

1622: + f_{0100} f_{1000} f_{1110} f_{1101} \\ &

1623: - f_{1000} f_{0101} f_{1100} f_{1110}

1624: - f_{0100} f_{1000} f_{0101} f_{1110} f_{1101}.

1625: \end{eqnarray*}

1626: \end{ex}

1627:

1628: \smallskip

1629:

1630: %Just as in the proof of Corollary~\ref{cor:gb}, we see

1631: Again using Proposition~\ref{prop:forest} and

1632: Theorem~14.11 in \cite{Beerenwinkel2005c}

1633: we see that

1634: the risk polynomial  $\,\RP(\cG; {\bf f})\,$

1635: is the sum of all squarefree monomials

1636: in the expansion of the Hilbert series $H(S/I_{\cG'}; {\bf f})$.

1637: Equivalently, $\,\RP(\cG; {\bf f})\,$ is the reduction of

1638: $H(S/I_{\cG'}; {\bf f})$ modulo the ideal generated

1639: by the squares $\,f_g^2 \,$ of the unknowns.

1640: Since $\,1/(1-f_g)\,$ equals $\,1+f_g\,$ modulo

1641: $\,\langle \, f_g^2 \, \rangle $, we have the following result.

1642:

1643: \begin{prop} \label{reisner}

1644: The risk polynomial  $\,\RP(\cG; {\bf f})\,$

1645: of the genotype lattice $\cG$ is the sum of

1646: all squarefree terms in the expansion of

1647: \[

1648: K_\cG({\bf f}) \cdot \prod_{g \in \cG'} (1+f_g),

1649: \]

1650: where $K_\cG({\bf f})$ is the $K$-polynomial

1651: of the Stanley-Reisner ideal $I_{\cG'}$.

1652: \end{prop}

1653:

1654: The univariate risk polynomial $\,\RP(\cG; a) \,$

1655:  is derived from $\,\RP(\cG;{\bf f})\,$

1656: by replacing each $f_g$ by the scalar unknown $a$.

1657: We have

1658: \[

1659: \RP(\cG;a) \quad = \quad

1660: c_0 + c_1 a + c_2 a^2 + \cdots + c_{n-1} a^{n-1},

1661: \]

1662: where $c_i$ is the number of chains of length $i$ in $\cG'$. Thus,

1663: $(c_0,\ldots,c_{n-1})$ is the $f$-vector

1664: of the simplicial complex of chains in $\cG'$.

1665: Likewise, we get the graded risk polynomial from

1666: $\RP(\cG;{\bf f})$ by replacing each $f_g$ by

1667: $a_{|g|}$. We note that the graded risk polynomial is  related to

1668: Ehrenborg's quasi-symmetric function encoding \cite{ehrenborg1996}

1669: of the flag $f$-vector of the chain complex of $\cG'$.

1670:

1671:

1672: \subsection*{The linear extensions method}

1673:

1674: One advantage of both Theorem~\ref{stanley}

1675: and Proposition~\ref{reisner} is that these

1676: formulas do not actually depend on the

1677: fact that $\cG$ is a distributive lattice.

1678: They also apply if the set

1679: $\cG$ of genotypes is an arbitrary

1680: poset. This is relevant for our

1681: discussion of the statistical models in Section~\ref{sec:bayes},

1682: where we introduced a more general

1683: class of posets $\cG_p \subseteq \{0,1\}^n$.

1684:

1685: This advantage is also a disadvantage:

1686: Theorem~\ref{stanley} and Proposition~\ref{reisner}

1687: do not give the most efficient methods for

1688: computing  $\RP(\cG;{\bf f})$ when $\cG$ is  the distributive lattice

1689: induced by an event poset $\cE$. In what follows

1690: we present a specialized and more efficient

1691: algorithm for the risk polynomial.

1692:  The input to this algorithm consists of

1693: the event poset $\cE$. It is not necessary

1694: to compute the genotype lattice $\cG$

1695: as this will be done as a byproduct of our approach,

1696: which is to compute  the risk polynomial $\RP(\cG;{\bf f}) $ directly from $\cE$.

1697:

1698: As before, we assume that $\cE$ has $n$ elements, and

1699: we write $[n]$ for the linearly ordered set $\{1,2,\ldots,n\}$.

1700: A {\em linear extension} of $\cE$ is an order-preserving

1701: bijection $\,\pi \colon \cE \rightarrow [n]$. This means that

1702: $e < e'$ in $\cE$ implies $\pi(e) < \pi(e')$.

1703: Every linear extension  $\,\pi \colon \cE \rightarrow [n]$

1704: gives rise to an ordered list of $n-1$ genotypes

1705: $\,g^{(1)},g^{(2)}, \ldots,g^{(n-1)}\,$ in

1706: $\,\cG' = \cG \backslash \{\hat{0},\hat{1}\}$ as follows.

1707: The genotype $g^{(i)}$ is

1708: the subset of $\cE$ consisting of all

1709: events whose image under $\pi$

1710: is among the first $i$ positive integers. In symbols,

1711: $\, g^{(i)} \,= \, \pi^{-1}(\{1,2,\ldots,i\}) $.

1712: The sequence $g^{(1)}, g^{(2)}, \ldots, g^{(n-1)}$, derived from $\pi$,

1713: represents a mutational pathway in $\cG$.

1714:

1715: We now fix one distinguished linear extension of $\cE$,

1716: that is, we identify the set underlying $\cE$ with $[n]$ itself.

1717: Then a linear extension is simply

1718: any permutation $\pi$ of $[n]$ which preserves the

1719: order relations in $\cE$.  We define

1720: \begin{equation}

1721: \label{FPi}

1722: {\bf f}(\pi) \quad = \quad

1723: \prod_{i: \pi(i) < \pi(i+1)} ( f_{g^{(i)}} + 1)

1724: \cdot

1725: \prod_{i: \pi(i) > \pi(i+1)}  f_{g^{(i)}}  ,

1726: \end{equation}

1727: where $i$ runs over $\{1,2,\ldots,n-1\}$.

1728: Our algorithm amounts to evaluating

1729: the  risk polynomial by means of the

1730: following explicit summation formula.

1731:

1732: \begin{thm}   \label{thm:linearExtensions}

1733: The risk polynomial  $\RP(\cG;{\bf f}) $

1734: equals the sum of the products ${\bf f}(\pi)$

1735: where $\pi$ runs over all linear extensions of

1736: the event poset $\cE$.

1737: \end{thm}

1738:

1739: \begin{proof}

1740: The relationship between chains in $\cG$ and

1741: linear extensions of $\cE$ is the content of

1742: \cite[Prop.~3.5.2]{Stanley1999}.

1743: The distributive lattice $\cG$ has a canonical

1744: {\em R-labeling} \cite[Sec.~3.13]{Stanley1999}

1745: which assigns to each edge of the Hasse diagram of

1746: $\cG$ the corresponding element of $\cE$.

1747: In view of this R-labeling, Exercise~59d in \cite[Chap.~3]{Stanley1999}

1748: tells us that the poset $\,\cG'  = \cG \backslash \{\hat{0},\hat{1}\}\,$ is

1749: {\em chain-partitionable}.

1750: Each product  ${\bf f}(\pi)$ as in (\ref{FPi})

1751:  is the generating function for

1752:  all the chains in precisely one part of that chain

1753: partition of $\cG'$. Adding up all products

1754: gives the generating function for all chains,

1755: which is the risk polynomial.

1756: \end{proof}

1757:

1758: \begin{ex} \rm

1759: The event poset $\cE$ in Figure~\ref{fig:ex1} has five linear extensions $\pi$:

1760: \begin{eqnarray*}

1761: \pi \quad \,\,\,& {\bf f}(\pi) \\

1762: (1, 2, 3, 4) &  (1+f_{1000})(1+f_{1100})(1+f_{1110})  \\

1763: (1, 2, 4, 3) & (1 + f_{1000}) (1+f_{1100}) f_{1101}         \\

1764: (2, 1, 3, 4) & f_{0100}(1+f_{1100})(1+f_{1110})          \\

1765: (2, 1, 4, 3) & f_{0100}(1+f_{1100}) f_{1101}                 \\

1766: (2, 4, 1, 3) & (1+f_{0100}) f_{0101} (1+f_{1101})

1767: \end{eqnarray*}

1768: The sum of these five products equals the risk polynomial  $\RP(\cG;{\bf f}) $.

1769: \end{ex}

1770:

1771:

1772: \subsection*{Implementation}

1773:

1774: Pruesse and Ruskey \cite{pruesse1994} showed that

1775: the linear extensions of a poset $\cE$ can be computed in time linear in

1776: the number of linear extensions.

1777: Thus, their algorithm computes $\RP(\cG;{\bf f}) $ in

1778: time linear in the size of the output of

1779: Theorem~\ref{thm:linearExtensions}.  That output is in

1780: factored form (\ref{FPi}) and is always more compact than the

1781: expanded risk polynomial.  In this manner, we compute the risk

1782: polynomial in time sublinear in the size of the expanded risk

1783: polynomial.

1784:

1785: To obtain the univariate risk polynomial, we take the sum of the terms

1786: $\,(1+a)^{n-1-\delta} a^\delta$, where $\delta = \delta(\pi)$

1787: is the number of descents of the linear extension $\pi$.

1788: Similarly, the graded risk polynomial $\RP(\cG; a_1,\ldots,a_{n-1})$ is found by

1789: keeping track of the descent set of each linear extension $\pi$.

1790: We believe that this method is best possible for general posets

1791: $\cE$. Notice that the leading term of

1792: the univariate risk polynomial is the number of linear extensions of

1793: $\cE$, and it is \#P-complete to count linear extensions \cite{brightwell}.

1794:

1795: When $\cE$ is a directed forest, the

1796: recursive structure can be used to help compute the risk polynomial.

1797: In this case, $\cE$ is built up by the operations of disjoint union

1798: and ordinal sum from the one element poset.  For example, in the univariate case,

1799: the zeta polynomial \cite[Sec.~3.11]{Stanley1999} of $\cG$ behaves nicely under these operations and

1800: can be used to write down the risk polynomial. Based on these

1801: considerations, we can design an efficient algorithm for

1802: computing the univariate risk polynomial of a directed forest.

1803:

1804: Using the method of Theorem~\ref{thm:linearExtensions}, we have developed software

1805: for computing risk polynomials.

1806: The input to our program is an arbitrary event poset $\cE$,

1807: and the output is

1808: the risk polynomial, the graded risk polynomial

1809: or the univariate risk polynomial.  Optionally, the user can also

1810: input either exact fitness values or upper and lower bounds for each fitness

1811: value.  The output in this case is either the exact risk of escape

1812: or upper and lower bounds for the risk.

1813: It is designed to integrate with the package

1814: \texttt{Mtreemix} \cite{Beerenwinkel2005b},

1815: allowing the user to start with data, infer a mutagenetic

1816: tree, and then easily compute the risk

1817: polynomial.

1818: Our software is available at

1819: \[

1820:    \url{http://bio.math.berkeley.edu/riskpoly/}

1821: \]

1822: We use the algorithm of \cite{Varol1981} for computing linear

1823: extensions.  Although this algorithm isn't asymptotically optimal, as

1824: shown in \cite{pruesse1994}, it

1825: is simple to implement and efficient in practice.

1826:

1827:

1828:

1829: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1830:

1831:

1832:

1833:

1834: \end{document}

1835: