0703:q-bio0703034/inv.tex

1: %\documentclass[12pt,envcountsame]{llncs}

2: \documentclass[12pt]{article}

3: %\pagenumbering{arabic}

4: %\pagestyle{plain}

5: \usepackage{fullpage}

6:

7: \usepackage{amsthm}

8: \usepackage{amsmath}

9: \usepackage{amsfonts}

10: \usepackage{graphicx}

11: \usepackage{url}

12:

13: %\spnewtheorem{alg}[theorem]{Algorithm}{\bfseries}{}%{\itshape}

14: %\spnewtheorem{op}[theorem]{Problem}{\bfseries}{}

15: \newtheorem{thm}{Theorem}

16: \newtheorem{lemma}[thm]{Lemma}

17: \newtheorem{prop}[thm]{Proposition}

18: \newtheorem{cor}[thm]{Corollary}

19: \newtheorem{prob}[thm]{Problem}

20: \newtheorem{conj}[thm]{Conjecture}

21: \newtheorem{alg}[thm]{Algorithm}

22: \newtheorem{rmk}[thm]{Remark}

23:

24: \newtheorem{example}[thm]{Example}

25: \newtheorem{definition}[thm]{Definition}

26:

27:

28: \newcommand{\sM}{\mathcal M}

29: \newcommand{\cE}{\mathcal{E}}

30: \newcommand{\cG}{\mathcal{G}}

31: \newcommand{\NN}{\mathbb{N}}

32: \newcommand{\rr}{\mathbb{R}}

33: \newcommand{\e}{\mathrm{e}}

34: \newcommand{\tr}{\mathrm{tr}}

35: \newcommand{\bof}{\mathbf{f}}

36: \newcommand{\Ta}{T_1}

37: \newcommand{\Tb}{T_2}

38: \newcommand{\Tc}{T_3}

39: \newcommand{\hpt}{\hat{p}(\theta)}

40: \newcommand{\hpi}{\hat{p}_{i}}

41: \newcommand{\hpb}{\hat{p}_{2}}

42: \newcommand{\hpc}{\hat{p}_{3}}

43: \newcommand{\diag}{\mathrm{diag}}

44:

45: \begin{document}

46: \title{Metric learning for phylogenetic invariants}

47: %\author{Nicholas Eriksson\inst{1}\fnmsep*\and Yuan Yao\inst{2}}

48: %\institute{\url{nke@stanford.edu}\\ Department of Statistics, \\Stanford University, \\Stanford, CA 94305-4065 \and

49: %\url{yuany@math.stanford.edu}\\Department of Mathematics,\\ Stanford University, \\Stanford, CA 94305-4065}

50: %\date{\today}

51: \author{Nicholas Eriksson\\

52: \url{nke@stanford.edu}\\

53: Department of Statistics,\\

54: Stanford University, \\

55: Stanford, CA 94305-4065

56: \and

57: Yuan Yao\\

58: \url{yuany@math.stanford.edu}\\

59: Department of Mathematics,\\

60: Stanford University, \\

61: Stanford, CA 94305-4065

62: }

63: \date{\today}

64:

65: \maketitle

66:

67: \begin{abstract}

68: We introduce new methods for phylogenetic tree quartet construction by using

69: machine learning to optimize the power of phylogenetic invariants.

70: Phylogenetic

71: invariants are polynomials in the joint probabilities which vanish under a

72: model of evolution on a phylogenetic tree.  We give algorithms for selecting a

73: good set of invariants and for learning a metric on this set of invariants

74: which optimally distinguishes the different models.  Our learning algorithms

75: involve linear and semidefinite programming on data simulated over a wide range

76: of parameters.  We provide extensive tests of the learned metrics on simulated data

77: from phylogenetic trees with four leaves under the Jukes-Cantor and Kimura

78: 3-parameter models of DNA evolution.  Our method greatly improves on other uses

79: of invariants and is competitive with or better than neighbor-joining.  In

80: particular, we obtain metrics trained on trees with short internal branches

81: which perform much better than neighbor joining on this region of parameter

82: space.

83: \end{abstract}

84: \begin{quotation}

85: \noindent \small {\bf Keywords:}

86: Phylogenetic invariants, algebraic statistics, semidefinite programming,

87: Felsenstein zone.

88: \end{quotation}

89:

90:

91: \section{Introduction}

92:

93: Phylogenetic invariants have been used for tree construction since the linear

94: invariants of Lake and Cavender-Felsenstein \cite{Lake1987,Cavender1987} were

95: found.  Although the linear invariants of the Jukes-Cantor model are powerful

96: enough to asymptotically distinguish between trees on 4 taxa \cite{Lake1987},

97: these linear invariants do not perform well in simulations

98: \cite{Huelsenbeck1995}.

99:

100: In the last 20 years, the entire set of phylogenetic invariants has been found

101: for many models of evolution (see \cite{Allman2007} and the references

102: therein).  Since the invariants provide an essentially complete description of

103: the model, using more invariants should give more power to distinguish between

104: different models.  However, different invariants give vastly different power at

105: distinguishing models and it is not known how to find the most powerful

106: invariants.

107:

108: In this paper, we use techniques from machine learning to find metrics on the

109: space of invariants which optimize their tree reconstruction power for the

110: Jukes-Cantor and Kimura 3-parameter phylogenetic models on trees with four

111: leaves.  Specifically, we apply \emph{metric learning} algorithms inspired by

112: \cite{XingNg03} to find the metric which best distinguishes the models.  For

113: training data we use simulations over a wide range of the parameter space.

114: Our main biological result is the construction of a metric which outperforms

115: neighbor-joining on trees simulated from the Felsenstein zone (i.e., trees with

116: a short interior edge). More generally, we find that metric learning

117: significantly improves upon other uses of invariants and is competitive with

118: neighbor joining even for short sequences and homogeneous rates.

119:

120: Casanellas and Fern\'{a}ndez-S\'{a}nchez \cite{Casanellas2006} also used the Kimura

121: 3-parameter invariants to construct trees with four taxa.  Their results

122: indicated that invariants can sometimes perform better than commonly used methods (e.g.,

123: neighbor joining and maximum likelihood) for data that evolved with

124: non-homogeneous rates and for extremely long sequences.   They used the $l_1$

125: norm on the space of invariants, weighing each polynomial equally.

126:

127: This paper improves upon  \cite{Casanellas2006}

128: by showing how to improve upon the $l_1$ norm on the space of invariants.

129: The $l_1$ norm behaves poorly

130: since it weighs equally

131: informative and non-informative invariants.

132: Simulating data and using metric learning

133: improves the performance of invariants by putting much more weight on the

134: powerful invariants.

135: This allows us to build an algorithm which is very accurate for trees with

136: short internal edges.

137: %%Second, we show that their method gives different answers depending on the

138: %%order of the input sequences because they use sets of invariants which are not

139: %%closed under the symmetry group of the tree.  We show how to build sets of

140: %%invariants which are closed under permutations of the input, thus giving an

141: %%algorithm that does not change if the input is permuted.

142:

143: %This paper is organized as follows.  In Section~\ref{sec:2}, we give a

144: %description of the phylogenetic invariants and models we will use.

145: %Section~\ref{sec:3}, contains our algorithms for metric learning using

146: %linear and semidefinite programming.

147: %%These algorithms take simulated data and

148: %%find a metric on the space of invariants such that the invariants from the correct

149: %%tree are minimized while the incorrect ones are simultaneously maximized.

150: %Section~\ref{sec:4} gives the results of extensive simulations with our

151: %algorithms for the Jukes-Cantor (JC69) and Kimura 3-parameter (K81) models.

152:

153: %The remainder of this paper is organized as follows.  We begin by giving a

154: %short introduction to phylogenetic invariants and motivate our algorithm.

155: %Section~\ref{sec:methods} contains the techniques used to select a set of

156: %invariants and the metric learning algorithms.  Section~\ref{sec:results} \dots

157:

158: %\section{Phylogenetic invariants}

159: %\label{sec:2}

160: We begin by briefly introducing the models and phylogenetic invariants we will use.

161: Section~\ref{sec:methods} describes the metric learning algorithms; section~\ref{sec:results} gives

162: the results of our simulation studies; and we conclude with a short discussion.

163:

164: By \emph{phylogenetic tree}, we mean a binary, unrooted tree with labelled

165: leaves.  There are three such trees with four leaves labelled $0, 1, 2, 3$, we

166: call these trees $\Ta$, $\Tb$, and $\Tc$ according to which leaf is on the same

167: ``cherry'' as leaf 0.

168: We consider two phylogenetic models on these trees:  the Jukes-Cantor (JC69)

169: model of evolution \cite{Jukes1969} and the Kimura 3-parameter (K81) model

170: \cite{Kimura1981}, both with uniform root distribution.

171: %and parameters $\alpha, \beta, \gamma$.

172:

173: These models associate to each edge $e$

174: of the tree a transition matrix

175: $M_e = \e^{Qt_e}$ where

176: where $t_e$ is the length of the edge $e$

177: and $Q$ is a rate matrix:

178: %\[

179: %M_e = \frac 1 4

180: %\begin{pmatrix}

181: %	1 + 3 \e^{-\frac{4}{3}t_e} &

182: %	1 - \e^{-\frac{4}{3}t_e}&

183: %	1 - \e^{-\frac{4}{3}t_e}&

184: %	1 - \e^{-\frac{4}{3}t_e}\\

185: %	1 - \e^{-\frac{4}{3}t_e}&

186: %	1 + 3 \e^{-\frac{4}{3}t_e} &

187: %	1 - \e^{-\frac{4}{3}t_e}&

188: %	1 - \e^{-\frac{4}{3}t_e}\\

189: %	1 - \e^{-\frac{4}{3}t_e}&

190: %	1 - \e^{-\frac{4}{3}t_e}&

191: %	1 + 3 \e^{-\frac{4}{3}t_e} &

192: %	1 - \e^{-\frac{4}{3}t_e}\\

193: %	1 - \e^{-\frac{4}{3}t_e}&

194: %	1 - \e^{-\frac{4}{3}t_e}&

195: %	1 - \e^{-\frac{4}{3}t_e}&

196: %	1 + 3 \e^{-\frac{4}{3}t_e}

197: %\end{pmatrix}

198: %\]

199: \[

200: Q =

201: \begin{pmatrix}

202: 	-3 \alpha & \alpha & \alpha & \alpha \\

203: 	\alpha & -3 \alpha & \alpha & \alpha \\

204: 	\alpha & \alpha & -3 \alpha & \alpha \\

205: 	\alpha & \alpha & \alpha & -3 \alpha\\

206: \end{pmatrix}

207: \quad \text{ or } \quad

208: \begin{pmatrix}

209: 	\cdot & \gamma & \alpha & \beta\\

210: 	\gamma & \cdot & \beta & \alpha\\

211: 	\alpha & \beta & \cdot & \gamma\\

212: 	\beta & \alpha & \gamma & \cdot

213: \end{pmatrix}

214: \]

215: for JC69 and K81 respectively, where $\cdot = -\gamma - \alpha - \beta$.

216: %The diagonal entries of the rate

217: %matrix are chosen to make the rows sum to zero.

218: For a given tree, we write

219: $p_{ijkl} = \Pr(\text{leaf } 0 = i, \text{leaf } 1 = j, \text{leaf }  2 = k,

220: \text{leaf } 3 = l)$ for

221: $i,j,k,l \in \{\texttt{A,C,G,T}\}$ and write $p = (p_{\texttt{AAAA}}, \dots,

222: p_{\texttt{TTTT}})$ for the joint probability distribution.

223: \emph{Phylogenetic invariants} are polynomial equations which are

224: satisfied between the joint parameters.  For example, $p_{\texttt{AAAA}} = p_{\texttt{CCCC}}$

225: holds for both JC69 and K81, but since this equation is true for all three trees,

226: will ignore it and similar equations.

227:

228: \begin{example}\rm

229: 	\label{ex:4pt}

230: Consider the four-point condition

231: on a tree metric \cite{Buneman1974}.  It says that if $d$ is a tree metric on $(ij : kl)$, then

232: \begin{equation}\label{eq:4pt}

233: d_{ij} + d_{kl} <  d_{ik} + d_{jl} = d_{il} + d_{jk}.

234: \end{equation}

235: Given a probability distribution $p$, the maximum likelihood Jukes-Cantor distance is

236: \begin{equation} \label{eq:JCdist}

237: d_{ij} = - \frac 3 4 \log \left( 1 - \frac {4 m_{ij}} {3}\right)

238: \end{equation}

239: where $m_{ij}$ is the fraction of mismatches between the two sequences, e.g.,

240: \[

241: m_{12} = \sum_{w,x,y,z \in \{A,C,G,T\}, w \neq x} p_{wxyz}.

242: \]

243: After substituting in (\ref{eq:4pt}) and exponentiating, the equality becomes

244: \begin{equation}\label{eq:4ptp}

245: \left(1 - \frac {4} {3} m_{ik}\right) \left(1 - \frac {4} {3} m_{jl}\right) =

246: \left(1 - \frac {4} {3} m_{il}\right) \left(1 - \frac {4} {3} m_{jk}\right).

247: \end{equation}

248: This observation is originally due to Cavender and Felsenstein

249: \cite{Cavender1987}.  The difference of the two sides of (\ref{eq:4ptp}) is a

250: quadratic polynomial in the joint probabilities which we will call the

251: four-point polynomial.

252: \end{example}

253:

254:

255: \section{Methods}

256: \label{sec:methods}

257: Since both models we consider are \emph{group-based}, it is easiest to work in Fourier

258: coordinates which can roughly be thought of as the $m_{ij}$ coordinates in

259: Example~\ref{ex:4pt} (cf.\ \cite{Hendy1989,Evans1993,Sturmfels2005}).  The

260: website \url{http://www.math.tamu.edu/~lgp/small-trees/} contains lists of

261: invariants for different models on trees with a small number of taxa.

262:

263: Our first task is building a set of invariants for the two models.  The above

264: website shows 33 polynomials (plus two implied linear relations) for

265: JC69 and 8002 polynomials for K81.

266: However,

267: these sets of invariants

268: %have desirable algebraic properties,

269: %they lack one important feature for our use.

270: are not closed under the symmetries of $T_1$.

271: That is, each tree can be

272: written in the plane in eight different ways (for example, the tree $\Ta$ can

273: be written as (01 : 23),  (10 : 23), \dots, (32 : 10)), and each of these induces a

274: different order on the probability coordinates $p_{ijkl}$.  We need a set of

275: invariants which does not change under this reordering if we don't want the

276: resulting algorithm  to depend on the order of the input sequences.

277:

278: \begin{figure}

279: 	\centering

280: 	%"((W:$b,X:$a):$a,Y:$b,Z:$a);";

281: 	\includegraphics[width=.35\textwidth]{figures/4tree-ab-label}

282: 	\caption{The tree used in the simulations. Branch lengths $a$ and $b$

283: 	ranged from $0.01$ to $0.75$ in intervals of $0.02$.}

284: 	\label{fig:param}

285: \end{figure}

286:

287: After performing this calculation, we are left with 49 polynomials for JC69 and

288: 11612 for K81.

289: However, our metric learning algorithms

290: run slowly as the number of invariants grows, so we had to find a subset of

291: a more manageable size.

292: We cut down the K81 invariants by testing each of the

293: 11612 invariants individually on the entire parameter space and only keeping

294: those which had good individual reconstruction rates.

295: Specifically,  we picked several different values for

296: $\gamma, \alpha, \beta$ and kept only those invariants which gave over a 62\%

297: reconstruction rate individually for sequences of length 100.  The result of

298: this calculation is sets of invariants $\bof^{JC69}_i$ and $\bof^{K81}_i$ of

299: cardinality 49 and 52.

300:

301: \begin{definition}

302: Given a probability distribution $\hat{p}$, and invariants $\bof_i = (f_{i,1},

303: \dots, f_{i,n})$, for tree $T_i$ (for $i = 1,2,3$), let

304: \[

305: \bof_i(\hat{p}) = (f_{i,1}(\hat{p}), \dots, f_{i,n}(\hat{p}))

306: \]

307: be the point in $\rr^{n}$ obtained by evaluating the invariants for $T_i$ at

308: $\hat{p}$.

309: \end{definition}

310:

311: If the probability distribution $\hat{p}$ actually comes from the model $T_1$,

312: then we will have $\bof_1(\hat{p}) = 0$, $\bof_2(\hat{p}) \neq 0$, and $\bof_3(\hat{p})

313: \neq 0$ generically (that is, except for points $\hat{p}$ which lie on the

314: intersection of two or more models).  This fact suggests that we can just pick

315: the tree $T_i$ such that $\bof_i$ is closest to zero.  However, the next example

316: shows that it is quite important to pick good polynomials and weigh them properly.

317:

318: \begin{example}\rm

319: Figure~\ref{fig:hist} shows the distribution of four of the invariants from $\bof^{JC}_i$

320: on data from simulations

321: of 1000 i.i.d.\

322: draws from the Jukes-Cantor model on $T_1$ over a varying set of parameters.

323: The histograms show the distributions for the simulated tree ($T_1$) in yellow

324: and the distributions for the other trees in gray and black.

325: \begin{figure}

326: 	\centering

327: 	\includegraphics[width=.5\textwidth]{figures/4poly-hist.pdf}

328: 	\caption{Distributions of four polynomials $f_{i,10}, f_{i,48}, f_{i,23}$

329: 	and $f_{i,45}$ on simulated data.

330: 	The yellow histogram corresponds to the correct tree, the black and gray are

331: 	the other two trees.}

332: 	\label{fig:hist}

333: \end{figure}

334:

335: Polynomial 10 (upper left) distinguishes nicely between the three trees

336: with the correct tree tightly distributed around zero. It is correct 97\% of

337: the time on our space of trees (Figure~\ref{fig:param}). Polynomial 48 (upper right) also shows power to

338: distinguish between all three trees, but

339: the distributions are much more overlapping --- it is only correct 50.8\% of the time.

340: Polynomial 10 is the four-point invariant from Example~\ref{ex:4pt}, polynomial

341: 48 is one of Lake's linear invariants.

342: The two other examples show a polynomial (23) which is biased towards selecting

343: the wrong tree (only 16\% correct), and a polynomial (45) for which the correct tree is tightly

344: clustered around zero, but the incorrect trees are indistinguishable and have wide

345: variance (88.9\% correct).

346:

347: %\begin{figure}

348: %	\centering

349: %	\includegraphics[width=4in]{figures/myRank_inv}

350: %	\caption{Performance of the individual Jukes-Cantor invariants.}

351: %	\label{fig:indiv}

352: %\end{figure}

353:

354: The parameters used for the simulations are described in

355: Figure~\ref{fig:param}.  Since 1000 samples should be quite enough to determine

356: the structure of a tree on four taxa, it is revealing that many of the

357: individual polynomials are quite poor

358: %(see Figure~\ref{fig:indiv} for the individual prediction rates for each polynomial)

359: (the mean prediction rate for all 49 polynomials is only 42\%).

360: The invariants have quite different variances and means and it is not optimal

361: to take each one with equal weight.

362: \end{example}

363:

364:

365: %\section{Metric learning}

366: %\label{sec:3}

367: %In this section we propose a learning algorithm to find a

368: %metric on the invariants which optimizes their power.

369:

370: This example shows that we need to scale and weigh the individual invariants.

371: Recall that for a positive (semi)definite matrix $A$ %\in \rr^{d^2}$,

372: the Mahalanobis (semi)norm $\|\cdot \|_A$ is defined by

373: \[

374: \| x \|_A = \sqrt{x^t A x}.

375: \]

376: %It satisfies (i) $\|x\|_A\geq 0$, (ii) $\|c x\|_A = |c|\|x\|_A$ (for $c\in

377: %\rr$), and (iii) $\|x+y\|_A\leq \|x\|_A+\|y\|_A$. Note that in (i) it is

378: %possible to have $\|x\|_A= 0$ with $x\neq 0$ in contrast to a norm $\|\cdot \|$

379: %which satisfies $\|x\|=0$ if and only if $x=0$.

380: Notice that since $A$ is positive semidefinite, it can be written as $A = U D

381: U^t$ where $U$ is  orthogonal and $D$ is diagonal with non-negative entries.

382: Thus the square root $B = U \sqrt{D} U^t$ is unique.  Now since $\|x\|_A^2

383: = x^tAx = (Bx)^t(Bx)=\|Bx\|^2$, we can view learning such a metric as finding a

384: transformation of the space of invariants that replaces each point $x$ with $Bx$ under the

385: Euclidean norm.

386: Accordingly, we will be searching for a positive semidefinite

387: matrix $A$ on the space of invariants which is ``optimal''.

388:

389: Let

390: $\hpt$ be an empirical probability distribution generated from a phylogenetic

391: model on tree $T_1$ with parameters $\theta$.

392: We wish to find $A$ such that the condition

393: \[

394: \|\bof_1(\hpt))\|_A < \min\left( \|\bof_2(\hpt)\|_A, \|\bof_3(\hpt)\|_A \right)

395: \]

396: is typically true for most $\hpt$ chosen from a suitable parameter space $\Theta$.

397:

398: Now suppose that $\Theta$ is a finite set of parameters from which we generate

399: training data $\bof_1(\hpt), \bof_2(\hpt), \bof_3(\hpt)$ for $\theta \in \Theta$.

400: As we saw above, each of the eight possible ways of writing

401: each tree induces a signed permutation of the coordinates of each

402: $\bof_i(\hpt)$.  We write these permutations in matrix form as $\pi_1, \dots,

403: \pi_8$.

404: Given this training data, we wish to solve the following optimization problem.

405: %For this purpose, the training data are generated as a set of triples

406: %$\{(x_1(\theta), x_2(\theta), x_3(\theta))\in \rr^{3d}: \theta = (\phi_e,\pi_e)

407: %\in \Theta\}$, where $\Theta$ is a finite set and $d=49$ in our setting.

408:

409: \begin{alg}\rm

410: 	\label{alg:meta}

411: 	{\bf (Metric learning for invariants)}

412:

413: 	\noindent {\em Input:}  model invariants $\bof_i$ for $T_i$

414: 	and a finite set $\Theta$ of model parameters.

415:

416: 	\noindent {\em Output:} a semidefinite matrix $A$

417:

418: 	\noindent {\em Procedure:}

419: 	\begin{enumerate}

420: 		\item Reduce the sets $\bof_i$ of invariants to a manageable size by testing individual invariants

421: 			on data simulated from $\theta \in \Theta$.

422: 		\item Augment the resulting sets so that they are closed under the eight

423: 			permutations of the input which fix tree $T_1$.

424: 		\item Compute the signed permutations $\pi_1, \dots, \pi_8$ which are induced on the

425: 			invariants $\bof_1$ by the above permutations.

426: 		\item Solve the following semidefinite programming problem:

427: 			\begin{equation*}

428: 				\begin{array}{ll}

429: 					%\text{Given: }& \bof_1, \bof_2, \bof_3

430: 					%\text{ and a finite set } \Theta \text{ of parameters,}\\

431: 					\text{Minimize: } &\sum_{\theta \in \Theta} \xi(\theta) + \lambda \tr A\\

432: 					\text{Subject to: } \quad

433: 					&\| X_1 (\hpt) \|_A^2   + \gamma \leq \min\left( \| X_2 (\hpt)\|_A^2 ,  \|X_3(\hpt) \|_A^2 \right)  + \xi(\theta),\\

434: 					& \pi_i A = A \pi_i \quad\text{for } 1 \leq i \leq 8,\\

435: 					&A \succeq 0, \quad\text{and}\\

436: 					&\xi(\theta)  \geq 0,

437: 				\end{array}

438: 			\end{equation*}

439: 			where $A\succeq 0$ denotes that $A$ is a positive semidefinite

440: 			matrix.

441: 		\item Alternatively, if we restrict $A$ to be

442: 			diagonal, this becomes a linear program and can be solved for much

443: 			larger sets of invariants and parameters.

444: 	\end{enumerate}

445: \end{alg}

446:

447: In the optimization step, we use a regularization parameter $\lambda$ to keep $A$

448: small and a margin parameter $\gamma$ to increase the margin between the

449: distributions.  This is a convex optimization problem with a linear objective

450: function and linear matrix equality and inequality constraints. Hence it is a

451: semidefinite programming (SDP) problem.

452: The SDP problem above has a unique optimizer and can be solved in polynomial

453: time. Its complexity depends on the capacity of the set $\Theta$ since each

454: point in $\Theta$ contributes a linear constraint.

455:

456:

457: For the range of parameters we consider in Section~\ref{sec:results}, we use

458: $\#(\Theta)=1444$.  In our experiments to solve the SDP we use SeDuMi 1.1

459: \cite{Sturm1999} or DSDP5

460: \cite{BenYe05} with YALMIP \cite{Lofberg2004} as the parser.

461: Matlab code to implement the above algorithm can be found at

462: \url{http://math.stanford.edu/~yuany/metricPhylo/matlab/}.

463:

464: We found that although SeDuMi often runs into numerical issues, it generally

465: finds a good matrix $A$ with competitive performance to the neighbor-joining

466: algorithm. DSDP is better in dealing with numerical stability at the cost of

467: more computational time. We have found that setting $\lambda = 0.0001$ and

468: $\gamma = 0.005$ gives good results in our situation. For example, in the case

469: of the JC69 model with a $49\times49$ semidefinite matrix $A$, YALMIP-SeDuMi takes 55.7

470: minutes to parse the constraints and solve the SDP, while YALMIP-DSDP takes

471: 167.1 minutes to finish the same job.  For details on experiments, see the next

472: section.

473:

474: Our algorithm was inspired by some early results on metric learning algorithms

475: such as \cite{XingNg03} and \cite{ShaSinNg04}, which aim to find a

476: (pseudo)-metric such that the mutual distances between similar examples are

477: minimized while the distances across dissimilar examples or classes are kept

478: large. Direct application of such an algorithm is not quite suitable in our setting.

479: As shown in Figure~\ref{fig:hist},

480: %a global view on the distribution of all three classes of data discloses that

481: the correct tree points are overlapped by the two incorrect trees. For

482: points in the overlapping region, it is hard to tell whether to shrink or

483: stretch their mutual distance. However, when the points appear in triples, it

484: is possible that for each triple the one closest to zero is generated from the

485: correct tree. Our algorithm is based on such an intuition and proved successful

486: in experiments.

487:

488: After using Algorithm~\ref{alg:meta} to find a good metric $A$, the following

489: simple algorithm allows us to construct trees on four taxa.

490:

491: \begin{alg}\rm

492: \label{alg:1}

493: {\bf (Tree construction with invariants)}

494:

495: \noindent {\em Input:}

496: A multiple alignment of 4 species and a semidefinite matrix $A$ from Algorithm~\ref{alg:meta}.

497: %invariant under

498: %the signed permutations $\pi_1, \dots, \pi_8$.

499:

500: \noindent {\em Output:}

501: A phylogenetic tree on the 4 species (without branch lengths).

502:

503: \noindent {\em Procedure:}

504: \begin{enumerate}

505: 	\item Form empirical distributions $\hat{p}$ by counting columns of the alignment.

506: 	\item Form the vectors $\bof_i = (f_{i,1}(\hat{p}), \dots, f_{i,n}(\hat{p}))$ for $1 \leq i \leq 3$.

507: 	\item Return $T_i$ where %$i$ is such that

508: 		the vector $\bof_i$ has smallest $A$-norm $\|\bof_i\|_A = \sqrt{\bof_i^t A \bof_i}$.

509: \end{enumerate}

510: \end{alg}

511:

512:

513: %\begin{figure}

514: %\centering

515: %%\includegraphics[width=.5\textwidth]{figures/RedonTop.jpg}

516: %%\includegraphics[width=.5\textwidth]{figures/RedonBottom1.jpg}

517: %\includegraphics[width=.75\textwidth]{figures/RedonBottom}

518: %\caption{Projections of the three data sets onto the first two eigenspaces of

519: %the metric $A$.  Red circles correspond to the correct tree, blue dots and

520: %green x's to the two incorrect trees.}

521: %\label{fig:eigen}

522: %\end{figure}

523:

524:

525: \section{Results}

526: \label{sec:results}

527:

528: We tested our metric learning algorithms for the invariants $\bof^{JC69}$ and

529: $\bof^{K81}$ as described above.

530: We trained two metrics for JC69 on

531: the tree in Figure~\ref{fig:param}.

532: The first used simulations from branch lengths between $0.01$ and $0.75$ on a

533: grid with increments of magnitude $0.02$, for a total of $1444$ different

534: parameters.

535: This region of parameter space was chosen for direct comparison with

536: \cite{Huelsenbeck1995,Casanellas2006}.

537: The second metric used parameters $0.01 \leq a \leq 0.25$ and $0.51 \leq b \leq

538: 0.75$ with increments of $0.01$, giving $625$ trees with very short interior edges.

539: Similarly for K81, we learned a metric using parameters $0.01 \leq a,b \leq 0.75$ and

540: several sets of $\gamma, \alpha, \beta$.

541: After learning metrics, we performed simulation tests on the same space of

542: parameters that was used to train.  We compared neighbor-joining

543: \cite{Saitou1987}, phylogenetic invariants with the $l_1$ or $l_2$ norm and

544: phylogenetic invariants with our learned norms.

545:

546: Since the edge lengths are large for part of the parameter space, we often see

547: simulated alignments with more than 75\% mismatches between pairs of taxa. In

548: such a case, (\ref{eq:JCdist}) returns infinite distance estimates under the

549: Jukes-Cantor

550: model.  So the results depend on how neighbor joining treats

551: infinite distances.  In PHYLIP

552: \cite{PHYLIP}, the program \texttt{dnadist} doesn't return a distance

553: matrix if some distances are infinite.  However, in PAUP*, infinite distances are set to a fixed large

554: number.

555: Since we are only concerned with the tree topology, we believe that the most

556: fair comparison is between phylogenetic invariants and the method of PAUP*.

557: However, it should be noted that this can make a major difference in results using neighbor joining, since often the correct tree can be returned even if some distances are infinite. See Figure~\ref{fig:contour} for an example of the difference and be warned that comparison between simulations studies done in different ways is difficult.

558:

559: Table~\ref{tab:JC} shows the results of 100 simulations at each of the 1444

560: parameter values for various sequence lengths using the JC69 model.  It gives

561: the percent correct over all 144,400 trials for

562: five different methods: invariants with $l_1$, $l_2$, and $A$-norms and

563: neighbor joining (using Jukes-Cantor distances and allowing infinite distances).  The contour

564: plots in Figure~\ref{fig:contour} show how the reconstruction rates vary across

565: parameter space for the five methods for a sequence length of 100.

566: Notice that the $A$-norm shows

567: particularly good behavior over the entire range of parameters, even in the

568: ``Felsenstein zone'' in the upper left corner.

569: When trained on the Felsenstein zone, the learned metric can perform even

570: better.  Table~\ref{tab:JC} shows the result of training a metric on this zone.

571: Notice that the $A$-norm is now quite a bit better than neighbor joining, even

572: though the $l_1$ and $l_2$ norms are terrible.  However, this learned norm is slightly worse on the whole parameter space than the metric trained on the whole space.

573:

574: \begin{table}

575: 	\centering

576: 	\begin{tabular}{cccccl@{\hspace{1cm}}ccccc}

577: 		\multicolumn{5}{c}{Full parameter space} & &

578: 		\multicolumn{5}{c}{Felsenstein zone}\\

579: 		Length & $l_1$  & $l_2$ & $A$-norm  &  NJ & &

580: 		Length & $l_1$  & $l_2$ & $A$-norm  &  NJ\\

581: 		\cline{1-5} \cline{7-11}

582: 		 25 & 62.7  & 59.8  & 74.5  & 75.5 &&   25 & 31.8 & 31.3 & 58.9 & 52.5\\

583: 		 50 & 71.9  & 66.3  & 85.0  & 85.9 &&   50 & 35.9 & 33.0 & 69.9 & 63.2\\

584: 		 75 & 76.7  & 69.6  & 90.0  & 90.4 &&   75 & 39.2 & 34.4 & 76.5 & 69.5\\

585: 		100 & 79.8  & 72.0  & 92.7  & 92.9 &&  100 & 42.2 & 35.8 & 81.2 & 73.9\\

586: 		200 & 86.4  & 77.6  & 97.0  & 96.6 &&  200 & 50.9 & 39.1 & 90.1 & 83.5\\

587: 		300 & 89.2  & 80.1  & 98.2  & 97.7 &&  300 & 55.4 & 40.4 & 93.6 & 87.8\\

588: 		400 & 91.1  & 82.1  & 98.7  & 98.2 &&  400 & 59.5 & 41.4 & 95.1 & 90.2\\

589: 		500 & 92.3  & 83.5  & 99.0  & 98.4 &&  500 & 62.4 & 42.5 & 95.6 & 91.4\\

590: 		\cline{1-5} \cline{7-11}

591: 	\end{tabular}

592: 	\caption{Percent of trials reconstructed correctly for the Jukes-Cantor

593: 	model over the entire parameter space and the Felsenstein zone for the

594: 	respective metrics.}

595: 	\label{tab:JC}

596: \end{table}

597:

598: Table~\ref{tab:K81} shows results for the K81 model under two choices of $(\gamma, \alpha, \beta)$.

599: We only report the $l_2$ scores, since the $l_1$ scores are similar.  Of note

600: is the column ``$l_2$ restrict'' which shows the $l_2$ norm on the top $52$

601: invariants as ranked by individual power on simulations as in the previous

602: section.  This column is better than the $l_2$ norm on all $11612$ invariants,

603: showing that many invariants are actually harmful.

604: The $A$-norm again improves on even the restricted $l_2$ and beats

605: neighbor-joining (run with K81 distances) on all examples.

606:

607: \begin{table}

608: 	\centering

609: 	\begin{tabular}{cccccl@{\hspace{1cm}}ccccc}

610: %$\gamma,\alpha, \beta$ &

611: 		\multicolumn{5}{c}{$(\gamma, \alpha, \beta) = (0.1, 3.0, 0.5)$}&&

612: 		\multicolumn{5}{c}{$(\gamma, \alpha, \beta) = (0.2, 0.5, 0.3)$}\\

613: length & $l_2$ & $l_2$ restrict & A & NJ&&

614: length & $l_2$ & $l_2$ restrict & A & NJ\\

615: \cline{1-5} \cline{7-11}

616:  25 & 59.2 & 66.9 & 71.5 & 62.7 && 25 & 63.7 & 67.4 & 70.9 & 65.0 \\

617:  50 & 68.0 & 77.5 & 82.1 & 72.7 && 50 & 71.6 & 77.5 & 81.1 & 74.2 \\

618:  75 & 73.4 & 82.7 & 86.9 & 79.3 && 75 & 75.4 & 82.7 & 86.4 & 80.5 \\

619: 100 & 76.8 & 85.8 & 89.7 & 82.6 &&100 & 77.6 & 85.8 & 89.3 & 83.4 \\

620: 200 & 84.6 & 90.8 & 94.3 & 90.1 &&200 & 83.3 & 91.3 & 94.1 & 89.7 \\

621: 300 & 88.2 & 92.6 & 93.1 & 95.6 &&300 & 86.4 & 93.2 & 95.7 & 91.8 \\

622: 400 & 90.1 & 93.6 & 96.4 & 94.8 &&400 & 88.5 & 94.3 & 96.5 & 93.0 \\

623: 500 & 91.4 & 94.2 & 96.9 & 95.7 &&500 & 90.0 & 93.7 & 96.3 & 93.2 \\

624: \cline{1-5} \cline{7-11}

625: 	\end{tabular}

626: 	\caption{Percent of trials reconstructed correctly

627: 	for the Kimura 3-parameter model over the entire parameter space for two choices of $\gamma, \alpha, \beta$.}

628: 	\label{tab:K81}

629: \end{table}

630:

631: \begin{figure}

632: 	\centering

633: 	\begin{tabular}{cc}

634: 	\includegraphics[height=.35\textheight]{figures/contour-l1} &

635: 	%\includegraphics[height=.25\textheight]{figures/contour-l2}\\

636: 	\includegraphics[height=.35\textheight]{figures/contour-A} \\

637: 	\includegraphics[height=.35\textheight]{figures/contour-NJinf} &

638: 	%\multicolumn{2}{c}{

639: 	\includegraphics[height=.35\textheight]{figures/contour-NJ} \\

640: \end{tabular}

641: 	\caption{Contour plots for the three reconstruction methods for the

642: 	Jukes-Cantor model over parameter space with alignments of length 100.

643: 	Black areas correspond to parameters

644: 	$(a,b)$ for which the tree was reconstructed correctly  over 95\% of the

645: 	time, gray for over 50\%, light gray for over 33\%, and white for under 33\%.  }

646: 	\label{fig:contour}

647: \end{figure}

648:

649:

650: \section{Discussion}

651: \label{sec:dis}

652:

653: We have shown that machine learning algorithms can substantially improve the

654: tree construction performance of phylogenetic invariants.  As an example, for

655: sequences of length 100, the four-point invariant (Example~\ref{ex:4pt}) for

656: the K81 model is correct 82\%  of the time on data simulated from K81 with

657: parameters $(0.1, 3.0, 0.5)$.  This is quite a bit better than the

658: $l_2$ norm on all 11612 invariants (76.8\%, Table~\ref{tab:K81}).

659:

660: The paper \cite{Casanellas2007} describes an algebraic method for picking a

661: subset of invariants for the K81 model.  They reduce to 48 invariants which give

662: an improvement over all 11612 invariants (up to 82.6\% on the above

663: example using the $l_2$ norm).  However, of these 48, only 4 of them are among

664: the top 52 we selected for $\bof^{K81}$, and the remaining 44 invariants are

665: mostly quite poor (42\% average accuracy). After taking the closure of these 48 invariants, there are 156

666: total and the performance actually drops to 78.3\%.  It seems that the

667: conditions for an invariant to be powerful are not particularly related to the

668: algebraic criterion used in \cite{Casanellas2007}.

669:

670: All invariant based methods heavily depend on the set of invariants that we

671: begin with.  Learning diagonal matrices $A$ had mixed performance, which further

672: suggests that the generating set we are using for the invariants is

673: non-optimal.  We believe that it is an important mathematical problem to

674: understand what properties are shared by the good invariants.  We suggest that

675: symmetry might be an important criterion to construct other polynomials like

676: the four-point condition with good power.

677:

678: The learned metrics in this paper are somewhat dependant on the

679: parameters chosen to train them.  This can be a benefit, as it allows us to

680: train tree construction algorithms for specific regions of parameter space

681: (e.g., the Felsenstein zone).  However, we hope that improvements to the metric

682: programming will allow us to train on larger parameter sets and thus obtain

683: uniformly better algorithms.

684:

685: Notice that these methods only recover the tree topology, not the edge lengths.

686: We believe that if the edge lengths are needed, they should be estimated after

687: building the tree, in which case standard statistical methods such as maximum

688: likelihood can be used easily.

689: While the invariants discussed in this paper may not be practical for large

690: trees, we believe there is great use in understanding fully the problem of

691: building trees on four taxa.  For example, these methods can either be used

692: as an input to quartet-based tree construction algorithms or as a verification

693: step for larger phylogenetic trees.

694:

695: For a method of building trees on more than four taxa using phylogenetic

696: invariants, see \cite{Eriksson2005b,Kim2006}, which use numerical linear

697: algebra to evaluate invariants given by rank conditions on certain matrices in

698: order to construct phylogenetic trees.  This amounts to evaluating many

699: polynomials at once, allowing it to run in polynomial time.

700:

701: The matrices $A$ for JC69 and K81 used in the tests can be found at

702: \url{http://stanford.edu/~nke/data/metricPhylo}.  A software package that can

703: run these tests is available at the same website.  It includes a program for

704: simulating evolution using any Markov model on a tree and several programs

705: using phylogenetic invariants.

706:

707: \section*{Acknowledgments}

708: N.~Eriksson was supported by NSF grant DMS-0603448 and wishes to thank MSRI and

709: the IMA for their hospitality. Y.~Yao was supported by DARPA grant 1092228.  We

710: wish to thank E.\ Allman, M.\ Drton, D.\ Ge, F.\ Memoli, L.\ Pachter, J.\

711: Rhodes, and Y.\ Ye for helpful comments and especially G.\ Carlsson for his

712: encouragement and computational resources.

713:

714: \bibliographystyle{alpha}

715: %\bibliography{nikos}

716:

717: \begin{thebibliography}{KKPP06}

718:

719: \bibitem[AR07]{Allman2007}

720: E~Allman and J~Rhodes.

721: \newblock Molecular phylogenetics from an algebraic viewpoint, 2007.

722: \newblock To appear.

723:

724: \bibitem[Bun74]{Buneman1974}

725: Peter Buneman.

726: \newblock A note on the metric properties of trees.

727: \newblock {\em J. Combinatorial Theory Ser. B}, 17:48--50, 1974.

728:

729: \bibitem[BY05]{BenYe05}

730: Steven~J. Benson and Yinyu Ye.

731: \newblock {DSDP5}: Software for semidefinite programming.

732: \newblock Technical Report ANL/MCS-P1289-0905, Mathematics and Computer Science

733:   Division, Argonne National Laboratory, Argonne, IL, September 2005.

734: \newblock Submitted to ACM Transactions on Mathematical Software.

735:

736: \bibitem[CF87]{Cavender1987}

737: J~Cavender and J~Felsenstein.

738: \newblock Invariants of phylogenies in a simple case with discrete states.

739: \newblock {\em Journal of Classification}, 4:57--71, 1987.

740:

741: \bibitem[CFS06]{Casanellas2006}

742: M.~Casanellas and J.~Fern\'{a}ndez-S\'{a}nchez.

743: \newblock {Performance of a New Invariants Method on Homogeneous and

744:   Non-homogeneous Quartet Trees}.

745: \newblock {\em Mol Biol Evol}, page msl153, 2006.

746:

747: \bibitem[CFS07]{Casanellas2007}

748: M.~Casanellas and J.~Fern\'{a}ndez-S\'{a}nchez.

749: \newblock {Geometry of the Kimura 3-parameter model}, 2007.

750: \newblock availabe at arXiv:math.AG/0702834.

751:

752: \bibitem[Eri05]{Eriksson2005b}

753: Nicholas Eriksson.

754: \newblock Tree construction using singular value decompsition.

755: \newblock In L.~Pachter and B.~Sturmfels, editors, {\em Algebraic Statistics

756:   for Computational Biology}, chapter~19, pages 347--358. Cambridge University

757:   Press, Cambridge, UK, 2005.

758:

759: \bibitem[ES93]{Evans1993}

760: S~Evans and T~Speed.

761: \newblock Invariants of some probability models used in phylogenetic inference.

762: \newblock {\em The Annals of Statistics}, 21:355--377, 1993.

763:

764: \bibitem[Fel04]{PHYLIP}

765: J~Felsenstein.

766: \newblock {PHYLIP (Phylogeny Inference Package) version 3.6}.

767: \newblock Distributed by the author, Department of Genome Sciences, University

768:   of Washington, Seattle, 2004.

769:

770: \bibitem[HP89]{Hendy1989}

771: M~Hendy and D~Penny.

772: \newblock A framework for the quantitative study of evolutionary trees.

773: \newblock {\em Systematic Zoology}, 38(4), 1989.

774:

775: \bibitem[Hue95]{Huelsenbeck1995}

776: John~P. Huelsenbeck.

777: \newblock Performance of phylogenetic methods in simulations.

778: \newblock {\em Sys Biol}, 1(44):17--48, 1995.

779:

780: \bibitem[JC69]{Jukes1969}

781: TH~Jukes and C~Cantor.

782: \newblock Evolution of protein molecules.

783: \newblock In HN~Munro, editor, {\em Mammalian Protein Metabolism}, pages

784:   21--32. New York Academic Press, 1969.

785:

786: \bibitem[Kim81]{Kimura1981}

787: M~Kimura.

788: \newblock Estimation of evolutionary sequences between homologous nucleotide

789:   sequences.

790: \newblock {\em Proceedings of the National Academy of Sciences, USA},

791:   78:454--458, 1981.

792:

793: \bibitem[KKPP06]{Kim2006}

794: Young~Rock Kim, Oh-In Kwon, Seong-Hun Paeng, and Chun-Jae Park.

795: \newblock Phylogenetic tree constructing algorithms fit for grid computing with

796:   {SVD}.

797: \newblock Available at \url{http://arxiv.org/abs/q-bio.QM/0611015}, 2006.

798:

799: \bibitem[L\"04]{Lofberg2004}

800: J.~L\"ofberg.

801: \newblock {YALMIP} : a toolbox for modeling and optimization in {MATLAB}.

802: \newblock In {\em Computer Aided Control Systems Design}, pages 284--289, 2004.

803:

804: \bibitem[Lak87]{Lake1987}

805: JA~Lake.

806: \newblock A rate-independent technique for analysis of nucleaic acid sequences:

807:   evolutionary parsimony.

808: \newblock {\em Molecular Biology and Evolution}, 4:167--191, 1987.

809:

810: \bibitem[SN87]{Saitou1987}

811: N~Saitou and M~Nei.

812: \newblock The neighbor joining method: a new method for reconstructing

813:   phylogenetic trees.

814: \newblock {\em Molecular Biology and Evolution}, 4(4):406--425, 1987.

815:

816: \bibitem[SS05]{Sturmfels2005}

817: Bernd Sturmfels and Seth Sullivant.

818: \newblock {T}oric ideals of phylogenetic invariants.

819: \newblock {\em J Comput Biol}, 12(4):457--481, May 2005.

820:

821: \bibitem[SSSN04]{ShaSinNg04}

822: Shai Shalev-Shwartz, Yoram Singer, and Andrew~Y. Ng.

823: \newblock Online learning of pseudo-metrics.

824: \newblock In {\em Proceedings of the Twenty-first International Conference on

825:   Machine Learning}, 2004.

826:

827: \bibitem[Stu99]{Sturm1999}

828: J.F. Sturm.

829: \newblock Using {SeDuMi} 1.02, a {MATLAB} toolbox for optimization over

830:   symmetric cones.

831: \newblock {\em Optimization Methods and Software}, 11--12:625--653, 1999.

832: \newblock Special issue on Interior Point Methods (CD supplement with

833:   software).

834:

835: \bibitem[XNJR03]{XingNg03}

836: Eric Xing, Andrew~Y. Ng, Michael Jordan, and Stuart Russell.

837: \newblock Distance metric learning, with application to clustering with

838:   side-information.

839: \newblock In {\em NIPS}, 2003.

840:

841: \end{thebibliography}

842: \end{document}

843:

844: