1: %\documentclass[12pt,envcountsame]{llncs}
2: \documentclass[12pt]{article}
3: %\pagenumbering{arabic}
4: %\pagestyle{plain}
5: \usepackage{fullpage}
6:
7: \usepackage{amsthm}
8: \usepackage{amsmath}
9: \usepackage{amsfonts}
10: \usepackage{graphicx}
11: \usepackage{url}
12:
13: %\spnewtheorem{alg}[theorem]{Algorithm}{\bfseries}{}%{\itshape}
14: %\spnewtheorem{op}[theorem]{Problem}{\bfseries}{}
15: \newtheorem{thm}{Theorem}
16: \newtheorem{lemma}[thm]{Lemma}
17: \newtheorem{prop}[thm]{Proposition}
18: \newtheorem{cor}[thm]{Corollary}
19: \newtheorem{prob}[thm]{Problem}
20: \newtheorem{conj}[thm]{Conjecture}
21: \newtheorem{alg}[thm]{Algorithm}
22: \newtheorem{rmk}[thm]{Remark}
23:
24: \newtheorem{example}[thm]{Example}
25: \newtheorem{definition}[thm]{Definition}
26:
27:
28: \newcommand{\sM}{\mathcal M}
29: \newcommand{\cE}{\mathcal{E}}
30: \newcommand{\cG}{\mathcal{G}}
31: \newcommand{\NN}{\mathbb{N}}
32: \newcommand{\rr}{\mathbb{R}}
33: \newcommand{\e}{\mathrm{e}}
34: \newcommand{\tr}{\mathrm{tr}}
35: \newcommand{\bof}{\mathbf{f}}
36: \newcommand{\Ta}{T_1}
37: \newcommand{\Tb}{T_2}
38: \newcommand{\Tc}{T_3}
39: \newcommand{\hpt}{\hat{p}(\theta)}
40: \newcommand{\hpi}{\hat{p}_{i}}
41: \newcommand{\hpb}{\hat{p}_{2}}
42: \newcommand{\hpc}{\hat{p}_{3}}
43: \newcommand{\diag}{\mathrm{diag}}
44:
45: \begin{document}
46: \title{Metric learning for phylogenetic invariants}
47: %\author{Nicholas Eriksson\inst{1}\fnmsep*\and Yuan Yao\inst{2}}
48: %\institute{\url{nke@stanford.edu}\\ Department of Statistics, \\Stanford University, \\Stanford, CA 94305-4065 \and
49: %\url{yuany@math.stanford.edu}\\Department of Mathematics,\\ Stanford University, \\Stanford, CA 94305-4065}
50: %\date{\today}
51: \author{Nicholas Eriksson\\
52: \url{nke@stanford.edu}\\
53: Department of Statistics,\\
54: Stanford University, \\
55: Stanford, CA 94305-4065
56: \and
57: Yuan Yao\\
58: \url{yuany@math.stanford.edu}\\
59: Department of Mathematics,\\
60: Stanford University, \\
61: Stanford, CA 94305-4065
62: }
63: \date{\today}
64:
65: \maketitle
66:
67: \begin{abstract}
68: We introduce new methods for phylogenetic tree quartet construction by using
69: machine learning to optimize the power of phylogenetic invariants.
70: Phylogenetic
71: invariants are polynomials in the joint probabilities which vanish under a
72: model of evolution on a phylogenetic tree. We give algorithms for selecting a
73: good set of invariants and for learning a metric on this set of invariants
74: which optimally distinguishes the different models. Our learning algorithms
75: involve linear and semidefinite programming on data simulated over a wide range
76: of parameters. We provide extensive tests of the learned metrics on simulated data
77: from phylogenetic trees with four leaves under the Jukes-Cantor and Kimura
78: 3-parameter models of DNA evolution. Our method greatly improves on other uses
79: of invariants and is competitive with or better than neighbor-joining. In
80: particular, we obtain metrics trained on trees with short internal branches
81: which perform much better than neighbor joining on this region of parameter
82: space.
83: \end{abstract}
84: \begin{quotation}
85: \noindent \small {\bf Keywords:}
86: Phylogenetic invariants, algebraic statistics, semidefinite programming,
87: Felsenstein zone.
88: \end{quotation}
89:
90:
91: \section{Introduction}
92:
93: Phylogenetic invariants have been used for tree construction since the linear
94: invariants of Lake and Cavender-Felsenstein \cite{Lake1987,Cavender1987} were
95: found. Although the linear invariants of the Jukes-Cantor model are powerful
96: enough to asymptotically distinguish between trees on 4 taxa \cite{Lake1987},
97: these linear invariants do not perform well in simulations
98: \cite{Huelsenbeck1995}.
99:
100: In the last 20 years, the entire set of phylogenetic invariants has been found
101: for many models of evolution (see \cite{Allman2007} and the references
102: therein). Since the invariants provide an essentially complete description of
103: the model, using more invariants should give more power to distinguish between
104: different models. However, different invariants give vastly different power at
105: distinguishing models and it is not known how to find the most powerful
106: invariants.
107:
108: In this paper, we use techniques from machine learning to find metrics on the
109: space of invariants which optimize their tree reconstruction power for the
110: Jukes-Cantor and Kimura 3-parameter phylogenetic models on trees with four
111: leaves. Specifically, we apply \emph{metric learning} algorithms inspired by
112: \cite{XingNg03} to find the metric which best distinguishes the models. For
113: training data we use simulations over a wide range of the parameter space.
114: Our main biological result is the construction of a metric which outperforms
115: neighbor-joining on trees simulated from the Felsenstein zone (i.e., trees with
116: a short interior edge). More generally, we find that metric learning
117: significantly improves upon other uses of invariants and is competitive with
118: neighbor joining even for short sequences and homogeneous rates.
119:
120: Casanellas and Fern\'{a}ndez-S\'{a}nchez \cite{Casanellas2006} also used the Kimura
121: 3-parameter invariants to construct trees with four taxa. Their results
122: indicated that invariants can sometimes perform better than commonly used methods (e.g.,
123: neighbor joining and maximum likelihood) for data that evolved with
124: non-homogeneous rates and for extremely long sequences. They used the $l_1$
125: norm on the space of invariants, weighing each polynomial equally.
126:
127: This paper improves upon \cite{Casanellas2006}
128: by showing how to improve upon the $l_1$ norm on the space of invariants.
129: The $l_1$ norm behaves poorly
130: since it weighs equally
131: informative and non-informative invariants.
132: Simulating data and using metric learning
133: improves the performance of invariants by putting much more weight on the
134: powerful invariants.
135: This allows us to build an algorithm which is very accurate for trees with
136: short internal edges.
137: %%Second, we show that their method gives different answers depending on the
138: %%order of the input sequences because they use sets of invariants which are not
139: %%closed under the symmetry group of the tree. We show how to build sets of
140: %%invariants which are closed under permutations of the input, thus giving an
141: %%algorithm that does not change if the input is permuted.
142:
143: %This paper is organized as follows. In Section~\ref{sec:2}, we give a
144: %description of the phylogenetic invariants and models we will use.
145: %Section~\ref{sec:3}, contains our algorithms for metric learning using
146: %linear and semidefinite programming.
147: %%These algorithms take simulated data and
148: %%find a metric on the space of invariants such that the invariants from the correct
149: %%tree are minimized while the incorrect ones are simultaneously maximized.
150: %Section~\ref{sec:4} gives the results of extensive simulations with our
151: %algorithms for the Jukes-Cantor (JC69) and Kimura 3-parameter (K81) models.
152:
153: %The remainder of this paper is organized as follows. We begin by giving a
154: %short introduction to phylogenetic invariants and motivate our algorithm.
155: %Section~\ref{sec:methods} contains the techniques used to select a set of
156: %invariants and the metric learning algorithms. Section~\ref{sec:results} \dots
157:
158: %\section{Phylogenetic invariants}
159: %\label{sec:2}
160: We begin by briefly introducing the models and phylogenetic invariants we will use.
161: Section~\ref{sec:methods} describes the metric learning algorithms; section~\ref{sec:results} gives
162: the results of our simulation studies; and we conclude with a short discussion.
163:
164: By \emph{phylogenetic tree}, we mean a binary, unrooted tree with labelled
165: leaves. There are three such trees with four leaves labelled $0, 1, 2, 3$, we
166: call these trees $\Ta$, $\Tb$, and $\Tc$ according to which leaf is on the same
167: ``cherry'' as leaf 0.
168: We consider two phylogenetic models on these trees: the Jukes-Cantor (JC69)
169: model of evolution \cite{Jukes1969} and the Kimura 3-parameter (K81) model
170: \cite{Kimura1981}, both with uniform root distribution.
171: %and parameters $\alpha, \beta, \gamma$.
172:
173: These models associate to each edge $e$
174: of the tree a transition matrix
175: $M_e = \e^{Qt_e}$ where
176: where $t_e$ is the length of the edge $e$
177: and $Q$ is a rate matrix:
178: %\[
179: %M_e = \frac 1 4
180: %\begin{pmatrix}
181: % 1 + 3 \e^{-\frac{4}{3}t_e} &
182: % 1 - \e^{-\frac{4}{3}t_e}&
183: % 1 - \e^{-\frac{4}{3}t_e}&
184: % 1 - \e^{-\frac{4}{3}t_e}\\
185: % 1 - \e^{-\frac{4}{3}t_e}&
186: % 1 + 3 \e^{-\frac{4}{3}t_e} &
187: % 1 - \e^{-\frac{4}{3}t_e}&
188: % 1 - \e^{-\frac{4}{3}t_e}\\
189: % 1 - \e^{-\frac{4}{3}t_e}&
190: % 1 - \e^{-\frac{4}{3}t_e}&
191: % 1 + 3 \e^{-\frac{4}{3}t_e} &
192: % 1 - \e^{-\frac{4}{3}t_e}\\
193: % 1 - \e^{-\frac{4}{3}t_e}&
194: % 1 - \e^{-\frac{4}{3}t_e}&
195: % 1 - \e^{-\frac{4}{3}t_e}&
196: % 1 + 3 \e^{-\frac{4}{3}t_e}
197: %\end{pmatrix}
198: %\]
199: \[
200: Q =
201: \begin{pmatrix}
202: -3 \alpha & \alpha & \alpha & \alpha \\
203: \alpha & -3 \alpha & \alpha & \alpha \\
204: \alpha & \alpha & -3 \alpha & \alpha \\
205: \alpha & \alpha & \alpha & -3 \alpha\\
206: \end{pmatrix}
207: \quad \text{ or } \quad
208: \begin{pmatrix}
209: \cdot & \gamma & \alpha & \beta\\
210: \gamma & \cdot & \beta & \alpha\\
211: \alpha & \beta & \cdot & \gamma\\
212: \beta & \alpha & \gamma & \cdot
213: \end{pmatrix}
214: \]
215: for JC69 and K81 respectively, where $\cdot = -\gamma - \alpha - \beta$.
216: %The diagonal entries of the rate
217: %matrix are chosen to make the rows sum to zero.
218: For a given tree, we write
219: $p_{ijkl} = \Pr(\text{leaf } 0 = i, \text{leaf } 1 = j, \text{leaf } 2 = k,
220: \text{leaf } 3 = l)$ for
221: $i,j,k,l \in \{\texttt{A,C,G,T}\}$ and write $p = (p_{\texttt{AAAA}}, \dots,
222: p_{\texttt{TTTT}})$ for the joint probability distribution.
223: \emph{Phylogenetic invariants} are polynomial equations which are
224: satisfied between the joint parameters. For example, $p_{\texttt{AAAA}} = p_{\texttt{CCCC}}$
225: holds for both JC69 and K81, but since this equation is true for all three trees,
226: will ignore it and similar equations.
227:
228: \begin{example}\rm
229: \label{ex:4pt}
230: Consider the four-point condition
231: on a tree metric \cite{Buneman1974}. It says that if $d$ is a tree metric on $(ij : kl)$, then
232: \begin{equation}\label{eq:4pt}
233: d_{ij} + d_{kl} < d_{ik} + d_{jl} = d_{il} + d_{jk}.
234: \end{equation}
235: Given a probability distribution $p$, the maximum likelihood Jukes-Cantor distance is
236: \begin{equation} \label{eq:JCdist}
237: d_{ij} = - \frac 3 4 \log \left( 1 - \frac {4 m_{ij}} {3}\right)
238: \end{equation}
239: where $m_{ij}$ is the fraction of mismatches between the two sequences, e.g.,
240: \[
241: m_{12} = \sum_{w,x,y,z \in \{A,C,G,T\}, w \neq x} p_{wxyz}.
242: \]
243: After substituting in (\ref{eq:4pt}) and exponentiating, the equality becomes
244: \begin{equation}\label{eq:4ptp}
245: \left(1 - \frac {4} {3} m_{ik}\right) \left(1 - \frac {4} {3} m_{jl}\right) =
246: \left(1 - \frac {4} {3} m_{il}\right) \left(1 - \frac {4} {3} m_{jk}\right).
247: \end{equation}
248: This observation is originally due to Cavender and Felsenstein
249: \cite{Cavender1987}. The difference of the two sides of (\ref{eq:4ptp}) is a
250: quadratic polynomial in the joint probabilities which we will call the
251: four-point polynomial.
252: \end{example}
253:
254:
255: \section{Methods}
256: \label{sec:methods}
257: Since both models we consider are \emph{group-based}, it is easiest to work in Fourier
258: coordinates which can roughly be thought of as the $m_{ij}$ coordinates in
259: Example~\ref{ex:4pt} (cf.\ \cite{Hendy1989,Evans1993,Sturmfels2005}). The
260: website \url{http://www.math.tamu.edu/~lgp/small-trees/} contains lists of
261: invariants for different models on trees with a small number of taxa.
262:
263: Our first task is building a set of invariants for the two models. The above
264: website shows 33 polynomials (plus two implied linear relations) for
265: JC69 and 8002 polynomials for K81.
266: However,
267: these sets of invariants
268: %have desirable algebraic properties,
269: %they lack one important feature for our use.
270: are not closed under the symmetries of $T_1$.
271: That is, each tree can be
272: written in the plane in eight different ways (for example, the tree $\Ta$ can
273: be written as (01 : 23), (10 : 23), \dots, (32 : 10)), and each of these induces a
274: different order on the probability coordinates $p_{ijkl}$. We need a set of
275: invariants which does not change under this reordering if we don't want the
276: resulting algorithm to depend on the order of the input sequences.
277:
278: \begin{figure}
279: \centering
280: %"((W:$b,X:$a):$a,Y:$b,Z:$a);";
281: \includegraphics[width=.35\textwidth]{figures/4tree-ab-label}
282: \caption{The tree used in the simulations. Branch lengths $a$ and $b$
283: ranged from $0.01$ to $0.75$ in intervals of $0.02$.}
284: \label{fig:param}
285: \end{figure}
286:
287: After performing this calculation, we are left with 49 polynomials for JC69 and
288: 11612 for K81.
289: However, our metric learning algorithms
290: run slowly as the number of invariants grows, so we had to find a subset of
291: a more manageable size.
292: We cut down the K81 invariants by testing each of the
293: 11612 invariants individually on the entire parameter space and only keeping
294: those which had good individual reconstruction rates.
295: Specifically, we picked several different values for
296: $\gamma, \alpha, \beta$ and kept only those invariants which gave over a 62\%
297: reconstruction rate individually for sequences of length 100. The result of
298: this calculation is sets of invariants $\bof^{JC69}_i$ and $\bof^{K81}_i$ of
299: cardinality 49 and 52.
300:
301: \begin{definition}
302: Given a probability distribution $\hat{p}$, and invariants $\bof_i = (f_{i,1},
303: \dots, f_{i,n})$, for tree $T_i$ (for $i = 1,2,3$), let
304: \[
305: \bof_i(\hat{p}) = (f_{i,1}(\hat{p}), \dots, f_{i,n}(\hat{p}))
306: \]
307: be the point in $\rr^{n}$ obtained by evaluating the invariants for $T_i$ at
308: $\hat{p}$.
309: \end{definition}
310:
311: If the probability distribution $\hat{p}$ actually comes from the model $T_1$,
312: then we will have $\bof_1(\hat{p}) = 0$, $\bof_2(\hat{p}) \neq 0$, and $\bof_3(\hat{p})
313: \neq 0$ generically (that is, except for points $\hat{p}$ which lie on the
314: intersection of two or more models). This fact suggests that we can just pick
315: the tree $T_i$ such that $\bof_i$ is closest to zero. However, the next example
316: shows that it is quite important to pick good polynomials and weigh them properly.
317:
318: \begin{example}\rm
319: Figure~\ref{fig:hist} shows the distribution of four of the invariants from $\bof^{JC}_i$
320: on data from simulations
321: of 1000 i.i.d.\
322: draws from the Jukes-Cantor model on $T_1$ over a varying set of parameters.
323: The histograms show the distributions for the simulated tree ($T_1$) in yellow
324: and the distributions for the other trees in gray and black.
325: \begin{figure}
326: \centering
327: \includegraphics[width=.5\textwidth]{figures/4poly-hist.pdf}
328: \caption{Distributions of four polynomials $f_{i,10}, f_{i,48}, f_{i,23}$
329: and $f_{i,45}$ on simulated data.
330: The yellow histogram corresponds to the correct tree, the black and gray are
331: the other two trees.}
332: \label{fig:hist}
333: \end{figure}
334:
335: Polynomial 10 (upper left) distinguishes nicely between the three trees
336: with the correct tree tightly distributed around zero. It is correct 97\% of
337: the time on our space of trees (Figure~\ref{fig:param}). Polynomial 48 (upper right) also shows power to
338: distinguish between all three trees, but
339: the distributions are much more overlapping --- it is only correct 50.8\% of the time.
340: Polynomial 10 is the four-point invariant from Example~\ref{ex:4pt}, polynomial
341: 48 is one of Lake's linear invariants.
342: The two other examples show a polynomial (23) which is biased towards selecting
343: the wrong tree (only 16\% correct), and a polynomial (45) for which the correct tree is tightly
344: clustered around zero, but the incorrect trees are indistinguishable and have wide
345: variance (88.9\% correct).
346:
347: %\begin{figure}
348: % \centering
349: % \includegraphics[width=4in]{figures/myRank_inv}
350: % \caption{Performance of the individual Jukes-Cantor invariants.}
351: % \label{fig:indiv}
352: %\end{figure}
353:
354: The parameters used for the simulations are described in
355: Figure~\ref{fig:param}. Since 1000 samples should be quite enough to determine
356: the structure of a tree on four taxa, it is revealing that many of the
357: individual polynomials are quite poor
358: %(see Figure~\ref{fig:indiv} for the individual prediction rates for each polynomial)
359: (the mean prediction rate for all 49 polynomials is only 42\%).
360: The invariants have quite different variances and means and it is not optimal
361: to take each one with equal weight.
362: \end{example}
363:
364:
365: %\section{Metric learning}
366: %\label{sec:3}
367: %In this section we propose a learning algorithm to find a
368: %metric on the invariants which optimizes their power.
369:
370: This example shows that we need to scale and weigh the individual invariants.
371: Recall that for a positive (semi)definite matrix $A$ %\in \rr^{d^2}$,
372: the Mahalanobis (semi)norm $\|\cdot \|_A$ is defined by
373: \[
374: \| x \|_A = \sqrt{x^t A x}.
375: \]
376: %It satisfies (i) $\|x\|_A\geq 0$, (ii) $\|c x\|_A = |c|\|x\|_A$ (for $c\in
377: %\rr$), and (iii) $\|x+y\|_A\leq \|x\|_A+\|y\|_A$. Note that in (i) it is
378: %possible to have $\|x\|_A= 0$ with $x\neq 0$ in contrast to a norm $\|\cdot \|$
379: %which satisfies $\|x\|=0$ if and only if $x=0$.
380: Notice that since $A$ is positive semidefinite, it can be written as $A = U D
381: U^t$ where $U$ is orthogonal and $D$ is diagonal with non-negative entries.
382: Thus the square root $B = U \sqrt{D} U^t$ is unique. Now since $\|x\|_A^2
383: = x^tAx = (Bx)^t(Bx)=\|Bx\|^2$, we can view learning such a metric as finding a
384: transformation of the space of invariants that replaces each point $x$ with $Bx$ under the
385: Euclidean norm.
386: Accordingly, we will be searching for a positive semidefinite
387: matrix $A$ on the space of invariants which is ``optimal''.
388:
389: Let
390: $\hpt$ be an empirical probability distribution generated from a phylogenetic
391: model on tree $T_1$ with parameters $\theta$.
392: We wish to find $A$ such that the condition
393: \[
394: \|\bof_1(\hpt))\|_A < \min\left( \|\bof_2(\hpt)\|_A, \|\bof_3(\hpt)\|_A \right)
395: \]
396: is typically true for most $\hpt$ chosen from a suitable parameter space $\Theta$.
397:
398: Now suppose that $\Theta$ is a finite set of parameters from which we generate
399: training data $\bof_1(\hpt), \bof_2(\hpt), \bof_3(\hpt)$ for $\theta \in \Theta$.
400: As we saw above, each of the eight possible ways of writing
401: each tree induces a signed permutation of the coordinates of each
402: $\bof_i(\hpt)$. We write these permutations in matrix form as $\pi_1, \dots,
403: \pi_8$.
404: Given this training data, we wish to solve the following optimization problem.
405: %For this purpose, the training data are generated as a set of triples
406: %$\{(x_1(\theta), x_2(\theta), x_3(\theta))\in \rr^{3d}: \theta = (\phi_e,\pi_e)
407: %\in \Theta\}$, where $\Theta$ is a finite set and $d=49$ in our setting.
408:
409: \begin{alg}\rm
410: \label{alg:meta}
411: {\bf (Metric learning for invariants)}
412:
413: \noindent {\em Input:} model invariants $\bof_i$ for $T_i$
414: and a finite set $\Theta$ of model parameters.
415:
416: \noindent {\em Output:} a semidefinite matrix $A$
417:
418: \noindent {\em Procedure:}
419: \begin{enumerate}
420: \item Reduce the sets $\bof_i$ of invariants to a manageable size by testing individual invariants
421: on data simulated from $\theta \in \Theta$.
422: \item Augment the resulting sets so that they are closed under the eight
423: permutations of the input which fix tree $T_1$.
424: \item Compute the signed permutations $\pi_1, \dots, \pi_8$ which are induced on the
425: invariants $\bof_1$ by the above permutations.
426: \item Solve the following semidefinite programming problem:
427: \begin{equation*}
428: \begin{array}{ll}
429: %\text{Given: }& \bof_1, \bof_2, \bof_3
430: %\text{ and a finite set } \Theta \text{ of parameters,}\\
431: \text{Minimize: } &\sum_{\theta \in \Theta} \xi(\theta) + \lambda \tr A\\
432: \text{Subject to: } \quad
433: &\| X_1 (\hpt) \|_A^2 + \gamma \leq \min\left( \| X_2 (\hpt)\|_A^2 , \|X_3(\hpt) \|_A^2 \right) + \xi(\theta),\\
434: & \pi_i A = A \pi_i \quad\text{for } 1 \leq i \leq 8,\\
435: &A \succeq 0, \quad\text{and}\\
436: &\xi(\theta) \geq 0,
437: \end{array}
438: \end{equation*}
439: where $A\succeq 0$ denotes that $A$ is a positive semidefinite
440: matrix.
441: \item Alternatively, if we restrict $A$ to be
442: diagonal, this becomes a linear program and can be solved for much
443: larger sets of invariants and parameters.
444: \end{enumerate}
445: \end{alg}
446:
447: In the optimization step, we use a regularization parameter $\lambda$ to keep $A$
448: small and a margin parameter $\gamma$ to increase the margin between the
449: distributions. This is a convex optimization problem with a linear objective
450: function and linear matrix equality and inequality constraints. Hence it is a
451: semidefinite programming (SDP) problem.
452: The SDP problem above has a unique optimizer and can be solved in polynomial
453: time. Its complexity depends on the capacity of the set $\Theta$ since each
454: point in $\Theta$ contributes a linear constraint.
455:
456:
457: For the range of parameters we consider in Section~\ref{sec:results}, we use
458: $\#(\Theta)=1444$. In our experiments to solve the SDP we use SeDuMi 1.1
459: \cite{Sturm1999} or DSDP5
460: \cite{BenYe05} with YALMIP \cite{Lofberg2004} as the parser.
461: Matlab code to implement the above algorithm can be found at
462: \url{http://math.stanford.edu/~yuany/metricPhylo/matlab/}.
463:
464: We found that although SeDuMi often runs into numerical issues, it generally
465: finds a good matrix $A$ with competitive performance to the neighbor-joining
466: algorithm. DSDP is better in dealing with numerical stability at the cost of
467: more computational time. We have found that setting $\lambda = 0.0001$ and
468: $\gamma = 0.005$ gives good results in our situation. For example, in the case
469: of the JC69 model with a $49\times49$ semidefinite matrix $A$, YALMIP-SeDuMi takes 55.7
470: minutes to parse the constraints and solve the SDP, while YALMIP-DSDP takes
471: 167.1 minutes to finish the same job. For details on experiments, see the next
472: section.
473:
474: Our algorithm was inspired by some early results on metric learning algorithms
475: such as \cite{XingNg03} and \cite{ShaSinNg04}, which aim to find a
476: (pseudo)-metric such that the mutual distances between similar examples are
477: minimized while the distances across dissimilar examples or classes are kept
478: large. Direct application of such an algorithm is not quite suitable in our setting.
479: As shown in Figure~\ref{fig:hist},
480: %a global view on the distribution of all three classes of data discloses that
481: the correct tree points are overlapped by the two incorrect trees. For
482: points in the overlapping region, it is hard to tell whether to shrink or
483: stretch their mutual distance. However, when the points appear in triples, it
484: is possible that for each triple the one closest to zero is generated from the
485: correct tree. Our algorithm is based on such an intuition and proved successful
486: in experiments.
487:
488: After using Algorithm~\ref{alg:meta} to find a good metric $A$, the following
489: simple algorithm allows us to construct trees on four taxa.
490:
491: \begin{alg}\rm
492: \label{alg:1}
493: {\bf (Tree construction with invariants)}
494:
495: \noindent {\em Input:}
496: A multiple alignment of 4 species and a semidefinite matrix $A$ from Algorithm~\ref{alg:meta}.
497: %invariant under
498: %the signed permutations $\pi_1, \dots, \pi_8$.
499:
500: \noindent {\em Output:}
501: A phylogenetic tree on the 4 species (without branch lengths).
502:
503: \noindent {\em Procedure:}
504: \begin{enumerate}
505: \item Form empirical distributions $\hat{p}$ by counting columns of the alignment.
506: \item Form the vectors $\bof_i = (f_{i,1}(\hat{p}), \dots, f_{i,n}(\hat{p}))$ for $1 \leq i \leq 3$.
507: \item Return $T_i$ where %$i$ is such that
508: the vector $\bof_i$ has smallest $A$-norm $\|\bof_i\|_A = \sqrt{\bof_i^t A \bof_i}$.
509: \end{enumerate}
510: \end{alg}
511:
512:
513: %\begin{figure}
514: %\centering
515: %%\includegraphics[width=.5\textwidth]{figures/RedonTop.jpg}
516: %%\includegraphics[width=.5\textwidth]{figures/RedonBottom1.jpg}
517: %\includegraphics[width=.75\textwidth]{figures/RedonBottom}
518: %\caption{Projections of the three data sets onto the first two eigenspaces of
519: %the metric $A$. Red circles correspond to the correct tree, blue dots and
520: %green x's to the two incorrect trees.}
521: %\label{fig:eigen}
522: %\end{figure}
523:
524:
525: \section{Results}
526: \label{sec:results}
527:
528: We tested our metric learning algorithms for the invariants $\bof^{JC69}$ and
529: $\bof^{K81}$ as described above.
530: We trained two metrics for JC69 on
531: the tree in Figure~\ref{fig:param}.
532: The first used simulations from branch lengths between $0.01$ and $0.75$ on a
533: grid with increments of magnitude $0.02$, for a total of $1444$ different
534: parameters.
535: This region of parameter space was chosen for direct comparison with
536: \cite{Huelsenbeck1995,Casanellas2006}.
537: The second metric used parameters $0.01 \leq a \leq 0.25$ and $0.51 \leq b \leq
538: 0.75$ with increments of $0.01$, giving $625$ trees with very short interior edges.
539: Similarly for K81, we learned a metric using parameters $0.01 \leq a,b \leq 0.75$ and
540: several sets of $\gamma, \alpha, \beta$.
541: After learning metrics, we performed simulation tests on the same space of
542: parameters that was used to train. We compared neighbor-joining
543: \cite{Saitou1987}, phylogenetic invariants with the $l_1$ or $l_2$ norm and
544: phylogenetic invariants with our learned norms.
545:
546: Since the edge lengths are large for part of the parameter space, we often see
547: simulated alignments with more than 75\% mismatches between pairs of taxa. In
548: such a case, (\ref{eq:JCdist}) returns infinite distance estimates under the
549: Jukes-Cantor
550: model. So the results depend on how neighbor joining treats
551: infinite distances. In PHYLIP
552: \cite{PHYLIP}, the program \texttt{dnadist} doesn't return a distance
553: matrix if some distances are infinite. However, in PAUP*, infinite distances are set to a fixed large
554: number.
555: Since we are only concerned with the tree topology, we believe that the most
556: fair comparison is between phylogenetic invariants and the method of PAUP*.
557: However, it should be noted that this can make a major difference in results using neighbor joining, since often the correct tree can be returned even if some distances are infinite. See Figure~\ref{fig:contour} for an example of the difference and be warned that comparison between simulations studies done in different ways is difficult.
558:
559: Table~\ref{tab:JC} shows the results of 100 simulations at each of the 1444
560: parameter values for various sequence lengths using the JC69 model. It gives
561: the percent correct over all 144,400 trials for
562: five different methods: invariants with $l_1$, $l_2$, and $A$-norms and
563: neighbor joining (using Jukes-Cantor distances and allowing infinite distances). The contour
564: plots in Figure~\ref{fig:contour} show how the reconstruction rates vary across
565: parameter space for the five methods for a sequence length of 100.
566: Notice that the $A$-norm shows
567: particularly good behavior over the entire range of parameters, even in the
568: ``Felsenstein zone'' in the upper left corner.
569: When trained on the Felsenstein zone, the learned metric can perform even
570: better. Table~\ref{tab:JC} shows the result of training a metric on this zone.
571: Notice that the $A$-norm is now quite a bit better than neighbor joining, even
572: though the $l_1$ and $l_2$ norms are terrible. However, this learned norm is slightly worse on the whole parameter space than the metric trained on the whole space.
573:
574: \begin{table}
575: \centering
576: \begin{tabular}{cccccl@{\hspace{1cm}}ccccc}
577: \multicolumn{5}{c}{Full parameter space} & &
578: \multicolumn{5}{c}{Felsenstein zone}\\
579: Length & $l_1$ & $l_2$ & $A$-norm & NJ & &
580: Length & $l_1$ & $l_2$ & $A$-norm & NJ\\
581: \cline{1-5} \cline{7-11}
582: 25 & 62.7 & 59.8 & 74.5 & 75.5 && 25 & 31.8 & 31.3 & 58.9 & 52.5\\
583: 50 & 71.9 & 66.3 & 85.0 & 85.9 && 50 & 35.9 & 33.0 & 69.9 & 63.2\\
584: 75 & 76.7 & 69.6 & 90.0 & 90.4 && 75 & 39.2 & 34.4 & 76.5 & 69.5\\
585: 100 & 79.8 & 72.0 & 92.7 & 92.9 && 100 & 42.2 & 35.8 & 81.2 & 73.9\\
586: 200 & 86.4 & 77.6 & 97.0 & 96.6 && 200 & 50.9 & 39.1 & 90.1 & 83.5\\
587: 300 & 89.2 & 80.1 & 98.2 & 97.7 && 300 & 55.4 & 40.4 & 93.6 & 87.8\\
588: 400 & 91.1 & 82.1 & 98.7 & 98.2 && 400 & 59.5 & 41.4 & 95.1 & 90.2\\
589: 500 & 92.3 & 83.5 & 99.0 & 98.4 && 500 & 62.4 & 42.5 & 95.6 & 91.4\\
590: \cline{1-5} \cline{7-11}
591: \end{tabular}
592: \caption{Percent of trials reconstructed correctly for the Jukes-Cantor
593: model over the entire parameter space and the Felsenstein zone for the
594: respective metrics.}
595: \label{tab:JC}
596: \end{table}
597:
598: Table~\ref{tab:K81} shows results for the K81 model under two choices of $(\gamma, \alpha, \beta)$.
599: We only report the $l_2$ scores, since the $l_1$ scores are similar. Of note
600: is the column ``$l_2$ restrict'' which shows the $l_2$ norm on the top $52$
601: invariants as ranked by individual power on simulations as in the previous
602: section. This column is better than the $l_2$ norm on all $11612$ invariants,
603: showing that many invariants are actually harmful.
604: The $A$-norm again improves on even the restricted $l_2$ and beats
605: neighbor-joining (run with K81 distances) on all examples.
606:
607: \begin{table}
608: \centering
609: \begin{tabular}{cccccl@{\hspace{1cm}}ccccc}
610: %$\gamma,\alpha, \beta$ &
611: \multicolumn{5}{c}{$(\gamma, \alpha, \beta) = (0.1, 3.0, 0.5)$}&&
612: \multicolumn{5}{c}{$(\gamma, \alpha, \beta) = (0.2, 0.5, 0.3)$}\\
613: length & $l_2$ & $l_2$ restrict & A & NJ&&
614: length & $l_2$ & $l_2$ restrict & A & NJ\\
615: \cline{1-5} \cline{7-11}
616: 25 & 59.2 & 66.9 & 71.5 & 62.7 && 25 & 63.7 & 67.4 & 70.9 & 65.0 \\
617: 50 & 68.0 & 77.5 & 82.1 & 72.7 && 50 & 71.6 & 77.5 & 81.1 & 74.2 \\
618: 75 & 73.4 & 82.7 & 86.9 & 79.3 && 75 & 75.4 & 82.7 & 86.4 & 80.5 \\
619: 100 & 76.8 & 85.8 & 89.7 & 82.6 &&100 & 77.6 & 85.8 & 89.3 & 83.4 \\
620: 200 & 84.6 & 90.8 & 94.3 & 90.1 &&200 & 83.3 & 91.3 & 94.1 & 89.7 \\
621: 300 & 88.2 & 92.6 & 93.1 & 95.6 &&300 & 86.4 & 93.2 & 95.7 & 91.8 \\
622: 400 & 90.1 & 93.6 & 96.4 & 94.8 &&400 & 88.5 & 94.3 & 96.5 & 93.0 \\
623: 500 & 91.4 & 94.2 & 96.9 & 95.7 &&500 & 90.0 & 93.7 & 96.3 & 93.2 \\
624: \cline{1-5} \cline{7-11}
625: \end{tabular}
626: \caption{Percent of trials reconstructed correctly
627: for the Kimura 3-parameter model over the entire parameter space for two choices of $\gamma, \alpha, \beta$.}
628: \label{tab:K81}
629: \end{table}
630:
631: \begin{figure}
632: \centering
633: \begin{tabular}{cc}
634: \includegraphics[height=.35\textheight]{figures/contour-l1} &
635: %\includegraphics[height=.25\textheight]{figures/contour-l2}\\
636: \includegraphics[height=.35\textheight]{figures/contour-A} \\
637: \includegraphics[height=.35\textheight]{figures/contour-NJinf} &
638: %\multicolumn{2}{c}{
639: \includegraphics[height=.35\textheight]{figures/contour-NJ} \\
640: \end{tabular}
641: \caption{Contour plots for the three reconstruction methods for the
642: Jukes-Cantor model over parameter space with alignments of length 100.
643: Black areas correspond to parameters
644: $(a,b)$ for which the tree was reconstructed correctly over 95\% of the
645: time, gray for over 50\%, light gray for over 33\%, and white for under 33\%. }
646: \label{fig:contour}
647: \end{figure}
648:
649:
650: \section{Discussion}
651: \label{sec:dis}
652:
653: We have shown that machine learning algorithms can substantially improve the
654: tree construction performance of phylogenetic invariants. As an example, for
655: sequences of length 100, the four-point invariant (Example~\ref{ex:4pt}) for
656: the K81 model is correct 82\% of the time on data simulated from K81 with
657: parameters $(0.1, 3.0, 0.5)$. This is quite a bit better than the
658: $l_2$ norm on all 11612 invariants (76.8\%, Table~\ref{tab:K81}).
659:
660: The paper \cite{Casanellas2007} describes an algebraic method for picking a
661: subset of invariants for the K81 model. They reduce to 48 invariants which give
662: an improvement over all 11612 invariants (up to 82.6\% on the above
663: example using the $l_2$ norm). However, of these 48, only 4 of them are among
664: the top 52 we selected for $\bof^{K81}$, and the remaining 44 invariants are
665: mostly quite poor (42\% average accuracy). After taking the closure of these 48 invariants, there are 156
666: total and the performance actually drops to 78.3\%. It seems that the
667: conditions for an invariant to be powerful are not particularly related to the
668: algebraic criterion used in \cite{Casanellas2007}.
669:
670: All invariant based methods heavily depend on the set of invariants that we
671: begin with. Learning diagonal matrices $A$ had mixed performance, which further
672: suggests that the generating set we are using for the invariants is
673: non-optimal. We believe that it is an important mathematical problem to
674: understand what properties are shared by the good invariants. We suggest that
675: symmetry might be an important criterion to construct other polynomials like
676: the four-point condition with good power.
677:
678: The learned metrics in this paper are somewhat dependant on the
679: parameters chosen to train them. This can be a benefit, as it allows us to
680: train tree construction algorithms for specific regions of parameter space
681: (e.g., the Felsenstein zone). However, we hope that improvements to the metric
682: programming will allow us to train on larger parameter sets and thus obtain
683: uniformly better algorithms.
684:
685: Notice that these methods only recover the tree topology, not the edge lengths.
686: We believe that if the edge lengths are needed, they should be estimated after
687: building the tree, in which case standard statistical methods such as maximum
688: likelihood can be used easily.
689: While the invariants discussed in this paper may not be practical for large
690: trees, we believe there is great use in understanding fully the problem of
691: building trees on four taxa. For example, these methods can either be used
692: as an input to quartet-based tree construction algorithms or as a verification
693: step for larger phylogenetic trees.
694:
695: For a method of building trees on more than four taxa using phylogenetic
696: invariants, see \cite{Eriksson2005b,Kim2006}, which use numerical linear
697: algebra to evaluate invariants given by rank conditions on certain matrices in
698: order to construct phylogenetic trees. This amounts to evaluating many
699: polynomials at once, allowing it to run in polynomial time.
700:
701: The matrices $A$ for JC69 and K81 used in the tests can be found at
702: \url{http://stanford.edu/~nke/data/metricPhylo}. A software package that can
703: run these tests is available at the same website. It includes a program for
704: simulating evolution using any Markov model on a tree and several programs
705: using phylogenetic invariants.
706:
707: \section*{Acknowledgments}
708: N.~Eriksson was supported by NSF grant DMS-0603448 and wishes to thank MSRI and
709: the IMA for their hospitality. Y.~Yao was supported by DARPA grant 1092228. We
710: wish to thank E.\ Allman, M.\ Drton, D.\ Ge, F.\ Memoli, L.\ Pachter, J.\
711: Rhodes, and Y.\ Ye for helpful comments and especially G.\ Carlsson for his
712: encouragement and computational resources.
713:
714: \bibliographystyle{alpha}
715: %\bibliography{nikos}
716:
717: \begin{thebibliography}{KKPP06}
718:
719: \bibitem[AR07]{Allman2007}
720: E~Allman and J~Rhodes.
721: \newblock Molecular phylogenetics from an algebraic viewpoint, 2007.
722: \newblock To appear.
723:
724: \bibitem[Bun74]{Buneman1974}
725: Peter Buneman.
726: \newblock A note on the metric properties of trees.
727: \newblock {\em J. Combinatorial Theory Ser. B}, 17:48--50, 1974.
728:
729: \bibitem[BY05]{BenYe05}
730: Steven~J. Benson and Yinyu Ye.
731: \newblock {DSDP5}: Software for semidefinite programming.
732: \newblock Technical Report ANL/MCS-P1289-0905, Mathematics and Computer Science
733: Division, Argonne National Laboratory, Argonne, IL, September 2005.
734: \newblock Submitted to ACM Transactions on Mathematical Software.
735:
736: \bibitem[CF87]{Cavender1987}
737: J~Cavender and J~Felsenstein.
738: \newblock Invariants of phylogenies in a simple case with discrete states.
739: \newblock {\em Journal of Classification}, 4:57--71, 1987.
740:
741: \bibitem[CFS06]{Casanellas2006}
742: M.~Casanellas and J.~Fern\'{a}ndez-S\'{a}nchez.
743: \newblock {Performance of a New Invariants Method on Homogeneous and
744: Non-homogeneous Quartet Trees}.
745: \newblock {\em Mol Biol Evol}, page msl153, 2006.
746:
747: \bibitem[CFS07]{Casanellas2007}
748: M.~Casanellas and J.~Fern\'{a}ndez-S\'{a}nchez.
749: \newblock {Geometry of the Kimura 3-parameter model}, 2007.
750: \newblock availabe at arXiv:math.AG/0702834.
751:
752: \bibitem[Eri05]{Eriksson2005b}
753: Nicholas Eriksson.
754: \newblock Tree construction using singular value decompsition.
755: \newblock In L.~Pachter and B.~Sturmfels, editors, {\em Algebraic Statistics
756: for Computational Biology}, chapter~19, pages 347--358. Cambridge University
757: Press, Cambridge, UK, 2005.
758:
759: \bibitem[ES93]{Evans1993}
760: S~Evans and T~Speed.
761: \newblock Invariants of some probability models used in phylogenetic inference.
762: \newblock {\em The Annals of Statistics}, 21:355--377, 1993.
763:
764: \bibitem[Fel04]{PHYLIP}
765: J~Felsenstein.
766: \newblock {PHYLIP (Phylogeny Inference Package) version 3.6}.
767: \newblock Distributed by the author, Department of Genome Sciences, University
768: of Washington, Seattle, 2004.
769:
770: \bibitem[HP89]{Hendy1989}
771: M~Hendy and D~Penny.
772: \newblock A framework for the quantitative study of evolutionary trees.
773: \newblock {\em Systematic Zoology}, 38(4), 1989.
774:
775: \bibitem[Hue95]{Huelsenbeck1995}
776: John~P. Huelsenbeck.
777: \newblock Performance of phylogenetic methods in simulations.
778: \newblock {\em Sys Biol}, 1(44):17--48, 1995.
779:
780: \bibitem[JC69]{Jukes1969}
781: TH~Jukes and C~Cantor.
782: \newblock Evolution of protein molecules.
783: \newblock In HN~Munro, editor, {\em Mammalian Protein Metabolism}, pages
784: 21--32. New York Academic Press, 1969.
785:
786: \bibitem[Kim81]{Kimura1981}
787: M~Kimura.
788: \newblock Estimation of evolutionary sequences between homologous nucleotide
789: sequences.
790: \newblock {\em Proceedings of the National Academy of Sciences, USA},
791: 78:454--458, 1981.
792:
793: \bibitem[KKPP06]{Kim2006}
794: Young~Rock Kim, Oh-In Kwon, Seong-Hun Paeng, and Chun-Jae Park.
795: \newblock Phylogenetic tree constructing algorithms fit for grid computing with
796: {SVD}.
797: \newblock Available at \url{http://arxiv.org/abs/q-bio.QM/0611015}, 2006.
798:
799: \bibitem[L\"04]{Lofberg2004}
800: J.~L\"ofberg.
801: \newblock {YALMIP} : a toolbox for modeling and optimization in {MATLAB}.
802: \newblock In {\em Computer Aided Control Systems Design}, pages 284--289, 2004.
803:
804: \bibitem[Lak87]{Lake1987}
805: JA~Lake.
806: \newblock A rate-independent technique for analysis of nucleaic acid sequences:
807: evolutionary parsimony.
808: \newblock {\em Molecular Biology and Evolution}, 4:167--191, 1987.
809:
810: \bibitem[SN87]{Saitou1987}
811: N~Saitou and M~Nei.
812: \newblock The neighbor joining method: a new method for reconstructing
813: phylogenetic trees.
814: \newblock {\em Molecular Biology and Evolution}, 4(4):406--425, 1987.
815:
816: \bibitem[SS05]{Sturmfels2005}
817: Bernd Sturmfels and Seth Sullivant.
818: \newblock {T}oric ideals of phylogenetic invariants.
819: \newblock {\em J Comput Biol}, 12(4):457--481, May 2005.
820:
821: \bibitem[SSSN04]{ShaSinNg04}
822: Shai Shalev-Shwartz, Yoram Singer, and Andrew~Y. Ng.
823: \newblock Online learning of pseudo-metrics.
824: \newblock In {\em Proceedings of the Twenty-first International Conference on
825: Machine Learning}, 2004.
826:
827: \bibitem[Stu99]{Sturm1999}
828: J.F. Sturm.
829: \newblock Using {SeDuMi} 1.02, a {MATLAB} toolbox for optimization over
830: symmetric cones.
831: \newblock {\em Optimization Methods and Software}, 11--12:625--653, 1999.
832: \newblock Special issue on Interior Point Methods (CD supplement with
833: software).
834:
835: \bibitem[XNJR03]{XingNg03}
836: Eric Xing, Andrew~Y. Ng, Michael Jordan, and Stuart Russell.
837: \newblock Distance metric learning, with application to clustering with
838: side-information.
839: \newblock In {\em NIPS}, 2003.
840:
841: \end{thebibliography}
842: \end{document}
843:
844: