q-bio0703034/inv.tex
1: %\documentclass[12pt,envcountsame]{llncs}
2: \documentclass[12pt]{article}
3: %\pagenumbering{arabic}
4: %\pagestyle{plain}
5: \usepackage{fullpage}
6: 
7: \usepackage{amsthm}
8: \usepackage{amsmath}
9: \usepackage{amsfonts}
10: \usepackage{graphicx}
11: \usepackage{url}
12: 
13: %\spnewtheorem{alg}[theorem]{Algorithm}{\bfseries}{}%{\itshape}
14: %\spnewtheorem{op}[theorem]{Problem}{\bfseries}{}
15: \newtheorem{thm}{Theorem}
16: \newtheorem{lemma}[thm]{Lemma}
17: \newtheorem{prop}[thm]{Proposition}
18: \newtheorem{cor}[thm]{Corollary}
19: \newtheorem{prob}[thm]{Problem}
20: \newtheorem{conj}[thm]{Conjecture}
21: \newtheorem{alg}[thm]{Algorithm}
22: \newtheorem{rmk}[thm]{Remark}
23: 
24: \newtheorem{example}[thm]{Example}
25: \newtheorem{definition}[thm]{Definition}
26: 
27: 
28: \newcommand{\sM}{\mathcal M}
29: \newcommand{\cE}{\mathcal{E}}
30: \newcommand{\cG}{\mathcal{G}}
31: \newcommand{\NN}{\mathbb{N}}
32: \newcommand{\rr}{\mathbb{R}}
33: \newcommand{\e}{\mathrm{e}}
34: \newcommand{\tr}{\mathrm{tr}}
35: \newcommand{\bof}{\mathbf{f}}
36: \newcommand{\Ta}{T_1}
37: \newcommand{\Tb}{T_2}
38: \newcommand{\Tc}{T_3}
39: \newcommand{\hpt}{\hat{p}(\theta)}
40: \newcommand{\hpi}{\hat{p}_{i}}
41: \newcommand{\hpb}{\hat{p}_{2}}
42: \newcommand{\hpc}{\hat{p}_{3}}
43: \newcommand{\diag}{\mathrm{diag}}
44: 
45: \begin{document}
46: \title{Metric learning for phylogenetic invariants}
47: %\author{Nicholas Eriksson\inst{1}\fnmsep*\and Yuan Yao\inst{2}}
48: %\institute{\url{nke@stanford.edu}\\ Department of Statistics, \\Stanford University, \\Stanford, CA 94305-4065 \and
49: %\url{yuany@math.stanford.edu}\\Department of Mathematics,\\ Stanford University, \\Stanford, CA 94305-4065}
50: %\date{\today}
51: \author{Nicholas Eriksson\\
52: \url{nke@stanford.edu}\\
53: Department of Statistics,\\ 
54: Stanford University, \\
55: Stanford, CA 94305-4065
56: \and
57: Yuan Yao\\
58: \url{yuany@math.stanford.edu}\\
59: Department of Mathematics,\\ 
60: Stanford University, \\
61: Stanford, CA 94305-4065
62: }
63: \date{\today}
64: 
65: \maketitle
66: 
67: \begin{abstract}
68: We introduce new methods for phylogenetic tree quartet construction by using
69: machine learning to optimize the power of phylogenetic invariants.
70: Phylogenetic
71: invariants are polynomials in the joint probabilities which vanish under a
72: model of evolution on a phylogenetic tree.  We give algorithms for selecting a
73: good set of invariants and for learning a metric on this set of invariants
74: which optimally distinguishes the different models.  Our learning algorithms
75: involve linear and semidefinite programming on data simulated over a wide range
76: of parameters.  We provide extensive tests of the learned metrics on simulated data
77: from phylogenetic trees with four leaves under the Jukes-Cantor and Kimura
78: 3-parameter models of DNA evolution.  Our method greatly improves on other uses
79: of invariants and is competitive with or better than neighbor-joining.  In
80: particular, we obtain metrics trained on trees with short internal branches
81: which perform much better than neighbor joining on this region of parameter
82: space.
83: \end{abstract}
84: \begin{quotation}
85: \noindent \small {\bf Keywords:} 
86: Phylogenetic invariants, algebraic statistics, semidefinite programming,
87: Felsenstein zone.  
88: \end{quotation}
89: 
90: 
91: \section{Introduction}
92: 
93: Phylogenetic invariants have been used for tree construction since the linear
94: invariants of Lake and Cavender-Felsenstein \cite{Lake1987,Cavender1987} were
95: found.  Although the linear invariants of the Jukes-Cantor model are powerful
96: enough to asymptotically distinguish between trees on 4 taxa \cite{Lake1987},
97: these linear invariants do not perform well in simulations
98: \cite{Huelsenbeck1995}.
99: 
100: In the last 20 years, the entire set of phylogenetic invariants has been found
101: for many models of evolution (see \cite{Allman2007} and the references
102: therein).  Since the invariants provide an essentially complete description of
103: the model, using more invariants should give more power to distinguish between
104: different models.  However, different invariants give vastly different power at
105: distinguishing models and it is not known how to find the most powerful
106: invariants.  
107: 
108: In this paper, we use techniques from machine learning to find metrics on the
109: space of invariants which optimize their tree reconstruction power for the
110: Jukes-Cantor and Kimura 3-parameter phylogenetic models on trees with four
111: leaves.  Specifically, we apply \emph{metric learning} algorithms inspired by
112: \cite{XingNg03} to find the metric which best distinguishes the models.  For
113: training data we use simulations over a wide range of the parameter space.  
114: Our main biological result is the construction of a metric which outperforms
115: neighbor-joining on trees simulated from the Felsenstein zone (i.e., trees with
116: a short interior edge). More generally, we find that metric learning
117: significantly improves upon other uses of invariants and is competitive with
118: neighbor joining even for short sequences and homogeneous rates.  
119: 
120: Casanellas and Fern\'{a}ndez-S\'{a}nchez \cite{Casanellas2006} also used the Kimura
121: 3-parameter invariants to construct trees with four taxa.  Their results
122: indicated that invariants can sometimes perform better than commonly used methods (e.g.,
123: neighbor joining and maximum likelihood) for data that evolved with
124: non-homogeneous rates and for extremely long sequences.   They used the $l_1$
125: norm on the space of invariants, weighing each polynomial equally. 
126: 
127: This paper improves upon  \cite{Casanellas2006}
128: by showing how to improve upon the $l_1$ norm on the space of invariants.
129: The $l_1$ norm behaves poorly
130: since it weighs equally
131: informative and non-informative invariants.
132: Simulating data and using metric learning
133: improves the performance of invariants by putting much more weight on the
134: powerful invariants.  
135: This allows us to build an algorithm which is very accurate for trees with
136: short internal edges.
137: %%Second, we show that their method gives different answers depending on the
138: %%order of the input sequences because they use sets of invariants which are not
139: %%closed under the symmetry group of the tree.  We show how to build sets of
140: %%invariants which are closed under permutations of the input, thus giving an
141: %%algorithm that does not change if the input is permuted.  
142: 
143: %This paper is organized as follows.  In Section~\ref{sec:2}, we give a
144: %description of the phylogenetic invariants and models we will use.
145: %Section~\ref{sec:3}, contains our algorithms for metric learning using
146: %linear and semidefinite programming.  
147: %%These algorithms take simulated data and
148: %%find a metric on the space of invariants such that the invariants from the correct
149: %%tree are minimized while the incorrect ones are simultaneously maximized.
150: %Section~\ref{sec:4} gives the results of extensive simulations with our
151: %algorithms for the Jukes-Cantor (JC69) and Kimura 3-parameter (K81) models.
152: 
153: %The remainder of this paper is organized as follows.  We begin by giving a
154: %short introduction to phylogenetic invariants and motivate our algorithm.
155: %Section~\ref{sec:methods} contains the techniques used to select a set of
156: %invariants and the metric learning algorithms.  Section~\ref{sec:results} \dots
157: 
158: %\section{Phylogenetic invariants}
159: %\label{sec:2}
160: We begin by briefly introducing the models and phylogenetic invariants we will use.
161: Section~\ref{sec:methods} describes the metric learning algorithms; section~\ref{sec:results} gives
162: the results of our simulation studies; and we conclude with a short discussion.
163: 
164: By \emph{phylogenetic tree}, we mean a binary, unrooted tree with labelled
165: leaves.  There are three such trees with four leaves labelled $0, 1, 2, 3$, we
166: call these trees $\Ta$, $\Tb$, and $\Tc$ according to which leaf is on the same
167: ``cherry'' as leaf 0.
168: We consider two phylogenetic models on these trees:  the Jukes-Cantor (JC69)
169: model of evolution \cite{Jukes1969} and the Kimura 3-parameter (K81) model
170: \cite{Kimura1981}, both with uniform root distribution.
171: %and parameters $\alpha, \beta, \gamma$.
172: 
173: These models associate to each edge $e$
174: of the tree a transition matrix 
175: $M_e = \e^{Qt_e}$ where 
176: where $t_e$ is the length of the edge $e$
177: and $Q$ is a rate matrix:
178: %\[
179: %M_e = \frac 1 4 
180: %\begin{pmatrix}
181: %	1 + 3 \e^{-\frac{4}{3}t_e} & 
182: %	1 - \e^{-\frac{4}{3}t_e}& 
183: %	1 - \e^{-\frac{4}{3}t_e}& 
184: %	1 - \e^{-\frac{4}{3}t_e}\\ 
185: %	1 - \e^{-\frac{4}{3}t_e}& 
186: %	1 + 3 \e^{-\frac{4}{3}t_e} & 
187: %	1 - \e^{-\frac{4}{3}t_e}& 
188: %	1 - \e^{-\frac{4}{3}t_e}\\ 
189: %	1 - \e^{-\frac{4}{3}t_e}& 
190: %	1 - \e^{-\frac{4}{3}t_e}& 
191: %	1 + 3 \e^{-\frac{4}{3}t_e} & 
192: %	1 - \e^{-\frac{4}{3}t_e}\\ 
193: %	1 - \e^{-\frac{4}{3}t_e}& 
194: %	1 - \e^{-\frac{4}{3}t_e}& 
195: %	1 - \e^{-\frac{4}{3}t_e}& 
196: %	1 + 3 \e^{-\frac{4}{3}t_e} 
197: %\end{pmatrix}
198: %\]
199: \[
200: Q = 
201: \begin{pmatrix}
202: 	-3 \alpha & \alpha & \alpha & \alpha \\
203: 	\alpha & -3 \alpha & \alpha & \alpha \\
204: 	\alpha & \alpha & -3 \alpha & \alpha \\
205: 	\alpha & \alpha & \alpha & -3 \alpha\\
206: \end{pmatrix}
207: \quad \text{ or } \quad 
208: \begin{pmatrix}
209: 	\cdot & \gamma & \alpha & \beta\\
210: 	\gamma & \cdot & \beta & \alpha\\
211: 	\alpha & \beta & \cdot & \gamma\\
212: 	\beta & \alpha & \gamma & \cdot
213: \end{pmatrix}
214: \]
215: for JC69 and K81 respectively, where $\cdot = -\gamma - \alpha - \beta$.
216: %The diagonal entries of the rate 
217: %matrix are chosen to make the rows sum to zero.
218: For a given tree, we write 
219: $p_{ijkl} = \Pr(\text{leaf } 0 = i, \text{leaf } 1 = j, \text{leaf }  2 = k, 
220: \text{leaf } 3 = l)$ for
221: $i,j,k,l \in \{\texttt{A,C,G,T}\}$ and write $p = (p_{\texttt{AAAA}}, \dots,
222: p_{\texttt{TTTT}})$ for the joint probability distribution. 
223: \emph{Phylogenetic invariants} are polynomial equations which are
224: satisfied between the joint parameters.  For example, $p_{\texttt{AAAA}} = p_{\texttt{CCCC}}$
225: holds for both JC69 and K81, but since this equation is true for all three trees,
226: will ignore it and similar equations.
227: 
228: \begin{example}\rm
229: 	\label{ex:4pt}
230: Consider the four-point condition
231: on a tree metric \cite{Buneman1974}.  It says that if $d$ is a tree metric on $(ij : kl)$, then
232: \begin{equation}\label{eq:4pt}
233: d_{ij} + d_{kl} <  d_{ik} + d_{jl} = d_{il} + d_{jk}.
234: \end{equation}
235: Given a probability distribution $p$, the maximum likelihood Jukes-Cantor distance is
236: \begin{equation} \label{eq:JCdist}
237: d_{ij} = - \frac 3 4 \log \left( 1 - \frac {4 m_{ij}} {3}\right)
238: \end{equation}
239: where $m_{ij}$ is the fraction of mismatches between the two sequences, e.g.,
240: \[
241: m_{12} = \sum_{w,x,y,z \in \{A,C,G,T\}, w \neq x} p_{wxyz}.
242: \]
243: After substituting in (\ref{eq:4pt}) and exponentiating, the equality becomes
244: \begin{equation}\label{eq:4ptp}
245: \left(1 - \frac {4} {3} m_{ik}\right) \left(1 - \frac {4} {3} m_{jl}\right) =
246: \left(1 - \frac {4} {3} m_{il}\right) \left(1 - \frac {4} {3} m_{jk}\right). 
247: \end{equation}
248: This observation is originally due to Cavender and Felsenstein
249: \cite{Cavender1987}.  The difference of the two sides of (\ref{eq:4ptp}) is a
250: quadratic polynomial in the joint probabilities which we will call the
251: four-point polynomial. 
252: \end{example}
253: 
254: 
255: \section{Methods}
256: \label{sec:methods}
257: Since both models we consider are \emph{group-based}, it is easiest to work in Fourier
258: coordinates which can roughly be thought of as the $m_{ij}$ coordinates in
259: Example~\ref{ex:4pt} (cf.\ \cite{Hendy1989,Evans1993,Sturmfels2005}).  The
260: website \url{http://www.math.tamu.edu/~lgp/small-trees/} contains lists of
261: invariants for different models on trees with a small number of taxa.
262: 
263: Our first task is building a set of invariants for the two models.  The above
264: website shows 33 polynomials (plus two implied linear relations) for 
265: JC69 and 8002 polynomials for K81.
266: However, 
267: these sets of invariants 
268: %have desirable algebraic properties,
269: %they lack one important feature for our use.  
270: are not closed under the symmetries of $T_1$.
271: That is, each tree can be
272: written in the plane in eight different ways (for example, the tree $\Ta$ can
273: be written as (01 : 23),  (10 : 23), \dots, (32 : 10)), and each of these induces a
274: different order on the probability coordinates $p_{ijkl}$.  We need a set of
275: invariants which does not change under this reordering if we don't want the
276: resulting algorithm  to depend on the order of the input sequences.
277: 
278: \begin{figure}
279: 	\centering
280: 	%"((W:$b,X:$a):$a,Y:$b,Z:$a);";
281: 	\includegraphics[width=.35\textwidth]{figures/4tree-ab-label} 
282: 	\caption{The tree used in the simulations. Branch lengths $a$ and $b$
283: 	ranged from $0.01$ to $0.75$ in intervals of $0.02$.}
284: 	\label{fig:param}
285: \end{figure}
286: 
287: After performing this calculation, we are left with 49 polynomials for JC69 and
288: 11612 for K81. 
289: However, our metric learning algorithms
290: run slowly as the number of invariants grows, so we had to find a subset of
291: a more manageable size.  
292: We cut down the K81 invariants by testing each of the
293: 11612 invariants individually on the entire parameter space and only keeping
294: those which had good individual reconstruction rates.  
295: Specifically,  we picked several different values for
296: $\gamma, \alpha, \beta$ and kept only those invariants which gave over a 62\%
297: reconstruction rate individually for sequences of length 100.  The result of
298: this calculation is sets of invariants $\bof^{JC69}_i$ and $\bof^{K81}_i$ of
299: cardinality 49 and 52.
300: 
301: \begin{definition}
302: Given a probability distribution $\hat{p}$, and invariants $\bof_i = (f_{i,1},
303: \dots, f_{i,n})$, for tree $T_i$ (for $i = 1,2,3$), let 
304: \[
305: \bof_i(\hat{p}) = (f_{i,1}(\hat{p}), \dots, f_{i,n}(\hat{p}))
306: \]
307: be the point in $\rr^{n}$ obtained by evaluating the invariants for $T_i$ at
308: $\hat{p}$.  
309: \end{definition}
310: 
311: If the probability distribution $\hat{p}$ actually comes from the model $T_1$,
312: then we will have $\bof_1(\hat{p}) = 0$, $\bof_2(\hat{p}) \neq 0$, and $\bof_3(\hat{p})
313: \neq 0$ generically (that is, except for points $\hat{p}$ which lie on the
314: intersection of two or more models).  This fact suggests that we can just pick
315: the tree $T_i$ such that $\bof_i$ is closest to zero.  However, the next example
316: shows that it is quite important to pick good polynomials and weigh them properly.
317: 
318: \begin{example}\rm
319: Figure~\ref{fig:hist} shows the distribution of four of the invariants from $\bof^{JC}_i$
320: on data from simulations
321: of 1000 i.i.d.\
322: draws from the Jukes-Cantor model on $T_1$ over a varying set of parameters.
323: The histograms show the distributions for the simulated tree ($T_1$) in yellow
324: and the distributions for the other trees in gray and black.
325: \begin{figure}
326: 	\centering
327: 	\includegraphics[width=.5\textwidth]{figures/4poly-hist.pdf}
328: 	\caption{Distributions of four polynomials $f_{i,10}, f_{i,48}, f_{i,23}$ 
329: 	and $f_{i,45}$ on simulated data.
330: 	The yellow histogram corresponds to the correct tree, the black and gray are
331: 	the other two trees.}
332: 	\label{fig:hist}
333: \end{figure}
334: 
335: Polynomial 10 (upper left) distinguishes nicely between the three trees
336: with the correct tree tightly distributed around zero. It is correct 97\% of
337: the time on our space of trees (Figure~\ref{fig:param}). Polynomial 48 (upper right) also shows power to
338: distinguish between all three trees, but
339: the distributions are much more overlapping --- it is only correct 50.8\% of the time.
340: Polynomial 10 is the four-point invariant from Example~\ref{ex:4pt}, polynomial
341: 48 is one of Lake's linear invariants.
342: The two other examples show a polynomial (23) which is biased towards selecting
343: the wrong tree (only 16\% correct), and a polynomial (45) for which the correct tree is tightly
344: clustered around zero, but the incorrect trees are indistinguishable and have wide
345: variance (88.9\% correct).
346: 
347: %\begin{figure}
348: %	\centering
349: %	\includegraphics[width=4in]{figures/myRank_inv}
350: %	\caption{Performance of the individual Jukes-Cantor invariants.}
351: %	\label{fig:indiv}
352: %\end{figure}
353: 
354: The parameters used for the simulations are described in
355: Figure~\ref{fig:param}.  Since 1000 samples should be quite enough to determine
356: the structure of a tree on four taxa, it is revealing that many of the
357: individual polynomials are quite poor 
358: %(see Figure~\ref{fig:indiv} for the individual prediction rates for each polynomial)
359: (the mean prediction rate for all 49 polynomials is only 42\%).  
360: The invariants have quite different variances and means and it is not optimal
361: to take each one with equal weight.  
362: \end{example}
363: 
364: 
365: %\section{Metric learning}
366: %\label{sec:3}
367: %In this section we propose a learning algorithm to find a 
368: %metric on the invariants which optimizes their power.
369: 
370: This example shows that we need to scale and weigh the individual invariants.
371: Recall that for a positive (semi)definite matrix $A$ %\in \rr^{d^2}$,
372: the Mahalanobis (semi)norm $\|\cdot \|_A$ is defined by 
373: \[
374: \| x \|_A = \sqrt{x^t A x}.
375: \]
376: %It satisfies (i) $\|x\|_A\geq 0$, (ii) $\|c x\|_A = |c|\|x\|_A$ (for $c\in
377: %\rr$), and (iii) $\|x+y\|_A\leq \|x\|_A+\|y\|_A$. Note that in (i) it is
378: %possible to have $\|x\|_A= 0$ with $x\neq 0$ in contrast to a norm $\|\cdot \|$
379: %which satisfies $\|x\|=0$ if and only if $x=0$.
380: Notice that since $A$ is positive semidefinite, it can be written as $A = U D
381: U^t$ where $U$ is  orthogonal and $D$ is diagonal with non-negative entries.
382: Thus the square root $B = U \sqrt{D} U^t$ is unique.  Now since $\|x\|_A^2
383: = x^tAx = (Bx)^t(Bx)=\|Bx\|^2$, we can view learning such a metric as finding a
384: transformation of the space of invariants that replaces each point $x$ with $Bx$ under the 
385: Euclidean norm.
386: Accordingly, we will be searching for a positive semidefinite
387: matrix $A$ on the space of invariants which is ``optimal''.
388: 
389: Let
390: $\hpt$ be an empirical probability distribution generated from a phylogenetic
391: model on tree $T_1$ with parameters $\theta$.
392: We wish to find $A$ such that the condition
393: \[
394: \|\bof_1(\hpt))\|_A < \min\left( \|\bof_2(\hpt)\|_A, \|\bof_3(\hpt)\|_A \right)
395: \]
396: is typically true for most $\hpt$ chosen from a suitable parameter space $\Theta$.  
397: 
398: Now suppose that $\Theta$ is a finite set of parameters from which we generate
399: training data $\bof_1(\hpt), \bof_2(\hpt), \bof_3(\hpt)$ for $\theta \in \Theta$.
400: As we saw above, each of the eight possible ways of writing
401: each tree induces a signed permutation of the coordinates of each
402: $\bof_i(\hpt)$.  We write these permutations in matrix form as $\pi_1, \dots,
403: \pi_8$.
404: Given this training data, we wish to solve the following optimization problem.
405: %For this purpose, the training data are generated as a set of triples
406: %$\{(x_1(\theta), x_2(\theta), x_3(\theta))\in \rr^{3d}: \theta = (\phi_e,\pi_e)
407: %\in \Theta\}$, where $\Theta$ is a finite set and $d=49$ in our setting.
408: 
409: \begin{alg}\rm
410: 	\label{alg:meta}
411: 	{\bf (Metric learning for invariants)}
412: 
413: 	\noindent {\em Input:}  model invariants $\bof_i$ for $T_i$
414: 	and a finite set $\Theta$ of model parameters.
415: 
416: 	\noindent {\em Output:} a semidefinite matrix $A$ 
417: 
418: 	\noindent {\em Procedure:}
419: 	\begin{enumerate}
420: 		\item Reduce the sets $\bof_i$ of invariants to a manageable size by testing individual invariants
421: 			on data simulated from $\theta \in \Theta$.
422: 		\item Augment the resulting sets so that they are closed under the eight
423: 			permutations of the input which fix tree $T_1$.
424: 		\item Compute the signed permutations $\pi_1, \dots, \pi_8$ which are induced on the
425: 			invariants $\bof_1$ by the above permutations.
426: 		\item Solve the following semidefinite programming problem:
427: 			\begin{equation*}
428: 				\begin{array}{ll}
429: 					%\text{Given: }& \bof_1, \bof_2, \bof_3
430: 					%\text{ and a finite set } \Theta \text{ of parameters,}\\
431: 					\text{Minimize: } &\sum_{\theta \in \Theta} \xi(\theta) + \lambda \tr A\\
432: 					\text{Subject to: } \quad
433: 					&\| X_1 (\hpt) \|_A^2   + \gamma \leq \min\left( \| X_2 (\hpt)\|_A^2 ,  \|X_3(\hpt) \|_A^2 \right)  + \xi(\theta),\\
434: 					& \pi_i A = A \pi_i \quad\text{for } 1 \leq i \leq 8,\\
435: 					&A \succeq 0, \quad\text{and}\\
436: 					&\xi(\theta)  \geq 0,
437: 				\end{array}
438: 			\end{equation*} 
439: 			where $A\succeq 0$ denotes that $A$ is a positive semidefinite
440: 			matrix. 
441: 		\item Alternatively, if we restrict $A$ to be
442: 			diagonal, this becomes a linear program and can be solved for much
443: 			larger sets of invariants and parameters.
444: 	\end{enumerate}
445: \end{alg}
446: 
447: In the optimization step, we use a regularization parameter $\lambda$ to keep $A$
448: small and a margin parameter $\gamma$ to increase the margin between the
449: distributions.  This is a convex optimization problem with a linear objective
450: function and linear matrix equality and inequality constraints. Hence it is a
451: semidefinite programming (SDP) problem. 
452: The SDP problem above has a unique optimizer and can be solved in polynomial
453: time. Its complexity depends on the capacity of the set $\Theta$ since each
454: point in $\Theta$ contributes a linear constraint. 
455: 
456: 
457: For the range of parameters we consider in Section~\ref{sec:results}, we use
458: $\#(\Theta)=1444$.  In our experiments to solve the SDP we use SeDuMi 1.1
459: \cite{Sturm1999} or DSDP5
460: \cite{BenYe05} with YALMIP \cite{Lofberg2004} as the parser.
461: Matlab code to implement the above algorithm can be found at
462: \url{http://math.stanford.edu/~yuany/metricPhylo/matlab/}. 
463: 
464: We found that although SeDuMi often runs into numerical issues, it generally
465: finds a good matrix $A$ with competitive performance to the neighbor-joining
466: algorithm. DSDP is better in dealing with numerical stability at the cost of
467: more computational time. We have found that setting $\lambda = 0.0001$ and
468: $\gamma = 0.005$ gives good results in our situation. For example, in the case
469: of the JC69 model with a $49\times49$ semidefinite matrix $A$, YALMIP-SeDuMi takes 55.7
470: minutes to parse the constraints and solve the SDP, while YALMIP-DSDP takes
471: 167.1 minutes to finish the same job.  For details on experiments, see the next
472: section. 
473: 
474: Our algorithm was inspired by some early results on metric learning algorithms
475: such as \cite{XingNg03} and \cite{ShaSinNg04}, which aim to find a
476: (pseudo)-metric such that the mutual distances between similar examples are
477: minimized while the distances across dissimilar examples or classes are kept
478: large. Direct application of such an algorithm is not quite suitable in our setting. 
479: As shown in Figure~\ref{fig:hist},
480: %a global view on the distribution of all three classes of data discloses that 
481: the correct tree points are overlapped by the two incorrect trees. For
482: points in the overlapping region, it is hard to tell whether to shrink or
483: stretch their mutual distance. However, when the points appear in triples, it
484: is possible that for each triple the one closest to zero is generated from the
485: correct tree. Our algorithm is based on such an intuition and proved successful
486: in experiments. 
487: 
488: After using Algorithm~\ref{alg:meta} to find a good metric $A$, the following
489: simple algorithm allows us to construct trees on four taxa.
490: 
491: \begin{alg}\rm
492: \label{alg:1}
493: {\bf (Tree construction with invariants)}  
494: 
495: \noindent {\em Input:} 
496: A multiple alignment of 4 species and a semidefinite matrix $A$ from Algorithm~\ref{alg:meta}.
497: %invariant under
498: %the signed permutations $\pi_1, \dots, \pi_8$.
499: 
500: \noindent {\em Output:}
501: A phylogenetic tree on the 4 species (without branch lengths).
502:  
503: \noindent {\em Procedure:}
504: \begin{enumerate}
505: 	\item Form empirical distributions $\hat{p}$ by counting columns of the alignment.
506: 	\item Form the vectors $\bof_i = (f_{i,1}(\hat{p}), \dots, f_{i,n}(\hat{p}))$ for $1 \leq i \leq 3$.
507: 	\item Return $T_i$ where %$i$ is such that 
508: 		the vector $\bof_i$ has smallest $A$-norm $\|\bof_i\|_A = \sqrt{\bof_i^t A \bof_i}$.
509: \end{enumerate}
510: \end{alg}
511: 
512: 
513: %\begin{figure}
514: %\centering
515: %%\includegraphics[width=.5\textwidth]{figures/RedonTop.jpg}
516: %%\includegraphics[width=.5\textwidth]{figures/RedonBottom1.jpg}
517: %\includegraphics[width=.75\textwidth]{figures/RedonBottom}
518: %\caption{Projections of the three data sets onto the first two eigenspaces of
519: %the metric $A$.  Red circles correspond to the correct tree, blue dots and
520: %green x's to the two incorrect trees.}
521: %\label{fig:eigen}
522: %\end{figure}
523: 
524: 
525: \section{Results}
526: \label{sec:results}
527: 
528: We tested our metric learning algorithms for the invariants $\bof^{JC69}$ and
529: $\bof^{K81}$ as described above.  
530: We trained two metrics for JC69 on 
531: the tree in Figure~\ref{fig:param}.
532: The first used simulations from branch lengths between $0.01$ and $0.75$ on a
533: grid with increments of magnitude $0.02$, for a total of $1444$ different
534: parameters.  
535: This region of parameter space was chosen for direct comparison with
536: \cite{Huelsenbeck1995,Casanellas2006}.
537: The second metric used parameters $0.01 \leq a \leq 0.25$ and $0.51 \leq b \leq
538: 0.75$ with increments of $0.01$, giving $625$ trees with very short interior edges.
539: Similarly for K81, we learned a metric using parameters $0.01 \leq a,b \leq 0.75$ and
540: several sets of $\gamma, \alpha, \beta$.
541: After learning metrics, we performed simulation tests on the same space of
542: parameters that was used to train.  We compared neighbor-joining
543: \cite{Saitou1987}, phylogenetic invariants with the $l_1$ or $l_2$ norm and
544: phylogenetic invariants with our learned norms. 
545: 
546: Since the edge lengths are large for part of the parameter space, we often see
547: simulated alignments with more than 75\% mismatches between pairs of taxa. In
548: such a case, (\ref{eq:JCdist}) returns infinite distance estimates under the
549: Jukes-Cantor
550: model.  So the results depend on how neighbor joining treats
551: infinite distances.  In PHYLIP
552: \cite{PHYLIP}, the program \texttt{dnadist} doesn't return a distance
553: matrix if some distances are infinite.  However, in PAUP*, infinite distances are set to a fixed large
554: number. 
555: Since we are only concerned with the tree topology, we believe that the most
556: fair comparison is between phylogenetic invariants and the method of PAUP*.
557: However, it should be noted that this can make a major difference in results using neighbor joining, since often the correct tree can be returned even if some distances are infinite. See Figure~\ref{fig:contour} for an example of the difference and be warned that comparison between simulations studies done in different ways is difficult.
558: 
559: Table~\ref{tab:JC} shows the results of 100 simulations at each of the 1444
560: parameter values for various sequence lengths using the JC69 model.  It gives
561: the percent correct over all 144,400 trials for
562: five different methods: invariants with $l_1$, $l_2$, and $A$-norms and
563: neighbor joining (using Jukes-Cantor distances and allowing infinite distances).  The contour
564: plots in Figure~\ref{fig:contour} show how the reconstruction rates vary across
565: parameter space for the five methods for a sequence length of 100.  
566: Notice that the $A$-norm shows
567: particularly good behavior over the entire range of parameters, even in the
568: ``Felsenstein zone'' in the upper left corner.  
569: When trained on the Felsenstein zone, the learned metric can perform even
570: better.  Table~\ref{tab:JC} shows the result of training a metric on this zone.
571: Notice that the $A$-norm is now quite a bit better than neighbor joining, even
572: though the $l_1$ and $l_2$ norms are terrible.  However, this learned norm is slightly worse on the whole parameter space than the metric trained on the whole space.
573: 
574: \begin{table}
575: 	\centering
576: 	\begin{tabular}{cccccl@{\hspace{1cm}}ccccc}
577: 		\multicolumn{5}{c}{Full parameter space} & &
578: 		\multicolumn{5}{c}{Felsenstein zone}\\
579: 		Length & $l_1$  & $l_2$ & $A$-norm  &  NJ & &
580: 		Length & $l_1$  & $l_2$ & $A$-norm  &  NJ\\
581: 		\cline{1-5} \cline{7-11} 
582: 		 25 & 62.7  & 59.8  & 74.5  & 75.5 &&   25 & 31.8 & 31.3 & 58.9 & 52.5\\ 
583: 		 50 & 71.9  & 66.3  & 85.0  & 85.9 &&   50 & 35.9 & 33.0 & 69.9 & 63.2\\
584: 		 75 & 76.7  & 69.6  & 90.0  & 90.4 &&   75 & 39.2 & 34.4 & 76.5 & 69.5\\
585: 		100 & 79.8  & 72.0  & 92.7  & 92.9 &&  100 & 42.2 & 35.8 & 81.2 & 73.9\\
586: 		200 & 86.4  & 77.6  & 97.0  & 96.6 &&  200 & 50.9 & 39.1 & 90.1 & 83.5\\
587: 		300 & 89.2  & 80.1  & 98.2  & 97.7 &&  300 & 55.4 & 40.4 & 93.6 & 87.8\\
588: 		400 & 91.1  & 82.1  & 98.7  & 98.2 &&  400 & 59.5 & 41.4 & 95.1 & 90.2\\
589: 		500 & 92.3  & 83.5  & 99.0  & 98.4 &&  500 & 62.4 & 42.5 & 95.6 & 91.4\\
590: 		\cline{1-5} \cline{7-11} 
591: 	\end{tabular}
592: 	\caption{Percent of trials reconstructed correctly for the Jukes-Cantor
593: 	model over the entire parameter space and the Felsenstein zone for the
594: 	respective metrics.}
595: 	\label{tab:JC}
596: \end{table}
597: 
598: Table~\ref{tab:K81} shows results for the K81 model under two choices of $(\gamma, \alpha, \beta)$.
599: We only report the $l_2$ scores, since the $l_1$ scores are similar.  Of note
600: is the column ``$l_2$ restrict'' which shows the $l_2$ norm on the top $52$
601: invariants as ranked by individual power on simulations as in the previous
602: section.  This column is better than the $l_2$ norm on all $11612$ invariants,
603: showing that many invariants are actually harmful.
604: The $A$-norm again improves on even the restricted $l_2$ and beats
605: neighbor-joining (run with K81 distances) on all examples.
606: 
607: \begin{table}
608: 	\centering
609: 	\begin{tabular}{cccccl@{\hspace{1cm}}ccccc}
610: %$\gamma,\alpha, \beta$ & 
611: 		\multicolumn{5}{c}{$(\gamma, \alpha, \beta) = (0.1, 3.0, 0.5)$}&&
612: 		\multicolumn{5}{c}{$(\gamma, \alpha, \beta) = (0.2, 0.5, 0.3)$}\\
613: length & $l_2$ & $l_2$ restrict & A & NJ&&
614: length & $l_2$ & $l_2$ restrict & A & NJ\\
615: \cline{1-5} \cline{7-11}
616:  25 & 59.2 & 66.9 & 71.5 & 62.7 && 25 & 63.7 & 67.4 & 70.9 & 65.0 \\ 
617:  50 & 68.0 & 77.5 & 82.1 & 72.7 && 50 & 71.6 & 77.5 & 81.1 & 74.2 \\
618:  75 & 73.4 & 82.7 & 86.9 & 79.3 && 75 & 75.4 & 82.7 & 86.4 & 80.5 \\
619: 100 & 76.8 & 85.8 & 89.7 & 82.6 &&100 & 77.6 & 85.8 & 89.3 & 83.4 \\
620: 200 & 84.6 & 90.8 & 94.3 & 90.1 &&200 & 83.3 & 91.3 & 94.1 & 89.7 \\
621: 300 & 88.2 & 92.6 & 93.1 & 95.6 &&300 & 86.4 & 93.2 & 95.7 & 91.8 \\
622: 400 & 90.1 & 93.6 & 96.4 & 94.8 &&400 & 88.5 & 94.3 & 96.5 & 93.0 \\
623: 500 & 91.4 & 94.2 & 96.9 & 95.7 &&500 & 90.0 & 93.7 & 96.3 & 93.2 \\
624: \cline{1-5} \cline{7-11}
625: 	\end{tabular}
626: 	\caption{Percent of trials reconstructed correctly 
627: 	for the Kimura 3-parameter model over the entire parameter space for two choices of $\gamma, \alpha, \beta$.} 
628: 	\label{tab:K81}
629: \end{table}
630: 
631: \begin{figure}
632: 	\centering
633: 	\begin{tabular}{cc}
634: 	\includegraphics[height=.35\textheight]{figures/contour-l1} & 
635: 	%\includegraphics[height=.25\textheight]{figures/contour-l2}\\
636: 	\includegraphics[height=.35\textheight]{figures/contour-A} \\
637: 	\includegraphics[height=.35\textheight]{figures/contour-NJinf} &
638: 	%\multicolumn{2}{c}{
639: 	\includegraphics[height=.35\textheight]{figures/contour-NJ} \\
640: \end{tabular}
641: 	\caption{Contour plots for the three reconstruction methods for the
642: 	Jukes-Cantor model over parameter space with alignments of length 100.
643: 	Black areas correspond to parameters
644: 	$(a,b)$ for which the tree was reconstructed correctly  over 95\% of the
645: 	time, gray for over 50\%, light gray for over 33\%, and white for under 33\%.  } 
646: 	\label{fig:contour}
647: \end{figure}
648: 
649: 
650: \section{Discussion}
651: \label{sec:dis}
652: 
653: We have shown that machine learning algorithms can substantially improve the
654: tree construction performance of phylogenetic invariants.  As an example, for
655: sequences of length 100, the four-point invariant (Example~\ref{ex:4pt}) for
656: the K81 model is correct 82\%  of the time on data simulated from K81 with
657: parameters $(0.1, 3.0, 0.5)$.  This is quite a bit better than the 
658: $l_2$ norm on all 11612 invariants (76.8\%, Table~\ref{tab:K81}).  
659: 
660: The paper \cite{Casanellas2007} describes an algebraic method for picking a
661: subset of invariants for the K81 model.  They reduce to 48 invariants which give
662: an improvement over all 11612 invariants (up to 82.6\% on the above
663: example using the $l_2$ norm).  However, of these 48, only 4 of them are among
664: the top 52 we selected for $\bof^{K81}$, and the remaining 44 invariants are
665: mostly quite poor (42\% average accuracy). After taking the closure of these 48 invariants, there are 156
666: total and the performance actually drops to 78.3\%.  It seems that the
667: conditions for an invariant to be powerful are not particularly related to the
668: algebraic criterion used in \cite{Casanellas2007}.
669: 
670: All invariant based methods heavily depend on the set of invariants that we
671: begin with.  Learning diagonal matrices $A$ had mixed performance, which further
672: suggests that the generating set we are using for the invariants is
673: non-optimal.  We believe that it is an important mathematical problem to
674: understand what properties are shared by the good invariants.  We suggest that
675: symmetry might be an important criterion to construct other polynomials like
676: the four-point condition with good power.
677: 
678: The learned metrics in this paper are somewhat dependant on the
679: parameters chosen to train them.  This can be a benefit, as it allows us to
680: train tree construction algorithms for specific regions of parameter space
681: (e.g., the Felsenstein zone).  However, we hope that improvements to the metric
682: programming will allow us to train on larger parameter sets and thus obtain
683: uniformly better algorithms.
684: 
685: Notice that these methods only recover the tree topology, not the edge lengths.
686: We believe that if the edge lengths are needed, they should be estimated after
687: building the tree, in which case standard statistical methods such as maximum
688: likelihood can be used easily.
689: While the invariants discussed in this paper may not be practical for large
690: trees, we believe there is great use in understanding fully the problem of
691: building trees on four taxa.  For example, these methods can either be used
692: as an input to quartet-based tree construction algorithms or as a verification
693: step for larger phylogenetic trees.
694: 
695: For a method of building trees on more than four taxa using phylogenetic
696: invariants, see \cite{Eriksson2005b,Kim2006}, which use numerical linear
697: algebra to evaluate invariants given by rank conditions on certain matrices in
698: order to construct phylogenetic trees.  This amounts to evaluating many
699: polynomials at once, allowing it to run in polynomial time.
700: 
701: The matrices $A$ for JC69 and K81 used in the tests can be found at
702: \url{http://stanford.edu/~nke/data/metricPhylo}.  A software package that can
703: run these tests is available at the same website.  It includes a program for
704: simulating evolution using any Markov model on a tree and several programs
705: using phylogenetic invariants.
706: 
707: \section*{Acknowledgments}
708: N.~Eriksson was supported by NSF grant DMS-0603448 and wishes to thank MSRI and
709: the IMA for their hospitality. Y.~Yao was supported by DARPA grant 1092228.  We
710: wish to thank E.\ Allman, M.\ Drton, D.\ Ge, F.\ Memoli, L.\ Pachter, J.\
711: Rhodes, and Y.\ Ye for helpful comments and especially G.\ Carlsson for his
712: encouragement and computational resources. 
713: 
714: \bibliographystyle{alpha}
715: %\bibliography{nikos}
716: 
717: \begin{thebibliography}{KKPP06}
718: 
719: \bibitem[AR07]{Allman2007}
720: E~Allman and J~Rhodes.
721: \newblock Molecular phylogenetics from an algebraic viewpoint, 2007.
722: \newblock To appear.
723: 
724: \bibitem[Bun74]{Buneman1974}
725: Peter Buneman.
726: \newblock A note on the metric properties of trees.
727: \newblock {\em J. Combinatorial Theory Ser. B}, 17:48--50, 1974.
728: 
729: \bibitem[BY05]{BenYe05}
730: Steven~J. Benson and Yinyu Ye.
731: \newblock {DSDP5}: Software for semidefinite programming.
732: \newblock Technical Report ANL/MCS-P1289-0905, Mathematics and Computer Science
733:   Division, Argonne National Laboratory, Argonne, IL, September 2005.
734: \newblock Submitted to ACM Transactions on Mathematical Software.
735: 
736: \bibitem[CF87]{Cavender1987}
737: J~Cavender and J~Felsenstein.
738: \newblock Invariants of phylogenies in a simple case with discrete states.
739: \newblock {\em Journal of Classification}, 4:57--71, 1987.
740: 
741: \bibitem[CFS06]{Casanellas2006}
742: M.~Casanellas and J.~Fern\'{a}ndez-S\'{a}nchez.
743: \newblock {Performance of a New Invariants Method on Homogeneous and
744:   Non-homogeneous Quartet Trees}.
745: \newblock {\em Mol Biol Evol}, page msl153, 2006.
746: 
747: \bibitem[CFS07]{Casanellas2007}
748: M.~Casanellas and J.~Fern\'{a}ndez-S\'{a}nchez.
749: \newblock {Geometry of the Kimura 3-parameter model}, 2007.
750: \newblock availabe at arXiv:math.AG/0702834.
751: 
752: \bibitem[Eri05]{Eriksson2005b}
753: Nicholas Eriksson.
754: \newblock Tree construction using singular value decompsition.
755: \newblock In L.~Pachter and B.~Sturmfels, editors, {\em Algebraic Statistics
756:   for Computational Biology}, chapter~19, pages 347--358. Cambridge University
757:   Press, Cambridge, UK, 2005.
758: 
759: \bibitem[ES93]{Evans1993}
760: S~Evans and T~Speed.
761: \newblock Invariants of some probability models used in phylogenetic inference.
762: \newblock {\em The Annals of Statistics}, 21:355--377, 1993.
763: 
764: \bibitem[Fel04]{PHYLIP}
765: J~Felsenstein.
766: \newblock {PHYLIP (Phylogeny Inference Package) version 3.6}.
767: \newblock Distributed by the author, Department of Genome Sciences, University
768:   of Washington, Seattle, 2004.
769: 
770: \bibitem[HP89]{Hendy1989}
771: M~Hendy and D~Penny.
772: \newblock A framework for the quantitative study of evolutionary trees.
773: \newblock {\em Systematic Zoology}, 38(4), 1989.
774: 
775: \bibitem[Hue95]{Huelsenbeck1995}
776: John~P. Huelsenbeck.
777: \newblock Performance of phylogenetic methods in simulations.
778: \newblock {\em Sys Biol}, 1(44):17--48, 1995.
779: 
780: \bibitem[JC69]{Jukes1969}
781: TH~Jukes and C~Cantor.
782: \newblock Evolution of protein molecules.
783: \newblock In HN~Munro, editor, {\em Mammalian Protein Metabolism}, pages
784:   21--32. New York Academic Press, 1969.
785: 
786: \bibitem[Kim81]{Kimura1981}
787: M~Kimura.
788: \newblock Estimation of evolutionary sequences between homologous nucleotide
789:   sequences.
790: \newblock {\em Proceedings of the National Academy of Sciences, USA},
791:   78:454--458, 1981.
792: 
793: \bibitem[KKPP06]{Kim2006}
794: Young~Rock Kim, Oh-In Kwon, Seong-Hun Paeng, and Chun-Jae Park.
795: \newblock Phylogenetic tree constructing algorithms fit for grid computing with
796:   {SVD}.
797: \newblock Available at \url{http://arxiv.org/abs/q-bio.QM/0611015}, 2006.
798: 
799: \bibitem[L\"04]{Lofberg2004}
800: J.~L\"ofberg.
801: \newblock {YALMIP} : a toolbox for modeling and optimization in {MATLAB}.
802: \newblock In {\em Computer Aided Control Systems Design}, pages 284--289, 2004.
803: 
804: \bibitem[Lak87]{Lake1987}
805: JA~Lake.
806: \newblock A rate-independent technique for analysis of nucleaic acid sequences:
807:   evolutionary parsimony.
808: \newblock {\em Molecular Biology and Evolution}, 4:167--191, 1987.
809: 
810: \bibitem[SN87]{Saitou1987}
811: N~Saitou and M~Nei.
812: \newblock The neighbor joining method: a new method for reconstructing
813:   phylogenetic trees.
814: \newblock {\em Molecular Biology and Evolution}, 4(4):406--425, 1987.
815: 
816: \bibitem[SS05]{Sturmfels2005}
817: Bernd Sturmfels and Seth Sullivant.
818: \newblock {T}oric ideals of phylogenetic invariants.
819: \newblock {\em J Comput Biol}, 12(4):457--481, May 2005.
820: 
821: \bibitem[SSSN04]{ShaSinNg04}
822: Shai Shalev-Shwartz, Yoram Singer, and Andrew~Y. Ng.
823: \newblock Online learning of pseudo-metrics.
824: \newblock In {\em Proceedings of the Twenty-first International Conference on
825:   Machine Learning}, 2004.
826: 
827: \bibitem[Stu99]{Sturm1999}
828: J.F. Sturm.
829: \newblock Using {SeDuMi} 1.02, a {MATLAB} toolbox for optimization over
830:   symmetric cones.
831: \newblock {\em Optimization Methods and Software}, 11--12:625--653, 1999.
832: \newblock Special issue on Interior Point Methods (CD supplement with
833:   software).
834: 
835: \bibitem[XNJR03]{XingNg03}
836: Eric Xing, Andrew~Y. Ng, Michael Jordan, and Stuart Russell.
837: \newblock Distance metric learning, with application to clustering with
838:   side-information.
839: \newblock In {\em NIPS}, 2003.
840: 
841: \end{thebibliography}
842: \end{document}
843: 
844: