q-bio0512009/good.tex
1: \documentclass{article}
2: 
3: \usepackage{graphicx}
4: \usepackage{amsmath}
5: \usepackage{amssymb}
6: \usepackage{eucal}
7: \usepackage{natbib}
8: 
9: \newcommand{\ocaml}{\texttt{ocaml}}
10: \newcommand{\nbar}{\ensuremath{\bar{N}}}
11: \newcommand{\varn}{\ensuremath{\sigma_N^2}}
12: \newcommand{\Ic}{\ensuremath{I_c}}
13: \newcommand{\Qone}{\ensuremath{Q_1}}
14: \newcommand{\Itwo}{\ensuremath{I_2}}
15: \newcommand{\Bone}{\ensuremath{B_1}}
16: \newcommand{\Btwo}{\ensuremath{B_2}}
17: \newcommand{\Aone}{\ensuremath{A_1}}
18: \newcommand{\Atwo}{\ensuremath{A_2}}
19: 
20: \newcommand{\ra}{\rightarrow}
21: \newcommand{\Tn}{\ensuremath{\mathcal{T}_n}}
22: \newcommand{\R}{\mathbb{R}}
23: \newcommand{\x}{\mathbf{x}}
24: \newcommand{\one}{\ensuremath{\mathbf{1}}}
25: \newcommand{\argmax}{\operatornamewithlimits{argmax}}
26: \newcommand{\norm}[1]{\ensuremath{\|#1\|}}
27: \newcommand{\stack}[2]{\begin{smallmatrix} #1 \\ #2 \end{smallmatrix}}
28: 
29: \newcommand{\SBsection}[1]{\vspace{.6cm} \noindent \textsc{#1} \vspace{.2cm}}
30: \newcommand{\SBsubsection}[1]{\vspace{.4cm} \noindent \textit{#1} \vspace{.2cm}}
31: \newcommand{\SBsubsubsection}[1]{\vspace{.4cm} \noindent
32:   \textsc{\small #1} \vspace{.2cm}}
33: 
34: \newenvironment{parmatrix}{\left( \begin{array}}{\end{array} \right)}
35: 
36: % for double spacing
37: %\usepackage{doublespace}
38: 
39: \newtheorem{theorem}{Theorem}
40: \newtheorem{prop}{Proposition}
41: \newtheorem{lemma}{Lemma}
42: 
43: \title{A geometric approach to tree shape statistics}
44: \author{Frederick A. Matsen}
45: 
46: %%%%%%%%%%%%%%%%%%%%
47: % todo
48: % save genereated trees
49: 
50: \begin{document}
51: 
52: \maketitle
53:  
54: \newcounter{count}
55: 
56: \begin{abstract}
57:   This article presents a new way to understand the descriptive
58:   ability of tree shape statistics. Where before tree shape statistics
59:   were chosen by their ability to distinguish between
60:   macroevolutionary models, the ``resolution'' presented in this
61:   paper quantifies the ability of a statistic to differentiate between
62:   similar and different trees. We term this a ``geometric'' approach
63:   to differentiate it from the model-based approach previously
64:   explored. A distinct advantage of this perspective is that it allows
65:   evaluation of multiple tree shape statistics describing different
66:   aspects of tree shape. After developing the methodology, it is
67:   applied here to make specific recommendations for a suite of three
68:   statistics which will hopefully prove useful in applications. The
69:   article ends with an application of the tree shape statistics to
70:   clarify the impact of omission of taxa on tree shape.
71: \end{abstract}
72: 
73: The analysis of phylogenetic tree shape provides one way of
74: understanding the forces guiding macroevolution, as well as
75: understanding possible biases of tree reconstruction methodology.
76: Although it has been a subject of study for many years, a recent
77: editorial in this journal \citep{simon-page} hints that finding the
78: forces guiding tree shape is a long-term challenge which has yet to be
79: completely understood. Joe \citet{felsenstein} concludes the chapter
80: on tree shape methodology in his recent book with the simple phrase
81: ``[c]learly this literature is in its early days.'' Indeed, tree shape
82: is still a challenge, and an important one. A complete understanding
83: would help resolve important questions in biology such as the roles of
84: adaptive radiation and environmental change in generating diversity.
85: Tree shape also poses difficult issues of its own, such as the impact
86: of missing or extinct taxa on our understanding of historical
87: biodiversity. Not only are many fundamental questions left unanswered,
88: but the area is ripe for progress: the large number and size of
89: contemporary phylogenies forms a fantastic corpus on which
90: macroevolutionary hypotheses can be tested.
91: 
92: In order to use phylogenetic tree shape as a tool, we need methods to
93: measure and quantify aspects of tree shape. Almost all work to this
94: day has been done with measures of tree ``balance,'' which is the
95: degree to which two sister taxa are of the same or different size. A
96: major vein of research has been to compare balance of trees created
97: from data to trees produced by one or another null model
98: \citep{Savage1983:225} \citep{Guyer1991:340} \citep{Guyer1993:253}
99: \citep{stam02a}. \citet{Kirkpatrick1993:1171}, in one of the early
100: papers in the area, quantified the power of different measures of tree
101: balance in distinguishing between two models of tree shape. The two
102: models are extremely simple: one, called the Yule or ERM model,
103: develops a tree by starting with a single species and then choosing
104: uniformly among species to speciate. The other, called the PDA model,
105: is simply the distribution on tree shapes induced by the uniform
106: distribution on labelled trees.
107: 
108: Studies have shown that most trees created from data are less balanced
109: than would be expected from the ERM model, yet more balanced than
110: would be expected from the PDA model \citep{Mooers1995:379}
111: \citep{Mooers1997:31} \citep{Purvis2002:844}. Models of increasing
112: sophistication have appeared, attempting to re-create this observed
113: pattern of tree shape observed in nature. For example,
114: \citet{Heard1996:2141} found that speciation rate variation among
115: lineages can lead to imbalanced trees. 
116: \citet{Losos1995:329} found that short ``refractory periods''-- periods
117: before which a new species can speciate again-- led to more balanced
118: trees, while \citet{Rogers1996:99} found that very long
119: refractory periods led to less balanced trees.
120: \citet{Aldous95,Aldous2001:23} was the first to propose a
121: (non-evolutionary) model which interpolated between the ERM and the
122: PDA models. More recently, \citet{Steel2001:91} and
123: \citet{Pinelis2003:1425} have since developed evolutionary
124: models which also interpolate.
125: 
126: With these models, one could presumably arrange
127: parameters to correctly fit the observed pattern of imbalance as
128: reported by a given statistic. But is that really enough? What if
129: other aspects of the tree shape, not measured by the statistic, differ
130: considerably? After all, any single statistic is a one-dimensional
131: summary of a very complex set of data. One might follow the suggestion
132: of Agapow and Purvis \citep{Agapow2002:866} and use two different
133: balance statistics which measure balance in different parts of the
134: tree, but in this paper we hope to present a more direct approach.
135: 
136: The only proposal made in the literature which has the potential to
137: encapsulate lots of information about the shape of a tree has been by
138: \citet{Aldous2001:23}. He suggests first constructing a
139: scatterplot of the interior nodes, where the $x$ coordinate is the
140: size of the subclade subtended by that interior node, and the $y$
141: coordinate is the size of the smaller daughter clade. The proposal is
142: then to perform nonlinear median regression on
143: the log-log version of this scatterplot and then use the fitted
144: function as a descriptor of tree shape. We will call the log-log
145: scatterplot the ``Aldous scatterplot'' in the following.
146: 
147: There are a number of advantages to this approach. It is very natural
148: from a statistical viewpoint relative to the other, more ad-hoc,
149: measures of tree balance. The method has the potential to give quite a
150: lot of information about tree shape compared to a single summary
151: statistic. Finally, it allows comparison of trees of different sizes
152: by superposition of scatterplots, which is a significant advantage.
153: There is currently no generally accepted method for comparing trees of
154: different sizes using the standard statistics; this remains a
155: problematic issue \citep{Mooers1995:379} \citep{stam02a}.
156: 
157: However, there are three disadvantages which may not make Aldous'
158: proposal as practical as might be hoped. The first is that regression
159: works best with many points of data, and thus one can only expect his
160: technique to work with rather large trees. This problem is exacerbated
161: by the fact that isomorphic subtrees are superimposed on one another
162: in the scatterplot, further reducing the number of fittable points.
163: The second is an inherent problem with summarizing a tree as a
164: scatterplot of this sort. Assume that tree $T$ has two non-isomorphic
165: subtrees $A$ and $B$ of the same size. Exchanging $A$ and $B$ in $T$
166: will not change the scatterplot and thus not change any regression
167: parameters, although the resulting tree may differ significantly in
168: shape. The third problem is that the resulting output can be hard to
169: interpret. What does, for example, the $k$th Taylor coefficient of the
170: fitted function actually signify? Despite these issues, we believe
171: that this technique is underutilized and may be the technique of
172: choice when working with large phylogenies.
173: 
174: Overall, it appears that additional methods would be useful for
175: understanding tree shape. This paper attempts to provide some of these
176: new methods.
177: 
178: \SBsection{The geometric approach}
179: 
180: The basic philosophy behind the geometric approach is that similar
181: trees should have similar statistics, and that rather different trees
182: should have different statistics. This philosophy is summarized in
183: Figure \ref{fig:example_stat}. All of the trees with six tips are
184: evaluated by two hypothetical statistics. The top axis shows what one
185: might consider a good statistic. The maximally balanced tree is on
186: the far left side, and the completely unbalanced tree is on the far
187: right. When a subtree is preserved, the statistic tends not to change
188: too much. The bottom axis shows what might be considered a bad
189: statistic. The extremes of tree balance are now put together, and two
190: similar trees are now on the two extremes of the axis.
191: 
192: \begin{figure}
193:   \begin{center}
194:   \includegraphics[angle=0,scale=.75]{example_stats.eps}
195: \end{center}
196: \caption{Good and bad statistics from the geometric perspective. The
197:   horizontal axes represent values of hypothetical statistics. In
198:   figure (a) very different trees are separated, while in figure (b)
199:   very different trees are close together.}
200: \label{fig:example_stat}
201: \end{figure}
202: 
203: If we are to apply this sort of intuition on trees, it is necessary to
204: formalize the notion of similar and different. We do so by
205: constructing a metric on unlabeled trees.
206: 
207: \SBsubsection{A metric for evolutionary histories}
208: 
209: Here we describe a metric on unlabeled trees which can be applied
210: directly to compare tree shapes or can be used to guide the selection
211: of statistics as described below. To begin we state that by ``tree''
212: we will mean a finite strictly bifurcating rooted tree without leaf
213: labels or specified edge lengths. We have chosen finite strictly
214: bifurcating rooted trees, as these correspond most naturally to the
215: output of models. This paper concerns itself with tree shape rather
216: than the identity of taxa, thus we consider unlabeled trees. Finally,
217: our intent in this paper is to understand the combinatorial content of
218: the tree, and thus we consider trees without specified edge lengths.
219: The case including edge lengths would be an interesting future
220: extension of this work, but would require a significant further
221: development of the methodology.
222: 
223: \begin{figure}
224:   \begin{center}
225:   \includegraphics[angle=0,scale=.45]{nni.eps}
226: \end{center}
227: \caption{A single rooted NNI move.}
228: \label{fig-nni}
229: \end{figure}
230: 
231: We recall that a metric $g$ is simply a set of ``distances'' between
232: pairs of a collection of objects satisfying (i) $g(x,y) = 0$ if and
233: only if $x=y$, (ii) $g(x,y) = g(y,x)$, (iii) the triangle inequality:
234: $g(x,y) + g(y,z) \geq g(x,z)$. The metric we consider is simply the
235: nearest neighbor interchange (NNI) metric on unlabeled trees, depicted
236: in Figure \ref{fig-nni}. A single NNI ``move'' represents a change of
237: branching order of a tree to one of two possible configurations. The
238: unlabeled NNI distance from one tree to another is defined to be the
239: minimum number of moves necessary to change one tree to the other.
240: Note that these interchanges have appeared before in 
241: \citet{Kuhner1995:1421} as proposal draws for their their
242: Metropolis-Hastings approach to estimating population parameters.
243: 
244: Tree space equipped with the NNI metric is shown in Figure
245: \ref{fig:tree_space} for trees on 6 leaves. It is a graph which has
246: connections between any two trees which are a single NNI move apart.
247: Note that the NNI distance is a special case of the shortest-path
248: metric on a graph and thus we are justified in calling it a metric.
249: Also, although the metric is not explicitly model-based, a change of
250: branching order can be thought of as a change of timing of
251: diversification events.
252: 
253: \begin{figure}
254:   \begin{center}
255:   \includegraphics[angle=0,scale=.45]{tree_space.eps}
256: \end{center}
257: \caption{Unlabeled tree space equipped with the NNI metric. An edge
258:   between two trees means that a single NNI move changes one to the
259:   other.}
260: \label{fig:tree_space}
261: \end{figure}
262: 
263: Unsurprisingly, computing this metric is NP-complete, as can be seen by
264: a small modification of a similar proof by \citet{dasgupta}. Their
265: paper demonstrates that calculating the unrooted NNI distance on
266: unrooted trees is NP-complete. However, the unrooted NNI moves are
267: identical to the moves in Figure \ref{fig-nni} when the tree shown in
268: the diagram is chosen to be anything but the entire tree. Therefore we
269: can simply root the tree in Figure 4 of their paper on the far left
270: side of the main linear tree and the proof proceeds as usual.
271: 
272: \SBsubsection{Resolution of Statistics}
273: 
274: In this section we define the notion of ``resolution'' of a tree shape
275: statistic. Although the formal definition of the resolution is in
276: terms of the statistical method of multidimensional scaling, we will
277: first describe how resolution relates to the more common method of
278: principal component analysis, and then give an intuitive definition of
279: resolution as a measure of how much a statistic ``spreads out'' the
280: data. This resolution measure will be applied to various tree shape
281: statistics below where the underlying data
282: will be the tree space of a given number of leaves. In this way the
283: resolution will be our operational definition of performance for tree
284: shape statistics.
285: 
286: The resolution measure formalizes the intuitive notion that similar
287: objects should have similar statistics and rather different objects
288: should have different statistics. For the moment let us consider these
289: objects to be points in $n$-dimensional space. A natural statistic
290: which satisfies our criteria is the familiar first principal component
291: from multivariate statistics. It is some projection of the original
292: spatial data, so objects which are close together stay close together
293: after projection. Also, it is the direction along which variance of
294: the coordinates of the points is maximized, so as much as possible
295: objects which are far apart stay far apart. In this way we consider
296: the first principal component to be the best possible statistic for
297: this collection of points, and will assign it the highest resolution
298: value.
299: 
300: We can get at the principal component by thinking of it as the
301: maximization of a certain ``quadratic form.'' In the standard
302: formulation, the principal components are the eigenvectors of the
303: covariance matrix constructed from the coordinates of the sample
304: points. However, it turns out that even if we do not have the actual
305: coordinates of the points, but rather the distances between them, we
306: can still construct the covariance matrix. The process goes as follows:
307: let $H$ be the $n \times n$ ``centering matrix''
308: \[
309: H = I - (1/n) J
310: \]
311: where $J$ is the matrix with every entry equal to one. The operation
312: of the centering matrix on a vector subtracts off the average of the
313: entries of the vector from each component, so the result is a vector
314: which is perpendicular to the vector of ones. Let $S(A)$ be the
315: component-wise matrix squaring operation, such that the $ij$ entry of
316: $S(A)$ is $a_{ij}^2$. Then if $D$ is a ``euclidean distance matrix,''
317: i.e. a matrix such that the $ij$ entry is the distance between two
318: points $i$ and $j$ in a euclidean space, then $B = H \, S(D) \, H$
319: will correspond exactly with the covariance matrix of those same
320: points calculated in the traditional way \citep{mardiaEA}.
321: 
322: With the covariance matrix now in hand, we can apply the Rayleigh
323: Quotient theorem, which is a special case of the Courant-Fisher
324: theorem. It states that the eigenvector corresponding to the largest
325: eigenvalue of a symmetric matrix maximizes the quadratic form $\x^T M
326: \x$ 
327: over all unit-length vectors $x$ \citep{ortega}. Thus in our setting
328: the first principal component is the unit-norm $\x$ which maximizes
329: the quadratic form
330: \begin{equation}
331: \label{eq:qf1}
332: R(\x) \ = - \ \x^T H \, S(D) \, H \x.
333: \end{equation}
334: Again, the action of left multiplication by $H$ simply subtracts the
335: average of the components of $\x$. Therefore maximization is certainly
336: achieved by an $\x$ which has average zero, i.e. is perpendicular to
337: one. On such $\x$, $H$ clearly has no effect. Therefore we can obtain
338: first principal component as
339: \begin{equation}
340: \label{eq:qf2}
341: \argmax_{\stack{\norm{\x} = 1}{\x \perp \one}} \ \ \x^T S(D) \x.
342: \end{equation}
343: Written out in a slightly longer form this is 
344: \begin{equation}
345: \label{eq:qf_intuitive}
346: \argmax_{\stack{\norm{\x} = 1}{\x \perp \one}} \ \sum_{i,j} -d_{ij}^2 x_i x_j
347: \end{equation}
348: This formula has a simple and intuitive explanation. As mentioned
349: above, in our view a statistic should assign very different values to
350: objects which are far apart. This equation simply formalizes this
351: intuition in a nice way: an individual term of the sum in
352: (\ref{eq:qf_intuitive}) will be maximized if $x_i$ is very negative
353: and if $x_j$ is very positive. The summation and the distances simply
354: combine all of these terms together in a weighted fashion such that
355: $ij$ pairs which are distant carry more weight than ones which are
356: close. Therefore the more distant objects will tend to be farther
357: apart in $x$-value, and the closer objects will tend to be closer in
358: $x$-value.
359: 
360: We will call the quadratic form $R$ of (\ref{eq:qf1}) the
361: ``resolution'' of a statistic, in the sense that a statistic which
362: differentiates between close and distant objects has a high level of
363: resolution. As mentioned above, the first principal component
364: maximizes $R$, and thus its value is an upper limit on the resolution
365: of a statistic. However, we will see below that some well-known
366: statistics on tree space achieve resolution nearly that of the first
367: principal component.
368: 
369: So far we have defined the resolution for data sets of distance
370: matrices for configuration of points in euclidean space. Although
371: phrased in a slightly unusual manner, this has led us into the
372: well-known area of principal component analysis. However, our intent
373: is to apply this technique to the space of all unlabeled trees with
374: the NNI metric. The distance matrix corresponding to this space is far
375: from being a euclidean distance matrix. Is it possible to continue
376: with the same formalism as in the euclidean setting?
377: 
378: It turns out that we can, and that the procedure is now called metric
379: multidimensional scaling (MDS) \citep{mardiaEA}. The only difference
380: is that $D$ is now allowed to be non-euclidean. In essence, when we
381: substitute a non-euclidean distance matrix into (\ref{eq:qf1}), we
382: consider the projection of the squared centered matrix onto the cone
383: of semidefinite matrices. Thus multidimensional scaling performs
384: principal component analysis on the ``closest'' euclidean distance
385: matrix to our original matrix in a specific sense \citep{dattorro}.
386: This operation certainly loses some data, but enough information is
387: retained to understand the descriptive ability of several statistics.
388: We visit this issue in the last section.
389: 
390: Note that this is not the first application of MDS to phylogenetic
391: analysis: \citet{Hillis2005:471} applied it with
392: interesting results to the space of trees with labeled tips. They used
393: MDS with the Robinson-Foulds distance metric as a tool for visualization
394: and analysis of the output of tree reconstruction software. Our intent
395: and methods differ here, as we are concerned with finding near-optimal
396: statistics for understanding unlabeled tree space with the NNI metric.
397: 
398: In this section we have defined the resolution as function that allows
399: us to understand the descriptive ability of some statistic. At this
400: point we specialize to the case of tree shape statistics on tree space
401: equipped with the NNI metric. Resolution scores are calculated as
402: follows: first construct a vector with rows equal to the value of the
403: statistic on all trees in tree space. Then apply the matrix $H$ to
404: center the vector; then normalize the vector in the euclidean sense
405: resulting in a vector $\hat{x}$. The resolution is the value of
406: $\hat{x}^T S(D) \hat{x}$. We will use this definition to guide
407: selection of statistics.
408: 
409: \newpage
410: \SBsection{Results}
411: 
412: In this section the methodology of the previous section is applied to
413: compare the resolution of tree shape statistics. We will first
414: evaluate the standard list of statistics \citep{Kirkpatrick1993:1171}
415: \citep{Agapow2002:866} \citep{felsenstein} according to the above
416: methodology. Then we search for a best second statistic given the
417: first, and the best third statistic given a first and second. Our
418: criterion for performance is high resolution on the whole unlabeled
419: tree space with the NNI metric as described in the previous section.
420: The tree space was generated and evaluated by an \texttt{ocaml}
421: \citep{ocaml} program whose source is available upon request.
422: 
423: We calculated the well-known statistics \nbar\ and \varn\ proposed by
424: \citet{Sackin1972:225}, \Ic\ proposed by \citet{Colless1982:100}, and
425: \Bone\ and \Btwo, proposed by \citet{Shao1990:266}. We added
426: to the list a rarely used statistic \Itwo, invented by
427: \citet{Mooers1997:31} to provide a measure which weights all nodes
428: equally. Finally, we implemented the proposal of 
429: \citet{Aldous2001:23} to perform median regression as described in the
430: introduction. We fit a quadratic polynomial to the data using median
431: regression and interpreted the linear and quadratic coefficients as
432: descriptive statistics which we call \Aone\ and \Atwo. 
433: 
434: We note here that although Aldous' paper did not explicitly specify
435: how to perform the median regression, we have chosen nonlinear median
436: regression as described by \citet{Koenker1978:33}.
437: This method minimizes the sum of the distances of the estimated median
438: to the data points. Median regression performs much better (as a
439: maximum-likelihood estimator) than least-squares regression when
440: errors are non-gaussian, as in our case. It can be easily implemented
441: using linear programming; in this case it was implemented in 34 lines
442: of code using an \ocaml\ frontend to the GNU linear programming
443: package GLPK.
444: 
445: \begin{table}
446:   \centering
447: \hspace*{-2cm} 
448:     \begin{tabular}{cccccccccc}
449:       $n$ & $\lambda_0$ & \Ic & \nbar & \varn & \Itwo & \Bone & \Btwo & \Aone & \Atwo \\
450:       \hline
451:       7 & 7.01 & 6.29 & 6.34 & 6.07 & 5.90 & 6.22 & 6.29 & 2.67 & 2.70 \\
452:       8 & 21.48 & 19.43 & 19.07 & 18.05 & 17.67 & 18.89 & 19.04 & 5.82 & 6.02 \\
453:       9 & 48.06 & 43.24 & 43.38 & 41.13 & 39.44 & 42.29 & 42.57 & 7.71 & 8.42 \\
454:       10 & 125.11 & 116.37 & 115.93 & 110.07 & 103.60 & 111.18 & 111.55 & 31.14 & 33.74 \\
455:       11 & 299.82 & 283.47 & 282.88 & 268.50 & 249.33 & 269.62 & 269.56 & 84.38 & 89.79 \\
456:       12 & 755.12 & 714.86 & 714.04 & 676.40 & 626.25 & 676.61 & 672.84 & 224.32 & 241.35 \\
457:       13 & 1856.88 & 1760.73 & 1760.97 & 1663.67 & 1525.18 & 1661.87 & 1645.81 & 575.67 & 622.98 \\
458:       14 & 4619.28 & 4387.95 & 4385.72 & 4139.01 & 3779.58 & 4113.12 & 4051.89 & 1458.20 & 1583.53 \\
459:       15 & 11392.51 & 10819.20 & 10817.17 & 10190.62 & 9241.58 & 10106.57 & 9909.07 & 3788.17 & 4124.96 \\
460:     \end{tabular}
461:   \caption{The resolution scores for tree statistics on the NNI
462:     distance matrix.}
463: \label{table:first}
464: \end{table}
465: 
466: \begin{figure}
467:   \begin{center}
468:   \includegraphics[angle=-90,scale=.45]{fig1.ps}
469: \end{center}
470: \caption{Resolution scores divided by the first eigenvalue.}
471: \label{fig:first}
472: \end{figure}
473: 
474: The results of this analysis are presented in Table \ref{table:first}
475: and Figure \ref{fig:first}. First, we find that the resolution of two
476: statistics, \Ic\ and \nbar, is rather close to the first eigenvalue,
477: which is the upper limit for the resolution. This is quite remarkable,
478: in that two statistics which were designed ``by hand'' to measure a
479: visible aspect of tree shape end up having almost as much resolution
480: as theoretically possible. The fact that overall tree balance appears
481: as such an important descriptor justifies in a sense the
482: disproportionate amount of attention given to it in the tree shape
483: literature. Another nice fact is that the relative resolution scores
484: correspond loosely to the power of the statistics as found by
485: \citet{Agapow2002:866}: \Ic\ and \nbar\ have the most resolution,
486: followed by \varn\ and \Bone; \Btwo\ has the lowest resolution of the
487: standard suite of statistics. We report that in this first setting,
488: \Itwo\ does have substantially lower resolution than the other
489: statistics, however, we will see that it performs well in later
490: settings. Finally, it appears that the coefficients of the best-fit
491: quadratic polynomial on the Aldous scatterplot should not be used as a
492: first statistic in the simpleminded way presented here on small trees;
493: it is possible that an alternative formulation would yield better
494: results.
495: 
496: So far we have only validated that our technique gives results which
497: do not seem completely out of the ordinary. However, now we can do
498: something new. Let's say that we choose \Ic\ as our first statistic
499: and ask the question ``what is the best second number to know about a
500: tree given that we already know \Ic?'' This question has a
501: mathematical formulation: we simply project out the \Ic\ component of
502: the matrix $B$ and repeat the previous process. 
503: 
504: \begin{table}
505:   \centering
506:   \begin{tabular}{cccccccc}
507: n & \nbar & \varn & \Itwo & \Bone & \Btwo & \Aone & \Atwo \\
508: \hline
509: 7 & 0.15 & 0.03 & 1.89 & 0.75 & 0.53 & 2.68 & 2.74 \\
510: 8 & 0.35 & 0.24 & 5.42 & 1.75 & 1.34 & 6.05 & 6.10 \\
511: 9 & 0.88 & 0.54 & 14.94 & 6.45 & 5.16 & 7.43 & 8.50 \\
512: 10 & 1.85 & 1.77 & 42.47 & 14.55 & 12.37 & 31.72 & 33.76 \\
513: 11 & 4.12 & 5.52 & 110.23 & 40.09 & 35.80 & 85.44 & 89.11 \\
514: 12 & 8.91 & 16.80 & 293.51 & 97.41 & 91.67 & 224.61 & 230.85 \\
515: 13 & 20.06 & 48.81 & 749.81 & 253.42 & 249.96 & 577.10 & 593.12 \\
516: 14 & 44.64 & 139.34 & 1930.63 & 625.33 & 645.74 & 1431.73 & 1449.77 \\
517: 15 & 102.17 & 387.97 & 4883.15 & 1586.90 & 1710.31 & 3657.50 & 3657.96 \\
518: \end{tabular}
519: \caption{Resolution scores for tree statistics on the NNI
520:   distance matrix after projecting out \Ic.}
521: \label{table:second_a}
522: \end{table}
523: 
524: The resolution scores of the previously chosen statistics are listed
525: in Table \ref{table:second_a} with the exception of \Ic, which of
526: course has resolution zero because we have projected it out. We note
527: first that \nbar\ has rather small resolution, which is to be expected
528: because it is highly correlated with \Ic. Comparatively, \Itwo, \Aone, and
529: \Atwo\ now do better, which means that they measure a different
530: aspect of tree shape than does \Ic. 
531: 
532: However, it is possible to improve on existing statistics by
533: explicitly constructing a statistic which measures a different aspect
534: of tree shape than \Ic. Plotting the principal components of the $B$
535: matrix suggests that a good second statistic may be the change of
536: balance from the root to the tips. We have implemented this intuition
537: in two ways, first as the ``derived statistics'' of a given statistic,
538: and second as a specific statistic which we call \Qone.
539: 
540: First we describe the construction of the derived statistics of a
541: given statistic $Y$. Start by making a plot analogous to the Aldous
542: scatterplot, except now the $x$ axis is the size of the subtree and
543: $y$ is the value of the statistic $Y$. Now do median regression on
544: this scatterplot and report the slope of the best-fit line or the
545: quadratic coefficient of the best-fit quadratic polynomial. Given an
546: original statistic $Y$ we will call these two derived statistics $Y'$
547: and $Y''$ in analogy to the first and second derivatives of calculus.
548: Higher derived statistics are of course possible but will not be
549: investigated in this paper.
550: 
551: We have designed another statistic, which we call \Qone, which also
552: attempts to quantify the change of balance from the root to the tips.
553: The conceptual model for this statistic is the idea that at some time
554: in the past there may have been a change of evolutionary machinery
555: such that the balance before that time differs from the balance after
556: that time. In some sense the procedure tries to find that time and
557: then compares the balance before and after that time.
558: 
559: The procedure can be described as follows. Begin by assigning to each
560: internal node a ``local imbalance,'' which quantifies the degree of
561: imbalance just at that node. If a bifurcating internal node has
562: subtrees of size $s_l$ and $s_r$, the local imbalance for trees is
563: \[
564: \frac{|s_l - s_r|}{s_l + s_r - 2}.
565: \]
566: This quantity is similar to the summand in the definition of \Itwo\ by 
567: \citet{Mooers1997:31}. We set the local imbalance of a
568: three-node tree to be one at each node. We set the local imbalance of
569: a two-node tree to be zero unless it is part of a three-node tree.
570: 
571: After local imbalances have been assigned, we iterate up the tree to
572: find a ``cut'' of the tree into one basal tree and then a collection
573: of distal trees, which must contain all of the leaves. The cut is
574: first chosen such that the average local imbalance of the internal
575: nodes of the distal trees is maximized. Then the first statistic is
576: computed, which is the average imbalance of the internal nodes of the
577: distal trees minus the average imbalance of the internal nodes of the
578: basal tree. This process is repeated to create a second statistic,
579: except a cut is chosen such that the imbalance of the internal nodes
580: of the distal trees is minimized. Whichever value is greater in
581: absolute value is then called \Qone.
582: 
583: We also recall a statistic which has been understood from the
584: theoretical perspective but which is not in common usage in the tree
585: shape literature: the number of ``cherries'' of a tree. A ``cherry''
586: is simply a subtree of two leaves. \citet{mckenzie-steel} have shown
587: that the distribution of the number of cherries is asymptotically
588: normal under both the equal rates Markov and the uniform model (see
589: next section) and have derived the mean and variance for each. 
590: 
591: \begin{table}
592:   \centering
593:   \begin{tabular}{ccccccccc}
594:     n & \Qone & cherries & $\Ic'$ & $\Itwo'$ & $\Bone''$ & $\Btwo''$ &
595:     \Aone & \Atwo \\
596:     \hline
597:     7 & 4.84 & 1.71 & 3.53 & 2.92 & 2.52 & 2.50 & 2.68 & 2.74 \\
598:     8 & 12.29 & 5.48 & 10.34 & 10.07 & 6.41 & 6.48 & 6.05 & 6.10 \\
599:     9 & 30.88 & 15.27 & 28.13 & 27.97 & 15.23 & 15.58 & 7.43 & 8.50 \\
600:     10 & 73.07 & 44.80 & 61.97 & 62.01 & 44.96 & 46.37 & 31.72 & 33.76 \\
601:     11 & 173.93 & 118.61 & 147.68 & 146.36 & 122.90 & 129.08 & 85.44 & 89.11 \\
602:     12 & 427.55 & 322.74 & 347.84 & 340.43 & 312.94 & 322.52 & 224.61 & 230.85 \\
603:     13 & 1024.86 & 833.99 & 871.08 & 868.45 & 798.39 & 823.73 & 577.10 & 593.12 \\
604:     14 & 2459.67 & 2171.81 & 2127.44 & 2059.13 & 2042.00 & 2101.81 & 1431.73 & 1449.77 \\
605:     15 & 5972.63 & 5530.14 & 5058.50 & 4873.71 & 5103.33 & 5232.47 & 3657.50 & 3657.96 \\
606:   \end{tabular}
607:   \caption{Resolution scores for tree statistics on the NNI
608:     distance matrix after projecting out \Ic.}
609: \label{table:second_b}
610: \end{table}
611: 
612: Table \ref{table:second_b} presents the somewhat surprising results of
613: the resolution method as applied to the distance matrix after \Ic\ has
614: been projected out. The best performance is achieved by \Qone, the
615: somewhat complicated statistic presented above, but close behind is
616: the number of cherries, perhaps the simplest possible statistic.
617: Although the performance of the cherry statistic lags behind the above
618: statistics as a first statistic (see Supplementary Material), it has
619: remarkably good performance as a second statistic. Similar performance
620: is achieved by the slightly more complex $\Ic'$. We also report the
621: values of $B_1''$ and $B_2''$ due to their good performance.
622: 
623: Now assume we choose \Qone\ for our second statistic and look for a
624: third. As before, we project \Ic\ and \Qone\ out of our matrix and
625: compare scores. 
626: \begin{table}
627:   \centering
628:   \begin{tabular}{ccccccc}
629:     $n$ & $\Bone''$ & $\Btwo''$ & $\Qone''$ & $\Ic''$ & \Aone & \Atwo \\
630:     \hline
631:     7 & 2.34 & 2.42 & 1.87 & 1.30 & 1.77 & 1.97 \\
632:     8 & 6.53 & 6.75 & 4.39 & 5.12 & 5.08 & 5.55 \\
633:     9 & 15.55 & 15.86 & 9.73 & 12.89 & 7.44 & 8.43 \\
634:     10 & 44.91 & 45.83 & 38.16 & 37.03 & 31.60 & 33.83 \\
635:     11 & 122.45 & 127.04 & 99.51 & 92.82 & 85.30 & 88.91 \\
636:     12 & 313.13 & 321.23 & 245.45 & 250.41 & 223.88 & 230.76 \\
637:     13 & 798.41 & 820.11 & 645.11 & 619.09 & 577.72 & 586.28 \\
638:     14 & 2040.07 & 2095.10 & 1633.48 & 1524.52 & 1429.47 & 1428.79 \\
639:     15 & 5104.65 & 5223.00 & 3939.10 & 3822.40 & 3649.16 & 3603.47 \\
640:   \end{tabular}
641:   \caption{The resolution scores for tree statistics on the NNI
642:     distance matrix after projecting out \Ic\ and $\Qone$.}
643: \label{table:third}
644: \end{table}
645: This time it is $\nbar''$ which performs the best. However, we note
646: that \Aone, \Atwo, and \Itwo\ are not far behind.
647: 
648: In the end, what is the best general-purpose suite of statistics to
649: use for tree shape description? For a first statistic, the answer is
650: probably \Ic\ or \nbar. They have high resolution and are simple to
651: compute. For a second statistic, \Qone\ has the highest resolution but
652: is somewhat complex; the number of cherries and $\Ic'$ also have good
653: resolution and simple interpretations. For a third statistic the
654: statistic with the highest resolution is $\Btwo''$, however if one is
655: interested in three statistics another good recommendation would be
656: the triple $(\Ic,\Ic',\Ic'')$ which has satisfactory resolution and
657: clear interpretation.
658: 
659: \SBsubsection{Example application}
660: 
661: In the introduction, we proposed that ``interpolating'' evolutionary
662: models could be used to fit any given pattern of overall imbalance. We
663: argued that this fact motivates the use of multiple tree shape
664: statistics, as a single statistic may be insufficient to distinguish
665: between trees generated by the original evolutionary model and a
666: fitted one. In this section we investigate these matters using
667: simulations and the results of the previous sections.
668: 
669: The model we have chosen for this example application is Aldous'
670: ``beta-splitting'' model \citep{Aldous95} \citep{Aldous2001:23}. It is
671: a simple model with a single parameter, $\beta$, which allows
672: interpolation between the ``comb'' tree ($\beta = -2$) and the
673: maximally balanced tree ($\beta = \infty$). The ``equal rates Markov''
674: or ERM tree (i.e. the coalescent tree distribution) emerges when
675: $\beta = 0$, and the ``proportional to different arrangements'' or PDA
676: tree (i.e. the distribution on tree shapes induced by a uniform
677: distribution on labeled trees) appears when $\beta = -1.5$. 
678: 
679: The idea of this model is to recursively split the tips into two
680: subclades using the beta distribution. More precisely, if we assume
681: that a clade has $n$ taxa, the probability of the split being between
682: subclades of size $i$ and $n-i$ is
683: \[
684: q_{n,\beta} (i) = C(n;\beta) \frac{\Gamma(\beta+i+1) \Gamma(\beta+n-i+1)}
685: {\Gamma(i+1) \Gamma(n-i+1)}
686: \]
687: where $C(n;\beta)$ is a normalizing constant. This distribution is
688: equivalent to scattering the taxa on the unit interval and then
689: splitting with the $B(\beta+1,\beta+1)$ distribution \citep{Aldous95}.
690: 
691: This model is easily adapted to a maximum-likelihood framework. The
692: likelihood of each tree for a given $\beta$ is the product of the
693: likelihoods of each split. We consider the likelihood of a
694: collection of trees to be the product of the likelihoods of each tree.
695: With a trick from \citep{Aldous95} one can derive a formula for the
696: $C(n;\beta)$ and then find a $\beta$ which maximizes the log
697: likelihood of a collection of trees in the standard way.
698: 
699: As an application of the above statistics we investigate the effect of
700: missing taxa on phylogenetic tree shape using simulation. We will
701: model the effect on tree shape of a sequencing strategy which is
702: common in the realm of infectious disease: sequence only those strains
703: which are significantly different from previously sequenced strains.
704: We assume that the original tree emerged from an evolutionary process
705: which has the ERM distribution on trees. We then assume that the edge
706: lengths are distributed according to a $N(1,.25)$ Gaussian
707: distribution truncated below zero. Given such a tree with $n$ leaves,
708: we then recursively delete $k$ taxa in the following manner: find the
709: pair of taxa which are closest together in terms of tree distance
710: (including edge length), and randomly delete one of them. We then
711: perform a maximum-likelihood fit as described above on those trees,
712: resulting in a $\beta$, and then generate a sample of beta-splitting
713: trees on $n-k$ leaves using this $\beta$. Which statistics can
714: distinguish between the original trees and the fitted trees?
715: 
716: We performed this simulation study with a sample size of 500, $n=100$,
717: and $k=10$. The $\beta$ value fitted to the described deletion process
718: was $-1.02$, corresponding to a decrease in balance from the $\beta =
719: 0$ original tree. We then compared statistics between 500 of the ``fitted''
720: beta-splitting trees and the original trees with deleted taxa. The
721: trees were then evaluated with the two-tailed Wilcoxson rank sum test
722: to find statistical power of each statistic to differentiate between
723: the two distributions. The results of this analysis are in Table
724: \ref{table:example}.
725: 
726: Remarkably, the statistical power for this scenario corresponds with
727: the resolution of these statistics when \Ic\ has been projected out.
728: This makes some sense because when we fit a tree to the beta-splitting
729: model, we are primarily fitting the overall balance of the trees. We
730: recall that the four statistics with highest resolution after
731: projection were \Qone, the number of cherries, $\Ic'$, and \Itwo.
732: Three out of four of these statistics are also the most powerful for
733: our example application. Although this is an indicative
734: correspondence, one reason it is not perfect is that the resolution
735: scores trees based on overall descriptive ability and here we consider
736: statistical power to differentiate between two specific models. For
737: example, considering that cherries tend to be eliminated by the
738: described taxon deleting process, it is not surprising that the number
739: of cherries would have such high statistical power in this example
740: application. We have also included the statistics \Aone\ and \Atwo\ in
741: Table \ref{table:example} because they performed reasonably well; this
742: corresponds with their good resolution after projecting out \Ic\ as
743: shown in Table \ref{table:second_a}. It is not surprising that these
744: statistics perform better on relatively large trees. Finally, as might
745: be expected for a situation in which we have fitted the overall
746: balance of a tree to the model, the statistic \Ic\ has essentially no
747: power to distinguish between the two models.
748: 
749: \begin{table}
750: \centering
751: \begin{tabular}{cccccccc}
752: & \Ic & cherries & \Itwo & \Qone & $\Ic'$ & \Aone & \Atwo \\
753: \hline
754: NM & 0.077 & 30 & 0.47 & 0.24 & 0.015 & 0.62 & 0.056 \\
755: DM & 0.076 & 29 & 0.49 & 0.27 & 0.019 & 0.51 & 0.089 \\
756: $p$ & 0.16 & 7.6e-32 & 5.1e-13 & 1.8e-07 & 4.6e-07 & 4.4e-06 & 1.1e-06 \\
757: \end{tabular}
758: \caption{Comparison of the scores for various statistics when
759:   applied to trees from two different models. ``NM'' signifies the
760:   median score of the statistic when applied to a sample of ERM
761:   trees of size 90; ``DM'' signifies the median when applied to a
762:   sample of beta-splitting trees with leaves deleted as described in
763:   the text. The last line shows the $p$-value for the two-sided Wilcoxson
764:   rank-sum test.}
765: \label{table:example}
766: \end{table}
767: 
768: We argue that this simple simulation exercise further demonstrates
769: that the resolution measure can help guide the selection of good
770: general-purpose tree shape statistics. Although these statistics were
771: chosen on purely geometric grounds, they were also the most powerful
772: for this somewhat arbitrary model.
773: 
774: \SBsection{Extensions}
775: 
776: There are a number of limitations to this methodology which point the
777: way for future development. The first is that this application of the
778: MDS technique was to a specific model of tree space, namely that with
779: the unlabeled NNI distance. It is possible that this is not a good
780: choice. However, if another model is found which seems more
781: appropriate, that can be easily brought into the general framework
782: presented here and derive analogous results. Another angle of this
783: problem is that the resolution parameter described implicitly takes
784: the uniform distribution on trees. That is to say, trees which are
785: never seen in models or from data carry equal weight in the resolution
786: measure as trees which are common. This could decrease the utility of
787: the resolution measure, especially when considering large trees.
788: However, in the author's opinion there is no clear choice of
789: distribution. In fact, the main purpose of tree shape theory is to
790: think about what sorts of distributions are appropriate for tree
791: shape. If a clear alternative distribution is found, some
792: modifications will have to be made to the methodology to incorporate
793: this information.
794: 
795: Second, this methodology offers nothing to the debate of how to
796: compare the shape of trees of different size. This is a very
797: fundamental problem which may be more philosophical than technical:
798: what does it actually mean to say that a tree of one size has a
799: similar shape to one of a different size? A common response in the
800: literature \citep{Mooers1995:379} \citep{stam02a} is to compare in one
801: way or another the shape of a given tree to a sample of trees from a
802: fixed distribution; knowing the distribution of the statistic as for
803: the number of cherries \citep{mckenzie-steel} makes this an attractive
804: option for some statistics. However, if we wish to have a descriptive
805: theory independent of perhaps over-simple models, some other method
806: will have to be found. This is clearly an interesting avenue for
807: future research.
808: 
809: Third, because the number of unlabeled binary trees is very large,
810: asymptotically $O(c^n n^{-3/2})$ \citep{harding} \citep{semple-steel},
811: we have had to limit ourselves to moderately small trees. This may
812: skew the analysis in that statistics which perform poorly for small
813: trees may perform quite well for large trees; an example case might be
814: Aldous' descriptors of tree shape. One response to this objection is
815: that Figure \ref{fig:first} shows a certain level of stability as $n$
816: increases: statistics which are good for smaller $n$ appear to be good
817: for larger $n$ as well. As our understanding of this NNI tree space is
818: very limited, we cannot prove any statement of this type at this time.
819: Furthermore, although increasingly large trees are now available, the
820: analysis of trees of intermediate size is still a challenge and at
821: worst the above methodology is applicable to that case. However, we do
822: consider this to be a problem for future research.
823: 
824: Fourth, multidimensional scaling with non-euclidean data always loses
825: some information. This results from the fact that the analysis is
826: actually performed on a projection of the original distance matrix. As
827: mentioned, the NNI tree space is certainly non-euclidean: even in the
828: innocuous-looking case of $n=6$ (see Figure~\ref{fig:tree_space}) some
829: distortion results from a euclidean projection. The subject of how
830: much information is lost from this projection is very interesting but
831: requires a separate treatment. We will address these issues in a
832: future article.
833: 
834: Fifth, edgelength information is conspicuously absent in tree shape
835: analysis. Typically information about timing of speciation (or other
836: branching) events is analyzed in a completely different manner, as a
837: lineages-through-time plot, which is then used to estimate speciation
838: and extinction rates with maximum likelihood \citep{neeEA94a}. Clearly
839: any analysis of this sort eliminates topological information which may
840: aid in choosing an evolutionary model. The tree shape literature has
841: already shown that the standard birth-death process where each leaf is
842: equally likely to split or be eliminated does not construct trees
843: which seem to reflect the imbalance seen in nature; nevertheless this
844: assumption is implicit in Nee et. al.'s analysis. More work is needed
845: to integrate the tree shape and timing literature.
846: 
847: Finally, we come to a limitation which is fundamental to any
848: discussion of trees: with very few exceptions, trees are not actual
849: data. They are almost certainly flawed reconstructions of historical
850: events. A common response to this problem by coalescent theorists
851: trying to estimate evolutionary parameters is to simply ``integrate
852: out'' the history by performing MCMC iteration over all possible
853: histories \citep{Kuhner1995:1421}. However, we believe that there is a
854: signal in tree shape that stands out from the noise and which can
855: guide us in selection of evolutionary models. We also note that tree
856: shape has a role in understanding potential problems and biases of
857: tree reconstruction methods.
858: 
859: In summary, we have developed a new method for evaluating tree shape
860: statistics, which we call the ``resolution'' of a statistic. This
861: method formalizes the intuition that a good statistic takes on similar
862: values for similar trees and different values for rather different
863: trees. It has the advantage that it can help choose a $k$th statistic
864: given that $k-1$ other statistics are already known; this opens up the
865: possibility of finding a useful suite of statistics to describe a
866: tree. We then use the method to make specific recommendations for such
867: a suite of three statistics. Finally, we compare the results of the
868: geometric analysis to two model-based tree distributions and find that
869: statistics with good resolution were also the ones which had high
870: power to distinguish the two distributions. We hope that these
871: statistics and methodology will prove useful for scientists engaged in
872: the fascinating questions emerging from macroevolution and
873: phylogenetic reconstruction. We suggest that this paper represents a
874: small step in an area which will continue to pose interesting
875: questions for years to come.
876: 
877: \SBsubsubsection{Acknowledgments}
878: 
879: \begin{footnotesize}
880:   The author would like to thank Akira Sasaki for asking him the
881:   question ``what is a good way to numerically describe the shape of a
882:   tree?'' two years ago, as well as David Aldous, Steve Evans, Joseph
883:   Felsenstein, Susan Holmes, Arne Mooers, Montgomery Slatkin and John
884:   Wakeley for stimulating discussion and valuable comments. F.A.M. was
885:   supported by a Graduate Research Fellowship from the National
886:   Science Foundation.
887: \end{footnotesize}
888: 
889: \newpage
890: \bibliographystyle{plainnat}
891: \bibliography{/home/matsen/papers/bibtex_entries,good}
892: 
893: \SBsection{Supplementary Material} 
894: 
895: Here I will present tables of all of the statistics, not just the ones
896: with high resolution values.
897: 
898: \end{document}
899: 
900: