0512:q-bio0512009/good.tex

1: \documentclass{article}

2:

3: \usepackage{graphicx}

4: \usepackage{amsmath}

5: \usepackage{amssymb}

6: \usepackage{eucal}

7: \usepackage{natbib}

8:

9: \newcommand{\ocaml}{\texttt{ocaml}}

10: \newcommand{\nbar}{\ensuremath{\bar{N}}}

11: \newcommand{\varn}{\ensuremath{\sigma_N^2}}

12: \newcommand{\Ic}{\ensuremath{I_c}}

13: \newcommand{\Qone}{\ensuremath{Q_1}}

14: \newcommand{\Itwo}{\ensuremath{I_2}}

15: \newcommand{\Bone}{\ensuremath{B_1}}

16: \newcommand{\Btwo}{\ensuremath{B_2}}

17: \newcommand{\Aone}{\ensuremath{A_1}}

18: \newcommand{\Atwo}{\ensuremath{A_2}}

19:

20: \newcommand{\ra}{\rightarrow}

21: \newcommand{\Tn}{\ensuremath{\mathcal{T}_n}}

22: \newcommand{\R}{\mathbb{R}}

23: \newcommand{\x}{\mathbf{x}}

24: \newcommand{\one}{\ensuremath{\mathbf{1}}}

25: \newcommand{\argmax}{\operatornamewithlimits{argmax}}

26: \newcommand{\norm}[1]{\ensuremath{\|#1\|}}

27: \newcommand{\stack}[2]{\begin{smallmatrix} #1 \\ #2 \end{smallmatrix}}

28:

29: \newcommand{\SBsection}[1]{\vspace{.6cm} \noindent \textsc{#1} \vspace{.2cm}}

30: \newcommand{\SBsubsection}[1]{\vspace{.4cm} \noindent \textit{#1} \vspace{.2cm}}

31: \newcommand{\SBsubsubsection}[1]{\vspace{.4cm} \noindent

32:   \textsc{\small #1} \vspace{.2cm}}

33:

34: \newenvironment{parmatrix}{\left( \begin{array}}{\end{array} \right)}

35:

36: % for double spacing

37: %\usepackage{doublespace}

38:

39: \newtheorem{theorem}{Theorem}

40: \newtheorem{prop}{Proposition}

41: \newtheorem{lemma}{Lemma}

42:

43: \title{A geometric approach to tree shape statistics}

44: \author{Frederick A. Matsen}

45:

46: %%%%%%%%%%%%%%%%%%%%

47: % todo

48: % save genereated trees

49:

50: \begin{document}

51:

52: \maketitle

53:

54: \newcounter{count}

55:

56: \begin{abstract}

57:   This article presents a new way to understand the descriptive

58:   ability of tree shape statistics. Where before tree shape statistics

59:   were chosen by their ability to distinguish between

60:   macroevolutionary models, the ``resolution'' presented in this

61:   paper quantifies the ability of a statistic to differentiate between

62:   similar and different trees. We term this a ``geometric'' approach

63:   to differentiate it from the model-based approach previously

64:   explored. A distinct advantage of this perspective is that it allows

65:   evaluation of multiple tree shape statistics describing different

66:   aspects of tree shape. After developing the methodology, it is

67:   applied here to make specific recommendations for a suite of three

68:   statistics which will hopefully prove useful in applications. The

69:   article ends with an application of the tree shape statistics to

70:   clarify the impact of omission of taxa on tree shape.

71: \end{abstract}

72:

73: The analysis of phylogenetic tree shape provides one way of

74: understanding the forces guiding macroevolution, as well as

75: understanding possible biases of tree reconstruction methodology.

76: Although it has been a subject of study for many years, a recent

77: editorial in this journal \citep{simon-page} hints that finding the

78: forces guiding tree shape is a long-term challenge which has yet to be

79: completely understood. Joe \citet{felsenstein} concludes the chapter

80: on tree shape methodology in his recent book with the simple phrase

81: ``[c]learly this literature is in its early days.'' Indeed, tree shape

82: is still a challenge, and an important one. A complete understanding

83: would help resolve important questions in biology such as the roles of

84: adaptive radiation and environmental change in generating diversity.

85: Tree shape also poses difficult issues of its own, such as the impact

86: of missing or extinct taxa on our understanding of historical

87: biodiversity. Not only are many fundamental questions left unanswered,

88: but the area is ripe for progress: the large number and size of

89: contemporary phylogenies forms a fantastic corpus on which

90: macroevolutionary hypotheses can be tested.

91:

92: In order to use phylogenetic tree shape as a tool, we need methods to

93: measure and quantify aspects of tree shape. Almost all work to this

94: day has been done with measures of tree ``balance,'' which is the

95: degree to which two sister taxa are of the same or different size. A

96: major vein of research has been to compare balance of trees created

97: from data to trees produced by one or another null model

98: \citep{Savage1983:225} \citep{Guyer1991:340} \citep{Guyer1993:253}

99: \citep{stam02a}. \citet{Kirkpatrick1993:1171}, in one of the early

100: papers in the area, quantified the power of different measures of tree

101: balance in distinguishing between two models of tree shape. The two

102: models are extremely simple: one, called the Yule or ERM model,

103: develops a tree by starting with a single species and then choosing

104: uniformly among species to speciate. The other, called the PDA model,

105: is simply the distribution on tree shapes induced by the uniform

106: distribution on labelled trees.

107:

108: Studies have shown that most trees created from data are less balanced

109: than would be expected from the ERM model, yet more balanced than

110: would be expected from the PDA model \citep{Mooers1995:379}

111: \citep{Mooers1997:31} \citep{Purvis2002:844}. Models of increasing

112: sophistication have appeared, attempting to re-create this observed

113: pattern of tree shape observed in nature. For example,

114: \citet{Heard1996:2141} found that speciation rate variation among

115: lineages can lead to imbalanced trees.

116: \citet{Losos1995:329} found that short ``refractory periods''-- periods

117: before which a new species can speciate again-- led to more balanced

118: trees, while \citet{Rogers1996:99} found that very long

119: refractory periods led to less balanced trees.

120: \citet{Aldous95,Aldous2001:23} was the first to propose a

121: (non-evolutionary) model which interpolated between the ERM and the

122: PDA models. More recently, \citet{Steel2001:91} and

123: \citet{Pinelis2003:1425} have since developed evolutionary

124: models which also interpolate.

125:

126: With these models, one could presumably arrange

127: parameters to correctly fit the observed pattern of imbalance as

128: reported by a given statistic. But is that really enough? What if

129: other aspects of the tree shape, not measured by the statistic, differ

130: considerably? After all, any single statistic is a one-dimensional

131: summary of a very complex set of data. One might follow the suggestion

132: of Agapow and Purvis \citep{Agapow2002:866} and use two different

133: balance statistics which measure balance in different parts of the

134: tree, but in this paper we hope to present a more direct approach.

135:

136: The only proposal made in the literature which has the potential to

137: encapsulate lots of information about the shape of a tree has been by

138: \citet{Aldous2001:23}. He suggests first constructing a

139: scatterplot of the interior nodes, where the $x$ coordinate is the

140: size of the subclade subtended by that interior node, and the $y$

141: coordinate is the size of the smaller daughter clade. The proposal is

142: then to perform nonlinear median regression on

143: the log-log version of this scatterplot and then use the fitted

144: function as a descriptor of tree shape. We will call the log-log

145: scatterplot the ``Aldous scatterplot'' in the following.

146:

147: There are a number of advantages to this approach. It is very natural

148: from a statistical viewpoint relative to the other, more ad-hoc,

149: measures of tree balance. The method has the potential to give quite a

150: lot of information about tree shape compared to a single summary

151: statistic. Finally, it allows comparison of trees of different sizes

152: by superposition of scatterplots, which is a significant advantage.

153: There is currently no generally accepted method for comparing trees of

154: different sizes using the standard statistics; this remains a

155: problematic issue \citep{Mooers1995:379} \citep{stam02a}.

156:

157: However, there are three disadvantages which may not make Aldous'

158: proposal as practical as might be hoped. The first is that regression

159: works best with many points of data, and thus one can only expect his

160: technique to work with rather large trees. This problem is exacerbated

161: by the fact that isomorphic subtrees are superimposed on one another

162: in the scatterplot, further reducing the number of fittable points.

163: The second is an inherent problem with summarizing a tree as a

164: scatterplot of this sort. Assume that tree $T$ has two non-isomorphic

165: subtrees $A$ and $B$ of the same size. Exchanging $A$ and $B$ in $T$

166: will not change the scatterplot and thus not change any regression

167: parameters, although the resulting tree may differ significantly in

168: shape. The third problem is that the resulting output can be hard to

169: interpret. What does, for example, the $k$th Taylor coefficient of the

170: fitted function actually signify? Despite these issues, we believe

171: that this technique is underutilized and may be the technique of

172: choice when working with large phylogenies.

173:

174: Overall, it appears that additional methods would be useful for

175: understanding tree shape. This paper attempts to provide some of these

176: new methods.

177:

178: \SBsection{The geometric approach}

179:

180: The basic philosophy behind the geometric approach is that similar

181: trees should have similar statistics, and that rather different trees

182: should have different statistics. This philosophy is summarized in

183: Figure \ref{fig:example_stat}. All of the trees with six tips are

184: evaluated by two hypothetical statistics. The top axis shows what one

185: might consider a good statistic. The maximally balanced tree is on

186: the far left side, and the completely unbalanced tree is on the far

187: right. When a subtree is preserved, the statistic tends not to change

188: too much. The bottom axis shows what might be considered a bad

189: statistic. The extremes of tree balance are now put together, and two

190: similar trees are now on the two extremes of the axis.

191:

192: \begin{figure}

193:   \begin{center}

194:   \includegraphics[angle=0,scale=.75]{example_stats.eps}

195: \end{center}

196: \caption{Good and bad statistics from the geometric perspective. The

197:   horizontal axes represent values of hypothetical statistics. In

198:   figure (a) very different trees are separated, while in figure (b)

199:   very different trees are close together.}

200: \label{fig:example_stat}

201: \end{figure}

202:

203: If we are to apply this sort of intuition on trees, it is necessary to

204: formalize the notion of similar and different. We do so by

205: constructing a metric on unlabeled trees.

206:

207: \SBsubsection{A metric for evolutionary histories}

208:

209: Here we describe a metric on unlabeled trees which can be applied

210: directly to compare tree shapes or can be used to guide the selection

211: of statistics as described below. To begin we state that by ``tree''

212: we will mean a finite strictly bifurcating rooted tree without leaf

213: labels or specified edge lengths. We have chosen finite strictly

214: bifurcating rooted trees, as these correspond most naturally to the

215: output of models. This paper concerns itself with tree shape rather

216: than the identity of taxa, thus we consider unlabeled trees. Finally,

217: our intent in this paper is to understand the combinatorial content of

218: the tree, and thus we consider trees without specified edge lengths.

219: The case including edge lengths would be an interesting future

220: extension of this work, but would require a significant further

221: development of the methodology.

222:

223: \begin{figure}

224:   \begin{center}

225:   \includegraphics[angle=0,scale=.45]{nni.eps}

226: \end{center}

227: \caption{A single rooted NNI move.}

228: \label{fig-nni}

229: \end{figure}

230:

231: We recall that a metric $g$ is simply a set of ``distances'' between

232: pairs of a collection of objects satisfying (i) $g(x,y) = 0$ if and

233: only if $x=y$, (ii) $g(x,y) = g(y,x)$, (iii) the triangle inequality:

234: $g(x,y) + g(y,z) \geq g(x,z)$. The metric we consider is simply the

235: nearest neighbor interchange (NNI) metric on unlabeled trees, depicted

236: in Figure \ref{fig-nni}. A single NNI ``move'' represents a change of

237: branching order of a tree to one of two possible configurations. The

238: unlabeled NNI distance from one tree to another is defined to be the

239: minimum number of moves necessary to change one tree to the other.

240: Note that these interchanges have appeared before in

241: \citet{Kuhner1995:1421} as proposal draws for their their

242: Metropolis-Hastings approach to estimating population parameters.

243:

244: Tree space equipped with the NNI metric is shown in Figure

245: \ref{fig:tree_space} for trees on 6 leaves. It is a graph which has

246: connections between any two trees which are a single NNI move apart.

247: Note that the NNI distance is a special case of the shortest-path

248: metric on a graph and thus we are justified in calling it a metric.

249: Also, although the metric is not explicitly model-based, a change of

250: branching order can be thought of as a change of timing of

251: diversification events.

252:

253: \begin{figure}

254:   \begin{center}

255:   \includegraphics[angle=0,scale=.45]{tree_space.eps}

256: \end{center}

257: \caption{Unlabeled tree space equipped with the NNI metric. An edge

258:   between two trees means that a single NNI move changes one to the

259:   other.}

260: \label{fig:tree_space}

261: \end{figure}

262:

263: Unsurprisingly, computing this metric is NP-complete, as can be seen by

264: a small modification of a similar proof by \citet{dasgupta}. Their

265: paper demonstrates that calculating the unrooted NNI distance on

266: unrooted trees is NP-complete. However, the unrooted NNI moves are

267: identical to the moves in Figure \ref{fig-nni} when the tree shown in

268: the diagram is chosen to be anything but the entire tree. Therefore we

269: can simply root the tree in Figure 4 of their paper on the far left

270: side of the main linear tree and the proof proceeds as usual.

271:

272: \SBsubsection{Resolution of Statistics}

273:

274: In this section we define the notion of ``resolution'' of a tree shape

275: statistic. Although the formal definition of the resolution is in

276: terms of the statistical method of multidimensional scaling, we will

277: first describe how resolution relates to the more common method of

278: principal component analysis, and then give an intuitive definition of

279: resolution as a measure of how much a statistic ``spreads out'' the

280: data. This resolution measure will be applied to various tree shape

281: statistics below where the underlying data

282: will be the tree space of a given number of leaves. In this way the

283: resolution will be our operational definition of performance for tree

284: shape statistics.

285:

286: The resolution measure formalizes the intuitive notion that similar

287: objects should have similar statistics and rather different objects

288: should have different statistics. For the moment let us consider these

289: objects to be points in $n$-dimensional space. A natural statistic

290: which satisfies our criteria is the familiar first principal component

291: from multivariate statistics. It is some projection of the original

292: spatial data, so objects which are close together stay close together

293: after projection. Also, it is the direction along which variance of

294: the coordinates of the points is maximized, so as much as possible

295: objects which are far apart stay far apart. In this way we consider

296: the first principal component to be the best possible statistic for

297: this collection of points, and will assign it the highest resolution

298: value.

299:

300: We can get at the principal component by thinking of it as the

301: maximization of a certain ``quadratic form.'' In the standard

302: formulation, the principal components are the eigenvectors of the

303: covariance matrix constructed from the coordinates of the sample

304: points. However, it turns out that even if we do not have the actual

305: coordinates of the points, but rather the distances between them, we

306: can still construct the covariance matrix. The process goes as follows:

307: let $H$ be the $n \times n$ ``centering matrix''

308: \[

309: H = I - (1/n) J

310: \]

311: where $J$ is the matrix with every entry equal to one. The operation

312: of the centering matrix on a vector subtracts off the average of the

313: entries of the vector from each component, so the result is a vector

314: which is perpendicular to the vector of ones. Let $S(A)$ be the

315: component-wise matrix squaring operation, such that the $ij$ entry of

316: $S(A)$ is $a_{ij}^2$. Then if $D$ is a ``euclidean distance matrix,''

317: i.e. a matrix such that the $ij$ entry is the distance between two

318: points $i$ and $j$ in a euclidean space, then $B = H \, S(D) \, H$

319: will correspond exactly with the covariance matrix of those same

320: points calculated in the traditional way \citep{mardiaEA}.

321:

322: With the covariance matrix now in hand, we can apply the Rayleigh

323: Quotient theorem, which is a special case of the Courant-Fisher

324: theorem. It states that the eigenvector corresponding to the largest

325: eigenvalue of a symmetric matrix maximizes the quadratic form $\x^T M

326: \x$

327: over all unit-length vectors $x$ \citep{ortega}. Thus in our setting

328: the first principal component is the unit-norm $\x$ which maximizes

329: the quadratic form

330: \begin{equation}

331: \label{eq:qf1}

332: R(\x) \ = - \ \x^T H \, S(D) \, H \x.

333: \end{equation}

334: Again, the action of left multiplication by $H$ simply subtracts the

335: average of the components of $\x$. Therefore maximization is certainly

336: achieved by an $\x$ which has average zero, i.e. is perpendicular to

337: one. On such $\x$, $H$ clearly has no effect. Therefore we can obtain

338: first principal component as

339: \begin{equation}

340: \label{eq:qf2}

341: \argmax_{\stack{\norm{\x} = 1}{\x \perp \one}} \ \ \x^T S(D) \x.

342: \end{equation}

343: Written out in a slightly longer form this is

344: \begin{equation}

345: \label{eq:qf_intuitive}

346: \argmax_{\stack{\norm{\x} = 1}{\x \perp \one}} \ \sum_{i,j} -d_{ij}^2 x_i x_j

347: \end{equation}

348: This formula has a simple and intuitive explanation. As mentioned

349: above, in our view a statistic should assign very different values to

350: objects which are far apart. This equation simply formalizes this

351: intuition in a nice way: an individual term of the sum in

352: (\ref{eq:qf_intuitive}) will be maximized if $x_i$ is very negative

353: and if $x_j$ is very positive. The summation and the distances simply

354: combine all of these terms together in a weighted fashion such that

355: $ij$ pairs which are distant carry more weight than ones which are

356: close. Therefore the more distant objects will tend to be farther

357: apart in $x$-value, and the closer objects will tend to be closer in

358: $x$-value.

359:

360: We will call the quadratic form $R$ of (\ref{eq:qf1}) the

361: ``resolution'' of a statistic, in the sense that a statistic which

362: differentiates between close and distant objects has a high level of

363: resolution. As mentioned above, the first principal component

364: maximizes $R$, and thus its value is an upper limit on the resolution

365: of a statistic. However, we will see below that some well-known

366: statistics on tree space achieve resolution nearly that of the first

367: principal component.

368:

369: So far we have defined the resolution for data sets of distance

370: matrices for configuration of points in euclidean space. Although

371: phrased in a slightly unusual manner, this has led us into the

372: well-known area of principal component analysis. However, our intent

373: is to apply this technique to the space of all unlabeled trees with

374: the NNI metric. The distance matrix corresponding to this space is far

375: from being a euclidean distance matrix. Is it possible to continue

376: with the same formalism as in the euclidean setting?

377:

378: It turns out that we can, and that the procedure is now called metric

379: multidimensional scaling (MDS) \citep{mardiaEA}. The only difference

380: is that $D$ is now allowed to be non-euclidean. In essence, when we

381: substitute a non-euclidean distance matrix into (\ref{eq:qf1}), we

382: consider the projection of the squared centered matrix onto the cone

383: of semidefinite matrices. Thus multidimensional scaling performs

384: principal component analysis on the ``closest'' euclidean distance

385: matrix to our original matrix in a specific sense \citep{dattorro}.

386: This operation certainly loses some data, but enough information is

387: retained to understand the descriptive ability of several statistics.

388: We visit this issue in the last section.

389:

390: Note that this is not the first application of MDS to phylogenetic

391: analysis: \citet{Hillis2005:471} applied it with

392: interesting results to the space of trees with labeled tips. They used

393: MDS with the Robinson-Foulds distance metric as a tool for visualization

394: and analysis of the output of tree reconstruction software. Our intent

395: and methods differ here, as we are concerned with finding near-optimal

396: statistics for understanding unlabeled tree space with the NNI metric.

397:

398: In this section we have defined the resolution as function that allows

399: us to understand the descriptive ability of some statistic. At this

400: point we specialize to the case of tree shape statistics on tree space

401: equipped with the NNI metric. Resolution scores are calculated as

402: follows: first construct a vector with rows equal to the value of the

403: statistic on all trees in tree space. Then apply the matrix $H$ to

404: center the vector; then normalize the vector in the euclidean sense

405: resulting in a vector $\hat{x}$. The resolution is the value of

406: $\hat{x}^T S(D) \hat{x}$. We will use this definition to guide

407: selection of statistics.

408:

409: \newpage

410: \SBsection{Results}

411:

412: In this section the methodology of the previous section is applied to

413: compare the resolution of tree shape statistics. We will first

414: evaluate the standard list of statistics \citep{Kirkpatrick1993:1171}

415: \citep{Agapow2002:866} \citep{felsenstein} according to the above

416: methodology. Then we search for a best second statistic given the

417: first, and the best third statistic given a first and second. Our

418: criterion for performance is high resolution on the whole unlabeled

419: tree space with the NNI metric as described in the previous section.

420: The tree space was generated and evaluated by an \texttt{ocaml}

421: \citep{ocaml} program whose source is available upon request.

422:

423: We calculated the well-known statistics \nbar\ and \varn\ proposed by

424: \citet{Sackin1972:225}, \Ic\ proposed by \citet{Colless1982:100}, and

425: \Bone\ and \Btwo, proposed by \citet{Shao1990:266}. We added

426: to the list a rarely used statistic \Itwo, invented by

427: \citet{Mooers1997:31} to provide a measure which weights all nodes

428: equally. Finally, we implemented the proposal of

429: \citet{Aldous2001:23} to perform median regression as described in the

430: introduction. We fit a quadratic polynomial to the data using median

431: regression and interpreted the linear and quadratic coefficients as

432: descriptive statistics which we call \Aone\ and \Atwo.

433:

434: We note here that although Aldous' paper did not explicitly specify

435: how to perform the median regression, we have chosen nonlinear median

436: regression as described by \citet{Koenker1978:33}.

437: This method minimizes the sum of the distances of the estimated median

438: to the data points. Median regression performs much better (as a

439: maximum-likelihood estimator) than least-squares regression when

440: errors are non-gaussian, as in our case. It can be easily implemented

441: using linear programming; in this case it was implemented in 34 lines

442: of code using an \ocaml\ frontend to the GNU linear programming

443: package GLPK.

444:

445: \begin{table}

446:   \centering

447: \hspace*{-2cm}

448:     \begin{tabular}{cccccccccc}

449:       $n$ & $\lambda_0$ & \Ic & \nbar & \varn & \Itwo & \Bone & \Btwo & \Aone & \Atwo \\

450:       \hline

451:       7 & 7.01 & 6.29 & 6.34 & 6.07 & 5.90 & 6.22 & 6.29 & 2.67 & 2.70 \\

452:       8 & 21.48 & 19.43 & 19.07 & 18.05 & 17.67 & 18.89 & 19.04 & 5.82 & 6.02 \\

453:       9 & 48.06 & 43.24 & 43.38 & 41.13 & 39.44 & 42.29 & 42.57 & 7.71 & 8.42 \\

454:       10 & 125.11 & 116.37 & 115.93 & 110.07 & 103.60 & 111.18 & 111.55 & 31.14 & 33.74 \\

455:       11 & 299.82 & 283.47 & 282.88 & 268.50 & 249.33 & 269.62 & 269.56 & 84.38 & 89.79 \\

456:       12 & 755.12 & 714.86 & 714.04 & 676.40 & 626.25 & 676.61 & 672.84 & 224.32 & 241.35 \\

457:       13 & 1856.88 & 1760.73 & 1760.97 & 1663.67 & 1525.18 & 1661.87 & 1645.81 & 575.67 & 622.98 \\

458:       14 & 4619.28 & 4387.95 & 4385.72 & 4139.01 & 3779.58 & 4113.12 & 4051.89 & 1458.20 & 1583.53 \\

459:       15 & 11392.51 & 10819.20 & 10817.17 & 10190.62 & 9241.58 & 10106.57 & 9909.07 & 3788.17 & 4124.96 \\

460:     \end{tabular}

461:   \caption{The resolution scores for tree statistics on the NNI

462:     distance matrix.}

463: \label{table:first}

464: \end{table}

465:

466: \begin{figure}

467:   \begin{center}

468:   \includegraphics[angle=-90,scale=.45]{fig1.ps}

469: \end{center}

470: \caption{Resolution scores divided by the first eigenvalue.}

471: \label{fig:first}

472: \end{figure}

473:

474: The results of this analysis are presented in Table \ref{table:first}

475: and Figure \ref{fig:first}. First, we find that the resolution of two

476: statistics, \Ic\ and \nbar, is rather close to the first eigenvalue,

477: which is the upper limit for the resolution. This is quite remarkable,

478: in that two statistics which were designed ``by hand'' to measure a

479: visible aspect of tree shape end up having almost as much resolution

480: as theoretically possible. The fact that overall tree balance appears

481: as such an important descriptor justifies in a sense the

482: disproportionate amount of attention given to it in the tree shape

483: literature. Another nice fact is that the relative resolution scores

484: correspond loosely to the power of the statistics as found by

485: \citet{Agapow2002:866}: \Ic\ and \nbar\ have the most resolution,

486: followed by \varn\ and \Bone; \Btwo\ has the lowest resolution of the

487: standard suite of statistics. We report that in this first setting,

488: \Itwo\ does have substantially lower resolution than the other

489: statistics, however, we will see that it performs well in later

490: settings. Finally, it appears that the coefficients of the best-fit

491: quadratic polynomial on the Aldous scatterplot should not be used as a

492: first statistic in the simpleminded way presented here on small trees;

493: it is possible that an alternative formulation would yield better

494: results.

495:

496: So far we have only validated that our technique gives results which

497: do not seem completely out of the ordinary. However, now we can do

498: something new. Let's say that we choose \Ic\ as our first statistic

499: and ask the question ``what is the best second number to know about a

500: tree given that we already know \Ic?'' This question has a

501: mathematical formulation: we simply project out the \Ic\ component of

502: the matrix $B$ and repeat the previous process.

503:

504: \begin{table}

505:   \centering

506:   \begin{tabular}{cccccccc}

507: n & \nbar & \varn & \Itwo & \Bone & \Btwo & \Aone & \Atwo \\

508: \hline

509: 7 & 0.15 & 0.03 & 1.89 & 0.75 & 0.53 & 2.68 & 2.74 \\

510: 8 & 0.35 & 0.24 & 5.42 & 1.75 & 1.34 & 6.05 & 6.10 \\

511: 9 & 0.88 & 0.54 & 14.94 & 6.45 & 5.16 & 7.43 & 8.50 \\

512: 10 & 1.85 & 1.77 & 42.47 & 14.55 & 12.37 & 31.72 & 33.76 \\

513: 11 & 4.12 & 5.52 & 110.23 & 40.09 & 35.80 & 85.44 & 89.11 \\

514: 12 & 8.91 & 16.80 & 293.51 & 97.41 & 91.67 & 224.61 & 230.85 \\

515: 13 & 20.06 & 48.81 & 749.81 & 253.42 & 249.96 & 577.10 & 593.12 \\

516: 14 & 44.64 & 139.34 & 1930.63 & 625.33 & 645.74 & 1431.73 & 1449.77 \\

517: 15 & 102.17 & 387.97 & 4883.15 & 1586.90 & 1710.31 & 3657.50 & 3657.96 \\

518: \end{tabular}

519: \caption{Resolution scores for tree statistics on the NNI

520:   distance matrix after projecting out \Ic.}

521: \label{table:second_a}

522: \end{table}

523:

524: The resolution scores of the previously chosen statistics are listed

525: in Table \ref{table:second_a} with the exception of \Ic, which of

526: course has resolution zero because we have projected it out. We note

527: first that \nbar\ has rather small resolution, which is to be expected

528: because it is highly correlated with \Ic. Comparatively, \Itwo, \Aone, and

529: \Atwo\ now do better, which means that they measure a different

530: aspect of tree shape than does \Ic.

531:

532: However, it is possible to improve on existing statistics by

533: explicitly constructing a statistic which measures a different aspect

534: of tree shape than \Ic. Plotting the principal components of the $B$

535: matrix suggests that a good second statistic may be the change of

536: balance from the root to the tips. We have implemented this intuition

537: in two ways, first as the ``derived statistics'' of a given statistic,

538: and second as a specific statistic which we call \Qone.

539:

540: First we describe the construction of the derived statistics of a

541: given statistic $Y$. Start by making a plot analogous to the Aldous

542: scatterplot, except now the $x$ axis is the size of the subtree and

543: $y$ is the value of the statistic $Y$. Now do median regression on

544: this scatterplot and report the slope of the best-fit line or the

545: quadratic coefficient of the best-fit quadratic polynomial. Given an

546: original statistic $Y$ we will call these two derived statistics $Y'$

547: and $Y''$ in analogy to the first and second derivatives of calculus.

548: Higher derived statistics are of course possible but will not be

549: investigated in this paper.

550:

551: We have designed another statistic, which we call \Qone, which also

552: attempts to quantify the change of balance from the root to the tips.

553: The conceptual model for this statistic is the idea that at some time

554: in the past there may have been a change of evolutionary machinery

555: such that the balance before that time differs from the balance after

556: that time. In some sense the procedure tries to find that time and

557: then compares the balance before and after that time.

558:

559: The procedure can be described as follows. Begin by assigning to each

560: internal node a ``local imbalance,'' which quantifies the degree of

561: imbalance just at that node. If a bifurcating internal node has

562: subtrees of size $s_l$ and $s_r$, the local imbalance for trees is

563: \[

564: \frac{|s_l - s_r|}{s_l + s_r - 2}.

565: \]

566: This quantity is similar to the summand in the definition of \Itwo\ by

567: \citet{Mooers1997:31}. We set the local imbalance of a

568: three-node tree to be one at each node. We set the local imbalance of

569: a two-node tree to be zero unless it is part of a three-node tree.

570:

571: After local imbalances have been assigned, we iterate up the tree to

572: find a ``cut'' of the tree into one basal tree and then a collection

573: of distal trees, which must contain all of the leaves. The cut is

574: first chosen such that the average local imbalance of the internal

575: nodes of the distal trees is maximized. Then the first statistic is

576: computed, which is the average imbalance of the internal nodes of the

577: distal trees minus the average imbalance of the internal nodes of the

578: basal tree. This process is repeated to create a second statistic,

579: except a cut is chosen such that the imbalance of the internal nodes

580: of the distal trees is minimized. Whichever value is greater in

581: absolute value is then called \Qone.

582:

583: We also recall a statistic which has been understood from the

584: theoretical perspective but which is not in common usage in the tree

585: shape literature: the number of ``cherries'' of a tree. A ``cherry''

586: is simply a subtree of two leaves. \citet{mckenzie-steel} have shown

587: that the distribution of the number of cherries is asymptotically

588: normal under both the equal rates Markov and the uniform model (see

589: next section) and have derived the mean and variance for each.

590:

591: \begin{table}

592:   \centering

593:   \begin{tabular}{ccccccccc}

594:     n & \Qone & cherries & $\Ic'$ & $\Itwo'$ & $\Bone''$ & $\Btwo''$ &

595:     \Aone & \Atwo \\

596:     \hline

597:     7 & 4.84 & 1.71 & 3.53 & 2.92 & 2.52 & 2.50 & 2.68 & 2.74 \\

598:     8 & 12.29 & 5.48 & 10.34 & 10.07 & 6.41 & 6.48 & 6.05 & 6.10 \\

599:     9 & 30.88 & 15.27 & 28.13 & 27.97 & 15.23 & 15.58 & 7.43 & 8.50 \\

600:     10 & 73.07 & 44.80 & 61.97 & 62.01 & 44.96 & 46.37 & 31.72 & 33.76 \\

601:     11 & 173.93 & 118.61 & 147.68 & 146.36 & 122.90 & 129.08 & 85.44 & 89.11 \\

602:     12 & 427.55 & 322.74 & 347.84 & 340.43 & 312.94 & 322.52 & 224.61 & 230.85 \\

603:     13 & 1024.86 & 833.99 & 871.08 & 868.45 & 798.39 & 823.73 & 577.10 & 593.12 \\

604:     14 & 2459.67 & 2171.81 & 2127.44 & 2059.13 & 2042.00 & 2101.81 & 1431.73 & 1449.77 \\

605:     15 & 5972.63 & 5530.14 & 5058.50 & 4873.71 & 5103.33 & 5232.47 & 3657.50 & 3657.96 \\

606:   \end{tabular}

607:   \caption{Resolution scores for tree statistics on the NNI

608:     distance matrix after projecting out \Ic.}

609: \label{table:second_b}

610: \end{table}

611:

612: Table \ref{table:second_b} presents the somewhat surprising results of

613: the resolution method as applied to the distance matrix after \Ic\ has

614: been projected out. The best performance is achieved by \Qone, the

615: somewhat complicated statistic presented above, but close behind is

616: the number of cherries, perhaps the simplest possible statistic.

617: Although the performance of the cherry statistic lags behind the above

618: statistics as a first statistic (see Supplementary Material), it has

619: remarkably good performance as a second statistic. Similar performance

620: is achieved by the slightly more complex $\Ic'$. We also report the

621: values of $B_1''$ and $B_2''$ due to their good performance.

622:

623: Now assume we choose \Qone\ for our second statistic and look for a

624: third. As before, we project \Ic\ and \Qone\ out of our matrix and

625: compare scores.

626: \begin{table}

627:   \centering

628:   \begin{tabular}{ccccccc}

629:     $n$ & $\Bone''$ & $\Btwo''$ & $\Qone''$ & $\Ic''$ & \Aone & \Atwo \\

630:     \hline

631:     7 & 2.34 & 2.42 & 1.87 & 1.30 & 1.77 & 1.97 \\

632:     8 & 6.53 & 6.75 & 4.39 & 5.12 & 5.08 & 5.55 \\

633:     9 & 15.55 & 15.86 & 9.73 & 12.89 & 7.44 & 8.43 \\

634:     10 & 44.91 & 45.83 & 38.16 & 37.03 & 31.60 & 33.83 \\

635:     11 & 122.45 & 127.04 & 99.51 & 92.82 & 85.30 & 88.91 \\

636:     12 & 313.13 & 321.23 & 245.45 & 250.41 & 223.88 & 230.76 \\

637:     13 & 798.41 & 820.11 & 645.11 & 619.09 & 577.72 & 586.28 \\

638:     14 & 2040.07 & 2095.10 & 1633.48 & 1524.52 & 1429.47 & 1428.79 \\

639:     15 & 5104.65 & 5223.00 & 3939.10 & 3822.40 & 3649.16 & 3603.47 \\

640:   \end{tabular}

641:   \caption{The resolution scores for tree statistics on the NNI

642:     distance matrix after projecting out \Ic\ and $\Qone$.}

643: \label{table:third}

644: \end{table}

645: This time it is $\nbar''$ which performs the best. However, we note

646: that \Aone, \Atwo, and \Itwo\ are not far behind.

647:

648: In the end, what is the best general-purpose suite of statistics to

649: use for tree shape description? For a first statistic, the answer is

650: probably \Ic\ or \nbar. They have high resolution and are simple to

651: compute. For a second statistic, \Qone\ has the highest resolution but

652: is somewhat complex; the number of cherries and $\Ic'$ also have good

653: resolution and simple interpretations. For a third statistic the

654: statistic with the highest resolution is $\Btwo''$, however if one is

655: interested in three statistics another good recommendation would be

656: the triple $(\Ic,\Ic',\Ic'')$ which has satisfactory resolution and

657: clear interpretation.

658:

659: \SBsubsection{Example application}

660:

661: In the introduction, we proposed that ``interpolating'' evolutionary

662: models could be used to fit any given pattern of overall imbalance. We

663: argued that this fact motivates the use of multiple tree shape

664: statistics, as a single statistic may be insufficient to distinguish

665: between trees generated by the original evolutionary model and a

666: fitted one. In this section we investigate these matters using

667: simulations and the results of the previous sections.

668:

669: The model we have chosen for this example application is Aldous'

670: ``beta-splitting'' model \citep{Aldous95} \citep{Aldous2001:23}. It is

671: a simple model with a single parameter, $\beta$, which allows

672: interpolation between the ``comb'' tree ($\beta = -2$) and the

673: maximally balanced tree ($\beta = \infty$). The ``equal rates Markov''

674: or ERM tree (i.e. the coalescent tree distribution) emerges when

675: $\beta = 0$, and the ``proportional to different arrangements'' or PDA

676: tree (i.e. the distribution on tree shapes induced by a uniform

677: distribution on labeled trees) appears when $\beta = -1.5$.

678:

679: The idea of this model is to recursively split the tips into two

680: subclades using the beta distribution. More precisely, if we assume

681: that a clade has $n$ taxa, the probability of the split being between

682: subclades of size $i$ and $n-i$ is

683: \[

684: q_{n,\beta} (i) = C(n;\beta) \frac{\Gamma(\beta+i+1) \Gamma(\beta+n-i+1)}

685: {\Gamma(i+1) \Gamma(n-i+1)}

686: \]

687: where $C(n;\beta)$ is a normalizing constant. This distribution is

688: equivalent to scattering the taxa on the unit interval and then

689: splitting with the $B(\beta+1,\beta+1)$ distribution \citep{Aldous95}.

690:

691: This model is easily adapted to a maximum-likelihood framework. The

692: likelihood of each tree for a given $\beta$ is the product of the

693: likelihoods of each split. We consider the likelihood of a

694: collection of trees to be the product of the likelihoods of each tree.

695: With a trick from \citep{Aldous95} one can derive a formula for the

696: $C(n;\beta)$ and then find a $\beta$ which maximizes the log

697: likelihood of a collection of trees in the standard way.

698:

699: As an application of the above statistics we investigate the effect of

700: missing taxa on phylogenetic tree shape using simulation. We will

701: model the effect on tree shape of a sequencing strategy which is

702: common in the realm of infectious disease: sequence only those strains

703: which are significantly different from previously sequenced strains.

704: We assume that the original tree emerged from an evolutionary process

705: which has the ERM distribution on trees. We then assume that the edge

706: lengths are distributed according to a $N(1,.25)$ Gaussian

707: distribution truncated below zero. Given such a tree with $n$ leaves,

708: we then recursively delete $k$ taxa in the following manner: find the

709: pair of taxa which are closest together in terms of tree distance

710: (including edge length), and randomly delete one of them. We then

711: perform a maximum-likelihood fit as described above on those trees,

712: resulting in a $\beta$, and then generate a sample of beta-splitting

713: trees on $n-k$ leaves using this $\beta$. Which statistics can

714: distinguish between the original trees and the fitted trees?

715:

716: We performed this simulation study with a sample size of 500, $n=100$,

717: and $k=10$. The $\beta$ value fitted to the described deletion process

718: was $-1.02$, corresponding to a decrease in balance from the $\beta =

719: 0$ original tree. We then compared statistics between 500 of the ``fitted''

720: beta-splitting trees and the original trees with deleted taxa. The

721: trees were then evaluated with the two-tailed Wilcoxson rank sum test

722: to find statistical power of each statistic to differentiate between

723: the two distributions. The results of this analysis are in Table

724: \ref{table:example}.

725:

726: Remarkably, the statistical power for this scenario corresponds with

727: the resolution of these statistics when \Ic\ has been projected out.

728: This makes some sense because when we fit a tree to the beta-splitting

729: model, we are primarily fitting the overall balance of the trees. We

730: recall that the four statistics with highest resolution after

731: projection were \Qone, the number of cherries, $\Ic'$, and \Itwo.

732: Three out of four of these statistics are also the most powerful for

733: our example application. Although this is an indicative

734: correspondence, one reason it is not perfect is that the resolution

735: scores trees based on overall descriptive ability and here we consider

736: statistical power to differentiate between two specific models. For

737: example, considering that cherries tend to be eliminated by the

738: described taxon deleting process, it is not surprising that the number

739: of cherries would have such high statistical power in this example

740: application. We have also included the statistics \Aone\ and \Atwo\ in

741: Table \ref{table:example} because they performed reasonably well; this

742: corresponds with their good resolution after projecting out \Ic\ as

743: shown in Table \ref{table:second_a}. It is not surprising that these

744: statistics perform better on relatively large trees. Finally, as might

745: be expected for a situation in which we have fitted the overall

746: balance of a tree to the model, the statistic \Ic\ has essentially no

747: power to distinguish between the two models.

748:

749: \begin{table}

750: \centering

751: \begin{tabular}{cccccccc}

752: & \Ic & cherries & \Itwo & \Qone & $\Ic'$ & \Aone & \Atwo \\

753: \hline

754: NM & 0.077 & 30 & 0.47 & 0.24 & 0.015 & 0.62 & 0.056 \\

755: DM & 0.076 & 29 & 0.49 & 0.27 & 0.019 & 0.51 & 0.089 \\

756: $p$ & 0.16 & 7.6e-32 & 5.1e-13 & 1.8e-07 & 4.6e-07 & 4.4e-06 & 1.1e-06 \\

757: \end{tabular}

758: \caption{Comparison of the scores for various statistics when

759:   applied to trees from two different models. ``NM'' signifies the

760:   median score of the statistic when applied to a sample of ERM

761:   trees of size 90; ``DM'' signifies the median when applied to a

762:   sample of beta-splitting trees with leaves deleted as described in

763:   the text. The last line shows the $p$-value for the two-sided Wilcoxson

764:   rank-sum test.}

765: \label{table:example}

766: \end{table}

767:

768: We argue that this simple simulation exercise further demonstrates

769: that the resolution measure can help guide the selection of good

770: general-purpose tree shape statistics. Although these statistics were

771: chosen on purely geometric grounds, they were also the most powerful

772: for this somewhat arbitrary model.

773:

774: \SBsection{Extensions}

775:

776: There are a number of limitations to this methodology which point the

777: way for future development. The first is that this application of the

778: MDS technique was to a specific model of tree space, namely that with

779: the unlabeled NNI distance. It is possible that this is not a good

780: choice. However, if another model is found which seems more

781: appropriate, that can be easily brought into the general framework

782: presented here and derive analogous results. Another angle of this

783: problem is that the resolution parameter described implicitly takes

784: the uniform distribution on trees. That is to say, trees which are

785: never seen in models or from data carry equal weight in the resolution

786: measure as trees which are common. This could decrease the utility of

787: the resolution measure, especially when considering large trees.

788: However, in the author's opinion there is no clear choice of

789: distribution. In fact, the main purpose of tree shape theory is to

790: think about what sorts of distributions are appropriate for tree

791: shape. If a clear alternative distribution is found, some

792: modifications will have to be made to the methodology to incorporate

793: this information.

794:

795: Second, this methodology offers nothing to the debate of how to

796: compare the shape of trees of different size. This is a very

797: fundamental problem which may be more philosophical than technical:

798: what does it actually mean to say that a tree of one size has a

799: similar shape to one of a different size? A common response in the

800: literature \citep{Mooers1995:379} \citep{stam02a} is to compare in one

801: way or another the shape of a given tree to a sample of trees from a

802: fixed distribution; knowing the distribution of the statistic as for

803: the number of cherries \citep{mckenzie-steel} makes this an attractive

804: option for some statistics. However, if we wish to have a descriptive

805: theory independent of perhaps over-simple models, some other method

806: will have to be found. This is clearly an interesting avenue for

807: future research.

808:

809: Third, because the number of unlabeled binary trees is very large,

810: asymptotically $O(c^n n^{-3/2})$ \citep{harding} \citep{semple-steel},

811: we have had to limit ourselves to moderately small trees. This may

812: skew the analysis in that statistics which perform poorly for small

813: trees may perform quite well for large trees; an example case might be

814: Aldous' descriptors of tree shape. One response to this objection is

815: that Figure \ref{fig:first} shows a certain level of stability as $n$

816: increases: statistics which are good for smaller $n$ appear to be good

817: for larger $n$ as well. As our understanding of this NNI tree space is

818: very limited, we cannot prove any statement of this type at this time.

819: Furthermore, although increasingly large trees are now available, the

820: analysis of trees of intermediate size is still a challenge and at

821: worst the above methodology is applicable to that case. However, we do

822: consider this to be a problem for future research.

823:

824: Fourth, multidimensional scaling with non-euclidean data always loses

825: some information. This results from the fact that the analysis is

826: actually performed on a projection of the original distance matrix. As

827: mentioned, the NNI tree space is certainly non-euclidean: even in the

828: innocuous-looking case of $n=6$ (see Figure~\ref{fig:tree_space}) some

829: distortion results from a euclidean projection. The subject of how

830: much information is lost from this projection is very interesting but

831: requires a separate treatment. We will address these issues in a

832: future article.

833:

834: Fifth, edgelength information is conspicuously absent in tree shape

835: analysis. Typically information about timing of speciation (or other

836: branching) events is analyzed in a completely different manner, as a

837: lineages-through-time plot, which is then used to estimate speciation

838: and extinction rates with maximum likelihood \citep{neeEA94a}. Clearly

839: any analysis of this sort eliminates topological information which may

840: aid in choosing an evolutionary model. The tree shape literature has

841: already shown that the standard birth-death process where each leaf is

842: equally likely to split or be eliminated does not construct trees

843: which seem to reflect the imbalance seen in nature; nevertheless this

844: assumption is implicit in Nee et. al.'s analysis. More work is needed

845: to integrate the tree shape and timing literature.

846:

847: Finally, we come to a limitation which is fundamental to any

848: discussion of trees: with very few exceptions, trees are not actual

849: data. They are almost certainly flawed reconstructions of historical

850: events. A common response to this problem by coalescent theorists

851: trying to estimate evolutionary parameters is to simply ``integrate

852: out'' the history by performing MCMC iteration over all possible

853: histories \citep{Kuhner1995:1421}. However, we believe that there is a

854: signal in tree shape that stands out from the noise and which can

855: guide us in selection of evolutionary models. We also note that tree

856: shape has a role in understanding potential problems and biases of

857: tree reconstruction methods.

858:

859: In summary, we have developed a new method for evaluating tree shape

860: statistics, which we call the ``resolution'' of a statistic. This

861: method formalizes the intuition that a good statistic takes on similar

862: values for similar trees and different values for rather different

863: trees. It has the advantage that it can help choose a $k$th statistic

864: given that $k-1$ other statistics are already known; this opens up the

865: possibility of finding a useful suite of statistics to describe a

866: tree. We then use the method to make specific recommendations for such

867: a suite of three statistics. Finally, we compare the results of the

868: geometric analysis to two model-based tree distributions and find that

869: statistics with good resolution were also the ones which had high

870: power to distinguish the two distributions. We hope that these

871: statistics and methodology will prove useful for scientists engaged in

872: the fascinating questions emerging from macroevolution and

873: phylogenetic reconstruction. We suggest that this paper represents a

874: small step in an area which will continue to pose interesting

875: questions for years to come.

876:

877: \SBsubsubsection{Acknowledgments}

878:

879: \begin{footnotesize}

880:   The author would like to thank Akira Sasaki for asking him the

881:   question ``what is a good way to numerically describe the shape of a

882:   tree?'' two years ago, as well as David Aldous, Steve Evans, Joseph

883:   Felsenstein, Susan Holmes, Arne Mooers, Montgomery Slatkin and John

884:   Wakeley for stimulating discussion and valuable comments. F.A.M. was

885:   supported by a Graduate Research Fellowship from the National

886:   Science Foundation.

887: \end{footnotesize}

888:

889: \newpage

890: \bibliographystyle{plainnat}

891: \bibliography{/home/matsen/papers/bibtex_entries,good}

892:

893: \SBsection{Supplementary Material}

894:

895: Here I will present tables of all of the statistics, not just the ones

896: with high resolution values.

897:

898: \end{document}

899:

900: