0702:q-bio0702050/gmi.tex

1: \documentclass{elsart}

2:

3: %\usepackage{showkeys}

4: \usepackage[mathlines]{lineno}

5: \usepackage{amssymb,amsmath,verbatim,graphicx}

6:

7: %\linenumbers

8:

9:

10: % macros

11: \newcommand\R{\mathbb R}

12: \newcommand\C{\mathbb C}

13: \newcommand\Z{\mathbb Z}

14:

15: \newcommand\diag{\operatorname{diag}}

16: \newcommand\im{\operatorname{Im}}

17: \newcommand\pr{\boldsymbol{\pi}_{GM}}

18: \newcommand\pI{\boldsymbol{\pi}_I}

19: \newcommand\M{\mathcal{M}}

20:

21: \newcommand\dsp{\displaystyle}

22:

23:

24: \begin{document}

25:

26: \begin{frontmatter}

27:

28: \title{Identifying evolutionary trees and substitution parameters

29: for the general Markov model with invariable sites}

30:

31:

32:

33: \thanks{This research was supported in part by the Institute for

34:   Mathematics and its Applications, with funds provided by the National

35:   Science Foundation.  We thank the IMA for its hospitality.}

36:

37: \author{Elizabeth S. Allman},

38: \ead{e.allman@uaf.edu}

39: \author{John A. Rhodes\corauthref{cor}}

40: \ead{j.rhodes@uaf.edu}

41: \corauth[cor]{Corresponding author.}

42:

43:

44: \address{Department of Mathematics and Statistics\\University of

45:   Alaska Fairbanks\\PO Box 756660\\Fairbanks, AK 99775}

46:

47:

48: \date{February 22, 2007}

49:

50:

51: \begin{abstract}

52:   The general Markov plus invariable sites (GM+I) model of biological

53:   sequence evolution is a two-class model in which an unknown

54:   proportion of sites are not allowed to change, while the remainder

55:   undergo substitutions according to a Markov process on a tree. For

56:   statistical use it is important to know if the model is

57:   identifiable; can both the tree topology and the numerical

58:   parameters be determined from a joint distribution describing

59:   sequences only at the leaves of the tree?  We establish that for

60:   generic parameters both the tree and all numerical parameter values

61:   can be recovered, up to clearly understood issues of `label

62:   swapping.'  The method of analysis is algebraic, using phylogenetic

63:   invariants to study the variety defined by the model.  Simple

64:   rational formulas, expressed in terms of determinantal ratios, are

65:   found for recovering numerical parameters describing the invariable

66:   sites.

67: \begin{keyword}

68: Phylogenetics, invariable site model, identifiability, phylogenetic

69: invariants \MSC 92D15; 14J99; 60J20

70: \end{keyword}

71:

72: \end{abstract}

73:

74: \end{frontmatter}

75:

76: \section{Introduction}\label{sec:intro}

77:

78: If a model of biological sequence evolution is to be used for

79: phylogenetic inference, it is essential that the model parameters of

80: interest --- certainly the tree parameter and usually the numerical

81: parameters

82: --- be identifiable from the joint distribution of states at the

83: leaves of the tree. Though often unstated, the assumption that model

84: parameters are identifiable underlies the use of both Maximum

85: Likelihood and Bayesian inference methods. As increasingly

86: complicated models, incorporating across-site rate variation,

87: covarion structure, or other types of mixtures, are implemented in

88: software packages, there is a real possibility that

89: non-identifiability could confound data analysis. Unfortunately, our

90: theoretical understanding of this issue lags well behind current

91: phylogenetic practice.

92:

93:

94: One natural approach to proving the identifiability of the tree

95: topology relies on the definition of a phylogenetic distance for the

96: model, and the $4$-point condition of Buneman \cite{Bun}. For

97: instance, Steel \cite{S94} used the log-det distance to establish

98: the identifiability of the tree topology under the general Markov

99: model and its submodels. Such a distance-based argument shows

100: additionally that $2$-marginalizations of the full joint

101: distribution suffice to recover the tree parameter, since distances

102: require only two-sequence comparisons. Once the tree has been

103: identified, the numerical parameters giving rise to a joint

104: distribution for the general Markov model

105: can be determined by an argument of Chang

106: \cite{MR97k:92011}.

107:

108: However, for more general mixture models and rates-across-sites

109: models no appropriate definition of a distance is known, so proving the

110: identifiability of the tree parameter requires a different approach.

111: (Though distance measures have been developed for GTR models with

112: rate-substitution \cite{GuLi96,WadSt97}, these require that one know

113: the rate distribution completely, and identifiability of the rate distribution

114: has yet to be addressed.

115: Although identifiability of the popular GTR+I+$\Gamma$ model of

116: sequence evolution was considered in \cite{Rog01}, there are gaps in

117: the argument, as was pointed out to us by An\'e \cite{AnePC}.)

118:

119: In \cite{ARidtree}, the viewpoint of algebraic geometry is used to

120: show the generic identifiability of the tree parameter for the

121: covarion model of \cite{MR1604518} and for certain mixture models with

122: a small number of classes.

123: Though this result is far more general than previous identifiability results,

124: it still fails to cover the type of rate-variation models

125: currently in common use for data analysis, and does not

126: address identifiability of numerical parameters at all.

127: Much more study of the identifiability question is needed.

128:

129:

130:

131: \smallskip

132:

133: In this paper, we focus on the \emph{general Markov plus invariable

134:  sites}, GM+I,  model of sequence evolution, a model that encompasses

135: the GTR+I model that is of more immediate interest to practitioners.

136: Note that previous work on GM+I by Baake \cite{MR1664261}

137: focused on \emph{non}-identifiability. In that paper

138: parameter choices for the $2$-state GM+I model

139: on two distinct 4-taxon trees

140: are constructed  that give rise to the same pairwise

141: joint distributions ($2$-marginals).  As both sets of parameters have

142: 50\% invariable sites, this shows that the

143: identifiability of the tree parameter cannot generally hold on the basis of

144: 2-sequence comparisons, even if the distribution of rate factors is

145: known. Furthermore, it shows a well-behaved phylogenetic distance cannot

146: be defined for this model, as existence of such a distance would imply tree

147: identifiability.

148:

149: Here we prove that all parameters for the GM+I model are indeed

150: identifiable, through $4$-sequence comparisons.  By identifiable, we

151: mean \emph{generically identifiable} in a geometric sense: For a

152: fixed tree, the set of numerical parameters for which the joint

153: distribution could have arisen from either a) a different tree, or

154: b) a `significantly different' (in a sense to be made clear later)

155: choice of numerical parameters on the same tree, is of strictly

156: lower dimension than that of the full numerical parameter space.

157: (For a concrete example of generic identifiability,

158: recall the results of Steel and Chang on the general Markov model:

159: assumptions that

160: the Markov edge matrices $M_e$ have determinant $\ne 0,1$ and that the

161: distribution of states at the root has strictly positive entries ensure

162: identifiability of all parameters. These are

163: generic conditions.) Thus for natural probability distributions on

164: the parameter space, with probability one a choice of parameters is

165: generic.

166:

167:

168:

169: \medskip

170:

171: Although identifiability of the tree parameter for GM+I follows from

172: more general results in \cite{ARidtree}, that paper did not consider

173: identifiability of numerical parameters. Our arguments here are

174: tailored to GM+I and yield stronger results

175: addressing numerical parameters as well as the tree. Our approach

176: is again based on the determination of \emph{phylogenetic

177: invariants} for the model. While the invariants described in

178: \cite{ARidtree} are invariants for more general models than GM+I,

179: the ones given in this paper apply only to GM+I and its submodels,

180: and are of much lower degree. As a byproduct of the development of

181: these GM+I invariants, we are led to rational formulas for

182: recovering all the parameters related to the invariable sites from a

183: joint distribution.  Indeed, these formulas are crucial to our

184: identification of numerical parameters.

185:

186: These formulas can be viewed as GM+I analogs of the formulas for the

187: proportion of invariable sites in group-based+I models that were

188: found by the capture-recapture argument of \cite{SHL00}.  In the

189: group-based setting, those formulas were developed into a heuristic

190: means of estimating the proportion of invariable sites from data

191: without performing a full tree inference. This has been implemented

192: in {\tt SplitsTree4} \cite{splitsTree4}. However, it remains unclear

193: whether a similar useful heuristic can be found for the formulas

194: presented in this paper.

195:

196: \smallskip

197:

198: Since our algebraic methods at times employ computational commutative

199: algebra software packages, and these tool are not commonly used

200: in the phylogenetics literature, we have included some examples of

201: code in Appendix \ref{app:code}.

202:

203:

204: \section{The GM+I Model}\label{sec:gmi}

205:

206: Let $T$ denote an\emph{ $n$-taxon tree}, by which we mean a tree

207: with $n$ leaves labeled by the taxa $a_1,a_2,\dots, a_n$ and all

208: internal vertices of valence at least 3. We say $T$ is \emph{binary}

209: if all internal nodes have valence exactly 3.

210:

211: We begin by describing the parameterization of the $\kappa$-state GM+I

212: model of sequence evolution along $T$, where $\kappa=4$ corresponds to

213: usual models of DNA evolution.  The \emph{class size parameter}

214: $\delta$ denotes the probability that any particular site in a

215: sequence is invariable: conceptually, the flip of a biased coin

216: weighted by $\delta$ determines if a site is allowed to undergo state

217: transitions. If a site is invariable, it is assigned state

218: $i\in[\kappa]=\{1,2,\dots,\kappa\}$ with probability $\pi_I(i)$. Here

219: $\boldsymbol \pi_I=(\pi_I(1),\dots,\pi_I(k))$ is a vector of

220: non-negative numbers summing to 1 giving the state distribution for

221: invariable sites.

222:

223: All sites that are not invariable mutate according to a common set

224: of parameters for the GM model, though independently of one another.

225: For these sites, we associate to each node (including leaves) of $T$

226: a random variable with state space $[\kappa]$.  Choosing any node

227: $r$ of $T$ to serve as a root, and directing all edges away from

228: $r$, let $T_r$ denote the resulting directed tree $T$.  A \emph{root

229: distribution} vector $\pr= (\pi_{GM}(1),\dots, \pi_{GM}(\kappa))$,

230: with non-negative entries summing to $1$, has entries $\pr(j)$

231: specifying the probability that the root variable is in state $j$.

232: For each directed edge $e = (v \to w)$ of $T_r$, let $M_e$ be a

233: $\kappa\times \kappa$ Markov matrix, so that $M_e(i,j)$ specifies

234: the conditional probability that the variable at $w$ is in state $j$

235: given that the variable at $v$ is in state $i$. Thus entries of all

236: $M_e$ are non-negative, with rows summing to 1.

237:

238: For the GM+I model on an $n$-taxon tree $T$ with edge set $E$, the

239: stochastic parameter space $S \subset {[0,1]}^N$

240: is of dimension

241: $N =1 + (\kappa-1) +(\kappa - 1) + |E|\kappa(\kappa - 1) = 2\kappa - 1 +

242: |E|\kappa(\kappa - 1).$

243: The parameterization map giving the joint

244: distribution of the variables at the leaves of $T$ is

245: denoted by

246: \begin{linenomath}

247: \begin{align*}

248:   \phi_T: S &\longrightarrow {[0,1]}^{\kappa^n},\\

249:   \mathbf{s} &\longmapsto P.

250: \end{align*}

251: \end{linenomath}

252: We view $P$ as an $n$-dimensional $\kappa \times \dots \times

253: \kappa$ array, with dimensions corresponding to the ordered taxa

254: $a_1,a_2,\dots,a_n$, and with entries indexed by the states at the

255: leaves of $T$. The entries of $P$ are polynomial functions in the

256: parameters $\mathbf{s}$ explicitly given by

257: \begin{linenomath}

258: \begin{multline}

259:   P(i_1, \dots, i_n) =\\

260:   \delta\, \epsilon (i_1,i_2,\dots i_n) \pI(i_1) +(1-\delta)

261:   \sum_{(j_v) \in \mathcal{H}} \left( \pr(j_r) \prod_{e} M_e(j_{v_i},

262:     j_{v_f})\right).\label{eq:Pdef}

263: \end{multline}

264: \end{linenomath}

265: Here $\epsilon(i_1,i_2,\dots i_n)$ is 1 if all $i_j$ are equal and 0

266: otherwise, the product is taken over all edges $e=(v_i\to v_f)\in E$,

267: and the sum is taken over the set of all possible assignments of

268: states to nodes of $T$ extending the assignment $(i_1, \dots, i_n)$ to

269: the leaves: If $V$ is the set of vertices of $T$ then

270: \begin{linenomath}

271: $$\mathcal{H} = \left\{(j_v) \in [\kappa]^{|V|} \mid j_v = i_k \mbox{

272:   if } v \mbox{ is a leaf labeled by $a_k$} \right\}.

273: $$

274: \end{linenomath}

275: For notational ease, the entries of $P$, the \emph{pattern

276:   frequencies}, are also denoted by $p_{i_1 \dots i_n} = P(i_1, \dots,

277: i_n)$.

278:

279: We note that while a root $r$ was chosen for the tree in order to

280: explicitly describe the GM portion of the parameterization of our

281: model, the particular choice of $r$ is not important. Under mild

282: additional restrictions on model parameters, changing the root

283: location corresponds to a simple invertible change of variables in

284: the parameterization. (See \cite{SSH94}, \cite{AR03}, or \cite{ARgm}

285: for details.) This justifies our slight abuse of language in

286: referring to the GM or GM+I model on $T$, rather than on $T_r$, and

287: we omit future references to root location.

288:

289: Note that equation (\ref{eq:Pdef}) allows us to more succinctly

290: describe any $P\in \im(\phi_T)$ as

291: \begin{linenomath}

292: \begin{equation}P= (1-\delta) P_{GM} + \delta P_I

293: \label{eq:decomp}\end{equation}

294: \end{linenomath}

295: where $P_{GM}$ is an array in the

296: image of the GM parameterization map on $T$ and

297: $P_I=\diag(\boldsymbol \pi_I)$ is an $n$-dimensional array whose

298: off-diagonal entries are zeros and whose diagonal entries are those

299: of $\pi_I$.

300:

301:

302: \section{Model Identifiability}\label{sec:modelId}

303:

304: We now make precise the various concepts of identifiability of a

305: phylogenetic model. To adapt standard statistical language to the

306: phylogenetic setting, for a fixed set $A$ of $n$ taxa and $\kappa\ge

307: 2$, consider a collection $\mathcal M$ of pairs $(T,\phi_T )$, where

308: $T$ is an $n$-taxon tree with leaf labels $A$, and $\phi_T:S_T\to

309: [0,1]^{\kappa^n}$ is a parameterization map of the joint

310: distribution of pattern frequencies for the model on $T$.  We say

311: \emph{the tree parameter is identifiable} for $\mathcal M$ if for

312: every $P\in \cup_{(T,\phi_T)\in

313:   \mathcal M} \im(\phi_T)$, there is a unique $T$ such that $P\in

314: \im(\phi_T)$. We say that \emph{numerical parameters are identifiable on a

315: tree $T$} if the map $\phi_T$ is injective, that is if for

316: every $P\in\im(\phi_T)$ there is a unique $\mathbf s\in S_T$ with

317: $\phi_T(\mathbf s)=P$. We say the \emph{model $\mathcal M$ is identifiable}

318: if the tree parameter is identifiable, and for each tree the numerical

319: parameters are identifiable.

320:

321:

322: It is well-known that such a definition of identifiability is too

323: stringent for phylogenetics. First, unless one restricts parameter

324: spaces, there is little hope that the tree parameter be identifiable:

325: One need only think of any standard model on a binary 4-taxon tree in

326: which the Markov matrix parameter on the internal edge is the identity

327: matrix. Any joint distribution arising from such a parameter choice

328: could have as well arisen from any other 4-taxon tree topology.

329:

330: Even if such `special' parameter choices are excluded so the tree

331: parameter becomes identifiable, identifiability of numerical

332: parameters also poses problems, as noted  by Chang

333: \cite{MR97k:92011}. For example, consider the 3-taxon tree with the

334: GM model. Then multiple parameter choices give rise to the same

335: joint distribution since the labeling of the states at the internal

336: node can be permuted in $\kappa!$ ways, as long as the Markov matrix

337: parameters are adjusted accordingly \cite{AR03}. The occurrence of

338: this sort of `label-swapping' non-identifiability in statistical

339: models with hidden (unobserved) variables is well-known, but is not

340: of great concern. However, even for this model more subtle forms of

341: non-identifiability can occur, in which infinitely many parameter

342: choices lead to the same joint distribution. These arise from

343: singularities in the model, and can be avoided by again restricting

344: parameter space. Such `generic' conditions for the GM model have

345: already been mentioned in the introduction.

346:

347: We therefore refine our notions of identifiability. Because we are

348: concerned primarily with model where the maps $\phi_T$ are given by

349: polynomials, we give a formulation appropriate to that setting.

350: Recall that given any collection $\mathcal F$ of polynomials

351: in $N$ variables, their common zero set,

352: \begin{linenomath}

353: $$

354: V(\mathcal F)=\{z\in \C^N \mid f(z)=0 \text{ for all } f\in \mathcal F\},

355: $$

356: \end{linenomath}

357: is the \emph{algebraic variety} defined by $\mathcal F$. If the algebraic

358: variety is a proper subset of $\C^N$, then it is said to be \emph{proper}.

359:

360:

361:

362: \begin{defn}

363: Let $\mathcal M$ be a model on a collection of $n$-taxon trees, as

364: defined above.

365: \begin{enumerate}

366: \item  We say \emph{the tree parameter is generically identifiable}

367: for $\mathcal M$ if for each tree $T$ there exists a proper

368: algebraic variety $X_T$ with the

369: property that

370: \begin{linenomath}

371: $$P\in \bigcup_{(T,\phi_T)\in \mathcal M}

372: \phi_T(S_T\smallsetminus X_T) \text{ implies } P\in

373: \phi_T(S_T\smallsetminus X_T) \text{ for a unique

374: $T$}.$$

375: \end{linenomath}

376:

377: \item We say that \emph{numerical parameters are generically

378: locally identifiable on a tree $T$} if there is a proper

379: algebraic variety $Y_T$ such that for all

380: $\mathbf s\in S_T\smallsetminus Y_T$, there is a

381: neighborhood of $\mathbf s$ on which $\phi_T$ is injective.

382:

383: \item We say the \emph{model $\mathcal M$ is generically locally

384: identifiable} if the tree parameter is generically identifiable, and

385: for each tree the numerical parameters are generically locally

386: identifiable. \end{enumerate}

387: \end{defn}

388:

389: Note that the notion of `generic' here is used to mean

390: `for all parameters but those lying on a proper

391: subvariety of the parameter space,' and such a variety

392: is necessarily of lower dimension than the full parameter

393: space. Using the standard measure

394: on the parameter space, viewed as a subset of $\R^N$,  this notion thus

395: also implies `for all

396: parameters except those in a set of measure 0.'

397:

398: \smallskip

399:

400: In the important special case of parameterization maps defined by

401: polynomial formulas, such as that for the GM+I model, generic local

402: identifiability of numerical parameters is equivalent to the notion in

403: algebraic geometry of the map $\phi_T$ being \emph{generically

404:   finite}. In this case, there exists a proper variety $Y_T$ and an

405: integer $k$, the degree of the map $\phi_T$, such that restricted to

406: $S_T\smallsetminus Y_T$ the map $\phi_T$ is not only locally injective

407: but also $k$-to-1: That is, if $\mathbf s\in S_T\smallsetminus Y_T$

408: and $P=\phi_T(\mathbf s)$, then the fiber $\phi^{-1}_T(P)$ has

409: cardinality $k$.

410:

411: Because of the label swapping issue at internal nodes, for the GM

412: model and GM+I on an $n$-taxon tree $T$ with vertex set $V$, fibers of

413: generic points will always have cardinality at least $\kappa!(|V|-n)$.

414: Thus for these models, the best we can hope for is generic local

415: identifiability of the model (both tree and numerical parameters)

416: where the generic fiber has exactly this cardinality.  That in fact is

417: what we establish in the next section.

418:

419:

420:

421: \section{Generic Identifiability for the GM+I model}\label{sec:genericId}

422:

423: We begin our arguments by determining some phylogenetic invariants for

424: the GM+I model. The notion of a phylogenetic invariant was introduced

425: by Cavender and Felsenstein \cite{CF87} and Lake \cite{Lake87}, in the

426: hope that phylogenetic invariants might be useful for practical tree

427: inference.  Their role here, in proving identifiability, is more

428: theoretical but illustrates their value in analyzing models.

429:

430: \smallskip

431:

432: For a parameterization $\phi_T$ given by polynomial formulas on domain

433: $S_T\subseteq\R^N$, we may uniquely extend to a polynomial map with

434: domain $\C^N$, given by the same polynomial formulas, which we again

435: denote by $ \phi_T: \C^N \longrightarrow {\C}^{\kappa^n}.$

436:

437: \begin{rem}

438:   Extending parameters to include complex values is solely for

439:   mathematical convenience, as algebraic geometry provides the natural

440:   setting for our viewpoint.  The collection of stochastic joint

441:   distributions (arising from the original stochastic parameter space)

442:   is a proper subset of $\im(\phi_T)$.

443: \end{rem}

444:

445:

446: The \emph{phylogenetic variety}, $V_T$, is the the smallest algebraic

447: variety in $\C^{\kappa^n}$ containing $ \phi_T(\C^N)$, \emph{i.e.},

448: the closure of the image of $\phi_T$ under the Zariski topology,

449: \begin{linenomath}

450:   $$V_T=\overline{\im(\phi_T)}\subseteq\C^{\kappa^n}.$$

451: \end{linenomath}

452:

453: \begin{rem}

454:   $V_T$ coincides with the closure of $\im(\phi_T)=\phi_T(\C^N)$ under

455:   the usual topology on $\C^{\kappa^N}.$ However, while $V_T\cap

456:   [0,1]^{\kappa^n}$ contains the closure of $\phi_T(S_T)$ under the

457:   usual topology, these need not be equal.

458: \end{rem}

459:

460:

461: Let $\C[P]$ denote the ring of polynomials in the $\kappa^{n}$

462: indeterminates $\{p_{i_1\dots i_n}\}.$ Then the collection of all

463: polynomials in $\C[P]$ vanishing on $V_T$ forms a prime ideal $I_T$.

464: We refer to $I_T$ as a \emph{phylogenetic ideal}, and its elements as

465: \emph{phylogenetic invariants}.  More explicitly, a polynomial $f\in

466: \C[P]$ is a phylogenetic invariant if, and only if, $f(P_0)=0$ for

467: every $P_0\in \phi_T(\C^{\kappa^n})$, or equivalently, if, and only

468: if, $f(P_0)=0$ for every $P_0\in \phi_T(S_T)$.

469:

470:

471:

472: \medskip

473:

474: As we proceed, we consider first the special case of $4$-taxon

475: trees. We highlight the

476: $\kappa=2$ case, in part to illustrate the arguments for general

477: $\kappa$ more clearly, and in part because we can go further in understanding

478: the 2-state model.

479:

480: \medskip

481:

482: Consider the $4$-taxon binary tree $T_{ab|cd}$, with taxa $a,b,c,d$ as

483: shown in Figure \ref{fig:4taxa}.

484:

485: \begin{figure}[h]

486: \begin{center}

487: \includegraphics[height=.75in]{figv01.eps}

488: \end{center}

489: \caption{The 4-taxon tree $T_{ab|cd}$}\label{fig:4taxa}

490: \end{figure}

491: Suppose that $P$ is a $2 \times 2 \times 2

492: \times 2$  pattern frequency array,

493:  whose indices correspond to states $[2]=\{1,2\}$

494: at the taxa in alphabetical order.

495: Then the internal edge $e$ of $T$ defines the split $ab \mid

496: cd$ in the tree, and we define the \emph{edge flattening} $F_e$ of $P$

497: at $e$, a $2^2 \times 2^2$ matrix, by

498: \begin{linenomath}

499: \begin{equation}\label{eq:Flat}

500: F_e =

501: \begin{pmatrix}

502:   p_{1111} & p_{1112} & p_{1121} & p_{1122}\\

503:   p_{1211} & p_{1212} & p_{1221} & p_{1222}\\

504:   p_{2111} & p_{2112} & p_{2121} & p_{2122}\\

505:   p_{2211} & p_{2212} & p_{2221} & p_{2222}\\

506: \end{pmatrix}.

507: \end{equation}

508: \end{linenomath}

509: Notice that the rows of $F_e$ are indexed by the states at $\{ab\}$

510: and the columns by states at $\{cd\}$. The flattening $F_e$ is

511: intuitively motivated by considering a `collapsed' model induced by

512: $e$: taxa $a$ and $b$ are grouped together forming a single variable

513: $\{ab\}$ with $4$ states, and the grouping $\{cd\}$ forms a second

514: variable with $4$ states.

515:

516:

517: This construction can be generalized in a natural way: suppose $T$

518: is an $n$-taxon tree, and $P$ a $\kappa\times\dots\times\kappa$ array with

519: indices corresponding to the taxa labeling the leaves of $T$. Then

520: for any edge $e$ in $T$, we can form from $P$ the matrix $F_{e}$ of size

521: $\kappa^{n_1} \times \kappa^{n_2}$, where $n_1$ and $n_2$ are the

522: cardinalities of the two sets of taxa in the split induced by $e$.

523:

524:

525:

526: From \cite{ARgm} (for a more expository presentation, see also

527: \cite{ARnme}), we have:

528:

529:

530: \begin{thm} \label{thm:GM}

531:   For the $2$-state GM model on a binary $n$-taxon

532:   tree $T$, the phylogenetic ideal $I_T$ is generated by all $3\times 3$

533:   minors of all edge flattenings $F_e$ of $P$.

534:   Moreover, for the $\kappa$-state GM model on an $n$-taxon tree $T$,

535:   the phylogenetic ideal

536:   $I_T$ contains all

537:   $(\kappa+1) \times (\kappa+1)$ minors of all edge flattenings

538:   of $P$.

539: \end{thm}

540:

541: Using this result, we can deduce some

542: elements of the phylogenetic ideal for

543: the GM+I model for any number of taxa $n \ge 4$ and any number of

544: states $\kappa \ge 2$.

545:

546: \begin{prop}\label{prop:invariants} (Phylogenetic Invariants for GM+I)

547: \begin{enumerate}

548: \item \label{prop:inv:item1}

549: For the $4$-taxon tree $T_{ab|cd}$ and

550: the $2$-state GM+I model,

551: the cubic determinantal polynomials

552: \begin{linenomath}

553:   $$

554:   f_1=\left |\begin{matrix}

555:       p_{1112} & p_{1121} & p_{1122}\\

556:       p_{1212} & p_{1221} & p_{1222}\\

557:       p_{2112} & p_{2121} & p_{2122}\\

558: \end{matrix}\right |

559: \mbox{ and } f_2=\left |\begin{matrix}

560:     p_{1211} & p_{1212} & p_{1221}\\

561:     p_{2111} & p_{2112} & p_{2121}\\

562:     p_{2211} & p_{2212} & p_{2221}

563: \end{matrix}\right |

564: $$

565: \end{linenomath}

566: are phylogenetic invariants. These are the two $3\times 3$ minors of

567: the matrix flattening $F_{ab \mid cd}$ of equation (\ref{eq:Flat}) that do not

568: involve either of the entries $p_{1111}$ or $p_{2222}$.

569:

570: \item More generally, for $n\ge 4$ and $\kappa\ge 2$, consider the

571:   $\kappa$-state GM+I model on an $n$-taxon tree $T$. Then for each

572:   edge $e$ of $T$, all $(\kappa+1)\times (\kappa+1)$ minors of the

573:   flattening $F_e$ of $P$ that avoid all entries $p_{ii\dots i}$,

574:   $i\in[\kappa]$ are phylogenetic invariants.

575: \end{enumerate}

576: \end{prop}

577:

578: \begin{pf}  We prove the first statement in detail.

579: From equation (\ref{eq:decomp}),

580: for

581: any $P=\phi_T(s)$ we have $P=(1-\delta)P_{GM}+\delta

582:  P_I$, where $P_{GM}$ is a 4-dimensional table arising from the GM

583:   model on $T$ and $P_I=\diag(\pi_I)$ is a diagonal table with entries

584:   giving the distribution of states for the invariable sites.

585:   Flattening these tables with respect to the internal edge of the

586:   tree, we obtain

587: \begin{linenomath}

588: \begin{align}F_{ab \mid cd} &= (1-\delta) F_{GM} + \delta F_I\notag \\

589:   &=(1 - \delta)

590: \begin{pmatrix}

591:   \tilde p_{1111} & \tilde p_{1112} & \tilde p_{1121} & \tilde p_{1122}\\

592:   \tilde p_{1211} & \tilde p_{1212} & \tilde p_{1221} & \tilde p_{1222}\\

593:   \tilde p_{2111} & \tilde p_{2112} & \tilde p_{2121} & \tilde p_{2122}\\

594:   \tilde p_{2211} & \tilde p_{2212} & \tilde p_{2221} & \tilde

595:   p_{2222}

596: \end{pmatrix}

597: +

598: \delta \begin{pmatrix}

599:   \pi_I(1) &\ 0\ &\  0\ & 0 \\

600:   0 & 0 & 0 & 0\\

601:   0 & 0 & 0 & 0\\

602:   0 & 0 & 0 & \pi_I(2) \\

603: \end{pmatrix}.\label{eq:Psum}

604: \end{align}

605: \end{linenomath}

606: By Theorem \ref{thm:GM}, all $3\times 3$ minors of $F_{GM}$ vanish.

607: Since the `upper right' and `lower left' minors of $F_{ab|cd}$ are the

608: same as those of $F_{GM}$, up to a factor of $(1-\delta)^3$, they also

609: vanish.

610:

611: Straightforward

612: modifications to this argument give the general case.\hfill\qed

613: \end{pf}

614:

615:

616: For arbitrary $n,\kappa$, the GM+I model should have many other

617: invariants than those found here.

618: Among these is, of course, the stochastic invariant

619: \begin{linenomath}

620: $$f_s(P)=1-\sum_{\mathbf i\in [\kappa]^n} p_{\mathbf i}.$$

621: \end{linenomath}

622:

623: In the simplest interesting case of the GM+I model, however, we

624: have the following computational result.

625:

626: \begin{prop}\label{prop:invariantsK2} The phylogenetic ideal

627:   for the $2$-state GM+I model on the $4$-taxon tree $T_{ab|cd}$ of Figure

628:   \ref{fig:4taxa} is generated by $f_s$ and

629: the minors $f_1$, $f_2$ above;

630: \begin{linenomath}

631: $$I_T  = \langle f_s, f_1, f_2 \rangle.$$

632: \end{linenomath}

633: \end{prop}

634:

635: \begin{pf}

636:   A computation of the Jacobian of the parameterization

637: $\phi_T: S \subset \C^{13} \to \C^{2^4}$ shows it has full rank at

638: some points, and so $V_T$ is of dimension 13.  If $I =

639: \langle f_s,f_1, f_2 \rangle$, then $I \subseteq I_T$.  Another computation

640: shows that $I$ is prime and of

641:   dimension $13$.  Thus, necessarily $I =

642:   I_T$.  (The code for these computations is given in Appendix

643:   \ref{app:code}.)\hfill\qed

644: \end{pf}

645:

646:

647: Let $V_{ab | cd}$, $V_{ac | bc}$, $V_{ad | bc}$ be the varieties for

648: the $2$-state GM+I models for the three $4$-taxon binary tree topologies, with

649: corresponding phylogenetic ideals $I_{ab \mid cd}$, $I_{ac \mid bd}$,

650: $I_{ad \mid bc}$.

651: Of course Proposition \ref{prop:invariantsK2}

652: gives generators for each of these ideals --- two $3

653: \times 3$ minors of the flattenings of $P$ appropriate to those tree

654: topologies, along with $f_s$. A computation (see Appendix \ref{app:code}) shows

655: that these three ideals are distinct. Therefore the three varieties

656: are distinct, and their pairwise intersections are proper

657: subvarieties. Thus for any parameters $\mathbf s$

658: not lying in the inverse image

659: of these subvarieties, $T$ is uniquely determined from $\phi_T(\mathbf s)$.

660: Thus we obtain

661:

662: \begin{cor} \label{cor:identTreeN4k2}

663:   For the $2$-state GM+I model on binary 4-taxon trees, the tree parameter is

664:   generically identifiable.

665: \end{cor}

666:

667: As $\dim(V_{ab|cd})=13$, and the parameter space for $\phi_T$ is 13

668: dimensional, we also immediately obtain that the map $\phi_T$ is

669: generically finite. This yields

670:

671: \begin{cor} \label{cor:idenNumN4k2}

672: For the $2$-state GM+I model on a binary 4-taxon tree,

673: numerical parameters are generically locally

674: identifiable.

675: \end{cor}

676:

677: Note that this does approach does not yield the cardinality of the

678: generic fiber of the parameterization map, which is also of

679: interest. We will return to this issue in Theorem

680: \ref{thm:genericIdent}.

681:

682: \medskip

683:

684: Further computations show that

685: $\dim(V_{ab|cd} \cap V_{ac|bd}\cap V_{ad|bc})=11$. As this

686: intersection contains all points arising from the GM+I

687: model on the 4-taxon star tree, which is an 11-parameter model, this

688: is not surprising. In fact, one can verify computationally

689: that the ideal $I_{ab|cd}+I_{ac|bd}+I_{ad|bc}$ is the defining

690: prime ideal of the star-tree variety.

691: We also note that the ideal $I_{ab|cd} + I_{ac|bd}$

692: decomposes into two primes, both of dimension 11. Thus the variety

693: defined by this ideal has two components, one of which is the variety

694: for the star tree.

695:

696: \medskip

697:

698: In principle, the ideal $I_T$ of all invariants for the GM+I model

699: on an arbitrary tree $T$ can be computed from the parameterization

700: map $\phi_T$ via an elimination of variables using Gr\"obner bases

701: \cite{MR2001c:92009}. However, if all invariants for the

702: $\kappa$-state GM model on $T$ are known, they can provide an

703: alternate approach to finding $I_T$ which, while still proceeding by

704: elimination, should be less computationally demanding.

705:

706: To present this most simply, we note that because

707: our varieties lie in the hyperplane described by the stochastic invariant,

708: it is natural to consider their projectivizations,

709: lying in $\mathbb P^{\kappa^n-1}$ rather than $\C^{\kappa^n}$. The

710: corresponding phylogenetic ideals, which we denote by $J_T$,

711: are generated by the homogeneous polynomials in $I_T$, and do not contain the

712: stochastic invariant. Conversely, $I_T$ is generated by the elements of

713: $J_T$ together with the stochastic invariant.

714:

715: In addition, we need

716: not restrict ourselves to the GM model, but rather deal with any

717: phylogenetic model parameterized by polynomials.

718:

719: \begin{prop} \label{prop:elim}

720: Suppose $\widetilde \phi_T:\C^N\to \C^{\kappa^n}$ is a

721: parameterization map for some phylogenetic model $\mathcal M$ on

722: $T$, with corresponding homogeneous phylogenetic ideal $\widetilde

723: J_T$. Let

724: \begin{linenomath}

725: $$\phi_T:\C^{N}\times\C^\kappa\to

726:  \C^{\kappa^n}$$

727: \end{linenomath}

728: be the parametrization map for the $\mathcal M$+I model

729: given by

730: \begin{linenomath}

731: $$\phi_T(\mathbf s,(\delta,\boldsymbol \pi_I))=(1-\delta)

732: \widetilde \phi_T(\mathbf s)+\delta \diag(\boldsymbol \pi_I).$$

733: \end{linenomath}

734: Let

735: $P'$ denote the collection of all indeterminate entries of $P$

736: except those in $P_{eq}=\{p_{ii\dots i}\mid i\in[\kappa]\}$. Then

737: the homogeneous phylogenetic ideal $J_T$ for the $\mathcal M$+I

738: model on $T$ is $J_T=\left (\widetilde J_T\cap \C[P'] \right)\C[P].$

739: Thus $J_T$ can be computed from $\widetilde J_T$ by elimination of

740: the variables in $P_{eq}$.

741: \end{prop}

742:

743: \begin{pf}

744: Extend the parameterization maps $\widetilde \phi_T, \phi_T$ to

745: parameterizations of cones by introducing an additional parameter,

746: \begin{linenomath}

747: $$ \widetilde \Phi_T(\mathbf s,t)=t\,\widetilde\phi_T(\mathbf s)$$

748: $$

749: \Phi_T(\mathbf s,(\delta,\boldsymbol \pi_I ),t)

750: =t\,\phi_T(\mathbf s,(\delta,\boldsymbol \pi_I))$$

751: \end{linenomath}

752: Then

753: $\im(\Phi_T)=\C^\kappa\times\operatorname{proj}(\im(\widetilde \Phi_T)),$

754: where $\C^\kappa$ corresponds to coordinates in $P_{eq}$ and

755: `$\operatorname{proj}$' denotes

756: the projection map from $P$-coordinates to $P'$-coordinates. As $J_T$

757: is the ideal of polynomials vanishing on $\im(\Phi_T)$, and

758: $\tilde J_T\cap \C[P']$ the ideal

759: vanishing on $\operatorname{proj}(\im(\widetilde \Phi_T))$, the result follows.

760: \hfill\qed

761: \end{pf}

762:

763: Using this, in the appendix we give an alternate computation to show

764: both part (\ref{prop:inv:item1}) of Proposition

765: \ref{prop:invariants}, and Proposition \ref{prop:invariantsK2}.

766: While this computation is quite fast, a more naive attempt to find

767: GM+I invariants directly from the full parameterization map using

768: elimination was unsuccessful, demonstrating the utility of the

769: proposition.

770: Moreover, we can use this proposition to compute all 2-state GM+I invariants

771: on the 5-taxon binary tree as well. This leads us to

772:

773: \begin{conj} On an $n$-taxon binary tree, the ideal of homogeneous

774: invariants for the 2-state

775: GM+I model is generated by those $3\times 3$

776: minors of edge flattenings

777: that do not involve the variables $p_{11\dots1}$ and $p_{22\dots2}$,

778: together with the

779: stochastic invariant.

780: \end{conj}

781:

782: \medskip

783:

784:

785: Although we are unable to determine all GM+I invariants for the

786: 4-taxon tree for general $\kappa$, using only those described in

787: Proposition \ref{prop:invariants} we can still obtain

788: identifiability results through a modified argument.

789:

790:

791: \begin{prop}\label{prop:treeId} For the $\kappa$-state GM+I model on

792: binary 4-taxon trees, $\kappa\ge 2$, the tree parameter is

793: generically identifiable.

794: \end{prop}

795:

796: \begin{pf} By the argument leading to Corollary \ref{cor:identTreeN4k2},

797: it is enough to show the varieties $V_{ab \mid cd}$, $V_{ac

798:     \mid bd}$, and $V_{ad|bc}$ are distinct.

799: Considering, for example, the first two, we can

800: show that the varieties $V_{ab \mid cd}$ and $V_{ac

801:     \mid bd}$ are distinct, by giving an invariant $f \in I_{ac \mid

802:     bd}$  and a point $P_0\in V_{ab|cd}$

803: such that $f(P_0)\ne 0$.

804:

805: Using Proposition \ref{prop:invariants}, we pick an

806:   invariant $f \in I_{ac \mid bd}$ as follows: In the flattening

807: $F_{ac|bd}$ according to the split $ac|bd$, choose any collection

808: of $\kappa+1$ $ac$-indices with distinct $a$ and $c$ states, \emph{e.g.},

809: $\{12,13,\dots,1\kappa,21,23\}$. Using the same set as $bd$-indices,

810: this determines a $(\kappa+1)\times(\kappa+1)$-minor $f$.

811:

812: We pick $P_0=\phi_{T_{ab|cd}}(\mathbf s)$ using the parameterization

813: of equation (\ref{eq:Pdef}) by making a specific choice of parameters

814: $\mathbf s$. On $T_{ab|cd}$, with the root $r$ located at one of the

815: internal nodes, choose parameters $\mathbf{s}$ as follows: Let

816: $\pr$, $\pI$ be arbitrary but with all entries of $\pr$ positive.

817: Pick any $\delta \in [0,1)$. For the four terminal edges choose

818: $M_e$ to be the $\kappa \time \kappa \times \kappa$ identity matrix

819: $I_\kappa$. For the single internal edge $e$ of $T$, choose any

820: Markov matrix $M_{e}$ with all positive entries. For such

821: parameters, the entries of the joint distribution $P_0 =

822: \phi_{T_{ab|cd}} (\mathbf{s})$ are zero except for the pattern

823: frequencies $p_{iijj}$, where the states at the leaves $a$ and $b$

824: agree and the states at the leaves $c$ and $d$ agree.  Since the

825: entries of $M_{e}$ and the root distributions are positive, each of

826: the $p_{iijj} > 0$.

827:

828: But considering the flattening $F_{ac \mid bd}$ of

829: $P_0=\phi_{T_{ab|cd}} (\mathbf{s})$ with respect to the `wrong'

830: topology $T_{ac \mid bd}$, we observe that the $\kappa^2$ non-zero

831: entries $p_{iijj}$ of $F_{ac \mid bd}$ all lie on the diagonal of

832: $F_{ac \mid bd}$, in the positions with $ij$ as both $ac$-index and

833: $bd$-index. Furthermore, by our choice of $f$, a subset of them

834: forms the diagonal of the submatrix whose determinant is $f$.

835: Therefore $f(P_0)\ne 0$.\hfill\qed

836: \end{pf}

837:

838: \begin{prop}(Recovery of invariable site parameters)\label{prop:idformulas}

839: \begin{enumerate}

840: \item For the 4-taxon tree $T_{ab|cd}$ and the 2-state GM+I model, suppose

841: $P=\phi_T(\mathbf s)$. Then generically the parameters in

842: $\mathbf s$ related to invariable sites can be recovered from $P$ by

843: the following formulas:

844: \begin{linenomath}

845: $$\delta=\frac {|A_1|+|A_2|}{|B|},\ \  \boldsymbol \pi_I=\frac 1{|A_1|+|A_2|} \left (

846: |A_1|,|A_2|\right ),$$ where $B=\begin{pmatrix}

847:     p_{1212} & p_{1221} \\

848:     p_{2112} & p_{2121}

849: \end{pmatrix}$,

850: $$A_1=\begin{pmatrix}

851:       p_{1111} & p_{1112} & p_{1121}\\

852:       p_{1211} & p_{1212} & p_{1221}\\

853:       p_{2111} & p_{2112} & p_{2121}\\

854: \end{pmatrix}, \ \

855: A_2=\begin{pmatrix}

856:     p_{1212} & p_{1221} & p_{1222}\\

857:     p_{2112} & p_{2121} & p_{2122}\\

858:     p_{2212} & p_{2221} & p_{2222}

859: \end{pmatrix}.

860: $$

861: \end{linenomath}

862: \item More generally, for the $\kappa$-state GM+I model on $T_{ab|cd}$,

863: the invariable site parameters can be recovered

864: from a generic point in the image of the parameterization map by

865: rational formulas of the form

866: \begin{linenomath}

867: $$\delta=\frac

868: {\sum_{i\in[\kappa]}|A_i|}{|B|}, \ \ \boldsymbol \pi_I=\frac

869: 1{\sum_{i\in[\kappa]} |A_i|} \left ( |A_1|,|A_2|,\dots, |A_n|\right

870: ).$$

871: \end{linenomath}

872: Here $|B|$ is any $\kappa \times \kappa$ minor of $F_{ab|cd}$

873: that omits the all rows and columns indexed by $ii$, and $|A_i|$ is

874: the $(\kappa+1)\times(\kappa+1)$ minor obtained by including all

875: rows and columns chosen for $B$ and in addition the $ii$ row and

876: $ii$ column.

877: \end{enumerate}

878: \end{prop}

879:

880: \begin{pf} We

881:   give the complete argument in the case $\kappa = 2$ first.  For a

882:   joint distribution $P \in \im(\phi_T)$, write $F_{ab \mid cd} =

883:   (1-\delta) F_{GM} + \delta F_I$ as in equation (\ref{eq:Psum}).  Since

884: $A_1$ is the `upper left'  $3 \times 3$ submatrix of $F_{ab \mid

885: cd}$, using linearity properties of the determinant, and that all $3

886: \times 3$ minors of $F_{GM}$ evaluate to zero, we observe that

887: \begin{linenomath}

888: \begin{align*}

889:   \vert A_1 \vert

890: &=(1 - \delta)^3 \left|

891: \begin{matrix}

892:   \tilde p_{1111} & \tilde p_{1112} & \tilde p_{1121} \\

893:   \tilde p_{1211} & \tilde p_{1212} & \tilde p_{1221} \\

894:   \tilde p_{2111} & \tilde p_{2112} & \tilde p_{2121} \\

895: \end{matrix}

896: \right| + \left|

897: \begin{matrix}

898:   \delta \pi_I(1) & 0 &\ 0 \\

899:   0 & (1-\delta) \tilde p_{1212} &\ (1-\delta) \tilde p_{1221} \\

900:   0 & (1-\delta) \tilde p_{2112} &\ (1-\delta) \tilde p_{2121} \\

901: \end{matrix}\right|\\

902: \\

903: &= \delta \pi_I(1) \left|

904: \begin{matrix}

905:   (1-\delta)\tilde p_{1212} &\  (1-\delta)\tilde p_{1221} \\

906:  (1-\delta) \tilde p_{2112} &\ (1-\delta )\tilde p_{2121} \\

907: \end{matrix}\right|.

908: \end{align*}

909: \end{linenomath}

910: Thus we have $\vert A_1 \vert = \delta \pi_I(1) \vert B \vert$. Now,

911: if $\vert B \vert \neq 0$, then

912: \begin{linenomath}

913: $$

914: \delta \pi_I(1) = \frac{\vert A_1 \vert}{\vert B \vert}.

915: $$

916: \end{linenomath}

917: As $|B|$ does not vanish on all of $V_T$, we have a rational formula

918: to compute $\delta \pi_I(1)$ for generic points on $V_T$.

919:

920: Similarly, since $A_2$ is the `lower right' submatrix of $F_{ab \mid

921: cd}$, then

922: \begin{linenomath}

923: $$\delta \pi_I(2) =

924: \frac{\vert A_2 \vert}{\vert B \vert}.

925: $$

926: \end{linenomath}

927: Adding these together, we obtain the stated rational expression for

928: $\delta$.

929:

930:

931: Assuming additionally the generic condition that $\delta \neq 0$, then we find

932: \begin{linenomath}

933: $$\boldsymbol \pi_I =\left ( \frac{\vert A_1

934:   \vert}{\vert A_1\vert + \vert A_2 \vert},

935: \frac{\vert A_2

936:   \vert}{\vert A_1 \vert + \vert A_2 \vert}\right ).$$

937: \end{linenomath}

938: Thus the parameters $\delta, \boldsymbol \pi_I$ are

939: generically identifiable for GM+I on $T$.

940:

941: One readily sees the argument above can be modified for arbitrary

942: $\kappa$.\hfill\qed

943: \end{pf}

944:

945: Note that when $\kappa>2$ the above proposition gives many

946: alternative rational formulas for the invariable site parameters, as

947: there are many options for choosing the matrix $B$.

948:

949:

950: We now obtain our main result.

951:

952: \begin{thm}\label{thm:genericIdent}

953: The $\kappa$-state GM+I model on $n$-taxon binary trees, with $n\ge

954: 4$, $\kappa \ge 2$, is generically locally identifiable.

955: Furthermore, for an $n$-taxon tree with $V$ vertices, the fibers of

956: generic points of $V_T$ under the parametrization map have

957: cardinality $\kappa!(|V|-n)$. Thus for generic points, label

958: swapping at internal nodes is the only source of

959: non-identifiability.

960: \end{thm}

961:

962: \begin{pf} Suppose $T$ is an $n$-taxon tree with $P=\phi_T(\mathbf

963: s)$. Choose some subset of 4 taxa, say $\{a,b,c,d\}$, and suppose

964: the induced quartet tree is $T_{ab|cd}$. Then $P_{abcd}$, the

965: 4-marginalization of $P$, is easily seen to be of the form

966: $P_{abcd}=\phi_{T_{ab|cd}}(\mathbf s_{abcd})$ where $\mathbf

967: s_{abcd}=g(\mathbf s)$ and $g$ is a surjective polynomial function.

968: But the tree

969: $T_{ab|cd}$ is generically identifiable by Proposition

970: \ref{prop:treeId}, and thus invariable site parameters in $\mathbf s_{abcd}$

971: are generically identifiable by Proposition \ref{prop:idformulas}.

972: As these coincide with the invariable site parameters in $\mathbf

973: s$, and generic conditions on $\mathbf s_{abcd}$ imply generic

974: conditions on $\mathbf s$, the invariable site parameters are

975: generically identifiable for the full $n$-taxon model.

976:

977: As an $n$-taxon binary tree topology is determined

978: by the collection of all induced quartet tree topologies, one can now see

979: that $T$ is generically identifiable. Alternately,

980: using the identified invariable site parameters,

981: and assuming the additional

982: generic condition that $\delta\ne 1$, note that

983: \begin{linenomath}

984: $$P_{GM} =

985: \frac{1}{(1-\delta)} \left( P - \delta P_I\right)

986: $$

987: \end{linenomath}

988: is a joint

989: distribution arising from general Markov parameters. Thus generic

990: identifiability of the tree can also by obtained from

991: Steel's  result for the GM model \cite{S94} applied to $P_{GM}$.

992:

993:

994: The generic identifiability of the remaining numerical parameters follows

995: from Chang's argument \cite{MR97k:92011} applied to $P_{GM}$.

996: Chang's approach also indicates the cardinality of the generic fiber is

997: $\kappa!(|V|-n)$ due to the label swapping phenomenon.\hfill\qed

998: \end{pf}

999:

1000:

1001:

1002:

1003: \section{Estimating Invariable Sites Parameters}\label{sec:estInv}

1004:

1005: The concrete result in Proposition \ref{prop:idformulas} gives

1006: explicit rational formulas for recovering parameters relating to

1007: invariable sites from the joint distribution. These can be viewed as

1008: generalizations of the formulas found in \cite{SHL00} for

1009: group-based models. As \cite{SHL00} develops the group-based model

1010: formulas into a heuristic means of estimating the invariable site

1011: parameters from data without performing a full Maximum Likelihood

1012: fit of data to a tree under a $\mathcal M$+I model, one might

1013: suspect the formulas of Proposition \ref{prop:idformulas} could be

1014: used similarly without the need to assume $\mathcal M$ was

1015: group-based, or approximately group-based.

1016:  We emphasize that however useful such an

1017: estimate might be, it would not be intended to replace a more

1018: statistical but time-consuming computation, such as obtaining the

1019: Maximum Likelihood estimates for these parameters.

1020:

1021: However, it is by no means obvious how to use these formulas well

1022: even for a heuristic estimate. First, for a 4-taxon tree

1023: we have many choices for the

1024: matrix $B$, in fact

1025: \begin{linenomath}

1026: $$\binom{\kappa^2-\kappa}{\kappa}^2$$

1027: \end{linenomath}

1028: of them, so even for $\kappa=4$, there are 245,025 basic sets of the

1029: formulae. Moreover, while these simple formulae

1030: emerged from our method of proof, one could in fact modify them by

1031: adding to any of them a rational function whose numerator is a

1032: phylogenetic invariant for the GM+I model, and whose denominator is

1033: not. Since the invariant vanishes on any joint distribution arising

1034: from the model, the resulting formulae will still recover invariable

1035: site information for generic parameters. Thus there are actually

1036: infinitely many formulas for recovering invariable site parameters.

1037:

1038: One can nonetheless consider simple averaging schemes using only the

1039: basic formulas of Proposition \ref{prop:idformulas} and find that on

1040: simulated data they perform quite well at approximately recovering

1041: invariable site parameters from empirical distributions. However,

1042: averaging the large number of formulas give here, and then also

1043: averaging over a large sample of quartets,

1044:  as is proposed in \cite{SHL00}, is more

1045: time consuming than one might wish for a fast heuristic. Moreover,

1046: one must be aware that the denominator in these formulas may vanish

1047: on an empirical distribution --- it is certain to be non-zero only

1048: for true distributions for GM+I arising from generic parameters.

1049:

1050: Nonetheless, it would be of interest to develop versions of these

1051: formulas with good statistical estimation properties, as the GM+I

1052: model encompasses models such as the GTR+I model which is often

1053: preferred in biological data analysis to group-based+I models. Of

1054: course addressing more general rate-variation models would be even

1055: more desirable, though our results here are not sufficient for that.

1056:

1057:

1058:

1059:

1060: \appendix

1061:

1062: \section{Code for Computational Algebra Software}\label{app:code}

1063:

1064: The following code is also available on the authors' websites.

1065:

1066: \subsection{Computation for Proposition \ref{prop:invariantsK2} }

1067:

1068: To show the variety has dimension 13, we execute the following Maple code:

1069:

1070: {

1071: \scriptsize

1072: \begin{verbatim}

1073: pa := Matrix([[p,1-p]]); Mae := Matrix([[1-a,a],[r,1-r]]);

1074: Meb := Matrix([[1-b,b],[s,1-s]]); Mef := Matrix([[1-e,e],[t,1-t]]);

1075: Mfc := Matrix([[1-c,c],[u,1-u]]); Mfd := Matrix([[1-d,d],[v,1-v]]);

1076: P := Array(1..2,1..2,1..2,1..2);

1077: for i from 1 to 2 do for j from 1 to 2 do for k from 1 to 2 do for l from 1 to 2 do

1078:   P[i,j,k,l]:=0;

1079:   for m from 1 to 2 do  for n from 1 to 2 do

1080:     P[i,j,k,l]:=P[i,j,k,l]+pa[1,i]*Mae[i,m]*Meb[m,j]*Mef[m,n]*Mfc[n,k]*Mfd[n,l];

1081:   od;od;

1082:   P[i,j,k,l]:=(1-w)*P[i,j,k,l];

1083: od;od;od;od;

1084: P[1,1,1,1]:=P[1,1,1,1]+w*q: P[2,2,2,2]:=P[2,2,2,2]+w*(1-q):

1085: Q:=ListTools[Flatten](convert(P,listlist)):

1086: J:=VectorCalculus[Jacobian](Q,[a,b,c,d,e,r,s,t,u,v,p,q,w]):

1087: K:=subs({a=1/3,b=1/5,c=1/7,d=1/11,e=1/13,r=1/17,s=1/19,t=1/23,u=1/29,v=1/31,

1088:                                                           p=1/3,q=1/5,w=1/7},J):

1089: LinearAlgebra[Rank](K);

1090: \end{verbatim}

1091: }

1092:

1093:

1094: Using Singular \cite{sing}, we complete the proof:

1095:

1096: {

1097: \scriptsize

1098: \begin{verbatim}

1099: LIB "matrix.lib";  LIB "primdec.lib";

1100: ring r = 0, (p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15),dp;

1101: // Define matrix flattening F_{ab | cd} and polys fs, f1, f2

1102: matrix Fab[4][4]=p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15;

1103: matrix UR[3][3]=submat(Fab,1..3,2..4); matrix LL[3][3]=submat(Fab,2..4,1..3);

1104: poly f1=det(UR); poly f2=det(LL);

1105: poly fs = p0+p1+p2+p3+p4+p5+p6+p7+p8+p9+p10+p11+p12+p13+p14+p15-1;

1106: ideal I = fs,f1,f2;   // define ideal I

1107: dim(std(I));          // compute dimension of r/I

1108: primdecGTZ(I);        // compute primary decomposition of I to show prime

1109: \end{verbatim}

1110: }

1111:

1112:

1113: \subsection{Computation for intersections of $V_{ab|cd},V_{ac|bd},V_{ad|bc}$}

1114:

1115: Continuing the Singular session above, we execute the following:

1116:

1117: {

1118: \scriptsize

1119: \begin{verbatim}

1120: /* Define ideals Iac, Iad corresponding to two alternative tree

1121:    topologies for 4-taxon trees.  (So, I = Iab in this notation.)   */

1122: // Flattening for ac | bd split

1123: matrix Fac[4][4]=p0,p1,p4,p5,p2,p3,p6,p7,p8,p9,p12,p13,p10,p11,p14,p15;

1124: poly f3=det(submat(Fac,1..3,2..4)); poly f4=det(submat(Fac,2..4,1..3));

1125: ideal Iac = fs,f3,f4;

1126: // Flattening for  ad | bc split

1127: matrix Fad[4][4]=p0,p2,p4,p6,p1,p3,p5,p7,p8,p10,p12,p14,p9,p11,p13,p15;

1128: poly f5=det(submat(Fad,1..3,2..4)); poly f6=det(submat(Fad,2..4,1..3));

1129: ideal Iad = fs,f5,f6;

1130: reduce(f1,std(Iac));  // non-zero answer shows f1 not in Iac

1131: reduce(Iac,std(I));   // non-zero shows f3,f4 not in I

1132: ideal J = I,Iac; dim(std(J));  // show dim is 11

1133: ideal K = J,Iad; dim(std(K));  // show dim is 11

1134: primdecGTZ(K);        // show K prime, and thus ideal for star tree

1135: \end{verbatim}

1136: }

1137:

1138: \subsection{Computation of 2-state GM+I ideal, 4-taxon trees, using Proposition \ref{prop:elim} }

1139:

1140: The following Singular code performs the needed elimination for a binary tree:

1141:

1142: {

1143: \scriptsize

1144: \begin{verbatim}

1145: ideal Igm = minor(Fab,3);

1146: // Eliminate the `diagonal' variables

1147: ideal Igmi = elim1(Igm,p0*p15);

1148: \end{verbatim}

1149: }

1150:

1151: For the star tree, the 2-state GM ideal is known from \cite{ARgm}.

1152: Thus elimination can be used to find GM+I invariants. We also show

1153: this result agrees with $\mathtt K$ above.

1154:

1155: {

1156: \scriptsize

1157: \begin{verbatim}

1158: ideal Igm = minor(Fab,3),minor(Fac,3),minor(Fad,3);

1159: // Eliminate the `diagonal' variables

1160: ideal Igmi = elim1(Igm,p0*p15),fs;

1161: reduce(K,std(Igmi));  // all 0's indicates ideal containment

1162: reduce(Igmi,std(K));  // all 0's indicates ideal containment

1163: \end{verbatim}

1164: }

1165:

1166: \bibliographystyle{elsart-num} \bibliography{Phylo}

1167:

1168: \end{document}

1169: