q-bio0702050/gmi.tex
1: \documentclass{elsart}
2: 
3: %\usepackage{showkeys}
4: \usepackage[mathlines]{lineno}
5: \usepackage{amssymb,amsmath,verbatim,graphicx}
6: 
7: %\linenumbers
8: 
9: 
10: % macros
11: \newcommand\R{\mathbb R}
12: \newcommand\C{\mathbb C}
13: \newcommand\Z{\mathbb Z}
14: 
15: \newcommand\diag{\operatorname{diag}}
16: \newcommand\im{\operatorname{Im}}
17: \newcommand\pr{\boldsymbol{\pi}_{GM}}
18: \newcommand\pI{\boldsymbol{\pi}_I}
19: \newcommand\M{\mathcal{M}}
20: 
21: \newcommand\dsp{\displaystyle}
22: 
23: 
24: \begin{document}
25: 
26: \begin{frontmatter}
27: 
28: \title{Identifying evolutionary trees and substitution parameters
29: for the general Markov model with invariable sites}
30: 
31: 
32: 
33: \thanks{This research was supported in part by the Institute for
34:   Mathematics and its Applications, with funds provided by the National
35:   Science Foundation.  We thank the IMA for its hospitality.}
36: 
37: \author{Elizabeth S. Allman},
38: \ead{e.allman@uaf.edu}
39: \author{John A. Rhodes\corauthref{cor}}
40: \ead{j.rhodes@uaf.edu}
41: \corauth[cor]{Corresponding author.}
42: 
43: 
44: \address{Department of Mathematics and Statistics\\University of
45:   Alaska Fairbanks\\PO Box 756660\\Fairbanks, AK 99775}
46: 
47: 
48: \date{February 22, 2007}
49: 
50: 
51: \begin{abstract}
52:   The general Markov plus invariable sites (GM+I) model of biological
53:   sequence evolution is a two-class model in which an unknown
54:   proportion of sites are not allowed to change, while the remainder
55:   undergo substitutions according to a Markov process on a tree. For
56:   statistical use it is important to know if the model is
57:   identifiable; can both the tree topology and the numerical
58:   parameters be determined from a joint distribution describing
59:   sequences only at the leaves of the tree?  We establish that for
60:   generic parameters both the tree and all numerical parameter values
61:   can be recovered, up to clearly understood issues of `label
62:   swapping.'  The method of analysis is algebraic, using phylogenetic
63:   invariants to study the variety defined by the model.  Simple
64:   rational formulas, expressed in terms of determinantal ratios, are
65:   found for recovering numerical parameters describing the invariable
66:   sites.
67: \begin{keyword}
68: Phylogenetics, invariable site model, identifiability, phylogenetic
69: invariants \MSC 92D15; 14J99; 60J20
70: \end{keyword}
71: 
72: \end{abstract}
73: 
74: \end{frontmatter}
75: 
76: \section{Introduction}\label{sec:intro}
77: 
78: If a model of biological sequence evolution is to be used for
79: phylogenetic inference, it is essential that the model parameters of
80: interest --- certainly the tree parameter and usually the numerical
81: parameters
82: --- be identifiable from the joint distribution of states at the
83: leaves of the tree. Though often unstated, the assumption that model
84: parameters are identifiable underlies the use of both Maximum
85: Likelihood and Bayesian inference methods. As increasingly
86: complicated models, incorporating across-site rate variation,
87: covarion structure, or other types of mixtures, are implemented in
88: software packages, there is a real possibility that
89: non-identifiability could confound data analysis. Unfortunately, our
90: theoretical understanding of this issue lags well behind current
91: phylogenetic practice.
92: 
93: 
94: One natural approach to proving the identifiability of the tree
95: topology relies on the definition of a phylogenetic distance for the
96: model, and the $4$-point condition of Buneman \cite{Bun}. For
97: instance, Steel \cite{S94} used the log-det distance to establish
98: the identifiability of the tree topology under the general Markov
99: model and its submodels. Such a distance-based argument shows
100: additionally that $2$-marginalizations of the full joint
101: distribution suffice to recover the tree parameter, since distances
102: require only two-sequence comparisons. Once the tree has been
103: identified, the numerical parameters giving rise to a joint
104: distribution for the general Markov model
105: can be determined by an argument of Chang
106: \cite{MR97k:92011}.
107: 
108: However, for more general mixture models and rates-across-sites
109: models no appropriate definition of a distance is known, so proving the
110: identifiability of the tree parameter requires a different approach.
111: (Though distance measures have been developed for GTR models with
112: rate-substitution \cite{GuLi96,WadSt97}, these require that one know
113: the rate distribution completely, and identifiability of the rate distribution
114: has yet to be addressed.
115: Although identifiability of the popular GTR+I+$\Gamma$ model of
116: sequence evolution was considered in \cite{Rog01}, there are gaps in
117: the argument, as was pointed out to us by An\'e \cite{AnePC}.)
118: 
119: In \cite{ARidtree}, the viewpoint of algebraic geometry is used to
120: show the generic identifiability of the tree parameter for the
121: covarion model of \cite{MR1604518} and for certain mixture models with
122: a small number of classes.
123: Though this result is far more general than previous identifiability results,
124: it still fails to cover the type of rate-variation models
125: currently in common use for data analysis, and does not
126: address identifiability of numerical parameters at all.
127: Much more study of the identifiability question is needed.
128: 
129: 
130: 
131: \smallskip
132: 
133: In this paper, we focus on the \emph{general Markov plus invariable
134:  sites}, GM+I,  model of sequence evolution, a model that encompasses
135: the GTR+I model that is of more immediate interest to practitioners.
136: Note that previous work on GM+I by Baake \cite{MR1664261}
137: focused on \emph{non}-identifiability. In that paper
138: parameter choices for the $2$-state GM+I model
139: on two distinct 4-taxon trees
140: are constructed  that give rise to the same pairwise
141: joint distributions ($2$-marginals).  As both sets of parameters have
142: 50\% invariable sites, this shows that the
143: identifiability of the tree parameter cannot generally hold on the basis of
144: 2-sequence comparisons, even if the distribution of rate factors is
145: known. Furthermore, it shows a well-behaved phylogenetic distance cannot
146: be defined for this model, as existence of such a distance would imply tree
147: identifiability.
148: 
149: Here we prove that all parameters for the GM+I model are indeed
150: identifiable, through $4$-sequence comparisons.  By identifiable, we
151: mean \emph{generically identifiable} in a geometric sense: For a
152: fixed tree, the set of numerical parameters for which the joint
153: distribution could have arisen from either a) a different tree, or
154: b) a `significantly different' (in a sense to be made clear later)
155: choice of numerical parameters on the same tree, is of strictly
156: lower dimension than that of the full numerical parameter space.
157: (For a concrete example of generic identifiability,
158: recall the results of Steel and Chang on the general Markov model:
159: assumptions that
160: the Markov edge matrices $M_e$ have determinant $\ne 0,1$ and that the
161: distribution of states at the root has strictly positive entries ensure
162: identifiability of all parameters. These are
163: generic conditions.) Thus for natural probability distributions on
164: the parameter space, with probability one a choice of parameters is
165: generic.
166: 
167: 
168: 
169: \medskip
170: 
171: Although identifiability of the tree parameter for GM+I follows from
172: more general results in \cite{ARidtree}, that paper did not consider
173: identifiability of numerical parameters. Our arguments here are
174: tailored to GM+I and yield stronger results
175: addressing numerical parameters as well as the tree. Our approach
176: is again based on the determination of \emph{phylogenetic
177: invariants} for the model. While the invariants described in
178: \cite{ARidtree} are invariants for more general models than GM+I,
179: the ones given in this paper apply only to GM+I and its submodels,
180: and are of much lower degree. As a byproduct of the development of
181: these GM+I invariants, we are led to rational formulas for
182: recovering all the parameters related to the invariable sites from a
183: joint distribution.  Indeed, these formulas are crucial to our
184: identification of numerical parameters.
185: 
186: These formulas can be viewed as GM+I analogs of the formulas for the
187: proportion of invariable sites in group-based+I models that were
188: found by the capture-recapture argument of \cite{SHL00}.  In the
189: group-based setting, those formulas were developed into a heuristic
190: means of estimating the proportion of invariable sites from data
191: without performing a full tree inference. This has been implemented
192: in {\tt SplitsTree4} \cite{splitsTree4}. However, it remains unclear
193: whether a similar useful heuristic can be found for the formulas
194: presented in this paper.
195: 
196: \smallskip
197: 
198: Since our algebraic methods at times employ computational commutative
199: algebra software packages, and these tool are not commonly used
200: in the phylogenetics literature, we have included some examples of
201: code in Appendix \ref{app:code}.
202: 
203: 
204: \section{The GM+I Model}\label{sec:gmi}
205: 
206: Let $T$ denote an\emph{ $n$-taxon tree}, by which we mean a tree
207: with $n$ leaves labeled by the taxa $a_1,a_2,\dots, a_n$ and all
208: internal vertices of valence at least 3. We say $T$ is \emph{binary}
209: if all internal nodes have valence exactly 3.
210: 
211: We begin by describing the parameterization of the $\kappa$-state GM+I
212: model of sequence evolution along $T$, where $\kappa=4$ corresponds to
213: usual models of DNA evolution.  The \emph{class size parameter}
214: $\delta$ denotes the probability that any particular site in a
215: sequence is invariable: conceptually, the flip of a biased coin
216: weighted by $\delta$ determines if a site is allowed to undergo state
217: transitions. If a site is invariable, it is assigned state
218: $i\in[\kappa]=\{1,2,\dots,\kappa\}$ with probability $\pi_I(i)$. Here
219: $\boldsymbol \pi_I=(\pi_I(1),\dots,\pi_I(k))$ is a vector of
220: non-negative numbers summing to 1 giving the state distribution for
221: invariable sites.
222: 
223: All sites that are not invariable mutate according to a common set
224: of parameters for the GM model, though independently of one another.
225: For these sites, we associate to each node (including leaves) of $T$
226: a random variable with state space $[\kappa]$.  Choosing any node
227: $r$ of $T$ to serve as a root, and directing all edges away from
228: $r$, let $T_r$ denote the resulting directed tree $T$.  A \emph{root
229: distribution} vector $\pr= (\pi_{GM}(1),\dots, \pi_{GM}(\kappa))$,
230: with non-negative entries summing to $1$, has entries $\pr(j)$
231: specifying the probability that the root variable is in state $j$.
232: For each directed edge $e = (v \to w)$ of $T_r$, let $M_e$ be a
233: $\kappa\times \kappa$ Markov matrix, so that $M_e(i,j)$ specifies
234: the conditional probability that the variable at $w$ is in state $j$
235: given that the variable at $v$ is in state $i$. Thus entries of all
236: $M_e$ are non-negative, with rows summing to 1.
237: 
238: For the GM+I model on an $n$-taxon tree $T$ with edge set $E$, the
239: stochastic parameter space $S \subset {[0,1]}^N$
240: is of dimension
241: $N =1 + (\kappa-1) +(\kappa - 1) + |E|\kappa(\kappa - 1) = 2\kappa - 1 +
242: |E|\kappa(\kappa - 1).$
243: The parameterization map giving the joint
244: distribution of the variables at the leaves of $T$ is
245: denoted by
246: \begin{linenomath}
247: \begin{align*}
248:   \phi_T: S &\longrightarrow {[0,1]}^{\kappa^n},\\
249:   \mathbf{s} &\longmapsto P.
250: \end{align*}
251: \end{linenomath}
252: We view $P$ as an $n$-dimensional $\kappa \times \dots \times
253: \kappa$ array, with dimensions corresponding to the ordered taxa
254: $a_1,a_2,\dots,a_n$, and with entries indexed by the states at the
255: leaves of $T$. The entries of $P$ are polynomial functions in the
256: parameters $\mathbf{s}$ explicitly given by
257: \begin{linenomath}
258: \begin{multline}
259:   P(i_1, \dots, i_n) =\\
260:   \delta\, \epsilon (i_1,i_2,\dots i_n) \pI(i_1) +(1-\delta)
261:   \sum_{(j_v) \in \mathcal{H}} \left( \pr(j_r) \prod_{e} M_e(j_{v_i},
262:     j_{v_f})\right).\label{eq:Pdef}
263: \end{multline}
264: \end{linenomath}
265: Here $\epsilon(i_1,i_2,\dots i_n)$ is 1 if all $i_j$ are equal and 0
266: otherwise, the product is taken over all edges $e=(v_i\to v_f)\in E$,
267: and the sum is taken over the set of all possible assignments of
268: states to nodes of $T$ extending the assignment $(i_1, \dots, i_n)$ to
269: the leaves: If $V$ is the set of vertices of $T$ then
270: \begin{linenomath}
271: $$\mathcal{H} = \left\{(j_v) \in [\kappa]^{|V|} \mid j_v = i_k \mbox{
272:   if } v \mbox{ is a leaf labeled by $a_k$} \right\}.
273: $$
274: \end{linenomath}
275: For notational ease, the entries of $P$, the \emph{pattern
276:   frequencies}, are also denoted by $p_{i_1 \dots i_n} = P(i_1, \dots,
277: i_n)$.
278: 
279: We note that while a root $r$ was chosen for the tree in order to
280: explicitly describe the GM portion of the parameterization of our
281: model, the particular choice of $r$ is not important. Under mild
282: additional restrictions on model parameters, changing the root
283: location corresponds to a simple invertible change of variables in
284: the parameterization. (See \cite{SSH94}, \cite{AR03}, or \cite{ARgm}
285: for details.) This justifies our slight abuse of language in
286: referring to the GM or GM+I model on $T$, rather than on $T_r$, and
287: we omit future references to root location.
288: 
289: Note that equation (\ref{eq:Pdef}) allows us to more succinctly
290: describe any $P\in \im(\phi_T)$ as
291: \begin{linenomath}
292: \begin{equation}P= (1-\delta) P_{GM} + \delta P_I
293: \label{eq:decomp}\end{equation}
294: \end{linenomath}
295: where $P_{GM}$ is an array in the
296: image of the GM parameterization map on $T$ and
297: $P_I=\diag(\boldsymbol \pi_I)$ is an $n$-dimensional array whose
298: off-diagonal entries are zeros and whose diagonal entries are those
299: of $\pi_I$.
300: 
301: 
302: \section{Model Identifiability}\label{sec:modelId}
303: 
304: We now make precise the various concepts of identifiability of a
305: phylogenetic model. To adapt standard statistical language to the
306: phylogenetic setting, for a fixed set $A$ of $n$ taxa and $\kappa\ge
307: 2$, consider a collection $\mathcal M$ of pairs $(T,\phi_T )$, where
308: $T$ is an $n$-taxon tree with leaf labels $A$, and $\phi_T:S_T\to
309: [0,1]^{\kappa^n}$ is a parameterization map of the joint
310: distribution of pattern frequencies for the model on $T$.  We say
311: \emph{the tree parameter is identifiable} for $\mathcal M$ if for
312: every $P\in \cup_{(T,\phi_T)\in
313:   \mathcal M} \im(\phi_T)$, there is a unique $T$ such that $P\in
314: \im(\phi_T)$. We say that \emph{numerical parameters are identifiable on a
315: tree $T$} if the map $\phi_T$ is injective, that is if for
316: every $P\in\im(\phi_T)$ there is a unique $\mathbf s\in S_T$ with
317: $\phi_T(\mathbf s)=P$. We say the \emph{model $\mathcal M$ is identifiable}
318: if the tree parameter is identifiable, and for each tree the numerical
319: parameters are identifiable.
320: 
321: 
322: It is well-known that such a definition of identifiability is too
323: stringent for phylogenetics. First, unless one restricts parameter
324: spaces, there is little hope that the tree parameter be identifiable:
325: One need only think of any standard model on a binary 4-taxon tree in
326: which the Markov matrix parameter on the internal edge is the identity
327: matrix. Any joint distribution arising from such a parameter choice
328: could have as well arisen from any other 4-taxon tree topology.
329: 
330: Even if such `special' parameter choices are excluded so the tree
331: parameter becomes identifiable, identifiability of numerical
332: parameters also poses problems, as noted  by Chang
333: \cite{MR97k:92011}. For example, consider the 3-taxon tree with the
334: GM model. Then multiple parameter choices give rise to the same
335: joint distribution since the labeling of the states at the internal
336: node can be permuted in $\kappa!$ ways, as long as the Markov matrix
337: parameters are adjusted accordingly \cite{AR03}. The occurrence of
338: this sort of `label-swapping' non-identifiability in statistical
339: models with hidden (unobserved) variables is well-known, but is not
340: of great concern. However, even for this model more subtle forms of
341: non-identifiability can occur, in which infinitely many parameter
342: choices lead to the same joint distribution. These arise from
343: singularities in the model, and can be avoided by again restricting
344: parameter space. Such `generic' conditions for the GM model have
345: already been mentioned in the introduction.
346: 
347: We therefore refine our notions of identifiability. Because we are
348: concerned primarily with model where the maps $\phi_T$ are given by
349: polynomials, we give a formulation appropriate to that setting.
350: Recall that given any collection $\mathcal F$ of polynomials
351: in $N$ variables, their common zero set,
352: \begin{linenomath}
353: $$
354: V(\mathcal F)=\{z\in \C^N \mid f(z)=0 \text{ for all } f\in \mathcal F\},
355: $$
356: \end{linenomath}
357: is the \emph{algebraic variety} defined by $\mathcal F$. If the algebraic
358: variety is a proper subset of $\C^N$, then it is said to be \emph{proper}.
359: 
360: 
361: 
362: \begin{defn}
363: Let $\mathcal M$ be a model on a collection of $n$-taxon trees, as
364: defined above.
365: \begin{enumerate}
366: \item  We say \emph{the tree parameter is generically identifiable}
367: for $\mathcal M$ if for each tree $T$ there exists a proper
368: algebraic variety $X_T$ with the
369: property that
370: \begin{linenomath}
371: $$P\in \bigcup_{(T,\phi_T)\in \mathcal M}
372: \phi_T(S_T\smallsetminus X_T) \text{ implies } P\in
373: \phi_T(S_T\smallsetminus X_T) \text{ for a unique
374: $T$}.$$
375: \end{linenomath}
376: 
377: \item We say that \emph{numerical parameters are generically
378: locally identifiable on a tree $T$} if there is a proper
379: algebraic variety $Y_T$ such that for all
380: $\mathbf s\in S_T\smallsetminus Y_T$, there is a
381: neighborhood of $\mathbf s$ on which $\phi_T$ is injective.
382: 
383: \item We say the \emph{model $\mathcal M$ is generically locally
384: identifiable} if the tree parameter is generically identifiable, and
385: for each tree the numerical parameters are generically locally
386: identifiable. \end{enumerate}
387: \end{defn}
388: 
389: Note that the notion of `generic' here is used to mean
390: `for all parameters but those lying on a proper
391: subvariety of the parameter space,' and such a variety
392: is necessarily of lower dimension than the full parameter
393: space. Using the standard measure
394: on the parameter space, viewed as a subset of $\R^N$,  this notion thus
395: also implies `for all
396: parameters except those in a set of measure 0.'
397: 
398: \smallskip
399: 
400: In the important special case of parameterization maps defined by
401: polynomial formulas, such as that for the GM+I model, generic local
402: identifiability of numerical parameters is equivalent to the notion in
403: algebraic geometry of the map $\phi_T$ being \emph{generically
404:   finite}. In this case, there exists a proper variety $Y_T$ and an
405: integer $k$, the degree of the map $\phi_T$, such that restricted to
406: $S_T\smallsetminus Y_T$ the map $\phi_T$ is not only locally injective
407: but also $k$-to-1: That is, if $\mathbf s\in S_T\smallsetminus Y_T$
408: and $P=\phi_T(\mathbf s)$, then the fiber $\phi^{-1}_T(P)$ has
409: cardinality $k$.
410: 
411: Because of the label swapping issue at internal nodes, for the GM
412: model and GM+I on an $n$-taxon tree $T$ with vertex set $V$, fibers of
413: generic points will always have cardinality at least $\kappa!(|V|-n)$.
414: Thus for these models, the best we can hope for is generic local
415: identifiability of the model (both tree and numerical parameters)
416: where the generic fiber has exactly this cardinality.  That in fact is
417: what we establish in the next section.
418: 
419: 
420: 
421: \section{Generic Identifiability for the GM+I model}\label{sec:genericId}
422: 
423: We begin our arguments by determining some phylogenetic invariants for
424: the GM+I model. The notion of a phylogenetic invariant was introduced
425: by Cavender and Felsenstein \cite{CF87} and Lake \cite{Lake87}, in the
426: hope that phylogenetic invariants might be useful for practical tree
427: inference.  Their role here, in proving identifiability, is more
428: theoretical but illustrates their value in analyzing models.
429: 
430: \smallskip
431: 
432: For a parameterization $\phi_T$ given by polynomial formulas on domain
433: $S_T\subseteq\R^N$, we may uniquely extend to a polynomial map with
434: domain $\C^N$, given by the same polynomial formulas, which we again
435: denote by $ \phi_T: \C^N \longrightarrow {\C}^{\kappa^n}.$
436: 
437: \begin{rem}
438:   Extending parameters to include complex values is solely for
439:   mathematical convenience, as algebraic geometry provides the natural
440:   setting for our viewpoint.  The collection of stochastic joint
441:   distributions (arising from the original stochastic parameter space)
442:   is a proper subset of $\im(\phi_T)$.
443: \end{rem}
444: 
445: 
446: The \emph{phylogenetic variety}, $V_T$, is the the smallest algebraic
447: variety in $\C^{\kappa^n}$ containing $ \phi_T(\C^N)$, \emph{i.e.},
448: the closure of the image of $\phi_T$ under the Zariski topology,
449: \begin{linenomath}
450:   $$V_T=\overline{\im(\phi_T)}\subseteq\C^{\kappa^n}.$$
451: \end{linenomath}
452: 
453: \begin{rem}
454:   $V_T$ coincides with the closure of $\im(\phi_T)=\phi_T(\C^N)$ under
455:   the usual topology on $\C^{\kappa^N}.$ However, while $V_T\cap
456:   [0,1]^{\kappa^n}$ contains the closure of $\phi_T(S_T)$ under the
457:   usual topology, these need not be equal.
458: \end{rem}
459: 
460: 
461: Let $\C[P]$ denote the ring of polynomials in the $\kappa^{n}$
462: indeterminates $\{p_{i_1\dots i_n}\}.$ Then the collection of all
463: polynomials in $\C[P]$ vanishing on $V_T$ forms a prime ideal $I_T$.
464: We refer to $I_T$ as a \emph{phylogenetic ideal}, and its elements as
465: \emph{phylogenetic invariants}.  More explicitly, a polynomial $f\in
466: \C[P]$ is a phylogenetic invariant if, and only if, $f(P_0)=0$ for
467: every $P_0\in \phi_T(\C^{\kappa^n})$, or equivalently, if, and only
468: if, $f(P_0)=0$ for every $P_0\in \phi_T(S_T)$.
469: 
470: 
471: 
472: \medskip
473: 
474: As we proceed, we consider first the special case of $4$-taxon
475: trees. We highlight the
476: $\kappa=2$ case, in part to illustrate the arguments for general
477: $\kappa$ more clearly, and in part because we can go further in understanding
478: the 2-state model.
479: 
480: \medskip
481: 
482: Consider the $4$-taxon binary tree $T_{ab|cd}$, with taxa $a,b,c,d$ as
483: shown in Figure \ref{fig:4taxa}.
484: 
485: \begin{figure}[h]
486: \begin{center}
487: \includegraphics[height=.75in]{figv01.eps}
488: \end{center}
489: \caption{The 4-taxon tree $T_{ab|cd}$}\label{fig:4taxa}
490: \end{figure}
491: Suppose that $P$ is a $2 \times 2 \times 2
492: \times 2$  pattern frequency array,
493:  whose indices correspond to states $[2]=\{1,2\}$
494: at the taxa in alphabetical order.
495: Then the internal edge $e$ of $T$ defines the split $ab \mid
496: cd$ in the tree, and we define the \emph{edge flattening} $F_e$ of $P$
497: at $e$, a $2^2 \times 2^2$ matrix, by
498: \begin{linenomath}
499: \begin{equation}\label{eq:Flat}
500: F_e =
501: \begin{pmatrix}
502:   p_{1111} & p_{1112} & p_{1121} & p_{1122}\\
503:   p_{1211} & p_{1212} & p_{1221} & p_{1222}\\
504:   p_{2111} & p_{2112} & p_{2121} & p_{2122}\\
505:   p_{2211} & p_{2212} & p_{2221} & p_{2222}\\
506: \end{pmatrix}.
507: \end{equation}
508: \end{linenomath}
509: Notice that the rows of $F_e$ are indexed by the states at $\{ab\}$
510: and the columns by states at $\{cd\}$. The flattening $F_e$ is
511: intuitively motivated by considering a `collapsed' model induced by
512: $e$: taxa $a$ and $b$ are grouped together forming a single variable
513: $\{ab\}$ with $4$ states, and the grouping $\{cd\}$ forms a second
514: variable with $4$ states.
515: 
516: 
517: This construction can be generalized in a natural way: suppose $T$
518: is an $n$-taxon tree, and $P$ a $\kappa\times\dots\times\kappa$ array with
519: indices corresponding to the taxa labeling the leaves of $T$. Then
520: for any edge $e$ in $T$, we can form from $P$ the matrix $F_{e}$ of size
521: $\kappa^{n_1} \times \kappa^{n_2}$, where $n_1$ and $n_2$ are the
522: cardinalities of the two sets of taxa in the split induced by $e$.
523: 
524: 
525: 
526: From \cite{ARgm} (for a more expository presentation, see also
527: \cite{ARnme}), we have:
528: 
529: 
530: \begin{thm} \label{thm:GM}
531:   For the $2$-state GM model on a binary $n$-taxon
532:   tree $T$, the phylogenetic ideal $I_T$ is generated by all $3\times 3$
533:   minors of all edge flattenings $F_e$ of $P$.
534:   Moreover, for the $\kappa$-state GM model on an $n$-taxon tree $T$,
535:   the phylogenetic ideal
536:   $I_T$ contains all
537:   $(\kappa+1) \times (\kappa+1)$ minors of all edge flattenings
538:   of $P$.
539: \end{thm}
540: 
541: Using this result, we can deduce some
542: elements of the phylogenetic ideal for
543: the GM+I model for any number of taxa $n \ge 4$ and any number of
544: states $\kappa \ge 2$.
545: 
546: \begin{prop}\label{prop:invariants} (Phylogenetic Invariants for GM+I)
547: \begin{enumerate}
548: \item \label{prop:inv:item1}
549: For the $4$-taxon tree $T_{ab|cd}$ and
550: the $2$-state GM+I model,
551: the cubic determinantal polynomials
552: \begin{linenomath}
553:   $$
554:   f_1=\left |\begin{matrix}
555:       p_{1112} & p_{1121} & p_{1122}\\
556:       p_{1212} & p_{1221} & p_{1222}\\
557:       p_{2112} & p_{2121} & p_{2122}\\
558: \end{matrix}\right |
559: \mbox{ and } f_2=\left |\begin{matrix}
560:     p_{1211} & p_{1212} & p_{1221}\\
561:     p_{2111} & p_{2112} & p_{2121}\\
562:     p_{2211} & p_{2212} & p_{2221}
563: \end{matrix}\right |
564: $$
565: \end{linenomath}
566: are phylogenetic invariants. These are the two $3\times 3$ minors of
567: the matrix flattening $F_{ab \mid cd}$ of equation (\ref{eq:Flat}) that do not
568: involve either of the entries $p_{1111}$ or $p_{2222}$.
569: 
570: \item More generally, for $n\ge 4$ and $\kappa\ge 2$, consider the
571:   $\kappa$-state GM+I model on an $n$-taxon tree $T$. Then for each
572:   edge $e$ of $T$, all $(\kappa+1)\times (\kappa+1)$ minors of the
573:   flattening $F_e$ of $P$ that avoid all entries $p_{ii\dots i}$,
574:   $i\in[\kappa]$ are phylogenetic invariants.
575: \end{enumerate}
576: \end{prop}
577: 
578: \begin{pf}  We prove the first statement in detail.
579: From equation (\ref{eq:decomp}),
580: for
581: any $P=\phi_T(s)$ we have $P=(1-\delta)P_{GM}+\delta
582:  P_I$, where $P_{GM}$ is a 4-dimensional table arising from the GM
583:   model on $T$ and $P_I=\diag(\pi_I)$ is a diagonal table with entries
584:   giving the distribution of states for the invariable sites.
585:   Flattening these tables with respect to the internal edge of the
586:   tree, we obtain
587: \begin{linenomath}
588: \begin{align}F_{ab \mid cd} &= (1-\delta) F_{GM} + \delta F_I\notag \\
589:   &=(1 - \delta)
590: \begin{pmatrix}
591:   \tilde p_{1111} & \tilde p_{1112} & \tilde p_{1121} & \tilde p_{1122}\\
592:   \tilde p_{1211} & \tilde p_{1212} & \tilde p_{1221} & \tilde p_{1222}\\
593:   \tilde p_{2111} & \tilde p_{2112} & \tilde p_{2121} & \tilde p_{2122}\\
594:   \tilde p_{2211} & \tilde p_{2212} & \tilde p_{2221} & \tilde
595:   p_{2222}
596: \end{pmatrix}
597: +
598: \delta \begin{pmatrix}
599:   \pi_I(1) &\ 0\ &\  0\ & 0 \\
600:   0 & 0 & 0 & 0\\
601:   0 & 0 & 0 & 0\\
602:   0 & 0 & 0 & \pi_I(2) \\
603: \end{pmatrix}.\label{eq:Psum}
604: \end{align}
605: \end{linenomath}
606: By Theorem \ref{thm:GM}, all $3\times 3$ minors of $F_{GM}$ vanish.
607: Since the `upper right' and `lower left' minors of $F_{ab|cd}$ are the
608: same as those of $F_{GM}$, up to a factor of $(1-\delta)^3$, they also
609: vanish.
610: 
611: Straightforward
612: modifications to this argument give the general case.\hfill\qed
613: \end{pf}
614: 
615: 
616: For arbitrary $n,\kappa$, the GM+I model should have many other
617: invariants than those found here.
618: Among these is, of course, the stochastic invariant
619: \begin{linenomath}
620: $$f_s(P)=1-\sum_{\mathbf i\in [\kappa]^n} p_{\mathbf i}.$$
621: \end{linenomath}
622: 
623: In the simplest interesting case of the GM+I model, however, we
624: have the following computational result.
625: 
626: \begin{prop}\label{prop:invariantsK2} The phylogenetic ideal
627:   for the $2$-state GM+I model on the $4$-taxon tree $T_{ab|cd}$ of Figure
628:   \ref{fig:4taxa} is generated by $f_s$ and
629: the minors $f_1$, $f_2$ above;
630: \begin{linenomath}
631: $$I_T  = \langle f_s, f_1, f_2 \rangle.$$
632: \end{linenomath}
633: \end{prop}
634: 
635: \begin{pf}
636:   A computation of the Jacobian of the parameterization
637: $\phi_T: S \subset \C^{13} \to \C^{2^4}$ shows it has full rank at
638: some points, and so $V_T$ is of dimension 13.  If $I =
639: \langle f_s,f_1, f_2 \rangle$, then $I \subseteq I_T$.  Another computation
640: shows that $I$ is prime and of
641:   dimension $13$.  Thus, necessarily $I =
642:   I_T$.  (The code for these computations is given in Appendix
643:   \ref{app:code}.)\hfill\qed
644: \end{pf}
645: 
646: 
647: Let $V_{ab | cd}$, $V_{ac | bc}$, $V_{ad | bc}$ be the varieties for
648: the $2$-state GM+I models for the three $4$-taxon binary tree topologies, with
649: corresponding phylogenetic ideals $I_{ab \mid cd}$, $I_{ac \mid bd}$,
650: $I_{ad \mid bc}$.
651: Of course Proposition \ref{prop:invariantsK2}
652: gives generators for each of these ideals --- two $3
653: \times 3$ minors of the flattenings of $P$ appropriate to those tree
654: topologies, along with $f_s$. A computation (see Appendix \ref{app:code}) shows
655: that these three ideals are distinct. Therefore the three varieties
656: are distinct, and their pairwise intersections are proper
657: subvarieties. Thus for any parameters $\mathbf s$
658: not lying in the inverse image
659: of these subvarieties, $T$ is uniquely determined from $\phi_T(\mathbf s)$.
660: Thus we obtain
661: 
662: \begin{cor} \label{cor:identTreeN4k2}
663:   For the $2$-state GM+I model on binary 4-taxon trees, the tree parameter is
664:   generically identifiable.
665: \end{cor}
666: 
667: As $\dim(V_{ab|cd})=13$, and the parameter space for $\phi_T$ is 13
668: dimensional, we also immediately obtain that the map $\phi_T$ is
669: generically finite. This yields
670: 
671: \begin{cor} \label{cor:idenNumN4k2}
672: For the $2$-state GM+I model on a binary 4-taxon tree,
673: numerical parameters are generically locally
674: identifiable.
675: \end{cor}
676: 
677: Note that this does approach does not yield the cardinality of the
678: generic fiber of the parameterization map, which is also of
679: interest. We will return to this issue in Theorem
680: \ref{thm:genericIdent}.
681: 
682: \medskip
683: 
684: Further computations show that
685: $\dim(V_{ab|cd} \cap V_{ac|bd}\cap V_{ad|bc})=11$. As this
686: intersection contains all points arising from the GM+I
687: model on the 4-taxon star tree, which is an 11-parameter model, this
688: is not surprising. In fact, one can verify computationally
689: that the ideal $I_{ab|cd}+I_{ac|bd}+I_{ad|bc}$ is the defining
690: prime ideal of the star-tree variety.
691: We also note that the ideal $I_{ab|cd} + I_{ac|bd}$
692: decomposes into two primes, both of dimension 11. Thus the variety
693: defined by this ideal has two components, one of which is the variety
694: for the star tree.
695: 
696: \medskip
697: 
698: In principle, the ideal $I_T$ of all invariants for the GM+I model
699: on an arbitrary tree $T$ can be computed from the parameterization
700: map $\phi_T$ via an elimination of variables using Gr\"obner bases
701: \cite{MR2001c:92009}. However, if all invariants for the
702: $\kappa$-state GM model on $T$ are known, they can provide an
703: alternate approach to finding $I_T$ which, while still proceeding by
704: elimination, should be less computationally demanding.
705: 
706: To present this most simply, we note that because
707: our varieties lie in the hyperplane described by the stochastic invariant,
708: it is natural to consider their projectivizations,
709: lying in $\mathbb P^{\kappa^n-1}$ rather than $\C^{\kappa^n}$. The
710: corresponding phylogenetic ideals, which we denote by $J_T$,
711: are generated by the homogeneous polynomials in $I_T$, and do not contain the
712: stochastic invariant. Conversely, $I_T$ is generated by the elements of
713: $J_T$ together with the stochastic invariant.
714: 
715: In addition, we need
716: not restrict ourselves to the GM model, but rather deal with any
717: phylogenetic model parameterized by polynomials.
718: 
719: \begin{prop} \label{prop:elim}
720: Suppose $\widetilde \phi_T:\C^N\to \C^{\kappa^n}$ is a
721: parameterization map for some phylogenetic model $\mathcal M$ on
722: $T$, with corresponding homogeneous phylogenetic ideal $\widetilde
723: J_T$. Let
724: \begin{linenomath}
725: $$\phi_T:\C^{N}\times\C^\kappa\to
726:  \C^{\kappa^n}$$
727: \end{linenomath}
728: be the parametrization map for the $\mathcal M$+I model
729: given by
730: \begin{linenomath}
731: $$\phi_T(\mathbf s,(\delta,\boldsymbol \pi_I))=(1-\delta)
732: \widetilde \phi_T(\mathbf s)+\delta \diag(\boldsymbol \pi_I).$$
733: \end{linenomath}
734: Let
735: $P'$ denote the collection of all indeterminate entries of $P$
736: except those in $P_{eq}=\{p_{ii\dots i}\mid i\in[\kappa]\}$. Then
737: the homogeneous phylogenetic ideal $J_T$ for the $\mathcal M$+I
738: model on $T$ is $J_T=\left (\widetilde J_T\cap \C[P'] \right)\C[P].$
739: Thus $J_T$ can be computed from $\widetilde J_T$ by elimination of
740: the variables in $P_{eq}$.
741: \end{prop}
742: 
743: \begin{pf}
744: Extend the parameterization maps $\widetilde \phi_T, \phi_T$ to
745: parameterizations of cones by introducing an additional parameter,
746: \begin{linenomath}
747: $$ \widetilde \Phi_T(\mathbf s,t)=t\,\widetilde\phi_T(\mathbf s)$$
748: $$
749: \Phi_T(\mathbf s,(\delta,\boldsymbol \pi_I ),t)
750: =t\,\phi_T(\mathbf s,(\delta,\boldsymbol \pi_I))$$
751: \end{linenomath}
752: Then
753: $\im(\Phi_T)=\C^\kappa\times\operatorname{proj}(\im(\widetilde \Phi_T)),$
754: where $\C^\kappa$ corresponds to coordinates in $P_{eq}$ and
755: `$\operatorname{proj}$' denotes
756: the projection map from $P$-coordinates to $P'$-coordinates. As $J_T$
757: is the ideal of polynomials vanishing on $\im(\Phi_T)$, and
758: $\tilde J_T\cap \C[P']$ the ideal
759: vanishing on $\operatorname{proj}(\im(\widetilde \Phi_T))$, the result follows.
760: \hfill\qed
761: \end{pf}
762: 
763: Using this, in the appendix we give an alternate computation to show
764: both part (\ref{prop:inv:item1}) of Proposition
765: \ref{prop:invariants}, and Proposition \ref{prop:invariantsK2}.
766: While this computation is quite fast, a more naive attempt to find
767: GM+I invariants directly from the full parameterization map using
768: elimination was unsuccessful, demonstrating the utility of the
769: proposition.
770: Moreover, we can use this proposition to compute all 2-state GM+I invariants
771: on the 5-taxon binary tree as well. This leads us to
772: 
773: \begin{conj} On an $n$-taxon binary tree, the ideal of homogeneous
774: invariants for the 2-state
775: GM+I model is generated by those $3\times 3$
776: minors of edge flattenings
777: that do not involve the variables $p_{11\dots1}$ and $p_{22\dots2}$,
778: together with the
779: stochastic invariant.
780: \end{conj}
781: 
782: \medskip
783: 
784: 
785: Although we are unable to determine all GM+I invariants for the
786: 4-taxon tree for general $\kappa$, using only those described in
787: Proposition \ref{prop:invariants} we can still obtain
788: identifiability results through a modified argument.
789: 
790: 
791: \begin{prop}\label{prop:treeId} For the $\kappa$-state GM+I model on
792: binary 4-taxon trees, $\kappa\ge 2$, the tree parameter is
793: generically identifiable.
794: \end{prop}
795: 
796: \begin{pf} By the argument leading to Corollary \ref{cor:identTreeN4k2},
797: it is enough to show the varieties $V_{ab \mid cd}$, $V_{ac
798:     \mid bd}$, and $V_{ad|bc}$ are distinct.
799: Considering, for example, the first two, we can
800: show that the varieties $V_{ab \mid cd}$ and $V_{ac
801:     \mid bd}$ are distinct, by giving an invariant $f \in I_{ac \mid
802:     bd}$  and a point $P_0\in V_{ab|cd}$
803: such that $f(P_0)\ne 0$.
804: 
805: Using Proposition \ref{prop:invariants}, we pick an
806:   invariant $f \in I_{ac \mid bd}$ as follows: In the flattening
807: $F_{ac|bd}$ according to the split $ac|bd$, choose any collection
808: of $\kappa+1$ $ac$-indices with distinct $a$ and $c$ states, \emph{e.g.},
809: $\{12,13,\dots,1\kappa,21,23\}$. Using the same set as $bd$-indices,
810: this determines a $(\kappa+1)\times(\kappa+1)$-minor $f$.
811: 
812: We pick $P_0=\phi_{T_{ab|cd}}(\mathbf s)$ using the parameterization
813: of equation (\ref{eq:Pdef}) by making a specific choice of parameters
814: $\mathbf s$. On $T_{ab|cd}$, with the root $r$ located at one of the
815: internal nodes, choose parameters $\mathbf{s}$ as follows: Let
816: $\pr$, $\pI$ be arbitrary but with all entries of $\pr$ positive.
817: Pick any $\delta \in [0,1)$. For the four terminal edges choose
818: $M_e$ to be the $\kappa \time \kappa \times \kappa$ identity matrix
819: $I_\kappa$. For the single internal edge $e$ of $T$, choose any
820: Markov matrix $M_{e}$ with all positive entries. For such
821: parameters, the entries of the joint distribution $P_0 =
822: \phi_{T_{ab|cd}} (\mathbf{s})$ are zero except for the pattern
823: frequencies $p_{iijj}$, where the states at the leaves $a$ and $b$
824: agree and the states at the leaves $c$ and $d$ agree.  Since the
825: entries of $M_{e}$ and the root distributions are positive, each of
826: the $p_{iijj} > 0$.
827: 
828: But considering the flattening $F_{ac \mid bd}$ of
829: $P_0=\phi_{T_{ab|cd}} (\mathbf{s})$ with respect to the `wrong'
830: topology $T_{ac \mid bd}$, we observe that the $\kappa^2$ non-zero
831: entries $p_{iijj}$ of $F_{ac \mid bd}$ all lie on the diagonal of
832: $F_{ac \mid bd}$, in the positions with $ij$ as both $ac$-index and
833: $bd$-index. Furthermore, by our choice of $f$, a subset of them
834: forms the diagonal of the submatrix whose determinant is $f$.
835: Therefore $f(P_0)\ne 0$.\hfill\qed
836: \end{pf}
837: 
838: \begin{prop}(Recovery of invariable site parameters)\label{prop:idformulas}
839: \begin{enumerate}
840: \item For the 4-taxon tree $T_{ab|cd}$ and the 2-state GM+I model, suppose
841: $P=\phi_T(\mathbf s)$. Then generically the parameters in
842: $\mathbf s$ related to invariable sites can be recovered from $P$ by
843: the following formulas:
844: \begin{linenomath}
845: $$\delta=\frac {|A_1|+|A_2|}{|B|},\ \  \boldsymbol \pi_I=\frac 1{|A_1|+|A_2|} \left (
846: |A_1|,|A_2|\right ),$$ where $B=\begin{pmatrix}
847:     p_{1212} & p_{1221} \\
848:     p_{2112} & p_{2121}
849: \end{pmatrix}$,
850: $$A_1=\begin{pmatrix}
851:       p_{1111} & p_{1112} & p_{1121}\\
852:       p_{1211} & p_{1212} & p_{1221}\\
853:       p_{2111} & p_{2112} & p_{2121}\\
854: \end{pmatrix}, \ \
855: A_2=\begin{pmatrix}
856:     p_{1212} & p_{1221} & p_{1222}\\
857:     p_{2112} & p_{2121} & p_{2122}\\
858:     p_{2212} & p_{2221} & p_{2222}
859: \end{pmatrix}.
860: $$
861: \end{linenomath}
862: \item More generally, for the $\kappa$-state GM+I model on $T_{ab|cd}$,
863: the invariable site parameters can be recovered
864: from a generic point in the image of the parameterization map by
865: rational formulas of the form
866: \begin{linenomath}
867: $$\delta=\frac
868: {\sum_{i\in[\kappa]}|A_i|}{|B|}, \ \ \boldsymbol \pi_I=\frac
869: 1{\sum_{i\in[\kappa]} |A_i|} \left ( |A_1|,|A_2|,\dots, |A_n|\right
870: ).$$
871: \end{linenomath}
872: Here $|B|$ is any $\kappa \times \kappa$ minor of $F_{ab|cd}$
873: that omits the all rows and columns indexed by $ii$, and $|A_i|$ is
874: the $(\kappa+1)\times(\kappa+1)$ minor obtained by including all
875: rows and columns chosen for $B$ and in addition the $ii$ row and
876: $ii$ column.
877: \end{enumerate}
878: \end{prop}
879: 
880: \begin{pf} We
881:   give the complete argument in the case $\kappa = 2$ first.  For a
882:   joint distribution $P \in \im(\phi_T)$, write $F_{ab \mid cd} =
883:   (1-\delta) F_{GM} + \delta F_I$ as in equation (\ref{eq:Psum}).  Since
884: $A_1$ is the `upper left'  $3 \times 3$ submatrix of $F_{ab \mid
885: cd}$, using linearity properties of the determinant, and that all $3
886: \times 3$ minors of $F_{GM}$ evaluate to zero, we observe that
887: \begin{linenomath}
888: \begin{align*}
889:   \vert A_1 \vert
890: &=(1 - \delta)^3 \left|
891: \begin{matrix}
892:   \tilde p_{1111} & \tilde p_{1112} & \tilde p_{1121} \\
893:   \tilde p_{1211} & \tilde p_{1212} & \tilde p_{1221} \\
894:   \tilde p_{2111} & \tilde p_{2112} & \tilde p_{2121} \\
895: \end{matrix}
896: \right| + \left|
897: \begin{matrix}
898:   \delta \pi_I(1) & 0 &\ 0 \\
899:   0 & (1-\delta) \tilde p_{1212} &\ (1-\delta) \tilde p_{1221} \\
900:   0 & (1-\delta) \tilde p_{2112} &\ (1-\delta) \tilde p_{2121} \\
901: \end{matrix}\right|\\
902: \\
903: &= \delta \pi_I(1) \left|
904: \begin{matrix}
905:   (1-\delta)\tilde p_{1212} &\  (1-\delta)\tilde p_{1221} \\
906:  (1-\delta) \tilde p_{2112} &\ (1-\delta )\tilde p_{2121} \\
907: \end{matrix}\right|.
908: \end{align*}
909: \end{linenomath}
910: Thus we have $\vert A_1 \vert = \delta \pi_I(1) \vert B \vert$. Now,
911: if $\vert B \vert \neq 0$, then
912: \begin{linenomath}
913: $$
914: \delta \pi_I(1) = \frac{\vert A_1 \vert}{\vert B \vert}.
915: $$
916: \end{linenomath}
917: As $|B|$ does not vanish on all of $V_T$, we have a rational formula
918: to compute $\delta \pi_I(1)$ for generic points on $V_T$.
919: 
920: Similarly, since $A_2$ is the `lower right' submatrix of $F_{ab \mid
921: cd}$, then
922: \begin{linenomath}
923: $$\delta \pi_I(2) =
924: \frac{\vert A_2 \vert}{\vert B \vert}.
925: $$
926: \end{linenomath}
927: Adding these together, we obtain the stated rational expression for
928: $\delta$.
929: 
930: 
931: Assuming additionally the generic condition that $\delta \neq 0$, then we find
932: \begin{linenomath}
933: $$\boldsymbol \pi_I =\left ( \frac{\vert A_1
934:   \vert}{\vert A_1\vert + \vert A_2 \vert},
935: \frac{\vert A_2
936:   \vert}{\vert A_1 \vert + \vert A_2 \vert}\right ).$$
937: \end{linenomath}
938: Thus the parameters $\delta, \boldsymbol \pi_I$ are
939: generically identifiable for GM+I on $T$.
940: 
941: One readily sees the argument above can be modified for arbitrary
942: $\kappa$.\hfill\qed
943: \end{pf}
944: 
945: Note that when $\kappa>2$ the above proposition gives many
946: alternative rational formulas for the invariable site parameters, as
947: there are many options for choosing the matrix $B$.
948: 
949: 
950: We now obtain our main result.
951: 
952: \begin{thm}\label{thm:genericIdent}
953: The $\kappa$-state GM+I model on $n$-taxon binary trees, with $n\ge
954: 4$, $\kappa \ge 2$, is generically locally identifiable.
955: Furthermore, for an $n$-taxon tree with $V$ vertices, the fibers of
956: generic points of $V_T$ under the parametrization map have
957: cardinality $\kappa!(|V|-n)$. Thus for generic points, label
958: swapping at internal nodes is the only source of
959: non-identifiability.
960: \end{thm}
961: 
962: \begin{pf} Suppose $T$ is an $n$-taxon tree with $P=\phi_T(\mathbf
963: s)$. Choose some subset of 4 taxa, say $\{a,b,c,d\}$, and suppose
964: the induced quartet tree is $T_{ab|cd}$. Then $P_{abcd}$, the
965: 4-marginalization of $P$, is easily seen to be of the form
966: $P_{abcd}=\phi_{T_{ab|cd}}(\mathbf s_{abcd})$ where $\mathbf
967: s_{abcd}=g(\mathbf s)$ and $g$ is a surjective polynomial function.
968: But the tree
969: $T_{ab|cd}$ is generically identifiable by Proposition
970: \ref{prop:treeId}, and thus invariable site parameters in $\mathbf s_{abcd}$
971: are generically identifiable by Proposition \ref{prop:idformulas}.
972: As these coincide with the invariable site parameters in $\mathbf
973: s$, and generic conditions on $\mathbf s_{abcd}$ imply generic
974: conditions on $\mathbf s$, the invariable site parameters are
975: generically identifiable for the full $n$-taxon model.
976: 
977: As an $n$-taxon binary tree topology is determined
978: by the collection of all induced quartet tree topologies, one can now see
979: that $T$ is generically identifiable. Alternately,
980: using the identified invariable site parameters,
981: and assuming the additional
982: generic condition that $\delta\ne 1$, note that
983: \begin{linenomath}
984: $$P_{GM} =
985: \frac{1}{(1-\delta)} \left( P - \delta P_I\right)
986: $$
987: \end{linenomath}
988: is a joint
989: distribution arising from general Markov parameters. Thus generic
990: identifiability of the tree can also by obtained from
991: Steel's  result for the GM model \cite{S94} applied to $P_{GM}$.
992: 
993: 
994: The generic identifiability of the remaining numerical parameters follows
995: from Chang's argument \cite{MR97k:92011} applied to $P_{GM}$.
996: Chang's approach also indicates the cardinality of the generic fiber is
997: $\kappa!(|V|-n)$ due to the label swapping phenomenon.\hfill\qed
998: \end{pf}
999: 
1000: 
1001: 
1002: 
1003: \section{Estimating Invariable Sites Parameters}\label{sec:estInv}
1004: 
1005: The concrete result in Proposition \ref{prop:idformulas} gives
1006: explicit rational formulas for recovering parameters relating to
1007: invariable sites from the joint distribution. These can be viewed as
1008: generalizations of the formulas found in \cite{SHL00} for
1009: group-based models. As \cite{SHL00} develops the group-based model
1010: formulas into a heuristic means of estimating the invariable site
1011: parameters from data without performing a full Maximum Likelihood
1012: fit of data to a tree under a $\mathcal M$+I model, one might
1013: suspect the formulas of Proposition \ref{prop:idformulas} could be
1014: used similarly without the need to assume $\mathcal M$ was
1015: group-based, or approximately group-based.
1016:  We emphasize that however useful such an
1017: estimate might be, it would not be intended to replace a more
1018: statistical but time-consuming computation, such as obtaining the
1019: Maximum Likelihood estimates for these parameters.
1020: 
1021: However, it is by no means obvious how to use these formulas well
1022: even for a heuristic estimate. First, for a 4-taxon tree
1023: we have many choices for the
1024: matrix $B$, in fact
1025: \begin{linenomath}
1026: $$\binom{\kappa^2-\kappa}{\kappa}^2$$
1027: \end{linenomath}
1028: of them, so even for $\kappa=4$, there are 245,025 basic sets of the
1029: formulae. Moreover, while these simple formulae
1030: emerged from our method of proof, one could in fact modify them by
1031: adding to any of them a rational function whose numerator is a
1032: phylogenetic invariant for the GM+I model, and whose denominator is
1033: not. Since the invariant vanishes on any joint distribution arising
1034: from the model, the resulting formulae will still recover invariable
1035: site information for generic parameters. Thus there are actually
1036: infinitely many formulas for recovering invariable site parameters.
1037: 
1038: One can nonetheless consider simple averaging schemes using only the
1039: basic formulas of Proposition \ref{prop:idformulas} and find that on
1040: simulated data they perform quite well at approximately recovering
1041: invariable site parameters from empirical distributions. However,
1042: averaging the large number of formulas give here, and then also
1043: averaging over a large sample of quartets,
1044:  as is proposed in \cite{SHL00}, is more
1045: time consuming than one might wish for a fast heuristic. Moreover,
1046: one must be aware that the denominator in these formulas may vanish
1047: on an empirical distribution --- it is certain to be non-zero only
1048: for true distributions for GM+I arising from generic parameters.
1049: 
1050: Nonetheless, it would be of interest to develop versions of these
1051: formulas with good statistical estimation properties, as the GM+I
1052: model encompasses models such as the GTR+I model which is often
1053: preferred in biological data analysis to group-based+I models. Of
1054: course addressing more general rate-variation models would be even
1055: more desirable, though our results here are not sufficient for that.
1056: 
1057: 
1058: 
1059: 
1060: \appendix
1061: 
1062: \section{Code for Computational Algebra Software}\label{app:code}
1063: 
1064: The following code is also available on the authors' websites.
1065: 
1066: \subsection{Computation for Proposition \ref{prop:invariantsK2} }
1067: 
1068: To show the variety has dimension 13, we execute the following Maple code:
1069: 
1070: {
1071: \scriptsize
1072: \begin{verbatim}
1073: pa := Matrix([[p,1-p]]); Mae := Matrix([[1-a,a],[r,1-r]]);
1074: Meb := Matrix([[1-b,b],[s,1-s]]); Mef := Matrix([[1-e,e],[t,1-t]]);
1075: Mfc := Matrix([[1-c,c],[u,1-u]]); Mfd := Matrix([[1-d,d],[v,1-v]]);
1076: P := Array(1..2,1..2,1..2,1..2);
1077: for i from 1 to 2 do for j from 1 to 2 do for k from 1 to 2 do for l from 1 to 2 do
1078:   P[i,j,k,l]:=0;
1079:   for m from 1 to 2 do  for n from 1 to 2 do
1080:     P[i,j,k,l]:=P[i,j,k,l]+pa[1,i]*Mae[i,m]*Meb[m,j]*Mef[m,n]*Mfc[n,k]*Mfd[n,l];
1081:   od;od;
1082:   P[i,j,k,l]:=(1-w)*P[i,j,k,l];
1083: od;od;od;od;
1084: P[1,1,1,1]:=P[1,1,1,1]+w*q: P[2,2,2,2]:=P[2,2,2,2]+w*(1-q):
1085: Q:=ListTools[Flatten](convert(P,listlist)):
1086: J:=VectorCalculus[Jacobian](Q,[a,b,c,d,e,r,s,t,u,v,p,q,w]):
1087: K:=subs({a=1/3,b=1/5,c=1/7,d=1/11,e=1/13,r=1/17,s=1/19,t=1/23,u=1/29,v=1/31,
1088:                                                           p=1/3,q=1/5,w=1/7},J):
1089: LinearAlgebra[Rank](K);
1090: \end{verbatim}
1091: }
1092: 
1093: 
1094: Using Singular \cite{sing}, we complete the proof:
1095: 
1096: {
1097: \scriptsize
1098: \begin{verbatim}
1099: LIB "matrix.lib";  LIB "primdec.lib";
1100: ring r = 0, (p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15),dp;
1101: // Define matrix flattening F_{ab | cd} and polys fs, f1, f2
1102: matrix Fab[4][4]=p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15;
1103: matrix UR[3][3]=submat(Fab,1..3,2..4); matrix LL[3][3]=submat(Fab,2..4,1..3);
1104: poly f1=det(UR); poly f2=det(LL);
1105: poly fs = p0+p1+p2+p3+p4+p5+p6+p7+p8+p9+p10+p11+p12+p13+p14+p15-1;
1106: ideal I = fs,f1,f2;   // define ideal I
1107: dim(std(I));          // compute dimension of r/I
1108: primdecGTZ(I);        // compute primary decomposition of I to show prime
1109: \end{verbatim}
1110: }
1111: 
1112: 
1113: \subsection{Computation for intersections of $V_{ab|cd},V_{ac|bd},V_{ad|bc}$}
1114: 
1115: Continuing the Singular session above, we execute the following:
1116: 
1117: {
1118: \scriptsize
1119: \begin{verbatim}
1120: /* Define ideals Iac, Iad corresponding to two alternative tree
1121:    topologies for 4-taxon trees.  (So, I = Iab in this notation.)   */
1122: // Flattening for ac | bd split
1123: matrix Fac[4][4]=p0,p1,p4,p5,p2,p3,p6,p7,p8,p9,p12,p13,p10,p11,p14,p15;
1124: poly f3=det(submat(Fac,1..3,2..4)); poly f4=det(submat(Fac,2..4,1..3));
1125: ideal Iac = fs,f3,f4;
1126: // Flattening for  ad | bc split
1127: matrix Fad[4][4]=p0,p2,p4,p6,p1,p3,p5,p7,p8,p10,p12,p14,p9,p11,p13,p15;
1128: poly f5=det(submat(Fad,1..3,2..4)); poly f6=det(submat(Fad,2..4,1..3));
1129: ideal Iad = fs,f5,f6;
1130: reduce(f1,std(Iac));  // non-zero answer shows f1 not in Iac
1131: reduce(Iac,std(I));   // non-zero shows f3,f4 not in I
1132: ideal J = I,Iac; dim(std(J));  // show dim is 11
1133: ideal K = J,Iad; dim(std(K));  // show dim is 11
1134: primdecGTZ(K);        // show K prime, and thus ideal for star tree
1135: \end{verbatim}
1136: }
1137: 
1138: \subsection{Computation of 2-state GM+I ideal, 4-taxon trees, using Proposition \ref{prop:elim} }
1139: 
1140: The following Singular code performs the needed elimination for a binary tree:
1141: 
1142: {
1143: \scriptsize
1144: \begin{verbatim}
1145: ideal Igm = minor(Fab,3);
1146: // Eliminate the `diagonal' variables
1147: ideal Igmi = elim1(Igm,p0*p15);
1148: \end{verbatim}
1149: }
1150: 
1151: For the star tree, the 2-state GM ideal is known from \cite{ARgm}.
1152: Thus elimination can be used to find GM+I invariants. We also show
1153: this result agrees with $\mathtt K$ above.
1154: 
1155: {
1156: \scriptsize
1157: \begin{verbatim}
1158: ideal Igm = minor(Fab,3),minor(Fac,3),minor(Fad,3);
1159: // Eliminate the `diagonal' variables
1160: ideal Igmi = elim1(Igm,p0*p15),fs;
1161: reduce(K,std(Igmi));  // all 0's indicates ideal containment
1162: reduce(Igmi,std(K));  // all 0's indicates ideal containment
1163: \end{verbatim}
1164: }
1165: 
1166: \bibliographystyle{elsart-num} \bibliography{Phylo}
1167: 
1168: \end{document}
1169: