q-bio0501030/hoef.tex
1: \documentclass[a4paper]{article}
2: \usepackage[latin1]{inputenc}
3: \usepackage[dvips]{graphics}
4: \usepackage[small]{caption}
5: \renewcommand{\figurename}{Fig.}
6: \title{Molecular Phylogenetic Analyses and Real Life Data}
7: \author{Kerstin Hoef-Emden}
8: 
9: \begin{document}
10: 
11: \maketitle
12: 
13: \begin{center}
14: 
15: Universität zu Köln, Botanisches Institut, Lehrstuhl I, Gyrhofstr. 15,
16: 
17: 50931 Köln, Germany
18: 
19: e-mail: kerstin.hoef-emden@uni-koeln.de
20: 
21: \end{center}
22: 
23: \bigskip
24: 
25: \section{What is Molecular Phylogeny?}
26: 
27: Most probably, all life existing today on earth shares a common ancestry
28: billions of years back in the past. A set of indispensable genes necessary
29: for maintenance of basic cell functions were passed on from the unknown
30: common ancestor to its extant descendants by asexual and/or sexual
31: reproduction. During the course of evolution, the genes, the numbers of
32: genes, their functions and the sizes of the genomes (i.e.\ the total DNA
33: content of a cell) became modified. If genes originate from a common
34: ancestor gene and fulfill the same function in a cell, they are said to be
35: homologous. The degree of divergence between homologous genes is considered
36: a measure for their relatedness (and also for the relatedness of the
37: organisms).
38: 
39: In molecular phylogeny, the relationships among, usually extant, organisms
40: are examined by comparing homologous DNA or protein sequences (i.e.\ the gene
41: products). The relationships are displayed as trees with branch (or edge)
42: lengths reflecting the degrees of genetic divergence. Each branch tip
43: represents an extant sequence; the internal nodes or vertices represent
44: unknown ancestors to the terminal nodes. The branching pattern and branch
45: lengths describe the evolutionary pathways leading to the sequences at the
46: terminal nodes. Clusters of terminal branches connected to a common ancestor
47: are termed clades.
48: 
49: The construction of phylogenetic trees has been shown to be a NP-hard
50: problem; the number of possible trees increases exponentially with the
51: number of DNA or protein sequences included in the phylogenetic analyses
52: \cite{Steel1992}. Due to the large amount of data and the complexity of the
53: task, phylogenetic trees cannot be inferred without help of computers.
54: 
55: Numerous studies addressing the problems of molecular phylogenetic analyses
56: methods in theory or practice have been published. First publications about
57: phylogenetic methods date back into the 60s. The methods and evolutionary
58: models were refined in the course of time, but problems still remain. The
59: cited references in this review represent only few examples from a vast
60: amount of literature. Also only some of the mostly used methods in molecular
61: phylogeny are presented.
62: 
63: For digging into the mathematics behind the phylogenetic analyses methods
64: introduced below, one may start with Joe Felsenstein's book
65: \cite{Felsenstein2003}.
66: 
67: \section{Phylogenetic Analyses Methods}
68: 
69: DNA sequences are based on a four-letter-code representing the four
70: nucleotides (A for adenin, C for cytosin, G for guanin, T for thymin),
71: whereas protein sequences are based on a twenty-letter-code representing the
72: twenty different amino acids. Prior to the phylogenetic analyses, an
73: alignment of the sequences has to be assembled (the single sequence is also
74: termed a ``taxon", because it represents a species, genus, individual or
75: strain). If sequences of homologous genes e.g.\ show differences in lengths
76: due to insertions or deletions, gaps have to be inserted to place
77: functionally corresponding positions in the same vertical column of the
78: alignment (Fig.\ 1).
79: \begin{figure}[h] 
80:  \begin{center}
81:   \includegraphics{hoef1.eps}
82:   \caption{Excerpt from an alignment of nuclear ITS2 sequences. The ITS2 or
83: internal transcribed spacer 2 expands between two RNA coding genes of the
84: ribosomal operon. The ribosomal operon is transcribed in one piece. The two
85: internal transcribed spacers between the RNA coding regions fold up in a
86: specific way and are excised. Since the two ITS regions solely function as
87: spacers, they are under low selective pressure and, thus, display high
88: mutation rates. The example alignment shows ITS2 regions of closely related
89: organisms belonging to one genus. The sequences are oriented in horizontal
90: direction, whereas functionally corresponding positions are arranged in
91: columns. Several gaps had to be inserted due to insertions of nucleotides in
92: the sequences 1 and 5.}
93:  \end{center}
94: \end{figure}
95: Non-alignable regions such as insertions of several nucleotides need to be
96: excluded from the phylogenetic analyses. Improperly aligned sequences or
97: inclusions of non-alignable regions in the phylogenetic analyses may result
98: in artefactual phylogenetic trees.
99: 
100: In most standard methods for inferring phylogenetic trees, an optimality
101: criterion and a tree search algorithm have to be chosen. The optimality
102: criterion is used to determine the best among the considered trees by
103: defining a type of ``scoring" system. Optimality criteria are e.g.\ maximum
104: parsimony, distance matrix or maximum likelihood \cite{Felsenstein2003}.
105: 
106: In unweighted maximum parsimony, each mutation from one nucleotide or amino
107: acid to another, e.g.\ from a C to a G, costs one ``penalty" point. All point
108: mutations are considered equally likely. The mutations along a given tree
109: are summed up and the best tree or maximum parsimony tree is the one with
110: the lowest sum of penalty points. Unweighted maximum parsimony uses integer
111: values and often several to many equally parsimonious trees are found.
112: 
113: In distance analyses, the sequences are pair-wise compared. Their genetic
114: divergences are transformed into distance values and listed in a triangular
115: distance matrix. Whereas maximum parsimony treats all mutations as equally
116: likely, the computation of distance matrices allows for different mutation
117: rates and other variations of parameters (i.e.\ evolutionary models, see
118: chapter below). To infer trees from a distance matrix, usually the
119: neighbor-joining algorithm is used (see below).
120: 
121: Maximum likelihood is a probablistic and the computationally most costly
122: method (Fig.\ 2).
123: \begin{figure}[htbp]
124:  \begin{center}
125:   \includegraphics{hoef2.eps}
126:   \caption{Computation of the likelihood of a tree. To obtain the overall
127: likelihood value of a tree, for each position of the alignment the
128: probabilities of all possible combinations of ancestral character states are
129: computed. The site-wise likelihood comprises the sum of all probabilities.
130: The site-wise log likelihoods are then multiplied and result in the log
131: likelihood of a given tree.}
132:  \end{center}
133: \end{figure}
134: It searches for the tree that optimizes the probability of observing the
135: data. The likelihood of a tree is expressed as negative natural logarithm.
136: The maximum likelihood method also allows for different evolutionary models,
137: but differs from distance matrix methods in that it uses discrete characters
138: and may result in more than one optimal tree (however, rarely more than
139: two).
140: 
141: The numbers of sequences used to infer phylogenetic trees in biological
142: research projects almost always prohibited exhaustive searches of the
143: complete tree space due to limitations of computation time. Thus, maximum
144: parsimony or maximum likelihood were usually combined with heuristic tree
145: search algorithms. For a heuristic search a first tree is generated e.g.\ by
146: adding the sequences step-by-step to the growing tree. This first tree is
147: then subjected to local and/or global rearrangements by swapping internal
148: branches or cutting the tree into pieces and rejoining the parts in
149: different places. This procedure is supposed to overcome potential local
150: optima and to find the global optimum. The construction of a tree by
151: neighbor-joining, the preferred method used with distance matrices, starts
152: with a star-like tree. The pair of sequences with the lowest genetic
153: divergence is joined (i.e.\ they are said to be neighbors) and the distance
154: matrix recalculated. These steps are repeated with the next closest related
155: sequences or clusters of sequences until the tree is completely resolved.
156: 
157: In Bayesian analyses, posterior probabilities for trees and evolutionary
158: parameters are calculated using the Bayes theorem
159: \cite{HuelsenbeckEtAl2001}. With the Bayes formula the posterior probability
160: of a tree given the data is calculated using prior probabilities of the data
161: and the tree, and the likelihood of a tree. Since it is impossible to
162: calculate all trees and evolutionary parameters from the space of the joint
163: posterior probability distribution, samples are drawn using
164: Metropolis-coupled Markov chain Monte Carlo simulations. This means, at
165: start of a Bayesian analysis, several chains are initialized to search for
166: the global optimum in the space of the joint posterior probability
167: distribution. Once initialized, the chains cross the space for several
168: hundredthousands to millions of generations by slightly modifying the
169: parameters (tree topology, branch lengths, evolutionary model parameters).
170: Trees and evolutionary model parameters are sampled only from the cold
171: chain; the other so-called heated chains traverse the space more easily and
172: exchange their status data from time to time with the cold chain. By doing
173: so, the heated chains help the cold chain to reach the global optimum, which
174: comprises a set of the best trees and evolutionary parameters. The presumed
175: global optimum is found when the likelihoods of the trees sampled from the
176: cold chain reach stationarity.
177: 
178: The phylogenetic trees inferred by the above mentioned methods are usually
179: bifurcating trees. They may be rooted or unrooted. In rooted trees, the
180: closest related sistergroup is used to define the direction of evolution in
181: the sequences. To e.g.\ examine the relationships among chimpanzee, gorilla
182: and man, the orangutan would be the appropriate outgroup. Unrooted trees are
183: like looking onto the treetop from above without knowing where the stem is.
184: In unrooted trees it is not possible to tell, where evolution started and in
185: which direction the sequences evolve.
186: 
187: \section{Models of Molecular Evolution}
188: 
189: In addition to exponentially growing numbers of possible trees,
190: phylogenetic analyses are further complicated by the fact that substitution
191: rates of nucleotides or amino acids may vary. Evolutionary models are an
192: attempt to approximate the complexity of molecular evolution as close as
193: possible.
194: 
195: The proportions of the four nucleotides in a DNA sequence may differ from
196: gene to gene and, thus, need to be considered in phylogenetic analyses (base
197: frequencies). To account for differing substitution rates for the six types
198: of point mutations, a substitution rate matrix is used (Fig.\ 3A).
199: \begin{figure}[htbp]
200:  \begin{center}
201:   \includegraphics{hoef3.eps}
202:   \caption{Substitution rate matrices and among-site rate variation. Fig.\ 3A.
203: Examples for substitution rate matrices. To the left, the most complex type
204: implemented in phylogeny software programmes, the general time reversible
205: model (GTR) with six different substitution rates. To the right, a modified
206: GTR model, the Tamura-Nei model with three different mutation rates. Fig.\
207: 3A. Among-site rate variation in RNA and protein coding DNA. Sites with high
208: mutation rates are usually found in loop regions of RNA secondary structure,
209: whereas helices are more conserved (left). In protein coding DNA, the third
210: position of the codons is usually the most variable. The degenerate code
211: allows for several codons to represent the same amino acid. In this example,
212: codons for the amino acids serine, arginine and valine are shown. Between
213: DNA and protein, a transcription step to messenger RNA is necessary. Bold
214: face, positions with higher mutation rates. Fig.\ 3C. Modelling the
215: among-site rate variation using a gamma distribution. Examples for
216: continuous gamma distribution with different shape parameters to the left
217: and a discrete gamma distribution with seven rate categories to the right.
218: The discrete gamma distribution approximates a continuous gamma distribution
219: with a shape parameter \( \alpha \) of 1.}
220:  \end{center}
221: \end{figure}
222: However, depending in the positions in the alignment, these rates may be
223: higher or lower. Some positions are highly conserved and do not change at
224: all. Others evolve at differing rates (Fig.\ 3B). Both parameters, the
225: proportion of invariable sites and site-specific rate variation, modelled as
226: a gamma-distribution (Fig.\ 3C), belong to the among-site substitution rate
227: variation and can be explained by functional constraints on the gene
228: products.
229: 
230: For most data sets used in biological studies, it is impossible to infer
231: phylogenetic trees in a reasonable time by optimizing all likelihood
232: parameters at once during a maximum likelihood analysis, i.e.\ tree topology,
233: branch lengths of the trees, base frequencies, substitution rate matrix,
234: proportion of invariable sites and continuously gamma-distributed among-site
235: rate variation. An often practised approach consisted of determining first
236: the parameters of the evolutionary model fitting best the data
237: \cite{PosadaCrandall1998}. To find the appropriate evolutionary model, a
238: tree is inferred with a fast method (usually distance matrix with
239: neighbor-joining) and the likelihood values for this tree are calculated for
240: each available evolutionary model. The model fitting best the data is then
241: chosen by e.g.\ hierarchical likelihood ratio tests (hLRT) or by the Akaike
242: information criterion (AIC). Also, a discrete instead of a continuous
243: gamma-distributed among-site rate variation is used to reduce computation
244: times (Fig.\ 3C). Thus, during heuristic tree search only tree topology and
245: branch lengths need to be optimized, whereas the evolutionary model
246: parameters have been already estimated from the data set using an
247: approximate tree topology prior to the heuristic tree search.
248: 
249: An additional evolutionary parameter, the covarion/covariotide model takes
250: lineage-specific evolutionary rates into consideration, i.e.\ complete
251: sequences may evolve faster than others. The covarion/covariotide model,
252: however, until today was only implemented in Bayesian phylogenetic analyses
253: programmes.
254: 
255: Protein coding DNA sequences are \textit{in vivo} first transcribed into
256: messenger RNA, then translated into a protein consisting of a string of
257: amino acids (Fig.\ 3B). The function of the protein is determined by folding
258: up into tertiary and quarternary structures and by amino acids with specific
259: chemical properties in specific positions. Maximum likelihood analyses of
260: DNA sequences are quite time intensive. Maximum likelihood analyses with 20
261: character states for the amino acids are even more time-consuming. Thus, in
262: protein phylogenies, substitution rate matrices were usually not computed
263: from the data sets, instead pre-defined sustitution rate matrices
264: empirically derived from large alignments of other proteins were used
265: \cite{Felsenstein2003}.
266: 
267: Phylogenetic trees can also be inferred from the DNA sequences of protein
268: coding genes, which however offers some pitfalls. In protein coding genes,
269: three nucleotides code for one amino acid, but the genetic code is
270: degenerate. This means that several three-nucleotide combinations may code
271: for the same amino acid (e.g.\ six codons are known to code for arginine,
272: leucine or serine; see Fig.\ 3B). As a consequence, a nucleotide change in
273: one codon position may be either without effect on the amino acid (= silent
274: or synonymous substitution), or cause a change of one amino acid to another
275: (= nonsynonymous substitution). Only nonsynonymous substitutions can result
276: in a loss or decrease of function, and, thus are subject to functional
277: constraints. However, the sophisticated evolutionary model parameters
278: mentioned above were in first place developed to cope with RNA coding genes.
279: The three-nucleotide codon structure is ignored and synonymous and
280: nonsynonymous mutations are treated equally. Also, often several
281: evolutionary pathways are possible to evolve from one codon to another,
282: which further complicates the evolutionary model parameters. Often the third
283: positions of codons show nucleotide biases towards higher GC or AT contents.
284: 
285: However, from theoretical and simulation studies, but also empirically, it
286: became obvious that using wrong assumptions about the underlying
287: evolutionary processes may result in biased phylogenetic trees.
288:  
289: \section{Simulation Studies}
290: 
291: The accuracy of a method comprises consistency, efficiency and robustness. A
292: method is consistent, if it infers the correct phylogenetic tree with an
293: infinite amount of data. Efficiency describes the sensitivity of a method
294: concerning the lengths of sequences. The shorter the sequences can be for a
295: method to converge to the correct tree topology, the more efficient is the
296: method. Robustness considers using wrong assumptions about the underlying
297: evolutionary model. A method is robust, if it infers the correct
298: phylogenetic tree although a wrong evolutionary model was used. Since
299: biologists use DNA or protein sequences of finite lengths, in practice only
300: consistency and robustness of a method are of interest.
301: 
302: In a simulation study by Huelsenbeck \cite{Huelsenbeck1995}, e.g.\ four-taxon
303: data sets of differing sequence lengths were generated \textit{in silico}
304: from a random starting sequence according to pre-specified evolutionary
305: models and phylogenetic trees (see parameter space in Fig.\ 4A).
306: \begin{figure}[htbp]
307:  \begin{center}
308:   \includegraphics{hoef4.eps}
309:   \caption{The long branch attraction artefact (LBA). Fig.\ 4A. The parameter
310: space with different tree topologies usually used in simulation studies with
311: four-taxon trees. Fig.\ 4B. An example for a LBA of a four-taxon tree. The
312: tree to the left corresponds to the tree in the top left corner of the
313: parameter space in Fig.\ 4A. To tree to the right shows the typical LBA bias.
314: The high evolutionary rates displayed by the long branches of the taxa A and
315: B cause reversals in the nucleotides, e.g.\ a C mutates to a G, a T and back
316: to a C. In combination with a high background noise, which blurs
317: phylogenetic signals, these reversals are presumably interpreted erroneously
318: as positives for genetic relatedness. The region in the parameter space
319: resulting in biased trees is also sometimes called the ``Felsenstein" zone
320: of a method. This region is predominantly located in the top left, sometimes
321: extended to the top right of the parameter space shown in Fig.\ 4A. The
322: larger this ``Felsenstein" zone is, the less robust the phylogenetic
323: method.}
324:   \end{center}
325: \end{figure} 
326: Different phylogenetic analyses methods were then used to infer trees from
327: the data sets and the conditions determined that caused the methods to infer
328: wrong tree topologies. The so-called long branch attraction artefact (LBA)
329: is the most well-known phenomenon causing biased tree topologies. Usually,
330: LBAs were found in phylogenetic trees with extremely long terminal (i.e.\
331: branches with high evolutionary rates) but short internal branches (Fig.\
332: 4B). In most test situations, maximum likelihood outperformed other methods,
333: but it also failed in finding the correct tree, if the assumed evolutionary
334: models were too different from the evolutionary processes under which the
335: simulated data sets had evolved.
336: 
337: \section{Phylogenetic Analyses and Real Life Data}
338: 
339: Since divergent branch lengths were almost always found in phylogenetic
340: analyses of \textit{in vivo} evolved sequences, the effects of potential
341: LBAs were a frequent matter of concern \cite{AndersonSwofford2004}.
342: Especially in large scale phylogenies comprising sequences of very different
343: organisms, long-branch taxa were often gathered ladder-like close to the
344: root of the trees, which may indicate a potential bias caused by LBAs. The
345: farther back in time the examined relationships of organisms reach, the
346: worse the resolution at the internal branches of a tree. It was found that
347: an addition of sequences to the data set and a complex evolutionary model
348: with a gamma-distributed among-site rate variation were the best options to
349: reduce artefacts in a phylogenetic tree \cite{Graybeal1998},
350: \cite{BrunoHalpern1999}. Especially, adding more sequences of the
351: problematic type could break up long branches, increase the resolution in
352: this part of a tree and thereby neutralise the LBA.
353: 
354: An example of how taxon sampling and choice of evolutionary model may affect
355: the results of a molecular phylogeny can be found in the cryptophytes, a
356: group consisting of microscopic flagellated unicells. Most of the genera in
357: this group are algae, i.e.\ they contain a pigmented plastid which is used to
358: turn the energy of light into chemical energy by photosynthesis. Two genera
359: are, however, colourless. \textit{Goniomonas} is phagotrophic; it feeds from
360: ingesting bacteria. The other genus, formerly classified as
361: \textit{Chilomonas} feeds from organic molecules, but still harbours a
362: leukoplast, i.e.\ a colourless plastid. In a phylogenetic analysis with a low
363: number of nuclear 18S ribosomal DNA sequences, \textit{Goniomonas} and
364: ``\textit{Chilomonas}" clustered together indicating a relationship of both
365: genera \cite{CavalierSmithEtAl1996}. In a later analysis, sequences of the
366: photosynthetic genus \textit{Cryptomonas} were added \cite{MarinEtAl1998}.
367: It turned out that \textit{Goniomonas} was the most basally diverging taxon,
368: whereas ``\textit{Chilomonas}" was a colourless \textit{Cryptomonas}. The
369: clade with the genera \textit{Cryptomonas} and ``\textit{Chilomonas}" seemed
370: to be the most basal group of the plastid-bearing cryptophytes. Thus, the
371: sisterhood of \textit{Goniomonas} and ``\textit{Chilomonas}" were caused by
372: a LBA due to inappropriate taxon sampling. The analysis in
373: \cite{MarinEtAl1998}, however, was done using maximum likelihood under a
374: simple evolutionary model, i.e.\ without considering an among-site rate
375: variation. In a study using a complex evolutionary model with among-site
376: rate variation, the basal position of the
377: \textit{Cryptomonas}/``\textit{Chilomonas}" clade was also shown to be an
378: artefact caused by long branch attraction \cite{HoefEmdenEtAl2002}.
379: 
380: Thus, long branch attraction artefacts are a real problem in phylogenies
381: inferred from \textit{in vivo} evolved sequences. The best options to cope
382: with LBAs, i.e.\ adding more taxa, and using complex evolutionary models and
383: robust methods, however, collide with another problem biologists were and
384: are still confronted with computation times. The larger the amount of
385: sequences, the more reliable the phylogenetic analyses methods do work, but
386: exponentially more time is also needed to obtain results.
387: 
388: Bayesian analysis was introduced as a potential faster alternative to
389: maximum likelihood analysis \cite{HuelsenbeckEtAl2001}. However, for large
390: data sets Markov chains often need to be run for more generations to reach a
391: plateau of likelihood values, which also increases comutation times. In
392: addition, the posterior probabilities given for the different branches of
393: the consensus tree, in which the sampled trees are summarised, are more
394: optimistic than support values obtained from nonparametric bootstrapping
395: using the maximum likelihood criterion (i.e.\ a subsampling method with at
396: least 100, often more than 100 subsample data sets, to test the stability of
397: the branches of a tree). Bayesian analysis may be speeded up by running the
398: different Markov chains on separate CPUs of a computing server or a cluster.
399: 
400: In heuristic tree searches using the maximum likelihood criterion, some
401: parallelised versions of programmes have been introduced e.g.\
402: \cite{StewartEtAl2001}. The tasks of tree generation and tree evaluation
403: were distributed among a master (tree generation and comparison) and worker
404: programmes (calculation of branch lengths and likelihoods).
405: 
406: Another attempt to decrease computation times was quartet-puzzling
407: \cite{StrimmervonHaeseler1996}. In quartet-puzzling, trees are computed from
408: quartets of n sequences of a larger data set using the maximum likelihood
409: criterion and weighted accordingly. The best of the three possible 4-trees
410: for each quartet are used to first assemble a large number of n-trees
411: (quartet-puzzling) and finally to obtain a consensus n-tree. This method is
412: much faster than a heuristic trees search, but more vulnerable to LBA. Among
413: hundreds to thousands computed four-taxon trees, only a low number of
414: biased 4-trees suffices to pass on a topological error to the final n-tree.
415: In simulation studies, global character maximum likelihood almost always
416: outperformed quartet-puzzling or related methods \cite{RanwezGascuel2001}.
417: 
418: Other studies tried to overcome LBA and exponentially growing computing
419: times with longer sequences, e.g.\ by using complete genomes to infer
420: phylogenetic trees. Phylogenetic analyses of longer sequences increase the
421: computing times only linearly. Since sequencing of complete genomes need
422: much more time and resources than that of single genes or smaller sets of
423: genes, the taxon sampling in these studies generally was lower. It has been
424: shown, however, that long sequences cannot compensate for an extended taxon
425: sampling. The low number of taxa included in a genome-scale analysis
426: resulted in high bootstrap support even for biased tree topologies
427: \cite{SoltisEtAl2004}. Also genome-scale alignments cannot be refined
428: by eye anymore. They depend in automatic alignment algorithms, which may
429: perform badly by producing more or less biologically meaningless alignments
430: \cite{PollardEtAl2004}. A better option than using complete genomes
431: presumably is to sequence a set a of genes, to refine the alignment of each
432: gene by eye, and to concatenate the genes \cite{BaptesteEtAl2002}.
433: 
434: Additional problems occur, if the evolution of a gene and/or a group of
435: organisms cannot be described by bifurcating trees. In sexually reproducing
436: populations, the examined gene may be present in differing alleles. Each
437: individual of a population inherits two alleles, one from its mother, the
438: other from its father. In addition, parts of the alleles can be exchanged by
439: genetic recombination. Genetic material may also be transferred between
440: unrelated organisms, e.g.\ by infection with viruses, by endosymbiosis or in
441: bacteria by exchange of plasmids. Whereas the inheritance of genes from
442: parents to child is called vertical gene transfer, the exchange of genetic
443: material between unrelated organisms is called lateral gene transfer. The
444: results of sexual reproduction or lateral gene transfers are genetic
445: chimaeras and reticulate evolutionary trees.
446: 
447: \section{Conclusions}
448: 
449: Until yet, there seems to be no easy way out of the treadmill of extremely
450: increasing computing times for phylogeneticists. New algorithms to reduce
451: time consumption in phylogenetic analysis have been proposed until recently,
452: e.g.\ \cite{GuindonGascuel2003}. However, only if the algorithms are offered
453: in software programmes suitable for the tasks of phylogenetic analysis, if
454: they are presented in an understandable way to biologists and if they
455: prove to be robust, they will accepted and used.
456: 
457: \bigskip
458: 
459: \bibliographystyle{plain}
460: 
461: \begin{thebibliography}{99}
462: 
463: \bibitem{Steel1992} Steel M (1992) The complexity of reconstructing trees
464: from qualitative characters and subtrees. J. Classif. 9 (1): 91--116
465: 
466: \bibitem{Felsenstein2003} Felsenstein J (2003) Inferring phylogenies.
467: Sinauer Associates, Publishers, Sunderland
468: 
469: \bibitem{HuelsenbeckEtAl2001} Huelsenbeck JP, Ronquist F, Nielsen R,
470: Bollback JP (2001) Bayesian inference of phylogeny and its impact on
471: evolutionary biology. Science 294 (5550): 2310--2314
472: 
473: \bibitem{PosadaCrandall1998} Posada D, Crandall KA (1998) Modeltest:
474: testing the model of DNA substitution. Bioinformatics 14 (9): 817--818.
475: 
476: \bibitem{Huelsenbeck1995} Huelsenbeck JP (1995) Performance of phylogenetic
477: methods in simulation. Syst. Biol. 44 (2): 17--48
478: 
479: \bibitem{AndersonSwofford2004} Anderson FE, Swofford DL (2004) Should we be
480: worried about long-branch atrraction in real data sets? Investigations using
481: metazoan 18S rDNA. Mol. Phylogenet. Evol. 33 (2): 440--451
482: 
483: \bibitem{Graybeal1998} Graybeal A (1998) Is it better to add taxa or
484: characters to a difficult phylogenetic problem? Syst. Biol. 49 (1): 9--17
485: 
486: \bibitem{BrunoHalpern1999} Bruno WJ, Halpern AL (1999) Topological bias and
487: inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16
488: (4): 564--566
489: 
490: \bibitem{CavalierSmithEtAl1996} Cavalier-Smith T, Couch JA, Thorsteinsen KE,
491: Gilson P, Deane JA, Hill DRA, McFadden GI (1996) Cryptomonad nuclear and
492: nucleomorph SSU rRNA phylogeny. Eur. J. Phycol. 31 (4): 315--328
493: 
494: \bibitem{MarinEtAl1998} Marin B, Klingberg M, Melkonian M (1998)
495: Phylogenetic relationships among the Cryptophyta: analyses of   
496: nuclear-encoded SSU rRNA sequences support the monophyly of extant
497: plastid-containing lineages. Protist 149 (3): 265--276
498: 
499: \bibitem{HoefEmdenEtAl2002} Hoef-Emden K, Marin B, Melkonian M (2002)
500: Nuclear and nucleomorph SSU rDNA phylogeny in the Cryptophyta and the
501: evolution of cryptophyte diversity. J. Mol. Evol. 55 (2): 161--179
502: 
503: \bibitem{StewartEtAl2001} Stewart CA, Hart D, Berry DK, Olsen GJ, Wernert
504: EA, Fischer W (2001) Parallel implementation and performance of fastDNAml --
505: a program for maximum likelihood phylogenetic inference. Proc. SC2001,
506: Denver, CO, November 2001
507: 
508: \bibitem{StrimmervonHaeseler1996} Strimmer K, von Haeseler A (1996) Quartet
509: puzzling: a quartet maximum-likelihood method for reconstructing tree
510: topologies. Mol. Biol. Evol. 13 (7): 964--969
511: 
512: \bibitem{RanwezGascuel2001} Ranwez V, Gascuel O (2001) Quartet-based
513: phylogenetic inference: improvements and limits. Mol. Biol. Evol. 18 (6):
514: 1103--1116
515: 
516: \bibitem{SoltisEtAl2004} Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu
517: Y-L, Chase MW, Farris JS, Stefanovi\'c S, Rice DW, Palmer JD, Soltis PS
518: (2004) Genome-scale data, angiosperm relationships, and `ending
519: incongruence': a cautionary tale in phylogenetics. Trends Plant Sci. 9 (10):
520: 477--483
521: 
522: \bibitem{PollardEtAl2004} Pollard DA, Bergman CM, Stoye J, Celniker SE,1
523: Eisen MB (2004) Benchmarking tools for the alignment of functional noncoding
524: DNA. BMC Bioinformatics 5: 6
525: 
526: \bibitem{BaptesteEtAl2002} Bapteste E, Brinkmann H, Lee JA, Moore DV, Sensen
527: CW, Gordon P, Durufl\'e L, Gaasterland T, Lopez P, Müller M, Philippe H
528: (2002) The analysis of 100 genes supports the grouping of three highly
529: divergent amoebae: \textit{Dictyostelium}, \textit{Entamoeba}, and
530: \textit{Mastigamoeba}. Proc. Natl. Acad. Sci. USA 99 (3): 1414--1419
531: 
532: \bibitem{GuindonGascuel2003} Guindon S, Gascuel O (2003) A simple, fast, and
533: accurate algorithm to estimate large phylogenies by maximum likelihood.
534: Syst. Biol. 52 (5): 696--704
535: 
536: \end{thebibliography}
537: \end{document}