0501:q-bio0501030/hoef.tex

1: \documentclass[a4paper]{article}

2: \usepackage[latin1]{inputenc}

3: \usepackage[dvips]{graphics}

4: \usepackage[small]{caption}

5: \renewcommand{\figurename}{Fig.}

6: \title{Molecular Phylogenetic Analyses and Real Life Data}

7: \author{Kerstin Hoef-Emden}

8:

9: \begin{document}

10:

11: \maketitle

12:

13: \begin{center}

14:

15: Universit�t zu K�ln, Botanisches Institut, Lehrstuhl I, Gyrhofstr. 15,

16:

17: 50931 K�ln, Germany

18:

19: e-mail: kerstin.hoef-emden@uni-koeln.de

20:

21: \end{center}

22:

23: \bigskip

24:

25: \section{What is Molecular Phylogeny?}

26:

27: Most probably, all life existing today on earth shares a common ancestry

28: billions of years back in the past. A set of indispensable genes necessary

29: for maintenance of basic cell functions were passed on from the unknown

30: common ancestor to its extant descendants by asexual and/or sexual

31: reproduction. During the course of evolution, the genes, the numbers of

32: genes, their functions and the sizes of the genomes (i.e.\ the total DNA

33: content of a cell) became modified. If genes originate from a common

34: ancestor gene and fulfill the same function in a cell, they are said to be

35: homologous. The degree of divergence between homologous genes is considered

36: a measure for their relatedness (and also for the relatedness of the

37: organisms).

38:

39: In molecular phylogeny, the relationships among, usually extant, organisms

40: are examined by comparing homologous DNA or protein sequences (i.e.\ the gene

41: products). The relationships are displayed as trees with branch (or edge)

42: lengths reflecting the degrees of genetic divergence. Each branch tip

43: represents an extant sequence; the internal nodes or vertices represent

44: unknown ancestors to the terminal nodes. The branching pattern and branch

45: lengths describe the evolutionary pathways leading to the sequences at the

46: terminal nodes. Clusters of terminal branches connected to a common ancestor

47: are termed clades.

48:

49: The construction of phylogenetic trees has been shown to be a NP-hard

50: problem; the number of possible trees increases exponentially with the

51: number of DNA or protein sequences included in the phylogenetic analyses

52: \cite{Steel1992}. Due to the large amount of data and the complexity of the

53: task, phylogenetic trees cannot be inferred without help of computers.

54:

55: Numerous studies addressing the problems of molecular phylogenetic analyses

56: methods in theory or practice have been published. First publications about

57: phylogenetic methods date back into the 60s. The methods and evolutionary

58: models were refined in the course of time, but problems still remain. The

59: cited references in this review represent only few examples from a vast

60: amount of literature. Also only some of the mostly used methods in molecular

61: phylogeny are presented.

62:

63: For digging into the mathematics behind the phylogenetic analyses methods

64: introduced below, one may start with Joe Felsenstein's book

65: \cite{Felsenstein2003}.

66:

67: \section{Phylogenetic Analyses Methods}

68:

69: DNA sequences are based on a four-letter-code representing the four

70: nucleotides (A for adenin, C for cytosin, G for guanin, T for thymin),

71: whereas protein sequences are based on a twenty-letter-code representing the

72: twenty different amino acids. Prior to the phylogenetic analyses, an

73: alignment of the sequences has to be assembled (the single sequence is also

74: termed a ``taxon", because it represents a species, genus, individual or

75: strain). If sequences of homologous genes e.g.\ show differences in lengths

76: due to insertions or deletions, gaps have to be inserted to place

77: functionally corresponding positions in the same vertical column of the

78: alignment (Fig.\ 1).

79: \begin{figure}[h]

80:  \begin{center}

81:   \includegraphics{hoef1.eps}

82:   \caption{Excerpt from an alignment of nuclear ITS2 sequences. The ITS2 or

83: internal transcribed spacer 2 expands between two RNA coding genes of the

84: ribosomal operon. The ribosomal operon is transcribed in one piece. The two

85: internal transcribed spacers between the RNA coding regions fold up in a

86: specific way and are excised. Since the two ITS regions solely function as

87: spacers, they are under low selective pressure and, thus, display high

88: mutation rates. The example alignment shows ITS2 regions of closely related

89: organisms belonging to one genus. The sequences are oriented in horizontal

90: direction, whereas functionally corresponding positions are arranged in

91: columns. Several gaps had to be inserted due to insertions of nucleotides in

92: the sequences 1 and 5.}

93:  \end{center}

94: \end{figure}

95: Non-alignable regions such as insertions of several nucleotides need to be

96: excluded from the phylogenetic analyses. Improperly aligned sequences or

97: inclusions of non-alignable regions in the phylogenetic analyses may result

98: in artefactual phylogenetic trees.

99:

100: In most standard methods for inferring phylogenetic trees, an optimality

101: criterion and a tree search algorithm have to be chosen. The optimality

102: criterion is used to determine the best among the considered trees by

103: defining a type of ``scoring" system. Optimality criteria are e.g.\ maximum

104: parsimony, distance matrix or maximum likelihood \cite{Felsenstein2003}.

105:

106: In unweighted maximum parsimony, each mutation from one nucleotide or amino

107: acid to another, e.g.\ from a C to a G, costs one ``penalty" point. All point

108: mutations are considered equally likely. The mutations along a given tree

109: are summed up and the best tree or maximum parsimony tree is the one with

110: the lowest sum of penalty points. Unweighted maximum parsimony uses integer

111: values and often several to many equally parsimonious trees are found.

112:

113: In distance analyses, the sequences are pair-wise compared. Their genetic

114: divergences are transformed into distance values and listed in a triangular

115: distance matrix. Whereas maximum parsimony treats all mutations as equally

116: likely, the computation of distance matrices allows for different mutation

117: rates and other variations of parameters (i.e.\ evolutionary models, see

118: chapter below). To infer trees from a distance matrix, usually the

119: neighbor-joining algorithm is used (see below).

120:

121: Maximum likelihood is a probablistic and the computationally most costly

122: method (Fig.\ 2).

123: \begin{figure}[htbp]

124:  \begin{center}

125:   \includegraphics{hoef2.eps}

126:   \caption{Computation of the likelihood of a tree. To obtain the overall

127: likelihood value of a tree, for each position of the alignment the

128: probabilities of all possible combinations of ancestral character states are

129: computed. The site-wise likelihood comprises the sum of all probabilities.

130: The site-wise log likelihoods are then multiplied and result in the log

131: likelihood of a given tree.}

132:  \end{center}

133: \end{figure}

134: It searches for the tree that optimizes the probability of observing the

135: data. The likelihood of a tree is expressed as negative natural logarithm.

136: The maximum likelihood method also allows for different evolutionary models,

137: but differs from distance matrix methods in that it uses discrete characters

138: and may result in more than one optimal tree (however, rarely more than

139: two).

140:

141: The numbers of sequences used to infer phylogenetic trees in biological

142: research projects almost always prohibited exhaustive searches of the

143: complete tree space due to limitations of computation time. Thus, maximum

144: parsimony or maximum likelihood were usually combined with heuristic tree

145: search algorithms. For a heuristic search a first tree is generated e.g.\ by

146: adding the sequences step-by-step to the growing tree. This first tree is

147: then subjected to local and/or global rearrangements by swapping internal

148: branches or cutting the tree into pieces and rejoining the parts in

149: different places. This procedure is supposed to overcome potential local

150: optima and to find the global optimum. The construction of a tree by

151: neighbor-joining, the preferred method used with distance matrices, starts

152: with a star-like tree. The pair of sequences with the lowest genetic

153: divergence is joined (i.e.\ they are said to be neighbors) and the distance

154: matrix recalculated. These steps are repeated with the next closest related

155: sequences or clusters of sequences until the tree is completely resolved.

156:

157: In Bayesian analyses, posterior probabilities for trees and evolutionary

158: parameters are calculated using the Bayes theorem

159: \cite{HuelsenbeckEtAl2001}. With the Bayes formula the posterior probability

160: of a tree given the data is calculated using prior probabilities of the data

161: and the tree, and the likelihood of a tree. Since it is impossible to

162: calculate all trees and evolutionary parameters from the space of the joint

163: posterior probability distribution, samples are drawn using

164: Metropolis-coupled Markov chain Monte Carlo simulations. This means, at

165: start of a Bayesian analysis, several chains are initialized to search for

166: the global optimum in the space of the joint posterior probability

167: distribution. Once initialized, the chains cross the space for several

168: hundredthousands to millions of generations by slightly modifying the

169: parameters (tree topology, branch lengths, evolutionary model parameters).

170: Trees and evolutionary model parameters are sampled only from the cold

171: chain; the other so-called heated chains traverse the space more easily and

172: exchange their status data from time to time with the cold chain. By doing

173: so, the heated chains help the cold chain to reach the global optimum, which

174: comprises a set of the best trees and evolutionary parameters. The presumed

175: global optimum is found when the likelihoods of the trees sampled from the

176: cold chain reach stationarity.

177:

178: The phylogenetic trees inferred by the above mentioned methods are usually

179: bifurcating trees. They may be rooted or unrooted. In rooted trees, the

180: closest related sistergroup is used to define the direction of evolution in

181: the sequences. To e.g.\ examine the relationships among chimpanzee, gorilla

182: and man, the orangutan would be the appropriate outgroup. Unrooted trees are

183: like looking onto the treetop from above without knowing where the stem is.

184: In unrooted trees it is not possible to tell, where evolution started and in

185: which direction the sequences evolve.

186:

187: \section{Models of Molecular Evolution}

188:

189: In addition to exponentially growing numbers of possible trees,

190: phylogenetic analyses are further complicated by the fact that substitution

191: rates of nucleotides or amino acids may vary. Evolutionary models are an

192: attempt to approximate the complexity of molecular evolution as close as

193: possible.

194:

195: The proportions of the four nucleotides in a DNA sequence may differ from

196: gene to gene and, thus, need to be considered in phylogenetic analyses (base

197: frequencies). To account for differing substitution rates for the six types

198: of point mutations, a substitution rate matrix is used (Fig.\ 3A).

199: \begin{figure}[htbp]

200:  \begin{center}

201:   \includegraphics{hoef3.eps}

202:   \caption{Substitution rate matrices and among-site rate variation. Fig.\ 3A.

203: Examples for substitution rate matrices. To the left, the most complex type

204: implemented in phylogeny software programmes, the general time reversible

205: model (GTR) with six different substitution rates. To the right, a modified

206: GTR model, the Tamura-Nei model with three different mutation rates. Fig.\

207: 3A. Among-site rate variation in RNA and protein coding DNA. Sites with high

208: mutation rates are usually found in loop regions of RNA secondary structure,

209: whereas helices are more conserved (left). In protein coding DNA, the third

210: position of the codons is usually the most variable. The degenerate code

211: allows for several codons to represent the same amino acid. In this example,

212: codons for the amino acids serine, arginine and valine are shown. Between

213: DNA and protein, a transcription step to messenger RNA is necessary. Bold

214: face, positions with higher mutation rates. Fig.\ 3C. Modelling the

215: among-site rate variation using a gamma distribution. Examples for

216: continuous gamma distribution with different shape parameters to the left

217: and a discrete gamma distribution with seven rate categories to the right.

218: The discrete gamma distribution approximates a continuous gamma distribution

219: with a shape parameter \( \alpha \) of 1.}

220:  \end{center}

221: \end{figure}

222: However, depending in the positions in the alignment, these rates may be

223: higher or lower. Some positions are highly conserved and do not change at

224: all. Others evolve at differing rates (Fig.\ 3B). Both parameters, the

225: proportion of invariable sites and site-specific rate variation, modelled as

226: a gamma-distribution (Fig.\ 3C), belong to the among-site substitution rate

227: variation and can be explained by functional constraints on the gene

228: products.

229:

230: For most data sets used in biological studies, it is impossible to infer

231: phylogenetic trees in a reasonable time by optimizing all likelihood

232: parameters at once during a maximum likelihood analysis, i.e.\ tree topology,

233: branch lengths of the trees, base frequencies, substitution rate matrix,

234: proportion of invariable sites and continuously gamma-distributed among-site

235: rate variation. An often practised approach consisted of determining first

236: the parameters of the evolutionary model fitting best the data

237: \cite{PosadaCrandall1998}. To find the appropriate evolutionary model, a

238: tree is inferred with a fast method (usually distance matrix with

239: neighbor-joining) and the likelihood values for this tree are calculated for

240: each available evolutionary model. The model fitting best the data is then

241: chosen by e.g.\ hierarchical likelihood ratio tests (hLRT) or by the Akaike

242: information criterion (AIC). Also, a discrete instead of a continuous

243: gamma-distributed among-site rate variation is used to reduce computation

244: times (Fig.\ 3C). Thus, during heuristic tree search only tree topology and

245: branch lengths need to be optimized, whereas the evolutionary model

246: parameters have been already estimated from the data set using an

247: approximate tree topology prior to the heuristic tree search.

248:

249: An additional evolutionary parameter, the covarion/covariotide model takes

250: lineage-specific evolutionary rates into consideration, i.e.\ complete

251: sequences may evolve faster than others. The covarion/covariotide model,

252: however, until today was only implemented in Bayesian phylogenetic analyses

253: programmes.

254:

255: Protein coding DNA sequences are \textit{in vivo} first transcribed into

256: messenger RNA, then translated into a protein consisting of a string of

257: amino acids (Fig.\ 3B). The function of the protein is determined by folding

258: up into tertiary and quarternary structures and by amino acids with specific

259: chemical properties in specific positions. Maximum likelihood analyses of

260: DNA sequences are quite time intensive. Maximum likelihood analyses with 20

261: character states for the amino acids are even more time-consuming. Thus, in

262: protein phylogenies, substitution rate matrices were usually not computed

263: from the data sets, instead pre-defined sustitution rate matrices

264: empirically derived from large alignments of other proteins were used

265: \cite{Felsenstein2003}.

266:

267: Phylogenetic trees can also be inferred from the DNA sequences of protein

268: coding genes, which however offers some pitfalls. In protein coding genes,

269: three nucleotides code for one amino acid, but the genetic code is

270: degenerate. This means that several three-nucleotide combinations may code

271: for the same amino acid (e.g.\ six codons are known to code for arginine,

272: leucine or serine; see Fig.\ 3B). As a consequence, a nucleotide change in

273: one codon position may be either without effect on the amino acid (= silent

274: or synonymous substitution), or cause a change of one amino acid to another

275: (= nonsynonymous substitution). Only nonsynonymous substitutions can result

276: in a loss or decrease of function, and, thus are subject to functional

277: constraints. However, the sophisticated evolutionary model parameters

278: mentioned above were in first place developed to cope with RNA coding genes.

279: The three-nucleotide codon structure is ignored and synonymous and

280: nonsynonymous mutations are treated equally. Also, often several

281: evolutionary pathways are possible to evolve from one codon to another,

282: which further complicates the evolutionary model parameters. Often the third

283: positions of codons show nucleotide biases towards higher GC or AT contents.

284:

285: However, from theoretical and simulation studies, but also empirically, it

286: became obvious that using wrong assumptions about the underlying

287: evolutionary processes may result in biased phylogenetic trees.

288:

289: \section{Simulation Studies}

290:

291: The accuracy of a method comprises consistency, efficiency and robustness. A

292: method is consistent, if it infers the correct phylogenetic tree with an

293: infinite amount of data. Efficiency describes the sensitivity of a method

294: concerning the lengths of sequences. The shorter the sequences can be for a

295: method to converge to the correct tree topology, the more efficient is the

296: method. Robustness considers using wrong assumptions about the underlying

297: evolutionary model. A method is robust, if it infers the correct

298: phylogenetic tree although a wrong evolutionary model was used. Since

299: biologists use DNA or protein sequences of finite lengths, in practice only

300: consistency and robustness of a method are of interest.

301:

302: In a simulation study by Huelsenbeck \cite{Huelsenbeck1995}, e.g.\ four-taxon

303: data sets of differing sequence lengths were generated \textit{in silico}

304: from a random starting sequence according to pre-specified evolutionary

305: models and phylogenetic trees (see parameter space in Fig.\ 4A).

306: \begin{figure}[htbp]

307:  \begin{center}

308:   \includegraphics{hoef4.eps}

309:   \caption{The long branch attraction artefact (LBA). Fig.\ 4A. The parameter

310: space with different tree topologies usually used in simulation studies with

311: four-taxon trees. Fig.\ 4B. An example for a LBA of a four-taxon tree. The

312: tree to the left corresponds to the tree in the top left corner of the

313: parameter space in Fig.\ 4A. To tree to the right shows the typical LBA bias.

314: The high evolutionary rates displayed by the long branches of the taxa A and

315: B cause reversals in the nucleotides, e.g.\ a C mutates to a G, a T and back

316: to a C. In combination with a high background noise, which blurs

317: phylogenetic signals, these reversals are presumably interpreted erroneously

318: as positives for genetic relatedness. The region in the parameter space

319: resulting in biased trees is also sometimes called the ``Felsenstein" zone

320: of a method. This region is predominantly located in the top left, sometimes

321: extended to the top right of the parameter space shown in Fig.\ 4A. The

322: larger this ``Felsenstein" zone is, the less robust the phylogenetic

323: method.}

324:   \end{center}

325: \end{figure}

326: Different phylogenetic analyses methods were then used to infer trees from

327: the data sets and the conditions determined that caused the methods to infer

328: wrong tree topologies. The so-called long branch attraction artefact (LBA)

329: is the most well-known phenomenon causing biased tree topologies. Usually,

330: LBAs were found in phylogenetic trees with extremely long terminal (i.e.\

331: branches with high evolutionary rates) but short internal branches (Fig.\

332: 4B). In most test situations, maximum likelihood outperformed other methods,

333: but it also failed in finding the correct tree, if the assumed evolutionary

334: models were too different from the evolutionary processes under which the

335: simulated data sets had evolved.

336:

337: \section{Phylogenetic Analyses and Real Life Data}

338:

339: Since divergent branch lengths were almost always found in phylogenetic

340: analyses of \textit{in vivo} evolved sequences, the effects of potential

341: LBAs were a frequent matter of concern \cite{AndersonSwofford2004}.

342: Especially in large scale phylogenies comprising sequences of very different

343: organisms, long-branch taxa were often gathered ladder-like close to the

344: root of the trees, which may indicate a potential bias caused by LBAs. The

345: farther back in time the examined relationships of organisms reach, the

346: worse the resolution at the internal branches of a tree. It was found that

347: an addition of sequences to the data set and a complex evolutionary model

348: with a gamma-distributed among-site rate variation were the best options to

349: reduce artefacts in a phylogenetic tree \cite{Graybeal1998},

350: \cite{BrunoHalpern1999}. Especially, adding more sequences of the

351: problematic type could break up long branches, increase the resolution in

352: this part of a tree and thereby neutralise the LBA.

353:

354: An example of how taxon sampling and choice of evolutionary model may affect

355: the results of a molecular phylogeny can be found in the cryptophytes, a

356: group consisting of microscopic flagellated unicells. Most of the genera in

357: this group are algae, i.e.\ they contain a pigmented plastid which is used to

358: turn the energy of light into chemical energy by photosynthesis. Two genera

359: are, however, colourless. \textit{Goniomonas} is phagotrophic; it feeds from

360: ingesting bacteria. The other genus, formerly classified as

361: \textit{Chilomonas} feeds from organic molecules, but still harbours a

362: leukoplast, i.e.\ a colourless plastid. In a phylogenetic analysis with a low

363: number of nuclear 18S ribosomal DNA sequences, \textit{Goniomonas} and

364: ``\textit{Chilomonas}" clustered together indicating a relationship of both

365: genera \cite{CavalierSmithEtAl1996}. In a later analysis, sequences of the

366: photosynthetic genus \textit{Cryptomonas} were added \cite{MarinEtAl1998}.

367: It turned out that \textit{Goniomonas} was the most basally diverging taxon,

368: whereas ``\textit{Chilomonas}" was a colourless \textit{Cryptomonas}. The

369: clade with the genera \textit{Cryptomonas} and ``\textit{Chilomonas}" seemed

370: to be the most basal group of the plastid-bearing cryptophytes. Thus, the

371: sisterhood of \textit{Goniomonas} and ``\textit{Chilomonas}" were caused by

372: a LBA due to inappropriate taxon sampling. The analysis in

373: \cite{MarinEtAl1998}, however, was done using maximum likelihood under a

374: simple evolutionary model, i.e.\ without considering an among-site rate

375: variation. In a study using a complex evolutionary model with among-site

376: rate variation, the basal position of the

377: \textit{Cryptomonas}/``\textit{Chilomonas}" clade was also shown to be an

378: artefact caused by long branch attraction \cite{HoefEmdenEtAl2002}.

379:

380: Thus, long branch attraction artefacts are a real problem in phylogenies

381: inferred from \textit{in vivo} evolved sequences. The best options to cope

382: with LBAs, i.e.\ adding more taxa, and using complex evolutionary models and

383: robust methods, however, collide with another problem biologists were and

384: are still confronted with computation times. The larger the amount of

385: sequences, the more reliable the phylogenetic analyses methods do work, but

386: exponentially more time is also needed to obtain results.

387:

388: Bayesian analysis was introduced as a potential faster alternative to

389: maximum likelihood analysis \cite{HuelsenbeckEtAl2001}. However, for large

390: data sets Markov chains often need to be run for more generations to reach a

391: plateau of likelihood values, which also increases comutation times. In

392: addition, the posterior probabilities given for the different branches of

393: the consensus tree, in which the sampled trees are summarised, are more

394: optimistic than support values obtained from nonparametric bootstrapping

395: using the maximum likelihood criterion (i.e.\ a subsampling method with at

396: least 100, often more than 100 subsample data sets, to test the stability of

397: the branches of a tree). Bayesian analysis may be speeded up by running the

398: different Markov chains on separate CPUs of a computing server or a cluster.

399:

400: In heuristic tree searches using the maximum likelihood criterion, some

401: parallelised versions of programmes have been introduced e.g.\

402: \cite{StewartEtAl2001}. The tasks of tree generation and tree evaluation

403: were distributed among a master (tree generation and comparison) and worker

404: programmes (calculation of branch lengths and likelihoods).

405:

406: Another attempt to decrease computation times was quartet-puzzling

407: \cite{StrimmervonHaeseler1996}. In quartet-puzzling, trees are computed from

408: quartets of n sequences of a larger data set using the maximum likelihood

409: criterion and weighted accordingly. The best of the three possible 4-trees

410: for each quartet are used to first assemble a large number of n-trees

411: (quartet-puzzling) and finally to obtain a consensus n-tree. This method is

412: much faster than a heuristic trees search, but more vulnerable to LBA. Among

413: hundreds to thousands computed four-taxon trees, only a low number of

414: biased 4-trees suffices to pass on a topological error to the final n-tree.

415: In simulation studies, global character maximum likelihood almost always

416: outperformed quartet-puzzling or related methods \cite{RanwezGascuel2001}.

417:

418: Other studies tried to overcome LBA and exponentially growing computing

419: times with longer sequences, e.g.\ by using complete genomes to infer

420: phylogenetic trees. Phylogenetic analyses of longer sequences increase the

421: computing times only linearly. Since sequencing of complete genomes need

422: much more time and resources than that of single genes or smaller sets of

423: genes, the taxon sampling in these studies generally was lower. It has been

424: shown, however, that long sequences cannot compensate for an extended taxon

425: sampling. The low number of taxa included in a genome-scale analysis

426: resulted in high bootstrap support even for biased tree topologies

427: \cite{SoltisEtAl2004}. Also genome-scale alignments cannot be refined

428: by eye anymore. They depend in automatic alignment algorithms, which may

429: perform badly by producing more or less biologically meaningless alignments

430: \cite{PollardEtAl2004}. A better option than using complete genomes

431: presumably is to sequence a set a of genes, to refine the alignment of each

432: gene by eye, and to concatenate the genes \cite{BaptesteEtAl2002}.

433:

434: Additional problems occur, if the evolution of a gene and/or a group of

435: organisms cannot be described by bifurcating trees. In sexually reproducing

436: populations, the examined gene may be present in differing alleles. Each

437: individual of a population inherits two alleles, one from its mother, the

438: other from its father. In addition, parts of the alleles can be exchanged by

439: genetic recombination. Genetic material may also be transferred between

440: unrelated organisms, e.g.\ by infection with viruses, by endosymbiosis or in

441: bacteria by exchange of plasmids. Whereas the inheritance of genes from

442: parents to child is called vertical gene transfer, the exchange of genetic

443: material between unrelated organisms is called lateral gene transfer. The

444: results of sexual reproduction or lateral gene transfers are genetic

445: chimaeras and reticulate evolutionary trees.

446:

447: \section{Conclusions}

448:

449: Until yet, there seems to be no easy way out of the treadmill of extremely

450: increasing computing times for phylogeneticists. New algorithms to reduce

451: time consumption in phylogenetic analysis have been proposed until recently,

452: e.g.\ \cite{GuindonGascuel2003}. However, only if the algorithms are offered

453: in software programmes suitable for the tasks of phylogenetic analysis, if

454: they are presented in an understandable way to biologists and if they

455: prove to be robust, they will accepted and used.

456:

457: \bigskip

458:

459: \bibliographystyle{plain}

460:

461: \begin{thebibliography}{99}

462:

463: \bibitem{Steel1992} Steel M (1992) The complexity of reconstructing trees

464: from qualitative characters and subtrees. J. Classif. 9 (1): 91--116

465:

466: \bibitem{Felsenstein2003} Felsenstein J (2003) Inferring phylogenies.

467: Sinauer Associates, Publishers, Sunderland

468:

469: \bibitem{HuelsenbeckEtAl2001} Huelsenbeck JP, Ronquist F, Nielsen R,

470: Bollback JP (2001) Bayesian inference of phylogeny and its impact on

471: evolutionary biology. Science 294 (5550): 2310--2314

472:

473: \bibitem{PosadaCrandall1998} Posada D, Crandall KA (1998) Modeltest:

474: testing the model of DNA substitution. Bioinformatics 14 (9): 817--818.

475:

476: \bibitem{Huelsenbeck1995} Huelsenbeck JP (1995) Performance of phylogenetic

477: methods in simulation. Syst. Biol. 44 (2): 17--48

478:

479: \bibitem{AndersonSwofford2004} Anderson FE, Swofford DL (2004) Should we be

480: worried about long-branch atrraction in real data sets? Investigations using

481: metazoan 18S rDNA. Mol. Phylogenet. Evol. 33 (2): 440--451

482:

483: \bibitem{Graybeal1998} Graybeal A (1998) Is it better to add taxa or

484: characters to a difficult phylogenetic problem? Syst. Biol. 49 (1): 9--17

485:

486: \bibitem{BrunoHalpern1999} Bruno WJ, Halpern AL (1999) Topological bias and

487: inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16

488: (4): 564--566

489:

490: \bibitem{CavalierSmithEtAl1996} Cavalier-Smith T, Couch JA, Thorsteinsen KE,

491: Gilson P, Deane JA, Hill DRA, McFadden GI (1996) Cryptomonad nuclear and

492: nucleomorph SSU rRNA phylogeny. Eur. J. Phycol. 31 (4): 315--328

493:

494: \bibitem{MarinEtAl1998} Marin B, Klingberg M, Melkonian M (1998)

495: Phylogenetic relationships among the Cryptophyta: analyses of

496: nuclear-encoded SSU rRNA sequences support the monophyly of extant

497: plastid-containing lineages. Protist 149 (3): 265--276

498:

499: \bibitem{HoefEmdenEtAl2002} Hoef-Emden K, Marin B, Melkonian M (2002)

500: Nuclear and nucleomorph SSU rDNA phylogeny in the Cryptophyta and the

501: evolution of cryptophyte diversity. J. Mol. Evol. 55 (2): 161--179

502:

503: \bibitem{StewartEtAl2001} Stewart CA, Hart D, Berry DK, Olsen GJ, Wernert

504: EA, Fischer W (2001) Parallel implementation and performance of fastDNAml --

505: a program for maximum likelihood phylogenetic inference. Proc. SC2001,

506: Denver, CO, November 2001

507:

508: \bibitem{StrimmervonHaeseler1996} Strimmer K, von Haeseler A (1996) Quartet

509: puzzling: a quartet maximum-likelihood method for reconstructing tree

510: topologies. Mol. Biol. Evol. 13 (7): 964--969

511:

512: \bibitem{RanwezGascuel2001} Ranwez V, Gascuel O (2001) Quartet-based

513: phylogenetic inference: improvements and limits. Mol. Biol. Evol. 18 (6):

514: 1103--1116

515:

516: \bibitem{SoltisEtAl2004} Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu

517: Y-L, Chase MW, Farris JS, Stefanovi\'c S, Rice DW, Palmer JD, Soltis PS

518: (2004) Genome-scale data, angiosperm relationships, and `ending

519: incongruence': a cautionary tale in phylogenetics. Trends Plant Sci. 9 (10):

520: 477--483

521:

522: \bibitem{PollardEtAl2004} Pollard DA, Bergman CM, Stoye J, Celniker SE,1

523: Eisen MB (2004) Benchmarking tools for the alignment of functional noncoding

524: DNA. BMC Bioinformatics 5: 6

525:

526: \bibitem{BaptesteEtAl2002} Bapteste E, Brinkmann H, Lee JA, Moore DV, Sensen

527: CW, Gordon P, Durufl\'e L, Gaasterland T, Lopez P, M�ller M, Philippe H

528: (2002) The analysis of 100 genes supports the grouping of three highly

529: divergent amoebae: \textit{Dictyostelium}, \textit{Entamoeba}, and

530: \textit{Mastigamoeba}. Proc. Natl. Acad. Sci. USA 99 (3): 1414--1419

531:

532: \bibitem{GuindonGascuel2003} Guindon S, Gascuel O (2003) A simple, fast, and

533: accurate algorithm to estimate large phylogenies by maximum likelihood.

534: Syst. Biol. 52 (5): 696--704

535:

536: \end{thebibliography}

537: \end{document}