METHODS
IN
MOLECULAR BIOLOGY™
Bioinformatics Volume I Data, Sequence Analysis and Evolution
Edited by
Jonathan M. Keith, PhD School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia
Editor Jonathan M. Keith School of Mathematical Sciences Queensland University of Technology Brisbane, Queensland, Australia
[email protected] Series Editor John Walker Hatfield, Hertfordshire AL10 9NP UK
ISBN: 978-1-58829-707-5 ISSN 1064-3745 DOI: 10.1007/978-1-60327-159-2
e-ISBN: 978-1-60327-159-2 e-ISSN: 1940-6029
Library of Congress Control Number: 2007943036 © 2008 Humana Press, a part of Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, 999 Riverview Drive, Suite 208, Totowa, NJ 07512 USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Cover illustration: Fig. 4, Chapter 19, “Inferring Ancestral Protein Interaction Networks,” by José M. Peregrín-Alvarez Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com
Chapter 14 Inferring Trees Simon Whelan Abstract Molecular phylogenetics examines how biological sequences evolve and the historical relationships between them. An important aspect of many such studies is the estimation of a phylogenetic tree, which explicitly describes evolutionary relationships between the sequences. This chapter provides an introduction to evolutionary trees and some commonly used inferential methodology, focusing on the assumptions made and how they affect an analysis. Detailed discussion is also provided about some common algorithms used for phylogenetic tree estimation. Finally, there are a few practical guidelines, including how to combine multiple software packages to improve inference, and a comparison between Bayesian and maximum likelihood phylogenetics. Key words: Phylogenetic inference, evolutionary trees, maximum likelihood, parsimony, distance methods, review.
1. Introduction Phylogenetics and comparative genomics use multiple sequence alignments to study how the genetic material changes over evolutionary time and draw biologically interesting inferences. The primary aim of many studies, and an important byproduct of others, is finding the phylogenetic tree that best describes the evolutionary relationship of the sequences. Trees have proved useful in many areas of molecular biology. In studies of pathogens, they have offered a wealth of insights into the interaction between viruses and their hosts during evolution. Pre-eminent among these have been studies on HIV in which, for example, trees were crucial in demonstrating that HIV was the result of at least two different zoonoses from chimpanzees (1). Trees have
Jonathan M. Keith (ed.), Bioinformatics, Volume I: Data, Sequence Analysis, and Evolution, vol. 452 © 2008 Humana Press, a part of Springer Science + Business Media, Totowa, NJ Book doi: 10.1007/978-1-60327-159-2 Springerprotocols.com
287
288
Whelan
also provided valuable insights in molecular and physiological studies, including how the protein repertoire evolved (2, 3), genome evolution (4, 5), and the development of taxonomic classification and species concepts (6, 7). Accurate reconstruction of evolutionary relationships is also important in studies in which the primary aim is not tree inference, such as investigating the selective pressures acting on proteins (see Chapter 15) or the identification of conserved elements in genomes (8). Failure to correctly account for the historical relationship of sequences can lead to inaccurate hypothesis testing and impaired biological conclusions (9). Despite the importance of phylogenetic trees, obtaining an accurate estimate of one is not a straightforward process. There are many phylogenetic software packages available, each with unique advantages and disadvantages. An up-to-date list of such packages is held at the URL: http://evolution.genetics.washington.edu/phylip/software.html. This chapter aims to provide an introductory guide to enable users to make informed decisions about the phylogenetic software they use, by describing what phylogenetic trees are, why they are useful, and some of the underlying principles of phylogenetic inference.
2. Underlying Principles Phylogenetic trees make implicit assumptions about the data. Most importantly, these include that all sequences share a common ancestor, and that sequences evolving along all branches in the tree evolve independently. Violations of the former assumption occur when unrelated regions are included in the data. This occurs, for example, when only subsets of protein domains are shared between sequences or when data have entered the tree from other sources, such as sequence contamination, lateral gene transfer, or transposons. Violations of the second assumption occur when information in one part of a tree affects sequence in another. This occurs in gene families under gene conversion or lateral gene transfer. Before assuming a bifurcating tree for phylogenetic analyses, one should try to ensure that these implicit assumptions are not violated. The trees estimated in phylogenetics usually come in two flavors, rooted and unrooted, differing in the assumptions they make about the most recent common ancestor of the sequences, the root of the tree. Knowing the root of the tree enables the order of divergence events to be determined, which is valuable, for example, when investigating the evolution of phenotypic traits by examining their location on a tree.
Inferring Trees
A
Seq5
B
Root
289
Seq6 Seq4
Seq3
Seq2
Inferred Root Outgroup
Seq6
Seq2
Seq5 Seq4 Seq3
Seq1
Seq1
Out1 Out1
Fig. 14.1. Two common forms of bifurcating tree are used in phylogenetics. (A) Rooted trees make explicit assumptions about the most recent common ancestor of sequences and can imply directionality of the evolutionary process. (B) Unrooted trees assume a time-reversible evolutionary process. The root of sequences 1–6 can be inferred by adding an outgroup.
Rooted trees explicitly identify the location of this ancestor (Fig. 14.1A), whereas unrooted trees do not (Fig. 14.1B). In practice, the majority of studies use unrooted trees and, if rooting is required, it can be inferred through the use of an outgroup (Fig. 14.1B). 2.1. Scoring Trees
To discriminate between trees, a score function is required that quantifies how well a phylogenetic tree describes the observed sequences. The last 40 years have led to the proposal of many different scoring functions, summarized briefly as statistical, parsimony, and distance-based. There has been considerable debate in the literature over which methodology is the most appropriate for inferring trees. A brief discussion of each is provided in the following, although a full discussion is beyond this chapter’s scope (but see Note 1; see also (6, 9–11) for an introduction). Statistical methods are currently in the ascendancy and use likelihood-based scoring functions. Likelihood is a measure proportional to the probability of observing the data given the parameters specifying an evolutionary model and branch lengths in the tree. These describe how sequences change over time and how much change has occurred on particular lineages, respectively (see Chapter 13) (9). Statistical methods come in two varieties: maximum likelihood (ML), which searches for the tree that maximizes the likelihood function, and Bayesian inference, which samples trees in proportion to the likelihood function and prior expectations. The primary strength behind statistical methods is that they are based on established and reliable methodology that has been applied to many areas of research, from classical population genetics to modeling world economies (12). They are also statistically consistent: Under an accurate evolutionary model they tend to converge to the “true tree” as longer sequences are used (13, 14). This, and other associated properties, enables statistical methodology to produce high-quality phylogenetic estimates with a minimum of bias under a wide range of
290
Whelan
conditions. The primary criticism of statistical methods is that they are computationally intensive. Progress in phylogenetic software and computing resources is steadily meeting this challenge. Parsimony counts the minimum number of changes required on a tree to describe the observed data. Parsimony is intuitive to understand and computationally fast, relative to statistical methods. It is often criticized for being statistically inconsistent: Increasing sequence length in certain conditions can lead to greater confidence in the wrong tree. Parsimony does not include an explicit evolutionary model (although see 15). In some quarters, this is seen as an advantage because of a belief that evolution cannot be modeled (16). Others view this as a disadvantage because it does not account for widely acknowledged variation in the evolutionary process. In practice, parsimony may approximate statistical methods when the branch lengths on a tree are short (15). Distance-based criteria use pairwise estimates of evolutionary divergence to infer a tree, either by an algorithmic clustering approach (e.g., neighbor joining) (17) or assessing the fit of the distances to particular tree topologies (e.g., least-squares) (18). Distance methods are exceptionally quick: A phylogeny can be produced from thousands of sequences within minutes. When an adequate evolutionary model is used to obtain distances, the methodology can be consistent, although it is probable that it converges to the “true tree” at a slower rate than full statistical methods. An additional weakness of distance methods is that they use only pairwise comparisons of sequences to construct a tree. These comparisons are long distances relative to the branches on a tree, and are consequently harder to estimate and prone to larger variances. In contrast, statistical methods estimate evolutionary distances on a branch-by-branch basis through a tree, exploiting evolutionary information more efficiently by using all sequences in a tree to inform about branch lengths and reduce the variance of each estimate. Purely algorithmic distance methods, such as neighbor-joining and some of its derivatives, do not use a statistical measure to fit distances to a tree. The estimate is taken as the outcome of a predefined algorithm, making it unclear what criterion is used to estimate the tree. In practice, however, algorithmic methods appear to function as good approximations to other more robust methods, such as least-squares or the minimum-evolution criterion. 2.2. Why Estimating Trees Is Difficult
The effective and accurate estimation of trees remains difficult, despite their wide-ranging importance in experimental and computational studies. It is difficult to find the optimal tree in statistical and parsimony methods because the size of tree space rapidly increases with the number of sequences, making an exhaustive analysis impractical for even modest numbers of sequences. For 50 sequences, there are approximately 1076 possible trees, a number comparable to the estimated number of atoms in the observable
Inferring Trees
291
universe. This necessitates heuristic approaches for searching tree space that speed up computation, usually at the expense of accuracy. The phylogenetic tree estimation problem is unusual and there are few well-studied examples from other research disciplines to draw on for heuristics (20). Consequently, there has been a lot of active research into methodology to find the optimal tree using a variety of novel heuristic algorithms. Many of these approaches progressively optimize the tree estimate by iteratively examining the score of nearby trees and making the highest scoring the new best estimate, and stopping when no further improvements can be found. The nature of the heuristics mean there is often no way of deciding whether the newly discovered optimum is the globally best tree or whether it is one of many other local optima in tree space (although see (19), for a description of a Branch and Bound algorithm). Through acknowledging this problem and applying phylogenetic software to its full potential it is possible to produce good estimates of trees that exhibit many characteristics of a sequence’s evolutionary history.
3. Point Estimation of Trees The majority of software for inferring trees results in a single (point) estimate of the tree that best describes the evolutionary relationships of the data. This is ultimately what most researchers are interested in and what is usually included in any published work. In order to obtain the best possible estimate it is valuable to understand how phylogenetic software functions and the strengths and weaknesses of different approaches (see Note 2). Phylogenetic heuristics can be summarized by the following four-step approach: 1. Propose a tree. 2. Refine the tree using a traversal scheme until no further improvement found. 3. Check stopping criterion. 4. Resample from tree space and go to 2. Not all four steps are employed by all phylogenetic software. Many distance methods, for example, use only step 1, whereas many others stop after refining a tree estimate. 3.1. Proposing an Initial Tree
Initial tree proposal is fundamental to all phylogenetic tree estimation and there is no substitute for a good starting topology when inferring trees. The most popular approach to producing an initial tree is to use distance-based clustering methods. This is usually chosen for computational speed, so purely algorithmic approaches are common. Occasionally, more sophisticated approaches, such
292
Whelan
as quartet puzzling (21), are used. These can be highly effective for smaller datasets, but often do not scale well to larger numbers of sequences. An alternative approach is to choose a tree based on external information, such as the fossil record or prior studies, which can be appropriate when examining the relationship of wellcharacterized species and/or molecules. An alternative, widely used approach is to use sequence-based clustering algorithms (9, 19). These are similar to distance-based methods, but instead of constructing a tree using pairwise distances, they use something akin to full statistical or parsimony approaches. The two most popular are the stepwise addition and star-decomposition algorithms (Fig. 14.2). Stepwise addition (Fig. 14.2A) starts with a tree of three sequences and adds the remaining sequences to the tree in a random order to the location that maximizes the scoring criterion. The order that the sequences are added to the tree can affect the final proposed topology, also allowing a random order of addition to be used as a re-sampling step. Star-decomposition (Fig. 14.2B) starts with all of the sequences related by a star phylogeny: a tree with no defined internal branches. The algorithm progressively adds branches to this tree by resolving multi-furcations (undefined regions of a tree) that increase the scoring criteria by the largest amount. Providing there are no tied optimal scores the algorithm is deterministic, resulting in the same proposed tree each time. 3.2. Refining the Tree Estimate
Refining the tree estimate is the heuristic optimization step. The tree is improved using an iterative procedure that stops when no further improvement can be found. For each iteration, a traversal scheme is used to move around tree space and propose a set of candidate trees from the current tree estimate. Each tree is
A Seq1
Seq1
Seq4
?
? ?
Seq3
Add Seq4 to branch
?
Seq3
Seq2
B
leading to Seq3
?
Seq1
Seq5
?
Seq3
Add Seq5 to branch leading to Seq2
Seq2
? ?
Seq4
Seq2
Seq1
Seq1
Seq4
Seq5
Seq1
Seq3
Seq5
Seq3
Add branch separating Seq2
Seq4
Seq2 and Seq5
Add branch separating Seq3 and Seq4
Seq4
Seq5
Seq4 Seq3
Seq5 Seq2
Seq2
Fig. 14.2. Sequence-based clustering algorithms are frequently used to propose trees. (A) Stepwise addition progressively adds sequences to a tree at a location that maximizes the score function. (B) Star-decomposition starts with a topology with no defined internal branches and serially adds branches that maximize the score function. For example, in the first step branches could be added that separate all possible pairs of sequences.
Inferring Trees
293
assessed using the score function and one (or more) trees are chosen as the starting point for the next round of iteration. This is usually the best tree, although some approaches, such as Bayesian inference and simulated annealing, accept suboptimal trees (see the following). The popular traversal schemes discussed in the following share a common feature: they propose candidate trees by making small rearrangements on the current tree, examining each internal branch of a tree in turn and varying the way they propose trees from it. Three of the most popular methods are, in order of complexity and the number of candidate trees they produce: nearest neighbor interchange (NNI), subtree pruning and regrafting (SPR), and tree bisection and reconnection (TBR). NNI (Fig. 14.3A) breaks an internal branch to produce four subtrees. Each of the three possible combinations for arranging these subtrees is added to the list of candidate trees. SPR (Fig. 14.3B) is a generalization of NNI that generates candidate trees by breaking the internal branch under consideration and proposing new trees by putting together the resultant subtrees in different ways, one of which is the original topology. The set of candidate trees is generated by regrafting the broken branch of subtree A to each of subtree B’s branches; Fig. 14.3B shows
A
Nearest Neighbour Interchange
Step 1: Take original topology
B
Subtree Pruning and Regrafting
C
Tree Bisection and Reconnection
Step 1: Take original topology
Step 1: Take original topology
Step 2: Break branch under consideration
Step 2: Break branch under consideration
Step 3: Attach subtree to any branch in other subtree
Step 3: Connect any branch from one subtree with any branch from the other
Step 2: choose one of three possible rearrangements
Fig. 14.3. Three related schemes for traversing tree space when searching for an optimal tree. (A) Nearest Neighbor Interchange: an internal branch is broken and the three potential arrangements of the four subtrees are examined. (B) Subtree Pruning and Regrafting: an internal branch is removed and all ways of regrafting one of the resulting subtrees to the other are examined. Dotted arrows demonstrate some potential regrafting points. (C) Tree Bisection and Reconnection: an internal branch is removed and all possible ways of reconnecting the subtrees are examined. Dotted lines demonstrate potential reconnection points.
294
Whelan
three example regraftings (dotted arrows). To complete the set of candidate trees the order of the subtrees is reversed and the process repeated, this time regrafting subtree B to each of subtree A’s branches. TBR is similar to SPR, but generalizes it further to produce even more candidate trees per branch. When TBR breaks a branch, all possible ways of joining the two subtrees are added to the list. Some example reconnections are illustrated in Fig. 14.3C as dotted lines. NNI, SPR, and TBR are hierarchical in structure because the set of trees proposed by the more complex approach completely contains those proposed by the simpler approach (22). This hierarchical structure is not general: There are other approaches based on removing multiple branches from a tree that do not follow this pattern, but these are not widely implemented (23, 24). The advantages and disadvantages of these methods emerge from the number of candidate trees they produce. NNI produces a modest number of proposed trees, growing linearly with the number of sequences in the phylogeny. This limits its effectiveness by allowing only small steps in tree space and a greater susceptibility to local optima than more expansive schemes. The number of trees proposed by SPR rises rapidly as the number of sequences increases, making it computationally impractical for large numbers of sequences, although the greater number of candidate trees results in a larger step size and fewer local optima. Innovations based around SPR limit the number of candidate trees by bounding the number of steps away that a subtree can move from its original position. The subtree in Fig. 14.3B, for example, could be bounded in its movement to a maximum of two branches (all branches not represented by a triangular subtree in the figure). Each branch has a maximum number of subtrees it can produce, returning a linear relationship between the number of candidate trees and sequences. This approach offers a promising direction for future algorithm research, but currently is not widely implemented (25, 26). The characteristics of TBR are similar to SPR, but amplified because the number of candidate topologies per branch increases even more rapidly with the number of sequences. 3.3. Stopping Criteria
Many phylogenetic software packages do not resample tree space and stop after a single round of refinement. When resampling is used, a stopping rule is required. These are usually arbitrary, allowing only a pre-specified number of resamples or refinements. A recent innovation offers an alternative based on how frequently improvements in the overall optimal tree are observed. This dynamically estimates the number of iterations required before no further improvement in tree topology will be found and stops the algorithm when this has been exceeded (27).
Inferring Trees
295
3.4. Resampling from Tree Space
Sampling from one place in tree space and refining still may not find the globally optimal tree. The goal of resampling from tree space is to expand the area of tree space searched by the heuristic and uncover new, potentially better optima. This is achieved by starting the refinement procedure from another point in tree space. This approach was originally used in some of the earliest phylogenetic software (28), but is only lightly studied relative to improvements in the refinement methodology. Three of the many possible resampling schemes are discussed here: uniform resampling, stepwise addition, and importance quartet puzzling (IQP) (27). Uniform resampling is the simplest resampling strategy and the probability of picking each possible tree is one divided by the total number of trees. Although rarely used in practice, its deficiencies are edifying to the tree estimation problem. Each optimum has an area of tree space associated with it that will lead back to it during the refinement process, referred to as a center of attraction. In the majority of phylogenetic problems there are potentially large numbers of optima and the centers of attraction can be relatively small. Uniform sampling is prone to ending up in poor regions of tree space where nearby optima are unlikely to be particularly high. Finding centers with high optima can be difficult and requires an intelligent sampling process. Stepwise addition with random sequence ordering is a viable resampling strategy because adding sequences in a different order is liable to produce a different, but equally good, starting tree. Both stepwise addition and uniform resampling effectively throw away information from the current best estimate of a tree. IQP keeps some of this information and can be viewed as a partial stepwise addition process. It resamples by randomly removing a number of sequences from the current best tree, then consecutively adding them back to the tree in a random order using the IQP algorithm. This identifies good locations to insert sequences by examining a set of four-species subtrees that all include the newly added sequence (see ref. 27 for more details). IQP resampling has been demonstrated to be reasonably effective for tree estimation when coupled with NNI. An alternative approach to this purist resampling is to combine two or more refinement heuristics, the quicker of which (e.g., NNI) is used for the refinement step, and an alternative with a slower more expansive scheme (e.g., TBR) is used rarely to make larger steps in tree space.
3.5. Other Approaches to Point Estimation
Other popular approaches to phylogenetic tree estimation include genetic algorithms, simulated annealing (SA), and supertree reconstruction. Genetic algorithms are a general approach for numerical optimization that use evolutionary principles to allow a population of potential trees to adapt by improving
296
Whelan
their fitness (score function) according to a refinement scheme defined by the software designer. As the algorithm progresses, a proxy for natural selection weeds out trees with a lower fitness, and better trees tend to become more highly represented. After a period of time, the algorithm is stopped and the best topology discovered is the point estimate. The construction of genetic algorithms is very much an art and highly dependent on the designer’s ability to construct a coherent and effective fitness scale, and the application of quasi-natural selection. Some approaches for estimating trees using genetic algorithms have been noticeably successful (29, 30). SA bases its optimization strategy on the observation that natural materials find their optimal energy state when allowed to cool slowly. Usual approaches to SA propose a tree at random from a traversal scheme, and the probability of accepting this as the current tree depends on whether it improves the score function. A tree that improves the score is accepted. Trees with lower scores are accepted with a probability related to the score difference between the current and new tree, and a “heat”’ variable that decreases slowly during time. This random element allows the optimization process to move between different centers of attraction. SA starts “hot,” frequently accepting poor trees and covering large tracts of tree space. As it gradually “cools,” it becomes increasingly focused on accepting only trees with a higher score and the algorithm settles on a best tree. SA has proved very useful in other difficult optimization problems, but has yet to be widely used in phylogenetics (but see (31–33)).
4. Confidence Intervals on Point Estimates
In studies in which the phylogeny is of prime importance, it is necessary to attach a degree of confidence to the point estimate. This typically involves using computer simulations to generate new datasets from features of the original. The underlying principle is that simulated data represent independent draws from the same distribution over tree space (and evolutionary model space) as the process that generated the real data. This allows an assessment of the variability of the estimate and the construction of a confidence interval. This form of simulation is often known as a bootstrap (34), after the phrase “pulling oneself up by the bootstraps,” as it is employed when a problem is too difficult for classical statistics to solve. There are two broad approaches to bootstrap data widely used in phylogenetics: non-parametric bootstrapping and parametric bootstrapping. Figure 14.4 contains examples of all the methods discussed in the following.
Inferring Trees Original data set S1 A A G C T S2 A G G G T S3 T A G C T S4 A G G C T S5 A G G G T
Non-parametric bootstrap Break sequence into constituent columns
T T T T T
A
S1 S2 S3 S4 S5
G G G G G
T T T T T
G G G G G
C G C C G
S1
S3
S1
S3
S4
S2
Bootstrap probabilities
S2
S5
S1 S2 S3 S4 S5
...
S5
S1
S4
S5
S1
S3
S2
S4
Bootstrap probabilities mapped onto tree
A
Data set n
A G A G G
T T T T T
G G C C G
C G G G G
C G C C G
S4
S2
S4
S2
S5
G
T
S2
S5
SOWH test of monophyly S4
Log likelihood
S3
S3
S5
C
Model and tree define probability distribution of columns
A A T A A
Non-parametric hypothesis tests
S3
0.75
S4
T
Simulated data sets Data set 1
Bootstrap values
1.00
Tree and model from phylogenetic analysis S1 S3
G
Probability distribution of columns
S1
Parametric bootstrap
C
0.2
C G C C G 0.2
G G G G G 0.2
A G A G G 0.2
0.2
A A T A A
297
S1 S2
S5
0.75
0.23
0.02
0.00
0.00
AU test
0.84
0.34
0.12
0.04
0.01
SH test
0.88
0.41
0.20
0.06
0.04
HN: Monophyly of S2:S5
-1056.12
HA: No constraints
-1047.69
Likelihood difference (δ ) Distribution produced by parametric bootstrapping under the null hypothesis
8.43
δ
Fig. 14.4. Confidence in phylogenetic trees is often assessed through simulation approaches called bootstrapping. The different forms of bootstrap use the original data in different ways to simulate new datasets (center). Non-parametric bootstrapping (left) produces new data by resampling with replacement from the alignment columns from the original data. Parametric bootstrapping (right) uses parameters estimated from the original data to produce new datasets. Trees are estimated for each of the simulated sets and their results summarized using a variety of measures (bottom). The values shown are for demonstrative purposes only. See text for more details of the procedure.
4.1. The Non-Parametric Bootstrap
The non-parametric bootstrap is applicable to all methods of phylogenetic inference. It assumes that the probability distribution of observed columns in the original sequence alignment is representative of the complex and unknown evolutionary process. In other words, it assumes that if evolution had produced multiple copies of the original data, the average frequency of each alignment column would be exactly that observed in the original data. This philosophy underpins the simulation strategy. Sampling with replacement is used to produce a simulated dataset by repeatedly drawing from the original data to make a new dataset of suitable length. Each column in the original alignment has an equal probability of contributing to the simulated data. This approach allows non-parametric bootstrapping to encompass some of the complexities of sequence evolution that are not easily modeled, such as complex substitution patterns and rate variation, but introduces a finite sampling problem: Only a small proportion of all possible data columns could possibly be represented in the real data. For a complete DNA alignment covering 20 species there are ∼1012 possible data columns (four DNA bases
298
Whelan
raised to the 20th power). It would be unreasonable to expect any real dataset to provide a detailed representation of the probability distribution over this space. The most ubiquitous non-parametric bootstrap test of confidence is simple bootstrapping, which assesses confidence in a tree on a branch-by-branch basis (35). Tree estimates are obtained for a large number of simulated datasets. Bootstrap values are placed on branches of the original point estimate of the tree as the frequency that implied bipartitions are observed in trees estimated from the simulated data. This is useful for examining the evidence for particular subtrees, but becomes difficult to interpret for a whole tree because it hides information about how frequently other trees are estimated. This can be addressed by describing the bootstrap probability of different trees. In Fig. 14.4, the two bootstrap values are expanded to five bootstrap probabilities and, using hypothesis testing, three of the five can be rejected. These forms of simple bootstrapping are demonstrably biased and often place too much confidence in a small number of trees (36, 37), but due to their simplicity they remain practical and useful tools for exploring confidence in a tree estimate. Two other useful forms of the non-parametric bootstrap are employed in the Shimodaira-Hasegawa (SH) (38) and Approximately Unbiased (AU) (39) tests. These tests require a list of candidate trees to be proposed, representing a set of alternate hypotheses, such as species grouping, and usually containing the optimal tree estimate. The tests form a confidence set by calculating a value related to the probability of each tree being the best tree, and then rejecting those that fall below the critical value. A wellchosen list allows researchers to reject and support biologically interesting hypotheses based on tree shape, such as monophyly and gene duplication. Both tests control the level of type I (falsepositive) error successfully. In other words, the confidence interval is conservative and does not place unwarranted confidence in a small number of trees. This was a particular problem for the tests’ predecessor, the Kishino-Hasegawa (KH) test (40) that, due to a common misapplication, placed undue confidence in a small number of trees. The AU test is constructed in a subtly different manner than the SH test, which removes a potential bias and increases statistical power. This allows the AU test to reject more trees than the SH test and produce tighter confidence intervals (demonstrated in Fig. 14.4). 4.2. The Parametric Bootstrap
The parametric bootstrap is applicable to statistical methods of phylogenetic inference and is widely used to compare phylogenetic models as well as trees. The simulation is performed by generating new sequences and allowing them to evolve on a tree topology, according to the parameters in the statistical model estimated from the original data, such as replacement rates between
Inferring Trees
299
bases/residues and branch lengths. This completely defines the probability distribution over all possible data columns, even those not observed in the original data, which avoids potential problems introduced by sampling in the non-parametric bootstrap. This introduces a potential source of bias because errors in the evolutionary model and its assumptions are propagated in the simulation. As evolutionary models become more realistic and as the amount of data analyzed grows, the distributions across data columns produced by parametric and non-parametric bootstrapping may be expected to become increasingly similar. When the model is sufficiently accurate and the sequences become infinitely long the two distributions may be expected to be the same, although it is unclear whether anything close to this situation occurs in real data. The most popular test using the parametric bootstrap is the Swofford-Olsen-Waddell-Hillis (SOWH) (19) test. This also addresses the comparison of trees from a hypothesis testing perspective, by constructing a null hypothesis (HN) of interest and an alternative, more general hypothesis (HA). Figure 14.4 demonstrates this through a simple test of monophyly. A likelihood is calculated under the null hypothesis of monophyly, which restricts tree space by enforcing that a subset of sequences always constitutes a single clade (group) on a tree, in this case S2 and S5 always being together. In practice this means performing a tree search on the subset of tree space in which the clade exists. A second likelihood is calculated under the alternative hypothesis of no monophyly, which allows tree estimation from the entirety of tree space. The SOWH test examines the improvement in likelihood, d, observed by allowing the alternative hypothesis. To perform a statistical test, d needs to be compared with some critical value on the null distribution. No standard distribution is appropriate and parametric bootstrapping is used to estimate it, with the parameters required for the simulation taken from HN. Each simulated dataset is assessed under the null and alternate hypotheses and the value of d is taken as a sample from the null distribution. When repeated large numbers of times, this produces a distribution of d that is appropriate for significance testing. In the example, the observed value of d falls outside the 95% mark of the distribution, meaning that the null hypothesis of monophyly is rejected. This example demonstrates some of the strengths and weaknesses of parametric bootstrapping. The SOWH test does not require a limited list of trees to be defined because hypotheses can be constructed by placing simple restrictions on tree space. This allows the SOWH test to assess complex questions that are not otherwise easily addressed, but the computational burden of each simulated dataset may be as extreme as the original dataset. This can make large numbers of bootstrap replicates unfeasible.
300
Whelan
4.3. Limitations of Current Bootstrap Simulation
Both of the discussed bootstrapping approaches have theoretical and practical limitations. The primary practical limitation of both bootstrapping methods is that they are computationally very intensive, although there are many texts detailing computational approximations to make them computationally more efficient (19, 41). The theoretical limitations of current bootstrapping methods stem from the simplifying assumptions they make when describing sequence evolution. These, most seriously, include poor choice of evolutionary model, neglecting insertion-deletion mutations (indels), and the effect of neighboring sites on sequence evolution. In statistical methods, a poor choice of model can adversely affect all forms of bootstrapping. In non-parametric bootstrapping, inaccuracies in the model can lead to biases in the tree estimate that may manifest as overconfidence in an incorrect tree topology. Modeling errors in parametric bootstrapping may result in differences between the probability distribution of columns generated during the simulation and the distribution of the “true” evolutionary process, which may make the simulated distribution inappropriate for statistical testing. The relative effects of model mis-specification on bootstrapping are generally poorly characterized and every care should be taken when choosing a model for phylogenetic inference (see Chapter 16). Indels are common mutations that can introduce alignment errors, which can impact phylogenetic analysis (see Chapters 7 and 15). These effects are generally poorly characterized and, to their detriment, the majority of phylogenetic methodologies ignore them. The context of a site in a sequence may have significant effect on its evolution, and there are numerous wellcharacterized biological dependencies that are not covered by standard simulation models. In vertebrate genomes sequences, for example, methylation of the cytosine in CG dinucleotides results in rapid mutation to TG. Non-parametric bootstrap techniques using block resampling (42) or the use of more complex evolutionary models (e.g., hidden Markov models) (43) in the parametric bootstrap would alleviate some of these problems, but they are rarely used in practice.
5. Bayesian Inference of Trees Bayesian inference of phylogenetic trees is a relatively recent innovation that simultaneously estimates the tree and the parameters in the evolutionary model, while providing a measure of confidence in those estimates. The following provides a limited introduction to Bayesian phylogenetics, highlighting
Inferring Trees
301
some of the principles behind the methods, its advantages, disadvantages, and similarities to other methodology. More comprehensive guides to Bayesian phylogenetics can be found in Huelsenbeck et al. (44), Holder and Lewis (45), and online in the documentation for the BAMBE and MrBayes software. The major theoretical difference between Bayesian inference and likelihood approaches is that the former includes a factor describing prior expectations about a problem (see Note 3). More precisely, the prior is a probability distribution over all parameters, including the tree and model, describing how frequently one would expect values to be observed before evaluating evidence from the data. Bayesian inference examines the posterior distribution of the parameters of interest, such as the relative probabilities of different trees. The posterior is a probability distribution formed as a function of the prior and likelihood, which represents the information held within the data. When Bayesian inference is successful, a large dataset would ideally produce a posterior distribution focused tightly around a small number of good trees. 5.1. Bayesian Estimation of Trees
In order to make an inference about the phylogenetic tree, the posterior distribution needs to be processed in some way. To obtain the posterior probability of trees, the parameters not of direct interest to the analysis need to be integrated out of the posterior distribution. These are often referred to as “nuisance parameters,” and include components of the evolutionary model and branch lengths. A common summary of the list of trees that Bayesian phylogenetics produces is the maximum a posteriori probability (MAP) tree, which is the tree that contains the largest mass of probability in tree space. The integration required to obtain the MAP tree is represented in the transition from left to right in the posterior distribution section of Fig. 14.5, where the area under the curve for each tree on the left equates to the posterior probability for each tree on the right. The confidence in the MAP tree can be estimated naturally from the posterior probabilities and requires no additional computation. In Bayesian parlance, this is achieved by constructing a credibility interval, which is similar to the confidence interval of classical statistics, and is constructed by adding trees to the credible set in order of decreasing probability. For example, the credibility interval for the data in Fig. 14.5 would be constructed by adding the trees in the order of C, A, and B. The small posterior probability contained in B means that it is likely to be rejected from the credibility interval. Readers should be aware that there are other equally valid ways of summarizing the results of a Bayesian analysis, including the majority rule consensus tree (46).
302
Whelan
Prior
Sequence data
Informative prior
+ Uninformative prior
Seq1 Seq2 Seq3 Seq4
ACTC … CGCC ACTG … CGCT ATTG … CACT ATTG … CACT
Parameter space
Posterior distributions Tree A
Tree C Tree B
Tree C Tree A Tree B
Parameter space
Tree space
Fig. 14.5. A schematic of Bayesian tree inference. The prior (left) contains the information or beliefs one has about the parameters contained in the tree and the evolutionary model before seeing the data. During Bayesian inference, this is combined with the information about the tree and model parameter values held in the original data (right) to produce the posterior distribution. These may be summarized to provide estimates of parameters (bottom left) and trees (bottom right).
5.2. Sampling the Posterior Using Markov Chain Monte Carlo
In phylogenetics, a precise analysis of the posterior distribution is usually not computationally possible because it requires a summation across all possible tree topologies. Markov chain Monte Carlo (MCMC) rescues Bayesian inference by forming a series (chain) of pseudo-random samples from the posterior distribution as an approximation to it. Understanding how this sampling works is useful to further explain Bayesian tree inference. A simplified description of the MCMC algorithm for tree estimation follows: 1. Get initial estimate of tree. 2. Propose a new tree (often by methods similar to traversal schemes in Section 3.2.). 3. Accept the tree according to a probability function. 4. Go to Step 2. The options for obtaining an initial tree estimate: (1) are the same as for point estimation, although a random starting place can also be a good choice because starting different MCMC chains from very different places can be useful for assessing their convergence (see the following). The tree proposal mechanism in (2) needs to satisfy at least three criteria: (a) the proposal process is random; (b) every tree is connected to every other tree, and (c) the chain does not periodically repeat itself. The necessity for (a) and (b) allows the chain potential access to all points in tree space, which
Inferring Trees
303
in principle allows complete sampling if the chain is allowed to run long enough. The final point (c) is a technical requirement that ensures the chain does not repeatedly visit the areas of tree space in the same order; it is aperiodic. The ability of MCMC to effectively sample tree space is highly dependent on the proposal scheme, and the most popular schemes are similar to those used in point estimation (see the preceding). The probability of a new tree being accepted (Step 3) is the function that enables the MCMC algorithm to correctly sample from the posterior distribution. The probability of acceptance depends on the difference in likelihood scores of the current and new tree, their chances of occurring under the prior, and an additional correction factor dependent on the sampling approach. A good sampling scheme coupled with this acceptance probability enables the chain to frequently accept trees that offer an improvement, whereas occasionally accepting mildly poorer trees. Trees with very low posterior probability are rarely visited. The overall result is that the amount of time a chain spends in regions of tree space is directly proportional to the posterior distribution. This allows the posterior probability of trees to be easily calculated as the frequency of time that the chain spends visiting different topologies. 5.3. How Long to Run a Markov Chain Monte Carlo
The number of samples required for MCMC to successfully sample the posterior distribution is dependent on two factors: convergence and mixing. A chain is said to have converged when it begins to accurately sample from the posterior distribution, and the period before this happens is called burn in. The mixing of a chain is important because it controls how quickly a chain converges and its ability to sample effectively from the posterior distribution afterward. When a chain mixes well, all trees can be quickly reached from all other trees and MCMC is a highly effective method. When mixing is poor, the chain’s ability to sample effectively from the posterior is compromised. It is notoriously difficult to confirm that the chain has converged and is successfully mixing, but there are diagnostic tools available to help. A powerful way to examine these conditions is to run multiple chains and compare them. If a majority of chains starting from substantially different points in tree space concentrate their sampling in the same region, it is indicative that the chains have converged. Evidence for successful mixing can be found by comparing samples between converged chains. When samples are clearly different, it is strong evidence that the chain is not mixing well. These comparative approaches can go awry; for example, when a small number of good tree topologies with large centers of attraction are separated by long and deep troughs in the surface of posterior probabilities. If by chance all the chains start in the same center of attraction, they can misleadingly
304
Whelan
appear to have converged and mixed well even when they have poorly sampled the posterior. This behavior has been induced for small trees under artificial mis-specifications of the evolutionary model, although the general prevalence of this problem is currently unknown. An alternative diagnostic is to examine a plot of the likelihood and/or model parameter values, such as rate variation parameters and sum of branches in the tree, against sample number. Before convergence these values may tend to show discernible patterns of change. The likelihood function, for example, may appear on average to steadily increase, as the chain moves to progressively better areas of tree-space. When the chain converges, these values may appear to have quite large random fluctuations with no apparent trend. Fast fluctuation accompanied by quite large differences in likelihood, for example, would be indicative of successful mixing. This character alone is a weak indicator of convergence because chains commonly fluctuate before they find better regions of tree space. New sampling procedures, such as Metropolis Coupled MCMC (MC3), are being introduced that can address more difficult sampling and mixing problems and are likely to feature more frequently in phylogenetic inference. See also Note 3. 5.4. The Specification of Priors
The subjectivity and applicability of priors is one of the thorniest subjects in the use of Bayesian inference. Their pros and cons are widely discussed elsewhere (e.g., pros (44); cons (9)) and I shall concentrate on practical details of their use in tree inference. If there is sufficient information in a dataset and the priors adequately cover tree space and parameter space, then the choice of prior should have only minimal impact on an analysis. There are broadly two types of prior: informative priors and uninformative priors. Informative priors describe a definite belief in the evolutionary relationships in sequences prior to analyzing the data, potentially utilizing material from a broad spectrum of areas, from previous molecular and morphological studies to an individual researcher’s views and opinions. This information is processed to form a probability distribution over tree space. Strong belief in a subset of branches in a phylogeny can be translated as a higher prior probability for the subsection of tree space that contains them. This utility of informative priors has been demonstrated (44), but is rarely used in the literature. This is partly because the choice of prior in Bayesian inference often rests with those who produce tree estimation software, not the researcher using it. Implementing an opinion as a prior can be an arduous process if you are not familiar with computer programming, which limits a potentially interesting and powerful tool. Uninformative priors are commonly used in phylogenetics and are intended to describe the position of no previous knowledge
Inferring Trees
305
about the evolutionary relationship between the sequences. In tree inference this can be interpreted as each tree being equally likely, which is philosophically similar to how other methods, such as likelihood and parsimony, treat tree estimation. Producing uninformative priors has proved problematic because of the interaction of tree priors with those of other parameters. This problem has been demonstrated to manifest itself as high confidence in incorrect tree topologies (47) and overly high support for particular branches in a tree (48, 49). Further research demonstrated that this is likely to be the result of how Bayesian analysis deals with trees, in which the length of some branches are very small or zero (47, 50). There have been several suggestions to deal with these problems, including bootstrapping (49) and describing trees with zero branches in the prior (51). There is currently no settled opinion on the effectiveness of these methods. Users of Bayesian phylogenetics should keep abreast of developments in the area and employ due care and diligence, just as with any other tree estimation method.
12. Notes 1. Methodology and statistical models: The first step in any study requiring the estimation of a phylogenetic tree is to decide which methodology to use. Statistical methods are arguably the most robust for inferring trees. Their computational limitations are outweighed by benefits, including favorable statistical properties and explicit modeling of the evolutionary process. Choice of evolutionary model is also important and other chapters offer more details about the considerations one should make prior to phylogenetic inference. In general, models that more realistically describe evolution are thought more likely to produce accurate estimates of the tree. Models should include at least two components: a factor defining variation in the overall rate between sites, and an adequate description of the replacement rates between nucleotides/amino acids. Phylogenetic descriptions of rate variation allow each site in an alignment to evolve at a different overall rate with a defined probability. The distribution of potential rates is often described as a Γ-distribution, defined through a single parameter α that is inversely proportional to amount of rate variation (52). For DNA models, replacement rates between bases should minimally consist of the parameters of the widely implemented HKY model (53), which describe the relative frequency of the different bases and the bias toward transition
306
Whelan
mutation. The replacement rates between amino acids in protein models are usually not directly estimated from the data of interest. Instead, substantial amounts of representative data are used to produce generally applicable models, including the empirical models of Dayhoff (54) and Whelan and Goldman (WAG) (55) for globular proteins, and mtREV (56) and mtmam (57) for mitochondrial proteins. It is also common practice to adjust the relative frequency of the amino acids in these models to better reflect the data under consideration (58, 59). Models describing the evolution of codons can also be used for phylogeny estimation (60), although their primary use in phylogenetics remains the study of selection in proteins (see Chapter 15). 2. Choosing phylogenetic software: After deciding upon an appropriate methodology, a set of phylogenetic software must be chosen to estimate the tree. Phylogenetic studies require the best possible estimate of the tree, albeit tempered by computational limitations, and choosing software to maximize the coverage of tree space is advantageous. This should ideally involve creating a list of potentially best trees from a range of powerful phylogenetic software packages using complementary methods of tree searching. When using software that does not resample from tree space, it is useful to manually start the estimation procedure from different points in tree space, mimicking resampling and expanding coverage. In practice, not all phylogenetic software may implement the chosen model. In these cases, the model most closely resembling the chosen model should be used. The final list of potentially best trees is informative about the difficulty of the phylogenetic inference problem on a particular dataset. If the trees estimated by different software and starting points frequently agree it is evidence that the tree estimate is good. If few of the estimates agree, inferring a tree from those data is probably hard and continued effort may reveal even better trees. Direct comparisons between the scores of different phylogenetic software packages are difficult because the models used and the method for calculating scores can vary. The final list should be assessed using a single consistent software package that implements the chosen model. The tree with the highest score is taken to be the optimal estimate and confidence intervals can then be calculated. 3. Bayesian inference versus ML: Bayesian and ML approaches to statistical tree inference are highly complementary, sharing the same likelihood function and their use of evolutionary models. The MAP tree in Bayesian inference and the optimal tree under ML are likely to be comparable and this can be exploited by running them in parallel and comparing their optimal tree estimates. This is also a useful
Inferring Trees
307
diagnostic to assess whether the MCMC chain has converged: If the trees under Bayesian inference and ML agree, it is good evidence that the chain has successfully burned in and all is well. When the trees do not agree, one or both of the estimation procedures may have gone awry. When the ML estimate is better (in terms of likelihood or posterior probability) and does not feature in the MCMC chain, it may demonstrate that the MCMC chain did not converge. When this occurs, making any form of inference from the chain is unwise. When the Bayesian estimate is better, it is indicative that the MCMC tree search was more successful than the ML point estimation. This demonstrates a potential use of MCMC as a tool for ML tree estimation. If an MCMC sampler is functioning well, its tree search can potentially outperform the point estimation algorithms used under ML. The tree with the highest likelihood from the chain is therefore a strong candidate as a starting point for point estimation, or even as an optimal tree itself. This approach is not currently widely used for phylogenetics inference.
Acknowledgments S.W. is funded by EMBL. Comments and suggestions from Nick Goldman, Lars Jermiin, Ari Loytynoja, and Fabio Pardi all helped improve previous versions of the manuscript.
References 1. Hahn, B. H., Shaw, G. M., de Cock, K.M., et al. (2000) AIDS as a zoonosis: Scientific and public health implications. Science 287, 607–614. 2. Pellegrini, M., Marcotte, E. M., Thompson, M. J., et al. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96, 4285–4288. 3. Tatusov, R. L., Natale, D. A., Garkavtsev, I. V., et al. (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 29, 22–28. 4. Mouse Genome Sequencing Consortium. (2002) Initial sequencing of the mouse genome. Nature 420, 520–562. 5. The ENCODE Project Consortium. (2004) The ENCODE (Encyclopedia of DNA Elements) project. Science 306, 636–640.
6. Page, R. D. M., Holmes, E. C. (1998) Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford, UK. 7. Gogarten, J. P., Doolittle, W. F., Lawrence, J. G. (2002) Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19, 2226– 2238. 8. Siepel, A., Bejerano, G., Pedersen, J. S., et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050. 9. Felsenstein, J. (2004) Inferring Phylogenies. Sinauer Associates, Sunderland, MA. 10. Nei, M., Kumar, S. (2000) Molecular Evolution and Phylogenetics. Oxford University Press, New York. 11. Whelan, S., Lio, P., Goldman, N. (2001). Molecular phylogenetics: state-of-the-art methods for looking into the past. Trends Genet 17, 262–272.
308
Whelan
13. Chang, J. T. (1996) Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Math Biosci 137, 51–73. 14. Rogers, J. S. (1997) On the consistency of maximum likelihood estimation of phylogenetic trees from nucleotide sequences. Syst Biol 46, 354–357. 15. Steel, M. A., Penny, D. (2000) Parsimony, likelihood, and the role of models in molecular phylogenetics. Mol Biol Evol 17, 839–850. 16. Siddall M. E., Kluge A. G. (1997) Probabilism and phylogenetic inference. Cladistics 13, 313–336. 17. Saitou, N., Nei, M. (1987) The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406–425. 18. Fitch, W. M., Margoliash, E. (1967) Construction of phylogenetic trees. A method based on mutation distances as estimated from cytochrome c sequences is of general applicability. Science 155, 279–284. 19. Swofford, D. L., Olsen, G. J., Waddell, P. J., et al. (1996) Phylogenetic inference, in (Hillis, D.M., Moritz, C., and Mable B. K., eds.), Molecular Systematics, 2nd ed. Sinauer, Sunderland, MA. 20. Yang, Z., Goldman, N., Friday, A. (1995) Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst Biol 44, 384–399. 21. Strimmer, K., von Haeseler, A. (1996) Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964–969. 22. Bryant, D. The splits in the neighbourhood of a tree. Ann Combinat 8, 1–11. 23. Sankoff, D., Abel Y., Hein, J. (1994) A tree, a window, a hill; generalisation of nearest neighbor interchange in phylogenetic optimisation. J Classif 11, 209–232. 24. Ganapathy, G., Ramachandran, V., Warnow, T. (2004) On contract-and-refine transformations between phylogenetic trees. Proc Fifteenth ACM-SIAM Symp Discrete Algorithms (SODA), 893–902. 25. Wolf, M. J., Easteal, S., Kahn, M., et al. (2000) TrExML: a maximum-likelihood approach for extensive tree-space exploration. Bioinformatics 16, 383–394. 26. Stamatakis, A., Ludwig, T., Meier, H. (2005) RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21, 456–463.
27. Vinh, L. S., von Haeseler, A. (2004) IQPNNI: moving fast through tree space and stopping in time. Mol Biol Evol 21, 1565–1571. 28. Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package). Distributed by the author. Department of Genetics, University of Washington, Seattle. 29. Lewis, P. O. (1998) A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol Biol Evol 15, 277–283. 30. Lemmon, A. R., Milinkovich, M. C. (2002) The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation. Proc Natl Acad Sci U S A 99, 10516–10521. 31. Lundy, M. (1985) Applications of the annealing algorithm to combinatorial problems in statistics. Biometrika 72, 191–198. 32. Salter, L., Pearl., D. K. (2001) Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Syst Biol 50, 7–17. 33. Keith J. M., Adams P., Ragan M. A., et al. (2005) Sampling phylogenetic tree space with the generalized Gibbs sampler. Mol Phy Evol 34, 459–468. 34. Efron, B., Tibshirani, R. J. (1993) An Introduction to the Bootstrap. Chapman and Hall, New York. 35. Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791. 36. Hillis, D., Bull, J. (1993) An empirical test of bootstrapping as a method for assessing conference in phylogenetic analysis. Syst Biol 42, 182–192. 37. Efrom, B., Halloran, E., Holmes, S. (1996) Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci U S A 93, 13429–13434. 38. Shimodaira, H., Hasegawa, M. (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol 16, 1114–1116. 39. Shimodaira, H. (2002) An approximately unbiased test of phylogenetic tree selection. Syst Biol 51, 492–508. 40. Kishino, H., Hasegawa, M. (1989) Evaluation of the maximum-likelihood estimate of the evolutionary tree topologies from DNAsequence data, and the branching order in Hominoidea. J Mol Evol 29, 170–179. 41. Hasegawa, M., Kishino, H. (1994) Accuracies of the simple methods for estimating
Inferring Trees
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
the bootstrap probability of a maximumlikelihood tree. Mol Biol Evol 11, 142–145. Davison, A. C., Hinkley, D. V. (1997) Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, MA. Siepel, A., Haussler, D. (2005) Phylogenetic hidden Markov models, in (Nielsen, R., ed.), Statistical Methods in Molecular Evolution. Springer, New York. Huelsenbeck, J. P., Larget, B., Miller, R. E., et al. (2002) Potential applications and pitfalls of Bayesian inference of phylogeny. Syst Biol 51, 673–688. Holder, M., Lewis, P. O. (2003) Phylogeny estimation: traditional and Bayesian approaches. Nat Rev Genet 4, 275–284. Larget, B., Simon, D. (1999) Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol Biol Evol 16, 750–759. Suzuki, Y., Glazko G. V., Nei, M. (2002) Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc Natl Acad Sci U S A 99, 16138–16143. Alfaro, M. E., Zoller, S., Lutzoni, F. (2003) Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. Mol Biol Evol 20,255–266. Douady, C. J., Delsuc, F., Boucher, Y., et al. (2003) Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Mol Biol Evol 20, 248–254. Yang, Z., Rannala, B. (2005) Branch-length prior influences Bayesian posterior probability of phylogeny. Syst Biol 54, 455–470. Lewis, P. O., Holder, M. T., Holsinger, K. E. (2005) Polytomies and Bayesian phylogenetic inference. Syst Biol 54, 241–253.
309
52. Yang, Z. (1996) Among-site rate variation and its impact on phylogenetic analysis. Trends Ecol Evol 11, 367–372. 53. Hasegawa, M., Kishino, H., Yano, T. (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22, 160–174. 54. Dayhoff, M. O., Eck, R. V., Park, C. M. (1972) A model of evolutionary change in proteins, in (Dayhoff, M. O., ed.), Atlas of Protein Sequence and Structure, vol. 5. National Biomedical Research Foundation, Washington, DC. 55. Whelan, S., Goldman, N. (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol 18, 691–699. 56. Adachi, J., Hasegawa M. (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42, 459–468. 57. Yang, Z., Nielsen, R., Hasegawa, M. (1998) Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 15, 1600–1611. 58. Cao, Y., Adachi, J., Janke, A., et al. (1994) Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J Mol Evol 39, 519–527. 59. Goldman, N., Whelan, S. (2002) A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol 19, 1821–1831. 60. Ren, F., Tanaka, H., Yang, Z. (2005) An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst Biol 54, 808–818.