Information about Computational Phylogenetics
Computational phylogenetics is the application of computational algorithms, methods and programs to phylogenetic analyses. The goal is to assemble a phylogenetic tree representing a hypothesis about the evolutionary ancestry of a set of genes, species, or other taxa. For example, these techniques have been used to explore the family tree of hominid species[1] and the relationships between specific genes shared by many types of organisms.[2] Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms, while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification. Many forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment in constructing and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The phylogenetic trees constructed by computational methods are unlikely to perfectly reproduce the evolutionary tree that represents the historical relationships between the species being analyzed. The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species.
Producing a phylogenetic tree requires a measure of homology among the characteristics shared by the taxa being compared. In morphological studies, this requires explicit decisions about which physical characteristics to measure and how to use them to encode distinct states corresponding to the input taxa. In molecular studies, a primary problem is in producing a multiple sequence alignment (MSA) between the genes or amino acid sequences of interest. Progressive sequence alignment methods produce a phylogenetic tree by necessity because they incorporate new sequences into the calculated alignment in order of genetic distance. Although a phylogenetic tree can always be constructed from an MSA, phylogenetics methods such as maximum parsimony and maximum likelihood do not require the production of an initial or concurrent MSA.
By contrast, unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent. An unrooted tree can always be produced from a rooted tree, but a root cannot usually be placed on an unrooted tree without additional data on divergence rates, such as the assumption of the molecular clock hypothesis.[3]
The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional "tree space" through which search paths can be traced by optimization algorithms. Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology, it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters.[4]
Some phenotypic classifications, particularly those used when analyzing very diverse groups of taxa, are discrete and unambiguous; classifying organisms as possessing or lacking a tail, for example, is straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution. A common method is simply to sort the measurements of interest into two or more classes, rendering continuous observed variation as discretely classifiable (e.g., all examples with humerus bones longer than a given cutoff are scored as members of one state, and all members whose humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements.[7]
Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, reuse of previously compiled data matrices is not uncommon, although this may propagate flaws in the original matrix into multiple derivative analyses.[8]
The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-complete,[10] so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space.
The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP-hard;<ref name="felsenstein" /> consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree, if not the most optimal in the set. Most such methods involve a steepest descent-style minimization mechanism operating on a tree rearrangement criterion.
The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search space by efficiently calculating the likelihood of subtrees.<ref name="felsenstein" /> The method calculates the likelihood for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is, the tips of the tree) and working backwards toward the "bottom" node in nested sets. However, the trees produced by the method are only rooted if the substitution model is irreversible, which is not generally true of biological systems. The search for the maximum-likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically; general global optimization tools such as the Newton-Raphson method are often used. Searching tree topologies defined by likelihood has not been shown to be NP-complete,<ref name="felsenstein" /> but remains extremely challenging because branch-and-bound search is not yet effective for trees represented in this way.
Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms, although the choice of move set varies; selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a proposed tree at each step[16] and swapping descendant subtrees of a random internal node between two related trees.[17] The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criterion, and prior distribution in published work.<ref name="felsenstein" />
All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate.<ref name="felsenstein" /> More advanced models distinguish between transitions and transversions. The most general possible time-reversible model, called the GTR model, has contains six mutation rate parameters. An even more generalized model known as the general 12-parameter model breaks time-reversibility, at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages.<ref name="felsenstein" /> One possible variation on this theme adjusts the rates so that overall GC content - an important measure of DNA double helix stability - varies over time.[19]
Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation follows from the arrangement of nucleotides in protein-coding genes into three-base codons. If the location of the open reading frame (ORF) is known, rates of mutation can be adjusted for position of a given site within a codon, since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon's meaning in the genetic code.<ref name="Sullivan" /> A less hypothesis-driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution, often the gamma distribution or log-normal distribution.<ref name="felsenstein" /> Finally, a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates, so that the mutation rate of a given site is correlated across sites and lineages.[20]
An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback-Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models.<ref name="Sullivan" /> The AIC is calculated on an individual model rather than a pair, so it is independent of the order in which models are assessed. A related alternative, the Bayesian information criterion (BIC), has a similar basic interpretation but penalizes complex models more heavily.<ref name="Sullivan" />
..... Click the link for more information.
..... Click the link for more information.
A hominid is any member of the biological family Hominidae (the "great apes"), including the extinct and extant humans, chimpanzees, gorillas, and orangutans.
..... Click the link for more information.
..... Click the link for more information.
FOSSIL is a standard for allowing serial communication for telecommunications programs under the DOS operating system.
..... Click the link for more information.
Producing a phylogenetic tree requires a measure of homology among the characteristics shared by the taxa being compared. In morphological studies, this requires explicit decisions about which physical characteristics to measure and how to use them to encode distinct states corresponding to the input taxa. In molecular studies, a primary problem is in producing a multiple sequence alignment (MSA) between the genes or amino acid sequences of interest. Progressive sequence alignment methods produce a phylogenetic tree by necessity because they incorporate new sequences into the calculated alignment in order of genetic distance. Although a phylogenetic tree can always be constructed from an MSA, phylogenetics methods such as maximum parsimony and maximum likelihood do not require the production of an initial or concurrent MSA.
Types of phylogenetic trees
Phylogenetic trees generated by computational phylogenetics can be either rooted or unrooted depending on the input data and the algorithm used. A rooted tree is a directed graph that explicitly identifies a most recent common ancestor (MRCA), usually an imputed sequence that is not represented in the input. Genetic distance measures can be used to plot a tree with the input sequences as leaf nodes and their distances from the root proportional to their genetic distance from the hypothesized MRCA. Identification of a root usually requires the inclusion in the input data of at least one "outgroup" known to be only distantly related to the sequences of interest.By contrast, unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent. An unrooted tree can always be produced from a rooted tree, but a root cannot usually be placed on an unrooted tree without additional data on divergence rates, such as the assumption of the molecular clock hypothesis.[3]
The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional "tree space" through which search paths can be traced by optimization algorithms. Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology, it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters.[4]
Coding characters and defining homology
Morphological analysis
The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; for individual species, they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which features to measure is a major inherent obstacle to the method. The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant.[5] Morphological studies can be confounded by examples of convergent evolution of phenotypes.[6] A major challenge in constructing useful classes is the high likelihood of inter-taxon overlap in the distribution of the phenotype's variation. The inclusion of extinct taxa in morphological analysis is often difficult due to absence of or incomplete fossil records, but has been shown to have a significant effect on the trees produced; in one study only the inclusion of extinct species of apes produced a morphologically derived tree that was consistent with that produced from molecular data.<ref name="Strait" />Some phenotypic classifications, particularly those used when analyzing very diverse groups of taxa, are discrete and unambiguous; classifying organisms as possessing or lacking a tail, for example, is straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution. A common method is simply to sort the measurements of interest into two or more classes, rendering continuous observed variation as discretely classifiable (e.g., all examples with humerus bones longer than a given cutoff are scored as members of one state, and all members whose humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements.[7]
Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, reuse of previously compiled data matrices is not uncommon, although this may propagate flaws in the original matrix into multiple derivative analyses.[8]
Molecular analysis
The problem of character coding is very different in molecular analyses, as the characters in biological sequence data are immediate and discretely defined - distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences. However, defining homology can be challenging due to the inherent difficulties of multiple sequence alignment. For a given gapped MSA, several rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are "mutations" versus ancestral characters, and which events are insertion mutations or deletion mutations. For example, given only a pairwise alignment with a gap region, it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In practice, sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation.Distance-matrix methods
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches.<ref name="mount" /> Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignment. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.<ref name="felsenstein" />Neighbor-joining
Neighbor-joining methods apply general data clustering techniques to sequence analysis using genetic distance as a clustering metric. The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages. Its relative, UPGMA (Unweighted Pair Group Method with Arithmetic mean) produces rooted trees and requires a constant-rate assumption - that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal.Fitch-Margoliash method
The Fitch-Margoliash method uses a weighted least squares method for clustering based on genetic distance.[9] Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be linear; the linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances - a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a substitution matrix such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches.<ref name="felsenstein" />The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-complete,[10] so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space.
Using outgroups
Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set.<ref name="mount" /> This usage can be seen as a type of experimental control. If the outgroup has been appropriately chosen, it will have a much greater genetic distance and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds noise to the analysis.<ref name="mount" /> Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly conserved across lineages. Horizontal gene transfer, especially between otherwise divergent bacteria, can also confound outgroup usage.Maximum parsimony
Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include a "cost" associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost. This is a useful approach in cases where not every possible type of event is equally likely - for example, when particular nucleotides or amino acids are known to be more mutable than others.The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP-hard;<ref name="felsenstein" /> consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree, if not the most optimal in the set. Most such methods involve a steepest descent-style minimization mechanism operating on a tree rearrangement criterion.
Branch and bound
The branch and bound algorithm is a general method used to increase the efficiency of searches for near-optimal solutions of NP-hard problems first applied to phylogenetics in the early 1980s.[11] Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions. As its name implies, it requires as input both a branching rule (in the case of phylogenetics, the addition of the next species or sequence to the tree) and a bound (a rule that excludes certain regions of the search space from consideration, thereby assuming that the optimal solution cannot occupy that region). Identifying a good bound is the most challenging aspect of the algorithm's application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of criteria known as Zharkikh's rules[12] severely limit the search space by defining characteristics shared by all candidate "most parsimonious" trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have produced identical data) and the elimination of character sites at which two or more states do not occur in at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.Sankoff-Morel-Cedergren algorithm
The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences.[13] The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches, thereby favoring the tree that introduces a minimal number of such events. The imputed sequences at the interior nodes of the tree are scored and summed over all the nodes in each possible tree. The lowest-scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function. Because the method is highly computationally intensive, an approximate method in which initial guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming.<ref name="felsenstein" />MALIGN and POY
More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSA.[14] However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary events.[15] Both programs are available from the American Museum of Natural History.Maximum likelihood
The maximum likelihood method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees. The method requires a substitution model to assess the probability of particular mutations; roughly, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability. This is broadly similar to the maximum-parsimony method, but maximum likelihood allows additional statistical flexibility by permitting varying rates of evolution across both lineages and sites. In fact, the method requires that evolution at different sites and along different lineages must be statistically independent. Maximum likelihood is thus well suited to the analysis of distantly related sequences, but because it formally requires search of all possible combinations of tree topology and branch length, it is computationally expensive to perform on more than a few sequences.The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search space by efficiently calculating the likelihood of subtrees.<ref name="felsenstein" /> The method calculates the likelihood for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is, the tips of the tree) and working backwards toward the "bottom" node in nested sets. However, the trees produced by the method are only rooted if the substitution model is irreversible, which is not generally true of biological systems. The search for the maximum-likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically; general global optimization tools such as the Newton-Raphson method are often used. Searching tree topologies defined by likelihood has not been shown to be NP-complete,<ref name="felsenstein" /> but remains extremely challenging because branch-and-bound search is not yet effective for trees represented in this way.
Bayesian inference
Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assume a prior probability distribution of the possible trees, which may simply be the probability of any one tree among all the possible trees that could be generated from the data, or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes. The choice of prior distribution is a point of contention among users of Bayesian-inference phylogenetics methods.<ref name="felsenstein" />Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms, although the choice of move set varies; selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a proposed tree at each step[16] and swapping descendant subtrees of a random internal node between two related trees.[17] The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criterion, and prior distribution in published work.<ref name="felsenstein" />
Model selection
Molecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates of mutation at various sites along the gene or amino acid sequences being studied. At their simplest, substitution models aim to correct for differences in the rates of transitions and transversions in nucleotide sequences. The use of substitution models is necessitated by the fact that the genetic distance between two sequences increases linearly only for a short time after the two sequences diverge from each other (alternatively, the distance is linear only shortly before coalescence). The longer the amount of time after divergence, the more likely it becomes that two mutations occur at the same nucleotide site. Simple genetic distance calculations will thus undercount the number of mutation events that have occurred in evolutionary history. The extent of this undercount increases with increasing time since divergence, which can lead to the phenomenon of long branch attraction, or the misassignment of two distantly related but convergently evolving sequences as closely related.[18] The maximum parsimony method is particularly susceptible to this problem due to its explicit search for a tree representing a minimum number of distinct evolutionary events.<ref name="felsenstein" />Types of models
All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate.<ref name="felsenstein" /> More advanced models distinguish between transitions and transversions. The most general possible time-reversible model, called the GTR model, has contains six mutation rate parameters. An even more generalized model known as the general 12-parameter model breaks time-reversibility, at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages.<ref name="felsenstein" /> One possible variation on this theme adjusts the rates so that overall GC content - an important measure of DNA double helix stability - varies over time.[19]
Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation follows from the arrangement of nucleotides in protein-coding genes into three-base codons. If the location of the open reading frame (ORF) is known, rates of mutation can be adjusted for position of a given site within a codon, since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon's meaning in the genetic code.<ref name="Sullivan" /> A less hypothesis-driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution, often the gamma distribution or log-normal distribution.<ref name="felsenstein" /> Finally, a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates, so that the mutation rate of a given site is correlated across sites and lineages.[20]
Choosing the best model
The selection of an appropriate model is critical for the production of good phylogenetic analyses, both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated, and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit.<ref name="Sullivan" /> The most common method of model selection is the likelihood ratio test (LRT), which produces a likelihood estimate that can be interpreted as a measure of "goodness of fit" between the model and the input data.<ref name="Sullivan" /> However, care must be taken in using these results, since a more complex model with more parameters will always have a higher likelihood than a simplified version of the same model, which can lead to the naive selection of models that are overly complex.<ref name="felsenstein" /> For this reason model selection computer programs will choose the simplest model that is not significantly worse than more complex substitution models. A significant disadvantage of the LRT is the necessity of making a series of pairwise comparisons between models; it has been shown that the order in which the models are compared has a major effect on the one that is eventually selected.[21]An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback-Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models.<ref name="Sullivan" /> The AIC is calculated on an individual model rather than a pair, so it is independent of the order in which models are assessed. A related alternative, the Bayesian information criterion (BIC), has a similar basic interpretation but penalizes complex models more heavily.<ref name="Sullivan" />
See also
- List of phylogenetics software
- PHYLIP
- Phylogenetic comparative methods
- Phylogenetic tree
- Phylogenetics
- Systematics
- Joe Felsenstein
External links
- PHYLIP, a freely distributed phylogenetic analysis package
- PAUP, a similar analysis package available for purchase
- MrBayes, a program for the Bayesian estimation of phylogeny ()
- Modeltest, a program for selecting appropriate substitution models for nucleotide sequences
- CIPRES: Cyberinfrastructure for Phylogenetic Research
- Phylogenetic inferring on the T-REX server
- List of phylogeny programs
References
1. ^ Strait DS, Grine FE. (2004). Inferring hominoid and early hominid phylogeny using craniodental characters: the role of fossil taxa. J Hum Evol 47(6):399-452.
2. ^ Hodge T, Cope MJ. (2000). A myosin family tree. J Cell Sci 113: 3353-3354.
3. ^ Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
4. ^ Felsenstein J. (2004). Inferring Phylogenies Sinauer Associates: Sunderland, MA.
5. ^ Swiderski DL, Zelditch ML, Fink WL. (1998). Why morphometrics is not special: coding quantitative data for phylogenetic analysis. 47(3):508-19.
6. ^ Gaubert P, Wozencraft WC, Cordeiro-Estrela P, Veron G. (2005). Mosaics of convergences and noise in morphological phylogenies: what's in a viverrid-like carnivoran? Syst Biol 54(6):865-94.
7. ^ Wiens JJ. (2001). Character analysis in morphological phylogenetics: problems and solutions. Syst Biol 50(5):689-99.
8. ^ Jenner RA. (2001). Bilaterian phylogeny and uncritical recycling of morphological data sets. Syst Biol 50(5): 730-743.
9. ^ Fitch WM, Margoliash E. (1967). Construction of phylogenetic trees. Science 155: 279-84.
10. ^ Day, WHE. (1986). Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology 49:461-7.
11. ^ Hendy MD, Penny D. (1982). Branch and bound algorithms to determine minimal evolutionary trees. Math Biosci 60: 133-42.
12. ^ Ratner VA, Zharkikh AA, Kolchanov N, Rodin S, Solovyov S, Antonov AS. (1995). Molecular Evolution Biomathematics Series Vol 24. Springer-Verlag: New York, NY.
13. ^ Sankoff D, Morel C, Cedergren RJ. (1973). Evolution of 5S RNA and the non-randomness of base replacement. Nature New Biology 245:232-4.
14. ^ Wheeler WC, Gladstein DG. (1994). MALIGN: a multiple nucleic acid sequence alignment program. J Heredity 85: 417-18.
15. ^ Simmons MP. (2004). Independence of alignment and tree search. Mol Phylogenet Evol 31(3):874-9.
16. ^ Mau B, Newton MA. (1997). Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. J Comp Graph Stat 6:122-31.
17. ^ Yang Z, Rannala B. (1997). bayesian phylogenetic inference using DNA sequences: a Merkov chain Monte Carlo method. Mol Biol Evol 46:409-18.
18. ^ Sullivan J, Joyce P. (2005). Model selection in phylogenetics. Annual Review of Ecology, Evolution, and Systematics. 36: 445-466.
19. ^ Galtier N, Guoy M. (1998.) Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–79.
20. ^ Fitch WM, Markowitz E. (1970). An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochemical Genetics 4:579-593.
21. ^ Pol D. (2004.) Empirical problems of the hierarchical likelihood ratio test for model selection. Syst Biol 53:949–62.
2. ^ Hodge T, Cope MJ. (2000). A myosin family tree. J Cell Sci 113: 3353-3354.
3. ^ Mount DM. (2004). Bioinformatics: Sequence and Genome Analysis 2nd ed. Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
4. ^ Felsenstein J. (2004). Inferring Phylogenies Sinauer Associates: Sunderland, MA.
5. ^ Swiderski DL, Zelditch ML, Fink WL. (1998). Why morphometrics is not special: coding quantitative data for phylogenetic analysis. 47(3):508-19.
6. ^ Gaubert P, Wozencraft WC, Cordeiro-Estrela P, Veron G. (2005). Mosaics of convergences and noise in morphological phylogenies: what's in a viverrid-like carnivoran? Syst Biol 54(6):865-94.
7. ^ Wiens JJ. (2001). Character analysis in morphological phylogenetics: problems and solutions. Syst Biol 50(5):689-99.
8. ^ Jenner RA. (2001). Bilaterian phylogeny and uncritical recycling of morphological data sets. Syst Biol 50(5): 730-743.
9. ^ Fitch WM, Margoliash E. (1967). Construction of phylogenetic trees. Science 155: 279-84.
10. ^ Day, WHE. (1986). Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology 49:461-7.
11. ^ Hendy MD, Penny D. (1982). Branch and bound algorithms to determine minimal evolutionary trees. Math Biosci 60: 133-42.
12. ^ Ratner VA, Zharkikh AA, Kolchanov N, Rodin S, Solovyov S, Antonov AS. (1995). Molecular Evolution Biomathematics Series Vol 24. Springer-Verlag: New York, NY.
13. ^ Sankoff D, Morel C, Cedergren RJ. (1973). Evolution of 5S RNA and the non-randomness of base replacement. Nature New Biology 245:232-4.
14. ^ Wheeler WC, Gladstein DG. (1994). MALIGN: a multiple nucleic acid sequence alignment program. J Heredity 85: 417-18.
15. ^ Simmons MP. (2004). Independence of alignment and tree search. Mol Phylogenet Evol 31(3):874-9.
16. ^ Mau B, Newton MA. (1997). Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. J Comp Graph Stat 6:122-31.
17. ^ Yang Z, Rannala B. (1997). bayesian phylogenetic inference using DNA sequences: a Merkov chain Monte Carlo method. Mol Biol Evol 46:409-18.
18. ^ Sullivan J, Joyce P. (2005). Model selection in phylogenetics. Annual Review of Ecology, Evolution, and Systematics. 36: 445-466.
19. ^ Galtier N, Guoy M. (1998.) Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871–79.
20. ^ Fitch WM, Markowitz E. (1970). An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochemical Genetics 4:579-593.
21. ^ Pol D. (2004.) Empirical problems of the hierarchical likelihood ratio test for model selection. Syst Biol 53:949–62.
Topics in phylogenetics | |
|---|---|
| Relevant fields | phylogenetics |
| Basic concepts | synapomorphy |
| Phylogeny inference methods | maximum parsimony |
| Current topics | PhyloCode |
| List of evolutionary biology topics | |
Basic topics in |
|---|
Evidence of evolution
Processes of evolution: adaptation - macroevolution - microevolution - speciation
Population genetic mechanisms: natural selection - genetic drift - gene flow - mutation
Evolutionary developmental biology (Evo-devo) concepts: phenotypic plasticity - canalisation - modularity
Modes of evolution: anagenesis - catagenesis - cladogenesis
History: History of evolutionary thought - Charles Darwin - The Origin of Species - modern evolutionary synthesis - Evolutionary history of life
Other subfields: ecological genetics - human evolution - molecular evolution - phylogenetics - systematics
List of evolutionary biology topics - Timeline of evolution
|
In mathematics, computing, linguistics, and related disciplines, an algorithm is a finite list of well-defined instructions for accomplishing some task that, given an initial state, will proceed through a well-defined series of successive states, eventually terminating in an
..... Click the link for more information.
..... Click the link for more information.
phylogenetics (Greek: phyle = tribe, race and genetikos = relative to birth, from genesis = birth) is the study of evolutionary relatedness among various groups of organisms (e.g., species, populations).
..... Click the link for more information.
..... Click the link for more information.
A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor.
..... Click the link for more information.
..... Click the link for more information.
For a non-technical introduction to the topic, see .
A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions...... Click the link for more information.
species is one of the basic units of biological classification. A species is often defined as a group of organisms capable of interbreeding and producing fertile offspring.
..... Click the link for more information.
..... Click the link for more information.
For the journal, see .
A taxon (plural taxa), or taxonomic unit, is a name designating an organism or group of organisms. A taxon is assigned a rank and can be placed at a particular level in a systematic hierarchy reflecting evolutionary..... Click the link for more information.
For the book by Robert J. Sawyer, see .
A hominid is any member of the biological family Hominidae (the "great apes"), including the extinct and extant humans, chimpanzees, gorillas, and orangutans.
..... Click the link for more information.
The term morphology in biology refers to the outward appearance (shape, structure, color, pattern) of an organism or taxon and its component parts. This is in contrast to physiology, which deals primarily with function.
..... Click the link for more information.
..... Click the link for more information.
phenotype describes the total physical appearance of an organism, as opposed to its genotype. This genotype-phenotype distinction was proposed by Wilhelm Johannsen in 1911 to make clear the difference between an organism's heredity and what that heredity produces.
..... Click the link for more information.
..... Click the link for more information.
A nucleotide is a chemical compound that consists of 3 portions: a heterocyclic base, a sugar, and one or more phosphate groups. In the most common nucleotides the base is a derivative of purine or pyrimidine, and the sugar is the pentose (five-carbon sugar) deoxyribose or ribose.
..... Click the link for more information.
..... Click the link for more information.
amino acid is a molecule that contains both amine and carboxyl functional groups. In biochemistry, this term refers to alpha-amino acids with the general formula H2NCHRCOOH, where R is an organic substituent.
..... Click the link for more information.
..... Click the link for more information.
Proteins are large organic compounds made of amino acids arranged in a linear chain and joined together by peptide bonds between the carboxyl and amino groups of adjacent amino acid residues.
..... Click the link for more information.
..... Click the link for more information.
In bioinformatics, a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
..... Click the link for more information.
..... Click the link for more information.
For a non-technical introduction to the topic, see .
A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions...... Click the link for more information.
In biology the genome of an organism is its whole hereditary information and is encoded in the DNA (or, for some viruses, RNA). This includes both the genes and the non-coding sequences of the DNA.
..... Click the link for more information.
..... Click the link for more information.
A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor.
..... Click the link for more information.
..... Click the link for more information.
In evolutionary biology, homology is any similarity between characters that is due to their shared ancestry. There are examples in different branches of biology. Anatomical structures that perform the same function in different biological species and evolved from the same structure
..... Click the link for more information.
..... Click the link for more information.
multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In general, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a
..... Click the link for more information.
..... Click the link for more information.
Genetic distance is a measure of the dissimilarity of genetic material between different species or individuals of the same species.
All of life today is based upon the molecule of inheritance, DNA (deoxyribonucleic acid).
..... Click the link for more information.
All of life today is based upon the molecule of inheritance, DNA (deoxyribonucleic acid).
..... Click the link for more information.
Maximum parsimony, often simply referred to as "parsimony," is a non-parametric statistical method commonly used in computational phylogenetics for estimating phylogenies. Under maximum parsimony, the preferred phylogenetic tree is the tree that requires the least number of
..... Click the link for more information.
..... Click the link for more information.
Maximum likelihood estimation (MLE) is a popular statistical method used to calculate the best way of fitting a mathematical model to some data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide an
..... Click the link for more information.
..... Click the link for more information.
A phylogenetic tree, also called an evolutionary tree, is a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor.
..... Click the link for more information.
..... Click the link for more information.
The most recent common ancestor (MRCA) of any set of organisms is the most recent individual from which all organisms in the group are directly descended. The term is most frequently used of humans.
..... Click the link for more information.
..... Click the link for more information.
In computer science, a leaf node is a node of a tree data structure that has zero child nodes. Often, leaf nodes are the nodes farthest from the root node. In the graph theory tree, a leaf node is a vertex of degree 1 other than the root (except when the tree has only one vertex;
..... Click the link for more information.
..... Click the link for more information.
Genetic distance is a measure of the dissimilarity of genetic material between different species or individuals of the same species.
All of life today is based upon the molecule of inheritance, DNA (deoxyribonucleic acid).
..... Click the link for more information.
All of life today is based upon the molecule of inheritance, DNA (deoxyribonucleic acid).
..... Click the link for more information.
The molecular clock (based on the molecular clock hypothesis (MCH)) is a technique in genetics to date when two species diverged.
..... Click the link for more information.
..... Click the link for more information.
In mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set.
..... Click the link for more information.
..... Click the link for more information.
matrix (plural matrices) is a rectangular table of elements (or entries), which may be numbers or, more generally, any abstract quantities that can be added and multiplied.
..... Click the link for more information.
..... Click the link for more information.
In evolutionary biology, convergent evolution is the process whereby organisms not closely related (not monophyletic), independently evolve similar traits as a result of having to adapt to similar environments or ecological niches[1].
..... Click the link for more information.
..... Click the link for more information.
- For other uses of the term, see Fossil (disambiguation)
FOSSIL is a standard for allowing serial communication for telecommunications programs under the DOS operating system.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus