Discrete - Mark Pagel - University of Reading

Discrete - Mark Pagel - University of Reading
User’s Manual for
Discrete
(copyright M. Pagel)
Mark Pagel
School of Animal and Microbial Sciences
University of Reading
Reading RG6 6AJ
UK
email: [email protected]
(www.ams.reading.ac.uk/zoology/pagel/)
2
Discrete
Discrete is a computer program for the comparative analysis of binary characters on phylogenetic
trees. The program implements a continuous time Markov model and was initially described in
Pagel (1994). Several papers since then describe other features of Discrete (Pagel, 1997,
1999a,b). I would be grateful if you would cite these papers, as appropriate, when you use
Discrete. A recent application can be found in Lutzoni, Pagel, and Reeb, 2001.
The application program can be used to
•
•
•
•
•
•
•
test for correlated evolution between pairs of traits,
find ancestral states (see especially Pagel, 1999b),
test for rates of evolution,
detect directional trait evolution,
investigate the tempo and mode of trait evolution,
to detect differential rates of evolution in different branches of the tree using a gamma rate
heterogeneity model, and
to conduct Monte Carlo simulation studies of results.
This manual describes how to use Discrete and its hypothesis testing capabilities. The manual is
intended to provide enough of an introduction to use the program, although it does not attempt to
describe all of the things one might use the program for. Some features of Discrete are not yet
implemented in the program even though they appear in the menus. The current computer
interface was designed and programmed by Mr Peter Fredericks; Drs Heath and Richard Forster
worked on earlier versions of the program.
For variables with more than two states my related program called Multi-state is available.
However, Multi-state is intended only to study trait evolution and not correlated evolution
between pairs of traits. With more than two states a very large number of parameters is required
to test correlated evolution. But see Characters with more than two states under Hypothesis
Testing.
Discrete implements the Markov model in a maximum likelihood framework. This makes it
possible to analyse and test hypotheses about trait evolution without the need ever to reconstruct
ancestral states (although ancestral states can be estimated). Instead, the parameters of trait
evolution are estimated having summed the likelihood over all possible states at each node of the
tree (see Pagel, 1994 for further explanation of the likelihood approach). An advantage of this
approach over parsimony methods is that uncertainty in the ancestral state reconstructions is
automatically taken into account in all likelihood calculations. By comparison, parsimony
methods first infer the ancestral states and then treat them in later calculations as if they are
known without error. This lends a false degree of certainty to calculations and biases p-values.
This bias is most pronounced when traits evolve more than once on a tree, that is when trait
evolution is relatively rapid. Under such circumstances parsimony methods are known to
underestimate the amount of change, especially in long branches of the tree (see Pagel, 1999a for
an example).
3
One of the principal uses for Discrete is to test for correlated evolution between two binary
discrete characters. This is achieved by comparing the fit (likelihood)of two models to the data.
In one the two traits are allowed to evolve independently; in the other they evolve in a correlated
fashion. Evidence for a correlation is found if the model of correlated evolution fits the data
significantly better than the model of independent evolution (Pagel, 1994).
For a trait that can take only two values (e.g., 0,1), two rates must be estimated, one for transition
from “0” to “1”, and the other for transitions from “1” to “0”. These parameters are sufficient to
characterise the evolution of traits in isolation from one another. Four parameters are required
for two traits evolving independently (see Figure). The model of correlated or dependent trait
evolution considers the four possible states that two binary characters can jointly adopt (0,0; 0,1;
1,0; 1,1). It then allows one of the variables to change state in any branch of the tree, yielding
eight possible transitions to be estimated (Figure). These can be shown to be sufficient to
calculate the probability of any kind of change in any branch of the tree, and they can be used to
chart the most probable course of evolution from the ancestral state to the contemporary derived
state.
Independent transitions in two binary traits
Trait
Y
0
X
0
1
1
Linked transitions between two binary traits
Y, X
0, 0
0,1
1,0
1,1
Independent transitions between two
binary states in two traits (upper); Linked or
correlated transitions in two binary traits (lower).
Dashed lines are not calculated.
4
Running Discrete
This version of the program runs on PC’s under the Windows operating system. Macintosh users
can run Discrete by installing Connectix Virtual PC on their computers.
Data Input Format
Discrete requires a bifurcating phylogeny and data on species. It uses its own input format (‘pagformat’) to describe the phylogenetic tree and the data. The attraction of the tree format is its
transparency, although it is not as compact as Phylip (Newick) or Nexus formats. We hope to
make available software for converting among these three formats.
The input format is simple. Consider the tree below of four species, and three internal nodes, 1,
2, and 3 where node 3 is also the root. Trees must be rooted.
s1
s2
s3
s4
|
|
|
|
|t1
|t2
|
|
|__________|
|
|
|1
| t3
|
|t5
|
|t4
|_____________|
|
|2
|
|t6
|
|_____________|
|3
The ‘pag’ input format for this tree is:
# example phylogenetic tree. Comments can precede tree if
# preceded by ‘#” as in this line.
s1, 1, t1, data1,data2
s2,1, t2, data1, data2
s3, 2, t3, data1, data2
s4,3, t4,data1, data2
1, 2, t5
2, 3, t6
The “data1, data2” are the comparative data measured across species. Discrete takes two traits.
If only one trait is being investigated it can be duplicated to create a ‘dummy’ second trait.
Read the first line of the input file as "species 1 goes to node "1" over ‘time’ or length t1 with
data 1, 2; the second line as species 2 goes to node 1 over time or length t2, with data 1, 2", and
so on. Data points must be real numbers. Missing data are not allowed. Species with missing
data must be removed from the tree.
5
Beginning with the fifth line of the file, the connections among the internal nodes are described. ,
"node 1 goes to node 2 over time or length 4, and node 2 goes to node 3 over time or length t6.
Nodes do not have data, and the branch lengths can be any real number. Branch lengths can be
any units but units of time and genetic distance (operational time) are especially useful. If no
branch length information is available, one option is to assign them all an arbitrary length of 1.0
(although it shouldbe borne in mind that doing so implies that more total evolution has taken
place between the root and the tips of the tree for those species with more ancestors).
Only tips (species) have data, and the tree must be bifurcating. If you have so-called polytomies
in your tree, resolve them to bifurcations, or if this is not possible, remove species until a
bifurcating node remains. If the species that are removed all have the same values on the traits,
the analyses will not be affected in any substantial way. Trees must be rooted, although the root
itself is not described but inferred by the program from the input format.
The species names can be any alphanumeric character but should start with a letter. They must
be one word. Internal nodes can be integers or alphanumeric characters (Note that this input
format is more flexible than that for my related program Continuous that analyses quantitative
comparative data. Users anticipating using both methods may wish to have their input formats
conform to that required for Continuous – see its manual).
It may often be easiest to number the species from left to right beginning with species s1 to
species sn. Then label the nodes as n+1, n+2 and so on until you reach the root. Every node in a
bifurcating tree must have two and only two descendants. The root does not "go to" any other
node as is not described any further. Items are separated by commas.
At the moment, Discrete does not give very much useful information when it tries to read in a
treefile that is wrong in some respect. The most common errors are failing to separate items by
commas, or inserting more than one comma, having too few or too many descendants of a node,
specifying the wrong descendant, or incorrectly specifying the number of species or variables.
The best way to de-bug a treefile that is not working is to print it out and compare it line by line
to a picture of the phylogeny.
Loading and Viewing the Data: File Menu
When Discrete is started up a blank window will appear. Select Open from the File menu and
look for the input file in the box that appears. Input files must be text files in the format as
described above. If the input file has been saved by a word processor such as Microsoft Word be
sure to choose the text only option. Sometimes word processors can insert invisible characters
that can interfere with the input file being read in. If you suspect this, open the file in a word
processor and save it as some other file choosing a text only format.
If the treefile loads properly, the message “Data Loaded in Successfully” will appear, and
describe the file. To view the contents of the input file select Display Input from the File menu.
The Save Output command saves a copy of the analysis window in a text file and is useful when
a number of analyses have been run.
6
Subsequent files can be loaded the same way and will replace the previous file as the one that
Discrete will analyse.
Analysing Data
Discrete can be used to characterise and test hypotheses about the evolution of single traits, to
find ancestral states, to test for evidence of correlations among traits, and to conduct computer
simulations. All parameters are estimated by maximum likelihood, and found by searching a
likelihood surface for the value that maximises the likelihood of observing the data given the
value of the parameter, the phylogenetic tree, and the model of evolution. All likelihoods are
expressed as log-likelihoods.
The Independent, Dependent, Simulation, and Graphics menus provide the features for
analysing data. The features in these menus will be described in order below, although this will
not normally correspond to how they will be used when analysing data.
Independent Menu
The Independent menu provides a number of ways to investigate the evolution of single traits,
do ancestral state estimation, and provides the starting analyses for the test of correlated
evolution. The results to complete the test of correlated evolution are obtained from the
Dependent menu. How to conduct significance tests is described in the Hypothesis Testing
section. The following sections describe the Independent menu. All calculations in this menu fit
the model of independent trait evolution in contrast to the model of dependent evolution
described in the next section.
Run Independent Test command
This command calculates the log-likelihood of the model of independent evolution for the two
traits (see Pagel, 1994, 1997). This model allows the traits to evolve independently on the tree.
Each trait is characterised by a forward and backward transition rate, labelled “alpha” and “beta”,
respectively. They are the instantaneous transition rates from state 0 to state 1 (alpha) and from
state 1 to state 0 (beta). As rates they depend upon the lengths of the branches of the
phylogenetic tree. They are not probabilities. Alpha1 abd beta1 correspond to trait 1 and alpha2
and beta2 to trait 2.
When the independent test is run the program prints out the likelihood of the model (which is the
sum of the likelihoods for each variable separately), the transition rate parameters, and
information on the state of other parameters that can be fixed or estimated.
Set Independent Variables command
This option opens a menu box allows the user to estimate ancestral state at the root of the tree
(other ancestral state reconstruction described under Graphics menu), calculate scaling
7
parameters, fix the values of parameters to predetermined values, and to test for differential rates
of trait evolution.
The Parameter Restriction box allows one to choose a parameter to be fixed to a scalar value or
to be restricted to be equal to another parameter. The kind of restriction is chosen from the
Restriction Type menu. Setting a parameter to a constant fixes it at that value in all likelihood
calculations. Fixing a parameter to the value 0f 0.0 can be used to compare the likelihood
obtained when the parameter = 0.0 with that obtained when it is allowed to take its maxmimum
likelihood value (see Hypothesis testing). Fixing a parameter to be equal to another parameter
means that they are restricted to take the same estimated value in the model. This feature makes
it possible to test simpler models (for example, rate of forward changes equals rate of backward
changes is achieved by fixing alpha1=beta1 or alpha2 = beta2) against unrestricted models.
The choices made in the Parameter Restriction box are implemented the next time Run
Independent Test option is chosen.
The Set Model box allows the user to fix the root (Fix Root option) or allow it to remain free (if
not fixed the likelihood calculations sum over both values). If Root Reconstruction is switched
on, then the program automatically estimates the likelihood of the two alternative values at the
root and prints out the posterior root probabilities based upon what I have called the “local”
method (see Pagel, 1999b for an explanation of the calculations). The Bayesian Weights option
is described in Pagel (1999b) but is not fully tested.
The Independent Scaling option allows the user to estimate the value of the parameter κ (kappa).
Kappa is described in Pagel (1994). The kappa parameter differentially stretches or compresses
individual phylogenetic branch lengths and can be used to test for a punctuational versus gradual
mode of trait evolution. Kappa > 1.0 stretches long branches more than shorter ones, indicating
that longer branches contribute more to trait evolution (as if the rate of evolution accelerates
within a long branch). Kappa < 1.0 compresses longer branches more than shorter ones. In the
extreme of Kappa = 0.0, trait evolution is independent of the length of the branch. Kappa = 0.0 is
consistent with a punctuational mode of evolution.
Kappa is interesting in its own right and can be valuable for smoothing the likelihood surface. If
the phylogeny contains a wide range of branch lengths – some very long, others very short – it
can be difficult to fit the likelihood model. Kappa will often take a value <<1.0 on such trees,
making all branches roughly the same length.
The Advanced Options box contains parameters that the numerical analysis algorithm uses in its
‘hill-climbing’ routine. These are best left untouched, save for the Convergence value. This
value determines when to stop the likelihood search: if two successive likelihoods from the
search procedure differ by less than the Convergence value, the search is stopped. Smaller
numbers therefore cause a more stringent stopping rule to be enforced.
The choices made in the Set Model box are implemented the next time Run Independent Test
option is chosen.
8
Gamma Settings command
This menu implements a gamma rate heterogeneity model of trait evolution (using code based
upon Yang’s discrete gamma model. J. Mol. Evol. 39,306,1994). This model allows the traits to
evolve at different rates in different branches of the tree, where the distribution of rates is
assumed to follow a gamma distribution with a mean of 1.0. When the gamma parameter is
estimated, the likelihood of the basic model of trait evolution (Independent model or Dependent
model) is summed over the distribution of possible rates.
If the gamma model improves the fit of the data to the underlying model, the likelihood will be
improved and this indicates that rates of evolution are significantly faster or slower in some
branches of the tree. The model does not at present identify which branches.
Gamma rate parameters can be estimated separately for the X and Y traits (trait 1 and trait 2),
they can be restricted to be equal to each other (via the Parameter Restriction window), or they
can be restricted to a constant. The gamma distribution is, for purposes of calculation, divided
into a number of discrete classes of equal area. Four divisions usually provides sufficient
resolution.
Choosing the gamma option greatly slows calculations as the parameter is often difficult to fit
and the number of likelihood calculations is increased by a factor equal to the number of
divisions chosen. It is recommended that the model is run a number of times when the gamma
option is on as the value of the parameter often varies from run to run (usually indicating that it
has little affect).
Discrete automatically incorporates the maximum likelihood values of the kappa and gamma
scaling parameters into its calculations, when these options are switched on.
Ancestral States command
This command allows one to estimate the best simultaneous set of ancestral states on the tree.
There are 2 n possible assignments of ancestral states of a binary character to n nodes. The option
calculates the likelihood of each of them and identifies the single assignment of ancestral states to
the n nodes that has the highest likelihood. For trees of more than about 18 nodes it can take a
very long time, especially if the ‘local’ option is used (Pagel, 1999b). This option re-calculates
the Independent model for each of the 2 n assignments. The global option simply applies the
parameter values from the Independent model to each reconstruction.
The set of ancestral states derived from this option can differ from those obtained by separately
calculating the most probable ancestral state at each node (Graphics menu), allowing the others
to vary.
Dependent Menu
This menu effectively repeats the options of the Independent menu but here implements them for
the model of dependent trait evolution.
9
Run Dependent Test
This option calculates the likelihood of the 8 parameter model of dependent trait evolution. The
parameters are displayed as qij values and a table is drawn showing their correspondence to the
actual states of the traits. Thus, q12 estimates the rate at which the Y or trait2 character changes
from 0 to 1 when the X character is in state 1. The q34 parameter measures the same rate, but
now against a background of character X in state 1. Careful choice of comparisons of pairs of
parameters tests specific hypotheses of trait evolution. A number of these are described in Pagel
(1994).
The four ‘forward’ and four ‘backward’ transition rate parameters can be used to construct a
‘flow diagram’. The flow diagram charts the most probable way that the traits have evolved from
some ancestral state to some derived state. For example, if in the diagram below the state “0,0” is
thought to be ancestral, one may be interested in how evolution got to the state “1,1”. The
diagram shows that it could have gone via the intermediate state ‘1,0’ or via ‘0,1’. By testing
each qij for significance it will often be the case that one of the possible pathways is significant
but the other is not (see Czeilly, DuBois, and Pagel, Animal Behavior, 59, 1143-1152., 2000 for
an example).
The Flow Diagram
X, Y
0, 0
q12
0,1
q21
q31
q13
q42
q24
q34
1,0
q43
1,1
Set Dependent Variables menu
This menu repeats the options of the Set Independent Variables menu but now they are applied to
the dependent model. As before it is possible to fix parameters, set them to each other,
reconstruct ancestral states at the root, and choose the kappa scaling parameter.
The Root Reconstruction option calculates the most probable joint set of states at the root, and
prints out their probabilities. They will often be equal to the product of the corresponding root
probabilities from the Independent model, although they need not be.
Gamma Setting
This option finds a single gamma value that is optimal for scaling the dependent model. It can be
very difficult to fit.
10
Simulation menu
The simulation menu allows the user to set up and run a Monte Carlo simulation study of the
independent or dependent model. Its principal used is to find the approximate null hypothesis
distribution for the test of correlated evolution. This test compares the log-likelihoods of the
model of independent evolution with that of the model of dependent evolution, via what is known
as the likelihood ratio statistic (see Testing Correlated Evolution under Hypothesis Testing).
Run Simulations
This command runs the Monte Carlo simulations following the choices made in the Simulation
Setup menu.
The simulations print results to the screen: IL = independent likelihood of independent model as
fitted to simulated data; DL = likelihood of dependent model on same data; LR = likelihood ratio
= (DL-IL); (0,0),…(1,1) = the proportion of simulated tip values (species) with these traitcombinations.
At the end of a run of n simulations the approximate p-value is printed out for the likelihood ratio
that was observed in the real data.
Important: for the p-value result and the simulations to be meaningful the exact forms of
the independent model and dependent model that are being tested should be run in
succession just before running the simulations.
Sometimes simulations fail owing to unusual combinations of data that cause floating point
errors. The simulation results are written out to a file so if a simulation fails, the runs to that
point can be retrieved. Simulations can be combined to yield larger data sets.
Simulation Setup
The Simulation Type menu allows the user to choose the Independent or the Dependent model as
the model that is used to generate the simulation data. The default is the independent model as
this is the model used to derive the null hypothesis sampling distribution for the test of correlated
evolution.
Fossil Records
If ‘fossil’s have been set (that is, nodes fixed to one or the other value of the trait)on the
phylogeny (Graphics menu) this option allows them to be included or not in the simulations. If
they have been used when the independent model was calculated then they can be employed in
the simulations.
Number of Simulation Runs
A minimum of 100 runs is recommended, although use fewer to inspect the runs to see if the
settings are correct and that the run is producing meaningful results. Simulation results
(likelihoods and distributions of tip states are printed out to the screeen).
Simlimit and Variance
11
The simlimit command prevents the hill-climbing algorithm from getting stuck (2000 iterations is
a useful figure) and the varianced command ignores simulated runs in which the variance of the
characters acoss the tips is too low. A value of 5 seems useful. Simulated data sets can, by
chance, all come up with the same value at the tips and then there is nothing for the model to
analyse.
Parameter Output
Selecting these options produces output files of the estimated parameter values for the simulated
data.
Graphics Menu
The Graphics menu allows the user to inspect the phylogeny, reconstruct ancestral states, assign
ancestral states to nodes, and to calculate likelihood surfaces for specified parameters.
Draw Phylogeny
This option produces a picture of the phylogeny drawn to scale from the branch length
information in the input file.
Clicking on a node (this can be tricky) brings up the Node Information box for that node. The
box gives information about the node, including its ancestral state (default = no state
information). By clicking on Fossil1 or Fossil2 it is possible to fix the value of the node at a
specified state, or return it to the ‘free’ or unfixed state.
Re-calculating the likelihood having set the node successively to state 0 and then 1 gives
information about which is the more probable state. This procedure is automated in the
Calculate Fossil Likelihood command. Pressing ‘GO’ instructs the program to calculate the
likelihoods of a ‘0’ and then a ‘1’ at the node and to print out their probabilities under the model.
This is a very quick way to do ancestral state reconstruction by maximum likelihood. Results are
printed to the text window.
NOTE potential problem: When ‘fossil’ likelihoods are calculated two kinds of calculation are
done, called ‘local’ and ‘global’ (see Pagel 1999 Systematic Biology for a description of these).
There is a mistake in the current version of Discrete (4.0) that means that if a series of ancestral
states are calculated in succession, the global estimates will only be correct for the first set. This
is because after calculating the global and local fossil likelihoods, the global alpha and beta
parameters of the independent model get lost. The consequence is that if one estimates a number
of ancestral nodes in a row, the global estimate no longer represents the true global estimate
because the initial parameter estimates are no longer the same. The local estimates are not
affected.
To get global estimates of ancestral state, first fix the alpha and beta parameters to their ML
estimates using the settings in the Set Independent Variables menu. When this is done the local
and global estimates using the “Go” button will necessarily be the same. The local estimates can
be obtained by unfixing the parameters and re-doing the analyses.
12
Clicking on the end of a terminal branch reveals information about the species.
Surface Plot
This command draws the likelihood surface for a parameter, given the instructions from the
Surface Setup command.
Surface Setup
It is often desirable to see how the likelihood changes for differing values of a parameter – this is
a likelihood surface in one dimension. The parameter is successively fixed at a series of values
and all other parameters are free to vary when the likelihood is calculated. The option allows the
user to choose a parameter to be plotted, specify the accuracy of the curve (number of points in
curve), and specify whether 95% confidence intervals should be included.
These plots can be quickly calculated for the standard parameters of the independent model, but
may take a long time for scaling parameters, gamma parameters, and parameters of the dependent
model.
Hypothesis Testing
All hypotheses are tested using the likelihood ratio statistic. The likelihood ratio statistic
compares the log-likelihood of a null hypothesis model to that of an alternative hypothesis model.
Discrete automatically calculates the log-likelihood of whatever model is chosen in the
Independent or Dependent menus, and displays this likelihood in the text window.
Once you have run the dependent test a likelihood ratio will be printed out. The value that is
printed out to the screen (and in the simulations) is the simple difference between the dependent
likelihood based upon the last Dependent model run, and the independent likelihood based upon
the last model run under the Independent analysis. Thus, for example, if one wishes to test for
correlated evolution, the Independent model should be run followed by the Dependent model.
Then, the likelihood ratio printed out will reflect the difference between these two models.
Conventionally, this difference is multiplied by 2 to form the likelihood ratio statistic. The
likelihood ratio (LR) test compares the goodness of fit of a model to the data with that of a
simpler model that lacks one or more of the parameters. The LR statistic is then defined as
LR = −2 loge [ H0 H1 ],
where H0 represents the simpler (null) model and H1 the (alternative) model containing the
parameters representing the evolutionary processes one wishes to estimate.
If the simpler model is a special case of the more complicated one, the LR statistic is
asymptotically distributed as a chi-squared variate with degrees of freedom equal to the
difference in the number of parameters between the two models, i.e., LR ~ χ 2 (v ) , where v is the
number of degrees of freedom. One test is a special case of another if it is possible to collapse
the more complicated model to the simpler model by setting some parameters to zero or to other
fixed values. For example, the model in which a parameter such as kappa is estimated collapses
13
to the default null hypothesis model of kappa = 1: the kappa=1 or null model is a special case of
the alternative hypothesis model in which kappa is free to take any value. In such circumstances
the two models are often referred to as being ‘nested’, and here they differ by one degree of
freedom.
Testing Correlated Evolution. One of the principal uses of Discrete will be to test for correlated
evolution. Elsewhere (Pagel, 1994) this is called the omnibus test. This test is performed by
comparing the likelihoods of the models of independent and dependent evolution via a likelihood
ratio test. If the traits are correlated in the sample the dependent model will fit the data
significantly better. In the above equation for LR the log-likelihood of the independent model is
H0 and H1 is the log-likelihood of the dependent model.
In their default states, these two models differ by four parameters. Simulation studies (Pagel,
1997) show that the likelihood ratio test ratio in this instance is asymptotically distributed as a
chi-squared variate with 4 degrees of freedom. However, for small phylogenies or for traits that
show very little change on the tree, the null hypothesis distribution may be less than a chisquared with 4 degrees of freedom, approximating to a 3 or even 2 degree of freedom
distribution.
What this means is that if the result of the test exceeds the chi-squared 4 df criterion for p<0.05
(for example), one can safely reject the null hypothesis. If it doesn’t, it may still be possible to
reject the null if simulations show that the distribution is less than chi-squared with 4 df. This is
what the simulations options determine (see Simulation Setup under Simulations).
Other Examples of Likelihood Ratio tests with Discrete (Pagel, 1994 provides an outline of
tests and Pagel, 1999a gives an example of the test of correlated evoltuion with binary traits).
Models of Evolution. Do forward and backward transitions proceed at the same rate? Is the rate
of back transitions not different from zero? These and other examples can be tested by simple
LR tests with one degree of freedom. Compare the restricted model of independence
(alpha=beta; beta=0.0) to the unrestricted model.
The dependent model has 8 parameters. Frequently it is possible to show that some of them do
not differ from zero. These parameters can then be set to zero to produce a simpler model of
dependent evolution. By implication, this model says something about how the two traits
evolved.
Conditional or Contingent trait evolution. Does the rate at which Trait 2 changes from 0 to 1
depend upon the state of Trait 1. This and other conditional tests are performed by restricting the
dependent model. Comparing the likelihood of a model in which q12 is restricted to q34 with the
likelihood of the unrestricted model tests this hypothesis of conditional evolution. The test has 1
df. An alternative form of this test separately asks whether each differs from 0.0. If one does and
the other does not, then it might be argued that they differ from each other. This test can be
slightly more powerful than the preceding test.
14
Punctuational and Gradual Trait Evolution. Perform a LR test of kappa = 0.0 (null) to kappa =
ML value. If kappa(ML) is not significantly different from 0.0, then trait evolution is consistent
with a punctuational mode of change. Kappa > 0.0 implies some form of gradualism. Test
whether kappa < 1.0 to see if default or ‘scaled’ gradualism is better supported.
More generally, the test of kappa is one of whether the branch lengths are informative about trait
evolution. If they are not, kappa will tend to go to zero.
Constant-rate of change: Perform a LR test of the independent model with Gamma turned off
versus the same model with Gamma turned on. This test will have one degree of freedom for
each value of gamma estimated. Thus, if gamma is estimated only for trait1, the test will follow
a chi-squared 1 df distribtuion.
Ancestral States. The conventional cut-off point for preferring one state at a node over the other
is if their likelihoods differ by more than 2 log units or by more than 4 in the LR test.
Characters with more than two states. Discrete is not set-up to calculate likelihoods for traits
with more than two states. However, any trait with more than two states can be represented as a
series of binary traits, each one contrasting a group labelled “1” with all of the others. Careful
choice of assignment of 1’s and 0’s in successive traits can account for the comparisons one may
wish to make. Each of the successive binary traits can then be correlated with some other binary
trait of interest.
General Tips
Finding the maximum likelihood can be difficult for some data sets. Users should repeat analyses
of the independent and dependent models several times to get a sense of the stability of the result.
Sometimes a "local" optimum exists and the program will occassionally find that result rather
than the global optimum. Repeating the analysis it will become obvious which of the two is the
preferred result.
Some data sets have very difficult likelihood surfaces that return highly unsatisfactory results.
Data sets with a very large ratio of the longest to the shortest branch can sometimes behave
badly. These cases can often be dealt with by introducing a scaling parameter substantially less
than 1.0. This has the effect of shrinking all branches, but shrinking longer ones more than
shorter ones. There is nothing wrong with doing this; in fact the optimal branch length scaling is
interesting in its own right (see Pagel 1994, 1997). The scaling reflects the transformed space in
which the underlying model of evolution best fits the data.
Sometimes the best fit model returns very large values for some of the rate parameters. They can
be so large as to seem unrealistic. Usually this means that the likelihood surface is 'flat' for that
parameter and so, effectively, all values of the parameter return the same likelihood. The large
value then does not indicate a large effect.
15
References
Pagel, M. Detecting correlated evolution on phylogenies: a general method for the comparative
analysis of discrete characters. Proceedings of the Royal Society (B) 255 37-45 (1994).
Pagel, M. Inferring evolutionary processes from phylogenies. Zoologica Scripta 26, 331-348
(1997).
Pagel, M. Inferring the historical patterns of biological evolution. Nature, 401, 877-884 (1999a)
Pagel, M. The maximum likelihood approach to reconstructing ancestral character states of
discrete characters on phylogenies. Systematic Biology, 48, 612-622 (1999b).
Lutzoni, F. ,Pagel, M., and Reeb, V. 2001. Major fungal lineages derived from lichen-symbiotic
ancestors. Nature, 411, 937-940.
IX. Disclaimer
Discrete has been tested and gives correct results, to the best of my knowledge. However, no
specific claims are made for its accuracy and users are responsible for the interpretation and use
of all results derived from it.
X. Known Problems
Restricting a parameter of the Independent model to 0.0 will sometimes cause a log SING error.
The problem seems to be most acute on small phylogenies. If restricting a parameter to zero is
important for a hypothesis test, try setting it to a very small value such as 0.000001.
See the section on ancestral state reconstruction about a bug in one aspect of this code (easily
dealt with within the program).
Users are invited to report problems to [email protected]
16
Appendix: Example data set
This data set includes values of mating system (1=multi-male, 0=monogamy or unimale) and
presence/absenece of oestrous advertisement (1=present, 0=absent) for nine Old World primates.
It is in the correct pag-file input format for Discrete. S. Branch lengths are genetic distances.
#data on primates: trait 1 = advertisement, trait 2 = mating system
homo.sapiens,n11,29,0,0
pan.trog,n10,9,1,1
pan.paniscus,n10,5,1,1
gorilla,n12,20,0,0
pongo.pyg,n13,22,1,0
Hylo.syndact.,n14,3,0,0
Hylo.sp,n14,2,0,0
col.guer,n16,2,0,0
col.bad,n16,2,1,1
n10,n11,15
n11,n12,11
n12,n13,10
n13,n15,18
n14,n15,28
n15,n17,10
n16,n17,56
The Phylogeny implied by this treefile
29
Homo sapiens
11
9
Pan troglodytes
15
10
5
Pan paniscus
20
18
Gorilla gorilla
22
Pongo pygmaeus
10
3
Hylobates syndactylus
28
2
56
Hylobates sp
2
Colobus guereza
2
Colobus badius
17
Analyses of Example data set
log-likelihood of
Independent model: ≈ -10.52
Dependent Model: ≈ -7.05; LR test ≈ 2 X 3.47≈ 6.94
approximate p-value (100 runs of simulation) ≈ 0.02
Models of Evolution
restrict alpha1 ≈ 0.0 (set to 0.00001 setting to 0.0 may cause computational error in this case. See
X. Known Problems) likelihood ≈ -14.206
restrict alpha2 ≈ 0.0 (set to 0.00001 setting to 0.0 may cause computational error in this case. X.
Known Problems) likelihood ≈ -15.051
Setting either alpha1 or alpha2 to zero causes a large increase in the likelihood. This is
equivalent to saying that these two parameters are statistically different from zero.
Estimated ancestral states at root
Trait 1: approximately equally likely to be 0 or 1
Trait 2: approximately equally likely to be 0 or 1
Scaling Parameter κ
kappa ≈ 0.002 log-likelihood with kappa = -10.42 (no improvement over default independent
model)
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement