Molecular Systematics

Molecular Systematics
Tutorial “Molecular Systematics”, 18 April – 11 May 2016
Christian Printzen, Sabine Matuszak, Frank Lappe
1. Introduction
„... any systematic attempt to construct ... natural relationships becomes the purest speculation,
completely unsupported by any sort of evidence. ... the ultimate goal of biological classification
can not be achieved in the case of bacteria.“
Stanier, Doudoroff & Adelberg, The microbial world, 2nd ed. (1964)
„A revolution is occurring in bacterial taxonomy. ... molecular sequencing techniques permit a
direct measurement of genealogical relationships.
Fox et al., Science 209: 457–463 (1980)
Systematics, and particularly phylogenetics, deals with the evolutionary relationships among
organisms. Starting with Charles Darwin’s main work “On the Origin of Species by Means of
Natural Selection” (Darwin 1859), systematists have attempted to discover genealogical relations among species. The icon for these relationships is a phylogenetic tree resembling the wellknown family trees with which we outline genealogical realtionships among our family members.
The first “tree of life” was published in 1866 by Ernst Haeckel (Haeckel 1866). Until the middle
of the 20th century phylogenetic trees were almost exclusively based on morphological characters. This worked relatively well for highly evolved organisms with many observable characters,
but failed regularly in systematic groups with few visible and measurable features.
Bacteriologists were particularly frustrated. In 1964 (see first citation above) Stanier and his coauthors threw in the towel and buried a whole discipline. In their eyes bacterial systematics was
“the purest speculation”. Only 16 years later the situation had changed completely. Leading
bacteriologists declared a scientific revolution in Science. With the help of molecular data it had
become possible to reconstruct the phylogenetic history of bacteria. Today, the analysis of molecular data can help us answer questions that were thought to be unanswerable 40 years ago.
The history of an organism – from its evolutionary roots to its familial past – leaves traces in its
genome. Of course, Darwin and his contemporaries knew nothing about the mechanisms and
molecules of heredity. The way from the discovery of the origin of species and the laws of heredity to the calculation of molecular phylogenetic trees was long. Gregor εendel’s work on the
heredity of traits was published in 1866 (Mendel 1866), but first recognized in professional circles more than 30 years later. In 1869, the Swiss physiologist Friedrich Miescher discovered a
molecule which he called “nuclein” in nuclei of pus cells washed from bandages gathered from a
nearby hospital. He suspected this substance to be the molecule of heredity (Miescher 1874).
But it took 75 more years until Oswald Avery and his co-workers, through their groundbreaking
experiments on pneumococci, were able to show that desoxyribonucleic acid was indeed the
inherited material (Avery et al. 1944). Soon afterwards, its chemical and physical structure was
revealed by Erwin Chargaff et al. (1951), James Watson and Francis Crick (1953). The first
piece of a DNA sequence (only 20 base pairs!) was published another 15 years later by Ray Wu
and Dale Kaiser (1968).
At that time, de-ciphering a DNA or protein sequence was technically challenging and extremely
expensive. Molecular data was therefore not usable for taxonomists and systematists, who
needed large numbers of sequences for their datasets. This changed with the publication of two
articles that altered the whole of biology and, together with it, systematics and evolutionary science. In the first paper the English biochemist Frederic Sanger and two of his colleagues described how the nucleotide sequence of a DNA molecule could be identified with a relatively
simple procedure (Sanger et al. 1977). The other manuscript was rejected by Science and Nature (these journals also make mistakes!) and was therefore first published in a volume of symposium procedures. In it, the American biochemist Kary Mullis and co-workers described a
method to amplify DNA in a test tube, the polymerase chain reaction (PCR, Mullis et al. 1986,
Mullis & Faloona 1987). Both methods together enabled systematists to determine the nucleotide sequence of particular gene loci in different species in a relatively easy and cheap way.
Already in the early 1960s, Emil Zuckerkandl and Linus Pauling (1962) had postulated that phylogenetic relationships could be elucidated with the help of macromolecules. Three years later
they published what is probably the first molecular phylogenetic tree based on protein sequences (Zuckerkandl & Pauling 1965). At about the same time, the development of statistical methods for phylogenetic reconstruction set on, at first without direct reference to molecular data.
Charles Michener and Robert Sokal (1957) had made a start with the first statistical method for
the calculation of genealogical relationships based on “traditional” morphological characters.
Events came thick and fast in the 1960s, mainly through the works of Anthony Edwards and
Luca Cavalli-Sforza. In 1963 they introduced the principle of “minimum evolution”, better
known under the name “parsimony method” popularized by Joseph Camin and Robert Sokal
(1965). The works of Thomas Jukes and Charles Cantor (1969) as well as Walter Fitch (1971)
this principle applicable to molecular data. In 1967 Anthony Edwards and Luigi Cavalli-Sforza
described a method for the calculation of phylogenetic trees based on genetic distances. Using a similar method, Walter Fitch and Emanuel Margoliash (1967) published a phylogeny of
vertebrates in the same year. And already in 1964 Edwards and Cavalli-Sforza had theorized on
a technique to reconstruct phylogenis by maximum likelihood. This method is a well established statistical tool to compare models and hypotheses. Its application failed because, at that
time, there were no computers that could tackle massive amount of computer operations. It took
only eight years to develop and publish the basic methods of tree reconstruction (some of them
more than once in parallel), but another 17 years until Joseph Felsenstein (1981) introduced a
practical maximum likelihood method for phylogenetic reconstructions. Further milestones on
the way to the calculation of evolutionary trees were the neighbour joining method by Naruya
Saitou and Masatoshi Nei (1987) based on genetic distances and Bayesian tree reconstruction introduced by Yang and Rannala in 1997.
This compact course is designed to familiarize you with some of the methods of DNA sequence
analysis used in systematics today. We concentrate on the most widely used methods: parsimony, maximum likelihood and Bayesian tree reconstruction. Distance methods are nowadays
rarely used in phylogenetic reconstruction. The course can be regarded as a “cooking class” for
molecular systematics and this tutorial is the cookbook. It briefly explains the background of the
methods (normal text and blue boxes) and gives step by step instructions for lab procedures
and use of computer programs.
We use different computer programs for data analysis. Some of them have graphical user interfaces, others are command-line operated. This tutorial uses the following conventions:
Bold face denotes commands from drop-down menus or dialog boxes. >File>Open, for example, means “choose Open from the menu File”.
Courier denotes command line entries. hsearch addseq=random nreps=10000; <Enter> means “type these exact commands into the command line and then press Enter”.
2. Course Objectives
Generating DNA sequences has become rather simple routine work that is not hard to understand (although routine success in this field only comes with experience). Some of the methods
for calculating phylogenetic trees are not easy to understand. Although the computer does all of
the computations, you need at least an intuitive understanding of these methods and the calculation steps to interpret your data. This tutorial tries to explain the background of these methods
with as little mathematics as possible using examples and analogies. But it cannot do entirely
without some simple algebra and statistics. In order to make the most of this course you should
work through this tutorial thoroughly beforehand, even if you do not understand everything. This
will enable you to ask specific questions during the course. Participants with a bascic
knowledge of German can also use the e-learning platform E-volution (Schulte et al. 2013, accessible under “Biowissenschaften” at: to complement the information and explanations given in this course.
After this course you should be able to:
1) describe and understand phylogenetic trees,
2) compute phylogenetic trees using maximum parsimony, maximum likelihood and Bayesian methods,
3) describe weaknesses and advantages of these methods, and
4) understand and calculate the statistical support of branches in a phylogenetic tree.
5) date evolutionary events using molecular clock trees, and
6) reconstruct character evolution using phylogenetic trees.
3. Datengewinnung
DNA zu „sequenzieren“ bedeutet, die Reihenfolge der Nukleotid-Basen – die Basensequenz –
eines DNA-Abschnitts zu ermitteln. Die für den Organismus wichtigen Informationen zur Synthese von Proteinen sind in Form von Basentripletts (Dreiergruppen von Nukleotidbasen) auf
der DNA gespeichert. Daneben gibt es bei Eukaryoten eine große Menge nicht-codierender
DNA, die scheinbar keinerlei Informationen trägt. DNA-Sequenzen werden von einer Generation an die nächste weiter gegeben. Durch gelegentliche Fehler bei der DNA-Replikation (Mutationen) entstehen im Laufe der Zeit immer größere Sequenz-Unterschiede zwischen den Organismen. Die für Systematiker wichtigen Informationen sind genau diese erblich fixierten Veränderungen (Substitutionen), die Sequenzen im Laufe der Evolutions- oder Populationsgeschichte
durchmachen. Vereinfacht kann man sagen: Je näher zwei Organismen miteinander verwandt
sind, desto ähnlicher sind sich ihre DNA-Sequenzen. Es geht in der molekularen Systematik
also darum, anhand von DNA- (oder Protein-) Sequenzunterschieden die Evolutionsgeschichte
der Organismen nachzuvollziehen. Dies ist ein Prozess, der eine ganze Reihe von Arbeitsschritten umfasst. Die folgende Zusammenfassung soll Ihnen die Übersicht erleichtern.
1) Der erste Schritt der DNA-Sequenzierung ist die Extraktion der DNA aus Gewebeproben. Die DNA muss aus den Zellen freigesetzt und alle anderen Zellbestandteile beseitigt
2) Ein oder mehrere vorher bestimmte DNA-Abschnitte müssen in so großer Konzentration
vorliegen, dass man ihre Basensequenz bestimmen kann. Die PolymeraseKettenreaktion (PCR) dient dazu, die ausgewählten DNA-Abschnitte zu vermehren.
3) Das Endprodukt der PCR-Reaktion dient als Ausgangsprodukt für die eigentliche Sequenzierreaktion. Um optimale Ergebnisse zu erzielen, setzt man eine genau definierte
Menge PCR-Produkt ein, dessen Konzentration man hierfür bestimmen muss.
Die folgenden Arbeitsschritte werden im Kurs nicht praktisch behandelt.
4) Durch Reinigung der PCR-Produkte entfernt man alle Substanzen, die die Sequenzierreaktion stören.
5) Das gereinigt PCR-Produkt wird in der Sequenzierreaktion einer weiteren PCR unterzogen. Diesmal verwendet man nur einen Primer, so dass es zu keiner Verdoppelung
der DNA mehr kommt. Meist möchte man zur Sicherheit und zum Datenabgleich beide
DNA-Stränge sequenzieren und setzt deshalb zwei oder mehr Reaktionen mit jeweils unterschiedlichen Primern an.
6) Im letzten Schritt wird das Produkt der Sequenzierreaktion in einem automatischen DNASequenzierer elektrophoretisch aufgetrennt, wobei mit Hilfe von Fluoreszenzmarkern die
DNA-Sequenz bestimmt wird.
Aufarbeitung des Pflanzenmaterials und DNA-Extraktion
Frisch gesammeltes Pflanzenmaterial liefert die beste Ausbeute an DNA. Auch die Qualität
(ausgedrückt im Molekulargewicht der DNA) ist bei Einsatz von Frischmaterial am höchsten. Im
Alltag greift man aber oft auf konserviertes Material zurück (wer hat schon ein DNAExtraktionslabor auf Sammelreisen dabei). Am besten konserviert man die DNA von Pflanzen
durch rasches Trocknen mit Silicagel-Perlen, Lagerung in einem speziellen Puffergemisch oder
Tieffrieren (allerdings hat man im Gelände meist auch keinen Gefrierschrank bei sich). Auch
aus Herbarbelegen, durch Pressen getrocknetes Material, kann man DNA isolieren, die allerdings häufig stark degradiert ist. Für verschiedene Organismengruppen sind die unterschiedlichsten DNA-Extraktionsmethoden entwickelt worden. Im Kurs wenden wir exemplarisch zwei
Methoden an: Eine Extraktion mit CTAB und Chloroform sowie eine mit einem kommerziell erhältlichen Kit („DNeasy Plant Mini Kit“ der Firma Qiagen).
Bei beiden Methoden ist der erste Arbeitsgang der einfachste: Zerstören Sie das Pflanzenmaterial bis zur Unkenntlichkeit! Dies ist notwendig, weil im nächsten Schritt die DNA aus den Zellen
freigesetzt werden soll. Je größer die Oberfläche, desto schneller wirken die dazu notwendigen
Enzyme. Trotzdem heißt es vorsichtig vorgehen: Handschuhe tragen!
DNA-Extraktion mit der CTAB-Methode
Von diesem Protokoll gibt es unzählige Varianten für tierisches und pflanzliches Material. Fast
jeder Forscher schwört dabei auf sein eigenes Rezept. Im Kurs wird ein für die Familie der
Bromeliaceae optimiertes Protokoll, basierend auf dem Grundrezept von Doyle & Doyle (1987),
1. Jeder Teilnehmer erhält 4 Pflanzenproben, 3 von unterschiedlich alten Herbarproben
sowie 1 in Silicagel getrocknete Probe.
2. Beschriften Sie für jede Probe Deckel und Seitenwand eines 2 mL Eppendorf-Gefäßes
mit der Probennummer. Schützen Sie die seitliche Beschriftung mit einem kleinen Stück
Tesafilm. Wiegen Sie jeweils 20-30 mg einer Pflanzenprobe ein.
3. Pipettieren Sie 650 µL CTAB-Extraktionspuffer (CTAB = Cetyltrimethylammoniumbromid)
in jedes Gefäß. Geben Sie 1,3 µL Mercaptoethanol (= MET) hinzu. Vorsicht, MET ist
gif- tig – unter dem Abzug arbeiten!
4. Stellen Sie die Gefäße bei 60 °C in den Heizblock.
5. Stellen Sie für jede Pflanzenprobe einen Mörser samt Pistill bereit, geben Sie eine Messerspitze Sand und das Pistill hinein.
6. Ziehen Sie Schutzbrille und Baumwollhandschuhe, darüber Latexhandschuhe an!
7. Füllen Sie den Mörser zur Hälfte mit flüssigem Stickstoff. Vorsicht! Auch mit Baumwollhandschuhen den gekühlten Mörser nur kurzzeitig berühren. Geben Sie die Pflanzenprobe hinzu und warten Sie, bis der Stickstoff beinahe vollständig verdunstet ist.
8. Beginnen Sie unverzüglich mit dem Mörsern.
9. Überführen Sie das Pflanzenpulver mit einem Spatel und Trichter in das entsprechende
Eppendorf-Gefäß bevor es aufgetaut ist! Unter dem Abzug arbeiten!
10. Durchmischen Sie Puffer und Pflanzenmaterial durch kurzes Umschütteln des geschlossenen Gefäßes.
11. Wenn alle Proben versorgt sind und im Heizblock stehen, warten Sie 30-60 Minuten. In
dieser Zeit die Gefäße alle 5 Minuten kräftig aufschütteln. Nach Absprache mit der Kursleitung können Sie während dieser Zeit eventuell eine Pause einlegen.
12. Anschließend Proben auf Raumtemperatur (RT) abkühlen lassen.
13. Zugabe von 650 µL „Chloroform-Isoamylalkohol 24 + 1“ (gründlich gemischt). Vorsicht:
Chloroform ist gesundheitsschädlich – unter dem Abzug arbeiten! Proben 10 Minuten mit der Hand schwenken.
14. Zentrifugieren Sie 15 Minuten bei 9000 rpm und Raumtemperatur. Achten Sie darauf,
dass der Rotor austariert ist!
15. Beschriften Sie neue 1,5 mL Eppendorf-Gefäße mit der Probennummer.
16. Obere, DNA-haltige Phase vorsichtig mit der 1000er Pipette abziehen und in das vorbereitete 1,5 mL Eppendorf-Gefäß geben. Unter dem Abzug arbeiten!
17. Volumen der DNA-Lösung abschätzen und 0,6 Volumen Isopropanol (RT) zugeben, Deckel schließen und vorsichtig mischen. Es dürfen keine Schlieren mehr zu sehen sein.
18. Für 1 Stunde bei –20 °C stehen lassen. Nach Absprache mit der Kursleitung können Sie
während dieser Zeit eventuell eine Pause einlegen.
19. Zentrifugieren Sie 15 Minuten bei 13.000 rpm und 10 °C. Nun liegt die DNA als Pellet am
Gefäßboden vor.
20. Vorsichtiges Abgießen des Überstandes in ein Abfallgefäß. Rand des 1,5 mL EppendorfGefäßes gründlich auf Zellstoff abtupfen.
21. Zugabe von 500 µL 70% unvergälltem Ethanol. Gefäß kurz ausschwenken.
22. Zentrifugieren Sie 15 Minuten bei 13.000 rpm und 10 °C. DNA liegt als Pellet am Gefäßboden vor.
23. Vorsichtiges Abgießen des Überstandes in ein Abfallgefäß. Rand des 1,5 mL EppendorfGefäßes gründlich auf Zellstoff abtupfen.
24. Pellet im offenen 1,5 mL Eppendorf-Gefäß bei Raumtemperatur für 20-30 Minuten trocknen. Darauf achten, dass die DNA nicht komplett trocken fällt.
25. 100 µL TE-Puffer (pH 8) und 1 μδ RNAse-Lösung (10 mg/mL) zugeben. 1-2 Stunden bei
37 °C inkubieren, bis sich das Pellet gelöst hat.
26. DNA für weitere Verwendung bei 4 °C im Kühlschrank aufbewahren.
Zunächst werden bei diesem Verfahren die Zellwände aufgebrochen. Im anschließend zugegebenen Extraktionspuffer wirken verschiedene Stoffe: CTAB ist ein Detergens und zerstört deshalb Zellmembranen, entfernt Polysaccharide (gerade bei Pflanzen meist reichlich vorhanden)
und denaturiert Proteine (z.B. Nukleasen). EDTA blockiert DNA-modifizierende Enzyme und
MET schützt vor oxidativen Schädigungen. Auch die Inkubation bei 60 °C sorgt für die Denaturierung von Proteinen. Zur Abtrennung der Nukleinsäuren von unerwünschten Stoffen wird
Chloroform verwendet. Der zugesetzte Isoamylalkohol verhindert Schaumbildung. In der oberen, wässrigen Phase sammeln sich die Nukleinsäuren, die denaturierten Proteine bilden eine
weißliche Grenzschicht zur unteren organischen Phase, welche u.a. Lipide enthält. Ganz unten
haben sich Sand und Fasern abgesetzt. Mit Isopropanol werden Nukleinsäuren gefällt und
durch Waschen mit 70%igem Ethanol von Salzen und anderen wasserlöslichen Stoffen gereinigt. Die Zugabe von RNAse sorgt für den Abbau der RNA (es soll ja mit DNA gearbeitet werden).
Die DNA-Lösung ist im Kühlschrank mehrere Jahre lang haltbar. Frieren Sie DNA-Lösungen nur
ein, wenn Sie sicher sind, dass sie nur noch selten gebraucht werden. Häufiges Auftauen und
Einfrieren zerstört die DNA.
DNA-Extraktion mit dem „DNeasy Plant Mini Kit“ (Qiagen)
1. Für jeden Teilnehmer werden wie oben 4 Pflanzenproben bereitgestellt. Beschriften Sie
für jede Probe jeweils ein Schraubdeckelröhrchen.
2. Wiegen Sie 20-30 mg von jeder Probe ab und geben Sie sie in das Röhrchen. Geben Sie
vorsichtig zu jeder Probe eine oder zwei Keramikkugeln hinzu.
3. Heizen Sie einen Heizblock auf 65 °C vor.
4. Erwärmen Sie die AP1-Pufferflasche im Wasserbad.
5. Mörsern Sie das Material in der Kugelmühle zu feinem Pulver. Überprüfen Sie dabei regelmäßig den Mahlgrad Ihrer Probe.
6. Pipettieren Sie 400 µL Puffer AP1 (Lysispuffer) und 4 µL RNAse zu jeder Probe. Durchmischen Sie Puffer und Pflanzenmaterial durch kurzes Umschütteln des geschlossenen
7. Stellen Sie die Proben sofort in den vorgewärmten Heizblock.
8. Wenn alle Proben versorgt sind und im Heizblock stehen, warten Sie 30-60 Minuten. In
dieser Zeit die Gefäße alle 5 Minuten kräftig aufschütteln und die folgenden Arbeitsschritte vorbereiten:
9. Beschriften Sie für jede Probe 1 violette Säule und 1 weiße Säule im Collecting tube sowie zwei 1,5 mL Gefäße mit der jeweiligen Probennummer. Auch die Collecting tubes
10. Erwärmen Sie Puffer AE auf einem Heizblock (65 °C).
11. Nach Abschluss der Inkubation pipettieren Sie zu jeder Probe 130 µL Puffer P3 (Fällungspuffer). Pipettenspitze wechseln!
12. Mischen Sie durch Umschütteln und stellen Sie die Proben 5 Minuten auf Eis.
13. Zentrifugieren Sie das Gemisch 5 Minuten bei 14.000 rpm. Achten Sie darauf, dass der
Rotor austariert ist!
14. Pipettieren Sie den Überstand auf die violette Säule.
15. Zentrifugieren Sie 2 Minuten bei 14.000 rpm. Verwerfen Sie die Säule.
16. Pipettieren Sie den Durchlauf (ohne Pellet!) in eines der 1,5 mL Eppendorf-Gefäße.
17. Fügen Sie das 1,5-fache Volumen Puffer AW1 (Bindepuffer) zum Durchlauf hinzu und
mischen Sie sofort durch gründliches Auf- und Abpipettieren. Pipettenspitze wechseln!
Wie stellen Sie das Volumen des Durchlaufs fest?
18. Überführen Sie 650 µL dieses Gemischs auf die weiße Säule. Pipettenspitze wechseln!
19. Zentrifugieren Sie 1 Minute bei 8000 rpm.
20. Verwerfen Sie den Durchfluss. Wiederholen Sie die Schritte 18 und 19, bis die gesamte
Probe durch die Säule gelaufen ist. Verwerfen Sie danach die 1,5 mL Gefäße.
21. Setzen Sie die weiße Säule in ein neues 2 mL Collecting tube und pipettieren Sie 500 µL
Puffer AW2 (Waschpuffer) auf die Säule.
22. Zentrifugieren Sie 1 Minute bei 8000 rpm.
23. Verwerfen Sie den Durchfluss.
24. Pipettieren Sie 500 µL Puffer AW2 auf die Säule.
25. Zentrifugieren Sie 2 Minuten bei 14000 rpm.
26. Übertragen Sie die weiße Säule in das zweite 1,5 mL Eppendorf-Gefäß und pipettieren
Sie 100 µL warmen Puffer AE (Elutionspuffer) auf die Säule. Lassen Sie die Säulen 5
Minuten bei Raumtemperatur stehen.
27. Zentrifugieren Sie 1 Minute bei 8000 rpm.
28. Wiederholen Sie die die Elution mit weiteren 50 µL Puffer AE. Verwerfen Sie die weiße
Säule und schließen Sie die Eppendorf-Gefäße. Beschriften Sie diese seitlich mit Artname und Datum.
Nach dem letzten Schritt halten Sie nun die extrahierte DNA in einer Pufferlösung in Ihren Händen. Im Einzelnen haben Sie in Schritt 1-4 das Material mechanisch aufgeschlossen und in
Schritt 5-7 die Zellmembranen lysiert und die DNA freigesetzt. In Schritt 9-15 haben Sie feste
Bestandteile und die meisten Polysaccharide entfernt. In Schritt 16-19 wurde die DNA an eine
Trägermembran gebunden. Das so gebundene Material haben Sie in Schritt 20-24 von Proteinresten, weiteren Polysacchariden, Nukleotiden und anorganischen Zellbestandteilen gereinigt.
In Schritt 25-27 wurde die DNA von der Trägermembran wieder gelöst.
Sollte es Probleme bei der PCR geben (s.u.), kann man versuchen, statt mit Elutionspuffer mit
Wasser zu eluieren, dem danach 10% TE-Puffer beigefügt wird. Das in vielen Elutionspuffern in
höherer Konzentration vorhandene EDTA bindet Mg-Ionen, was in der PCR oft zu schlechter
Produktausbeute führt.
Photometrische Quantifizierung der DNA mit dem Qubit Fluorometer
Viele PCR-Reaktionen verlaufen besser, wenn man die Menge an eingesetzter DNA genau einstellen kann. Deshalb muss im nächsten Schritt die DNA-Menge in jedem Eluat bestimmt werden. Mit Aqua bidest kann dann die benötigte Verdünnung eingestellt werden.
1. Beschriften Sie für jede Probe (CTAB-Extraktion und Säulchen-Extraktion) und für Standard 1 und 2 je ein 0,5 mL Gefäß.
2. Für jeden Ansatz (Proben und Standards) werden 200 µL Working Solution hergestellt:
199 µL Puffer + 1 µL Farbstoff (kein Glasgefäß verwenden), mischen.
3. Legen Sie für jeden Standard 190 µl, für jede Probe dagegen 198 µL der Working Solution im 0,5 mL Gefäß vor.
4. Geben Sie anschließend 10 µL pro Standard und 2 µL pro Probe hinzu. Gründlich vortexen und 2 min bei RT stehenlassen.
5. Qubit-Gerät anschalten, >Choose you assay >DNA >dsDNA broad range >Read new
standards >Yes.
6. Den Anweisungen gemäß erst Standard 1, dann Standard 2 messen (>Read).
7. Mit >Insert Assaytube >Read die Messung der ersten Probe starten.
8. Anschließend >Calculate stock concentration wählen, und mit dem Rad das >Volume
of sample used auf 2 µL einstellen. Wert in µg/mL (entspricht ng/µl) notieren.
9. Mit >Read next sample die nächste Probe messen.
10. Berechnen Sie, wie Sie die Proben verdünnen müssen, um jeweils 10 µL mit einer Konzentration von 5 ng/µL zu erhalten. Halten Sie Ihre Ergebnisse in einer Tabelle fest.
11. Beschriften Sie ein neues 1,5 mL Gefäß für jede Verdünnung und stellen sie diese anhand Ihrer Berechnungen her.
Die Polymerase-Kettenreaktion (polymerase chain reaction = PCR) ist eine Abfolge biochemischer Reaktionen, die bei verschiedenen Temperaturen in einem einzigen Reaktionsgefäß ablaufen. Im Grunde simuliert die PCR-Reaktion die in der Natur vor der Zellteilung erfolgende
Replikation der DNA im Reagenzglas. Die isolierte genomische DNA wird in einer Pufferlösung
mit einer DNA-Polymerase, Mg2+-Ionen und Desoxyribonukleotiden (dNTPs) zusammengebracht. Damit die Polymerase die DNA als Vorlage verwenden kann und mit den in der Lösung
vorhandenen Nukleotiden einen Komplementärstrang synthetisieren kann, müssen die beiden
Stränge der DNA zuerst getrennt (denaturiert) werden. Dies geschieht bei 90-95 °C. Zusätzlich
benötigt das Enzym einen Anfangsstrang, an den es weitere Nukleotide anfügen kann. Diesen
kurzen Strang fügt man in Form von zwei „Primern“ hinzu (s. Abb. 1). Diese kurzen Oligonukleotide binden bei 40-60 °C an passende, komplementäre DNA-Abschnitte (annealing). Damit
hat man die Möglichkeit zu bestimmen, welche Teile der DNA man amplifizieren will. Die am
häufigsten verwendete Polymerase stammt aus dem in heißen Quellen lebenden Bakterium
Thermophilus aquaticus und ist deshalb hitzestabil (taq-Polymerase). Die optimale Reaktionstemperatur liegt bei 72 °C. Der Zyklus von DNA-Denaturierung, Primer-annealing und DNASynthese wird in einem automatischen Thermocycler 30 bis 40 mal durchlaufen, wobei sich die
Menge des ausgewählten DNA-Abschnitts im (nie erreichten) Idealfall jedesmal verdoppelt.
Im Kurs soll ein Abschnitt der Chloroplasten-DNA amplifiziert werden, nämlich die BarcodingRegion von rbcL (Gen für die große Untereinheit des Enzyms RuBisCO).
Abb. 1: Primerpositionen für die Sequenzierung des 5’-Endes der rbcL.
PCR-Reaktionen können sehr launisch sein; geringe Veränderungen im Protokoll führen oft zu
kompletten Fehlschlägen. Trotzdem werden in verschiedenen Laboren oft ganz unterschiedliche Protokolle verwendet. Das am häufigsten verwendete Verfahren zur Optimierung von PCR
Reaktionen ist „Versuch und Irrtum“: Verläuft die PCR schließlich wie erwünscht, verändert man
das Protokoll nicht mehr (oder nur noch, wenn man testen will, ob sich das Resultat noch verbessern lässt).
Multiple Banden, unsaubere Banden oder fehlendes PCR-Produkt sind die häufigsten Probleme. Fehlendes PCR-Produkt z. B. kann die unterschiedlichsten Ursachen haben:
- Das Extraktionsprotokoll hat versagt; das Eluat enthält keine oder zu wenig DNA (Sollte
nach Quantifizierung der DNA ausgeschlossen sein).
- Man hat zuviel DNA hinzugefügt (auch das verursacht Probleme, lässt sich aber durch
Quantifizierung vermeiden).
- Die Ausgangs-DNA ist degeneriert (z. B. durch häufiges Auftauen und wieder Einfrieren).
- Der DNA-Extrakt enthält Stoffe (z. B. Polysaccharide, EDTA), die die PCR stören.
- Die Polymerase ist überaltert/ nicht tiefgefroren gelagert.
- Einer oder beide Primer sind überaltert/ nicht tiefgefroren gelagert.
- Irgendein Idiot hat die Primer falsch beschriftet.
- Die Primer passen nicht auf die Bindungsstellen (kann sogar bei angeblich universellen Primern hin und wieder geschehen).
- Man hat eine Zutat im Ansatz vergessen.
- Die Annealing-Temperatur ist zu hoch.
- Der Thermocycler ist defekt.
Der Lösung kann man nur auf die Spur kommen, wenn man der Reihe nach die verschiedenen
Möglichkeiten ausschließt. Im Zweifelsfall empfiehlt es sich, erfahrene Kollegen oder das Internet um Rat zu fragen.
Die Polymerase-Kettenreaktion erfolgt in PCR-Gefäßen (0,2 ml). Obwohl die optimale Reaktionstemperatur bei 72° C liegt, ist die Polymerase auch bei Zimmertemperatur schon aktiv und
beginnt, hier und da Nukleotide an Primer anzubauen. Das kann die PCR-Reaktion stören,
weshalb das Ansetzen der PCR auf Eis erfolgt. Teurere sog. „hot start“ Enzyme enthalten Antikörper, die die Polymerase deaktivieren. Durch mehrminütiges Vorerhitzen auf 96 °C im Thermocycler werden die Antikörper denaturiert und das Enzym aktiviert. Mit solchen Enzymen kann
man auch bei Zimmertemperatur arbeiten.
1. Ziehen Sie Handschuhe an und tauen Sie die folgenden Zutaten auf:
- destilliertes und autoklaviertes Wasser
- Peqlab 10x Reaktionspuffer S (blauer Deckel) = PCR-Puffer
- MgCl2
- dNTP-Mix
- Primer rbcLa_F und rbcLa_R
- Trehalose
Nach dem Auftauen mischen Sie jedes Reagenz durch kurzes Schwenken, zentrifugieren es kurz an und bewahren es in Eis auf (ausgenommen DMSO).
2. Beschriften Sie für jeden PCR-Ansatz und die Negativkontrolle je ein PCR-Gefäß (0,2 ml)
und stellen Sie es auf Eis.
3. Die Taq-Polymerase ist trotz Aufbewahrung im Gefrierschrank flüssig und wird direkt auf
Eis überführt.
4. Ihr DNA-Extrakt überführen sie direkt vom Kühlschrank auf Eis.
Abgesehen von der Ausgangs-DNA enthalten Ihre PCR-Ansätze die gleichen Chemikalien in
gleicher Konzentration. Um sich unnötiges Pipettieren zu ersparen, stellt man daher zuerst einen sog. „Mastermix“ her, der alles enthält außer der zu amplifizierenden DNA. Jeder PCRAnsatz enthält β5 μδ, davon entfallen 5 μδ auf das DNA-Extrakt. Reaktionen mit 50 μδ sind
auch üblich, 100 μδ Ansätze wegen der hohen Kosten nur, wenn große εengen DNA benötigt
Aufgabe: Der Mastermix enthält die folgenden Bestandteile.
PCR-Puffer (blau)
Primer rbcLa_F
Primer rbcLa_R
5 ng/µL
10 x
2 mM
25 mM
10 µM
10 µM
1 ng/µL
0,32 mM
2 mM
0,4 μM
0,4 μM
5 µl
100 x
5 U/ μL
0,25 M
2,5 U
ad 25 µL
x Reaktionen
β μl
Berechnen Sie die Volumina der einzelnen Bestandteile für eine β5 μδ-Reaktion und einen
Mastermix für x β5 μδ-Reaktionen. Lassen Sie Platz für 5 μδ DNA-Extrakt.
1. Berechnen Sie die Mengen für den Mastermix. Berechnen Sie hierfür 1 Ansatz für jede
PCR-Reaktion, sowie 1 Reaktion zusätzlich (= Negativkontrolle).
2. Stellen Sie ein Eppendorf-Gefäß (1,5 ml) auf Eis und pipettieren Sie erst Wasser, dann
die anderen Zutaten in der angegebenen Reihenfolge zusammen. Benutzen Sie für jedes
Reagens eine frische Pipettenspitze. Mischen Sie jedes Reagens (außer der Polymerase, s. u.!) vor dem Pipettieren durch Ein- und Auspipettieren bei gleichzeitigem Rühren!
3. Pipettieren Sie als letztes die Polymerase. Rühren Sie die Polymerase-Lösung nur sehr
vorsichtig, aber sorgfältig mit der Pipettenspitze um. Die Lösung ist sehr zähflüssig; sorgen Sie dafür, dass nichts an der Außenseite der Pipettenspitze hängenbleibt. Dieses
Enzym ist extrem teuer (0,5 mL kosten ungefähr 1000 Euro).
4. εischen Sie den εastermix bis Sie keine Schlieren mehr sehen. Pipettieren Sie β0 μδ in
jeden Ansatz.
5. Pipettieren Sie zuletzt 5 µL DNA-Extrakt (entsprechende Menge H2O für die Negativkontrolle) in das jeweilige Gefäß und verschließen Sie es.
6. Zentrifugieren Sie die PCR-Ansätze kurz ab (wenige Umdrehungen) und stellen Sie sie
sofort wieder auf Eis.
7. Schalten Sie den Thermocycler (Modell Arktik Thermal Cycler) ein und starten Sie das
Programm rbcl_pwg:
8. >Run >Carmen Jung (über Pfeiltasten auswählen) >Open >rbcl_pwg (über Pfeiltasten
auswählen) >Start >Lidtemp 099 °C >ok > TempMode >Sample >Vesseltype >Common >Sample Volume >25 µl.
9. Deckel öffnen, Proben einstellen und Deckel schließen >Accept >Yes.
PCR-Programm rbcl_pwg:
initiale Denaturierung:
finale Elongation:
95 °C
94 °C
55 °C
72 °C
72 °C
8 °C
4 min
30 sec
60 sec
35 x
60 sec
10 min
Drei wichtige Hinweise:
1) Protokollieren Sie die PCR-Reaktion, indem Sie für jedes Gefäß das DNA-Extrakt, die PCRNummer und die Mengen aller Bestandteile des Ansatzes notieren (am besten in Tabellenform
wie folgt).
5 μL x μL
x μL
x μL
Primer1 Primer2 Enzym
x μL
x μL
x μL
Dieses Protokoll ist sehr wichtig. Zu Beginn einer neuen Versuchsreihe (neue Organismen,
neue Genabschnitte usw.) müssen PCR-Reaktionen fast immer optimiert werden. Das ist nur
möglich, wenn man die Versuchsbedingungen bei jedem Ansatz protokolliert.
2) DNA-Extrakten von verschiedenen Organismen lassen sie sich äußerlich nicht mehr unterscheiden. Deshalb muss man sich peinlich genau vor Verwechslungen der Reaktionsgefäße
oder -nummern hüten. Deshalb alle Eppendorf-Gefäße sauber und permanent beschriften, und
unbedingt auch das Datum jedes Ansatzes notieren! Nur so hat man eine Chance, Verwechslungen auch im Nachhinein noch auf die Spur zu kommen.
3) Arbeiten Sie in der „Post-PCR-Phase“ mit pingeliger Genauigkeit, sonst zerrütten Sie das
Verhältnis zu Ihren Kollegen. PCR-Produkte enthalten DNA in millionenfach erhöhter Konzentration, die im Labor mit genomischer DNA um Polymerase, Primer und dNTPs konkurriert.
Wenn Sie Pech haben, erhalten Sie dann schöne PCR-Produkte, die von Organismen stammen, von denen Sie gar nichts wissen wollen. Dann muss das gesamte Labor einschließlich
Mobiliar, Geräten, Glasflaschen usw. gereinigt und kontaminiertes Verbrauchsmaterial weggeworfen werden.
Nachweis und densitometrische Quantifizierung von PCR-Produkten
Die spannende Frage lautet nun: Hat die PCR funktioniert oder nicht? Um diese Frage zu beantworten führen Sie eine Agarose-Gel-Elektrophorese durch. Das Gel wird mit SYBRgreen
gefärbt, das sich mit DNA zu einem fluoreszierenden Komplex verbindet. Bei Betrachten des
Gels auf einem UV-bzw. Blaulicht-Transilluminator finden Sie so heraus, ob (1) überhaupt PCRProdukt entstanden ist, (2) nur ein spezifisches oder mehrere Produkte unterschiedlicher Länge
entstanden sind, und (3) wie lang das amplifizierte DNA-Stück ungefähr ist, d. h. ob das richtige
Stück amplifiziert wurde. Bei Arbeiten mit Ethidiumbromid oder SYBRgreen müssen Sie immer blaue Nitrilhandschuhe tragen!
Die Qualität der Sequenzen hängt empfindlich von der DNA-Menge im Sequenzier-Ansatz ab.
Unterhalb eines gewissen Schwellenwertes versagt die Sequenzierreaktion, oberhalb werden
die Sequenzen oft unsauber. Solche schlechten Sequenzchromatogramme muss man in nervenaufreibender Arbeit editieren. Selbst dann bleiben oft viele Positionen der Sequenz unsicher, was die Datenanalyse erschwert. Um sich diesen Ärger zu ersparen, ist es besser, die
DNA zu quantifizieren. Wir haben die genomische DNA bereits fluorometrisch quantifiziert. Die
folgende Methode liefert wesentlich genauere Werte, ist allerdings auch umständlicher. Normalerweise wird nur die Konzentration der aufgereinigten PCR-Produkte bestimmt. In diesem Kurs
geht es aber darum, die Qualität der genomischen DNA in verschieden aufbewahrten Pflanzenproben abzuschätzen. Die wichtigste Frage ist dabei, wie gut sich das Material für die PCR eignet. Da bei der Aufreinigung DNA in schwer zu kontrollierendem Umfang verloren geht, quantifizieren wir hier das ungereinigte PCR-Produkt.
1. Stellen Sie einprozentige Agarosegele entsprechend der Gesamtzahl an PCR-Produkten
her. Wählen Sie die passenden Kammergrößen und Kammeinsätze aus. In jeder Reihe
muss eine Tasche für einen Größenmarker frei bleiben. Für jedes Gel verwenden Sie bitte eine Glasflasche.
2. BEISPIEL für ein mittelgroßes Gel: Wiegen Sie 1 g Agarose in eine Glasflasche ein.
Vorsicht, Agarose und Kokain unterscheiden sich in Aussehen und Preis nicht sonderlich
(vermutlich aber in der Wirkung). Fügen Sie 100 mL 1 × TAE und einen Magnetfisch hinzu und setzen einen Deckel lose auf.
3. Erhitzen Sie die Mischung 1-2 min auf höchster Stufe in der Mikrowelle.
4. Rühren Sie einmal mit dem Magnetrührer durch.
5. Erhitzen Sie jetzt in kleinen Schritten (20-30 sec mit zwischenzeitlichem Rühren) weiter
in der Mikrowelle bis die Lösung völlig klar ist. Agarose kocht in Sekundenschnelle über
(eine Riesenschweinerei!). Beaufsichtigen Sie die Flasche gut und stoppen Sie die Mikrowelle, sobald die ersten Blasen erscheinen. Auch auf dem Magnetrührer beginnt heiße
Agarose-Lösung sehr leicht zu schäumen.
6. Wenn die Agarose-Lösung klar ist, lassen Sie sie auf dem Magnetrührer bis etwa 60° C
abkühlen (mit Handschuhen knapp unterhalb der Schmerzgrenze).
7. Setzen Sie für jedes Gel den Gelträger um 90° gedreht in die Elektrophoresekammer ein,
so dass die Gummidichtungen an den Wänden liegen. Hängen Sie einen oder zwei
Kämme ein.
8. Tragen Sie bei den folgenden Schritten Nitrilhandschuhe.
9. Pipettieren Sie 8 μδ SYBRgreen pro 100 mδ Agaroselösung hinzu. Vermeiden Sie jeden
direkten Kontakt mit dem Farbstoff und entsorgen Sie die Pipettenspitze nur im dafür
vorgesehenen Behälter.
10. Gießen Sie das Gemisch vorsichtig in die jeweilige Gelwanne ohne Blasen zu erzeugen.
Etwaige Blasen mit einer Pipettenspitze an den Rand des Gels manövrieren. Warten Sie
ca. 20 Minuten, bevor Sie mit den folgenden Arbeitsschritten beginnen.
11. Wenn das Gel milchig aussieht, ziehen Sie zunächst ganz vorsichtig die Kämme aus
dem Gel. Anschließend heben Sie die Gelwanne aus dem Tank und drehen sie um 90°.
Füllen Sie Puffer 1 × TAE in den Elektrophoresetank, bis das Gel bedeckt ist. Achten Sie
auf die richtige Ausrichtung des Gels, die DNA wandert zur Anode (roter Anschluss).
12. Schneiden Sie einen Streifen Parafilm ab und legen Sie ihn direkt auf den Arbeitstisch.
13. Pipettieren Sie nebeneinander für jedes PCR-Produkt einen Tropfen mit 7 µL Wasser.
14. Pipettieren Sie in jeden Tropfen 2 µl 6xLadepuffer.
15. Fügen Sie zuletzt zu jedem Tropfen 3 µL PCR-Produkt hinzu. Mischen Sie die PCRProdukte vorher gründlich und wechseln Sie die Pipettenspitzen! Notieren Sie auf einem
Papier Namen und Reihenfolge der PCR-Produkte.
16. Pipettieren Sie jeden Tropfen in eine Tasche des Gels.
17. Pipettieren Sie in die letzte Tasche jeder Reihe 5 µL Größenmarker („Easy Ladder“).
18. Setzen Sie den Deckel auf, kontrollieren Sie noch einmal die richtige Ausrichtung des
Gels und starten Sie die Elektrophorese bei 85 V.
19. Wenn der blaue Marker drei Viertel der verfügbaren Strecke zurückgelegt hat, kann das
Gel auf dem Transilluminator betrachtet werden.
20. Ziehen Sie wieder blaue Nitrilhandschuhe an. Heben Sie die Gelwanne vorsichtig aus
dem Elek-trophoresetank und trocknen sie mit Papier ab.
21. Überführen Sie das Gel ohne zu tropfen auf den Transilluminator in der Geldokumentationskammer. Gießen Sie den gebrauchten Puffer vorsichtig in die entsprechende
22. Schließen Sie die Tür der Kammer und schalten Sie das Gerät und die UV-Beleuchtung
(„Transillumination“) ein. Arbeiten Sie nun zügig, da UV-Licht die DNA schädigt und die
erfolgreiche Sequenzierung des PCR-Produktes verhindern kann.
23. Starten Sie das Programm „Herolab EasyWin32“ (liegt auf dem Desktop), übernehmen
Sie „wie letzte Benutzung“ und klicken in der Menüzeile >Datei >Kamera aktivieren an.
24. Zoomen und fokussieren Sie die Kamera, so dass das ganze Gel und alle Banden scharf
zu sehen sind.
25. Machen Sie ein Bild, indem Sie mit der rechten Maustaste auf das Livebild klicken, speichern Sie es, drucken Sie ein Foto des Gels aus und kleben es zur Dokumentation unter
ihr PCR-Protokoll.
26. Wenn Sie das Bild des Gels auf dem Bildschirm gespeichert haben, wählen Sie im
Hauptmenü >Auswertung >Modul C Spot Einzelvolumenanalyse an.
27. Es erscheint ein Kreuz anstelle des Mauspfeils. Setzen Sie das Kreuz auf die linke obere
Ecke der Referenzbande des Größenmarkers (500 bp), ziehen Sie den erscheinenden
Rahmen nach rechts unten auf und fixieren ihn mit einem Klick der linken Maustaste. Der
Rahmen sollte etwas größer als die Bande sein.
28. Markieren Sie entsprechend alle anderen Banden. Mit einem Klick auf die rechte Maustaste beenden Sie das Setzen der Rahmen.
29. Mit einem Doppelklick der linken Maustaste in einen beliebigen Rahmen öffnet sich die
Tabelle mit den Messergebnissen.
30. Zum Quantifizieren der einzelnen DNA-Banden verwenden wir die 500 bp Bande des
Größenmarkers mit bekannter Konzentration (50 ng DNA in 5 μδ εarker). Markieren Sie
in der Tabelle die Zeile der Referenzbande und öffnen mit der rechten Maustaste das
Kontextmenü >Volumen zuweisen. Setzen Sie einen Haken bei >Verwenden und geben Sie unter >Neu 50 ein, >Bestätigen.
31. Die Spalte „Vol Kalib“ der Tabelle enthält nun die DNA-Konzentration je 2 μδ PCR- Produkt.
32. Berechnen Sie die Konzentration für 1 µL und wieviel PCR-Produkt Sie in der Sequenzreaktion einsetzen müssten, um auf 20 ng DNA zu kommen.
33. Schalten Sie das UV-Licht aus und entsorgen Sie das Gel im entsprechenden Behälter.
Reinigen Sie die Oberfläche des Transilluminators mit destilliertem Wasser und Papiertüchern.
34. Entsorgen Sie Ihre blauen Nitrilhandschuhe im Sondermüll.
Die amplifizierte DNA einer erfolgreichen Reaktion liegt als mehr oder weniger breite Bande auf
dem Gel vor. Diese Banden sollten deutlich sichtbar und sauber begrenzt sein, und die Fragmente sollten in der Länge dem erwarteten Produkt entsprechen. Die Gelspur der Nullprobe
sollte schwarz sein. Sollten Sie in der Nullprobe eine Bande finden, die einer der Banden in Ihren PCR-Ansätzen entspricht, sind ihre Reaktionen möglicherweise kontaminiert. Meist sind
verunreinigte Reagenzien die Ursache solcher Kontaminationen. Diese Verunreinigungen entstehen z. B., wenn Pipettenspitzen nicht gewechselt wurden und genomische DNA übertragen
wurde. In diesem Fall muss die PCR wiederholt werden. Im schlimmsten Fall hat man es mit
Laborkontaminationen zu tun (s.o.).
Reinigung der PCR-Produkte
Die Aufreinigung der PCR-Produkte und die Sequenzierung führen wir im Kurs nicht durch.
Rückstände der PCR-Reaktion, besonders Polymerase und nicht verbrauchte Primer müssen
beseitigt werden, bevor die eigentliche Sequenzierreaktion angesetzt werden kann. Beim
gründlichsten Verfahren trennt man das gesamte PCR-Produkt auf einem Agarosegel auf und
schneidet die Banden aus. Durch die Elektrophorese wird das PCR-Produkt von anderen Bestandteilen der PCR-Reaktion abgetrennt. Die ausgeschnittene Bande mit der gewünschten
DNA wird in einer enzymatischen Reaktion aus der Agarose gelöst. Die weiteren Reinigungsschritte funktionieren nach dem gleichen Prinzip wie DNA-Isolierungskits. Die DNA (PCRProdukt) wird an eine Membran gebunden, gereinigt und mit Wasser oder Puffer eluiert.
DNA-Sequenzierungen werden heute routinemäßig von zentralen Einrichtungen an Universitäten oder von kommerziellen Laborzentren durchgeführt. Versuchen Sie trotzdem, Aufgabe 5 zu
In der ersten PCR-Reaktion ging es darum, einen spezifischen Abschnitt der genomischen DNA
exponentiell zu vermehren, um ihn später sequenzieren zu können. In der Sequenzierreaktion
wird diese hoch konzentrierte Template-DNA nicht mehr exponentiell vermehrt. Man fügt deshalb nur einen Primer hinzu, so dass die Polymerase nur einen der beiden DNA-Stränge synthetisieren kann. Die Sequenzierreaktion enthält neben gewöhnlichen dNTPs einen kleinen Anteil Didesoxribonukleotide (ddNTPs). Gewöhnlich verwendet man fertige Mischungen (sog.
Terminator-Kits mit z. T. blumigen Namen), die von den Herstellern automatischer Sequenzierer
angeboten werden. An jedes der vier unterschiedlichen ddNTPs ist ein anderer Fluoreszenzfarbstoff gebunden. Wird ein solches dd-Nukleotid zufällig eingebaut, kann der DNA-Strang
nicht weiter verlängert werden und die Reaktion bricht ab. Am Ende vieler Reaktionszyklen erhält man so ein Gemisch von fluoreszenzmarkierten DNA-Strängen aller unterschiedlichen
Längen, bei denen das letzte Nukleotid sich durch seine spezifische Fluoreszenz verrät.
Diese vielen hundert verschiedenen DNA-Moleküle werden im letzten Schritt auf einem Polyacrylamid-Gel elektrophoretisch nach ihrer Größe aufgetrennt. In automatischen Sequenzierern,
die hierfür verwendet werden, befinden sich am Ende des Gels zwei Dioden-Laser, die die vorbeiwandernden DNA-Moleküle zur Fluoreszenz anregen. Diese Fluoreszenz wird von einem
Detektor aufgezeichnet, dessen Signale in Form eines Sequenz-Chromatogramms als Computerdatei gespeichert werden.
Solche Chromatogramme sind nur über eine begrenzte Länge, bestenfalls ca. 1000 Basenpaare (bp), hinweg lesbar. Oft sind die sequenzierten Genabschnitte aber länger, so dass man
schon aus diesem Grund beide Stränge sequenzieren muss. Um sicher zu gehen, kann man
zusätzlich interne Sequenzierprimer einsetzen.
4. Assembling Datasets
Warning: Phylogenetic analyses create a confusing number of computer files. Different programs require different input and create output files. Your tutor will show you where you can find
input files. Once you start working with these, you will be the only person in control. If you do
not want to get lost, store your data in separate folders for each analysis. Name your files and
folders so that the names tell you something about their content (not “Kevin1995/tree”). We are
leaving the cosy world of Apple and Microsoft. Avoid special characters and blank spaces in file
names. Many programs will return cryptic error messages instead of reading these files. Underscores are usually ok. If possible, restrict file names to eight characters (plus file extension).
If you want to be completely safe, write a log of your analyses with file names and short notes
on the analyses you carried out. You are right, we are pedantic bean counters. But we have
given this course 15 times now, and there was always at least one student who could not find
trees or other files at the end of the course.
Before you can analyze your data you have to assemble single sequences into data files that
can be read by computer programs. We will carry out these steps using the program “Geneious”. to get started, we will try to get familiar with the graphical interface of Geneious.
1. Open the program by double-clicking on the symbol. If this symbol is not on your desktop, search for it in the Start menu. The user interface of Geneious resembles and email
client. The header contains a number of drop-down menus. Below it you find four
windows. The one on the left shows the file structure. On the right you see a help
window. The upper half of the central window contains a few sequence files. If you mark
one of these files a graphic representation of the DNA or protein sequence appears in the
lower window together with additional menus that allow to change and edit the graphic.
2. Double-clicking a file will open the same view in a separate window.
3. Select one of the files. You have 15 minutes to play around with it and make yourself familiar with different options. After that you will be asked to present one of these options to
your fellow students.
4. After the presentation we start by creating a number of folders in which you can store
your data. A right mouse-click on >Local>New Folder will create a new folder. Name this
folder “Phylo2016”.
5. Within this folder create two subfolders “Sequences” and “Alignments”.
6. Mark the subfolder “Sequences” with a left mouse-click and select >File>Import>From
7. Click on >go to directory. A window will open, in which you can navigate to the central
course folder (ask your tutor!). Sequence files carry the extension *.ab1. Mark the ones
reserved for you (ask your tutor!) and copy them into your folder by selecting >Import.
Editing Sequences
So far you have only checked whether PCR products of the expected length were formed in the
reactions. We assumed that they represented the gene loci we wanted to sequence and hoped
that the sequences would be readable. We must no check this assumption. Automatic DNA sequencers create so-called “trace files” that contain a succession of coloured peaks. Each colour
stands for one nucleotide. Fig. 2 shows a detail of such a trace file.
Fig. 2: Perfect sequence chromatogram: clear peaks,almost no noise.
Small errors in the lab can lead to a loss of quality of the trace files (Fig. 3). The quality of chromatograms is often somewhere between these extremes. Single positions are unclear but most
are recognizable. This procedure is called “sequence editing”. Both ends of a DNA sequence
file are often rather difficult to interpret. In the beginning rests of primers may cause a lot of
noise, near the end the signal strength decreases. Sequencing both strands of a DNA fragment
allows one to complement or correct uncertain positions by comparison with the other strand.
Fig. 3: Almost unreadable sequence chromatogram: few clear peaks, lots of noise.
To enhance the data quality long DNA fragments are often sequenced using additional internal
primers. For each sequencing reaction we have one sequence file. The complete sequence is
assembled by putting these files together, correcting errors and saving a so-called “consensus
8. We first combine the different trace files in a so-called “contig”. The electropherograms
belonging to the same marker and species are marked by a mouse-click (press the shift
key to mark several files at the same time). These files appear in the bottom window as
light blue lines. Regions of poor quality at the start and end of the files are marked by a
red line.
9. In the header click >Assembly. A window with different options will open.
10. We keep the default settings, define a name for our contig under >Assembly name and
click >OK. The chromatograms are being assembled and will appear as sorted and overlapping lines in the lower window. On top of the chromatograms you will find a bold black
bar (“consensus”). This is the total sequence that has been assembled from the single
sequence files. In the upper window a new file named “contig” or “contig [no.]” has been
11. Enlarge the view by clicking on the handlens symbol on the right or by clicking the
button. You can now see the letter symbols for the nucleotides in the consensus sequence and below it a green bar of unequal thickness that shows how many single sequence files support the consensus sequence (“coverage”). Below this you see the single
sequence reads and their nucleotide sequence. The name of the chromatogram is annotated “FWD” if it is displayed in forward direction or “REV” if it has been reversed.
12. Alternatively you can select different options for the assembly of sequence reads, for example under “Sensitivity” >Highest Sensitivity, >Do not trim. >Create subfolder creates a new subfolder in which the contig is saved.
13. Remember that a double-click on the contig in the upper window opens the file in a new
14. The header of the new window allows you to change between different views of the sequence. The most important is the one that is currently open( “Contig View“). But you can
also view the sequence reads and consensus as plain text or as a virtual gel picture.
15. Under “Colours” you can choose between different colour settings for nucleotides. “Quality” shows clearly readable bases in light blue. The poorer the quality the darker the blue
colour. „Clustal“ shows nucleotides in different colours. You may adjust the colours under
16. Under >Graphs>Options it is possible to to remove the chromatograms by clicking on
“Show Graphs”. >Sequence Logo brings forth a psychedelic succession of letters, in
which the size of the letter is proportional to the quality of its base position. Play a bit
around and explain one of the options to your fellow students afterwards.
After relaxing a bit in this way, we can now start with some real work, controlling and correcting
sequence reads and the consensus.
17. Under >Options (right of “Highlighting”) click >Highlighting and select “Disagreement
to Consensus” followed by >Go to next disagreement or Strg+d. The cursor will move
to the next position with ambiguous signal if there are any.
18. Inspect this position. Which base is correct in this place?
19. In the second header of the window you find a button “Allow editing”. Selecting it allows
you to write into the succession of nucleotides like in a text file. You can insert or delete
letters or change them. Each change in the original sequence is marked below the sequence: a yellow indicates changes, a red one a deletion and a green one an insertion.
20. Correct the whole contig in this way. If you cannot decide which nucleotide is correct in a
certain position write “N” into the file.
21. The sequencing machine has already trimmed the unclear regions at the beginning and
end of sequence reads. These parts of the sequence are still present but underlined with
a red bar. They were not used to calculate the consensus. If you are not satisfied with
this trimming, e.g. because the sequencer or Geneious trimmed too much of the sequence, you may change the settings. Select >Annotate & Predict>Trim Ends. A new
window will open.
22. You can now change the “Error Probability Limit” if you like. Values smaller than 0.05
(the default) cause Geneious to be more strict and trim larger parts of the sequence.
Larger values have the opposite effect. You may also perform this step after you have
created an alignment (see below).
23. To store the edited consensus sequence mark the word “consensus” left of the sequence.
24. Select >Extract from the top menu. A new window “Extract” will open. Enter a name into
the field “Extraction name” and click on >OK. The window with the contig will close and a
new file with this name will appear in the upper window.
The next step is to make sure that your sequences really belong to the organisms you want to
study (and not to you, your lab colleagues or their organisms).
BLAST-Search in Genbank
Countless (phylo)genetic projects have generated millions of DNA and protein sequences over
the last decades. Most of these sequences are deposited in three large scale databases run by
the National Institute of Health of the USA (NIH), the European Molecular Biology Laboratory
(EMBL) and the DNA Databank of Japan. These databases are linked to each other and updated every 24 hours. All well-respected scientific journals nowadays require their authors, to
submit their sequence data to one of these data bases before publishing their results. Any other
researcher may then use these data for their own research. Systematists use them in two different ways. They include already published sequences in their own data sets to save time and
money. Or they use a search algorithm called BLAST (Basic Local Alignment Search Tool) to
screen Genbank for sequences similar to their own. In this way you can find out whether your
sequence data belongs to the organisms you wish to study and not to contaminating bacteria,
fungi or your colleague’s organisms. The following BLAST search can be performed directly on
the BLAST web page at by copy-pasting the sequence into a search field. But you can also do this from within Geneious.
1. Mark one of your sequences and select >Sequence Search in the top menu. A new window opens.
2. Adjust the number under “Maximum Hits” to 20.
3. A new subfolder is created within your folder. The upper window now shows the progress
of the search.
4. After the search has finished a list of most similar sequences and some additional information appears in the window. Let us first check to which group of organisms our sequence belongs. This information is shown in the column “Organism“.
5. High similarity between our sequence and a sequence in Genbank usually means that
the organisms are closely related to each other. If the most similar sequences belong to
mouse or human, you have not sequenced a plant.
6. If you want to learn more about the most similar sequences you have to look at the full
Genbank entry. Mark one or more sequences from the list and click on >Download Full
Sequence(s) in the header of the lower window. No new files will appear, but the DNA
symbol in front of the Genbank accession number will take on a more vivid colour.
7. Mark one of the full Genbank entries and select >Text View in the header of the lower
window. You will see your own “query” sequence and Genbank sequence side by side.
The line between both rows of letters shows similarities between both sequences.
8. Scroll further down and you look at the details of the Genbank entry.Before the description there is a link, which opens the Genbank entry in a web browser. The description
starts with a unique accession number, the length of the sequence, submission date,
gene locus, name and systematic position of the organism, the name of the person who
submitted the sequence and a literature reference. The gene locus is described in more
detail under “Features”. Finally the sequence itself follows.
9. The BLAST-function in Geneious offers a lot of Gimmicks. >Alignment View in the lower
window shows a consensus sequence with all nucleotides occurring in both sequences.
Below the two sequences you see one or more coloured bars with annotations of the
Genbank sequence. This function can be helpful when you want to annotate your own
10. Just for fun, mark one of your sequences and select >Tools>Linnaeus Search... While
the search is progressing you may watch Carl von Linné making faces.
11. Finally a new file with Linnaeus Blast results will appear. Open it by double-clicking.
>Query Sequence View shows your sequence aligned (see below) with the most similar
Genbank sequences and a phylogenetic tree on the left side. What do you see when you
select >Linnaeus View?
The next step is to assemble the single sequences into a data set that can be used in phylogenetic analyses, a so-called “alignment”. Your task is to analyze the phylogenetic relationships of
different groups of angiosperms and gymnosperms. You will receive DNA sequences of ca 20
taxa. A single gene locus often contains too little information to resolve the phylogenetic relationships sufficiently. Therefore we will work with three gene loci: the two chloroplast genes
Maturase K (matK) and the large subunit of the RuBisCO (rbcL), as well as the alpha-subunit of
the mitochondrial ATP-Synthase (atp1). These data were compiled from Genbank.
Aligning your Sequences
Phylogenetic analyses are largely based on similarities between gene sequences. These similarities can be “measured“ in different ways. For example, one can count the number of nucleotides that differ between sequences. For this purpose you first have to find out which nucleotides in your sequences occupy similar positions. Put simply, you cannot just place the sequences side by side randomly and count the differences, you have to compare homologous
nucleotides (Box 2). The problem is that nucleotides are not only replaced by mutations but
sometimes also get lost (deletions) or are added (insertions). Before analysing your data you
need to bring all sequences to equal length and shift homologous nucleotides into their correct
position by introducing “gaps” into your dataset. This procedure is called “aligning”, the result is
an “alignment” that serves as the basis of your analyses.
The difficulty is to find the exact position of insertions and deletions (“indels”) and to introduce
gaps in the right positions. The following two sequences can be aligned in at least two different
|| || |||
|| |||||
Data sets with a lot of species and variable sequences in particularly cannot be aligned by
manually shifting sequences back and forth. This would be extremely tedious and lead to datasets that are no longer reproducible. If we want a computer to do the job, we must define a
criterion that allows the program to decide which alignment is the best. The most commonly
used programs treat indels and substitutions as two evolutionary events that are both “punished” with penalties. The program then tries to find the solution with the smallest total penalty.
The question then remains how to weight indels and substitutions in relation to each other. Different weighting schemes often result in different alignments. This is unsatisfactory because all
analyses are based on the assumption that the alignment positions are homologous and two
different positions of the same alignment cannot be homologous. Different solutions to the problem have been suggested. The most commonly used nowadays is to remove all positions with
uncertain alignment from the final data set. This may lead to substantial loss of data, but is the
only way to create reliable and reproducible results.
We will offer two different options in the course: (1) aligning sequences in Geneious without
elimination of uncertain alignment. (2) using the web interface GUIDANCE to obtain an optimal
alignment and reliability scores.
1. Mark the folder “Alignment” in the left window.
2. >File >Import >From File
3. Select the folder and import the sequence files that you are supposed to analyze during
the course (ask your tutor!) by marking them and pressing >OK.
4. The files will appear in the upper window.
5. Mark one of the files and select >Alignment in the top menu. A new window opens.
6. You can choose among different options, e.g. alignment methods. Select >MuscleAlignment.
Muscle is a fast program for aligning DNA sequences. The alignment is performed in several
steps: (1) The program compares all sequences pairwise, but does not create pairwise alignments. Instead it calculates pairwise differences between sequences based on the number of
identical short sequence parts. this is much faster than calculating pairwise alignments. (2)
Based on the genetic differences a preliminary phylogenetic tree is calculated. (3) Muscle then
aligns the two most closely related sequences of this tree. From there it works itself through the
whole dataset aligning with more and more dissimilar sequences until all sequences are combined in a single alignment. (4) Based on this alignment the program calculates more exact
pairwise distances among sequences. (5) These are used to calculate a new tree which (6)
serves as a basis for a new alignment. Steps 4 to 6 are repeated until the alignment does not
change any more. Muscle is particularly well suited to align large data sets with up to a few
thousand sequences.
1. You may again select among different options. For example, “Maximum number of iterations” is set to “8”. This value is the maximum number of times that Muscle will repeat
steps 4-6 (above) and may be changed. Usually Muscle needs less than 8 iterations to
reach an optimal alignment.
2. You can access more options under >More Options (surprise!).
3. Click >OK to start the program.
4. A new file appears in the upper window named “Nucleotide alignment [no.]”. Assign a
new name to this file (works the same as re-naming files in Windows Explorer).
5. Open the alignment in a new window.
6. >Allow Editing allows you to change and “improve” the alignment manually. Only use
this option when you detect obvious errors in the alignment. This is rarely the case with
Muscle. It is, however, possible that you made a mistake while editing sequences. This is
the time to detect such errors and correct them by going back to your original sequences
and checking them once more! Of course this is not possible with Genbank sequences.
7. If you want to change something you can do this by simply writing into the alignment.
8. Watch out! Removing or inserting gaps shifts the whole sequence left or right and will
destroy the alignment. Check your alignment carefully at each step.
9. You can delete or change a whole column by writing into the consensus sequence.
10. undo changes with “ctrl-z”.
11. After finishing your final editing you must apply two changes to the file. The beginning
and end of alignments usually contain large numbers of gaps, because sequences are of
unequal length. These are not deletions but missing data. Replace these gaps by “N” (but
not the gaps within the sequences!). You may also decide to delete the start and end of
the alignment when it consists of only few (less than 50%) of the sequences. Mark these
parts in the consensus and use the “Entf” key.
12. Save the final alignment by >File >save.
13. Repeat these steps for the remaining two datasets.
14. Phylogeny programs use different input file formats. You must convert and save your files
in these formats. Select >File >Export >Selected Documents.
15. Next select “Nexus (*.nex)” in the pop-up window and press >OK.
16. In the next pop-up window (“Select export location”) choose a name for your file and save
it in one of your folders. Press >OK, ignore the warning and >Proceed.
17. Now export the alignments into PHYLIP and FASTA format.
Removing unreliable Alignment using the GUIDANCE Web Server
Next we want to check how reliable our alignment is. There are different ways of detecting and
removing unreliable data. For us, the easiest is the GUIDANCE web server at GUIDANCE uses a bootstrap method (see below) to create different
guide trees. These trees are used as a basis for aligning the sequences. GUIDANCE then
checks how different the alignments are. The more similar a certain alignment position is in the
different alignment the more reliable it is. GUIDANCE calculates reliability scores for every
alignment position and allows us to remove positions below a certain threshold.
1. Open the GUIDANCE web page in your web browser.
2. Upload the unaligned sequences in FASTA format.
3. Select a title for your job, tick the box “Ich bin kein Roboter” and then “Submit”.
4. A new window will open telling you about the progress of the analysis. It is usually finished within 20-30 minutes.
5. Once the job is finished, the results and output files are displayed in the same window.
6. First look at “MSA colored according to the confidence score”. What does the graphic tell
7. Next look at the different output files. You have 15 minutes, after which you will be asked
to explain one of the output files to your fellow students.
8. Finally remove columns below a threshold of 0.93, copy and save the resulting alignment
as a FASTA file.
9. For a serious publication you should also save the list of removed columns to report under “Material and Methods”.
Concatenating Multigene Data Sets in Geneious
The information contained in a single gene locus is often insufficient to reliably infer the phylogenetic relationships of species. This is particularly true for datasets with lots of unrelated species and closely related species. Data from more gene loci may increase the statistical support
of your trees (see below). Moreover, the phylogeny of single genes might not accurately reflect
the phylogeny of the phylogenetically young species that have only recently evolved. Conclusions about the relationships on closely related species should therefore always be based on
the analysis of several gene loci.
For the following analyses we will therefore combine the three alignments into a single data set.
We will “concatenate” the alignments. This is a simple procedure in Geneious if all taxa have
exactly the same name in all alignments. It can be a nightmare if working with large data sets in
which sequences of the same species have different names in different files.
1. Import the GUIDANCE alignments of matK, rbcL and atp1 into the folder “Alignment”
in the left window.
2. Make sure that the sequences from the same species have exactly the same name in
all three alignments. This is usually not the case if sequence names contain Genbank
accession numbers. If you remove numbers at this stage, you must keep a table with
Genbank accession numbers for the different species and loci to cite in your publication.
3. Mark the three alignments in the upper window and select >Tools>Concatenate Sequences or Alignments...
4. A new window opens in which you can define the order of gene loci in your final data
set. >OK produces a new file with the name “Concatenated Alignments [no.]”.
5. Open this file in a new window. Usually not all sequences were present in all data
sets, because some are missing in Genbank. You must fill the respective gaps with
6. Re-name the file so that you can still recognize it after a few days. You will need it
frequently during the course.
7. Save it in NEXUS and PHYLIP format.
5. Data Analysis: General Considerations
Using DNA sequence data we want to answer the question how closely or distantly related different species are to each other. Generating data is an important first step in the empirical sciences. However, analysing these data is equally important. It is only after data analysis that a
scientist knows whether a particular hypothesis is supported or rejected or which of the many
possible hypotheses fits the data best. Molecular systematists and evolutionary biologists use
analytical methods that are widespread in the sciences. However, the application of these
methods in systematics is often not easy to understand. The reason for this is not only the
widespread dread of mathematics among biologists. The questions put forward by systematists
are often more complicated to answer than those in physics or chemistry. Consider the following
example to illustrate this: In 1919 the astrophysicist Arthur Stanley Eddington travelled to the
island of Príncipe in the Gulf of Guinea. During a total solar eclipse he gathered data to test certain conclusions from the theory of relativity. Ultimately, he wanted to know whether these data
supported Newtonian or Einsteinian physics. Eddington only had to deal with two theories and
three possible outcomes. The data could support either the theory of relativity or Newtonian
physics or it could contradict both theories.
Which hypotheses does a systematist test with molecular data? The surprising answer is: the
phylogenetic trees. But are these trees not simply “calculated” from the data? Using the three
methods of tree reconstruction treated in this course the trees do not inevidently follow from the
data. Instead, we have to calculate, how well the data fit a certain tree. Every possible phylogenetic tree is a hypothesis of phylogenetic relationships and we have to identify the tree that is
best supported by our data. Unfortunately, the number of possible phylogenetic trees gets unmanageably large even for small datasets with few species (see Box 1 on pages 6-7). Unlike
Eddington, we cannot just make a simple yes-no decision, we have to decide among a myriad
of possible hypotheses.
The way in which scientists test hypotheses and theories poses another difficulty. Usually, certain “expected” observations are deduced from a theory. The next step is to test whether the
experimental observations are in line with the “theoretical” ones. For example, according to the
theory of relativity the light does not follow an exactly straight path but is bent by gravity fields.
This is exactly what Eddington observed during the eclipse: the light of stars that were optically
close to the sun was deflected by the sun’s gravitiy field. But which theoretical and testable observations can be deduced from a phylogenetic tree? Even entirely absurd trees, in which for
example all flying animals are united in a single group, cannot be rejected entirely by our data,
even if they fit the data much worse than trees in which birds, bats and insects form different
6. Phylogenetic Trees and Methods
The question most commonly asked by systematists is how organisms are related to each other. According to the theory of evolution all extant organisms have descended from a single
common ancestor in the distant past and are therefore related to each other. Because the DNA
is replicated and passed on from one generation to the next all homologous DNA sequences
are similarly related to each other. As are protein sequences because their amino acid sequence follows directly from the DNA sequence of their coding gene. The mechanism of DNA
replication is not error-free. Every now and then read errors occur and one nucleotide is replaced by another, a process called mutation. If these mutations are not lethal they can be
passed on to the next generation. By the actions of genetic drift and selection, some mutations
will eventually occur in all individuals of a species and are then called substitutions. Mutation is
a random process. The longer two sequences have existed independently from each other the
more substitutions will have occurred in both of them. The property of DNA and protein sequences to accumulate genetic differences over time can be used to reconstruct the evolutionary history and relationships of organisms. As a rule of thumb one can say that two highly similar individuals or species are more closely related to each other than two very different ones.
In systematics and taxonomy, relationships are commonly illustrated by genealogical or phylogenetic trees. As stated above, such a tree is a model or a hypothesis of the relationships
among the studied species. Before we care about the methods with which we can turn DNA sequence data into evolutionary trees we will briefly discuss some properties of phylogenetic
Box 1: Phylogenetic trees
A phylogenetic tree illustrates the evolutionary relationships among organisms. It is a graphical representation of the hypothesis: “The species in this data set could be related to each other in this particular
A phylogenetic tree consists of branches (or “edges” in the language of mathematicians) which are
joined at nodes. In the following example two internal nodes are denoted as X and Y. The organisms
(or better their DNA sequences) are found at the tip of branches and are also called terminal nodes
(here called A-L). The particular way in which branches and nodes are tied together is called the topology of the tree. In this example K and L are more closely related to each other than H and L. The length
of the branches is proportional to the number of substitutions between one nodes and the next. The
scale bar below the tree shows the branch length for 0.1 substitutions per nucleotide position.
Most phylogenetic trees are dichotomously branched or bifurcating. This means that exactly two
branches arise from each ancestral node. If more than two branches arise from a single node this is
called a polytomy. The branches of a phylogenetic tree can be rotated around their respective nodes
without changing the information content of the tree, pretty much like the parts of a mobile. In the tree
above, D is not more closely related to E-F than C. You are free to rotate them around node Y, for example, if this makes it easier for you to describe the relationships. Phylogenetic trees may have a root or
not. A root confers additional information to a tree. It tells you from where the evolution of a group took
its origin. In principle, you can fix a root to any branch of an unrooted tree. In practice you usually include
species in your dataset that are closely related to the group of interest but not part of it. This “outgroup”
will automatically provide a root for the “ingroup”.
The number of possible trees increases with the number of taxa included in the dataset and can become
incredibly large. For n taxa it can be calculated as:
Un = (2n-5)(2n-7) … (γ)(1) for unrooted trees and
Rn = (2n-3)(2n-5) … (γ)(1) for rooted trees.
For 20 species, R20 = 8 200 794 532 637 891 559 000, for 100 species there are more possible trees
than atoms in the universe.
A convenient way to summarize trees for computer programs is the Newick format. In this format relationships are expressed as brackets around taxon names. In Newick format the tree above looks like
this: (((((I(J(K,L)))(G,H))((E,F)(C,D)))B)A). Additional information, such as branch lengths, can be included after a colon behind taxon names. For example: (((((I:0.25(J:0.11(K:0.03,L:0.06)))(...
A group is called monophyletic when all its members can be traced back to the same common ancestor
and this ancestor has not produced any descendants that are not a member of this group (in this example the red group). All members of the yellow group also descend from a single ancestor, but in contrast
to the red group, this ancestor has a descendant that is not a member of the yellow group. Such groups
are called paraphyletic. A typical example for a paraphyly are reptiles, because birds and mammals
also evolved from reptilian ancestors. The members of the black group have different ancestors. This
group is called polyphyletic. Of course, if we go further back in time we will also find a common ancestor for all members of the black group (in this case the root of the tree). The distinction between paraand polyphyly is therefore somewhat blurred.
Which criteria shall we use to find the phylogenetic tree that best represents the evolution of our
group? In order to find the „best“ among all possible trees we first need a quantity that tells us
how „good“ a tree is in relation to others. We call this an optimization criterion that in one way or
the other tells us how well our data fits to a certain evolutionary hypothesis. In this course we
will learn about three such criteria: (1) the principle of maximum parsimony (MP), (2) the concept of maximum likelihood (ML) and (3) the concept of posterior probability that forms the core
of Bayesian statistics.
(1) The principle of maximum parsimony (MP) is the simplest of these criteria. Its philosophical
foundation dates back to the medieval friar William of Ockham. Ockham argued that the theory
that explains a phenomenon with the smallest number of prior assumptions should be preferred
over all other explanations. In the case of phlyogenies this means that the tree that requires the
smallest number of substitutions to explain the genetic differences between the taxa is the pre-
ferred evolutionary hypothesis. MP is not so much a statistical method but rather founded on
philosophical principles.
In order to determine the minimal number of substitutions for a particular tree one has to insert
all possible combinations of nucleotides at the internal nodes, then count how many substitutions this would require and do this for all nucleotide positions of the dataset and for all possible
trees. Fortunately, Walter Fitch (1971) invented a method that greatly facilitates this process,
particularly for computers. Starting from the terminal nodes one moves to the next internal node
and forms the intersecting set of the nucleotides at the superior nodes. If this set is empty (because the two superior nodes have different nucleotides or sets of nucleotides like in the two
nodes “1” below) one forms the set union instead. Every time this happens one counts a substitution event. All one has to do now is to add these numbers over all nodes and nucleotide positions of the dataset, do this for all trees and choose the one with the smallest grand total of substitutions.
set union
intersecting set
MP does not use all nucleotide positions of an alignment. Positions that are identical in all sequences are generally uninformative under MP. If a single sequence deviates from all others
at a certain position, the most parsimonious solution is always that the substitution occurred on
the terminal branch to this sequence. Such a position does also not allow any conclusions on
phylogenetic relationships of a species. Only positions in which more than one taxon shows a
certain substitution are parsimony-informative. MP may have serious drawbacks because
substitutions are merely counted without modelling the substitution process. Datasets with many
homoplasies (see Box 2) can be difficult or impossible to analyse under MP.
Box 2: Homology and Homoplasy
Two or more organisms may share a certain trait or character state because they inherited it from a
common ancestor. A good example are the wings of birds. In this case the trait is called homologous.
Identical character states can also evolve independently as, for example, the wings of birds and bats.
Such a trait is called a homoplasy. In the case of complex organs such as wings it is fairly easy to distinguish homologous from homoplasious character states. In the case of nucleotides this is often impossible at first view. In most cases we can assume that identical nucleotides at a certain alignment position
are homologous. There are, however, two events that can lead to homoplasies at the molecular level:
parallel substitutions (also called parallelisms) and reversals, in which a nucleotide mutates
back to a previous character state. With only four possible character states, both events are relatively
likely to be observed in big datasets, particularly at rapidly evolving gene loci.
sequence A
sequence B
sequence C
sequence D
The example above illustrates homoplasy at the molecular level. If we assume the tree ((A,D)(B,C)) either A and B or C and D have independently acquired their identical nucleotide at position 8. If the true
topology is ((A,B)(C,D)) the same holds for the nucleotides at positions 3 and 5.
Many authors make a logical link between the principle of parsimony and the terms homoplasy and homology. On the background of common descent, identical character states of two or more species
should be regarded as homologous as long as nothing argues against this assumption. The assumption
that identical traits are the result of two or more independent evolutionary events is an ad hoc-hypothesis
to explain the data. The smaller the number of such ad hoc-hypotheses, the more parsimonious is the
tree. Unfortunately, the story is not as simple as it sounds (see Felsenstein 2004: 136-146).
Homoplasious positions in a dataset lead to conflicts in the choice of the optimal tree because they favour different tree topologies. A high number of homoplasies in a dataset make it difficult to decide between competing phylogenetic hypotheses. As a measure of the reliability of an MP tree homoplasy indices are often calculated.
The consistency index CI is calculated by forming the ratio
ci = mi / si
for every alignment position, where mi is the minimal number of substitutions in all possible trees and si
the observed number of substitutions for the tree under consideration.
CI is then calculated as the mean of the values for all positions:
CI = Σimi / Σisi
This index is not identical for all tree topologies and its lower limit is not 0. Its reverse is called the homoplasy index (HI = 1 – CI) and is a measure of the frequency of positions with parallel or reverse mutations.The retention index
RI = (Σigi - Σisi) / (Σigi - Σimi),
where gi is the maximal number of substitutions in all possible trees, and the rescaled consistency index
RC = CI × RI
are identical for all topologies and range from 0 to 1.
Homoplasies may positively mislead phylogenetic analyses. A notorious example is the so-called “long
branch attraction” under MP. Imagine a tree with the following topology and branch lengths:
The taxa A and C are most closely related to each other, as are B and D. Simulation studies have shown
that in most cases MP would recover a topology in which A and B are most closely related. The reason
is the high number of mutations on the long branches that lead to A and B. Some of these substitutions
will be reverse and parallel mutations creating the impression that A and B share a lot of homologies.
Instead of resolving the problem adding more data will make it ever more certain to end up with the
wrong topology in such a case. This is one of the reasons why model based methods such as ML and
Bayesian methods have more or less replaced MP analyses.
Maximum likelihood and Bayesian posterior probability, are statistical methods that try to answer two different questions about data and hypotheses. A proponent of likelihood would ask:
“What does the experimental evidence (or the data) tell me about two competing hypotheses?”
A Bayesian statistician asks: “How does the available evidence change my confidence in a certain hypothesis?”
(2) The law of likelihood states: “The data D favour hypothesis H1 over hypothesis H2 if and only
if the probability to observe the data is higher under the assumption of hypothesis H 1 than under
H2. A mathematical shorthand for the last part of this statement could look like this:
P(D| H1) > P(D| H2).
The probability of the data under the assumption of a certain hypothesis or model P(D| H) is
called the likelihood of the model or hypothesis. This likelihood tells us nothing about the probability of the hypothesis as Sober (2008) illustrates by a nice example. Imagine somebody sitting
at home and hearing a noise from above. A superstitious person might believe that some gremlins are bowling in his attic. The likelihood of this hypothesis is very high because the probability
is very high that in this case you would hear some kind of noise from above, although this is of
course complete nonsense. This is the reason why single likelihood values are of no use when
trying to evaulate different hypotheses. What is very useful is the so-called likelihood ratio of two
hypotheses defined as:
P(D| H1) / P(D| H2).
This value is larger than 1 if hypothesis H1 has a higher likelihood than H2 and smaller than 1 in
the opposite case. By calculating this ratio one can find out, which of two hypotheses is more
strongly supported by the data and how strong this support is. As we already know, systematists have to decide between a very high number of competing hypotheses. Theoretically, this is
no problem. The topology with the highest likelihood is supported best and is to be preferred.
The computation of likelihood values for phylogenetic trees is only problematic because it involves many more arithmetic operations than an MP analysis.
To explain how the likelihood of a phylogenetic hypothesis is calculated, look at the following
It consists of four taxa 1, 2, 3 and 4 and two internal nodes 5 and 6. The taxa and nodes are
connected by branches of different length l1, l2 … l6. The branch lengths correspond to the
product of substitution rate and the time between the origin of the branch and its split-up, or
simply: The longer a branch the higher the number of substitutions on it.
In order to compute the likelihood for a certain position k of the alignment we put random nucleotides on nodes 0, 5 and 6. The nucleotides at the terminal nodes 1-4 are given by our dataset.
Remember that the likelihood is the probability of observing these nucleotides (the data) given
the tree (the hypothesis). We start by putting a random nucleotide (say a G) at node 0 and call it
x0. We must now calculate Px0, the probability of observing this nucleotide there. At this point,
don’t worry about how this is done, we are currently only assembling the formula for calculating
the likelihood. Next, we need the probabilities that this nucleotide mutates to x5 along branch l5
(Px0->x5 · l5) and to x6 along branch l6 (Px0->x6 · l6). The product tells us that these probabilities
depend on the branch lengths. The longer a branch the more opportunities to mutate. If we extend to the final branches leading to the terminal nodes, we get the following long but not very
complicated formula:
lk = Px0 · (Px0->x5 · l5) · (Px5->x1 · l1) · … · (Px6->x4 · l4)
Because we do not know the nucleotides at nodes 0, 5 and 6, we must calculate this quantity
for all possible combinations of nucleotides at these nodes and then add up these values.
Lk = lk1 + lk2 + ...,
where the subscripts 1, 2 etc. denote the different combinations of nucleotides. This value Lk
(mind the capital L!) is called the likelihood function for position k of the alignment. Finally, this
calculation has to be done for every single position of the alignment. If we assume that all positions evolve independently the likelihood for the total dataset is simply the product of the values
for the single positions
L = Π Lk.
Unfortunately, probabilities always take values smaller than 1 and the product of many factors
smaller than 1 quickly approaches zero. Therefore, statisticians and computer programs prefer
the logarithm of the likelihood, because the logarithm of a product is the sum of the logarithms
of its factors, which turns the above product into
lnL = Σ lnLk.
If we have the topology and branch lengths of a tree and a dataset we only need one more
thing: a formula which allows us to calculate Px0 and all the probabilities with which nucleotides
mutate along the branches. This formula is called a substitution model (see Box 5).
(3) Bayesian posterior probability
Bayesian statistics also tries to evaluate hypotheses with the help of data, but asks an entirely
different question. Bayesian statisticians are not essentially interested in comparing different
hypotheses, they want to know whether the data increase or decrease our confidence in a certain hypothesis. Likelihood enables us to compare different hypotheses but allows no statements about a single hypothesis or model. With Bayesian methods we can directly infer the
credibility of a single hypothesis.
Bayesian statistics relies on Bayes’ rule which is easily derived from the definition of conditional
probabilities. Let us look at a simple example to explain conditional probability. A pack of 52
cards consists of four types of cards: 13 spades, 13 clubs (both black), 13 diamonds and 13
hearts (both red). What is the probability of randomly drawing a card of hearts (H) under the
condition that it is a red card (R)? Obviously ½, because half of the red cards are hearts. A
more general way of calculating this probability (which is also applicable to more complicated
questions that are not so easy to answer) uses a slightly different line of argumentation. The
probability P(R & H) that the card is red and hearts at the same time is ¼, the probability that is
is red, P(R), our precondition, is ½. Dividing P(R & H) by the probability that our condition is true
P(R) also gives ½. Hence the probability of an event (H) under the condition that event (R) has
happened equals the probability that both events happen divided by the probability of the condition:
P(H | R) = P(H & R) / P(R)
This formula is more general because it allows us to exchange the two events:
P(R | H) = P(R & H) / P(H)
Because P(H & R) is the same as P(R & H) – the probability that both events happen – we can
now express P(H & R) in two different ways:
P(H & R) = P(H | R) · P(R) = P(R | H) · P(H).
If we only look at the last two parts of this equation and solve for P(H | R), we get
P(H | R) = P(R | H) · P(H) / P(R).
This is Bayes rule found by the reverend Thomas Bayes back in the 1740s, published posthumously by Richard Price in 1763 and left more or less unnoticed until Pierre-Simon Laplace
found it independently and published it once more in 1774. This theorem tells us how the probability of a hypothesis P(H) is changed when we gather new data (R). P(H) is the probability of
the hypothesis before we have seen the new data, the so-called prior probability. P(H | R) is
the probability of the same hypothesis after we have evaluated these data, the posterior probability. In contrast to the likelihood this is a true probability and behaves like one: P(H) + P(not
H) = 1. The likelihood of the hypothesis given the new data P(R | H) is also part of the formula.
Besides these rather harmless participants there is an unpleasant stranger P(R), which is the
total probability of the data under all possible hypotheses.
This total probability P(R), and to a lesser degree the prior probability P(H), can cause tremendous difficulties. The probability of most hypotheses is impossible to guess without having seen
any data. And the total probability of the data under all hypotheses can only be calculated for a
relatively small number of possible hypotheses. There is no doubt that Bayes’ rule is mathematically sound but, because of these difficulties, a lot of statisticians are not convinced of its application.
For phylogenetic reconstructions with zillions of trees the total probability of the data is impossible to compute. One can, however, come to an approximate solution by studying a random
sample of trees. Such a procedure would resemble the calculation of the average income of
citizens by evaluating a representative sample of them. Considering that there are so many different trees, the vast majority will be entirely absurd, some will be rather unlikely and only a few
will represent the evolution of the group in a relatively reasonable way. A representative sample
of trees would sample trees not randomly but strictly in proportion to their probability or likelihood. One method to sample trees in this way was published by Ziheng Yang and Bruce Rannala in 1997.
It circumvents the problem of the total probability of the data in an elegant way. When the posterior probabilities of two trees are compared this total probability simply cancels out as is evident
from the following formula, in which H1 and H2 represent different trees (hypotheses) and D the
P(H1 | D) = P(D | H1) · P(H1) / P(D) = P(D | H1) · P(H1)
P(H2 | D)
P(D | H2) · P(H2) / P(D)
P(D | H2) · P(H2)
The only minor problem now is the prior probability of the trees. Here a simplification can help
us. If we have not seen any data at all, all trees will actually have the same prior probability,
which is called a non-informative prior. This assumption is certainly not true, but as long as we
have not seen any data, we have no reason to believe that any tree is more probable than the
others. Under this somewhat idealized assumption the whole procedure reduces to a calculation
of likelihood ratios. Trees are then sampled in the following way.
A computer program calculates the likelihood of a randomly picked tree. Next, the program
changes a parameter of the phylogeny (a branch length, the position of a species on the tree
etc.) and calculates the likelihood for this new tree. If this is higher the program “accepts” or
stores the tree as a new starting point for calculations. If it is lower the tree is only accepted with
a probability that is inversely proportional to the likelihood difference between the trees. If the
difference is high, i. e. if the second tree is much worse than the first, the probability of accepting it is very small, if the difference between the two is low the program jumps to the new tree
with a higher probability. If a tree is not accepted, the program remains on its current tree. This
sequence (change of a parameter, likelihood calculation, comparison of likelihoods, jump to a
new tree) is repeated several million times. This way, the program jumps like a monkey from
tree to tree preferrably in an upward direction but every now and then also taking a step downhill. The method is called a Markov Chain Monte Carlo process (often abbreviated MCMC) and
ensures that trees are collected in proportion to their likelihood. The Markov Chain gathers a
representative sample of trees. One of the great advantages of this method is that the probability of every parameter of the phylogeny (a certain branch of the topology, parameter of the substitution model etc.) is equal to its frequency in the sample and can be extracted with small effort. Branch lengths and many parameters of the substitution model are not just point estimates
but can be given with standard deviations.
A few adjustments have to be made to ensure that the sampling process is efficient and really
representative. Because the chain starts with a random tree the first trees are probably very unlikely. In order not to bias the sample towards trees with low likelihood, the first 10.000 to
100.000 steps of the Markov chain are discarded. This is called the “Burn-in” phase. It is also
important to ensure that the trees in the sample were gathered independently from each other.
Because only one parameter is changed at each step of the Markov chain this is only given
when we sample not every tree but, for example every 100 th tree. Finally, If the likelihood “landscape” is very rugged with several “peaks” of high likelihood divided by deep “valleys” of low
likelihood a single Markov chain might get stuck on one of the peaks. Therefore, one usually
runs several chains in parallel starting from different initial trees. This increases the chances to
sample the whole likelihood landscape evenly.
7. How to Calculate Phylogenetic Trees
After all these theoretical explanations and considerations, let us finally do it. The number of
different computer programs for phylogenetic analyses is so large that we will restrict this
course to a few basic ones that are commonly used or cited in the literature. Programs such as
PAUP, PHYLIP (URL: and MEGA (URL: can perform various analyses. But you will find loads of software
for specialized purposes, for example Bayesian analyses (MrBayes, URL: or ML computations (PHYML, URL:; RAxMLGUI, URL: The programs above with a URL are freely available.
Most phylogenetic programs will run on Mac OS and Windows. The most commonly use input
formats for these programs (FASTA, PHYLIP and NEXUS format) will be explained during the
Before we start our first phylogenetic analysis, we have to consider one more aspect, the difficult task of finding the optimal tree out of a number of possible trees that exceeds the number of
atoms in the universe. It is obvious that even the fastest computers cannot solve this problem in
reasonable time. Instead of picking a limited number of trees randomly and comparing them,
there are smarter search options that use simple strategies to reach a preliminary result and try
to optimize this solution. Such heuristic search strategies are explained in the following box.
Box 4: Heuristic search
A heuristic search for an optimal tree is a two-step procedure. In the first step an initial tree is constructed using a simple and fast method. A commonly used algorithm is stepwise addition. Three taxa are
chosen randomly from the data set and combined into an unrooted tree (how many unrooted trees are
there for 3 taxa?). Step by step more taxa are added to this tree by joining them to one of the branches
of the existing tree, so that the resulting tree is the most parsimonious one or the one with the highest
likelihood. Why is this tree not the overall best one? Simply because the order, in which taxa were added, has an impact on the final result. And there are just about as many different orders in which you can
arrange the taxa as there are different phylogenetic trees for them (for 50 taxa there are 3 x 10 64 ways).
It wouldn’t help to try them all.
The second step starts when all taxa are added to the initial tree. Now the program starts to exchange
branches on the tree and calculates whether this improves the tree. There are different strategies for this
so-called branch swapping.
Nearest Neighbour Interchange (NNI) removes an internal branch of the tree and exchanges the four
groups into all three possible arrangments before the branch is tied again. This is done for all internal
branches of the tree.
Subtree Pruning and Regrafting (SPR) cuts off a branch (like a gardener pruning a tree) and joins it to
all possible branches of the remaining tree. Again, this is done for all branches of the tree.
Tree Bisection and Reconnection (TBR) cuts the tree into two and reconnects the two subtrees at all
possible positions. This is repeated for all possible cutting points. The number of investigated trees is
In order to further increase the number of trees the search is usually repeated several times, each time
adding the taxa in a different order to the initial tree. Although only a small fraction of trees is compared
this search algorithm works very well in most cases.
Maximum Parsimony and the NEXUS File Format
We will calculate the first phylogenetic trees using the Windows version of PAUP and maximum
parsimony, the default option in PAUP. This program can either be operated through a command line or specific command blocks that are attached to the data. PAUP is a versatile program with lots of commands. We will therefore not learn more than a few basic functions of this
1. Open PAUP. A dialogue box will open automatically. Load the provided alignment in
NEXUS format by marking it and pressing “Open/Execute”. If the file does not contain typos and other mistakes (e. g. blank spaces, full stops or hyphens in taxon names) this
should prompt no error messages.
2. In the window above the command line you will see some information on your data set.
>Window>[name of the file] opens a second window with the actual data file. You can
toggle between these windows.
3. We will first try out a simple heuristic search for an optimal tree using the command line:
hsearch addseq=asis; <Enter>.
4. hsearch initiates the heuristic search, addseq=asis tells the program to add the taxa
to the initial tree (stepwise addition, see Box 4) in the order in which they appear in the
5. The output tells you how many characters the dataset contains, how many of them are
parsimony-informative, and which branch swapping algorithm was used. The default in
PAUP is TBR. Under “score” in the table you can see how many evolutionary steps (substitutions) the best tree requires.
6. In order to view the tree, type:
describetrees 1/ brlens=yes; <Enter>
Now the program displays tree no. 1 and prints a table with branch lengths (brlens). In
addition different homoplasy indices are displayed.
7. In order to increase the number of studied trees and the probability of finding the optimal
tree, we will now increase the number of cycles in the heuristic search. This only makes
sense if we tell the program to randomly add taxa during the first step of each analysis.
hsearch addseq=random swap=NNI nreps=100; <Enter>
The heuristic search will now be repeated 100 times adding the sequences randomly.
This time branches will be swapped using nearest neighbor interchange.
8. A very convenient option of PAUP is that it remembers recently used commands. in order
to display the first tree click on the arrow to the right of the command line. You will see a
list of the previously used commands. Choose the correct one and press <Enter>.
9. PAUP produces unrooted trees. Let us now root the tree with an outgroup. First we have
to define the outgroup. Assuming you use the provided dataset the command for this is:
outgroup Podocarpus Ginkgo Metasequoia Cycas; <Enter>
If you use your own dataset you will probably know which taxon or taxa to choose for an
10.To root the tree with this outgroup, type:
describetrees 1/ root=outgroup outroot=monophyl; <Enter>
Now the outgroup should appear as a monophyletic group outside the ingroup taxa.
11. In order to save the tree or trees with branch lengths, type:
savetrees file=[file name] brlens=yes; <Enter>
You have now saved the first tree as a result of the analysis. As you may have noticed displaying trees is not the strongest side of PAUP. In order to edit the tree for publications or before
printing it we will now open it in FigTree.
1. Load the tree file by >File>Open.
2. If the branches are labelled (for example by statistical support values), you will be asked
to provide a name for these labels. This will not be the case now.
3. The tree appears in the window. The menu on the left side allows you to alter the appearance of tree. Play a bit around with the layout.
4. Above the window you see a graphical menu, with which you can change the position
and order of clades. To re-root the tree with a different group mark the branch and
choose >Reroot.
5. You can also rotate clades around nodes (remember that this does not alter the topology
of the tree). Choose a branch and >Rotate. >Highlight marks a whole clade, >Colour
changes the colour of branches. Play around with the different options. Also change from
“Node” to “Clade” or “Taxa” above “Selection Mode” to find out about these options.
6. If you want to display statistical support values for branches (see below) mark the box
“Node Labels” or “Branch Labels” on the left and open the menu by clicking on the arrow.
7. Choose the appropriate label from the drop-down menu “Display”. At this point we have
not yet investigated the statistical support for branches. But remember this function for
later use.
8. Once the tree suits your taste you can export it into a graphical format by >File>Export
Graphic... and choosing the appropriate format and the folder in which you want to store
the file.
Choosing a Substitution Model for Likelilhood Analyses
Likelihood analyses are not only more difficult to understand than parsimony. They are also a bit
more clumsy to perform, because, in order to calculate the likelihood of a tree, we need to
choose a substitution model. This is a set of formulas that allows us to calculate how probable it
is that a certain nucleotide is substituted by another along a branch of defined length. Before we
go into the practical details, Box 5 will explain substitution models in general.
Box 5: Substitution models
Two sequences that derive from a common ancestor will gradually accumulate substitutions. Because
substitutions are rare events that happen randomly we would expect that the number of observed differences between two sequences grows slowly and more or less linearly with time (black line in the illustration below).
However, the observed differences between sequences do not accurately reflect the substitutions that
have happened. A small number of substitutions will affect nucleotides that have already mutated before.
Instead of two substitution events we will only observe one or, if the mutation is back to the original nucleotide at this position, none at all. Usually, the number of observed differences between two sequences (red line) will be smaller than the actual number of substitution events. This effect is more pronounced
in rapidly evolving genes than at slowly evolving gene loci and is particularly problematic for MP where
differences among sequences are simply counted, which misrepresents the actual number of substitutions. Substitution models try to model the whole process including parallel and reverse mutations.
The simplest substitution model (the Jukes-Cantor model) assumes that all possible substitutions occur
with equal probability. This model is relatively easy to describe. Substitutions are random events. We do
not know exactly when the next substitution will hit a gene, but we know that eventually a nucleotide will
be substituted by another. It is like waiting for a notoriously unpunctual bus. As you wait the probability
that the bus has not yet arrived grows smaller and smaller until the damn vehicle finally arrives. If we call
the waiting time T and the time that has passed since you arrived at the bus stop t we can say: The
probability that the waiting time for the bus (or the next substitution) is longer than the time that has
passed is getting smaller and smaller as time passes. This probability is exponentially distributed.
P(T > t) = e-t.
If this is the probability that nothing has happened (no bus, no substitution), then the probability that
“something” has happened is 1-e-t. There are four different nucleotides (A, C, G, T). If all substitutions
are equally probable the probability of any single substitution is ¼ (1-e-t). This is correct if, for example a
G is replaced by an A, C or T. But how big is the probability that after a certain time the G is still a G?
There are two possibilities for this case: either nothing has happened: e-t. Or the G was replaced by the
same nucleotide: ¼ (1-e-t). The total probability for this event is therefore e-t + ¼ (1-e-t). If we generalize
this for all nucleotides we can describe the probability that, at time t, nucleotide x is replaced by nucleotide y, in the following way:
Px→y(t) =
¼ (1-e-t)
if x ≠ y
e-t + ¼ (1-e-t)
if x = y
The curly bracket here means “either – or”. This is the mathematical abbreviation of the Jukes-Cantor
model. Actually, most gene loci do not follow this simple substitution model. For example, due to their
different size, purines (G and A) and pyrimidines (C and T) are more often substituted by another nucleotide of the same group (transition) than by one of the other group (transversion). Moreover, nucleotides
do usually not occur at equal frequencies which favours certain substitutions over others. All these facts
are considered in more complicated models. The following four matrices show commonly used substitution models. The uppor row shows the nucleotides before, the left column the nucleotides after substitution. The greek letters in the center denote the different substitution rates, the symbols gA, gT etc. frequencies of the respective nucleotide.
T α
T gA - αgC gG
T αgA -
C α
C gA αgT -
G α
G α
gC αgG
G αgA gT gC -
- αgT gC gG
gA gT
gC gG
- ζgG
G gA gT ζgC GTR
These models have different numbers of unknown parameters. The Jukes-Cantor model only has one
parameter α, the general time reversible model 9 because it assigns a separate rate to all six kinds of
substitutions. The more parameters a model has, the more difficult it becomes to compute the likelihood.
Position specific substitution rates
All these models assume that the substitution rates are identical for all positions of a DNA sequence.
This is often not true. Changes in the amino acid sequence of the active centers of enzymes are much
rarer than substitutions in side chains of a protein because they usually affect the function of the protein.
First, second and third positions of a codon also have different substitution rates because many mutations at third positions do not affect the amino acid sequence. Hence, the substitution rates may vary
along a gene sequence. This variation can be considered in models by assigning different overall rates
to different positions.
In principle one could assign a separate rate to each position. Unfortunately, the number of parameters
in a model cannot be higher than the number of datapoints in the dataset. You can easily understand this
if you think of a regression line through a scatter-plot of data points. The regression line is a model that
simplifies the information contained in the data and is defined by two parameters, the gradient and the
intersection with one of the axes. If you only have one data point there is no unique regression line representing the data. We must therefore use another method to include position specific rates into a substitution model. Instead of assigning a separate rate to each position we define a set of “substitution rate
classes”, each with its own rate. The default is often 4 but any number of classes can be defined. How
much these rates differ from each other is defined by a certain probability distribution, the gamma distribution. Look at the following graph.
On the x-axis we have the substitution rates, low rates on the left close to the y-axis, higher rates on the
right. The area below the curve is then divided into 4 (or whatever number you prefer) parts so that each
part covers the same area below the curve. The four different rates are then determined as the median
of each part, in this case 0.32, 0.68, 1.11 and 1.88. The nice thing about this procedure is that the gamma distribution is defined by a single parameter  and can take very different shapes. In the figure above
 = 2. If  ≥ 1 all substitution rates are relatively close to 1. For  < 1 they differ more strongly. The rates
for  = 0.5 are, for example, 0.03, 0.28, 0.92 and 2.77. Once the different rates are defined, the likelihood for each position (given a tree with branch lengths) is calculated using the mean likelihood value for
all four rates. Such a model can represent the actual substitution process much better, because the
higher rates contribute a higher proportion to the overall likelihood for rapidly evolving and the low rates
for slowly evolving positions.
Now that we know what a substitution model is and how it reflects the substitution process, how
can we select a model among the many different ones? Mathematical models do not perfectly
represent but only approximate natural processes. An ideal model would allow us to extract the
essential information from complex datasets. As outlined above a scatter plot in a coordinate
system may be reduced to a regression line a + bx that tells us how pressure depends upon
temperature in an autoclave. In such a model, a large dataset is reduced to two parameters a
and b.
The law of likelihood offers one way to choose among models. If you calculate the likelihood of
a tree based on the dataset and different substitution models you could pick the preferred model
by comparing the likelihood values for different models, just as you choose the best tree by
comparing likelihood values for different trees.
P (D | M1) / P (D | M2) (see page 10)
The problem is that one can always add further parameters to a model that increase the agreement between the model and the data. The more complex a model gets the more closely it reflects the actual dataset and the higher its likelihood. By just comparing the likelihood ratio we
would always end up with the most complex model. Because all datasets contain noise in addition to information complex models reflect more and more of that noise. This increases the
chances that a model misrepresents the natural relationships and will not fit very well to future
datasets. In the example above, instead of a simple line we could describe a curve that touches
every point of the scatter plot. This model would fit the data perfectly but probably does not correctly describe the relationship between temperature and pressure in the autoclave. Therefore
we need to avoid overly complicated models.
Introducing additional parameters into a very simple model will greatly increase its likelihood.
But an additional parameter in an already very complex model will make only a small difference.
In 1974, Hirotugu Akaike introduced a method that makes use of this observation by offsetting
the number of parameters with the likelihood value of a model. The formula looks like this:
AIC = 2k – 2log L(M)
where k is the number of parameters and log L(M) the logarithm of the likelihood of a certain
model. The more parameters you include the larger the values for 2k and log L. As long as the
increase in likelihood exceeds the increase in 2k the AIC value gets smaller. Once the model
has reached a certain degree of complexity, the increase in likelihood will no longer compensate
the increase in 2k, because additional parameters add only little to the likelihood value. From
then on the AIC value will increase again. Put simply, the model with the lowest AIC is the one
that explains the data best with a minimum number of parameters. There is a number of freely
available programs which calculate the AIC or similar quantities (e. g. jModeltest, MtGUI).
As we have already pointed out, different genes and even parts of a gene often follow different
substitution models. In many protein coding genes a quick look at the alignment already tells
you that first, second and third codon positions have different rates. For multigene datasets, this
would suggest to use a separate model for each individual gene and each codon position. For
datasets with many genes this would lead lead to an unnecessarily complex model, if some
genes within the dataset follow the same model. Therefore one should always test whether different genes follow the same model and whether they are not better combined in a single data
partition with a single model. The following figure from the user manual of PartitionFinder (Lanfear et al. 2012) shows different options to partition a dataset of three coding genes.
In order to find the optimal model, one must consider all possible combinations of models and
partitioning schemes. The program PartitionFinder does this automatically. It needs a dataset in
PHYLIP format and a configuration file, in which the most complicated partitioning scheme to be
considered is defined by the user. In the case of the provided dataset this would include three
codon positions for every gene in the dataset (atp1, matK, rbcL). The configuration file is called
“partition_finder.cfg” and can be found in the folder “PartitionFinder”.
1. Copy the alignment in PHYLIP format into the folder “analyses/PartitionFinder”.
2. Open the file “partition_finder.cfg” in a text editor.
3. Add the name of your PHYLIP file behind alignment = .
4. Below ## DATA BLOCK ## you already find the nine possible partitions. You need to
define which parts of your alignment belong to which partition. The notation 1-xxxx\3
means that every 3rd position between 1 and xxxx belongs to one partition. Why does
the next partition begin with a 2 (2-xxxx\3)?
5. Save the file without changing its name after you have entered all the information.
6. PartitionFinder is written in the programming language Python. In order to start it open a
command prompt and type the command python and a space.
7. Open the file “Programm/PartitionFinder”, drag and drop the file “” into
the command prompt and type another space.
8. Now drag and drop the folder “Analyses/PartitionFinder” into the command prompt and
press <Enter>. If there are no errors in the cfg-file the program will start computing the
optimal combination of models and partitions. The progress is output on the screen.
9. After the analysis has been finished you will find a new folder “analysis” and a file “log.txt”
in the folder “Analyses/PartitionFinder”
10. The logfile documents the screen output.
11. The folder “analysis” contains the results of the analysis. The most important file for us is
“best_schemes.txt” which describes the optimal partitioning scheme. Open the file in
Word (output is difficult to read in some text editors).
12. Following some rather irrelevant lines with numbers you find the description of the
scheme. The partitions are numbered. For each you can see the optimal model, the
name of the gene loci (e. g. matK_1) and their position in the alignment. Study the table
carefully! Is there anything peculiar?
13. Below the table a short explanation tells you which of the opriginal partitions (the ones
that you suggested) follow the same model.
14. At the end of the file there is a text block which we can later use in the ML analysis. Here,
the partitioning scheme is described so that we can just copy-paste it into the program
Writing a PAUP Block
In our first attempt to calculate a phylogenetic tree we used the command line to tell PAUP what
to do. A much more convenient way is to write all necessary commands into a text block that is
appended to the data block in PAUP. The NEXUS format allows us to include a number of different blocks, for example an ASSUMPTIONS block, a TREES block with previously computed
trees or a PAUP block with details on the analysis. This is very convenient when we wish to run
different analyses successively or just want to save waiting time. Instead of waiting on results
for hours and type in the next commands, we can feed PAUP with all necessary commands on
Friday and pick up the results of the completed analysis on Monday. We start with a simple example, a ML analysis for which we have to define a substitution model.
1. The data block in a NEXUS file ends with end; Insert two returns. The command block
always starts with the following line:
begin PAUP;
2. Because the default optimality criterion in PAUP is MP we have to tell the program that
we now want to perform an ML analysis. We also want to tell the program to finish the
analysis independently so that we do not have to press the “close” button. Write the following into a new line below the previous command:
set autoclose=yes criterion=likelihood;
3. Next, we have to define the substitution model. PAUP does not accept multiple models
for a single dataset. The following parameters of the substitution model are taken from an
analysis of the dataset with the program MrModeltest 2.2.
lset base=(0.3445 0.1718 0.1697) nst=6 rmat=(0.8683 2.6043 0.2708
0.9152 2.5678) rates=gamma shape=0.9213 pinvar=0.3175;
The command lset initiates the definition of model parameters for an ML analysis.
base=(x y z) defines the frequency of nucleotides A, C and G. Why is the frequency
of T not defined?
The x in nst=x refers to the number of different substitution types in the model (1, 2 or 6
depending on the choice of model, see Box 5).
rmat=(x y z …) means the relative rate of the different substitution types (A→C,
A→G, etc.). The rate for G→T is set to 1.
rates=[equal, gamma]: Substitution rates are either identical for every position of the
alignment or variable among sites.
shape=x defines the shape of the Gamma function that is used to model site specific
substitution rates (see Box 5).
pinvar=x most positions of an alignment are usually identical across species and hence
appear to be invariable. The value for x defines this proportion of invariable sites in the
4. To start the heuristic search for the best tree we include the following line after the line
starting with lset:
hsearch addseq=random swap=SPR nreps=10;
5. The next line contains the command to show the best tree with branch length that you already know. As an exercise write it yourself.
6. Finally, let PAUP save the tree with branch lengths under a file name of your own choice.
7. Save and close the NEXUS file.
8. Load the file once more. If everything is ok, the analysis will start without any further
Phylogenetic Uncertainty, the Non-Parametric Bootstrap
The results of different phylogenetic analyses on the same dataset (e. g. MP and ML analyses)
often differ from each other. MP analyses commonly come up with several equally parsimonious
trees. These differences may concern important details, for example the question whether a
taxon is monophyletic or not. How much confidence can we have in our results or in other
words, how far is our hypothesis (the tree) from the truth (the evolution as it actually happened)? This is an important question in all scientific studies. Physicists will usually repeat
measurements and report mean values and standard deviations to estimate the deviation of a
single measurement from the true value. But evolution cannot be repeated experimentally. Sequencing many genes and comparing the resulting trees might help, but we often find that the
information contained in a single gene is too small to allow meaningful interpretations. That is
the reason why most studies nowadays rely on data from multiple genes.
When the deviation of a measurement from the true value cannot be inferred, because the
measurement cannot be repeated, one can instead treat the measurement as the “truth” and
repeat the analysis by re-sampling the dataset. The deviation of the re-sampled results from the
original result then serves as an approximation of the true deviation. Referring to the American
proverbial expression “to pull yourself up by your own boots” (a basically impossible task) statisticians call this method “bootstrapping”. Which at least proves that they have a sense of humour.
In phylogenetic analysis you re-sample the dataset several 100 to 1000 times with replacement.
By randomly picking single alignment positions one produces datasets that are identical in size
to the original one. Random sampling with replacement ensures that the new datasets all differ
from each other, because some positions will be picked more than once while others will be left
out randomly. In a second step one calculates phylogenetic trees for each subsample and a
single consensus tree, which summarizes the information contained in the bootstrapped trees.
Every branch that was found in more than 50 % of the bootstrapped trees can be unambiguously displayed in the consensus tree. Finally, one determines, in how many of the bootstrapped
trees the branches were found. This “bootstrap support” (in percent) or similar measures of
support are a necessary part of all decent phylogenetic trees. Simulation studies (in which you
actually know the “truth”) show that bootstrap values tend to underestimate the confidence that
we can have in a branch. Although 95 % confidence is a frequently used threshold value in other fields of statistics, it has become customary to discuss branches with support values above
70 %. Less strongly supported branches are considered to be too unreliable to merit further discussion.
Bootstrap values are a measure of precision rather than accuracy. High precision means that a
measurement renders very similar results every time it is repeated. High accuracy, on the other
hand, means that these measurements are close to the true value. When I step on my bathroom scales ten times and it always measures the same weight it is very precise. Nevertheless
it can be poorly calibrated and under- or overestimate my true weight considerably. Precision
alone is not really sufficient to convey confidence in a measurement. Bootstrap values are very
similar in this respect. A node with high bootstrap support can be reliably reconstructed from the
data, but a bootstrap support of 85 % does not mean that there is an 85 % chance for this
branch to occur in the true evolutionary tree. We will carry out the first bootstrap analysis using
1. Open PAUP and load the NEXUS file.
2. Should the program start to compute trees, stop it an change into the window that displays the NEXUS file. You can deactivate the PAUP block at the end, by putting it in
square brackets. Save the file and execute it once more.
3. Type bootstrap nreps=500 search=heuristic; into the command line and
press <Enter>.
This will start a bootstrap analysis with 500 pseudoreplications (the smallest number
considered to give reliable results) and a heuristic search for the best tree in every replicate.
4. After the analysis is finished the display shows the consensus tree and a table with all
groups that were found in at least one bootstrap tree. The taxa are abbreviated as numbers above the columns. The respective groupings are marked by stars. The rightmost
column shows the frequency (in percent) with which the group occurred among the bootstrap trees. Groups appearing in less than 5 % of the trees are not listed. Compare the
list with the tree.
5. In order to save the consensus tree type the following commands:
savetrees from=1 to=1 file=[File name] savebootp=nodelabels; <Enter>
PAUP saves the bootstrap support as a label of each node. The command from=1
to=1 is necessary to circumvent a bug in PAUP. Without this command the file is empty.
The tree file can be opened and edited in FigTree.
Bootstrap Analyses Using Maximum Likelihood
As you have probably seen by now, ML analyses and bootstrap analyses under MP can take
very long with PAUP. Running a ML bootstrap analysis with a dataset like ours would soon exhaust your patience. Different methods have been developed to speed up the computation of
ML trees. One of these programs is RAxML (Randomized Axelerated Maximum Likelihood). In
contrast to PAUP it is also able to consider separate models for different dataset partitions.
However, it only offers to select the same model for all data partitions. The difference to a single
model for the whole dataset is that the parameter values of the model are estimated independently for all data partitions. RAxML calculates bootstrap support values and an ML tree
separately and displays the bootstrap values on the ML tree, not the consensus tree. RAxML
needs a dataset in PHYLIP format.
1. Copy the alignment (PHYLIP format) into a new folder, in which you also want to save
the results of your RAxML analysis.
2. Start the program RAxMLGUI by double-clicking on the file “”.
3. Load the alignment by >Alignment file.
4. A window with the heading “Open alignment file” will open. Navigate to the folder with
your alignment, highlight it and >Open.
5. If the alignment contains identical sequences or positions that only consist of gaps a
warning will appear. You may delete these “double” sequences and gap positions by
choosing >yes. In our case we want to use all sequences and press >no. RAxMLGUI
recognizes whether the dataset consists of DNA or protein sequence data.
6. Your PHYLIP file will appear in the disply window.
7. >Options>Set/Edit Partitions... opens a new window “Set partitions”. Here you can describe the different partitions of your data set.
8. RAxMLGUI can process many kinds of data. You can specify your data by pressing
>DNA in the upper left corner. “BIN” means binary data (0,1), “MULTI” multistate characters (e. g. blue, red, green), and “AA” amino acid sequences. Multistate characters may
have up to 32 possible character states (0–9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P,
Q, R, S, T, U, V). Do not change any of the settings.
9. You can either define the limits of the data partitions “manually” or copy-paste the definition from PartitionFinder. Choose Options>Edit partitions.
10. Copy the last few lines from the file “best_schemes.txt” into the display of RAxMLGUI
and correct line breaks if necessary.
11. >Set will save the partitions.
12. [If you want to enter partitions manually follow these instructions: The start of a partition
cannot be changed. It is either 1 or follows from the end of the previous partition. Enter
the last position of the first partition into the upper right field.
13. >no codon means that all positions with these limits follow the same model. Alternatively, for protein coding sequences, you may choose >codon specific if you want to use
separate models for 1st, 2nd and 3rd positions or >3rd codon if 1st and 2nd positions
together shall follow a model that differs from that for 3rd positions.
14. Press >Add. The larger window will now show a line describing the first partition as you
defined it. Repeat steps 12-14 until all positions of the dataset are assigned to a gene
15. A message appears “All characters have been assigned to partitions. Set partitions for
the analysis?” If you do not want to change anything press >OK, otherwise >Abbrechen.]
16. After all partitions have been defined a new option will appear below “Run RAxML” called
“per-partition brL”. Selecting this option will prompt the program to calculate branch
lengths for each partition. The topologies of these trees will be identical, but the branch
lengths will be optimized for each position separately.
17. In the lower menu bar you see the command >ML + rapid bootstrap. Here you can
choose between a simple ML analysis or two different types of bootstrap analyses, “thorough” or “rapid”. Do not change the settings.
18. The number of bootstrap replicates may also be adjusted. If it is shortly before lunchtime
choose 1000, otherwise leave it at 100.
19. Finally you may choose among different substitution models. For the sake of speed,
RAxMLGUI offers a limited choice of alternatives. The GTRGAMMA model includes site
specific rates, GTRGAMMAI a proportion of invariable sites in addition. The models
GTRCAT and GTRCATI further speed up calculations but are relatively imprecise. This
should particularly affect branch lengths.
20. >Run RAxML starts the analyses and opens a command window in which the progress
of the analysis is reported.
21. RAxML produces a number of output files. The most important file
“RAxML_bipartitions..tre” shows the ML tree with branch lengths and bootstrap support
values. If you have chosen >per-partition brL before you will also get a separate tree
with branch lengths for each partition. These partitions have the endings “PARTITION.0”,
“PARTITION.1”, etc.
22. To finish your analysis, edit the tree in FigTree and save it as graphic.
Bayesian Analysis
For a number of reasons, the Bayesian method has become very popular among systematists.
Although millions of trees are calculated, the method is very fast.
This is mainly due to the fact that you can skip bootstrap analyses. The Bayesian analysis
computes the probability of the tree and all parameters of the analysis simultaneously.
From the tree sample you can also compute confidence limits and standard deviations for all
parameters of the model (i.e. the tree topology with branch lengths and the substitution
All methods that only search for a single optimal tree may fail under certain conditions. We
have already seen the example of long branch attraction under MP. Using an inappropriate
substitution model may also cause serious problems in ML. Bayesian methods are not free
from these error sources, but instead of focussing on a single tree topology, we may rather
consider the sample of trees as a whole and take phylogenetic uncertainty into account
when interpreting the trees.
For example, the consensus tree may show a certain taxon to be polyphyletic, but how high
are the chances and can we completely dismiss the idea that it is actually monophyletic?
Bayesian tree samples offer good possibilities to test competing phylogenetic hypotheses
against each other.
The main point of criticism against Bayesian analyses is that we need to define prior expectations for model parameters. For example, a grossly wrong mean value for the prior distribution
of branch lengths may lead to absurdly long consensus trees when compared with ML trees.
The solution in this case is to choose the mean length of branches from an ML analysis as the
mean of the prior distribution in the Bayesian analysis. In the following, we will perform a simple
Bayesian analysis, which also allows us to apply our skills in writing command blocks.
1. Copy the NEXUS file into the folder in which MrBayes is stored.
2. Open the NEXUS file in a text editor or in PAUP and change to the file view window (after eventually stopping automatic calculations). Put the PAUP block from the previous
analysis into square brackets.
3. If anything is written between “#NEXUS” and “begin data;” delete this text. The first lines
of the file shall now look like this:
begin data;
4. Make sure that the next line looks like this:
format datatype=dna interleave missing=N gap=-;
5. Move to the end of the file and add another line: begin mrbayes;
6. Just like the PAUP block that we have written before, this MrBayes block defines the settings and parameters for the analysis. Insert a line below this command that tells
MrBayes to close the program window after the analysis is finished, and add another line
specifying the name of a log file. This ensures that the screen output is also saved as a
file for future reference. This file will be replaced if you re-run the analysis. If you do not
want this write append instead.
set autoclose=yes;
log start filename=[file name]_log replace;
7. MrBayes allows the use of different substitution models for different gene loci or codon
positions. We will use the partitioning scheme and the models from PartitionFinder. For
each data partition add a line that starts with charset. Add a name for the data partition
and describe its position in the alignment as follows.
charset A = 1-xxxx\3; [matK 1st position]
charset B = yyyz-zzzz\3 2-zzzz\3; [rbcL and atp1 3rd positions]
To avoid typing errors keep the names as simple as possible and avoid blank spaces,
numbers and full stops. Different parts of the alignment may be combined into one partition. Use a blank space, not a comma, to separate them. Finish each line with a semicolon. Include more descriptive names of partitions in square brackets to aid your memory.
8. MrBayes may successively analyse several datasets, for example calculate trees first for
every partition separately and then a combined analysis. Defining the partitions is therefore not sufficient. We also have to tell MrBayes which partitions to use as the dataset for
the analysis. The command for this is partition followed by a name and the number
and names of the partitions to be used. Several different datasets may be defined.
partition threegenes=5:matK_1, rbcL_2_atp1_2 ...;
9. Now tell MrBayes which dataset to analyse:
set partition=threegenes;
10. Like in PAUP, the definition of substitution models is initiated by the command lset. You
have to define as many models as there are partitions in the dataset, and then tell
MrBayes for which partition the respective model is used by applyto=(number). Next
you define the number of different substitution rates (1, 2 or 6 as for the ML analysis).
The rates command allows to distinguish between models with equal rates for all posi-
tions (equal), gamma distributed site specific rates (gamma), a proportion of invariable
sites (propinv) and both (invgamma). The command lines look this:
lset applyto=(1) nst=[value] rates=[value];
lset applyto=(2) nst=[Value] rates=[value];
11. Models also differ in the way, in which they treat base frequencies. Some simple models
assume that all nucleotides are equally frequent in the dataset. As a default, MrBayes
assumes unequal base frequencies and tries to optimize the relative frequencies using a
Dirichlet prior distribution. If some of your models require equal base frequencies, you
have to tell MrBayes use a different “state frequency prior”: Make sure that the prior applies to the correct data partition.
prset applyto=(1) statefreqpr=fixed(equal);
prset applyto=(4) statefreqpr=fixed(equal);
At this stage you may change the prior distributions of all possible parameters, including
the branch length prior (see page 26).
12. If different models contain the same type of parameters, e. g. two models that both use
gamma-distributed site specific rates, MrBayes will try to find a single value for these parameters which is then applied to all partitions. If you want MrBayes to infer the values for
these parameters independently you have unlink these parameters across partitions. The
command is:
unlink [Parameter]=(all);
In our case we want to optimize the shape parameter alpha, the proportion of invariable
positions, the nucleotide frequencies and the substitution rates independently. This is
done by the following row of commands:
Alternatively, you may only decouple a parameter between some of the partitions. This
command would look like: unlink shape=(number,number,...);
13. You have probably noticed that, in spite of these additional options, the definition of substitution models is more unspecific in MrBayes than in PAUP or Garli. We only have to
describe how many and which different parameters there are (e. g. the number of different substitution rates, the shape parameter of the gamma distribution etc.). The basic
model is specified but we do not specify the parameters values as we have done for
PAUP or Garli. These values are optimized by the Markov chain while it is running.
14. Next you have to define settings for the Markov chain. The most important settings concern the number of generations to be run (ngen) and the frequency at which trees are
sampled (samplefreq). You may also determine how often the progress of the analysis
shall be printed to the screen (printfreq) and whether you want trees with branch
lengths or without (savebrlens). A special command concerns the question of how
many parallel Markov chains will be run (nchains). The following command line tells
MrBayes to run 1 million generations using four parallel Markov chains, sample and print
every 100th generation to the screen and save trees with branch lengths. Many more options may be specified.
mcmc ngen=1000000 printfreq=100 samplefreq=100 nchains=4 savebrlens=yes filename=[file name]_out;
15. Finally, you want MrBayes to not just sample 10.000 trees but also to summarize them in
a consensus tree (sumt) and also summarize the parameters of the substitution models
in a separate file (sump). The respective command lines look like this:
sump filename=[file name]_out relburnin=yes burninfrac=0.5;
sumt filename=[file name]_out relburnin=yes burninfrac=0.5 contype=allcompat;
The commands relburnin=yes and burninfrac=0.5 mean that MrBayes discards
the first 50 % of the sampled trees as burnin. contype=allcompat tells the program to
include clades in the consensus tree even if they have less than 50 % support, as long
as they are compatible with the more highly supported groups.
16. The end of the MrBayes block is defined by :
17. Save the dataset in your MrBayes folder.
18. Open this folder and double click on the MrBayes executable. A command prompt will
open in which you have to type execute [file name] to start the analysis. The program will begin to report every single step of the analysis on the screen. If no error messages appear, every 100th generation of the Markov chain will then be shown with likelihood values of the sampled trees in brackets. In the right column the program tells you
how long it will still need to finish the analysis. After it has finished the program will summarize the results on the screen and saving them in a lot of different files.
19. Among others we see a rather crude plot on the screen in which MrBayes illustrates the
development of the likelihood values during the run. After a steep increase these values
will reach a plateau. You have to discard all trees that were sampled before the plateau.
We have anticipated the decision by telling MrBayes to discard the first half of the trees.
Depending on the number of different models and parameters the screen shows a long
row of values for: tree length TL, transition/transversion rates (kappa), substitution rates
(r…), nucleotide frequencies (pi…), gamma shape parameters (alpha) and proportions of
invariable sites (pinvar). These data are also stored under the extension *.pstat.
20. After this, we get a list of taxa and clades (called “bipartitions”), another list with posterior
probabilities of clades (bipartitions), a list of branch lengths with standard deviations, a
tree with posterior clade probabilities (called “credibility values”) and another tree with
branch lengths. These data are stored in the log-file and a file with the extension *.parts.
21. A list of the different trees sampled by the Markov chain is stored under the extension
*.trprobs. This file also contains the probabilities of single trees and the cumulative probabilities of the trees. This allows us to not only concentrate on the consensus tree of the
analysis, but also evaluate sets of trees, for example those trees that together have a
posterior probability of 95 or 99 %.
22. The most important file is the consensus tree with branch lengths and posterior probabilities (*,con.tre), which we open and edit in FigTree as usual.
23. If you want to check the probability of a clade that is not part of the consensus tree, you
can do so by opening the *.parts-file with in Word.
24. The sampled trees are stored under *.t, and the sampled model parameters under *.p,
separately for each run.
The Molecular Clock
Systematists are often interested in more than just the phylogenetic relationships of their pet
organisms. For example, one might also want to date evolutionary events. Temporal information
might help to answer questions regarding biogeography and evolutionary mechanisms.
Zuckerkandl and Pauling (1965) supposed that the genome accumulates substitutions at a constant rate. This idea of a “molecular clock” was also part of the “neutral” theory of evolution by
Kimura (1968, 1983). According to this theory molecular evolution is a random process that depends mainly on time.
If the hypothesis of a molecular clock is true we would expect that the genetic distance among
taxa is proportional to the time that has passed since these taxa have descended from their
most recent common ancestor. In the tree below the divergence time between taxa A and O is
identical to that between taxa B and O. Hence, we would expect that the genetic distances dAO
and dBO are also equal.
The following table shows the data that induced Zuckerkandl and Pauling (1965) to postulate
the molecular clock. It shows differences between amino acid sequences of α-hemoglobin in
shark, carp, newt, chicken, echidna, kangaroo, dog and humans. The paleonotological evidence
suggests that the shark is the outgroup for all other taxa. Accordingly, the genetic distances between shark and the other vertebrates should be equal. The same logic suggests that the genetic distances between human and the other taxa increases with the distance of the relationship. Indeed we see similar distances in each row and more or less decreasing distances in
Strict molecular clocks
It is possible to compute phlyogenetic trees in which the distance from the root to the tips is
identical for all taxa. In such an ultrametric tree the branch lengths are proportional to time. If we
know the age of a single node in such a tree, we can calibrate all other nodes and make inferences about the evolution of groups for which we might not have any fossil evidence. This
sounds like a great idea, but one has to be cautious. The substitution models that we have discussed above do not assume that substitutions were constant over all times and taxonomic
groups. In fact we know that this is not the case. Rodents, for example, have much higher substitution rates than primates, partly because of their shorter generation times. Without the clock
assumption, the branch lengths of a tree are adjusted freely to optimize the likelihood of the
tree. They represent the product of time and (variable) substitution rate and can have very unequal lengths. Forcing the program to assume equal rates for all taxa and branches changes the
substitution model substantially. We can compare the likelihood of a given tree with and without
an enforced clock. The clock tree has always a lower likelihood than a tree in which branch
lengths can vary freely. If the difference is not too big a strict molecular clock reconstruction
may be justified. Unfortunately, in most cases the difference in likelihood is so big that there is
simply no justification to use it.
Relaxed molecular clocks
Instead of simply assuming equal rates it is possible to treat substitution rate and time as different parameters to be optimized during the computation of the tree. Such models are called “relaxed molecular clocks” to distinguish them from the strict clocks. A relaxed clock tree enables
us to date nodes while at the same time allows substitution rates to vary across branches. The
most complex models assign a different rate to every branch.
One of the most frequently used models is the so-called “uncorrelated log-normal (ULN) relaxed
molecular clock” implemented in the computer program BEAST. The way in which BEAST assigns substitution rates to branches resembles the way in which site specific substitution rates
are defined. Rates are drawn randomly from a logarithmic normal distribution (see the figure
below) and assigned to branches of the tree. This is done by a Markov chain wich also varies
the topology and branch lengths (and the parameters of the substitution model).
Calibrating the molecular clock
A clock tree represents the relative ages of nodes in comparison to each other. Such a phylogeny is of little use when you cannot assign absolute ages (mostly in millions of years) to some of
the nodes. One may use very different sources of information from geology and paleontology to
date phylogenies. A dated fossil may be used if it shows at least one synapomorphy that is typical for a clade of the tree. This is taken as evidence that the respective clade already existed at
the time of this fossil, which is then taken as the minimal age for the respective clade. Geological events such as the origin of volcanic islands, the uplift of mountains or the split up of distributional ranges due to continental drift are also used to calibrate clock trees. Some of these
calibrations have to be treated with caution. Taking the age of an island for an age of an endemic taxon may underestimate the real age, because the taxon may have evolved much earlier
and colonized the island at a later stage.
The vast majority of molecular clocks is computed with Bayesian methods. One of the reasons
is the possibility to consider the age of certain nodes as prior information, which should increase
the accuracy of the reconstruction. This is one of the few cases where we can replace a noninformative prior with an informative one by restricting the age of certain nodes with prior distributions of a certain shape. The illustration below shows a few possible distributions.
A uniform distribution with an upper and a lower limit can be selected when a minimum and
maximum age of a fossil taxon is known. Lognormal and exponential distributions are suitable
for fossils with a minimum age. The normal distribution is appropriate for tectonic or biogeographical data. Nodes for which no fossil (or other) data is available receive a non-informative
prior age, i. e. their age is not restricted in any direction.
The following molceular clock analysis requires the use of a whole armada of programs:
BEAST, BEAUti, TreeAnnotator (from the BEAST package), Tracer and FigTree. BEAUti is
used to produce an input file for BEAST.
1. Open BEAUti by double-clicking on the executable. On top of the central window you see
a row of tabs. We will go through these tabs from left („Partitions“) to right („MCMC“) to
produce the input file. First load the NEXUS-file by pressing >+ on the lower left of the
screen and navigating to the Nexus file used in the Bayesian analysis.
2. BEAUTi recognizes the character sets defined in the MrBayes block. For every data partition you should see a separate white row in the grey central window. If you only see a
single line with the file name in the first column you may have put the MrBayes block into
square brackets or chosen a wrong file. In this case, close the file by pressing >-, choose
the correct file or remove square brackets and load the file once more.
3. For each partition the number of taxa, sites (nucleotide positions) and a few descriptors
of the substitution model are listed. In order to use the substitution and partitioning model
calculated with PartitionFinder, mark all Partition Names and choose >Unlink Subst.
4. The next tab >Taxa allows us to define nodes for calibrating the molecular clock. Press
>+ and a taxon set “untitled” appears in the left field. You can see all taxa in the data set
in the central field “Excluded Taxa”. Mark those that you want to combine in a node (e. g.
all gymnosperms) and transfer them to the field “Included Taxa” by pressing the green
arrow in the center.
5. Double-clicking “untitled” allows you to change the name of the taxon set.
6. Choose >Mono? if you want the group to appear monophyletic in your tree.
7. In the last column you can insert a phylogenetic age for the group. This age may refer to
the base of the branch on which the group is sitting (the “stem age”) or to the node from
which the group diversifies (the “crown age”). >Stem? allows you to define the age as
the stem age of the group.
8. You may define as many taxa sets as you like by clicking >+.
9. Because we do not have age for the tips of our tree or traits to map, we skip the tabs
“Tips” and “Traits” and go straight to >Sites.
10. Here we can define subsitution models for each of our data partitions. Check the output
from PartitionFinder and choose the model parameters that come closest to the models
11. Under >Clocks you may choose a >Model for the calculation of molecular clocks, e. g. a
>Strict Clock or a > Lognormal Relaxed Clock (the recommended option).
12. BEAST is also used for population genetic purposes. Genes are usually exchanged between members of a population but not between different taxa. Under >Trees we can
therefore choose among different mathematical models. Select >Tree Prior >Speciation: Yule Process. This option considers only speciation but not extinctions, such as
>Speciation: Birth-Death-Process.
13. We skip the >States tab, because we do not want to reconstruct ancestral character
states on this tree.
14. The prior distributions of all model parameters have to be selected under >Priors. Luckily, BEAUTi suggests default options for most of these. The default options are marked by
a * symbol and BEAUTi alerts you to check whether they are appropriate. You may verify
this by clicking on the priors and selecting >ok in the pop-up window. For this course let
us just assume that BEAUTi suggested meaningful priors. However, we must check priors that BEAUTi regards as inappropriate and marks in red.
15. We skip >Operators and move on to >MCMC.
16. We define the length of the Markov chain under >Length of chain: (at least 10 million
generations for a good analysis). >Log parameters every: defines how often the chain
saves trees and parameters while it is running. This depends on the length of the chain.
For 10 million generations one should save every 1000th tree.
17. With >File name stem we can select a name for the BEAST input and all output files. Do
not change the name, because in the past this used to cause program crashes etc.
18. >Generate BEAST File finally generates the input file (*.xml). A window opens and
warns you about model priors that you have not yet checked. Press >Continue and
choose a folder in which the file is saved.
19. Running BEAST is now easy. Close BEAUTi and double click on “BEAST v1.8.2.exe”.
20. >Choose File... opens another window. Load the *.xml input file generated by BEAUTi.
21. Because we have not installed the BEAGLE library we must deactivate >Use BEAGLE
library if available. Start the analysis by pressing >Run. The front window will close.
Similar to MrBayes the progress of the analysis will be printed to another window.
22. As soon as the analysis is finished a number of statistics appear in this window. BEAST
also suggests which parameters should be changed to improve the analysis.
23. The program produces two output files (*.trees and *.log). The first contains all sampled
phylogenies, the second all parameters that were sampled during the analysis.
24. The results of the analysis have to be summarized into a consensus tree. Open “TreeAnnotator” by double-clicking on the icon.
25. Insert the number of sampled trees that shall be discarded as burnin behind >Burnin
(as states):. Remember that this is not the number of generations but the number of
sampled trees to be discarded. Because our Markov chain was rather short, we discard
50% of the trees.
26. Behind >Input Tree File: select >Choose File... to open the tree file produced by
BEAST (*.trees) and select a name and location behind >Output File:. Both times a popup window appears. Select a file as input and insert a name for the output file. TreeAnnotator will produce a file with this name.
27. >Run starts the calculations in a different window. After the output file has been written,
open it in FigTree. The tree should contain node ages with 95% confidence intervals and
posterior probabilities for all nodes. Try to find out how to depict this information in
Avery, O. T., MacLeod, C. M. & McCarty, M. 1944. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: Induction of transformation by a desoxyribonucleic acid fraction isolated
from Pneumococcus type III. Journal of Experimental Medicine 79: 137-158.
Camin, J.H. & Sokal, R.R. 1965. A method for deducing branching sequences in phylogeny. Evolution 19: 311-326.
Cavalli-Sforza, L.L. & Edwards, A.W.F. 1967. Phylogenetic analysis: Models and estimation procedures. Evolution
21: 550-570.
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. 1994. The History and Geography of Human Genes. Princeton:
University Press.
Chargaff, E., Lipshitz, R., Green, C. & Hodes, M. E. 1951. The composition of the desoxyribonucleic acid of salmon
sperm. Journal of Biological Chemistry 192: 223-230.
Darwin, C. 1859. On the origin of species by means of natural selection, or the preservation of favoured races in
the struggle for life. London: John Murray.
Edwards, A.W.F. & Cavalli-Sforza, L.L. 1963. The reconstruction of evolution. Heredity 18: 553.
Edwards, A.W.F. & Cavalli-Sforza, L.L. 1964. Reconstruction of evolutionary trees. N: Heywood, V.H. & McNeill, J.
(eds.) Phenetic and Phylogenetic Classification. London: Systematics Association Publ. 6, pp. 67-76.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17: 368-376.
Felsenstein, J. 2004. Inferring phylogenies. Sunderland, MA, Sinauer.
Fitch, W.M. 1971. Toward defining the course of evolution: Minimum change for a specified tree topology. Systematic Zoology 20: 406-416.
Fitch, W.M. & Margoliash, E. 1967. Construction of phylogenetic trees. Science 155: 279-284.
Fox, G. E., Stackebrandt, E., Hespell, R. B., Gibson, J., Maniloff, J., Dyer, T. A., Wolfe, R. S., Balch, W. E., Tanner,
R. S., Magrum, L. J., Zablen, L. B., Blakemore, R., Gupta, R., Bonen, L., Lewis, B. J., Stahl, D. A., Luehrsen,
K. R., Chen, K. N. & Woese, C. R. 1980. The phylogeny of prokaryotes. Science 209: 457-463.
Haeckel, E. 1866. Generelle Morphologie der Organismen. Allgemeine Grundzüge der organischen FormenWissenschaft. Berlin: Georg Reimer.
Jukes, T.H. & Cantor, C.R. 1969. Evolution of protein molecules. In: Munro, M.N. (ed.) Mammalian Protein Metabolism. Vol. 3. New York: Academic Press, pp. 21-132.
Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217: 624-626.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge, University Press.
Knoop, V., Müller, K. 2006. Gene und Stammbäume. Ein Handbuch zur molekularen Phylogenetik. Heidelberg,
Elsevier, Spektrum Akademischer Verlag.
Lanfear, R., Calcott, B., Ho, S. Y. W. & Guindon. S. 2012. PartitionFinder: combined selection of partitioning
schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution 29: 1695–
Mendel, J.G. 1866. Versuche über Pflanzen-Hybriden. Verhandlungen des naturforschenden Vereines in Brünn 4:
Michener, C.D. & Sokal, R.R. 1957. A quantitative approach to a problem in classification. Evolution 11: 130-162.
Miescher, F. 1874. Die Spermatozoen einiger Wirbelthiere. Ein Beitrag zur Histochemie. Verhandlungen der Naturforschenden Gesellschaft Basel 6: 138-208.
Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G. & Erlich, H. 1986. Specific enzymatic amplification of DNA in
vitro: the polymerase chain reaction. Cold Spring Harbour Symposia on Quantitative Biology 51: 263-273.
Mullis, K. & Faloona, F. 1987. Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods in Enzymology 155: 335-350.
Saitou, N. & Nei, M. 1987. The neighbour-joining method: A new method for reconstructing phylogenetic trees.
Molecular Biology and Evolution 4: 406-425.
Sanger, F., Nicklen, S. & Coulson, A.R. 1977. DNA sequencing with chain-terminating inhibitors. Proceedings of
the National Academy of Sciences of the USA 74: 5463-5467.
Schulte, K., Naduvilezhath, L., Nierbauer, K.U., Silvestro, D., Michalak, I., Metzler, D., Zizka, G. & Printzen, C.
2013. E-volution: Der E-Learning-Kurs zur phylogenetischen Analyse. Beta-Version. URL:
Sober, E. 2008. Evidence and Evolution. The logic behind the science. Cambridge University Press.
Sokal, R. R. & Rohlf F. J. 1995. Biometry. The principles and practice of statistics in biological research. 3 ed.
New York, Freeman & Co.
Stanier, R. Y., Doudoroff, M. & Adelberg, E. A. 1964. The Microbial world (2nd ed.). Englewood Cliffs N.J.: Prentice-Hall.
Watson, J. D. & Crick F. H. 1953. Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid.
Nature 171: 737–738.
Wu, R. & Kaiser, A. D. 1968. Structure and base sequence in the cohesive ends of bacteriophage lambda DNA.
Journal of Molecular Biology 35: 523-537.
Yang, Z. & Rannala, B. 1997. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo method. Molecular Biology and Evolution 14: 717-724.
Zar, J. H. 1999. Biostatistical analysis. 4 ed. Upper Saddle River, NJ, Prentice Hall.
Zhu, L. & Gigerenzer G. 2006. Children can solve Bayesian problems: the role of representation in mental computation. Cognition 98: 287-308.
Zuckerkandl, E. & Pauling, L. 1962. Molecular disease, evolution, and genetic heterogeneity. In: Kasha, M. & Pullman, B. (eds.). Horizons in Biochemistry. New York: Academic Press, pp. 189–225.
Zuckerkandl, E. & Pauling, L. 1965. Evolutionary divergence and convergence in proteins. In: Bryson, V. & Vogel,
H.J. (eds). Evolving genes and proteins. New York: Academic Press, pp 97–166.
Appendix 1: Commands to Define Substitution Models in MrBayes
nst=6 rates=propinv
nst=6 rates=gamma
nst=6 rates=invgamma
GTR + gamma
GTR + I + gamma
lset applyto=() nst=6
prset applyto=() statefreqpr=fixed(equal)
lset applyto=() nst=6 rates=propinv
prset applyto=() statefreqpr=fixed(equal)
# SYM + I
lset applyto=() nst=6 rates=gamma
prset applyto=() statefreqpr=fixed(equal)
# SYM + gamma
lset applyto=() nst=6 rates=invgamma
prset applyto=() statefreqpr=fixed(equal)
# SYM + I + gamma
nst=2 rates=propinv
nst=2 rates=gamma
nst=2 rates=invgamma
HKY + gamma
HKY + I + gamma
K2P (=K80)
lset applyto=() nst=2
prset applyto=() statefreqpr=fixed(equal)
# K2P
lset applyto=() nst=2 rates=propinv
prset applyto=() statefreqpr=fixed(equal)
# K2P + I
lset applyto=() nst=2 rates=gamma
prset applyto=() statefreqpr=fixed(equal)
# K2P + gamma
lset applyto=() nst=2 rates=invgamma
prset applyto=() statefreqpr=fixed(equal)
# K2P + I + gamma
nst=1 rates=propinv
nst=1 rates=gamma
nst=1 rates=invgamma
F81 + I
F81 + gamma
F81 + I + gamma
Jukes Cantor
lset applyto=() nst=1
prset applyto=() statefreqpr=fixed(equal)
# JC
lset applyto=() nst=1 rates=propinv
prset applyto=() statefreqpr=fixed(equal)
# JC + I
lset applyto=() nst=1 rates=gamma
prset applyto=() statefreqpr=fixed(equal)
# JC + gamma
lset applyto=() nst=1 rates=incgamma
prset applyto=() statefreqpr=fixed(equal)
# JC + I + gamma
Appendix 2: Example MrBayes-block
log start filename=example_log append;
[MrBayes writes a “checkpoint” file, in which parameters of the analysis are stored every 100.000 generations. If the analysis is unexpectedly
stopped append allows you to restart it from the last “checkpoint”.]
set autoclose=yes;
delete [taxon name];
[deletes single taxa from the dataset]
outgroup [taxon name];
[specifies an outgroup]
1-118 270-371;
119-269 980-1881 1883-2581\3;
1884-2581\3 2582-3661\3;
partition PartFind = 7: A, B, C, D, E, F, G;
set partition = PartFind;
lset applyto=(all) nucmodel=4by4;
[other options for the nucleotide model are doublets for stem regions of ribosomal RNA (16 character states) and codon for protein
sequences (64 character states). These models are computationally intensive!]
lset applyto=(1) nst=6 rates=gamma ngammacat=6;
lset applyto=(2) nst=6 rates=invgamma ngammacat=6;
[SYM+I+G, equal nucleotide frequencies, see prior settings below]
lset applyto=(3) nst=2 rates=invgamma ngammacat=6;
lset applyto=(4) nst=2 rates=gamma ngammacat=6;
[K2P+G, equal nucleotide frequencies, see prior settings below]
lset applyto=(5) nst=2 rates=gamma ngammacat=6;
[K2P+G, equal nucleotide frequencies, see prior settings below]
lset applyto=(6) nst=6 rates=propinv;
lset applyto=(7) nst=2 rates=propinv;
prset applyto=(all) ratepr=variable brlenspr=unconstrained:exp(18) topology=uniform;
[brlenspr sets the prior distribution from which branch lengths are drawn. the options are unconstrained and clock (for a molecular
clock analysis). exp(18) means an exponential distribution with mean 1/18. The default for the mean is (10), but in some cases this may lead
to grossly wrong branch lengths. A way to find a more suitable mean is to calculate a ML tree and then take the mean branch length of that tree
as the mean for the branch length prior. Remember that the number in brackets is the reverse of this mean!]
prset applyto=(1) shapepr=exponential(5);
[shapepr sets the prior distribution for the shape parameter alpha of the gamma distribution.]
statefreqpr=fixed(equal) shapepr=exponential(5);
statefreqpr=fixed(equal) shapepr=exponential(5);
statefreqpr=fixed(equal) shapepr=exponential(5);
unlink statefreq=(1,3,6,7) revmat=(1,2,6) tratio=(3,4,5,7) shape=(1,2,3,4,5) pinvar=(2,3,6,7);
[State frequencies are fixed for partitions 2, 4 and 5. They only have to be unlinked between the other partitions. A full rate matrix (revmat) is
only present for the GTR and SYM models (partitions 1, 2, 6). The HKY and K2P model (3, 4, 5, 7) only have a transition/transversion ratio.]
mcmc append=yes ngen=40000000 nruns=3 nchains=6 temp=0.15 printfreq=1000 samplefreq=200 mcmcdiagn=yes diagnfreq=100000 diagnstat=avgstddev minpartfreq=0.10 rel-
burnin=yes burninfrac=0.5 stoprule=yes stopval=0.01 starttree=random filename=example_out;
[append=yes allows you to restart the analysis if it is unexpectedly stopped (power failure etc.). diagnfreq specifies how often the average
standard deviation of branch frequencies (avgstddev) is compared between runs. Poorly supported branches will vary widely between trees.
minpartfreq=0.10 specifies that only branches with a support of 0.1 or higher are compared. Relburnin sets the burnin proportion for the
comparison. Stoprule=yes tells MrBayes to stop the analysis when the difference among runs drops below a specified value ( stopval).]
plot filename=example_out;
sumt filename= example_out relburnin=yes burninfrac=0.5 nruns=3 contype=allcompat;
sump filename= example_out relburnin=yes burninfrac=0.5 nruns=3;
log stop;
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF