Mathematical simulations and Verhulst modeling of compositional changes in DNA sequences of

Mathematical simulations and Verhulst modeling of compositional changes in DNA sequences of
UNIVERSITY OF PRETORIA
Mathematical simulations and
Verhulst modeling of compositional
changes in DNA sequences of
acquired genomic islands due to
bacterial genome amelioration
Master Thesis
Xiaoyu Yu
11/13/2014
Submitted in partial fulfilment of the degree: Msc Bioinformatics, Bioinformatics
and Computational Biology Unit, Department of Biochemistry School of Biological
Sciences, Faculty Natural and Agricultural Science University of Pretoria
Table of Contents
Declaration ...................................................................................................................... 3
Literature Review .......................................................................................................... 4
1. Horizontal Gene Transfer ................................................................................... 4
1.1. Background and Process ..................................................................... 4
1.2. Advantages and Disadvantages .......................................................... 6
1.3. Genomic Islands .................................................................................. 8
1.4. Evolution ............................................................................................. 9
2. Identification Tools .......................................................................................... 14
2.1. Phylogenetic Approach ..................................................................... 14
2.2. Compositional Approach ................................................................... 17
2.3. Other Approaches ............................................................................. 20
3. Amelioration .................................................................................................... 25
3.1. Concept of Genomic Amelioration .................................................... 25
3.2. Project Aim ......................................................................................... 27
3.3. Project Objective ................................................................................ 28
Introduction .................................................................................................................. 30
Methods .................................................................................................... 32
Results ....................................................................................................... 39
Discussions ................................................................................................. 67
Conclusion ................................................................................................. 80
Acknowledgement ..................................................................................... 81
References ................................................................................................. 82
Appendix .................................................................................................... 87
2
Declaration
I, Xiaoyu Yu declare that the thesis, which I hereby submit for the degree Msc
Bioinformatics at the University of Pretoria, is my own work and has not previously
been submitted by me for a degree at this or any other tertiary institution.
SIGNATURE: ..…………………………..
DATE: ………………………………….
3
Literature Review
1) Horizontal Gene Transfer
1.1) Background and Process
Horizontal gene transfer (HGT), as the name states, is a process that transfers genetic
material from one organism to another by genealogical reproductive way (also known as
vertical transfer of genetic material). The transfer can occur between the same
organisms or across different species but mainly within the prokaryotic species and
seldom in eukaryotes. There are at least three mechanisms when it comes to the
process of HGT. Conjugation, where mobile elements such as plasmids actively
replicates itself and gets transferred to another cell; transduction, where the gene of a
host cell is packaged within a virus and transferred to another host along with virus
genes and transformation is when parts of the DNA are picked up from external
environment (Figure 1).
The first sign of HGT was in an experiment by Joshua Lederberg and Edward L. Tatum
which saw a type of bacterial mating called conjugation. The experiment observed that
the generation of daughter cells is able to grow in a media that cannot support the
growth of either of the parent cells. Their experiments showed that this type of gene
exchange requires direct contact between bacteria (Lederberg and Tatum 1946). Later
in the early 1950s, following the success of the study where genetic exchange happens
between the bacteria Escherichia coli, the authors hypothesized further that all bacteria
could undergo such a process and hence experimenting on Salmonella typhimurium and
other Salmonella serotypes began. The results from the experiment were positive and
later on the process were named transduction forms one of the three main mechanisms
of HGT. (Zinder and Lederberg, 1962).
4
Fig 1. The process of HGT can happen in three different ways. Conjugation, where mobile element such as plasmid
replicates itself and transferred to the recipient; transformation happens when DNA material is picked up from
external environments (dead cells) and transduction occurs when genes are packaged in viruses and transferred when
changing hosts.
In the mid of 1980s, the phenomenon and impact of HGT was first described by Michael
Syvanen. The hypothesis was that cross-species gene transfer in prokaryotes occurs and
a new perspective in prokaryote evolution could be considered. The idea was that the
uniform genetic code across different species allows exploitation of the same
transferred gene by different organisms. There are also some other hints that genes can
be transferred between prokaryotes, hints such as DNA containing plasmid which
autonomously replicate, transfer by phages such as the experiment by Freeman or by
direct digestion, phagocytosis. By ingesting large amount of genetic material, the
bacteria which are main carrion eaters are exposed to genes from most other species.
All these cases are ways which leads to HGT and also the building blocks of how we
know the concept it is today. The major point in Syvanen’s article was that lateral gene
transfer is a major concept in order to explain many things that the classical Darwin
theory cannot explain in evolution and also contributes many new possibilities into the
way of evolution (Syvanen, 1985).
5
1.2) Advantages and Disadvantages
Standing in the bacteria point of view, there are two major advantages to HGT to be
considered. The first advantage would be gaining of a new function that is beneficial to
the recipient through transfer of a novel gene from a foreign donor. This will rapidly
increase the rate in which the species will evolve compared to evolving independently.
Another benefit is when an organism undergoes gene loss by deletion or deleterious
mutations and will be able to regain that gene through HGT by another member of the
population.
Of course on the other side of the coin, there are also many disadvantages that are very
detrimental to the host. When HGT occurs, the transferred genetic material is
somewhat random (Complexity Theory discussed later on in the evolution section) and
also the point of insertion is also random. Hence majority of the time, HGT is not
beneficial and one or more of the following can occur. Non-coding genes could lengthen
the size of the genome and hence increase the replication time of RNA/DNA. Genes with
no function (could be due to other interactive genes not being present within recipient)
or duplicated genes could be expressed increasing translation and transcription cost
which is not beneficial to the host. Random insertion of genes could lead to existing
genes being none functional or interfere with the gene function. Hence HGT is a high
risk high reward process.
With numerous advantages and disadvantages, finding the balance of how much HGT is
actually beneficial for optimal evolution rate is then important. In a study done by Higgs
and Vogan, they modeled the beneficial and detrimental effects of HGT (Vogan and
Higgs, 2011). By testing different values of HGT rate and gene loss rate within the model,
they were able to conclude that HGT rate was high when gene loss was high (due to
advantage two) and vice versa. They further hypothesized that the earliest genomes
before the last universal common ancestor had high gene loss during replication process
and hence HGT was favored. As the genes are rapidly spread, larger and fitter genomes
were built, vertical transfer of these genomes can then be passed down with lower gene
loss rate. This can be seen by modern prokaryotic genomes which have a much lower
HGT rate since the chance of a beneficial gene being transferred is relatively low
compare to earlier genomes. Therefore increases the probability of detrimental effects
by HGT and hence lower HGT rate is preferred.
6
When we look at the traditional portrait of the tree of life, we could see that HGT is not
something that happens frequently. This is due to the fact that with the increasing
number of complete genome sequences being available, we could see that some genes
are highly conserved but slightly discordant. If the rate of HGT is high, then majority of
the genome sequences from different organisms should be similar and all organisms
should function in the same way. An explanation to this could be that when the genome
sequence of an organism reaches a particular level of complexity, gene transfer should
not be possible since the organism already obtain the gene or the gene is not necessary
to the organism. This way, the rate of HGT will decrease and reach the Darwinian
Threshold (the time of major transition of evolutionary mechanisms from mostly
horizontal to mostly vertical transfer). In this case, HGT is a disadvantage and the rate is
minimized in order to achieve better evolution which matches the conclusion that Higgs
and Vogan came up with the model.
1.3) Genomic Islands
With the increasing importance of HGT within the bacterial world, we needed to find a
way in which we can detect the actual transferred genetic region in order to get a better
understanding of the effects of HGT. Genomic Islands (GI), a region of genetic material
that is foreign within the host organism that is thought to be transferred over by HGT.
The idea came from pathogenic islands (PAI) which as the name states is a region of
genetic material that can cause the bacteria to become pathogenic when it was not
before. The initial naming of PAI was by Groisman and Ochman in 1996 where studies
showed viruses transferred virulent strains from one to another and hence come out of
a dormant state after a certain amount of time. PAIs were then characterized as an
unstable region with virulence-associated phenotype (Groisman and Ochman, 1996).
Some other GI types include symbiosis (Sullivan et al., 2002), metabolic (Penn et al.
2009), antibiotic synthesis and antibiotic resistance (Levings et al. 2005) and fitness
(Hacker and Carniel, 2001). These GIs vary in size and recognized by their functional
homology. GIs are normally between 10 to 200kb and could be recognized by
compositional means such as GC content, GC skewness, tetranucleotide and/or codons
frequency biases. Phylogenetic methods can also be used to find characteristics such as
having 16-20bp direct repeats flanking on both sides allowing integration of GIs into
target site. Some other characteristics such as GIs containing cryptic genes encoding
7
integrases and carrying of insertion elements or transposons (Buchrieser et al., 1998;
Gal-Mor and Finlay, 2006) could be present. These characteristics are all markers for
identifying GIs in target sequences.
Fig 2. The general structure of a genomic island which contains irregular GC content compared to the rest of the
genome. Genomic islands are usually inserted after tRNA and is flanked both sides with direct repeats (DR). It also
contains an integrase as well as some insertion sequence and some genes. The type of genomic island will be
determined according to the type of gene. Genomic Island can be described as virulence, symbiosis, metabolic,
antibiotic and fitness islands.
With current technology, HGT have never been an easy task to identify. There have been many
methods developed but none have had perfect success in identifying all HGT events. Each
method has its own pros and cons and no real golden standard in how to really identify HGT.
However when a GI is located, further studies could be done in order to investigate the
evolutionary effects of HGT in a retrospective manner by performing function
annotation and analyzing the amelioration of inserted DNA regions within bacterial
genomes (HGT).
8
1.4) Evolution
Ever since the discovery of the concept of HGT until modern understanding of the
process, it is evident that HGT is one of the more important driving forces in the world
of prokaryote evolution. From a recent study by Babic et al., they were able to get a
direct visualization of horizontal gene transfer. Using E. coli as the test subject and
fluorescent protein fusion method, they found that DNA transfer through the F pilus at
considerable cell distances and the integrated transfer DNA through recombination
occurred within up to 96% recipients. These transferred DNA also split and segregate to
different chromosomes through successive replication and future generations inherit
different cell clusters (Babic et al., 2008). From the above example, the process could be
explained as a section of history within an organism evolutionary timeline.
HGT has been a rather controversial topic when it comes to explaining some parts of
evolution but it is widely accepted in prokaryotic world (Boucher et al., 2003). Unlike
natural selection, mutation and genetic drift which are some other evolution processes,
when lateral gene transfer happens across species, the genes transferred are not always
appropriate genes for the recipient genome. With large amount of genetic material
being transferred, could all these be useful to the organism which receives all these
genetic material? While, inappropriate gene transfers may lead to destruction of
organisms since random insertion of genetic material could lead to a change in many
protein functions. Novel functional gene transfer however will improve adaptability and
survivability of the organism through many new functions gained by the transfer. Since
the organism with the transferred gained novel function which improve survivability, by
natural selection, this organism will outlive the others and hence pass down the newly
combined genetic material to the next generation. Hence, HGT also fits in the
explanation of evolution.
There have also been problems regarding the lateral gene transfer model in terms of
evolution. The main concern was that by looking at the genetic material in different
organism, one can hardly tell which gene has been transferred and which organism was
the donor. Today with numerous whole genome sequences being available, many
criteria have been set to check where such transfers have occurred. One such criterion is
based on codon biases and bases compositions relative to the genes in the DNA
9
sequence, but this criterion itself is very controversial due to the fact that the mutation
rate or selective pressure in the recipient is different from the donor (Koski et al., 2001).
Another problem in identifying HGT is that the DNA sequence itself mutates over time
and that the DNA composition of transferred genes ameliorates and become more like
the genes within the recipient. This makes it hard for researchers to identify the donor
of the gene and make a connection to HGT even happening. By the above criterion,
identification of gene transfer also becomes harder when the transfer happened a long
time ago. Another major concern to HGT is that when comparing genes, one cannot tell
if the gene has been lost or being transferred.
As the technology in this field of study increases exponentially, so does the amount of
research put into HGT in order to get a clearer picture of the effect on evolution. Results
showed that 1.6 to 32.6 percent, depending on the microbial genome, may have been
acquired by horizontal gene transfer (Koonin et al., 2001) and recently using network
analysis of shared genes, the above result could be increased to around 81 percent
(Dagan et al., 2008). This is due to the fact that current HGT identifying technique
cannot identify all HGT events and different techniques also give different results with
little overlap. Current techniques also have high false positive rates hence the estimate
of HGT genes within different microbial genomes have a large range in percentage. With
the above results, current microbial genomes share a large percentage (up to 81 percent)
of its genome between them and hence rare appropriate genes have a less probability
of getting transferred. This would lead to less HGT rate which matches the case of
modern microbial genomes.
In another study by Kanhere and Vingron (2009), it was shown that transfer across
specie in the microbial world happens more often from bacteria to archaea (from the
study, 74% of genes were transferred in this direction) instead of the other way around.
Out of the transferred genes from bacteria to archaea, majority of the transferred genes
were closely related to metabolic functions. On the other hand, archaea gene transfers
to bacteria showed a preference of translational related genes(Kanhere and Vingron,
2009).
Acknowledging that HGT is an important factor of bacterial evolution, it should be
accounted that there are certain barriers in which HGT is limited to a certain extent. The
complexity theory proposed by Jain et al. contains two points. First point says that the
10
informational genes such as genes involved in DNA replication, transcription and
translation are less prone to HGT while operational genes are more likely to be
transferred (Jain et al., 1999). This point was supported by Nakamura where he
published an article using Bayesian inferences showing that operational genes were
more likely to be transferred (Nakamura et al., 2004). The second point in the
complexity theory was that after the genes have been transferred, post-transfer
maintenance of genes occurs and the genes with useful functions are preserved while
useless genes were removed. In this case, the organism will rapidly gain new functions.
Another aspect to be considered is the effect of taxonomic distance between organisms
which would effect if HGT would occur or not. Several studies shows that gene transfers
occur more effectively if the two organisms participating in the transfer are closely
related in terms of evolutionary (Nakamura et al., 2004, Ochman et al., 2000).
Building onto the complexity theory, Wellner et al. proposed that when an organism
achieved a certain complexity, it serves as a barrier to prevent HGT (Wellner et al., 2007).
They further hypothesized that connectivity (gene interaction network to form protein
complex) is also associated with HGT where gene with lower connectivity has a higher
chance to be transferred. This makes sense since a new genes with lesser connectivity is
easier to incorporate into a genome. HGT is beneficial when it introduces a new gene to
the recipient genome to create new function but when the genome itself is complex
enough, the transferred gene more likely will cause harm on native genes or fail with
incorporation into an existing network of genes. Hence a lower HGT rate is observed
between taxonomically distant organisms which are caused by taxonomic barrier. This
idea is also supported by Vogan and Higgs (2011) in their model where HGT reaches a
certain threshold in which vertical transfer of genes is more beneficial than horizontally
(This phenomenon is also known as Darwinian Threshold). Study by Mozhayskiy and
Tagkopoulos (2011) also confirm the above points with another measure of fitness
where environmental complexity will also affect the rate of horizontal gene transfer.
The tree of life for prokaryote species currently is difficult to define. With numerous
cross transfers of genetic material, speciation itself is a complex process hence new
ways for speciation is needed (Thompson, 2013). The uses of phylogenetic trees are no
longer a viable way to explain complex relationships between species although there are
still new methods still in development to define such systems (Thiergart et al., 2014).
Phylogenetic tree (rooted or unrooted) which takes into consideration the most
parsimonious (MP) connection (represented by branches, caused by speciation or
11
mutation) between most common ancestor and species (represented by nodes) after
speciation event. This linear model is restricting by the fact that most of the time there
are more than one MP connection between species and creating a single phylogenetic
tree from data is often contradicting and lots of results of interest could be lost
depending on the type of research being done (phylogenetic tree only takes into
consideration the MP connection, hence other possible MP connections will be lost and
not analyzed).
This brings us to a more generalized model of phylogenetic tree also known as a
phylogenetic network which considers more than just a single MP connection but
multiple possible MP connections between species under study (Husan and Bryant,
2006). This technique is less limiting and hence broadening up the ability to do more
research of interest such as complex relationships between bacteria species or
characteristics of different networks under different evolutionary circumstances. The
downside to all this is of course the computational cost of taking multiple MP
connection into account (as many MP connections as possible whereas taking all MP
connection into consideration is sometimes impossible with current technology). There
are different types of phylogenetic networks depending on the research done (Figure 3).
Split networks, a more generalized phylogenetic tree which takes into consideration
multiple MP connections between species and their most common ancestor into one
super tree. Reticulate networks display evolutionary data with events such as
hybridization, recombination and HGT which fits very well with bacteria. Other types of
phylogenetic networks also exist. These include gene loss and duplication as well as host
and parasite co-evolution. Aside from MP type analysis, other analysis methods also
exist such as statistical parsimony (Templeton et al., 1992).
Pangenomics (bacterial species can be described by its pan-genome, which is composed
of a "core genome" and a "dispensable genome") became a more viable way to explain
species of bacteria (Medini et al., 2005). The core genome of bacterial specie could be
considered as the household genes and dispensable genome are the genes that have
either gone through HGT or through mutation. Based on the idea of phylogenetic
networks, core genome can also be seen as sharing of a common ancestral history while
dispensable genes can be the reticulate events such as HGT which are the multiple
branches within the network. With so many genes transferred and a large pool of
dispensable genomes, identifying HGT events becomes increasingly more difficult.
12
Fig 3. Different types of Phylogenetic networks, phylogenetic tree being one type of network. Other networks include
split networks, reticulate networks and other phylogenetic networks. Each type of network is used for different type
of research depending on research. Split network include Median network (data from sequence), Consensus network
(data from tree), split decomposition and neighbor-net (both data from distances). Reticulate network include
hybridization network (data from tree), recombination network (data from sequence) and ancestral recombination
graphs (data from genealogies). Lastly, other phylogenetic network include any graphs explaining evolutionary data
and augmented trees where HGT is represented as additional inserted edges into the tree to create a network of not
just linear transfer but also horizontal.
13
2) HGT Identification tools
2.1) Phylogenetic Approach
When it comes to identifying HGT, there is no single bioinformatic tool capable of
finding all HGT within an entire genome. Currently phylogenetic and compositional
methods are the two main groups to be considered with the most success. While each
of the above groups has their own strength and weaknesses, the results from both of
these approaches barely overlap and hence hard to distinguish which method is better
or more correct (Figure 4). Phylogenetic method searches for conflicts between the
phylogeny inferred for a gene and the assumed organismal phylogeny whereas
compositional methods searches for atypical regions within a genome compared to the
rest of the genome. The reason for the non overlapping set of results from both
methods is amelioration whereby transferred genes will undergo directional mutation
and hence harder to detect using compositional methods. Compositional methods
detect largely recent events which depends on donor and recipient having different
compositional traits, while phylogenetic methods depend on homolog sequence being
present in other sequences separating donor and recipient which allows this methods to
detect much more ancient transfers (Ragan et al., 2006).
The first step to any phylogenetic method is to collect a large amount of data sequence
to infer trees for the comparison analysis. A big downfall for this method is that
sometimes when there is insufficient phylogenetic information, a lot of HGT events
cannot be detected. Suppose that the data set is sufficient and a phylogenetic tree is
built based on the sequences (normally using ribosomal RNA or well conserved and
characterized protein sequences (Santos and Ochman, 2004)), a reference tree is also
made based on the true evolutionary history of the organisms under study. The second
drawback happens during tree building phase whereby both trees are based on many
trees built and come to a consensus tree. Due to different rates of mutation of different
genes, this process could be challenging and a consensus tree could be hard to
determine. Taking the drawback into account, after both trees have been built, a
comparison is done between the two trees and if HGT event has occurred within this
data set, there should be a disagreement between the two trees. There are many ways
to do a tree comparison, but the optimal measure is using the subtree prune and regraft
(SPR) distance (Hein et al., 1996) to find HGT events within a tree.
14
1
Fig. 4. Graph indicating the detection of HGT by compositional and phylogenetic methods in terms of time and
phylogenetic distance parameters. Since amelioration occurs at a faster rate than normal mutations through vertical
transfer, older HGT events are detected by mainly phylogenetic methods (Beige) while compositional methods detect
more HGT events since HGT happens more frequently between close taxa (Grey) which are one of the limitations of
phylogenetic method. There are HGT events identified by both methods (brown) which the conditions needed to
identify the process are satisfied. The region indicated by (1) where HGT happened long ago between close taxa is
difficult to identify through either methods hence currently none of the existing methods can identify all HGT events
within a genome.
SPR operation on a tree is defined by performing a cutting on any edge hence pruning
subtree “t” and then regrafting “t” to a new vortex (Figure 5). In the context of HGT, the
regrafted edge corresponds to the donor and the cut edge corresponds to the recipient.
An edit path is a set of SPR operations which was done to the reference tree in order to
get a congruent tree compared to the inferred one. Normally there will be many edit
paths for a single comparison and hence an optimal edit path is chosen which is the
most parsimonious for a given reference and inferred tree. The length of the optimal
edit path is then the minimum SPR distance between the trees which also include all
HGT events within this data set. One advantage of the phylogenetic approach is that
when doing the analysis, one can see the direction of transfer during the HGT event
based on the optimal edit path (Beiko and Hamilton, 2006). While this approach is very
15
powerful in detecting HGT events, computational complexity in order calculate SPR is
still limited.
E(2+5)
E6
Fig. 5. SPR operation is done when there is a discrepancy between the reference tree and the inferred tree. Suppose
HGT occurred indicated by the dashed arrow, edge E6 is the
then the donor and is received by edge E5. SPR operation
start by cutting edge E5 and then regrafting it under the new vortex E(5+6). New parent edges are formed by
incorporating the new edge to create E(2+5). The other edges are not affected by the operation since they were not
involved in the transfer event.
Despite many challenges faced by phy
phylogenetic
logenetic methods, it is still the method of choice
in terms of identifying HG
HGT especially for ancient genes. Aside from the SPR distance,
there are other statistical tests that boost the power of accuracy on phylogenetic
methods. The approximately biased (AU) test is such a measure whereby
where for each tree
tested, a probability
ability is calculated for the confidence that this tree is the true tree
describing history of the data under consideration. The greater the P
P-value for the test
tree, the closer it is to the true tree and a confidence set is made containing all test
trees with a P-value
value above a significant alpha value. If the confidence set does not agree
with the organismal phylogeny with significant alpha value, then there are possible HGT
event within the data
ata set. The AU test is reliable in its measure with respect to false
positives but false negative rates are still within unacceptable range especially with
stringent alpha values. Depending on research, a decrease in alpha value could be used
to decrease false negative rates. A 5% significant level gives up to 90% power of
detection on average which is reasonable but still inadequate in some cases.
Building onto the traditional method of phylogenetics, a method of genome wide
prediction of HGT can be done in retrospective assessment of prediction reliability. With
16
current technology, genomic information such as annotation and new genome
sequences are rapidly increasing and therefore an automated and computational
efficient tool should be created. With the addition of the above two points, the
downside of phylogenetic methods greatly decreases and therefore a stronger tool can
be created. Darkhorse algorithm created by Podell and Gaasterland combines a
probability based lineage weighted selection method with filtering approach and
adjustable for wide variation in protein sequences conservation to detect HGT on a
genome bases (Podell and Gaasterland, 2007). The algorithm uses an unique measure
namely lineage probability index (LPI) which is calculated by using BLAST results in the
relative context (based on the query by a percentage base to compensate difference in
conservation between proteins) and then ranked by their lineage frequency of matches
over entire genome (based on multiple databases which increases statistical power).
The measure can characterize organism's HGT history profile, density of database
covered for related species and list of proteins least likely to be inherited. The algorithm
is made to be efficient (only LPI measure is needed to be computed) and automated,
therefore useful in quick updates of incorporating new information to existing analysis.
Positive results can then be prioritized for more in depth analysis such as phylogenetic
tree and nucleotide composition.
2.2) Compositional Approach
Compositional method, also known as parametric method, uses characteristics of the
genome as a tool in detecting HGT events. These characteristics include GC content (as
well as first and third codon position), oligonucleotide usage (OU), and codon bias etc.
These parameters all have its pros and cons and all had its successes in detecting HGT.
GC content being the most basic method of composition group uses the theory of each
specie has its own unique GC content pattern. This is due to a combination of
environmental (adaptation and survivability in different habitat) and genetic factor of
individual genome (Foerstner et al., 2005; Sueoka, 1988). Hence finding atypical regions
of highly differential GC content within recipient genome allows identification of HGT
events.
17
Since GC content methods are the most basic, there are many flaws and needs many
improvements. Nucleotide positions are known to mutate at different rates depending
on their region within the genome (conservative regions) or different codon positions
(third codon positions tend to mutate more than the other two). Therefore depending
on the GC methods alone will create many false positives and will not be a sufficient
method to use in order to detect HGT. Codon usage is the next addition to the tool box
which reinforces the existing GC method and gives solution to the above two problems.
Aside from taking into consideration of comparing the GC content of both first and third
codon positions from the genome mean, codon adaption index (CAI) is also used as a
measure to differentiate atypical genes within recipient (Sharp and Li 1987). A statistical
chi squared test is used to give power to the test in order to reduce amount of false
positives.
Although with the addition of codon usage technique, the detection power of the above
two method increased significantly. Codon bias and GC content is still a poor indicator
for HGT and still many transferred genes are still under the detection radar (Koski et al.,
2001). Oligonucleotide usage (OU), also known as short nucleotide sequences or k-mer
(sequence length of two to fourteen nucleotides), are known for its descriptive
characteristic of a genome. OU signature dates back to 1995 where Karlin et al. uses
dinucleotide composition bias to make evolutionary implications (Karlin and Burge,
1995). Since then, statistical approaches were used as a reinforcement of existing OU
techniques (Deschavanne et al 1999) as well as higher order k-mer analysis such as
tetranucleotide patterns along with Markov Models (Pride 2003). The pattern of
deviation of OU frequencies from expectation (where combination of ACGT of the same
length had the same frequency) were shown to be genomic signatures and hence
contain phylogenetic characteristics that links microorganisms. Therefore the idea
behind this approach is that genomic OU composition within genome is less variable
than between genomes. This allows simplistic criteria in order to identify HGT events
and region (GI) regardless of where the analysis within genome being considered which
codon bias technique lack in. OU statistics also take into consideration interactions
between nucleotides such as base stacking energy, position preference and bendability
which can affect the rate in which the nucleotide mutate (Reva ON 2004).
Like the other techniques, OU pattern analysis in terms of detecting HGT is powerful but
still lacking from being the perfect approach. An overview done by Bohlin et al. which
analyzed the effectiveness of di-, tetra- and hexa-mers in detecting HGT came to an
18
conclusion that none of the above three techniques were superior than the others and
are all context dependent (Bohlin et al., 2008). Thus lack of a golden standard in
choosing which approach to use in different context to prevent false positive and false
negative predictions is a key problem of GI identification. Therefore, new algorithms
need to be made by incorporating as much information as possible as well as being
flexible and simplistic to use. SeqWord Genomic Island Sniffer (SWGIS), software
developed by Bezuidt et al. (2009) uses OU statistics to identify atypical regions within
genomes. OU pattern of various length could be used in combinations of each other (dito hepta-nucleotides) within one algorithm which gives flexibility and an all round
analysis to identify HGT (Reva, 2005). SWGIS belongs to the larger SeqWord project by
Reva which also contain other software tools such as genome browser where GIs can be
visualized and differentiated into different types (Ganesan et al., 2008).
SWGIS utilizes three parameters in order to detect and differentiate different GIs. These
parameters include OU distance, pattern skew (PS) and OU variance. Normalization
could also be done which allows frequency count of words to be normalized by other
word length or mono-nucleotide frequencies (e.g. if the analysis is done on a much
skewed or diverse GC content genome, you would like to normalize the word frequency
by mono-nucleotide to take that fact into consideration in your analysis). Normalization
is split into two options, internal and external which internal applies to the current
genomic fragment under analysis and external applies to the global genome.
The program first calculates the frequency of nucleotide words of various lengths
(chosen by user) and deviation of frequency from expected is then calculated and
recorded as a matrix (pattern). The results are then ranked according to most deviated
to the least deviated. Distance between two patterns (e.g. GI region compared to the
whole genome) is calculated as the absolute distance between ranks of oligonucleotide
word in the two patterns. The program automatically calculates four combinations of
direct and reverse strands and takes the minimum value as the distance. Therefore
depending on the distance value, HGT events can be identified since genomic signature
is somewhat unique to each organism, a large distance value between patterns shows
that there is a foreign genetic material. Pattern skew is a particular case of distance
measure which calculates the distance between direct and inverse strands of the same
DNA. Since for bacterial genomes the PS value tends to be low, a high PS value could
imply insertion of phage elements (Reva, 2004). Lastly OU variance calculates the
variance of the deviation between two patterns. Depending on the normalization used
19
on the patterns, the patterns are unique and a large difference in variance between
patterns is another criterion for identifying HGT events. However, due to there being a
constraint to the number of combinations of nucleotide words, uncontrolled mutation
(insertion) can cause higher OUV values which could cause false positives which is a
downside to this algorithm.
There are other applications of compositional methods beside the standard analysis of
whole genome sequences. Tamames and Moya developed an algorithm which estimates
the extent of HGT in metagenomics (Tamames and Moya, 2008). Since compositional
methods require comparison of region of interest to rest of the genome, metagenomic
data is therefore lacking in this regard. An alternative approach is then used by
combining OU (tetranucleotide used here) by sliding windows (10 at default) through
ORF and Pearson’s correlation between the windows. ORFs are compared to each other
and low correlation implies dissimilarity between them. All values are then grouped into
matrix and then clustered into a tree. Depending on the cutoff, any significant
correlation values (lower than cutoff) are then considered to be transferred genes.
Though the task is difficult in identifying HGT with metagenomic sequences, the results
are still a good start into a new field of study. In turn, compositional techniques are a
powerful tool in many applications and a simplistic yet efficient way in identifying HGT
events.
2.3) Other Approaches
New techniques are being brought up at incredible speed with modern technology.
Aside from the standard phylogenetic and compositional approaches to identify HGT,
other methods that utilize similar ideas to the above two main groups have emerged.
We here look at two of these techniques and see what new perspective these results
will bring and their shortcomings compared to the other main techniques.
The theory behind the first technique branches off from traditional compositional group
whereby using the nucleotide substitution rate matrix to detect HGT. Different species
have their own nucleotide compositions and hence must have their unique rate matrix
associated with it. This offers an advantage over traditional compositional methods
20
whereby similar composition does not imply similar rate matrix. Hamady et al.
hypothesized that HGT changes nucleotide substitution dynamics because mutational
processes differ between old and new organism (Hamady et al., 2006). Hence if a
change in the rate matrix is detected, HGT should occur within the organism since the
transferred gene rate matrix differ from the recipient and by amelioration, a change of
rate matrix occur. On the other hand if a genome has not undergone HGT, the rate
matrix should stay moderately the same between genes of the same organism. A
criterion is then set for a test for putative HGT genes within different genomes.
The rate matrix is derived from the Markov Model of neutral sequence evolution. This
algorithm is used because of its success in many other bioinformatic applications such as
sequence searching, alignment and phylogeny. The model typically represents four
nucleotides at any given position within a DNA sequence. Each nucleotide has a rate of
change to other nucleotide and is then grouped as a 4x4 matrix with each unit within
the matrix representing a rate from one nucleotide to another (Figure 5a). By theory,
the row of the rate matrix must sum to zero because the rate of change away from each
state must equal the rate of change towards them. Hence the diagonal values must be
negative while the off diagonal values are positive (Figure 5b). This is one of the criteria
to check if the rate matrix is correctly derived. The rate matrix for the genome is then
derived empirically through the probability matrix through a logarithmic conversion.
Although Markov Model is useful, many assumptions for the model do not always make
biological sense. Assumptions such as all sites are identical and independent are clearly
not true in the sense that sites are often correlated especially those that encode RNA
(Smith et al., 2004). Markov models also have a basic assumption of time consistency
and being time-reversible which is also not biologically correct (Lobry and Lobry, 1999).
To get close to being biologically significant, a constraint must be added to the rate
matrix. This in turn limits the true inference of the rate matrix and therefore shrinking
the ability to detect HGT which is the major downfall of this approach.
For this model specifically, two improvements have been added in order to improve
biological significance while not decrease the accuracy of the model in determining HGT.
Triple roots are used instead of pairs to remove the assumption of time-reversibility.
This allows further increase in accuracy in order to determine the rate matrix as well as
allow direction of change to be inferred at the cost of a bit more computational time.
The model also only takes into consideration the nucleotide of the third codon position
to minimize the influence of selection.
21
Fig. 5a. Rate of change between different nucleotide states by Markov Model.
Fig. 5b. Rate of change matrix based on the Markov Model. The negative diagonal values allows the rows to sum to
zero which is a criteria for a valid rate matrix
In order to check the extent in which the ability of the model to detect HGT is significant,
inferring the rate matrix correctly is of great importance. The rate matrix is inferred
using three genomes of the same species of approximately the same level of divergence.
If the rate matrix inferred does not fit well with the genes within each genome
(uncorrelated) in some way, this implies that there should be more than one rate matrix
that fits this genome and hence HGT even should have occurred. To discriminate if there
are two or more rate matrices involved, phylogenetic methods are used to build trees
based on the number of rate matrices involved of 8 or 16 sequences. Various new
statistics are then used in order to analyze the trees to make a conclusion. These
statistics include a combination of different methods used by other researchers which
include mean distance of sequence, number of sequence used for inferring and
normalizing of the rate matrix, variance of distance etc.
This method of detecting HGT increases accuracy in discriminating between HGT and
non HGT events up to three folds compared to standard GC content methods. The
comparison to GC methods is due to it being consistent within bacterial genomes as well
as all other compositional statistics are somewhat related to GC content. This
22
improvement of accuracy is based on a combination of different statistics in order to
reduce error rate. The random forest algorithm was used in order to include statistical
significance to the result (Breiman, 2001). There is a limit to the accuracy of this method
whereby the rate matrix between the compared organisms must be slightly different in
order to distinguish a difference. While some downfall is still evident, overall, this new
approach offers a new tool to detect HGT at significant accuracy.
The second technique differ from the first in which it uses pure statistical analysis in
order to detect HGT but still branches off the standard compositional approach.
Chatterjee et al. proposed that at any segment of a whole chromosomal sequence must
have a similar distance between the segment and the rest of the sequence (Chatterjee
et al., 2008). The measure could be done based on GC content or their oligonucleotide
distribution and the distance measure are done by either absolute or Euclidean distance.
Alternatively for annotated genomes, one may take account gene content and their
codon usage as well as amino acid usage biases.
The first phase of the test for HGT is done by a comparison of “s” to the rest of the
chromosome sequence (“s” being a segment within the DNA sequence under study). A
vector of N segments is taken independently and each segment has the same length as
“s” and does not overlap with “s”. Another vector of N random pairs is then
independently selected from the chromosome sequence which does not overlap with
“s”. Distance is then calculated for both vectors between “s” and s' (complement of “s”)
as well as s1' and s2'. If s belongs to a GI, vector D1 should be larger than D2 otherwise
HGT did not occur in the segment s.
Statistical theory states that since D1 and D2 are both taken independently hence both
vectors are independent and identical distributions. Therefore a standard statistical test
can be done whereby the null hypothesis is that elements in D1 is the same as element
in D2 (not a GI) and alternative hypothesis would be D1 is larger than D2 (GI). Mean and
variance values are then calculated for both vectors D1 and D2 and by central limit
theory (for large enough N) if the statistic is zero, null hypothesis is true while if the
statistic is positive, alternative hypothesis is true. If computational constraint is an issue,
a smaller N must be chosen and therefore a two sample Kolmogorov-Smirnov test or
Wilcoxon-Mann-Whitney statistical test can be used as a replacement (Randles, 1979).
These tests support the above criteria as a test for GI.
23
Since GI vary in size and location within the chromosome, statistical test is then done on
s by sliding window across the chromosome and varies in size. The sliding window
should not be too small as it will increase the computational cost significantly. P-value is
then recorded and plotted allows easier detection of GI location and size within
chromosome. A cutoff P0 (0 < P0 < 1) then determines the putative GI from the
chromosome which ends phase one of the tests. Further refinement is done on the
putative GIs after phase one since these GIs are always larger in size and may be a false
positive. A refinement phase is then done to increase the accuracy of the method by
reducing false positive. This phase is similar to the above test but the only difference is
that the putative islands are removed from the chromosome sequence itself so that the
random segment does not include any of these regions. This will in turn reduce the
effects of any influence by any putative islands present in the chromosome sequence.
The rest of the test will be the same as the first phase.
The method works well and ranked highest compared to some other well known
methods such as Island-DB, W8 and HGT-DB etc. in terms of sensitivity but not so well in
terms of specificity. This is due to the fact that this method detects a much larger
number of putative GIs based on the criteria used. The advantage of this method is that
it does not require any training set and uses a powerful statistical backup which some
other methods lack. On the other hand, possible downfalls of this method include
computational cost with varying window sizes and a high number of false positives
which is caused by a large number of putative GIs being identified. While setting strict
parameters can reduce the false positive rate, a lot of interesting information could be
lost so a clear border line is hard to distinguish.
24
3) Amelioration Model
3.1) Concept of Genomic Amelioration
Amelioration is the process where the base DNA composition of the transferred genes
from a donor undergoes nucleotide substitutions over time and reflects similarly in DNA
composition to the recipient genome. This is due to the fact that the introduced genes
are subjected to the same mutational pressure (Sueoka, 1988) as the recipient genome
and hence over time become more similar in genomic composition. This is directed
mutation whereby the foreign insert within a new environment being under the stress
of a new mutational pressure and hence undergoes increased selection towards the
recipient genome. Amelioration can hence be thought as an evolutionary process or
model whereby acquiring foreign genetic material and making it its own for its own
benefits through stress induced mutagenesis is also viable (Maclean, 2013). The process
of amelioration is more evident in large groups of gene transfers since there is a larger
region of atypical composition to undergo directional mutation. Furthermore, newly
transferred genes are easier to identify since they have just started ameliorating and
comparison of donor and transferred gene can aid the modeling process.
There have been very few models of amelioration since the start of this idea and
therefore there is no golden standard to the approach. The most famous one is to use
the rate of nucleotide substitution between the gene transferred and recipient genome
to model amelioration (Lawrence and Ochman, 1997). The rate and extent of
amelioration as well as analyzing how long each gene undergoes directional mutational
pressure allows the estimation of the time of HGT. This model is based upon the fact
that nucleotide composition of a DNA sequence typically represents an equilibrium
between selection and directional mutational pressure (Sueoka, 1962; Sueoka, 1988).
When a gene is transferred, the gene will experience the same directional mutational
pressure as the recipient genome and its base composition will reach a new equilibrium.
Mathematical model describing this change has been developed (Sueoka, 1962) to
express this change in DNA composition with respect to directional mutational pressure.
The amelioration model does not directly quantify directional mutation pressure but
rather represent it as a fraction of net change in DNA composition with regards to
nucleotide substitution rate. The model consists of four parameters namely nucleotide
25
substitution rate, transition/transversion rate (IV ratio), GC content at equilibrium and
GC content of HGT region. All parameters are easily calculated and the model itself is
very easy to use.
The rate of amelioration can be expressed as a function in terms of the substitution rate.
An empirical substitution rate S can be expressed as the rate of change at the site of
cytosine or guanine (RGC) and the rate of change at site adenine and thymine (RAT). RGC
can be further broken down into the rate of change from G or C to A or T (RGC -> AT
includes G -> A, G -> T, C -> A and C -> T) and the rate of interchange between G to C
(RGC -> CG includes G -> C and C -> G), similarly for RAT. Knowing the fact that all transition
mutation and half the transversion changes the GC content of the DNA sequence, we
can simplify the equation into one rate of change along with the IV ratio. One
assumption for this is that the two transversion rates are equally frequent.
S = [(IV Ratio + 1) / (IV Ratio + 0.5)] x [RGC -> AT + RAT -> GC]
[1]
Equation [1] represents the total substitution rate in terms of both directional mutations
and transition/transversion rates. The combined action of these two rates will lead to
GC equilibrium as proposed by Sueoka (Sueoka, 1962, Sueoka, 1988). GC equilibrium
(GCEQ) can therefore be expressed as a ratio between the directional mutation rate of
AT and GC.
GCEQ = RAT -> GC / (RAT -> GC + RGC -> AT) and ATEQ = RGC -> AT / (RAT -> GC + RGC -> AT) [2]
Therefore combining [1] and [2]
RAT -> GC = S x GCEQ x [(IV Ratio + 0.5) / (IV Ratio + 1)]
[3a]
RGC -> AT = S x ATEQ x [(IV Ratio + 0.5) / (IV Ratio + 1)]
[3b]
The GC change over time can therefore be expressed as the gain in GC content minus
the loss in GC content. Let ATHGT and GCHGT be the base composition of the horizontal
transferred DNA:
∆ GCHGT = [ATHGT x RAT -> GC] – [GCHGT x RGC -> AT]
[4]
Combining (3a), (3b) and (4):
∆ GCHGT = S x [(IV Ratio + 0.5) / (IV Ratio + 1)] x [GCEQ - GCHGT]
26
[5]
Equation [5] shows that the rate of change in GC content of horizontal transferred DNA
can be expressed by three parameters. ∆ GCHGT is proportional to the substitution rate
(S) as well as the difference between the GC equilibrium and the GC content of the
horizontal transferred DNA values. The above two parameters as well as the IV ratio can
all be derived from comparative studies in nucleotide sequences. Also taking into
consideration that different codon positions experience different selective pressure and
mutate at different rates. Hence different codon positions can be analyzed
independently to create a more accurate amelioration model for specific DNA sequence.
Even with taking into consideration that different codon positions experience different
mutation rate, the amelioration model proposed above is still too simplistic. Taking the
whole region into consideration, using a single mutation rate (S) might be insufficient to
explain amelioration process. But the three parameters within the equation is still
sufficient to make biological sense for the amelioration model to work, but a lot of
information can still be added to improve the accuracy of the existing model.
3.2) Project Aims
Based on the model described in the previous section, the use of single nucleotide
substitution rate S and GC content difference between equilibrium and transferred
region takes the most basic assumptions for it to make biological sense. Due to its
simplicity, it is easy to use at the cost of much information lost in result. Similarly to the
GC content method in compositional techniques, taking in consideration of only single
nucleotide within analysis causes flaws. These flaws can be covered by using
oligonucleotide patterns which takes many of the assumptions into account which single
nucleotide analysis lacks in. Hence a model which uses oligonucleotide pattern data
needs to be derived for a more detailed analysis on amelioration of bacterial genomes.
OU statistics calculated by the SWGIS are a very useful tool which summarizes important
characteristics of a genome. By taking into consideration the difference between the OU
statistics of GI and the rest of the donor genome allows us to get a clearer picture the
amelioration process. Amelioration is the process in which the foreign genomic material
will undergo mutation to achieve similar composition to the recipient genomic sequence.
Hence the OU pattern of the GI will tend towards the donor sequence therefore
difference in OU distance and variance will decrease during the amelioration process.
27
Using this fact, conversion can done on the OU word distance between GI and recipient
sequence into a probability of mutation per iteration and utilizing these parameters to
create a simulation of the amelioration process.
Verhulst model is well known for its uses within the biology field for modeling
population growth (Horowitz et al., 2010; Koseki and Nonaka, 2012). The sigmoid curve
which defines the model is useful in explaining the amelioration process (Exponential
increase in the beginning as well as the decreasing towards capacity value trending
towards the end shows directional mutation which is the core assumption of the
amelioration process) as well as its simplicity to use (Simple logistic equation). The
parameters of the model also reflect real life situations (capacity, initial exponential
increase then decrease towards capacity). Therefore we aim to model the amelioration
process through the usage of Verhulst Model and oligonucleotide usage patterns within
the sequences of genomic island and recipient.
3.3) Project Objectives
The objective of the project will be to derive an algorithm which will model genomic
amelioration of bacterial genome using a combination of compositional methods (OU)
and mathematical modeling (Verhulst Equation). The algorithm should reflect the
dynamics of the amelioration process and produce parameters that could explain it
throughout the time lapse of the process. These parameters must be biological
meaningful in which it can explain the trend of the amelioration process as well as
estimate the time of insertion of different horizontally transferred genomic islands in
different recipient genomes. Hence multiple different genomic islands (testers) as well
as possible recipient genomes (target) were chosen such that the algorithm can be seen
in terms of explaining amelioration for all types of bacteria genome sequences. These
different tester target combinations should also be able to give answers questions
regarding the amelioration process. These include different taxa amelioration
comparisons (gram-positive, proteobacteria, etc.), from the same taxa (is the
amelioration process less extreme than different taxa?) and tester target sequence from
the same genome (does it ameliorate at average mutation rate/ no directional
mutation?).
The resulting model must be simplistic and easy to derive in the sense that by using the
least amount of core data and using most autonomous method to achieve. The model
28
should also be able to make conclusions such as rate of mutation estimate of different
genomic loci sequences and the time of insertion of different genomic island into their
recipients. To ensure the most accurate results, we use multiple genomic islands from
the same origin to compare the degree of amelioration to their time of insertion.
Vocabulary
GI – Genomic Island
OU – Oligonucleotide Usage
PLF – Probability Logistic Function
PAGI - Pseudomonas aeruginosa Genomic Island
Tester – Genomic Island insert sequence
Target – Recipient genome sequence
29
Introduction
Horizontal gene transfer (HGT) within bacteria studies has dated back several decades
and has been well documented. Current studies are still undergoing to dwell deeper into
its effects within phylogeny and evolution alongside improvement in new technology
and techniques (Hamady et al., 2006). These techniques have been improved to
increase accuracy in determining HGT events as well as trying to create a standardized
tool which can determine all HGT events. Currently there are two main methods in
determining HGT, compositional and phylogenetic, in which both has their own
advantages and disadvantages (Vogan and Higgs, 2011). Amelioration, the process
where the base DNA composition of the transferred genes from a donor undergoes
nucleotide substitutions over time and reflects similarly in DNA composition to the
recipient genome, is one of the factors influencing the creation of a standardize tool and
a major downfall of compositional methods.
Although HGT is well understood, amelioration itself is understudied. Hence the study of
amelioration is vital to enhance the understanding of this process. With this insight,
many aspects such as the mutation process of transferred material (preference in
composition mutation, directional mutation, mutation rate), and the effects of base
composition of recipient on the amelioration process can be answered. Therefore we
attempt to create a logical yet practical mathematical model to model amelioration.
To increase the understanding of amelioration within bacterial genomes, four foreign
inserts and known genomic islands (GI) were used to model their amelioration process
towards compositional profiles of genomes of organisms representing distant taxa and
different GC content, i.e. Bacillus subtilis 168, Pseudomonas aeruginosa PA01,
Escherichia coli K12, Xylella fastidiosa 9a5c and Streptomyces griseus NBRC 13350.
These genome sequences were chosen as a small sample in attempt to cover the vast
bacteria domain with many major phyla being covered (actinobacteria, gram-positive
bacteria, etc.). Hence the amelioration model created here can also be applied on other
bacteria organisms.
30
Simulation of amelioration process was done using compositional methods on each
combination of GI (tester) and recipient (target) whereby k-mer words (di-mers, tri-mers,
tetra-mers) were calculated and ranked based by their frequencies in descending order
of oligonucleotide usage (OU) (Bezuidt et al., 2009). A logistic probability function was
then used to convert the ranked frequencies into a probability which gives the likelihood
that at any given position the nucleotide will be substituted into another. A program on
Python to simulate the amelioration process of generations with a given mutation rate
was designed and in turn simulates the amelioration process for the underlined GIs. An
amelioration model was then derived and fitted to the standard Verhulst model, which
in general used in population dynamics. The standard Verhulst model was fitting to
amelioration process was due to its sigmoid curve shape that fit the basic assumption of
directional mutation.
The parameters within the model were also well suited for the simulated data and
represent a good fit for the sample simulations. The program predicts a graduate
merging of the insert’s OU profile with those of the host genomes that would stabilize at
some level of pattern similarity. The dynamics of this process and the level of
stabilization depend on the rate of mutations in the tester organism as well as the
composition of the tester and target sequence. Using statistical methods, a regression
model was made as a simplification in creating the amelioration model with the above
three parameters which can be used for any bacterial organism.
The resulting amelioration model was also used on 4 distinct GIs from the same origin
sequence (Klockgether, 2007) towards the same target. The time of insertion estimate
was reasonable for all four distinct GI sequences which prove that the algorithm is
suitable for estimating of the time lapsed after GI acquisition by a bacterium. The
algorithm was also modified to estimate the mutation rates in different organisms and
genomic loci though the results were inconclusive with further improvements needed
within the method.
31
Methods
Flow Chart
Takes tester and target
sequence as input and
convert sequence into
oligonucleotide statistics
Input
Sequence
Takes oligonucleotide statistics as
input and using the Probability
logistic function to create a
simulation of amelioration process
Simulation
Takes simulation results as
input and a Verhulst Model
is fitted onto data to create
and amelioration model
Verhulst
Modeling
Time
Estimation
Rate of
Mutation
Estimation
Utilizes the algorithm for Verhulst
Model fitting to estimate the rate
of mutation of genomic loci by
letting the tester sequence be the
loci sequence and target the whole
genome sequence in which the loci
sequence is obtained from.
Utilizes Verhulst Model specific to
tester and target combinations which
estimates the time of insertion of the
tester sequence within target and
calculate the extent of amelioration.
32
Sequence Data and Oligonucleotide Usage Statistics
Four known genomic islands (GI), labeled under tester, were used to model their
amelioration process towards compositional profiles of genomes of organisms
representing distant taxa and different GC content, i.e. Bacillus subtilis 168 (BS),
Pseudomonas aeruginosa PA01 (PA), Escherichia coli K12 (EC), Xylella fastidiosa 9a5c (XF)
and Streptomyces griseus NBRC 13350 (SG). 5 Genomic islands PAGI1, PAGI2, PAGI3,
PAGI4 and PKLC102 were used to estimate the time of insertion (Klockgether et. al.,
2007). Detailed statistics are within table 1 below. GIs used in this study was identified
and obtained from SWGIS Pre_GI database within the SeqWord Project [1]. Similarly,
the five target organisms sequence data (Fasta format) were also obtain in this manner.
Genome sequences chosen were of similar bp sizes such that it reflects the true
amelioration process whereby an insertion of GI is within the recipient. The sequence
file was then used in another program made using python script Oligonucleotide Pattern
Evolution Project (OPEP).
Table 1. Detailed statistics of genome sequence used within study
Host
Bacillus subtilis 168
Escherichia coli CFT073
Escherichia coli K12 substr
MG1655
Streptomyces coelicolor A3(2)
Streptomyces griseus NBRC
13350
Xylella fastidiosa 9a5c
Pseudomonas aeruginosa
pathogenicity island PAGI 1
Pseudomonas aeruginosa PA01
PAGI_1
PAGI_2
PAGI_3
PAGI_4
pKLC102
Tester
NC
GI#
NC_000964
9
NC_004431
20
NC_000913
NC_003888
33
Start
2146000
3409389
7561923
End
2258655
3494936
7647787
Length
112655
85547
1
51300
Target
End
1100009
Length
100001
1
100001
100001
1
500000
100001
600001
100001
100001
1
106370
106369
85864
NC_010572
NC_002488
1
Start
1000009
51300
NC_002516
51300
158230
128136
34398
103609
OPEP transforms sequence data into OU statistics which is vital in the simulation step.
Combinations of GI (Tester) and recipients (Targets) are used as inputs (20 combinations
of 4 tester x 5 targets) and OU statistics are calculated. These OU statistics include
deviation, distance (D), variance (V) and compositional variance between tester and
target (V0) of which only deviation, V and V0 are used throughout the algorithm (Bezuidt
33
et. al., 2011). Deviation is a measure which calculates the logarithmic deviation of OU
pattern from expected frequency of OU words. The equation is as follows:
∆ w = ∆ ξ 1...ξN
2
2
 C2
 ξ 1...ξN |obs C ξ 1...ξN |e + C ξ 1...ξN |0
ln
2
2
2
 C ξ 1...ξN |e C ξ 1...ξN |obs + C ξ 1...ξN |0
= 6× 
 + 1
C2
ln C 2

ξ 1...ξN |0
ξ 1...ξN |e 

 






[1]
Where ξn is any nucleotide A, T, G or C in the N-long word; C[ξ1…ξN]|obs is the observed
count of a word [ξ1…ξN]; C[ξ1…ξN]|e is its expected count and C[ξ1…ξN]|0 is a standard count
estimated from the assumption of an equal distribution of words in the sequence:
(C[ξ1…ξN]|0 = Lseq × 4-N). Deviation gives a relative measure in terms of a logarithmic
normal distribution of abundant or rare a specific OU word is within the sequence. A low
deviation measure (negative value, less than expected) implies rare OU word within
sequence while high deviation measure (positive value, higher than expected) implies
abundance of OU words within pattern.
Variance (V) is calculated based on deviation in which the variation of deviation based
on the whole sequence is calculated as seen in equation [2].
4N
∑∆
2
w
RV =
(4
w
N
)
−1 ×σ 0
[2]
Where σ0 is the expected standard deviation of the word distribution in a randomly
generated sequence which depends on the sequence length given by:
σ 0 = 0.02 +
4N
Lseq
[3]
The compositional variance (V0) calculates the variation of the deviation between two
OU patterns tester and target. Equation [2] is used with the only difference being the
deviation value used is calculated as DevTester - DevTarget.
34
Figure 6 is a graphical representation of OPEP program with. Two sequences are used as
input situated in the top left and middle block while the difference in their OU pattern is
shown within the top right block. Each OU is also shown in a block plot, each block
representing a k-mer word, where the colour shows how frequent the k-mer word is
present in the sequence (red showing k-mer word present in sequence more than
expected, blue being the opposite). The intensity of the colour reflects how
abundant/rare the word is represented. All OU statistics for the specific sequence are
shown underneath. These OU statistics are then stored as variables to be used in the
simulation step where the python script is already a part of the OPEP program.
Fig. 6. Example of a screenshot of the OPEP program displaying a combination of tester (E.coli) and target (S. griseus).
The left hand side represents the tester, middle the target and right the difference between two patterns. Tetra-mer
word pattern is shown above where the OU statistic is shown in text below the block plot. Dev representing deviation
(Equation 1) is a measure of frequency of word occurred in sequence that deviates from expected. In the right text
block, the dev parameter shows the deviation measure between two patterns. The words are also in descending
order whereby the highest dev is ranked first. Other parameters such as Pattern skew (PS), internal variance (Var,
equation 2), variance (V) and distance (D) is also present. Distance shown on the top right corner represents the
absolute distance between ranks of oligonucleotide in the two patterns. PS is a particular case of distance measure
which calculates the distance between direct and inverse strands of the same DNA. Lastly var shows the variability of
pattern within the sequence and V is the variability between sequences. The block plot gives an easier representation
of the frequency measure of each k-mer word and is colour coded. Putting the cursor over each block also gives the
status of the word in detail.
35
Probability Logistic Function
The simulation process uses a special derived function (Probability logistic function)
which converts a deviation measure into a probability that at any given position, a
specific nucleotide can be substituted by another. The deviation used here is slightly
different from equation 1 where the deviation measure here needs to only consider the
nucleotide instead of the OU pattern and uses the deviation value based on the
difference between tester and target sequence (DevTester - DevTarget). Consider the
following example sequence …TGGTGGGTCGTGTAGG… where deviation at nucleotide T
needs to be calculated for the given sequence, the following OU words statistics up to
tetra-mer GGGT, GGT, GT, GGTC, GTC, TC, GTCG, TCG, TCGT are calculated. A win score
is calculated for each possible permutation of substitution i.e. for GGGT:
OU Word
Prior deviation
Posterior deviation
GGGT
GGGC
GGGA
GGGG
-5.22 (DevGGGT)
-5.22 (DevGGGT)
-5.22 (DevGGGT)
-5.22 (DevGGGT)
-5.22 (DevGGGT)
-4.71 (DevGGGC)
-0.29 (DevGGGA)
-2.39 (DevGGGG)
Win score
(Posterior – Prior)
0
0.51
4.93
2.83
This is done for all related substitutions and the deviation value for a specific nucleotide
is equal to the sum of all win scores of that nucleotide. I.e. for nucleotide A for the given
sequence above:
GGGT → GGGA*p4
GGT→ GGA*p3
GT→ GA*p2
GGTC→ GGAC*p4
GTC→ GAC*p3
TC→ AC*p2
GTCG→ GACG*p4
TCG→ ACG*p3
TCGT→ ACGT*p4
p2, p3 and p4 in this case are the weighing parameter. By default, all weighing
parameters are equal to one. This procedure is done for every nucleotide ACGT at each
position of the sequence and the deviation value is a row vector of the form [DevA, DevC,
DevG, DevT].
36
This deviation value (x) is then used in equation 4 to calculate the probability of
substitution.
Probability of substitution =
(( [4]
Equation 4 is derived from the statistical logistic function and there are two reasons why
this specific function is used. Firstly, the conversion range is for any x value (negative
infinity to infinity) to a value between 0 and 1 (probability). In this case specifically, the
function is tailored to be converted to a probability range of [0:0.33]. The reason for this
is that at any given position, a nucleotide can only be substituted by 3 other possible
nucleotides and also the probability of not being substituted at all. Consider an example
that at position 10 of the sequence, the base is a nucleotide G. The deviation measure
for position 10 is [-4.3745; 1.2864; 5.2346; -2.1465] (Deviation measure always sum up
to 0) where each position in the list represents [A, C, G, T] respectively and mu equal to
0.1. Using function [1], the probability conversion becomes [0.002; 0.016; 0.779; 0.203].
Since all probabilities must sum to one, the state in which the position is at, the
probability is adjusted. In this case, the substitution probability for G is 1 – substitution
probability for [A, C, T].
The other logic behind the logistic function is the usage of parameter “a” and “b” for
biological justifications. Parameter “a” is a function of μ such that at deviation = 0,
probability of substitution must equal to average mutation rate μ. In biological sense, at
no deviation, the substitution rate should be as expected which is equal to the value of
the average mutation rate of the sequence. Equation of parameter “a” is as follows:
= −ln Parameter “b” represents the conservation of the sequence under study. A change in
the “b” parameter will change the shape of the function which reflects how a small
change in deviation will change the probability of substitution. For practical use, if a
sequence under study is conserved, a large “b” value is used (b > 1). This will cause a
37
small change in deviation to cause a large change in probability which is true for
conservative sequences (Figure 7). The opposite apply for low “b” values (0 < b < 1). For
this study specifically, a “b” value of 1 is used for uniformity.
Probability
Probability
Deviation
Deviation
Probability
Probability
Deviation
Deviation
Fig. 7. Logistic probability function shifts. Changing in parameter “a” and “b” in equation [1] will change the shape of
the curve to allow flexibility to reflect real life situations. Parameter “a” also known as a function of average mutation
value μ. This function is especially tailored such that at x = 0 (also known as the expected value or no deviation), y
intercept is equal to μ. Meaning no abnormal mutation should occur and be as expected hence at this point, the
mutation rate reflects the μ value. The shift in parameter “a” will shift the y intercept up and down which is shown in
the top left and right graph. In biological sense, this shift will change the subset of probabilities in which the
over/under represented words can be converted into. Parameter “b” represents the conservativeness of the
sequence. A large “b” value will shrink the graph and an increase in a small range of deviation will increase in a large
change in probability, while the opposite occurs for a small “b” value as shown in the bottom left and right graphs. X
value in this case is the deviation value for each [A, C, G, T] at a specific position.
38
Results
Simulation Results
In order to identify possible amelioration models, a simulation of the amelioration
process is needed first to generate data for the fitting procedure. Simulation step starts
with user input parameters which is vital in controlling the results you are given. The
parameters include the number of iterations for the simulation process, the average
mutation rate for the sequence under study and the weighing parameters of k-mer
words (Figure 8). The number of iterations forms the core of the simulation process and
it determines how many generations the sequence will undergo allowing random
nucleotide mutations. The average mutation rate (μ) parameter directly correlates with
the amount of mutation (Figure 7) and k-mer weighing parameter determines the
weighting of the k-mer patterns on the deviation measure. The “save report” option
allows you to print the results onto a textfile with advanced option of saving the results
after every N iteration. For this study, a μ value of 0.00001, 0.00004, 0.00008 and
0.0001 are used for all combinations of tester and target and the iteration is set at 2000.
Normalization of k-mer remains unchanged and the report is saved after every 100
iterations.
The simulation procedure then proceeds by calculating the necessary OU statistics first.
These statistics include the deviation measure for each OU word pattern (equation 1),
the variance of tester and target sequence based on the deviation measure (equation 2)
and the compositional variance (V0) between the two sequences (equation 3). At each
iteration of the simulation process, the deviation value at each specific position is
calculated based on equation 4 as well as the probability of substitution for each
nucleotide based on the probability logistic function. A random number is rolled
between zero and one and depending on the random number it will determine which
nucleotide the base will substitute into. For example, the base position is G and the
probability of mutation is [0.002;0.016;0.779;0.203] (based on previous example),
therefore the cumulative mutation probability becomes [0.002;0.018;0.797;1] for
[A,C,G,T] respectively. If the random number falls between 0 and 0.002 then a mutation
of G to A occurs and similarly for other nucleotides. This process is done for every single
base position within the sequence and with every new iteration, this procedure is
repeated starting from the calculation of OU statistics (since OU statistics are different
39
for the new sequence). The simulation process ends when all iterations specified by user
are done.
Fig. 8. There are several important parameter settings for the simulation process. μ representing the average
mutation rate of the sequence which is inserted by the researcher depending on the data. For this research
specifically, μ values of 0.00001, 0.00004, 0.00008 and 0.0001 are used. Iteration parameter determines how many
generations you want the sequence to undergo mutation. Each iteration calculates new substitution probabilities
depending on the newly calculated deviation values of the new sequence simulated. The k-mer (k representing 2-7 in
this case) option allows you to specify the weighting of a specific k-mer pattern. Save report option allows you to save
the final simulation report in a text file. The advanced option under save report allows you to specify the intervals in
which the intervals get reported and the option of saving the sequence is also available.
Initial simulation testing was done on the combination of P. aeruginosa 01 GI tester
against a P. aeruginosa genomic fragment that is different from the sequence of GI as
target. Different combinations of iteration (100, 300, 500, 1000, 5000, 10000) and μ
(0.000001, 0.00001, 0.0001, 0.001, 0.01) were used to test the efficiency of the
simulation program. As the iteration value increases, computation time increases
exponentially (100 iterations – 8 hours, 10000 iterations – 4 weeks) while μ does not
make a significant difference in computation time. Another factor that influences the
computational time is the internal variance of the tester sequence (E. coli – 6 hours for
100 iterations and S. griseus takes 10 hours for 100 iterations) where lower internal
variance sequences tends to run quicker than higher ones. μ parameter does however
40
impact the logical part of the simulation and two situations could occur. High values of μ
tend to allow too many substitutions per iteration and the amelioration process
proceeds unnaturally fast (Figure 9) hence close to random which could be seen as the
program’s limitation. Similarly, low values of μ will have the opposite effect. From the
various simulations done with different μ values, the range of [0.00001; 0.0001] seems
to work the best.
Variance
Tetra-mer Variance
per Iteration
3.5
3
2.5
2
1.5
1
0.5
0
Mu 0.0001
Mu 0.0008
0
100
200
300
400
500
600
Iteration
Fig. 9. Comparison between different μ values influencing the variance measure. Higher μ value will increase the
number of substitution per iteration (Increasing μ increases the probability of substitution at each base position
based on the logistic probability function) therefore causes the difference in steep decline between the red curve and
the blue curve in the first 100 iterations. Due to the higher μ, red curve reaches the minimum variance value much
quicker and therefore random substitution occurs. This is due to tester sequence reaching a similar state as the target
sequence (minimum variance value) which at higher μ, unlikely substitutions will become more likely (upward slope
shown from 100 iteration onwards) therefore creating a situation of mutating away from target. The blue curve is on
the other hand is a better representation of the amelioration process compared to the red curve where the process is
smoother and better fit to biological applications.
After the initial simulation run, four tester and five target combinations were run with
different μ values for each combination. A total of eighty simulations were done and
each simulation was analyzed individually. All simulation results shared similar trends of
high variance decrease in the initial iteration period (usually 0 to 500 iterations) where
the first 100 iterations showed the greatest different in variance. The decrease gradually
diminishes and reaches a limit value which is similar to figure 11’s blue curve. Each
simulation result of tester and target combination is different (variance decrease rate,
limit value) but a general trend can be seen where combinations with similar
41
compositional pattern (low V0) tend to have lower limit value and lower variance
decrease rate while for the opposite case, the effect is the other way around (highly
different composition between tester and target, high V0, will tend to have a high limit
value with high variance decrease rate).
We found that the general trend of the simulation data follows a logistic curve much like
a Verhulst model. Possibly applicable to simulate the amelioration changes in DNA
composition of horizontally acquired islands towards the host genome pattern, however
we found out that high rate of substitutions (μ) may significantly alter the sequence
from a Verhulst based expectation. It was necessary to evaluate the appropriateness of
the Verhulst model for different mu-values and tester-target combinations by
investigation residual values. These results are presented in the next section.
Verhulst Model Fitting
Verhulst model is well known for its uses within the biology field for modeling
population growth (Horowitz et al., 2010; Koseki and Nonaka, 2012). The sigmoid curve
which defines the model is useful in this case where an extreme increase in the
beginning shows the signature of the amelioration process (directional mutation) and
the gradual decrease at the end which shows the similarity in the resulting sequences.
This model shows much similarity to what we see within the results from the simulation
data. Using the output from simulation results, we test the possibility of a Verhulst
based model of amelioration process. The simplest way to do this which is also used
within this study is to derive the differential equation of the simulated data and see if
the differential equation matches the Verhulst model hypothesis. The Verhulst
differential equation [5] and standard equation [6] is shown below:
[5]
[6]
42
Where g controls the slope of the differential equation, K is the maximum capacity of
variable V. In the case of equation [6], additional parameters are V0 (not to be mixed
with compositional variance) which is the initial value at time t=0 and t equals the time
period.
Using the compositional variance between tester and target from the simulation data of
di-mer, tri-mer and tetra-mer separately (highlighted yellow in Table 2) against iteration
as y and x values, linear regression is done to fit an equation. This could be viewed as
the composition change of the sequence over time. Suppose the iteration dataset
consists of {t0, t1, … , tn-1, tn} and similarly the V0 dataset being {V0, V1, … , Vn-1, Vn} (V
could be either di-mer, tri-mer or tetra-mer). The empirical differential dataset is
calculated by fitting a linear line through “n” points and taking its derivative. “n” being
equal to {2,3,…,N, N+1} and N equal to the cutoff value. The cutoff value is calculated as
the point in which the variance is reaching equilibrium or no longer is decreasing (Vn –
Vn-1 > 0 or very close to zero). The logic behind the cutoff value is that we are trying to
determine the limit value which is one of the parameters of the Verhulst Equation (K)
and all values beyond the cutoff value will cause the differential dataset to become
skewed (derivative very close to zero due to a large number of points being relatively
close to each other in value) and hence estimating K becomes increasingly more
inaccurate.
Table 2. Simulation Output
Iteration
0
100
200
300
400
500
600
Mutation
0
6750
8908
10498
11837
12963
13955
Delta
0
6750
2158
1590
1339
1126
992
2D
8.82352
6.61764
5.88235
5.88235
5.88235
4.41176
4.41176
2V
6.23378
4.42955
3.99178
3.71179
3.49119
3.30584
3.16697
3D
12.16346
9.75961
8.84615
8.36538
8.02884
7.74038
7.45192
3V
5.58532
4.2549
3.93263
3.71797
3.54465
3.39143
3.27523
4D
12.38448
9.93433
9.50571
9.19868
8.99805
8.80654
8.6363
4V
5.31588
4.30297
4.07895
3.93205
3.80521
3.69319
3.60998
Iter: Iteration; Mut: Cumulative mutation frequency; Delta: Mutation occurred during the iteration period; D:
Distance V: Variance
For example, let N be equal to 20 and the first row of the differential dataset will be
equal to the gradient of the set {(V0:t0), (V1:t1)} and second row will become the gradient
of {(V0:t0), (V1:t1), (V2:t2)}. This process will go on until all points (N) are taken into
43
calculation. This step is done separately for di-mer, tri-mer and tetra-mer dataset. Figure
10 shows a visual representation of this step.
Fig. 10. The process of creating the dataset for fitting the differential equation. Using the di-mer variance data points
(Variance being the variable under study) as the points for line fitting. Initially two points are used and a line is fitted
(top left), and after each fitting, another point is added (top right, bottom left) until there are no more points
available (bottom right). For each line fitting, the derivative is used as a data point for the differential function.
Using the empirical differential dataset, a differential equation is fitted and parameter
for the model is estimated (Figure 11). The fitted model is in the same form as the
Verhulst differential equation [5]. In order to derive the Standard Verhulst Equation [6],
we need to derive the parameters of the equation g and K. The derived differential
equation is a quadratic function where two parameters (g and K) need to be estimated.
Due to its form, a linear transformation can be done which equation [5] can be rewritten as:
44
Using the above simplified equation, linear regression can be performed on the
empirical dataset and both parameters can be estimated (g = a, K = -a/b). A python
script has been written to perform the whole process from building the dataset to
estimating both parameters. The program goes one step further in estimating the best g
value (in terms of least squared error) corresponding to K. Since the simulated data
follows a Verhulst differential equation model, we can assume that the Standard
Verhulst Equation can therefore model the amelioration process where parameter g can
explain the trend (gradual/extreme) of amelioration process, K the maximum similarity
the tester and target can share in terms of compositional variance, t the time period of
amelioration and V the compositional variance between tester and target.
Fig. 11. Verhulst standard differential equation fitting. Using the dataset of derivatives (left), a fitting of differential
equation is done (right). A python script is written for both formation of dataset as well as the fitting of equation. It is
shown on the right graph the difference between the empirical derivative (blue) against the estimated (red).
Verhulst Equation was then fitted to all combinations of tester, target and μ giving two
more parameters g and m where m is equal to g/k. The results were recorded down into
a table listing each combination with their corresponding μ values (Table 3). The
resulting table was analyzed independently for each parameter (g and m) and their
correspondence to the input parameters (Tester, Target, μ). Residue values were also
recorded for each combination as absolute value difference between estimated
equations against simulation data. The residue dataset will give more insight into how
well the Verhulst Equation actually fit to the simulation of amelioration process as well
as what factors might influence the fitting of the equation.
45
Table 3. Verhulst Model Fitting Parameters
Tester/Target
PAGI
Ecoli
PAGI
9a5c
PAGI
Subtilis
PAGI
Aeruginosa
PAGI
Griseus
Griseus
μ
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00004
0.00008
0.0001
Dimer
m
0.001027
0.001765
0.001681
0.001603
0.000160
0.000424
0.001004
0.001122
0.001264
0.001013
0.001049
0.001010
0.000347
0.000727
0.001037
0.001193
0.000696
0.000948
0.001449
0.001599
0.001216
0.001643
0.001756
g
0.002034
0.002859
0.002236
0.001915
0.000148
0.000314
0.001014
0.001133
0.003804
0.002147
0.001573
0.001271
0.000346
0.000371
0.000539
0.000585
0.001446
0.001299
0.001710
0.001837
0.002758
0.003024
0.003011
Trimer
m
0.001107
0.001976
0.002202
0.003472
0.000413
0.000927
0.001600
0.001997
0.001379
0.002205
0.002548
0.002816
0.000516
0.000969
0.001289
0.001365
0.000734
0.000919
0.001390
0.001674
0.001109
0.001921
0.002188
g
0.002628
0.004129
0.004293
0.007150
0.000825
0.001558
0.002720
0.003489
0.004922
0.007278
0.007849
0.008538
0.000795
0.001095
0.001353
0.001436
0.001717
0.001488
0.002126
0.002506
0.002897
0.004434
0.004981
Tetramer
m
0.000918
0.001310
0.001917
0.002957
0.000551
0.001297
0.002633
0.002608
0.001156
0.001911
0.002322
0.002646
0.000714
0.001282
0.001680
0.002120
0.000807
0.000916
0.001297
0.001458
0.001134
0.002043
0.002365
g
0.002566
0.003145
0.004506
0.007223
0.001258
0.002711
0.005502
0.005492
0.004392
0.006652
0.007661
0.008640
0.001428
0.002269
0.002889
0.003604
0.002385
0.002024
0.002801
0.003053
0.003408
0.005856
0.006774
PAGI: P. aeruginosa Genomic Island, Ecoli: E.coli K12, 9a5c: X. Fastidiosa 9a5c strain, Subtilis: B.subtilis sub 168,
Aeruginosa: P. aeruginosa, Griseus: S.griseus, m and g are both parameters of the Verhulst Equation where m = g/k
(see methods). B. subtilis Genomic Island, E. coli K12 genomic island and S. coelicolor genomic island in combination
with five targets are not displayed in this table (See Appendix Table 1).
Box and whisker plots were made with 240 residue values made from 4 tester, 5 target
and 4 μ combinations. The first plot is divided into different mu and K-mer combinations
and their influence on the equation fitting step (Figure 12a). From the plot, a general
trend of increasing in μ value decreases the residue of the fitting. This is seen for all Kmer sizes with respect to the mean residue value (represented by the black line inside
the blue box). This could be explained by the low substitution rate caused by the low μ
value which in turn causes a higher estimation of capacity value K based on the
46
simulation data. With a higher K value, g parameter which depends on K will become
lower (g parameter determines the slope/steepness of the graph hence with a higher K
value; the graph will become less steep to compensate).
Fig. 12a. Box and whisker plot of residue values corresponding to k-mer and μ combinations. 2, 3, 4 in the graph
shows di-mer, tri-mer and tetra-mer respectively while the decimal values represent μ. It is shown in the graph above
that lower μ values fit poorly with the Verhulst Equation and gradual increase in μ decreases the residue value
(calculated as simulation variance – estimated variance by Verhulst Equation). Some outliers occur at higher μ values
which correspond to poorly matched tester and target combinations (See figure 12b). There are more outlier residue
values for tri-mer and tetra-mer due to a higher capacity value K value compared to the di-mer (See Figure 13).
Therefore the capacity value K is achieved quicker in the tri-mer and tetra-mer case which allows more random
substitution than di-mer hence a higher residue value.
Another factor visible within figure 12a is that higher μ values caused more outliers
within tri-mer and tetra-mer residue values. Due to a small number of outliers for each
set of data (maximum 3 outliers out of 20 data points), a likely assumption is that some
tester and target combinations’ composition is very different from each other. When
this is the case, the tester will have a hard time incorporating within the target sequence
and hence causes a higher K value (capacity in which the tester and target shows
similarity in composition pattern). This will in turn cause a similar situation as case with
47
di-mer data set, where a higher K value would cause a bad fitting of the Verhulst
Equation and therefore the outlying residue value.
Fig. 12b. Box and whisker plot of residue between the estimated Verhulst Equation and simulation data. Each box
represents a tester and target combination and the abbreviations represent: BS – B.subtilis, EC – E.coli, XF –
X.Fastidiosa, PA – P.aeruginosa, SC – S.coelicolor, SG – S.griseus. The two highest residue values were 9.136 (PA/BS
combination) and 9.2025 (SC/BS combination) while the three lowest were 0.4003 (SC/PA combination), 0.6146
(EC/BS combination) and 0.6636 (PA/XF combination). SC/PA combination had the smallest range of residue values as
well as the smallest mean. The orange circles represent the outliers of the box plots.
The case above is supported by figure 12b where some combinations of tester and
target show significantly larger residue than the others. Combinations PA/BS and SC/BS
showed the highest residue while BS/SG and BS/PA were the next two highest values.
Looking at these four combinations, PA/BS, SC/BS, BS/SG and BS/PA where the first
represents the tester and the latter target, the organisms involved in the combination
are identical in the sense that the tester and target were only swapped around. This
implies that for specific tester and target combinations (PA and BS; SC or SG and BS
where SC and SG are closely related) Verhulst Equation fitting is poor and hence the
large residue value. Analyzing from a different perspective, combinations PA and SC
against BS showed a significant difference from the rest. PA and SC share a similar
internal variance (compositional parameter) while BS is significantly different from the
48
two. Looking at the reverse situation, the lowest residue values were of combinations
SC/PA, PA/XF and EC/BS. All combination share similarities in small pattern variance but
PA/XF do not have similar internal variance. Therefore a hypothesis can be made that
tester and target composition statistic (Internal variance, variance) and μ could be a
determining factor on the Verhulst Equation parameters.
To understand which factors influences the parameter of the Verhulst Equation, Figure
13 and 14 plots the relationship between the two parameter from the Verhulst fitting
and the different combinations of tester and target. There are six output parameters
from the python script consisting of di-mer k (2K) and g (2g), tri-mer k (3K) and g (3g)
and tetra-mer k (4K) and g (4g). Three additional parameters 2m, 3m and 4m were
calculated as stated in the methods section. Figure 14 first shows the differences
between the capacity values of different K-mers as well as a comparison between
different combinations. It is clear that from the graph, tetra-mer K value for every
combination is larger than tri-mer K value and in turn larger than di-mer. Biologically this
is true where tetra-mer patterns which are much more complex than both tri-mer and
di-mer patterns should have a higher variance difference (Tetra-mer pattern has 256
different combinations of 4 nucleotide length words while tri-mers has 81 and di-mers
have 16).
Another interesting point in figure 14 is that combinations SC/BS and PA/BS showed the
highest K values which corresponds to figure 12b. Combinations EC/SG and SC/EC also
displays a high K value and since EC and SC or SG are highly different in composition, this
supports the hypothesis of compositional statistics influencing the amelioration model
parameters. A hidden point however, that is the two lowest K value combinations PA/PA
and EC/EC shared a common factor of both testers and targets are of a similar organism.
This leads to another potential factor of related species having an effect on the
parameters of the Verhulst Equation.
49
Parameter K estimation by Verhulst Model
for different K-mer
4.5
4
3.5
3
K
2.5
2
2K
1.5
3K
1
4K
0.5
SG/SG
SG/PA
SG/BS
SG/XF
SG/EC
EC/SG
EC/PA
EC/BS
EC/XF
EC/EC
BS/SG
BS/PA
BS/BS
BS/XF
BS/EC
PA/SG
PA/PA
PA/BS
PA/XF
PA/EC
0
Tester/Target Combinations
Fig. 13. A combined graph showing the different tester target combination and their estimated K parameter based on
the Verhulst Equation. On the x-axis shows the different tester/target combinations and on the y-axis the estimated K
value. All estimations follow the same trend of tetra-mer (green) having the highest K value estimate and while dimer (blue) has the lowest. This is due to tetra-mer pattern being much more complex than di-mer patter (256
combinations compared to 16 respectively) and hence the pattern between tester and target variance will be further
apart (represented by K). P.aeruginosa (PA), PA combination as tester target and E.coli (EC), EC combination also
achieved lowest K value estimate. The common factor between the two combinations is that both tester and target
are of the same organism. Though this is not definite because B.subtilis (BS), BS and S.coelicolor (SC), S.griseus (SG)
combinations do not follow this trend. The highest K values are achieved by BS (target) combinations with SC and PA
(tester). A likely explanation is that the internal variance of PA and SC are highly different. This can be further seen
from the combination EC and SC therefore we can conclude that the internal variance of tester and target influences
the estimation of parameter K.
Aside from parameter K, g parameter represents the rate or slope in which the Verhulst
Equation takes form. A high g parameter implies a steep slope while a low value shows a
shallow curve. Hence parameter g should therefore be directly linked to μ, a set
constant in which determines the amount of substitution (see methods). Based on the
hypothesis, parameters of the Verhulst Equation should be affected by 3 independent
parameters, namely tester internal variance, target internal variance and mu. If tester
internal variance does not affect the parameters, then for the same target and different
tester combinations, the g and k values should be identical. Similarly for the case with
the target internal variance but this is not true as shown in figure 14 where each tester
50
and target combination showed different linear lines (x-axis μ and y-axis g parameter).
This graph shows that parameter 3g is a function of tester, target internal variance and
μ. This is also done similarly to the other 5 parameters 2g, 4g, 2m, 3m, and 4m (See
Appendix Figure 2, 3, 4, 5 and 6).
The python script which fits the equation to the data will always use the parameters
that create the least residue. This will in turn sometimes cause extreme parameter
values as seen in figure 14. From figure 13, SC/EC and SC/BS had a high K value and to
compensate for this, the g values were significantly different from the other
combinations. This causes potential outliers which could affect the estimation of the
parameter function at a later stage.
Tri-mer g parameter for Different
Tester/Target combinations
0.03
0.025
g
0.02
0.015
0.01
0.005
0
0
0.00002
0.00004
0.00006
0.00008
0.0001
0.00012
PA/EC
PA/XF
PA/BS
PA/PA
PA/SG
BS/EC
BS/XF
BS/BS
BS/PA
BS/SG
EC/EC
EC/XF
EC/BS
EC/PA
EC/SG
SC/EC
SC/XF
SC/BS
SC/PA
SC/SG
mu
Fig. 14. Graph plot of tri-mer g parameter estimate of the Verhulst Equation of all 20 combinations of tester and
target. From the above graph, tester S. coelicolor (SC), target E.coli (EC) and B.subtilis combination clearly is the
outlier of the dataset. Due to the nature of the model fitting of the Verhulst Equation, best estimate of K and g are
done on the simulation dataset. Since the pattern composition (caused by internal variance) of SC/EC and SC/BS
combinations are highly different, extreme g and k values are estimated (also seen in Figure 13). Therefore these
values are eliminated for a better linear fitting for the rest of the data.
51
It was found that the Verhulst model in general fits to simulating of the DNA
amelioration process as the cumulative error residues were within the range of 0.4003
to 9.2025 of calculated absolute values. As it was expected (see the previous section),
mismatches were greater when higher mu-values were set. However, contrary to our
expectation, significant alterations were observed in several tester/target pairs (see Fig.
12b), particularly BS/PA, PA/BS, BS/SG and SG/BS. It indicates that the current model
does not account yet for all factors influencing the amelioration process. As the model
worked appropriately for the majority of tester/target combinations, we decided to use
it as a working model for further analysis remembering that some improving to the
model should be done in future. In the next section we consider approaches of
estimating of Verhulst model parameters.
Parameter Function Estimation
From the figures of the previous section (12a, 12b, 13 and 14) we can assume that the
Verhulst Model parameters for each combination share a dependency towards the
characteristics of the tester and target of that combination. We tested this using all the
Verhulst equation parameters from all combinations of tester and target grouped
together. The initial dataset of tester and target combinations were chosen such that
the different combinations were a small sample to represent the vast bacteria kingdom
(each target were very different in terms of characteristic to the others, see discussion
section on Parameter Function Estimation). Hence we assume and test if the general
trend (dependency of Verhulst equation parameter on tester and target characteristics)
will work for all bacteria combinations.
The core of the Verhulst Equation is based on its two parameters (K and g) where the
others are all user inputs (V0 and V(t)). The dataset with the parameters g and K from
the Verhulst Equation were further reduced down into parameters g and m (where k =
g/m therefore m = g/k). Since in the standard Verhulst Equation, K is dependent on m,
hence it is more sensible to fit an equation to m rather than K. This also eliminates the
dependence upon g by K hence reduces any unnecessary error within the regression
process. Based on the model fitting of the 80 combinations of different tester and
targets, a general trend (Figure 15) can be seen from the dataset of parameters g and m
where K = g/m. To verify the truth of this relationship, a statistical multivariate linear
regression test is done in SAS on parameters g and m. The variables used to test for the
linear relationship are tester (V_Tester) and target internal variance (V_Target),
52
compositional variance between tester and target patterns (V0), μ (Mu), absolute
difference between tester target internal variance (VT – VTe) and if the tester and target
are of similar organism (variable equal 1 if yes, 0 if no). Different models were tested
using SAS enterprise 4.3 on parameter g and m as well as different k-mers (di-, tri-,
tetra-). The model with the best R-squared value was chosen using the stepwise
selection method.
Fig. 15. Graph showing the relationship between di-mer g parameter, μ, variance of target and variance of tester.
Each line showing the linear relationship of g corresponds to the other three parameter present (μ, Vtarget, Vtester)
within the model. We can see a general trend which majority of the combinations of tester target follows (Linear
Relationship).
Figure 16 shows the multivariate linear regression analysis result of parameter 4g (tetramer g parameter) with 80 data points each representing one combination of tester and
target in association with a μ value. The adjusted R-squared value is 67.91% which
shows that the variables used to model parameter 4g is not good enough as the model
only explains 68% of the dataset. The bottom table of figure 15 also gives the
significance of each variable within the model in which a high p value (Pr > |t|) indicate
the variable being insignificant. Using a 5% level of significance cutoff, three variables
namely VT – VTe, Related and Mu shown to be less useful in the model compared to the
rest of the variables (V_Tester, V_Target and V0).
53
Considering figure 14 where extreme outliers were present, a new dataset was
constructed by filtering out these combinations. To speed up the process, selection
methods (Forward, Backward and Stepwise) were used to eliminate insignificant
variables which affects the adjusted R-squared value (adjusted R-squared value is
calculated according to the amount of variables within the model) therefore reducing
potential variance increase caused by an increase of variables within model (Figure 17).
Through selection, the g parameter function chosen at 99% confidence level for each
variable is in the form:
And similarly, m parameter function is of the form:
Where a, b, c are all estimated constants using multiple linear regression and VTester
being the internal variance of tester sequence, VTarget being the internal variance of
target sequence, V0 the variance between the tester and target sequence and μ the
average mutation rate of tester.
The new fitted model achieved an adjusted R-squared value of 91.01% which is a
significant increase from the previous model and is an acceptable value to say that the
fitted model explains parameter g well. Other parameters (2g, 3g, 2m, 3m and 4m) were
modeled the same way using multivariate linear regression (Table 4). The general form
of the two parameters estimation functions are:
54
Fig. 16.Multi-variate regression analysis with all combinations of tester, target and μ. The adjusted R-squared value (a
coefficient of determination of how well the model explains the data) which takes into consideration the number of
variables used is a better measure to use than the normal R-squared value. The adjusted R-squared value for the
model with regards to this dataset is poor in which only 67,91% of the data is explained by the model fitted.
55
Table 4. Multivariate Linear Regression analysis results for different parameters
Parameters
2g
3g
4g
2m
3m
4m
Variables
V_Tester, Mu
V_Tester, Mu
V_Tester, V_Target, Mu
V_Tester, V0, Mu
V_Tester, V0, Mu
V_Tester, V0, Mu
Adjusted R-squared
76.26%
79.91%
91.01%
81.40%
89.40%
93.66%
2g: Di-mer g parameter, 3g: Tri-mer g parameter, 4g: Tetra-mer g parameter, 2m: Di-mer m parameter, 3m: Tri-mer m
parameter, 4m: Tetra-mer m parameter, V_Tester: Tester internal variance, V_Target: Target internal variance, Mu: μ,
V0: Compositional variance between tester and target pattern. SAS output in Appendix Figure 7, 8, 9, 10 and 11.
Through the different selection methods, we can see that all three of them achieved the
same model (Figure 17). Since the model is identical in all three cases, we can assume
this is the best model in terms of the dataset used. Three variables were eliminated (VTVTe, V0 and Related) in which they all only contribute less than 0.5% increase in Rsquared value therefore insignificant within the model. Other three variables make up
the most simplistic model in which parameter 4g can be modeled with the highest
accuracy.
Taking a more in depth analysis, what if more combinations of tester and target are
added, will the model be sufficient enough to explain the parameters or will more
variables be added in order to compensate? Figure 18 displays three different dataset
analysis each using different numbers of tester and target combinations. The three
dataset consists of two tester five target combinations, three tester five target
combinations and four tester five target combinations. The same multivariate linear
regression analysis was done on all three dataset displaying similar results. The first
dataset two tester five target combinations show a 90.39% adjusted R-squared value
with variables V0, V_Tester, V_Target and Mu being significant. Similarly, second dataset
three tester five target combinations model achieved 90.93% adjusted R-squared with
the same variables. The only difference between the three models is the third dataset
four tester five target combinations model had one less variable V0 for the model but
still achieved an adjusted R-squared value of 91.01%. From these results, we can assume
that the model itself is sufficient in explaining the estimated parameter with any
number of tester and target combinations with the exception of the extreme outliers
which the parameter function will estimate poorly.
56
Fig. 17. Different variable selection methods based on a 5% significance level. Six variables was taken into
consideration from the hypothesis namely tester internal variance (V_Tester), target internal variance (V_Target), μ
(Mu), related (if the tester and target organism is closely related, the variable takes the number one, if not then zero),
V0 (Variance between tester and target composition pattern) and VT-VTe (Internal variance of target minus the tester
internal variance). The final best model fitted by the three selection methods (backward elimination, forward
selection and stepwise selection) are identical and shown by the first table with three variables (VT-VTe, V0 and
Related) not being a significant factor explaining parameter 4g.
57
Fig. 18. Multi-variate linear regression fitting on parameter 4g (tetra-mer g parameter). Top left used 40 observation
(different combinations of two tester, five targets and four μ values) initially and the model fitted achieved an
adjusted R-squared value of 90,39%. Taking a 5% confidence level of significance, only variables V_Tester (Tester
internal variance), V_Target (Target internal variance), V0 (Compositional variance difference between tester and
target) and μ plays an important factor in explaining parameter 4g. Top right uses 60 observations which contain
three testers instead of two achieved adjusted R-squared value of 90.93% with the significant variables remaining the
same. Similarly, on the bottom results with 74 observations (removed 6 outliers); the adjusted R-squared value is
91.01% even though one variable V0 is no longer needed within the equation. Hence parameter 4g can be sufficiently
modeled by a function of V_Tester, V_Target and μ irrespective of an increasing number of testers based on the
adjusted R-squared value.
58
Based on the results from the previous section, the Verhulst Equation fits well onto the
simulated data for different combinations of tester and target with a few outliers as
seen in figure 14. From figure 12b and 13, we can also deduce that there is some form
of dependence between the characteristics of the tester target sequences with the
Verhulst Equation parameters (Figure 18). In this section we proved that there is such
dependence and by utilizing this and forming an equation, we can estimate the Verhulst
Equation for any combinations of tester and target sequence from bacterial genomes
with relative accuracy (Figure 17). This accuracy is affected by the outliers as seen in
figure 16 and 17 where by removing the outliers causes a great increase in accuracy of
the Verhulst Equation parameter estimate. This Parameter Estimation Function can then
be used to estimate the Verhulst Equation which in turn estimates the time of insertion
for any genomic island within any recipient genome.
Time of Insertion Estimation
Assuming that for every genomic island insert, there is a unique Verhulst Model
explaining the amelioration process of that genomic island. Therefore based on the
Verhulst Model, the parameters for each combination of tester and target should also
be unique. Hence putting any variable as the subject of the formula with the other
variables given, the estimated variable should only have one possible solution. Verhulst
Equation uses the function of parameter g and K in order to estimate the iteration used
for different composition variance between tester and target. Utilizing the parameter
functions of g and m to create K for di-mer, tri-mer and tetra-mer, a time estimation can
be made by varying different values of μ. Manipulating equation [6] such that the
variable t is the unknown with other variables given, t is then the function as follows:
[9]
In equation [9], two parameters are calculated from the GI sequence. V0, the
compositional pattern variance between the donor of the tester sequence and the
target and V(t) is the compositional pattern variance between the tester and the target
sequence. Equation [7] and [8] are used to calculate g and K according to the regression
parameters of the data set. This is done for each K-mer pattern (di-, tri- and tetra-)
creating three Verhulst Equations with t being the subject of the formula and resulting
in three t values. The μ value which gives the lowest standard deviation measure
59
between the three t values is considered and the time of insertion is equal to the mean
of the three t values.
The resulting t value will represent how many iteration it took for the donor sequence
(origin of the GI sequence) to reach the state of the GI sequence. A python script is
written for this procedure where the μ value used to calculate g and K are rounded off
to 6 decimal places. The range of the μ value is determined by the user. For this project
specifically, four genomic islands (PAGI1, PAGI2, PAGI3, PAGI4 – PAGI: P.aeruginosa
Genomic Island) were used with the origin sequence (PKLC 102) being the donor for all
four GIs with μ value range of [0.000001; 0.0001].
From the parameter equation [7] and [8] from the method section, we can estimate
parameter g, m and K (where K = g/m) when given the variables VTarget, VTester, V0 and μ.
Subsequently with g and K calculated, an amelioration model can be made based on the
Verhulst Equation when given V0 (Compositional variance between donor sequence of
GI and target sequence), g, t (number of iterations to achieve composition V(t)) and K.
Changing the subject of the formula by letting t be the variable and V(t) (Compositional
variance between tester and target sequence at time t) given, we can calculate the time
of insertion for any combination of tester and target if we have V0.
Table 5 shows the results of 4 genomic island inserts PAGI 1, PAGI 2, PAGI 3 and PAGI 4
with the origin sequence for all 4 GIs PKLC and their time of insertion. The table was
calculated using excel along with g, k values (di-mer, tri-mer and tetra-mer) from
Verhulst Equation fitting of simulation data. Iteration (variable t) value was calculated
using equation [9] for different K-mers (2, 3 and 4) for each GI insert. Four different μ
values were used which gave different time estimates and each K-mer. The time of
insertion estimate for each GI is shown in the last table with label “T” which took the
lowest standard deviation measure as the criteria for the four μ values used. The V0
value between PAGI 2 and origin sequence were shown to be the furthest followed by
PAGI 1, 3 and 4 and this is also reflected by the “T” variable where the time of insertion
is the most out of all the GIs in the analysis. PAGI 4 which has a higher V(t) than V0
calculated a negative T value which is correct in the sense of vector signs (negative
meaning the opposite direction) showing sequence PKLC requires 42 iterations to
achieve the same composition as PAGI 4 or from PAGI 4. We can also say that the
60
corresponding μ value is the true average substitution rate for the GI sequence
amelioration process.
Table 5. Time of Insertion Estimation Results
Mu
0.00001
0.00004
0.00008
0.0001
2K
0.99684
0.51
0.52
0.4904
Mu: 0.00001
PAGI 1
PAGI 2
PAGI 3
PAGI 4
2t
Mu: 0.00004
PAGI 1
PAGI 2
PAGI 3
PAGI 4
2t
Mu: 0.00008
PAGI 1
PAGI 2
PAGI 3
PAGI 4
2t
Mu: 0.0001
PAGI 1
PAGI 2
PAGI 3
PAGI 4
2t
PAGI
PAGI
PAGI
PAGI
1
2
3
4
2g
0.000346
0.000371
0.000539
0.000585
412.113728
8931.19813
117.523322
-156.295694
156.979076
1405.61648
46.1178622
-63.0218118
110.626395
997.119789
32.4843438
-44.3703468
94.9590908
839.692797
27.9237968
-38.193555
Min Std Dev
25.680194
73.2318605
9.09466957
3.38942764
3K
1.54217
1.13
1.05
1.0522
3g
0.000795
0.001095
0.001353
0.001436
4K
1.99905
1.77
1.72
1.7003
3t
518.566167
99.9094962
-142.671988
4t
720.659124
48.4275954
-172.120966
Std Dev
156.72372
35.9046245
14.7381983
3t
195.483716
1858.35832
40.6640126
-60.5114349
4t
263.047387
21.0915241
-78.462327
Std Dev
53.6934873
320.136824
13.1601949
9.72064841
3t
139.204404
1100.68528
29.2469093
-43.7840766
4t
187.126677
15.3647134
-57.5998967
Std Dev
38.6556164
73.2318605
9.09466957
7.81282691
3t
4t
144.438802
11.9609959
-44.9685464
Std Dev
25.680194
145.060883
9.13736521
3.38942764
131.62521
1044.83987
27.6473342
-41.3828649
T
4g
0.001428
0.002269
0.002889
0.003604
Mu
123.674367
1048.90253
25.6986555
-41.5149888
0.0001
0.00008
0.00008
0.0001
2: Di-mer, 3: Tri-mer, 4: Tetra-mer, K and g: Verhulst Equation parameters, T: Time of insertion estimation in terms of
iteration, t: estimated iteration based on Verhulst Equation, Std Dev: standard deviation, Mu: μ, PAGI: P.aeruginosa
Genomic Island.
61
Instead of using the parameters of the fitted Verhulst Equation which potentially uses a
longer time due to simulation process, estimated g and m parameters was done using a
python script by varying the μ variable between a range of [0.000001;0.0001] (Figure
19). Utilizing the parameter function and the variables estimated using regression, g and
m were calculated and in turn T the same way as Table 5. The estimated time of
insertion from the python program show a definite difference to the results from table 5
but the trend in the result remains the same (PAGI 4 being negative, PAGI 1 being more
distant than PAGI 3 hence the longer time of insertion). This difference could be caused
by the error from the multivariate regression where the model used to estimate the
parameters were not 100% (adjusted R-squared) but still serve as a good estimate of the
time of insertion. Another interesting point that is shared by both the empirical and
theoretical method is the contribution of standard deviation between the three t values
(2-mer, 3-mer and 4-mer t estimate). Tri-mer and tetra-mer t estimates are very similar
compared to the di-mer t estimate which has a large difference.
Fig. 19. Python script output for estimation of time of insertion by minimizing standard deviation. Genomic island
sequences used were P.aeruginosa genomic island (PAGI) 1, 2, 3 and 4 all with origin sequence PKLC. The script
output include name of genomic island sequence, time estimated (time of insertion based on minimizing standard
deviation), di-mer, tri-mer and tetra-mer time estimate, minimum standard deviation and mu value used to
calculated the time of insertion. PAGI 2 was not displayed due to the estimated capacity value were higher than the
V0 difference hence no time was able to be calculated.
62
There is a significant difference between the simulation data inferred Verhulst Equation
and Parameter Estimation Function inferred Verhulst Equation in terms of time of
insertion estimate. This is caused by the imperfect parameter estimation function as
shown in figure 16 from previous section. But both methods give a good estimation in
terms of giving a relative idea to how long ago the genomic island was inserted within
the recipient genome (Table 5 and Figure 19). Though the empirical method is more
accurate in terms of the lowest standard deviation criteria, this method is very time
consuming due to the unknown average mutation value used within the simulation
process hence a trial and error approach must be used. On the other hand, the
theoretical methods is much faster but less accurate and the average mutation value
can be estimated.
Mutation Rate Estimation for Different Genomic Loci
If we view the amelioration process as a change in the mutation rate of the tester
sequence within the target throughout time, then can we think of a way to analyze the
mutation rate of any genomic loci within the genome sequence? Changing the current
algorithm slightly by letting the tester sequence be the genomic loci sequence and
target the whole genome sequence, we attempt to estimate the rate of mutation of the
genomic loci sequence in terms of differential equation. By analyzing different genomic
loci sequences from the same genome sequence by their difference in OU composition,
we can analyze the trend between these combinations of tester and target. This trend
can be measured in terms of differential and can be viewed as the rate of mutation of
the genomic loci sequence relative to other genomic loci of the same genome sequence.
Utilizing the different output parameters from the simulation process, an estimation of
the mutation rate of different compositional pattern can be done. Assuming the tester
sequence to be a specific sequence of genomic loci and target being the rest of the
genome sequence, a rate of mutation measure can be estimated for different genomic
loci using a similar approach as the Verhulst Equation with some slight variations to the
algorithm. Using the different genomic sequences at different iterations of the
simulation process (at each iteration, the composition between the genomic sequence
and target differ) as an imitation of different genomic loci, a differential equation
representing the mutation rate at each composition pattern variance can be estimated.
For this study, three different simulation processes with different μ values (0.00005,
0.0001 and 0.0002) were used to create different sequences with different compositions.
63
The dataset consist of 60 sequences (2000 iteration simulation each with sequence
saved at every 100 iteration period using combination of P.aeruginosa genomic island as
tester and P.aeruginosa whole genome as target). Each sequence represents a
theoretical genomic locus (tester) with different composition to the same target.
Compositional statistics are then recorded for each combination of tester and target
which are sorted in ascending order according to internal variance and then used for the
estimation of the differential equation. The variation comes in during the differential
equation estimation where the 2 dataset used for the estimation are the composition
variance between tester and target and a measure called the forward mutation ratio.
The forward mutation ratio is calculated as:
!"#$#% &'(()"* +()" =
Distance Between Tester and Target
Internal Variance of Tester
In this study specifically, only the tetra-mer distance and variance is considered and
used to calculate this ratio. For each genomic locus, the forward mutation ratio is
calculated with the corresponding composition pattern variance between loci and target.
Using multiple points from each genomic locus, a differential equation can be calculated
using the same method as in figure 10. However the form of the differential equation
will differ from equation [5] depending on the dataset given. The resulting differential
equation will be an estimate of the mutation rate for any given composition variance
between genomic loci and target as represented in equation. This mutation rate
measure is a relative measure compared to other genomic loci sequences to how more
likely it will mutate compared to the target sequence.
60 sequences simulated with three different μ values were used as a representation of
different genomic loci (tester) with the same target (see appendix table 2). Five
randomly selected sequences are shown in table 6 with the composition statistic
displayed which are needed to calculate the mutation rate for each sequence. The two
parameters that determines the mutation rate (differential = mutation rate) are V0 and
forward mutation rate (FMR) and in turn, FMR is calculated from VTester and distance.
Therefore with the three compositional statistics present, a mutation rate estimate can
be done for any combination of tester and target.
64
Analyzing the function (mutation rate is a function of the three compositional statistics)
in which the mutation rate is estimated for different genomic sequences, we can
determine the relationship of the differential equation based on the three parameters.
Looking at table 5 in more detail, three trends can be seen that influences the resulting
mutation rate. A decrease in differential can be caused by an increase in VTester, decrease
in distance and decrease in V0 between tester and target. Thinking in a biological sense,
a smaller distance and V0 between tester and target means that the tester sequence is
compositionally similar to the target sequence. Therefore the tester sequence should
undergo less mutation due to the sequence reaching a more stable state (higher
selection). Hence the relationship between distance, VTester and mutation rate makes
sense here.
Table 6. Mutation rate estimation for 5 simulated sequences
Internal 4V
Distance
V0
FMR
Differential
5.17
7.67
2.97
1.483559
0.884918
5.63
6.12
2.33
1.087034
0.696544
5.83
5.53
2.1
0.948542
0.62334
6.06
5.5
1.91
0.907591
0.571682
6.28
6.02
1.86
0.958599
0.515892
Internal 4V (VTester): Tetra-mer internal variance of tester, Distance: Absolute distance measure between tester and
target (See section 2.2 in Literature Review), V0: Variation between composition pattern between tester and target,
FMR: Forward Mutation Ratio, Differential: Empirical differential calculated based on FMR and V0, also equivalent to
mutation rate, Number of Mutations: Differential multiplied by loci sequence size.
Figure 20 displays the estimated differential equation estimated from the 60 simulated
sequences. The linear differential equation also follows the trend stated above and can
be used to estimate the mutation rate based on the V0 between tester and target.
However this estimate is a relatively poor indication due to the non one to one
relationship between V0 and mutation rate (for the same V0, there could be multiple
mutation rate estimates). As seen at the lower left part of the blue dot plot in figure 20,
one V0 measure can result in two different differential values. Hence the differential
equation can serve as a trend indicator but not a good estimator of mutation rate.
Therefore empirical methods work more accurately than using the differential equation
in to estimating the mutation rate.
65
By creating a new measure FMR in combination with V0, the change of each individual
with respect to the other can be thought to be the change in composition (V0) with
respect to the individual characteristics of the genomic loci sequence (FMR). Hence this
measure can be thought as the mutation rate estimate of the genomic loci sequence
under study. This algorithm is similar to the Verhulst Equation estimate with the tester
and target sequence being the genomic loci sequence and whole genome sequence
respectively. The rate of mutation estimate for genomic sequence in this case can only
be measure relatively to other genomic loci sequence of the same genome and an
increase in the sample number will also increase the accuracy of the mutation rate
estimate. Though in theory this measure can be thought of as the rate of mutation
estimate, no biological application has been used to prove it being so hence lack the
core knowledge to make conclusive statements regarding the use of this algorithm to
estimate the rate of mutation for genomic loci.
Differential Equation for Different
Genomic Loci
1
y = 0.3193x - 0.0437
R² = 0.9681
0.9
0.8
0.7
Differential 4V
Linear Differential
0.6
0.5
0.4
1.5
2
2.5
3
3.5
Fig. 20. Empirical Estimation of the differential equation based on the different genomic loci sequences represented
by simulation sequences. Each point on graph is plotted by a genome locus (tester) V0 against the same target on the
x-axis with the forward mutation ratio on the y-axis. The differential equation estimated follows a linear function
where with any given V0 between tester and target, the differential is estimated. The differential in this case is
equivalent to the mutation rate of the genomic locus with the specific V0 and forward mutation rate. The linear
function is a decreasing function with a decrease in V0 leading to a decrease in mutation rate. Although the r-squared
value is high, the linear function is not a good indicator of the relationship between V0 and mutation rate. This is
caused by the non one to one relationship where one V0 measure can lead to two mutation rates (caused by different
forward mutation rate).
66
Discussions
Choosing Combinations of Tester and Target
The primary aim of the project is to create an amelioration model which applies to all
bacterial genomes. To achieve the maximum accuracy of model, choosing the correct
combinations of tester and target is vital. Due to time constraints of the project, five
targets and four testers were chosen and each were carefully selected. The five targets
were selected first and it is chosen to represent as much of a variety of bacteria as
possible. The targets include Bacillus subtilis 168 (BS), Pseudomonas aeruginosa PA01
(PA), Escherichia coli K12 (EC), Xylella fastidiosa 9a5c (XF) and Streptomyces griseus
NBRC 13350 (SG). Each target originates from different taxa, found in different
environments and has different compositional statistics (GC content, genome size). EC is
a gram-negative proteobacteria, most well studied bacteria and found in the human gut.
XF is also a proteobacteria and an important plant pathogen which causes phoney peach
disease. BS is a gram-positive bacterium of bacilli class which can live in extreme
conditions due to its structure. PA is another proteobacteria which is a disease causing
bacteria to both animals and humans and can be found in many environments
throughout the world. SG is a gram-positive actinobacteria commonly found in the soil
with some strains from the deep sea. The GC content varies highly between the targets
(EC with 43.51% to SG with 72.2%) with the longest genome size of SG being three times
longer than XF (Table 7).
Table 7. Target genome details (UCSC Genome Browser)
Targets
B. subtilis 168
S. griseus NBRC 13350
X. fastidiosa 9a5c
E.coli K12
P. aeruginosa PA01
GC Content
43.51
72.2
52.67
50.79
66.56
Length
4215606
8545929
2679306
4639675
6264404
Gene Number
4422
2674
2838
4466
5682
The four testers include P. aeruginosa, B. subtilis, E. coli and S. coelicolor (SC) which was
selected for three purposes. First is to test if the tester and target combination
belonging to the same organism will affect the outcome of the Verhulst Equation (e.g.
PA and PA). The second is to analyze if closely related organisms will influence the
parameters of the model (SG and SC) and lastly if changing the order of tester and target
67
combination will yield the same model. From the resulting Verhulst Equation fitting of
all 80 combinations of tester and target, the first and second point does not prove
significant (related variable being non-significant). The third point also did not yield the
same parameters for different order of the same tester and target combinations (e.g.
BS/PA and PA/BS). However the trend in the results does show that there are similarities
for these types of combinations as shown in figure 12b and 14.
The parameter estimation result from SAS indicates that there is still room for
improvement. Based on figure 18, additional combinations of tester and target should
not significantly change the model itself as well as the accuracy in which it predicts the
parameters. Hence adding more combinations of tester and target should increase the
accuracy of the model even more. But taking into mind the criteria (bacteria taxa,
composition and other factors e.g. environment, origin, pathogeneity) in picking these
combinations will greatly influence the significance of the model as well as identifying
other potential interesting factors which was not found from the current project.
Simulation Parameters
Simulation step takes the longest period of time to run as well as the most important
step in structuring the amelioration model. Careful planning needs to be done in order
to get the most information out of the simulation data without wasting too much time
on computation. Due to the computation complexity of the simulation program, the
computation time is exponential when increasing the amount of iterations ran for the
simulation. Hence a need to balance the choice of simulation parameters such that
maximum information can be kept with the minimum amount of computational time
used to calculate it.
Analyzing the 10000 iteration simulation run data with respect to V0 change per
iteration, we want to identify the smallest iteration cutoff point such that there is
sufficient information to estimate the parameters of the Verhulst Equation (Figure 21).
From the 10000 iteration simulation results, at 2000 iterations, we can identify majority
of the change in the curve (slope) which is used to estimate parameter g and also close
to the minimum capacity V0 point which is parameter K. Considering four testers, 5
targets and four μ combinations there will be at least 240 simulations runs. 2000
iteration simulation run takes five days to complete and four simulations can be run
68
simultaneously (adding any more simulations at the same time will affect computation
time) therefore it is feasible to do all simulations within a one year period (estimated
time taken for all simulations is 300 days).
K-mer Variance 0.00001 mu
4.5
4
3.5
Variance
3
2V
2.5
3V
2
4V
1.5
1
0.5
0
0
2000
4000
6000
8000
10000
Iterations
Fig. 21. 10000 iteration simulation V0 graph for different K-mers at 0.00001 μ. Majority of the decrease occurs before
2000 iterations and then start to tend to equilibrium. Hence by adding iterations more than 2000 will not greatly
increase the information needed to estimate the parameters (g – slope of the curve and k – Minimum limit V0 value)
for the amelioration model.
Four μ values were chosen such that the μ values within the specified range do not
follow a specific trend (linear) and still able to determine if μ is a factor in estimating the
parameters of the Verhulst Equation. From the Verhulst modeling and parameter
estimation results, the μ values used were significant in determining the parameters but
were poor in fitting the model to the simulation data. From the box and whisker plot of
the residue differentiated by the different μ values (Figure 12a), the lowest μ value used
(0.00001) were not a good choice as a simulation parameter. This could however be
compensated by increasing the number of iterations used within the simulation to get
more data for estimation or just use a higher μ value. A possible improvement to the
69
current existing model could be done by changing the simulation parameters (increasing
iteration count, increasing μ value).
Probability Logistic Function
In order to construct a sensible amelioration model, the simulation data must be as
close to a true amelioration process as possible. The Probability Logistic Function (PLF) is
especially tailored for the simulation process to do this with two assumptions in mind
which makes this function biologically suitable. The first assumption is the most
important as well as the core idea of amelioration which is directional mutation. Based
on literature review section 3.1, Lawrence and Ochman (1997) proved with their
amelioration model that directional mutation occurs and drives the amelioration
process. The simulation process utilizes this core concept and changes the
compositional statistics of tester and target as a measure of probability of substitution
at each base position. Meaning that depending on where the sequences (OU pattern)
between tester and target are different, a likely substitution at that point will be a high
in probability. Therefore the assumption of directional mutation is the fundamental core
of the simulation process.
Second assumption is that every bacterial genome should differ in some way and hence
the way they ameliorate should differ. Hence the PLF should be able to adjust itself
depending on the biological fact of the sequence. This is done with parameters “a” and
“b” where parameter “a” is a function of μ which directly controls the likelihood of
substitution and “b” which controls the conservations of the sequence. Though “a” in
this case is a constant mutation rate where under no directed mutation or selection
occurs, the mutation probability or likelihood of mutation is never constant for any base
position of the tester sequence due to mutation pressure of the target genome. This is
an important consideration as some sequences tend to have a different mutation rate
than other regions. Using different combinations of “a” and “b” will give you a unique
representation of different amelioration process simulations for different combinations
of tester and target. This will also give more flexibility in which the function can be
manipulated to represent as closely to real life applications.
For the 80 combinations of tester and target combination simulations, “a” parameter
was being varied by the change of μ with “b” parameter all equal to 1. Based on the SAS
70
results from figure 16, the adjusted R-squared was very low and was a poor fit. This poor
fitting was caused by the outliers from specific combinations of tester and target.
Looking at it from another perspective, these outliers could potentially be caused by the
incorrect usage of parameter “b” and hence create bias within the simulation step.
Therefore, the correct choice of parameter “b” could increase the accuracy of the
amelioration model as well as the parameter estimation of the model. For future
research, the connection between parameter “b” and the Verhulst model could enhance
the biological significance of the model.
Verhulst Equation
Overall, the fitting of the Verhulst Equation on the simulation data of 80 combinations
of tester and target was on average a good fit. Due to the nature of the amelioration
process (directional mutation), the logistic curve fit well to the simulated data. Some
combinations were shown extremely good fits where the residue value between the
actual simulation data and estimated model was 0.4003 (absolute measure of
difference). However some shown to be not compatible with the Verhulst Equation on
two reasons with the first being the actual simulation data. When the composition of
the tester and target differ significantly (e.g. BS/SG and SC/BS combinations), the
simulation data becomes extreme which result in high minimum V0 value and extreme
difference between initial V0 and minimum V0 (causes extreme slope hence high
parameter g). Therefore when fitting a Verhulst Equation with irregular parameters K
and g (as seen in figure 15 with the outliers) which forms the core of the shape of the
equation, the residue value will naturally be high which proves it to be a none good fit.
The second reason is the estimation of the differential equation where the fitting of the
Verhulst Equation is strictly dependant on. However the differential equation is also
derived from the simulation data hence the first reason will also apply here. The actual
estimation of the differential equation works by transforming the differential dataset
into a linear function and then estimating both g and K (see methods). In order to
estimate a sensible K value from the linear function, all values which are close to
equilibrium are removed and a cutoff point is set. To get this cutoff point, two criteria
were considered and only one was used in the python script. The first criteria was that
all points lower than a certain percentage change from the previous iteration was set as
the cutoff while the second criteria which was used within the python script sets the
cutoff at the point when the differential is positive.
71
PA/XF Variance per Iteration
3.6
3.4
Variance
3.2
2V
3
3V
2.8
4V
2.6
2.4
2.2
2
0
200
400
600
800
1000 1200 1400 1600 1800 2000
Iteration
Fig. 22. Simulated amelioration process of combination P.aeruginosa tester and X.Fastidiosa strain 9a5c target. 2000
iteration and μ value of 0.00001 was set as the input parameter. At 300 iterations, a “mountain peak” structure
occurs which is caused by a substitution away from target instead of towards it. This causes a shift in the cutoff value
being at 300 iterations instead of 2000 hence the K value estimation will be around 2.9 instead of 2.1. Therefore
setting the count number of differential sign changes (change from increase – positive to decrease – negative or vice
versa) to 2 in this case will prevent the premature setting of the cutoff. Hence investigating the simulation data before
hand and manually setting the criteria will improve the accuracy of the Verhulst Equation fitting to simulation data.
Looking at both criteria, there are certain situations when the simulation data will create
flaws when estimating parameter K. For the first criteria, if the composition of both
tester and target are relatively close (e.g.EC/EC combination) then the percentage
change per iteration will naturally be very low (V0 per iteration change of less than 0.5%),
hence this will cause the K estimation to be higher than it actually is (cutoff set too
early). For the second criteria, in rare circumstances, some combinations show irregular
change in their V0 per iteration such as a “mountain range” structure (Figure 22). This
structure means that within the amelioration process, at some iteration the change of
V0 is moving away from the target instead of towards it. Therefore the shape of the
amelioration process will not be a continuous decreasing graph but an increasing and
decreasing graph like the shape of a mountain range. In this case, the cutoff point will
also be set too early and hence the estimation of parameter K will be bigger than it
should be. In order to avoid these circumstances, manual inspection of the simulation
takes priority in the sense of knowing what type of data you are dealing with. Changing
the criteria to suit the simulation data under study such as increasing the count number
72
in which the differential is positive (count number 2 was used within study) in the
second case and for the first case, you can use a smaller percentage change to match
the simulation data. Other changes which could improve the Verhulst Equation fitting
would be the usage of parameter “a” and “b” for different combinations of tester and
target. From figure 12a, a change in parameter “a” improved the fitting of the Verhulst
Equation for the same tester and target combination.
Parameter Estimation and Selection Methods
Taking into account all combinations of tester and target data and possible influence
factors which contribute to the estimation of the parameters, the estimated model fit
the data poorly as seen in figure 16. The core philosophy of the amelioration model was
to create a simplistic equation from simulation data that applies to all bacterial genomes.
If data were to be eliminated to get a better fitting as in figure 17, then the model itself
is not a reflection of what is intended from the aim and the model itself is biased. Hence
improvement needs to be done in order to achieve the results of figure 17 without the
removal of the extreme combinations which causes the bad fitting of the parameter
function.
Two core problems can be seen from the above problem with the first being the cause
of the extreme outliers and the second is the variable selection of the model. Through
the discussion of “Verhulst Equation” and “Probability Logistic Equation” section, the
first problem can be reduced through the control of the simulation and fitting process.
Second problem can be seen by comparing figure 16 and 17. If you consider the variable
to be significant at 5% confidence level, then in figure 16 the variables that are
significant are V0, V_Tester and V_Target while in figure 17, variables V_Tester,
V_Target and Mu are significant. This example implies two points with the first being the
variables considered for the model will influence the effects of other model (collinearity
effect). For instance in figure 16, six variables are considered hence the variable V0
effect in the model overtake the effect of Mu therefore Mu is less significant than V0.
However in figure 17 after non-significant variables are removed, the effect of Mu is
more apparent than when there was 6 variables in figure 16. This problem can be easily
solved through different selection methods to choose the best model suitable for the
data under study.
73
Three selection methods were used in the study and the stepwise selection methods
was chosen and used for all parameter rather than the other two selection methods on
two reasons. Stepwise selection methods take less computation time (Though in this
study, computation time does not apply due to the small number of variables used) and
are a combination of both forward and backward selection method. By setting an entry
and exit level of significance for variables under consideration for the model, stepwise
selection method takes a “jumping” approach. The method start off with analyzing each
variable and according to the entry level of significance, it will accept the variable within
the model. While accepting each variable, it will also calculate the contribution of the
variable to the model and select the most significant variable. At each step, the
significance of the other variables is calculated according to the already selected
variables and any non-significant variables are removed according to the exit level of
significance. In this way the first core problem stated from the above example can be
cleared.
Another issue with the above problem is choosing the best level of significance that is
needed to be considered for the variable to be used within the model. From figure 16,
the hypothesis was that related tester and target should be a factor influencing the
parameter of the Verhulst Equation (PA/PA combination had a very low K value). This
idea is reinforced in figure 17 where the level of significance of the related variable is at
7.38%. For this study specifically, a 5% level of confidence is used as both entry and exit
level but with a difference of 2.38%, can we disregard the related variable as
insignificant to the model? Hence the question of setting the level of confidence can be
debatable. However 5% level of confidence was still chosen as a strict criterion to
reduce bias from large number of variables used (adjusted R-squared value). In terms of
the related variable example, from the parameter estimation function estimated from
SAS in figure 17, 0.37% partial R-squared value was lost due to elimination of the related
variable. The significance of 0.37% relative to the whole model (91.37% R-squared value)
is very low but the significance cannot be regarded and can be considered contextual
based. Many other potential variables which could be of interest were not considered in
this study such as sequence size, sequence region (Nakamura et al. 2004) and multiple
average mutation rate (Snir, 2014). These can be tested in future to further enhance
current model for parameter estimation.
74
Time of insertion estimation
In order to get a sensible measure of time of insertion, a criterion is set for two reasons.
For any genomic island (tester) insert at any point in time, the tester sequence can be
expressed as compositional characteristics (OU pattern of 2-mer, 3-mer and 4-mer).
Hence with these statistics and using the Verhulst Model, time estimation can be done
for the tester in question. In the perfect theoretical word where the amelioration model
is perfect, the estimated time for all three patterns (2, 3 and 4-mers) should be exactly
identical and therefore the time of insertion should be the estimated time value. But in
practice, there is a discrepancy between the three estimated time values and this is
caused by two reasons. The first is random noise substitution where biological data does
not always follow the strict rules of a mathematical model. The second is the imperfect
model where the model itself cannot accurately measure the variable in question (not
100% R-squared regression fitting, high residue during Verhulst Model fitting).
Therefore the lowest possible deviation between the three time estimates is considered
as the set criterion for estimating the time of insertion. However analyzing the standard
deviation measure yielded interesting result where the contribution of standard
deviation from the three time estimates is shown to be quite different. Tri-mer and
tetra-mer time estimate was similar compared to the di-mer estimate which was
significantly different. This difference could be caused by the poor Verhulst Model fitting
on the di-mer dataset which resulted in the biased t estimate. Hence possible
improvement to the current method is by applying a weighting model (give tetra-mer
and tri-mer t estimate a higher percentage weighting instead of the average weighting
which is used currently in this study) or add higher k-mers into consideration (more t
estimates to get a clearer idea of the true time of insertion), the time of insertion
estimate might be more accurate.
The second reason behind the set criterion is to assure that the estimated time value is
in fact the most correct estimate. To do this, the analysis of the relationship between μ
and standard deviation is needed. In order to identify the most correct estimate, there
should only be one minimum standard deviation value for all μ values (Figure 23).
Consider the opposite where there are multiple low points of standard deviation values,
there is no specific way in telling which time estimate is true for the tester sequence in
question. Hence the criterion of lowest standard deviation will not be able to estimate
the most correct time of insertion. Therefore the criterion set is sufficient in identifying
the most accurate time of insertion based on the Verhulst Model.
75
Standard Deviation
The Relationship Between Mu and Std Dev
350
300
250
200
150
100
50
0
PAGI 1
PAGI 2
PAGI 3
PAGI 4
0
0.00005
0.0001
0.00015
Mu
Fig. 23. Relationship between the change in Mu and its effect on the standard deviation measure of the set [2t, 3t, 4t].
From the red curve, we can deduce a parabola curve where the others are more of a linear decreasing curve. Based
on the trend, a change in mu will vary the standard deviation value such that at some value of mu, the standard
deviation will achieve a minimum value (Red Curve).
Comparing the empirical estimation of the time of insertion and the theoretical
approach using the parameter estimation method based on table 5 and figure 19, there
is still a significant difference in the estimation. The poor estimation of the theoretical
approach is largely responsible by the bad estimations of the parameters using the
parameter estimation functions from SAS (Low R-squared values on parameter
estimation models). This estimation could however be improved if the parameter
estimation is done individually on the combination instead of all possible combinations
(specific combinations does not follow the trend of the parameter equation e.g. PA/PA
combination has much lower g and K values empirically compared to theoretical
estimate). Although the empirical estimation is better in estimating the time of insertion
for the simulated sequences compared to theoretical approaches, it is very time
consuming in both simulation and calculation step while the theoretical approach is
much faster in the sense of given a few input parameters for an output. Hence a better
parameter function is needed which was explained in earlier discussions.
There are also two more limitations to the time of insert estimation method. The first is
the need of a point of origin sequence for the calculations to work. Within the given
parameters of the Verhulst equation, the parameter V0 is needed to calculate the time
variable. Although the origin sequence can be thought as the donor of the tester
76
sequence but sometimes this is not always the case. The tester sequence could be
donated from any organism with such sequence hence the choice of which sequence to
choose as the origin sequence can largely affect the time estimate. The other limitation
is the inability to calculate a time estimate based on the Verhulst Model as seen in table
5 and figure 19 (PAGI 2). This is caused by the simulation step where no simulations
could cover the amelioration process of the tester sequence therefore the Verhulst
Model cannot estimate the time of insertion for that tester sequence. To solve this
problem, simulation step needs to be improved as discussed in earlier section.
Ultimately, the time of insert estimation method is a good approach in getting an
accurate time estimate but more test data is needed to test its practical uses on real life
applications.
There is one point of interest is that of the four GIs analyzed here, there are two
different rates of mutation even though they all come from the same origin within the
same target. This could potentially mean that there might be more than one possible
Mu value for any given tester sequence or that there are more than one model of
selection within the target with different stress level mutagenesis (Maclean, 2013). With
this thought in mind, this could mean that within the amelioration model, there might
be two or more mu parameters which also correspond to another study done by Snir
(2014).
Rate of Mutation Estimation
A study by Martincorena (2012) states that there is a great variation in mutation rate
across the genome which is non-random and depends on factors (DNA repair,
transcription) which reduces risk of deleterious mutations. Codon positions also
undergo different mutation rate (Knight 2001), hence combining the two there should
be some connection between sequence patterns and rate of mutation. In attempt to
find the pattern, three parameters distance (D), internal variance (Int V) and variance
(V0) was used in attempt to characterize a sequence in a unique way such that for each
sequence, there is a unique measure of mutation rate corresponding to that sequence.
Analyzing each parameter individually, internal variance has the least impact in
determining the mutation rate since the internal variance measure only applies to the
tester sequence. This parameter serves as a normalizing constant for different genomic
loci within the study such that it creates a third dimension to the calculations as well as
maintain the unique characterization of each sequence (some genomic loci sequences
77
has got the same distance and V0 measures to the target but contain completely
different internal variance). The distance parameter is a measure between tester and
target calculated as the absolute distance between ranks of OU in the two patterns.
Hence by normalizing the distance parameter by the internal variance gives you a
relative ratio (Forward Mutation Ratio) of how close in terms of distance is the sequence
with the specific internal variance is to the target sequence.
The forward mutation ratio (FMR) makes biological sense in two ways. The first case is
that lower distance value implies the tester sequence and the target composition are
relatively similar. This is equivalent to a low FMR value due to the lower distance and
hence lower mutation rate. The second case is where two tester sequences having the
same distance measure with different internal variance where the higher internal
variance sequence will have a lower FMR value. In such case, the higher internal
variance sequence will always be closer in composition to the target than the lesser one.
Therefore for each sequence there is a unique FMR measure which is relative to the
mutation rate of that sequence (low FMR = low mutation rate).
V0 is determined by the variability of word deviations between the tester sequence and
target. It can be seen as how different is the tester and target sequence in terms of their
composition. Assuming that V0 is the difference between the tester sequence and the
target sequence and the FMR value determines the change needed between the two
sequences, the differential between the two variables should equal to the mutation rate.
Tetra-mer distance and V0 were chosen as the calculation parameters within this study.
The choice of tetra-mer over di-mer and tri-mer was due to the complexity of the tetramer pattern (256 word combinations over 16 and 81 respectively) to better reflect the
sequence structure. But this assumption was not proven due to the lack of knowledge to
determine which mutation rate estimate is more correct compared to others. Hence
more in depth study is needed on different K-mer parameters (D, V, Int V) to determine
the best approach and also real data (genomic loci sequence instead of simulated
sequences) to test the practicality of the method (Comparing the number of mutation to
biological sequence data).
The main downside to the approach of estimating the rate of mutation is that it uses a
relative approach. Multiple genomic loci sequences (preferably more than five
sequences) with the same target (whole genome sequence containing the genomic loci
78
sequence) are required for the method to work. Increasing the number of genomic loci
sequences within the calculation will increase the accuracy of the differential measure
(Table 8). This is due to the differential being a function of the sample data where the
trend of the data determines the differential measure. An advantage to increasing the
number of genomic loci sequences is that for each genomic locus, the associated
differential will form a differential equation which can be used for comparison between
organisms (mutation rate function between organisms) or in depth analysis of the trend
within the organism (linear or non-linear relationship between mutation rate and
sequence composition). But this causes unnecessary calculations if you are only
interested in one or two genomic loci sequences and their mutation rates with regards
to the target sequence. Though a measure is calculated based on compositional
statistics and is comparative to other genomic loci sequences, the lack of biological data
to back up this measure will give inconclusive results.
Table 8. Comparison between 5 genomic loci differential measure using different total numbers of
genomic loci used
Internal 4V
5.17
5.63
5.83
6.06
6.28
Distance
7.67
6.12
5.53
5.5
6.02
V0
2.97
2.33
2.1
1.91
1.86
FMR
1.483559
1.087034
0.948542
0.907591
0.958599
Differential N = 5
0.884918069
0.699732181
0.677820406
0.633470643
0.590850028
Differential N = 60
0.884918069
0.696544331
0.623340175
0.571681541
0.5158925
Internal 4V: Tetra-mer internal variance of tester, Distance: Absolute distance measure between tester and target
(See section 2.2 in Literature Review), V0: Variation between composition pattern between tester and target, FMR:
Forward Mutation Ratio, Differential: Empirical differential calculated based on FMR and V0, also equivalent to
mutation rate, N: number of genomic loci sequence used for differential calculation
79
Conclusion
The amelioration process of a genomic island insert (tester) in a recipient genome
(target) is successfully represented through simulation using the probability logistic
function (PLF) and the compositional characteristics of the tester and target. Utilizing
the correct input parameters which characterizes the tester and target genome (PLF
parameter “a” and “b” controlling the mutability of sequence and its conservativeness)
allows the simulation to better reflect the true amelioration process of the tester within
specific targets. The amelioration model which takes the form of a Verhulst Equation
has proven to fit well to the simulation data sets within this study. Hence by knowing
the parameters of the Verhulst Equation, the amelioration process can be expressed as
a simple equation which can be used for further analysis such as time of insert
estimation or amelioration model comparisons between different tester and target
combinations.
Parameter estimation model which was estimated through regression from 80 different
combinations of tester and target was done in attempt to estimate the parameters of
the Verhulst Model for any combinations of tester and target. Although the estimated
parameter function is not 100% accurate but it is still within an acceptable range for it to
make sensible estimations (parameter estimation model r-squared values all above
70%). The time of insert estimations done through the estimated parameter showed a
significant difference to the simulation and model fitting approach but the trend in the
answers remains the same. Hence the parameter estimation method is a good way to
get an estimated result in a short period of time (only the parameter equation is needed
compared to empirical methods needing the simulation data which potentially requires
a long time). But, there is room for improvement in all stages of the method (e.g.
changing input parameters of PLF to better suit the tester and target for better
simulation results, model selection technique during regression of parameter estimation
function) where each improvement can increase the accuracy of the of the resulting
estimate.
Altering the methods for the estimation of the mutation rate of genomic loci sequence
has shown to have not so significant results. Through the usage of the simulations
sequences as genomic loci, relative approach was used in attempt to estimate the
80
mutation rate of each sequence. Although sensible mutation rate measures were
estimated (in terms of trend between sequence composition and estimated mutation
rate), there was no biological significance to the measure in which the mutation rate
made enough sense to use it to make any form of conclusions. Therefore more research
is needed here to bridge the result with biological data to make further assumption and
could potentially result in interesting outcomes.
Acknowledgement
I would like to thank all the members at the Bioinformatics and Computational
Biology Unit at the University of Pretoria for all their support, guidance and
assistance. In particular my supervisor Dr. O. Reva and Prof. F. Joubert, I would like
to thank both of you for everything you have done. Also, I would like to thank NRF
and University of Pretoria for their funding, without them, this would not be
possible.
81
References
1. SWGB mirror site at the University of Pretoria in South Africa.
2. BABIC, A., LINDNER, A., VULIC, M., STEWART, E. & RADMAN, M. et al. 2008.
Direct visualization of horizontal gene transfer. Science 319, 1533–1536.
3. BEIKO, R. & HAMILTON, N. 2006. Phylogenetic identification of lateral genetic
transfer events. BMC Evol Biol, 6, 15.
4. BEZUIDT, O., MENDEZ, G. & REVA, O. 2009. SEQWord Gene Island Sniffer: a
program to study the lateral genetic exchange among bacteria. World Academy of Science,
Engineering and Technology 58, 1169-11274
5. BEZUIDT, O., GANESAN, H., LABUSCHANGE, P. EMMETT, W., PIERNEEF, R
& REVA, O. 2011. Linguistic Approaches for Annotation, Visualization and Comparison
of Prokaryotic Genomes and Environmental Sequences. In N-Y Yang (Ed) Systems and
Computational Biology - Molecular and Cellular Experimental Systems Chapter 2, 27-52.
InTeck, Croatia. (ISBN 978-953-307-280-7)
6. BOUCHER, Y., DOUADY, C., PAPKE, R., WALSH, D., BOUDREAU, M., NESBO,
C., CASE, R. & DOOLITTLE, W. 2003. Lateral gene transfer and the origins of
prokaryotic groups. Annu Rev Genet, 37, 283 - 328.
7. BREIMAN, L. 2001. Random Forests. Machine Learning, 45, 5 - 32.
8. BUCHRIESER, C., PRENTICE, M. & CARNIEL, E. 1998. The 102-kilobases
unstable region of Yersinia pestis comprises a high-pathogenicity island linked to a
pigmentation segment which undergoes internal rearrangement. J Bacteriol 180: 2321–
2329.
9. CHATTERJEE, R., CHAUDHURI, K. & CHAUDHURI, P. 2008. On detection and
assessment of statistical significance of Genomic Islands. BMC Genomics, 9, 150.
10. DAGAN, T., RANDRUP, Y. & MARTIN, W. 2008. Modular networks and
cumulative impact of lateral transfer in prokaryote genome evolution. Proc Natl Acad Sci
USA. 105, 10 039–10 044.
11. FOERSTNER, K., VON, M., HOOPER, S. & BORK, P. 2005. Environments shape
the nucleotide composition of genomes. EMBO Rep, 6, 1208 - 1213.
12. GAL-MOR, O. & FINLAY, B. 2006. Pathogenicity islands: a molecular toolbox for
bacterial virulence. Cell. Microbiol. 8: 1707–1719
82
13. GANESAN, H., RAKITIANSKAIA, A., DAVENPORT, C., TUMMLER, B. &
REVA, O. 2008. The SeqWord Genome Browser: an online tool for the identification and
visualization of atypical regions of bacterial genomes through oligonucleotide usage.
BMC Bioinformatics, 9, 333.
14. GROISMAN, EA. & OCHMAN, H. 1996. Pathogenicity islands: bacterial evolution
in quantum leaps. Cell. 1996;87:791–794.
15. HACKER, J. & CARNIEL, E. 2001. Ecological fitness, genomic islands and bacterial
pathogenicity. EMBO Rep. 2001 May 15; 2(5): 376-381
16. HAMADY, M., BETTERTON, M. & KNIGHT, R. 2006. Using the nucleotide
substitution rate matrix to detect horizontal gene transfer. BMC Bioinformatics, 7, 476.
17. HEIN, J., JIANG, T., WANG, L. & ZHANG, K. 1996. On the complexity of
comparing evolutionary trees. Discrete Applied Mathematics 1996,71:153-169.
18. HOROWITZ, J., NORMAND, M. D., CORRADINI, M. G. & PELEG, M. 2010.
Probabilistic Model of Microbial Cell Growth, Division, and Mortality. Applied and
Environmental Microbiology, 76, 230-242.
19. HUSON, DH. & BRYANT, D. 2006. Application of phylogenetic networks in
evolutionary studies. Mol Biol Evol. 2006;23:254–267.
20. JAIN, R., RIVERA, M. & LAKE, J. 1999. Horizontal gene transfer among genomes:
the complexity hypothesis. Proc Natl Acad Sci USA, 96, 3801 - 3806.
21. KANHERE, A. & VINGRON, M. 2009. Horizontal Gene Transfers in prokaryotes
show differential preferences for metabolic and translational genes. BMC Evol Biol, 9, 9.
22. KARLIN, S. & BURGE, C. 1995. Dinucleotide relative abundance extremes: a
genomic signature. Trends Genet, 11, 283 - 290.
23. KLOCKGETHER, J., WURDEMANN, D., REVA, O., WIEHLMANN, L. &
TUMMLER, B. 2007. Diversity of the abundant pKLC102/PAGI-2 family of genomic
islands in Pseudomonas aeruginosa. J Bacteriol, 2007, 189:2443-2459.
24. KNIGHT, RD., STEPHEN, J. & LANDWEBER, L., 2001. A simple model based on
mutation and selection explains trends in codon and amino-acid usage and GC
composition within and across genomes. Genome Biol, 2, R10.
25. KOONIN, E., MAKAROVA, KS. & ARAVIND, L. 2001. Horizontal gene transfer in
Prokaryotes. Quantification and classification. Annu. Rev. Microbiol. 55, 709–742.
83
26. KOSEKI, S. & NONAKA, J. 2012. Alternative Approach To Modeling Bacterial Lag
Time, Using Logistic Regression as a Function of Time, Temperature, pH, and Sodium
Chloride Concentration. Applied and Environmental Microbiology, 78, 6103-6112.
27. KOSKI, L. B., MORTON, R. A. & GOLDING, G. B. 2001. Codon bias and base
composition are poor indicators of horizontally transferred genes. Mol Biol Evol, 18, 40412.
28. LAWRENCE, J. & OCHMAN, H. 1997. Amelioration of bacterial genomes: rates of
change and exchange. J Mol Evol, 44, 383 - 397.
29. LEDERBERG, J. & TATUM, EL. 1946. Gene recombination in Escherichia Coli.
Nature, 158, 558.
30. LEVINGS, RS., LIGHTFOOT, D., PARTRIDGE, S., HALL, R. & DJORDJEVIC, S.
2005. The Genomic Island SGI1, Containing the Multiple Antibiotic Resistance Region
of Salmonella enterica Serovar Typhimurium DT104 or Variants of It, Is Widely
Distributed in Other S. enterica Serovars. J. Bacteriol. vol. 187 no. 13 4401-4409
31. LOBRY, J. & LOBRY, C. 1999. Evolution of DNA base composition under nostrand-bias conditions when the substitution rates are not constant. Mol Biol Evol, 16, 719
- 723.
32. MACLEAN, R.C., TORRES-BARCELO, C. & MOXON, R., 2013. Evaluating
evolutionary models of stress-induced mutagenesis in bacteria. Nature Reviews Genetics,
14(3), pp.221–227.
33. MARTINCORENA, I., SESHASAYEE, A. & LUSCOMBE, N. 2012. Evidence of
non-random mutation rates suggests an evolutionary risk management strategy. , 14,
pp.6–9.
34. MEDINI, D., DONATI, C., TETTELIN, H., MASIGNANI, V. & RAPPUOLI, R.
2005. The microbial pan-genome. Curr Opin Genet Dev. 2005 Dec;15(6):589-94.
35. MOZHAYSKIY, V. & TAGKOPOULOS, I., 2012. Horizontal gene transfer
dynamics and distribution of fitness effects during microbial in silico evolution. BMC
bioinformatics, 13(10), p.13.
36. NAKAMURA, Y., ITOH, T., MATSUDA, H. & GOJOBORI, T. 2004. Biased
biological functions of horizontally transferred genes in prokaryotic genomes. Nat Genet,
36, 760 - 766.
37. OCHMAN, H., LAWRENCE, J. & GROISMAN, E. 2000. Lateral gene transfer and
the nature of bacterial innovation. Nature, 405, 299 - 304.
84
38. PENN, K., JENKINS, C., NETT, M., UDWARY, DW., GONTANG, EA.,
MCGLINCHEY, RP., FOSTER, B., LAPIDUS, A., PODELL, S., ALLEN, EE., MOORE,
BS. & JENSEN, PR. 2009. Genomic islands link secondary metabolism to functional
adaptation in marine Actinobacteria. ISME J.2009;3:1193.
39. PODELL, S. & GAASTERLAND, T. 2007. DarkHorse: a method for genome-wide
prediction of horizontal gene transfer. Genome Biol, 8, R16.
40. RAGAN, M., HARLOW, T. & BEIKO, R. 2006. Do different surrogate methods
detect lateral genetic transfer events of different relative ages? Trends Microbiol, 14, 4 - 8.
41. RANDLES, R. 1979. Introduction to the Theory of Nonparametric Statistics.
42. SANTOS, S. & OCHMAN, H. 2004. Identification and phylogenetic sorting of
bacterial lineages with universally conserved genes and proteins. Environ Microbiol, 6,
754 - 759.
43. SMITH, A., LUI, T. & TILLIER, E. 2004. Empirical models for substitution in
ribosomal RNA. Mol Biol Evol, 21, 419 - 427.
44. SNIR, S. 2014. On the number of genomic pacemakers: a geometric approach.
Algorithms for Molecular Biology, 9(1)
45. SUEOKA, N. 1962. On the genetic basis of variation and heterogeneity of DNA base
composition. Proc Natl Acad Sci USA, 48, 582 - 592.
46. SUEOKA, N. 1988. Directional mutation pressure and neutral molecular evolution.
Proc Natl Acad Sci USA, 85, 2653 - 2657.
47. SYVANEN, M. 1985. Cross-Species gene transfer: implications for a new theory of
evolution. F. Theor. Biol. 112,333-343.
48. TAMAMES, J. & MOYA, A. 2008. Estimating the extent of horizontal gene transfer
in metagenomic sequences. BMC Genomics, 9, 136.
49. TEMPLETON, AR., CRANDALL, KA. & SING, CF. 1992. A cladistic analysis of
phenotypic associations with haplotypes inferred from restriction endonuclease mapping
and DNA sequence data. III. Cladogram estimation. Genetics 132:619–633.
50. THIERGART, T., LANDAN, G. & MARTIN, WF. 2014. Concatenated alignments
and the case of the disappearing tree. BMC Evol Biol, 14(1), p.2624
51. THOMPSON, CC., CHIMETTO, L., EDWARDS, R., SWINGS, J.,
STACKEBRANDT, E. & THOMPSON, F. 2013. Microbial genomic taxonomy. BMC
genomics, 14, p.913
85
52. VOGAN, A. & HIGGS, P. 2011. The advantages and disadvantages of horizontal
gene transfer and the emergence of the first species. Biology Direct, 6, 1.
53. WELLNER, A., LURIE, M. & GOPHNA, U. 2007. Complexity, connectivity, and
duplicability as barriers to lateral gene transfer. Genome Biol, 8, R156.
54. ZINDER, N.D. & LEDERBERG, J. 1952. Genetic Exchange in Salmonella. F.
Bacteriol. 64, 678-699.
86
Appendix
Table 1. Verhulst Equation Fitting Results (ALL)
Tester/Target
Ecoli
PAGI
9a5c
PAGI
Subtilis
PAGI
Aeruginosa
PAGI
Griseus
PAGI
Ecoli
BSGI
9a5c
BSGI
Subtilis
BSGI
Aeruginosa
BSGI
Griseus
BSGI
Ecoli
Mu
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
Dimer
m
0.001026842
0.001764815
0.001681203
0.00160251
0.000159945
0.000424324
0.00100396
0.001122227
0.001263787
0.001012736
0.001048667
0.001010093
0.000347097
0.000727451
0.001036538
0.001192904
0.000696129
0.000948175
0.001449153
0.001599199
0.000776085
0.001455773
0.002756193
0.003293114
0.000793818
0.00102073
0.001561995
0.001924911
0.002406199
0.005535173
0.004352866
0.003090641
0.000706657
0.001062861
0.001402204
0.001915596
0.000819463
0.001215942
0.001643389
0.0017563
0.000411539
g
0.002034
0.002859
0.002236
0.001915
0.000148
0.000314
0.001014
0.001133
0.003804
0.002147
0.001573
0.001271
0.000346
0.000371
0.000539
0.000585
0.001446
0.001299
0.00171
0.001837
0.001202
0.001315
0.002192
0.002702
0.002111
0.001487
0.001838
0.00222
0.00354
0.007302
0.005445
0.003778
0.002243
0.001985
0.002074
0.002787
0.002885
0.002758
0.003024
0.003011
0.000408
Trimer
m
0.00110737
0.001975598
0.002201538
0.003472222
0.000413169
0.000927381
0.0016
0.001997252
0.001378711
0.002205455
0.002548377
0.002816428
0.000515507
0.000969027
0.001288571
0.00136476
0.000734002
0.000918519
0.001389542
0.001674015
0.000783963
0.001005546
0.001872621
0.002353124
0.000781195
0.001012921
0.001653401
0.002204165
0.001313106
0.002238163
0.003222746
0.004004695
0.000713649
0.00102183
0.00160984
0.001901994
0.000814174
0.001108942
0.001920728
0.002188104
0.000462691
87
g
0.002628
0.004129
0.004293
0.00715
0.000825
0.001558
0.00272
0.003489
0.004922
0.007278
0.007849
0.008538
0.000795
0.001095
0.001353
0.001436
0.001717
0.001488
0.002126
0.002506
0.001707
0.001378
0.002361
0.003032
0.00221
0.001803
0.002659
0.003472
0.002653
0.003947
0.005469
0.006995
0.002403
0.002186
0.003102
0.003501
0.003226
0.002897
0.004434
0.004981
0.000599
Tetramer
m
0.000917608
0.001310417
0.001917447
0.002956853
0.000551452
0.001297129
0.002632536
0.002608159
0.001155789
0.001911494
0.002321515
0.002646329
0.000714339
0.001281921
0.001679651
0.002119626
0.000807158
0.000915837
0.001296759
0.001457906
0.000769204
0.000952272
0.00187713
0.002016665
0.000826046
0.001203159
0.002381293
0.002675843
0.001140405
0.001525763
0.002196179
0.003442809
0.000799112
0.001401433
0.002371602
0.002677713
0.000691756
0.001133506
0.002043195
0.002365471
0.00057094
g
0.002566
0.003145
0.004506
0.007223
0.001258
0.002711
0.005502
0.005492
0.004392
0.006652
0.007661
0.00864
0.001428
0.002269
0.002889
0.003604
0.002385
0.002024
0.002801
0.003053
0.002172
0.002085
0.003966
0.004187
0.00256
0.002864
0.005326
0.005832
0.00279
0.003355
0.004713
0.0076
0.002808
0.003814
0.00616
0.006739
0.002884
0.003408
0.005856
0.006774
0.000969
ECGI
9a5c
ECGI
Subtilis
ECGI
Aeruginosa
ECGI
Griseus
ECGI
Ecoli
SCGI
9a5c
SCGI
Subtilis
SCGI
Aeruginosa
SCGI
Griseus
SCGI
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00001
0.00004
0.00008
0.0001
0.00088627
0.001211374
0.001392143
0.000344612
0.000612934
0.000813492
0.000973532
0.000498192
0.001089876
0.001262782
0.001134042
0.000504911
0.000652377
0.0008015
0.000974784
0.000612225
0.00081185
0.001023211
0.001169973
0.001434189
0.001611431
0.001804491
0.001445747
0.001305809
0.003267109
0.00347529
0.003800621
0.000845369
0.001018693
0.001121212
0.001091804
0.001606986
0.002148197
0.002336182
0.002998412
0.000603656
0.001230469
0.001854088
0.003411514
0.000519
0.000639
0.000808
0.000607
0.000508
0.000533
0.000743
0.000937
0.001842
0.00205
0.001643
0.001825
0.001397
0.001282
0.001469
0.002462
0.002195
0.002354
0.002595
0.003866
0.003107
0.002732
0.001948
0.002758
0.005953
0.005239
0.005387
0.002394
0.001891
0.001591
0.00142
0.002199
0.00227
0.002132
0.002644
0.0007
0.00126
0.001873
0.00384
0.000988718
0.001114535
0.00123621
0.000490983
0.000775343
0.000994662
0.001244454
0.000907299
0.00143898
0.00222532
0.002560522
0.000556753
0.000747801
0.001005982
0.001240126
0.00055478
0.000820896
0.001056831
0.001424164
0.004017142
0.005829785
0.00470301
0.004844245
0.001641339
0.003219562
0.003054996
0.003553715
0.005188847
0.006014738
0.006970816
0.00587597
0.001329329
0.001939501
0.002920371
0.002957222
0.000815391
0.001585076
0.003258115
0.005952196
0.000964
0.001049
0.001143
0.00104
0.001044
0.001267
0.001683
0.001933
0.002844
0.00431
0.004823
0.001963
0.001734
0.002035
0.002449
0.002265
0.002442
0.002821
0.003831
0.015795
0.020673
0.013967
0.01482
0.004385
0.007584
0.006216
0.007165
0.025718
0.025793
0.027946
0.02166
0.002501
0.003084
0.004665
0.004369
0.001121
0.002137
0.004346
0.008442
0.001203063
0.001436026
0.001871695
0.000552343
0.000895195
0.001124544
0.001423138
0.000731941
0.001322625
0.001866726
0.002306108
0.000595248
0.000777888
0.00111206
0.001287705
0.00055153
0.000796819
0.001247257
0.001400519
0.00161837
0.002219329
0.002295806
0.002430858
0.001373618
0.001725225
0.002593665
0.002783452
0.004160507
0.00531558
0.00688381
0.005082207
0.001882804
0.003384353
0.003820562
0.003210351
0.001316295
0.004529634
5.30594E-05
3.60992E-05
0.001791
0.002064
0.00269
0.001299
0.001634
0.001972
0.002582
0.001836
0.002989
0.004216
0.00504
0.002182
0.002108
0.002734
0.003142
0.002405
0.002675
0.003979
0.004372
0.00678
0.008097
0.007117
0.007638
0.004285
0.004542
0.006272
0.006459
0.021333
0.023514
0.028764
0.019474
0.004373
0.007341
0.008521
0.006848
0.002328
0.008147
0.000094
0.000069
PAGI: P. aeruginosa Genomic Island, BSGI: B. subtilis Genomic Island, ECGI: E.coli Genomic Island, SCGI: S.coelicolor
Genomic Island, Ecoli: E.coli K12, 9a5c: X. Fastidiosa 9a5c strain, Subtilis: B.subtilis sub 168, Aeruginosa: P. aeruginosa,
Griseus: S.griseus, m and g are both parameters of the Verhulst Equation where m = g/k (see methods). E. coli K12
genomic island and S. coelicolor genomic island in combination with five targets are not displayed in this table.
88
Di-mer g parameter Different Tester/Target
0.008
0.007
0.006
0.005
g
0.004
0.003
0.002
0.001
0
0
0.00002
0.00004
0.00006
0.00008
0.0001
0.00012
PA/EC
PA/XF
PA/BS
PA/PA
PA/SG
BS/EC
BS/XF
BS/BS
BS/PA
BS/SG
EC/EC
EC/XF
EC/BS
EC/PA
EC/SG
SC/EC
SC/XF
SC/BS
SC/PA
SC/SG
mu
Fig. 2. Graph plot of di-mer g parameter estimate of the Verhulst Equation of all 20 combinations of tester and target.
From the graph, no clear relationship can be stated between g and the other factors (μ, tester and target internal
variance). Some combinations that show significant difference to the rest are BS/BS (B.subtilis), PA/SG
(P.aeruginosa/S.griseus) and SC/EC (S.coelicolor/E.coli). This could be caused by extreme K values as stated in the
results section.
89
Tetra-mer g parameter Different Tester/Target
0.035
0.03
0.025
g
0.02
0.015
0.01
0.005
0
0
0.00002
0.00004
0.00006
mu
0.00008
0.0001
0.00012
PA/EC
PA/XF
PA/BS
PA/PA
PA/SG
BS/EC
BS/XF
BS/BS
BS/PA
BS/SG
EC/EC
EC/XF
EC/BS
EC/PA
EC/SG
SC/EC
SC/XF
SC/BS
SC/PA
SC/SG
Fig. 3. Graph plot of tetra-mer g parameter estimate of the Verhulst Equation of all 20 combinations of tester and
target. There are clear differences to figure 2 and similarities to tri-mer g plot in figure 15 result section. There is a
clear trend of linearity between g and the three factors μ, tester and target internal variance with some outlier
combinations such as SC/BS (S.coelicolor/B.subtilis) and (S.coelicolor/ S.griseus).
90
Di-mer m parameter Different Tester/Targets
0.006
0.005
0.004
m
0.003
0.002
0.001
0
0
0.00002
0.00004
0.00006
0.00008
mu
0.0001
0.00012
PA/EC
PA/XF
PA/BS
PA/PA
PA/SG
BS/EC
BS/XF
BS/BS
BS/PA
BS/SG
EC/EC
EC/XF
EC/BS
EC/PA
EC/SG
SC/EC
SC/XF
SC/BS
SC/PA
SC/SG
Fig. 4. Graph plot of di-mer m parameter estimate of the Verhulst Equation of all 20 combinations of tester and target.
Similar to the graph of parameter 2g in figure 2 with the same outliers but there is a clear trend of a linear function
existence. Outliers include BS/BS (B.subtilis), PA/SG (P.aeruginosa/S.griseus).
91
Tri-mer m parameter Different Tester/Target
0.008
0.007
0.006
0.005
m
0.004
0.003
0.002
0.001
0
0
0.00002
0.00004
0.00006
0.00008
mu
0.0001
0.00012
PA/EC
PA/XF
PA/BS
PA/PA
PA/SG
BS/EC
BS/XF
BS/BS
BS/PA
BS/SG
EC/EC
EC/XF
EC/BS
EC/PA
EC/SG
SC/EC
SC/XF
SC/BS
SC/PA
SC/SG
Fig. 5. Graph plot of tri-mer m parameter estimate of the Verhulst Equation of all 20 combinations of tester and target.
Similar to the graph of parameter 3g in figure 15 in result section with the same outliers and a definite linear function
existence. Outliers include SC/BS (S.coelicolor/B.subtilis) and (S.coelicolor/ E.coli).
92
Tetra=mer m parameter Different Tester/Target
0.008
0.007
0.006
0.005
m
0.004
0.003
0.002
0.001
0
0
0.00002
0.00004
0.00006
0.00008
mu
0.0001
0.00012
PA/EC
PA/XF
PA/BS
PA/PA
PA/SG
BS/EC
BS/XF
BS/BS
BS/PA
BS/SG
EC/EC
EC/XF
EC/BS
EC/PA
EC/SG
SC/EC
SC/XF
SC/BS
SC/PA
SC/SG
Fig. 6. Graph plot of tetra-mer m parameter estimate of the Verhulst Equation of all 20 combinations of tester and
target. Exact same outliers as parameter 4g and has the same trend as parameter 2m and 3m. Since m parameter is
calculated as g/K, the outliers’ existence is the same since the parameters are in proportion to each other. Outliers
include SC/BS (S.coelicolor/B.subtilis) and (S.coelicolor/ S.griseus).
93
Fig. 7. Multi-variate regression analysis with all combinations of tester, target and μ for parameter 2g. The adjusted Rsquared value is 76.26% which shows that there is a linear relationship but the model is poor in estimating parameter
g. The significant variables are V_Tester and μ where μ is only significant at a 15% confidence level.
94
Fig. 8. Multi-variate regression analysis with all combinations of tester, target and μ for parameter 2m. The adjusted
R-squared value is 81.40% which is a better model than 2g in terms of estimating 2m with variables V_Tester, Mu and
V0 where all three variables are significant at 5% confidence level.
95
Fig. 9. Multi-variate regression analysis with all combinations of tester, target and μ for parameter 3g. The adjusted Rsquared value is 79.97% which is a better model than parameter 2g but still a poor model where 20% of the data will
be estimated with error. The variables are the same as parameter 2g including V_Tester and Mu where both variables
are significant at 1% confidence level. In combination with results of parameter 4g in figure 15 results section,
parameter g is a function of V_Tester, V_Target and Mu.
96
Fig. 10. Multi-variate regression analysis with all combinations of tester, target and μ for parameter 3m. The adjusted
R-squared value is 89.40% which is a very good model in terms of estimating parameter 3m. The three significant
variables are V_Tester, Mu and V0 where all three are significant at 1% confidence level.
97
Fig. 11. Multi-variate regression analysis with all combinations of tester, target and μ for parameter 4m. The adjusted
R-squared value is 93.66% which is the best model out of all six parameters. Based on the R-squared value, we can
assume that parameter 4m follows a multi-variate linear function. The significant variables are V_Tester, Mu and V0
which are all significant at 1% confidence level. Parameter m for all K-mers also follows the same function of V_Tester,
Mu and V0.
98
Table 2. Differential Estimate for different simulated sequences
Internal 4V
5.17
5.28
5.27
5.35
5.38
5.41
5.42
5.47
5.53
5.53
5.55
5.58
5.59
5.61
5.63
5.67
5.68
5.71
5.72
5.74
5.75
5.77
5.79
5.79
5.82
5.82
5.83
5.85
5.86
5.87
5.88
5.9
5.9
5.9
5.92
5.92
5.93
5.94
5.97
5.96
5.99
6
6.01
6.04
6.04
6.05
Distance
7.67
7.2
6.97
6.83
6.44
6.66
6.49
6.48
6.35
6.24
6.21
5.87
6.13
6.18
6.12
6.15
5.99
6.03
5.59
5.83
6.02
5.92
5.73
5.8
5.81
5.47
5.53
5.74
5.66
5.44
5.53
5.32
5.52
5.42
5.48
5.4
5.35
5.42
5.36
5.27
5.36
5.38
5.22
5.45
5.45
5.28
V0
FMR
2.97
2.8
2.77
2.68
2.59
2.6
2.57
2.52
2.46
2.45
2.41
2.34
2.37
2.36
2.33
2.3
2.27
2.26
2.19
2.21
2.22
2.19
2.16
2.16
2.14
2.09
2.1
2.12
2.1
2.06
2.08
2.02
2.06
2.04
2.05
2.02
1.99
1.99
1.96
1.95
1.95
1.94
1.9
1.92
1.92
1.87
Differential
1.483559
1.363636
1.322581
1.276636
1.197026
1.231054
1.197417
1.184644
1.148282
1.128391
1.118919
1.051971
1.096601
1.101604
1.087034
1.084656
1.054577
1.056042
0.977273
1.015679
1.046957
1.025997
0.989637
1.001727
0.998282
0.939863
0.948542
0.981197
0.96587
0.926746
0.940476
0.901695
0.935593
0.918644
0.925676
0.912162
0.902192
0.912458
0.897822
0.884228
0.894825
0.896667
0.868552
0.902318
0.902318
0.872727
99
0.884918069
0.833634975
0.841290623
0.817837767
0.810451008
0.792621145
0.786001884
0.768645337
0.751016954
0.7442572
0.731611437
0.726944162
0.717473583
0.707036492
0.696544331
0.683130736
0.67405555
0.664996225
0.663905307
0.65813006
0.649327844
0.641590694
0.637581008
0.632360139
0.626294694
0.624855445
0.623340175
0.619581157
0.616483633
0.615368196
0.614299524
0.613437464
0.611780512
0.610724923
0.609730404
0.608275093
0.606008331
0.602999982
0.599743379
0.597514959
0.594559652
0.591144048
0.58816412
0.583786152
0.579836481
0.575811842
6.06
6.08
6.09
6.1
6.14
6.16
6.2
6.22
6.24
6.25
6.26
6.28
6.29
6.29
5.5
5.59
5.52
5.31
5.31
5.37
5.48
5.69
5.82
5.9
6
6.02
6.21
6.34
1.91
1.9
1.89
1.86
1.84
1.82
1.82
1.82
1.82
1.84
1.86
1.86
1.89
1.9
0.907591
0.919408
0.906404
0.870492
0.864821
0.871753
0.883871
0.914791
0.932692
0.944
0.958466
0.958599
0.987281
1.007949
0.571681541
0.566520698
0.562199573
0.559338375
0.556181928
0.551806326
0.546960748
0.540414614
0.533239391
0.526948598
0.521234098
0.5158925
0.510756658
0.505413194
Internal 4V: Tetra-mer internal variance of tester, Distance: Absolute distance measure between tester and target
(See section 2.2 in Literature Review), V0: Variation between composition pattern between tester and target, FMR:
Forward Mutation Ratio, Differential: Empirical differential calculated based on FMR and V0, also equal to the
mutation rate.
100
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement