Investigations into the evolution of biological networks SARA LIGHT

Investigations into the evolution of biological networks SARA LIGHT
Investigations into the evolution of
biological networks
Stockholm Bioinformatics Center
Department of Biochemistry and Biophysics
Stockholm University
Stockholm 2006
Sara Light
Stockholm Bioinformatics Center
Stockholm University
Albanova University Center
SE-106 91 Stockholm
E-mail: [email protected]
Tel: +46-(0)8-5537 8562
Fax: +46-(0)8-5537 8214
Doctoral Thesis
Light, 2006
ISBN 91-7155-273-1
Printed by Universitetsservice US AB, Stockholm
To Maria, Mom, Dad and Bill.
Investigations into the evolution of biological networks
Sara Light, [email protected]
Stockholm Bioinformatics Center
Stockholm University, SE-106 91 Stockholm, Sweden
Individual proteins, and small collections of proteins, have been extensively studied
for at least two hundred years. Today, more than 350 genomes have been completely
sequenced and the proteomes of these genomes have been at least partially mapped.
The inventory of protein coding genes is the first step toward understanding the
cellular machinery. Recent studies have generated a comprehensive data set for the
physical interactions between the proteins of Saccharomyces cerevisiae, in addition
to some less extensive proteome interaction maps of higher eukaryotes. Hence, it
is now becoming feasible to investigate important questions regarding the evolution
of protein-protein networks. For instance, what is the evolutionary relationship
between proteins that interact, directly or indirectly? Do interacting proteins coevolve? Are they often derived from each other? In order to perform such proteomewide investigations, a top-down view is necessary. This is provided by network (or
graph) theory.
The proteins of the cell may be viewed as a community of individual molecules
which together form a society of proteins (nodes), a network, where the proteins
have various kinds of relationships (edges) to each other. There are several different types of protein networks, for instance the two networks studied here, namely
metabolic networks and protein-protein interaction networks. The metabolic network is a representation of metabolism, which is defined as the sum of the reactions
that take place inside the cell. These reactions often occur through the catalytic
activity of enzymes, representing the nodes, connected to each other through substrate/product edges. The indirect interactions of metabolic enzymes are clearly
different in nature from the direct physical interactions, which are fundamental to
most biological processes, which constitute the edges in protein-protein interaction
This thesis describes three investigations into the evolution of metabolic and
protein-protein interaction networks. We present a comparative study of the importance of retrograde evolution, the scenario that pathways assemble backward
compared to the direction of the pathway, and patchwork evolution, where enzymes
evolve from a broad to narrow substrate specificity. Shifting focus toward network
topology, a suggested mechanism for the evolution of biological networks, preferential
attachment, is investigated in the context of metabolism. Early in the investigation
of biological networks it seemed clear that the networks often display a particular, ’scale-free’, topology. This topology is characterized by many nodes with few
interaction partners and a few nodes with a large number of interaction partners
(hubs). While the second paper describes the evidence for preferential attachment
in metabolic networks, the final paper describes the characteristics of the hubs in
the physical interaction network of S. cerevisiae.
1 Publications
1.1 Publications included in this thesis . . . . . . . . . . . . . . . . . . .
1.2 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Introduction
3 The evolution of biological networks
3.1 Metabolic networks . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Defining metabolic networks . . . . . . . . . . . . . . . . . .
3.1.2 The evolution of metabolism . . . . . . . . . . . . . . . . . .
3.2 Protein-protein interaction networks . . . . . . . . . . . . . . . . . .
3.2.1 Context-based protein-protein interaction prediction . . . . .
3.2.2 Experimental determination of individual protein-protein interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.3 Experimental determination of protein complexes . . . . . .
3.2.4 The need for validation of protein-protein interactions . . . .
3.2.5 Toward the human protein-protein interaction network . . .
3.2.6 The evolution of protein-protein interactions . . . . . . . . .
3.3 On the structure of biological networks . . . . . . . . . . . . . . . .
3.3.1 What is the advantage of scale-freeness? . . . . . . . . . . .
3.3.2 How do biological scale-free networks evolve? . . . . . . . . .
3.3.3 The relationship between protein-protein interaction network
topology and evolution . . . . . . . . . . . . . . . . . . . . .
3.3.4 Is scale-freeness of biological importance? . . . . . . . . . . .
4 Techniques and databases
4.1 Representation of the metabolic network . . . .
4.2 Representation of the protein-protein interaction
4.3 Domain assignment and protein age . . . . . . .
4.4 Statistical tests . . . . . . . . . . . . . . . . . .
5 The
. . . . .
. . . . .
. . . . .
present investigations
Retrograde vs Patchwork evolution (Paper I) . . . . . . . . . . . . .
Preferential attachment in metabolic networks (Paper II) . . . . . .
Domain distance as an alternative homology measure (Paper III) . .
The characteristics of hubs in the protein-protein interaction network
of Saccharomyces cerevisiae (Paper IV) . . . . . . . . . . . . . . . .
. 18
. 19
. 26
. 27
. 28
. 29
6 Conclusions and reflections
7 Acknowledgements
Publications included in this thesis
Paper I
Sara Light and Per Kraulis
Network analysis of metabolic enzyme evolution in Escherichia coli.
BMC Bioinformatics, 5:15, 2004.
Paper II
Sara Light, Per Kraulis and Arne Elofsson
Preferential attachment in the evolution of metabolic networks.
BMC Genomics, 6:159, 2005e1.
Paper III
Åsa K. Björklund†1 , Diana Ekman†1 , Sara Light, Johannes Frey-Skött and Arne
† These authors contributed equally.
Domain rearrangements in protein evolution.
Journal of Molecular Biology, 353, 911-923, 2005.
Paper IV
Diana Ekman†1 , Sara Light†1 , Åsa K. Björklund and Arne Elofsson
† These authors contributed equally.
What properties characterize the hub proteins of the protein-protein interaction
network of Saccharomyces cerevisiae?
Submitted, 2006
Other publications
Christian Roth, Shruti Rastogi, Lars Arvestad, Katharina Dittmar, Sara Light, Diana Ekman and David A. Liberles
Evolution after gene duplication: Models, mechanisms, sequences, systems and organisms
Submitted, 2006
Paper I-III reproduced with permission.
Since the first bacterial genome, Haemophilus influenzae, was sequenced in 1995,
over 350 genomes have been completely sequenced. Currently, most of the sequenced
genomes belong to the eubacterial domain of life, but the fraction of sequenced eukaryotic genomes is expected to grow in the next few years. Knowing the sequence in
which the nucleotides appear in genomes, and finding the genes encoded therein, are
necessary stepping stones towards a greater understanding of the cellular machinery
and biological diversity. For instance, until the human genome was sequenced, the
number of human genes was expected to be roughly 100,000, but was shown to be
less than 30,000 (Consortium., 2001; Venter & et al, 2001). Hence, the number of
human genes is around 5 times that of the pathogenic proteobacterium Pseudomonas
aeruginosa, which is perhaps surprising considering the difference in complexity, and
size alone, between these organisms. However, a relatively small difference in the
number of genes can result in a great number of variations of the co-expression patterns for each gene combination. Consequently, it is now generally believed that the
complexity of multicellular organisms probably originates in part from the combinatorial interactions between proteins (Claverie, 2001). In addition, multi-domain
proteins are more common in eukaryotes (Apic et al., 2001b; Ekman et al., 2005;
Gerstein & Levitt, 1998; Liu & Rost, 2004), which may, through a wider range of
functional combinations of domains, result in greater complexity. Furthermore, alternative splicing of eukaryotic transcripts is likely to increase the protein repertory
of eukaryotes threefold and, finally, the transcription machinery is more complex in
eukaryotes compared to prokaryotes.
Life on Earth originated around 3.8 billion years ago. It is likely that the first
organisms were prokaryotes, i.e. unicellular organisms without well defined internal compartments (organelles). The diversity of life is derived through evolution;
the process of amplification and mutation of genetic material, followed by varying
proportions of chance and natural selection. Alterations of the genome sequence
occur, for example, as a result of exposure to detrimental agents (toxins, radiation,
viruses) and consist of, for instance, point mutations as well as insertions or deletions of nucleotides, in addition to larger chromosomal events. These mechanisms
alter the genetic material, sometimes increasing the likelihood of survival for its host
organism (positive selection), but more often decreasing it (negative selection) or
having negligible impact (neutral evolution). The changes on the genomic sequence
can lead to changes on the cellular networks, which may have consequences on the
tissue or organism level.
An important source of new genetic material is gene duplication (Ohno, 1970;
Lynch & Conery, 2000). Indeed, estimates show that 30-60% of the genes in eukaryotic genomes are duplicates (Ball & Cherry, 2001). Genes that are derived from
gene duplications are called paralogs, see Figure 1. When a gene is duplicated,
one gene still performs its original function, while the duplicate may be somewhat
altered and eventually lost from the genome, through deactivating mutations. In
Figure 1: Paralogs and orthologs. Paralogs, pairs of proteins which have diverged through duplication, are often contrasted to orthologs, which have diverged through speciation. Here, A, B and
C represent three species, while X and Z are two genes, which have diverged after duplication.
Hence, the two genes are paralogs. Further, two branches, representing the evolution of each one
of these genes, respectively, are shown. The X-branch shows an example of genes which are all
orthologous to each other. In the Z-branch, a similar scenario has occurred, with additional lineage
specific duplication events in species B and C. Z1B and Z2B are therefore inparalogs with respect
to the radiation of A from B. Furthermore, Z1B and Z2B are both orthologous to ZA.
some instances, however, such as when the increased copy number conveys a selective advantage to the organism, the duplicate may be retained. Furthermore, by
fixing beneficial mutations, one of the genes may acquire a new functionality (neofunctionalization) (Ohno, 1970), or, alternatively, the genes may evolve through
complementary degenerative mutations towards two genes whose functions, taken
together, perform the function of the original gene (subfunctionalization) (Force
et al., 1999).
A rare form of duplication is whole genome duplication, which is believed to
have taken place at least once during the evolution of Saccharomyces cerevisiae
(Kellis et al., 2004), and could be an important source of new functional subnetworks
(modules) in the eukaryotic cell (Conant & Wolfe, 2006). In prokaryotes, transfer
of genetic material between unrelated organisms is an important source of novel
genetic material. Although the extent of the evolutionary impact of horizontal
gene transfer (HGT) is still under debate (Kurland et al., 2003; Snel et al., 2002;
Lawrence & Ochman, 1997), it is probably an important evolutionary process in
microbial species (Boucher et al., 2003).
Proteins, the gene products, are the building blocks of the cell and perform
certain functions one by one, as in the instance of some enzymes, or together with
other proteins. In yeast there are around 6,000 proteins which often act together,
either in small groups or in large complexes. Much like the individual human beings
in this world are, for some purposes, viewed collectively as a society, a network of
people, the proteins of the cell, often referred to as the proteome, can be seen as a
community of individual molecules which together form a network. Certainly, some
human behaviours cannot be properly understood outside the context of human
society, and similarly, some properties of proteins may not make sense unless viewed
from the top down. Therefore, a network perspective of the proteome is an important
tool, which may enable another level of understanding of living organisms. This
thesis describes investigations into the evolution of metabolic and protein-protein
interaction networks.
The evolution of biological networks
A network, or graph, consists of objects called nodes (vertices) connected by edges
(links). Frequently, a graph is depicted in diagrammatic form as a set of nodes,
joined by edges, see Figure 2. In Papers I and II, the nodes of the metabolic network
are enzymes connected to each other through compounds, while the nodes in the
protein-protein interaction network (PPIN) in Paper IV represent the proteins in
the network. In the latter network, the edges are the physical interactions between
the proteins.
Metabolic networks
Metabolism, in its broadest sense, is the sum of the reactions which take place inside
the cell. It includes the processes by which cells take up and convert compounds
from precursors to complex compounds, anabolism, and the processes by which
compounds are degraded, catabolism. The most common nutrients are probably
carbohydrates, e.g. sucrose. The conversion of sugars to complex compounds is facilitated by a set of proteins, enzymes, which accelerate the rates of most reactions
in the cell. Almost all living organisms contain the machinery to expediently degrade sugars, produce high energy compounds, such as adenosinetriphosphase (ATP)
and reduced nicotinamide adenine dinucleotide (NADH) and produce precursors for
complex molecules, such as proteins.
Enzymes, which are almost always proteins, catalyze many reactions in the cell,
without being altered themselves through the process. Generally, the enzyme forms
an enzyme-substrate complex with the compound, often referred to as the substrate,
Clustering coefficient
Figure 2: Network parameters. a) The connectivity (k) of a node is defined as the number of nodes
it is connected to through an edge. Here, the edges are undirected, although directed graphs are
also common. The highest connected node in the network is the hub. b) The clustering coefficient
(c) is a measure of the interconnectivity of the neighbors of a node. For instance, the hub in this
network is connected to five nodes, which could form 10 pairs of nodes among each other, but there
are only 2 such pairs. The clustering coefficient is defined as the number of actual neighbor pairs
divided by the maximum number of pairs (here c=0.2).
and optimizes the environment surrounding the substrate, thus drastically increasing
the probability that a reaction will take place and the substrate be transformed to
product. In one of the most cited databases for chemical reactions, LIGAND (Goto
et al., 2002), there are currently more than 7,000 biological reactions, involving over
15,000 different compounds, most of which are catalyzed by enzymes. The enzymes
may be classified into six primary classes according to the type of reaction which
they catalyze;
1. Oxidoreductases; Transfer of electrons; e.g. alcohol dehydrogenase.
2. Transferases; Transfer of functional groups; e.g.citrate synthase.
3. Hydrolases; using or producing water; e.g. glucose-6-phosphatase.
4. Lyases; cleavage of C-C, C-O, C-N bonds; e.g. fructose-bisphosphate aldolase.
5. Isomerases; transfer of groups within a molecule; e.g. triose-phosphate isomerase.
6. Ligases; NTP-coupled bond formation; e.g acetate-CoA ligase.
The enzymes may be classified further into the Enzyme Commission (EC) numbers. For instance, the enzyme alcohol dehydrogenase, which catalyzes the reaction
where an alcohol is oxidized to an aldehyde or ketone using NADH as an electron
acceptor, is classified as an oxidoreductase because its reaction is an oxidation, and
further, it is classified as EC 1.1 since it acts on the CH-OH bond, and subsequently
as EC 1.1.1 because NADH (or NADPH) acts as the electron acceptor. Finally,
the enzyme is classified as EC because it is acting on primary or secondary
alcohols and hemiacetals. There is considerable variation in the specificity of enzymes. Some enzymes are quite specific and may act on one compound only. In
addition, the specificity of a certain enzyme often varies between species. Again,
alcohol dehydrogenase serves as an example since animal alcohol dehydrogenases
also use cyclic alcohols as substrates, while the yeast variant does not.
Naturally, not all enzymes are present in all organisms. For instance, animals
lack most of the enzymes necessary for alkaloid synthesis which are present in some
plants, and many intracellular prokaryotes lack the enzymes necessary for amino acid
synthesis. In recent years, it has become clear that there is substantial variation in
the architecture of metabolism in different organisms both between and within the
domains of life. For example, the citric-acid cycle, thought to be one of the oldest
parts of metabolism, seems to be absent or incomplete in most of the organisms
studied (Huynen et al., 1999). In addition, although orthologs, see Figure 1, of
half of the enzymes in Escherichia coli are present in all three domains of life,
there is considerable variation in the groupings of enzymes into metabolic pathways
(Peregrin-Alvarez et al., 2003).
Defining metabolic networks
The whole metabolic network of E. coli consists of all the metabolic reactions in E.
coli, such as the reactions in the biotin metabolism, the glycolysis and the amino
acid metabolism. Traditionally, metabolic networks have been drawn with enzymes
as edges, see Figure 3, but with a network view, the reverse representation is often
more intuitive. In the latter representation, the connections between the enzyme
nodes in metabolic networks are the compounds involved in the reactions. If an
enzyme E1 catalyzes a reaction where a product P is produced, and P is used as a
substrate by an enzyme E2, the two enzymes are neighbors in the network. Although
pathways are biologically meaningful partitions of the full metabolic network, the
partitioning of the metabolic network into pathways is not always straightforward
(Schuster et al., 2000). Therefore, there could be correlations that are not visible
from a pathway perspective which will emerge from a whole-network oriented view.
There is also an element of arbitrariness involved in which compounds are considered promiscuous (compounds involved in many reactions, e.g. ATP) and which
are not. A similar element of arbitrariness is involved in the determination of which
compounds are cofactors (e.g. metal ions or NAD/NADH) and which compounds
are the main substrates or products of the reactions. The promiscuous compounds
Coenzyme A
Sulfur donor Dethiobiotin
Figure 3: Common representation of the biotin metabolism. Here, the main reactants represent the
nodes of the network and the enzymes represent the edges. This drawing of the biotin metabolism
was redrawn from EcoCyc (Karp et al., 2002).
pose a potential problem to metabolic network studies since these compounds participate in a large number of reactions, and their exclusion and inclusion could change
the properties, and relevance, of the network representation considerably. There are
numerous ways to approach the problem of compound promiscuity, one of which is to
simply remove compounds which are generally considered non-essential in a certain
reaction, and another is to count the number of reactions a compound participates
in within the complete metabolic network of E. coli and exclude those which are
most promiscuous from the network.
The evolution of metabolism
In 1945 one of the first theories regarding the evolution of metabolic pathways, often
referred to as retrograde evolution, was proposed by Horowitz (Horowitz, 1945). The
theory was primarily addressing the problem of how one enzyme could provide a
selective advantage to an organism before the whole pathway had been completed. It
states that during evolution pathways assembled backward compared to the direction
of the pathway in response to depletion of substrates from the environment. As
an example consider the following scenario: Enzyme E catalyzes reaction A → B,
where B is essential to the organism. A is depleted from the environment, and
subsequently, an organism harboring an enzyme E’ with the ability to catalyze a
reaction producing A from some other substrate from the environment would be at
an advantage. Since E can already bind A there is a greater chance that E rather
than an enzyme without an affinity for A will be mutated into E’, if duplicated, see
Figure 4.
Figure 4: Retrograde evolution. Retrograde evolution describes how pathways evolve backwards
compared to the direction of the pathway. A is the substrate that is depleted from the environment.
E and E’ are the enzymes of the reactions, where E’ is a duplicate of E.
In 1976 Jensen described recruitment evolution (Jensen, 1976), more often referred to as the patchwork evolution model (Lazcano & Miller, 1996), as an alternative to retrograde evolution. The retrograde evolution model demands an accumulation of substrates in the environment. Since many intermediate substrates are
chemically unstable, the retrograde evolution model cannot explain the evolution of
reactions involving such substrates. The patchwork evolution model states that the
number of enzymes was initially small, and broad substrate specificities enabled as
many diverse functions as possible given the small enzymatic repertory. As evolution
proceeded specialization took place by gene duplication. For example consider the
following: An enzyme E catalyzes a reaction where either one of the substrates S1
and S2 is accepted. Through gene duplication and random mutation two different
versions of E evolve; E’, which only accepts S1 as a substrate, and E” which only
accepts S2 as a substrate. Jensen could support his theory by the observation that
several contemporary enzymes have quite broad substrate specificities. As a secondary theory to patchwork evolution, Jensen suggested that pathways may evolve
en bloc, i.e. chains of reactions within pathways evolve together as a result of broad
substrate specificities, see Figure 5. There is a certain amount of support for this
theory since several reaction chains in different pathways are highly analogous (for
instance reaction chains in the TCA cycle and lysine biosynthesis).
The retrograde evolution model was at first supported by the discovery of operons, where the functionally related genes in operons were thought to have evolved
through tandem duplications (Horowitz, 1965). It was later established that the
genes present in the same operon often do not show significant sequence similarity.
Figure 5: Pathway evolution en bloc. a) E1, E2 and E3 are ancient enzymes with very broad
substrate specificities. S11-17, S21-27 and S31-37 are substrates accepted by E1, E2 and E3,
respectively. The initial function of selective advantage is the product S41 from the substrate chain
S11-S21-S31-S41 but mutations of the broad specificity enzymes leads to the substrate chain S17S27-S37-S47, where S47 gives a significant selective advantage. b) This may favour the duplication
of E1, E2 and E3, which allows specialization of the enzymes to efficiently produce both S41 and
Instead, some gene clusters can be explained by the efficient transfer of functional
modules between unrelated organisms through horizontal gene transfer (Lawrence
& Roth, 1996).
There are a few known homologous genes coding for enzymes which catalyze
consecutive reactions and which therefore represent possible cases of retrograde evolution: trpC, trpA and trpB, which catalyze consecutive steps of the tryptophan
biosynthesis (Wilmanns et al., 1991), hisF and hisA in the histidine biosynthetic
pathway (Fani et al., 1995) and metB and metC in the methionine biosynthesis
(Belfaiza et al., 1986). Interestingly, a mechanism analogous to retrograde evolution can be successfully applied to the evolution of vertebrate steroid receptors.
According to Thornton’s ligand exploitation process, the terminal ligand in the
biosynthetic pathway of steroids in vertebrates is the first one for which a receptor
evolved, namely an estrogen receptor (Thornton, 2001). Furthermore, there are a
few studies which give some support for the retrograde evolution model. A superfamily has a general tendency to appear in one or two particular pathway(s) (Saqi &
Sternberg, 2001) and, additionally, homologous enzymes are found at close distances
within the (extended) pathways of E. coli (Rison et al., 2002). Finally, homologous
enzymes are also found close to each other in the whole metabolic network (Alves
et al., 2002).
The patchwork evolution model holds that there should be many pairs of homologous enzymes catalyzing analogous reactions, where one or more substrates are
non-identical but similar. Support for this theory is more substantial than for the
retrograde evolution model (Copley & Bork, 2000; Teichmann et al., 2001; Tsoka
& Ouzounis, 2001; Petsko et al., 1993). The TIM-barrel containing enzymes have
been found in many different pathways (Copley & Bork, 2000) and the homologous
pairs of small molecule metabolism enzymes of E. coli have been shown to be evenly
distributed within and across pathways (Teichmann et al., 2001; Tsoka & Ouzounis,
In conclusion, recent studies have shown that there is little support for the retrograde evolution model while Jensen’s patchwork evolution model has substantially
more support (Rison et al., 2002; Alves et al., 2002). However, the importance of
these mechanisms, and others depending on gene duplication, in the evolution of
the metabolic network of E. coli may be smaller than the impact of horizontal gene
transfer (Pal et al., 2005). A recent study shows that the metabolic network of E.
coli evolves by incorporating horizontally transferred genes which are predominantly
involved in transportation (Pal et al., 2005).
Protein-protein interaction networks
Although enzymes sometimes physically interact, the metabolic network described
above consists of enzyme nodes connected to each other merely by compound edges.
The indirect interactions of metabolic enzymes are clearly different to the direct
physical interactions between proteins. Such interactions are important to most
biological processes, since many proteins need to interact with other proteins to
perform their functions properly. Hence, knowledge about the interactions between
proteins is crucial for understanding biological functions. Furthermore, the functions
of many proteins are unknown and identification of the physical interactions in which
these proteins participate may give a first indication of their function.
Context-based protein-protein interaction prediction
The budding yeast S. cerevisiae was the first eukaryotic genome that was sequenced
(Goffeau et al., 1996) and, therefore, the first systematic genome-wide studies of protein interactions have been conducted on S. cerevisiae. After the publication of the S.
cerevisiae genome sequence, several computational methods based on genomic context were developed for protein-protein interaction (PPI) prediction. For instance,
if two proteins are fused in at least one genome, but appear as separate genes in
another genome, there is an increased chance that the proteins interact, or are otherwise functionally coupled (Marcotte et al., 1999; Enright et al., 1999). Furthermore,
although gene order is generally poorly conserved in prokaryotic genomes, the genes
which do display a conserved gene order across prokaryotic species are likely to physically interact (Dandekar et al., 1998). Finally, phylogenetic profiles have been used
to identify which proteins often occur together, which is another strong indication
of proteins with a greater likelihood of interacting (Pellegrini et al., 1999).
Experimental determination of individual protein-protein interactions
The known metabolic reactions have, for the most part, been identified through
laborious studies of individual enzymes. Furthermore, the function of a newly sequenced gene may be inferred from its homology to a protein of identified function.
Similarly, the physical interactions between proteins had to be detected by characterization of individual interactions. Recently, however, high-throughput (HT)
studies have been developed, by which protein-protein interactions may be identified through genome wide studies. Consequently, the past few years have seen a
great increase in the number of known protein-protein interactions. The two most
important methods are affinity purification followed by mass spectrometry, which
is a common technique for identifying protein complexes (Gavin et al., 2002; Ho
et al., 2002), while the yeast two-hybrid method is used for identifying individual
protein-protein interactions (Uetz et al., 2000; Ito et al., 2001; Fields & Song, 1989).
The yeast two-hybrid method (Y2H) (Fields & Song, 1989) can be used to determine if two particular proteins interact. In this method, the two proteins of interest,
A and B, are fused to two different kinds of proteins. A (often referred to as bait)
is fused to a DNA-binding protein, which binds to a specific stretch of DNA slightly
upstream of the reporter gene, a gene encoding a protein that reports the presence
of an interaction between A and B. B (often referred to as the prey) is fused to an
activating domain (AD), which activates the transcription of the reporter gene. The
reporter gene will not be transcribed unless AD is present, and AD is only present
if A interacts with B. Therefore, there is a signal, such as for instance growth on
histidine-free medium, only when A and B interact.
The two-hybrid method is tractable for systematic large-scale studies and has
been used to study the entire proteome of S. cerevisiae (Ito et al., 2001; Uetz
et al., 2000), Caenorhabditis elegans (Li et al., 2004) and Drosophila melanogaster
(Giot et al., 2003). Furthermore, Y2H is a sensitive method capable of detecting
transient as well as stable interactions. However, the method suffers from a few
serious problems and many of the interactions detected by Y2H, possibly as many
as 50-90% (Mrowka et al., 2001; Sprinzak et al., 2003), are probably erroneous (false
positives). Further, the detection of the interactions takes place in the nucleus,
which is not the natural environment for a large portion of the interactions. While
the latter could lead to that many interactions are missed, it may also lead to
that other interactions are misconstrued as interactions when, in fact, these may
be secondary interactions mediated through a third protein, or an RNA-molecule,
connecting the two proteins under investigation. Finally, there is a large number of
known interactions between proteins which are missed with Y2H (false negatives),
such as for instance the known interaction between the α and β electron transfer
flavoproteins (Aloy & Russell, 2002). The large number of false negatives is likely to
be caused by that proteins only interact when certain activation signals have induced
conformational changes in one or both of the interacting proteins (Ito et al., 2001).
In addition, the unnatural mechanism of fused proteins within a compartment, the
nucleus, where most of the proteins do not naturally interact, is a likely cause for
the absence of known protein-protein interactions in the two-hybrid screens. Finally,
membrane protein interactions are unlikely to be detectable by Y2H.
Experimental determination of protein complexes
The affinity purification methods, on the other hand, do not identify individual interactions between proteins, but are used to determine which proteins appear in
complexes together. In tandem affinity purification (TAP) (Rigaut et al., 1999),
some proteins are selected as baits which are used to fish for the proteins that form
a complex with the bait. The baits are fused with two affinity tags, typically Staphylococcus aureus protein A (ProtA) and calmodin-binding peptide (CBP) separated
by a TEV protease cleavage site. The tags are used to attach the bait to an affinity chromatography column in two tandem steps. First, the mixture of proteins is
eluted on the IgG affinity column and the bait proteins’ ProtA tags bind to the IgG
matrix together with its complex of proteins. Second, TEV protease is added to the
elute to extract the complex. Third, the bait and complex are added to the second
column, where the CBP tag on the bait binds to the calmodin of the matrix. The
matrix is thoroughly washed, and finally the purified complex is eluted with EGTA.
Clearly, the rigorous purification steps in TAP should disfavour the detection of
transient interactions within complexes, and, consequently, most complexes found
are likely to be stable. Furthermore, the exact interactions between the proteins in
the complexes detected by TAP have not been determined. Some of the proteins in
the complexes interact directly with each other, but others are at the outskirts of
the protein complex and are not in each others proximity, although they are likely
to be functionally related.
Affinity purification methods have been successfully applied to large scale studies
on the proteome of S. cerevisiae (Gavin et al., 2002; Ho et al., 2002). The overlap
between the interaction pairs found with affinity purification methods and Y2H is
small, partly because the methods complement each other (Aloy & Russell, 2002).
However, the lack of overlap between similar methods shows that both Y2H and
affinity purification methods suffer from shortcomings.
The need for validation of protein-protein interactions
Although high-throughput methods are comprehensive and less likely to be subject
to the biased expectations of individual researchers, unlike low-throughput methods,
there is a small overlap between the interaction pairs produced by different methods
(Uetz & Finley, 2005). For instance, the two independent Y2H screens of S. cerevisiae merely showed a 17-20% overlap (Ito et al., 2001). It is not yet clear if this
lack of overlap is caused by a limited coverage of the vast yeast interactome, or if it
is caused by false positives. However, among the high-throughput methods, TAP is
the most reliable HT method according to a recent analysis (Cornell et al., 2004).
Furthermore, it is clear that the protein interactions detected by HT Y2H are different in nature from those detected by low-throughput Y2H experiments, collected
from the Munich Information Center for Protein Sequences (MIPS) (Mewes et al.,
2002). In fact, either the fraction of false positives for HT Y2H is at least 40%,
or the low-throughput experiments are clearly biased (Mrowka et al., 2001). Since
there is a small overlap between the interaction pairs produced by these methods,
there is a need for reliable validation measures.
Generally, a protein-protein interaction that has been observed more than once
is more reliable, although false positives may be highly reproducible (Fields, 2005).
Furthermore, computational methods are frequently utilized to indicate the reliability of protein-protein interactions. For instance, functional annotations and
subcellular localizations may provide an indication of the reliability of a particular
protein-protein interaction since proteins which are predicted to be active in the same
subcellular location and have related functions are more likely to interact (Sprinzak
et al., 2003). In addition, expression patterns for proteins in the same complex are
expected to be correlated, otherwise their expression would be unnecessarily energetically costly (Jansen et al., 2002), although less so for interactions of a transient
nature. Consistently, interacting proteins are often co-expressed (Grigoriev, 2001;
Jansen et al., 2002) and, in combination, expression analysis and protein-protein
interaction data can be used to initially characterize proteins of unknown function
(Kemmeren et al., 2002). Furthermore, structural information (Edwards et al., 2002;
Aloy & Russell, 2002) and functional annotation (Marcotte et al., 1999) can be used
to validate protein-protein interactions.
Interactions which have been identified in low-throughput experiments are, as a
rule, considered more reliable. Nevertheless, despite the problems associated with
HT protein-protein datasets, their expedience warrants a continued effort to develop
improvements. Furthermore, in the unlikely event that there will be no significant
advances of HF PPI studies in the coming years, the methods available today can still
present candidates for protein interactions (Fields, 2005). However, recent technical
improvements in pooling strategies indicate that the accuracy of HT Y2H could be
significantly increased, while the number of screens is simultaneously decreased (Jin
et al., 2006).
Toward the human protein-protein interaction network
Hopefully, many of the efforts concerning the yeast protein-protein interaction network may be transferred to Metazoa, and ultimately to the human protein-protein
interaction network (PPIN). The computational methods above can be applied to
newly sequenced genomes, and preliminary interaction networks for these organisms
can thereby be inferred (Date & Marcotte, 2003; von Mering et al., 2003). For instance, there are small conserved subnetworks which are shared between the PPINs
of organisms as diverse as Helicobacter pylori and S. cerevisiae (Kelley et al., 2003),
and some S. cerevisiae interactions were successfully modeled to the C. elegans interaction network (Li et al., 2004). Although such computational methods, based
on genomic context or homology, or both, are likely to contribute to the understanding of the PPINs of Metazoa, there is clearly a need to perform high-throughput
experiments on the higher organisms.
Current estimates of the number of protein-protein interactions in yeast indicate
that there are 10,000-30,000 interacting pairs while the corresponding estimate for
the human interactome is 40,000-200,000, excluding considerations for splice variants
or post-translational modification. Although the interactomes of the nematode C.
elegans (Li et al., 2004) and the fruit fly D. melanogaster (Giot et al., 2003) have
been studied by HT techniques, there are distinct networks in each organism that are
not expected to be transferable to vertebrates, and there are probably vertebrate
specific subnetworks. Presently, the human interactome is being studied through
revised yeast two-hybrid methods, where the expression levels of the fusion proteins
are low and auto-activators are eliminated, through usage of three different Gal4inducible promoters (Rual et al., 2005) and further results are expected soon from
high-throughput Y2H screens and mass spectrometry experiments (Warner et al.,
The evolution of protein-protein interactions
The repertory of proteins is partially conserved between relatively closely related
organisms. Can the same be said of protein-protein interactions? Indeed, it seems
that some groups of proteins, which belong to the same pathway, are all present,
or all absent, from a genome (Pellegrini et al., 1999). Furthermore, many of the
interactions present in yeast appear to also be present in C. elegans, although the
PPIN of the eukaryotic intracellular parasite Plasmodium falciparum shows little
similarity with the other eukaryotes (Suthram et al., 2005). Clearly, an understanding of how the protein interactions evolve is likely to elucidate the evolution of new
Proteins which interact with each other in stable complexes evolve at similar
rates (Fraser et al., 2002). Indeed, residues in interfaces of stable complexes evolve
at a slow rate, and appear to co-evolve, i.e. a substitution in one protein leads to
a complementary alteration in its neighbor, in order to preserve the functionality
of the interaction (Mintseris & Weng, 2005). For instance, a number of ligandreceptor pairs, where the interactions between the ligands and their receptors are
often obligate, have been shown to co-evolve (Goh & Cohen, 2002). In contrast,
proteins which interact transiently show little evidence of co-evolution (Mintseris
& Weng, 2005). However, although there is little evidence for co-evolution at the
amino acid level between transiently interacting proteins, there is co-evolution of
expression levels of interacting proteins (Fraser et al., 2004).
We may surmise that the level of co-evolution of two interacting proteins varies
according to the stability of the interaction. Similarly, the evolutionary history of
proteins in complexes may be somewhat different from that of proteins involved
in transient interactions. Interestingly, a recent study (Pereira-Leal & Teichmann,
2005) indicates that complete or partial module duplication, i.e. duplication of
protein complexes, has occurred on several occasions during the evolution of S.
cerevisiae. For transient interaction pairs, it does not seem that the proteins evolve
in a similar manner. In fact, after gene duplication events, interactions between
proteins are often rapidly lost in an asymmetric manner, where one of the genes
loses most of the original interactions (Wagner, 2001). The asymmetry is prominent
in hubs, where divergence results in loss of sub-functions in the duplicated gene,
whereas duplicates of proteins with few interaction partners are more likely to gain
new functions (Zhang et al., 2005).
On the structure of biological networks
Metabolic networks and other complex networks such as the film actor collaboration network, the world wide web, protein domain networks and protein-protein
interaction networks are small-world networks with some properties which are consistent with scale-free networks (Jeong et al., 2000; Wagner & Fell, 2001; Wuchty,
2001; Apic et al., 2001a). The small-worldliness of the metabolic network of E. coli
has recently been contested for an alternative network representation where carbon
atomic traces in metabolic reactions were used (Arita, 2004). Furthermore, since
the coverage of the real PPIN is low, it has been questioned whether the topology
of the PPIN can currently be correctly identified (Han et al., 2005).
A small-world network is characterized by 1) short path lengths between any
two nodes in the network and 2) a high clustering coefficient, which means that
the neighbors of a certain node of the network are often connected to each other
thereby forming clusters, see Figure 2. A scale-free network, in this context, has
a power-law connectivity (degree, k) distribution, i.e. there are many nodes that
have very low connectivities and a handful of nodes with much higher connectivities
(hubs), see Figure 6. Scale-free networks are robust networks in the sense that they
often remain intact when a large fraction of randomly chosen nodes is eliminated
from the network (Albert et al., 2000). However, if a small fraction of the hubs of
the network is eliminated the network is likely to become fragmented into several
What is the advantage of scale-freeness?
There are at least two possible reasons for why scale-free topologies are observed
in many biological networks. First, the scale-free topology may give the organism
a selective advantage over organisms with other network topologies. It has been
suggested that the scale-free character of biological networks has evolved through
natural selection for the advantage of robustness and error-tolerance that the scalefree network topology confers to the organism (Jeong et al., 2001). Second, the
Random network
Scale−free network
Figure 6: Scale-free networks. a) A schematic scale-free network. The hub of the network (striped)
has the highest connectivity (k=8) while most of the nodes have quite low connectivities. The
longest path (high-lighted) between any two nodes in the network is 4 steps. b) A schematic
random network. This network has the same number of nodes and edges as the network in figure
a, but the connectivity distribution is different. The highest connectivity in the network (k=3) is
similar to the average connectivity (k=1.9), and the longest path is rater long (9 steps).
scale-free topology could be a side effect of an evolutionary process.
Two possible selective advantages that the scale-free network topology may confer to the organism have been brought forward. First, the scale-free topology of the
networks makes them robust against random mutations since most nodes in the network only are connected to one or two other nodes and the deletion of such a node
will not significantly affect the structure of the network. However, the integrity of
the scale-free network is susceptible to attacks on the hubs. In accordance with this
notion, proteins with high connectivities in the protein-protein interaction network
of S. cerevisiae are more likely than other proteins to be essential proteins (Jeong
et al., 2001) and, accordingly, the average connectivity of essential proteins is twice
as high as that of non-essential proteins (Bader & Hogue, 2002). Recently, however,
a substantial correlation between connectivity and essentiality could not be found
in the PPIN of neither yeast nor human (Gandhi et al., 2006; Coulomb et al., 2005).
It is possible that the discrepancy is due to disparate choices of protein-protein interaction databases. Second, scale-free networks have short average path lengths,
i.e. there usually exists a short path between any two nodes in the network, which
could enable the network to shift quickly from one state to another in response to
environmental changes (Jeong et al., 2000).
How do biological scale-free networks evolve?
Networks with scale-free properties have been shown to evolve when two simple rules
are applied: 1) The network grows by the addition of new nodes. 2) Preferential
attachment: New nodes are more likely to become connected to well connected
nodes in the network than to low connectivity nodes (Barabasi & Albert, 1999).
While preferential attachment is often at the root of scale-freeness, a network with a
power-law degree distribution might be produced through other mechanisms as well.
Preferential attachment in the context of genetic networks may take place partly
through gene duplication (Qian et al., 2001; Barabasi & Oltvai, 2004). In agreement
with preferential attachment, proteins which have homologs in all 3 domains of life,
which are likely to be of ancient origin, have high connectivities in the proteinprotein interaction network of S. cerevisiae (Eisenberg & Levanon, 2003; Saeed &
Deane, 2006). In contrast, Kunin et al (Kunin et al., 2004) showed that the most
highly connected proteins date to after the evolution of primordial eukaryotes but
before the radiation of eukaryotes to Plants, Metazoa and Protista.
The relationship between protein-protein interaction network topology and evolution
Even if the exact nature of the degree-distribution of the PPIN has not yet been
firmly determined, and it is likely to vary depending on the definition of an interaction, it is clear that there are some highly connected proteins that are characterized
by certain properties. Hub proteins could be particularly interesting drug targets, for
instance in cancer therapy (Apic et al., 2005), where highly expressed hub proteins
in diseased tissues may be targeted.
The hubs of the PPIN of S. cerevisiae have been shown to evolve slowly, which
may be because larger portions of the lengths of these proteins are directly involved
in their interactions (Hirsh & Fraser, 2001; Fraser et al., 2002). Other studies
indicate that the proposed negative correlation between evolutionary rate and connectivity is only due to a small fraction of proteins with high numbers of interactions
which evolve slower than most proteins in the yeast network (Jordan et al., 2003).
The difference between some of these studies seems to be due to the nature of the
data sets. When complexes identified with mass spectrometry based methods are
included in the analysis the correlation between connectivity and evolutionary rate
is clear (Fraser et al., 2003).
Based on expression profiles, two different hub types in the PPIN of S. cerevisiae may be identified; static hubs (party hubs) and dynamic hubs (date hubs)
(Han et al., 2004). The party hubs reside in static complexes where they interact
with most of their partners at the same time, while date hubs bind their interac18
tion partners at different times and/or locations. Party hubs are probably central
parts of functional modules/complexes while date hubs act as the mediators between these semi-autonomous modules, see Figure 7. Thus, date hubs appear to be
more important than party hubs for the topology of the network (Han et al., 2004).
Interestingly, they are often younger than party hubs (Fraser, 2005), indicating that
there is a greater variability between species with regard to these connectors. Further, while there is no substantial difference between the proportion of essential
proteins among the party and date hubs, perturbation of the date hubs leads to
susceptibility of the genome to further perturbations (Han et al., 2004).
Estimates show that a quarter of the gene deletions in S. cerevisiae which are not
associated with a changed phenotype, are compensated by gene duplicates (Gu et al.,
2003). On the other hand, a large number of genes without duplicates have been
shown to have no detectable effect on the phenotype upon deletion (Wagner, 2000),
thus indicating that the lack of redundancy does not seem to result in susceptibility.
However, if these singletons are young proteins at the peripheries of the network,
perhaps on average less important than other proteins, it may not be surprising that
their deletion does not lead to detrimental effects. Since hub proteins are pivotal for
the robustness of the PPIN, it is conceivable that duplicates, or partial duplicates,
of the hubs, rather than other proteins, may be particularly beneficial. However,
such hub duplicates may be rare since gene duplications often cause an imbalance
in the concentration of the components of protein-protein complexes which might
be deleterious (Veitia, 2002; Papp et al., 2003).
Is scale-freeness of biological importance?
There are several reasons for believing that scale-freeness is of little biological relevance. First, the observed topology of the protein-protein interaction network can
evolve, without selective pressure network topology, through gene duplication and
deletion of links between proteins (Wagner, 2003; Amoutzias et al., 2004; van Noort
et al., 2004). Second, since the metabolic networks studied in Jeong et al’s analysis
(Jeong et al., 2000) include ubiquitous compounds such as water, NADH and ATP,
it is dubious whether the similar average path lengths of various organisms means
anything else than that water, ATP and NADH are ubiquitous and therefore dominate the structure of the compound network in all these organisms. Indeed when
the most ubiquitous compounds are removed from the network different organisms
have quite different average path lenghts (Ma & Zeng, 2003).
Finally, and perhaps most importantly, it is possible that all large chemical networks have scale-free topologies. To investigate this possibility Wagner (Wagner,
2004) used planetary reaction chemistry data compiled in another study (Gleiss
et al., 2001). This data includes the chemical reactions found in the atmospheres
of Earth, Venus and Jupiter. The resulting networks also have scale-free topologies, indicating that natural selection is not necessary to explain the development
of scale-free topologies. It is therefore highly arguable whether natural selection
Figure 7: Party hubs and date hubs. a) The nodes represent the subset of eukaryotic specific
proteins of S. cerevisiae and the edges are physical interactions as determined by the core dataset
from DIP (Salwinski et al., 2004). Party hubs (green nodes) are highly connected proteins (hubs)
which are co-expressed with their neighbors, while date hubs (yellow nodes) are hubs whose neighbors are expressed at different times. The grey nodes represent the proteins that are neither party
hubs nor date hubs. b) The date hub subnetwork from a is characterized by a high degree of
interconnectivity. c) The party hub subnetwork shows compartmentalization. These images were
drawn with BioLayout (Goldovsky et al., 2005)
molds the global structure of cellular networks because of selective benefits of the
scale-free topology. Instead, the scale-free distribution of compounds in the planetary atmosphere may lead to a scale-free distribution of connectivity in metabolic
networks, where the compounds are used in the biological system in proportion to
their prevalence.
Techniques and databases
Here, some of the techniques and databases from Papers I-IV are described. First,
the representation frameworks for the metabolic and protein-protein interaction networks are discussed. Second, the methods used for estimating protein age and domain assignments are explained.
Representation of the metabolic network
Metabolic pathway information is available in several different databases such as
EcoCyc (Karp et al., 2002), WIT (Overbeek et al., 2000), BRENDA (Schomburg
et al., 2002) and KEGG (Kanehisa et al., 2002). EcoCyc is an E. coli K-12 MG1655
specific database and contains the metabolic complement of E. coli. EcoCyc has
an advantage over the two other databases in that it contains information about
the directionality of the enzymatic reactions. Furthermore, some enzymes have
not yet been classified according to the Enzyme Commission (EC) system (NCIUBMB, 1992). Whereas, for instance, KEGG does not contain clear documentation
of unclassified enzymatic reactions, EcoCyc contains both EC-classified enzymes and
enzymes that fall outside of the EC classification.
The classic representation of metabolic reactions depict enzymes as edges, while
compounds represent the nodes, see Figure 3. In recent years, however, the reverse
representation, often referred to as ’protein-centric’ graphs (Gerrard et al., 2001) or
’reaction graphs’ (Wagner & Fell, 2001), of the network has become more common
since it is particularly well suited for comparative studies of the enzymes relative
to their connections and positions in the network. The work presented herein relies
on the protein-centric network representation. An edge represents one or more
compounds. There will be an edge leading from an enzyme E1 to an enzyme E2
if E1 catalyzes a reaction where compound A is produced and E2 takes A as a
substrate. Reversible reactions, such as A B, are separated into two reactions, A
→ B and B → A. There can be at most one edge in each direction between a pair
of enzymes, see Figure 8. In addition, some reactions are physiologically irreversible
and should be represented accordingly. Therefore, we used a directed graph to
represent the metabolic network of E. coli. As a result, although the network is a
fully connected graph, there are not necessarily paths from every enzyme node in
the network to every other enzyme.
Figure 8: Protein-centric representation of the biotin metabolism. In the protein-centric representation enzymes are nodes and reactants are edges. EC is a neighbor of EC
because there is an edge leading from EC to EC The edges representing CO 2 ,
between EC and EC, are eliminated when the 20 most promiscuous compounds
are removed from the network.
One of the most problematic aspects with metabolic network analysis is how
the promiscuous compounds should be dealt with. It could be argued that since
promiscuous compounds are usually not the limiting factors of reactions, the network
would become more biochemically meaningful if these compounds were removed (Ma
& Zeng, 2003). However, to our knowledge there is no generally accepted criterion
to determine the biological relevance of a compound in a reaction. Therefore, in
the studies presented here, a simple network-based criterion was applied. We count
the number of times a compound occurs as part of an edge in the network. The
most common compounds are then considered as promiscuous compounds and are
removed from the analysis (Alves et al., 2002).
Representation of the protein-protein interaction network
Protein-protein interaction networks are often represented with proteins as nodes
and the physical interactions between the proteins as edges. While most studies of
PPINs often utilize this representation, the choice of database is less obvious. It
is intuitively appealing to build a PPIN representation from various resources, i.e.
low-throughput experiments, affinity purification methods and two-hybrid methods.
However, therein lies a risk of defusing the meaning of ”physical interaction”. For
instance, it is often unclear which proteins interact with each other within a large
protein complex. Most investigations have made use of either the ’spoke’ model,
where the bait protein interacts with all the other proteins in the complex but there
are no interactions between the preys, or the ’matrix’ model, where all proteins in a
complex are assumed to interact with each other (Bader & Hogue, 2002). Of course,
these two models are diametrically opposed, and neither one of them is likely to
be representative of the real relationships between proteins in complexes, although
estimates show that the spoke model is threefold more accurate than the matrix
model (Bader & Hogue, 2002).
In the study presented here the representation of the PPIN was built using
the ’core’ data set from the database of interacting proteins (DIP) (Deane et al.,
2002; Salwinski et al., 2004) downloaded in March 2005. DIP contains low-andhigh-throughput experimentally verified interactions, including two-hybrid and TAP
studies. The core data set consists of the interactions in DIP, which, by computational methods, are deemed reliable. First, the expression profile reliability index
is derived from the expression correlations of interactions in DIP to those of known
interacting and non-interacting pairs. Second, the DIP data set is verified through
paralogous verification, whereby an interaction pair is considered more reliable if it
has mutually interacting paralogs within the yeast interactome.
The connectivity of a protein node is defined as the number of proteins it is connected to, see Figure 2. The proteins were grouped according to their connectivities
in the core interaction network. Hubs are defined as proteins with eight or more
interactions while proteins with less than four interactions are labeled non-hubs and
the rest are intermediately connected. As previously mentioned, the physical interactions of proteins vary a great deal, which is especially obvious when the hub
proteins are considered, since some of the hubs are at the cores of stable protein
complexes, whereas others are involved in quite varied interactions. In fact, hubs
may be classified into party hubs (PH), believed to be at the cores of modules, and
date hubs (DH), believed to be the connectors between these modules (Han et al.,
2004). Co-expression profiles from five different conditions (stress response (Gasch
& Werner-Washburne, 2002), cell cycle (Spellman et al., 1998), pheromone treatment (Roberts et al., 2000), sporulation (Chu et al., 1998) and unfolded protein
responses (Travers et al., 2000)) can be used to determine the average Pearson correlation coefficient (PCC) between each hub and its interaction partners. Party hubs
are defined as proteins that show high average PCC with their interaction partners.
Domain assignment and protein age
Protein domains are the structural, functional and evolutionary units which together, or one by one, form proteins. For the purposes of Papers III and IV the
domains from Pfam-A (Sonnhammer et al., 1997) were assigned by HMMER-2.0
to various genomes. We determined the number of domains in proteins by adding
the number of Pfam-A, Pfam-B domains and unassigned regions longer than 100
residues (orphan domains). The proteins that only consist of orphan domains are
referred to as orphans (Rost, 2002). Orphan proteins, have no homologs, or very
few homologs which are only present in closely related species, and do not contain
any known domains.
In Paper IV, protein age was estimated based on the domain contents. The
domains were then grouped into those present in i) eukaryotes and prokaryotes, ii)
eukaryotes only, iii) S. cerevisiae and/or Schizosaccharomyces pombe (yeast) and
iv) no other species (orphans). In order to avoid bias toward repeating domains,
which are common in eukaryotes, each domain family represented in the protein
contributed equally to the age class of the protein, i.e. repeating domains were only
counted once. For example, a six domain protein consisting of four ancient domains
A, one orphan domain B and one eukaryotic domain C is 1/3 ancient, 1/3 orphan
and 1/3 eukaryotic. As an alternative estimate of protein age, the S. cerevisiae
proteins were assigned to orthologous groups using clusters of orthologous groups
(COG) (Tatusov et al., 2000) for prokaryotic orthologs and KOG (Tatusov et al.,
2003) for eukaryotic orthologs.
Furthermore, in Paper IV we sought to classify paralogous pairs according to the
the approximate time of the duplication. For this purpose, KOG, complemented by
two-species groups (TWOGs) and lineage-specific extensions (LSEs), were used to
find inparalogs (Remm et al., 2001), see Figure 1. Inparalogs are defined here
as paralogous proteins that appear to have originated since the split between S.
cerevisiae and S. pombe, believed to have occurred roughly 1,100 Myr ago (Heckman
et al., 2001). Among the inparalogs, there is a subset of paralogous pairs that
originate from the whole genome duplication (WGD), which is believed to have taken
place in S. cerevisiae about 100 Myr ago. These paralogs, ohnologs, were identified
by the Yeast Gene Order Browser (Byrne & Wolfe, 2005). Although the inparalog
dataset probably includes paralogs which arose quite recently, it is reasonable to
assume, that the ohnolog subset represents some of the youngest inparalogs. As an
alternative measure, paralogs were defined as sequences with an e-value below 10 −5
and an alignment covering more than 40% of the longest of the compared sequences.
In addition, proteins of at least two domains and identical Pfam-A or Pfam-A+B
domain architectures were also considered paralogous if their lengths did not differ
by more than 30% of the longer sequence.
Statistical tests
For Papers I, II and IV randomized networks were generated in order to estimate the
statistical significance of the results. This was done through shuffling the distribution
of the property under investigation while preserving the network topology, see Figure
9. Subsequently, Z-scores were calculated. For example, consider Paper IV, where we
sought to determine the significance of the distribution of inparalogs between party
hubs and other proteins. The connectivities of the proteins were randomized, but
the topology preserved, and the number of inparalogs in the different connectivity
classes (party hubs, date hubs, intermediately connected, non-hubs) were calculated.
In this case, the randomization process was iterated 10,000 times and the average
and standard deviation from the randomizations were used to calculate the Z-score
Z = (n̄− r̄)/stdev(r), where n is the real number of inparalogs in a connectivity class
k and r is the result from the randomized network. The Z-score then expresses how
far the average number of inparalogs belonging to a certain connectivity group differs
from the average number of inparalogs of randomly sampled proteins, measured in
units of the random sampling distribution’s standard deviation. The larger the Zscore, the less likely that the difference between the real and random networks is by
Original network
Shuffled network
Figure 9: Preservation of network topology. a) The figure shows the nodes and edges in the original
example graph. b) The figure shows the nodes and edges for the randomized graph where the node
identities (A, B, C, D, E, F, G, H) have been shuffled. The topological properties of the original
graph are preserved in the randomized graph.
The present investigations
The investigations in this thesis revolve around the evolution of biological networks.
The initial two papers concern the well studied metabolic networks, first with regard
to two long standing competing explanations for the evolution of metabolism (Paper
I) and, second, from a network topology perspective (Paper II). Furthermore, Paper
III describes a study estimating the proportion of different domain rearrangements
in proteins. Finally, Paper IV concerns the characteristics of the proteins with many
interaction partners in the protein-protein interaction network of S. cerevisiae, with
regard to their domain compositions, interaction partners and evolutionary history.
Retrograde vs Patchwork evolution (Paper I)
The two most common models for the evolution of metabolism are patchwork evolution (Jensen, 1976), where enzymes are thought to diverge from broad to narrow
substrate specificity (see Figure 5), and retrograde evolution (Horowitz, 1945), according to which enzymes evolve backwards, compared to the direction of the pathway, in response to substrate depletion, see Figure 4. Analysis of the distribution of
homologous enzyme pairs in the metabolic network can shed light on the respective
importance of the two models. In this study, we investigate the evolution of the
metabolism in E. coli K-12 MG1655 viewed as a single network using EcoCyc.
The first step in this investigation was to determine if there is an overrepresentation of homologous enzymes at short distances in the network. To accomplish this
task, sequence comparison between all enzyme pairs was performed and the minimal
path length (MPL) between all enzyme pairs was determined. In the real network
58 % of all homologous pairs are found at MPL=1, compared to 18 % in the randomized networks. However, in this network all compounds, including ATP, NADH
and NADPH (promiscuous compounds), were included. Although this network is
completely devoid of biochemical bias (or knowledge) about the ’importance’ of the
various reactants, one could certainly argue that the short distances in the network
between all enzymes that catalyze reactions involving ATP are not of substantial
biological relevance. Furthermore, from a purely methodological perspective, the
inclusion of promiscuous compounds confounds the study in that the network becomes very dense if the promiscuous compounds are included. To remedy such
complications we removed the 20 most promiscuous compounds and all cofactors,
as defined by EcoCyc. However, the overrepresentation of homologous enzymes at
MPL=1 remained after these revisions, and is therefore unlikely to be the result of
common cofactor-binding domains alone.
The retrograde evolution model predicts that homologous enzyme pairs will be
found at close distances in the metabolic network. Like previous studies (Alves
et al., 2002; Rison et al., 2002) our study seems to lend at least some support to the
retrograde evolution model. To investigate this further we analyzed the homologous
pairs of enzymes in order to distinguish between on one hand cases of patchwork evolution (broad-to-narrow evolution of enzyme substrate specificity) and on the other
hand cases of retrograde evolution (evolution of a different reaction mechanism).
We found that the correlation between homology and network distance at MPL 1
is mostly due to homologous enzyme pairs of similar functions (same primary EC
number). These homologous enzyme pairs with analogous functions have probably
evolved through patchwork evolution. However, there is a statistically significant
over-representation of functionally dissimilar (different primary EC numbers) homologous enzyme pairs at MPL 1 which cannot easily be explained by the patchwork
evolution model. In conclusion, our study indicates that while retrograde evolution
may have played a small part, patchwork evolution is the predominant process of
metabolic enzyme evolution.
Preferential attachment in metabolic networks (Paper
While the investigation described in Paper I sought to determine the influence of
two well established theories regarding the local evolution of metabolic networks,
this study brings the investigation to the network level. It has been suggested that
the scale-free properties, that some biological networks may display, could arise as
a result of preferential attachment of new nodes to highly connected nodes through
gene duplication (Qian et al., 2001; Barabasi & Oltvai, 2004), see Figure 10. Preferential attachment by gene duplication may take place according to the following
scenario; Initially, the product of the duplicated gene has exactly the same function
and position in the network as its template. Since many proteins are connected to
the hub of the network, the duplicated protein is by chance likely to be connected to
the hub. Subsequently, the duplicate may evolve towards another functionality but
it could retain some of its original function. Indeed, it has been shown that gene
duplication and gene loss are sufficient to explain the scale-freeness of biological
networks (Wagner, 2003; Amoutzias et al., 2004; van Noort et al., 2004). Nonetheless, investigations of the predictions generated from preferential attachment on real
biological networks are of interest, since it supports the notion that preferential attachment, most likely through gene duplication, is part of the explanation for the
power-law connectivity distribution seen in several biological networks. Such an investigation was conducted previously on the protein-protein interaction network of
S. cerevisiae (Eisenberg & Levanon, 2003). Similarly, we have investigated two predictions generated from the mechanism of preferential attachment in the evolution
of the metabolic network of E. coli K-12.
First, if preferential attachment is of any significance in the evolution of the
metabolic network of E. coli, the older enzymes in the network should have a higher
average connectivity. We have found that E. coli enzymes which are represented
in three domains of life, and in eukaryotes but not Archaea, have a higher average
connectivity than expected by chance. Second, another prediction generated from
the hypothesis is that highly connected nodes should gain more new edges than
nodes with low connectivities. To investigate this prediction we extracted the enzymes with representatives in 3 domains of life and constructed a representation of
LUCA’s metabolic network. Indeed, we found a positive linear correlation between
connectivity in the ancient network and number of connections gained through evolution.
Further, we found that the E. coli enzymes which are believed to have undergone
horizontal gene transfer have a higher average connectivity than other enzymes.
This result suggests that the highly connected enzymes are often old in the sense
that they are likely to have originated in LUCA and been part of the bacterial
metabolic repertory for a long time, but are sometimes relatively recent additions
to the metabolic network of E. coli K-12. It is possible that bacteria such as E. coli
are adjusting their metabolic networks to a changing environment by replacing the
Original sequence
After duplication of one gene
Figure 10: Preferential attachment. If one gene is duplicated by chance among the five genes shown
in the figure (by patterns), it is likely to be connected to the hub. Therefore, highly connected
proteins tend to gain new connections faster than other proteins.
relatively central enzymes, with high connectivities, for better adapted orthologs
from other prokaryotic species.
Domain distance as an alternative homology measure
(Paper III)
Here, the evolution of multi-domain proteins was studied in terms of domain fusions
and repetitions. First, we developed the novel measure domain distance, defined as
the number of domains which differ between two domain architectures (DAs). Second, we determined the proportion of indels, repetitions and exchanges of domains
in Pfam (Sonnhammer et al., 1997) and SCOP (Bairoch & Apweiler, 1996).
In this study evolutionary events are categorized into three different classes;
repetitions, indels (insertions/deletions) and domain exchanges. While domains can
be deleted from a protein, it has been shown that more proteins are created from
fusion (insertion) than fission (deletion) (Snel et al., 2000) and that fusions are four
times more common (Kummerfeld & Teichmann, 2005). Therefore, although it is
unclear which of the two proteins appeared first, we assume that each multi–domain
architecture stems from another architecture which is shorter or of the same length.
We found that indels are the most common events, and in contrast, exchanges
of domains are quite rare. Repetitions are less frequent than expected, as 55%
(eukaryotic dataset) of the multi–domain architectures contain repeating domains.
Therefore it is possible that our method may overestimate the fraction of indels at
the expense of repetitions.
Next, we sought to determine the position of the domain rearrangements. Our
findings show that indels and repetitions seldom occur in the middle of a protein,
and arise equally at the N-and C-terminals. Two–domain architectures are often
created from two single–domain architectures and since 43-49% of the DAs in our
dataset consist of two–domain architectures, this could largely explain why we get
similar fractions at both ends, and few events in the middle of the protein. Further, a
neighbour with domain distance one could be found for 78-88% of the multi–domain
architectures. Hence, most events can be explained by additions of a single domain.
Finally, insertion/deletion of several domains at once is uncommon and most events
involving several domains are repetitions.
In conclusion, using domain distance we quantified the different evolutionary
events leading to complex domain architectures. From our investigations, it is clear
that indels are the most common rearrangement events, followed by repetitions.
Further, the majority of the events involves the addition of a single domain, except
in the case of long tandem repeats, where cassettes of a number of domains are
sometimes duplicated simultaneously.
The characteristics of hubs in the protein-protein interaction network of Saccharomyces cerevisiae (Paper
In the final investigation presented in this thesis, we have determined some characteristics of highly connected (hub) proteins in the protein-protein interaction network
of S. cerevisiae. The connectivities of the proteins in the network were determined
using the ’core’ dataset of the Database of Interacting Proteins (see Methods). It is
possible to distinguish two different hub types in the PPIN of S. cerevisiae; static
hubs (party hubs) and dynamic hubs (date hubs) (Han et al., 2004). The party hubs
are found in static complexes, while the date hubs bind their interaction partners
at different times and/or locations.
First, we investigated the impact of domain composition on connectivity. One
reason for the higher complexity of eukaryotes compared to prokaryotes is the increased number of domain combinations found in eukaryotes, where for example
binding domains have been added to existing catalytic proteins (Ekman et al., 2005).
The idea that multi-domain proteins can bind many different proteins is intuitively
appealing. Indeed, our results show that the proportion of multi-domain proteins
in hubs is larger than the corresponding fraction in non-hubs, which are on average
shorter than hubs. Furthermore, hubs frequently contain repeated domains and, in
addition, date hubs contain more disordered regions than party hubs which suggests
that disordered regions are particularly important for the flexible binding of date
Second, we wanted to investigate how specialized the hubs are in their binding;
Are these highly connected proteins hubs because they interact with many similar
proteins, or because they are able to interact with many different partners with
diverse domain compositions? If hubs gained interactions through duplication of
their neighbors, many neighbors would be paralogous to each other. Some complexes
have been formed by duplications (Pereira-Leal & Teichmann, 2005) but in general,
interactions are often lost by one of the paralogs soon after duplication (Wagner,
2001). Consistently, we have found that only a small fraction of the hub interactions
can be explained by interactions with paralogs.
A less stringent definition of homology is when two proteins share a domain
family. A domain which is recurring in all neighbor proteins could also provide a
necessary binding site, and might suggest at least a partial evolutionary relationship between the neighbors. Our findings indicate that the Pkinase and WD40
domains are frequent among the domains contained in the hub interaction partners.
Furthermore, in as many as 23% of the hubs, the most frequently shared domain
in the interactors is shared with the hub itself, a feature almost twice as frequent
in party hubs as in date hubs. These same domain interactions (SDI) are particularly frequent between proteins containing e.g. Pkinase, LSM, proteasome and AAA
domains. However, overall, our findings show that domain recurrence among hub
interaction partners can only explain part the hub interactions.
Finally, we investigated the paralogs of the party and date hubs. The proteinprotein interaction network is susceptible to targeted attacks on the hubs of the
network (Jeong et al., 2001; Albert et al., 2000). Since hub proteins are pivotal
for the robustness of the protein-protein network, it is conceivable that there is an
overrepresentation of hub duplicates. In other words, hub duplicates could provide
genetic redundancy where it is most crucially needed in the network, see Figure 11b.
On the other hand, gene duplications may cause an imbalance in the concentration
of the components of protein-protein complexes which might be deleterious (Veitia,
2002; Papp et al., 2003), see Figure 11a. The first mechanism predicts that the
hubs should have a higher fraction of paralogs than other proteins. In contrast,
the latter mechanism, which is sometimes referred to as dosage sensitivity, predicts
the opposite. We found that the duplicability of hub proteins is similar to that of
other proteins. However, there are very few static hub (party hub) paralogs which
originate from relatively recent duplications. We hypothesize that the number of
retained party hub duplicates has decreased relative to the duplicates of non-hubs
during the evolution of S. cerevisiae. Although other explanations are possible, we
speculate that the dosage sensitivity of party hubs has become more pronounced
compared to other proteins during the evolution of S. cereivisae.
Dosage sensitivity
Hub redundancy
Hub deletion
Single hub
Disintegrated network
Hub deletion
Duplicate hub
Preserved topology
Figure 11: Dosage sensitivity and hub redundancy. a) Dosage sensitivity for a simple complex
consisting of the proteins A and B. When A is duplicated, its duplicate A’ forms a homodimer
with A. As a result, there is a decrease in the number of AB complexes, which, if this is an important
interaction in the cell, could be detrimental. b) Hub redundancy. The black node represents the
hub in this simple network. If the hub is deleted, the network will be disintegrated. However, if
there is a duplicate copy for this hub protein, the structure of the network will persist even if one
of the hubs is deleted.
Conclusions and reflections
Of the biological network studied herein, the wealth of knowledge of metabolism
is likely to be more reliable than the information regarding protein-protein interaction networks. However, knowledge of metabolic networks originates from the
model organisms E. coli and S. cerevisiae to a great extent, and is therefore subject
to substantial bias. Therefore, it can be expected that much will be learnt about
metabolism of higher eukaryotes in the years to come. Clearly, an important step will
be to relate the more than 1,500 known enzymatic activities whose corresponding
enzymes have not yet been identified (Lespinet & Labedan, 2006). However, while
much is to be learnt about the metabolism, and its evolution, among the higher eukaryotes, it is clear that far more remains to be discovered regarding protein-protein
interaction networks. Certainly, if many of the orphans correspond to functions
which, like their sequences, are unique to their specific organisms, a steady stream
of novel biological functions will be revealed for years to come.
Here, we have investigated the evolution of the metabolic network of E. coli,
where we found that enzymes primarily evolve through duplication of enzymes with
broad substrate specificities, by gene duplication, to enzymes with more specialized
substrate specificities. Further, we found some evidence supporting preferential
attachment in metabolic networks, in that enzymes which are older have a higher
connectivity and enzymes which have a higher connectivity in a representation of the
ancient network, consisting of the enzymes which occur in all domains of life, tend to
gain more new edges than other enzymes. Studying protein evolution with respect to
domain combinations, we found that the domain rearrangements which are the most
common in proteins are single indels taking place at the N-and C-terminals. Finally,
we investigated the characteristics of the hub proteins of the yeast protein-protein
interaction network and found that repeating domains are quite common among
the hubs, while disorder is particularly common in the hubs involved in transient
interactions (date hubs). Furthermore, the hubs which are at the cores of complexes
(party hubs) have few recent paralogs compared to other proteins. This could, for
example, indicate that their paralogs diverge fast, which is contradicted by the result
that party hubs have the slowest evolutionary rate of all proteins (Fraser, 2005), or
by an increase in dosage sensitivity for party hubs compared to other proteins during
There appears to be a possible contradiction in the most recent papers regarding the evolution of protein-protein networks. On one hand, dosage sensitivity has
become firmly established as a factor which restricts duplications of proteins which
are part of protein complexes (Veitia, 2002; Papp et al., 2003). On the other hand,
many protein complexes show similarities to each other (Pereira-Leal & Teichmann,
2005), which is hard to explain other than through gene, or genome, duplication.
Genome duplication could be the best explanation for the evolution of complexes,
since mutually paralogous complexes can evolve in this manner while dosage imbalance is avoided. However, our investigation in Paper IV shows that party hub
duplicates are underrepresented among the paralogous pairs originating from the
whole genome duplication of yeast, suggesting that for these party hubs, the whole
genome duplication did not lead to a particularly relaxed dosage sensitivity. Further
studies are needed to explain the disparate observations, but it is possible that nonhub duplicates are frequently retained since they are generally more recent proteins
where substitutions may lead to improvements.
I would like to thank Stiftelsen for strategisk forskning for financial support. Furthermore, I’d sincerely like to thank the following people:
Arne Elofsson for being an encouraging, inspiring and just, simply, wonderful advisor.
Per Kraulis for recruiting me as a PhD candidate, teaching me a lot about biology
and for invaluable lessons in scientific argumentation.
Karin Melén for being a great friend; for all the fun talks, always helping out and,
most of all, for your contagiously positive personality.
Diana Ekman and Åsa K. Björklund for being amazing co-workers in every way.
Olivia Eriksson and Ann-Charlotte Berglund Sonnhammer for all the dinner parties
and for many inspiring scientific discussions.
To everyone in Arne’s group and all the people at SBC, past and present, for making
SBC a great place to be.
Anna-Karin for your friendship - and all the fun pop culture talks, not to mention
those Buffy episodes.
My aunt Birgit for helping me when I most needed it.
My sister Maria for always being up for a heart-to-heart fashion photography chat.
You are my twin spirit.
Mamma Ulla och Pappa Krille. Thank you for all your support during these years.
I can safely say that this thesis would never have been completed without either one
of you - at least three times over. You are the best.
Bill, for always sharing your independent thoughts with me, teaching me so much I
didn’t know and for putting up with my long work days during the last year. I love
Albert, R., Jeong, H. and Barabasi, A. L. 2000. Error and attack tolerance of
complex networks. Nature 406:378–382.
Aloy, P. and Russell, R. B. 2002. The third dimension for protein interactions and
complexes. Trends Biochem Sci 27:633–638.
Alves, R., Chaleil, R. A. and Sternberg, M. J. 2002. Evolution of enzymes in
metabolism: a network perspective. J Mol Biol 320:751–770.
Amoutzias, G. D., Robertson, D. L., Oliver, S. G. and Bornberg-Bauer, E. 2004.
Convergent evolution of gene networks by single-gene duplications in higher
eukaryotes. EMBO Rep 5:274–279.
Apic, G., Gough, J. and Teichmann, S. A. 2001a. An insight into domain combinations. Bioinformatics 17:S83–S89.
Apic, G., Gough, J. and Teichmann, S. A. 2001b. Domain combinations in archaeal,
eubacterial and eukaryotic proteomes. J Mol Biol 310:311–325.
Apic, G., Ignjatovic, T., Boyer, S. and Russell, R. B. 2005. Illuminating drug
discovery with biological pathways. FEBS Lett 579:1872–1877.
Arita, M. 2004. The metabolic world of Escherichia coli is not small. Proc Natl
Acad Sci USA 101:1543–1547.
Bader, G. D. and Hogue, C. W. 2002. Analyzing yeast protein-protein interaction
data obtained from different sources. Nat Biotechnol 20:991–997.
Bairoch, A. and Apweiler, R. 1996. The swiss-prot protein sequence data bank and
its new supplement trembl. Nucleic Acids Res 24:21–25.
Ball, C. A. and Cherry, J. M. 2001. Genome comparisons highlight similarity and
diversity within eukaryotic genomes. Curr Opin Chem Biol 5:86–89.
Barabasi, A. L. and Albert, R. 1999. Emergence of scaling in random networks.
Science 286:509–512.
Barabasi, A-L. and Oltvai, Z. N. 2004. Network biology: understanding the cell’s
functional organization. Nat Rev Genet 5:101–113.
Belfaiza, J., Parsot, C., Martel, A., de la Tour, C. B, Margarita, D., Cohen, G. N.
and Saint-Girons, I. 1986. Evolution in biosynthetic pathways: two enzymes
catalyzing consecutive steps in methionine biosynthesis originate from a common ancestor and possess a similar regulatory region. Proc Natl Acad Sci USA
Boucher, Y., Douady, C. J., Papke, R. T., Walsh, D. A., Boudreau, M. E., Nesbo,
C. L., Case, R. J. and Doolittle, W. F. 2003. Lateral gene transfer and the
origins of prokaryotic groups. Annu Rev Genet 37:283–328.
Byrne, K. P. and Wolfe, K. H. 2005. The yeast gene order browser: combining
curated homology and syntenic context reveals gene fate in polyploid species.
Genome Res 15:1456–1461.
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O. and
Herskowitz, I. 1998. The transcriptional program of sporulation in budding
yeast. Science 282:699–705.
Claverie, J. M. 2001. Gene number. what if there are only 30,000 human genes?
Science 291:1255–1257.
Conant, G. C. and Wolfe, K. H. 2006. Functional partitioning of yeast co-expression
networks after genome duplication. PLoS Biol 4:epub.
Consortium., International Human Genome Sequencing 2001. Initial sequencing and
analysis of the human genome. Nature 409:860–921.
Copley, R. R. and Bork, P. 2000. Homology among (β/α)8 barrels: implications for
the evolution of metabolic pathways. J Mol Biol 303:627–641.
Cornell, M., Paton, N. W. and Oliver, S. G. 2004. A critical and integrated view of
the yeast interactome. Comp Func Genom 5:382–402.
Coulomb, S., Bauer, M., Bernard, D. and Marsolier-Kergoat, M. C. 2005. Gene
essentiality and the topology of protein interaction networks. Proc Biol Sci
Dandekar, T., Snel, B., Huynen, M. and Bork, P. 1998. Conservation of gene order: a
fingerprint of proteins that physically interact. Trends Biochem Sci 23:324–328.
Date, S. V. and Marcotte, E. M. 2003. Discovery of uncharacterized cellular systems
by genome-wide analysis of functional linkages. Nat Biotechnol 21:1055–1062.
Deane, C. M., Salwinski, L., Xenarios, I. and Eisenberg, D. 2002. Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1:349–356.
Edwards, A. M., Kus, B., Jansen, R., Greenbaum, D., Greenblatt, J. and Gerstein,
M. 2002. Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet 18:529–536.
Eisenberg, E. and Levanon, E. Y. 2003. Preferential attachment in the protein
network evolution. Phys Rev Lett 91:128701.
Ekman, D., Björklund, Å. K., Frey-Skött, J. and Elofsson, A. 2005. Multi–domain
proteins in the three kingdoms of life - orphan domains and other unassigned
regions. J Mol Biol 348:231–243.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C. and Ouzounis, C. A. 1999. Protein
interaction maps for complete genomes based on gene fusion events. Nature
Fani, R., Lio, P. and Lazcano, A. 1995. Molecular evolution of the histidine biosynthetic pathway. J Mol Biol 41:760–774.
Fields, S. 2005. High-throughput two-hybrid analysis. the promise and the peril.
FEBS J 272:5391–5399.
Fields, S. and Song, O. 1989. A novel genetic system to detect protein-protein
interactions. Nature 340:245–246.
Force, A., Lynch, M. B., Pickett, B., Amores, A. and Yan, Y.-L. 1999. Preservation of
dupicate genes by complementary, degenerative mutations. Genetics 151:1531–
Fraser, H. B. 2005. Modularity and evolutionary constraint on proteins. Nat Genet
Fraser, H. B., Hirsh, A. E., Steinmetz, L. M., Scharfe, C. and Feldman, M. W. 2002.
Evolutionary rate in the protein interaction network. Science 296:750–752.
Fraser, H. B., Hirsh, A. E., Wall, D. P. and Eisen, M. B. 2004. Coevolution of gene
expression among interacting proteins. Proc Natl Acad Sci USA 101:9033–903.
Fraser, H. B., Wall, D. P. and Hirsh, A. E. 2003. A simple dependence between
protein evolution rate and the number of protein-protein interactions. BMC
Evol Biol 3.
Gandhi, T. K., Zhong, J., Mathivanan, S., Karthick, L., Chandrika, K. N., Mohan,
S. S., Sharma, S., Pinkert, S., Nagaraju, S., Periaswamy, B., Mishra, G., Nandakumar, K., Shen, B., Deshpande, N., Nayak, R., Sarker, M., Boeke, J. D.,
Parmigiani, G., Schultz, J., Bader, J. S. and Pandey, A. 2006. Analysis of the
human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 38:285–293.
Gasch, A. P. and Werner-Washburne, M. 2002. The genomics of yeast responses to
environmental stress and starvation. Funct Integr Genomics 2:181–182.
Gavin, A. C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz,
J., Rick, J. M., Michon, A. M., Cruciat, C. M., Remor, M., Hofert, C., Schelder,
M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D.,
Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier,
M. A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G.,
Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G.
and Superti-Furga, G. 2002. Functional organization of the yeast proteome by
systematic analysis of protein complexes. Nature 415:141–147.
Gerrard, J. A., Sparrow, A. D. and Wells, J. A. 2001. Metabolic databases - what
next? Trends Biochem Sci 26:137–140.
Gerstein, M. and Levitt, M. 1998. Comprehensive assessment of automatic structural
alignment against a manual standard, the scop classification of proteins. Protein
Sci 7:445–456.
Giot, L., Bader, J. S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y. L.,
Ooi, C. E., Godwin, B., Vitols, E., Vijayadamodar, G., Pochart, P., Machineni, H., Welsh, M., Kong, Y., Zerhusen, B., Malcolm, R., Varrone, Z., Collis,
A., Minto, M., Burgess, S., McDaniel, L., Stimpson, E., Spriggs, F., Williams,
J., Neurath, K., Ioime, N., Agee, M., Voss, E., Furtak, K., Renzulli, R., Aanensen, N., Carrolla, S., Bickelhaupt, E., Lazovatsky, Y., DaSilva, A., Zhong, J.,
Stanyon, C. A., Finley, R. L., White, K. P., Braverman, M., Jarvie, T., Gold,
S., Leach, M., Knight, J., Shimkets, R. A., McKenna, M. P., Chant, J. and
Rothberg, J. M. 2003. A protein interaction map of Drosophila melanogaster.
Science 302:1727–1736.
Gleiss, P. M., Stadler, P. F., Wagner, A. and Fell, D. A. 2001. Relevant cycles in
chemical reaction networks. Adv Complex Syst 4:207–226.
Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H.,
Galibert, F., Hoheisel, J. D., Jacq, C, Johnston, M., Louis, E. J., Mewes,
H. W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S. G. 1996. Life
with 6000 genes. Science 274:563–567.
Goh, C. S. and Cohen, F. E. 2002. Co-evolutionary analysis reveals insights into
protein-protein interactions. J Mol Biol 324:177–192.
Goldovsky, L., Cases, I., Enright, A. J. and Ouzounis, C. A. 2005. Biolayout(java):
versatile network visualisation of structural and functional relationships. Applied Bioinformatics 4:71–74.
Goto, S., Okuno, Y., Hattori, M., Nishioka, T. and Kanehisa, M. 2002. LIGAND:
database of chemical compounds and reactions in biological pathways. Nucleic
Acids Res 30:402–404.
Grigoriev, A. 2001. A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage t7 and the yeast
Saccharomyces cerevisiae. Nucleic Acids Res 29:3513–3519.
Gu, Z., Steinmetz, L. M., Gu, Z., Scharfe, C., Davis, R. W. and Li, W. H. 2003.
Role of duplicate genes in genetic robustness against null mutations. Nature
Han, J. D., Bertin, N., Hao, T., Goldberg, D. S., Berriz, G. F., Zhang, L. V., Dupuy,
D., Walhout, A. J., Cusick, M. E., Roth, F. P. and Vidal, M. 2004. Evidence
for dynamically organized modularity in the yeast protein-protein interaction
network. Nature 430:88–93.
Han, J. D., Dupuy, D., Bertin, N., Cusick, M. E. and Vidal, M. 2005. Effect of
sampling on topology predictions of protein-protein interaction networks. Nat
Biotechnol 23:839–844.
Heckman, D. S., Geiser, D. M., Eidell, B. R., Stauffer, R. L., Kardos, N. L. and
Hedges, S. B. 2001. Molecular evidence for the early colonization of land by
fungi and plants. Science 293:1129–1133.
Hirsh, A. E. and Fraser, H. B. 2001. Protein dispensability and rate of evolution.
Nature 411:1046–1049.
Ho, Y., Gruhler, A., Heilbut, A., Bader, G. D., Moore, L., Adams, S. L., Millar,
A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., Wolting, C., Donaldson, I.,
Schandorff, S., Shewnarane, J., Vo, M., Taggart, J., Goudreault, M., Muskat,
B., Alfarano, C., Dewar, D., Lin, Z., Michalickova, K., Willems, A. R., Sassi,
H., Nielsen, P. A., Rasmussen, K. J., Andersen, J. R., Johansen, L. E., Hansen,
L. H., Jespersen, H., Podtelejnikov, A., Nielsen, P. A., Crawford, J., Poulsen,
V., Sorensen, B. D., Matthiesen, J., Hendrickson, R. C., Gleeson, F., Pawson,
T., Moran, M. F., Durocher, D., Mann, M., Hogue, C. W., Figeys, D. and
Tyers, M. 2002. Systematic identification of protein complexes in Saccharomyces
cerevisiae by mass spectrometry. Nature 415:180–183.
Horowitz, N. H. 1945. On the evolution of biochemical syntheses. Proc Natl Acad
Sci USA 31:153–157.
Horowitz, N. H. 1965. The evolution of biochemical syntheses - retrospect and
prospect. In Evolving genes and proteins, (Bryson, V. and Vogel, H. J., eds),
pp. 15–23, Academic Press, New York.
Huynen, M. A., Dandekar, T. and Bork, P. 1999. Variation and evolution of the
citric-acid cycle: a genomic perspective. Trends Microbiol 7:281–291.
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M. and Sakaki, Y. 2001. A
comprehensive two-hybrid analysis to explore the yeast protein interactome.
Proc Natl Acad Sci USA 98:4277–4278.
Jansen, R., Greenbaum, D. and Gerstein, M. 2002. Relating whole-genome expression data with protein-protein interactions. Genome Res 12:37–46.
Jensen, R. A. 1976. Enzyme recruitment in evolution of new function. Annu Rev
Microbiol 30:409–425.
Jeong, H., Mazon, S. P., Barabasi, A. L. and Oltvai, Z. N. 2001. Lethality and
centrality in protein networks. Nature 411:41–42.
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. and Barabasi, A. L. 2000. The
large-scale organization of metabolic networks. Nature 407:651–654.
Jin, F., Hazbun, T., Michaud, G. A., Salcius, M., Predki, P. F., Fields, S. and Huang,
J. 2006. A pooling-deconvolution strategy for biological network elucidation.
Nat Methods 3:183–189.
Jordan, I. K., Wolf, Y. I. and Koonin, E. V. 2003. No simple dependence between
protein evolution rate and the number of protein-protein interactions: only the
most prolific interactors tend to evolve slowly. BMC Evol Biol 3.
Kanehisa, M., Goto, S., Kawashima, S. and Nakaya, A. 2002. The KEGG databases
at GenomeNet. Nucleic Acids Res 30:42–46.
Karp, P. D., Riley, M., Saier, M., Paulsen, I. T., Collado-Vides, J., Paley, S. M.,
Pellegrini-Toole, A., Bonavides, C. and Gama-Castro, S. 2002. The ecocyc
database. Nucleic Acids Res 30:56–58.
Kelley, B. P., Sharan, R., Karp, R. M., Sittler, T., Root, D. E., Stockwell, B. R. and
Ideker, T. 2003. Conserved pathways within bacteria and yeast as revealed by
global protein network alignment. Proc Natl Acad Sci USA 100:11394–11399.
Kellis, M., Birren, B. W. and Lander, E. S. 2004. Proof and evolutionary analysis
of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature
Kemmeren, P., van, Berkum, Vilo, J., Bijma, T., Donders, R., Brazma, A. and
Holstege, F. C. 2002. Protein interaction verification and functional annotation
by integrated analysis of genome-scale data. Mol Cell 9:1133–1143.
Kummerfeld, S. K. and Teichmann, S. A. 2005. Relative rates of gene fusion and
fission in multi-domain proteins. Trends Genet 21:25–30.
Kunin, V., Pereira-Leal, J. B. and Ouzounis, C. A. 2004. Functional evolution of
the yeast protein interaction network. Mol Biol Evol 21:1171–1176.
Kurland, C. G., Canback, B. and Berg, O. G. 2003. Horizontal gene transfer: a
critical view. Proc Natl Acad Sci USA 100:9658–9662.
Lawrence, J. G. and Ochman, H. 1997. Amelioration of bacterial genomes: rates of
change and exchange. J Mol Evol 44:383–397.
Lawrence, J. G. and Roth, J. R 1996. Selfish operons: horizontal transfer may drive
the evolution of gene clusters. Genetics 143:1843–1860.
Lazcano, A. and Miller, S. L. 1996. The origin and early evolution of life: prebiotic
chemistry, the pre-rna world, and time. Cell 85:793–798.
Lespinet, O. and Labedan, B. 2006. Puzzling over orphan enzymes. Cell Mol Life
Sci 63:517–523.
Li, S., Armstrong, C. M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain,
P. O., Han, J. D., Chesneau, A., Hao, T., Goldberg, D. S., Li, S., Martinez,
M., Rual, J. F., Lamesch, P., Xu, L., Tewari, M., Wong, S. L., Zhang, L. V.,
Berriz, G. F., Jacotot, L., Vaglio, P., Reboul, J., Hirozane-Kishikawa, T., Li,
S., Gabel, H. W., Elewa, A., Baumgartner, B., Rose, D. J., Yu, H., Bosak,
S., Sequerra, R., Fraser, A., Mango, S. E., Saxton, W. M., Strome, S., Van,
D., Piano, F., Vandenhaute, J., Sardet, C., Gerstein, M., Doucette-Stamm, L.,
Gunsalus, K. C., Harper, J. W., Cusick, M. E., Roth, F. P., Hill, D. E. and
Vidal, M. 2004. A map of the interactome network of the metazoan C. elegans.
Science 303:540–543.
Liu, J. and Rost, B. 2004. Chop proteins into structural domain-like fragments.
PROTEINS: Structure, Function and Bioinformatics 55:678–688.
Lynch, M. and Conery, J. S. 2000. The evolutionary fate and consequences of
duplicate genes. Science 290:1151–1155.
Ma, H. and Zeng, A. P. 2003. Reconstruction of metabolic networks from genome
data and analysis of their global structure for various organisms. Bioinformatics
Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W., Yeates, T. O. and Eisenberg, D. 1999. Detecting protein function and protein-protein interactions from
genome sequences. Science 285:751–753.
Mewes, H. W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs,
M., Morgenstern, B., Munsterkotter, M., Rudd, S. and Weil, B. 2002. Mips: a
database for genomes and protein sequences. Nucleic Acids Res 30:31–34.
Mintseris, J. and Weng, Z. 2005. Structure, function, and evolution of transient
and obligate protein-protein interactions. Proc Natl Acad Sci USA 102:10930–
Mrowka, R., Patzak, A. and Herzel, H. 2001. Is there a bias in proteome research?
Genome Res 11:1971–1973.
NC-IUBMB 1992. Enzyme Nomenclature. Academic Press, San Diego, California.
Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, Berlin.
Overbeek, R., Larsen, N., Pusch, G. D., D’Souza, M., Jr. Selkov, E., Kyrpides,
N., Fonstein, M., Maltsev, N. and Selkov, E. 2000. WIT : integrated system
for high-throughput genome sequence analysis and metabolic reconstruction.
Nucleic Acids Res 28:123–125.
Pal, C., Papp, B. and Lercher, M. J. 2005. Adaptive evolution of bacterial metabolic
networks by horizontal gene transfer. Nature Genet 37:1372–1375.
Papp, B., Pal, C. and Hurst, L. D. 2003. Dosage sensitivity and the evolution of
gene families in yeast. Nature 424:194–197.
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. and Yeates, T. O.
1999. Assigning protein functions by comparative genome analysis: protein
phylogenetic profiles. Proc Natl Acad Sci USA 96:4285–4288.
Peregrin-Alvarez, J. M., Tsoka, S. and Ouzounis, C. A. 2003. The phylogenetic
extent of metabolic enzymes and pathways. Genome Res 13:422–427.
Pereira-Leal, J B and Teichmann, S A 2005. Novel specificities emerge by stepwise
duplication of functional modules. Genome Res 15:552–559.
Petsko, G. A., Kenyon, G. L., Gerlt, J. A., Ringe, D. and Kozarich, J. W. 1993. On
the origin of enzymatic species. Trends Biochem Sci 18:372–376.
Qian, J., Luscombe, N. M. and Gerstein, M. 2001. Protein family and fold occurrence
in genomes: power-law behaviour and evolutionary model. J Mol Biol 313:673–
Remm, M., Storm, C. E. and Sonnhammer, E. L. 2001. Automatic clustering
of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol
Rigaut, G., Shevchenko, A., Rutz, B., Wilm, M., Mann, M. and Seraphin, B. 1999.
A generic protein purification method for protein complex characterization and
proteome exploration. Nat Biotechnol 17:1030–1032.
Rison, S. C., Teichmann, S. A. and Thornton, J. M. 2002. Homology, pathway distance and chromosomal localisation of the small molecule metabolism enzymes
in Escherichia coli. J Mol Biol 318:911–932.
Roberts, C. J., Nelson, B., Marton, M. J., Stoughton, R., Meyer, M. R., Bennett,
H. A., He, Y. D., Dai, H., Walker, W. L., Hughes, T. R., Tyers, M., Boone,
C. and Friend, S. H. 2000. Signaling and circuitry of multiple mapk pathways
revealed by a matrix of global gene expression profiles. Science 287:873–880.
Rost, B. 2002. Did evolution leap to create the protein universe? Curr Opin Struct
Biol 12:409–416.
Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N.,
Berriz, G. F., Gibbons, F. D., Dreze, M., Ayivi-Guedehoussou, N., Klitgord,
N., Simon, C., Boxem, M., Milstein, S., Rosenberg, J., Goldberg, D. S., Zhang,
L. V., Wong, S. L., Franklin, G., Li, N., Albala, J. S., Lim, J., Fraughton, C.,
Llamosas, E., Cevik, S., Bex, C., Lamesch, P., Sikorski, R. S., Vandenhaute,
J., Zoghbi, H. Y., Smolyar, A., Bosak, S., Sequerra, R., Doucette-Stamm, L.,
Cusick, M. E., Hill, D. E., Roth, F. P. and Vidal, M. 2005. Towards a proteomescale map of the human protein-protein interaction network. Nature 437:1173–
Saeed, R. and Deane, C. M. 2006. Protein protein interactions, evolutionary rate,
abundance and age. BMC Bioinformatics 7:128.
Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U. and Eisenberg,
D. 2004. The database of interacting proteins: 2004 update. Nucleic Acids Res
Saqi, M. A. and Sternberg, M. J. 2001. A structural census of metabolic networks
for E. coli. J Mol Biol 313:1195–1206.
Schomburg, I., Chang, A. and Schomburg, D. 2002. Brenda, enzyme data and
metabolic information. Nucleic Acids Res 30:47–49.
Schuster, S., Fell, D. A. and Dandekar, T. 2000. A general definition of metabolic
pathways useful for systematic organization and analysis of complex metabolic
networks. Nat Biotechnol 18:326–332.
Snel, B., Bork, P. and Huynen, M. 2000. Genome evolution. gene fusion versus gene
fission. Trends Genet 16:9–11.
Snel, B., Bork, P. and Huynen, M. A. 2002. Genomes in flux: the evolution of
archaeal and proteobacterial gene content. Genome Res 12:17–25.
Sonnhammer, E. L., Eddy, S. R. and Durbin, R. 1997. Pfam: a comprehensive
database of protein families based on seed alignments. Proteins 28:405–420.
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,
Brown, P. O., Botstein, D. and Futcher, B. 1998. Comprehensive identification
of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray
hybridization. Mol Biol Cell 9:3273–3297.
Sprinzak, E., Sattath, S. and Margalit, H. 2003. How reliable are experimental
protein-protein interaction data? J Mol Biol 327:919–923.
Suthram, S., Sittler, T. and Ideker, T. 2005. The plasmodium protein network
diverges from those of other eukaryotes. Nature 438:108–112.
Tatusov, R., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin,
E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N., Rao,
B. S., Smirnov, S., Sverdlov, A. V., Vasudevan, S., Wolf, Y. I., Yin, J. J. and
Natale, D. A. 2003. The cog database: an updated version includes eukaryotes.
BMC Bioinformatics 4:41.
Tatusov, R. L., Galperin, M. Y., Natale, D. A. and Koonin, E. V. 2000. The cog
database: a tool for genome-scale analysis of protein functions and evolution.
Nucleic Acids Res 28:33–36.
Teichmann, S. A., Rison, S. C. G., Thornton, J. M., Riley, M., Gough, J. and
Chothia, C. 2001. The evolution and structural anatomy of the small molecule
metabolic pathways in Escherichia coli. J Mol Biol 311:693–708.
Thornton, J. W. 2001. Evolution of vertebrate steroid receptors from an ancestral
estrogen receptor by ligand exploitation and serial genome expansions. Proc
Natl Acad Sci USA 98:5671–5676.
Travers, K. J., Patil, C. K., Wodicka, L., Lockhart, D. J., Weissman, J. S. and
Walter, P. 2000. Functional and genomic analyses reveal an essential coordination between the unfolded protein response and er-associated degradation. Cell
Tsoka, S. and Ouzounis, C. A. 2001. Functional versatility and molecular diversity
of the metabolic map of Escherichia coli. Genome Res 11:1503–1510.
Uetz, P. and Finley, R. L. Jr. 2005. From protein networks to biological systems.
FEBS Lett 579:1821–1827.
Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R.,
Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li,
Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M.,
Johnston, M., Fields, S. and Rothberg, J. M. 2000. A comprehensive analysis of
protein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–627.
van Noort, V., Snel, B. and Huynen, M. A. 2004. The yeast coexpression network has
a small-world, scale-free architecture and can be explained by a simple model.
EMBO Rep 5:280–284.
Veitia, R. A. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–
Venter, J. C. and et al 2001. The sequence of the human genome. Science 291:1304–
von Mering, C., Zdobnov, E. M., Tsoka, S., Ciccarelli, F. D., Pereira-Leal, J. B.,
Ouzounis, C. A. and Bork, P. 2003. Genome evolution reveals biochemical
networks and functional modules. Proc Natl Acad Sci USA 100:15428–15433.
Wagner, A. 2000. Mutational robustness in genetic networks of yeast. Nature Gen
Wagner, A. 2001. The yeast protein interaction network evolves rapidly and contains
few redundant duplicate genes. Mol Biol Evol 18:1283–1292.
Wagner, A. 2003. How large protein interaction networks evolve. Proc R Soc Lond
B 270:457–466.
Wagner, A. 2004. The connectivity of large genetic networks: design, history, or
mere chemistry? To appear in ’Power laws, scale-free networks and genome
biology.’ by Karev, Wolf and Koonin.
Wagner, A. and Fell, D. A. 2001. The small world inside large metabolic networks.
Proc R Soc Lond B Biol Sci 268:1803–1810.
Warner, G. J., Adeleye, Y. A. and Ideker, T. 2006. Interactome networks: the state
of the science. Genome Biol 7:301.
Wilmanns, M., Hyde, C. C., Davies, D. R., Kirschner, K. and Jansonius, J. N.
1991. Structural conservation in parallel beta/alpha-barrel enzymes that catalyze three sequential reactions in the pathway of tryptophan biosynthesis. Biochemistry 30:9161–9169.
Wuchty, S. 2001. Scale-free behavior in protein domain networks. Mol Biol Evol
Zhang, Z., Luo, Z. W., Kishino, H. and J., Kearsey M. 2005. Divergence pattern of
duplicate genes in protein-protein interactions follows the power law. Mol Biol
Evol 22:501–505.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF