Computational Analyses of Biological Sequences -Applications to Antibody-based

Computational Analyses of Biological Sequences -Applications to Antibody-based
Computational Analyses of Biological
Sequences -Applications to Antibody-based
Proteomics and Gene Family Characterization
Mats Lindskog
Royal Institute of Technology
School of Biotechnology
Stockholm 2005
© Mats Lindskog, 2005
ISBN 91-7178-186-2
Royal Institute of Technology
School of Biotechnology
AlbaNova University Center
SE-106 91 Stockholm
Sweden
Printed at Universitetsservice US-AB
Drottning Kristinas väg 53B
SE-100 44 Stockholm
Sweden
Cover: Currently, the sequences of more than 300 different genomes are available.
Computer-based methods are used for analyses and visualization of these biological sequences.
To my Mother and Father
IV
Abstract
Following the completion of the human genome sequence, post-genomic efforts
have shifted the focus towards the analysis of the encoded proteome. Several
different systematic proteomics approaches have emerged, for instance, antibodybased proteomics initiatives, where antibodies are used to functionally explore
the human proteome. One such effort is HPR (the Swedish Human Proteome
Resource), where affinity-purified polyclonal antibodies are generated and
subsequently used for protein expression and localization studies in normal and
diseased tissues. The antibodies are directed towards protein fragments, PrESTs
(Protein Epitope Signature Tags), which are selected based on criteria favourable
in subsequent laboratory procedures.
This thesis describes the development of novel software (Bishop) to facilitate
the selection of proper protein fragments, as well as ensuring a high-thoughput
processing of selected target proteins. The majority of proteins were successfully
processed by this approach, however, the design strategy resulted in a number of
fall-outs. These proteins comprised alternative splice variants, as well as proteins
exhibiting high sequence similarities to other human proteins. Alternative
strategies were developed for processing of these proteins. The strategy for
handling of alternative splice variants included the development of additional
software and was validated by comparing the immunohistochemical staining
patterns obtained with antibodies generated towards the same target protein.
Processing of high sequence similarity proteins was enabled by assembling human
proteins into clusters according to their pairwise sequence identities. Each cluster
was represented by a single PrEST located in the region of the highest sequence
similarity among all cluster members, thereby representing the entire cluster.
This strategy was validated by identification of all proteins within a cluster using
antibodies directed to such cluster specific PrESTs using Western blot analysis.
In addition, the PrEST design success rates for more than 4,000 genes were
evaluated.
Several genomes other than human have been finished, currently more than 300
genomes are fully sequenced. Following the release of the tree model organism
black cottonwood (Populus trichocarpa), a bioinformatic analysis identified unknown
cellulose synthases (CesAs), and revealed a total of 18 CesA family members. These
genes are thought to have arisen from several rounds of genome duplication.
This number is significantly higher than previous studies performed in other
plant genomes, which comprise only ten CesA family members in those genomes.
Moreover, identification of corresponding orthologous ESTs belonging to the
closely related hybrid aspen (P. tremula x tremuloides) for two pairs of CesAs suggest
that they are actively transcribed. This indicates that a number of paralogs have
preserved their functionalities following extensive genome duplication events in
the tree’s evolutionary history.
Key words: antigen, antibodies, antibody-based proteomics, biocomputing, bioinformatics, Populus, PrEST, protein fragment, software, visualization
Abstract
Abstract
V
VI
List of Publications
This thesis is based on the following publications, which in the text are
referred to by their Roman numerals:
I†
Lindskog, M., Rockberg, J., Uhlén, M., and Sterky, F. (2005)
Selection
of
Protein
Epitopes
for
Antibody
Production.
BioTechniques Vol. 38, No. 5: pp 723-727
II
Berglund, L., Lindskog, M., Persson, A., Sivertsson, Å., Uhlén,
M., and Al-Khalili Szigyarto, C. (2005) Design of Protein Epitope
Signature Tags for Antibody-based Proteomics, Manuscript
III
Lindskog, M., Berglund, L., Hamsten, C., Uhlén, M., and Sterky,
F. (2005) Processing of High Sequence Similarity Proteins in Antibodybased Proteomics, Manuscript
IV†
Djerbi, S., Lindskog, M., Arvestad, L., Sterky, F., and Teeri, T. T.
(2005) The Genome Sequence of Black Cottonwood (Populus
Trichocarpa) Reveals 18 Conserved Cellulose Synthase (CesA) genes.
Planta Vol. 221, Issue 5, pp 739-46
In addition, previously unpublished data is presented.
Related publications:
A
Uhlén, M., Björling, E., Agaton, C., Al-Khalili Szigyarto, C.,
Amini, B., Andersen, E., Andersson, A-C., Asplund, A.,
Angelidou, P., Asplund, C., Berglund, L., Bergström, K., Cerjan,
D., Ekström, M., Elobeid, A., Eriksson, C., Fagerberg, L., Falk,
R., Fall ,J., Forsberg, M., Gumbel, K., Halimi, A., Hallin, I.,
Hamsten, C., Hansson, M., Hedhammar, M., Hercules,
G., Kampf, C., Larsson, K., Lindskog, M., Lund, J., Lundeberg,
J., Magnusson, K., Malm, E., Nilsson, P., Oksvold, P., Olsson,
I., Ottosson, J., Paavilainen, L., Persson, A., Rimini, R.,
Rockberg, J., Runeson, M., Sivertsson, Å., Sköllermo, A., Steen,
J., Stenvall, M., Sterky, F., Strömberg, S., Sundberg, M., Tegel, H.,
Tourle, S., Wahlund, E., Wan, J., Wernérus, H., Westberg, J.,
Wester, K., Wrethagen, U., Xu, L.L., Ödling, J., Öster, E., Hober
S., and Pontén F. (2005) A Human Protein Atlas for Normal and
Cancer Tissues Based on Antibody Proteomics. Molecular and
Cellular Proteomics, 2005 Aug 27, Epub ahead of print
B
Pettersson, E., Lindskog, M., Lundeberg, J. and Ahmadian, A.
Tri-nucleotide Threading for Parallel Amplification of Minute Amounts
of Genomic DNA, Submitted
†Publication I and IV are reproduced with kind permission of Biotechniques and Springer Science
and Business Media.
List of Publications
List of Publications
VII
VIII
Table of Contents
Table of Contents
Abstract
List of Publications
INTRODUCTION
1. The Human Genome
1.1 Towards a Complete Sequence of the Human Genome
1.2 Genetic Evolution
1.2.1 Adaptive Evolution
1.2.2 Duplication Events
1.2.3 Other Evolutionary Factors
1.2.4 Genetic Variation
1.2.5 Defining Homology
1.3 Genome Size, Structure and Content
1.4 Genes
2. The Proteome and Proteins
2.1 From Genome to Proteome
2.2 Proteome Diversity
2.3 Proteins
3. Proteomics from a Bioinformatic Perspective
3.1 Background
3.2 Protein Expression and Purification
3.3 Two Dimensional Gel Electrophoresis
3.4 Mass Spectrometry
3.5 Structural Proteomics
3.6 Antibodies in Proteomics
3.6.1 Antibodies and OtherAffinity Ligands
3.6.2 Antigens and Epitopes
3.6.3 Protein Analysis on Microarrays
3.6.4 Immunohistochemistry
3.6.5 In Vitro Detection
3.7 Prediction of Subcellular Localization
3.8 Post Translational Modifications
3.9 Protein Interactions and Complexes
4. Biological information resources
4.1 Genomic Data
4.2 Proteome and Protein Data
4.2.1 Protein Sequence Data and Protein Information
4.2.2 Protein Structure Data
4.2.3 Protein Domains and Families
4.2.4 Protein Interactions
V
VII
1
3
3
4
4
4
5
6
6
7
8
11
11
12
14
15
15
16
17
17
18
18
18
19
20
21
22
23
23
24
25
26
26
26
27
27
28
5. Data Searches, Sequence Comparisons,
and Biocomputing
29
5.1 Comparative Genomics and Proteomics
5.2 Sequence Alignments
29
29
5.2.1 Pairwise Sequence Alignments
5.2.2 Practical Aspects of Pairwise Alignments
5.2.3 Multiple Sequence Alignments
30
32
33
5.3 Prediction of Signal Sequences and Transmembrane Regions
5.4 Biocomputing – a Practical Approach
33
34
IX
7.1 Background
7.2 Design of Protein Fragments for Recombinant Protein
Production (publication I)
7.3 A Systematic Analysis of Designed PrESTs (publication II)
7.4. Specific PrEST Design Strategies
7.4.1 Antibody-based Proteomics on High Sequence Similarity
Proteins (publication III)
7.4.2 Design of Isoform Specific Protein Fragments (unpublished data)
8. Identification and Characterization of Gene Family
Members in a Fully Sequenced Tree Genome
35
37
39
39
41
45
47
47
48
53
8.1 Populus, a Tree Model System
8.2 Cellulose Synthases
8.3 Identification of Previously Unknown CesA Family
Members (publication IV)
53
53
Concluding Remarks
List of Abbreviations
Acknowledgements
Populärvetenskaplig Sammanfattning (In Swedish)
References
APPENDICES
57
59
61
65
67
81
Publication I: Selection of Protein Epitopes for Antibody
Production
Publication II: Design of Protein Epitope Signature Tags
for Antibody-based Proteomics
Publication III: Processing of High Sequence Similarity Proteins
in Antibody-based Proteomics
Publication IV: The Genome Sequence of Black Cottonwood
(Populus Trichocarpa) Reveals 18 Conserved Cellulose Synthase
(CesA) Genes
54
83
91
113
129
Table of Contents
PRESENT INVESTIGATION
6. Objectives
7. Design of Protein Epitope Signature Tags (PrESTs) for
Antibody-based Proteomics
X
Introduction
1
Introduction
Introduction
2
The Human Genome
3
1.1 Towards a Complete Sequence of the Human Genome
Within each cell of all living organisms on earth lies their genetic blueprint, which
is written in the form of DNA (DeoxyriboNucleic Acid). This blueprint is what
gives each organism its unique characteristic features. Up until the late 1800s,
little was known about the properties and attributes of DNA. Yet, in 1866 Gregor
Mendel set us on the path to unravel its secrets (see figure 1.1). He discovered that
many traits of an organism depend upon two distinct factors, one from the male
and one from the female parent. These factors were later termed genes, carriers
of the genetic information. In 1871 Friedrich Miescher successfully isolated
nuclein (nuclein contains nucleic acids), which he proposed to be the hereditary
material. This idea was refined in 1944 when Oswald Avery described DNA as
the carrier of hereditary information. Nine years later James Watson and Francis
Crick solved the now famous double stranded helical structure of DNA (Watson
and Crick, 1953). Furthermore, Watson and Crick postulated the central dogma
of molecular biology, which describes the process in which DNA is transcribed
to mRNA (messenger RiboNucleic Acid), which is subsequently translated into
protein (see chapter 2.1).
The advent (Maxam and Gilbert, 1977; Sanger et al., 1977), and automation (Smith
et al., 1986) of DNA sequencing provided the foundation for deciphering the
information in whole genomes. The Human Genome Project (HGP) was initiated
in 1990, and its objectives were to sequence the human genome, together with four
model organism genomes, by the year 2005. These species included a bacterium
(Escherichia coli), a yeast (Saccharomyces cerevisiae), a nematode (Caenorhabditis
elegans) and a fruitfly (Drosophila melanogaster). The Institute for Genomic Reseach
(TIGR) finished sequencing of the first bacterial genome (Haemophilus influenzae)
in 1995 (Fleischmann et al., 1995), and in 1998, Celera Genomics introduced
a private genome project initiative. A first glimpse of the human genome was
provided in 1999 with the sequencing of chromosome 22 (Dunham et al., 1999).
Automation of
DNA sequencing
DNA as hereditary
material
Discovery
of genes
Postulation of
the central dogma
First bacterial
genome sequence
Initiation of
the HGP
Structure
of DNA
Discovery of
nucleic acids
Draft of the
human genome
DNA
sequencing
High-quality
human sequence
Human chromosome
22 sequenced
Figure 1.1 Events prior to the completion of the human genome sequence. 1866: Gregor Mendel
discovered hereditary factors later termed genes; 1871: The discovery of the molecule nuclein by Friedrich
Miescher; 1944: DNA was described as the hereditary material by Oswald Avery; 1953: The structure of DNA
was revealed (Watson and Crick, 1953); 1958: Postulation of the central dogma of molecular biology (Crick,
1958); 1977: The introduction of DNA sequencing technology (Maxam and Gilbert, 1977; Sanger et al., 1977);
1986: The automation of DNA sequencing (Smith et al., 1986); 1990: The initiation of the HGP; 1995: First
bacterial genome sequence finished (Fleischmann et al., 1995); 1999: The sequence of human chromosome
22 released (Dunham et al., 1999); 2001: A first draft of human genome sequence was provided (Lander et
al., 2001; Venter et al., 2001); 2003: The release of a high-quality human genome sequence (Stein, 2004).
Towards a Complete Sequence of the Human Genome
1. The Human Genome
4
The Human Genome
A first draft of the human genome sequence was released in 2001, concurrently
by the HGP (Lander et al., 2001) and Celera Genomics (Venter et al., 2001).
However, this primary draft included a large number of sequence gaps. In the
spring of 2003, 50 years after the DNA structure first was solved, a high-quality
sequence map was released, covering 99% of the euchromatic (gene-rich) regions
and with the majority of the previously identified gaps bridged (Stein, 2004).
Finally, the largest internationally coordinated genomic research effort ever had
fulfilled its goals by releasing a publicly available human genome sequence that
will provide data for various avenues of research.
1.2 Genetic Evolution
The recent completions of genome sequences from a large number of different
organisms (see chapter 5.1) have allowed the exploration of the evolution of
genomic entities, such as genes and proteins. Genomes are not static, but dynamic,
complex units that continously change, adapt, and evolve, and are affected by
several evolutionary factors.
1.2.1 Adaptive Evolution
“All living things have much in common, in their chemical composition, their germinal
vesicles, their cellular structure, and their laws of growth reproduction. Therefore I should
infer that probably all the organic beings which have ever lived on earth have descended from
some one primordial form.”
Charles Darwin, The Origin of Species by Means of Natural Selection.
The theory of natural selection stated by Charles Darwin was highly controversial
in 1859. However, evidence supporting natural selection or adaptive evolution, has
accumulated over time, and is now widely accepted by the scientific community.
Adaptive evolution is the process by which genes (or organisms) adapt to their
surrounding environment, resulting in evolutionary changes if the adaptations
are benifical and functional. Genetic regions subjected to adaptive evolution
are most likely functionally relevant. The identification of such regions might
therefore result in predictions concerning their function (Swanson, 2003). For
instance, adaptive evolution has promoted diversity in the antigen recognition
site in the major histability complex locus (Hughes and Nei, 1988).
1.2.2 Duplication Events
Genetic duplications are crucial for organismal evolution, due to the large
increase of genetic material following a duplication event (reviewed by Taylor
and Raes, 2004). Moreover, it has been stated that two consecutive rounds of
complete genome duplication have provided the genetic material necessary for
the evolution of the vertebrate genome (Ohno, 1970; Sidow, 1996).
Duplication events might arise at different levels, duplications of entire genomes
(polyploidization), of chromosomal segments (segmental duplications), of
individual genes, or parts of genes. Nonduplicated genes must maintain their
function and accordingly their evolution is generally constrained by selective
pressure (Kent et al., 2003). In contrast, following a duplication event, one or both
duplicates may experience relaxed functional constraints due to redundancy. This
is in some cases followed by an increased evolutional rate, which can provide novel
gene functions and regulations (Kondrashov et al., 2002). Genes may be altered by
creating new function (neofunctionalization) and/or partitioning of the ancestral
gene function (subfunctionalization) or become silent (nonfunctionalization or
pseudogenization; see figure 1.2; Ohno, 1970; Force et al., 1999). However, the
Rates of duplications varies among species and genes, shorter genes appear to be
subjected to duplication events more frequently than longer genes (Taylor and
Raes, 2004). According to this context, gene duplication is a relatively common
occurrence (0.01 estimated duplications/gene/million years), and the average
half-life of duplicated genes is relatively short (around 4.0 million years; Lynch
and Conery, 2000; Gu et al., 2002; Lynch and Conery, 2003; Taylor and Raes,
2004).
Non-functionalization
Redundancy
A
Neofunctionalization
(coding)
Regulatory subfunction
Gain of new regulatory subfunction
Loss of regulatory subfunction
Coding function
Gain of new coding function
Loss of coding function
Non-functionalization
Neofunctionalization
(regulatory)
Neofunctionalization
(coding)
Small-scale
duplication
B
Segmental
or genomic
duplication
Redundancy
Neofunctionalization
(regulatory)
Neofunctionalization
(regulatory)
and / or
subfunctionalization
(partial)
Subfunctionalization
(partial redundancy)
Neofunctionalization
(regulatory)
and / or
subfunctionalization
(complete)
Subfunctionalization
(complete)
Figure 1.2 A schematic view of the possible fates of duplicated genes. A: The classical model of duplicate
gene fates, which consists of non- and neofunctionalization in coding regions (Ohno, 1970). B: Recent
theories propose a more complex chain of events, which includes subfunctionalization as well as alterations
in regulatory regions. The propabilities of the events are proportional to the thickness of the arrows
(adapted from Moore and Purugganan, 2005).
1.2.3 Other Evolutionary Factors
Genome evolution is influenced by deletions and transpositions of genome
segments or genes. In addition, retrotranspositions are of importance and comprise
about 17% of the human genome (Lander et al., 2001). Retrotransposition is
transposition by means of reverse transcription of RNA and subsequent insertion
into the genome.
Horizontal gene transfer, where genes are transferred between species has also
influenced genome evolution. For instance, about 18% of the genes present in E.
coli appear to arise from horizontal gene transfer after the divergence from the
salmonella lineage one billion years ago (Lawrence and Ochman, 1998).
5
Genetic Evolution
majority of gene duplicates evolve towards silencing rather than preservation
(Lynch and Conery, 2000) and silenced genes are generally termed pseudogenes
(see chapter 1.4).
6
The Human Genome
1.2.4 Genetic Variation
Individual genomes within the same species differ as a result of chromosomal
alterations and point mutations, such as single base pair substitutions, insertions
and deletions. Variations that occur as single nucleotide bases as well as stretches
of longer DNA segments, are invaluable markers in mapping of disease-related
genes, in diagnostics and human population studies, as well as in evolutionary
genetics (Ahmadian and Lundeberg, 2002).
The majority of genetic variations consists of single nucleotide polymorphisms
(SNPs), which account for around 85% of all genomic variations (Little, 1999).
A SNP is a genetic variation between individuals in a given population, which
is limited to a single base pair and occurring at a frequency exceeding one
percent (Ahmadian and Lundeberg, 2002). SNPs are distributed non-randomly
throughout the human genome, occuring in average at every 1,250th base pair
(Lander et al., 2001). Furthermore, SNPs located in the protein coding regions
and regulatory segments of genes may give rise to phenotypic differences by
altering protein structure and/or function, as well as protein expression levels,
which subsequently might lead to disease. The majority of SNPs, however, are
situated in noncoding regions of the human genome and are commonly used as
convenient markers in population genetics and evolutionary studies (Syvanen,
2001).
Identification and amplification of relevant SNPs was facilitated by the introduction
of the polymerase chain reaction (PCR; Mullis and Faloona, 1987; Saiki et al.,
1989). For genome-wide studies, amplifications of SNPs need to be performed
in parallel with large numbers of SNPs (Hardenbol et al., 2003; Fan et al., 2003;
related publication B).
1.2.5 Defining Homology
Evolutionary relationships among genes might be complex and difficult to
define, hence a set of descriptive terms have been coined (reviewed by Fitch,
2000). Homology is defined as the relationship of two characters (e.g. genes)
sharing a common evolutionary origin (Fitch, 1970). Note that homology is not
a measurement of similarity, but a strict statement describing a divergent relation
between sequences.
There are a number of differerent subtypes of homology. Orthology is defined
as the relationship of two homologous genes cognated by speciation (Fitch,
1970). Paralogy is the relationship of two genes arising from a duplication event
(Fitch, 1970). However, paralogy falls short in specifying if the genes described
are members of the same species. For instance, if duplication is subsequently
followed by speciation, two genes belonging to different species can be defined
as paralogs. As a consequence, two subtypes of paralogy have been proposed to
bring clearity to this issue: paralogs within the same species and paralogs residing
in different species; inparalogs and outparalogs, respectively (Sonnhammer
and Koonin, 2002). Xenology, the third subtype of homology, is defined as the
relationship of two homologous genes involving a horizontal (interspecial) gene
transfer (see chapter 1.2.3) for at least one of the genes (see figure 1.3; Gray and
Fitch, 1983). Conversively to homology, analogy is defined as the relationship of
two genes that have descended from unrelated ancestors by convergence (Fitch,
1970). Unfortunately, these definitions of homology have been, and still are,
inconsistently used throughout the research community (Fitch, 2000; Jensen,
2001; Koonin, 2001; Petsko, 2001).
7
2
3
3
4
A1
AB1
B1
B2
C1
C2
C3
Figure 1.3 The three different subtypes of homology: orthology, paralogy, and xenology. A view of a
hypothetical evolution of genes derived from a common ancestor. Speciation events (1, 3) are depicted
by an upside down ‘Y’ and duplication events (2, 4) by a horizontal bar. Orthologs are two genes sharing a
common ancestor by a speciation event (e.g. B1-C1) and xenologs are genes among which at least one of
the genes has undergone a horizontal gene transfer (AB1 is xenologous to all other genes in the figure).
Paralogs are genes whose common ancestor resides at a duplication event (e.g. B1-C2 and C2-C3). C1-C3
are inparalogs, thus belonging to the same species, and B1-C2 are outparalogs, as they reside in different
species. The latter case is due to that a duplication event (2) is followed by a speciation event (3; adapted
from Fitch, 2000).
1.3 Genome Size, Structure, and Content
Following the completion of the human genome (see chapter 1.1), in-depth
sequence analyses have to take into consideration the specific properties of the
genome. Genetic information is coded in DNA, which consists of four different
building blocks, nucleotides, which are designated A (Adenine), C (Cytosine),
G (Guanine) and T (Thymine). DNA predominantly exists in a double stranded
configuration, and has a self-complementary property. This property facilitates the
recovery of the complementary strand, which is essential during cell-division when
a cell needs to replicate its DNA (see figure 2.1). Moreover, DNA is tightly bundled
in the nucleus by several levels of organization. At the highest level, human DNA
is split into units called chromosomes. The haploid human genome consists of
around three billion bases which are distributed among 24 chromosomes (see
table 1.1).
Only about 1.5% of the genetic material constitutes protein coding regions, so
called exons. The remaining part constitutes of about 48% of unique intergenic
DNA, for which no function has yet been identified. The majority of this unique
DNA is probably an evolutionary artifact, lacking present function. Furthermore,
49% of the human genome consists of repetitive elements (REs), and a small part,
about 0.5%, comprise nonfunctional pseudogenes (see chapter 1.4; Makalowski,
2000). There are more than 4.3 million REs in the human genome, and many
Genetic Evolution/Genome Size, Structure, and Content
1
8
The Human Genome
translated REs are found in proteins, thus having importance for protein
function (Li 2001). Although the specific function of these repetitive elements
is still unknown, they do interact with the genome and influence its evolution
(Makalowski 2000).
Chromosome
Size (Mbp)
% of genome
Number
of genes
% genes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
X
Y
245.5
243.0
199.5
191.4
180.9
171.0
158.6
146.3
138.4
135.4
134.5
132.4
114.1
106.4
100.3
88.8
78.8
76.1
63.8
62.4
46.9
49.5
154.8
57.7
7.98
7.90
6.48
6.22
5.88
5.56
5.16
4.75
4.50
4.40
4.37
4.30
3.71
3.46
3.26
2.89
2.56
2.47
2.07
2.03
1.53
1.61
5.03
1.88
2,281
1,482
1,168
866
970
1,152
1,116
794
919
862
1,426
1,104
399
733
766
957
1,257
322
1,468
631
271
552
931
104
10.12
6.58
5.18
3.84
4.31
5.11
4.95
3.52
4.08
3.83
6.33
4.90
1.77
3.25
3.40
4.25
5.58
1.43
6.52
2.80
1.20
2.45
4.13
0.46
Table 1.1 Chromosomal data and gene content of the human genome (based on data from Ensembl v.
33.35d).
1.4 Genes
A gene is a stretch of DNA coding for a protein, and might be located on either
DNA strand. In eukaryotic organisms, genes appear split into separated segments
where regions of protein coding exons are interrupted by stretches of non-coding
intervening sequences (introns). In humans, the exons have an average length
of 150 nucleotides and the substantially longer introns are generally spanning
around 3.5 kilo bases (kb). Yet, intron-lengths as large as 500 kb have been observed
(Rowen et al., 2002). In a process entitled splicing, the introns are removed from
the transcribed RNA and the protein coding elements are assembled into mature
mRNAs which are subsequently translated into proteins. If alternative splicing
occurs, the protein coding exons are united differently. This results in various
protein isoforms with distinct functions, and is an important source of protein
diversity (see chapter 2.2).
Through duplication of DNA segments (see chapter 1.2.2), a single gene can give
rise to multiple copies of duplicated genes, which retain parts of their functionality.
These genes might be referred to as a gene family (see chapter 8.2). The exact
definition of a gene family differs, ranging from at least 35% sequence identity
Genes with close similarity to their respective paralogs, but unable to code for a
functional protein, are termed pseudogenes (reviewed by Mighell et al., 2000).
Pseudogenization is the most common fate for duplicated genes (see chapter
1.2.2). According to the theory of positive selection, one or several of duplicates are
unconstrained by selective pressure, therefore randomly accumulating mutations.
These mutations might subsequently silence gene function via disruptions of
the original reading frame (e.g. frame shifts or introduction of a stop codon).
Pseudogenes exist as two different types, processed and nonprocessed. Processed
pseudogenes are formed through retrotransposition of mature RNAs, and these
genes lack upstream promoters, as a result of random integration into the genome.
Nonprocessed pseudogenes most commonly arise after a duplication event,
followed by nonfunctionalization (see chapter 1.2.2). Pseudogenes are thought
to be responsible for misincorporations in gene collections, and as much as 20%
of predicted genes are thought to consist of nonfunctional pseudogenes (Lander
et al., 2001; Mounsey et al., 2002). However, the total number of pseudogenes and
their genomic locations are in many cases still unknown (Torrents et al., 2003).
9
Genes
between corresponding protein sequences (Orengo et al., 2002), to merely
sharing a common protein fold. However, members in a gene family are clearly
evolutionary related, often having similar or related functions.
10
The Proteome and Proteins
2.1 From Genome to Proteome
The majority of the genes in the human genome code for functional proteins, and
the path from gene to protein includes many turns before the final gene product
is reached. The process where DNA is transcribed into RNA and thereafter
translated into protein, is termed the central dogma of molecular biology (see
figure 2.1). Furthermore, the dogma includes DNA replication, in which DNA is
copied during cell division. Following the transcription of DNA into mRNA, the
mRNA migrates from the nucleus to the cytoplasm and encounters ribosomes.
The ribosomes provide a framework for holding mRNA in position during protein
synthesis, a process known as translation. After translation, proteins might be
subjected to a wide variety of modifications which alter the protein properties by
adding a modifying group to one or several amino acids, or by proteolytic cleavage
(Mann and Jensen, 2003). There are hundreds of different post translational
modifications (PTMs). For instance, cleavage of signal sequences, glycosylations,
and phosphorylations (Garavelli, 2004). The PTMs might determine localization,
activity, interaction and turnover of the proteins (Mann and Jensen, 2003).
Moreover, mRNA might be subjected to pre-translational editing, resulting in
amino acid changes not inferable from the DNA sequence.
Replication
cDNA
DNA
Transcription
Reverse
transcription
RNA editing
mRNA
mRNA (edited)
Translation
Post-translational
modification
Protein
Protein (modified)
Figure 2.1 The central dogma of molecular biology, which comprises transcription, translation, and
replication.The dogma is hereby displayed in combination with the process of reverse transcription as well as
with factors contributing to proteome diversity such as RNA editing and post-translational modifications.
Exons are colored in green and introns are grey, however, introns are usually much longer than exons (see
chapter 1.4). cDNA (complementary DNA) contains only exons and are therefore colored in green. mRNA
might be subjected to pre-translational editing and these editions are here shown in orange. Moreover,
proteins can be modified after the translation (e.g. phosphorylations) and these modifications are shown
in red.
From Genome to Proteome
2. The Proteome and Proteins
11
12
The Proteome and Proteins
In addition, transcription can be reversed in vitro by using the enzyme reverse
transcriptase, in a process where mRNA functions as a template for conversion
into cDNA (complementary DNA). cDNA contains only exons and is often used
in studies when coding gene regions are of interest.
2.2 Proteome Diversity
In 1994, Marc Wilkins first coined the term proteome as the set of proteins
encoded by the genome (Wilkins et al., 1996). The proteome consists of a vast
array of proteins, among which diversity is generated by events such as alternative
splicing, somatic DNA rearrangments, and PTMs. At least one-third of the human
genes are subjected to alternative splicing and there are several different types of
alternative splicing events (see figure 2.2). Different selections of splice sites, such
as 5’ or 3’ splicing, give rise to alternative isoforms. In addition, although rarely
ocurring, entire introns might be retained and transcribed (Garcia-Blanco et al.,
2004). Alternative splicing often results in additional protein domains, rather
than alternating domains (Mironov et al., 1999).
A
B
C
D
E
Figure 2.2 Different types of alternative splicing events. Constitutive exons are shown in light grey color,
alternative exons are dark grey, and the intron is dark grey with stripes. Exons can have multiple 5’- or 3’splice sites (A, B). Fully contained exons that are alternatively used are called cassette exons. Single cassette
exons can reside between two constitutive exons and the alternative exon is either skipped or included
(C). Multiple cassette exons can be located beween two constitutive exons, and the splicing machinery
must choose beween them (D). Finally, a lack of splicing might result in intron retention (E; adapted from
Graveley, 2001).
Protein class
Description
Number of proteins
Nonredundant proteins
Protein isoforms
Combinatorial variants
Protein species
Protein alleles
A protein representative from every gene locus
Variants obtained by alternative splicing
Proteins generated by somatic DNA rearrangements
Proteins that differ as a result of PTMs
Differences in genetic variations resulting
20,000-25,000
50,000-100,000
>10,000,000
>100,000
75,000-150,000
Table 2.1 Estimated number of protein molecules in the human proteome, as a result of different
classifications (adapted from Uhlen and Ponten, 2005).
This extensive diversity of the proteome might explain why species that are
highly similar in their DNA differ substantially at the phenotypic level. Indeed,
recent studies imply that human and chimpanzee genomes differ only by 1.2%
in intergenic regions and 0.6% percent in protein coding regions (Ebersberger
et al., 2002; Sakate et al., 2003). Since the proteome is not fully characterized,
proteome comparisons might only be guessed.
13
Proteome Diversity
Including all protein variants, the human proteome is estimated to contain millions
of different molecular species (O’Donovan et al., 2001; Bauer and Kuster, 2003;
Uhlen and Ponten, 2005). However, the total number is clearly dependant on how
the proteome is defined. The proteome might be divided into five sub-proteomic
classifications, which all contribute to the total number of protein molecules
(see table 2.1; Uhlen and Ponten, 2005). One class consitutes a nonredundant
protein set, in which every gene locus contributes with a single representative
protein. The current human gene count is 22,218 (Ensembl v33.35; Hubbard et
al., 2005), and although this number will change, the final number in this class is
most probably between 20,000 and 25,000. Another class is comprised of a large
number of proteins, which result from alternative splicing events. An estimation
of this category might give between 50,000 and 100,000 protein isoforms. Somatic
rearrangments involved in the immune response give rise to a third class. For
instance, the estimated number of different IgG molecule rearrangements might
be as many as ten million. Many proteins undergo PTMs (see chapter 2.1), and
this category might account for approximatly 100,000 different protein molecules.
However, this is a difficult number to estimate since many modifications are
stepwise processes. Finally, there are is a myriad of protein variants among which
differences are due to coding SNPs (see chapter 1.2.4).
14
Proteomics from a Bioinformatic Perspective
2.3 Proteins
Proteins display a wide range of different functions (e.g. building blocks ,
enzymes, cellular, signal mediators, and receptors). They consist of 20 different
building blocks, called amino acids, which differ in size, shape, hydrophobicity,
and charge. Protein structure is of complex nature and contains four different
levels of structural organization. The order in which the amino acids are linked
together is the primary structure. The secondary structure of proteins refers to
the elementary structural patterns present in most known proteins, where the
two main structures are alpha helices and beta sheets. The tertiary structure is
the complete three-dimensional folded structure of the protein chain (see figure
3.2). The final structural level, the quaternary structure, is only present if the
protein contains more than one chain, and refers to the chain organization and
interconnections. For example, hemoglobin contains of a group of four amino
acid chains.
Figure 2.3 The protein structure of the enzyme cellobiose dehydrogenase, hereby shown on a cellulose
surface. Secondary structural elements are colored; alpha helices are red and beta sheets are blue.
3.1 Background
Proteins are key functional and structural molecules. Therefore, the systematic and
extensive molecular characterization of the proteome, proteomics, is necessary
for understanding biological processes and systems (reviewed by de Hoog and
Mann, 2004). Proteomics is a broad field which encompasses identification,
quantification, and characterization of proteins expressed in various cells
and tissues. Moreover, it includes methods for structural determination and
characterization of PTMs, the use of antibodies as protein capture agents as well
as bioinformatic approaches. A large-scale single initiative similar to the HGP (see
chapter 1.1) is not possible due to the complex and dynamic nature of proteins.
In proteomics, many different approaches in parallel are necessary (Lesley, 2001)
and often used in combination to complement each other (see figure 3.1). As
most drug targets are proteins, it is clear that proteomics will have a great impact
on drug discovery and clinical diagnostics (Tyers and Mann, 2003).
Genomic and proteomic initiatives are producing an avalanche of data. Public
databases (see chapter 4) are continuously expanding and sequence repositories
grow at exponential rates (Benson et al., 2005). As a result of the enormous
amounts of data, computer systems have become essential to researchers.
Bioinformatics is frequently described as the organization and interpretation
of biological data using computational systems and methods (reviewed by Yu et
al., 2004). It is positioned at the interface of biology, computer science, statistics,
and mathematics (Benton, 1996; Luscombe et al., 2001). The bioinformatics
discipline aims at organizing and structuralizing biological research data in
a human-readable format. Moreover, it focuses on development of software,
tools, and resources facilitating data analysis, as well as the usage of the latter
for visualization and interpretation of data in a relevant fashion. Bioinformatic
methods and systems such as sequence analysis and comparative proteomics (see
chapter 5), might provide valuable aid and provide insights to different areas of
research. Proteomics approaches are currently highly dependent on computerassisted systems and bioinformatic methods (see figure 3.1).
Background
3. Proteomics from a Bioinformatic
Perspective
15
16
Proteomics from a Bioinformatic Perspective
Mass
spectrometry
Two-dimensional
gel electrophoresis
Post-translational
modifications
Protein-protein
interactions
In silico predictions
Protein arrays
Structural
proteomics
Antibody-based
proteomics
Localization
Figure 3.1 An overview of proteomic methods and their relations to each other. Moreover, a number of
approaches can be complemented with corresponding in silico prediction methods. ‘Antibody-based
proteomics’ includes methods such as ELISA (Enzyme-Linked ImmunoSorbent Assay), Western blot analysis
(see chapter 3.7.2), and immunoprecipitation, whereas ‘Structural proteomics’ comprises NMR (Nuclear
Magnetic Resonance) and x-ray crystallography. The dashed arrows from ‘Mass spectrometry’ and ‘Twodimensional gel electophoresis’ to ‘In silico predictions’ symbolize database screening against known data
(see chapter 3.3).
3.2 Protein Expression and Purification
A challenge when analyzing proteins from a proteome-wide point of view is the
expression and purification of human proteins in large numbers (Lesley, 2001).
There are both prokaryotic and eukaryotic protein expression hosts, where the
most frequently used bacterial host is E. coli. This bacterium has a rapid growth
and generates high yields as well as having throughly characterized genomics.
Furthermore, E. coli exists as a multitude of different mutant strains and is
compatible with a large number of cloning vectors (Lu et al., 1996; Baneyx,
1999). However, the approach of expressing eukaryotic proteins in prokaryotic
cell systems commonly encounters by problems in proper folding and satisfactory
expression levels (Lesley, 2001). Therefore, some researchers use eukaryotic
expression hosts such as the predominantly used S. cerevisiae, which is available in
various different strains (Gellissen et al., 1992; Cregg et al., 2000).
A common procedure to analyse unknown human proteins begins with the
identification of gene function by querying publicly available datasets (see
chapter 4). This is followed by amplification of the gene from total RNA by reverse
transcriptase polymerase chain reaction (RT-PCR) to generate cDNA (see chapter
2.1). Amplified gene products are subsequently cloned into a suitable vector (Liu,
1999; Hartley et al., 2000; Graslund et al., 2002). The cells expressing the proteins
are subsequently cultivated, the proteins harvested and finally purified.
3.3 Two Dimensional Gel Electrophoresis
The widely used two-dimensional gel electrophoresis (2DE) technology was initially
described 30 years ago (O’Farrell, 1975) and allows for separation of proteins in a
poly-acrylamide gel according to isoelectric point and molecular weight. As many
as 10,000 proteins can simultaneously be analyzed on a single gel (Jungblut et al.,
1996). Following separation, the proteins are visualized by staining with either
Coomassie blue, if the protein is in high abundance, or silver staining, which
is more sensitive (Shevchenko et al., 1996). Separation and detection of PTMs
of proteins is possible in 2D gel electrophoresis, however, 2DE fails to analyse
certain types of proteins. For instance, proteins expressed in low copy number
are not detected by the staining (Santoni et al., 2000). 2DE is frequently used with
mass spectometry (MS) and bioinformatics and includes protein solubilization of
tissue sample, followed by 2DE separation of the proteins, and computer analysis
of protein spot patterns. Subsequently, the isolated protein spots are subjected to
proteolytic digestion and characterization by MS, and the generated information
is screened against databases for identification of the proteins.
3.4 Mass Spectrometry
MS is a method that has been further developed due to the recent availability
of genome sequence databases (see chapter 4) and technical breakthroughs
(Aebersold and Mann, 2003) and has become an important technique for
protein identification. In MS, individual molecules are analyzed by ionization
and measurement of their trajectorial response to electric and/or magnetic
fields. Analyses by MS are capable of revealing two specific protein features, the
peptide-mass fingerprint and product-ion spectras. Peptide-mass fingerprints are
determined by MS and product-ion spectras by tandem MS (MS/MS) or MSn.
The two techniques are commonly used in combination, due to their capabilities
of generating complementary information of the target protein (BeranovaGiorgianni, 2003). MS is used in conjunction with several other applications (see
figure 3.1). For instance, MS and 2DE (see chapter 3.3) in combination with
antibodies (see chapter 3.6.1) recognizing phosphorylated proteins may faciliate
the identification and characterization of PTMs (Mann and Jensen, 2003).
The interpretation of large sets of acquired MS data is a cumbersome task and
underlines the need for MS spectra analysis software. Several different analysis
algorithms have been developed but need further refinements in order to finally
supersede the need of human intervention (Blueggel et al., 2004).
17
Protein Expression and Purification/Two Dimensional Gel Electrophoresis/Mass Spectrometry
Inclusion bodies are frequently formed in E. coli, and these aggregates can often
be avoided by production of recombinant fusion proteins (Lu et al., 1996). Fusion
tags are also used to facilitate purification and handling of recombinant proteins.
For instance, the histidine-tag enables purification on immobilized metal ion
affinity chromatography (IMAC) columns (Lesley, 2001) Moreover, fusion
tags such as thioredoxin (Lu et al., 1996) provide a means to significantly raise
expression levels.
18
Proteomics from a Bioinformatic Perspective
3.5 Structural Proteomics
The functions of proteins are related to their structure and therefore threedimensional (3D) protein structures are of great importance. The field of
structural proteomics (also known as structural genomics) aims at mapping protein
space, and determining 3D structures for all possible protein folds (Vitkup et
al., 2001). Following sequencing of previously uncharacterized genes, 10-40% of
corresponding protein sequences might be deduced a structure by comparisons
to sequences of known structures (Rost et al., 2003). Protein structure is more
evolutionary conserved than amino acid sequence (Blundell and Mizuguchi,
2000), and often provides more functional information than the sequence itself.
This structural conservation is the result of retained protein interaction sites
consisting of residues situated in close proximity in space, but might be distant in
sequence location (Sanchez and Sali, 1998).
Structures for proteins that are easily expressed, crystallized and diffracted are
solved by x-ray crystallographic methods (Terwilliger and Berendzen, 1999).
Alternatively, proteins that fail to crystallize might be structurally determined by
nuclear magnetic resonance (NMR; Montelione and Anderson, 1999).
Various computer-based methods for prediction of protein structure are in use
(reviewed by Rost et al., 2003). Sequence-based large-scale structure predictions
generally consist of sequence similarity searches against sequences with known
3D-structure followed by model building from detected structural templates, and
finally a validation of the resulting models (Skolnick et al., 2000). Searching for
proteins sharing structural similarities, but lacking obvious sequence similarities,
are performed with fold recognition threading methods (Jones, 1999). Threading
by fold recognition compares amino acid sequences to known 3D structures, and
subsequently evaluates the level of fitting to these structures (Rost et al., 1997).
Furthermore, fold recognition allows for identification of analogous (see chapter
1.2.5) pairs of proteins with no significant sequence similarities (Jones, 1999).
3.6 Antibodies in Proteomics
Antibodies have been used in protein studies for decades, owing to their strong
affinity and high specificity in target binding. Currently, antibodies are used in
several different applications such as immunolocalization studies (see chapter
3.6.4; reviewed by Uhlen and Ponten, 2005)), in vitro detection (see chapter 3.6.5)
and analyses on arrays (see chapter 3.6.3). Furthermore, antibodies have been
modified by in vitro selection techniques (see chapter 3.6.1) to enhance their
properties and increase their specificities and affinities.
3.6.1 Antibodies and Other Affinity Ligands
Advances in purification methods and selection techniques (e.g. phage display)
have fueled the emergence of several different kinds of antibodies. As of today,
four major types of antibodies are commonly used: polyclonal antibodies (pAbs),
monoclonal antibodies (mAbs), mono-specific antibodies (msAbs), and other
classes of affinity reagents.
pAbs derived from immunization of suitable animals (e.g. rabbits), were the
primary antibody source until 30 years ago. The introduction of antigens (see
chapter 3.6.2) into a host animal induces an immune response resulting in
antigen-specific antibodies. This pool of antibodies binds to multiple epitope sites
on the antigen. This multi-epitope binding contributes to a tolerance towards
minor structural changes in the antigen, such as partial denaturation or PTMs
3.6.2 Antigens and Epitopes
The generation of antibodies requires selection of target antigens and three
different types of protein antigens are generally produced: full-length proteins,
synthetically produced peptides, and recombinant protein fragments. Full-length
protein antigens are used for correct folding of the native protein, or when
studying PTMs (Miroux and Walker, 1996). The use of a correctly folded fulllength protein might however be questionable in some applications. For instance,
if the protein is immunized into animals, it is most likely slightly denaturated
upon mixing with different adjuvants (Uhlen and Ponten, 2005). Synthetically
generated peptides are the shortest of the antigen-types, usually spanning less
than 40 residues, and therefore the peptides are unlikely to be correctly folded.
However, antibodies towards synthetic peptides have successfully been generated
(Meloen et al., 2003). Recombinant protein fragments have the great advantage
of being easily scaled-up for whole-proteomics studies (Agaton et al., 2003). These
protein fragments are shorter than full-length proteins, thereby facilitating the
cloning by usage of relatively short PCR fragments and the fusion to affinity-tags
(see chapter 3.2) allows for high-throughput purification.
19
Structural Proteomics/Antibodies in Proteomics
(see chapter 2.1). This makes pAbs suitable for multi-platform efforts wherein
proteins often exist in different forms. However, pAbs are not renewable and may
exhibit high cross-reactivities (Nilsson et al., 2005).
The introduction and development of the mAb technology (Kohler and Milstein,
1975), allowed the use of antibodies as reproducible chemical entities. This was
made possible as a result of the development of immortal hybridoma cell lines
continuously producing the selected mAbs (Borrebaeck, 2000). Currently, mAbs
are the key immunoreagents for medical diagnostic applications (Borrebaeck,
2000). However, mAbs have shortcomings in large-scale efforts, as a result of the
need for time-consuming screening (Uhlen and Ponten, 2005). In contrast to
pAbs, mAbs recognize single epitopes and are therefore less versatile for multiplatform functional and analytical applications which often involve proteins in
denaturated forms. For instance, proteins are often exposed to formalin in tissue
fixation for immunolocalization studies (see chapter 3.6.4).
msAbs are enriched from polyclonal antisera by high stringency affinity-purification.
Therefore mAbs have the same advantages of pAbs, but display low cross-reactive
tendencies as a result of the purification (Uhlen and Ponten, 2005). Preliminary
studies also indicate that reimmunizations of the target antigen in combination
with the stringent immunoaffinity purification yield antibodies with very similar
binding characteristics (related publication A). The msAbs might therefore be
considered as a renewable resource, and they have been used in studies involving
extensive tissue profiling (Agaton et al., 2003; Kampf et al., 2004).
More recently, techniques have been developed that allow antibodies and other
affinity reagents to be developed without the involvement of laboratory animals.
These affinity ligands and antibodies are developed with in vitro selection
techniques, such as phage display (Bradbury and Marks, 2004) or ribosomal
display (Lipovsek and Pluckthun, 2004). They include antibody fragments, such as
scFvs (single chain fragment) and Fabs (Better et al., 1988; Liu and Marks, 2000)
and other protein-based reagents developed by recombinant protein techniques
(reviewed by Hey et al., 2005) such as affibodies (Nord et al., 1997).
20
Proteomics from a Bioinformatic Perspective
Antibodies recognize different antigenic determinant, epitopes, which are either
linear or discontinous. Linear epitopes consist of a continous stretch of amino
acids, whereas an epitope with residues in close proximity in space, but distantly
located in sequence, is discountinous (Blythe and Flower, 2005). The majority of
epitopes are discontinous (Barlow et al., 1986) however, most in silico methods
focus on predicting linear epitopes. The majority of algorithms are using amino
acid based scales to predict epitopes, e.g. to find local hydrophilicity points (Hopp
and Woods, 1981). In conclusion, the most recent and comprehensive study
shows no significant correlation between predicted and previously experimentally
mapped epitopes. Furthermore, current epitope predictions perform only slightly
better than by random chance (Blythe and Flower, 2005).
3.6.3 Protein Analysis on Microarrays
Following the footsteps of DNA microarrays (Schena et al., 1995), the development
of protein-based array technologies have increased at a rapid rate. In contrast
to traditional experiments where antibodies are used in very limited numbers,
simultaneous analyses of large numbers of protein samples are now possible by
using antibodies as binding agents (Bradbury et al., 2004; Nilsson et al., 2005).
Protein arrays are highly specific and experiments have successfully identified a
single protein among 10,000 others (MacBeath and Schreiber, 2000). Microarrays
consisting of protein-based capture ligands as probes, or proteins themselves
being fixed to the surface of a silicon chip, are valuable protein analysis tools.
There are three standard types of protein-detecting arrays (see figure 3.2).
Antibody arrays where antibodies are fixed to a solid support and subsequently
incubated with pre-labelled antigens (Haab, 2001; Haab et al., 2001; Schweitzer
and Kingsmore, 2002), prior to the detection of antibody-antigen bindings. On
sandwich arrays antibodies are fixed on the slide and captured protein targets
are subsequently detected using a additional labeled antibody (Schweitzer and
Kingsmore, 2002). Finally, on reverse phase protein arrays, proteins or protein
mixtures are immobilized and detected upon binding to labeled antibodies. These
experiments might be performed with primary or secondary labeled antibodies
(Paweletz et al., 2001). It is important to keep in mind that the labeling of proteins
might result in chemical modifications, including epitope damage and affinity
changes (Barry and Soloviev, 2004). A major hurdle is difficulties in detecting all
PTMs and isoforms of a protein (Lopez and Pluskal, 2003).
21
Sandwich array
Reverse phase
protein array
Figure 3.2 Different detection methods on protein/ antibody arrays: direct antibody arrays, sandwich arrays,
and reverse phase protein arrays. In addition, reversed phase arrays might be performed with a secondary
labeled antibody, as well as with complex protein samples hybridized to the chip surface.
3.6.4 Immunohistochemistry
Immunohistochemistry (IHC) is the process of using antibodies for detection
of antigens in tissues, and might provide information of protein expression in
tissues at the subcellular level (reviewed by Warford et al., 2004). Determination
of the subcellular localization of a protein is usually performed by fusion of the
corresponding gene to a reporter or by epitope-tagging (Kumar et al., 2002),
where small tags are fused to the target protein.
Prior to analyses on tissue samples, the tissues are fixed, and thereafter embedded
in paraffin or alternatively flash frozen. The availability of frozen tissues is limited
and therefore not as widely used as paraffin-embedded tissues. Tissue-fixation
is most commonly performed directly following collection by submersion in
formalin and thereafter insertion into paraffin blocks. Alternatively, the fixation
of tissues is performed in ethanol (Ahram et al., 2003). A number of methods for
simultaneous analysis of multiple tissue samples have been implemented. The
primary step towards high-throughput histology methods was the tissue sausage
method (Battifora, 1986), which allowed for the analysis of over 100 different
tumour tissues. However, the tissues were not oriented to allow linkage to their
clinical origin and the tissue sausage failed to represent the entire number of
tissues in a section (Warford et al., 2004). The tissue microarray (TMA; see figure
3.3) technology surmounted these limitations and dramatically increased the
number of simultanously analysed tissue samples (Kononen et al., 1998). Following
incubation with antibodies and staining, the capture of TMAs is performed by
conventional flatbed scanners. Subsequent storage of high-resolution images
requires large amount of storage space.
Protein Analysis on Microarrays/Immunohistochemistry
Direct antibody
array
22
Proteomics from a Bioinformatic Perspective
When using antibodies in TMA experiments, there is a risk for cross-reactivites.
Antibody recognition of related or unrelated epitopes in tissues, where the target
protein is nonexisting, may result in false positives (Warford et al., 2004). On
the contrary, false negatives might arise as a result of difficulties of the antibody
to identify the target, destruction of the epitope as a result of tissue-fixation, or
because the epitope might be unreachable due to cross-linking or protein-protein
interactions. Morover, the detection system might fail to identify the target epitope
due to expression levels below detection levels (Warford et al., 2004). Also, special
care should be taken to select tissue cores being representative of the pathology
or histology of standard sections (Simon et al., 2003).
Figure 3.3 A view of a TMA (Tissue MicroArray) containing immunohistochemically stained sections of
normal human tissues. Each spot has a diameter of 1 mm and consists of a different tissue sample. In this
example, the protein expression pattern is displayed in a spot representing part of a normal appendix.
The higher magnification view shows that glandular cells in the mucosa are positive. Surface epithelial
cells show only weak immunoreactivity, whereas the glandular cells lining the colonic crypts show strong
cytoplasmic positivity. The inflammatory cells surrounding the crypts are essentially negative. Brown color
represents positive staining. The blue blue color (hematoxylin) is used as non-specific counter-staining to
expose tissue structure and cell morphology.
3.6.5 In Vitro Detection
Tissue localization is also possible to perform using Western blot analysis, where
tissue lysates are separated by gel electrophoresis. Western blot is a commonly
employed technique for detection of protein antigens in complex mixtures (Dunn,
1999). Following electrophoresis, the proteins (or other biological samples) are
transferred to a membrane, usually composed of polyvinylidene difluoride (PVDF)
or nitrocellulose, for subsequent detection. Prevention of non-specific binding
of antibodies to the surface is performed by incubation of the membrane in
appropriate blocking solution. The detection in Western blot analysis is commonly
performed by with direct or indirect methods. In the direct method, the primary
antibody, which binds the antigen, is labeled by an enzyme or fluorescent dye.
The indirect method involves a secondary labeled antibody, which binds to the
primary antibody bound to the antigen. Labeling of the secondary antibody can
be done by the use of various compounds, including probes, biotin and enzyme
conjugates. Western blots also give valuable information about protein size and
antibody specificity.
The majority of eukaryotic genes are transcribed in the nucleus, and subsequent
translation occurs in the cytosol. Many proteins undergo further sorting into
various subcellular components by signals within the amino acid sequence (Schatz
and Dobberstein, 1996; Mattaj and Englmeier, 1998). Prediction of these signal
sequences is commonly achieved by different methods. By establishing homology
to a protein with known localization, the destiny of unknown proteins might be
solved. Proteins may also be directed to different cellular compartments by the aid
of carrier proteins. These proteins identify specific sequence motifs, which target
the protein to a certain part of the cell. In silico based methods can successfully
predict signal peptide cleavage sites with greater than 80% accuracy (see chapter
5.3; Emanuelsson et al., 2001). However, methods predicting N-terminal signals
are biased as a result of errors in predicting start codons in genome projects
(Reinhardt and Hubbard, 1998). Ab initio methods predict the subcellular
localization from the amino acid composition alone, but these methods are less
successive compared to the sequence motif methods (Rost et al., 2003). Finally,
localization can be determined by identifying interactions with other proteins
that have known subcellular localization.
3.8 Post Translational Modifications
Analysis of individual PTMs (see chapter 2.1) is most commonly performed by
comparing experimentally generated data to previously identified amino acid
sequences. The primary step is to identify the target protein, which can be done
by MS techniques or antibody recognition using Western blot analysis (Mann and
Jensen, 2003).
Modifications of many proteins are essentially determined by four different
approaches, 2DE, affinity-based enrichment of modified proteins, PTM
identification in complex mixtures, and affinity-based methods and derivatization.
In 2DE (see chapter 3.3), the modified proteins can be visualized on gels, by
for instance, anti-phosphoamino acid antibodies, and subsequently analyzed in
MS (Soskic et al., 1999). Alternatively, proteins with PTMs can be run twice on
2D gels with an enzymatic removal of the modifying group between runs. Spots
only present in the first run indicate modifications of the proteins (Yamagata et
al., 2002). When the mass of the modified protein is not sufficient to establish
the modification, MS/MS (see chapter 3.4) fragmentation is used to identify and
localize the PTMs (Mann and Jensen, 2003).
Bioinformatic methods to predict PTMs are promising and might be closely
integrated with experimental approaches. As proteomics approaches identify
larger numbers of modification-sites in vivo, prediction algorithms will benefit
and their performance will increase (Mann and Jensen, 2003). In silico prediction
of PTMs comprises the identification of local consensus sequence motifs of signal
sequence cleavage sites (Nielsen et al., 1997) and more complicated amino acid
correlation patterns such as secondary structure and surface accessability (Hansen
et al., 1998; Blom et al., 1999).
23
In Vitro Detection/Prediction of Subcellular Localization/Post Translational Modifications
3.7 Prediction of Subcellular Localization
24
Biological Information Resources
3.9 Protein Interactions and Complexes
Proteins can interact with a myriad of other molecules to regulate and support
them in large interaction networks (Chen and Xu, 2003; Bader and Hogue, 2000).
Deciphering protein interactions might shed light on molecular mechanisms
and biological processes (Ito et al., 2000) and provide clues to the function of
uncharacterized proteins. Multi-protein complexes are regulating various cellular
processes, of which some are associated with disease. Screening of these protein
interaction partners might lead to the identification of possible drug targets.
Since the development of the two-hybrid assay 15 years ago (Fields and Song,
1989), this powerful screening method has been extensively used in interaction
studies. In this assay, the ‘bait’ and ‘prey’ proteins are fused to DNA binding- and
transcriptional factor activation domains, respectively. Both fused proteins are
subsequently co-expressed in yeast, and if proper interaction occurs, the two parts
are brought in close proximity resulting in the expression of a reporter molecule
used for detection (Uetz et al., 2000).
In silico predicted protein-protein interactions are not relevant in the absence of
experimental validation. However, the two different approaches might be used in
combination, complementing and validating each other (Chen and Xu, 2003).
There are several interaction prediction methods, such as genetic neighborhood
conservation, gene fusion and phylogenetic profiling.
Prokaryotic gene clusters and operons (contigous sets of genes that are coregulated and co-transcribed) provide the basis for the genetic neighborhood
conservation method. Prokaryotic gene clusters typically consist of genes related
by function (Overbeek et al., 1999). Valuable information on which proteins form
complexes can be gained by identifying and comparing operons and gene clusters
from different organisms. The major drawback is that this method only applies to
bacteria.
A method called the gene fusion method predicts protein-protein interactions
on the basis of homologous fused proteins. For example, protein A and B are
separated proteins in organism Y, and expressed as a fused protein in organism
Z. The fusion (and therefore probable interaction in organism Z) implies that
protein A and B also interact in organism Y (Marcotte et al., 1999). One limitation
of this method is that many interactions have evolved through other mechanisms,
and therefore might be missed.
The phylogenetic profiling method is based on the assumption that interacting
proteins functioning together also have a similar evolution. By constructing a
phylogenetic profile for each protein that constitutes the proteins evolutionary
history, linkage between proteins with similar profiles might be established. The
method thereby allows for prediction of functions of unknown proteins (Pellegrini
et al., 1999).
Databases of biological information are an important resource behind
biotechnology research (see table 4.1). Proteomic approaches depend on
complete, reliable and recently updated databases (Apweiler et al., 2004a). Some
researchers have shifted their focus from small-scale studies towards analyses
involving large data sets. This includes comparative studies between sets of
sequences or between organisms. The enormous quantities of gene and protein
information need to be stored in a relevant way, allowing for integration and
cross-references between different data sources. Besides functioning as storage
facilities, the data sources have to provide access, interpretation and visualization
of the data. The number of databases is steadily increasing and currently there
are 719 publicly available molecular biology databases; an increase with 31% from
previous year (Galperin, 2005).
Database Name
URL
Database Type
Ensembl
http://www.ensembl.org
The Vertebrate genome
annotation (VEGA)
Swissprot
http://vega.sanger.ac.uk/
http://www.ebi.ac.uk/swissprot
TrEMBL
Uniprot
http://www.ebi.ac.uk/trembl
http://www.expasy.uniprot.org/
Protein data bank
Pfam
Interpro
http://www.rcsb.org/pdb/
http://www.sanger.ac.uk/Software/Pfam/
http://www.ebi.ac.uk/interpro/
Database of interacting
proteins (DIP)
Biomolecular interaction
network database (BIND)
The Predictome database
http://dip.doe-mbi.ucla.edu/
Information around large automatically
annotated genomes
Manually annotated vertabrate genome
sequences
Extensively annotated nonredundant
protein sequences
Computer-annotated protein sequences
Centralized repository for protein
information
Three dimensional protein structures
Protein families and domains
Unites domains, protein families, and
from protein signature databases
exrimentally and literature search
generated protein-protein interaction data
Protein-protein, protein-DNA, protein-RNA,
and protein-ligand interaction data
In silico predicted protein interactions
http://binddb.org
http://predictome.bu.edu/
Table 4.1 A summary of a number of databases for biological information.
The need of integrating different biological information resources, in
combination with the severe problem of defining protein function, emphasizes
the importance of a standardized vocabulary. The first attempt was an enzymatic
activity classification using four digits; EC numbers (Nomenclature committee,
1992). Every EC number is associated with a name for the respective enzyme,
and the numbers represent a progressively finer classification of the enzyme, in a
hierarchic fashion.
Currently, the gene ontology (GO) terms are the major set of terms used for
describing protein function used. The GO consortium has defined a set of
structured classifications in three levels: i) molecular function (e.g. enzyme), ii)
biological process (e.g. cell growth and maintenance), and iii) cellular component
(e.g. nuclear membrane; G.O. Consortium, 2001). The GO terminology is,
however, not entirely complete, altough the most comprehensive set of definitions
currently available.
Biological Information Resources
4. Biological Information
Resources
25
26
Biological Information Resources
4.1 Genomic Data
There are three main systems that provide organized information around large
automatically annotated genomes: Ensembl (Birney et al., 2004), the NCBI
(National Center for Biotechnology Information) genome resource (Wheeler
et al., 2004) and the USCS (University of California Santa Cruz) genome
brower (Karolchik et al., 2004). All three resources are interlinked and updated
regulary.
Ensembl provides several tools for users who need to manipulate parts of the
genomic data. For instance, the EnsMart system (Kasprzyk et al., 2004), which is a
web-based data mining and retrieval system, allows for extraction of relevant data
in several output formats. Moreover, a series of downloadable standard data sets
are available (i.e. a protein FASTA file of all proteins). Over the last year, a total
of seven new genomes have been added to Ensembl, which is presently comprised
of a total of 16 different genomes such as human, mouse, rat, dog, and chicken
(see table 5.1).
The vertebrate genome annotation (VEGA) database is a resource for browsing
manually annotated finished vertebrate genome sequences. It currently includes
human, mouse, dog, and zebrafish. The entire set of annotated genes are supported
by transcriptional evidence from protein sequences, cDNA, or expressed sequence
tags (ESTs; Ashurst et al., 2005). At present, 14 of the human chromosomes,
covering almost half of the genome, are annotated. Manual annotation is more
precise in recognizing splice variants and pseudogenes as compared to automated
annotation, as in Ensembl (Ashurst et al., 2005).
4.2 Proteome and Protein Data
4.2.1 Protein Sequence Data and Protein Information
SWISS-PROT (Bairoch and Apweiler, 2000) is maintained by the Swiss institute
of bioinformatics (SIB) and the European bioinformatics institute (EBI). It
provides annotated data with minimum redundancy, and is integrated with other
databases. The entries in SWISS-PROT are extensively annotated and analysed by
biologists to ensure high quality. There are two classes of data in SWISS-PROT, the
core data and the annotation. The mandatory core information associated with
each SWISS-PROT entry consists of the protein name, the amino acid sequence,
citation information, and taxonomic data. The annotation includes description
regarding protein function, PTMs, secondary structure elements, similaritites to
other proteins, domains and sites, splice isoforms, etc. Redundancies are kept at a
minimum by merging together separate entries belonging to different literature
reports. Furthermore, SWISS-PROT provides cross-references to many exernal
databases (O’Donovan et al., 2002).
TrEMBL is comprised of computer-annotated sequences derived from translation
of all coding sequences in GenBank (Benson et al., 2005), the DNA databank of
Japan (DDBJ; Gasteiger et al., 2001), and European molecular biology laboratory
(EMBL; Kanz et al., 2005) nucleotide sequence databases. The data include large
amounts of information which is not yet included in the SWISS-PROT dataset.
The TrEMBL data generation is performed in three steps: i) translation of the
coding sequences previously mentioned, ii) removal of redundancy by merging
of multiple entries (O’Donovan et al., 1999), and iii) automated information
enhancement through annotation transfer from well-annotated proteins in
SWISS-PROT to uncharacterized proteins in TrEMBL, which belong to welldefined groups, by using InterPro (Mulder et al., 2005).
4.2.2 Protein Structure Data
PDB (the Protein Data Bank; Bernstein et al., 1977) was initiated as early as
1971 and is the single repository for three dimensional protein and nucleic acid
structural data. PDB archives, annotates, and distributes sets of atomic coordinates.
In addition to the structural data itself, PDB also stores the measurements from
which the data is derived (i.e. x-ray structure determinations). Protein structure
(see chapter 3.5) is associated with the functions of proteins and therefore of
importance.
4.2.3 Protein Domains and Families
Pfam (Protein families database) is a large repository of protein families and
domains, including 7,973 families (version 18). Pfam families are built around
multiple sequence alignments (see chapter 5.2.3) and are divided into two main
classes, Pfam-A and -B. Pfam-A families are based on manually curated multiple
alignments, whereas Pfam-B are derived from an automatic clustering of SWISSPROT and TrEMBL and are therefore less reliable. Pfam families cover around
75% of all proteins present in SWISS-PROT and TrEMBL (Bateman et al., 2004),
and have been widely used when annotating the human genome sequence
(Lander et al., 2001).
Another resource is InterPro, which unites domains, protein families, and
functional site records from protein signature databases. Each InterPro entry
includes annotation, functional annotation in GO terms and literature references.
At present, eight major databases are included, e.g. Pfam (Bateman et al., 2004),
SMART (Simple Modular Architecture Research Tool; Letunic et al., 2004), and
PRINTS (Protein fingerprint database; Attwood et al., 2003). InterPro’s family,
domain, and functional site definitions will primarily be used in functional
classification and annotation of unknown sequences (Apweiler et al., 2001).
27
Genomic Data/Proteome and Protein Data
The Universal protein resource (UniProt; Apweiler et al., 2004b) has been formed
by uniting SWISS-PROT, TrEMBL, and protein information resource (PIR). It is
a centralized repository for protein information and UniProt consists of three
database components. The UniProt archive (UniParc), the most comprehensive
publicly available protein sequence collection provides nonredundant protein
sequences from all publicly accessable protein data bases. UNIPROT is the UniProt
knowledgebase, which contains a section of fully manually annotated sequences
and a section awaiting annotation. Finally, the UniProt NREF (UniRef) database
provides nonredundant sequence collections, grouped by sequence identity.
Three different clustering limits (NREF100, NREF90, NREF50) provide a full
sequence coverage, while showing only sequences corresponding to the specific
limitations (Bairoch et al., 2005).
28
Data Searches, Sequence Comparisons, and Biocomputing
4.2.4 Protein Interactions
The Database of interacting proteins (DIP) contains experimentally generated
data and results from literature searches, regarding protein-protein interactions
(Marcotte et al., 2001). It mostly includes data from human, yeast, and Helicobacter
pylori, and allows for comprehensive visualization and extraction of interaction
and network data (Xenarios et al., 2000).
BIND (the Biomolecular Interaction Network Database) stores protein-protein,
protein-DNA, protein-RNA, and protein-ligand interaction data (Bader and
Hogue, 2000). In addition, it includes information on molecular complexes and
pathways. BIND describes various conditions and data regarding the interaction
of interest such as experimental conditions, cellular localization, sequence data,
kinetics, thermodynamics, etc.
In silico predicted interactions are stored at the Predictome database. The putative
interactions are based on predictions using phylogenetic profiling, domain
fusion, and chromosomal proximity (see chapter 3.9), involving proteins from 44
different species. This database also provides comparison of different prediction
methods and their correlation with known pathway data (Mellor et al., 2002).
5.1 Comparative Genomics and Proteomics
Comparative genomics and proteomics are the analyses and comparisons of
genomes and proteomes from different organisms in order to investigate the
evolution of species and functions of genes and proteins. Bioinformatic methods
may be applied to all organisms, as a result of the uniform genetic code. Due
to the availability of completely sequenced genomes from various organisms
(see table 5.1), comparative genomics and proteomics have rapidly emerged.
Currently, a total of 303 (www.genomesonline.org) different genomes are
finished. By examining features, such as genetic locations and conserved regions,
correlations between genes can be established, providing valuable information.
In closely related species (e.g. human and chimpanzee) differences might result
in candidate genes for adaptive evolution (see chapter 1.2.1). Conversively, in
distantly related species (e.g. human and mouse) the basic procedure involves
screening for functional conservation (Ruvolo, 2004). Comparative genomics
often involve the use of pairwise or multiple sequence alignments (see chapter
5.2.1, 5.2.3).
Species
Number
of genes
Genome
size (Mbp)
Original reference
Human (Homo sapiens)
Chimpanzee (Pan troglodytes)
Dog (Canis familiaris)
Chicken (Gallus gallus)
Mouse (Mus musculus)
Rat (Rattus norvegicus)
Fish (Danio rerio)
Fly (Drosophila melanogaster)
Worm (Caenorhabditis elegans)
Protozoan (Trypanosoma cruzi)
Yeast (Saccharomyces cerevisiae)
Tree (Populus trichocarpa)
Plant (Arabdidopsis thaliana)
22,218†
22,475†
18,201†
17,709†
25,613†
21,952†
22,877†
14,359†
19,745†
22,570
6,680
58,036
31,270
3,272
2,733
2,385
1,054
2,267
2,718
1,688
132
100
60
12
473
115
Lander et al., 2001, Venter et al., 2001
The Chimpanzee sequencing consortium, 2005
Kirkness et al. , 2003
Wallsi et al. , 2004
Waterston et al., 2002
Gibbs et al., 2002
http://www.sanger.ac.uk/projects/d_rerio/ ¶
Adams et al., 2000
The C. elegans sequencing consortium, 1998
El-sayed et al., 2005
Goffeau et al., 1996
http://genome.jgi-psf.org/poptr1 ¶
The Aradidopsis genome initiative, 2000
Table 5.1 The complete sequenced genomes of 12 different organisms. †Number of genes according to
Ensembl, October 2005. ¶ No reference yet available.
5.2 Sequence Alignments
Initial clues to function and/or structure of newly sequenced proteins are
commonly derived from amino acid sequence similarities to known proteins.
As a consequence, the focus of research might be directed towards unexpected
relations and correlations (Lipman and Pearson, 1985; Altschul and Lipman,
1990). Furthermore, discoveries of new protein families have been made by
using sequence searches alone (Pearson, 1990). Homologous (see chapter 1.2.5)
proteins have their fold and/or active site of binding domains in common, as well
as sharing significant sequence similarities.
Alignments, unambiguous mapping of residues between sequences, are mainly
performed to establish inference of homology between two (see chapter 5.2.1) or
multiple numbers (see chapter 5.2.3) of sequences. When performing sequence
alignments, there is a trade-off between sensitivity (the ability of identifying
sequences with distant evolutionary relationships) and selectivity (the avoidance
Comparative Genomics and Proteomics/Sequence Alignments
5. Data Searches, Sequence
Comparisons, and Biocomputing
29
30
Data Searches, Sequence Comparisons, and Biocomputing
of non-related matches with spurious high similarity scores). The latter might
appear as a result of biased amino acid composition. Various sequence alignment
algorithms and scoring parameters are in use and the choice of algorithm depends
upon the type of comparison problem needed to be solved.
5.2.1 Pairwise Sequence Alignments
Pairwise alignments are performed either as local or global alignments. Local
algorithms seek subsequence alignments to find the strongest region of similarity
between two sequences (thereby disregarding dissimilarities outside that region).
Local alignments are preferred when identifying shared conserved regions or
domains, as well as shuffled domains, as a result of complex evolutionary pedigree
(e.g. internal gene duplications). Such alignments are often used when screening
DNA or protein databases as well as in exon-identification, where mRNA is
compared to genomic DNA.
Global alignment algorithms compare the two aligned sequences from their
respective beginnings to ends. These algorithms are the most suitable choice
when the similarities span the full length of the sequences, or when homology
has already been established. Scores generated by global alignment are calculated
with or without gap-penalties at the ends of the sequences.
Alignment algorithms can be divided into two general classes. Rigorous comparison
algorithms calculate the optimal similarity score, and heuristic algorithms are less
computationally intense and are therefore faster. Several programs exist that are
based on optimal algorithms. The first pair-wise sequence similarity algorithm was
the optimal and global Needleman-Wunch algorithm (Needleman and Wunsch,
1970), which is available as the implementation Needle (www.ebi.ac.uk/emboss).
In contrast, The application Water performs performs local, optimal alignments
using the Smith-Waterman algorithm (Smith and Waterman, 1981; www.ebi.
ac.uk/emboss).
The heuristic algorithms do not necessarily find the optimal score for every single
sequence included in a search database. The two most commonly used heuristic
comparison algoritms are BLAST (Basic Local Alignment Search Tool; Altschul
et al., 1990) and FASTA (Pearson, 1990), which stands for Fast All (fast searches
for ‘all’, meaning both protein and nucleotide comparisons). They are typically
5-50 times faster compared to the Smith-Waterman algorithm, since they examine
only a portion of the potential alignments between sequences. However, in many
cases FASTA and BLAST generate similar results compared to the more rigorous
algorithms.
BLASTP (P for protein) is the most widely used rapid protein sequence comparison
algorithm. It combines high sensitivity and selectivity, and seldom generates high
spurious alignment scores for unrelated sequences. However, as many as 25% of all
amino acids in protein sequences are situated in regions with severe biased amino
acid composition (e.g. low complexity regions). Therefore, BLAST programs filter
low complexity regions from proteins by utilizing the SEG algorithm (Wootton,
1994). The algorithms assess scores to the generated alignments which reflect a
minimum of insertions, deletions and substitutions. Accordingly, highly similar
sequences generate high scoring and distant sequences low scores.
Protein sequence alignment methods determine similarity by using substitution
matrices, which assess scores for every possible mutation of one amino acid into
another. Matrices for generating similarity scores have three main properties and
can therefore differ in the following aspects: i) method used to construct the
There are two types of commonly used matrices, PAM (Point Accepted Mutation)
and BLOSUM (BLocks Of SUbstition Matrices). PAM is a molecular evolutionary
model, which is constructed from observed residue replacements in closely related
proteins (Dayhoff et al., 1978). An average change in 1% of all residue position
corresponds to one PAM. Therefore homologous proteins might be recognized
even though they are separated with more than 100 PAMs. This is due to the
possibility of multiple mutations on some residues and absence of mutations on
others. As evolutionary rates differ among proteins, no clear correlation between
evolutionary time and PAM distance can be established.
The BLOSUMs are directly constructed from multiple sequence alignments (see
chapter 5.2.3) of distantly related proteins, and not derived from closely related
proteins (Henikoff and Henikoff, 1992). The PAM and BLOSUM matrices differ
not only in the way they are constructed, but also in how they are used. PAMs
are numbered based on the number of point accepted mutations (e.g. PAM120),
whereas BLOSUMs are numbered according to the shared similarities among the
multiple alignments used to construct them (e.g. BLOSUM80). PAM matrices
with high numbers and BLOSUM matrices with low numbers are constructed
for comparison of distantly related protein sequences. In conclusion, the choice
of matrix is dependant on the diversity of the relation needed to establish. In
general, BLOSUM62 (see figure 5.1) performs better than PAM matrices with
BLASTP and FASTA (Henikoff and Henikoff, 1993).
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
-1
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
-2
0
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
-2
-2
1
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
0
-3
-3
-3
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
-1
1
0
0
-3
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
-1
0
0
2
-4
2
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
0
-2
0
-1
-3
-2
-2
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
-2
0
1
-1
-3
0
0
-2
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
-1
-3
-3
-3
-1
-3
-3
-4
-3
4
2
-3
1
0
-3
-2
-1
-3
-1
3
-1
-2
-3
-4
-1
-2
-3
-4
-3
2
4
-2
2
0
-3
-2
-1
-2
-1
1
-1
2
0
-1
-3
1
1
-2
-1
-3
-2
5
-1
-3
-1
0
-1
-3
-2
-2
-1
-1
-2
-3
-1
0
-2
-3
-2
1
2
-1
5
0
-2
-1
-1
-1
-1
1
-2
-3
-3
-3
-2
-3
-3
-3
-1
0
0
-3
0
6
-4
-2
-2
1
3
-1
-1
-2
-2
-1
-3
-1
-1
-2
-2
-3
-3
-1
-2
-4
7
-1
-1
-4
-3
-2
1
-1
1
0
-1
0
0
0
-1
-2
-2
0
-1
-2
-1
4
1
-3
-2
-2
0
-1
0
-1
-1
-1
-1
-2
-2
-1
-1
-1
-1
-2
-1
1
5
-2
-2
0
-3
-3
-4
-4
-2
-2
-3
-2
-2
-3
-2
-3
-1
1
-4
-3
-2
11
2
-3
-2
-2
-2
-3
-2
-1
-2
-3
2
-1
-1
-2
-1
3
-3
-2
-2
2
7
-1
0
-3
-3
-3
-1
-2
-2
-3
-3
3
1
-2
1
-1
-2
-2
0
-3
-1
4
Figure 5.1 A substitutional matrix, BLOSUM62, which is used in sequence alignments. The matrix is
derived from multiple sequences alignments of sequences with 62% sequence identity. Mutations, which
are evolutionary unlikely to occur are assessed negative scores, whereas positive scores are assigned to
conservative changes.
31
Pairwise Sequence Alignments
matrix; ii) the scoring scale, and iii) the information content. Thus, mutations
that are evolutionary unlikely to occur are assigned large negative scores, whereas
positive scores are assessed to conservative changes (see figure 5.1; Pearson,
2000).
Data Searches, Sequence Comparisons, and Biocomputing
5.2.2 Practical Aspects of Pairwise Alignments
When compared to protein sequence comparisons, DNA sequence comparisons
are significantly less informative due to several reasons. DNA sequences, which
do not encode structural RNAs or proteins, have high divergence rates. As a
consequence, there are difficulties in establishing reliable sequence homologies
between sequences with a common ancestor older than 200 million years. In
contrast, homology detection among protein sequences that diverged one billion
years ago is possible (Pearson, 2000). Protein sequences harbor biochemical
information within the amino acids themselves, which DNA sequences do not. For
instance, there are large differences in chemical properties between isoleucine and
glycine, and high similarities between lysine and arginine (i.e. size, hydrophobicity
and charge; see chapter 2.3). Finally, the degeneracy of the genetic code results in
many silent third-base codon changes, which do not change the encoded protein
(Pearson, 1990).
For local alignment searches, the correlation between sequence similarity score,
alignment length, and statistical significance, is complex. For instance, a database
screening might generate two distinct hits, one 15-25 amino acid short domain with
85% sequence identity to the query sequence and one longer domain consisting
of 50-80 residues sharing 30% identity with the query. Both hits generate identical
similarity scores. Yet, the latter is more prone to be of biological relevance (Pearson
and Miller, 1992).
Sequence-based searches are not suitable for identifying distant evolutionary
relationships. Difficulties arise as the sequence similarities dimnish, and around
20-30%, referred to as the twilight zone, relations are hard to establish (see
figure 5.2; Doolittle et al., 1986). However, remote relations can be detected
using methods such as fold recognition (see chapter 3.5; Jones, 1997), or iterated
BLAST searches (see chapter 5.2.3; Altschul et al., 1997).
100
Observed Percent Identity
32
80
60
40
20
Twilight Zone
0
0
50
100
150
200
250
300
350
Evolutionary Distance (PAMs)
Figure 5.2 Limits of sequence similarity searches. As the evolutionary distance between sequences
increases, the possibility of significantly establishing a relation descreases. In the so-called twilight zone
(displayed as a dashed rectangle), relations are hard to establish with sequence-based searches (adapted
from Pearson, 2000).
The majority of multiple alignment algorithms are based on a progressive approach,
which refers to the intial grouping of more similar sequences followed by the latter
incorporation of more distantly related sequences (Boeckmann et al., 2003). The
most widely used multiple sequence alignment method is ClustalW (Thompson et
al., 1994), which exploits the evolutionary relationship between similar sequences.
The method starts by computing an all-to-all pairwise comparison following the
branching order of a family tree, in which more related sequences are aligning
first and more distant thereafter. Consequently, ClustalW groups all the sequences
based on similarities and then produces the final multiple alignment. ClustalW
also generates the information needed to construct a phylogenetic tree.
The position-specific iterated BLAST (PSI-BLAST) is a hybrid algorithm, including
elements from pairwise as well as multiple sequence alignment methods. The
initial database search is followed by construction of position-specific sequence
profiles, which, upon iteration, might be refined thereby raising the search
sensitivity (Altschul et al., 1997). Besides the many advantages of PSI-BLAST, false
incorporations of unrelated sequences and non-masked low complexity regions
might bias the profiles and incorporate additional false positives in the following
iterations (Holm, 1998).
5.3 Prediction of Signal Sequences and Transmembrane Regions
The identification of sequence motifs (see chapter 3.7), which direct proteins in
the cell have been the goal for programs predicting signal peptides. SignalP takes
into account that the residues at positions -3 and -1 relative the signal peptide
cleavage site have to be neutral and small for cleavage to occur (von Heijne,
1983; von Heijne, 1985). SignalP can predict cleavage sites with an accuracy of
78% for eukaryotes and 89% for prokaryotes (Nielsen et al., 1997). Similar to
signal peptides are signal anchors, which have signal sequences not cleaved by
signal peptidases (von Heijne, 1988). A more recent version of SignalP has been
developed, that distinguishes between cleaved signal peptides and uncleaved signal
anchors with an accuracy of about 75% (Nielsen et al., 1999). Similar to SignalP
is TargetP, which predicts subcellular localization of proteins (Emanuelsson et al.,
2000). By using N-terminus protein sequence data, TargetP discriminates between
secretory, mitochondrial, and chloroplast proteins. It performs better than other
available similar tools, with an overall success-rate of 85% for plant sequences and
90% for nonplant sequences.
Membrane proteins are an important class of proteins, for which it is difficult
to obtain atomic-resolution information about 3D structure. Accurate methods
for predicting the locations of transmembrane helices are available. For instance,
TMHMM (Trans Membrane Hidden Markov Model), a commonly used tool,
predicts transmembrane helices correctly for about 95% of all proteins in a test
data set (Krogh et al., 2001).
33
Sequence Alignments/Prediction of Signal Sequences and Transmembrane regions
5.2.3 Multiple Sequence Alignments
Even though pairwise comparisons are essential when analyzing sequences,
analyses of protein families sharing conserved characteristics, requires alignment
and comparison of groups of sequences. This is done in multiple sequence
alignments, which form the foundation for methods such as phylogenetic analysis,
fold recognition (see chapter 3.5), and secondary structure prediction (Bateman
et al., 2002).
34
Present Investigation
5.4 Biocomputing – a Practical Approach
The development and construction of biocomputational software applications
is highly dependent on the type of problems that need to be solved. Different
programming languages and database solutions are available, with respective
strengths and weaknesses. The two most commonly used programming languages
in biocomputing are Perl (perl.org) and Java (java.sun.com). They both have
their own open-source projects which provide frameworks for handling biological
data (BioJava (biojava.org) and BioPerl (biojava.org)). These frameworks include
modules, classes and interfaces for file parsing, manipulation of biological
sequences, and database client and server support. Perl is ideal for smaller scripts
when less code is needed to be written and for construction of scripts without
using an object-oriented approach. Java is more favourable when constructing
larger software solutions resulting from more extensive visualization packages and
cross-platform capabilities.
When designing and constructing novel software, every new problem is unique
and therefore a general solution is not available. However, the system has to
include general features such as proper storage and management of data, which
involves a relevant and stable database-system and data-recovery facilites (backups). Moreover, the system needs to present and visualize the data in ways that
facilitate the interpretation by the end-user. The system functionalities must be
cleary defined and continuous interaction between software developers and endusers might be necessary.
In various large-scale, high-throughput approaches, the system needs to handle
a large amount of data at high speed. Therefore, well-established prediction
methods in combination with experimentally-gained knowledge might be useful
(related publication A, B). The overall high production rates might result in a
system, which focuses on high-throughput processing over extreme sensitivity. Yet,
the approach has to be sensitive enough not to miss any relevant data.
35
Present Investigation
Present Investigation
36
Objectives
This thesis describes the use of biocomputational approaches to analyse and
characterize publicly available biological sequences in order to gain information
about gene and protein functions (see figure 6.1). The work can be divided in
three major research topics.
Firstly, novel software has been developed to facilitate selection of protein
fragments subsequently used in antibody-based proteomics (publication I).
Furthermore, the suggested design strategy, as well as experimental results, was
evaluated in a separate systematic study (publication II). Secondly, additional
software functionalities were developed to enable antibody recognition of splice
variants (unpublished data, see chapter 7.4.2), and proteins with high sequence
similarities, possibly evolved through duplication events (publication III). The
previously mentioned strategies facilitated the move from a one-by-one manual
gene analysis to processing genes in a high-throughput manner with a minimum
of human interference. Finally, the data mining of the genomic sequence of a
tree model organism, poplar (Populus trichocarpa), resulted in the identification
of previously unknown genes belonging to the cellulose synthase (CesA) family
(publication IV). These genes are thought to have evolved through duplication
events at the gene or genome level.
The studies described include the use of genomic and proteomic data from human
(publication I-III, unpublished data) as well as from P. trichocarpa (publication IV).
Furthermore, sequence analysis such as homology searches (publication I-IV),
PCR primer design (publication I, II), multiple alignments and phylogenetics
(publication IV), have been performed.
Objectives
6. Objectives
37
38
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
Publication III
Processing of high sequence
similarity proteins
Publication I
PrEST design software
Publication IV
Data mining in Populus
PrEST
Unpublished data
PrEST design on splice variants
Publication II
A systematic study of designed PrESTs
Figure 6.1 A schematic overview of the publications (and unpublished data) presented in this thesis.
Publication III, IV explore proteins that are thought to have evolved through duplication events (see
chapter 1.2.2). Publication I-III as well as the unpublished data comprise studies involving the centerpositioned PrEST (Protein Epitope Signature Tag; see chapter 7.2).
7.1 Background
On the 15th of August 2003, the Swedish human proteome resource (HPR)
was initiated. It is an antibody-based proteomics approach, which aims for the
generation of affinity-purified polyclonal antibodies towards recombinant protein
fragments (see figure 7.1; related publication A). The antibodies are subsequently
used in TMA studies (see chapter 3.6.4) on normal and diseased tissues (Kampf et
al., 2004), as well as for in vitro detection (see chapter 3.6.5) and analysis on protein
microarrays (see chapter 3.6.3). Protein profiling is also performed on cell-lines,
which might provide a valuable complement to tissue localization studies. These
cell-lines have the advantage of being throughtly controlled during cultivation.
Some cell-lines consist of cells from rare types of cancers, which are difficult to
get tussie samples of. Well-characterized cell-lines might also provide negative and
positive controls in the IHC assays (Andersson et al., manuscript in preparation).
The HPR approach is based on a pilot study of the human chromosome 21, which
comprised areas such as protein expression and purification, antibody staining,
as well as biocomputing (Agaton et al., 2003). Antibodies are generated towards
recombinant protein fragments, protein epitope signature tags (PrESTs), and are
given mono-specific properties (see chapter 3.6.1) by stringent purification of
polyclonal antisera using the PrESTs as affinity ligands.
An important part of the HPR strategy is the selection of recombinant protein
fragments, which typically span around 100-150 residues. These PrEST fragments
are designed by the use of in-house prediction software (Bishop; publication I)
and are selected on the basis of favorable sequence properties for subsequent
cloning, expression and purification.
There is a need for specific and quality controlled antibodies, displaying a
minimum of cross-reactivity to other proteins than their intended targets. The
HPR approach deals with this issue in multiple ways. The PrEST design strategy
emphasizes on locating PrESTs in regions of low sequence similarity to other
human proteins. This reduces the risk for the msAbs to cross-react with similar
proteins. This is accomplished by the PrEST design software (publication I),
which selects recombinant protein fragments based on a local alignment scanning
procedure.
Background
7. Design of Protein Epitope Signature
Tags (PrESTs) for Antibody-based
Proteomics
39
40
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
PrEST selection
Cloning
Purification of PrEST
Antibody generation Affinity purification
of antibodies
Sequencing
Production of PrEST
PrEST array
In situ localization
Public data
In vitro detection
Figure 7.1 A schematic overview of the HPR project, which includes several computer-assisted modules.
PrEST fragments are determined by the use of specific design strategies, with regards taken to the properties
of the target proteins (publication I-III, unpublished data). Quality control by cycle sequencing (Blakesley,
1993) ensures the amplification of a corrent PrEST fragment. Data belonging to the designed PrESTs, and to
the generated antibodies are stored and displayed in a publicly available data repository (www.proteinatlas.
org). In addition, the entire project is integrated with a laboratory information management system (LIMS),
which ensures proper processing and tracking of all the data.
Several quality control steps are included in the HPR project. Sequence verification
of clones in the cloning procedure is obtained by sequencing and comparisons
with the in silico generated PrEST. The antibodies are also used on reverse phase
protein fragment microarrays (see chapter 3.6.3), on which the absence of crossreactivity indicates the presence of mono-specific antibodies. In addition, the
antibodies are verified by Western blot analysis (see chapter 3.6.5), and these
results are compared to the TMA data. Reimmunizations of respective PrESTs and
then comparison of different protein expression patterns further provide a means
of validating the results. Also, two msAbs directed towards different regions of the
same protein and designed in appropriate software (publication I), should provide
correlating staining patterns (see table 7.2). The generated antibodies have been
included in comparative studies together with well-characterized, commercial,
and widely used mAbs. The similar results provided by this study corroborate the
high specificity of the HPR-generated msAbs (Nilsson et al., 2005).
7.2 Design of Protein Fragments for Recombinant Protein Production
(publication I)
A novel PrEST selection software has been developed (Bishop; see figure
7.2), predicting protein fragments based on several empirically determined
requirements. Bishop allows for high-throughput selection of PrESTs with a
minimum of end-user interference, providing a system which was integrated with
several publicly available prediction tools (see below). In contrast, the previously
used design strategy (Agaton et al., 2003), was performed in a more manual geneby-gene fashion and required the involvement of several different programs. The
selection requisites in Bishop facilitate the subsequent laboratory procedures
and promote selection of the most suitable PrEST fragment for any given gene
product.
Unwanted binding of the produced msAbs needed to be kept at a minimum. This
was accomplished by a local alignment scanning procedure based on BLASTP
(Altschul et al., 1990), which ensured the selection of fragments exhibiting lowest
possible sequence similarity to other human proteins. The procedure used a
sliding window approach that traversed the target protein sequence, moving one
amino acid per iteration. The size of the window could be individually determined
by the end-user, but the default size was 125 amino acids.
Transmembrane regions cause problems in the expression and purification
procedures in the subsequently used expression host E. coli (Miroux and
Walker, 1996), as well as being poorly accessible to antibodies in tissue profiling.
Transmembrane regions were therefore avoided by the use of the prediction
program TMHMM (Krogh et al., 2001), which was executed by the main
program.
Signal peptides (see chapter 3.7.3) are cleaved off from proteins during
translocation and are therefore inappropriate targets for antibody recognition.
Bishop allows for the exclusion of signal peptides by visualizing the predictions
produced by SignalP (Nielsen et al., 1997).
Finally, avoidance of restriction enzyme motifs used in subsequent cloning is a
necessity. Therefore, the corresponding cDNA sequence of the target protein was
screened for restriction enzyme sites. The software allows for searching of any
desired restriction site or basic sequence motif. Based on the four requirements
mentioned above, two non-overlapping PrEST fragments, if possible, were
automatically predicted for all analyzed target proteins.
41
Design of Protein Fragments for Recombinant Protein Production (publication I)
Following TMA studies of antibodies, corresponding data and data regarding the
PrEST towards which the antibody has been directed, are stored in a publicly
available protein atlas (www.proteinatlas.org; related publication A). This
repository includes data such as high-resolution TMA images, target gene data, and
validation of the msAbs. A research project of this magnitude encompasses several
laboratory assays and generates vast amounts of data. A laboratory information
management system (LIMS) faciliates tracking and handling of the complete
laboratory procedures and of the sequence-data used for PrEST selection (see
chapter 7.4). The LIMS is integrated with the publicly available data repository, as
well as with current PrEST design software (publication I).
42
B
C
D
B
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
A
43
E
Figure 7.2 The five different tabs of the Bishop software (A-E): A: The sequence alignment tab, in which
the target protein sequence is aligned to all human proteins. The user also has the possibility to exlude
splice isoforms (see chapter 7.4.2, unpublished data) or high sequence identity proteins (see publication
II) from the subsequent sequence similarity search (displayed in tab C). B: The PrEST data tab, where
data concerning the predicted PrESTs are shown. For instance, PrEST locations, sequence identities and
amino acid sequences. C: The sequence feature tab, which displays several transcript and protein features
as horizontal bars. These are (from top to bottom): transcript sequence, protein sequence, and predicted
PrEST sequences. In addition (although not displayed in the figure) Bishop allows for the exclusion of signal
peptides, transmembrane regions, and restriction enzyme sites located within the PrEST sequences. D:
The PCR primer design tab, in which primers are selected based on non-stringent criteria. E: The result tab,
in which all previously generated data are displayed before commiting the predicted PrEST, allowing for
subsequent ordering of the designed PCR primers.
Design of Protein Fragments for Recombinant Protein Production (publication I)
D
44
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
As previously mentioned, Bishop allows for selection of fragments with minimum
sequence similarity to other proteins. However, in some cases there might be a
trade-off between locating the PrEST in a target region, and identifying the region
of lowest possible sequence similarity. Therefore, the region of lowest similarity
might not always be the optimal choice. For instance, when designing PrESTs
representing groups of proteins of high sequence similarity (publication III) or
PrESTs located in common or unique parts of multiple splice variants (unpublished
data, see chapter 7.4.2), other design criteria might be more relevant. Also, it is
possible to perform a manual PrEST selection in Bishop (see figure 7.2, tab B).
For instance, when designing PrESTs towards conserved active domains or extracellular parts of receptor proteins. Nevertheless, designing PrESTs in the region
of lowest similarity should be the primary choice when processing genes in a highthroughput manner.
Following PrEST design, a primer design module was included in Bishop. An
appropriate melting temperature (Tm), as well as a suitable length of the primer,
can be set by the end-user. The primer design is based on non-stringent criteria
such as ending the primer sequence with a C or a G and avoiding continous
stretches longer than three identical nucleotides. However, when compared to a
more stringent primer design this non-stringent criteria showed only slightly lower
success rates in the subsequent RT-PCR and sequencing modules (publication
II). Following primer design, all data related to the target protein is transferred
to the LIMS (see chapter 7.1), from where the primer ordering is performed.
Bishop is fully integrated with the LIMS, performing data exchanges and overnight synchronization of information.
Due to the existence of discontinous and linear epitopes (see chapter 3.6.2), it
would be beneficial to obtain antibodies recognizing both types of epitopes. That
accomplished, the antibodies might be used in assays including proteins in a
native fold, as well as studies comprising slightly denaturated proteins.
In order to explore the capabilities of the cloned and expressed PrESTs to form
discontinous epitopes, designed protein fragments were evaluated by analyzing
their 3D structures. This was performed by marking the PrEST regions on known
available protein structures retrieved from the protein data bank (see figure 3 in
appendix I; Bernstein et al., 1977). As antigens located at the N- or C-termini are
commonly used for the expression of recombinant protein fragments for antibody
production (Agaton et al., 2003), these regions were also included in the analysis.
The C- and N-termini of the protein were marked in regions corresponding to
the PrEST lengths. This limited study suggested that the PrESTs designed by
the software package seem to have a relatively good potential to display some
conformational epitopes. However, this is not a sufficiant proof of the expressed
PrESTs ability to form secondary structures resembling the native protein.
On the other hand, the PrEST fragments seem to be large enough to provide
surface exposure for a number of residues on the native protein. This is of great
importance for antibody recognition of the target.
The protein fragments are, in the majority of cases, shorter than full-length
proteins and, therefore, a native protein fold of the expressed PrESTs is not
obtained (see figure 7.3). Also, the immunized PrESTs are most likely denaturated
upon immunization due to the exposure to adjuvants. Still, these protein-fold
modifications seem to resemble the denaturated forms of the proteins located
in formalin-fixated tissues in TMAs, since the majority of the msAbs actually
recognize their intended targets.
45
Gene region
Full-length protein
PrEST
Figure 7.3 A comparison of a full-length protein in relation to a PrEST, which is generated by the expression
of a recombinant protein fragment. In the majority of cases the PrEST is shorter than a full-length protein.
This in combination with the use of adjuvants in the immunizations, results in a slightly modified fold of
the PrEST fragments.
7.3 A Systematic Analysis of Designed PrESTs (publication II)
PrESTs designed in the previously described software (publication I) as well as
in a similar software named ProteinWeaver (Affibody, Bromma, Sweden), were
analysed. The success rates of the PrEST design strategy was investigated in this
study, with respect to target protein properties as well as PCR primer design
approaches. The analysis was performed on proteins encoded by genes from
human chromosomes 14, 22 and X together with target proteins requested from
several research collaborations (see chapter 7.4). The proteins submitted from
collaborators were identified by the use of several different proteomic methods, for
example, MS (see chapter 3.4) and methods involving protein-protein interactions
(see chapter 3.9). This systematic study included a total of 6,412 PrEST fragments
located on proteins encoded by 4,263 genes, corresponding to 16% of the genes
currently included in the Ensembl dataset. Chromosomes 14, 22, and X were overrepresentated, due to the focus on these chromosomes (see table 7.1).
The design of PCR-primers for amplification of the PrESTs, was performed
using a high stringency- and a low stringency approach (previously described
in publication I). The high stringency algorithm included additional criteria,
considering dimer- and hairpin formation. Comparison of results from the two
primer design approaches revealed a slightly higher success rate for the high
stringency algorithm when compared to the low stringency approach, showing
82% and 74%, respectively. The success rates were calculated from subsequent
RT-PCR analysis and sequencing. However, the actual difference between the two
approaches is probably lower, since the high stringency approach was recently
added to the HPR project. Thus, continuous improvement and refinement of
laboratory procedures might have contributed to the higher success rates of the
stringent primer design approach.
A Systematic Analysis of Designed PrESTs (publication II)
Gene
46
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
Chromosome
Number of Ensembl genes
Genes with PrEST
Number of PrEST clones
1
2158
246
204
2
1443
163
135
3
1146
162
142
4
850
105
97
5
964
125
103
6
1146
141
117
7
1094
145
120
8
785
93
84
Chromosome
Number of Ensembl genes
Genes with PrEST
Number of PrEST clones
9
904
110
96
10
874
84
74
11
1397
142
123
12
1085
157
137
13
394
37
29
14
713
639
476
15
702
73
68
16
947
98
73
Chromosome
Number of Ensembl genes
Genes with PrEST
Number of PrEST clones
17
1219
145
130
18
316
30
25
19
1455
216
149
20
622
81
57
21
265
37
36
22
535
440
344
X
957
718
523
Y
129
77
44
Table 7.1 The current status of the number of designed PrESTs and sequence verified clones compared to
the gene count on all human chromosomes (The gene count is according to Ensembl August 2005).
In order to increase the efficiency of the RT-PCR analysis, and properly validate
the generated antibodies (see chapter 7.1), the primary strategy was to design two
PrESTs for each protein. The design strategy successfully selected two PrESTs on
the same protein for approximately half of the investigated proteins. For shorter
proteins lacking sufficient length, a single PrEST was designed. In cases where
the protein was shorter than 100 amino acids, the entire protein sequence was
selected as antigen. Proteins shorter than 100 residues constituted 5.3% of the
total protein count in the Ensembl dataset (v26.35).
Failures in PrEST design were a result of interference by signal peptides, restriction
enzyme sites or transmembrane regions. Additionaly, some proteins were to short
for design. Together, these failures constituted approximately one percent of
the total number of investigated proteins. Moreover, this study considered target
proteins with high sequence similarities (above 80% pairwise sequence identity)
to other human proteins as fall-out proteins, which in this case constituted around
13% of the total protein data set. Proteins encoded by genes located on human
chromosomes 22 and X constituted a larger part of these fall-outs. The number of
fall-out proteins from encoding genes from chromosome 14 was significantly lower
due to few duplicated regions (Cheung et al., 2003). In contrast, chromosomes 22
and X are known to have high numbers of duplicated segments. However, these
genes might be processed with a strategy solely developed for handling of such
high sequence similarity proteins (publication III).
In the HPR, protein and cDNA sequences from Ensembl were the primary data
sources for PrEST design. After retrieval, the sequences were initially grouped
according to a basic classification scheme, resulting in three different protein
categories. Firstly, the nonredundant set, comprised one representative protein
from each unique gene locus (see chapter 2.2). For genes with splice variants,
the longest isoform was selected. Secondly, a portion of the nonredundant set
encompassed high sequence similarity proteins, which were processed using a
specific design strategy (publication III). The final category contained additional
spliced isoforms (see chapter 2.2) not covered by the first category. In this case,
the PrEST design was directed towards specific splice variants and towards regions
shared by several or all ot the protein isoforms (unpublished data, see chapter
7.4.2).
Currently, the HPR is focusing on genes on human chromosomes 14, 22, and
X, as well as specific requests from research collaborators. These genes include
biomarkers for human cancers and stem cells, as well as genes from other species.
For instance, evolutionary relevant chimpanzee genes, and Populus (see chapter
8.1) genes involved in plant growth and development.
7.4.1 Antibody-based Proteomics on High Sequence Similarity Proteins (publication
III)
As previously mentioned (see chapter 7.4.1), three different protein classifications
have been established. Accordingly, different PrEST design strategies have
been developed for individual handling of the different categories. The second
category consisted of high sequence similarity proteins (HSSPs), exhibiting more
than 80% pairwise sequence identity to human proteins from different genes.
This sequence identity limit also corresponded to the fall-out proteins previously
described (publication II). A design stategy for processing HSSPs was developed,
where the proteins initially were grouped in clusters based on a maximum of 80%
sequence identity. The PrEST design is thereafter focused on locating PrESTs
in regions which were common to all cluster members. As a proof of principle,
antibodies generated towards such cluster specific PrESTs have been analyzed
by Western blot analysis using whole-cell and tissue protein extracts. The ability
of identifying proteins within a high sequence similarity cluster by the use of a
single antibody was thus investigated. In addition, novel software was developed to
estimate the minimum number of PrESTs needed for coverage of the HSSPs.
Initially, the HSSPs were identified using an all-to-all sequence similarity search,
in which all human proteins predicted by Ensembl were included. Proteins from
different genes sharing more than 80% sequence identities over the shortest of
the aligned sequences were selected. This resulted in 3,250 identified HSSPs,
corresponding to 14.6% of the non-redundant set of proteins. The subsequent
PrEST design was facilitated using an enhanced version of Bishop (publication
I). This allowed for the simultaneous visualization of all proteins within a cluster,
and for location of PrESTs within aligned high sequence similarity regions. This
version of Bishop contained an additional sequence similarity scanning procedure,
in which the target proteins were aligned to all proteins within their respective
cluster. The similarities to all human proteins could additionally be visualized.
The optimal PrEST location within the aligned high sequence similarity region
was the region of highest local sequence similarity among all proteins in a cluster.
This local similarity was generated as previously described (publication I).
47
Specific PrEST Design Strategies (publication III and unpublished data)
7.4 Specific PrEST Design Strategies
48
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
To explore the capacity of the generated antibodies in order to identify all members
of the HSSP clusters, an in vitro detection by Western blot analysis was performed
on antibodies towards cluster specific PrESTs. An example showed (see figure 3 in
appendix II) that all cluster members could be identified in one of the cell lines.
Although limited, this study implies the possibility to use this design strategy for
generation of antibodies capable of recognizing all proteins in a high sequence
similarity cluster, despite the fact that their sequences are not identical.
An estimation of the the minimum number of PrESTs sufficient to cover all the
previously identified HSSPs was made possible by the development of novel
software. This program performed homology searches using BLASTP (Altschul
et al., 1990) which resulted in the identification of HSSPs. The proteins were
subsequently assembled in clusters based on common hit sequences. The protein
sequences in each cluster were then aligned and grouped into sub-clusters.
Proteins with common regions of sufficient size (in this case 100 residues), where
grouped in the same sub-cluster. Finally, in order to remove redundancies and
to cover as many proteins as possible with a minimum of PrESTs, the entire set
of sub-clusters was subjected to a sorting. In this sorting large sub-clusters were
favorized. This resulted in a total of 1,124 sub-clusters and in 186 single proteins
remaining after the final sorting. This implies that 1,310 PrESTs are necessary for
coverage of the complete set of HSSPs. In average, 2.73 proteins were covered by
a region large enough to accommodate a PrEST fragment.
In conclusion, this resulted in a total of 20,281 PrESTs, sufficient for coverage of
the total nonredundant set of proteins, corresponding to the entire set of human
genes (22,221 according to Ensembl v26.35). To facilitate the study of multiple
splice variants belonging to these genes, a specific approach was developed (see
below). In addition, antibodies generated by this HSSP strategy could potentially
be important for investigations of gene duplications and gene families.
7.4.2 Design of Isoform Specific Protein Fragments (unpublished data)
The final protein category, as defined by the HPR, was proteins with multiple
splice variants, which were not included in the nonredundant set of proteins.
According to Ensembl, about 32% of the human genes had multiple isoforms and
the majority (30.3%) of these genes had between two and four observed variants.
The primary design strategy (see chapter 7.4) focused on the nonredundant set
by selecting a single representative protein for genes with several splice variants.
However, designing PrESTs unique to a specific isoform or common to all of the
splice variants might be desirable. A strategy for designing PrESTs on genes with
several splice variants was therefore developed. The approach included novel
software (ExonViewer), which was used in combination with Bishop (publication
I). ExonViewer used exon boundaries from the Ensembl database and visualized
them in a user-friendly interface (see figure 7.4). A Perl-based script for data
retrieval and pre-processing of data was developed, whereas the main software was
written in Java. Both parts of the system used the BioPerl and BioJava frameworks
for handling of sequence data and for visualization of exons (see chapter 4.4).
This allowed for in-depth studies of different splicing events. Designing PrESTs
on unique splice variants, or regions common to several or all variants was also
possible.
49
PrEST A
PrEST B
B
PrEST C
PrEST D
Figure 7.4 Two screendumps of ExonViewer, in which exons are displayed as color coded bars. The
uppermost bar represents a consensus view with the entire number of exons belonging to the target gene
shown. The middle and lowermost bars display the assembled exons (showed in amino acid coordinates).
In both the examples shown (A, B), both genes (the VWF and TJP1 gene) have two different isoforms. In A,
two PrESTs are located in a region unique to that splice variant. B shows the placement of two PrESTs in a
region common to the two isoforms.
In this design strategy, the target genes were visualized in ExonViewer and regions
common or unique to all or several of the isoforms, were selected as candidate
PrEST design regions. These regions were at least 100 continous amino acids
in length. Yet, any desired length limit could be defined. Subsequently, these
regions were exported to Bishop (publication I) for convenient design of protein
fragments.
Specific PrEST Design Strategies (publication III and unpublished data)
A
50
Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics
TMA results from four antibodies generated against two pairs of PrESTs were used
as means to validate the strategy. Each pair was designed on the same target protein.
The first pair of PrESTs was located on a region unique to one of the isoforms on
the VWF gene (ENSG00000110799; see figure 7.4A). The second pair was placed
on a region common to both isoforms of the TJP1 gene (ENSG00000104067; see
figure 7.4B). A comparison of the protein expression patterns within the PrESTpairs in 43 different tissues, revealed a clear correlation between PrESTs designed
on the same protein (see table 7.2).
Tissue
Celltype
Adrenal gland
cortical cells
medullar cells
glandular cells
lymphoid tissue
cells in granular layer
cells in molecular layer
purkinje cells
neuronal cells
on-neuronal cells
glandular cells
surface epithelial cells
glandular cells
glandular cells
cells in endometrial stroma
cells in myometrium/ECM
glandular cells
cells in endometrial stroma
cells in myometrium/ECM
glandular cells
glandular cells
surface epithelial cells
glandular cells
glandular cells
myocytes
neuronal cells
non-neuronal cells
cells in glomeruli
cells in tubuli
neuronal cells
non-neuronal cells
bile duct cells
hepatocytes
alveolar cells
macrophages
follicle cells (cortex)
non-follicle cells (paracortex)
surface epithelial cells
follicle cells
ovarian stromal cells
Appendix
Cerebellum
Cerebral cortex
Cervix, uterine
Colon
Duodenum
Endometrium 1
Endometrium 2
Epididymis
Esophagus
Fallopian tube
Gall bladder
Heart muscle
Hippocampus
Kidney
Lateral ventricle
Liver
Lung
Lymph node
Oral mucosa
Ovary
Antibody
A B
C D
Tissue
Celltype
Pancreas
exocrine pancreas
islet cells
glandular cells
decidual cells
trophoblastic cells
glandular cells
glandular cells
glandular cells
myocytes
adnexal cells
epidermal cells
glandular cells
smooth muscle cells
mesenchymal cells
mesenchymal cells
cells in red pulp
cells in white pulp
glandular cells
glandular cells
cells in ductus seminiferus
leydig cells
glandular cells
follicle cells (cortex)
non-follicle cells (paracortex)
surface epithelial cells
surface epithelial cells
surface epithelial cells
surface epithelial cells
Parathyroid gland
Placenta
Prostate
Rectum
Salivary gland
Skeletal muscle
Skin
Small intestine
Smooth muscle
Soft tissue 1
Soft tissue 2
Spleen
Stomach 1
Stomach 2
Testis
Thyroid gland
Tonsil
Urinary bladder
Vagina
Vulva/anal skin
Antibody
A B
C D
Strong staining
Weak staining
Medium staining
No staining
Table 7.2 A comparison of the tissue distribution in 43 different tissues of the four investigated antibodies.
Red color indicates strong staining, orange color medium staining, and yellow color weak staining. Blue
color displays no staining. Tissues with no available pictures were excluded from the study.
The different colors in the protein expression patterns indicated the presence of
the proteins in the tissues. Red color indicated strong staining, orange medium
staining, and yellow color indicated weak staining, whereas blue color indicated
no staining. The majority of tissues showed similar staining patterns for PrESTs
located on the same protein (antibody A-B and C-D in table 8.1). Similar staining
in the included tissues of two PrESTs located on the same protein could be seen as
a validation and confirmation of the design stategy (see chapter 7.1).
Yet, according to the applied three-level scheme, different staining intensities (weak,
intermediate, and strong staining) might reveal a similar protein expression upon
closer examination. Differences in staining patterns might be a result of variations
in the tissues, due to that nonconsecutive sections of tissues have been used. The
tissue-sampling has been performed on tissues from different individuals, and the
tissues might therefore not be representative. Moreover, the annotation might
have been performed by different annotators. The antibodies used in the TMAs
The tissue distribution data used in this study is merely for validation of the
strategy. This limited data indicates that the design strategy is promising. However,
more antibodies need to be generated for a large-scale evaluation. Furthermore,
PrESTs can be directed to unique regions of several isoforms thereby revealing
clues about the different tissue distribution patterns.
51
Design of Isoform Specific Protein Fragments (unpublished data)
might also have been diluted in different ways. Furthermore, the splice variants in
the data source might not have been fully curated, i.e. unknown isoforms might
have been present which interfered with the results.
52
Identification and Characterization of Gene Family Members in a Fully Sequenced Tree Genome
8.1 Populus, a Tree Model System
Trees belonging to the genus Populus, (i.e. poplars, including cottonwoods and
aspens) have several attributes, which have contributed to their emergence as
predominant model organisms for molecular tree biotechnology (reviewed by
Brunner et al., 2004). Populus members are widely distributed throughout the
Northern hemisphere. For instance, black cottonwood (Populus trichocarpa) grows
in diverse habitats such as those in Alaska and Mexico (Brunner et al., 2004). This
adaptivity and genetic variation make Populus trees ideal for studies of adaptive
evolution (see chapter 1.2.1). Furthermore, their short generation time and easy
transformation enable functional genomic studies.
The recent sequencing of the genome of P. trichocarpa (genome.jgi-psf.org/
poptr1) provides means to further study and characterize the farely small genome
of this organism (see table 5.1). In comparison, loblolly pine (Pinus taeda L.) has
a large and complex genome (about 20 times larger than Populus), and exhibits
a long generation time. Pine is therefore an unsuitable model organism for tree
genomics (Whetten et al., 2001).
8.2 Cellulose Synthases
The capability of forming wood is a specific feature distinguishing trees from
plants. Cellulose, which is the primary component in wood, is synthesized by large
membrane-bound protein complexes (Doblin et al., 2002). These complexes
include catalytic subunits encoded by CesA gene family members. CesA enzymes
are glycosyl transferases (GTs). GTs form glycosidic bonds by catalyzing the
transfer of sugar moities from appropriate sugar donors to acceptor molecule,
thus creating carbohydrates such as cellulose (Coutinho et al., 2003). Conserved
residues present in all GTs, belonging to the same family as CesA enzymes, include
three D residues as well as a QxxRW motif, proposed to be involved in substrateand acceptor binding, as well as processivity (Saxena et al., 1995; Campbell et
al., 1997; Garinot-Schneider et al., 2000). In addition, CesA family members
harbor two zinc finger domains positioned near the N-terminus, as well as two
hypervariable regions (HVR-1 and HVR-2) specific to plant genes (see figure 8.1,
8.2). CesA proteins are generally around 1,000 amino acids in length with eight
predicted transmembrane regions.
Populus, a Tree Model System/Cellulose Synthases
8. Identification and
Characterization of Gene Family
Members in a Fully Sequenced
Tree Genome
53
54
Identification and Characterization of Gene Family Members in a Fully Sequenced Tree Genome
Figure 8.1 A view of the putative CesA protein integrated into the plasma membrane. The characteristic
features of the CesA protein are: zinc finger domain (green) consisting of two CxxC-motifs, the conserved
catalytic residues D, D, D (blue) and QxxRW (gold) forming the putative catalytic site, the plant-specific
hypervariable regions HVR-1 (red) and HVR-2 (magenta), and the eight transmembrane regions (black).
8.3 Identification of Previously Unknown CesA Family Members
(publication IV)
In this study, the genome sequence of P. trichocarpa was mined for unidentified
CesA family members. This was done by querying the genome with previously
identified CesA EST- and full-length gene sequences from the closely related hybrid
aspen (P. tremula x tremuloides; Djerbi et al., 2004). The screening was performed
with BLAST and resulted in the identification of 18 putative CesA genes based on
sequence similarities alone. Subsequently, the corresponding protein sequences
were subjected to an all-to-all global sequence alignment search using the software
Needle (www.ebi.ac.uk/emboss). This analysis revealed a clear sequence similarity
pattern, which distinctly grouped the CesA genes into seven pairs, one single
sequence, as well as a group of three sequences.
The 18 putative CesA gene sequences were further analyzed and conserved
sequence characteristics were verified for confirmation of true CesA family
membership (see chapter 8.2; figure 8.1). All CesA characteristics were identified
in the 18 putative CesA sequences using tools such as ClustalW (see figure 9.2;
Thompson et al., 1994), TMHMM (Krogh et al., 2001), as well as manual analysis.
This extensive sequence analysis implies that the collected genes are indeed true
CesA family members.
55
1A
1B
3B
57
57
59
59
59
59
59
59
59
59
59
40
40
36
41
51
29
29
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
D
GHSGGHDTEGNELPRLVYVSREKRPGFSHHKKAGAMNALIRVSAVLTNAPFMLNLDCDHY
D
GHSGGHDVEGNELPRLVYVSREKRPGFSHHKKAGAMNALIRVSAVLTNAPFMLNLDCDHY
D
GHNGVHDVEGNELPRLVYVSREKRPGFDHHKKAGAMNALVRVSAIISNAPYMLNVDCDHY
D
GHNGVHDVEGNELPRLVYVSREKRPGFDHHKKAGAMNSLVRVSAIITNAPYMLNVDCDHY
D
GQSGVRDVEGNELPRLVYVSREKRPGFEHHKKAGAMNALMRVTAVLSNAPYLLNVDCDHY
D
GQSGVRDVEGCELPRLVYVSREKRPGFEHHKKAGAMNALVRVSAVLSNAPYLLNVDCDHY
D
GQSGGHDTDGNELPRLVYVSREKRPGFNHHKKAGAMNALVRVSAVLTNAPYLLNLDCDHY
D
GQSGGHDTDGNELPRLVYVSREKRPGFNHHKKAGAMNALVRVSAVLSNARYLLNLDCDHY
D
GHSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHY
D
GHSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHY
D
GHSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHY
D
GQSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNSLVRVSAVLTNGPFLLNLDCDHY
D
GQSGGLDSDGNELPRLVYVSREKRPGFQHHKKAGAMNSLVRVSAVLTNGPFLLNLDCDHY
D
GHSGGLDTEGNELPRLVYVSREKRPGFQHHKKAGAMNALVRVSAVLTNGPFLLNLDCDHY
D
GHSGGLDTEGNELPRLVYVSREKRPGFQHHKKAGAMNALVRVSAVLTNGPFLLNLDCDHY
D
GSEGALDVEGKELPRLVYVSREKRPGYNHHKKAGAMNALIRVSAVLTNAPFMLNLDCDHY
D
GNTGARDIEGNELPRLVYVSREKRPGYQHHKKAGAENALVRVSAVLTNAPYILNLDCDHY
D
GNTGARDIEGNELPRLVYVSREKRPGYQHHKKAGAENALVRVSAVLTNAPYILNLDCDHY
531
540
579
579
580
581
568
565
568
568
568
562
564
542
552
512
464
500
ECGFPACRPCYEYERREGTQNCPQCKTRYKRLKGSPRVEGD--DDEDDLDDIEHEFIIED
C
EDDLDDIEHEFIIED
ECGFPVCRPCYEYERREGTQNCPQCKTRYKRLKGSPRVEGD--DEEDDVDDIEHEFIIED
C
EDDVDDIEHEFIIED
ECAFPVCRPCYEYERREGNQACPQCRTRYKRIKGSPRVDGD--EEEEDTDDLENEFDIGI
C
EEDTDDLENEFDIGI
ECAFPVCRPCYEYERREGNQACPQCRTRYKRIKGSPRVDGD--EEEEDTDDLENEFDIGV
C
EEDTDDLENEFDIGV
ECAFPVCRPCYEYERREGNQACPQCKTRYKRLKGSPRVEGD--EEEDDIDDLEHEFDYGN
C
EDDIDDLEHEFDYGN
ECAFPVCRPCYEYERREGNQACPQCKTRYKRLKGSPRVEGD--EEEDDTDDLEHEFDYGN
C
EDDTDDLEHEFDYGN
ECAFPICRTCYEYERREGNQVCPQCKTRFKRLKGCARVHGD--EEEDGIDDLENEFNF-C
EDGIDDLENEFNF-ECAFPICRTCYEYERKEGNQVCPQCKTRFKRLKGCARVHGD--DEEDGTDDLENEFNF-C
EDGTDDLENEFNF-ECAFPVCRPCYEYERKDGTQSCPQCKTRYRRHKGSPRVDGD--EDEDDVDDLENEFNYAQ
C
EDDVDDLENEFNYAQ
ECAFPVCRPCYEYERKDGTQSCPQCKTRYRRHKGSPRVDGD--EDEDDVDDLENEFNYAQ
C
EDDVDDLENEFNYAQ
ECAFPVCRPCYEYERKDGTQSCPQCKTRYRRHKGSPRVDGD--EDEDGVDDLENEFNYAQ
C
EDGVDDLENEFNYAQ
VCAFPVCRPCYEYERKDGNQSCPQCKTRYRRHKGSPAILGDR-EEDGDADDGAIDFNYSS
C
DGDADDGAIDFNYSS
VCAFPVCRPCYEYERKDGNQSCPQCKTRYKRLNGSPAILGDR-EEDGDADDGASDFNYSS
C
DGDADDGASDFNYSS
VCSFPVCRPCYEYERKDGNQSCPQCKTKYKRHKGSPPIQGE----DANSDEVENKSNHHT
C
DANSDEVENKSNHHT
VCAFPVCRPCYEYERKDGNQSCPQCKTKYKRHKGSPPIQGEE-MGDADSEDVGNKSNHHI
C
DADSEDVGNKSNHHI
VCGFPVCRPCYEYERSEGNQSCPQCNTRYKRHKGCPRVPGDNDDEDANFDDFEDEFQIKH
C
DANFDDFEDEFQIKH
ECNYHMCKSCFEYEIKEGRKVCLRCGSPYD-----------------------------C
--------------ECNYPMCKSCFEFEIKEGRKVCLRCGSPYDEFETFIVVHIP----ENPFHLLVTHLFIYY
C
ENPFHLLVTHLFIYY
115
115
117
117
117
117
115
115
117
117
117
99
99
92
100
111
59
85
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
PKEPKRPKMVTCDCCP-------CFGRRKKK-------GPVYVGTGCVFKRQALYGYDPPKEPKRPKMVTCDCCP-------CFGRRKKK-------PKDPKRPKMETCDCCP-------CFGRRKKK-------GPVYVGTGCVFKRQALYGYDPPKDPKRPKMETCDCCP-------CFGRRKKK-------PVK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN
GPIYVGTGCVFRRQALYGYDAPVK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN
PIK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN
GPIYVGTGCVFRRQALYGYDAPIK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN
PVK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK
GPIYVGTGCVFRRQALYGYDAPVK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK
PVK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK
GPIYVGTGCVFRKQALYGYDAPVK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK
PKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
PKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
VLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
VLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
VLTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI
GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI
PLKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS
PLKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS
PLKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES
PLKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS
PVSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL
GPVYVGTGCVFNRQSLYGYDPPVSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL
PSMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY
GPMYVGTGCVFNRQALYGYGPPSMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY
PSMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY
GPVYVGTGCVFNRQALYGYGPPSMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY
636
645
696
694
697
698
687
684
680
680
680
674
676
654
664
630
579
615
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
EQDKNKYLTEAMLHGKMTYGRGHDDEENSHFPPVITG------VRSRPVSGEFPIGSHGE
EQDKNKHLTEAMLHGKMTYGRGHDDEENSQFPPVITG------IRSRPVSGEFSIGSHGE
NDRRDPHQVAEALLAARLNTGRGSQSNVSGFATPSE-------FDSASVVPEIPLLTYGE
NDRRDPRHVAEALLSARLNTGRGSQAHVSGFATPSE-------FDSASVAPEIPLLTYGE
FDGLSPEQVAEAMLSSRMNTGRASHSNISGIPTHGE-------LDSSPLNSKIPLLTYGE
LDGLSPEQVAEAMLSSRINTGRASHSNTYGIPTQGE-------LDSSPLSSKIPLLTYGE
-DGRNSN-----RHDMQHHGGLGGPESMRHYDPDLP-------HDLHHPLPQVPLLTNGQ
-DGRNSN-----RHDMQHHG---GPESMLHYDPDLP-------HDLHHPLPRVPLLTNGQ
GIGKARRQ-------WQ-----GEDIELSSSSRHESQ-PIPLLTNGQPVSGEIPCATPDN
GIGKARRQ-------WQ-----GEDIELSSSSRHESQ-PIPLLTNGQPVSGEIPCATPDN
GIGNAKHQ-------WQ-----GDDIELSSSSRHESQ-PIPLLTNGQPVSGEIPCATPDN
ENQNQKQKIAERMLSWQMTFGRGEDLGAPNYDKEVSHNHIPLITNGHEVSGELSAASPEH
ENQNQKQRIAERMLSWQMTYGRGEDSGAPNYDKEVSHNHIPLLTNGHEVSGELSAASPEH
SGVQDEKQKIERMMAWDSSSGRKEHLATTNYDRDVSLNHIPYLAGRRSVSGDLSAASPER
SGVQDEKQKIERMLGWDSSSGRKEHLATTNYDKDGSLNHIPYLAGRRSVSGDLSAASPER
HDHDESNQ--------------KNVFSHTEIEHYNEQ-------EMQPIRPAFSSAGSVA
----------ENLLDDVEKKGSGNQSTMASHLNNSQ--------DVGIHARHISSVSTVD
SYANLCLLSPENLLDDVEKKGSGNQSTMASHLNDSQ--------DVGIHARHISSVSTVD
169
169
170
170
170
170
162
159
164
164
164
159
159
152
160
150
101
137
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
-------NAKNG------AVGEGTSLQG----------MDNEKELLMS
-------NAKNG------AVGEGTSLQG----------MDNEKELLMSQMNFEKRFGQSA
-------NAKNG------AVGEGMD--------------NNDKELLMS
-------NAKNG------AVGEGMD--------------NNDKELLMSHMNFEKKFGQSA
E------KKKSK------EASKQIHALENIEEGIEG--IDNEKSALMP
E------KKKSK------EASKQIHALENIEEGIEG--IDNEKSALMPQIKFEKKFGQSS
E------KKKSK------DASKQIHALENIEEGIEG--IDNEKSALMP
E------KKKSK------DASKQIHALENIEEGIEG--IDNEKSALMPQIKFEKKFGQSS
K------KSKNR------EASKQIHALENIEEGIEE--STSEKSSETS
K------KSKNR------EASKQIHALENIEEGIEE--STSEKSSETSQMKLEKKFGQSP
K------KSKNR------EASKQIHALENIEG-TEE--STSEKSSETS
K------KSKNR------EASKQIHALENIEG-TEE--STSEKSSETSQMKLEKKFGQSP
L------KKRNS------KTFEPVGALEGIEEGIEG--IESESVDVTS
L------KKRNS------KTFEPVGALEGIEEGIEG--IESESVDVTSEQKLEKKFGQSS
L------KKRNS------RTFAPVGTLEGIEEGIEG--IETENVAVTS
L------KKRNS------RTFAPVGTLEGIEEGIEG--IETENVAVTSEKKLENKFGQSS
D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMS
D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMSQKSLEKRFGQSP
D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMS
D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMSQKSLEKRFGQSP
D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMS
D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMSQKSLEKRFGQSP
D------KKKSGKH---ADPTVPVFSLEDIEEGVEGAGFDDEKSLLMS
D------KKKSGKH---ADPTVPVFSLEDIEEGVEGAGFDDEKSLLMSQTSLEKRFGQSA
D------KKKSGKH---VDPTVPIFSLDDIEEGVEGAGFDDEKSLLMS
D------KKKSGKH---VDPTVPIFSLDDIEEGVEGAGFDDEKSLLMSQMSLEKRFGQSA
-------KKKSSKH---VDPALPVFNLEDIEEGVEGTGFDDEKSLLMS
-------KKKSSKH---VDPALPVFNLEDIEEGVEGTGFDDEKSLLMSQMTLEKRFGQST
-------KKKSSKL---VDPTLPVFNLEDIEEGVEGTGFDDEKSLLMS
-------KKKSSKL---VDPTLPVFNLEDIEEGVEGTGFDDEKSLLMSQMTLEKRFGQST
LGGLYPMKKKMMGKKYTRKASAPVFDLEEIEEGLEG-YEELEKSSLMS
LGGLYPMKKKMMGKKYTRKASAPVFDLEEIEEGLEG-YEELEKSSLMSQKSFEKRFGQSP
K------DAKRE------DLNAAIFNLTEIDN-----YDEYERSMLIS
K------DAKRE------DLNAAIFNLTEIDN-----YDEYERSMLISQLSFEKTFGLSS
R------DAKRE------DLNAAIFNLTEIDN-----YDEHERSMLIS
R------DAKRE------DLNAAIFNLTEIDN-----YDEHERSMLISQLSFEKTFGLSS
673
678
742
740
743
743
733
730
729
729
729
725
727
704
714
689
622
658
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
QMLSSS----------------LHKRVHPYPVSEP----------------------EGG
QMLSSS----------------LHKRVHPYPVSEPGSA-------------RWDEKKEGG
EDVGISSDKHALI---IPP--FRGKRIHPMPFPDSSMS-LPPRPMDPNKDLAVYGYGTVA
EDVGISSDKHALI---VPP--FHGKRIHPMPFSDSSIP-LPPRPMDPKKDLAVYGYGTVA
EDTEISSDRHALI---VPP--SHGNRFHPISFPDPSIPLAQPRPMVPKKDIAVYGYGSVA
EDAEISSDRHALI---VPPHMSHGNRVHPTSFSDPSIP-SQPRPMVPKKDIAVYGYGSVA
MVDDIPPEQHALVPSYMAPIGGSGKRIHPLPFSDSAVP-VQPRSMDPSKDLAAYGYGSIA
MVDDIPPEQHALVPSYMAPVGGDGKRIHPLPFSDSSLP-AQPRSLDPSKDLAAYGYGSIA
QSVRTTSGPLGPA----------ERNVNSSPYIDPRQP-VHVRIVDPSKDLNSYGLGNVD
QSVRTTSGPLGPA----------ERNVNSSPYIDPRQP-VHVRIVDPSKDLNSYGLGNVD
QSVRTTSGPLGPA----------ERNVHSSPYIDPRQP-VHVRIVDPSKDLNSYGLGNVD
ISMASP-----GA----------AGGKHIPYASDVHQS-SNGRVVDPVREFGSPGLGNVA
VSMASPG---AGA----------GGGKRIPYASDVHQS-SNVRVVDPVREFGSPGLGNVA
YSLASP-------------------------ESGIR---ATMR--DPTRDSGSLGFGNVA
YSMASP-------------------------ESGIR---ANIRVVDPTRDSGSLGFGNVA
G-------------------------------KDLEGE--------------KEGYSNAE
SEMN-------------------------------------------------DEYGNPI
SEMN-------------------------------------------------DEYGNPI
191
200
224
224
225
226
221
218
213
213
213
203
205
182
192
165
112
148
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
D
IFVTSTLMEEGGVPPSSSPAALLKEAIHVISCGYEDKTEWGLELGWIYGSITEDILTGFK
D
IFVTSTLMEEGGVPPSSSPAALLKEAIHVISCGYEDKTEWGLELGWIYGSITEDILTGFK
D
VFIAATLMEDGGVPKGASSASLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK
D
VFIASTLMEDGGVPKGASSASLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK
D
VFVASTLLENGGVPRDASPASLLREAIQVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK
D
VFAVSTLLENGGVPRDASPASLLREAIQVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK
D
VFVASTLLEDGGTLKSASPASLLKEAIHVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK
D
VFVASTLLEDGGTLKSASPASLLKEAIHVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK
D
VFIAATFQEQGGIPPTTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK
D
VFIAATFQEQGGIPPTTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK
D
VFIAATFQEQGGIPPSTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK
D
VFVASTLMENGGVPQSATPETLLKEAIHVISCGYEDKTDWGSEIGWIYGSVTEDILTGFK
D
VFVASTLMENGGVPQSATPETLLKEAIHVISCGYEDKTDWGSEIGWIYGSVTEDILTGFK
D
VFVASTLMENGGVPGSATPESLLKEAIHVISCGYEDKTDWGSEIGWIYGSVTEDILTGFK
D
VFVASTLMENGGVPESATPESLLKEAIHVISCGYEDKSDWGSEIGWIYGSVTEDILTGFK
D
VFIASTLMENGGLPEGTNSQSHIKEAIHVISCGYEEKTEWGKEVGWIYGSVTEDILTGFK
D
VFIESTLMENGGVPESANSSTLIKEAIHVIGCGFEEKTEWGKEIGWIYGSVTEDILSGFK
D
VFIESTLMENGGVPESANSPTLIKEAIHVIGCGYEEKTEWGKEIGWIYGSVTEDILSGFK
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
WKERMDDWKMQQ-GNLGPEQE-------------------DDAEAAMLDEARQPLSRKVP
WKERMDDWKMQQ-GNLGPEQE-------------------DDAE
WKERMDEWKMQQ-GNLGPEQD-------------------DDAEAAMLEDARQPLSRKVP
WKERMDEWKMQQ-GNLGPEQD-------------------DDAE
WKERMEEWRKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPDLPMMDEGRQPLSRKLP
WKERMEEWRKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPD
WKERMEEWKKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPDLPMMDEGRQPLSRKLP
WKERMEEWKKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPD
WKDRMEDWKKRQNDKLQVVKHEGGHDN-----GNFEGDELDDPDLPMMDEGRQPLSRKLP
WKDRMEDWKKRQNDKLQVVKHEGGHDN-----GNFEGDELDDPD
WKDRMEDWKKRQNDKLQVVKHEGGYDG-----GNFEGDELDDPDLPMMDEGRQPLSRKLP
WKDRMEDWKKRQNDKLQVVKHEGGYDG-----GNFEGDELDDPD
WKERMESWKQKQ-DNLQMMKSENGDYD-----G-------DDPDLPLMDEARQPLSRKTP
WKERMESWKQKQ-DNLQMMKSENGDYD-----G-------DDPD
WKERMESWKQKQ-DKLQIMKRENGDYD-----D-------DDPDLPLMDEARQPLSRKMP
WKERMESWKQKQ-DKLQIMKRENGDYD-----D-------DDPD
WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDELQMADDARQPLSRVVP
WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDE
WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDELQMADDARQPLSRVVP
WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDE
WKERVEGWKLKQDKNMMQMTNRYSEGK----GDMEG-TGSNGDELQMADDARQPMSRVVP
WKERVEGWKLKQDKNMMQMTNRYSEGK----GDMEG-TGSNGDE
WKERVDGWKMKQDKNVVPMSTGHAPSE-RGVGDIDAATDVLVDDSLLNDEARQPLSRKVS
WKERVDGWKMKQDKNVVPMSTGHAPSE-RGVGDIDAATDVLVDD
WKERVDGWKMKQDKTVVPMSTGHAPSE-RGAGDIDAATDVLVDDSLLNDEARQPLSRKVS
WKERVDGWKMKQDKTVVPMSTGHAPSE-RGAGDIDAATDVLVDD
WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDDSLLNDEARQPLSRKVS
WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDD
WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDDSLLNDEARQPLSRKVS
WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDD
WQERVEKWKVRQEKRGLVSKDDGGNDQ-------------GEEDEYLMAEARQPLWRKIP
WQERVEKWKVRQEKRGLVSKDDGGNDQ-------------GEED
WKNRVESWKDKKNKKKKSNTKPETEPA--------QVPPEQQMEEKPSAEASEPLSIVYP
WKNRVESWKDKKNKKKKSNTKPETEPA--------QVPPEQQME
WKNRVESWKDKKNKKKKSSPKTETEPA--------QVPPEQQMEDKPSAAASEPLSIVYP
WKNRVESWKDKKNKKKKSSPKTETEPA--------QVPPEQQME
231
240
279
279
280
281
268
265
268
268
268
262
264
242
252
212
164
200
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
QVLRWA
MHCRGWRSIYCMPKLAAFKGSAPINLSDRLNQVLRWALGSVEIFFSRHSPMLYGYKEGKL
QVLRWA
MHCRGWRSIYCMPKRAAFKGSAPINLSDRLNQVLRWALGSVEIFFSRHSPMLYGYKEGKL
QVLRWA
MHCHGWRSVYCTPKIPAFKGSAPINLSDRLHQVLRWALGSVEILLSRHCPIWYGYGCG-L
QVLRWA
MHCHGWRSVYCMPKRPAFKGSAPINLSDRLHQVLRWALGSVEILLSRHCPIWYGYGCG-L
QVLRWA
MHCHGWRSVYCIPKRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPIWYGYGGG-L
QVLRWA
MHCHGWRSVYCIPKRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPIWYGYGGG-L
MHCHGWRSIYCIPSRPAFKGSAPINLSDRLHQVLRWALGSVEIFLSRHCPLWYGYGGG-L
QVLRWA
QVLRWA
MHCHGWRSIYCIPARPAFKGSAPINLSDRLHQVLRWALGSVEIFLSRHCPLWYGYGGG-L
QVLRWA
MHARGWISIYCMPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYNGR-L
QVLRWA
MHARGWISIYCMPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYNGR-L
QVLRWA
MHARGWISIYCMPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYSGR-L
QVLRWA
MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYGGR-L
QVLRWA
MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYGGR-L
QVLRWA
MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYSGR-L
QVLRWA
MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYSGR-L
QVLRWA
MHCRGWRSVYCSPQRPAFKGSAPINLSDRLHQVLRWALGSIEIFLSHHCPLWYGYGGK-L
QVLRWA
MHCRGWRSIYCMPVRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPLWYGYGGGRL
QVLRWA
MHCRGWRSIYCMPVRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPLWYGFGGGRL
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
EKISCYLSDDGASMCTFEAMSETAEFARKWVPFCKKYSIEPRAPEFYFALKIDYLKDKVQ
D
EKISCYLSDDGASMCTFEAMSETAEFARKWVPFCKKFNIEPRAPEFYFTLKVDYLKDKVQ
D
EKVACYVSDDGAAMLTFEAISETSEFARKWVPFCKRFSIEPRAPEWYFAKKVDYLKDKVD
D
DKVACYVSDDGAAMLTFEAISETSEFARKWVPFCKRFSIEPRAPEWYFAQKVDYLKDRVD
D
DKVACYVSDDGAAMLTFEALSETSEFARKWVPFCKKFNIEPRAPEWYFSQKMDYLKNKVH
D
DKVACYVSDDGAAMLTFEALSETSEFARKWVPFCKKFNIEPRAPEWYFSQKIDYLKNKVH
D
DKVSCYVSDDGAAMLTFEALSETSEFAKKWVPFCKKFSIEPRAPEFYFAQKIDYLKDKVQ
D
DKISCYVSDDGAAMLTFEALSETSEFAKKWVPFCKKFSIEPRAPEFYFAQKIDYLKDKVD
D
DKVSCYVSDDGSAMLTFEALSETAEFARKWVPFCKKHNIEPRAPEFYFAQKIDYLKDKIQ
D
DKVSCYVSDDGSAMLTFEALSETAEFARKWVPFCKKHNIEPRAPEFYFAQKIDYLKDKIQ
D
DKVSCYVSDDGSAMLTFEALSETAEFARKWVPFCKKHSIEPRAPEFYFAQKIDYLKDKIQ
D
DKVSCYVSDDGAAMLTFEALSETSEFARKWVPFCKKYNIEPRAPEFYFSQKIDYLKDKVQ
D
DKVSCYVSDDGAAMLTFEALSETSEFARKWVPFCKKYSIEPRAPEWYFAQKIDYLKDKVQ
D
DKVSCYVSDDGAAMLTFEAISETSEFARKWVPFCKKYDIEPRAPEWYFAQKIDYLKDKVH
D
DKVSCYVSDDGAAMLTFETMSETSEFARKWVPFCKRYDIEPRAPEWYFSQKIDYLKDKVH
D
DKVSCYVSDDGASMLLFDSLAETAEFARRWVPFCKKHNIEPRAPEFYFTQKIDYLKDKVH
D
DKVSCYVSDDGAAMLTFESLVETAEFARKWVPFCKKFSIEPRAPEFYFSQKIDYLKDKVQ
D
DKVSCYVSDDGAAMLSFESLVETAEFARKWVPFCKKYTIEPRAPEFYFSLKIDYLKDKVQ
D
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
MEASAGLVAGSHNRNELVVIHGHEEHKP---LKNLDGQVCEICGDEIGLTVDGDLFVACN
C C
C
MEASAGLVAGSHNRNELVVIHGHEEHKP---LKNLDGQVCEICGDEIGVTVDGDLFVACN
C C
C
METKGRLIAGSHNRNEFVLINADEIARVTS-VKELSGQICKICGDEIEITVDGEPFVACN
C C
C
METKGRLIAGSHNRNEFVLINADEIARVTS-VKELSGQICKICGDEIEVTVDGEPFVACN
C C
C
MNTGGRLIAGSHNRNEFVLINADENARIKS-VQELSGQVCHICGDEIEITVDGEPFVACN
C C
C
MNTGGRLIAGSHNRNEFVLINADENARIKS-VKELSGQVCQICGDEIEITVDGEPFVACN
C C
C
MEVSAGLVAGSHNRNELVVIRRDGESAPRS-LERVSRQICQICGDDVGLTVDGELFVACN
C C
C
MEVSAGLVAGSHNRNELVVIRRDGEFAPRS-LERVSRQICHICGDDVGLTVDGELFVACN
C C
C
MEANAGMVAGSYRRNELVRIRHDSDSAPKP-LKNLNGQTCQICGDNVGVTENGDIFVACN
C C
C
MEANAGMVAGSYRRNELVRIRHDSDSAPKP-LKNLNGQTCQICGDNVGVTENGDIFVACN
C C
C
MEANAGMVAGSYRRNELVRIRHDSDSGPKP-LKNLNGQTCQICGDNVGVTENGDIFVACN
C C
C
-------------------MESEGETGVKP-MTSIVGQVCQICSDSVGKTVDGEPFVACD
C C
C
-------------------MESEGETGAKP-MKSTGGQVCQICGDNVGKTADGEPFVACD
C C
C
-------------------MDLEGDA-----TGPKKIQVCQICSDDIDKTVDGEPFVACH
C C
C
-------------------MNLLLTLGVLLVVQPKKIQVCQICSDDIGKTIDGEPFVACH
C C
C
---MAGLVTGSSQ-----TLHAKDELRPPT-RQSATSKKCRVCGDEIGVKEDGEVFVACH
C C
C
-------------------------------MMESGAPLCHSCGDQVGHDANGDLFVACH
C C
C
-------------------------------MMESGAPICHTCGEQVGHDANGELFVACH
C C
C
2
1B
2
2
2
3A
411
420
459
459
460
461
448
445
448
448
448
442
444
422
432
392
344
380
1A, 1B
4
4
3C
5
733
738
802
800
803
803
793
790
789
789
789
785
787
764
774
749
682
718
793
798
861
859
862
862
852
849
848
848
848
844
846
823
833
808
742
778
Zinc finger domains (CxxC)
2
Hyper Variable Region I
3A, 3B, 3C
Conserved D residues in the catalytic site
4
Hyper Variable Region II
5
Conserved QxxRW motif in the catalytic site
Figure 8.2 A multiple alignment showing the conserved sequence features common to cellulose synthase
family members (these features are also presented in figure 8.1).
A thorough phylogenetic analysis was performed, which included the 18 identified
CesA protein sequences together with CesA protein sequences and EST sequences
from other plant genomes, i.e. hybrid aspen (P. tremula x tremuloides), quaking
aspen (P. tremuloides), cotton (Gossypium hirsutum), rice (Oryza sativa), thale cress
(A. thaliana) and maize (Zea mays). This analysis was restricted to conserved regions
among CesA family members, thus the HVR-1 and -2 regions (see figure 8.1, 8.2)
were excluded. According the phylogenetic tree generated by the Bayesian method
Identification of Previously Unkown CesA Family Members (publication IV)
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
56
Concluding Remarks
(see figure 2 in appendix IV), the analysis further implied that PtCesA2, 4, and 6, as
well as PtCesA5, 7, and 8 (see figure 8.3) are most likely monophyletic (arise from
a single common ancestral gene). Previous studies involving the PtrCesA4 and 5
(present in figure 8.3A) suggest their involvment in primary cell wall synthesis
(Kalluri and Joshi, 2003; Kalluri and Joshi, 2004). Furthermore, PttCesA2 and 4
(present in figure 8.3B) are constitutively expressed in several different tissues
(Djerbi et al., 2004). It is therefore tempting to speculate that PtCesA2, 4 and 6 are
active in synthesis of primary cell walls, and that PtCesA5, 7 and 8 are constitutively
expressed. However, more extensive studies need to be performed in order fully
determine their function.
A
PtrCesA7
PttCesA4
PtCesA4_1
PtCesA4_2
AtCesA2
AtCesA9
AtCesA6
AtCesA5
PttCesA2
PtCesA2_1
PtCesA2_2
PtrCesA6
PtCesA6_1
PtCesA6_2
OsCesA9 †
OsCesA6
ZmCesA8
ZmCesA6
ZmCesA7
OsCesA5
OsCesA3
HvCesA2 †
B
PtCesA8_3
PtCesA8_1
PtrCesA4
PtCesA8_
AtCesA1
AtCesA10
ZmCesA2
ZmCesA1
HvCesA6
OsCesA1
ZmCesA3 †
ZmCesA5
OsCesA2
HvCesA3 †
ZmCesA4
OsCesA8
ZmCesA9
HvCesA1
GhCesA3
AtCesA3
PtrCesA5
PtCesA5_1
PtCesA5_2
PtCesA7_1
PtCesA7_2
Figure 8.3 Phylogenetic relationships of CesA genes from Populus trichocarpa (Pt), Populus tremula x
tremuloides (Ptt), Populus tremuloides (Ptr), Arabidopsis thaliana (At), Gossypium hirsutum (Gh), Hordeum
vulgare (Hv), Oryza sativa (Os), and Zea mays (Zm). The analysis is based on predicted amino acid sequences
derived from CesA genes in fully sequences genomes, full-length cDNA copies of putative CesA genes, and
selected partial cDNAs (denoted by †). The numbers along the edges show the a posteriori probabilities of
the partitions induced by the edges.
Sequence comparison alone does not shed light on whether genes might be
transcriptionally silent or not. In order to elucidate whether these newly discovered
CesA genes may be actively transcribed, corresponding EST sequences were
retrieved from PopulusDB (Sterky et al., 2004). The CesA sequences were mapped
to contigs in PopulusDB, which are representations of transcript sequences
consisting of sets of overlapping clones. Alignments more similar than the applied
limit of 94% identical nucelotides, over 95% of the contig length, were mapped.
The identification of indivudual EST sequences in PopulusDB was not satisfactory
due to the high sequence similarities between paralogous CesA gene pairs and the
location of several EST sequences in conserved regions of the gene sequences.
Yet, previous gene expression studies in hybrid aspen implies that both genes in at
least one pair, CesA3-1 and CesA3-2, are actively transcribed (Djerbi et al., 2004).
The full length CesA gene sequences in hybrid aspen were further used to confirm
the number of identified CesA genes in fully sequenced genomes from other plant
taxa, thale cress (A. thaliana) and rice (Oryza Sativa). The analysis corroborates
previous analyses, showing the presence of ten CesA family members in respective
genome.
The human genome project has provided large amounts of data that has truly
opened the field of sequence analysis, sequence comparisons, and biocomputing.
Knowledge encompassing computer programming and data analysis, as well as
biotechnology, will be increasingly important for researchers of tomorrow.
The PrEST selection strategies described in this thesis (publication I, III,
unpublished data) provide a framework for designing PrESTs for all previously
defined categories of genes (see chapter 7.4). These strategies are built around
experimentally defined criteria. Specifically directed software and sequence
analysis tools developed to facilitate the move from a one-by-one sample analysis
to high-throughput handling of data (publication I-III, unpublished data, related
publication B), will be of great importance. Due to the wide range of unsolved
problems and the uniqueness of research projects, generally applied systems will
not be suitable in the majority of cases. Instead, most software solutions will need
to be custom made for specific purposes.
Large scale proteomic efforts such as the HPR (see chapter 7.1, related publication
A) will provide a vast amount of data and shed further light on the functions
and localization of human gene products. Different antibody-based approaches
need to be synchronized in order to avoid redundant target genes and allow
for comparative studies. Furthermore, as the specificity of antibodies is of great
importance, proper validations need to be performed by different assays in a
systematic fashion. Cross-reactivities can thus be kept at a minimum and the quality
assured antibodies will be of great value to the proteomic research community. In
contrast, small scale in-depth investigations of individual target genes or gene
families (publication IV) will still be performed, as these studies need to be done
by researchers having expert knowledge regarding these targets. Collaborations
between small-scale and large high-throughput projects will be advantageous for
both parts. For instance, in-depth knowledge from small-scale projects might
direct large-scale efforts towards identification of relevant target genes.
Currently, we only view the tip of the iceberg of functional genomic and proteomic
research and the use of computational methods is now intrinsic. All future research
initiatives will incorporate and integrate computer-assisted methods into their
experimental setup. The level of complexity and the amount of the generated
proteomic data will increase, and with that more challenging biological questions
will need to be answered. As a consequence, bioinformatic approaches will be
more prominent than ever before.
Concluding Remarks
Concluding Remarks
57
58
List of Abbreviations
2DE
3D
3’
5’
A
BIND
BLAST
BLASTP
BLOSUM
C
cDNA
CesA
DDBJ
DIP
DNA
EBI
EC
EMBL
EST
FASTA
G
GO
GT
HGP
HPR
HSSP
HVR
IHC
IMAC
kb
LIMS
mAbs
mRNA
MS
msAbs
NCBI
pAbs
PAM
PCR
PDB
Pfam
PIR
PrEST
PSI-BLAST
PTM
PVDF
RE
RNA
RT-PCR
SIB
SMART
Two-Dimensional gel Electrophoresis
Three-Dimensional
Three prime
Five prime
Adenine
Biomolecular Interaction Network Database
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool Protein
BLocks Of SUbstition Matrices
Cytosine
complementary DNA
cellulose synthase
DataBank of Japan
Database of Interacting Proteins
DeoxyriboNucleic Acid
European Bioinformatics Institute
Enzymatic activity Classification
European Molecular Biology Laboratory
Expressed Sequence Tag
FAST All
Guanine
Gene Ontology
Glycosyl Transferases
Human Genome Project
Swedish Human Proteome Resource
High Sequence Similarity Proteins
HyperVariable Region
ImmunoHistoChemistry
Immobilized Metal ion Affinity Chromatography
kilo bases
Laboratory Information Management System
monoclonal Antibodies
messenger RiboNucleic Acid
Mass Spectrometry
mono-specific Antibodies
National center for biotechnology information
polyclonal Antibodies
Point Accepted Mutation
Polymerase Chain Reaction
Protein Data Bank
Protein families database
Protein Information Resource
Protein Epitope Signature Tag
Position-Specific Iterated BLAST
Post Translational Modification
PolyVinylidene DiFluoride
Repetitive Elements
RiboNucleic Acid
Reverse Transcriptase Polymerase Chain Reaction
Swiss Institute of Bioinformatics
Simple Modular Architecture Research Tool
List of Abbreviations
List of Abbreviations
59
60
Acknowledgements
SNPs
T
TIGR
TMA
TMHMM
UniParc
UniProt
UNIPROT
UniRef
USCS
VEGA
Single Nucleotide Polymorphisms
Thymine
The Institute of Genomic Research
Tissue MicroArray
Trans Membrane Hidden Markov Model
UniProt archive
Universal Protein resource
UniProt knowledgebase
UniProt NREF
University of California Santa Cruz
Vertebrate Genome Annotation
I am grateful to a lot of persons, as you all know this is no one-man show. Without
the following persons at work and in my life, I doubt the research over the last five
years and this thesis would have been possible.
Persons at the Institution of Biotechnology Royal Institute of
Technology:
Prof. Mathias Uhlen (My very, very, very busy supervisor. You have given me
invaluable guidance in the field of antibody-based proteomics, as well as teaching
me the importance of presenting research data in a convenient way. Furthermore,
you have provided me with splendid ideas regarding our research)
Dr. Fredrik Sterky (You have helped me tremendously over the last years,
structuralizing our research and motivating me. I really appreciate that your door
always is open for me. At times when I tend to forget the focus in our research and
present drifted-away ideas, you always keep me back on track)
Prof. Stefan Ståhl (My former diploma work supervisor, who guided me in the
field of biosynthesis of bioactive marine natural products, currently a sock-walking
hardrocking Dean of School of Biotechnology. Initially, you provided me with a
PhD position, and I am truly thankful to you)
Prof. Per-Åke Nygren (My current next-door- neighbour, for always giving direct
and specific answers regarding my reseach), Assoc. Prof. Sophia Hober (For
directing the lab-course in a structuralized and fun way), Prof. Joakim Lundeberg
(For valuable advice concerning my research and for lending me your room when
your were at daddy-leave)
Dr. Anja Persson, Dr. Jenny Ottoson, Dr. Henrik Wernerus and Dr. Cristina AlKhalili Szigyarto (For bringing clarity to my questions)
Dr. Per Unneberg (For much advice and computational guidance during the old
days, when I was a novice in the field of computers and bioinformatics)
Dr. Afshin Ahmadian and Erik Petterson (For showing me the world of parallel
SNP amplification. Our meetings were the most laid-back I’ve ever encountered
during my PhD studies)
My old room-buddies in B3:1049 (Esther Edlund-Rose, Ronny Falk, Olof Nord,
Emma Lundberg, Tove Alm, and Cecilia Laurell. I still miss you, and the quite
atmosphere we had. Tove and Esther, thank you for being the party-hardy girls
that you are), My intermediate room-buddies in B3:1001 (Lisa Berglund, Carl
Hamsten, Johan Rockberg, Dr. Åsa Sivertsson, Dr. Cristina Al-Khalili Szigyarto,
and Dr. Charlotta Agaton)
The entire HPR-crew at KTH and Uppsala University (For interesting, fun and
crazy happenings in at HPR-annuals at Krusenberg, and at HUPO in Munich)
My Hedhammar and Max Käller (My two partners in thesis writing, thank you
for discussions about references, figures and other things about writing a thesis
nobody else cares about)
Acknowledgements
Acknowledgements
61
62
Acknowledgements
And of course, all of the remaining hard-working associates at the divisions of
Proteomics, Gene Technology and Molecular Biotechnology division at the School
of Biotechnology (You know who you are)
Moreover, The Knut and Alice Wallenberg foundation is gratefully acknowledged
for financial support.
The following people have contributed tremendously to my thesis:
Prof. Mathias Uhlen, Dr. Fredrik Sterky, Dr. Henrik Wernerus, Dr. Soraya Djerbi,
and Dr. Cristina Al-Khalili Szigyarto (For critical reading and fruitful discussions)
Håkan Gill (My redwine-loving, former next-door-neighbour. Thank you for
bringing the third dimension to my thesis by providing me with awesome 3D
pictures. You are truly a master at 3D!)
Assoc. Prof. Harry Brumer III (For corecting maj innglissh, and teaching me that
N-terminals has nothing to do with biotechnology)
Cecilia Heyman at the library (For answering my many, many e-mails and providing
me with publications that I couldn’t get hold of myself)
Furthermore, I would like to thank the following for providing me with material
and insights: Johan Rockberg (the unpublished data section), Prof. Per-Åke
Nygren (discussions about affinity reagents), Dr. Cristina Divne (protein structure
pictures), Prof. Fredrik Pontén (TMA picture and the annotation of the staining),
Dr. Peter Nilsson (PrEST array picture), Dr. Anja Persson (raw material for the
PrEST statistics table), Marcus Johansson (For your helpful advice with InDesign),
Dr. Afsin Ahmadian (discussions about the SNP section) and Dr. Margareta
Ramström (MS spectra).
My Friends:
R&B: (It’s been, and still is, a BLAST to hang around with the crazy crew from
R&B, and every event has had it’s fair share of relaxation, fun and crazyness. I was
going to write extremeness, but maybe we’ll add that to the list after some years.
Just to remained you: Stenskär, Gotland, Dalarna, Jönköping, Västerås, Florida
and Sandhamn)
R&B is: André (The wine-knower who likes to shake his stuff to latino rhythms
and sing Ronja Rumpdaughter songs), Ante (The old shrimp-farmer, who has a
Matrix-inspired running-form, and burned his upper-lip once when drinking),
Feffe (Knows his charter-destinations and luxury cars, he dances with one finger
up i the air). Jake (He drives nothing but Alfa Romeo and his tough body withstand
even several rounds of ammunition), Martin (Vice president in R&B. He is a
newlywed marathån runner, who has actually no taste in music), Niclas (The
designer/bonus-dad/carpenter, who like to hang out at Knife-south and likes to
dance riverdance), Niklas (A.K.A. Mr. Handsome and Becks, has the sickest golfswing in history, and the wackiest kangaroo-dance, likes on-line poker and brand
clothes), Ola (The man with the broken arm, the proud owner of a pizza-racer,
which he once forgot where it was), Patric (co-founder of R&B, A.K.A ‘The guy
in cruel condition’, drives a Mercedes, lives the high-life and hangs out close to
svampen), and Sparre (A.K.A. Goose and Spirre, who is still in the military service,
that is, a TopGun instructor)
I’m looking forward to new events, R&B forever!
Daniel, zlow-roc Johansen (The only time I’ve seen him pissed off is during gym
class when he wasn’t allowed to drink soda. Thanks for awesome Miami-trips, boattrips, parties, T2-review and dinners)
Magnus von Bahr (The second sickest party dancer, thanks for fishing trips,
dinners at Rörstrandsgatan and crazy party trips)
Tomas Eklund (A.K.A Tjomme, Lance Armstrong, thank you for the opportunity
of having my party at your place. I’ll do my best to raise the roof!)
My Family:
The Djerbis, Taieb & Enikö, Samir & Emili (For for nice dinners and gettogethers)
Mounira & Martin (For weekends at Stenskär, pleasant dinners and for X-box/
Singstar nights)
Jonas (My oldest brother, thank you for looking after me all these years, even if I
didn’t like that when I was younger), Maria (For moral support), Nora & Hanna
(For being who you are)
Fredrik (My older brother, thank you for your support the last years and for
watching Adam when I really needed to work), Eva (For nice get-togethers at
Stenskär and Åre, Good luck with your research), Hugo & Filippa (For bringing
joy and fun)
Mom Eva and Dad Rolf (My two heroes! As much support I’ve had from you all
my life is more than a son can ask for. Thank you so much for taking care of Adam
when I wrote my thesis. Can you believe, even though I wasn’t the most promising
student in school, I still managed to accomplish this. I owe a lot to you)
Adam (A.K.A. Mats Jr., Knas-fnas, Klanton, Fjanton. When you say: ”I love you on
all planets, even on the sun’ it makes me warn inside. I hope I haven’t neglected you
too much when writing my ‘blue book’.)
Sosso (My ideal-princess and soulmate. I sure hope our time together only has
started and that we will have many years together in the future. Thank you for all
your help with Adam, you are surely the best bonus-mom anyone can ask for in
the world. I love you)
63
Acknowledgements
Kristina Tunkrans (the attorney and DJ-extravaganza. You have supported me
tremendously over the last years!)
64
Populärvetenskaplig Sammanfattning (Swedish)
De senaste åren har forskningen inom bioteknik gjort stora framsteg tack vare
att den mänskliga arvsmassan (genomet) nu finns tillgänglig som sekvens. Den
har kartlagts av ett stort internationellt samarbete och år 2003 kom en offentlig
högkvalitets karta över det mänskliga genomet ut. Den kartan inkluderade ungefär
25000 gener, vilka innehåller information om människans alla ärftliga egenskaper.
Dessa gener kodar i sin tur för aktiva komponenter, proteiner, som bland annat
är kroppens byggstenar och enzymer. Mer än 90% av dagens läkemedel riktar sig
mot proteiner. På grund av denna betydelse av proteiner så har det initierats ett
flertal proteomikstudier. Proteomikforskning syftar till att systematiskt identifiera
och karakteriserar alla proteiner. Forskningen utförs på flertalet sätt, t.ex. genom
att använda antikroppar som binder till proteiner, vilka därmed kan detekteras i
t.ex. vävnader.
Under år 2003 startades ett samarbetsprojekt mellan Kungliga Tekniska Högskolan
och Uppsala Universitet. I detta projekt kallat HPR (the Swedish Human Proteome
Resource; se figur 7.1), kartläggs systematiskt de mänskliga proteinerna och deras
uttryck med hjälp av vävnadsmikromatriser (se figur 3.3). Dessa matriser består av
prover från 48 olika typer av vävnader och 20 cancertyper. Således kan proteiner
studeras med avseende på var de är uttryckta.
I HPR genereras antikroppar mot fragment av proteiner, så kallade PrESTar
(Protein Epitope Signature Tags; se figur 7.3). Dessa PrESTar fungerar som
signaturer eller representater av humana proteiner. Proteinfragmenten är noggrant
utvalda baserat på kriterier som underlättar efterföljande laboratoriska moment.
Bland annat ska proteinfragmenten inneha så låg sekvenslikhet som möjligt med
övriga mänskliga protein, d.v.s. vara så unika som möjligt för det proteinet som de
representerar. Utvalet av proteinfragment görs av ett dataprogram, Bishop (artikel
I). Utöver detta så har en studie som utvärderar denna selektionsstrategi utförts
(artikel II). Den studien visar att det är möjligt att konstruera PrESTar för det stora
flertalet av de undersökta proteinerna. Dock behövs viss hänsyn tas till två speciella
kategorier av proteiner. Dessa är proteiner med hög sekvenssimilaritet (likhet)
med andra proteiner och multipla splicevarianter. Multipla splicevarianter är flera
olika proteiner som härstammar från en och samma gen. Två olika strategier har
därmed utvecklats för att behandla dessa proteiner (artikel III och opublicerat
data).
Utöver det mänskliga genomet, har flera andra organismers arvsmassa kartlagts.
Bland annat poppel, som är en modellorganism för träd. Det som skiljer träd från
andra växter är deras förmåga att bilda ved. Cellulosa är en huvudkomponenterna
i ved och dessutom en viktig råvara i t.ex. pappersmassa- och textilindustrin. En
bioinformatisk (se nedan) studie av jättepoppel (Populus trichocarpa; artikel IV),
har tidigare okända cellulosasyntasgener identifierades och antalet uppskattats till
18. Tidigare studier i växter har visat på ungefär tio stycken cellulosasyntasgener.
Populärvetenskaplig Sammanfattning (Swedish)
Populärvetenskaplig
Sammanfattning (In Swedish)
65
66
Rererences
Arbetet i denna doktorsavhandling har fokuserats på studier som baserats på
offentligt tillgängliga genomsekvenserna, vilka sedan bearbetats och analyserats
med hjälp av datalogiska metoder. Denna metodik kallas bioinformatik och
innefattar principer hämtade från matematik, statistik, datalogi och biologi.
De arbeten som beskrivs i denna doktorsavhandling innefattar utveckling av
analysprogram och databasimplementeringar (artikel I, III, opublicerat data)
samt bearbetning av DNA- och proteinsekvenser. De sekvenser som har analyserats
härstammat från människa (artikel I-III, opublicerat data) och jättepoppelträd
(artikel IV). Dessa analyser har omfattat parvisa- (artikel I, III, opublicerat data)
och multipla upplinjeringar samt fylogenetiska studier (artikel IV).
Protein
DNA
Analysprogram
>ENSP00000314748
MNMTAYNKGRLQSSFWIVDKQHVYIGSAGLDWQSLGQMKEVIFYNCSCLVLFALYSLFSRVPQTWSKRLYGVYDNEKKLQLQLNETKSQAFVSNSPKLFCPKNRSFDIDAIYSVIDDAKQYVYIAV
MDYLPISSTSTKRTYWPDLDAKIREALVLRSVRVRLLLSFWKETDPLTFNFISSLKAICTEIANCSL
KVKFFDLERENACATKEQKNHTFPRLNRNKYMVTDGAAYIGNFDWVGNDFTQNAGTGLVINQADVRN
NRSIIKQLKDVFERDWYSPYAKTLQPTKQPNCSSLFKLKPLSNKTATDDTGGKDPRNV
DNA- och proteinsekvenser
Databas
Query
1
Sbjct
1
Query
61
Sbjct
61
MISPDPRPSPGLARWAESYEAKCXXXXXXXXXXXXXPNVTTCRQVGKTXXXXXXXXXXXX
MIS D R SPGLARWAESYEAKCERRQE RE+RR R N TTCRQ GK LR Q +E+LQ A
MISSDSRSSPGLARWAESYEAKCERRQETRENRRRRRNETTCRQPGKVLRTQHKERLQGA
60
XXXXXXXXXXXXXEEKGKAQHPQAREQGPSRR-----------------------PGQVR QF +RRNLE E+KG
QAREQGPS +
PGQ
RQLQFLKRRNLEEEKKG-----QAREQGPSSKTDGGTGQVSILKESLPGANKASFPGQQE
96
Parvisa upplinjeringar
60
115
PtCesA9-1
PtCesA9-2
PtCesA2-1
PtCesA2-2
PtCesA4-2
PtCesA4-1
PtCesA6-1
PtCesA6-2
PtCesA8-1
PtCesA8-3
PtCesA8-2
PtCesA5-2
PtCesA5-1
PtCesA7-2
PtCesA7-1
PtCesA1
PtCesA3-1
PtCesA3-2
Fylogeni
PtCesA8_3
PtCesA8_1
PtrCesA4
PtCesA8_
AtCesA1
AtCesA10
ZmCesA2
ZmCesA1
HvCesA6
OsCesA1
ZmCesA3 †
ZmCesA5
OsCesA2
HvCesA3 †
ZmCesA4
OsCesA8
ZmCesA9
HvCesA1
GhCesA3
AtCesA3
PtrCesA5
PtCesA5_1
PtCesA5_2
PtCesA7_1
PtCesA7_2
GPVYVGTGCVFKRQALYGYDPPKEPKRPKMVTCDCCP-------CFGRRKKK-------KEPKRPKMVTCDCCP-------CFGRRKKK-------KDPKRPKMETCDCCP-------CFGRRKKK-------GPVYVGTGCVFKRQALYGYDPPKDPKRPKMETCDCCP-------CFGRRKKK-------VK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN
GPIYVGTGCVFRRQALYGYDAPVK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN
GPIYVGTGCVFRRQALYGYDAPIK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN
IK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN
GPIYVGTGCVFRRQALYGYDAPVK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK
VK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK
GPIYVGTGCVFRKQALYGYDAPVK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK
VK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK
GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
KT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
KT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE
GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
LTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
LTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI
GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI
LTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI
LKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS
LKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES
LKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES
GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS
LKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS
GPVYVGTGCVFNRQSLYGYDPPVSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL
VSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL
GPMYVGTGCVFNRQALYGYGPPSMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY
SMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY
GPVYVGTGCVFNRQALYGYGPPSMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY
SMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY
636
645
696
694
697
698
687
684
680
680
680
674
676
654
664
630
579
615
Multipla upplinjeringar
Figur En sammanfattning över de bioinformatiska metoder som behandlats i denna doktorsavhandling.
Offentligt tillgängliga DNA- och proteinsekvenser har utgjort grunden för utveckling av bioinformatiska
analysprogram som är integrerade med databaser. Dessutom har jämförelser av sekvenser utförts, bl.a.
parvisa- och multipla upplinjeringar samt fylogenetiska studier.
67
Aebersold, R. and Mann, M.: Mass
spectrometry-based proteomics. Nature 422
(2003) 198-207.
Adams, M.D., Celniker, S.E., Holt, R.A.,
Evans, C.A., Gocayne, J.D., Amanatides,
P.G., Scherer, S.E., Li, P.W., Hoskins,
R.A., Galle, R.F., George, R.A., Lewis,
S.E., Richards, S., Ashburner, M.,
Henderson, S.N., Sutton, G.G.,
Wortman, J.R., Yandell, M.D., Zhang,
Q., Chen, L.X., Brandon, R.C., Rogers,
Y.H., Blazej, R.G., Champe, M., Pfeiffer,
B.D., Wan, K.H., Doyle, C., Baxter,
E.G., Helt, G., Nelson, C.R., Gabor,
G.L., Abril, J.F., Agbayani, A., An, H.J.,
Andrews-Pfannkoch, C., Baldwin, D.,
Ballew, R.M., Basu, A., Baxendale,
J., Bayraktaroglu, L., Beasley, E.M.,
Beeson, K.Y., Benos, P.V., Berman, B.P.,
Bhandari, D., Bolshakov, S., Borkova,
D., Botchan, M.R., Bouck, J., Brokstein,
P., Brottier, P., Burtis, K.C., Busam,
D.A., Butler, H., Cadieu, E., Center,
A., Chandra, I., Cherry, J.M., Cawley,
S., Dahlke, C., Davenport, L.B., Davies,
P., de Pablos, B., Delcher, A., Deng, Z.,
Mays, A.D., Dew, I., Dietz, S.M., Dodson,
K., Doup, L.E., Downes, M., DuganRocha, S., Dunkov, B.C., Dunn, P.,
Durbin, K.J., Evangelista, C.C., Ferraz,
C., Ferriera, S., Fleischmann, W.,
Fosler, C., Gabrielian, A.E., Garg, N.S.,
Gelbart, W.M., Glasser, K., Glodek, A.,
Gong, F., Gorrell, J.H., Gu, Z., Guan,
P., Harris, M., Harris, N.L., Harvey, D.,
Heiman, T.J., Hernandez, J.R., Houck,
J., Hostin, D., Houston, K.A., Howland,
T.J., Wei, M.H., Ibegwam, C., et al.: The
genome sequence of Drosophila melanogaster.
Science 287 (2000) 2185-95
Agaton, C., Galli, J., Hoiden
Guthenberg, I., Janzon, L., Hansson,
M., Asplund, A., Brundell, E., Lindberg,
S., Ruthberg, I., Wester, K., Wurtz,
D., Hoog, C., Lundeberg, J., Stahl,
S., Ponten, F. and Uhlen, M.: Affinity
proteomics for systematic protein profiling
of chromosome 21 gene products in human
tissues. Mol Cell Proteomics 2 (2003)
405-14.
Ahmadian, A. and Lundeberg, J.: A
brief history of genetic variation analysis.
Biotechniques 32 (2002) 1122-4, 1126,
1128 passim.
Ahram, M., Flaig, M.J., Gillespie, J.W.,
Duray, P.H., Linehan, W.M., Ornstein,
D.K., Niu, S., Zhao, Y., Petricoin, E.F.,
3rd and Emmert-Buck, M.R.: Evaluation
of ethanol-fixed, paraffin-embedded tissues
for proteomic applications. Proteomics 3
(2003) 413-21.
Altschul, S.F., Gish, W., Miller, W.,
Myers, E.W. and Lipman, D.J.: Basic
local alignment search tool. J Mol Biol 215
(1990) 403-10.
Altschul, S.F. and Lipman, D.J.: Protein
database searches for multiple alignments.
Proc Natl Acad Sci U S A 87 (1990)
5509-13.
Altschul, S.F., Madden, T.L., Schaffer,
A.A., Zhang, J., Zhang, Z., Miller, W.
and Lipman, D.J.: Gapped BLAST and
PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids
Res 25 (1997) 3389-402.
References
References
68
References
Apweiler, R., Attwood, T.K., Bairoch,
A., Bateman, A., Birney, E., Biswas,
M., Bucher, P., Cerutti, L., Corpet, F.,
Croning, M.D., Durbin, R., Falquet, L.,
Fleischmann, W., Gouzy, J., Hermjakob,
H., Hulo, N., Jonassen, I., Kahn, D.,
Kanapin, A., Karavidopoulou, Y., Lopez,
R., Marx, B., Mulder, N.J., Oinn, T.M.,
Pagni, M., Servant, F., Sigrist, C.J. and
Zdobnov, E.M.: The InterPro database,
an integrated documentation resource for
protein families, domains and functional
sites. Nucleic Acids Res 29 (2001) 3740.
Apweiler, R., Bairoch, A. and Wu, C.H.:
Protein sequence databases. Curr Opin
Chem Biol 8 (2004a) 76-80.
Apweiler, R., Bairoch, A., Wu, C.H.,
Barker, W.C., Boeckmann, B., Ferro,
S., Gasteiger, E., Huang, H., Lopez,
R., Magrane, M., Martin, M.J., Natale,
D.A., O’Donovan, C., Redaschi, N. and
Yeh, L.S.: UniProt: the Universal Protein
knowledgebase. Nucleic Acids Res 32
(2004b) D115-9.
The Arabidopsis genome initiative.
Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana.
Nature 408 (2000) 796-815.
Ashurst, J.L., Chen, C.K., Gilbert, J.G.,
Jekosch, K., Keenan, S., Meidl, P., Searle,
S.M., Stalker, J., Storey, R., Trevanion,
S., Wilming, L. and Hubbard, T.: The
Vertebrate Genome Annotation (Vega)
database. Nucleic Acids Res 33 (2005)
D459-65.
Attwood, T.K., Bradley, P., Flower, D.R.,
Gaulton, A., Maudling, N., Mitchell,
A.L., Moulton, G., Nordle, A., Paine,
K., Taylor, P., Uddin, A. and Zygouri,
C.: PRINTS and its automatic supplement,
prePRINTS. Nucleic Acids Res 31 (2003)
400-2.
Bader, G.D. and Hogue, C.W.: BIND--a
data specification for storing and describing
biomolecular
interactions,
molecular
complexes and pathways. Bioinformatics
16 (2000) 465-77.
Bairoch, A. and Apweiler, R.: The SWISSPROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic
Acids Res 28 (2000) 45-8.
Bairoch, A., Apweiler, R., Wu, C.H.,
Barker, W.C., Boeckmann, B., Ferro,
S., Gasteiger, E., Huang, H., Lopez,
R., Magrane, M., Martin, M.J., Natale,
D.A., O’Donovan, C., Redaschi, N. and
Yeh, L.S.: The Universal Protein Resource
(UniProt). Nucleic Acids Res 33 (2005)
D154-9.
Baneyx, F.: Recombinant protein expression
in Escherichia coli. Curr Opin Biotechnol
10 (1999) 411-21.
Barlow, D.J., Edwards, M.S. and
Thornton, J.M.: Continuous and
discontinuous
protein
antigenic
determinants. Nature 322 (1986) 747-8.
Barry, R. and Soloviev, M.: Quantitative
protein profiling using antibody arrays.
Proteomics 4 (2004) 3717-26.
Bateman, A., Birney, E., Cerruti, L.,
Durbin, R., Etwiller, L., Eddy, S.R.,
Griffiths-Jones, S., Howe, K.L., Marshall,
M. and Sonnhammer, E.L.: The Pfam
protein families database. Nucleic Acids
Res 30 (2002) 276-80.
Bateman, A., Coin, L., Durbin, R.,
Finn, R.D., Hollich, V., Griffiths-Jones,
S., Khanna, A., Marshall, M., Moxon,
S., Sonnhammer, E.L., Studholme,
D.J., Yeats, C. and Eddy, S.R.: The Pfam
protein families database. Nucleic Acids
Res 32 (2004) D138-41.
Battifora,
H.:
The
multitumor
(sausage) tissue block: novel method for
immunohistochemical antibody testing.
Lab Invest 55 (1986) 244-8.
Bauer, A. and Kuster, B.: Affinity
purification-mass spectrometry. Powerful
tools for the characterization of protein
complexes. Eur J Biochem 270 (2003)
570-8.
Benton, D.: Bioinformatics--principles and
potential of a new multidisciplinary tool.
Trends Biotechnol 14 (1996) 261-72.
Beranova-Giorgianni: Proteome analysis
by two-dimensional gel electrophoresis and
mass spectrometry: strengths and limitations.
Trends Anal Chem 22 (2003) 273-281.
Blueggel, M., Chamrad, D. and Meyer,
H.E.: Bioinformatics in proteomics. Curr
Pharm Biotechnol 5 (2004) 79-88.
Blundell, T.L. and Mizuguchi, K.:
Structural genomics: an overview. Prog
Biophys Mol Biol 73 (2000) 289-95.
Blythe, M.J. and Flower, D.R.:
Benchmarking B cell epitope prediction:
underperformance of existing methods.
Protein Sci 14 (2005) 246-8.
Bernstein,
F.C.,
Koetzle,
T.F.,
Williams, G.J., Meyer, E.F., Jr., Brice,
M.D., Rodgers, J.R., Kennard, O.,
Shimanouchi, T. and Tasumi, M.: The
Protein Data Bank. A computer-based
archival file for macromolecular structures.
Eur J Biochem 80 (1977) 319-24.
Boeckmann, B., Bairoch, A., Apweiler,
R., Blatter, M.C., Estreicher, A.,
Gasteiger, E., Martin, M.J., Michoud,
K., O’Donovan, C., Phan, I., Pilbout,
S. and Schneider, M.: The SWISS-PROT
protein knowledgebase and its supplement
TrEMBL in 2003. Nucleic Acids Res 31
(2003) 365-70.
Better, M., Chang, C.P., Robinson,
R.R. and Horwitz, A.H.: Escherichia coli
secretion of an active chimeric antibody
fragment. Science 240 (1988) 1041-3.
Borrebaeck, C.A.: Antibodies in
diagnostics - from immunoassays to protein
chips. Immunol Today 21 (2000) 37982.
Birney, E., Andrews, T.D., Bevan, P.,
Caccamo, M., Chen, Y., Clarke, L.,
Coates, G., Cuff, J., Curwen, V., Cutts,
T., Down, T., Eyras, E., FernandezSuarez, X.M., Gane, P., Gibbins, B.,
Gilbert, J., Hammond, M., Hotz,
H.R., Iyer, V., Jekosch, K., Kahari,
A., Kasprzyk, A., Keefe, D., Keenan,
S., Lehvaslaiho, H., McVicker, G.,
Melsopp, C., Meidl, P., Mongin, E.,
Pettett, R., Potter, S., Proctor, G., Rae,
M., Searle, S., Slater, G., Smedley, D.,
Smith, J., Spooner, W., Stabenau, A.,
Stalker, J., Storey, R., Ureta-Vidal, A.,
Woodwark, K.C., Cameron, G., Durbin,
R., Cox, A., Hubbard, T. and Clamp,
M.: An overview of Ensembl. Genome Res
14 (2004) 925-8.
Bradbury, A.R. and Marks, J.D.:
Antibodies from phage antibody libraries. J
Immunol Methods 290 (2004) 29-49.
Blakesley, R.W.: Cycle sequencing.
Methods Mol Biol 23 (1993) 209-17.
Blom, N., Gammeltoft, S. and Brunak,
S.: Sequence and structure-based prediction
of eukaryotic protein phosphorylation sites. J
Mol Biol 294 (1999) 1351-62.
Bradbury, A.R., Velappan, N., Verzillo,
V., Ovecka, M., Marzari, R., Sblattero,
D., Chasteen, L., Siegel, R. and Pavlik,
P.: Antibodies in proteomics. Methods Mol
Biol 248 (2004) 519-46.
Brunner, A.M., Busov, V.B. and Strauss,
S.H.: Poplar genome sequence: functional
genomics in an ecologically dominant plant
species. Trends Plant Sci 9 (2004) 4956.
The C. elegans consortium : Genome
sequence of the nematode C. elegans:
a platform for investigating biology.
Science 282 (1998) 2012-8.
Campbell, J.A., Davies, G.J., Bulone,
V. and Henrissat, B.: A classification of
nucleotide-diphospho-sugar
glycosyltransferases based on amino acid
sequence similarities. Biochem J 326 ( Pt
3) (1997) 929-39.
69
References
Benson, D.A., Karsch-Mizrachi, I.,
Lipman, D.J., Ostell, J. and Wheeler,
D.L.: GenBank. Nucleic Acids Res 33
(2005) D34-8.
70
References
Chen, Y. and Xu, D.: Computational
analyses of high-throughput protein-protein
interaction data. Curr Protein Pept Sci 4
(2003) 159-81.
Doblin, M.S., Kurek, I., Jacob-Wilk, D.
and Delmer, D.P.: Cellulose biosynthesis
in plants: from genes to rosettes. Plant Cell
Physiol 43 (2002) 1407-20.
Cheung, J., Estivill, X., Khaja, R.,
MacDonald, J.R., Lau, K., Tsui, L.C.
and Scherer, S.W.: Genome-wide detection
of segmental duplications and potential
assembly errors in the human genome
sequence. Genome Biol 4 (2003) R25.
Doolittle, R.F., Feng, D.F., Johnson,
M.S. and McClure, M.A.: Relationships
of human protein sequences to those of other
organisms. Cold Spring Harb Symp
Quant Biol 51 Pt 1 (1986) 447-55.
The
chimpanzee
sequencing
consortium: Initial sequence of the
chimpanzee genome and comparison with
the human genome. Nature 437 (2005)
69-87.
Coutinho, P.M., Deleury, E., Davies, G.J.
and Henrissat, B.: An evolving hierarchical
family classification for glycosyltransferases.
J Mol Biol 328 (2003) 307-17.
Cregg, J.M., Cereghino, J.L., Shi,
J. and Higgins, D.R.: Recombinant
protein expression in Pichia pastoris. Mol
Biotechnol 16 (2000) 23-52.
Crick, F.: The Biological Replication of
Macromolecules. Symp. Soc. Exp. Biol
XII, 138 (1958).
Dayhoff, M., Schwartz, R.M. and Orcutt,
B.C.: A model of evolutionary change in
proteins., In Atlas of Protein Sequence
and Structure. National Biomedical
Reseach Foundation, Silver Spring,
MD, 1978, pp. 345-352.
de Hoog, C.L. and Mann, M.: Proteomics.
Annu Rev Genomics Hum Genet 5
(2004) 267-93.
Djerbi, S., Aspeborg, H., Nilsson,
P.,
Sundberg,
B.,
Mellerowicz,
E., Blomqvist, K. and Teeri, T.T.:
Identification and expression analysis of
genes encoding putative cellulose synthases
(CesA) in the hybrid aspen, Populus tremula
(L.) x P. tremuloides (Michx.). Cellulose
11 (2004) 301-312.
Dunham, I., Shimizu, N., Roe, B.A.,
Chissoe, S., Hunt, A.R., Collins, J.E.,
Bruskiewich, R., Beare, D.M., Clamp,
M., Smink, L.J., Ainscough, R.,
Almeida, J.P., Babbage, A., Bagguley,
C., Bailey, J., Barlow, K., Bates, K.N.,
Beasley, O., Bird, C.P., Blakey, S.,
Bridgeman, A.M., Buck, D., Burgess, J.,
Burrill, W.D., O’Brien, K.P. and et al.:
The DNA sequence of human chromosome
22. Nature 402 (1999) 489-95.
Dunn, M.J.: Detection of total proteins on
western blots of 2-D polyacrylamide gels.
Methods Mol Biol 112 (1999) 319-29.
Ebersberger, I., Metzler, D., Schwarz,
C. and Paabo, S.: Genomewide comparison
of DNA sequences between humans and
chimpanzees. Am J Hum Genet 70
(2002) 1490-7.
El-Sayed,
N.M.,
Myler,
P.J.,
Bartholomeu, D.C., Nilsson, D.,
Aggarwal, G., Tran, A.N., Ghedin, E.,
Worthey, E.A., Delcher, A.L., Blandin,
G., Westenberger, S.J., Caler, E.,
Cerqueira, G.C., Branche, C., Haas,
B., Anupama, A., Arner, E., Aslund, L.,
Attipoe, P., Bontempi, E., Bringaud,
F., Burton, P., Cadag, E., Campbell,
D.A., Carrington, M., Crabtree, J.,
Darban, H., da Silveira, J.F., de Jong, P.,
Edwards, K., Englund, P.T., Fazelina, G.,
Feldblyum, T., Ferella, M., Frasch, A.C.,
Gull, K., Horn, D., Hou, L., Huang,
Y., Kindlund, E., Klingbeil, M., Kluge,
S., Koo, H., Lacerda, D., Levin, M.J.,
Lorenzi, H., Louie, T., Machado, C.R.,
McCulloch, R., McKenna, A., Mizuno,
Y., Mottram, J.C., Nelson, S., Ochaya,
S., Osoegawa, K., Pai, G., Parsons, M.,
Pentony, M., Pettersson, U., Pop, M.,
Emanuelsson, O., Nielsen, H., Brunak,
S. and von Heijne, G.: Predicting
subcellular localization of proteins based
on their N-terminal amino acid sequence. J
Mol Biol 300 (2000) 1005-16.
Fleischmann, R.D., Adams, M.D.,
White, O., Clayton, R.A., Kirkness, E.F.,
Kerlavage, A.R., Bult, C.J., Tomb, J.F.,
Dougherty, B.A., Merrick, J.M. and et
al.: Whole-genome random sequencing and
assembly of Haemophilus influenzae Rd.
Science 269 (1995) 496-512.
Force, A., Lynch, M., Pickett, F.B.,
Amores, A., Yan, Y.L. and Postlethwait,
J.: Preservation of duplicate genes by
complementary, degenerative mutations.
Genetics 151 (1999) 1531-45.
Galperin, M.Y.: The Molecular Biology
Database Collection: 2005 update. Nucleic
Acids Res 33 (2005) D5-24.
Emanuelsson, O., von Heijne, G. and
Schneider, G.: Analysis and prediction of
mitochondrial targeting peptides. Methods
Cell Biol 65 (2001) 175-87.
Garavelli, J.S.: The RESID Database of
Protein Modifications as a resource and
annotation tool. Proteomics 4 (2004)
1527-33.
Fan, J.B., Oliphant, A., Shen, R.,
Kermani, B.G., Garcia, F., Gunderson,
K.L., Hansen, M., Steemers, F., Butler,
S.L., Deloukas, P., Galver, L., Hunt, S.,
McBride, C., Bibikova, M., Rubano,
T., Chen, J., Wickham, E., Doucet,
D., Chang, W., Campbell, D., Zhang,
B., Kruglyak, S., Bentley, D., Haas,
J., Rigault, P., Zhou, L., Stuelpnagel,
J. and Chee, M.S.: Highly parallel SNP
genotyping. Cold Spring Harb Symp
Quant Biol 68 (2003) 69-78.
Garcia-Blanco, M.A., Baraniak, A.P.
and Lasda, E.L.: Alternative splicing in
disease and therapy. Nat Biotechnol 22
(2004) 535-46.
Garinot-Schneider,
C.,
Lellouch,
A.C. and Geremia, R.A.: Identification
of essential amino acid residues in the
Sinorhizobium meliloti glucosyltransferase
ExoM. J Biol Chem 275 (2000) 3140713.
Fields, S. and Song, O.: A novel genetic
system to detect protein-protein interactions.
Nature 340 (1989) 245-6.
Fitch, W.M.: Distinguishing homologous
from analogous proteins. Syst Zool 19
(1970) 99-113.
Fitch, W.M.: Homology a personal view
on some of the problems. Trends Genet 16
(2000) 227-31.
Gasteiger, E., Jung, E. and Bairoch,
A.: SWISS-PROT: connecting biomolecular
knowledge via a protein database. Curr
Issues Mol Biol 3 (2001) 47-55.
Gellissen, G., Melber, K., Janowicz,
Z.A., Dahlems, U.M., Weydemann,
U., Piontek, M., Strasser, A.W. and
Hollenberg, C.P.: Heterologous protein
production in yeast. Antonie Van
Leeuwenhoek 62 (1992) 79-93.
Gibbs, R.A., Weinstock, G.M., Metzker,
M.L., Muzny, D.M., Sodergren, E.J.,
Scherer, S., Scott, G., Steffen, D.,
Worley, K.C., Burch, P.E., Okwuonu,
G., Hines, S., Lewis, L., DeRamo, C.,
Delgado, O., Dugan-Rocha, S., Miner,
G., Morgan, M., Hawes, A., Gill, R.,
Celera, Holt, R.A., Adams, M.D.,
71
References
Ramirez, J.L., Rinta, J., Robertson, L.,
Salzberg, S.L., Sanchez, D.O., Seyler,
A., Sharma, R., Shetty, J., Simpson,
A.J., Sisk, E., Tammi, M.T., Tarleton,
R., Teixeira, S., Van Aken, S., Vogt, C.,
Ward, P.N., Wickstead, B., Wortman,
J., White, O., Fraser, C.M., Stuart, K.D.
and Andersson, B.: The genome sequence
of Trypanosoma cruzi, etiologic agent of
Chagas disease. Science 309 (2005) 40915.
72
References
Amanatides, P.G., Baden-Tillson, H.,
Barnstead, M., Chin, S., Evans, C.A.,
Ferriera, S., Fosler, C., Glodek, A., Gu,
Z., Jennings, D., Kraft, C.L., Nguyen,
T., Pfannkoch, C.M., Sitter, C., Sutton,
G.G., Venter, J.C., Woodage, T., Smith,
D., Lee, H.M., Gustafson, E., Cahill,
P., Kana, A., Doucette-Stamm, L.,
Weinstock, K., Fechtel, K., Weiss, R.B.,
Dunn, D.M., Green, E.D., Blakesley,
R.W., Bouffard, G.G., De Jong, P.J.,
Osoegawa, K., Zhu, B., Marra, M.,
Schein, J., Bosdet, I., Fjell, C., Jones,
S., Krzywinski, M., Mathewson, C.,
Siddiqui, A., Wye, N., McPherson,
J., Zhao, S., Fraser, C.M., Shetty, J.,
Shatsman, S., Geer, K., Chen, Y.,
Abramzon, S., Nierman, W.C., Havlak,
P.H., Chen, R., Durbin, K.J., Egan, A.,
Ren, Y., Song, X.Z., Li, B., Liu, Y., Qin,
X., Cawley, S., Cooney, A.J., D’Souza,
L.M., Martin, K., Wu, J.Q., GonzalezGaray, M.L., Jackson, A.R., Kalafus, K.J.,
McLeod, M.P., Milosavljevic, A., Virk,
D., Volkov, A., Wheeler, D.A., Zhang,
Z., Bailey, J.A., Eichler, E.E., Tuzun,
E., et al.: Genome sequence of the Brown
Norway rat yields insights into mammalian
evolution. Nature 428 (2004) 493-521.
G.O.
Consortium.:
Creating
the
gene ontology resource: design and
implementation, Genome Res, 2001, pp.
1425-33.
Goffeau, A., Barrell, B.G., Bussey, H.,
Davis, R.W., Dujon, B., Feldmann, H.,
Galibert, F., Hoheisel, J.D., Jacq, C.,
Johnston, M., Louis, E.J., Mewes, H.W.,
Murakami, Y., Philippsen, P., Tettelin,
H. and Oliver, S.G.: Life with 6000 genes.
Science 274 (1996) 546, 563-7.
Graslund, S., Larsson, M., Falk, R.,
Uhlen, M., Hoog, C. and Stahl,
S.: Single-vector three-frame expression
systems for affinity-tagged proteins. FEMS
Microbiol Lett 215 (2002) 139-47.
Graveley, B.R.: Alternative splicing:
increasing diversity in the proteomic world.
Trends Genet 17 (2001) 100-7.
Gray, G.S. and Fitch, W.M.: Evolution
of antibiotic resistance genes: the DNA
sequence of a kanamycin resistance gene
from Staphylococcus aureus. Mol Biol Evol
1 (1983) 57-66.
Gu, Z., Cavalcanti, A., Chen, F.C.,
Bouman, P. and Li, W.H.: Extent of gene
duplication in the genomes of Drosophila,
nematode, and yeast. Mol Biol Evol 19
(2002) 256-62.
Haab, B.B.: Advances in protein microarray
technology for protein expression and
interaction profiling. Curr Opin Drug
Discov Devel 4 (2001) 116-23.
Haab, B.B., Dunham, M.J. and Brown,
P.O.: Protein microarrays for highly
parallel detection and quantitation of
specific proteins and antibodies in complex
solutions. Genome Biol 2 (2001)
RESEARCH0004.
Hansen, J.E., Lund, O., Tolstrup,
N., Gooley, A.A., Williams, K.L. and
Brunak, S.: NetOglyc: prediction of mucin
type O-glycosylation sites based on sequence
context and surface accessibility. Glycoconj
J 15 (1998) 115-30.
Hardenbol, P., Baner, J., Jain, M.,
Nilsson, M., Namsaraev, E.A., KarlinNeumann, G.A., Fakhrai-Rad, H.,
Ronaghi, M., Willis, T.D., Landegren,
U. and Davis, R.W.: Multiplexed
genotyping with sequence-tagged molecular
inversion probes. Nat Biotechnol 21
(2003) 673-8.
Hartley, J.L., Temple, G.F. and Brasch,
M.A.: DNA cloning using in vitro sitespecific recombination. Genome Res 10
(2000) 1788-95.
Henikoff, S. and Henikoff, J.G.: Amino
acid substitution matrices from protein
blocks. Proc Natl Acad Sci U S A 89
(1992) 10915-9.
Henikoff, S. and Henikoff, J.G.:
Performance evaluation of amino acid
substitution matrices. Proteins 17 (1993)
49-61.
Hubbard, T., Andrews, D., Caccamo,
M., Cameron, G., Chen, Y., Clamp,
M., Clarke, L., Coates, G., Cox, T.,
Cunningham, F., Curwen, V., Cutts,
T., Down, T., Durbin, R., FernandezSuarez, X.M., Gilbert, J., Hammond,
M., Herrero, J., Hotz, H., Howe, K., Iyer,
V., Jekosch, K., Kahari, A., Kasprzyk,
A., Keefe, D., Keenan, S., Kokocinsci,
F., London, D., Longden, I., McVicker,
G., Melsopp, C., Meidl, P., Potter, S.,
Proctor, G., Rae, M., Rios, D., Schuster,
M., Searle, S., Severin, J., Slater, G.,
Smedley, D., Smith, J., Spooner, W.,
Stabenau, A., Stalker, J., Storey, R.,
Trevanion, S., Ureta-Vidal, A., Vogel,
J., White, S., Woodwark, C. and Birney,
E.: Ensembl 2005. Nucleic Acids Res 33
(2005) D447-53.
Hughes, A.L. and Nei, M.: Pattern
of nucleotide substitution at major
histocompatibility complex class I loci
reveals overdominant selection. Nature
335 (1988) 167-70.
Ito, T., Tashiro, K., Muta, S., Ozawa, R.,
Chiba, T., Nishizawa, M., Yamamoto,
K., Kuhara, S. and Sakaki, Y.: Toward
a protein-protein interaction map of the
budding yeast: A comprehensive system
to examine two-hybrid interactions in all
possible combinations between the yeast
proteins. Proc Natl Acad Sci U S A 97
(2000) 1143-7.
Jensen, R.A.: Orthologs an paralogs -we
need to get it right. Genome Biol 2 (2001)
1002.1-1002.3.
Jones, D.T.: Progress in protein structure
prediction. Curr Opin Struct Biol 7
(1997) 377-87.
Jones, D.T.: GenTHREADER: an efficient
and reliable protein fold recognition method
for genomic sequences. J Mol Biol 287
(1999) 797-815.
Jungblut, P., Thiede, B., Zimny-Arndt,
U., Muller, E.C., Scheler, C., WittmannLiebold, B. and Otto, A.: Resolution
power of two-dimensional electrophoresis
and identification of proteins from gels.
Electrophoresis 17 (1996) 839-47.
Kalluri, U.C. and Joshi, C.P.: Isolation
and characterization of a new, full-length
cellulose synthase cDNA, PtrCesA5 from
developing xylem of aspen trees. J Exp Bot
54 (2003) 2187-8.
Kalluri, U.C. and Joshi, C.P.:
Differential expression patterns of two
cellulose synthase genes are associated
with primary and secondary cell wall
development in aspen trees. Planta 220
(2004) 47-55.
Kampf, C., Andersson, A.-C., Wester,
K., Björling, E., Uhlen, M. and Ponten,
F.: Antibody-Based Tissue Profiling As
a Tool for Clinical Proteomics. Clinical
Proteomics 1 (2004) 285-300.
Kanz, C., Aldebert, P., Althorpe,
N., Baker, W., Baldwin, A., Bates,
K., Browne, P., van den Broek, A.,
Castro, M., Cochrane, G., Duggan, K.,
Eberhardt, R., Faruque, N., Gamble,
J., Diez, F.G., Harte, N., Kulikova,
T., Lin, Q., Lombard, V., Lopez, R.,
Mancuso, R., McHale, M., Nardone, F.,
Silventoinen, V., Sobhany, S., Stoehr,
P., Tuli, M.A., Tzouvara, K., Vaughan,
R., Wu, D., Zhu, W. and Apweiler, R.:
The EMBL Nucleotide Sequence Database.
Nucleic Acids Res 33 (2005) D29-33.
Karolchik, D., Hinrichs, A.S., Furey,
T.S., Roskin, K.M., Sugnet, C.W.,
Haussler, D. and Kent, W.J.: The UCSC
Table Browser data retrieval tool. Nucleic
Acids Res 32 (2004) D493-6.
73
References
Holm, L.: Unification of protein families.
Curr Opin Struct Biol 8 (1998) 372-9.
Hopp, T.P. and Woods, K.R.: Prediction
of protein antigenic determinants from
amino acid sequences. Proc Natl Acad Sci
U S A 78 (1981) 3824-8.
74
References
Kasprzyk, A., Keefe, D., Smedley, D.,
London, D., Spooner, W., Melsopp, C.,
Hammond, M., Rocca-Serra, P., Cox, T.
and Birney, E.: EnsMart: a generic system
for fast and flexible access to biological data.
Genome Res 14 (2004) 160-9.
Kent, W.J., Baertsch, R., Hinrichs, A.,
Miller, W. and Haussler, D.: Evolution’s
cauldron: duplication, deletion, and
rearrangement in the mouse and human
genomes. Proc Natl Acad Sci U S A 100
(2003) 11484-9.
Kirkness, E.F., Bafna, V., Halpern,
A.L., Levy, S., Remington, K., Rusch,
D.B., Delcher, A.L., Pop, M., Wang,
W., Fraser, C.M. and Venter, J.C.: The
dog genome: survey sequencing and
comparative analysis. Science 301
(2003) 1898-903.
Kohler, G. and Milstein, C.: Continuous
cultures of fused cells secreting antibody of
predefined specificity. Nature 256 (1975)
495-7.
Kondrashov, F.A., Rogozin, I.B., Wolf,
Y.I. and Koonin, E.V.: Selection in the
evolution of gene duplications. Genome
Biol 3 (2002) RESEARCH0008.
Kononen,
J.,
Bubendorf,
L.,
Kallioniemi, A., Barlund, M., Schraml,
P., Leighton, S., Torhorst, J., Mihatsch,
M.J., Sauter, G. and Kallioniemi, O.P.:
Tissue microarrays for high-throughput
molecular profiling of tumor specimens. Nat
Med 4 (1998) 844-7.
Koonin, E.V.: An apology for orthologs - or
brave new memes. Genome Biol 2 (2001)
COMMENT1005.
Krogh, A., Larsson, B., von Heijne,
G. and Sonnhammer, E.L.: Predicting
transmembrane protein topology with a
hidden Markov model: application to
complete genomes. J Mol Biol 305 (2001)
567-80.
Kumar, A., Agarwal, S., Heyman, J.A.,
Matson, S., Heidtman, M., Piccirillo,
S., Umansky, L., Drawid, A., Jansen,
R., Liu, Y., Cheung, K.H., Miller, P.,
Gerstein, M., Roeder, G.S. and Snyder,
M.: Subcellular localization of the yeast
proteome. Genes Dev 16 (2002) 707-19.
Lander, E.S., Linton, L.M., Birren, B.,
Nusbaum, C., Zody, M.C., Baldwin,
J., Devon, K., Dewar, K., Doyle, M.,
FitzHugh, W., Funke, R., Gage, D.,
Harris, K., Heaford, A., Howland, J.,
Kann, L., Lehoczky, J., LeVine, R.,
McEwan, P., McKernan, K., Meldrim,
J., Mesirov, J.P., Miranda, C., Morris,
W., Naylor, J., Raymond, C., Rosetti, M.,
Santos, R., Sheridan, A., Sougnez, C.,
Stange-Thomann, N., Stojanovic, N.,
Subramanian, A., Wyman, D., Rogers,
J., Sulston, J., Ainscough, R., Beck,
S., Bentley, D., Burton, J., Clee, C.,
Carter, N., Coulson, A., Deadman, R.,
Deloukas, P., Dunham, A., Dunham, I.,
Durbin, R., French, L., Grafham, D.,
Gregory, S., Hubbard, T., Humphray,
S., Hunt, A., Jones, M., Lloyd, C.,
McMurray, A., Matthews, L., Mercer,
S., Milne, S., Mullikin, J.C., Mungall,
A., Plumb, R., Ross, M., Shownkeen,
R., Sims, S., Waterston, R.H., Wilson,
R.K., Hillier, L.W., McPherson, J.D.,
Marra, M.A., Mardis, E.R., Fulton, L.A.,
Chinwalla, A.T., Pepin, K.H., Gish, W.R.,
Chissoe, S.L., Wendl, M.C., Delehaunty,
K.D., Miner, T.L., Delehaunty, A.,
Kramer, J.B., Cook, L.L., Fulton, R.S.,
Johnson, D.L., Minx, P.J., Clifton, S.W.,
Hawkins, T., Branscomb, E., Predki, P.,
Richardson, P., Wenning, S., Slezak,
T., Doggett, N., Cheng, J.F., Olsen, A.,
Lucas, S., Elkin, C., Uberbacher, E.,
Frazier, M., et al.: Initial sequencing and
analysis of the human genome. Nature 409
(2001) 860-921.
Lawrence, J.G. and Ochman, H.:
Molecular archaeology of the Escherichia
coli genome. Proc Natl Acad Sci U S A 95
(1998) 9413-7.
Letunic, I., Copley, R.R., Schmidt, S.,
Ciccarelli, F.D., Doerks, T., Schultz, J.,
Ponting, C.P. and Bork, P.: SMART 4.0:
towards genomic data integration. Nucleic
Acids Res 32 (2004) D142-4.
Lipman, D.J. and Pearson, W.R.: Rapid
and sensitive protein similarity searches.
Science 227 (1985) 1435-41.
Lipovsek, D. and Pluckthun, A.: In-vitro
protein evolution by ribosome display and
mRNA display. J Immunol Methods 290
(2004) 51-67.
Little, P.: The book of genes. Nature 402
(1999) 467-8.
Liu, B. and Marks, J.D.: Applying phage
antibodies to proteomics: selecting single
chain Fv antibodies to antigens blotted on
nitrocellulose. Anal Biochem 286 (2000)
119-28.
Luscombe, N.M., Greenbaum, D. and
Gerstein, M.: What is bioinformatics?
A proposed definition and overview of the
field. Methods Inf Med 40 (2001) 34658.
Lynch, M. and Conery, J.S.: The
evolutionary fate and consequences of
duplicate genes. Science 290 (2000)
1151-5.
Lynch, M. and Conery, J.S.: The
evolutionary demography of duplicate genes.
J Struct Funct Genomics 3 (2003) 3544.
MacBeath, G. and Schreiber, S.L.:
Printing proteins as microarrays for highthroughput
function
determination.
Science 289 (2000) 1760-3.
Makalowski, W.: Genomic scrap yard: how
genomes utilize all that junk. Gene 259
(2000) 61-7.
Mann, M. and Jensen, O.N.: Proteomic
analysis of post-translational modifications.
Nat Biotechnol 21 (2003) 255-61.
Liu, Q.: The univector plasmid-fusion
system, a method for rapid construction
of recombinant DNA without restriction
enzymes. Curr. Biol. 8 (1999) 13001309.
Marcotte, E.M., Pellegrini, M., Ng, H.L.,
Rice, D.W., Yeates, T.O. and Eisenberg,
D.: Detecting protein function and proteinprotein interactions from genome sequences.
Science 285 (1999) 751-3.
Lopez, M.F. and Pluskal, M.G.: Protein
micro- and macroarrays: digitizing the
proteome. J Chromatogr B Analyt
Technol Biomed Life Sci 787 (2003)
19-27.
Marcotte, E.M., Xenarios, I. and
Eisenberg, D.: Mining literature for proteinprotein interactions. Bioinformatics 17
(2001) 359-63.
Lu, Z., DiBlasio-Smith, E.A., Grant,
K.L., Warne, N.W., LaVallie, E.R.,
Collins-Racie, L.A., Follettie, M.T.,
Williamson, M.J. and McCoy, J.M.:
Histidine patch thioredoxins. Mutant forms
of thioredoxin with metal chelating affinity
that provide for convenient purifications of
thioredoxin fusion proteins. J Biol Chem
271 (1996) 5059-65.
Mattaj, I.W. and Englmeier, L.:
Nucleocytoplasmic transport: the soluble
phase. Annu Rev Biochem 67 (1998)
265-306.
Maxam, A.M. and Gilbert, W.: A new
method for sequencing DNA. Proc Natl
Acad Sci U S A 74 (1977) 560-4.
Mellor, J.C., Yanai, I., Clodfelter, K.H.,
Mintseris, J. and DeLisi, C.: Predictome:
a database of putative functional links
between proteins. Nucleic Acids Res 30
(2002) 306-9.
75
References
Lesley, S.A.: High-throughput proteomics:
protein expression and purification in the
postgenomic world. Protein Expr Purif 22
(2001) 159-64.
76
References
Meloen, R.H., Puijk, W.C., Langeveld,
J.P., Langedijk, J.P. and Timmerman, P.:
Design of synthetic peptides for diagnostics.
Curr Protein Pept Sci 4 (2003) 253-60.
Mullis, K.B. and Faloona, F.A.: Specific
synthesis of DNA in vitro via a polymerasecatalyzed chain reaction. Methods
Enzymol 155 (1987) 335-50.
Mighell, A.J., Smith, N.R., Robinson,
P.A. and Markham, A.F.: Vertebrate
pseudogenes. FEBS Lett 468 (2000) 10914.
Needleman, S.B. and Wunsch, C.D.: A
general method applicable to the search for
similarities in the amino acid sequence of
two proteins. J Mol Biol 48 (1970) 44353.
Nielsen, H., Brunak, S. and von Heijne,
G.: Machine learning approaches for the
prediction of signal peptides and other
protein sorting signals. Protein Eng 12
(1999) 3-9.
Mironov, A.A., Fickett, J.W. and
Gelfand, M.S.: Frequent alternative
splicing of human genes. Genome Res 9
(1999) 1288-93.
Miroux, B. and Walker, J.E.: Overproduction of proteins in Escherichia coli:
mutant hosts that allow synthesis of some
membrane proteins and globular proteins at
high levels. J Mol Biol 260 (1996) 28998.
Montelione, G.T. and Anderson, S.:
Structural genomics: keystone for a Human
Proteome Project. Nat Struct Biol 6 (1999)
11-2.
Moore, R.C. and Purugganan, M.D.:
The evolutionary dynamics of plant
duplicate genes. Curr Opin Plant Biol 8
(2005) 122-8.
Mounsey, A., Bauer, P. and Hope,
I.A.: Evidence suggesting that a fifth of
annotated Caenorhabditis elegans genes
may be pseudogenes. Genome Res 12
(2002) 770-5.
Mulder, N.J., Apweiler, R., Attwood,
T.K., Bairoch, A., Bateman, A., Binns,
D., Bradley, P., Bork, P., Bucher, P.,
Cerutti, L., Copley, R., Courcelle, E.,
Das, U., Durbin, R., Fleischmann, W.,
Gough, J., Haft, D., Harte, N., Hulo, N.,
Kahn, D., Kanapin, A., Krestyaninova,
M., Lonsdale, D., Lopez, R., Letunic,
I., Madera, M., Maslen, J., McDowall,
J., Mitchell, A., Nikolskaya, A.N.,
Orchard, S., Pagni, M., Ponting, C.P.,
Quevillon, E., Selengut, J., Sigrist,
C.J., Silventoinen, V., Studholme, D.J.,
Vaughan, R. and Wu, C.H.: InterPro,
progress and status in 2005. Nucleic
Acids Res 33 (2005) D201-5.
Nielsen, H., Engelbrecht, J., Brunak,
S. and von Heijne, G.: Identification of
prokaryotic and eukaryotic signal peptides
and prediction of their cleavage sites.
Protein Eng 10 (1997) 1-6.
Nilsson, P., Paavilainen, L., Larsson, K.,
Ödling, J., Sundberg, M., Andersson,
A.-C., Kampf, C., Persson, A., Al-Khalili
Szigartzo, C., Ottosson, J., Björling,
E., Hober, S., Wernérus, H., Wester,
K., Ponten, F. and Uhlen, M.: Towards
a human proteome atlas: high-throughput
generation of mono-specific antibodies for
tissue profiling. Proteomics, In press
(2005).
Nomenclature Committee:
Recommendations of the Nomenclature
Committee of the International Union of
Biochemistry and Molecular Biology on
the Nomenclature and Classification of
Enyzmes, Academic, New York, 1992.
Nord, K., Gunneriusson, E., Ringdahl,
J., Stahl, S., Uhlen, M. and Nygren,
P.A.: Binding proteins selected from
combinatorial libraries of an alpha-helical
bacterial receptor domain. Nat Biotechnol
15 (1997) 772-7.
O’Donovan, C., Apweiler, R. and
Bairoch, A.: The human proteomics
initiative (HPI). Trends Biotechnol 19
(2001) 178-81.
O’Donovan, C., Martin, M.J., Glemet,
E., Codani, J.J. and Apweiler, R.:
Removing redundancy in SWISS-PROT
and TrEMBL. Bioinformatics 15 (1999)
258-9.
O’Farrell, P.H.: High resolution twodimensional electrophoresis of proteins. J
Biol Chem 250 (1975) 4007-21.
Ohno, S.: Evolution by gene duplication.
Springer-verlag (1970).
Orengo, C.A., Bray, J.E., Buchan, D.W.,
Harrison, A., Lee, D., Pearl, F.M.,
Sillitoe, I., Todd, A.E. and Thornton,
J.M.: The CATH protein family database:
a resource for structural and functional
annotation of genomes. Proteomics 2
(2002) 11-21.
Overbeek, R., Fonstein, M., D’Souza,
M., Pusch, G.D. and Maltsev, N.: The use
of gene clusters to infer functional coupling.
Proc Natl Acad Sci U S A 96 (1999)
2896-901.
Paweletz, C.P., Charboneau, L., Bichsel,
V.E., Simone, N.L., Chen, T., Gillespie,
J.W., Emmert-Buck, M.R., Roth, M.J.,
Petricoin, I.E. and Liotta, L.A.: Reverse
phase protein microarrays which capture
disease progression show activation of prosurvival pathways at the cancer invasion
front. Oncogene 20 (2001) 1981-9.
Pearson, W.R.: Rapid and sensitive
sequence comparison with FASTP and
FASTA. Methods Enzymol 183 (1990)
63-98.
Pearson,
W.R.:
Protein
sequence
comparison and protein evolution. ISMB
2000 (2000).
Pearson, W.R. and Miller, W.: Dynamic
programming algorithms for biological
sequence comparison. Methods Enzymol
210 (1992) 575-601.
Pellegrini, M., Marcotte, E.M.,
Thompson, M.J., Eisenberg, D. and
Yeates, T.O.: Assigning protein functions
by comparative genome analysis: protein
phylogenetic profiles. Proc Natl Acad Sci
U S A 96 (1999) 4285-8.
Petsko, G.A.: Homologuephobia. Genome
Biol 2 (2001) COMMENT1002.
Reinhardt, A. and Hubbard, T.: Using
neural networks for prediction of the
subcellular location of proteins. Nucleic
Acids Res 26 (1998) 2230-6.
Rost, B., Liu, J., Nair, R., Wrzeszczynski,
K.O. and Ofran, Y.: Automatic prediction
of protein function. Cell Mol Life Sci 60
(2003) 2637-50.
Rost, B., Schneider, R. and Sander, C.:
Protein fold recognition by prediction-based
threading. J Mol Biol 270 (1997) 47180.
Rowen, L., Young, J., Birditt, B., Kaur,
A., Madan, A., Philipps, D.L., Qin, S.,
Minx, P., Wilson, R.K., Hood, L. and
Graveley, B.R.: Analysis of the human
neurexin genes: alternative splicing and the
generation of protein diversity. Genomics
79 (2002) 587-97.
Ruvolo, M.: Comparative primate
genomics: the year of the chimpanzee. Curr
Opin Genet Dev 14 (2004) 650-6.
Saiki, R.K., Walsh, P.S., Levenson, C.H.
and Erlich, H.A.: Genetic analysis of
amplified DNA with immobilized sequencespecific oligonucleotide probes. Proc Natl
Acad Sci U S A 86 (1989) 6230-4.
Sakate, R., Osada, N., Hida, M.,
Sugano, S., Hayasaka, I., Shimohira,
N., Yanagi, S., Suto, Y., Hashimoto, K.
and Hirai, M.: Analysis of 5’-end sequences
of chimpanzee cDNAs. Genome Res 13
(2003) 1022-6.
77
References
O’Donovan, C., Martin, M.J., Gattiker,
A., Gasteiger, E., Bairoch, A. and
Apweiler, R.: High-quality protein
knowledge resource: SWISS-PROT and
TrEMBL. Brief Bioinform 3 (2002) 27584.
78
References
Sanchez, R. and Sali, A.: Largescale protein structure modeling of the
Saccharomyces cerevisiae genome. Proc
Natl Acad Sci U S A 95 (1998) 13597602.
Sanger, F., Nicklen, S. and Coulson,
A.R.: DNA sequencing with chainterminating inhibitors. Proc Natl Acad
Sci U S A 74 (1977) 5463-7.
Santoni, V., Molloy, M. and Rabilloud,
T.: Membrane proteins and proteomics: un
amour impossible? Electrophoresis 21
(2000) 1054-70.
Saxena, I.M., Brown, R.M., Jr., Fevre,
M., Geremia, R.A. and Henrissat, B.:
Multidomain architecture of beta-glycosyl
transferases: implications for mechanism of
action. J Bacteriol 177 (1995) 1419-24.
Schatz, G. and Dobberstein, B.: Common
principles of protein translocation across
membranes. Science 271 (1996) 151926.
Schena, M., Shalon, D., Davis,
R.W. and Brown, P.O.: Quantitative
monitoring of gene expression patterns with
a complementary DNA microarray. Science
270 (1995) 467-70.
Schweitzer, B. and Kingsmore, S.F.:
Measuring proteins on microarrays. Curr
Opin Biotechnol 13 (2002) 14-9.
Shevchenko, A., Wilm, M., Vorm, O. and
Mann, M.: Mass spectrometric sequencing
of proteins silver-stained polyacrylamide
gels. Anal Chem 68 (1996) 850-8.
Sidow, A.: Gen(om)e duplications in the
evolution of early vertebrates. Curr Opin
Genet Dev 6 (1996) 715-22.
Simon, R., Mirlacher, M. and Sauter,
G.: Tissue microarrays in cancer diagnosis.
Expert Rev Mol Diagn 3 (2003) 42130.
Skolnick, J., Fetrow, J.S. and Kolinski,
A.: Structural genomics and its importance
for gene function analysis. Nat Biotechnol
18 (2000) 283-7.
Smith, L.M., Sanders, J.Z., Kaiser, R.J.,
Hughes, P., Dodd, C., Connell, C.R.,
Heiner, C., Kent, S.B. and Hood, L.E.:
Fluorescence detection in automated DNA
sequence analysis. Nature 321 (1986)
674-9.
Smith, T.F. and Waterman, M.S.:
Identification of common molecular
subsequences. J Mol Biol 147 (1981) 1957.
Sonnhammer, E.L. and Koonin,
E.V.: Orthology, paralogy and proposed
classification for paralog subtypes. Trends
Genet 18 (2002) 619-20.
Soskic, V., Gorlach, M., Poznanovic,
S., Boehmer, F.D. and GodovacZimmermann, J.: Functional proteomics
analysis of signal transduction pathways
of the platelet-derived growth factor beta
receptor. Biochemistry 38 (1999) 175764.
Stein, L.D.: Human genome: end of the
beginning. Nature 431 (2004) 915-6.
Sterky, F., Bhalerao, R.R., Unneberg,
P., Segerman, B., Nilsson, P., Brunner,
A.M.,
Charbonnel-Campaa,
L.,
Lindvall, J.J., Tandre, K., Strauss,
S.H., Sundberg, B., Gustafsson, P.,
Uhlen, M., Bhalerao, R.P., Nilsson, O.,
Sandberg, G., Karlsson, J., Lundeberg,
J. and Jansson, S.: A Populus EST resource
for plant functional genomics. Proc Natl
Acad Sci U S A 101 (2004) 13951-6.
Swanson, W.J.: Adaptive evolution of genes
and gene families. Curr Opin Genet Dev
13 (2003) 617-22.
Syvanen, A.C.: Accessing genetic variation:
genotyping single nucleotide polymorphisms.
Nat Rev Genet 2 (2001) 930-42.
Terwilliger, T.C. and Berendzen, J.:
Automated MAD and MIR structure
solution. Acta Crystallogr D Biol
Crystallogr 55 ( Pt 4) (1999) 849-61.
Thompson, J.D., Higgins, D.G. and
Gibson, T.J.: CLUSTAL W: improving the
sensitivity of progressive multiple sequence
alignment through sequence weighting,
position-specific gap penalties and weight
matrix choice. Nucleic Acids Res 22
(1994) 4673-80.
Torrents, D., Suyama, M., Zdobnov,
E. and Bork, P.: A genome-wide survey
of human pseudogenes. Genome Res 13
(2003) 2559-67.
Tyers, M. and Mann, M.: From genomics
to proteomics. Nature 422 (2003) 193-7.
Uetz, P., Giot, L., Cagney, G., Mansfield,
T.A., Judson, R.S., Knight, J.R.,
Lockshon, D., Narayan, V., Srinivasan,
M., Pochart, P., Qureshi-Emili, A., Li, Y.,
Godwin, B., Conover, D., Kalbfleisch, T.,
Vijayadamodar, G., Yang, M., Johnston,
M., Fields, S. and Rothberg, J.M.: A
comprehensive analysis of protein-protein
interactions in Saccharomyces cerevisiae.
Nature 403 (2000) 623-7.
A., Dew, I., Fasulo, D., Flanigan, M.,
Florea, L., Halpern, A., Hannenhalli,
S., Kravitz, S., Levy, S., Mobarry, C.,
Reinert, K., Remington, K., AbuThreideh, J., Beasley, E., Biddick, K.,
Bonazzi, V., Brandon, R., Cargill, M.,
Chandramouliswaran, I., Charlab, R.,
Chaturvedi, K., Deng, Z., Di Francesco,
V., Dunn, P., Eilbeck, K., Evangelista,
C., Gabrielian, A.E., Gan, W., Ge, W.,
Gong, F., Gu, Z., Guan, P., Heiman, T.J.,
Higgins, M.E., Ji, R.R., Ke, Z., Ketchum,
K.A., Lai, Z., Lei, Y., Li, Z., Li, J., Liang,
Y., Lin, X., Lu, F., Merkulov, G.V.,
Milshina, N., Moore, H.M., Naik, A.K.,
Narayan, V.A., Neelam, B., Nusskern,
D., Rusch, D.B., Salzberg, S., Shao, W.,
Shue, B., Sun, J., Wang, Z., Wang, A.,
Wang, X., Wang, J., Wei, M., Wides, R.,
Xiao, C., Yan, C., et al.: The sequence of
the human genome. Science 291 (2001)
1304-51.
Vitkup, D., Melamud, E., Moult, J. and
Sander, C.: Completeness in structural
genomics. Nat Struct Biol 8 (2001) 55966.
von Heijne, G.: Patterns of amino acids
near signal-sequence cleavage sites. Eur J
Biochem 133 (1983) 17-21.
von Heijne, G.: Signal sequences. The
limits of variation. J Mol Biol 184 (1985)
99-105.
Uhlen, M. and Ponten, F.: Antibodybased proteomics for human tissue profiling.
Mol Cell Proteomics 4 (2005) 384-93.
von Heijne, G.: Transcending the
impenetrable: how proteins come to terms
with membranes. Biochim Biophys Acta
947 (1988) 307-33.
Venter, J.C., Adams, M.D., Myers, E.W.,
Li, P.W., Mural, R.J., Sutton, G.G.,
Smith, H.O., Yandell, M., Evans, C.A.,
Holt, R.A., Gocayne, J.D., Amanatides,
P., Ballew, R.M., Huson, D.H.,
Wortman, J.R., Zhang, Q., Kodira,
C.D., Zheng, X.H., Chen, L., Skupski,
M., Subramanian, G., Thomas, P.D.,
Zhang, J., Gabor Miklos, G.L., Nelson,
C., Broder, S., Clark, A.G., Nadeau, J.,
McKusick, V.A., Zinder, N., Levine, A.J.,
Roberts, R.J., Simon, M., Slayman, C.,
Hunkapiller, M., Bolanos, R., Delcher,
Wallis, J.W., Aerts, J., Groenen, M.A.,
Crooijmans, R.P., Layman, D., Graves,
T.A., Scheer, D.E., Kremitzki, C.,
Fedele, M.J., Mudd, N.K., Cardenas, M.,
Higginbotham, J., Carter, J., McGrane,
R., Gaige, T., Mead, K., Walker, J.,
Albracht, D., Davito, J., Yang, S.P.,
Leong, S., Chinwalla, A., Sekhon, M.,
Wylie, K., Dodgson, J., Romanov, M.N.,
Cheng, H., de Jong, P.J., Osoegawa, K.,
Nefedov, M., Zhang, H., McPherson,
J.D., Krzywinski, M., Schein, J., Hillier,
L., Mardis, E.R., Wilson, R.K. and
79
References
Taylor, J.S. and Raes, J.: Duplication and
divergence: the evolution of new genes and
old ideas. Annu Rev Genet 38 (2004)
615-43.
80
Appendices
Warren, W.C.: A physical map of the
chicken genome. Nature 432 (2004) 7614.
Warford, A., Howat, W. and McCafferty,
J.: Expression profiling by high-throughput
immunohistochemistry.
J
Immunol
Methods 290 (2004) 81-92.
Waterston,
R.H.,
Lindblad-Toh,
K., Birney, E., Rogers, J., Abril, J.F.,
Agarwal, P., Agarwala, R., Ainscough, R.,
Alexandersson, M., An, P., Antonarakis,
S.E., Attwood, J., Baertsch, R., Bailey, J.,
Barlow, K., Beck, S., Berry, E., Birren,
B., Bloom, T., Bork, P., Botcherby, M.,
Bray, N., Brent, M.R., Brown, D.G.,
Brown, S.D., Bult, C., Burton, J., Butler,
J., Campbell, R.D., Carninci, P., Cawley,
S., Chiaromonte, F., Chinwalla, A.T.,
Church, D.M., Clamp, M., Clee, C.,
Collins, F.S., Cook, L.L., Copley, R.R.,
Coulson, A., Couronne, O., Cuff, J.,
Curwen, V., Cutts, T., Daly, M., David,
R., Davies, J., Delehaunty, K.D., Deri, J.,
Dermitzakis, E.T., Dewey, C., Dickens,
N.J., Diekhans, M., Dodge, S., Dubchak,
I., Dunn, D.M., Eddy, S.R., Elnitski,
L., Emes, R.D., Eswara, P., Eyras, E.,
Felsenfeld, A., Fewell, G.A., Flicek, P.,
Foley, K., Frankel, W.N., Fulton, L.A.,
Fulton, R.S., Furey, T.S., Gage, D.,
Gibbs, R.A., Glusman, G., Gnerre, S.,
Goldman, N., Goodstadt, L., Grafham,
D., Graves, T.A., Green, E.D., Gregory,
S., Guigo, R., Guyer, M., Hardison,
R.C., Haussler, D., Hayashizaki, Y.,
Hillier, L.W., Hinrichs, A., Hlavina, W.,
Holzer, T., Hsu, F., Hua, A., Hubbard,
T., Hunt, A., Jackson, I., Jaffe, D.B.,
Johnson, L.S., Jones, M., Jones, T.A.,
Joy, A., Kamal, M., Karlsson, E.K., et
al.: Initial sequencing and comparative
analysis of the mouse genome. Nature
420 (2002) 520-62.
Watson, J.D. and Crick, F.H.: The
structure of DNA. Cold Spring Harb
Symp Quant Biol 18 (1953) 123-31.
Wheeler, D.L., Church, D.M., Edgar, R.,
Federhen, S., Helmberg, W., Madden,
T.L., Pontius, J.U., Schuler, G.D.,
Schriml, L.M., Sequeira, E., Suzek,
T.O., Tatusova, T.A. and Wagner, L.:
Database resources of the National Center
for Biotechnology Information: update.
Nucleic Acids Res 32 (2004) D35-40.
Whetten, R., Sun, Y.H., Zhang, Y. and
Sederoff, R.: Functional genomics and
cell wall biosynthesis in loblolly pine. Plant
Mol Biol 47 (2001) 275-91.
Wilkins, M.R., Pasquali, C., Appel,
R.D., Ou, K., Golaz, O., Sanchez, J.C.,
Yan, J.X., Gooley, A.A., Hughes, G.,
Humphery-Smith, I., Williams, K.L.
and Hochstrasser, D.F.: From proteins to
proteomes: large scale protein identification
by two-dimensional electrophoresis and
amino acid analysis. Biotechnology (N
Y) 14 (1996) 61-5.
Wootton, J.C.: Non-globular domains in
protein sequences: automated segmentation
using complexity measures. Comput Chem
18 (1994) 269-85.
Xenarios, I., Rice, D.W., Salwinski,
L., Baron, M.K., Marcotte, E.M. and
Eisenberg, D.: DIP: the database of
interacting proteins. Nucleic Acids Res
28 (2000) 289-91.
Yamagata, A., Kristensen, D.B., Takeda,
Y., Miyamoto, Y., Okada, K., Inamatsu,
M. and Yoshizato, K.: Mapping of
phosphorylated proteins on two-dimensional
polyacrylamide
gels
using
protein
phosphatase. Proteomics 2 (2002) 126776.
Yu, U., Lee, S.H., Kim, Y.J. and Kim,
S.: Bioinformatics in the post-genome era. J
Biochem Mol Biol 37 (2004) 75-82.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement