Computational Analyses of Biological Sequences -Applications to Antibody-based Proteomics and Gene Family Characterization Mats Lindskog Royal Institute of Technology School of Biotechnology Stockholm 2005 © Mats Lindskog, 2005 ISBN 91-7178-186-2 Royal Institute of Technology School of Biotechnology AlbaNova University Center SE-106 91 Stockholm Sweden Printed at Universitetsservice US-AB Drottning Kristinas väg 53B SE-100 44 Stockholm Sweden Cover: Currently, the sequences of more than 300 different genomes are available. Computer-based methods are used for analyses and visualization of these biological sequences. To my Mother and Father IV Abstract Following the completion of the human genome sequence, post-genomic efforts have shifted the focus towards the analysis of the encoded proteome. Several different systematic proteomics approaches have emerged, for instance, antibodybased proteomics initiatives, where antibodies are used to functionally explore the human proteome. One such effort is HPR (the Swedish Human Proteome Resource), where affinity-purified polyclonal antibodies are generated and subsequently used for protein expression and localization studies in normal and diseased tissues. The antibodies are directed towards protein fragments, PrESTs (Protein Epitope Signature Tags), which are selected based on criteria favourable in subsequent laboratory procedures. This thesis describes the development of novel software (Bishop) to facilitate the selection of proper protein fragments, as well as ensuring a high-thoughput processing of selected target proteins. The majority of proteins were successfully processed by this approach, however, the design strategy resulted in a number of fall-outs. These proteins comprised alternative splice variants, as well as proteins exhibiting high sequence similarities to other human proteins. Alternative strategies were developed for processing of these proteins. The strategy for handling of alternative splice variants included the development of additional software and was validated by comparing the immunohistochemical staining patterns obtained with antibodies generated towards the same target protein. Processing of high sequence similarity proteins was enabled by assembling human proteins into clusters according to their pairwise sequence identities. Each cluster was represented by a single PrEST located in the region of the highest sequence similarity among all cluster members, thereby representing the entire cluster. This strategy was validated by identification of all proteins within a cluster using antibodies directed to such cluster specific PrESTs using Western blot analysis. In addition, the PrEST design success rates for more than 4,000 genes were evaluated. Several genomes other than human have been finished, currently more than 300 genomes are fully sequenced. Following the release of the tree model organism black cottonwood (Populus trichocarpa), a bioinformatic analysis identified unknown cellulose synthases (CesAs), and revealed a total of 18 CesA family members. These genes are thought to have arisen from several rounds of genome duplication. This number is significantly higher than previous studies performed in other plant genomes, which comprise only ten CesA family members in those genomes. Moreover, identification of corresponding orthologous ESTs belonging to the closely related hybrid aspen (P. tremula x tremuloides) for two pairs of CesAs suggest that they are actively transcribed. This indicates that a number of paralogs have preserved their functionalities following extensive genome duplication events in the tree’s evolutionary history. Key words: antigen, antibodies, antibody-based proteomics, biocomputing, bioinformatics, Populus, PrEST, protein fragment, software, visualization Abstract Abstract V VI List of Publications This thesis is based on the following publications, which in the text are referred to by their Roman numerals: I† Lindskog, M., Rockberg, J., Uhlén, M., and Sterky, F. (2005) Selection of Protein Epitopes for Antibody Production. BioTechniques Vol. 38, No. 5: pp 723-727 II Berglund, L., Lindskog, M., Persson, A., Sivertsson, Å., Uhlén, M., and Al-Khalili Szigyarto, C. (2005) Design of Protein Epitope Signature Tags for Antibody-based Proteomics, Manuscript III Lindskog, M., Berglund, L., Hamsten, C., Uhlén, M., and Sterky, F. (2005) Processing of High Sequence Similarity Proteins in Antibodybased Proteomics, Manuscript IV† Djerbi, S., Lindskog, M., Arvestad, L., Sterky, F., and Teeri, T. T. (2005) The Genome Sequence of Black Cottonwood (Populus Trichocarpa) Reveals 18 Conserved Cellulose Synthase (CesA) genes. Planta Vol. 221, Issue 5, pp 739-46 In addition, previously unpublished data is presented. Related publications: A Uhlén, M., Björling, E., Agaton, C., Al-Khalili Szigyarto, C., Amini, B., Andersen, E., Andersson, A-C., Asplund, A., Angelidou, P., Asplund, C., Berglund, L., Bergström, K., Cerjan, D., Ekström, M., Elobeid, A., Eriksson, C., Fagerberg, L., Falk, R., Fall ,J., Forsberg, M., Gumbel, K., Halimi, A., Hallin, I., Hamsten, C., Hansson, M., Hedhammar, M., Hercules, G., Kampf, C., Larsson, K., Lindskog, M., Lund, J., Lundeberg, J., Magnusson, K., Malm, E., Nilsson, P., Oksvold, P., Olsson, I., Ottosson, J., Paavilainen, L., Persson, A., Rimini, R., Rockberg, J., Runeson, M., Sivertsson, Å., Sköllermo, A., Steen, J., Stenvall, M., Sterky, F., Strömberg, S., Sundberg, M., Tegel, H., Tourle, S., Wahlund, E., Wan, J., Wernérus, H., Westberg, J., Wester, K., Wrethagen, U., Xu, L.L., Ödling, J., Öster, E., Hober S., and Pontén F. (2005) A Human Protein Atlas for Normal and Cancer Tissues Based on Antibody Proteomics. Molecular and Cellular Proteomics, 2005 Aug 27, Epub ahead of print B Pettersson, E., Lindskog, M., Lundeberg, J. and Ahmadian, A. Tri-nucleotide Threading for Parallel Amplification of Minute Amounts of Genomic DNA, Submitted †Publication I and IV are reproduced with kind permission of Biotechniques and Springer Science and Business Media. List of Publications List of Publications VII VIII Table of Contents Table of Contents Abstract List of Publications INTRODUCTION 1. The Human Genome 1.1 Towards a Complete Sequence of the Human Genome 1.2 Genetic Evolution 1.2.1 Adaptive Evolution 1.2.2 Duplication Events 1.2.3 Other Evolutionary Factors 1.2.4 Genetic Variation 1.2.5 Defining Homology 1.3 Genome Size, Structure and Content 1.4 Genes 2. The Proteome and Proteins 2.1 From Genome to Proteome 2.2 Proteome Diversity 2.3 Proteins 3. Proteomics from a Bioinformatic Perspective 3.1 Background 3.2 Protein Expression and Purification 3.3 Two Dimensional Gel Electrophoresis 3.4 Mass Spectrometry 3.5 Structural Proteomics 3.6 Antibodies in Proteomics 3.6.1 Antibodies and OtherAffinity Ligands 3.6.2 Antigens and Epitopes 3.6.3 Protein Analysis on Microarrays 3.6.4 Immunohistochemistry 3.6.5 In Vitro Detection 3.7 Prediction of Subcellular Localization 3.8 Post Translational Modifications 3.9 Protein Interactions and Complexes 4. Biological information resources 4.1 Genomic Data 4.2 Proteome and Protein Data 4.2.1 Protein Sequence Data and Protein Information 4.2.2 Protein Structure Data 4.2.3 Protein Domains and Families 4.2.4 Protein Interactions V VII 1 3 3 4 4 4 5 6 6 7 8 11 11 12 14 15 15 16 17 17 18 18 18 19 20 21 22 23 23 24 25 26 26 26 27 27 28 5. Data Searches, Sequence Comparisons, and Biocomputing 29 5.1 Comparative Genomics and Proteomics 5.2 Sequence Alignments 29 29 5.2.1 Pairwise Sequence Alignments 5.2.2 Practical Aspects of Pairwise Alignments 5.2.3 Multiple Sequence Alignments 30 32 33 5.3 Prediction of Signal Sequences and Transmembrane Regions 5.4 Biocomputing – a Practical Approach 33 34 IX 7.1 Background 7.2 Design of Protein Fragments for Recombinant Protein Production (publication I) 7.3 A Systematic Analysis of Designed PrESTs (publication II) 7.4. Specific PrEST Design Strategies 7.4.1 Antibody-based Proteomics on High Sequence Similarity Proteins (publication III) 7.4.2 Design of Isoform Specific Protein Fragments (unpublished data) 8. Identification and Characterization of Gene Family Members in a Fully Sequenced Tree Genome 35 37 39 39 41 45 47 47 48 53 8.1 Populus, a Tree Model System 8.2 Cellulose Synthases 8.3 Identification of Previously Unknown CesA Family Members (publication IV) 53 53 Concluding Remarks List of Abbreviations Acknowledgements Populärvetenskaplig Sammanfattning (In Swedish) References APPENDICES 57 59 61 65 67 81 Publication I: Selection of Protein Epitopes for Antibody Production Publication II: Design of Protein Epitope Signature Tags for Antibody-based Proteomics Publication III: Processing of High Sequence Similarity Proteins in Antibody-based Proteomics Publication IV: The Genome Sequence of Black Cottonwood (Populus Trichocarpa) Reveals 18 Conserved Cellulose Synthase (CesA) Genes 54 83 91 113 129 Table of Contents PRESENT INVESTIGATION 6. Objectives 7. Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics X Introduction 1 Introduction Introduction 2 The Human Genome 3 1.1 Towards a Complete Sequence of the Human Genome Within each cell of all living organisms on earth lies their genetic blueprint, which is written in the form of DNA (DeoxyriboNucleic Acid). This blueprint is what gives each organism its unique characteristic features. Up until the late 1800s, little was known about the properties and attributes of DNA. Yet, in 1866 Gregor Mendel set us on the path to unravel its secrets (see figure 1.1). He discovered that many traits of an organism depend upon two distinct factors, one from the male and one from the female parent. These factors were later termed genes, carriers of the genetic information. In 1871 Friedrich Miescher successfully isolated nuclein (nuclein contains nucleic acids), which he proposed to be the hereditary material. This idea was refined in 1944 when Oswald Avery described DNA as the carrier of hereditary information. Nine years later James Watson and Francis Crick solved the now famous double stranded helical structure of DNA (Watson and Crick, 1953). Furthermore, Watson and Crick postulated the central dogma of molecular biology, which describes the process in which DNA is transcribed to mRNA (messenger RiboNucleic Acid), which is subsequently translated into protein (see chapter 2.1). The advent (Maxam and Gilbert, 1977; Sanger et al., 1977), and automation (Smith et al., 1986) of DNA sequencing provided the foundation for deciphering the information in whole genomes. The Human Genome Project (HGP) was initiated in 1990, and its objectives were to sequence the human genome, together with four model organism genomes, by the year 2005. These species included a bacterium (Escherichia coli), a yeast (Saccharomyces cerevisiae), a nematode (Caenorhabditis elegans) and a fruitfly (Drosophila melanogaster). The Institute for Genomic Reseach (TIGR) finished sequencing of the first bacterial genome (Haemophilus influenzae) in 1995 (Fleischmann et al., 1995), and in 1998, Celera Genomics introduced a private genome project initiative. A first glimpse of the human genome was provided in 1999 with the sequencing of chromosome 22 (Dunham et al., 1999). Automation of DNA sequencing DNA as hereditary material Discovery of genes Postulation of the central dogma First bacterial genome sequence Initiation of the HGP Structure of DNA Discovery of nucleic acids Draft of the human genome DNA sequencing High-quality human sequence Human chromosome 22 sequenced Figure 1.1 Events prior to the completion of the human genome sequence. 1866: Gregor Mendel discovered hereditary factors later termed genes; 1871: The discovery of the molecule nuclein by Friedrich Miescher; 1944: DNA was described as the hereditary material by Oswald Avery; 1953: The structure of DNA was revealed (Watson and Crick, 1953); 1958: Postulation of the central dogma of molecular biology (Crick, 1958); 1977: The introduction of DNA sequencing technology (Maxam and Gilbert, 1977; Sanger et al., 1977); 1986: The automation of DNA sequencing (Smith et al., 1986); 1990: The initiation of the HGP; 1995: First bacterial genome sequence finished (Fleischmann et al., 1995); 1999: The sequence of human chromosome 22 released (Dunham et al., 1999); 2001: A first draft of human genome sequence was provided (Lander et al., 2001; Venter et al., 2001); 2003: The release of a high-quality human genome sequence (Stein, 2004). Towards a Complete Sequence of the Human Genome 1. The Human Genome 4 The Human Genome A first draft of the human genome sequence was released in 2001, concurrently by the HGP (Lander et al., 2001) and Celera Genomics (Venter et al., 2001). However, this primary draft included a large number of sequence gaps. In the spring of 2003, 50 years after the DNA structure first was solved, a high-quality sequence map was released, covering 99% of the euchromatic (gene-rich) regions and with the majority of the previously identified gaps bridged (Stein, 2004). Finally, the largest internationally coordinated genomic research effort ever had fulfilled its goals by releasing a publicly available human genome sequence that will provide data for various avenues of research. 1.2 Genetic Evolution The recent completions of genome sequences from a large number of different organisms (see chapter 5.1) have allowed the exploration of the evolution of genomic entities, such as genes and proteins. Genomes are not static, but dynamic, complex units that continously change, adapt, and evolve, and are affected by several evolutionary factors. 1.2.1 Adaptive Evolution “All living things have much in common, in their chemical composition, their germinal vesicles, their cellular structure, and their laws of growth reproduction. Therefore I should infer that probably all the organic beings which have ever lived on earth have descended from some one primordial form.” Charles Darwin, The Origin of Species by Means of Natural Selection. The theory of natural selection stated by Charles Darwin was highly controversial in 1859. However, evidence supporting natural selection or adaptive evolution, has accumulated over time, and is now widely accepted by the scientific community. Adaptive evolution is the process by which genes (or organisms) adapt to their surrounding environment, resulting in evolutionary changes if the adaptations are benifical and functional. Genetic regions subjected to adaptive evolution are most likely functionally relevant. The identification of such regions might therefore result in predictions concerning their function (Swanson, 2003). For instance, adaptive evolution has promoted diversity in the antigen recognition site in the major histability complex locus (Hughes and Nei, 1988). 1.2.2 Duplication Events Genetic duplications are crucial for organismal evolution, due to the large increase of genetic material following a duplication event (reviewed by Taylor and Raes, 2004). Moreover, it has been stated that two consecutive rounds of complete genome duplication have provided the genetic material necessary for the evolution of the vertebrate genome (Ohno, 1970; Sidow, 1996). Duplication events might arise at different levels, duplications of entire genomes (polyploidization), of chromosomal segments (segmental duplications), of individual genes, or parts of genes. Nonduplicated genes must maintain their function and accordingly their evolution is generally constrained by selective pressure (Kent et al., 2003). In contrast, following a duplication event, one or both duplicates may experience relaxed functional constraints due to redundancy. This is in some cases followed by an increased evolutional rate, which can provide novel gene functions and regulations (Kondrashov et al., 2002). Genes may be altered by creating new function (neofunctionalization) and/or partitioning of the ancestral gene function (subfunctionalization) or become silent (nonfunctionalization or pseudogenization; see figure 1.2; Ohno, 1970; Force et al., 1999). However, the Rates of duplications varies among species and genes, shorter genes appear to be subjected to duplication events more frequently than longer genes (Taylor and Raes, 2004). According to this context, gene duplication is a relatively common occurrence (0.01 estimated duplications/gene/million years), and the average half-life of duplicated genes is relatively short (around 4.0 million years; Lynch and Conery, 2000; Gu et al., 2002; Lynch and Conery, 2003; Taylor and Raes, 2004). Non-functionalization Redundancy A Neofunctionalization (coding) Regulatory subfunction Gain of new regulatory subfunction Loss of regulatory subfunction Coding function Gain of new coding function Loss of coding function Non-functionalization Neofunctionalization (regulatory) Neofunctionalization (coding) Small-scale duplication B Segmental or genomic duplication Redundancy Neofunctionalization (regulatory) Neofunctionalization (regulatory) and / or subfunctionalization (partial) Subfunctionalization (partial redundancy) Neofunctionalization (regulatory) and / or subfunctionalization (complete) Subfunctionalization (complete) Figure 1.2 A schematic view of the possible fates of duplicated genes. A: The classical model of duplicate gene fates, which consists of non- and neofunctionalization in coding regions (Ohno, 1970). B: Recent theories propose a more complex chain of events, which includes subfunctionalization as well as alterations in regulatory regions. The propabilities of the events are proportional to the thickness of the arrows (adapted from Moore and Purugganan, 2005). 1.2.3 Other Evolutionary Factors Genome evolution is influenced by deletions and transpositions of genome segments or genes. In addition, retrotranspositions are of importance and comprise about 17% of the human genome (Lander et al., 2001). Retrotransposition is transposition by means of reverse transcription of RNA and subsequent insertion into the genome. Horizontal gene transfer, where genes are transferred between species has also influenced genome evolution. For instance, about 18% of the genes present in E. coli appear to arise from horizontal gene transfer after the divergence from the salmonella lineage one billion years ago (Lawrence and Ochman, 1998). 5 Genetic Evolution majority of gene duplicates evolve towards silencing rather than preservation (Lynch and Conery, 2000) and silenced genes are generally termed pseudogenes (see chapter 1.4). 6 The Human Genome 1.2.4 Genetic Variation Individual genomes within the same species differ as a result of chromosomal alterations and point mutations, such as single base pair substitutions, insertions and deletions. Variations that occur as single nucleotide bases as well as stretches of longer DNA segments, are invaluable markers in mapping of disease-related genes, in diagnostics and human population studies, as well as in evolutionary genetics (Ahmadian and Lundeberg, 2002). The majority of genetic variations consists of single nucleotide polymorphisms (SNPs), which account for around 85% of all genomic variations (Little, 1999). A SNP is a genetic variation between individuals in a given population, which is limited to a single base pair and occurring at a frequency exceeding one percent (Ahmadian and Lundeberg, 2002). SNPs are distributed non-randomly throughout the human genome, occuring in average at every 1,250th base pair (Lander et al., 2001). Furthermore, SNPs located in the protein coding regions and regulatory segments of genes may give rise to phenotypic differences by altering protein structure and/or function, as well as protein expression levels, which subsequently might lead to disease. The majority of SNPs, however, are situated in noncoding regions of the human genome and are commonly used as convenient markers in population genetics and evolutionary studies (Syvanen, 2001). Identification and amplification of relevant SNPs was facilitated by the introduction of the polymerase chain reaction (PCR; Mullis and Faloona, 1987; Saiki et al., 1989). For genome-wide studies, amplifications of SNPs need to be performed in parallel with large numbers of SNPs (Hardenbol et al., 2003; Fan et al., 2003; related publication B). 1.2.5 Defining Homology Evolutionary relationships among genes might be complex and difficult to define, hence a set of descriptive terms have been coined (reviewed by Fitch, 2000). Homology is defined as the relationship of two characters (e.g. genes) sharing a common evolutionary origin (Fitch, 1970). Note that homology is not a measurement of similarity, but a strict statement describing a divergent relation between sequences. There are a number of differerent subtypes of homology. Orthology is defined as the relationship of two homologous genes cognated by speciation (Fitch, 1970). Paralogy is the relationship of two genes arising from a duplication event (Fitch, 1970). However, paralogy falls short in specifying if the genes described are members of the same species. For instance, if duplication is subsequently followed by speciation, two genes belonging to different species can be defined as paralogs. As a consequence, two subtypes of paralogy have been proposed to bring clearity to this issue: paralogs within the same species and paralogs residing in different species; inparalogs and outparalogs, respectively (Sonnhammer and Koonin, 2002). Xenology, the third subtype of homology, is defined as the relationship of two homologous genes involving a horizontal (interspecial) gene transfer (see chapter 1.2.3) for at least one of the genes (see figure 1.3; Gray and Fitch, 1983). Conversively to homology, analogy is defined as the relationship of two genes that have descended from unrelated ancestors by convergence (Fitch, 1970). Unfortunately, these definitions of homology have been, and still are, inconsistently used throughout the research community (Fitch, 2000; Jensen, 2001; Koonin, 2001; Petsko, 2001). 7 2 3 3 4 A1 AB1 B1 B2 C1 C2 C3 Figure 1.3 The three different subtypes of homology: orthology, paralogy, and xenology. A view of a hypothetical evolution of genes derived from a common ancestor. Speciation events (1, 3) are depicted by an upside down ‘Y’ and duplication events (2, 4) by a horizontal bar. Orthologs are two genes sharing a common ancestor by a speciation event (e.g. B1-C1) and xenologs are genes among which at least one of the genes has undergone a horizontal gene transfer (AB1 is xenologous to all other genes in the figure). Paralogs are genes whose common ancestor resides at a duplication event (e.g. B1-C2 and C2-C3). C1-C3 are inparalogs, thus belonging to the same species, and B1-C2 are outparalogs, as they reside in different species. The latter case is due to that a duplication event (2) is followed by a speciation event (3; adapted from Fitch, 2000). 1.3 Genome Size, Structure, and Content Following the completion of the human genome (see chapter 1.1), in-depth sequence analyses have to take into consideration the specific properties of the genome. Genetic information is coded in DNA, which consists of four different building blocks, nucleotides, which are designated A (Adenine), C (Cytosine), G (Guanine) and T (Thymine). DNA predominantly exists in a double stranded configuration, and has a self-complementary property. This property facilitates the recovery of the complementary strand, which is essential during cell-division when a cell needs to replicate its DNA (see figure 2.1). Moreover, DNA is tightly bundled in the nucleus by several levels of organization. At the highest level, human DNA is split into units called chromosomes. The haploid human genome consists of around three billion bases which are distributed among 24 chromosomes (see table 1.1). Only about 1.5% of the genetic material constitutes protein coding regions, so called exons. The remaining part constitutes of about 48% of unique intergenic DNA, for which no function has yet been identified. The majority of this unique DNA is probably an evolutionary artifact, lacking present function. Furthermore, 49% of the human genome consists of repetitive elements (REs), and a small part, about 0.5%, comprise nonfunctional pseudogenes (see chapter 1.4; Makalowski, 2000). There are more than 4.3 million REs in the human genome, and many Genetic Evolution/Genome Size, Structure, and Content 1 8 The Human Genome translated REs are found in proteins, thus having importance for protein function (Li 2001). Although the specific function of these repetitive elements is still unknown, they do interact with the genome and influence its evolution (Makalowski 2000). Chromosome Size (Mbp) % of genome Number of genes % genes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y 245.5 243.0 199.5 191.4 180.9 171.0 158.6 146.3 138.4 135.4 134.5 132.4 114.1 106.4 100.3 88.8 78.8 76.1 63.8 62.4 46.9 49.5 154.8 57.7 7.98 7.90 6.48 6.22 5.88 5.56 5.16 4.75 4.50 4.40 4.37 4.30 3.71 3.46 3.26 2.89 2.56 2.47 2.07 2.03 1.53 1.61 5.03 1.88 2,281 1,482 1,168 866 970 1,152 1,116 794 919 862 1,426 1,104 399 733 766 957 1,257 322 1,468 631 271 552 931 104 10.12 6.58 5.18 3.84 4.31 5.11 4.95 3.52 4.08 3.83 6.33 4.90 1.77 3.25 3.40 4.25 5.58 1.43 6.52 2.80 1.20 2.45 4.13 0.46 Table 1.1 Chromosomal data and gene content of the human genome (based on data from Ensembl v. 33.35d). 1.4 Genes A gene is a stretch of DNA coding for a protein, and might be located on either DNA strand. In eukaryotic organisms, genes appear split into separated segments where regions of protein coding exons are interrupted by stretches of non-coding intervening sequences (introns). In humans, the exons have an average length of 150 nucleotides and the substantially longer introns are generally spanning around 3.5 kilo bases (kb). Yet, intron-lengths as large as 500 kb have been observed (Rowen et al., 2002). In a process entitled splicing, the introns are removed from the transcribed RNA and the protein coding elements are assembled into mature mRNAs which are subsequently translated into proteins. If alternative splicing occurs, the protein coding exons are united differently. This results in various protein isoforms with distinct functions, and is an important source of protein diversity (see chapter 2.2). Through duplication of DNA segments (see chapter 1.2.2), a single gene can give rise to multiple copies of duplicated genes, which retain parts of their functionality. These genes might be referred to as a gene family (see chapter 8.2). The exact definition of a gene family differs, ranging from at least 35% sequence identity Genes with close similarity to their respective paralogs, but unable to code for a functional protein, are termed pseudogenes (reviewed by Mighell et al., 2000). Pseudogenization is the most common fate for duplicated genes (see chapter 1.2.2). According to the theory of positive selection, one or several of duplicates are unconstrained by selective pressure, therefore randomly accumulating mutations. These mutations might subsequently silence gene function via disruptions of the original reading frame (e.g. frame shifts or introduction of a stop codon). Pseudogenes exist as two different types, processed and nonprocessed. Processed pseudogenes are formed through retrotransposition of mature RNAs, and these genes lack upstream promoters, as a result of random integration into the genome. Nonprocessed pseudogenes most commonly arise after a duplication event, followed by nonfunctionalization (see chapter 1.2.2). Pseudogenes are thought to be responsible for misincorporations in gene collections, and as much as 20% of predicted genes are thought to consist of nonfunctional pseudogenes (Lander et al., 2001; Mounsey et al., 2002). However, the total number of pseudogenes and their genomic locations are in many cases still unknown (Torrents et al., 2003). 9 Genes between corresponding protein sequences (Orengo et al., 2002), to merely sharing a common protein fold. However, members in a gene family are clearly evolutionary related, often having similar or related functions. 10 The Proteome and Proteins 2.1 From Genome to Proteome The majority of the genes in the human genome code for functional proteins, and the path from gene to protein includes many turns before the final gene product is reached. The process where DNA is transcribed into RNA and thereafter translated into protein, is termed the central dogma of molecular biology (see figure 2.1). Furthermore, the dogma includes DNA replication, in which DNA is copied during cell division. Following the transcription of DNA into mRNA, the mRNA migrates from the nucleus to the cytoplasm and encounters ribosomes. The ribosomes provide a framework for holding mRNA in position during protein synthesis, a process known as translation. After translation, proteins might be subjected to a wide variety of modifications which alter the protein properties by adding a modifying group to one or several amino acids, or by proteolytic cleavage (Mann and Jensen, 2003). There are hundreds of different post translational modifications (PTMs). For instance, cleavage of signal sequences, glycosylations, and phosphorylations (Garavelli, 2004). The PTMs might determine localization, activity, interaction and turnover of the proteins (Mann and Jensen, 2003). Moreover, mRNA might be subjected to pre-translational editing, resulting in amino acid changes not inferable from the DNA sequence. Replication cDNA DNA Transcription Reverse transcription RNA editing mRNA mRNA (edited) Translation Post-translational modification Protein Protein (modified) Figure 2.1 The central dogma of molecular biology, which comprises transcription, translation, and replication.The dogma is hereby displayed in combination with the process of reverse transcription as well as with factors contributing to proteome diversity such as RNA editing and post-translational modifications. Exons are colored in green and introns are grey, however, introns are usually much longer than exons (see chapter 1.4). cDNA (complementary DNA) contains only exons and are therefore colored in green. mRNA might be subjected to pre-translational editing and these editions are here shown in orange. Moreover, proteins can be modified after the translation (e.g. phosphorylations) and these modifications are shown in red. From Genome to Proteome 2. The Proteome and Proteins 11 12 The Proteome and Proteins In addition, transcription can be reversed in vitro by using the enzyme reverse transcriptase, in a process where mRNA functions as a template for conversion into cDNA (complementary DNA). cDNA contains only exons and is often used in studies when coding gene regions are of interest. 2.2 Proteome Diversity In 1994, Marc Wilkins first coined the term proteome as the set of proteins encoded by the genome (Wilkins et al., 1996). The proteome consists of a vast array of proteins, among which diversity is generated by events such as alternative splicing, somatic DNA rearrangments, and PTMs. At least one-third of the human genes are subjected to alternative splicing and there are several different types of alternative splicing events (see figure 2.2). Different selections of splice sites, such as 5’ or 3’ splicing, give rise to alternative isoforms. In addition, although rarely ocurring, entire introns might be retained and transcribed (Garcia-Blanco et al., 2004). Alternative splicing often results in additional protein domains, rather than alternating domains (Mironov et al., 1999). A B C D E Figure 2.2 Different types of alternative splicing events. Constitutive exons are shown in light grey color, alternative exons are dark grey, and the intron is dark grey with stripes. Exons can have multiple 5’- or 3’splice sites (A, B). Fully contained exons that are alternatively used are called cassette exons. Single cassette exons can reside between two constitutive exons and the alternative exon is either skipped or included (C). Multiple cassette exons can be located beween two constitutive exons, and the splicing machinery must choose beween them (D). Finally, a lack of splicing might result in intron retention (E; adapted from Graveley, 2001). Protein class Description Number of proteins Nonredundant proteins Protein isoforms Combinatorial variants Protein species Protein alleles A protein representative from every gene locus Variants obtained by alternative splicing Proteins generated by somatic DNA rearrangements Proteins that differ as a result of PTMs Differences in genetic variations resulting 20,000-25,000 50,000-100,000 >10,000,000 >100,000 75,000-150,000 Table 2.1 Estimated number of protein molecules in the human proteome, as a result of different classifications (adapted from Uhlen and Ponten, 2005). This extensive diversity of the proteome might explain why species that are highly similar in their DNA differ substantially at the phenotypic level. Indeed, recent studies imply that human and chimpanzee genomes differ only by 1.2% in intergenic regions and 0.6% percent in protein coding regions (Ebersberger et al., 2002; Sakate et al., 2003). Since the proteome is not fully characterized, proteome comparisons might only be guessed. 13 Proteome Diversity Including all protein variants, the human proteome is estimated to contain millions of different molecular species (O’Donovan et al., 2001; Bauer and Kuster, 2003; Uhlen and Ponten, 2005). However, the total number is clearly dependant on how the proteome is defined. The proteome might be divided into five sub-proteomic classifications, which all contribute to the total number of protein molecules (see table 2.1; Uhlen and Ponten, 2005). One class consitutes a nonredundant protein set, in which every gene locus contributes with a single representative protein. The current human gene count is 22,218 (Ensembl v33.35; Hubbard et al., 2005), and although this number will change, the final number in this class is most probably between 20,000 and 25,000. Another class is comprised of a large number of proteins, which result from alternative splicing events. An estimation of this category might give between 50,000 and 100,000 protein isoforms. Somatic rearrangments involved in the immune response give rise to a third class. For instance, the estimated number of different IgG molecule rearrangements might be as many as ten million. Many proteins undergo PTMs (see chapter 2.1), and this category might account for approximatly 100,000 different protein molecules. However, this is a difficult number to estimate since many modifications are stepwise processes. Finally, there are is a myriad of protein variants among which differences are due to coding SNPs (see chapter 1.2.4). 14 Proteomics from a Bioinformatic Perspective 2.3 Proteins Proteins display a wide range of different functions (e.g. building blocks , enzymes, cellular, signal mediators, and receptors). They consist of 20 different building blocks, called amino acids, which differ in size, shape, hydrophobicity, and charge. Protein structure is of complex nature and contains four different levels of structural organization. The order in which the amino acids are linked together is the primary structure. The secondary structure of proteins refers to the elementary structural patterns present in most known proteins, where the two main structures are alpha helices and beta sheets. The tertiary structure is the complete three-dimensional folded structure of the protein chain (see figure 3.2). The final structural level, the quaternary structure, is only present if the protein contains more than one chain, and refers to the chain organization and interconnections. For example, hemoglobin contains of a group of four amino acid chains. Figure 2.3 The protein structure of the enzyme cellobiose dehydrogenase, hereby shown on a cellulose surface. Secondary structural elements are colored; alpha helices are red and beta sheets are blue. 3.1 Background Proteins are key functional and structural molecules. Therefore, the systematic and extensive molecular characterization of the proteome, proteomics, is necessary for understanding biological processes and systems (reviewed by de Hoog and Mann, 2004). Proteomics is a broad field which encompasses identification, quantification, and characterization of proteins expressed in various cells and tissues. Moreover, it includes methods for structural determination and characterization of PTMs, the use of antibodies as protein capture agents as well as bioinformatic approaches. A large-scale single initiative similar to the HGP (see chapter 1.1) is not possible due to the complex and dynamic nature of proteins. In proteomics, many different approaches in parallel are necessary (Lesley, 2001) and often used in combination to complement each other (see figure 3.1). As most drug targets are proteins, it is clear that proteomics will have a great impact on drug discovery and clinical diagnostics (Tyers and Mann, 2003). Genomic and proteomic initiatives are producing an avalanche of data. Public databases (see chapter 4) are continuously expanding and sequence repositories grow at exponential rates (Benson et al., 2005). As a result of the enormous amounts of data, computer systems have become essential to researchers. Bioinformatics is frequently described as the organization and interpretation of biological data using computational systems and methods (reviewed by Yu et al., 2004). It is positioned at the interface of biology, computer science, statistics, and mathematics (Benton, 1996; Luscombe et al., 2001). The bioinformatics discipline aims at organizing and structuralizing biological research data in a human-readable format. Moreover, it focuses on development of software, tools, and resources facilitating data analysis, as well as the usage of the latter for visualization and interpretation of data in a relevant fashion. Bioinformatic methods and systems such as sequence analysis and comparative proteomics (see chapter 5), might provide valuable aid and provide insights to different areas of research. Proteomics approaches are currently highly dependent on computerassisted systems and bioinformatic methods (see figure 3.1). Background 3. Proteomics from a Bioinformatic Perspective 15 16 Proteomics from a Bioinformatic Perspective Mass spectrometry Two-dimensional gel electrophoresis Post-translational modifications Protein-protein interactions In silico predictions Protein arrays Structural proteomics Antibody-based proteomics Localization Figure 3.1 An overview of proteomic methods and their relations to each other. Moreover, a number of approaches can be complemented with corresponding in silico prediction methods. ‘Antibody-based proteomics’ includes methods such as ELISA (Enzyme-Linked ImmunoSorbent Assay), Western blot analysis (see chapter 3.7.2), and immunoprecipitation, whereas ‘Structural proteomics’ comprises NMR (Nuclear Magnetic Resonance) and x-ray crystallography. The dashed arrows from ‘Mass spectrometry’ and ‘Twodimensional gel electophoresis’ to ‘In silico predictions’ symbolize database screening against known data (see chapter 3.3). 3.2 Protein Expression and Purification A challenge when analyzing proteins from a proteome-wide point of view is the expression and purification of human proteins in large numbers (Lesley, 2001). There are both prokaryotic and eukaryotic protein expression hosts, where the most frequently used bacterial host is E. coli. This bacterium has a rapid growth and generates high yields as well as having throughly characterized genomics. Furthermore, E. coli exists as a multitude of different mutant strains and is compatible with a large number of cloning vectors (Lu et al., 1996; Baneyx, 1999). However, the approach of expressing eukaryotic proteins in prokaryotic cell systems commonly encounters by problems in proper folding and satisfactory expression levels (Lesley, 2001). Therefore, some researchers use eukaryotic expression hosts such as the predominantly used S. cerevisiae, which is available in various different strains (Gellissen et al., 1992; Cregg et al., 2000). A common procedure to analyse unknown human proteins begins with the identification of gene function by querying publicly available datasets (see chapter 4). This is followed by amplification of the gene from total RNA by reverse transcriptase polymerase chain reaction (RT-PCR) to generate cDNA (see chapter 2.1). Amplified gene products are subsequently cloned into a suitable vector (Liu, 1999; Hartley et al., 2000; Graslund et al., 2002). The cells expressing the proteins are subsequently cultivated, the proteins harvested and finally purified. 3.3 Two Dimensional Gel Electrophoresis The widely used two-dimensional gel electrophoresis (2DE) technology was initially described 30 years ago (O’Farrell, 1975) and allows for separation of proteins in a poly-acrylamide gel according to isoelectric point and molecular weight. As many as 10,000 proteins can simultaneously be analyzed on a single gel (Jungblut et al., 1996). Following separation, the proteins are visualized by staining with either Coomassie blue, if the protein is in high abundance, or silver staining, which is more sensitive (Shevchenko et al., 1996). Separation and detection of PTMs of proteins is possible in 2D gel electrophoresis, however, 2DE fails to analyse certain types of proteins. For instance, proteins expressed in low copy number are not detected by the staining (Santoni et al., 2000). 2DE is frequently used with mass spectometry (MS) and bioinformatics and includes protein solubilization of tissue sample, followed by 2DE separation of the proteins, and computer analysis of protein spot patterns. Subsequently, the isolated protein spots are subjected to proteolytic digestion and characterization by MS, and the generated information is screened against databases for identification of the proteins. 3.4 Mass Spectrometry MS is a method that has been further developed due to the recent availability of genome sequence databases (see chapter 4) and technical breakthroughs (Aebersold and Mann, 2003) and has become an important technique for protein identification. In MS, individual molecules are analyzed by ionization and measurement of their trajectorial response to electric and/or magnetic fields. Analyses by MS are capable of revealing two specific protein features, the peptide-mass fingerprint and product-ion spectras. Peptide-mass fingerprints are determined by MS and product-ion spectras by tandem MS (MS/MS) or MSn. The two techniques are commonly used in combination, due to their capabilities of generating complementary information of the target protein (BeranovaGiorgianni, 2003). MS is used in conjunction with several other applications (see figure 3.1). For instance, MS and 2DE (see chapter 3.3) in combination with antibodies (see chapter 3.6.1) recognizing phosphorylated proteins may faciliate the identification and characterization of PTMs (Mann and Jensen, 2003). The interpretation of large sets of acquired MS data is a cumbersome task and underlines the need for MS spectra analysis software. Several different analysis algorithms have been developed but need further refinements in order to finally supersede the need of human intervention (Blueggel et al., 2004). 17 Protein Expression and Purification/Two Dimensional Gel Electrophoresis/Mass Spectrometry Inclusion bodies are frequently formed in E. coli, and these aggregates can often be avoided by production of recombinant fusion proteins (Lu et al., 1996). Fusion tags are also used to facilitate purification and handling of recombinant proteins. For instance, the histidine-tag enables purification on immobilized metal ion affinity chromatography (IMAC) columns (Lesley, 2001) Moreover, fusion tags such as thioredoxin (Lu et al., 1996) provide a means to significantly raise expression levels. 18 Proteomics from a Bioinformatic Perspective 3.5 Structural Proteomics The functions of proteins are related to their structure and therefore threedimensional (3D) protein structures are of great importance. The field of structural proteomics (also known as structural genomics) aims at mapping protein space, and determining 3D structures for all possible protein folds (Vitkup et al., 2001). Following sequencing of previously uncharacterized genes, 10-40% of corresponding protein sequences might be deduced a structure by comparisons to sequences of known structures (Rost et al., 2003). Protein structure is more evolutionary conserved than amino acid sequence (Blundell and Mizuguchi, 2000), and often provides more functional information than the sequence itself. This structural conservation is the result of retained protein interaction sites consisting of residues situated in close proximity in space, but might be distant in sequence location (Sanchez and Sali, 1998). Structures for proteins that are easily expressed, crystallized and diffracted are solved by x-ray crystallographic methods (Terwilliger and Berendzen, 1999). Alternatively, proteins that fail to crystallize might be structurally determined by nuclear magnetic resonance (NMR; Montelione and Anderson, 1999). Various computer-based methods for prediction of protein structure are in use (reviewed by Rost et al., 2003). Sequence-based large-scale structure predictions generally consist of sequence similarity searches against sequences with known 3D-structure followed by model building from detected structural templates, and finally a validation of the resulting models (Skolnick et al., 2000). Searching for proteins sharing structural similarities, but lacking obvious sequence similarities, are performed with fold recognition threading methods (Jones, 1999). Threading by fold recognition compares amino acid sequences to known 3D structures, and subsequently evaluates the level of fitting to these structures (Rost et al., 1997). Furthermore, fold recognition allows for identification of analogous (see chapter 1.2.5) pairs of proteins with no significant sequence similarities (Jones, 1999). 3.6 Antibodies in Proteomics Antibodies have been used in protein studies for decades, owing to their strong affinity and high specificity in target binding. Currently, antibodies are used in several different applications such as immunolocalization studies (see chapter 3.6.4; reviewed by Uhlen and Ponten, 2005)), in vitro detection (see chapter 3.6.5) and analyses on arrays (see chapter 3.6.3). Furthermore, antibodies have been modified by in vitro selection techniques (see chapter 3.6.1) to enhance their properties and increase their specificities and affinities. 3.6.1 Antibodies and Other Affinity Ligands Advances in purification methods and selection techniques (e.g. phage display) have fueled the emergence of several different kinds of antibodies. As of today, four major types of antibodies are commonly used: polyclonal antibodies (pAbs), monoclonal antibodies (mAbs), mono-specific antibodies (msAbs), and other classes of affinity reagents. pAbs derived from immunization of suitable animals (e.g. rabbits), were the primary antibody source until 30 years ago. The introduction of antigens (see chapter 3.6.2) into a host animal induces an immune response resulting in antigen-specific antibodies. This pool of antibodies binds to multiple epitope sites on the antigen. This multi-epitope binding contributes to a tolerance towards minor structural changes in the antigen, such as partial denaturation or PTMs 3.6.2 Antigens and Epitopes The generation of antibodies requires selection of target antigens and three different types of protein antigens are generally produced: full-length proteins, synthetically produced peptides, and recombinant protein fragments. Full-length protein antigens are used for correct folding of the native protein, or when studying PTMs (Miroux and Walker, 1996). The use of a correctly folded fulllength protein might however be questionable in some applications. For instance, if the protein is immunized into animals, it is most likely slightly denaturated upon mixing with different adjuvants (Uhlen and Ponten, 2005). Synthetically generated peptides are the shortest of the antigen-types, usually spanning less than 40 residues, and therefore the peptides are unlikely to be correctly folded. However, antibodies towards synthetic peptides have successfully been generated (Meloen et al., 2003). Recombinant protein fragments have the great advantage of being easily scaled-up for whole-proteomics studies (Agaton et al., 2003). These protein fragments are shorter than full-length proteins, thereby facilitating the cloning by usage of relatively short PCR fragments and the fusion to affinity-tags (see chapter 3.2) allows for high-throughput purification. 19 Structural Proteomics/Antibodies in Proteomics (see chapter 2.1). This makes pAbs suitable for multi-platform efforts wherein proteins often exist in different forms. However, pAbs are not renewable and may exhibit high cross-reactivities (Nilsson et al., 2005). The introduction and development of the mAb technology (Kohler and Milstein, 1975), allowed the use of antibodies as reproducible chemical entities. This was made possible as a result of the development of immortal hybridoma cell lines continuously producing the selected mAbs (Borrebaeck, 2000). Currently, mAbs are the key immunoreagents for medical diagnostic applications (Borrebaeck, 2000). However, mAbs have shortcomings in large-scale efforts, as a result of the need for time-consuming screening (Uhlen and Ponten, 2005). In contrast to pAbs, mAbs recognize single epitopes and are therefore less versatile for multiplatform functional and analytical applications which often involve proteins in denaturated forms. For instance, proteins are often exposed to formalin in tissue fixation for immunolocalization studies (see chapter 3.6.4). msAbs are enriched from polyclonal antisera by high stringency affinity-purification. Therefore mAbs have the same advantages of pAbs, but display low cross-reactive tendencies as a result of the purification (Uhlen and Ponten, 2005). Preliminary studies also indicate that reimmunizations of the target antigen in combination with the stringent immunoaffinity purification yield antibodies with very similar binding characteristics (related publication A). The msAbs might therefore be considered as a renewable resource, and they have been used in studies involving extensive tissue profiling (Agaton et al., 2003; Kampf et al., 2004). More recently, techniques have been developed that allow antibodies and other affinity reagents to be developed without the involvement of laboratory animals. These affinity ligands and antibodies are developed with in vitro selection techniques, such as phage display (Bradbury and Marks, 2004) or ribosomal display (Lipovsek and Pluckthun, 2004). They include antibody fragments, such as scFvs (single chain fragment) and Fabs (Better et al., 1988; Liu and Marks, 2000) and other protein-based reagents developed by recombinant protein techniques (reviewed by Hey et al., 2005) such as affibodies (Nord et al., 1997). 20 Proteomics from a Bioinformatic Perspective Antibodies recognize different antigenic determinant, epitopes, which are either linear or discontinous. Linear epitopes consist of a continous stretch of amino acids, whereas an epitope with residues in close proximity in space, but distantly located in sequence, is discountinous (Blythe and Flower, 2005). The majority of epitopes are discontinous (Barlow et al., 1986) however, most in silico methods focus on predicting linear epitopes. The majority of algorithms are using amino acid based scales to predict epitopes, e.g. to find local hydrophilicity points (Hopp and Woods, 1981). In conclusion, the most recent and comprehensive study shows no significant correlation between predicted and previously experimentally mapped epitopes. Furthermore, current epitope predictions perform only slightly better than by random chance (Blythe and Flower, 2005). 3.6.3 Protein Analysis on Microarrays Following the footsteps of DNA microarrays (Schena et al., 1995), the development of protein-based array technologies have increased at a rapid rate. In contrast to traditional experiments where antibodies are used in very limited numbers, simultaneous analyses of large numbers of protein samples are now possible by using antibodies as binding agents (Bradbury et al., 2004; Nilsson et al., 2005). Protein arrays are highly specific and experiments have successfully identified a single protein among 10,000 others (MacBeath and Schreiber, 2000). Microarrays consisting of protein-based capture ligands as probes, or proteins themselves being fixed to the surface of a silicon chip, are valuable protein analysis tools. There are three standard types of protein-detecting arrays (see figure 3.2). Antibody arrays where antibodies are fixed to a solid support and subsequently incubated with pre-labelled antigens (Haab, 2001; Haab et al., 2001; Schweitzer and Kingsmore, 2002), prior to the detection of antibody-antigen bindings. On sandwich arrays antibodies are fixed on the slide and captured protein targets are subsequently detected using a additional labeled antibody (Schweitzer and Kingsmore, 2002). Finally, on reverse phase protein arrays, proteins or protein mixtures are immobilized and detected upon binding to labeled antibodies. These experiments might be performed with primary or secondary labeled antibodies (Paweletz et al., 2001). It is important to keep in mind that the labeling of proteins might result in chemical modifications, including epitope damage and affinity changes (Barry and Soloviev, 2004). A major hurdle is difficulties in detecting all PTMs and isoforms of a protein (Lopez and Pluskal, 2003). 21 Sandwich array Reverse phase protein array Figure 3.2 Different detection methods on protein/ antibody arrays: direct antibody arrays, sandwich arrays, and reverse phase protein arrays. In addition, reversed phase arrays might be performed with a secondary labeled antibody, as well as with complex protein samples hybridized to the chip surface. 3.6.4 Immunohistochemistry Immunohistochemistry (IHC) is the process of using antibodies for detection of antigens in tissues, and might provide information of protein expression in tissues at the subcellular level (reviewed by Warford et al., 2004). Determination of the subcellular localization of a protein is usually performed by fusion of the corresponding gene to a reporter or by epitope-tagging (Kumar et al., 2002), where small tags are fused to the target protein. Prior to analyses on tissue samples, the tissues are fixed, and thereafter embedded in paraffin or alternatively flash frozen. The availability of frozen tissues is limited and therefore not as widely used as paraffin-embedded tissues. Tissue-fixation is most commonly performed directly following collection by submersion in formalin and thereafter insertion into paraffin blocks. Alternatively, the fixation of tissues is performed in ethanol (Ahram et al., 2003). A number of methods for simultaneous analysis of multiple tissue samples have been implemented. The primary step towards high-throughput histology methods was the tissue sausage method (Battifora, 1986), which allowed for the analysis of over 100 different tumour tissues. However, the tissues were not oriented to allow linkage to their clinical origin and the tissue sausage failed to represent the entire number of tissues in a section (Warford et al., 2004). The tissue microarray (TMA; see figure 3.3) technology surmounted these limitations and dramatically increased the number of simultanously analysed tissue samples (Kononen et al., 1998). Following incubation with antibodies and staining, the capture of TMAs is performed by conventional flatbed scanners. Subsequent storage of high-resolution images requires large amount of storage space. Protein Analysis on Microarrays/Immunohistochemistry Direct antibody array 22 Proteomics from a Bioinformatic Perspective When using antibodies in TMA experiments, there is a risk for cross-reactivites. Antibody recognition of related or unrelated epitopes in tissues, where the target protein is nonexisting, may result in false positives (Warford et al., 2004). On the contrary, false negatives might arise as a result of difficulties of the antibody to identify the target, destruction of the epitope as a result of tissue-fixation, or because the epitope might be unreachable due to cross-linking or protein-protein interactions. Morover, the detection system might fail to identify the target epitope due to expression levels below detection levels (Warford et al., 2004). Also, special care should be taken to select tissue cores being representative of the pathology or histology of standard sections (Simon et al., 2003). Figure 3.3 A view of a TMA (Tissue MicroArray) containing immunohistochemically stained sections of normal human tissues. Each spot has a diameter of 1 mm and consists of a different tissue sample. In this example, the protein expression pattern is displayed in a spot representing part of a normal appendix. The higher magnification view shows that glandular cells in the mucosa are positive. Surface epithelial cells show only weak immunoreactivity, whereas the glandular cells lining the colonic crypts show strong cytoplasmic positivity. The inflammatory cells surrounding the crypts are essentially negative. Brown color represents positive staining. The blue blue color (hematoxylin) is used as non-specific counter-staining to expose tissue structure and cell morphology. 3.6.5 In Vitro Detection Tissue localization is also possible to perform using Western blot analysis, where tissue lysates are separated by gel electrophoresis. Western blot is a commonly employed technique for detection of protein antigens in complex mixtures (Dunn, 1999). Following electrophoresis, the proteins (or other biological samples) are transferred to a membrane, usually composed of polyvinylidene difluoride (PVDF) or nitrocellulose, for subsequent detection. Prevention of non-specific binding of antibodies to the surface is performed by incubation of the membrane in appropriate blocking solution. The detection in Western blot analysis is commonly performed by with direct or indirect methods. In the direct method, the primary antibody, which binds the antigen, is labeled by an enzyme or fluorescent dye. The indirect method involves a secondary labeled antibody, which binds to the primary antibody bound to the antigen. Labeling of the secondary antibody can be done by the use of various compounds, including probes, biotin and enzyme conjugates. Western blots also give valuable information about protein size and antibody specificity. The majority of eukaryotic genes are transcribed in the nucleus, and subsequent translation occurs in the cytosol. Many proteins undergo further sorting into various subcellular components by signals within the amino acid sequence (Schatz and Dobberstein, 1996; Mattaj and Englmeier, 1998). Prediction of these signal sequences is commonly achieved by different methods. By establishing homology to a protein with known localization, the destiny of unknown proteins might be solved. Proteins may also be directed to different cellular compartments by the aid of carrier proteins. These proteins identify specific sequence motifs, which target the protein to a certain part of the cell. In silico based methods can successfully predict signal peptide cleavage sites with greater than 80% accuracy (see chapter 5.3; Emanuelsson et al., 2001). However, methods predicting N-terminal signals are biased as a result of errors in predicting start codons in genome projects (Reinhardt and Hubbard, 1998). Ab initio methods predict the subcellular localization from the amino acid composition alone, but these methods are less successive compared to the sequence motif methods (Rost et al., 2003). Finally, localization can be determined by identifying interactions with other proteins that have known subcellular localization. 3.8 Post Translational Modifications Analysis of individual PTMs (see chapter 2.1) is most commonly performed by comparing experimentally generated data to previously identified amino acid sequences. The primary step is to identify the target protein, which can be done by MS techniques or antibody recognition using Western blot analysis (Mann and Jensen, 2003). Modifications of many proteins are essentially determined by four different approaches, 2DE, affinity-based enrichment of modified proteins, PTM identification in complex mixtures, and affinity-based methods and derivatization. In 2DE (see chapter 3.3), the modified proteins can be visualized on gels, by for instance, anti-phosphoamino acid antibodies, and subsequently analyzed in MS (Soskic et al., 1999). Alternatively, proteins with PTMs can be run twice on 2D gels with an enzymatic removal of the modifying group between runs. Spots only present in the first run indicate modifications of the proteins (Yamagata et al., 2002). When the mass of the modified protein is not sufficient to establish the modification, MS/MS (see chapter 3.4) fragmentation is used to identify and localize the PTMs (Mann and Jensen, 2003). Bioinformatic methods to predict PTMs are promising and might be closely integrated with experimental approaches. As proteomics approaches identify larger numbers of modification-sites in vivo, prediction algorithms will benefit and their performance will increase (Mann and Jensen, 2003). In silico prediction of PTMs comprises the identification of local consensus sequence motifs of signal sequence cleavage sites (Nielsen et al., 1997) and more complicated amino acid correlation patterns such as secondary structure and surface accessability (Hansen et al., 1998; Blom et al., 1999). 23 In Vitro Detection/Prediction of Subcellular Localization/Post Translational Modifications 3.7 Prediction of Subcellular Localization 24 Biological Information Resources 3.9 Protein Interactions and Complexes Proteins can interact with a myriad of other molecules to regulate and support them in large interaction networks (Chen and Xu, 2003; Bader and Hogue, 2000). Deciphering protein interactions might shed light on molecular mechanisms and biological processes (Ito et al., 2000) and provide clues to the function of uncharacterized proteins. Multi-protein complexes are regulating various cellular processes, of which some are associated with disease. Screening of these protein interaction partners might lead to the identification of possible drug targets. Since the development of the two-hybrid assay 15 years ago (Fields and Song, 1989), this powerful screening method has been extensively used in interaction studies. In this assay, the ‘bait’ and ‘prey’ proteins are fused to DNA binding- and transcriptional factor activation domains, respectively. Both fused proteins are subsequently co-expressed in yeast, and if proper interaction occurs, the two parts are brought in close proximity resulting in the expression of a reporter molecule used for detection (Uetz et al., 2000). In silico predicted protein-protein interactions are not relevant in the absence of experimental validation. However, the two different approaches might be used in combination, complementing and validating each other (Chen and Xu, 2003). There are several interaction prediction methods, such as genetic neighborhood conservation, gene fusion and phylogenetic profiling. Prokaryotic gene clusters and operons (contigous sets of genes that are coregulated and co-transcribed) provide the basis for the genetic neighborhood conservation method. Prokaryotic gene clusters typically consist of genes related by function (Overbeek et al., 1999). Valuable information on which proteins form complexes can be gained by identifying and comparing operons and gene clusters from different organisms. The major drawback is that this method only applies to bacteria. A method called the gene fusion method predicts protein-protein interactions on the basis of homologous fused proteins. For example, protein A and B are separated proteins in organism Y, and expressed as a fused protein in organism Z. The fusion (and therefore probable interaction in organism Z) implies that protein A and B also interact in organism Y (Marcotte et al., 1999). One limitation of this method is that many interactions have evolved through other mechanisms, and therefore might be missed. The phylogenetic profiling method is based on the assumption that interacting proteins functioning together also have a similar evolution. By constructing a phylogenetic profile for each protein that constitutes the proteins evolutionary history, linkage between proteins with similar profiles might be established. The method thereby allows for prediction of functions of unknown proteins (Pellegrini et al., 1999). Databases of biological information are an important resource behind biotechnology research (see table 4.1). Proteomic approaches depend on complete, reliable and recently updated databases (Apweiler et al., 2004a). Some researchers have shifted their focus from small-scale studies towards analyses involving large data sets. This includes comparative studies between sets of sequences or between organisms. The enormous quantities of gene and protein information need to be stored in a relevant way, allowing for integration and cross-references between different data sources. Besides functioning as storage facilities, the data sources have to provide access, interpretation and visualization of the data. The number of databases is steadily increasing and currently there are 719 publicly available molecular biology databases; an increase with 31% from previous year (Galperin, 2005). Database Name URL Database Type Ensembl http://www.ensembl.org The Vertebrate genome annotation (VEGA) Swissprot http://vega.sanger.ac.uk/ http://www.ebi.ac.uk/swissprot TrEMBL Uniprot http://www.ebi.ac.uk/trembl http://www.expasy.uniprot.org/ Protein data bank Pfam Interpro http://www.rcsb.org/pdb/ http://www.sanger.ac.uk/Software/Pfam/ http://www.ebi.ac.uk/interpro/ Database of interacting proteins (DIP) Biomolecular interaction network database (BIND) The Predictome database http://dip.doe-mbi.ucla.edu/ Information around large automatically annotated genomes Manually annotated vertabrate genome sequences Extensively annotated nonredundant protein sequences Computer-annotated protein sequences Centralized repository for protein information Three dimensional protein structures Protein families and domains Unites domains, protein families, and from protein signature databases exrimentally and literature search generated protein-protein interaction data Protein-protein, protein-DNA, protein-RNA, and protein-ligand interaction data In silico predicted protein interactions http://binddb.org http://predictome.bu.edu/ Table 4.1 A summary of a number of databases for biological information. The need of integrating different biological information resources, in combination with the severe problem of defining protein function, emphasizes the importance of a standardized vocabulary. The first attempt was an enzymatic activity classification using four digits; EC numbers (Nomenclature committee, 1992). Every EC number is associated with a name for the respective enzyme, and the numbers represent a progressively finer classification of the enzyme, in a hierarchic fashion. Currently, the gene ontology (GO) terms are the major set of terms used for describing protein function used. The GO consortium has defined a set of structured classifications in three levels: i) molecular function (e.g. enzyme), ii) biological process (e.g. cell growth and maintenance), and iii) cellular component (e.g. nuclear membrane; G.O. Consortium, 2001). The GO terminology is, however, not entirely complete, altough the most comprehensive set of definitions currently available. Biological Information Resources 4. Biological Information Resources 25 26 Biological Information Resources 4.1 Genomic Data There are three main systems that provide organized information around large automatically annotated genomes: Ensembl (Birney et al., 2004), the NCBI (National Center for Biotechnology Information) genome resource (Wheeler et al., 2004) and the USCS (University of California Santa Cruz) genome brower (Karolchik et al., 2004). All three resources are interlinked and updated regulary. Ensembl provides several tools for users who need to manipulate parts of the genomic data. For instance, the EnsMart system (Kasprzyk et al., 2004), which is a web-based data mining and retrieval system, allows for extraction of relevant data in several output formats. Moreover, a series of downloadable standard data sets are available (i.e. a protein FASTA file of all proteins). Over the last year, a total of seven new genomes have been added to Ensembl, which is presently comprised of a total of 16 different genomes such as human, mouse, rat, dog, and chicken (see table 5.1). The vertebrate genome annotation (VEGA) database is a resource for browsing manually annotated finished vertebrate genome sequences. It currently includes human, mouse, dog, and zebrafish. The entire set of annotated genes are supported by transcriptional evidence from protein sequences, cDNA, or expressed sequence tags (ESTs; Ashurst et al., 2005). At present, 14 of the human chromosomes, covering almost half of the genome, are annotated. Manual annotation is more precise in recognizing splice variants and pseudogenes as compared to automated annotation, as in Ensembl (Ashurst et al., 2005). 4.2 Proteome and Protein Data 4.2.1 Protein Sequence Data and Protein Information SWISS-PROT (Bairoch and Apweiler, 2000) is maintained by the Swiss institute of bioinformatics (SIB) and the European bioinformatics institute (EBI). It provides annotated data with minimum redundancy, and is integrated with other databases. The entries in SWISS-PROT are extensively annotated and analysed by biologists to ensure high quality. There are two classes of data in SWISS-PROT, the core data and the annotation. The mandatory core information associated with each SWISS-PROT entry consists of the protein name, the amino acid sequence, citation information, and taxonomic data. The annotation includes description regarding protein function, PTMs, secondary structure elements, similaritites to other proteins, domains and sites, splice isoforms, etc. Redundancies are kept at a minimum by merging together separate entries belonging to different literature reports. Furthermore, SWISS-PROT provides cross-references to many exernal databases (O’Donovan et al., 2002). TrEMBL is comprised of computer-annotated sequences derived from translation of all coding sequences in GenBank (Benson et al., 2005), the DNA databank of Japan (DDBJ; Gasteiger et al., 2001), and European molecular biology laboratory (EMBL; Kanz et al., 2005) nucleotide sequence databases. The data include large amounts of information which is not yet included in the SWISS-PROT dataset. The TrEMBL data generation is performed in three steps: i) translation of the coding sequences previously mentioned, ii) removal of redundancy by merging of multiple entries (O’Donovan et al., 1999), and iii) automated information enhancement through annotation transfer from well-annotated proteins in SWISS-PROT to uncharacterized proteins in TrEMBL, which belong to welldefined groups, by using InterPro (Mulder et al., 2005). 4.2.2 Protein Structure Data PDB (the Protein Data Bank; Bernstein et al., 1977) was initiated as early as 1971 and is the single repository for three dimensional protein and nucleic acid structural data. PDB archives, annotates, and distributes sets of atomic coordinates. In addition to the structural data itself, PDB also stores the measurements from which the data is derived (i.e. x-ray structure determinations). Protein structure (see chapter 3.5) is associated with the functions of proteins and therefore of importance. 4.2.3 Protein Domains and Families Pfam (Protein families database) is a large repository of protein families and domains, including 7,973 families (version 18). Pfam families are built around multiple sequence alignments (see chapter 5.2.3) and are divided into two main classes, Pfam-A and -B. Pfam-A families are based on manually curated multiple alignments, whereas Pfam-B are derived from an automatic clustering of SWISSPROT and TrEMBL and are therefore less reliable. Pfam families cover around 75% of all proteins present in SWISS-PROT and TrEMBL (Bateman et al., 2004), and have been widely used when annotating the human genome sequence (Lander et al., 2001). Another resource is InterPro, which unites domains, protein families, and functional site records from protein signature databases. Each InterPro entry includes annotation, functional annotation in GO terms and literature references. At present, eight major databases are included, e.g. Pfam (Bateman et al., 2004), SMART (Simple Modular Architecture Research Tool; Letunic et al., 2004), and PRINTS (Protein fingerprint database; Attwood et al., 2003). InterPro’s family, domain, and functional site definitions will primarily be used in functional classification and annotation of unknown sequences (Apweiler et al., 2001). 27 Genomic Data/Proteome and Protein Data The Universal protein resource (UniProt; Apweiler et al., 2004b) has been formed by uniting SWISS-PROT, TrEMBL, and protein information resource (PIR). It is a centralized repository for protein information and UniProt consists of three database components. The UniProt archive (UniParc), the most comprehensive publicly available protein sequence collection provides nonredundant protein sequences from all publicly accessable protein data bases. UNIPROT is the UniProt knowledgebase, which contains a section of fully manually annotated sequences and a section awaiting annotation. Finally, the UniProt NREF (UniRef) database provides nonredundant sequence collections, grouped by sequence identity. Three different clustering limits (NREF100, NREF90, NREF50) provide a full sequence coverage, while showing only sequences corresponding to the specific limitations (Bairoch et al., 2005). 28 Data Searches, Sequence Comparisons, and Biocomputing 4.2.4 Protein Interactions The Database of interacting proteins (DIP) contains experimentally generated data and results from literature searches, regarding protein-protein interactions (Marcotte et al., 2001). It mostly includes data from human, yeast, and Helicobacter pylori, and allows for comprehensive visualization and extraction of interaction and network data (Xenarios et al., 2000). BIND (the Biomolecular Interaction Network Database) stores protein-protein, protein-DNA, protein-RNA, and protein-ligand interaction data (Bader and Hogue, 2000). In addition, it includes information on molecular complexes and pathways. BIND describes various conditions and data regarding the interaction of interest such as experimental conditions, cellular localization, sequence data, kinetics, thermodynamics, etc. In silico predicted interactions are stored at the Predictome database. The putative interactions are based on predictions using phylogenetic profiling, domain fusion, and chromosomal proximity (see chapter 3.9), involving proteins from 44 different species. This database also provides comparison of different prediction methods and their correlation with known pathway data (Mellor et al., 2002). 5.1 Comparative Genomics and Proteomics Comparative genomics and proteomics are the analyses and comparisons of genomes and proteomes from different organisms in order to investigate the evolution of species and functions of genes and proteins. Bioinformatic methods may be applied to all organisms, as a result of the uniform genetic code. Due to the availability of completely sequenced genomes from various organisms (see table 5.1), comparative genomics and proteomics have rapidly emerged. Currently, a total of 303 (www.genomesonline.org) different genomes are finished. By examining features, such as genetic locations and conserved regions, correlations between genes can be established, providing valuable information. In closely related species (e.g. human and chimpanzee) differences might result in candidate genes for adaptive evolution (see chapter 1.2.1). Conversively, in distantly related species (e.g. human and mouse) the basic procedure involves screening for functional conservation (Ruvolo, 2004). Comparative genomics often involve the use of pairwise or multiple sequence alignments (see chapter 5.2.1, 5.2.3). Species Number of genes Genome size (Mbp) Original reference Human (Homo sapiens) Chimpanzee (Pan troglodytes) Dog (Canis familiaris) Chicken (Gallus gallus) Mouse (Mus musculus) Rat (Rattus norvegicus) Fish (Danio rerio) Fly (Drosophila melanogaster) Worm (Caenorhabditis elegans) Protozoan (Trypanosoma cruzi) Yeast (Saccharomyces cerevisiae) Tree (Populus trichocarpa) Plant (Arabdidopsis thaliana) 22,218† 22,475† 18,201† 17,709† 25,613† 21,952† 22,877† 14,359† 19,745† 22,570 6,680 58,036 31,270 3,272 2,733 2,385 1,054 2,267 2,718 1,688 132 100 60 12 473 115 Lander et al., 2001, Venter et al., 2001 The Chimpanzee sequencing consortium, 2005 Kirkness et al. , 2003 Wallsi et al. , 2004 Waterston et al., 2002 Gibbs et al., 2002 http://www.sanger.ac.uk/projects/d_rerio/ ¶ Adams et al., 2000 The C. elegans sequencing consortium, 1998 El-sayed et al., 2005 Goffeau et al., 1996 http://genome.jgi-psf.org/poptr1 ¶ The Aradidopsis genome initiative, 2000 Table 5.1 The complete sequenced genomes of 12 different organisms. †Number of genes according to Ensembl, October 2005. ¶ No reference yet available. 5.2 Sequence Alignments Initial clues to function and/or structure of newly sequenced proteins are commonly derived from amino acid sequence similarities to known proteins. As a consequence, the focus of research might be directed towards unexpected relations and correlations (Lipman and Pearson, 1985; Altschul and Lipman, 1990). Furthermore, discoveries of new protein families have been made by using sequence searches alone (Pearson, 1990). Homologous (see chapter 1.2.5) proteins have their fold and/or active site of binding domains in common, as well as sharing significant sequence similarities. Alignments, unambiguous mapping of residues between sequences, are mainly performed to establish inference of homology between two (see chapter 5.2.1) or multiple numbers (see chapter 5.2.3) of sequences. When performing sequence alignments, there is a trade-off between sensitivity (the ability of identifying sequences with distant evolutionary relationships) and selectivity (the avoidance Comparative Genomics and Proteomics/Sequence Alignments 5. Data Searches, Sequence Comparisons, and Biocomputing 29 30 Data Searches, Sequence Comparisons, and Biocomputing of non-related matches with spurious high similarity scores). The latter might appear as a result of biased amino acid composition. Various sequence alignment algorithms and scoring parameters are in use and the choice of algorithm depends upon the type of comparison problem needed to be solved. 5.2.1 Pairwise Sequence Alignments Pairwise alignments are performed either as local or global alignments. Local algorithms seek subsequence alignments to find the strongest region of similarity between two sequences (thereby disregarding dissimilarities outside that region). Local alignments are preferred when identifying shared conserved regions or domains, as well as shuffled domains, as a result of complex evolutionary pedigree (e.g. internal gene duplications). Such alignments are often used when screening DNA or protein databases as well as in exon-identification, where mRNA is compared to genomic DNA. Global alignment algorithms compare the two aligned sequences from their respective beginnings to ends. These algorithms are the most suitable choice when the similarities span the full length of the sequences, or when homology has already been established. Scores generated by global alignment are calculated with or without gap-penalties at the ends of the sequences. Alignment algorithms can be divided into two general classes. Rigorous comparison algorithms calculate the optimal similarity score, and heuristic algorithms are less computationally intense and are therefore faster. Several programs exist that are based on optimal algorithms. The first pair-wise sequence similarity algorithm was the optimal and global Needleman-Wunch algorithm (Needleman and Wunsch, 1970), which is available as the implementation Needle (www.ebi.ac.uk/emboss). In contrast, The application Water performs performs local, optimal alignments using the Smith-Waterman algorithm (Smith and Waterman, 1981; www.ebi. ac.uk/emboss). The heuristic algorithms do not necessarily find the optimal score for every single sequence included in a search database. The two most commonly used heuristic comparison algoritms are BLAST (Basic Local Alignment Search Tool; Altschul et al., 1990) and FASTA (Pearson, 1990), which stands for Fast All (fast searches for ‘all’, meaning both protein and nucleotide comparisons). They are typically 5-50 times faster compared to the Smith-Waterman algorithm, since they examine only a portion of the potential alignments between sequences. However, in many cases FASTA and BLAST generate similar results compared to the more rigorous algorithms. BLASTP (P for protein) is the most widely used rapid protein sequence comparison algorithm. It combines high sensitivity and selectivity, and seldom generates high spurious alignment scores for unrelated sequences. However, as many as 25% of all amino acids in protein sequences are situated in regions with severe biased amino acid composition (e.g. low complexity regions). Therefore, BLAST programs filter low complexity regions from proteins by utilizing the SEG algorithm (Wootton, 1994). The algorithms assess scores to the generated alignments which reflect a minimum of insertions, deletions and substitutions. Accordingly, highly similar sequences generate high scoring and distant sequences low scores. Protein sequence alignment methods determine similarity by using substitution matrices, which assess scores for every possible mutation of one amino acid into another. Matrices for generating similarity scores have three main properties and can therefore differ in the following aspects: i) method used to construct the There are two types of commonly used matrices, PAM (Point Accepted Mutation) and BLOSUM (BLocks Of SUbstition Matrices). PAM is a molecular evolutionary model, which is constructed from observed residue replacements in closely related proteins (Dayhoff et al., 1978). An average change in 1% of all residue position corresponds to one PAM. Therefore homologous proteins might be recognized even though they are separated with more than 100 PAMs. This is due to the possibility of multiple mutations on some residues and absence of mutations on others. As evolutionary rates differ among proteins, no clear correlation between evolutionary time and PAM distance can be established. The BLOSUMs are directly constructed from multiple sequence alignments (see chapter 5.2.3) of distantly related proteins, and not derived from closely related proteins (Henikoff and Henikoff, 1992). The PAM and BLOSUM matrices differ not only in the way they are constructed, but also in how they are used. PAMs are numbered based on the number of point accepted mutations (e.g. PAM120), whereas BLOSUMs are numbered according to the shared similarities among the multiple alignments used to construct them (e.g. BLOSUM80). PAM matrices with high numbers and BLOSUM matrices with low numbers are constructed for comparison of distantly related protein sequences. In conclusion, the choice of matrix is dependant on the diversity of the relation needed to establish. In general, BLOSUM62 (see figure 5.1) performs better than PAM matrices with BLASTP and FASTA (Henikoff and Henikoff, 1993). A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Figure 5.1 A substitutional matrix, BLOSUM62, which is used in sequence alignments. The matrix is derived from multiple sequences alignments of sequences with 62% sequence identity. Mutations, which are evolutionary unlikely to occur are assessed negative scores, whereas positive scores are assigned to conservative changes. 31 Pairwise Sequence Alignments matrix; ii) the scoring scale, and iii) the information content. Thus, mutations that are evolutionary unlikely to occur are assigned large negative scores, whereas positive scores are assessed to conservative changes (see figure 5.1; Pearson, 2000). Data Searches, Sequence Comparisons, and Biocomputing 5.2.2 Practical Aspects of Pairwise Alignments When compared to protein sequence comparisons, DNA sequence comparisons are significantly less informative due to several reasons. DNA sequences, which do not encode structural RNAs or proteins, have high divergence rates. As a consequence, there are difficulties in establishing reliable sequence homologies between sequences with a common ancestor older than 200 million years. In contrast, homology detection among protein sequences that diverged one billion years ago is possible (Pearson, 2000). Protein sequences harbor biochemical information within the amino acids themselves, which DNA sequences do not. For instance, there are large differences in chemical properties between isoleucine and glycine, and high similarities between lysine and arginine (i.e. size, hydrophobicity and charge; see chapter 2.3). Finally, the degeneracy of the genetic code results in many silent third-base codon changes, which do not change the encoded protein (Pearson, 1990). For local alignment searches, the correlation between sequence similarity score, alignment length, and statistical significance, is complex. For instance, a database screening might generate two distinct hits, one 15-25 amino acid short domain with 85% sequence identity to the query sequence and one longer domain consisting of 50-80 residues sharing 30% identity with the query. Both hits generate identical similarity scores. Yet, the latter is more prone to be of biological relevance (Pearson and Miller, 1992). Sequence-based searches are not suitable for identifying distant evolutionary relationships. Difficulties arise as the sequence similarities dimnish, and around 20-30%, referred to as the twilight zone, relations are hard to establish (see figure 5.2; Doolittle et al., 1986). However, remote relations can be detected using methods such as fold recognition (see chapter 3.5; Jones, 1997), or iterated BLAST searches (see chapter 5.2.3; Altschul et al., 1997). 100 Observed Percent Identity 32 80 60 40 20 Twilight Zone 0 0 50 100 150 200 250 300 350 Evolutionary Distance (PAMs) Figure 5.2 Limits of sequence similarity searches. As the evolutionary distance between sequences increases, the possibility of significantly establishing a relation descreases. In the so-called twilight zone (displayed as a dashed rectangle), relations are hard to establish with sequence-based searches (adapted from Pearson, 2000). The majority of multiple alignment algorithms are based on a progressive approach, which refers to the intial grouping of more similar sequences followed by the latter incorporation of more distantly related sequences (Boeckmann et al., 2003). The most widely used multiple sequence alignment method is ClustalW (Thompson et al., 1994), which exploits the evolutionary relationship between similar sequences. The method starts by computing an all-to-all pairwise comparison following the branching order of a family tree, in which more related sequences are aligning first and more distant thereafter. Consequently, ClustalW groups all the sequences based on similarities and then produces the final multiple alignment. ClustalW also generates the information needed to construct a phylogenetic tree. The position-specific iterated BLAST (PSI-BLAST) is a hybrid algorithm, including elements from pairwise as well as multiple sequence alignment methods. The initial database search is followed by construction of position-specific sequence profiles, which, upon iteration, might be refined thereby raising the search sensitivity (Altschul et al., 1997). Besides the many advantages of PSI-BLAST, false incorporations of unrelated sequences and non-masked low complexity regions might bias the profiles and incorporate additional false positives in the following iterations (Holm, 1998). 5.3 Prediction of Signal Sequences and Transmembrane Regions The identification of sequence motifs (see chapter 3.7), which direct proteins in the cell have been the goal for programs predicting signal peptides. SignalP takes into account that the residues at positions -3 and -1 relative the signal peptide cleavage site have to be neutral and small for cleavage to occur (von Heijne, 1983; von Heijne, 1985). SignalP can predict cleavage sites with an accuracy of 78% for eukaryotes and 89% for prokaryotes (Nielsen et al., 1997). Similar to signal peptides are signal anchors, which have signal sequences not cleaved by signal peptidases (von Heijne, 1988). A more recent version of SignalP has been developed, that distinguishes between cleaved signal peptides and uncleaved signal anchors with an accuracy of about 75% (Nielsen et al., 1999). Similar to SignalP is TargetP, which predicts subcellular localization of proteins (Emanuelsson et al., 2000). By using N-terminus protein sequence data, TargetP discriminates between secretory, mitochondrial, and chloroplast proteins. It performs better than other available similar tools, with an overall success-rate of 85% for plant sequences and 90% for nonplant sequences. Membrane proteins are an important class of proteins, for which it is difficult to obtain atomic-resolution information about 3D structure. Accurate methods for predicting the locations of transmembrane helices are available. For instance, TMHMM (Trans Membrane Hidden Markov Model), a commonly used tool, predicts transmembrane helices correctly for about 95% of all proteins in a test data set (Krogh et al., 2001). 33 Sequence Alignments/Prediction of Signal Sequences and Transmembrane regions 5.2.3 Multiple Sequence Alignments Even though pairwise comparisons are essential when analyzing sequences, analyses of protein families sharing conserved characteristics, requires alignment and comparison of groups of sequences. This is done in multiple sequence alignments, which form the foundation for methods such as phylogenetic analysis, fold recognition (see chapter 3.5), and secondary structure prediction (Bateman et al., 2002). 34 Present Investigation 5.4 Biocomputing – a Practical Approach The development and construction of biocomputational software applications is highly dependent on the type of problems that need to be solved. Different programming languages and database solutions are available, with respective strengths and weaknesses. The two most commonly used programming languages in biocomputing are Perl (perl.org) and Java (java.sun.com). They both have their own open-source projects which provide frameworks for handling biological data (BioJava (biojava.org) and BioPerl (biojava.org)). These frameworks include modules, classes and interfaces for file parsing, manipulation of biological sequences, and database client and server support. Perl is ideal for smaller scripts when less code is needed to be written and for construction of scripts without using an object-oriented approach. Java is more favourable when constructing larger software solutions resulting from more extensive visualization packages and cross-platform capabilities. When designing and constructing novel software, every new problem is unique and therefore a general solution is not available. However, the system has to include general features such as proper storage and management of data, which involves a relevant and stable database-system and data-recovery facilites (backups). Moreover, the system needs to present and visualize the data in ways that facilitate the interpretation by the end-user. The system functionalities must be cleary defined and continuous interaction between software developers and endusers might be necessary. In various large-scale, high-throughput approaches, the system needs to handle a large amount of data at high speed. Therefore, well-established prediction methods in combination with experimentally-gained knowledge might be useful (related publication A, B). The overall high production rates might result in a system, which focuses on high-throughput processing over extreme sensitivity. Yet, the approach has to be sensitive enough not to miss any relevant data. 35 Present Investigation Present Investigation 36 Objectives This thesis describes the use of biocomputational approaches to analyse and characterize publicly available biological sequences in order to gain information about gene and protein functions (see figure 6.1). The work can be divided in three major research topics. Firstly, novel software has been developed to facilitate selection of protein fragments subsequently used in antibody-based proteomics (publication I). Furthermore, the suggested design strategy, as well as experimental results, was evaluated in a separate systematic study (publication II). Secondly, additional software functionalities were developed to enable antibody recognition of splice variants (unpublished data, see chapter 7.4.2), and proteins with high sequence similarities, possibly evolved through duplication events (publication III). The previously mentioned strategies facilitated the move from a one-by-one manual gene analysis to processing genes in a high-throughput manner with a minimum of human interference. Finally, the data mining of the genomic sequence of a tree model organism, poplar (Populus trichocarpa), resulted in the identification of previously unknown genes belonging to the cellulose synthase (CesA) family (publication IV). These genes are thought to have evolved through duplication events at the gene or genome level. The studies described include the use of genomic and proteomic data from human (publication I-III, unpublished data) as well as from P. trichocarpa (publication IV). Furthermore, sequence analysis such as homology searches (publication I-IV), PCR primer design (publication I, II), multiple alignments and phylogenetics (publication IV), have been performed. Objectives 6. Objectives 37 38 Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics Publication III Processing of high sequence similarity proteins Publication I PrEST design software Publication IV Data mining in Populus PrEST Unpublished data PrEST design on splice variants Publication II A systematic study of designed PrESTs Figure 6.1 A schematic overview of the publications (and unpublished data) presented in this thesis. Publication III, IV explore proteins that are thought to have evolved through duplication events (see chapter 1.2.2). Publication I-III as well as the unpublished data comprise studies involving the centerpositioned PrEST (Protein Epitope Signature Tag; see chapter 7.2). 7.1 Background On the 15th of August 2003, the Swedish human proteome resource (HPR) was initiated. It is an antibody-based proteomics approach, which aims for the generation of affinity-purified polyclonal antibodies towards recombinant protein fragments (see figure 7.1; related publication A). The antibodies are subsequently used in TMA studies (see chapter 3.6.4) on normal and diseased tissues (Kampf et al., 2004), as well as for in vitro detection (see chapter 3.6.5) and analysis on protein microarrays (see chapter 3.6.3). Protein profiling is also performed on cell-lines, which might provide a valuable complement to tissue localization studies. These cell-lines have the advantage of being throughtly controlled during cultivation. Some cell-lines consist of cells from rare types of cancers, which are difficult to get tussie samples of. Well-characterized cell-lines might also provide negative and positive controls in the IHC assays (Andersson et al., manuscript in preparation). The HPR approach is based on a pilot study of the human chromosome 21, which comprised areas such as protein expression and purification, antibody staining, as well as biocomputing (Agaton et al., 2003). Antibodies are generated towards recombinant protein fragments, protein epitope signature tags (PrESTs), and are given mono-specific properties (see chapter 3.6.1) by stringent purification of polyclonal antisera using the PrESTs as affinity ligands. An important part of the HPR strategy is the selection of recombinant protein fragments, which typically span around 100-150 residues. These PrEST fragments are designed by the use of in-house prediction software (Bishop; publication I) and are selected on the basis of favorable sequence properties for subsequent cloning, expression and purification. There is a need for specific and quality controlled antibodies, displaying a minimum of cross-reactivity to other proteins than their intended targets. The HPR approach deals with this issue in multiple ways. The PrEST design strategy emphasizes on locating PrESTs in regions of low sequence similarity to other human proteins. This reduces the risk for the msAbs to cross-react with similar proteins. This is accomplished by the PrEST design software (publication I), which selects recombinant protein fragments based on a local alignment scanning procedure. Background 7. Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics 39 40 Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics PrEST selection Cloning Purification of PrEST Antibody generation Affinity purification of antibodies Sequencing Production of PrEST PrEST array In situ localization Public data In vitro detection Figure 7.1 A schematic overview of the HPR project, which includes several computer-assisted modules. PrEST fragments are determined by the use of specific design strategies, with regards taken to the properties of the target proteins (publication I-III, unpublished data). Quality control by cycle sequencing (Blakesley, 1993) ensures the amplification of a corrent PrEST fragment. Data belonging to the designed PrESTs, and to the generated antibodies are stored and displayed in a publicly available data repository (www.proteinatlas. org). In addition, the entire project is integrated with a laboratory information management system (LIMS), which ensures proper processing and tracking of all the data. Several quality control steps are included in the HPR project. Sequence verification of clones in the cloning procedure is obtained by sequencing and comparisons with the in silico generated PrEST. The antibodies are also used on reverse phase protein fragment microarrays (see chapter 3.6.3), on which the absence of crossreactivity indicates the presence of mono-specific antibodies. In addition, the antibodies are verified by Western blot analysis (see chapter 3.6.5), and these results are compared to the TMA data. Reimmunizations of respective PrESTs and then comparison of different protein expression patterns further provide a means of validating the results. Also, two msAbs directed towards different regions of the same protein and designed in appropriate software (publication I), should provide correlating staining patterns (see table 7.2). The generated antibodies have been included in comparative studies together with well-characterized, commercial, and widely used mAbs. The similar results provided by this study corroborate the high specificity of the HPR-generated msAbs (Nilsson et al., 2005). 7.2 Design of Protein Fragments for Recombinant Protein Production (publication I) A novel PrEST selection software has been developed (Bishop; see figure 7.2), predicting protein fragments based on several empirically determined requirements. Bishop allows for high-throughput selection of PrESTs with a minimum of end-user interference, providing a system which was integrated with several publicly available prediction tools (see below). In contrast, the previously used design strategy (Agaton et al., 2003), was performed in a more manual geneby-gene fashion and required the involvement of several different programs. The selection requisites in Bishop facilitate the subsequent laboratory procedures and promote selection of the most suitable PrEST fragment for any given gene product. Unwanted binding of the produced msAbs needed to be kept at a minimum. This was accomplished by a local alignment scanning procedure based on BLASTP (Altschul et al., 1990), which ensured the selection of fragments exhibiting lowest possible sequence similarity to other human proteins. The procedure used a sliding window approach that traversed the target protein sequence, moving one amino acid per iteration. The size of the window could be individually determined by the end-user, but the default size was 125 amino acids. Transmembrane regions cause problems in the expression and purification procedures in the subsequently used expression host E. coli (Miroux and Walker, 1996), as well as being poorly accessible to antibodies in tissue profiling. Transmembrane regions were therefore avoided by the use of the prediction program TMHMM (Krogh et al., 2001), which was executed by the main program. Signal peptides (see chapter 3.7.3) are cleaved off from proteins during translocation and are therefore inappropriate targets for antibody recognition. Bishop allows for the exclusion of signal peptides by visualizing the predictions produced by SignalP (Nielsen et al., 1997). Finally, avoidance of restriction enzyme motifs used in subsequent cloning is a necessity. Therefore, the corresponding cDNA sequence of the target protein was screened for restriction enzyme sites. The software allows for searching of any desired restriction site or basic sequence motif. Based on the four requirements mentioned above, two non-overlapping PrEST fragments, if possible, were automatically predicted for all analyzed target proteins. 41 Design of Protein Fragments for Recombinant Protein Production (publication I) Following TMA studies of antibodies, corresponding data and data regarding the PrEST towards which the antibody has been directed, are stored in a publicly available protein atlas (www.proteinatlas.org; related publication A). This repository includes data such as high-resolution TMA images, target gene data, and validation of the msAbs. A research project of this magnitude encompasses several laboratory assays and generates vast amounts of data. A laboratory information management system (LIMS) faciliates tracking and handling of the complete laboratory procedures and of the sequence-data used for PrEST selection (see chapter 7.4). The LIMS is integrated with the publicly available data repository, as well as with current PrEST design software (publication I). 42 B C D B Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics A 43 E Figure 7.2 The five different tabs of the Bishop software (A-E): A: The sequence alignment tab, in which the target protein sequence is aligned to all human proteins. The user also has the possibility to exlude splice isoforms (see chapter 7.4.2, unpublished data) or high sequence identity proteins (see publication II) from the subsequent sequence similarity search (displayed in tab C). B: The PrEST data tab, where data concerning the predicted PrESTs are shown. For instance, PrEST locations, sequence identities and amino acid sequences. C: The sequence feature tab, which displays several transcript and protein features as horizontal bars. These are (from top to bottom): transcript sequence, protein sequence, and predicted PrEST sequences. In addition (although not displayed in the figure) Bishop allows for the exclusion of signal peptides, transmembrane regions, and restriction enzyme sites located within the PrEST sequences. D: The PCR primer design tab, in which primers are selected based on non-stringent criteria. E: The result tab, in which all previously generated data are displayed before commiting the predicted PrEST, allowing for subsequent ordering of the designed PCR primers. Design of Protein Fragments for Recombinant Protein Production (publication I) D 44 Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics As previously mentioned, Bishop allows for selection of fragments with minimum sequence similarity to other proteins. However, in some cases there might be a trade-off between locating the PrEST in a target region, and identifying the region of lowest possible sequence similarity. Therefore, the region of lowest similarity might not always be the optimal choice. For instance, when designing PrESTs representing groups of proteins of high sequence similarity (publication III) or PrESTs located in common or unique parts of multiple splice variants (unpublished data, see chapter 7.4.2), other design criteria might be more relevant. Also, it is possible to perform a manual PrEST selection in Bishop (see figure 7.2, tab B). For instance, when designing PrESTs towards conserved active domains or extracellular parts of receptor proteins. Nevertheless, designing PrESTs in the region of lowest similarity should be the primary choice when processing genes in a highthroughput manner. Following PrEST design, a primer design module was included in Bishop. An appropriate melting temperature (Tm), as well as a suitable length of the primer, can be set by the end-user. The primer design is based on non-stringent criteria such as ending the primer sequence with a C or a G and avoiding continous stretches longer than three identical nucleotides. However, when compared to a more stringent primer design this non-stringent criteria showed only slightly lower success rates in the subsequent RT-PCR and sequencing modules (publication II). Following primer design, all data related to the target protein is transferred to the LIMS (see chapter 7.1), from where the primer ordering is performed. Bishop is fully integrated with the LIMS, performing data exchanges and overnight synchronization of information. Due to the existence of discontinous and linear epitopes (see chapter 3.6.2), it would be beneficial to obtain antibodies recognizing both types of epitopes. That accomplished, the antibodies might be used in assays including proteins in a native fold, as well as studies comprising slightly denaturated proteins. In order to explore the capabilities of the cloned and expressed PrESTs to form discontinous epitopes, designed protein fragments were evaluated by analyzing their 3D structures. This was performed by marking the PrEST regions on known available protein structures retrieved from the protein data bank (see figure 3 in appendix I; Bernstein et al., 1977). As antigens located at the N- or C-termini are commonly used for the expression of recombinant protein fragments for antibody production (Agaton et al., 2003), these regions were also included in the analysis. The C- and N-termini of the protein were marked in regions corresponding to the PrEST lengths. This limited study suggested that the PrESTs designed by the software package seem to have a relatively good potential to display some conformational epitopes. However, this is not a sufficiant proof of the expressed PrESTs ability to form secondary structures resembling the native protein. On the other hand, the PrEST fragments seem to be large enough to provide surface exposure for a number of residues on the native protein. This is of great importance for antibody recognition of the target. The protein fragments are, in the majority of cases, shorter than full-length proteins and, therefore, a native protein fold of the expressed PrESTs is not obtained (see figure 7.3). Also, the immunized PrESTs are most likely denaturated upon immunization due to the exposure to adjuvants. Still, these protein-fold modifications seem to resemble the denaturated forms of the proteins located in formalin-fixated tissues in TMAs, since the majority of the msAbs actually recognize their intended targets. 45 Gene region Full-length protein PrEST Figure 7.3 A comparison of a full-length protein in relation to a PrEST, which is generated by the expression of a recombinant protein fragment. In the majority of cases the PrEST is shorter than a full-length protein. This in combination with the use of adjuvants in the immunizations, results in a slightly modified fold of the PrEST fragments. 7.3 A Systematic Analysis of Designed PrESTs (publication II) PrESTs designed in the previously described software (publication I) as well as in a similar software named ProteinWeaver (Affibody, Bromma, Sweden), were analysed. The success rates of the PrEST design strategy was investigated in this study, with respect to target protein properties as well as PCR primer design approaches. The analysis was performed on proteins encoded by genes from human chromosomes 14, 22 and X together with target proteins requested from several research collaborations (see chapter 7.4). The proteins submitted from collaborators were identified by the use of several different proteomic methods, for example, MS (see chapter 3.4) and methods involving protein-protein interactions (see chapter 3.9). This systematic study included a total of 6,412 PrEST fragments located on proteins encoded by 4,263 genes, corresponding to 16% of the genes currently included in the Ensembl dataset. Chromosomes 14, 22, and X were overrepresentated, due to the focus on these chromosomes (see table 7.1). The design of PCR-primers for amplification of the PrESTs, was performed using a high stringency- and a low stringency approach (previously described in publication I). The high stringency algorithm included additional criteria, considering dimer- and hairpin formation. Comparison of results from the two primer design approaches revealed a slightly higher success rate for the high stringency algorithm when compared to the low stringency approach, showing 82% and 74%, respectively. The success rates were calculated from subsequent RT-PCR analysis and sequencing. However, the actual difference between the two approaches is probably lower, since the high stringency approach was recently added to the HPR project. Thus, continuous improvement and refinement of laboratory procedures might have contributed to the higher success rates of the stringent primer design approach. A Systematic Analysis of Designed PrESTs (publication II) Gene 46 Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics Chromosome Number of Ensembl genes Genes with PrEST Number of PrEST clones 1 2158 246 204 2 1443 163 135 3 1146 162 142 4 850 105 97 5 964 125 103 6 1146 141 117 7 1094 145 120 8 785 93 84 Chromosome Number of Ensembl genes Genes with PrEST Number of PrEST clones 9 904 110 96 10 874 84 74 11 1397 142 123 12 1085 157 137 13 394 37 29 14 713 639 476 15 702 73 68 16 947 98 73 Chromosome Number of Ensembl genes Genes with PrEST Number of PrEST clones 17 1219 145 130 18 316 30 25 19 1455 216 149 20 622 81 57 21 265 37 36 22 535 440 344 X 957 718 523 Y 129 77 44 Table 7.1 The current status of the number of designed PrESTs and sequence verified clones compared to the gene count on all human chromosomes (The gene count is according to Ensembl August 2005). In order to increase the efficiency of the RT-PCR analysis, and properly validate the generated antibodies (see chapter 7.1), the primary strategy was to design two PrESTs for each protein. The design strategy successfully selected two PrESTs on the same protein for approximately half of the investigated proteins. For shorter proteins lacking sufficient length, a single PrEST was designed. In cases where the protein was shorter than 100 amino acids, the entire protein sequence was selected as antigen. Proteins shorter than 100 residues constituted 5.3% of the total protein count in the Ensembl dataset (v26.35). Failures in PrEST design were a result of interference by signal peptides, restriction enzyme sites or transmembrane regions. Additionaly, some proteins were to short for design. Together, these failures constituted approximately one percent of the total number of investigated proteins. Moreover, this study considered target proteins with high sequence similarities (above 80% pairwise sequence identity) to other human proteins as fall-out proteins, which in this case constituted around 13% of the total protein data set. Proteins encoded by genes located on human chromosomes 22 and X constituted a larger part of these fall-outs. The number of fall-out proteins from encoding genes from chromosome 14 was significantly lower due to few duplicated regions (Cheung et al., 2003). In contrast, chromosomes 22 and X are known to have high numbers of duplicated segments. However, these genes might be processed with a strategy solely developed for handling of such high sequence similarity proteins (publication III). In the HPR, protein and cDNA sequences from Ensembl were the primary data sources for PrEST design. After retrieval, the sequences were initially grouped according to a basic classification scheme, resulting in three different protein categories. Firstly, the nonredundant set, comprised one representative protein from each unique gene locus (see chapter 2.2). For genes with splice variants, the longest isoform was selected. Secondly, a portion of the nonredundant set encompassed high sequence similarity proteins, which were processed using a specific design strategy (publication III). The final category contained additional spliced isoforms (see chapter 2.2) not covered by the first category. In this case, the PrEST design was directed towards specific splice variants and towards regions shared by several or all ot the protein isoforms (unpublished data, see chapter 7.4.2). Currently, the HPR is focusing on genes on human chromosomes 14, 22, and X, as well as specific requests from research collaborators. These genes include biomarkers for human cancers and stem cells, as well as genes from other species. For instance, evolutionary relevant chimpanzee genes, and Populus (see chapter 8.1) genes involved in plant growth and development. 7.4.1 Antibody-based Proteomics on High Sequence Similarity Proteins (publication III) As previously mentioned (see chapter 7.4.1), three different protein classifications have been established. Accordingly, different PrEST design strategies have been developed for individual handling of the different categories. The second category consisted of high sequence similarity proteins (HSSPs), exhibiting more than 80% pairwise sequence identity to human proteins from different genes. This sequence identity limit also corresponded to the fall-out proteins previously described (publication II). A design stategy for processing HSSPs was developed, where the proteins initially were grouped in clusters based on a maximum of 80% sequence identity. The PrEST design is thereafter focused on locating PrESTs in regions which were common to all cluster members. As a proof of principle, antibodies generated towards such cluster specific PrESTs have been analyzed by Western blot analysis using whole-cell and tissue protein extracts. The ability of identifying proteins within a high sequence similarity cluster by the use of a single antibody was thus investigated. In addition, novel software was developed to estimate the minimum number of PrESTs needed for coverage of the HSSPs. Initially, the HSSPs were identified using an all-to-all sequence similarity search, in which all human proteins predicted by Ensembl were included. Proteins from different genes sharing more than 80% sequence identities over the shortest of the aligned sequences were selected. This resulted in 3,250 identified HSSPs, corresponding to 14.6% of the non-redundant set of proteins. The subsequent PrEST design was facilitated using an enhanced version of Bishop (publication I). This allowed for the simultaneous visualization of all proteins within a cluster, and for location of PrESTs within aligned high sequence similarity regions. This version of Bishop contained an additional sequence similarity scanning procedure, in which the target proteins were aligned to all proteins within their respective cluster. The similarities to all human proteins could additionally be visualized. The optimal PrEST location within the aligned high sequence similarity region was the region of highest local sequence similarity among all proteins in a cluster. This local similarity was generated as previously described (publication I). 47 Specific PrEST Design Strategies (publication III and unpublished data) 7.4 Specific PrEST Design Strategies 48 Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics To explore the capacity of the generated antibodies in order to identify all members of the HSSP clusters, an in vitro detection by Western blot analysis was performed on antibodies towards cluster specific PrESTs. An example showed (see figure 3 in appendix II) that all cluster members could be identified in one of the cell lines. Although limited, this study implies the possibility to use this design strategy for generation of antibodies capable of recognizing all proteins in a high sequence similarity cluster, despite the fact that their sequences are not identical. An estimation of the the minimum number of PrESTs sufficient to cover all the previously identified HSSPs was made possible by the development of novel software. This program performed homology searches using BLASTP (Altschul et al., 1990) which resulted in the identification of HSSPs. The proteins were subsequently assembled in clusters based on common hit sequences. The protein sequences in each cluster were then aligned and grouped into sub-clusters. Proteins with common regions of sufficient size (in this case 100 residues), where grouped in the same sub-cluster. Finally, in order to remove redundancies and to cover as many proteins as possible with a minimum of PrESTs, the entire set of sub-clusters was subjected to a sorting. In this sorting large sub-clusters were favorized. This resulted in a total of 1,124 sub-clusters and in 186 single proteins remaining after the final sorting. This implies that 1,310 PrESTs are necessary for coverage of the complete set of HSSPs. In average, 2.73 proteins were covered by a region large enough to accommodate a PrEST fragment. In conclusion, this resulted in a total of 20,281 PrESTs, sufficient for coverage of the total nonredundant set of proteins, corresponding to the entire set of human genes (22,221 according to Ensembl v26.35). To facilitate the study of multiple splice variants belonging to these genes, a specific approach was developed (see below). In addition, antibodies generated by this HSSP strategy could potentially be important for investigations of gene duplications and gene families. 7.4.2 Design of Isoform Specific Protein Fragments (unpublished data) The final protein category, as defined by the HPR, was proteins with multiple splice variants, which were not included in the nonredundant set of proteins. According to Ensembl, about 32% of the human genes had multiple isoforms and the majority (30.3%) of these genes had between two and four observed variants. The primary design strategy (see chapter 7.4) focused on the nonredundant set by selecting a single representative protein for genes with several splice variants. However, designing PrESTs unique to a specific isoform or common to all of the splice variants might be desirable. A strategy for designing PrESTs on genes with several splice variants was therefore developed. The approach included novel software (ExonViewer), which was used in combination with Bishop (publication I). ExonViewer used exon boundaries from the Ensembl database and visualized them in a user-friendly interface (see figure 7.4). A Perl-based script for data retrieval and pre-processing of data was developed, whereas the main software was written in Java. Both parts of the system used the BioPerl and BioJava frameworks for handling of sequence data and for visualization of exons (see chapter 4.4). This allowed for in-depth studies of different splicing events. Designing PrESTs on unique splice variants, or regions common to several or all variants was also possible. 49 PrEST A PrEST B B PrEST C PrEST D Figure 7.4 Two screendumps of ExonViewer, in which exons are displayed as color coded bars. The uppermost bar represents a consensus view with the entire number of exons belonging to the target gene shown. The middle and lowermost bars display the assembled exons (showed in amino acid coordinates). In both the examples shown (A, B), both genes (the VWF and TJP1 gene) have two different isoforms. In A, two PrESTs are located in a region unique to that splice variant. B shows the placement of two PrESTs in a region common to the two isoforms. In this design strategy, the target genes were visualized in ExonViewer and regions common or unique to all or several of the isoforms, were selected as candidate PrEST design regions. These regions were at least 100 continous amino acids in length. Yet, any desired length limit could be defined. Subsequently, these regions were exported to Bishop (publication I) for convenient design of protein fragments. Specific PrEST Design Strategies (publication III and unpublished data) A 50 Design of Protein Epitope Signature Tags (PrESTs) for Antibody-based Proteomics TMA results from four antibodies generated against two pairs of PrESTs were used as means to validate the strategy. Each pair was designed on the same target protein. The first pair of PrESTs was located on a region unique to one of the isoforms on the VWF gene (ENSG00000110799; see figure 7.4A). The second pair was placed on a region common to both isoforms of the TJP1 gene (ENSG00000104067; see figure 7.4B). A comparison of the protein expression patterns within the PrESTpairs in 43 different tissues, revealed a clear correlation between PrESTs designed on the same protein (see table 7.2). Tissue Celltype Adrenal gland cortical cells medullar cells glandular cells lymphoid tissue cells in granular layer cells in molecular layer purkinje cells neuronal cells on-neuronal cells glandular cells surface epithelial cells glandular cells glandular cells cells in endometrial stroma cells in myometrium/ECM glandular cells cells in endometrial stroma cells in myometrium/ECM glandular cells glandular cells surface epithelial cells glandular cells glandular cells myocytes neuronal cells non-neuronal cells cells in glomeruli cells in tubuli neuronal cells non-neuronal cells bile duct cells hepatocytes alveolar cells macrophages follicle cells (cortex) non-follicle cells (paracortex) surface epithelial cells follicle cells ovarian stromal cells Appendix Cerebellum Cerebral cortex Cervix, uterine Colon Duodenum Endometrium 1 Endometrium 2 Epididymis Esophagus Fallopian tube Gall bladder Heart muscle Hippocampus Kidney Lateral ventricle Liver Lung Lymph node Oral mucosa Ovary Antibody A B C D Tissue Celltype Pancreas exocrine pancreas islet cells glandular cells decidual cells trophoblastic cells glandular cells glandular cells glandular cells myocytes adnexal cells epidermal cells glandular cells smooth muscle cells mesenchymal cells mesenchymal cells cells in red pulp cells in white pulp glandular cells glandular cells cells in ductus seminiferus leydig cells glandular cells follicle cells (cortex) non-follicle cells (paracortex) surface epithelial cells surface epithelial cells surface epithelial cells surface epithelial cells Parathyroid gland Placenta Prostate Rectum Salivary gland Skeletal muscle Skin Small intestine Smooth muscle Soft tissue 1 Soft tissue 2 Spleen Stomach 1 Stomach 2 Testis Thyroid gland Tonsil Urinary bladder Vagina Vulva/anal skin Antibody A B C D Strong staining Weak staining Medium staining No staining Table 7.2 A comparison of the tissue distribution in 43 different tissues of the four investigated antibodies. Red color indicates strong staining, orange color medium staining, and yellow color weak staining. Blue color displays no staining. Tissues with no available pictures were excluded from the study. The different colors in the protein expression patterns indicated the presence of the proteins in the tissues. Red color indicated strong staining, orange medium staining, and yellow color indicated weak staining, whereas blue color indicated no staining. The majority of tissues showed similar staining patterns for PrESTs located on the same protein (antibody A-B and C-D in table 8.1). Similar staining in the included tissues of two PrESTs located on the same protein could be seen as a validation and confirmation of the design stategy (see chapter 7.1). Yet, according to the applied three-level scheme, different staining intensities (weak, intermediate, and strong staining) might reveal a similar protein expression upon closer examination. Differences in staining patterns might be a result of variations in the tissues, due to that nonconsecutive sections of tissues have been used. The tissue-sampling has been performed on tissues from different individuals, and the tissues might therefore not be representative. Moreover, the annotation might have been performed by different annotators. The antibodies used in the TMAs The tissue distribution data used in this study is merely for validation of the strategy. This limited data indicates that the design strategy is promising. However, more antibodies need to be generated for a large-scale evaluation. Furthermore, PrESTs can be directed to unique regions of several isoforms thereby revealing clues about the different tissue distribution patterns. 51 Design of Isoform Specific Protein Fragments (unpublished data) might also have been diluted in different ways. Furthermore, the splice variants in the data source might not have been fully curated, i.e. unknown isoforms might have been present which interfered with the results. 52 Identification and Characterization of Gene Family Members in a Fully Sequenced Tree Genome 8.1 Populus, a Tree Model System Trees belonging to the genus Populus, (i.e. poplars, including cottonwoods and aspens) have several attributes, which have contributed to their emergence as predominant model organisms for molecular tree biotechnology (reviewed by Brunner et al., 2004). Populus members are widely distributed throughout the Northern hemisphere. For instance, black cottonwood (Populus trichocarpa) grows in diverse habitats such as those in Alaska and Mexico (Brunner et al., 2004). This adaptivity and genetic variation make Populus trees ideal for studies of adaptive evolution (see chapter 1.2.1). Furthermore, their short generation time and easy transformation enable functional genomic studies. The recent sequencing of the genome of P. trichocarpa (genome.jgi-psf.org/ poptr1) provides means to further study and characterize the farely small genome of this organism (see table 5.1). In comparison, loblolly pine (Pinus taeda L.) has a large and complex genome (about 20 times larger than Populus), and exhibits a long generation time. Pine is therefore an unsuitable model organism for tree genomics (Whetten et al., 2001). 8.2 Cellulose Synthases The capability of forming wood is a specific feature distinguishing trees from plants. Cellulose, which is the primary component in wood, is synthesized by large membrane-bound protein complexes (Doblin et al., 2002). These complexes include catalytic subunits encoded by CesA gene family members. CesA enzymes are glycosyl transferases (GTs). GTs form glycosidic bonds by catalyzing the transfer of sugar moities from appropriate sugar donors to acceptor molecule, thus creating carbohydrates such as cellulose (Coutinho et al., 2003). Conserved residues present in all GTs, belonging to the same family as CesA enzymes, include three D residues as well as a QxxRW motif, proposed to be involved in substrateand acceptor binding, as well as processivity (Saxena et al., 1995; Campbell et al., 1997; Garinot-Schneider et al., 2000). In addition, CesA family members harbor two zinc finger domains positioned near the N-terminus, as well as two hypervariable regions (HVR-1 and HVR-2) specific to plant genes (see figure 8.1, 8.2). CesA proteins are generally around 1,000 amino acids in length with eight predicted transmembrane regions. Populus, a Tree Model System/Cellulose Synthases 8. Identification and Characterization of Gene Family Members in a Fully Sequenced Tree Genome 53 54 Identification and Characterization of Gene Family Members in a Fully Sequenced Tree Genome Figure 8.1 A view of the putative CesA protein integrated into the plasma membrane. The characteristic features of the CesA protein are: zinc finger domain (green) consisting of two CxxC-motifs, the conserved catalytic residues D, D, D (blue) and QxxRW (gold) forming the putative catalytic site, the plant-specific hypervariable regions HVR-1 (red) and HVR-2 (magenta), and the eight transmembrane regions (black). 8.3 Identification of Previously Unknown CesA Family Members (publication IV) In this study, the genome sequence of P. trichocarpa was mined for unidentified CesA family members. This was done by querying the genome with previously identified CesA EST- and full-length gene sequences from the closely related hybrid aspen (P. tremula x tremuloides; Djerbi et al., 2004). The screening was performed with BLAST and resulted in the identification of 18 putative CesA genes based on sequence similarities alone. Subsequently, the corresponding protein sequences were subjected to an all-to-all global sequence alignment search using the software Needle (www.ebi.ac.uk/emboss). This analysis revealed a clear sequence similarity pattern, which distinctly grouped the CesA genes into seven pairs, one single sequence, as well as a group of three sequences. The 18 putative CesA gene sequences were further analyzed and conserved sequence characteristics were verified for confirmation of true CesA family membership (see chapter 8.2; figure 8.1). All CesA characteristics were identified in the 18 putative CesA sequences using tools such as ClustalW (see figure 9.2; Thompson et al., 1994), TMHMM (Krogh et al., 2001), as well as manual analysis. This extensive sequence analysis implies that the collected genes are indeed true CesA family members. 55 1A 1B 3B 57 57 59 59 59 59 59 59 59 59 59 40 40 36 41 51 29 29 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 D GHSGGHDTEGNELPRLVYVSREKRPGFSHHKKAGAMNALIRVSAVLTNAPFMLNLDCDHY D GHSGGHDVEGNELPRLVYVSREKRPGFSHHKKAGAMNALIRVSAVLTNAPFMLNLDCDHY D GHNGVHDVEGNELPRLVYVSREKRPGFDHHKKAGAMNALVRVSAIISNAPYMLNVDCDHY D GHNGVHDVEGNELPRLVYVSREKRPGFDHHKKAGAMNSLVRVSAIITNAPYMLNVDCDHY D GQSGVRDVEGNELPRLVYVSREKRPGFEHHKKAGAMNALMRVTAVLSNAPYLLNVDCDHY D GQSGVRDVEGCELPRLVYVSREKRPGFEHHKKAGAMNALVRVSAVLSNAPYLLNVDCDHY D GQSGGHDTDGNELPRLVYVSREKRPGFNHHKKAGAMNALVRVSAVLTNAPYLLNLDCDHY D GQSGGHDTDGNELPRLVYVSREKRPGFNHHKKAGAMNALVRVSAVLSNARYLLNLDCDHY D GHSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHY D GHSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHY D GHSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNALIRVSAVLTNGAYLLNVDCDHY D GQSGGLDTDGNELPRLVYVSREKRPGFQHHKKAGAMNSLVRVSAVLTNGPFLLNLDCDHY D GQSGGLDSDGNELPRLVYVSREKRPGFQHHKKAGAMNSLVRVSAVLTNGPFLLNLDCDHY D GHSGGLDTEGNELPRLVYVSREKRPGFQHHKKAGAMNALVRVSAVLTNGPFLLNLDCDHY D GHSGGLDTEGNELPRLVYVSREKRPGFQHHKKAGAMNALVRVSAVLTNGPFLLNLDCDHY D GSEGALDVEGKELPRLVYVSREKRPGYNHHKKAGAMNALIRVSAVLTNAPFMLNLDCDHY D GNTGARDIEGNELPRLVYVSREKRPGYQHHKKAGAENALVRVSAVLTNAPYILNLDCDHY D GNTGARDIEGNELPRLVYVSREKRPGYQHHKKAGAENALVRVSAVLTNAPYILNLDCDHY 531 540 579 579 580 581 568 565 568 568 568 562 564 542 552 512 464 500 ECGFPACRPCYEYERREGTQNCPQCKTRYKRLKGSPRVEGD--DDEDDLDDIEHEFIIED C EDDLDDIEHEFIIED ECGFPVCRPCYEYERREGTQNCPQCKTRYKRLKGSPRVEGD--DEEDDVDDIEHEFIIED C EDDVDDIEHEFIIED ECAFPVCRPCYEYERREGNQACPQCRTRYKRIKGSPRVDGD--EEEEDTDDLENEFDIGI C EEDTDDLENEFDIGI ECAFPVCRPCYEYERREGNQACPQCRTRYKRIKGSPRVDGD--EEEEDTDDLENEFDIGV C EEDTDDLENEFDIGV ECAFPVCRPCYEYERREGNQACPQCKTRYKRLKGSPRVEGD--EEEDDIDDLEHEFDYGN C EDDIDDLEHEFDYGN ECAFPVCRPCYEYERREGNQACPQCKTRYKRLKGSPRVEGD--EEEDDTDDLEHEFDYGN C EDDTDDLEHEFDYGN ECAFPICRTCYEYERREGNQVCPQCKTRFKRLKGCARVHGD--EEEDGIDDLENEFNF-C EDGIDDLENEFNF-ECAFPICRTCYEYERKEGNQVCPQCKTRFKRLKGCARVHGD--DEEDGTDDLENEFNF-C EDGTDDLENEFNF-ECAFPVCRPCYEYERKDGTQSCPQCKTRYRRHKGSPRVDGD--EDEDDVDDLENEFNYAQ C EDDVDDLENEFNYAQ ECAFPVCRPCYEYERKDGTQSCPQCKTRYRRHKGSPRVDGD--EDEDDVDDLENEFNYAQ C EDDVDDLENEFNYAQ ECAFPVCRPCYEYERKDGTQSCPQCKTRYRRHKGSPRVDGD--EDEDGVDDLENEFNYAQ C EDGVDDLENEFNYAQ VCAFPVCRPCYEYERKDGNQSCPQCKTRYRRHKGSPAILGDR-EEDGDADDGAIDFNYSS C DGDADDGAIDFNYSS VCAFPVCRPCYEYERKDGNQSCPQCKTRYKRLNGSPAILGDR-EEDGDADDGASDFNYSS C DGDADDGASDFNYSS VCSFPVCRPCYEYERKDGNQSCPQCKTKYKRHKGSPPIQGE----DANSDEVENKSNHHT C DANSDEVENKSNHHT VCAFPVCRPCYEYERKDGNQSCPQCKTKYKRHKGSPPIQGEE-MGDADSEDVGNKSNHHI C DADSEDVGNKSNHHI VCGFPVCRPCYEYERSEGNQSCPQCNTRYKRHKGCPRVPGDNDDEDANFDDFEDEFQIKH C DANFDDFEDEFQIKH ECNYHMCKSCFEYEIKEGRKVCLRCGSPYD-----------------------------C --------------ECNYPMCKSCFEFEIKEGRKVCLRCGSPYDEFETFIVVHIP----ENPFHLLVTHLFIYY C ENPFHLLVTHLFIYY 115 115 117 117 117 117 115 115 117 117 117 99 99 92 100 111 59 85 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 PKEPKRPKMVTCDCCP-------CFGRRKKK-------GPVYVGTGCVFKRQALYGYDPPKEPKRPKMVTCDCCP-------CFGRRKKK-------PKDPKRPKMETCDCCP-------CFGRRKKK-------GPVYVGTGCVFKRQALYGYDPPKDPKRPKMETCDCCP-------CFGRRKKK-------PVK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN GPIYVGTGCVFRRQALYGYDAPVK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN PIK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN GPIYVGTGCVFRRQALYGYDAPIK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN PVK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK GPIYVGTGCVFRRQALYGYDAPVK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK PVK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK GPIYVGTGCVFRKQALYGYDAPVK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK PKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE PKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE VLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI VLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI VLTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI PLKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS PLKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS PLKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES PLKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS PVSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL GPVYVGTGCVFNRQSLYGYDPPVSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL PSMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY GPMYVGTGCVFNRQALYGYGPPSMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY PSMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY GPVYVGTGCVFNRQALYGYGPPSMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY 636 645 696 694 697 698 687 684 680 680 680 674 676 654 664 630 579 615 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 EQDKNKYLTEAMLHGKMTYGRGHDDEENSHFPPVITG------VRSRPVSGEFPIGSHGE EQDKNKHLTEAMLHGKMTYGRGHDDEENSQFPPVITG------IRSRPVSGEFSIGSHGE NDRRDPHQVAEALLAARLNTGRGSQSNVSGFATPSE-------FDSASVVPEIPLLTYGE NDRRDPRHVAEALLSARLNTGRGSQAHVSGFATPSE-------FDSASVAPEIPLLTYGE FDGLSPEQVAEAMLSSRMNTGRASHSNISGIPTHGE-------LDSSPLNSKIPLLTYGE LDGLSPEQVAEAMLSSRINTGRASHSNTYGIPTQGE-------LDSSPLSSKIPLLTYGE -DGRNSN-----RHDMQHHGGLGGPESMRHYDPDLP-------HDLHHPLPQVPLLTNGQ -DGRNSN-----RHDMQHHG---GPESMLHYDPDLP-------HDLHHPLPRVPLLTNGQ GIGKARRQ-------WQ-----GEDIELSSSSRHESQ-PIPLLTNGQPVSGEIPCATPDN GIGKARRQ-------WQ-----GEDIELSSSSRHESQ-PIPLLTNGQPVSGEIPCATPDN GIGNAKHQ-------WQ-----GDDIELSSSSRHESQ-PIPLLTNGQPVSGEIPCATPDN ENQNQKQKIAERMLSWQMTFGRGEDLGAPNYDKEVSHNHIPLITNGHEVSGELSAASPEH ENQNQKQRIAERMLSWQMTYGRGEDSGAPNYDKEVSHNHIPLLTNGHEVSGELSAASPEH SGVQDEKQKIERMMAWDSSSGRKEHLATTNYDRDVSLNHIPYLAGRRSVSGDLSAASPER SGVQDEKQKIERMLGWDSSSGRKEHLATTNYDKDGSLNHIPYLAGRRSVSGDLSAASPER HDHDESNQ--------------KNVFSHTEIEHYNEQ-------EMQPIRPAFSSAGSVA ----------ENLLDDVEKKGSGNQSTMASHLNNSQ--------DVGIHARHISSVSTVD SYANLCLLSPENLLDDVEKKGSGNQSTMASHLNDSQ--------DVGIHARHISSVSTVD 169 169 170 170 170 170 162 159 164 164 164 159 159 152 160 150 101 137 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 -------NAKNG------AVGEGTSLQG----------MDNEKELLMS -------NAKNG------AVGEGTSLQG----------MDNEKELLMSQMNFEKRFGQSA -------NAKNG------AVGEGMD--------------NNDKELLMS -------NAKNG------AVGEGMD--------------NNDKELLMSHMNFEKKFGQSA E------KKKSK------EASKQIHALENIEEGIEG--IDNEKSALMP E------KKKSK------EASKQIHALENIEEGIEG--IDNEKSALMPQIKFEKKFGQSS E------KKKSK------DASKQIHALENIEEGIEG--IDNEKSALMP E------KKKSK------DASKQIHALENIEEGIEG--IDNEKSALMPQIKFEKKFGQSS K------KSKNR------EASKQIHALENIEEGIEE--STSEKSSETS K------KSKNR------EASKQIHALENIEEGIEE--STSEKSSETSQMKLEKKFGQSP K------KSKNR------EASKQIHALENIEG-TEE--STSEKSSETS K------KSKNR------EASKQIHALENIEG-TEE--STSEKSSETSQMKLEKKFGQSP L------KKRNS------KTFEPVGALEGIEEGIEG--IESESVDVTS L------KKRNS------KTFEPVGALEGIEEGIEG--IESESVDVTSEQKLEKKFGQSS L------KKRNS------RTFAPVGTLEGIEEGIEG--IETENVAVTS L------KKRNS------RTFAPVGTLEGIEEGIEG--IETENVAVTSEKKLENKFGQSS D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMS D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMSQKSLEKRFGQSP D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMS D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMSQKSLEKRFGQSP D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMS D------KKRAMKR---TESTVPIFNMEDIEEGVEG--YDDERSLLMSQKSLEKRFGQSP D------KKKSGKH---ADPTVPVFSLEDIEEGVEGAGFDDEKSLLMS D------KKKSGKH---ADPTVPVFSLEDIEEGVEGAGFDDEKSLLMSQTSLEKRFGQSA D------KKKSGKH---VDPTVPIFSLDDIEEGVEGAGFDDEKSLLMS D------KKKSGKH---VDPTVPIFSLDDIEEGVEGAGFDDEKSLLMSQMSLEKRFGQSA -------KKKSSKH---VDPALPVFNLEDIEEGVEGTGFDDEKSLLMS -------KKKSSKH---VDPALPVFNLEDIEEGVEGTGFDDEKSLLMSQMTLEKRFGQST -------KKKSSKL---VDPTLPVFNLEDIEEGVEGTGFDDEKSLLMS -------KKKSSKL---VDPTLPVFNLEDIEEGVEGTGFDDEKSLLMSQMTLEKRFGQST LGGLYPMKKKMMGKKYTRKASAPVFDLEEIEEGLEG-YEELEKSSLMS LGGLYPMKKKMMGKKYTRKASAPVFDLEEIEEGLEG-YEELEKSSLMSQKSFEKRFGQSP K------DAKRE------DLNAAIFNLTEIDN-----YDEYERSMLIS K------DAKRE------DLNAAIFNLTEIDN-----YDEYERSMLISQLSFEKTFGLSS R------DAKRE------DLNAAIFNLTEIDN-----YDEHERSMLIS R------DAKRE------DLNAAIFNLTEIDN-----YDEHERSMLISQLSFEKTFGLSS 673 678 742 740 743 743 733 730 729 729 729 725 727 704 714 689 622 658 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 QMLSSS----------------LHKRVHPYPVSEP----------------------EGG QMLSSS----------------LHKRVHPYPVSEPGSA-------------RWDEKKEGG EDVGISSDKHALI---IPP--FRGKRIHPMPFPDSSMS-LPPRPMDPNKDLAVYGYGTVA EDVGISSDKHALI---VPP--FHGKRIHPMPFSDSSIP-LPPRPMDPKKDLAVYGYGTVA EDTEISSDRHALI---VPP--SHGNRFHPISFPDPSIPLAQPRPMVPKKDIAVYGYGSVA EDAEISSDRHALI---VPPHMSHGNRVHPTSFSDPSIP-SQPRPMVPKKDIAVYGYGSVA MVDDIPPEQHALVPSYMAPIGGSGKRIHPLPFSDSAVP-VQPRSMDPSKDLAAYGYGSIA MVDDIPPEQHALVPSYMAPVGGDGKRIHPLPFSDSSLP-AQPRSLDPSKDLAAYGYGSIA QSVRTTSGPLGPA----------ERNVNSSPYIDPRQP-VHVRIVDPSKDLNSYGLGNVD QSVRTTSGPLGPA----------ERNVNSSPYIDPRQP-VHVRIVDPSKDLNSYGLGNVD QSVRTTSGPLGPA----------ERNVHSSPYIDPRQP-VHVRIVDPSKDLNSYGLGNVD ISMASP-----GA----------AGGKHIPYASDVHQS-SNGRVVDPVREFGSPGLGNVA VSMASPG---AGA----------GGGKRIPYASDVHQS-SNVRVVDPVREFGSPGLGNVA YSLASP-------------------------ESGIR---ATMR--DPTRDSGSLGFGNVA YSMASP-------------------------ESGIR---ANIRVVDPTRDSGSLGFGNVA G-------------------------------KDLEGE--------------KEGYSNAE SEMN-------------------------------------------------DEYGNPI SEMN-------------------------------------------------DEYGNPI 191 200 224 224 225 226 221 218 213 213 213 203 205 182 192 165 112 148 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 D IFVTSTLMEEGGVPPSSSPAALLKEAIHVISCGYEDKTEWGLELGWIYGSITEDILTGFK D IFVTSTLMEEGGVPPSSSPAALLKEAIHVISCGYEDKTEWGLELGWIYGSITEDILTGFK D VFIAATLMEDGGVPKGASSASLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK D VFIASTLMEDGGVPKGASSASLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK D VFVASTLLENGGVPRDASPASLLREAIQVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK D VFAVSTLLENGGVPRDASPASLLREAIQVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK D VFVASTLLEDGGTLKSASPASLLKEAIHVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK D VFVASTLLEDGGTLKSASPASLLKEAIHVISCGYEDKTEWGKEVGWIYGSVTEDILTGFK D VFIAATFQEQGGIPPTTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK D VFIAATFQEQGGIPPTTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK D VFIAATFQEQGGIPPSTNPATLLKEAIHVISCGYEDKTEWGKEIGWIYGSVTEDILTGFK D VFVASTLMENGGVPQSATPETLLKEAIHVISCGYEDKTDWGSEIGWIYGSVTEDILTGFK D VFVASTLMENGGVPQSATPETLLKEAIHVISCGYEDKTDWGSEIGWIYGSVTEDILTGFK D VFVASTLMENGGVPGSATPESLLKEAIHVISCGYEDKTDWGSEIGWIYGSVTEDILTGFK D VFVASTLMENGGVPESATPESLLKEAIHVISCGYEDKSDWGSEIGWIYGSVTEDILTGFK D VFIASTLMENGGLPEGTNSQSHIKEAIHVISCGYEEKTEWGKEVGWIYGSVTEDILTGFK D VFIESTLMENGGVPESANSSTLIKEAIHVIGCGFEEKTEWGKEIGWIYGSVTEDILSGFK D VFIESTLMENGGVPESANSPTLIKEAIHVIGCGYEEKTEWGKEIGWIYGSVTEDILSGFK PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 WKERMDDWKMQQ-GNLGPEQE-------------------DDAEAAMLDEARQPLSRKVP WKERMDDWKMQQ-GNLGPEQE-------------------DDAE WKERMDEWKMQQ-GNLGPEQD-------------------DDAEAAMLEDARQPLSRKVP WKERMDEWKMQQ-GNLGPEQD-------------------DDAE WKERMEEWRKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPDLPMMDEGRQPLSRKLP WKERMEEWRKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPD WKERMEEWKKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPDLPMMDEGRQPLSRKLP WKERMEEWKKKQSDKLQVVKHQGGKGG-----ENNGGDELDDPD WKDRMEDWKKRQNDKLQVVKHEGGHDN-----GNFEGDELDDPDLPMMDEGRQPLSRKLP WKDRMEDWKKRQNDKLQVVKHEGGHDN-----GNFEGDELDDPD WKDRMEDWKKRQNDKLQVVKHEGGYDG-----GNFEGDELDDPDLPMMDEGRQPLSRKLP WKDRMEDWKKRQNDKLQVVKHEGGYDG-----GNFEGDELDDPD WKERMESWKQKQ-DNLQMMKSENGDYD-----G-------DDPDLPLMDEARQPLSRKTP WKERMESWKQKQ-DNLQMMKSENGDYD-----G-------DDPD WKERMESWKQKQ-DKLQIMKRENGDYD-----D-------DDPDLPLMDEARQPLSRKMP WKERMESWKQKQ-DKLQIMKRENGDYD-----D-------DDPD WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDELQMADDARQPLSRVVP WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDE WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDELQMADDARQPLSRVVP WKERVEGWKLKQDKNIMQMTNRYPEGK----GDIEG-TGSNGDE WKERVEGWKLKQDKNMMQMTNRYSEGK----GDMEG-TGSNGDELQMADDARQPMSRVVP WKERVEGWKLKQDKNMMQMTNRYSEGK----GDMEG-TGSNGDE WKERVDGWKMKQDKNVVPMSTGHAPSE-RGVGDIDAATDVLVDDSLLNDEARQPLSRKVS WKERVDGWKMKQDKNVVPMSTGHAPSE-RGVGDIDAATDVLVDD WKERVDGWKMKQDKTVVPMSTGHAPSE-RGAGDIDAATDVLVDDSLLNDEARQPLSRKVS WKERVDGWKMKQDKTVVPMSTGHAPSE-RGAGDIDAATDVLVDD WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDDSLLNDEARQPLSRKVS WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDD WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDDSLLNDEARQPLSRKVS WRERIDGWKMKPEKNTAPMSVSNAPSEGRGGGDFDASTDVLMDD WQERVEKWKVRQEKRGLVSKDDGGNDQ-------------GEEDEYLMAEARQPLWRKIP WQERVEKWKVRQEKRGLVSKDDGGNDQ-------------GEED WKNRVESWKDKKNKKKKSNTKPETEPA--------QVPPEQQMEEKPSAEASEPLSIVYP WKNRVESWKDKKNKKKKSNTKPETEPA--------QVPPEQQME WKNRVESWKDKKNKKKKSSPKTETEPA--------QVPPEQQMEDKPSAAASEPLSIVYP WKNRVESWKDKKNKKKKSSPKTETEPA--------QVPPEQQME 231 240 279 279 280 281 268 265 268 268 268 262 264 242 252 212 164 200 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 QVLRWA MHCRGWRSIYCMPKLAAFKGSAPINLSDRLNQVLRWALGSVEIFFSRHSPMLYGYKEGKL QVLRWA MHCRGWRSIYCMPKRAAFKGSAPINLSDRLNQVLRWALGSVEIFFSRHSPMLYGYKEGKL QVLRWA MHCHGWRSVYCTPKIPAFKGSAPINLSDRLHQVLRWALGSVEILLSRHCPIWYGYGCG-L QVLRWA MHCHGWRSVYCMPKRPAFKGSAPINLSDRLHQVLRWALGSVEILLSRHCPIWYGYGCG-L QVLRWA MHCHGWRSVYCIPKRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPIWYGYGGG-L QVLRWA MHCHGWRSVYCIPKRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPIWYGYGGG-L MHCHGWRSIYCIPSRPAFKGSAPINLSDRLHQVLRWALGSVEIFLSRHCPLWYGYGGG-L QVLRWA QVLRWA MHCHGWRSIYCIPARPAFKGSAPINLSDRLHQVLRWALGSVEIFLSRHCPLWYGYGGG-L QVLRWA MHARGWISIYCMPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYNGR-L QVLRWA MHARGWISIYCMPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYNGR-L QVLRWA MHARGWISIYCMPPRPAFKGSAPINLSDRLNQVLRWALGSIEILLSRHCPIWYGYSGR-L QVLRWA MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYGGR-L QVLRWA MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYGGR-L QVLRWA MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYSGR-L QVLRWA MHARGWRSIYCMPKRPAFKGSAPINLSDRLNQVLRWALGSVEILLSRHCPIWYGYSGR-L QVLRWA MHCRGWRSVYCSPQRPAFKGSAPINLSDRLHQVLRWALGSIEIFLSHHCPLWYGYGGK-L QVLRWA MHCRGWRSIYCMPVRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPLWYGYGGGRL QVLRWA MHCRGWRSIYCMPVRPAFKGSAPINLSDRLHQVLRWALGSVEIFFSRHCPLWYGFGGGRL PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 EKISCYLSDDGASMCTFEAMSETAEFARKWVPFCKKYSIEPRAPEFYFALKIDYLKDKVQ D EKISCYLSDDGASMCTFEAMSETAEFARKWVPFCKKFNIEPRAPEFYFTLKVDYLKDKVQ D EKVACYVSDDGAAMLTFEAISETSEFARKWVPFCKRFSIEPRAPEWYFAKKVDYLKDKVD D DKVACYVSDDGAAMLTFEAISETSEFARKWVPFCKRFSIEPRAPEWYFAQKVDYLKDRVD D DKVACYVSDDGAAMLTFEALSETSEFARKWVPFCKKFNIEPRAPEWYFSQKMDYLKNKVH D DKVACYVSDDGAAMLTFEALSETSEFARKWVPFCKKFNIEPRAPEWYFSQKIDYLKNKVH D DKVSCYVSDDGAAMLTFEALSETSEFAKKWVPFCKKFSIEPRAPEFYFAQKIDYLKDKVQ D DKISCYVSDDGAAMLTFEALSETSEFAKKWVPFCKKFSIEPRAPEFYFAQKIDYLKDKVD D DKVSCYVSDDGSAMLTFEALSETAEFARKWVPFCKKHNIEPRAPEFYFAQKIDYLKDKIQ D DKVSCYVSDDGSAMLTFEALSETAEFARKWVPFCKKHNIEPRAPEFYFAQKIDYLKDKIQ D DKVSCYVSDDGSAMLTFEALSETAEFARKWVPFCKKHSIEPRAPEFYFAQKIDYLKDKIQ D DKVSCYVSDDGAAMLTFEALSETSEFARKWVPFCKKYNIEPRAPEFYFSQKIDYLKDKVQ D DKVSCYVSDDGAAMLTFEALSETSEFARKWVPFCKKYSIEPRAPEWYFAQKIDYLKDKVQ D DKVSCYVSDDGAAMLTFEAISETSEFARKWVPFCKKYDIEPRAPEWYFAQKIDYLKDKVH D DKVSCYVSDDGAAMLTFETMSETSEFARKWVPFCKRYDIEPRAPEWYFSQKIDYLKDKVH D DKVSCYVSDDGASMLLFDSLAETAEFARRWVPFCKKHNIEPRAPEFYFTQKIDYLKDKVH D DKVSCYVSDDGAAMLTFESLVETAEFARKWVPFCKKFSIEPRAPEFYFSQKIDYLKDKVQ D DKVSCYVSDDGAAMLSFESLVETAEFARKWVPFCKKYTIEPRAPEFYFSLKIDYLKDKVQ D PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 MEASAGLVAGSHNRNELVVIHGHEEHKP---LKNLDGQVCEICGDEIGLTVDGDLFVACN C C C MEASAGLVAGSHNRNELVVIHGHEEHKP---LKNLDGQVCEICGDEIGVTVDGDLFVACN C C C METKGRLIAGSHNRNEFVLINADEIARVTS-VKELSGQICKICGDEIEITVDGEPFVACN C C C METKGRLIAGSHNRNEFVLINADEIARVTS-VKELSGQICKICGDEIEVTVDGEPFVACN C C C MNTGGRLIAGSHNRNEFVLINADENARIKS-VQELSGQVCHICGDEIEITVDGEPFVACN C C C MNTGGRLIAGSHNRNEFVLINADENARIKS-VKELSGQVCQICGDEIEITVDGEPFVACN C C C MEVSAGLVAGSHNRNELVVIRRDGESAPRS-LERVSRQICQICGDDVGLTVDGELFVACN C C C MEVSAGLVAGSHNRNELVVIRRDGEFAPRS-LERVSRQICHICGDDVGLTVDGELFVACN C C C MEANAGMVAGSYRRNELVRIRHDSDSAPKP-LKNLNGQTCQICGDNVGVTENGDIFVACN C C C MEANAGMVAGSYRRNELVRIRHDSDSAPKP-LKNLNGQTCQICGDNVGVTENGDIFVACN C C C MEANAGMVAGSYRRNELVRIRHDSDSGPKP-LKNLNGQTCQICGDNVGVTENGDIFVACN C C C -------------------MESEGETGVKP-MTSIVGQVCQICSDSVGKTVDGEPFVACD C C C -------------------MESEGETGAKP-MKSTGGQVCQICGDNVGKTADGEPFVACD C C C -------------------MDLEGDA-----TGPKKIQVCQICSDDIDKTVDGEPFVACH C C C -------------------MNLLLTLGVLLVVQPKKIQVCQICSDDIGKTIDGEPFVACH C C C ---MAGLVTGSSQ-----TLHAKDELRPPT-RQSATSKKCRVCGDEIGVKEDGEVFVACH C C C -------------------------------MMESGAPLCHSCGDQVGHDANGDLFVACH C C C -------------------------------MMESGAPICHTCGEQVGHDANGELFVACH C C C 2 1B 2 2 2 3A 411 420 459 459 460 461 448 445 448 448 448 442 444 422 432 392 344 380 1A, 1B 4 4 3C 5 733 738 802 800 803 803 793 790 789 789 789 785 787 764 774 749 682 718 793 798 861 859 862 862 852 849 848 848 848 844 846 823 833 808 742 778 Zinc finger domains (CxxC) 2 Hyper Variable Region I 3A, 3B, 3C Conserved D residues in the catalytic site 4 Hyper Variable Region II 5 Conserved QxxRW motif in the catalytic site Figure 8.2 A multiple alignment showing the conserved sequence features common to cellulose synthase family members (these features are also presented in figure 8.1). A thorough phylogenetic analysis was performed, which included the 18 identified CesA protein sequences together with CesA protein sequences and EST sequences from other plant genomes, i.e. hybrid aspen (P. tremula x tremuloides), quaking aspen (P. tremuloides), cotton (Gossypium hirsutum), rice (Oryza sativa), thale cress (A. thaliana) and maize (Zea mays). This analysis was restricted to conserved regions among CesA family members, thus the HVR-1 and -2 regions (see figure 8.1, 8.2) were excluded. According the phylogenetic tree generated by the Bayesian method Identification of Previously Unkown CesA Family Members (publication IV) PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 56 Concluding Remarks (see figure 2 in appendix IV), the analysis further implied that PtCesA2, 4, and 6, as well as PtCesA5, 7, and 8 (see figure 8.3) are most likely monophyletic (arise from a single common ancestral gene). Previous studies involving the PtrCesA4 and 5 (present in figure 8.3A) suggest their involvment in primary cell wall synthesis (Kalluri and Joshi, 2003; Kalluri and Joshi, 2004). Furthermore, PttCesA2 and 4 (present in figure 8.3B) are constitutively expressed in several different tissues (Djerbi et al., 2004). It is therefore tempting to speculate that PtCesA2, 4 and 6 are active in synthesis of primary cell walls, and that PtCesA5, 7 and 8 are constitutively expressed. However, more extensive studies need to be performed in order fully determine their function. A PtrCesA7 PttCesA4 PtCesA4_1 PtCesA4_2 AtCesA2 AtCesA9 AtCesA6 AtCesA5 PttCesA2 PtCesA2_1 PtCesA2_2 PtrCesA6 PtCesA6_1 PtCesA6_2 OsCesA9 † OsCesA6 ZmCesA8 ZmCesA6 ZmCesA7 OsCesA5 OsCesA3 HvCesA2 † B PtCesA8_3 PtCesA8_1 PtrCesA4 PtCesA8_ AtCesA1 AtCesA10 ZmCesA2 ZmCesA1 HvCesA6 OsCesA1 ZmCesA3 † ZmCesA5 OsCesA2 HvCesA3 † ZmCesA4 OsCesA8 ZmCesA9 HvCesA1 GhCesA3 AtCesA3 PtrCesA5 PtCesA5_1 PtCesA5_2 PtCesA7_1 PtCesA7_2 Figure 8.3 Phylogenetic relationships of CesA genes from Populus trichocarpa (Pt), Populus tremula x tremuloides (Ptt), Populus tremuloides (Ptr), Arabidopsis thaliana (At), Gossypium hirsutum (Gh), Hordeum vulgare (Hv), Oryza sativa (Os), and Zea mays (Zm). The analysis is based on predicted amino acid sequences derived from CesA genes in fully sequences genomes, full-length cDNA copies of putative CesA genes, and selected partial cDNAs (denoted by †). The numbers along the edges show the a posteriori probabilities of the partitions induced by the edges. Sequence comparison alone does not shed light on whether genes might be transcriptionally silent or not. In order to elucidate whether these newly discovered CesA genes may be actively transcribed, corresponding EST sequences were retrieved from PopulusDB (Sterky et al., 2004). The CesA sequences were mapped to contigs in PopulusDB, which are representations of transcript sequences consisting of sets of overlapping clones. Alignments more similar than the applied limit of 94% identical nucelotides, over 95% of the contig length, were mapped. The identification of indivudual EST sequences in PopulusDB was not satisfactory due to the high sequence similarities between paralogous CesA gene pairs and the location of several EST sequences in conserved regions of the gene sequences. Yet, previous gene expression studies in hybrid aspen implies that both genes in at least one pair, CesA3-1 and CesA3-2, are actively transcribed (Djerbi et al., 2004). The full length CesA gene sequences in hybrid aspen were further used to confirm the number of identified CesA genes in fully sequenced genomes from other plant taxa, thale cress (A. thaliana) and rice (Oryza Sativa). The analysis corroborates previous analyses, showing the presence of ten CesA family members in respective genome. The human genome project has provided large amounts of data that has truly opened the field of sequence analysis, sequence comparisons, and biocomputing. Knowledge encompassing computer programming and data analysis, as well as biotechnology, will be increasingly important for researchers of tomorrow. The PrEST selection strategies described in this thesis (publication I, III, unpublished data) provide a framework for designing PrESTs for all previously defined categories of genes (see chapter 7.4). These strategies are built around experimentally defined criteria. Specifically directed software and sequence analysis tools developed to facilitate the move from a one-by-one sample analysis to high-throughput handling of data (publication I-III, unpublished data, related publication B), will be of great importance. Due to the wide range of unsolved problems and the uniqueness of research projects, generally applied systems will not be suitable in the majority of cases. Instead, most software solutions will need to be custom made for specific purposes. Large scale proteomic efforts such as the HPR (see chapter 7.1, related publication A) will provide a vast amount of data and shed further light on the functions and localization of human gene products. Different antibody-based approaches need to be synchronized in order to avoid redundant target genes and allow for comparative studies. Furthermore, as the specificity of antibodies is of great importance, proper validations need to be performed by different assays in a systematic fashion. Cross-reactivities can thus be kept at a minimum and the quality assured antibodies will be of great value to the proteomic research community. In contrast, small scale in-depth investigations of individual target genes or gene families (publication IV) will still be performed, as these studies need to be done by researchers having expert knowledge regarding these targets. Collaborations between small-scale and large high-throughput projects will be advantageous for both parts. For instance, in-depth knowledge from small-scale projects might direct large-scale efforts towards identification of relevant target genes. Currently, we only view the tip of the iceberg of functional genomic and proteomic research and the use of computational methods is now intrinsic. All future research initiatives will incorporate and integrate computer-assisted methods into their experimental setup. The level of complexity and the amount of the generated proteomic data will increase, and with that more challenging biological questions will need to be answered. As a consequence, bioinformatic approaches will be more prominent than ever before. Concluding Remarks Concluding Remarks 57 58 List of Abbreviations 2DE 3D 3’ 5’ A BIND BLAST BLASTP BLOSUM C cDNA CesA DDBJ DIP DNA EBI EC EMBL EST FASTA G GO GT HGP HPR HSSP HVR IHC IMAC kb LIMS mAbs mRNA MS msAbs NCBI pAbs PAM PCR PDB Pfam PIR PrEST PSI-BLAST PTM PVDF RE RNA RT-PCR SIB SMART Two-Dimensional gel Electrophoresis Three-Dimensional Three prime Five prime Adenine Biomolecular Interaction Network Database Basic Local Alignment Search Tool Basic Local Alignment Search Tool Protein BLocks Of SUbstition Matrices Cytosine complementary DNA cellulose synthase DataBank of Japan Database of Interacting Proteins DeoxyriboNucleic Acid European Bioinformatics Institute Enzymatic activity Classification European Molecular Biology Laboratory Expressed Sequence Tag FAST All Guanine Gene Ontology Glycosyl Transferases Human Genome Project Swedish Human Proteome Resource High Sequence Similarity Proteins HyperVariable Region ImmunoHistoChemistry Immobilized Metal ion Affinity Chromatography kilo bases Laboratory Information Management System monoclonal Antibodies messenger RiboNucleic Acid Mass Spectrometry mono-specific Antibodies National center for biotechnology information polyclonal Antibodies Point Accepted Mutation Polymerase Chain Reaction Protein Data Bank Protein families database Protein Information Resource Protein Epitope Signature Tag Position-Specific Iterated BLAST Post Translational Modification PolyVinylidene DiFluoride Repetitive Elements RiboNucleic Acid Reverse Transcriptase Polymerase Chain Reaction Swiss Institute of Bioinformatics Simple Modular Architecture Research Tool List of Abbreviations List of Abbreviations 59 60 Acknowledgements SNPs T TIGR TMA TMHMM UniParc UniProt UNIPROT UniRef USCS VEGA Single Nucleotide Polymorphisms Thymine The Institute of Genomic Research Tissue MicroArray Trans Membrane Hidden Markov Model UniProt archive Universal Protein resource UniProt knowledgebase UniProt NREF University of California Santa Cruz Vertebrate Genome Annotation I am grateful to a lot of persons, as you all know this is no one-man show. Without the following persons at work and in my life, I doubt the research over the last five years and this thesis would have been possible. Persons at the Institution of Biotechnology Royal Institute of Technology: Prof. Mathias Uhlen (My very, very, very busy supervisor. You have given me invaluable guidance in the field of antibody-based proteomics, as well as teaching me the importance of presenting research data in a convenient way. Furthermore, you have provided me with splendid ideas regarding our research) Dr. Fredrik Sterky (You have helped me tremendously over the last years, structuralizing our research and motivating me. I really appreciate that your door always is open for me. At times when I tend to forget the focus in our research and present drifted-away ideas, you always keep me back on track) Prof. Stefan Ståhl (My former diploma work supervisor, who guided me in the field of biosynthesis of bioactive marine natural products, currently a sock-walking hardrocking Dean of School of Biotechnology. Initially, you provided me with a PhD position, and I am truly thankful to you) Prof. Per-Åke Nygren (My current next-door- neighbour, for always giving direct and specific answers regarding my reseach), Assoc. Prof. Sophia Hober (For directing the lab-course in a structuralized and fun way), Prof. Joakim Lundeberg (For valuable advice concerning my research and for lending me your room when your were at daddy-leave) Dr. Anja Persson, Dr. Jenny Ottoson, Dr. Henrik Wernerus and Dr. Cristina AlKhalili Szigyarto (For bringing clarity to my questions) Dr. Per Unneberg (For much advice and computational guidance during the old days, when I was a novice in the field of computers and bioinformatics) Dr. Afshin Ahmadian and Erik Petterson (For showing me the world of parallel SNP amplification. Our meetings were the most laid-back I’ve ever encountered during my PhD studies) My old room-buddies in B3:1049 (Esther Edlund-Rose, Ronny Falk, Olof Nord, Emma Lundberg, Tove Alm, and Cecilia Laurell. I still miss you, and the quite atmosphere we had. Tove and Esther, thank you for being the party-hardy girls that you are), My intermediate room-buddies in B3:1001 (Lisa Berglund, Carl Hamsten, Johan Rockberg, Dr. Åsa Sivertsson, Dr. Cristina Al-Khalili Szigyarto, and Dr. Charlotta Agaton) The entire HPR-crew at KTH and Uppsala University (For interesting, fun and crazy happenings in at HPR-annuals at Krusenberg, and at HUPO in Munich) My Hedhammar and Max Käller (My two partners in thesis writing, thank you for discussions about references, figures and other things about writing a thesis nobody else cares about) Acknowledgements Acknowledgements 61 62 Acknowledgements And of course, all of the remaining hard-working associates at the divisions of Proteomics, Gene Technology and Molecular Biotechnology division at the School of Biotechnology (You know who you are) Moreover, The Knut and Alice Wallenberg foundation is gratefully acknowledged for financial support. The following people have contributed tremendously to my thesis: Prof. Mathias Uhlen, Dr. Fredrik Sterky, Dr. Henrik Wernerus, Dr. Soraya Djerbi, and Dr. Cristina Al-Khalili Szigyarto (For critical reading and fruitful discussions) Håkan Gill (My redwine-loving, former next-door-neighbour. Thank you for bringing the third dimension to my thesis by providing me with awesome 3D pictures. You are truly a master at 3D!) Assoc. Prof. Harry Brumer III (For corecting maj innglissh, and teaching me that N-terminals has nothing to do with biotechnology) Cecilia Heyman at the library (For answering my many, many e-mails and providing me with publications that I couldn’t get hold of myself) Furthermore, I would like to thank the following for providing me with material and insights: Johan Rockberg (the unpublished data section), Prof. Per-Åke Nygren (discussions about affinity reagents), Dr. Cristina Divne (protein structure pictures), Prof. Fredrik Pontén (TMA picture and the annotation of the staining), Dr. Peter Nilsson (PrEST array picture), Dr. Anja Persson (raw material for the PrEST statistics table), Marcus Johansson (For your helpful advice with InDesign), Dr. Afsin Ahmadian (discussions about the SNP section) and Dr. Margareta Ramström (MS spectra). My Friends: R&B: (It’s been, and still is, a BLAST to hang around with the crazy crew from R&B, and every event has had it’s fair share of relaxation, fun and crazyness. I was going to write extremeness, but maybe we’ll add that to the list after some years. Just to remained you: Stenskär, Gotland, Dalarna, Jönköping, Västerås, Florida and Sandhamn) R&B is: André (The wine-knower who likes to shake his stuff to latino rhythms and sing Ronja Rumpdaughter songs), Ante (The old shrimp-farmer, who has a Matrix-inspired running-form, and burned his upper-lip once when drinking), Feffe (Knows his charter-destinations and luxury cars, he dances with one finger up i the air). Jake (He drives nothing but Alfa Romeo and his tough body withstand even several rounds of ammunition), Martin (Vice president in R&B. He is a newlywed marathån runner, who has actually no taste in music), Niclas (The designer/bonus-dad/carpenter, who like to hang out at Knife-south and likes to dance riverdance), Niklas (A.K.A. Mr. Handsome and Becks, has the sickest golfswing in history, and the wackiest kangaroo-dance, likes on-line poker and brand clothes), Ola (The man with the broken arm, the proud owner of a pizza-racer, which he once forgot where it was), Patric (co-founder of R&B, A.K.A ‘The guy in cruel condition’, drives a Mercedes, lives the high-life and hangs out close to svampen), and Sparre (A.K.A. Goose and Spirre, who is still in the military service, that is, a TopGun instructor) I’m looking forward to new events, R&B forever! Daniel, zlow-roc Johansen (The only time I’ve seen him pissed off is during gym class when he wasn’t allowed to drink soda. Thanks for awesome Miami-trips, boattrips, parties, T2-review and dinners) Magnus von Bahr (The second sickest party dancer, thanks for fishing trips, dinners at Rörstrandsgatan and crazy party trips) Tomas Eklund (A.K.A Tjomme, Lance Armstrong, thank you for the opportunity of having my party at your place. I’ll do my best to raise the roof!) My Family: The Djerbis, Taieb & Enikö, Samir & Emili (For for nice dinners and gettogethers) Mounira & Martin (For weekends at Stenskär, pleasant dinners and for X-box/ Singstar nights) Jonas (My oldest brother, thank you for looking after me all these years, even if I didn’t like that when I was younger), Maria (For moral support), Nora & Hanna (For being who you are) Fredrik (My older brother, thank you for your support the last years and for watching Adam when I really needed to work), Eva (For nice get-togethers at Stenskär and Åre, Good luck with your research), Hugo & Filippa (For bringing joy and fun) Mom Eva and Dad Rolf (My two heroes! As much support I’ve had from you all my life is more than a son can ask for. Thank you so much for taking care of Adam when I wrote my thesis. Can you believe, even though I wasn’t the most promising student in school, I still managed to accomplish this. I owe a lot to you) Adam (A.K.A. Mats Jr., Knas-fnas, Klanton, Fjanton. When you say: ”I love you on all planets, even on the sun’ it makes me warn inside. I hope I haven’t neglected you too much when writing my ‘blue book’.) Sosso (My ideal-princess and soulmate. I sure hope our time together only has started and that we will have many years together in the future. Thank you for all your help with Adam, you are surely the best bonus-mom anyone can ask for in the world. I love you) 63 Acknowledgements Kristina Tunkrans (the attorney and DJ-extravaganza. You have supported me tremendously over the last years!) 64 Populärvetenskaplig Sammanfattning (Swedish) De senaste åren har forskningen inom bioteknik gjort stora framsteg tack vare att den mänskliga arvsmassan (genomet) nu finns tillgänglig som sekvens. Den har kartlagts av ett stort internationellt samarbete och år 2003 kom en offentlig högkvalitets karta över det mänskliga genomet ut. Den kartan inkluderade ungefär 25000 gener, vilka innehåller information om människans alla ärftliga egenskaper. Dessa gener kodar i sin tur för aktiva komponenter, proteiner, som bland annat är kroppens byggstenar och enzymer. Mer än 90% av dagens läkemedel riktar sig mot proteiner. På grund av denna betydelse av proteiner så har det initierats ett flertal proteomikstudier. Proteomikforskning syftar till att systematiskt identifiera och karakteriserar alla proteiner. Forskningen utförs på flertalet sätt, t.ex. genom att använda antikroppar som binder till proteiner, vilka därmed kan detekteras i t.ex. vävnader. Under år 2003 startades ett samarbetsprojekt mellan Kungliga Tekniska Högskolan och Uppsala Universitet. I detta projekt kallat HPR (the Swedish Human Proteome Resource; se figur 7.1), kartläggs systematiskt de mänskliga proteinerna och deras uttryck med hjälp av vävnadsmikromatriser (se figur 3.3). Dessa matriser består av prover från 48 olika typer av vävnader och 20 cancertyper. Således kan proteiner studeras med avseende på var de är uttryckta. I HPR genereras antikroppar mot fragment av proteiner, så kallade PrESTar (Protein Epitope Signature Tags; se figur 7.3). Dessa PrESTar fungerar som signaturer eller representater av humana proteiner. Proteinfragmenten är noggrant utvalda baserat på kriterier som underlättar efterföljande laboratoriska moment. Bland annat ska proteinfragmenten inneha så låg sekvenslikhet som möjligt med övriga mänskliga protein, d.v.s. vara så unika som möjligt för det proteinet som de representerar. Utvalet av proteinfragment görs av ett dataprogram, Bishop (artikel I). Utöver detta så har en studie som utvärderar denna selektionsstrategi utförts (artikel II). Den studien visar att det är möjligt att konstruera PrESTar för det stora flertalet av de undersökta proteinerna. Dock behövs viss hänsyn tas till två speciella kategorier av proteiner. Dessa är proteiner med hög sekvenssimilaritet (likhet) med andra proteiner och multipla splicevarianter. Multipla splicevarianter är flera olika proteiner som härstammar från en och samma gen. Två olika strategier har därmed utvecklats för att behandla dessa proteiner (artikel III och opublicerat data). Utöver det mänskliga genomet, har flera andra organismers arvsmassa kartlagts. Bland annat poppel, som är en modellorganism för träd. Det som skiljer träd från andra växter är deras förmåga att bilda ved. Cellulosa är en huvudkomponenterna i ved och dessutom en viktig råvara i t.ex. pappersmassa- och textilindustrin. En bioinformatisk (se nedan) studie av jättepoppel (Populus trichocarpa; artikel IV), har tidigare okända cellulosasyntasgener identifierades och antalet uppskattats till 18. Tidigare studier i växter har visat på ungefär tio stycken cellulosasyntasgener. Populärvetenskaplig Sammanfattning (Swedish) Populärvetenskaplig Sammanfattning (In Swedish) 65 66 Rererences Arbetet i denna doktorsavhandling har fokuserats på studier som baserats på offentligt tillgängliga genomsekvenserna, vilka sedan bearbetats och analyserats med hjälp av datalogiska metoder. Denna metodik kallas bioinformatik och innefattar principer hämtade från matematik, statistik, datalogi och biologi. De arbeten som beskrivs i denna doktorsavhandling innefattar utveckling av analysprogram och databasimplementeringar (artikel I, III, opublicerat data) samt bearbetning av DNA- och proteinsekvenser. De sekvenser som har analyserats härstammat från människa (artikel I-III, opublicerat data) och jättepoppelträd (artikel IV). Dessa analyser har omfattat parvisa- (artikel I, III, opublicerat data) och multipla upplinjeringar samt fylogenetiska studier (artikel IV). Protein DNA Analysprogram >ENSP00000314748 MNMTAYNKGRLQSSFWIVDKQHVYIGSAGLDWQSLGQMKEVIFYNCSCLVLFALYSLFSRVPQTWSKRLYGVYDNEKKLQLQLNETKSQAFVSNSPKLFCPKNRSFDIDAIYSVIDDAKQYVYIAV MDYLPISSTSTKRTYWPDLDAKIREALVLRSVRVRLLLSFWKETDPLTFNFISSLKAICTEIANCSL KVKFFDLERENACATKEQKNHTFPRLNRNKYMVTDGAAYIGNFDWVGNDFTQNAGTGLVINQADVRN NRSIIKQLKDVFERDWYSPYAKTLQPTKQPNCSSLFKLKPLSNKTATDDTGGKDPRNV DNA- och proteinsekvenser Databas Query 1 Sbjct 1 Query 61 Sbjct 61 MISPDPRPSPGLARWAESYEAKCXXXXXXXXXXXXXPNVTTCRQVGKTXXXXXXXXXXXX MIS D R SPGLARWAESYEAKCERRQE RE+RR R N TTCRQ GK LR Q +E+LQ A MISSDSRSSPGLARWAESYEAKCERRQETRENRRRRRNETTCRQPGKVLRTQHKERLQGA 60 XXXXXXXXXXXXXEEKGKAQHPQAREQGPSRR-----------------------PGQVR QF +RRNLE E+KG QAREQGPS + PGQ RQLQFLKRRNLEEEKKG-----QAREQGPSSKTDGGTGQVSILKESLPGANKASFPGQQE 96 Parvisa upplinjeringar 60 115 PtCesA9-1 PtCesA9-2 PtCesA2-1 PtCesA2-2 PtCesA4-2 PtCesA4-1 PtCesA6-1 PtCesA6-2 PtCesA8-1 PtCesA8-3 PtCesA8-2 PtCesA5-2 PtCesA5-1 PtCesA7-2 PtCesA7-1 PtCesA1 PtCesA3-1 PtCesA3-2 Fylogeni PtCesA8_3 PtCesA8_1 PtrCesA4 PtCesA8_ AtCesA1 AtCesA10 ZmCesA2 ZmCesA1 HvCesA6 OsCesA1 ZmCesA3 † ZmCesA5 OsCesA2 HvCesA3 † ZmCesA4 OsCesA8 ZmCesA9 HvCesA1 GhCesA3 AtCesA3 PtrCesA5 PtCesA5_1 PtCesA5_2 PtCesA7_1 PtCesA7_2 GPVYVGTGCVFKRQALYGYDPPKEPKRPKMVTCDCCP-------CFGRRKKK-------KEPKRPKMVTCDCCP-------CFGRRKKK-------KDPKRPKMETCDCCP-------CFGRRKKK-------GPVYVGTGCVFKRQALYGYDPPKDPKRPKMETCDCCP-------CFGRRKKK-------VK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN GPIYVGTGCVFRRQALYGYDAPVK-KKPPGRTCNCLPRWCC--YCCRSKKKNKKSKSKSN GPIYVGTGCVFRRQALYGYDAPIK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN IK-KKPPGRTCNCLPKWCC--CCCRSKKKNKKSK--SN GPIYVGTGCVFRRQALYGYDAPVK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK VK-KRPPGKTCNCWPKWCC--LFCGSRKNKKSKQKKEK GPIYVGTGCVFRKQALYGYDAPVK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK VK-KKPPGKTCNCLPKWCY--LWCGSRKNKKSKPKKEK GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE KT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE GPIYVGTGCVFRRHALYGYDAPKT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE KT-KKPPTRTCNCLPKWCCGCFCSGRKKKKKTNKPKSE GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI LTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI LTEEDLEPNIIVKSCC--------GSRKKGRGGNKKYI GPVYVGTGCCFNRQALYGYDPVLTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI LTEEDLEPNIIVKSCC--------GSRKKGRGGHKKYI LKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSLCG--------GSRKKSSKSSKKGS LKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGMLSSLCG--------GSRKKGSKSSKKGS GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES LKPKHKKPGFLSSCFG--------GSRKKSSGSGRKES GPVYVGTGCVFNRTALYGYEPPLKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS LKPKHKKPGFLSSCFG--------GSRKKSSRSGRKDS GPVYVGTGCVFNRQSLYGYDPPVSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL VSENRPK-MTCDCWPSWCC-CCCGGSRKKSKKKGQRSL GPMYVGTGCVFNRQALYGYGPPSMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY SMPRLRKGKESSSCFS-----CCCPTKKKPAQDPAEVY GPVYVGTGCVFNRQALYGYGPPSMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY SMPSLRKRKDSSSCFS-----CCCPSKKKPAQDPAEVY 636 645 696 694 697 698 687 684 680 680 680 674 676 654 664 630 579 615 Multipla upplinjeringar Figur En sammanfattning över de bioinformatiska metoder som behandlats i denna doktorsavhandling. Offentligt tillgängliga DNA- och proteinsekvenser har utgjort grunden för utveckling av bioinformatiska analysprogram som är integrerade med databaser. Dessutom har jämförelser av sekvenser utförts, bl.a. parvisa- och multipla upplinjeringar samt fylogenetiska studier. 67 Aebersold, R. and Mann, M.: Mass spectrometry-based proteomics. Nature 422 (2003) 198-207. Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., George, R.A., Lewis, S.E., Richards, S., Ashburner, M., Henderson, S.N., Sutton, G.G., Wortman, J.R., Yandell, M.D., Zhang, Q., Chen, L.X., Brandon, R.C., Rogers, Y.H., Blazej, R.G., Champe, M., Pfeiffer, B.D., Wan, K.H., Doyle, C., Baxter, E.G., Helt, G., Nelson, C.R., Gabor, G.L., Abril, J.F., Agbayani, A., An, H.J., Andrews-Pfannkoch, C., Baldwin, D., Ballew, R.M., Basu, A., Baxendale, J., Bayraktaroglu, L., Beasley, E.M., Beeson, K.Y., Benos, P.V., Berman, B.P., Bhandari, D., Bolshakov, S., Borkova, D., Botchan, M.R., Bouck, J., Brokstein, P., Brottier, P., Burtis, K.C., Busam, D.A., Butler, H., Cadieu, E., Center, A., Chandra, I., Cherry, J.M., Cawley, S., Dahlke, C., Davenport, L.B., Davies, P., de Pablos, B., Delcher, A., Deng, Z., Mays, A.D., Dew, I., Dietz, S.M., Dodson, K., Doup, L.E., Downes, M., DuganRocha, S., Dunkov, B.C., Dunn, P., Durbin, K.J., Evangelista, C.C., Ferraz, C., Ferriera, S., Fleischmann, W., Fosler, C., Gabrielian, A.E., Garg, N.S., Gelbart, W.M., Glasser, K., Glodek, A., Gong, F., Gorrell, J.H., Gu, Z., Guan, P., Harris, M., Harris, N.L., Harvey, D., Heiman, T.J., Hernandez, J.R., Houck, J., Hostin, D., Houston, K.A., Howland, T.J., Wei, M.H., Ibegwam, C., et al.: The genome sequence of Drosophila melanogaster. Science 287 (2000) 2185-95 Agaton, C., Galli, J., Hoiden Guthenberg, I., Janzon, L., Hansson, M., Asplund, A., Brundell, E., Lindberg, S., Ruthberg, I., Wester, K., Wurtz, D., Hoog, C., Lundeberg, J., Stahl, S., Ponten, F. and Uhlen, M.: Affinity proteomics for systematic protein profiling of chromosome 21 gene products in human tissues. Mol Cell Proteomics 2 (2003) 405-14. Ahmadian, A. and Lundeberg, J.: A brief history of genetic variation analysis. Biotechniques 32 (2002) 1122-4, 1126, 1128 passim. Ahram, M., Flaig, M.J., Gillespie, J.W., Duray, P.H., Linehan, W.M., Ornstein, D.K., Niu, S., Zhao, Y., Petricoin, E.F., 3rd and Emmert-Buck, M.R.: Evaluation of ethanol-fixed, paraffin-embedded tissues for proteomic applications. Proteomics 3 (2003) 413-21. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J.: Basic local alignment search tool. J Mol Biol 215 (1990) 403-10. Altschul, S.F. and Lipman, D.J.: Protein database searches for multiple alignments. Proc Natl Acad Sci U S A 87 (1990) 5509-13. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25 (1997) 3389-402. References References 68 References Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J. and Zdobnov, E.M.: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 29 (2001) 3740. Apweiler, R., Bairoch, A. and Wu, C.H.: Protein sequence databases. Curr Opin Chem Biol 8 (2004a) 76-80. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N. and Yeh, L.S.: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32 (2004b) D115-9. The Arabidopsis genome initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408 (2000) 796-815. Ashurst, J.L., Chen, C.K., Gilbert, J.G., Jekosch, K., Keenan, S., Meidl, P., Searle, S.M., Stalker, J., Storey, R., Trevanion, S., Wilming, L. and Hubbard, T.: The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res 33 (2005) D459-65. Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A. and Zygouri, C.: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 31 (2003) 400-2. Bader, G.D. and Hogue, C.W.: BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics 16 (2000) 465-77. Bairoch, A. and Apweiler, R.: The SWISSPROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 28 (2000) 45-8. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N. and Yeh, L.S.: The Universal Protein Resource (UniProt). Nucleic Acids Res 33 (2005) D154-9. Baneyx, F.: Recombinant protein expression in Escherichia coli. Curr Opin Biotechnol 10 (1999) 411-21. Barlow, D.J., Edwards, M.S. and Thornton, J.M.: Continuous and discontinuous protein antigenic determinants. Nature 322 (1986) 747-8. Barry, R. and Soloviev, M.: Quantitative protein profiling using antibody arrays. Proteomics 4 (2004) 3717-26. Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M. and Sonnhammer, E.L.: The Pfam protein families database. Nucleic Acids Res 30 (2002) 276-80. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., Studholme, D.J., Yeats, C. and Eddy, S.R.: The Pfam protein families database. Nucleic Acids Res 32 (2004) D138-41. Battifora, H.: The multitumor (sausage) tissue block: novel method for immunohistochemical antibody testing. Lab Invest 55 (1986) 244-8. Bauer, A. and Kuster, B.: Affinity purification-mass spectrometry. Powerful tools for the characterization of protein complexes. Eur J Biochem 270 (2003) 570-8. Benton, D.: Bioinformatics--principles and potential of a new multidisciplinary tool. Trends Biotechnol 14 (1996) 261-72. Beranova-Giorgianni: Proteome analysis by two-dimensional gel electrophoresis and mass spectrometry: strengths and limitations. Trends Anal Chem 22 (2003) 273-281. Blueggel, M., Chamrad, D. and Meyer, H.E.: Bioinformatics in proteomics. Curr Pharm Biotechnol 5 (2004) 79-88. Blundell, T.L. and Mizuguchi, K.: Structural genomics: an overview. Prog Biophys Mol Biol 73 (2000) 289-95. Blythe, M.J. and Flower, D.R.: Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci 14 (2005) 246-8. Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.F., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. and Tasumi, M.: The Protein Data Bank. A computer-based archival file for macromolecular structures. Eur J Biochem 80 (1977) 319-24. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S. and Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31 (2003) 365-70. Better, M., Chang, C.P., Robinson, R.R. and Horwitz, A.H.: Escherichia coli secretion of an active chimeric antibody fragment. Science 240 (1988) 1041-3. Borrebaeck, C.A.: Antibodies in diagnostics - from immunoassays to protein chips. Immunol Today 21 (2000) 37982. Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., FernandezSuarez, X.M., Gane, P., Gibbins, B., Gilbert, J., Hammond, M., Hotz, H.R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, K.C., Cameron, G., Durbin, R., Cox, A., Hubbard, T. and Clamp, M.: An overview of Ensembl. Genome Res 14 (2004) 925-8. Bradbury, A.R. and Marks, J.D.: Antibodies from phage antibody libraries. J Immunol Methods 290 (2004) 29-49. Blakesley, R.W.: Cycle sequencing. Methods Mol Biol 23 (1993) 209-17. Blom, N., Gammeltoft, S. and Brunak, S.: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294 (1999) 1351-62. Bradbury, A.R., Velappan, N., Verzillo, V., Ovecka, M., Marzari, R., Sblattero, D., Chasteen, L., Siegel, R. and Pavlik, P.: Antibodies in proteomics. Methods Mol Biol 248 (2004) 519-46. Brunner, A.M., Busov, V.B. and Strauss, S.H.: Poplar genome sequence: functional genomics in an ecologically dominant plant species. Trends Plant Sci 9 (2004) 4956. The C. elegans consortium : Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282 (1998) 2012-8. Campbell, J.A., Davies, G.J., Bulone, V. and Henrissat, B.: A classification of nucleotide-diphospho-sugar glycosyltransferases based on amino acid sequence similarities. Biochem J 326 ( Pt 3) (1997) 929-39. 69 References Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L.: GenBank. Nucleic Acids Res 33 (2005) D34-8. 70 References Chen, Y. and Xu, D.: Computational analyses of high-throughput protein-protein interaction data. Curr Protein Pept Sci 4 (2003) 159-81. Doblin, M.S., Kurek, I., Jacob-Wilk, D. and Delmer, D.P.: Cellulose biosynthesis in plants: from genes to rosettes. Plant Cell Physiol 43 (2002) 1407-20. Cheung, J., Estivill, X., Khaja, R., MacDonald, J.R., Lau, K., Tsui, L.C. and Scherer, S.W.: Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol 4 (2003) R25. Doolittle, R.F., Feng, D.F., Johnson, M.S. and McClure, M.A.: Relationships of human protein sequences to those of other organisms. Cold Spring Harb Symp Quant Biol 51 Pt 1 (1986) 447-55. The chimpanzee sequencing consortium: Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437 (2005) 69-87. Coutinho, P.M., Deleury, E., Davies, G.J. and Henrissat, B.: An evolving hierarchical family classification for glycosyltransferases. J Mol Biol 328 (2003) 307-17. Cregg, J.M., Cereghino, J.L., Shi, J. and Higgins, D.R.: Recombinant protein expression in Pichia pastoris. Mol Biotechnol 16 (2000) 23-52. Crick, F.: The Biological Replication of Macromolecules. Symp. Soc. Exp. Biol XII, 138 (1958). Dayhoff, M., Schwartz, R.M. and Orcutt, B.C.: A model of evolutionary change in proteins., In Atlas of Protein Sequence and Structure. National Biomedical Reseach Foundation, Silver Spring, MD, 1978, pp. 345-352. de Hoog, C.L. and Mann, M.: Proteomics. Annu Rev Genomics Hum Genet 5 (2004) 267-93. Djerbi, S., Aspeborg, H., Nilsson, P., Sundberg, B., Mellerowicz, E., Blomqvist, K. and Teeri, T.T.: Identification and expression analysis of genes encoding putative cellulose synthases (CesA) in the hybrid aspen, Populus tremula (L.) x P. tremuloides (Michx.). Cellulose 11 (2004) 301-312. Dunham, I., Shimizu, N., Roe, B.A., Chissoe, S., Hunt, A.R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J., Ainscough, R., Almeida, J.P., Babbage, A., Bagguley, C., Bailey, J., Barlow, K., Bates, K.N., Beasley, O., Bird, C.P., Blakey, S., Bridgeman, A.M., Buck, D., Burgess, J., Burrill, W.D., O’Brien, K.P. and et al.: The DNA sequence of human chromosome 22. Nature 402 (1999) 489-95. Dunn, M.J.: Detection of total proteins on western blots of 2-D polyacrylamide gels. Methods Mol Biol 112 (1999) 319-29. Ebersberger, I., Metzler, D., Schwarz, C. and Paabo, S.: Genomewide comparison of DNA sequences between humans and chimpanzees. Am J Hum Genet 70 (2002) 1490-7. El-Sayed, N.M., Myler, P.J., Bartholomeu, D.C., Nilsson, D., Aggarwal, G., Tran, A.N., Ghedin, E., Worthey, E.A., Delcher, A.L., Blandin, G., Westenberger, S.J., Caler, E., Cerqueira, G.C., Branche, C., Haas, B., Anupama, A., Arner, E., Aslund, L., Attipoe, P., Bontempi, E., Bringaud, F., Burton, P., Cadag, E., Campbell, D.A., Carrington, M., Crabtree, J., Darban, H., da Silveira, J.F., de Jong, P., Edwards, K., Englund, P.T., Fazelina, G., Feldblyum, T., Ferella, M., Frasch, A.C., Gull, K., Horn, D., Hou, L., Huang, Y., Kindlund, E., Klingbeil, M., Kluge, S., Koo, H., Lacerda, D., Levin, M.J., Lorenzi, H., Louie, T., Machado, C.R., McCulloch, R., McKenna, A., Mizuno, Y., Mottram, J.C., Nelson, S., Ochaya, S., Osoegawa, K., Pai, G., Parsons, M., Pentony, M., Pettersson, U., Pop, M., Emanuelsson, O., Nielsen, H., Brunak, S. and von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300 (2000) 1005-16. Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M. and et al.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269 (1995) 496-512. Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.L. and Postlethwait, J.: Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151 (1999) 1531-45. Galperin, M.Y.: The Molecular Biology Database Collection: 2005 update. Nucleic Acids Res 33 (2005) D5-24. Emanuelsson, O., von Heijne, G. and Schneider, G.: Analysis and prediction of mitochondrial targeting peptides. Methods Cell Biol 65 (2001) 175-87. Garavelli, J.S.: The RESID Database of Protein Modifications as a resource and annotation tool. Proteomics 4 (2004) 1527-33. Fan, J.B., Oliphant, A., Shen, R., Kermani, B.G., Garcia, F., Gunderson, K.L., Hansen, M., Steemers, F., Butler, S.L., Deloukas, P., Galver, L., Hunt, S., McBride, C., Bibikova, M., Rubano, T., Chen, J., Wickham, E., Doucet, D., Chang, W., Campbell, D., Zhang, B., Kruglyak, S., Bentley, D., Haas, J., Rigault, P., Zhou, L., Stuelpnagel, J. and Chee, M.S.: Highly parallel SNP genotyping. Cold Spring Harb Symp Quant Biol 68 (2003) 69-78. Garcia-Blanco, M.A., Baraniak, A.P. and Lasda, E.L.: Alternative splicing in disease and therapy. Nat Biotechnol 22 (2004) 535-46. Garinot-Schneider, C., Lellouch, A.C. and Geremia, R.A.: Identification of essential amino acid residues in the Sinorhizobium meliloti glucosyltransferase ExoM. J Biol Chem 275 (2000) 3140713. Fields, S. and Song, O.: A novel genetic system to detect protein-protein interactions. Nature 340 (1989) 245-6. Fitch, W.M.: Distinguishing homologous from analogous proteins. Syst Zool 19 (1970) 99-113. Fitch, W.M.: Homology a personal view on some of the problems. Trends Genet 16 (2000) 227-31. Gasteiger, E., Jung, E. and Bairoch, A.: SWISS-PROT: connecting biomolecular knowledge via a protein database. Curr Issues Mol Biol 3 (2001) 47-55. Gellissen, G., Melber, K., Janowicz, Z.A., Dahlems, U.M., Weydemann, U., Piontek, M., Strasser, A.W. and Hollenberg, C.P.: Heterologous protein production in yeast. Antonie Van Leeuwenhoek 62 (1992) 79-93. Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., Okwuonu, G., Hines, S., Lewis, L., DeRamo, C., Delgado, O., Dugan-Rocha, S., Miner, G., Morgan, M., Hawes, A., Gill, R., Celera, Holt, R.A., Adams, M.D., 71 References Ramirez, J.L., Rinta, J., Robertson, L., Salzberg, S.L., Sanchez, D.O., Seyler, A., Sharma, R., Shetty, J., Simpson, A.J., Sisk, E., Tammi, M.T., Tarleton, R., Teixeira, S., Van Aken, S., Vogt, C., Ward, P.N., Wickstead, B., Wortman, J., White, O., Fraser, C.M., Stuart, K.D. and Andersson, B.: The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease. Science 309 (2005) 40915. 72 References Amanatides, P.G., Baden-Tillson, H., Barnstead, M., Chin, S., Evans, C.A., Ferriera, S., Fosler, C., Glodek, A., Gu, Z., Jennings, D., Kraft, C.L., Nguyen, T., Pfannkoch, C.M., Sitter, C., Sutton, G.G., Venter, J.C., Woodage, T., Smith, D., Lee, H.M., Gustafson, E., Cahill, P., Kana, A., Doucette-Stamm, L., Weinstock, K., Fechtel, K., Weiss, R.B., Dunn, D.M., Green, E.D., Blakesley, R.W., Bouffard, G.G., De Jong, P.J., Osoegawa, K., Zhu, B., Marra, M., Schein, J., Bosdet, I., Fjell, C., Jones, S., Krzywinski, M., Mathewson, C., Siddiqui, A., Wye, N., McPherson, J., Zhao, S., Fraser, C.M., Shetty, J., Shatsman, S., Geer, K., Chen, Y., Abramzon, S., Nierman, W.C., Havlak, P.H., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Li, B., Liu, Y., Qin, X., Cawley, S., Cooney, A.J., D’Souza, L.M., Martin, K., Wu, J.Q., GonzalezGaray, M.L., Jackson, A.R., Kalafus, K.J., McLeod, M.P., Milosavljevic, A., Virk, D., Volkov, A., Wheeler, D.A., Zhang, Z., Bailey, J.A., Eichler, E.E., Tuzun, E., et al.: Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428 (2004) 493-521. G.O. Consortium.: Creating the gene ontology resource: design and implementation, Genome Res, 2001, pp. 1425-33. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., Louis, E.J., Mewes, H.W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S.G.: Life with 6000 genes. Science 274 (1996) 546, 563-7. Graslund, S., Larsson, M., Falk, R., Uhlen, M., Hoog, C. and Stahl, S.: Single-vector three-frame expression systems for affinity-tagged proteins. FEMS Microbiol Lett 215 (2002) 139-47. Graveley, B.R.: Alternative splicing: increasing diversity in the proteomic world. Trends Genet 17 (2001) 100-7. Gray, G.S. and Fitch, W.M.: Evolution of antibiotic resistance genes: the DNA sequence of a kanamycin resistance gene from Staphylococcus aureus. Mol Biol Evol 1 (1983) 57-66. Gu, Z., Cavalcanti, A., Chen, F.C., Bouman, P. and Li, W.H.: Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol Biol Evol 19 (2002) 256-62. Haab, B.B.: Advances in protein microarray technology for protein expression and interaction profiling. Curr Opin Drug Discov Devel 4 (2001) 116-23. Haab, B.B., Dunham, M.J. and Brown, P.O.: Protein microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions. Genome Biol 2 (2001) RESEARCH0004. Hansen, J.E., Lund, O., Tolstrup, N., Gooley, A.A., Williams, K.L. and Brunak, S.: NetOglyc: prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconj J 15 (1998) 115-30. Hardenbol, P., Baner, J., Jain, M., Nilsson, M., Namsaraev, E.A., KarlinNeumann, G.A., Fakhrai-Rad, H., Ronaghi, M., Willis, T.D., Landegren, U. and Davis, R.W.: Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 21 (2003) 673-8. Hartley, J.L., Temple, G.F. and Brasch, M.A.: DNA cloning using in vitro sitespecific recombination. Genome Res 10 (2000) 1788-95. Henikoff, S. and Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89 (1992) 10915-9. Henikoff, S. and Henikoff, J.G.: Performance evaluation of amino acid substitution matrices. Proteins 17 (1993) 49-61. Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., Cunningham, F., Curwen, V., Cutts, T., Down, T., Durbin, R., FernandezSuarez, X.M., Gilbert, J., Hammond, M., Herrero, J., Hotz, H., Howe, K., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Kokocinsci, F., London, D., Longden, I., McVicker, G., Melsopp, C., Meidl, P., Potter, S., Proctor, G., Rae, M., Rios, D., Schuster, M., Searle, S., Severin, J., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Trevanion, S., Ureta-Vidal, A., Vogel, J., White, S., Woodwark, C. and Birney, E.: Ensembl 2005. Nucleic Acids Res 33 (2005) D447-53. Hughes, A.L. and Nei, M.: Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335 (1988) 167-70. Ito, T., Tashiro, K., Muta, S., Ozawa, R., Chiba, T., Nishizawa, M., Yamamoto, K., Kuhara, S. and Sakaki, Y.: Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc Natl Acad Sci U S A 97 (2000) 1143-7. Jensen, R.A.: Orthologs an paralogs -we need to get it right. Genome Biol 2 (2001) 1002.1-1002.3. Jones, D.T.: Progress in protein structure prediction. Curr Opin Struct Biol 7 (1997) 377-87. Jones, D.T.: GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287 (1999) 797-815. Jungblut, P., Thiede, B., Zimny-Arndt, U., Muller, E.C., Scheler, C., WittmannLiebold, B. and Otto, A.: Resolution power of two-dimensional electrophoresis and identification of proteins from gels. Electrophoresis 17 (1996) 839-47. Kalluri, U.C. and Joshi, C.P.: Isolation and characterization of a new, full-length cellulose synthase cDNA, PtrCesA5 from developing xylem of aspen trees. J Exp Bot 54 (2003) 2187-8. Kalluri, U.C. and Joshi, C.P.: Differential expression patterns of two cellulose synthase genes are associated with primary and secondary cell wall development in aspen trees. Planta 220 (2004) 47-55. Kampf, C., Andersson, A.-C., Wester, K., Björling, E., Uhlen, M. and Ponten, F.: Antibody-Based Tissue Profiling As a Tool for Clinical Proteomics. Clinical Proteomics 1 (2004) 285-300. Kanz, C., Aldebert, P., Althorpe, N., Baker, W., Baldwin, A., Bates, K., Browne, P., van den Broek, A., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., Faruque, N., Gamble, J., Diez, F.G., Harte, N., Kulikova, T., Lin, Q., Lombard, V., Lopez, R., Mancuso, R., McHale, M., Nardone, F., Silventoinen, V., Sobhany, S., Stoehr, P., Tuli, M.A., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W. and Apweiler, R.: The EMBL Nucleotide Sequence Database. Nucleic Acids Res 33 (2005) D29-33. Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D. and Kent, W.J.: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32 (2004) D493-6. 73 References Holm, L.: Unification of protein families. Curr Opin Struct Biol 8 (1998) 372-9. Hopp, T.P. and Woods, K.R.: Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci U S A 78 (1981) 3824-8. 74 References Kasprzyk, A., Keefe, D., Smedley, D., London, D., Spooner, W., Melsopp, C., Hammond, M., Rocca-Serra, P., Cox, T. and Birney, E.: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 14 (2004) 160-9. Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W. and Haussler, D.: Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 100 (2003) 11484-9. Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., Delcher, A.L., Pop, M., Wang, W., Fraser, C.M. and Venter, J.C.: The dog genome: survey sequencing and comparative analysis. Science 301 (2003) 1898-903. Kohler, G. and Milstein, C.: Continuous cultures of fused cells secreting antibody of predefined specificity. Nature 256 (1975) 495-7. Kondrashov, F.A., Rogozin, I.B., Wolf, Y.I. and Koonin, E.V.: Selection in the evolution of gene duplications. Genome Biol 3 (2002) RESEARCH0008. Kononen, J., Bubendorf, L., Kallioniemi, A., Barlund, M., Schraml, P., Leighton, S., Torhorst, J., Mihatsch, M.J., Sauter, G. and Kallioniemi, O.P.: Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med 4 (1998) 844-7. Koonin, E.V.: An apology for orthologs - or brave new memes. Genome Biol 2 (2001) COMMENT1005. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L.: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305 (2001) 567-80. Kumar, A., Agarwal, S., Heyman, J.A., Matson, S., Heidtman, M., Piccirillo, S., Umansky, L., Drawid, A., Jansen, R., Liu, Y., Cheung, K.H., Miller, P., Gerstein, M., Roeder, G.S. and Snyder, M.: Subcellular localization of the yeast proteome. Genes Dev 16 (2002) 707-19. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., Funke, R., Gage, D., Harris, K., Heaford, A., Howland, J., Kann, L., Lehoczky, J., LeVine, R., McEwan, P., McKernan, K., Meldrim, J., Mesirov, J.P., Miranda, C., Morris, W., Naylor, J., Raymond, C., Rosetti, M., Santos, R., Sheridan, A., Sougnez, C., Stange-Thomann, N., Stojanovic, N., Subramanian, A., Wyman, D., Rogers, J., Sulston, J., Ainscough, R., Beck, S., Bentley, D., Burton, J., Clee, C., Carter, N., Coulson, A., Deadman, R., Deloukas, P., Dunham, A., Dunham, I., Durbin, R., French, L., Grafham, D., Gregory, S., Hubbard, T., Humphray, S., Hunt, A., Jones, M., Lloyd, C., McMurray, A., Matthews, L., Mercer, S., Milne, S., Mullikin, J.C., Mungall, A., Plumb, R., Ross, M., Shownkeen, R., Sims, S., Waterston, R.H., Wilson, R.K., Hillier, L.W., McPherson, J.D., Marra, M.A., Mardis, E.R., Fulton, L.A., Chinwalla, A.T., Pepin, K.H., Gish, W.R., Chissoe, S.L., Wendl, M.C., Delehaunty, K.D., Miner, T.L., Delehaunty, A., Kramer, J.B., Cook, L.L., Fulton, R.S., Johnson, D.L., Minx, P.J., Clifton, S.W., Hawkins, T., Branscomb, E., Predki, P., Richardson, P., Wenning, S., Slezak, T., Doggett, N., Cheng, J.F., Olsen, A., Lucas, S., Elkin, C., Uberbacher, E., Frazier, M., et al.: Initial sequencing and analysis of the human genome. Nature 409 (2001) 860-921. Lawrence, J.G. and Ochman, H.: Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A 95 (1998) 9413-7. Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P. and Bork, P.: SMART 4.0: towards genomic data integration. Nucleic Acids Res 32 (2004) D142-4. Lipman, D.J. and Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227 (1985) 1435-41. Lipovsek, D. and Pluckthun, A.: In-vitro protein evolution by ribosome display and mRNA display. J Immunol Methods 290 (2004) 51-67. Little, P.: The book of genes. Nature 402 (1999) 467-8. Liu, B. and Marks, J.D.: Applying phage antibodies to proteomics: selecting single chain Fv antibodies to antigens blotted on nitrocellulose. Anal Biochem 286 (2000) 119-28. Luscombe, N.M., Greenbaum, D. and Gerstein, M.: What is bioinformatics? A proposed definition and overview of the field. Methods Inf Med 40 (2001) 34658. Lynch, M. and Conery, J.S.: The evolutionary fate and consequences of duplicate genes. Science 290 (2000) 1151-5. Lynch, M. and Conery, J.S.: The evolutionary demography of duplicate genes. J Struct Funct Genomics 3 (2003) 3544. MacBeath, G. and Schreiber, S.L.: Printing proteins as microarrays for highthroughput function determination. Science 289 (2000) 1760-3. Makalowski, W.: Genomic scrap yard: how genomes utilize all that junk. Gene 259 (2000) 61-7. Mann, M. and Jensen, O.N.: Proteomic analysis of post-translational modifications. Nat Biotechnol 21 (2003) 255-61. Liu, Q.: The univector plasmid-fusion system, a method for rapid construction of recombinant DNA without restriction enzymes. Curr. Biol. 8 (1999) 13001309. Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O. and Eisenberg, D.: Detecting protein function and proteinprotein interactions from genome sequences. Science 285 (1999) 751-3. Lopez, M.F. and Pluskal, M.G.: Protein micro- and macroarrays: digitizing the proteome. J Chromatogr B Analyt Technol Biomed Life Sci 787 (2003) 19-27. Marcotte, E.M., Xenarios, I. and Eisenberg, D.: Mining literature for proteinprotein interactions. Bioinformatics 17 (2001) 359-63. Lu, Z., DiBlasio-Smith, E.A., Grant, K.L., Warne, N.W., LaVallie, E.R., Collins-Racie, L.A., Follettie, M.T., Williamson, M.J. and McCoy, J.M.: Histidine patch thioredoxins. Mutant forms of thioredoxin with metal chelating affinity that provide for convenient purifications of thioredoxin fusion proteins. J Biol Chem 271 (1996) 5059-65. Mattaj, I.W. and Englmeier, L.: Nucleocytoplasmic transport: the soluble phase. Annu Rev Biochem 67 (1998) 265-306. Maxam, A.M. and Gilbert, W.: A new method for sequencing DNA. Proc Natl Acad Sci U S A 74 (1977) 560-4. Mellor, J.C., Yanai, I., Clodfelter, K.H., Mintseris, J. and DeLisi, C.: Predictome: a database of putative functional links between proteins. Nucleic Acids Res 30 (2002) 306-9. 75 References Lesley, S.A.: High-throughput proteomics: protein expression and purification in the postgenomic world. Protein Expr Purif 22 (2001) 159-64. 76 References Meloen, R.H., Puijk, W.C., Langeveld, J.P., Langedijk, J.P. and Timmerman, P.: Design of synthetic peptides for diagnostics. Curr Protein Pept Sci 4 (2003) 253-60. Mullis, K.B. and Faloona, F.A.: Specific synthesis of DNA in vitro via a polymerasecatalyzed chain reaction. Methods Enzymol 155 (1987) 335-50. Mighell, A.J., Smith, N.R., Robinson, P.A. and Markham, A.F.: Vertebrate pseudogenes. FEBS Lett 468 (2000) 10914. Needleman, S.B. and Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48 (1970) 44353. Nielsen, H., Brunak, S. and von Heijne, G.: Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 12 (1999) 3-9. Mironov, A.A., Fickett, J.W. and Gelfand, M.S.: Frequent alternative splicing of human genes. Genome Res 9 (1999) 1288-93. Miroux, B. and Walker, J.E.: Overproduction of proteins in Escherichia coli: mutant hosts that allow synthesis of some membrane proteins and globular proteins at high levels. J Mol Biol 260 (1996) 28998. Montelione, G.T. and Anderson, S.: Structural genomics: keystone for a Human Proteome Project. Nat Struct Biol 6 (1999) 11-2. Moore, R.C. and Purugganan, M.D.: The evolutionary dynamics of plant duplicate genes. Curr Opin Plant Biol 8 (2005) 122-8. Mounsey, A., Bauer, P. and Hope, I.A.: Evidence suggesting that a fifth of annotated Caenorhabditis elegans genes may be pseudogenes. Genome Res 12 (2002) 770-5. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., Cerutti, L., Copley, R., Courcelle, E., Das, U., Durbin, R., Fleischmann, W., Gough, J., Haft, D., Harte, N., Hulo, N., Kahn, D., Kanapin, A., Krestyaninova, M., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McDowall, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Pagni, M., Ponting, C.P., Quevillon, E., Selengut, J., Sigrist, C.J., Silventoinen, V., Studholme, D.J., Vaughan, R. and Wu, C.H.: InterPro, progress and status in 2005. Nucleic Acids Res 33 (2005) D201-5. Nielsen, H., Engelbrecht, J., Brunak, S. and von Heijne, G.: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10 (1997) 1-6. Nilsson, P., Paavilainen, L., Larsson, K., Ödling, J., Sundberg, M., Andersson, A.-C., Kampf, C., Persson, A., Al-Khalili Szigartzo, C., Ottosson, J., Björling, E., Hober, S., Wernérus, H., Wester, K., Ponten, F. and Uhlen, M.: Towards a human proteome atlas: high-throughput generation of mono-specific antibodies for tissue profiling. Proteomics, In press (2005). Nomenclature Committee: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enyzmes, Academic, New York, 1992. Nord, K., Gunneriusson, E., Ringdahl, J., Stahl, S., Uhlen, M. and Nygren, P.A.: Binding proteins selected from combinatorial libraries of an alpha-helical bacterial receptor domain. Nat Biotechnol 15 (1997) 772-7. O’Donovan, C., Apweiler, R. and Bairoch, A.: The human proteomics initiative (HPI). Trends Biotechnol 19 (2001) 178-81. O’Donovan, C., Martin, M.J., Glemet, E., Codani, J.J. and Apweiler, R.: Removing redundancy in SWISS-PROT and TrEMBL. Bioinformatics 15 (1999) 258-9. O’Farrell, P.H.: High resolution twodimensional electrophoresis of proteins. J Biol Chem 250 (1975) 4007-21. Ohno, S.: Evolution by gene duplication. Springer-verlag (1970). Orengo, C.A., Bray, J.E., Buchan, D.W., Harrison, A., Lee, D., Pearl, F.M., Sillitoe, I., Todd, A.E. and Thornton, J.M.: The CATH protein family database: a resource for structural and functional annotation of genomes. Proteomics 2 (2002) 11-21. Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G.D. and Maltsev, N.: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 96 (1999) 2896-901. Paweletz, C.P., Charboneau, L., Bichsel, V.E., Simone, N.L., Chen, T., Gillespie, J.W., Emmert-Buck, M.R., Roth, M.J., Petricoin, I.E. and Liotta, L.A.: Reverse phase protein microarrays which capture disease progression show activation of prosurvival pathways at the cancer invasion front. Oncogene 20 (2001) 1981-9. Pearson, W.R.: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183 (1990) 63-98. Pearson, W.R.: Protein sequence comparison and protein evolution. ISMB 2000 (2000). Pearson, W.R. and Miller, W.: Dynamic programming algorithms for biological sequence comparison. Methods Enzymol 210 (1992) 575-601. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. and Yeates, T.O.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 96 (1999) 4285-8. Petsko, G.A.: Homologuephobia. Genome Biol 2 (2001) COMMENT1002. Reinhardt, A. and Hubbard, T.: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res 26 (1998) 2230-6. Rost, B., Liu, J., Nair, R., Wrzeszczynski, K.O. and Ofran, Y.: Automatic prediction of protein function. Cell Mol Life Sci 60 (2003) 2637-50. Rost, B., Schneider, R. and Sander, C.: Protein fold recognition by prediction-based threading. J Mol Biol 270 (1997) 47180. Rowen, L., Young, J., Birditt, B., Kaur, A., Madan, A., Philipps, D.L., Qin, S., Minx, P., Wilson, R.K., Hood, L. and Graveley, B.R.: Analysis of the human neurexin genes: alternative splicing and the generation of protein diversity. Genomics 79 (2002) 587-97. Ruvolo, M.: Comparative primate genomics: the year of the chimpanzee. Curr Opin Genet Dev 14 (2004) 650-6. Saiki, R.K., Walsh, P.S., Levenson, C.H. and Erlich, H.A.: Genetic analysis of amplified DNA with immobilized sequencespecific oligonucleotide probes. Proc Natl Acad Sci U S A 86 (1989) 6230-4. Sakate, R., Osada, N., Hida, M., Sugano, S., Hayasaka, I., Shimohira, N., Yanagi, S., Suto, Y., Hashimoto, K. and Hirai, M.: Analysis of 5’-end sequences of chimpanzee cDNAs. Genome Res 13 (2003) 1022-6. 77 References O’Donovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A. and Apweiler, R.: High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform 3 (2002) 27584. 78 References Sanchez, R. and Sali, A.: Largescale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci U S A 95 (1998) 13597602. Sanger, F., Nicklen, S. and Coulson, A.R.: DNA sequencing with chainterminating inhibitors. Proc Natl Acad Sci U S A 74 (1977) 5463-7. Santoni, V., Molloy, M. and Rabilloud, T.: Membrane proteins and proteomics: un amour impossible? Electrophoresis 21 (2000) 1054-70. Saxena, I.M., Brown, R.M., Jr., Fevre, M., Geremia, R.A. and Henrissat, B.: Multidomain architecture of beta-glycosyl transferases: implications for mechanism of action. J Bacteriol 177 (1995) 1419-24. Schatz, G. and Dobberstein, B.: Common principles of protein translocation across membranes. Science 271 (1996) 151926. Schena, M., Shalon, D., Davis, R.W. and Brown, P.O.: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270 (1995) 467-70. Schweitzer, B. and Kingsmore, S.F.: Measuring proteins on microarrays. Curr Opin Biotechnol 13 (2002) 14-9. Shevchenko, A., Wilm, M., Vorm, O. and Mann, M.: Mass spectrometric sequencing of proteins silver-stained polyacrylamide gels. Anal Chem 68 (1996) 850-8. Sidow, A.: Gen(om)e duplications in the evolution of early vertebrates. Curr Opin Genet Dev 6 (1996) 715-22. Simon, R., Mirlacher, M. and Sauter, G.: Tissue microarrays in cancer diagnosis. Expert Rev Mol Diagn 3 (2003) 42130. Skolnick, J., Fetrow, J.S. and Kolinski, A.: Structural genomics and its importance for gene function analysis. Nat Biotechnol 18 (2000) 283-7. Smith, L.M., Sanders, J.Z., Kaiser, R.J., Hughes, P., Dodd, C., Connell, C.R., Heiner, C., Kent, S.B. and Hood, L.E.: Fluorescence detection in automated DNA sequence analysis. Nature 321 (1986) 674-9. Smith, T.F. and Waterman, M.S.: Identification of common molecular subsequences. J Mol Biol 147 (1981) 1957. Sonnhammer, E.L. and Koonin, E.V.: Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet 18 (2002) 619-20. Soskic, V., Gorlach, M., Poznanovic, S., Boehmer, F.D. and GodovacZimmermann, J.: Functional proteomics analysis of signal transduction pathways of the platelet-derived growth factor beta receptor. Biochemistry 38 (1999) 175764. Stein, L.D.: Human genome: end of the beginning. Nature 431 (2004) 915-6. Sterky, F., Bhalerao, R.R., Unneberg, P., Segerman, B., Nilsson, P., Brunner, A.M., Charbonnel-Campaa, L., Lindvall, J.J., Tandre, K., Strauss, S.H., Sundberg, B., Gustafsson, P., Uhlen, M., Bhalerao, R.P., Nilsson, O., Sandberg, G., Karlsson, J., Lundeberg, J. and Jansson, S.: A Populus EST resource for plant functional genomics. Proc Natl Acad Sci U S A 101 (2004) 13951-6. Swanson, W.J.: Adaptive evolution of genes and gene families. Curr Opin Genet Dev 13 (2003) 617-22. Syvanen, A.C.: Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat Rev Genet 2 (2001) 930-42. Terwilliger, T.C. and Berendzen, J.: Automated MAD and MIR structure solution. Acta Crystallogr D Biol Crystallogr 55 ( Pt 4) (1999) 849-61. Thompson, J.D., Higgins, D.G. and Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22 (1994) 4673-80. Torrents, D., Suyama, M., Zdobnov, E. and Bork, P.: A genome-wide survey of human pseudogenes. Genome Res 13 (2003) 2559-67. Tyers, M. and Mann, M.: From genomics to proteomics. Nature 422 (2003) 193-7. Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S. and Rothberg, J.M.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403 (2000) 623-7. A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, K., AbuThreideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chandramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z., Di Francesco, V., Dunn, P., Eilbeck, K., Evangelista, C., Gabrielian, A.E., Gan, W., Ge, W., Gong, F., Gu, Z., Guan, P., Heiman, T.J., Higgins, M.E., Ji, R.R., Ke, Z., Ketchum, K.A., Lai, Z., Lei, Y., Li, Z., Li, J., Liang, Y., Lin, X., Lu, F., Merkulov, G.V., Milshina, N., Moore, H.M., Naik, A.K., Narayan, V.A., Neelam, B., Nusskern, D., Rusch, D.B., Salzberg, S., Shao, W., Shue, B., Sun, J., Wang, Z., Wang, A., Wang, X., Wang, J., Wei, M., Wides, R., Xiao, C., Yan, C., et al.: The sequence of the human genome. Science 291 (2001) 1304-51. Vitkup, D., Melamud, E., Moult, J. and Sander, C.: Completeness in structural genomics. Nat Struct Biol 8 (2001) 55966. von Heijne, G.: Patterns of amino acids near signal-sequence cleavage sites. Eur J Biochem 133 (1983) 17-21. von Heijne, G.: Signal sequences. The limits of variation. J Mol Biol 184 (1985) 99-105. Uhlen, M. and Ponten, F.: Antibodybased proteomics for human tissue profiling. Mol Cell Proteomics 4 (2005) 384-93. von Heijne, G.: Transcending the impenetrable: how proteins come to terms with membranes. Biochim Biophys Acta 947 (1988) 307-33. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., Gocayne, J.D., Amanatides, P., Ballew, R.M., Huson, D.H., Wortman, J.R., Zhang, Q., Kodira, C.D., Zheng, X.H., Chen, L., Skupski, M., Subramanian, G., Thomas, P.D., Zhang, J., Gabor Miklos, G.L., Nelson, C., Broder, S., Clark, A.G., Nadeau, J., McKusick, V.A., Zinder, N., Levine, A.J., Roberts, R.J., Simon, M., Slayman, C., Hunkapiller, M., Bolanos, R., Delcher, Wallis, J.W., Aerts, J., Groenen, M.A., Crooijmans, R.P., Layman, D., Graves, T.A., Scheer, D.E., Kremitzki, C., Fedele, M.J., Mudd, N.K., Cardenas, M., Higginbotham, J., Carter, J., McGrane, R., Gaige, T., Mead, K., Walker, J., Albracht, D., Davito, J., Yang, S.P., Leong, S., Chinwalla, A., Sekhon, M., Wylie, K., Dodgson, J., Romanov, M.N., Cheng, H., de Jong, P.J., Osoegawa, K., Nefedov, M., Zhang, H., McPherson, J.D., Krzywinski, M., Schein, J., Hillier, L., Mardis, E.R., Wilson, R.K. and 79 References Taylor, J.S. and Raes, J.: Duplication and divergence: the evolution of new genes and old ideas. Annu Rev Genet 38 (2004) 615-43. 80 Appendices Warren, W.C.: A physical map of the chicken genome. Nature 432 (2004) 7614. Warford, A., Howat, W. and McCafferty, J.: Expression profiling by high-throughput immunohistochemistry. J Immunol Methods 290 (2004) 81-92. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., Antonarakis, S.E., Attwood, J., Baertsch, R., Bailey, J., Barlow, K., Beck, S., Berry, E., Birren, B., Bloom, T., Bork, P., Botcherby, M., Bray, N., Brent, M.R., Brown, D.G., Brown, S.D., Bult, C., Burton, J., Butler, J., Campbell, R.D., Carninci, P., Cawley, S., Chiaromonte, F., Chinwalla, A.T., Church, D.M., Clamp, M., Clee, C., Collins, F.S., Cook, L.L., Copley, R.R., Coulson, A., Couronne, O., Cuff, J., Curwen, V., Cutts, T., Daly, M., David, R., Davies, J., Delehaunty, K.D., Deri, J., Dermitzakis, E.T., Dewey, C., Dickens, N.J., Diekhans, M., Dodge, S., Dubchak, I., Dunn, D.M., Eddy, S.R., Elnitski, L., Emes, R.D., Eswara, P., Eyras, E., Felsenfeld, A., Fewell, G.A., Flicek, P., Foley, K., Frankel, W.N., Fulton, L.A., Fulton, R.S., Furey, T.S., Gage, D., Gibbs, R.A., Glusman, G., Gnerre, S., Goldman, N., Goodstadt, L., Grafham, D., Graves, T.A., Green, E.D., Gregory, S., Guigo, R., Guyer, M., Hardison, R.C., Haussler, D., Hayashizaki, Y., Hillier, L.W., Hinrichs, A., Hlavina, W., Holzer, T., Hsu, F., Hua, A., Hubbard, T., Hunt, A., Jackson, I., Jaffe, D.B., Johnson, L.S., Jones, M., Jones, T.A., Joy, A., Kamal, M., Karlsson, E.K., et al.: Initial sequencing and comparative analysis of the mouse genome. Nature 420 (2002) 520-62. Watson, J.D. and Crick, F.H.: The structure of DNA. Cold Spring Harb Symp Quant Biol 18 (1953) 123-31. Wheeler, D.L., Church, D.M., Edgar, R., Federhen, S., Helmberg, W., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Suzek, T.O., Tatusova, T.A. and Wagner, L.: Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 32 (2004) D35-40. Whetten, R., Sun, Y.H., Zhang, Y. and Sederoff, R.: Functional genomics and cell wall biosynthesis in loblolly pine. Plant Mol Biol 47 (2001) 275-91. Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez, J.C., Yan, J.X., Gooley, A.A., Hughes, G., Humphery-Smith, I., Williams, K.L. and Hochstrasser, D.F.: From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology (N Y) 14 (1996) 61-5. Wootton, J.C.: Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18 (1994) 269-85. Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M. and Eisenberg, D.: DIP: the database of interacting proteins. Nucleic Acids Res 28 (2000) 289-91. Yamagata, A., Kristensen, D.B., Takeda, Y., Miyamoto, Y., Okada, K., Inamatsu, M. and Yoshizato, K.: Mapping of phosphorylated proteins on two-dimensional polyacrylamide gels using protein phosphatase. Proteomics 2 (2002) 126776. Yu, U., Lee, S.H., Kim, Y.J. and Kim, S.: Bioinformatics in the post-genome era. J Biochem Mol Biol 37 (2004) 75-82.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project