Variation in length of proteins by repeats and disorder regions Rauan Sagit

Variation in length of proteins by repeats and disorder regions Rauan Sagit

Variation in length of proteins by repeats and disorder regions

Rauan Sagit

ISBN 978-91-7447-670-5

Printed in Sweden by US-AB, Stockholm 2013

Distributor: Department of Biochemistry and Biophysics, Stockholm University

List of Papers

The following papers, referred to in the text by their Roman numerals, are included in this thesis.

PAPER I: Nebulin: A Study of Protein Repeat Evolution

Åsa K. Björklund, Sara Light*, Rauan Sagit* and Arne Elofsson. J. Mol. Biol. 402, 38-51 (2010).

PAPER II: The evolution of filamin - A protein domain repeat perspective

Sara Light, Rauan Sagit, Sujay S. Ithychanda, Jun Qin and Arne

Elofsson. J. Struct. Biol. 179, 289-298 (2012).

PAPER III: Long indels are disordered: A study of disorder and indels in homologous eukaryotic proteins

Sara Light*, Rauan Sagit*, Diana Ekman and Arne Elofsson.

Biochim. Biophys. Acta (2013).

PAPER IV: Protein expansion is primarily due to indels in intrinsically disordered regions

Rauan Sagit*, Sara Light*, Diana Ekman and Arne Elofsson.

Manuscript (2013).

Reprints were made with permission from the publishers.

* Authors contributed equally

Abstract

Protein-coding genes evolve together with their genome and acquire changes, some of which affect the length of their protein products. This explains why equivalent proteins from different species can exhibit length differences. Variation in length of proteins during evolution arguably presents a large number of possibilities for improvement and innovation of protein structure and function. In order to contribute to an increased understanding of this process, we have studied variation caused by tandem domain duplications and insertions or deletions of intrinsically disordered residues.

The study of two proteins, Nebulin and Filamin, together with a broader study of long repeat proteins (>10 domain repeats), began by confirming that tandem domains evolve by internal duplications. Next, we show that vertebrate Nebulins evolved by duplications of a seven-domain unit, yet the most recent duplications utilized different gene parts as duplication units. However,

Filamin exhibits a checkered duplication pattern, indicating that duplications were followed by similarity erosions that were hindered at particular domains due to the presence of equivalent binding motifs. For long repeat proteins, we found that human segmental duplications are over-represented in long repeat genes. Additionally, domains found in long repeat domain arrays achieved this primarily by duplications of two or more domains at a time.

The study of homologous protein pairs from the well-characterized eukaryotes nematode, fruit fly and several fungi, demonstrated a link between variation in length and variation in the number of intrinsically disordered residues.

Next, insertions and deletions (indels) estimated from HMM-HMM pairwise alignments showed that disordered residues are clearly more frequent among indel than non-indel residues. Additionally, a study of raw length differences showed that more than half of the variation in fungi proteins is composed of disordered residues. Finally, a model of indels and their immediate surroundings suggested that disordered indels occur in already disordered regions rather than in ordered regions.

Contents

List of Papers

Abstract

Abbreviations

1 Introduction

2 Evolution of proteins

11

2.1

Protein-coding genes . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2

Evolution of proteins . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.3

Protein length and microsatellite instability . . . . . . . . . . . . . .

14

3 Protein domain architectures

16

3.1

Classification of domain families . . . . . . . . . . . . . . . . . . . .

16

3.2

Domain assignment in a protein sequence . . . . . . . . . . . . . . .

17

3.3

Evolution of protein domain architectures . . . . . . . . . . . . . . .

18

4 Variation in protein length by tandem domain duplications

19

4.1

The study of Filamin . . . . . . . . . . . . . . . . . . . . . . . . . .

20

4.2

The study of Nebulin . . . . . . . . . . . . . . . . . . . . . . . . . .

22

viii

9

iv

v

5 Variation in protein length at intrinsically disordered regions

23

5.1

The study of indels in orthologous proteins of invertebrates and yeast .

23

5.2

The study of length differences in homologous proteins of yeast . . .

24

6 Methodology

25

6.1

Bioinformatics databases . . . . . . . . . . . . . . . . . . . . . . . .

25

6.2

Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

6.3

Profile Hidden Markov Models . . . . . . . . . . . . . . . . . . . . .

29

6.4

Prediction of disorder regions . . . . . . . . . . . . . . . . . . . . . .

31

Sammanfattning

Acknowledgements

References

xxxiii

xxxv

xxxvi

Abbreviations

T

TE

U

A

C

CATH

DNA

Adenine

Cytosine

Class, architecture, topology and homologous superfamily

Deoxyribonucleic acid

Guanine G

HMM

HR mRNA

NHEJ

Hidden Markov Model

Homologous recombination messenger RNA

Non-homologous end joining

Pfam Protein families pre-mRNA Precursor mRNA

RNA

SCOP

SMART

Ribonucleic acid

Structural classification of proteins

Simple modular architecture research tool

Thymine

Transposable element

Uracil

1. Introduction

During evolution, proteins undergo numerous modifications in sequence and structure, expanding the multiplicity of cell functions. The resulting variation in length among homologous proteins is an exciting area of research with many interesting unanswered questions. The work presented here is concerned with the variation in length of proteins with homologous domains in tandem and intrinsically disordered regions. The work resulted in papers I, II, III and

IV. Below, there follows a brief introduction to the papers as well as to the forthcoming five chapters.

It is known that protein domain repeats are common in proteins central to the organization of a cell, particularly in eukaryotic cells. Additionally, it has been shown that domain repeats evolve through internal tandem duplications in their respective genes. The underlying mechanisms are however not fully understood. In order to reveal more details on domain repeat expansion mechanisms, we studied the gigantic muscle protein called Nebulin in paper I. It has a large number of actin-binding nebulin domains in tandem and is therefore an excellent candidate for this type of study.

Further, especially in higher eukaryotes, certain protein domains can be found in proteins as tandem repeats, where they perform a broad range of functions frequently related to cellular organization. Filamin is a clear example, since it interacts with many different proteins and has a crucial role for the cytoskeleton. While studying domains that form long repeats, it becomes evident that the properties of a long repeat protein are governed both by the specific properties of the individual repeat domain and the number of repeated copies. Paper II begins by studying domains found in long repeat proteins and continues by focusing on filamin domains present in Filamin proteins.

From an overall perspective, it is well known that proteins evolve in length by insertions and deletions (indels) in their corresponding genes. At the same time, a change in length can disrupt the folding, structure or function of the targeted protein. During the recent decade, proteins with small or large regions that do not fold into well defined three dimensional structures have come to attention, in part because they turned out to be quite common. A change in length of an already unstructured protein region would arguably be less damaging and therefore more acceptable. Therefore, in paper III we investigated a possible link between variation in length and in disorder by studying indels

9

from HMM-HMM pairwise alignments for two sets (nematode and fruit fly; yeast) of orthologous protein pairs.

Finally, attempts to quantify the link between variation in length and disorder between two proteomes are hampered by the difficulty to correctly align the highly diverged sequences of distantly related proteins. In paper IV, we circumvent the alignment problem by calculating raw length differences instead of estimating the variation in length from pairwise alignments. We continued by applying this strategy on a set of homologous protein pairs from yeast.

To provide necessary background and give an account of our four research papers that are inserted at the end of the thesis, there follows Sections 2 and

3 that contain a general discussion on genes and proteins as well as protein domains. Section 4 devoted to tandem domain duplications presents a background and a discussion of paper I, studying the evolution of a gigantic muscle protein called Nebulin, and paper II, investigating the duplication patterns of several domain families and the Filamin proteins of 38 eukaryotic species.

Section 5 presents papers III and IV, where we analyzed the relationship between variations in protein length and in the number of intrinsically disordered residues in the same proteins. Finally, Section 6 describes some of the bioinformatics resources used in our analyses.

10

2. Evolution of proteins

The length of a protein is the count of amino acids that constitute its peptide chain. Eukaryotic proteins are, on average, longer than prokaryotic proteins.

Brocchieri and Karlin estimated that the median protein length of five eukaryotes (Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae and Arabidopsis thaliana) is 361 residues compared to

267 residues for 44 bacterial species and 247 residues for 12 archaeal species

[1].

Further, the role of a protein is to perform a function. In order to perform its function, a protein has to form its native three-dimensional structure. This process is called protein folding and is a journey within an energy landscape

to reach the native structure [2]. To strengthen the process, chaperone proteins can assist folding and prevent misfolding and aggregation [3]. Thus, correctly

folded proteins are ready to perform their function. Interestingly, a study of misfolded human proteins proposed that up to 30% of protein products are

misfolded and subsequently degraded [4].

The hereditary information of a living cell is held in the genome, consisting of one or more large DNA molecules. The genome contains information that codes for proteins, called protein-coding genes. Intriguingly, the genome is not constant and undergoes modifications caused by a number of different mechanisms. In order to highlight the impact of such modifications on proteins and the impact of modified proteins on the cell, here follows a discussion of protein-coding genes, evolution of proteins, and protein length and microsatellite instability.

2.1

Protein-coding genes

As already mentioned, a protein-coding gene codes for a protein. The path from a gene to a protein involves two major steps. The transcription of the gene to synthesize a messenger RNA, followed by the translation of the messenger

RNA to synthesize a protein. This is how information stored in DNA is used during protein synthesis. Flow of information from a protein to either RNA or DNA does however not take place. Francis Crick defined these rules of

information flow as the central dogma of molecular biology [5].

11

A first assumption would be a one to one relationship between the number of genes and the number of different proteins that can be produced by a cell.

Indeed, this assumption holds true for prokaryotic cells, but not for eukaryotic cells. Eukaryotic cells have an additional mechanism to control the final protein product, called alternative splicing. In turn, alternative splicing is possible due to the presence of non-coding DNA (introns) that is located between

stretches of coding DNA (exons) [6]. The RNA transcribed from a gene with

introns is called a pre-mRNA. The pre-mRNA goes through an RNA processing step where the introns are spliced out to obtain the mRNA molecule. While alternative splicing is where a part of an exon or one or more entire exons also become spliced out from the pre-mRNA. The study of alternative splicing is

therefore crucial [7–12] in order to achieve a complete mapping between genes

and their protein products.

A key knowledge for understanding the relationship between DNA and proteins is how a DNA molecule codes for a protein molecule. Both DNA (A,

T, G and C) and RNA (A, U, G and C) have a four-letter alphabet. Meanwhile, proteins have a 20-letter alphabet. The relationship between nucleotides and amino acids is three to one, where three consecutive nucleotides (codon) code for one amino acid (residue). Out of the 64 possible codons, 61 code for amino acids and 3 code for translation termination signals. As a consequence, all amino acids except Methionine (AUG) and Tryptophan (UGG) are coded by two or more different codons. This creates a tolerance for certain nucleotide substitutions. Additionally, it creates intolerance for insertions or deletions of one or two nucleotides that would shift the reading frame and radically change the protein sequence.

Finally, two genes within or between species can be related from a historical point of view, in terms of having a common ancestor. Thus, two proteins are homologous if their respective genes have a common ancestor. Homology of two proteins is a historical fact that can merely be estimated from available

information [13]. According to accepted nomenclature, two proteins are or-

thologs when their genes diverged by speciation and paralogs when their genes

diverged by duplication. Ohno [14] argued gene duplications by polyploidy

[15] and unequal crossing-over [16] to be the major source for the emergence

of novel genes and proteins. Moreover, Ohno, Wolf and Atkin proposed that

gene duplication paved way for the evolution from fish to mammals [17].

2.2

Evolution of proteins

The genome houses a considerable collection of protein-coding genes. The gene collection, in turn, specifies the proteins that can be produced from this genome. During reproduction, the progeny receives a copy of the genome and

12

thereby a copy of the gene collection. This is how the progeny becomes capable of producing equivalent protein molecules to what their parents produced.

This system of information inheritance has an additional factor, namely change. A genome is not constant and undergoes changes through different mechanisms. A change can cause an error in a gene, e.g. a reading frame shift that causes a gene to produce a wrong protein. When the correct protein is no longer produced, the organism suffers since this protein’s function is no longer performed in its cells.

Charles Darwin formulated a line of thought that can be summarized as the

survival of the fittest during evolution by means of natural selection [18]. That

survival is not guaranteed becomes clear when considering that more individuals of every species are born compared to the number that can actually survive.

Under these circumstances, a slight advantage in one individual can allow it to survive in the short run and generate successful progeny in the long run. The advantage, in turn, can originate from an advantageous change in a gene that makes it produce a protein that is superior to the corresponding protein in other individuals. Thus, changes in the gene collection have a direct influence on the ability to survive and to produce capable progeny.

Indeed, positive or negative changes in a gene have a direct influence on

the fitness [19] of the host organism. Yet, Motoo Kimura proposed that the

majority of changes in the genome, i.e. mutations, are neutral and therefore

governed by random genetic drift [20]. Genetic drift, in turn, is a process

where allele (variants of the same gene) frequencies are randomly sampled in a finite population. The level of selective power of natural selection and

genetic drift is proposed to depend on the effective size of a population [21].

Finally, genetic drift can allow even a slightly less functional allele to have a chance of survival, which could be interpreted as an exception to the rule of the survival of the fittest.

It would be natural to think of survival in terms of organisms striving to

survive. An alternative view was presented by Richard Dawkins [22] where he

argues that it is the genes that strive to survive, using organisms as their “survival machines”. To view the process of evolution from this angle is elegant, considering for instance that viruses could be viewed merely as survival machines for their genes, ending any debate whether a virus is living or non-living since this factor becomes irrelevant.

In summary, protein-coding genes are selected by the quality of their proteins as well as by a factor of chance. Genes are subject to changes, while changes create variants upon which selection processes can act, thereby continuing the cycle of change.

13

2.3

Protein length and microsatellite instability

Section 5 will discuss results from papers III and IV that present and investigate a considerable coupling between variation in length and disorder. A disorder region is a set of consecutive residues in a protein that exhibit a particular structural property, namely that they do not fold to a well defined structure in the native state. Evidence for the existence of disorder regions in proteins that are involved in transcriptional activation was discussed by Paul Sigler in 1988

[23], where disorder regions were referred to as being ill-structured.

Since then, the study of disorder regions continued [24–29]. Intrinsic dis-

order turned out to be considerably more abundant in eukaryotes compared

to prokaryotes [25], most probably due to their involvement in regulatory and

signaling functions. In addition, proteins with long (>30 residues) disorder regions are involved in a large variety of process, including cell differentiation,

transcription and transcriptional regulation [30]. Peter Tompa proposed that

expansions of nucleotide repeat regions (microsattelites) can be involved in

the expansion of disorder regions [31]. In connection to this proposal, below

follows a brief description of three mechanisms that can cause expansions or contractions of microsatellites, i.e. microsatellite instability.

DNA replication slippage

Expansions of certain tri-nucleotide units are involved in genetic and somatic

diseases, where Richards and Sutherland [32] mention CCG expansions to be involved in Fragile X syndrome [33] and Fragile X E syndrome [34], while

CAG expansions are involved in Spinal and Bulbar muscular atrophy [35], Myotonic dystrophy [36], Huntington disease [37], Spinocerebellar ataxia (Type

1) [38], Dentatorubral pallidoluysian atrophy [39; 40], and nonobese diabetic

(NOD) mouse gene interleukin-2 [41]. As a consequence of the double stranded

DNA, expansion of CCG units on one strand is equivalent to the expansion of

CGG on the complementary strand, when considering the same direction, e.g.

5’ to 3’. Similarly, the expansion of CAG units is equivalent to the expansion of CTG on the complementary strand.

Short (1 to 10 nucleotide units) repetitions, microsatellites, and their presence in protein-coding genes in the protein-coding genes have been studied

[42–44]. It has been proposed that the expansion and contraction of microsatel-

lites can be caused by DNA replication slippage [45; 46]. A suggested mech-

anism for DNA replication slippage involves two direct repeats that flank a

hairpin structure consisting of two adjacent inverted repeats [47]. The DNA

polymerase is stalled at the base of the hairpin and dissociates from the DNA molecule. The direct repeat prior to the hairpin separates from the template

14

strand and binds to the direct repeat beyond the hairpin. The DNA polymerase reloads and replication continues. The key idea is that deletions or expansions share replication pausing as a first step. Further, Viguera and co-workers point

out that secondary structures formed in DNA are recombination hot spots [48–

50] while pause sites in replication forks are deletion hot spots [51–54].

DNA double strand break repair

The genome is composed of one or more large DNA molecules. These suffer damages from time to time and need to be repaired. There are two general repair strategies in place, namely homologous recombination (HR) and non-

homologous end joining (NHEJ) [55]. The HR strategy includes gene conver-

sion, break-induced replication and single-strand annealing. During HR, the breakage of a double strand is followed by 5’ to 3’ resection, strand invasion and new DNA synthesis. NHEJ can repair DNA breaks with little or no base

pairing at the junction. Paques and co-workers [56] propose that while slippage

during DNA replication can account for a fraction of microsatellite instability, double strand break repair could be the major cause.

Transposable elements

Transposable elements (TEs) are regions in the genome that can transfer, either themselves or a copy of themselves, to a different location. Barbara McClintock was awarded the Nobel Prize in Physiology or Medicine in 1983 for the

discovery of transposable elements [57] in the genome of maize. As more

genomes were fully sequenced, it became clear that transposable elements are

present in almost all known genomes [58; 59]. Thompson-Stewart and coworkers [60] proposed that transposable elements are involved in microsatellite

instability. Moreover, the presence of transposable elements in spliceosomal introns can cause a duplication or deletion involving more than 10 nucleotides,

as in Alu-mediated recombination [61].

15

3. Protein domain architectures

Proteins have modular units known as protein domains [62]. For instance, in 1973, Wetlaufer [63] reported distinct structural regions while examining

known protein structures including Trypsin, Lysozyme and Immunoglobulin.

Wetlaufer defined a structure region as consecutive residues that form a compact structure. Later, distinct structural regions were found in otherwise unrelated proteins, supporting the idea that nature uses available material to create

new material [64], in this case by using available domains to create new pro-

teins.

A domain can be defined from an evolutionary, structural or independent

folding perspective [65]. The evolutionary perspective states that a domain

“accepts” and “rejects” changes to its sequence, during evolution, based on its own criteria, while the structural perspective states that a domain has to maintain a particular structure. Finally, the independent folding perspective states that a domain can fold with or without the rest of the protein chain.

The more elusive and arguably more important notion in this context is function. A domain that contributes with a function to the protein will preferably conserve this function during evolution. The correct function requires the correct structure, which in turn requires an accurate folding process. Finally, the structure is defined by the sequence, thus changes in the sequence affect the folding, structure and function of a protein domain and thereby of the protein as a whole.

3.1

Classification of domain families

The ability to estimate the presence of a protein domain in a given protein allows to transfer knowledge from well studied proteins to poorly studied proteins when they have homologous domains. Homology, in turn, has to be estimated from sequence or structure similarity. Thus, it is possible to classify domains into domain families. The advantage of sequence based classifications is a higher coverage of all existing proteins. While the advantage of structure based classification is a domain border inference based on structure rather than local sequence similarity.

16

CATH [66] classifies domain families using a combination of manual and

automated procedures. It has four levels of classification called Class, Architecture, Topology and Homologous Superfamily. Class is determined by secondary structure composition, e.g. mainly alpha, mainly-beta or alphabeta. Architecture is determined by orientations of secondary structure elements while ignoring the connectivity, e.g. barrel or 3-layer sandwich. The fold of the protein domain core determines topology. Homologous Superfamily is determined by either sequence similarity or structure similarity (SSAP)

[67].

SCOP [68] stands for Structural Classification of Proteins and is mainly

built by manual inspection. It has three levels of classification, Family, Superfamily and Fold. A high level (>=30%) of sequence identity defines a family.

Superfamily is determined by a decision that there is a high probability of a common evolutionary origin, despite low sequence identity, by considering structural and functional features. A fold is determined by the major secondary structures, their spatial arrangement and topological connections. Relevantly,

an inversed protein sequence does not yield the same fold [69].

SMART [70] stands for Simple Modular Architecture Research Tool and

is a collection of multiple sequence alignments, with one alignment per protein domain family. Full alignments are built from seed alignments, where seed alignments are created using an iterative semi-automatic procedure. The ambition is to create domain families with emphasis on having a common function.

Pfam [71] stands for Protein Families and utilizes profile HMMs to orga-

nize domains into families. The sequences originate from a database called

Pfamseq, which in turn is based on Uniprot (Swiss-Prot and TrEMBL). Pfam release 26.0 has more than 13,000 manually curated protein domain families.

The starting point for a domain family in Pfam is a set of manually selected sequences that represent a domain family.

Next, a multiple sequence alignment is built for these family members and this alignment is called a seed alignment. The seed alignment is then utilized to estimate the emission and transition probabilities of a profile HMM that will represent this family. The profile HMM is then used to evaluate additional family members. In addition to the manually curated Pfam-A families,

Pfam also has automatically generated Pfam-B families. The Pfam-B families

are calculated from clusters built by the ADDA algorithm [72] for sequence

regions without Pfam-A annotations.

3.2

Domain assignment in a protein sequence

A protein domain family can be represented by a profile HMM (see subsection

6.3 for profile HMMs), as done by default in the Pfam database. A program

17

can be used to scan a protein sequence against a profile HMM library. The

HMMER package has a program called hmmscan for this very purpose [73].

Hmmscan takes a protein sequence as input and provides a list of domain family hits as output. Every hit is reported with start and stop positions in the query sequence. Together with an expectation value, estimating the number of hits of equivalent strength that could be found by chance in the current search setup.

The list can contain hits that overlap in the query sequence. A manual or automatic decision has to be taken in order to determine which hit is the best. An automatic decision can be based on comparing expectation values.

A manual decision can be based on inspecting proteins from the same family.

Large-scale studies would require automatic decisions, whereas small-scale studies can benefit from adding manual inspection. However, there are also proteins without any detectable homology to other proteins. These are called

orphan proteins [74] or ORFans [75]. There are also sequence regions of 50 or more unassigned residues called orphan domains [76]. Most orphan domains are found at protein termini [77], possibly due to mutations of translation start

and stop codons.

3.3

Evolution of protein domain architectures

Eukaryotic proteins are, on average, longer than prokaryotic proteins. Accordingly, it has been estimated that 65% of eukaryotic proteins have multi-domain architectures (two or more domains per protein) compared to 40% for prokary-

otes [76]. Further, it has been proposed that eukaryotes have acquired new multi-domain architectures at a higher rate than prokaryotes [78]. Domain in-

sertions are estimated to have been three times more common than deletions.

The most common event appears to have been insertion or deletion of a single

domain at the N-terminus or C-terminus of a protein [79; 80].

Gene fusion, gene fission and exon shuffling have been proposed as causes to changes in protein domain architectures. It has been proposed that gene fusions and fissions can explain more than 80% of all existing multi-domain

architectures [81]. Further, according to estimations, gene fusion occurred at least four times as often as gene fission [81; 82]. Meanwhile, exons that

code for domains can be shuffled among eukaryotic genes by intronic recom-

binations [83–86]. Interestingly, it has been proposed that a protein-coding sequence can have an insertion of one or more introns [87]. Thus, theoreti-

cally, a new exon can be created by the insertion of two introns and possibly become suitable for future exon shuffling.

18

4. Variation in protein length by tandem domain duplications

From a protein sequence perspective, tandem duplication is where one or more consecutive residues are copied and inserted right next to their origin. An extreme example of tandem duplication in proteins is that of a single residue. The resulting consecutive stretch of identical amino acids is called a homorepeat

[88]. A homorepeat has a remarkable possibility of forming interactions due

to a high concentration of the properties of the repeated amino acid type. Additionally, it has been proposed that identical or nearly identical homorepeats

tend to be intrinsically disordered [89], displaying an amino-acid composition

preference similar to intrinsically disordered proteins and being rarely found among proteins with known native structures.

A number of computational programs have been developed [90] to identify tandem duplications of one or more residues, such as T-REKS [91], REP

[92], HHrep [93] and HHalign [94]. T-REKS performs ab initio identification of tandem repeats using a K-means algorithm, and performs better [91] than

TRED [95], INTREP [96] and XSTREAM [97]. While REP searches a protein

sequence for repeats that belong to one of the repeat families implemented into the method. HHrep performs repeat identification in protein sequences by profile HMM-HMM comparisons, without relying on a pre-defined set of domain families. Finally, HHalign is part of the HHsearch package and when supplied with a single multiple sequence alignment, it will compare the alignment against itself.

Meanwhile, certain proteins have two or more homologous domains in tandem. Even long arrays (>10 homologous domains in tandem) are feasible

due to their simple structure [98]. In general, domains found in tandem arrays are between 20 and 170 residues in length [99]. While domain families that

form tandem domain arrays are on average shorter than other families [79].

Additionally, a study of gene exon and intron structure proposed that 30% of genes coding for proteins with long arrays house the entire array in a single

large exon [99]. In this respect, human C2H2 zinc finger and Leucine-rich

repeat (LRR) domains were over-represented.

Arguably, a domain shorter than 50 residues is too short to be called a domain. Therefore, it can be considered as a “short modular unit”. For instance,

19

the tetratricopeptide repeat (TPR) is a 34 residue modular unit found in both

eukaryotes and prokaryotes [100]. TPRs are involved in molecular recognition

and protein-protein interactions. This can be exemplified by the Hsp70 / Hsp90

organizing protein (Hop) with its array of 9 TPR units [101]. The three TPR

units at the N-terminus bind to Hsp70, while the three middle TPR units bind to Hsp90. The interacting peptides in Hsp70 and Hps90 are 12 and 5 residues long, respectively. In addition, this can serve as an example of when more than one short modular unit is needed to perform a binding function.

Björklund and co-workers studied the expansion of domain arrays in pro-

teins [99]. They found that the duplication of two or more domains at a time

has been more common than that of a single domain. Additionally, they proposed that the mechanism that causes tandem domain duplications is likely to be independent from the size of the duplicated unit. Further, the majority of the detectable duplications are found in the middle of a protein and not at the termini. Thus, the evolution of homologous domain array architectures differs in this respect from the evolution of non-repeat architectures. Below follow our studies of Filamin (paper IV) and Nebulin (paper III) proteins and the expansion of their domain arrays.

4.1

The study of Filamin

Human Filamin A (FLNA) is an anchor between membrane proteins and actin filaments. FLNA has an N-terminal actin-binding domain consisting of two calponin homology (CH) domains and an array of 24 filamin domains. A filamin domain is 95 residues long and belongs to the class of immunoglobulinlike domains. FLNA functions as a homo-dimer and forms a dimer with a second FLNA by the interaction of their 24th filamin domains.

FLNA has a structure that is divided into three rods by two hinges. Rod

1 consists of the actin-binding domain followed by filamin domains 1 to 15.

Rod 2 consists of filamin domains 16 to 23. Rod 3 consists of the dimerizing filamin domain 24. The actin-binding domain of rod 1 binds actin filaments. The filamin domains of rod 2 have a number of binding partners includ-

ing FilGAP and the cytoplasmic tail of beta-integrin [102]. FilGAP is a Rac

GTPase-activating protein (GAP) implicated in the control of tumor cell migra-

tion [103]. Integrins are hetero-dimers of membrane spanning alpha and beta subunits with extra-cellular domains involved in cell adhesion [104]. FLNA

filamin domains 4, 9 and 12 in rod 1 and 17, 19, 21, and 23 in rod 2 share a conserved ligand-binding site. The best human FLNA binder of the cytoplas-

mic tail of beta 7 integrin is the 21st Filamin domain [105].

Here, we focused on FLNA and studied Filamins from 38 eukaryotes [106].

There is some difference in the number of filamin domains in these proteins.

20

order in domain array

1 2 3 4 5 6 7 8 protein domain architecture

Figure 4.1:

A protein domain similarity matrix is created to visually detect and inspect domain duplications in a domain array (green). The opacity corresponds to the degree of similarity between two domains, the darker the more similar. First, scores are calculated from pairwise local alignments with the Smith-

Waterman algorithm. The scores are then normalized by the sum of the lengths of the domains and then the largest score observed off the diagonal. The opacities thus represent a relative similarity within this domain array.

While mouse and chicken have 24 filamin domains, sea squirt, fruit fly and medicinal leech have 22, 22 and 35 domains, respectively. A non-identical number of tandem domains suggest one or more duplication events since the last common ancestor of these species.

Duplication patterns were examined by calculating one domain similarity matrix for every Filamin protein. A domain similarity matrix is a domain-bydomain comparison of a tandem domain array against itself. A box in a matrix represents a domain-to-domain comparison. The box is colored in a gray scale ranging from white to black. The higher sequence similarity, the darker the color. The color is normalized by dividing all sequence similarity scores by the highest score observed for a domain-to-domain comparison in the array.

We found that the clearest signs of duplication events are present in Trichoplax adhaerens , starlet sea anemone and medicinal leech. The duplications involved non-termini domains. The duplications in starlet sea anemone and medicinal leech had a duplication unit of a single filamin domain. The duplication in T. adhaerens had a duplication unit of six filamin domains that duplicated twice in tandem.

21

4.2

The study of Nebulin

In paper I we studied Nebulin, a giant protein of 600 to 900 kDa that is part

of the skeletal muscle thin filament [107]. Nebulin performs a wide range of

functions, including regulation of muscle contraction. The domain architecture of Nebulin is an SH3 domain followed by a long array of nebulin domains.

A single nebulin domain binds actin with high affinity [108], while a seven-

nebulin array has been proposed to bind actin with an even higher affinity

[109]. While the SH3 domain anchors the Nebulin protein to the Z-disk [110].

Each nebulin domain is 35 residues long and has a helix turn helix structure. It turned out that the exon boundaries of nebulin domains are shifted by one helix compared to the Pfam nebulin domain definition. Therefore we decided to use the exon boundaries in place of domain boundaries, where a nebulin domain defined by its exon boundaries was called a nebulin evolutionary unit (NEU).

The array of NEUs in Nebulin mainly expanded by internal duplications

of a seven-nebulin domain unit (super repeat) [99]. The human Nebulin gene

contains 238 NEUs, 175 out of which code for 25 super repeats. By identifying super repeats in 10 other vertebrates and manually inspecting a multiple sequence alignment of these 11 Nebulins, we concluded that the most parsimonious scenario is where the last common ancestor of these vertebrates had

16 super repeats. Therefore, we concluded that the majority of duplications involving super repeats occurred in an early vertebrate.

22

5. Variation in protein length at intrinsically disordered regions

Intrinsically disordered residues are characterized by not forming a well-defined structure in the native protein structure. This makes some protein disorder re-

gions suitable for transient binding [111], which allows these proteins to perform signaling functions [112]. While being ill-structured as a single protein

molecule, some disorder regions become ordered upon binding to a partner

molecule [113]. This could motivate why some disorder regions contain preformed secondary structure elements [114].

Regarding variation in length of proteins, the effect of having an insertion or deletion of one or more residues depends on the region affected. For the structure of a globular protein, changes in the core are probably more disruptive than changes in long loops located on the surface. Curiously, it has been proposed that nested insertions in the same location of a protein can eventually

change the structural character of that location [115]. The goal of papers III

and IV was to study the extent and character of a possible link between variation in length and in disorder. Below follows an introduction and summary of our main findings.

5.1

The study of indels in orthologous proteins of invertebrates and yeast

In paper III, we performed a study on 3,736 orthologous protein pairs between Caenorhabditis elegans and Drosophila melanogaster. The insertions and deletions were estimated by building pairwise profile HMM-HMM alignments for every protein pair. Further, we decided to focus on long (>30) disorder regions and to employ two predictors, Disopred2 and Iupred. In order to provide a more complete study, we performed the exact same analyses on

18,389 orthologous protein pairs between Saccharomyces cerevisiae and five other fungal species.

Firstly, we found that intrinsically disordered residues are clearly more frequent among indels than among aligned residues. We also report a stronger correlation between intrinsic disorder and indels than between coils (loops)

23

and indels. Secondly, intrinsically disordered residues are more common in longer indels compared to shorter indels. Further, short and medium sized disordered indels are preferentially located at non-terminal protein regions, while long indels, ordered and disordered, are preferentially located at protein termini. Finally, while confirming that disordered regions diverged further than ordered regions, we encountered a few previously recognized protein families where the disordered regions are better conserved than the ordered ones. We found these proteins to be associated with information processes, including translation and RNA processing.

5.2

The study of length differences in homologous proteins of yeast

To find the correct alignment between two highly diverged protein sequences is a difficult task. However, it is also possible to study variation in length by calculating raw length differences. In paper IV, we studied 19,664 homologous protein pairs between S. cerevisiae and five other fungi species. We found that more than half of the variation in length of yeast proteins could be accounted for by insertions or deletions of intrinsically disordered residues. The relation between variation in length and in disorder was expressed in a linear manner, with the slope of a fitted line to the data points (protein pairs). For convenience, all slopes were presented as percentages, e.g. 0.6 is shown as 60 %.

To estimate the significance, two steps were taken. First, three models were defined and called proteome, protein and proximity. The proteome model assumes the disorder content of the length difference should be equal to the disorder content of the entire set of protein pairs. The protein model assumes the content should be equal to the content of the pair in question. The proximity model assumes the content should be equal to the content calculated from insertions and deletions in a pairwise HMM-HMM alignment of the protein pair.

Second, additional predictors of disorder, low complexity, secondary structure elements and tandem nucleotide repeats (TNR) were employed for comparison.

The links between low complexity and variation in length, as well as TNR and variation in length, fail to show a significant difference between the observations and the models. Next, when the significance of the difference between observations and models is considered along with the difference between corresponding slopes, Disopred2 and Iupred show a larger difference between observations and models than Psipred Coils. Finally, the proximity model consistently having a higher slope than the protein model suggests that disordered residues are inserted or deleted within an already existing disordered region.

24

6. Methodology

Bioinformatics is a research field that develops methods to store, organize and analyze biological data. The results from experiments, including whole genome sequencing, are stored and made accessible for further analysis. A common situation is where the accumulation of experimental data is slow and expensive. A widespread strategy to compensate for this is to create predictors. The creation of a predictor involves the implementation of an algorithm and the incorporation of experimental knowledge. A predictor thereby uses the available experimental data for certain proteins to make inferences for other proteins without the corresponding type of experimental data. Experimental as well as predicted results are deposited in databases that are readily available to the research community. Here follows a brief description of databases, machine learning algorithms, sequence alignment and sequence profiles that have been used in the work presented in papers I, II, III and IV.

6.1

Bioinformatics databases

Ensembl [116] provides genetic and genomic data for chordates. Ensembl re-

lease 69 (October 2012) supports 70 species including human, mouse and zebra fish. Ensembl provides both web graphical tools and application programming interfaces to extract and analyze data. Ensembl has data for 43 chordates with high coverage genome sequences. Ensembl provides information about protein-coding genes and their transcripts together with corresponding proteins. It is also possible to perform advanced searches and find, for instance, all genes that only have a single known transcript.

Universal Protein Resource (UniProt) [117] is a repository of protein se-

quences. The UniProt knowledgebase (UniProtKB) consists of Swiss-Prot and

TrEMBL. Swiss-Prot has protein sequences that have been manually curated and reviewed, while TrEMBL has all the other protein sequences. The UniProt reference clusters (UniRef) presents sequence sets built based on sequence identity. The majority of UniProt proteomes are translations of genome sequences from the International nucleotide sequence database consortium (INSDC).

Uniref90 was used in papers III and IV for running PSI-BLAST on the proteins to build HMM-HMM alignments.

25

The National Center for Biotechnology Information (NCBI) Reference Se-

quence (RefSeq) [118] is a database that provides a complete collection of non-

redundant, cross-linked and annotated set of nucleotide and protein records for every included species. NCBI’s RefSeq also includes alternatively spliced protein products. The records in RefSeq are based on the nucleotide and protein information available from INSDC (NCBI’s GenBank, European Nucleotide

Archive, and DNA Data Bank of Japan). Additionally, the records in RefSeq are curated by collaborating groups and by NCBI staff.

InParanoid [119] is a database of orthologs and in-paralogs computed be-

tween pairs of proteomes. An in-paralog is defined as a paralog that arose after speciation of the two species under comparison. Accordingly, an outparalog is a paralog that arose before speciation. Thus, for two proteomes,

InParanoid presents orthologous clusters. Additionally, to avoid redundancy and other problems due to alternative splice forms, InParanoid 7 only includes the longest protein for every gene. Regarding the selection of in-paralogs, consider the procedure between species A and B. Assume that proteins A1 and B1 are mutual best hits when A1 is compared against all proteins in B and vice versa. Then, A1 and B1 become the main orthologs for their ortholog cluster.

Next, any protein from A more similar to B1 than to any other protein in B

is included to the cluster as an in-paralog [119]. The same is done for B and

A1. InParanoid version 7 [120] was released in 2009 and presents ortholog

clusters between pairs of proteomes among 99 eukaryotic and 1 prokaryotic

(Escherichia coli) species.

6.2

Sequence alignment

The alignment between two biological sequences is a common and impor-

tant procedure. The Needleman-Wunsch algorithm [121] can be used to find

the best global alignment between two protein sequences. While the Smith-

Waterman algorithm [122], a variation of the Needleman-Wunsch algorithm,

can be used to find the best local pairwise alignment. Needleman-Wunsch is a dynamic programming algorithm and thus divides and solves the alignment

problem as a set of smaller problems [123].

Example of a Needleman-Wunsch alignment

Consider the global alignment of two sequences “MAQ” and “MMAQ”. Assume that the match score is “+1”, the mismatch score is “-1” and the gap penalty is “-2” per gap. The Needleman-Wunsch algorithm then proceeds by

filling in a 4 by 5 matrix as shown in Figure 6.1 A. First, the score at cell (1,1)

is set to zero and the first row and the first column are filled in according to

26

A)

M

M A Q

0

-2 -4 -6

-2 1

-1 -3

M -4

-1

0 -2

A -6 -3

0

-1

B)

-

M

M

M

M A Q

M A Q

A Q

M A Q

Q -8 -5 -2

1

Figure 6.1: A) The score matrix for the Needleman-Wunsch algorithm. B) The two best scoring alignments.

the gap penalty. Next, starting from cell (2, 2), the score for every remaining cell is calculated according to the best score it can get from one of the three possible preceding cells.

Thus, cell (2,2) receives a score of “1” since this is the best score it can get by stepping from cell (1,1) to (2,2), while stepping from cells (1,2) or (2,1) would have yielded a score of “-4”. Additionally, the “pointer”, i.e. the fact that cell (2,2) originates from cell (1,1), is saved. After all the cells have been filled in, the score in the bottom right corner is the total score of the alignment.

Further, the alignment itself is derived by a procedure known as backtracking, where the saved pointers are utilized to assign matches and gaps. In this case,

there are two equally high scoring alignments, shown in Figure 6.1 B.

The Smith-Waterman algorithm differs by setting cells with negative scores to zero, beginning the backtracking from the cell with the highest score and ending the backtracking at a cell with score zero. Additionally, the final cell with a zero score is excluded from the resulting local alignment.

Dayhoff and BLOSUM substitution matrices

The match scores between amino-acid types can be obtained from substitution matrices. Dayhoff et al. introduced a protein substitution matrix called

the point accepted mutation (PAM) matrix [124]. The mutation rate of every

amino acid type was estimated from 1,572 changes in 71 groups of closely related proteins. A PAM matrix corresponds to an evolutionary distance between two sequences. A distance of 1 PAM is one substitution per 100 residues. Accordingly, this matrix is called a PAM1 matrix. Multiplying the PAM1 matrix

27

by itself yields the PAM2 matrix. Kosiol et al. discusses a few of the Dayhoff

matrices in use today [125].

While Dayhoff matrices are based on estimated mutation rates, BLOSUM matrices are based on observed substitution frequencies from blocks of aligned

sequences [126]. BLOSUM stands for blocks substitution matrix and is derived from the Blocks alignment database [127]. The values in a BLOSUM

matrix are calculated as the logarithm of the ratio of the observed and expected probabilities (log-odds ratio) of a substitution. A positive value is where the observed substitution probability is higher than the expected. A negative value is where the observed is lower than the expected.

BLOSUM matrices are named according to the data they are estimated upon. Thus, the BLOSUM62 matrix is derived from a set of blocks with sequences that have more than 62% pairwise sequence identity. Interestingly, certain miscalculations have been discovered in the way BLOSUM matrices were calculated, yet it has been proposed that they did not undermine the us-

ability of BLOSUM matrices [128].

BLAST and PSI-BLAST

The sheer quantity of available biological sequence data requires efficient approaches to search a query sequence against a sequence database. Thus, a

sequence database search tool has to trade accuracy for speed. FASTA [129] and BLAST [130] are both successful implementations.

The BLAST (Basic Local Alignment Search Tool) algorithm is based on the idea that exact matches of short (k residues long) words between the query sequence and the database sequences increases the speed of a large-scale search.

The original algorithm begins with defining words in the query sequence, starting at the first residue and moving one residue forward at a time. Consider the default value of 3 for the parameter k in a protein search. Next, consider the query sequence “MMAQ”. This yields two k-words, “MMA” and “MAQ”. For every k-word, all matching words that score better than a threshold T are de-

fined, where the score is calculated according to the BLOSUM62 [126] matrix.

For “MMA”, since M-M scores 5 and A-A scores 4 giving a total score of 14 for MMA, at T set to 10, given MM as the first two residues, the third residue can only be either A or S. Acceptable matching words are then organized in an efficient search tree and the database sequences are searched for exact matches to one or more of these words. Sequences with such a match go to the next step, where the matching word is extended in both directions and simultaneously aligned to the corresponding query region to define highscoring segment pairs (HSP). Finally, ungapped alignments between HSPs and the query are reported, together with expectation values.

28

Figure 6.2: Example of a profile HMM architecture with 1 begin state, 3 match states, 4 insertion states, 3 deletion states and 1 end state. Transitions between states are shown as arrows. Additionally, an emitted sequence is shown together with a possible state path.

Later, the scientific community saw the arrival of gapped BLAST and

PSI-BLAST [131]. PSI-BLAST (Position-Specific Iterated BLAST) is an extension of the BLAST program based on sequence profiles [132]. The PSI-

BLAST profile, or position specific scoring matrix (PSSM), is a matrix that represents a multiple sequence alignment. In the first iteration, PSI-BLAST runs gapped BLAST with the query sequence. The output from this step is a query anchored multiple sequence alignment of a length equal to the query sequence. Next, this alignment is used to calculate a PSSM. In the second iteration, the PSSM becomes the query against the same database with a modified version of BLAST. Additional hits are included to construct the next PSSM.

The iterations continue for a specified number of times or until convergence is reached. Thus, the PSSM allows discrimination according to amino acid preferences at every position of the query sequence.

6.3

Profile Hidden Markov Models

Hidden Markov Model (HMM) [133] is a probabilistic model applicable to se-

quences of observations. An HMM has states that emit characters, connected by transitions from one state to another. Consider an HMM with states 1 and 2 that each can emit characters “A” and “B”. Assume that the emission probabilities for state 1 are 10% for A and 90% for B. While for state 2 they are 50% for

A and 50% for B. Next, the transition probabilities for state 1 are 70% return to state 1 and 30% transfer to state 2. While for state 2 they are 90% return to state 2 and 10% transfer to state 1. This HMM then proceeds by generating

29

a sequence of As and Bs. The observer can merely observe the sequence and thereby lacks knowledge of which states that emitted which As and Bs. This is the “hidden” property of an HMM.

Further, profile HMMs [123; 134] have been successfully utilized to rep-

resent the information in a multiple sequence alignment. Profile HMMs have a linear and left to right design compared to a general HMM. The states in a profile HMM correspond to positions in a multiple sequence alignment. Thus, the profile HMM generates sequences that follow the amino acid, insertion and deletion preferences of the alignment that the profile is based upon. Analogous to an HMM, the probability to reach a state only depends on the previous state and no other states. A profile HMM has a begin state, match states, insertion states, deletion states and an end state, where only match and insertion states

generate residues, see Figure 6.2.

The emission and transition probabilities of a profile HMM can be estimated from an already built alignment. Standard dynamic programming algorithms developed for HMMs can be used to score sequences against this model

[123], i.e. estimate the probability that a given sequence could be emitted by

this profile and thus be a member of those sequences that belong to the source alignment. When scoring a sequence against a profile HMM, the log-odds

score [135] is widely applied, where the score compares the probability of the

profile emitting a given sequence against a null model.

Profile HMM-HMM alignment

In papers III and IV, we used HHalign that is part of the HHsearch package [94]

to build global profile HMM-HMM alignments for pairs of homologous proteins. Consider the alignment of two protein sequences. Each sequence went through a PSI-BLAST run with 3 iterations against the Uniref90 database. For every sequence, the output from this step was a query anchored multiple sequence alignment. Next, the two alignments were submitted to HHalign that built one profile HMM per alignment and aligned the two profiles. Terminal gaps were added with a script and, since the alignment from HHalign already contained residues from the two initial query proteins, the end result was a pairwise alignment between the two sequences.

Johannes Söding describes [94] the pairwise alignment of HMMs in the

HHsearch implementation. The log-sum-of-odds score is derived to compare the probability, over all possible sequences that can be emitted by each of the two profile HMMs, that these sequences were co-emitted by these profile HMMs against a null model. Next, the Viterbi dynamic programming

algorithm [123] is applied to determine the co-emission path, one per profile

HMM, that yields the maximum log-sum-of-odds score. The Viterbi algorithm

30

is applied to find the most probable state path along an HMM. By saving pointers backwards and filling in a score matrix, it is possible to determine the most probable path by backtracking.

6.4

Prediction of disorder regions

The characteristics of disorder regions in terms of sequence and structure motivate the creation of dedicated predictors. As Dosztanyi and co-authors discuss

[136], disorder regions cannot be regarded as equivalent to low complexity

regions in terms of sequence composition. They do, however, tend to avoid amino acids with hydrophobic side chains and prefer those with polar and charged side chains. Further, some disorder regions contain transient secondary structure elements that take shape upon binding, while some loops form a well-defined structure in the native state. Thus, disorder regions are not equivalent to loops, and vice versa, in terms of structure. Schlessinger and co-authors developed NORSnet to distinguish long unstructured loops from

structured loops [137].

A number of different disorder predictors are currently available, see re-

views [27; 136; 138]. In papers III and IV, we chose to predict intrinsic dis-

order primarily with Disopred2 and Iupred. Disopred2 uses evolutionary information by considering a PSI-BLAST profile instead of a single sequence.

While Iupred is a rapid method that relies on a model based on known and ordered structures of globular proteins. Below follows a description of these two predictors.

Disopred2

Disopred2 [139] is trained on a set of non-redundant protein structures from

the PDB determined by X-ray crystallography. The selected proteins have less than 25% pairwise sequence identity and higher resolution than 2.0 Å.

The set consisted of 715 protein chains with 176,550 residues experimentally classified as ordered and 4,590 residues as disordered. A PSI-BLAST profile was built for every sequence in the set. The classifier was then trained on these

PSI-BLAST profiles. Disopred2 is a cascaded classifier with a support vector machine whose output serves as input to an artificial neural network. The final output is a binary classification of every residue in the query sequence as either ordered or disordered.

31

Iupred

Iupred [140] is based on the idea that disordered residues form contacts that are

energetically less favorable than ordered contacts. The first stage is to calculate a 20 by 20 interaction potential matrix (M) using beta carbon to beta carbon

distances following Thomas and Dill [141]. This matrix is calculated from

785 globular protein structures with less than 25% pairwise sequence identity and X-ray crystallography resolution of 2.5 Å or higher. A sequence based energy predictor matrix (P) is thus defined. This matrix uses the amino-acid composition of a given residue together with sequentially adjacent residues to estimate the interaction energy that it contributes to the protein structure.

Next, a window of 100 residues was used to calculate an M matrix for the globular structures. The protein sequences of the same set were used to fit a

P matrix to the M matrix by solving a set of linear equations for every aminoacid type. The border between order and disorder was estimated by energy calculations using the P matrix on two sets of proteins. An ordered set of 559 globular proteins with known structures and a disordered set of 129 proteins with experimentally verified IDRs. The final output is a binary classification of every residue in the query sequence as either ordered or disordered.

32

Sammanfattning

Proteiner är stora molekyler som finns inne i våra celler, i våra cellmembraner och utanför våra celler, där dem utför livsviktiga för oss funktioner. Vårt genetiska material som vi har ärvt från våra föräldrar, genomet, innehåller gener med instruktioner för vilka proteiner som kan skapas i cellen. Ett proteins evolutionära historia sträcker sig samtidigt längre bakåt i tiden än den första människan som vandrade jorden. En gen har längs sin färd i genomet från generation till generation varit utsatt för förändringar som kunde ske på grund av diverse mekanismer. Vissa händelser ändrade längden på proteinet som en gen innehöll instruktioner för att koda. Det är förändringar av den här typen som undersökts i arbetet presenterat i den här avhandlingen.

Olika arter utsattes för individuella förändringar i sina gener och därmed också i sina proteiner. En förändring i ett redan välfungerande protein innebär en risk. Ett protein behöver vecka sig till sin korrekta tre-dimensionella struktur, den nativa strukturen, för att vidare kunna utföra sin funktion. Ifall proteinet blir exempelvis kortare eller längre, kan det leda till inkorrekt veckning, inkorrekt funktion eller andra negativa konsekvenser.

Det faktum att vissa förändringar i proteinlängd trots allt blivit accepterade under evolutionen märks tydligt när man jämför motsvarande proteiner från olika arter. De fyra artiklarna som denna avhandling baserar sig på har undersökt längdsvariationen i två olika slags proteiner, dem som har repeterande domäner och dem som har oordnade regioner. Målsättningen har varit att nå en ökad förståelse i hur proteiner har förändrats i längd. Samt att därmed

öka förståelsen för vilka regioner i ett protein som eventuellt skulle acceptera konstgjorda längdsförändringar.

Vi undersökte hur Nebulin och Filamin, två relativt stora proteiner med ett flertal repeterande domäner vardera, har genomgått variationer i längd. I enighet med tidigare resultat, har båda proteinerna genomgått flera duplikationer av domäner ungefär i mitten av sina respektive gener. Samtidigt som resultaten av duplikationerna har varit ungefär samma i motsvarande Nebulin-proteiner från olika arter, har däremot den delen som deltog i duplikationen varit olika.

Vidare presenterade vi att variation i längd har en koppling till variation i antalet oordnade aminosyror när man jämför relaterade proteiner från olika arter. Vi konstaterade att drygt hälften av all variation i längd hos bakjäst utgörs av oordnade aminosyror. Vår analys visar att proteiner med högre innehåll av

oordnade aminosyror uppvisar en större variation i längd. Slutligen visar analysen att det vanligtvis är oordnade regioner som växer eller krymper, snarare

än att sådana regioner introduceras på nytt.

Acknowledgements

First and foremost, many, many thanks to Professor Arne Elofsson for allowing me to join the group, guiding me through my PhD and helping me to produce a good body of work. It has been an honor to work in such a prestigious group.

Next, I am very grateful to Sara Light for a fruitful and inspiring cooperation.

I express my gratitude to Åsa Björklund, as Åsa played an instrumental role in my familiarization with the work at hand. I am also grateful to Diana Ekman for the work she did concerning variation in length coupled to disorder regions.

I am grateful to Marcin Skwark, for being a responsive colleague and having an inspiring attitude towards research questions. To Minttu Virkki, for formulating excellent questions and being ready to contribute with her extensive experimental knowledge. To Sara, Walter Basile and Per Warholm for being part of the first batch that moved to Scilifelab and making the transition smooth and fun. To Sikander Hayat for following my work and giving tips about relevant research papers. To Walter, for giving valuable feedback on multiple occasions.

Last but not least, many thanks to Håkan Viklund, Maria Berns, Björn

Wallner, Arjun Ray, Aron Hennerdal, Kristoffer Illergård, Andreas Bernsel,

Jenny Falk, Linnea Hedin, Daniel Nilsson, Nanjiang Shu, Wiktor Jurkowski,

Christoph Peters, Konstantinos Tsirigos, Karolis Uziela and Michel Mirco for being great co-workers and creating an encouraging atmosphere.

References

[1] L. B

ROCCHIERI AND

S. K

ARLIN

. Protein length in eukaryotic and prokaryotic proteomes.

Nucleic Acids Res.

, 33(10):3390–3400, 2005.

[2] C. M. D

OBSON

. Protein folding and misfolding. Nature, 426(6968):884–890, 2003.

[3] F. U. H

ARTL AND

M. H

AYER

-H

ARTL

. Molecular chaperones in the cytosol: from nascent chain to folded protein. Science, 295(5561):1852–1858, 2002.

[4] U. S

CHUBERT

, L. C. A

NTON

, J. G

IBBS

, C. C. N

ORBURY

, J. W. Y

EWDELL

,

AND

J. R. B

ENNINK

.

Rapid degradation of a large fraction of newly synthesized proteins by proteasomes. Nature,

404(6779):770–774, 2000.

[5] F. C

RICK

. Central dogma of molecular biology. Nature, 227(5258):561–563, 1970.

[6] W. G

ILBERT

. Why genes in pieces? Nature, 271(5645):501, 1978.

[7] D. L. B

LACK

. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem.,

72:291–336, 2003.

[8] B. J. B

LENCOWE

. Alternative splicing: new insights from global analyses. Cell, 126(1):37–47,

2006.

[9] P. C

ARNINCI

, T. K

ASUKAWA

, S. K

ATAYAMA

, J. G

OUGH

, M. C. F

RITH

, N. M

AEDA

, R. O

YAMA

,

T. R

AVASI

, B. L

ENHARD

, C. W

ELLS

, R. K

ODZIUS

, K. S

HIMOKAWA

, V. B. B

AJIC

, S. E. B

REN

-

NER

, S. B

ATALOV

, A. R. F

ORREST

, M. Z

AVOLAN

, M. J. D

AVIS

, L. G. W

ILMING

, V. A

IDI

-

NIS

, J. E. A

LLEN

, A. A

MBESI

-I

MPIOMBATO

, R. A

PWEILER

, R. N. A

TURALIYA

, T. L. B

AI

-

LEY

, M. B

ANSAL

, L. B

AXTER

, K. W. B

EISEL

, T. B

ERSANO

, H. B

ONO

, A. M. C

HALK

, K. P.

C

HIU

, V. C

HOUDHARY

, A. C

HRISTOFFELS

, D. R. C

LUTTERBUCK

, M. L. C

ROWE

, E. D

ALLA

,

B. P. D

ALRYMPLE

, B.

DE

B

ONO

, G. D

ELLA

G

ATTA

, D.

DI

B

ERNARDO

, T. D

OWN

, P. E

N

-

GSTROM

, M. F

AGIOLINI

, G. F

AULKNER

, C. F. F

LETCHER

, T. F

UKUSHIMA

, M. F

URUNO

,

S. F

UTAKI

, M. G

ARIBOLDI

, P. G

EORGII

-H

EMMING

, T. R. G

INGERAS

, T. G

OJOBORI

, R. E.

G

REEN

, S. G

USTINCICH

, M. H

ARBERS

, Y. H

AYASHI

, T. K. H

ENSCH

, N. H

IROKAWA

, D. H

ILL

,

L. H

UMINIECKI

, M. I

ACONO

, K. I

KEO

, A. I

WAMA

, T. I

SHIKAWA

, M. J

AKT

, A. K

ANAPIN

,

M. K

ATOH

, Y. K

AWASAWA

, J. K

ELSO

, H. K

ITAMURA

, H. K

ITANO

, G. K

OLLIAS

, S. P. K

R

-

ISHNAN

, A. K

RUGER

, S. K. K

UMMERFELD

, I. V. K

UROCHKIN

, L. F. L

AREAU

, D. L

AZAREVIC

,

L. L

IPOVICH

, J. L

IU

, S. L

IUNI

, S. M

C

W

ILLIAM

, M. M

ADAN

B

ABU

, M. M

ADERA

, L. M

AR

-

CHIONNI

, H. M

ATSUDA

, S. M

ATSUZAWA

, H. M

IKI

, F. M

IGNONE

, S. M

IYAKE

, K. M

OR

-

RIS

, S. M

OTTAGUI

-T

ABAR

, N. M

ULDER

, N. N

AKANO

, H. N

AKAUCHI

, P. N

G

, R. N

ILSSON

,

S. N

ISHIGUCHI

, S. N

ISHIKAWA

, F. N

ORI

, O. O

HARA

, Y. O

KAZAKI

, V. O

RLANDO

, K. C. P

ANG

,

W. J. P

AVAN

, G. P

AVESI

, G. P

ESOLE

, N. P

ETROVSKY

, S. P

IAZZA

, J. R

EED

, J. F. R

EID

, B. Z.

R

ING

, M. R

INGWALD

, B. R

OST

, Y. R

UAN

, S. L. S

ALZBERG

, A. S

ANDELIN

, C. S

CHNEIDER

,

C. S

CHONBACH

, K. S

EKIGUCHI

, C. A. S

EMPLE

, S. S

ENO

, L. S

ESSA

, Y. S

HENG

, Y. S

HIBATA

,

H. S

HIMADA

, K. S

HIMADA

, D. S

ILVA

, B. S

INCLAIR

, S. S

PERLING

, E. S

TUPKA

, K. S

UG

-

IURA

, R. S

ULTANA

, Y. T

AKENAKA

, K. T

AKI

, K. T

AMMOJA

, S. L. T

AN

, S. T

ANG

, M. S.

T

AYLOR

, J. T

EGNER

, S. A. T

EICHMANN

, H. R. U

EDA

, E.

VAN

N

IMWEGEN

, R. V

ERARDO

,

C. L. W

EI

, K. Y

AGI

, H. Y

AMANISHI

, E. Z

ABAROVSKY

, S. Z

HU

, A. Z

IMMER

, W. H

IDE

,

C. B

ULT

, S. M. G

RIMMOND

, R. D. T

EASDALE

, E. T. L

IU

, V. B

RUSIC

, J. Q

UACKENBUSH

,

C. W

AHLESTEDT

, J. S. M

ATTICK

, D. A. H

UME

, C. K

AI

, D. S

ASAKI

, Y. T

OMARU

, S. F

UKUDA

,

M. K

ANAMORI

-K

ATAYAMA

, M. S

UZUKI

, J. A

OKI

, T. A

RAKAWA

, J. I

IDA

, K. I

MAMURA

,

M. I

TOH

, T. K

ATO

, H. K

AWAJI

, N. K

AWAGASHIRA

, T. K

AWASHIMA

, M. K

OJIMA

, S. K

ONDO

,

H. K

ONNO

, K. N

AKANO

, N. N

INOMIYA

, T. N

ISHIO

, M. O

KADA

, C. P

LESSY

, K. S

HIBATA

,

T. S

HIRAKI

, S. S

UZUKI

, M. T

AGAMI

, K. W

AKI

, A. W

ATAHIKI

, Y. O

KAMURA

-O

HO

, H. S

UZUKI

,

J. K

AWAI

,

AND

Y. H

AYASHIZAKI

. The transcriptional landscape of the mammalian genome.

Science , 309(5740):1559–1563, 2005.

[10] E. K

IM

, A. G

OREN

,

AND

G. A

ST

.

Alternative splicing: current perspectives.

Bioessays

,

30(1):38–47, 2008.

[11] E. K

IM

, A. M

AGEN

,

AND

G. A

ST

. Different levels of alternative splicing among eukaryotes.

Nucleic Acids Res.

, 35(1):125–131, 2007.

[12] E. T. W

ANG

, R. S

ANDBERG

, S. L

UO

, I. K

HREBTUKOVA

, L. Z

HANG

, C. M

AYR

, S. F.

K

INGSMORE

, G. P. S

CHROTH

,

AND

C. B. B

URGE

. Alternative isoform regulation in human tissue transcriptomes. Nature, 456(7221):470–476, 2008.

[13] W. M. F

ITCH

. Homology a personal view on some of the problems. Trends Genet., 16(5):227–

231, 2000.

[14] S. O

HNO

. Evolution by Gene Duplication. Springer-Verlag, Berlin, 1970.

[15] M. S

EMON AND

K. H. W

OLFE

. Consequences of genome duplication. Curr. Opin. Genet. Dev.,

17(6):505–512, 2007.

[16] J. W. S

ZOSTAK AND

R. W

U

. Unequal crossing over in the ribosomal DNA of Saccharomyces cerevisiae. Nature, 284(5755):426–430, 1980.

[17] S. O

HNO

, U. W

OLF

,

AND

N. B. A

TKIN

. Evolution from fish to mammals by gene duplication.

Hereditas , 59(1):169–187, 1968.

[18] C

HARLES

D

ARWIN

. On the origin of species by means of natural selection. John Murray, London,

1859.

[19] H. A. O

RR

. Fitness and its role in evolutionary genetics. Nat. Rev. Genet., 10(8):531–539, 2009.

[20] M. K

IMURA

. Evolutionary rate at the molecular level. Nature, 217(5129):624–626, 1968.

[21] B. C

HARLESWORTH

. Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variation. Nat. Rev. Genet., 10(3):195–205, 2009.

[22] R

ICHARD

D

AWKINS

. The selfish gene - 30th anniversary edition. Oxford University Press, New

York, 2006.

[23] P. B. S

IGLER

.

Transcriptional activation. Acid blobs and negative noodles.

Nature ,

333(6170):210–212, 1988.

[24] J. B

ELLAY

, S. H

AN

, M. M

ICHAUT

, T. K

IM

, M. C

OSTANZO

, B. J. A

NDREWS

, C. B

OONE

, G. D.

B

ADER

, C. L. M

YERS

,

AND

P. M. K

IM

. Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biol., 12(2):R14, 2011.

[25] R. P

ANCSA AND

P. T

OMPA

. Structural disorder in eukaryotes. PLoS ONE, 7(4):e34687, 2012.

[26] M. M. P

ENTONY AND

D. T. J

ONES

. Modularity of intrinsic disorder in the human proteome.

Proteins , 78(1):212–221, 2010.

[27] P. T

OMPA

. Intrinsically disordered proteins: a 10-year recap. Trends Biochem. Sci., 37(12):509–

516, 2012.

[28] V. N. U

VERSKY

. Intrinsically disordered proteins and novel strategies for drug discovery.

Expert Opin Drug Discov , 7(6):475–488, 2012.

[29] V. N. U

VERSKY

. Unusual biophysics of intrinsically disordered proteins. Biochim. Biophys.

Acta, In Press , 2012.

[30] H. X

IE

, S. V

UCETIC

, L. M. I

AKOUCHEVA

, C. J. O

LDFIELD

, A. K. D

UNKER

, V. N. U

VERSKY

,

AND

Z. O

BRADOVIC

. Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions. J. Proteome Res., 6(5):1882–1898, 2007.

[31] P. T

OMPA

.

Intrinsically unstructured proteins evolve by repeat expansion.

Bioessays ,

25(9):847–855, 2003.

[32] R. I. R

ICHARDS AND

G. R. S

UTHERLAND

. Simple repeat DNA is not replicated simply. Nat.

Genet.

, 6(2):114–116, 1994.

[33] E. J. K

REMER

, M. P

RITCHARD

, M. L

YNCH

, S. Y

U

, K. H

OLMAN

, E. B

AKER

, S. T. W

ARREN

,

D. S

CHLESSINGER

, G. R. S

UTHERLAND

,

AND

R. I. R

ICHARDS

. Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n. Science, 252(5013):1711–1714, 1991.

[34] S. J. K

NIGHT

, A. V. F

LANNERY

, M. C. H

IRST

, L. C

AMPBELL

, Z. C

HRISTODOULOU

, S. R.

P

HELPS

, J. P

OINTON

, H. R. M

IDDLETON

-P

RICE

, A. B

ARNICOAT

,

AND

M. E. P

EMBREY

. Trinucleotide repeat amplification and hypermethylation of a CpG island in FRAXE mental retardation. Cell, 74(1):127–134, 1993.

[35] A. R. L

A

S

PADA

, E. M. W

ILSON

, D. B. L

UBAHN

, A. E. H

ARDING

,

AND

K. H. F

ISCHBECK

.

Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature,

352(6330):77–79, 1991.

[36] J. D. B

ROOK

, M. E. M

C

C

URRACH

, H. G. H

ARLEY

, A. J. B

UCKLER

, D. C

HURCH

, H. A

BU

-

RATANI

, K. H

UNTER

, V. P. S

TANTON

, J. P. T

HIRION

,

AND

T. H

UDSON

. Molecular basis of myotonic dystrophy: expansion of a trinucleotide (CTG) repeat at the 3’ end of a transcript encoding a protein kinase family member. Cell, 68(4):799–808, 1992.

[37] T

HE

H

UNTINGTON

S

D

ISEASE

C

OLLABORATIVE

R

ESEARCH

G

ROUP

. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes.

Cell , 72(6):971–983, 1993.

[38] H. T. O

RR

, M. Y. C

HUNG

, S. B

ANFI

, T. J. K

WIATKOWSKI

, A. S

ERVADIO

, A. L. B

EAUDET

,

A. E. M

C

C

ALL

, L. A. D

UVICK

, L. P. R

ANUM

,

AND

H. Y. Z

OGHBI

. Expansion of an unstable trinucleotide CAG repeat in spinocerebellar ataxia type 1. Nat. Genet., 4(3):221–226, 1993.

[39] R. K

OIDE

, T. I

KEUCHI

, O. O

NODERA

, H. T

ANAKA

, S. I

GARASHI

, K. E

NDO

, H. T

AKAHASHI

,

R. K

ONDO

, A. I

SHIKAWA

,

AND

T. H

AYASHI

. Unstable expansion of CAG repeat in hereditary dentatorubral-pallidoluysian atrophy (DRPLA). Nat. Genet., 6(1):9–13, 1994.

[40] S. N

AGAFUCHI

, H. Y

ANAGISAWA

, K. S

ATO

, T. S

HIRAYAMA

, E. O

HSAKI

, M. B

UNDO

,

T. T

AKEDA

, K. T

ADOKORO

, I. K

ONDO

,

AND

N. M

URAYAMA

. Dentatorubral and pallidoluysian atrophy expansion of an unstable CAG trinucleotide on chromosome 12p. Nat. Genet., 6(1):14–

18, 1994.

[41] S. G

HOSH

, S. M. P

ALMER

, N. R. R

ODRIGUES

, H. J. C

ORDELL

, C. M. H

EARNE

, R. J. C

OR

-

NALL

, J. B. P

RINS

, P. M

C

S

HANE

, G. M. L

ATHROP

,

AND

L. B. P

ETERSON

. Polygenic control of autoimmune diabetes in nonobese diabetic mice. Nat. Genet., 4(4):404–409, 1993.

[42] H. E

LLEGREN

. Microsatellites: simple sequences with complex evolution. Nat. Rev. Genet.,

5(6):435–445, 2004.

[43] Y. C. L

I

, A. B. K

OROL

, T. F

AHIMA

,

AND

E. N

EVO

. Microsatellites within genes: structure, function, and evolution. Mol. Biol. Evol., 21(6):991–1007, 2004.

[44] G. T

OTH

, Z. G

ASPARI

,

AND

J. J

URKA

. Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res., 10(7):967–981, 2000.

[45] R. G

EMAYEL

, M. D. V

INCES

, M. L

EGENDRE

,

AND

K. J. V

ERSTREPEN

. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annu. Rev. Genet., 44:445–477,

2010.

[46] H. T

ACHIDA AND

M. I

IZUKA

. Persistence of repeated sequences that evolve by replication slippage. Genetics, 131(2):471–478, 1992.

[47] E. V

IGUERA

, D. C

ANCEILL

,

AND

S. D. E

HRLICH

. Replication slippage involves DNA polymerase pausing and dissociation. EMBO J., 20(10):2587–2595, 2001.

[48] D. R. L

EACH

. Long DNA palindromes, cruciform structures, genetic instability and secondary structure repair. Bioessays, 16(12):893–900, 1994.

[49] D. J. P

INDER

, C. E. B

LAKE

, J. C. L

INDSEY

,

AND

D. R. L

EACH

. Replication strand preference for deletions associated with DNA palindromes. Mol. Microbiol., 28(4):719–727, 1998.

[50] R. R. S

INDEN

, G. X. Z

HENG

, R. G. B

RANKAMP

,

AND

K. N. A

LLEN

. On the deletion of inverted repeated DNA in Escherichia coli: effects of length, thermal stability, and cruciform formation in vivo. Genetics, 129(4):991–1005, 1991.

[51] K. B

EBENEK

, J. A

BBOTTS

, S. H. W

ILSON

,

AND

T. A. K

UNKEL

. Error-prone polymerization by

HIV-1 reverse transcriptase. Contribution of template-primer misalignment, miscoding, and termination probability to mutational hot spots. J. Biol. Chem., 268(14):10324–10334, 1993.

[52] H. B

IERNE

, S. D. E

HRLICH

,

AND

B. M

ICHEL

. The replication termination signal terB of the

Escherichia coli chromosome is a deletion hot spot. EMBO J., 10(9):2699–2705, 1991.

[53] H. B

IERNE

, S. D. E

HRLICH

,

AND

B. M

ICHEL

. Deletions at stalled replication forks occur by two different pathways. EMBO J., 16(11):3332–3340, 1997.

[54] H. B

IERNE AND

B. M

ICHEL

. When replication forks stop. Mol. Microbiol., 13(1):17–23, 1994.

[55] J. E. H

ABER

.

Partners and pathways repairing a double-strand break.

Trends Genet.

,

16(6):259–264, 2000.

[56] F. P

AQUES

, W. Y. L

EUNG

,

AND

J. E. H

ABER

. Expansions and contractions in a tandem repeat induced by double-strand break repair. Mol. Cell. Biol., 18(4):2045–2054, 1998.

[57] B. M

C

CLINTOCK. The origin and behavior of mutable loci in maize. Proc. Natl. Acad. Sci.

U.S.A.

, 36(6):344–355, 1950.

[58] A. H

UA

-V

AN

, A. L

E

R

OUZIC

, C. M

AISONHAUTE

,

AND

P. C

APY

. Abundance, distribution and dynamics of retrotransposable elements and transposons: similarities and differences.

Cytogenet. Genome Res.

, 110(1-4):426–440, 2005.

[59] T. W

ICKER

, F. S

ABOT

, A. H

UA

-V

AN

, J. L. B

ENNETZEN

, P. C

APY

, B. C

HALHOUB

, A. F

LAVELL

,

P. L

EROY

, M. M

ORGANTE

, O. P

ANAUD

, E. P

AUX

, P. S

AN

M

IGUEL

,

AND

A. H. S

CHULMAN

. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet., 8(12):973–

982, 2007.

[60] D. T

HOMPSON

-S

TEWART

, G. H. K

ARPEN

,

AND

A. C. S

PRADLING

. A transposable element can drive the concerted evolution of tandemly repetitious DNA. Proc. Natl. Acad. Sci. U.S.A.,

91(19):9042–9046, 1994.

[61] M. N

YSTROM

-L

AHTI

, P. K

RISTO

, N. C. N

ICOLAIDES

, S. Y. C

HANG

, L. A. A

ALTONEN

, A. L.

M

OISIO

, H. J. J

ARVINEN

, J. P. M

ECKLIN

, K. W. K

INZLER

,

AND

B. V

OGELSTEIN

. Founding mutations and Alu-mediated recombination in hereditary colon cancer. Nat. Med., 1(11):1203–

1206, Nov 1995.

[62] C. P. P

ONTING AND

R. R. R

USSELL

. The natural history of protein domains. Annu. Rev.

Biophys. Biomol. Struct.

, 31:45–71, 2002.

[63] D. B. W

ETLAUFER

. Nucleation, rapid folding, and globular intrachain regions in proteins.

Proc. Natl. Acad. Sci. U.S.A.

, 70(3):697–701, 1973.

[64] F. J

ACOB

. Evolution and tinkering. Science, 196(4295):1161–1166, 1977.

[65] R. L. M

ARSDEN AND

C. A. O

RENGO

. The classification of protein domains. Methods Mol.

Biol.

, 453:123–146, 2008.

[66] C. A. O

RENGO

, A. D. M

ICHIE

, S. J

ONES

, D. T. J

ONES

, M. B. S

WINDELLS

,

AND

J. M. T

HORN

-

TON

. CATH–a hierarchic classification of protein domain structures. Structure, 5(8):1093–

1108, 1997.

[67] C. A. O

RENGO AND

W. R. T

AYLOR

. SSAP: sequential structure alignment program for protein structure comparison. Meth. Enzymol., 266:617–635, 1996.

[68] A. G. M

URZIN

, S. E. B

RENNER

, T. H

UBBARD

,

AND

C. C

HOTHIA

. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol.,

247(4):536–540, 1995.

[69] S. L

ORENZEN

, C. G

ILLE

, R. P

REISSNER

,

AND

C. F

ROMMEL

. Inverse sequence similarity of proteins does not imply structural similarity. FEBS Lett., 545(2-3):105–109, 2003.

[70] I. L

ETUNIC

, T. D

OERKS

,

AND

P. B

ORK

. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res., 40(Database issue):D302–305, 2012.

[71] M. P

UNTA

, P. C. C

OGGILL

, R. Y. E

BERHARDT

, J. M

ISTRY

, J. T

ATE

, C. B

OURSNELL

, N. P

ANG

,

K. F

ORSLUND

, G. C

ERIC

, J. C

LEMENTS

, A. H

EGER

, L. H

OLM

, E. L. S

ONNHAMMER

, S. R.

E

DDY

, A. B

ATEMAN

,

AND

R. D. F

INN

. The Pfam protein families database. Nucleic Acids Res.,

40(Database issue):290–301, 2012.

[72] A. H

EGER AND

L. H

OLM

. Exhaustive enumeration of protein domain families. J. Mol. Biol.,

328(3):749–767, 2003.

[73] R. D. F

INN

, J. C

LEMENTS

,

AND

S. R. E

DDY

. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res., 39(Web Server issue):29–37, 2011.

[74] B. R

OST

. Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol., 12(3):409–

416, 2002.

[75] D. F

ISCHER AND

D. E

ISENBERG

.

Finding families for genomic ORFans.

Bioinformatics

,

15(9):759–762, 1999.

[76] D. E

KMAN

, A. K. B

JORKLUND

, J. F

REY

-S

KOTT

,

AND

A. E

LOFSSON

. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J. Mol. Biol.,

348(1):231–243, 2005.

[77] D. E

KMAN AND

A. E

LOFSSON

. Identifying and quantifying orphan protein sequences in fungi.

J. Mol. Biol.

, 396(2):396–405, 2010.

[78] D. E

KMAN

, A. K. B

JORKLUND

,

AND

A. E

LOFSSON

. Quantification of the elevated rate of domain rearrangements in metazoa. J. Mol. Biol., 372(5):1337–1348, 2007.

[79] A. K. B

JORKLUND

, D. E

KMAN

, S. L

IGHT

, J. F

REY

-S

KOTT

,

AND

A. E

LOFSSON

. Domain rearrangements in protein evolution. J. Mol. Biol., 353(4):911–923, 2005.

[80] J. W

EINER

, F. B

EAUSSART

,

AND

E. B

ORNBERG

-B

AUER

. Domain deletions and substitutions in the modular protein evolution. FEBS J., 273(9):2037–2047, 2006.

[81] J. H. F

ONG

, L. Y. G

EER

, A. R. P

ANCHENKO

,

AND

S. H. B

RYANT

. Modeling the evolution of protein domain architectures using maximum parsimony. J. Mol. Biol., 366(1):307–315, 2007.

[82] S. K. K

UMMERFELD AND

S. A. T

EICHMANN

. Relative rates of gene fusion and fission in multidomain proteins. Trends Genet., 21(1):25–30, 2005.

[83] J. A. K

OLKMAN AND

W. P. S

TEMMER

. Directed evolution of proteins by exon shuffling. Nat.

Biotechnol.

, 19(5):423–428, 2001.

[84] L. P

ATTHY

. Intron-dependent evolution: preferred types of exons and introns. FEBS Lett.,

214(1):1–7, 1987.

[85] L. P

ATTHY

. Genome evolution and the evolution of exon-shuffling–a review. Gene, 238(1):103–

114, 1999.

[86] L. P

ATTHY

. Modular assembly of genes and the evolution of new functions. Genetica, 118(2-

3):217–231, 2003.

[87] M. B

ELFORT AND

P. S. P

ERLMAN

.

Mechanisms of intron mobility.

J. Biol. Chem.

,

270(51):30237–30240, 1995.

[88] J. J

ORDA AND

A. V. K

AJAVA

. Protein homorepeats sequences, structures, evolution, and functions. Adv Protein Chem Struct Biol, 79:59–88, 2010.

[89] J. J

ORDA

, B. X

UE

, V. N. U

VERSKY

,

AND

A. V. K

AJAVA

. Protein tandem repeats - the more perfect, the less structured. FEBS J., 277(12):2673–2682, 2010.

[90] A. V. K

AJAVA

. Tandem repeats in proteins: from sequence to structure. J. Struct. Biol.,

179(3):279–288, 2012.

[91] J. J

ORDA AND

A. V. K

AJAVA

. T-REKS: identification of Tandem REpeats in sequences with a

K-meanS based algorithm. Bioinformatics, 25(20):2632–2638, 2009.

[92] M. A. A

NDRADE

, C. P. P

ONTING

, T. J. G

IBSON

,

AND

P. B

ORK

. Homology-based method for identification of protein repeats using statistical significance estimates.

J. Mol. Biol.

,

298(3):521–537, 2000.

[93] J. S

ODING

, M. R

EMMERT

,

AND

A. B

IEGERT

. HHrep: de novo protein repeat detection and the origin of TIM barrels. Nucleic Acids Res., 34(Web Server issue):W137–142, 2006.

[94] J. S

ODING

.

Protein homology detection by HMM-HMM comparison.

Bioinformatics ,

21(7):951–960, 2005.

[95] D. S

OKOL

, G. B

ENSON

,

AND

J. T

OJEIRA

. Tandem repeats over the edit distance. Bioinformatics , 23(2):e30–35, 2007.

[96] E. M. M

ARCOTTE

, M. P

ELLEGRINI

, T. O. Y

EATES

,

AND

D. E

ISENBERG

. A census of protein repeats. J. Mol. Biol., 293(1):151–160, 1999.

[97] A. M. N

EWMAN AND

J. B. C

OOPER

. XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics, 8:382,

2007.

[98] M. A. A

NDRADE

, C. P

EREZ

-I

RATXETA

,

AND

C. P. P

ONTING

. Protein repeats: structures, functions, and evolution. J. Struct. Biol., 134(2-3):117–131, 2001.

[99] A. K. B

JORKLUND

, D. E

KMAN

,

AND

A. E

LOFSSON

. Expansion of protein domain repeats.

PLoS Comput. Biol.

, 2(8):e114, 2006.

[100] L. D. D’A

NDREA AND

L. R

EGAN

. TPR proteins: the versatile helix. Trends Biochem. Sci.,

28(12):655–662, 2003.

[101] C. S

CHEUFLER

, A. B

RINKER

, G. B

OURENKOV

, S. P

EGORARO

, L. M

ORODER

, H. B

ARTUNIK

,

F. U. H

ARTL

,

AND

I. M

OAREFI

. Structure of TPR domain-peptide complexes: critical elements in the assembly of the Hsp70-Hsp90 multichaperone machine. Cell, 101(2):199–210, 2000.

[102] A. J. E

HRLICHER

, F. N

AKAMURA

, J. H. H

ARTWIG

, D. A. W

EITZ

,

AND

T. P. S

TOSSEL

. Mechanical strain in actin networks regulates FilGAP and integrin binding to filamin A. Nature,

478(7368):260–263, 2011.

[103] K. S

AITO

, Y. O

ZAWA

, K. H

IBINO

,

AND

Y. O

HTA

. FilGAP, a Rho/Rho-associated protein kinaseregulated GTPase-activating protein for Rac, controls tumor cell migration. Mol. Biol. Cell,

23(24):4739–4750, 2012.

[104] R

OGER

S H

OLMES AND

U

JJWAL

K R

OUT

. Comparative studies of vertebrate beta integrin genes and proteins: ancient genes in vertebrate evolution. Biomolecules, 1(1):3–31, 2011.

[105] S. S. I

THYCHANDA

, D. H

SU

, H. L

I

, L. Y

AN

, D. D. L

IU

, D. L

IU

, M. D

AS

, E. F. P

LOW

,

AND

J. Q

IN

. Identification and characterization of multiple similar ligand-binding repeats in filamin: implication on filamin-mediated receptor clustering and cross-talk. J. Biol. Chem.,

284(50):35113–35121, 2009.

[106] S. L

IGHT

, R. S

AGIT

, S. S. I

THYCHANDA

, J. Q

IN

,

AND

A. E

LOFSSON

. The evolution of filamin-a protein domain repeat perspective. J. Struct. Biol., 179(3):289–298, 2012.

[107] S. L

ABEIT

, C. A. O

TTENHEIJM

,

AND

H. G

RANZIER

. Nebulin, a major player in muscle health and disease. FASEB J., 25(3):822–829, 2011.

[108] J. P. J

IN AND

K. W

ANG

. Cloning, expression, and protein interaction of human nebulin fragments composed of varying numbers of sequence modules. J. Biol. Chem., 266(31):21215–

21223, 1991.

[109] D. D. R

OOT AND

K. W

ANG

. High-affinity actin-binding nebulin fragments influence the actoS1 complex. Biochemistry, 40(5):1171–1186, 2001.

[110] A. S. P

OLITOU

, R. S

PADACCINI

, C. J

OSEPH

, B. B

RANNETTI

, R. G

UERRINI

, M. H

ELMER

-

C

ITTERICH

, S. S

ALVADORI

, P. A. T

EMUSSI

,

AND

A. P

ASTORE

. The SH3 domain of nebulin binds selectively to type II peptides: theoretical prediction and experimental validation. J.

Mol. Biol.

, 316(2):305–315, 2002.

[111] A. K. B

JORKLUND

, S. L

IGHT

, L. H

EDIN

,

AND

A. E

LOFSSON

. Quantitative assessment of the structural bias in protein-protein interaction assays. Proteomics, 8(22):4657–4667, 2008.

[112] V. N. U

VERSKY

, C. J. O

LDFIELD

,

AND

A. K. D

UNKER

. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J. Mol. Recognit., 18(5):343–384, 2005.

[113] H. J. D

YSON AND

P. E. W

RIGHT

. Coupling of folding and binding for unstructured proteins.

Curr. Opin. Struct. Biol.

, 12(1):54–60, 2002.

[114] M. F

UXREITER

, I. S

IMON

, P. F

RIEDRICH

,

AND

P. T

OMPA

. Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J. Mol. Biol., 338(5):1015–

1026, 2004.

[115] H. J

IANG AND

C. B

LOUIN

.

Insertions and the emergence of novel protein structure: a structure-based phylogenetic study of insertions. BMC Bioinformatics, 8:444, 2007.

[116] P. F

LICEK

, M. R. A

MODE

, D. B

ARRELL

, K. B

EAL

, S. B

RENT

, D. C

ARVALHO

-S

ILVA

,

P. C

LAPHAM

, G. C

OATES

, S. F

AIRLEY

, S. F

ITZGERALD

, L. G

IL

, L. G

ORDON

, M. H

EN

-

DRIX

, T. H

OURLIER

, N. J

OHNSON

, A. K. K

AHARI

, D. K

EEFE

, S. K

EENAN

, R. K

INSELLA

,

M. K

OMOROWSKA

, G. K

OSCIELNY

, E. K

ULESHA

, P. L

ARSSON

, I. L

ONGDEN

, W. M

C

L

AREN

,

M. M

UFFATO

, B. O

VERDUIN

, M. P

IGNATELLI

, B. P

RITCHARD

, H. S. R

IAT

, G. R. R

ITCHIE

,

M. R

UFFIER

, M. S

CHUSTER

, D. S

OBRAL

, Y. A. T

ANG

, K. T

AYLOR

, S. T

REVANION

, J. V

AN

-

DROVCOVA

, S. W

HITE

, M. W

ILSON

, S. P. W

ILDER

, B. L. A

KEN

, E. B

IRNEY

, F. C

UNNINGHAM

,

I. D

UNHAM

, R. D

URBIN

, X. M. F

ERNANDEZ

-S

UAREZ

, J. H

ARROW

, J. H

ERRERO

, T. J. H

UB

-

BARD

, A. P

ARKER

, G. P

ROCTOR

, G. S

PUDICH

, J. V

OGEL

, A. Y

ATES

, A. Z

ADISSA

,

AND

S. M.

S

EARLE

. Ensembl 2012. Nucleic Acids Res., 40(Database issue):84–90, 2012.

[117] R. A

PWEILER

, M. J

ESUS

M

ARTIN

, C. O’

ONOVAN

, M. M

AGRANE

, Y. A

LAM

-F

ARUQUE

, R. A

N

-

TUNES

, E. B

ARRERA

C

ASANOVA

, B. B

ELY

, M. B

INGLEY

, L. B

OWER

, B. B

URSTEINAS

,

W. M

UN

C

HAN

, G. C

HAVALI

, A. D

A

S

ILVA

, E. D

IMMER

, R. E

BERHARDT

, F. F

AZZINI

, A. F

E

-

DOTOV

, J. G

ARAVELLI

, L. G. C

ASTRO

, M. G

ARDNER

, R. H

IETA

, R. H

UNTLEY

, J. J

ACOB

-

SEN

, D. L

EGGE

, W. L

IU

, J. L

UO

, S. O

RCHARD

, S. P

ATIENT

, K. P

ICHLER

, D. P

OGGIOLI

,

N. P

ONTIKOS

, S. P

UNDIR

, S. R

OSANOFF

, T. S

AWFORD

, H. S

EHRA

, E. T

URNER

, T. W

ARDELL

,

X. W

ATKINS

, M. C

ORBETT

, M. D

ONNELLY

, P.

VAN

R

ENSBURG

, M. G

OUJON

, H. M

C

W

ILLIAM

,

R. L

OPEZ

, I. X

ENARIOS

, L. B

OUGUELERET

, A. B

RIDGE

, S. P

OUX

, N. R

EDASCHI

, G. A

RGOUD

-

P

UY

, A. A

UCHINCLOSS

, K. A

XELSEN

, D. B

ARATIN

, M. C. B

LATTER

, B. B

OECKMANN

,

J. B

OLLEMAN

, L. B

OLLONDI

, E. B

OUTET

, S. B

RACONI

Q

UINTAJE

, L. B

REUZA

, E.

DE

C

AS

-

TRO

, L. C

ERUTTI

, E. C

OUDERT

, B. C

UCHE

, I. C

USIN

, M. D

OCHE

, D. D

ORNEVIL

, S. D

UVAUD

,

A. E

STREICHER

, L. F

AMIGLIETTI

, M. F

EUERMANN

, S. G

EHANT

, S. F

ERRO

, E. G

ASTEIGER

,

V. G

ERRITSEN

, A. G

OS

, N. G

RUAZ

-G

UMOWSKI

, U. H

INZ

, C. H

ULO

, N. H

ULO

, J. J

AMES

,

S. J

IMENEZ

, F. J

UNGO

, T. K

APPLER

, G. K

ELLER

, V. L

ARA

, P. L

EMERCIER

, D. L

IEBERHERR

,

X. M

ARTIN

, P. M

ASSON

, M. M

OINAT

, A. M

ORGAT

, S. P

AESANO

, I. P

EDRUZZI

, S. P

ILBOUT

,

M. P

OZZATO

, M. P

RUESS

, C. R

IVOIRE

, B. R

OECHERT

, M. S

CHNEIDER

, C. S

IGRIST

, K. S

ONES

-

SON

, S. S

TAEHLI

, E. S

TANLEY

, A. S

TUTZ

, S. S

UNDARAM

, M. T

OGNOLLI

, L. V

ERBREGUE

,

A. L. V

EUTHEY

, C. H. W

U

, C. N. A

RIGHI

, L. A

RMINSKI

, W. C. B

ARKER

, C. C

HEN

, Y. C

HEN

,

P. D

UBEY

, H. H

UANG

, A. K

UKREJA

, K. L

AIHO

, R. M

AZUMDER

, P. M

C

G

ARVEY

, D. A. N

A

-

TALE

, T. G. N

ATARAJAN

, N. V. R

OBERTS

, B. E. S

UZEK

, C. V

INAYAKA

, Q. W

ANG

, Y. W

ANG

,

L. S. Y

EH

,

AND

J. Z

HANG

. Reorganizing the protein space at the Universal Protein Resource

(UniProt). Nucleic Acids Res., 40(Database issue):D71–75, 2012.

[118] K. D. P

RUITT

, T. T

ATUSOVA

, W. K

LIMKE

,

AND

D. R. M

AGLOTT

. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res., 37(Database issue):D32–36, 2009.

[119] M. R

EMM

, C. E. S

TORM

,

AND

E. L. S

ONNHAMMER

. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol., 314(5):1041–1052, 2001.

[120] G. O

STLUND

, T. S

CHMITT

, K. F

ORSLUND

, T. K

OSTLER

, D. N. M

ESSINA

, S. R

OOPRA

,

O. F

RINGS

,

AND

E. L. S

ONNHAMMER

. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res., 38(Database issue):196–203, 2010.

[121] S. B. N

EEDLEMAN AND

C. D. W

UNSCH

. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48(3):443–453, 1970.

[122] T. F. S

MITH AND

M. S. W

ATERMAN

. Identification of common molecular subsequences. J.

Mol. Biol.

, 147(1):195–197, 1981.

[123] R. D

URBIN

, S. E

DDY

, A. K

ROGH

,

AND

G. M

ITCHISON

. Biological sequence analysis - probabilistic models of proteins and nucleic acids . Cambridge University Press, Cambridge, 1998.

[124] M D

AYHOFF

, R S

CHWARTZ

,

AND

B O

RCUTT

. A Model of Evolutionary Change in Proteins, pps. 345-352 in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, Natl. Biomedical Res.

Found., Silver Spring, Maryland , 1978.

[125] C. K

OSIOL AND

N. G

OLDMAN

. Different versions of the Dayhoff rate matrix. Mol. Biol. Evol.,

22(2):193–199, 2005.

[126] S. H

ENIKOFF AND

J. G. H

ENIKOFF

. Amino acid substitution matrices from protein blocks.

Proc. Natl. Acad. Sci. U.S.A.

, 89(22):10915–10919, 1992.

[127] S. P

IETROKOVSKI

, J. G. H

ENIKOFF

,

AND

S. H

ENIKOFF

. The Blocks database–a system for protein classification. Nucleic Acids Res., 24(1):197–200, 1996.

[128] M. P. S

TYCZYNSKI

, K. L. J

ENSEN

, I. R

IGOUTSOS

,

AND

G. S

TEPHANOPOULOS

. BLOSUM62 miscalculations improve search performance. Nat. Biotechnol., 26(3):274–275, 2008.

[129] W. R. P

EARSON AND

D. J. L

IPMAN

. Improved tools for biological sequence comparison. Proc.

Natl. Acad. Sci. U.S.A.

, 85(8):2444–2448, 1988.

[130] S. F. A

LTSCHUL

, W. G

ISH

, W. M

ILLER

, E. W. M

YERS

,

AND

D. J. L

IPMAN

. Basic local alignment search tool. J. Mol. Biol., 215(3):403–410, 1990.

[131] S. F. A

LTSCHUL

, T. L. M

ADDEN

, A. A. S

CHAFFER

, J. Z

HANG

, Z. Z

HANG

, W. M

ILLER

,

AND

D. J. L

IPMAN

. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389–3402, 1997.

[132] M. G

RIBSKOV

, A. D. M

C

L

ACHLAN

,

AND

D. E

ISENBERG

. Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A., 84(13):4355–4358, 1987.

[133] L. R

ABINER

. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

[134] S R E

DDY

. Profile hidden Markov models. Bioinformatics, 14(9):755–763, 1998.

[135] C. B

ARRETT

, R. H

UGHEY

,

AND

K. K

ARPLUS

. Scoring hidden Markov models. Comput. Appl.

Biosci.

, 13(2):191–199, 1997.

[136] Z. D

OSZTANYI

, B. M

ESZAROS

,

AND

I. S

IMON

. Bioinformatical approaches to characterize intrinsically disordered/unstructured proteins. Brief. Bioinformatics, 11(2):225–243, 2010.

[137] A. S

CHLESSINGER

, J. L

IU

,

AND

B. R

OST

. Natively unstructured loops differ from other loops.

PLoS Comput. Biol.

, 3(7):e140, 2007.

[138] B. H

E

, K. W

ANG

, Y. L

IU

, B. X

UE

, V. N. U

VERSKY

,

AND

A. K. D

UNKER

. Predicting intrinsic disorder in proteins: an overview. Cell Res., 19(8):929–949, 2009.

[139] J. J. W

ARD

, J. S. S

ODHI

, L. J. M

C

G

UFFIN

, B. F. B

UXTON

,

AND

D. T. J

ONES

. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol.,

337(3):635–645, 2004.

[140] Z. D

OSZTANYI

, V. C

SIZMOK

, P. T

OMPA

,

AND

I. S

IMON

. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol., 347(4):827–839, 2005.

[141] P. D. T

HOMAS AND

K. A. D

ILL

. An iterative method for extracting energy-like quantities from protein structures. Proc. Natl. Acad. Sci. U.S.A., 93(21):11628–11633, 1996.

Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project