C OMPUTATIONAL AND EXPERIMENTAL APPROACHES TO REGULATORY

C OMPUTATIONAL AND EXPERIMENTAL APPROACHES TO REGULATORY
COMPUTATIONAL AND EXPERIMENTAL
APPROACHES TO REGULATORY
GENETIC VARIATION
Malin Andersen
Royal Institute of Technology,
School of Biotechnology
Stockholm, 2007
© Malin Andersen
E-mail: [email protected]
School of Biotechnology
Royal Institute of Technology
AlbaNova University Center
SE-106 91 Stockholm
Sweden
Printed at Universitetsservice US AB
Box 700 14
Stockholm
ISBN: 978-91-7178-827-6
TRITA-BIO-Report 2007:12
ISSN 1654-2312
Malin Andersen (2007): Computational and experimental approaches to regulatory
genetic variation
School of Biotechnology, Royal Institute of Technology (KTH), Stockholm,
Sweden
ISBN 978-91-7178-827-6
ABSTRACT
Genetic variation is a strong risk factor for many human diseases, including
diabetes, cancer, cardiovascular disease, depression, autoimmunity and asthma.
Most of the disease genes identified so far alter the amino acid sequences of
encoded proteins. However, a significant number of genetic variants affecting
complex diseases may alter the regulation of gene transcription. The map of the
regulatory elements in the human genome is still to a large extent unknown, and it
remains a challenge to separate the functional regulatory genetic variations from
linked neutral variations.
The objective of this thesis was to develop methods for the identification
of genetic variation with a potential to affect the transcriptional regulation of
human genes, and to analyze potential regulatory polymorphisms in the CD36
glycoprotein, a candidate gene for cardiovascular disease.
An in silico tool for the prediction of regulatory polymorphisms in human
genes was implemented and is available at www.cisreg.ca/RAVEN. The tool was
evaluated using experimentally verified regulatory single nucleotide polymorphisms
(SNPs) collected from the scientific literature, and tested in combination with
experimental detection of allele specific expression of target genes (allelic
imbalance). Regulatory SNPs were shown to be located in evolutionary conserved
regions more often than background SNPs, but predicted transcription factor
binding sites were unable to enrich for regulatory SNPs unless additional
information linking transcription factors with the target genes were available.
The in silico tool was applied to the CD36 glycoprotein, a candidate gene
for cardiovascular disease. Potential regulatory SNPs in the alternative promoters of
this gene were identified and evaluated in vitro and in vivo using a clinical study for
coronary artery disease. We observed association to the plasma concentrations of
inflammation markers (serum amyloid A protein and C-reactive protein) in
myocardial infarction patients, which highlights the need for further analyses of
potential regulatory polymorphisms in this gene.
Taken together, this thesis describes an in silico approach to identify
putative regulatory polymorphisms which can be useful for directing limited
laboratory resources to the polymorphisms most likely to have a phenotypic effect.
Keywords: single nucleotide polymprhism (SNP), regulatory SNP, transcription
factor binding site, phylogenetic footprinting, allelic imbalance, EMSA, CD36,
cardiovascular disease
Computational and experimental approaches to regulatory genetic variation
4
Malin Andersen
LIST OF PUBLICATIONS
The results presented in this thesis are based on the following papers, which will be
referred to in the text by their Roman numerals:
I. Malin C. Andersen, Pär G. Engström, Stuart Lithwick, David Arenillas, Per
Eriksson, Boris Lenhard, Wyeth W. Wasserman and Jacob Odeberg (2007) In
silico detection of sequence variations modifying transcriptional regulation.
PLoS Computational Biology, In press.
II. Lili Milani, Manu Gupta, Malin Andersen, Sumeer Dhar, Mårten Fryknäs,
Anders Isaksson, Rolf Larsson, Ann-Christine Syvänen (2007) Allelic
imbalance in gene expression as a guide to cis-acting regulatory single
nucleotide polymorphisms in cancer cells. Nucleic Acids Res. 35(5):e34.
III. Malin Andersen, Boris Lenhard, Carl Whatling, Per Eriksson and Jacob
Odeberg (2006) Alternative promoter usage of the membrane glycoprotein
CD36. BMC Mol. Biol. Mar 3;7:8.
IV. Louisa Cheung, Malin Andersen, Carolina Gustavsson, Jacob Odeberg,
Leandro Fernández-Pérez, Gunnar Norstedt and Petra Tollet-Egnell (2007)
Hormonal and nutritional regulation of alternative CD36 transcripts in rat
liver--a role for growth hormone in alternative exon usage. BMC Mol Biol. Jul
17;8:60.
V. Malin C. Andersen, Rachel M. Fisher, Kristina Holmberg, Ann Samnegård,
Anders Hamsten, Per Eriksson and Jacob Odeberg (2007) In silico prediction
of regulatory SNPs in the CD36 gene and evaluation of their effect in a
clinical study for coronary artery disease. Manuscript.
RELATED PUBLICATIONS
Jorge Andrade, Malin Andersen, Anna Sillén, Caroline Graff and Jacob Odeberg
(2007) The use of grid computing to drive data-intensive genetic research. Eur J
Hum Genet. Jun;15(6):694-702.
5
Computational and experimental approaches to regulatory genetic variation
6
Malin Andersen
TABLE OF CONTENTS
INTRODUCTION ....................................................................................................... 9
INTRODUCTION TO GENETIC VARIATION ................................................................................ 9
Molecular bases of genetic variation in the human genome.............................................................. 9
Linkage disequilibrium and haplotype analysis........................................................................... 10
Genetic variation and complex diseases....................................................................................... 13
Linkage analysis and association studies .................................................................................... 13
The importance of finding the functional polymorphism................................................................ 15
GENETIC VARIATION AFFECTING GENE REGULATION ........................................................ 16
Genetic variation affecting transcriptional regulation.................................................................... 17
Experimental methods for the analysis of regulatory SNPs.......................................................... 18
In silico predictions of regulatory SNPs ...................................................................................... 20
Other regulatory genetic variation ............................................................................................... 25
INTRODUCTION TO CD36 ........................................................................................................ 26
CD36 function ......................................................................................................................... 26
Gene structure ........................................................................................................................... 26
CD36 deficiency........................................................................................................................ 27
CD36 and cardiovascular disease............................................................................................... 27
PRESENT INVESTIGATIONS................................................................................ 29
AIMS ............................................................................................................................................. 29
RESULTS AND DISCUSSION ....................................................................................................... 30
In silico prediction of regulatory polymorphisms (Paper I and II).................................................. 30
Alternative first exon usage of the CD36 gene (Paper III and IV) ............................................. 35
Identification of potential regulatory SNPs in the CD36 gene and their evaluation in relation to
coronary artery disease (Paper V)............................................................................................... 38
CONCLUDING REMARKS ........................................................................................................... 41
FUTURE PERSPECTIVES ....................................................................................... 42
ABBREVIATIONS ..................................................................................................... 45
ACKNOWLEDGEMENTS........................................................................................ 46
REFERENCES........................................................................................................... 49
ORIGINAL PAPERS (APPENDICES I-V)
7
Computational and experimental approaches to regulatory genetic variation
8
Malin Andersen
INTRODUCTION
INTRODUCTION TO GENETIC VARIATION
Every living organism, from the smallest bacteria to the largest mammal, contains
the information for synthesizing nearly all molecules to create a functional being in
its genome. The human genome consists of roughly 3 billion base pairs which are
inherited in two copies, one from the mother and one from the father. Two
unrelated individuals have 99.9 % identical genomes [1,2], the remaining fraction
being important as it contributes to the phenotypic differences between the
individuals. In comparison, the human genome is approximately 98,8% identical to
the genome of our closest relative the chimpanzee [3].
Most of the sequence variants within a single human being have no
immediate effect on his or her health, however, some variations do have
phenotypic consequences. Genetic variation can affect how we look, how we
behave, how susceptible we are to various diseases and how we respond to drug
treatment. Some traits, such as the ABO blood type [4] and certain uncommon
monogenetic disorders such as Huntington’s disease are dependent on the
information encoded in one single location in the genome, but most traits have a
more complex pattern of inheritance.
Identification of the functional DNA sequence variant responsible for a
genetic disease is crucial for understanding the molecular mechanisms behind the
disease and to pinpoint therapeutic targets. Most functional variations known today
alter the properties of the encoded protein by changing its amino acid sequence.
However, genetic variation affecting the regulation of genes may be another major
source of phenotypic variation among humans [5,6]. The aim of this thesis was to
use computational and experimental methods to identify sequence variants with a
potential to alter the regulation of encoded genes, and to apply some of these
methods to the human membrane glycoprotein CD36, a candidate gene for
cardiovascular disease.
Molecular bases of genetic variation in the human genome
Mutations are introduced into the genome if the replication machinery makes errors
during cell division, by exposure to different toxic compounds, by irradiation or by
viruses. When a mutation occurs in the germ line of an individual it will be
transferred to its offspring, and if the mutation is not immediately harmful for the
bearer it can be passed on to subsequent generations and allowed to stabilize in the
population. Common genetic variations that have a minor allele frequency (MAF)
of more than 1% in a population are called polymorphisms, more unusual genome
variants are simply referred to as “mutations”.
9
Computational and experimental approaches to regulatory genetic variation
The most common type of genome variation is single nucleotide polymorphisms
(SNPs), which have been introduced into the genome by the substitution of one
single base. In addition there are many common one base insertion and deletion
polymorphisms. SNPs account for 90% of all polymorphisms in the human
genome [7,8]. The sequencing of the human genome provided a vast amount of
potential SNPs [9], and continuous re-sequencing efforts have increased the
number further and enabled a better estimation of their allele frequencies. In
October 2007 the major repository for SNPs (dbSNP) [10] contained 5.7 million
unique validated SNP records. From in-depth sequencing of a representative
collection of genome fragments corresponding to 1 % of the human genome, it has
been estimated that there is on average one SNP per 279 base pairs [11]. Although
this number is preliminary, as it is based on only a fraction of the genome, the
selected fraction is likely to be representative of the entire human genome [12].
Short tandem repeats (STR), or microsatellites, consist of short units of
DNA, typically 1-5 bases in length, that are repeated 10 to 100 times. STR
polymorphisms often have several alleles corresponding to several possible
numbers of repeats, which makes them more informative compared with biallelic
SNPs. STR genotypes within pedigrees are therefore often fully informative so that
the progenitor of a particular allele can be identified.
There are also several types of larger, structural variations in the genome.
Some are large enough to be identified using a microscope, such as aneuploidies
and rearrangements that are typically more than 3 Mb in size. Small insertions,
inversions and duplications (usually less than 1 kb) have been observed through
DNA sequencing. Structural variations that range in size from approximately 1 kb
to 3 Mb have been more difficult to observe, but during the last few years new
strategies and tools for efficient assessment of such intermediate scale variations
have emerged. Hundreds of submicroscopic copy number variants (CNVs) [13]
[14] and inversions [15,16] have now been described in the human genome, and
other types of variation including cryptic translocations and uniparental disomy
appear to be fairly common [17].
Linkage disequilibrium and haplotype analysis
The mutation rate in humans is considered to be relatively low in relation to the
number of generations since the most recent common ancestor for any two
humans [11]. Most polymorphisms that have been allowed to stabilize in the
genome are therefore likely to be the result of single historical mutation events
rather than having occurred independently in several individuals. Each new genetic
variant that has been introduced into the genome was therefore initially linked to
the particular chromosomal background on which it occurred. Since the alleles of
closely located polymorphisms are linked on the chromosomes, they are inherited
together as a unit or haplotype (Figure 1).
10
Malin Andersen
Figure 1: Illustration of the origin of haplotypes. Each new polymorphism is initially linked
to the chromosomal background on which it occurred and therefore certain combinations
of allelic variants are inherited as a unit (a haplotype).
11
Computational and experimental approaches to regulatory genetic variation
The co-inheritance of neighboring loci leads to association between the loci in a
population, a phenomenon known as linkage disequilibrium (LD). As
recombination takes place between the maternal and paternal chromosomes during
the formation of germ cells (meiosis), the complete linkage between two recently
introduced mutations is diluted with time. The likelihood of recombination taking
place between two SNPs increases with the distance between them, thus, linkage
disequilibrium between two polymorphic sites on average decline with the distance.
However, the recombination rate has been shown to vary dramatically across the
genome and therefore the extent of LD between nearby markers varies [18].
Recombination hotspots and long regions of low recombination divide the genome
into discrete haplotype blocks, with high LD between SNPs located within the
same block and low LD between SNPs located in different blocks [19].
The international HapMap Consortium has carried out a large scale
evaluation of the haplotype structure across the entire human genome using DNA
from 269 individuals from four population groups [20]. In phase I of the HapMap
project, at least one common SNP (in this context a SNP with minor allele
frequency > 0.05) has been genotyped every 5 kb across the genome. In addition,
the ENCODE project has selected representative regions corresponding to 1% of
the human genome (referred to as the ENCODE regions) [12] and these have been
sequenced in depth in the HapMap project.
The results from the HapMap project have given us a preliminary map of
how genetic variation is distributed across the genome and how it varies among
populations [11]. Most SNPs have a relatively low minor allele frequency, indicated
by the fact that 46% of all variation in the ENCODE regions have a minor allele
frequency below 0.05. On the other hand, 90 % of the heterozygous sites within
one single individual are due to common SNPs with a minor allele frequency above
0.05.
Because the recombination rates vary dramatically across the genome, the
lengths of the haplotype blocks vary from 1 to 100 kb. The average haplotype block
spans between 30 and 70 SNPs, but the number of haplotypes with an allele
frequency above 0.05 is on average 4 to 5.6 depending on the population. The
typical SNP is therefore highly correlated with many of its neighbors. Considering
only common SNPs (MAF>0.05), one in five has 20 or more completely correlated
neighboring SNPs and one in three has more than five perfectly correlated
neighbors. In contrast, one in five SNPs has no perfectly correlated neighbor.
Phase II of the HapMap project showed that approximately 0.5-1.0 % of all
common SNPs are untaggable, which means that no other SNP within 100 kb is in
LD with the SNP with an r2 value above 0.2 [21].
Because of the strong correlation between alleles of SNPs within a
haplotype block, knowing the structure of the haplotype blocks in the population
makes it possible for researchers to genotype only a few representative SNPs in
each block and from the results infer the genotypes at linked loci. The results from
the HapMap project clearly show that a small set of highly informative tag SNPs
capture a large fraction of the variation in the genome, but also that it is very
12
Malin Andersen
important to select good representative SNPs from each haplotype in order to take
advantage of the LD structure in the human genome in genotyping studies.
Genetic variation and complex diseases
Family history is a strong risk factor for many human diseases, including diabetes,
cancer, cardiovascular disease, depression, autoimmunity and asthma, suggesting
that inherited genetic factors are important in the pathogenesis of these. Several
thousands of illnesses are known to have a genetic component, and the influence of
genetic variance on many other phenotypes has been demonstrated for example by
twin studies. In January 2007 the database Online Mendelian Inheritance in Man
(OMIM) had records of 3345 phenotypes for which the molecular basis is
established, and 2048 unique disease genes were identified [22].
The majority of the identified disease causing genes is related to rare, highly
heritable Mendelian disorders where variation in one single gene is both necessary
and sufficient to cause the disease. However, most common complex diseases
depend upon a combination of hereditary and environmental factors, where one
particular gene variant is typically responsible for only a modest increase in disease
susceptibility and many common genetic variants are thought to contribute to
disease onset and progression. Although the increased absolute risk of developing
such a disease in a single carrier of a susceptibility allele is low, the elevated risk
with respect to the population prevalence makes the identification of these factors
relevant for public health [23]. Finding these complex patterns of inheritance is
much more difficult than identifying rare mutations causing Mendelian diseases.
Out of the 3345 phenotypes in OMIM for which the molecular basis is established,
only 375 are susceptibility phenotypes. Many of the susceptibility alleles linked to
these phenotypes are common polymorphisms [22].
Linkage analysis and association studies
There are two main approaches to demonstrate that a particular genomic locus is
associated with a trait and thereby implicated in the trait etiology, namely linkage
analysis and association studies. Linkage refers to the physical proximity of loci
along a chromosome. Two loci, for example the causative trait locus and a proximal
marker locus, are linked if they are located sufficiently close together so that their
alleles tend to be inherited together. Linkage analyses therefore seek to identify
marker loci that co-segregate with the trait within families. Affected pairs of
relatives, for example affected siblings, are analyzed and alleles shared between the
pairs are identified (Figure 2a). The idea behind linkage tests is then to identify
marker alleles that are co-inherited in affected relatives more often than what would
be expected if the marker and trait alleles were unlinked. Since linkage analyses seek
to reveal how alleles segregate in pedigrees, it is critically important to determine
whether the shared alleles are identical by descent (IBD; i.e. copies of the same
parental allele) or only identical by state (i.e. appearing the same since they have the
13
Computational and experimental approaches to regulatory genetic variation
same genotype but derived from two different copies of the allele). This is easier if
the marker polymorphisms are multiallelic, and microsatellites or other multiallelic
markers are therefore preferred over SNPs.
Association studies seek to identify particular variants that are associated
with the phenotype at the population level (Figure 2b). Such an association can
arise if the underlying functional sequence variant is measured directly or if a
marker variant is in linkage disequilibrium with the functional variant. The casecontrol study has been the most widely applied strategy. Here patients who already
have a disease are compared with appropriately matched control subjects. Other
study designs are also possible, such as the prospective cohort study in which
individuals are collected before the onset of disease and followed under the same
experimental protocol. In this way there is no bias for the selection of a control
population, but the approach requires more resources and patience to allow a
sufficient number of cases to emerge before the association studies can proceed
[24]. The basic concept behind linkage analysis and association studies is reviewed
by Borecki and Suarez [25].
Figure 2: Linkage analyses seek to identify alleles that are identical by descent between
affected pairs of relatives (for example siblings). Association studies seek to identify
association at population level, between polymorphic markers and a particular trait.
14
Malin Andersen
Both linkage analysis and association studies can be applied to candidate genes or to
a whole genome scan. The candidate gene approach is characterized as a
hypothesis-testing approach since the choice of gene to analyze is based on prior
knowledge about the gene and the disease. In a genome scan anonymous genetic
markers are selected from throughout the genome, and all are tested for the
presence of a linked trait locus. Since there is no bias in selecting the markers, the
genome scan is a hypothesis-generating approach.
Genome wide linkage studies have suggested many susceptibility loci for
complex traits, for example in coronary artery disease[26,27,28], type 2 diabetes
[29,30], asthma [31] and Crohn’s disease [32,33 ]. The recent technological progress
in genotyping methodologies has introduced new opportunities for performing
genome wide scans much more efficiently. Hundreds of thousands of SNPs can be
genotyped in one experiment in a short time. The sequencing of the human
genome [9,34] and the HapMap project [11] have made large panels of verified
polymorphisms in humans available. Together, this progress has made feasible
whole genome association (WGA) studies. Indeed, during the last few years several
new highly significant disease loci have been identified using WGA, for example in
type II diabetes[35,36], cardiovascular disease [37,38], prostate cancer [39,40] and
Crohn’s disease [41,42]. The recent explosion in the number of disease associations
identified using WGA shows that this is a powerful technique to detect disease
susceptibility loci.
There are numerous methodological considerations to take into account
before conducting a linkage analysis or association study to identify risk alleles for
complex diseases, including issues regarding the study design (such as selection of
the appropriate control population), parameters affecting the power of the test
(such as the penetrance of the disease allele), and statistical issues (such as multiple
testing). For a comprehensive review of the design of association studies and
statistical procedures, see for example the article by Cardon and Bell [24].
The importance of finding the functional polymorphism
Linkage and association studies often reveal linkage to a genomic region containing
several thousands of SNPs. However, understanding the fundamental molecular
mechanisms behind a common genetic disease is the key problem in human
genetics and it is therefore important to identify the causative genetic variants.
Knowledge of the disease-causing alleles is necessary in order to understand the
nature of their pathogenic functions and to generate accurate models of the disease
in vitro and in vivo. These are important steps towards developing new therapeutic
inventions [6]. Moreover, the functional polymorphism provides optimal power for
detecting linkage to the trait locus. If the functional variant is not included in the
analysis it can be difficult to replicate linkage findings in a second cohort, since
linkage disequilibrium between the functional variant and analyzed markers can
differ between populations.
15
Computational and experimental approaches to regulatory genetic variation
While structural genomic variation can affect the phenotype through a gene dosage
effect [43], common polymorphisms in protein coding loci can, in principal,
influence phenotypes either by changing the quality or quantity of the encoded
proteins. Non-synonymous polymorphisms, that occur when a codon in the open
reading frame of the mRNA sequence is altered, give rise to proteins with altered
amino acid sequences. The effects of such variations can vary from almost nothing
if an amino acid is replaced with another amino acid of very similar chemical
properties, to a protein with drastically altered function or even loss of function if
the polymorphism introduces a premature stop codon.
The quantity of a protein can, for example, be affected by variation in
genomic regulatory elements leading to altered transcriptional regulation of the
gene. Variation in the 5’ or 3’ untranslated regions (UTR) of a gene can alter the
stability of the encoded mRNA molecule and thereby affect the amount of
produced protein.
There are numerous examples of monogenic Mendelian disorders caused
by mis-sense or nonsense mutations in the amino acid coding parts of genes, for
example the trinucleotide repeat expansion in the HD gene in Huntington’s disease
[44], point mutations in the amino acid coding sequence of the CFTR gene causing
cystic fibrosis [45] and the single base substitution in FV (Factor V Leiden) which
is associated with deep vein thrombosis [46]. However, in complex genetic diseases
the functional alleles often have more subtle effects and regulatory polymorphisms
are suggested to be important for such susceptibility phenotypes [47].
GENETIC VARIATION AFFECTING GENE REGULATION
Although most of the disease alleles known today alter the amino acid sequences of
encoded proteins, there remain disease-associated genes for which there is no
difference in protein-coding information between individuals of different
phenotypes. Examples of such cases include the phenotypic variances in plasma
Dopamine-beta-hydroxylase [48] and serum angiotensin I converting enzyme levels
[49].
In vivo studies have demonstrated that allele-specific differences in gene
expression can be observed for about 40% to 50% of the human genes [50-53] and
that a portion of the differences can be attributed to genetic variation in noncoding regulatory regions [54]. Polymorphisms can affect gene regulation through
either cis- or trans-acting effects. Cis-acting regulatory polymorphisms are those
that act on genes located within (or near) their own locus, for example by altering
the binding sites of DNA binding proteins. Trans-acting effects are observed when
variation in one gene (either qualitative or quantitative) influences the expression
level of a second gene located far away on the same chromosome or on another
chromosome.
Genome-wide linkage mappings of gene expression levels in model
organisms have been used to study trans- and cis-acting regulatory polymorphisms.
16
Malin Andersen
It has been shown that there is a higher fraction of SNPs close to genes for which
altered mRNA expression between individuals have been linked to the locus of the
same gene (self-linkage) compared to genes with no self-linkage [55,56], an
observation that may highlight the importance of cis-acting regulatory SNPs. A
recently published whole genome linkage analysis of gene expression levels in
human lymphocytes suggested that approximately 7 % of all transcripts are affected
by cis-regulatory polymorphisms, and trans-acting effects were observed for less
than 0.1% of the transcripts, indicating that the regulators with the strongest effect
tend to be located in cis [57]. Approximately the same results were obtained from
the back-to-back published whole genome association study of gene expression in
lymphoblastoid cell lines from the 269 individuals sampled in the HapMap project
[58].
While SNPs that alter the amino acid sequences of encoded proteins are
often readily identifiable because of the knowledge of the rules of gene translation,
the map of regulatory elements in the human genome is still sparse and the
identification of genetic variation altering such elements often relies on de novo
characterization of previously unknown functional sites. New insights into the
functional complexity of the non-amino acid coding fraction of the human genome
( reviewed in [59]) suggest that genetic variation can influence gene regulation
through many different molecular mechanisms, and the identification of each
functional allele is therefore a complex task.
Genetic variation affecting transcriptional regulation
The most abundant type of regulatory variation known today interferes with the
binding of transcription factors (TFs) to genomic regulatory elements, thereby
altering the rate at which the genes are transcribed. Well over hundred such cisacting sequence variants have been shown to alter transcription in vitro [60,61,62].
Experimental methods to pinpoint regulatory polymorphisms in individual genes
have been successful, but these methods are time consuming. High throughput
methods for measuring RNA abundance of two allelic variants (allelic imbalance)
can indicate genes with linked cis-acting regulatory polymorphisms, but these
methods do not pinpoint the functional allele. Computational methods to
distinguish functional from neutral genomic variants could prove useful to direct
limited laboratory resources to sites most likely to exhibit a phenotypic effect.
However, these approaches still have limited capacity to detect true regulatory
polymorphisms. A combination of the currently available in silico, in vitro and in vivo
tools are therefore likely to be most efficient in the search for functional regulatory
polymorphisms. Below is a description of some of the tools available for such
analyses today, some of which have been used in the context of this thesis. For indepth reviews of experimental approaches to regulatory SNP analysis, please refer
to the articles by Buckland [6] and Prokunina and Alarcon-Riquelme [63].
17
Computational and experimental approaches to regulatory genetic variation
Experimental methods for the analysis of regulatory SNPs
Electrophoretic mobility shift assay (EMSA)
Altered DNA-protein binding interactions between allelic variants can be detected
using electrophoretic mobility shift assays (EMSA) [64,65,66]. Double stranded
oligonucleotides of 20 to 25 bases in length are labeled and incubated with nuclear
extracts to allow any DNA binding protein to bind to its recognition sequence if it
is present. The mixture is analyzed on a polyacrylamide gel, and the formation of at
least one band in addition to the band corresponding to the free probe indicates
formation of a DNA-protein complex. If oligonucleotides corresponding to the
two alleles of an SNP are analyzed separately it is possible to assess the relative
ability of each sequence to bind to the protein. The effect of excess amounts of
unlabeled oligonucleotide corresponding to one but not the other allele can be
studied in a competitive assay. A decrease in intensity of the shifted band on the gel
in the competitive assay indicates allele specific binding of a factor in the nuclear
extracts (Figure 3). Addition of antibodies against the protein involved in the
complex formation can cause additional retardation of the shifted band (called
supershift), which proves that particular nuclear protein is involved in the
interaction.
Figure 3: Cartoon of a competitive EMSA with oligonucleotides corresponding to both
alleles of an SNP. The first four lanes contain, from left to right, labeled oligo
corresponding to allele 1, labeled oligo for allele 1 and transcription factor, labeled oligo
corresponding to allele 2, labeled oligo corresponding to allele 2 plus transcription factor.
The last six lanes contain labeled oligo corresponding to allele 1 and transcription factor,
and in addition lanes 5-7 contain increasing amounts of unlabeled oligo for allele 1, lanes 810 contains increasing amounts of unlabeled oligo for allele 2. Since the upper band is
stronger in lane 2 than in lane 4, the DNA-protein complex is stronger for allele 1 than for
allele 2. Lanes 5-7 show that the unlabeled oligo corresponding to allele 1 competes for the
protein. Lanes 8-10 show that unlabeled oligonucleotide corresponding to allele 2 is a less
efficient competitor. A pattern like this would suggest allele specific binding between allele 1
and the transcription factor.
18
Malin Andersen
Chromatin immunoprecipitation
Although EMSA is a widely used method for analysis of binding events between
DNA and the transcription factor in vitro, the identified sites are not always
functional in vivo. In vivo binding events can be analyzed using chromatin
immunoprecipitation [67], where formaldehyde crosslinking is used to freeze DNAprotein and protein-protein contacts at a given moment in living cells or tissues.
The crosslinked DNA is broken into smaller fragments (enzymatically or by
sonication) and exposed to antibodies capable of recognizing the bound
transcription factor. The DNA attached to the antibody is then amplified by PCR
and either sequenced or studied by hybridization to a genomic microarray (a
method called ChIP-on-chip) [68,69].
Reporter vector assays
The effect of potential regulatory SNPs on gene transcription can be assessed in
vitro using reporter vector constructs [70]. Fragments of DNA carrying the alleles of
the SNP can be cloned into a vector containing a weak promoter, or the whole
endogenous promoter carrying the two alleles of the SNP can be cloned into a
promoter-less vector. The vector carries a reporter gene (for example the fireflyderived luciferase gene or green fluorescent protein) whose expression will depend
on the inserted regulatory sequence. The different allelic constructs and a control
plasmid are transiently transfected into the appropriate cell line. The level of
reporter gene expression is measured for each allele of the SNP and compared with
the expression of the empty vector, and the impact of the allelic variants on
transcription can thereby be evaluated.
Allelic imbalance
The effects on transcription of cis-acting regulatory SNPs can be monitored in vivo
by analyzing coding SNPs in cDNA samples from heterozygous individuals. The
relative amount of the two alleles of an SNP in a cDNA sample reflects the relative
expression of the gene from the two copies of the chromosome. If the ratio
between the two alleles deviates from unity, the maternal and paternal copies of the
gene are transcribed at different rates (allelic imbalance, AI), suggesting that
regulatory variation in linkage disequilibrium with the observed coding SNP affect
the transcription [50,52]. The relative amounts of the two alleles can be measured
by a DNA sequencing or genotyping technique. Since the allelic variants have been
exposed to the same environmental influences and trans-acting factors, the allelic
imbalance is likely to be due to cis-acting regulatory variation or epigenetic factors.
HaploChIP
A limitation of using allelic imbalance in mRNA expression is that it requires
polymorphisms to be present in the RNA transcript (or at least in the pre-mRNA).
An alternative approach is to use chromatin immunoprecipitation against for
19
Computational and experimental approaches to regulatory genetic variation
example Ser5-phosphorylated RNA polymerase II, and sequence the precipitated
DNA molecules. Phosphorylation of Ser5 on RNA polymerase II takes place when
the protein is released from the initiation complex and starts synthesizing the new
mRNA molecule, and is therefore a marker of transcriptional activity. If there are
SNPs within the precipitated genomic region, and if one allele or haplotype is
detected at a higher frequency than 50%, this is evidence of preferred transcription
of that allele or haplotype [71].
Linkage mapping
Cis- and trans-acting regulatory variation can be characterized by linkage mapping
of expression levels, where variation in expression levels between individuals are
treated as quantitative traits and mapped (using pedigrees) to genomic loci.
Variation in mRNA expression levels that are linked to the locus of the same gene
(self-linkage) is likely to be dependent on variation in cis-regulatory elements, and
variation in mRNA expression levels linked to other loci are likely due to trans
acting effects [55,56].
In silico predictions of regulatory SNPs
The detection of allelic variants with a potential to alter the binding affinity to
transcription factors is fundamental for in silico predictions of regulatory
polymorphisms. Eukaryotic transcription factors tolerate a considerable amount of
variation in their target binding sites. However, certain positions are highly
conserved between functional sites, and genetic variations in such positions are
therefore likely to affect the binding affinity between the DNA and the
corresponding transcription factor. A brief introduction to the field of transcription
factor binding site (TFBS) prediction research will be given in this section. For
more in-depth reviews please see the articles by Stormo [72], Wasserman and
Sandelin [73] and MacIsaac and Fraenkel[74].
Modeling and predicting transcription factor binding sites
To predict de novo transcription factor binding sites, a model that captures the
degenerate nature of the preferred binding sequences must first be derived based
on multiple examples of functional sites. Collections of binding sites can be
compiled from experimentally verified functional binding sites in the genome, or by
high throughput in vitro site selection assays [75]. The collected binding sites are
aligned to capture the preferred binding pattern of the transcription factor (Figure
4a). It is possible to describe the degenerate DNA pattern that makes up a TFBS
using a consensus sequence (Figure 4b), but a disadvantage is that one single
symbol cannot quantitatively describe the nucleotide distribution at a degenerate
position in the binding site. To allow for this, a binding site is often represented by
a position frequency matrix (PFM), describing the nucleotide frequencies in every
position of the site (Figure 4c).
20
Malin Andersen
Figure 4: Overview of the construction of PWMs and sequence logos from a
collection of experimental transcription factor binding sites [79], and the use of
PWMs for scoring unknown DNA sequences.
21
Computational and experimental approaches to regulatory genetic variation
The PFM can be viewed as a table of probabilities of observing each nucleotide in
every position of the TFBS. The preferred binding pattern of a transcription factor
can be visualized graphically by a sequence logo [76], enabling a fast and intuitive
visual assessment of the pattern (Figure 4d).
Once a TFBS model is built, it can be used to identify occurrences of the
binding pattern in DNA sequences. For efficient computational analysis, the PFM
is converted to a log-scale before being used to score new sequences. To eliminate
null values before log-conversion, and in part to correct for small samples of
binding sites, a pseudo count is also added to each cell. The final log-scale
converted matrix is called a position weight matrix , shortened PWM (Figure 4e)
(reviewed by Stormo [72]). Databases such as Transfac [77] and JASPAR [78]
contain collections of PWMs for the currently evaluated transcription factors. The
PWM can be used to score DNA sequences for the presence of potential binding
sites by summing the contributing score for each relevant nucleotide in the profile.
When evaluating longer sequences, the PWM is slid over the sequence in 1 bp
increments, evaluating every possible binding site in the sequence (Figure 4f).
The absolute score from a PWM has been shown to be directly
proportional to the binding energy of the DNA-protein interaction [80,81]. This
implies that matrix models of TFBSs can be used to score the two alleles of a
polymorphism in a regulatory region and from the score difference estimate the
effect on the TF binding affinity.
One limitation of PWMs is that the nucleotide observed at one position in
the binding site is assumed to have no effect on the likelihood of observing a
certain nucleotide at an adjacent site [82]. A high throughput analysis of the binding
affinities between the MAX A and MAX B transcription factors and their target
binding sites recently showed that PWMs tend to overestimate differences in free
binding energies between sequence variants that differ by three or more bp [83].
However, most predictions for two bp deviations were correct, supporting the use
of PWMs for assessing the impact of SNPs on consensus and near-consensus
binding sites.
Discovery of TFBS by pattern recognition
Another limitation in using PWMs for the identification of potential transcription
factor binding sites is that the data for constructing the models is limited and
therefore high quality PWMs are missing for a large number of human transcription
factors. An alternative approach to identify preferred binding sites is to look for
overrepresented DNA motifs in the promoter regions of co-regulated genes, or in
genomic regions enriched for transcription factor binding using the ChIP-chip
assay [74].
PWMs are prone to produce false positive TFBS predictions
A significant proportion of the binding sites predicted by PWMs are capable of
binding the corresponding transcription factors in vitro [84]. However, the short and
22
Malin Andersen
degenerate nature of the TFBS means that PWMs are likely to generate many false
positive predictions. A PWM for the transcription factor MEF2 was for example
shown to give one predicted site per 1700 bp [85]. It is not biologically realistic that
all these sites are functional in vivo. Although PWMs are capable of describing the
DNA binding properties of transcription factors, additional information about the
regulatory region must therefore be incorporated for the detection of in vivo
biologically relevant sites.
Phylogenetic footprinting
One type of additional information that is readily available is the evolutionary
conservation of functional non-coding genomic sequences. Under the hypothesis
that the regulatory mechanism of a gene is conserved between species, it is likely
that mutations in regulatory elements have been less tolerated during evolution and
that these sites therefore are more conserved between species than the background
sequence. This is the underlying principle behind phylogenetic footprinting, which
can be used to eliminate regions less likely to contain cis-acting regulatory sites and
thereby increase the specificity of predictions generated with PWMs [73,86,87].
Restricting the search for regulatory genetic variation to conserved regions can be
expected to increase the likelihood of identifying functional variants.
Moses et al showed that experimentally verified transcription factor
binding sites in the promoter regions of four yeast genomes were more conserved
than the background sequence in the analyzed promoters. They also showed that
the rate of evolution in TFBSs varied within the binding sites so that the positions
important for binding were more conserved than the more degenerate positions in
the TFBS [88]. Conserved regions in alignments between orthologous genomic
sequences in human and mouse have been widely used in the search for functional
regulatory elements, both because the appropriate data have been available for
several years and because the genomes have been shown to be particularly useful in
enriching for human regulatory elements [86,89]. For example, Wasserman et al
showed that 74 out of 75 experimentally defined sequence specific binding sites of
skeletal-muscle-specific transcription factors were confined to the 19% of the
human sequences that were most conserved in the orthologous mouse sequences
[86].
However, not all regulatory elements that are functional in vivo are
conserved between species, and the suitable evolutionary distance for comparison
varies depending on the analyzed gene. Dermitzakis et al showed that in a
collection of 64 functional TFBS in genomic sequences that were alignable between
human and rodent, 33 sites had shared function between the species, 17 sites were
specific for rodent and 14 sites were specific for human. From genome wide
mappings of the liver-specific transcription factors FOXA2, HNF1A, HNF4A and
HNF6 in human and mouse hepatocytes using the ChIP-chip technology, Odom et
al showed that 41% to 89% of the binding events appeared to be species specific
despite the conserved function of the transcription factors [90]. This indicates that
23
Computational and experimental approaches to regulatory genetic variation
although evolutionary conservation provides help in finding functional sites, a
significant proportion of the functional sites will be eliminated.
Thanks to the recent explosion in the number of sequenced genomes, and
to the development of tools that allow construction and evaluation of alignments of
several whole genomes, it is possible to increase the specificity of phylogenetic
footprinting by comparing multiple species. When scoring multiple alignments for
evolutionary conserved regions it is necessary to consider the phylogeny of the
represented species, which for example is implemented in the PhastCons program
[91]. From the functional analysis of 1% of the human genome in the ENCODE
pilot project, it was shown that 4.9% of the human genome is evolutionary
constrained when compared to the orthologous sequences in 14 mammalian species
[92]. Out of these constrained sequences, 40% overlapped with protein coding
exons and their associated untranslated regions and 20% overlapped with
experimentally verified non-coding functional elements such as regulatory elements.
The ENCODE project has used ChIP-chip technology to identify functional
transcription factor binding regions in vivo and 55% of these regulatory regions
reside within the evolutionary constrained sequences [92].
Cis-regulatory modules
The binding of transcription factors in vivo is not only a function of the binding
affinity between the protein and its DNA site. In higher eukaryotes TFs rarely
operates alone, but generally bind to DNA in cooperation with other DNA binding
factors, and their binding sites are often organized into clusters of so called cisregulatory modules (CRM). Incorporation of the modularity of TFBSs in cisregulatory regions can increase the signal to noise ratio of TFBS predictions, as was
shown for example in [93,94] and reviewed in [73]. Analysis of overlap between
potential regulatory SNPs and predicted cis-regulatory modules could be an
interesting approach to increase the specificity of predictions of regulatory SNPs
based on individual PWMs.
Distance to the transcription start site is a relevant factor when selecting potential regulatory SNPs
It has been shown that there is a strong bias in the location of functional regulatory
SNPs towards the proximity of transcription start sites (TSSs) [62,95]. Although
this might reflect ascertainment bias in studies based on regulatory polymorphisms
from the literature since regulatory SNPs in the proximal promoters have generally
been more analyzed in vitro than more upstream variants [95], this cannot explain
the results by Buckland et al [62].
Buckland et al searched for regulatory polymorphisms in the regions from
approximately 700 bp upstream of the TSS to the TSS in 247 genes by cloning the
corresponding sequence from 16 individuals into luciferase reporter constructs. The
polymorphic sites responsible for differences in expression between individuals
were identified by analyzing the cloned sequences in denaturing high performance
liquid chromatography. They noticed that the polymorphisms identified as
24
Malin Andersen
functional in their assay were located closer to the TSS of the respective genes than
SNPs with no effect on reporter gene expression. Although the set of identified
regulatory polymorphisms was small (only 40 regulatory allelic variants were
detected), the results indicate that SNPs located in the first 200 bases upstream of
the TSS are more likely to have a functional effect than those further upstream.
In relation to this it is important to note that more than half of the human
protein coding genes have alternative first exons, utilized in a tissue specific fashion
[96-99], and that the transcription start site for one specific first exon can vary
[96,97]. Incorporation of information of alternative transcription start sites in the
selection of candidate regulatory SNPs is therefore highly relevant.
Tools for the in silico prediction of regulatory SNPs
Several resources for the identification of SNPs within predicted TFBSs are
available on the internet, both using TFBS predictions alone and in combination
with evolutionary conserved regions. Stepanova et al presented a database of SNPs
affecting putative TFBSs based on predictions using Transfac PWMs [100]. Zhao et
al presented a database called PromoLign that contains pre-computed SNP analyses
of the 10 kb sequence upstream of more than 6400 human-mouse orthologous
gene pairs [101]. Paper I in this thesis describes RAVEN, which is a web-based tool
for the analysis of SNPs affecting putative transcription factor binding sites based
on PWMs in JASPAR [78] and evolutionary conservation based on phastCons
scores[91]. Montgomery et al have compiled a database of experimentally verified
regulatory SNPs [61].
Other regulatory genetic variation
The focus of this thesis is variation affecting transcriptional regulation.
Nevertheless, it should be noted that several other types of non-protein-coding
polymorphisms may affect the expression of proteins and thereby influence human
phenotypes. Polymorphisms in the 3’ or 5’ untranslated regions of genes can alter
expression both by affecting the stability of the mRNA molecule and by altering the
efficiency of translation, for example by introducing or removing upstream open
reading frames. Another class of functional regulatory polymorphisms interferes
with mRNA splicing. Mutations can alter splice donor sites, splice acceptor sites or
exonic splicing enhancers. This gives rise to alternative splicing of the pre-mRNA
and often insertion of premature stop codons, leading to a truncated protein. Given
the complexity of the non-protein-coding fraction of the human genome, and the
importance of non-coding regulatory RNA molecules in multi cellular organisms,
polymorphisms affecting non-coding RNA molecules are also likely to contribute
to human traits [59].
25
Computational and experimental approaches to regulatory genetic variation
INTRODUCTION TO CD36
CD36 is a candidate gene for many traits involved in cardiovascular disease, and we
hypothesized that regulatory polymorphisms might influence the expression of the
gene. A brief introduction to the function and complex regulation of this protein is
needed at this point. For a more complete review of the role of CD36 in
cardiovascular disease, and of the molecular basis of the genetic variation in the
CD36 locus, please see the articles by Silverstein and Febbraio [102], Nicholson and
Hajjar [103], Collot-Teixeira et al [104] and Rac et al [105].
CD36 function
CD36 is an 88 kDa membrane glycoprotein expressed on the surface of many
tissues and cell types, including adipocytes, skeletal muscle, endothelial cells, liver,
monocytes and macrophages. On cells that rely on fat as energy source, such as
heart, muscle and fat cells, CD36 is involved in the energy metabolism by binding
and transporting long chain fatty acids across the cell membrane into the cells
[106,107,108]. On phagocytic cells such as macrophages, CD36 functions as a
scavenger receptor by binding and internalizing oxidized low density lipoprotein
(oxLDL) [109]. The protein is also involved in induction of apoptosis together with
other membrane proteins, and it is an adhesion protein capable of interacting with
collagen and thrombospondin as well as thrombospondine like peptides expressed
on the surface of malaria infected erythrocytes [110,111].
Gene structure
The CD36 gene is located on chromosome 7 q11.2 in human [112] and it consists
of 15 exons, of which exons 1, 2 and 15 are untranslated. Exons 3 and 14 encode
the N-terminal and C-terminal domains of the CD36 protein, and these exons also
encode the two transmembrane regions. Exons 4 to 13 encode the extracellular
domain of the protein, containing the different binding sites for the interaction with
thrombospondin, long chain fatty acids, collagen, apoptotic cells and oxLDL
(reviewed in [105]).
The gene has several alternative first exons [113,114,115] (and Paper III in
this thesis), and it has a large 5’ UTR which folds into a structure that has been
shown to influence the translational efficiency of the mRNA sequence [116].
Several isoforms exist of the protein due to alternative splicing of the CD36 premRNA.
26
Malin Andersen
CD36 deficiency
Two types of CD36 deficiency exists among humans: In type I deficiency the
protein is absent on the surface of all human cells, whereas in type II CD36
deficiency the protein is absent on the surface of platelets but expressed at nearly
normal levels on the surface of monocytes and macrophages. CD36 deficiency is
present in 2% to 3% of Japanese, Thais, and Africans, but in less than 0.3% of
Americans of European descent [117,118,119].
CD36 and cardiovascular disease
Animal models have suggested that CD36 has a significant role both in the
progression of atherosclerosis and in insulin resistance, which is a strong risk factor
for cardiovascular disease [120]. Spontaneously hypertensive rats show insulin
resistance, hypertension, hypertriglyceridaemia, reduced plasma concentrations of
high density lipoprotein (HDL) and metabolic defects in adipocytes, and the
phenotype has been linked to a defect in the CD36 gene [121,122,123]. In human
studies CD36 deficiency has been associated with insulin resistance, decreased fatty
acid uptake, increased plasma non-esterified fatty acid concentrations [124,125],
elevation in serum low-density lipoprotein (LDL) cholesterol [126] and a decreased
uptake of oxLDL [127] by macrophages.
Atherosclerosis is a chronic inflammatory response in the walls of arteries,
characterized by a gradual thickening and hardening of the arteries and the
formation of atheromatous plaques, which can lead to the formation of a thrombus
with its clinical complications myocardial infarction or stroke if the plaque ruptures.
One of the initial steps in the development of atherosclerosis is that macrophages
are recruited to the plaque area, and by binding and internalizing oxLDL they
develop into lipid laden cells called foam cells that constitute the core of the plaque.
Since CD36 is one of the principal receptors responsible for the binding and uptake
of oxLDL in macrophages [109,128,129], it may play a crucial role in the
progression of macrophages to foam cells. In vitro, CD36 is demonstrated to have a
role in the pro-inflammatory response of macrophages when exposed to oxLDL,
with a significant positive correlation between the expression of CD36 and proinflammatory genes [130].
Common polymorphisms in the CD36 locus have been associated with
type 2 diabetes, insulin resistance [131], increased plasma concentrations of nonesterified fatty acids and triglycerides, as well as increased risk for cardiovascular
disease in patients with type 2 diabetes [132]. The causative polymorphisms
underlying these associations have not yet been identified experimentally. Several
amino acid altering SNPs are reported for CD36 in dbSNP [10], but they appear to
be monomorphic in European populations, in which the above associations were
observed. Given the complex regulation of this gene we therefore hypothesized
that there might be regulatory polymorphisms affecting the expression of the gene
27
Computational and experimental approaches to regulatory genetic variation
and ultimately some of the complex traits associated with it in relation to
cardiovascular disease.
28
Malin Andersen
PRESENT INVESTIGATIONS
AIMS
The overall aim of this thesis was to develop methods to detect genetic variation
with a potential to alter transcriptional regulation of human genes, and to analyze
potential regulatory polymorphisms in the CD36 glycoprotein, a candidate gene for
cardiovascular disease.
In particular, the aims were:
• To develop a sequence based in silico prediction tool for the detection of
polymorphisms with a potential to alter transcriptional regulation. The tool
should be available as a user friendly web-based application that enables
analysis of any gene of interest.
• To evaluate the bioinformatics driven approach for selecting potential
regulatory polymorphisms using experimentally verified regulatory SNPs
from the literature.
• To evaluate if the combination of experimentally determined allelic
imbalance in the expression of target genes and in silico predictions of
regulatory SNPs in these genes can aid in the identification of functional
variants.
• To analyze the alternative promoter usage of the CD36 gene in human and
rodent as a preparation for regulatory SNP identification, to expand
previous (incomplete) reports of the structure and expression pattern of
the gene.
• Use the in silico based approach to detect potential regulatory SNPs in the
CD36 gene and analyze whether these SNPs are associated with
phenotypes relevant for cardiovascular disease.
29
Computational and experimental approaches to regulatory genetic variation
RESULTS AND DISCUSSION
In silico prediction of regulatory polymorphisms (Paper I and II)
Development of a web-based tool for prediction of regulatory SNPs (Paper I)
Position-specific weight matrices (PWMs) have been useful for predicting cisregulatory elements, especially when phylogenetic footprinting is used to eliminate a
fraction of the false positive transcription factor binding site (TFBS) predictions
[72,86]. Since the scores produced by PWMs are proportional to the binding energy
between the DNA and the transcription factor [81], a score difference between two
alleles of an SNP in a predicted TFBS should, theoretically, reflect a difference in
binding energy between the two sequences. We therefore developed an algorithm
that combines phylogenetic footprinting with detection of putative TFBSs affected
by SNPs to identify polymorphisms with the potential to alter gene transcription.
To facilitate efficient analyses, computational methods and newly
implemented algorithms were developed as an integrated framework for regulatory
SNP analysis. The framework includes all the components for the location and
extraction of data from genome and SNP databases, pattern detection, phylogenetic
footprinting and SNP effect estimation. A web interface to the application, entitled
RAVEN (Regulatory Analysis of Variation in Enhancers), was developed to enable
analysis of almost any gene of interest. Genes are located directly using keywords or
identifiers and the user defines the genomic region to analyze (for example an
upstream, downstream or intronic region of the analyzed gene). The central screen
of RAVEN is a genome browser like view of the analyzed genomic region, with
SNPs, predicted TFBSs affected by SNPs, evolutionary conserved regions, repeat
elements and mRNA transcripts mapped to it (Figure 5).
Resources for predicted regulatory SNPs have been presented before, for
example as databases of SNPs in predicted TFBS [100], and in combination with
conserved regions from human-mouse alignments [101]. RAVEN expands on the
functionalities of these databases by enabling users to explore regulatory elements
outside of the 5’ upstream proximal promoter, analyzing user-supplied
polymorphisms (not limited to SNPs) and by using phastCons scores from multiple
alignments for phylogenetic footprinting. Since the user defines the genomic
region, the analysis is not restricted to regulatory polymorphisms around one
annotated transcription start site, which is important if a gene has several alternative
promoters.
30
Malin Andersen
Figure 5: The graphical results view of the RAVEN application. The results view contains
(from the top) genomic coordinates, SNPs from major SNP databases, personal SNPs,
predicted transcription factor binding sites affected by SNPs, predicted transcription factor
binding sites affected by SNPs and in conserved regions, conserved sequence segments,
repeat sequences, the conservation profile, reference transcripts and coordinates relative to
the transcription start site of the analyzed gene.
Evaluation of the in silico tool using regulatory SNPs collected from the literature (Paper I)
In order to evaluate if the in silico approach could enrich for regulatory
polymorphisms we compiled a collection of experimentally verified regulatory
SNPs from published papers. We required that the two alleles of the SNPs should
show allele specific binding to nuclear extracts or purified transcription factors in
an EMSA assay, and that the two alleles should show altered expression levels of a
reporter gene. We managed to collect 104 examples of such single base
substitutions. For 20 of the verified regulatory SNPs the associated transcription
factor had been identified using supershift. For comparison we also collected a
background set of SNPs from the genomic regions from -10 kb to the TSS of all
human genes with a mouse ortholog (N=26044).
31
Computational and experimental approaches to regulatory genetic variation
We tested all regulatory SNPs and 4000 background SNPs randomly selected from
the larger data set for overlap with putative TFBSs using PWMs from the JASPAR
database [78], and recorded a score delta to every SNP according to:
(
score delta SNP
)PWM
(
)PWM
= score allele1
(
− score allele 2
)PWM
The result from this analysis suggested that nearly all SNPs overlapped and affected
potential TF binding sites, and there was no difference in the distribution of score
delta values between the regulatory and background SNPs. This is in line with
another recently published evaluation of properties of genomic regions containing
regulatory SNPs, which showed that PWM scores alone are poor predictors of
functional regulatory SNPs [95].
We next tested the application of phylogenetic footprinting to assess the
method’s capacity to enrich for functional regulatory SNPs. The results showed
that the SNPs with documented effect on gene regulation were more frequently
located within evolutionary conserved sequences relative to the background SNPs.
For example, when using a phastCons score threshold of 0.4 to define conserved
regions, approximately 28% of the regulatory SNPs were retained compared with
only 9% of the background SNPs. From these results it is obvious that although
evolutionary conservation does help in enriching for regulatory SNPs, a significant
proportion of the regulatory SNPs will be eliminated when applying stringent
conservation constraints. Also, even after the application of phylogenetic
footprinting there will be some false positive predictions since nearly all SNPs
overlap with putative transcription factor binding sites, and since 9% of the
background SNPs are located in conserved regions. Using background SNPs is a
necessary simplification due to the difficulty in collecting a data set of documented
neutral SNPs, and a fraction of the false predicted background SNPs could in fact
be true regulatory SNPs.
When we analyzed the regulatory SNPs for which the affected
transcription factor binding site was identified using supershift, we observed that
the number of false predictions was drastically decreased when the predictions were
limited to those corresponding to the PWMs of the verified TFs. This is perhaps
not surprising, but the observation suggests that prior information about which
transcription factor is involved in the regulation of an analyzed gene is necessary in
order to make meaningful predictions.
In reality, the transcription factor associated with a not-yet identified
regulatory region is seldom known, but suggestive prior data to motivate directed
analysis can be derived from many sources. In addition to the scientific literature,
candidate transcription factors can be selected based on associated Gene Ontology
terms [133] in common with a target gene. High throughput proteomics initiatives
such as the Human Protein Atlas program [134] can highlight transcription factors
expressed in the tissues relevant for the target gene and the studied disease.
32
Malin Andersen
In silico predictions of regulatory SNPs in genes with allelic imbalance (Paper II)
A second study facilitated evaluation of the in silico tool in combination with
experimental evidence of allelic imbalance (AI) of the target gene in cancer cell
lines. Using RNA from 13 human tumor cell lines, allelic imbalance was observed
in 41 out of 160 candidate genes involved in cancer progression and in response to
anticancer drugs. A previous version of RAVEN was used to select putative
regulatory SNPs in the upstream genomic regions of these 41 genes. In this version
of RAVEN the phylogenetic footprinting analysis was based on human-mouse
alignments. About 100 SNPs were selected that were both located in genomic
regions conserved between human and mouse and that overlapped and affected
putative TFBSs.
Allelic imbalance of a coding SNP in heterozygous individuals gives an
indication that there are cis-acting regulatory effects favoring the transcription of
one chromosome above the other. The idea is that genetic variation in cis-acting
regulatory elements should be in LD with the analyzed coding SNP, and that one
allele of the regulatory SNP is located on the same chromosome as the highly
expressed coding allele, the other allele of the regulatory SNP being linked to the
less abundantly expressed coding allele. The potential regulatory SNPs identified
using RAVEN were therefore genotyped in the cell lines, and those SNPs that were
heterozygous in the same cell lines in which the allelic imbalance was detected were
selected for evaluation using EMSA (N=15).
Electrophoretic mobility shift assays were performed for these 15 SNPs,
using nuclear extracts from one of the analyzed cell lines (HeLa). Eight out of the
fifteen SNPs that were analyzed using EMSA showed reproducible evidence of
allele specific binding to proteins present in the nuclear extracts. This success rate is
comparable with the results from another study, in which potential regulatory SNPs
were identified and evaluated using EMSA [135]. EMSA reflects DNA-protein
binding interactions in highly artificial conditions, since the DNA-protein
interaction is studied completely out of its chromatin environment. However, the
fact that allelic imbalance was observed in vivo supports the results of the EMSA
study, and gives credibility to the interpretation that these SNPs in fact influence
the gene regulation.
The relatively high success rate in EMSA was encouraging, suggesting that
the combination of detection of allelic imbalance and in silico prediction of potential
regulatory SNPs is a good approach to pinpoint regulatory SNPs. However, it is
also possible that the selection of SNPs that were heterozygous in the same cell line
that showed evidence of AI contributed to the results. Allelic imbalance is thought
to be caused by non-coding regulatory SNPs that are heterozygous in the analyzed
individual (unless the observed AI is caused by epigenetic effects). If there is a
number of non-coding SNPs in the vicinity of a gene, the cell line in which AI was
detected is likely to be heterozygous for only a fraction of these non-coding SNPs,
eliminating some of the candidate regulatory SNPs in the locus.
33
Computational and experimental approaches to regulatory genetic variation
Out of approximately 100 SNPs selected based on in silico predictions, only 15 were
shown to be heterozygous in the same cell lines that showed allelic imbalance,
suggesting that the other 85 cases could not be responsible for the observed AI.
Although the remaining 85 predicted regulatory SNPs may influence gene
expression in other tissues, cell types, or under different environmental conditions,
it is likely that some represent false positive predictions. This highlights the
difficulties of applying the in silico approach in a high throughput manner without
prior information about which transcription factor is involved in the altered
regulation of the gene.
34
Malin Andersen
Alternative first exon usage of the CD36 gene (Paper III and IV)
The CD36 gene has a strong tissue specific expression pattern, which is apparent in
type-II deficient patients who have lost the CD36 protein on the surface of
platelets but have nearly normal expression of the protein on the surface of
monocytes and macrophages. Tissue specific regulation of CD36 has also been
observed in rodents in response to two different anti-diabetic drugs [136], and a
female predominant expression of the gene has been observed in rat and human
liver [137].
Given the fact that the gene has at least five alternative first exons in
human, the tissue and gender specific regulation of the gene may be mediated
through the alternative promoters. In order to get an overview of how the
alternative first exons of CD36 are used in various tissues and cell types, and in
response to external stimuli, we investigated their expression patterns in human and
rodent tissues using real time RT-PCR. This prepared us for the subsequent in silico
identification of potential regulatory SNPs in the gene (Paper V). An overview of
the alternative first exons, with primers used for real time RT-PCR in human and
rat are shown in Figure 6.
Figure 6: Alternative first exons and corresponding mRNA transcripts for the human and
rat CD36 genes. Forward and reverse primers used in the RT-PCR reactions are indicated
with arrows.
35
Computational and experimental approaches to regulatory genetic variation
Alternative first exons usage of in human tissues (Paper III)
We evaluated relative expression levels of the different first exons of CD36 in
human tissues with a central role in energy metabolism: liver, fat and muscle (heart
and skeletal muscle) where the role of the protein as a fatty acid transporter is
central, in monocytes where the protein is a scavenger receptor, as well as in
various other tissues and cell types. The relative expression levels of CD36 in the
different tissues varied between the analyzed alternative first exons, suggesting that
they are regulated independently and tissue specifically. For example, most
alternative fist exons were relatively highly expressed in adipose tissue. Monocytes
on the other hand expressed relatively high levels of exons 1b, 1e, 1f and 3-4, but
no expression at all was detected of exon 1c.
A semi-quantitative analysis of the expression levels of the alternative first
exons within the samples suggested that exons 1a, 1b and 1c were the main
contributors to the total CD36 expression in the heart, skeletal muscle, adipose
tissue and placenta samples, and that exon 1b was the main contributor in the liver
and monocyte samples.
CD36 is one of the principal receptors for the binding and uptake of
oxidized low density lipoprotein (oxLDL) in macrophages [109,128,129], and since
the gene is upregulated by its own ligand in a positive feed-back loop [138,139] we
evaluated whether this upregulation was caused by activation of one particular
promoter. To this end we used cells from a human monocytic leukemia cell line,
THP-1. The cells were differentiated into macrophages by treatment with PMA,
and we measured the relative expression of the alternative first exons of CD36
before and after incubation with oxLDL. The results showed that all alternative
first exons were upregulated in response to oxLDL, except for exon 1c which was
not detected at all in the THP-1 macrophages even after treatment with oxLDL.
This suggests that the effect of oxLDL on CD36 expression in macrophages is not
mediated strictly through one of the alternative promoters, and we speculate that a
locus control mechanism may be involved.
Alternative first exon usage in rat and mouse (Paper IV)
The CD36 gene is located on the small arm of chromosome 4 in rat. The genomic
sequence corresponding to the alternative promoters of the gene was not yet
available in the rat genome, which made the analysis of the alternative promoters
problematic. Alternative first exons 1a and 1b have been described in mouse
[114,140], and there were rat EST sequences in GenBank with high sequence
similarity to the alternative mouse transcripts. PCR primers were therefore designed
based on a comparison between the rat EST sequences and the annotated mouse
transcripts. These primers were used to amplify the alternative transcripts
corresponding to exon 1a and 1b using cDNA from rat liver. By sequencing the
amplified DNA we observed two sequence species that differed only in their
5’regions, where the part of the sequence that differed between the sequences
corresponded to exons 1a and 1b and the common sequence corresponded to
36
Malin Andersen
exons 2 and 3. The alternative first exons 1a and 1b therefore appeared to be
mutually exclusive in rat liver. The presence of alternative promoters of CD36 in
both human, mouse and rat supports the hypothesis that the complex first exon
structure has a physiological relevance for regulating the expression of the gene.
Upon sequencing of the alternative first exons 1a and 1b, we evaluated the
expression of the two mRNA isoforms in male and female rat and mouse liver. A
female predominant expression of both alternative first exons was observed both in
rat and mouse, but the relative difference in expression was larger in rat, and exon
1a showed a greater fold difference than exon 1b.
The use of model organisms enabled us to study the alternative first exon
usage of CD36 in male and female individuals in response to specific environmental
conditions. Many sex-specific differences in gene expression in hepatic cells can be
explained by differences in the amounts and expression patterns of certain
hormones, for example estrogen and growth hormone (GH). In order to find
putative hormonal factors behind the female predominant CD36 expression in liver
we investigated if the expression of the gene in male rats could be “feminized” by
treatment with estrogen and continuous (female specific) administration of GH.
The expression was also analyzed in liver from old male rats. The results showed
that male rats treated with a female specific pattern of GH showed significantly
higher levels of the alternative first exons of CD36 compared with untreated male
rats, exon 1a showing a larger fold-difference. Male rats treated with estrogen
showed significantly higher levels of exon 1a in comparison with untreated male
rats, but no difference was observed for exon 1b. High age increased the expression
of both alternative first exons, again with a larger fold difference for exon 1a.
We also analyzed the expression of CD36 in male and female rat liver and
skeletal muscle in response to food depravation. Animals that have been without
food for 12 hours should have an increased lipolysis in comparison with those that
have been out of food for 4 hours. Since CD36 has a central role in transporting
free fatty acids into tissues that can use fat as energy source, it was interesting to
analyze sex-specific differences in expression of the alternative first exons in
response to this condition. Both male and female rats showed increased expression
of CD36 in skeletal muscle after food depravation, in line with the fact that muscle
cells have increased fatty acid metabolism upon starvation. In liver on the other
hand, the higher expression of basal levels of CD36 in females was reduced in
response to starvation, but no effect was observed in male rats.
Taken together, these results show that the alternative first exons of
CD36 are regulated both sex-specifically and tissue specifically under various
hormonal and nutritional conditions in rodents. The effects involve both alternative
first exons of the gene, but the higher fold differences observed for exon 1a in
some of the experiments may indicate that a promoter specific effect is involved.
Given this sex-specific regulation of CD36 in rodents and in humans [137], it is
possible that potential polymorphisms in regulatory regions of the gene can
influence males and females differently.
37
Computational and experimental approaches to regulatory genetic variation
Identification of potential regulatory SNPs in the CD36 gene and
their evaluation in relation to coronary artery disease (Paper V)
As described in the introduction, CD36 may be involved in several processes of
relevance for cardiovascular disease, such as metabolic status, inflammation and
atherosclerosis. Given the complex regulation of the CD36 gene with at least five
alternative first exons, and the fact that the amino acid altering polymorphisms
reported for the gene in dbSNP appear to be monomorphic in European
populations, we hypothesized that regulatory SNPs could affect some of the
processes in which CD36 is involved. In Paper III we observed that alternative first
exons 1a, 1b and 1c are the main contributors to the total CD36 expression in
skeletal muscle, heart and adipose tissue, and that monocytes mainly express the
gene from exon 1b. We therefore used the in silico tool described in Paper I to
identify potential regulatory SNPs upstream of these alternative first exons, and
evaluated their effect in vitro using EMSA and in vivo in a clinical study of coronary
artery disease. This enabled us to evaluate whether the in silico approach for
predicting regulatory SNPs could be useful in a clinical study.
We identified one SNP located in an evolutionary conserved region
upstream of exon 1c which affected a potential binding site for c-REL, a
transcription factor we considered relevant in relation to CD36 since it is involved
in immune regulation. We also identified one SNP located in a conserved region
upstream of exon 1a. No SNPs were located in evolutionary conserved regions in
the proximal promoter of exon 1b, but one SNP was selected because it was
located only 2 base pairs away from the transcription start site of the alternative
first exon. In order to capture as much information as possible in the subsequent
association study, we also selected 5 marker SNPs distributed across the CD36
locus. A representation of the CD36 gene, with selected SNPs and haplotype
blocks observed in the studied population is shown in Figure 7.
Figure 7: Overview of the CD36 gene and SNPs selected for the association study. The
lines below the mRNA mappings represent haplotype blocks inferred from the genotypes in
the studied population.
38
Malin Andersen
The function of the SNPs upstream of exons 1a and 1c were tested in vitro using
EMSAs on nuclear extracts from THP-1 cells. The major C-allele of the SNP
upstream of exon 1c (rs1194182) showed stronger binding to protein(s) in the
nuclear extracts than did the minor G-allele. The same SNP affected a predicted
binding site for c-REL, but the in silico results were not in line with the in vitro
results since the PWM score was higher for the G-allele than for the C-allele.
Moreover, we were unable to detect a supershift using an antibody against c-REL,
indicating that the allele specific binding pattern observed using EMSA could be
caused by protein(s) other than c-REL.
The SNPs were genotyped in a clinical study for coronary artery disease,
selected because it contained data on several of the parameters and phenotypes that
CD36 may influence based on the literature. The cohort consisted of 387 patients
admitted to the hospital for myocardial infarction (MI) and 387 healthy age and sex
matched control subjects. In brief, 82% of the participants were men, the median
age was 54 years and median BMI was 26.8 and 25.6 in the MI patients and control
subjects respectively [141]. We observed significant differences in plasma
concentrations of the inflammatory markers C-reactive protein (CRP) and serum
amyloid A (SAA) protein between MI patients with different genotypes of the
SNPs in block 2, with lower levels in the heterozygous individuals. No differences
were observed between control subjects of different genotypes. We also observed
associations between the SNPs rs819454 in block 1 and rs7789369 in block 2 and
“mean % stenosis”, and between the rs819454 SNP and “plaque area /mm”. These
parameters reflect the severity and extension of coronary artery disease, analyzed by
quantitative coronary angiography.
The aim of this study was to test if the in silico approach for selecting
regulatory SNPs described in Paper I could be used to identify regulatory
polymorphisms in the CD36 gene with an effect in a clinical cohort. The observed
associations between the potential regulatory SNPs in the promoters of exons 1c
and 1a and inflammatory markers, and the observed allele specific binding of
nuclear extract protein(s) to the exon 1c SNP support the use of the bioinformatics
driven approach. As discussed previously, EMSA reflects binding events between
short DNA sequences and nuclear proteins out of the chromatin environment of
the analyzed gene, and further studies in vitro and in vivo must be carried out in order
to establish if the SNP upstream of exon 1c in fact alters the transcriptional
regulation of CD36, or if the observed associations are caused by strong LD
between the analyzed SNPs and unknown functional polymorphisms. An
interesting experiment would for example be to correlate the genotypes of the
promoter SNPs a with the expression levels of the gene in relevant tissues and cell
types. However, given the relative ease and low cost of genotyping, the strategy of
testing predicted regulatory SNPs in a clinical cohort prior to more extensive
functional studies in vitro and in vivo may be cost effective when searching for
potential disease mechanisms.
Clearly there are many unsolved issues regarding the associations observed
in Paper V. They appear to be caused by lower levels of inflammation in individuals
39
Computational and experimental approaches to regulatory genetic variation
that are heterozygous for the associated SNPs, which does not fit the standard
dominant or allele dosage effect models. Also, due to the risk of obtaining positive
signals just by chance, the results from an association study must always be
interpreted with caution, especially when performing multiple comparisons.
Nevertheless, the observed associations between promoter polymorphisms in
CD36 and the inflammatory status of patients with an early onset myocardial
infarction highlight the need for further analyses of the potential regulatory SNPs in
this gene.
40
Malin Andersen
CONCLUDING REMARKS
Paper I: Position-specific weight matrices can be used to model a difference in
binding strengths between a transcription factor and the two alleles of an SNP, but
due to their low specificity they could not be used to distinguish regulatory from
background SNPs in the absence of additional data. Regulatory SNPs was shown to
be located within evolutionary conserved regions more often than background
SNPs. However, also the combination of PWMs and phylogenetic footprinting
resulted in many false positive predictions, and it will therefore be important to
integrate prior knowledge of which transcription factor is likely to be involved in
the regulation of a gene in order to make meaningful predictions.
Paper II: Using the prediction tool described in Paper I, potential regulatory SNPs
were identified in genes showing allelic imbalance in tumor cell lines. Predicted
regulatory SNPs that were heterozygous in the cell line in which allelic imbalance
was observed were analyzed in vitro using EMSAs. Around half of the analyzed
SNPs showed evidence of allele specific binding of a nuclear protein. This suggests
that experimental detection of allelic imbalance in combination with in silico
predictions of regulatory SNPs is an efficient approach to identify the functional
variant, but it is likely that the filter for heterozygous upstream SNPs also
contributed to the enrichment.
Paper III: CD36 was shown to be expressed from at least five alternative first
exons in human. The first exons appeared to be regulated both individually and
collectively in different tissues and cell types and in response to various
environmental factors.
Paper IV: CD36 appeared to be transcribed from at least two alternative first
exons in rat. The alternative first exons were regulated sex-specifically in rat liver,
with a female predominant expression pattern that could be introduced in male rats
by continuous infusion with growth hormone. The alternative first exons were
down-regulated in female rat liver in response to starvation, but in skeletal muscle
the expression in response to starvation was increased in both male and female rats.
Paper V: Potential regulatory SNPs were predicted in the proximal promoters of
exons 1a and 1c of the CD36 gene. Significant associations were observed between
these SNPs and plasma concentrations of the inflammatory markers C-reactive
protein and serum amyloid A, in patients with an early onset myocardial infarction.
This suggests that the in silico approach for predicting functional polymorphisms
could be useful when there is a reason to believe that regulatory SNPs are involved
in determining a disease trait.
41
Computational and experimental approaches to regulatory genetic variation
FUTURE PERSPECTIVES
This thesis has described a bioinformatics driven approach to identify SNPs with a
potential to alter transcriptional regulation. The tool was evaluated using genes
from the literature containing experimentally verified regulatory SNPs, tested on
genes with experimental evidence of allelic imbalance and finally applied to the
CD36 glycoprotein, a candidate gene for cardiovascular disease.
Paper I clearly shows that in silico predictions of TFBSs in combination
with phylogenetic footprinting has a limited capacity to enrich for regulatory SNPs
in the absence of additional supportive data. Position-specific weight matrices are
capable of modeling preferred TFBSs and the effect of single base variation in such
sites, but the interface between the DNA and the transcription factor contains too
little information to make specific predictions. Although regulatory SNPs are
located in evolutionary conserved regions more often than background SNPs, the
enrichment provided by applying phylogenetic footprinting is still not satisfying.
Since the presence of an optimal binding sequence for a transcription
factor does not imply that the TF necessarily binds the site in vivo, other genomic
properties may distinguish regulatory elements from neutral DNA sequences. It has
for example been suggested that regulatory elements may have different DNA
stability, flexibility and curvature than other sequences, and that SNPs affecting
such properties may interfere with transcriptional regulation (reviewed in [6]).
In a recent publication, several genomic properties were assessed to
evaluate if they could pinpoint regulatory genetic variation [95]. Experimentally
verified regulatory SNPs from the literature were compared with background
variation located in the promoter regions of the same genes. The SNPs were for
example assessed for the location in evolutionary conserved regions, overlap with
repetitive sequences, overlap with CpG islands, allele specific differences in PWM
scores, changes in overrepresented motifs detected with various pattern recognition
tools, allele specific differences in DNA bendability and curvature, the minor allele
frequency of the SNP and the distance between the SNP and the TSS. The authors
concluded that some of these properties could significantly discriminate between
regulatory and background SNPs, but other methods, such as PWM scores and
DNA bendability and curvature, had little capacity to pinpoint regulatory SNPs.
Regulatory SNPs were located closer to the transcription start sites, were less likely
to be located in CpG islands or within repetitive elements and had higher minor
allele frequencies than the background SNPs [95]
These results are interesting and may help to improve future prediction
tools. However, some of the properties may be affected by ascertainment bias since
researchers may have preferred to study SNPs with relatively high minor allele
frequencies located in close proximity to the target genes, and therefore such
regulatory SNPs may be overrepresented in the literature. Nevertheless, also an
unbiased approach to identify regulatory SNPs by cloning promoter regions from
different individuals into reporter vector constructs revealed a strong bias in the
42
Malin Andersen
location of regulatory SNPs close to the TSSs of the genes [62], indicating that the
finding has biological relevance. Since a large fraction of the human genes are likely
to have alternative transcription start sites it might be relevant to include such
information in the analysis when selecting candidate regulatory SNPs.
In Paper I we observed that prior information about which transcription
factors are likely to be involved in the regulation of a gene helps eliminating a large
fraction of the false positive TFBS predictions, and that such information therefore
is likely to make the predictions biologically meaningful. In addition to the scientific
literature we suggested that common Gene Ontology terms [133] may recommend
candidate transcription factors for an analyzed gene. Analysis of co-expression of
target genes and transcription factors may also provide such information, and one
interesting filter to explore would be to look for transcription factors that are
present in the same tissues and cell types as the target gene, for example using the
human protein atlas [134].
Transcription factors in higher eukaryotes usually bind to DNA in
cooperation with other DNA binding factors and co-factors [142]. In silico analysis
of cis-regulatory modules can provide more specific predictions of regulatory
elements compared with isolated PWMs [143]. In addition to prior information on
which transcription factors are likely to regulate a particular gene, it would also be
valuable to have prior information on which transcription factors work together to
regulate their target genes.
Even if prior knowledge is available for the analyzed gene, a fundamental
limitation in the application of PWMs is that models of preferred binding sites are
still missing for a large fraction of the human transcription factors. In order to
enable a hypothesis driven approach of the sort suggested in Paper I and above,
PWM models for a candidate TF must be available. Continuous experimental work
is therefore necessary, using chromatin immunoprecipitation, high throughput in
vitro site selection assays and standard gene-by-gene analysis.
In relation to this, it has been shown that DNA sequences specifically
bound by a TF in vivo do not always correspond to the high affinity binding sites
obtained by cyclic in vitro site selection assays. This was for example shown for
transcription factors in the ETS protein family. In ETS target genes with tissue
specific expression patterns, ETS1 was often found to bind in cooperation with the
transcription factor RUNX1. A binding pattern compiled from DNA sequences
showing strong binding to the ETS1-RUNX1 complex in vivo was shown to deviate
substantially from the in vitro selected EST1 and RUNX1 binding sites [144]. This
may imply that it will be important to analyze preferred binding patterns of
cooperatively bound transcription factors in order to model many of the sites that
are active in vivo.
The recent advances in global analysis of gene regulation by array or
sequencing based technologies of chromatin immunoprecipitation can provide
information on which genomic regions are bound by certain transcription factors in
vivo. The ENCODE project has taken on a systematic analysis of regulatory
elements where for example modified histones that indicate the location of actively
43
Computational and experimental approaches to regulatory genetic variation
transcribed DNA, DNase hypersensitive sites, binding sites for a number of
transcription factors and alternative transcription start sites are identified [12,92].
Such high throughput mappings of regulatory elements do not only provide a draft
map of the regulatory landscape of particular candidate genes, but may also provide
new insights to the global distribution of regulatory elements and to the genomic
properties of such sites [92]. This may lead to better predictions of regulatory
sequences and genetic variation affecting such sites, which will be valuable until the
complete regulatory landscape of the human genome has been characterized
experimentally.
44
Malin Andersen
ABBREVIATIONS
AI
Allelic Imbalance
ChIP
Chromatin Immunoprecipitation
CNV
Copy Number Variation
CRP
C-Reactive Protein
EMSA
Electrophoretic Mobility Shift Assay
ENCODE
ENCyclopedia Of DNA Elements
GH
Growth Hormone
HDL
High Density Lipoprotein
IBD
Identical By Descent
LD
Linkage Disequilibrium
LDL
Low Density Lipoprotein
MAF
Minor Allele Frequency
MI
Myocardial Infarction
OMIM
Online Mendelian Inheritance in Man
oxLDL
oxidized Low Density Lipoprotein
PFM
Position Frequency Matrix
PWM
Position-specific Weight Matrix
RAVEN
Regulatory Analysis of Variation in Enhancers
SAA
Serum Amyloid A
SNP
Single Nucleotide Polymorphism
STR
Short Tandem Repeat
TF
Transcription Factor
TFBS
Transcription Factor Binding Site
TSS
Transcription Start Site
UTR
Untranslated Open Reading frame
WGA
Whole Genome Association
45
Computational and experimental approaches to regulatory genetic variation
ACKNOWLEDGEMENTS
Jag vill rikta mitt varmaste tack till alla er som har hjälpt mig och gjort arbetet med
den här avhandling fantastiskt roligt! Speciellt vill jag tacka:
Min huvudhandledare Jacob Odeberg, för ditt stora engagemang i mina projekt och
ditt genuina intresse för forskning som smittar av sig. Tack för att du tog dig an mig
som doktorand, för att du har introducerat mig i så många roliga projekt och
samarbeten och för att du har låtit mig fokusera på sånt som jag har varit
intresserad av.
Min bihandledare Per Eriksson, för all din hjälp, inte minst med att hitta rätt fokus i
mina studier (som har haft en fot i bioinformatikens och en i molekylärbiologins
värld), och för din positiva inställning som har gjort arbetet roligt!
Wyeth Wasserman, co-supervisor in the RAVEN project. Thank you for your
enthusiasm, support and expert knowledge during the work with the RAVEN
paper.
Boris Lenhard, co-supervisor in the RAVEN project. Thank you for your hard
work with the web interface as well as with the manuscript, and for many
interesting discussions over at CGB.
Pär Engström, David Arenillas and Stuart Lithwick, thank you for your support and
help in the RAVEN project. Dave, thanks for making the new web interface work.
Pär, thanks for teaching me a lot of statistics and for critically evaluating the results.
Rachel Fisher, Ann Samnegård och Anders Hamsten för alla diskussioner och för
er ovärderliga hjälp i associationsstudien av CD36.
Louisa Cheung, Nina Ståhlberg, Petra Tollet-Egnell, Gunnar Norstedt, Carolina
Gustavsson och Leandro Fernández-Pérez för trevligt samarbete om CD36.
Lili Milani, Ann-Christine Syvänen och alla medförfattare i ”Allelic Imbalance
peket” för att ni involverade mig i ert projekt.
Kicki Holmberg för all din hjälp med renrumsteknik, genotypning och allt du har
lärt mig på labbet. Tack också för alla luncher, middagar och inte minst
ridsällskapet!
Barbro Burt för att du visat mig till rätta i labbet på GV och lärt mig att köra
EMSA.
46
Malin Andersen
Jorge Andrade för ett trevligt samarbete om SNPs och Grid-Allegro!
Jens Lagergren, Öjvind Johansson, Cristina Al-Khalili Szigyarto och Lisa Berglund
för givande diskussioner om transkriptionsfaktorer och bioinformatik. Lisa, tack
också för dina värdefulla kommentarer på den här avhandlingen och för trevligt
resesällskap till CSHL!
Joakim Lundeberg och hela Gene Technology för intressanta diskussioner,
seminarier och lärorika workshops!
Mathias Uhlén för din entusiasm och förmåga att sätta dig in i allas projekt och
komma med värdefulla synpunkter.
Alla PIs i DNA-corner för att ni bidrar till den trevliga stämningen på instutionen.
Tack också alla nuvarande och före detta DNA-corner medarbetare för att mina år
på KTH har varit fantastiskt roliga! Det går alltid att ta en sväng i korridoren och
prata några ord med någon när man behöver en paus efter för många timmar
framför datorn!
I would also like to thank the co-workers of the Atherosclerosis Research Unit at
the Department of Medicine, KI for always making me feel welcomed to your
seminars and to the lab at GV.
Peter Nilsson för att du har granskat avhandlingen och kommit med värdefulla
synpunkter.
Eric Björkvall för all IT support som gjort livet så mycket lättare!
Bahram Amini för hjälp med sekvensning.
Monica Ruck, Mona Wallén, Marita Johnsson och Inger Åhlén för hjälp med
administrativa frågor.
Annika, Lasse, Torbjörn, Ylva, Mia och Cissela i SNP Core Facility för all hjälp
med genotypningar och för trevliga samankomster!
Hela tjejgänget på labbet (ingen nämnd och ingen glömd) för allt roligt vi har gjort
med teater, julpyssel, middagar och annat!
Marica för alla luncher, ridningen och för planeringshjälpen nu på slutet, det är
mycket roligare när man är två!
47
Computational and experimental approaches to regulatory genetic variation
John, Erik P och Jorge igen, för hjälpen med allt det praktiska så här på slutet i
avhandlignsarbetet, perfekt att ha ett gäng experter att fråga om figurformat och
annat!!!
Gamla och nuvarande rumskamrater, inte minst Valtteri, Marcus, Johan, Erik,
Jorge, Sebastian, Johanna och bonus-Daniel för alla roliga och givande diskussioner
om vetenskap och annat, och för er underbara humor!
Mina kära vänner Tove och Torun och Esther för er stora generositet och för alla
pep-talks. Det känns tryggt att veta att jag kan vända mig till er om det är något.
Dessutom har jag haft väldigt roligt ihop med er de senaste fem-sex åren! Esther,
tack också för att du hjälpte mig att granska avhandlingen, både engelskan och
innehållet.
Alla mina vänner utanför KTH! Jag ser fram emot att träffas lite mera nu när
avhandlingen är klar. ☺
Sune och Helena, tack för att ni tar hand om mig i Söderköping och för att jag alltid
har känt mig välkommen i familjen. Jag är glad för att vi har blivit grannar så att vi
kan träffas så ofta. Helena, tack också för att du hjälpte mig med engelskan!
Mamma, Pappa och David, tack för att ni alltid ställer upp och engagerar er i vad
jag än gör, ni är fantastiska!
Marcus, du har betytt otroligt mycket för att jag faktiskt blev klar med den här
avhandlingen. Tack för ditt stöd, för din förmåga att alltid komma med
konstruktiva lösning på problem och för att du har varit så förstående den här sista
tiden när jag bara har skrivit och skrivit…
Detta arbete har finansierats med forskningsanslag från Vetenskapsrådet, Vinnova,
Stiftelsen för Strategisk forskning och Wallenberg Consortium North. Jag är också
tacksam för att ha fått resestipendium ur Ragnar och Astrid Signeuls stiftelse.
48
Malin Andersen
REFERENCES
1. Reich DE, Schaffner SF, Daly MJ, McVean G, Mullikin JC, et al. (2002) Human
genome sequence variation and the influence of gene history, mutation and
recombination. Nat Genet 32: 135-142.
2. Przeworski M, Hudson RR, Di Rienzo A (2000) Adjusting the focus on human
variation. Trends Genet 16: 296-302.
3. Chen FC, Vallender EJ, Wang H, Tzeng CS, Li WH (2001) Genomic divergence
between human and chimpanzee estimated from large-scale alignments of
genomic sequences. J Hered 92: 481-489.
4. Dean L (2005) Blood Groups and Red Cell Antigens. Bethesda (MD): National
Library of Medicine (US), National Center for Biotechnology Information.
5. Knight JC (2005) Regulatory polymorphisms underlying complex disease traits. J
Mol Med 83: 97-109.
6. Buckland PR (2006) The importance and identification of regulatory
polymorphisms and their mechanisms of action. Biochim Biophys Acta
1762: 17-28.
7. Kruglyak L, Nickerson DA (2001) Variation is the spice of life. Nat Genet 27:
234-236.
8. Reich DE, Gabriel SB, Altshuler D (2003) Quality and completeness of SNP
databases. Nat Genet 33: 457-458.
9. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial
sequencing and analysis of the human genome. Nature 409: 860-921.
10. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, et al. (2001) dbSNP: the
NCBI database of genetic variation. Nucleic Acids Res 29: 308-311.
11. The International HapMap Consortium (2005) A haplotype map of the human
genome. Nature 437: 1299-1320.
12. ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of
DNA Elements) Project. Science 306: 636-640.
13. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, et al. (2004) Large-scale copy
number polymorphism in the human genome. Science 305: 525-528.
14. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, et al. (2004)
Detection of large-scale variation in the human genome. Nat Genet 36:
949-951.
15. Stefansson H, Helgason A, Thorleifsson G, Steinthorsdottir V, Masson G, et al.
(2005) A common inversion under selection in Europeans. Nat Genet 37:
129-137.
16. Visser R, Shimokawa O, Harada N, Kinoshita A, Ohta T, et al. (2005)
Identification of a 3.0-kb major recombination hotspot in patients with
Sotos syndrome who carry a common 1.9-Mb microdeletion. Am J Hum
Genet 76: 52-67.
49
Computational and experimental approaches to regulatory genetic variation
17. Feuk L, Carson AR, Scherer SW (2006) Structural variation in the human
genome. Nat Rev Genet 7: 85-97.
18. Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, et al. (2001) Linkage
disequilibrium in the human genome. Nature 411: 199-204.
19. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, et al. (2002) The
structure of haplotype blocks in the human genome. Science 296: 22252229.
20. The International HapMap Consortium (2003) The International HapMap
Project. Nature 426: 789-796.
21. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, et al. (2007) A second
generation human haplotype map of over 3.1 million SNPs. Nature 449:
851-861.
22. McKusick VA (2007) Mendelian Inheritance in Man and its online version,
OMIM. Am J Hum Genet 80: 588-604.
23. Merikangas KR, Risch N (2003) Genomic priorities and public health. Science
302: 599-601.
24. Cardon LR, Bell JI (2001) Association study designs for complex diseases. Nat
Rev Genet 2: 91-99.
25. Borecki IB, Suarez BK (2001) Linkage and association: basic concepts. Adv
Genet 42: 45-66.
26. Pajukanta P, Cargill M, Viitanen L, Nuotio I, Kareinen A, et al. (2000) Two loci
on chromosomes 2 and X for premature coronary heart disease identified
in early- and late-settlement populations of Finland. Am J Hum Genet 67:
1481-1493.
27. Helgadottir A, Manolescu A, Thorleifsson G, Gretarsdottir S, Jonsdottir H, et
al. (2004) The gene encoding 5-lipoxygenase activating protein confers risk
of myocardial infarction and stroke. Nat Genet 36: 233-239.
28. Farrall M, Green FR, Peden JF, Olsson PG, Clarke R, et al. (2006) Genomewide mapping of susceptibility to coronary artery disease identifies a novel
replicated locus on chromosome 17. PLoS Genet 2: e72.
29. Reynisdottir I, Thorleifsson G, Benediktsson R, Sigurdsson G, Emilsson V, et
al. (2003) Localization of a susceptibility gene for type 2 diabetes to
chromosome 5q34-q35.2. Am J Hum Genet 73: 323-335.
30. Grant SF, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, et al.
(2006) Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk
of type 2 diabetes. Nat Genet 38: 320-323.
31. Xu J, Meyers DA, Ober C, Blumenthal MN, Mellen B, et al. (2001)
Genomewide screen and identification of gene-gene interactions for
asthma-susceptibility loci in three U.S. populations: collaborative study on
the genetics of asthma. Am J Hum Genet 68: 1437-1446.
32. Rioux JD, Silverberg MS, Daly MJ, Steinhart AH, McLeod RS, et al. (2000)
Genomewide search in Canadian families with inflammatory bowel disease
reveals two novel susceptibility loci. Am J Hum Genet 66: 1863-1870.
50
Malin Andersen
33. Rioux JD, Daly MJ, Silverberg MS, Lindblad K, Steinhart H, et al. (2001)
Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to
Crohn disease. Nat Genet 29: 223-228.
34. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. (2001) The sequence
of the human genome. Science 291: 1304-1351.
35. Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PI, et al. (2007)
Genome-wide association analysis identifies loci for type 2 diabetes and
triglyceride levels. Science 316: 1331-1336.
36. Zeggini E, Weedon MN, Lindgren CM, Frayling TM, Elliott KS, et al. (2007)
Replication of genome-wide association signals in UK samples reveals risk
loci for type 2 diabetes. Science 316: 1336-1341.
37. McPherson R, Pertsemlidis A, Kavaslar N, Stewart A, Roberts R, et al. (2007) A
common allele on chromosome 9 associated with coronary heart disease.
Science 316: 1488-1491.
38. Helgadottir A, Thorleifsson G, Manolescu A, Gretarsdottir S, Blondal T, et al.
(2007) A common variant on chromosome 9p21 affects the risk of
myocardial infarction. Science 316: 1491-1493.
39. Gudmundsson J, Sulem P, Manolescu A, Amundadottir LT, Gudbjartsson D, et
al. (2007) Genome-wide association study identifies a second prostate
cancer susceptibility variant at 8q24. Nat Genet 39: 631-637.
40. Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, et al. (2007) Genome-wide
association study of prostate cancer identifies a second risk locus at 8q24.
Nat Genet 39: 645-649.
41. Duerr RH, Taylor KD, Brant SR, Rioux JD, Silverberg MS, et al. (2006) A
genome-wide association study identifies IL23R as an inflammatory bowel
disease gene. Science 314: 1461-1463.
42. Couzin J, Kaiser J (2007) Genome-wide association. Closing the net on
common disease genes. Science 316: 820-822.
43. Inoue K, Lupski JR (2002) Molecular mechanisms for genomic disorders. Annu
Rev Genomics Hum Genet 3: 199-242.
44. The Huntington's Disease Collaborative Research Group (1993) A novel gene
containing a trinucleotide repeat that is expanded and unstable on
Huntington's disease chromosomes. The Huntington's Disease
Collaborative Research Group. Cell 72: 971-983.
45. Riordan JR, Rommens JM, Kerem B, Alon N, Rozmahel R, et al. (1989)
Identification of the cystic fibrosis gene: cloning and characterization of
complementary DNA. Science 245: 1066-1073.
46. Bertina RM, Koeleman BP, Koster T, Rosendaal FR, Dirven RJ, et al. (1994)
Mutation in blood coagulation factor V associated with resistance to
activated protein C. Nature 369: 64-67.
47. Pastinen T, Hudson TJ (2004) Cis-acting regulatory variation in the human
genome. Science 306: 647-650.
48. Zabetian CP, Anderson GM, Buxbaum SG, Elston RC, Ichinose H, et al. (2001)
A quantitative-trait analysis of human plasma-dopamine beta-hydroxylase
51
Computational and experimental approaches to regulatory genetic variation
activity: evidence for a major functional polymorphism at the DBH locus.
Am J Hum Genet 68: 515-522.
49. Rigat B, Hubert C, Alhenc-Gelas F, Cambien F, Corvol P, et al. (1990) An
insertion/deletion polymorphism in the angiotensin I-converting enzyme
gene accounting for half the variance of serum enzyme levels. J Clin Invest
86: 1343-1346.
50. Bray NJ, Buckland PR, Owen MJ, O'Donovan MC (2003) Cis-acting variation
in the expression of a high proportion of genes in human brain. Hum
Genet 113: 149-153.
51. Lo HS, Wang Z, Hu Y, Yang HH, Gere S, et al. (2003) Allelic variation in gene
expression is common in the human genome. Genome Res 13: 1855-1862.
52. Pastinen T, Sladek R, Gurd S, Sammak A, Ge B, et al. (2004) A survey of
genetic and epigenetic variation affecting human gene expression. Physiol
Genomics 16: 184-193.
53. Singer-Sam J, LeBon JM, Dai A, Riggs AD (1992) A sensitive, quantitative assay
for measurement of allele-specific transcripts differing by a single
nucleotide. PCR Methods Appl 1: 160-163.
54. Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, et al. (2004)
Genetic analysis of genome-wide variation in human gene expression.
Nature 430: 743-747.
55. Ronald J, Brem RB, Whittle J, Kruglyak L (2005) Local regulatory variation in
Saccharomyces cerevisiae. PLoS Genet 1: e25.
56. GuhaThakurta D, Xie T, Anand M, Edwards SW, Li G, et al. (2006) Cisregulatory variations: a study of SNPs around genes showing cis-linkage in
segregating mouse populations. BMC Genomics 7: 235.
57. Goring HH, Curran JE, Johnson MP, Dyer TD, Charlesworth J, et al. (2007)
Discovery of expression QTLs using large-scale transcriptional profiling in
human lymphocytes. Nat Genet 39: 1208-1216.
58. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, et al. (2007) Population
genomics of human gene expression. Nat Genet 39: 1217-1224.
59. Mattick JS (2004) RNA regulation: a new genetics? Nat Rev Genet 5: 316-323.
60. Rockman MV, Wray GA (2002) Abundant Raw Material for Cis-Regulatory
Evolution in Humans. Mol Biol Evol 19: 1991-2004.
61. Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, et al.
(2006) ORegAnno: an open access database and curation system for
literature-derived promoters, transcription factor binding sites and
regulatory variation. Bioinformatics 22: 637-640.
62. Buckland PR, Hoogendoorn B, Coleman SL, Guy CA, Smith SK, et al. (2005)
Strong bias in the location of functional promoter polymorphisms. Hum
Mutat 26: 214-223.
63. Prokunina L, Alarcon-Riquelme ME (2004) Regulatory SNPs in complex
diseases: their identification and functional validation. Expert Rev Mol
Med 6: 1-15.
52
Malin Andersen
64. Fried M, Crothers DM (1981) Equilibria and kinetics of lac repressor-operator
interactions by polyacrylamide gel electrophoresis. Nucleic Acids Res 9:
6505-6525.
65. Fried MG (1989) Measurement of protein-DNA interaction parameters by
electrophoresis mobility shift assay. Electrophoresis 10: 366-376.
66. Hellman LM, Fried MG (2007) Electrophoretic mobility shift assay (EMSA) for
detecting protein-nucleic acid interactions. Nat Protoc 2: 1849-1861.
67. Orlando V (2000) Mapping chromosomal proteins in vivo by formaldehydecrosslinked-chromatin immunoprecipitation. Trends Biochem Sci 25: 99104.
68. Ren B, Cam H, Takahashi Y, Volkert T, Terragni J, et al. (2002) E2F integrates
cell cycle progression with DNA repair, replication, and G(2)/M
checkpoints. Genes Dev 16: 245-256.
69. Weinmann AS, Yan PS, Oberley MJ, Huang TH, Farnham PJ (2002) Isolating
human transcription factor targets by coupling chromatin
immunoprecipitation and CpG island microarray analysis. Genes Dev 16:
235-244.
70. Alam J, Cook JL (1990) Reporter genes: application to the study of mammalian
gene transcription. Anal Biochem 188: 245-254.
71. Knight JC, Keating BJ, Rockett KA, Kwiatkowski DP (2003) In vivo
characterization of regulatory polymorphisms by allele-specific
quantification of RNA polymerase loading. Nat Genet 33: 469-475.
72. Stormo GD (2000) DNA binding sites: representation and discovery.
Bioinformatics 16: 16-23.
73. Wasserman WW, Sandelin A (2004) Applied bioinformatics for the
identification of regulatory elements. Nat Rev Genet 5: 276-287.
74. MacIsaac KD, Fraenkel E (2006) Practical strategies for discovering regulatory
DNA sequence motifs. PLoS Comput Biol 2: e36.
75. Pollock R, Treisman R (1990) A sensitive method for the determination of
protein-DNA binding specificities. Nucleic Acids Res 18: 6197-6204.
76. Schneider TD, Stephens RM (1990) Sequence logos: a new way to display
consensus sequences. Nucleic Acids Res 18: 6097-6100.
77. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, et al. (2003)
TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic
Acids Res 31: 374-378.
78. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004)
JASPAR: an open-access database for eukaryotic transcription factor
binding profiles. Nucleic Acids Res 32 Database issue: D91-94.
79. Kunsch C, Ruben SM, Rosen CA (1992) Selection of optimal kappa B/Rel
DNA-binding motifs: interaction of both subunits of NF-kappa B with
DNA is required for transcriptional activation. Mol Cell Biol 12: 44124421.
53
Computational and experimental approaches to regulatory genetic variation
80. Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory
proteins. Statistical-mechanical theory and application to operators and
promoters. J Mol Biol 193: 723-750.
81. Fields DS, He Y, Al-Uzri AY, Stormo GD (1997) Quantitative specificity of the
Mnt repressor. J Mol Biol 271: 178-194.
82. Benos PV, Bulyk ML, Stormo GD (2002) Additivity in protein-DNA
interactions: how good an approximation is it? Nucleic Acids Res 30: 44424451.
83. Maerkl SJ, Quake SR (2007) A systems approach to measuring the binding
energy landscapes of transcription factors. Science 315: 233-237.
84. Tronche F, Ringeisen F, Blumenfeld M, Yaniv M, Pontoglio M (1997) Analysis
of the distribution of binding sites for a tissue-specific transcription factor
in the vertebrate genome. J Mol Biol 266: 231-245.
85. Fickett JW (1996) Quantitative discrimination of MEF2 sites. Mol Cell Biol 16:
437-441.
86. Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE (2000)
Human-mouse genome comparisons to locate regulatory sites. Nat Genet
26: 225-228.
87. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, et al. (2003)
Identification of conserved regulatory elements by comparative genome
analysis. J Biol 2: 13.
88. Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB (2003) Position
specific variation in the rate of evolution in transcription factor binding
sites. BMC Evol Biol 3: 19.
89. Flint J, Tufarelli C, Peden J, Clark K, Daniels RJ, et al. (2001) Comparative
genome analysis delimits a chromosomal domain and identifies key
regulatory elements in the alpha globin cluster. Hum Mol Genet 10: 371382.
90. Odom DT, Dowell RD, Jacobsen ES, Gordon W, Danford TW, et al. (2007)
Tissue-specific transcriptional regulation has diverged significantly between
human and mouse. Nat Genet 39: 730-732.
91. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, et al. (2005)
Evolutionarily conserved elements in vertebrate, insect, worm, and yeast
genomes. Genome Res 15: 1034-1050.
92. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, et al.
(2007) Identification and analysis of functional elements in 1% of the
human genome by the ENCODE pilot project. Nature 447: 799-816.
93. Blanchette M, Bataille AR, Chen X, Poitras C, Laganiere J, et al. (2006)
Genome-wide computational prediction of transcriptional regulatory
modules reveals new insights into human gene expression. Genome Res
16: 656-668.
94. Johansson O, Alkema W, Wasserman WW, Lagergren J (2003) Identification of
functional clusters of transcription factor binding motifs in genome
sequences: the MSCAN algorithm. Bioinformatics 19 Suppl 1: i169-176.
54
Malin Andersen
95. Montgomery SB, Griffith OL, Schuetz JM, Brooks-Wilson A, Jones SJ (2007) A
survey of genomic properties for the detection of regulatory
polymorphisms. PLoS Comput Biol 3: e106.
96. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, et al. (2006)
Genome-wide analysis of mammalian promoter architecture and evolution.
Nat Genet 38: 626-635.
97. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, et al. (2005) The
transcriptional landscape of the mammalian genome. Science 309: 15591563.
98. Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, et al. (2007) Prominent
use of distal 5' transcription start sites and discovery of a large number of
additional exons in ENCODE regions. Genome Res 17: 746-759.
99. Landry JR, Mager DL, Wilhelm BT (2003) Complex controls: the role of
alternative promoters in mammalian genomes. Trends Genet 19: 640-648.
100. Stepanova M, Tiazhelova T, Skoblov M, Baranova A (2006) Potential
regulatory SNPs in promoters of human genes: a systematic approach. Mol
Cell Probes 20: 348-358.
101. Zhao T, Chang LW, McLeod HL, Stormo GD (2004) PromoLign: a database
for upstream region analysis and SNPs. Hum Mutat 23: 534-539.
102. Silverstein RL, Febbraio M (2000) CD36 and atherosclerosis. Curr Opin
Lipidol 11: 483-491.
103. Nicholson AC, Hajjar DP (2004) CD36, oxidized LDL and PPAR gamma:
pathological interactions in macrophages and atherosclerosis. Vascul
Pharmacol 41: 139-146.
104. Collot-Teixeira S, Martin J, McDermott-Roe C, Poston R, McGregor JL (2007)
CD36 and macrophages in atherosclerosis. Cardiovasc Res 75: 468-477.
105. Rac ME, Safranow K, Poncyljusz W (2007) Molecular Basis of Human Cd36
Gene Mutations. Mol Med.
106. Abumrad NA, el-Maghrabi MR, Amri EZ, Lopez E, Grimaldi PA (1993)
Cloning of a rat adipocyte membrane protein implicated in binding or
transport of long-chain fatty acids that is induced during preadipocyte
differentiation. Homology with human CD36. J Biol Chem 268: 1766517668.
107. Ibrahimi A, Abumrad NA (2002) Role of CD36 in membrane transport of
long-chain fatty acids. Curr Opin Clin Nutr Metab Care 5: 139-145.
108. Nozaki S, Tanaka T, Yamashita S, Sohmiya K, Yoshizumi T, et al. (1999)
CD36 mediates long-chain fatty acid transport in human myocardium:
complete myocardial accumulation defect of radiolabeled long-chain fatty
acid analog in subjects with CD36 deficiency. Mol Cell Biochem 192: 129135.
109. Endemann G, Stanton LW, Madden KS, Bryant CM, White RT, et al. (1993)
CD36 is a receptor for oxidized low density lipoprotein. J Biol Chem 268:
11811-11816.
55
Computational and experimental approaches to regulatory genetic variation
110. Ockenhouse CF, Tandon NN, Magowan C, Jamieson GA, Chulay JD (1989)
Identification of a platelet membrane glycoprotein as a falciparum malaria
sequestration receptor. Science 243: 1469-1471.
111. Oquendo P, Hundt E, Lawler J, Seed B (1989) CD36 directly mediates
cytoadherence of Plasmodium falciparum parasitized erythrocytes. Cell 58:
95-101.
112. Fernandez-Ruiz E, Armesilla AL, Sanchez-Madrid F, Vega MA (1993) Gene
encoding the collagen type I and thrombospondin receptor CD36 is
located on chromosome 7q11.2. Genomics 17: 759-761.
113. Noushmehr H, D'Amico E, Farilla L, Hui H, Wawrowsky KA, et al. (2005)
Fatty acid translocase (FAT/CD36) is localized on insulin-containing
granules in human pancreatic beta-cells and mediates fatty acid effects on
insulin secretion. Diabetes 54: 472-481.
114. Sato O, Kuriki C, Fukui Y, Motojima K (2002) Dual promoter structure of
mouse and human fatty acid translocase/CD36 genes and unique
transcriptional activation by peroxisome proliferator-activated receptor
alpha and gamma ligands. J Biol Chem 277: 15703-15711.
115. Zingg JM, Ricciarelli R, Andorno E, Azzi A (2002) Novel 5' exon of scavenger
receptor CD36 is expressed in cultured human vascular smooth muscle
cells and atherosclerotic plaques. Arterioscler Thromb Vasc Biol 22: 412417.
116. Griffin E, Re A, Hamel N, Fu C, Bush H, et al. (2001) A link between diabetes
and atherosclerosis: Glucose regulates expression of CD36 at the level of
translation. Nat Med 7: 840-846.
117. Curtis BR, Aster RH (1996) Incidence of the Nak(a)-negative platelet
phenotype in African Americans is similar to that of Asians. Transfusion
36: 331-334.
118. Urwijitaroon Y, Barusrux S, Romphruk A, Puapairoj C (1995) Frequency of
human platelet antigens among blood donors in northeastern Thailand.
Transfusion 35: 868-870.
119. Yamamoto N, Ikeda H, Tandon NN, Herman J, Tomiyama Y, et al. (1990) A
platelet membrane glycoprotein (GP) deficiency in healthy blood donors:
Naka- platelets lack detectable GPIV (CD36). Blood 76: 1698-1703.
120. Febbraio M, Podrez EA, Smith JD, Hajjar DP, Hazen SL, et al. (2000)
Targeted disruption of the class B scavenger receptor CD36 protects
against atherosclerotic lesion development in mice. J Clin Invest 105: 10491056.
121. Aitman TJ, Glazier AM, Wallace CA, Cooper LD, Norsworthy PJ, et al. (1999)
Identification of Cd36 (Fat) as an insulin-resistance gene causing defective
fatty acid and glucose metabolism in hypertensive rats. Nat Genet 21: 7683.
122. Aitman TJ, Gotoda T, Evans AL, Imrie H, Heath KE, et al. (1997)
Quantitative trait loci for cellular defects in glucose and fatty acid
metabolism in hypertensive rats. Nat Genet 16: 197-201.
56
Malin Andersen
123. Pravenec M, Landa V, Zidek V, Musilova A, Kren V, et al. (2001) Transgenic
rescue of defective Cd36 ameliorates insulin resistance in spontaneously
hypertensive rats. Nat Genet 27: 156-158.
124. Kajihara S, Hisatomi A, Ogawa Y, Yasutake T, Yoshimura T, et al. (2001)
Association of the Pro90Ser CD36 mutation with elevated free fatty acid
concentrations but not with insulin resistance syndrome in Japanese. Clin
Chim Acta 314: 125-130.
125. Miyaoka K, Kuwasako T, Hirano K, Nozaki S, Yamashita S, et al. (2001)
CD36 deficiency associated with insulin resistance. Lancet 357: 686-687.
126. Yanai H, Chiba H, Morimoto M, Abe K, Fujiwara H, et al. (2000) Human
CD36 deficiency is associated with elevation in low-density lipoproteincholesterol. Am J Med Genet 93: 299-304.
127. Nozaki S, Kashiwagi H, Yamashita S, Nakagawa T, Kostner B, et al. (1995)
Reduced uptake of oxidized low density lipoproteins in monocyte-derived
macrophages from CD36-deficient subjects. J Clin Invest 96: 1859-1865.
128. Kunjathoor VV, Febbraio M, Podrez EA, Moore KJ, Andersson L, et al.
(2002) Scavenger receptors class A-I/II and CD36 are the principal
receptors responsible for the uptake of modified low density lipoprotein
leading to lipid loading in macrophages. J Biol Chem 277: 49982-49988.
129. Rahaman SO, Lennon DJ, Febbraio M, Podrez EA, Hazen SL, et al. (2006) A
CD36-dependent signaling cascade is necessary for macrophage foam cell
formation. Cell Metab 4: 211-221.
130. Martin-Fuentes P, Civeira F, Recalde D, Garcia-Otin AL, Jarauta E, et al.
(2007) Individual variation of scavenger receptor expression in human
macrophages with oxidized low-density lipoprotein is associated with a
differential inflammatory response. J Immunol 179: 3242-3248.
131. Corpeleijn E, van der Kallen CJ, Kruijshoop M, Magagnin MG, de Bruin TW,
et al. (2006) Direct association of a promoter polymorphism in the
CD36/FAT fatty acid transporter gene with Type 2 diabetes mellitus and
insulin resistance. Diabet Med 23: 907-911.
132. Ma X, Bacci S, Mlynarski W, Gottardo L, Soccio T, et al. (2004) A common
haplotype at the CD36 locus is associated with high free fatty acid levels
and increased cardiovascular risk in Caucasians. Hum Mol Genet 13: 21972205.
133. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene
ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25: 25-29.
134. Uhlen M, Bjorling E, Agaton C, Szigyarto CA, Amini B, et al. (2005) A human
protein atlas for normal and cancer tissues based on antibody proteomics.
Mol Cell Proteomics 4: 1920-1932.
135. Mottagui-Tabar S, Faghihi MA, Mizuno Y, Engstrom PG, Lenhard B, et al.
(2005) Identification of functional SNPs in the 5-prime flanking sequences
of human genes. BMC Genomics 6: 18.
57
Computational and experimental approaches to regulatory genetic variation
136. Singh Ahuja H, Liu S, Crombie DL, Boehm M, Leibowitz MD, et al. (2001)
Differential effects of rexinoids and thiazolidinediones on metabolic gene
expression in diabetic rodents. Mol Pharmacol 59: 765-773.
137. Stahlberg N, Rico-Bautista E, Fisher RM, Wu X, Cheung L, et al. (2003)
Female-predominant expression of fatty acid translocase/CD36 in rat and
human liver. Endocrinology.
138. Andersson T, Borang S, Larsson M, Wirta V, Wennborg A, et al. (2002) Novel
candidate genes for atherosclerosis are identified by representational
difference analysis-based transcript profiling of cholesterol-loaded
macrophages. Pathobiology 69: 304-314.
139. Han J, Hajjar DP, Febbraio M, Nicholson AC (1997) Native and modified low
density lipoproteins increase the functional expression of the macrophage
class B scavenger receptor, CD36. J Biol Chem 272: 21654-21659.
140. Sato O, Takanashi N, Motojima K (2007) Third promoter and differential
regulation of mouse and human fatty acid translocase/CD36 genes. Mol
Cell Biochem 299: 37-43.
141. Samnegard A, Silveira A, Lundman P, Boquist S, Odeberg J, et al. (2005)
Serum matrix metalloproteinase-3 concentration is influenced by MMP-3 1612 5A/6A promoter genotype and associated with myocardial infarction.
J Intern Med 258: 411-419.
142. Arnone MI, Davidson EH (1997) The hardwiring of development:
organization and function of genomic regulatory systems. Development
124: 1851-1864.
143. Wasserman WW, Fickett JW (1998) Identification of regulatory regions which
confer muscle-specific gene expression. J Mol Biol 278: 167-181.
144. Hollenhorst PC, Shah AA, Hopkins C, Graves BJ (2007) Genome-wide
analyses reveal properties of redundant and specific promoter occupancy
within the ETS gene family. Genes Dev 21: 1882-1894.
58
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement