Gene Discovery Using Pareto Depth Sampling Distributions

Gene Discovery Using Pareto Depth Sampling Distributions
Gene Discovery Using Pareto Depth Sampling Distributions
Gilles Fleury¦ , Alfred Hero† , Sepideh Zareparsi∗ and Anand Swaroop∗
¦
†
Ecole Supérieure d’Electricité, Service des Mesures, 91192 Gif-sur-Yvette, France
Depts. of EECS, BioMedical Eng. and Statistics, University of Michigan, Ann Arbor MI 48109, USA
∗
Dept. of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor MI, 48109 USA
January 19, 2004
Abstract
this paper, we will present a new method for analyzing gene microarray data which we call the method
Most methods for finding interesting gene expres- of Pareto depth sampling distributions (PDSD).
sion profiles from gene microarray data are based on
The massive scale and variability of microarray
a single discriminant, e.g. the classical paired t-test.
gene data creates new and challenging problems
Here a different approach is introduced based on
of clustering and data mining: the so-called gene
gene ranking according to Pareto depth in multiple
filtering problem. This problem has two subprobdiscriminants. The novelty of our approach, which
lems called gene screening and gene ranking. Gene
is an extension of our previous work on Pareto front
screening is concerned with determining a list of
analysis (PFA), is that a gene’s relative rank is degene probes whose expression levels are statistically
termined according to the ordinal theory of multiple
and biologically significant with respect to some pobjective optimization. Furthermore, the distribuvalue or familywise error rate. Gene ranking is contion of each gene’s rank, called Pareto depth, is decerned with finding a fixed number of genes that
termined by resampling over the microarray repliare rank ordered according to one or more statisticates. This distribution is called the Pareto depth
cal and biological criteria. These two subproblems
sampling distribution (PDSD) and it is used to asare closely related, but this paper focusses on gene
sess the stability of each ranking. We illustrate and
ranking using multiple criteria. Multicriteria methcompare the PDSD approach with both simulated
ods of gene screening with familywise error conand real gene microarray experiments.
straints have been presented elsewhere [14] and will
Keywords: gene filtering, multi-objective optimizanot be discussed further in this paper.
tion, false discovery rate, data depth analysis.
Multicriteria gene filtering seeks to find genes
whose expression profiles strike an optimal compromise between maximizing (or minimizing) several
1 Introduction
criteria. It is often easier for a molecular bioloSince Watson and Crick discovered DNA more than gist to specify several criteria than a single critefifty years ago, the field of genomics has progressed rion. For example the biologist might be interfrom a speculative science starved for data and com- ested in aging genes, which he might define as those
putation cycles to one of the most thriving areas of genes having expression profiles that are increascurrent biomedical research and development [26]. ing over time, have low curvature over time, and
Spectacular advances in gene mapping of humans whose total increase from initial time to final time
and other organisms [6, 22, 21] have been made is large. Or one may have to deal with two biover the past decade. Similar advances are be- ologists who each have different criteria for what
ing made in the discovery of within-species vari- features constitute an interesting aging gene. In a
ation of transcript expression levels, e.g., due to well designed gene microarray experiment, multicrifactors such as growth, disease, and environment teria (or other) methods of screening will generally
[27, 5, 1, 7]. Progress in understanding such gene result in a large number of genes and the biologist
expression mechanisms has been made possible by must next face the problem of selecting a few of the
the gene microarray and its associated signal ex- most “promising genes” to investigate further. Restraction and processing algorithms [20, 18, 19]. In olution of this problem is of importance since vali1
dation of gene response requires running more sen- conclusion is that the PDSD approach, when forsitive amplification protocols, such as quantitative mulated as a Pareto depth test, significantly outreal-time reverse-transcription polymerase-chain-reaction
performs previous PFA and paired t-test methods.
(RT-PCR). As compared to microarray experiments,
Our experience has shown that the PFA methodRT-PCR’s higher sensitivity is offset by its lower ology can discover important genes that elude standard analysis such as paired t-test or other analysis
throughput and its higher cost-per-probe.
It is thus clear that some sort of rank order- of variance (ANOVA) methods. However, our obing of the selected genes would help guide the bi- jective here is limited to introducing and illustrating
ologist to a cost effective solution of the validation the PFA methodology and we do so using both conproblem. As a linear ordering of multiple criteria trolled simulations and experimental data. These
does not generally exist, an absolute ranking of the datasets are representative of real world microarray
selected genes is generally impossible. However a experiments for studying the genetics of retinal agpartial ordering is possible when formulated as a ing and disease. We report on more comprehensive
multicriterion optimization problem. This partial comparisons, biological significance of genes discovordering groups genes into successive Pareto fronts ered using PFA, and scientific significance of the
of the multicriteria scattergram (see Section 3). It experiments elsewhere.
The paper is organized as follows. In Sec. 2 we
is this partial ordering which was used in our pregive
some background on genomics and briefly revious work [10, 11, 13] to obtain relative rankings
view
gene microarray technology. In Sec. 3 we moof gene expression levels based on microarray extivate
and describe the PFA multicriterion ranking
periments. We called our multiobjective approach
approach
and introduce the concept of PDSD’s. In
to gene ranking Pareto front analysis (PFA). As
Sec.
4
we
report on quantitative comparisons bepointed out in [13] the PFA approach is related to
tween
the
Pareto
depth test and other tests used
John Tukey’s notion of data depths and contours of
for
gene
selection
and ranking. Finally, in Sec. 5
depth in a multivariate sample [24, 8]. To highlight
we
conclude
with
some
general remarks.
the contrast between PFA and the concept of data
depths we will refer to the Pareto depth of a gene
as the Pareto front on which the gene lies. It is to
be noted that Pareto analysis has been adopted for
many continuous and discrete optimization applications including evolutionary computing [23, 28].
2
Several variants of PFA were introduced in [10,
11, 13] including resistant PFA (RPFA), based on
cross-validation, and posterior PFA (PPFA), based
on Bayes posterior analysis, of gene rankings. These
rankings were computed by rank ordering each gene’s
probability of lying on the first two or three Pareto
fronts of the multicriteria scattergram. This paper introduces a more powerful PFA gene discovery tool, the aforementioned Pareto depth sampling
distribution method, into the PFA toolbox. The
PDSD method generates an empirical distribution
of the depth of the front, the Pareto depth, on
which each gene lies. This distribution is computed
by implementing a resampling method similar to
the bootstrap. From this distribution many different attributes of the Pareto depth can be determined and used for ranking the genes. The PDSD
approach is more general than our previous crossvalidation PFA approach that used a special attribute, the cumulative PDSD, to rank the genes.
Using simulations we compare our PFA methods to
the standard paired t-test on the basis of correct
discovery and false discovery rates. Our principal
2
Background
The ability to perform accurate genetic differentiation between two or more biological populations
is a problem of great interest to researchers in genetics and related areas. For example, in a temporally sampled population of mice one is frequently
interested in identifying genes that display a significant change in expression level between a pair
of time points. Gene microarrays have revolutionized the field of experimental genetics by offering to
the experimenter the ability to simultaneously measure thousands of gene sequences simultaneously. A
gene microarray consists of a large number N of
known DNA probes that are put in distinct locations on a small slide [16, 2, 9]. After hybridization
of an unknown tissue sample to the microarray, the
abundance of each probe present in the sample can
be estimated from the measured levels of hybridization, called probe responses, of the sample to each
probe.
Due to high response variability the study of differential gene expression between two or more populations or time points usually requires hybridizing
several samples from each population. We assume
that there are T populations each consisting
PTof Mt
samples, t = 1, . . . , T . For each of the
t=1 Mt
samples we assume an independent microarray hybridization experiment is performed yielding N gene
probe responses extracted from the microarray. Define the measured response of the n-th probe on the
m-th microarray acquired at time t
software to yield log2 probe response data. Of interest to our biology collaborators is the effect of
aging on retinal gene expression. For this purpose
we compare two populations comprising the 8 tissue
samples at the two extreme late development time
points M2 and M21. Our objective was to find genes
with a high level of differential expression between
these points.
ytm (n), n = 1, . . . , N, m = 1, . . . , Mt , t = 1, . . . , T.
When several gene chip experiments are performed
over time they can be combined in order to find
genes with interesting expression profiles. This is a
data mining problem for which many methods have
been proposed including: multiple paired t-tests;
linear discriminant analysis; self organizing (Kohonen) maps (SOM); principal components analysis (PCA); K-means clustering; hierarchical clustering (kdb trees, CART, gene shaving); and support
vector machines (SVM) [12, 4]. Validation methods have been widely used and include [25, 17]:
significance analysis of microarrays (SAM); bootstrapping cluster analysis; and leave-one-out crossvalidation. Most of these methods are based on
filtering out profiles that maximize some criterion
such as: the ratio of between-population-variation
to within-population-variation; or the temporal correlation between a measured profile and a profile
template. As contrasted to maximizing such scalar
criteria, multicriteria gene filtering seeks to find the
best compromise between maximizing or minimizing several criteria. This method is closely related
to multi-objective optimization which has been used
in many applications [23, 28].
2.1
2.3
Human Retinal Aging Study
The experiment consists of hybridizing 16 retinal
tissue samples taken from 8 young human donors
and 8 old human donors. The ages of the young
donors ranged from 16 to 21 years and the ages of
the old donors ranged from 70 to 85 years old. The
16 tissue samples were hybridized to 16 Affymetrix
U95A Human GeneChip microarrays each containing N = 12, 642 probes. Again MAS5 software was
used to extract log2 probe response data and our
objective was to find genes with a high level of differential expression between the young and old populations.
3
Gene Screening and Ranking
Consider the problem of finding a set of genes whose
mean expression levels are significantly different between a pair of populations (T = 2). The measured probe responses from such genes should exhibit small within-population variability (intra-class
dispersion) and large between-population variability (inter-class dispersion). Two natural measures
of intra-class dispersion ξ1 and inter-class dispersion
ξ2 , respectively, are the (scaled) absolute difference
between sample means:
Data Sets
Data from two microarray experiments are used in
the sequel to illustrate our analysis. These data
were collected by collaborators in Anand Swaroop’s
laboratory in the Dept. of Ophthalmology at the
University of Michigan. We briefly describe these
two experiments below.
ξ2 (n) = q
2.2
Mouse Retinal Aging Study
1
1
M1
+
1
M2
|ȳ1. (n) − ȳ2. (n)| ,
(1)
The experiment consists of hybridizing 24 retinal where,
tissue samples taken from each of 24 age-sorted mice
Mt
at 6 ages (time points) with 4 replicates per time
1 X
ȳt. (n) =
ytm (n)
point. These 6 time points consisted of 2 early
Mt m=1
development (Pn2, Pn10) and 4 late development
(M2, M6, M16, M21) samples. RNA from each sample of retinal tissue was amplified and hybridized to and the pooled sample standard deviation:
s
the 12,422 probes on one of 24 Affymetrix U74Av2
(M1 − 1)σ12 (n) + (M2 − 1)σ22 (n)
Mouse GeneChip microarrays. The data arrays from
(2)
ξ1 (n) =
the GeneChips were processed by Affymetrix MAS5
(M1 − 1) + (M2 − 1)
3
where,
σt2 (n) =
third front and so on. A gene that lies on the k-th
Pareto front will be said to be at ”Pareto depth” k.
Mt
X
1
2
(ytm (n) − ȳt. (n)) .
Mt − 1 m=1
ξ2
Tpt (n) =
ξ2 (n)
ξ1 (n)
S
>
<
S
ηT
A
C
(3)
ξ1
D
where ηT is a user-specified threshold. Here S refers
to selecting gene n while S refers to rejecting gene
n. If the user wishes to constrain familywise error rate or false discovery rate then ηT is chosen as
a function of the quantiles of the student-t density
with M1 +M2 −2 degrees of freedom. Alternatively,
if the user wishes to select a fixed number p of genes
for further study, e.g., by RT-PCR, then ηT is datadependent. Specifically, in this latter case one reduces ηT until the number card{n : Tpt (n) > ηT } is
equal to p, i.e. exactly p genes have Tpt (n) values
that exceed ηT .
The test statistic Tpt (n) in (3) is a scalar criterion that could be used to rank the genes in decreasing order of Tpt , or, equivalently, in increasing
p-value.
3.1
ξ2
B
The simple paired t-test [3] can be used to separate the populations by thresholding the ratio of
the two dispersion measures:
ξ1
Figure 1: a). A, B, D are non-dominated genes in the
dual criteria plane where ξ1 is to be minimized and ξ2
is to be maximized. Genes A, B, and D are at Pareto
depth 1 while gene C is at Pareto depth 2. b). Successive
Pareto fronts in dual criteria plane (o : first Pareto
front, * : second Pareto front, + : third Pareto front).
ξ2
ξ2
Optimum
ξ1
Multiple Objective Ranking
ξ1
Figure 2: a). Pareto front contains a single gene, b).
Multiple objective optimization captures the intrinsic compromises among possibly conflicting objectives. To illustrate, in the present context we consider the pair of criteria ξ2 (n) (1) and ξ1 (n) (2). A
gene that maximizes ξ2 and minimizes ξ1 over all
genes would be a very attractive gene indeed. Unfortunately, such an extreme of optimality is seldom
attained with multiple criteria. A more common
case is illustrated in Fig. 1.a. It should be obvious to the reader that gene A is “better” than gene
C because both criteria are higher for A than for
C. However it is not as straightforward to specify a
preference between A, B and D. Multi-criteria ranking uses the “non-dominated” property as a way to
establish such a preference relation. A and B are
said to be non-dominated because improvement of
one criterion in going from A to B corresponds to
degradation of the other criterion. All the genes
which are non-dominated constitute a curve which
is called the Pareto front. A second Pareto front is
obtained by stripping off points on the first front
and computing the Pareto front of the remaining
points. This process can be repeated to define a
Pareto front contains all genes. In a) linear ordering exists and a single gene (optimum) dominates the others.
In b) no non-trivial partial ordering exists and there is
only one Pareto front.
In rare cases the Pareto front consists of a single gene (see Fig. 2.a). At the opposite extreme,
there are cases where the Pareto front consists of
the entire set of genes (see Fig. 2.b). It can be
shown that as the number of criteria increases the
Pareto front becomes less and less discriminatory,
e.g. for an infinite number of criteria it consists
of the entire set of genes. In practical cases where
only a few criteria are used there are multiple Pareto
fronts each consisting of many genes. We illustrate
in Fig. 3 where we show the scatterplot of the criteria {(ξ1 (n), ξ2 (n))}N
n=1 defined in (1) and (2) for
all gene probe responses extracted from microarrays in the mouse retina aging experiment. As in
[13] we call this scatterplot the multicriterion scattergram. For this set of data, Fig. 3 shows the first
4
Pareto front as lying on the left-upper boundary of
the multicriterion scattergram.
4
10
3
10
ξ2 (inter class : MAX)
2
0
ξ2 (inter class : MAX)
10
−1
10
10
1
10
0
10
−1
10
−2
10
−2
10
0
10
1
10
2
10
3
10
4
10
ξ1 (intra class : MIN)
−3
10
Figure 4: The multicriterion scattergram for the human retina aging experiment. Each point in the
scatterplot corresponds to the pair (ξ1 (n), ξ2 (n)) for
Figure 3: The multicriterion scattergram for the a particular gene n.
mouse retina aging experiment. Each point in the
scatterplot corresponds to the pair (ξ1 (n), ξ2 (n)) for
For each of these resampled set of genes the Pareto
a particular gene n. The first three Pareto fronts
fronts are computed. The most resistant genes are
are indicated (◦, ¤ and?).
those which remain on the top Pareto fronts throughout the resampling process. To quantify the movement of a given gene across the Pareto fronts we
For comparison, in Fig. 4 we show the multicri- introduce the Pareto depth sampling distribution
terion scattergram for the human retina aging data (PDSD). For each gene this distribution corresponds
M
set with the same pair of criteria ξ1 , ξ2 defined in to the empirical distribution of the 2 Pareto front
(1) and (2). As the upper left boundary of this indexes visited during the resampling process:
scatterplot is much shallower and denser than the
Mresamp
X
1
scatterplot in Fig. 3 the first Pareto front of the huPdsdn (k) =
1n (j, k), k = 1, . . . , N
Mresamp j=1
man data contains many more genes than the first
Pareto front of the mouse data. Since they would
not render well in this densely populated boundary where Mresamp = 2M is the number of resampling
region, the Pareto fronts are not indicated in Fig. trials, and 1n (j, k) is an indicator function of the
4.
event: ”j-th resampling of n-th gene is on k-th
Pareto front.” If K is the total number of Pareto
fronts in the scattergram (ξ1 (n), ξ2 (n)}N
n=1 then, by
convention, we define Pdsdn (k) = 0 for k > K. As
3.2 Pareto Depth Sampling Distribu- the PDSD
P is a probability distribution Pdsdn (k) ≥
0 and k Pdsdn (k) = 1.
tion
0
ξ1 (intra class : MIN)
10
Microarray data are strongly corrupted by biological variations and measurement variations. To account for this variation we applied a simple resamFigure 5 corresponds to the (un-normalized) PDSDs
pling procedure to robustify the Pareto analysis. over the first 40 Pareto depths for four different
This resampling is implemented as a bootstrap pro- genes taken from the human data set under the
dual criteria (ξ1 , ξ2 ) of (1) and (2). The highly
cedure and is equivalent to leave-one-out cross-validation
[15]. Resampling proceeds as follows: for each time concentrated PDSD in the top-left panel indicates
point a sample is omitted leaving 2M sets of (M − that this gene is very stable; it remains on the first
1)2 pairs to be tested (here we set Mt = M , cor- front throughout the resampling process. At the opresponding to the two data sets presented above). posite extreme, the highly dispersed PDSD on the
5
plot. For a given threshold ηd the test (5) defines a
quarter disk region in the plane of Fig. 6 centered
40
40
at the origin (0, 0) with disk radius ηd . Genes whose
moment pair (m2 (n), σ 2 (n)) falls in this region will
20
20
pass the test and be selected as having both low and
0
0
stable Pareto depths. Bootstrap methods, imple1 2 3 4 5
1 2 3 4 5
mented with random permutation and resampling,
could be straightforwardly implemented to deter60
60
mine the p-values of this test. However, in this paper we will focus on constraining the number of dis40
40
covered genes as opposed to the level of significance
20
20
of the test. When the number of discovered genes
is constrained to be 50, the top ranked 50 genes fall
0
0
1 3 5 7 9 11
1
10
20
30
40
into the acceptance region of the test (5). Figure 7
Pareto front number
Pareto front number
shows a gray-coded image of the PDSDs for each of
these top 50 genes for the human retina data. The
Figure 5: Unormalized PDSDs for four different
figure indicates that the Pareto depths of these 50
genes taken from human retina experiment. These
genes are tightly concentrated in the range 1 to 6.
PDSDs are indexed by the Pareto depth, which is
equivalent to Pareto front number.
60
PDSD
PDSD
60
4
10
bottom-right panel indicates a very unstable gene;
its Pareto depth is highly sensitive to resampling.
The two other panels depict PDSD’s of genes that
lie within these two extremes. As the PDSD summarizes all of the empirical Pareto depth statistics
it can be used to develop a wide array of gene ranking criteria. For example, in [10] we ranked genes
in terms of the proportion of resampling trials for
which a gene remained on one of the top 3 Pareto
fronts. This ranking criterion is equivalent to the
cumulative Pareto front test
Tcum (n) =
3
X
k=1
Pdsdn (k)
S
>
<
S
3
p
m2 (n) + σ 2 (n)
1
10
0
10
−1
10
−2
10
ηc .
10
(4)
S
>
<
S
ηd .
0
10
1
2
3
10
10
4
10
5
10
2
Depth m (CV Pareto front)
Figure 6: The scatter plot of the square mean (horizontal axis) and the variance (vertical axis) of the
Pareto depth of each gene for human retinal data.
Here CV refers to our resampling method consisting
of leave-one out cross-validation.
In this paper we investigate a different PDSD
ranking statistic for pulling out genes that are both
highly stable and have low Pareto depth. Genes
with these attributes can be captured by requiring
that their Pareto depth variance σ 2 (n) and squared
Pareto depth mean m2 (n) be small. Equivalently,
we define the Pareto depth test
Tpd (n) =
2
10
2
Depth σ (CV Pareto front)
10
(5)
For comparison, Fig. 8 shows the PDSDs obtained by applying exactly the same selection criterion (5) to the mouse aging experiment as we just
presented for the human aging experiment. Notice
that the PDSDs for the top 50 mouse genes are
spread over 16 or more Pareto depths. This high
spread reflects the facts that: (1) there are fewer
stable Pareto dominant genes in the mouse aging
Note that the test statistic
Tpd (n) is equivalently
P
expressed as Tpd (n) = k k 2 Pdsdn (k).
Figure 6 is the scatter plot of the pairs of moments {(m2 (n), σ 2 (n))}N
n=1 of the gene PDSDs for
the human retina data. The best genes are those
which have smallest mean and variance, i.e., the
genes that lie on the lower left corner of the scatter
6
4
5
Here we compare the paired t-test (3), the cumulative Pareto front test (4) used in [10], and the
Pareto depth test (5) on the basis of their gene ranking performance for the retinal aging experimental
data and for simulated data.
10
(m2+σ2) depth sorted genes
Experimental Comparisons
15
20
25
30
4.1
35
40
Figures 9 and 10 show the number of genes discovered as a function of the paired t-test threshold
ηT for the experimental human and mouse data, respectively. The shapes of the curves in these two figures are substantially different. Indeed the distinctive plateau at the right tail of Fig. 10 is due to the
existence of several mouse genes whose best scores
(ξ1 (n), ξ2 (n)) are well detached from the scores of
the rest of the genes. There are no such highly detached human genes as can be seen by comparing
the multicriteria scattergrams of Figs. 3 and 4. Figures 11 and 12 show the number of genes discovered
as a function of the inverse Pareto depth test threshold 1/ηd for the experimental human and mouse
data, respectively. Again the shapes of the curves
in these two figures are substantially different. As
compared to the paired t-test figures, Figs. 9 and
10, the increase in the number of genes discovered
by the Pareto depth test is much more gradual as
1/ηp decreases, suggesting that the Pareto rankings
are more stable to measurement variations as compared to the rankings induced by the t-test statistic.
45
50
2
4
6
8
10
12
Pareto front number
14
16
18
Experimental Data
20
Figure 7: The PDSDs of the 50 top human genes
discovered using the test (5) applied to the scatter
plot of Fig. 6 with threshold ηd determined such that
exactly 50 genes fall into acceptance region. The
magnitude of the PDSD is encoded in the false color
range of black (PDSD=1) to white (P DSD = 0).
experiment as compared to the human aging experiment; (2) the Pareto front for the human aging
experiment is broader and contains more genes than
for the mouse aging data.
5
10
20
4
10
25
30
number of genes
(m2+σ2) depth sorted genes
5
10
15
35
40
45
50
2
4
6
8
10
12
Pareto front number
14
16
18
20
3
10
2
10
1
10
Figure 8: The PDSDs of the 50 top mouse genes
discovered using the test ((5). As compared to the
human genes Fig. 7 there is much higher variability
in the Pareto depths of the top 50.
0
10
0
0.5
1
1.5
2
2.5
3
T−test threshold
3.5
4
4.5
5
Figure 9: Number of t-test-extracted genes as a
function of threshold ηT for data in human retina
aging study.
7
5
number of genes with (1/ξPDSD > threshold)
10
5
number of genes with (T−test > threshold)
4
3
2
10
10
2
1
10
10
0
10
1
10
1
2
3
4
5
T−test threshold
6
7
0
0.1
0.2
0.3
0.4
0.5
PDSD threshold
0.6
0.7
0.8
Figure 12: Number of Pareto-depth-test-extracted
genes as a function of inverse threshold 1/ηd for
data in mouse retina aging study.
0
0
3
10
10
10
4
10
10
8
number of genes with (1/ξ
PDSD
> threshold)
Figure 10: Number of t-test-extracted genes as a
function of threshold ηT for data in mouse retina
truth” true rankings. The simulations were deaging study.
signed to be representative of gene expressions in
a typical gene microarray experiment. Three hundred (N = 300) different probe responses were simulated. Eight (M = 8) replicates of the n-th gene
5
probe response were generated according to an i.i.d.
10
Gaussian distribution with means and variances given
by (m1 (n), σ12 (n)) and (m2 (n), σ22 (n)) for popula4
10
tions 1 and 2, respectively. The variances were
made equal σ12 (n) = σ22 (n) = σ 2 (n) over both populations. The means and variances were set by the
3
10
following formula:
σ(n) = ξ2 (n), m1 (n) = 0, m2 (n) = ξ1 (n)ξ2 (n)/2
2
10
where the values of ξ1 (n), ξ2 (n) are indicated by
the criteria structure illustrated in Fig. 13. The
ground truth ranking of all genes is determined by
this figure which can be viewed as the ensemble
mean scattergram. We designate the 90 genes on
the first 3 fronts of Fig. 13 (depth increasing along
−45o diagonal) as ground-truth-optimal genes.
1
10
0
10
0
0.1
0.2
0.3
0.4
0.5
0.6
PDSD threshold
0.7
0.8
0.9
1
Figure 11: Number of Pareto-depth-test-extracted
genes as a function of inverse threshold 1/ηd for
data in human retina aging study.
4.2
Figure 14 shows a realization of the empirical
scattergram obtained from sample mean and variance estimates derived from the replicates. Figure 15 shows the three first Pareto fronts and the
boundaries of two acceptance regions for the paired
t-test applied to the empirical scattergram of Fig.
14. The first three Pareto fronts do not capture
Simulated Data
We performed a limited set of simulations to be
able to compare estimated rankings to the ”ground
8
boundary corresponds to a paired t-test threshold
ηT which would lead to discovery of all of the 90
ground-truth-optimal genes. However, the false discovery rate of this acceptance region is quite high
(> 40%). In Fig. 16 is another depiction of the two
acceptance regions of the paired t-test. It is clear
from this example that neither the paired t-test
nor the cumulative Pareto front test (4) succeed in
extracting all the ground-truth-optimal genes with
low false discovery rate. The next question we address is: would a Pareto depth test do better than
the simple three front Pareto test and paired t-tests
illustrated here?
5
10
4
3
10
2
10
2
ξ (inter class : MAX)
10
1
10
0
10 −5
10
−4
10
−3
−2
10
10
ξ (intra class : MIN)
−1
10
10
0
1
5
10
Figure 13: Ensemble mean scattergram (ground
truth) for simulation study. There are 10 groups
of 30 genes represented by each of the 10 semicircles. Ground truth Pareto optimal genes lie on the
outermost front (low ξ1 and high ξ2 ).
4
10
3
ξ2 (MAX)
10
2
10
1
10
5
10
0
Estimated ξ2 (inter class : MAX)
10
−1
4
10
10
−2
10
10
3
10
−6
−5
10
−4
10
−3
10
−2
10
ξ1 (MIN)
−1
10
0
10
1
10
Figure 15: Three first Pareto fronts (◦, ¤ and ∗)
and boundaries of paired t-test acceptance ragions
for the scattergram of Fig. 14.
2
10
1
10
0
10
10
−4
−3
−2
−1
10
10
10
Estimated ξ1 (intra class : MIN)
10
0
To quantify the tradeoffs between the paired tFigure 14: Empirical scattergram constructed from test and the Pareto tests for extracting the groundestimating sample mean and variance from the M = truth-optimal genes we performed a representative
8 i.i.d. samples of each gene.
simulation study to compute the average correct
discovery rates and the average false discovery rates
as a function of the number M of replicates. All
all of the ground-truth-optimal genes but they have tests were implemented with a data dependent thresha very low (0%) false discovery rate (proportion of old which selected the 90 top genes as ranked by the
genes found which are not ground-truth-optimal). respective test statistics. For the range of M studThe solid line boundary of the paired t-test cor- ied this threshold setting gave the paired t-test a
responds to a threshold ηT which discovers the 90 nearly constant correct discovery rate of approxigenes with highest Tpt (n) value. Use of this ac- mately 88%. The cumulative Pareto front test (4)
ceptance region would result in discovery of more and the Pareto depth test (5) were implemented for
ground-truth-optimal genes than discovered by the comparison.
In Figs. 17 and 18 we plot the correct discovery
first three Pareto fronts, but with a false discovery rate of approximately 15%. The dashed line rate and the false discovery rate, respectively, for
9
8
100
10
7
10
95
6
Correct Discovery Rate (%)
10
5
T−test
10
4
10
3
10
2
10
1
90
85
80
75
70
10
0
10
0
50
100
150
200
250
65
1
300
2
3
Gene index
4
5
log2(number of samples)
6
7
8
Figure 17: Correct discovery rate as a function of
the number of replicates for paired t-test (solid) versus cumulative Pareto front test (dashed). Both
tests have data-dependent thresholds that select the
90 top ranked genes according to their respective test
statistics. )
the paired t-test and the cumulative Pareto front
test. From the figures it is clear that the cumulative Pareto front test has better performance than
the paired t-test for large M . However, it suffers
from lower correct discovery rate than the paired
t-test for small M . In Figs. 19 and 20 the same
error rates are compared for the paired t-test and
the Pareto depth test. The Pareto depth test performed significantly better (higher correct discovery
rate and lower false discovery rate) than the paired
t-test for all M .
45
40
35
False Discovery Rate
Figure 16: Paired t-test statistic and thresholds corresponding to boundaries in Fig. 15. Genes are ordered from right to left by scanning successive fronts
in the ensemble mean scattergram of Fig. 13.
30
25
20
15
10
5
0
1
2
3
4
5
log2(number of samples)
6
7
8
Figure 18: False discovery rate as a function of the
number of replicates for paired t-test (solid) versus
The alert reader will realize that our definition cumulative Pareto front test (dashed). Both tests
of ground-truth-optimal genes favors the Pareto meth- have data-dependent thresholds that select the 90
ods of gene ranking and selection as compared to top ranked genes according to their respective test
the paired t methods. Our definition of ground- statistics.
truth-optimality was motivated by our several years
of experience helping molecular biologists discover
biologically interesting genes, in particular genes 5
Conclusion
with weak but interesting transcription factors. A
more comprehensive study would compare the per- DNA microarray technology allows one to evaluformance of Pareto to paired t approaches when the ate the expression profile of thousands of genes siground-truth-optimal genes are defined differently. multaneously. However, to take full advantage of
Due to space limitations we do not present the re- these powerful tools, we need to find new methods
to handle large amounts of data and information
sults of this study here.
10
creasingly high dimensionality of genetic data sets.
The developed method has been implemented in
matlab and C and is sufficiently fast to be part of
an interactive tool for gene screening, ranking, and
clustering.
100
Correct Discovery Rate (%)
98
96
94
92
Acknowledgements
90
88
86
1.5
2
2.5
3
3.5
4
log2(number of samples)
4.5
5
5.5
Figure 19: Correct discovery rate as a function of
the number of replicates for paired t-test (solid) versus Pareto depth test (dashed). Both tests have
data-dependent thresholds that select the 90 top
ranked genes according to their respective test statistics.
45
35
False Discovery Rate
References
[1] A. A. Alizadeh and etal, “Distinct types of diffuse large B-cell lymphoma identified by gene
expression profiling,” Nature, vol. 403, pp.
503–511, 2000.
[2] D. Bassett, M. Eisen, and M. Boguski, “Gene
expression informatics–it’s all in your mine,”
Nature Genetics, vol. 21, no. 1 Suppl, pp. 51–
55, Jan 1999.
40
30
25
[3] P. J. Bickel and K. A. Doksum, Mathematical Statistics: Basic Ideas and Selected Topics,
Holden-Day, San Francisco, 1977.
20
15
[4] M. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugent, T. Furey, M. Ares, and
D. Haussler, “Knowledge-based analysis of microarray gene expression data by using support
vector machines,” Proc. of Nat. Academy of
Sci. (PNAS), vol. 97, no. 1, pp. 262–267, 2000.
10
5
0
1.5
This research was partially supported by National
Institutes of Health grant NIH-EY11115 (including
microarray supplements), Macula Vision Research
Foundation, and Elmer and Sylvia Sramek Foundation.
2
2.5
3
3.5
4
log2(number of samples)
4.5
5
5.5
Figure 20: False discovery rate as a function of the
number of replicates for paired t-test (solid) versus
Pareto depth test (dashed). Both tests have datadependent thresholds that select the 90 top ranked
genes according to their respective test statistics.
without becoming overwhelmed by the potentially
large number of candidate genes. This paper has
presented a new method of Pareto analysis that can
identify and rank genes that have both stable and
low Pareto depths relative to the remaining genes.
Additional genes discovered using this algorithm are
now being validated by RT-PCR methods. Many
signal processing challenges remain due to the in11
[5] V. Cheung, L. Conlin, T. Weber, M. Arcaro,
K. Jen, M. Morley, and R. Spielman, “Natural variation in human gene expression assessed
in lymphoblastoid cells,” Nature Genetics, vol.
33, pp. 422–425, 2003.
[6] F. C. Collins, M. Morgan, and A. Patrinos,
“The Human Genome Project: lessons from
large-scale biology,” Science, vol. 300, pp. 286–
290, April 11 2003.
[7] J. DeRisi, V. Iyer, and P. Brown, “Exploring
the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278,
no. 5338, pp. 680–686, Oct 24 1997.
[8] D. Donoho and M. Gasko, “Breakdown prop(PNAS), vol. 98, pp. 8961–8965, 2000.
erties of location estimates based on halfspace
citeseer.nj.nec.com/414709.html.
depth and projected outlyingness,” Annals of
[18] C. Lee, R. Klopp, R. Weindruch, and T. Prolla,
Statistics, vol. 4, pp. 1803–1827, 1992.
“Gene expression profile of aging and its retar[9] P. Fitch and B. Sokhansanj, “Genomic endation by caloric restriction,” Science, vol. 285,
gineering: moving beyond DNA sequence to
no. 5432, pp. 1390–1393, Aug 27 1999.
function,” IEEE Proceedings, vol. 88, no. 12,
[19] F. Livesey, T. Furukawa, M. Steffen,
pp. 1949–1971, Dec 2000.
G. Church, and C. Cepko, “Microarray
[10] G. Fleury, A. O. Hero, S. Yosida, T. Carter,
analysis of the transcriptional network conC. Barlow, and A. Swaroop, “Clustering
trolled by the photoreceptor homeobox gene,”
gene expression signals from retinal microarCrx. Curr Biol, vol. 6, no. 10, pp. 301–10, Mar
ray data,” in Proc. IEEE Int. Conf. Acoust.,
23 2000.
Speech, and Sig. Proc., volume IV, pp. 4024–
4027, Orlando, FL, 2002.
[20] D. Lockhart, H. Dong, M. Byrne, M. Follettie,
M. Gallo, M. Chee, M. Mittmann, C. Wang,
[11] G. Fleury, A. O. Hero, S. Yosida, T. Carter,
M. Kobayashi, H. Horton, and E. Brown, “ExC. Barlow, and A. Swaroop, “Pareto analpression monitoring by hybridization to highysis for gene filtering in microarray experidensity oligonucleotide arrays,” Nat. Biotechments,” in European Sig. Proc. Conf. (EUnol., vol. 14, no. 13, pp. 1675–80, 1996.
SIPCO), Toulouse, FRANCE, 2002.
[21] M. Marra and etal, “The genome sequence
[12] T. Hastie, R. Tibshirani, M. Eisen, P. Brown,
of the SARS-associated coronavirus,” SciD. Ross, U. Scherf, J. Weinstein, A. Alizadeh,
ence Express, vol. 10.1126, , May 1 2003.
L. Staudt, and D. Botstein, “Gene shaving: a
www.scienceecpress.org.
new class of clustering methods for expression
arrays,” Technical report, Stanford University, [22] P. A. Rota and etal, “Characterization of a
2000.
novel coronavirus associated with severe acute
respiratory syndrome,” Science, vol. 10.1126, ,
[13] A. Hero and G. Fleury, “Pareto-optimal
May 1 2003. www.scienceecpress.org.
methods for gene analysis,” Journ. of VLSI
Signal Processing, vol. to appear, , 2003.
[23] R. E. Steuer, Multi criteria optimization: thewww.eecs.umich.edu/~hero/bioinfo.html.
ory, computation, and application, Wiley, New
York
N.Y., 1986.
[14] A. Hero, G. Fleury, A. Mears, and A. Swaroop, “Multicriteria gene screening for analysis of differential expression with DNA [24] J. Tukey, Exploratory Data Analysis, Wiley,
NY NY, 1977.
microarrays,” EURASIP Journ. of Applied
Signal Processing, p. to appear, 2003.
[25] V. G. Tusher, R. Tibshirani, and G. Chu,
www.eecs.umich.edu/~hero/bioinfo.html.
“Significance analysis of microarrays applied to
the ionizing radiation response,” Proc. of Nat.
[15] J. U. Hjorth, Computer intensive statistical
Academy of Sci. (PNAS), vol. 98, pp. 5116–
methods, CHapman and Hall, London, 1994.
5121, 2001.
[16] K. Kadota, R. Miki, H. Bono, K. Shimizu,
Y. Okazaki, and Y. Hayashizaki, “Preprocess- [26] J. Watson and A. Berry, DNA: The secret of
life, Alfred A. Knopf, 2003.
ing implementation for microarray (prim): an
efficient method for processing cdna microarray data,” Physiol Genomics, vol. 4, no. 3, pp. [27] J. Yu, A. Mears, S. Yoshida, R. Farjo,
T. Carter, D. Ghosh, A. Hero, C. Barlow, and
183–188, Jan 19 2001.
A. Swaroop, “From disease genes to cellular
pathways: A progress report,” in Retinal dys[17] K. Kerr and G. Churchill, “Bootstrapping cluster analysis: Assessing the relitrophies: functional genomics to gene therapy,
ability of conclusions from microarray expp. 147–164, Wiley, Novartis Foundation Symperiments,” Proc. of Nat. Academy of Sci.
posium, Chichester, 2004.
12
[28] E. Zitler and L. Thiele, “An evolutionary algorithm for multiobjective optimization: the
strength Pareto approach,” Technical report,
Swiss Federal Institute of Technology (ETH),
May 1998.
13
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising