Gene Discovery Using Pareto Depth Sampling Distributions Gilles Fleury¦ , Alfred Hero† , Sepideh Zareparsi∗ and Anand Swaroop∗ ¦ † Ecole Supérieure d’Electricité, Service des Mesures, 91192 Gif-sur-Yvette, France Depts. of EECS, BioMedical Eng. and Statistics, University of Michigan, Ann Arbor MI 48109, USA ∗ Dept. of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor MI, 48109 USA January 19, 2004 Abstract this paper, we will present a new method for analyzing gene microarray data which we call the method Most methods for finding interesting gene expres- of Pareto depth sampling distributions (PDSD). sion profiles from gene microarray data are based on The massive scale and variability of microarray a single discriminant, e.g. the classical paired t-test. gene data creates new and challenging problems Here a different approach is introduced based on of clustering and data mining: the so-called gene gene ranking according to Pareto depth in multiple filtering problem. This problem has two subprobdiscriminants. The novelty of our approach, which lems called gene screening and gene ranking. Gene is an extension of our previous work on Pareto front screening is concerned with determining a list of analysis (PFA), is that a gene’s relative rank is degene probes whose expression levels are statistically termined according to the ordinal theory of multiple and biologically significant with respect to some pobjective optimization. Furthermore, the distribuvalue or familywise error rate. Gene ranking is contion of each gene’s rank, called Pareto depth, is decerned with finding a fixed number of genes that termined by resampling over the microarray repliare rank ordered according to one or more statisticates. This distribution is called the Pareto depth cal and biological criteria. These two subproblems sampling distribution (PDSD) and it is used to asare closely related, but this paper focusses on gene sess the stability of each ranking. We illustrate and ranking using multiple criteria. Multicriteria methcompare the PDSD approach with both simulated ods of gene screening with familywise error conand real gene microarray experiments. straints have been presented elsewhere [14] and will Keywords: gene filtering, multi-objective optimizanot be discussed further in this paper. tion, false discovery rate, data depth analysis. Multicriteria gene filtering seeks to find genes whose expression profiles strike an optimal compromise between maximizing (or minimizing) several 1 Introduction criteria. It is often easier for a molecular bioloSince Watson and Crick discovered DNA more than gist to specify several criteria than a single critefifty years ago, the field of genomics has progressed rion. For example the biologist might be interfrom a speculative science starved for data and com- ested in aging genes, which he might define as those putation cycles to one of the most thriving areas of genes having expression profiles that are increascurrent biomedical research and development [26]. ing over time, have low curvature over time, and Spectacular advances in gene mapping of humans whose total increase from initial time to final time and other organisms [6, 22, 21] have been made is large. Or one may have to deal with two biover the past decade. Similar advances are be- ologists who each have different criteria for what ing made in the discovery of within-species vari- features constitute an interesting aging gene. In a ation of transcript expression levels, e.g., due to well designed gene microarray experiment, multicrifactors such as growth, disease, and environment teria (or other) methods of screening will generally [27, 5, 1, 7]. Progress in understanding such gene result in a large number of genes and the biologist expression mechanisms has been made possible by must next face the problem of selecting a few of the the gene microarray and its associated signal ex- most “promising genes” to investigate further. Restraction and processing algorithms [20, 18, 19]. In olution of this problem is of importance since vali1 dation of gene response requires running more sen- conclusion is that the PDSD approach, when forsitive amplification protocols, such as quantitative mulated as a Pareto depth test, significantly outreal-time reverse-transcription polymerase-chain-reaction performs previous PFA and paired t-test methods. (RT-PCR). As compared to microarray experiments, Our experience has shown that the PFA methodRT-PCR’s higher sensitivity is offset by its lower ology can discover important genes that elude standard analysis such as paired t-test or other analysis throughput and its higher cost-per-probe. It is thus clear that some sort of rank order- of variance (ANOVA) methods. However, our obing of the selected genes would help guide the bi- jective here is limited to introducing and illustrating ologist to a cost effective solution of the validation the PFA methodology and we do so using both conproblem. As a linear ordering of multiple criteria trolled simulations and experimental data. These does not generally exist, an absolute ranking of the datasets are representative of real world microarray selected genes is generally impossible. However a experiments for studying the genetics of retinal agpartial ordering is possible when formulated as a ing and disease. We report on more comprehensive multicriterion optimization problem. This partial comparisons, biological significance of genes discovordering groups genes into successive Pareto fronts ered using PFA, and scientific significance of the of the multicriteria scattergram (see Section 3). It experiments elsewhere. The paper is organized as follows. In Sec. 2 we is this partial ordering which was used in our pregive some background on genomics and briefly revious work [10, 11, 13] to obtain relative rankings view gene microarray technology. In Sec. 3 we moof gene expression levels based on microarray extivate and describe the PFA multicriterion ranking periments. We called our multiobjective approach approach and introduce the concept of PDSD’s. In to gene ranking Pareto front analysis (PFA). As Sec. 4 we report on quantitative comparisons bepointed out in [13] the PFA approach is related to tween the Pareto depth test and other tests used John Tukey’s notion of data depths and contours of for gene selection and ranking. Finally, in Sec. 5 depth in a multivariate sample [24, 8]. To highlight we conclude with some general remarks. the contrast between PFA and the concept of data depths we will refer to the Pareto depth of a gene as the Pareto front on which the gene lies. It is to be noted that Pareto analysis has been adopted for many continuous and discrete optimization applications including evolutionary computing [23, 28]. 2 Several variants of PFA were introduced in [10, 11, 13] including resistant PFA (RPFA), based on cross-validation, and posterior PFA (PPFA), based on Bayes posterior analysis, of gene rankings. These rankings were computed by rank ordering each gene’s probability of lying on the first two or three Pareto fronts of the multicriteria scattergram. This paper introduces a more powerful PFA gene discovery tool, the aforementioned Pareto depth sampling distribution method, into the PFA toolbox. The PDSD method generates an empirical distribution of the depth of the front, the Pareto depth, on which each gene lies. This distribution is computed by implementing a resampling method similar to the bootstrap. From this distribution many different attributes of the Pareto depth can be determined and used for ranking the genes. The PDSD approach is more general than our previous crossvalidation PFA approach that used a special attribute, the cumulative PDSD, to rank the genes. Using simulations we compare our PFA methods to the standard paired t-test on the basis of correct discovery and false discovery rates. Our principal 2 Background The ability to perform accurate genetic differentiation between two or more biological populations is a problem of great interest to researchers in genetics and related areas. For example, in a temporally sampled population of mice one is frequently interested in identifying genes that display a significant change in expression level between a pair of time points. Gene microarrays have revolutionized the field of experimental genetics by offering to the experimenter the ability to simultaneously measure thousands of gene sequences simultaneously. A gene microarray consists of a large number N of known DNA probes that are put in distinct locations on a small slide [16, 2, 9]. After hybridization of an unknown tissue sample to the microarray, the abundance of each probe present in the sample can be estimated from the measured levels of hybridization, called probe responses, of the sample to each probe. Due to high response variability the study of differential gene expression between two or more populations or time points usually requires hybridizing several samples from each population. We assume that there are T populations each consisting PTof Mt samples, t = 1, . . . , T . For each of the t=1 Mt samples we assume an independent microarray hybridization experiment is performed yielding N gene probe responses extracted from the microarray. Define the measured response of the n-th probe on the m-th microarray acquired at time t software to yield log2 probe response data. Of interest to our biology collaborators is the effect of aging on retinal gene expression. For this purpose we compare two populations comprising the 8 tissue samples at the two extreme late development time points M2 and M21. Our objective was to find genes with a high level of differential expression between these points. ytm (n), n = 1, . . . , N, m = 1, . . . , Mt , t = 1, . . . , T. When several gene chip experiments are performed over time they can be combined in order to find genes with interesting expression profiles. This is a data mining problem for which many methods have been proposed including: multiple paired t-tests; linear discriminant analysis; self organizing (Kohonen) maps (SOM); principal components analysis (PCA); K-means clustering; hierarchical clustering (kdb trees, CART, gene shaving); and support vector machines (SVM) [12, 4]. Validation methods have been widely used and include [25, 17]: significance analysis of microarrays (SAM); bootstrapping cluster analysis; and leave-one-out crossvalidation. Most of these methods are based on filtering out profiles that maximize some criterion such as: the ratio of between-population-variation to within-population-variation; or the temporal correlation between a measured profile and a profile template. As contrasted to maximizing such scalar criteria, multicriteria gene filtering seeks to find the best compromise between maximizing or minimizing several criteria. This method is closely related to multi-objective optimization which has been used in many applications [23, 28]. 2.1 2.3 Human Retinal Aging Study The experiment consists of hybridizing 16 retinal tissue samples taken from 8 young human donors and 8 old human donors. The ages of the young donors ranged from 16 to 21 years and the ages of the old donors ranged from 70 to 85 years old. The 16 tissue samples were hybridized to 16 Affymetrix U95A Human GeneChip microarrays each containing N = 12, 642 probes. Again MAS5 software was used to extract log2 probe response data and our objective was to find genes with a high level of differential expression between the young and old populations. 3 Gene Screening and Ranking Consider the problem of finding a set of genes whose mean expression levels are significantly different between a pair of populations (T = 2). The measured probe responses from such genes should exhibit small within-population variability (intra-class dispersion) and large between-population variability (inter-class dispersion). Two natural measures of intra-class dispersion ξ1 and inter-class dispersion ξ2 , respectively, are the (scaled) absolute difference between sample means: Data Sets Data from two microarray experiments are used in the sequel to illustrate our analysis. These data were collected by collaborators in Anand Swaroop’s laboratory in the Dept. of Ophthalmology at the University of Michigan. We briefly describe these two experiments below. ξ2 (n) = q 2.2 Mouse Retinal Aging Study 1 1 M1 + 1 M2 |ȳ1. (n) − ȳ2. (n)| , (1) The experiment consists of hybridizing 24 retinal where, tissue samples taken from each of 24 age-sorted mice Mt at 6 ages (time points) with 4 replicates per time 1 X ȳt. (n) = ytm (n) point. These 6 time points consisted of 2 early Mt m=1 development (Pn2, Pn10) and 4 late development (M2, M6, M16, M21) samples. RNA from each sample of retinal tissue was amplified and hybridized to and the pooled sample standard deviation: s the 12,422 probes on one of 24 Affymetrix U74Av2 (M1 − 1)σ12 (n) + (M2 − 1)σ22 (n) Mouse GeneChip microarrays. The data arrays from (2) ξ1 (n) = the GeneChips were processed by Affymetrix MAS5 (M1 − 1) + (M2 − 1) 3 where, σt2 (n) = third front and so on. A gene that lies on the k-th Pareto front will be said to be at ”Pareto depth” k. Mt X 1 2 (ytm (n) − ȳt. (n)) . Mt − 1 m=1 ξ2 Tpt (n) = ξ2 (n) ξ1 (n) S > < S ηT A C (3) ξ1 D where ηT is a user-specified threshold. Here S refers to selecting gene n while S refers to rejecting gene n. If the user wishes to constrain familywise error rate or false discovery rate then ηT is chosen as a function of the quantiles of the student-t density with M1 +M2 −2 degrees of freedom. Alternatively, if the user wishes to select a fixed number p of genes for further study, e.g., by RT-PCR, then ηT is datadependent. Specifically, in this latter case one reduces ηT until the number card{n : Tpt (n) > ηT } is equal to p, i.e. exactly p genes have Tpt (n) values that exceed ηT . The test statistic Tpt (n) in (3) is a scalar criterion that could be used to rank the genes in decreasing order of Tpt , or, equivalently, in increasing p-value. 3.1 ξ2 B The simple paired t-test [3] can be used to separate the populations by thresholding the ratio of the two dispersion measures: ξ1 Figure 1: a). A, B, D are non-dominated genes in the dual criteria plane where ξ1 is to be minimized and ξ2 is to be maximized. Genes A, B, and D are at Pareto depth 1 while gene C is at Pareto depth 2. b). Successive Pareto fronts in dual criteria plane (o : first Pareto front, * : second Pareto front, + : third Pareto front). ξ2 ξ2 Optimum ξ1 Multiple Objective Ranking ξ1 Figure 2: a). Pareto front contains a single gene, b). Multiple objective optimization captures the intrinsic compromises among possibly conflicting objectives. To illustrate, in the present context we consider the pair of criteria ξ2 (n) (1) and ξ1 (n) (2). A gene that maximizes ξ2 and minimizes ξ1 over all genes would be a very attractive gene indeed. Unfortunately, such an extreme of optimality is seldom attained with multiple criteria. A more common case is illustrated in Fig. 1.a. It should be obvious to the reader that gene A is “better” than gene C because both criteria are higher for A than for C. However it is not as straightforward to specify a preference between A, B and D. Multi-criteria ranking uses the “non-dominated” property as a way to establish such a preference relation. A and B are said to be non-dominated because improvement of one criterion in going from A to B corresponds to degradation of the other criterion. All the genes which are non-dominated constitute a curve which is called the Pareto front. A second Pareto front is obtained by stripping off points on the first front and computing the Pareto front of the remaining points. This process can be repeated to define a Pareto front contains all genes. In a) linear ordering exists and a single gene (optimum) dominates the others. In b) no non-trivial partial ordering exists and there is only one Pareto front. In rare cases the Pareto front consists of a single gene (see Fig. 2.a). At the opposite extreme, there are cases where the Pareto front consists of the entire set of genes (see Fig. 2.b). It can be shown that as the number of criteria increases the Pareto front becomes less and less discriminatory, e.g. for an infinite number of criteria it consists of the entire set of genes. In practical cases where only a few criteria are used there are multiple Pareto fronts each consisting of many genes. We illustrate in Fig. 3 where we show the scatterplot of the criteria {(ξ1 (n), ξ2 (n))}N n=1 defined in (1) and (2) for all gene probe responses extracted from microarrays in the mouse retina aging experiment. As in [13] we call this scatterplot the multicriterion scattergram. For this set of data, Fig. 3 shows the first 4 Pareto front as lying on the left-upper boundary of the multicriterion scattergram. 4 10 3 10 ξ2 (inter class : MAX) 2 0 ξ2 (inter class : MAX) 10 −1 10 10 1 10 0 10 −1 10 −2 10 −2 10 0 10 1 10 2 10 3 10 4 10 ξ1 (intra class : MIN) −3 10 Figure 4: The multicriterion scattergram for the human retina aging experiment. Each point in the scatterplot corresponds to the pair (ξ1 (n), ξ2 (n)) for Figure 3: The multicriterion scattergram for the a particular gene n. mouse retina aging experiment. Each point in the scatterplot corresponds to the pair (ξ1 (n), ξ2 (n)) for For each of these resampled set of genes the Pareto a particular gene n. The first three Pareto fronts fronts are computed. The most resistant genes are are indicated (◦, ¤ and?). those which remain on the top Pareto fronts throughout the resampling process. To quantify the movement of a given gene across the Pareto fronts we For comparison, in Fig. 4 we show the multicri- introduce the Pareto depth sampling distribution terion scattergram for the human retina aging data (PDSD). For each gene this distribution corresponds M set with the same pair of criteria ξ1 , ξ2 defined in to the empirical distribution of the 2 Pareto front (1) and (2). As the upper left boundary of this indexes visited during the resampling process: scatterplot is much shallower and denser than the Mresamp X 1 scatterplot in Fig. 3 the first Pareto front of the huPdsdn (k) = 1n (j, k), k = 1, . . . , N Mresamp j=1 man data contains many more genes than the first Pareto front of the mouse data. Since they would not render well in this densely populated boundary where Mresamp = 2M is the number of resampling region, the Pareto fronts are not indicated in Fig. trials, and 1n (j, k) is an indicator function of the 4. event: ”j-th resampling of n-th gene is on k-th Pareto front.” If K is the total number of Pareto fronts in the scattergram (ξ1 (n), ξ2 (n)}N n=1 then, by convention, we define Pdsdn (k) = 0 for k > K. As 3.2 Pareto Depth Sampling Distribu- the PDSD P is a probability distribution Pdsdn (k) ≥ 0 and k Pdsdn (k) = 1. tion 0 ξ1 (intra class : MIN) 10 Microarray data are strongly corrupted by biological variations and measurement variations. To account for this variation we applied a simple resamFigure 5 corresponds to the (un-normalized) PDSDs pling procedure to robustify the Pareto analysis. over the first 40 Pareto depths for four different This resampling is implemented as a bootstrap pro- genes taken from the human data set under the dual criteria (ξ1 , ξ2 ) of (1) and (2). The highly cedure and is equivalent to leave-one-out cross-validation [15]. Resampling proceeds as follows: for each time concentrated PDSD in the top-left panel indicates point a sample is omitted leaving 2M sets of (M − that this gene is very stable; it remains on the first 1)2 pairs to be tested (here we set Mt = M , cor- front throughout the resampling process. At the opresponding to the two data sets presented above). posite extreme, the highly dispersed PDSD on the 5 plot. For a given threshold ηd the test (5) defines a quarter disk region in the plane of Fig. 6 centered 40 40 at the origin (0, 0) with disk radius ηd . Genes whose moment pair (m2 (n), σ 2 (n)) falls in this region will 20 20 pass the test and be selected as having both low and 0 0 stable Pareto depths. Bootstrap methods, imple1 2 3 4 5 1 2 3 4 5 mented with random permutation and resampling, could be straightforwardly implemented to deter60 60 mine the p-values of this test. However, in this paper we will focus on constraining the number of dis40 40 covered genes as opposed to the level of significance 20 20 of the test. When the number of discovered genes is constrained to be 50, the top ranked 50 genes fall 0 0 1 3 5 7 9 11 1 10 20 30 40 into the acceptance region of the test (5). Figure 7 Pareto front number Pareto front number shows a gray-coded image of the PDSDs for each of these top 50 genes for the human retina data. The Figure 5: Unormalized PDSDs for four different figure indicates that the Pareto depths of these 50 genes taken from human retina experiment. These genes are tightly concentrated in the range 1 to 6. PDSDs are indexed by the Pareto depth, which is equivalent to Pareto front number. 60 PDSD PDSD 60 4 10 bottom-right panel indicates a very unstable gene; its Pareto depth is highly sensitive to resampling. The two other panels depict PDSD’s of genes that lie within these two extremes. As the PDSD summarizes all of the empirical Pareto depth statistics it can be used to develop a wide array of gene ranking criteria. For example, in [10] we ranked genes in terms of the proportion of resampling trials for which a gene remained on one of the top 3 Pareto fronts. This ranking criterion is equivalent to the cumulative Pareto front test Tcum (n) = 3 X k=1 Pdsdn (k) S > < S 3 p m2 (n) + σ 2 (n) 1 10 0 10 −1 10 −2 10 ηc . 10 (4) S > < S ηd . 0 10 1 2 3 10 10 4 10 5 10 2 Depth m (CV Pareto front) Figure 6: The scatter plot of the square mean (horizontal axis) and the variance (vertical axis) of the Pareto depth of each gene for human retinal data. Here CV refers to our resampling method consisting of leave-one out cross-validation. In this paper we investigate a different PDSD ranking statistic for pulling out genes that are both highly stable and have low Pareto depth. Genes with these attributes can be captured by requiring that their Pareto depth variance σ 2 (n) and squared Pareto depth mean m2 (n) be small. Equivalently, we define the Pareto depth test Tpd (n) = 2 10 2 Depth σ (CV Pareto front) 10 (5) For comparison, Fig. 8 shows the PDSDs obtained by applying exactly the same selection criterion (5) to the mouse aging experiment as we just presented for the human aging experiment. Notice that the PDSDs for the top 50 mouse genes are spread over 16 or more Pareto depths. This high spread reflects the facts that: (1) there are fewer stable Pareto dominant genes in the mouse aging Note that the test statistic Tpd (n) is equivalently P expressed as Tpd (n) = k k 2 Pdsdn (k). Figure 6 is the scatter plot of the pairs of moments {(m2 (n), σ 2 (n))}N n=1 of the gene PDSDs for the human retina data. The best genes are those which have smallest mean and variance, i.e., the genes that lie on the lower left corner of the scatter 6 4 5 Here we compare the paired t-test (3), the cumulative Pareto front test (4) used in [10], and the Pareto depth test (5) on the basis of their gene ranking performance for the retinal aging experimental data and for simulated data. 10 (m2+σ2) depth sorted genes Experimental Comparisons 15 20 25 30 4.1 35 40 Figures 9 and 10 show the number of genes discovered as a function of the paired t-test threshold ηT for the experimental human and mouse data, respectively. The shapes of the curves in these two figures are substantially different. Indeed the distinctive plateau at the right tail of Fig. 10 is due to the existence of several mouse genes whose best scores (ξ1 (n), ξ2 (n)) are well detached from the scores of the rest of the genes. There are no such highly detached human genes as can be seen by comparing the multicriteria scattergrams of Figs. 3 and 4. Figures 11 and 12 show the number of genes discovered as a function of the inverse Pareto depth test threshold 1/ηd for the experimental human and mouse data, respectively. Again the shapes of the curves in these two figures are substantially different. As compared to the paired t-test figures, Figs. 9 and 10, the increase in the number of genes discovered by the Pareto depth test is much more gradual as 1/ηp decreases, suggesting that the Pareto rankings are more stable to measurement variations as compared to the rankings induced by the t-test statistic. 45 50 2 4 6 8 10 12 Pareto front number 14 16 18 Experimental Data 20 Figure 7: The PDSDs of the 50 top human genes discovered using the test (5) applied to the scatter plot of Fig. 6 with threshold ηd determined such that exactly 50 genes fall into acceptance region. The magnitude of the PDSD is encoded in the false color range of black (PDSD=1) to white (P DSD = 0). experiment as compared to the human aging experiment; (2) the Pareto front for the human aging experiment is broader and contains more genes than for the mouse aging data. 5 10 20 4 10 25 30 number of genes (m2+σ2) depth sorted genes 5 10 15 35 40 45 50 2 4 6 8 10 12 Pareto front number 14 16 18 20 3 10 2 10 1 10 Figure 8: The PDSDs of the 50 top mouse genes discovered using the test ((5). As compared to the human genes Fig. 7 there is much higher variability in the Pareto depths of the top 50. 0 10 0 0.5 1 1.5 2 2.5 3 T−test threshold 3.5 4 4.5 5 Figure 9: Number of t-test-extracted genes as a function of threshold ηT for data in human retina aging study. 7 5 number of genes with (1/ξPDSD > threshold) 10 5 number of genes with (T−test > threshold) 4 3 2 10 10 2 1 10 10 0 10 1 10 1 2 3 4 5 T−test threshold 6 7 0 0.1 0.2 0.3 0.4 0.5 PDSD threshold 0.6 0.7 0.8 Figure 12: Number of Pareto-depth-test-extracted genes as a function of inverse threshold 1/ηd for data in mouse retina aging study. 0 0 3 10 10 10 4 10 10 8 number of genes with (1/ξ PDSD > threshold) Figure 10: Number of t-test-extracted genes as a function of threshold ηT for data in mouse retina truth” true rankings. The simulations were deaging study. signed to be representative of gene expressions in a typical gene microarray experiment. Three hundred (N = 300) different probe responses were simulated. Eight (M = 8) replicates of the n-th gene 5 probe response were generated according to an i.i.d. 10 Gaussian distribution with means and variances given by (m1 (n), σ12 (n)) and (m2 (n), σ22 (n)) for popula4 10 tions 1 and 2, respectively. The variances were made equal σ12 (n) = σ22 (n) = σ 2 (n) over both populations. The means and variances were set by the 3 10 following formula: σ(n) = ξ2 (n), m1 (n) = 0, m2 (n) = ξ1 (n)ξ2 (n)/2 2 10 where the values of ξ1 (n), ξ2 (n) are indicated by the criteria structure illustrated in Fig. 13. The ground truth ranking of all genes is determined by this figure which can be viewed as the ensemble mean scattergram. We designate the 90 genes on the first 3 fronts of Fig. 13 (depth increasing along −45o diagonal) as ground-truth-optimal genes. 1 10 0 10 0 0.1 0.2 0.3 0.4 0.5 0.6 PDSD threshold 0.7 0.8 0.9 1 Figure 11: Number of Pareto-depth-test-extracted genes as a function of inverse threshold 1/ηd for data in human retina aging study. 4.2 Figure 14 shows a realization of the empirical scattergram obtained from sample mean and variance estimates derived from the replicates. Figure 15 shows the three first Pareto fronts and the boundaries of two acceptance regions for the paired t-test applied to the empirical scattergram of Fig. 14. The first three Pareto fronts do not capture Simulated Data We performed a limited set of simulations to be able to compare estimated rankings to the ”ground 8 boundary corresponds to a paired t-test threshold ηT which would lead to discovery of all of the 90 ground-truth-optimal genes. However, the false discovery rate of this acceptance region is quite high (> 40%). In Fig. 16 is another depiction of the two acceptance regions of the paired t-test. It is clear from this example that neither the paired t-test nor the cumulative Pareto front test (4) succeed in extracting all the ground-truth-optimal genes with low false discovery rate. The next question we address is: would a Pareto depth test do better than the simple three front Pareto test and paired t-tests illustrated here? 5 10 4 3 10 2 10 2 ξ (inter class : MAX) 10 1 10 0 10 −5 10 −4 10 −3 −2 10 10 ξ (intra class : MIN) −1 10 10 0 1 5 10 Figure 13: Ensemble mean scattergram (ground truth) for simulation study. There are 10 groups of 30 genes represented by each of the 10 semicircles. Ground truth Pareto optimal genes lie on the outermost front (low ξ1 and high ξ2 ). 4 10 3 ξ2 (MAX) 10 2 10 1 10 5 10 0 Estimated ξ2 (inter class : MAX) 10 −1 4 10 10 −2 10 10 3 10 −6 −5 10 −4 10 −3 10 −2 10 ξ1 (MIN) −1 10 0 10 1 10 Figure 15: Three first Pareto fronts (◦, ¤ and ∗) and boundaries of paired t-test acceptance ragions for the scattergram of Fig. 14. 2 10 1 10 0 10 10 −4 −3 −2 −1 10 10 10 Estimated ξ1 (intra class : MIN) 10 0 To quantify the tradeoffs between the paired tFigure 14: Empirical scattergram constructed from test and the Pareto tests for extracting the groundestimating sample mean and variance from the M = truth-optimal genes we performed a representative 8 i.i.d. samples of each gene. simulation study to compute the average correct discovery rates and the average false discovery rates as a function of the number M of replicates. All all of the ground-truth-optimal genes but they have tests were implemented with a data dependent thresha very low (0%) false discovery rate (proportion of old which selected the 90 top genes as ranked by the genes found which are not ground-truth-optimal). respective test statistics. For the range of M studThe solid line boundary of the paired t-test cor- ied this threshold setting gave the paired t-test a responds to a threshold ηT which discovers the 90 nearly constant correct discovery rate of approxigenes with highest Tpt (n) value. Use of this ac- mately 88%. The cumulative Pareto front test (4) ceptance region would result in discovery of more and the Pareto depth test (5) were implemented for ground-truth-optimal genes than discovered by the comparison. In Figs. 17 and 18 we plot the correct discovery first three Pareto fronts, but with a false discovery rate of approximately 15%. The dashed line rate and the false discovery rate, respectively, for 9 8 100 10 7 10 95 6 Correct Discovery Rate (%) 10 5 T−test 10 4 10 3 10 2 10 1 90 85 80 75 70 10 0 10 0 50 100 150 200 250 65 1 300 2 3 Gene index 4 5 log2(number of samples) 6 7 8 Figure 17: Correct discovery rate as a function of the number of replicates for paired t-test (solid) versus cumulative Pareto front test (dashed). Both tests have data-dependent thresholds that select the 90 top ranked genes according to their respective test statistics. ) the paired t-test and the cumulative Pareto front test. From the figures it is clear that the cumulative Pareto front test has better performance than the paired t-test for large M . However, it suffers from lower correct discovery rate than the paired t-test for small M . In Figs. 19 and 20 the same error rates are compared for the paired t-test and the Pareto depth test. The Pareto depth test performed significantly better (higher correct discovery rate and lower false discovery rate) than the paired t-test for all M . 45 40 35 False Discovery Rate Figure 16: Paired t-test statistic and thresholds corresponding to boundaries in Fig. 15. Genes are ordered from right to left by scanning successive fronts in the ensemble mean scattergram of Fig. 13. 30 25 20 15 10 5 0 1 2 3 4 5 log2(number of samples) 6 7 8 Figure 18: False discovery rate as a function of the number of replicates for paired t-test (solid) versus The alert reader will realize that our definition cumulative Pareto front test (dashed). Both tests of ground-truth-optimal genes favors the Pareto meth- have data-dependent thresholds that select the 90 ods of gene ranking and selection as compared to top ranked genes according to their respective test the paired t methods. Our definition of ground- statistics. truth-optimality was motivated by our several years of experience helping molecular biologists discover biologically interesting genes, in particular genes 5 Conclusion with weak but interesting transcription factors. A more comprehensive study would compare the per- DNA microarray technology allows one to evaluformance of Pareto to paired t approaches when the ate the expression profile of thousands of genes siground-truth-optimal genes are defined differently. multaneously. However, to take full advantage of Due to space limitations we do not present the re- these powerful tools, we need to find new methods to handle large amounts of data and information sults of this study here. 10 creasingly high dimensionality of genetic data sets. The developed method has been implemented in matlab and C and is sufficiently fast to be part of an interactive tool for gene screening, ranking, and clustering. 100 Correct Discovery Rate (%) 98 96 94 92 Acknowledgements 90 88 86 1.5 2 2.5 3 3.5 4 log2(number of samples) 4.5 5 5.5 Figure 19: Correct discovery rate as a function of the number of replicates for paired t-test (solid) versus Pareto depth test (dashed). Both tests have data-dependent thresholds that select the 90 top ranked genes according to their respective test statistics. 45 35 False Discovery Rate References [1] A. A. Alizadeh and etal, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, pp. 503–511, 2000. [2] D. Bassett, M. Eisen, and M. Boguski, “Gene expression informatics–it’s all in your mine,” Nature Genetics, vol. 21, no. 1 Suppl, pp. 51– 55, Jan 1999. 40 30 25 [3] P. J. Bickel and K. A. Doksum, Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day, San Francisco, 1977. 20 15 [4] M. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugent, T. Furey, M. Ares, and D. Haussler, “Knowledge-based analysis of microarray gene expression data by using support vector machines,” Proc. of Nat. Academy of Sci. (PNAS), vol. 97, no. 1, pp. 262–267, 2000. 10 5 0 1.5 This research was partially supported by National Institutes of Health grant NIH-EY11115 (including microarray supplements), Macula Vision Research Foundation, and Elmer and Sylvia Sramek Foundation. 2 2.5 3 3.5 4 log2(number of samples) 4.5 5 5.5 Figure 20: False discovery rate as a function of the number of replicates for paired t-test (solid) versus Pareto depth test (dashed). Both tests have datadependent thresholds that select the 90 top ranked genes according to their respective test statistics. without becoming overwhelmed by the potentially large number of candidate genes. This paper has presented a new method of Pareto analysis that can identify and rank genes that have both stable and low Pareto depths relative to the remaining genes. Additional genes discovered using this algorithm are now being validated by RT-PCR methods. Many signal processing challenges remain due to the in11 [5] V. Cheung, L. Conlin, T. Weber, M. Arcaro, K. Jen, M. Morley, and R. Spielman, “Natural variation in human gene expression assessed in lymphoblastoid cells,” Nature Genetics, vol. 33, pp. 422–425, 2003. [6] F. C. Collins, M. Morgan, and A. Patrinos, “The Human Genome Project: lessons from large-scale biology,” Science, vol. 300, pp. 286– 290, April 11 2003. [7] J. DeRisi, V. Iyer, and P. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278, no. 5338, pp. 680–686, Oct 24 1997. [8] D. Donoho and M. Gasko, “Breakdown prop(PNAS), vol. 98, pp. 8961–8965, 2000. erties of location estimates based on halfspace citeseer.nj.nec.com/414709.html. depth and projected outlyingness,” Annals of [18] C. Lee, R. Klopp, R. Weindruch, and T. Prolla, Statistics, vol. 4, pp. 1803–1827, 1992. “Gene expression profile of aging and its retar[9] P. Fitch and B. Sokhansanj, “Genomic endation by caloric restriction,” Science, vol. 285, gineering: moving beyond DNA sequence to no. 5432, pp. 1390–1393, Aug 27 1999. function,” IEEE Proceedings, vol. 88, no. 12, [19] F. Livesey, T. Furukawa, M. Steffen, pp. 1949–1971, Dec 2000. G. Church, and C. Cepko, “Microarray [10] G. Fleury, A. O. Hero, S. Yosida, T. Carter, analysis of the transcriptional network conC. Barlow, and A. Swaroop, “Clustering trolled by the photoreceptor homeobox gene,” gene expression signals from retinal microarCrx. Curr Biol, vol. 6, no. 10, pp. 301–10, Mar ray data,” in Proc. IEEE Int. Conf. Acoust., 23 2000. Speech, and Sig. Proc., volume IV, pp. 4024– 4027, Orlando, FL, 2002. [20] D. Lockhart, H. Dong, M. Byrne, M. Follettie, M. Gallo, M. Chee, M. Mittmann, C. Wang, [11] G. Fleury, A. O. Hero, S. Yosida, T. Carter, M. Kobayashi, H. Horton, and E. Brown, “ExC. Barlow, and A. Swaroop, “Pareto analpression monitoring by hybridization to highysis for gene filtering in microarray experidensity oligonucleotide arrays,” Nat. Biotechments,” in European Sig. Proc. Conf. (EUnol., vol. 14, no. 13, pp. 1675–80, 1996. SIPCO), Toulouse, FRANCE, 2002. [21] M. Marra and etal, “The genome sequence [12] T. Hastie, R. Tibshirani, M. Eisen, P. Brown, of the SARS-associated coronavirus,” SciD. Ross, U. Scherf, J. Weinstein, A. Alizadeh, ence Express, vol. 10.1126, , May 1 2003. L. Staudt, and D. Botstein, “Gene shaving: a www.scienceecpress.org. new class of clustering methods for expression arrays,” Technical report, Stanford University, [22] P. A. Rota and etal, “Characterization of a 2000. novel coronavirus associated with severe acute respiratory syndrome,” Science, vol. 10.1126, , [13] A. Hero and G. Fleury, “Pareto-optimal May 1 2003. www.scienceecpress.org. methods for gene analysis,” Journ. of VLSI Signal Processing, vol. to appear, , 2003. [23] R. E. Steuer, Multi criteria optimization: thewww.eecs.umich.edu/~hero/bioinfo.html. ory, computation, and application, Wiley, New York N.Y., 1986. [14] A. Hero, G. Fleury, A. Mears, and A. Swaroop, “Multicriteria gene screening for analysis of differential expression with DNA [24] J. Tukey, Exploratory Data Analysis, Wiley, NY NY, 1977. microarrays,” EURASIP Journ. of Applied Signal Processing, p. to appear, 2003. [25] V. G. Tusher, R. Tibshirani, and G. Chu, www.eecs.umich.edu/~hero/bioinfo.html. “Significance analysis of microarrays applied to the ionizing radiation response,” Proc. of Nat. [15] J. U. Hjorth, Computer intensive statistical Academy of Sci. (PNAS), vol. 98, pp. 5116– methods, CHapman and Hall, London, 1994. 5121, 2001. [16] K. Kadota, R. Miki, H. Bono, K. Shimizu, Y. Okazaki, and Y. Hayashizaki, “Preprocess- [26] J. Watson and A. Berry, DNA: The secret of life, Alfred A. Knopf, 2003. ing implementation for microarray (prim): an efficient method for processing cdna microarray data,” Physiol Genomics, vol. 4, no. 3, pp. [27] J. Yu, A. Mears, S. Yoshida, R. Farjo, T. Carter, D. Ghosh, A. Hero, C. Barlow, and 183–188, Jan 19 2001. A. Swaroop, “From disease genes to cellular pathways: A progress report,” in Retinal dys[17] K. Kerr and G. Churchill, “Bootstrapping cluster analysis: Assessing the relitrophies: functional genomics to gene therapy, ability of conclusions from microarray expp. 147–164, Wiley, Novartis Foundation Symperiments,” Proc. of Nat. Academy of Sci. posium, Chichester, 2004. 12 [28] E. Zitler and L. Thiele, “An evolutionary algorithm for multiobjective optimization: the strength Pareto approach,” Technical report, Swiss Federal Institute of Technology (ETH), May 1998. 13

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising