Analysis of Gene Expression Data Spring Semester, 2007 Lecture 7: June 7, 2007 Lecturer: Ron Shamir 7.1 Support Vector Machines 7.1.1 Introduction Scribe: Hadas Zur and Roy Navon 1 The theory of support vector machines (SVM), has its origins in the late seventies, in the work of Vapnik [16] on the theory of statistical learning. Lately it has been receiving increasing attention, and many applications as well as important theoretical results are based on this theory. In fact, Support Vector Machines are arguably the most important discovery in the area of machine learning. The main area in which the methods described below had been used is in the field of pattern recognition. By way of motivation, we summarize some recent applications and extensions of support vector machines: For the pattern recognition case, SVMs have been used for [3]: • Isolated handwritten digit recognition. • Object recognition. • Speaker identification. • Face detection in images. • Text categorization. For the regression estimation case, SVMs have been compared on: • Benchmark time series prediction tests. • PET operator inversion problem. The main idea of Support Vector Machines is to find a decision surface - a hyperplane in feature space (a line in the case of two features) - which separates the data into two classes. SVMs are extremely successful, robust, efficient, and versatile. In addition, there are good theoretical indications as to why they generalize well. In most of the cases SVM generalization performance either matches or is significantly better than that of competing methods. In the next section we will describe in detail the usage of SVM in the analysis of microarray gene expression data. This exposition is based mainly on [3, 4, 9]. 1 Based in part on a scribe by Simon Kamenkovich and Erez Greenstein May 2002 and on a scribe by Daniela Raijman and Igor Ulitsky Mars 2005 c Analysis of Gene Expression Data Tel Aviv Univ. 2 7.1.2 General motivation One of the basic notions in the theory of SVM is the capacity of the machine, i.e. the ability of the machine to learn any training set of data without error. A machine with too much capacity is like a botanist with a photographic memory - who, when presented with a new tree, concludes that it is not a tree since it has a different number of leaves from anything she has seen before. On the other hand, a machine with too little capacity is like the botanist lazy brother - who declares that if it is green it is a tree. Neither can generalize well. Roughly speaking, for a given learning task with a given amount of training data, the best generalization performance will be achieved if the right balance is found, between the accuracy attained on the particular training set and the capacity of the machine. The main idea of support vector machines is: • Map the data to a predetermined very high-dimensional space via a kernel function. • Find the hyperplane that maximizes the margin between the two classes - i.e. that separates the two classes. • If data are not separable find the hyperplane that maximizes. the margin and minimizes the (weighted average of the) misclassifications - i.e. perform a soft separation allowing errors. Three derivatives of this idea are: • Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin. • Generalize to non-linearly separable problems: have a penalty term for misclassifications. • Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space. 7.1.3 General mathematical setting Suppose we are given l observations. Each observation consists of a pair - a vector xi ∈ Rn , 1 ≤ i ≤ l, and an associated label yi , given to us by a trusted source. In the tree recognition problem, the vector xi might be some representation of an image, e.g., by its pixel values, and yi would be +1 if the image contains a tree, and 0 otherwise. Furthermore, it is assumed that there exists some unknown probability distribution P (x, y) , from which the above observations are assumed to be independently drawn and identically distributed. Now suppose we have a deterministic machine whose task is to learn the mapping xi → yi . The machine is actually defined by a set of possible mappings xi → f (x, α), where the function f depends on a parameter α, i.e, a choice of the parameter α specifies one particular machine from the set f (x, α). For a particular choice of the parameter, the machine will be called a trained machine. Support Vector Machines 7.1.4 3 The VC dimension The VC dimension is a property of a set of functions {f (α)}. It can be defined in a more general manner, but we will assume families of functions that obtain binary values. If a given set of l points can be labeled in all possible 2l ways, and for each labeling, a member of the set {f (α)} can be found which correctly assigns those labels, we say that the set of points is shattered by the set of functions. The VC dimension for the set of functions {f (α)} is defined as the maximum size of a set of points that can be shattered by {f (α)}. In other words, if the VC dimension is h there exists at least one set of h points that can be shattered. Example: Shattering points with oriented lines in R2 . Suppose that the data are points in R2 , and the set {f (α)} consists of oriented straight lines, such that for a given line all points on one side are assigned the value 1, and all points on the other are assigned the value 0. It is possible to find a set of three points that can be shattered by the oriented lines (see 7.1), but it is not possible to shatter a set of 4 points with the set of oriented lines. Thus, the VC dimension of the set of oriented lines in R2 is 3. In the more general case the VC dimension of a set of oriented hyperplanes in Rn is n + 1. Figure 7.1: Three points in the plane shattered by oriented lines. 4 7.1.5 c Analysis of Gene Expression Data Tel Aviv Univ. Support Vector Machines We will start with the simplest case, linear machines trained on separable data. Given a training set xi , yi , xi ∈ Rn , yi ∈ {−1, 1}, we assume that the data is linearly separable, i.e., there exists a separating hyperplane which separates the positive examples (yi = 1) from the negative ones (yi = −1). The points x which lie on the hyperplane satisfy w ·x+b = 0, where |b| w is a normal to the hyperplane and kwk is the perpendicular distance from the hyperplane to the origin. Let d+ (d− ) be the shortest distance from the separating hyperplane to the closest positive (negative) example. Define the margin of a separating hyperplane to be d+ + d− . For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin. Thus the goal is to find the optimal linear classifier (a hyperplane), such that it classifies every training example correctly, and maximizes the classification margin. The above description can be formulated in the following way: Suppose that all the training data satisfy the following constraints: xi · w + b ≥ +1, yi = +1 xi · w + b ≤ −1, yi = −1 (7.1) (7.2) Now consider the points for which the equality in (7.1) holds. These points lie on the hyperplane H1 : xi · w + b = 1. Similarly, the points for which the equality in (7.2) holds 1 lie on the hyperplane H2 : xi · w + b = −1. In this case d+ = d− = kwk , therefore the 2 margin is kwk . Note that H1 and H2 are parallel (they have the same normal) and that no training points fall between them. Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing kwk2 , subject to the above constraints, because keeping this norm small will also keep the VC-dimension small. 1 kwk2 2 s.t. yi (w · xi + b) ≥ 1, i = 1 . . . l minimize (7.3) (7.4) or 2 kwk s.t. yi (w · xi + b) ≥ 1, i = 1 . . . l maximize (7.5) (7.6) Linear, Hard-Margin SVM Formulation Find w, b that solve equation (7.3) under comstraint (7.4) The above minimization problem is convex, therefore there exists a unique global minimum value (when feasible), and there is a unique minimizer, i.e. weight and b value that provides the minimum (given that the data is indeed linearly separable). There is no solution if the data are not linearly separable. At this point, since the problem is convex, it can be solved using standard Quadratic Programming (QP) optimization techniques which is not Support Vector Machines 5 very complex since the dimensionality is N + 1. Since N is the dimension of the input space, this problem is more or less tractable for real applications. Nevertheless, in order to easily explain the extension to nonlinear decision surfaces (which will be described in section 7.2.8), we look at the dual problem, and use the technique of Lagrange Multipliers. Lagrange formulation of the problem We will now switch to a Lagrangian formulation of the problem. There are two reasons for doing this. The first is that the constraints (7.1),(7.2) will be replaced by constraints on the Lagrange multipliers themselves, which will be much easier to handle. The second is that in this reformulation of the problem, the training data will only appear (in the actual training and test algorithms) in the form of dot products between vectors. This is a crucial property which will allow us to generalize the procedure to the nonlinear case. We introduce positive Lagrange multipliers αi , i = 1, . . . , l one for each of the inequality constraints . We form the Lagrangian: l l X X 1 LP = kwk2 − αi yi (xi · w + b) + αi 2 i=1 i=1 We have to minimize LP with respect to w, b and simultaneously require that the derivatives of LP with respect to all the αi vanish. This is a convex quadratic problem, which has a dual formulation - maximize LP , subject to the constraint that the gradients of LP with respect to w and b vanish, and subject to the constraints that αi ≥ 0 . This gives: maximize LD = l X αi − i=1 s.t. 1X αi αj yi yj xi · xj 2 i,j l X αi yi = 0, αi ≥ 0 (7.7) (7.8) i=1 Support vector training (for the separable, linear case) therefore amounts to maximizing LD with respect to the αi , subject to above constraints and non-negativity of the αi , with P solution given by w = li=1 αi yi xi . Notice that there is a Lagrange multiplier αi for every training point. In the solution, those points for which αi > 0 are called support vectors, and lie on one of the hyperplanes H1 or H2 . All other training points have αi = 0 and lie to the side of H1 or H2 such that strict inequality holds. For these machines, the support vectors are the critical elements of the training set. They lie closest to the decision boundary. If all other training points were removed (or moved around, but so as not to cross H1 or H2 ), and training was repeated, the same separating hyperplane would be found. This remains a convex quadratic equation, and therefore quadratic programming still applies. 7.1.6 The soft margin hyperplane: Linear non separable case When applied to non-separable data, the above algorithm will find no feasible solution. This will be evidenced by the objective function (i.e. the dual Lagrangian) growing arbitrarily c Analysis of Gene Expression Data Tel Aviv Univ. 6 Figure 7.2: Linear separating hyperplanes for the separable case. The support vectors are circled large. So how can we extend these ideas to handle non-separable data? We would like to relax the constraints: xi · w + b ≥ +1, yi = +1 xi · w + b ≤ −1, yi = −1 (7.9) (7.10) but only when necessary, that is, we would like to introduce a further cost (i.e. an increase in the primal objective function) for doing so. This can be done by introducing positive slack variables ξi , i = 1, . . . , l in the constraints (Cortes and Vapnik, 1995), which then become: xi · w + b ≥ +1 − ξi , yi = +1 xi · w + b ≤ −1 + ξi , yi = −1 (7.11) (7.12) P Thus, for an error to occur, the corresponding ξi must exceed unity, so i ξi is an upper bound on the number of training errors. Hence a natural way to assign an extra cost for P errors is to change the objective function to be minimized from 21 kwk2 to 12 kwk2 + C li ξi , where C is a parameter to be chosen by the user. A larger C corresponds to assigning a higher penalty to errors. This is again a convex quadratic programming problem. Thus the dual formulation becomes: maximize LD = l X i=1 s.t αi − 1X αi αj yi yj xi · xj 2 i,j 0 ≤ αi ≤ C, l X αi yi = 0 (7.13) (7.14) i=1 The only difference from the optimal hyperplane case is that the αi have an upper bound of C, which is the penalty for misclassification. The parameter C controls the range of the αi and avoids over emphasizing some examples. C is called the complementary slackness when C tends to infinity we return to the separable case. The parameter C controls the range of λi , i.e. Support Vector Machines 7 avoids over emphasizing some examples. ξi (C - λi ) = 0, indicates complementary ”slacknes”. The parameter C can be extended to be case dependant. The weight λi : λi < C → ξi = 0, meaning the i-th example is correctly classified is not quite important. λi < C → ξi can be non-zero, i.e. the i-th training example may be misclassified, which is very important. This algorithm tries to keep ξi at zero while maximizing the margin. It does not minimize the number of misclassifications (which is an NP-complete problem) but the sum of distances from the margin hyperplanes. There are some formulations which use ξi2 instead. Figure 7.3: Linear Separating hyperplanes for the non separable case. 7.1.7 Soft vs hard margin SVM Even when the data can be linearly separated, we might benefit from using a soft margin, allowing us to get a much wider margin at a small cost (See figure 7.4). Using a Soft-margin we can always obtain a solution, since the method is more robust to outliers (smoother surfaces in the non-linear case). However, it requires us to guess the cost parameter, as opposed to the hard-margin method, which does not require any parameters. 8 c Analysis of Gene Expression Data Tel Aviv Univ. Figure 7.4: Robustness of Soft vs Hard Margin SVMs. Support Vector Machines 7.1.8 9 Disadvantages of linear decision surfaces Linear decision surfaces may sometimes be unable to seperate the requested data. Figure 7.5: Disadvantages of linear decision surfaces - the two presented classes can not be seperated by a linear decision surface. 10 7.1.9 c Analysis of Gene Expression Data Tel Aviv Univ. Advantages of Non-linear decision surfaces Non-linear decision durfaces are more robust. Figure 7.6: Advantages of Non-linear decision surfaces - seperation of non-linearly seperable classes is enabled, due to the curved nature of non-linear hyperplanes. Support Vector Machines 7.1.10 11 Non linear case In some cases the data requires a more complex, non-linear separation. When coming to generalize the above ideas to the non-linear case - the idea is to use the same techniques as above for linear machines . Since finding a linear machine is not possible in the original space of the training set, we first map the training set to an Euclidean space with a higher dimension (even of infinite dimension), this higher-dimensional space is called the feature space, as opposed to the input space occupied by the training set. With an appropriately chosen feature space of sufficient dimensionality, any consistent training set can be made separable. However, translating the training set into a higher-dimensional space incurs both computational and learning-theoretic costs. (see Figure 7.7) Figure 7.7: Linear Classifiers in High-Dimensional Spaces. c Analysis of Gene Expression Data Tel Aviv Univ. 12 Suppose we first map the data to some other space H, using a mapping Φ : Rd → H. Then the SVM formulation becomes: l X 1 ξi kwk2 + C 2 i s.t. yi (w · Φ(xi ) + b) ≥ 1 − ξi , ∀xi , ξi ≥ 0 minimize (7.15) (7.16) Data now appear as Φ(xi ). Weights are also mapped to the new space. However, if Φ(xi ) is very high dimensional, explicit mapping is very expensive. Therefore, we would like to solve the problem without explicitly mapping the data. The key idea is to notice that in the dual representation of the above problems - the training data appeared only in the form of dot products. Now suppose we first map the data to some other space H, using a mapping Φ : Rd → H . Then the training algorithm would only depend on the data through dot products in H, i.e. on functions of the form Φ(xi ) · Φ(xj ). 7.1.11 The Kernel Trick All we need in order to perform training in H is a function that satisfies K(xi , xj ) = Φ(xi ) · Φ(xj ), i.e., the image of the inner product of the data is the inner product of the images of the data. This type of function is called a kernel function. The kernel function is used in the higher dimension space as a dot product, so we do not need to explicitly map the data into the high-dimensional space. Classification can else be done without explicitly mapping the new instances to the higher dimension, as we take advantage of the fact that P P sgn(wx + b) = sgn( i αi yi K(xi , x) + b) where b solves αj (yj i αi yi K(xi , xj ) + b − 1) = 0 for any j with αj 6= 0. Examples of kernel functions are: − • K(xi , xj ) = e • K(xi , xj ) = e kxi −xj k2 2σ 2 xi −xj σ2 - radial basis kernel. - gaussian kernel. • K(xi , xj ) = (xi · xj + 1)k - polynomial kernel. In general, using the kernel trick provides huge computational savings over explicit mapping. 7.1.12 The Mercer Condition [3, 4] For which kernels does there exist a pair H,Φ, with the properties described above, and for which does there not? The answer is given by Mercer’s condition: There exists a mapping Φ and an expansion: K(x, y) = X iΦ(xi )Φ(yi ) (7.17) if and only if, for any g(x) such that Z g(x)2 dx (7.18) Support Vector Machines 13 is finite then Z K(x, y)g(x)g(y)dxdy ≥ 0 (7.19) Note that for specific cases, it may not be easy to check whether Mercer’s condition is satisfied. Equatio (7.19) must hold for every g with finite L2 norm (i.e. which satisfies equation (7.18)). However, we can easily prove that the condition is satisfied for positive integral powers of the dot product: K(x; y) = (x · y)p . We must show that Z X ( xdi=1 xi yi )p g(x)g(y)dxdy ≥ 0 P d x The typical term in the multinomial expansion of ( form p i=1 xi yi ) (7.20) contributes a term of the Z p! xr11 xr22 · · · y1r1 y2r2 · · · g(x)g(y)dxdy r1 !2 ! · · · (p − ri − r2 · ··)! (7.21) to the left hand side of (7.19), which factorizes: Z p! ( xr11 xr22 · · · g(x)dx)2 ≥ 0 r1 !2 ! · · · (p − ri − r2 · ··)! (7.22) One simple consequence is that any kernel which can be expressed as K(x, y) = p = 0∞ cp (x· y)p where the cp are positive real coefficients and the series is uniformly convergent, satisfies Mercer’s condition, a fact also noted in. A number of observations are in order: P • Vapnik (1995) uses the condition (2) above to characterize the kernels that can be used in SVM. • There is another result similar to Mercer’s one, but more general: the kernel is positive definite if and only if: Z dxdyK(x, y)g(x)g(y) ≥ 0 Ω ∀g ∈ L1 (Ω) (7.23) • The kernels K that can be used to represent a scalar product in the feature space are closely related to the theory of Reproducing Kernel Hilbert Spaces (RKHS). 7.1.13 Other Types of Kernel Methods • SVMs that perform regression • SVMs that perform clustering c Analysis of Gene Expression Data Tel Aviv Univ. 14 • v-Support Vector Machines: maximize margin while bounding the number of margin errors • Leave One Out Machines: minimize the bound of the leave-one-out error • SVM formulations that take into consideration difference in cost of misclassification for the different classes • Kernels suitable for sequences of strings or other specialized kernels. 7.1.14 Feature Selection with SVMs Recursive Feature Elimination: • Train a linear SVM • Remove the x% of variables with the lowest weights (those variables affect classification the least) Retrain the SVM with remaining variables and repeat until classification is reduced. This method is very successful. Other formulations exist where minimizing the number of variavles is folded into the optimization problem. The algorithms for non-linear SVMs are similar, and are quite successful. 7.1.15 Multi Class SVMs In its basic form an SVM is able to classify test examples into only two classes: positive and negative. We say that SVM is a binary classifier. This means that training examples must also be only of the two kind: positive and negative. Even though an SVM is binary, we can combine several such classifiers to form a multi-class variant of an SVM. • One-versus-all: Train n binary classifiers, one for each class against all other classes. Predicted class is the class of the most confident classifier. • One-versus-one: Train n(n − 1)/2 classifiers each discriminating between a pair of classes. Several strategies for selecting the final classification based on the output of the binary SVMs. • Truly Multi Class SVMs: Generalize the SVM formulation to multiple categories. 7.1.16 Training SVM , Problems and Heuristics There are several heuristics for solving the problem of training an SVM. As we have seen above the training of an SVM is a quadratic optimization problem. Zoutendijk’s Method [7] aims to solve a linear programming problem: Find the direction that minimizes the objective function, and make the largest move along this direction ,while still satisfying all the constraints. Another problem in the training of SVM is that in the general case the computation of the kernel function K(xi , xj ) for each pair of elements might be computationally Knowledge-based analysis of microarray gene expression data 15 expensive. The solution is to use only part of the training data, using the assumption that only part of the data contributes to the decision boundary. Then we define a strategy to increase the objective function, while updating the set of data points contributing to the formation of the decision boundary. 7.1.17 Heuristic for training SVM with large data set 1. Divide the training examples into two sets A, B. 2. Use the set A of training examples to find out the optimal decision boundary. 3. Find an example xi in A with no contribution to the decision boundary, αi = 0. 4. Find another example, xm in B that can not be classified correctly by the current decision boundary. 5. Remove xi from A and add xm to A. 6. Repeat steps 2-5 till some stopping criterion is satisfied. 7.2 7.2.1 Knowledge-based analysis of microarray gene expression data Introduction After laying the theoretic foundations for SVM, we explore its applications in the analysis of microarray gene expression data. The work described here is due to [8]. In this work the authors applied an SVM for functional annotation of genes. The idea is to begin with a set of genes that have a common function: for example, genes coding for ribosomal proteins or genes coding for components of the proteasome. In addition, a separate set of genes that are known not to be members of the functional class is specified. These two sets of genes are combined to form a set of training examples in which the genes are labelled positively if they are in the functional class and are labelled negatively if they are known not to be in the functional class. A set of training examples can easily be assembled from literature and database sources. Using this training set, an SVM would learn to discriminate between the members and non-members of a given functional class based on expression data. Having learned the expression features of the class, the SVM could recognize new genes as members or as non-members of the class based on their expression data. We describe here the use of SVM to classify genes based on gene expression. Analyzing expression data from 2,467 genes from the budding yeast Saccharomyces cerevisiae measured in 79 different DNA microarray hybridization experiments [2]. From these data, the authors learn to recognize five functional classes from the Munich Information Center for Protein Sequences Yeast Genome Database (MYGD) (http://www.mips.biochem.mpg.de/proj/yeast). 16 c Analysis of Gene Expression Data Tel Aviv Univ. In addition to SVM classification, the authors subject the data to analyses by four competing machine learning techniques, including Fisher’s linear discriminant, Parzen windows, and two decision tree learners. The SVM method out-performed all other methods. The work described here experimented with several kernel functions: K(X, Y ) = (X · Y + 1)d , d = 1, 2, 3 and: 2 2 K(X, Y ) = e−kX−Y k /2σ 7.2.2 Balancing positive and negative examples The gene functional classes examined here contain very few members relative to the total number of genes in the data set. This leads to an imbalance in the number of positive and negative training examples that, in combination with noise in the data, is likely to cause the SVM to make incorrect classifications. When the magnitude of the noise in the negative examples outweighs the total number of positive examples, the optimal hyperplane located by the SVM will be uninformative, classifying all members of the training set as negative examples. The authors overcame this problem by modifying the matrix of kernel values computed during SVM optimization. Let X1 , . . . , Xn be the genes in the training set, and let K be the matrix defined by the kernel function K on this training set i.e., Kij = K(Xi , Xj ). Xi is the logarithm of the ratio of expression level Ei for gene X in experiment i to the expression level Ri of gene X in the reference state, normalized so that the expression vector X = (X1 , . . . , X79 ) has Euclidean length 1: log(Ei /Ri ) Xi = qP 2 79 j=1 log Ej /Rj By adding to the diagonal of the kernel matrix a constant whose magnitude depends on the class of the training example, one can control the fraction of misclassified points in the two classes. This technique ensures that the positive points are not regarded as noisy labels. For positive examples, the diagonal element is modified by Kij := Kij + λn+ /N , where n+ is the number of positive training examples, N is the total number of training examples, and λ is scale factor. A similar formula is used for the negative examples, with n+ replaced by n− . In the experiments reported here, the scale factor is set to 0.1. 7.3 Tissue Classification with Gene Expression Profiles [10] Constantly improving gene expression profiling technologies are expected to provide understanding and insight into cancer related cellular processes. Gene expression data is also expected to significantly aid in the development of efficient cancer diagnosis and classification platforms. In this work the authors examine three sets of gene expression data measured across sets of tumor(s) and normal clinical samples: The first set consists of 2,000 genes, measured in 62 epithelial colon samples. The second consists of ≈ 100,000 clones, measured in 32 ovarian samples. The third set consists of ≈ 7,100 genes, measured in 72 bone marrow Tissue Classification with Gene Expression Profiles [10] 17 and peripheral blood samples (Golub99). They examine the use of scoring methods, measuring separation of tissue type (e.g., tumors from normals) using individual gene expression levels. These are then coupled with high dimensional classification methods to assess the classification power of complete expression profiles. They present results of performing leaveone-out cross validation (LOOCV) experiments on the three data sets, employing nearest neighbor classifier, SVM, AdaBoost and a novel clustering based classification technique. Figure 7.8: Summary of classification performance of the different methods on the three data sets. The tables shows the precent of samples that were correctly classified, incorrectly classfied, and unclassfied by each method in the LOOCV evaluation. Unsupervised labels for margin based classifier were decided by a fixed threshold on classification margin: in SVM, 0.25, and in Adaboost, 0.05. c Analysis of Gene Expression Data Tel Aviv Univ. 18 7.4 7.4.1 Molecular Classification of Cancer Overview Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). In [6] and [13] a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) 2 and acute lymphoblastic leukemia (ALL) 3 without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancers, independent of previous biological knowledge. 7.4.2 Introduction Determination of cancer type helps assigning an appropriate treatment to a patient. Cancer classification is based primarily on location or on morphological appearance of the tumor, that requests experienced biologist to distinguish known forms. The morphology classification has the limitation, that tumors with similar histopathological 4 appearance can have significantly different clinical courses and response to therapy. In general, cancer classification has been difficult because it has historically relied on specific biological insights, rather than systematic and unbiased approaches for recognizing tumor subtypes. In [6] and [13] an approach based on global gene expression analysis is described. The authors study the following three challenges: 1. Feature selection: Identifying the most informative genes for prediction. Clearly most genes are not relevant to cancer so we want to choose only the best features (criteria) for class prediction. 2. Class prediction or Classification: Assignment of particular tumor samples to already defined classes, which could reflect current states or future outcomes. 3. Class discovery or Clustering: Finding previously unrecognized tumor subtypes. 2 AML affects various white blood cells including granulocytes, monocytes and platelets. Leukemic cells accumulate in the bone marrow, replace normal blood cells and spread to the liver, spleen, lymph nodes, central nervous system, kidneys and gonads. 3 ALL is a cancer of immature lymphocytes, called lymphoblasts (sometimes referred to as blast cells). Normally, white blood cells repair and reproduce themselves in an orderly and controlled manner but in leukaemia the process gets out of control and the cells continue to divide, but do not mature. 4 Histopathology is the science that studies pathologic tissues. Molecular Classification of Cancer 19 Thus, difference between classification and clustering is that clustering is unsupervised - we do not know anything about division, whereas classification is a supervised learning process, where division to subtypes is already known. The authors studied the problem of classifying acute leukemia. Classification of acute leukemia began with the observation of variability in clinical outcome and subtle differences in nuclear morphology. Enzyme-based histochemical analysis, introduced in the 1960s, provided the first basis for classification of acute leukemias into acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Later ALL was divided into 2 subcategories: T-lineage ALL and B-lineage ALL. Some particular subtypes of acute leukemia have been found to be associated with specific chromosomal translocations. Although the distinction between AML and ALL has been well established, no single test is currently sufficient to establish the diagnosis. Rather, current clinical practice involves an experienced hematopothologist’s interpretation of the tumor morphology, histochemistry and immunophenotyping analysis performed in highly specialized laboratory. Although usually accurate, leukemia classification remains imperfect and errors do occur, for example when one type of cancer pretends to be another or when a mix of cancers accidently is identified as cancer of only one type. The goal is to develop a systematic approach to cancer classification based on gene expression data. Two data sets were taken: • Learning set, containing 38 bone marrow samples (27 ALL and 11 AML), that were obtained at the same stage of the disease, but from different patients. On this set features will be learned and predictors will be developed to validate the test set. • Test set, containing 34 leukemia samples (20 ALL and 14 AML), that consisted of 24 bone marrow and 10 peripheral blood samples. RNA prepared from bone marrow mononuclear cells was hybridized to old generation Affymetrix oligonucleotide microarrays with 6817 human genes. 7.4.3 Feature Selection The first goal was to find a set of predicting genes,whose typical expression pattern are strongly correlated with the class distinction to be predicted and have low variance within each class. Let c = (1,1,1,1,0,0,0,0) be a binary class vector, containing the class assigned to each sample (0 or 1). Let genei = (e1 , e2 , e3 , ..., e12 ) is expression vector for genei , consisting of its expression levels in each of the tumor samples. 0 The authors scored a gene as a distinctor by P (genei , c) = µσ11 −µ ,where µi is the mean +σ0 expression level of samples in class i and σi is standard deviation of the expression levels in these samples, i = 0, 1. The larger P (gene, c), the better the gene distinction. Hence genes with highest |P (g, c)| are chosen as predictor set. 20 c Analysis of Gene Expression Data Tel Aviv Univ. Neighborhood analysis The 6817 genes were sorted by their degree of correlation with distinction. To establish whether the observed correlation would be stronger than expected by chance, the researchers developed a method called “Neighborhood analysis”. Assume that range of an expression levels is [−1, 1], so expression vector of an ideal class distinctor would be represented by an “idealized expression pattern” c, in which the expression level is uniformly high in class 1 and uniformly low in class 2: c = (1, 1, 1, 1, −1, −1, −1). The idea of neighborhood analysis is simply look at a neighborhood of a fixed size around c, count the number of gene expression vectors within it, compare it to the number of expression vectors within the neighborhood of the same size around a random permutation of c - π(c). Let N (g) be a number of genes such that P (g, c) > α, for some constant α. Let R(g) be a number of genes such that P (g, π(c)) > α. By trying many random permutations we can determine if the neighborhood around c holds more gene expression vectors that we expect to see by chance. If we find that N (g) E(R(G)), we can conclude that the class distinction represented by c is likely to be predictable from the expression data. This analysis is illustrated in Figure 7.9. Note that permuting preserves class sizes. Figure 7.9: Schematic illustration of Neighborhood analysis. The class distinction is represented by c. Each gene is represented by expression level in each of the tumor samples. In the figure, the data set is composed of 6 AMLs and six ALLs. Gene g1 is well correlated with the class distinction, whereas g2 is poorly correlated with c. The results are compared to the corresponding distribution obtained for random idealized vector c∗ , obtained by randomly permuting the coordinates of c. Molecular Classification of Cancer 21 Another approach is to present neighborhoods, whose radii are a function of the expression value, i.e., for each class c we count genes with P (g, c) > x as a function of x. In this approach we do not aspire to get more genes in the same circle, as in previous one, but to obtain small circles, containing many genes. We can calculate this distribution for known subsets and for random permutations. In Figure 7.10 we can see how many genes (y axis) have as value of at least P on the x axis. If our data was random we would expect that observed data curve should be much closer to the median, but it’s not so and for P (g, c) > 0.3 it’s located far even from 1% significance level. In summary we have about 1000 more genes that are highly correlated with the AML-ALL class distinction, than what we could expect in random. Since the neighborhood analysis has shown that there are genes significantly correlated with class distinction c, the authors used known samples to create a “class predictor” capable of assigning a new sample to one of two classes. The goal is to choose the k genes most closely correlated with AML-ALL distinction in the known samples. Choosing a prediction set We could simple choose the top k genes by the absolute value of P (g, c), but this allows the possibility that, for example, all genes are expressed at a high level in class 1 and a low level(or not at all)in class 2. However, predictors do better when they include some gene expressed at high levels in each class, because to assign new sample to a class, it must correlate with highly expressed genes in this class, so we choose the top k1 genes (highly expressed in class 1) and the bottom k2 genes (highly expressed in class 2) so that: • They must be roughly equal to prevent a sample, that is located somewhere between 2 classes, to be assigned to class 1 just because k1 > k2 or the opposite. • The fewer genes we choose the more statistically significant they will be. • The more genes we choose the robust the results will be obtained. We know that gene expression is different in different people ,different tissues etc., so it is not reasonable that one gene is enough to predict. Also we know that cancer is related to many biological processes, so we expect to find several genes to represent each of these processes. • Too many genes are not helpful, because if all of them are significantly correlated with a class distinction, it is unlikely that they’re all represent different biological mechanisms. Their expression patterns are probably dependent, so that the are unlikely to add information not already provided by the others. Thus, generally we pick few tens of genes. Let S be the set of the chosen informative genes. 7.4.4 Class prediction Prediction by weighted voting We now describe the voting scheme presented in [6]. Each gene in S gets to cast its vote for exactly one class. The gene’s vote on a new sample x is weighted by how closely its 22 c Analysis of Gene Expression Data Tel Aviv Univ. Figure 7.10: Neighborhood analysis: ALL versus AML. For the 38 leukemia samples in the initial data set, the plot shows the number of genes within various ”neighborhoods” of the ALL-AML class distinction together with curves showing the 5% and 1% significance levels for the number of genes within corresponding neighborhoods of the randomly permuted class distinction. Genes highly expressed in ALL compared to AML are shown in the left panel; those more highly expressed in ALL compared to AML are shown in the left panel. The large number of genes highly correlated with the class distinction is apparent. For example, in the left panel the number of genes with correlation P (g, c) > 0.3 was 709 for the AML-ALL distinction, but had a median of 173 genes for random class distinctions. P (g, c) = 0.3 is the point where the observed data intersect the 1% significance level, meaning that 1% of random neighborhoods contain as many points as the observed neighborhood around the AML-ALL distinction. expression in the learning set correlates with c: w(g) = P (g, c). The vote is the product of this weight and a measure of how informative the gene appears to be for predicting the new sample. Intuitively, we expect the gene’s level in x to look like that of either a typical class 1 sample or a typical class 2 sample in the learning set, so we compare expression in the new sample to the class means in the learning set. We define a ”decision boundary” as halfway 2 between the two class means: bg = µ1 −µ . The vote corresponds to the distance between 2 the decision boundary and the gene’s expression in the new sample. So each gene casts a weighted vote V = w(g)(xg − bg ). The weights are defined so that positive votes count as votes for membership in class 1, negative ones for membership in class 2. The votes for all genes in S are combined; V+ is the sum of positive votes and V− is the sum of negative votes. The winner is simply the class receiving the larger total absolute vote. Intuitively, if one class receives most of the votes, it seems reasonable to predict this majority. However, if the margin of victory is small, a prediction for the majority class Molecular Classification of Cancer 23 seems somewhat arbitrary and can only be done with low confidence. The authors therefore winner −Vloser . define the “prediction strength” to measure the margin of victory as: P S = VVwinner +Vloser Since Vwinner is always greater then Vloser , P S varies between 0 and 1. Empirically, the researchers decided on a threshold of 0.3 i.e., if P S > 0.3, then x will be assigned to the winning class, otherwise x is left undetermined. Figure 7.11 shows a graphic presentation of the solution: Each gene gi votes for either AML or ALL, depending on its expression level xi in the sample. Summing separately votes for AML and ALL, shows that ALL wins. Figure 7.11: Class predictor. The prediction of a new sample is based on the weighted votes of a set of informative genes. Each such gene gi votes for either AML or ALL, depending on whether its expression level xi is closer to µALL or µAM L The votes of each class are summed to obtain VAM L and VALL . The sample is assigned to the class with the higher total vote, provided that the prediction strength exceeds a predetermined threshold. Testing class predictors There are two possibilities to test the validity of class predictors: 1. Leave-one-out Cross Validation (LOOCV) on initial data set: Withhold a sample, choose informative features, build a predictor based on the remaining samples, and predict the class of the withheld sample. The process is repeated for each sample and the cumulative error rate is calculated. 2. Validation on an independent set: Assess the accuracy on an independent set of samples. Generally, the two above procedures are carried out together. Testing on an independent is better, but we are forced to do LOOCV, when samples are very scarce. The authors applied this approach to the acute leukemia samples. The set of informative gene to be used as predictors was to chosen to be the 50 genes most closely correlated with AML-ALL distinction in the known samples. The following results were obtained: 24 c Analysis of Gene Expression Data Tel Aviv Univ. 1. On learning set: These predictors assigned 36 of 38 samples as either AML or ALL and the remaining two as uncertain(P S < 0.3); all predictions agreed with the patients’ clinical diagnosis. 2. On independent test set: These predictors assigned 29 of 34 samples with 100% accuracy. The success is notable, as the test set included samples from peripheral blood, from childhood AML patients and from different reference laboratories that used different sample preparation protocols. Overall, as shown in Figure 7.12, the prediction strength was quite high. The median in cross validation was P S = 0.77. On the test the median was PS = 0.73. The average prediction strength was lower for samples from one laboratory, that used a very different protocol for sample preparation. This suggests that clinical implementation of such approach should include standardization of sample preparation. Figure 7.12: Prediction strengths. The scatter-plots show the prediction strengths(PS) for the samples in cross-validation(left) and on the independent set(right). Median PS is denoted by a horizontal line. Predictors with PS less 0.3 are considered uncertain. The choice to use 50 informative genes in the predictor was somewhat arbitrary.In fact, the results were insensitive to the particular choice: predictors based on between 10 to 200 genes were all found to be 100% accurate, reflecting the strong correlation with the AMLALL distinction. Molecular Classification of Cancer 25 Informative genes The list of informative genes used in the AML versus ALL predictor is highly instructive. Figure 7.13 show the list of used informative genes. Figure 7.13: Genes distinguishing ALL from AML. The 50 genes most highly correlated with the ALL-AML class are shown. Each row corresponds to a gene, with the columns corresponding to expression level in different samples. Expression levels for each gene are normalized across the samples such that the mean is 0 and the standard deviation(SD) is 1. Expressions level greater then the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates SDs above or below the mean. Some genes, including CD11c, CD33 and MB-1 encode cell surface protein useful in distinguishing lymphoid from myeloid lineage cells. Others provide new markers of acute c Analysis of Gene Expression Data Tel Aviv Univ. 26 leukemia subtype, for example, the leptin receptor 5 , originally identified through its role in weight regulation, showed high relative expression in AML. These data suggest that genes useful for cancer class prediction may also provide insights into cancer pathogenesis and pharmacology. The researchers explored the ability to predict response to chemotherapy among the 15 adult AML patients who had been treated and for whom long-term clinical follow-up was available. Eight patients failed to achieve remission after induction chemotherapy, while the remaining seven remained in remission for 46 to 84 months. Neighborhood analysis found no striking excess of genes correlated with the response to chemotherapy. Class predictors that used 10 to 50 genes were not highly accurate in cross-validation. Thus, no evidence of strong multigene expression signature was correlated with clinical outcome, although this could reflect the relatively small sample size. 7.4.5 Class Discovery Next the researchers turn to the question of class discovery. They explored whether cancer classes could be discovered automatically. For example, if the AML-ALL distinction was not already known, could we discover it simply on the basis of gene expression. The authors used a clustering algorithm for this part. To cluster tumors Golub et al. used the SOM (self-organizing maps) algorithm. At first a 2-cluster SOM was applied to cluster the 38 initial leukemia samples on the basis of the expression patterns of all 6817 genes. The clusters were evaluated by comparing them to the known AML-ALL classes (Figure 7.14A). The results were as follows: cluster A1 contained mostly ALL samples (24 of 25) and cluster A2 contained mostly AML samples (10 of 13). On basis of the clusters the authors constructed predictors to assign new samples as “type A1” or “type A2” and tested their validity. Predictors that used a wide range of different numbers of informative genes preformed well in cross-validation. The cross-validation not only showed high accuracy, but actually refined the SOM-defined classes: the subset of samples accurately classified in cross-validation were those perfectly subdivided by SOM into ALL and AML. Then the class predictor of A1-A2 distinction was tested on the independent test set. The prediction strengths were quite high: the median PS was 0.61 and 74% of samples were above threshold (Figure 7.14B), indicating that the structure seen in the initial set is also seen in the test set. In contrast, random clusters consistently yielded predictors with poor accuracy in cross-validation and low PS on the independent data set. On basis of such analysis, the A1-A2 distinction can be seen to be meaningful, rather then simply a statistical artifact of the initial data set. The results thus show that the AMLALL distinction could have been automatically discovered and confirmed without previous biological knowledge. Finally, the researchers tried to extend the class discovery by searching for finer subclasses of the leukemias. 4-cluster SOM divided the samples into four classes, that largely corresponded to AML, T-lineage ALL, B-lineage ALL and B-lineage ALL, respectively (Figure 7.14C). When these classes were evaluated by constructing class predictors, all pairs could be distinguished one from another, with exception of B3 versus B4 (Figure 7.14D). The 5 leptin receptor is a molecule that identifies leptines Molecular Classification of Cancer 27 Figure 7.14: ALL-AML class discovery. (A) Schematic presentation of 2-cluster SOM. A 2-cluster (2 by 1) SOM was generated from the 38 initial leukemia samples with a modification of the GENECLUSTER computer package. Cluster A1 contains the majority of ALL samples(grey squares) and cluster A2 contains the majority of AML samples(black circles). (B) Prediction strength distribution. The scatter plots show the distribution of PS scores for class predictors. (C) Schematic presentation of the 4-cluster SOM. ALL samples are shown as black circles, T - lineage ALL as open squares, and B-lineage ALL as grey squares. (D) Prediction strength distribution for pairwise comparison among classes. Cross-validation studies show that the four classes could be distinguished with high prediction scores, with the exception of classes B3 and B4. prediction tests thus confirmed the distinction corresponding to AML, B-ALL and T-ALL and suggested that it may be appropriate to merge classes B3 and B4, composed primarily of B-lineage ALL. 7.4.6 Class discovery of a new class In the previous article the authors showed how they can discover classes which are already known. In Bittner et al [1] the authors discover new classes previously unknown in melanoma. They used the CAST clustering algorithm to cluster the data into different groups and found a subset which has a different gene signature. They later experimantally proved in vitro that 28 c Analysis of Gene Expression Data Tel Aviv Univ. this new subtype is related to invasive melanomas that form primitive tubular networks, a feature of some highly aggresive metastastatic melanomas. Breast Cancer Classification 7.5 29 Breast Cancer Classification Breast cancer is one of the most common cancers found in women. Different breast cancer patients with the same stage of disease can have markedly different treatment responses and overall outcome. Current methods (such as lymph node status and histological grade) fail to classify accurately breast tumors and 70-80% of patients receiving chemotherapy or hormonal treatment would have survivaed with out it. In [15] the authors aim to find gene expression based classification methods that can predict the clinical outcome of breast cancer. The authors used a data set containing 98 primary breast cancers divided into the following sub groups: 1. Sporadic patients: (a) 34 samples from patients who developed distant metastases within 5 years. These were called the poor prognosis group and the mean time to metastases was 2.5 years. (b) 44 samples from patients who continued to be disease-free after a period of at least 5 years. These were called the good prognosis group and had a mean follow-up time of 8.7 years. All sporadic patients were lymph node negative and under 55 years of age at diagnosis. 2. Carriers: (a) 18 samples from patients with BRCA1 germline mutations. (b) 2 samples from BRCA2 mutation carriers. From each of the above patients 5µg of RNA was isolated from their tumor samples. The RNA was then hybridized on Agilent microarrays containing approximately 25,000 human genes. Out of these 25,000 genes, 5,000 were significantly regulated across the group of samples i.e. at least a twofold difference in more than five tumors. 7.5.1 Hierarchical clustering of the data The first step performed by the authors was to use a non-supervised, hierarchical clustering algorithm to cluster the 98 tumors on the basis of their similarities measured over these 5,000 significant genes. Similarly, the 5,000 genes were clustered on the basis of their similarities measured over the group of 98 tumors. The results of this clustering can be seen in figure 7.15. Looking at both parts of the figure we can see a connection between the two main clusters and the clinical markers. For example, looking at the metastases column (the right-most column) we can see that in the upper group only 34% of the sporadic patients were from the group who developed distant metastases within 5 years, whereas in the lower group 70% of the sporadic patients had progressive disease. Thus, using unsupervised clustering we can already, to some extent, distinguish between good prognosis and bad prognosis tumors. Another connection we can see is the ER (estrogen receptor) status and lymphocytic infiltration. We see that the top group is enriched in positive ER tumors and negative 30 c Analysis of Gene Expression Data Tel Aviv Univ. Figure 7.15: Unsupervised two-dimensional clustering analysis of 98 breast tumors and 5,000 genes. a Heatmap ordered according to the clustering. Each row represents one of the 98 tumors and each column represents one of the 5,000 significant genes. b Selected clinical data for the 98 patients. c Zoom-in on genes that co-regulate with the ER-α gene. d Zoom in on genes connected to lymphocytic infiltration Breast Cancer Classification 31 lymphocytic infiltration and the bottom group is enriched in the opposite phenotypes. This is consistent with previous reports which grouped breast cancer into two subgroups which differ in ER status and lymphocytic infiltration. Part c of the figure shows a zoom in on genes that co-regulate with the ER-α gene and part d of genes connected to lymphocytic infiltration. The difference between the two groups of genes is easily visible. 7.5.2 Classification The next step the authors did was to try and create a classifier that can classify between the poor prognosis and good prognosis groups using gene expression values. Approximately 5,000 genes (significantly regulated in more than 3 tumors out of 78) were selected from the 25,000 genes on the microarray. The correlation coefficient of the expression for each gene with the disease outcome was calculated and 231 genes were found to be significantly associated with disease outcome (|CC| > 0.3) These genes were rank-ordered according to |CC|. A classifier was built using the top 5 genes on this ordered list and then evaluated using LOOCV. In each iteration, 5 more genes from the top of the list were added to the classifier input and LOOCV was used again. The accuracy improved with each iteration until the optimal number of marker genes was reached at 70 genes. The expression pattern of the 70 genes in the 78 samples is shown in Figure 7.16 where tumors are ordered according to their correlation coefficients with the average good prognosis profile. The classifier predicted correctly the actual outcome of disease for 65 out of the 78 patients (83%). 5 poor prognosis and 8 good prognosis patients were assigned to the opposite category. The optimal accuracy threshold is shown as a solid line in part b of Figure 7.16. Since we are more worried about poor prognosis patients being misdiagnosed a more sensitive threshold was used (the dashed line in the figure). This optimized sensitivity threshold resulted in a total of 15 misclassifications: 3 poor prognosis tumors were classified as good prognosis (as opposed to 5 before) and 12 good prognosis tumors were classified as poor prognosis (as opposed to 8 in the previous threshold). The authors note, that the functional annotation for the genes provides insight into the underlying biological mechanism leading to rapid metastases. Genes invlolved in cell cycle, invastion and metastasis, angiogenesis and signal transduction are significantly up regulated in the poor prognosis signature (for example, cyclin E2, MCM6, metalloproteinases MMP9 ad MP1, RAB6B, PK428, ESM1 and the VEGF receptor FLT1). 7.5.3 Validation The classifier classifies tumors having a gene expression profile with a correlation coefficient above the ’optimized sensitivity’ threshold (dashed line) as good prognosis signature, and below this threshold as a poor prognosis signature. To validate the prognosis classifier, an additional independent set of primary tumors from 19 young, lymph-node-negative breast cancer patients was selected. This group consisted of 7 patients who remained metastasis free for at least five years, and 12 patients who developed distant metastases within 5 years. The disease outcome was predicted by the 70-gene classifier and resulted in 2 out of 19 32 c Analysis of Gene Expression Data Tel Aviv Univ. Figure 7.16: Supervised classification on prognosis signatures. a The classifier structure. b A heatmap of the expression values. Each row represents a tumor and each column a gene. The solid line is the optimal accuracy threshold and the dashed line the optimal sensitivity c Same as b, but the expression matrix is for tumors of 19 additional breast cancer patients. incorrect classification using both the optimal accuracy threshold (solid line) and optimal sensitivity threshold (dashed line). The results are shown in part c of Figure 7.16. The odds ratio that a woman under 55 years of age diagnosed with lymph-node-negative breast cancer that has a poor prognosis signature to develop a distant metastasis within 5 years compared with those that have the good prognosis signature is 15-fold. This is compared to previous methods which achieved only 2.4-6.4 fold. Breast Cancer Classification 7.5.4 33 Using the classifier to predict survival rate In a different paper [14], the same group of authors tried to re-validate their classifier on a much larger population. This time 295 samples taken from patients with breast cancer were used. 151 of the patients had lymph-node-negative disease and 144 patients had lymphnode-positive disease. The same classifier using the previous 70 genes signature and threshold was used to classify the patients. Among the 295 patients, 180 had a poor prognosis signature and 115 had a good-prognosis signature. The 10 year survival rate was 54.6±4.4 percent for the poor prognosis group and 94.5±2.6 percent for the good prognosis group. At 10 years, the probability of remaining free of distant metastases was 50.6±4.5 percent in the group with a poor-prognosis signature and 85.2±4.3 percent in the group with a good-prognosis signature. Figure 7.17: Pattern of expression of genes in 295 patients with breast cancer. Figure 7.17 shows a heatmap for the 295 tumors and 70 genes. The tumors are ranked the same as in Figure 7.16 . Notice that the tumors classified as good-prognosis have a 34 c Analysis of Gene Expression Data Tel Aviv Univ. Figure 7.18: Kaplan Meier Analysis of the probability that patients would remain free of metasteses and the probablility of overall survival among all patients. Figure 7.19: Kaplan Meier Analysis of the probability that patients would remain free of metasteses and the probablility of overall survival among patients with lymph-node-negative disease. much lower incidence of metastases and death. The authors also show Kaplan Meier plots to show the probability that patients would remain metastases free and the probability of overall survival among all patients. Figure 7.18 shows metastasis rates and survival rates amongst all patients. It is easily seen that as the years go by, the chances for patients from the good prognosis group to survive and remain metastasis free are significantly higher than those in the poor prognosis group. Figure 7.19 shows the same plots for lymph-node-negative patients. Breast Cancer Classification 7.5.5 35 Is the above gene set unique? In 2005, Ein-Dor et al [5] set out to check if the 70-genes signature described before is unique in its ability to classify breast cancer. The gene set has little or no overlap with other published work. To test this the authors used the same training set used in [15] consisting of 77 sporadic patients and the same test set of 19 sporadic patients([15] used 78 samples for their training set but the authors in this paper removed one because it had more than 20% missing values). In a similar way to [15] the authors first created a sub-set of about 5,000 genes which are highly regulated and ranked the genes in this subset according to their correlation to survival. The correlation p-value was calculated using comparison to the correlation with 105 permutated survival vectors. Next, a series of classifiers was built using consecutive groups of 70 genes. For each classifier, the training and the test error was measured, and seven other sets of 70 genes were found to produce classifiers with the same prognostic capabilities as those based on the top 70. The Kaplan-Meier plots for the classifier based on the top 70 genes, and the other seven classifiers are shown in figure 7.20. The figure shows that the new seven classifiers based on lower ranking genes than the original classifier produced similar results. Figure 7.20: Kaplan-Meier analysis of van’t classifier and of the seven alternative classifiers as obtained from classifying all 96 samples. Upper curves describe the probability of remaining free of metastasis in the group of samples classified as having a good prognosis signature, while the lower curves describe the poor prognosis group. After showing that the 70-gene signature is not unique in its ability to predict survival rates, the authors tried to see if the gene list is dependent on the selection of the training set. Out of the 96 total samples 77 random patients were chosen to be the training set. The same ratio of good/poor prognosis (33/44) was kept, and repetitions (using bootstrapping) 36 c Analysis of Gene Expression Data Tel Aviv Univ. were used. Figure 7.21 shows the location of the top 70 genes (ranked by correlation). The genes in the figure are ranked according to the first training set. The figure shows that for each selection of the training set, the ranks of the top 70 genes change significantly, and therefore not robust to the selection of the training set. Figure 7.21: Ten sets of top 70 genes, identified in 10 randomly chosen training sets ofN = 77 patients. Each row represents a gene and each column a training set. The genes were ordered according to their correlation rank in the first training set (leftmost column). For each training set, the 70 top ranked genes are colored black. The genes that were top ranked in one training set can have a much lower rank when another training set is used. The two rightmost columns (columns 11 and 12) mark those of the 70 genes published by van’t Veer et al. [15] and the 128 genes appearing in (Ramaswamy et al. [12]) that are among the top 1000 of our first training set. In conclusion, the authors of [5] show that the 70-gene signature is neither unique in its ability to predict survival nor is it robust and is very dependent on the selection of the training set. Therefore, one should not try to gain insight on the biological behavior of cancer based on these 70 genes. Nevertheless, the 70-gene signature still produces good results in prognosis and can be used in the clinical world. Classification into multiple classes 7.6 7.6.1 37 Classification into multiple classes Introduction We now describe the work of Ramaswamy et al. [11] that deals with multiclass cancer diagnosis. To establish analytic methods capable of solving complex, multiclass gene expressionbased classification problems the researchers created a gene expression database, containing the expression profiles of 218 tumor samples, representing 14 common human cancer classes, and 90 normal tissue samples. Hybridization targets were prepared with RNA from whole tumors. The targets were hybridized sequentially to oligonucleotide arrays, containing a total of 16063 probe sets. Expression values for each gene were calculated by using Affymetrix GENECHIP analysis software. Two fundamentally different approaches to data analysis were explored: clustering (unsupervised learning) and classification (supervised learning). 7.6.2 Clustering As we already know, this approach allows the dominant structure in a dataset to dictate the separation of samples into clusters based on overall similarity in expression, without prior knowledge of sample identity. Of 16,063 expression values considered, 11,322 passed some variation filter (see [11] for details) and were used for clustering. The dataset was normalized by standardizing each row (gene) to mean = 0 and variance = 1. Average-linkage hierarchical clustering was performed by using CLUSTER and TREEVIEW software. Self-organizing map analysis was performed by using GENECLUSTER analysis package. Figure 7.22 shows the results of both hierarchical and self-organizing map clustering of this data set. 7.6.3 Classification To make multiclass distinctions the researchers devised an analytic scheme, depicted in Figure 7.23. One vs. All (OVA) SVM scheme For each known type a binary classifier is built. The classifier uses the SVM algorithm to define a hyperplane that best separates training samples into two classes: samples from this class vs. all other samples. An unknown test sample’s position relative to the hyperplane determines its class, and the confidence of each SVM prediction is based on the distance of the test sample from the hyperplane; that distance is calculated in 16,063-dimensional gene space, corresponding to the total number of expression values considered. Recursive feature elimination Given microarray data with n genes per sample, each OVA SVM classifier outputs a hyperplane w, that can be thought of as a vector with n elements each corresponding to the expression of a particular gene. Assuming the expression values of each gene have similar 38 c Analysis of Gene Expression Data Tel Aviv Univ. Figure 7.22: Clustering of tumor gene expression data and identification of tumor-specific molecular markers. Hierarchical clustering (a) and a 5 x 5 self-organizing map (SOM) (b) were used to cluster 144 tumors spanning 14 tumor classes according to their gene expression patterns. (c) Gene expression values for class-specific OVA markers are shown. Columns represent 190 primary human tumor samples ordered by class. Rows represent 10 genes most highly correlated with each OVA distinction. Red indicates high relative level of expression, and blue represents low relative level of expression. The known cancer markers prostate-specific antigen (PSA), carcinoembryonic antigen (CEA), and estrogen receptor (ER) are identified. BR, breast adenocarcinoma; PR, prostate adenocarcinoma; LU, lung adenocarcinoma; CR, colorectal adenocarcinoma; LY, lymphoma; BL, bladder transitional cell carcinoma; ML, melanoma; UT, uterine adenocarcinoma; LE, leukemia; RE, renal cell carcinoma; PA, pancreatic adenocarcinoma; OV, ovarian adenocarcinoma; ME, pleural mesothelioma; CNS, central nervous system. ranges, the absolute magnitude of each element in w determines its importance in classifying the sample, since the class label is sign[f (x)]. Each OVA SVM classifier is first trained with all genes, 10 % of the genes with least |wi | are removed , and each classifier is retrained with the smaller gene set. This procedure is repeated iteratively to study prediction accuracy as a function of gene number. Prediction Each test sample is presented sequentially to the 14 OVA classifiers, each of which either claims or rejects that sample as belonging to a single class with an associated confidence. Finally, each test sample is assigned to the class with the highest OVA classifier confidence. If the confidences were low no prediction is made. Testing and Results As we mentioned above, the number of genes contributing to the high accuracy of the SVM OVA classifier was also investigated. The SVM algorithm considers all 16063 input genes and naturally utilizes all genes that contain information for each OVA distinction. Classification into multiple classes 39 Figure 7.23: Multiclass classification scheme. The multiclass cancer classification problem is divided into a series of 14 OVA problems, and each OVA problem is addressed by a different class-specific classifier (e.g., “breast cancer” vs. “not breast cancer”). Each classifier uses the SVM algorithm to define a hyperplane that best separates training samples into two classes. In the example shown, a test sample is sequentially presented to each of 14 OVA classifiers and is predicted to be breast cancer, based on the breast OVA classifier having the highest confidence. Genes are assigned weights based on their relative contribution to the determination of each hyperplane, and genes that do not contribute to a distinction are weighted zero. Virtually all genes on the array were assigned weakly positive or negative weights in each OVA classifier, indicating that thousands of genes potentially carry information relevant for the 14 OVA class distinctions. To determine whether the inclusion of this large number of genes was actually required for the observed high-accuracy predictions, the authors examined the relationship between classification accuracy and gene number by using recursive feature elimination. As shown in Figure 7.25, maximal classification accuracy is achieved when the predictor utilizes all genes for each OVA distinction. Nevertheless, significant prediction can still be achieved by using smaller gene numbers. The accuracy of the multiclass SVM-based classifier in cancer diagnosis was first evaluated by leave-one-out cross-validation in a set of 144 training samples. As shown in Figure 7.24, the majority (80%) of the 144 calls was high confidence (defined as confidence > 0) and these had an accuracy of 90%, using the patient’s clinical diagnosis as the ”gold standard”. The remaining 20% of the tumors had low confidence calls (confidence 0), and these predictions had an accuracy of 28%. Overall, the multiclass prediction corresponded to the correct assignment for 78% of the tumors, far exceeding the accuracy of random classification(9%). 40 c Analysis of Gene Expression Data Tel Aviv Univ. For half of the errors, the correct classification corresponded to the second- or third-most confident OVA prediction. These results were confirmed by training the multiclass SVM classifier on the entire set of 144 samples and applying this classifier to an independent set of 54 tumor samples. Overall prediction accuracy on this test set was 78%. Poorly differentiated samples yielded low-confidence prediction in cross-validation and could not be accurately classified according to the tissue origin, indicating that they are molecularly distinct entities with different gene expression patterns compared with their well differentiated counterparts. Overall, these results demonstrate the feasibility of accurate multiclass molecular cancer classification and suggests a strategy for future clinical implementation of molecular cancer diagnosis. Figure 7.24: Multiclass classification results. (a) Results of multiclass classification by using cross-validation on a training set (144 primary tumors) and independent testing with 2 test sets: Test (54 tumors; 46 primary and 8 metastatic) and PD (20 poorly differentiated tumors; 14 primary and 6 metastatic). (b) Scatter plot showing SVM OVA classifier confidence as a function of correct calls (blue) or errors (red) for Training, Test, and PD samples. A - accuracy of prediction; % - percentage of total sample number Classification into multiple classes 41 Figure 7.25: Multiclass classification as a function of gene number. Training and test datasets were combined (190 tumors; 14 classes), then were randomly split into 100 training and test sets of 144 and 46 samples (all primary tumors) in a classproportional manner. SVM OVA prediction was performed, and mean classification accuracy for the 100 splits was plotted as a function of number of genes used by each of the 14 OVA classifiers, showing decreasing prediction accuracy with decreasing gene number. Results using other algorithms (k-NN, k-nearest neighbors; WV, weighted voting) and classification schemes (AP, all-pairs) are also shown. 42 c Analysis of Gene Expression Data Tel Aviv Univ. Bibliography [1] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406(6795):536– 40, 2000. [2] M.B. Eisen P.T. Spellman P.O. Brown D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci., 95(25):14863–14868, 1998. [3] C.J.C Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:1–47, 1998. [4] R. Freund E. Osuna and F. Girosi. Support vector machines: Training and applications. Technical Report AIM-1602, MIT, 1996. [5] Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171–178, 2005. [6] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, October 1999. [7] G.Zoutendijk. Methods of Feasible Directions: A study in linear and non-linear programming. Elsevier, 1970. [8] M.P. Brown W.N. Grundy D. Lin N. Cristianini C.W. Sugnet T.S. Fury M. Ares D Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci., 97(1):262–267, 2001. [9] J. Shawe-Taylor N. Cristianini. An Introduction to support vector machines and other Kernel based learning methods. Cambridge University Press, 2000. [10] A. Ben-Dor L. Bruhn I. Nachman M. Schummer N. Friedman and Z. Yakhini. Tissue classification with gene expression profiles. J. Computational Biology, 7:559–584, 2000. 43 44 BIBLIOGRAPHY [11] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov, T. Poggio, W. Gerald, M. Loda, E. S. Lander, and T. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A, 98(26):15149–15154, December 2001. [12] Sridhar Ramaswamy, Ken N. Ross, Eric S. Lander, and Todd R. Golub. A molecular signature of metastasis in primary solid tumors. Nature Genetics, 33:49–54, 2002. [13] Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, and Eric S. Lander. Class prediction and discovery using gene expression data. In RECOMB, pages 263–272, 2000. [14] M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med, 347(25):1999–2009, December 2002. [15] L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530–536, January 2002. [16] V.N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1999.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement