UNIVERSITE LIBRE DE BRUXELLES Faculte Des Sciences Departement D’Informatique Use of Machine Learning in Bioinformatics to Identify Prognostic and Predictive Molecular Signatures in Human Breast Cancer Promoteur : M. Gianluca Bontempi Co-promoteur : M. Christos Sotiriou Mémoire présenté en vue de l’obtention du DEA en Sciences par Benjamin Haibe-Kains Année Académique 2004 - 2005 Contents 1 Introduction 1.1 Bioinformatics Context . . . . . . . . . . . . . . . 1.1.1 Treatment Resistance in Breast Cancer . 1.1.1.1 Tamoxifen Resistance Project . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . 1.3 Glossary . . . . . . . . . . . . . . . . . . . . . . . 1.4 Abbreviations and Acronyms . . . . . . . . . . . 1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Survival Analysis 2.1 Censoring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Survival Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Cumulative Distribution Function . . . . . . . . . . . . . . 2.2.2 Probability Density Function . . . . . . . . . . . . . . . . . 2.2.3 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Simple Hazard Models . . . . . . . . . . . . . . . . . . . . . 2.3 Estimating Survival Curves . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Kaplan-Meier Method . . . . . . . . . . . . . . . . . . . . . 2.4 Estimating Regression Models . . . . . . . . . . . . . . . . . . . . . 2.4.1 Parametric Regression Models . . . . . . . . . . . . . . . . 2.4.2 Semiparametric Regression Models . . . . . . . . . . . . . . 2.4.2.1 The Proportional Hazards Model . . . . . . . . . . 2.4.2.2 Hypothesis Test . . . . . . . . . . . . . . . . . . . 2.5 Testing for Differences in Survivor Functions . . . . . . . . . . . . 2.5.1 Logrank Test . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Hazard Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Variable Ranking . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Variable Subset Selection . . . . . . . . . . . . . . . . . . . 2.6.2.1 Wrappers and Embedded Methods . . . . . . . . . 2.6.3 Feature Construction and Space Dimensionality Reduction 2.6.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 5 5 6 8 10 12 . . . . . . . . . . . . . . . . . . . . . . . 13 15 16 17 17 17 18 19 21 23 23 24 24 28 28 29 30 30 31 32 33 33 34 34 3 Materials 3.1 Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Microarray Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . c Technology . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Affymetrix° 37 37 37 38 4 Methods 4.1 Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Preprocessing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Read Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Get Expression Measures . . . . . . . . . . . . . . . . . . . . . . 4.2.2.1 Background Correction . . . . . . . . . . . . . . . . . . 4.2.2.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . 4.2.2.3 Summarization . . . . . . . . . . . . . . . . . . . . . . . 4.2.2.4 Population Correction . . . . . . . . . . . . . . . . . . . 4.2.3 Prefiltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Variable Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1.1 Scoring Function Based on Univariate Cox Model . . . 4.3.2 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 Classifier Validation on Different Microarray Platforms 4.3.3 Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Final Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Cutoff Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Survival Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Hazard Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Logrank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Proportion of DMFS . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Time-Dependent ROC Curve . . . . . . . . . . . . . . . . . . . . 4.5.4.1 Sensitivity and Specificity . . . . . . . . . . . . . . . . . 4.5.4.2 Area Under the ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 45 45 45 46 46 47 47 48 48 49 50 50 51 51 52 53 53 53 53 53 54 54 54 54 56 5 Results 5.1 Tamoxifen Resistance Project . . . . . . . . . . . . . 5.1.1 Quality Assessment . . . . . . . . . . . . . . . 5.1.2 Preprocessing Methods . . . . . . . . . . . . . 5.1.3 Variable Ranking . . . . . . . . . . . . . . . . . 5.1.4 Feature Construction . . . . . . . . . . . . . . . 5.1.5 Final Cox Model and Risk Score Computation 5.1.6 Cutoff Selection . . . . . . . . . . . . . . . . . . 5.1.6.1 Hazard Ratio . . . . . . . . . . . . . . 5.1.6.2 Logrank Test . . . . . . . . . . . . . . 5.1.6.3 Proportion of DMFS . . . . . . . . . 5.1.6.4 Time-Dependent ROC Curve . . . . . 5.1.7 Validation on Independent Test Set . . . . . . 5.1.7.1 Risk Scores . . . . . . . . . . . . . . . 5.1.7.2 Hazard Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 57 58 58 58 60 62 63 63 63 65 65 65 66 ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.7.3 5.1.7.4 5.1.7.5 Logrank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . Proportion of DMFS . . . . . . . . . . . . . . . . . . . . . . Time-Dependent ROC Curve . . . . . . . . . . . . . . . . . . 66 66 68 6 Conclusion 6.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 72 A Semiparametric Regression Models A.1 Tied Data . . . . . . . . . . . . . . A.2 Time-Dependent Covariate . . . . A.3 Nonproportional Hazards . . . . . A.4 Estimating Survivor Functions . . 74 74 74 74 76 : . . . . Additional Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Microarray Platforms 77 C Probeset Annotations 78 D Gene Ontology 86 Bibliography 87 iii Acknowledgments I would like to thank so many people that it is impossible to find any order for their respective contributions. Let’s start : G. Bontempi who followed me since my licence in computer science and initiated the collaboration with Christos Sotiriou. His help was invaluable for my research and his risotto alla milanese has changed my life ! Christos Sotiriou who has allowed me to do my training in his lab and who believes enough in me to keep me. His passion for the research is a great source of motivation. And last but not least, I thank him for his confidence that allows me to continue my research in spite of the huge amount of work in the lab. All my colleagues of the Microarray Unit at the IJB for their enthusiasm and their efficiency. Christine Desmedt, Francoise Lallemand, Virginie Durbecq and Sherene Loi for their great discussions and their motivation. All my colleagues of the Machine Learning Group at the ULB. I have passed great moments with them during this first year of research. Special thanks to Yann-Aël Le Borgne to have read with courage the preliminary versions of my thesis. Raymond Devillers for his careful reading. My girlfriend, Olivia, to support me even when I live only for my work. Finally all my teachers who have opened my mind to such interesting research fields. I hope that my friends and my family have not really suffered from my bad mood during intensive work. Their support was not only valuable but necessary. iv Chapter 1 Introduction Contents 1.1 Bioinformatics Context . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Treatment Resistance in Breast Cancer . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 1.4 1.5 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 6 8 10 12 Thanks to the routine use of screening mammograms in developed countries, more and more women are diagnosed with early breast cancer (small tumors and absence of lymph node invasion). However, despite early detection, up to 20 to 30% of these women will relapse and die from their disease. The majority of these deaths are due to distant metastases. Locoregional treatment (surgery and radiotherapy) are always carried out and a systemic adjuvant treatment (e.g. chemotherapy and/or endocrine therapy) is proposed to all high-risk patients to prevent recurrence. The definition of such a risk is a central problem in clinic and can have two different significations. The risk can have a prognostic value which is its power of prediction of survival independently of treatment. On the other hand, the risk can have a predictive value which is its power of prediction of survival under treatment. Currently the risk is defined from several histological criteria established during consensus conferences in Europe and USA [Goldhirsh et al., 1998; Eifel et al., 2001; Goldhirsh et al., 2003] which attempt to define prognostic criteria for breast cancers 1 : • Invasive/non-invasive breast cancer : – Non-invasive (or “in situ”) cancers confine themselves to the ducts or lobules and do not spread to the surrounding tissues in the breast or other parts of the body. However they can develop into or increase your risk for invasive cancer. – Invasive (or infiltrating) cancers have started to break through normal breast tissue barriers and invade surrounding areas. Much more serious than non-invasive 1 http://www.breastcancer.org 1 cancers, invasive cancers can spread to other parts of the body through the bloodstream and lymphatic system. • Number of involved lymph nodes : some breast cancers spread to the lymph nodes under the arm. When the lymph nodes are involved in the cancer, they are called “node positive”. When lymph nodes are free of cancer, they are called “node negative”. In large medical studies, there appears to be a correlation between the number of involved lymph nodes and the cancer aggressiveness. Knowing how many lymph nodes are affected by cancer can help to select the more aggressive treatment in the adjuvant setting. • Tumor size : tumors with large tumor size are considered poor prognosis. Currently, the breast tumors are diagnosed earlier and consequently, their size is smaller. • Tumor rate/grade : – Rate of cancer cell growth : the proportion of cancer cells growing and making new cells varies from tumor to tumor and may be helpful in predicting how aggressive a cancer is. If more than 6-10% of the cells are making new cells, the rate of growth is considered unfavorably high. – Grade of cancer cell growth : patterns of cell growth are rated on a scale from 1 to 3 (also referred to as low, medium, and high instead of 1, 2 or 3). Calm, wellorganized growth with few cells reproducing is considered grade 1. Disorganized, irregular growth patterns in which many cells are in the process of making new cells is called grade 3. The lower the grade, the more favorable the expected outcome. At the same time, the higher the grade, the more vulnerable the cancer is to treatments such as chemotherapy and radiation. Thus, the histological grade in breast cancer provides important prognostic information. However, its interobserver variability and poor reproducibility, especially for tumours of intermediate grade, has limited its clinical potential. A recent study [Sotiriou et al., 2005] has determined a refinement of the histological grade using gene expression profiling. – Dead cells within the tumor : it is tempting to think that the only good cancer cell is a dead cancer cell. However, necrosis (or dead tumor cells) is one of several signs of excessive tumor growth. • Hormone receptor status : estrogen and progesterone stimulate the growth of normal breast cells as well as some breast cancer cells. If a tumor is estrogen-receptor positive (ER-positive), it is more likely to grow in a high-estrogen environment. ER-negative tumors are usually not affected by the levels of estrogen and progesterone in your body. ER-positive cancers are more likely to respond to anti-estrogen therapies (e.g. Tamoxifen, a drug that works by blocking the estrogen receptors on the breast tissue cells and slowing their estrogen-fueled growth). • Oncogenes : according to the oncogene, it is either the gene amplification, the increasing amount of its protein or its mutation that confer its properties in breast cancer. The over-expression happens when an oncogene (such as HER2/neu, EGFR, and p53) overexpresses itself by making excessive normal or abnormal proteins and receptors. Cancers that result from over-expressed oncogenes tend to be more nasty or belligerent and are 2 more likely to recur than other cancers. They also may respond to different types of treatment than other breast cancers. • Margins of resection : the term “margins”or “margins of resection”is used to refer to the distance between the tumor and the edge of the tissue taken by surgery. The margins are measured on all six sides: front and back, top and bottom, left and right. According to these histological criteria, approximately 80% of young patients without lymph node invasion are candidates for adjuvant treatment. It is obvious that these patients are over-treated because 70 to 80% of them will not develop distant metastases without the adjuvant treatment [EBCT Collaborative Group, 1998]. These results highlight the necessity to improve the risk evaluation based on traditional factors. During last ten years, several prognosis factors (e.g. HER2 and p53 mutations) have been assessed and have been correlated to the prognosis but these genes, taken individually, have only a limited prognostic power. Moreover, intensive research concerns specific markers for treatment response but these markers have only limited predictive power. This is probably due to the molecular complexity and heterogeneity of the tumors. The tumor phenotype is not determined by isolated aberrations but by a combination of anomalies in a genetic context. Currently, thanks to technological advances in genome sequencing, new tools are available to analyze biological materials at the molecular level. The microarray technology (which will be introduced in Section 3.2) allows to analyze the genetic identity of a specific tissue for the whole genome. In one microarray experiment, the expression of several thousands of genes can be measured from a tumor tissue. This technology can be used to study the molecular make-up of multiple breast tumors to improve the risk evaluation and our understanding of this biological phenomenon. 1.1 Bioinformatics Context The use of machine learning methods [Mitchell, 1997; Hastie et al., 2001] in the field of bioinformatics is increasing over time. Such methods seem to be good candidates to treat microarray data [Dudoit et al., 2002]. Many problems in genomics are analyzed by machine learning methods. These include cancer prediction, gene finding, protein structures and functions, protein interactions, gene regulation networks, among many other problems. Here is a definition of machine learning2 : Machine learning is a field of artificial intelligence related to data mining and statistics. It involves learning from data. The researcher feeds a set of training examples to a computer program that aims to learn the connection between features of the examples and a specified target concept. In our problem, the expression values of the genes are the input and the target concept is the survival of the corresponding patients. An important example of the use of machine learning methods in human breast cancer is the prognosis of node-negative breast cancers using microarray [van’t Veer et al., 2002]. According to a common view, progression from a primary to a metastatic tumor is accompanied 2 The definition comes from http://en.wikipedia.org/wiki/Machine_Learning. 3 by the sequential acquisition of phenotype changes, thus allowing breast cancer cells to invade, disseminate, and colonize distant sites. Nevertheless, most investigations have revealed that progression is not accompanied by major changes in marker expression or grade [Lacroix and Leclercq, 2004]. These observations suggest that the metastatic signature might already be present in the primary breast tumor, challenging the traditional model of metastasis, which specifies that most primary tumor cells have low metastatic potential, but rare cells within large primary tumors acquire metastatic capacity through somatic mutations. From that perspective, [van’t Veer et al., 2002], applying a machine learning method (supervised learning, see Figure 1.1), sought to identify whether there exists a gene expression signature strongly prognostic of a short interval to distant metastases in primary breast cancer tumors. gene expression data biological phenomenon class training data supervised learning prognostic tool predicted class Figure 1.1: Supervised learning method used in microarray classification as in [van’t Veer et al., 2002]. The learning method constructs a classifier on the basis of the microarray data (gene expressions) and survival information about the patients (i.e. binary class representing the appearance of distant metastases in the first 5 years of follow-up). We can use this classifier to predict the class of new data (i.e. a tumor tissue from a new patient). They found 231 genes significantly associated with disease outcome as defined by the presence of distant metastasis at the 5-year mark. They could then subsequently collapse this list into a core set of 70 prognostic markers. Interestingly, the investigators tested the ability of this array-derived prognostic “expression profile” to correctly identify patients who would need adjuvant chemotherapy and compared it to accepted guidelines for treatment of node negative breast cancer (NIH [Eifel et al., 2001] and St. Gallen [Goldhirsh et al., 1998] consensus guidelines). They found that although the expression profile could correctly identify patients who would need adjuvant chemotherapy, it could effectively reduce the fraction of women not needing adjuvant chemotherapy by about 30%. The same group applied this signature to a larger test set of node negative and node positive breast cancer patients (295) 4 from the same institution. This study confirmed that the 70-gene prognosis signature could clearly distinguish patients with excellent 10 year survival from those with a high mortality rate [van de Vijver et al., 2002]. In this thesis, we propose a machine learning methodology (which will be described in Chapter 4). We perform an experimental validation on real microarray data concerning the prediction of treatment resistance for breast cancer patients. The microarray data come from the Microarray Unit of the Institut Jules Bordet. 1.1.1 Treatment Resistance in Breast Cancer One of the most important advances in the treatment of breast cancer came from the understanding that most patients with breast cancer have disseminated or “micrometastasized” tumors already at the time of diagnosis. Therefore, in order to efficiently fight the disease, a local surgical operation should be combined with effective simultaneous systemic treatment, such as radio-, hormonal or chemotherapy. While significant advances have been made with this so called adjuvant therapy, optimal therapy has not yet been defined for any breast cancer patients. One of the hurdles in the adjuvant therapy is that the tumor cells are either inherently resistant or develop resistance to such therapies. The underlying biochemical and genetic reasons of drug resistance in metastatic breast cancer are not clear. Hence, many women are given such adjuvant therapy, but only a minority will benefit. Most therapy drugs are thought to work so that they activate self-destructive mechanisms in cancer cells and these cells therefore “commit suicide” (apoptosis) in response to the therapy. It has been hypothesized that resistant cancer cells somehow refuse to commit suicide in response to therapeutic drugs. The microarray technology could be used to study the genetic context of treatment resistance in breast cancer to improve the choice of an adequate therapy and our understanding of this biological phenomenon. 1.1.1.1 Tamoxifen Resistance Project This project concerns the prediction of early distant metastases on Tamoxifen in earlystage breast cancer. The majority of early-stage breast cancers express estrogen receptors (ER) and receive Tamoxifen in the adjuvant setting. Yet up to 40% of these patients will relapse on Tamoxifen and develop incurable metastatic disease. Recent evidence from three large randomized controlled trials [Howell and Cuzick, 2005; Coombes and Hall, 2004; Goss and Ingle, 2003] exploring the role of aromatase inhibitor (AI) in the adjuvant setting shows a benefit from the novel strategy. However the optimal sequence and duration of Tamoxifen/AI treatment is unknown. Therefore, it is vital to learn to identify those women at higher risk of Tamoxifen resistance. The aim of this project is to identify genes that could predict for this subset of women. In this thesis, we will focus on the analysis of gene expression profiles which are determined c from 99 Tamoxifen-only treated ER positive early stage BC using Affymetrix ° hgu133a and hgu133b chips (see Chapter 3). Within this group 30 (29%) patients developed distant recurrence at a median time3 to relapse of 3.8 years and 75 (71%) remained disease free at a median of 10.7 years of follow-up. The independent validation set consisted of 69 ER+ 3 The median time is computed using the KM estimator (see Section 2.3.1). 5 Tamoxifen only treated breast cancer patients from a different institution (Karolinska, Sweden). Another independent dataset (Guys hospital, UK) consisted of 87 ER+ Tamoxifen only treated breast cancer patients. Using these data, a group of genes will be selected to identify breast cancer patients at risk of early distant relapse on Tamoxifen. These patients could be the ideal candidates for upfront AIs, while the others would be considered for sequential Tamoxifen/AI. 1.2 Contributions This section describes all the contributions presented in this thesis. Methodology We propose a machine learning methodology based on machine learning methods (e.g. feature selection) and well-established survival statistics (e.g. statistical tests for the difference in survival between two groups). This methodology is sketched in figure 1.2 and includes methods for data preprocessing, feature selection, classifier construction and performance assessment (these methods will be described in Chapter 4). Preprocessing Methods processing : We introduce three new concepts in the microarray data pre- • Use of a normalization procedure (RMA [Irizarry et al., 2003a]) separately for each population of patients in order to facilitate the further analysis (inclusion of new populations during the analysis and an easier way to test new samples). See Section 4.2.2.2. • A new correction method, called population correction, in order to minimize the variability due to the population effect. See Section 4.2.2.4. • A prefiltering based on detection calls in order to discard noninformative probesets without using demographic data. Even if some measurements (MM probe intensities) are not taken into account by the normalization procedure (RMA), this information is used in the prefiltering based on detection calls. See Section 4.2.3. Feature Selection We introduce a new feature selection method based on variable ranking, semi-supervised hierarchical clustering and cross-validation. See Section 4.3. Classifier Validation on different Microarray Platforms We propose a new method to facilitate the classifier validation on different microarray patforms . This method is based on a specific feature construction. See Section 4.3.2.1. Time-Dependent ROC Curve We use the recently introduced time-dependent ROC curves in breast cancer microarray studies in order to assess the classifier performance. Moreover, we provide an implementation of this method based on the R statistical tool [R Development Core Team, 2005]. See Section 4.5.4. Cutoff Selection We introduce a new simple method to select a cutoff, based on the hazard ratio, for the risk scores. The aim of this method is to classify specifically a low-risk group including the smallest number of events before three years (early distant metastases). See Section 4.4.2. 6 Training raw microarray data training set Test raw microarray data test set demographics training set quality assessment quality assessment data preprocessing data preprocessing survival model feature selection CLASSIFIER demographics test set risk score computation risk score computation apply cutoff low/high-risk groups cutoff cutoff selection survival statistics Figure 1.2: Machine learning methodology for survival analysis of microarray data. 7 1.3 Glossary Adjuvant therapy Treatment given after the primary treatment to increase the chances of a cure. Adjuvant therapy may include chemotherapy, radiation therapy, hormone therapy, or biological therapy. cDNA complementary DNA (cDNA) is single-stranded DNA synthesized from a mature mRNA template. Consistency The consistency of an estimator means that it converges in probability to the true values as the sample gets larger, implying that the estimator is unbiased in large samples. Covariate A covariate is a variable that is possibly predictive of the outcome under study. A covariate may be of direct interest or be a confounding variable or effect modifier. Cross-hybridization The hydrogen bonding of a single-stranded DNA sequence that is partially but not entirely complementary to a single-stranded substrate. Often, this involves hybridizing a DNA probe for a specific DNA sequence to the homologous sequences of different species. Cross-validation The cross-validation is the practice of partitioning a sample of data into subsets such that analysis is initially performed on a single subset, while further subsets are retained “blind” in order for subsequent use in confirming and validating the initial analysis. Dendrogram A hierarchy representation by a dichotomous diagram, in which the end of a branch corresponds to an element and the level of a junction corresponds to the taxonomic distance from the two elements or the two groups that it connects. Distant metastasis Cancer cells may spread to lymph nodes (regional lymph nodes) near the primary tumor. This is called nodal involvement, positive nodes, or regional disease. Cancer cells may spread to other parts of the body, distant from the primary tumor. If a new cancer grows in such sites, we call it a distant metastasis. Expressed Sequence Tag A short strand of DNA that is a part of a cDNA molecule and can act as identifier of a gene. GenBank The GenBank sequence database is an annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at National Center for Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). Gene Expression Transcription of the information contained within the DNA into messenger RNA (mRNA) molecules that are then translated into proteins. Hybridization Hybridization is the process of binding complementary pairs of DNA molecules. A DNA molecule has a very strong preference for its sequence complement, so just mixing complementary sequences is enough to induce them to hybridize. Hybridization is 8 temperature dependent, so DNAs that hybridize strongly at low temperature can be temporarily separated (denatured) by heating. Location parameter The location parameter simply shifts the distribution left or right on the horizontal axis. Longitudinal data Observations collected over a period of time. Lymph nodes Lymph nodes are components of the lymphatic system. Clusters of lymph nodes are found in the underarms, groin, neck, chest, and abdomen. Lymph nodes act as filters, with an internal honeycomb of connective tissue filled with lymphocytes that collect and destroy bacteria and viruses. When the body is fighting an infection, these lymphocytes multiply rapidly and produce a characteristic swelling of the lymph nodes. Mer Monomeric Unit. The largest constitutional unit, coming from only one molecule of a monomer in a process of polymerization. c Meta-analysis Analysis involving several sources of microarray data (e.g. Affymetrix ° c data). and Agilent° Microarray Ordered arrangement, on a miniaturized support of glass, of silicon or polymer, of hundreds or thousands of molecular probes whose nucleotidic sequence is known, and whose function is to recognize, in a mixture, their complementary nucleotidic sequences. Monotone function The function f is monotone if, whenever x ≤ y, then f (x) ≤ f (y). Stated differently, a monotone function is one that preserves the order. Neo-adjuvant therapy Treatment also known as primary systemic therapy, or primary medical therapy: when chemotherapy is given before primary surgery. Oligonucleotide Short fragment of a single-stranded DNA. Prognosis Prediction of survival independently of treatment. Polymerase Chain Reaction (PCR) Exponential amplification of almost any region of a selected DNA molecule. Probe Easily detectable molecule which has the property to be located specifically either on another molecule, or in a given cellular compartment. Various molecules can be used as probe with condition that a marker (enzyme, compound radioactive or fluorescent) can be associated with the probe which allows its detection. Generally the probe is a nucleic acid fragment (ARN or ADN). c Probeset Set of probes used in the microarray platform of Affymetrix° . Even if, generally, a probeset corresponds to one gene, the expression of one gene may be measured by a set of probesets. Reverse Transcriptase Polymerase Chain Reaction (RT-PCR) Molecular technique which uses upon the reverse transcriptase to amplify a sequence of RNA and to transform it into DNA. 9 Scale parameter The effect of a scale parameter greater than one is to stretch the PDF. The greater the magnitude, the greater the stretching. The effect of a scale parameter less than one is to compress the PDF. The compressing approaches a spike as the scale parameter goes to zero. A scale parameter of 1 leaves the PDF unchanged (if the scale parameter is 1 to begin with) and non-positive scale parameters are not allowed. Sensitivity The sensitivity of a binary classification test is a parameter that expresses something about the test’s performance. The sensitivity of such a test is the proportion of P those cases having a positive test result of all positive cases tested ( T PT+F N ). Shape parameter Many probability distributions are not a single distribution, but are in fact a family of distributions. This is due to the distribution having one or more shape parameters. Shape parameters allow a distribution to take on a variety of shapes, depending on the value of the shape parameter. These distributions are particularly useful in modeling applications since they are flexible enough to model a variety of datasets. Skewness Skewness is a measure of the asymmetry of the probability distribution of a realvalued random variable. Roughly speaking, a distribution has positive skew (rightskewed) if the higher tail is longer and negative skew (left-skewed) if the lower tail is longer. Specificity The specificity of a binary classification test is a parameter that expresses something about the test’s performance. The specificity of such a test is the proportion of N true negatives of all the negative samples tested ( T NT+F P ). c Tamoxifen A drug (Nolvadex° ) used to treat breast cancer, and to prevent it in women who are at a high risk of developing breast cancer. Tamoxifen blocks the effects of the hormone estrogen in the breast. It belongs to the family of drugs called antiestrogens. 1.4 Abbreviations and Acronyms AUC Area Under the Curve. AI Aromatase Inhibitor. AFT Accelerated Failure Time. BIG Breast International Group. CEL CELl intensities. CDF Cumulative Distribution Function. DCIS Ductal Carcinoma In Situ. DDBJ DNA Data Bank of Japan. DF Degree of Freedom. DMFS Distant Metastases Free Survival. 10 EBI European Bioinformatics Institute. EMBL European Molecular Biology Laboratory. EORTC-BCG Breast Cancer Group of the European Organization for Research and Treatment in Cancer. ER Estrogen Receptor. EST Expressed Sequence Tag. FN False Negatives. FP False Positives. GCOS GeneChip Operating Software. GO Gene Ontology. GUYT Population of Tamoxifen treated patients coming from the Guys hospital. HR Hazard Ratio. IJB Institut Jules Bordet. KIT Population of Tamoxifen treated patients coming from the Karolinska hospital. KM Kaplan-Meier. KNN K-Nearest Neighbours. LASSO Least Absolute Shrinkage and Selection Operator. LN Lymph Node. LOO Leave-One-Out. c MAS Microarray Affymetrix° Suite. MGED Microarray Gene Expression Data Society. MIAME Minimum Information About a Microarray Experiment. MM Mis-Match. NCBI National Center for Biotechnology Information. OXFT Population of Tamoxifen treated patients coming from the John Radcliffe hospital hospital. PDF Probability Density Function. PM Perfect Match. RMA Robust Multi-array Average. 11 ROC Receiving Operator Characteristic. RT-PCR Reverse Transcriptase Polymerase Chain Reaction. SIB Swiss Institute of Bioinformatics. SVM Support Vector Machines. TN True Negatives. TP True Positives. 1.5 Notations N Number of samples. P Number of probes. Q Number of probesets. F Number of features. n Number of input variables. P op Set of populations. X Set of covariates. G Indicator variable for group (G = 0 for the low-risk group and G = 1 for the high-risk group). S Scoring function. X, Y, . . . Upper case letters represent random variables (except the previously defined “number of ...”). x, y, . . . Lower case letters represent the realization of random variables (except n). x, X, β, . . . Bold letters represents vectors or matrices. Ti Time of occurrence/censoring for the sample i (i ∈ {1, . . . , N }). δi Indicator status for sample i (i ∈ {1, . . . , N }). β Coefficients of a linear regression model. β̂ Estimated coefficients. DN Dataset of N samples {xi , yi } (i ∈ {1, . . . , N }). 12 Chapter 2 Survival Analysis Contents 2.1 2.2 2.3 2.4 2.5 2.6 Censoring Data . . . . . . . . . . . . . . . . . . . . . . . . . . Survival Distributions . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Cumulative Distribution Function . . . . . . . . . . . . . . 2.2.2 Probability Density Function . . . . . . . . . . . . . . . . . 2.2.3 Hazard Function . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Simple Hazard Models . . . . . . . . . . . . . . . . . . . . . Estimating Survival Curves . . . . . . . . . . . . . . . . . . 2.3.1 Kaplan-Meier Method . . . . . . . . . . . . . . . . . . . . . Estimating Regression Models . . . . . . . . . . . . . . . . . 2.4.1 Parametric Regression Models . . . . . . . . . . . . . . . . 2.4.2 Semiparametric Regression Models . . . . . . . . . . . . . . Testing for Differences in Survivor Functions . . . . . . . . 2.5.1 Logrank Test . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Wilcoxon Test . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Hazard Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Variable Ranking . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Variable Subset Selection . . . . . . . . . . . . . . . . . . . 2.6.3 Feature Construction and Space Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 17 17 17 18 19 21 23 23 24 28 29 30 30 31 32 33 34 Survival Analysis is a class of statistical methods for studying the occurrence and timing of events. These methods are most often applied to the study of deaths but can treat different kinds of event including the onset of disease, equipment failures, arrests, etc. Survival analysis was designed for longitudinal data on the occurrence of events. An event can be defined as a qualitative change1 that can be situated in time. For instance a disease consists of a transition from an healthy state to a diseased state. Moreover, the timing of the event is also considered for analysis. Ideally, the transitions occur virtually instantaneously and the exact times at which the event occurs is known. Some 1 A qualitative change is defined as a transition from one discrete state to another. 13 transitions may take a little time, however, and the exact time of onset may be unknown or ambiguous. For survival analysis, the best observation plan is prospective. By prospective we mean that the observation of a set of individuals starts at some well-defined point in time and they are followed for some substantial period of time, recording the time at which the events of interest occur. In this thesis, survival analysis is used with retrospective data, looking back at patients’ medical history. These data present some potential limitations : • the data are prone to errors, some events may be forgotten • the sample of patients may be a biased subsample of the initial population of interest. Survival data have two common features that are difficult to handle with conventional statistical methods : censoring and time-dependent covariates (sometimes called time-varying explanatory variables). Consider the following example, which illustrates both these problems. A sample of 432 inmates released in Maryland state prisons was followed for one year after release [Rossi et al., 1980]. The event of interest was the first arrest. The aim was to determine how the occurrence and timing of arrests depended on several covariates (predictor variables). Some of these covariates (like age of release and number of previous convictions) remained constant over the one-year interval. Others (like marital status and employment status) could change at any time during the follow-up period. If we narrow our focus on a dichotomous dependent variable (arrested or not arrested), conventional methods that could analyze such data, are the logistic regression (logit) [McCullagh and Nelder, 1989], linear discriminant analysis or support vector machines for instance (see [Duda et al., 2001] for a review of such classification methods). But this analysis ignores information on the timing of arrest. It is natural to suppose that people who are arrested one week after release have, on average, a higher propensity to be arrested than those who are not arrested until the 52nd week. At least, ignoring that information should reduce the precision of the estimates. One solution to this problem is to make the length of time between release and first arrest the dependent variable and then estimate it by a conventional linear regression [McCullagh and Nelder, 1989]. But it remains a problem with persons who were not arrested during the one-year follow-up. Such cases are referred to as censored. A couple of obvious ad-hoc methods exist for dealing with censored cases, but neither works well. One method is to discard the censored cases but this proportion may be large. This method may result in large biases. Alternatively, the time of arrest could be set at one year for all those who were not arrested. That is clearly an underestimate, however, and some of those ex-convicts may never be arrested. Again large biases may occur. Whichever method is used, it is not clear how a time-dependent variable like employment status can be appropriately incorporated into either the classification methods for the occurrence of arrests or the linear model for the timing of arrests. The methods of survival analysis allow for censoring and many also allow for timedependent covariates in combining the information with the censored and the uncensored cases [Allison, 1995]. 14 2.1 Censoring Data An observation on a random variable T is right-censored if all you know about T is that it is greater than some value c. In survival analysis, T is typically the time of occurrence for some event, and cases are right-censored because observation is terminated before the event occurs. The simplest and the most common situation is depicted in Figure 2.1. Suppose that this figure reports some of the data from a study in which all persons receive heart surgery at time 0 and are followed for 3 years thereafter. The horizontal axis represents time. Each of the horizontal lines labeled A through E represents a single person. An x indicates that a death occurred at that point in time. The vertical line at 3 is the point at which the follow-up of the patients is stopped. Any death occurring at time 3 or earlier are observed and, hence, those death times are uncensored. Any deaths occurring after 3 years are not observed, and those death times are censored at time 3. Therefore, persons A, C and D have uncensored death times, while person B and E have rightcensored death times. Observations that are censored in this way are referred to as singly right-censored. Singly refers to the fact that all the observations had the same censoring time. Observations that are not censored are said to have a censoring time, in this case three years. It is just that their death times did not exceed their censoring time. Moreover, the censoring time is fixed and is under the control of the investigator. Persons A x B C x x D x E x 1 2 3 Years since surgery Figure 2.1: Singly right-censored data. Random censoring occurs when observations are terminated for reasons that are not under the control of the investigator. This situation can be illustrated by the following example : in a study of divorces, a sample of couples are followed for 10 years beginning with the marriage and the timing of all divorces are recorded. Clearly, couples that are still married after ten years are censored by a mechanism identical to this applied for the singly right-censored data. But for some couples, either the husband or the wife may die before the ten years are up. Some couples may move out and it may be impossible to contact them. Still other couples may refuse to participate after, say five years. These kinds of censoring are depicted in Figure 2.2 where the o for the couples B and C indicates that observation is censored at that point in time. Random censoring can also be produced when there is a single termination time, but 15 Couples A x B o C o D x E x 0 5 10 Years since marriage Figure 2.2: Randomly censored data. entry times vary randomly across individuals. Consider again the example in which people are followed for heart surgery until death. A more likely scenario is one in which people receive heart surgery at various point in time, but the study has to be terminated on a single date. All persons still alive on that date are considered censored, but their survival time from surgery will vary. This censoring is considered random because the entry times are typically not under the control of the investigator. Standard methods of survival analysis treat the right-censored data but require that random censoring be noninformative. Here is how this situation is described in [Cox and Oakes, 1984] : A crucial condition is that, conditionally on the values of any explanatory variables, the prognosis for any individual who has survived to ci should not be affected if the individual is censored at ci . That is, an individual who is censored at c should be representative of all those subjects with the same values of he explanatory variables who survive to c (p. 5). The best way to understand this condition is to think about possible violations. In the divorce example mentioned earlier, it is plausible that those couples who refuse to continue participating in the study are more likely to be experiencing marital difficulties and, hence, are at greater risk of divorce. The censoring is informative assuming that measured covariates do not fully account for the association between drop-out and marital difficulty. Informative censoring can, at least in principle, lead to severe biases, but it is difficult in most situations to assess the magnitude or direction of those biases. In this thesis we will focus on analysis of right-censored data. 2.2 Survival Distributions The standard approaches to survival analysis are based on statistical modeling. The times at which events occur are assumed to be realizations of some random variable T . Three ways of describing the probability distribution of T are presented in this section : 1. the cumulative distribution function 16 2. the probability density function 3. the hazard function. 2.2.1 Cumulative Distribution Function The cumulative distribution function (CDF) of a random variable T, denoted by F (t), is a function giving the probability that the variable will be less than or equal to any specific value t, i.e. F (t) = Pr{T ≤ t}. In survival analysis, it is more common to work with the survivor function, defined as S(t) = Pr{T > t} = 1 − F (t). If the event of interest is a death, the survivor function gives the probability of surviving beyond t. Because T cannot be negative, S(0) = 1. 2.2.2 Probability Density Function When variables are continuous, another useful way of describing the probability distribution is the probability density function (PDF). This function is defined as f (t) = 2.2.3 dF (t) dS(t) =− dt dt (2.1) Hazard Function In the case of continuous survival data, the hazard function is actually more used than the PDF in order to describe distributions. The hazard function is defined as Pr{t ≤ T < t + ∆t | T ≥ t} ∆t→0 ∆t h(t) = lim (2.2) The function h(t) quantifies the instantaneous risk2 that an event will occur in the small interval between t and t + ∆t. The probability in the numerator of (2.2) is conditional on the individual surviving to time t because individuals who have already experienced the event should not be considered. The definition of the hazard function in (2.2) is similar to an alternative definition of the PDF Pr{t ≤ T < t + ∆t} ∆t→0 ∆t f (t) = lim (2.3) The only difference is that the probability in the numerator of (2.3) is an unconditional probability, whereas the probability in (2.2) is conditional on T ≥ t. For this reason, the hazard function is sometimes described as a conditional density. The survivor function, the probability density function and the hazard function are equivalent ways of describing a continuous probability distribution. The relationship between the PDF and the survivor function is given directly by the (2.1). Another simple formula expresses the hazard function in terms of the PDF and the survivor function : h(t) = f (t) S(t) 2 (2.4) Although it may be helpful to think of the hazard as the instantaneous probability of an event at time t, this quantity is not a probability and may be greater than 1. This is due to the division by ∆t in (2.2). Although the hazard has no upper bound, it cannot be less than 0. 17 Together, (2.4) and (2.1) imply that h(t) = − d log S(t) dt (2.5) By integrating both sides of (2.5), we obtain an expression of the survivor function in terms of the hazard function : ¾ ½ Z t h(u)du (2.6) S(t) = exp − 0 Together with (2.4), this formula leads to ½ Z t ¾ f (t) = h(t) exp − h(u)du (2.7) 0 The hazard is a dimensional quantity that has the form number of events per interval of time. This is why the hazard is sometimes called a rate. The units in which time is measured must be known in order to interpret the value of the hazard. Suppose that the hazard of contracting influenza at some particular point in time is 0.015 with time measured in months. This means that if the hazard stays at that value during a period of one month, one expects that a person will contract the influence 0.015 times during that month. 2.2.4 Simple Hazard Models The hazard function is a useful way of describing the probability distribution for the time of event occurrence. Every hazard function has a corresponding probability distribution. This section examines some rather simple hazard functions and discusses their associated probability distributions. The simplest hazard functions specifies that the hazard is constant over time, that is, h(t) = λ or, equivalently log h(t) = µ. Substituting this hazard into (2.6) and carrying out the integration implies that the survival function is S(t) = e−λt . From (2.1), we get the PDF f (t) = λe−λt . This is the PDF for the exponential distribution with parameter λ. Thus, a constant hazard implies an exponential distribution for the time until an event occurs (or the time between events). Let now the natural logarithm of the hazard be a linear function of time : ln h(t) = µ + αt where µ and α are real constant values. Taking the logarithm is a convenient way to ensure that h(t) is nonnegative, regardless of the value of µ, α and t. We can rewrite the equation as h(t) = λγ t where λ = eµ and γ = eα . This hazard function implies that the time of event occurrence has a Gompertz distribution (see Figures 2.3, 2.4 and the Table 2.2 for the Gompertz distribution, the Gompertz hazard function and the Gompertz PDF formula respectively). Alternatively we can assume that ln h(t) = µ + α ln t which can be rewritten as h(t) = λtα 18 with λ = eµ . This equation implies that the time of event occurrence follows a Weibull distribution (see figures 2.5, 2.6 and the Table 2.2 for the Weibull distribution, the Weibull hazard function and the Weibull PDF formula respectively). Probability density distribution Gompertz distribution 3.0 2.5 shape = −1 PDF 2.0 1.5 1.0 0.5 shape = 0.5 shape = 0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t Figure 2.3: Gompertz distribution for time of event occurrence. The probability density distribution is given for different values of the shape parameter (the shape corresponds to the α parameter of the Gompertz hazard function, such that shape = α). The Gompertz and the Weibull distributions coincide with the exponential distribution in the special case α = 0. When α is not zero, the hazard is either always decreasing or always increasing with time for both distributions. One difference between them is that, for the Weibull model, when t = 0, the hazard is either zero or infinite. With the Gompertz model, the initial value of the hazard is λ, which can be any nonnegative number. We can extent each of these models to allow for the influence of covariates. For instance, a covariate for the situation reported by the Figure 2.1 could be the age of the patient at time of surgery or its blood group. Thus, if we have covariates x1 , x2 , . . . , xk , we can write Exponential : Gompertz : W eibull : 2.3 ln h(t) = µ + β1 x1 + β2 x2 + · · · + βk xk ln h(t) = µ + αt + β1 x1 + β2 x2 + · · · + βk xk ln h(t) = µ + α ln t + β1 x1 + β2 x2 + · · · + βk xk (2.8) (2.9) (2.10) Estimating Survival Curves Prior to 1970, the estimation of S(t) was the predominant method of survival analysis [Gross and Clark, 1975]. Nowadays, the workhorse of the survival analysis is the Cox regression method [Cox, 1972]. Nevertheless, survival curves are still useful for preliminary examination of the data, for computing derived quantities from regression models (e.g. the median survival time or the five-year probability of survival) and for evaluating the fit of regression models. 19 Hazard function Gompertz distribution 3.0 α = 0.5 2.5 Gompertz hazard 2.0 1.5 α=0 1.0 0.5 α = −1 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t Figure 2.4: Typical hazard functions (h(t) = λγ t with λ = 1) for the Gompertz distribution. Probability density function Weibull distribution 3.0 2.5 2.0 PDF shape = 0.5 1.5 1.0 shape = 1 shape = 2.5 shape = 2 shape = 1.5 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t Figure 2.5: Weibull distribution for time of event occurrence. The probability density distribution is given for different values of the shape parameter (the shape corresponds to the α parameter of the Weibull hazard function, such that shape = α + 1). 20 Hazard function Weibull distribution 3.0 α=1 2.5 α = 1.5 Weibull hazard 2.0 α = 0.5 1.5 α=0 1.0 α = −0.5 0.5 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t Figure 2.6: Typical hazard functions (h(t) = λtα with λ = 1) for the Weibull distribution. There exist two methods to estimate survivor functions : the Kaplan-Meier and the lifetable methods. The Kaplan-Meier method is most suitable for small datasets with precisely measured event times. The life-table method may be better for large datasets or when the measurement of event times is crude [Allison, 1995]. In this thesis, the number of samples is small (high feature/sample ratio that will be described in Section 2.6). It is the reason why the life-table method will not be treated. 2.3.1 Kaplan-Meier Method The Kaplan-Meier (KM) estimator is the most widely used method for estimating survivor functions. Also known as the product-limit estimator, Kaplan and Meier have shown in 1958 that this estimator is the nonparametric maximum likelihood estimator [Kaplan and Meier, 1958]. When there are no censored data, the KM estimator is simple and intuitive.We have seen in Section 2.2 that the survivor function S(t) is the probability that an event time is greater than t, where t can be any nonnegative number. In the case of no censoring, the KM estimator is just the sample proportion of observations with event time greater than t. If data are right censored, the observed proportion of cases with event times greater than t can be biased downward because cases that are censored before t may have experienced an event before t without our knowledge. Suppose there are r distinct event times, t 1 < t2 < · · · < tr . At each time tj , there are nj individuals who are said to be at risk of an event. At risk means they have not experienced an event nor have they been censored prior to time t j . If any cases are censored at exactly tj , they are also considered to be at risk at tj . Let dj be the number of individuals who die at time tj . The KM estimator is then defined as ¸ Y · d j b = (2.11) S(t) 1− nj j:tj ≤t 21 for t1 ≤ t ≤ tr . In words, the quantity in brackets can be interpreted as the conditional b probabilities of surviving to time tj+1 , given that one has survived to time tj . So, S(t) is b the probability to survive to time t. For t < t1 (the smallest event time), S(t) is defined b depends on the to be 1. For t > tr (the largest observed event time), the definition of S(t) configuration of the censored observations. When there are no censored times greater than b b r ) for t > tr . When there are censored times greater than tr , S(t) b tr , S(t) is set to S(t is undefined for t greater than the largest censoring time. Here is a small example concerning the survival of breast cancer patients (inspired from [Collett, 2003]). Consider the data in Table 2.1. Patient id 1 2 3 4 5 Survival time (in months) 5 8 10 13 18 Event 1 1 0 1 0 Table 2.1: Survival times for breast cancer patients. The corresponding survival curve using KM estimator is given in Figure 2.7. 1.0 0.8 c S(t) 0.6 0.4 0.2 0.0 0 5 10 15 t Figure 2.7: Survival curve estimated by the KM estimator from data in Table 2.1. The “+” represents the censoring. An estimate of standard error of the KM estimate can be obtained by the Greenwood’s 22 formula [Greenwood, 1926; Collett, 2003] : o n X 2 2 b b S(t) = {S(t)} σ̂G j:tj ≤t v uX o n b b u t = S(t) se b G S(t) j:tj ≤t dj nj (nj − dj ) dj nj (nj − dj ) (2.12) b separately. Moreover, This is derived by estimating each term in the product expansion of S(t) b the bootstrap method can be used to estimate the variance of S(t) [Akritas, 1986]. It can be shown that the KM estimator is asymptotically normal according the sample size, with b mean S(t) and variance estimated by the Greenwood’s formula [Meier, 1975]. Intervals of confidence around KM estimates can be computed using these results. 2.4 Estimating Regression Models Survivor functions can be estimated by regression models. In survival analysis, there exist two categories of such regression models : the parametric and the semiparametric regression models. 2.4.1 Parametric Regression Models The parametric regression models with censored data are estimated using the method of maximum likelihood. Such class of regression models is known as the accelerated failure time (AFT) class. In the most general form, the AFT model describes a relationship between the survivor functions of any two individuals. If Si (t) is the survivor function for individual i, then for any other individual j, the AFT model holds that Si (t) = Sj (φij t) where i, j ∈ {1, . . . , N } and φij is a constant that is specific to the pairs (i, j). This model says that what makes different an individual from another is the rate at which they age. A good example is the conventional wisdom that a year for a dog is equivalent to seven years for a human. In practice, the models commonly used are a special case of the AFT model that is quite similar in form to an ordinary linear regression model. Let Ti be a random variable denoting the event time for the i th individual in the sample, and let xi1 , xi2 , . . . , xin be the values of n covariates for that same individual. The model is then ln Ti = β0 + β1 xi1 + · · · + βn xin + ²i (2.13) where ²i is a random disturbance term, and β0 , β1 , . . . , βn are parameters to be estimated. In a linear regression model, it is typical to assume that ²i has a normal distribution with a mean and variance that are constant over i, and that the ²’s are independent across observations. It is the case for one member of the AFT class, the log-normal model 3 . However there exist several alternatives allowing distributions of ² besides the normal distribution but 23 Distribution of ² Distribution of T extreme value (2 par.) Weibull extreme value (1 par.) exponential log-gamma logistic PDF of T ¡ ¢ a x−c a−1 b b −(x−c) 1 b e b x−c a b ) x ≥ c; a, b > 0 x ≥ c; b > 0 ) ( ( ) x ≥ c; a, b > 0 bΓ(a) x−c a−1 a ( b ) b [1+( x−c )a ]2 x ≥ c; a, b > 0 ! „ „b «« x−c x−c a−1 − b e b gamma log-logistic − normal e− ( e log-normal ln (x−c) m 2 /(2a2 ) √ (x−c)a 2π x ≥ c; a, b > 0 Table 2.2: Alternatives for the distributions of ² and their corresponding distributions of T . Legend : a is the shape parameter, b is the scale parameter and c is the location parameter. retaining the assumptions of constant mean and variance, as well as independence across observations. Some example of these alternatives are given in Table 2.2. The main reason of the use of such alternatives is that they have different implications for the hazard functions that may lead to different substantive interpretations. The parameters of such models are estimated by the maximum likelihood method (see Section 2.4.2). Recently, the parametric regression models have been eclipsed by the semiparametric regression model with the famous Cox regression model. this is why this thesis will focus on this promising method. 2.4.2 Semiparametric Regression Models The semiparametric regression models refer to the method first proposed in 1972 by the British statistician Sir David Cox in his famous paper “Regression Models and Life Tables” [Cox, 1972]. It is difficult to exaggerate the impact of this paper. In the 1992 Science Citation Index, it was cited over 800 times, making it the most highly cited journal article in the entire literature of statistics. In fact, [Garfield, 1990] reported that its cumulative citation count placed it among the top 100 papers in all of science. This enormous popularity can be explained by the fact that, unlike the parametric methods, Cox’s method does not require the selection of some particular probability distribution to represent survival times. For this reason, the method is called semiparametric. Moreover, this method makes it relatively easy to incorporate time-dependent covariates 4 . 2.4.2.1 The Proportional Hazards Model In his 1972 paper, Cox made two significant innovations. First, he proposed a model that is standardly referred as the proportional hazards model 5 . Second, he proposed a new estimation 3 This model is called the log-normal model because if ln T has a normal distribution, then T has a log-normal distribution. 4 The time-dependent covariates are covariates which value may change over the course of the observation period. 5 It is important to mention that the model proposed by Cox can be generalized to allow for nonproportional hazards. 24 method that was later named maximum partial likelihood. The term Cox regression refers to the combination of the model and the estimation method. Model Let’s start with the basic model that does not include time-dependent covariates or nonproportional hazards. The model is usually written as hi (t) = λ0 (t) exp (β1 x1i + · · · + βn xni ) (2.14) This equation says that the hazard for individual i at time t is the product of two factors : • a baseline hazard function λ0 (t) that is left unspecified, except that it can not be negative • a linear function of a set of n fixed covariates, which is exponentiated. The function λ0 (t) can be regarded as the hazard function for an individual whose covariates all have values of zero. Taking the logarithm of both sides of (2.14), we can rewrite the model as ln hi (t) = α(t) + β1 x1i + · · · + βn xni (2.15) where α(t) = ln λ0 (t). If we further specify α(t) = α, we get the exponential model with covariates (2.8). If we specify α(t) = αt, we get the Gompertz model. Finally, if we specify α(t) = α ln t, we have the Weibull model (see Section 2.2.4). As we will see, however, the great attraction of Cox regression is that such choices are unnecessary. The function α(t) can take any form whatever. This model is called the proportional hazards model because the hazard for any individual is a fixed proportion of the hazard for any other individual. It can be shown by taking the ratio of the hazards for two individuals i and j for i, j ∈ {1, . . . , N }, and applying (2.14) hi (t) = exp {β1 (x1i − x1j ) + · · · + βn (xni − xnj )} hj (t) (2.16) We can see in (2.16) that λ0 (t) cancels out of the numerator and denominator. As a result, the ratio of the hazards for any two individuals is constant over time. If we graph the ln hazards for any two individuals, the proportional hazards property implies that the hazard functions should be strictly parallel as depicted in Figure 2.8. Estimation Fitting the proportional hazards model given in (2.14) to an observed set of survival data entails estimating the unknown coefficients, β1 , β2 , . . . , βn , of the covariates X1 , X2 , . . . , Xn , in the linear component of the model. The baseline hazard function λ0 (t) may also need to be estimated. It turns out that these two components of the model can be estimated separately. The β’s are estimated first and these estimates are then used to construct an estimate of the baseline hazard function (see [Collett, 2003] for details about the estimation of the baseline hazard function). This is an important result, since it means that in order to make inferences about the effect of n covariates, X1 , X2 , . . . , Xn , on the relative hazard, hi (t)/λ0 (t), we do not need an estimate of λ0 (t). Since the estimation of β’s does not take into account the baseline hazard function, the resulting estimates are not fully efficient. This means that their standard errors are larger than they would be with the entire likelihood function. However, the loss of efficiency is quite 25 ln h(t) individual 1 individual 2 t Figure 2.8: Parallel hazard functions from the proportional hazard model. small in most cases [Efron, 1977]. In return, estimates have good properties regardless of the actual shape of the baseline hazard function. Partial likelihood estimates have still two of the three standard properties of maximum likelihood estimates : they are consistent and asymptotically normal6 [Cox, 1972]. Another interesting property of partial likelihood estimates is that they depend only on the ranks of the event times, not their numerical values. This implies that any monotone transformation of the event times will leave the coefficient estimates unchanged. Using the same notation as before, we have N independent individuals (i ∈ {1, . . . , N }). For each individual i, the data consist on three parts : ti , δi and xi , where ti is the time of the event or the time of censoring, δi is an indicator variable with a value of 1 if ti is uncensored or a value of 0 if ti is censored, and xi = [x1i , x2i , . . . , xni ] is a vector of n covariate values. An ordinary likelihood function is typically written as a product of the likelihoods for all the individuals in the sample. On the other hand, the partial likelihood can be written as a product of the likelihoods for all the events that are observed. So we can write PL = N Y Li (2.17) i=1 where Li is the likelihood for the i th event. Next we need to know how the individuals Li are constructed. This is best explained by way of an example. Consider the data in Table 2.1 where we add a column to for a covariate X. The covariate X has a value of 1 if the tumor had a positive marker for distant metastasis, 0 otherwise (see Table 2.3). The first event occurred to patient 1 in month 5. To construct the partial likelihood L 1 for this event, we ask the following question : “Given that an event occurred in month 5, what is the probability that it happened to patient 1 rather than any other patients ?”. The answer is the hazard for patient 1 at month 5 divided by the sum of the hazards for all the patients who were at risk in that same month : L1 = h1 (5) h1 (5) + h2 (5) + · · · + h5 (5) (2.18) 6 Partial likelihood estimates are approximately unbiased and their sampling distribution is approximately normal in large samples. 26 Patient id 1 2 3 4 5 Survival time (in months) 5 8 10 13 18 Event 1 1 0 1 0 X 1 1 1 0 0 Table 2.3: Survival times for breast cancer patients with the covariate X. While this expression has considerable intuitive appeal, the derivation is actually rather involved and will not presented here (see [Collett, 2003] for details). The second event occurred to patient 2 in month 8. Patient 1 is no longer at risk of event because he had already an event. So L2 has the same form as L1 , but the hazard for patient 1 is removed from the denominator : L2 = h2 (8) h2 (8) + · · · + h5 (8) (2.19) The set of all individuals who are at risk at a given point in time is often referred to as the risk set. At month 8, the risk set consists of patient 2 through 5, inclusive. We continue in this way for each successive event in order to construct each individual L i . The general form is #δ i " eβxi L i = PN (2.20) βxj j=1 yij e where yij = 1 if tj ≥ ti and yij = 0 if tj < ti (the y’s are just a convenient mechanism for excluding from the denominator those individuals who already experienced the event and are not part of the risk set). Moreover, the censored information are excluded because δ i = 0 for those cases. This expression is not valid for tied event times but it does allow for ties between event time and one or more censoring times. A general expression for the partial likelihood for data with fixed covariates from a proportional hazards model is #δ i " N Y eβxi (2.21) PL = PN βxj j=1 yij e i=1 Once the partial likelihood is constructed, it can be maximized with respect to β just like an ordinary likelihood function. It is convenient to maximize the logarithm of the likelihood which is N N X X yij eβxj (2.22) ln P L = δi βxi − ln i=1 j=1 Most partial likelihood programs use some version of the Newton-Raphson algorithm [Collett, 2003] to maximize this function with respect to β. The formula allowing to compute the standard error of the estimated parameter β̂, are given in Appendix A of [Collett, 2003]. These standard errors can be used to obtain confidence intervals for β’s. In particular, under the assumption that the estimated parameters β̂’s follow a normal distribution, a (100 − α)% confidence interval for a parameter β is the interval with limits β̂ ± zα/2 se(β̂), where zα/2 is the upper α/2-point of the standard normal distribution. 27 Normalization of the Loglikelihood We introduced a normalization for loglikelihood. Because the loglikelihood computed using (2.22) is directly proportional to the number of events, the normalized loglikelihood is simply the loglikelihood divided by the number of events in a dataset, given by N N X X 1 δi βxi − ln (normalized) ln P L = PN yij eβxj (2.23) δ i i=1 i=1 j=1 Such a normalization is useful when we want to compare the loglikelihood of a model tested on different datasets. Indeed, these datasets may not contain the same number of events and the scales of the corresponding likelihoods may be very different. We will see in Section 5.1.4, the utility of this normalization. 2.4.2.2 Hypothesis Test There exist three hypothesis tests in order to test the null hypothesis H0 : β = β (0) where β (0) is the initial value for β̂, the coefficients estimated by the Cox model. Only the Wald and the likelihood ratio tests will be described in this section7 . • The Wald test is (β̂ − β (0) )0 Î(β̂ − β (0) ) where Î = I(β̂) is the estimated information matrix8 at the solution. For single variable, this reduces to the usual z -statistic β̂/se(β̂). ³ ´ • The likelihood ratio test is 2 l(β̂) − l(β (0) ) where l is the log partial likelihood at the initial and final estimates of β̂. The null hypothesis distribution of both the Wald and the likelihood ratio tests is a chisquare on p degrees of freedom where p is the number of coefficients. They are asymptotically equivalent but in finite samples they may differ. The likelihood ratio test is generally considered to be more reliable than the Wald test. Such tests allow us to assess the likelihood that a coefficient or a set of coefficients in a Cox model are different from their initial values (typically 0). We provide some additional topics about the semiparametric regression models in Appendix A. This concerns the treatment of tied data, the time-dependent covariates, the nonproportional hazards and the estimation of the survivor functions. 2.5 Testing for Differences in Survivor Functions If a treatment has been applied to one group but not another, the obvious question to ask is “Did the treatment make a difference in the survival experience of the two groups ?”. Since the survivor function gives a complete accounting of the survival experience of each group, a natural approach to answering this question is to test the null hypothesis that the survivor functions are the same in the two groups : S1 (t) = S2 (t) ∀t > 0, where the subscripts distinguish the two groups. There exist three alternative statistics for testing this null hypothesis : the logrank test (also known as the Mantel-Haenzel test), the Wilcoxon test and the hazard ratio. 7 Details about the third hypothesis test, the score test, are given in [Therneau and Grambsch, 2000]. The information matrix is the second derivative of the log partial likelihood with respect to β. Details are given in [Therneau and Grambsch, 2000]. 8 28 2.5.1 Logrank Test Suppose that there are r distinct event times, t1 < t2 < · · · < tr across the two groups, and that at time tj , d1j individuals in group 1 and d2j individuals in group 2 have an event occurrence, for j = 1, 2, . . . , r. Suppose further that there are n1j individuals at risk of event occurrence in the first group just before time tj , and that there are n2j at risk in the second group. Consequently, at time tj , there are dj = d1j + d2j event occurrences in total out of nj = n1j + n2j individuals at risk. The situation is summarized in Table 2.4. Group 1 2 Total Number of events at tj d1j d2j dj Number surviving beyond tj n1j − d1j n2j − d2j nj − d j Number at risk just before tj n1j n2j nj Table 2.4: Number of events at the jth event time in each of the two groups of individuals. Each statistic can be written as a function of deviations of observed numbers of events from expected numbers. If the null hypothesis that survival is independent of group is true, we can therefore regard d1j , the number of events at tj in group 1, as the realization of a random variable D1j , which can take any value in the range from 0 to min(dj , n1j ). In fact, D1j has a distribution known as the hypergeometric distribution [Droesbeke, 1988], according to which the probability that D1j in the first group takes the value d1j is ¡ dj ¢¡ d1j nj −dj ¢ n −d1j ¡ n1j ¢ j n1j (2.24) The mean of the hypergeometric random variable D1j is given by n1j dj nj e1j = (2.25) so that e1j is the expected number of individuals who have an event at time tj in group 1. For group 1, the logrank statistic can be written as UL = r X j=1 (d1j − e1j ) (2.26) Since the event times are independent of one another, the variance of (2.26) is simply the sum of the variances of the D1j . D1j having a hypergeometric distribution, the variance of D1j is given by n1j (nj − n1j)dj (nj − dj ) var(D1j ) = (2.27) n2j (nj − 1) so that the variance of UL is var(UL ) = r X var(D1j ) = VL j=1 29 (2.28) Furthermore, it can be shown that UL has an approximate normal distribution when the √ number of event times is not too small [Droesbeke, 1988]. It then follows that U L / VL has a normal distribution with zero mean and unit variance. The square of a standard normal random variable has a chi-squared distribution of one degree of freedom (DF), denoted χ 21 , and so we have that UL2 ∼ χ21 (2.29) VL The p-value of the logrank test is calculated by using this chi-square statistic and a chi-square distribution with one DF. 2.5.2 Wilcoxon Test The Wilcoxon statistic, given by UW = r X j=1 nj (d1j − e1j ) (2.30) differs from the logrank test only by the presence of nj , the total number at risk at each time point. Thus, it is a weighted sum of the deviations of observed numbers of events from expected numbers of events. As with the logrank statistic, the chi-square test is calculated by squaring the Wilcoxon statistic for either group and dividing by the estimated variance (see [Collett, 2003] for details). Since the Wilcoxon test gives more weight to early times that to the late times (n j always decreases), it is less sensitive than the logrank test to differences between groups that occur at later points in time. Although both statistics test the same null hypothesis, they differ in their sensitivity to various kinds of departures from that hypothesis. In particular, the logrank test is more powerful for detecting differences of the form S1 (t) = [S2 (t)]γ where γ is some positive number other than 1. This equation defines a proportional hazards model, which is discussed in details in Section 2.4 (the logrank test is closely related to tests for differences between two groups that are performed within the framework of Cox’s proportional hazards model). In contrast, the Wilcoxon test is more powerful than the logrank test in situations where event times have log-normal distributions with a common variance but with different means between the two groups. Neither test is particularly good when the survival curves cross [Allison, 1995]. The Wilcoxon and the logrank tests readily generalize to three or more groups, with the null hypothesis that all groups have the same survivor function. If the null hypothesis is true, all the test statistics have chi-square distributions with DF equal to the number of groups minus 1. 2.5.3 Hazard Ratio The hazard ratio is a summary of the difference between two survival curves, representing the reduction in the risk of event between two different conditions. It is a form of relative risk. Proportional hazards regression model assumes that the relative risk of event between the two conditions is constant at each interval of time. 30 Let G be an indicator variable, which takes the value zero if an individual is on the first condition (e.g. low-risk group) and unity if an individual is on the second condition (e.g. high-risk group). If gi is the value of G for the ith individual in the study, i ∈ {1, . . . , N }, the hazard function for this individual can be written as hi (t) = λ0 (t) exp(βgi ) (2.31) where gi = 1 if the ith individual is on the second condition or zero otherwise. Because of the type of the indicator variable G, λ0 (t) is the hazard function for an individual on the first condition. Moreover, the hazard function for any individual on the second condition is ψλ 0 (t) (proportional hazards). ψ is the relative hazard or hazard ratio with ψ = exp(β) This is the proportional hazards model for the comparison of two groups. In this thesis, the indicator variable G is unity for the high-risk group and zero for the low-risk group. So the hazard ratio permits to assess if the risk of the high-risk group is higher than in the low-risk group. Confidence Interval Once the parameter β is estimated, giving β̂, the corresponding estimate of the hazard ratio is ψ̂ = exp(β̂), and the standard error of ψ̂ can be obtained from the standard error of β̂ (see Section 2.4.2.1). So, the standard error of ψ̂ is given by se(ψ̂) = ψ̂ se(β̂) (2.32) A (100 − α)% confidence interval for the true hazard ratio ψ, can be obtained by exponentiating the confidence limit for β because the distribution of the logarithm of the estimated hazard ratio will be more closely approximated by a normal distribution than that of the hazard ratio itself [Collett, 2003]. 2.6 Feature Selection When a review of [Blum and Langley, 1997; Kohavi and John, 1997] on relevance including several papers on variable and feature selection was published, few studies used more than 40 features. The situation has changed considerably in the past few years and papers explore domains with hundreds to tens of thousands of variables or features. New techniques are proposed to address these challenging tasks involving many irrelevant and redundant variables and often comparably few training examples. Survival analysis of microarray data is such a new field with several thousands of genes for several hundreds of samples. Two characteristics of microarray data highlight the utility of feature selection : • High feature/sample ratio : The microarray-based high-throughput technology generates a huge number of potential predictors (i.e. probes). On the other hand, the sample size of patients or cell lines is usually very small compared to the number of probes in the study (high feature/sample ratio). Modeling such high-dimensional data is complex. The problem becomes more difficult when the phenotypes such as time to death or time to cancer recurrence are subject to right-censoring. Additionally, microarray data often possess a great deal of noise. Due to the very high dimensional space of the predictors, the standard maximum Cox partial likelihood method cannot be applied directly to obtain the parameter estimates. 31 Moreover, from biological point of view, one should expect that only a small subset of the genes is relevant to predicting the phenotypes. Including all the genes in the predictive model increases its variance and is expected to lead to poor predictive performance. • Highly correlated features : In microarray experiments, the expression levels of many probes may be highly correlated. Such a characteristic is explained by the co-regulation of many genes. Indeed, it has been assumed that similar patterns in gene expression profiles usually suggest relationships between the genes [Yu et al., 2003] or equivalently, the genes targeted by the same transcription factors tend to show similar expression patterns.. There are many potential benefits of variable and feature selection: facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization times, defying the curse of dimensionality to improve prediction performance [Guyon and Elisseeff, 2003]. Some methods put more emphasis on one aspect or another, and this is another point of distinction between this special issue and previous work. Some papers focus mainly on constructing and selecting subsets of features that are useful to build a good predictor. This contrasts with the problem of finding or ranking all potentially relevant variables. Selecting the most relevant variables is usually suboptimal for building a predictor, particularly if the variables are redundant. Conversely, a subset of useful variables may exclude many redundant, but relevant, variables. For a discussion of relevance versus usefulness and definitions of the various notions of relevance, see the review articles [Blum and Langley, 1997; Kohavi and John, 1997]. Three aspects of feature selection will be tackled : filters that select variables by ranking them according to some statistic, subset selection methods including wrapper/embedded methods that assess subsets of variables according to their usefulness to a given predictor and feature construction which aim to increase the predictor performance by building more compact feature subsets. 2.6.1 Variable Ranking Many variable selection algorithms include variable ranking as a principal or auxiliary selection mechanism because of its simplicity, scalability, and good empirical success. In many microarray studies [van’t Veer et al., 2002; Chang et al., 2003; Jansen et al., 2005; Shipp et al., 2002], the gene ranking is a common method to select the most promising genes to build a classifier and/or to be the subject of further biological experiments. According to the design of the survival analysis, the ranking criterion is defined for individual variables, independently of the context of others. Consider a set of N samples (xi , yi ) (i ∈ {1, . . . , N }) consisting of input values xij (j ∈ {1, . . . , n}) and output value yi . Variable ranking makes use of a scoring function S(j) associated to each input variable and computed from the values xij and yi . By convention, we assume that a high score is indicative of a valuable variable and that we sort variables in decreasing order of S(j). To use variable ranking to build predictors, nested subsets incorporating progressively more and more variables of decreasing relevance are defined. A cross-validation procedure can be used to assess the optimal number of features [Vittinghof et al., 2005]. Following the classification of [Kohavi and John, 1997], variable ranking is a filter method : it is a preprocessing step, independent of the choice of the predictor. In practice, however, the 32 scoring function is selected for its usefulness for a classifier that would use the most relevant input variables. So, the choice of the scoring function is not always independent on the choice of the classifier used in further analysis. Even if variable ranking is not optimal, it may be preferable to other variable subset selection methods because of its computational and statistical scalability. Computationally, it is efficient since it requires only the computation of n scores and the sorting of the scores. Statistically, it is robust against overfitting because it introduces bias but it may have considerably less variance [Hastie et al., 2001]. Three examples that outline the usefulness and the limitations of variable ranking techniques are given in [Guyon and Elisseeff, 2003]. 2.6.2 Variable Subset Selection Variable subset selection methods allow the selection of subsets of variables that together have good predictive power, as opposed to ranking variable methods that rank the variables according to their individual predictive power. We will focus on the wrappers which utilize the learning machine of interest as a black box to score subsets of variables according to their predictive power. Embedded methods which perform variable selection in the process of training and are usually specific to given learning machines, are considered as a promising approach for future works (see Section 6.1). 2.6.2.1 Wrappers and Embedded Methods The wrapper methodology, recently popularized by [Kohavi and John, 1997], offers a simple and powerful way to address the problem of variable selection, regardless of the chosen learning machine. In fact, the learning machine is considered as a black box and the method is applicable to any learning algorithm, including off-the-shelf machine learning software packages. In its most general formulation, the wrapper methodology consists in using the prediction performance of a given learning machine to assess the relative usefulness of subsets of variables. In practice, one needs to define : (i) how to search the space of all possible variable subsets; (ii) how to assess the prediction performance of a learning machine to guide the search and halt it; and (iii) which learning machine to use. An exhaustive search can conceivably be performed, if the number of variables is not too large. But, the problem is known to be NP-hard [Amaldi and Kann, 1998] and the search becomes quickly computationally intractable. A wide range of search strategies can be used, including best-first, branch-and-bound, simulated annealing, genetic algorithms (see [Kohavi and John, 1997] for a review). Performance assessments are usually done using a validation set or by cross-validation. Wrappers are often criticized because they seem to be a “brute force” method requiring massive amounts of computation, but it is not necessarily the case. Efficient search strategies may be devised. Using such strategies does not necessarily mean sacrificing prediction performance. In fact, it appears to be the inverse in some cases : greedy search strategies seem to be particularly computationally advantageous and robust against overfitting 9 [Guyon and Elisseeff, 2003]. Among such search strategies, we can mention two common methods : forward selection and backward elimination. In forward selection, variables are progressively incorporated into larger and larger subsets, whereas in backward elimination one starts with 9 The name “greedy” come from the fact that one never revisits former decisions to include (or exclude) variables in light of new decisions. 33 the set of all variables and progressively eliminates the least promising ones. Both methods yield nested subsets of variables. By using the learning machine as a black box, wrappers are remarkably universal and simple. But embedded methods that incorporate variable selection as part of the training process may be more efficient in several respects : they make better use of the available data by not needing to split the training data into a training and validation set; they reach a solution faster by avoiding retraining a predictor from scratch for every variable subset investigated. Recent articles highlight the promising results of embedded methods in the Cox regression as the LASSO procedure for Cox regression [Tibshirani, 1997] and penalized Cox regression [Gui and Li, 2004]. 2.6.3 Feature Construction and Space Dimensionality Reduction In some applications, reducing the dimensionality of the data by selecting a subset of the original variables may be advantageous for reasons including the expense of making, storing and processing measurements. If these considerations are not of concern, other means of space dimensionality reduction should also be considered. The art of machine learning starts with the design of appropriate data representations. Better performance is often achieved using features derived from the original input. Building a feature representation is an opportunity to incorporate domain knowledge into the data and can be very application specific. Nonetheless, there are a number of generic feature construction methods, including: clustering; basic linear transforms of the input variables (e.g. PCA/SVD, LDA), etc (see [Dudoit et al., 2002] for a comparison of such methods). Clustering has long been used for feature construction [Hartigan, 1975]. The idea is to replace a group of similar variables by a cluster centroid, which becomes a feature. The most popular algorithms include K-means and hierarchical clustering (see [Duda et al., 2001] for a review). Clustering is usually associated with the idea of unsupervised learning (no use of any additional information such that demographic data) but it can be useful to introduce some supervision in the clustering procedure to obtain more discriminant features. This is the idea of the semi-supervised clustering [Bair and Tibshirani, 2004]. Firstly we rank the variables (see Section 2.6.1) using a supervised method (as Student t-test or univariate Cox model). Then we perform an unsupervised hierarchical clustering in order to cluster similar variables. Finally we use these clusters to construct new features. 2.6.3.1 Hierarchical Clustering Hierarchical clustering is one of the most commmon clustering method [Hartigan, 1975]. [Eisen et al., 1998] introduced this method to analyze microarray data by organizing genes in a hierarchical tree structure (dendrogram), based on their degree of similarity. The basic idea is to assemble a set of items10 into a binary tree, where items are joined by very short branches if they are very similar to each other, and by increasingly longer branches as their similarity decreases. A small example of such a tree is given in Figure 2.9. Hierarchical clustering uses an agglomerative hierarchical processing consisting of repeated cycles where the two closest remaining items (those with the highest similarity) are joined by a node/branch of a tree, with the length of the branch set to the similarity between the joined 10 An item refers to a variable or a set of variables. 34 variable 6 variable 3 variable 5 variable 2 variable 4 variable 1 Figure 2.9: Example of tree computed by hierarchical clustering of six variables. Variable 4 and variable 2 are highly similar (joined by short branches). Idem for Variable 3 and variable 6 but with a smaller similarity between them, etc. items. The two joined items are removed from list of items being processed and replaced by an item that represents the new branch. The similarities between this new item and all other remaining items are computed, and the process is repeated until only one item remains. In order to apply this algorithm, we have to choose a metric of similarity and a way to compute the similarity between two items (called the linkage). Metric of Similarity In this thesis, the uncentered Pearson correlation (sometimes called angular distance) is used as metric of similarity. The Pearson correlation (r) and the uncentered Pearson correlation (ru ) between two vectors xi1 and xi2 , are given by (2.33) and (2.34) respectively Pn j=1 (xi1 j − x̄i1 )(xi2 j − x̄i2 ) r(xi1 , xi2 ) = qP (2.33) Pn n 2 2 (x − x̄ ) (x − x̄ ) i j i i j i 2 2 1 1 j=1 j=1 Pn j=1 xi1 j xi2 j ru (xi1 , xi2 ) = qP (2.34) Pn n 2 2 x x j=1 i1 j j=1 i2 j The Pearson correlation coefficient is always between -1 and 1, with 1 meaning that the two gene expression profiles are identical, 0 meaning they are completely uncorrelated, and -1 meaning they are perfect opposites. The correlation coefficient is invariant under linear transformation of the data [Droesbeke, 1988]. The uncentered version of the Pearson correlation coefficient differs in that there is no centering by subtraction of x̄i1 and x̄i2 from the expression measurements. So, two vectors differing only by an offset have r = 1 but ru 6= 1. ru is also called angular distance because it equals the cosine of the angle formed by the vectors xi1 and xi2 . Linkage There are a variety of ways to compute similarity between items that are sets of variables : centroid linkage, single linkage, complete linkage, and average linkage for instance (see [Duda et al., 2001]). In this thesis, the complete linkage is used to compute the similarity between two items c1 and c2 which is the maximum of all pairwise similarities between 35 sample 2 variables contained in c1 and c2 . The Figure 2.10 gives a small example of the complete linkage. c1 s1 item s2 c2 sample 1 Figure 2.10: Example of complete linkage used in the hierarchical clustering with two clusters of variables c1 and c2 (circles and squares respectively) and only two samples. The new variable (triangle) will be assigned to cluster c2 because s2 > s1 (a large similarity means a small distance between two probesets). This method is computationally efficient. Indeed, a hierarchical clustering using the complete linkage computes only once the similarity matrix that contains all the pairwise similarities between probesets, and uses this information to construct the dendrogram. 36 Chapter 3 Materials Contents 3.1 3.2 3.1 Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microarray Platform . . . . . . . . . . . . . . . . . . . . . . . . . . c 3.2.1 Affymetrix° Technology . . . . . . . . . . . . . . . . . . . . . . . 37 37 38 Populations Three different populations of patients treated by Tamoxifen are studied in this thesis : • OXFT : John Radcliffe Hospital (JRH), Oxford, UK (Dr Adrian Harris). 99 patients. • KIT : Uppsala hospital (Karolinska) Sweden (Dr Jonas Bergh). Hybridized in Singapore (Lance Miller and Edison). 69 patients. • GUYT : Guys Hospital, UK (Drs Paul Ellis and Cheryl Gillet). 87 patients. All the patients had ER+ tumors, were under 88 years old and have been treated by Tamoxifen. Some patients have had to be discarded because of the insufficient follow-up and the lack of RNA material. The Microarray Unit of Institut Jules Bordet (IJB) carried out the hybridizations for the OXFT and the GUYT populations. The hybridizations for the KIT population were performed by the laboratory of Lance Miller and Edison in Singapore. For the OXFT and KIT experiments, the chips hgu133a and hgu133b (see Section 3.2.1) have been hybridized. The new high density hgu133plus2 chip have been used for the GUYT population (see Section 3.2.1). 3.2 Microarray Platform Microarray technology is a powerful tool for genetic research that uses nucleic acid hybridization techniques to evaluate the mRNA expression profile of thousands of genes within a single c experiment. The Microarray Unit of the IJB uses the Affymetrix° platform1 which is 1 http://www.affymetrix.com 37 a short oligonucleotide platform (see Appendix B for an overview of different microarray c devices are shown in the Figure 3.1. platforms). Affymetrix° Figure 3.1: Affymetrix fluidic station and scanner. 3.2.1 c Affymetrix° Technology c chips are short oligonucleotide (25 mers) arrays fabricated by direct synthesis Affymetrix° of oligonucleotides on a silicon surface [Fdor et al., 1991]. Each chip contains up to 400,000 to 1M different probes (see Figure 3.2). Since oligonucleotide probes are synthesized in known locations on the chip, the hybridization pattern and signal intensities can be interpreted in terms of gene identity and relative expression levels by a specific software2 . Each gene is represented on the chip by a series of different pairs of oligonucleotide probes. Each probe pair consists of a perfect match (called PM) and a mismatch (called MM) oligonucleotide (see Figure 3.3). The perfect match has a sequence exactly complementary to the particular region of gene and thus the probeset measures the expression of the gene. The mismatch probe differs from the perfect match probe by a single base substitution at the center base position, disturbing the bonding of the target gene transcript. This helps to determine the background and nonspecific hybridization (also called cross-hybridization) that contributes to the signal measured for the perfect match oligo [Lockhart et al., 1996]. Probes are selected on the basis of current information from GenBank and other nucleotide repositories. The sequences are believed to recognize unique regions of the 3’ end of the gene. c The entire design of Affymetrix° microarray experiments is depicted in Figure 3.4. Once the biological material under study is introduced in the chip, the hybridization process (see Figure 3.5) enables the assessment of the levels of expression of the genes characterized by the probes on the chips (see Figure 3.6). c Data Hierarchy As described previously, there are different levels of data in the Affymetrix ° technology : 1. The probes : the low-level measurements. The probes are constituted by two short oligonucleotides, the PM and the MM. c 2 We can mention the Affymetrix° Microarray Suite Software or the Bioconductor [Gentleman et al., 2004] packages for R [R Development Core Team, 2005]. 38 c Figure 3.2: Affymetrix° GeneChip probe array. Figure 3.3: Oligonucleotide probe pair (Perfect Match and MisMatch). 39 c microarray experiments. Figure 3.4: Design of Affymetrix° c Figure 3.5: Hybridization process on the Affymetrix° GeneChip array. 40 Figure 3.6: Measurement of the level of gene expression after the hybridization process on c GeneChip array. Affymetrix° 2. The probesets : one probeset is a set of 11 to 20 probes. 3. The gene : one gene is represented by one or several probesets. The number of probesets depends on the sequence of the gene under study. An example of such a hierarchy with a gene represented by two probesets, is depicted in Figure 3.7. PM MM PM MM probe probe PM MM PM MM PM MM probe probe probe PM MM probe probeset probeset gene c Figure 3.7: Data hierarchy on Affymetrix° platform. c c Affymetrix° Chips Several types of Affymetrix° human chips for human samples are available : hgu95a, hgu95b, hgu133a, hgu133b, hgu133plus2, etc. For all the populations except the GUYT population, the samples were hybridized using the chips hgu133a 41 (22283 affy ids3 ) and hgu133b (22645 affy ids). The majority of known genes are on the chip hgu133a but the chip hgu133b is also used for the completeness (entire human genome). For the GUYT population, the new hgu133plus2 chip (54675 affy ids) is used. hgu133plus2 is the union of the chips hgu133a and hgu133b in a single high density chip. Its use permits to reduce the cost and the time of the experiments and the data can be compared with the data from the hgu133a and hgu133b chips. 3 The majority of affy ids represent human genes but some are used for control or represent large region of transcribed DNA (EST). 42 Chapter 4 Methods Contents 4.1 4.2 Quality Assessment . . . . . . . . . . . . . . . Preprocessing Methods . . . . . . . . . . . . . 4.2.1 Read Data . . . . . . . . . . . . . . . . . . . 4.2.2 Get Expression Measures . . . . . . . . . . . 4.2.3 Prefiltering . . . . . . . . . . . . . . . . . . . 4.3 Feature Selection . . . . . . . . . . . . . . . . . 4.3.1 Variable Ranking . . . . . . . . . . . . . . . . 4.3.2 Feature Construction . . . . . . . . . . . . . . 4.3.3 Cox Model . . . . . . . . . . . . . . . . . . . 4.4 Final Model . . . . . . . . . . . . . . . . . . . . 4.4.1 Final Cox Model . . . . . . . . . . . . . . . . 4.4.2 Cutoff Selection . . . . . . . . . . . . . . . . . 4.5 Survival Statistics . . . . . . . . . . . . . . . . 4.5.1 Hazard Ratio . . . . . . . . . . . . . . . . . . 4.5.2 Logrank Test . . . . . . . . . . . . . . . . . . 4.5.3 Proportion of DMFS . . . . . . . . . . . . . . 4.5.4 Time-Dependent ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 45 46 48 49 50 51 52 53 53 53 53 53 54 54 54 A flow-chart of our machine learning methodology is sketched in Figure 4.1. This methodology consists of the following steps : 1. Repartition of the dataset in training and test sets. Both sets are independent. 2. Training phase. This phase can be further decomposed in : Quality Assessment Quality assessment in order to discard microarray experiments that could have failed (see Section 4.1). Data Preprocessing The microarray data that have fulfilled the quality criteria, are preprocessed in order to obtain data comparable across samples (reading of the data and getting the expression measures) and to remove noninformative probesets (prefiltering) (see Section 4.2). 43 Training raw microarray data training set Test raw microarray data test set demographics training set demographics test set quality assessment quality assessment read data DATA PREPROCESSING read data get expression measures get expression measures prefiltering compute risk scores apply cutoff variable ranking survival statistics set of features number of features FEATURE SELECTION low/high-risk groups - hazard ratio - logrank test - proportion of DMFS - time-dependent ROC curve estimated coefficients feature construction features 10-fold cv best cutoff Cox model best features FINAL MODEL final Cox model risk scores cutoff selection based on hazard ratio Figure 4.1: Design of the survival analysis. 44 Variable Ranking A ranking is performed on probesets in order to select the most promising ones for the final classifier (see Section 4.3.1). Feature Construction From this subset of probesets, the features are constructed using a hierarchical clustering. In order to select the best number of clusters, a 10fold cross-validation is performed to assess the performance of the classifier using such features (see Section 4.3.2). Final Model Once the best set of features is constructed, a multivariate Cox model is fitted using all the training data to obtain the final model. Risk Score Computation The final model is used to compute the risk score for each patient in training set. Cutoff Selection In order to classify the patient in low and high-risk groups, a cutoff for the risk scores is selected based on survival statistics (see Section 4.4). 3. Test phase. This phase can be further decomposed in : Quality Assessment Quality assessment in order to discard microarray experiments that could have failed (see Section 4.1). Data Preprocessing The microarray data that have fulfilled the quality criteria, are preprocessed similarly to the training set. Risk Score Computation The risk scores are computed for each patient in the test set using the final model fitted on the training set. Apply Cutoff The same cutoff as selected in training part, is applied to classify the patients in low and high-risk groups. Survival Statistics Assessment of the performance using the same survival statistics than those used in Section 4.4. 4.1 Quality Assessment Depending on the microarray technologies, specific criteria have been used in literature to c assess the quality of the microarray experiments. For the Affymetrix ° technology, two c different sets of guidelines are commonly used in microarray studies : the Affymetrix ° guidelines [Affymetrix, 2002] and the Bioconductor1 guidelines [Hartmann et al., 2003; Gautier et al., 2004]. A review of such guidelines is given in [Haibe-Kains, 2004]. 4.2 Preprocessing Methods We will describe the methods used in data preprocessing in this section. This procedure consists in reading the raw microarray data, in getting the expression measures and in performing a prefiltering of the probesets. 4.2.1 Read Data The raw data are read using the functions of the affy package (see Bioconductor website 2 for the description of the affy package). 1 2 See affy and simpleaffy packages. http://www.bioconductor.org 45 4.2.2 Get Expression Measures The procedure used to get the expression measures of each probeset can be divided in four steps : 1. Background correction (B). 2. Normalization (N ). 3. Summarization (S). 4. Population correction (P ) Let x be the raw intensities of a probeset coming from the CEL files of multiple microarray experiments (see Section 4.2.1). The expression measure sc of this probeset (called corrected signal ), is sc = P (S(N (B(x)))). The Robust Multi-array Average procedure (RMA3 ) [Irizarry et al., 2003a] performs the first three steps. We introduce the population correction step described in Section 4.2.2.4. 4.2.2.1 Background Correction Let us define the background as a measurement of signal intensity caused by auto-fluorescence of the microarray surface and cross-hybridization (see Section 3.2.1). The background correction is a method which does some or all of the following : • Corrects for background noise, biological sample preparation. • Adjusts for cross-hybridization. • Adjusts estimated expression values to fall on proper scale. The RMA background correction is performed by estimating the unknown quantity S on the following model O =S+² (4.1) where O is the observed PM intensity (see Section 3.2.1), S is the signal of interest and ² is a noise. S is assumed to have an exponential distribution with parameter α and ² is assumed to have a normal distribution with parameters µ (mean) and σ (standard deviation). To avoid any possibility of negative values, the normal is truncated at zero. Given we have o, the observed PM intensity, this then leads to an adjustment ¡ ¢ ¡ ¢ φ ab − φ o−a b ¡ ¢ (4.2) E (s|O = o) = a + b ¡ a ¢ −1 Φ b − Φ o−a b where a = o − µ − σ 2 α and b = σ. Note that φ and Φ are the normal PDF and the normal CDF respectively. α, µ and σ are then estimated [Irizarry et al., 2003b] what leads to the expected value of the signal, given the observed value of the intensity. Note that the RMA procedure does not use the MM information (see Section 3.2.1) in order to correct signal for background and cross-hybridization. Indeed, exploratory analysis presented in [Naef et al., 2001; Irizarry et al., 2003b] suggests that the MM may be detecting signal as well as cross-hybridization and suggests to use only the PM information. So, only the PM intensities are corrected and used for further analysis. 3 Currently, the RMA procedure is one of the most efficient as shown in [Bolstad et al., 2003]. 46 4.2.2.2 Normalization In microarray studies, biological sources of variation are referred to as interesting variation. However, many non-biological factors may contribute to the variability of data. This means that observed expression levels may also include variation introduced during the sample preparation, manufacture of the microarrays, and the processing of the microarrays (labeling, hybridization, and scanning). These are referred to as sources of obscuring variation. See [Hartemink et al., 2001; Irizarry et al., 2003b] for a more detailed discussion. The obscuring sources of variation can have many different effects on data. Unless microarrays are appropriately normalized, comparing data from different microarrays can lead to misleading results. Normalization is a process of reducing undesired variation across microarray experiments and may use information from multiple experiments. c GeneChip microarVarious methods have been proposed for normalizing Affymetrix° rays. [Bolstad et al., 2003] present a review of these methods and find quantile normalization to perform best. The aim of quantile normalization is to make the empirical distribution of probe intensities the same for all microarrays. So, the probe intensities distribution of sample i is identical to the probe intensities distribution of sample j with i, j ∈ {1, . . . , N } . Let XN ×P be the matrix of the P probe intensities4 for the N samples under study. The quantile normalization algorithm is given in algorithm 1. Algorithm 1 Quantile normalization algorithm 1: Sort each line of X to give Xsort (save original positions). 2: Take the means across the columns of Xsort and assign this mean to each element in the column to give X∗sort . 3: Get Xnormalized by rearranging each line of X∗sort to have the same ordering as original X (restore original positions). The quantile normalization method is a specific case of the transformation x ∗i = F −1 (G(xi )), where we estimate G by the empirical distribution of each microarray and F using the empirical distribution of the averaged sample quantiles. 4.2.2.3 Summarization To obtain an expression measure of a probeset from intensities of the probes that belong to this probeset, we assume that for each probeset p, the background adjusted, normalized, and log transformed PM intensities, denoted with y, follow a linear additive model such that yipq = µiq + αpq + ²ipq (4.3) where i ∈ {1, . . . . , N }, p ∈ {1, . . . , P }, q ∈ {1, . . . , Q}, αjq is the probe affinity effect of probeset q, µiq representing the log scale expression level of probeset q for array i, and ²ipq representing an independent identically distributed error term with mean 0 of the probes intensities p belonging PP of the probeset q for the microarray i. For estimation of the parameters c has we assume that p=1 αpq = 0 ∀q ∈ Q. This assumption is saying that Affymetrix° chosen probes with intensities that on average are representative of the associated genes 4 Note that the background correction and the quantile normalization manage probe intensities. There are P probes in one microarray with P ≥ Q where Q is the number of probesets in one microarray (see Figure 3.7 for the data hierarchy). 47 expression. The estimate of µiq gives the expression measure for probeset q on microarray i. To protect against outlier probes a robust procedure is used, such as median polish [Holder et al., 2001; Tukey, 1977], to estimate model parameters. 4.2.2.4 Population Correction In medical studies, it is common to study several distinct groups of patients, called populations. These populations come from different institutions5 and their origin may be an important source of variability. We have observed that some probesets are systematically over-expressed or under-expressed according to the population of patients 6 . Unfortunately, the RMA procedure described above is not able to remove this source of variability without taking into account the origin of the samples (data not shown). Moreover, from a pragmatic point of view, it is easier to normalize each population separately and integrate the new samples when they are introduced in the analysis. In order to minimize the population effect, we introduce an additional transformation of the microarray data, called population correction. Let XN ×Q be the matrix of probeset expressions, after background correction, normalization and summarization for each population separately. Let P op be the set of different populations, each population being a set of samples. The algorithm of the population correction method is given in algorithm 2. Algorithm 2 Population correction algorithm 1: for each population k ∈ P op do 2: for each probeset q ∈ {1, . . . , Q} do 3: xmedian ← median of probeset q across the samples in k 4: for each sample i ∈ k do 5: X∗ [i, q] ← X[i, q] − xmedian 6: end for 7: end for 8: end for 9: X∗ [i, q] contains the corrected probeset intensities. . median centering In other words, each probeset is centered by its median in taking into account the origin of the samples (population). After the population correction, we have no more observed 7 the over/under-expression of the probesets of interest and the patients of different populations can be compared for further analysis. This correction has already been applied with success in [Sotiriou et al., 2005]. 4.2.3 Prefiltering The prefiltering consists in removing some noninformative probesets without using demographic information. The prefiltering is made of two steps : c • Remove the Affymetrix° control probesets (see [Affymetrix, 2002]). 5 In the Tamoxifen resistance project, we have three different populations of patients (see Section 3.1). A hierarchical clustering (see Section 2.6.3.1) can highlight the population effect. Indeed if you cluster all your experiments and you observe that your experiments from the same population are cluster together, that means that the population effect is stronger that biological information. 7 A hierarchical clustering (see Section 2.6.3.1) can no more highlight the population effect. 6 48 • Remove the probesets that have at least 95% of Absent calls among all the samples in the training set. The default parameters are used in the detection calls method. Detection Calls A detection algorithm [Affymetrix, 2002] uses probe pair (PM/MM, see Section 3.2.1) intensities to generate detection p-value and to assign a Present, Marginal, or Absent call to each probeset. Each probe pair in a probeset is considered as having a potential vote in determining whether the measured transcript is detected (Present) or not detected (Absent). The vote is described by a value called the discrimination score (R). The score is calculated for each probe pair and is compared to a predefined threshold τ . Probe pairs with scores higher than τ vote for the presence of the transcript and inversely. The voting result is summarized as a p-value associated with the test of the difference between score and τ . The discrimination score is a basic property of a probe pair that describes its ability to detect its attended target (see Figure 4.2) Rip = (Xpm [i, p] − Xmm [i, p]) (Xpm [i, p] + Xmm [i, p]) (4.4) where i ∈ {1, . . . , N }, p ∈ {1, . . . , P }, pm and mm indexes indicating the PM and the MM intensity of the probe respectively. 1 80 80 80 80 80 80 80 80 80 PM MM probe pairs 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 discrimnation score 0.8 0.6 0.4 0.2 τ 0 -0.2 20 30 40 50 60 70 80 MM intensity/probe pair Figure 4.2: Example of discrimination score [Affymetrix, 2002]. The PM intensity is fixed to 80 and the MM intensity varies from 10 to 100. The y-axis represents the discriminant score and the x-axis represents the MM intensity. Each discrimination score is compared to the threshold τ , which is a small positive number 8 that can be adjusted to increase or decrease the sensitivity and/or specificity of the analysis. Detection p-value is calculated by the One-Sided Wilcoxon’s Signed Rank test [Droesbeke, 1988]. Finally, a detection call Present/Marginal/Absent is assigned to each probeset according to its detection p-value and two arbitrary thresholds α1 and α2 (see Figure 4.3). 4.3 Feature Selection The feature selection step aims to identify a set of features giving good performance in generalization when used in the classifier. These methods includes the variable ranking, the feature 8 Default value equals to 0.015. 49 90 100 Marginal α1 0.0 Absent α2 Present 1 0.06 0.04 detection p-value Figure 4.3: Detection p-value [Affymetrix, 2002]. construction and the Cox model as classifier. In the following sections, by “variables” we refer to the “raw” input variables (probeset expression obtained after the preprocessing step, see Section 4.2) and by “features” we refer to the variables constructed from input variables. 4.3.1 Variable Ranking For an efficient variable ranking (see Section 2.6), we have to choose a scoring function useful for the fitting of the classifier. Because the model used for classification of patients is a multivariate Cox model, we have selected a scoring function based on a univariate Cox model. Many other alternatives can be considered like a variable ranking based on Student t-test or Pearson correlation. 4.3.1.1 Scoring Function Based on Univariate Cox Model The scoring function S(j) is one minus the p-value computed by a likelihood ratio test (see Section 2.4.2.2) of a univariate Cox model. The p-value for the variable j is computed from the χ2 distribution using the following statistic : ³ ´ χ2 statistic(j) = 2 l(β̂ (j) ) − l(β (0) ) " ÃN ! ÃN !#! ÃN X X X (i) β̂(j) xkj yik e + ln yik = 2 δi β̂ xij − ln i=1 k=1 k=1 where yik = 1 if tk ≥ ti and yik = 0 if tk < ti , β̂ (j) is the vector of the estimated coefficient of the variable j and β (0) represents the null coefficient. 50 The p-value of the likelihood ratio test represents the significance of the difference between the partial loglikelihoods of the models with and without the considered variable. In other words, how much the variable is valuable for the model. 4.3.2 Feature Construction The feature construction step aims at constructing features derived from the original input (see Section 2.6.3). In the survival analysis design given in Figure 4.1, we use such a method to construct the features : • We perform a variable ranking using the scoring function described in the previous section and we choose a threshold to select only the informative probesets (e.g. only the probesets that have a score > 0.9999). • A hierarchical clustering (see Section 2.6.3.1) is performed in order to select clusters of highly correlated probesets. • For each cluster, new features are constructed in computing the cluster centroid, i.e. average of the intensities of all the probesets in a cluster. This semi-supervised method (see Section 2.6.3) selects the probesets using demographic data (supervised) and a hierarchical clustering (unsupervised) is used to cluster highly correlated probesets to obtain cluster centroids. Such a method has several advantages : 1. Variable ranking is less prone to overfitting (see Section 2.6.1) and is computationally efficient. Indeed we have to deal with more than 30,000 probesets. 2. Clustering of highly correlated probesets9 allows us to identify interesting groups of co-regulated genes which will be the object of further biological experiments. 3. The computation of the cluster centroids permits (i) to reduce the variance of the features (ii) to facilitate the validation of the classifier on another microarray platform (see Section 4.3.2.1). We use a hierarchical clustering with uncentered Pearson correlation as similarity metric and complete linkage (see Section 2.6.3.1). 4.3.2.1 Classifier Validation on Different Microarray Platforms In this section, we propose a method to facilitate the validation of a classifier developed on c c a specific microarray platform (e.g. Affymetrix° ) to a different one (e.g. Agilent° ). Because of the heterogeneity of the existing microarray platforms, it is very hard to compare/validate new results between different microarray studies. However, given the cost and the scarcity of microarray experiments, it would be very interesting to be able to test the final model of classification on other microarray data. There are two main difficulties with such comparisons/validations : 9 A cluster of probesets can be reduced to a cluster of genes in consulting their biological annotations (see Appendix D). 51 1. We have to find similar probes on the two microarray platforms under study (e.g. a c c platforms). and Agilent° probe representing the same gene in Affymetrix° 2. We have to normalize the datasets coming from the different microarray platforms in order to analyze comparable data. The first problem is partly solved by the design of the feature selection (see Section 4.3). Indeed, the final classification model is based on a multivariate Cox regression fitted with the constructed features. The set of features is composed by the cluster centroids constructed during the feature construction step. Because each feature is an average of several probesets in a cluster, the classifier is less sensitive to the absence of one or more probes as it may happen when you analyze data coming from different microarray platforms. The robustness of the classifier according to the loss of one or more probes will be analyzed in future works (see Section 6.1). 4.3.3 Cox Model Once the features are constructed, the normalized loglikelihood (see equation (2.23)) of a multivariate Cox model fitted using these features is estimated by a 10-fold cross-validation procedure. This procedure partitions the training data in ten couples {training subset, test subset} where the training subset contains 90% of the training set and the test subset contains the remaining 10%. For each couple, a multivariate Cox model is fitted using the training subset and its loglikelihood is computed on the corresponding test subset and normalized. At the end, the ten normalized loglikelihoods are averaged to obtain an estimation of the classifier performance on independent data. Such a procedure is represented in Figure 4.4. step 1 step 3 step 2 step 10 tes tes tes trs trs fit Cox model fit Cox model fit Cox model fit Cox model normalized loglikelihood normalized loglikelihood normalized loglikelihood normalized loglikelihood trs trs tes Legend : trs = training subset tes = test subset average of normalized loglikelihood Figure 4.4: 10-fold cross-validation procedure to estimate the loglikelihood of the Cox model on independent data. 52 4.4 Final Model The last part of the analysis concerns the fitting of the final Cox model and its use to classify the patients in low and high-risk groups. In order to assess the difference in survival between the two groups of patients, we compute several survival statistics. 4.4.1 Final Cox Model Once the best set of constructed features is selected, a Cox model is fitted using all the training set. This model is used to compute the risk scores rsi such that rsi = F X β̂j fij (4.5) j=1 where F is the number of features, i ∈ {1, . . . , N }, j ∈ {1, . . . , F }, β̂’s are the estimated coefficents of the final Cox model and fij is the feature j of the sample i. The construction of such features is described in Section 4.3.2. 4.4.2 Cutoff Selection The risk score rs is a continuous variable representing the risk for each patient to die. In order to classify the patient in two groups, a cutoff has to be selected. The aim is to have two groups of patients with high difference in survival. Such a difference can be assessed with different kind of survival statistics (see Section 4.5). The cutoff for the risk scores is selected on the basis of the hazard ratio (HR, see Section 2.5.3). The algorithm of cutoff selection based on hazard ratio is given in algorithm 3. Algorithm 3 Algorithm of the cutoff selection based on hazard ratio 1: Consider only the patients on the training set. 2: Keep only cutoffs which leave at least 25% of patients in the high-risk group. 3: Keep only cutoffs which have not the unity in their 95% confidence interval (see Section 4.5.1). . HR = 1 means no difference in survival between low and high-risk groups 4: Select the cutoff which has the lowest proportion of DMFS at 3 years (see Section 4.5.3) and the highest HR. 4.5 Survival Statistics Several ways to assess the difference in survival between low and high-risk groups are described in this section. 4.5.1 Hazard Ratio The hazard ratio was introduced in Section 2.5.3. This statistic permits to assess the reduction in the risk of event between two different groups. 53 4.5.2 Logrank Test The logrank test was introduced in Section 2.5. This statistical test permits to test the difference between two survivor functions, S1 and S2 . They are estimated using the KM estimator (see Section 2.3.1) from the patients belonging to low and high-risk groups. 4.5.3 Proportion of DMFS The survival at three and five years are common criteria in clinical practice. In this thesis, this criterion is called the distant metastasis free survival (DMFS) because we study the appearance of distant metastases for the patients under a specific treatment. It would be interesting to have a classifier specially efficient to discriminate patients who would not die in the first three of five years (low-risk group) even if the high-risk is less well discriminated (trade-off between sensitivity and specificity). The same reasoning can be made for the inverse. The proportion of events (distant metastases) during the first three or five years, is computed for low and high-risk groups in order to assess such a performance criterion for the classifier. 4.5.4 Time-Dependent ROC Curve ROC curves are a popular method for displaying sensitivity and specificity of a continuous diagnostic marker R, for a binary disease variable D. However, many disease outcomes are time-dependent D(t), and ROC curves that vary as a function of time may be more appropriate. A common example of a time-dependent variable is vital status, where D(t) = 1 if a patient has died prior to time t and zero otherwise. In [Heagerty et al., 2000], the authors propose summarizing the discrimination potential of a marker R, measured at baseline t = 0, by calculating ROC curves for cumulative disease or death incidence by time t, which is denoted as ROC(t). A typical complexity with survival data is that observations may be censored. Two ROC curve estimators are proposed that can accommodate censored data. A simple estimator is based on using the Kaplan-Meier estimator for each possible subset R > c. However, this estimator does not guarantee the necessary condition that sensitivity and specificity are monotone in R. An alternative estimator that does guarantee monotonicity is based on a nearest neighbor estimator for the bivariate distribution function of (R, T ), where T represents survival time [Akritas, 1994]. Two interesting examples are given in [Heagerty et al., 2000] 10 . 4.5.4.1 Sensitivity and Specificity The sensitivity and the specificity of a binary classification test or algorithm are parameters that express something about the test’s performance. The sensitivity of such a test is the P 11 proportion of those cases having a positive test result of all positive cases tested ( T PT+F N ). The specificity of such a test is the proportion of true negatives of all the negative samples 10 In [Heagerty et al., 2000], the authors present an example where ROC(t) is used to compare a standard and a modified flow cytometry measurement for predicting survival after detection of breast cancer and an example where the ROC(t) curve displays the impact of modifying eligibility criteria for sample size and power in HIV prevention trials. 11 TP, FN, TN, FP represent the rate of true positives, false negatives, true negatives and false positives respectively. 54 N tested ( T NT+F P ). Sensitivity and specificity are well established for simple binary variables with either discrete or continuous marker measurements. In [Heagerty et al., 2000], this concepts of sensitivity and specificity are extended to time-dependent binary variables such as vital status, allowing characterization of diagnostic accuracy for censored data. For test results defined on continuous scales, ROC curves are standard summaries of accuracy. If R denotes the diagnostic test or marker, with higher values more indicative of disease, and D is a binary indicator of disease status, then the ROC curve for R is a plot of the sensitivity associated with the dichotomized test R > c versus (1 − specif icity) for all possible threshold values c, i.e. the ROC curve is the monotone increasing function in [0, 1] with coordinates (Pr{R > c | D = 0}, Pr{R > c | D = 1}) where c ∈ {−∞, ∞}. This function characterizes the diagnostic potential of a continuous test by summarizing all of the possible trade-offs between sensitivity and specificity. The higher the ROC curve is in the quadrant [0, 1] × [0, 1], the better is its capacity for discriminating diseased from nondiseased subjects. Definitions Let Ti denote failure time and Ri the diagnostic marker for subject i. Let Ci denote the censoring time, Zi = min(Ti , Ci ) the follow-up time, and δi a censoring indicator with δi = 1 if Ti ≤ Ci and δi = 0 if Ti > Ci . We use the counting process Di (t) = 1 if Ti ≤ t and Di (t) = 0 if Ti > t to denote event (disease) status at any time t with Di (t) = 1 indicating that subject i has had an event prior to time t. Recall that ROC curves display the relationship between a diagnostic marker R, and a binary disease variable Di by plotting estimates of the sensitivity Pr{R > c | D = 1}, and one minus the specificity 1 − Pr{R ≤ c | D = 0} for all possible values c. When disease status is time dependent, consider sensitivity and specificity as time-dependent functions and define them as sensitivity(c, t) = Pr{R > c | D(t) = 1} specif icity(c, t) = Pr{R ≤ c | D(t) = 0} Using these definitions, we can define the corresponding ROC curve for any time t, ROC(t). Kaplan-Meier Estimator We can use Bayes’ theorem to rewrite the sensitivity and the specificity as Pr{R > c | D(t) = 1} = Pr{R ≤ c | D(t) = 0} = {1 − S(t | R > c)} Pr{R > c} 1 − S(t) S(t | R ≤ c) Pr{R ≤ c} S(t) where S(t) is the survival function S(t) = Pr{T > t} and S(t | R > c) is the conditional survival function for the subset defined by R > c. A widely used nonparametric estimate of S(t) is given by the KM estimator [Kaplan and Meier, 1958] (see Section 2.3.1). The KM estimator uses all the information in the data, including censored observations, to estimate the survival function. A simple estimator for sensitivity and specificity at time t is then given by combining the 55 KM estimator SbKM (t) and the empirical distribution function of the marker covariate R, as c KM {R > c | D(t) = 1} = Pr c KM {R ≤ c | D(t) = 0} = Pr P {1 − SbKM (t | R > c)}{1 − FbR (c)} {1 − SbKM (t)} SbKM (t | R ≤ c)FbR (c) SbKM (t) 1(Ri ≤ c) . n Now, we can estimate sensitivity and specificity for a linear predictor with censored data using the KM estimator. However this estimator has two problems : where FbR (c) = 1. This simple estimator does not guarantee that sensitivity or specificity is monotone. By definition, we gave Pr{R > c | D(t) = 1} ≥ Pr{R > c0 | D(t) = 1} for c0 > c. See a violation example in [Heagerty et al., 2000]. 2. A potential problem with the KM-based ROC estimator is that the conditional KM estimator SbKM (t | R > c) assumes that the censoring process does not depend on R. This assumption may be violated in practice when the intensity of follow-up efforts are influenced by the baseline diagnostic marker measurements. For the moment, we have implemented only the KM-based ROC estimator. 4.5.4.2 Area Under the ROC Curve The area under the ROC curve (AUC) can be interpreted as the probability that the test result from a randomly chosen diseased individual exceeds that for a randomly chosen nondiseased individual and is often used to summarize the ROC curve. 56 Chapter 5 Results Contents 5.1 Tamoxifen Resistance Project . . . . . . . . . . 5.1.1 Quality Assessment . . . . . . . . . . . . . . . 5.1.2 Preprocessing Methods . . . . . . . . . . . . . 5.1.3 Variable Ranking . . . . . . . . . . . . . . . . . 5.1.4 Feature Construction . . . . . . . . . . . . . . . 5.1.5 Final Cox Model and Risk Score Computation 5.1.6 Cutoff Selection . . . . . . . . . . . . . . . . . . 5.1.7 Validation on Independent Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 58 58 58 60 62 65 In order to study the effectiveness of the methods described in Chapter 4 on real data, we apply the design of the survival analysis given in Figure 4.1 to the data coming from the Tamoxifen resistance project (see Sections 1.1.1.1). We use the OXFT population (99 patients) as training set and the KIT and GUYT populations as test set (156 patients). The Table 5.1 gives a summary of the repartition of the data and their main characteristics. Tamoxifen resistance project Populations Number of patients Number of probesets Training set OXFT 99 44928 Test set KIT and GUYT 156 44928 Table 5.1: Repartition of the data between training and test sets for the Tamoxifen resistance project. 5.1 5.1.1 Tamoxifen Resistance Project Quality Assessment The 255 microarrays coming from the OXFT, KIT and GUYT populations have passed the quality tests described in Section 4.1 (data not shown due to confidentiality). 57 5.1.2 Preprocessing Methods We apply the preprocessing methods described in Section 4.2 to the data, resulting in 32,139 probesets (we discard 12,789 probesets at the prefiltering step, see Section 4.2.3). 5.1.3 Variable Ranking The scoring function (see Section 4.3.1.1) is applied to all prefiltered probesets. The histogram in Figure 5.1 gives an approximate of the score distribution. We can see that there is a large number of very high scores (close to 1) in comparison to smaller scores (0 to 0.9). In order to keep only a small subset of promising probesets, we have to choose a large threshold. This results in 213 probesets with a score > 0.9999. Figure 5.1: Histogram of the score computed by the scoring function for all the probesets remaining after the prefiltering. The annotations of the remaining probesets are given in Appendix C. These annotations are available using the annotation package of Bioconductora and the information publicly c available on the Affymetrix° website1 . 5.1.4 Feature Construction In this step, we carry out hierarchical clustering (see Section 2.6.3.1) on the probesets selected after variable ranking in order to cluster probesets according to a correlation metric. The resulting clustering is given in Figure 5.2. This figure includes the dendrogram (tree at the top) and the heatmap (below the dendrogram). The heatmap is a graphical representation of the gene expressions with down-regulation in green (negative gene expression), up-regulation in red (positive gene expression) and absence of expression in black (gene expression close to zero). 1 http://www.affymetrix.com/analysis/index.affx 58 Figure 5.2: Hierarchical clustering of the 213 probesets selected according to their ranking scores. The y-axis represents the patients and the x-axis represents the probesets. Only the probesets are clustered (see dendrogram at the top of the Figure). The patients are sorted by their risk score (see Section 5.1.5), the lowest risk being at the top. 59 In order to take advantage of the feature selection in multiple microarray platforms comparison (see Section 4.3.2.1), we chose a minimum cluster size of 5 probesets. So the number of constructed features is not equal to the number of clusters (see Section 4.3.2). When a large number of clusters is tested (the number of clusters can be as large as the number of probesets), no feature can be constructed because all the clusters are too small. The relation between the number of clusters and the number of constructed features (with the minimum cluster size set to 5) is given in Figure 5.3. If such a parameter is set to zero, there is no constraint and the number of clusters equals to number of constructed feature. In our case, we can see that the number of constructed features increase rapidly with the number of clusters and starts to decrease when the number of clusters is too large (the size of some clusters decreases below the limit). Figure 5.3: Impact of a minimum cluster size set to 5 on the relation between number of clusters and number of constructed features. The performance of the multivariate Cox model with a set of constructed features, is assessed using 10-fold cross-validation (see Section 4.3.3). The evolution of the training error (minus the (normalized)loglikelihood on the training subset) and the test error (minus the (normalized)loglikelihood on the test subset) are given in Figure 5.5 (see Section 2.4.2.1 for the description of the loglikelihood normalization). We can see that the number of clusters minimizing the test error is two (see Figure 5.4 for the two selected clusters). Even if the training error decreases to fifty (which is its minimum), we can see that the test error starts to increase from three clusters on. This is an evidence of overfitting, the number of constructed features increasing with the number of clusters between zero and fifty (see Figure 5.3). 5.1.5 Final Cox Model and Risk Score Computation Once the best set of features is selected (here we have only two features), such features are constructed and used to fit a Cox model on all the training set (see Section 4.4.1). 60 pclust.2 pclust.1 Figure 5.4: Dendrogram with the two selected clusters in color (orange for the cluster 1 and blue for the cluster 2). Figure 5.5: Evolution of the error with the number of clusters. The dashed line represents the training error and the solid line represents the test error. The vertical dashed line is the best number of clusters (2) w.r.t. the test error. 61 coef exp(coef) se(coef) z p pclust.1 1.53 4.59802 1.36 1.12 0.26000 pclust.2 -5.03 0.00653 1.50 -3.34 0.00082 Likelihood ratio test= 59.9 on 2 df, p=9.77e-14 Note that the coefficient of the feature constructed from all the probesets of the cluster 2 (called pclust.2 ), when adjusted with the other feature, is highly significant2 (p-value = 8.2e-4) while this is not the case for the coefficient of cluster 1 (called pclust.1 ). However, according to the feature selection step (see Section 4.3) and the method used to construct features (see Section 4.3.2), this couple of features is the best one. Moreover the model is significantly different from a model without any features (p-value = 9.77e-14) according to the likelihood ratio test. Using this model, a risk score is computed for each patient of the training set (see Section 4.4.1). An histogram of risk scores for these patients is given in Figure 5.6. We can see that the approximate distribution given by the histogram is left-skewed, meaning that there are more patients with low risk (in agreement with current clinical observations). Figure 5.6: Histogram of risk scores for all the patients in the training set. Common statistics are displayed below the histogram. 5.1.6 Cutoff Selection The selection of a cutoff on the training set is based on the hazard ratio. We compute the other survival statistics described in Section 4.5 in order to assess the effectiveness of the selection procedure. 2 The p-value is computed by the Wald test from the z statistic and a χ2 distribution. The likelihood ratio test and the Wald test are described in Section 2.4.2.2. 62 5.1.6.1 Hazard Ratio For each possible cutoff, we compute the hazard ratio (see Section 4.5.1). The evolution of the hazard ratio w.r.t. the cutoffs is given in Figure 5.7. The vertical dashed line represents the hazard ratio based cutoff (0.93) selected by the algorithm 3 described in section 4.4.2. The hazard ratio based cutoff gives an hazard ratio of 60.42 with the 95% confidence interval [7.99, 456.59] on the training set. Figure 5.7: Evolution of the hazard ratio w.r.t. the cutoff. The fat line is the HR. The two thin lines delimit the 95% confidence interval around the HR. The horizontal dashed line represents HR = 1 which means no difference in survival between low and high-risk groups. The vertical dashed line represents the hazard ratio based cutoff selected by the algorithm 3 in section 4.4.2. 5.1.6.2 Logrank Test The Figure 5.8 depicts the evolution of the log 10 p-value w.r.t. the cutoffs. We can see that the hazard ratio based cutoff gives two groups significantly different according to the logrank test (p-value = 8.37e-13). See Section 4.5.2 for details about the logrank test. The gaps of the line in Figure 5.8 are due to null p-values, giving − inf in log 10 . Such p-values are not plotted. 5.1.6.3 Proportion of DMFS The Figure 5.9 depicts the evolution of the DMFS proportion (see Section 4.5.3) in low and high-risk groups w.r.t. the cutoffs. We can see that there is no event in the low-risk group, whereas there is 25% of events before three years in the high-risk group. It is interesting to mention that the two functions are quasi monotonically increasing meaning that the patients are very well classified. 63 Figure 5.8: Evolution of the log10 p-value w.r.t. the cutoffs. The horizontal dashed line represents the minimum level of significance (log 10 0.05). The vertical dashed line represents the hazard ratio based cutoff (0.93). Figure 5.9: Evolution of the DMFS proportion in low (solid line) and high-risk (dashed line) groups w.r.t. the cutoffs. The vertical dashed line represents the hazard ratio based cutoff (0.93). 64 5.1.6.4 Time-Dependent ROC Curve The Figure 5.10 gives the time-dependent ROC curve at three years (see Section 4.5.4) in order to assess the performance of the classifier whatever the selected cutoff. The AUC of the classifier equals 0.96 and is different from a random classifier (whose the AUC equals 0.5) represented by the diagonal3 and shows very good classification performances for events before three years on the training set. Figure 5.10: Time-dependent ROC curve at three years on the training set. The diagonal (dashed line) represents a random classifier (AUC = 0.5). Moreover, the AUC is computed for each point in time to highlight the performance of the classifier w.r.t. time (see Figure 5.11). We can see that whatever the point in time, the classifier shows very good performances with AUC > 0.95, especially at three years. 5.1.7 Validation on Independent Test Set We compute each survival statistic as in the previous section in using the independent test set (patients from KIT and GUYT). The cutoff tested in the next sections is the same than the one selected in the training set. 5.1.7.1 Risk Scores Using the final model fitted in Section 5.1.5, we compute a risk score for each patient of the test set (see Section 4.4.1). An histogram of risk scores for these patients is given in Figure 5.12. We can see that the approximate distribution given by the histogram is right-skewed, meaning that there are more patients with high risk. It is not the case for the risk scores of the 3 A p-value addressing the null hypothesis Ho that the area under the ROC curve of the classifier is 0.5 i.e. the AUC of a random classifier, can be computed using the U statistic for Mann-Whitney test [Mason and Graham, 1982]. 65 Figure 5.11: Evolution of the AUC’s w.r.t. time on the training set. The horizontal dashed line represents the AUC of a random classifier (AUC = 0.5). The vertical dashed line represents the 3 years mark. patients in the training set (see Section 5.1.5) where we find the contrary. This is an evidence that the training set (OXFT population) and the test set (KIT and GUYT populations) are slightly different4 . This can lead to a poor validation performances due to differences between training and test sets. 5.1.7.2 Hazard Ratio For each cutoff tested in the training set, the hazard ratio is computed on the test set. The evolution of the hazard ratio w.r.t. the cutoffs is given in Figure 5.13. The vertical dashed line represents the hazard ratio based cutoff (0.93) selected previously. This cutoff gives an hazard ratio of 2.44 with the 95% confidence interval [1.38, 4.31] on the test set. The 95% confidence interval does not include the unity. 5.1.7.3 Logrank Test The Figure 5.14 depicts the evolution of the log 10 p-value w.r.t. the cutoffs on the test set. We can see that the hazard ratio based cutoff gives two groups significantly different according to the logrank test (p-value = 1.47e-3). See Section 4.5.2 for details about the logrank test. 5.1.7.4 Proportion of DMFS The Figure 5.15 depicts the evolution of the DMFS proportion in low and high-risk groups w.r.t. the cutoffs on the test set. We can see that there is 8% of events in the low-risk group, 4 A careful study of the demographic data shows that the KIT population contains a lot of early distant metastases (early events) in comparison to the OXFT and GUYT populations. 66 Figure 5.12: Histogram of risk scores for all the patients in the test set. Common statistics are displayed below the histogram. Figure 5.13: Evolution of the hazard ratio w.r.t. the cutoffs. The fat line is the HR. the two thin lines is the 95% confidence interval around the HR. The horizontal dashed line represents HR = 1 which means no difference in survival between low and high-risk groups. The vertical dashed line represents the hazard ratio based cutoff selected by the algorithm 3 in Section 4.4.2. 67 Figure 5.14: Evolution of the log10 p-value w.r.t. the cutoffs. The horizontal dashed line represents the minimum level of significance (log 10 0.05). The vertical dashed line represents the hazard ratio based cutoff (0.93). whereas there is 20% of event in the high-risk group, before three years. The very high-risk patients are not well classified as the Figure 5.15 shows (if we choose a cutoff of 4, there is no more event before three years in the high-risk group). 5.1.7.5 Time-Dependent ROC Curve The time-dependent ROC curve at three years on the test set is given in Figure 5.16. The AUC of the classifier equals to 0.65, remaining different than a random one but the difference is less evident. The Figure 5.17 depicts the evolution of the AUC w.r.t. time on the test set. Interestingly, the classifier has poor performances on very early events (in the first year) but gives much better performances after three years. This fact need to be investigated in further analysis. 68 Figure 5.15: Evolution of the DMFS proportion in low (solid line) and high-risk (dashed line) groups w.r.t. the cutoffs. The vertical dashed line represents the hazard ratio based cutoff (0.93). Figure 5.16: Time-dependent ROC curve at three years on the test set. The diagonal (dashed line) represents a random classifier (AUC = 0.5). 69 Figure 5.17: Evolution of the AUC’s w.r.t. time on the test set. The horizontal dashed line represents the AUC of a random classifier (AUC = 0.5). The vertical dashed line represents the 3 years mark. 70 Chapter 6 Conclusion Contents 6.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 We have proposed a machine learning methodology for the microarray analysis using machine learning and well-established survival methods. This methodology covers the whole range of such an analysis, starting from the raw data and their preprocessing to end with high-level analysis like the feature selection, the construction of the classifier on the training set and its validation on independent test set using several traditional survival statistics. We have chosen to use a classifier based on survival analysis instead of a binary classifier as in [van’t Veer et al., 2002] for instance. Indeed, when we transform the survival data in binary classes, we loose information (see Chapter 2). We prefer to use all the information available in the survival data. Moreover, we have chosen to develop a methodology keeping the classifier interpretable instead of using it as a black box. From a biological point of view, we can extract interesting biological information from the final classifier. Indeed the risk computation is based on a linear combination of several weighted cluster centroids (see Section 4.3.2). The cluster centroids are the average of several probesets and the weights are the coefficients fitted by the Cox regression. So we can study in details the biological information of such clusters and their contributions in the risk computation. We have tested the methodology on real data dealing with the Tamoxifen resistance of breast cancer patients. We have constructed a classifier based on a training set of 99 patients, able to assess correctly the risk of the patients in the test set (156 patients), and then classify them in low and high-risk groups. This classifier results in a hazard ratio of 2.44 with the 95% confidence interval [1.38, 4.31] between the two groups (see Figure 5.13). This difference in survival is confirmed by the logrank test (p-value = 1.47e-3, see Figure 5.14). Moreover, there is a very low percentage of distant metastases in the low-risk group within the first three years (8% in 101 patients) whereas there is 20% (in 55 patients) of distant metastases before three years in the high-risk group (see Figure 5.15). The evolution of the AUC of time-dependent ROC curves shows us that the classifier has poor performance for very early distant metastases (within the first year) but has good performance for later events (including the three years mark, see Figure 5.17). Even if the results seem to be very promising, there exist numerous alternatives in terms of methods used for the variable ranking, the feature construction and the classifier. We can 71 test them and compare the different results. Moreover, we can test the classifier on different datasets if publicly available and assess the benefits of the feature construction described in Section 4.3.2.1. 6.1 Future Works 1. Study of the impact of the normalization methods on the results. Currently, new normalization methods are introduced (e.g. GCRMA1 ) and recent articles challenge the performance assessment of such methods [Ploner et al., 2005; Sheden et al., 2005; Gautier et al., 2005]. 2. Implementation of data preprocessing methods for specific computer architectures like computers clusters. Indeed, the data preprocessing step is computational intensive and current tools are inefficient for larger datasets. These tools need to be adapted to specific computer architecture. 3. Study of the variance of variable ranking methods and the overlap of the results between different populations of patients. Variable ranking is commonly used in microarray studies without any consideration of their intrinsic variance within one population and inter-population. The microarray data available at the Microarray Unit (IJB) gives us the opportunity to carry out such an analysis. 4. Use of alternative methods for the feature construction : (a) Use of alternative clustering methods like the adaptive quality-based clustering [De Smet et al., 2002] instead of hierarchical clustering. Such a method has not the constraint of clustering all the variables. So, only highly correlated variables will be clustered together whereas uncorrelated variables will be left alone. (b) Use of methods of space dimensionality reduction and input variable transformation (e.g. PCA) to construct new features which are independent of each others. 5. Study of the classifier robustness with the loss of one or more probes. As described in Section 4.3.2.1, we may lose some variables used to construct features when we test the classifier on another microarray platform. It would be interesting to study the performance of the classifier in removing one or more probesets to assess its robustness. 6. Use of penalized Cox model [Tibshirani, 1997; Gui and Li, 2004] in order to perform a feature selection at the level of the model (see embedded methods described in Section 2.6.2.1). 7. Currently, we have studied the performance of the survival analysis design on one split of the data based on the populations. Indeed, the training and the test sets are composed of one or more populations without mixing the patients between populations. A multiple random validation strategy [Michiels et al., 2005] is necessary to assess the performance whatever the training/test split. 1 GCRMA is a software package introduced by W. Zhijin in 2003 (http://www.bioconductor.org/ repository/release1.4/package/html/gcrma.html). 72 8. Comparison with binary classification techniques [Brown et al., 1999; Duda et al., 2001; Dudoit et al., 2002]. 9. Use of Gene Ontology (see Appendix D) to infer biological knowledge about the probesets selected to construct the features. 10. Comparison with traditional histological criteria and consensus (see Chapter 1). 11. Comparison with other molecular signatures using different technologies [Paik et al., 2004; Ma et al., 2004]. 12. Test of the developed classifier on other datasets if publicly available. 73 Appendix A Semiparametric Regression Models : Additional Topics In this chapter, a number of additional topics that arise in the practical application of the semiparametric regression models are discussed. A.1 Tied Data The formula for the partial likelihood in (2.21) is valid only for data in which no two events occur at the same time. However, it is quite common for data to contain tied event times, so an alternative formula is needed to handle those situations. The most common alternative is called the Breslow’s approximation, which works well with relatively few ties. When data are heavily tied, the approximation can be quite poor [Farewell and Prentice, 1980; Hsieh, 1995]. There exist better approximations proposed by [Efron, 1977] as well as the exact and the discrete methods. More details are given in [Therneau and Grambsch, 2000]. A.2 Time-Dependent Covariate The time-dependent covariates may change in value over the time of observation. While it is simple to modify Cox’s model to allow for time-dependent covariates, the computations of the resulting partial likelihood is much more time consuming. To modify the model in (2.15) to include time-dependent covariates, all we need to do is write (t) after the x’s that are time dependent. For a model with one fixed covariate and one time-dependent covariate, we have ln hi (t) = α(t) + β1 xi1 + β2 xi2 (t) (A.1) The hazard at time t depends on the value of x1 , and on the value of x2 at time t. x2 (t) can be defined using any information about the individual prior to time t. The computation of each time-dependent covariate at each time t can be expensive. A.3 Nonproportional Hazards When time-dependent covariates are introduced in the Cox regression model, the assumption of proportional hazards is violated. Indeed, because the time-dependent covariates will change 74 at different rates for different individuals, the ratios of their hazards can not remain constant. However, we have seen in Section A.2 that the partial likelihood can treat such situations. Proportional hazards assumption violations for fixed covariates are equivalent to interactions between one or more covariates and time. The proportional hazards model assumes that the effect of each covariate is the same at all points in time. If the effect of a covariate varies with time, the proportional hazards assumption is violated for that variable. Explicit Interaction Method A common way of representing interaction between two variables in a linear regression model is to include a new variable that is the product of the two variables in equation (see Section A.2). To represent the interaction between a covariate x and time in a Cox model, we can write ln h(t) = α(t) + β1 x + β2 xt (A.2) Factoring out the x , we can write this as ln h(t) = α(t) + (β1 + β2 t) x (A.3) In this equation the effect of x is β1 + β2 t. If β2 is positive, then the effect of x increase linearly with time; if it is negative, the effect decreases with time linearly with time. β 1 can be interpreted as the effect of x at time 0, the origin of the process. This model can be easily estimated by defining a time-dependent covariate z = xt. It is also straightforward to include interactions between time and time-dependent covariates. Then, the covariates will change over time but also the effect of those covariates will change over time. In order to test the proportional hazards assumption, a time-dependent covariate representing the interaction of the original covariate and time, can be added to the model for any suspected covariate. If the interaction covariate have a significant coefficient, we have evidence for nonproportionality. Otherwise, we may conclude than the proportional hazards assumption is not violated. Stratification Method Another approach to nonproportionality is stratification, a technique that is most useful when the covariate that interacts with time is both categorical and not of direct interest. Let z be such a binary covariate and we suspect that the effect of z varies with time. Alternatively, we can say that the shape of the hazard function is different according to z. Let x be another covariate of the model. The model can be written as ½ ln hi (t) = α0 (t) + βxi if z = 0 ln hi (t) = α1 (t) + βxi if z = 1 Notice that the coefficient of x is the same in both equations, but the arbitrary function of time is allowed to to differ. We can combine the two equations into a single equation by writing ln hi (t) = αz (t) + βxi The model can be estimated by the method of partial likelihood using these steps 1. Construct separate partial likelihood functions for each of the value of z. 2. Multiply those two functions together. 75 3. Choose values of β that maximize this function. If the coefficient of the covariate x is not significant in the model including x and z but is in the model including x and stratified by z, we can conclude that it is important to control the effect of the covariate z. Compared to the explicit interaction method, the method of stratification has two main advantages : • The explicit interaction method requires to choose a particular form for the interaction, but stratification allows for any kind of change in the effect of a covariate over time. • Stratification is easier to set up ans is less expensive in computation time. But there are also important disadvantages of the stratification : • There is no way to test for either the main effect of the stratifying covariate or its interaction with time. In particular, it is not legitimate to compare the log-likelihoods for models with and without a stratifying covariate [Allison, 1995]. • If the form of the interaction with time is correctly specified, the explicit interaction method should yield more efficient estimates of the coefficients of the other covariates. A.4 Estimating Survivor Functions The form of the dependence of the hazard on time is left unspecified in the proportional hazards model. Furthermore, the partial likelihood method discards that portion of the likelihood that contains information about the dependence of the hazard on time. Nevertheless, it is possible to get nonparametric estimates of the survivor function based on a fitted proportional hazards model. When there are no time-dependent covariates, the Cox model can be written as S(t) = [S0 (t)]exp(βx) where S(t) is the survival probability at time t for an individual with covariate values x, and S0 (t) is the baseline survivor function, that is the survivor function for an individual whose covariate values are all zero. After estimating β by partial likelihood, we can get an estimate of S0 (t) by a nonparametric maximum likelihood method (see [Collett, 2003] for details). 76 Appendix B Microarray Platforms Overview of different microarray technologies. 77 Appendix C Probeset Annotations probeset 227578 at 201014 s at accession number H28597 NM 006452 225723 at BE794699 204033 at NM 004237 208696 at AF275798 200750 s at AF054183 213911 s at 204331 s at BF718636 NM 021107 202433 226943 202779 201947 at at s at s at NM 005827 AA287457 NM 014501 NM 006431 200853 at 209714 s at NM 002106 AF213033 201764 201475 222077 219588 238728 NM 024056 NM 004990 AU153848 NM 017760 AA194266 at x at s at s at at gene name symbol unigene pclust thymopoietin phosphoribosylaminoimidazole carboxylase, phosphoribosylaminoimidazole succinocarboxamide synthetase chromosome 6 open reading frame 129 thyroid hormone receptor interactor 13 chaperonin containing TCP1, subunit 5 (epsilon) RAN, member RAS oncogene family H2A histone family, member Z mitochondrial ribosomal protein S12 solute carrier family 35, member B1 NA ubiquitin-conjugating enzyme E2S chaperonin containing TCP1, subunit 2 (beta) H2A histone family, member Z cyclin-dependent kinase inhibitor 3 (CDK2-associated dual specificity phosphatase) hypothetical protein MGC5576 methionine-tRNA synthetase Rac GTPase activating protein 1 more than blood homolog NA TMPO PAICS Hs.11355 Hs.518774 pclust.1 pclust.1 C6orf129 Hs.284207 pclust.1 TRIP13 Hs.436187 pclust.1 CCT5 Hs.1600 pclust.1 RAN Hs.519656 pclust.1 H2AFZ MRPS12 Hs.119192 Hs.411125 pclust.1 pclust.1 SLC35B1 NA UBE2S CCT2 Hs.154073 NA Hs.396393 Hs.189772 pclust.1 pclust.1 pclust.1 pclust.1 H2AFZ CDKN3 Hs.119192 Hs.84113 pclust.1 pclust.1 MGC5576 MARS RACGAP1 MTB NA Hs.103834 Hs.355867 Hs.505469 Hs.18616 NA pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 78 probeset 201342 at accession number NM 003093 204962 218009 234347 201606 NM 001809 NM 003981 AF038554 BE796924 s s s s at at at at 219060 at 203011 at NM 018024 NM 005536 219275 204092 205046 228357 226287 235704 218782 at s at at at at at s at NM 004708 NM 003600 NM 001813 BE966979 AI458313 AI307251 NM 014109 239753 217932 213008 211058 at at at x at BE560888 NM 015971 BG403615 BC006379 224913 s at AA877820 214710 218883 223785 218556 205543 210334 s at s at at at at x at BE407516 NM 024629 BC004277 NM 014182 NM 014278 AB028869 208079 s at 218027 at NM 003158 NM 014175 211072 x at BC006481 209251 x at 204825 at BC004949 NM 014791 210821 226604 209218 219661 225702 BC002703 AA418403 AF098865 NM 022897 AA973041 x at at at at at gene name symbol unigene pclust small nuclear ribonucleoprotein polypeptide C centromere protein A, 17kDa protein regulator of cytokinesis 1 density-regulated protein nuclear phosphoprotein similar to S. cerevisiae PWP1 hypothetical protein FLJ10204 inositol(myo)-1(or 4)monophosphatase 1 programmed cell death 5 aurora kinase A centromere protein E, 312kDa NA hypothetical protein AF301222 DAZ associated protein 2 ATPase family, AAA domain containing 2 NA mitochondrial ribosomal protein S7 hypothetical protein FLJ10719 tubulin, alpha, ubiquitous SNRPC Hs.1063 pclust.1 CENPA PRC1 DENR PWP1 Hs.1594 Hs.459362 Hs.22393 Hs.506652 pclust.1 pclust.1 pclust.1 pclust.1 FLJ10204 IMPA1 Hs.18029 Hs.492120 pclust.1 pclust.1 PDCD5 AURKA CENPE NA LOC81023 DAZAP2 ATAD2 Hs.482549 NA Hs.75573 NA Hs.143733 Hs.369761 Hs.370834 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 NA MRPS7 FLJ10719 K-ALPHA1 TIMM50 NA Hs.71787 Hs.513126 Hs.524390 pclust.1 pclust.1 pclust.1 pclust.1 Hs.355819 pclust.1 CCNB1 MLF1IP FLJ10719 ORMDL2 HSPA4L BIRC5 Hs.23960 Hs.481307 Hs.513126 Hs.534450 Hs.135554 Hs.514527 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 STK6 MRPL15 Hs.250822 Hs.18349 pclust.1 pclust.1 K-ALPHA1 TUBA6 MELK Hs.524390 pclust.1 Hs.436035 Hs.184339 pclust.1 pclust.1 CENPA NA SQLE RANBP17 FLJ14825 Hs.1594 NA Hs.71465 Hs.410810 Hs.521800 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 translocase of inner mitochondrial membrane 50 homolog (yeast) cyclin B1 MLF1 interacting protein hypothetical protein FLJ10719 ORM1-like 2 (S. cerevisiae) heat shock 70kDa protein 4-like baculoviral IAP repeat-containing 5 (survivin) serine/threonine kinase 6 mitochondrial ribosomal protein L15 tubulin, alpha, ubiquitous tubulin alpha 6 maternal embryonic leucine zipper kinase centromere protein A, 17kDa NA squalene epoxidase RAN binding protein 17 hypothetical protein FLJ14825 79 probeset 222992 s at accession number AF261090 201292 at AL561834 213310 at AI613483 201090 x at NM 006082 212722 208838 203554 220060 212021 s at at x at s at s at AK021780 AB020636 NM 004219 NM 017915 AU132185 218326 s at NM 018490 224330 s at AB049647 216125 s at 203880 at AF064606 NM 005694 226376 at AI885018 212500 at AL049319 203764 at 203007 x at 206698 at NM 014750 AF077198 NM 021083 223110 at 224331 s at BC003701 AB049654 203214 x at NM 001786 213741 s at BF575685 218046 s at NM 016065 217946 s at NM 016402 200659 s at 200925 at NM 002634 NM 004373 223156 at BC000242 gene name symbol unigene pclust NADH dehydrogenase (ubiquinone) 1 beta subcomplex, 9, 22kDa topoisomerase (DNA) II alpha 170kDa eukaryotic translation initiation factor 2C, 2 tubulin, alpha, ubiquitous NDUFB9 Hs.15977 pclust.1 TOP2A Hs.156346 pclust.1 EIF2C2 Hs.449415 pclust.1 K-ALPHA1 PTDSR TIP120A PTTG1 FLJ20641 MKI67 Hs.524390 pclust.1 Hs.514505 Hs.546407 Hs.350966 Hs.330663 Hs.80976 pclust.1 pclust.1 pclust.1 pclust.1 pclust.1 LGR4 Hs.502176 pclust.1 MRPL27 Hs.7736 pclust.1 RANBP9 COX17 Hs.306242 Hs.534383 pclust.1 pclust.1 ZC3HDC5 Hs.201859 pclust.1 C10orf22 Hs.99821 pclust.1 DLG7 LYPLA1 XK Hs.77695 Hs.435850 Hs.78919 pclust.1 pclust.1 pclust.1 DKFZP434I116 Hs.202238 MRPL36 Hs.32196 pclust.1 pclust.1 CDC2 Hs.334562 pclust.1 KPNA1 Hs.161008 pclust.1 MRPS16 Hs.180312 pclust.1 SAE1 Hs.515500 pclust.1 PHB COX6A1 Hs.514303 Hs.497118 pclust.1 pclust.1 MRPS23 Hs.5836 pclust.1 phosphatidylserine receptor TBP-interacting protein pituitary tumor-transforming 1 hypothetical protein FLJ20641 antigen identified by monoclonal antibody Ki-67 leucine-rich repeat-containing G protein-coupled receptor 4 mitochondrial ribosomal protein L27 RAN binding protein 9 COX17 homolog, cytochrome c oxidase assembly protein (yeast) zinc finger CCCH type domain containing 5 chromosome 10 open reading frame 22 discs, large homolog 7 (Drosophila) lysophospholipase I Kell blood group precursor (McLeod phenotype) DKFZP434I116 protein mitochondrial ribosomal protein L36 cell division cycle 2, G1 to S and G2 to M karyopherin alpha 1 (importin alpha 5) mitochondrial ribosomal protein S16 SUMO-1 activating enzyme subunit 1 prohibitin cytochrome c oxidase subunit VIa polypeptide 1 mitochondrial ribosomal protein S23 80 probeset 215452 x at accession number AL031133 210216 213379 200932 234464 AF084513 AF091086 NM 006400 AK021607 x at at s at s at 209849 s at 231609 at AF029669 AW418674 222267 at 202095 s at BE619220 NM 001168 209408 at 212160 at U63743 AI984005 225827 at AI832074 211662 s at 212022 s at L08666 BF001806 213647 at D42046 212639 x at AL581768 202069 s at AI826060 242218 at AI201116 201524 x at NM 003348 202704 at 223472 at AA675892 AF071594 201597 at NM 001865 224753 at 219555 s at BE614410 NM 018455 220318 at 229068 at NM 017957 BF197357 202954 at 201483 s at NM 007019 BC002802 gene name symbol unigene pclust SMT3 suppressor of mif two 3 homolog 2 (yeast) RAD1 homolog (S. pombe) hypothetical protein CL640 dynactin 2 (p50) essential meiotic endonuclease 1 homolog 1 (S. pombe) RAD51 homolog C (S. cerevisiae) chromosome 10 open reading frame 82 hypothetical protein FLJ14803 baculoviral IAP repeat-containing 5 (survivin) kinesin family member 2C exportin, tRNA (nuclear export receptor for tRNAs) eukaryotic translation initiation factor 2C, 2 voltage-dependent anion channel 2 antigen identified by monoclonal antibody Ki-67 DNA2 DNA replication helicase 2like (yeast) tubulin, alpha, ubiquitous SUMO2 Hs.546298 pclust.1 RAD1 CL640 DCTN2 EME1 Hs.547084 Hs.144304 Hs.289123 Hs.514330 pclust.1 pclust.1 pclust.1 pclust.1 RAD51C C10orf82 Hs.412587 Hs.121347 pclust.1 pclust.1 FLJ14803 BIRC5 Hs.267245 Hs.514527 pclust.1 pclust.1 KIF2C XPOT Hs.69360 Hs.85951 pclust.1 pclust.1 EIF2C2 Hs.449415 pclust.1 VDAC2 MKI67 Hs.355927 Hs.80976 pclust.1 pclust.1 DNA2L Hs.532446 pclust.1 K-ALPHA1 IDH3A Hs.524390 pclust.1 Hs.546262 pclust.1 PPARD Hs.485196 pclust.1 UBE2N Hs.524630 pclust.1 TOB1 WHSC1 Hs.531550 Hs.113876 pclust.1 pclust.1 COX7A2 Hs.70312 pclust.1 CDCA5 BM039 Hs.434886 Hs.283532 pclust.1 pclust.1 EPN3 CCT5 Hs.165904 Hs.1600 pclust.1 pclust.1 UBE2C SUPT4H1 Hs.93002 Hs.439481 pclust.1 pclust.1 isocitrate dehydrogenase 3 (NAD+) alpha peroxisome proliferative activated receptor, delta ubiquitin-conjugating enzyme E2N (UBC13 homolog, yeast) transducer of ERBB2, 1 Wolf-Hirschhorn syndrome candidate 1 cytochrome c oxidase subunit VIIa polypeptide 2 (liver) cell division cycle associated 5 uncharacterized bone marrow protein BM039 epsin 3 chaperonin containing TCP1, subunit 5 (epsilon) ubiquitin-conjugating enzyme E2C suppressor of Ty 4 homolog 1 (S. cerevisiae) 81 probeset 201804 221520 235427 208892 225197 218983 x at s at at s at at at accession number NM 001281 BC001651 AA418074 BC003143 W58461 NM 016546 208891 202018 213107 219281 204015 224835 209940 at s at at at s at at at BC003143 NM 002343 R59093 NM 012331 BC002671 AL109935 AF083068 205898 at U20350 214175 x at 202962 at 205011 at AI254547 NM 015254 NM 014622 226034 at 200762 at 209295 at BE222344 NM 001386 AF016266 214486 x at AF041459 222799 at 211828 s at 205968 at AK001606 AF172268 NM 002252 226179 at 212294 at N63920 BG111761 202386 225499 205945 221840 NM 019081 AW296194 NM 000565 AA775177 s at at at at 212076 at AI701430 217767 at 223269 at NM 000064 BC004355 gene name symbol unigene pclust cytoskeleton associated protein 1 cell division cycle associated 8 NA dual specificity phosphatase 6 NA complement component 1, r subcomponent-like dual specificity phosphatase 6 lactotransferrin TRAF2 and NCK interacting kinase methionine sulfoxide reductase A dual specificity phosphatase 4 ribosomal protein S18 pseudogene 1 poly (ADP-ribose) polymerase family, member 3 chemokine (C-X3-C motif) receptor 1 PDZ and LIM domain 4 kinesin family member 13B loss of heterozygosity, 11, chromosomal region 2, gene A NA dihydropyrimidinase-like 2 tumor necrosis factor receptor superfamily, member 10b CASP8 and FADD-like apoptosis regulator HSPC049 protein TRAF2 and NCK interacting kinase potassium voltage-gated channel, delayed-rectifier, subfamily S, member 3 NA guanine nucleotide binding protein (G protein), gamma 12 limkain b1 KIAA1272 protein interleukin 6 receptor protein tyrosine phosphatase, receptor type, E myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila) complement component 3 hypothetical protein MGC3200 CKAP1 CDCA8 NA DUSP6 NA C1RL Hs.31053 Hs.524571 NA Hs.298654 NA Hs.525264 pclust.1 pclust.1 pclust.2 pclust.2 pclust.2 pclust.2 DUSP6 LTF TNIK MSRA DUSP4 RPS18P1 PARP3 Hs.298654 Hs.529517 Hs.34024 Hs.490981 Hs.417962 NA Hs.271742 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 CX3CR1 Hs.78913 pclust.2 82 PDLIM4 Hs.424312 KIF13B Hs.444767 LOH11CR2A Hs.152944 pclust.2 pclust.2 pclust.2 NA NA DPYSL2 Hs.173381 TNFRSF10B Hs.521456 pclust.2 pclust.2 pclust.2 CFLAR Hs.390736 pclust.2 HSPC049 TNIK KCNS3 Hs.371722 Hs.34024 Hs.414489 pclust.2 pclust.2 pclust.2 NA GNG12 NA Hs.431101 pclust.2 pclust.2 LKAP KIAA1272 IL6R PTPRE Hs.173524 Hs.472285 Hs.135087 Hs.127022 pclust.2 pclust.2 pclust.2 pclust.2 MLL Hs.258855 pclust.2 C3 MGC3200 Hs.529053 Hs.9088 pclust.2 pclust.2 probeset at at at s at accession number AI457984 AL512757 NM 001394 AK000667 202552 s at 225629 s at NM 016441 AI669498 218491 s at 224215 s at 235308 at NM 014174 AF196571 AW499525 200918 s at NM 003139 226981 at AW002079 212080 at AV714029 222453 at 226160 at AL136693 AW138757 208893 225728 209270 211317 BC005047 AI659533 L25541 AF041461 243729 231876 204014 217007 s at at at s at 235651 at 202992 at 211495 x at AV741130 NM 000587 AF114011 210314 x at AF114013 218084 x at NM 014164 227026 at 222199 s at 207836 s at AI016714 AK001289 NM 006867 225546 at W68180 224811 at 209468 at BF112093 AB017498 gene name symbol unigene pclust NA tripartite motif-containing 56 dual specificity phosphatase 4 a disintegrin and metalloproteinase domain 15 (metargidin) cysteine-rich motor neuron 1 zinc finger and BTB domain containing 4 thymocyte protein thy28 delta-like 1 (Drosophila) zinc finger and BTB domain containing 20 signal recognition particle receptor (’docking protein’) myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila) myeloid/lymphoid or mixed-lineage leukemia (trithorax homolog, Drosophila) cytochrome b reductase 1 hexose-6-phosphate dehydrogenase (glucose 1-dehydrogenase) dual specificity phosphatase 6 NA laminin, beta 3 CASP8 and FADD-like apoptosis regulator NA complement component 7 tumor necrosis factor (ligand) superfamily, member 13 tumor necrosis factor (ligand) superfamily, member 13 FXYD domain containing ion transport regulator 5 M-phase phosphoprotein, mpp8 bridging integrator 3 RNA binding protein with multiple splicing eukaryotic elongation factor-2 kinase NA low density lipoprotein receptorrelated protein 5 NA TRIM56 DUSP4 ADAM15 NA Hs.521092 Hs.417962 Hs.312098 pclust.2 pclust.2 pclust.2 pclust.2 CRIM1 ZBTB4 Hs.332847 Hs.35096 pclust.2 pclust.2 THY28 DLL1 ZBTB20 Hs.13645 Hs.379912 Hs.477166 pclust.2 pclust.2 pclust.2 SRPR Hs.368376 pclust.2 MLL Hs.258855 pclust.2 MLL Hs.258855 pclust.2 CYBRD1 H6PD Hs.221941 Hs.463511 pclust.2 pclust.2 DUSP6 NA LAMB3 CFLAR Hs.298654 NA Hs.497636 Hs.390736 pclust.2 pclust.2 pclust.2 pclust.2 NA C7 TNFSF13 NA Hs.78065 Hs.54673 pclust.2 pclust.2 pclust.2 TNFSF13 Hs.54673 pclust.2 FXYD5 Hs.333418 pclust.2 HSMPP8 BIN3 RBPMS Hs.269654 Hs.546409 Hs.334587 pclust.2 pclust.2 pclust.2 EEF2K Hs.498892 pclust.2 NA LRP5 NA Hs.6347 pclust.2 pclust.2 83 probeset 227983 at 222862 s at 214724 at 1294 at 203407 at 227507 at 215506 s at 226621 at 227040 at 201814 at 222529 at 231274 s at 209499 x at accession number AI810244 BG169832 AF070621 L13852 NM 002705 BF593899 AK021882 AI133452 AI655763 AI300084 BG251467 R92925 BF448647 226728 at BF056007 213109 at 212494 at N25621 AB028998 219563 at NM 024633 209460 at 211564 s at 226597 at AF237813 BC003096 AI348159 203941 at 208609 s at 201496 x at NM 018250 NM 019105 S67238 217732 240120 218380 230492 230472 224970 204451 229817 229616 218205 AF092128 H72914 NM 021730 BE328402 AI870306 AA419275 NM 003505 AI452715 AU158463 NM 017572 s at at at s at at at at at s at s at 202304 at NM 014923 221795 at AI346341 gene name symbol unigene pclust hypothetical protein MGC7036 adenylate kinase 5 DIX domain containing 1 ubiquitin-activating enzyme E1-like periplakin NA ras homolog gene family, member I fibrinogen, gamma polypeptide hypothetical protein LOC283506 TBC1 domain family, member 5 mitochondrial solute carrier protein mitochondrial solute carrier protein tumor necrosis factor (ligand) superfamily, member 12-member 13 solute carrier family 27 (fatty acid transporter), member 1 TRAF2 and NCK interacting kinase tensin like C1 domain containing phosphatase chromosome 14 open reading frame 139 4-aminobutyrate aminotransferase PDZ and LIM domain 4 chromosome 19 open reading frame 32 hypothetical protein FLJ10871 tenascin XB myosin, heavy polypeptide 11, smooth muscle integral membrane protein 2B NA NA hypothetical protein KIAA1434 iroquois homeobox protein 1 nuclear factor I/A frizzled homolog 1 (Drosophila) zinc finger protein 608 hypothetical protein LOC196996 MAP kinase interacting serine/threonine kinase 2 fibronectin type III domain containing 3 neurotrophic tyrosine kinase, receptor, type 2 MGC7036 AK5 DIXDC1 UBE1L PPL NA ARHI FGG LOC283506 TBC1D5 MSCP MSCP TNFSF12TNFSF13 SLC27A1 Hs.488173 Hs.18268 Hs.116796 Hs.16695 Hs.192233 NA Hs.194695 Hs.546255 Hs.507783 Hs.475629 Hs.122514 Hs.122514 Hs.54673 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 Hs.363138 pclust.2 TNIK TENC1 Hs.34024 Hs.6147 pclust.2 pclust.2 C14orf139 Hs.41502 pclust.2 ABAT PDLIM4 C19orf32 Hs.336768 Hs.424312 Hs.76277 pclust.2 pclust.2 pclust.2 FLJ10871 TNXB MYH11 Hs.162397 Hs.485104 Hs.460109 pclust.2 pclust.2 pclust.2 ITM2B NA NA KIAA1434 IRX1 NFIA FZD1 ZNF608 LOC196996 MKNK2 Hs.446450 NA NA Hs.472040 Hs.424156 Hs.191911 Hs.94234 Hs.266616 Hs.412093 Hs.515032 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 FNDC3 Hs.508010 pclust.2 NTRK2 Hs.494312 pclust.2 84 probeset 225776 at 228496 s at 227438 at 223115 at 221796 at 225793 at 201820 at accession number AW205585 gene name symbol unigene bromodomain adjacent to zinc fin- BAZ2A Hs.314263 ger domain, 2A AW243081 cysteine-rich motor neuron 1 CRIM1 Hs.332847 AI760166 alpha-kinase 1 ALPK1 Hs.535761 AK001674 cofactor required for Sp1 transcrip- CRSP6 Hs.444931 tional activation, subunit 6, 77kDa AA707199 neurotrophic tyrosine kinase, recep- NTRK2 Hs.494312 tor, type 2 AW500180 hypothetical protein MGC46719 MGC46719 Hs.515748 NM 000424 keratin 5 (epidermolysis KRT5 Hs.433845 bullosa simplex, DowlingMeara/Kobner/Weber-Cockayne types) Table C.1: Annotations for the 213 probesets selected after variable ranking (see Section 4.3.1). 85 pclust pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 pclust.2 Appendix D Gene Ontology The set of probesets used to construct the features (see Section 4.3.2) may contain tens or hundreds of genes. The common task is to translate this list of genes into a better understanding of the involved biological phenomena. Currently, this is done through a tedious combination of searches through the literature and a number of public databases. Fortunately useful tools (e.g. [Draghici et al., 2003]) allow to annotate automatically a list of genes. To obtain some biological information, all genes were annotated according to known function using the Gene Ontology Consortium categories [Ashburner et al., 2000] : biological process, cellular component and molecular function. The GO consortium is setting a “dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing”. Such tools are used to obtain biological information about the probesets (as in [Lacroix et al., 2004]) and specifically in this thesis, the probesets used in the classifier. 86 Bibliography Affymetrix (2002). GeneChip Expression Analysis. Akritas, M. G. (1986). Bootstrapping the kaplan-meier estimator. Journal of the American Statistical Association, 81:1032–1038. Akritas, M. G. (1994). Nearest neighbor estimation of a bivariate distribution under random censoring. Annals of Statistics, 22:1299–1327. Allison, P. D. (1995). Survival Analysis Using SAS: A Practical Guide. SAS Institute Inc. Amaldi, E. and Kann, V. (1998). On the approximation of minimizing non zero variables or unstaisfied relations in linear systems. Theoretical Computer Science, 209:237–260. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwoght, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unfication of biology. the gene ontology consortium. Nat Genet, 25:25–29. Bair, E. and Tibshirani, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLOS Biology, 2(4):511–522. Blum, A. and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245–271. Bolstad, B. M., Irizarry, R. A., Astrand, M., and TP, T. S. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193. Brown, M., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C., Ares, M., and Haussler, D. (1999). Support vector machine classification of microarray gene expression data. University of California, Santa Cruz and University of Bristol. Chang, J. C., Wooten, E. C., Tsimelzon, A., Hilsenbeck, S. G., Gutierrez, M. C., Elledge, R., Mohsin, S., Osborne, C. K., Chamness, G. C., Allred, D. C., and Connell, P. O. (2003). Gene expression profiling predicts therapeutic response to docetaxel in breast cancer patients. Lancet, 362:362–369. Collett, D. (2003). Modelling Survival Data in Medical Research. Chapman and Hall, second edition edition. 87 Coombes, R. C. and Hall, E. (2004). A randomized trial of exemestane after two to three years of tamoxifen therapy in postmenopausal women with primary breast cance. New England Journal Medecine, 350(11):1081–1092. Cox and Oakes, D. (1984). Analysis of Survival Data. Chapman and Hall (London). Cox, D. R. (1972). Regression models and life tables. Journal of the Royal Statistical Society Series B, 34:187–220. De Smet, F., Mathys, J., Marchal, K., TRhijs, G., De Moor, B., and Moreau, Y. (2002). Adaptive quality-based clustering of gene expression profiles. Bioinformatics, 18(5):735– 746. Draghici, S., Khatri, P., Bhavsar, P., Shah, A., Krawetz, S., and Tainsky, M. (2003). Ontotools, the toolkit of the modern biologist: Pnto-express, onto-compare, onto-design and onto-translate. Nucleic Acids Research, 31(13):3775–3781. Droesbeke, J. J. (1988). Elements de Statistique. Ellipses. Duda, R. O., Hart, P. R., and Stork, D. G. (2001). Pattern classification. John Wiley and sons. Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77–87. EBCT Collaborative Group (1998). Polychemotherapy for early breast cancer: an overview of the randomized trials. Lancet, 352:930–942. Early Breast Cancer Trialists’ Collaborative Group. Efron, B. (1977). The efficiency of cox’s likelihood function for censored data. Journal of the American Statistical Association, 76:312–319. Eifel, P., Axelson, J. A., Costa, J., Crowley, J., Curran, W. J., Deshler, A., Fulton, S., Hendricks, C. B., Kemeny, M., Kornblith, A. B., Louis, T. A., Markman, M., Mayer, R., and Roter, D. (2001). National institutes of health consensus development conference statement: Adjuvant therapy for breast cancer. J. Natl Cancer Inst., 93(13):979–989. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. PNAS, 95:14863–14868. Farewell, V. T. and Prentice, R. L. (1980). The approximation of partial likelihood with emphasis on case-control studies. Biometrika, 67:273–278. Fdor, S. P., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A. T., and Solas, D. (1991). Lightdirected, spatially addressable parallel chemical synthesis. Science, 251(767-773). Garfield, E. (1990). 100 most cited papers of all time. Current Contents. Gautier, A., Moller, M., Frijs-Hansen, L., and Knudsen, S. (2005). Alternative mapping of probes to genes for affymetrix chips. BMC Bioinformatics, 5(111). 88 Gautier, L., Irizarry, R., Cope, L., and Boldstad, B. (2004). Description of affy. Technical report, Bioconductor. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., R., A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y., and Zhang, J. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology, 5:R80. Goldhirsh, A., andR. D. Gelber, W. C. W., andB. Thurlimann, A. S. C., and Senn, H. J. (2003). Meeting highlights: Updated international expert consensus on the primary therapy of early breast cancer. J. Clin. Oncol., 21(17):3357–3365. Goldhirsh, A., Glick, J. H., Gelber, R. D., and Senn, H. (1998). Meeting highlights: International consensus panel on the treatment of primary breast cancer. Journal of National Cancer Institute, 90(1601-1608). Goss, P. E. and Ingle, N. (2003). A randomized trial of letrozole in postmenopausal women after five years of tamoxifen therapy for early-stage breast cancer. New England Journal of Medecine, 349(19):1793–1802. Greenwood, M. (1926). The errors of sampling of the survivorship tables. Reports on Public Health and Statistical Subjects, 33:1–26. Gross, A. J. and Clark, V. A. (1975). Survival Distributions: Reliability Applications in the Biomedical Sciences. Wiley. Gui, J. and Li, H. (2004). Penalized cox regression analysis in the highdimensional and low-sample size settings, with applications to microarray gene expression data. Center for Bioinformatics and Molecular Biostatistic, paper L1Cox. http://repositories.cdlib.org/cbmb/L1Cox. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182. Haibe-Kains, B. (2004). Breast cancer diagnosis using microarray. Master’s thesis, ULB. Hartemink, A. J., Gifford, D. K., Jaakola, T. S., and Young, R. A. (2001). Maximum likelihood estimation of optimal scaling factors for expression array normalization. SPIE BiOS. Hartigan, J. A. (1975). Clustering Algorithms. Wiley. Hartmann, O., Samans, B., and Schafer, H. (2003). Low level analysis for affymetrix genechips: Normalization and quality control. Technical report, Insitiute of Medical Biometry and Epidemiology. Faculty of Medicine and Hospital, Philippe-University. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning. Springer. Heagerty, P. J., Lumley, T., and Pepe, M. S. (2000). Time-dependent roc curves for censored survival data and a diagnostic marker. Biometrics, 56:337–344. 89 Holder, D., Rauberras, R. F., Pikounis, V. B., Svetnik, V., and Shoper, K. (2001). Statistical analysis of high density oligonucleotide arrays: a safer approach. Proceedings of the ASA Annual Meeting Atlanta, GA. Howell, A. and Cuzick, J. (2005). Results of the atac (arimidex, tamoxifen, alone or in combination) trial after completion of 5 years’ adjuvant treatment for breast cancer. Lancet, 365(9453):60–62. Hsieh, F. Y. (1995). A cautionary note on the analysis of extreme data with cox regression. The American Statistician, 49:226–228. Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003a). Summaries of affymetrix genechip probe level data. Nucleic Acids Research, 31(4):e15. Irizarry, R. A., Bridget, H., Francois, C., Yasmin, D., Beazer-Barclay, Kristen, J., Antonellis, Uwe, S., and Speed, T. P. (2003b). Exploration, normalization, and summarization of hight density oligonucletotide array probe level data. Bioinformatics. in press. Jansen, M., Foekens, J. A., van Staveren, I. L., Dirkzwager-Kiel, M. M., Ritstier, K., Look, M. P., van Gelder, M. E. M., Sieuwerts, A. M., Portengen, H., Dorssers, L. C., Jlijn, J., and Berns, M. (2005). Molecular clasification of tamoxifen-resistant breast carcinomas by gene expression profiling. Journal of Clinical Oncology, 23(4):732–740. Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. Journal of American Statistical Asscoiation, 53:457–451. Kohavi, R. and John, G. (1997). Wrappers for feature subset selection. AIJ, 97(1-2):273–324. Lacroix, M., Haibe-Kains, B., Laes, J. F., Hennuy, B., Lallemand, F., Gonze, I., Cardoso, F., Piccart, M., Leclercq, G., and Sotiriou, C. (2004). Gene regulation by phorbol 12myristate 13-acetate (PMA) in two jighly different breast cancer cell lines. Oncology Report, 12(4):701–707. Lacroix, M. and Leclercq, G. (2004). Relevance of breast cancer cell lines as models for breast tumors: an update. Breast Cancer Res and Treat, 415:530–536. Lockhart, D. J., Dong, H., Byrne, M. C., Follette, M. T., Gallow, M. V., Chee, M. S., Mittmann, M., Wang, C., Kbayashi, M., Horton, H., and Brown, E. L. (1996). Expression monitoring by hybridization to high-density oligonucletoide arrays. Nature Biotech., 14:1675–1680. Ma, X. J., Wang, Z., Ryan, P. D., Isakoff, S. J., Barmettler, A., Fuller, A., Muir, B., Mohapatra, G., Salunga, R., Tuggle, J. T., Tran, Y., Tran, D., Tassin, A., Amon, P., Wang, W., Wang, W., Enright, E., Stecker, K., Estepa-Sabal, E., Smith, B., Younger, J., Balis, U., Michaelson, J., Bhan, A., Habion, K., Baer, T. M., Brugge, J., Haber, D. A., Erlander, M. G., and Sgroi, D. S. (2004). A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell, 5:607–616. Mason, S. J. and Graham, N. E. (1982). Areas beneath the relative operating characteristics (roc) and relative operating levels (rol) curves: Statistical significance and interpretation. Q. J. R. Meteorol. Soc., 30:291–303. 90 McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall. Meier, P. (1975). Estimation of a distribution function from incomplete observations. Perspectives in Probability and Statistics, pages 67–87. Michiels, S., Koscielny, S., and Hill, C. (2005). Prediction of cancer outcome with microarrays: a multiple radom validation strategy. Lancet, 365:488–492. Mitchell, T. (1997). Machine Learning. McGraw. Naef, F., Lim, D. A., and Magnasco, M. O. (2001). rom features to expression: High desnity oligonucleotide array analysis revisited. Technical report, Institut fur Hydromechanik und Wasserwirtschaft. Paik, S., Shak, S., Tang, G., Kim, C., Bakker, J., Cronin, M., Baehner, F. L., Walker, M. G., Watson, D., Park, T., Hiller, W., Fisher, E. R., Wickerham, D. L., Bryant, J., and Wolmark, N. (2004). A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. New England Journal of Medecine, (351):2817–2826. Ploner, A., Miller, L. D., Hall, P., Bergh, J., and Pawitan, Y. (2005). Correlation test to assess low-level processing of high-desnity oligonucletide microarray data. BMC Bioinformatics. R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Rossi, P. H., Berk, R. A., and Lenihan, K. J. (1980). Money, Work and Crime: Some Experimental Results. New Yord Academic Press Inc. Sheden, K., Chen, W., Kuick, R., Ghosh, D., Macdonald, J., Cho, K. R., Giordano, T. J., Gruber, S. B., Fearon, E. R., Taylor, J. M., and Hanash, S. (2005). Comparison of seven methods for producing affymetrix expression score based on false discovery rate in disease profiling data. BMC Bioinformatics, 6(26). Shipp, M. A., Ross, K. N., and Tamayo, P. (2002). Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling. Nature Medicine, 8:68–74. Sotiriou, C., Wirapati, P., Loi, S., Harris, A., Bergh, J., Smeds, J., Farmer, P., Praz, V., Haibe-Kains, B., Lallemand, F., Buyse, M., Piccart, M., and Delorenzi, M. (2005). Gene expression profiling in breast cancer challenges the existence of intermediate histological grade. submitted. Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer. Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in Medecine, 16:385–395. Tukey, J. W. (1977). Exploratory Data Analysis. Addison. van de Vijver, M. J., He, Y. D., van’t Veer, L., Dai, H., Hart, A. M., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E. T., 91 Friend, S. H., and Bernards, R. (2002). A gene expression signature as a predictor of survival in breast cancer. The new England Journal of Medecine, 347(25):1999–2009. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhiven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530–536. Vittinghof, E., Glidden, D. V., Shiboski, S. C., and McCulloch, C. E. (2005). Regression Methods in Biostatistics: Linear, Logistic, Survival and Repeated Measures Models. Springer. Yu, H., Luscombe, N. M., Gian, J., and Gerstein, M. (2003). Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trend in Genetics, 19(8):422– 427. 92

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement