E SCUELA T ÉCNICA S UPERIOR DE I NGENIEROS I NFORMÁTICOS U NIVERSIDAD P OLITÉCNICA DE M ADRID P H D T HESIS A RTIFICIAL I NTELLIGENCE N EW DEFINITIONS OF MARGIN FOR MULTI - CLASS B OOSTING ALGORITHMS AUTHOR: Antonio Fernández Baldera ADVISORS: Luis Baumela Molina, Jose Miguel Buenaposada Biencinto J ULY, 2015 New definitions of margin for multi-class Boosting algorithms Antonio Fernández Baldera July, 2015 Agradecimientos Durante el tiempo que lleva concluir unos estudios de doctorado uno llega a sentir el apoyo de tantas personas que resulta difícil incluirlas en pocas líneas. En cualquier caso, trataré de hacer un buen ejercicio de memoria para recogerlos a todos en esta dedicatoria. En primer lugar me veo en la obligación de agradecer a Luis Baumela todo el esfuerzo (y paciencia) que ha dedicado en mi formación y que ha derivado en la consecución de la presente tesis. Gracias por haber estado ahí ayudándome en tantos aspectos: proponiendo ideas, cuidando la manera de redactar, enseñando a resolver trámites, facilitando material, escuchando propuestas, orientando objetivos, etc. Ciertamente es imposible no aprender con una persona como él trabajando a tu lado. De verdad, muchas gracias. Igualmente, no puedo dejar de agradecer a Jose Miguel Buenaposada por todos sus comentarios, tanto teóricos como experimentales. Qué fácil es mantener el entusiasmo por una idea cuando se tiene un compañero como él. En verdad sobra decir que parte de los objetivos alcanzados en este documento se han conseguido gracias a sus consejos. Muchísimas gracias. Mis agradecimientos también a mis compañeros de laboratorio: Juan, Yadira, Pablo, David y Kendrick; por los buenos ratos que hemos llegado a pasar. Fuera del grupo, gracias también a Monse y a Elena, por amenizar los momentos del café. Agradecer de igual modo a Antonio Valdés por aportar su visión matemática a nuestros seminarios y a Aleix Martínez por haberme hecho un hueco en su extraordinario grupo. Gracias a toda mi familia, como no, por haber estado siempre ahí. Gracias por hacerme sentir más confiado que nunca en mi potencial académico, especialmente después de tantos años de formación. Por otro lado, gracias a todos los amigos que durante este tiempo me han permitido compartir tan buenos momentos e igualmente han sabido estar ahí cuando les he necesitado. Gracias por vuestro apoyo. A la gente de mi pueblo: Rocha, Galea, Jose Luis y Charly; a mis compadres de Badajoz: Felix y Javi; a mis excompañeros del INE: Alberto, Paz y Cristina; a los amigos de San Sebastián de los Reyes: Álvaro, Sergio y Jose; a mis excompañeros del máster: Arturo, Diego, Ernesto, Victor, Ghislain, Pablo y Raúl; a mi gente de Las Rozas: Cristina, Rosa y Carlos; y, como no, a mis amigos en Columbus: Fabián, Felipe y Adriana (y toda su familia). Igualmente, gracias a tantos otros amigos que no recojo en estas líneas pero que también me han hecho sentir afortunado durante estos años. Como no podía ser de otro modo, mis más cariñosos agradecimientos a Aili Cutie Báez. Sobran los motivos. Aparte, creo más que conveniente dedicar también mi agradecimiento al proyecto TIN200806815-C02-02 del Ministerio de Economía y Competitividad. Sin dicho proyecto no hubiera sido posible que esta investigación se llevara a cabo. iii Abstract The family of Boosting algorithms represents a type of classification and regression approach that has shown to be very effective in Computer Vision problems. Such is the case of detection, tracking and recognition of faces, people, deformable objects and actions. The first and most popular algorithm, AdaBoost, was introduced in the context of binary classification. Since then, many works have been proposed to extend it to the more general multi-class, multi-label, costsensitive, etc... domains. Our interest is centered in extending AdaBoost to two problems in the multi-class field, considering it a first step for upcoming generalizations. In this dissertation we propose two Boosting algorithms for multi-class classification based on new generalizations of the concept of margin. The first of them, PIBoost, is conceived to tackle the multi-class problem by solving many binary sub-problems. We use a vectorial codification to represent class labels and a multi-class exponential loss function to evaluate classifier responses. This representation produces a set of margin values that provide a range of penalties for failures and rewards for successes. The stagewise optimization of this model introduces an asymmetric Boosting procedure whose costs depend on the number of classes separated by each weak-learner. In this way the Boosting procedure takes into account class imbalances when building the ensemble. The resulting algorithm is a well grounded method that canonically extends the original AdaBoost. The second algorithm proposed, BAdaCost, is conceived for multi-class problems endowed with a cost matrix. Motivated by the few cost-sensitive extensions of AdaBoost to the multi-class field, we propose a new margin that, in turn, yields a new loss function appropriate for evaluating costs. Since BAdaCost generalizes SAMME, Cost-Sensitive AdaBoost and PIBoost algorithms, we consider our algorithm as a canonical extension of AdaBoost to this kind of problems. We additionally suggest a simple procedure to compute cost matrices that improve the performance of Boosting in standard and unbalanced problems. A set of experiments is carried out to demonstrate the effectiveness of both methods against other relevant Boosting algorithms in their respective areas. In the experiments we resort to benchmark data sets used in the Machine Learning community, firstly for minimizing classification errors and secondly for minimizing costs. In addition, we successfully applied BAdaCost to a segmentation task, a particular problem in presence of imbalanced data. We conclude the thesis justifying the horizon of future improvements encompassed in our framework, due to its applicability and theoretical flexibility. Keywords: Machine Learning, AdaBoost, Multi-class Boosting, Margin-based classifiers, Cost-sensitive learning, Imbalanced data. v Resumen La familia de algoritmos de Boosting son un tipo de técnicas de clasificación y regresión que han demostrado ser muy eficaces en problemas de Visión Computacional. Tal es el caso de los problemas de detección, de seguimiento o bien de reconocimiento de caras, personas, objetos deformables y acciones. El primer y más popular algoritmo de Boosting, AdaBoost, fue concebido para problemas binarios. Desde entonces, muchas han sido las propuestas que han aparecido con objeto de trasladarlo a otros dominios más generales: multiclase, multilabel, con costes, etc. Nuestro interés se centra en extender AdaBoost al terreno de la clasificación multiclase, considerándolo como un primer paso para posteriores ampliaciones. En la presente tesis proponemos dos algoritmos de Boosting para problemas multiclase basados en nuevas derivaciones del concepto margen. El primero de ellos, PIBoost, está concebido para abordar el problema descomponiéndolo en subproblemas binarios. Por un lado, usamos una codificación vectorial para representar etiquetas y, por otro, utilizamos la función de pérdida exponencial multiclase para evaluar las respuestas. Esta codificación produce un conjunto de valores margen que conllevan un rango de penalizaciones en caso de fallo y recompensas en caso de acierto. La optimización iterativa del modelo genera un proceso de Boosting asimétrico cuyos costes dependen del número de etiquetas separadas por cada clasificador débil. De este modo nuestro algoritmo de Boosting tiene en cuenta el desbalanceo debido a las clases a la hora de construir el clasificador. El resultado es un método bien fundamentado que extiende de manera canónica al AdaBoost original. El segundo algoritmo propuesto, BAdaCost, está concebido para problemas multiclase dotados de una matriz de costes. Motivados por los escasos trabajos dedicados a generalizar AdaBoost al terreno multiclase con costes, hemos propuesto un nuevo concepto de margen que, a su vez, permite derivar una función de pérdida adecuada para evaluar costes. Consideramos nuestro algoritmo como la extensión más canónica de AdaBoost para este tipo de problemas, ya que generaliza a los algoritmos SAMME, Cost-Sensitive AdaBoost y PIBoost. Por otro lado, sugerimos un simple procedimiento para calcular matrices de coste adecuadas para mejorar el rendimiento de Boosting a la hora de abordar problemas estándar y problemas con datos desbalanceados. Una serie de experimentos nos sirven para demostrar la efectividad de ambos métodos frente a otros conocidos algoritmos de Boosting multiclase en sus respectivas áreas. En dichos experimentos se usan bases de datos de referencia en el área de Machine Learning, en primer lugar para minimizar errores y en segundo lugar para minimizar costes. Además, hemos podido aplicar BAdaCost con éxito a un proceso de segmentación, un caso particular de problema con datos desbalanceados. Concluimos justificando el horizonte de futuro que encierra el marco de trabajo que presentamos, tanto por su aplicabilidad como por su flexibilidad teórica. Palabras clave: vii Aprendizaje Automático, AdaBoost, Boosting Multiclase, Clasificadores basados en Margen, Aprendizaje con costes, Datos desbalanceados. Contents 1 2 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Background on Boosting 5 2.1 Background on Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Supervised Classification Problems . . . . . . . . . . . . . . . . . . . 7 Binary Boosting: AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Understanding AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Statistical View of Boosting . . . . . . . . . . . . . . . . . . . . . . . 11 Multi-class Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Boosting algorithms based on binary weak-learners . . . . . . . . . . . 13 2.3.2 Boosting algorithms based on vectorial encoding . . . . . . . . . . . . 16 2.4 Cost-sensitive binary Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Other perspectives of Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 2.3 3 4 Partially Informative Boosting 25 3.1 Multi-class margin extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 PIBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 AdaBoost as a special case of PIBoost . . . . . . . . . . . . . . . . . . 31 3.2.2 Asymmetric treatment of partial information . . . . . . . . . . . . . . 32 3.2.3 Common sense pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Multi-class Cost-sensitive Boosting 41 4.1 Cost-sensitive multi-class Boosting . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 43 Previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 4.1.2 4.2 4.3 5 New margin for cost-sensitive classification . . . . . . . . . . . . . . . 46 BAdaCost: Boosting Adapted for Cost-matrix . . . . . . . . . . . . . . . . . . 47 4.2.1 Direct generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Cost matrix construction . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.2 Minimizing costs: UCI repository . . . . . . . . . . . . . . . . . . . . 51 4.3.3 Unbalanced Data: Synapse and Mitochondria segmentation . . . . . . 53 Conclusions 57 5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1.1 New theoretical scopes . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1.2 Other scopes of supervised learning . . . . . . . . . . . . . . . . . . . 59 A Proofs 61 A.1 Proof of expression (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.4 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 A.5 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 A.6 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 List of Figures 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 Values of the Exponential Loss Function over margins, z, for a classification problem with 4-classes. Possible margin values are obtained taking into account the expression (3.5) for s = 1 and s = 2. . . . . . . . . . . . . . . . . . . . . . 28 Margin vectors for a problem with three classes. Left figure presents the set of vectors Y . Right plot presents the set Ŷ . . . . . . . . . . . . . . . . . . . . . 31 Plots comparing the performances of Boosting algorithms. In the vertical axis we display the error rate. In the horizontal axis we display the number of weaklearners fitted for each algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 38 Diagram of the Nemenyi test. The average rank for each method is marked on the segment. We show critical differences for both α = 0.05 and α = 0.1 significance level at the top. We group with thick blue line algorithms with no significantly different performance. . . . . . . . . . . . . . . . . . . . . . . . . 39 Plot comparing the performances of Boosting algorithms for the Amazon data base. In the vertical axis we display the error rate. In the horizontal axis we display the number of weak-learners fitted for each algorithm. . . . . . . . . . 40 Comparison of ranks through the Bonferroni-Dunn test. BAdaCost’s average rank is taken as reference. Algorithms significantly worse than our method for a significance level of 0.10 are unified with a blue line. . . . . . . . . . . . . . 53 Example of a segmented image. In b), green pixels belong to mitochondria while red ones belong to synapses. Figure c) indicates an estimation. . . . . . . 54 Brain images experiment with a heavily unbalanced data set. Training and testing error rates, along the iterations, for each algorithm are shown. . . . . . . . 55 xi List of Tables 3.1 Cost Matrix associated to a PIBoost’s separator of a set S with s = |S| classes. 32 3.2 An example of encoding matrix for PIBoost’s weak learners when K = 4 and G = {all single labels} ∪ {all pairs of labels}. . . . . . . . . . . . . . . . . . . 33 3.3 Comparison of the main properties of ECOC-based algorithms and PIBoost. µm (l) denotes the coloring function µm : L → {±1} at iteration m. R denotes the length of “code-words”. In AdaBoost.OC, I¯m (x) indicates I (hm (x) = µm (l)). 34 3.4 Summary of selected UCI data sets . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Number of iterations considered for each Boosting algorithm. The first column displays the data base name with the number of classes in parenthesis. Columns two to six display the number of iterations of each algorithm. For PIBoost(2) the number of separators per iteration appears inside brackets. The last column displays the number of weak-learners used for each data base. . . . . . . . . . 37 Error rates of GentleBoost, AdaBoost.MH, SAMME, PIBoost(1) and PIBoost(2) algorithms for each data set in table 3.4. Standard deviations appear inside parentheses in 10−4 scale. Bold values represent the best result achieved for each data base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7 P-values corresponding to Wilcoxon matched-pairs signed-ranks test. . . . . . 40 4.1 Summary of selected UCI data sets . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Classification cost rates of Ada.C2M1, MultiBoost, Lp -CSB, and BAdaCost algorithms for each data set after 100 iterations. Standard deviations appear inside parentheses in 10−4 scale. Bold values represent the best result achieved for each data base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Error rates of the five algorithms after the last iteration. . . . . . . . . . . . . . 55 3.6 4.3 xiii Chapter 1 Introduction The emergence of the personal computer and its increasing computing power has brought a new scope of applications to our lives. The more computational power the closer the chance to transfer human reasoning to artificial environments. This issue derived inevitably on a field of Science: Artificial Intelligence (AI). There are many disciplines included in AI such as Robotic Systems, Mecatronics, Ontology, Bio-Informatics, Computer Vision, Data Mining, Pattern Recognition, Machine Learning, etc. The last three have been guided by statistical improvements and have received an enormous degree of attention in the last three decades. The contents of this dissertation are set within the context of Machine Learning and Pattern Recognition. Both disciplines will be considered as synonymous hereafter. When introducing Machine Learning a first question always arises: what should be understood by learning? This is an extraordinary complex question, almost as difficult as to define the concept of intelligence. Such endeavour could take up the whole thesis, so we will bear in mind the following simple definition: learning is the act of acquiring new knowledge by synthesizing information. It is surely an incomplete definition since for humans one should add the fact of modifying existing knowledge. Moreover, behaviours and skills are also included in the process of learning. In any case we find this definition good enough to fill the gap between human learning and Machine Learning. This field of AI encompasses situations where an automatic decision must be taken wisely taking into account some restrictions. In other words, its aim is “teaching machines” to take reasonable decisions. We will pay attention only to this perspective of learning: situations where one has to select the right option based on a set of restrictions. This understanding of choosing “right decisions” leads to the concept of classification problem. Roughly speaking, we could say that the structure of a classification problem is as simple as follows. Choose the “best” answer from a finite set of discrete options, based on some input data and some sample choices. Several well established techniques, grounded in different statistical foundations, have been proposed to solve it. Some examples are: Support Vector Machines, Bayesian Networks, Neural Networks, and ensemble algorithms. The latter being the scope of Boosting, the topic of this thesis. 1 2 Introduction 1.1 Motivation Boosting algorithms are learning schemes that produce an accurate or strong classifier by combining a set of simple base prediction rules or weak-learners. Their popularity is based not only on the fact that it is often much easier to devise a simple but inaccurate prediction rule than building a highly accurate classifier, but also because of the successful practical results and good theoretical properties of the algorithms. This philosophy of learning has found applications in many fields of Science, but specially in Computer Vision where there is a plethora of works grounded in them [16, 90, 42, 11, 54]. The most well-known Boosting algorithm, AdaBoost, was introduced in the context of two-class (binary) classification. It works in an iterative way. First a weight distribution is defined over the training set. Then, at each iteration, the best weak-learner according to the weight distribution is selected and combined with the previously selected weak-learners to form the strong classifier. Weights are updated to decrease the importance of correctly classified samples, so the algorithm tends to concentrate on the “difficult” examples. Classification problems with more than two possible labels (multi-class problems) are very common in practice. When the nature of a task of this kind suggests decomposing it into binary subproblems, then it may be quite convenient to use Boosting [80, 78, 17, 3, 35, 95]. Although many Boosting algorithms have been proposed to address this issue [80, 78, 35, 3], none of them can be considered a canonical generalization of the original AdaBoost [26] in the sense of evaluating binary information under a multi-class loss function. In this thesis we provide a canonical extension of AdaBoost based on the well known statistical interpretation of Boosting [29]. The resulting method links to a common sense pattern, namely “the fact of discarding options in order to select the correct one”. A pattern that, in turn, brings this computational process closer to human reasoning. Furthermore, the multi-class classification problem at hand may present different costs for misclassifications. Usually this information can be encoded using a cost matrix. The inclusion of a cost matrix can be considered either the goal of the problem, if we are interested in optimizing a cost-based loss function, or a tool for other purposes. In the second case, it turns out helpful for addressing problems with label-dependent unbalanced data, ordinal problems1 or problems with hard-to-fit decision boundaries. Cost-sensitive classification has been exploited for problems with two labels [94, 93, 60, 23, 33, 50, 89, 86, 63], but not much research has been devoted to problems with more than two [58, 102, 86]. This lack of study motivated us to develop a Boosting algorithm that encloses the essence of the most relevant generalizations of AdaBoost to both multi-class and cost-sensitive perspectives. The algorithms we introduce in this thesis, PIBoost and BAdaCost, address respectively the problems of multi-class Boosting and multi-class Boosting with costs. They may be used to solve many relevant problems such as multi-class and minimum risk classification, object detection, image segmentation, etc. 1.2 Main Contributions In the present thesis we focus on the concept of margin as cornerstone for the exponential loss function. The core message of the dissertation is described below: 1 Multi-class problems coming from the discretization of a continuous variable. 1.3. STRUCTURE OF THE THESIS 3 Hypothesis: The use of suitable margin vectors yields multi-class Boosting algorithms able to manage binary information or costs in a canonical way via the exponential loss function. This idea will be developed properly throughout the chapters. Here we summarize our main contributions: • We propose two definitions of margin vectors. Firstly we introduce margin vectors suited for multi-class problems based on binary information. Secondly our proposal deals with cost-sensitive problems. Given a cost matrix for a multi-class classification problem we introduce another set of margin vectors. Both types of vectors are introduced jointly with an exponential loss function. • Based on the first extension of margin, we introduce a new multi-class Boosting algorithm, PIBoost [24]. We decompose the classification problem into binary sub-problems that are evaluated with a multi-class loss function. The margin values obtained with the new encoding produce a broad range of penalties and rewards. The stagewise optimization of this model introduces an asymmetric Boosting procedure whose costs depend on the number of classes separated by each weak-learner. We think of PIBoost as a canonical extension of AdaBoost to the multi-class case using binary weak learners. With our proposal we schedule a framework that allows managing complex problems by translating them into the binary case in the purest Boosting fashion. • Based on the second extension of margin we introduce a cost-sensitive multi-class Boosting algorithm, BAdaCost. We consider BAdaCost a canonical generalization of AdaBoost to the multi-class cost-sensitive case using multi-class weak learners. In fact we generalize SAMME [112] and Cost-sensitive AdaBoost [63], two direct extensions of AdaBoost for multi-class and binary cost-sensitive problems respectively. These results are discussed in Chapter 4. 1.3 Structure of the Thesis The remainder of the thesis is organized as follows: • Chapter 2 introduces Boosting. Firstly we describe some basics on Machine Learning and then we present the AdaBoost algorithm from the two most common perspectives of the process, namely: error-bounding and statistical. We include a summary of multi-class Boosting methods, highlighting those derived upon the concept of margin, followed by a summary of binary cost-sensitive Boosting algorithms. Other interesting perspectives of Boosting are also included. • Chapter 3 presents our algorithm PIBoost for multi-class problems. Here we introduce the new set of margin vectors on which we ground our algorithm. We compare PIBoost with previous Boosting algorithms based on binary information. Experiments showing the performance of PIBoost against other relevant Boosting algorithms are included. • Chapter 4 describes our multi-class cost-sensitive algorithm BAdaCost. Again, we introduce a set of margin vectors from which we derive our algorithm. We explain BAdaCost’s 4 Introduction structure in detail and compare it with previous relevant works in the area. Finally a set of experiments shows the applicability of BAdaCost under different perspectives. • Chapter 5 summarizes the most relevant conclusions of our work jointly with new lines of ongoing research. • In the Appendix at the end of the document we provide the proofs of the results in the thesis. Chapter 2 Background on Boosting Most learning paradigms are based on acquiring a single model that yields a powerful classifier after training. During this process it is usual to require high computational cost to produce accurate results (SVM, Bayesian Networks, Deep Learning, etc.). If lucky, the process can be parallelized and then the computational burden is distributed. When such capabilities are not at hand it would be desirable to have some mechanism able to produce good classifications based on a combination of not so good, but usually faster, ones. In most cases we can build simple rules of thumb, called weak predictions, that partially solve the problem by analyzing a small group of predictive variables. Hence, we may wonder: is it possible to use weak predictions to build a powerful classifier in a step-wise fashion? Besides, can we improve it by learning from previous mistakes? If so, can we do it without losing previous gains in accuracy? The Boosting philosophy of learning answers these questions. This chapter serves as guide to introduce Boosting and some of its most important extensions. It is organized as follows. In a first section we review some basics related to the scope of supervised learning. In particular, we describe our problems of interest and the concept of meta-classifier. Section 2.2 is devoted to introduce the origin of Boosting jointly with AdaBoost, the most important algorithm in the field. A comprehensive description of the method is carried out, paying special attention to its statistical interpretation. In section 2.3 we review the most relevant extensions of AdaBoost to multi-class problems. In this case we will focus on those developed upon the concept of margin. Section 2.4 is devoted to review the most relevant generalizations proposed to tackle cost-sensitive problems. Finally, a short summary of other interesting topics covered by Boosting closes the chapter in section 2.5. 2.1 Background on Machine Learning There is a glossary of primal concepts that should be known beforehand to frame Boosting and our algorithms conveniently. Let us describe these elements jointly with our terminology for classification problems. Firstly, we will assume without loss of generality that the domain, X, of predictive variables (features) of a problem is a subset of RD . In general X is not required to be a vectorial subspace of RD . In fact, depending on the nature of the problem, some operations may not even make sense. Nonetheless, it is usual for X to be endowed with structure of vectorial space. Once the variables are given, we call dimensionality to the number of them, i.e. the minimum value D for which X ⊆ RD . The term class (objective variable) will refer 5 6 Background on Boosting to the qualitative or numeric feature that is the goal of the prediction. In the following we will take into account problems where the class variable can be encoded with a finite number of values, called labels. Hence, regression problems are out of the scope of the thesis. The set of labels, L, should be exhaustive. In other words, one should find in data each and every of these values and no one else is needed. We will assume that the mutually-exclusive property is satisfied, which implies that labels are well defined, i.e. no semantic overlapping should arise. An instance, (x, l), is hoped to have just one label. In subsection 2.1.1, we will comment a classification problem that does not meet this property. In the following, there will be times where the concepts of class and label will be used with the same meaning of the latter. In those situations the context will yield no ambiguities. Finally, a classifier (learner, hypothesis) will be any function H : X → L that assigns a label (and only one) to each instance x ∈ X. In some cases the set L could be replaced by P(L), parts of L, or {S, S C } for a subset S ⊂ L. Besides, the broad set of methods included in the world of Machine Learning can be divided into many categories according to different perspectives. Here we summarize some of them, emphasizing the area of our interest. Based on the availability of knowledge about the objective variable we can set a first division: Unsupervised learning (cluster analysis), when there is no knowledge about labels in data; Semi-supervised Learning: when some instances are unlabelled; Supervised Learning, when there is complete knowledge about the class in data. The algorithms we present lie in the latter. A second important distinction comes from the necessity of estimating a priori probabilities to apply a method. Given an observation, x, the Bayes’ rule for predicting a label l is P (l|x) = P (x|l)P (l)P (x)−1 , which in turn is proportional to P (x|l)P (l). One can estimate these values either directly or using the above formula, that implies estimating P (x|l) and P (l). This second option needs to fit a probability distribution for each label and to estimate the a priori probabilities. Algorithms tackling the classification problem in this fashion are called generative while those that fit P (l|x) directly are called discriminative, which is the case of Boosting algorithms. An additional classification can be established depending on the theoretical nature. It is usual to call paradigm to every approach that takes basis on a different conception of the learning process. They have particular motivations: bio-inspired, probabilistic, geometrical, “casebased”, etc. Popular paradigms are: Neural Networks, Lazy Learners (K-NN and its variants), Bayesian Classifiers (Naive Bayes, TAN, etc), Support Vector Machines, Logistic Regression, Classification Trees and Rule-based algorithms. Some of these methodologies are supported by theories about universal approximations, i.e. functions that theoretically can approach any objective function up to a specified degree of error. Boosting algorithms neither are a paradigm per se nor can be included in any paradigm. Rather they are settled in the group of meta-classifiers or, as better known today, ensemble learning. Let us describe some essential ideas about meta-classifiers. For a complete introduction see L. Kuncheva’s book [47]. Roughly speaking, we define a meta-classifier as any algorithm that builds a classification rule based on the results provided by various classifiers. Following an initial intuition, one should expect the classifiers to be heterogeneous in order to obtain a good final decision using an ensemble. It could be proved theoretically that this combination can be specially suited to reduce either the general classification error bias or its associated variance [7]. This heterogeneity does not imply poor predictions for the compounding learners. Moreover, it is desirable to have a reasonably good accuracy for each of them. It could also be expected to combine a large number of classifiers, thus an efficient type of base algorithms should be used. In the same way it would be desirable to guarantee, if possible, the management 2.1. BACKGROUND ON MACHINE LEARNING 7 of learners specialized in sub-domains of the data. To commit a meta-classifier one can use the same data in every base classifier (which is the usual situation) or develop a particular classifier for every single variable and then derive a global decision. The latter is often the case when working with magnitudes from different contexts: weight, length, pressure, temperature, sound intensity, etc. Taking into account the structure of the meta-classifier, which will depend on the type of problem, we can organize the compounding classifiers in line, in parallel or following a hierarchic structure. Finally, metaclassifiers can also be divided into two categories depending on whether they use the same paradigm to generate base classifiers or different. We are interested in the first case. Important families of algorithms using the same base paradigm are: Bagging [8], Randomness-derived algorithms (for instance, Random Forests [9]), Cascade-shaped classifiers [100] and finally the ones derived applying Boosting [28, 29]. We complete the section describing the supervised problems of our interest. For more information about the above topics we recommend the books of J. Friedman et al. [30], R. Duda et al. [19], A. J. Izenman [41], C. M. Bishop [7], and I. H. Witten et al. [104]. 2.1.1 Supervised Classification Problems The supervised classification problems of our interest are divided into three groups: binary, multi-class and multi-label. Each group generalizes its predecessors. These classification problems will be supposed free of costs. Although the algorithms proposed in this thesis are conceived for multi-class problems, we include the multi-label case bearing in mind the strong connections it has with our multi-class proposal and other previous works on multi-class classification. In first place, binary classification problems are the simplest. The class variable presents two possible labels, usually denoted by L = {−1, 1}, {1, 0} or L = {1, 2}. Secondly, multi-class problems differ from the previous case in the fact that L has more than two elements, usually denoted with natural numbers L = {1, 2, . . . , K}. Although its name suggests the opposite in this kind of classification problems there is only one class variable. Finally, multi-labels problems have also a finite set of labels L = {1, 2, . . . , K}, but every instance x has an associate subset Lx ⊆ L of labels, which justifies the name. In other words, instances can present more than a single label. Thus, this kind of classification is accomplished through a function H : X → P(L), where P(L) is the Label Powerset of the problem. There may be instances for which no labels or all labels are assigned. Let us consider the following illustrative example. INSTANCE LABEL1 1 X 2 3 X 4 5 X LABEL2 LABEL3 LABEL4 X X X X X X With the aim of simplifying multi-label problems we explain below a couple of strategies widely used. These processes are also considered to tackle multi-class problems, as we highlight in section 2.3. 8 Background on Boosting 1. Binary Relevance. The idea is to fit a binary classifier for each label (presence/absence) and then map the result on P(L). This implies transforming the original data set into K theoretically independent data sets. For example, Label 3 in the above table would become: INSTANCE LABEL3 1 0 2 0 3 1 4 1 5 0 2. Extending the data base. The idea behind this transformation is simple. Every instance is repeated K times and then a new feature, with each of the possible labels, is added. In other words, each repetition has the shape “feature-label”. Then the new class variable becomes binary just by encoding the presence/absence of a label with +1/ − 1 respectively. Thus there will be an amount of N · K observations of the kind (xn , j, ±1); with 1 ≤ n ≤ N and 1 ≤ j ≤ K. The first instance in the example gives rise to: INSTANCE LABEL PRESENCE 1 1 1 1 2 1 1 3 -1 1 4 -1 2.2 Binary Boosting: AdaBoost Let us introduce this methodology starting from its origin. Boosting is a technique proposed by R. Schapire [77] at the end of the eighties. The main idea in his work was to commit the efficiency of many poor classifiers (weak learners) into a unique powerful classifier (strong learner). Roughly speaking, we can define Boosting as a general methodology of converting rough rules of thumb into a high quality prediction rule. Schapire’s conception of the problem finds its basis on the work of L. Valiant [97] introduced in 1984 about the probably approximately correct model of learning (PAC learning). The implicit goal of this venue of research was to combine the computational power available at the time with the most recent theories about learnability. Computational learning theory was then conceived as a complex branch of Computer Science where statistics, inductive inference and information theory were mixed with a novel ingredient, the computational complexity. At that date, Valiant’s work on PAC learning was considered an essential step for developing an adequate mathematical framework where both statistical and computational views of learning had place. Successive works of M. Kearns and L. Valiant [44, 45] kept open the question of whether a weak learning algorithm (let us say, one with accuracy slightly better than random guessing) allows to be “boosted” to a strong learning algorithm (one with arbitrarily high accuracy). The most important theorem on this topic is due to Schapire [77] and came to solve the problem. His result states the equivalence between both types of learnabilities. Specifically, Schapire proved that a class of target functions is strongly PAC-learnable if and only if it is weakly learnable. It 2.2. BINARY BOOSTING: ADABOOST 9 is worth mentioning that in this genesis of Boosting there were no algorithms defined as such. Rather there were only theoretical results describing, under fixed conditions, how a classifier improves its accuracy after a reweigthing of instances (which is the essence of Boosting). A few years later Y. Freund and R. Schapire [28] defined their seminal algorithm based on this reweighting technique, AdaBoost, which stands for Adaptive Boosting and whose pseudocode is shown in Algorithm 1. By far this is the most important and referenced Boosting algorithm in the literature. Originally designed to tackle binary problems, AdaBoost has been a cornerstone for subsequent derivations that reached for multi-class, multi-label, semi-supervised and cost-sensitive problems. Algorithm 1 : AdaBoost 1- Initialize the weight vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N . 2- For m = 1 to M : (a) Fit a classifier fm (x) to the trainingP data using weights W. (b) Compute weighted error: Errm = N n=1 w(n)I (fm (xn ) 6= yn ). 1 (c) Compute αm = 2 log ((1 − Errm )/Errm ). (d) Update weights w(n) ← w(n) exp (−αm yn fm (xn )) , n = 1, . . . , N . (e) Re-normalize W. P M α f (x) 3- Output Final Classifier: H(x) = sign m=1 m m Let X and L = {+1, −1} be the domain and set of labels, respectively, of the problem at hand. Maintaining the classic notation for binary problems, we will denote real labels with y instead of l. AdaBoost creates a strong classifier, H, whose output is the sign of a linear combination of weak classifiers: F (x) = M X αm fm (x) . (2.1) m=1 P + In other words, H(x) = sign(F (x)) = sign( M m=1 αm fm (x)), where αm ∈ R and fm : X → L, ∀ m. Usually all of these single functions are expected to belong to a specific type of hypothesis H (for instance, classification trees with a restriction on the depth). The shape of the final classifier is exactly a weighted majority vote from a set of M classifiers. Since F (x) ∈ R, the value of its module |F (x)| should be understood as a degree of confidence for classifying with sign(F (x)). That is why the values provided by F are usually called confidence-rated predictions. In the language of Boosting every classifier fm included in expression 2.1 is called a weak learner while the linear combination F is called the committee. 2.2.1 Understanding AdaBoost Let us explain briefly how this iterative process works. For each iteration m of AdaBoost, a pair (fm , αm ) is calculated and added to the model in a greedy1 fashion. On the one hand fm is fitted taking into account the weight associated to each instance in such a way that those with largest weights have priority of classification. Therefore, every weak classifier focusses on hard instances according to W. On the other hand, the constant αm is a positive value that measures the goodness of fm . The larger the value the most reliable its associated classifier. Then appears 1 Elements included in previous iterations remain unaffected. 10 Background on Boosting the essential idea behind AdaBoost, the reweighting process carried out over instances. Before defining it we must introduce the Exponential Loss Function: L(y, F (x)) := exp(−yF (x)) . (2.2) This function is a key concept for the development of Boosting algorithms as we will show in subsequent sections. For a certain labeled instance (x, y) and given a confidence rated classifier F , it computes the value z := yF (x), usually called Margin, and then applies exp to −z. Hence the exponential loss function can be defined over margin values: L(z) = exp(−z). It is clear that the more negative the value of margin the larger the loss incurred, while the more positive the closer to zero. AdaBoost’s reweighting process computes the values of this loss function when the pair (αm , fm ) is added to the current model Fm−1 , i.e. wm = exp(−yFm (x)) = exp(−y(Fm−1 (x) + αm fm (x))) , (2.3) which in turn can be recursively depicted as follows: ! m X wm = exp −y αr fr (x) = wm−1 exp(−yαm fm (x)) , (2.4) r=1 that is quite more compact PN and efficient. Additionally, the weight vector is normalized, W = W/Zm , where Zm = n=1 wm (n) is the normalization constant. These weights will be taken into account in the next iteration. Specifically, P the objective of the next classifier f is to minimize the weighted error, i.e. Err(f ) = N n=1 w(n)I(f (xn ) 6= yn ). So, when computing it, there are two ways of using W: 1. Directly: If the set of hypothesis H allows weights in its computations. 2. By sampling: Since W is a distribution of probability on data, it can be used for sampling data from which we can learn a weak classifier from H. In both cases, every new weak learner is forced to focus on hard instances (those poorly classifed in previous rounds). Once the result of its weighted error Err is known, the value of α associated to this classifier is 1 − Err 1 . (2.5) α = log 2 Err The process can be repeated as many times as iterations M are specified. Schapire’s proposal kept a theoretical justification based on an uniform bounding of the classification error. Specifically, if we denote Errm = 1/2 − γm the classification error of the weak learner in iteration m, then the training error of the classifier globally fitted, H, is bounded in the following way: ! M M M Y X Y p p 2 2 ≤ exp −2 1 − 4γm γm . (2.6) Err(H) ≤ 2 Errm (1 − Errm ) ≤ m=1 m=1 m=1 Thus, for a constant γ satisfying 0 < γ ≤ γm , ∀ m, we get: Err(H) ≤ exp(−2M γ 2 ) , (2.7) 2.2. BINARY BOOSTING: ADABOOST 11 which tends to zero as the number of iterations M increases. Expression (2.7) justifies the part Adaptive in AdaBoost, since γ and M do not need to be given for the algorithm to adapt to the problem. Proving the first inequality in (2.6) is not difficult. One just has to realize that the 0|1-error loss defined in terms of the margin, Err0|1 (z) = (1 − sign(z))/2, is upperly bounded by the exponential L(z) = exp(−z). Therefore Err(H) = N N 1 X 1 X I(yn 6= H(xn )) = I(yn F (xn ) < 0) N n=1 N n=1 N N M M X Y Y 1 X ≤ exp(−yn F (xn )) = wM (n) · Zm = Zm . N n=1 n=1 m=1 m=1 Where the second last equality follows from substituting (2.5) in the normalization factor Zm = N X wm−1 (n) exp(−αm yn fm (xn )). n=1 A more complex reasoning has to be applied to describe a bound for the generalization error. This analysis was carried out by Y. Freund and R. Schapire [28]. They justified that ! p DV C /N ; (2.8) GeneralErr(H) ≤ P̂ (z ≤ σ) + Õ σ where: P̂ is the empirical probability, N is the number of instances, and DV C is the VapnikChervonenkis dimension of the problem. Later, Schapire et al. [79] proposed a better bound in terms of the concept of margin but independent from M . In this context, margins are exactly a "measure of confidence" in the prediction (their sign). Specifically he proved that AdaBoost tends to increase the margin of the training data and, as consequence, the generalization error decreases. It is worth mentioning that at the end of the nineties AdaBoost was becoming so popular, specially due to its surprising accuracy in high dimensions, that made other famous metaclassifiers like Leo Breiman’s Bagging algorithm [8] take a second place. However, the beginning of the present century saw the emergence of an important meta-classifier also derived by L. Breiman, the Random Forests [9], that have received a degree of research comparable to Boosting, specially for Computer Vision problems. AdaBoost allows an alternative interpretation that became quite useful for justifying its good properties. This different point of view has a statistical background that may be more understandable to newcomers. Next subsection is devoted to it. 2.2.2 Statistical View of Boosting Since its beginning, Boosting has always been object of study because of its apparently resistance to overfiting. It was never well understood how such an iterative process improves its generalization error iteration after iteration even when its training error is zero. But at the end of the nineties Friedman, Hastie and Tibshirani’s work [29] came to shed some light on the matter, although their arguments were not completely satisfactory [66]. In their work the authors 12 Background on Boosting proved that AdaBoost can be obtained by fitting an additive model [37], i.e. a model with the shape of expression (2.1), whose goal is to minimize the expected exponential loss. Specifically they proved that Schapire’s method builds an additive logistic regression via Newton-like updates for minimizing such expected loss. They also derived the Real AdaBoost, an analogous version of AdaBoost for confidence-rated weak learners free of constants α. Furthermore, they propose LogitBoost and GentleBoost, two algorithms that resort to Newton steps for updating the additive model with regard to the binomial log-likelihood and the exponential loss, respectively. Let us explain briefly how the pair (αm , fm ) is estimated in the m-th iteration under this statistical point of view. Firstly, the optimal parameters have to satisfy (αm , fm ) = arg min α,f N X exp [−yn (Fm−1 (xn ) + αf (xn ))] ; (2.9) n=1 that, again, can be written in terms of weigths wm−1 (n) = exp(−yn Fm−1 (xn )) as follows (αm , fm ) = arg min α,f N X wm−1 (n) exp (−αyn f (xn )) . (2.10) n=1 Now we can calculate each parameter separately. Let α > 0 be a pre-specified value, to characterize f we just have to rewrite the objective function conveniently: N X wm−1 (n) exp (−αyn f (xn )) = n=1 −α e N X f (xn )=yn eα − e−α N X n=1 wm−1 (n) + e α N X wm−1 (n) = f (xn )6=yn wm−1 (n)I(f (xn ) 6= yn ) + e−α N X wm−1 (n) . |n=1 {z 1 } Therefore, independently of α, the optimal weak learner minimizes the weighted error. Now, let us assume fm be known (hence its weighted error Err too). The above expression becomes eα − e−α Err + e−α , (2.11) that is a convex function w.r.t α. So differentiating and equating to zero one obtains 1 1 − Err αm = log , 2 Err (2.12) exactly the same value of (2.5). We will bear in mind this for our derivations. It will serve to introduce our definition of canonical extension for Boosting algorithms. Let P1 and P2 be two types of classification problems in such a way that P1 becomes P2 when a set of restrictions is satisfied. Assume that A and B are Boosting algorithms developed to solve P1 and P2 , respectively. Assume also that both algorithms are derived from different loss functions following the statistical interpretation of Boosting. We define the concept of canonical extension (generalization) as follows: 2.3. MULTI-CLASS BOOSTING 13 “Let (A,B) be a pair of Boosting algorithms defined as above. A is said to extend canonically B if the loss function in A restricted to the framework of B leads to update the additive model with the same elements (Gm (x),αm ) fitted following B. In other words, the restrictions in A given by the second framework yields the algorithm B”. 2.3 Multi-class Boosting Almost jointly with the emergence of AdaBoost new extensions to multi-class problems were proposed. There is a large number of Boosting algorithms dealing with this type of classification. For ease of cataloging, we divide them into two groups: algorithms that decompose the problem into binary subproblems (thus, using binary weak learners) and algorithms that work simultaneously with all the labels (using multi-class weak learners or computing a posteriori probabilities at the same iteration). The following subsections discuss each approach. The Boosting algorithm that we introduce in the next chapter may be positioned between both groups. Hereafter, we will maintain the notation of section 2.1.1 for multi-class problems. The set of labels will be L = {1, 2, . . . , K}. Instances will be denoted again (x, l). 2.3.1 Boosting algorithms based on binary weak-learners Completing Freund and Schapire’s contribution [28] for binary problems, the authors proposed in [26] two extensions to multi-class problems. We start discussing the second one, AdaBoost.M2. This algorithm proceeds extending the datasets K-times, as discussed in section 2.1.1 for multi-label problems, and then computing binary weak classifiers of the shape h : X × L → {0, 1}. Following AdaBoost’s essence, a normalized distribution of weights, W, is used, but this time with a matrix shape, W ∈ [0, 1]N ×K . The main novelty of the method was the inclusion of errors based on a pseudo-loss function N 1 X w(n, k)(1 − h(xn , ln ) + h(xn , k)) , Ērr(h) = 2 (2.13) (n,k)∈B where B = {(n, k)|n = 1, . . . , N ; k 6= ln }. This type of losses penalizes simultaneously hard instances and hard labels. The pseudo-loss is combined with the binary exponential loss function to update weights, while α constants are computed substituting Ērr in (2.5). Years later, the idea behind AdaBoost.M2 was brought back by Shapire and Singer in [80], where three algorithms for multi-label problems were proposed. The first of them, AdaBoost.MH, enables a direct application to multi-class problems, becoming one of the most popular methods for this task among the proposed at the date. We show its pseudo-code in Algorithm 2. AdaBoost.MH addresses the multi-class problem extending the data base just as introduced in 2.1.1. The main difference between it and AdaBoost.M2 is the use of the Hamming loss for measuring errors instead of a pseudo-loss. Analogously AdaBoost.MR was also proposed, in this case using a ranking loss to measure multi-label accuracy. The third algorithm in [80], AdaBoost.MO, is more complex than AdaBoost.MH since the set of labels, L, is mapped on P(L̂) for an undefined set of labels, L̂, with R elements. Therefore, via λ : L → P(L̂), each label l has an associated “codeword” λl ∈ {±1}R in such 14 Background on Boosting Algorithm 2 : AdaBoost.MH 1- Initialize the weight matrix W with uniform distribution w(n, k) = 1/KN , for n = 1, . . . , N ; k = 1, . . . K. 2- For m = 1 to M : (a) Fit binary classifier hm : X × L → {−1, +1} using weights W. (b) Compute αm . (c) Update weight vector w(n, l) ← w(n, l) exp (−αm yl hm (xn , l)), for n = 1, . . . , N ; l = 1, . . . K. (d) Re-normalize W. 3- Output Final Classifier: P M Multi-label. H(x, l) = sign m=1 αm hm (x, l) . PM Multi-class. H(x) = arg maxk m=1 αm hm (x, k). a way that a binary classification can be performed on the extended set {(xn , ˆl)| ˆl = 1, . . . , R} just as stated for AdaBoost.MH. AdaBoost.MO belongs to a new stream of multi-class Boosting based on the Error Correcting Output Codes philosophy (ECOC). Let us describe briefly the main ideas behind this perspective of learning. The ECOC methodology was introduced by Dietterich and Bakiri [17] in the nineties. It served as an alternative approach to reduce a multi-class problem to a set of R binary ones. The key point of this approach lies in using a particular encoding to represent the subset of labels object of classification. So, considered the r-th task, one generates a weak-learner hr : X → {+1, −1}. The presence/absence of a group of labels on an instance is encoded by a column vector belonging to {−1, +1}K , where +1 indicates presence of the objective labels of hr . It is usual to use a coloring function, µ, for assigning the presence or absence of a set of labels in the data. The resulting assignment becomes the real labels for its associated binary subproblem. The composition of all column vectors generated in this fashion produces a (K × R)-codification matrix, in which the l-th row serves as a codeword associated to label l. When classifying a new instance one has to compute all the answers of the weak learners and compare the resulting (1 × R)-vector with each codeword. The decision rule consists in choosing the nearest class to the result (under certain measure like the Hamming distance). This solution justifies matrix encodings as a practical and intuitive tool for building strong multi-class classifiers using binary weak learners. Based on this idea, and jointly with AdaBoost.MO [80], two new Boosting algorithms appeared, AdaBoost.OC [78] and AdaBoost.ECC [35]. These two algorithms are probably the most relevant ones for this insight of multi-class Boosting. We show their respective pseudocodes in Algorithm 3 and 4. We must point out that AdaBoost.OC is also grounded on the M2-version proposed by Schapire, but this time completed with an ECOC perspective. Here we briefly compare these three ECOC-based methods: • With regard to the instance weights, they have in common a normalized matrix W ∈ RN ×R used at each iteration. In particular, for AdaBoost.OC and AdaBoost.ECC, R = K and the actual weights, D ∈ RN , for fitting weak learners are computed based on W and the m-th selected coloring, µm . • Besides this, the three algorithms add just one voting constant, α, per iteration to accompany the decisions of its respective weak learner. In AdaBoost.ECC the constant gm (xn ) 2.3. MULTI-CLASS BOOSTING 15 Algorithm 3 : AdaBoost.OC 1- Initialize the weight matrix W with uniform distribution w(n, k) = 1/N (K −1), if k 6= ln , and w(n, ln ) = 0 , n = 1, . . . , N . 2- For m = 1 to M : (a) Compute coloring function P µm : L → {0, 1}. −1 (b) Compute Dm (n) = Zm k∈L w(n, k)I(µm (ln ) 6= µm (k)), where Zm is a normalization constant. (c) Fit a classifier hm : X → {0, 1} to the data {(x, µm (l))} weighted according to Dm . (d) Define h̃m (xn )) := {k : hm (xn ) = µm (k)}. P∈ LP K / h̃m (xn )) + I(k ∈ h̃m (xn )) ), (e) Compute Errm = 21 N n=1 k=1 w(n, k)( I(ln ∈ 1 (f) Compute αm = 2 log ((1 − Errm )/Errm h ). i (g) Update weights w(n, k) ← w(n, k) exp αm ( I(ln ∈ / h̃m (xn )) + I(k ∈ h̃m (xn )) ) . (h) Re-normalize W. P 3- Output Final Classifier: H(x) = arg maxk M m=1 αm I (hm (x) = µm (l)) . supports the possibility of two instance-dependent values: ! P D (n) m 1 n:h (x )=µ (l )=1 , if hm (xn ) = +1 ; αm = ln P m n m n 2 n:hm (xn )=16=µm (ln ) Dm (n) ! P 1 n:hm (xn )=µm (ln )=−1 Dm (n) βm = ln P , if hm (xn ) = −1 . 2 n:hm (xn )=−16=µm (ln ) Dm (n) (2.14) • The loss function applied for updating weights in all three algorithms comes from a derivation of the binary exponential loss function with non-vectorial arguments (for instance, a pseudo-loss in the case of AdaBoost.OC). • Another important issue is the shape of the final classifier. On the one hand, the final decision for AdaBoost.OC and AdaBoost.ECC admits a translation into a K-dimensional function f (x) = (f1 (x), . . . , fK (x))> whose maximum coordinate is selected as response. On the other hand, AdaBoost.MO proposes two options based on the final response f (x) ∈ RR and the (K × R)-matrix of codewords. One can select the row that measures the closest distance to the response or resort to a confidence-rated prediction such as: X ˆ ˆ arg min exp −λl (l)f (x, l) , (2.15) l l̂∈L̂ that is the option recommended by the authors. • Finally, with regard to the number of weak learners computed at each iteration, the three algorithms compute just one weak learner (jointly with its voting constant α). AdaBoost.OC and AdaBoost.ECC train a weak classifier associated to the coloring µm , while AdaBoost.MO trains a weak learner for the binary problem generated by the extended data base {(xn , ˆl)| ˆl = 1, . . . , R} with labels {λln (ˆl)| ˆl = 1, . . . , R}. We will come back to these issues in next chapter, when introducing our multi-class algorithm. 16 Background on Boosting Algorithm 4 : AdaBoost.ECC 1- Initialize the weight matrix W with uniform distribution w(n, k) = 1/N (K −1), if k 6= ln , and w(n, ln ) = 0 , n = 1, . . . , N . 2- For m = 1 to M : (a) Compute coloring function P µm : L → {−1, +1}. −1 (b) Compute Dm (n) = Zm k∈L w(n, k)I(µm (ln ) 6= µm (k)), where Zm is a normalization constant. (c) Fit a binary classifier hm : X → {−1, +1} to the training data using weights Dm . (d) Compute αm and βm following (2.14). (e) Compute gm (x) = αm I(hm (x) = +1) − 1βm I(hm (x) = −1). (f) Update weights w(n, k) ← w(n, k) exp 2 gm (xn )(µm (k) − µm (ln )) , (g) Re-normalize W. P 3- Output Final Classifier: H(x) = arg maxk M m=1 gm (x)µm (k) . 2.3.2 Boosting algorithms based on vectorial encoding Grouped in a second block, we encompass those multi-class algorithm that manage all labels simultaneously at each iteration (using multi-class weak learners or estimating directly a posteriori probabilities). Here we include the first algorithm proposed by Freund and Schapire in [26], AdaBoost.M1. The essence of this multi-class generalization of AdaBoost lies in using pure multi-class weak learners while maintaining the same structure of the original algorithm. The main drawback in AdaBoost.M1 is the need for “strong learners”, i.e. hypothesis that can achieve accuracy of at least 50%, a requirement that may be too strong when the number of labels is high. Despite the lack of theory supporting this method, we must clarify that it is very common to consider AdaBoost.M1 as the first Boosting algorithm. This is due to its direct translation to AdaBoost for the binary case. A second approach came with the multi-class version of LogitBoost appeared jointly with the binary one [29]. As its counterpart for two labels, it estimates separately the probability of belonging to each label based on a multi-logit parametrization. The most interesting works under our point of view are those grounded on a vectorial insight. A successful way to generalize the symmetry of class-label representation in the binary case to the multi-class case is using a set of vector-valued class codes that represent the correspondence between the label set L = {1, . . . , K} and a collection of vectors Y = {y1 , . . . , yK }, −1 elsewhere. So, if l = 1, the code where vector yl has a value 1 in the l-th coordinate and K−1 > −1 −1 vector representing class 1 is y1 = 1, K−1 , . . . , K−1 . It is immediate to see the equivalence between classifiers H(x) defined over L and classifiers f (x) defined over Y : H(x) = l ∈ L ⇔ f (x) = yl ∈ Y . (2.16) In the remainder of the thesis we will use capital letters (H, G or T ) for denoting classifiers with target set L. On the other hand, classifiers having as codomain a set of vectors, like Y , will be denoted with small bold letters (f or g). This codification was first introduced by Lee, Lin and Wahba [51] for extending the binary Support Vector Machine to the multi-class case. More recently H. Zou, J. Zhu and T. Hastie [114] generalized the concept of binary margin to the multi-class case using a related vectorial codification in Pwhich a K-vector y is said to be> a margin vector if it satisfies the sum-to-zero condition K k=1 y(k) = 0. In other words, y 1 = 0, where 1 denotes a vector 2.3. MULTI-CLASS BOOSTING 17 of ones. This sum-to-zero condition reflects the implicit nature of the response in classification problems in which each yn takes one and only one value from a set of labels. Margin vectors are useful for multi-class classification problems for other reasons. One of them comes directly from the sum-to-zero property. It is known that, in general, every vectorial classifier f(x) = (f1 (x), . . . , fK (x))> has a direct translation into a posteriori probabilities P (l = k | x), ∀k ∈ L, via the Multi-class Logistic Regression Function (MLRF) exp(fk (x)) . P (l = k | x) = PK i=1 exp(fi (x)) (2.17) It is clear that a function f (x) produces the same a posteriori probabilities than g(x) = f (x) + α(x) · 1, where α(x) is a real valued function and 1 is a K-dimensional vector of ones. Such is the case when, for example, α(x) = −fK (x). Using margin vectors we do not have to concern about this issue2 . Using this codification, J. Zhu, H. Zou, S. Rosset and T. Hastie [112] generalized the original AdaBoost to multi-class problems under a statistical point of view. This work has been a cornerstone for subsequent derivations. Let us describe the main elements upon which the algorithm is grounded. Firstly, the binary margin applied in AdaBoost, z = yf (x), is replaced by the Multi-class Vectorial Margin, that is defined as a scalar product z := y> f (x) . (2.18) The essence of the margin approach resides in maintaining negative/positive values of the margin when a classifier has respectively a failure/success. That is, if y, f (x) ∈ Y ; the margin z = y> f (x) satisfies: z > 0 ⇔ y = f(x), and z < 0 ⇔ y 6= f(x). Note that, again, this definition of margin serves as a compact way to specify numerically hits and mistakes of classification. It is straightforward that the only two possible values of the margin when y, f (x) ∈ Y are: ( K if f (x) = y , (K−1) (2.19) z = y> f (x) = −K if f (x) 6= y (K−1)2 Bearing this definition in mind, the Multi-class Exponential Loss Function is: 1 > 1 L (y, f (x)) := exp − y f (x) = exp − z . K K (2.20) As the reader may guess, the presence of the constant 1/K is important but not determinant for the proper behaviour of the loss function. We will see later how it simplifies some calculations. An interesting property of this loss function (that justifies the addition of the constant 1/K) comes from the following result: ! K 1 X L (y, f (x)) = exp − y(k)fk (x) = K k=1 v uK uY K = t exp (−y(k)fk (x)) . K Y k=1 !1/K exp (−y(k)fk (x)) (2.21) k=1 Over the set of functions F = f : x 7→ RK we can define the equivalence relation f (x) ∼ g(x) ⇔ ∃α : x 7→ R | g(x) = f (x) + α(x)1. Then margin functions become representatives of equivalence classes. 2 18 Background on Boosting Hence this multi-class loss function is a geometric mean of the binary exponential loss function applied to each pair of coordinates of (y, f (x)) (i.e. component-wise margins). Moreover, this loss function is Fisher-consistent [114]. This property is defined as follows: “A loss function L is said to be Fisher-consistent if, ∀x ∈ X (set of full measure), the optimization problem f̂ = arg minf L(PL|X , f (x)), with f belonging to the hyperplane of margin vectors, has an unique solution and, in addition, arg maxj fˆj (x) = arg maxj P (l = j|x)”. Roughly speaking the Fisher-consistent condition says that if we were provided with infinite samples, we could recover the exact Bayes rule by minimizing losses of the kind. This fact makes them suitable for fitting multi-class classifiers with guarantees. H. Zou, J. Zhu and T. Hastie introduced a theoretical basis for margin vectors and Fisher-consistent loss functions [114]. Given a classification function expressed in terms of a margin vector f (x) = (f1 (x), . . . , fK (x))> , they defined the multi-class margin for an instance (x, l) as the coordinate fl (x). Consequently, a binary loss function can be used for evaluating multi-class decisions. Although this generalization is adequate to derive algorithms like AdaBoost.ML and the Multi-category GentleBoost we believe this definition of margin does not exploit the utility of vectorial encodings for labels. In the case of the multi-class exponential loss (2.20), it can be proved that the population minimizer: arg min EY |X=x [L (Y, f (x))] , f corresponds to the multi-class Bayes optimal classification rule [112]: arg max fk (x) = arg max P (Y = k|x) . k k Other loss functions, such as the logit or L2 , share this property and may also be used for building Boosting algorithms. Similarly, Saberian and Vaconcelos justified that other set of margin vectors could have been used for representing labels [74], and therefore to develop equivalent algorithms. Their an interesting definition of a multi-class margin > work also proposes > for label k, zk := 1/2 yk f (x) − maxj6=k yj f (x) . Using it, they justify the classification criteria H(x) = arg maxk zk . Having defined the above concepts, namely multi-class vectorial margin and multi-class exponential loss, we can come back to the work of Zhu et al. [112]. Their proposed algorithm, SAMME3 (Stage-wise Additive Modeling using a Multi-class Exponential loss function), resorts to the multi-class exponential loss for evaluating classifications encoded with margin vectors when real labels are encoded likewise. The expected loss is then minimized using a stagewise additive gradient descent approach. Its pseudo-code is shown in Algorithm 5. It is quite interesting how the resulting algorithm only differs from AdaBoost (see both pseudo-codes) in step 2.(c) that now is αm = log ((1 − Errm )/Errm ) + log(K − 1). Step 3. is not specially different since H(x) = arg max k M X m=1 αm I (Tm (x) = k) = arg max fk (x) , k P where f (x) = M m=1 αm gm (x). Moreover, it is an easy exercise to prove that the above classification rule is equivalent to assigning the maximum margin (2.18), H(x) = arg maxk y> k f (x), 3 A curious name for an algorithm that is essentially the same as the AdaBoost.ME proposed in a technical report [113] developed previously by the authors of [114] following a different margin-based insight. 2.3. MULTI-CLASS BOOSTING 19 which links with the perspective defined in [74]. In the same way it is straightforward to verify that AdaBoost becomes a particular case when K = 2, what makes us think that SAMME is a canonical generalization of AdaBoost using multi-class weak-learners. Algorithm 5 : SAMME 1- Initialize the weight vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N . 2- For m = 1 to M : (a) Fit a multi-class classifier Tm (x) toP the training data using weights W. (b) Compute weighted error: Errm = N n=1 w(n)I (Tm (xn ) 6= ln ). (c) Compute αm = log ((1 − Errm )/Errm ) + log(K − 1). (d) Update weight vector w(n) ← w(n) exp (αm I (Tm (xn ) 6= ln )) , n = 1, . . . , N . (e) Re-normalize W. P 3- Output Final Classifier: H(x) = arg maxk M m=1 αm I (Tm (x) = k) . It is worth pointing out also the impact of the above works [112, 114], jointly with [20], on the multi-class field of Boosting. For instance, J.Huang et al. [39] proposed GAMBLE (Gentle Adaptative Multi-class Boosting Learning), an algorithm that takes advantage of the same vectorial labels and loss function of SAMME, for one side, and the same type of weak learners and structure of GentleBoost, for another. The resulting multi-class Boosting algorithm is merged with an active learning methodology to scale up to large data sets. To finish the section, let us describe the Multi-category GentleBoost algorithm introduced in [114]. This method is a prominent example of multi-class algorithm conceived to address a coordinate-wise fit of the margin vector f (x).See its pseudo-code in Algorithm 6. Multicategory GentleBoost resorts to the exponential loss (applied to the margin z = fl (x)) to build the vectorial additive model. It works as follows. A vectorial function h(x) ∈ RK is initialized to zero in each of its coordinates. The iterative process updates h(x) = h(x) + g(x) aiming to find the direction, g(x), that makes the empirical risk ! N K 1 X 1 X exp −hln (xn ) + hk (xn ) N n=1 K k=1 (2.22) decrease most. This is accomplished by considering the expansion of (2.22) to the second order and then simplifying the Hessian considering only its diagonal. Doing so, it can be verified that the optimal j-th coordinate in g(x) minimizes N X N 1X 2 2 − gj (xn )znj exp (−fln (xn )) + gj (xn )znj exp (−fln (xn )) , 2 n=1 n=1 (2.23) where znj = −1/K + I(ln = j) and f (x) is the margin vector corresponding to the model already fitted (i.e h(x) mapped on the margin hyperplane). A simple way to solve it is cal−1 culating a regression function by weighted least-squares of the target variable znj to xn with 2 weights w(n, j) = znj exp(−fln (xn )). Finally, h is projected to the hyperplane of margin vector in order to classify or compute a posteriori probabilities if required. It is easy to verify that Multi-category GentleBoost extends canonically its binary counterpart GentleBoost [29]. 20 Background on Boosting Algorithm 6 : Multi-category GentleBoost 1- Initialize the weight vector W with constant distribution w(n) = 1 , n = 1, . . . , N . 2- Initialize coordinates hk (x) = 0, for k = 1, . . . , K. 3- For m = 1 to M : (a) For k = 1 to K: * Let zn := −1/K + I(ln = k). Compute w∗ (n) = zn2 exp(−fln (xn )). * Fit a regressor gk (x) by least-squares of variable zn−1 to xn weighted with w∗ (n). * Update hk (x) = hk (x) + gk (x). (b) Compute f (x), the projection of h(x) into the margin hyperplane. (c) Compute w(n) = exp(−fln (xn )), n = 1, . . . , N . 4- Output Final Classifier: H(x) = arg maxk fk (x) . 2.4 Cost-sensitive binary Boosting A second mandatory extension for AdaBoost comes with the chance of having different costs for misclassifications. Assume that the binary classification problem at hand is provided with a (2 × 2)-cost matrix C(1, 1) C(1, 2) C= (2.24) C(2, 1) C(2, 2) with no negatives real values. Here, rows are referred to real labels while columns indicate predicted labels. Since its beginning, AdaBoost has received a notable degree of attention in order to adapt its good properties to a cost-sensitive structure like the above. It is very common to set the diagonal equal to zero, i.e. no costs for correct classifications. We will justify this consideration in section 4.1. For ease of notation we will use C1 and C2 to denote the constants C(1, 2) and C(2, 1), respectively. Initial attempts to generalize AdaBoost in this fashion came essentially from heuristic changes on specific parts of the pseudo-code. Such is the case of CSB0, CSB1 and CSB2 [94, 93]. Their respective reweighting schemes for the n-th instance are given by: wm (n) if fm (xn ) = yn , (2.25) wm+1 (n) = Cyn wm (n) if fm (xn ) 6= yn wm (n) exp(−yn fm (xn )) if fm (xn ) = yn wm+1 (n) = , (2.26) Cyn wm (n) exp(−yn fm (xn )) if fm (xn ) 6= yn wm (n) exp(−αm yn fm (xn )) if fm (xn ) = yn wm+1 (n) = . (2.27) Cyn wm (n) exp(−αm yn fm (xn )) if fm (xn ) 6= yn The common structure of the three algorithms is shown in Algorithm 7. Another example came with AdaCost, the algorithm proposed by W. Fan et al. [23]. It was developed guided by a weighing rule that includes a particular cost and a margin-dependent function, β(n), in the argument of the exponential loss function. So, for an instance (xn , yn ), one computes wm+1 (n) = wm (n) exp(−αm yn fm (xn )β(n)) . (2.28) In practice, the authors select β(n) = 1/2(1 − zn Cyn ), where C(yn ) is the cost incurred for the real label yn and zn is the associated margin4 in iteration m. Similarly, AsymBoost [101] is also 4 Remember the equivalence between labels {1, −1}, {1, 0} and {1, 2} for binary problems. 2.4. COST-SENSITIVE BINARY BOOSTING 21 Algorithm 7 :CSB 1- Initialize the weight vector W with w(n) = Cyn /Z0 , n = 1, . . . , N ; where Z0 is a normalization constant. 2- For m = 1 to M : (a) For f ∈ P ool P * Compute weighted error of f , i.e. Err = N n=1 w(n)I (f (xn ) 6= yn ). end for (b) Select fm , the weak hypothesis with minimum weighted error. (c) Compute αm = log ((1 − Errm )/Errm ). (d) Update weight vector following either (2.25), (2.26) or (2.27). (e) Re-normalize W. 3- Output Final Classifier: P H(x) = sign M m=1 αm fm (x)(C1 I(fm (x) = 1) + C2 I(fm (x) = −1)) based on a reweighting scheme: wm+1 (n) = wm (n) exp(−αm yn fm (xn ))(C1 /C2 )yn /2m . (2.29) This choice seems to be non optimal due to its dependence of the current iteration m, that differs from the “adaptive” property of the AdaBoost algorithm. In the same way, Y.Sun et al. [89, 87] proposed another three ways to update weights in a cost-sensitive fashion: 1. wm+1 (n) = wm (n) exp(−αm Cyn yn fm (xn )) 2. wm+1 (n) = wm (n)Cyn exp(−αm yn fm (xn )) 3. wm+1 (n) = wm (n)Cyn exp(−αm Cyn yn fm (xn )) No formal reason is given to include costs “inside” the exponential loss, “outside” the exponential loss, and “in both places” jointly. Each reweighting scheme yields, in turn, its cost-sensitive extension of AdaBoost, namely AdaC1, AdaC2 and AdaC3. Algorithm 8 shows AdaC2’s pseudo-code. We will come back to it in chapter 4. Algorithm 8 :AdaC2 1- Initialize the weight vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N . 2- For m = 1 to M : (a) Fit a classifier fm (x) to the trainingP data using weights W. (b) Compute weighted error: Errm = N n=1 w(n)I (fm (xn ) 6= yn ). (c) Compute αm = log ((1 − Errm )/Errm ). (d) Update weight vector w(n) ← w(n)Cyn exp (αm I (fm (xn ) 6= yn )) , n = 1, . . . , N . (e) Re-normalize W. P 3- Output Final Classifier: H(x) = sign M m=1 αm fm (x) More recently, two works came to shed light on the cost-sensitive capability of Boosting from a more formal point of view. On the one hand Masnadi-Shirazi and Vasconcelos’ work [63] (see also their previous paper [60]) may be considered a canonical extension of AdaBoost to 22 Background on Boosting this field. Beside other three methods for binary cost-sensitive problems, this paper introduces Cost-Sensitive AdaBoost (CS-AdaBoost). The core idea behind the algorithm is substituting the original exponential loss function by a cost-dependent derivative: LCS−Ada (y, F (x)) = I(y = 1) exp (−yC1 F (x)) + I(y = −1) exp (−yC2 F (x)) . (2.30) It is clear that it becomes AdaBoost when a 0|1-cost matrix is used. CS-Adaboost is then derived by fitting an additive model whose objective is to minimize the expected loss. Algorithm 9 shows the pseudo-code. Like previous approaches, the algorithm needs to have a pool of available weak learners from which to select the optimal one in each iteration, jointly with the optimal step β. For a candidate weak learner, g(x), they compute two constants summing up the weighted errors associated to instances with the same label, b for label 1 and d for label −1. Then they calculate β by finding the only real solution to the following equation: 2C1 b cosh (βC1 ) + 2C2 d cosh (βC2 ) = T1 C1 e−βC1 + T2 C2 e−βC2 . (2.31) Finally, the pair (g(x), β) minimizing LCS−Ada (y, F (x) + βg(x)) is added to the model. A particularity of CS-AdaBoost lies in the initial weigthing of instances, that attempts to distribute the sum of positive and weights evenly (1/2 PNfor each group of weights). See 1- in Pnegative N Algorithm 9 for N1 = n=1 I(yn = 1) and N−1 = n=1 I(yn = −1). Algorithm 9 :Cost-Sensitive AdaBoost 1- Initialize the weight vector W with distribution w(n) = 1/2Nyn , n = 1, . . . , N . 2- For m = 1 to M : PN P w(n) , T = (a) Calculate constants: T1 = N 2 yn =−1 w(n) = 1 − T1 . yn =1 (b) For f ∈ P ool:P PN * Calculate: b = N yn =−1 w(n)I(yn 6= f (xn )) . yn =1 w(n)I(yn 6= f (xn )) , d = * Calculate β, the solution to equation P(2.31). * Compute the weighted error using N n=1 LCS−Ada (yn , Fm−1 (xn ) + βf (xn )). end for (c) Select the pair (fm , βm ) of minimum weighted error. (d) Update weights: w(n) ← w(n) · LCS−Ada (yn , βm fm (xn )) , n = 1, . . . , N . (e) Re-normalize W. P 3- Output Final Classifier: H(x) = sign M m=1 βm fm (x) . On the other hand Landesa-Vazquez and Alba-Castro’s work [49] discuss the effect of an initial non-uniform weighing of instances to endow AdaBoost with a cost-sensitive behaviour. The resulting method, Cost-Generalized AdaBoost, takes advantage of the remainder of the AdaBoost’s original structure. Beside these approaches, we must point out that probably the most intuitive way for applying AdaBoost to cost-sensitive problems came with Viola and Jones’ work for detection [100]. The authors basically maintained the original algorithm and introduce a threshold, θ, to bias the committed response in favour of the most costly label. Hence, the resulting variation has the shape ! M X H(x) = sign (F (x) − θ) = sign αm fm (x) − θ . (2.32) m=1 2.5. OTHER PERSPECTIVES OF BOOSTING 23 Obviously, the threshold is manipulated over a validation set to asses the required quality of detection. If an actual cost matrix is given then θ is easily calculated: FC (x) = log P (Y = 1|x)C1 P (Y = 1|x) C2 = log − log = F (x) − θ . P (Y = −1|x)C2 P (Y = −1|x) C1 (2.33) It is also easy to verify that this way of proceeding is equivalent to change the a priori probabilities from a generative point of view: FC (x) = log P (x|Y = 1)P (Y = 1)C1 P (Y = 1|x)C1 = log , P (Y = −1|x)C2 P (x|Y = −1)P (Y = −1)C2 (2.34) since the ratio of a prioris P (Y = 1)/P (Y = −1) is directly corrected by C1 /C2 . Moving the cost-insensitve decision boundary is a suboptimal strategy because the are no guarantees of fitting the true cost-sensitive decision boundary, as was pointed out in [63]. 2.5 Other perspectives of Boosting So far we have only considered the extensions of AdaBoost most closely related to our work in this thesis, namely: multi-class and binary cost-sensitive problems. We close the chapter giving a short overview of other interesting perspectives driven by Boosting in the world of Machine Learning. The second part of the section is devoted to comment the presence of Boosting in some Computer Vision tasks. Here is a short list of topics where Boosting has allowed new venues of research: • New loss functions. Some recent advances in Boosting came from developing a loss function that allows a desired property. Such is the case of the SavageBoost [62] and TangentBoost [61] algorithms. For both methods the authors derived a loss function specially designed to prevent the effect of outliers in the classification. Another examples can be found in [10, 80, 34, 71]. • Semi-supervised learning. Problems with unlabeled instances are also present in the scope of Boosting. Two recent works in this area are [59], where the SemiBoost algorithm is introduced, and [13], where a margin-based cost function is regularized in order to be optimized in a supervised way. For readers interested in this field we also recommend [85, 53, 75]. • Entropy projection. Kivinen and Warmuth’s work [46] discovered properties between consecutive weight vectors {Wm , Wm+1 }, understood as probability distributions. Specifically the new destribution, Wm+1 , is the closest to the previous one, via the relative entropy, from those belonging to the orthogonal hyperlane of weight vectors. See [84] for another perspective in which the Lagrange dual problems of some Boosting derivations are proven to be entropy maximization problems. • Regresion. Obviously the world of regresion has also received attention in the Boosting literature. We highlight GradientBoost [31], the algorithm developed by J.Friedman as a reference in the area. Another relevant derivation is the above mentioned GentleBoost [29, 114]. 24 Background on Boosting • Game theory. In the work [27] the authors explained the connection between Boosting and game theory. Specifically they described AdaBoost in terms of a 2-players game where one of them fits a weak learner based on a given weight vector while the second player receives the weighted error and computes the α value with which derive the next vector of weights to the first player. • Mahalanobis distance. Informative distances, like Mahalanobis, have also been used in the stage-wise optimization performed by Boosting. The most representative works in the area were introduced by C. Shen et al. [82, 83]. The idea behind their proposal is the use of “diferences between Mahalanobis distances”, d2M (i, j) − d2M (i, k), as argument (margin) for the exponential loss function. Specifically, the objective of the minimization is a sum of as many exponetial losses as restrictions dM (i, j) > dM (i, k) are required among point triplets, (i, j, k). Following this insight they derived the MetricBoost algorithm. • Condition of Boostability. This terminology is referred to works that address the conditions for a Boosting algorithm to have guarantees of convergence. We highlight the paper of I. Mukherjee and R. Schapire [67], in which both strong and weak conditions of “boostability” are given for multi-class Boosting algorithms. These conditions are evaluated on previous relevant works. An alternative analysis of the convergence of Boosting under several loss functions is devoted to M. Telgarsky [92]. • Relationship to other paradigms. Since Boosting is a particular case of meta-classifier, it was a matter of time combining its structure with other relevant paradigms. Such is the case of DeepBoost [14], a recent work where strong learners are allowed as basis without losing quality of fit. The method is open to be combined with deep decision trees, Support Vector Machines or even Neural Networks. For an interesting comparative with Bagging with respect to robustness see S. Rosset’s work [73]. Taking aside the above topics, Boosting has shown to be an excellent tool for many problems in the area of Computer Vision. Some of them require modifications in the original formulation of the algorithms to provide optimal results. That is the case of detection problems, where the shape of the strong learner calculated with AdaBoost is pruned conveniently to acquire a cascade-shaped structure. In case of face detection we must highlight the above mentioned works of P. Viola and M. Jones [101, 100]. It has also been successfully used for recognizing texts [12, 22], deriving object detectors efficiently [95, 99, 52], and labelling images [43, 108, 105]. Moreover, Boosting has became a very useful strategy for feature selection in Computer Vision problems [96, 48]. Other interesting application of Boosting in Computer Vision is to identify personal characteristics from low resolution pictures of faces. Such is the case, for instance, of gender recognition. Here we must point out S. Baluja y H. Rowley’s work [5], in which AdaBoost uses simply comparisons of gray-level between pairs of pixels to obtain significantly good results. See [107] for another approach, in this case based on second order discriminant analysis updated iteration after iteration. Chapter 3 Partially Informative Boosting In section 2.3.1 we have discussed some multi-class Boosting algorithms based on binary weak learners, that essentially separate the set of classes into two groups. None of them is a proper extension of AdaBoost in the sense of taking advantage of the exponential loss function in a pure multi-class fashion. This is exactly the root of our theoretical improvement. Can we transfer partial responses to the multi-class field maintaining this property? So far we have discussed the important role of margin for binary and multi-class Boosting. Here we extend this concept to manage binary sub-problems properly and, hence, to answer the above question. In this chapter we introduce a multi-class generalization of AdaBoost that uses ideas present in previous works. We use binary weak-learners to separate groups of classes, like [3, 78, 80], and a margin-based exponential loss function with a vectorial encoding like [51, 112, 39]. However, the final result is new. To model the uncertainty in the classification provided by each weak-learner we use different vectorial encodings for representing class labels and classifier responses. This codification yields an asymmetry in the evaluation of classifier performance that produces different margin values depending on the number of classes separated by each weak-learner. Thus, at each Boosting iteration, the sample weight distribution is updated as usually according to the performance of the weak-learner, but also, depending on the number of classes in each group. In this way our Boosting approach takes into account both, the uncertainty in the classification of a sample in a group of classes, and the imbalance in the number of classes separated by the weak-learner [87, 38]. Specifically, we decompose the problem into binary subproblems whose goal is to separate a set of labels from the rest and then we encode every response using a new set of margin vectors in a way that the multi-class exponential loss function can be applied. The resulting algorithm is called PIBoost, which stands for Partially Informative Boosting, reflecting the idea that the Boosting process collects partial information about classification provided by each weak-learner. PIBoost is well grounded theoretically and provides significantly good results. We consider it, perhaps, the only canonical extension of AdaBoost based on binary weak learners. The chapter is organized as follows. Next section is devoted to introduce our new set of margin vectors jointly with the loss function. Section 3.2 describes PIBoost in detail. Here we pay attention to Lemma 1, the main result upon which the algorithm is based. Paragraphs showing PIBoost’s relationships with AdaBoost and CS-AdaBoost are also included. We also point out how PIBoost implies a pattern of common sense when taking decisions. In section 3.3 we compare the main points of our algorithm with those of the ECOC-based algorithms commented in section 2.3.1. Finally, we devote section 3.4 to discuss experiments where the accuracy of PIBoost against other relevant multi-class Boosting methods is analyzed. 25 26 3.1 Partially Informative Boosting Multi-class margin extension We saw in section 2.3.2 how the use of margin vectors for encoding labels induces a natural generalization of binary classification, yielding margins that derive multi-class algorithms. In this section we introduce a new multi-class margin expansion. Similarly to [51, 114, 112, 74, 39] we use margin vectors to represent multi-class membership, i.e. real labels. However, in our proposal, data labels and those estimated by a classifier will not be defined on the same set of vectors. Our margin vectors will produce, for each iteration of the algorithm, different margin values for each sample, depending on the number of classes separated by the weak-learner. This fact is related to the asymmetry produced in the classification when the number of classes separated by a weak learner is different on each side and to the “difficulty” or information content of that classification. Remember that the essence of the margin approach resides in maintaining negative/positive values of the margin when a classifier has respectively a failure/success. That is, if y, f (x) ∈ Y , the margin z = y> f (x) satisfies: z > 0 ⇔ y = f (x) and z < 0 ⇔ y 6= f (x). We extend the set Y by allowing that each yl may also take a negative value, that can be interpreted as a fair vote for any label but the l-th. This vector encodes the uncertainty in the response of the classifier, by evenly dividing the evidence among all labels but the l-th. It provides the smallest amount of information about the classification of an instance; i.e. a negative classification, the instance does not belong to class l but to any other. Our goal is to build a Boosting algorithm that combines both positive and negative weak responses into a strong decision. Following this intuition we introduce new margin vectors through fixing a group of slabels, S ∈ P(L), and defining yS in the following way: 1 if i ∈ S S S S > S s (3.1) y = y1 , . . . , y K with yi := −1 if i ∈ /S K−s It is straightforward that any yS is a margin vector [51, 114]. In addition, if S c is the complec mentary set of S ∈ P(L), then yS = −yS . Let Ŷ be the whole set of vectors obtained in this fashion. We want to use Ŷ as target set, that is: f : X → Ŷ , but under a binary perspective. The difference with respect to other approaches using similar codification [51, 114, 112] is that the defined in (2.16) is broken. In particular, weak-learners will take values in Scorrespondence S y , −y rather than the whole set Ŷ . The combination of answers obtained by the Boosting algorithm will provide complete information over Ŷ . So now the correspondence for each weak-learner is binary F S (x) = ±1 ⇔ f S (x) = ±yS , (3.2) where F S : X → {+1, −1} is a classifier that recognizes the presence (+1) or absence (−1) of the group of labels S in the data. We propose a multi-class margin for evaluating the answer given by f S (x). Data labels always belong to Y but predicted ones, f S (x), belong to Ŷ . In consequence, depending on s = |S|, we have four possible margin values ( K ± s(K−1) if y ∈ S > S (3.3) z = y f (x) = K ± (K−s)(K−1) if y ∈ /S where the sign is positive/negative if the partial classification is correct/incorrect. The derivations of the above expressions are in the Appendix A.1. 3.1. MULTI-CLASS MARGIN EXTENSION 27 We use the multi-class exponential loss just as it was introduced in 2.3.2 to evaluate these margins (3.3): −1 > S S y f (x) . (3.4) L y, f (x) = exp K In consequence, the above vectorial codification of labels will produce different degrees of punishes and rewards depending on the number of classes separated by the weak-learner. Assume that we fix a set of classes, S, and an associated weak-learner that separates them from the rest, f S (x). We may also assume that |S| ≤ K/2, since if |S| > K/2 we can choose S 0 = S c and then |S 0 | ≤ K/2. The failure or success of f S (x) in classifying an instance x with label l ∈ S will have a larger margin than when classifying an instance with label l ∈ S c . The margins in (3.3) provide the following rewards and punishes when used in conjunction with the exponential loss (3.4) ∓1 exp if y ∈ S s(K−1) L y, f S (x) = ∓1 exp if y ∈ / S. (K−s)(K−1) (3.5) In dealing with the class imbalance problem, the losses produced in (3.5) reflect the fact that the importance of instances in S is higher than those in S c , since S is a smaller set. Hence, the cost of miss-classifying an instance in S outweighs that of classifying one in S c [87]. This fact may also be intuitively interpreted in terms of the “difficulty” or amount of information provided by a classification. Classifying a sample in S provides more information, or, following the usual intuition behind Boosting, is more “difficult” than the classification of an instance in S c , since S c has a broader range of possible labels. The smaller the set S the more “difficult” or informative will be the result of the classification of an instance in it. We can further illustrate this idea with an example. Assume that we work on a classification problem with K = 5 classes. We may select S1 = {1} and S2 = {1, 2} as two possible sets of labels to be learned by our weak-learners. Samples in S1 should be the more important than those in S1C or in S2 , since S1 has the smallest class cardinality. Similarly, in general, it is easier to recognize data in S2 than in S1 , since the latter is smaller; i.e. classifying a sample in S1 provides more information than in S2 . Encoding labels with vectors from Y we will have the following margin values and losses: > S1 ±5/4 ⇒ L(y, f S1 ) = ±5/16 e±1/4 = {0.77, 1.28} y ∈ S1 e±1/16 = {0.93, 1.06} y ∈ S1c ±5/8 ⇒ L(y, f S2 ) = ±5/12 e±1/8 = {0.88, 1.13} y ∈ S2 e±1/12 = {0.92, 1.08} y ∈ S2c z = y f (x) = > S2 z = y f (x) = Everything we say about instances in S1 will be the most rewarded or penalized in the problem, since S1 is the smallest class set. Set S2 is the second smallest, in consequence classification in that set will produce the second largest rewards and penalties. Similarly, we “say more” excluding an instance from S2 = {1, 2} than from S1 = {1}, since S2c is smaller than S1c . In consequence, rewards and penalties for samples classified in S2c will be slightly larger than those in S1c . In Fig. 3.1 we display the loss values for the separators associated to the sets S1 and S2 . 28 Partially Informative Boosting Figure 3.1: Values of the Exponential Loss Function over margins, z, for a classification problem with 4-classes. Possible margin values are obtained taking into account the expression (3.5) for s = 1 and s = 2. 3.2 PIBoost In this section we present the structure of PIBoost [24], whose pseudo-code we show in Algorithm 10. At each Boosting iteration we fit as many weak-learners as groups of labels, G ⊂ P(L), are considered. The aim of each weak-learner is to separate its associated labels from the rest and persevere in this task iteration after iteration. That is the reason why we call them separators. A weight vector WS is associated to the separator of set S. For each set S ∈ G, with s = |S|, PIBoost builds a stage-wise additive model [37] of the form f m (x) = f m−1 (x) + βm gm (x) (where super-index S is omitted for ease of notation). In step 2 of the algorithm we estimate constant β and function g(x) for each label and iteration. The following Lemma solves the problem of finding these parameters. Lemma 1. Given an additive model f m (x) = f m−1 (x) + βm gm (x) associated to a set of labels, S ∈ G, the solution to (βm , gm (x)) = arg min β,g(x) N X exp n=1 −1 > y (f m−1 (xn ) + βg(xn )) K n (3.6) is obtained in the following way: • Given β > 0, the optimal weak learner is: gm = arg min B1 g N X ln ∈S w(n)I y> n g(xn ) N X < 0 + B2 w(n)I y> n g(xn ) < 0 , ln ∈S / h i h i β −β β −β with B1 = exp( s(K−1) ) − exp( s(K−1) ) , B2 = exp( (K−s)(K−1) ) − exp( (K−s)(K−1) ) . • Given a learner g(x), the optimal constant is βm = s(K − s)(K − 1) log R , 3.2. PIBOOST 29 where R is the only real positive root of the polynomial Pm (x) = E1(K − s)x2(K−s) + sE2xK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) (3.7) P and theP constants involved in both expresions are defined as follows: A = 1 ln ∈S w(n), P > A ln ∈S / w(n) ( i.e. A1 + A2 = 1 ), E1 = ln ∈S w(n)I(yn g(xn ) < 0) , E2 = P2 = > ln ∈S / w(n)I(yn g(xn ) < 0), and Wm−1 = {w(n)} is the weight vector of iteration m-1. The proof of this result is in Appendix A.2. As can be seen in the Lemma, the optmization of gm (x) and βm in 3.6 depend on each other. An efficient strategy to solve this problem is to iteratively optimize for one of the variables, assuming known the other. We have considered two ways to proceed: 1) compute an initial gm fixing an initial βm (1, for simplicity) and; 2) compute an initial gm assuming B1 = B2 . In both cases we have empirically confirmed that the results obtained with sevaral iterations in this process are not significantly better than those in the first iteration. Hence, in Algorithm 10, we introduce the method by assuming the second option (B1 = B2 ) and making no sub-iterations to obtain the optimal pair (gm (x),βm ), which is the procedure selected for our experiments. With this assumption, the optimal weak learner is calculated according to gm = arg min g N X w(n)I y> n g(xn ) < 0 , (3.8) n=1 that is an efficient and practical criteria. Lemma 1 justifies steps 2:(a), (c)1 , (d) and (e) in our pseudo-code. In case y ∈ S, the update rule 2:(f) follows from: −1 > S S S y βf (xn ) w (n) = w (n) exp K n ±K −1 S S = w (n) exp s(K − s)(K − 1) log Rm K s(K − 1) S S ∓(K−s) = wS (n) exp ∓(K − s) log Rm = wS (n) Rm The case y ∈ / S provides an analogous expression. Note in (3.7) that the root will be zero only if A1 = E1, what implies β = −∞. This possibility should be considered explicitly in the implementation. An interesting case with closed form solution for the polynomial occurs when K is an even number. Any separator of s = K/2 labels yields a simpler formula in (3.7): P (x) = (E1 + E2) xK − (A1 − E1 + A2 − E2) = ErrxK − (1 − Err) , | {z } Err q that can be easily solved with a closed-form solution, x = K 1−Err , which in turn provides the Err value K(K − 1) 1 − Err K −1 1 − Err log =s log . (3.9) β= 4 Err 2 Err In expression / GSm (x), the set GSm (x) must be understood as lnC ∈ S Gm (x) = −1 ≡ S . 1 GSm (x) = +1 ≡ S and 30 Partially Informative Boosting Algorithm 10 : PIBoost 1- Initialize weight vectors wS (n) = 1/N ; with n = 1, . . . , N and S ∈ G ⊂ P(L). 2- For m = 1 until the number of iterations M and for each S ∈ G: (a) Fit a binary classifier GSm (x) over training data with respect to its corresponding wS . (b) Translate GSm (x) into gSm : X → Ŷ . (c) Compute 2 types of errors associated with GSm (x) X X E1S,m = wnS I ln ∈ / GSm (xn ) , E2S,m = wnS I ln ∈ / GSm (xn ) ln ∈S ln ∈S / S (d) Calculate Rm , the only positive root of the polynomial PmS (x) defined in (3.7). S S (e) Calculate βm = s(K − s)(K − 1) log Rm . (f) Update weights as follows (sign +/- depends on the failure/hit of GSm ): S ±(K−s) • If ln ∈ S then wS (n) = wS (n) Rm S ±s , • If ln ∈ / S then wS (n) = wS (n) Rm (g) Re-normalize weight vectors. 3Output H(x) = arg maxk fk (x), where f (x) = (f1 (x), ..., fK (x))> = P FinalS Classifier: PM S m=1 S∈G βm gm (x). The shape of the final classifier is easy and intuitive to interpret. The vectorial function built during the process collects in each k-coordinate information that can be understood as a degree of confidence for classifying sample x into class k. The classification rule assigns the label with highest value in its coordinate. This criterion has a geometrical interpretation provided by the codification of labels as K-dimensional vectors. Since the set Ŷ contains margin vectors, the process of selecting the most probable one is carried out on the hyperplane orthogonal to 1 = (1, . . . , 1)> (see Fig. 3.2). So, we build our decision on a subspace of RK free of total indifference about labels. It means, that the final vector f (x) built during the process will usually present a dominant coordinate that represents the selected label. Ties between labels will only appear in degenerate cases. The plot on the right in Fig. 3.2 shows the set of pairs of vectors Ŷ defined by our extension, whereas on the left are shown the set of vectors Y used in [51, 112]. Although the spanned gray hyperplane is the same, we exploit every binary answer in such a way that the negation of a class is directly translated into a new vector that provides positive evidence for the complementary set of classes in the final composition, f (x). The inner product of class labels y ∈ Y and classifier predictions, f (x) ∈ Ŷ , y> f (x) produces a set of asymmetric margin values in such a way that, as described in section 3.1, all successes and failures do not have the same importance. Problems with four or more classes are more difficult to be shown graphically but allow richer sets of margin vectors. The second key idea in PIBoost is that we can build a better classifier when collecting information from positive and negative classifications in Ŷ than when using only the positive classifications in the set Y . Each weak-learner, or separator, gS , acts as a partial expert of the problem that provides us with a clue about what is the label of x. Note here that when a weak-learner classifies x as belonging to a set of classes, the value of its associated step β, that depends on the success rate of the weak-learner, is evenly distributed among the classes in the set. In the same way, the bet will be used to evenly reduce the confidence on coordinates 3.2. PIBOOST 31 Figure 3.2: Margin vectors for a problem with three classes. Left figure presents the set of vectors Y . Right plot presents the set Ŷ . corresponding to non-selected classes. This balance inside selected and discarded classes is reflected in a margin value with a sensible multi-class interpretation. In other words, every answer obtained by a separator is directly translated into multi-class information in a fair way. 3.2.1 AdaBoost as a special case of PIBoost At this point we can verify that PIBoost applied to a two-class problem is equivalent to AdaBoost. In this case we only need to fit one classifier at each iteration2 . Thus there will be only one weight vector to be updated and only one group of β constants. It is also easy to match the expression of parameter β computed in PIBoost with the value of α computed in AdaBoost just by realizing that, fixed an iteration whose index we omit, the polynomial in step 2 - (d) is P (x) = (E1 + E2) x2 − (A1 − E1 + A2 − E2) = Err · x2 − (1 − Err) . 1/2 1 1−Err , thus β = log . What indeed is the Solving this expression we get R = 1−Err Err 2 Err value of α in AdaBoost. It also could be verified substituting K = 2 and s = 1 in expression (3.9). Finally, it is straightforward that the final decisions are equivalent. If we transform AdaBoost’s labels, L = {+1, −1}, into PIBoost’s, L0 = {1, 2}, we get that classification rule PM H(x) = sign m=1 αm hm (x) turns into Ĥ(x) = arg maxk fk (x), where f (x) = (f1 (x), f2 (x)) = M X βm gm (x) . m=1 2 Separating the first class from the second is equivalent to separating the second from the first and, of course, there are no more possibilities. 32 Partially Informative Boosting 3.2.2 Asymmetric treatment of partial information Our codification of class labels and classifier responses produces different margin values. This asymmetry in evaluating successes and failures in the classification may also be interpreted as a form of asymmetric Boosting. As such it is directly related to the Cost-Sensitive AdaBoost in [63]. Using the cost matrix defined in Table 3.1, we can relate the PIBoost algorithm with the Cost-Sensitive AdaBoost [63]. If we denote b = E1S , d = E2S , T+ = A1 , T− = A2 then the polynomial (3.7), P S (x), solved at each PIBoost iteration to compute the optimal step, βm , along the direction of largest descent gm (x) is equivalent to the following cosh-depending expression used in the Cost-Sensitive AdaBoost to estimate the same parameter [63] 2C1 b cosh (C1 α) + 2C2 d cosh (C2 α) = C1 T+ e−C1 α + C2 · T− e−C2 α , (3.10) where the costs {C1 , C2 } are the non-zero values in Table 3.1. In consequence, PIBoost is Real \ Pred. S Sc Sc S 0 1 s(K−1) 1 (K−1)(K−s) 0 Table 3.1: Cost Matrix associated to a PIBoost’s separator of a set S with s = |S| classes. a Boosting algorithm that combines a set of cost-sensitive binary weak-learners whose costs depend on the number of classes separated by each weak-learner. See the work of I. Landesa-Vazquez and J.L. Alba-Castro [50] for a better understanding of equation (3.10). In their proposal a double base3 analysis of the asymmetries is discussed. Moreover, they also resort to a polynomial with the same shape of (3.7) to find the optimal constant β added at each iteration. We find quite interesting the way they decompose the equation in order to speed up finding the solution of the polynomial in their resulting algorithm AdaBoostDB. 3.2.3 Common sense pattern We must emphasize that PIBoost’s structure links with a pattern of common sense. In fact we apply this philosophy in our everyday life when we try to guess something discarding possibilities. Let us illustrate this with an example. Assume that a boy knows that his favourite pen has been stolen in his classroom. Even though he probably thinks of a suspect he also has the chance to ask each classmate what he knows about the issue. Perhaps doing so he will collect a pool of useful answers of the kind: “I think it was Jimmy”, “I am sure it was not a girl”, “I just know that it was neither me not Victoria”, “I would suspect of Martin and his group of friends”, etc. It is clear that none of these answers drives the boy to a final conclusion (at least, they should not) but they form a set of clues quite useful. Supposing that no more information is available it seems that an immediate strategy could be to sum up those answers weighted by the degree of confidence associated to those questioned. 3 Such doble base comes from the two arguments of cosh appearing in (3.10). 3.3. RELATED WORK 33 Thus combining all that information our protagonist should have one suspect at the end of the working day. It’s easy to find similarities between such a situation and the structure of PIBoost: the answer of each friend can be seen as a weak-learner GSm , the level of credibility (or trust) associated to each is our βm , while the iteration value m can be thought as a measure of time in the relationship with the classmates. 3.3 Related work Here we discuss some relationships between the ECOC-based algorithms commented in section 2.3.1 and PIBoost. Table 3.3 summarizes the main properties of the four algorithms, extending the comparative presented in that section. Firstly, the loss function applied for updating weights in AdaBoost.OC relies on the exponential loss with a pseudo-loss as argument, while AdaBoost.MO and AdaBoost.ECC use an exponential loss function with binary arguments. In section 3.1 we have highlighted the importance of using a pure multi-class loss function for achieving different margin values, hence penalizing binary failures into a real multi-class context. With our particular treatment for binary sub-problems we extend AdaBoost in a more natural way, because PIBoost can be seen as a group of several binary AdaBoost well tied via the multi-class exponential loss function and where every partial answer is well suited for the original multi-class problem. It is not necessary to manage all instance weights linked as one when later a binary loss or a pseudo-loss (that cares of “failures” due to classes over-selected in the coloring function µm ) will be used. Besides, the resulting structure of PIBoost is similar to the {±1}-matrix of ECOC algorithms, except for the presence of fractions. At each iteration of PIBoost there is a block of |G|-response vectors that, grouped as columns, form a K × |G|-matrix similar to |G|-weak learners of ECOC-based algorithms. Table 3.2 shows the case of a problem with four labels when G consists of all single labels in union with all pairs of labels. However, in our approach, fractions let us make an even distribution of evidence among the classes in a set, whereas in the ECOC philosophy every binary sub-problem has the same importance for the final count. Label 1 Label 2 Label 3 Label 4 Label 1-2 Label 1-3 1 -1/3 -1/3 -1/3 1/2 1/2 -1/3 1 -1/3 -1/3 1/2 -1/2 -1/3 -1/3 1 -1/3 -1/2 1/2 -1/3 -1/3 -1/3 1 -1/2 -1/2 Label 1-4 1/2 -1/2 -1/2 1/2 Table 3.2: An example of encoding matrix for PIBoost’s weak learners when K = 4 and G = {all single labels} ∪ {all pairs of labels}. With respect to margin values, E.L. Allwein, R. Schapire and Y. Singer [3] discussed the strong connections between ECOC-based algorithms and margin theory. The framework developed by the authors provides a new perception of multi-class algorithms based on binary subproblems, unifying the most relevant ones by using matrix encodings with values in {+1, 0, −1}. A deeper analysis of this fact for AdaBoost.MO, ECC and OC can be found in [91]. Although this development is broad enough to cover the most popular multi-class Boosting methods, we find no reason in using binary margin values, z(k,r) = l(k, r)fr (x), for measuring the quality of each bit fr (x) in a predicted-codeword f (x) = (f1 (x), . . . , fR (x)), where l(k, r) is the real value associated to label k for the r-th binary subclassifier. The same binary loss 34 Partially Informative Boosting function can be applied to the resulting value. Our conception of margin values for multi-class problems based on vectorial encodings is richer and provides a broad range of values as was exposed in section 3.1. So far we have highlighted three essential points, namely: the use of different loss functions, the fact of handling the evidence for or against a set of labels evenly, and our conception of margin; but there are others features distinguishing PIBoost from ECOC-based Boosting algorithms. We describe them briefly. Since we develop independent blocks of weak learners, S we endow each sub-model withPits own P weight vector, W , instead of using a whole weight N ×R matrix W ∈ R satisfying n=1 r=1 w(n, r) = 1 (with R = K for the .OC and .ECC variants). Our way of proceeding lets each separator persevere independently, thus providing responses uncorrelated with the rest of separators. Another important aspect is the shape of the final decision rule. Both AdaBoost.OC and AdaBoost.ECC build a K-dimensional function f (x) = (f1 (x), . . . , fK (x))> whose maximum coordinate is selected as response. We argued that this particularity is shared by our algorithm, but we gave a geometric meaning to this way of summarizing information. In fact, it is easy to prove that our rule is equivalent to choosing the vector in Y that provides maximum margin, i.e. arg maxk y> k f (x). With regard to the amount of weak learners computed at each iteration, we believe that setting blocks of separators at each iteration there is no need for computing coloring functions, that build a single separator based on random patterns [80] and, consequently, does not ensure an uniform label covering4 . In the same way, we find interesting the use of “code-words” for denoting labels but we think that, once the final hypothesys is commited, it is more intuitive to select the coordinate with maximum value than resort to a metric for measuring the closeness between labels and the obtained final row. Jointly with this last observation, remark that PIBoost, unlike ECOC-based S algorithms, computes one value βm per set, S, and iteration, m. In other words, at each iteration |G| constants are calculated. Issue Weights for instances AdaBoost.MO W ∈ RN ×R for training Constants αm per iteration Loss function One αm Final Classifier W. Learners per iteration AdaBoost.OC W ∈ RN ×K and D ∈ RN for training One αm Binary Exponential Expression (2.15) Pseudo-loss Exponential argl∈L max PM ¯ m αm Im (x) 1 1 AdaBoost.ECC W ∈ RN ×K and D ∈ RN for training One gm (x) PIBoost A vector WS per set S ∈ G S One βm per set S ∈ G Binary Multi-class Exponential Exponential argl∈L max Max. coordinate PM of m gm (x)µm (l) P M S S m,S∈G βm gm (x) 1 |G| Table 3.3: Comparison of the main properties of ECOC-based algorithms and PIBoost. µm (l) denotes the coloring function µm : L → {±1} at iteration m. R denotes the length of “codewords”. In AdaBoost.OC, I¯m (x) indicates I (hm (x) = µm (l)). At this point one can suspect that, independently of the selected group G, PIBoost needs too much computational load at each iteration. This is partially true because, as was said above, 4 We assume that, when working with PIBoost, the user will select a group G such that, at least, contains all single labels {{k} | k ∈ L}. 3.4. EXPERIMENTS 35 separators receive a particular treatment similar to several binary AdaBoost linked via the multiclass exponential loss function. Does such a schedule pay the way when efficiency is required? We postpone our answer to next section, where the quality of PIBoost is revealed even for few iterations. We must emphasize that PIBoost is not the only multi-class Boosting algorithm using labels separators asymmetrically. As we said in section 2.3.1, AdaBoost.ECC [35] presents different voting weights gm (x) (= αm or −βm ) depending on the class assigned to (x, l) by the weak learner hm (x), see (2.14). That is, fixed an iteration m, the algorithm groups the instances into two sets: G+ = {(x, l) | hm (x) = +1} and G− = {(x, l) | hm (x) = −1}. Then the two possible voting constants are independently computed using the same weight vector for instances, Dm . Namely: ! ! P P D (n) D (n) m m −1 1 {Hits in G+ } {Hits in G− } ; −βm = ln P . αm = ln P 2 D (n) 2 m {F ailures in G+ } {F ailures in G− } Dm (n) Using these values the global weights for instances wm+1 (n, l) are updated via the binary exponential loss function. We provide a similar handling to each separator, but asymmetries between the number of labels to be separated are considered inherently in our margin vectors. For PIBoost, after training a new weak learner, we take into account two types of errors (and two S constants) that will yield just one value βm once its associated polynomial P S (x) is solved. To complete this section we find convenient to discuss briefly the important work of A. Torralba, K.P. Murphy and W.T. Freeman [95] for multi-label problems, where the JointBoost algorithm is proposed, and its similarity to PIBoost’s structure. Their algorithm takes advantage of the information obtained from binary subproblems focussed on separating groups of labels, which links to our point of view for the multi-class problem. JointBoost is designed to detect different possible objects in images by sharing features. This implies having a set of K labels representing the objects but also an extra label for denoting the background. In consequence the problem is composed by K + 1 labels non-equally important. A set of K confidence-rated predictors is fitted by adding separators in the following way: the optimal weak learners for separating groups of labels are computed, then the best of them (after evaluating a variant of weighted error) is added to those predictors whose labels are separated in the chosen group of labels. For a 3 objects problem, the fitted models have the shape: H1 (x) = G1,2,3 (x) + G1,2 (x) + G1,3 (x) + G1 (x) H2 (x) = G1,2,3 (x) + G1,2 (x) + G2,3 (x) + G2 (x) H3 (x) = G1,2,3 (x) + G1,3 (x) + G3,3 (x) + G3 (x) (3.11) Which clearly is different from PIBoost’s perspective. In addition, the way JointBoost computes weak learners and the use of weighted squared errors to compare separators are also elements that differenciate this algorithm from PIBoost. It is worth mentioning that tackling multi-label problems in this binary fashion is a well known strategy for developing algorithms in the area, see [4, 43, 108, 105, 12, 22]. 3.4 Experiments Our goal in this section is to evaluate and compare the performance of PIBoost. We have selected fourteen data sets from the UCI repository: CarEvaluation, Chess, CNAE9, Isolet, Multifeatures, Nursery,OptDigits, PageBlocks, PenDigits, SatImage, Segmentation, Vehicle, Vowel 36 Partially Informative Boosting and WaveForm. They have different numbers of input variables (6 to 856), classes (3 to 26) and instances (846 to 28056), and represent a wide spectrum of types of problems. Although some data sets have separate training and test sets, we use both of them together, so the performance for each algorithm can be evaluated using cross-validation. Table 3.4 shows a summary of the main features of the data bases5 . For comparison purposes we select three well-known Data set Variables CarEvaluation 6 Chess 6 CNAE9 856 Isolet 617 Multifeatures 649 Nursery 8 OptDigits 64 PageBlocks 10 PenDigits 16 SatImage 36 Segmentation 19 Vehicle 18 Vowel 10 Waveform 21 Labels 4 18 9 26 10 5 10 5 10 7 7 4 11 3 Instances 1728 28056 1080 7797 2000 12960 5620 5473 10992 6435 2310 846 990 5000 Table 3.4: Summary of selected UCI data sets multi-class Boosting algorithms. AdaBoost.MH [80] is perhaps the most prominent example of multi-class classifier with binary weak-learners. Similarly, SAMME [112] is, under our perspective, the most well known canonical multi-class algorithms with multi-class weak-learners. Finally, Multi-category GentleBoost [114] is probably the most accurate method that treats labels separately at each iteration. We display their respective pseudo-codes in Algorithm 2, 5 and 6. Selecting a weak-learner that provides a fair comparison among different Boosting algorithms is important at this point. SAMME requires multi-class weak-learners while, on the other hand, AdaBoost.MH and PIBoost can use even simple stump-like classifiers. Besides, Multicategory GentleBoost requires the use of regression over continuous variables for computing its weak-learners. We select classification trees as weak-learners for the first three algorithms and regression trees for the latter. For classification trees the following growing schedule is adopted. Each tree grows splitting impure nodes that present more than N̄ /K instances (where N̄ is the number of samples selected for fitting the tree), so this value is a lower bound for splitting. We find good results for the sample size parameter when N̄ < 0.4 · N , where N is the training data size. In particular we fix N̄ = 0.1 · N for all data sets. In the case of regression trees the growing pattern is similar but the bound of N̄ /K instances for splitting produces poor results. Here more complex trees achieve better performance. In particular when the minimum bound for splitting is N̄ /2K instances we got excellent error rates. A pruning process is carried out too in both types of trees. 5 The original Nursery data set has 2 instances labeled with second class. We omit them in order to apply cross-validation properly. 3.4. EXPERIMENTS 37 We experiment with two variants of PIBoost. The first one takes all single labels, G = {{k}| k ∈ L}, as group of sets to separate while the second one, more complex, takes all single labels plus all pairs of labels, G 0 = G ∪ {{k, l}| k 6= l ; k, l ∈ L}. We must emphasize the importance of selecting a good group of separators in achieving the best performance. Depending on the number of classes, selecting an appropriate set G is a problem on itself. Knowledge of the dependencies among labels sets will certainly help in designing a good set of separators. We leave this issue as future work. For the experiments we fix a number of iterations that depends on the algorithm and the number of labels of each data set. Since the five algorithms considered in this section fit a different number of weak-learners at each iteration, we select the number of iterations for each one in such a way that all experiments have the same number of weak-learners (see Table 3.5). Remember that, when a data set presents K-labels, PIBoost(2) fits K2 + K separators per iteration while PIBoost(1) and GentleBoost fit only K. Besides SAMME and AdaBoost.MH fit one weak-learner per iteration. In Fig. 3.3 we plot the performance of all five algorithms. The splitting criterion for classification trees is the Gini index. Data set CarEvaluation (4) Chess (18) CNAE9 (9) Isolet (26) Multifeatures (10) Nursery (5) OptDigits (10) PageBlocks (5) PenDigits (10) SatImage (7) Segmentation (7) Vehicle (4) Vowel (11) Waveform (3) GentleBoost 70 95 100 135 110 120 110 120 110 80 80 70 120 40 AdaBoost.MH 280 1710 900 3510 1100 600 1100 600 1100 560 560 280 1320 120 SAMME 280 1710 900 3510 1100 600 1100 600 1100 560 560 280 1320 120 PIBoost(1) 70 95 100 135 110 120 110 120 110 80 80 70 120 40 PIBoost(2) 40 [7] 10 [171] 20 [45] 10 [351] 20 [55] 40 [15] 20 [55] 40 [15] 20 [55] 20 [28] 20 [28] 40 [7] 20 [66] 40 [3] #WL 280 1710 900 3510 1100 600 1100 600 1100 560 560 280 1320 120 Table 3.5: Number of iterations considered for each Boosting algorithm. The first column displays the data base name with the number of classes in parenthesis. Columns two to six display the number of iterations of each algorithm. For PIBoost(2) the number of separators per iteration appears inside brackets. The last column displays the number of weak-learners used for each data base. The performance of a classifier corresponds to that achieved at the last iteration, combining all learned weak classifiers. We evaluate the performance of the algorithms using 5-fold crossvalidation. Table 3.6 shows these values and their standard deviations. As can be seen, PIBoost (with its two variants) outperforms the rest of methods in many data sets. Once the algorithms are ranked by accuracy we use the Friedman test to asses whether the performance differences are statistically significant [15]. As was expected the null hypothesis (all algorithms have the same quality) is rejected with a p-value < 0.01. Hence we carry out a post-hoc analysis. We use the Nemenyi test to group the algorithms that present insignificant difference [15]. Figure 3.4 shows the result of the test for both α = 0.05 and α = 0.1 significance level. Summarizing, PIBoost(1) can be considered as good as PIBoost(2) and also as good as the rest of algorithms, but PIBoost(2) is significantly better than the latter. In addition, we use the Wilcoxon matchedpairs signed-ranks test to asses the statistical significance of the performance in comparisons 38 Partially Informative Boosting Car Evaluation CNAE9 Chess 0.8 0.25 0.15 0.7 0.2 0.1 0.6 0.05 0 0.15 0.5 0 50 100 150 200 250 0.4 0.1 0 500 1000 1500 0.05 0 MultiFeature Isolet 400 600 800 Nursery 0.1 0.2 0.08 0.6 200 0.15 0.06 0.4 0.1 0.04 0.2 0 0.05 0.02 0 1000 2000 3000 0 500 1000 0 0 PageBlocks OptDigits 0.2 0 200 400 600 PenDigits 0.06 0.1 0.15 0.05 0.08 0.1 0.04 0.05 0.03 0.06 0.04 0 0 200 400 600 800 1000 0.02 0.02 0 SatImage 200 400 600 0 0.4 0.1 0.45 0.08 0.4 0.06 0.35 0.2 0.04 0.3 0.02 0.25 0.1 200 400 0 0 Vowel 200 1000 0.2 400 0 100 200 WaveForm 0.6 0.26 GentleBoost 0.24 0.4 AdaBoost.MH 0.22 PIBoost(1) 0.2 0.2 0 500 Vehicle 0.3 0 0 Segmentation PIBoost(2) 0.18 SAMME 0.16 0 500 1000 0 50 100 Figure 3.3: Plots comparing the performances of Boosting algorithms. In the vertical axis we display the error rate. In the horizontal axis we display the number of weak-learners fitted for each algorithm. between pairs of algorithms [15]. Table 3.7 presents the p-values obtained after comparing PIBoost(1) and PIBoost(2) with the others. Again, it is clear that the latter is significantly better than the rest. Additionally, we have also performed one more experiment with the Amazon data base to asses the performance of PIBoost in a problem with a very high dimensional space and with a large number of labels. This data base also belongs to the UCI repository. It has 1500 sample instances with 10000 features grouped in 50 classes. With this data base we follow the same experimental design as with the other data bases, but only use the PIBoost(1) algorithm. In Figure 3.5 we plot the evolution in the performance of each algorithm as the number of weak 3.4. EXPERIMENTS 39 Figure 3.4: Diagram of the Nemenyi test. The average rank for each method is marked on the segment. We show critical differences for both α = 0.05 and α = 0.1 significance level at the top. We group with thick blue line algorithms with no significantly different performance. learners increases. At the last iteration, PIBoost(1) has respectively an error rate and a standard deviation of 0.4213 and (±374 × 10−4 ), whereas Multi-category GentleBoost has 0.5107 and (±337 × 10−4 ), SAMME 0.6267 and (±215 × 10−4 ) and, finally, AdaBoost.MH 0.7908 and (±118 × 10−4 ). Discussion The experimental results confirm our initial intuition that by increasing the range of margin values and considering the asymmetries in the class distribution generated by the weak-learners we can significantly improve the performance of Boosting algorithms. This is particularly evident in problems with a large number of classes and few training instances. In the same way, the performance gain is evident in a high dimensional space, see the results for CNAE9, Isolet and, of course, Amazon. Moreover, as can be observed in Table 3.5 and Table 3.6, our second variant of PIBoost produces goods results even when few iterations are computed, see Isolet, OptDigits, SatImage, Segmentation or Vowel. 40 Partially Informative Boosting Data set GentleBoost AdaBoost.MH SAMME PIBoost(1) PIBoost(2) CarEvaluation Chess CNAE9 Isolet Multifeatures Nursery OptDigits PageBlocks PenDigits SatImage Segmentation Vehicle Vowel Waveform 0.0852 (±121) 0.5136 (±61) 0.0870 (±239) 0.1507 (±94) 0.0460 (±128) 0.1216 (±60) 0.0756 (±74) 0.0291 (±52) 0.0221 (±11) 0.1294 (±32) 0.0494 (±64) 0.2710 (±403) 0.2818 (±322) 0.1618 (±75) 0.0713 (±168) 0.4240 (±34) 0.1028 (±184) 0.5433 (±179) 0.3670 (±822) 0.0203 (±32) 0.0432 (±59) 0.0276 (±46) 0.0113 (±29) 0.1318 (±51) 0.0407 (±88) 0.3976 (±297) 0.3525 (±324) 0.1810 (±72) 0.0487 (±111) 0.5576 (±63) 0.1111 (±77) 0.0812 (±185) 0.0135 (±44) 0.0482 (±58) 0.0365 (±55) 0.0386 (±87) 0.0484 (±62) 0.3691 (±120) 0.0238 (±55) 0.2320 (±221) 0.0667 (±114) 0.1710 (±109) 0.0325 (±74) 0.5260 (±118) 0.1472 (±193) 0.1211 (±253) 0.0340 (±96) 0.0192 (±29) 0.0400 (±13) 0.0364 (±47) 0.0358 (±40) 0.1113 (±62) 0.0208 (±52) 0.2509 (±305) 0.0646 (±183) 0.1532 (±44) 0.0377 (±59) 0.5187 (±74) 0.0824 (±171) 0.0559 (±55) 0.0145 (±82) 0.0313 (±62) 0.0240 (±41) 0.0302 (±50) 0.0192 (±25) 0.0949 (±53) 0.0177 (±61) 0.2355 (±258) 0.0606 (±160) 0.1532 (±44) Table 3.6: Error rates of GentleBoost, AdaBoost.MH, SAMME, PIBoost(1) and PIBoost(2) algorithms for each data set in table 3.4. Standard deviations appear inside parentheses in 10−4 scale. Bold values represent the best result achieved for each data base. PIBoost(2) PIBoost(1) GentleBoost AdaBoost.MH SAMME PIBoost(1) 0.0012 0.0580 0.0203 0.1353 0.0006 0.7148 0.0081 Table 3.7: P-values corresponding to Wilcoxon matched-pairs signed-ranks test. 1 GentleBoost AdaBoost.MH PIBoost(1) 0.9 SAMME 0.8 0.7 0.6 0.5 0.4 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Figure 3.5: Plot comparing the performances of Boosting algorithms for the Amazon data base. In the vertical axis we display the error rate. In the horizontal axis we display the number of weak-learners fitted for each algorithm. Chapter 4 Multi-class Cost-sensitive Boosting In this chapter we give a second step in our multi-class classification problems by adding a cost matrix. Assigning values for penalizing different types of errors is useful to address some relevant problems. One of them is the need to skew the decision boundaries to reduce the generalization error. It frequently occurs in unbalanced problems, in which the majority classes tend to be favoured by the regular classification rules. The addition of costs is useful to increase the importance of minority classes, hence to correct the decision boundaries. Another important type of problem comes with the ordinal regression. It consists in a classification problem where the labels are obtained as the discretization of a continuous variable and preserving the order is essential. A proper cost matrix may be useful to avoid “distant” failures, which obviously must be more penalized than “closer” ones. Finally, costs can be viewed by themselves as the objective of the classification. This is particularly evident in problems where each decision involves a cost and minimizing it becomes the goal of the algorithm. Such is the case of insurance, banking, or diagnosis applications. In this chapter we study the addition of a cost matrix to the Boosting framework through the well known exponential loss function. To this aim some algorithms have been proposed but none of them may be considered a canonical extension of AdaBoost to the multi-class costsensitive field using multi-class weak learners. We will discuss them in the following section. This topic has been widely studied for binary problems, as was summarized in section 2.4. Now we present an extension to the multi-class field based on a new concept of margin. The remainder of the chapter is organized as follows. Section 4.1 describes previous theories on multi-class cost-sensitive Boosting. Here we define our new concept of cost-sensitive margin. In section 4.2 we show in detail the structure of our algorithm. This section also presents the main result that supports BAdaCost jointly with some corollaries describing direct generalizations. Finally, in Section 4.3 we show experiments confirming the efficiency of our algorithm. 4.1 Cost-sensitive multi-class Boosting Let us assume the misclassification costs for our multi-class problem are encoded using a (K×K)-matrix C, where each entry C(i, j) ≥ 0 measures the cost of misclassifying an instance 41 42 Multi-class Cost-sensitive Boosting with real label i when the prediction is j, C(1, 1) C(1, 2) C(2, 1) C(2, 2) C= .. .. . . C(K, 1) C(K, 2) ... ... .. . C(1, K) C(2, K) .. . . . . . C(K, K) We will expect of this matrix to have costs for correct assignments lower than any wrong classification, i.e. C(i, i) < C(i, j), ∀i 6= j. More rigorously, multi-class problems may be affected by costs in an instance-dependent way. In these situations an exclusive row of costs, Cn ∈ RK , is associated to each instance (xn , ln ), n = 1, . . . , N [58]. Obviously, using a cost matrix is a special case, since all samples with the same label would share the same penalizing row. We will not address this kind of problems in our study due to the lack of real applications. Let us introduce some intuitive notations for the remainder of the chapter. Hereafter, M(j, −) and M(−, j) will be used for referring to the j-th row and column vector of a matrix M. Again I(·) will denote the indicator function (1 when argument is true, 0 when false). For cost-sensitive problems the regular Bayes Decision Rule is not suitable, since the label with maximum a posteriori probability may present a high cost. Rather a costs-dependent criterion is applied. If P(x) = (P (1|x), . . . , P (K|x))> is the vector of a posteriori probabilities for a given x ∈ X, then the Cost-sensitive Bayes Decision rule is F (x) = arg min P(x)> C(−, j) , j∈L (4.1) which is but the minimizer of the risk function R(P(x), C(−, j)) := P(x)> C(−, j) with regard to j ∈ L. When dealing with multi-class cost-sensitive problems one has to understand how the addition of a cost matrix influences the decision boundaries. For this requirement, we recommend O’Brien’s et al. work [69]. It displays a concise glossary of linear algebra operations on a cost matrix and their respective effects over decision boundaries. Let K X h=1 C(h, i)P (i|x) = K X C(h, j)P (j|x) (4.2) h=1 be the decision boundary between classes i and j, with i 6= j. Here we describe the ones we will use: 1. The decision boundaries are not affected when C is replaced by αC, for any α > 0. 2. Adding a constant to all costs for a particular true class does not affect the final decision. In other words, adding a positive value to a row C(i, −) maintains the result unaffected. 3. As consequence of the previous property, any cost matrix C can be replaced by an equivalent Ĉ with Ĉ(i, i) = 0, ∀i. Proving each of them is immediate just by plugging each variant in the expression (4.2). Taking into account the last property we will assume without loss of generality that C(i, i) = 0, ∀i ∈ L, i.e. the cost of correct classifications is null. We will denote 0|1-matrix to that with zeros in its diagonal and ones elsewhere (in other words, a matrix representing a pure multi-class problem). Let us focus our attention on the meaning of a cost matrix depending on its symmetry. We may have: 4.1. COST-SENSITIVE MULTI-CLASS BOOSTING 43 • Symmetric matrix. Since symmetrical values are equal, no additional information is provided when mistaking in one direction or another. This means that the actual information lies in comparing the costs associated to different decision boundaries, which can be ranked according to their importance. Hence this structure is recommended for problems where some boundaries are more important than others. In Graph Theory this kind of matrix would represent an undirected complete graph with different distances between nodes (labels). • Asymmetric matrix. This regular case is the appropriate for situations where some labels are more important than others. It is useful when the problem at hand presents unbalanced data or simply when we are interested in avoiding some types of mistakes (possibly the most usual ones). In Graph Theory this case would represent a directed complete graph with paths of different module even between pairs of nodes (labels). In the following we will consider essentially this last case. 4.1.1 Previous works Cost-sensitive classification problems for more than two labels had never been easy to translate to the area of Boosting. Following an initial intuition, one may try to decompose the problem into binary ones and then apply any algorithm described in section 2.4. However this can be a bad choice since the (2 × 2)-matrix associated to each subproblem may be undefined or even unuseful for the global problem. For instance, think of separating a label from the rest, but there is no justified way to compose the associated binary cost matrix. Furthermore, if the global matrix is symmetric and the idea is separating one label from another (One-vs-One strategy) then every subproblem would be equally important, i.e. every submatrix would be equivalent to a binary cost-free problem. There are several works in the literature that address the cost-sensitiveness of a problem in a paradigm-independent framework [21, 18, 111, 56, 106]. We will not consider these cases since we are interested in introducing costs in the multi-class Boosting context. The contributions conceived for this purpose are: • AdaC2.M1 [86]. The algorithm developed by Y. Sun et al. is probably the first including costs when using multi-class weak learners. The idea behind this is combining the multi-class structure of AdaBoost.M1 [26], with the weighting rule of AdaC2 [89] (see Algorithm 8), hence its name. As can be guessed, no theoretical derivation supports this method. Rather it becomes just an heuristic procedure for merging both extensions of AdaBoost into one. AdaC2.M1’s pseudo-code is shown in Algorithm 11. As its multi-class counterpart it fails in computing α-values only available for “not too weak” learners, see step 2- (c). Moreover, they comprise the information of the row concerning to a real label PK l into a single value, Cl = j=1 C(l, j), which misses the structure of the given cost matrix (there are infinite different cost matrices producing the same values Cl , therefore representing the “same problem” for the algorithm). • Lp -CSB [58] This algorithm was originally conceived to solve instance-dependent costsensitive problems. See its pseudo-code in Algorithm 12. The authors, A. C. Lozano and N. Abe, provided a new insight for solving the cost-sensitive problem. They resorted 44 Multi-class Cost-sensitive Boosting Algorithm 11 :AdaC2.M1 1- Initialize the Weight Vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N . 2- For m = 1 to M : (a) Fit a multi-class classifier Gm (x) toP the training data using weights W. (b) Compute weighted error: Errm = N n=1 w(n)I (Gm (xn ) 6= ln ). 1 (c) Compute αm = 2 log ((1 − Errm )/Errm ). (d) Update weight vector w(n) ← w(n)Cln exp (−αm I (Gm (xn ) = ln )) , n = 1, . . . , N . (e) Re-normalize W. P 3- Output Final Classifier: H(x) = arg maxk M m=1 αm I(Gm (x) = k) . to relational hypothesis, h : X × L → [0, 1], satisfying the stochastic condition, ∀x ∈ P X , l∈L h(l|x) = 1, to solve the minimization N 1 X arg min C(ln , arg max h(k|xn )) , h N k n=1 (4.3) which obviously is the goal of the problem (minimize the expected cost for an uniform distribution of instances). Since the above function is not convex, they remedied this drawback by translating every term in (4.3) into ∞ K X h(k|xn ) C(ln , k) C(ln , arg max h(k|xn )) = k kh(k|xn )k∞ k=1 (4.4) that, in turn, can be approximated with a p-norm (p ≥ 1) as follows C(ln , arg max h(k|xn )) ' k K X k=1 h(k|xn ) maxj h(j|xn ) p C(ln , k) . (4.5) Since maxj h(j|xn ) will be grater or equal than 1/K, then expression (4.3) can be approximated by the following convexification: N K 1 XX arg min h(k|xn )p C(ln , k) , h N n=1 k=1 (4.6) which becomes the aim of the Boosting algorithm. Then an extended-data approach is followed, just like AdaBoost.MH [80], for stochastic hypothesis. Equation (4.7) shows the reweighting scheme applied. This time the voting constants, α, form a convex linear combination of weak learners (see 2- (d)). We must point out that Lp -CSB generalizes a previous work of N. Abe et al. [1] where a Data Space Extension is proposed jointly with an Iterative Weighting scheme to derive the GBSE algorithm (Gradient Boosting with Stochastic Ensembles). Specifically, the latter becomes a particular case of Lp -CSB when p = 1. A minor drawback when applying Lp -CSB comes with the selection of the optimal value p for the norm. No clear foundation has been argued for considering values like 3 or 4, as can be seen in the experiments of [58]. 4.1. COST-SENSITIVE MULTI-CLASS BOOSTING 45 Algorithm 12 :Lp -CSB 1- Initialize H0 with uniform distribution H0 (k|xn ) = 1/K , ∀n, ∀k. 2- Set the expanded labeled data S̄ = { (xn , k), ¯ln,k | k ∈ L , ¯ln,k := I(k = ln )}. 3- For m = 1 to M : (a) Set w(x, k) = Hm−1 (k|x)p−1 C(x, k), ∀(x, k) ∈ S̄. (b) For all (xn , k) ∈ S̄ compute: w(x if ¯ln,k = 0 P n , k)/2 w̄(xn , k) = ( j6=l w(xn , j))/2 if ¯ln,k = 1 (4.7) (c) Compute relational hypothesis hm on S̄ with regard to weights W̄. (d) Choose αm ∈ [0, 1), for example αm = 1/m. (e) Set Hm (x) := (1 − αm )Hm−1 (x) + αm hm (x). 4- Output Final Classifier: H(x) = arg maxk HM (k|x) . • MultiBoost [102]. The most recent derivation and, under our point of view, the closer to AdaBoost’s essence. The algorithm proposed by J. Wang resorts to margin vectors, y , g(x) ∈ Y ; and a special derivative of the exponential loss function L(l, f (x)) := K X C(l, k) exp(fk (x)) , (4.8) k=1 to carry out a gradient descent search. The structure of the additive model fitted in this fashion is similar to the one obtained with SAMME [112]. MultiBoost’s pseudo-code is shown in Algorithm 13. It is clear that the loss defined in (4.8) does not coincide with any other loss function when a 0|1-cost matrix is used. The main result supporting MultiBoost’s theory states that the optimal pair (Tm (x), βm ) to add to the additive model1 f m (x) = f m−1 (x) + βm gm (x) is found by solving: Tm (x) = arg min T K −1 βm = K log N X A(n, T (xn )) , (4.9) n=1 1 − Err Err − log(K − 1) . (4.10) PN Where Err = n=1 A(i, T (xn )) and constants A(i, k) compose a (N × K)-matrix, A, of values derived from costs (see step 2- (d)). Obviously, each of the above optimal parameter depends on the other. Since there is no direct way to solve (4.9), it is convenient to have of a pool of weak learner from which to select the optimal. It becomes the best strategy to accomplish step 2- (a). Let us go back to the most relevant binary cost-sensitive Boosting algorithms discussed in section 2.4. On the one hand, as far as we know, Cost-Sensitive AdaBoost [63] has not found a direct generalization to the multi-class field. We will show in section 4.2.1 how our new algorithm BAdaCost accomplishes it. On the other hand Cost-Generalized AdaBoost [49] 1 Remember the equivalence between multi-class weak learners, T (x), defined on L and those defined over margin values, g(x). See (2.16) 46 Multi-class Cost-sensitive Boosting Algorithm 13 :MultiBoost 1- Initialize constants A(n, j) = C(ln , j), n = 1, . . . , N ; k = 1, . . . , K and normalize A. 2- For m = 1 to M : (a) Solve (Tm (x), βm ) in expressions (4.9) and (4.10). (b) Translate Tm (x) in terms of margin vectors gm (x). (c) Update the additive model: f m (x) = f m−1 (x) + βm gm (x). j (d) Update constants A(n, j) ← A(n, j) exp (βm gm (xn )) , n = 1, . . . , N ; j = 1, . . . , K. (e) Re-normalize A. P 3- Output Final Classifier: H(x) = arg maxk fk (x); for f (x) = M m=1 βm gm (x) . resorts to the initial reweighting discussed in section 2.4 to induce the cost-sensitive property to the original AdaBoost. Extending this process to multi-class problems seems unclear since it requires assigning a different initial weight to each possible kind of error, which is impossible for just one weight vector. 4.1.2 New margin for cost-sensitive classification Here we introduce the essence of our second algorithm. We firstly define the concept of multiclass cost-sensitive margin, which serves as link between multi-class margins and the values of the cost matrix, both of them considered as argument of a loss function. With this in mind, we introduce an essential change in the multi-class exponential loss function. Let C∗ be a (K × K)-matrix defined in the following way C(i, j) if i 6= j ∗ PK , ∀i, j ∈ L , (4.11) C (i, j) = if i = j − h=1 C(i, h) i.e. C∗ is obtained from C replacing the j-th zero in the diagonal for the sum of the elements in the j-th row with negative sign. For our cost-sensitive classification problem each value C ∗ (j, j) will represent a “minus cost” associated to a type of correct classification. In other words, elements in the diagonal should be understood as prizes for successes over instances with the same real label. Notice that, by definition, the j-th row in C∗ is a margin vector that encodes the cost structure of the j-th label. This motivates us to use these rows as the new set of vectors for encoding true labels, i.e. we will keep in mind the bijection: l ↔ C(l, −). Notice also that our vectors are neither equidistant nor have equal norm. In other words, it serves to reorient the space of labels following the structure of the cost matrix. Based on the above codification, we define the Multi-class Cost-sensitive Margin value for an instance (x, l) with respect to the multi-class vectorial classifier f (x) as zC := C∗ (l, −) · f (x) . (4.12) Analogously to expression (2.18), it is easy to verify that if f (x) = yj ∈ Y , for a certain j ∈ L, K C∗ (l, j). Therefore the multi-class cost-sensitive margins obtained then C∗ (l, −) · f (x) = K−1 from a discrete classifier f : X → Y can be calculated directly using the label-valued analogous of f , i.e. F : X → L, through the formula zC = C∗ (l, −) · f (x) = K C∗ (l, F (x)) . K −1 (4.13) 4.2. BADACOST: BOOSTING ADAPTED FOR COST-MATRIX As a consequence when considering a lineal combination of discrete classifiers f = the following expression can be applied: M K X αm C∗ (l, Fm (x)) . zC = αm C (l, −) · f m (x) = K − 1 m=1 m=1 M X ∗ 47 PM m=1 αm f m (4.14) Our aim is to use this value as argument for the multi-class exponential loss function in order to obtain the Cost-sensitive Multi-Class Exponential Loss Function, that is defined as follows: LC (l, f (x)) := exp(zC ) = exp (C∗ (l, −) · f (x)) . (4.15) It will be the loss function for our problem. See that zC yields negative values when classifications are good under the cost-sensitive point of view, while positive values come from costly asignments. That is why LC does not need a negative sign in the exponent, which was the case of previous exponential loss functions. This is a key point in our proposal. Let us see now the suitability of our new loss function. Specifically, let us discuss how it extends the loss functions in CS-AdaBoost and SAMME respectively: A) LCS−AdaBoost (l, f (x)) = I(l = 1) exp (−lC1 f (x)) + I(l = −1) exp (−lC2 f (x)) −1 > B) LSAM M E (y, f (x)) = exp y f (x) . K Proving how each of them becomes a special case of ours is quite simple. On the one hand LCS−AdaBoost is exactly the result of (4.15) when applied on a binary cost-sensitive problem, where off-diagonal values have been denoted: C1 := C(1, 2) and C2 := C(2, 1). On the other hand, we can obtain LSAM M E just by fixing a 0|1-cost matrix re-scaled by a factor λ > 0, i.e. λC0|1 . Specifically, when λ = 1/K(K − 1) we get exactly the same values provided by SAMME’s margin vectors (see 2.19). Hence BAdaCost’s loss functions generalizes its multiclass counterpart. In subsection 4.2.1 we will state complete results about this capability of generalization. 4.2 BAdaCost: Boosting Adapted for Cost-matrix In this section we introduce BAdaCost [25], that stands for Boosting Adapted for Cost-matrix. As we have defined the cost-sensitive multi-class exponentialP loss function and given a training sample {(xn , ln )} we minimize the empirical expected loss, N n=1 LC (ln , f (xn )), to obtain the new Boosting algorithm. Once more, the minimization is carried out by fitting an additive PM model, f (x) = m=1 βm gm (x). The weak learner selected at each iteration m will consists of an optimal step of size βm along the direction gm of the largest descent of the expected costsensitive multi-class exponential loss function. In Lemma 2 we show how to compute them. Lemma 2. Let C be a cost matrix for a multi-class problem. Given the additive model f m (x) = f m−1 (x) + βm gm (x) the solution to (βm , gm (x)) = arg min β,g N X n=1 exp (C∗ (ln , −) (f m−1 (xn ) + βg(xn ))) (4.16) 48 Multi-class Cost-sensitive Boosting is the same as the solution to (βm , gm (x)) = arg min β,g K X ! ∗ Sj exp (βC (j, j)) + X j=1 ∗ Ej,k exp (βC (j, k)) , (4.17) k6=j P P where Sj = {n:g(xn )=ln =j} w(n), Ej,k = {n:ln =j,g(xn )=k} w(n), and the weight of the n-th training instance is given by: ! m−1 X w(n) = exp C∗ (ln , −) βm f m (xn ) . (4.18) t=1 Given a known direction g, the optimal step β can be obtained as the solution to K X X j=1 k6=j Ej,k C(j, k)A(j, k)β = K X K X Sj C(j, h)A(j, j)β , (4.19) j=1 h=1 where A(j, k) = exp(C ∗ (j, k)), ∀j, k ∈ L. Finally, given a known β, the optimal descent direction g, equivalently G, is given by ! N X X A(ln , k)β I(G(xn ) = k) . (4.20) arg min w(n) A(ln , ln )β I(G(xn ) = ln ) + G n=1 k6=ln The proof of this result is in the Appendix A.3. The BAdaCost pseudo-code is shown in Algorithm 14. Just like other Boosting algorithms we start weights with uniform distribution. At each iteration, we add a new multi-class weak learner gm : X → Y to the additive model weighted by βm , a measure of the confidence in the prediction of gm . The optimal weak learner that minimizes (4.20) is a cost-sensitive multi-class classifier trained using the data weights, w(n), and a modified cost matrix, Aβ = exp(βC∗ ). Observe that training a cost-sensitive weak-learner does not imply computational difficulties. For instance, when using classification trees one can proceed in three ways [38]: adjusting decision tresholds, changing the split criteria of impurity, or applying a cost-sensitive pruning. We simply recommend computing the costsensitive counterparts of the Gini index or the entropy as splitting criteria when fitting trees. Algorithm 14 : BAdaCost 1- Initialize weight vector W, w(n) = 1/N ; for n = 1, . . . , N . 2- Compute matrices C∗ , following equation (4.11), and A for the given C. 3- For m = 1 to M : (a) Obtain Gm by minimizing (4.20) for β = 1. Translate Gm into gm : X → Y . (b) Compute the constants Ej,k and Sj , ∀j, k; as described in Lemma 2. (c) Compute βm solving equation (4.19). (d) Update weights w(n) ← w(n) exp (βm C∗ (ln , −)gm (xn )), for n = 1, . . . , N . (e) Re-normalize vector W. P 4- Output Classifier: H(x) = arg mink C∗ (k, −)f (x), where f (x) = M m=1 βm gm (x). Unlike other Boosting algorithms [28, 112], here gm and βm , can not be optimized independently. As with PIBoost, we solve this by alternatively fixing the value of one of them and optimizing for the other. It is performed by specifying a seed value of βm (1 for simplicity) and then computing first gm , and then βm , consecutively. Otherwise, a convenient loop should be 4.2. BADACOST: BOOSTING ADAPTED FOR COST-MATRIX 49 included to encompass steps 3- (a), (b), and (c); in order to ensure a desired improvement in the cost of classification. Once the improvement condition is satisfied, the optimal pair (gm ,βm ) would be added to the model. As with PIBoost, we have empirically confirmed that there are no significant differences if one performs a single iteration of the loop instead of many. Thus, we proceed as described in Algorithm 14, that is quite more efficient. Finally, let us justify step 3- comparing it with multi-class approaches. It is well known that vectorial classifiers, f (x), provide a degree of confidence for classifying sample x into every class. Hence, the max rule, arg maxk fk (x), can be used for label assignment [112, 24]. It is straightforward to see that this criterion is equivalent to assigning the label that maximizes the > > multi-class margin, arg maxk y> k f (x) = arg mink −yk f (x). Since −yk f (x) is proportional to ∗ C0|1 (k, −)> f (x), we can extend the decision rule to the cost-sensitive field just by assigning arg mink C∗ (k, −)f (x). 4.2.1 Direct generalizations We want to remark that our multi-class algorithm is a canonical generalization of previous Boosting algorithms. Specifically we take into account three canonical generalizations of AdaBoost under different contexts. First, Cost-Sensitive AdaBoost [63] is a canonical extension for cost-sensitive binary problems. In second place, SAMME [112], the best-grounded extension of AdaBoost to the multi-class field using multi-class weak learners. Finally we also generalize PIBoost [24], our previously presented multi-class algorithm. The following Corollaries of Lemma 2 prove these canonical extension results. Proofs are shown in A.4, A.5 and A.6, respectively. 1 , ∀i 6= j, then the above result is equivalent to SAMME. Corollary 1. When C(i, j) = K(K−1) The update for the additive model f m (x) = f m−1 (x) + βm gm (x) is given by (βm , gm (x)) = arg minβ,g PN n=1 exp −y> n (f m−1 (xn ) + βg(xn )) and both optimal parameters can be computed in the following way • gm = arg ming • βm = (K−1)2 K PN log n=1 w(n)I (g(xn ) 6= yn ) 1−E E + log (K − 1) , where E is the sum of all weighted errors. Corollary 2. When K = 2 the Lemma 2 is equivalent to the Cost-sensitive AdaBoost. If we denote C(1, 2) = C1 and C(2, 1) = C2 , the update (4.17) for the additive model Fm (x) = Fm−1 (x) + βm Gm (x) becomes: (βm , Gm (x)) = arg min β,G X {ln =1} w(n) exp (−C1 βG(xn )) + X w(n) exp (C2 βG(xn )) . {ln =2} For a certain value β the optimal Gm (x) is given by arg min eβC1 − e−βC1 b + e−βC1 T1 + eβC2 − e−βC2 d + e−βC2 T2 , G 50 Multi-class Cost-sensitive Boosting where2 : T1 = P P P P w(n), T2 = {n:ln =2} w(n), b = {n:G(xn )6=ln =1} w(n) and d = {n:G(xn )6=ln =2} w(n). Given a known direction, G(x), the optimal step βm can be calculated as the solution to {n:ln =1} 2C1 b cosh (βC1 ) + 2C2 d cosh (βC2 ) = T1 C1 e−βC1 + T2 C2 e−βC2 . Corollary 3. When using margin vectors to separate a group of s-labels, S ∈ P(L), from the rest, the result of the Lemma 2 is equivalent to PIBoost. The update for each additive model built in this fashion, f m (x) = f m−1 (x) + βm gm (x), becomes: (βm , gm (x)) = arg min β,g N X w(n) exp n=1 −β > y g(xn ) . K n For a certain value β the optimal step gm (x) is given by β −β −β −β β −β s(K−1) s(K−1) s(K−1) (K−s)(K−1) (K−s)(K−1) E1+e E2+e (K−s)(K−1) A2 , arg min e −e A1 + e −e g where: A1 = P P {n:G(xn )6=ln ∈S} / {n:ln ∈S} w(n), A2 = P {n:ln ∈S} / w(n), E1 = P {n:G(xn )6=ln ∈S} w(n) and E2 = w(n). Besides, known a direction g(x), the optimal step βm can be calculated as βm = s(K − s)(K − 1) log R, where R is the only real positive root of the polynomial Pm (x) = E1(K − s)x2(K−s) + E2sxK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) . 4.3 Experiments It is time to evaluate the performance of BAdaCost on two relevant kinds of problems. To this aim, we devote a first subsection to introduce our procedure to compute cost matrices. Subsequently, in 4.3.2, we apply our method to some real data sets belonging to the UCI repository. In this first case, minimizing costs will be the goal of classification. Secondly, we compare BAdaCost on a complex Computer Vision problem 4.3.3, the detection of synapses and mitochondria on medical images. Here the adding of a cost matrix will serve as a tool to deal with unbalanced data sets. 4.3.1 Cost matrix construction When working with cost-sensitive problems one should have reliable information about the associated penalties for any pair of labels. In Decision Theory this is obtained directly asking an expert in the area. This can be expensive, time-consuming or even impossible to do. That is why having a real cost matrix seldom happens. Unfortunately, as far as we know, no general procedure has been proposed in the literature to solve this problem. For instance, a typical way to evaluate cost-sensitive algorithms is to generate random cost matrices. A process like this could be misleading for the final result since these costs do not follow a reasonable design, thus it can yield meaningless results from the point of view of the decision boundaries between labels. 2 Here we adopt the notation used in [63]. 4.3. EXPERIMENTS 51 In the same way, when using a cost-sensitive algorithm for solving unbalanced problem, the selection of a suitable cost matrix, C, becomes essential. In [86] the authors encoded the cost matrix in a K-element vector and use a genetic algorithm to optimize classification performance. The drawbacks of this approach are both the computational complexity and the loss of information incurred by “vectorizing” the cost matrix. The procedure we introduce computes a set of costs that punishes more heavily errors in the minority classes. A straightforward solution would be to set the costs inversely proportional to the class unbalance ratios [33]. This solution is not satisfactory since costs are not only related to the relative number of samples in each class, but also to the complexity of the classification problem, i.e. the amount of class overlapping, within-class unbalance, etc. Rather, we find reasonable to infer this information from the confusion matrix obtained with a standard cost-free classifier. Bearing both types of problems in mind, we include a pre-process for measuring the hardness of the problem with regard to each pair of labels that is also appropriate for unbalanced data. It consists in running a simple cost-insensitive multi-class algorithm on the training data (for instance, BAdaCost with a 0|1-cost matrix for few iterations), and then compute the associated confusion3 matrix, F. This matrix becomes extraordinarily informative to describe the overall difficulty of the problem. More precisely, it serves to measure properly the degree of overlapping between labels, and consequently allows to focus on the most relevant boundaries. Using this information, we proceed as follows. P Let F∗ be the matrix obtained when dividing the i-th row, F(i, −), of F by j F (i, j), i.e. the number of samples in class i. Consequently, F ∗ (i, j) is the proportion of data in class i classified as j. These values collect the degree of overlapping with regard to each real label i. Hence, high coefficients will be assigned to hard decision boundaries. Secondly, we transform F∗ = F∗ /maxi,j F ∗ (i, j) to obtain a matrix with maximun coefficient equal to 1. Then we set F ∗ (i, i) = 0, ∀i, for the diagonal values. Note that in a complex and unbalanced data set, a 0|1-loss classifier will tend to over-fit the majority classes. So, off-diagonal elements in rows F∗ (i, −) for majority (alt. minority) classes will have low (high) scores. If needed, off-diagonal zero values should be replaced by a small > 0. Doing so we guarantee that only correct classifications have null costs. Finally, to improve numerical conditioning, we rescale the resulting matrix, C = λF∗ , for an appropriate λ > 0. We can do this transformation based on the fact that any cost matrix can be multiplied by a positive constant without affecting the output labels of the problem [69]. 4.3.2 Minimizing costs: UCI repository To asses the performance of BAdaCost on real problems we resort again to the UCI catalogue of Machine Learning problems. In this first comparison we are interested in measuring the capability to minimize costs. A group of multi-class supervised data sets are selected, just like we did in section 3.4. This time we have to be sure that the selected data represent complex problems for the process to be meaningful. The chosen data bases are: CarEvaluation, Chess, CNAE9, ContraMethod, Isolet, Letter, Shuttle, OptDigits, PenDigits, SatImage, Segmentation, and Waveform. They collect a broad scope of classification problems with regard to number of variables (6 to 856), labels (3 to 26) and instances (1080 to 58000). Table 4.1 shows a description of the selected ones. We compare BAdaCost with the algorithms introduced in section 4.1.1: AdaC2.M1 [86] , 3 This matrix is also known as contingency matrix. 52 Multi-class Cost-sensitive Boosting Data set Variables CarEvaluation 6 Chess 6 CNAE9 856 ContraMethod 9 Isolet 617 Letter 16 Shuttle 9 OptDigits 64 PenDigits 16 SatImage 36 Segmentation 19 Waveform 21 Labels 4 18 9 3 26 26 7 10 10 7 7 3 Instances 1728 28056 1080 1473 7797 20000 58000 5620 10992 6435 2310 5000 Table 4.1: Summary of selected UCI data sets Lp -CSB [58] and MultiBoost [102]. See their respective pseudo-codes in Algorithm 11, 12 and 13. For each data set we proceed in the following way. Firstly, if needed, we unify train and test data into a single set. Then we carry out a 5-fold cross validation process taking care of maintaining the original proportion of labels for each fold. When training, we compute a cost matrix following the criteria described in 4.3.1. Then we run each algorithm 100 iterations. We resort to classification trees as base learners. As discussed in 4.1.1, AdaC2.M1 and BAdaCost allow the use of multi-class weak learner. Besides, MultiBoost also uses multi-class weak learners but it requires a pool of them to work properly (minimizing (4.9) is intractable). For this reason we create a pool of 6000 weak learners. To asses different weighting schemes in this group of hypothesis, we sample data from 30%, 45%, and 60% of the training data (2000 weak learners for each ratio). In third place, Lp -CSB translates the multi-class problem into a binary one. P The average misclassification cost, N1 N n=1 C(ln , H(xn )), is collected at the end of each process. Note that, after rescaling the cost matrix, final costs may sum up to a very small quantity. We show the results in table 4.3.2. It is clear that PIBoost outperforms the rest of algorithms in most of the data bases. To asses the statistical significance of the performance differences among the four methods we use the Friedman test of average ranks. The statistic supports clearly the alternative hypothesis, i.e. algorithms do not achieve equivalent results. Then a post-hoc analysis complements our arguments. We carry out the Bonferroni-Dunn test for both significance levels α = 0.05 and α = 0.10. The confidence distances4 for these tests are CD0.05 = 1.2617 and CD0.10 = 1.1216, respectively. Figure 4.1 shows the final result. We can conclude that BAdaCost is significantly better than the AdaC2.M1 and Lp -CSB algorithms for the above levels of significance. In the case of MultiBoost we can state the same conclusion for α = 0.10, but not for α = 0.05 (the difference between ranks is 1.25). 4 This values depends on the number of classifiers being compared jointly with the number of data sets over which comparison is carried out. See [15]. 4.3. EXPERIMENTS 53 Data set AdaC2.M1 MultiBoost Lp-CSB BAdaCost CarEval Chess Isolet SatImage Letter Shuttle ContraMeth. CNAE9 OptDigits PenDigits Segmenta. Waveform 0.0026 (±9) 0.0029 (±5) 0.0289 (±48) 0.0478 (±62) 0.0491 (±78) 2.1e−5 (±0.07) 0.0980 (±129) 0.0397 (±103) 0.0366 (±120) 0.0326 (±29) 0.0242 (±123) 0.0905 (±113) 0.0232 (±36) 0.0262 (±34) 0.0140 (±15) 0.0187 (±23) 0.0319 (±53) 8.9e−5 (±0.08) 0.1058 (±214) 0.0171 (±57) 0.0134 (±27) 0.0193 (±34) 0.0154 (±48) 0.0515 (±128) 0.0038 (±15) 0.0004 (±3) 0.0149 (±18) 0.0170 (±26) 0.0161 (±23) 3.5e−5 (±0.13) 0.0938 (±359) 0.0241 (±108) 0.0170 (±34) 0.0162 (±72) 0.0094 (±18) 0.0632 (±201) 0.0024 (±15) 0.0160 (±9) 0.0066 (±14) 0.0132 (±11) 0.0066 (±7) 3.9e−5 (±0.3) 0.0928 (±253) 0.0191 (±51) 0.0030 (±8) 0.0018 (±5) 0.0050 (±30) 0.0367 (±96) Table 4.2: Classification cost rates of Ada.C2M1, MultiBoost, Lp -CSB, and BAdaCost algorithms for each data set after 100 iterations. Standard deviations appear inside parentheses in 10−4 scale. Bold values represent the best result achieved for each data base. Figure 4.1: Comparison of ranks through the Bonferroni-Dunn test. BAdaCost’s average rank is taken as reference. Algorithms significantly worse than our method for a significance level of 0.10 are unified with a blue line. 4.3.3 Unbalanced Data: Synapse and Mitochondria segmentation To complete the section, we show the performance of our algorithm in the domain of unbalanced classification problems. These problems are characterized for having large differences in the number of samples in each class. Such a situation frequently occurs in complex data sets, like those involving class overlapping, small sample size, or within-class unbalance [38]. When working with unbalanced data, standard classifiers usually perform poorly since they minimize the number of misclassified training samples disregarding minority classes. It is worth mentioning how this has become an important research area in Pattern Recognition [33, 38, 88] insomuch as unbalanced classification problems frequently occur in relevant practical problems, such as Bio-medical Image Analysis [6], object detection in Computer Vision [100] or medical decision-making [65]. Solutions to the class unbalance problem may be coarsely organized into data-based, that 54 Multi-class Cost-sensitive Boosting 0 0 0 100 100 100 200 200 200 300 300 300 400 400 400 500 500 500 600 600 600 700 700 0 200 400 600 800 a) 1000 0 700 200 400 600 b) 800 1000 0 200 400 600 800 1000 c) Figure 4.2: Example of a segmented image. In b), green pixels belong to mitochondria while red ones belong to synapses. Figure c) indicates an estimation. re-sample the data space to balance the classes, and algorithm-based approaches, that introduce new algorithms that bias the learning towards the minority class [38]. The symbiotic relation between data- and ensemble-based algorithms in the context of two-class unbalanced classification is highlighted in a recent survey [33]. Boosting algorithms have also been extensively used to address this kind of problems [33, 38, 86, 91]. However, with the exception of AdaC2.M1 [86], no previous work has addressed the problem of multi-class Boosting in presence of unbalanced data by using a cost matrix5 . We compare BAdaCost with the multi-class cost-sensitive algorithms considered in section 4.3.2: AdaC2.M1, Lp -CSB and MultiBoost. We also add the SAMME algorithm to our experiment, considering it a good example of cost-free multi-class algorithm with multi-class weak learners. Let us briefly describe our problem. In the last years we have seen advances in the automated acquisition of large series of images of brain tissue. The complexity of these images and the high number of neurons in a small section of the brain, makes the automated analysis of these images the only practical solution. Mitochondria and synapses are two cell structures of neurological interest that are suitable for automated processing. However, the classification process to segment these structures is highly unbalanced, since most pixels in these images belong to the background, few of them belong to mitochondria and a small minority belong to synapses. Figure 4.2 shows an example. For this experiment we collect a training set composed of 10000 background, 4000 mitochondria and 1000 synapse data and a testing set with 20000 data per class. We apply to each image in the stack a set of linear Gaussian filters at different scales to compute zero, first and second order derivatives. For each pixel we get a vector of responses S = (s00 , s10 , s01 , s02 , s11 , s20 ) ∂ ∂2 ∂ 2 , σ · Gσ ∗ ∂y , σ 2 · Gσ ∗ ∂x that are respectively obtained applying the filters Gσ∗ , σ · Gσ ∗ ∂x 2,σ · 2 2 ∂ ∂ Gσ ∗ ∂xy , σ 2 · Gσ ∗ ∂y 2 where Gσ is a zero mean Gaussian with σ standard deviation. For a value p of σ the pixel feature vector is given by f (σ) = (s00 , s210 + s201 , λ1 , λ2 ) where λ1 and λ2 are the eigenvalues of the Hessian matrix of the pixel, that depend on s20 , s02 and s11 . The final 16-dimensional feature vector for each pixel is given by the concatenation of the f (σ) vector at 4 scales (values of σ). Having described the features, our goal in this subsection is to use the BAdaCost algorithm to label pixels in these images as mitochondria, synapse and background. We compare the five 5 In [103] the authors propose AdaBoost.NC, a multi-class Boosting algorithm specialized in unbalanced data. They do not resort to cost matrices, rather they use a different insight based on instance- and iteration-dependent penalties that are applied in the reweighting scheme. 4.3. EXPERIMENTS 55 algorithms using a cost matrix obtained as described in 4.3.1. It is done obtaining the confusion matrix after applying 15 iterations of the BAdaCost algorithm (with 0|1-cost matrix) to the training set. Once more, we use classification trees as base learners for each algorithm. In all the algorithms we use a re-sampling factor of r = 0.7. In case of SAMME, AdaC2.M1 and BAdaCost we use a shrinkage factor of s = 0.1 (this factor is not needed for the rest). For the Lp -CSB algorithm we select p = 4 and for MultiBoost we create a pool of 10000 learners following the mentioned re-sampling factor. We run the five algorithms for 150 iterations. TRAINING TESTING 0.36 0.5 AdaC2.M1 AdaC2.M1 0.34 BAdaCost BAdaCost 0.45 Lp−CSB 0.32 Lp−CSB MultiBoost MultiBoost 0.3 SAMME 0.4 SAMME 0.28 0.26 0.35 0.24 0.3 0.22 0.2 0.25 0.18 0.16 0 50 100 150 0.2 0 50 100 150 Figure 4.3: Brain images experiment with a heavily unbalanced data set. Training and testing error rates, along the iterations, for each algorithm are shown. In Fig. 4.3 we show the training and testing classification results for this experiment. Table 4.3 shows the error rates for each algorithm after the last iteration. The unbalance in the training data reflects the essence of this segmentation problem in which synapses cover a very small area in the image, mitochondria a slightly larger and, finally, the background class the largest area. A classifier unaware of this unbalance would overfit the largest class to achieve the lowest error rate on the unbalanced training data set. We can see this effect in plots in Fig. 4.3. The SAMME algorithm achieves the lowest error rate on the training data set, but the poorest in the balanced testing set, clearly showing that it has overfitted the background class. Note here that, although the classes are unbalanced, the error rate is a meaningful classification measure in the testing data set because it is balanced. On the other hand, BAdaCost, since it is a cost-sensitive classifier, gets a much better testing error rate. The training error rate in this case is much higher than SAMME. This is an expected result, since the cost matrix has effectively moved the class boundary towards the majority class. The MultiBoost and AdaC2.M1 algorithms perform better than SAMME in the test set, but clearly worse than BAdaCost. Although Lp -CSB, MultiBoost and AdaC2.M1 are cost-sensitive algorithms, their poor performance in this experiment proves that their results are far from optimal. SAMME AdaC2.M1 Train 0.1723 0.1691 Test 0.3327 0.3327 Lp -CSB MultiBoost BAdaCost 0.1843 0.2095 0.2515 0.3667 0.3293 0.2255 Table 4.3: Error rates of the five algorithms after the last iteration. Discussion In this section we carry out two types of experiments. 56 Multi-class Cost-sensitive Boosting The first is conceived to minimize decision costs on real problems when a cost structure is given. To this aim, we have consider a set of UCI benchmark problems for which we compute cost matrices following the process described in 4.3.1. Doing so, the cost-sensitive classification focusses on the difficult decision boundaries. We carry out a comparison with other multi-class cost-sensitive Boosting algorithms. At the light of the results, we conclude that BAdaCost significantly finds the most appropriate assignation of labels for this purpose. This reaffirms the use of the cost-sensitive multi-class exponential loss function for developing algorithms in the area. In our second experiment we solve a complex problem in presence of unbalanced data. Specifically, we address the segmentation of mitochondria and synapses in images of small sections of the brain. In this case, we endow the multi-class problem with a cost matrix to move the decision boundaries towards the majority class. In this way the classifier does not neglect the minority classes. We compared BAdaCost with other multi-class cost-sensitive algorithms. We used the process explained in section 4.3.1 to estimate a cost matrix for this problem. The results confirm that BAdaCost outperforms significantly the rest of algorithms since the testing error decreases the most. This is essentially due to our unbalance-aware learning. In other words, BAdaCost really takes advantage of a cost matrix to correct the unbalance in the data. Chapter 5 Conclusions In the present dissertation we have exploited two definitions of margin to develop new algorithms for multi-class Boosting. On the one hand, we have proposed a new multi-class Boosting algorithm called PIBoost. The main contribution in it is the use of binary classifiers whose response is encoded in a multi-class vector and evaluated under an exponential loss function. Data labels and classifier responses are encoded in different vector domains in such a way that they produce a set of asymmetric margin values that depend on the distribution of classes separated by the weak learner. In this way PIBoost properly addresses possible class unbalances appearing in the problem binarization. The range of rewards and penalties provided by this multi-class loss function is also related to the amount of information yielded by each weak-learner. The most informative weak learners are those that classify samples in the smallest class set and, consequently, their sample weight rewards and penalties are the largest. Here the codification produces a fair distribution of the vote or evidence among the classes in the group. We match it with a pattern of common sense, namely: the fact of learning to guess something by discarding possibilities in the proper way. The resulting algorithm maintains the essence of AdaBoost, that, in fact, is a special case of PIBoost when the number of classes is two. Furthermore, the way it translates partial information about the problem into multi-class knowledge let us think of our method as a canonical extension of AdaBoost using binary information. The experiments performed confirm that PIBoost significantly improves the performance of other well known multi-class classification algorithms. We do not claim that PIBoost is the best multi-class Boosting algorithm in the literature. Rather, we emphasize that the multi-class margin expansion introduced improves existing binary multi-class classification approaches and open new research venues for margin-based classification. Following the insight behind PIBoost we extended the original AdaBoost for multi-class cost-sensitive problems using multi-class weak learners. By extending the notion of multiclass margin to the cost-sensitive case we introduced a new method and also developed theoretical connections proving that it canonically generalizes SAMME [112], Cost-sensitive AdaBoost [63] and PIBoost [24]. The resulting algorithm, BAdaCost, stands for Boosting Adapted for Cost-matrix. We have shown experimentally that BAdaCost outperforms other relevant Boosting algorithms in the area. We perform this comparison in two types of tasks: minimizing costs on standard data sets, and also improving the test accuracy on a unbalanced data problem. When making an algorithm cost-sensitive, the costs are a new set of parameters. Estimating these parameters is a difficult problem per se. Most of the cost-sensitive algorithms published 57 58 Conclusions either do not provide a procedure to compute them [58, 102], they resort to a computationally demanding search procedure [86], or, in the unbalance case, they set the costs inversely proportional to the class unbalance ratio [33]. In our experiments, we used a simple procedure to estimate the costs from the confusion matrix obtained with a simple cost-free multi-class rule. This procedure has shown to work well in practice. However, we are aware that this matrix is far from optimal. Finally we must say that the horizon of applicabilities for both algorithms is vast. Frameworks based on variations of the multi-class margin yield flexible derivatives to new paradigms or new types of problems. The following section is devoted to comment these new insights of future work. 5.1 Future work Here we discuss some interesting topics for future venues of research. In a first paragraph we review theoretical aspects that are considered as immediate goals. Then we comment different areas of applicability in the field of supervised classification, especially for Computer Vision. 5.1.1 New theoretical scopes Many Boosting algorithms asses their convergence through error bounds [28, 26, 80]. We derived our algorithms based on a statistical view of Boosting. In a future research we will derive bounds on the training and test errors of PIBoost and BAdaCost. In the case of PIBoost, its weak learners are grouped in separators but their responses, encoded as margin vectors, are sum up jointly. Therefore, the demonstration of the error bounding should take into account each partial classification and the result when they are merged into a final decision. Such an analysis is not immediate and would require a thorough research. Likewise, as we stated in section 2.5 there is a plethora of works discussing different perspectives of Boosting. It would be quite interesting to link some of them to multi-class margin-based theory. Three examples are the reduction of entropy as analyzed in [46, 84], the game-theory point of view described in [67] and the use of different margin-based loss functions [61, 10]. With regard to PIBoost’s separators, in our experiments we did not take into account sets of labels larger than a pair. If the number of labels allows it, it is also possible to add new “clues” to the additive model fitted coming from trios or quartets of labels. The main disadvantage, obviously, is the computational load associated. As we mentioned in section 3.4 the number of separators could become prohibitive. For this reason we are interested in developing a well grounded scheme for selecting the best subsets of labels for our separators. Establishing the proper cost matrix is a difficult key step in many cost-sensitive problems. Aware of this fact, we have introduced a procedure for measuring costs on hard-to-fit decision boundaries and also for tackling asymmetries in data. In future research we would like to devise a more efficient and optimal algorithm to estimate cost matrices in certain problems, such as for example unbalanced ones. Similarly, we also want to derive symmetric matrices suitable for problems where some decision boundaries between pairs of labels are more important than others, regardless the importance of the labels in a pair. 5.1. FUTURE WORK 5.1.2 59 Other scopes of supervised learning Firstly, we must point out that the supervised classification addressed in the dissertation is concerned with a discrete set of values (in fact, a finite set of labels). Many real problems require predictions on continuous magnitudes, which invites us to extend both PIBoost and BAdaCost to regression problems. Labeling data is a time consuming and costly process, hence the interest in developing semi-supervised algorithms. We would like to study the problem of semi-supervised multiclass boosting with costs. Some problems present such a high dimensionality that a feature selection process is firstly needed. AdaBoost has been successfully used for selecting discriminative features while building a strong cascade classifier, e.g. [101, 100]. Sharing this perspective we are also motivated to apply PIBoost to retrieve the best features for multi-class problems. In Computer Vision, detection problems usually suffer from unbalanced data since positive classes have much less samples than the background one. The usual technique to deal with such unbalance is sub-sampling the background class, what entails an information loss. With BAdaCost the majority class sub-sampling is not needed and the full training set can be used. Moreover, it is well known that training a single classifier for each object (a pure binary problem) is worse than building a multi-class object detector [95] because visual features can be shared among the weak learners. We are studying the application of PIBoost to multi-pose detection of objects as practical and intuitive application. Last but not least, we would like to develop a cost-sensitive derivation of PIBoost. In other words, we plan to derive a canonical generalization of AdaBoost to tackle classifications in presence of a multi-class cost matrix by using binary weak learners. 60 Conclusions Appendix A Proofs A.1 Proof of expression (3.3) Assumee, without loss of generality, that x belongs to the first class and we try to separate the set S, with the first s-labels, from the rest using f S . Assume also that there is success when classifying with it. In that case the value of the margin will be > −1 −1 1 −1 −1 1 y f (x) = 1, ,..., ,..., , ,..., K −1 K −1 s s K −s K −s s−1 (K − s) K 1 + = . = − s s(K − 1) (K − 1)(K − s) s(K − 1) > S If the separator fails then f S (x) would have opposite sign and therefore the result. Besides, assume now that the real label of the instance is the same but now we separate the last s-labels from the rest. Assume also that this time f S erroneously classifies the instance as belonging to those last labels. The value of the margin will be > −1 −1 −1 1 1 −1 ,..., ,..., , ,..., y f (x) = 1, K −1 K −1 K −s K −s s s −1 (K − s − 1) s −K = + − = . K − s (K − 1)(K − s) (K − 1)s (K − 1)(K − s) > S Again, the sign of the result would be opposite if f S excludes x from the first labels group. A.2 Proof of Lemma 1 Let us fix a subset of s labels, s = |S|, and assume that we have fitted a separator f m (x) (whose S-index we omit) as an additive model f m+1 (x) = f m (x) + βg(x), in the m-th step. We fix a β > 0 and rewriting the expression to look for the best g(x): N X n=1 exp X N −1 > −1 > y (f m (xn ) + βg(xn )) = w(n) exp βyn g(xn ) = K n K n=1 61 (A.1) 62 Appendix = X w(n) exp ln ∈S ∓β s(K − 1) + X w(n) exp ln ∈S / ∓β (K − s)(K − 1) = (A.2) ! X −β −β = + w(n) exp + w(n) exp s(K − 1) (K − s)(K − 1) ln ∈S ln ∈S / X β −β + exp − exp w(n)I(y> (A.3) n g(xn ) < 0) + s(K − 1) s(K − 1) ln ∈S X −β β − exp w(n)I(y> + exp n g(xn ) < 0). (K − s)(K − 1) (K − s)(K − 1) ! X ln ∈S / The last expression is a sum of four terms. As can be seen, the first and second are constants while the third and fourth are the ones depending on g(x). The values in brackets are positive constants. Let us use denote them B1 and B2 , respectively. So minimizing the above expression is equivalent to minimizing X X B1 (A.4) w(n)I(y> w(n)I(y> n g(xn ) < 0) + B2 n g(xn ) < 0) . ln ∈S ln ∈S / Hence the first point of the Lemma follows. Now assume known g(x) and its error E on training data. The error can be decomposed into two parts: X X w(n)I(y> w(n)I(y> (A.5) E= n g(xn ) < 0) + n g(xn ) < 0) = E1 + E2 . ln ∈S ln ∈S / The expression (A.3) can be written now as β −β −β + exp − exp E1 + A1 exp s(K − 1) s(K − 1) s(K − 1) β −β −β + exp − exp E2 , + A2 exp (K − s)(K − 1) (K − s)(K − 1) (K − s)(K − 1) (A.6) P P where A1 = ln ∈S w(n) and A2 = ln ∈S / w(n). It can be easily verified that the above expression is convex with respect to β. So differentiating w.r.t. β, equating to zero and simplifying terms we get: E1 β E2 β exp + exp = s s(K − 1) K −s (K − s)(K − 1) (A.7) (A1 − E1) −β (A2 − E2) −β = exp + exp . s s(K − 1) K −s (K − s)(K − 1) There is no direct procedure to solve β here. We propose the change of variable β = s(K − s)(K − 1) log (x) with x > 0. This change transform the last equation into the polynomial (K − s)E1x(K−s) + sE2xs − s(A2 − E2)x−s − (K − s)(A1 − E1)x−(K−s) = 0, (A.8) or, equivalently, multiplying by x(K−s) (K − s)E1x2(K−s) + sE2xK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) = 0. (A.9) According to Descartes’ Theorem of the Signs the last polynomial has a single real positive root, which proves the second point of the Lemma. A.3. PROOF OF LEMMA 2 A.3 63 Proof of Lemma 2 Assume that in the m-th iteration we have fitted a classifier f m (x) as an additive model and we are searching for the parameters (β, g) to add in the next step, f m+1 (x) = f m (x) + βg(x). N X exp (C∗ (ln , −) (f m (xn ) + βg(xn ))) (A.10) n=1 = N X w(n) exp (βC∗ (ln , −)g(xn )) (A.11) n=1 = K X X w(n) exp (βC∗ (j, −)g(xn )) (A.12) j=1 {n:ln =j} = K X X w(n) exp j=1 {n:ln =j} βK ∗ C (j, G(xn )) . K −1 (A.13) In the last step we take into account the equivalence between vectorial-valued functions, g, and label-valued functions, G. If we denote S (j) = {n : ln = G(xn ) = j} the set of index of well classified instances with ln = j and F (j, k) = {n : ln = j, G(xn ) = k} denotes the index where G classifies as k when the real label is j then the above expression can be rewritten K ∗ ∗ X X X βKC (j, j) βKC (j, k) w(n) exp + w(n) exp (A.14) K −1 K −1 j=1 n∈S (j) n∈F (j,k) X ! βKC ∗ (j, k) βKC ∗ (j, j) + Ej,k exp . (A.15) = Sj exp K − 1 K − 1 j=1 k6=j P P Where Sj = n∈S (j) w(n) and Ej,k = n∈F (j,k) w(n). Taking into account that these constants are positive values and also exp(βC ∗ (i, j)) > 0 (∀i, j ∈ L), we can omit the term K/(K − 1) appearing in the exponents in order to address the minimization. Subsequently, the objective function can be written: ! K X X Sj exp (βC ∗ (j, j)) + Ej,k exp (βC ∗ (j, k)) . (A.16) K X j=1 k6=j Now, fixed a value β > 0, the optimal step, g, can be found minimizing K X X ∗ Ej,k exp (βC ∗ (j, k)) Sj exp (βC (j, j)) + | {z } | {z } j=1 = K X N X n=1 k6=j X j=1 = A(j,j)β n∈S (j) w(n) A(j, j)β + A(j,k)β X X k6=j (A.17) w(n) A(j, k)β (A.18) n∈F (j,k) ! w(n) A(ln , ln )β I (G(xn ) = ln ) + X k6=ln A(ln , k)β I (G(xn ) = k) . (A.19) 64 Appendix Finally if we assume known a direction, g, then its weighted errors, Ej,k , and successes, Sj , will be computable. So differentiating (A.16) with respect to β (note that is a convex function) and equating to zero we get ! K X X Ej,k C ∗ (j, k) exp (βC ∗ (j, k)) + Sj C ∗ (j, j) exp (βC ∗ (j, j)) = 0 (A.20) j=1 k6=j K X X ∗ ∗ Ej,k C (j, k) exp (βC (j, k)) = − j=1 k6=j K X Sj C ∗ (j, j) exp (βC ∗ (j, j)) (A.21) j=1 K X X ∗ β Ej,k C (j, k)A(j, k) = − j=1 k6=j K X X K X Sj C ∗ (j, j)A(j, j)β (A.22) Sj C(j, h)A(j, j)β . (A.23) j=1 β Ej,k C(j, k)A(j, k) = j=1 k6=j K X K X j=1 h=1 There is no direct procedure to solve this equation so we resort to iterative methods. A.4 Proof of Corollary 1 As was commented in section 4.1.2 it is easy to see that when C is defined in the following way: 0 for i = j C(i, j) := ∀i, j ∈ L (A.24) 1 for i 6= j K(K−1) , then a discrete vectorial weak learner, f , yields C∗ (l, −)f (x) = −1/(K − 1) for right classifications and C∗ (l, −)f (x) = 1/(K − 1)2 for mistakes. Both quantities are in fact the values of −1 > y f (x) appearing in the exponent of the loss function in SAMME. Thus expression (A.16) K can be written X ! K X −β β Sj exp Ej,k exp (A.25) + 2 K − 1 (K − 1) j=1 k6=j K X = ! Sj exp j=1 | {z S −β K −1 } K X X + ! Ej,k exp j=1 k6=j | {z E β (K − 1)2 (A.26) } β −β (A.27) = S exp + E exp K −1 (K − 1)2 −β β = (1 − E) exp + E exp (A.28) K −1 (K − 1)2 −β β −β = exp + E exp − exp . (A.29) K −1 (K − 1)2 K −1 P So, despite the value of β, the above expression is minimized when E = N n=1 w(n)I (G(xn ) 6= ln ) is minimum. A.5. PROOF OF COROLLARY 2 65 For the second point of the corollary we just need to consider the above expression as a function of β. Differentiating and equating to zero we get −β 1 β −β − exp +E exp =0 (A.30) + exp K −1 K −1 (K − 1)2 K −1 E β −β exp = (1 − E) exp (A.31) K −1 (K − 1)2 K −1 (K − 1)(1 − E) Kβ = (A.32) exp (K − 1)2 E and taking logarithms Kβ 1−E + log (K − 1) (A.33) = log (K − 1)2 E (K − 1)2 1−E β= log + log (K − 1) . (A.34) K E Hence the second point of the corollary follows. A.5 Proof of Corollary 2 Given the (2 × 2)-cost-matrix, let C1 = C(1, 2) and C2 = C(2, 1) denote the non diagonal values. The expression (A.16) becomes: 2 X (A.35) Sj exp (−βCj ) + Ej,k exp (βCj ) = |{z} j=1 k6=j S1 exp (−βC1 ) + E1,2 exp (βC1 ) + S2 exp (−βC2 ) + E2,1 exp (βC2 ) . (A.36) Let us assume now that β > 0 is known. Using the Lemma the optimal discrete weak learner minimizing the expected loss is arg min eβC1 E1,2 + e−βC1 S1 + eβC2 E2,1 + e−βC2 S2 (A.37) g P P , changing the notation T1 = {n:ln =1} w(n), T2 = {n:ln =2} w(n), E1,2 = b = T1 − S1 and E2,1 = d = T2 − S2 , arg min eβC1 b + e−βC1 (T1 − b) + eβC2 d + e−βC2 (T2 − d) = (A.38) g arg min eβC1 − e−βC1 b + e−βC1 T1 + eβC2 − e−βC2 d + e−βC2 T2 . (A.39) g Besides, if we assume known the optimal weak learner, g, then its weighted success/error rates will be computable. We can find the best value β using the Lemma. In this binary case the following expression must be solved E1,2 C1 eβC1 + E2,1 C2 eβC2 = S1 C1 e−βC1 + S2 C2 e−βC2 . (A.40) Again, using the notation T1 , T2 , b and d we get bC1 bC1 eβC1 + dC2 eβC2 = (T1 − b)C1 e−βC1 + (T2 − d)C2 e−βC2 eβC1 + e−βC1 + dC2 eβC2 + eβC2 = T1 C1 e−βC1 + T2 C2 e−βC2 (A.41) 2C1 b cosh (βC1 ) + 2C2 d cosh (βC2 ) = T1 C1 e−βC1 + T2 C2 e−βC2 , (A.43) that proves the equivalence between both algorithms for binary problems. (A.42) 66 Appendix A.6 Proof of Corollary 3 Let S denote a subset of s-labels of the problem. We can simplify the notation using 1 or 2 for denoting the presence or absence of labels of S in data. The margin values are y> f (x) = ±K ±K , when y ∈ S, and y> f (x) = (K−s)(K−1) , when y ∈ / S. In both cases there is posis(K−1) tive/negative sign in case of right/wrong classification. In turn the exponential loss function, ∓1 ∓1 ) and exp( (K−s)(K−1) ) respectively. exp(−y> f (x)/K), yields exp( s(K−1) 1 1 Let C be (2×2)-cost-matrix with non diagonal values C(1, 2) = sK and C(2, 1) = (K−s)K . This matrix produces cost-sensitive multi-class margins with the same values on the loss function. Thus we can apply our result to this binary cost-sensitive sub-problem. In particular we can apply the Corollary 2 directly. Let β > 0 be known. Substituting in expresion (A.39) we get the optimal weak learner, g, solving −β −β β − exp E1 + A1 exp + arg min exp g s(K − 1) s(K − 1) s(K − 1) β −β −β + exp − exp E2 + A2 exp . (K − s)(K − 1) (K − s)(K − 1) (K − s)(K − 1) (A.44) P P P Where A1 = {n:ln =1} w(n), A2 = {n:ln =2} w(n), E1 = {n:g(xn )6=ln =1} w(n) and E2 = P {n:g(xn )6=ln =2} w(n). If we assume known the optimal direction of classification g, then its weighted errors and successes, we can compute the optimal step β using (A.43) as the solution to β β 2E1 2E2 cosh cosh + = s s(K − 1) (K − s) (K − s)(K − 1) (A.45) −β −β A2 A1 exp exp + . s s(K − 1) (K − s) (K − s)(K − 1) Making the change of variable β = s(K − s)(K − 1) log x we get A1 −(K−s) E1 (K−s) A2 E2 x − x−(K−s) + xs − x−s = x + x−s . s (K − s) s (K − s) (A.46) Which is equivalent to find the only real solution (Descartes Theorem of signs) of the following polynomial: P (x) = E1(K − s)x2(K−s) + E2sxK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) . (A.47) Hence the Corollary follows. Bibliography [1] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class cost-sensitive learning. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 3–11, 2004. [2] Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. In Proc. International Conference on Machine Learning (ICML), volume 28, pages 1220– 1228, 2013. [3] Erin L. Allwein, Robert E. Schapire, Yoram Singer, and Pack Kaelbling. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2000. [4] Yonatan Amit, Dekel Ofer, and Yoram Singer. A boosting algorithm for label covering in multilabel problems. Journal of Machine Learning Research, 2:27–34, 2007. [5] Shumeet Baluja and Henry A. Rowley. Boosting sex identification performance. International Journal of Computer Vision, 71(1):111–119, 2007. [6] C. Becker, K. Ali, G. Knott, and P. Fua. Learning context cues for synapse segmentation. IEEE Transactions on Medical Imaging, 32(10):1864–1877, 2013. [7] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag New York, Inc., 2006. [8] Leo Breiman. Bagging predictors. Machine Learning, pages 123–140, 1996. [9] Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001. [10] Peter Bühlmann and Bin Yu. Boosting with the l2 loss: Regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003. [11] Wen-Chung Chang and Chih-Wei Cho. Multi-class boosting with color-based haar-like features. In Signal-Image Technologies and Internet-Based System (SITIS), pages 719– 726, 2007. [12] Junli Chen, Xuezhong Zhou, and Zhaohui Wu. A multi-label chinese text categorization system based on boosting algorithm. In Computer and Information Technology (CIT), pages 1153–1158, 2004. [13] Ke Chen and Shihai Wang. Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. Transactions on Pattern Analysis and Machine Intelligence, 33(1):129–143, 2011. 67 68 BIBLIOGRAPHY [14] Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In Proc. International Conference on Machine Learning (ICML), 2014. [15] Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. [16] Hongbo Deng, Jianke Zhu, Michael R. Lyu, and Irwin King. Two-stage multi-class adaboost for facial expression recognition. In International Joint Conference on Neural Networks (IJCNN), pages 3005–3010, 2007. [17] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, pages 263– 286, 1995. [18] Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In Proc. International Conference on Knowledge Discovery and Data Mining (KDD), pages 155–164, 1999. [19] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. WileyInterscience, 2 edition, 2000. [20] Gunther Eibl and Karl-Peter Pfeiffer. Multiclass boosting for weak classifiers. Journal of Machine Learning Research, 6:189–210, 2005. [21] Charles Elkan. The foundations of cost-sensitive learning. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), pages 973–978, 2001. [22] Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani. Treeboost.mh: A boosting algorithm for multi-label hierarchical text categorization. String Processing and Information Retrieval, pages 13–24, 2006. [23] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost: Misclassification cost-sensitive boosting. In Proc. International Conference on Machine Learning (ICML), pages 97–105, 1999. [24] Antonio Fernández-Baldera and Luis Baumela. Multi-class boosting with asymmetric weak-learners. Pattern Recognition, 47(5):2080–2090, 2014. [25] Antonio Fernández-Baldera, José M. Buenaposada, and Luis Baumela. Multi-class boosting for imbalanced data. In Proc. Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pages 1–8, 2015. [26] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. International Conference on Machine Learning (ICML), pages 148–156, 1996. [27] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Conference on Computational Learning Theory, pages 325–332, 1996. [28] Yoav Freund and Robert E. Schapire. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:199– 139, 1997. [29] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. BIBLIOGRAPHY 69 [30] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning: Data Mining. Springer series in statistics. Springer, 2009. [31] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189–1232, 2000. [32] M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937. [33] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42(4):463–484, July 2012. [34] Tianshi Gao and Daphne Koller. Multiclass boosting with hinge loss based on output coding. In Proc. International Conference on Machine Learning (ICML), 2011. [35] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and errorcorrecting codes. In Proc. Annual Conference on Computational Learning Theory (COLT), pages 145–155, 1999. [36] Zhihui Hao, Chunhua Shen, Nick Barnes, and Bo Wang. Totally-corrective multi-class boosting. In Asian Conference on Computer Vision (ACCV), volume 6495, pages 269– 280, 2010. [37] Trevor Hastie and Robert Tibshirani. Generalized Additive Models. Monographs on Statistics and Applied Probability. Chapman and Hall, 1990. [38] Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009. [39] Jian Huang, Seyda Ertekin, Yang Song, Hongyuan Zha, and C. Lee Giles. Efficient multiclass boosting classification with active learning. In SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2007. [40] R. L. Iman and J. M. Davenport. Approximation of the critical region of the friedman statistic. Communications in Statistics, pages 571–595, 1980. [41] Alan Julian Izenman. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer Publishing Company, Inc., 1 edition, 2008. [42] Wei Jiang, Shih-Fu Chang, and Alexander C Loui. Kernel sharing with joint boosting for multi-class concept detection. In Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2007. [43] Matt Johnson and Roberto Cipolla. Improved image annotation and labelling through multi-label boosting. In British Machine Vision Conference (BMVC), 2005. [44] Michael Kearns and Leslie Valiant. Learning boolean formulae or finite automata is hard as margin. Technical report, Harvard University, August 1988. [45] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, 1994. 70 BIBLIOGRAPHY [46] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection. In Proc. Annual Conference on Computational Learning Theory (COLT), pages 134–144, 1999. [47] Ludmila I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. WileyInterscience, 2004. [48] Tae kyun Kim and Roberto Cipolla. Mcboost: Multiple classifier boosting for perceptual co-clustering of images and visual features. In Advances in Neural Information Processing Systems (NIPS), pages 841–848, 2009. [49] Iago Landesa-Vázquez and José Luis Alba-Castro. Shedding light on the asymmetric learning capability of adaboost. Pattern Recognition Letters, 33(3):247–255, 2012. [50] Iago Landesa-Vázquez and José Luis Alba-Castro. Double-base asymmetric adaboost. Neurocomputing, 118:101–114, 2013. [51] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99:67–81, 2004. [52] Leonidas Lefakis and Francois Fleuret. Joint cascade optimization using a product of boosted classifiers. In Advances in Neural Information Processing Systems (NIPS), pages 1315–1323, 2010. [53] Christian Leistner, Helmut Grabner, and Horst Bischof. Semi-supervised boosting using visual similarity learning. In Computer Vision and Pattern Recognition (CVPR), 2008. [54] Yen-Yu Lin and Tyng-Luh Liu. Robust face detection with multi-class boosting. In Computer Vision and Pattern Recognition (CVPR), volume 1, pages 680–687, 2005. [55] Li Liu, Ling Shao, and Peter Rockett. Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recognition, 46(7):1810–1818, 2013. [56] Xu-Ying Liu and Zhi-Hua Zhou. Towards cost-sensitive learning for real-world applications. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) Workshops, volume 7104, pages 494–505, 2012. [57] Hung-Yi Lo, Ju-Chiang Wang, Hsin-Min Wang, and Shou-De Lin. Cost-sensitive multilabel learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia, 13(3):518–529, 2011. [58] Aurelie C. Lozano and Naoki Abe. Multi-class cost-sensitive boosting with p-norm loss functions. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 506–514, 2008. [59] Pavan Kumar Mallapragada, Rong Jin, Anil K. Jain, and Yi Liu. Semiboost: Boosting for semi-supervised learning. Transactions on Pattern Analysis and Machine Intelligence, 31(11):2000–2014, 2009. [60] H. Masnadi-Shirazi and N. Vasconcelos. Asymmetric boosting. In Proc. International Conference on Machine Learning (ICML), pages 609–619, 2007. BIBLIOGRAPHY 71 [61] Hamed Masnadi-Shirazi, Vijay Mahadevan, and Nuno Vasconcelos. On the design of robust classifiers for computer vision. In Computer Vision and Pattern Recognition (CVPR), pages 779–786, 2010. [62] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In Advances in Neural Information Processing Systems (NIPS), pages 1049–1056, 2008. [63] Hamed Masnadi-Shirazi and Nuno Vasconcelos. Cost-sensitive boosting. Transactions on Pattern Analysis and Machine Intelligence, 33:294–309, 2011. [64] Hamed Masnadi-Shirazi, Nuno Vasconcelos, and Vijay Mahadevan. On the design of robust classifiers for computer vision. In Computer Vision and Pattern Recognition (CVPR), pages 779–786, 2010. [65] Maciej A Mazurowski, Piotr A Habas, Jacek M Zurada, Joseph Y Lo, Jay A Baker, and Georgia D Tourassi. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 21(2):427–436, 2008. [66] David Mease and Abraham Wyner. Evidence contrary to the statistical view of boosting. Journal of Machine Learning Research, 9:131–156, 2008. [67] I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. In Advances in Neural Information Processing Systems (NIPS), pages 1714–1722, 2010. [68] P. B. Nemenyi. Distribution-free multiple comparisons. PhD thesis, Princeton University, 1963. [69] Deirdre B. O’Brien, Maya R. Gupta, and Robert M. Gray. Cost-sensitive multi-class classification from probability estimates. In Proc. International Conference on Machine Learning (ICML), pages 712–719, 2008. [70] A. Opelt, A. Pinz, and A. Zisserman. Learning an alphabet of shape and appearance for multi-class object detection. International Journal of Computer Vision, 80(1):16–44, 2008. [71] Mark D. Reid, Robert C. Williamson, and Peng Sun. The convexity and design of composite multiclass losses. In Proc. International Conference on Machine Learning (ICML), pages 687–694, 2012. [72] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101–141, 2004. [73] Saharon Rosset. Robust boosting and its relation to bagging. In Proc. International Conference on Knowledge Discovery in Data Mining (KDD), pages 249–255, 2005. [74] Mohammad J. Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms. In Advances in Neural Information Processing Systems (NIPS), 2011. [75] Amir Saffari, Christian Leistner, and Horst Bischof. Regularized multi-class semisupervised boosting. In Computer Vision and Pattern Recognition (CVPR), pages 967– 974, 2009. 72 BIBLIOGRAPHY [76] Raúl Santos-Rodríguez, Alicia Guerrero-Curieses, Rocío Alaiz-Rodríguez, and Jesús Cid-Sueiro. Cost-sensitive learning based on bregman divergences. Machine Learning, 76(2-3):271–285, 2009. [77] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990. [78] Robert E. Schapire. Using output codes to boost multiclass learning problems. In Proc. International Conference on Machine Learning (ICML), pages 313–321, 1997. [79] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the efectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [80] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37:297–336, 1999. [81] Robert E Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2-3):135–168, 2000. [82] Chunhua Shen, Junae Kim, Lei Wang, and Anton Hengel. Positive semidefinite metric learning with boosting. In Advances in Neural Information Processing Systems (NIPS), pages 1651–1659, 2009. [83] Chunhua Shen, Junae Kim, Lei Wang, and Anton Van Den Hengel. Positive semidefinite metric learning using boosting-like algorithms. Journal of Machine Learning Research, 13:1007–1036, 2012. [84] Chunhua Shen and Hanxi Li. On the dual formulation of boosting algorithms. Transactions on Pattern Analysis and Machine Intelligence, 32:2216–2231, 2010. [85] Yifan Shi, Aaron F. Bobick, and Irfan A. Essa. Learning temporal sequence model from partially labeled data. In Computer Vision and Pattern Recognition (CVPR), pages 1631– 1638, 2006. [86] Yanmin Sun, Mohamed S. Kamel, and Yang Wang. Boosting for learning multiple classes with imbalanced class distribution. In Proc. International Conference on Data Mining (ICDM), ICDM ’06, pages 592–602, 2006. [87] Yanmin Sun, Mohamed S. Kamel, Andrew K. C. Wong, and Yang Wang. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, 2007. [88] Yanmin Sun, Andrew K. C. Wong, and Mohamed S. Kamel. Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04):687–719, 2009. [89] Yanmin Sun, Andrew K. C. Wong, and Yang Wang. Parameter inference of cost-sensitive boosting algorithms. In Proc. International Conference on Machine Learning and Data Mining (MLDM), pages 21–30, 2005. [90] Yijun Sun, Zhipeng Liu, Sinisa Todorovic, and Jian Li. Adaptive boosting for sar automatic target recognition. IEEE Transactions on Aerospace and Electronic Systems, 43(1):112–125, 2007. BIBLIOGRAPHY 73 [91] Yijun Sun, Sinisa Todorovic, and Jian Li. Unifiyng multi-class adaboost algorithms with binary base learners under the margin framework. Pattern Recognition Letters, 28:631– 643, 2007. [92] Matus J. Telgarsky. The fast convergence of boosting. In Advances in Neural Information Processing Systems (NIPS), pages 1593–1601, 2011. [93] Kai Ming Ting. A comparative study of cost-sensitive boosting algorithms. In Proc. International Conference on Machine Learning (ICML), pages 983–990, 2000. [94] Kai Ming Ting and Zijian Zheng. Boosting cost-sensitive trees. In Proc. International Conference on Discovery Science (DS), pages 244–255, 1998. [95] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Sharing features: Efficient boosting procedures for multiclass object detection. In Computer Vision and Pattern Recognition (CVPR), pages 762–769, 2004. [96] Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua. Learning image descriptors with the boosting-trick. In Advances in Neural Information Processing Systems 25, pages 269–277, 2012. [97] L. G. Valiant. A theory of the learnable. Journal of Computer and System Sciences, 27(11):1134–1142, 1984. [98] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. [99] Nuno Vasconcelos and Mohammad J. Saberian. Boosting classifier cascades. In Advances in Neural Information Processing Systems (NIPS), pages 2047–2055, 2010. [100] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, 2004. [101] Paul A. Viola and Michael J. Jones. Fast and robust classification using asymmetric adaboost and a detector cascade. In Advances in Neural Information Processing Systems (NIPS), pages 1311–1318, 2001. [102] Junhui Wang. Boosting the generalized margin in cost-sensitive multiclass classification. Journal of Computational and Graphical Statistics, 22(1):178–192, 2013. [103] Shuo Wang and Xin Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4):1119–1130, 2012. [104] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Elsevier, 2005. [105] Fei Wu, Yahong Han, Qi Tian, and Yueting Zhuang. Multi-label boosting for image annotation by structural grouping sparsity. In ACM International Conference on Multimedia (ACM-MM), pages 15–24, 2010. [106] Fen Xia, Yan-wu Yang, Liang Zhou, Fuxin Li, Min Cai, and Daniel D. Zeng. A closedform reduction of multi-class cost-sensitive learning to weighted multi-class learning. Pattern Recognition, 42(7):1572–1581, 2009. 74 BIBLIOGRAPHY [107] Xun Xu and Thomas S. Huang. Soda-boosting and its application to gender recognition. In Analysis and Modeling of Faces and Gestures (AMFG), pages 193–204, 2007. [108] Rong Yan, Jelena Tesic, and John R. Smith. Model-shared subspace boosting for multilabel classification. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 834–843, 2007. [109] Jieping Ye. Least squares linear discriminant analysis. In Proc. International Conference on Machine Learning (ICML), 2007. [110] Yin Zhang and Zhi-Hua Zhou. Cost-sensitive face recognition. Transactions on Pattern Analysis and Machine Intelligence, 32(10):1758–1769, 2010. [111] Zhi-Hua Zhou and Xu-Ying Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232–257, 2010. [112] Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. Multi-class adaboost. Statistics and Its Interface, 2:349–360, 2009. [113] Hui Zou, Ji Zhu, and Trevor Hastie. The margin vector, admissible loss and multi-class margin-based classifiers. Technical report, University of Minnesota, 2007. [114] Hui Zou, Ji Zhu, and Trevor Hastie. New multicategory boosting algorithms based on multicategory fisher-consistent losses. Annals of Applied Statistics, 2:1290–1306, 2008.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement