new definitions of margin for multi

new definitions of margin for multi
E SCUELA T ÉCNICA S UPERIOR DE I NGENIEROS I NFORMÁTICOS
U NIVERSIDAD P OLITÉCNICA DE M ADRID
P H D T HESIS
A RTIFICIAL I NTELLIGENCE
N EW DEFINITIONS OF MARGIN FOR
MULTI - CLASS B OOSTING ALGORITHMS
AUTHOR: Antonio Fernández Baldera
ADVISORS: Luis Baumela Molina, Jose Miguel Buenaposada Biencinto
J ULY, 2015
New definitions of margin for multi-class Boosting
algorithms
Antonio Fernández Baldera
July, 2015
Agradecimientos
Durante el tiempo que lleva concluir unos estudios de doctorado uno llega a sentir el apoyo de
tantas personas que resulta difícil incluirlas en pocas líneas. En cualquier caso, trataré de hacer
un buen ejercicio de memoria para recogerlos a todos en esta dedicatoria.
En primer lugar me veo en la obligación de agradecer a Luis Baumela todo el esfuerzo
(y paciencia) que ha dedicado en mi formación y que ha derivado en la consecución de la
presente tesis. Gracias por haber estado ahí ayudándome en tantos aspectos: proponiendo ideas,
cuidando la manera de redactar, enseñando a resolver trámites, facilitando material, escuchando
propuestas, orientando objetivos, etc. Ciertamente es imposible no aprender con una persona
como él trabajando a tu lado. De verdad, muchas gracias.
Igualmente, no puedo dejar de agradecer a Jose Miguel Buenaposada por todos sus comentarios, tanto teóricos como experimentales. Qué fácil es mantener el entusiasmo por una
idea cuando se tiene un compañero como él. En verdad sobra decir que parte de los objetivos
alcanzados en este documento se han conseguido gracias a sus consejos. Muchísimas gracias.
Mis agradecimientos también a mis compañeros de laboratorio: Juan, Yadira, Pablo, David
y Kendrick; por los buenos ratos que hemos llegado a pasar. Fuera del grupo, gracias también
a Monse y a Elena, por amenizar los momentos del café. Agradecer de igual modo a Antonio
Valdés por aportar su visión matemática a nuestros seminarios y a Aleix Martínez por haberme
hecho un hueco en su extraordinario grupo.
Gracias a toda mi familia, como no, por haber estado siempre ahí. Gracias por hacerme
sentir más confiado que nunca en mi potencial académico, especialmente después de tantos años
de formación.
Por otro lado, gracias a todos los amigos que durante este tiempo me han permitido compartir tan buenos momentos e igualmente han sabido estar ahí cuando les he necesitado. Gracias
por vuestro apoyo. A la gente de mi pueblo: Rocha, Galea, Jose Luis y Charly; a mis compadres
de Badajoz: Felix y Javi; a mis excompañeros del INE: Alberto, Paz y Cristina; a los amigos
de San Sebastián de los Reyes: Álvaro, Sergio y Jose; a mis excompañeros del máster: Arturo, Diego, Ernesto, Victor, Ghislain, Pablo y Raúl; a mi gente de Las Rozas: Cristina, Rosa y
Carlos; y, como no, a mis amigos en Columbus: Fabián, Felipe y Adriana (y toda su familia).
Igualmente, gracias a tantos otros amigos que no recojo en estas líneas pero que también me
han hecho sentir afortunado durante estos años.
Como no podía ser de otro modo, mis más cariñosos agradecimientos a Aili Cutie Báez.
Sobran los motivos.
Aparte, creo más que conveniente dedicar también mi agradecimiento al proyecto TIN200806815-C02-02 del Ministerio de Economía y Competitividad. Sin dicho proyecto no hubiera
sido posible que esta investigación se llevara a cabo.
iii
Abstract
The family of Boosting algorithms represents a type of classification and regression approach
that has shown to be very effective in Computer Vision problems. Such is the case of detection,
tracking and recognition of faces, people, deformable objects and actions. The first and most
popular algorithm, AdaBoost, was introduced in the context of binary classification. Since then,
many works have been proposed to extend it to the more general multi-class, multi-label, costsensitive, etc... domains. Our interest is centered in extending AdaBoost to two problems in the
multi-class field, considering it a first step for upcoming generalizations.
In this dissertation we propose two Boosting algorithms for multi-class classification based
on new generalizations of the concept of margin. The first of them, PIBoost, is conceived
to tackle the multi-class problem by solving many binary sub-problems. We use a vectorial
codification to represent class labels and a multi-class exponential loss function to evaluate
classifier responses. This representation produces a set of margin values that provide a range
of penalties for failures and rewards for successes. The stagewise optimization of this model
introduces an asymmetric Boosting procedure whose costs depend on the number of classes
separated by each weak-learner. In this way the Boosting procedure takes into account class
imbalances when building the ensemble. The resulting algorithm is a well grounded method
that canonically extends the original AdaBoost.
The second algorithm proposed, BAdaCost, is conceived for multi-class problems endowed with a cost matrix. Motivated by the few cost-sensitive extensions of AdaBoost to the
multi-class field, we propose a new margin that, in turn, yields a new loss function appropriate
for evaluating costs. Since BAdaCost generalizes SAMME, Cost-Sensitive AdaBoost and PIBoost algorithms, we consider our algorithm as a canonical extension of AdaBoost to this kind
of problems. We additionally suggest a simple procedure to compute cost matrices that improve
the performance of Boosting in standard and unbalanced problems.
A set of experiments is carried out to demonstrate the effectiveness of both methods against
other relevant Boosting algorithms in their respective areas. In the experiments we resort to
benchmark data sets used in the Machine Learning community, firstly for minimizing classification errors and secondly for minimizing costs. In addition, we successfully applied BAdaCost
to a segmentation task, a particular problem in presence of imbalanced data. We conclude the
thesis justifying the horizon of future improvements encompassed in our framework, due to its
applicability and theoretical flexibility.
Keywords:
Machine Learning, AdaBoost, Multi-class Boosting, Margin-based classifiers, Cost-sensitive
learning, Imbalanced data.
v
Resumen
La familia de algoritmos de Boosting son un tipo de técnicas de clasificación y regresión que
han demostrado ser muy eficaces en problemas de Visión Computacional. Tal es el caso de
los problemas de detección, de seguimiento o bien de reconocimiento de caras, personas, objetos deformables y acciones. El primer y más popular algoritmo de Boosting, AdaBoost, fue
concebido para problemas binarios. Desde entonces, muchas han sido las propuestas que han
aparecido con objeto de trasladarlo a otros dominios más generales: multiclase, multilabel,
con costes, etc. Nuestro interés se centra en extender AdaBoost al terreno de la clasificación
multiclase, considerándolo como un primer paso para posteriores ampliaciones.
En la presente tesis proponemos dos algoritmos de Boosting para problemas multiclase
basados en nuevas derivaciones del concepto margen. El primero de ellos, PIBoost, está concebido para abordar el problema descomponiéndolo en subproblemas binarios. Por un lado,
usamos una codificación vectorial para representar etiquetas y, por otro, utilizamos la función
de pérdida exponencial multiclase para evaluar las respuestas. Esta codificación produce un
conjunto de valores margen que conllevan un rango de penalizaciones en caso de fallo y recompensas en caso de acierto. La optimización iterativa del modelo genera un proceso de Boosting
asimétrico cuyos costes dependen del número de etiquetas separadas por cada clasificador débil. De este modo nuestro algoritmo de Boosting tiene en cuenta el desbalanceo debido a las
clases a la hora de construir el clasificador. El resultado es un método bien fundamentado que
extiende de manera canónica al AdaBoost original.
El segundo algoritmo propuesto, BAdaCost, está concebido para problemas multiclase
dotados de una matriz de costes. Motivados por los escasos trabajos dedicados a generalizar
AdaBoost al terreno multiclase con costes, hemos propuesto un nuevo concepto de margen que,
a su vez, permite derivar una función de pérdida adecuada para evaluar costes. Consideramos
nuestro algoritmo como la extensión más canónica de AdaBoost para este tipo de problemas, ya
que generaliza a los algoritmos SAMME, Cost-Sensitive AdaBoost y PIBoost. Por otro lado,
sugerimos un simple procedimiento para calcular matrices de coste adecuadas para mejorar
el rendimiento de Boosting a la hora de abordar problemas estándar y problemas con datos
desbalanceados.
Una serie de experimentos nos sirven para demostrar la efectividad de ambos métodos
frente a otros conocidos algoritmos de Boosting multiclase en sus respectivas áreas. En dichos
experimentos se usan bases de datos de referencia en el área de Machine Learning, en primer
lugar para minimizar errores y en segundo lugar para minimizar costes. Además, hemos podido
aplicar BAdaCost con éxito a un proceso de segmentación, un caso particular de problema con
datos desbalanceados. Concluimos justificando el horizonte de futuro que encierra el marco de
trabajo que presentamos, tanto por su aplicabilidad como por su flexibilidad teórica.
Palabras clave:
vii
Aprendizaje Automático, AdaBoost, Boosting Multiclase, Clasificadores basados en Margen,
Aprendizaje con costes, Datos desbalanceados.
Contents
1
2
Introduction
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Background on Boosting
5
2.1
Background on Machine Learning . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
Supervised Classification Problems . . . . . . . . . . . . . . . . . . .
7
Binary Boosting: AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1
Understanding AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2.2
Statistical View of Boosting . . . . . . . . . . . . . . . . . . . . . . .
11
Multi-class Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3.1
Boosting algorithms based on binary weak-learners . . . . . . . . . . .
13
2.3.2
Boosting algorithms based on vectorial encoding . . . . . . . . . . . .
16
2.4
Cost-sensitive binary Boosting . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.5
Other perspectives of Boosting . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2
2.3
3
4
Partially Informative Boosting
25
3.1
Multi-class margin extension . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2
PIBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2.1
AdaBoost as a special case of PIBoost . . . . . . . . . . . . . . . . . .
31
3.2.2
Asymmetric treatment of partial information . . . . . . . . . . . . . .
32
3.2.3
Common sense pattern . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Multi-class Cost-sensitive Boosting
41
4.1
Cost-sensitive multi-class Boosting . . . . . . . . . . . . . . . . . . . . . . . .
41
4.1.1
43
Previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
4.1.2
4.2
4.3
5
New margin for cost-sensitive classification . . . . . . . . . . . . . . .
46
BAdaCost: Boosting Adapted for Cost-matrix . . . . . . . . . . . . . . . . . .
47
4.2.1
Direct generalizations . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.1
Cost matrix construction . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3.2
Minimizing costs: UCI repository . . . . . . . . . . . . . . . . . . . .
51
4.3.3
Unbalanced Data: Synapse and Mitochondria segmentation . . . . . .
53
Conclusions
57
5.1
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.1.1
New theoretical scopes . . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.1.2
Other scopes of supervised learning . . . . . . . . . . . . . . . . . . .
59
A Proofs
61
A.1 Proof of expression (3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
A.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
A.3 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
A.4 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
A.5 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
A.6 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
List of Figures
3.1
3.2
3.3
3.4
3.5
4.1
4.2
4.3
Values of the Exponential Loss Function over margins, z, for a classification
problem with 4-classes. Possible margin values are obtained taking into account
the expression (3.5) for s = 1 and s = 2. . . . . . . . . . . . . . . . . . . . . .
28
Margin vectors for a problem with three classes. Left figure presents the set of
vectors Y . Right plot presents the set Ŷ . . . . . . . . . . . . . . . . . . . . .
31
Plots comparing the performances of Boosting algorithms. In the vertical axis
we display the error rate. In the horizontal axis we display the number of weaklearners fitted for each algorithm. . . . . . . . . . . . . . . . . . . . . . . . . .
38
Diagram of the Nemenyi test. The average rank for each method is marked
on the segment. We show critical differences for both α = 0.05 and α = 0.1
significance level at the top. We group with thick blue line algorithms with no
significantly different performance. . . . . . . . . . . . . . . . . . . . . . . . .
39
Plot comparing the performances of Boosting algorithms for the Amazon data
base. In the vertical axis we display the error rate. In the horizontal axis we
display the number of weak-learners fitted for each algorithm. . . . . . . . . .
40
Comparison of ranks through the Bonferroni-Dunn test. BAdaCost’s average
rank is taken as reference. Algorithms significantly worse than our method for
a significance level of 0.10 are unified with a blue line. . . . . . . . . . . . . .
53
Example of a segmented image. In b), green pixels belong to mitochondria
while red ones belong to synapses. Figure c) indicates an estimation. . . . . . .
54
Brain images experiment with a heavily unbalanced data set. Training and testing error rates, along the iterations, for each algorithm are shown. . . . . . . .
55
xi
List of Tables
3.1
Cost Matrix associated to a PIBoost’s separator of a set S with s = |S| classes.
32
3.2
An example of encoding matrix for PIBoost’s weak learners when K = 4 and
G = {all single labels} ∪ {all pairs of labels}. . . . . . . . . . . . . . . . . . .
33
3.3
Comparison of the main properties of ECOC-based algorithms and PIBoost.
µm (l) denotes the coloring function µm : L → {±1} at iteration m. R denotes
the length of “code-words”. In AdaBoost.OC, I¯m (x) indicates I (hm (x) = µm (l)). 34
3.4
Summary of selected UCI data sets . . . . . . . . . . . . . . . . . . . . . . . .
36
3.5
Number of iterations considered for each Boosting algorithm. The first column
displays the data base name with the number of classes in parenthesis. Columns
two to six display the number of iterations of each algorithm. For PIBoost(2)
the number of separators per iteration appears inside brackets. The last column
displays the number of weak-learners used for each data base. . . . . . . . . .
37
Error rates of GentleBoost, AdaBoost.MH, SAMME, PIBoost(1) and PIBoost(2)
algorithms for each data set in table 3.4. Standard deviations appear inside
parentheses in 10−4 scale. Bold values represent the best result achieved for
each data base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.7
P-values corresponding to Wilcoxon matched-pairs signed-ranks test. . . . . .
40
4.1
Summary of selected UCI data sets . . . . . . . . . . . . . . . . . . . . . . . .
52
4.2
Classification cost rates of Ada.C2M1, MultiBoost, Lp -CSB, and BAdaCost
algorithms for each data set after 100 iterations. Standard deviations appear
inside parentheses in 10−4 scale. Bold values represent the best result achieved
for each data base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Error rates of the five algorithms after the last iteration. . . . . . . . . . . . . .
55
3.6
4.3
xiii
Chapter 1
Introduction
The emergence of the personal computer and its increasing computing power has brought a
new scope of applications to our lives. The more computational power the closer the chance to
transfer human reasoning to artificial environments. This issue derived inevitably on a field
of Science: Artificial Intelligence (AI). There are many disciplines included in AI such as
Robotic Systems, Mecatronics, Ontology, Bio-Informatics, Computer Vision, Data Mining,
Pattern Recognition, Machine Learning, etc. The last three have been guided by statistical
improvements and have received an enormous degree of attention in the last three decades.
The contents of this dissertation are set within the context of Machine Learning and Pattern
Recognition. Both disciplines will be considered as synonymous hereafter.
When introducing Machine Learning a first question always arises: what should be understood by learning? This is an extraordinary complex question, almost as difficult as to define the
concept of intelligence. Such endeavour could take up the whole thesis, so we will bear in mind
the following simple definition: learning is the act of acquiring new knowledge by synthesizing
information. It is surely an incomplete definition since for humans one should add the fact of
modifying existing knowledge. Moreover, behaviours and skills are also included in the process of learning. In any case we find this definition good enough to fill the gap between human
learning and Machine Learning. This field of AI encompasses situations where an automatic
decision must be taken wisely taking into account some restrictions. In other words, its aim is
“teaching machines” to take reasonable decisions. We will pay attention only to this perspective of learning: situations where one has to select the right option based on a set of restrictions.
This understanding of choosing “right decisions” leads to the concept of classification problem.
Roughly speaking, we could say that the structure of a classification problem is as simple as
follows. Choose the “best” answer from a finite set of discrete options, based on some input
data and some sample choices.
Several well established techniques, grounded in different statistical foundations, have
been proposed to solve it. Some examples are: Support Vector Machines, Bayesian Networks,
Neural Networks, and ensemble algorithms. The latter being the scope of Boosting, the topic
of this thesis.
1
2
Introduction
1.1
Motivation
Boosting algorithms are learning schemes that produce an accurate or strong classifier by combining a set of simple base prediction rules or weak-learners. Their popularity is based not
only on the fact that it is often much easier to devise a simple but inaccurate prediction rule than
building a highly accurate classifier, but also because of the successful practical results and good
theoretical properties of the algorithms. This philosophy of learning has found applications in
many fields of Science, but specially in Computer Vision where there is a plethora of works
grounded in them [16, 90, 42, 11, 54]. The most well-known Boosting algorithm, AdaBoost,
was introduced in the context of two-class (binary) classification. It works in an iterative way.
First a weight distribution is defined over the training set. Then, at each iteration, the best
weak-learner according to the weight distribution is selected and combined with the previously
selected weak-learners to form the strong classifier. Weights are updated to decrease the importance of correctly classified samples, so the algorithm tends to concentrate on the “difficult”
examples.
Classification problems with more than two possible labels (multi-class problems) are very
common in practice. When the nature of a task of this kind suggests decomposing it into binary
subproblems, then it may be quite convenient to use Boosting [80, 78, 17, 3, 35, 95]. Although
many Boosting algorithms have been proposed to address this issue [80, 78, 35, 3], none of
them can be considered a canonical generalization of the original AdaBoost [26] in the sense
of evaluating binary information under a multi-class loss function. In this thesis we provide a
canonical extension of AdaBoost based on the well known statistical interpretation of Boosting
[29]. The resulting method links to a common sense pattern, namely “the fact of discarding
options in order to select the correct one”. A pattern that, in turn, brings this computational
process closer to human reasoning.
Furthermore, the multi-class classification problem at hand may present different costs for
misclassifications. Usually this information can be encoded using a cost matrix. The inclusion
of a cost matrix can be considered either the goal of the problem, if we are interested in optimizing a cost-based loss function, or a tool for other purposes. In the second case, it turns out
helpful for addressing problems with label-dependent unbalanced data, ordinal problems1 or
problems with hard-to-fit decision boundaries. Cost-sensitive classification has been exploited
for problems with two labels [94, 93, 60, 23, 33, 50, 89, 86, 63], but not much research has
been devoted to problems with more than two [58, 102, 86]. This lack of study motivated us to
develop a Boosting algorithm that encloses the essence of the most relevant generalizations of
AdaBoost to both multi-class and cost-sensitive perspectives.
The algorithms we introduce in this thesis, PIBoost and BAdaCost, address respectively
the problems of multi-class Boosting and multi-class Boosting with costs. They may be used
to solve many relevant problems such as multi-class and minimum risk classification, object
detection, image segmentation, etc.
1.2
Main Contributions
In the present thesis we focus on the concept of margin as cornerstone for the exponential loss
function. The core message of the dissertation is described below:
1
Multi-class problems coming from the discretization of a continuous variable.
1.3. STRUCTURE OF THE THESIS
3
Hypothesis: The use of suitable margin vectors yields multi-class Boosting algorithms able
to manage binary information or costs in a canonical way via the exponential loss function.
This idea will be developed properly throughout the chapters. Here we summarize our
main contributions:
• We propose two definitions of margin vectors. Firstly we introduce margin vectors suited
for multi-class problems based on binary information. Secondly our proposal deals with
cost-sensitive problems. Given a cost matrix for a multi-class classification problem we
introduce another set of margin vectors. Both types of vectors are introduced jointly with
an exponential loss function.
• Based on the first extension of margin, we introduce a new multi-class Boosting algorithm, PIBoost [24]. We decompose the classification problem into binary sub-problems
that are evaluated with a multi-class loss function. The margin values obtained with the
new encoding produce a broad range of penalties and rewards. The stagewise optimization of this model introduces an asymmetric Boosting procedure whose costs depend on
the number of classes separated by each weak-learner. We think of PIBoost as a canonical
extension of AdaBoost to the multi-class case using binary weak learners. With our proposal we schedule a framework that allows managing complex problems by translating
them into the binary case in the purest Boosting fashion.
• Based on the second extension of margin we introduce a cost-sensitive multi-class Boosting algorithm, BAdaCost. We consider BAdaCost a canonical generalization of AdaBoost
to the multi-class cost-sensitive case using multi-class weak learners. In fact we generalize SAMME [112] and Cost-sensitive AdaBoost [63], two direct extensions of AdaBoost
for multi-class and binary cost-sensitive problems respectively. These results are discussed in Chapter 4.
1.3
Structure of the Thesis
The remainder of the thesis is organized as follows:
• Chapter 2 introduces Boosting. Firstly we describe some basics on Machine Learning
and then we present the AdaBoost algorithm from the two most common perspectives of
the process, namely: error-bounding and statistical. We include a summary of multi-class
Boosting methods, highlighting those derived upon the concept of margin, followed by a
summary of binary cost-sensitive Boosting algorithms. Other interesting perspectives of
Boosting are also included.
• Chapter 3 presents our algorithm PIBoost for multi-class problems. Here we introduce
the new set of margin vectors on which we ground our algorithm. We compare PIBoost
with previous Boosting algorithms based on binary information. Experiments showing
the performance of PIBoost against other relevant Boosting algorithms are included.
• Chapter 4 describes our multi-class cost-sensitive algorithm BAdaCost. Again, we introduce a set of margin vectors from which we derive our algorithm. We explain BAdaCost’s
4
Introduction
structure in detail and compare it with previous relevant works in the area. Finally a set
of experiments shows the applicability of BAdaCost under different perspectives.
• Chapter 5 summarizes the most relevant conclusions of our work jointly with new lines
of ongoing research.
• In the Appendix at the end of the document we provide the proofs of the results in the
thesis.
Chapter 2
Background on Boosting
Most learning paradigms are based on acquiring a single model that yields a powerful classifier
after training. During this process it is usual to require high computational cost to produce
accurate results (SVM, Bayesian Networks, Deep Learning, etc.). If lucky, the process can be
parallelized and then the computational burden is distributed.
When such capabilities are not at hand it would be desirable to have some mechanism able
to produce good classifications based on a combination of not so good, but usually faster, ones.
In most cases we can build simple rules of thumb, called weak predictions, that partially solve
the problem by analyzing a small group of predictive variables. Hence, we may wonder: is it
possible to use weak predictions to build a powerful classifier in a step-wise fashion? Besides,
can we improve it by learning from previous mistakes? If so, can we do it without losing
previous gains in accuracy? The Boosting philosophy of learning answers these questions.
This chapter serves as guide to introduce Boosting and some of its most important extensions. It is organized as follows. In a first section we review some basics related to the scope
of supervised learning. In particular, we describe our problems of interest and the concept
of meta-classifier. Section 2.2 is devoted to introduce the origin of Boosting jointly with AdaBoost, the most important algorithm in the field. A comprehensive description of the method
is carried out, paying special attention to its statistical interpretation. In section 2.3 we review
the most relevant extensions of AdaBoost to multi-class problems. In this case we will focus on
those developed upon the concept of margin. Section 2.4 is devoted to review the most relevant
generalizations proposed to tackle cost-sensitive problems. Finally, a short summary of other
interesting topics covered by Boosting closes the chapter in section 2.5.
2.1
Background on Machine Learning
There is a glossary of primal concepts that should be known beforehand to frame Boosting and
our algorithms conveniently. Let us describe these elements jointly with our terminology for
classification problems. Firstly, we will assume without loss of generality that the domain, X,
of predictive variables (features) of a problem is a subset of RD . In general X is not required to
be a vectorial subspace of RD . In fact, depending on the nature of the problem, some operations
may not even make sense. Nonetheless, it is usual for X to be endowed with structure of
vectorial space. Once the variables are given, we call dimensionality to the number of them,
i.e. the minimum value D for which X ⊆ RD . The term class (objective variable) will refer
5
6
Background on Boosting
to the qualitative or numeric feature that is the goal of the prediction. In the following we will
take into account problems where the class variable can be encoded with a finite number of
values, called labels. Hence, regression problems are out of the scope of the thesis. The set
of labels, L, should be exhaustive. In other words, one should find in data each and every of
these values and no one else is needed. We will assume that the mutually-exclusive property is
satisfied, which implies that labels are well defined, i.e. no semantic overlapping should arise.
An instance, (x, l), is hoped to have just one label. In subsection 2.1.1, we will comment a
classification problem that does not meet this property. In the following, there will be times
where the concepts of class and label will be used with the same meaning of the latter. In those
situations the context will yield no ambiguities. Finally, a classifier (learner, hypothesis) will be
any function H : X → L that assigns a label (and only one) to each instance x ∈ X. In some
cases the set L could be replaced by P(L), parts of L, or {S, S C } for a subset S ⊂ L.
Besides, the broad set of methods included in the world of Machine Learning can be divided into many categories according to different perspectives. Here we summarize some of
them, emphasizing the area of our interest. Based on the availability of knowledge about the
objective variable we can set a first division: Unsupervised learning (cluster analysis), when
there is no knowledge about labels in data; Semi-supervised Learning: when some instances
are unlabelled; Supervised Learning, when there is complete knowledge about the class in data.
The algorithms we present lie in the latter.
A second important distinction comes from the necessity of estimating a priori probabilities to apply a method. Given an observation, x, the Bayes’ rule for predicting a label l is
P (l|x) = P (x|l)P (l)P (x)−1 , which in turn is proportional to P (x|l)P (l). One can estimate
these values either directly or using the above formula, that implies estimating P (x|l) and P (l).
This second option needs to fit a probability distribution for each label and to estimate the a
priori probabilities. Algorithms tackling the classification problem in this fashion are called
generative while those that fit P (l|x) directly are called discriminative, which is the case of
Boosting algorithms.
An additional classification can be established depending on the theoretical nature. It is
usual to call paradigm to every approach that takes basis on a different conception of the learning process. They have particular motivations: bio-inspired, probabilistic, geometrical, “casebased”, etc. Popular paradigms are: Neural Networks, Lazy Learners (K-NN and its variants),
Bayesian Classifiers (Naive Bayes, TAN, etc), Support Vector Machines, Logistic Regression,
Classification Trees and Rule-based algorithms. Some of these methodologies are supported by
theories about universal approximations, i.e. functions that theoretically can approach any objective function up to a specified degree of error. Boosting algorithms neither are a paradigm per
se nor can be included in any paradigm. Rather they are settled in the group of meta-classifiers
or, as better known today, ensemble learning.
Let us describe some essential ideas about meta-classifiers. For a complete introduction
see L. Kuncheva’s book [47]. Roughly speaking, we define a meta-classifier as any algorithm
that builds a classification rule based on the results provided by various classifiers. Following
an initial intuition, one should expect the classifiers to be heterogeneous in order to obtain a
good final decision using an ensemble. It could be proved theoretically that this combination
can be specially suited to reduce either the general classification error bias or its associated
variance [7]. This heterogeneity does not imply poor predictions for the compounding learners.
Moreover, it is desirable to have a reasonably good accuracy for each of them. It could also
be expected to combine a large number of classifiers, thus an efficient type of base algorithms
should be used. In the same way it would be desirable to guarantee, if possible, the management
2.1. BACKGROUND ON MACHINE LEARNING
7
of learners specialized in sub-domains of the data.
To commit a meta-classifier one can use the same data in every base classifier (which is
the usual situation) or develop a particular classifier for every single variable and then derive
a global decision. The latter is often the case when working with magnitudes from different
contexts: weight, length, pressure, temperature, sound intensity, etc. Taking into account the
structure of the meta-classifier, which will depend on the type of problem, we can organize the
compounding classifiers in line, in parallel or following a hierarchic structure. Finally, metaclassifiers can also be divided into two categories depending on whether they use the same
paradigm to generate base classifiers or different. We are interested in the first case. Important
families of algorithms using the same base paradigm are: Bagging [8], Randomness-derived
algorithms (for instance, Random Forests [9]), Cascade-shaped classifiers [100] and finally the
ones derived applying Boosting [28, 29].
We complete the section describing the supervised problems of our interest. For more
information about the above topics we recommend the books of J. Friedman et al. [30], R.
Duda et al. [19], A. J. Izenman [41], C. M. Bishop [7], and I. H. Witten et al. [104].
2.1.1
Supervised Classification Problems
The supervised classification problems of our interest are divided into three groups: binary,
multi-class and multi-label. Each group generalizes its predecessors. These classification problems will be supposed free of costs. Although the algorithms proposed in this thesis are conceived for multi-class problems, we include the multi-label case bearing in mind the strong
connections it has with our multi-class proposal and other previous works on multi-class classification.
In first place, binary classification problems are the simplest. The class variable presents
two possible labels, usually denoted by L = {−1, 1}, {1, 0} or L = {1, 2}.
Secondly, multi-class problems differ from the previous case in the fact that L has more
than two elements, usually denoted with natural numbers L = {1, 2, . . . , K}. Although its name
suggests the opposite in this kind of classification problems there is only one class variable.
Finally, multi-labels problems have also a finite set of labels L = {1, 2, . . . , K}, but every
instance x has an associate subset Lx ⊆ L of labels, which justifies the name. In other words,
instances can present more than a single label. Thus, this kind of classification is accomplished
through a function H : X → P(L), where P(L) is the Label Powerset of the problem. There
may be instances for which no labels or all labels are assigned. Let us consider the following
illustrative example.
INSTANCE LABEL1
1
X
2
3
X
4
5
X
LABEL2 LABEL3 LABEL4
X
X
X
X
X
X
With the aim of simplifying multi-label problems we explain below a couple of strategies widely
used. These processes are also considered to tackle multi-class problems, as we highlight in
section 2.3.
8
Background on Boosting
1. Binary Relevance. The idea is to fit a binary classifier for each label (presence/absence)
and then map the result on P(L). This implies transforming the original data set into
K theoretically independent data sets. For example, Label 3 in the above table would
become:
INSTANCE LABEL3
1
0
2
0
3
1
4
1
5
0
2. Extending the data base. The idea behind this transformation is simple. Every instance
is repeated K times and then a new feature, with each of the possible labels, is added. In
other words, each repetition has the shape “feature-label”. Then the new class variable
becomes binary just by encoding the presence/absence of a label with +1/ − 1 respectively. Thus there will be an amount of N · K observations of the kind (xn , j, ±1); with
1 ≤ n ≤ N and 1 ≤ j ≤ K. The first instance in the example gives rise to:
INSTANCE LABEL PRESENCE
1
1
1
1
2
1
1
3
-1
1
4
-1
2.2
Binary Boosting: AdaBoost
Let us introduce this methodology starting from its origin. Boosting is a technique proposed
by R. Schapire [77] at the end of the eighties. The main idea in his work was to commit the
efficiency of many poor classifiers (weak learners) into a unique powerful classifier (strong
learner). Roughly speaking, we can define Boosting as a general methodology of converting
rough rules of thumb into a high quality prediction rule.
Schapire’s conception of the problem finds its basis on the work of L. Valiant [97] introduced in 1984 about the probably approximately correct model of learning (PAC learning). The
implicit goal of this venue of research was to combine the computational power available at
the time with the most recent theories about learnability. Computational learning theory was
then conceived as a complex branch of Computer Science where statistics, inductive inference
and information theory were mixed with a novel ingredient, the computational complexity. At
that date, Valiant’s work on PAC learning was considered an essential step for developing an
adequate mathematical framework where both statistical and computational views of learning
had place.
Successive works of M. Kearns and L. Valiant [44, 45] kept open the question of whether
a weak learning algorithm (let us say, one with accuracy slightly better than random guessing)
allows to be “boosted” to a strong learning algorithm (one with arbitrarily high accuracy). The
most important theorem on this topic is due to Schapire [77] and came to solve the problem. His
result states the equivalence between both types of learnabilities. Specifically, Schapire proved
that a class of target functions is strongly PAC-learnable if and only if it is weakly learnable. It
2.2. BINARY BOOSTING: ADABOOST
9
is worth mentioning that in this genesis of Boosting there were no algorithms defined as such.
Rather there were only theoretical results describing, under fixed conditions, how a classifier
improves its accuracy after a reweigthing of instances (which is the essence of Boosting).
A few years later Y. Freund and R. Schapire [28] defined their seminal algorithm based on
this reweighting technique, AdaBoost, which stands for Adaptive Boosting and whose pseudocode is shown in Algorithm 1. By far this is the most important and referenced Boosting algorithm in the literature. Originally designed to tackle binary problems, AdaBoost has been a cornerstone for subsequent derivations that reached for multi-class, multi-label, semi-supervised
and cost-sensitive problems.
Algorithm 1 : AdaBoost
1- Initialize the weight vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N .
2- For m = 1 to M :
(a) Fit a classifier fm (x) to the trainingP
data using weights W.
(b) Compute weighted error: Errm = N
n=1 w(n)I (fm (xn ) 6= yn ).
1
(c) Compute αm = 2 log ((1 − Errm )/Errm ).
(d) Update weights w(n) ← w(n) exp (−αm yn fm (xn )) , n = 1, . . . , N .
(e) Re-normalize W.
P
M
α
f
(x)
3- Output Final Classifier: H(x) = sign
m=1 m m
Let X and L = {+1, −1} be the domain and set of labels, respectively, of the problem
at hand. Maintaining the classic notation for binary problems, we will denote real labels with
y instead of l. AdaBoost creates a strong classifier, H, whose output is the sign of a linear
combination of weak classifiers:
F (x) =
M
X
αm fm (x) .
(2.1)
m=1
P
+
In other words, H(x) = sign(F (x)) = sign( M
m=1 αm fm (x)), where αm ∈ R and fm :
X → L, ∀ m. Usually all of these single functions are expected to belong to a specific type of
hypothesis H (for instance, classification trees with a restriction on the depth). The shape of the
final classifier is exactly a weighted majority vote from a set of M classifiers. Since F (x) ∈ R,
the value of its module |F (x)| should be understood as a degree of confidence for classifying
with sign(F (x)). That is why the values provided by F are usually called confidence-rated
predictions. In the language of Boosting every classifier fm included in expression 2.1 is called
a weak learner while the linear combination F is called the committee.
2.2.1
Understanding AdaBoost
Let us explain briefly how this iterative process works. For each iteration m of AdaBoost, a
pair (fm , αm ) is calculated and added to the model in a greedy1 fashion. On the one hand fm is
fitted taking into account the weight associated to each instance in such a way that those with
largest weights have priority of classification. Therefore, every weak classifier focusses on hard
instances according to W. On the other hand, the constant αm is a positive value that measures
the goodness of fm . The larger the value the most reliable its associated classifier. Then appears
1
Elements included in previous iterations remain unaffected.
10
Background on Boosting
the essential idea behind AdaBoost, the reweighting process carried out over instances. Before
defining it we must introduce the Exponential Loss Function:
L(y, F (x)) := exp(−yF (x)) .
(2.2)
This function is a key concept for the development of Boosting algorithms as we will show in
subsequent sections. For a certain labeled instance (x, y) and given a confidence rated classifier
F , it computes the value z := yF (x), usually called Margin, and then applies exp to −z. Hence
the exponential loss function can be defined over margin values: L(z) = exp(−z). It is clear
that the more negative the value of margin the larger the loss incurred, while the more positive
the closer to zero.
AdaBoost’s reweighting process computes the values of this loss function when the pair
(αm , fm ) is added to the current model Fm−1 , i.e.
wm = exp(−yFm (x)) = exp(−y(Fm−1 (x) + αm fm (x))) ,
(2.3)
which in turn can be recursively depicted as follows:
!
m
X
wm = exp −y
αr fr (x) = wm−1 exp(−yαm fm (x)) ,
(2.4)
r=1
that is quite more compact
PN and efficient. Additionally, the weight vector is normalized, W =
W/Zm , where Zm = n=1 wm (n) is the normalization constant. These weights will be taken
into account in the next iteration. Specifically,
P the objective of the next classifier f is to minimize the weighted error, i.e. Err(f ) = N
n=1 w(n)I(f (xn ) 6= yn ). So, when computing it,
there are two ways of using W:
1. Directly: If the set of hypothesis H allows weights in its computations.
2. By sampling: Since W is a distribution of probability on data, it can be used for sampling
data from which we can learn a weak classifier from H.
In both cases, every new weak learner is forced to focus on hard instances (those poorly classifed in previous rounds). Once the result of its weighted error Err is known, the value of α
associated to this classifier is
1 − Err
1
.
(2.5)
α = log
2
Err
The process can be repeated as many times as iterations M are specified.
Schapire’s proposal kept a theoretical justification based on an uniform bounding of the
classification error. Specifically, if we denote Errm = 1/2 − γm the classification error of the
weak learner in iteration m, then the training error of the classifier globally fitted, H, is bounded
in the following way:
!
M
M
M
Y
X
Y
p
p
2
2 ≤ exp −2
1 − 4γm
γm
.
(2.6)
Err(H) ≤
2 Errm (1 − Errm ) ≤
m=1
m=1
m=1
Thus, for a constant γ satisfying 0 < γ ≤ γm , ∀ m, we get:
Err(H) ≤ exp(−2M γ 2 ) ,
(2.7)
2.2. BINARY BOOSTING: ADABOOST
11
which tends to zero as the number of iterations M increases. Expression (2.7) justifies the part
Adaptive in AdaBoost, since γ and M do not need to be given for the algorithm to adapt to the
problem. Proving the first inequality in (2.6) is not difficult. One just has to realize that the
0|1-error loss defined in terms of the margin, Err0|1 (z) = (1 − sign(z))/2, is upperly bounded
by the exponential L(z) = exp(−z). Therefore
Err(H) =
N
N
1 X
1 X
I(yn 6= H(xn )) =
I(yn F (xn ) < 0)
N n=1
N n=1
N
N
M
M
X
Y
Y
1 X
≤
exp(−yn F (xn )) =
wM (n) ·
Zm =
Zm .
N n=1
n=1
m=1
m=1
Where the second last equality follows from substituting (2.5) in the normalization factor
Zm =
N
X
wm−1 (n) exp(−αm yn fm (xn )).
n=1
A more complex reasoning has to be applied to describe a bound for the generalization error.
This analysis was carried out by Y. Freund and R. Schapire [28]. They justified that
!
p
DV C /N
;
(2.8)
GeneralErr(H) ≤ P̂ (z ≤ σ) + Õ
σ
where: P̂ is the empirical probability, N is the number of instances, and DV C is the VapnikChervonenkis dimension of the problem. Later, Schapire et al. [79] proposed a better bound in
terms of the concept of margin but independent from M . In this context, margins are exactly
a "measure of confidence" in the prediction (their sign). Specifically he proved that AdaBoost
tends to increase the margin of the training data and, as consequence, the generalization error
decreases.
It is worth mentioning that at the end of the nineties AdaBoost was becoming so popular, specially due to its surprising accuracy in high dimensions, that made other famous metaclassifiers like Leo Breiman’s Bagging algorithm [8] take a second place. However, the beginning of the present century saw the emergence of an important meta-classifier also derived
by L. Breiman, the Random Forests [9], that have received a degree of research comparable to
Boosting, specially for Computer Vision problems.
AdaBoost allows an alternative interpretation that became quite useful for justifying its
good properties. This different point of view has a statistical background that may be more
understandable to newcomers. Next subsection is devoted to it.
2.2.2
Statistical View of Boosting
Since its beginning, Boosting has always been object of study because of its apparently resistance to overfiting. It was never well understood how such an iterative process improves its
generalization error iteration after iteration even when its training error is zero. But at the end
of the nineties Friedman, Hastie and Tibshirani’s work [29] came to shed some light on the matter, although their arguments were not completely satisfactory [66]. In their work the authors
12
Background on Boosting
proved that AdaBoost can be obtained by fitting an additive model [37], i.e. a model with the
shape of expression (2.1), whose goal is to minimize the expected exponential loss.
Specifically they proved that Schapire’s method builds an additive logistic regression via
Newton-like updates for minimizing such expected loss. They also derived the Real AdaBoost,
an analogous version of AdaBoost for confidence-rated weak learners free of constants α. Furthermore, they propose LogitBoost and GentleBoost, two algorithms that resort to Newton steps
for updating the additive model with regard to the binomial log-likelihood and the exponential
loss, respectively.
Let us explain briefly how the pair (αm , fm ) is estimated in the m-th iteration under this
statistical point of view. Firstly, the optimal parameters have to satisfy
(αm , fm ) = arg min
α,f
N
X
exp [−yn (Fm−1 (xn ) + αf (xn ))] ;
(2.9)
n=1
that, again, can be written in terms of weigths wm−1 (n) = exp(−yn Fm−1 (xn )) as follows
(αm , fm ) = arg min
α,f
N
X
wm−1 (n) exp (−αyn f (xn )) .
(2.10)
n=1
Now we can calculate each parameter separately. Let α > 0 be a pre-specified value, to characterize f we just have to rewrite the objective function conveniently:
N
X
wm−1 (n) exp (−αyn f (xn )) =
n=1
−α
e
N
X
f (xn )=yn
eα − e−α
N
X
n=1
wm−1 (n) + e
α
N
X
wm−1 (n) =
f (xn )6=yn
wm−1 (n)I(f (xn ) 6= yn ) + e−α
N
X
wm−1 (n) .
|n=1 {z
1
}
Therefore, independently of α, the optimal weak learner minimizes the weighted error. Now,
let us assume fm be known (hence its weighted error Err too). The above expression becomes
eα − e−α Err + e−α ,
(2.11)
that is a convex function w.r.t α. So differentiating and equating to zero one obtains
1
1 − Err
αm = log
,
2
Err
(2.12)
exactly the same value of (2.5).
We will bear in mind this for our derivations. It will serve to introduce our definition of
canonical extension for Boosting algorithms. Let P1 and P2 be two types of classification
problems in such a way that P1 becomes P2 when a set of restrictions is satisfied. Assume that
A and B are Boosting algorithms developed to solve P1 and P2 , respectively. Assume also that
both algorithms are derived from different loss functions following the statistical interpretation
of Boosting. We define the concept of canonical extension (generalization) as follows:
2.3. MULTI-CLASS BOOSTING
13
“Let (A,B) be a pair of Boosting algorithms defined as above. A is said to extend canonically B if the loss function in A restricted to the framework of B leads to update the additive
model with the same elements (Gm (x),αm ) fitted following B. In other words, the restrictions
in A given by the second framework yields the algorithm B”.
2.3
Multi-class Boosting
Almost jointly with the emergence of AdaBoost new extensions to multi-class problems were
proposed. There is a large number of Boosting algorithms dealing with this type of classification. For ease of cataloging, we divide them into two groups: algorithms that decompose
the problem into binary subproblems (thus, using binary weak learners) and algorithms that
work simultaneously with all the labels (using multi-class weak learners or computing a posteriori probabilities at the same iteration). The following subsections discuss each approach.
The Boosting algorithm that we introduce in the next chapter may be positioned between both
groups.
Hereafter, we will maintain the notation of section 2.1.1 for multi-class problems. The set
of labels will be L = {1, 2, . . . , K}. Instances will be denoted again (x, l).
2.3.1
Boosting algorithms based on binary weak-learners
Completing Freund and Schapire’s contribution [28] for binary problems, the authors proposed in [26] two extensions to multi-class problems. We start discussing the second one,
AdaBoost.M2. This algorithm proceeds extending the datasets K-times, as discussed in section 2.1.1 for multi-label problems, and then computing binary weak classifiers of the shape
h : X × L → {0, 1}. Following AdaBoost’s essence, a normalized distribution of weights, W,
is used, but this time with a matrix shape, W ∈ [0, 1]N ×K . The main novelty of the method was
the inclusion of errors based on a pseudo-loss function
N
1 X
w(n, k)(1 − h(xn , ln ) + h(xn , k)) ,
Ērr(h) =
2
(2.13)
(n,k)∈B
where B = {(n, k)|n = 1, . . . , N ; k 6= ln }. This type of losses penalizes simultaneously
hard instances and hard labels. The pseudo-loss is combined with the binary exponential loss
function to update weights, while α constants are computed substituting Ērr in (2.5).
Years later, the idea behind AdaBoost.M2 was brought back by Shapire and Singer in
[80], where three algorithms for multi-label problems were proposed. The first of them, AdaBoost.MH, enables a direct application to multi-class problems, becoming one of the most
popular methods for this task among the proposed at the date. We show its pseudo-code in
Algorithm 2. AdaBoost.MH addresses the multi-class problem extending the data base just as
introduced in 2.1.1. The main difference between it and AdaBoost.M2 is the use of the Hamming loss for measuring errors instead of a pseudo-loss. Analogously AdaBoost.MR was also
proposed, in this case using a ranking loss to measure multi-label accuracy.
The third algorithm in [80], AdaBoost.MO, is more complex than AdaBoost.MH since
the set of labels, L, is mapped on P(L̂) for an undefined set of labels, L̂, with R elements.
Therefore, via λ : L → P(L̂), each label l has an associated “codeword” λl ∈ {±1}R in such
14
Background on Boosting
Algorithm 2 : AdaBoost.MH
1- Initialize the weight matrix W with uniform distribution w(n, k) = 1/KN , for n =
1, . . . , N ; k = 1, . . . K.
2- For m = 1 to M :
(a) Fit binary classifier hm : X × L → {−1, +1} using weights W.
(b) Compute αm .
(c) Update weight vector w(n, l) ← w(n, l) exp (−αm yl hm (xn , l)),
for n = 1, . . . , N ; l = 1, . . . K.
(d) Re-normalize W.
3- Output Final Classifier:
P
M
Multi-label. H(x, l) = sign
m=1 αm hm (x, l) .
PM
Multi-class. H(x) = arg maxk m=1 αm hm (x, k).
a way that a binary classification can be performed on the extended set {(xn , ˆl)| ˆl = 1, . . . , R}
just as stated for AdaBoost.MH. AdaBoost.MO belongs to a new stream of multi-class Boosting
based on the Error Correcting Output Codes philosophy (ECOC). Let us describe briefly the
main ideas behind this perspective of learning.
The ECOC methodology was introduced by Dietterich and Bakiri [17] in the nineties. It
served as an alternative approach to reduce a multi-class problem to a set of R binary ones.
The key point of this approach lies in using a particular encoding to represent the subset of
labels object of classification. So, considered the r-th task, one generates a weak-learner hr :
X → {+1, −1}. The presence/absence of a group of labels on an instance is encoded by a
column vector belonging to {−1, +1}K , where +1 indicates presence of the objective labels
of hr . It is usual to use a coloring function, µ, for assigning the presence or absence of a
set of labels in the data. The resulting assignment becomes the real labels for its associated
binary subproblem. The composition of all column vectors generated in this fashion produces
a (K × R)-codification matrix, in which the l-th row serves as a codeword associated to label
l. When classifying a new instance one has to compute all the answers of the weak learners
and compare the resulting (1 × R)-vector with each codeword. The decision rule consists in
choosing the nearest class to the result (under certain measure like the Hamming distance).
This solution justifies matrix encodings as a practical and intuitive tool for building strong
multi-class classifiers using binary weak learners.
Based on this idea, and jointly with AdaBoost.MO [80], two new Boosting algorithms
appeared, AdaBoost.OC [78] and AdaBoost.ECC [35]. These two algorithms are probably the
most relevant ones for this insight of multi-class Boosting. We show their respective pseudocodes in Algorithm 3 and 4. We must point out that AdaBoost.OC is also grounded on the
M2-version proposed by Schapire, but this time completed with an ECOC perspective.
Here we briefly compare these three ECOC-based methods:
• With regard to the instance weights, they have in common a normalized matrix W ∈
RN ×R used at each iteration. In particular, for AdaBoost.OC and AdaBoost.ECC, R = K
and the actual weights, D ∈ RN , for fitting weak learners are computed based on W and
the m-th selected coloring, µm .
• Besides this, the three algorithms add just one voting constant, α, per iteration to accompany the decisions of its respective weak learner. In AdaBoost.ECC the constant gm (xn )
2.3. MULTI-CLASS BOOSTING
15
Algorithm 3 : AdaBoost.OC
1- Initialize the weight matrix W with uniform distribution w(n, k) = 1/N (K −1), if k 6= ln ,
and w(n, ln ) = 0 , n = 1, . . . , N .
2- For m = 1 to M :
(a) Compute coloring function
P µm : L → {0, 1}.
−1
(b) Compute Dm (n) = Zm
k∈L w(n, k)I(µm (ln ) 6= µm (k)),
where Zm is a normalization constant.
(c) Fit a classifier hm : X → {0, 1} to the data {(x, µm (l))} weighted according to Dm .
(d) Define h̃m (xn )) := {k
: hm (xn ) = µm (k)}.
P∈ LP
K
/ h̃m (xn )) + I(k ∈ h̃m (xn )) ),
(e) Compute Errm = 21 N
n=1
k=1 w(n, k)( I(ln ∈
1
(f) Compute αm = 2 log ((1 − Errm )/Errm
h ).
i
(g) Update weights w(n, k) ← w(n, k) exp αm ( I(ln ∈
/ h̃m (xn )) + I(k ∈ h̃m (xn )) ) .
(h) Re-normalize W.
P
3- Output Final Classifier: H(x) = arg maxk M
m=1 αm I (hm (x) = µm (l)) .
supports the possibility of two instance-dependent values:
!
P
D
(n)
m
1
n:h (x )=µ (l )=1
, if hm (xn ) = +1 ;
αm = ln P m n m n
2
n:hm (xn )=16=µm (ln ) Dm (n)
!
P
1
n:hm (xn )=µm (ln )=−1 Dm (n)
βm = ln P
, if hm (xn ) = −1 .
2
n:hm (xn )=−16=µm (ln ) Dm (n)
(2.14)
• The loss function applied for updating weights in all three algorithms comes from a
derivation of the binary exponential loss function with non-vectorial arguments (for instance, a pseudo-loss in the case of AdaBoost.OC).
• Another important issue is the shape of the final classifier. On the one hand, the final
decision for AdaBoost.OC and AdaBoost.ECC admits a translation into a K-dimensional
function f (x) = (f1 (x), . . . , fK (x))> whose maximum coordinate is selected as response. On the other hand, AdaBoost.MO proposes two options based on the final response f (x) ∈ RR and the (K × R)-matrix of codewords. One can select the row that
measures the closest distance to the response or resort to a confidence-rated prediction
such as:
X
ˆ
ˆ
arg min
exp −λl (l)f (x, l) ,
(2.15)
l
l̂∈L̂
that is the option recommended by the authors.
• Finally, with regard to the number of weak learners computed at each iteration, the
three algorithms compute just one weak learner (jointly with its voting constant α). AdaBoost.OC and AdaBoost.ECC train a weak classifier associated to the coloring µm , while
AdaBoost.MO trains a weak learner for the binary problem generated by the extended
data base {(xn , ˆl)| ˆl = 1, . . . , R} with labels {λln (ˆl)| ˆl = 1, . . . , R}.
We will come back to these issues in next chapter, when introducing our multi-class algorithm.
16
Background on Boosting
Algorithm 4 : AdaBoost.ECC
1- Initialize the weight matrix W with uniform distribution w(n, k) = 1/N (K −1), if k 6= ln ,
and w(n, ln ) = 0 , n = 1, . . . , N .
2- For m = 1 to M :
(a) Compute coloring function
P µm : L → {−1, +1}.
−1
(b) Compute Dm (n) = Zm
k∈L w(n, k)I(µm (ln ) 6= µm (k)),
where Zm is a normalization constant.
(c) Fit a binary classifier hm : X → {−1, +1} to the training data using weights Dm .
(d) Compute αm and βm following (2.14).
(e) Compute gm (x) = αm I(hm (x) = +1) −
1βm I(hm (x) = −1).
(f) Update weights w(n, k) ← w(n, k) exp 2 gm (xn )(µm (k) − µm (ln )) ,
(g) Re-normalize W.
P
3- Output Final Classifier: H(x) = arg maxk M
m=1 gm (x)µm (k) .
2.3.2
Boosting algorithms based on vectorial encoding
Grouped in a second block, we encompass those multi-class algorithm that manage all labels
simultaneously at each iteration (using multi-class weak learners or estimating directly a posteriori probabilities). Here we include the first algorithm proposed by Freund and Schapire in
[26], AdaBoost.M1. The essence of this multi-class generalization of AdaBoost lies in using
pure multi-class weak learners while maintaining the same structure of the original algorithm.
The main drawback in AdaBoost.M1 is the need for “strong learners”, i.e. hypothesis that can
achieve accuracy of at least 50%, a requirement that may be too strong when the number of
labels is high. Despite the lack of theory supporting this method, we must clarify that it is
very common to consider AdaBoost.M1 as the first Boosting algorithm. This is due to its direct translation to AdaBoost for the binary case. A second approach came with the multi-class
version of LogitBoost appeared jointly with the binary one [29]. As its counterpart for two
labels, it estimates separately the probability of belonging to each label based on a multi-logit
parametrization.
The most interesting works under our point of view are those grounded on a vectorial insight. A successful way to generalize the symmetry of class-label representation in the binary
case to the multi-class case is using a set of vector-valued class codes that represent the correspondence between the label set L = {1, . . . , K} and a collection of vectors Y = {y1 , . . . , yK },
−1
elsewhere. So, if l = 1, the code
where vector yl has a value 1 in the l-th coordinate and K−1
>
−1
−1
vector representing class 1 is y1 = 1, K−1
, . . . , K−1
. It is immediate to see the equivalence
between classifiers H(x) defined over L and classifiers f (x) defined over Y :
H(x) = l ∈ L ⇔ f (x) = yl ∈ Y .
(2.16)
In the remainder of the thesis we will use capital letters (H, G or T ) for denoting classifiers
with target set L. On the other hand, classifiers having as codomain a set of vectors, like Y , will
be denoted with small bold letters (f or g).
This codification was first introduced by Lee, Lin and Wahba [51] for extending the binary Support Vector Machine to the multi-class case. More recently H. Zou, J. Zhu and T.
Hastie [114] generalized the concept of binary margin to the multi-class case using a related
vectorial codification in
Pwhich a K-vector y is said to be> a margin vector if it satisfies the
sum-to-zero condition K
k=1 y(k) = 0. In other words, y 1 = 0, where 1 denotes a vector
2.3. MULTI-CLASS BOOSTING
17
of ones. This sum-to-zero condition reflects the implicit nature of the response in classification problems in which each yn takes one and only one value from a set of labels. Margin vectors are useful for multi-class classification problems for other reasons. One of them
comes directly from the sum-to-zero property. It is known that, in general, every vectorial
classifier f(x) = (f1 (x), . . . , fK (x))> has a direct translation into a posteriori probabilities
P (l = k | x), ∀k ∈ L, via the Multi-class Logistic Regression Function (MLRF)
exp(fk (x))
.
P (l = k | x) = PK
i=1 exp(fi (x))
(2.17)
It is clear that a function f (x) produces the same a posteriori probabilities than g(x) = f (x) +
α(x) · 1, where α(x) is a real valued function and 1 is a K-dimensional vector of ones. Such is
the case when, for example, α(x) = −fK (x). Using margin vectors we do not have to concern
about this issue2 .
Using this codification, J. Zhu, H. Zou, S. Rosset and T. Hastie [112] generalized the
original AdaBoost to multi-class problems under a statistical point of view. This work has been
a cornerstone for subsequent derivations. Let us describe the main elements upon which the
algorithm is grounded. Firstly, the binary margin applied in AdaBoost, z = yf (x), is replaced
by the Multi-class Vectorial Margin, that is defined as a scalar product
z := y> f (x) .
(2.18)
The essence of the margin approach resides in maintaining negative/positive values of the margin when a classifier has respectively a failure/success. That is, if y, f (x) ∈ Y ; the margin
z = y> f (x) satisfies: z > 0 ⇔ y = f(x), and z < 0 ⇔ y 6= f(x). Note that, again, this
definition of margin serves as a compact way to specify numerically hits and mistakes of classification. It is straightforward that the only two possible values of the margin when y, f (x) ∈ Y
are:
(
K
if f (x) = y ,
(K−1)
(2.19)
z = y> f (x) =
−K
if f (x) 6= y
(K−1)2
Bearing this definition in mind, the Multi-class Exponential Loss Function is:
1 >
1
L (y, f (x)) := exp − y f (x) = exp − z .
K
K
(2.20)
As the reader may guess, the presence of the constant 1/K is important but not determinant for
the proper behaviour of the loss function. We will see later how it simplifies some calculations.
An interesting property of this loss function (that justifies the addition of the constant 1/K)
comes from the following result:
!
K
1 X
L (y, f (x)) = exp −
y(k)fk (x) =
K k=1
v
uK
uY
K
= t
exp (−y(k)fk (x)) .
K
Y
k=1
!1/K
exp (−y(k)fk (x))
(2.21)
k=1
Over the set of functions F = f : x 7→ RK we can define the equivalence relation f (x) ∼ g(x) ⇔ ∃α :
x 7→ R | g(x) = f (x) + α(x)1. Then margin functions become representatives of equivalence classes.
2
18
Background on Boosting
Hence this multi-class loss function is a geometric mean of the binary exponential loss function
applied to each pair of coordinates of (y, f (x)) (i.e. component-wise margins).
Moreover, this loss function is Fisher-consistent [114]. This property is defined as follows:
“A loss function L is said to be Fisher-consistent if, ∀x ∈ X (set of full measure), the
optimization problem f̂ = arg minf L(PL|X , f (x)), with f belonging to the hyperplane of margin
vectors, has an unique solution and, in addition, arg maxj fˆj (x) = arg maxj P (l = j|x)”.
Roughly speaking the Fisher-consistent condition says that if we were provided with infinite samples, we could recover the exact Bayes rule by minimizing losses of the kind. This
fact makes them suitable for fitting multi-class classifiers with guarantees. H. Zou, J. Zhu
and T. Hastie introduced a theoretical basis for margin vectors and Fisher-consistent loss functions [114]. Given a classification function expressed in terms of a margin vector f (x) =
(f1 (x), . . . , fK (x))> , they defined the multi-class margin for an instance (x, l) as the coordinate fl (x). Consequently, a binary loss function can be used for evaluating multi-class decisions. Although this generalization is adequate to derive algorithms like AdaBoost.ML and the
Multi-category GentleBoost we believe this definition of margin does not exploit the utility of
vectorial encodings for labels.
In the case of the multi-class exponential loss (2.20), it can be proved that the population
minimizer:
arg min EY |X=x [L (Y, f (x))] ,
f
corresponds to the multi-class Bayes optimal classification rule [112]:
arg max fk (x) = arg max P (Y = k|x) .
k
k
Other loss functions, such as the logit or L2 , share this property and may also be used for
building Boosting algorithms. Similarly, Saberian and Vaconcelos justified that other set of
margin vectors could have been used for representing labels [74], and therefore to develop
equivalent algorithms. Their
an interesting
definition of a multi-class margin
> work also proposes
>
for label k, zk := 1/2 yk f (x) − maxj6=k yj f (x) . Using it, they justify the classification
criteria H(x) = arg maxk zk .
Having defined the above concepts, namely multi-class vectorial margin and multi-class
exponential loss, we can come back to the work of Zhu et al. [112]. Their proposed algorithm,
SAMME3 (Stage-wise Additive Modeling using a Multi-class Exponential loss function), resorts to the multi-class exponential loss for evaluating classifications encoded with margin vectors when real labels are encoded likewise. The expected loss is then minimized using a stagewise additive gradient descent approach. Its pseudo-code is shown in Algorithm 5. It is quite
interesting how the resulting algorithm only differs from AdaBoost (see both pseudo-codes) in
step 2.(c) that now is αm = log ((1 − Errm )/Errm ) + log(K − 1). Step 3. is not specially
different since
H(x) = arg max
k
M
X
m=1
αm I (Tm (x) = k) = arg max fk (x) ,
k
P
where f (x) = M
m=1 αm gm (x). Moreover, it is an easy exercise to prove that the above classification rule is equivalent to assigning the maximum margin (2.18), H(x) = arg maxk y>
k f (x),
3
A curious name for an algorithm that is essentially the same as the AdaBoost.ME proposed in a technical
report [113] developed previously by the authors of [114] following a different margin-based insight.
2.3. MULTI-CLASS BOOSTING
19
which links with the perspective defined in [74]. In the same way it is straightforward to verify
that AdaBoost becomes a particular case when K = 2, what makes us think that SAMME is a
canonical generalization of AdaBoost using multi-class weak-learners.
Algorithm 5 : SAMME
1- Initialize the weight vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N .
2- For m = 1 to M :
(a) Fit a multi-class classifier Tm (x) toP
the training data using weights W.
(b) Compute weighted error: Errm = N
n=1 w(n)I (Tm (xn ) 6= ln ).
(c) Compute αm = log ((1 − Errm )/Errm ) + log(K − 1).
(d) Update weight vector w(n) ← w(n) exp (αm I (Tm (xn ) 6= ln )) , n = 1, . . . , N .
(e) Re-normalize W.
P
3- Output Final Classifier: H(x) = arg maxk M
m=1 αm I (Tm (x) = k) .
It is worth pointing out also the impact of the above works [112, 114], jointly with [20],
on the multi-class field of Boosting. For instance, J.Huang et al. [39] proposed GAMBLE
(Gentle Adaptative Multi-class Boosting Learning), an algorithm that takes advantage of the
same vectorial labels and loss function of SAMME, for one side, and the same type of weak
learners and structure of GentleBoost, for another. The resulting multi-class Boosting algorithm
is merged with an active learning methodology to scale up to large data sets.
To finish the section, let us describe the Multi-category GentleBoost algorithm introduced
in [114]. This method is a prominent example of multi-class algorithm conceived to address
a coordinate-wise fit of the margin vector f (x).See its pseudo-code in Algorithm 6. Multicategory GentleBoost resorts to the exponential loss (applied to the margin z = fl (x)) to build
the vectorial additive model. It works as follows. A vectorial function h(x) ∈ RK is initialized
to zero in each of its coordinates. The iterative process updates h(x) = h(x) + g(x) aiming to
find the direction, g(x), that makes the empirical risk
!
N
K
1 X
1 X
exp −hln (xn ) +
hk (xn )
N n=1
K k=1
(2.22)
decrease most. This is accomplished by considering the expansion of (2.22) to the second order
and then simplifying the Hessian considering only its diagonal. Doing so, it can be verified that
the optimal j-th coordinate in g(x) minimizes
N
X
N
1X 2
2
−
gj (xn )znj exp (−fln (xn )) +
gj (xn )znj
exp (−fln (xn )) ,
2
n=1
n=1
(2.23)
where znj = −1/K + I(ln = j) and f (x) is the margin vector corresponding to the model
already fitted (i.e h(x) mapped on the margin hyperplane). A simple way to solve it is cal−1
culating a regression function by weighted least-squares of the target variable znj
to xn with
2
weights w(n, j) = znj exp(−fln (xn )). Finally, h is projected to the hyperplane of margin vector in order to classify or compute a posteriori probabilities if required. It is easy to verify that
Multi-category GentleBoost extends canonically its binary counterpart GentleBoost [29].
20
Background on Boosting
Algorithm 6 : Multi-category GentleBoost
1- Initialize the weight vector W with constant distribution w(n) = 1 , n = 1, . . . , N .
2- Initialize coordinates hk (x) = 0, for k = 1, . . . , K.
3- For m = 1 to M :
(a) For k = 1 to K:
* Let zn := −1/K + I(ln = k). Compute w∗ (n) = zn2 exp(−fln (xn )).
* Fit a regressor gk (x) by least-squares of variable zn−1 to xn weighted with w∗ (n).
* Update hk (x) = hk (x) + gk (x).
(b) Compute f (x), the projection of h(x) into the margin hyperplane.
(c) Compute w(n) = exp(−fln (xn )), n = 1, . . . , N .
4- Output Final Classifier: H(x) = arg maxk fk (x) .
2.4
Cost-sensitive binary Boosting
A second mandatory extension for AdaBoost comes with the chance of having different costs
for misclassifications. Assume that the binary classification problem at hand is provided with a
(2 × 2)-cost matrix
C(1, 1) C(1, 2)
C=
(2.24)
C(2, 1) C(2, 2)
with no negatives real values. Here, rows are referred to real labels while columns indicate
predicted labels.
Since its beginning, AdaBoost has received a notable degree of attention in order to adapt
its good properties to a cost-sensitive structure like the above. It is very common to set the
diagonal equal to zero, i.e. no costs for correct classifications. We will justify this consideration
in section 4.1. For ease of notation we will use C1 and C2 to denote the constants C(1, 2) and
C(2, 1), respectively.
Initial attempts to generalize AdaBoost in this fashion came essentially from heuristic
changes on specific parts of the pseudo-code. Such is the case of CSB0, CSB1 and CSB2
[94, 93]. Their respective reweighting schemes for the n-th instance are given by:
wm (n)
if fm (xn ) = yn
,
(2.25)
wm+1 (n) =
Cyn wm (n) if fm (xn ) 6= yn
wm (n) exp(−yn fm (xn ))
if fm (xn ) = yn
wm+1 (n) =
,
(2.26)
Cyn wm (n) exp(−yn fm (xn )) if fm (xn ) 6= yn
wm (n) exp(−αm yn fm (xn ))
if fm (xn ) = yn
wm+1 (n) =
.
(2.27)
Cyn wm (n) exp(−αm yn fm (xn )) if fm (xn ) 6= yn
The common structure of the three algorithms is shown in Algorithm 7.
Another example came with AdaCost, the algorithm proposed by W. Fan et al. [23]. It was
developed guided by a weighing rule that includes a particular cost and a margin-dependent
function, β(n), in the argument of the exponential loss function. So, for an instance (xn , yn ),
one computes
wm+1 (n) = wm (n) exp(−αm yn fm (xn )β(n)) .
(2.28)
In practice, the authors select β(n) = 1/2(1 − zn Cyn ), where C(yn ) is the cost incurred for the
real label yn and zn is the associated margin4 in iteration m. Similarly, AsymBoost [101] is also
4
Remember the equivalence between labels {1, −1}, {1, 0} and {1, 2} for binary problems.
2.4. COST-SENSITIVE BINARY BOOSTING
21
Algorithm 7 :CSB
1- Initialize the weight vector W with w(n) = Cyn /Z0 , n = 1, . . . , N ; where Z0 is a
normalization constant.
2- For m = 1 to M :
(a) For f ∈ P ool
P
* Compute weighted error of f , i.e. Err = N
n=1 w(n)I (f (xn ) 6= yn ).
end for
(b) Select fm , the weak hypothesis with minimum weighted error.
(c) Compute αm = log ((1 − Errm )/Errm ).
(d) Update weight vector following either (2.25), (2.26) or (2.27).
(e) Re-normalize W.
3- Output Final Classifier:
P
H(x) = sign M
m=1 αm fm (x)(C1 I(fm (x) = 1) + C2 I(fm (x) = −1))
based on a reweighting scheme:
wm+1 (n) = wm (n) exp(−αm yn fm (xn ))(C1 /C2 )yn /2m .
(2.29)
This choice seems to be non optimal due to its dependence of the current iteration m, that differs
from the “adaptive” property of the AdaBoost algorithm.
In the same way, Y.Sun et al. [89, 87] proposed another three ways to update weights in a
cost-sensitive fashion:
1. wm+1 (n) = wm (n) exp(−αm Cyn yn fm (xn ))
2. wm+1 (n) = wm (n)Cyn exp(−αm yn fm (xn ))
3. wm+1 (n) = wm (n)Cyn exp(−αm Cyn yn fm (xn ))
No formal reason is given to include costs “inside” the exponential loss, “outside” the exponential loss, and “in both places” jointly. Each reweighting scheme yields, in turn, its cost-sensitive
extension of AdaBoost, namely AdaC1, AdaC2 and AdaC3. Algorithm 8 shows AdaC2’s
pseudo-code. We will come back to it in chapter 4.
Algorithm 8 :AdaC2
1- Initialize the weight vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N .
2- For m = 1 to M :
(a) Fit a classifier fm (x) to the trainingP
data using weights W.
(b) Compute weighted error: Errm = N
n=1 w(n)I (fm (xn ) 6= yn ).
(c) Compute αm = log ((1 − Errm )/Errm ).
(d) Update weight vector w(n) ← w(n)Cyn exp (αm I (fm (xn ) 6= yn )) , n = 1, . . . , N .
(e) Re-normalize W.
P
3- Output Final Classifier: H(x) = sign M
m=1 αm fm (x)
More recently, two works came to shed light on the cost-sensitive capability of Boosting
from a more formal point of view. On the one hand Masnadi-Shirazi and Vasconcelos’ work
[63] (see also their previous paper [60]) may be considered a canonical extension of AdaBoost to
22
Background on Boosting
this field. Beside other three methods for binary cost-sensitive problems, this paper introduces
Cost-Sensitive AdaBoost (CS-AdaBoost). The core idea behind the algorithm is substituting the
original exponential loss function by a cost-dependent derivative:
LCS−Ada (y, F (x)) = I(y = 1) exp (−yC1 F (x)) + I(y = −1) exp (−yC2 F (x)) .
(2.30)
It is clear that it becomes AdaBoost when a 0|1-cost matrix is used. CS-Adaboost is then
derived by fitting an additive model whose objective is to minimize the expected loss. Algorithm
9 shows the pseudo-code. Like previous approaches, the algorithm needs to have a pool of
available weak learners from which to select the optimal one in each iteration, jointly with the
optimal step β. For a candidate weak learner, g(x), they compute two constants summing up
the weighted errors associated to instances with the same label, b for label 1 and d for label −1.
Then they calculate β by finding the only real solution to the following equation:
2C1 b cosh (βC1 ) + 2C2 d cosh (βC2 ) = T1 C1 e−βC1 + T2 C2 e−βC2 .
(2.31)
Finally, the pair (g(x), β) minimizing LCS−Ada (y, F (x) + βg(x)) is added to the model. A
particularity of CS-AdaBoost lies in the initial weigthing of instances, that attempts to distribute
the sum of positive and
weights evenly (1/2
PNfor each group of weights). See 1- in
Pnegative
N
Algorithm 9 for N1 = n=1 I(yn = 1) and N−1 = n=1 I(yn = −1).
Algorithm 9 :Cost-Sensitive AdaBoost
1- Initialize the weight vector W with distribution w(n) = 1/2Nyn , n = 1, . . . , N .
2- For m = 1 to M :
PN
P
w(n)
,
T
=
(a) Calculate constants: T1 = N
2
yn =−1 w(n) = 1 − T1 .
yn =1
(b) For f ∈ P ool:P
PN
* Calculate: b = N
yn =−1 w(n)I(yn 6= f (xn )) .
yn =1 w(n)I(yn 6= f (xn )) , d =
* Calculate β, the solution to equation
P(2.31).
* Compute the weighted error using N
n=1 LCS−Ada (yn , Fm−1 (xn ) + βf (xn )).
end for
(c) Select the pair (fm , βm ) of minimum weighted error.
(d) Update weights: w(n) ← w(n) · LCS−Ada (yn , βm fm (xn )) , n = 1, . . . , N .
(e) Re-normalize W.
P
3- Output Final Classifier: H(x) = sign M
m=1 βm fm (x) .
On the other hand Landesa-Vazquez and Alba-Castro’s work [49] discuss the effect of an
initial non-uniform weighing of instances to endow AdaBoost with a cost-sensitive behaviour.
The resulting method, Cost-Generalized AdaBoost, takes advantage of the remainder of the
AdaBoost’s original structure.
Beside these approaches, we must point out that probably the most intuitive way for applying AdaBoost to cost-sensitive problems came with Viola and Jones’ work for detection [100].
The authors basically maintained the original algorithm and introduce a threshold, θ, to bias the
committed response in favour of the most costly label. Hence, the resulting variation has the
shape
!
M
X
H(x) = sign (F (x) − θ) = sign
αm fm (x) − θ .
(2.32)
m=1
2.5. OTHER PERSPECTIVES OF BOOSTING
23
Obviously, the threshold is manipulated over a validation set to asses the required quality of
detection. If an actual cost matrix is given then θ is easily calculated:
FC (x) = log
P (Y = 1|x)C1
P (Y = 1|x)
C2
= log
− log
= F (x) − θ .
P (Y = −1|x)C2
P (Y = −1|x)
C1
(2.33)
It is also easy to verify that this way of proceeding is equivalent to change the a priori probabilities from a generative point of view:
FC (x) = log
P (x|Y = 1)P (Y = 1)C1
P (Y = 1|x)C1
= log
,
P (Y = −1|x)C2
P (x|Y = −1)P (Y = −1)C2
(2.34)
since the ratio of a prioris P (Y = 1)/P (Y = −1) is directly corrected by C1 /C2 . Moving
the cost-insensitve decision boundary is a suboptimal strategy because the are no guarantees of
fitting the true cost-sensitive decision boundary, as was pointed out in [63].
2.5
Other perspectives of Boosting
So far we have only considered the extensions of AdaBoost most closely related to our work in
this thesis, namely: multi-class and binary cost-sensitive problems. We close the chapter giving
a short overview of other interesting perspectives driven by Boosting in the world of Machine
Learning. The second part of the section is devoted to comment the presence of Boosting in
some Computer Vision tasks. Here is a short list of topics where Boosting has allowed new
venues of research:
• New loss functions. Some recent advances in Boosting came from developing a loss
function that allows a desired property. Such is the case of the SavageBoost [62] and
TangentBoost [61] algorithms. For both methods the authors derived a loss function
specially designed to prevent the effect of outliers in the classification. Another examples
can be found in [10, 80, 34, 71].
• Semi-supervised learning. Problems with unlabeled instances are also present in the
scope of Boosting. Two recent works in this area are [59], where the SemiBoost algorithm
is introduced, and [13], where a margin-based cost function is regularized in order to be
optimized in a supervised way. For readers interested in this field we also recommend
[85, 53, 75].
• Entropy projection. Kivinen and Warmuth’s work [46] discovered properties between
consecutive weight vectors {Wm , Wm+1 }, understood as probability distributions. Specifically the new destribution, Wm+1 , is the closest to the previous one, via the relative entropy, from those belonging to the orthogonal hyperlane of weight vectors. See [84] for
another perspective in which the Lagrange dual problems of some Boosting derivations
are proven to be entropy maximization problems.
• Regresion. Obviously the world of regresion has also received attention in the Boosting
literature. We highlight GradientBoost [31], the algorithm developed by J.Friedman as
a reference in the area. Another relevant derivation is the above mentioned GentleBoost
[29, 114].
24
Background on Boosting
• Game theory. In the work [27] the authors explained the connection between Boosting
and game theory. Specifically they described AdaBoost in terms of a 2-players game
where one of them fits a weak learner based on a given weight vector while the second
player receives the weighted error and computes the α value with which derive the next
vector of weights to the first player.
• Mahalanobis distance. Informative distances, like Mahalanobis, have also been used in
the stage-wise optimization performed by Boosting. The most representative works in the
area were introduced by C. Shen et al. [82, 83]. The idea behind their proposal is the use
of “diferences between Mahalanobis distances”, d2M (i, j) − d2M (i, k), as argument (margin) for the exponential loss function. Specifically, the objective of the minimization is a
sum of as many exponetial losses as restrictions dM (i, j) > dM (i, k) are required among
point triplets, (i, j, k). Following this insight they derived the MetricBoost algorithm.
• Condition of Boostability. This terminology is referred to works that address the conditions for a Boosting algorithm to have guarantees of convergence. We highlight the
paper of I. Mukherjee and R. Schapire [67], in which both strong and weak conditions of
“boostability” are given for multi-class Boosting algorithms. These conditions are evaluated on previous relevant works. An alternative analysis of the convergence of Boosting
under several loss functions is devoted to M. Telgarsky [92].
• Relationship to other paradigms. Since Boosting is a particular case of meta-classifier, it
was a matter of time combining its structure with other relevant paradigms. Such is the
case of DeepBoost [14], a recent work where strong learners are allowed as basis without
losing quality of fit. The method is open to be combined with deep decision trees, Support
Vector Machines or even Neural Networks. For an interesting comparative with Bagging
with respect to robustness see S. Rosset’s work [73].
Taking aside the above topics, Boosting has shown to be an excellent tool for many problems in the area of Computer Vision. Some of them require modifications in the original
formulation of the algorithms to provide optimal results. That is the case of detection problems, where the shape of the strong learner calculated with AdaBoost is pruned conveniently
to acquire a cascade-shaped structure. In case of face detection we must highlight the above
mentioned works of P. Viola and M. Jones [101, 100]. It has also been successfully used for
recognizing texts [12, 22], deriving object detectors efficiently [95, 99, 52], and labelling images [43, 108, 105]. Moreover, Boosting has became a very useful strategy for feature selection
in Computer Vision problems [96, 48].
Other interesting application of Boosting in Computer Vision is to identify personal characteristics from low resolution pictures of faces. Such is the case, for instance, of gender recognition. Here we must point out S. Baluja y H. Rowley’s work [5], in which AdaBoost uses
simply comparisons of gray-level between pairs of pixels to obtain significantly good results.
See [107] for another approach, in this case based on second order discriminant analysis updated
iteration after iteration.
Chapter 3
Partially Informative Boosting
In section 2.3.1 we have discussed some multi-class Boosting algorithms based on binary weak
learners, that essentially separate the set of classes into two groups. None of them is a proper
extension of AdaBoost in the sense of taking advantage of the exponential loss function in a pure
multi-class fashion. This is exactly the root of our theoretical improvement. Can we transfer
partial responses to the multi-class field maintaining this property? So far we have discussed
the important role of margin for binary and multi-class Boosting. Here we extend this concept
to manage binary sub-problems properly and, hence, to answer the above question.
In this chapter we introduce a multi-class generalization of AdaBoost that uses ideas
present in previous works. We use binary weak-learners to separate groups of classes, like [3,
78, 80], and a margin-based exponential loss function with a vectorial encoding like [51, 112,
39]. However, the final result is new. To model the uncertainty in the classification provided by
each weak-learner we use different vectorial encodings for representing class labels and classifier responses. This codification yields an asymmetry in the evaluation of classifier performance
that produces different margin values depending on the number of classes separated by each
weak-learner. Thus, at each Boosting iteration, the sample weight distribution is updated as
usually according to the performance of the weak-learner, but also, depending on the number
of classes in each group. In this way our Boosting approach takes into account both, the uncertainty in the classification of a sample in a group of classes, and the imbalance in the number
of classes separated by the weak-learner [87, 38]. Specifically, we decompose the problem into
binary subproblems whose goal is to separate a set of labels from the rest and then we encode
every response using a new set of margin vectors in a way that the multi-class exponential loss
function can be applied. The resulting algorithm is called PIBoost, which stands for Partially
Informative Boosting, reflecting the idea that the Boosting process collects partial information
about classification provided by each weak-learner. PIBoost is well grounded theoretically and
provides significantly good results. We consider it, perhaps, the only canonical extension of
AdaBoost based on binary weak learners.
The chapter is organized as follows. Next section is devoted to introduce our new set of
margin vectors jointly with the loss function. Section 3.2 describes PIBoost in detail. Here
we pay attention to Lemma 1, the main result upon which the algorithm is based. Paragraphs
showing PIBoost’s relationships with AdaBoost and CS-AdaBoost are also included. We also
point out how PIBoost implies a pattern of common sense when taking decisions. In section
3.3 we compare the main points of our algorithm with those of the ECOC-based algorithms
commented in section 2.3.1. Finally, we devote section 3.4 to discuss experiments where the
accuracy of PIBoost against other relevant multi-class Boosting methods is analyzed.
25
26
3.1
Partially Informative Boosting
Multi-class margin extension
We saw in section 2.3.2 how the use of margin vectors for encoding labels induces a natural
generalization of binary classification, yielding margins that derive multi-class algorithms. In
this section we introduce a new multi-class margin expansion. Similarly to [51, 114, 112, 74, 39]
we use margin vectors to represent multi-class membership, i.e. real labels. However, in our
proposal, data labels and those estimated by a classifier will not be defined on the same set of
vectors. Our margin vectors will produce, for each iteration of the algorithm, different margin
values for each sample, depending on the number of classes separated by the weak-learner.
This fact is related to the asymmetry produced in the classification when the number of classes
separated by a weak learner is different on each side and to the “difficulty” or information
content of that classification.
Remember that the essence of the margin approach resides in maintaining negative/positive
values of the margin when a classifier has respectively a failure/success. That is, if y, f (x) ∈ Y ,
the margin z = y> f (x) satisfies: z > 0 ⇔ y = f (x) and z < 0 ⇔ y 6= f (x). We extend
the set Y by allowing that each yl may also take a negative value, that can be interpreted as a
fair vote for any label but the l-th. This vector encodes the uncertainty in the response of the
classifier, by evenly dividing the evidence among all labels but the l-th. It provides the smallest
amount of information about the classification of an instance; i.e. a negative classification, the
instance does not belong to class l but to any other. Our goal is to build a Boosting algorithm
that combines both positive and negative weak responses into a strong decision.
Following this intuition we introduce new margin vectors through fixing a group of slabels, S ∈ P(L), and defining yS in the following way:
1
if i ∈ S
S
S
S >
S
s
(3.1)
y = y1 , . . . , y K
with yi :=
−1
if i ∈
/S
K−s
It is straightforward that any yS is a margin vector [51, 114]. In addition, if S c is the complec
mentary set of S ∈ P(L), then yS = −yS . Let Ŷ be the whole set of vectors obtained in this
fashion. We want to use Ŷ as target set, that is: f : X → Ŷ , but under a binary perspective.
The difference with respect to other approaches using similar codification [51, 114, 112] is that
the
defined in (2.16) is broken. In particular, weak-learners will take values in
Scorrespondence
S
y , −y rather than the whole set Ŷ . The combination of answers obtained by the Boosting algorithm will provide complete information over Ŷ . So now the correspondence for each
weak-learner is binary
F S (x) = ±1 ⇔ f S (x) = ±yS ,
(3.2)
where F S : X → {+1, −1} is a classifier that recognizes the presence (+1) or absence (−1) of
the group of labels S in the data.
We propose a multi-class margin for evaluating the answer given by f S (x). Data labels
always belong to Y but predicted ones, f S (x), belong to Ŷ . In consequence, depending on
s = |S|, we have four possible margin values
(
K
± s(K−1)
if y ∈ S
> S
(3.3)
z = y f (x) =
K
± (K−s)(K−1)
if y ∈
/S
where the sign is positive/negative if the partial classification is correct/incorrect. The derivations of the above expressions are in the Appendix A.1.
3.1. MULTI-CLASS MARGIN EXTENSION
27
We use the multi-class exponential loss just as it was introduced in 2.3.2 to evaluate these
margins (3.3):
−1 > S
S
y f (x) .
(3.4)
L y, f (x) = exp
K
In consequence, the above vectorial codification of labels will produce different degrees
of punishes and rewards depending on the number of classes separated by the weak-learner.
Assume that we fix a set of classes, S, and an associated weak-learner that separates them from
the rest, f S (x). We may also assume that |S| ≤ K/2, since if |S| > K/2 we can choose
S 0 = S c and then |S 0 | ≤ K/2. The failure or success of f S (x) in classifying an instance x with
label l ∈ S will have a larger margin than when classifying an instance with label l ∈ S c . The
margins in (3.3) provide the following rewards and punishes when used in conjunction with the
exponential loss (3.4)

∓1

exp
if y ∈ S
s(K−1)
L y, f S (x) =
∓1
 exp
if y ∈
/ S.
(K−s)(K−1)
(3.5)
In dealing with the class imbalance problem, the losses produced in (3.5) reflect the fact
that the importance of instances in S is higher than those in S c , since S is a smaller set. Hence,
the cost of miss-classifying an instance in S outweighs that of classifying one in S c [87]. This
fact may also be intuitively interpreted in terms of the “difficulty” or amount of information
provided by a classification. Classifying a sample in S provides more information, or, following
the usual intuition behind Boosting, is more “difficult” than the classification of an instance in
S c , since S c has a broader range of possible labels. The smaller the set S the more “difficult”
or informative will be the result of the classification of an instance in it.
We can further illustrate this idea with an example. Assume that we work on a classification
problem with K = 5 classes. We may select S1 = {1} and S2 = {1, 2} as two possible sets
of labels to be learned by our weak-learners. Samples in S1 should be the more important than
those in S1C or in S2 , since S1 has the smallest class cardinality. Similarly, in general, it is easier
to recognize data in S2 than in S1 , since the latter is smaller; i.e. classifying a sample in S1
provides more information than in S2 . Encoding labels with vectors from Y we will have the
following margin values and losses:
> S1
±5/4
⇒ L(y, f S1 ) =
±5/16
e±1/4 = {0.77, 1.28} y ∈ S1
e±1/16 = {0.93, 1.06} y ∈ S1c
±5/8
⇒ L(y, f S2 ) =
±5/12
e±1/8 = {0.88, 1.13} y ∈ S2
e±1/12 = {0.92, 1.08} y ∈ S2c
z = y f (x) =
> S2
z = y f (x) =
Everything we say about instances in S1 will be the most rewarded or penalized in the problem,
since S1 is the smallest class set. Set S2 is the second smallest, in consequence classification
in that set will produce the second largest rewards and penalties. Similarly, we “say more”
excluding an instance from S2 = {1, 2} than from S1 = {1}, since S2c is smaller than S1c . In
consequence, rewards and penalties for samples classified in S2c will be slightly larger than those
in S1c . In Fig. 3.1 we display the loss values for the separators associated to the sets S1 and S2 .
28
Partially Informative Boosting
Figure 3.1: Values of the Exponential Loss Function over margins, z, for a classification problem with 4-classes. Possible margin values are obtained taking into account the expression (3.5)
for s = 1 and s = 2.
3.2
PIBoost
In this section we present the structure of PIBoost [24], whose pseudo-code we show in Algorithm 10. At each Boosting iteration we fit as many weak-learners as groups of labels,
G ⊂ P(L), are considered. The aim of each weak-learner is to separate its associated labels from the rest and persevere in this task iteration after iteration. That is the reason why we
call them separators. A weight vector WS is associated to the separator of set S.
For each set S ∈ G, with s = |S|, PIBoost builds a stage-wise additive model [37] of the
form f m (x) = f m−1 (x) + βm gm (x) (where super-index S is omitted for ease of notation). In
step 2 of the algorithm we estimate constant β and function g(x) for each label and iteration.
The following Lemma solves the problem of finding these parameters.
Lemma 1. Given an additive model f m (x) = f m−1 (x) + βm gm (x) associated to a set of labels,
S ∈ G, the solution to
(βm , gm (x)) = arg min
β,g(x)
N
X
exp
n=1
−1 >
y (f m−1 (xn ) + βg(xn ))
K n
(3.6)
is obtained in the following way:
• Given β > 0, the optimal weak learner is:
gm = arg min B1
g
N
X
ln ∈S
w(n)I
y>
n g(xn )
N
X
< 0 + B2
w(n)I y>
n g(xn ) < 0 ,
ln ∈S
/
h
i
h
i
β
−β
β
−β
with B1 = exp( s(K−1)
) − exp( s(K−1)
) , B2 = exp( (K−s)(K−1)
) − exp( (K−s)(K−1)
) .
• Given a learner g(x), the optimal constant is βm = s(K − s)(K − 1) log R ,
3.2. PIBOOST
29
where R is the only real positive root of the polynomial
Pm (x) = E1(K − s)x2(K−s) + sE2xK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) (3.7)
P
and theP
constants involved in both expresions are defined
as
follows:
A
=
1
ln ∈S w(n),
P
>
A
ln ∈S
/ w(n) ( i.e. A1 + A2 = 1 ), E1 =
ln ∈S w(n)I(yn g(xn ) < 0) , E2 =
P2 =
>
ln ∈S
/ w(n)I(yn g(xn ) < 0), and Wm−1 = {w(n)} is the weight vector of iteration m-1.
The proof of this result is in Appendix A.2. As can be seen in the Lemma, the optmization
of gm (x) and βm in 3.6 depend on each other. An efficient strategy to solve this problem is to
iteratively optimize for one of the variables, assuming known the other. We have considered
two ways to proceed: 1) compute an initial gm fixing an initial βm (1, for simplicity) and; 2)
compute an initial gm assuming B1 = B2 . In both cases we have empirically confirmed that
the results obtained with sevaral iterations in this process are not significantly better than those
in the first iteration. Hence, in Algorithm 10, we introduce the method by assuming the second
option (B1 = B2 ) and making no sub-iterations to obtain the optimal pair (gm (x),βm ), which
is the procedure selected for our experiments. With this assumption, the optimal weak learner
is calculated according to
gm = arg min
g
N
X
w(n)I y>
n g(xn ) < 0 ,
(3.8)
n=1
that is an efficient and practical criteria.
Lemma 1 justifies steps 2:(a), (c)1 , (d) and (e) in our pseudo-code. In case y ∈ S, the
update rule 2:(f) follows from:
−1 > S
S
S
y βf (xn )
w (n) = w (n) exp
K n
±K
−1
S
S
= w (n) exp
s(K − s)(K − 1) log Rm
K
s(K − 1)
S
S ∓(K−s)
= wS (n) exp ∓(K − s) log Rm
= wS (n) Rm
The case y ∈
/ S provides an analogous expression.
Note in (3.7) that the root will be zero only if A1 = E1, what implies β = −∞. This
possibility should be considered explicitly in the implementation. An interesting case with
closed form solution for the polynomial occurs when K is an even number. Any separator of
s = K/2 labels yields a simpler formula in (3.7):
P (x) = (E1 + E2) xK − (A1 − E1 + A2 − E2) = ErrxK − (1 − Err) ,
| {z }
Err
q
that can be easily solved with a closed-form solution, x = K 1−Err
, which in turn provides the
Err
value
K(K − 1)
1 − Err
K −1
1 − Err
log
=s
log
.
(3.9)
β=
4
Err
2
Err
In expression
/ GSm (x), the set GSm (x) must be understood as
lnC ∈
S
Gm (x) = −1 ≡ S .
1
GSm (x) = +1
≡ S and
30
Partially Informative Boosting
Algorithm 10 : PIBoost
1- Initialize weight vectors wS (n) = 1/N ; with n = 1, . . . , N and S ∈ G ⊂ P(L).
2- For m = 1 until the number of iterations M and for each S ∈ G:
(a) Fit a binary classifier GSm (x) over training data with respect to its corresponding wS .
(b) Translate GSm (x) into gSm : X → Ŷ .
(c) Compute 2 types of errors associated with GSm (x)
X
X
E1S,m =
wnS I ln ∈
/ GSm (xn ) , E2S,m =
wnS I ln ∈
/ GSm (xn )
ln ∈S
ln ∈S
/
S
(d) Calculate Rm
, the only positive root of the polynomial PmS (x) defined in (3.7).
S
S
(e) Calculate βm = s(K − s)(K − 1) log Rm
.
(f) Update weights as follows (sign +/- depends on the failure/hit of GSm ):
S ±(K−s)
• If ln ∈ S then wS (n) = wS (n) Rm
S ±s
,
• If ln ∈
/ S then wS (n) = wS (n) Rm
(g) Re-normalize weight vectors.
3Output
H(x) = arg maxk fk (x), where f (x) = (f1 (x), ..., fK (x))> =
P FinalS Classifier:
PM
S
m=1
S∈G βm gm (x).
The shape of the final classifier is easy and intuitive to interpret. The vectorial function built
during the process collects in each k-coordinate information that can be understood as a degree
of confidence for classifying sample x into class k. The classification rule assigns the label
with highest value in its coordinate. This criterion has a geometrical interpretation provided by
the codification of labels as K-dimensional vectors. Since the set Ŷ contains margin vectors,
the process of selecting the most probable one is carried out on the hyperplane orthogonal
to 1 = (1, . . . , 1)> (see Fig. 3.2). So, we build our decision on a subspace of RK free of total
indifference about labels. It means, that the final vector f (x) built during the process will usually
present a dominant coordinate that represents the selected label. Ties between labels will only
appear in degenerate cases. The plot on the right in Fig. 3.2 shows the set of pairs of vectors Ŷ
defined by our extension, whereas on the left are shown the set of vectors Y used in [51, 112].
Although the spanned gray hyperplane is the same, we exploit every binary answer in such a
way that the negation of a class is directly translated into a new vector that provides positive
evidence for the complementary set of classes in the final composition, f (x). The inner product
of class labels y ∈ Y and classifier predictions, f (x) ∈ Ŷ , y> f (x) produces a set of asymmetric
margin values in such a way that, as described in section 3.1, all successes and failures do not
have the same importance. Problems with four or more classes are more difficult to be shown
graphically but allow richer sets of margin vectors.
The second key idea in PIBoost is that we can build a better classifier when collecting
information from positive and negative classifications in Ŷ than when using only the positive
classifications in the set Y . Each weak-learner, or separator, gS , acts as a partial expert of
the problem that provides us with a clue about what is the label of x. Note here that when
a weak-learner classifies x as belonging to a set of classes, the value of its associated step β,
that depends on the success rate of the weak-learner, is evenly distributed among the classes in
the set. In the same way, the bet will be used to evenly reduce the confidence on coordinates
3.2. PIBOOST
31
Figure 3.2: Margin vectors for a problem with three classes. Left figure presents the set of
vectors Y . Right plot presents the set Ŷ .
corresponding to non-selected classes. This balance inside selected and discarded classes is
reflected in a margin value with a sensible multi-class interpretation. In other words, every
answer obtained by a separator is directly translated into multi-class information in a fair way.
3.2.1
AdaBoost as a special case of PIBoost
At this point we can verify that PIBoost applied to a two-class problem is equivalent to AdaBoost. In this case we only need to fit one classifier at each iteration2 . Thus there will be only
one weight vector to be updated and only one group of β constants.
It is also easy to match the expression of parameter β computed in PIBoost with the value
of α computed in AdaBoost just by realizing that, fixed an iteration whose index we omit, the
polynomial in step 2 - (d) is
P (x) = (E1 + E2) x2 − (A1 − E1 + A2 − E2) = Err · x2 − (1 − Err) .
1/2
1
1−Err
,
thus
β
=
log
. What indeed is the
Solving this expression we get R = 1−Err
Err
2
Err
value of α in AdaBoost. It also could be verified substituting K = 2 and s = 1 in expression
(3.9).
Finally, it is straightforward that the final decisions are equivalent. If we transform AdaBoost’s labels,
L = {+1, −1}, into PIBoost’s, L0 = {1, 2}, we get that classification rule
PM
H(x) = sign m=1 αm hm (x) turns into Ĥ(x) = arg maxk fk (x), where
f (x) = (f1 (x), f2 (x)) =
M
X
βm gm (x) .
m=1
2
Separating the first class from the second is equivalent to separating the second from the first and, of course,
there are no more possibilities.
32
Partially Informative Boosting
3.2.2
Asymmetric treatment of partial information
Our codification of class labels and classifier responses produces different margin values. This
asymmetry in evaluating successes and failures in the classification may also be interpreted as
a form of asymmetric Boosting. As such it is directly related to the Cost-Sensitive AdaBoost
in [63].
Using the cost matrix defined in Table 3.1, we can relate the PIBoost algorithm with the
Cost-Sensitive AdaBoost [63]. If we denote b = E1S , d = E2S , T+ = A1 , T− = A2
then the polynomial (3.7), P S (x), solved at each PIBoost iteration to compute the optimal step,
βm , along the direction of largest descent gm (x) is equivalent to the following cosh-depending
expression used in the Cost-Sensitive AdaBoost to estimate the same parameter [63]
2C1 b cosh (C1 α) + 2C2 d cosh (C2 α) = C1 T+ e−C1 α + C2 · T− e−C2 α ,
(3.10)
where the costs {C1 , C2 } are the non-zero values in Table 3.1. In consequence, PIBoost is
Real \ Pred.
S
Sc
Sc
S
0
1
s(K−1)
1
(K−1)(K−s)
0
Table 3.1: Cost Matrix associated to a PIBoost’s separator of a set S with s = |S| classes.
a Boosting algorithm that combines a set of cost-sensitive binary weak-learners whose costs
depend on the number of classes separated by each weak-learner.
See the work of I. Landesa-Vazquez and J.L. Alba-Castro [50] for a better understanding
of equation (3.10). In their proposal a double base3 analysis of the asymmetries is discussed.
Moreover, they also resort to a polynomial with the same shape of (3.7) to find the optimal
constant β added at each iteration. We find quite interesting the way they decompose the equation in order to speed up finding the solution of the polynomial in their resulting algorithm
AdaBoostDB.
3.2.3
Common sense pattern
We must emphasize that PIBoost’s structure links with a pattern of common sense. In fact we
apply this philosophy in our everyday life when we try to guess something discarding possibilities.
Let us illustrate this with an example. Assume that a boy knows that his favourite pen
has been stolen in his classroom. Even though he probably thinks of a suspect he also has the
chance to ask each classmate what he knows about the issue. Perhaps doing so he will collect a
pool of useful answers of the kind: “I think it was Jimmy”, “I am sure it was not a girl”, “I just
know that it was neither me not Victoria”, “I would suspect of Martin and his group of friends”,
etc. It is clear that none of these answers drives the boy to a final conclusion (at least, they
should not) but they form a set of clues quite useful.
Supposing that no more information is available it seems that an immediate strategy could
be to sum up those answers weighted by the degree of confidence associated to those questioned.
3
Such doble base comes from the two arguments of cosh appearing in (3.10).
3.3. RELATED WORK
33
Thus combining all that information our protagonist should have one suspect at the end of the
working day. It’s easy to find similarities between such a situation and the structure of PIBoost:
the answer of each friend can be seen as a weak-learner GSm , the level of credibility (or trust)
associated to each is our βm , while the iteration value m can be thought as a measure of time in
the relationship with the classmates.
3.3
Related work
Here we discuss some relationships between the ECOC-based algorithms commented in section
2.3.1 and PIBoost. Table 3.3 summarizes the main properties of the four algorithms, extending
the comparative presented in that section.
Firstly, the loss function applied for updating weights in AdaBoost.OC relies on the exponential loss with a pseudo-loss as argument, while AdaBoost.MO and AdaBoost.ECC use an
exponential loss function with binary arguments. In section 3.1 we have highlighted the importance of using a pure multi-class loss function for achieving different margin values, hence
penalizing binary failures into a real multi-class context. With our particular treatment for binary sub-problems we extend AdaBoost in a more natural way, because PIBoost can be seen as
a group of several binary AdaBoost well tied via the multi-class exponential loss function and
where every partial answer is well suited for the original multi-class problem. It is not necessary to manage all instance weights linked as one when later a binary loss or a pseudo-loss (that
cares of “failures” due to classes over-selected in the coloring function µm ) will be used.
Besides, the resulting structure of PIBoost is similar to the {±1}-matrix of ECOC algorithms, except for the presence of fractions. At each iteration of PIBoost there is a block
of |G|-response vectors that, grouped as columns, form a K × |G|-matrix similar to |G|-weak
learners of ECOC-based algorithms. Table 3.2 shows the case of a problem with four labels
when G consists of all single labels in union with all pairs of labels. However, in our approach,
fractions let us make an even distribution of evidence among the classes in a set, whereas in the
ECOC philosophy every binary sub-problem has the same importance for the final count.
Label 1 Label 2 Label 3 Label 4 Label 1-2 Label 1-3
1
-1/3
-1/3
-1/3
1/2
1/2
-1/3
1
-1/3
-1/3
1/2
-1/2
-1/3
-1/3
1
-1/3
-1/2
1/2
-1/3
-1/3
-1/3
1
-1/2
-1/2
Label 1-4
1/2
-1/2
-1/2
1/2
Table 3.2: An example of encoding matrix for PIBoost’s weak learners when K = 4 and
G = {all single labels} ∪ {all pairs of labels}.
With respect to margin values, E.L. Allwein, R. Schapire and Y. Singer [3] discussed
the strong connections between ECOC-based algorithms and margin theory. The framework
developed by the authors provides a new perception of multi-class algorithms based on binary subproblems, unifying the most relevant ones by using matrix encodings with values in
{+1, 0, −1}. A deeper analysis of this fact for AdaBoost.MO, ECC and OC can be found in
[91]. Although this development is broad enough to cover the most popular multi-class Boosting
methods, we find no reason in using binary margin values, z(k,r) = l(k, r)fr (x), for measuring
the quality of each bit fr (x) in a predicted-codeword f (x) = (f1 (x), . . . , fR (x)), where l(k, r)
is the real value associated to label k for the r-th binary subclassifier. The same binary loss
34
Partially Informative Boosting
function can be applied to the resulting value. Our conception of margin values for multi-class
problems based on vectorial encodings is richer and provides a broad range of values as was
exposed in section 3.1.
So far we have highlighted three essential points, namely: the use of different loss functions, the fact of handling the evidence for or against a set of labels evenly, and our conception
of margin; but there are others features distinguishing PIBoost from ECOC-based Boosting algorithms. We describe them briefly. Since we develop independent blocks of weak learners,
S
we endow each sub-model withPits own
P weight vector, W , instead of using a whole weight
N ×R
matrix W ∈ R
satisfying n=1 r=1 w(n, r) = 1 (with R = K for the .OC and .ECC
variants). Our way of proceeding lets each separator persevere independently, thus providing
responses uncorrelated with the rest of separators. Another important aspect is the shape of
the final decision rule. Both AdaBoost.OC and AdaBoost.ECC build a K-dimensional function
f (x) = (f1 (x), . . . , fK (x))> whose maximum coordinate is selected as response. We argued
that this particularity is shared by our algorithm, but we gave a geometric meaning to this way
of summarizing information. In fact, it is easy to prove that our rule is equivalent to choosing the vector in Y that provides maximum margin, i.e. arg maxk y>
k f (x). With regard to the
amount of weak learners computed at each iteration, we believe that setting blocks of separators
at each iteration there is no need for computing coloring functions, that build a single separator
based on random patterns [80] and, consequently, does not ensure an uniform label covering4 .
In the same way, we find interesting the use of “code-words” for denoting labels but we think
that, once the final hypothesys is commited, it is more intuitive to select the coordinate with
maximum value than resort to a metric for measuring the closeness between labels and the obtained final row. Jointly with this last observation, remark that PIBoost, unlike ECOC-based
S
algorithms, computes one value βm
per set, S, and iteration, m. In other words, at each iteration
|G| constants are calculated.
Issue
Weights for
instances
AdaBoost.MO
W ∈ RN ×R
for training
Constants αm
per iteration
Loss function
One αm
Final
Classifier
W. Learners
per iteration
AdaBoost.OC
W ∈ RN ×K
and D ∈ RN
for training
One αm
Binary
Exponential
Expression (2.15)
Pseudo-loss
Exponential
argl∈L max
PM
¯
m αm Im (x)
1
1
AdaBoost.ECC
W ∈ RN ×K
and D ∈ RN
for training
One gm (x)
PIBoost
A vector WS
per set S ∈ G
S
One βm
per
set S ∈ G
Binary
Multi-class
Exponential
Exponential
argl∈L max
Max. coordinate
PM
of
m gm (x)µm (l) P
M
S S
m,S∈G βm gm (x)
1
|G|
Table 3.3: Comparison of the main properties of ECOC-based algorithms and PIBoost. µm (l)
denotes the coloring function µm : L → {±1} at iteration m. R denotes the length of “codewords”. In AdaBoost.OC, I¯m (x) indicates I (hm (x) = µm (l)).
At this point one can suspect that, independently of the selected group G, PIBoost needs
too much computational load at each iteration. This is partially true because, as was said above,
4
We assume that, when working with PIBoost, the user will select a group G such that, at least, contains all
single labels {{k} | k ∈ L}.
3.4. EXPERIMENTS
35
separators receive a particular treatment similar to several binary AdaBoost linked via the multiclass exponential loss function. Does such a schedule pay the way when efficiency is required?
We postpone our answer to next section, where the quality of PIBoost is revealed even for few
iterations.
We must emphasize that PIBoost is not the only multi-class Boosting algorithm using
labels separators asymmetrically. As we said in section 2.3.1, AdaBoost.ECC [35] presents
different voting weights gm (x) (= αm or −βm ) depending on the class assigned to (x, l) by
the weak learner hm (x), see (2.14). That is, fixed an iteration m, the algorithm groups the
instances into two sets: G+ = {(x, l) | hm (x) = +1} and G− = {(x, l) | hm (x) = −1}. Then
the two possible voting constants are independently computed using the same weight vector for
instances, Dm . Namely:
!
!
P
P
D
(n)
D
(n)
m
m
−1
1
{Hits in G+ }
{Hits in G− }
; −βm =
ln P
.
αm = ln P
2
D
(n)
2
m
{F ailures in G+ }
{F ailures in G− } Dm (n)
Using these values the global weights for instances wm+1 (n, l) are updated via the binary exponential loss function. We provide a similar handling to each separator, but asymmetries between
the number of labels to be separated are considered inherently in our margin vectors. For PIBoost, after training a new weak learner, we take into account two types of errors (and two
S
constants) that will yield just one value βm
once its associated polynomial P S (x) is solved.
To complete this section we find convenient to discuss briefly the important work of A.
Torralba, K.P. Murphy and W.T. Freeman [95] for multi-label problems, where the JointBoost
algorithm is proposed, and its similarity to PIBoost’s structure. Their algorithm takes advantage
of the information obtained from binary subproblems focussed on separating groups of labels,
which links to our point of view for the multi-class problem. JointBoost is designed to detect
different possible objects in images by sharing features. This implies having a set of K labels
representing the objects but also an extra label for denoting the background. In consequence
the problem is composed by K + 1 labels non-equally important. A set of K confidence-rated
predictors is fitted by adding separators in the following way: the optimal weak learners for
separating groups of labels are computed, then the best of them (after evaluating a variant of
weighted error) is added to those predictors whose labels are separated in the chosen group of
labels. For a 3 objects problem, the fitted models have the shape:
H1 (x) = G1,2,3 (x) + G1,2 (x) + G1,3 (x) + G1 (x)
H2 (x) = G1,2,3 (x) + G1,2 (x) + G2,3 (x) + G2 (x)
H3 (x) = G1,2,3 (x) + G1,3 (x) + G3,3 (x) + G3 (x)
(3.11)
Which clearly is different from PIBoost’s perspective. In addition, the way JointBoost computes
weak learners and the use of weighted squared errors to compare separators are also elements
that differenciate this algorithm from PIBoost. It is worth mentioning that tackling multi-label
problems in this binary fashion is a well known strategy for developing algorithms in the area,
see [4, 43, 108, 105, 12, 22].
3.4
Experiments
Our goal in this section is to evaluate and compare the performance of PIBoost. We have selected fourteen data sets from the UCI repository: CarEvaluation, Chess, CNAE9, Isolet, Multifeatures, Nursery,OptDigits, PageBlocks, PenDigits, SatImage, Segmentation, Vehicle, Vowel
36
Partially Informative Boosting
and WaveForm. They have different numbers of input variables (6 to 856), classes (3 to 26)
and instances (846 to 28056), and represent a wide spectrum of types of problems. Although
some data sets have separate training and test sets, we use both of them together, so the performance for each algorithm can be evaluated using cross-validation. Table 3.4 shows a summary
of the main features of the data bases5 . For comparison purposes we select three well-known
Data set
Variables
CarEvaluation
6
Chess
6
CNAE9
856
Isolet
617
Multifeatures
649
Nursery
8
OptDigits
64
PageBlocks
10
PenDigits
16
SatImage
36
Segmentation
19
Vehicle
18
Vowel
10
Waveform
21
Labels
4
18
9
26
10
5
10
5
10
7
7
4
11
3
Instances
1728
28056
1080
7797
2000
12960
5620
5473
10992
6435
2310
846
990
5000
Table 3.4: Summary of selected UCI data sets
multi-class Boosting algorithms. AdaBoost.MH [80] is perhaps the most prominent example
of multi-class classifier with binary weak-learners. Similarly, SAMME [112] is, under our perspective, the most well known canonical multi-class algorithms with multi-class weak-learners.
Finally, Multi-category GentleBoost [114] is probably the most accurate method that treats labels separately at each iteration. We display their respective pseudo-codes in Algorithm 2, 5
and 6.
Selecting a weak-learner that provides a fair comparison among different Boosting algorithms is important at this point. SAMME requires multi-class weak-learners while, on the other
hand, AdaBoost.MH and PIBoost can use even simple stump-like classifiers. Besides, Multicategory GentleBoost requires the use of regression over continuous variables for computing its
weak-learners. We select classification trees as weak-learners for the first three algorithms and
regression trees for the latter.
For classification trees the following growing schedule is adopted. Each tree grows splitting impure nodes that present more than N̄ /K instances (where N̄ is the number of samples
selected for fitting the tree), so this value is a lower bound for splitting. We find good results for
the sample size parameter when N̄ < 0.4 · N , where N is the training data size. In particular
we fix N̄ = 0.1 · N for all data sets. In the case of regression trees the growing pattern is
similar but the bound of N̄ /K instances for splitting produces poor results. Here more complex
trees achieve better performance. In particular when the minimum bound for splitting is N̄ /2K
instances we got excellent error rates. A pruning process is carried out too in both types of trees.
5
The original Nursery data set has 2 instances labeled with second class. We omit them in order to apply
cross-validation properly.
3.4. EXPERIMENTS
37
We experiment with two variants of PIBoost. The first one takes all single labels, G =
{{k}| k ∈ L}, as group of sets to separate while the second one, more complex, takes all single
labels plus all pairs of labels, G 0 = G ∪ {{k, l}| k 6= l ; k, l ∈ L}. We must emphasize the importance of selecting a good group of separators in achieving the best performance. Depending
on the number of classes, selecting an appropriate set G is a problem on itself. Knowledge of
the dependencies among labels sets will certainly help in designing a good set of separators.
We leave this issue as future work.
For the experiments we fix a number of iterations that depends on the algorithm and the
number of labels of each data set. Since the five algorithms considered in this section fit a
different number of weak-learners at each iteration, we select the number of iterations for each
one in such a way that all experiments have the same number of weak-learners
(see Table 3.5).
Remember that, when a data set presents K-labels, PIBoost(2) fits K2 + K separators per
iteration while PIBoost(1) and GentleBoost fit only K. Besides SAMME and AdaBoost.MH fit
one weak-learner per iteration. In Fig. 3.3 we plot the performance of all five algorithms. The
splitting criterion for classification trees is the Gini index.
Data set
CarEvaluation (4)
Chess (18)
CNAE9 (9)
Isolet (26)
Multifeatures (10)
Nursery (5)
OptDigits (10)
PageBlocks (5)
PenDigits (10)
SatImage (7)
Segmentation (7)
Vehicle (4)
Vowel (11)
Waveform (3)
GentleBoost
70
95
100
135
110
120
110
120
110
80
80
70
120
40
AdaBoost.MH
280
1710
900
3510
1100
600
1100
600
1100
560
560
280
1320
120
SAMME
280
1710
900
3510
1100
600
1100
600
1100
560
560
280
1320
120
PIBoost(1)
70
95
100
135
110
120
110
120
110
80
80
70
120
40
PIBoost(2)
40 [7]
10 [171]
20 [45]
10 [351]
20 [55]
40 [15]
20 [55]
40 [15]
20 [55]
20 [28]
20 [28]
40 [7]
20 [66]
40 [3]
#WL
280
1710
900
3510
1100
600
1100
600
1100
560
560
280
1320
120
Table 3.5: Number of iterations considered for each Boosting algorithm. The first column
displays the data base name with the number of classes in parenthesis. Columns two to six
display the number of iterations of each algorithm. For PIBoost(2) the number of separators per
iteration appears inside brackets. The last column displays the number of weak-learners used
for each data base.
The performance of a classifier corresponds to that achieved at the last iteration, combining
all learned weak classifiers. We evaluate the performance of the algorithms using 5-fold crossvalidation. Table 3.6 shows these values and their standard deviations. As can be seen, PIBoost
(with its two variants) outperforms the rest of methods in many data sets. Once the algorithms
are ranked by accuracy we use the Friedman test to asses whether the performance differences
are statistically significant [15]. As was expected the null hypothesis (all algorithms have the
same quality) is rejected with a p-value < 0.01. Hence we carry out a post-hoc analysis. We use
the Nemenyi test to group the algorithms that present insignificant difference [15]. Figure 3.4
shows the result of the test for both α = 0.05 and α = 0.1 significance level. Summarizing,
PIBoost(1) can be considered as good as PIBoost(2) and also as good as the rest of algorithms,
but PIBoost(2) is significantly better than the latter. In addition, we use the Wilcoxon matchedpairs signed-ranks test to asses the statistical significance of the performance in comparisons
38
Partially Informative Boosting
Car Evaluation
CNAE9
Chess
0.8
0.25
0.15
0.7
0.2
0.1
0.6
0.05
0
0.15
0.5
0
50
100
150
200
250
0.4
0.1
0
500
1000
1500
0.05
0
MultiFeature
Isolet
400
600
800
Nursery
0.1
0.2
0.08
0.6
200
0.15
0.06
0.4
0.1
0.04
0.2
0
0.05
0.02
0
1000
2000
3000
0
500
1000
0
0
PageBlocks
OptDigits
0.2
0
200
400
600
PenDigits
0.06
0.1
0.15
0.05
0.08
0.1
0.04
0.05
0.03
0.06
0.04
0
0
200
400
600
800
1000
0.02
0.02
0
SatImage
200
400
600
0
0.4
0.1
0.45
0.08
0.4
0.06
0.35
0.2
0.04
0.3
0.02
0.25
0.1
200
400
0
0
Vowel
200
1000
0.2
400
0
100
200
WaveForm
0.6
0.26
GentleBoost
0.24
0.4
AdaBoost.MH
0.22
PIBoost(1)
0.2
0.2
0
500
Vehicle
0.3
0
0
Segmentation
PIBoost(2)
0.18
SAMME
0.16
0
500
1000
0
50
100
Figure 3.3: Plots comparing the performances of Boosting algorithms. In the vertical axis we
display the error rate. In the horizontal axis we display the number of weak-learners fitted for
each algorithm.
between pairs of algorithms [15]. Table 3.7 presents the p-values obtained after comparing
PIBoost(1) and PIBoost(2) with the others. Again, it is clear that the latter is significantly better
than the rest.
Additionally, we have also performed one more experiment with the Amazon data base to
asses the performance of PIBoost in a problem with a very high dimensional space and with a
large number of labels. This data base also belongs to the UCI repository. It has 1500 sample
instances with 10000 features grouped in 50 classes. With this data base we follow the same
experimental design as with the other data bases, but only use the PIBoost(1) algorithm. In
Figure 3.5 we plot the evolution in the performance of each algorithm as the number of weak
3.4. EXPERIMENTS
39
Figure 3.4: Diagram of the Nemenyi test. The average rank for each method is marked on the
segment. We show critical differences for both α = 0.05 and α = 0.1 significance level at the
top. We group with thick blue line algorithms with no significantly different performance.
learners increases. At the last iteration, PIBoost(1) has respectively an error rate and a standard
deviation of 0.4213 and (±374 × 10−4 ), whereas Multi-category GentleBoost has 0.5107 and
(±337 × 10−4 ), SAMME 0.6267 and (±215 × 10−4 ) and, finally, AdaBoost.MH 0.7908 and
(±118 × 10−4 ).
Discussion
The experimental results confirm our initial intuition that by increasing the range of margin
values and considering the asymmetries in the class distribution generated by the weak-learners
we can significantly improve the performance of Boosting algorithms. This is particularly evident in problems with a large number of classes and few training instances. In the same way,
the performance gain is evident in a high dimensional space, see the results for CNAE9, Isolet
and, of course, Amazon. Moreover, as can be observed in Table 3.5 and Table 3.6, our second
variant of PIBoost produces goods results even when few iterations are computed, see Isolet,
OptDigits, SatImage, Segmentation or Vowel.
40
Partially Informative Boosting
Data set
GentleBoost
AdaBoost.MH
SAMME
PIBoost(1)
PIBoost(2)
CarEvaluation
Chess
CNAE9
Isolet
Multifeatures
Nursery
OptDigits
PageBlocks
PenDigits
SatImage
Segmentation
Vehicle
Vowel
Waveform
0.0852 (±121)
0.5136 (±61)
0.0870 (±239)
0.1507 (±94)
0.0460 (±128)
0.1216 (±60)
0.0756 (±74)
0.0291 (±52)
0.0221 (±11)
0.1294 (±32)
0.0494 (±64)
0.2710 (±403)
0.2818 (±322)
0.1618 (±75)
0.0713 (±168)
0.4240 (±34)
0.1028 (±184)
0.5433 (±179)
0.3670 (±822)
0.0203 (±32)
0.0432 (±59)
0.0276 (±46)
0.0113 (±29)
0.1318 (±51)
0.0407 (±88)
0.3976 (±297)
0.3525 (±324)
0.1810 (±72)
0.0487 (±111)
0.5576 (±63)
0.1111 (±77)
0.0812 (±185)
0.0135 (±44)
0.0482 (±58)
0.0365 (±55)
0.0386 (±87)
0.0484 (±62)
0.3691 (±120)
0.0238 (±55)
0.2320 (±221)
0.0667 (±114)
0.1710 (±109)
0.0325 (±74)
0.5260 (±118)
0.1472 (±193)
0.1211 (±253)
0.0340 (±96)
0.0192 (±29)
0.0400 (±13)
0.0364 (±47)
0.0358 (±40)
0.1113 (±62)
0.0208 (±52)
0.2509 (±305)
0.0646 (±183)
0.1532 (±44)
0.0377 (±59)
0.5187 (±74)
0.0824 (±171)
0.0559 (±55)
0.0145 (±82)
0.0313 (±62)
0.0240 (±41)
0.0302 (±50)
0.0192 (±25)
0.0949 (±53)
0.0177 (±61)
0.2355 (±258)
0.0606 (±160)
0.1532 (±44)
Table 3.6: Error rates of GentleBoost, AdaBoost.MH, SAMME, PIBoost(1) and PIBoost(2)
algorithms for each data set in table 3.4. Standard deviations appear inside parentheses in 10−4
scale. Bold values represent the best result achieved for each data base.
PIBoost(2)
PIBoost(1)
GentleBoost
AdaBoost.MH
SAMME
PIBoost(1)
0.0012
0.0580
0.0203
0.1353
0.0006
0.7148
0.0081
Table 3.7: P-values corresponding to Wilcoxon matched-pairs signed-ranks test.
1
GentleBoost
AdaBoost.MH
PIBoost(1)
0.9
SAMME
0.8
0.7
0.6
0.5
0.4
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Figure 3.5: Plot comparing the performances of Boosting algorithms for the Amazon data base.
In the vertical axis we display the error rate. In the horizontal axis we display the number of
weak-learners fitted for each algorithm.
Chapter 4
Multi-class Cost-sensitive Boosting
In this chapter we give a second step in our multi-class classification problems by adding a
cost matrix. Assigning values for penalizing different types of errors is useful to address some
relevant problems. One of them is the need to skew the decision boundaries to reduce the generalization error. It frequently occurs in unbalanced problems, in which the majority classes tend
to be favoured by the regular classification rules. The addition of costs is useful to increase the
importance of minority classes, hence to correct the decision boundaries. Another important
type of problem comes with the ordinal regression. It consists in a classification problem where
the labels are obtained as the discretization of a continuous variable and preserving the order is
essential. A proper cost matrix may be useful to avoid “distant” failures, which obviously must
be more penalized than “closer” ones. Finally, costs can be viewed by themselves as the objective of the classification. This is particularly evident in problems where each decision involves
a cost and minimizing it becomes the goal of the algorithm. Such is the case of insurance,
banking, or diagnosis applications.
In this chapter we study the addition of a cost matrix to the Boosting framework through
the well known exponential loss function. To this aim some algorithms have been proposed
but none of them may be considered a canonical extension of AdaBoost to the multi-class costsensitive field using multi-class weak learners. We will discuss them in the following section.
This topic has been widely studied for binary problems, as was summarized in section 2.4. Now
we present an extension to the multi-class field based on a new concept of margin.
The remainder of the chapter is organized as follows. Section 4.1 describes previous theories on multi-class cost-sensitive Boosting. Here we define our new concept of cost-sensitive
margin. In section 4.2 we show in detail the structure of our algorithm. This section also
presents the main result that supports BAdaCost jointly with some corollaries describing direct
generalizations. Finally, in Section 4.3 we show experiments confirming the efficiency of our
algorithm.
4.1
Cost-sensitive multi-class Boosting
Let us assume the misclassification costs for our multi-class problem are encoded using a
(K×K)-matrix C, where each entry C(i, j) ≥ 0 measures the cost of misclassifying an instance
41
42
Multi-class Cost-sensitive Boosting
with real label i when the prediction is j,

C(1, 1) C(1, 2)
 C(2, 1) C(2, 2)

C=
..
..

.
.
C(K, 1) C(K, 2)
...
...
..
.
C(1, K)
C(2, K)
..
.



 .

. . . C(K, K)
We will expect of this matrix to have costs for correct assignments lower than any wrong classification, i.e. C(i, i) < C(i, j), ∀i 6= j. More rigorously, multi-class problems may be affected
by costs in an instance-dependent way. In these situations an exclusive row of costs, Cn ∈ RK ,
is associated to each instance (xn , ln ), n = 1, . . . , N [58]. Obviously, using a cost matrix is a
special case, since all samples with the same label would share the same penalizing row. We
will not address this kind of problems in our study due to the lack of real applications.
Let us introduce some intuitive notations for the remainder of the chapter. Hereafter,
M(j, −) and M(−, j) will be used for referring to the j-th row and column vector of a matrix M. Again I(·) will denote the indicator function (1 when argument is true, 0 when false).
For cost-sensitive problems the regular Bayes Decision Rule is not suitable, since the label
with maximum a posteriori probability may present a high cost. Rather a costs-dependent criterion is applied. If P(x) = (P (1|x), . . . , P (K|x))> is the vector of a posteriori probabilities
for a given x ∈ X, then the Cost-sensitive Bayes Decision rule is
F (x) = arg min P(x)> C(−, j) ,
j∈L
(4.1)
which is but the minimizer of the risk function R(P(x), C(−, j)) := P(x)> C(−, j) with regard to j ∈ L.
When dealing with multi-class cost-sensitive problems one has to understand how the addition of a cost matrix influences the decision boundaries. For this requirement, we recommend
O’Brien’s et al. work [69]. It displays a concise glossary of linear algebra operations on a cost
matrix and their respective effects over decision boundaries. Let
K
X
h=1
C(h, i)P (i|x) =
K
X
C(h, j)P (j|x)
(4.2)
h=1
be the decision boundary between classes i and j, with i 6= j. Here we describe the ones we
will use:
1. The decision boundaries are not affected when C is replaced by αC, for any α > 0.
2. Adding a constant to all costs for a particular true class does not affect the final decision.
In other words, adding a positive value to a row C(i, −) maintains the result unaffected.
3. As consequence of the previous property, any cost matrix C can be replaced by an equivalent Ĉ with Ĉ(i, i) = 0, ∀i.
Proving each of them is immediate just by plugging each variant in the expression (4.2). Taking
into account the last property we will assume without loss of generality that C(i, i) = 0, ∀i ∈ L,
i.e. the cost of correct classifications is null. We will denote 0|1-matrix to that with zeros in its
diagonal and ones elsewhere (in other words, a matrix representing a pure multi-class problem).
Let us focus our attention on the meaning of a cost matrix depending on its symmetry. We
may have:
4.1. COST-SENSITIVE MULTI-CLASS BOOSTING
43
• Symmetric matrix. Since symmetrical values are equal, no additional information is provided when mistaking in one direction or another. This means that the actual information
lies in comparing the costs associated to different decision boundaries, which can be
ranked according to their importance. Hence this structure is recommended for problems
where some boundaries are more important than others. In Graph Theory this kind of
matrix would represent an undirected complete graph with different distances between
nodes (labels).
• Asymmetric matrix. This regular case is the appropriate for situations where some labels
are more important than others. It is useful when the problem at hand presents unbalanced
data or simply when we are interested in avoiding some types of mistakes (possibly the
most usual ones). In Graph Theory this case would represent a directed complete graph
with paths of different module even between pairs of nodes (labels).
In the following we will consider essentially this last case.
4.1.1
Previous works
Cost-sensitive classification problems for more than two labels had never been easy to translate
to the area of Boosting. Following an initial intuition, one may try to decompose the problem
into binary ones and then apply any algorithm described in section 2.4. However this can be a
bad choice since the (2 × 2)-matrix associated to each subproblem may be undefined or even
unuseful for the global problem. For instance, think of separating a label from the rest, but there
is no justified way to compose the associated binary cost matrix. Furthermore, if the global
matrix is symmetric and the idea is separating one label from another (One-vs-One strategy)
then every subproblem would be equally important, i.e. every submatrix would be equivalent to
a binary cost-free problem.
There are several works in the literature that address the cost-sensitiveness of a problem in a
paradigm-independent framework [21, 18, 111, 56, 106]. We will not consider these cases since
we are interested in introducing costs in the multi-class Boosting context. The contributions
conceived for this purpose are:
• AdaC2.M1 [86]. The algorithm developed by Y. Sun et al. is probably the first including costs when using multi-class weak learners. The idea behind this is combining the
multi-class structure of AdaBoost.M1 [26], with the weighting rule of AdaC2 [89] (see
Algorithm 8), hence its name. As can be guessed, no theoretical derivation supports this
method. Rather it becomes just an heuristic procedure for merging both extensions of AdaBoost into one. AdaC2.M1’s pseudo-code is shown in Algorithm 11. As its multi-class
counterpart it fails in computing α-values only available for “not too weak” learners, see
step 2- (c). Moreover, they comprise
the information of the row concerning to a real label
PK
l into a single value, Cl =
j=1 C(l, j), which misses the structure of the given cost
matrix (there are infinite different cost matrices producing the same values Cl , therefore
representing the “same problem” for the algorithm).
• Lp -CSB [58] This algorithm was originally conceived to solve instance-dependent costsensitive problems. See its pseudo-code in Algorithm 12. The authors, A. C. Lozano
and N. Abe, provided a new insight for solving the cost-sensitive problem. They resorted
44
Multi-class Cost-sensitive Boosting
Algorithm 11 :AdaC2.M1
1- Initialize the Weight Vector W with uniform distribution w(n) = 1/N , n = 1, . . . , N .
2- For m = 1 to M :
(a) Fit a multi-class classifier Gm (x) toP
the training data using weights W.
(b) Compute weighted error: Errm = N
n=1 w(n)I (Gm (xn ) 6= ln ).
1
(c) Compute αm = 2 log ((1 − Errm )/Errm ).
(d) Update weight vector w(n) ← w(n)Cln exp (−αm I (Gm (xn ) = ln )) , n = 1, . . . , N .
(e) Re-normalize W.
P
3- Output Final Classifier: H(x) = arg maxk M
m=1 αm I(Gm (x) = k) .
to relational
hypothesis, h : X × L → [0, 1], satisfying the stochastic condition, ∀x ∈
P
X , l∈L h(l|x) = 1, to solve the minimization
N
1 X
arg min
C(ln , arg max h(k|xn )) ,
h N
k
n=1
(4.3)
which obviously is the goal of the problem (minimize the expected cost for an uniform
distribution of instances). Since the above function is not convex, they remedied this
drawback by translating every term in (4.3) into
∞
K X
h(k|xn )
C(ln , k)
C(ln , arg max h(k|xn )) =
k
kh(k|xn )k∞
k=1
(4.4)
that, in turn, can be approximated with a p-norm (p ≥ 1) as follows
C(ln , arg max h(k|xn )) '
k
K X
k=1
h(k|xn )
maxj h(j|xn )
p
C(ln , k) .
(4.5)
Since maxj h(j|xn ) will be grater or equal than 1/K, then expression (4.3) can be approximated by the following convexification:
N
K
1 XX
arg min
h(k|xn )p C(ln , k) ,
h N
n=1 k=1
(4.6)
which becomes the aim of the Boosting algorithm. Then an extended-data approach is
followed, just like AdaBoost.MH [80], for stochastic hypothesis. Equation (4.7) shows
the reweighting scheme applied. This time the voting constants, α, form a convex linear
combination of weak learners (see 2- (d)). We must point out that Lp -CSB generalizes
a previous work of N. Abe et al. [1] where a Data Space Extension is proposed jointly
with an Iterative Weighting scheme to derive the GBSE algorithm (Gradient Boosting
with Stochastic Ensembles). Specifically, the latter becomes a particular case of Lp -CSB
when p = 1.
A minor drawback when applying Lp -CSB comes with the selection of the optimal value
p for the norm. No clear foundation has been argued for considering values like 3 or 4, as
can be seen in the experiments of [58].
4.1. COST-SENSITIVE MULTI-CLASS BOOSTING
45
Algorithm 12 :Lp -CSB
1- Initialize H0 with uniform distribution H0 (k|xn ) =
1/K , ∀n, ∀k.
2- Set the expanded labeled data S̄ = { (xn , k), ¯ln,k | k ∈ L , ¯ln,k := I(k = ln )}.
3- For m = 1 to M :
(a) Set w(x, k) = Hm−1 (k|x)p−1 C(x, k), ∀(x, k) ∈ S̄.
(b) For all (xn , k) ∈ S̄ compute:
w(x
if ¯ln,k = 0
P n , k)/2
w̄(xn , k) =
( j6=l w(xn , j))/2
if ¯ln,k = 1
(4.7)
(c) Compute relational hypothesis hm on S̄ with regard to weights W̄.
(d) Choose αm ∈ [0, 1), for example αm = 1/m.
(e) Set Hm (x) := (1 − αm )Hm−1 (x) + αm hm (x).
4- Output Final Classifier: H(x) = arg maxk HM (k|x) .
• MultiBoost [102]. The most recent derivation and, under our point of view, the closer
to AdaBoost’s essence. The algorithm proposed by J. Wang resorts to margin vectors,
y , g(x) ∈ Y ; and a special derivative of the exponential loss function
L(l, f (x)) :=
K
X
C(l, k) exp(fk (x)) ,
(4.8)
k=1
to carry out a gradient descent search. The structure of the additive model fitted in this
fashion is similar to the one obtained with SAMME [112]. MultiBoost’s pseudo-code is
shown in Algorithm 13. It is clear that the loss defined in (4.8) does not coincide with any
other loss function when a 0|1-cost matrix is used.
The main result supporting MultiBoost’s theory states that the optimal pair (Tm (x), βm )
to add to the additive model1 f m (x) = f m−1 (x) + βm gm (x) is found by solving:
Tm (x) = arg min
T
K −1
βm =
K
log
N
X
A(n, T (xn )) ,
(4.9)
n=1
1 − Err
Err
− log(K − 1) .
(4.10)
PN
Where Err =
n=1 A(i, T (xn )) and constants A(i, k) compose a (N × K)-matrix,
A, of values derived from costs (see step 2- (d)). Obviously, each of the above optimal
parameter depends on the other. Since there is no direct way to solve (4.9), it is convenient
to have of a pool of weak learner from which to select the optimal. It becomes the best
strategy to accomplish step 2- (a).
Let us go back to the most relevant binary cost-sensitive Boosting algorithms discussed
in section 2.4. On the one hand, as far as we know, Cost-Sensitive AdaBoost [63] has not
found a direct generalization to the multi-class field. We will show in section 4.2.1 how our
new algorithm BAdaCost accomplishes it. On the other hand Cost-Generalized AdaBoost [49]
1
Remember the equivalence between multi-class weak learners, T (x), defined on L and those defined over
margin values, g(x). See (2.16)
46
Multi-class Cost-sensitive Boosting
Algorithm 13 :MultiBoost
1- Initialize constants A(n, j) = C(ln , j), n = 1, . . . , N ; k = 1, . . . , K and normalize A.
2- For m = 1 to M :
(a) Solve (Tm (x), βm ) in expressions (4.9) and (4.10).
(b) Translate Tm (x) in terms of margin vectors gm (x).
(c) Update the additive model: f m (x) = f m−1 (x) + βm gm (x).
j
(d) Update constants A(n, j) ← A(n, j) exp (βm gm
(xn )) , n = 1, . . . , N ; j = 1, . . . , K.
(e) Re-normalize A.
P
3- Output Final Classifier: H(x) = arg maxk fk (x); for f (x) = M
m=1 βm gm (x) .
resorts to the initial reweighting discussed in section 2.4 to induce the cost-sensitive property
to the original AdaBoost. Extending this process to multi-class problems seems unclear since it
requires assigning a different initial weight to each possible kind of error, which is impossible
for just one weight vector.
4.1.2
New margin for cost-sensitive classification
Here we introduce the essence of our second algorithm. We firstly define the concept of multiclass cost-sensitive margin, which serves as link between multi-class margins and the values of
the cost matrix, both of them considered as argument of a loss function.
With this in mind, we introduce an essential change in the multi-class exponential loss
function. Let C∗ be a (K × K)-matrix defined in the following way
C(i, j)
if i 6= j
∗
PK
, ∀i, j ∈ L ,
(4.11)
C (i, j) =
if i = j
− h=1 C(i, h)
i.e. C∗ is obtained from C replacing the j-th zero in the diagonal for the sum of the elements
in the j-th row with negative sign. For our cost-sensitive classification problem each value
C ∗ (j, j) will represent a “minus cost” associated to a type of correct classification. In other
words, elements in the diagonal should be understood as prizes for successes over instances
with the same real label.
Notice that, by definition, the j-th row in C∗ is a margin vector that encodes the cost
structure of the j-th label. This motivates us to use these rows as the new set of vectors for
encoding true labels, i.e. we will keep in mind the bijection: l ↔ C(l, −). Notice also that
our vectors are neither equidistant nor have equal norm. In other words, it serves to reorient the
space of labels following the structure of the cost matrix.
Based on the above codification, we define the Multi-class Cost-sensitive Margin value for
an instance (x, l) with respect to the multi-class vectorial classifier f (x) as
zC := C∗ (l, −) · f (x) .
(4.12)
Analogously to expression (2.18), it is easy to verify that if f (x) = yj ∈ Y , for a certain j ∈ L,
K
C∗ (l, j). Therefore the multi-class cost-sensitive margins obtained
then C∗ (l, −) · f (x) = K−1
from a discrete classifier f : X → Y can be calculated directly using the label-valued analogous
of f , i.e. F : X → L, through the formula
zC = C∗ (l, −) · f (x) =
K
C∗ (l, F (x)) .
K −1
(4.13)
4.2. BADACOST: BOOSTING ADAPTED FOR COST-MATRIX
As a consequence when considering a lineal combination of discrete classifiers f =
the following expression can be applied:
M
K X
αm C∗ (l, Fm (x)) .
zC =
αm C (l, −) · f m (x) =
K − 1 m=1
m=1
M
X
∗
47
PM
m=1
αm f m
(4.14)
Our aim is to use this value as argument for the multi-class exponential loss function in order to
obtain the Cost-sensitive Multi-Class Exponential Loss Function, that is defined as follows:
LC (l, f (x)) := exp(zC ) = exp (C∗ (l, −) · f (x)) .
(4.15)
It will be the loss function for our problem. See that zC yields negative values when classifications are good under the cost-sensitive point of view, while positive values come from costly
asignments. That is why LC does not need a negative sign in the exponent, which was the case
of previous exponential loss functions. This is a key point in our proposal.
Let us see now the suitability of our new loss function. Specifically, let us discuss how it
extends the loss functions in CS-AdaBoost and SAMME respectively:
A) LCS−AdaBoost (l, f (x)) = I(l = 1) exp (−lC1 f (x)) + I(l = −1) exp (−lC2 f (x))
−1 >
B) LSAM M E (y, f (x)) = exp
y f (x) .
K
Proving how each of them becomes a special case of ours is quite simple. On the one hand
LCS−AdaBoost is exactly the result of (4.15) when applied on a binary cost-sensitive problem,
where off-diagonal values have been denoted: C1 := C(1, 2) and C2 := C(2, 1). On the other
hand, we can obtain LSAM M E just by fixing a 0|1-cost matrix re-scaled by a factor λ > 0,
i.e. λC0|1 . Specifically, when λ = 1/K(K − 1) we get exactly the same values provided by
SAMME’s margin vectors (see 2.19). Hence BAdaCost’s loss functions generalizes its multiclass counterpart. In subsection 4.2.1 we will state complete results about this capability of
generalization.
4.2
BAdaCost: Boosting Adapted for Cost-matrix
In this section we introduce BAdaCost [25], that stands for Boosting Adapted for Cost-matrix.
As we have defined the cost-sensitive multi-class exponentialP
loss function and given a training
sample {(xn , ln )} we minimize the empirical expected loss, N
n=1 LC (ln , f (xn )), to obtain the
new Boosting algorithm.
Once more, the minimization is carried out by fitting an additive
PM
model, f (x) = m=1 βm gm (x). The weak learner selected at each iteration m will consists of
an optimal step of size βm along the direction gm of the largest descent of the expected costsensitive multi-class exponential loss function. In Lemma 2 we show how to compute them.
Lemma 2. Let C be a cost matrix for a multi-class problem. Given the additive model f m (x) =
f m−1 (x) + βm gm (x) the solution to
(βm , gm (x)) = arg min
β,g
N
X
n=1
exp (C∗ (ln , −) (f m−1 (xn ) + βg(xn )))
(4.16)
48
Multi-class Cost-sensitive Boosting
is the same as the solution to
(βm , gm (x)) = arg min
β,g
K
X
!
∗
Sj exp (βC (j, j)) +
X
j=1
∗
Ej,k exp (βC (j, k))
,
(4.17)
k6=j
P
P
where Sj = {n:g(xn )=ln =j} w(n), Ej,k = {n:ln =j,g(xn )=k} w(n), and the weight of the n-th
training instance is given by:
!
m−1
X
w(n) = exp C∗ (ln , −)
βm f m (xn ) .
(4.18)
t=1
Given a known direction g, the optimal step β can be obtained as the solution to
K X
X
j=1 k6=j
Ej,k C(j, k)A(j, k)β =
K X
K
X
Sj C(j, h)A(j, j)β ,
(4.19)
j=1 h=1
where A(j, k) = exp(C ∗ (j, k)), ∀j, k ∈ L. Finally, given a known β, the optimal descent
direction g, equivalently G, is given by
!
N
X
X
A(ln , k)β I(G(xn ) = k) .
(4.20)
arg min
w(n) A(ln , ln )β I(G(xn ) = ln ) +
G
n=1
k6=ln
The proof of this result is in the Appendix A.3. The BAdaCost pseudo-code is shown in
Algorithm 14. Just like other Boosting algorithms we start weights with uniform distribution.
At each iteration, we add a new multi-class weak learner gm : X → Y to the additive model
weighted by βm , a measure of the confidence in the prediction of gm . The optimal weak learner
that minimizes (4.20) is a cost-sensitive multi-class classifier trained using the data weights,
w(n), and a modified cost matrix, Aβ = exp(βC∗ ). Observe that training a cost-sensitive
weak-learner does not imply computational difficulties. For instance, when using classification
trees one can proceed in three ways [38]: adjusting decision tresholds, changing the split criteria
of impurity, or applying a cost-sensitive pruning. We simply recommend computing the costsensitive counterparts of the Gini index or the entropy as splitting criteria when fitting trees.
Algorithm 14 : BAdaCost
1- Initialize weight vector W, w(n) = 1/N ; for n = 1, . . . , N .
2- Compute matrices C∗ , following equation (4.11), and A for the given C.
3- For m = 1 to M :
(a) Obtain Gm by minimizing (4.20) for β = 1. Translate Gm into gm : X → Y .
(b) Compute the constants Ej,k and Sj , ∀j, k; as described in Lemma 2.
(c) Compute βm solving equation (4.19).
(d) Update weights w(n) ← w(n) exp (βm C∗ (ln , −)gm (xn )), for n = 1, . . . , N .
(e) Re-normalize vector W.
P
4- Output Classifier: H(x) = arg mink C∗ (k, −)f (x), where f (x) = M
m=1 βm gm (x).
Unlike other Boosting algorithms [28, 112], here gm and βm , can not be optimized independently. As with PIBoost, we solve this by alternatively fixing the value of one of them and
optimizing for the other. It is performed by specifying a seed value of βm (1 for simplicity) and
then computing first gm , and then βm , consecutively. Otherwise, a convenient loop should be
4.2. BADACOST: BOOSTING ADAPTED FOR COST-MATRIX
49
included to encompass steps 3- (a), (b), and (c); in order to ensure a desired improvement in
the cost of classification. Once the improvement condition is satisfied, the optimal pair (gm ,βm )
would be added to the model. As with PIBoost, we have empirically confirmed that there are
no significant differences if one performs a single iteration of the loop instead of many. Thus,
we proceed as described in Algorithm 14, that is quite more efficient.
Finally, let us justify step 3- comparing it with multi-class approaches. It is well known
that vectorial classifiers, f (x), provide a degree of confidence for classifying sample x into every
class. Hence, the max rule, arg maxk fk (x), can be used for label assignment [112, 24]. It is
straightforward to see that this criterion is equivalent to assigning the label that maximizes the
>
>
multi-class margin, arg maxk y>
k f (x) = arg mink −yk f (x). Since −yk f (x) is proportional to
∗
C0|1 (k, −)> f (x), we can extend the decision rule to the cost-sensitive field just by assigning
arg mink C∗ (k, −)f (x).
4.2.1
Direct generalizations
We want to remark that our multi-class algorithm is a canonical generalization of previous
Boosting algorithms. Specifically we take into account three canonical generalizations of AdaBoost under different contexts. First, Cost-Sensitive AdaBoost [63] is a canonical extension
for cost-sensitive binary problems. In second place, SAMME [112], the best-grounded extension of AdaBoost to the multi-class field using multi-class weak learners. Finally we also
generalize PIBoost [24], our previously presented multi-class algorithm.
The following Corollaries of Lemma 2 prove these canonical extension results. Proofs are
shown in A.4, A.5 and A.6, respectively.
1
, ∀i 6= j, then the above result is equivalent to SAMME.
Corollary 1. When C(i, j) = K(K−1)
The update for the additive model f m (x) = f m−1 (x) + βm gm (x) is given by
(βm , gm (x)) = arg minβ,g
PN
n=1
exp −y>
n (f m−1 (xn ) + βg(xn ))
and both optimal parameters can be computed in the following way
• gm = arg ming
• βm =
(K−1)2
K
PN
log
n=1
w(n)I (g(xn ) 6= yn )
1−E
E
+ log (K − 1) ,
where E is the sum of all weighted errors.
Corollary 2. When K = 2 the Lemma 2 is equivalent to the Cost-sensitive AdaBoost. If we
denote C(1, 2) = C1 and C(2, 1) = C2 , the update (4.17) for the additive model Fm (x) =
Fm−1 (x) + βm Gm (x) becomes:
(βm , Gm (x)) = arg min
β,G
X
{ln =1}
w(n) exp (−C1 βG(xn )) +
X
w(n) exp (C2 βG(xn )) .
{ln =2}
For a certain value β the optimal Gm (x) is given by
arg min eβC1 − e−βC1 b + e−βC1 T1 + eβC2 − e−βC2 d + e−βC2 T2 ,
G
50
Multi-class Cost-sensitive Boosting
where2 : T1 =
P
P
P
P
w(n), T2 =
{n:ln =2} w(n), b =
{n:G(xn )6=ln =1} w(n) and d =
{n:G(xn )6=ln =2} w(n). Given a known direction, G(x), the optimal step βm can be calculated
as the solution to
{n:ln =1}
2C1 b cosh (βC1 ) + 2C2 d cosh (βC2 ) = T1 C1 e−βC1 + T2 C2 e−βC2 .
Corollary 3. When using margin vectors to separate a group of s-labels, S ∈ P(L), from the
rest, the result of the Lemma 2 is equivalent to PIBoost. The update for each additive model
built in this fashion, f m (x) = f m−1 (x) + βm gm (x), becomes:
(βm , gm (x)) = arg min
β,g
N
X
w(n) exp
n=1
−β >
y g(xn ) .
K n
For a certain value β the optimal step gm (x) is given by
β
−β
−β
−β
β
−β
s(K−1)
s(K−1)
s(K−1)
(K−s)(K−1)
(K−s)(K−1)
E1+e
E2+e (K−s)(K−1) A2 ,
arg min e
−e
A1 + e
−e
g
where: A1 =
P
P
{n:G(xn )6=ln ∈S}
/
{n:ln ∈S}
w(n), A2 =
P
{n:ln ∈S}
/
w(n), E1 =
P
{n:G(xn )6=ln ∈S}
w(n) and E2 =
w(n).
Besides, known a direction g(x), the optimal step βm can be calculated as βm = s(K −
s)(K − 1) log R, where R is the only real positive root of the polynomial
Pm (x) = E1(K − s)x2(K−s) + E2sxK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) .
4.3
Experiments
It is time to evaluate the performance of BAdaCost on two relevant kinds of problems. To this
aim, we devote a first subsection to introduce our procedure to compute cost matrices. Subsequently, in 4.3.2, we apply our method to some real data sets belonging to the UCI repository.
In this first case, minimizing costs will be the goal of classification. Secondly, we compare
BAdaCost on a complex Computer Vision problem 4.3.3, the detection of synapses and mitochondria on medical images. Here the adding of a cost matrix will serve as a tool to deal with
unbalanced data sets.
4.3.1
Cost matrix construction
When working with cost-sensitive problems one should have reliable information about the
associated penalties for any pair of labels. In Decision Theory this is obtained directly asking
an expert in the area. This can be expensive, time-consuming or even impossible to do. That
is why having a real cost matrix seldom happens. Unfortunately, as far as we know, no general
procedure has been proposed in the literature to solve this problem. For instance, a typical way
to evaluate cost-sensitive algorithms is to generate random cost matrices. A process like this
could be misleading for the final result since these costs do not follow a reasonable design,
thus it can yield meaningless results from the point of view of the decision boundaries between
labels.
2
Here we adopt the notation used in [63].
4.3. EXPERIMENTS
51
In the same way, when using a cost-sensitive algorithm for solving unbalanced problem,
the selection of a suitable cost matrix, C, becomes essential. In [86] the authors encoded the
cost matrix in a K-element vector and use a genetic algorithm to optimize classification performance. The drawbacks of this approach are both the computational complexity and the loss of
information incurred by “vectorizing” the cost matrix. The procedure we introduce computes
a set of costs that punishes more heavily errors in the minority classes. A straightforward solution would be to set the costs inversely proportional to the class unbalance ratios [33]. This
solution is not satisfactory since costs are not only related to the relative number of samples
in each class, but also to the complexity of the classification problem, i.e. the amount of class
overlapping, within-class unbalance, etc. Rather, we find reasonable to infer this information
from the confusion matrix obtained with a standard cost-free classifier.
Bearing both types of problems in mind, we include a pre-process for measuring the hardness of the problem with regard to each pair of labels that is also appropriate for unbalanced
data. It consists in running a simple cost-insensitive multi-class algorithm on the training data
(for instance, BAdaCost with a 0|1-cost matrix for few iterations), and then compute the associated confusion3 matrix, F. This matrix becomes extraordinarily informative to describe the
overall difficulty of the problem. More precisely, it serves to measure properly the degree of
overlapping between labels, and consequently allows to focus on the most relevant boundaries.
Using this information, we proceed as follows.
P
Let F∗ be the matrix obtained when dividing the i-th row, F(i, −), of F by j F (i, j),
i.e. the number of samples in class i. Consequently, F ∗ (i, j) is the proportion of data in class i
classified as j. These values collect the degree of overlapping with regard to each real label i.
Hence, high coefficients will be assigned to hard decision boundaries. Secondly, we transform
F∗ = F∗ /maxi,j F ∗ (i, j) to obtain a matrix with maximun coefficient equal to 1. Then we set
F ∗ (i, i) = 0, ∀i, for the diagonal values. Note that in a complex and unbalanced data set, a
0|1-loss classifier will tend to over-fit the majority classes. So, off-diagonal elements in rows
F∗ (i, −) for majority (alt. minority) classes will have low (high) scores.
If needed, off-diagonal zero values should be replaced by a small > 0. Doing so we
guarantee that only correct classifications have null costs. Finally, to improve numerical conditioning, we rescale the resulting matrix, C = λF∗ , for an appropriate λ > 0. We can do this
transformation based on the fact that any cost matrix can be multiplied by a positive constant
without affecting the output labels of the problem [69].
4.3.2
Minimizing costs: UCI repository
To asses the performance of BAdaCost on real problems we resort again to the UCI catalogue
of Machine Learning problems. In this first comparison we are interested in measuring the
capability to minimize costs. A group of multi-class supervised data sets are selected, just like
we did in section 3.4. This time we have to be sure that the selected data represent complex
problems for the process to be meaningful. The chosen data bases are: CarEvaluation, Chess,
CNAE9, ContraMethod, Isolet, Letter, Shuttle, OptDigits, PenDigits, SatImage, Segmentation,
and Waveform. They collect a broad scope of classification problems with regard to number
of variables (6 to 856), labels (3 to 26) and instances (1080 to 58000). Table 4.1 shows a
description of the selected ones.
We compare BAdaCost with the algorithms introduced in section 4.1.1: AdaC2.M1 [86] ,
3
This matrix is also known as contingency matrix.
52
Multi-class Cost-sensitive Boosting
Data set
Variables
CarEvaluation
6
Chess
6
CNAE9
856
ContraMethod
9
Isolet
617
Letter
16
Shuttle
9
OptDigits
64
PenDigits
16
SatImage
36
Segmentation
19
Waveform
21
Labels
4
18
9
3
26
26
7
10
10
7
7
3
Instances
1728
28056
1080
1473
7797
20000
58000
5620
10992
6435
2310
5000
Table 4.1: Summary of selected UCI data sets
Lp -CSB [58] and MultiBoost [102]. See their respective pseudo-codes in Algorithm 11, 12 and
13.
For each data set we proceed in the following way. Firstly, if needed, we unify train and
test data into a single set. Then we carry out a 5-fold cross validation process taking care of
maintaining the original proportion of labels for each fold. When training, we compute a cost
matrix following the criteria described in 4.3.1. Then we run each algorithm 100 iterations. We
resort to classification trees as base learners. As discussed in 4.1.1, AdaC2.M1 and BAdaCost
allow the use of multi-class weak learner. Besides, MultiBoost also uses multi-class weak
learners but it requires a pool of them to work properly (minimizing (4.9) is intractable). For
this reason we create a pool of 6000 weak learners. To asses different weighting schemes in this
group of hypothesis, we sample data from 30%, 45%, and 60% of the training data (2000 weak
learners for each ratio). In third place, Lp -CSB translates the multi-class problem into a binary
one.
P
The average misclassification cost, N1 N
n=1 C(ln , H(xn )), is collected at the end of each
process. Note that, after rescaling the cost matrix, final costs may sum up to a very small
quantity. We show the results in table 4.3.2.
It is clear that PIBoost outperforms the rest of algorithms in most of the data bases. To
asses the statistical significance of the performance differences among the four methods we use
the Friedman test of average ranks. The statistic supports clearly the alternative hypothesis,
i.e. algorithms do not achieve equivalent results. Then a post-hoc analysis complements our
arguments. We carry out the Bonferroni-Dunn test for both significance levels α = 0.05 and
α = 0.10. The confidence distances4 for these tests are CD0.05 = 1.2617 and CD0.10 = 1.1216,
respectively. Figure 4.1 shows the final result. We can conclude that BAdaCost is significantly
better than the AdaC2.M1 and Lp -CSB algorithms for the above levels of significance. In the
case of MultiBoost we can state the same conclusion for α = 0.10, but not for α = 0.05 (the
difference between ranks is 1.25).
4
This values depends on the number of classifiers being compared jointly with the number of data sets over
which comparison is carried out. See [15].
4.3. EXPERIMENTS
53
Data set
AdaC2.M1
MultiBoost
Lp-CSB
BAdaCost
CarEval
Chess
Isolet
SatImage
Letter
Shuttle
ContraMeth.
CNAE9
OptDigits
PenDigits
Segmenta.
Waveform
0.0026 (±9)
0.0029 (±5)
0.0289 (±48)
0.0478 (±62)
0.0491 (±78)
2.1e−5 (±0.07)
0.0980 (±129)
0.0397 (±103)
0.0366 (±120)
0.0326 (±29)
0.0242 (±123)
0.0905 (±113)
0.0232 (±36)
0.0262 (±34)
0.0140 (±15)
0.0187 (±23)
0.0319 (±53)
8.9e−5 (±0.08)
0.1058 (±214)
0.0171 (±57)
0.0134 (±27)
0.0193 (±34)
0.0154 (±48)
0.0515 (±128)
0.0038 (±15)
0.0004 (±3)
0.0149 (±18)
0.0170 (±26)
0.0161 (±23)
3.5e−5 (±0.13)
0.0938 (±359)
0.0241 (±108)
0.0170 (±34)
0.0162 (±72)
0.0094 (±18)
0.0632 (±201)
0.0024 (±15)
0.0160 (±9)
0.0066 (±14)
0.0132 (±11)
0.0066 (±7)
3.9e−5 (±0.3)
0.0928 (±253)
0.0191 (±51)
0.0030 (±8)
0.0018 (±5)
0.0050 (±30)
0.0367 (±96)
Table 4.2: Classification cost rates of Ada.C2M1, MultiBoost, Lp -CSB, and BAdaCost algorithms for each data set after 100 iterations. Standard deviations appear inside parentheses in
10−4 scale. Bold values represent the best result achieved for each data base.
Figure 4.1: Comparison of ranks through the Bonferroni-Dunn test. BAdaCost’s average rank
is taken as reference. Algorithms significantly worse than our method for a significance level
of 0.10 are unified with a blue line.
4.3.3
Unbalanced Data: Synapse and Mitochondria segmentation
To complete the section, we show the performance of our algorithm in the domain of unbalanced classification problems. These problems are characterized for having large differences in
the number of samples in each class. Such a situation frequently occurs in complex data sets,
like those involving class overlapping, small sample size, or within-class unbalance [38]. When
working with unbalanced data, standard classifiers usually perform poorly since they minimize
the number of misclassified training samples disregarding minority classes. It is worth mentioning how this has become an important research area in Pattern Recognition [33, 38, 88]
insomuch as unbalanced classification problems frequently occur in relevant practical problems, such as Bio-medical Image Analysis [6], object detection in Computer Vision [100] or
medical decision-making [65].
Solutions to the class unbalance problem may be coarsely organized into data-based, that
54
Multi-class Cost-sensitive Boosting
0
0
0
100
100
100
200
200
200
300
300
300
400
400
400
500
500
500
600
600
600
700
700
0
200
400
600
800
a)
1000
0
700
200
400
600
b)
800
1000
0
200
400
600
800
1000
c)
Figure 4.2: Example of a segmented image. In b), green pixels belong to mitochondria while
red ones belong to synapses. Figure c) indicates an estimation.
re-sample the data space to balance the classes, and algorithm-based approaches, that introduce
new algorithms that bias the learning towards the minority class [38]. The symbiotic relation
between data- and ensemble-based algorithms in the context of two-class unbalanced classification is highlighted in a recent survey [33]. Boosting algorithms have also been extensively
used to address this kind of problems [33, 38, 86, 91].
However, with the exception of AdaC2.M1 [86], no previous work has addressed the
problem of multi-class Boosting in presence of unbalanced data by using a cost matrix5 . We
compare BAdaCost with the multi-class cost-sensitive algorithms considered in section 4.3.2:
AdaC2.M1, Lp -CSB and MultiBoost. We also add the SAMME algorithm to our experiment,
considering it a good example of cost-free multi-class algorithm with multi-class weak learners.
Let us briefly describe our problem.
In the last years we have seen advances in the automated acquisition of large series of images of brain tissue. The complexity of these images and the high number of neurons in a small
section of the brain, makes the automated analysis of these images the only practical solution.
Mitochondria and synapses are two cell structures of neurological interest that are suitable for
automated processing. However, the classification process to segment these structures is highly
unbalanced, since most pixels in these images belong to the background, few of them belong to
mitochondria and a small minority belong to synapses. Figure 4.2 shows an example.
For this experiment we collect a training set composed of 10000 background, 4000 mitochondria and 1000 synapse data and a testing set with 20000 data per class. We apply to each
image in the stack a set of linear Gaussian filters at different scales to compute zero, first and second order derivatives. For each pixel we get a vector of responses S = (s00 , s10 , s01 , s02 , s11 , s20 )
∂
∂2
∂
2
, σ · Gσ ∗ ∂y
, σ 2 · Gσ ∗ ∂x
that are respectively obtained applying the filters Gσ∗ , σ · Gσ ∗ ∂x
2,σ ·
2
2
∂
∂
Gσ ∗ ∂xy
, σ 2 · Gσ ∗ ∂y
2 where Gσ is a zero mean Gaussian with σ standard deviation. For a value
p
of σ the pixel feature vector is given by f (σ) = (s00 , s210 + s201 , λ1 , λ2 ) where λ1 and λ2 are
the eigenvalues of the Hessian matrix of the pixel, that depend on s20 , s02 and s11 . The final
16-dimensional feature vector for each pixel is given by the concatenation of the f (σ) vector at
4 scales (values of σ).
Having described the features, our goal in this subsection is to use the BAdaCost algorithm
to label pixels in these images as mitochondria, synapse and background. We compare the five
5
In [103] the authors propose AdaBoost.NC, a multi-class Boosting algorithm specialized in unbalanced data.
They do not resort to cost matrices, rather they use a different insight based on instance- and iteration-dependent
penalties that are applied in the reweighting scheme.
4.3. EXPERIMENTS
55
algorithms using a cost matrix obtained as described in 4.3.1. It is done obtaining the confusion
matrix after applying 15 iterations of the BAdaCost algorithm (with 0|1-cost matrix) to the
training set. Once more, we use classification trees as base learners for each algorithm. In all
the algorithms we use a re-sampling factor of r = 0.7. In case of SAMME, AdaC2.M1 and
BAdaCost we use a shrinkage factor of s = 0.1 (this factor is not needed for the rest). For
the Lp -CSB algorithm we select p = 4 and for MultiBoost we create a pool of 10000 learners
following the mentioned re-sampling factor. We run the five algorithms for 150 iterations.
TRAINING
TESTING
0.36
0.5
AdaC2.M1
AdaC2.M1
0.34
BAdaCost
BAdaCost
0.45
Lp−CSB
0.32
Lp−CSB
MultiBoost
MultiBoost
0.3
SAMME
0.4
SAMME
0.28
0.26
0.35
0.24
0.3
0.22
0.2
0.25
0.18
0.16
0
50
100
150
0.2
0
50
100
150
Figure 4.3: Brain images experiment with a heavily unbalanced data set. Training and testing
error rates, along the iterations, for each algorithm are shown.
In Fig. 4.3 we show the training and testing classification results for this experiment. Table
4.3 shows the error rates for each algorithm after the last iteration. The unbalance in the training
data reflects the essence of this segmentation problem in which synapses cover a very small area
in the image, mitochondria a slightly larger and, finally, the background class the largest area.
A classifier unaware of this unbalance would overfit the largest class to achieve the lowest
error rate on the unbalanced training data set. We can see this effect in plots in Fig. 4.3. The
SAMME algorithm achieves the lowest error rate on the training data set, but the poorest in the
balanced testing set, clearly showing that it has overfitted the background class. Note here that,
although the classes are unbalanced, the error rate is a meaningful classification measure in the
testing data set because it is balanced. On the other hand, BAdaCost, since it is a cost-sensitive
classifier, gets a much better testing error rate. The training error rate in this case is much higher
than SAMME. This is an expected result, since the cost matrix has effectively moved the class
boundary towards the majority class. The MultiBoost and AdaC2.M1 algorithms perform better
than SAMME in the test set, but clearly worse than BAdaCost. Although Lp -CSB, MultiBoost
and AdaC2.M1 are cost-sensitive algorithms, their poor performance in this experiment proves
that their results are far from optimal.
SAMME AdaC2.M1
Train
0.1723
0.1691
Test
0.3327
0.3327
Lp -CSB MultiBoost BAdaCost
0.1843
0.2095
0.2515
0.3667
0.3293
0.2255
Table 4.3: Error rates of the five algorithms after the last iteration.
Discussion
In this section we carry out two types of experiments.
56
Multi-class Cost-sensitive Boosting
The first is conceived to minimize decision costs on real problems when a cost structure is
given. To this aim, we have consider a set of UCI benchmark problems for which we compute
cost matrices following the process described in 4.3.1. Doing so, the cost-sensitive classification
focusses on the difficult decision boundaries. We carry out a comparison with other multi-class
cost-sensitive Boosting algorithms. At the light of the results, we conclude that BAdaCost
significantly finds the most appropriate assignation of labels for this purpose. This reaffirms the
use of the cost-sensitive multi-class exponential loss function for developing algorithms in the
area.
In our second experiment we solve a complex problem in presence of unbalanced data.
Specifically, we address the segmentation of mitochondria and synapses in images of small
sections of the brain. In this case, we endow the multi-class problem with a cost matrix to move
the decision boundaries towards the majority class. In this way the classifier does not neglect
the minority classes. We compared BAdaCost with other multi-class cost-sensitive algorithms.
We used the process explained in section 4.3.1 to estimate a cost matrix for this problem. The
results confirm that BAdaCost outperforms significantly the rest of algorithms since the testing
error decreases the most. This is essentially due to our unbalance-aware learning. In other
words, BAdaCost really takes advantage of a cost matrix to correct the unbalance in the data.
Chapter 5
Conclusions
In the present dissertation we have exploited two definitions of margin to develop new algorithms for multi-class Boosting.
On the one hand, we have proposed a new multi-class Boosting algorithm called PIBoost.
The main contribution in it is the use of binary classifiers whose response is encoded in a
multi-class vector and evaluated under an exponential loss function. Data labels and classifier
responses are encoded in different vector domains in such a way that they produce a set of
asymmetric margin values that depend on the distribution of classes separated by the weak
learner. In this way PIBoost properly addresses possible class unbalances appearing in the
problem binarization. The range of rewards and penalties provided by this multi-class loss
function is also related to the amount of information yielded by each weak-learner.
The most informative weak learners are those that classify samples in the smallest class
set and, consequently, their sample weight rewards and penalties are the largest. Here the codification produces a fair distribution of the vote or evidence among the classes in the group.
We match it with a pattern of common sense, namely: the fact of learning to guess something
by discarding possibilities in the proper way. The resulting algorithm maintains the essence of
AdaBoost, that, in fact, is a special case of PIBoost when the number of classes is two. Furthermore, the way it translates partial information about the problem into multi-class knowledge let
us think of our method as a canonical extension of AdaBoost using binary information.
The experiments performed confirm that PIBoost significantly improves the performance
of other well known multi-class classification algorithms. We do not claim that PIBoost is the
best multi-class Boosting algorithm in the literature. Rather, we emphasize that the multi-class
margin expansion introduced improves existing binary multi-class classification approaches and
open new research venues for margin-based classification.
Following the insight behind PIBoost we extended the original AdaBoost for multi-class
cost-sensitive problems using multi-class weak learners. By extending the notion of multiclass margin to the cost-sensitive case we introduced a new method and also developed theoretical connections proving that it canonically generalizes SAMME [112], Cost-sensitive AdaBoost [63] and PIBoost [24]. The resulting algorithm, BAdaCost, stands for Boosting Adapted
for Cost-matrix. We have shown experimentally that BAdaCost outperforms other relevant
Boosting algorithms in the area. We perform this comparison in two types of tasks: minimizing
costs on standard data sets, and also improving the test accuracy on a unbalanced data problem.
When making an algorithm cost-sensitive, the costs are a new set of parameters. Estimating
these parameters is a difficult problem per se. Most of the cost-sensitive algorithms published
57
58
Conclusions
either do not provide a procedure to compute them [58, 102], they resort to a computationally
demanding search procedure [86], or, in the unbalance case, they set the costs inversely proportional to the class unbalance ratio [33]. In our experiments, we used a simple procedure to
estimate the costs from the confusion matrix obtained with a simple cost-free multi-class rule.
This procedure has shown to work well in practice. However, we are aware that this matrix is
far from optimal.
Finally we must say that the horizon of applicabilities for both algorithms is vast. Frameworks based on variations of the multi-class margin yield flexible derivatives to new paradigms
or new types of problems. The following section is devoted to comment these new insights of
future work.
5.1
Future work
Here we discuss some interesting topics for future venues of research. In a first paragraph we
review theoretical aspects that are considered as immediate goals. Then we comment different
areas of applicability in the field of supervised classification, especially for Computer Vision.
5.1.1
New theoretical scopes
Many Boosting algorithms asses their convergence through error bounds [28, 26, 80]. We derived our algorithms based on a statistical view of Boosting. In a future research we will derive
bounds on the training and test errors of PIBoost and BAdaCost. In the case of PIBoost, its
weak learners are grouped in separators but their responses, encoded as margin vectors, are sum
up jointly. Therefore, the demonstration of the error bounding should take into account each
partial classification and the result when they are merged into a final decision. Such an analysis
is not immediate and would require a thorough research.
Likewise, as we stated in section 2.5 there is a plethora of works discussing different
perspectives of Boosting. It would be quite interesting to link some of them to multi-class
margin-based theory. Three examples are the reduction of entropy as analyzed in [46, 84], the
game-theory point of view described in [67] and the use of different margin-based loss functions
[61, 10].
With regard to PIBoost’s separators, in our experiments we did not take into account sets of
labels larger than a pair. If the number of labels allows it, it is also possible to add new “clues”
to the additive model fitted coming from trios or quartets of labels. The main disadvantage,
obviously, is the computational load associated. As we mentioned in section 3.4 the number
of separators could become prohibitive. For this reason we are interested in developing a well
grounded scheme for selecting the best subsets of labels for our separators.
Establishing the proper cost matrix is a difficult key step in many cost-sensitive problems.
Aware of this fact, we have introduced a procedure for measuring costs on hard-to-fit decision
boundaries and also for tackling asymmetries in data. In future research we would like to devise
a more efficient and optimal algorithm to estimate cost matrices in certain problems, such as
for example unbalanced ones. Similarly, we also want to derive symmetric matrices suitable
for problems where some decision boundaries between pairs of labels are more important than
others, regardless the importance of the labels in a pair.
5.1. FUTURE WORK
5.1.2
59
Other scopes of supervised learning
Firstly, we must point out that the supervised classification addressed in the dissertation is concerned with a discrete set of values (in fact, a finite set of labels). Many real problems require
predictions on continuous magnitudes, which invites us to extend both PIBoost and BAdaCost
to regression problems.
Labeling data is a time consuming and costly process, hence the interest in developing
semi-supervised algorithms. We would like to study the problem of semi-supervised multiclass
boosting with costs.
Some problems present such a high dimensionality that a feature selection process is firstly
needed. AdaBoost has been successfully used for selecting discriminative features while building a strong cascade classifier, e.g. [101, 100]. Sharing this perspective we are also motivated
to apply PIBoost to retrieve the best features for multi-class problems.
In Computer Vision, detection problems usually suffer from unbalanced data since positive classes have much less samples than the background one. The usual technique to deal
with such unbalance is sub-sampling the background class, what entails an information loss.
With BAdaCost the majority class sub-sampling is not needed and the full training set can be
used. Moreover, it is well known that training a single classifier for each object (a pure binary
problem) is worse than building a multi-class object detector [95] because visual features can
be shared among the weak learners. We are studying the application of PIBoost to multi-pose
detection of objects as practical and intuitive application.
Last but not least, we would like to develop a cost-sensitive derivation of PIBoost. In other
words, we plan to derive a canonical generalization of AdaBoost to tackle classifications in
presence of a multi-class cost matrix by using binary weak learners.
60
Conclusions
Appendix A
Proofs
A.1
Proof of expression (3.3)
Assumee, without loss of generality, that x belongs to the first class and we try to separate the
set S, with the first s-labels, from the rest using f S . Assume also that there is success when
classifying with it. In that case the value of the margin will be
>
−1
−1
1 −1
−1
1
y f (x) = 1,
,...,
,..., ,
,...,
K −1
K −1
s
s K −s
K −s
s−1
(K − s)
K
1
+
=
.
= −
s s(K − 1) (K − 1)(K − s)
s(K − 1)
> S
If the separator fails then f S (x) would have opposite sign and therefore the result.
Besides, assume now that the real label of the instance is the same but now we separate the
last s-labels from the rest. Assume also that this time f S erroneously classifies the instance as
belonging to those last labels. The value of the margin will be
>
−1
−1
−1 1
1
−1
,...,
,...,
, ,...,
y f (x) = 1,
K −1
K −1
K −s
K −s s
s
−1
(K − s − 1)
s
−K
=
+
−
=
.
K − s (K − 1)(K − s) (K − 1)s
(K − 1)(K − s)
> S
Again, the sign of the result would be opposite if f S excludes x from the first labels group.
A.2
Proof of Lemma 1
Let us fix a subset of s labels, s = |S|, and assume that we have fitted a separator f m (x) (whose
S-index we omit) as an additive model f m+1 (x) = f m (x) + βg(x), in the m-th step. We fix a
β > 0 and rewriting the expression to look for the best g(x):
N
X
n=1
exp
X
N
−1 >
−1 >
y (f m (xn ) + βg(xn )) =
w(n) exp
βyn g(xn ) =
K n
K
n=1
61
(A.1)
62
Appendix
=
X
w(n) exp
ln ∈S
∓β
s(K − 1)
+
X
w(n) exp
ln ∈S
/
∓β
(K − s)(K − 1)
=
(A.2)
!
X
−β
−β
=
+
w(n) exp
+
w(n) exp
s(K
−
1)
(K
−
s)(K
−
1)
ln ∈S
ln ∈S
/
X
β
−β
+ exp
− exp
w(n)I(y>
(A.3)
n g(xn ) < 0) +
s(K − 1)
s(K − 1)
ln ∈S
X
−β
β
− exp
w(n)I(y>
+ exp
n g(xn ) < 0).
(K − s)(K − 1)
(K − s)(K − 1)
!
X
ln ∈S
/
The last expression is a sum of four terms. As can be seen, the first and second are constants
while the third and fourth are the ones depending on g(x). The values in brackets are positive
constants. Let us use denote them B1 and B2 , respectively. So minimizing the above expression
is equivalent to minimizing
X
X
B1
(A.4)
w(n)I(y>
w(n)I(y>
n g(xn ) < 0) + B2
n g(xn ) < 0) .
ln ∈S
ln ∈S
/
Hence the first point of the Lemma follows.
Now assume known g(x) and its error E on training data. The error can be decomposed
into two parts:
X
X
w(n)I(y>
w(n)I(y>
(A.5)
E=
n g(xn ) < 0) +
n g(xn ) < 0) = E1 + E2 .
ln ∈S
ln ∈S
/
The expression (A.3) can be written now as
β
−β
−β
+ exp
− exp
E1 +
A1 exp
s(K − 1)
s(K − 1)
s(K − 1)
β
−β
−β
+ exp
− exp
E2 ,
+ A2 exp
(K − s)(K − 1)
(K − s)(K − 1)
(K − s)(K − 1)
(A.6)
P
P
where A1 = ln ∈S w(n) and A2 = ln ∈S
/ w(n). It can be easily verified that the above expression is convex with respect to β. So differentiating w.r.t. β, equating to zero and simplifying
terms we get:
E1
β
E2
β
exp
+
exp
=
s
s(K − 1)
K −s
(K − s)(K − 1)
(A.7)
(A1 − E1)
−β
(A2 − E2)
−β
=
exp
+
exp
.
s
s(K − 1)
K −s
(K − s)(K − 1)
There is no direct procedure to solve β here. We propose the change of variable β = s(K −
s)(K − 1) log (x) with x > 0. This change transform the last equation into the polynomial
(K − s)E1x(K−s) + sE2xs − s(A2 − E2)x−s − (K − s)(A1 − E1)x−(K−s) = 0,
(A.8)
or, equivalently, multiplying by x(K−s)
(K − s)E1x2(K−s) + sE2xK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) = 0.
(A.9)
According to Descartes’ Theorem of the Signs the last polynomial has a single real positive root,
which proves the second point of the Lemma.
A.3. PROOF OF LEMMA 2
A.3
63
Proof of Lemma 2
Assume that in the m-th iteration we have fitted a classifier f m (x) as an additive model and we
are searching for the parameters (β, g) to add in the next step, f m+1 (x) = f m (x) + βg(x).
N
X
exp (C∗ (ln , −) (f m (xn ) + βg(xn )))
(A.10)
n=1
=
N
X
w(n) exp (βC∗ (ln , −)g(xn ))
(A.11)
n=1
=
K
X
X
w(n) exp (βC∗ (j, −)g(xn ))
(A.12)
j=1 {n:ln =j}
=
K
X
X
w(n) exp
j=1 {n:ln =j}
βK ∗
C (j, G(xn )) .
K −1
(A.13)
In the last step we take into account the equivalence between vectorial-valued functions, g, and
label-valued functions, G. If we denote S (j) = {n : ln = G(xn ) = j} the set of index of well
classified instances with ln = j and F (j, k) = {n : ln = j, G(xn ) = k} denotes the index where
G classifies as k when the real label is j then the above expression can be rewritten


K
∗
∗
X
X
X
βKC (j, j)
βKC (j, k) 

w(n) exp
+
w(n) exp
(A.14)
K −1
K −1
j=1
n∈S (j)
n∈F (j,k)
X
!
βKC ∗ (j, k)
βKC ∗ (j, j)
+
Ej,k exp
.
(A.15)
=
Sj exp
K
−
1
K
−
1
j=1
k6=j
P
P
Where Sj = n∈S (j) w(n) and Ej,k = n∈F (j,k) w(n). Taking into account that these constants
are positive values and also exp(βC ∗ (i, j)) > 0 (∀i, j ∈ L), we can omit the term K/(K − 1)
appearing in the exponents in order to address the minimization. Subsequently, the objective
function can be written:
!
K
X
X
Sj exp (βC ∗ (j, j)) +
Ej,k exp (βC ∗ (j, k)) .
(A.16)
K
X
j=1
k6=j
Now, fixed a value β > 0, the optimal step, g, can be found minimizing


K
X
X


∗
Ej,k exp (βC ∗ (j, k))
Sj exp (βC (j, j)) +
|
{z
}
|
{z
}
j=1
=
K
X
N
X
n=1
k6=j

X

j=1
=

A(j,j)β
n∈S (j)
w(n) A(j, j)β +
A(j,k)β


X
X

k6=j
(A.17)

w(n) A(j, k)β 
(A.18)
n∈F (j,k)
!
w(n) A(ln , ln )β I (G(xn ) = ln ) +
X
k6=ln
A(ln , k)β I (G(xn ) = k)
.
(A.19)
64
Appendix
Finally if we assume known a direction, g, then its weighted errors, Ej,k , and successes, Sj ,
will be computable. So differentiating (A.16) with respect to β (note that is a convex function)
and equating to zero we get
!
K
X
X
Ej,k C ∗ (j, k) exp (βC ∗ (j, k)) + Sj C ∗ (j, j) exp (βC ∗ (j, j)) = 0
(A.20)
j=1
k6=j
K X
X
∗
∗
Ej,k C (j, k) exp (βC (j, k)) = −
j=1 k6=j
K
X
Sj C ∗ (j, j) exp (βC ∗ (j, j))
(A.21)
j=1
K X
X
∗
β
Ej,k C (j, k)A(j, k) = −
j=1 k6=j
K X
X
K
X
Sj C ∗ (j, j)A(j, j)β
(A.22)
Sj C(j, h)A(j, j)β .
(A.23)
j=1
β
Ej,k C(j, k)A(j, k) =
j=1 k6=j
K X
K
X
j=1 h=1
There is no direct procedure to solve this equation so we resort to iterative methods.
A.4
Proof of Corollary 1
As was commented in section 4.1.2 it is easy to see that when C is defined in the following
way:
0
for i = j
C(i, j) :=
∀i, j ∈ L
(A.24)
1
for i 6= j
K(K−1)
, then a discrete vectorial weak learner, f , yields C∗ (l, −)f (x) = −1/(K − 1) for right classifications and C∗ (l, −)f (x) = 1/(K − 1)2 for mistakes. Both quantities are in fact the values of
−1 >
y f (x) appearing in the exponent of the loss function in SAMME. Thus expression (A.16)
K
can be written
X
!
K
X
−β
β
Sj exp
Ej,k exp
(A.25)
+
2
K
−
1
(K
−
1)
j=1
k6=j
K
X
=
!
Sj
exp
j=1
|
{z
S
−β
K −1
}
K X
X
+
!
Ej,k
exp
j=1 k6=j
|
{z
E
β
(K − 1)2
(A.26)
}
β
−β
(A.27)
= S exp
+ E exp
K −1
(K − 1)2
−β
β
= (1 − E) exp
+ E exp
(A.28)
K −1
(K − 1)2
−β
β
−β
= exp
+ E exp
− exp
.
(A.29)
K −1
(K − 1)2
K −1
P
So, despite the value of β, the above expression is minimized when E = N
n=1 w(n)I (G(xn ) 6= ln )
is minimum.
A.5. PROOF OF COROLLARY 2
65
For the second point of the corollary we just need to consider the above expression as a
function of β. Differentiating and equating to zero we get
−β
1
β
−β
− exp
+E
exp
=0
(A.30)
+ exp
K −1
K −1
(K − 1)2
K −1
E
β
−β
exp
= (1 − E) exp
(A.31)
K −1
(K − 1)2
K −1
(K − 1)(1 − E)
Kβ
=
(A.32)
exp
(K − 1)2
E
and taking logarithms
Kβ
1−E
+ log (K − 1)
(A.33)
= log
(K − 1)2
E
(K − 1)2
1−E
β=
log
+ log (K − 1) .
(A.34)
K
E
Hence the second point of the corollary follows.
A.5
Proof of Corollary 2
Given the (2 × 2)-cost-matrix, let C1 = C(1, 2) and C2 = C(2, 1) denote the non diagonal
values. The expression (A.16) becomes:


2
X

(A.35)
Sj exp (−βCj ) + Ej,k exp (βCj ) =
|{z}
j=1
k6=j
S1 exp (−βC1 ) + E1,2 exp (βC1 ) + S2 exp (−βC2 ) + E2,1 exp (βC2 ) .
(A.36)
Let us assume now that β > 0 is known. Using the Lemma the optimal discrete weak learner
minimizing the expected loss is
arg min eβC1 E1,2 + e−βC1 S1 + eβC2 E2,1 + e−βC2 S2
(A.37)
g
P
P
, changing the notation T1 = {n:ln =1} w(n), T2 = {n:ln =2} w(n), E1,2 = b = T1 − S1 and
E2,1 = d = T2 − S2 ,
arg min eβC1 b + e−βC1 (T1 − b) + eβC2 d + e−βC2 (T2 − d) =
(A.38)
g
arg min eβC1 − e−βC1 b + e−βC1 T1 + eβC2 − e−βC2 d + e−βC2 T2 .
(A.39)
g
Besides, if we assume known the optimal weak learner, g, then its weighted success/error rates
will be computable. We can find the best value β using the Lemma. In this binary case the
following expression must be solved
E1,2 C1 eβC1 + E2,1 C2 eβC2 = S1 C1 e−βC1 + S2 C2 e−βC2 .
(A.40)
Again, using the notation T1 , T2 , b and d we get
bC1
bC1 eβC1 + dC2 eβC2 = (T1 − b)C1 e−βC1 + (T2 − d)C2 e−βC2
eβC1 + e−βC1 + dC2 eβC2 + eβC2 = T1 C1 e−βC1 + T2 C2 e−βC2
(A.41)
2C1 b cosh (βC1 ) + 2C2 d cosh (βC2 ) = T1 C1 e−βC1 + T2 C2 e−βC2 ,
(A.43)
that proves the equivalence between both algorithms for binary problems.
(A.42)
66
Appendix
A.6
Proof of Corollary 3
Let S denote a subset of s-labels of the problem. We can simplify the notation using 1 or 2
for denoting the presence or absence of labels of S in data. The margin values are y> f (x) =
±K
±K
, when y ∈ S, and y> f (x) = (K−s)(K−1)
, when y ∈
/ S. In both cases there is posis(K−1)
tive/negative sign in case of right/wrong classification. In turn the exponential loss function,
∓1
∓1
) and exp( (K−s)(K−1)
) respectively.
exp(−y> f (x)/K), yields exp( s(K−1)
1
1
Let C be (2×2)-cost-matrix with non diagonal values C(1, 2) = sK
and C(2, 1) = (K−s)K
.
This matrix produces cost-sensitive multi-class margins with the same values on the loss function. Thus we can apply our result to this binary cost-sensitive sub-problem. In particular we
can apply the Corollary 2 directly.
Let β > 0 be known. Substituting in expresion (A.39) we get the optimal weak learner, g,
solving
−β
−β
β
− exp
E1 + A1 exp
+
arg min exp
g
s(K − 1)
s(K − 1)
s(K − 1)
β
−β
−β
+ exp
− exp
E2 + A2 exp
.
(K − s)(K − 1)
(K − s)(K − 1)
(K − s)(K − 1)
(A.44)
P
P
P
Where A1 = {n:ln =1} w(n), A2 = {n:ln =2} w(n), E1 = {n:g(xn )6=ln =1} w(n) and E2 =
P
{n:g(xn )6=ln =2} w(n).
If we assume known the optimal direction of classification g, then its weighted errors and
successes, we can compute the optimal step β using (A.43) as the solution to
β
β
2E1
2E2
cosh
cosh
+
=
s
s(K − 1)
(K − s)
(K − s)(K − 1)
(A.45)
−β
−β
A2
A1
exp
exp
+
.
s
s(K − 1)
(K − s)
(K − s)(K − 1)
Making the change of variable β = s(K − s)(K − 1) log x we get
A1 −(K−s)
E1 (K−s)
A2
E2
x
− x−(K−s) +
xs − x−s =
x
+
x−s .
s
(K − s)
s
(K − s)
(A.46)
Which is equivalent to find the only real solution (Descartes Theorem of signs) of the following
polynomial:
P (x) = E1(K − s)x2(K−s) + E2sxK − s(A2 − E2)x(K−2s) − (K − s)(A1 − E1) . (A.47)
Hence the Corollary follows.
Bibliography
[1] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class
cost-sensitive learning. In International Conference on Knowledge Discovery and Data
Mining (KDD), pages 3–11, 2004.
[2] Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. In
Proc. International Conference on Machine Learning (ICML), volume 28, pages 1220–
1228, 2013.
[3] Erin L. Allwein, Robert E. Schapire, Yoram Singer, and Pack Kaelbling. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning
Research, 1:113–141, 2000.
[4] Yonatan Amit, Dekel Ofer, and Yoram Singer. A boosting algorithm for label covering
in multilabel problems. Journal of Machine Learning Research, 2:27–34, 2007.
[5] Shumeet Baluja and Henry A. Rowley. Boosting sex identification performance. International Journal of Computer Vision, 71(1):111–119, 2007.
[6] C. Becker, K. Ali, G. Knott, and P. Fua. Learning context cues for synapse segmentation.
IEEE Transactions on Medical Imaging, 32(10):1864–1877, 2013.
[7] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science
and Statistics. Springer-Verlag New York, Inc., 2006.
[8] Leo Breiman. Bagging predictors. Machine Learning, pages 123–140, 1996.
[9] Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001.
[10] Peter Bühlmann and Bin Yu. Boosting with the l2 loss: Regression and classification.
Journal of the American Statistical Association, 98(462):324–339, 2003.
[11] Wen-Chung Chang and Chih-Wei Cho. Multi-class boosting with color-based haar-like
features. In Signal-Image Technologies and Internet-Based System (SITIS), pages 719–
726, 2007.
[12] Junli Chen, Xuezhong Zhou, and Zhaohui Wu. A multi-label chinese text categorization
system based on boosting algorithm. In Computer and Information Technology (CIT),
pages 1153–1158, 2004.
[13] Ke Chen and Shihai Wang. Semi-supervised learning via regularized boosting working
on multiple semi-supervised assumptions. Transactions on Pattern Analysis and Machine
Intelligence, 33(1):129–143, 2011.
67
68
BIBLIOGRAPHY
[14] Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In Proc. International
Conference on Machine Learning (ICML), 2014.
[15] Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research, 7:1–30, 2006.
[16] Hongbo Deng, Jianke Zhu, Michael R. Lyu, and Irwin King. Two-stage multi-class
adaboost for facial expression recognition. In International Joint Conference on Neural
Networks (IJCNN), pages 3005–3010, 2007.
[17] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via
error-correcting output codes. Journal of Artificial Intelligence Research, pages 263–
286, 1995.
[18] Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In
Proc. International Conference on Knowledge Discovery and Data Mining (KDD), pages
155–164, 1999.
[19] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. WileyInterscience, 2 edition, 2000.
[20] Gunther Eibl and Karl-Peter Pfeiffer. Multiclass boosting for weak classifiers. Journal
of Machine Learning Research, 6:189–210, 2005.
[21] Charles Elkan. The foundations of cost-sensitive learning. In Proc. International Joint
Conference on Artificial Intelligence (IJCAI), pages 973–978, 2001.
[22] Andrea Esuli, Tiziano Fagni, and Fabrizio Sebastiani. Treeboost.mh: A boosting algorithm for multi-label hierarchical text categorization. String Processing and Information
Retrieval, pages 13–24, 2006.
[23] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost: Misclassification cost-sensitive boosting. In Proc. International Conference on Machine Learning
(ICML), pages 97–105, 1999.
[24] Antonio Fernández-Baldera and Luis Baumela. Multi-class boosting with asymmetric
weak-learners. Pattern Recognition, 47(5):2080–2090, 2014.
[25] Antonio Fernández-Baldera, José M. Buenaposada, and Luis Baumela. Multi-class
boosting for imbalanced data. In Proc. Iberian Conference on Pattern Recognition and
Image Analysis (IbPRIA), pages 1–8, 2015.
[26] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In
Proc. International Conference on Machine Learning (ICML), pages 148–156, 1996.
[27] Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In
Conference on Computational Learning Theory, pages 325–332, 1996.
[28] Yoav Freund and Robert E. Schapire. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:199–
139, 1997.
[29] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a
statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000.
BIBLIOGRAPHY
69
[30] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical
Learning: Data Mining. Springer series in statistics. Springer, 2009.
[31] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.
Annals of Statistics, 29:1189–1232, 2000.
[32] M. Friedman. The use of ranks to avoid the assumption of normality implicit in the
analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937.
[33] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. A review on
ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews, 42(4):463–484, July 2012.
[34] Tianshi Gao and Daphne Koller. Multiclass boosting with hinge loss based on output
coding. In Proc. International Conference on Machine Learning (ICML), 2011.
[35] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and errorcorrecting codes. In Proc. Annual Conference on Computational Learning Theory
(COLT), pages 145–155, 1999.
[36] Zhihui Hao, Chunhua Shen, Nick Barnes, and Bo Wang. Totally-corrective multi-class
boosting. In Asian Conference on Computer Vision (ACCV), volume 6495, pages 269–
280, 2010.
[37] Trevor Hastie and Robert Tibshirani. Generalized Additive Models. Monographs on
Statistics and Applied Probability. Chapman and Hall, 1990.
[38] Haibo He and Edwardo A. Garcia. Learning from imbalanced data. IEEE Transactions
on Knowledge and Data Engineering, 21(9):1263–1284, 2009.
[39] Jian Huang, Seyda Ertekin, Yang Song, Hongyuan Zha, and C. Lee Giles. Efficient
multiclass boosting classification with active learning. In SIAM International Conference
on Data Mining. Society for Industrial and Applied Mathematics, 2007.
[40] R. L. Iman and J. M. Davenport. Approximation of the critical region of the friedman
statistic. Communications in Statistics, pages 571–595, 1980.
[41] Alan Julian Izenman. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer Publishing Company, Inc., 1 edition, 2008.
[42] Wei Jiang, Shih-Fu Chang, and Alexander C Loui. Kernel sharing with joint boosting
for multi-class concept detection. In Computer Vision and Pattern Recognition (CVPR),
pages 1–8, 2007.
[43] Matt Johnson and Roberto Cipolla. Improved image annotation and labelling through
multi-label boosting. In British Machine Vision Conference (BMVC), 2005.
[44] Michael Kearns and Leslie Valiant. Learning boolean formulae or finite automata is hard
as margin. Technical report, Harvard University, August 1988.
[45] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM, 41(1):67–95, 1994.
70
BIBLIOGRAPHY
[46] Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection. In Proc. Annual
Conference on Computational Learning Theory (COLT), pages 134–144, 1999.
[47] Ludmila I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. WileyInterscience, 2004.
[48] Tae kyun Kim and Roberto Cipolla. Mcboost: Multiple classifier boosting for perceptual
co-clustering of images and visual features. In Advances in Neural Information Processing Systems (NIPS), pages 841–848, 2009.
[49] Iago Landesa-Vázquez and José Luis Alba-Castro. Shedding light on the asymmetric
learning capability of adaboost. Pattern Recognition Letters, 33(3):247–255, 2012.
[50] Iago Landesa-Vázquez and José Luis Alba-Castro. Double-base asymmetric adaboost.
Neurocomputing, 118:101–114, 2013.
[51] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines:
theory and application to the classification of microarray data and satellite radiance data.
Journal of the American Statistical Association, 99:67–81, 2004.
[52] Leonidas Lefakis and Francois Fleuret. Joint cascade optimization using a product of
boosted classifiers. In Advances in Neural Information Processing Systems (NIPS), pages
1315–1323, 2010.
[53] Christian Leistner, Helmut Grabner, and Horst Bischof. Semi-supervised boosting using
visual similarity learning. In Computer Vision and Pattern Recognition (CVPR), 2008.
[54] Yen-Yu Lin and Tyng-Luh Liu. Robust face detection with multi-class boosting. In
Computer Vision and Pattern Recognition (CVPR), volume 1, pages 680–687, 2005.
[55] Li Liu, Ling Shao, and Peter Rockett. Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition. Pattern Recognition,
46(7):1810–1818, 2013.
[56] Xu-Ying Liu and Zhi-Hua Zhou. Towards cost-sensitive learning for real-world applications. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)
Workshops, volume 7104, pages 494–505, 2012.
[57] Hung-Yi Lo, Ju-Chiang Wang, Hsin-Min Wang, and Shou-De Lin. Cost-sensitive multilabel learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia,
13(3):518–529, 2011.
[58] Aurelie C. Lozano and Naoki Abe. Multi-class cost-sensitive boosting with p-norm
loss functions. In International Conference on Knowledge Discovery and Data Mining
(KDD), pages 506–514, 2008.
[59] Pavan Kumar Mallapragada, Rong Jin, Anil K. Jain, and Yi Liu. Semiboost: Boosting for
semi-supervised learning. Transactions on Pattern Analysis and Machine Intelligence,
31(11):2000–2014, 2009.
[60] H. Masnadi-Shirazi and N. Vasconcelos. Asymmetric boosting. In Proc. International
Conference on Machine Learning (ICML), pages 609–619, 2007.
BIBLIOGRAPHY
71
[61] Hamed Masnadi-Shirazi, Vijay Mahadevan, and Nuno Vasconcelos. On the design of
robust classifiers for computer vision. In Computer Vision and Pattern Recognition
(CVPR), pages 779–786, 2010.
[62] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss functions for
classification: theory, robustness to outliers, and savageboost. In Advances in Neural
Information Processing Systems (NIPS), pages 1049–1056, 2008.
[63] Hamed Masnadi-Shirazi and Nuno Vasconcelos. Cost-sensitive boosting. Transactions
on Pattern Analysis and Machine Intelligence, 33:294–309, 2011.
[64] Hamed Masnadi-Shirazi, Nuno Vasconcelos, and Vijay Mahadevan. On the design of
robust classifiers for computer vision. In Computer Vision and Pattern Recognition
(CVPR), pages 779–786, 2010.
[65] Maciej A Mazurowski, Piotr A Habas, Jacek M Zurada, Joseph Y Lo, Jay A Baker,
and Georgia D Tourassi. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks,
21(2):427–436, 2008.
[66] David Mease and Abraham Wyner. Evidence contrary to the statistical view of boosting.
Journal of Machine Learning Research, 9:131–156, 2008.
[67] I. Mukherjee and R. E. Schapire. A theory of multiclass boosting. In Advances in Neural
Information Processing Systems (NIPS), pages 1714–1722, 2010.
[68] P. B. Nemenyi. Distribution-free multiple comparisons. PhD thesis, Princeton University,
1963.
[69] Deirdre B. O’Brien, Maya R. Gupta, and Robert M. Gray. Cost-sensitive multi-class
classification from probability estimates. In Proc. International Conference on Machine
Learning (ICML), pages 712–719, 2008.
[70] A. Opelt, A. Pinz, and A. Zisserman. Learning an alphabet of shape and appearance
for multi-class object detection. International Journal of Computer Vision, 80(1):16–44,
2008.
[71] Mark D. Reid, Robert C. Williamson, and Peng Sun. The convexity and design of
composite multiclass losses. In Proc. International Conference on Machine Learning
(ICML), pages 687–694, 2012.
[72] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine
Learning Research, 5:101–141, 2004.
[73] Saharon Rosset. Robust boosting and its relation to bagging. In Proc. International
Conference on Knowledge Discovery in Data Mining (KDD), pages 249–255, 2005.
[74] Mohammad J. Saberian and Nuno Vasconcelos. Multiclass boosting: Theory and algorithms. In Advances in Neural Information Processing Systems (NIPS), 2011.
[75] Amir Saffari, Christian Leistner, and Horst Bischof. Regularized multi-class semisupervised boosting. In Computer Vision and Pattern Recognition (CVPR), pages 967–
974, 2009.
72
BIBLIOGRAPHY
[76] Raúl Santos-Rodríguez, Alicia Guerrero-Curieses, Rocío Alaiz-Rodríguez, and Jesús
Cid-Sueiro. Cost-sensitive learning based on bregman divergences. Machine Learning,
76(2-3):271–285, 2009.
[77] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227,
1990.
[78] Robert E. Schapire. Using output codes to boost multiclass learning problems. In Proc.
International Conference on Machine Learning (ICML), pages 313–321, 1997.
[79] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin:
A new explanation for the efectiveness of voting methods. The Annals of Statistics,
26(5):1651–1686, 1998.
[80] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37:297–336, 1999.
[81] Robert E Schapire and Yoram Singer. Boostexter: A boosting-based system for text
categorization. Machine Learning, 39(2-3):135–168, 2000.
[82] Chunhua Shen, Junae Kim, Lei Wang, and Anton Hengel. Positive semidefinite metric
learning with boosting. In Advances in Neural Information Processing Systems (NIPS),
pages 1651–1659, 2009.
[83] Chunhua Shen, Junae Kim, Lei Wang, and Anton Van Den Hengel. Positive semidefinite
metric learning using boosting-like algorithms. Journal of Machine Learning Research,
13:1007–1036, 2012.
[84] Chunhua Shen and Hanxi Li. On the dual formulation of boosting algorithms. Transactions on Pattern Analysis and Machine Intelligence, 32:2216–2231, 2010.
[85] Yifan Shi, Aaron F. Bobick, and Irfan A. Essa. Learning temporal sequence model from
partially labeled data. In Computer Vision and Pattern Recognition (CVPR), pages 1631–
1638, 2006.
[86] Yanmin Sun, Mohamed S. Kamel, and Yang Wang. Boosting for learning multiple
classes with imbalanced class distribution. In Proc. International Conference on Data
Mining (ICDM), ICDM ’06, pages 592–602, 2006.
[87] Yanmin Sun, Mohamed S. Kamel, Andrew K. C. Wong, and Yang Wang. Cost-sensitive
boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378,
2007.
[88] Yanmin Sun, Andrew K. C. Wong, and Mohamed S. Kamel. Classification of imbalanced
data: a review. International Journal of Pattern Recognition and Artificial Intelligence,
23(04):687–719, 2009.
[89] Yanmin Sun, Andrew K. C. Wong, and Yang Wang. Parameter inference of cost-sensitive
boosting algorithms. In Proc. International Conference on Machine Learning and Data
Mining (MLDM), pages 21–30, 2005.
[90] Yijun Sun, Zhipeng Liu, Sinisa Todorovic, and Jian Li. Adaptive boosting for sar automatic target recognition. IEEE Transactions on Aerospace and Electronic Systems,
43(1):112–125, 2007.
BIBLIOGRAPHY
73
[91] Yijun Sun, Sinisa Todorovic, and Jian Li. Unifiyng multi-class adaboost algorithms with
binary base learners under the margin framework. Pattern Recognition Letters, 28:631–
643, 2007.
[92] Matus J. Telgarsky. The fast convergence of boosting. In Advances in Neural Information
Processing Systems (NIPS), pages 1593–1601, 2011.
[93] Kai Ming Ting. A comparative study of cost-sensitive boosting algorithms. In Proc.
International Conference on Machine Learning (ICML), pages 983–990, 2000.
[94] Kai Ming Ting and Zijian Zheng. Boosting cost-sensitive trees. In Proc. International
Conference on Discovery Science (DS), pages 244–255, 1998.
[95] Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Sharing features: Efficient
boosting procedures for multiclass object detection. In Computer Vision and Pattern
Recognition (CVPR), pages 762–769, 2004.
[96] Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua. Learning image descriptors with the boosting-trick. In Advances in Neural Information Processing
Systems 25, pages 269–277, 2012.
[97] L. G. Valiant. A theory of the learnable. Journal of Computer and System Sciences,
27(11):1134–1142, 1984.
[98] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[99] Nuno Vasconcelos and Mohammad J. Saberian. Boosting classifier cascades. In Advances in Neural Information Processing Systems (NIPS), pages 2047–2055, 2010.
[100] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal
of Computer Vision, 57(2):137–154, 2004.
[101] Paul A. Viola and Michael J. Jones. Fast and robust classification using asymmetric
adaboost and a detector cascade. In Advances in Neural Information Processing Systems
(NIPS), pages 1311–1318, 2001.
[102] Junhui Wang. Boosting the generalized margin in cost-sensitive multiclass classification.
Journal of Computational and Graphical Statistics, 22(1):178–192, 2013.
[103] Shuo Wang and Xin Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,
42(4):1119–1130, 2012.
[104] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann Series in Data Management Systems. Elsevier, 2005.
[105] Fei Wu, Yahong Han, Qi Tian, and Yueting Zhuang. Multi-label boosting for image annotation by structural grouping sparsity. In ACM International Conference on Multimedia
(ACM-MM), pages 15–24, 2010.
[106] Fen Xia, Yan-wu Yang, Liang Zhou, Fuxin Li, Min Cai, and Daniel D. Zeng. A closedform reduction of multi-class cost-sensitive learning to weighted multi-class learning.
Pattern Recognition, 42(7):1572–1581, 2009.
74
BIBLIOGRAPHY
[107] Xun Xu and Thomas S. Huang. Soda-boosting and its application to gender recognition.
In Analysis and Modeling of Faces and Gestures (AMFG), pages 193–204, 2007.
[108] Rong Yan, Jelena Tesic, and John R. Smith. Model-shared subspace boosting for multilabel classification. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 834–843, 2007.
[109] Jieping Ye. Least squares linear discriminant analysis. In Proc. International Conference
on Machine Learning (ICML), 2007.
[110] Yin Zhang and Zhi-Hua Zhou. Cost-sensitive face recognition. Transactions on Pattern
Analysis and Machine Intelligence, 32(10):1758–1769, 2010.
[111] Zhi-Hua Zhou and Xu-Ying Liu. On multi-class cost-sensitive learning. Computational
Intelligence, 26(3):232–257, 2010.
[112] Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. Multi-class adaboost. Statistics and
Its Interface, 2:349–360, 2009.
[113] Hui Zou, Ji Zhu, and Trevor Hastie. The margin vector, admissible loss and multi-class
margin-based classifiers. Technical report, University of Minnesota, 2007.
[114] Hui Zou, Ji Zhu, and Trevor Hastie. New multicategory boosting algorithms based on
multicategory fisher-consistent losses. Annals of Applied Statistics, 2:1290–1306, 2008.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement