Ensemble Methods
Computer Science
“Professor Zhou’s book is a comprehensive introduction to ensemble
methods in machine learning. It reviews the latest research in this
exciting area. I learned a lot reading it!”
—Thomas G. Dietterich, Oregon State University, ACM Fellow, and
founding president of the International Machine Learning Society
“This is a timely book. Right time and right book … with an authoritative
but inclusive style that will allow many readers to gain knowledge on
the topic.”
—Fabio Roli, University of Cagliari
An up-to-date, self-contained introduction to a state-of-the-art
machine learning approach, Ensemble Methods: Foundations and
Algorithms shows how these accurate methods are used in realworld tasks. It gives you the necessary groundwork to carry out
further research in this evolving field.
K11467
K11467_Cover.indd 1
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
Ensemble Methods
Foundations and Algorithms
Zhou
Features
• Supplies the basics for readers unfamiliar with machine learning
and pattern recognition
• Covers nearly all aspects of ensemble techniques such as
combination methods and diversity generation methods
• Presents the theoretical foundations and extensions of many
ensemble methods, including Boosting, Bagging, Random
Trees, and Stacking
• Introduces the use of ensemble methods in computer vision,
computer security, medical imaging, and famous data mining
competitions
• Highlights future research directions
• Provides additional reading sections in each chapter and
references at the back of the book
Ensemble Methods
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
Zhi-Hua Zhou
4/30/12 10:30 AM
Ensemble Methods
Foundations and Algorithms
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
SERIES EDITORS
Ralf Herbrich and Thore Graepel
Microsoft Research Ltd.
Cambridge, UK
AIMS AND SCOPE
This series reflects the latest advances and applications in machine learning
and pattern recognition through the publication of a broad range of reference
works, textbooks, and handbooks. The inclusion of concrete examples, applications, and methods is highly encouraged. The scope of the series includes,
but is not limited to, titles in the areas of machine learning, pattern recognition, computational intelligence, robotics, computational/statistical learning
theory, natural language processing, computer vision, game AI, game theory,
neural networks, computational neuroscience, and other relevant topics, such
as machine learning applied to bioinformatics or cognitive science, which
might be proposed by potential contributors.
PUBLISHED TITLES
MACHINE LEARNING: An Algorithmic Perspective
Stephen Marsland
HANDBOOK OF NATURAL LANGUAGE PROCESSING,
Second Edition
Nitin Indurkhya and Fred J. Damerau
UTILITY-BASED LEARNING FROM DATA
Craig Friedman and Sven Sandow
A FIRST COURSE IN MACHINE LEARNING
Simon Rogers and Mark Girolami
COST-SENSITIVE MACHINE LEARNING
Balaji Krishnapuram, Shipeng Yu, and Bharat Rao
ENSEMBLE METHODS: FOUNDATIONS AND ALGORITHMS
Zhi-Hua Zhou
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
Ensemble Methods
Foundations and Algorithms
Zhi-Hua Zhou
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2012 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20120501
International Standard Book Number-13: 978-1-4398-3005-5 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made
to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all
materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all
material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in
any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.
copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-7508400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that
have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my parents, wife and son.
Z.-H. Zhou
This page intentionally left blank
Preface
Ensemble methods that train multiple learners and then combine them for
use, with Boosting and Bagging as representatives, are a kind of state-of-theart learning approach. It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have
already achieved great success in many real-world tasks.
It is difficult to trace the starting point of the history of ensemble methods since the basic idea of deploying multiple models has been in use in
human society for a long time; however, it is clear that ensemble methods have become a hot topic since the 1990s, and researchers from various
fields such as machine learning, pattern recognition, data mining, neural
networks and statistics have explored ensemble methods from different aspects.
This book provides researchers, students and practitioners with an introduction to ensemble methods. The book consists of eight chapters which
naturally constitute three parts.
Part I is composed of Chapter 1. Though this book is mainly written for
readers with a basic knowledge of machine learning and pattern recognition, to enable readers who are unfamiliar with these fields to access the
main contents, Chapter 1 presents some “background knowledge” of ensemble methods. It is impossible to provide a detailed introduction to all
backgrounds in one chapter, and therefore this chapter serves mainly as a
guide to further study. This chapter also serves to explain the terminology
used in this book, to avoid confusion caused by other terminologies used
in different but relevant fields.
Part II is composed of Chapters 2 to 5 and presents “core knowledge”
of ensemble methods. Chapters 2 and 3 introduce Boosting and Bagging,
respectively. In addition to algorithms and theories, Chapter 2 introduces
multi-class extension and noise tolerance, since classic Boosting algorithms
are designed for binary classification, and are usually hurt seriously by
noise. Bagging is naturally a multi-class method and less sensitive to noise,
and therefore, Chapter 3 does not discuss these issues; instead, Chapter 3
devotes a section to Random Forest and some other random tree ensembles that can be viewed as variants of Bagging. Chapter 4 introduces combination methods. In addition to various averaging and voting schemes, the
Stacking method and some other combination methods as well as relevant
methods such as mixture of experts are introduced. Chapter 5 focuses on ensemble diversity. After introducing the error-ambiguity and bias-variance
vii
viii
Preface
decompositions, many diversity measures are presented, followed by recent advances in information theoretic diversity and diversity generation
methods.
Part III is composed of Chapters 6 to 8, and presents “advanced knowledge” of ensemble methods. Chapter 6 introduces ensemble pruning,
which tries to prune a trained ensemble to get a better performance. Chapter 7 introduces clustering ensembles, which try to generate better clustering results by combining multiple clusterings. Chapter 8 presents some
developments of ensemble methods in semi-supervised learning, active
learning, cost-sensitive learning and class-imbalance learning, as well as
comprehensibility enhancement.
It is not the goal of the book to cover all relevant knowledge of ensemble
methods. Ambitious readers may be interested in Further Reading sections
for further information.
Two other books [Kuncheva, 2004, Rokach, 2010] on ensemble methods
have been published before this one. To reflect the fast development of this
field, I have attempted to present an updated and in-depth overview. However, when writing this book, I found this task more challenging than expected. Despite abundant research on ensemble methods, a thorough understanding of many essentials is still needed, and there is a lack of thorough empirical comparisons of many technical developments. As a consequence, several chapters of the book simply introduce a number of algorithms, while even for chapters with discussions on theoretical issues,
there are still important yet unclear problems. On one hand, this reflects
the still developing situation of the ensemble methods field; on the other
hand, such a situation provides a good opportunity for further research.
The book could not have been written, at least not in its current form,
without the help of many people. I am grateful to Tom Dietterich who has
carefully read the whole book and given very detailed and insightful comments and suggestions. I want to thank Songcan Chen, Nan Li, Xu-Ying
Liu, Fabio Roli, Jianxin Wu, Yang Yu and Min-Ling Zhang for helpful comments. I also want to thank Randi Cohen and her colleagues at Chapman &
Hall/CRC Press for cooperation.
Last, but definitely not least, I am indebted to my family, friends and students for their patience, support and encouragement.
Zhi-Hua Zhou
Nanjing, China
Notations
x
variable
x
vector
A
matrix
I
identity matrix
X,Y
input and output spaces
D
probability distribution
D
data sample (data set)
N
normal distribution
U
uniform distribution
H
hypothesis space
H
set of hypotheses
h(·)
hypothesis (learner)
L
learning algorithm
p(·)
probability density function
p(· | ·)
conditional probability density function
P (·)
probability mass function
P (· | ·)
conditional probability mass function
E ·∼D [f (·)]
mathematical expectation of function f (·) to ·
under distribution D. D and/or · is ignored when
the meaning is clear
var·∼D [f (·)] variance of function f (·) to · under distribution D
I(·)
indicator function which takes 1 if · is true, and 0
otherwise
sign(·)
sign function which takes -1,1 and 0 when · < 0,
· > 0 and · = 0, respectively
err(·)
error function
{. . .}
set
(. . .)
row vector
ix
x
Notations
(. . .)
column vector
|·|
size of data set
·
L2 -norm
Contents
Preface
vii
Notations
ix
1 Introduction
1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . .
1.2 Popular Learning Algorithms . . . . . . . . . . . . . .
1.2.1 Linear Discriminant Analysis . . . . . . . . . .
1.2.2 Decision Trees . . . . . . . . . . . . . . . . . . .
1.2.3 Neural Networks . . . . . . . . . . . . . . . . .
1.2.4 Naı̈ve Bayes Classifier . . . . . . . . . . . . . . .
1.2.5 k-Nearest Neighbor . . . . . . . . . . . . . . . .
1.2.6 Support Vector Machines and Kernel Methods
1.3 Evaluation and Comparison . . . . . . . . . . . . . . .
1.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . .
1.5 Applications of Ensemble Methods . . . . . . . . . . .
1.6 Further Readings . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
3
4
6
8
9
9
12
15
17
20
2 Boosting
2.1 A General Boosting Procedure
2.2 The AdaBoost Algorithm . . .
2.3 Illustrative Examples . . . . .
2.4 Theoretical Issues . . . . . . .
2.4.1 Initial Analysis . . . . .
2.4.2 Margin Explanation . .
2.4.3 Statistical View . . . .
2.5 Multiclass Extension . . . . .
2.6 Noise Tolerance . . . . . . . .
2.7 Further Readings . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
28
32
32
32
35
38
41
44
3 Bagging
3.1 Two Ensemble Paradigms
3.2 The Bagging Algorithm . .
3.3 Illustrative Examples . . .
3.4 Theoretical Issues . . . . .
3.5 Random Tree Ensembles .
3.5.1 Random Forest . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
47
48
50
53
57
57
.
.
.
.
.
.
.
.
.
.
.
.
xi
xii
Contents
3.5.2 Spectrum of Randomization . . . . . . . . . . . .
3.5.3 Random Tree Ensembles for Density Estimation
3.5.4 Random Tree Ensembles for Anomaly Detection
3.6 Further Readings . . . . . . . . . . . . . . . . . . . . . .
4 Combination Methods
4.1 Benefits of Combination . . . . . . . . . . .
4.2 Averaging . . . . . . . . . . . . . . . . . . . .
4.2.1 Simple Averaging . . . . . . . . . . .
4.2.2 Weighted Averaging . . . . . . . . . .
4.3 Voting . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Majority Voting . . . . . . . . . . . .
4.3.2 Plurality Voting . . . . . . . . . . . .
4.3.3 Weighted Voting . . . . . . . . . . . .
4.3.4 Soft Voting . . . . . . . . . . . . . . .
4.3.5 Theoretical Issues . . . . . . . . . . .
4.4 Combining by Learning . . . . . . . . . . . .
4.4.1 Stacking . . . . . . . . . . . . . . . .
4.4.2 Infinite Ensemble . . . . . . . . . . .
4.5 Other Combination Methods . . . . . . . .
4.5.1 Algebraic Methods . . . . . . . . . .
4.5.2 Behavior Knowledge Space Method .
4.5.3 Decision Template Method . . . . .
4.6 Relevant Methods . . . . . . . . . . . . . . .
4.6.1 Error-Correcting Output Codes . . .
4.6.2 Dynamic Classifier Selection . . . .
4.6.3 Mixture of Experts . . . . . . . . . . .
4.7 Further Readings . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
59
61
64
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
68
70
71
72
73
74
75
77
83
83
86
87
87
88
89
89
90
93
93
95
5 Diversity
5.1 Ensemble Diversity . . . . . . . . . . . . . . . . .
5.2 Error Decomposition . . . . . . . . . . . . . . . .
5.2.1 Error-Ambiguity Decomposition . . . . .
5.2.2 Bias-Variance-Covariance Decomposition
5.3 Diversity Measures . . . . . . . . . . . . . . . . .
5.3.1 Pairwise Measures . . . . . . . . . . . . .
5.3.2 Non-Pairwise Measures . . . . . . . . . .
5.3.3 Summary and Visualization . . . . . . . .
5.3.4 Limitation of Diversity Measures . . . . .
5.4 Information Theoretic Diversity . . . . . . . . . .
5.4.1 Information Theory and Ensemble . . . .
5.4.2 Interaction Information Diversity . . . . .
5.4.3 Multi-Information Diversity . . . . . . . .
5.4.4 Estimation Method . . . . . . . . . . . . .
5.5 Diversity Generation . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
99
100
100
102
105
105
106
109
110
111
111
112
113
114
116
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
xiii
5.6 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . 118
6 Ensemble Pruning
6.1 What Is Ensemble Pruning . . . . . . . . . .
6.2 Many Could Be Better Than All . . . . . . .
6.3 Categorization of Pruning Methods . . . . .
6.4 Ordering-Based Pruning . . . . . . . . . . .
6.5 Clustering-Based Pruning . . . . . . . . . .
6.6 Optimization-Based Pruning . . . . . . . . .
6.6.1 Heuristic Optimization Pruning . . .
6.6.2 Mathematical Programming Pruning
6.6.3 Probabilistic Pruning . . . . . . . . .
6.7 Further Readings . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
119
120
123
124
127
128
128
129
131
133
7 Clustering Ensembles
7.1 Clustering . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Clustering Methods . . . . . . . . . . . . .
7.1.2 Clustering Evaluation . . . . . . . . . . . .
7.1.3 Why Clustering Ensembles . . . . . . . . .
7.2 Categorization of Clustering Ensemble Methods
7.3 Similarity-Based Methods . . . . . . . . . . . . .
7.4 Graph-Based Methods . . . . . . . . . . . . . . .
7.5 Relabeling-Based Methods . . . . . . . . . . . . .
7.6 Transformation-Based Methods . . . . . . . . . .
7.7 Further Readings . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
135
135
135
137
139
141
142
144
147
152
155
8 Advanced Topics
8.1 Semi-Supervised Learning . . . . . . . . . . . . . . . . . .
8.1.1 Usefulness of Unlabeled Data . . . . . . . . . . . .
8.1.2 Semi-Supervised Learning with Ensembles . . . .
8.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Usefulness of Human Intervention . . . . . . . . .
8.2.2 Active Learning with Ensembles . . . . . . . . . .
8.3 Cost-Sensitive Learning . . . . . . . . . . . . . . . . . . . .
8.3.1 Learning with Unequal Costs . . . . . . . . . . . .
8.3.2 Ensemble Methods for Cost-Sensitive Learning . .
8.4 Class-Imbalance Learning . . . . . . . . . . . . . . . . . .
8.4.1 Learning with Class Imbalance . . . . . . . . . . .
8.4.2 Performance Evaluation with Class Imbalance . .
8.4.3 Ensemble Methods for Class-Imbalance Learning
8.5 Improving Comprehensibility . . . . . . . . . . . . . . . .
8.5.1 Reduction of Ensemble to Single Model . . . . . .
8.5.2 Rule Extraction from Ensembles . . . . . . . . . .
8.5.3 Visualization of Ensembles . . . . . . . . . . . . .
8.6 Future Directions of Ensembles . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
157
157
157
159
163
163
165
166
166
167
171
171
172
176
179
179
180
181
182
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiv
Contents
8.7 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . 184
References
187
Index
219
1
Introduction
1.1 Basic Concepts
One major task of machine learning, pattern recognition and data mining
is to construct good models from data sets.
A “data set” generally consists of feature vectors, where each feature vector is a description of an object by using a set of features. For example,
take a look at the synthetic three-Gaussians data set as shown in Figure 1.1.
Here, each object is a data point described by the features x-coordinate, ycoordinate and shape, and a feature vector looks like (.5, .8, cross) or (.4, .5, circle). The number of features of a data set is called dimension or dimensionality; for example, the dimensionality of the above data set is three. Features
are also called attributes, a feature vector is also called an instance, and
sometimes a data set is called a sample.
y
x
FIGURE 1.1: The synthetic three-Gaussians data set.
A “model” is usually a predictive model or a model of the structure of
the data that we want to construct or discover from the data set, such as
a decision tree, a neural network, a support vector machine, etc. The pro-
1
2
Ensemble Methods: Foundations and Algorithms
cess of generating models from data is called learning or training, which
is accomplished by a learning algorithm. The learned model can be called
a hypothesis, and in this book it is also called a learner. There are different learning settings, among which the most common ones are supervised
learning and unsupervised learning. In supervised learning, the goal is to
predict the value of a target feature on unseen instances, and the learned
model is also called a predictor. For example, if we want to predict the
shape of the three-Gaussians data points, we call “cross” and “circle” labels, and the predictor should be able to predict the label of an instance
for which the label information is unknown, e.g., (.2, .3). If the label is categorical, such as shape, the task is also called classification and the learner
is also called classifier; if the label is numerical, such as x-coordinate, the
task is also called regression and the learner is also called fitted regression
model. For both cases, the training process is conducted on data sets containing label information, and an instance with known label is also called
an example. In binary classification, generally we use “positive” and “negative” to denote the two class labels. Unsupervised learning does not rely on
label information, the goal of which is to discover some inherent distribution information in the data. A typical task is clustering, aiming to discover
the cluster structure of data points. In most of this book we will focus on supervised learning, especially classification. We will introduce some popular
learning algorithms briefly in Section 1.2.
Basically, whether a model is “good” depends on whether it can meet the
requirements of the user or not. Different users might have different expectations of the learning results, and it is difficult to know the “right expectation” before the concerned task has been tackled. A popular strategy is
to evaluate and estimate the performance of the models, and then let the
user to decide whether a model is acceptable, or choose the best available
model from a set of candidates. Since the fundamental goal of learning is
generalization, i.e., being capable of generalizing the “knowledge” learned
from training data to unseen instances, a good learner should generalize
well, i.e., have a small generalization error, also called the prediction error. It is infeasible, however, to estimate the generalization error directly,
since that requires knowing the ground-truth label information which is
unknown for unseen instances. A typical empirical process is to let the predictor make predictions on test data of which the ground-truth labels are
known, and take the test error as an estimate of the generalization error.
The process of applying a learned model to unseen data is called testing.
Before testing, a learned model often needs to be configured, e.g., tuning
the parameters, and this process also involves the use of data with known
ground-truth labels to evaluate the learning performance; this is called validation and the data is validation data. Generally, the test data should not
overlap with the training and validation data; otherwise the estimated performance can be over-optimistic. More introduction on performance evaluation will be given in Section 1.3.
Introduction
3
A formal formulation of the learning process is as follows: Denote X as
the instance space, D as a distribution over X , and f the ground-truth target function. Given a training data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )},
where the instances xi are drawn i.i.d. (independently and identically distributed) from D and yi = f (xi ), taking classification as an example, the
goal is to construct a learner h which minimizes the generalization error
err(h) = Ex∼D [I(h(x) = f (x))].
(1.1)
1.2 Popular Learning Algorithms
1.2.1 Linear Discriminant Analysis
A linear classifier consists of a weight vector w and a bias b. Given an
instance x, the predicted class label y is obtained according to
y = sign(w x + b).
(1.2)
The classification process is accomplished by two steps. First, the instance
space is mapped onto a one-dimensional space (i.e., a line) through the
weight vector w; then, a point on the line is identified to separate the positive instances from negative ones.
To find the best w and b for separating different classes, a classical linear
learning algorithm is Fisher’s linear discriminant analysis (LDA). Briefly, the
idea of LDA is to enable instances of different classes to be far away while
instances within the same class to be close; this can be accomplished by
making the distance between centers of different classes large while keeping the variance within each class small.
Given a two-class training set, we consider all the positive instances, and
obtain the mean μ+ and the covariance matrix Σ+ ; similarly, we consider all
the negative instances, and obtain the mean μ− and the covariance matrix
Σ− . The distance between the projected class centers is measured as
SB (w) = (w μ+ − w μ− )2 ,
(1.3)
and the variance within classes is measured as
SW (w) = w Σ+ w + w Σ− w.
(1.4)
LDA combines these two measures by maximizing
J(w) = SB (w)/SW (w),
(1.5)
of which the optimal solution has a closed-form
w∗ = (Σ+ + Σ− )−1 (μ+ − μ− ).
(1.6)
4
Ensemble Methods: Foundations and Algorithms
After obtaining w, it is easy to calculate the bias b. The simplest way is to let
b be the middle point between the projected centers, i.e.,
b∗ = w (μ+ + μ− )/2,
(1.7)
which is optimal when the two classes are from normal distributions sharing the same variance.
Figure 1.2 illustrates the decision boundary of an LDA classifier.
y
x
FIGURE 1.2: Decision boundary of LDA on the three-Gaussians data set.
1.2.2 Decision Trees
A decision tree consists of a set of tree-structured decision tests working
in a divide-and-conquer way. Each non-leaf node is associated with a feature test also called a split; data falling into the node will be split into different subsets according to their different values on the feature test. Each leaf
node is associated with a label, which will be assigned to instances falling
into this node. In prediction, a series of feature tests is conducted starting
from the root node, and the result is obtained when a leaf node is reached.
Take Figure 1.3 as an example. The classification process starts by testing
whether the value of the feature y-coordinate is larger than 0.73; if so, the instance is classified as “cross”, and otherwise the tree tests whether the feature value of x-coordinate is larger than 0.64; if so, the instance is classified as
“cross” and otherwise is classified as “circle”.
Decision tree learning algorithms are generally recursive processes. In
each step, a data set is given and a split is selected, then this split is used to
divide the data set into subsets, and each subset is considered as the given
data set for the next step. The key of a decision tree algorithm is how to
select the splits.
Introduction
5
FIGURE 1.3: An example of a decision tree.
In the ID3 algorithm [Quinlan, 1998], the information gain criterion is
employed for split selection. Given a training set D, the entropy of D is defined as
Ent(D) = −
P (y|D) log P (y|D).
(1.8)
y∈Y
If the training set D is divided into subsets D1 , . . . , Dk , the entropy may be
reduced, and the amount of the reduction is the information gain, i.e.,
G(D; D1 , . . . , Dk ) = Ent(D) −
k
|Dk |
i=1
|D|
Ent(Dk ).
(1.9)
Thus, the feature-value pair which will cause the largest information gain is
selected for the split.
One problem with the information gain criterion is that features with a
lot of possible values will be favored, disregarding their relevance to classification. For example, suppose we are dealing with binary classification
and each instance has a unique “id”, and if the “id” is considered as a feature, the information gain of taking this feature as split would be quite large
since this split will classify every training instance correctly; however, it cannot generalize and thus will be useless for making prediction on unseen instances.
This deficiency of the information gain criterion is addressed in C4.5
[Quinlan, 1993], the most famous decision tree algorithm. C4.5 employs the
gain ratio
−1
k
|Dk |
|Dk |
P (D; D1 , . . . , Dk ) = G(D; D1 , . . . , Dk ) · −
log
, (1.10)
|D|
|D|
i=1
which is a variant of the information gain criterion, taking normalization
on the number of feature values. In practice, the feature with the highest
gain ratio, among features with better-than-average information gains, is
selected as the split.
6
Ensemble Methods: Foundations and Algorithms
CART [Breiman et al., 1984] is another famous decision tree algorithm,
which uses Gini index for selecting the split maximizing the Gini
Ggini (D; D1 , . . . , Dk ) = I(D) −
k
|Dk |
i=1
where
I(D) = 1 −
|D|
P (y | D)2 .
I(Dk ),
(1.11)
(1.12)
y∈Y
It is often observed that a decision tree, which is perfect on the training
set, will have a worse generalization ability than a tree which is not-so-good
on the training set; this is called overfitting which may be caused by the fact
that some peculiarities of the training data, such as those caused by noise in
collecting training examples, are misleadingly recognized by the learner as
the underlying truth. To reduce the risk of overfitting, a general strategy is to
employ pruning to cut off some tree branches caused by noise or peculiarities of the training set. Pre-pruning tries to prune branches when the tree
is being grown, while post-pruning re-examines fully grown trees to decide which branches should be removed. When a validation set is available,
the tree can be pruned according to the validation error: for pre-pruning,
a branch will not be grown if the validation error will increase by growing
the branch; for post-pruning, a branch will be removed if the removal will
decrease the validation error.
Early decision tree algorithms, such as ID3, could only deal with categorical features. Later ones, such as C4.5 and CART, are enabled to deal
with numerical features. The simplest way is to evaluate every possible split
point on the numerical feature that divides the training set into two subsets,
where one subset contains instances with the feature value smaller than the
split point while the other subset contains the remaining instances.
When the height of a decision tree is limited to 1, i.e., it takes only one
test to make every prediction, the tree is called a decision stump. While decision trees are nonlinear classifiers in general, decision stumps are a kind
of linear classifiers.
Figure 1.4 illustrates the decision boundary of a typical decision tree.
1.2.3 Neural Networks
Neural networks, also called artificial neural networks, originated from
simulating biological neural networks. The function of a neural network is
determined by the model of neuron, the network structure, and the learning algorithm.
Neuron is also called unit, which is the basic computational component
in neural networks. The most popular neuron model, i.e., the McCullochPitts model (M-P model), is illustrated in Figure 1.5(a). In this model, input
Introduction
7
y
x
FIGURE 1.4: Decision boundary of a typical decision tree on the threeGaussians data set.
signals are multiplied with corresponding connection weights at first, and
then signals are aggregated and compared with a threshold, also called bias
of the neuron. If the aggregated signal is larger than the bias, the neuron will
be activated and the output signal is generated by an activation function,
also called transfer function or squashing function.
Neurons are linked by weighted connections to form a network. There are
many possible network structures, among which the most popular one is
the multi-layer feed-forward network, as illustrated in Figure 1.5(b). Here
the neurons are connected layer-by-layer, and there are neither in-layer
connections nor cross-layer connections. There is an input layer which receives input feature vectors, where each neuron usually corresponds to one
element of the feature vector. The activation function for input neurons
is usually set as f (x) = x. There is an output layer which outputs labels,
where each neuron usually corresponds to a possible label, or an element of
a label vector. The layers between the input and output layers are called hidden layers. The hidden neurons and output neurons are functional units,
and a popular activation function for them is the sigmoid function
f (x) =
1
.
1 + e−x
(1.13)
Although one may use a network with many hidden layers, the most popular setting is to use one or two hidden layers, since it is known that a
feed-forward neural network with one hidden layer is already able to approximate any continuous function, and more complicated algorithms are
needed to prevent networks with many hidden layers from suffering from
problems such as divergence (i.e., the networks do not converge to a stable
state).
The goal of training a neural network is to determine the values of the
connection weights and the biases of the neurons. Once these values are
8
Ensemble Methods: Foundations and Algorithms
(a)
(b)
FIGURE 1.5: Illustration of (a) a neuron, and (b) a neural network.
decided, the function computed by the neural network is decided. There
are many neural network learning algorithms. The most commonly applied
idea for training a multi-layer feed-forward neural network is that, as long
as the activation function is differentiable, the whole neural network can be
regarded as a differentiable function which can be optimized by gradient
descent method.
The most successful algorithm, Back-Propagation (BP) [Werbos, 1974,
Rumelhart et al., 1986], works as follows. At first, the inputs are feedforwarded from the input layer via the hidden layer to the output layer, at
which the error is calculated by comparing the network output with the
ground-truth. Then, the error will be back propagated to the hidden layer
and the input layer, during which the connection weights and biases are adjusted to reduce the error. The process is accomplished by tuning towards
the direction with the gradient. Such a process will be repeated in many
rounds, until the training error is minimized or the training process is terminated to avoid overfitting.
1.2.4 Naı̈ve Bayes Classifier
To classify a test instance x, one approach is to formulate a probabilistic
model to estimate the posterior probability P (y | x) of different y’s, and
predict the one with the largest posterior probability; this is the maximum
a posterior (MAP) rule. By Bayes Theorem, we have
P (y | x) =
P (x | y)P (y)
,
P (x)
(1.14)
Introduction
9
where P (y) can be estimated by counting the proportion of class y in the
training set, and P (x) can be ignored since we are comparing different y’s
on the same x. Thus we only need to consider P (x | y). If we can get an
accurate estimate of P (x | y), we will get the best classifier in theory from
the given training data, that is, the Bayes optimal classifier with the Bayes
error rate, the smallest error rate in theory. However, estimating P (x | y)
is not straightforward, since it involves the estimation of exponential numbers of joint-probabilities of the features. To make the estimation tractable,
some assumptions are needed.
The naı̈ve Bayes classifier assumes that, given the class label, the n features are independent of each other within each class. Thus, we have
P (x | y) =
n
P (xi | y),
(1.15)
i=1
which implies that we only need to estimate each feature value in each class
in order to estimate the conditional probability, and therefore the calculation of joint-probabilities is avoided.
In the training stage, the naı̈ve Bayes classifier estimates the probabilities
P (y) for all classes y ∈ Y, and P (xi | y) for all features i = 1, . . . , n and all
feature values xi from the training set. In the test stage, a test instance x will
be predicted with label y if y leads to the largest value of
P (y | x) ∝ P (y)
n
P (xi | y)
(1.16)
i=1
among all the class labels.
1.2.5 k -Nearest Neighbor
The k-nearest neighbor (kNN) algorithm relies on the principle that objects similar in the input space are also similar in the output space. It is a
lazy learning approach since it does not have an explicit training process,
but simply stores the training set instead. For a test instance, a k-nearest
neighbor learner identifies the k instances from the training set that are
closest to the test instance. Then, for classification, the test instance will
be classified to the majority class among the k instances; while for regression, the test instance will be assigned the average value of the k instances.
Figure 1.6(a) illustrates how to classify an instance by a 3-nearest neighbor
classifier. Figure 1.6(b) shows the decision boundary of a 1-nearest neighbor classifier, also called the nearest neighbor classifier.
1.2.6 Support Vector Machines and Kernel Methods
Support vector machines (SVMs) [Cristianini and Shawe-Taylor, 2000],
originally designed for binary classification, are large margin classifiers
10
Ensemble Methods: Foundations and Algorithms
+
+
?
y
+
x
(b)
(a)
FIGURE 1.6: Illustration of (a) how a k-nearest neighbor classifier predicts
on a test instance, and (b) the decision boundary of the nearest neighbor
classifier on the three-Gaussians data set.
that try to separate instances of different classes with the maximum margin hyperplane. The margin is defined as the minimum distance from instances of different classes to the classification hyperplane.
Considering a linear classifier y = sign(w x+b), or abbreviated as (w, b),
we can use the hinge loss to evaluate the fitness to the data:
m
max{0, 1 − yi (w xi + b)}.
(1.17)
i=1
The Euclidean distance from an instance xi to the hyperplane w x + b is
|w xi + b|
.
w
(1.18)
If we restrict |w xi + b| ≥ 1 for all instances, the minimum distance to
the hyperplane is w−1 . Therefore, SVMs maximize w−1 .
Thus, SVMs solve the optimization problem
w2
+C
ξi
2
i=1
m
(w∗ , b∗ ) = arg min
w,b,ξi
(1.19)
s.t. yi (w xi + b) ≥ 1 − ξi (∀i = 1, . . . , m)
ξi ≥ 0 (∀i = 1, . . . , m) ,
where C is a parameter and ξi ’s are slack variables introduced to enable the
learner to deal with data that could not be perfectly separated, such as data
with noise. An illustration of an SVM is shown in Figure 1.7.
Introduction
11
FIGURE 1.7: Illustration of SVM.
(1.19) is called the primal form of the optimization. The dual form, which
gives the same optimal solution, is
m
1 α = arg max
αi −
αi αj yi yj xi , xj 2 i=1 j=1
α
i=1
∗
s.t.
m
m
m
(1.20)
αi yi = 0
i=1
αi ≥ 0 (∀i = 1, . . . , m) ,
where ·, · is the inner product. The solution w∗ of the primal form is now
presented as
m
w∗ =
α∗i yi xi ,
(1.21)
i=1
∗
and the inner product between w and an instance x can be calculated as
w∗ , x =
m
α∗i yi xi , x .
(1.22)
i=1
A limitation of the linear classifiers is that, when the data is intrinsically
nonlinear, linear classifiers cannot separate the classes well. In such cases,
a general approach is to map the data points onto a higher-dimensional
feature space where the data linearly non-separable in the original feature
space become linearly separable. However, the learning process may become very slow and even intractable since the inner product will be difficult
to calculate in the high-dimensional space.
12
Ensemble Methods: Foundations and Algorithms
Fortunately, there is a class of functions, kernel functions (also called
kernels), which can help address the problem. The feature space derived by
kernel functions is called the Reproducing Kernel Hilbert Space (RKHS).
An inner product in the RKHS equals kernel mapping of inner product of
instances in the original lower-dimensional feature space. In other words,
K(xi , xj ) = φ(xi ), φ(xj )
(1.23)
for all xi ’s, where φ is a mapping from the original feature space to a higherdimensional space and K is a kernel. Thus, we can simply replace the inner
products in the dual form of the optimization by the kernel.
According to Mercer’s Theorem [Cristianini and Shawe-Taylor, 2000], every positive semi-definite symmetric function is a kernel. Popular kernels
include the linear kernel
K(xi , xj ) = xi , xj ,
(1.24)
K(xi , xj ) = xi , xj d ,
(1.25)
the polynomial kernel
where d is the degree of the polynomial, and the Gaussian kernel (or called
RBF kernel)
xi − xj 2
K(xi , xj ) = exp −
,
(1.26)
2σ 2
where σ is the parameter of the Gaussian width.
The kernel trick, i.e., mapping the data points with a kernel and then accomplishing the learning task in the RKHS, is a general strategy that can be
incorporated into any learning algorithm that considers only inner products between the input feature vectors. Once the kernel trick is used, the
learning algorithms are called kernel methods. Indeed, SVMs are a special
kind of kernel method, i.e., linear classifiers facilitated with kernel trick.
1.3 Evaluation and Comparison
Usually, we have multiple alternative learning algorithms to choose
among, and a number of parameters to tune. The task of choosing the best
algorithm and the settings of its parameters is known as model selection,
and for this purpose we need to estimate the performance of the learner. By
empirical ways, this involves design of experiments and statistical hypothesis tests for comparing the models.
It is unwise to estimate the generalization error of a learner by its training error, i.e., the error that the learner makes on the training data, since
Introduction
13
training error prefers complex learners rather than learners that generalize well. Usually, a learner with very high complexity can have zero training
error, such as a fully grown decision tree; however, it is likely to perform
badly on unseen data due to overfitting. A proper process is to evaluate the
performance on a validation set. Note that the labels in the training set and
validation set are known a priori to the training process, and should be used
together to derive and tune the final learner once the model has been selected.
In fact, in most cases the training and validation sets are obtained by
splitting a given data set into two parts. While splitting, the properties of
the original data set should be kept as much as possible; otherwise the validation set may provide misleading estimates, for an extreme example, the
training set might contain only positive instances while the validation set
contains only negative instances. In classification, when the original data
set is split randomly, the class percentage should be maintained for both
training and validation sets; this is called stratification, or stratified sampling.
When there is not enough labeled data available to create a separate validation set, a commonly used validation method is cross-validation. In kfold cross-validation, the original data set is partitioned by stratified split
into k equal-size disjoint subsets, D1 , . . . , Dk , and then k runs of trainingtests are performed. In the ith run, Diis used as the validation set while
the union of all the other subsets, i.e., j=i Dj , is used as the training set.
The average results of the k runs are taken as the results of the crossvalidation. To reduce the influence of randomness introduced by data
split, the k-fold cross-validation can be repeated t times, which is called
t-times k-fold cross-validation. Usual configurations include 10-times 10fold cross-validation, and 5-times 2-fold cross-validation suggested by Dietterich [1998]. Extremely, when k equals the number of instances in the
original data set, there is only one instance in each validation set; this is
called leave-one-out (LOO) validation.
After obtaining the estimated errors, we can compare different learning
algorithms. A simple comparison on average errors, however, is not reliable
since the winning algorithm may occasionally perform well due to the randomness in data split. Hypothesis test is usually employed for this purpose.
To compare learning algorithms that are efficient enough to run 10 times,
the 5 × 2 cv paired t-test is a good choice [Dietterich, 1998]. In this test, we
run 5-times 2-fold cross-validation. In each run of 2-fold cross-validation,
the data set D is randomly split into two subsets D1 and D2 of equal size.
Two algorithms a and b are trained on each set and tested on the other, re(1)
(1)
sulting in four error estimates: erra and errb (trained on D1 and tested
(2)
(2)
on D2 ) and erra and errb (trained on D2 and tested on D1 ). We have the
error differences
(i)
d(i) = erra(i) − errb
(i = 1, 2)
(1.27)
14
Ensemble Methods: Foundations and Algorithms
with the mean and the variance, respectively:
μ=
d(1) + d(2)
,
2
s2 = (d(1) − μ)2 + (d(2) − μ)2 .
(1.28)
(1.29)
(1)
Let s2i denote the variance in the ith time 2-fold cross-validation, and d1
denote the error difference in the first time. Under the null hypothesis, the
5×2 cv t̃-statistic
(1)
d1
t̃ = ∼ t5 ,
(1.30)
5
1
2
i=1 si
5
would be distributed according to the Student’s t-distribution with 5 degrees of freedom. We then choose a significance level α. If t̃ falls into the
interval [−t5 (α/2), t5 (α/2)], the null hypothesis is accepted, suggesting that
there is no significant difference between the two algorithms. Usually α is
set to 0.05 or 0.1.
To compare learning algorithms that can be run only once, the McNemar’s test can be used instead [Dietterich, 1998]. Let err01 denote the number of instances on which the first algorithm makes a wrong prediction
while the second algorithm is correct, and err10 denotes the inverse. If the
two algorithms have the same performance, err01 is close to err10 , and
therefore, the quantity
(|err01 − err10 | − 1)2
∼ χ21
err01 + err10
(1.31)
would be distributed according to the χ2 -distribution.
Sometimes, we evaluate multiple learning algorithms on multiple data
sets. In this situation, we can conduct the Friedman test [Demšar, 2006].
First, we sort the algorithms on each data set according to their average
errors. On each data set, the best algorithm is assigned rank 1, the worse
algorithms are assigned increased ranks, and average ranks are assigned in
case of ties. Then, we average the ranks of each algorithm over all data sets,
and use the Nemenyi post-hoc test [Demšar, 2006] to calculate the critical
difference value
k(k + 1)
CD = qα
,
(1.32)
6N
where k is the number of algorithms, N is the number of data sets and qα
is the critical value [Demšar, 2006]. A pair of algorithms are believed to be
significantly different if the difference of their average ranks is larger than
the critical difference.
The Friedman test results can be visualized by plotting the critical difference diagram, as illustrated in Figure 1.8, where each algorithm corresponds to a bar centered at the average rank with the width of critical difference value. Figure 1.8 discloses that the algorithm A is significantly better
Introduction
15
CD value
Algorithm A
Algorithm B
average rank
Algorithm C
Algorithm D
1
2
3
4
FIGURE 1.8: Illustration of critical difference diagram.
than all the other algorithms, the algorithm D is significantly worse than
all the other algorithms, and the algorithms B and C are not significantly
different, according to the given significance level.
1.4 Ensemble Methods
Ensemble methods train multiple learners to solve the same problem. In
contrast to ordinary learning approaches which try to construct one learner
from training data, ensemble methods try to construct a set of learners and
combine them. Ensemble learning is also called committee-based learning, or learning multiple classifier systems.
Figure 1.9 shows a common ensemble architecture. An ensemble contains a number of learners called base learners. Base learners are usually
generated from training data by a base learning algorithm which can be
decision tree, neural network or other kinds of learning algorithms. Most
ensemble methods use a single base learning algorithm to produce homogeneous base learners, i.e., learners of the same type, leading to homogeneous ensembles, but there are also some methods which use multiple
learning algorithms to produce heterogeneous learners, i.e., learners of different types, leading to heterogeneous ensembles. In the latter case there
is no single base learning algorithm and thus, some people prefer calling
the learners individual learners or component learners to base learners.
The generalization ability of an ensemble is often much stronger than
that of base learners. Actually, ensemble methods are appealing mainly because they are able to boost weak learners which are even just slightly better than random guess to strong learners which can make very accurate
predictions. So, base learners are also referred to as weak learners.
16
Ensemble Methods: Foundations and Algorithms
learner 1
x
learner 2
combination
y
learner n
FIGURE 1.9: A common ensemble architecture.
It is difficult to trace the starting point of the history of ensemble methods
since the basic idea of deploying multiple models has been in use in human
society for a long time. For example, even earlier than the introduction of
Occam’s razor, the common basic assumption of scientific research which
prefers simple hypotheses to complex ones when both fit empirical observations well, the Greek philosopher Epicurus (341 - 270 B.C.) introduced the
principle of multiple explanations [Asmis, 1984] which advocated to keep
all hypotheses that are consistent with empirical observations.
There are three threads of early contributions that led to the current area
of ensemble methods; that is, combining classifiers, ensembles of weak
learners and mixture of experts. Combining classifiers was mostly studied
in the pattern recognition community. In this thread, researchers generally
work on strong classifiers, and try to design powerful combining rules to get
stronger combined classifiers. As the consequence, this thread of work has
accumulated deep understanding on the design and use of different combining rules. Ensembles of weak learners was mostly studied in the machine
learning community. In this thread, researchers often work on weak learners and try to design powerful algorithms to boost the performance from
weak to strong. This thread of work has led to the birth of famous ensemble
methods such as AdaBoost, Bagging, etc., and theoretical understanding on
why and how weak learners can be boosted to strong ones. Mixture of experts was mostly studied in the neural networks community. In this thread,
researchers generally consider a divide-and-conquer strategy, try to learn
a mixture of parametric models jointly and use combining rules to get an
overall solution.
Ensemble methods have become a major learning paradigm since the
1990s, with great promotion by two pieces of pioneering work. One is empirical [Hansen and Salamon, 1990], in which it was found that predictions
made by the combination of a set of classifiers are often more accurate
than predictions made by the best single classifier. A simplified illustration
is shown in Figure 1.10. The other is theoretical [Schapire, 1990], in which
it was proved that weak learners can be boosted to strong learners. Since
strong learners are desirable yet difficult to get, while weak learners are easy
to obtain in real practice, this result opens a promising direction of gener-
Introduction
0.25
average
best single
combination
0.20
error
17
0.15
0.10
0.05
0
5
15
20
25
noise level
FIGURE 1.10: A simplified illustration of Hansen and Salamon [1990]’s observation: Ensemble is often better than the best single.
ating strong learners by ensemble methods.
Generally, an ensemble is constructed in two steps, i.e., generating the
base learners, and then combining them. To get a good ensemble, it is generally believed that the base learners should be as accurate as possible, and
as diverse as possible.
It is worth mentioning that generally, the computational cost of constructing an ensemble is not much larger than creating a single learner. This
is because when we want to use a single learner, we usually need to generate multiple versions of the learner for model selection or parameter tuning; this is comparable to generating base learners in ensembles, while the
computational cost for combining base learners is often small since most
combination strategies are simple.
1.5 Applications of Ensemble Methods
The KDD-Cup 1 is the most famous data mining competition. Since 1997,
it is held every year and attracts the interests of data mining teams all over
the world. The competition problems cover a large variety of practical tasks,
such as network intrusion detection (1999), molecular bioactivity & protein
locale prediction (2001), pulmonary embolisms detection (2006), customer
relationship management (2009), educational data mining (2010), music
recommendation (2011), etc. In the past KDD-Cup competitions, among
various techniques utilized in the solutions, ensemble methods have drawn
the most attention and won the competitions for the most times. For example, in KDD-Cups of the last three years (2009-2011), all the first-place and
second-place winners used ensemble methods.
1 http://www.sigkdd.org/kddcup/.
18
Ensemble Methods: Foundations and Algorithms
Another famous competition, the Netflix Prize,2 is held by the online
DVD-rental service Netflix and seeks to improve the accuracy of predictions about how much someone is going to enjoy a movie based on their
preferences; if one participating team improves Netflix’s own algorithm by
10% accuracy, they would win the grand prize of $1,000,000. On September 21, 2009, Nexflix awarded the $1M grand prize to the team BellKor’s
Pragmatic Chaos, whose solution was based on combining various classifiers including asymmetric factor models, regression models, restricted
Boltzmann machines, matrix factorization, k-nearest neighbor, etc. Another team, which achieved the winning performance but was defeated because the result was submitted 20 minutes later, even used The Ensemble as
the team name.
In addition to the impressive results in competitions, ensemble methods have been successfully applied to diverse real-world tasks. Indeed, they
have been found useful in almost all places where learning techniques are
exploited. For example, computer vision has benefited much from ensemble methods in almost all branches such as object detection, recognition
and tracking.
Viola and Jones [2001, 2004] proposed a general object detection framework by combining AdaBoost with a cascade architecture. Viola and Jones
[2004] reported that, on a 466MHz machine, the face detector spent only
0.067 seconds for a 384×288 image; this is almost 15 times faster than stateof-the-art face detectors, while the detection accuracy is comparable. This
framework was recognized as one of the most exciting breakthroughs in
computer vision (especially, face detection) during the past decade.
Huang et al. [2000] designed an ensemble architecture for pose-invariant
face recognition, particularly for recognizing faces with in-depth rotations.
The basic idea is to combine a number of view-specific neural networks
with a specially designed combination module. In contrast to conventional
techniques which require pose information as input, this framework does
not need pose information and it can even output pose estimation in addition to the recognition result. Huang et al. [2000] reported that this framework even outperformed conventional techniques facilitated with perfect
pose information. A similar method was later applied to multi-view face
detection [Li et al., 2001].
Object tracking aims to assign consistent labels to the target objects in
consecutive frames of a video. By considering tracking as a binary classification problem, Avidan [2007] proposed ensemble tracking, which trains
an ensemble online to distinguish between the object and the background.
This framework constantly updates a set of weak classifiers, which can
be added or removed at any time to incorporate new information about
changes in object appearance and the background. Avidan [2007] showed
2 http://www.netflixprize.com/.
Introduction
19
that the ensemble tracking framework could work in a large variety of
videos with various object size, and it runs very efficiently, at a few frames
per second without optimization, hence can be used in online applications.
Ensemble methods have been found very appropriate to characterize
computer security problems because each activity performed on computer
systems can be observed at multiple abstraction levels, and the relevant
information may be collected from multiple information sources [Corona
et al., 2009].
Giacinto et al. [2003] applied ensemble methods to intrusion detection.
Considering that there are different types of features characterizing the
connection, they constructed an ensemble from each type of features independently, and then combined the outputs from these ensembles to produce the final decision. Giacinto et al. [2003] reported that, when detecting
known attacks, ensemble methods lead to the best performance. Later, Giacinto et al. [2008] proposed an ensemble method for anomaly-based intrusion detection which is able to detect intrusions never seen before.
Malicious executables are programs designed to perform a malicious
function without the owner’s permission, and they generally fall into three
categories, i.e., viruses, worms, and Trojan horses. Schultz et al. [2001] proposed an ensemble method to detect previously unseen malicious executables automatically, based on representing the programs using binary profiling, string sequences and hex dumps. Kolter and Maloof [2006] represented
programs using n-grams of byte codes, and reported that boosted decision
trees achieved the best performance; they also suggested that this method
could be used as the basis for an operational system for detecting new malicious executables never seen before.
Ensemble methods have been found very useful in diverse tasks of computer aided medical diagnosis, particularly for increasing the diagnosis reliability.
Zhou et al. [2002a] designed a two-layered ensemble architecture for lung
cancer cell identification, where the first layer predicts benign cases if and
only if all component learners agree, and otherwise the case will be passed
to the second layer to make a further decision among benign and different
cancer types. Zhou et al. [2002a] reported that the two-layered ensemble
results in a high identification rate with a low false-negative identification
rate.
For early diagnosis of Alzheimer’s disease, previous methods generally
considered single channel data from the EEG (electroencephalogram). To
make use of multiple data channels, Polikar et al. [2008] proposed an ensemble method where the component learners are trained on different data
sources obtained from different electrodes in response to different stimuli
and in different frequency bands, and their outputs are combined for the
final diagnosis.
In addition to computer vision, computer security and computer aided
medical diagnosis, ensemble methods have also been applied to many
20
Ensemble Methods: Foundations and Algorithms
other domains and tasks such as credit card fraud detection [Chan et al.,
1999, Panigrahi et al., 2009], bankruptcy prediction [West et al., 2005],
protein structure classification [Tan et al., 2003, Shen and Chou, 2006],
species distributions forecasting [Araújo and New, 2007], weather forecasting [Maqsood et al., 2004, Gneiting and Raftery, 2005], electric load forecasting [Taylor and Buizza, 2002], aircraft engine fault diagnosis [Goebel
et al., 2000, Yan and Xue, 2008], musical genre and artist classification
[Bergstra et al., 2006], etc.
1.6 Further Readings
There are good textbooks on machine learning [Mitchell, 1997, Alpaydin,
2010, Bishop, 2006, Hastie et al., 2001], pattern recognition [Duda et al.,
2000, Theodoridis and Koutroumbas, 2009, Ripley, 1996, Bishop, 1995] and
data mining [Han and Kamber, 2006, Tan et al., 2006, Hand et al., 2001].
More introductory materials can be found in these books.
Linear discriminant analysis is closely related to principal component
analysis (PCA) [Jolliffe, 2002], both looking for linear combination of features to represent the data. LDA is a supervised approach focusing on
distinguishing between different classes, while PCA is an unsupervised
approach generally used to identify the largest variability. Decision trees
can be mapped to a set of “if-then” rules [Quinlan, 1993]. Most decision
trees use splits like “x ≥ 1” or “y ≥ 2”, leading to axis-parallel partitions
of instance space. There are also exceptions, e.g., oblique decision trees
[Murthy et al., 1994] which use splits like “x+y ≥ 3”, leading to non-axisparallel partitions. The BP algorithm is the most popular and most successful neural network learning algorithm. It has many variants, and can
also be used to train neural networks whose structures are different from
feed-forward networks, such as recurrent neural networks where there
are cross-layer connections. Haykin [1998] provides a good introduction to
neural networks. Though the nearest neighbor algorithm is very simple, it
works well in most cases. The error of the nearest neighbor classifier is guaranteed to be no worse than twice of the Bayes error rate on infinite data
[Cover and Hart, 1967], and kNN approaches the Bayes error rate for some
k value which is related to the amount of data. The distances between instances are not constrained to be calculated by the Euclidean distance, and
the contributions from different neighbors can be weighted. More information on kNN can be found in [Dasarathy, 1991]. The naı̈ve Bayes classifier based on the conditional independence assumption works well in most
cases [Domingos and Pazzani, 1997]; however, it is believed that the performance can be improved further by relaxing the assumption, and therefore
Introduction
21
many semi-naı̈ve Bayes classifiers such as TAN [Friedman et al., 1997] and
LBR [Zheng and Webb, 2000] have been developed. A particularly successful one is the AODE [Webb et al., 2005], which has incorporated ensemble
mechanism and often beats TAN and LBR, especially on intermediate-size
data sets. SVMs are rooted in the statistical learning theory [Vapnik, 1998].
More introductory materials on SVMs and kernel methods can be found in
[Cristianini and Shawe-Taylor, 2000, Schölkopf et al., 1999].
Introductory materials on hypothesis tests can be found in [Fleiss, 1981].
Different hypothesis tests are usually based on different assumptions, and
should be applied in different situations. The 10-fold cross-validation t-test
was popularly used; however, Dietterich [1998] discloses that such a test
underestimates the variability and it is likely to incorrectly detect a difference when no difference exists (i.e., the type I error), while the 5×2cv paired
t-test is recommended instead.
The No Free Lunch Theorem [Wolpert, 1996, Wolpert and Macready,
1997] implies that it is hopeless to dream for a learning algorithm which is
consistently better than other learning algorithms. It is important to notice,
however, that the No Free Lunch Theorem considers the whole problem
space, that is, all the possible learning tasks; while in real practice, we are
usually only interested in a give task, and in such a situation, the effort of
trying to find the best algorithm is valid. From the experience of the author
of this book, for lots of tasks, the best off-the-shelf learning technique at
present is ensemble methods such as Random Forest facilitated with feature
engineering which constructs/generates usually an overly large number of
new features rather than simply working on the original features.
[Kuncheva, 2004] and [Rokach, 2010] are books on ensemble methods.
Xu and Amari [2009] discuss the relation between combining classifiers and
mixture of experts. The MCS workshop (International Workshop on Multiple
Classifier Systems) is the major forum in this area. Abundant literature on
ensemble methods can also be found in various journals and conferences
on machine learning, pattern recognition and data mining.
This page intentionally left blank
2
Boosting
2.1 A General Boosting Procedure
The term boosting refers to a family of algorithms that are able to convert
weak learners to strong learners. Intuitively, a weak learner is just slightly
better than random guess, while a strong learner is very close to perfect performance. The birth of boosting algorithms originated from the answer to
an interesting theoretical question posed by Kearns and Valiant [1989]. That
is, whether two complexity classes, weakly learnable and strongly learnable
problems, are equal. This question is of fundamental importance, since if
the answer is positive, any weak learner is potentially able to be boosted to a
strong learner, particularly if we note that in real practice it is generally very
easy to obtain weak learners but difficult to get strong learners. Schapire
[1990] proved that the answer is positive, and the proof is a construction,
i.e., boosting.
The general boosting procedure is quite simple. Suppose the weak
learner will work on any data distribution it is given, and take the binary
classification task as an example; that is, we are trying to classify instances
as positive and negative. The training instances in space X are drawn i.i.d.
from distribution D, and the ground-truth function is f . Suppose the space
X is composed of three parts X1 , X2 and X3 , each takes 1/3 amount of the
distribution, and a learner working by random guess has 50% classification
error on this problem. We want to get an accurate (e.g., zero error) classifier on the problem, but we are unlucky and only have a weak classifier at
hand, which only has correct classifications in spaces X1 and X2 and has
wrong classifications in X3 , thus has 1/3 classification error. Let’s denote
this weak classifier as h1 . It is obvious that h1 is not desired.
The idea of boosting is to correct the mistakes made by h1 . We can try to
derive a new distribution D from D, which makes the mistakes of h1 more
evident, e.g., it focuses more on the instances in X3 . Then, we can train a
classifier h2 from D . Again, suppose we are unlucky and h2 is also a weak
classifier, which has correct classifications in X1 and X3 and has wrong classifications in X2 . By combining h1 and h2 in an appropriate way (we will
explain how to combine them in the next section), the combined classifier
will have correct classifications in X1 , and maybe some errors in X2 and X3 .
23
24
Ensemble Methods: Foundations and Algorithms
Input: Sample distribution D;
Base learning algorithm L;
Number of learning rounds T .
Process:
1. D1 = D.
% Initialize distribution
2. for t = 1, . . . , T :
3.
ht = L(Dt ); % Train a weak learner from distribution Dt
4.
t = Px∼Dt (ht (x) = f (x)); % Evaluate the error of ht
5.
Dt+1 = Adjust Distribution(Dt , t )
6. end
Output: H(x) = Combine Outputs({h1 (x), . . . , ht (x)})
FIGURE 2.1: A general boosting procedure
Again, we derive a new distribution D to make the mistakes of the combined classifier more evident, and train a classifier h3 from the distribution,
so that h3 has correct classifications in X2 and X3 . Then, by combining h1 ,
h2 and h3 , we have a perfect classifier, since in each space of X1 , X2 and X3 ,
at least two classifiers make correct classifications.
Briefly, boosting works by training a set of learners sequentially and combining them for prediction, where the later learners focus more on the mistakes of the earlier learners. Figure 2.1 summarizes the general boosting
procedure.
2.2 The AdaBoost Algorithm
The general boosting procedure described in Figure 2.1 is not a real algorithm since there are some unspecified parts such as Adjust Distribution
and Combine Outputs. The AdaBoost algorithm [Freund and Schapire,
1997], which is the most influential boosting algorithm, can be viewed as
an instantiation of these parts as shown in Figure 2.2.
Consider binary classification on classes {−1, +1}. One version of derivation of AdaBoost [Friedman et al., 2000] is achieved by minimizing the exponential loss function
exp (h | D) = Ex∼D [e−f (x)h(x) ]
(2.1)
Boosting
25
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Base learning algorithm L;
Number of learning rounds T .
Process:
1. D1 (x) = 1/m.
% Initialize the weight distribution
2. for t = 1, . . . , T :
3. ht = L(D, Dt ); % Train a classifier ht from D under distribution Dt
4. t = Px∼Dt (ht (x) = f (x)); % Evaluate the error of ht
5. if t > 0.5thenbreak
t
; % Determine the weight of ht
6. αt = 12 ln 1−
t
exp(−αt ) if ht (x) = f (x)
7. Dt+1 (x) = DZt (x)
×
t
exp(αt ) if ht (x) = f (x)
D (x)exp(−αt f (x)ht (x))
% Update the distribution, where
= t
Zt
% Zt is a normalization factor which
% enables Dt+1 to be a distribution
8. end
T
α
h
(x)
Output: H(x) = sign
t=1 t t
FIGURE 2.2: The AdaBoost algorithm
using additive weighted combination of weak learners as
H(x) =
T
αt ht (x) .
(2.2)
t=1
The exponential loss is used since it gives an elegant and simple update
formula, and it is consistent with the goal of minimizing classification error and can be justified by its relationship to the standard log likelihood.
When the exponential loss is minimized by H, the partial derivative of the
exponential loss for every x is zero, i.e.,
∂e−f (x)H(x)
= −f (x)e−f (x)H(x)
∂H(x)
(2.3)
= −e−H(x) P (f (x) = 1 | x) + eH(x) P (f (x) = −1 | x)
=0.
Then, by solving (2.3), we have
H(x) =
and hence,
P (f (x) = 1 | x)
1
ln
,
2 P (f (x) = −1 | x)
(2.4)
26
Ensemble Methods: Foundations and Algorithms
P (f (x) = 1 | x)
1
ln
sign(H(x)) = sign
2 P (f (x) = −1 | x)
1,
P (f (x) = 1 | x) > P (f (x) = −1 | x)
=
−1, P (f (x) = 1 | x) < P (f (x) = −1 | x)
= arg max P (f (x) = y | x),
(2.5)
y∈{−1,1}
which implies that sign(H(x)) achieves the Bayes error rate. Note that we
ignore the case P (f (x) = 1 | x) = P (f (x) = −1 | x). The above derivation
shows that when the exponential loss is minimized, the classification error
is also minimized, and thus the exponential loss is a proper optimization
target for replacing the non-differentiable classification error.
The H is produced by iteratively generating ht and αt . The first weak classifier h1 is generated by invoking the weak learning algorithm on the original distribution. When a classifier ht is generated under the distribution Dt ,
its weight αt is to be determined such that αt ht minimizes the exponential
loss
exp(αt ht | Dt ) = Ex∼Dt [e−f (x)αt ht (x) ]
(2.6)
−αt
αt
I(f (x) = ht (x)) + e I(f (x) = ht (x))
= Ex∼Dt e
= e−αt Px∼Dt (f (x) = ht (x)) + eαt Px∼Dt (f (x) = ht (x))
= e−αt (1 − t ) + eαt t ,
where t = Px∼Dt (ht (x) = f (x)). To get the optimal αt , let the derivative of
the exponential loss equal zero, that is,
∂exp(αt ht | Dt )
= −e−αt (1 − t ) + eαt t = 0 ,
∂αt
then the solution is
αt =
1
ln
2
1 − t
t
(2.7)
(2.8)
,
as in line 6 of Figure 2.2.
Once a sequence of weak classifiers and their corresponding weights
have been generated, these classifiers are combined as Ht−1 . Then, AdaBoost adjusts the sample distribution such that in the next round, the base
learning algorithm will output a weak classifier ht that corrects some mistakes of Ht−1 . Considering the exponential loss again, the ideal classifier ht
that corrects all mistakes of Ht−1 should minimize the exponential loss
exp(Ht−1 + ht | D) = Ex∼D [e−f (x)(Ht−1 (x)+ht (x)) ]
= Ex∼D [e
−f (x)Ht−1 (x) −f (x)ht (x)
e
(2.9)
].
Boosting
27
Using Taylor expansion of e−f (x)ht (x) , the exponential loss is approximated
by
f (x)2 ht (x)2
exp (Ht−1 + ht | D) ≈ Ex∼D e−f (x)Ht−1 (x) 1 − f (x)ht (x) +
2
1
, (2.10)
= Ex∼D e−f (x)Ht−1 (x) 1 − f (x)ht (x) +
2
by noticing that f (x)2 = 1 and ht (x)2 = 1.
Thus, the ideal classifier ht is
ht (x) = arg min exp (Ht−1 + h | D)
h
1
−f (x)Ht−1 (x)
= arg min Ex∼D e
1 − f (x)h(x) +
2
h
= arg max Ex∼D e−f (x)Ht−1 (x) f (x)h(x)
h
= arg max Ex∼D
h
(2.11)
e−f (x)Ht−1 (x)
f
(x)h(x)
,
Ex∼D [e−f (x)Ht−1 (x) ]
by noticing that Ex∼D [e−f (x)Ht−1 (x) ] is a constant.
Denote a distribution Dt as
Dt (x) =
D(x)e−f (x)Ht−1 (x)
.
Ex∼D [e−f (x)Ht−1 (x) ]
(2.12)
Then, by the definition of mathematical expectation, it is equivalent to
write that
e−f (x)Ht−1 (x)
ht (x) = arg max Ex∼D
f
(x)h(x)
(2.13)
Ex∼D [e−f (x)Ht−1 (x) ]
h
= arg max Ex∼Dt [f (x)h(x)] .
h
Further noticing that f (x)ht (x) = 1 − 2I(f (x) = ht (x)), the ideal classifier
is
ht (x) = arg min Ex∼Dt [I(f (x) = h(x))] .
(2.14)
h
As can be seen, the ideal ht minimizes the classification error under the
distribution Dt . Therefore, the weak learner is to be trained under Dt , and
has less than 0.5 classification error according to Dt . Considering the rela-
28
Ensemble Methods: Foundations and Algorithms
tion between Dt and Dt+1 , we have
Dt+1 (x) =
=
D(x)e−f (x)Ht (x)
Ex∼D [e−f (x)Ht (x) ]
(2.15)
D(x)e−f (x)Ht−1 (x) e−f (x)αt ht (x)
Ex∼D [e−f (x)Ht (x) ]
= Dt (x) · e−f (x)αt ht (x)
Ex∼D [e−f (x)Ht−1 (x) ]
,
Ex∼D [e−f (x)Ht (x) ]
which is the way AdaBoost updates the sample distribution as in line 7 of
Figure 2.2.
It is noteworthy that the AdaBoost algorithm described in Figure 2.2 requires the base learning algorithm being able to learn with specified distributions. This is often accomplished by re-weighting, that is, weighting
training examples in each round according to the sample distribution. For
base learning algorithms that cannot handle weighted training examples,
re-sampling, that is, sampling training examples in each round according
to the desired distribution, can be applied.
For base learning algorithms which can be used with both re-weighting
and re-sampling, generally there is no clear performance difference between these two implementations. However, re-sampling provides an option for Boosting with restart [Kohavi and Wolpert, 1996]. In each round of
AdaBoost, there is a sanity check to ensure that the current base learner is
better than random guess (see line 5 of Figure 2.2). This sanity check might
be violated on some tasks when there are only a few weak learners and the
AdaBoost procedure will be early-terminated far before the specified number of rounds T . This occurs particularly often on multiclass tasks. When
re-sampling is used, the base learner that cannot pass the sanity check can
be removed, and a new data sample can be generated, on which a new base
learner will be trained; in this way, the AdaBoost procedure can avoid the
early-termination problem.
2.3 Illustrative Examples
It is helpful to gain intuitive understanding of AdaBoost by observing its
behavior. Consider an artificial data set in a two-dimensional space, plotted
in Figure 2.3(a). There are only four instances, i.e.,
⎧
⎫
(z1 = (+1, 0) , y1 = +1)⎪
⎪
⎪
⎪
⎪
⎨ (z = (−1, 0) , y = +1)⎪
⎬
2
2
,
⎪
(z3 = (0, +1) , y3 = −1)⎪
⎪
⎪
⎪
⎪
⎩
⎭
(z4 = (0, −1) , y4 = −1)
Boosting
29
x2
x2
x2
+1
+1
x1
+1
+1
x1
0.55 -0.55
(a) The XOR data
(b) 1st round
-1
+1
+1
+1
+1
0.85 -0.25 1.35
-1.35 -2.45 -0.85
-1
x1
-1
-1
-1
x2
-1
-1
-1
x1
-0.25 -1.35 0.25
(c) 2nd round
(d) 3rd round
FIGURE 2.3: AdaBoost on the XOR problem.
where yi = f (zi ) is the label of each instance. This is the XOR problem.
As can be seen, there is no straight line that is able to separate positive instances (i.e., z1 and z2 ) from negative instances (i.e., z3 and z4 ); in other
words, the two classes cannot be separated by a linear classifier.
Suppose we have a base learning algorithm which works as follows. It
evaluates eight basis functions h1 to h8 described in Figure 2.4 on the training data under a given distribution, and then outputs the one with the
smallest error. If there is more than one basis function with the smallest error, it selects one randomly. Notice that none of these eight basis functions
can separate the two classes.
Now we track how AdaBoost works:
1. The first step is to invoke the base learning algorithm on the original
data. Since h2 , h3 , h5 and h8 all have the smallest classification errors
0.25, suppose the base learning algorithm outputs h2 as the classifier. After that, one instance, z1 , is incorrectly classified, so the error
h1 (x) =
h3 (x) =
h5 (x) =
h7 (x) =
+1, if (x1 > −0.5)
h2 (x) =
−1, otherwise
+1, if (x1 > +0.5)
h4 (x) =
−1, otherwise
+1, if (x2 > −0.5)
h6 (x) =
−1, otherwise
+1, if (x2 > +0.5)
h8 (x) =
−1, otherwise
−1, if (x1 > −0.5)
+1, otherwise
−1, if (x1 > +0.5)
+1, otherwise
−1, if (x2 > −0.5)
+1, otherwise
−1, if (x2 > +0.5)
+1, otherwise
where x1 and x2 are the values of x at the first and the second dimension,
respectively.
FIGURE 2.4: The eight basis functions considered by the base learning algorithm.
30
Ensemble Methods: Foundations and Algorithms
y
y
x
x
(a)
(b)
y
x
(c)
FIGURE 2.5: Decision boundaries of (a) a single decision tree, (b) AdaBoost
and (c) the 10 decision trees used by AdaBoost, on the three-Gaussians data
set.
is 1/4 = 0.25. The weight of h2 is 0.5 ln 3 ≈ 0.55. Figure 2.3(b) visualizes the classification, where the shadow area is classified as negative
(-1) and the weights of the classification, 0.55 and -0.55, are displayed.
2. The weight of z1 is increased, and the base learning algorithm is invoked again. This time h3 , h5 and h8 have the smallest error, and suppose h3 is picked, of which the weight is 0.80. Figure 2.3(c) shows the
combined classification of h2 and h3 with their weights, where different gray levels are used for distinguishing the negative areas according
to the combination weights.
3. The weight of z2 is increased. This time only h5 and h8 equally have
the smallest errors, and suppose h5 is picked, of which the weight is
1.10. Figure 2.3(d) shows the combined classification of h2 , h3 and h5 .
After the three steps, let us consider the sign of classification weights in
each area in Figure 2.3(d). It can be observed that the sign of classification
weights of z1 and z2 is “+”, while that of z3 and z4 is “−”. This means all the
Boosting
0.6
Pruned Decision Tree
Decision Stump
1
0.8
0.6
0.4
0.2
0
0
31
0.2
0.4
0.6
0.8
1
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
AdaBoost with Pruned Decision Tree
AdaBoost with Decision Stump
Unpruned Decison Tree
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
AdaBoost with Unpruned Decision Tree
FIGURE 2.6: Comparison of predictive errors of AdaBoost against single
base learners on 40 UCI data sets. Each point represents a data set and locates according to the predictive error of the two compared algorithms. The
diagonal line indicates where the two compared algorithms have identical
errors.
instances are correctly classified; thus, by combining the imperfect linear
classifiers, AdaBoost has produced a non-linear classifier with zero error.
For a further understanding of AdaBoost, we visualize the decision
boundaries of a single decision tree, AdaBoost and its component decision
trees on the three-Gaussians data set, as shown in Figure 2.5. It can be observed that the decision boundary of AdaBoost is more flexible than that of
a single decision tree, and this helps to reduce the error from 9.4% of the
single decision tree to 8.3% of the boosted ensemble.
We also evaluate the AdaBoost algorithm on 40 data sets from the UCI
Machine Learning Repository,1 which covers a broad range of real-world
tasks. The Weka2 implementation of AdaBoost.M1 using re-weighting with
50 weak learners is evaluated. Almost all kinds of learning algorithms can be
1 http://www.ics.uci.edu/
~ mlearn/MLRepository.html
2 http://www.cs.waikato.ac.nz/ml/weka/
32
Ensemble Methods: Foundations and Algorithms
taken as base learning algorithms, such as decision trees, neural networks,
etc. Here, we have tried three base learning algorithms: decision stumps,
and pruned and unpruned J48 decision trees (Weka implementation of C4.5
decision trees). We plot the comparison results in Figure 2.6, from which it
can be observed that AdaBoost usually outperforms its base learning algorithm, with only a few exceptions on which it hurts performance.
2.4 Theoretical Issues
2.4.1 Initial Analysis
Freund and Schapire [1997] proved that, if the base learners of AdaBoost
have errors 1 , 2 , . . ., T , the error of the final combined learner, , is upper
bounded by
= Ex∼D I[H(x) = f (x)] ≤ 2T
T T
2
t (1 − t ) ≤ e−2 t=1 γt ,
(2.16)
t=1
where γt = 0.5 − t is called the edge of ht . It can be seen that AdaBoost
reduces the error exponentially fast. Also, it can be derived that, to achieve
an error less than , the number of learning rounds T is upper bounded by
T ≤
1
1
ln ,
2γ 2 (2.17)
where it is assumed that γ ≥ γ1 ≥ . . . ≥ γT .
In practice, however, all the estimates can only be carried out on training data D, i.e., D = Ex∼D I[H(x) = f (x)], and thus the errors are training errors, while the generalization error D is of more interest. Freund and
Schapire [1997] showed that the generalization error of AdaBoost is upper
bounded by
dT
D ≤ D + Õ
(2.18)
m
with probability at least 1 − δ, where d is the VC-dimension of base learners,
m is the number of training instances, T is the number of learning rounds
and Õ(·) is used instead of O(·) to hide logarithmic terms and constant factors.
2.4.2 Margin Explanation
The generalization bound (2.18) suggests that, in order to achieve a good
generalization, it is necessary to constrain the complexity of base learners
Boosting
20
1
error
accumulated ratio
of test set
training
testing
optimal
15
10
5
0
10
100
number of learning rounds
(a)
33
0.5
-0.5
1000
5 rounds
100 rounds
1000 rounds
0
margin
0.5
1
(b)
FIGURE 2.7: (a) Training and test error, and (b) margin distribution of AdaBoost on the UCI letter data set. (Plot based on a similar figure in [Schapire
et al., 1998])
as well as the number of learning rounds, and otherwise AdaBoost will overfit.
Empirical studies, however, show that AdaBoost often does not overfit;
that is, the test error often tends to decrease even after the training error
reaches zero, even after a large number of rounds such as 1,000. For example, Schapire et al. [1998] plotted the typical performance of AdaBoost
as shown in Figure 2.7(a). It can be observed that AdaBoost achieves zero
training error in less than 10 rounds but after that, the generalization error
keeps reducing. This phenomenon seems to contradict with the Occam’s
Razor which prefers simple hypotheses to complex ones when both fit empirical observations well. So, it is not strange that explaining why AdaBoost
seems resistant to overfitting becomes one of the central theoretical issues
and has attracted much attention.
Schapire et al. [1998] introduced the margin-based explanation to AdaBoost. Formally, in the context of binary classification, i.e., f (x) ∈ {−1, +1},
the margin of the classifier h on the instance x, or in other words, the distance of x to the classification hyperplane of h, is defined as f (x)h(x), and
similarly, the margin of the ensemble H(x) = Tt=1 αt ht (x) is f (x)H(x) =
T
t=1 αt f (x)ht (x), while the normalized margin of the ensemble is
T
f (x)H(x) =
αt f (x)ht (x)
,
T
t=1 αt
t=1
(2.19)
where αt ’s are the weights of base learners.
Based on the concept of margin, Schapire et al. [1998] proved that, given
any threshold θ > 0 of margin over the training sample D, with probability
at least 1 − δ, the generalization error of the ensemble D = Px∼D (f (x) =
34
Ensemble Methods: Foundations and Algorithms
H(x)) can be bounded by
1
d
≤ Px∼D (f (x)H(x) ≤ θ) + Õ
+ ln
mθ2
δ
T
d
1
1−θ
,
≤ 2T
t (1 − t )1+θ + Õ
+ ln
mθ2
δ
t=1
D
(2.20)
where d, m, T and Õ(·) are the same as those in (2.18), and t is the training
error of the base learner ht . The bound (2.20) discloses that when other variables are fixed, the larger the margin over the training set, the smaller the
generalization error. Thus, Schapire et al. [1998] argued that AdaBoost tends
to be resistant to overfitting since it is able to increase the ensemble margin even after the training error reaches zero. Figure 2.7(b) illustrates the
margin distribution of AdaBoost at different numbers of learning rounds.
Notice that the bound (2.20) depends heavily on the smallest margin,
since the probability Px∼D (f (x)H(x) ≤ θ) will be small if the smallest margin is large. Based on this recognition, Breiman [1999] developed the arc-gv
algorithm, which is a variant of AdaBoost but directly maximizes the minimum margin
= min f (x)H(x) .
(2.21)
x∈D
In each round, arc-gv updates αt according to
1 + γt
1 + t
1
1
αt = ln
− ln
,
2
1 − γt
2
1 − t
(2.22)
where γt is the edge of ht , and t is the minimum margin of the combined
classifier up to the current round.
Based on the minimum margin, Breiman [1999] proved a generalization
error bound tighter than (2.20). Since the minimum margin of arc-gv converges to the largest possible minimum margin, the margin theory would
appear to predict that arc-gv should perform better than AdaBoost. However, Breiman [1999] found in experiments that, though arc-gv does produce
uniformly larger minimum margin than AdaBoost, the test error of arc-gv
increases drastically in almost every case. Hence, Breiman [1999] convincingly concluded that the margin-based explanation for AdaBoost was in serious doubt and a new understanding is needed. This almost sentenced the
margin theory to death.
Seven years later, Reyzin and Schapire [2006] reported an interesting
finding. The bound of generalization error (2.20) is relevant to the margin, the number of learning rounds and the complexity of base learners. To
study the influence of margin, the other factors should be fixed. When comparing arc-gv and AdaBoost, Breiman [1999] tried to control the complexity
35
12
11
10
9
AdaBoost
arc−gv
8
7
0
100
200
number of learning rounds
(a)
300
cumulative frequency
cumulative average tree depth
Boosting
1
0.8
0.6
0.4
AdaBoost
arc−gv
0.2
0
−0.1
0
0.1
0.2
0.3
0.4
0.5
margin
(b)
FIGURE 2.8: (a) Tree depth and (b) margin distribution of AdaBoost against
arc-gv on the UCI clean1 data set.
of base learners by using decision trees with a fixed number of leaves. However, Reyzin and Schapire [2006] found that these trees have very different shapes. The trees generated by arc-gv tend to have larger depth, while
those generated by AdaBoost tend to have larger width. Figure 2.8(a) depicts the difference of the depth of typical trees generated by the two algorithms. Though the trees have the same number of leaves, it seems that
a deeper tree makes more attribute tests than a wider tree, and therefore
they are unlikely to have equal complexity. So, Reyzin and Schapire [2006]
repeated Breiman’s experiments by using decision stumps which have only
two leaves and therefore have a fixed complexity, and found that the margin distribution of AdaBoost is better than that of arc-gv, as illustrated in
Figure 2.8(b).
Thus, the margin distribution is believed crucial to the generalization
performance of AdaBoost, and Reyzin and Schapire [2006] suggested to consider average margin or median margin as measures to compare margin
distributions.
2.4.3 Statistical View
Though the margin-based explanation to AdaBoost has a nice geometrical intuition and is attractive to the learning community, it is not that attractive to the statistics community, and statisticians have tried to understand AdaBoost from the perspective of statistical methods. A breakthrough
in this direction was made by Friedman et al. [2000] who showed that the
AdaBoost algorithm can be interpreted as a stagewise estimation procedure
for fitting an additive logistic regression model, which is exactly how we derive the AdaBoost in Section 2.2.
Notice that (2.2) is a form of additive model. The exponential loss func-
36
Ensemble Methods: Foundations and Algorithms
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Least square base learning algorithm L;
Number of learning rounds T .
Process:
1. y0 (x) = f (x).
% Initialize target
2. H0 (x) = 0.
% Initialize function
3. for t = 1, . . . , T :
1
4. pt (x) =
−2Ht−1 (x) ; %Calculate probability
1+e
(x)−pt (x)
yt (x) = pytt−1
(x)(1−pt (x)) ; % Update target
Dt (x) = pt (x)(1 − pt (x)); % Update weight
ht = L(D, yt , Dt ); % Train a least square classifier ht to fit yt
in data set D under distribution Dt
8. Ht (x) = Ht−1 (x) + 12 ht (x); %Update combined classifier
9. end
T
Output: H(x) = sign
t=1 ht (x)
5.
6.
7.
FIGURE 2.9: The LogitBoost algorithm
tion (2.1) adopted by AdaBoost is a differentiable upper bound of the 0/1loss function that is typically used for measuring misclassification error
[Schapire and Singer, 1999]. If we take a logistic function and estimate probability via
eH(x)
P (f (x) = 1 | x) = H(x)
,
(2.23)
e
+ e−H(x)
we can find that the exponential loss function and the log loss function
(negative log-likelihood)
log (h | D) = Ex∼D ln 1 + e−2f (x)h(x)
(2.24)
are minimized by the same function (2.4). So, instead of taking the Newtonlike updates in AdaBoost, Friedman et al. [2000] suggested to fit the additive
logistic regression model by optimizing the log loss function via gradient
decent with the base regression models, leading to the LogitBoost algorithm
shown in Figure 2.9.
According to the explanation of Friedman et al. [2000], AdaBoost is just
an optimization process that tries to fit an additive model based on a surrogate loss function. Ideally, a surrogate loss function should be consistent,
i.e., optimizing the surrogate loss will yield ultimately an optimal function
with the Bayes error rate for true loss function, while the optimization of the
surrogate loss is computationally more efficient. Many variants of AdaBoost
have been developed by considering different surrogate loss functions, e.g.,
Boosting
37
the LogitBoost which considers the log loss, the L2Boost which considers the
l2 loss [Bühlmann and Yu, 2003], etc.
On the other hand, if we just regard a boosting procedure as an optimization of a loss function, an alternative way for this purpose is to use mathematical programming [Demiriz et al., 2002, Warmuth et al.,
2008] to solve
the weights of weak learners. Consider an additive model h∈H αh h of a
pool H of weak learners, and let ξi be the loss of the model on instance xi .
Demiriz et al. [2002] derived that, if the sum of coefficients and losses is
bounded such that
αh + C
m
ξi ≤ B ,
(2.25)
i=1
h∈H
which actually bounds the complexity (or covering number) of the model
[Zhang, 1999], the generalization error is therefore bounded as
1
ln m 2
1
D ≤ Õ
B ln(Bm) +
ln
,
(2.26)
m
m δ
where C ≥ 1 and αh ≥ 0, and Õ hides other variables. It is evident that
minimizing B also minimizes this upper bound. Thus, considering T weak
learners, letting yi = f (xi ) be the label of training instance xi and Hi,j =
hj (xi ) be the output of weak learner hj on xi , we have the optimization task
min
αj ,ξi
s.t.
yi
T
αj + C
j=1
T
j=1
m
ξi
(2.27)
i=1
Hi,j αj + ξi ≥ 1 (∀i = 1, . . . , m)
ξi ≥ 0 (∀i = 1, . . . , m)
αj ≥ 0 (∀j = 1, . . . , T ) ,
or equivalently,
max ρ − C
αj ,ξi ,ρ
s.t.
yi
T
j=1
m
ξi
i=1
Hi,j αj + ξi ≥ ρ (∀i = 1, . . . , m)
T
αj = 1
j=1
ξi ≥ 0 (∀i = 1, . . . , m)
αj ≥ 0 (∀j = 1, . . . , T ) ,
of which the dual form is
(2.28)
38
Ensemble Methods: Foundations and Algorithms
s.t.
min β
wi ,β
m
(2.29)
wi yi Hi,j ≤ β (∀j = 1, . . . , T )
m
wi = 1
i=1
i=1
wi ∈ [0, C ] (∀i = 1, . . . , m) .
A difficulty for the optimization task is that T can be very large. Considering
the final solution of the first linear programming, some α will be zero. One
way to handle this problem is to find the smallest subset of all the columns;
this can be done by column generation [Nash and Sofer, 1996]. Using the
dual form, set wi = 1/m for the first column, and then find the jth column
that violates the constraint
m
wi yi Hi,j ≤ β
(2.30)
i=1
m
to the most. This is equivalent to maximizing i=1 wi yi Hi,j ; in other words,
finding the weak learner hj with the smallest error under the weight distribution w. When the solved hj does not violate any constraint, optimality
is reached and the column generation process terminates. The whole procedure forms the LPBoost algorithm [Demiriz et al., 2002] summarized in
Figure 2.10. The performance advantage of LPBoost against AdaBoost is not
apparent [Demiriz et al., 2002], while it is observed that an improved version, entropy regularized LPBoost, often beats AdaBoost [Warmuth et al.,
2008].
It is noteworthy that though the statistical view of boosting is well accepted by the statistics community, it does not answer the question why
AdaBoost seems resistant to overfitting. Moreover, the AdaBoost algorithm
was designed as a classification algorithm for minimizing the misclassification error, while the statistical view focuses on the minimization of the
surrogate loss function (or equivalently, probability estimation); these two
problems are often very different. As indicated by Mease and Wyner [2008],
in addition to the optimization aspect, a more comprehensive view should
also consider the stagewise nature of the algorithm as well as the empirical
variance reduction effect.
2.5 Multiclass Extension
In the previous sections we focused on AdaBoost for binary classification,
i.e., Y = {+1, −1}. In many classification tasks, however, an instance be-
Boosting
39
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Base learning algorithm L;
Parameter ν.
Number of learning rounds T .
Process:
1. w1,i = 1/m (∀i = 1, . . . , m).
2. β1 = 0.
3. for t = 1, . . ., T:
4. ht
= L(D, w); % Train a learner ht from D under w
m
5. if i=1 wt,i yi ht (xi ) ≤ βt then T = t − 1; break; % Check optimality
6. Hi,t = ht (xi ) (i = 1, . . . , m) ; % Fill a column
7. (wt+1 , βt+1 ) = arg minw,β β
m
wi yi hj (xi ) ≤ β (∀j ≤ t)
s.t.
i=1
m
wi = 1
i=1
wi ∈ [0,
1
] (∀i = 1, . . . , m)
mν
8. end
9. solve α from the dual
solution (wT+1 , βT +1 );
T
Output: H(x) = sign
t=1 αt ht (x)
FIGURE 2.10: The LPBoost algorithm
longs to one of many instead of two classes. For example, a handwritten
digit belongs to one of the 10 classes, i.e., Y = {0, . . . , 9}. There are many
alternative ways to extend AdaBoost for multiclass classification.
AdaBoost.M1 [Freund and Schapire, 1997] is a very straightforward extension, which is the same as the algorithm shown in Figure 2.2 except that
the base learners now are multiclass learners instead of binary classifiers.
This algorithm could not use binary classifiers, and has an overly strong
constraint that every base learner has less than 1/2 multiclass 0/1-loss.
SAMME [Zhu et al., 2006] is an improvement over AdaBoost.M1, which
replaces line 6 of AdaBoost.M1 in Figure 2.2 by
1 − t
1
αt = ln
+ ln(|Y| − 1) .
(2.31)
2
t
This modification is derived from the minimization of multiclass exponential loss, and it was proved that, similar to the case of binary classification,
optimizing the multiclass exponential loss approaches to the Bayes error
rate, i.e.,
sign(h∗ (x)) = arg max P (y|x) ,
(2.32)
y∈Y
40
Ensemble Methods: Foundations and Algorithms
A v.s. B
not B
not A
A v.s. O
not O
A
O v.s. B
not A
not B
O
O
not O
B
FIGURE 2.11: A directed acyclic graph that aggregates one-versus-one decomposition for classes of A, B and O.
where h∗ is the optimal solution to the multiclass exponential loss.
A commonly used solution to the multiclass classification problem is to
decompose the task into multiple binary classification problems. Popular decomposition schemes include one-versus-rest and one-versus-one.
One-versus-rest decomposes a multiclass task of |Y| classes into |Y| binary
classification tasks, where the ith task is to classify whether an instance belongs to the ith class or not. One-versus-one decomposes a multiclass task
of |Y| classes into |Y|(|Y|−1)
binary classification tasks, where each task is to
2
classify whether an instance belongs to, say, the ith class or the jth class.
AdaBoost.MH [Schapire and Singer, 1999] follows the one-versus-rest
strategy. After training |Y| number of binary AdaBoost classifiers, the real
T
value output H(x) =
t=1 αt ht (x) rather than the crisp classification of
each AdaBoost classifier is used to identify the most probable class, that is,
H(x) = arg max Hy (x) ,
(2.33)
y∈Y
where Hy is the binary AdaBoost classifier that classifies the yth class from
the rest.
AdaBoost.M2 [Freund and Schapire, 1997] follows the one-versus-one
strategy, which minimizes a pseudo-loss. This algorithm is later generalized
as AdaBoost.MR [Schapire and Singer, 1999] which minimizes a ranking loss
motivated by the fact that the highest ranked class is more likely to be the
correct class. Binary classifiers obtained by one-versus-one decomposition
can also be aggregated by voting, pairwise coupling, directed acyclic graph,
etc. [Hsu and Lin, 2002, Hastie and Tibshirani, 1998]. Voting and pairwise
coupling are well known, while Figure 2.11 illustrates the use of a directed
acyclic graph.
Boosting
41
2.6 Noise Tolerance
Real-world data are often noisy. The AdaBoost algorithm, however, was
originally designed for clean data and has been observed to be very sensitive to noise. The noise sensitivity of AdaBoost is generally attributed to
the exponential loss function (2.1) which specifies that if an instance were
not classified as the same as its given label, the weight of the instance will
increase drastically. Consequently, when a training instance is associated
with a wrong label, AdaBoost still tries to make the prediction resemble the
given label, and thus degenerates the performance.
MadaBoost [Domingo and Watanabe, 2000] improves AdaBoost by depressing large instance weights. It is almost the same as AdaBoost except
for the weight updating rule. Recall the weight updating rule of AdaBoost,
i.e.,
e−αt , if ht (x) = f (x)
Dt (x)
Dt+1 (x) = Zt × αt
(2.34)
e ,
if ht (x) = f (x)
Dt (x)
−αt ·ht (x)·f (x)
Zt × e
t
D1 (x)
× i=1 e−αi ·hi (x)·f (x)
Zt
=
=
,
where Zt and Zt are the normalization terms. It can be seen that, if the
prediction on an instance
is different from its given label for a number of
rounds, the term ti=1 e−αi ·hi (x)·f (x) will grow very large, pushing the instance to be classified according to the given label in the next round. To reduce the undesired dramatic increase of instance weights caused by noise,
MadaBoost sets an upper limit on the weights:
t
D1 (x)
−αi ·hi (x)·f (x)
Dt+1 (x) =
,
(2.35)
× min 1,
e
Zt
i=1
where Zt is the normalization term. By using this weight updating rule, the
instance weights will not grow without bound.
FilterBoost [Bradley and Schapire, 2008] does not employ the exponential loss function used in AdaBoost, but adopts the log loss function (2.24).
Similar to the derivation of AdaBoost in Section 2.2, we consider fitting an
additive model to minimize the log loss function. At round t, denote the
combined learner as Ht−1 and the classifier to be trained as ht . Using the
42
Ensemble Methods: Foundations and Algorithms
Taylor expansion of the loss function, we have
1
(2.36)
log (Ht−1 + ht | D) = Ex∼D − ln
1 + e−f (x)(Ht−1 (x)+ht (x))
f (x)ht (x)
ef (x)Ht−1 (x)
≈ Ex∼D ln(1 + e−f (x)Ht−1 (x) ) −
+
1 + ef (x)Ht−1 (x)
2(1 + ef (x)Ht−1 (x) )2
f (x)ht (x)
,
≈ Ex∼D −
1 + ef (x)Ht−1 (x)
by noticing that f (x)2 = 1 and ht (x)2 = 1. To minimize the loss function, ht
needs to satisfy
ht = arg min log (Ht−1 + h | D)
h
f (x)h(x)
= arg max Ex∼D
1 + ef (x)Ht−1 (x)
h
f (x)h(x)
= arg max Ex∼D
Zt (1 + ef (x)Ht−1 (x) )
h
= arg max Ex∼Dt [f (x)h(x)] ,
(2.37)
h
1
where Zt = Ex∼D [
] is the normalization factor, and the weight
1+ef (x)Ht−1 (x)
updating rule is
Dt (x) =
1
D(x)
.
Zt 1 + ef (x)Ht−1 (x)
(2.38)
It is evident that with this updating rule, the increase of the instance weights
is upper bounded by 1, similar to the weight depressing in MadaBoost, but
smoother.
The BBM (Boosting-By-Majority) [Freund, 1995] algorithm was the first
iterative boosting algorithm. Though it is noise tolerant [Aslam and Decatur, 1993], it requires the user to specify mysterious parameters in advance; excluding the requirement on unknown parameters motivated the
development of AdaBoost. BrownBoost [Freund, 2001] is another adaptive
version of BBM, which inherits BBM’s noise tolerance property. Derived
from the loss function of BBM, which is an accumulated binomial distribution, the loss function of BrownBoost is corresponding to the Brownian
motion process [Gardiner, 2004], i.e.,
f (x)Ht−1 (x) + f (x)ht (x) + c − t
√
bmp (Ht−1 + ht | D) = Ex∼D 1 − erf
,
c
(2.39)
where the parameter c specifies the total time for the boosting procedure, t
is the current time which starts from zero and increases in each round, and
Boosting
43
erf (·) is the error function
erf (a) =
1
π
a
2
e−x dx .
−∞
(2.40)
The loss function (2.39) can be expanded as
⎡
⎤
1
1 − erf ( √ (f (x)Ht−1 (x) + c − t)
c
⎢
⎥
⎢
⎥
⎢
⎥
2
2
−(f (x)Ht−1 (x)+c−t) /c
bmp (Ht−1 + ht | D) ≈ Ex∼D ⎢ −
⎥
e
f
(x)h
(x)
t
⎢
⎥
cπ
⎣
⎦
4 −(f (x)Ht−1 (x)+c−t)2 /c
− 2 e
f (x)2 ht (x)2
c π
2
≈ −Ex∼D e−(f (x)Ht−1 (x)+c−t) /c f (x)ht (x) , (2.41)
and thus, the learner which minimizes the loss function is
ht = arg min bmp (Ht−1 + h | D)
h
2
= arg max Ex∼D e−(f (x)Ht−1 (x)+c−t) /c f (x)h(x)
(2.42)
h
= arg max Ex∼Dt [f (x)h(x)] ,
h
where the weight updating rule is
Dt (x) =
D(x) −(f (x)Ht−1 (x)+c−t)2 /c
e
,
Zt
(2.43)
and Zt is the normalization term. Notice that the weighting function
2
e−(f (x)Ht−1 (x)+c−t) /c at here is quite different from that used in the
boosting algorithms introduced above. When the classification margin
f (x)Ht−1 (x) equals the negative remaining time −(c − t), the weight is set
to the largest and will reduce as the margin goes either larger or smaller.
This implies that BrownBoost/BBM “gives up” on some very hard training
instances. With more learning rounds, −(c − t) approaches 0. This implies
that BrownBoost/BBM pushes the margin on most instances to be positive,
while leaving alone the remaining hard instances that could be noise.
RobustBoost [Freund, 2009] is an improvement of BrownBoost, aiming at
improving the noise tolerance ability by boosting the normalized classification margin, which is believed to be related to the generalization error
(see Section 2.4.2). In other words, instead of minimizing the classification
error, RobustBoost tries to minimize
' (
T
α
f
(x)h
(x)
t
t
t=1
Ex∼D I
≤θ
,
(2.44)
T
t=1 αt
44
Ensemble Methods: Foundations and Algorithms
where θ is the goal margin. For this purpose, the Brownian motion process
in BrownBoost is changed to the mean-reverting Ornstein-Uhlenbeck process [Gardiner, 2004], with the loss function
m̃(Ht−1 (x) + ht (x)) − μ( ct )
oup (Ht−1 + ht | D) = Ex∼D 1 − erf
,
σ( ct )
(2.45)
where m̃(H) is the normalized margin of H, 0 ≤ ct ≤ 1, μ( ct ) = (θ−2ρ)e1− c +
t
2ρ and σ( ct )2 = (σf2 + 1)e2(1− c ) − 1 are respectively the mean and variance of
the process, ρ, σf as well as θ are parameters of the algorithm. By a similar
derivation as that of BrownBoost, the weight updating rule of RobustBoost is
t
Dt (x) =
D(x) −(f (x)Ht−1 (x)−μ( t ))2 /(2σ( t )2 )
c
c
e
.
Zt
(2.46)
A major difference between the weighting functions (2.46) of RobustBoost
and (2.43) of BrownBoost lies in the fact that μ( ct ) approaches θ as t approaches the total time c; thus, RobustBoost pushes the normalized classification margin to be larger than the goal margin θ, while BrownBoost just
pushes the classification margin to be larger than zero.
2.7 Further Readings
Computational learning theory studies some fundamental theoretical
issues of learning. First introduced by Valiant [1984], the PAC (Probably
Approximately Correct) framework models learning algorithms in a distribution free manner. Roughly speaking, for binary classification, a problem is learnable or strongly learnable if there exists an algorithm that outputs a learner h in polynomial time such that for all 0 < δ, ≤ 0.5,
P (Ex∼D [I[h(x) = f (x)]] < ) ≥ 1 − δ, and a problem is weakly learnable
if there exists an algorithm that outputs a learner with error 0.5 − 1/p where
p is a polynomial in problem size and other parameters. Anthony and Biggs
[1992], and Kearns and Vazirani [1994] provide good introductions to computational learning theory.
In 1990, Schapire [1990] proved that strongly learnable problem class
equals the weakly learnable problem class, an open problem raised by
Kearns and Valiant [1989]. The proof is a construction, which is the first
boosting algorithm. One year later, Freund developed the more efficient
BBM algorithm, which was later published in [Freund, 1995]. Both algorithms, however, suffered from the practical deficiency that the error
bounds of the base learners need to be known in advance. Later in 1995,
Boosting
45
Freund and Schapire [1995, 1997] developed the AdaBoost algorithm, which
avoids the requirement on unknown parameters, thus named from adaptive boosting. The AdaBoost paper [Freund and Schapire, 1997] won its authors the Gödel Prize in 2003.
Understanding why AdaBoost seems resistant to overfitting is one of the
most interesting open problems on boosting. Notice that the concern is
why AdaBoost often does not overfit, and it does not never overfit, e.g.,
Grove and Schuurmans [1998] showed that overfitting eventually occurs after enough learning rounds. Many interesting discussions can be found in
the discussion part of [Friedman et al., 2000, Mease and Wyner, 2008].
Besides Breiman [1999], Harries [1999] also constructed an algorithm to
show that the minimum margin is not crucial. Wang et al. [2008] introduced the Emargin and proved a new generalization error bound tighter
than that based on the minimum margin. Gao and Zhou [2012] showed that
the minimum margin and Emargin are special cases of the kth margin; all
of them are single margins that cannot measure the margin distribution.
By considering exactly the same factors as Schapire et al. [1998], Gao and
Zhou [2012] proved a new generalization error bound based on the empirical Bernstein inequality [Maurer and Pontil, 2009]; this new generalization
error bound is uniformly tighter than both the bounds of Schapire et al.
[1998] and Breiman [1999], and thus defends the margin-based explanation against Breiman’s doubt. Furthermore, Gao and Zhou [2012] obtained
an even tighter generalization error bound by considering the empirical average margin and margin variance.
In addition to the two most popular theoretical explanations, i.e., the
margin explanation and the statistical view, there are also some other theoretical explanations to boosting. For example, Breiman [2004] proposed
the population theory for boosting, Bickel et al. [2006] considered boosting
as the Gauss-Southwell minimization of a loss function, etc. The stability
of AdaBoost has also been studied [Kutin and Niyogi, 2002, Gao and Zhou,
2010].
There are many empirical studies involving AdaBoost, e.g., [Bauer and Kohavi, 1999, Opitz and Maclin, 1999, Dietterich, 2000b]. The famous biasvariance decomposition [Geman et al., 1992] has been employed to empirically study why AdaBoost achieves excellent performance. This powerful
tool breaks the expected error of a learning algorithm into the sum of three
non-negative quantities, i.e., the intrinsic noise, the bias, and the variance.
The bias measures how closely the average estimate of the learning algorithm is able to approximate the target, and the variance measures how
much the estimate of the learning algorithm fluctuates for the different
training sets of the same size. It has been observed that AdaBoost primarily reduces the bias though it is also able to reduce the variance [Bauer and
Kohavi, 1999, Breiman, 1996a, Zhou et al., 2002b].
Ideal base learners for boosting are weak learners sufficiently strong to be
boostable, since it is easy to underfit if the base learners are too weak, yet
46
Ensemble Methods: Foundations and Algorithms
easy to overfit if the base learners are too strong. For binary classification,
it is well known that the exact requirement for weak learners is to be better
than random guess. While for multi-class problems, it remains a mystery
until the recent work by Mukherjee and Schapire [2010]. Notice that requiring base learners to be better than random guess is too weak for multi-class
problems, yet requiring better than 50% accuracy is too stringent. Recently,
Conditional Random Fields (CRFs) and latent variable models are also
utilized as base learners for boosting [Dietterich et al., 2008, Hutchinson
et al., 2011].
Error Correcting Output Codes (ECOC) [Dietterich and Bakiri, 1995] is
an important way to extend binary learners to multi-class learners, which
will be introduced in Section 4.6.1 of Chapter 4. Among the alternative ways
of characterizing noise in data and how a learning algorithm is resistant to
noise, the statistical query model [Kearns, 1998] is a PAC compliant theoretical model, in which a learning algorithm learns from queries of noisy expectation values of hypotheses. We call a boosting algorithm as a SQ Boosting if it efficiently boosts noise tolerant weak learners to strong learners.
The noise tolerance of MadaBoost was proved by showing that it is a SQ
Boosting, by assuming monotonic errors for the weak learners [Domingo
and Watanabe, 2000]. Aslam and Decatur [1993] showed that BBM is also
a SQ Boosting. In addition to algorithms introduced in Section 2.6, there
are many other algorithms, such as GentleBoost [Friedman et al., 2000], trying to improve the robustness of AdaBoost. McDonald et al. [2003] reported
an empirical comparison of AdaBoost, LogitBoost and BrownBoost on noisy
data. A thorough comparison of robust AdaBoost variants is an important
issue to be explored.
3
Bagging
3.1 Two Ensemble Paradigms
According to how the base learners are generated, roughly speaking,
there are two paradigms of ensemble methods, that is, sequential ensemble methods where the base learners are generated sequentially, with AdaBoost as a representative, and parallel ensemble methods where the base
learners are generated in parallel, with Bagging [Breiman, 1996d] as a representative.
The basic motivation of sequential methods is to exploit the dependence
between the base learners, since the overall performance can be boosted
in a residual-decreasing way, as seen in Chapter 2. The basic motivation
of parallel ensemble methods is to exploit the independence between the
base learners, since the error can be reduced dramatically by combining
independent base learners.
Take binary classification on classes {−1, +1} as an example. Suppose
the ground-truth function is f , and each base classifier has an independent
generalization error , i.e., for base classifier hi ,
P (hi (x) = f (x)) = .
After combining T number of such base classifiers according to
T
H(x) = sign
hi (x) ,
(3.1)
(3.2)
i=1
the ensemble H makes an error only when at least half of its base classifiers
make errors. Therefore, by Hoeffding inequality, the generalization error of
the ensemble is
T /2
P (H (x) = f (x)) =
k=0
T
1
(1 − )k T −k ≤ exp − T (2 − 1)2 . (3.3)
2
k
(3.3) clearly shows that the generalization error reduces exponentially to
the ensemble size T , and ultimately approaches to zero as T approaches
47
48
Ensemble Methods: Foundations and Algorithms
to infinity. Though it is practically impossible to get really independent
base learners since they are generated from the same training data set, base
learners with less dependence can be obtained by introducing randomness
in the learning process, and a good generalization ability can be expected
by the ensemble.
Another benefit of the parallel ensemble methods is that they are inherently favorable to parallel computing, and the training speed can be easily
accelerated using multi-core computing processors or parallel computers.
This is attractive as multi-core processors are commonly available nowadays.
3.2 The Bagging Algorithm
The name Bagging came from the abbreviation of Bootstrap AGGregatING
[Breiman, 1996d]. As the name implies, the two key ingredients of Bagging
are bootstrap and aggregation.
We know that the combination of independent base learners will lead to
a dramatic decrease of errors and therefore, we want to get base learners
as independent as possible. Given a training data set, one possibility seems
to be sampling a number of non-overlapped data subsets and then training a base learner from each of the subsets. However, since we do not have
infinite training data, such a process will produce very small and unrepresentative samples, leading to poor performance of base learners.
Bagging adopts the bootstrap distribution for generating different base
learners. In other words, it applies bootstrap sampling [Efron and Tibshirani, 1993] to obtain the data subsets for training the base learners. In detail, given a training data set containing m number of training examples, a
sample of m training examples will be generated by sampling with replacement. Some original examples appear more than once, while some original
examples are not present in the sample. By applying the process T times,
T samples of m training examples are obtained. Then, from each sample a
base learner can be trained by applying the base learning algorithm.
Bagging adopts the most popular strategies for aggregating the outputs of
the base learners, that is, voting for classification and averaging for regression. To predict a test instance, taking classification for example, Bagging
feeds the instance to its base classifiers and collects all of their outputs, and
then votes the labels and takes the winner label as the prediction, where
ties are broken arbitrarily. Notice that Bagging can deal with binary classification as well as multi-class classification. The Bagging algorithm is summarized in Figure 3.1.
It is worth mentioning that the bootstrap sampling also offers Bagging
Bagging
49
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Base learning algorithm L;
Number of base learners T .
Process:
1. for t = 1, . . . , T :
2. ht = L(D, Dbs ) % Dbs is the bootstrap distribution
3. end
Output: H(x) = arg max Tt=1 I(ht (x) = y)
y∈Y
FIGURE 3.1: The Bagging algorithm
another advantage. As Breiman [1996d] indicated, given m training examples, the probability that the ith training example is selected 0, 1, 2, . . . times
is approximately Poisson distributed with λ = 1, and thus, the probability
that the ith example will occur at least once is 1 − (1/e) ≈ 0.632. In other
words, for each base learner in Bagging, there are about 36.8% original training examples which have not been used in its training process. The goodness of the base learner can be estimated by using these out-of-bag examples, and thereafter the generalization error of the bagged ensemble can
be estimated [Breiman, 1996c, Tibshirani, 1996b, Wolpert and Macready,
1999].
To get the out-of-bag estimate, we need to record the training examples
used for each base learner. Denote H oob (x) as the out-of-bag prediction on
x, where only the learners that have not been trained on x are involved, i.e.,
H oob (x) = arg max
y∈Y
T
I(ht (x) = y) · I(x ∈
/ Dt ) .
(3.4)
t=1
Then, the out-of-bag estimate of the generalization error of Bagging is
erroob =
1
|D|
I(H oob (x) = y) .
(3.5)
(x,y)∈D
The out-of-bag examples can also be used for many other purposes.
For example, when decision trees are used as base classifiers, the posterior
probability of each node of each tree can be estimated using the out-of-bag
examples. If a node does not contain out-of-bag examples, it is marked “uncounted”. For a test instance, its posterior probability can be estimated by
averaging the posterior probabilities of non-uncounted nodes into which it
falls.
50
Ensemble Methods: Foundations and Algorithms
3.3 Illustrative Examples
To get an intuitive understanding of Bagging, we visualize the decision
boundaries of a single decision tree, Bagging and its component decision
trees on the three-Gaussians data set, as shown in Figure 3.2. It can be observed that the decision boundary of Bagging is more flexible than that of
a single decision tree, and this helps to reduce the error from 9.4% of the
single decision tree to 8.3% of the bagged ensemble.
y
y
x
x
(a)
(b)
y
x
(c)
FIGURE 3.2: Decision boundaries of (a) a single decision tree, (b) Bagging
and (c) the 10 decision trees used by Bagging, on the three-Gaussians data
set.
We also evaluate the Bagging algorithm on 40 data sets from the UCI Machine Learning Repository. The Weka implementation of Bagging with 20
base classifiers is tested. We have tried three base learning algorithms: decision stumps, and pruned and unpruned J48 decision trees. We plot the
Bagging
0.6
Pruned Decision Tree
Decision Stump
1
0.8
0.6
0.4
0.2
0
0
51
0.2
0.4
0.6
0.8
1
Bagging with Decision Stump
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Bagging with Pruned Decision Tree
Unpruned Decision Tree
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Bagging with Unpruned Decision Tree
FIGURE 3.3: Comparison of predictive errors of Bagging against single base
learners on 40 UCI data sets. Each point represents a data set and locates
according to the predictive error of the two compared algorithms. The diagonal line indicates where the two compared algorithms have identical
errors.
comparison results in Figure 3.3, and it can be observed that Bagging often
outperforms its base learning algorithm, and rarely reduces performance.
With a further observation on Figure 3.3, it can be found that Bagging using decision stumps is less powerful than Bagging using decision trees. This
is easier to see in Figure 3.4.
Remember that Bagging adopts bootstrap sampling to generate different
data samples, while all the data samples have large overlap, say, 63.2%, with
the original data set. If a base learning algorithm is insensitive to perturbation on training samples, the base learners trained from the data samples
may be quite similar, and thus combining them will not help improve the
generalization performance. Such learners are called stable learners. Decision trees are unstable learners, while decision stumps are more close to
stable learners. On highly stable learners such as k-nearest neighbor classifiers, Bagging does not work. For example, Figure 3.5 shows the decision
boundaries of a single 1-nearest neighbor classifier and Bagging of such
classifiers. The difference between the decision boundaries is hardly visi-
Ensemble Methods: Foundations and Algorithms
Bagging with Decision Stump
Bagging with Decision Stump
52
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Bagging with Unpruned Decision Tree
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Bagging with Pruned Decision Tree
Bagging with
Pruned Decision Tree
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Bagging with Unpruned Decision Tree
FIGURE 3.4: Comparison of predictive errors of Bagging using decision
stumps, pruned decision trees and unpruned decision trees. Each point
represents a data set and locates according to the predictive error of the two
compared algorithms. The diagonal line indicates where the two compared
algorithms have identical errors.
ble, and the predictive errors are both 9.1%.
Indeed, it is well known that Bagging should be used with unstable learners, and generally, the more unstable, the larger the performance improvement. This explains why in Figure 3.4 the performance of Bagging with unpruned decision trees is better than with pruned decision trees, since unpruned trees are more unstable than pruned ones. This provides a good
implication, that is, when we use Bagging, we do not need to do the timeconsuming decision tree pruning.
With independent base learners, (3.3) shows that the generalization error
reduces exponentially in the ensemble size T , and ultimately approaches
zero as T approaches to infinity. In practice we do not have infinite training data, and the base learners of Bagging are not independent since they
are trained from bootstrap samples. However, it is worth mentioning that
though the error might not drop to zero, the performance of Bagging converges as the ensemble size, i.e., the number of base learners, grows large,
as illustrated in Figure 3.6.
Bagging
53
y
y
x
x
(a)
(b)
0.36
0.12
0.34
0.11
Test Error
Test Error
FIGURE 3.5: Decision boundaries of (a) 1-nearest neighbor classifier, and
(b) Bagging of 1-nearest neighbor classifiers, on the three-Gaussians data
set.
0.32
0.30
0.28
0.26 0
10
0.10
0.09
0.08
10
1
2
3
10
10
0.07 0
10
Ensemble Size
1
10
10
2
3
10
Ensemble Size
(a) credit-g
(b) soybean
FIGURE 3.6: Impact of ensemble size on Bagging on two UCI data sets.
3.4 Theoretical Issues
Bagging has a tremendous variance reduction effect, and it is particularly effective with unstable base learners. Understanding these properties
is fundamental in the theoretical studies of Bagging.
Breiman [1996d] presented an explanation when he proposed Bagging.
Let’s consider regression at first. Let f denote the ground-truth function
and h(x) denote a learner trained from the bootstrap distribution Dbs . The
aggregated learner generated by Bagging is
H(x) = EDbs [h(x)] .
(3.6)
54
Ensemble Methods: Foundations and Algorithms
With simple algebra and the inequality (E[X])2 ≤ E[X 2 ], we have
2
2
(f (x) − H (x)) ≤ EDbs (f (x) − h (x)) .
(3.7)
Thus, by integrating both sides over the distribution, we can get that the
mean-squared error of H(x) is smaller than that of h(x) averaged over the
bootstrap sampling distribution, and the difference depends on how unequal the following inequality is:
2
2
(EDbs [h (x)]) ≤ EDbs h (x) .
(3.8)
This clearly discloses the importance of instability. That is, if h(x) does not
change much with different bootstrap samples, the aggregation will not
help; while if h(x) changes much, the improvement provided by the aggregation will be great. This explains why Bagging is effective with unstable
learners, and it reduces variance through the smoothing effect.
In the context of classification, suppose the classifier h(x) predicts the
class label y ∈ {y1 , y2 , . . . , yc }. Let P (y | x) denote the probability of y being the ground-truth class label of x. Then, the probability of h correctly
classifying x is
P (h (x) = y) P (y | x) ,
(3.9)
y
and the overall correct classification probability of h is
P (h (x) = y) P (y | x)P (x)dx ,
(3.10)
y
where P (x) is the input probability distribution.
If the input probability of x with class label y is larger than any other
classes, while h predicts class y for x more often, i.e.,
arg max P (h (x) = y) = arg max P (y | x) ,
y
(3.11)
y
the predictor h is called order-correct at the input x.
The aggregated classifier of Bagging is H(x) = arg maxy P (h (x) = y). Its
probability of correct classification at x is
I arg max P (h (x) = z) = y P (y | x) .
(3.12)
y
z
If h is order-correct at the input x, the above probability equals maxy P (y |
x). Thus, the correct classification probability of the aggregated classifier H
is
'
(
max P (y | x)P (x)dx +
I (H (x) = y) P (y | x) P (x)dx ,
x∈C
y
x∈C y
(3.13)
Bagging
55
where C is the set of all inputs x where h is order-correct, and C is the set
of inputs at which h is not order-correct. It always holds that
P (h (x) = y) P (y | x) ≤ max P (y | x) .
(3.14)
y
y
Thus, the highest achievable accuracy of Bagging is
max P (y | x)P (x)dx ,
(3.15)
y
which equals the Bayes error rate.
Comparing (3.10) and (3.13), it can be found that if a predictor is ordercorrect at most instances, Bagging can transform it into a nearly optimal
predictor. Notice that if the base learner is unstable, the h’s generated from
different samples will be quite different, and will produce different predictions on x, leading to a low probability of P (h(x) = y). According to (3.9),
the probability of correctly predicting x will be low. We know that, however,
if h is order-correct at x, Bagging will correctly classify x with high probability. This suggests that the performance improvement brought by Bagging is
large when the base learner is unstable but order-correct.
Friedman and Hall [2007] studied Bagging through a decomposition of
statistical predictors. They assumed that the learner h(x; γ) is parameterized by a parameter vector γ, which can be obtained by solving an estimation function
n
g ((xi , yi ) , γ) = 0 ,
(3.16)
i=1
where g is a smooth multivariate function, (xi , yi ) is the ith training example and n is the size of training set. Once γ is obtained, the learner h is decided.
Suppose γ ∗ is the solution of Ex,y [g((xi , yi ), γ)] = 0. Based on the Taylor
expansion of g((xi , yi ), γ) around γ ∗ , (3.16) can be rewritten as
'
n
g((xi , yi ), γ ∗ ) +
gk ((xi , yi ), γ ∗ )(γ − γ ∗ )k +
(3.17)
i=1
k
k1
(
∗
∗
∗
gk1 ,k2 ((xi , yi ), γ )(γ − γ )k1 (γ − γ )k2 + . . . = 0 ,
k2
where γk is the kth component of γ, and gk is the partial derivative of g with
respect to γk . Suppose γ̂ is a solution of (3.16), then from (3.17) it can be
expressed as
γ̂ = Γ +
αk1 k2 (Φ̄ − φ)k1 (Φ̄ − φ)k2
(3.18)
+
k1
k2
k1
k2
k3
αk1 k2 k3 (Φ̄ − φ)k1 (Φ̄ − φ)k2 (Φ̄ − φ)k3 + . . .
56
Ensemble Methods: Foundations and Algorithms
with coefficients αk1 k2 , αk1 k2 k3 ,
1
g((xi , yi ), γ ∗ ) ,
n i=1
n
∗
Γ=γ +M
−1
(3.19)
1
Φi , φ = E[Φi ] ,
(3.20)
n
where M is a matrix whose kth column is Ex,y [gk ((x, y), γ ∗ )], and Φi is a
vector of gk ((xi , yi ), γ ∗ ), gk1 ,k2 ((xi , yi ), γ ∗ ), . . .. It is obvious that γ̂ can be decomposed into linear and high-order parts.
Suppose the learner generated from a bootstrap sample of the training
data set D is parameterized by γ̂ , and the sample size is m (m ≤ n). According to (3.18), we have
γ̂ = Γ +
αk1 k2 (Φ̄ − φ)k1 (Φ̄ − φ)k2
(3.21)
Φ̄ =
k1
+
k2
k1
k2
αk1 k2 k3 (Φ̄ − φ)k1 (Φ̄ − φ)k2 (Φ̄ − φ)k3 + . . . .
k3
The aggregated learner of Bagging is parameterized by
γ̂bag = E[γ̂ | D] .
(3.22)
If γ̂ is linear in function of data, we have E[Γ | D] = Γ, and thus γ̂bag = γ̂.
This implies that Bagging does not improve linear components of γ̂.
Now, let’s consider higher-order components. Let
ρm =
n
,
m
1
(Φi − Φ)k1 (Φi − Φ)k2 ,
n i=1
S=
αk1 k2 σ̂k1 k2 .
(3.23)
n
σ̂k1 k2 =
k1
(3.24)
(3.25)
k2
Friedman and Hall [2007] showed that if ρm → ρ (1 ≤ ρ ≤ ∞) when n → ∞,
γ̂bag can be expressed as
γ̂bag = Γ +
1
ρm S + δbag ,
n
(3.26)
where δbag represents the terms with orders higher than quadratic. From
(3.24) it is easy to see that the variance of σ̂k1 k2 will decrease if the sample
size n increases, and the dependence of the variance on the sample size is in
the order O(n−1 ). Since S is linear in σ̂k1 k2 ’s, the dependence of the variance
of S on the sample size is also in the order of O(n−1 ). Thus, considering that
Bagging
57
ρm is asymptotic to a constant and the property of variance that var(aX) =
a2 var(X), the dependence of the variance of n1 ρm S on the sample size is in
the order of O(n−3 ). If we rewrite (3.18) as γ̂ = Γ+Δ, after a similar analysis,
we can get that the dependence of the variance of Δ on the sample size is
asymptotically in the order of O(n−2 ). Therefore, Bagging has reduced the
variance of the quadratic terms of γ̂ from O(n−2 ) to O(n−3 ). Similar effects
can be found on terms with orders higher than quadratic.
Thus, Friedman and Hall [2007] concluded that Bagging can reduce the
variance of higher-order components, yet not affect the linear components.
This implies that Bagging is better applied with highly nonlinear learners.
Since highly nonlinear learners tend to be unstable, i.e., their performance
changes much with data sample perturbation, it is understandable that
Bagging is effective with unstable base learners.
It is easy to understand that Bagging converges as the ensemble size
grows. Given a training set, Bagging uses bootstrap sampling to generate
a set of random samples; on each a base learner is trained. This process is
equivalent to picking a set of base learners from a pool of all possible learners randomly according to the distribution implied by bootstrap sampling.
Thus, given a test instance, the output of a base learner on the instance can
be denoted as a random variable Y drawn from the distribution. Without
loss of generality, let’s consider binary classification where Y ∈ {−1, +1}.
Bagging generally employs voting to combine base classifiers, while we can
consider averaging at first. Let ȲT = T1 Ti=1 Yi denote the average of the
outputs of T drawn classifiers, and E[Y ] denote the expectation. By the law
of large numbers, we have
lim P (|ȲT − E[Y ]| < ) = 1 .
T →∞
Turning to voting then, we have
*
)
) *
lim P sign ȲT = sign (E[Y ]) = 1 ,
T →∞
(3.27)
(3.28)
unless E[Y ] = 0. Therefore, Bagging will converge to a steady error rate as
the ensemble size grows, except for the rare case that Bagging equals random guess. Actually, this property is shared by all parallel ensemble methods.
3.5 Random Tree Ensembles
3.5.1 Random Forest
Random Forest (RF) [Breiman, 2001] is a representative of the state-ofthe-art ensemble methods. It is an extension of Bagging, where the major
58
Ensemble Methods: Foundations and Algorithms
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Feature subset size K.
Process:
1. N ← create a tree node based on D;
2. if all instances in the same class then return N
3. F ← the set of features that can be split further;
4. if F is empty then return N
5. F̃ ← select K features from F randomly;
6. N.f ← the feature which has the best split point in F̃ ;
7. N.p ← the best split point on N.f ;
8. Dl ← subset of D with values on N.f smaller than N.p ;
9. Dr ← subset of D with values on N.f no smaller than N.p ;
10. Nl ← call the process with parameters (Dl , K);
11. Nr ← call the process with parameters (Dr , K);
12. return N
Output: A random decision tree
FIGURE 3.7: The random tree algorithm in RF.
difference with Bagging is the incorporation of randomized feature selection. During the construction of a component decision tree, at each step
of split selection, RF first randomly selects a subset of features, and then
carries out the conventional split selection procedure within the selected
feature subset.
Figure 3.7 shows the random decision tree algorithm used in RF. The parameter K controls the incorporation of randomness. When K equals the
total number of features, the constructed decision tree is identical to the
traditional deterministic decision tree; when K = 1, a feature will be selected randomly. The suggested value of K is the logarithm of the number
of features [Breiman, 2001]. Notice that randomness is only introduced into
the feature selection process, not into the choice of split points on the selected feature.
Figure 3.8 compares the decision boundaries of RF and Bagging as well
as their base classifiers. It can be observed that decision boundaries of RF
and its base classifiers are more flexible, leading to a better generalization
ability. On the three-Gaussians data set, the test error of RF is 7.85% while
that of Bagging is 8.3%. Figure 3.9 compares the test errors of RF and Bagging
on 40 UCI data sets. It is clear that RF is more preferable no matter whether
pruned or unpruned decision trees are used.
The convergence property of RF is similar to that of Bagging. As illustrated
in Figure 3.10, RF usually has a worse starting point, particularly when the
ensemble size is one, owing to the performance degeneration of single base
learners by the incorporation of randomized feature selection; however, it
Bagging
y
59
y
x
x
(a)
(b)
y
y
x
x
(c)
(d)
FIGURE 3.8: Decision boundaries on the three-Gaussians data set: (a) the
10 base classifiers of Bagging; (b) the 10 base classifiers of RF; (c) Bagging;
(d) RF.
usually converges to lower test errors. It is worth mentioning that the training stage of RF is generally more efficient than Bagging. This is because
in the tree construction process, Bagging uses deterministic decision trees
which need to evaluate all features for split selection, while RF uses random
decision trees which only need to evaluate a subset of features.
3.5.2 Spectrum of Randomization
RF generates random decision trees by selecting a feature subset randomly at each node, while the split selection within the selected feature
subset is still deterministic. Liu et al. [2008a] described the VR-Tree ensemble method, which generates random decision trees by randomizing both
the feature selection and split selection processes.
The base learners of VR-Tree ensembles are VR-Trees. At each node of the
tree, a coin is tossed with α probability head-up. If a head is obtained, a de-
60
Ensemble Methods: Foundations and Algorithms
0.6
Bagging with
Unpruned Desicion Tree
Bagging with
Pruned Decision Tree
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.5
0.4
0.3
0.2
0.1
0
0
0.6
0.2
RF
0.4
0.6
RF
FIGURE 3.9: Comparison of predictive errors of RF against Bagging on 40
UCI data sets. Each point represents a data set and locates according to the
predictive error of the two compared algorithms. The diagonal line indicates where the two compared algorithms have identical errors.
0.25
Bagging
RF
0.32
Test Error
Test Error
0.34
0.30
0.28
0.26 0
10
Bagging
RF
0.20
0.15
0.10
10
1
2
10
Ensemble Size
(a) credit-g
3
10
0.05 0
10
10
1
2
10
3
10
Ensemble Size
(b) soybean
FIGURE 3.10: Impact of ensemble size on RF and Bagging on two UCI data
sets.
terministic node is constructed, that is, the best split point among all possible split points is selected in the same way as traditional decision trees; otherwise, a random node is constructed, that is, a feature is selected randomly
and a split point is selected on the feature randomly. Figure 3.11 shows the
VR-Tree algorithm.
The parameter α controls the degree of randomness. When α = 1, the
produced VR-trees are identical to deterministic decision trees, while when
α = 0, the produced VR-trees are completely random trees. By adjusting the
parameter value, we can observe a spectrum of randomization [Liu et al.,
2008a], as illustrated in Figure 3.12. This provides a way to study the influence of randomness on the ensemble performance. The spectrum has
Bagging
61
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Probability of using deterministic split selection α.
Process:
1. N ← create a tree node based on D;
2. if all instances in the same class then return N
3. F ← the set of features that can be split further;
4. if F is empty then return N
5. r ← a random number in interval [0, 1];
6. if r < α
7. then N.f ← a feature selected from F deterministically;
8.
N.p ← a split point selected on N.f deterministically;
9. else N.f ← a feature selected from F randomly;
10.
N.p ← a split point selected on N.f randomly;
11. Dl ← subset of D with values on N.f smaller than N.p;
12. Dr ← subset of D with values on N.f no smaller than N.p;
13. Nl ← call the process with parameters (Dl , α);
14. Nr ← call the process with parameters (Dr , α);
15. return N
Output: A VR-tree
FIGURE 3.11: The VR-Tree algorithm.
two ends, i.e., the random end (α close to 0) and the deterministic end (α
close to 1). In the random end, the trees are more diverse and of larger sizes;
in the deterministic end, the trees are with higher accuracy and of smaller
sizes. While the two ends have different characteristics, ensembles can be
improved by shifting toward the middle part of the spectrum. In practice,
it is generally difficult to know which middle point is a really good choice.
Liu et al. [2008a] suggested the Coalescence method which aggregates VRtrees with the parameter α being randomly chosen from [0, 0.5], and it was
observed in experiments that the performances of Coalescence are often superior to RF and VR-Tree ensemble with fixed α’s.
3.5.3 Random Tree Ensembles for Density Estimation
Random tree ensembles can be used for density estimation [Fan et al.,
2003, Fan, 2004, Liu et al., 2005]. Since density estimation is an unsupervised task, there is no label information for the training instances, and thus,
completely random trees are used. A completely random tree does not test
whether the instances belong to the same class; instead, it grows until every
leaf node contains only one instance or indistinguishable instances. The
completely random decision tree construction algorithm can be obtained
by replacing the condition “all instances in the same class” in the 2nd step
62
Ensemble Methods: Foundations and Algorithms
Average Test Error
0.21
0.20
0.19
0.18
0.17
0.16
0.15
0
0.2
0.4
α
0.6
0.8
1
FIGURE 3.12: Illustration of the spectrum of randomization [Liu et al.,
2008a]. The x-axis shows the α values, and the y-axis shows the predictive
error of VR-Tree ensembles averaged over 45 UCI data sets.
of Figure 3.11 by “only one instance”, and removing the 5th to 8th steps.
Figure 3.13 illustrates how completely random tree ensembles estimate
data density. Figure 3.13(a) plots five one-dimensional data points, labeled
as 1, 2, . . ., 5, respectively. The completely random tree grows until every instance falls into a sole leaf node. First, we randomly choose a split point in
between points 1 and 5, to divide the data into two groups. With a dominating probability, the split point falls either in between the points 1 and 2, or
in between the points 4 and 5, since the gaps between these pairs of points
are large. Suppose the split point adopted is in between the points 1 and 2,
and thus, the point 1 is in a sole leaf node. Then, the next split point will
be picked in between the points 4 and 5 with a large probability, and this
will make the point 5 be put into a sole leaf node. It is clear that the points
1 and 5 are more likely to be put into “shallow” leaf nodes, while the points
2 to 4 are more likely to be put into “deep” leaf nodes. Figure 3.13(b) plots
three completely random trees generated from the data. We can count the
average depth of each data point: 1.67 for the points 1 and 5, 3.33 for the
point 2, 3.67 for the point 3, and 3 for the point 4. Even though there are
just three trees, we can conclude that the points 1 and 5 are located in a
relatively sparse area, while the points 2 to 4 are located in relatively dense
areas. Figure 3.13(d) plots the density estimation result, where the density
values are calculated as, e.g., 1.67/(1.67 × 2 + 3.33 + 3.67 + 3) for the point
1.
The principle illustrated on one-dimensional data above also holds for
higher-dimensional data and for more complex data distributions. It is also
easy to extend to tasks where the ensembles are constructed incrementally,
such as in online learning or on streaming data. Notice that the construction of a completely random tree is quite efficient, since all it has to do is to
Bagging
2
1
63
4
3
5
(a) Five one-dimensional data points
12345
2345
1
234
1234
5
3
23
4
2
123
5
234
1
34
2
12345
12345
45
23
1
4
2
4
level 0
3
level 1
5
level 2
level 3
level 4
3
Depth
4
3
2
1
4
3
2
1
Depth
Depth
(b) Three completely random trees
4
3
2
1
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Density
(c) The depth of leaves on the one-dimensional
data, each corresponding to a random tree in the
sub-figure (b)
0.30
0.22
0.14
0.06
1
2
3
4
5
(d) The density estimation result
FIGURE 3.13: Illustration of density estimation by random tree ensemble.
pick random numbers. Overall, the data density can be estimated through
generating an ensemble of completely random trees and then calculating
the average depth of each data point; this provides a practical and efficient
tool for density estimation.
64
Ensemble Methods: Foundations and Algorithms
3.5.4 Random Tree Ensembles for Anomaly Detection
Anomalies are data points which do not conform to the general or expected behavior of the majority of data, and the task of anomaly detection is
to separate anomaly points from normal ones in a given data set [Chandola
et al., 2009]. In general, the terms anomalies and outliers are used interchangeably, and anomalies are also referred to as discordant observations,
exceptions, peculiarities, etc.
There are many established approaches to anomaly detection [Hodge
and Austin, 2004, Chandola et al., 2009]. A typical one is to estimate the data
density, and then treat the data points with very low densities as anomalies. However, as Liu et al. [2008b] disclosed, density is not a good indicator
to anomaly, because a clustered small group of anomaly points may have
a high density, while the bordering normal points may be with low density. Since the basic property of anomalies is few and different, isolation
is more indicative than density [Liu et al., 2008b]. Based on this recognition, random tree ensembles can serve well for anomaly detection, since
random trees are simple yet effective for measuring the difficulty of isolating data points. Figure 3.14 illustrates the idea of anomaly detection via
random trees. It can be observed that a normal point x generally requires
more partitions to be isolated, while an anomaly point x∗ is much easier to
be isolated with many fewer partitions.
Liu et al. [2008b] described the iForest (Isolation Forest) method for
anomaly detection. For each random tree, the number of partitions required to isolate a data point can be measured by the path length from the
x
x*
(a)
(b)
FIGURE 3.14: Illustration of anomaly detection by random trees: (a) a
normal point x requires 11 random tree partitions to be isolated; (b) an
anomaly point x∗ requires only four random tree partitions to be isolated.
Bagging
65
root node to the leaf node containing the data point. The fewer the required
partitions, the easier the data point to be isolated. It is obvious that only the
data points with short path lengths are of interest. Thus, to reduce unnecessary computation, the random trees used in iForest are set with a height
limit, that is, a limit on the tree depth. This random tree construction algorithm can be obtained by replacing the condition “all instances in the same
class” in the 2nd step of Figure 3.11 by “only one instance or height limit is
reached”, and removing the 5th to 8th steps. To improve the efficiency and
scalability on large data sets, the random trees are constructed from smallsized samples instead of the original data set. Given the data sample size
ψ, the height limit of iForest is set to log2 (ψ), which is approximately the
average tree height with ψ leaf nodes [Knuth, 1997].
To calculate the anomaly score s(xi ) for xi , the expected path length
E[h(xi )] is derived firstly by passing xi through every random tree in the
iForest ensemble. The path length obtained at each random tree is added
with an adjustment term c(n) to account for the ungrown branch beyond
the tree height limit. For a tree with n nodes, c(n) is set as the average path
length [Preiss, 1999]
2H(n − 1) − (2(n − 1)/n), n > 1
c(n) =
,
(3.29)
0,
n=1
where H(a) is the harmonic number that can be estimated by
H(a) ≈ ln(a) + 0.5772156649 (Euler’s constant).
Then, the anomaly score s(xi ) is calculated according to [Liu et al., 2008b]
s(xi ) = 2−
E[h(xi )]
c(ψ)
,
(3.30)
where c(ψ) serves as a normalization factor corresponding to the average
path length of traversing a random tree constructed from a sub-sample
with size ψ. Finally, if s(xi ) is very close to 1, xi is definitely an anomaly;
if s(xi ) is much smaller than 0.5, xi is quite safe to be regarded as a normal
data point; while if s(xi ) ≈ 0.5 for all xi ’s, there is no distinct anomaly [Liu
et al., 2008b].
Liu et al. [2010] presented the SCiForest (“SC” means “with Split selection
Criterion”), a variant of iForest. In contrast to iForest which considers only
axis-parallel splits of original features in the construction of random trees,
SCiForest tries to get smoother decision boundaries, similar to oblique decision trees, by considering hyper-plane splits derived from the combination of original features. Furthermore, since the hypothesis space is more
complicated by considering hyper-planes, rather than using a completely
random manner, a split selection criterion is defined in SCiForest to facilitate the selection of appropriate hyper-planes at each node to reduce the
risk of falling into poor sub-optimal solutions.
66
Ensemble Methods: Foundations and Algorithms
3.6 Further Readings
Bagging typically adopts majority voting for classification and simple averaging for regression. If the base learners are able to output confidence
values, weighted voting or weighted averaging are often used. Chapter 4 will
introduce combination methods.
Constructing ensembles of stable learners is difficult not only for Bagging,
but also for AdaBoost and other ensemble methods relying on data sample
manipulation. The reason is that the pure data sample perturbation could
not enable the base learners to have sufficiently large diversity. Chapter 5
will introduce diversity and discuss more on ensembles of stable learners.
Bühlmann and Yu [2002] theoretically showed that Bagging tends to
smooth crisp decisions, and the smoothing operation results in the variance reduction effect. Buja and Stuetzle [2000a,b, 2006] analyzed Bagging
by using U-statistics, and found that the leading effect of Bagging on variance is at the second order. They also extended Bagging from statistics to
statistical functional, and found that a bagged functional is also smooth.
The term forest was first used to refer ensembles of decision trees by Ho
[1995]. There are many other random decision tree ensemble methods, in
addition to the ones introduced in this chapter, e.g., Dietterich [2000b], Cutler and Zhao [2001], Robnik-Šikonja [2004], Rodriguez et al. [2006], Geurts
et al. [2006].
4
Combination Methods
4.1 Benefits of Combination
After generating a set of base learners, rather than trying to find the best
single learner, ensemble methods resort to combination to achieve a strong
generalization ability, where the combination method plays a crucial role.
Dietterich [2000a] attributed the benefit from combination to the following
three fundamental reasons:
• Statistical issue: It is often the case that the hypothesis space is too
large to explore for limited training data, and that there may be several
different hypotheses giving the same accuracy on the training data. If
the learning algorithm chooses one of these hypotheses, there is a risk
that a mistakenly chosen hypothesis could not predict the future data
well. As shown in Figure 4.1(a), by combining the hypotheses, the risk
of choosing a wrong hypothesis can be reduced.
• Computational issue: Many learning algorithms perform some kind
of local search that may get stuck in local optima. Even if there are
enough training data, it may still be very difficult to find the best hypothesis. By running the local search from many different starting
points, the combination may provide a better approximation to the
true unknown hypothesis. As shown in Figure 4.1(b), by combining
the hypotheses, the risk of choosing a wrong local minimum can be
reduced.
• Representational issue: In many machine learning tasks, the true unknown hypothesis could not be represented by any hypothesis in the
hypothesis space. As shown in Figure 4.1(c), by combing the hypotheses, it may be possible to expand the space of representable functions,
and thus the learning algorithm may be able to form a more accurate
approximation to the true unknown hypothesis.
These three issues are among the most important factors for which the
traditional learning approaches fail. A learning algorithm that suffers from
the statistical issue is generally said to have a high “variance”, a learning
algorithm that suffers from the computational issue can be described as
67
68
Ensemble Methods: Foundations and Algorithms
(a) Statistical
(b) Computational
(c) Representational
FIGURE 4.1: Three fundamental reasons for combination: (a) the statistical
issue, (b) the computational issue, and (c) the representational issue. The
outer curve represents the hypothesis space, and the inner curve in (a) represents the hypotheses with the same accuracy on the training data. The
point label f is the true hypothesis, and hi ’s are the individual hypotheses.
(Plot based on a similar figure in [Dietterich, 2000a].)
having a high “computational variance”, and a learning algorithm that suffers from the representational issue is generally said to have a high “bias”.
Therefore, through combination, the variance as well as the bias of learning algorithms may be reduced; this has been confirmed by many empirical
studies [Xu et al., 1992, Bauer and Kohavi, 1999, Opitz and Maclin, 1999].
4.2 Averaging
Averaging is the most popular and fundamental combination method for
numeric outputs. In this section we take regression as an example to explain
how averaging works. Suppose we are given a set of T individual learners
{h1 , . . . , hT } and the output of hi for the instance x is hi (x) ∈ R, our task is
to combine hi ’s to attain the final prediction on the real-valued variable.
4.2.1 Simple Averaging
Simple averaging obtains the combined output by averaging the outputs
of individual learners directly. Specifically, simple averaging gives the combined output H(x) as
T
1 H(x) =
hi (x).
(4.1)
T i=1
Suppose the underlying true function we try to learn is f (x), and x is
sampled according to a distribution p(x). The output of each learner can
Combination Methods
69
be written as the true value plus an error item, i.e.,
hi (x) = f (x) + i (x), i = 1, . . . , T.
(4.2)
Then, the mean squared error of hi can be written as
2
(hi (x) − f (x)) p(x)dx =
i (x)2 p(x)dx ,
(4.3)
and the averaged error made by the individual learners is
err(h) =
T
1 T i=1
i (x)2 p(x)dx .
(4.4)
Similarly, it is easy to derive that the expected error of the combined learner
(i.e., the ensemble) is
err(H) =
2
T
1
hi (x) − f (x) p(x)dx =
T i=1
It is easy to see that
2
T
1 i (x) p(x)dx.
T i=1
(4.5)
err(H) ≤ err(h) .
(4.6)
That is, the expected ensemble error will be no larger than the averaged
error of the individual learners.
Moreover, if we assume that the errors i ’s have zero mean and are uncorrelated, i.e.,
i (x)p(x)dx = 0 and
i (x)j (x)p(x)dx = 0 (for i = j) ,
(4.7)
it is not difficult to get
1
err(h) ,
(4.8)
T
which suggests that the ensemble error is smaller by a factor of T than the
averaged error of the individual learners.
Owing to its simplicity and effectiveness, simple averaging is among the
most popularly used methods and represents the first choice in many real
applications. It is worth noting, however, that the error reduction shown in
(4.8) is derived based on the assumption that the errors of the individual
learners are uncorrelated, while in ensemble learning the errors are typically highly correlated since the individual learners are trained on the same
problem. Therefore, the error reduction shown in (4.8) is generally hard to
achieve.
err(H) =
70
Ensemble Methods: Foundations and Algorithms
4.2.2 Weighted Averaging
Weighted averaging obtains the combined output by averaging the outputs of individual learners with different weights implying different importance. Specifically, weighted averaging gives the combined output H(x) as
H(x) =
T
(4.9)
wi hi (x) ,
i=1
where wi is the weight for hi , and the weights wi ’s are usually assumed to be
constrained by
T
wi ≥ 0 and
wi = 1 .
(4.10)
i=1
Similarly as in Section 4.2.1, suppose the underlying true function we try
to learn is f (x), and x is sampled according to a distribution p(x). The output of each learner can be written as (4.2). Then it is easy to write the ensemble error as [Perrone and Cooper, 1993]
err(H) =
=
T
i=1
T
2
wi hi (x) − f (x)
p(x)dx
⎞
⎛ T
wi hi (x) − f (x) ⎝
wj hj (x) − f (x)⎠ p(x)dx
i=1
=
T
T j=1
wi wj Cij ,
(4.11)
(hi (x) − f (x)) (hj (x) − f (x)) p(x)dx .
(4.12)
i=1 j=1
where
Cij =
It is evident that the optimal weights can be solved by
w = arg min
w
T
T wi wj Cij .
(4.13)
i=1 j=1
By applying the famous Lagrange multiplier method, it can be obtained
that [Perrone and Cooper, 1993]
T
wi =
j=1
−1
Cij
T
T k=1 j=1
.
−1
Ckj
(4.14)
Combination Methods
71
(4.14) provides a closed-form solution to the optimal weights. It is worth
noticing, however, that this solution requires the correlation matrix C to
be invertible, yet in ensemble learning such a matrix is usually singular
or ill-conditioned, since the errors of the individual learners are typically
highly correlated and many individual learners may be similar since they
are trained on the same problem. Therefore, the solution shown in (4.14)
is generally infeasible, and moreover, it does not guarantee non-negative
solutions.
It is easy to see that simple averaging, which can be regarded as taking
equal weights for all individual learners, is a special case of weighted averaging. Other combination methods, such as voting, are also special cases
or variants of weighted averaging. Indeed, given a set of individual learners, the weighted averaging formulation [Perrone and Cooper, 1993] provides a fundamental motivation for ensemble methods, since any ensemble method can be regarded as trying a specific way to decide the weights
for combining the individual learners, and different ensemble methods can
be regarded as different implementations of weighted averaging. From this
aspect it is easy to know that there is no ensemble method which is consistently the best, since deciding the weights is a computationally hard problem.
Notice that though simple averaging is a special case of weighted averaging, it does not mean that weighted averaging is definitely better than
simple averaging. In fact, experimental results reported in the literature do
not show that weighted averaging is clearly superior to simple averaging
[Xu et al., 1992, Ho et al., 1994, Kittler et al., 1998]. One important reason
is that the data in real tasks are usually noisy and insufficient, and thus the
estimated weights are often unreliable. In particular, with a large ensemble, there are a lot of weights to learn, and this can easily lead to overfitting;
simple averaging does not have to learn any weights, and so suffers little
from overfitting. In general, it is widely accepted that simple averaging is
appropriate for combining learners with similar performances, whereas if
the individual learners exhibit nonidentical strength, weighted averaging
with unequal weights may achieve a better performance.
4.3 Voting
Voting is the most popular and fundamental combination method for
nominal outputs. In this section we take classification as an example to explain how voting works. Suppose we are given a set of T individual classifiers {h1 , . . . , hT } and our task is to combine hi ’s to predict the class label from a set of l possible class labels {c1 , . . . , cl }. It is generally assumed
that for an instance x, the outputs of the classifier hi are given as an l-
72
Ensemble Methods: Foundations and Algorithms
dimensional label vector (h1i (x), . . . , hli (x)) , where hji (x) is the output of
hi for the class label cj . The hji (x) can take different types of values according to the information provided by the individual classifiers, e.g.,
- Crisp label: hji (x) ∈ {0, 1}, which takes value one if hi predicts cj as
the class label and zero otherwise.
- Class probability: hji (x) ∈ [0, 1], which can be regarded as an estimate
of the posterior probability P (cj | x).
For classifiers that produce un-normalized margins, such as SVMs, calibration methods such as Platt scaling [Platt, 2000] or Isotonic Regression
[Zadrozny and Elkan, 2001b] can be used to convert such an output to a
probability. Notice that the class probabilities estimated by most classifiers
are poor; however, combination methods based on class probabilities are
often highly competitive to those based on crisp labels, especially after a
careful calibration.
4.3.1 Majority Voting
Majority voting is the most popular voting method. Here, every classifier
votes for one class label, and the final output class label is the one that receives more than half of the votes; if none of the class labels receives more
than half of the votes, a rejection option will be given and the combined
classifier makes no prediction. That is, the output class label of the ensemble is
⎧
T
T
l ⎨
cj
if
hji (x) > 12
hki (x) ,
H (x) =
(4.15)
i=1
i=1
k=1
⎩
rejection otherwise .
If there are a total of T classifiers for a binary classification problem, the
ensemble decision will be correct if at least T /2 + 1 classifiers choose the
correct class label. Assume that the outputs of the classifiers are independent and each classifier has an accuracy p, implying that each classifier
makes a correct classification at probability p. The probability of the ensemble for making a correct decision can be calculated using a binomial
distribution; specifically, the probability of obtaining at least T /2 + 1 correct classifiers out of T is [Hansen and Salamon, 1990]:
Pmv =
T
k=T /2+1
T k
p (1 − p)T −k .
k
(4.16)
The accuracy of the ensemble with different values of p and T is illustrated
in Figure 4.2.
Lam and Suen [1997] showed that
Combination Methods
73
1
accuracy of the ensemble
accuracy of the ensemble
0.5
p = 0.2
p = 0.3
p = 0.4
p = 0.5
0.4
0.3
0.2
0.1
0.9
0.8
0.7
p = 0.6
p = 0.7
p = 0.8
0.6
0
0
10
20
30
ensemble size
40
0
50
10
20
30
ensemble size
40
50
FIGURE 4.2: Ensemble accuracy of majority voting of T independent classifiers with accuracy p for binary classification.
- If p > 0.5, then Pmv is monotonically increasing in T , and
lim Pmv = 1;
T →+∞
- If p < 0.5, then Pmv is monotonically decreasing in T , and
lim Pmv = 0;
T →+∞
- If p = 0.5, then Pmv = 0.5 for any T .
Notice that this result is obtained based on the assumption that the individual classifiers are statistically independent, yet in practice the classifiers
are generally highly correlated since they are trained on the same problem.
Therefore, it is unpractical to expect the majority voting accuracy converges
to one along with the increase of the number of individual classifiers.
4.3.2 Plurality Voting
In contrast to majority voting which requires the final winner to take at
least half of votes, plurality voting takes the class label which receives the
largest number of votes as the final winner. That is, the output class label of
the ensemble is
H(x) = carg max T hj (x) ,
(4.17)
j
i=1
i
and ties are broken arbitrarily. It is obvious that plurality voting does not
have a reject option, since it can always find a label receiving the largest
number of votes. Moreover, in the case of binary classification, plurality voting indeed coincides with majority voting.
74
Ensemble Methods: Foundations and Algorithms
4.3.3 Weighted Voting
If the individual classifiers are with unequal performance, intuitively, it
is reasonable to give more power to the stronger classifiers in voting; this is
realized by weighted voting. The output class label of the ensemble is
H(x) = carg max T
j
i=1
wi hji (x)
,
(4.18)
where wi is the weight assigned to the classifier hi . In practical applications,
T
the weights are often normalized and constrained by wi ≥ 0 and i=1 wi =
1, similar to that in weighted averaging.
Take a simple example to compare weighted voting and majority voting.
Suppose there are five independent individual classifiers with accuracies
{0.7, 0.7, 0.7, 0.9, 0.9}, respectively. Thus, the accuracy of majority voting
(i.e., at least three out of five classifiers are correct) is
Pmv = 0.73 + 2 × 3 × 0.72 × 0.3 × 0.9 × 0.1 + 3 × 0.7 × 0.3 × 0.92
≈ 0.933 ,
which is better than the best individual classifier. For weighted voting, suppose that the weights given to the classifiers are {1/9, 1/9, 1/9, 1/3, 1/3}, respectively, and then the accuracy of weighted voting is
Pwv = 0.92 + 2 × 3 × 0.9 × 0.1 × 0.72 × 0.3 + 2 × 0.9 × 0.1 × 0.73
≈ 0.951 .
This shows that, with adequate weight assignments, weighted voting can be
better than both the best individual classifier and majority voting. Similar
to weighted averaging, the key is how to obtain the weights.
Let = (1 , . . . , T ) denote the outputs of the individual classifiers,
where i is the class label predicted for the instance x by the classifier hi ,
and let pi denote the accuracy of hi . There is a Bayesian optimal discriminant function for the combined output on class label cj , i.e.,
H j (x) = log (P (cj ) P ( | cj )) .
(4.19)
Assuming that the outputs of the individual classifiers are conditionally inT
dependent, i.e., P (|cj ) = i=1 P (i |cj ), then it follows that
Combination Methods
H j (x) = log P (cj ) +
T
i=1
log P (i | cj )
⎛
T
= log P (cj ) + log ⎝
⎛
T
i=1,
i =cj
i=1,
i =cj
pi
i=1,
i =cj
T
T
P (i | cj )
i=1,
i =cj
= log P (cj ) + log ⎝
= log P (cj ) +
75
T
⎞
P (i | cj )⎠
⎞
(1 − pi )⎠
i=1,
i =cj
pi
+
log(1 − pi ) .
1 − pi i=1
T
log
(4.20)
T
Since i=1 log(1 − pi ) does not depend on the class label cj , and i = cj
can be expressed by the result of hji (x), the discriminant function can be
reduced to
T
pi
H j (x) = log P (cj ) +
hji (x) log
.
(4.21)
1
−
pi
i=1
The first term at the right-hand side of (4.21) does not rely on the individual learners, while the second term discloses that the optimal weights for
weighted voting satisfy
pi
wi ∝ log
,
(4.22)
1 − pi
which shows that the weights should be in proportion to the performance
of the individual learners.
Notice that (4.22) is obtained by assuming independence among the outputs of the individual classifiers, yet this does not hold since all the individual classifiers are trained on the same problem and they are usually highly
correlated. Moreover, it requires the estimation of ground-truth accuracies
of individual classifiers, and does not take the class priors into account.
Therefore, in real practice, (4.22) does not always lead to a performance
better than majority voting.
4.3.4 Soft Voting
For individual classifiers which produce crisp class labels, majority voting, plurality voting and weighted voting can be used, while for individual
classifiers which produce class probability outputs, soft voting is generally
the choice. Here, the individual classifier hi outputs a l-dimensional vector
(h1i (x), . . . , hli (x)) for the instance x, where hji (x) ∈ [0, 1] can be regarded
as an estimate of the posterior probability P (cj | x).
76
Ensemble Methods: Foundations and Algorithms
If all the individual classifiers are treated equally, the simple soft voting
method generates the combined output by simply averaging all the individual outputs, and the final output for class cj is given by
H j (x) =
T
1 j
h (x) .
T i=1 i
(4.23)
If we consider combining the individual outputs with different weights, the
weighted soft voting method can be any of the following three forms:
- A classifier-specific weight is assigned to each classifier, and the combined output for class cj is
H j (x) =
T
wi hji (x),
(4.24)
i=1
where wi is the weight assigned to the classifier hi .
- A class-specific weight is assigned to each classifier per class, and the
combined output for class cj is
H j (x) =
T
wij hji (x),
(4.25)
i=1
where wij is the weight assigned to the classifier hi for the class cj .
- A weight is assigned to each example of each class for each classifier,
and the combined output for class cj is
H j (x) =
m
T j j
wik
hi (x),
(4.26)
i=1 k=1
j
is the weight of the instance xk of the class cj for the classiwhere wik
fier hi .
In real practice, the third type is not often used since it may involve a large
number of weight coefficients. The first type is similar with weighted averaging or weighted voting, and so, in the following we focus on the second
type, i.e., class-specific weights. Since hji (x) can be regarded as an estimate
of P (cj | x), it follows that
hji (x) = P (cj | x) + ji (x) ,
(4.27)
where ji (x) is the approximation error. In classification, the target output is given as a class label. If the estimation is unbiased, the combined
Combination Methods
77
T
j j
output H j (x) =
i=1 wi hi (x) is also unbiased, and we can obtain a
variance-minimized unbiased estimation H j (x) for P (cj |x) by setting the
weights. Minimizing the variance of the combined approximation error
T
T
j j
j
j
i=1 wi i (x) under the constraints wi ≥ 0 and
i=1 wi = 1, we can get
the optimization problem
w = arg min
j
wj
T
m
k=1
2
wij hji (xk )
− I(f (xk ) = cj )
, j = 1, . . . , l,
(4.28)
i=1
from which the weights can be solved.
Notice that soft voting is generally used for homogeneous ensembles.
For heterogeneous ensembles, the class probabilities generated by different types of learners usually cannot be compared directly without a careful
calibration. In such situations, the class probability outputs are often converted to class label outputs by setting hji (x) to 1 if hji (x) = maxj {hji (x)} and
0 otherwise, and then the voting methods for crisp labels can be applied.
4.3.5 Theoretical Issues
4.3.5.1
Theoretical Bounds of Majority Voting
Narasimhamurthy [2003, 2005] analyzed the theoretical bounds of majority voting. In this section we focus on the introduction of this analysis.
Consider the binary classification problem, given a set of T classifiers,
h1 , . . . , hT , with accuracies p1 , . . . , pT , respectively, and for simplicity, assuming that T is an odd number (similar results for an even number can
be found in [Narasimhamurthy, 2005]). The joint statistics of classifiers can
be represented by Venn diagrams, and an example of three classifiers is illustrated in Figure 4.3. Here, each classifier is represented by a bit (i.e., 1
or 0) with 1 indicating that the classifier is correct and 0 otherwise. The regions in the Venn diagram correspond to the bit combinations. For example, the region marked with x3 corresponds to the bit combination “110”,
i.e., it corresponds to the event that both h1 and h2 are correct while h3 is
incorrect, and x3 indicates the probability associated with this event. Now,
let x = [x0 , . . . , x2T −1 ] denote the vector of joint probabilities, bit(i, T ) denotes the T -bit binary representation of integer i, and fmv denotes a bit
vector of length 2T where the entry at the ith position is
1 if the number of 1’s in bit(i, T ) > T /2,
(4.29)
fmv (i) =
0 otherwise.
Then, the probability of correct classification of majority voting can be rep
resented as fmv
x. This is the objective to be maximized/minimized subject
to certain constraints [Narasimhamurthy, 2003, 2005].
78
Ensemble Methods: Foundations and Algorithms
FIGURE 4.3: Venn diagram showing joint statistics of three classifiers. The
regions marked with different xi ’s correspond to different bit combinations,
and xi is the probability associated with the corresponding region. (Plot
based on a similar figure in [Narasimhamurthy, 2003].)
Notice that the accuracy of classifier hi is pi , that is, the probability of hi
being correct is pi . This can be represented as T constraints in the form of
b
i x = pi (1 ≤ i ≤ T ) ,
(4.30)
and it is easy to find that
b1 = (0, 1, . . . , 0, 1)
b2 = (0, 0, 1, 1, . . . , 0, 0, 1, 1)
..
.
2T −1
2T −1
/
01
2
/
01
2
bT = (0, 0, . . . , 0, 0, . . . , 1, 1, . . . , 1, 1) .
Let B = (b1 , . . . , bT ) and p = (p1 , . . . , pT ) . Considering the constraints
2T −1
i=0 xi = 1 and 0 ≤ xi ≤ 1, the lower and upper bounds of majority voting can be solved from the following linear programming problem
[Narasimhamurthy, 2003, 2005]:
min / max
s.t. Bx = p
x
fmv
x
(4.31)
1 x = 1
0 ≤ xi ≤ 1 (∀i = 0, 1, . . . , 2T − 1).
The theoretical lower and upper bounds of majority voting for three and
five classifiers are respectively illustrated in Figure 4.4 (a) and (b). For the
1
1
0.9
0.9
Accuracy of majority voting
Accuracy of majority voting
Combination Methods
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Upper bounds
Independent classifiers
Lower bounds
0.1
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Accuracy of each classifier
79
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
1
(a) three classifiers
0
0
Upper bounds
Independent classifers
Lower bounds
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Accuracy of each classifier
1
(b) five classifiers
FIGURE 4.4: Theoretical lower and upper bounds of majority voting, and
the accuracy of individual independent classifiers when there are (a)
three classifiers, and (b) five classifiers. (Plot based on a similar figure in
[Narasimhamurthy, 2003].)
purpose of illustration, all classifiers are assumed to have the same accuracy. The accuracy varies from 0 to 1 and the corresponding lower and upper bounds of majority voting are determined for each value. The accuracy of the individual classifiers is also plotted, and it is obvious that the
accuracy curves locate inside the region bounded by the lower and upper
bounds.
As for the role of the ensemble size T , we can find that the number of constraints is linear in T whereas the dimension of the vector x is exponential
in T . In other words, the “degrees of freedom” increase exponentially as T
increases. Hence, for a particular accuracy of the classifiers, increasing the
ensemble size will lead to a lower theoretical minimum (lower bound) and
a higher theoretical maximum (upper bound). This trend can be found by
comparing Figure 4.4 (a) and (b). It is worth noting, however, that this conclusion is drawn based on the assumption that the individual classifiers are
independent, while this assumption usually does not hold in real practice.
4.3.5.2
Decision Boundary Analysis
Tumer and Ghosh [1996] developed the decision boundary analysis
framework. Later, Fumera and Roli [2005] analyzed the simple as well as
weighted soft voting methods based on this framework. In this section we
focus on the introduction of this framework and the main results in [Tumer
and Ghosh, 1996, Fumera and Roli, 2005].
For simplicity, we consider one-dimensional feature space as in [Tumer
and Ghosh, 1996, Fumera and Roli, 2005], while it is known that the same
results hold for multi-dimensional feature spaces [Tumer, 1996]. According
80
Ensemble Methods: Foundations and Algorithms
to Bayesian decision theory, an instance x should be assigned to class ci
for which the posterior probability P (ci | x) is the maximum. As shown in
Figure 4.5, the ideal decision boundary between the classes ci and cj is the
point x∗ such that
P (ci | x∗ ) = P (cj | x∗ ) > max P (ck | x∗ ) .
k=i,j
(4.32)
In practice, the classifier can only provide an estimate hj (x) of the true posterior probability P (cj | x), that is,
hj (x) = P (cj | x) + j (x) ,
(4.33)
where j (x) denotes the error term, which is regarded as a random variable
with mean βj (named “bias”) and variance σj2 . Thus, when Bayesian decision rule is applied, misclassification occurs if
arg max hi (x) = arg max P (ci | x) .
i
(4.34)
i
The decision boundary obtained from the estimated posteriors, denoted as
xb , is characterized by
hi (xb ) = hj (xb ) ,
(4.35)
and it may differ from the ideal boundary by an offset
b = xb − x∗ .
(4.36)
As shown in Figure 4.5, this leads to an additional misclassification error
term over Bayes error, named added error [Tumer and Ghosh, 1996].
In the following, we focus on the case of unbiased and uncorrelated errors (detailed analysis on other cases can be found in [Tumer and Ghosh,
1996, Fumera and Roli, 2005]). Thus, for any given x, the mean of every error item j (x) is zero, i.e., βj = 0, and the error items on different classes,
i (x) and j (x), i = j, are uncorrelated.
Without loss of generality, assume b > 0. From Figure 4.5 it is easy to see
that the added error depends on the offset b and is given by
x∗ +b
x∗
[P (cj | x) − P (ci | x)] p(x)dx ,
(4.37)
where p(x) is the probability distribution of x. Assuming that the shift b =
xb − x∗ is small, a linear approximation can be used around x∗ , that is,
P (ck | x∗ + b) = P (ck | x∗ ) + bP (ck | x∗ ) .
(4.38)
Moreover, p(x) is approximated by p(x∗ ). Therefore, the added error be∗
comes p(x2 )t b2 , where t = P (cj | x∗ ) − P (ci | x∗ ). Based on (4.32) and
(4.35), it is easy to get
i (xb ) − j (xb )
.
(4.39)
b=
t
Combination Methods
81
FIGURE 4.5: True posteriors P (ci | x) and P (cj | x) (solid lines) around
the boundary x∗ , and the estimated posteriors hi (x) and hj (x) (dashed
lines) leading to the boundary xb . Lightly and darkly shaded areas represent
Bayes error and added error, respectively. (Plot based on a similar figure in
[Fumera and Roli, 2005].)
Since i (x) and j (x) are unbiased and uncorrelated, the bias and variance
σ2 +σ2
of b are given by βb = 0 and σb2 = i t2 j . Consequently, the expected value of
the added error with respect to b, denoted by erradd , is then given by [Tumer
and Ghosh, 1996]
p(x∗ )t 2
p(x∗ ) 2
(βb + σb2 ) =
(σi + σj2 ) .
2
2t
erradd (h) =
(4.40)
Now, consider the simplest form of weighted soft voting, which assigns
a non-negative weight wi to each individual learner. Based on (4.33), the
averaged estimate for the jth class is
hjwsv (x) =
T
wi hji (x) = P (cj | x) + wsv
(x) ,
j
(4.41)
i=1
T
i
where wsv
(x) =
j
i=1 wi j (x) and “wsv” denotes “weighted soft voting”.
Analogous to that in Figure 4.5, the decision boundary xbwsv is characterized by hiwsv (x) = hjwsv (x), and it has an offset bwsv from x∗ . Following the
same steps as above, the expected added error of weighted soft voting can
be obtained as
wsv
erradd
(H) =
2
p(x∗ ) 2 k 2
wk (σi ) + (σjk )2 =
wk erradd (hk ) ,
2t
T
T
k=1
k=1
(4.42)
where, from (4.40),
erradd (hk ) =
p(x∗ ) k 2
(σi ) + (σjk )2
2t
(4.43)
82
Ensemble Methods: Foundations and Algorithms
is the expected added error of the individual classifier hk . Thus, the performance of soft voting methods can be analyzed based on the expected
added errors of individual classifiers, instead of the biases and
variances.
T
wsv
Minimizing erradd
(H) under the constraints wk ≥ 0 and k=1 wk = 1,
the optimal weights can be obtained as
−1
T
1
1
.
(4.44)
wk =
err
(h
)
err
(hk )
add
i
add
i=1
This shows that the optimal weights are inversely proportional to the expected added error of the corresponding classifiers. This provides a theoretical support to the argument that different weights should be used for
classifiers of different strengths [Tumer and Ghosh, 1996].
wsv
Substituting (4.44) into (4.42), the value of erradd
(H) corresponding to
the optimal weights is
wsv
erradd
(H) =
1
erradd (h1 )
1
+ ...+
1
erradd (hT )
.
(4.45)
On the other hand, if simple soft voting is used, an equal weight wi = 1/T
is applied to (4.42), and the expected added error is
ssv
erradd
(H)
T
1 = 2
erradd (hk ) ,
T
(4.46)
k=1
where “ssv” denotes “simple soft voting”.
When the individual classifiers exhibit identical expected added errors,
wsv
ssv
i.e., erradd (hi ) = erradd (hj ) (∀i, j), it follows that erradd
(H) = erradd
(H),
and the expected added error is reduced by a factor T over each individual
classifier.
When the individual classifiers exhibit nonidentical added errors, it folwsv
ssv
lows that erradd
(H) < erradd
(H). Without loss of generality, denote the
smallest and largest expected added errors as
best
worst
erradd
(H) = min {erradd (hi )} and erradd
(H) = max {erradd (hi )} ,
i
i
(4.47)
and the corresponding classifiers are called the “best” and “worst” classifiers, respectively. From (4.46), the error reduction achieved by simple soft
voting over the kth classifier is
⎛
⎞
ssv
erradd (H)
erradd (hi ) ⎠
1
= 2 ⎝1 +
.
(4.48)
erradd (hk )
T
erradd (hk )
i=k
It follows that the reduction factors over the best and worst classifiers are
respectively in the ranges
ssv
ssv
(H)
(H)
erradd
erradd
1 1
1
,
+∞
,
∈
.
(4.49)
∈
,
worst (H)
best (H)
T
erradd
T2 T
erradd
Combination Methods
83
From (4.45), the reduction factor of weighted soft voting is
⎛
⎞−1
wsv
erradd (hk )
(H) ⎝
erradd
⎠ ,
= 1+
erradd (hk )
erradd (hi )
(4.50)
i=k
and consequently
wsv
(H)
erradd
∈
best
erradd (H)
wsv
(H)
erradd
1
1
,1 ,
∈
0,
.
worst (H)
T
erradd
T
(4.51)
From (4.49) and (4.51) it can be seen that, if the individual classifiers
have nonidentical added errors, both the simple and weighted soft voting
achieve an error reduction smaller than a factor of T over the best classifier,
and larger than a factor of T over the worst classifier. Moreover, weighted
soft voting always performs better than the best individual classifier, while
when the performances of individual classifiers are quite poor, the added
error of simple soft voting may become arbitrarily larger than that of the
best individual classifier. Furthermore, the reduction achieved by weighted
soft voting over the worst individual classifier can be arbitrarily large, while
the maximum reduction achievable by simple soft voting over the worst
individual classifier is 1/T 2 . It is worth noting that all these conclusions
are obtained based on the assumptions that the individual classifiers are
uncorrelated and that the optimal weights for weighted soft voting can be
solved, yet real situations are more complicated.
4.4 Combining by Learning
4.4.1 Stacking
Stacking [Wolpert, 1992, Breiman, 1996b, Smyth and Wolpert, 1998] is
a general procedure where a learner is trained to combine the individual learners. Here, the individual learners are called the first-level learners,
while the combiner is called the second-level learner, or meta-learner.
The basic idea is to train the first-level learners using the original training data set, and then generate a new data set for training the second-level
learner, where the outputs of the first-level learners are regarded as input
features while the original labels are still regarded as labels of the new training data. The first-level learners are often generated by applying different
learning algorithms, and so, stacked ensembles are often heterogeneous,
though it is also possible to construct homogeneous stacked ensembles.
The pseudo-code of a general stacking procedure is summarized in Figure 4.6.
84
Ensemble Methods: Foundations and Algorithms
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
First-level learning algorithms L1 , . . . , LT ;
Second-level learning algorithm L.
Process:
1. for t = 1, . . . , T : % Train a first-level learner by applying the
2.
ht = Lt (D);
% first-level learning algorithm Lt
3. end
4. D = ∅;
% Generate a new data set
5. for i = 1, . . . , m:
6.
for t = 1, . . . , T :
7.
zit = ht (xi );
8.
end
9.
D = D ∪ ((zi1 , . . . , ziT ), yi );
10. end
11. h = L(D );
% Train the second-level learner h by applying
% the second-level learning algorithm L to the
% new data set D .
Output: H(x) = h (h1 (x), . . . , hT (x))
FIGURE 4.6: A general Stacking procedure.
On one hand, stacking is a general framework which can be viewed as
a generalization of many ensemble methods. On the other hand, it can be
viewed as a specific combination method which combines by learning, and
this is the reason why we introduce Stacking in this chapter.
In the training phase of stacking, a new data set needs to be generated
from the first-level classifiers. If the exact data that are used to train the
first-level learner are also used to generate the new data set for training
the second-level learner, there will be a high risk of overfitting. Hence, it
is suggested that the instances used for generating the new data set are excluded from the training examples for the first-level learners, and a crossvalidation or leave-one-out procedure is often recommended.
Take k-fold cross-validation for example. Here, the original training data
set D is randomly split into k almost equal parts D1 , . . . , Dk . Define Dj and
D(−j) = D \ Dj to be the test and training sets for the jth fold. Given T
(−j)
is obtained by invoking the
learning algorithms, a first-level learner ht
tth learning algorithm on D(−j) . For each xi in Dj , the test set of the jth
(−j)
on xi . Then, at the end of
fold, let zit denote the output of the learner ht
the entire cross-validation process, the new data set is generated from the
T individual learners as
D = {(zi1 , . . . , ziT , yi )}m
i=1 ,
(4.52)
on which the second-level learning algorithm will be applied, and the re-
Combination Methods
85
sulting learner h is a function of (z1 , . . . , zT ) for y. After generating the new
data set, generally, the final first-level learners are re-generated by training
on the whole training data.
Breiman [1996b] demonstrated the success of stacked regression. He
used regression trees of different sizes or linear regression models with different numbers of variables as the first-level learners, and least-square linear regression model as the second-level learner under the constraint that
all regression coefficients are non-negative. This non-negativity constraint
was found to be crucial to guarantee that the performance of the stacked
ensemble would be better than selecting the single best learner.
For stacked classification, Wolpert [1992] indicated that it is crucial to
consider the types of features for the new training data, and the types of
learning algorithms for the second-level learner.
Ting and Witten [1999] recommended to use class probabilities instead
of crisp class labels as features for the new data, since this makes it possible to take into account not only the predictions but also the confidences of the individual classifiers. For first-level classifiers that can output class probabilities, the output of the classifier hk on an instance x is
(hk1 (x), . . . , hkl (x)), which is a probability distribution over all the possible
class labels {c1 , . . . , cl }, and hkj (x) denotes the probability predicted by hk
for the instance x belonging to class cj . Though hk predicts only the class cj
with the largest class probability hkj (x) as the class label, the probabilities
it obtained for all classes contain helpful information. Thus, the class probabilities from all first-level classifiers on x can be used along with the true
class label of x to form a new training example for the second-level learner.
Ting and Witten [1999] also recommended to use multi-response linear regression (MLR), which is a variant of the least-square linear regression algorithm [Breiman, 1996b], as the second-level learning algorithm.
Any classification problem with real-valued features can be converted into
a multi-response regression problem. For a classification problem with l
classes {c1 , . . . , cl }, l separate regression problems are formed as follows:
for each class cj , a linear regression model is constructed to predict a binary variable, which equals one if cj is the correct class label and zero otherwise. The linear regression coefficients are determined based on the leastsquares principle. In the prediction stage, given an instance x to classify,
all the trained linear regression models will be invoked, and the class label corresponding to the regression model with the largest value will be
output. It was found that the non-negativity constraints that are necessary
in regression [Breiman, 1996b] are irrelevant to the performance improvement in classification [Ting and Witten, 1999]. Later, Seewald [2002] suggested to use different sets of features for the l linear regression problems
in MLR. That is, only the probabilities of class cj predicted by the different classifiers, i.e., hkj (x) (k = 1, . . . , T ), are used to construct the linear
regression model corresponding to cj . Consequently, each of the linear regression problems has T instead of l×T features. The Weka implementation
86
Ensemble Methods: Foundations and Algorithms
of Stacking provides both the standard Stacking algorithm and the StackingC
algorithm which implements Seewald [2002]’s suggestion.
Clarke [2003] provided a comparison between Stacking and Bayesian
Model Averaging (BMA) which assigns weights to different models based
on posterior probabilities. In theory, if the correct data generating model is
among the models under consideration, and if the noise level is low, BMA is
never worse and often better than Stacking. However, in practice it is usually not the case since the correct data generating model is often not in the
models under consideration and may even be difficult to be approximated
well by the considered models. Clarke [2003]’s empirical results showed that
stacking is more robust than BMA, and BMA is quite sensitive to model approximation error.
4.4.2 Infinite Ensemble
Most ensemble methods exploit only a small finite subset of hypotheses, while Lin and Li [2008] developed an infinite ensemble framework that
constructs ensembles with infinite hypotheses. This framework can be regarded as learning the combination weights for all possible hypotheses. It
is based on support vector machines, and by embedding infinitely many
hypotheses into a kernel, it can be found that the learning problem reduces
to an SVM training problem with specific kernels.
Let H = {hα : α ∈ C} denote the hypothesis space, where C is a measure
space. The kernel that embeds H is defined as
KH,r (xi , xj ) =
C
(4.53)
Φxi (α)Φxj (α)dα ,
where Φx (α) = r(α)hα (x), and r : C → R+ is chosen such that the integral
exists for all xi , xj . Here, α denotes the parameter of the hypothesis hα , and
Z(α) means that Z depends on α. In the following we denote KH,r by KH
when r is clear from the context. It is easy to prove that (4.53) defines a valid
kernel [Schölkopf and Smola, 2002].
Following SVM, the framework formulates the following (primal) problem:
min
w∈L2 (C),b∈R,ξ∈Rm
s.t.
1
2
yi
w2 (α)dα + C
C
m
C
ξi ≥ 0
(4.54)
ξi
i=1
w(α)r(α)hα (x)dα + b
≥ 1 − ξi
(∀ i = 1, . . . , m).
The final classifier obtained from this optimization problem is
g(x) = sign
w(α)r(α)hα (x)dα + b .
C
(4.55)
Combination Methods
87
Obviously, if C is uncountable, it is possible that each hypothesis hα takes an
infinitesimal weight w(α)r(α)dα in the ensemble. Thus, the obtained final
classifier is very different from those obtained with other ensemble methods. Suppose H is negation complete, that is, h ∈ H ⇔ −h ∈ H. Then, every
linear combination over H has an equivalent linear combination with only
non-negative weights. By treating b as a constant hypothesis, the classifier
in (4.55) can be seen as an ensemble of infinite hypotheses.
By using the Lagrangian multiplier method and the kernel trick, the dual
problem of (4.54) can be obtained, and the final classifier can be written in
terms of the kernel KH as
m
g(x) = sign
yi λi KH (xi , x) + b ,
(4.56)
i=1
where KH is the kernel embedding the hypothesis set H, and λi ’s are the
Lagrange multipliers. (4.56) is equivalent to (4.55) and hence it is also an
infinite ensemble over H. In practice, if a kernel KH can be constructed
according to (4.53) with a proper embedding function r, such as the stump
kernel and the perceptron kernel in [Lin and Li, 2008], the learning problem
can be reduced to solve an SVM with the kernel KH , and thus, the final
ensemble can be obtained by applying typical SVM solvers.
4.5 Other Combination Methods
There are many other combination methods in addition to averaging,
voting and combining by learning. In this section we briefly introduce the
algebraic methods, BKS method and decision template method.
4.5.1 Algebraic Methods
Since the class probabilities output from individual classifiers can be regarded as an estimate of the posterior probabilities, it is straightforward to
derive combination rules under the probabilistic framework [Kittler et al.,
1998].
Denote hji (x), the class probability of cj output from hi , as hji . According
to Bayesian decision theory, given T classifiers, the instance x should be
assigned to the class cj which maximizes the posteriori probability P (cj |
hj1 , . . . , hjT ). From the Bayes theorem, it follows that
P (cj )P (hj1 , . . . , hjT | cj )
P (cj | hj1 , . . . , hjT ) = l
,
j
j
i=1 P (ci )P (h1 , . . . , hT | ci )
(4.57)
88
Ensemble Methods: Foundations and Algorithms
where P (hj1 , . . . , hjT | cj ) is the joint probability distribution of the outputs
from the classifiers.
Assume that the outputs are conditionally independent, i.e.,
P (hj1 , . . . , hjT | cj ) =
T
P (hji | cj ).
(4.58)
i=1
Then, from (4.57), it follows that
P (cj |
hj1 , . . . , hjT )
∝
T
j
i=1 P (hi | cj )
T
j
i=1 P (ci )
k=1 P (hk | ci )
T
−(T −1)
P (cj | hji ).
P (cj )
i=1
= l
P (cj )
(4.59)
Since hji is the probability output, we have P (cj | hji ) = hji . Thus, if all
classes are with equal prior, we get the product rule [Kittler et al., 1998] for
combination, i.e.,
T
H j (x) =
hji (x).
(4.60)
i=1
Similarly, Kittler et al. [1998] derived the soft voting method, as well as the
maximum/minimum/median rules. Briefly speaking, these rules choose the
maximum/minimum/median of the individual outputs as the combined
output. For example, the median rule generates the combined output according to
H j (x) = med(hji (x)) ,
(4.61)
i
where med(·) denotes the median statistic.
4.5.2 Behavior Knowledge Space Method
The Behavior Knowledge Space (BKS) method was proposed by Huang
and Suen [1995]. Let = (1 , . . . , T ) denote the class labels assigned by
the individual classifiers h1 , . . . , hT to the instance x, where i = hi (x). If we
consider as a T -dimensional random variable, the task can be reduced to
estimate P (cj | ). For this, every possible combination of class labels (i.e.,
every possible value of ) can be regarded as an index to a cell in the BKS table [Huang and Suen, 1995]. This table is filled using a data set D, and each
example (xi , yi ) is placed in the cell indexed by (h1 (xi ), . . . , hT (xi )). The
number of examples in each cell are counted; then, the most representative
class label is selected for this cell, where ties are broken arbitrarily and the
empty cells are labeled at random or by majority. In the testing stage, the
BKS method labels x to the class of the cell indexed by (h1 (x), . . . , hT (x)).
Combination Methods
89
The BKS method performs well if large and representative data sets are
available. It suffers from the small sample size problem, in which case overfitting may be serious. To deal with this problem, Raudys and Roli [2003]
analyzed the generalization error of the BKS method, and obtained an analytical model which relates error to sample size. Based on the model, they
proposed to use linear classifiers in “ambiguous” cells of the BKS table,
and this strategy was reported to strongly improve the BKS performance
[Raudys and Roli, 2003].
4.5.3 Decision Template Method
The Decision Template method was developed by Kuncheva et al. [2001].
In this method, the outputs of the classifiers on an instance x are organized
in a decision profile as the matrix
⎛ 1
⎞
h1 (x) . . . hj1 (x) . . . hl1 (x)
..
.. ⎟
⎜ ..
..
..
⎜ .
.
.
.
. ⎟
⎜
⎟
j
1
l
⎜
DP (x) = ⎜ hi (x) . . . hi (x) . . . hi (x) ⎟
(4.62)
⎟.
⎜ .
⎟
.
.
.
.
..
..
..
.. ⎠
⎝ ..
j
1
l
hT (x) . . . hT (x) . . . hT (x)
Based on the training data set D = {(x1 , y1 ), . . . , (xm , ym )}, the decision
templates are estimated as the expected DP (x), i.e.,
1 DTk =
DP (xi ), k = 1, . . . , l
(4.63)
mk i:y =c
i
k
where mk is the number of training examples in the class ck .
The testing stage of this method works like a nearest neighbor algorithm.
That is, the similarity between DP (x), i.e., the decision profile of the test
instance x, and the decision templates DTk ’s are calculated based on some
similarity measure [Kuncheva et al., 2001], and then the class label of the
most similar decision template is assigned to x.
4.6 Relevant Methods
There are some methods which try to make use of multiple learners,
yet in a strict sense they can not be recognized as ensemble combination
methods. For example, some methods choose one learner to make the final prediction, though the learner is not a fixed one and thus, all individual
learners may be chosen upon receiving different test instances; some methods combine multiple learners trained on different sub-problems rather
90
Ensemble Methods: Foundations and Algorithms
h1
h2
h3
h4
h5
HD
c1
Ѹ1
+1
Ѹ1
+1
+1
3
c2
+1
Ѹ1
Ѹ1
+1
Ѹ1
4
4
c3
Ѹ1
+1
+1
Ѹ1
+1
1
2
c4
Ѹ1
Ѹ1
+1
+1
Ѹ1
2
x
Ѹ1
Ѹ1
+1
Ѹ1
+1
h5
h6
h7
HD
ED
+1
Ѹ1
+1
+1
4
4
0
+1
Ѹ1
2
2
Ѹ1
Ѹ1
Ѹ1
+1
Ѹ1
5
+1
0
+1
Ѹ1
0
+1
3
+1
+1
Ѹ1
+1
Ѹ1
+1
h1
h2
h3
h4
c1
Ѹ1
Ѹ1
+1
c2
Ѹ1
c3
+1
+1
c4
Ѹ1
x
Ѹ1
ED
(a) Binary ECOC
(b) Ternary ECOC
FIGURE 4.7: ECOC examples. (a) Binary ECOC for a 4-class problem. An
instance x is classified to class c3 using the Hamming or the Euclidean decoding; (b) Ternary ECOC example, where an instance x is classified to class
c2 according to the Hamming or the Euclidean decoding.
than the exact same problem. This section provides a brief introduction to
ECOC, dynamic classifier selection and mixture of experts.
4.6.1 Error-Correcting Output Codes
Error-Correcting Output Codes (ECOC) is a simple yet powerful approach
to deal with a multi-class problem based on the combination of binary
classifiers [Dietterich and Bakiri, 1995, Allwein et al., 2000]. In general, the
ECOC approach works in two steps:
1. The coding step. In this step, a set of B different bipartitions of the
class label set {c1 , . . . , cl } are constructed, and subsequently B binary
classifiers h1 , . . . , hB are trained over the partitions.
2. The decoding step. In this step, given an instance x, a codeword is
generated by using the outputs of B binary classifiers. Then, the codeword is compared to the base codeword of each class, and the instance
is assigned to the class with the most similar codeword.
Typically, the partitions of class set are specified by a coding matrix
M, which can appear in different forms, e.g., binary form [Dietterich and
Bakiri, 1995] and ternary form [Allwein et al., 2000].
In the binary form, M ∈ {−1, +1}l×B [Dietterich and Bakiri, 1995]. Figure 4.7(a) provides an example of a binary coding matrix, which transforms
a four-class problem into five binary classification problems. In the figure,
regions coded by +1 are considered as a class, while regions coded by −1 are
considered as the other class. Consequently, binary classifiers are trained
based on these bipartitions. For example, the binary classifier h2 is trained
Combination Methods
to discriminate {c1 , c3 } against {c2 , c4 }, that is,
+1 if x ∈ {c1 , c3 }
h2 (x) =
−1 if x ∈ {c2 , c4 }.
91
(4.64)
In the decoding step, by applying the five binary classifiers, a codeword can
be generated for the instance x. Then, the codeword is compared with the
base codewords defined in the rows of M. For example, in Figure 4.7(a),
the instance x is classified to c3 according to either Hamming distance or
Euclidean distance.
In the ternary form, M ∈ {−1, 0, 1}l×B [Allwein et al., 2000]. Figure 4.7(b)
provides an example of a ternary coding matrix, which transforms a fourclass classification problem into seven binary problems. Here, “zero” indicates that the corresponding class is excluded from training the binary
classifier. For example, the classifier h4 is trained to discriminate c3 against
{c1 , c4 } without taking into account c2 . Notice that the codeword of a test
instance cannot contain zeros since the output of each binary classifier is
either −1 or +1. In Figure 4.7(b), the instance x is classified to c2 according
to either Hamming distance or Euclidean distance.
Popular binary coding schemes mainly include the one-versus-rest
scheme and dense random scheme [Allwein et al., 2000]. In the one-versusrest scheme, each binary classifier is trained to discriminate one class
against all the other classes. Obviously, the codeword is of length l if there
are l classes. In the dense random scheme, each element in the code is usually chosen with a probability of 1/2 for +1 and 1/2 for −1. Allwein et al.
[2000] suggested an optimal codeword length of 10 log l. Among a set of
dense random matrices, the optimal one is with the largest Hamming decoding distance among each pair of codewords.
Popular ternary coding schemes mainly include the one-versus-one
scheme and sparse random scheme [Allwein et al., 2000]. The one-versusone scheme considers all pairs of classes, and each classifier is trained to
discriminate between two classes. Thus, the codeword length is l(l − 1)/2.
The sparse random scheme is similar to the dense random scheme, except
that it includes the zero value in addition to +1 and −1. Generally, each element is chosen with probability of 1/2 for 0, a probability of 1/4 for +1 or
−1, and the codeword length is set to 15 log l [Allwein et al., 2000].
The central task of decoding is to find the base codeword wi (corresponding to class ci ) which is the closest to the codeword v of the given test instance. Popular binary decoding schemes mainly include:
- Hamming decoder. This scheme is based on the assumption that the
learning task can be modeled as an error-correcting communication
problem [Nilsson, 1965]. The measure is given by
j
j
j (1 − sign(v · wi ))
HD(v, wi ) =
.
(4.65)
2
92
Ensemble Methods: Foundations and Algorithms
- Euclidean decoder. This scheme is directly based on Euclidean distance [Pujol et al., 2006]. The measure is given by
ED(v, wi ) =
(v j − wij )2 .
(4.66)
j
- Inverse Hamming decoder. This scheme is based on the matrix Δ
which is composed of the Hamming decoding measures between the
codewords of M, and Δij = HD(wi , wj ) [Windeatt and Ghaderi,
2003]. The measure is given by
IHD(v, wi ) = max(Δ−1 D ) ,
(4.67)
where D denotes the vector of Hamming decoder values of v for each
of the base codewords wi .
Popular ternary decoding schemes mainly include:
- Attenuated Euclidean decoder. This is a variant of the Euclidean decoder, which has been redefined to ensure the measure to be unaffected by the positions of the codeword wi containing zeros [Pujol
et al., 2008]. The measure is given by
AED(v, wi ) =
|wij | · (v j − wij )2 .
(4.68)
j
- Loss-based decoder. This scheme chooses the class ci that minimizes
a particular loss function [Allwein et al., 2000]. The measure is given
by
LB(x, wi ) =
L(hj (x), wij ) ,
(4.69)
j
where hj (x) is the real-valued prediction on x, and L is the loss function. In practice, the loss function has many choices, while two commonly used ones are L(hj (x), wij ) = −hj (x) · wij and L(hj (x), wij ) =
exp(−hj (x) · wij ).
- Probabilistic-based decoder. This is a probabilistic scheme based on
the real-valued output of the binary classifiers [Passerini et al., 2004].
The measure is given by
j
j
P D(v, wi ) = − log
P
(v
=
w
|
h
(x))
+
C
,
(4.70)
j
j
i
j:wi =0
where C is a constant, and P (v j = wij | hj (x)) is estimated by
P (v j = wij | hj (x)) =
1
1+
exp(wij (aj
· hj (x) + bj ))
,
(4.71)
where a and b are obtained by solving an optimization problem
[Passerini et al., 2004].
Combination Methods
93
4.6.2 Dynamic Classifier Selection
Dynamic Classifier Selection (DCS) is a specific method for exploiting
multiple learners. After training multiple individual learners, DCS dynamically selects one learner for each test instance. In contrast to classic learning
methods which select the “best” individual learner and discard other individual learners, DCS needs to keep all the individual learners; in contrast
to typical ensemble methods which combine individual learners to make
predictions, DCS makes predictions by using one individual learner. Considering that DCS keeps all the individual learners for prediction, it can be
regarded as a “soft combination” method.
Ho et al. [1994] were the first to introduce DCS. They briefly outlined the
DCS procedure and proposed a selection method based on a partition of
training examples. The individual classifiers are evaluated on each partition so that the best-performing one for each partition is determined. In
the testing stage, the test instance will be categorized into a partition and
then classified by the corresponding best classifier.
Woods et al. [1997] proposed a DCS method called DCS-LA. The basic
idea is to estimate the accuracy of each individual classifier in local regions
surrounding the test instance, and then the most locally accurate classifier is selected to make the classification. In DCS-LA, the local regions are
specified in terms of k-nearest neighbors in the training data, and the local
accuracy can be estimated by overall local accuracy or local class accuracy.
The overall local accuracy is simply the percentage of local examples that
are correctly classified; the local class accuracy is the percentage of local
examples belonging to the same class that are correctly classified. Giacinto
and Roli [1997] developed a similar DCS method based on local accuracy.
They estimated the class posterior and calculated a “confidence” for the selection. Didaci et al. [2005] studied the performance bounds of DCS-LA and
showed that the upper bounds of DCS-LA are realistic and can be attained
by accurate parameter tuning in practice. Their experimental results clearly
showed the effectiveness of DCS based on local accuracy estimates.
Giacinto and Roli [2000a,b] placed DCS in the framework of Bayesian decision theory and found that, under the assumptions of decision regions
complementarity and decision boundaries complementarity, the optimal
Bayes classifier can be obtained by the selection of non-optimal classifiers.
This provides theoretical support for the power of DCS. Following the theoretical analysis, they proposed the a prior selection and a posterior selection
methods which directly exploit probabilistic estimates.
4.6.3 Mixture of Experts
Mixture of experts (ME) [Jacobs et al., 1991, Xu et al., 1995] is an effective approach to exploit multiple learners. In contrast to typical ensemble
methods where individual learners are trained for the same problem, ME
94
Ensemble Methods: Foundations and Algorithms
Output
Expert 1
Expert 2
Expert 3
Gating
Input
FIGURE 4.8: An illustrative example of mixture of experts. (Plot based on a
similar figure in [Jacobs et al., 1991].)
works in a divide-and-conquer strategy where a complex task is broken up
into several simpler and smaller subtasks, and individual learners (called
experts) are trained for different subtasks. Gating is usually employed to
combine the experts. Figure 4.8 illustrates an example for ME which consists of three experts.
Notice that the keys of ME are different from those of typical ensemble
methods. In typical ensemble methods, since the individual learners are
trained for the same problem, particularly from the same training data set,
they are generally highly correlated and a key problem is how to make the
individual learners diverse; while in ME, the individual learners are generated for different subtasks and there is no need to devote to diversity. Typical ensemble methods do not divide the task into subtasks; while in ME, a
key problem is how to find the natural division of the task and then derive
the overall solution from sub-solutions. In literature on ME, much emphasis was given to make the experts local, and this is thought to be crucial to
the performance. A basic method for this purpose is to target each expert to
a distribution specified by the gating function, rather than the whole original training data distribution.
Without loss of generality, assume that an ME architecture is comprised
of T experts, and the output y is a discrete variable with possible values 0
and 1 for binary classification. Given an input x, each local expert hi tries
to approximate the distribution of y and obtains a local output hi (y | x; θi ),
where θi is the parameter of the ith expert hi . The gating function provides
a set of coefficients πi (x; α) that weigh the contributions of experts, and α
is the parameter of the gating function. Thus, the final output of the ME is
a weighted sum of all the local outputs produced by the experts, i.e.,
Combination Methods
H(y | x; Ψ) =
T
πi (x; α) · hi (y | x; θi ) ,
95
(4.72)
i=1
where Ψ includes all unknown parameters. The output of the gating function is often modeled by the softmax function as
πi (x; α) =
exp(vi x)
,
T
exp(vj x)
(4.73)
j=1
where vi is the weight vector of the ith expert in the gating function, and
α contains all the elements in vi ’s. In the training stage, πi (x; α) states the
probability of the instance x appearing in the training set of the ith expert
hi ; while in the test stage, it defines the contribution of hi to the final prediction.
In general, the training procedure tries to achieve two goals: for given experts, to find the optimal gating function; for given gating function, to train
the experts on the distribution specified by the gating function. The unknown parameters are usually estimated by the Expectation Maximization
(EM) algorithm [Jordan and Xu, 1995, Xu et al., 1995, Xu and Jordan, 1996].
4.7 Further Readings
Weighted averaging was shown effective in ensemble learning by [Perrone and Cooper, 1993]. This method is quite basic and was used for combining multiple evidences long time ago, e.g., (4.14) was well known in portfolio selection in the 1950s [Markowitz, 1952]. In its early formulation there
was no constraint on the weights. Later, it was found that the weights in
practice may take large negative and positive values, and hence giving extreme predictions even when the individual learners provide reasonable
predictions; moreover, since the training data are used for training the individual learners as well as estimating the weights, the process is very easy to
suffer from overfitting. So, Breiman [1996b] suggested to consider the constraints as shown in (4.10), which has become a standard setting.
The expression of majority voting accuracy, (4.16), was first shown by
de Concorcet [1785] and later re-developed by many authors. The relation
between the majority voting accuracy Pmv , the individual accuracy p and
the ensemble size T was also given at first by de Concorcet [1785], but for
odd sizes only; Lam and Suen [1997] generalized the analysis to even cases,
leading to the overall result shown in Section 4.3.1.
96
Ensemble Methods: Foundations and Algorithms
Though the effectiveness of plurality voting has been validated by empirical studies, theoretical analysis on plurality voting is somewhat difficult
and there are only a few works [Lin et al., 2003, Mu et al., 2009]. In particular, Lin et al. [2003] theoretically compared the recognition/error/rejection
rates of plurality voting and majority voting under different conditions, and
showed that plurality voting is more efficient to achieve tradeoff between
the rejection and error rates.
Kittler et al. [1998], Kittler and Alkoot [2003] showed that voting can be
regarded as a special case of averaging, while the averaging rule is more
resilient to estimation errors than other combination methods. Kuncheva
[2002] theoretically studied six simple classifier combination methods under the assumption that the estimates are independent and identically distributed. Kuncheva et al. [2003] empirically studied majority voting, and
showed that dependent classifiers can offer improvement over independent classifiers for majority voting.
The Dempster-Shafer (DS) theory [Dempster, 1967, Shafer, 1976] is a
theory on evidence aggregation, which is able to represent uncertainties
and ignorance (lack of evidence). Several combination methods have been
inspired by the DS theory, e.g., [Xu et al., 1992, Rogova, 1994, Al-Ani and
Deriche, 2002, Ahmadzadeh and Petrou, 2003, Bi et al., 2008].
Utschick and Weichselberger [2004] proposed to improve the process
of binary coding by optimizing a maximum-likelihood objective function;
however, they found that the one-versus-rest scheme is still the optimal
choice for many multi-class problems. General coding schemes could not
guarantee that the coded problems are most suitable for a given task. Crammer and Singer [2002] were the first to design problem-dependent coding
schemes, and proved that the problem of finding the optimal discrete coding matrix is NP-complete. Later, several other problem-dependent designs
were developed based on exploiting the problem by finding representative
binary problems that increase the generalization performance while keeping the code length small. Discriminant ECOC (DECOC) [Pujol et al., 2006]
is based on the embedding of discriminant tree structures derived from
the problem. Forest-ECOC [Escalera et al., 2007] extends DECOC by including additional classifiers. ECOC-ONE [Pujol et al., 2008] uses a coding process that trains the binary problems guided by a validation set. For binary
decoding, Allwein et al. [2000] reported that the practical behavior of the
Inverse Hamming decoder is quite similar to the Hamming decoder. For
ternary decoding, Escalera et al. [2010b] found that the zero symbol introduces two kinds of biases, and to overcome these problems, they proposed
the Loss-Weighted decoder (LW) and the Pessimistic Beta Density Distribution decoder (β-DEN). An open source ECOC library developed by Escalera
et al. [2010a] can be found at http://mloss.org.
The application of Dynamic Classifier Selection is not limited to classification. For example, Zhu et al. [2004] showed that DCS has promising performance in mining data streams with concept drifting or with significant
Combination Methods
97
noise. The idea of DCS has even been generalized to dynamic ensemble selection by Ko et al. [2008].
Hierarchical mixture of experts (HME) [Jordan and Jacobs, 1992] extends mixture of experts (ME) into a tree structure. In contrast to ME which
builds the experts on the input directly, in HME the experts are built from
multiple levels of experts and gating functions. The EM algorithm still can
be used to train HME. Waterhouse and Robinson [1996] described how to
grow the tree structure of HME gradually. Bayesian frameworks for inferring the parameters of ME and HME were developed by Waterhouse et al.
[1996] and Bishop and Svensén [2003], respectively.
This page intentionally left blank
5
Diversity
5.1 Ensemble Diversity
Ensemble diversity, that is, the difference among the individual learners,
is a fundamental issue in ensemble methods.
Intuitively it is easy to understand that to gain from combination, the
individual learners must be different, and otherwise there would be no
performance improvement if identical individual learners were combined.
Tumer and Ghosh [1995] analyzed the performance of simple soft voting ensemble using the decision boundary analysis introduced in Section
4.3.5.2, by introducing a term θ to describe the overall correlation among
the individual learners. They showed that the expected added error of the
ensemble is
1 + θ(T − 1)
ssv
erradd
(H) =
err add (h) ,
(5.1)
T
where err add (h) is the expected added error of the individual learners (for
simplicity, all individual learners were assumed to have equal error), and
T is the ensemble size. (5.1) discloses that if the learners are independent,
i.e., θ = 0, the ensemble will achieve a factor of T of error reduction than the
individual learners; if the learners are totally correlated, i.e., θ = 1, no gains
can be obtained from the combination. This analysis clearly shows that the
diversity is crucial to ensemble performance. A similar conclusion can be
obtained for other combination methods.
Generating diverse individual learners, however, is not easy. The major
obstacle lies in the fact that the individual learners are trained for the same
task from the same training data, and thus they are usually highly correlated. Many theoretically plausible approaches, e.g., the optimal solution of
weighted averaging (4.14), do not work in practice simply because they are
based on the assumption of independent or less correlated learners. The
real situation is even more difficult. For example, the derivation of (5.1),
though it considers high correlation between individual learners, is based
on the assumption that the individual learners produce independent estimates of the posterior probabilities; this is actually not the case in practice.
In fact, the problem of generating diverse individual learners is even
more challenging if we consider that the individual learners must not be
99
100
Ensemble Methods: Foundations and Algorithms
very poor, and otherwise their combination would not improve and could
even worsen the performance. For example, it can be seen from (4.52) that
when the performance of individual classifiers is quite poor, the added error of simple soft voted ensemble may become arbitrarily large; similar analytical results can also be obtained for other combination methods.
So, it is desired that the individual learners should be accurate and diverse. Combining only accurate learners is often worse than combining
some accurate ones together with some relatively weak ones, since complementarity is more important than pure accuracy. Ultimately, the success of
ensemble learning lies in achieving a good tradeoff between the individual
performance and diversity.
Unfortunately, though diversity is crucial, we still do not have a clear understanding of diversity; for example, currently there is no well-accepted
formal definition of diversity. There is no doubt that understanding diversity is the holy grail in the field of ensemble learning.
5.2 Error Decomposition
It is important to see that the generalization error of an ensemble depends on a term related to diversity. For this purpose, this section introduces two famous error decomposition schemes for ensemble methods,
that is, the error-ambiguity decomposition and the bias-variance decomposition.
5.2.1 Error-Ambiguity Decomposition
The error-ambiguity decomposition was proposed by Krogh and Vedelsby
[1995]. Assume that the task is to use an ensemble of T individual learners
h1 , . . . , hT to approximate a function f : Rd → R, and the final prediction
of the ensemble is obtained through weighted averaging (4.9), i.e.,
H(x) =
T
wi hi (x)
i=1
where wi is the weight for the learner hi , and the weights are constrained by
T
wi ≥ 0 and i=1 wi = 1.
Given an instance x, the ambiguity of the individual learner hi is defined
as [Krogh and Vedelsby, 1995]
ambi(hi | x) = (hi (x) − H(x))2 ,
and the ambiguity of the ensemble is
(5.2)
Diversity
ambi(h | x) =
T
wi · ambi(hi | x) =
i=1
101
T
wi (hi (x) − H(x))2 .
(5.3)
i=1
Obviously, the ambiguity term measures the disagreement among the individual learners on instance x. If we use the squared error to measure the
performance, then the error of the individual learner hi and the ensemble
H are respectively
err(hi | x) = (f (x) − hi (x))2 ,
(5.4)
err(H | x) = (f (x) − H(x))2 .
(5.5)
Then, it is easy to get
ambi(h | x) =
T
wi err(hi | x)−err(H | x) = err(h | x)−err(H | x) , (5.6)
i=1
T
where err(h | x) =
i=1 wi · err(hi | x) is the weighted average of the
individual errors. Since (5.6) holds for every instance x, after averaging over
the input distribution it still holds that
T
ambi(hi | x)p(x)dx
wi
(5.7)
i=1
=
T
wi
err(hi | x)p(x)dx −
err(H | x)p(x)dx ,
i=1
where p(x) is the input distribution from which the instances are sampled.
The generalization error and the ambiguity of the individual learner hi can
be written respectively as
err(hi ) =
ambi(hi ) =
err(hi | x)p(x)dx ,
(5.8)
ambi(hi | x)p(x)dx .
(5.9)
Similarly, the generalization error of the ensemble can be written as
err(H) =
err(H | x)p(x)dx .
(5.10)
Based on the above notations and (5.6), we can get the error-ambiguity decomposition [Krogh and Vedelsby, 1995]
err(H) = err(h) − ambi(h),
(5.11)
102
Ensemble Methods: Foundations and Algorithms
where err(h) = Ti=1 wi · err(hi ) is the weighted average of individual gen
T
eralization errors, and ambi(h) = i=1 wi · ambi(hi ) is the weighted average
of ambiguities that is also referred to as the ensemble ambiguity.
On the right-hand side of (5.11), the first item err(h) is the average error
of the individual learners, depending on the generalization ability of individual learners; the second item ambi(h) is the ambiguity, which measures
the variability among the predictions of individual learners, depending on
the ensemble diversity. Since the second term is always positive, and it is
subtracted from the first term, it is clear that the error of the ensemble will
never be larger than the average error of the individual learners. More importantly, (5.11) shows that the more accurate and the more diverse the individual learners, the better the ensemble.
Notice that (5.11) was derived for the regression setting. It is difficult to
get similar results for classification. Furthermore, it is difficult to estimate
ambi empirically. Usually, the estimate of ambi is obtained by subtracting
the estimated value of err from the estimated value of err, and thus this
estimated value just shows the difference between the ensemble error and
individual error, not really showing the physical meaning of diversity; moreover, such an estimate often violates the constraint that ambi should be positive. Thus, (5.11) does not provide a unified formal formulation of ensemble diversity, though it does offer some important insights.
5.2.2 Bias-Variance-Covariance Decomposition
The bias-variance-covariance decomposition [Geman et al., 1992], or
popularly called as bias-variance decomposition, is an important general
tool for analyzing the performance of learning algorithms. Given a learning target and the size of training set, it divides the generalization error of a
learner into three components, i.e., intrinsic noise, bias and variance. The
intrinsic noise is a lower bound on the expected error of any learning algorithm on the target; the bias measures how closely the average estimate of
the learning algorithm is able to approximate the target; the variance measures how much the estimate of the learning approach fluctuates for different training sets of the same size.
Since the intrinsic noise is difficult to estimate, it is often subsumed into
the bias term. Thus, the generalization error is broken into the bias term
which describes the error of the learner in expectation, and the variance
term which reflects the sensitivity of the learner to variations in the training
samples.
Let f denote the target and h denote the learner. For squared loss, the
decomposition is
Diversity
103
2
err(h) = E (h − f )
2
2
= (E [h] − f ) + E (h − E [h])
= bias(h)2 + variance(h),
(5.12)
where the bias and variance of the learner h is respectively
bias(h) = E[h] − f,
(5.13)
2
variance(h) = E (h − E [h]) .
(5.14)
The key of estimating the bias and variance terms empirically lies in how
to simulate the variation of training samples with the same size. Kohavi and
Wolpert [1996]’s method, for example, works in a two-fold cross validation
style, where the original data set is split into a training set D1 and a test set
D2 . Then, T training sets are sampled from D1 ; the size of these training
sets is roughly half of that of D1 for ensuring that there are not many duplicate training sets in these T training sets even for small D. After that, the
learning algorithm is trained on each of those training sets and tested on
D2 , from which the bias and variance are estimated. The whole process can
be repeated several times to improve the estimates.
For an ensemble of T learners h1 , . . . , hT , the decomposition of (5.12) can
be further expanded, yielding the bias-variance-covariance decomposition
[Ueda and Nakano, 1996]. Without loss of generality, suppose that the individual learners are combined with equal weights. The averaged bias, averaged variance, and averaged covariance of the individual learners are defined respectively as
T
1 bias(H) =
(E [hi ] − f ) ,
T i=1
variance(H) =
(5.15)
T
1 2
E (hi − E [hi ]) ,
T i=1
1
E (hi − E [hi ]) E (hj − E [hj ]). (5.17)
T (T − 1) i=1 j=1
T
covariance(H) =
(5.16)
T
j=i
Then, the bias-variance-covariance decomposition of squared error of
ensemble is
1
1
err(H) = bias(H)2 + variance(H) + 1 −
covariance(H) . (5.18)
T
T
(5.18) shows that the squared error of the ensemble depends heavily on
the covariance term, which models the correlation between the individual
104
Ensemble Methods: Foundations and Algorithms
learners. The smaller the covariance, the better the ensemble. It is obvious
that if all the learners make similar errors, the covariance will be large, and
therefore it is preferred that the individual learners make different errors.
Thus, through the covariance term, (5.18) shows that the diversity is important for ensemble performance. Notice that the bias and variance terms are
constrained to be positive, while the covariance term can be negative. Also,
(5.18) was derived under regression setting, and it is difficult to obtain similar results for classification. So, (5.18) does not provide a formal formulation
of ensemble diversity either.
Brown et al. [2005a,b] disclosed the connection between the errorambiguity decomposition and the bias-variance-covariance decomposition. For simplicity, assume that the individual learners are combined with
equal weights. Considering that the left-hand side of (5.11) is the same as
the left-hand side of (5.18), by putting the right-hand sides of (5.11) and
(5.18) together, it follows that
(
'
T
T
1
1 2
2
(5.19)
err(H) − ambi(H) = E
(hi − f ) −
(hi − H)
T i=1
T i=1
1
1
2
= bias(H) + variance(H) + 1 −
covariance(H) .
T
T
After some derivations [Brown et al., 2005b,a], we get
(
'
T
1
2
err(H) = E
(hi − f )2 = bias (H) + variance(H) ,
T i=1
(
'
T
1
2
ambi(H) = E
(hi − H)
T i=1
(5.20)
(5.21)
= variance(H) − variance(H)
1
1
covariance(H) .
= variance(H) − variance(H) − 1 −
T
T
Thus, we can see that the term variance appears in both the averaged
squared error term and the average ambiguity term, and it cancels out if
we subtract the ambiguity from the error term. Moreover, the fact that the
term variance appears in both err and ambi terms indicates that it is hard
to maximize the ambiguity term without affecting the bias term, implying
that generating diverse learners is a challenging problem.
Diversity
105
5.3 Diversity Measures
5.3.1 Pairwise Measures
To measure ensemble diversity, a classical approach is to measure the
pairwise similarity/dissimilarity between two learners, and then average all
the pairwise measurements for the overall diversity.
Given a data set D = {(x1 , y1 ), . . . , (xm , ym )}, for binary classification
(i.e., yi ∈ {−1, +1}), we have the following contingency table for two classifiers hi and hj , where a + b + c + d = m are non-negative variables showing
the numbers of examples satisfying the conditions specified by the corresponding rows and columns. We will introduce some representative pairwise measures based on these variables.
hj = +1
hj = −1
hi = +1 hi = −1
a
c
b
d
Disagreement Measure [Skalak, 1996, Ho, 1998] between hi and hj is defined as the proportion of examples on which two classifiers make different
predictions, i.e.,
b+c
disij =
.
(5.22)
m
The value disij is in [0, 1]; the larger the value, the larger the diversity.
Q-Statistic [Yule, 1900] of hi and hj is defined as
Qij =
ad − bc
.
ad + bc
(5.23)
It can be seen that Qij takes value in the range of [−1, 1]. Qij is zero if hi and
hj are independent; Qij is positive if hi and hj make similar predictions; Qij
is negative if hi and hj make different predictions.
Correlation Coefficient [Sneath and Sokal, 1973] of hi and hj is defined as
ad − bc
ρij = .
(a + b)(a + c)(c + d)(b + d)
(5.24)
This is a classic statistic for measuring the correlation between two binary
vectors. It is easy to see that ρij and Qij have the same sign, and |ρij | ≥ |Qij |.
Kappa-Statistic [Cohen, 1960] is also a classical measure in statistical literature, and it was first used to measure the diversity between two classifiers
106
Ensemble Methods: Foundations and Algorithms
by [Margineantu and Dietterich, 1997, Dietterich, 2000b]. It is defined as 1
κp =
Θ1 − Θ2
,
1 − Θ2
(5.25)
where Θ1 and Θ2 are the probabilities that the two classifiers agree and
agree by chance, respectively. The probabilities for hi and hj can be estimated on the data set D according to
a+d
,
m
(a + b)(a + c) + (c + d)(b + d)
Θ2 =
.
m2
Θ1 =
(5.26)
(5.27)
κp = 1 if the two classifiers totally agree on D; κp = 0 if the two classifiers
agree by chance; κp < 0 is a rare case where the agreement is even less than
what is expected by chance.
The above measures do not require to know the classification correctness. In cases where the correctness of classification is known, the following
measure can be used:
Double-Fault Measure [Giacinto and Roli, 2001] is defined as the proportion of examples that have been misclassified by both the classifiers hi and
hj , i.e.,
e
,
(5.28)
dfij =
m
m
where e = k=1 I (hi (xk ) = yk ∧ hj (xk ) = yk ).
5.3.2 Non-Pairwise Measures
Non-pairwise measures try to assess the ensemble diversity directly,
rather than by averaging pairwise measurements. Given a set of individual
classifiers {h1 , . . . , hT } and a data set D = {(x1 , y1 ), . . . , (xm , ym )} where
xi is an instance and yi ∈ {−1, +1} is class label, in the following we will
introduce some representative non-pairwise measures.
Kohavi-Wolpert Variance was proposed by Kohavi and Wolpert [1996], and
originated from the bias-variance decomposition of the error of a classifier.
On an instance x, the variability of the predicted class label y is defined as
⎞
⎛
1⎝
varx =
P (y | x)2 ⎠ .
(5.29)
1−
2
y∈{−1,+1}
1 The notation κ is used for pairwise kappa-statistic, and the interrater agreement measure κ
p
(also called non-pairwise kappa-statistic) will be introduced later.
Diversity
107
Kuncheva and Whitaker [2003] modified the variability to measure diversity by considering two classifier outputs: correct (denoted by ỹ = +1)
and incorrect (denoted by ỹ = −1), and estimated P (ỹ = +1 | x) and
P (ỹ = −1 | x) over individual classifiers, that is,
P̂ (ỹ = 1 | x) =
ρ(x)
ρ(x)
and P̂ (ỹ = −1 | x) = 1 −
,
T
T
(5.30)
where ρ(x) is the number of individual classifiers that classify x correctly.
By substituting (5.30) into (5.29) and averaging over the data set D, the following kw measure is obtained:
m
1 ρ (xk ) (T − ρ (xk )) .
mT 2
kw =
(5.31)
k=1
It is easy to see that the larger the kw measurement, the larger the diversity.
Interrater agreement is a measure of interrater (inter-classifier) reliability
[Fleiss, 1981]. Kuncheva and Whitaker [2003] used it to measure the level of
agreement within a set of classifiers. This measure is defined as
m
1
ρ(xk )(T − ρ(xk ))
κ = 1 − T k=1
,
(5.32)
m(T − 1)p̄(1 − p̄)
where ρ(xk ) is the number of classifiers that classify xk correctly, and
p̄ =
m
T
1 I (hi (xk ) = yk )
mT i=1
(5.33)
k=1
is the average accuracy of individual classifiers. Similarly with κp , κ = 1 if
the classifiers totally agree on D, and κ ≤ 0 if the agreement is even less
than what is expected by chance.
Entropy is motivated by the fact that for an instance xk , the disagreement
will be maximized if a tie occurs in the votes of individual classifiers. Cunningham and Carney [2000] directly calculated the Shannon’s entropy on
every instance and averaged them over D for measuring diversity, that is,
1 m
m
Entcc =
−P (y|xk ) log P (y|xk ) ,
(5.34)
k=1 y∈{−1,+1}
T
where P (y|xk ) = T1 i=1 I (hi (xk ) = y) can be estimated by the proportion
of individual classifiers that predict y as the label of xk . It is evident that the
calculation of Entcc does not require to know the correctness of individual
classifiers.
108
Ensemble Methods: Foundations and Algorithms
Shipp and Kuncheva [2002] assumed to know the correctness of the classifiers, and defined their entropy measure as
Entsk =
m
1 min (ρ (xk ) , T − ρ (xk ))
,
m
T − T /2
(5.35)
k=1
where ρ(x) is the number of individual classifiers that classify x correctly.
The Entsk value is in the range of [0, 1], where 0 indicates no diversity and
1 indicates the largest diversity. Notice that (5.35) is not a classical entropy,
since it does not use the logarithm function. Though it can be transformed
into classical form by using a nonlinear transformation, (5.35) is preferred
in practice since it is easier to handle and faster to calculate [Shipp and
Kuncheva, 2002].
Difficulty was originally proposed by Hansen and Salamon [1990] and explicitly formulated by Kuncheva and Whitaker [2003]. Let a random variable
X taking values in {0, T1 , T2 , . . . , 1} denote the proportion of classifiers that
correctly classify a randomly drawn instance x. The probability mass function of X can be estimated by running the T classifiers on the data set D.
Considering the distribution shape, if the same instance is difficult for all
classifiers, and the other instances are easy for all classifiers, the distribution shape is with two separated peaks; if the instances that are difficult for
some classifiers are easy for other classifiers, the distribution shape is with
one off-centered peak; if all instances are equally difficult for all classifiers,
the distribution shape is without clear peak. So, by using the variance of X
to capture the distribution shape, the difficulty measure is defined as
θ = variance(X).
(5.36)
It is obvious that the smaller the θ value, the larger the diversity.
Generalized Diversity [Partridge and Krzanowski, 1997] was motivated by
the argument that the diversity is maximized when the failure of one classifier is accompanied by the correct prediction of the other. The measure is
defined as
p(2)
gd = 1 −
,
(5.37)
p(1)
where
p(1) =
T
i
pi ,
T
i=1
(5.38)
p(2) =
T
i i−1
pi ,
T
T −1
i=1
(5.39)
and pi denotes the probability of i randomly chosen classifiers failing on a
randomly drawn instance x. The gd value is in the range of [0, 1], and the
diversity is minimized when gd = 0.
Diversity
109
Table 5.1: Summary of ensemble diversity measures, where ↑ (↓) indicates
that the larger (smaller) the measurement, the larger the diversity (“Known”
indicates whether it requires to know the correctness of individual classifiers).
Diversity Measure
Disagreement
Q-statistic
Correlation coefficient
Kappa-statistic
Double-fault
Interrater agreement
Kohavi-Wolpert variance
Entropy (C&C’s)
Entropy (S&K’s)
Difficulty
Generalized diversity
Coincident failure
Symbol ↑/↓ Pairwise
dis
↑
Yes
Q
↓
Yes
ρ
↓
Yes
κp
↓
Yes
df
↓
Yes
κ
↓
No
kw
↑
No
Entcc
↑
No
Entsk
↑
No
θ
↓
No
gd
↑
No
cf d
↑
No
Known Symmetric
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
No
Yes
Yes
Yes
Yes
No
Yes
No
Yes
No
Coincident Failure [Partridge and Krzanowski, 1997] is a modified version
of the generalized diversity, defined as
0,
p0 = 1
cf d =
(5.40)
1 T
T −i
i=1 T −1 pi , p0 < 1 .
1−p0
cf d = 0 if all classifiers give the same predictions simultaneously, and cf d =
1 if each classifier makes mistakes on unique instances.
5.3.3 Summary and Visualization
Table 5.1 provides a summary of the 12 diversity measures introduced
above. The table shows whether a measure is pairwise or non-pairwise,
whether it requires to know the correctness of classifiers, and whether it
is symmetric or non-symmetric. A symmetric measure will keep the same
when the values of 0 (incorrect) and 1 (correct) in binary classification are
swapped [Ruta and Gabrys, 2001].
Kuncheva and Whitaker [2003] showed that the Kohavi-Wolpert variance
(kw), the averaged disagreement (disavg ) and the kappa-statistic (κ) are
closely related as
T −1
disavg ,
2T
T
kw,
κ = 1−
(T − 1)p̄(1 − p̄)
kw =
(5.41)
(5.42)
Ensemble Methods: Foundations and Algorithms
0.4
0.4
0.3
0.3
error rate
error rate
110
0.2
0.1
0.1
0
−0.2
0.2
0
0.2
0.4
kappa
0.6
0.8
0
−0.2
(a) AdaBoost
0
0.2
0.4
kappa
0.6
0.8
(b) Bagging
FIGURE 5.1: Examples of kappa-error diagrams on credit-g data set, where
each ensemble comprises 50 C4.5 decision trees.
where p̄ is in (5.33). Moreover, Kuncheva and Whitaker [2003]’s empirical study also disclosed that these diversity measures exhibited reasonably
strong relationships.
One advantage of pairwise measures is that they can be visualized in 2d
plots. This was first shown by Margineantu and Dietterich [1997]’s kappaerror diagram, which is a scatter-plot where each point corresponds to a
pair of classifiers, with the x-axis denoting the value of κp for the two classifiers, and the y-axis denoting the average error rate of these two classifiers.
Figure 5.1 shows examples of the kappa-error diagram. It can be seen that
the kappa-error diagram visualizes the accuracy-diversity tradeoff of different ensemble methods. The higher the point clouds, the less accurate the
individual classifiers; the more right-hand the point clouds, the less diverse
the individual classifiers. It is evident that other pairwise diversity measures
can be visualized in a similar way.
5.3.4 Limitation of Diversity Measures
Kuncheva and Whitaker [2003] presented possibly the first doubt on diversity measures. Through a broad range of experiments, they showed that
the effectiveness of existing diversity measures are discouraging since there
seems to be no clear relation between those diversity measurements and
the ensemble performance.
Tang et al. [2006] theoretically analyzed six diversity measures and
showed that if the average accuracy of individual learners is fixed and the
maximum diversity is achievable, maximizing the diversity among the individual learners is equivalent to maximizing the minimum margin of the
ensemble on the training examples. They showed empirically, however, that
the maximum diversity is usually not achievable, and the minimum margin
of an ensemble is not monotonically increasing with respect to existing di-
Diversity
111
versity measures.
In particular, Tang et al. [2006] showed that, compared to algorithms that
seek diversity implicitly, exploiting the above diversity measures explicitly is
ineffective in constructing consistently stronger ensembles. On one hand,
the change of existing diversity measurements does not provide consistent guidance on whether an ensemble achieves good generalization performance. On the other hand, the measurements are closely related to the
average individual accuracies, which is undesirable since it is not expected
that the diversity measure becomes another estimate of accuracy.
Notice that it is still well accepted that the motivation of generating diverse individual learners is right. Kuncheva and Whitaker [2003] and Tang
et al. [2006] disclosed that though many diversity measures have been developed, the right formulation and measures for diversity are unsolved yet,
and understanding ensemble diversity remains a holy grail problem.
5.4 Information Theoretic Diversity
Information theoretic diversity [Brown, 2009, Zhou and Li, 2010b] provides a promising recent direction for understanding ensemble diversity.
This section will introduce the connection between information theory and
ensemble methods first and then introduce two formulations of information theoretic diversity and an estimation method.
5.4.1 Information Theory and Ensemble
The fundamental concept of information theory is the entropy, which is
a measure of uncertainty. The entropy of a random variable X is defined as
Ent(X) =
−p(x) log(p(x)),
(5.43)
x
where x is the value of X and p(x) is the probability distribution. Based on
the concept of entropy, the dependence between two variables X1 and X2
can be measured by the mutual information [Cover and Thomas, 1991]
p(x1 , x2 )
,
(5.44)
p(x1 , x2 ) log
I(X1 ; X2 ) =
p(x
1 )p(x2 )
x ,x
1
2
or if given another variable Y , measured by the conditional mutual information [Cover and Thomas, 1991]
p(x1 , x2 | y)
.
(5.45)
I(X1 ; X2 | Y ) =
p(y)p(x1 , x2 | y) log
p(x
1 | y)p(x2 | y)
y,x ,x
1
2
112
Ensemble Methods: Foundations and Algorithms
In the context of information theory, suppose a message Y is sent
through a communication channel and the value X is received, the goal
is to recover the correct Y by decoding the received value X; that is, a
decoding operation Ŷ = g(X) is needed. In machine learning, Y is the
ground-truth class label, X is the input, and g is the predictor. For ensemble methods, the goal is to recover Y from a set of T classifiers {X1 , . . . , XT }
by a combination function g, and the objective is to minimize the probability of error prediction p (g (X1:T ) = Y ), where X1:T denotes T variables
X1 , . . . , XT . Based on information theory, Brown [2009] bounded the probability of error by two inequalities [Fano, 1961, Hellman and Raviv, 1970]
as
Ent(Y ) − I(X1:T ; Y ) − 1
Ent(Y ) − I(X1:T ; Y )
≤ p (g (X1:T ) = Y ) ≤
.
log(|Y |)
2
(5.46)
Thus, to minimize the prediction error, the mutual information I(X1:T ; Y )
should be maximized. By considering different expansions of the mutual
information term, different formulations of information theoretic diversity
can be obtained, as will be introduced in the next sections.
5.4.2 Interaction Information Diversity
Interaction information [McGill, 1954] is a multivariate generalization
of mutual information for measuring the dependence among multiple variables. The interaction information I(X1:n ) and the conditional interaction
information I({X1:n } | Y ) are respectively defined as
for n = 2
I(X1 ; X2 )
I({X1:n }) =
(5.47)
I({X1:n−1 } | Xn ) − I({X1:n−1 }) for n ≥ 3,
I({X1:n } | Y ) = EY [I({X1:n }) | Y ].
(5.48)
Based on interaction information, Brown [2009] presented an expansion
of I(X1:T ; Y ) as
I(X1:T ; Y ) =
T
i=1
1
2/
=
i=1
1
2/
1
2/
(5.49)
0
interaction information diversity
I(Xi ; Y ) −
relevancy
I({Sk ∪ Y })
k=2 Sk ⊆S
0
relevancy
T
T I(Xi ; Y ) +
0
T I({Sk }) +
k=2 Sk ⊆S
1
2/
redundancy
0
T k=2 Sk ⊆S
1
I({Sk }|Y ) ,
2/
0
conditional redundancy
(5.50)
where Sk is a set of size k. (5.50) shows that the mutual information
I(X1:T ; Y ) can be expanded into three terms.
Diversity
113
The first term, Ti=1 I(Xi ; Y ), is the sum of the mutual information between each classifier and the target. It is referred to as relevancy, which actually gives a bound on the accuracy of the individual classifiers. Since it is
additive to the mutual information, a large relevancy is preferred.
T The second term,
Sk ⊆S I({Sk }), measures the dependency
k=2
among all possible subsets of classifiers, and it is independent of the class
label Y . This term is referred to as redundancy. Notice that it is subtractive to the mutual information. A large I({Sk }) indicates strong correlations
among classifiers without considering the target Y , which reduces the value
of I(X1:T ; Y ), and hence a small value is preferred.
T The third term,
Sk ⊆S I({Sk }|Y ), measures the dependency
k=2
among the classifiers given the class label. It is referred to as conditional redundancy. Notice that it is additive to the mutual information, and a large
conditional redundancy is preferred.
It is evident that the relevancy term corresponds to the accuracy, while
both the redundancy and the conditional redundancy describe the correlations among classifiers. Thus, the interaction information diversity naturally emerges as (5.49). The interaction information diversity discloses that
the correlations among classifiers are not necessarily helpful to ensemble
performance, since there are different kinds of correlations and the helpful ones are those which have considered the learning target. It is easy to
find that the diversity exists at multiple orders of correlations, not simply
pairwise.
One limitation of the interaction information diversity lies in that the expression of the diversity terms, especially the involved interaction information, are quite complicated and there is no effective process for estimating
them at multiple orders in practice.
5.4.3 Multi-Information Diversity
Multi-information [Watanabe, 1960, Studeny and Vejnarova, 1998,
Slonim et al., 2006] is another multivariate generalization of mutual information. The multi-information I(X1:n ) and conditional multi-information
I(X1:n | Y ) are respectively defined as
I(X1:n ) =
p(x1 , · · · , xn ) log
x1:n
I(X1:n | Y ) =
y,x1:n
p(x1 , · · · , xn )
,
p(x1 )p(x2 ) · · · p(xn )
p(y)p(x1:n | y) log
p(x1:n | y)
.
p(x1 | y) · · · p(xn | y)
(5.51)
(5.52)
It is easy to see that, when n = 2 the (conditional) multi-information is
reduced to (conditional) mutual information. Moreover,
114
Ensemble Methods: Foundations and Algorithms
I(X1:n ) =
I(X1:n | Y ) =
n
i=2
n
I(Xi ; X1:i−1 );
(5.53)
I(Xi ; X1:i−1 | Y ).
(5.54)
i=2
Based on multi-information and conditional multi-information, Zhou
and Li [2010b] presented an expansion of I(X1:T ; Y ) as
I(X1:T ; Y ) =
T
i=1
1
I(Xi ; Y ) + I(X1:T | Y ) − I(X1:T )
1
2/
0
2/
0 multi-information diversity
(5.55)
relevance
=
T
i=1
1
I(Xi ; Y ) −
2/
T
i=2
1
0
relevance
T
I(Xi ; X1:i−1 ) +
2/
redundancy
I(Xi ; X1:i−1 | Y ) .
i=2
1
0
2/
0
conditional redundancy
(5.56)
Zhou and Li [2010b] proved that (5.49) and (5.55) are mathematically
equivalent, though the formulation of (5.55) is much simpler. One advantage of (5.55) is that its terms are decomposable over individual classifiers.
Take the redundancy term for example. Given an ensemble of size k, its re
k
dundancy is I(X1:k ) = i=2 I(Xi ; X1:i−1 ). Then, if a new classifier Xk+1 is
k+1
added, the new redundancy becomes I(X1:k+1 ) = i=2 I(Xi ; X1:i−1 ), and
the only difference is the mutual information I(Xk+1 ; X1:k ).
5.4.4 Estimation Method
For the interaction information diversity (5.49), it is obvious that this diversity consists of low-order and high-order components. If we only consider the pairwise components, the following can be obtained:
I(X1:T ; Y ) ≈
T
i=1
I(Xi ; Y ) −
T
T
i=1 j=i+1
I(Xi ; Xj ) +
T
T
I(Xi ; Xj | Y ).
i=1 j=i+1
(5.57)
This estimation would not be accurate since it omits higher-order components. If we want to consider higher-order components, however, we need
to estimate higher-order interaction information, which is quite difficult
and currently there is no effective approach available.
For the multi-information diversity (5.55), Zhou and Li [2010b] presented
an approximate estimation approach. Take the redundancy term in (5.55)
Diversity
115
I(X4 ; X3 , X2 , X1 ) = e + h + k + l +
m + n + o,
I(X4 ; X2 , X1 ) = h+k +l +m+n+o,
I(X4 ; X3 , X1 ) = e + h+ k + l + m+ n,
I(X4 ; X3 , X2 ) = e+h+k +m+n+o.
FIGURE 5.2: Venn diagram of an illustrative example of Zhou and Li
[2010b]’s approximation method.
for example. It is needed to estimate I(Xi ; X1:i−1 ) for all i’s. Rather than
calculating it directly, I(Xi ; X1:i−1 ) is approximated by
I(Xi ; X1:i−1 ) ≈ max I(Xi ; Ωk ) ,
Ωk ⊆Ω
(5.58)
where Ω = {Xi−1 , . . . , X1 }, and Ωk is a subset of size k (1 ≤ k ≤ i − 1). As an
illustrative example, Figure 5.2 depicts a Venn diagram for four variables,
where the ellipses represent the entropies of variables, while the mutual
information can be represented by the combination of regions in the diagram. As shown in the right-side of the figure, it can be found that the highorder component I(X4 ; X3 , X2 , X1 ) shares a large intersection with the loworder component I(X4 ; X2 , X1 ), where the only difference is region e. Notice that if X1 , X2 and X3 are strongly correlated, it is very likely that the uncertainty of X3 is covered by X1 and X2 ; that is, the regions c and e are very
small. Thus, I(X4 ; X2 , X1 ) provides an approximation to I(X4 ; X3 , X2 , X1 ).
Such a scenario often happens in ensemble construction, since the individual classifiers generally have strong correlations.
Similarly, the conditional redundancy term can be approximated as
I(Xi ; X1:i−1 | Y ) ≈ max I(Xi ; Ωk | Y ).
Ωk ⊆Ω
(5.59)
Thus, the multi-information diversity can be estimated by
I(Xi ; X1:i−1 | Y ) − I(Xi ; X1:i−1 ) ≈ max [I(Xi ; Ωk | Y ) − I(Xi ; Ωk )] . (5.60)
Ωk ⊆Ω
It can be proved that this estimation provides a lower-bound of the information theoretic diversity.
To accomplish the estimation, an enumeration over all the Ωk ’s is desired. In this way, however, for every i it is needed to estimate I(Xi ; Ωk ) and
k
I(Xi ; Ωk | Y ) for Ci−1
number of different Ωk ’s. When k is near (i − 1)/2,
116
Ensemble Methods: Foundations and Algorithms
the number will be large, and the estimation of I(Xi ; Ωk ) and I(Xi ; Ωk | Y )
will become difficult. Hence, a trade-off is needed, and Zhou and Li [2010b]
showed that a good estimation can be achieved even when k is restricted to
be small values such as 1 or 2.
5.5 Diversity Generation
Though there is no generally accepted formal formulation and measures
for ensemble diversity, there are effective heuristic mechanisms for diversity generation in ensemble construction. The common basic idea is to inject some randomness into the learning process. Popular mechanisms include manipulating the data samples, input features, learning parameters,
and output representations.
Data Sample Manipulation. This is the most popular mechanism. Given
a data set, multiple different data samples can be generated, and then the
individual learners are trained from different data samples. Generally, the
data sample manipulation is based on sampling approaches, e.g., Bagging
adopts bootstrap sampling [Efron and Tibshirani, 1993], AdaBoost adopts
sequential sampling, etc.
Input Feature Manipulation. The training data is usually described by a
set of features. Different subsets of features, or called subspaces, provide
different views on the data. Therefore, individual learners trained from different subspaces are usually diverse. The Random Subspace method [Ho,
1998] shown in Figure 5.3 is a famous ensemble method which employs this
mechanism. For data with a lot of redundant features, training a learner in
a subspace will be not only effective but also efficient. It is noteworthy that
Random Subspace is not suitable for data with only a few features. Moreover,
if there are lots of irrelevant features, it is usually better to filter out most
irrelevant features before generating the subspaces.
Learning Parameter Manipulation. This mechanism tries to generate diverse individual learners by using different parameter settings for the base
learning algorithm. For example, different initial weights can be assigned to
individual neural networks [Kolen and Pollack, 1991], different split selections can be applied to individual decision trees [Kwok and Carter, 1988, Liu
et al., 2008a], different candidate rule conditions can be applied to individual FOIL rule inducers [Ali and Pazzani, 1996], etc. The Negative Correlation
method [Liu and Yao, 1999] explicitly constrains the parameters of individual neural networks to be different by a regularization term.
Diversity
117
Input: Data set D = {(x1 , y1 ), (x2 , y2 ), · · · , (xm , ym )};
Base learning algorithm L;
Number of base learners T ;
Dimension of subspaces d.
Process:
1. for t = 1, . . . , T :
2. Ft = RS(D, d)
% Ft is a set of d randomly selected features;
3. Dt = MapFt (D) % Dt keeps only the features in Ft
4. ht = L(Dt )
% Train a learner
5. end
) )
*
*
Output: H(x) = arg max Tt=1 I ht MapFt (x) = y
y∈Y
FIGURE 5.3: The Random Subspace algorithm
Output Representation Manipulation. This mechanism tries to generate
diverse individual learners by using different output representations. For
example, the ECOC approach [Dietterich and Bakiri, 1995] employs errorcorrecting output codes, the Flipping Output method [Breiman, 2000] randomly changes the labels of some training instances, the Output Smearing
method [Breiman, 2000] converts multi-class outputs to multivariate regression outputs to construct individual learners, etc.
In addition to the above popular mechanisms, there are some other attempts. For example, Melville and Mooney [2005] tried to encourage diversity by using artificial training data. They constructed an ensemble in an iterative way. In each round, a number of artificial instances were generated
based on the model of the data distribution. These artificial instances were
then assigned the labels that are different maximally from the predictions
of the current ensemble. After that, a new learner is trained from the original training data together with the artificial training data. If adding the new
learner to the current ensemble increases training error, the new learner
will be discarded and another learner will be generated with another set
of artificial examples; otherwise, the new learner will be accepted into the
current ensemble.
Notice that different mechanisms for diversity generation can be used together. For example, Random Forest [Breiman, 2001] adopts both the mechanisms of data sample manipulation and input feature manipulation.
118
Ensemble Methods: Foundations and Algorithms
5.6 Further Readings
In addition to [Kohavi and Wolpert, 1996], there are a number of practically effective bias-variance decomposition approaches, e.g., [Kong and
Dietterich, 1995, Breiman, 1996a]. Most approaches focus solely on 0-1 loss
and produce quite different definitions. James [2003] proposed a framework which accommodates the essential characteristics of bias and variance, and their decomposition can be generalized to any symmetric loss
function.
A comprehensive survey on diversity generation approaches can be
found in [Brown et al., 2005a]. Current ensemble methods generally try to
generate diverse individual learners from labeled training data. Zhou [2009]
advocated to try to exploit unlabeled training data to enhance diversity, and
an effective method was proposed recently by Zhang and Zhou [2010].
Stable learners, e.g., naı̈ve Bayesian and k-nearest neighbor classifiers,
which are insensitive to small perturbations on training data, are usually
difficult to improve through typical ensemble methods. Zhou and Yu [2005]
proposed the FASBIR approach and showed that multimodal perturbation, which combines multiple mechanisms of diversity generation, provides a practical way to construct ensembles of stable learners.
6
Ensemble Pruning
6.1 What Is Ensemble Pruning
Given a set of trained individual learners, rather than combining all of
them, ensemble pruning tries to select a subset of individual learners to
comprise the ensemble.
An apparent advantage of ensemble pruning is to obtain ensembles with
smaller sizes; this reduces the storage resources required for storing the ensembles and the computational resources required for calculating outputs
of individual learners, and thus improves efficiency. There is another benefit, that is, the generalization performance of the pruned ensemble may be
even better than the ensemble consisting of all the given individual learners.
The first study on ensemble pruning is possibly [Margineantu and Dietterich, 1997] which tried to prune boosted ensembles. Tamon and Xiang
[2000], however, showed that boosting pruning is intractable even to approximate. Instead of pruning ensembles generated by sequential methods,
Zhou et al. [2002b] tried to prune ensembles generated by parallel methods
such as Bagging, and showed that the pruning can lead to smaller ensembles with better generalization performance. Later, most ensemble pruning
studies were devoted to parallel ensemble methods. Caruana et al. [2004]
showed that pruning parallel heterogeneous ensembles comprising different types of individual learners is better than taking the original heterogeneous ensembles. In [Zhou et al., 2002b] the pruning of parallel ensembles
was called selective ensemble; while in [Caruana et al., 2004], the pruning
of parallel heterogeneous ensembles was called ensemble selection. In this
chapter we put all of them under the umbrella of ensemble pruning.
Originally, ensemble pruning was defined for the setting where the individual learners have already been generated, and no more individual learners will be generated from training data during the pruning process. Notice
that traditional sequential ensemble methods will discard some individual learners during their training process, but that is not ensemble pruning.
These methods typically generate individual learners one by one; once an
individual learner is generated, a sanity check is applied and the individual
learner will be discarded if it cannot pass the check. Such a sanity check is
119
120
Ensemble Methods: Foundations and Algorithms
important to ensure the validity of sequential ensembles and prevent them
from growing infinitely. For example, in AdaBoost an individual learner will
be discarded if its accuracy is below 0.5; however, AdaBoost is not an ensemble pruning method, and boosting pruning [Margineantu and Dietterich,
1997] tries to reduce the number of individual learners after the boosting
procedure has stopped and no more individual learners will be generated.
It is noteworthy that some recent studies have extended ensemble pruning to all steps of ensemble construction, and individual learners may be
pruned even before all individual learners have been generated. Nevertheless, an essential difference between ensemble pruning and sequential ensemble methods remains: for sequential ensemble methods, an individual
learner would not be excluded once it is added into the ensemble; while for
ensemble pruning methods, any individual learners may be excluded, even
for the ones which have been kept in the ensemble for a long time.
Ensemble pruning can be viewed as a special kind of Stacking. As introduced in Chapter 4, Stacking tries to apply a meta-learner to combine the
individual learners, while the ensemble pruning procedure can be viewed
as a special meta-learner. Also, recall that as mentioned in Chapter 4, if we
do not worry about how the individual learners are generated, then different ensemble methods can be regarded as different implementations
of weighted combination; from this aspect, ensemble pruning can be regarded as a procedure which sets the weights on some learners to zero.
6.2 Many Could Be Better Than All
In order to show that it is possible to get a smaller yet better ensemble
through ensemble pruning, this section introduces Zhou et al. [2002b]’s
analyses.
We start from the regression setting on which the analysis is easier. Suppose there are N individual learners h1 , . . . , hN available, and thus the final
ensemble size T ≤ N . Without loss of generality, assume that the learners
are combined via weighted averaging according to (4.9) and the weights are
constrained by (4.10). For simplicity, assume that equal weights are used,
and thus, from (4.11) we have the generalization error of the ensemble as
err =
N
N Cij /N 2 ,
(6.1)
i=1 j=1
where Cij is defined in (4.12) and measures the correlation between hi and
hj . If the kth individual learner is excluded from the ensemble, the general-
Ensemble Pruning
121
ization error of the pruned ensemble is
err =
N
N Cij /(N − 1)2 .
(6.2)
i=1 j=1
i=k j=k
By comparing (6.1) and (6.2), we get the condition under which err is not
smaller than err , implying that the pruned ensemble is better than the allmember ensemble, that is,
(2N − 1)
N
N Cij ≤ 2N 2
i=1 j=1
N
Cik + N 2 Ckk .
(6.3)
i=1
i=k
(6.3) usually holds in practice since the individual learners are often highly
correlated. For an extreme example, when all the individual learners are
duplicates, (6.3) indicates that the ensemble size can be reduced without
sacrificing generalization ability. The simple analysis above shows that in
regression, given a number of individual learners, ensembling some instead
of all of them may be better.
It is interesting to study the difference between ensemble pruning and sequential ensemble methods based on (6.3). As above, let N denote the upper
bound of the final ensemble size. Suppose the sequential ensemble method
employs the sanity check that the new individual learner hk (1 < k ≤ N ) will
be kept if the ensemble consisting of h1 , . . . , hk is better than the ensemble
consisting of h1 , . . . , hk−1 on mean squared error. Then, hk will be discarded
if [Perrone and Cooper, 1993]
(2k − 1)
k−1
k−1
i=1 j=1
Cij ≤ 2(k − 1)2
k−1
Cik + (k − 1)2 Ckk .
(6.4)
i=1
Comparing (6.3) and (6.4) it is easy to see the following. Firstly, ensemble
pruning methods consider the correlation among all the individual learners while sequential ensemble methods consider only the correlation between the new individual learner and previously generated ones. For example, assume N = 100 and k = 10; sequential ensemble methods consider
only the correlation between h1 , . . . , h10 , while ensemble pruning methods
consider the correlations between h1 , . . . , h100 . Secondly, when (6.4) holds,
sequential ensemble methods will only discard hk , but h1 , . . . , hk−1 won’t be
discarded; while any classifier in h1 , . . . , hN may be discarded by ensemble
pruning methods when (6.3) holds.
Notice that the analysis from (6.1) to (6.4) only applies to regression.
Since supervised learning includes regression and classification, analysis
of classification setting is needed for a unified result. Again let N denote
the number of available individual classifiers, and thus the final ensemble
122
Ensemble Methods: Foundations and Algorithms
size T ≤ N . Without loss of generality, consider binary classification with
labels {−1, +1}, and assume that the learners are combined via majority
voting introduced in Section 4.3.1 and ties are broken arbitrarily. Given m
training instances, the expected output on these instances is (f1 , . . . , fm )
where fj is the ground-truth of the jth instance, and the prediction made
by the ith classifier hi on these instances is (hi1 , . . . , him ) where hij is the
prediction on the jth instance. Since fj , hij ∈ {−1, +1}, it is obvious that hi
correctly classifies the jth instance when hij fj = +1. Thus, the error of the
ith classifier on these m instances is
1 err(hi ) =
η(hij fj ) ,
m j=1
m
where η(·) is a function defined as
⎧
⎨1
η(x) = 0.5
⎩
0
Let s = (s1 , . . . , sm ) where sj =
ensemble on the jth instance is
N
i=1
if x = −1
if x = 0 .
if x = 1
(6.5)
(6.6)
hij . The output of the all-member
Hj = sign(sj ) .
(6.7)
It is obvious that Hj ∈ {−1, 0, +1}. The prediction of the all-member ensemble on the jth instance is correct when Hj fj = +1 and wrong when
Hj fj = −1, while Hj fj = 0 corresponds to a tie. Thus, the error of the allmember ensemble is
m
1 err =
η(Hj fj ) .
(6.8)
m j=1
Now, suppose the kth individual classifier is excluded from the ensemble.
The prediction made by the pruned ensemble on the jth instance is
Hj = sign(sj − hkj ) ,
(6.9)
and the error of the pruned ensemble is
1 η(Hj fj ) .
m j=1
m
err =
(6.10)
Then, by comparing (6.8) and (6.10), we get the condition under which err
is not smaller than err , implying that the pruned ensemble is better than
the all-member ensemble; that is,
m
)
*
η (sign (sj ) fj ) − η (sign (sj − hkj ) fj ) ≥ 0 .
j=1
(6.11)
Ensemble Pruning
123
Since the exclusion of the kth individual classifier will not change the output of the ensemble if |sj | > 1, and based on the property that
1
η(sign(x)) − η(sign(x − y)) = − sign(x + y) ,
2
the condition for the kth individual classifier to be pruned is
sign((sj + hkj )fj ) ≤ 0 .
(6.12)
(6.13)
j∈{arg j |sj |≤1}
(6.13) usually holds in practice since the individual classifiers are often
highly correlated. For an extreme example, when all the individual classifiers are duplicates, (6.13) indicates that the ensemble size can be reduced
without sacrificing generalization ability.
Through combining the analyses on both regression and classification
(i.e., (6.3) and (6.13)), we get the theorem of MCBTA (“many could be better
than all”) [Zhou et al., 2002b], which indicates that for supervised learning,
given a set of individual learners, it may be better to ensemble some instead
of all of these individual learners.
6.3 Categorization of Pruning Methods
Notice that simply pruning individual learners with poor performance
may not lead to a good pruned ensemble. Generally, it is better to keep
some accurate individuals together with some not-that-good but complementary individuals. Furthermore, notice that neither (6.3) nor (6.13) provides practical solutions to ensemble pruning since the required computation is usually intractable even when there is only one output in regression
and two classes in classification. Indeed, the central problem of ensemble
pruning research is how to design practical algorithms leading to smaller
ensembles without sacrificing or even improving the generalization performance contrasting to all-member ensembles.
During the past decade, many effective ensemble pruning methods have
been proposed. Roughly speaking, those methods can be classified into
three categories [Tsoumakas et al., 2009]:
• Ordering-based pruning. Those methods try to order the individual learners according to some criterion, and only the learners in the
front-part will be put into the final ensemble. Though they work in a
sequential style, it is noteworthy that they are quite different from sequential ensemble methods (e.g., AdaBoost) since all the available individual learners are given in advance and no more individual learners will be generated in the pruning process; moreover, any individual
learner, not just the latest generated one, may be pruned.
124
Ensemble Methods: Foundations and Algorithms
• Clustering-based pruning. Those methods try to identify a number
of representative prototype individual learners to constitute the final
ensemble. Usually, a clustering process is employed to partition the
individual learners into a number of groups, where individual learners in the same group behave similarly while different groups have
large diversity. Then, the prototypes of clusters are put into the final
ensemble.
• Optimization-based pruning. Those methods formulate the ensemble pruning problem as an optimization problem which aims to find
the subset of individual learners that maximizes or minimizes an objective related to the generalization ability of the final ensemble. Many
optimization techniques have been used, e.g., heuristic optimization
methods, mathematical programming methods, etc.
It is obvious that the boundaries between different categories are not
crisp, and there are methods that can be put into more than one category.
In particular, though there are many early studies on pure ordering-based
or clustering-based pruning methods, along with the explosively increasing exploitation of optimization techniques in machine learning, recent
ordering-based and clustering-based pruning methods become closer to
optimization-based methods.
6.4 Ordering-Based Pruning
Ordering-based pruning originated from Margineantu and Dietterich’s
[1997] work on boosting pruning. Later, most efforts were devoted to pruning ensembles generated by parallel ensemble methods.
Given N individual learners h1 , . . . , hN , suppose they are combined sequentially in a random order, the generalization error of the ensemble generally decreases monotonically as the ensemble size increases, and approaches an asymptotic constant error. It has been found that [Martı́nezMuñoz and Suárez, 2006], however, if an appropriate ordering is devised,
the ensemble error generally reaches a minimum with intermediate ensemble size and this minimum is often lower than the asymptotic error, as
shown in Figure 6.1. Hence, ensemble pruning can be realized by ordering
the N individual learners and then putting the front T individual learners
into the final ensemble.
It is generally hard to decide the best T value, but fortunately there are
usually many T values that will lead to better performance than the allmember ensemble, and at least the T value can be tuned on training data.
A more crucial problem is how to order the individual learners appropri-
Ensemble Pruning
125
ensemble error
original ensemble
ordered ensemble
number of individual learners
FIGURE 6.1: Illustration of error curves of the original ensemble (aggregated in random order) and ordered ensemble.
ately. During the past decade, many ordering strategies have been proposed. Most of them consider both the accuracy and diversity of individual learners, and a validation data set V with size |V | is usually used (when
there are not sufficient data, the training data set D or its sub-samples can
be used as validation data). In the following we introduce some representative ordering-based pruning methods.
Reduce-Error Pruning [Margineantu and Dietterich, 1997]. This method
starts with the individual learner whose validation error is the smallest.
Then, the remaining individual learners are sequentially put into the ensemble, such that the validation error of the resulting ensemble is as small
as possible in each round. This procedure is greedy, and therefore, after obtaining the top T individual learners, Margineantu and Dietterich [1997]
used Backfitting [Friedman and Stuetzle, 1981] search to improve the ensemble. In each round, it tries to replace one of the already selected individual learners with an unselected individual learner which could reduce
the ensemble error. This process repeats until none of the individual learners can be replaced, or the pre-set maximum number of learning rounds
is reached. It is easy to see that Backfitting is time-consuming. Moreover, it
was reported [Martı́nez-Muñoz and Suárez, 2006] that Backfitting could not
improve the generalization ability significantly for parallel ensemble methods such as Bagging.
Kappa Pruning [Margineantu and Dietterich, 1997]. This method assumes
that all the individual learners have similar performance, and uses the κp
diversity measure introduced in Section 5.3.1 to calculate the diversity of
every pair of individual learners on the validation set. It starts with the pair
with the smallest κp (i.e., the largest diversity among the given individual
learners), and then selects pairs of learners in ascending order of κp . Finally,
the top T individual learners are put into the ensemble. A variant method
126
Ensemble Methods: Foundations and Algorithms
was proposed by Martı́nez-Muñoz et al. [2009] later, through replacing the
pairwise κp diversity measure by the interrater agreement diversity measure κ introduced in Section 5.3.2. The variant method still starts with the
pair of individual learners that are with the smallest κ value. Then, at the tth
round, it calculates the κ value between each unselected individual learner
and the current ensemble Ht−1 , and takes the individual learner with the
smallest κ value to construct the ensemble Ht of size t. The variant method
often leads to smaller ensemble error. However, it is computationally much
more expensive than the original Kappa pruning. Banfield et al. [2005] proposed another variant method that starts with the all-member ensemble
and iteratively removes the individual learner with the largest average κ
value.
Kappa-Error Diagram Pruning [Margineantu and Dietterich, 1997]. This
method is based on the kappa-error diagram introduced in Section 5.3.3.
It constructs the convex hull of the points in the diagram, which can be
regarded as a summary of the entire diagram and includes both the most
accurate and the most diverse pairs of individual learners. The pruned ensemble consists of any individual learner that appears in a pair corresponding to a point on the convex hull. From the definition of the kappa-error diagram it is easy to see that this pruning method simultaneously considers
the accuracy as well as diversity of individual learners.
Complementariness Pruning [Martı́nez-Muñoz and Suárez, 2004]. This
method favors the inclusion of individual learners that are complementary
to the current ensemble. It starts with the individual learner whose validation error is the smallest. Then, at the tth round, given the ensemble Ht−1
of size t − 1, the complementariness pruning method adds the individual
learner ht which satisfies
ht = arg max
I (hk (x) = y and Ht−1 (x) = y) ,
(6.14)
hk
(x,y)∈V
where V is the validation data set, and hk is picked up from unselected individual learners.
Margin Distance Pruning [Martı́nez-Muñoz and Suárez, 2004]. This
method defines a signature vector for each individual learner. For example, the signature vector c(k) of the kth individual learner hk is a |V |dimensional vector where the ith element is
(k)
ci
= 2I (hk (xi ) = yi ) − 1,
(k)
(6.15)
where (xi , yi ) ∈ V . Obviously, ci = 1 if and only if hk classifies xi correctly and −1 otherwise. The performance of the ensemble can be charac
N
terized by the average of c(k) ’s, i.e., c̄ = N1 k=1 c(k) . The ith instance is
correctly classified by the ensemble if the ith element of c̄ is positive, and
Ensemble Pruning
127
the value of |c̄i | is the margin on the ith instance. If an ensemble correctly
classifies all the instances in V , the vector c̄ will lie in the first-quadrant of
the |V |-dimensional hyperspace, that is, every element of c̄ is positive. Consequently, the goal is to select the ensemble whose signature vector is near
an objective position in the first-quadrant. Suppose the objective position
is the point o with equal elements, i.e., oi = p (i = 1, . . . , |V |; 0 < p < 1).
In practice, the value of p is usually a small value (e.g., p ∈ (0.05, 0.25)). The
individual learner to be selected is the one which can reduce the distance
between c̄ and o to the most.
Orientation Pruning [Martı́nez-Muñoz and Suárez, 2006]. This method
uses the signature vector defined as above. It orders the individual learners
increasingly according to the angles between the corresponding signature
vectors and the reference direction, denoted as cref , which is the projection
of the first-quadrant diagonal onto the hyperplane defined by the signature
vector c̄ of the all-member ensemble.
Boosting-Based Pruning [Martı́nez-Muñoz and Suárez, 2007]. This method
uses AdaBoost to determine the order of the individual learners. It is similar
to the AdaBoost algorithm except that in each round, rather than generating
a base learner from the training data, the individual learner with the lowest weighted validation error is selected from the given individual learners.
When the weighted error is larger than 0.5, the Boosting with restart strategy is used, that is, the weights are reset and another individual learner is
selected. Notice that the weights are used in the ordering process, while
Martı́nez-Muñoz and Suárez [2007] reported that there is no significant
difference for the pruned ensemble to make prediction with/without the
weights.
Reinforcement Learning Pruning [Partalas et al., 2009]. This method models the ensemble pruning problem as an episodic task. Given N individual
learners, it assumes that there is an agent which takes N sequential actions
each corresponding to either including the individual learner hk in the final
ensemble or not. Then, the Q-learning algorithm [Watkins and Dayan, 1992],
a famous reinforcement learning technique, is applied to solve an optimal
policy of choosing the individual learners.
6.5 Clustering-Based Pruning
An intuitive idea to ensemble pruning is to identify some prototype individual learners that are representative yet diverse among the given individual learners, and then use only these prototypes to constitute the ensemble.
This category of methods is known as clustering-based pruning because
128
Ensemble Methods: Foundations and Algorithms
the most straightforward way to identify the prototypes is to use clustering
techniques.
Generally, clustering-based pruning methods work in two steps. In the
first step, the individual learners are grouped into a number of clusters.
Different clustering techniques have been exploited for this purpose. For
example, Giacinto et al. [2000] used hierarchical agglomerative clustering
and regarded the probability that the individual learners do not make coincident validation errors as the distance; Lazarevic and Obradovic [2001]
used k-means clustering based on Euclidean distance; Bakker and Heskes
[2003] used deterministic annealing for clustering; etc.
In the second step, prototype individual learners are selected from the
clusters. Different strategies have been developed. For example, Giacinto
et al. [2000] selected from each cluster the learner which is the most distant
to other clusters; Lazarevic and Obradovic [2001] iteratively removed individual learners from the least to the most accurate inside each cluster until
the accuracy of the ensemble starts to decrease; Bakker and Heskes [2003]
selected the centroid of each cluster; etc.
6.6 Optimization-Based Pruning
Optimization-based pruning originated from [Zhou et al., 2002b] which
employs a genetic algorithm [Goldberg, 1989] to select individual learners for the pruned ensemble. Later, many other optimization techniques,
including heuristic optimization, mathematical programming and probabilistic methods have been exploited. This section introduces several representative methods.
6.6.1 Heuristic Optimization Pruning
Recognizing that the theoretically optimal solution to weighted combination in (4.14) is infeasible in practice, Zhou et al. [2002b] regarded the
ensemble pruning problem as an optimization task and proposed a practical method GASEN.
The basic idea is to associate each individual learner with a weight that
could characterize the goodness of including the individual learner in the
final ensemble. Given N individual learners, the weights can be organized
as an N -dimensional weight vector, where small elements in the weight
vector suggest that the corresponding individual learners should be excluded. Thus, one weight vector corresponds to one solution to ensemble
pruning. In GASEN, a set of weight vectors are randomly initialized at first.
Then, a genetic algorithm is applied to the population of weight vectors,
Ensemble Pruning
129
where the fitness of each weight vector is calculated based on the corresponding ensemble performance on validation data. The pruned ensemble
is obtained by decoding the optimal weight vector evolved from the genetic
algorithm, and excluding individual learners associated with small weights.
There are different GASEN implementations, by using different coding
schemes or different genetic operators. For example, Zhou et al. [2002b]
used a floating coding scheme, while Zhou and Tang [2003] used a bit coding scheme which directly takes 0-1 weights and avoids the problem of setting an appropriate threshold to decide which individual learner should be
excluded.
In addition to genetic algorithms [Coelho et al., 2003], many other heuristic optimization techniques have been used in ensemble pruning; for example, greedy hill-climbing [Caruana et al., 2004], artificial immune algorithms [Castro et al., 2005, Zhang et al., 2005], case similarity search [Coyle
and Smyth, 2006], etc.
6.6.2 Mathematical Programming Pruning
One deficiency of heuristic optimization is the lack of solid theoretical
foundations. Along with the great success of using mathematical programming in machine learning, ensemble pruning methods based on mathematical programming optimization have been proposed.
6.6.2.1
SDP Relaxation
Zhang et al. [2006] formulated ensemble pruning as a quadratic integer
programming problem. Since finding the optimal solution is computationally infeasible, they provided an approximate solution by Semi-Definite
Programming (SDP).
First, given N individual classifiers and m training instances, Zhang et al.
[2006] recorded the errors in the matrix P as
0 if hj classifies xi correctly
Pij =
(6.16)
1 otherwise.
Let G = P P. Then, the diagonal element Gii is the number of mistakes
made by hi , and the off-diagonal element Gij is the number of co-occurred
mistakes of hi and hj . The matrix elements are normalized according to
Gij
i=j
G̃ij = 1m Gij
(6.17)
Gji
i = j .
2 Gii + Gjj
N
Thus, i=1 G̃ii measures the overall performance of the individual classi
N
fiers, i,j=1;i=j G̃ij measures the diversity, and a combination of these two
N
terms i,j=1 G̃ij is a good approximation of the ensemble error.
130
Ensemble Methods: Foundations and Algorithms
Consequently, the ensemble pruning problem is formulated as the
quadratic integer programming problem
min z G̃z
z
s.t.
N
zi = T, zi ∈ {0, 1} ,
(6.18)
i=1
where the binary variable zi represents whether the ith classifier hi is included in the ensemble, and T is the size of the pruned ensemble.
(6.18) is a standard 0-1 optimization problem, which is generally NPhard. However, let vi = 2zi − 1 ∈ {−1, 1},
N 1
1 G̃1 1 G̃
,
(6.19)
, and D =
V = vv , H =
1
I
G̃1
G̃
where 1 is all-one column vector and I is identity matrix, then (6.18) can be
rewritten as the equivalent formulation [Zhang et al., 2006]
min H ⊗ V
V
(6.20)
s.t. D ⊗ V = 4T, diag(V) = 1, V 0
rank(V) = 1 ,
where A ⊗ B =
ij Aij Bij . Then, by dropping the rank constraint, it is
relaxed to the following convex SDP problem which can be solved in polynomial time [Zhang et al., 2006]
min H ⊗ V
V
(6.21)
s.t. D ⊗ V = 4T, diag(V) = 1, V 0 .
6.6.2.2
1 -Norm Regularization
Li and Zhou [2009] proposed a regularized selective ensemble method
RSE which reduces the ensemble pruning task to a Quadratic Programming (QP) problem.
Given N individual classifiers and considering weighted combination,
RSE determines the weight vector w = [w1 , . . . , wN ] by minimizing the
regularized risk function
R(w) = λV (w) + Ω(w) ,
(6.22)
where V (w) is the empirical loss which measures the misclassification on
training data D = {(x1 , y1 ), . . . , (xm , ym )}, Ω(w) is the regularization term
which tries to make the final classifier smooth and simple, and λ is a regularization parameter which trades off the minimization of V (w) and Ω(w).
By using the hinge loss and graph Laplacian regularizer as the empirical
loss and regularization term, respectively, the problem is formulated as [Li
Ensemble Pruning
131
and Zhou, 2009]
w PLP w + λ
min
w
m
max(0, 1 − yi p
i w)
(6.23)
i=1
1 w = 1, w ≥ 0
s.t.
where pi = (h1 (xi ), . . . , hN (xi )) encodes the predictions of individual
classifiers on xi , P ∈ {−1, +1}N ×m is the prediction matrix which collects
predictions of all individual classifiers on all training instances, where Pij =
hi (xj ). L is the normalized graph Laplacian of the neighborhood graph G of
the training data. Denote the weighted
m adjacency matrix of G by W, and D
is a diagonal matrix where Dii = j=1 Wij . Then, L = D−1/2 (D−W)D−1/2 .
By introducing slack variables ξ = (ξ1 , . . . , ξm ) , (6.23) can be rewritten
as
min w PLP w + λ 1 ξ
w
s.t.
(6.24)
yi p
i w + ξi ≥ 1, (∀ i = 1, . . . , m)
1 w = 1, w ≥ 0, ξ ≥ 0 .
Obviously, (6.24) is a standard QP problem that can be efficiently solved by
existing optimization packages.
Notice that 1 w = 1, w ≥ 0 is a 1 -norm constraint on the weights w. The
1 -norm is a sparsity-inducing constraint which will force some wi ’s to be
zero, and thus, RSE favors an ensemble with small sizes and only a subset
of the given individual learners will be included in the final ensemble.
Another advantage of RSE is that it naturally fits the semi-supervised
learning setting due to the use of the graph Laplacian regularizer, hence
it can exploit unlabeled data to improve ensemble performance. More information on semi-supervised learning will be introduced in Chapter 8.
6.6.3 Probabilistic Pruning
Chen et al. [2006, 2009] proposed a probabilistic pruning method under
the Bayesian framework by introducing a sparsity-inducing prior over the
combination weights, where the maximum a posteriori (MAP) estimation
of the weights is obtained by Expectation Maximization (EM) [Chen et al.,
2006] and Expectation Propagation (EP) [Chen et al., 2009], respectively.
Due to the sparsity-inducing prior, many of the posteriors of the weights are
sharply distributed at zero, and thus many individual learners are excluded
from the final ensemble.
Given N individual learners h1 , . . . , hN , the output vector of the individual learners on the instance x is h(x) = (h1 (x), . . . , hN (x)) . The output of
the all-member ensemble is H(x) = w h(x), where w = [w1 , . . . , wN ] is a
non-negative weight vector, wi ≥ 0.
132
Ensemble Methods: Foundations and Algorithms
To make the weight vector w sparse and non-negative, a left-truncated
Gaussian prior is introduced to each weight wi [Chen et al., 2006], that is,
p(w | α) =
N
p(wi | αi ) =
i=1
N
Nt (wi | 0, α−1
i ),
(6.25)
i=1
where α = [α1 , . . . , αN ] is the inverse variance of weight vector w and
Nt (wi | 0, α−1
i ) is a left-truncated Gaussian distribution defined as
if wi ≥ 0 ,
2N (wi | 0, α−1
−1
i )
(6.26)
Nt (wi | 0, αi ) =
0
otherwise .
For regression, it is assumed that the ensemble output is corrupted by a
Gaussian noise i ∼ N (0, σ 2 ) with mean zero and variance σ 2 . That is, for
each training instance (xi , yi ), it holds that
yi = w h(xi ) + i .
(6.27)
Assuming i.i.d. training data, the likelihood can be expressed as
5
1
2
2 −m/2
p(y | w, X, σ ) = (2πσ )
exp − 2 y − w H ,
2σ
(6.28)
where y = [y1 , . . . , ym ] is the ground-truth output vector, and H = [h(x1 ),
. . . , h(xm )] is an N ×m matrix which collects all the predictions of individual
learners on all the training instances. Consequently, the posterior of w can
be written as
p(w | X, y, α) ∝
N
p(wi | αi )
i=1
m
p(yi | xi , w) .
(6.29)
i=1
As defined in (6.26), the prior over w is a left-truncated Gaussian, and therefore, exact Bayesian inference is intractable. However, the EM algorithm or
EP algorithm can be employed to generate an MAP solution, leading to an
approximation of the sparse weight vector [Chen et al., 2006, 2009].
For classification, the ensemble output is formulated as
)
*
H(x) = Φ w h(x) ,
(6.30)
6x
where Φ(x) = −∞ N (t | 0, 1)dt is the Gaussian cumulative distribution
function. The class label of x is +1 if H(x) ≥ 1/2 and 0 otherwise. As above,
the posterior of w can be derived as
p(w | X, y, α) ∝
N
i=1
p(wi | αi )
m
Φ(yi w h(xi )),
(6.31)
i=1
where both the prior p(wi | αi ) and the likelihood Φ(yi w h(xi )) are nonGaussian, and thus, the EM algorithm or the EP algorithm is used to obtain
an MAP estimation of the sparse weight vector [Chen et al., 2006, 2009].
Ensemble Pruning
133
6.7 Further Readings
Tsoumakas et al. [2009] provided a brief review on ensemble pruning
methods. Hernández-Lobato et al. [2011] reported a recent empirical study
which shows that optimization-based and ordering-based pruning methods, at least for pruning parallel regression ensembles, generally outperform ensembles generated by AdaBoost.R2, Negative Correlation and several
other approaches.
In addition to clustering, there are also other approaches for selecting the prototype individual learners, e.g., Tsoumakas et al. [2004, 2005]
picked prototype individual learners by using statistical tests to compare
their individual performance. Hernández-Lobato et al. [2009] proposed the
instance-based pruning method, where the individual learners selected for
making prediction are determined for each instance separately. Soto et al.
[2010] applied the instance-based pruning to pruned ensembles generated
by other ensemble pruning methods, yielding the double pruning method.
A similar idea has been described by Fan et al. [2002].
If each individual learner is viewed as a fancy feature extractor
[Kuncheva, 2008, Brown, 2010], it is obvious that ensemble pruning has
close relation to feature selection [Guyon and Elisseeff, 2003] and new ensemble pruning methods can get inspiration from feature selection techniques. It is worth noting, however, that the different natures of ensemble
pruning and feature selection must be considered. For example, in ensemble pruning the individual learners predict the same target and thus have
the same physical meaning, while in feature selection the features usually
have different physical meanings; the individual learners are usually highly
correlated, while this may not be the case for features in feature selection.
A breakthrough in computer vision of the last decade, i.e., the Viola-Jones
face detector [Viola and Jones, 2004], actually can be viewed as a pruning of
Harr-feature-based decision stump ensemble, or selection of Harr features
by AdaBoost with a cascade architecture.
This page intentionally left blank
7
Clustering Ensembles
7.1 Clustering
Clustering aims to find the inherent structure of the unlabeled data by
grouping them into clusters of objects [Jain et al., 1999]. A good clustering will produce high quality clusters where the intra-cluster similarity is
maximized while the inter-cluster similarity is minimized. Clustering can
be used as a stand-alone exploratory tool to gain insights on the nature
of the data, and it can also be used as a preprocessing stage to facilitate
subsequent learning tasks. Formally, given the data D = {x1 , x2 , . . . , xm }
where the ith instance xi = (xi1 , xi2 , . . . , xid ) ∈ Rd is a d-dimensional
feature vector, the task of clustering is to group D into k disjoint clusters
k
7
{Cj | j = 1, . . . , k} with j=1 Cj = D and Ci i=j Cj = ∅. The clustering
results returned by a clustering algorithm L can be represented as a label
vector λ ∈ N m , with the ith element λi ∈ {1, . . . , k} indicating the cluster
assignment of xi .
7.1.1 Clustering Methods
A lot of clustering methods have been developed and various taxonomies
can be defined from different perspectives, such as different data types the
algorithms can deal with, different assumptions the methods have adopted,
etc. Here, we adopt Han and Kamber [2006]’s taxonomy, which roughly divides clustering methods into the following five categories.
Partitioning Methods. A partitioning method organizes D into k partitions
by optimizing an objective partitioning criterion. The most well-known
partitioning method is k-means clustering [Lloyd, 1982], which optimizes
the square-error criterion
err =
k dis(x, x̄j )2 ,
(7.1)
j=1 x∈Cj
where x̄j = |C1j | x∈Cj x is the mean of the partition Cj , and dis(·, ·) measures the distance between two instances (e.g., Euclidean distance). Notice
135
136
Ensemble Methods: Foundations and Algorithms
that finding the optimal partitioning which minimizes err would require
exhaustive search of all the possible solutions and is obviously computationally prohibitive due to the combinatorial nature of the search space.
To circumvent this difficulty, k-means adopts an iterative relocation technique to find the desired solution heuristically. First, it randomly selects k
instances from D as the initial cluster centers. Then, every instance in D
is assigned to the cluster whose center is the nearest. After that, the cluster centers are updated and the instances are re-assigned to their nearest
clusters. The above process will be repeated until convergence.
Hierarchical Methods. A hierarchical method creates a hierarchy of clusterings on D at various granular levels, where a specific clustering can be
obtained by thresholding the hierarchy at a specified level of granule. An
early attempt toward hierarchical clustering is the SAHN method [Anderberg, 1973, Day and Edelsbrunner, 1984], which forms the hierarchy of clusterings in a bottom-up manner. Initially, each data point is placed into a
cluster of its own, and an m × m dissimilarity matrix D among clusters is set
with elements D(i, j) = dis(xi , xj ). Then, two closest clusters Ci and Cj are
identified based on D and replaced by the agglomerated cluster Ch . The dissimilarity matrix D is updated to reflect the deletion of Ci and Cj , as well as
the new dissimilarities between Ch and all remaining clusters Ck (k = i, j):
D(h, k) = αi D(i, k) + αj D(j, k) + βD(i, j) + γ|D(i, k) − D(j, k)|,
(7.2)
where αi , αj , β and γ are coefficients characterizing different SAHN implementations. The above merging process is repeated until all the data points
fall into a single cluster. Typical implementations of SAHN are named as
single-linkage (αi = 1/2; αj = 1/2; β = 0; γ = −1/2), complete-linkage
(αi = 1/2; αj = 1/2; β = 0; γ = 1/2) and average-linkage (αi = |Ci |/(|Ci | +
|Cj |); αj = |Cj |/(|Ci | + |Cj |); β = 0; γ = 0).
Density-Based Methods. A density-based method constructs clusters on D
based on the notion of density, where regions of instances with high density are regarded as clusters which are separated by regions of low density. DBSCAN [Ester et al., 1996] is a representative density-based clustering method, which characterizes the density of the data space with a pair
of parameters (ε, M inP ts). Given an instance x, its neighborhood within a
radius ε is called the ε-neighborhood of x. x is called a core object if its εneighborhood contains at least M inP ts number of instances. An instance p
is directly density-reachable to x if p is within the ε-neighborhood of x and
x is a core object. First, DBSCAN identifies core objects which satisfy the requirement imposed by the (ε, M inP ts) parameters. Then, it forms clusters
by iteratively connecting the directly density-reachable instances starting
from those core objects. The connecting process terminates when no new
data point can be added to any cluster.
Grid-Based Methods. A grid-based method quantizes D into a finite number of cells forming a grid-structure, where the quantization process is usu-
Clustering Ensembles
137
ally performed in a multi-resolution style. STING [Wang et al., 1997] is a representative grid-based method, which divides the data space into a number
of rectangular cells. Each cell stores statistical information of the instances
falling into this cell, such as count, mean, standard deviation, minimum,
maximum, type of distribution, etc. There are several levels of rectangular cells, each corresponding to a different level of resolution. Here, each
cell at a higher level is partitioned into a number of cells at the next lower
level, and statistical information of higher-level cells can be easily inferred
from its lower-level cells with simple operations such as elementary algebraic calculations.
Model-Based Methods. A model-based method assumes a mathematical
model characterizing the properties of D, where the clusters are formed to
optimize the fit between the data and the underlying model. The most famous model-based method is GMM-based clustering [Redner and Walker,
1984], which works by utilizing the Gaussian Mixture Model (GMM)
p(x|Θ) =
k
αj N(x|μj , Σj ) ,
(7.3)
j=1
where each mixture component N(x|μj , Σj ) (j = 1, . . . , k) employs Gaussian distribution with mean μj and covariance Σj , and participates in constituting the whole distribution p(x|Θ) with non-negative coefficient αj . In
k
addition, j=1 αj = 1 and Θ = {αj , μj , Σj |j = 1, . . . , k}. The cluster assignment λi for each instance xi ∈ D is specified according to the rule
λi = arg max k
1≤l≤k
αl N(xi |μl , Σl )
j=1
αj N(xi |μj , Σj )
.
(7.4)
The GMM parameters Θ are learned from D by employing the popular EM
procedure [Dempster et al., 1977] to maximize the following log-likelihood
function in an iterative manner:
⎛
⎞
m
k
ln ⎝
αj N(xi |μj , Σj )⎠ .
(7.5)
p(D|Θ) =
i=1
j=1
Details on the iterative optimization procedure can be easily found in classical literatures [Jain and Dubes, 1988, Bilmes, 1998, Jain et al., 1999, Duda
et al., 2000].
7.1.2 Clustering Evaluation
The task of evaluating the quality of clustering results is commonly referred to as cluster validity analysis [Jain and Dubes, 1988, Halkidi et al.,
2001]. Existing cluster validity indices for clustering quality assessment can
138
Ensemble Methods: Foundations and Algorithms
be roughly categorized into two types: external indices and internal indices.
The external indices evaluate the clustering results by comparing the
identified clusters to a pre-specified structure, e.g., the ground-truth clustering. Given the data set D = {x1 , . . . , xm }, let C = {C1 , . . . , Ck } denote the
identified clusters with label vector λ ∈ N m . Suppose C ∗ = {C1∗ , . . . , Cs∗ } is
the pre-specified clustering structure with label vector λ∗ . Then, four complementary terms can be defined to reflect the relationship between C and
C ∗:
⎧
a = |SS|, SS = {(xi , xj ) | λi = λj , λ∗i = λ∗j , i < j},
⎪
⎪
⎪
⎨b = |SD|, SD = {(x , x ) | λ = λ , λ∗ = λ∗ , i < j},
i
j
i
j
i
j
(7.6)
∗
∗
⎪
c
=
|DS|,
DS
=
{(x
,
x
)
|
λ
=
λ
,
λ
=
λ
i
j
i
j
⎪
i
j , i < j},
⎪
⎩
d = |DD|, DD = {(xi , xj ) | λi = λj , λ∗i = λ∗j , i < j},
where SS contains pairs of instances which belong to the same cluster in
both C and C ∗ ; the meanings of SD, DS and DD can be inferred similarly
based on the above definitions. It is evident that a + b + c + d = m(m − 1)/2.
A number of popular external cluster validity indices are defined as follows [Jain and Dubes, 1988, Halkidi et al., 2001]:
- Jaccard Coefficient (JC):
a
,
a+b+c
JC =
- Fowlkes and Mallows Index (FMI):
a
a
·
,
FMI =
a+b a+c
(7.7)
(7.8)
- Rand Index (RI):
RI =
2(a + d)
.
m(m − 1)
(7.9)
All these cluster validity indices take values between 0 and 1, and the
larger the index value, the better the clustering quality.
The internal indices evaluate the clustering results by investigating the
inherent properties of the identified clusters without resorting to a reference structure. Given the data set D = {x1 , . . . , xm }, let C = {C1 , . . . , Ck }
denote the identified clusters. The following terms are usually employed:
f (C) =
|C|−1 |C|
2
dis(xi , xj ),
|C|(|C| − 1) i=1 j=i+1
diam(C) = max dis(xi , xj ),
xi ,xj ∈C
dmin (Ci , Cj ) =
min
xi ∈Ci ,xj ∈Cj
dcen (Ci , Cj ) = dis(ci , cj ),
dis(xi , xj ),
(7.10)
(7.11)
(7.12)
(7.13)
Clustering Ensembles
139
where dis(·, ·) measures the distance between two data points and ci denotes the centroid of cluster Ci . Therefore, f (C) is the average distance
between the instances in cluster C, diam(C) is the diameter of cluster C,
dmin (Ci , Cj ) measures the distance between the two nearest instances in
Ci and Cj , and dcen (Ci , Cj ) measures the distance between the centroids of
Ci and Cj .
A number of popular internal cluster validity indices are defined as follows [Jain and Dubes, 1988, Halkidi et al., 2001]:
- Davies-Bouldin Index (DBI):
k
1
f (Ci ) + f (Cj )
DBI =
,
max
k i=1 1≤j≤k,j=i
dcen (Ci , Cj )
(7.14)
- Dunn Index (DI):
DI = min
1≤i≤k
min
1≤j≤k
dmin (Ci , Cj )
max1≤l≤k diam(Cl )
5
,
(7.15)
- Silhouette Index (SI):
⎛
⎞
|Ci |
k
1
⎝ 1
SI =
Si ⎠ ,
k i=1 |Ci | p=1 p
where
Spi =
aip − bip
9,
8
max aip , bip
⎫
⎧
|Cj |
⎬
⎨ 1 aip = min
dis(xp , xq ) ,
⎭
j=i ⎩ |Cj |
q=1
bip =
1
dis(xp , xq ) .
|Ci | − 1
(7.16)
(7.17)
(7.18)
(7.19)
q=p
For DBI, the smaller the index value, the better the clustering quality; for
DI and SI, the larger the index value, the better the clustering quality.
7.1.3 Why Clustering Ensembles
Clustering ensembles, also called clusterer ensembles or consensus
clustering, are a kind of ensemble whose base learners are clusterings, also
called clusterers, generated by clustering methods.
There are several general motivations for investigating clustering ensembles [Fred and Jain, 2002, Strehl and Ghosh, 2002]:
140
Ensemble Methods: Foundations and Algorithms
To Improve Clustering Quality. As we have seen in previous chapters,
strong generalization ability can be obtained with ensemble methods for
supervised learning tasks, as long as the base learners in the ensemble are
accurate and diverse. Therefore, it is not surprising that better clustering
quality can be anticipated if ensemble methods are also applied under unsupervised learning scenario.
For this purpose, a diverse ensemble of good base clusterings should be
generated. It is interesting to notice that in clustering, it is less difficult to
generate diverse clusterings, since clustering methods have inherent randomness. Diverse clusterings can be obtained by, for example, running
clustering methods with different parameter configurations, with different
initial data points, or with different data samples, etc. An ensemble is then
derived by combining the outputs of the base clusterings, such that useful
information encoded in each base clustering is fully leveraged to identify
the final clustering with high quality.
To Improve Clustering Robustness. As introduced in Section 7.1.1, a clustering method groups the instances into clusters by assuming a specific
structure on the data. Therefore, no single clustering method is guaranteed
to be robust across all clustering tasks as the ground-truth structures of different data may vary significantly. Furthermore, due to the inherent randomness of many clustering methods, the clustering results may also be
unstable if a single clustering method is applied to the same clustering task
several times.
Therefore, it is intuitive to utilize clustering ensemble techniques to generate robust clustering results. Given any data set for clustering analysis,
multiple base clusterings can be generated by running diverse clustering
methods to accommodate various clustering assumptions, or invoking the
same clustering method with different settings to compensate for the inherent randomness. Then, the derived ensemble may play more stably than a
single clustering method.
To Enable Knowledge Reuse and Distributed Computing. In many applications, a variety of legacy clusterings for the data may already exist and can
serve as the knowledge bases to be reused for future data exploration. It is
also a common practice that the data are gathered and stored in distributed
locations as a result of organizational or operational constraints, while performing clustering analysis by merging them into a centralized location is
usually infeasible due to communication, computational and storage costs.
In such situations, it is rather natural to apply clustering ensemble techniques to exploit the multiple base clusterings. The legacy clusterings can
directly serve as the base clusterings for further combination. While in the
distributed setting, a base clustering can be generated on each distributively stored part of the data, and then the base clustering rather than the
original data can be sent to a centralized location for a further exploitation.
Clustering Ensembles
141
7.2 Categorization of Clustering Ensemble Methods
Given the data D = {x1 , x2 , . . . , xm } where the ith instance xi = (xi1 , xi2 ,
. . . , xid ) ∈ Rd is a d-dimensional feature vector, like ensemble methods in
supervised learning setting, clustering ensemble methods also work in two
steps:
1. Clustering generation: In this step, each base clusterer L(q) (1 ≤ q ≤
(q)
r) groups D into k (q) clusters {Cj | j = 1, 2, . . . , k (q) }. Equivalently,
the clustering results returned by L(q) can be represented by a label
(q)
vector λ(q) ∈ N m , where the ith element λi ∈ {1, 2, . . . , k (q) } indicates the cluster assignment of xi .
2. Clustering combination: In this step, given the r base clusterings
{λ(1) , λ(2) , . . . , λ(r) }, a combination function Γ(·) is used to consolidate them into the final clustering λ = Γ({λ(1) , λ(2) , . . . , λ(r) }) ∈ N m
with k clusters, where λi ∈ {1, . . . , k} indicates the cluster assignment
of xi in the final clustering. For example, suppose four base clusterings of seven instances have been generated as follows,
λ(1) = (1, 1, 2, 2, 2, 3, 3)
λ(2) = (2, 3, 3, 2, 2, 1, 1)
λ(3) = (3, 3, 1, 1, 1, 2, 2)
λ(4) = (1, 3, 3, 4, 4, 2, 2)
where λ(1) , λ(2) and λ(3) each groups the seven instances into
three clusters, while λ(4) results in a clustering with four clusters. Furthermore, though λ(1) and λ(3) look very different at the
first glance, they actually yield the identical clustering results, i.e.,
{{x1 , x2 }, {x3 , x4 , x5 }, {x6 , x7 }}. Then, a reasonable consensus (with
three clusters) could be (1, 1, 1, 2, 2, 3, 3), or any of its six equivalent
labelings such as (2, 2, 2, 1, 1, 3, 3), which shares as much information as possible with the four base clusterings in the ensemble [Strehl
and Ghosh, 2002].
Generally speaking, clustering generation is relatively easier since any
data partition generates a clustering, while the major difficulty of clustering
ensembles lies in clustering combination. Specifically, for m instances with
k
1
k
(k−j) m
k clusters, the number of possible clusterings is k!
j , or
j=1 Cj (−1)
m
approximately k /k! for m k [Jain and Dubes, 1988]. For example, there
will be 171,798,901 ways to form four groups of only 16 instances [Strehl
and Ghosh, 2002]. Therefore, a brute-force search over all the possible clusterings to find the optimal combined clustering is apparently infeasible and
smart strategies are needed.
142
Ensemble Methods: Foundations and Algorithms
Most studies on clustering ensembles focus on the complicated clustering combination part. To successfully derive the ensemble clustering, the
key lies in how the information embodied in each base clustering is expressed and aggregated. During the past decade, many clustering ensemble methods have been proposed. Roughly speaking, these methods can be
classified into the following four categories:
• Similarity-Based Methods: A similarity-based method expresses the
base clustering information as similarity matrices and then aggregates multiple clusterings via matrix averaging. Examples include
[Fred and Jain, 2002, 2005, Strehl and Ghosh, 2002, Fern and Brodley,
2003].
• Graph-Based Methods: A graph-based method expresses the base
clustering information as an undirected graph and then derives the
ensemble clustering via graph partitioning. Examples include [Ayad
and Kamel, 2003, Fern and Brodley, 2004, Strehl and Ghosh, 2002].
• Relabeling-Based Methods: A relabeling-based method expresses the
base clustering information as label vectors and then aggregates via
label alignment. Examples include [Long et al., 2005, Zhou and Tang,
2006].
• Transformation-Based Methods: A transformation-based method
expresses the base clustering information as features for rerepresentation and then derives the ensemble clustering via metaclustering. Examples include [Topchy et al., 2003, 2004a].
7.3 Similarity-Based Methods
The basic idea of similarity-based clustering ensemble methods is to exploit the base clusterings to form an m × m consensus similarity matrix M,
and then generate the final clustering result based on the consensus similarity matrix. Intuitively, the matrix element M(i, j) characterizes the similarity (or closeness) between the pair of instances xi and xj . The general
procedure of similarity-based methods is shown in Figure 7.1.
A total of r base similarity matrices M(q) (q = 1, . . . , r) are firstly obtained
based on the clustering results of each base clusterer L(q) and then averaged to form the consensus similarity matrix. Generally, the base similarity
matrix M(q) can be instantiated in two different ways according to how L(q)
returns the clustering results, i.e., crisp clustering and soft clustering.
Crisp Clustering. In this setting, L(q) works by partitioning the data set D
into k (q) crisp clusters, such as k-means [Strehl and Ghosh, 2002, Fred and
Clustering Ensembles
143
Input: Data set D = {x1 , x2 , . . . , xm };
Base clusterer L(q) (q = 1, . . . , r);
Consensus clusterer L on similarity matrix.
Process:
1. for q = 1, . . . , r:
2. λ(q) = L(q) (D); % Form a base clustering from D with k (q) clusters
3. Derive an m × m base similarity matrix M(q) based on λ(q) ;
4. end r
5. M = 1r q=1 M(q) ; % Form the consensus similarity matrix
6. λ = L(M); % Form the ensemble clustering based on consensus
% similarity matrix M
Output: Ensemble clustering λ
FIGURE 7.1: The general procedure of similarity-based clustering ensemble
methods.
Jain, 2002, 2005]. Here, each instance belongs to exactly one cluster. The
(q)
(q)
base similarity matrix M(q) can be set as M(q) (i, j) = 1 if λi = λj and 0
otherwise. In other words, M(q) corresponds to a binary matrix specifying
whether each pair of instances co-occurs in the same cluster.
Soft Clustering. In this setting, L(q) works by grouping the data set D into
k (q) soft clusters, such as GMM-based clustering [Fern and Brodley, 2003].
Here, the probability of xi belonging to the lth cluster can be modeled as
(q)
P (l | i) with kl=1 P (l | i) = 1. The base similarity matrix M(q) can be set
(q)
as M(q) (i, j) = kl=1 P (l | i) · P (l | j). In other words, M(q) corresponds to
a real-valued matrix specifying the probability for each pair of instances to
co-occur in any of the clusters.
After obtaining the consensus similarity matrix M, the ensemble clustering λ can be derived from M by L in a number of ways, such as running the single-linkage [Fred and Jain, 2002, 2005], complete-linkage [Fern
and Brodley, 2003] or average-linkage [Fred and Jain, 2005] agglomerative
clustering over D by taking 1 − M(i, j) as the distance between xi and xj ,
or invoking partitioning clustering method [Strehl and Ghosh, 2002] over a
similarity graph with xi being the vertex and M(i, j) being the edge weight
between vertices.
The most prominent advantage of similarity-based methods lies in their
conceptual simplicity, since the similarity matrices are easy to be instantiated and aggregated. The consensus similarity matrix also offers much
flexibility for subsequent analysis, where many existing clustering methods
which operate on the similarity matrix can be applied to produce the final
144
Ensemble Methods: Foundations and Algorithms
ensemble clustering.
The major disadvantage of similarity-based methods lies in their efficiency. The computational and storage complexities are both quadratic in
m, i.e., the number of instances. Therefore, similarity-based methods can
only deal with small or medium-scale problems, and will encounter difficulties in dealing with large-scale data.
7.4 Graph-Based Methods
The basic idea of graph-based clustering ensemble methods is to construct a graph G = (V, E) to integrate the clustering information conveyed
by the base clusterings, and then identify the ensemble clustering by performing graph partitioning of the graph. Intuitively, the intrinsic grouping
characteristics among all the instances are implicitly encoded in G.
Given an ensemble of r base clusterings {λ(q) | 1 ≤ q ≤ r}, where each
(q)
λ(q) imposes k (q) clusters over the data set D, let C = {Cl | 1 ≤ q ≤ r, 1 ≤
l ≤ k (q) } denote the set consisting
the clusters in the base clusterings.
r of all
(q)
Furthermore, let k ∗ = |C| =
denote the size of C, i.e., the total
q=1 k
number of clusters in all base clusterings. Without loss of generality, clusters in C can be re-indexed as {Cj | 1 ≤ j ≤ k ∗ }. There are three alternative
ways to construct the graph G = (V, E) based on how the vertex set V is
configured, that is, V = D, V = C and V = D ∪ C.
V = D. In this setting, each vertex in V corresponds to a single data
point xi ∈ D [Ayad and Kamel, 2003, Strehl and Ghosh, 2002]. HGPA
(HyperGraph-Partitioning Algorithm) [Strehl and Ghosh, 2002] is a representative method within this category, whose pseudo-code is given in Figure 7.2.
Here, G is a hypergraph with equally weighted vertices. Given C, HGPA
regards each cluster C ∈ C as a hyperedge (connecting a set of vertices)
and adds it into E. In this way, high-order (≥ 3) rather than only pairwise
relationships between instances are incorporated in the hypergraph G. The
ensemble clustering λ is obtained by applying the HMETIS hypergraph partitioning package [Karypis et al., 1997] on G, where a cut over a hyperedge C
is counted if and only if the vertices in C fall into two or more groups as the
partitioning process terminates, and the hyperedge-cut is minimized subject to the constraint that comparable-sized partitioned groups are favored.
V = C. In this setting, each vertex in V corresponds to a set of data points
C ∈ C, i.e., one cluster in the base clusterings. Each edge in E is an ordinary
edge connecting two vertices from different base clusterings. MCLA (Meta-
Clustering Ensembles
145
Input: Data set D = {x1 , x2 , . . . , xm };
Clusters in all the base clusterings C = {Cj | 1 ≤ j ≤ k ∗ }.
Process:
1. V = D; % Set vertices vi as instances xi in D
2. E = ∅;
3. for j = 1,
. . . , k∗ :
4. E = E {Cj };
5. end
6. G = (V, E);
7. λ = HMETIS(G); % Invoke HMETIS package [Karypis et al., 1997] on G
Output: Ensemble clustering λ
FIGURE 7.2: The HGPA algorithm.
CLustering Algorithm) [Strehl and Ghosh, 2002] is a representative method
within this category, whose pseudo-code is given in Figure 7.3.
Here, MCLA constructs G as a r-partite graph, where r is the number
of base clusterings. Each edge is assigned with weight wij specifying the
degree of overlap between two connecting clusters. The METIS package
[Karypis and Kumar, 1998] is used to partition G into k balanced meta(M)
clusters Cp (p = 1, . . . , k), each characterized by an m-dimensional indi(M)
(M)
(M)
(M)
cator vector hp = (hp1 , hp2 , . . . , hpm ) expressing the level of association between instances and the meta-cluster. The ensemble clustering λ is
then formed by assigning each instance to the meta-cluster mostly associated with it. Notice that it is not guaranteed that every meta-cluster can win
for at least one instance, and ties are broken arbitrarily [Strehl and Ghosh,
2002].
V = D ∪ C. In this setting, each vertex in V corresponds to either a single data point xi ∈ D or a set of data points C ∈ C. Each edge in E is an
ordinary edge connecting two vertices with one from D and another from
C. HBGF (Hybrid Bipartite Graph Formulation) [Fern and Brodley, 2003] is a
representative method within this category, whose pseudo-code is given in
Figure 7.4.
Here, HBGF constructs G as a bi-partite graph with equally weighted
edges. The ensemble clustering λ is obtained by applying the SPEC [Shi
and Malik, 2000] or the METIS [Karypis and Kumar, 1998] graph partitioning
package [Karypis et al., 1997] on G. Here, the partitioning of the bi-partite
graph groups the instance vertices as well as the cluster vertices simultaneously. Therefore, the partitions of the individual instances are returned as
the final clustering results.
An appealing advantage of graph-based methods lies in their linear com-
146
Ensemble Methods: Foundations and Algorithms
Input: Data set D = {x1 , x2 , . . . , xm };
Clusters in all the base clusterings C = {Cj | 1 ≤ j ≤ k ∗ }.
Process:
1. V = C; % Set vertices vi as clusters Ci in C
2. E = ∅;
3. for i = 1, . . . , k ∗ :
4.
for j = 1, . . . , k ∗ :
5.
if Ci and Cj belong to different base clusterings
6.
then E = E ∪ {eij }; % Add edge eij = (vi , vj )
7.
wij = |Ci ∩ Cj |/(|Ci | + |Cj | − |Ci ∩ Cj |); % Set weight for eij
8.
end
9. end
10. G = (V, E);
(M)
(M)
(M)
11. {C1 , C2 , . . . , Ck } = METIS(G);
% Invoke METIS package [Karypis and Kumar, 1998] on
(M)
% G to induce meta-clusters Cp (p = 1, . . . , k)
12. for p = 1, . . . , k:
13. for i = 1, . . . , m:
(M)
(M)
14.
hpi = C∈C (M ) I(xi ∈ C)/|Cp |;
p
15. end
16. end
17. for i = 1, . . . , m:
(M)
18. λi = arg maxp∈{1,...,k} hpi ;
19. end
Output: Ensemble clustering λ
FIGURE 7.3: The MCLA algorithm.
putational complexity in m, the number of instances. Thus, this category of
methods provides a practical choice for clustering analysis on large-scale
data. In addition, graph-based methods are able to handle more complicated interactions between instances beyond pairwise relationships, e.g.,
the high-order relationship encoded by hyperedges in HGPA.
The major deficiency of graph-based methods is that the performance
heavily relies on the graph partitioning method that is used to produce
the ensemble clustering. Since graph partitioning techniques are not designed for clustering tasks and the partitioned clusters are just by-products
of the graph partitioning process, the quality of the ensemble clustering can
be impaired. Moreover, most graph partitioning methods such as HMETIS
[Karypis et al., 1997] have the constraint that each cluster contains approximately the same number of instances, and thus, the final ensemble clustering would become inappropriate if the intrinsic data clusters are highly
Clustering Ensembles
147
Input: Data set D = {x1 , x2 , . . . , xm };
Clusters in all the base clusterings C = {Cj | 1 ≤ j ≤ k ∗ };
Graph partitioning package L (SPEC [Shi and Malik, 2000] or
METIS [Karypis and Kumar, 1998]).
Process: 1. V = D C; % Set vertices vi as instances xi in D or clusters Ci in C
2. E = ∅;
3. for i = 1, . . . , m:
4.
for j = 1, . . . , k ∗ :
5.
if vi ∈ vj % vi being an instance in D; vj being a cluster in C
6.
then E = E {eij }; % Add edge eij = (vi , vj )
7.
wij = 1; % Set equal weight for eij
8.
end
9. end
10. G = (V, E);
11. λ = L(G); % Invoke the specified graph partitioning package on G
Output: Ensemble clustering λ
FIGURE 7.4: The HBGF algorithm.
imbalanced.
7.5 Relabeling-Based Methods
The basic idea of relabeling-based clustering ensemble methods is to
align or relabel the cluster labels of all base clusterings, such that the same
label denotes similar clusters across the base clusterings, and then derive
the final ensemble clustering based on the aligned labels.
Notice that unlike supervised learning where the class labels represent
specific classes, in unsupervised learning the cluster labels only express
grouping characteristics of the data and are not directly comparable across
different clusterings. For example, given two clusterings λ(1) = (1, 1, 2, 2, 3,
3, 1) and λ(2) = (2, 2, 3, 3, 1, 1, 2), though the cluster labels for each instance differ across the two clusterings, λ(1) and λ(2) are in fact identical.
It is obvious that the labels of different clusterings should be aligned, or
relabeled, based on label correspondence. Relabeling-based methods have
two alternative settings according to the type of label correspondence to be
established, i.e., crisp label correspondence and soft label correspondence.
Crisp Label Correspondence. In this setting, each base clustering is as-
148
Ensemble Methods: Foundations and Algorithms
Input: Data set D = {x1 , x2 , . . . , xm };
Base clusterings Λ = {λ(1) , λ(2) , . . . , λ(r) } each with k clusters.
Process:
(b)
1. Randomly select λ(b) = {Cl | l = 1, . . . , k} in Λ as reference clustering;
2. Λ = Λ − {λ(b) };
3. repeat
(q)
4.
Randomly select λ(q) = {Cl | l = 1, . . . :, k} in Λ to align
with λ(b) ;
:
: (b)
(q) :
5.
Initialize k × k matrix O with O(u, v) = :Cu ∩ Cv : (1 ≤ u, v ≤ k);
% Count instances shared by clusters in λ(b) and λ(q)
6.
I = {(u, v) | 1 ≤ u, v ≤ k};
7.
repeat
8.
(u , v ) = arg max(u,v)∈I O(u, v);
(q)
(q)
9.
Relabel Cv as Cu ;
10.
I = I − {(u , w) | (u , w) ∈ I} ∪ {(w, v ) | (w, v ) ∈ I};
11. until I = ∅
12. Λ = Λ − {λ(q) };
13. until Λ = ∅;
Output: Relabeled clusterings {λ(q) | 1 ≤ q ≤ r} with aligned cluster labels
FIGURE 7.5: The relabeling process for crisp label correspondence [Zhou
and Tang, 2006].
sumed to group data set D = {x1 , x2 , . . . , xm } into an equal number of
clusters, i.e., k (q) = k (q = 1, . . . , r). As a representative, the method described in [Zhou and Tang, 2006] aligns cluster labels as shown in Figure 7.5.
In [Zhou and Tang, 2006], clusters in different clusterings are iteratively
aligned based on the recognition that similar clusters should contain similar instances. The task of matching two clusterings, e.g., λ(q) and λ(b) , can
also be accomplished by formulating it as a standard assignment problem
(q)
[Kuhn, 1955], where the cost of assigning cluster Cv ∈ λ(q) to cluster
(b)
(b)
(q)
Cu ∈ λ(b) can be set as m − |Cu ∩ Cv |. Then, the minimum cost oneto-one assignment problem can be solved by the popular Hungarian algorithm [Topchy et al., 2004b, Hore et al., 2009].
After the labels of different base clusterings have been relabeled, strategies for combining classifiers can be applied to derive the final ensemble
(q)
clustering λ. Let λi ∈ {1, . . . , k} denote the cluster label of xi (i = 1, . . . , m)
in the aligned base clustering λ(q) (q = 1, . . . , r), four strategies are described in [Zhou and Tang, 2006] to derive λ:
- Simple Voting : The ensemble clustering label λi of xi is simply deter-
Clustering Ensembles
mined by
λi = arg max
r
l∈{1,...,k} q=1
149
(q)
I(λi
= l).
(7.20)
- Weighted Voting : The mutual information between a pair of clusterings [Strehl et al., 2000] is employed to derive the weight for each λ(q) .
(p)
(q)
Given two base clusterings λ(p) and λ(q) , let mu = |Cu |, mv = |Cv |
(p)
(q)
and muv = |Cu ∩ Cv |. The [0,1]-normalized mutual information
NMI
(p)
Φ
between λ and λ(q) can be defined as
ΦNMI (λ(p) , λ(q) ) =
k
k
muv · m
2 .
muv logk2
m u=1 v=1
mu · mv
(7.21)
Other kinds of definitions can be found in [Strehl and Ghosh, 2002,
Fred and Jain, 2005]. Then, for each base clustering, the average mutual information can be calculated as
β (q) =
1
r−1
r
ΦNMI (λ(p) , λ(q) ) (q = 1, . . . , r).
(7.22)
p=1,p=q
Intuitively, the larger the β (q) value, the less statistical information
contained in λ(q) while not contained in other base clusterings [Zhou
and Tang, 2006]. Thus, the weight for λ(q) can be defined as
1
(q = 1, . . . , r),
(7.23)
Z · β (q)
r
where Z is a normalizing factor such that q=1 w(q) = 1. Finally, the
ensemble clustering label λi of xi is determined by
w(q) =
λi = arg max
r
l∈{1,...,k} q=1
(q)
w(q) · I(λi
= l).
(7.24)
- Selective Voting : This is a strategy which incorporates ensemble pruning. In [Zhou and Tang, 2006], the mutual information weights {w(q) |
q = 1, . . . , r} are used to select the base clusterings for combination,
where the base clusterings with weights smaller than a threshold wthr
are excluded from 8the ensemble. Zhou and
9 Tang [2006] simply set
wthr = 1r . Let Q = q | w(q) ≥ 1r , 1 ≤ q ≤ r , then the ensemble clustering label λi of xi is determined by
(q)
I(λi = l).
(7.25)
λi = arg max
l∈{1,...,k} q∈Q
150
Ensemble Methods: Foundations and Algorithms
- Selective Weighted Voting : This is a weighted version of selective voting, where the ensemble clustering label λi of xi is determined by
(q)
w(q) · I(λi = l).
(7.26)
λi = arg max
l∈{1,...,k} q∈Q
It was reported in [Zhou and Tang, 2006] that the selective weighted voting
leads to the best empirical results, where the weighted voting and selective
voting both contribute to performance improvement.
Soft Label Correspondence. In this setting, each base clustering is assumed to group the data set D into an arbitrary number of clusters, i.e.,
(q)
k (q) ∈ N (q = 1, . . . , r). Each base clustering λ(q) = {Cl | l = 1, 2, . . . , k (q) }
can be represented as an m × k (q) matrix A(q) , where A(q) (i, l) = 1 if
(q)
xi ∈ Cl and 0 otherwise. Given two base clusterings λ(p) and λ(q) , a
(p)
(q)
k × k soft correspondence matrix S is assumed to model the correspondence relationship between clusters of each clustering. Here, S 0
k(q)
and v=1 S(u, v) = 1 (u = 1, 2, . . . , k (p) ). Intuitively, with the help of S, the
membership matrix A(p) for λ(p) can be mapped to the membership matrix
A(q) for λ(q) by A(p) S. The quality of this mapping can be measured by the
Frobenius matrix norm between A(q) and A(p) S, i.e., ||A(q) − A(p) S||2F . The
smaller the Frobenius norm, the more precisely the soft correspondence
matrix S captures the relation between A(p) and A(q) .
(1)
Given r base clusterings with membership matrices A(1) ∈ Rm×k , . . . ,
(r)
A(r) ∈ Rm×k and the number of k, as a representative, the SCEC (Soft
Correspondence Ensemble Clustering) method [Long et al., 2005] aims to
find the final ensemble clustering A ∈ Rm×k together with r soft corre(1)
(r)
spondence matrices S(1) ∈ Rk ×k , . . . , S(r) ∈ Rk ×k by minimizing the
objective function
min
r
||A − A(q) S(q) ||2F
(7.27)
q=1
s.t. S(q) (u, v) ≥ 0 and
k
S(q) (u, v) = 1 ∀ q, u, v.
v=1
This optimization problem can be solved by the alternating optimization
strategy, i.e., optimizing A and each S(q) one at a time by fixing the others.
Rather than directly optimizing (7.27), SCEC chooses to make two modifications to the above objective function. First, as the minimizer of (7.27) may
converge to a final ensemble clustering A with unreasonably small number of clusters (i.e., resulting in many all-zero columns in A), a columnsparseness constraint is enforced on each S(q) to help produce an A with as
many clusters as possible. Specifically, the sum of the variation of each column of S(q) is a good measure of its column-sparseness [Long et al., 2005],
Clustering Ensembles
151
Input: Data set D = {x1 , x2 , . . . , xm };
Base clusterings Λ = {λ(1) , λ(2) , . . . , λ(r) } each with k (q) clusters;
Integer k, coefficients α, β, small positive constant .
Process:
1. for q = 1, . . . , r:
2.
Form an m × k (q) membership matrix A(q) , where A(q) (i, l) = 1 if
(q)
(q)
xi ∈ Cl and 0 otherwise; % λ(q) = {Cl | l = 1, 2, . . . , k (q) }
(q)
3.
Randomly initialize a k × k soft correspondence matrix S(q)
with S(q) 0;
4. end
5. repeat
r
∂f
= 0 with f being the
6.
A = 1r q=1 A(q) S(q) ; % By setting ∂A
% objective function in (7.28)
7.
for q = 1, . . . , r:
8.
(A(q) ) A+βk1k(q) k
,
B+·1k(q) k
(q) (q) (q)
(q)
S(q) = S(q)
where
(q)
+ βkS(q) 1kk
B = (A ) A S − αS + kα
(q) 1k(q) k(q) S
9.
end
10. until convergence;
Output: Membership matrix A for ensemble clustering
FIGURE 7.6: The SCEC method.
1
i.e., the larger the value of ||S(q) − k(q)
1k(q) k(q) S(q) ||2F , the more column(q)
(q)
sparse the S . Here, 1k(q) k(q) is a k × k (q) matrix with all ones. Second,
k
as it is hard to handle the normalization constraint v=1 S(q) (u, v) = 1 efficiently, it is
transformed into a soft constraint by adding a penalty term to
(7.27) with rq=1 ||S(q) 1kk − 1k(q) k ||2F . Now, the objective function of SCEC
becomes
min
r
||A − A(q) S(q) ||2F
(7.28)
q=1
− α||S(q) −
1
k (q)
1k(q) k(q) S(q) ||2F + β||S(q) 1kk − 1k(q) k ||2F
s.t. S(q) (u, v) ≥ 0 ∀ q, u, v,
where α and β are coefficients balancing different terms. Like (7.27), the
modified objective function (7.28) can be solved by the alternative optimization process [Long et al., 2005] as shown in Figure 7.6. Specifically
(step 8), the division between two matrices is performed in an elementwise manner, and denotes the Hadamard product of two matrices. It has
been proven that (7.28) is guaranteed to reach a local minimum based on
152
Ensemble Methods: Foundations and Algorithms
the given alternative optimization process [Long et al., 2005].
An advantage of relabeling-based methods is that they offer the possibility of investigating the connections between different base clusterings,
which may be helpful in studying the implications of the clustering results.
In particular, in crisp label correspondence, the reference clustering can be
viewed as a profiling structure of the data set; while in soft label correspondence, the learned correspondence matrices provide intuitive interpretations to the relations between the ensemble clustering and each base clustering.
A deficiency of relabeling-based methods is that if there is no reasonable
correspondence among the base clusterings, they may not work well. Moreover, the crisp label correspondence methods require each base clustering
to have identical number of clusters, and it may result in a final ensemble clustering with fewer clusters than the base clusterings. The soft label
correspondence methods need to solve an optimization problem involving
numerous variables, and this is prone to get stuck in a local minimum far
from the optimal solution.
7.6 Transformation-Based Methods
The basic idea of transformation-based clustering ensemble methods is
to re-represent each instance as an r-tuple, where r is the number of base
clusterings and the qth element indicates its cluster assignment given by
the qth base clustering, and then derive the final ensemble clustering by
performing clustering analysis over the transformed r-tuples.
For example, suppose there are four base clusterings over five instances,
e.g., λ(1) = {1, 1, 2, 2, 3}, λ(2) = {1, 2, 2, 2, 3}, λ(3) = {2, 2, 3, 1, 3} and
λ(4) = {3, 1, 3, 2, 3}. Then, based on the transformation process, xi will be
transformed into the r-tuple ϕ(xi ) (r = 4) as: ϕ(x1 ) = (ϕ1 (x1 ), ϕ2 (x1 ),
ϕ3 (x1 ), ϕ4 (x1 )) = (1, 1, 2, 3), and similarly, ϕ(x2 ) = (1, 2, 2, 1) , ϕ(x3 ) =
(2, 2, 3, 3) , ϕ(x4 ) = (2, 2, 1, 2) and ϕ(x5 ) = (3, 3, 3, 3).
Each transformed r-tuple ϕ(x) = (ϕ1 (x), ϕ2 (x), . . . , ϕr (x)) can be regarded as a categorical vector, where ϕq (x) ∈ K(q) = {1, 2, . . . , k (q) } (q =
1, . . . , r). Any categorical clustering technique can then be applied to group
the transformed r-tuples to identify the final ensemble clustering. For example, one can define a similarity function sim(·, ·) between the transformed r-tuples, e.g.,
sim(ϕ(xi ), ϕ(xj )) =
r
q=1
I(ϕq (xi ) = ϕq (xj )),
(7.29)
Clustering Ensembles
153
and then use traditional clustering methods such as k-means to identify the
final ensemble clustering [Topchy et al., 2003].
The task of clustering categorical data can also be equivalently transformed into the task of creating a clustering ensemble, where the qth categorical feature with k (q) possible values can naturally give rise to a base
clustering with k (q) clusters [He et al., 2005].
Besides resorting to categorical clustering techniques, the task of clustering the transformed r-tuples can also be tackled directly in a probabilistic
framework [Topchy et al., 2004a], as introduced in the following.
Given r base clusterings λ(q) (q = 1, . . . , r) over the data set D, let y =
(y1 , y2 , . . . , yr ) ∈ K(1) × K(2) · · · × K(r) denote the r-dimensional random
vector, and yi = ϕ(xi ) = (y1i , y2i , . . . , yri ) denote the transformed r-tuple
for xi . The random vector y is modeled by a mixture of multinomial distributions, i.e.,
k
P (y | Θ) =
αj Pj (y | θj ),
(7.30)
j=1
where k is the number of mixture components which also corresponds to
the number of clusters in the final ensemble clustering. Each mixture component is parameterized by θj and Θ = {αj , θj | j = 1, . . . , k}. Assume that
the components of y are conditionally independent, i.e.,
Pj (y | θj ) =
r
(q)
(q)
Pj (yq | θj ) (1 ≤ j ≤ k).
(7.31)
q=1
(q)
(q)
Moreover, the conditional probability Pj (yq | θj ) is viewed as the outcome of one multinomial try, i.e.,
(q)
Pj (yq
|
(q)
θj )
=
(q)
k
ϑqj (l)δ(yq ,l) ,
(7.32)
l=1
where k (q) is the number of clusters in the qth base clustering and δ(·, ·)
represents the Kronecker delta function. The probabilities of the k (q) multi
k(q)
nomial outcomes are defined as ϑqj (l) with l=1 ϑqj (l) = 1, and thus,
θj = {ϑqj (l) | 1 ≤ q ≤ r, 1 ≤ l ≤ k (q) }.
Based on the above assumptions, the optimal parameter Θ∗ is found by
maximizing the log-likelihood function with regard to the m transformed
r-tuples Y = {yi | 1 ≤ i ≤ m}, i.e.,
m
∗
P (yi | Θ)
Θ = arg max log L(Y | Θ) = arg max log
Θ
Θ
⎛
⎞
k
m
log ⎝
αj Pj (yi | θj )⎠ .
= arg max
Θ
i=1
j=1
i=1
(7.33)
154
Ensemble Methods: Foundations and Algorithms
Input: Data set D = {x1 , x2 , . . . , xm };
Base clusterings Λ = {λ(1) , λ(2) , . . . , λ(r) } each with k (q) clusters;
Integer k.
Process:
1. for i = 1, . . . , m:
2.
for q = 1, . . . , r:
(q)
3.
yqi = λi ;
4.
end
5.
ϕ(xi ) = (y1i , y2i , . . . , yri ) ; % Set the transformed r-tuple for xi
6. end
k
7. Initialize αj (1 ≤ j ≤ k) with αj ≥ 0 and j=1 αj = 1;
8. for j = 1, . . . , k:
9.
for q = 1, . . . , r:
10.
Initialize ϑqj (l) (1 ≤ l ≤ k (q) ) with ϑqj (l) ≥ 0 and
k(q)
l=1 ϑqj (l) = 1;
11. end
12. end
13. repeat
14.
E[zij ] =
αj
k
j=1
15.
αj =
r
k(q)
αj
δ(yi ,l)
(ϑqj (l)) q
k(q)
i ,l) ;
δ(yq
q=1
l=1 (ϑqj (l))
m
q=1
r
l=1
m
E[z ]
m i=1
k ij
;
i=1
j=1 E[zij ]
ϑqj (l) =
% E-step
i
i=1 δ(yq ,l)E[zij ]
;
m k(q)
i
i=1
l=1 δ(yq ,l)E[zij ]
% M-step
16. until convergence;
17. λi = arg max1≤j≤k αj Pj (ϕ(xi ) | θj ); % c.f.: (7.30)−(7.32)
Output: Ensemble clustering λ
FIGURE 7.7: The EM procedure for the transformation-based method
[Topchy et al., 2004a] within probabilistic framework.
The EM algorithm is used to solve (7.33). To facilitate the EM procedure, the
hidden variables Z = {zij | 1 ≤ i ≤ m, 1 ≤ j ≤ k} are introduced, where
zij = 1 if yi belongs to the jth mixture component and 0 otherwise. Figure 7.7 illustrates the detailed EM procedure given in [Topchy et al., 2004a].
An advantage of the transformation-based methods is that they are usually easy to implement, since the re-representation of the instances using
the base clustering information is rather direct, and any off-the-shelf categorical clustering techniques can be applied to the transformed tuples to
compute the final ensemble clustering.
A deficiency of these methods lies in that when re-representing each instance into a categorical tuple, it is possible that the transformed data could
not fully encode the information embodied in the original data representa-
Clustering Ensembles
155
tion. Therefore, it is by no means guaranteed that the clustering results obtained from the transformed data resemble exactly the desired ensemble
clustering from the original base clusterings.
7.7 Further Readings
A lot of clustering methods have been developed. In addition to k-means,
famous partitioning methods include k-medoids [Kaufman and Rousseeuw,
1990] whose cluster centers are exactly training instances, k-modes [Huang,
1998] for categorical data, CLARANS [Ng and Han, 1994] for large-scale
data, etc. In addition to SAHN, famous hierarchical clustering methods
include AGNES [Kaufman and Rousseeuw, 1990] which can be regarded
as a particular version of SAHN, DIANA [Kaufman and Rousseeuw, 1990]
which forms the hierarchy in a top-down manner, BIRCH [Zhang et al.,
1996] which integrates hierarchical clustering with other clustering methods, ROCK [Guha et al., 1999] which was designed for categorical data, etc.
In addition to DBSCAN, famous density-based methods include OPTICS
[Ankerst et al., 1999] which augments DBSCAN with an ordering of clusters, DENCLUE [Hinneburg and Keim, 1998] which utilizes density distribution functions, etc. In addition to STING, famous grid-based methods
include WaveCluster [Sheikholeslami et al., 1998] which exploits wavelet
transformation, CLIQUE [Agrawal et al., 1998] which was designed for
high-dimensional data, etc. In addition to GMM-based clustering, famous
model-based methods include SOM [Kohonen, 1989] which forms clusters
by mapping from high-dimensional space into lower-dimensional (2d or
3d) space with the neural network model of self-organizing maps, COBWEB
[Fisher, 1987] which clusters categorical data incrementally, etc. There are
so many clustering methods partially because users may have very different
motivations to cluster even the same data, where there is no unique objective, and therefore, once a new criterion is given, a new clustering method
can be proposed [Estivill-Castro, 2002].
In addition to the cluster quality indices introduced in Section 7.1.2, Jain
and Dubes [1988], Halkidi et al. [2001] also provide introduction to many
other indices such as the external indices adjusted Rand index, Huberts Γ
statistic and the internal indices C index and Hartigan index .
Clustering ensemble techniques have already been applied to many
tasks, such as image segmentation [Zhang et al., 2008], gene expression
data analysis [Avogadri and Valentini, 2009, Hu et al., 2009, Yu and Wong,
2009], etc. Though there are many works on developing clustering ensemble methods, only a few studies have been devoted to the theoretical aspects. Topchy et al. [2004c] provided a theoretical justification for the use-
156
Ensemble Methods: Foundations and Algorithms
fulness of clustering ensemble under strong assumptions. Kuncheva and
Vetrov [2006] studied the stability issue of clustering ensemble with kmeans.
In contrast to supervised learning where the “accuracy” has a clear meaning, in unsupervised learning there is no unique equivalent concept. Therefore, the study of the accuracy-diversity relation of clustering ensemble
is rather difficult. Hadjitodorov and Kuncheva [2007], Hadjitodorov et al.
[2006], Kuncheva and Hadjitodorov [2004], Kuncheva et al. [2006] presented
some attempts towards this direction. There are some recent studies on
other advanced topics such as clustering ensemble pruning [Fern and Lin,
2008, Hong et al., 2009], scalable clustering ensemble [Hore et al., 2006,
2009], etc.
8
Advanced Topics
8.1 Semi-Supervised Learning
8.1.1 Usefulness of Unlabeled Data
The great advances in data collection and storage technology enable the
accumulation of a large amount of data in many real-world applications.
Assigning labels to these data, however, is expensive because the labeling
process requires human efforts and expertise. For example, in computeraided medical diagnosis, a large number of x-ray images can be obtained
from routine examination, yet it is difficult to ask physicians to mark all focuses of infection in all images. If we use traditional supervised learning
techniques to construct a diagnosis system, then only a small portion of
training data, on which the focuses have been marked, are useful. Due to
the limited amount of labeled training examples, it may be difficult to attain a strong diagnosis system. Thus, a question naturally arises: Can we
leverage the abundant unlabeled data with a few labeled training examples
to construct a strong learning system?
Semi-supervised learning deals with methods for exploiting unlabeled
data in addition to labeled data automatically to improve learning performance. Suppose the data are drawn from an unknown distribution D over
the instance space X and the label space Y. In semi-supervised learning, a
labeled data set L = {(x1 , y1 ), (x2 , y2 ), . . . , (xl , yl )} and an unlabeled data
set U = {xl+1 , xl+2 , . . . , xm } are given, where xi ∈ X and yi ∈ Y and generally l ! m, and the task is to learn H : X → Y. For simplicity, consider
binary classification tasks where Y = {−1, +1} .
It is interesting to know why unlabeled data, which do not contain labels,
can be helpful to supervised learning. Figure 8.1 provides an illustration.
It can be seen that though both the classification boundaries are perfectly
consistent with the labeled data points, the boundary obtained by considering unlabeled data is better in generalization. In fact, since both the unlabeled data U and the labeled data L are drawn from the same distribution D, unlabeled data can disclose some information on data distribution
which is helpful for constructing a model with good generalization ability.
Indeed, semi-supervised learning approaches work by taking assump-
157
158
Ensemble Methods: Foundations and Algorithms
(a) Without unlabeled data
(b) With unlabeled data
FIGURE 8.1: Illustration of the usefulness of unlabeled data. The optimal
classification boundary without/with considering unlabeled data are plotted, respectively.
tions on how the distribution information disclosed by unlabeled data is
connected with the label information. There are two basic assumptions,
i.e., the cluster assumption and the manifold assumption. The former assumes that data with similar inputs should have similar class labels; the
latter assumes that the data live in a low-dimensional manifold while the
unlabeled data can help to identify that manifold. The cluster assumption concerns classification, while the manifold assumption can also be
applied to tasks other than classification. In some sense, the manifold assumption is a generalization of the cluster assumption, since it is usually
assumed that the cluster structure of the data will be more easily found
in the lower-dimensional manifold. These assumptions are closely related
to low-density separation, which specifies that the boundary should not
go across high-density regions in the instance space. This assumption has
been adopted by many semi-supervised learning approaches. It is evident
that unlabeled data can help, at least, to identify the similarity, and thus
contribute to the construction of prediction models.
Transductive learning is a concept closely related to semi-supervised
learning. The main difference between them lies in the different assumptions on the test data. Transductive learning takes a closed-world assumption, i.e., the test data is known in advance and the unlabeled data are exactly the test data. The goal of transductive learning is to optimize the generalization ability on this test data. Semi-supervised learning takes an openworld assumption, i.e., the test data is not known and the unlabeled data
are not necessarily test data. Transductive learning can be viewed as a special setting of semi-supervised learning, and we do not distinguish them in
the following.
Advanced Topics
159
8.1.2 Semi-Supervised Learning with Ensembles
This section briefly introduces some semi-supervised ensemble methods. Most semi-supervised ensemble methods work by training learners
using the initial labeled data at first, and then using the learners to assign pseudo-labels to unlabeled data. After that, new learners are trained by
using both the initial labeled data and the pseudo-labeled data. The procedure of training learners and assigning pseudo-labels are repeated until
some stopping condition is reached. Based on the categorization of sequential and parallel ensemble methods (see Section 3.1), the introduction to
common semi-supervised ensemble methods is separated into the following two subsections.
8.1.2.1
Semi-Supervised Sequential Ensemble Methods
Semi-supervised sequential ensemble methods mainly include Boostingstyle methods, such as SSMBoost, ASSEMBLE and SemiBoost.
SSMBoost [d’Alché-Buc et al., 2002]. This method extends the margin definition to unlabeled data and employs gradient descent to construct an ensemble which minimizes the margin loss function on both labeled and unlabeled data. Here, Boosting is generalized as a linear combination of hypotheses, that is,
T
H(x) =
βi hi (x),
(8.1)
i=1
where the output of each base learner hi is in [−1, 1]. The overall loss function is defined with any decreasing function of the margin γ as
(H) =
l
(γ (H (xi ) , yi )) ,
(8.2)
i=1
where γ(H(xi ), yi ) = yi H(xi ) is the margin of the hypothesis H on the labeled example (xi , yi ). Apparently, the margin measures the confidence of
the classification for labeled data. For unlabeled data, however, the margin
cannot be calculated, since we do not know the ground-truth labels. One
alternative is to use the expected margin
)
*
γu (H (x)) = Ey γ(H(x), y) .
Using the output (H(x) + 1)/2 as threshold for an estimate of the posterior
probability P (y = +1 | x), the expected margin for unlabeled data in U
becomes
H (x) + 1
H(x) + 1
γu (H (x)) =
H(x) + 1 −
(−H (x))
2
2
)
*2
= H(x) .
(8.3)
160
Ensemble Methods: Foundations and Algorithms
Another way is to use the maximum a posteriori probability of y directly,
and thus,
)
*
γu (H (x)) = H(x)sign H(x) = |H(x)|.
(8.4)
Notice that the margins in both (8.3) and (8.4) require the outputs of the
learner on unlabeled data; this is the pseudo-label assigned by the ensemble. With the definition of margin for unlabeled data, the overall loss function of SSMBoost at the tth round is defined as
(Ht ) =
l
m
)
*
)
*
γ(Ht (xi ), yi ) +
γu (Ht (xi )) .
i=1
(8.5)
i=l+1
Then, in the tth round, SSMBoost tries to create a new base learner ht+1
and the corresponding weight βt+1 to minimize (Ht ). The final ensemble
H is obtained when the number of rounds T is reached. Notice that rather
than standard AdaBoost, SSMBoost uses the MarginBoost which is a variant
of AnyBoost [Mason et al., 2000] to attain base learners in each round.
ASSEMBLE [Bennett et al., 2002]. This method is similar to SSMBoost. It also
constructs ensembles in the form of (8.1), and alternates between assigning
pseudo-labels to unlabeled data using the existing ensemble and generating the next base classifier to maximize the margin on both labeled and unlabeled data, where the margin on unlabeled data is calculated according to
(8.4). The main difference between SSMBoost and ASSEMBLE lies in the fact
that SSMBoost requires the base learning algorithm be a semi-supervised
learning method, while Bennett et al. [2002] enabled ASSEMBLE to work
with any weight-sensitive learner for both binary and multi-class problems.
ASSEMBLE using decision trees as base classifiers won the NIPS 2001 Unlabeled Data Competition.
SemiBoost [Mallapragada et al., 2009]. Recall that in SSMBoost and ASSEMBLE, in each round the pseudo-labels are assigned to some unlabeled data
with high confidence, and the pseudo-labeled data along with the labeled
data are used together to train a new base learner in the next round. In this
way, the pseudo-labeled data may be only helpful to increase the classification margin, yet provide little novel information about the learning task,
since these pseudo-labeled data can be classified by the existing ensemble
with high confidence. To overcome this problem, Mallapragada et al. [2009]
proposed the SemiBoost method, which uses pairwise similarity measurements to guide the selection of unlabeled data to assign pseudo-labels.
They imposed the constraint that similar unlabeled instances must be assigned the same label, and if an unlabeled instance is similar to a labeled
instance then it must be assigned the label of the labeled instance. With
these constraints, the SemiBoost method is closely related to graph-based
semi-supervised learning approaches exploiting the manifold assumption.
SemiBoost was generalized for multi-class problems by Valizadegan et al.
[2008].
Advanced Topics
8.1.2.2
161
Semi-Supervised Parallel Ensemble Methods
Semi-supervised parallel ensemble methods are usually disagreementbased semi-supervised learning approaches, such as Tri-Training and CoForest.
Tri-Training [Zhou and Li, 2005]. This method can be viewed as an extension of the Co-Training method [Blum and Mitchell, 1998]. Co-Training trains
two classifiers from different feature sets, and in each round, each classifier labels some unlabeled data for the other learner to refine. Co-Training
works well on data with two independent feature sets both containing sufficient information for constructing a strong learner. Most data sets contain
only a single feature set, and it is difficult to judge which learner should be
trusted when they disagree. To address this issue, Zhou and Li [2005] proposed to train three learners, and in each round the unlabeled data are used
in a majority teach minority way; that is, for an unlabeled instance, if the
predictions of two learners agree yet the third learner disagrees, then the
unlabeled instance will be labeled by two learners for the third learner. To
reduce the risk of “correct minority” being misled by “incorrect majority”,
a sanity check mechanism was designed in [Zhou and Li, 2005], which is
examined in each round. In the testing phase, the prediction is obtained by
majority voting. The Tri-Training method can work with any base learners
and is easy to implement. Notice that, like ensembles in supervised learning, the three learners need to be diverse. Zhou and Li [2005] generated
the initial learners using bootstrap sampling, similar to the strategy used in
Bagging. Other strategies for augmenting diversity can also be applied, and
there is no doubt that Tri-Training can work well with multiple views since
different views will provide natural diversity.
Co-Forest [Li and Zhou, 2007]. This method is an extension of Tri-Training
to include more base learners. In each round, each learner is refined with
unlabeled instances labeled by its concomitant ensemble, which comprises
all the other learners. The concomitant ensembles used in Co-Forest are usually more accurate than the two learners used in Tri-Training. However, by
using more learners, it should be noticed that, during the “majority teach
minority” procedure the behaviors of the learners will become more and
more similar, and thus the diversity of the learners decreases rapidly. This
problem can be reduced to some extent by injecting randomness into the
learning process. In [Li and Zhou, 2007], a random forest was used to realize
the ensemble, and in each round, different subsets of unlabeled instances
were sampled from the unlabeled data for different learners; this strategy is
not only helpful for augmenting diversity, but also helpful for reducing the
risk of being trapped into poor local minima.
162
8.1.2.3
Ensemble Methods: Foundations and Algorithms
Augmenting Ensemble Diversity with Unlabeled Data
Conventional ensemble methods work under the supervised setting, trying to achieve a high accuracy and high diversity for individual learners
by using the labeled training data. It is noteworthy, however, that pursuing high accuracy and high diversity on the same labeled training data can
suffer from a dilemma; that is, the increase of diversity may require a sacrifice of individual accuracy. For an extreme example, if all learners are nearly
perfect on training data, to increase the diversity, the training accuracy of
most of the learners needs to be reduced.
From the aspect of diversity augmentation, using unlabeled data makes
a big difference. For example, given two sets of classifiers, H = {h1 , . . . , hn }
and G = {g1 , . . . , gn }, if we know that all of the classifiers are 100% accurate
on labeled training data, there is no basis for choosing between ensemble
H and ensemble G. However, if we find that the gi ’s make the same predictions on unlabeled data while the hi ’s make different predictions on some
unlabeled data, we know that the ensemble H would have good chance to
be better than G because it is more diverse while still being equally accurate
on the training data.
Notice that most semi-supervised ensemble methods, as introduced in
Section 8.1.2, exploit unlabeled data to improve the individual accuracy
by assigning pseudo-labels to unlabeled data and then using the pseudolabeled examples together with the original labeled examples to train the
individual learners. Recently, Zhou [2009] indicated that it is possible to design new ensemble methods by using unlabeled data to help augment diversity, and Zhang and Zhou [2010] proposed the UDEED method along this
direction.
Let X = Rd denote the d-dimensional input space, and given labeled
training examples L = {(x1 , y1 ), (x2 , y2 ), . . . , (xl , yl )} and unlabeled instances U = {xl+1 , xl+2 , . . . , xm }, where xi ∈ X . For simplicity, consider binary classification problem, that is, yi ∈ {−1, +1}. Let Lu = {x1 , x2 , . . . , xl }
denote the unlabeled data set derived from L by neglecting the label information. Assume that the ensemble comprises T component classifiers
{h1 , h2 , . . . , hT } each taking the form hk : X → [−1, +1]. Further, assume
that the value of |hk (x)| can be regarded as the confidence of x being positive or negative. As before, use the output (hk (x) + 1)/2 as a threshold for
an estimate of the posterior probability P (y = +1|x).
The basic idea of the UDEED method is to maximize the fit of the classifiers on the labeled data, while maximizing the diversity of the classifiers on the unlabeled data. Therefore, UDEED generates the ensemble
h = (h1 , h2 , · · · , hT ) by minimizing the loss function
V (h, L, D) = Vemp (h, L) + α · Vdiv (h, D),
(8.6)
where Vemp (h, L) corresponds to the empirical loss of h on L, Vdiv (h, D)
corresponds to the diversity loss of h on a data set D (e.g., D = U ) and α is a
Advanced Topics
163
parameter which trades off these two terms.
Indeed, (8.6) provides a general framework which can be realized
with different choices of the loss functions. In [Zhang and Zhou, 2010],
Vemp (h, L) and Vdiv (h, D) are realized by
T
1 ·
l(hk , L),
T
(8.7)
T
T
−1 2
·
d(hp , hq , D),
T (T − 1) p=1 q=p+1
(8.8)
Vemp (h, L) =
k=1
Vdiv (h, D) =
respectively, where l(hk , L) measures the empirical loss of hk on L, and
d(hp , hq , D) =
1 hp (x)hq (x),
|D|
(8.9)
x∈D
where d(hp , hq , D) represents the prediction difference between individual
classifiers hp and hq on D. Notice that the prediction difference is calculated
based on the real output h(x) instead of the signed output sign (h (x)).
Thus, UDEED aims to find the target model h∗ that minimizes the loss
function (8.6), that is,
h∗ = arg minh V (h, L, D).
(8.10)
In [Zhang and Zhou, 2010], logistic regression learners were used as the
component learners, and the minimization of the loss function was realized
by gradient descent optimization. By studying the ensembles with D = U
and D = Lu in (8.10), respectively, it was reported that using unlabeled
data in the way of UDEED is quite helpful. Moreover, various ensemble diversity measures were evaluated in [Zhang and Zhou, 2010], and the results
verified that the use of unlabeled data in the way of UDEED significantly
augmented the ensemble diversity and improved the prediction accuracy.
8.2 Active Learning
8.2.1 Usefulness of Human Intervention
Active learning deals with methods that assume that the learner has
some control over the data space, and the goal is to minimize the number
of queries on ground-truth labels from an oracle, usually human expert, for
generating a good learner. In other words, in contrast to passive learning
where the learner passively waits for people to give labels to instances, an
active learner will actively select some instances to query for their labels,
164
Ensemble Methods: Foundations and Algorithms
and the central problem in active learning is to achieve a good learner by
using the smallest number of queries.
There are two kinds of active learning. In reservoir-based active learning,
the queries posed by the learner must be drawn from the observed unlabeled instances; while in synthetic active learning, the learner is permitted to synthesize new instances and pose them as queries. The observed
unlabeled data is helpful to disclose distribution information, like its role
in semi-supervised learning, while a synthesized instance might be an instance that does not really exist and the corresponding query might be difficult to answer. Here, we focus on reservoir-based active learning.
It is evident that both active learning and semi-supervised learning try to
exploit unlabeled data to improve learning performance, while the major
difference is that active learning involves human intervention. So, it is interesting to understand how useful human intervention could be. For this
purpose, we can study the sample complexity of active learning, that is,
how many queries are needed for obtaining a good learner.
Generally, the sample complexity of active learning can be studied in two
settings, that is, the realizable case and unrealizable case. The former assumes that there exists a hypothesis perfectly separating the data in the hypothesis class, while the latter assumes that the data could not be perfectly
separated by any hypothesis in the hypothesis class because of noise. It is
obvious that the latter case is more difficult yet more practical.
During the past decade, many theoretical bounds on the sample complexity of active learning have been proved. In the realizable case, for example, by assuming that the hypothesis class is linear separators through
the origin and that the data is distributed uniformly over the unit sphere
in Rd , it has been proved that the sample complexity of active learning
1
;
is O(log
) taking into account the desired error bound with confidence
(1 − δ) [Freund et al., 1997, Dasgupta, 2005, 2006, Dasgupta et al., 2005,
; notation is used to hide logarithmic facBalcan et al., 2007]. Here the O
tors log log( 1 ), log(d) and log( 1δ ). Notice that the assumed conditions can be
satisfied in many situations, and therefore, this theoretical bound implies
that in the realizable case, there are many situations where active learning can offer exponential improvement in sample complexity compared to
passive learning. In the non-realizable case, generally the result is not that
optimistic. However, recently Wang and Zhou [2010] proved that there are
some situations where active learning can offer exponential improvement
in the sample complexity compared to passive learning.
Overall, it is well recognized that by using human intervention, active
learning can offer significant advantages over passive learning.
Advanced Topics
165
8.2.2 Active Learning with Ensembles
One of the major active learning paradigms, query-by-committee, also
called committee-based sampling, is based on ensembles. This paradigm
was proposed by Seung et al. [1992] and then implemented by many researchers for different tasks, e.g., [Dagan and Engelson, 1995, Liere and
Tadepalli, 1997, McCallum and Nigam, 1998]. In this paradigm, multiple
learners are generated, and then the unlabeled instance on which the learners disagree the most is selected to query. For example, suppose there are
five learners, among which three learners predict positive and two learners predict negative for an instance xi , while four learners predict positive
and one learner predicts negative for the instance xj , then these learners
disagree more on xi than on xj , and therefore xi will be selected for query
rather than xj .
One of the key issues of query-by-committee is how to generate the multiple learners in the committee. Freund et al. [1997] proved that when the
Gibbs algorithm, a randomized learning algorithm which picks a hypothesis from a given hypothesis class according to the posterior distribution,
is used to generate the learners, query-by-committee can exponentially
improve the sample complexity compared to passive learning. The Gibbs
algorithm, however, is computationally intractable. Abe and Mamitsuka
[1998] showed that popular ensemble methods can be used to generate the
committee. As with other ensemble methods, the learners in the committee should be diverse. Abe and Mamitsuka [1998] developed the Query-byBagging and Query-by-Boosting methods. Query-by-Bagging employs Bagging
to generate the committee. In each round, it re-samples the labeled training data by bootstrap sampling and trains a learner on each sample; then,
the unlabeled instance on which the learners disagree the most is queried.
Query-by-Boosting uses AdaBoost to generate the committee. In each round,
it constructs a boosted ensemble; then, the unlabeled instance on which
the margin predicted by the boosted ensemble is the minimum is queried.
Notice that the use of the ensemble provides the feasibility of combining active learning with semi-supervised learning. With multiple learners,
given a set of unlabeled instances, in each round the unlabeled instance on
which the learners disagree the most can be selected to query, while some
other unlabeled instances can be exploited by the majority teach minority strategy as in semi-supervised parallel ensemble methods such as TriTraining and Co-Forest (see Section 8.1.2). Zhou et al. [2006] proposed the
SSAIRA method based on such an idea, and applied this method to improve
the performance of relevance feedback in image retrieval. Later, Wang and
Zhou [2008] theoretically analyzed the sample complexity of the combination of active learning and semi-supervised learning, and proved that the
combination further improves the sample complexity compared to using
only semi-supervised learning or only active learning.
166
Ensemble Methods: Foundations and Algorithms
8.3 Cost-Sensitive Learning
8.3.1 Learning with Unequal Costs
Conventional learners generally try to minimize the number of mistakes
they will make in predicting unseen instances. This makes sense when the
costs of different types of mistakes are equal. In real-world tasks, however,
many problems have unequal costs. For example, in medical diagnosis, the
cost of mistakenly diagnosing a patient to be healthy may be far larger than
that of mistakenly diagnosing a healthy person as being sick, because the
former type of mistake may threaten a life. In such situations, minimizing
the number of mistakes may not provide the optimal decision because, for
example, three instances that each costs 20 dollars are less important than
one instance that costs 120 dollars. So, rather than minimizing the number
of mistakes, it is more meaningful to minimize the total cost. Accordingly,
the total cost, rather than accuracy and error rate, should be used for evaluating cost-sensitive learning performance.
Cost-sensitive learning deals with methods that work on unequal costs
in the learning process, where the goal is to minimize the total cost. A learning process may involve various costs such as the test cost, teacher cost, intervention cost, etc. [Turney, 2000], while the most often encountered one is
the misclassification cost.
Generally, there are two types of misclassification cost, that is, exampledependent cost and class-dependent cost. The former assumes that the costs
are associated with examples, that is, every example has its own misclassification cost; the latter assumes that the costs are associated with classes,
that is, every class has its own misclassification cost. Notice that, in most
real-world applications, it is feasible to ask a domain expert to specify the
cost of misclassifying a class as another class, yet only in some special situations is it convenient to get the cost for every training example. The following will focus on class-dependent costs and hereafter class-dependent
will not be mentioned explicitly.
The most popular cost-sensitive learning approach is Rescaling, which
tries to rebalance the classes such that the influence of each class in the
learning process is in proportion to its cost. For binary classification, suppose the cost of misclassifying the ith class to the jth class is costij , then the
optimal rescaling ratio of the ith class against the jth class is
τij =
costij
,
costji
(8.11)
which implies that after rescaling, the influence of the 1st class should be
cost12 /cost21 times of the influence of the 2nd class. Notice that (8.11) is
optimal for cost-sensitive learning, and it can be derived from the Bayes
risk theory as shown in [Elkan, 2001].
Advanced Topics
167
Rescaling is a general framework which can be implemented in different ways. For example, it can be implemented by re-weighting, i.e., assigning different weights to training examples of different classes, and
then passing the re-weighted training examples to any cost-blind learning algorithms that can handle weighted examples; or by re-sampling,
i.e., extracting a sample from the training data according to the proportion specified by (8.11), and then passing the re-sampled data to any costblind learning algorithms; or by threshold-moving, i.e., moving the decision threshold toward the cheaper class according to (8.11). In particular,
the threshold-moving strategy has been incorporated into many cost-blind
learning methods to generate their cost-sensitive variants. For example, for
decision trees, the tree splits can be selected based on a moved decision
threshold [Schiffers, 1997], and the tree pruning can be executed based
on a moved decision threshold [Drummond and Holte, 2000]; for neural
networks, the learning objective can be biased towards the high-cost class
based on a moved decision threshold [Kukar and Kononenko, 1998]; for
support vector machines, the corresponding optimization problem can be
written as [Lin et al., 2002]
1
||w||2H + C
cost(xi )ξi
2
i=1
m
min
w,b,ξ
(8.12)
s.t. yi (wT φ(xi ) + b) ≥ 1 − ξi
ξi ≥ 0 (∀ i = 1, . . . , m)
where φ is the feature induced from a kernel function and cost(xi ) is the
example-dependent cost for misclassifying xi . It is clear that the classification boundary is moved according to the rescaling ratio specified by the
cost terms.
8.3.2 Ensemble Methods for Cost-Sensitive Learning
Many ensemble methods for cost-sensitive learning have been developed. Representative ones mainly include MetaCost and Asymmetric Boosting.
MetaCost [Domingos, 1999]. This method constructs a decision tree ensemble by Bagging to estimate the posterior probability p(y|x). Then, it relabels each training example to the class with the minimum expected risk
according to the moved decision threshold. Finally, the relabeled data are
used to train a learner to minimize the error rate. The MetaCost algorithm is
summarized in Figure 8.2.
Notice that the probability estimates generated by different learning
methods are usually different, and therefore, it might be more reliable to
use the same learner in both steps of MetaCost. However, the probability estimates produced by classifiers are usually poor, since they are by-products
168
Ensemble Methods: Foundations and Algorithms
Input: Training data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Base learning algorithm L;
Cost matrix cost, where costij is the cost of misclassifying
examples of the ith class to the jth class;
Number of subsamples in Bagging Tb ;
Number of examples in each subsample m;
pb is T rue iff L produces class probabilities;
all is T rue iff all subsamples are to be used for each example.
Process:
1. for i = 1, . . . , Tb :
2. Di is a subsample of D with m examples;
3. Mi = L(Di );
4. end
5. for each example x in D:
6. for each class j: 7.
p(j|x) = 1 1 i p(j|x, Mi );
i
8.
where
9.
if pb then
10.
p(j|x, Mi ) is produced by Mi ;
11.
else
1 for the class predicted by Mi for x ,
12.
p(j|x, Mi ) =
0 for all other classes.
13.
end
14.
if all then
15.
i ranges over all Mi ’s;
16.
else
17.
i ranges over all Mi ’s such that x ∈ Di ;
18.
end
19. end
20. Assign x’s class label to be arg mini j p(j|x)costji ;
21.end
22.Build a model M by applying L on data set D with new labels.
Output: M
FIGURE 8.2: The MetaCost algorithm.
of classification, and many classifiers do not provide such estimates. Therefore, Bagging is used in MetaCost, while other ensemble methods can also be
used to obtain the probability estimates.
Asymmetric Boosting [Masnadi-Shirazi and Vasconcelos, 2007]. This
method directly modifies the AdaBoost algorithm such that the costsensitive solution is consistent with Bayes optimal risk. Recall the exponential loss function (2.1), the solution (2.4) and its property (2.5), Asymmetric
Advanced Topics
169
Input: Training data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Base learning algorithm L;
Cost of misclassifying positive/negative examples cost+ /cost− ;
Number of learning trails T ;
Number of iterations in gradient descent Ngd .
Process:
1. I+ = {i|yi = +1} and I− = {i|yi = −1};
2. Initialize weights as wi = 2|I1+ | , ∀i ∈ I+ and wi = 2|I1− | , ∀i ∈ I− ;
3. for t = 1 to T :
4. k = 1;
5. Initialize βk as a random number in [0, 1];
6. Using gradient descent to solve fk from
fk (x) = arg min[(ecost+ βk − e−cost+ βk ) · b + e−cost+ βk T+
f
7.
8.
9.
+(ecost− βk − e−cost− βk ) · d + e−cost− βk T− ]
where
T+ = i∈I+ wt (i), T− = i∈I− wt (i),
b = i∈I+ wt (i)I(yi = fk−1 (xi )),
d = i∈I− wt (i)I(yi = fk−1 (xi ));
for k = 2 to Ngd :
Solve βk from
2 · cost+ · b · cosh(cost+ βk ) + 2 · cost− · d · cosh(cost− βk )
−cost+ T+ e−cost+ βk − cost− T− e−cost− βk = 0;
Using gradient descent to solve fk from
fk (x) = arg min[(ecost+ βk − e−cost+ βk ) · b + e−cost+ βk T+
f
+(ecost− βk − e−cost− βk ) · d + e−cost− βk T− ];
end
Let (ht , αt ) be (fk , βk ) with the smallest loss;
Update weights as wt+1 (i) = wt (i)e−costyi αt yi ht (xi )
end
T
Output: H(x) = sign( t=1 αt ht (x));
10.
11.
12.
13.
FIGURE 8.3: The Asymmetric Boosting algorithm.
Boosting directly minimizes the loss function
cost (h | D) = Ex∼D [e−yh(x)cost(x) ]
(8.13)
m
1 I (yi = +1) e−yi h(xi )cost+ + I(yi = −1)e−yi h(xi )cost− ,
≈
m i=1
where yi ∈ {−1, +1} is the ground-truth label of xi , cost+ (cost− ) denotes
the cost of mistakenly classifying a positive (negative) example to the negative (positive) class, and h is the learner. The optimal solution minimizing
170
Ensemble Methods: Foundations and Algorithms
Table 8.1: Summary of major modifications on AdaBoost made by costsensitive Boosting methods.
AdaCost
Weight Update Rule
α
wt+1 (i) = wt (i)e−αt yi ht (xi )βδi
α=
β+1 = 0.5 − 0.5costi
β−1 = 0.5 + 0.5costi
1+et
ln 1−e
m t
et = i=1 wt (i)yi ht (xi )βδi
1
2
CSB0
wt+1 (i) = cδi (i)wt (i)
c−1 (i) = costi , c+1 (i) = 1
unchanged
CSB1
wt+1 (i) = cδi (i)wt (i)e−yi ht (xi )
unchanged
CSB2
wt+1 (i) =
Cδi (i)wt (i)e−αt yi ht (xi )
unchanged
Asymmetric
AdaBoost
wt+1 (i) =
√
wt (i)e−αt yi ht (xi ) eyi log K
cost+
K = cost
is cost ratio
−
unchanged
AdaC1
wt+1 (i) =
wt (i)e−αt yi ht (xi )costi
α=
AdaC2
wt+1 (i) =
costi wt (i)e−αt yi ht (xi )
α=
AdaC3
wt+1 (i) =
costi wt (i)e−αt yi ht (xi )costi
α = m
2
1
i=1 wt (i)(costi +costi δi )
m
2 ln
wt (i)(costi −cost2 δi )
1+
m
cost w (i)δ
1
2
i t
i
ln 1−i=1
m
costi wt (i)δi
1
2
ln yi =ht (xi ) costi wt (i)
i=1
costi wt (i)
yi =ht (xi )
i=1
i
‡ In the table, δi = +1 if ht (xi ) = yi and −1 otherwise; costi is the misclassification
cost of xi ; cost+ (cost− ) denotes the cost of mistakenly classifying a positive (negative) example to the negative (positive) class. For clarity, in weight update rules, we
omit the normalization factor Zt , which is used to make wt+1 a distribution.
the exponential loss cost is
h∗ =
1
p(y = +1|x)cost+
ln
,
cost+ + cost− p(y = −1|x)cost−
(8.14)
which is consistent with the Bayes optimal solution because
sign(h∗ (x)) = arg max p(y|x)cost(y) .
(8.15)
y∈{+1,−1}
Notice that it is difficult to minimize cost directly by fitting an additive
model, and therefore, as the general principle for minimizing convex loss
with AdaBoost, Asymmetric Boosting uses gradient descent optimization instead. Figure 8.3 shows the the Asymmetric Boosting algorithm.
Advanced Topics
171
There are a number of other cost-sensitive Boosting methods trying to
minimize the expected cost. Different from Asymmetric Boosting which is
derived directly from the Bayes risk theory, most of those cost-sensitive
Boosting methods use heuristics to achieve cost sensitivity, and therefore,
their optimal solutions cannot guarantee to be consistent with the Bayes
optimal solution. Some of them change the weight update rule of Adaboost
by increasing the weights of high-cost examples, such as CSB0, CSB1, CSB2
[Ting, 2000] and Asymmetric AdaBoost [Viola and Jones, 2002]. Some of them
change the weight update rule as well as α, the weight of base learners, by
associating a cost with the weighted error rate of each class, such as AdaC1,
AdaC2, AdaC3 [Sun et al., 2005] and AdaCost [Fan et al., 1999]. Table 8.1
summarizes the major modifications made by these methods on AdaBoost.
A thorough comparison of those methods is an important issue to be explored.
8.4 Class-Imbalance Learning
8.4.1 Learning with Class Imbalance
In many real-world tasks, e.g., fraud or failure detection, the data are usually imbalanced; that is, some classes have far more examples than other
classes. Consider binary classification for simplicity. The class with more
data is called the majority class and the other class is called the minority
class. The level of imbalance, i.e., the number of majority class examples divided by that of minority class examples, can be as large as 106 [Wu et al.,
2008].
It is often meaningless to achieve high accuracy when there is class imbalance, because the minority class would be dominated by the majority
class. For example, even when the level of imbalance is just 1,000, which is
very common in fraud detection tasks, a trivial solution which simply predicts all unseen instances to belong to the majority class will achieve an
accuracy of 99.9%; though the accuracy seems high, the solution is useless
since no fraud will be detected.
Notice that an imbalanced data set does not necessarily mean that the
learning task must suffer from class-imbalance. If the majority class is more
important than the minority class, it is not a problem for the majority class
to dominate the learning process. Only when the minority class is more important, or it cannot be sacrificed, the dominance of the majority class is a
disaster and class-imbalance learning is needed. In other words, there is
always an implicit assumption in class-imbalance learning that the minority class has higher cost than the majority class.
Cost-sensitive learning methods are often used in class-imbalance learn-
172
Ensemble Methods: Foundations and Algorithms
ing. In particular, the Rescaling approach can be adapted to class-imbalance
learning by replacing the right-hand side of (8.11) by the ratio of the size of
the jth class against that of the ith class; that is, to rebalance the classes
according to the level of imbalance, such that the influences of the minority class and the majority class become equal. Notice that, however, the
ground-truth level of imbalance is usually unknown, and rescaling according to the level of imbalance in training data does not always work well.
Notice that re-sampling strategies can be further categorized into undersampling which decreases the majority class examples, and over-sampling
which increases the minority class examples. Either method can be implemented by random sampling with or without replacement. However, randomly duplicating the minority class examples may increase the risk of
overfitting, while randomly removing the majority class examples may lose
useful information. To relax those problems, many advanced re-sampling
methods have been developed.
To improve under-sampling, some methods selectively remove the majority class examples such that more informative examples are kept. For
example, the one-sided sampling method [Kubat and Matwin, 1997] tries to
find a consistent subset D of the original data D in the sense that the 1-NN
rule learned from D can correctly classify all examples in D. Initially, D
contains all the minority class examples and one randomly selected majority class example. Then, an 1-NN classifier is constructed on D to classify
the examples in D. The misclassified majority examples are added into D .
After that, the Tomek Link [Tomek, 1976] is employed to remove borderline or noisy examples in the majority class in D . Let d(xi , xj ) denote the
distance between xi and xj . A pair of examples (xi , xj ) is called a Tomek
link if their class labels are different, and no example xk exists such that
d(xi , xk ) < d(xi , xj ) or d(xj , xk ) < d(xj , xi ). Examples participating in
Tomek links are usually either borderline or noisy.
To improve over-sampling, some methods use synthetic examples instead of exact copies to reduce the risk of overfitting. For example, SMOTE
[Chawla et al., 2002] generates synthetic examples by randomly interpolating between a minority class example and one of its neighbors from the
same class. Data cleaning techniques such as the Tomek link can be applied further to remove the possible noise introduced in the interpolation
process.
8.4.2 Performance Evaluation with Class Imbalance
Given data set D = {(x1 , y1 ), . . . , (xm , ym )}, for simplicity, consider binary classification where y ∈ {−1, +1}, and suppose the positive class has
m+ examples and negative class has m− examples, m+ + m− = m. Assume
that the positive class is the minority class, that is, m+ < m− . The confusion
matrix of a classifier h is in the form of:
Advanced Topics
Ground-truth “+”
Predicted as “+” TP (true positive)
Predicted as “−” FN (false negative)
173
Ground-truth “−”
FP (false positive)
TN (true negative)
⎧
m
⎪
T
P
=
⎪
I(yi = +1)I(h(xi ) = +1)
⎪
⎪
i=1
⎪
⎪
m
⎪
⎪
⎪
I(yi = −1)I(h(xi ) = +1)
⎨FP =
i=1
m
⎪
⎪
I(yi = −1)I(h(xi ) = −1)
TN =
⎪
⎪
⎪
i=1
⎪
⎪
m
⎪
⎪
⎩FN =
I(yi = +1)I(h(xi ) = −1)
(8.16)
i=1
where
T P + F N = m+ ,
(8.17)
T N + F P = m− ,
TP + FN + TN + FP = m .
(8.18)
(8.19)
With these variables, the accuracy and error rate can be written as
)
* TP + TN
,
acc = P h(x) = y =
m
(8.20)
)
* FP + FN
,
err = P h(x) = y =
m
(8.21)
acc + err = 1.
(8.22)
respectively, and
It is evident that accuracy and error rate are not adequate for evaluating class-imbalance learning performance, since more attention should be
paid to the minority class.
8.4.2.1
ROC Curve and AUC
ROC curve [Green and Swets, 1966, Spackman, 1989] can be used to evaluate learning performance under unknown class distributions or misclassification costs. “ROC” is the abbreviation for Receiver Operating Characteristic, which was originally used in radar signal detection in World War II. As
illustrated in Figure 8.4, the ROC curve plots how the true positive rate tpr
on the y-axis changes with the false positive rate f pr on the x-axis, where
tpr =
TP
TP
=
,
TP + FN
m+
(8.23)
f pr =
FP
FP
=
.
FP + TN
m−
(8.24)
174
Ensemble Methods: Foundations and Algorithms
True Positive Rate
1
0.8
0.6
0.4
0.2
0
0
0.1
0.4
0.7
1
False Positive Rate
FIGURE 8.4: Illustration of ROC curve. AUC is the area of the dark region.
A classifier corresponds to a point in the ROC space. If it classifies all examples as positive, tpr = 1 and f pr = 1; if it classifies all examples as negative, tpr = 0 and f pr = 0; if it classifies all examples correctly, tpr = 1 and
f pr = 0. When the tpr increases, the f pr will be unchanged or increase. If
two classifiers are compared, the one located to the upper left is better. A
functional hypothesis h : X × Y → R corresponds to a curve with (0, 0)
being the start point and (1, 1) being the end point, on which a series of
(f pr, tpr) points can be generated by applying different thresholds on the
outputs of h to separate different classes.
The AUC (Area Under ROC Curve) [Metz, 1978, Hanley and McNeil, 1983]
is defined as the area under the ROC curve, as shown in Figure 8.4. This
criterion integrates the performances of a classifier over all possible values
of f pr to represent the overall performance. The statistical interpretation of
AUC is the probability of the functional hypothesis h : X × Y → R assigning
a higher score to a positive example than to a negative example, i.e.,
)
*
AU C(h) = P h(x+ ) > h(x− ) .
(8.25)
The normalized Wilcoxon-Mann-Whitney statistic gives the maximum likelihood estimate of the true AUC as [Yan et al., 2003]
)
*
x+
x− I h(x+ ) > h(x− )
W =
.
(8.26)
m+ m−
Therefore, the AUC measures the ranking quality of h. Maximizing the AUC
is equivalent to maximizing the number of the pairs satisfying h(x+ ) >
h(x− ).
Advanced Topics
8.4.2.2
175
G-mean, F-measure and Precision-Recall Curve
G-mean, or Geometric mean, is the geometric mean of the accuracy of
each class, i.e.,
<
TP
TN
G-mean =
×
,
(8.27)
m+
m−
where the sizes of different classes have already been considered, and
therefore, it is a good candidate for evaluating class-imbalance learning
performance.
Precision measures how many examples classified as positive are really
positive, while recall measures how many positive examples are correctly
classified as positive. That is,
TP
,
(8.28)
TP + FP
TP
TP
Recall =
=
.
(8.29)
TP + FN
m+
By these definitions, the precision does not contain any information about
F N , and the recall does not contain any information about F P . Therefore,
neither provides a complete evaluation of learning performance, while they
are complementary to each other.
Though a high precision and a high recall are both desired, there are often conflicts to achieve the two goals together since F P usually becomes
larger when T P increases. Being a tradeoff, the F-measure is defined as the
harmonic mean of precision and recall as [van Rijsbergen, 1979]
−1
1
1
Fα = α
+ (1 − α)
,
(8.30)
Recall
P recision
P recision =
where α is a parameter to weight the relative importance of precision and
recall. By default, α is set to 0.5 to regard the precision and recall as equally
important.
To evaluate a learning method in various situations, e.g., with different
class distributions, a single pair of (Precision, Recall) or a single choice of
α for the F-measure is not enough. For this purpose, the Precision-Recall
(PR) curve can be used. It plots recall on the x-axis and precision on the
y-axis, as illustrated in Figure 8.5.
A classifier corresponds to a point in the PR space. If it classifies all examples correctly, P recision = 1 and Recall = 1; if it classifies all examples
as positive, Recall = 1 and P recision = m+ /m; if it classifies all examples as negative, Recall = 0 and P recision = 1. If two classifiers are compared, the one located on the upper right is better. A functional hypothesis
h : X ×Y → R corresponds to a curve on which a series of (Precision, Recall)
points can be generated by applying different thresholds on the outputs of
h to separate different classes. Discussion on the relationship between the
PR and ROC curves can be found in [Davis and Goadrich, 2006].
Ensemble Methods: Foundations and Algorithms
Precision
176
Recall
FIGURE 8.5: Illustration of PR curve.
8.4.3 Ensemble Methods for Class-Imbalance Learning
In class-imbalance learning, the ground-truth level of imbalance is usually unknown, and the ground-truth relative importance of the minority
class against the majority class is usually unknown also. There are many
potential variations, and therefore, it is not strange that ensemble methods
have been applied to obtain a more effective and robust performance. This
section mainly introduces SMOTEBoost, EasyEnsemble and BalanceCascade.
SMOTEBoost [Chawla et al., 2003]. This method improves the oversampling method SMOTE [Chawla et al., 2002] by combining it with AdaBoost.M2. The basic idea is to let the base learners focus more and more
on difficult yet rare class examples. In each round, the weights for minority class examples are increased. The SMOTEBoost algorithm is shown in
Figure 8.6.
EasyEnsemble [Liu et al., 2009]. The motivation of this method is to keep
the high efficiency of under-sampling but reduce the possibility of ignoring potentially useful information contained in the majority class examples.
EasyEnsemble adopts a very simple strategy. It randomly generates multiple
subsamples {N1 , N2 , . . . , NT } from the majority class N . The size of each
sample is the same as that of the minority class P , i.e., |Ni | = |P |. Then,
the union of each pair of Ni and P is used to train an AdaBoost ensemble.
The final ensemble is formed by combining all the base learners in all the
AdaBoost ensembles. The EasyEnsemble algorithm is shown in Figure 8.7.
EasyEnsemble actually generates a Bagged ensemble whose base learners
are Boosted ensembles. Such a strategy of combining AdaBoost with Bagging has been adopted in MultiBoosting [Webb and Zheng, 2004], which effectively leverages the power of AdaBoost in reducing bias and the power of
Bagging in reducing variance.
BalanceCascade [Liu et al., 2009]. This method tries to use guided deletion rather than random deletion of majority class examples. In contrast to
Advanced Topics
177
Input: Training data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Minority class examples P ⊆ D;
Base learning algorithm L;
Number of synthetic examples to be generated S;
Number of iterations T .
Process:
1. B = {(i, y) : i = 1, ..., m, y = yi };
2. Initialize distribution as w1 (i, y) = 1/|B| for (i, y) ∈ B;
3. for i = 1 to T :
4. Modify the distribution wt by creating S synthetic examples
from P using SMOTE algorithm;
5. Train a weak learner by using L and wt ;
6. Compute weak hypothesis ht : X × Y → [0, 1];
7. Compute
the pseudo-loss of hypothesis ht :
et = (i,y)∈B wt (i, y) (1 − ht (xi , yi ) + ht (xi , y));
t
8. αt = ln 1−e
et ;
1
9. dt = 2 (1 − ht (xi , y) + ht (xi , yi ));
10. Update the distribution wt+1 (i, y) = Z1t wt (i, y)e−αt dt ;
11.end
Output: H(x) = arg maxy∈Y Tt=1 αt ht (x, y).
FIGURE 8.6: The SMOTEBoost algorithm.
EasyEnsemble which generates subsamples of the majority class in an unsupervised parallel manner, BalanceCascade works in a supervised sequential
manner. In the ith round, a subsample Ni is generated from the current majority class data set N , with sample size |Ni | = |P |. Then, an ensemble Hi
is trained from the union of Ni and P by AdaBoost. After that, the majority class examples that are correctly classified by Hi are removed from N .
The final ensemble is formed by combining all the base learners in all the
AdaBoost ensembles. The BalanceCascade algorithm is shown in Figure 8.8.
BalanceCascade actually works in a cascading-style, which has been used
by Viola and Jones [2002] to improve the efficiency of face detection. Notice that both EasyEnsemble and BalanceCascade combine all base learners
instead of combining the outputs of the AdaBoost ensembles directly. This
strategy is adopted for exploiting the detailed information provided by the
base learners. Here, the base learners can actually be viewed as features exposing different aspects of the data.
There are many other ensemble methods for improving over-sampling
and under-sampling. For example, the DataBoost-IM method [Guo and Viktor, 2004] identifies hard examples in each boosting round and creates
synthetic examples according to the level of imbalance of hard examples;
Chan and Stolfo [1998]’s method simply partitions the majority class into
178
Ensemble Methods: Foundations and Algorithms
Input: Training data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Minority class examples P ⊆ D;
Majority class examples N ⊆ D;
Number of subsets T to sample from N ;
Number of iterations si to train an AdaBoost ensemble Hi .
Process:
1. for i = 1 to T :
2. Randomly sample a subset Ni from N with |Ni | = |P |;
3. Use P and Ni to learn an AdaBoost ensemble Hi , which is with si
weak classifiers hi,j and corresponding
weights
αi,j :
si
Hi (x) = sign
j=1 αi,j hi,j (x) .
4. end
T si
α
h
(x)
.
Output: H(x) = sign
i,j
i,j
j=1
i=1
FIGURE 8.7: The EasyEnsemble algorithm.
Input: Training data set D = {(x1 , y1 ), (x2 , y2 ), . . . , (xm , ym )};
Minority class examples P ⊆ D;
Majority class examples N ⊆ D;
Number of subsets T to sample from N ;
Number of iterations si to train an AdaBoost ensemble Hi .
Process: |P |
1. f ⇐ T −1 |N
| , f is the false positive rate (the error rate of misclassifying
a majority class example to the minority class) that Hi should achieve;
2. for i = 1 to T :
3. Randomly sample a subset Ni from N with |Ni | = |P |;
4. Use P and Ni to learn an AdaBoost ensemble Hi , which is with si
weak classifiers hi,j and corresponding weights αi,j , and adjust θi
such that Hi ’s false positive
rate is f :
si
;
α
h
(x)
−
θ
Hi (x) = sign
i,j
i,j
i
j=1
5. Remove from N all examples that are correctly classified by Hi .
6. end
T
T si
Output: H(x) = sign
.
α
h
(x)
−
θ
i,j
i,j
i
j=1
i=1
i=1
FIGURE 8.8: The BalanceCascade algorithm.
non-overlapping subsets with the size of the minority class and then trains
a base learner based on each pair of the subsets and the minority class.
There are also ensemble methods that combine over-sampling with undersampling [Estabrooks et al., 2004] or even combine them with other strate-
Advanced Topics
179
gies such as threshold-moving [Zhou and Liu, 2006]. A thorough comparison of those methods is an important issue to be explored. According to
incomplete comparisons available currently, the EasyEnsemble method is a
good choice in many situations.
8.5 Improving Comprehensibility
In many real-world tasks, in addition to attaining strong generalization
ability, the comprehensibility of the learned model is also important. It
is usually required that the learned model and its predictions are understandable and interpretable. Symbolic rules and decision trees are usually
deemed as comprehensible models. For example, every decision made by a
decision tree can be explained by the tree branches it goes through.
Comprehensibility is an inherent deficiency of ensemble methods. Even
when comprehensible models such as decision trees are used as base learners, the ensemble still lacks comprehensibility, since it aggregates multiple
models. This section introduces some techniques for improving ensemble
comprehensibility.
8.5.1 Reduction of Ensemble to Single Model
Considering that the comprehensibility of an ensemble is lost mainly because it aggregates multiple models, one possible approach to improving
the comprehensibility is to reduce the ensemble to a single model.
CMM [Domingos, 1998]. This method uses the ensemble to label some artificially generated instances, and then applies the base learning algorithm,
which was used to train the base learners for the ensemble, on the artificial
data together with the original training data to generate a single learner.
By adding the artificial data, it is expected that the final single learner
can mimic the behavior of ensemble. Notice that the final single learner
is trained using the same base learning algorithm. Though this avoids the
conflict of different biases of different types of learners, the performance
of the final single learner has high risk of overfitting because peculiarity
patterns in the training data that affect the base learning algorithm can be
strengthened. Also, if the base learners of the ensemble are not comprehensible, CMM will not improve comprehensibility.
Archetype Selection [Ferri et al., 2002]. This method calculates the similarity between each base learner and the ensemble by comparing their predictions on an artificial data set, and then selects the single base learner that is
the most similar to the ensemble. Notice that, in many cases there may not
180
Ensemble Methods: Foundations and Algorithms
exist a base learner that is very close to the ensemble, or the base learners
themselves are not comprehensible. In these cases this method will fail.
NeC4.5 [Zhou and Jiang, 2004]. This method is similar to CMM in using the
ensemble to generate an artificial data set and then using the artificial data
set together with the original training set to generate a single learner. The
major difference is that in NeC4.5, the learning algorithm for the final single
learner is different from the base learning algorithm of the ensemble. This
difference reduces the risk of overfitting, and the final single learner may
even be more accurate than the ensemble itself. However, by using different
types of learners, the different biases need to be treated well. To obtain an
accurate and comprehensible model, NeC4.5 uses a neural network ensemble and a C4.5 decision tree, where the neural networks are targeted to accuracy while the decision tree is used for comprehensibility. It was derived
in [Zhou and Jiang, 2004] that when the original training data set does not
capture the whole distribution or contains noise, and the first-stage learner
(e.g., the ensemble) is more accurate than the second-stage learner (e.g., a
C4.5 decision tree trained from the original training data), the procedure of
NeC4.5 will be beneficial. Later, such procedure of accomplishing two objectives in two stages with different types of learners is called twice learning
[Zhou, 2005].
ISM [Assche and Blockeel, 2007]. Considering that it is difficult to generate
artificial data in some domains such as those involving relational data, this
method tries to learn a single decision tree from a tree ensemble without
generating an artificial data set. The basic idea is to construct a single tree
where each split is decided by considering the utility of this split in similar paths of the trees in the ensemble. Roughly speaking, for each candidate
split of a node, a path of feature tests can be obtained from root to the node.
Then, similar paths in the trees of the ensemble will be identified, and the
utility of the split in each path can be calculated, e.g., according to information gain. The utility values obtained from all the similar paths are aggregated, and finally the candidate split with the largest aggregated utility
is selected for the current node.
8.5.2 Rule Extraction from Ensembles
Improving ensemble comprehensibility by rule extraction was inspired
by studies on rule extraction from neural networks [Andrews et al., 1995,
Tickle et al., 1998, Zhou, 2004], with the goal of using a set of symbolic ifthen rules to represent the ensemble.
REFNE [Zhou et al., 2003]. This method uses the ensemble to generate an
artificial data set. Then, it tries to identify a feature-value pair such as “color
= red” which is able to make a correct prediction on some artificial examples. If there exists such a feature-value pair, a rule with one antecedent
Advanced Topics
181
is generated, e.g., if “color = red” then positive, and the artificial examples
that are classified correctly by the rule are removed. REFNE searches for
other one-antecedent rules on the remaining artificial data, and if there
is no more, it starts to search for two-antecedent rules such as if “color =
blue” and “shape = round” then positive; and so on. Numeric features are
discretized adaptively, and sanity checks based on statistical tests are executed before each rule is accepted. Notice that REFNE generates priority rules, also called decision list, which must be applied in the order that
the earlier generated, the earlier applied. This method suffers from low efficiency and does not work on large-scale data sets.
C4.5 Rule-PANE [Zhou and Jiang, 2003]. This method improves REFNE by
using C4.5 Rule [Quinlan, 1993] to replace the complicated rule generation
procedure in REFNE. Though it was named as C4.5 Rule Preceded by Artificial Neural Ensemble, similar to REFNE, this method can be applied to extract rules from any type of ensembles comprising any types of base learners.
Notice that, though in most cases the ensemble is more accurate than
the extracted rules, there are also cases where the extracted rules are even
more accurate than the ensemble. In such cases, there is a conflict between
attaining a high accuracy and high fidelity. If the goal is to explain the ensemble or mimic behaviors of the ensemble, then the higher accuracy of the
extracted rules has to be sacrificed. This is the fidelity-accuracy dilemma
[Zhou, 2004]. However, if the goal is to achieve an accurate and comprehensible model, then it is not needed to care about whether the behaviors
of the ensemble can be correctly mimicked; this recognition motivated the
twice learning paradigm.
8.5.3 Visualization of Ensembles
Visualization is an important approach to help people understand the
behaviors of learning methods. Obviously, one of the most straightforward
ways is to plot the decision boundary of an ensemble after each learning
round. In such a plot, the x-axis and y-axis correspond to any pair of features, while each point corresponds to an instance. For example, Figures 8.9
and 8.10 provide illustrations of the visualization results of Boosted decision stumps and Bagged decision stumps on the three-Gaussians data set,
respectively.
Notice that visualization with dimensions higher than three is quite difficult. For data with more than three dimensions, dimension reduction may
be needed for visualization. In practice, however, the intrinsic dimension
of the data is often much larger than two or three, hence visualization can
only be performed on some important feature combinations.
182
Ensemble Methods: Foundations and Algorithms
(a) 1st round
(b) 3rd round
(c) 7th round
(d) 11th round
(e) 15th round
(f) 20th round
FIGURE 8.9: Boosted decision stumps on three-Gaussians, where circles/stars denote positive/negative examples, and solid/empty mark correct/incorrect classified examples, respectively.
8.6 Future Directions of Ensembles
There are many interesting future research directions for ensembles.
Here, we highlight two directions, that is, understanding ensembles and ensembles in the internet world.
There are two important aspects of understanding ensembles. The first
aspect is on diversity. It is well accepted that understanding diversity is the
holy grail problem in ensemble research. Though some recent advances
have been attained [Brown, 2009, Zhou and Li, 2010b], we are still a long
way from a complete understanding of diversity. It is not yet known whether
diversity is really a driving force, or actually a trap, since it might be just another appearance of accuracy. Moreover, if it is really helpful, can we do
better through exploiting it explicitly, for example, by using it as a regularizer for optimization? Recently there is some exploration in this direction
[Yu et al., 2011].
The second important aspect is on the loss view of ensemble methods.
From the view of statistical learning, every learning method is optimizing some kind of loss function. The understanding of the learning method
can get help from the understanding of the properties of the loss function.
There are some studies on the loss behind AdaBoost [Demiriz et al., 2002,
Warmuth et al., 2008]. Though there might exist some deviations, they provided some insight about AdaBoost. Can we conjecture what are the loss
Advanced Topics
183
(a) 1st round
(b) 3rd round
(c) 7th round
(d) 11th round
(e) 15th round
(f) 20th round
FIGURE 8.10: Bagged decision stumps on three-Gaussians, where circles/stars denote positive/negative examples, and solid/empty mark correct/incorrect classified examples, respectively.
functions optimized by other ensemble methods such as Bagging, Random
Subspace, etc.? If we know them, can we have practically more effective and
efficient ways, rather than ensembles, to optimize them?
There are also two important aspects of ensembles in the internet world.
The first one is on-site ensemble, which tries to use ensemble methods as
tools for exploiting resources scattered over internet. For example, for most
tasks there are relevant and helpful data existing at different places of the
internet. Merging the data to a single site may suffer from unaffordable
communication costs, not to mention other issues such as data ownership
and privacy which prevent the data from being exposed to other sites. In
such scenarios, each site can maintain its local model. Once the task is being executed, the site can send out a service request, such as the request of
making a prediction on an instance, to other sites. The other sites, if they
accept the request, will send back only the predictions made by their local
models. Finally the initial site can aggregate the remote predictions with
its local prediction to get a better result. Li et al. [2010] reported a work in
this line of research. There are many issues to be studied, for example, how
to make the decision process robust against requests lost on internet, how
to identify trustworthy remote predictions, how to make a better combined
result based on remote predictions with a little additional information such
as simple statistics, how to prevent the initial site from sending so many requests that it can reconstruct the remote models with high confidence and
therefore lead to privacy breaches, etc.
184
Ensemble Methods: Foundations and Algorithms
The second important aspect is on reusable ensembles. This name was
inspired by software reuse. Reusable components are functional components of softwares that can be shared as plug-in packages. Ideally, when
a user wants to develop a new program, there is no need to start from
scratch; instead, the user can search for reusable components and put them
together, and only write those functional components that could not be
found. Similarly, in the internet world, many sites may like to share their
learning models. Thus, when a user wants to construct a learning system,
s/he can search for useful reusable models and put them together. For this
purpose, there are many issues to be studied, for example, how to establish
the specification of the functions and usage of reusable learning models,
how to match them with the user requirement, how to put together different learning models trained from different data, etc. Though at present this
is just an imagination without support of research results, it is well worth
pursuing since it will help learning methods become much easier for common people to use, rather than being like an art that can only be deployed
by researchers.
It is also worth highlighting that the utility of variance reduction of ensemble methods would enable them to be helpful to many modern applications, especially those involving dynamic data and environment. For example, for social network analysis, a more robust or trustworthy result may be
achieved by ensembling the results discovered from different data sources,
or even from multiple perturbations of the same data source.
8.7 Further Readings
Many semi-supervised learning approaches have been developed during the past decade. Roughly speaking, they can be categorized into
four classes, i.e., generative approaches, S3VMs (Semi-Supervised Support
Vector Machines), graph-based approaches and disagreement-based approaches. Generative approaches use a generative model and typically employ EM to model the label estimation and parameter estimation process
[Miller and Uyar, 1997, Nigam et al., 2000]. S3VMs use unlabeled data to adjust the SVM decision boundary learned from labeled examples such that
it goes through the less dense region while keeping the labeled data correctly classified [Joachims, 1999, Chapelle and Zien, 2005]. Graph-based
approaches define a graph on the training data and then enforce the label smoothness over the graph as a regularization term [Zhu et al., 2003,
Zhou et al., 2004]. Disagreement-based approaches generate more than
one learner which collaborate to exploit unlabeled instances, and a large
disagreement between the learners is maintained to enable the learning
Advanced Topics
185
procedure to continue [Blum and Mitchell, 1998, Zhou and Li, 2007]. More
introduction on semi-supervised learning can be found in [Chapelle et al.,
2006, Zhu, 2006, Zhou and Li, 2010a].
Both ensemble and semi-supervised learning try to achieve strong generalization ability, however, they have almost developed separately and only a
few studies have tried to leverage their advantages. This phenomenon was
attributed by Zhou [2009] to the fact that the two communities have different philosophies. Zhou [2011] discussed the reasons for combining both.
In addition to query-by-committee, uncertainty sampling is another active learning paradigm which tries to query the most informative unlabeled
instance. In uncertainty sampling, a single learner is trained and then, the
unlabeled instance on which the learner is the least confident is selected
to query [Lewis and Gale, 1994, Tong and Koller, 2000]. There is another
school of active learning that tries to query the most representative unlabeled instance, usually by exploiting the cluster structure of data [Nguyen
and Smeulders, 2004, Dasgupta and Hsu, 2008]. Recently there are some
studies on querying informative and representative unlabeled instances
[Huang et al., 2010]. Most active learning algorithms focus on querying one
instance in each round, while batch mode active learning extends the classical setup by selecting multiple unlabeled instances to query in each trial
[Guo and Schuurmans, 2008, Hoi et al., 2009]. More introduction on active
learning can be found in [Settles, 2009].
Though Rescaling is effective in two-class cost-sensitive learning, its direct extension to multi-class tasks does not work well [Ting, 2002]. Zhou and
Liu [2010] disclosed that the Rescaling approach can be applied directly to
multi-class tasks only when the cost-coefficients are consistent, and otherwise the problem should be decomposed to a series of two-class problems
for applying Rescaling directly. In addition to class-dependent cost, there
are many studies on example-dependent cost [Zadrozny and Elkan, 2001a,
Zadrozny et al., 2003, Brefeld et al., 2003], where some representative methods are also ensemble methods, e.g., the costing method [Zadrozny et al.,
2003]. It is worth noting that traditional studies generally assumed that precise cost values are given in advance, while there is a recent work which
tried to handle imprecise cost information appearing as cost intervals [Liu
and Zhou, 2010].
ROC curve and AUC can be used to study class-imbalance learning as
well as cost-sensitive learning, and can be extended to multi-class cases
[Hand and Till, 2001, Fawcett, 2006]. Cost curve [Drummond and Holte,
2006] is equivalent to ROC curve but makes it easier to visualize costsensitive learning performance. It is noteworthy that there are some recent
studies that disclose that AUC has significant problems for model selection
[Lobo et al., 2008, Hand, 2009].
Some learning methods were developed by directly minimizing a criterion that considers the unequal sizes of different classes. For example, Brefeld and Scheffer [2005] proposed an SVM method to minimize
186
Ensemble Methods: Foundations and Algorithms
AUC, while Joachims [2005] proposed an SVM method to minimize Fmeasure. Those methods can also be used for class-imbalance learning.
There are also some class-imbalance learning methods designed based
on one-class learning or anomaly detection. More introduction on classimbalance learning can be found in [Chawla, 2006, He and Garcia, 2009].
Notice that, though class imbalance generally occurs simultaneously with
unequal costs, most studies do not consider them together, and even for
the well-studied Rescaling approach it is not yet known how to do the best
in such scenarios [Liu and Zhou, 2006].
Frank and Hall [2003] presented a method to provide a two-dimensional
visualization of class probability estimates. Though this method was not
specially designed for ensemble methods, there is no doubt that it can be
applied to ensembles.
This book does not plan to cover all topics relevant to ensemble methods. For example, stochastic discrimination [Kleinberg, 2000] which works
by sampling from the space of all subsets of the underlying feature space,
and multiple kernel learning [Bach et al., 2004] which can be viewed as ensembles of kernels, have not been included in this version. This also applies
to stability selection [Meinshausen and Bühlmann, 2010], a recent advance
in model selection for LASSO [Tibshirani, 1996a], which can be viewed as a
Bagging-style ensemble-based feature ranking.
MCS’2010, the 10th MCS Workshop, held a panel1 on reviewing the past
and foreseeing the future of ensemble research. The content of Section 8.6
was presented at that panel by the author of this book.
1 http://www.diee.unica.it/mcs/mcs2010/panel%20discussion.html
References
N. Abe and H. Mamitsuka. Query learning strategies using Boosting and
Bagging. In Proceedings of the 15th International Conference on Machine
Learning, pages 1–9, Madison, WI, 1998.
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD International Conference on Management
of Data, pages 94–105, Seattle, WA, 1998.
M. R. Ahmadzadeh and M. Petrou. Use of Dempster-Shafer theory to combine classifiers which use different class boundaries. Pattern Analysis and
Application, 6(1):41–46, 2003.
A. Al-Ani and M. Deriche. A new technique for combining multiple classifiers using the Dempster-Shafer theory of evidence. Journal of Artificial
Intelligence Research, 17(1):333–361, 2002.
K. M. Ali and M. J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173–202, 1996.
E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary:
A unifying approach for margin classifiers. Journal of Machine Learning
Research, 1:113–141, 2000.
E. Alpaydin. Introduction to Machine Learning. MIT Press, Cambridge, MA,
2nd edition, 2010.
M. R. Anderberg. Cluster Analysis for Applications. Academic, New York,
NY, 1973.
R. Andrews, J. Diederich, and A. B. Tickle. Survey and critique of techniques
for extracting rules from trained artificial neural networks. KnowledgeBased Systems, 8(6):373–389, 1995.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS: Ordering
points to identify the clustering structure. In Proceedings of the ACM
SIGMOD International Conference on Management of Data, pages 49–60,
Philadelphia, PA, 1999.
M. Anthony and N. Biggs. Computational Learning Theory. Cambridge
University Press, Cambridge, UK, 1992.
187
188
References
M. B. Araújo and M. New. Ensemble forecasting of species distributions.
Trends in Ecology & Evolution, 22(1):42–47, 2007.
J. A. Aslam and S. E. Decatur. General bounds on statistical query learning
and PAC learning with noise via hypothesis boosting. In Proceedings of
the 35th IEEE Annual Symposium on Foundations of Computer Science,
pages 282–291, Palo Alto, CA, 1993.
E. Asmis. Epicurus’ Scientific Method. Cornell University Press, Ithaca, NY,
1984.
A. V. Assche and H. Blockeel. Seeing the forest through the trees: Learning a
comprehensible model from an ensemble. In Proceedings of the 18th European Conference on Machine Learning, pages 418–429, Warsaw, Poland,
2007.
S. Avidan. Ensemble tracking. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 29(2):261–271, 2007.
R. Avogadri and G. Valentini. Fuzzy ensemble clustering based on random
projections for DNA microarray data analysis. Artificial Intelligence in
Medicine, 45(2-3):173–183, 2009.
H. Ayad and M. Kamel. Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In Proceedings of the 4th International Workshop on Multiple Classifier Systems, pages 166–175, Surrey,
UK, 2003.
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning,
conic duality, and the SMO algorithm. In Proceedings of the 21st International Conference on Machine Learning, Banff, Canada, 2004.
B. Bakker and T. Heskes. Clustering ensembles of neural network models.
Neural Networks, 16(2):261–269, 2003.
M.-F. Balcan, A. Z. Broder, and T. Zhang. Margin based active learning. In
Proceedings of the 20th Annual Conference on Learning Theory, pages 35–
50, San Diego, CA, 2007.
R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. Ensemble
diversity measures and their application to thinning. Information Fusion,
6(1):49–62, 2005.
E. Bauer and R. Kohavi. An empirical comparison of voting classification
algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):
105–139, 1999.
K. Bennett, A. Demiriz, and R. Maclin. Exploiting unlabeled data in ensemble methods. In Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 289–296, Edmonton, Canada, 2002.
References
189
J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl. Aggregate features
and AdaBoost for music classification. Machine Learning, 65(2-3):473–
484, 2006.
Y. Bi, J. Guan, and D. Bell. The combination of multiple classifiers using an
evidential reasoning approach. Artificial Intelligence, 172(15):1731–1751,
2008.
P. J. Bickel, Y. Ritov, and A. Zakai. Some theory for generalized boosting
algorithms. Journal of Machine Learning Research, 7:705–732, 2006.
J. A. Bilmes. A gentle tutorial of the EM algorithm and its applications to
parameter estimation for Gaussian mixture and hidden Markov models.
Technical Report TR-97-021, Department of Electrical Engineering and
Computer Science, University of California, Berkeley, CA, 1998.
C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, New York, NY, 1995.
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New
York, NY, 2006.
C. M. Bishop and M. Svensén. Bayesian hierarchical mixtures of experts. In
Proceedings of the 19th Conference in Uncertainty in Artificial Intelligence,
pages 57–64, Acapulco, Mexico, 2003.
A. Blum and T. Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the 11th Annual Conference on Computational
Learning Theory, pages 92–100, Madison, WI, 1998.
J. K. Bradley and R. E. Schapire. FilterBoost: Regression and classification
on large datasets. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 185–
192. MIT Press, Cambridge, MA, 2008.
U. Brefeld and T. Scheffer. AUC maximizing support vector learning. In Proceedings of the ICML 2005 Workshop on ROC Analysis in Machine Learning, Bonn, Germany, 2005.
U. Brefeld, P. Geibel, and F. Wysotzki. Support vector machines with example dependent costs. In Proceedings of the 14th European Conference on
Machine Learning, pages 23–34, Cavtat-Dubrovnik, Croatia, 2003.
L. Breiman. Bias, variance, and arcing classifiers. Technical Report 460,
Statistics Department, University of California, Berkeley, CA, 1996a.
L. Breiman. Stacked regressions. Machine Learning, 24(1):49–64, 1996b.
L. Breiman. Out-of-bag estimation. Technical report, Department of Statistics, University of California, 1996c.
190
References
L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996d.
L. Breiman. Prediction games and arcing algorithms. Neural Computation,
11(7):1493–1517, 1999.
L. Breiman. Randomizing outputs to increase prediction accuracy. Machine
Learning, 40(3):113–120, 2000.
L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
L. Breiman. Population theory for boosting ensembles. Annals of Statistics,
32(1):1–11, 2004.
L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and
Regression Trees. Chapman and Hall/CRC, Boca Raton, FL, 1984.
G. Brown. An information theoretic perspective on multiple classifier systems. In Proceedings of the 8th International Workshop on Multiple Classifier Systems, pages 344–353, Reykjavik, Iceland, 2009.
G. Brown. Some thoughts at the interface of ensemble methods and feature selection. Keynote at the 9th International Workshop on Multiple
Classifier Systems, Cairo, Egypt, 2010.
G. Brown, J. L. Wyatt, R. Harris, and X. Yao. Diversity creation methods: A
survey and categorisation. Information Fusion, 6(1):5–20, 2005a.
G. Brown, J. L. Wyatt, and P. Tino. Managing diversity in regression ensembles. Journal of Machine Learning Research, 6:1621–1650, 2005b.
P. Bühlmann and B. Yu. Analyzing bagging. Annals of Statistics, 30(4):927–
961, 2002.
P. Bühlmann and B. Yu. Boosting with the l2 loss: Regression and classification. Journal of the American Statistical Association, 98(462):324–339,
2003.
A. Buja and W. Stuetzle. The effect of bagging on variance, bias, and mean
squared error. Technical report, AT&T Labs-Research, 2000a.
A. Buja and W. Stuetzle. Smoothing effects of bagging. Technical report,
AT&T Labs-Research, 2000b.
A. Buja and W. Stuetzle. Observations on bagging. Statistica Sinica, 16(2):
323–351, 2006.
R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection
from libraries of models. In Proceedings of the 21st International Conference on Machine Learning, pages 18–23, Banff, Canada, 2004.
P. D. Castro, G. P. Coelho, M. F. Caetano, and F. J. V. Zuben. Designing ensembles of fuzzy classification systems: An immune-inspired approach.
References
191
In Proceedings of the 4th International Conference on Artificial Immune
Systems, pages 469–482, Banff, Canada, 2005.
P. Chan and S. Stolfo. Toward scalable learning with non-uniform class and
cost distributions: A case study in credit card fraud detection. In Proceeding of the 4th International Conference on Knowledge Discovery and Data
Mining, pages 164–168, New York, NY, 1998.
P. K. Chan, W. Fan, A. L. Prodromidis, and S. J. Stolfo. Distributed data mining in credit card fraud detection. IEEE Intelligent Systems, 14(6):67–74,
1999.
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM
Computing Surveys, 41(3):1–58, 2009.
O. Chapelle and A. Zien. Semi-supervised learning by low density separation. In Proceedings of the 10th International Workshop on Artificial
Intelligence and Statistics, pages 57–64. Barbados, 2005.
O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning.
MIT Press, Cambridge, MA, 2006.
N. V. Chawla. Data mining for imbalanced datasets: An overview. In O. Maimon and L. Rokach, editors, The Data Mining and Knowledge Discovery
Handbook, pages 853–867. Springer, New York, NY, 2006.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. SMOTE: Synthetic minority oversampling technique. Journal of Artificial Intelligence
Research, 16:321–357, 2002.
N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the
7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 107–119, Cavtat-Dubrovnik, Croatia, 2003.
H. Chen, P. Tiño, and X. Yao. A probabilistic ensemble pruning algorithm.
In Working Notes of ICDM’06 Workshop on Optimization-Based Data
Mining Techniques with Applications, pages 878–882, Hong Kong, China,
2006.
H. Chen, P. Tiňo, and X. Yao. Predictive ensemble pruning by expectation
propagation. IEEE Transactions on Knowledge and Data Engineering, 21
(7):999–1013, 2009.
B. Clarke. Comparing Bayes model averaging and stacking when model
approximation error cannot be ignored. Journal of Machine Learning Research, 4:683–712, 2003.
A. L. V. Coelho, C. A. M. Lima, and F. J. V. Zuben. GA-based selection of
components for heterogeneous ensembles of support vector machines.
192
References
In Proceedings of the Congress on Evolutionary Computation, pages 2238–
2244, Canberra, Australia, 2003.
J. Cohen. A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20(1):37–46, 1960.
I. Corona, G. Giacinto, C. Mazzariello, F. Roli, and C. Sansone. Information
fusion for computer security: State of the art and open issues. Information Fusion, 10(4):274–284, 2009.
T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE
Transactions on Information Theory, 13(1):21–27, 1967.
T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New
York, NY, 1991.
M. Coyle and B. Smyth. On the use of selective ensembles for relevance classification in case-based web search. In Proceedings of the 8th European
Conference on Case-Based Reasoning, pages 370–384, Fethiye, Turkey,
2006.
K. Crammer and Y. Singer. On the learnability and design of output codes
for multiclass problems. Machine Learning, 47(2-3):201–233, 2002.
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University
Press, Cambridge, UK, 2000.
P. Cunningham and J. Carney. Diversity versus quality in classification ensembles based on feature selection. Technical Report TCD-CS-2000-02,
Department of Computer Science, Trinity College Dublin, 2000.
A. Cutler and G. Zhao. PERT - perfect random tree ensembles. In Proceedings of the 33rd Symposium on the Interface of Computing Science and
Statistics, pages 490–497, Costa Mesa, CA, 2001.
I. Dagan and S. P. Engelson. Committee-based sampling for training probabilistic classifiers. In Proceedings of the 12th International Conference on
Machine Learning, pages 150–157, San Francisco, CA, 1995.
F. d’Alché-Buc, Y. Grandvalet, and C. Ambroise. Semi-supervised marginboost. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances
in Neural Information Processing Systems 14, pages 553–560. MIT Press,
Cambridge, MA, 2002.
B. V. Dasarathy, editor. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA, 1991.
S. Dasgupta. Analysis of a greedy active learning strategy. In L. Saul, Y. Weiss,
and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 337–344. MIT Press, Cambridge, MA, 2005.
References
193
S. Dasgupta. Coarse sample complexity bounds for active learning. In
Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 235–242. MIT Press, Cambridge, MA,
2006.
S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the 25th International Conference on Machine Learning, pages
208–215, Helsinki, Finland, 2008.
S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based
active learning. In Proceedings of the 18th Annual Conference on Learning
Theory, pages 249–263, Bertinoro, Italy, 2005.
J. Davis and M. Goadrich. The relationship between precision-recall and
ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, pages 233–240, Pittsburgh, PA, 2006.
W. H. E. Day and H. Edelsbrunner. Efficient algorithms for agglomerative
hierarchical clustering methods. Journal of Classification, 1:7–24, 1984.
N. C. de Concorcet. Essai sur l’Application de l’Analyze à la Probabilité
des Décisions Rendues à la Pluralité des Voix. Imprimérie Royale, Paris,
France, 1785.
A. Demiriz, K. P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation. Machine Learning, 46(1-3):225–254, 2002.
A. P. Dempster. Upper and lower probabilities induced by a multivalued
mapping. Annals of Mathematical Statistics, 38(2):325–339, 1967.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical
Soceity, Series B, 39(1):1–38, 1977.
J. Demšar. Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research, 7:1–30, 2006.
L. Didaci, G. Giacinto, F. Roli, and G. L. Marcialis. A study on the performances of dynamic classifier selection based on local accuracy estimation. Pattern Recognition, 38(11):2188–2191, 2005.
T. G. Dietterich. Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Computation, 10(7):1895–
1923, 1998.
T. G. Dietterich. Ensemble methods in machine learning. In Proceedings of
the 1st International Workshop on Multiple Classifier Systems, pages 1–15,
Sardinia, Italy, 2000a.
194
References
T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 2000b.
T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via
error-correcting output codes. Journal of Artificial Intelligence Research,
2:263–286, 1995.
T. G. Dietterich, G. Hao, and A. Ashenfelter. Gradient tree boosting for training conditional random fields. Journal of Machine Learning Research, 9:
2113–2139, 2008.
C. Domingo and O. Watanabe. Madaboost: A modification of AdaBoost. In
Proceedings of the 13th Annual Conference on Computational Learning
Theory, pages 180–189, Palo Alto, CA, 2000.
P. Domingos. Knowledge discovery via multiple models. Intelligent Data
Analysis, 2(1-4):187–202, 1998.
P. Domingos. MetaCost: A general method for making classifiers costsensitive. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, San Diego,
CA, 1999.
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103–137, 1997.
C. Drummond and R. C. Holte. Exploiting the cost of (in)sensitivity of decision tree splitting criteria. In Proceedings of the 17th International Conference on Machine Learning, pages 239–246, San Francisco, CA, 2000.
C. Drummond and R. C. Holte. Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1):95–130, 2006.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, New
York, NY, 2nd edition, 2000.
B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman &
Hall, New York, NY, 1993.
C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the
17th International Joint Conference on Artificial Intelligence, pages 973–
978, Seattle, WA, 2001.
S. Escalera, O. Pujol, and P. Radeva. Boosted landmarks of contextual descriptors and Forest-ECOC: A novel framework to detect and classify objects in clutter scenes. Pattern Recognition Letters, 28(13):1759–1768,
2007.
S. Escalera, O. Pujol, and P. Radeva. Error-correcting ouput codes library.
Journal of Machine Learning Research, 11:661–664, 2010a.
References
195
S. Escalera, O. Pujol, and P. Radeva. On the decoding process in ternary
error-correcting output codes. IEEE Transaction on Pattern Analysis and
Machine Intelligence, 32(1):120–134, 2010b.
A. Estabrooks, T. Jo, and N. Japkowicz. A multiple resampling method for
learning from imbalanced data sets. Computational Intelligence, 20(1):
18–36, 2004.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm
for discovering clusters in large spatial databases. In Proceedings of the
2nd International Conference on Knowledge Discovery and Data Mining,
pages 226–231, Portland, OR, 1996.
V. Estivill-Castro. Why so many clustering algorithms - A position paper.
SIGKDD Explorations, 4(1):65–75, 2002.
W. Fan. On the optimality of probability estimation by random decision
trees. In Proceedings of the 19th National Conference on Artificial Intelligence, pages 336–341, San Jose, CA, 2004.
W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. AdaCost: Misclassification costsensitive boosting. In Proceedings of the 16th International Conference on
Machine Learning, pages 97–105, Bled, Slovenia, 1999.
W. Fan, F. Chu, H. Wang, and P. S. Yu. Pruning and dynamic scheduling of
cost-sensitive ensembles. In Proceedings of the 18th National Conference
on Artificial Intelligence, pages 146–151, Edmonton, Canada, 2002.
W. Fan, H. Wang, P. S. Yu, and S. Ma. Is random model better? On its accuracy and efficiency. In Proceedings of the 3rd IEEE International Conferenceon Data Mining, pages 51–58, Melbourne, FL, 2003.
R. Fano. Transmission of Information: Statistical Theory of Communications. MIT Press, Cambridge, MA, 1961.
T. Fawcett. ROC graphs with instance varying costs. Pattern Recognition
Letters, 27(8):882–891, 2006.
X. Z. Fern and C. E. Brodley. Random projection for high dimensional data
clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning, pages 186–193, Washington,
DC, 2003.
X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite
graph partitioning. In Proceedings of the 21st International Conference on
Machine Learning, Banff, Canada, 2004.
X. Z. Fern and W. Lin. Cluster ensemble selection. In Proceedings of the 8th
SIAM International Conference on Data Mining, pages 787–797, Atlanta,
GA, 2008.
196
References
C. Ferri, J. Hernández-Orallo, and M. J. Ramı́rez-Quintana. From ensemble
methods to comprehensible models. In Proceedings of the 5th International Conference on Discovery Science, pages 165–177, Lübeck, Germany,
2002.
D. Fisher. Improving inference through conceptual clustering. In Proceedings of the 6th National Conference on Artificial Intelligence, pages 461–
465, Seattle, WA, 1987.
J. L. Fleiss. Statistical Methods for Rates and Proportions. John Wiley & Sons,
New York, NY, 2nd edition, 1981.
E. Frank and M. Hall. Visualizing class probability estimators. In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 168–179, Cavtat-Dubrovnik, Croatia,
2003.
A. Fred and A. K. Jain. Data clustering using evidence accumulation. In
Proceedings of the 16th International Conference on Pattern Recognition,
pages 276–280, Quebec, Canada, 2002.
A. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):835–850, 2005.
Y. Freund. Boosting a weak learning algorithm by majority. Information
and Computation, 121(2):256–285, 1995.
Y. Freund. An adaptive version of the boost by majority algorithm. Machine
Learning, 43(3):293–318, 2001.
Y. Freund. A more robust boosting algorithm. CORR abs/0905.2138, 2009.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of online learning and an application to boosting. In Proceedings of the 2nd
European Conference on Computational Learning Theory, pages 23–37,
Barcelona, Spain, 1995.
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System
Sciences, 55(1):119–139, 1997.
Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using
the query by committee algorithm. Machine Learning, 28(2-3):133–168,
1997.
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting (with discussions). Annals of Statistics, 28(2):337–
407, 2000.
J. H. Friedman and P. Hall. On bagging and nonlinear estimation. Journal
of Statistical Planning and Inference, 137(3):669–683, 2007.
References
197
J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of
American Statistical Association, 76(376):817–823, 1981.
N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers.
Machine Learning, 29(2):131–163, 1997.
G. Fumera and F. Roli. A theoretical and experimental analysis of linear
combiners for multiple classifier systems. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 27(6):942–956, 2005.
W. Gao and Z.-H. Zhou. Approximation stability and boosting. In Proceedings of the 21st International Conference on Algorithmic Learning Theory,
pages 59–73, Canberra, Australia, 2010a.
W. Gao and Z.-H. Zhou. On the doubt about margin explanation of boosting. CORR abs/1009.3613, 2012.
C. W. Gardiner. Handbook of Stochastic Methods. Springer, New York, NY,
3rd edition, 2004.
S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the
bias/variance dilemma. Neural Computation, 4(1):1–58, 1992.
P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine
Learning, 63(1):3–42, 2006.
G. Giacinto and F. Roli. Adaptive selection of image classifiers. In Proceedings of the 9th International Conference on Image Analysis and Processing,
pages 38–45, Florence, Italy, 1997.
G. Giacinto and F. Roli. A theoretical framework for dynamic classifier selection. In Proceedings of the 15th International Conference on Pattern
Recognition, pages 2008–2011, Barcelona, Spain, 2000a.
G. Giacinto and F. Roli. Dynamic classifier selection. In Proceedings of the
1st International Workshop on Multiple Classifier Systems, pages 177–189,
Cagliari, Italy, 2000b.
G. Giacinto and F. Roli. Design of effective neural network ensembles for
image classification purposes. Image and Vision Computing, 19(9-10):
699–707, 2001.
G. Giacinto, F. Roli, and G. Fumera. Design of effective multiple classifier systems by clustering of classifiers. In Proceedings of the 15th International Conference on Pattern Recognition, pages 160–163, Barcelona,
Spain, 2000.
G. Giacinto, F. Roli, and L. Didaci. Fusion of multiple classifiers for intrusion detection in computer networks. Pattern Recognition Letters, 24(12):
1795–1803, 2003.
198
References
G. Giacinto, R. Perdisci, M. D. Rio, and F. Roli. Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Information Fusion, 9(1):69–82, 2008.
T. Gneiting and A. E. Raftery. Atmospheric science: Weather forecasting
with ensemble methods. Science, 310(5746):248–249, 2005.
K. Goebel, M. Krok, and H. Sutherland. Diagnostic information fusion: Requirements flowdown and interface issues. In Proceedings of the IEEE
Aerospace Conference, volume 6, pages 155–162, Big Sky, MT, 2000.
D. E. Goldberg. Genetic Algorithm in Search, Optimization and Machine
Learning. Addison-Wesley, Boston, MA, 1989.
D. M. Green and J. M. Swets. Signal Detection Theory and Psychophysics.
John Wiley & Sons, New York, NY, 1966.
A. J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the 15th National Conference
on Artificial Intelligence, pages 692–699, Madison, WI, 1998.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In Proceedings of the 15th International Conference
on Data Engineering, pages 512–521, Sydney, Australia, 1999.
H. Guo and H. L. Viktor. Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. SIGKDD Explorations, 6(1):30–39, 2004.
Y. Guo and D. Schuurmans. Discriminative batch mode active learning. In
J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 593–600. MIT Press, Cambridge, MA, 2008.
I. Guyon and A. Elisseeff. An introduction to variable and feature selection.
Journal of Machine Learning Research, 3:1157–1182, 2003.
S. T. Hadjitodorov and L. I. Kuncheva. Selecting diversifying heuristics for
cluster ensembles. In Proceedings of the 7th International Workshop on
Multiple Classifier Systems, pages 200–209, Prague, Czech, 2007.
S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for
better cluster ensembles. Information Fusion, 7(3):264–275, 2006.
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation
techniques. Journal of Intelligent Information Systems, 17(2-3):107–145,
2001.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, San Francisco, CA, 2nd edition, 2006.
References
199
D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press,
Cambridge, MA, 2001.
D. J. Hand. Measuring classifier performance: A coherent alternative to the
area under the ROC curve. Machine Learning, 77(1):103–123, 2009.
D. J. Hand and R. J. Till. A simple generalization of the area under the ROC
curve to multiple classification problems. Machine Learning, 45(2):171–
186, 2001.
J. A. Hanley and B. J. McNeil. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3):839–843, 1983.
L. K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 12(10):993–1001, 1990.
M. Harries. Boosting a strong learner: Evidence against the minimum margin. In Proceedings of the 16th International Conference on Machine
Learning, pages 171–179, Bled, Slovenia, 1999.
T. Hastie and R. Tibshirani. Classification by pairwise coupling. Annals of
Statistics, 26(2):451–471, 1998.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, NY, 2001.
S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall,
Upper Saddle River, NJ, 2nd edition, 1998.
H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions
on Knowledge and Data Engineering, 21(9):1263–1284, 2009.
Z. He, X. Xu, and S. Deng. A cluster ensemble method for clustering categorical data. Information Fusion, 6(2):143–151, 2005.
M. Hellman and J. Raviv. Probability of error, equivocation, and the Chernoff bound. IEEE Transactions on Information Theory, 16(4):368–372,
1970.
D. Hernández-Lobato, G. Martı́nez-Muñoz, and A. Suárez. Statistical
instance-based pruning in ensembles of independent classifiers. IEEE
Transaction on Pattern Analysis and Machine Intelligence, 31(2):364–369,
2009.
D. Hernández-Lobato, G. Martı́nez-Muñoz, and A. Suárez. Empirical analysis and evaluation of approximate techniques for pruning regression bagging ensembles. Neurocomputing, 74(12-13):2250–2264, 2011.
A. Hinneburg and D. A. Keim. An efficient approach to clustering in large
multimedia databases with noise. In Proceedings of the 4th International
200
References
Conference on Knowledge Discovery and Data Mining, pages 58–65, New
York, NY, 1998.
T. K. Ho. Random decision forests. In Proceedings of the 3rd International
Conference on Document Analysis and Recognition, pages 278–282, Montreal, Canada, 1995.
T. K. Ho. The random subspace method for constructing decision forests.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):
832–844, 1998.
T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classifier systems. IEEE Transaction on Pattern Analysis and Machine Intelligence, 16(1):66–75, 1994.
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85–126, 2004.
S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Semisupervised SVM batch mode
active learning with applications to image retrieval. ACM Transactions on
Information Systems, 27(3):1–29, 2009.
Y. Hong, S. Kwong, H. Wang, and Q. Ren. Resampling-based selective clustering ensembles. Pattern Recognition Letters, 41(9):2742–2756, 2009.
P. Hore, L. Hall, and D. Goldgof. A cluster ensemble framework for large data
sets. In Proceedings of the IEEE International Conference on Systems, Man
and Cybernetics, pages 3342–3347, Taipei, Taiwan, ROC, 2006.
P. Hore, L. O. Hall, and D. B. Goldgof. A scalable framework for cluster ensembles. Pattern Recognition, 42(5):676–688, 2009.
C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support
vector machines. IEEE Transactions on Neural Networks, 13(2):415–425,
2002.
X. Hu, E. K. Park, and X. Zhang. Microarray gene cluster identification and
annotation through cluster ensemble and EM-based informative textual summarization. IEEE Transactions on Information Technology in
Biomedicine, 13(5):832–840, 2009.
F.-J. Huang, Z.-H. Zhou, H.-J. Zhang, and T. Chen. Pose invariant face recognition. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition, pages 245–250, Grenoble, France,
2000.
S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative
and representative examples. In J. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Informa-
References
201
tion Processing Systems 23, pages 892–900. MIT Press, Cambridge, MA,
2010.
Y. S. Huang and C. Y. Suen. A method of combining multiple experts for the
recognition of unconstrained handwritten numerals. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 17(1):90–94, 1995.
Z. Huang. Extensions to the k-means algorithm for clustering large data
sets with categorical values. Data Mining and Knowledge Discovery, 2(3):
283–304, 1998.
R. A. Hutchinson, L.-P. Liu, and T. G. Dietterich. Incorporating boosted regression trees into ecological latent variable models. In Proceedings of
the 25th AAAI Conference on Artificial Intelligence, pages 1343–1348, San
Francisco, CA, 2011.
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures
of local experts. Neural Computation, 3(1):79–87, 1991.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall,
Upper Saddle River, NJ, 1988.
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM
Computing Surveys, 31(3):264–323, 1999.
G. M. James. Variance and bias for general loss functions. Machine Learning, 51(2):115–135, 2003.
T. Joachims. Transductive inference for text classification using support
vector machines. In Proceedings of the 16th International Conference on
Machine Learning, pages 200–209, Bled, Slovenia, 1999.
T. Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd International Conference on Machine
Learning, pages 384–391, Bonn, Germany, 2005.
I. T. Jolliffe. Principal Component Analysis. Springer, New York, NY, 2nd
edition, 2002.
M. I. Jordan and R. A. Jacobs. Hierarchies of adaptive experts. In J. E. Moody,
S. J. Hanson, and R. Lippmann, editors, Advances in Neural Information
Processing Systems 4, pages 985–992. Morgan Kaufmann, San Francisco,
CA, 1992.
M. I. Jordan and L. Xu. Convergence results for the EM approach to mixtures
of experts architectures. Neural Networks, 8(9):1409–1431, 1995.
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):
359–392, 1998.
202
References
G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph
partitioning: Application in VLSI domain. In Proceedings of the 34th Annual Design Automation Conference, pages 526–529, Anaheim, CA, 1997.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction
to Cluster Analysis. John Wiley & Sons, New York, NY, 1990.
M. Kearns. Efficient noise tolerant learning from statistical queries. Journal
of the ACM, 45(6):983–1006, 1998.
M. Kearns and L. G. Valiant. Cryptographic limitations on learning Boolean
formulae and finite automata. In Proceedings of the 21st Annual ACM
Symposium on Theory of Computing, pages 433–444, Seattle, WA, 1989.
M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning
Theory. MIT Press, Cambridge, MA, 1994.
J. Kittler and F. M. Alkoot. Sum versus vote fusion in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25
(1):110–115, 2003.
J. Kittler, M. Hatef, R. Duin, and J. Matas. On combining classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239,
1998.
E. M. Kleinberg. On the algorithmic implementation of stochastic discrimination. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22(5):473–490, 2000.
D. E. Knuth. The Art of Computer Programming, Volume 3: Sorting and
Searching. Addison-Wesley, Reading, MA, 2nd edition, 1997.
A. H. Ko, R. Sabourin, and J. A. S. Britto. From dynamic classifier selection
to dynamic ensemble selection. Pattern Recognition, 41(5):1718–1731,
2008.
R. Kohavi and D. H. Wolpert. Bias plus variance decomposition for zeroone loss functions. In Proceedings of the 13th International Conference
on Machine Learning, pages 275–283, Bari, Italy, 1996.
T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag,
Berlin, 3rd edition, 1989.
J. F. Kolen and J. B. Pollack. Back propagation is sensitive to initial conditions. In R. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances
in Neural Information Processing Systems 3, pages 860–867. Morgan Kaufmann, San Francisco, CA, 1991.
J. Z. Kolter and M. A. Maloof. Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research, 7:2721–2744,
2006.
References
203
E. B. Kong and T. G. Dietterich. Error-correcting output coding corrects
bias and variance. In Proceedings of the 12th International Conference on
Machine Learning, pages 313–321, Tahoe City, CA, 1995.
A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and
active learning. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors,
Advances in Neural Information Processing Systems 7, pages 231–238. MIT
Press, Cambridge, MA, 1995.
M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets:
One sided selection. In Proceedings of the 14th Intenational Conference
on Machine Learning, pages 179–186, Nashville, TN, 1997.
H. W. Kuhn. The Hungarian method for the assignment problem. Naval
Research Logistics Quarterly, 2:83–79, 1955.
M. Kukar and I. Kononenko. Cost-sensitive learning with neural networks.
In Proceedings of the 13th European Conference on Artificial Intelligence,
pages 445–449, Brighton, UK, 1998.
L. I. Kuncheva. A theoretical study on six classifier fusion strategies. IEEE
Transations on Pattern Analysis and Machine Intelligence, 24(2):281–286,
2002.
L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms.
John Wiley & Sons, Hoboken, NJ, 2004.
L. I. Kuncheva. Classifier ensembles: Facts, fiction, faults and future, 2008.
Plenary Talk at the 19th International Conference on Pattern Recognition.
L. I. Kuncheva and S. T. Hadjitodorov. Using diversity in cluster ensembles.
In Proceedings of the IEEE International Conference on Systems, Man and
Cybernetics, pages 1214–1219, Hague, The Netherlands, 2004.
L. I. Kuncheva and D. P. Vetrov. Evaluation of stability of k-means cluster
ensembles with respect to random initialization. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 28(11):1798–1808, 2006.
L. I. Kuncheva and C. J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207, 2003.
L. I. Kuncheva, J. C. Bezdek, and R. P. Duin. Decision templates for multiple
classifier fusion: An experimental comparison. Pattern Recognition, 34
(2):299–314, 2001.
L. I. Kuncheva, C. J. Whitaker, C. Shipp, and R. Duin. Limits on the majority
vote accuracy in classifier fusion. Pattern Analysis and Applications, 6(1):
22–31, 2003.
204
References
L. I. Kuncheva, S. T. Hadjitodorov, and L. P. Todorova. Experimental comparison of cluster ensemble methods. In Proceedings of the 9th International
Conference on Information Fusion, pages 1–7, Florence, Italy, 2006.
S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the 18th Conference on Uncertainty in
Artificial Intelligence, pages 275–282, Edmonton, Canada, 2002.
S. W. Kwok and C. Carter. Multiple decision trees. In Proceedings of the 4th
International Conference on Uncertainty in Artificial Intelligence, pages
327–338, New York, NY, 1988.
L. Lam and S. Y. Suen. Application of majority voting to pattern recognition:
An analysis of its behavior and performance. IEEE Transactions on Systems, Man and Cybernetics - Part A: Systems and Humans, 27(5):553–568,
1997.
A. Lazarevic and Z. Obradovic. Effective pruning of neural network classifier ensembles. In Proceedings of the IEEE/INNS International Joint Conference on Neural Networks, pages 796–801, Washington, DC, 2001.
D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In
Proceedings of the 17th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 3–12, Dublin,
Ireland, 1994.
M. Li and Z.-H. Zhou. Improve computer-aided diagnosis with machine
learning techniques using undiagnosed samples. IEEE Transactions on
Systems, Man and Cybernetics - Part A: Systems and Humans, 37(6):1088–
1098, 2007.
M. Li, W. Wang, and Z.-H. Zhou. Exploiting remote learners in internet environment with agents. Science China: Information Sciences, 53(1):64–76,
2010.
N. Li and Z.-H. Zhou. Selective ensemble under regularization framework.
In Proceedings of the 8th International Workshop Multiple Classifier Systems, pages 293–303, Reykjavik, Iceland, 2009.
S. Z. Li, Q. Fu, L. Gu, B. Schölkopf, and H. J. Zhang. Kernel machine based
learning for multi-view face detection and pose estimation. In Proceedings of the 8th International Conference on Computer Vision, pages 674–
679, Vancouver, Canada, 2001.
R. Liere and P. Tadepalli. Active learning with committees for text categorization. In Proceedings of the 14th National Conference on Artificial Intelligence, pages 591–596, Providence, RI, 1997.
H.-T. Lin and L. Li. Support vector machinery for infinite ensemble learning. Journal of Machine Learning Research, 9:285–312, 2008.
References
205
X. Lin, S. Yacoub, J. Burns, and S. Simske. Performance analysis of pattern
classifier combination by plurality voting. Pattern Recognition Letters, 24
(12):1959–1969, 2003.
Y. M. Lin, Y. Lee, and G. Wahba. Support vector machines for classification
in nonstandard situations. Machine Learning, 46(1):191–202, 2002.
F. T. Liu, K. M. Ting, and W. Fan. Maximizing tree diversity by building
complete-random decision trees. In Proceedings of the 9th Pacific-Asia
Conference on Knowledge Discovery and Data Mining, pages 605–610,
Hanoi, Vietnam, 2005.
F. T. Liu, K. M. Ting, Y. Yu, and Z.-H. Zhou. Spectrum of variable-random
trees. Journal of Artificial Intelligence Research, 32(1):355–384, 2008a.
F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In Proceedings of the
8th IEEE International Conference on Data Mining, pages 413–422, Pisa,
Italy, 2008b.
F. T. Liu, K. M. Ting, and Z.-H. Zhou. On detecting clustered anomalies using
SCiForest. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases,
pages 274–290, Barcelona, Spain, 2010.
X.-Y. Liu and Z.-H. Zhou. The influence of class imbalance on cost-sensitive
learning: An empirical study. In Proceedings of the 6th IEEE International Conference on Data Mining, pages 970–974, Hong Kong, China,
2006.
X.-Y. Liu and Z.-H. Zhou. Learning with cost intervals. In Proceedings of the
16th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 403–412, Washington, DC, 2010.
X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics
- Part B: Cybernetics, 39(2):539–550, 2009.
Y. Liu and X. Yao. Ensemble learning via negative correlation. Neural Networks, 12(10):1399–1404, 1999.
S. P. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):128–137, 1982.
J. M. Lobo, A. Jiménez-Valverde, and R. Real. AUC: A misleading measure of
the performance of predictive distribution models. Global Ecology and
Biogeography, 17(2):145–151, 2008.
B. Long, Z. Zhang, and P. S. Yu. Combining multiple clusterings by soft correspondence. In Proceedings of the 4th IEEE International Conference on
Data Mining, pages 282–289, Brighton, UK, 2005.
206
References
P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Liu. Semiboost: Boosting for
semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):2000–2014, 2009.
I. Maqsood, M. R. Khan, and A. Abraham. An ensemble of neural networks
for weather forecasting. Neural Computing & Applications, 13(2):112–
122, 2004.
D. D. Margineantu and T. G. Dietterich. Pruning adaptive boosting. In Proceedings of the 14th International Conference on Machine Learning, pages
211–218, Nashville, TN, 1997.
H. Markowitz. Portfolio selection. Journal of Finance, 7(1):77–91, 1952.
G. Martı́nez-Muñoz and A. Suárez. Aggregation ordering in bagging. In Proceedings of the IASTED International Conference on Artifical Intelligence
and Applications, pages 258–263, Innsbruck, Austria, 2004.
G. Martı́nez-Muñoz and A. Suárez. Pruning in ordered bagging ensembles.
In Proceedings of the 23rd International Conference on Machine Learning,
pages 609–616, Pittsburgh, PA, 2006.
G. Martı́nez-Muñoz and A. Suárez. Using boosting to prune bagging ensembles. Pattern Recognition Letters, 28(1):156–165, 2007.
G. Martı́nez-Muñoz, D. Hernández-Lobato, and A. Suárez. An analysis of
ensemble pruning techniques based on ordered aggregation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 31(2):245–259, 2009.
H. Masnadi-Shirazi and N. Vasconcelos. Asymmetric Boosting. In Proceedings of the 24th International Conference on Machine Learning, pages
609–616, Corvallis, OR, 2007.
L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In P. J. Bartlett, B. Schölkopf, D. Schuurmans, and A. J. Smola, editors, Advances in Large-Margin Classifiers,
pages 221–246. MIT Press, Cambridge, MA, 2000.
A. Maurer and M. Pontil. Empirical Bernstein bounds and sample-variance
penalization. In Proceedings of the 22nd Conference on Learning Theory,
Montreal, Canada, 2009.
A. McCallum and K. Nigam. Employing EM and pool-based active learning
for text classification. In Proceedings of the 15th International Conference
on Machine Learning, pages 350–358, Madison, WI, 1998.
R. A. McDonald, D. J. Hand, and I. A. Eckley. An empirical comparison of
three boosting algorithms on real data sets with artificial class noise. In
Proceedings of the 4th International Workshop on Multiple Classifier Systems, pages 35–44, Guilford, UK, 2003.
References
207
W. McGill. Multivariate information transmission. IEEE Transactions on
Information Theory, 4(4):93–111, 1954.
D. Mease and A. Wyner. Evidence contrary to the statistical view of boosting (with discussions). Journal of Machine Learning Research, 9:131–201,
2008.
N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal
Statistical Society: Series B, 72(4):417–473, 2010.
P. Melville and R. J. Mooney. Creating diversity in ensembles using artificial
data. Information Fusion, 6(1):99–111, 2005.
C. E. Metz. Basic principles of ROC analysis. Seminars in Nuclear Medicine,
8(4):283–298, 1978.
D. J. Miller and H. S. Uyar. A mixture of experts classifier with learning
based on both labelled and unlabelled data. In M. Mozer, M. I. Jordan,
and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 571–577. MIT Press, Cambridge, MA, 1997.
T. M. Mitchell. Machine Learning. McGraw-Hill, New York, NY, 1997.
X. Mu, P. Watta, and M. H. Hassoun. Analysis of a plurality voting-based
combination of classifiers. Neural Processing Letters, 29(2):89–107, 2009.
I. Mukherjee and R. Schapire. A theory of multiclass boosting. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors,
Advances in Neural Information Processing Systems 23, pages 1714–1722.
MIT Press, Cambridge, MA, 2010.
S. K. Murthy, S. Kasif, and S. Salzberg. A system for the induction of oblique
decision trees. Journal of Artificial Intelligence Research, 2:1–33, 1994.
A.M. Narasimhamurthy. A framework for the analysis of majority voting.
In Proceedings of the 13th Scandinavian Conference on Image Analysis,
pages 268–274, Halmstad, Sweden, 2003.
A. Narasimhamurthy. Theoretical bounds of majority voting performance
for a binary classification problem. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(12):1988–1995, 2005.
S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill,
New York, NY, 1996.
R. Ng and J. Han. Efficient and effective clustering method for spatial data
mining. In Proceedings of the 20th International Conference on Very Large
Data Bases, pages 144–155, Santiago, Chile, 1994.
H. T. Nguyen and A. W. M. Smeulders. Active learning using pre-clustering.
In Proceedings of the 21st International Conference on Machine Learning,
pages 623–630, Banff, Canada, 2004.
208
References
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from
labeled and unlabeled documents using EM. Machine Learning, 39(2-3):
103–134, 2000.
N. J. Nilsson. Learning Machines: Foundations of Trainable PatternClassifying Systems. McGraw-Hill, New York, NY, 1965.
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study.
Journal of Artificial Intelligence Research, 11:169–198, 1999.
S. Panigrahi, A. Kundu, S. Sural, and A. K. Majumdar. Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian
learning. Information Fusion, 10(4):354–363, 2009.
I. Partalas, G. Tsoumakas, and I. Vlahavas. Pruning an ensemble of classifiers via reinforcement learning. Neurocomputing, 72(7-9):1900–1909,
2009.
D. Partridge and W. J. Krzanowski. Software diversity: Practical statistics for
its measurement and exploitation. Information & Software Technology,
39(10):707–717, 1997.
A. Passerini, M. Pontil, and P. Frasconi. New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks, 15
(1):45–54, 2004.
M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble method
for neural networks. In R. J. Mammone, editor, Artificial Neural Networks
for Spech and Vision, pages 126–142. Chapman & Hall, New York, NY,
1993.
J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–74. MIT Press, Cambridge, MA, 2000.
R. Polikar, A. Topalis, D. Parikh, D. Green, J. Frymiare, J. Kounios, and C. M.
Clark. An ensemble based data fusion approach for early diagnosis of
Alzheimer’s disease. Information Fusion, 9(1):83–95, 2008.
B. R. Preiss. Data Structures and Algorithms with Object-Oriented Design
Patterns in Java. Wiley, Hoboken, NJ, 1999.
O. Pujol, P. Radeva, and J. Vitrià. Discriminant ECOC: A heuristic method
for application dependent design of error correcting output codes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 28(6):1007–
1012, 2006.
O. Pujol, S. Escalera, and P. Radeva. An incremental node embedding technique for error correcting output codes. Pattern Recognition, 41(2):713–
725, 2008.
References
209
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
San Francisco, CA, 1993.
J. R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106,
1998.
Š. Raudys and F. Roli. The behavior knowledge space fusion method: Analysis of generalization error and strategies for performance improvement.
In Proceedings of the 4th International Workshop on Multiple Classifier
Systems, pages 55–64, Guildford, UK, 2003.
R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and
the EM algorithm. SIAM Review, 26(2):195–239, 1984.
L. Reyzin and R. E. Schapire. How boosting the margin can also boost classifier complexity. In Proceedings of the 23rd International Conference on
Machine Learning, pages 753–760, Pittsburgh, PA, 2006.
B. Ripley. Pattern Recognition and Neural Networks. Cambridge University
Press, Cambridge, UK, 1996.
M. Robnik-Šikonja. Improving random forests. In Proceedings of the 15th
European Conference on Machine Learning, pages 359–370, Pisa, Italy,
2004.
J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619–1630, 2006.
G. Rogova. Combining the results of several neural network classifiers. Neural Networks, 7(5):777–781, 1994.
L. Rokach. Pattern Classification Using Ensemble Methods. World Scientific,
Singapore, 2010.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland,
editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 318–362. MIT Press, Cambridge, MA,
1986.
D. Ruta and B. Gabrys. Application of the evolutionary algorithms for classifier selection in multiple classifier systems with majority voting. In Proceedings of the 2nd International Workshop on Multiple Classifier Systems,
pages 399–408, Cambridge, UK, 2001.
R. E. Schapire. The strength of weak learnability. Machine Learning, 5(2):
197–227, 1990.
R. E. Schapire and Y. Singer.
Improved boosting algorithms using
confidence-rated predictions. Machine Learning, 37(3):297–336, 1999.
210
References
R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A
new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651–1686, 1998.
J. Schiffers. A classification approach incorporating misclassification costs.
Intelligent Data Analysis, 1(1):59–68, 1997.
B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge,
MA, 2002.
B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors. Advances in Kernel
Methods: Support Vector Learning. MIT Press, Cambridge, MA, 1999.
M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo. Data mining methods
for detection of new malicious executables. In Proceedings of the IEEE
Symposium on Security and Privacy, pages 38–49, Oakland, CA, 2001.
A. K. Seewald. How to make stacking better and faster while also taking
care of an unknown weakness. In Proceedings of the 19th International
Conference on Machine Learning, pages 554–561, Sydney, Australia, 2002.
B. Settles. Active learning literature survey. Technical Report 1648, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI, 2009.
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the 5th Annual ACM Conference on Computational Learning
Theory, pages 287–294, Pittsburgh, PA, 1992.
G. Shafer. A Mathematical Theory of Evidence. Princeton University Press,
Princeton, NJ, 1976.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multiresolution clustering approach for very large spatial databases. In Proceedings of the 24th International Conference on Very Large Data Bases,
pages 428–439, New York, NY, 1998.
H. B. Shen and K. C. Chou. Ensemble classifier for protein fold pattern
recognition. Bioinformatics, 22(14):1717–1722, 2006.
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
C. A. Shipp and L. I. Kuncheva. Relationships between combination methods and measures of diversity in combining classifiers. Information Fusion, 3(2):135–148, 2002.
D. B. Skalak. The sources of increased accuracy for two proposed boosting algorithms. In Working Notes of the AAAI’96 Workshop on Integrating
Multiple Learned Models, Portland, OR, 1996.
References
211
N. Slonim, N. Friedman, and N. Tishby. Multivariate information bottleneck. Neural Computation, 18(8):1739–1789, 2006.
P. Smyth and D. Wolpert. Stacked density estimation. In M. I. Jordan, M. J.
Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 668–674. MIT Press, Cambridge, MA, 1998.
P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: The Principles and
Practice of Numerical Classification. W. H. Freeman, San Francisco, CA,
1973.
V. Soto, G. Martı́nez-Muñoz, D. Hernández-Lobato, and A. Suárez. A double
pruning algorithm for classification ensembles. In Proceedings of 9th International Workshop Multiple Classifier Systems, pages 104–113, Cairo,
Egypt, 2010.
K. A. Spackman. Signal detection theory: Valuable tools for evaluating inductive learning. In Proceedings of the 6th International Workshop on
Machine Learning, pages 160–163, Ithaca, NY, 1989.
A. Strehl and J. Ghosh. Cluster ensembles - A knowledge reuse framework
for combining multiple partitions. Journal of Machine Learning Research,
3:583–617, 2002.
A. Strehl, J. Ghosh, and R. J. Mooney. Impact of similarity measures on webpage clustering. In Proceedings of the AAAI’2000 Workshop on AI for Web
Search, pages 58–64, Austin, TX, 2000.
M. Studeny and J. Vejnarova. The multi-information function as a tool for
measuring stochastic dependence. In M. I. Jordan, editor, Learning in
Graphical Models, pages 261–298. Kluwer, Norwell, MA, 1998.
Y. Sun, A. K. C. Wong, and Y. Wang. Parameter inference of cost-sensitive
boosting algorithms. In Proceedings of the 4th International Conference
on Machine Learning and Data Mining in Pattern Recognition, pages 21–
30, Leipzig, Germany, 2005.
C. Tamon and J. Xiang. On the boosting pruning problem. In Proceedings
of the 11th European Conference on Machine Learning, pages 404–412,
Barcelona, Spain, 2000.
A. C. Tan, D. Gilbert, and Y. Deville. Multi-class protein fold classification
using a new ensemble machine learning approach. Genome Informatics,
14:206–217, 2003.
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining.
Addison-Wesley, Upper Saddle River, NJ, 2006.
E. K. Tang, P. N. Suganthan, and X. Yao. An analysis of diversity measures.
Machine Learning, 65(1):247–271, 2006.
212
References
J. W. Taylor and R. Buizza. Neural network load forecasting with weather
ensemble predictions. IEEE Transactions on Power Systems, 17(3):626–
632, 2002.
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press,
New York, NY, 4th edition, 2009.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society: Series B, 58(1):267–288, 1996a.
R. Tibshirani. Bias, variance and prediction error for classification rules.
Technical report, Department of Statistics, University of Toronto, 1996b.
A. B. Tickle, R. Andrews, M. Golea, and J. Diederich. The truth will come to
light: Directions and challenges in extracting the knowledge embedded
within trained artificial neural networks. IEEE Transactions on Neural
Networks, 9(6):1057–1067, 1998.
K. M. Ting. A comparative study of cost-sensitive boosting algorithms. In
Proceedings of the 17th International Conference on Machine Learning,
pages 983–990, San Francisco, CA, 2000.
K. M. Ting. An instance-weighting method to induce cost-sensitive trees.
IEEE Transactions on Knowledge and Data Engineering, 14(3):659–665,
2002.
K. M. Ting and I. H. Witten. Issues in stacked generalization. Journal of
Artificial Intelligence Research, 10:271–289, 1999.
I. Tomek. Two modifications of CNN. IEEE Transactions on Systems, Man
and Cybernetics, 6(11):769–772, 1976.
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proceedings of the 17th International Conference on Machine Learning, pages 999–1006, San Francisco, CA, 2000.
A. Topchy, A. K. Jain, and W. Punch. Combining multiple weak clusterings.
In Proceedings of the 3rd IEEE International Conference on Data Mining,
pages 331–338, Melbourne, FL, 2003.
A. Topchy, A. K. Jain, and W. Punch. A mixture model for clustering ensembles. In Proceedings of the 4th SIAM International Conference on Data
Mining, pages 379–390, Lake Buena Vista, FL, 2004a.
A. Topchy, B. Minaei-Bidgoli, A. K. Jain, and W. F. Punch. Adaptive clustering
ensembles. In Proceedings of the 17th International Conference on Pattern
Recognition, pages 272–275, Cambridge, UK, 2004b.
A. P. Topchy, M. H. C. Law, A. K. Jain, and A. L. Fred. Analysis of consensus
partition in cluster ensemble. In Proceedings of the 4th IEEE International
Conference on Data Mining, pages 225–232, Brighton, UK, 2004c.
References
213
G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective voting of heterogeneous
classifiers. In Proceedings of the 15th European Conference on Machine
Learning, pages 465–476, Pisa, Italy, 2004.
G. Tsoumakas, L. Angelis, and I. P. Vlahavas. Selective fusion of heterogeneous classifiers. Intelligent Data Analysis, 9(6):511–525, 2005.
G. Tsoumakas, I. Partalas, and I. Vlahavas. An ensemble pruning primer. In
O. Okun and G. Valentini, editors, Applications of Supervised and Unsupervised Ensemble Methods, pages 155–165. Springer, Berlin, 2009.
K. Tumer. Linear and Order Statistics Combiners for Reliable Pattern Classification. PhD thesis, The University of Texas at Austin, 1996.
K. Tumer and J. Ghosh. Theoretical foundations of linear and order statistics combiners for neural pattern classifiers. Technical Report TR-95-0298, Computer and Vision Research Center, University of Texas, Austin,
1995.
K. Tumer and J. Ghosh. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2):341–348, 1996.
P. D. Turney. Types of cost in inductive concept learning. In Proceedings
of the ICML’2000 Workshop on Cost-Sensitive Learning, pages 15–21, San
Francisco, CA, 2000.
N. Ueda and R. Nakano. Generalization error of ensemble estimators. In
Proceedings of the IEEE International Conference on Neural Networks,
pages 90–95, Washington, DC, 1996.
W. Utschick and W. Weichselberger. Stochastic organization of output codes
in multiclass learning problems. Neural Computation, 13(5):1065–1102,
2004.
L. G. Valiant. A theory of the learnable. Communications of the ACM, 27
(11):1134–1142, 1984.
H. Valizadegan, R. Jin, and A. K. Jain. Semi-supervised boosting for multiclass classification. In Proceedings of the 19th European Conference on
Machine Learning, pages 522–537, Antwerp, Belgium, 2008.
C. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
V. N. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.
P. Viola and M. Jones. Rapid object detection using a boosted cascade of
simple features. In Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pages 511–518, Kauai, HI,
2001.
P. Viola and M. Jones. Fast and robust classification using asymmetric
Adaboost and a detector cascade. In T. G. Dietterich, S. Becker, and
214
References
Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1311–1318. MIT Press, Cambridge, MA, 2002.
P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 57(2):137–154, 2004.
L. Wang, M. Sugiyama, C. Yang, Z.-H. Zhou, and J. Feng. On the margin explanation of boosting algorithm. In Proceedings of the 21st Annual Conference on Learning Theory, pages 479–490, Helsinki, Finland, 2008.
W. Wang and Z.-H. Zhou. On multi-view active learning and the combination with semi-supervised learning. In Proceedings of the 25th International Conference on Machine Learning, pages 1152–1159, Helsinki, Finland, 2008.
W. Wang and Z.-H. Zhou. Multi-view active learning in the non-realizable
case. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and
A. Culotta, editors, Advances in Neural Information Processing Systems 23,
pages 2388–2396. MIT Press, Cambridge, MA, 2010.
W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd International
Conference on Very Large Data Bases, pages 186–195, Athens, Greece,
1997.
M. K. Warmuth, K. A. Glocer, and S. V. Vishwanathan. Entropy regularized
LPBoost. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, pages 256–271, Budapest, Hungary, 2008.
S. Watanabe. Information theoretical analysis of multivariate correlation.
IBM Journal of Research and Development, 4(1):66–82, 1960.
S. Waterhouse, D. Mackay, and T. Robinson. Bayesian methods for mixtures
of experts. In D. S. Touretzky, M. Mozer, and M. E. Hasselmo, editors,
Advances in Neural Information Processing Systems 8, pages 351–357. MIT
Press, Cambridge, MA, 1996.
S. R. Waterhouse and A. J. Robinson. Constructive algorithms for hierarchical mixtures of experts. In D. S. Touretzky, M. Mozer, and M. E. Hasselmo,
editors, Advances in Neural Information Processing Systems 8, pages 584–
590. MIT Press, Cambridge, MA, 1996.
C. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992.
G. I. Webb and Z. Zheng. Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques. IEEE Transactions on
Knowledge and Data Engineering, 16(8):980–991, 2004.
G. I. Webb, J. R. Boughton, and Z. Wang. Not so naı̈ve Bayes: Aggregating
one-dependence estimators. Machine Learning, 58(1):5–24, 2005.
References
215
P. Werbos. Beyond regression: New tools for prediction and analysis in
the behavior science. PhD thesis, Harvard University, Cambridge, MA,
1974.
D. West, S. Dellana, and J. Qian. Neural network ensemble strategies for financial decision applications. Computers & Operations Research, 32(10):
2543–2559, 2005.
T. Windeatt and R. Ghaderi. Coding and decoding strategies for multi-class
learning problems. Information Fusion, 4(1):11–21, 2003.
D. H. Wolpert. Stacked generalization. Neural Networks, 5(2):241–260, 1992.
D. H. Wolpert. The lack of a priori distinctions between learning algorithms.
Neural Computation, 8(7):1341–1390, 1996.
D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization.
IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997.
D. H. Wolpert and W. G. Macready. An efficient method to estimate bagging’s generalization error. Machine Learning, 35(1):41–55, 1999.
K. Woods, W. P. Kegelmeyer, and K. Bowyer. Combination of multiple classifiers using local accuracy estimates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997.
J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg. Fast asymmetric learning
for cascade face detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 30(3):369–382, 2008.
L. Xu and S. Amari. Combining classifiers and learning mixture-of-experts.
In J. R. R. Dopico, J. Dorado, and A. Pazos, editors, Encyclopedia of Artificial Intelligence, pages 318–326. IGI, Berlin, 2009.
L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for
Gaussian mixtures. Neural Computation, 8(1):129–151, 1996.
L. Xu, A. Krzyzak, and C. Y. Suen. Methods of combining multiple classifiers
and their applications to handwriting recognition. IEEE Transactions on
Systems Man and Cybernetics, 22(3):418–435, 1992.
L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of
experts. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances
in Neural Information Processing Systems 7, pages 633–640. MIT Press,
Cambridge, MA, 1995.
L. Yan, R. H. Dodier, M. Mozer, and R. H. Wolniewicz. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney
statistic. In Proceedings of the 20th International Conference on Machine
Learning, pages 848–855, Washington, DC, 2003.
216
References
W. Yan and F. Xue. Jet engine gas path fault diagnosis using dynamic fusion
of multiple classifiers. In Proceedings of the International Joint Conference
on Neural Networks, pages 1585–1591, Hong Kong, China, 2008.
Y. Yu, Y.-F. Li, and Z.-H. Zhou. Diversity regularized machine. In Proceedings
of the 22nd International Joint Conference on Artificial Intelligence, pages
1603–1608, Barcelona, Spain, 2011.
Z. Yu and H.-S. Wong. Class discovery from gene expression data based
on perturbation and cluster ensemble. IEEE Transactions on NanoBioscience, 18(2):147–160, 2009.
G. Yule. On the association of attributes in statistics. Philosophical Transactions of the Royal Society of London, 194:257–319, 1900.
B. Zadrozny and C. Elkan. Learning and making decisions when costs
and probabilities are both unknown. In Proceedings of the 7th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204–213, San Francisco, CA, 2001a.
B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates
from decision trees and naive Bayesian classifiers. In Proceedings of
the 18th International Conference on Machine Learning, pages 609–616,
Williamstown, MA, 2001b.
B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by costproportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining, pages 435–442, Melbourne, FL,
2003.
M.-L. Zhang and Z.-H. Zhou. Exploiting unlabeled data to enhance ensemble diversity. In Proceedings of the 9th IEEE International Conference on
Data Mining, pages 609–618, Sydney, Australia, 2010.
T. Zhang. Analysis of regularized linear functions for classification problems. Technical Report RC-21572, IBM, 1999.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD
International Conference on Management of Data, pages 103–114, Montreal, Canada, 1996.
X. Zhang, S. Wang, T. Shan, and L. Jiao. Selective SVMs ensemble driven
by immune clonal algorithm. In Proceedings of the EvoWorkshops, pages
325–333, Lausanne, Switzerland, 2005.
X. Zhang, L. Jiao, F. Liu, L. Bo, and M. Gong. Spectral clustering ensemble
applied to SAR image segmentation. IEEE Transactions on Geoscience and
Remote Sensing, 46(7):2126–2136, 2008.
References
217
Y. Zhang, S. Burer, and W. N. Street. Ensemble pruning via semi-definite
programming. Journal of Machine Learning Research, 7:1315–1338, 2006.
Z. Zheng and G. I. Webb. Laze learning of Bayesian rules. Machine Learning,
41(1):53–84, 2000.
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with
local and global consistency. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,
Cambridge, MA, 2004.
Z.-H. Zhou. Rule extraction: Using neural networks or for neural networks?
Journal of Computer Science and Technology, 19(2):249–253, 2004.
Z.-H. Zhou. Comprehensibility of data mining algorithms. In J. Wang, editor, Encyclopedia of Data Warehousing and Mining, pages 190–195. IGI,
Hershey, PA, 2005.
Z.-H. Zhou. When semi-supervised learning meets ensemble learning. In
Proceedings of the 8th International Workshop on Multiple Classifier Systems, pages 529–538, Reykjavik, Iceland, 2009.
Z.-H. Zhou. When semi-supervised learning meets ensemble learning.
Frontiers of Electrical and Electronic Engineering in China, 6(1):6–16,
2011.
Z.-H. Zhou and Y. Jiang. Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble. IEEE Transactions on Information Technology in Biomedicine, 7(1):37–42, 2003.
Z.-H. Zhou and Y. Jiang. NeC4.5: Neural ensemble based C4.5. IEEE Transactions on Knowledge and Data Engineering, 16(6):770–773, 2004.
Z.-H. Zhou and M. Li. Tri-training: Exploiting unlabeled data using three
classifiers. IEEE Transactions on Knowledge and Data Engineering, 17
(11):1529–1541, 2005.
Z.-H. Zhou and M. Li. Semi-supervised regression with co-training style
algorithms. IEEE Transactions on Knowledge and Data Engineering, 19
(11):1479–1493, 2007.
Z.-H. Zhou and M. Li. Semi-supervised learning by disagreement. Knowledge and Information Systems, 24(3):415–439, 2010a.
Z.-H. Zhou and N. Li. Multi-information ensemble diversity. In Proceedings
of the 9th International Workshop on Multiple Classifier Systems, pages
134–144, Cairo, Egypt, 2010b.
Z.-H. Zhou and X.-Y. Liu. Training cost-sensitive neural networks with
methods addressing the class imbalance problem. IEEE Transactions on
Knowledge and Data Engineering, 18(1):63–77, 2006.
218
References
Z.-H. Zhou and X.-Y. Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232–257, 2010.
Z.-H. Zhou and W. Tang. Selective ensemble of decision trees. In Proceedings of the 9th International Conference on Rough Sets, Fuzzy Sets, Data
Mining and Granular Computing, pages 476–483, Chongqing, China,
2003.
Z.-H. Zhou and W. Tang. Clusterer ensemble. Knowledge-Based Systems, 19
(1):77–83, 2006.
Z.-H. Zhou and Y. Yu. Ensembling local learners through multimodal perturbation. IEEE Transactions on Systems, Man, and Cybernetics - Part B:
Cybernetics, 35(4):725–735, 2005.
Z.-H. Zhou, Y. Jiang, Y.-B. Yang, and S.-F. Chen. Lung cancer cell identification based on artificial neural network ensembles. Artificial Intelligence
in Medicine, 24(1):25–36, 2002a.
Z.-H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: Many could
be better than all. Artificial Intelligence, 137(1-2):239–263, 2002b.
Z.-H. Zhou, Y. Jiang, and S.-F. Chen. Extracting symbolic rules from trained
neural network ensembles. AI Communications, 16(1):3–15, 2003.
Z.-H. Zhou, K.-J. Chen, and H.-B. Dai. Enhancing relevance feedback in
image retrieval using unlabeled data. ACM Transactions on Information
Systems, 24(2):219–244, 2006.
J. Zhu, S. Rosset, H. Zou, and T. Hastie. Multi-class AdaBoost. Technical
report, Department of Statistics, University of Michigan, Ann Arbor, MI,
2006.
X. Zhu. Semi-supervised learning literature survey. Technical Report
1530, Department of Computer Sciences, University of Wisconsin at
Madison, Madison, WI, 2006. http://www.cs.wisc.edu/∼jerryzhu/pub/
ssl survey.pdf.
X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using
Gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning, pages 912–919, Washington,
DC, 2003.
X. Zhu, X. Wu, and Y. Yang. Dynamic classifier selection for effective mining
from noisy data streams. In Proceedings of the 14th IEEE International
Conference on Data Mining, pages 305–312, Brighton, UK, 2004.
Computer Science
“Professor Zhou’s book is a comprehensive introduction to ensemble
methods in machine learning. It reviews the latest research in this
exciting area. I learned a lot reading it!”
—Thomas G. Dietterich, Oregon State University, ACM Fellow, and
founding president of the International Machine Learning Society
“This is a timely book. Right time and right book … with an authoritative
but inclusive style that will allow many readers to gain knowledge on
the topic.”
—Fabio Roli, University of Cagliari
An up-to-date, self-contained introduction to a state-of-the-art
machine learning approach, Ensemble Methods: Foundations and
Algorithms shows how these accurate methods are used in realworld tasks. It gives you the necessary groundwork to carry out
further research in this evolving field.
K11467
K11467_Cover.indd 1
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
Ensemble Methods
Foundations and Algorithms
Zhou
Features
• Supplies the basics for readers unfamiliar with machine learning
and pattern recognition
• Covers nearly all aspects of ensemble techniques such as
combination methods and diversity generation methods
• Presents the theoretical foundations and extensions of many
ensemble methods, including Boosting, Bagging, Random
Trees, and Stacking
• Introduces the use of ensemble methods in computer vision,
computer security, medical imaging, and famous data mining
competitions
• Highlights future research directions
• Provides additional reading sections in each chapter and
references at the back of the book
Ensemble Methods
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
Zhi-Hua Zhou
4/30/12 10:30 AM
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement