Building Reliable Metaclassifiers for Text Learning

Building Reliable Metaclassifiers for Text Learning
Building Reliable Metaclassifiers
for Text Learning
Paul N. Bennett
May 2006
CMU-CS-06-121
School of Computer Science
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy.
Thesis Committee:
Jaime Carbonell, Chair
John Lafferty, Co-chair
Tom Mitchell
Susan Dumais, Microsoft Research
Eric Horvitz, Microsoft Research
c 2006 Paul N. Bennett
Copyright °
This research was sponsored by the Defense Advanced Research Agency (DARPA) under Contract No.
NBCHC030029, SRI International under Subcontract No. 55000691, the National Imagery & Mapping
Agency (NIMA) under Contract No. NMA401-02-C0033, and the National Science Foundation under Grant
Nos. IIS-9982226, IIS-9873009, and IIS-9988084. The views and conclusions contained in this document
are those of the author and should not be interpreted as representing the official policies, either express or
implied, of any sponsoring institution, the U.S. government, or any other entity.
Keywords: text classification, classifier combination, metaclassifiers, reliability indicators, reliability, multiple models, locality, combining estimates, calibration
This is dedicated first to my parents whose pride in my work has been as uplifting as their
supporting words, but most importantly, this is dedicated to my loving wife, Tina. Your
support made my long nights bearable; I hope my absence didn’t make yours too
unbearable.
iv
Abstract
Appropriately combining information sources to form a more effective output than any of the individual sources is a broad topic that has been researched
in many forms. It can be considered to contain sensor fusion, distributed
data-mining, regression combination, classifier combination, and even the basic classification problem. After all, the hypothesis a classifier emits is just a
specification of how the information in the basic features should be combined.
This dissertation addresses one subfield of this domain: leveraging locality
when combining classifiers for text classification. Classifier combination is
useful, in part, as an engineering aid that enables machine learning scientists
to understand the difference in base classifiers in terms of their local reliability,
dependence, and variance — much as higher-level languages are an abstraction
that improves upon assembly language without extending its computational
power. Additionally, using such abstraction, we introduce a combination model
that uses inductive transfer to extend the amount of labeled data that can be
brought to bear when building a text classifier combination model.
We begin by discussing the role calibrated probabilities play when combining classifiers. After reviewing calibration, we present arguments and empirical evidence that the distribution of posterior probabilities from a classifier
will give rise to asymmetry. Since the standard methods for recalibrating classifiers have an underlying assumption of symmetry, we present asymmetrical
distributions that can be fit efficiently and produce recalibrated probabilities of
higher quality than the symmetrical methods. The resulting improved probabilities can either be used directly for a single base classifier or used as part of
a classifier combination model.
Reflecting on the lessons learned from the study of calibration, we go on
to define local calibration, dependence, and variance and discuss the roles they
play in classifier combination. Using these insights as motivation, we introduce
a series of reliability-indicator variables which serve as an intuitive abstraction
of the input domain to capture the local context related to a classifier’s reliability.
We then introduce the main methodology of our work, STRIVE, which
uses metaclassifiers and reliability indicators to produce improved classification performance. A key difference from standard metaclassification approaches is that reliability indicators enable the metaclassifier to weigh each
classifier according to its local reliability in the neighborhood of the current
vi
prediction point. Furthermore, this approach empirically outperforms state-ofthe-art metaclassification approaches that do not use locality. We then analyze
the contributions of the various reliability indicators to the combination model
and suggest promising features to consider when redesigning the base classifiers or new combination approaches. Additionally, we show how inductive
transfer methods can be extended to increase the amount of labeled training
data available for learning a combination model by collapsing data traditionally viewed as coming from different learning tasks.
Next, we briefly review online-learning classifier combination algorithms
that have theoretical performance guarantees in the online setting and consider
adaptations of these to the batch settings as alternative metaclassifiers. We then
present empirical evidence that they are weaker in the offline setting than methods which employ standard classification algorithms as metaclassifiers, and we
suggest future improvements likely to yield more competitive algorithms.
Finally, the combination approaches discussed are broadly applicable to
classification problems other than topic classification, and we emphasize this
with experiments that demonstrate STRIVE improves performance of actionitem detectors in e-mail — a task where both the semantics and base classifier
performance are significantly different than topic classification.
Acknowledgments
First, I would like to thank the five most influential forces in my academic
life — Drs. Ron Richards, Bob Causey, Ray Mooney, Sue Dumais, and Jaime
Carbonell — for crafting my view of scientific inquiry, instilling me with computational curiosity, and equipping me with the tools to apply the former to the
latter. Additionally, I wish to acknowledge the efforts of each of my committee members, Drs. Jaime Carbonell, John Lafferty, Tom Mitchell, Sue Dumais,
and Eric Horvitz, for their helpful feedback and corrections. Their contributions have been invaluable, and any remaining errors are mine alone. Finally,
I most heartily thank Francisco Pereira and Kevyn Collins-Thompson whose
ability to find spare CPU cycles is only surpassed by their friendship.
viii
Contents
1 Overview
1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Worse Performance Does Not Mean Obsolete . . . . . . . . . . . . . . . .
4
1.2.1
Examples from Synthetic Data
. . . . . . . . . . . . . . . . . . .
5
1.2.2
Diversity and Accuracy in Real Datasets . . . . . . . . . . . . . . .
8
1.3
Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4
Criteria for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5
Roadmap to This Document . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work
13
2.1
Homogeneous Ensembles
2.2
Heterogeneous Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1
. . . . . . . . . . . . . . . . . . . . . . . . . . 14
Other Relevant Approaches . . . . . . . . . . . . . . . . . . . . . 18
2.3
Related Work Using Locality . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4
Previous Applications to Text Problems . . . . . . . . . . . . . . . . . . . 21
2.5
No Free Lunch and Its Implications . . . . . . . . . . . . . . . . . . . . . . 22
3 Calibration
25
3.1
Calibration and Related Concepts . . . . . . . . . . . . . . . . . . . . . . 25
3.2
Recalibrating Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1
The Need for Calibrated Probabilities in Other Applications . . . . 29
3.2.2
Recalibration Problem Definition & Approach . . . . . . . . . . . 31
3.2.3
Estimating the Parameters of the Asymmetric Distributions
ix
. . . . 34
CONTENTS
x
4
5
Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.5
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.6
Summary of Recalibration Methods . . . . . . . . . . . . . . . . . 47
Locality
49
4.1
“True” Posteriors, Log-odds, and Confidences . . . . . . . . . . . . . . . . 50
4.2
Calibration & Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3
Dependence & Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4
Variance, Sensitivity, & Locality . . . . . . . . . . . . . . . . . . . . . . . 55
4.5
Local Reliability, Variance, and Dependence . . . . . . . . . . . . . . . . . 61
Reliability Indicators
5.1
5.2
5.3
6
3.2.4
Model-Specific Reliability Indicators . . . . . . . . . . . . . . . . . . . . . 67
5.1.1
Variables Based on the Unigram Classifier
(Multinomial naı̈ve Bayes) . . . . . . . . . . . . . . . . . . . . . 68
5.1.2
Variables Based on the naı̈ve Bayes Classifier
(Multivariate Bernoulli naı̈ve Bayes) . . . . . . . . . . . . . . . . 70
5.1.3
Variables Based on the kNN Classifier
5.1.4
Variables Based on the Decision Tree Classifier
5.1.5
Variables Based on the SVM Classifier . . . . . . . . . . . . . . . 79
. . . . . . . . . . . . . . . 73
. . . . . . . . . . 78
Inputs for STRIVE (Document Dependent) . . . . . . . . . . . . . . . . . 86
5.2.1
Outputs of Base Classifiers . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2
Reliability Indicator Variables . . . . . . . . . . . . . . . . . . . . 87
Task-Dependent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Background for Empirical Analysis
6.1
67
Classifier Performance Measures
99
. . . . . . . . . . . . . . . . . . . . . . 99
6.1.1
Classification Measures . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.2
Probability Loss Functions . . . . . . . . . . . . . . . . . . . . . . 101
6.1.3
Ranking Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1.4
Summarizing Performance Scores . . . . . . . . . . . . . . . . . . 104
CONTENTS
6.2
6.3
6.4
xi
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2.1
Chronological Split vs. Cross-Validation . . . . . . . . . . . . . . . 106
6.2.2
MSN Web Directory . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.3
Reuters (21578) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.4
TREC-AP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.5
RCV1-v2 (Reuters 2000)
Base Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.1
Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.2
SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.3
Naı̈ve Bayes (multivariate Bernoulli) . . . . . . . . . . . . . . . . 111
6.3.4
Unigram (multinomial naı̈ve Bayes) . . . . . . . . . . . . . . . . . 112
6.3.5
k-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.6
Classifier Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Combining Classifiers
using Reliability Indicators
7.1
115
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1.1
7.2
. . . . . . . . . . . . . . . . . . . . . . 109
STRIVE: Metaclassifier with Reliability Indicators . . . . . . . . . 120
Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2.1
Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.2
Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 125
7.2.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3
An Analysis of Reliability Indicator Usefulness . . . . . . . . . . . . . . . 134
7.4
RCV1-v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.5
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8 Inductive Transfer
for Classifier Combination
149
CONTENTS
xii
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.2
Applying Inductive Transfer to Combination . . . . . . . . . . . . . . . . . 150
8.3
9
8.2.1
STRIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.2
LABEL: Layered Abstraction-Based Ensemble Learning . . . . . . 151
Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3.1
Base Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3.2
Metaclassifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3.3
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.3.4
Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 155
8.4
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5
Summary of Basic LABEL Approach . . . . . . . . . . . . . . . . . . . . 155
8.6
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Online Methods and Regret
159
9.1
Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.2
Regret and Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . 160
9.3
Combination Algorithms with Regret Guarantees . . . . . . . . . . . . . . 162
9.4
Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.4.1
Combination Implementations . . . . . . . . . . . . . . . . . . . . 166
9.4.2
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 171
9.5
Reconciling Theory and Practice . . . . . . . . . . . . . . . . . . . . . . . 175
9.6
Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10 Action-Item Detection in E-mail
179
10.1 Why Action-Item Detection? . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.3 Problem Definition & Approach . . . . . . . . . . . . . . . . . . . . . . . 182
10.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
CONTENTS
xiii
10.4 Experimental Analysis for Action-Item Detection . . . . . . . . . . . . . . 186
10.4.1 The Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.4.3 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 190
10.4.4 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 190
10.4.5 Baseline Results for Action-Item Detection . . . . . . . . . . . . . 191
10.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.5 Action-Item Detection vs. Topic Classification . . . . . . . . . . . . . . . . 195
10.6 Classifier Combination for Action-Item Detection . . . . . . . . . . . . . . 195
10.7 Reliability Indicators for Action-Item Detection . . . . . . . . . . . . . . . 197
10.8 Experimental Analysis
of Combining Action-Item Detectors . . . . . . . . . . . . . . . . . . . . . 198
10.8.1 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.8.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 198
10.8.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . 198
10.8.4 Results for Combining Action-Item Detectors . . . . . . . . . . . . 199
10.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11 Summary and Future Work
207
11.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
xiv
CONTENTS
List of Figures
1.1
A typical text classification problem. A text classification algorithm takes as input
a set of example documents. Each document is labeled by an authority with a
set of classes (here the topics). The algorithm uses these examples to construct
a model that with high accuracy can predict the topics the authority would have
assigned to future documents. This particular type of text classification problem is
called topic classification.
1.2
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Schematic characterization of reliability-indicator methodology. The output of the
classifiers is a graphical representation of a distribution over possible class labels. .
1.3
5
(a) The pdf for X1 as well as the decision boundary used by optimal classifier Ŷ1 ;
(b) The pdf for X2 as well as the decision boundary used by optimal classifier Ŷ2 .
1.5
3
Influence diagram for classifiers built using two conditionally independent views
of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
2
6
The correlation of conditionally-independent classifiers is largely determined by
their error rates and the class prior. This graph is generated with two classifiers
whose error rates are equal. . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
7
For a well-calibrated classifier, all points in a reliability diagram fall on the diagonal. In the long-run 0.6 (generally πi ) of the items the classifier predicts to
have 0.6 probability (generally probability πi ) of belonging to the class, actually
do belong to the class. Additionally, a reliability diagram often has annotations
indicating the frequency with which a certain value is predicted. . . . . . . . . .
3.2
27
We are concerned with how to perform the box highlighted in grey. The internals
are for one type of approach. . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3
Typical View of Discrimination based on Gaussians . . . . . . . . . . . . . . .
32
3.4
Gaussians vs. Asymmetric Gaussians. A Shortcoming of Symmetric Distributions
— The vertical lines show the modes as estimated nonparametrically. . . . . . . .
xv
33
LIST OF FIGURES
xvi
3.5
The empirical distribution of classifier scores for documents in the training and the
test set for class Earn in Reuters. Also shown is the fit of the asymmetric Laplace
distribution to the training score distribution. The positive class (i.e. Earn) is the
distribution on the right in each graph, and the negative class (i.e. ¬Earn) is that
on the left in each graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6
The fit produced by various methods compared to the empirical log-odds of the
training data for class Earn in Reuters. . . . . . . . . . . . . . . . . . . . . . .
4.1
44
45
Classifier combination can be thought of as combining estimates of each classifier’s estimate of the log-odds, λ̂i , via the latent variable representing the true
4.2
log-odds, λ, to improve the prediction of the class c. That is via, p( λ̂1 , . . . , λ̂n |
R∞
c) = −∞ p(λ̂, . . . , λ̂n | λ)p(λ | c) dλ. . . . . . . . . . . . . . . . . . . . . .
52
A few examples of the distribution of p(λ̂ | c) for various choices of the prior on
the true log-odds, p(λ), when the classifier’s predictions are distributed normally
around the true log-odds, λ̂ ∼ N (λ, 1). The prior used is a single Gaussian (left),
a mixture of two Gaussians (middle), and a mixture of three Gaussians (right).
100K samples were drawn from each distribution. The asymmetry of the resulting
distributions is very reminiscent of those seen in practice as shown in Section 3.2.
4.3
53
An influence diagram for two classifiers whose optimal combination is to allow
the output of each (Ŷ1 and Ŷ2 ) to contribute independently to the final prediction.
The input dimensions X1 , . . . , Xk are independent of dimensions Xk+1 , . . . , Xn
given the class variable; though, the interactions within the two feature sets may
be arbitrarily complex (which is why they are depicted as within one box). One
classifier’s predictions (Ŷ1 ) depend only on the values the first feature set takes
(X1 , . . . , Xk ) while the other classifier ’s predictions (Ŷ2 ) depend only on the values the second feature set takes (Xk+1 , . . . , Xn ).
4.4
. . . . . . . . . . . . . . . . 54
A simple example where the input space has a single dimension to illustrate the
role of the ratio of standard deviations in a =
σλ
σλ̂ ρλ,λ̂ .
In the example, p(x) is
uniform over
h i [1, 10]. In this example, the initial predictions are correct on average:
E[λ] = E λ̂ . The predicted log-odds, λ̂, are perfectly correlated with the true logodds, λ. That is, ρλ,λ̂ = 1, but a = 0.5 and b = 1.5. As can be seen from the
correction using just a, the coefficient forces the variation/slope of the predictions
to behave on average like the true variation. The resulting correction by b must
take into account the compression and rotation caused by a. . . . . . . . . . . .
58
LIST OF FIGURES
4.5
xvii
The class-conditional distribution of feature values for two synthetic examples
and their estimated forms using 100 training examples. The first (left) example
constrains the class-conditional variances to be equal and uses LDA to train the
model. The second (right) example has class-specific variances and uses QDA to
train the model.
4.6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
The posterior (left) and log-odds (right) for the example constrained to equal classconditional variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7
61
The coefficient a (left) and additive correction term b to perform linear correction
estimated globally and locally using hold-out data for the example constrained to
equal class-conditional variance. For this case where both the true and estimated
log-odds are linear, a single value of a and b is sufficient to perform perfect correction. The local estimation deviates from this at the edges because of data sparsity.
4.8
The posterior (left) and log-odds (right) for a 2-class example with class-specific
variances.
4.9
62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
The coefficient a (left) and additive correction term b to perform linear correction
estimated globally and locally using hold-out data for the example with classspecific variances. For this case, where both the true and estimated log-odds are
non-linear, a global value of a and b is not adequate to perform perfect correction.
64
4.10 The locally linear and global corrections of the log-odds for the equal class-conditional
variance example (left) and the non-equal example (right). As shown on the left,
when both the true and estimated model are linear, global weights suffice to perform perfect correction. However, when either the true or estimated models are
not linear, a locally linear model has the potential to perform far better correction,
as shown on the right.
5.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
An example in Euclidean space of the kNN shifted instances produced for a query
instance x using the other points shown as its neighborhood. The shifts are illustrated using cyan lines from the original instance. The nearness of neighbor 5
prevents the shifts toward neighbors 1 − 3 from being larger. In contrast, the shift
toward neighbor 4 is fully half the distance since it is away from the other neighbors. Since a shift toward each neighbor is weighted equally, the net effect is that
a shift toward a dense area is more likely. . . . . . . . . . . . . . . . . . . . .
5.2
75
The SVMlight solution with default C for an almost linearly separable problem.
The decision boundary is shown with a solid line. The dashed lines show the limits
of the margin. The support vectors are highlighted in black.
. . . . . . . . . . . 81
LIST OF FIGURES
xviii
5.3
The contours for the score function, f (x), of the SVM light solution with default
C. The labels “A” and “B” fall at the same distance to the separator but would we
have equal confidence at predicting “red circle” at both points? . . . . . . . . . .
5.4
82
The same data as Figure 5.2 but with a large non-separable mass added. The set
of support vectors (in green) has changed but the decision boundary is close to
the same. Is it still reasonable to assume the true log-odds is a (piecewise) linear
transform of f (x)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
83
The contour plots of meanGoodSVProximity (left) and stdDevGoodSVProximity
(right) appear to capture some of the motivating intuition. Note the negative values
in the left plot near the nonseparable mass. In the right mass we see the goodness
variance rises in the nonseparable mass as well as the regions to the side where
its unclear which mass examples belong to. Meanwhile variance in the nicely
separated region remains low and stable.
6.1
. . . . . . . . . . . . . . . . . . . . 85
(a) At left an Example ROC Curve using the conditionally independent classifier
example of Section 1.2.1. (b) At right, the optimal combination of Classifier 1 and
2 dominates both. The optimal combination has an error rate approximately half
of Classifier 1 and a sixth of Classifier 2, but as the classifiers get closer to perfect
classification, the graphical difference can appear deceptively small. . . . . . . . 103
6.2
(a) At left, the isolines (or contours) connecting equal values of the F1 score in a
Precision-Recall graph. The best performance is in the top right corner (red lines).
(b) At right, the isolines connecting equal values of F1 in an ROC graph. The ROC
graph has a free parameter of P (+) that must be specified to draw the contours.
For this graph, P (+) = 0.10. . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3
The effects of varying P (+) from 0.05 to 0.40 on the isolines of F1 in ROC space. 106
6.4
The effects of varying P (+) from 0.05 to 0.40 on the isolines of Error in ROC
space.
6.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
The effects of varying P (+) from 0.05 to 0.40 on the isolines of Cost(FP= 10,
FN= 1) in ROC space. Note that varying the costs of a linear utility function is
exactly equivalent to varying the prior for the error scoring function. . . . . . . . 108
6.6
The effects of varying P (+) from 0.05 to 0.40 on the isolines of Cost(FP= 1,
FN= 10) in ROC space. Note that varying the costs of a linear utility function is
exactly equivalent to varying the prior for the error scoring function. . . . . . . . 109
LIST OF FIGURES
7.1
xix
Schematic characterization of reliability-indicator methodology. The methodology formalizes the intuition shown here that document-specific context can be
used to improve the performance of a set of base classifiers. The output of the
classifiers is a graphical representation of a distribution over possible class labels. . 116
7.2
Portion of decision tree, learned by STRIVE-D (norm) for the Business & Finance
class in the MSN Web Directory corpus, representing a combination policy at the
metalevel that considers scores output by classifiers (dark nodes) and values of
indicator variables (lighter nodes). Higher in the same path, the decision tree also
makes use of OutputOfUnigram and OutputOfSVMLight, as well as other indicator
variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3
Typical application of a classifier to a text problem. In traditional text classification, a word-based representation of a document is extracted (along with the class
label during the learning phase), and the classifiers (here an SVM and Unigram
classifier) learn to output scores for the possible class labels. The shaded boxes
represent a distribution over class labels. . . . . . . . . . . . . . . . . . . . . . 121
7.4
Architecture of STRIVE. In STRIVE, an additional layer of learning is added where
the metaclassifier can use the context established by the reliability indicators and
the output of the base classifiers to make an improved decision. The reliability
indicators are functions of the document and/or the output of the base classifiers. . 121
7.5
The ROC curve for the Home & Family class in the MSN Web Directory corpus
from [0, 0.2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.6
The full ROC curve for the Home & Family class in the MSN Web Directory corpus. 131
7.7
For Stack-S (norm) and STRIVE-S (norm) change relative to the best base classifier
— the SVM classifier. On the left, we show the relative change using thresholds
optimized for F1, and on the right, we show the relative change using thresholds
optimized for error. In both figures, we display the changes in the three components that determine F1: true positives, false positives, and false negatives. Not
only does STRIVE-S (norm) achieve considerable reductions in error of 8-18%
(left) and 5-16% (right), but in all but one case, it also increases by a fair margin
the improvement attained by Stack-S (norm). . . . . . . . . . . . . . . . . . . . 134
LIST OF FIGURES
xx
7.8
Each point presents the performance for a single class in the RCV1-v2 corpus. Improvement in F1 over the baseline SVM is shown on the left while improvement
in error is shown on the right. As is typical, both axes are given in the log-domain.
In case of a zero denominator or numerator, the log-ratio is defined as 10/−10
respectively. On left we see that Stack-S (norm) severely decreases the F1 performance on several classes. On right we see that (when performance differs from
the baseline) both methods show a larger increase in performance according to
error over the baseline as the class becomes more prevalent. Striving appears to
require slightly more positive examples than stacking which is expected given the
higher dimensionality. The regression fits shown are fit only to the classes where
the metaclassifier’s performance differs from the baseline. . . . . . . . . . . . . 144
7.9
For Stack-S (norm) and STRIVE-S (norm) change relative to the best base classifier
— the SVM classifier — over all the topic classification corpora. On the left, we
show the relative change using thresholds optimized for F1, and on the right, we
show the relative change using thresholds optimized for error. In both figures, we
display the changes in the three components that determine F1: true positives, false
positives, and false negatives. Not only does STRIVE-S (norm) achieve considerable reductions in error of 4-18% (left) and 3-16% (right), but in all but one case, it
also increases by a fair margin the improvement attained by Stack-S (norm). Furthermore, STRIVE-S (norm) never hurts performance relative to the SVM on these
performance measures as Stack-S (norm) does over RCV1-v2 on the far left. . . . 144
10.1 An E-mail with emphasized Action-Item, an explicit request that requires the recipient’s attention or action. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.2 The Histogram (left) and Distribution (right) of Message Length. A bin size of 20
words was used. Only tokens in the body after hand-stripping were counted. After
stripping, the majority of words left are usually actual message content. . . . . . . 188
10.3 Both n-grams and a small prediction window lead to consistent improvements over
the standard approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.4 Users find action-items more quickly when assisted by a classification system. . . 193
10.5 ROC curves without (left) and with (right) error bars for the action-item corpus of
two of the most competitive base classifiers versus Stacking and Striving. We see
that Striving dominates the base classifiers and only loses for a small portion of
the curve to Stacking. As expected, the variance of all of the classifiers drops as
we move to the right. However, the variance for Striving drops far quicker than the
others. Both argue that Striving presents the most robust ranking of the documents. 201
List of Tables
3.1
Displayed is an example of the output distribution of two well-calibrated classifiers, π1 and π2 , and some sample combination rules: normalized product (π P ),
average (πA ), and the optimal combination given only the predictions π ∗ = P (c |
π1 , π2 ). Although both πP and πA improve over the base classifiers, neither are
well-calibrated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
29
(a) Results for naı̈ve Bayes (left) and (b) SVM (right). The best entry for a corpus
is in bold. Entries that are statistically significantly better than all other entries are
underlined. A † denotes the method is significantly better than all other methods
except for naı̈ve Bayes. A ‡ denotes the entry is significantly better than all other
methods except for A. Gauss (and naı̈ve Bayes for the table on the left). The reason
for this distinction in significance tests is described in the text. . . . . . . . . . .
5.1
42
Various quantities for the example in Euclidean space illustrated in Figure 5.1. α
is the amount example d is shifted toward each neighbor to produce d i . Each
row lists the Euclidean distances between the shifted point d i and the original
point d as well as each neighbor nj . The nearness of neighbor n5 prevents the
shifted instances d1 , d2 , and d3 from shifting closer to neighbors n1 , n2 , and n3 ,
respectively. Thus α for these shifted points is less than 0.5. . . . . . . . . . . .
5.2
76
Effect on running time of computing the kNN reliability indicators for the Reuters
21578 corpus (9603 training examples, 3299 testing examples, 900 features used).
The naı̈ve algorithm scans all training examples each time. The sparse algorithm
uses speed-ups based on sparsity and just performs basic prediction; we show one
version using the standard number of neighbors and one using twice that. The final
version also computes and writes the reliability indicators — using a neighborhood
of k = 29 for prediction but 2k to compute the reliability indicators. For these
comparisons, r-cut with r = 1 is used for prediction [Yan99].
xxi
. . . . . . . . . . 77
LIST OF TABLES
xxii
7.1
Performance on MSN Web Directory Corpus. The best performance (omitting
the oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’,
‘S’, or ‘R’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners, Stacking methods, or Reliability-indicator based Striving
methods at the p = 0.05 level. A blackboard (hollow) font is used to indicate
significance for the macro-sign test and micro-sign test. A normal font indicates
significance for the macro t-test. For the macro-averages (i.e., excluding micro F1)
when both tests are significant it is indicated with a bold, italicized font.
7.2
. . . . . 126
Performance on Reuters Corpus. The best performance (omitting the oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates
a method significantly outperforms all (other) Base classifiers, Default combiners,
Stacking methods, or Reliability-indicator based Striving methods at the p = 0.05
level. A blackboard (hollow) font is used to indicate significance for the macrosign test and micro-sign test. A normal font indicates significance for the macro
t-test. For the macro-averages (i.e., excluding micro F1) when both tests are significant it is indicated with a bold, italicized font.
7.3
. . . . . . . . . . . . . . . . 127
Performance on TREC-AP Corpus. The best performance (omitting the oracle
BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’
indicates a method significantly outperforms all (other) Base classifiers, Default
combiners, Stacking methods, or Reliability-indicator based Striving methods at
the p = 0.05 level. A blackboard (hollow) font is used to indicate significance
for the macro-sign test and micro-sign test. A normal font indicates significance
for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both
tests are significant it is indicated with a bold, italicized font.
7.4
. . . . . . . . . . 128
STRIVE-S Local (norm) uses a local product kernel of K(x i , xj ) =
[hρ(xi ), ρ(xj )i + 1] [hΠ(xi ), Π(xj )i + 1] where Π(x) is the projection into the
subspace consisting of the base classifier outputs and ρ is the identity function.
The resulting kernel has a subset of the terms in a quadratic kernel. STRIVE-S
Local (norm) restricts ρ to the subset of features included in STRIVE-D (norm)
leads to substantially less overfitting and positive gains in one corpus.
7.5
. . . . . . 132
In backward selection over the MSN Web testing set, deleting the variable that
most improved the average logscore of the model allows us to rank the variables
in rough order of impact by the average round a feature was deleted in. A higher
average rank means a feature has greater impact on the model. . . . . . . . . . . 137
LIST OF TABLES
7.6
xxiii
In backward selection over the Reuters testing set, deleting the variable that most
improved the average logscore of the model allows us to rank the variables in
rough order of impact by the average round a feature was deleted in. A higher
average rank means a feature has greater impact on the model. . . . . . . . . . . 138
7.7
This table shows the average reduction in logscore across classes caused by deleting each variable individually from the final models in the MSN Web testing set. A
negative score indicates that the deleting the variable negatively impacts the models, since deleting it reduces the logscore. A score of zero indicates the variable
has no impact on the models, while positive indicates the variable is included in
the models but hurts them on average . . . . . . . . . . . . . . . . . . . . . . 139
7.8
This table shows the average reduction in logscore across classes caused by deleting each variable individually from the final models in the Reuters testing set. A
negative score indicates that the deleting the variable negatively impacts the models, since deleting it reduces the logscore. A score of zero indicates the variable
has no impact on the models, while positive indicates the variable is included in
the models but hurts them on average . . . . . . . . . . . . . . . . . . . . . . 140
7.9
This table shows the average reduction in logscore across classes caused by deleting each variable individually from the final models in the TREC-AP testing set. A
negative score indicates that the deleting the variable negatively impacts the models, since deleting it reduces the logscore. A score of zero indicates the variable
has no impact on the models, while positive indicates the variable is included in
the models but hurts them on average . . . . . . . . . . . . . . . . . . . . . . 141
7.10 Performance on RCV1-v2 Corpus. The best performance (omitting the oracle
BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’
indicates a method significantly outperforms all (other) Base classifiers, Default
combiners, Stacking methods, or Reliability-indicator based Striving methods at
the p = 0.05 level. A blackboard (hollow) font is used to indicate significance
for the macro-sign test and micro-sign test. A normal font indicates significance
for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both
tests are significant it is indicated with a bold, italicized font.
. . . . . . . . . . 142
7.11 All results for the MSN Web Corpus discussed in this chapter. Ignoring BestSelect,
the overall best in each column is shown in red bold and the overall worst is shown
in blue italics.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
LIST OF TABLES
xxiv
7.12 All results for the Reuters Corpus discussed in this chapter. Ignoring BestSelect,
the overall best in each column is shown in red bold and the overall worst is shown
in blue italics.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.13 All results for the TREC-AP Corpus discussed in this chapter. Ignoring BestSelect,
the overall best in each column is shown in red bold and the overall worst is shown
in blue italics.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.14 All results for the RCV1-v2 Corpus discussed in this chapter. Ignoring BestSelect,
the overall best in each column is shown in red bold and the overall worst is shown
in blue italics.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.1
Inductive Transfer Performance Summary over all Tasks . . . . . . . . . . . . . 156
9.1
Comparison of the Online Combiners over the MSN Web Corpus. The best performance (omitting the oracle BestSelect) in each column is given in bold. A notation
of ‘B’, ‘D’, ‘S’, ‘R’, ’O’, or ’I’ indicates a method significantly outperforms all
(other) Base classifiers, Default combiners, Stacking methods, Reliability-indicator
based Striving methods, Online basic methods, or Indicator-based online methods
at the p = 0.05 level. A blackboard (hollow) font is used to indicate significance
for the macro-sign test and micro-sign test. A normal font indicates significance
for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both
tests are significant it is indicated with a bold, italicized font.
9.2
. . . . . . . . . . 173
Comparison of the Online Combiners over Reuters. The best performance (omitting the oracle BestSelect) in each column is given in bold. A notation of ‘B’,
‘D’, ‘S’, ‘R’, ’O’, or ’I’ indicates a method significantly outperforms all (other)
Base classifiers, Default combiners, Stacking methods, Reliability-indicator based
Striving methods, Online basic methods, or Indicator-based online methods at the
p = 0.05 level. A blackboard (hollow) font is used to indicate significance for the
macro-sign test and micro-sign test. A normal font indicates significance for the
macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests
are significant it is indicated with a bold, italicized font. . . . . . . . . . . . . . 174
10.1 Agreement of Human Annotators at Document Level . . . . . . . . . . . . . . . 187
10.2 Agreement of Human Annotators at Sentence Level . . . . . . . . . . . . . . . 188
10.3 Average Document-Detection Performance during Cross-Validation for Each Method
and the Sample Standard Deviation (Sn−1 ) in italics. The best performance for
each classifier is shown in bold. . . . . . . . . . . . . . . . . . . . . . . . . . 202
LIST OF TABLES
xxv
10.4 Significance results for n-grams versus a bag-of-words representation for document detection using document-level and sentence-level classifiers. When the F1
result is statistically significant, it is shown in bold. When the accuracy result is
significant, it is shown with a† . This table emphasizes the hypothesis that n-grams
or a “bag of words and phrases” outperforms a simple “bag of words” does, in fact,
hold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.5 Significance results for sentence-level classifiers vs. document-level classifiers for
the document detection problem. When the result is statistically significant, it is
shown in bold. This table emphasizes the hypothesis that a sentence-level classifier
outperforms a document-level classifier does, in fact, hold. . . . . . . . . . . . . 203
10.6 Performance of the Sentence-Level Classifiers at Sentence Detection . . . . . . . 203
10.7 Average base classifier and classifier combination performance during cross-validation
over the Action-Item Detection Corpus. The best performance (omitting the oracle
BestSelect) in each column is given in bold. The worst performance is given in italics. A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates a method significantly outperforms
all (other) Base classifiers, Default combiners, Stacking methods, or Reliabilityindicator based Striving methods at the p = 0.05 level using a two-tailed t-test. . . 204
10.8 Summary of performance on the action-item detection task. The columns show
the group names for which the row method is better (restricted to just those shown
here). ”Better” here means has a better average across cross-validation runs. When
statistically significantly better (by 2-sided t-test p = 0.05), results are printed in
a red bold italic font. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
xxvi
LIST OF TABLES
Notation
F
X
Y
The set of all features in a classification problem.
The input domain of the training examples. Typically, X ⊆ R |F |
The set of classes for a classification problem. Generally, we will refer
to binary classification tasks where Y = {0, 1} or Y = {−1, 1}.
C
The set of all base classifiers.
Ci
A particular classifier i.
hxi , yi i
A labeled example where xi ∈ X and yi ∈ Y
c(x)
The class of example x where x ∈ X .
D
The true distribution of examples.
P (c(x) = y | x) The “true” posterior or conditional distribution of the class of the example, c(x), given x. This will often be abbreviated P (y | x)
p(x)
The “true” density of unlabeled examples.
p(x, y)
The “true” joint density of labeled examples.
P̂A , p̂A
Model A’s estimate of a probability distributioin P or density p, respectively.
ŷi
Classifier Ci ’s prediction about the class of an example where the particular example is clear in context.
πi
Classifier Ci ’s posterior probability distribution over the classes of an example. That is, πi (c, x) = P̂Ci (c(x) = y | x). In the context of a binary
classification problem, we will often use πi to denote the posterior of the
positive class, i.e., P̂Ci (c(x) = 1 | x)
λi
Classifier Ci ’s posterior log odds of the class of an example. That is,
P̂ (c(x)=y|x)
λi (c, x) = log 1−CP̂i (c(x)=y|x) . In the context of a binary classification
Ci
problem, we will often use λi (x) or simply λi to denote the log odds of
P̂ (c(x)=1|x)
the positive class, i.e., log P̂Ci (c(x)6=1|x)
Ci
xxvii
LIST OF TABLES
xxviii
Other Definitions
Logit
Odds
Log Odds
p
The (natural) log of a probability divided by its complement. log 1−p
p
The quotient of a probability of an event and its complement. 1−p
The log of the odds of an event. This is equivalent to the logit for
p
. Also, the log of the probabilities for
a two-class problem. log 1−p
two mutually exclusive events, log ppji (see note below).
Odds Ratio
The quotient of the odds ratios of two different events.
Log Odds Ratio
The log of an odds ratio. Sometimes called log odds when no risk of
p(1−r)
confusion with the above definition. log (1−p)r
p(1−r)
(1−p)r
A note on Log Odds Given n probabilities pi for n mutually exclusive and exhaustive
P
events such that ni=1 pi = 1, it is unclear what the established terminology is for the quantity log ppji . When n = 2 this is just the log
of the odds of event i, but for n > 2 this does not reduce. Some
(p. 96 [HTF01]) refer to this quantity as log odds, log odds ratios,
and logits even though it does not reduce to any of the above forms.
Furtherermore, when n = 2 many parts of the literature say “Log
Odds Ratio” when meaning “Log Odds” as defined above. We keep
with the looser terminology which is more prevalent in the literature.
When clarity is necessary, we specify the intended meaning.
Chapter 1
Overview
1.1
Introduction
A text classification algorithm uses example documents that have been tagged with classes
by an authority1 to learn a model that, with high accuracy, can automatically predict the
class the authority would have assigned to future documents. In cases like topic classification (Figure 1.1) where each example can belong to multiple topics, the problem is usually
reduced to a series of binary classification tasks, Corporate Acquisitions vs. not Corporate
Acquisitions, Earnings vs. not Earnings, etc.
With the surge in digital text media, text classification has become increasingly important. Text classification techniques can assist in junk e-mail detection [SDHH98], allow
medical doctors to more rapidly find relevant research [HBLH94], aid in patent searches
[Lar99], improve web searches [CD00], and serve as a backend in a multitude of other
applications. The interested reader should see Sebastiani [Seb02] for a survey of recent
applications of machine learning to text classification.
Decision Trees, kNN, SVMs, language models, and naı̈ve Bayes are a few of the classification algorithms that have been developed [HMS66, Fri77, BFOS84, CH67, Vap00,
CV95, MK60, Abr63] and later used by researchers to address the problem of text classification [LR94, ADW94b, MLW92, Joa02, MN98, Lew92b]. Each of these models generally
are designed using a different set of assumptions regarding the data. However, none of these
algorithms dominate all text classification problems.2 Furthermore, even when one class1
It doesn’t matter for our purposes whether this “authority” is a single person, a committee, or any other
entity. The only stipulation is that the labeling is consistent in the sense that it is the same authority labeling
the training documents and future documents.
2
Although SVMs show perhaps the most robust behavior across a span of text classification problems.
1
1
1
1
1
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), 2si (Ei ))
f (X)
ˆ
fM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
1.1. INTRODUCTION
Example Documents Labeled with
Topics from Topic Set according
to Authority
Unlabeled Document
Predicted Topics:
Earnings
Corn Futures
Fixed Topic Set:
Corporate Acquisitions
Earnings
Corn Futures
Text Classification
Algorithm
Trained Classification
Model
Figure 1.1: A typical text classification problem. A text classification algorithm takes as input a
set of example documents. Each document is labeled by an authority with a set of classes (here
the topics). The algorithm uses these examples to construct a model that with high accuracy can
predict the topics the authority would have assigned to future documents. This particular type of
text classification problem is called topic classification.
ification algorithm significantly outperforms another for a given classification problem, it
is rarely the case that the worse classifier’s errors are a superset of the better. This fact has
long motivated the desire to combine models in order to obtain better, or more robust, overall performance. Schemes to do this have varied widely, from simple voting to methods for
including unlabeled examples.
Appropriately combining information sources to form a more effective output than any
of the individual sources is a broad field that has been researched in many forms. It can
be considered to contain sensor fusion, distributed data-mining, regression combination,
classifier combination, and even the basic classification problem; after all, the hypothesis a
classifier emits is just a specification of how the information in the basic features should be
combined.
Problems that arise in several situations motivate combining multiple learners. For example, it may not be possible to train using all the data because data privacy and security
concerns prevent sharing the data. However, a classifier can be trained over different data
subsets and the predictions they issue may be shared. In other cases, the computation burden of the base classifier may motivate classifier combination. When a classifier with a
nonlinear training or prediction cost is used, computational gains can be realized by partitioning the data and applying an instance of the classifier to each subset. In other situations,
combining classifiers can be seen as a way of extending the hypothesis space or relaxing
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
CHAPTER 1. OVERVIEW
ˆ
fM (X)
p1 (E)
p2 (E)
p3 (E)
Decision Tree
pM (E)
p(E1 | E)
SVM
p(E2 | E)
p(E3 | E)
Naive Bayes
p(EM | E)
p(Êi | E)
Unigram
p(Ei | Êi )
3
Document−Specific
Context
Metaclassifier
Figure 1.2: Schematic characterization of reliability-indicator methodology. The output of the
classifiers is a graphical representation of a distribution over possible class labels.
the bias of the original base classifier. Boosting decision stumps [SFBL98, BK99], cascade
generalization [Gam98a, Gam98b], and stacking [Wol92] can all be seen as methods where
the hypothesis space of the combiner can fit a more general class of functions than the input
base classifiers. Finally, in many different situations, classifier combination can be used as
a way to balance the strengths and weaknesses of a set of classifiers in order to achieve
increased generalization performance.
This work approaches classifier combination with increased performance as the primary
motivation, but the same methods are applicable for any of the above purposes. In order to
do so, we attempt to exploit the fact that the models learned by different classification algorithms have different error profiles. This is done by defining data-dependent characteristics
that can be tied to the likelihood a model predicts well in the context established by the current example. We focus on the local reliability, variance, and dependence of the classifiers
as the key data-dependent characteristics for classifier combination. Therefore, this work
is similar in flavor to Kahn’s [Kah04] without making his distributional assumptions for
classifier outputs which often do not hold in practice. In order to capture these characteristics, we define a set of reliability indicators that we argue are tied to these characteristics
for text classification problems. The general approach is depicted in Figure 1.2.
A secondary goal of this work is to provide a basis for understanding the interactions
of a set of classifiers. We argue this enables machine learning practitioners to more readily
choose between the trade-offs when choosing which methods to apply to a problem. This
abstraction is key when using classifier combination to mitigate data privacy and security
problems. Concerned parties need only to verify that they can share both the predictions
and indicators rather than the data itself.
4
1.2. WORSE PERFORMANCE DOES NOT MEAN OBSOLETE
Classifier combination is ultimately a hard problem where, as in other problems, the
optimal combination is rarely computable. Even weaker results, such as a combination
scheme that always outperforms its base input classifiers, have been shown to be theoretically unattainable [DHS01, Wol95]. Therefore, our goal will be to make substantial gains
when possible and to be within a small performance difference at other times.
Finally, our work shows that not only can these data-dependent characteristics be used
to construct a more effective classifier, but since they behave similarly across a set of related
problems, the data-scarcity problem can be somewhat alleviated during meta-learning by
sharing data via inductive transfer.
In order to situate the reader, the remainder of this chapter gives key examples to motivate the gains and elucidate the challenges of classifier combination before finally making
a more formal thesis statement and stating our evaluation criteria.
1.2
Worse Performance Does Not Mean Obsolete
Although theoretical results [DHS01, Wol95] indicate there is no a priori choice of algorithm which will perform the best over all problems, experience has shown that some
algorithms can dominate large classes of problems. For example, most acknowledge that
SVMs with a linear kernel will perform at least as well as and usually outperform most
known methods in topic classification. As a result, Machine Learning researchers often
attempt to understand how well an algorithm (e.g. SVM) fits a set of problems (e.g. topic
classification) by empirically evaluating algorithms.
In contrast, even when an algorithm outperforms another algorithm across a problem
set, combining the algorithms can lead to better results than either alone. Thus, while
good empirical performance gives evidence that an algorithm’s assumptions match the underlying domain structure well, improvement from combination methods provides weak
evidence that the base classifiers are only capturing subdomain structure and failing to entirely capture the learnable structure of the problem. There are many concrete situations
where weak classifiers can help improve the performance of a strong classifier. In the following, we construct several examples which illustrate the potential gains from classifier
combination.
p3 (E)
pM (E)
p(E1 | E)
CHAPTER 1. OVERVIEW
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
C
5
Class
X1 , . . . , X k
Xk+1 , . . . , Xn
Ŷ1
Ŷ2
Figure 1.3: Influence diagram for classifiers built using two conditionally independent views of the
data.
1.2.1 Examples from Synthetic Data
First, consider the case where there are two views of the data. For example, when classifying a web page, we might represent it as the text contained on the web page itself, or
we could view it as the text on the pages linked to by the web page. In the ideal case,
these two views would be independent given the class label. Figure 1.3 shows an influence
diagram for this simplified case. The generative process for this diagram can be thought of
as follows. First, the class, c, is chosen according to some prior, P (C). Then a distribution,
V1,c , which governs the first view generates the features X1 , . . . , Xk . A second distribution,
V2,c , then generates the features Xk+1 , . . . , Xn . The first classifier, Ŷ1 , uses only the first
feature set to make its prediction while the second classifier Ŷ2 uses the second feature set.
To make this concrete, consider a binary classification task where the classes are {−1, 1}
and we have only one feature per view. Each feature will be generated by a normal distribution. Let V1,c = N (5c, 3), V2,c = N (c, 1), Ŷ1 = argmaxc∈C P (c|X1 ), and Ŷ2 =
argmaxc∈C P (c|X2 ). Assuming that each class is equally likely, P (−1) = P (1) = 0.5.
Then the first classifier will have an error of 4.78% and the second an error of 15.87%.
Even though the first classifier far outperforms the second, it is obvious in this case that
each classifier has information which can lead to a better combined decision. Now, how
good can the combination of these classifiers be?
At this point, the astute reader will have already taken note that Ŷ1 and Ŷ2 are making
predictions in accordance with the correct posterior over their respective feature sets, but
this does not mean the classifiers have access to the actual probability functions, P (c|X 1 )
and P (c|X2 ), just that they are correct with respect to the decision threshold. Assuming
that we have two such classifiers and the combination function only has access to Ŷ1 and
Ŷ2 , then it is well-known that the best we can do can be expressed as a linear combination
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
1.2. WORSE PERFORMANCE
DOES NOT MEAN OBSOLETE
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
PDF for X
PDF for X
1
0.5
0.45
0.4
0.4
0.35
0.35
2
1
0.3
p(X | c)
0.3
0.25
0.2
0.25
0.2
0.15
0.15
0.1
0.1
0.05
0
−10
2
0.5
0.45
p(X | c)
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM6(X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.05
−8
−6
−4
−2
0
2
4
6
X1 (The decision boundary at 0.0 has an error of 4.78%)
8
10
0
−10
−8
−6
−4
−2
0
2
4
6
X2 (The decision boundary at 0.0 has an error of 15.87%)
8
10
Figure 1.4: (a) The pdf for X1 as well as the decision boundary used by optimal classifier Ŷ1 ; (b)
The pdf for X2 as well as the decision boundary used by optimal classifier Ŷ2 .
³
´
P
+
w
ŷ
of the classifier predictions3 , specifically sign log PP (+)
, where the weights
i,ŷ
i
i
(−)
are a function of the classifiers’ prediction accuracies:

P (Correct Positivei )


log P False Positive + log PP (−)
if ŷi = +1

(+)

(
i)

(1.1)
wi,ŷi =



P
Correct
Negative
(+)
i)

log (
+ log PP (−)
if ŷi = −1.
P (False Negativei )
With a class prior of 0.5, our example classifiers have a symmetric distribution of false positives and false negatives. Therefore, the best combination of these two classifiers would still
have an error rate of 4.78%, equal to that of the best classifier. While the error rate would
improve if we had more conditionally-independent classifiers, the problem here is that the
weaker classifier cannot overpower the stronger classifier because there is no information
about whether an example lies near the stronger classifier’s decision threshold.
On the other hand, if the classifiers actually issue the true probability estimates, P (c|X 1 )
and P (c|X2 ), along with their class predictions, then the optimal combination now has an
error rate of 2.63%. This gives nearly a 50% reduction in error over the best classifier
by combining it with a classifier that’s “3 times worse” according to error! The optimal
combination gains for this example in other contexts fall between these two extremes. For
example, a more reasonable case is when the classifiers produce probability estimates that
lie on the correct side of the decision threshold with respect to their feature set, but the
probabilities themselves are not correct. Of course, in the most common case in practice,
3
For the two class case, it is possible to write a single weight for each classifier rather than making the
weight a function of the classifier’s prediction. The form is equivalent but more tedious to write out, and the
interpretation is less obvious.
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
CHAPTER 1. OVERVIEW
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
7
1
Correlation Coefficient
0.8
0.6
0.4
0.2
0
0.4
0.3
0.8
0.6
0.2
Error Rate of Single Classifier
0.4
0.1
0.2
0
P(Class = 1)
Figure 1.5: The correlation of conditionally-independent classifiers is largely determined by their
error rates and the class prior. This graph is generated with two classifiers whose error rates are
equal.
the classifiers produce probabilities that are not accurate nor do they always produce a
consistent decision.
Taken together, this discussion highlights that one of the challenges in classifier combination lies in dealing with the output type of the classifier — whether they are probabilities,
scores, or predictions — as well as their quality.
Finally, it is worth noting the subtle point that analyzing the correlation alone will not
indicate whether classifiers have conditionally-independent predictions. In our particular
example, the Pearson correlation coefficient between the two classifiers’ predictions shows
a borderline strong correlation of 0.6174 simply because both classifiers are relatively accurate and therefore tend to agree. If we change the accuracy of Ŷ2 by changing the standard
deviation of the second feature to 0.6, the correlation coefficient jumps to 0.8180. However, regardless of the correlation, in both cases the classifiers provide distinctly different
information because they are conditionally-independent given the class.
In these examples, it is the class prior emission probability and the error rate of the
classifiers that largely determine how strongly correlated the predictions are. To remind
Ŷ1 ,Ŷ2 )
the reader, the Pearson correlation coefficient is ρ(Ŷ1 , Ŷ2 ) = Cov(
. Assuming the
σŶ σŶ
1
2
classifiers are relatively accurate, a class prior close to 0.5 will result in the predictions
having a large standard deviation (close to the max of 1) which will require a large number
of disagreements in the classifiers’ predictions to achieve a low correlation. This is unlikely
by chance if they are both accurate. As the class prior goes to zero or one, the number of
disagreements needed to drive the correlation to zero shrinks rather rapidly. Figure 1.5 will
8
1.2. WORSE PERFORMANCE DOES NOT MEAN OBSOLETE
help the reader visualize this for the case where Ŷ1 and Ŷ2 are constrained to have the same
accuracy.
At this point, the reader should be convinced that even strongly correlated predictions
between classifiers are not sufficient to dismiss the hypothesis that a combination will outperform either classifier.
1.2.2 Diversity and Accuracy in Real Datasets
The idealized examples presented in section 1.2.1 highlight two necessary conditions for
benefitting from classifier combination. First, the base classifiers must be diverse in that
they give different predictions. In the idealized example, this diversity came because the
classifier predictions were conditionally-independent given the class. Second, the classifiers must be fairly accurate in addition to being diverse. Clearly, diversity can be easily
obtained by random predictions unless this second constraint is enforced.
In the context of this dissertation, we use standard classification algorithms to obtain
both diversity and accuracy. These algorithms include Decision Trees, kNN, SVMs, simple
language models, and naı̈ve Bayes. These algorithms have been empirically evaluated
in a variety of text classification problems, and it is known that they are anywhere from
reasonably accurate to state-of-the-art. It is less clear that the predictions they issue should
be diverse especially when trained over the same data.
We argue that the fundamentally different assumptions underlying the algorithms give
rise to models that fit different parts of the data more or less well. Finding the conditions
under which these classifiers can be combined to obtain better performance is therefore
a precursory step to determining how and why any single classifier’s assumptions are not
sufficient to fully model the classification task at hand. We have chosen these algorithms
not only because they are known to perform well but also because they are different types of
classifiers along several different dimensions. The SVM, language model, and naı̈ve Bayes
algorithms we investigate produce a linear decision boundary where the kNN and Decision
Tree classifiers are non-linear classifiers. In contrast, the language model and naı̈ve Bayes
algorithms we employ are generative classifiers where the SVM, kNN, and Decision Tree
classifiers are discriminative.
In addition to these foundational motivations for our choice in base classifiers, empirical
evidence bears out the fact that their error profiles differ in a number of ways. When
viewed using ROC curves, there are typically cross-overs in the curve indicating that no
single model is appropriate for all cost functions. Finally, an optimistic estimate of the
CHAPTER 1. OVERVIEW
9
performance ceiling for combination, such as reducing error to only those examples where
all of the base classifiers are wrong, results in an error rate far better than the best classifier.
1.3
Thesis Statement
While classifier combination techniques comprise a broad field, this dissertation will focus solely on how context-sensitive metaclassification techniques can be used to improve
generalization performance.
An initial attempt to address this issue might start by requiring the classifiers to output a
probability distribution over possible class labels then combine them via a simple strategy,
such as a constant-weighted linear sum. While this seems like an inviting avenue to pursue,
obtaining “good” probability estimates can be problematic. We review our early investigations [Ben00, Ben02] that demonstrated how to convert a classifier’s scores to probability
estimates that are of quality better than or comparable to other methods for calibrating
classifier scores. While this is a step in the right direction, it does not address one of the
primary challenges of classifier combination — namely estimating the dependencies of the
classifier outputs. Furthermore, it fails to take advantage of the fact that the reliability of
a classifier’s predictions can vary across the input space. Additionally, the correlation between two classifiers’ predictions may vary locally as well — in some areas, showing high
dependence and in others being largely independent.
Of course, the characterization of context is the operative factor in the distinctions the
metalevel classifier can make. Other methods that have tried to leverage context have typically used only the classifier outputs or the outputs and all of the base features as context
[TW99, Gam98a, Gam98b, TT95, MP99]. In order to make finer local discriminations, we
wish to use more context than the classifier scores, but because of the high dimensionality
of text, using all the base features as a representation of context is a poor choice because
of the amount of data needed to accurately learn such a model. Following the work that
inspired this model [TH00], we introduce a set of reliability-indicator variables that are a
low-dimensional rich abstraction of the discriminatory context provided by a document for
learning.
In contrast to Toyama & Horvitz’s work [TH00], we give formal definitions for the local dependence, reliability, and variance of a set of classifiers, and then we define indicator
variables that are either direct or indirect approximations of these statistics. Other combination work [Kah04] relates the combination weights to these quantities when the classifier’s
log-odds predictions are assumed to follow a certain generative form. Our work can be seen
1.3. THESIS STATEMENT
10
as generalizing this framework. Even when a generative form is not assumed, these quantities are at least necessary. Although, in practice, we can rarely compute them directly and
accurately. Therefore, the fundamental assumptions underlying the metaclassifier approach
are: (1) Since we know these quantities are necessary for classifier combination, we will
gain by reducing the dimension of the metaclassifier space from the original input space to
approximations of these quantities; (2) Assuming these quantities are sufficient for classifier combination, it is not necessary to explicitly know the “locality” 4 of an example —
documents that are similar in the metaclassifier space will have similar combination rules
since these statistics are assumed sufficient.
Furthermore, we generalize the reliability-indicator characterization of context in a way
that enables using labeled data from separate learning tasks to learn an improved combination policy across all tasks. This can be done if we treat the metaclassifier as an abstraction
from discriminating a specific topic (e.g., Corporate Acquisitions vs. not Corporate Acquisitions) to the problem of discriminating topic membership in general (i.e., In-Topic vs.
Out-of-Topic). The base-level classifiers that are trained on a particular topic are used as
the representation of topic-specific knowledge, while the metaclassifier provides information about how to leverage context across topic-classification in general. Such an extension
is only possible if we generalize the reliability indicators away from linkages to the precise words in a document. Consider when “shares” occurs in a document in the Corporate
Acquisitions discrimination task and “corn” occurs in a document in the Corn Futures discrimination task; one simple task invariant representation of context at the metalevel might
transform both of these to: Is the word with maximum mutual information for the current
task present in this document? This representation enables the metaclassifier to use information about how document-specific context influences topic discrimination across a wide
variety of text classification tasks. The success of this abstracted model depends critically
upon our ability to find a “normalized” representation that captures how topic classification
decision boundaries co-vary with the statistical properties of language. We present empirical evidence that this approach can, in fact, succeed — increasing the amount of labeled
data available to build the metaclassifier by pooling it. We discuss possible issues that
might arise when conglomerating the data and offer solutions to these practical problems.
Finally, since the empirical evaluations of this model have focused primarily upon how
topic classification tasks relate to language-use distributions, we demonstrate how it can be
4
We refer to a combination method as using “locality” if the algorithm can induce a model that cannot
be expressed as a constant-weighted linear combination of a classifier’s predictions, probabilities, scores, or
log odds. Thus both the representation and the algorithm determine whether an approach uses locality. For
example, a linear SVM metaclassifier applied to the log-odds of a classifier’s predictions does not use locality,
but a decision tree algorithm, which can learn a non-linear function of the classifier’s outputs, uses locality.
CHAPTER 1. OVERVIEW
11
extended to other domains of text classification as well, specifically action-item detection
in e-mail documents.
Throughout all of the work, we argue the following thesis: Context-dependent combination procedures provide an effective way of combining classifiers that are generally
superior to constant-weighted linear combinations of the classifier’s estimates of the
posterior or log-odds. Furthermore, context can be leveraged in text classifier combination via an abstraction of the local reliability, dependence, and variance of the
base classifier outputs. Finally, these abstractions help identify opportunities for data
re-use that can be employed to significantly improve classification performance.
1.4
Criteria for Evaluation
As discussed earlier, it is not possible to construct a metaclassifier that always outperforms its base classifiers. As a demonstration of the suitability of these methods for text
classification though, we set the goal of statistically significantly outperforming the current
state-of-the-art base classification methods over several standard text classification corpora.
In addition, to prevent overtuning to specific corpora, we have chosen the corpora for their
breadth and have completely withheld performing experiments on one corpus until the final
stages of this dissertation. Furthermore, since we argue that our representation of context
is key, we will empirically demonstrate that these methods outperform simple constantweighted combinations of the classifier outputs in some corpora and, in the remaining ones,
achieve a statistically negligible difference.
1.5
Roadmap to This Document
The remainder of the document is laid out as follows. First, we describe related work. From
there, we discuss why obtaining quality probability estimates from a classifier based solely
on its output is problematic and introduce improved methods that rely on asymmetry. Using
the same framework of analysis, we motivate and define the local reliability, dependence,
and variance of a classifier. Based on these insights, we lay out the set of reliability indicators we use, show situations where they can be helpful, and discuss the computational and
practical implications for computing them. Before delving into the key contributions of this
dissertation, we briefly review our evaluation methodology, the implementation details of
the base classifiers, and a variety of performance measures that can be used to evaluate the
12
1.5. ROADMAP TO THIS DOCUMENT
effectiveness of text classifiers. Then, we describe the STRIVE framework, which uses a
standard classifier as the metaclassifier and the reliability-indicator representation to build
a more effective classifier. We follow this with a description of how the characterization of
context used in STRIVE can be generalized to build a domain-level metaclassifier that increases the pool of data that can be used for building a single model. In the next chapter we
consider alternative metaclassifiers based on online learning with performance guarantees,
conduct an empirical analysis of them, and discuss why these methods do not currently
yield performance comparable to offline metaclassification approaches. In the following
chapter, we demonstrate that these variables and representations are applicable to other text
classification problems such as e-mail action-item detection. Finally, we summarize our
contributions and highlight important directions for future work.
Chapter 2
Related Work
Appropriately combining information sources to form a more effective output than that of
any of the individual sources is a broad topic that has been researched in many forms. The
challenges of integrating information have gone under the labels of diagnosis [HBH88],
pattern recognition [DHS01], sensor fusion [Kle99], multistrategy learning [MT94], distributed data mining [KC00], and a variety of ensemble methods [Die00]. Diagnosis centers on identifying disorders from multiple pieces of evidence, such as reasoning about
probability distributions over a patient’s diseases from a set of symptoms and test results.
Pattern recognition and sensor fusion typically address challenges with integrating information from multiple modalities (e.g., auditory and visual) while distributed data mining
addresses how results retrieved from distinct training data sets can be unified to provide one
coherent view to the user. Multistrategy learning methods have focused primarily on combining methods from different paradigms (e.g., abductive and inductive methods). Ensemble methods are methods that first solve a classification or regression problem by creating
multiple learners that each attempt to solve the task independently, then use a procedure
specified by the particular ensemble method for selecting or weighting the individual learners. Ensemble methods include such examples as Bayesian averaging [Lea78, HMRV98],
bagging [Bre96], boosting [Sch90, Fre95, FS97], stacking [Wol92], cascade generalization
[Gam98a, Gam98b], hierarchical mixture of experts [JJ94], and this dissertation.
This chapter presents a sampling of the key works in the literature that are related to this
work.1 First, we review the two primary types of ensemble methods: those that combine
different models obtained from the same classification algorithm (homogeneous ensembles) and those that combine models obtained from different classification algorithms (het1
Additional work that is relevant to particular sections but not to the core focus of this work is discussed
in the appropriate chapter.
13
2.1. HOMOGENEOUS ENSEMBLES
14
erogeneous ensembles). Our proposed work focuses on heterogeneous ensembles. We then
specifically highlight combination methods that have employed some notion of locality as
well as specific applications of combination methods to text problems. Since calibration is
central to our exposition of the roles of local reliability, dependence, and variance in classifier combination, we also conduct a brief survey of previous attempts to improve probability
estimates or obtain calibrated estimates from a single classifier. Finally, we conclude with
a discussion of the No Free Lunch theorem — a well-known negative theoretical result
regarding classification and classifier combination.
2.1
Homogeneous Ensembles
Homogeneous ensembles are combination methods that combine different models obtained
from multiple runs of the same classification algorithm. The models may differ for a variety of reasons. For example, when combining neural networks, each model instance in the
ensemble could be the result of training after a random initialization of the model weights
[HS90]2 ; thus, each training run might settle on a different local error minimum. Alternatively, for decision trees, each tree could be obtained by training over a different sample
obtained by sampling randomly with replacement from the training data [Bre96]. Homogeneous ensemble methods include Bayesian averaging, bagging, boosting, and hierarchical
mixture of experts.
The primary focus of many homogeneous ensembles is to account for model parameter
uncertainty that results from noise in the data and having estimated the model parameters
with finite data. Bayesian averaging works directly with this concept and weights each hypothesis (possible model) by its probability of being the correct hypothesis according to a
chosen prior distribution. While Bayesian averaging is one of the oldest developed combination methodologies [Lea78], it has only recently become computationally feasible to deal
with large or possibly infinite hypothesis spaces via sampling or other techniques. Hoeting
et al. [HMRV98] study and give a survey of recent approaches to Bayesian averaging.
Model uncertainty can also be thought of as the variance in the model parameters if
different training sets of the same size were drawn from the same underlying distribution an
infinite number of times. Breiman [Bre96] introduces bagging (bootstrap aggregation) from
this approach to model uncertainty. A bootstrap sample of the same size as the training set
2
Merz’s [Mer98] definition of “homogeneous” and “heterogeneous” differs slightly. In his terminology
“heterogenous” refers to any difference in the learning algorithms. Thus different neural network models
obtained by different random initializations of network weights would be termed “heterogenous” by his approach.
CHAPTER 2. RELATED WORK
15
is drawn by sampling the training set with replacement. Then, the bootstrap sample is
given to the classification algorithm to obtain a model. This process is repeated N times
(for some large N ) and the resulting ensemble is combined via majority vote.
Boosting, as it has typically been applied, also constructs an ensemble of models by
obtaining each model from a training run over a different distribution on the training set.
However, each model is produced sequentially and the weight each training example is
given is a function of the number of previously trained models that predicted the example’s
class incorrectly. In this way, each successive model focuses more on the “hard examples”
that earlier models mispredicted. Boosting can often be better viewed as a feature selection
method rather than a variance reduction method such as the previous two methods. The
reason for this is that the primary empirical success of boosting has been to classification
algorithms where each model does not attempt to fully solve the problem. One such example is boosting decision stumps [SFBL98]; a decision stump is a decision tree of depth one.
In this sense, they can be seen as more of an attempt to avoid “data-fracturing” by choosing
a set of features where every example had at least some weight on the predictor chosen.
For those interested in alternative methods of using “all the data”, Domingos [Dom94] uses
a heuristically guided hill-climbing search to induce a series of classification rules over all
the data and weights the final combination of rules according to a combined measure of
coverage and precision.
Hierarchical mixture of experts (HME) [JJNH91, JJ94] has a quite different flavor from
the other homogeneous ensembles discussed. This method can be thought of as a tree
with classification models at the leaves and weighting functions at the internal nodes. The
models at the leaves make a prediction, and the predictions are “blended” by the weighting
functions, which act as gates as they propagate up the trees. The weighting functions
are functions of the input as well. Therefore, the weighting functions can provide a soft
partitioning of the input space. The authors first motivated the use of HME as another
method that avoids fracturing the data as divide-and-conquer methods do, and instead use
all of the data in combination with simple (high-bias) estimators at the leaves to strike
a more favorable bias-variance tradeoff. Because the weighting functions are functions
of the input, HME is the first method we have discussed that has a notion of locality.
The weighting functions typically used are generalized linear functions (a fixed non-linear
function of a linear transform of the weight). Typically, the model parameters and the
weighting functions are trained in conjunction. Therefore, depending on the type of experts
used at the leaves, estimating the parameters may be computationally intensive. As each
example is seen, the gating function gives increased weight to the experts that perform well
on it, and when the experts commit errors during training, they are updated according to
the amount of weight that the gating network placed on them. Therefore all experts are not
2.2. HETEROGENEOUS ENSEMBLES
16
generally trained over the same data. Thus, this approach shares similarity with boosting
in that it addresses how a complex task can be broken down into the combination of simple
classifiers trained over altered data distributions. The authors do not specifically address
how fixed models can be combined. Although the scheme could be altered to do so; in this
case, it would become an instance of local cascade generalization (discussed later) where
the gating function is essentially the metaclassifier being trained.
In contrast, our approach focuses on richer definitions of locality and uses these in
combining the outputs of models from different inductive algorithms. Thus, as opposed
to homogeneous methods where the individual models result from the same algorithm, we
seek to also draw insights into how the strengths of the various classification algorithms are
effectively employed by the combination algorithm.
2.2
Heterogeneous Ensembles
Heterogeneous ensembles are combination methods that combine different models obtained from different classification algorithms. While technically a heterogeneous ensemble could be applied to different models obtained from the same algorithm, they are different in that they typically stem from one of the following two motivations: (1) if the errors
of a set of classifiers are independent,3 then the error rate of an appropriate combination of
those classifiers drops exponentially fast with the number of classifiers; [Die00] (2) each
classifier has a bias, or a restriction on the set of functions it can learn, and by combining
different classification algorithms it is possible to relax the bias and learn more expressive
models (as always at a cost in terms of variance) [Gam98a, Gam98b].
Since a classification algorithm often outputs a model that performs well but disagrees
on some examples with a model obtained from a different classification algorithm, researchers have often simply assumed they provide independent sources of information and
combined them accordingly. Majority vote and constant-weighted sums of outputs have
operated under this assumption with varying empirical success; we discuss particular applications of these rules to text problems below.
An alternative to assuming the classifier outputs provide independent information about
the class is to take into account the fact that they may be partially dependent. Merz and
3
At this point, a common misunderstanding most be pointed out. The phrase, often used in literature,
“errors of a set of classifiers are independent” is sometimes mistakenly confused with “the output of the
classifiers are independent”. Obviously, we do not expect the output of the classifiers to be independent —
they are learning the same function. Instead a more precise statement would refer to the classifier’s outputs
being conditionally-independent given the class label as discussed in Section 1.2.1.
CHAPTER 2. RELATED WORK
17
Pazzani [Mer98, MP99, Mer99] do this by introducing methods which use an intermediate representation obtained via singular value decomposition. They introduce a method for
combining regression estimates (PCR∗ ) and one for combining classifier outputs (SCANN).
The first uses principle components analysis to factor the covariance matrix of the regression estimates, followed by a validation step to determine the number of principal vectors to
retain, and then performs regression using the retained components to determine how heavily to weight each method. The final form of the learned function uses a constant-weighted
linear combination and thus is not a local method in our terminology. For classification,
the class predictions of the classifiers are obtained and correspondence analysis is applied
to the prediction matrix to determine the canonical variates. Similar to the regression approach, a validation step is used to determine the number of variates to keep and the final
set of weights is determined using a nearest neighbor approach in the space of the remaining variates. The final form of the learned function uses a single weight per class. Thus
the classification approach has some use of locality. Although since we decompose problems into binary prediction tasks, we can rewrite such a model for binary classification as
a model that uses a single set of weights.
Another approach to taking dependency into account is taken by those models that use
a metaclassifier (as in Figure 1.2). Two such approaches are stacking and cascade generalization. Wolpert introduced stacking in [Wol92], and more recently, others have studied
its effectiveness more thoroughly [TW99]. Stacking trains a metaclassifier over the outputs of the lower level classifiers. Stacking can thus account for dependencies between the
classifiers by observing them in the training data for the metaclassifier model. Because
the metaclassifier can also generalize according to how close the examples are in terms
of the base classifier outputs, then stacking4 also has a sense of locality but restricted to
nearness in base classifier outputs. Cascade generalization [Gam98a] is similar to stacking,
but where stacking performs cross-validation to obtain outputs from the base classifiers to
train the metaclassifier, cascade generalization simply trains the base classifiers once and
gives the metaclassifier the base classifiers’ predictions over the training set. Thus, Gama
argues that cascade generalization is more appropriately seen as relaxing the bias (or extending the expressiveness) of the base models. Local cascade generalization [Gam98b]
also includes all of the base features of the example when training the metaclassifier. As
mentioned above, using all of the base features is often not feasible for text problems because of the high dimensionality of text. Thus, our work attempts to find a low-dimensional
representation of locality. Additionally, the representation aids in the interpretability of the
resulting models.
4
Depending on which metaclassifier model is used.
18
2.2. HETEROGENEOUS ENSEMBLES
In recent work, Caruana et al. [CNM04] present a method for performing simple globably weighted combinations of classifiers. Their approach builds a large library of classification models (several thousand) by varying the parameters of several agorithms. Then using
a simple hillclimbing approach to select with replacement a set of models to vote equally.
Correlation is an even greater challenge for their framework since the majority of models
result from varying a parameter smoothly for a base classification algorithm. Their empirical investigation includes several oddities, however. They restrict the data available to the
metaclassifier by using a small validation set. Because of the highly correlated models, this
puts many of the meta-algorithms at a disadvantage. They probably would have performed
better if only given a small set of base models using default paramaters. Additionally, they
do not compare their methods to methods explicitly targeted at handling correlation such
as Merz’s work discussed above.
2.2.1 Other Relevant Approaches
Another approach sometimes termed any of “metaselection”, “metalearning”, or “metaclassification”, attempts to pick one single (but potentially different) classifier for each
discrimination task. For example, one might learn a rule of the form, “If there are more
than 200 examples of each class, use the kNN model, else use the SVM model.” We prefer
the term “metaselection” for this approach since only one single model is actually used
for predictions. In contrast, most of the other heterogeneous methods (and ours) vote or
blend the predictions of each in some way. An example of metaselection is [LL01] where
the authors define features that they then use to choose the classifier to apply to a particular
problem. Our approach is more general than this since we can blend classifiers based on the
specific documents. For features that may be relevant to choosing a classifier for a problem
but static across all documents within that problem (e.g., number of training examples),
our work on Inductive Transfer demonstrates how to extend our combination framework to
obtain the benefits of both metaselection and combination.
Inductive Transfer is one of several methods such as multitask learning that attempt to
overcome the typical scarcity of labeled data by building predictive models using knowledge transfered from another task. In multitask learning, additional information for building models comes in the form of labels for related functions which can be learned over the
same input. Although such additional labels are typically unavailable at prediction time,
results have demonstrated that generalization performance can be improved on the primary
task by learning to predict the new variables in addition to the output variable of interest.
CHAPTER 2. RELATED WORK
19
Caruana [Car97] presents an approach to and analysis of multitask learning when the
n function-approximation tasks are over the same input (i.e., a labeled example consists of
x1 , . . . , xm data attributes and the values for this example of the n functions to be learned
f1 (~x), . . . , fn (~x)). In this analysis, the main concern is generalization performance for
one particular fi , the primary problem. Likewise, the Curds & Whey approach proposed
by Breiman & Friedman [BF95] solves a similarly formulated problem but attempts to
minimize the squared error across all of the n functions instead of placing emphasis on one
task.
In contrast to multitask learning, we show how to leverage labeled data from related
problems over examples in different input spaces to enhance the final model used in prediction. Problems related to this challenge have been termed classifier re-use [BG98] or
knowledge transfer [CK97]. We introduce a new approach to the challenge that hinges on
mapping the original feature space, targeted at predicting membership in a specific topic,
to a new feature space aimed at modeling the reliability of an ensemble of text classifiers.
Thrun & O’Sullivan [TO96] present methods for identifying related tasks and sequentially transferring knowledge when using a nearest-neighbor classifier. These methods
are applicable when the input has the same representation across tasks. Both Thrun &
O’Sullivan’s and Breiman & Friedman’s work could be applied to the inductive transfer
problem we lay out after transforming the data to our representation.
Cohen & Kudenko [CK97] perform an analysis of classifier re-use and sequential knowledge transfer in information filters for text documents. Their work showed that significant
improvements could be introduced when the classifiers were constructed to primarily model
features positively correlated with the topic (i.e., word presence that is positively correlated
with being In-Topic). However, the method also relies on the new task and the old task
sharing significant overlap in the underlying concept to be learned.
Finally, Bollacker & Ghosh [BG98] present a novel mechanism for classifier re-use
where a classifier is constructed for each of a set of support tasks that are later used in predictions for a primary task. The final classification is selected by predicting the same class
as the training data item (from the primary task data) that has the most similar prediction
pattern using the support classifiers. Since each support classifier is applied to examples
from every task, the input representation for each of the related tasks must be the same.
Additionally, the scheme, like error-correcting output coding [Die00], relies more on an assumption that the extra-task labels will serve as a natural encoding for the data rather than
other re-use mechanisms that specifically bias models or build representations of domain
knowledge.
2.3. RELATED WORK USING LOCALITY
20
2.3
Related Work Using Locality
We refer to a combination method as using locality if the algorithm can induce a model
which cannot be expressed as a constant-weighted linear combination of a classifier’s predictions, probabilities, scores, or log odds. Thus both the representation and the algorithm
determine whether an approach uses locality. For example, a linear SVM metaclassifier
applied to the log-odds of a classifier’s predictions does not use locality, but a decision tree
algorithm, which can learn a non-linear function of the classifier’s outputs, uses locality.
Locally weighted combinations have received far less attention in the reseach community. As we have mentioned above, some of the more prominent examples that include
some notion of locality are stacking, cascaded generalization, and HME. These methods 5
use either only the classifier outputs or the classifier outputs and base features to set a local
weight. In contrast, we combine the outputs based on properties of the classifier output that
capture their local variance and accuracy.
Merz [Mer95] also uses locality in his approach to dynamic selection and combination
by using a nearest-neighbor approach where two examples are considered similar if the
base classifiers have a similar prediction pattern for both of them. In the selection case,
the classifier with the highest accuracy in the retrieved neighbors is used, and in the combination approach, they are voted according to their accuracies. However, he was not able
to obtain promising results in the empirical evaluation. Woods, Kegelmeyer and Bowyer
[WJB97] present a similar method but the local accuracy of the classification methods are
estimated over a neighborhood found in the original input space. Very minor improvements
were demonstrated for several problems that had low-dimensional input spaces. Additionally, while they tried different neighborhood sizes, they do not report how they decided on
the size they used in their final results.
Finally, Tresp & Taniguchi [TT95] investigate a variety of locally weighted combination
rules and apply them to a low-dimensional data set (13 features). While they do investigate
rules that take advantage of local variance, density (number of samples around the point),
and reliability, their rules are sometimes not practical for high-dimensional data. More
importantly, their approach relies on generative assumptions about the data.
Additionally, our approach allows for richer definitions of locality than simple nearness
in the (Euclidean) feature representation. For example, the reliability indicator UnigramVariance that we discussed briefly above and in more detail below is essentially the variance
of a unigram model’s output with the deletion of a single word occurrence in the document.
We could also consider a more expressive generalization such as the variance with deletion
5
The reader is referred to Sections 2.1 and 2.2 above for more details on each of these methods.
CHAPTER 2. RELATED WORK
21
of a sentence or paragraph. Since whole contiguous sections may be deleted, this allows
the covariance of the words, as determined by the structure of language, to play a role in
how locality is effectively defined for documents.
2.4
Previous Applications to Text Problems
The combination of multiple methodologies or representations has been employed in several text related areas outside of text classification. For example, previous research in information retrieval has demonstrated that retrieval effectiveness can be improved by using
multiple, distinct representations [BCB94, KMT+ 82, RC95], or by using multiple queries
or search strategies [BCCC93, SF95]. Freitag [Fre98] presents a study of combining inductive learners for information extraction where several simple rules for combination are
employed after the classifier outputs are normalized.
In the realm of text classification, several researchers have achieved improvements in
classification accuracy via the combination of different classifiers [HPS96, LC96, LJ98,
YAP00]. Other investigators have reported that combined classifiers work well compared
to some particular approach [AKTV+ 01], but they have not reported results that compare
the accuracy of the classifier with the accuracies of the individual contributing classifiers. Thus, it is difficult to draw insights from their work about how the reliabilities of the
contributing classifiers vary over the input space. Similarly, systems that seek to enhance
classification performance by applying many instances of the same classifier, such as in
boosting procedures [SS00, WAD+ 99], typically leverage weaker component learners that
would not be directly examined as stand-alone classifiers.
Much of the previous work on combining text classifiers has used relatively simple
policies for selecting the best classifier or for combining the output of multiple classifiers.
As some examples, Larkey and Croft [LC96] used weighted linear combinations of system
ranks or scores; Hull et al. [HPS96] used linear combinations of probabilities or log odds
scores; Yang et al. [YAP00] used a linear combination of normalized scores; and Li and
Jain [LJ98] used voting and classifier selection techniques. As discussed in detail in Section
2.2, Lam and Lai [LL01] use category-averaged features to perform metaselection. Ruiz
and Srinivasan’s [RS02] study on applying HME to hierarchical text classification is an
example of a more complicated combination rule that has been applied to text. In order
to make the approach computationally feasible, they performed significant dimensionality
reduction using feature selection. Despite this, they only obtained performance comparable
to a version of the Rocchio algorithm. In contrast, our work has demonstrated a significant
improvement over competitive text classification algorithms. This indicates that making
2.5. NO FREE LUNCH AND ITS IMPLICATIONS
22
the most effective use of locality when combining text classifiers is still an unanswered
question.
2.5
No Free Lunch and Its Implications
Finally, no discussion of work related to classifier combination would be complete without
a discussion of the No Free Lunch Theorem [DHS01, Wol95] and its implications for classifier combination. Wolpert, who also introduced stacking [Wol92], derived it as a hardness
result, which in short demonstrates that there exists no classifier superior to every other
classifier unless we make assumptions about the example distribution. That is, even given
substantial training data, for any specific classifier there will always be some example distributions on which it is outperformed by another classifier. 6 Additionally, there will also
exist some distributions where random guessing outperforms the classifier.
As a result, the performance of a classifier on a set of problems can be seen to be
more an issue of the appropriateness of the fit between the classifier’s assumptions and the
true underlying distribution. Since classifier combination seeks to create a more effective
classifier from the individual input classifiers, some researchers have mistakenly believed
that the No Free Lunch Theorem implies any attempt at classifier combination is futile.
However, this is clearly no more true than saying that classification is futile. Instead it
must be understood that the empirical performance of the combination method will be
dependent on the fit between the assumptions the combination method makes about the
base classifiers, the example distribution, and reality. When these assumptions are a good
fit for the problems seen in practice, the combination model will perform well. Thus, just as
some classifiers (e.g., SVMs) dominate a large number of problems (e.g., text classification)
seen in practice, the challenge is to develop a metaclassification algorithm that captures the
common interactions among base classifiers seen in practice.
In the context of this dissertation then, we will highlight the conditions under which a
particular metaclassifier will perform well (provide a good fit) based on the interaction of
the base classifiers and the training data available (characteristics of the task). For example, when the other base classifiers do not provide additional information to that of the best
base classifier, then we might expect some algorithms will use the data more efficiently
to perform better. Whereas, when a single classifier dominates in different regions of the
input space, then another metaclassifier might outperform the previous algorithm. Likewise, the best choice might vary again when we consider when the optimal cases are linear
combinations. Since the No Free Lunch theorem provides us with a proof that there is no
6
One common assumption that is made to prove results are that the examples are drawn i.i.d.
CHAPTER 2. RELATED WORK
23
data-independent best choice or even a choice that will always outperform random, then we
aim to elucidate through data properties and empirical evaluation the tradeoffs involved.
24
2.5. NO FREE LUNCH AND ITS IMPLICATIONS
Chapter 3
Calibration
In this chapter we first review the concept of calibration — a measure of how good a set
of probability estimates are. If we consider the extreme case of applying a metaclassifier
to the outputs of a single base classifier, then the metaclassifier is either implicitly or explicitly recalibrating the base classifier. In fact, given only the base classifier’s probability
estimates, the metaclassifier cannot improve on those estimates if they are well-calibrated.
As a result, understanding how any combination method works in the case of a single
base classifier can give important insight into its behavior — especially since it is possible
to easily examine the empirical behavior of a base classifier’s probabilities. For example,
we will show that Kahn’s [Kah04] assumption of normally distributed class-conditional
log-odds rarely empirically holds for a single base classifier, and we will provide an explanation why it is unlikely to hold for accurate classifiers. Additionally, since not all classifiers directly estimate probabilities, this study also aids in producing probability estimates
for combination methods that require them.
The remainder of the chapter is devoted to investigating the empirical behavior of probability estimates obtained from various classifiers, explaining this behavior, and developing
new methods to improve the calibration of probability estimates obtained from classifiers
based on their observed empirical behavior.
3.1
Calibration and Related Concepts
An obvious way to approach classifier combination is to treat each classifier, C i , as a black
box that outputs an estimate of the probability distribution over class labels for each datapoint, P̂Ci (c | d). Then, perform the combination by simply combining these estimates with
25
3.1. CALIBRATION AND RELATED CONCEPTS
26
a linear or a weighted (normalized) multiplicative combination. A slightly more general
approach assumes each classifier outputs an unnormalized score score Ci (d, c), which is the
score assigned to class c for document d. This score is then converted to a probability
P̂Ci (c | d) using some other method.1
However, there is no guarantee that the estimates from different classifiers adhere to
a fixed standard. That is, for one classifier, 0.8 of the items assigned 0.6 probability for
the class under consideration may actually belong to the class; while for another classifier,
0.5 of the items assigned 0.6 probability may belong to the class. In some sense then, a
prediction of 0.6 from each classifier “means” different things.
DeGroot and Fienberg [DF83] review the concept of calibration, a candidate to use as a
fixed standard that addresses such inconsistencies. We say a classifier is well-calibrated if
as the number of predictions goes to infinity, the predicted probability goes to the empirical
probability. That is, for all unique πi , such that πi = P̂Ci (c | d), the empirical relative
frequency2 equals πi , i.e. P̃ (c | πi ) = πi . This can be best envisioned graphically with
the aid of a reliability diagram. Consider a two-class discrimination problem where Y =
{0, 1}. Then, we can plot the classifier’s predictions for one of the classes on the x-axis
and the empirical relative frequency on the y-axis. Figure 3.1 demonstrates this.
The above described calibration as a frequentist concept, but the reader should note that
it can also be viewed from a Bayesian viewpoint [GCSR95, GZ86]. From the Bayesian
view, an outside observer is actually stating his belief about a classifier’s behavior. The
classifier is then well-calibrated if the observer cannot improve upon the classifier’s forecasts given only the output of the classifier. Thus, a well-calibrated forecaster has, in a
sense, summarized all of its information in the probabilities it is emitting.
When evaluating the usefulness of predictions, we also must consider the frequency
with which a classifier outputs a particular prediction. For example, suppose for our problem the actual prior is P (c = 1) = 0.7. A classifier could simply predict P̂Ci (c = 1 |
d) = 0.7 all the time, and it would be well-calibrated. However, it is clearly less useful
than another classifier that is also well-calibrated and outputs P̂Cj (c = 1 | d) = 0.9 half the
time and P̂Cj (c = 1 | d) = 0.5 the other half of the time. DeGroot and Fienberg [DF83]
introduce the concept of refinement to compare two well-calibrated classifiers. Essentially,
one well-calibrated classifier Ci is at least refined as another Cj if the predictions of Ci can
be passed through a noisy channel to produce a well-calibrated classifier whose characterization in terms of calibration (the discrete set of prediction values it outputs and their
1
An assumption often made during this conversion is that P̂Ci (c | d) is monotonic in scoreCi (d, c).
There is an assumption that the space of probability estimates has been discretized to form a finite set of
possible values, e.g. {0, 0.1, 0.2, . . . , 1}.
2
3
3
3
3
Empirical Relative Frequency, P̃ (c = 1 | πi )
Ĉi = (fˆi (X), si (Ei ))
f (X)
CHAPTER f3.
CALIBRATION
ˆM (X)
p1 (E)
An Example of a Well-Calibrated Classifier’s Reliability Diagram
p2 (E) 1
p3 (E) 0.9
pM (E) 0.8
p(E1 | E) 0.7
p(E2 | E) 0.6
p(E3 | E) 0.5
p(EM | E) 0.4
p(Êi | E) 0.3
p(Ei | Êi )
27
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Classifier Ci ’s probability estimate πi = P̂Ci (c = 1 | d)
Figure 3.1: For a well-calibrated classifier, all points in a reliability diagram fall on the diagonal. In
the long-run 0.6 (generally πi ) of the items the classifier predicts to have 0.6 probability (generally
probability πi ) of belonging to the class, actually do belong to the class. Additionally, a reliability
diagram often has annotations indicating the frequency with which a certain value is predicted.
frequencies) is the same as j. Formally, this holds if there exist stochastic functions (i.e.,
distributions) h such that the following equations are satisfied for all π j ∈ Π:
X
P (πj ) =
h(πj | πi )P (πi )
(3.1)
πi ∈Π
πj P (πj ) =
X
πi ∈Π
πi h(πj | πi )P (πi )
(3.2)
Additionally, DeGroot and Fienberg give a simple statistical test that is necessary and sufficient to determine if one classifier is more refined than another. The most refined classifier
is the one that only outputs predictions 0 and 1 and is always correct.
An implication for classifier selection is that if we must choose only one of two wellcalibrated predictors, then it is always better to use the most refined predictor regardless of
how the predictions will be used. Refinement is a partial-ordering, and thus it is possible
that neither classifier is more refined than the other. Furthermore, it is unclear what the
implication is for predictors that are not calibrated — which is the case far more often than
not in practice.
Therefore, DeGroot and Fienberg continue by generalizing the notion of refinement to
a related concept they call sufficiency, which can be used to compare any two classifiers
regardless of whether or not they are well-calibrated. A classifier C i is sufficient for Cj
if the distribution of Cj ’s predictions can be characterized as a stochastic function of Ci ’s.
3.1. CALIBRATION AND RELATED CONCEPTS
28
Formally, Ci is sufficient for Cj if there exists a set of distributions h such that the following
equalities are satisfied:3
P (πj | y) =
X
πi ∈Π
h(πj | πi )P (πi | y) for all πj ∈ Π and y ∈ Y.
(3.3)
Like refinement, sufficiency is a partial ordering, and if we must select only one of
two classifiers and Ci is sufficient for classifier Cj , it is always better to choose classifier
Ci . However, we must ask whether we can still gain from combining these two classifiers.
DeGroot and Fienberg demonstrate that it is possible for classifier i to be sufficient for
classifier j but information can always be gained from their combination except when
P (c | πi , πj ) = P (c | πi ) for all πi , πj . To rephrase, we can gain by combining the two
classifiers unless the class is independent of the output of Cj given the output of Ci .
Since this has implications for classifier combination, it is worth considering further
how even when one classifier can be characterized as a stochastic function of another,
it can sometimes be used to gain improvement in combination. The argument proving
it is constructive but not exhaustive. That is, we show one set, but not the only set, of
conditions where this holds. Suppose, we are given Ci . We can easily choose an arbitrary
set of h(πj | πi ) to probabilistically generate πj from πi . Thus h, P (πj | c), and P (πi |
c) satisfy the conditions in Eq. 3.3. Now, one class-conditional valid joint is when the
classifiers are conditionally independent: P (πi , πj | c) = P (πi | c)P (πj | c). When that
distribution governs the data, then the combination of π j and πi can surpass either classifier.
In essence, even though knowing πi gives us information about what values πj takes, it is
only indirectly via the common class variable they are predicting.
Finally, we may not retain the property of calibration even when we are combining wellcalibrated classifiers with simple rules such as arithmetically averaging their predictions or
using a normalized product (naı̈ve Bayes combination). Table 3.1 shows an example where
the simple combinations improve their predictive power but fail to maintain calibration.
However, if we were to perform perfect post-processing recalibration on π P and πA , both
would perform as well as the optimal combination π∗ . While recalibration plays only a
small role in our combination framework, it is sometimes necessary for methods that assume their inputs are calibrated or when gauging if a combination method has been more
successful than is apparent (by recalibrating the combination method). The next section
discusses the new recalibration techniques we designed and the improvement they lead to
in practice.
3
This equation does not indicate how to find such an h, however.
CHAPTER 3. CALIBRATION
29
P(x) 0.125
Class
1
0.125
0
0.125
1
0.125
0
0.125
1
0.125
0
0.125
1
0.125
0
MSE
E[ln P̂ ]
π1
π2
0.25
0.75
0.25
0.75
0.75
0.75
0.25
0.25
0.75
0.75
0.25
0.25
0.75
0.25
0.75
0.25
0.1875
0.1875
-0.5623
-0.5623
πP
πA
π∗
0.5
0.5
0.5
0.5
0.5
0.5
0.9
0.75
1
0.1
0.25
0
0.9
0.75
1
0.1
0.25
0
0.5
0.5
0.5
0.5
0.5
0.5
0.1300 -0.3933
0.1562 -0.4904
0.1250 -0.3466
Table 3.1: Displayed is an example of the output distribution of two well-calibrated classifiers,
π1 and π2 , and some sample combination rules: normalized product (π P ), average (πA ), and the
optimal combination given only the predictions π∗ = P (c | π1 , π2 ). Although both πP and πA
improve over the base classifiers, neither are well-calibrated.
3.2
Recalibrating Classifiers
As mentioned above, calibration can play a crucial role in classifier combination. While
the majority of our work has explored combination methods that implicitly account for differing levels of calibration among the classifiers, some combination algorithms rely on the
input classifiers to be calibrated in order to perform well. Therefore, this section presents
new methods we derived for recalibrating classifiers and mapping scores to probability estimates. The primary contribution is the use of an asymmetric Laplace distribution to achieve
as good or better results than competing parametric models.
The connections between recalibration and combining classifiers also runs deeper. If we
consider giving a metaclassifier a single classifier as input, the metaclassifier is simply recalibrating the base classifier. Both the empirical methods presented in this section and the
theoretically-based arguments show that classifiers will typically demonstrate asymmetric
behavior. Therefore, we see that Kahn’s [Kah04] assumption when combining classifiers
of normally distributed class-conditional log-odds does not even hold for a single classifier.
3.2.1 The Need for Calibrated Probabilities in Other Applications
In addition to the role calibration plays in combination, recalibrating classifiers and mapping scores to probability estimates is an important problem in its own right. This is because text classifiers that give probability estimates are more flexible in practice than those
that give only a simple classification or even a ranking. For example, rather than choosing
one set decision threshold, they can be used in a Bayesian risk model [DHS01] to issue
30
3.2. RECALIBRATING CLASSIFIERS
a run-time decision which minimizes the expected cost of a user-specified cost function
dynamically chosen at prediction time. This can be used to minimize a linear utility cost
function for filtering tasks where pre-specified costs of relevant/irrelevant are not available
during training but are specified at prediction time. Furthermore, the costs can be changed
without retraining the model. Additionally, probability estimates are often used as the basis
of deciding which document’s label to request next during active learning [LG94, STP01].
Effective active learning can be key in many information retrieval tasks where obtaining
labeled data can be costly — severely reducing the amount of labeled data needed to reach
the same performance as when new labels are requested randomly [LG94]. Finally, they
are also amenable to making other types of cost-sensitive decisions [ZE01]. However, in
all of these tasks, the quality of the probability estimates is crucial.
Parametric models generally use assumptions that the data conform to the model to
trade-off flexibility with the ability to estimate the model parameters accurately with little training data. Since many text classification tasks often have very little training data,
we focus on parametric methods. However, most of the existing parametric methods that
have been applied to this task have an assumption we find empirically undesirable. While
some of these methods allow the distributions of the documents relevant and irrelevant to
the topic to have different variances, they typically enforce the unnecessary constraint that
the documents are symmetrically distributed around their respective modes. We introduce
several asymmetric parametric models that allow us to relax this assumption without significantly increasing the number of parameters and demonstrate how we can efficiently fit
the models. Additionally, these models can be interpreted as assuming the scores produced
by the text classifier have three basic types of empirical behavior — one corresponding to
each of the “extremely irrelevant”, “hard to discriminate”, and “obviously relevant” items.
First, we discuss in further detail the need for asymmetric models. After this, we describe two specific asymmetric models and, using two standard text classifiers, naı̈ve Bayes
and SVMs, demonstrate how they can be efficiently used to recalibrate poor probability
estimates or produce high quality probability estimates from raw scores. We then review
experiments using previously proposed methods and the asymmetric methods over several
text classification corpora to demonstrate the strengths and weaknesses of the various methods. Finally, we review related work on improving probability estimates and summarize our
contributions.
E
X
E1
CHAPTER 3. CALIBRATION
31
E2
E3
3.2.2 Recalibration Problem
Definition & Approach
Ei
Our work differs from earlierEapproaches
primarily in two points: (1) We provide asymmetM
ric parametric models suitableÊfor
use when little training data is available; (2) We explici
itly analyze the quality
offprobability
estimates these and competing methods produce and
E = (X,
(X))
provide significance
testss1for
Ĉ1 = (fˆ1 (X),
(E1these
)) results.
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Problem Definition
Ĉi = (fˆi (X), si (Ei ))
The general problem we are
concerned with is highlighted in Figure 3.2. A text classifier
f (X)
ˆ
fM (X)
Classifier
p1 (E) Document, d
p2 (E)
Predict class, c(d)={+,−}
p3 (E)
and give unnormalized
confidence s(d) that c(d)=+
pM (E)
p(E1 | E)
p(E2 | E)
p(s|+)
p(s|−)
p(E3 | E)
Bayes’ Rule
P(+)
P(−)
p(EM | E)
p(Êi | E)
P(+| s(d))
p(Ei | Êi )
Figure 3.2: We are concerned with how to perform the box highlighted in grey. The internals are
for one type of approach.
produces a prediction about a document and gives a score s(d) indicating the strength of its
decision that the document belongs to the positive class (relevant to the topic). We assume
throughout there are only two classes: the positive and the negative (or irrelevant) class (’+’
and ’-’ respectively).
There are two general types of parametric approaches. The first of these tries to fit
the posterior function directly, i.e. there is one function estimator that performs a direct
mapping of the score s to the probability P (+|s(d)). The second type of approach breaks
the problem down as shown in the grey box of Figure 3.2. An estimator for each of the
class-conditional densities (i.e. p(s|+) and p(s|−)) is produced, then Bayes’ rule and the
class priors are used to obtain the estimate for P (+|s(d)).
Ei
EM
32
3.2. RECALIBRATING CLASSIFIERS
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X),
1 (E1 ))
Motivation
for sAsymmetric
Distributions
ˆ
Ĉ2 = (f2 (X), s2 (E2 ))
Most
previous
Ĉ3 =of(fˆthe
s3 (E3 ))parametric approaches to this problem either directly or indirectly
3 (X),
(when
only
the
Ĉi =fitting
(fˆi (X),
si (E
i ))posterior) correspond to fitting Gaussians to the class-conditional
densities; they differ
only in the criterion used to estimate the parameters. We can visualize
f (X)
this as depicted fˆin
Figure 3.3. Since increasing s usually indicates increased likelihood
M (X)
of belonging to the
positive class, then the rightmost distribution usually corresponds to
p1 (E)
p(s|+).
p2 (E)
1
p(s | Class = +)
p(s | Class = −)
0.8
p(s | Class = {+,−})
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.6
A
B
0.4
C
0.2
0
−10
−5
0
Unnormalized Confidence Score s
5
10
Figure 3.3: Typical View of Discrimination based on Gaussians
However, using standard Gaussians fails to capitalize on a basic characteristic commonly seen. Namely, if we have a raw output score that can be used for discrimination,
then the empirical behavior between the modes (label B in Figure 3.3) is often very different than that outside of the modes (labels A and C in Figure 3.3). Intuitively, the area
between the modes corresponds to the hard examples, which are difficult for this classifier
to distinguish, while the areas outside the modes are the extreme examples that are usually
easily distinguished. This suggests that we may want to uncouple the scale of the outside
and inside segments of the distribution (as depicted by the curve denoted as A-Gaussian in
Figure 3.4). As a result, an asymmetric distribution may be a more appropriate choice for
application to the raw output score of a classifier.
Ideally (i.e. perfect classification) there will exist scores θ− and θ+ such that all examples with score greater than θ+ are relevant, and all examples with scores less than θ−
are irrelevant. Furthermore, no examples fall between θ− and θ+ . The distance |θ− − θ+ |
corresponds to the margin in some classifiers, and an attempt is often made to maximize
this quantity. Because text classifiers have training data to use to separate the classes, the
final behavior of the score distributions is primarily a factor of the amount of training data
and the consequent separation in the classes achieved. This is in contrast to search engine
CHAPTER 3. CALIBRATION
33
retrieval where the distribution of scores is more a factor of language distribution across
documents, the similarity function, and the length and type of query.
Perfect classification corresponds to using two very asymmetric distributions, but in this
case, the probabilities are actually one and zero, and many methods will work for typical
purposes.
Practically, some examples will fall between θ− and θ+ , and it is often important
PSfrag replacements
to estimate the probabilities of these examples well (since they correspond to the “hard”
E
examples). Justifications can be given for both why you may find more and less examples
X
between θ− and θ+ than
E1 outside of them, but there are few empirical reasons to believe that
the distributions should
E be symmetric.
2
A natural first candidate
for an asymmetric distribution is a generalization of a comE3
mon symmetric distribution,
e.g. the Laplace or the Gaussian. An asymmetric Laplace
Ei
distribution can be achieved
by placing two exponentials around the mode in the following
EM
manner:
Êi

E = (X, f (X))
βγ


(left of mode)

Ĉ1 = (fˆ1 (X), s1 (E1 ))
 β+γ exp [−β (θ − x)] x ≤ θ
p(x |sθ,
β, γ) =
(3.4)
(β, γ > 0)
Ĉ2 = (fˆ2 (X),
2 (E2 ))



 βγ exp [−γ (x − θ)] x > θ
Ĉ3 = (fˆ3 (X), s3 (E3 ))
(right of mode)
β+γ
Ĉi = (fˆi (X), si (Ei ))
where θ, β, and γf (X)
are the model parameters. θ is the mode of the distribution, β is the
inverse scale of fˆthe
exponential to the left of the mode, and γ is the inverse scale of the
M (X)
exponential to thepright.
We will use the notation Λ(X | θ, β, γ) to refer to this distribution.
1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.01
Gaussian
A-Gaussian
p(s | Class = {+,-})
0.008
0.006
0.004
0.002
0
-300
-200
-100
0
Unnormalized Confidence Score s
100
200
Figure 3.4: Gaussians vs. Asymmetric Gaussians. A Shortcoming of Symmetric Distributions —
The vertical lines show the modes as estimated nonparametrically.
3.2. RECALIBRATING CLASSIFIERS
34
We can create an asymmetric Gaussian in the same manner:

h
i
(x−θ)2
2

√
exp
−
x≤θ
(left of mode)

2

2σl
 2π(σl +σr )
p(x | θ, σl , σr ) =
(σl , σr > 0)

i
h

2

(x−θ)
√ 2
x>θ
(right of mode)
exp − 2σ2
2π(σ +σ )
l
r
(3.5)
r
where θ, σl , and σr are the model parameters. σl and σr are the scale parameters to the
left and to the right of the mode, respectively; when σl = σr , a symmetric Gaussian with
standard deviation σl is obtained. To refer to this asymmetric Gaussian, we use the notation
Γ(X | θ, σl , σr ). While these distributions are composed of “halves”, the resulting function
is a single continuous distribution.
These distributions allow us to fit our data with much greater flexibility at the cost
of only fitting six parameters. We could instead try mixture models for each component
or other extensions, but most other extensions require at least as many parameters (and
can often be more computationally expensive). In addition, the motivation above should
provide significant cause to believe the underlying distributions actually behave in this way.
Furthermore, this family of distributions can still fit a symmetric distribution, and finally, in
the empirical evaluation, evidence is presented that demonstrates this asymmetric behavior
(see Figure 3.5).
To our knowledge, neither family of distributions has been previously used in machine
learning or information retrieval. For the interested reader, statistical properties relevant to
these distributions are discussed in great detail in [KKP01].
3.2.3 Estimating the Parameters of the Asymmetric Distributions
This section develops the method for finding maximum likelihood estimates (MLE) of the
parameters for the above asymmetric distributions. In order to find the MLEs, we have two
choices: (1) use numerical estimation to estimate all three parameters at once (2) fix the
value of θ, and estimate the other two (β and γ or σl and σr ) given our choice of θ, then
consider alternate values of θ. Because of the simplicity of analysis in the latter alternative,
we choose this method.
CHAPTER 3. CALIBRATION
35
Asymmetric Laplace MLEs
For D = {x1 , x2 , . . . , xN } where the xi are i.i.d. and X ∼ Λ(X | θ, β, γ), the likelihood
is:
N
Y
Λ(X | θ, β, γ).
(3.6)
i
We desire to find the maximum likelihood estimates for β, γ and θ. To do so, we fix θ
and compute the maximum likelihood for that choice of θ. Then, we can simply consider
all choices of θ and choose the one with the maximum likelihood (or equivalently the
loglikelihood) over all choices of θ.
The loglikelihood we must compute then is:
log
N
Y
i=1
Λ(xi | θ, β, γ) =
=
N
X
i=1
X
log Λ(xi | θ, β, γ)
(3.7)
log Λ (xi | θ, β, γ) +
x∈D|x≤θ
X
log Λ (xi | θ, β, γ) (3.8)
x∈D|x>θ
¸
βγ
− β(θ − x)
=
log
β+γ
x∈D|x≤θ
¸
X ·
βγ
− γ(x − θ)
+
log
β+γ
X ·
(3.9)
x∈D|x>θ
X
βγ
+
[−β(θ − x)]
β+γ
x∈D|x≤θ
X
+
[−γ(x − θ)]
= N log
(3.10)
x∈D|x>θ
Let Nl = |{x ∈ D | x ≤ θ}| , Nr = |{x ∈ D | x > θ}|
X
X
and Sl =
x, Sr =
x
x∈D|x≤θ
x∈D|x>θ
βγ
− Nl βθ + βSl + Nr γθ − γSr
β+γ
Let Dl = Nl θ − Sl , Dr = Sr − Nr θ
βγ
= N log
− βDl − γDr
β+γ
= N log
(3.11)
(3.12)
Note that Dl and Dr are the sum of the absolute differences between the x belonging to
the left and right halves of the distribution (respectively) and θ. The partial derivatives are:
Nγ
Nβ
∂β
= β(β+γ)
− Dl and ∂γ
= γ(β+γ)
− Dr . We can set the derivatives to zero and solve them
∂l
∂l
3.2. RECALIBRATING CLASSIFIERS
36
analytically to find that the MLEs for β and γ for a fixed θ are:
βM LE =
Dl +
N
√
D r Dl
γM LE =
Dr +
N
√
D r Dl
.
(3.13)
Nl
if we were to estimate
These estimates are not wholly unexpected since we would obtain D
l
β independently of γ. The elegance of the formulae is that the estimates will tend to be
symmetric only insofar as the data dictate it (i.e. the closer D l and Dr are to being equal,
the closer the resulting inverse scales).
By continuity arguments, when N = 0, we assign β = γ = ²0 where ²0 is a small
constant that acts to disperse the distribution to a uniform. Similarly, when N 6= 0 and
Dl = 0, we assign β = ²inf where ²inf is a very large constant that corresponds to an
extremely sharp distribution (i.e. almost all mass at θ for that half). D r = 0 is handled
similarly.
Assuming that θ falls in some range [φ, ψ] dependent upon only the observed documents, then this alternative is also easily computable. Given N l , Sl , Nr , Sr , we can compute the posterior and the MLEs in constant time. In addition, if the scores are sorted,
then we can perform the whole process quite efficiently. Starting with the minimum θ = φ
we would like to try, we loop through the scores once and set N l , Sl , Nr , Sr appropriately.
Then we increase θ and just step past the scores that have shifted from the right side of
the distribution to the left. Assuming the number of candidate θs are O(n), this process is
O(n), and the overall process is dominated by sorting the scores, O(n log n) (or expected
linear time).
Asymmetric Gaussian MLEs
For D = {x1 , x2 , . . . , xN } where the xi are i.i.d. and X ∼ Γ(X | θ, σl , σr ), the likelihood
is:
N
Y
Γ(xi | θ, σl , σr )
(3.14)
i=1
We desire to find the maximum likelihood estimates for σl , σr and θ. Similar to the
above, we fix θ and compute the maximum likelihood for that choice of θ. Then, we
can simply consider all choices of θ and choose the one with the maximum likelihood (or
equivalently the loglikelihood) over all choices of θ. The derivation is very similar to that
for the Asymmetric Laplace.
CHAPTER 3. CALIBRATION
37
The loglikelihood we must compute then is:
log
N
Y
i=1
Γ(xi | θ, σl , σr ) =
=
N
X
i=1
X
log Γ(xi | θ, σl , σr )
(3.15)
log Γ(xi | θ, σl , σr ) +
x∈D|x≤θ
X
log Γ(xi | θ, σl , σr ) (3.16)
x∈D|x>θ
¸
(x − θ)2
2
−
=
log √
2σl2
2π(σ
+
σ
)
l
r
x∈D|x≤θ
¸
X ·
2
(x − θ)2
+
log √
−
2σr2
2π(σ
+
σ
)
l
r
x∈D|x>θ
X ·
= N log √
−
1
2σl2
2
2π(σl + σr )
X
1
(x − θ)2 − 2
2σr
x∈D|x≤θ
X
x∈D|x>θ
(x − θ)2
(3.17)
(3.18)
Let Nl = |{x ∈ D | x ≤ θ}| , Nr = |{x ∈ D | x > θ}| ,
X
X
X
X
x2 .
x2 , and Sr2 =
Sl =
x, Sr =
x, Sl2 =
x∈D|x≤θ
x∈D|x≤θ
x∈D|x>θ
x∈D|x>θ
¤
1 £
2
2
2 − Sl θ + N l θ
S
−
= N log √
l
2σl2
2π(σl + σr )
¤
1 £
(3.19)
− 2 Sr 2 − S r θ + N r θ 2
2σr
Let Dl2 = Sl2 − Sl θ + θ2 Nl , Dr2 = Sr2 − Sr θ + θ2 Nr
2
1
1
= N log √
(3.20)
− 2 Dl 2 − 2 Dr 2
2σr
2π(σl + σr ) 2σl
D
D
N
N
l
r
The partial derivatives are: ∂σ
and ∂σ
. We can set
= σl32 − σl +σ
= σl32 − σl +σ
∂l
∂l
r
r
r
l
the derivatives to zero and solve them analytically to find for a fixed θ only one feasible
solution:
s
2/3 1/3
Dl 2 + D l 2 Dr 2
σl,M LE =
(3.21)
N
s
2/3 1/3
Dr 2 + D r 2 Dl 2
σr,M LE =
.
(3.22)
N
By continuity arguments, when N = 0, we assign σr = σl = ²inf , and when N 6= 0 and
Dl2 = 0 (resp. Dr2 = 0), we assign σl = ²0 (resp. σr = ²0 ). Again, the same computational
complexity analysis applies to estimating these parameters.
3.2. RECALIBRATING CLASSIFIERS
38
3.2.4 Experimental Analysis
Methods
For each of the methods that use a class prior, we use a smoothed add-one estimate,
where N is the number of documents. For methods that fit the classi.e. P (c) = |c|+1
N +2
conditional densities, p(s|+) and p(s|−), the resulting densities are inverted using Bayes’
rule as described above. All of the methods below are fit using maximum likelihood estimates.
For recalibrating a classifier (i.e. correcting poor probability estimates output by the
classifier), it is usual to use the log-odds of the classifier’s estimate as s(d). The log-odds
. The normal decision threshold (minimizing error) in terms of
are defined to be log PP (+|d)
(−|d)
log-odds is at zero (i.e. P (+|d) = P (−|d) = 0.5).
Since it scales the outputs to a space [−∞, ∞], the log-odds make normal (and similar
distributions) applicable [LTB79]. Lewis & Gale [LG94] give a more motivating viewpoint
that fitting the log-odds has a dampening effect for the inaccurate independence assumption
and a bias correction for inaccurate estimates of the priors. In general, fitting the log-odds
can serve to boost or dampen the signal from the original classifier as the data dictate.
Gaussians
A Gaussian is fit to each of the class-conditional densities, using the usual maximum
P
likelihood estimates. That is, for the class-conditional mean, we use µ c = N1 c(d)=c s(d),
P
and for the class-conditional variance4 we use σc2 = N1 c(d)=c [s(d) − µc ]2 . This method
is denoted in the tables below as Gauss.
Asymmetric Gaussians
An asymmetric Gaussian is fit to each of the class-conditional densities using the maximum likelihood estimation procedure described in Section 3.2.3 above. Intervals between
adjacent scores are divided into 10 pieces for testing candidate θs, i.e. Eight points between
actual scores occurring in the data set are tested. This method is denoted as A. Gauss.
Laplace Distributions
Even though Laplace distributions are not typically applied to this task, we also tried
this method to isolate why benefit is gained from the asymmetric form. The usual MLEs
4
For the data we evaluated here, N was large enough that there was little difference between using the
MLE for variance given here (which is biased) or the unbiased version, which multiplies by N 1−1 instead.
CHAPTER 3. CALIBRATION
39
were used for estimating the location and scale of a classical symmetric Laplace distribution
as described in [KKP01].
That is, let ri , i = 1, . . . , N denote the ith score after the scores have been ranked by
s(d). The location parameter, θ ish essentially ithe median of the datapoints. For N odd,
θ = r N +1 , and for N even, θ = 12 r N + r N +1 . The inverse scale parameter is given by:
2
2
β=
P N
.
|s(d)−θ|
2
We denote this method as Laplace below.
Asymmetric Laplace Distributions
An asymmetric Laplace is fit to each of the class-conditional densities using the maximum likelihood estimation procedure described in Section 3.2.3 above. As with the asymmetric Gaussian, intervals between adjacent scores are divided into 10 pieces for testing
candidate θs. This method is denoted as A. Laplace below.
Logistic Regression
This method is the first of two methods we evaluated that directly fit the posterior,
P (+|s(d)). Both methods restrict the set of families to a two-parameter sigmoid family;
they differ primarily in their model of class labels. As opposed to the above methods, one
can argue that an additional boon of these methods is they completely preserve the ranking
given by the classifier. When this is desired, these methods may be more appropriate.
The previous methods will mostly preserve the rankings, but they can deviate if the data
dictate it. Thus, they may model the data behavior better at the cost of departing from a
monotonicity constraint in the output of the classifier.
Lewis & Gale [LG94] use logistic regression to recalibrate naı̈ve Bayes for subsequent
use in active learning. The model they use is:
P (+|s(d)) =
exp(a + b s(d))
.
1 + exp(a + b s(d))
(3.23)
Instead of using the probabilities directly output by the classifier, they use the loglikelihood ratio of the probabilities, log PP (d|+)
, as the score s(d). Instead of using this below,
(d|−)
we will use the log-odds ratio. This does not affect the model as it simply shifts all of the
scores by a constant determined by the priors. We refer to this method as LogReg below.
Logistic Regression with Noisy Class Labels
Platt [Pla99] proposes a framework that extends the logistic regression model above
to incorporate noisy class labels and uses it to produce probability estimates from the raw
output of an SVM.
40
3.2. RECALIBRATING CLASSIFIERS
This model differs from the LogReg model only in how the parameters are estimated.
The parameters are still fit using maximum likelihood estimation, but a model of noisy
class labels is used in addition to allow for the possibility that the class was mislabeled.
The noise is modeled by assuming there is a finite probability of mislabeling a positive
example and of mislabeling a negative example; these two noise estimates are determined
by the number of positive examples and the number of negative examples (using Bayes’
rule to infer the probability of incorrect label).
Even though the performance of this model would not be expected to deviate much from
LogReg, we evaluate it for completeness. We refer to this method below as LR+Noise.
Data
We examined several corpora, including the MSN Web Directory (13 classes), Reuters (10
classes), and TREC-AP (20 classes). More details about the data are given in Section 6.2.
Classifiers
We selected two classifiers for evaluation — a linear SVM classifier, which is a discriminative classifier that does not normally output probability values, and a naı̈ve Bayes classifier,
whose probability outputs are often poor [Ben00, DP96] but can be improved [Ben00,
ZE01, ZE02].
SVM
For linear SVMs, we use the Smox toolkit which is based on Platt’s Sequential Minimal
Optimization algorithm. The features were represented as continuous values. We used
the raw output score of the SVM as s(d) since it has been shown to be appropriate before
[Pla99]. The normal decision threshold (assuming we are seeking to minimize errors) for
this classifier is at zero.
Naı̈ve Bayes
The naı̈ve Bayes classifier model is a multinomial model [MN98]. We smoothed word
and class probabilities using a Bayesian estimate (with the word prior) and a Laplace mestimate, respectively. For more details, see Section 6.3.4. We use the log-odds estimated
by the classifier as s(d). The normal decision threshold is at zero.
CHAPTER 3. CALIBRATION
41
Performance Measures
We use log-loss [Goo52] and squared error [Bri50, DF86] to evaluate the quality of the
probability estimates. Both of these are proper scoring rules [DF83, DF86] in the sense
that a classifier’s view of its expected performance is maximized when the classifier actually
issues a probability of p̂ when it assesses the probability to be p̂, i.e. the classifier cannot
expect to gain from “hedging its bets”.
For a document d with class c(d) ∈ {+, −} (i.e. the data have known labels and not
probabilities), log-loss is defined as:
δ(c(d), +) log P (+|d) + δ(c(d), −) log P (−|d)
(3.24)
.
where δ(a, b) = 1 if a = b and 0 otherwise. The squared error is:
δ(c(d), +)(1 − P (+|d))2 + δ(c(d), −)(1 − P (−|d))2 .
(3.25)
When the class of a document is correctly predicted with a probability of one, log-loss is
zero and squared error is zero. When the class of a document is incorrectly predicted with
a probability of one, log-loss is −∞ and squared error is one. Thus, both measures assess
how close an estimate comes to correctly predicting the item’s class but vary in how harshly
incorrect predictions are penalized.
We report only the sum of these measures and omit the averages for space. Their
averages, average log-loss and mean squared error (MSE), can be computed from these
totals by dividing by the number of binary decisions in a corpus. Note that the log-loss
numbers given in this chapter are given as log base 2.
In addition, we also compare the error of the classifiers at their default thresholds and
with the probabilities. This evaluates how the probability estimates have improved with
respect to the decision threshold P (+|d) = 0.5. Thus, error only indicates how the methods
would perform if a false positive was penalized the same as a false negative and not the
general quality of the probability estimates. It is presented simply to provide the reader
with a more complete understanding of the empirical tendencies of the methods.
We use a standard paired micro sign test [YL99] to determine statistical significance in
the difference of all measures. Only pairs that the methods disagree on are used in the sign
test. This test compares pairs of scores from two systems with the null hypothesis that the
number of items they disagree on are binomially distributed. We use a significance level of
p = 0.01.
3.2. RECALIBRATING CLASSIFIERS
42
Naı̈ve Bayes
Log-loss
SVM
Error2
Errors
MSN Web
Log-loss
Error2
Errors
MSN Web
Gauss
-60656.41
10503.30
10754
Gauss
-54463.32
9090.57
10555
A.Gauss
-57262.26
8727.47
9675
A. Gauss
-44363.70
6907.79
8375
Laplace
-45363.84
8617.59
10927
Laplace
-42429.25
7669.75
10201
A.Laplace
-36765.88
6407.84†
8350
A. Laplace
-31133.83
5003.32
6170
LogReg
-36470.99
6525.47
8540
LogReg
-30209.36
5158.74
6480
LR+Noise
-36468.18
6534.61
8563
LR+Noise
-30294.01
5209.80
6551
naı̈ve Bayes
-1098900.83
17117.50
17834
Linear SVM
N/A
N/A
6602
Reuters
Reuters
Gauss
-5523.14
1124.17
1654
Gauss
-3955.33
589.25
735
A.Gauss
-4929.12
652.67
888
A. Gauss
-4580.46
428.21
532
Laplace
-5677.68
1157.33
1416
Laplace
-3569.36
640.19
770
A.Laplace
-3106.95‡
554.37‡
726
A. Laplace
-2599.28
412.75
505
LogReg
-3375.63
603.20
786
LogReg
-2575.85
407.48
509
LR+Noise
-3374.15
604.80
785
LR+Noise
-2567.68
408.82
516
naı̈ve Bayes
-52184.52
1969.41
2121
Linear SVM
N/A
N/A
516
TREC-AP
TREC-AP
Gauss
-57872.57
8431.89
9705
Gauss
-54620.94
6525.71
7321
A.Gauss
-66009.43
7826.99
8865
A. Gauss
-77729.49
6062.64
6639
Laplace
-61548.42
9571.29
11442
Laplace
-54543.19
7508.37
9033
A.Laplace
-48711.55
7251.87‡
8642
A. Laplace
-48414.39
5761.25‡
6572‡
LogReg
-48250.81
7540.60
8797
LogReg
-48285.56
5914.04
6791
LR+Noise
-48251.51
7544.84
8801
LR+Noise
-48214.96
5919.25
6794
naı̈ve Bayes
-1903487.10
41770.21
43661
Linear SVM
N/A
N/A
6718
Table 3.2: (a) Results for naı̈ve Bayes (left) and (b) SVM (right). The best entry for a corpus is
in bold. Entries that are statistically significantly better than all other entries are underlined. A †
denotes the method is significantly better than all other methods except for naı̈ve Bayes. A ‡ denotes
the entry is significantly better than all other methods except for A. Gauss (and naı̈ve Bayes for the
table on the left). The reason for this distinction in significance tests is described in the text.
CHAPTER 3. CALIBRATION
43
Experimental Methodology
As the categories under consideration in the experiments are not mutually exclusive, the
classification was done by training n binary classifiers, where n is the number of classes.
In order to generate the scores that each method uses to fit its probability estimates, we
use five-fold cross-validation on the training data. We note that even though it is computationally efficient to perform leave-one-out cross-validation for the naı̈ve Bayes classifier,
this may not be desirable since the distribution of scores can be skewed as a result. Of
course, as with any application of n-fold cross-validation, it is also possible to bias the
results by holding n too low and underestimating the performance of the final classifier.
Results & Discussion
The results for recalibrating naı̈ve Bayes are given in Table 3.2a. Table 3.2b gives results
for producing probabilistic outputs for SVMs.
We start with general observations that result from examining the performance of these
methods over the various corpora. The first is that A. Laplace, LR+Noise, and LogReg,
quite clearly outperform the other methods. There is usually little difference between the
performance of LR+Noise and LogReg (both as shown here and on a decision by decision
basis), but this is unsurprising since LR+Noise just adds noisy class labels to the LogReg
model. With respect to the three different measures, LR+Noise and LogReg tend to perform
slightly better (but never significantly) than A. Laplace at some tasks with respect to logloss and squared error. However, A. Laplace always produces the least number of errors for
all of the tasks, though at times the degree of improvement is not significant.
In order to give the reader a better sense of the behavior of these methods, Figures
3.5-3.6 show the fits produced by the most competitive of these methods versus the actual data behavior (as estimated nonparametrically by binning) for class Earn in Reuters.
Figure 3.5 shows the class-conditional densities, and thus only A. Laplace is shown since
LogReg fits the posterior directly. Figure 3.6 shows the estimations of the log-odds, (i.e.
(Earn|s(d))
). Viewing the log-odds (rather than the posterior) usually enables errors in
log PP(¬Earn|s(d))
estimation to be detected by the eye more easily.
We can break things down as the sign test does and just look at wins and losses on the
items that the methods disagree on. Looked at in this way only two methods (naı̈ve Bayes
and A. Gauss) ever have more pairwise wins than A. Laplace; those two sometimes have
more pairwise wins on log-loss and squared error even though the total never wins (i.e. they
are dragged down by heavy penalties).
Ei
EM
Êi
44
3.2. RECALIBRATING CLASSIFIERS
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
In addition, this comparison
pairwise
means that for those cases where LogReg
ˆ2 (X),
Ĉ2 = (fof
s2 (Ewins
2 ))
and LR+Noise have better
than sA.(E
Laplace,
it would not be deemed significant by
Ĉ3scores
= (fˆ3 (X),
3
3 ))
the sign test at any level since
they
nots have
more wins. For example, of the 130K binary
Ĉi =
(fˆido
(X),
i (Ei ))
decisions over the MSN Web dataset, A. fLaplace
(X) had approximately 101K pairwise wins
versus LogReg and LR+Noise. No method
ever had more pairwise wins than A. Laplace
fˆM (X)
for the error comparison nor did any method
ever achieve a better total.
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.012
0.45
Train
Test
A.Laplace
Train
Test
A.Laplace
0.4
0.01
0.35
0.3
p(s(d) | Class = {+,-})
0.008
p(s(d) | Class = {+,-})
Ei
EM
Êi
= (X, f (X))
1 (X), s1 (E1 ))
2 (X), s2 (E2 ))
3 (X), s3 (E3 ))
ˆi (X), si (Ei ))
f (X)
ˆ
fM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.006
0.004
0.25
0.2
0.15
0.1
0.002
0.05
0
-600
-400
-200
0
s(d) = naive Bayes log-odds
200
400
0
-15
-10
-5
0
5
10
15
s(d) = linear SVM raw score
Figure 3.5: The empirical distribution of classifier scores for documents in the training and the
test set for class Earn in Reuters. Also shown is the fit of the asymmetric Laplace distribution to
the training score distribution. The positive class (i.e. Earn) is the distribution on the right in each
graph, and the negative class (i.e. ¬Earn) is that on the left in each graph.
The basic observation made about naı̈ve Bayes in previous work is that it tends to produce estimates very close to zero and one [Ben00, LG94]. This means if it tends to be right
enough of the time, it will produce results that do not appear significant in a sign test that
ignores size of difference (as the one here). The totals of the squared error and log-loss
bear out the previous observation that “when it’s wrong it’s really wrong”.
There are several interesting points about the performance of the asymmetric distributions as well. First, A. Gauss performs poorly because (similar to naı̈ve Bayes) there are
some examples where it is penalized a large amount. This behavior results from a general
tendency to perform like the picture shown in Figure 3.4 (note the crossover at the tails).
While the asymmetric Gaussian tends to place the mode much more accurately than a symmetric Gaussian, its asymmetric flexibility combined with its distance function causes it to
distribute too much mass to the outside tails while failing to fit around the mode accurately
enough to compensate. Figure 3.4 is actually a result of fitting the two distributions to
real data. As a result, at the tails there can be a large discrepancy between the likelihood
of belonging to each class. Thus when there are no outliers A. Gauss can perform quite
= (fˆi (X), si (Ei ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
f (X)
fˆM (X) CHAPTER 3. CALIBRATION
fˆM (X)
p1 (E)
p1 (E)
p2 (E)
p2 (E)
p3 (E)
p3 (E)
pM (E)
pM (E)
p(E1 | E)
p(E1 | E)
p(E2 | E)
p(E2 | E)
p(E3 | E)
p(E3 | E)
p(EM | E)
p(EM | E)
p(Êi | E)
p(Êi | E)
p(Ei | Êi )
p(Ei | Êi )
45
8
4
2
0
-2
Train
Test
A.Laplace
LogReg
Log Odds = log P(+ | s(d)) - log P(- | s(d))
Log Odds = log P(+ | s(d)) - log P(- | s(d))
6
15
Train
Test
A.Laplace
LogReg
10
5
0
-4
-6
-250
-200
-150
-100
-50
0
s(d) = naive Bayes log-odds
50
100
150
-5
-4
-2
0
2
s(d) = linear SVM raw score
4
6
Figure 3.6: The fit produced by various methods compared to the empirical log-odds of the training
data for class Earn in Reuters.
competitively, but when there is an outlier A. Gauss is penalized quite heavily. There are
enough such cases overall that it seems clearly inferior to the top three methods.
However, the asymmetric Laplace places much more emphasis around the mode (Figure
3.5) because of the different distance function (think of the “sharp peak” of an exponential).
As a result most of the mass stays centered around the mode, while the asymmetric parameters still allow more flexibility than the standard Laplace. Since the standard Laplace also
corresponds to a piecewise fit in the log-odds space, this highlights that part of the power of
the asymmetric methods is their sensitivity in placing the knots at the actual modes — rather
than the symmetric assumption that the means correspond to the modes. Additionally, the
asymmetric methods have greater flexibility in fitting the slopes of the line segments as
well. Even in cases where the test distribution differs from the training distribution (Figure
3.5), A. Laplace still yields a solution that gives a better fit than LogReg (Figure 3.6), the
next best competitor.
Finally, we can make a few observations about the usefulness of the various performance metrics. First, log-loss only awards a finite amount of credit as the degree to which
something is correct improves (i.e. there are diminishing returns as it approaches zero), but
it can infinitely penalize for a wrong estimate. Thus, it is possible for one outlier to skew
the totals, but misclassifying this example may not matter for any but a handful of actual
utility functions used in practice. Secondly, squared error has a weakness in the other direction. That is, its penalty and reward are bounded in [0, 1], but if the number of errors
is small enough, it is possible for a method to appear better when it is producing what we
generally consider unhelpful probability estimates. For example, consider a method that
only estimates probabilities as zero or one (which naı̈ve Bayes tends toward but doesn’t
46
3.2. RECALIBRATING CLASSIFIERS
quite reach if you use smoothing). This method could win according to squared error, but
with just one error it would never perform better on log-loss than any method that assigns
some non-zero probability to each outcome. For these reasons, we recommend that neither
of these are used in isolation as they each give slightly different insights to the quality of
the estimates produced. These observations are straightforward from the definitions but are
underscored by the evaluation.
3.2.5 Related Work
Parametric models have been employed to obtain probability estimates in several areas
relevant to text classification. Lewis & Gale [LG94] use logistic regression to recalibrate
naı̈ve Bayes though the quality of the probability estimates are not directly evaluated; it is
simply performed as an intermediate step in active learning. Manmatha et al. [MRF01]
introduced appropriate models for producing probability estimates from relevance scores
returned from search engines and demonstrated how the resulting probability estimates
could be subsequently employed to combine the outputs of several search engines. They
use a different parametric distribution for the relevant and irrelevant classes but do not
pursue two-sided asymmetric distributions for a single class, as described here. They also
survey the long history of modeling the relevance scores of search engines. Our work is
similar in flavor to these previous attempts to model search engine scores, but we target text
classifier outputs which we have found demonstrate a different type of score distribution
behavior because of the role of training data.
Zadrozny & Elkan [ZE01] provide what can be thought of as a type of pruning targeted at improving the reliability of probability estimates obtained from decision trees
(termed curtailment) and a non-parametric method for recalibrating naı̈ve Bayes. In more
recent work [ZE02], they investigate using a semi-parametric method that uses a monotonic
piecewise-constant fit to the data and apply the method to naı̈ve Bayes and a linear SVM.
While they compared their methods to other parametric methods based on symmetry, they
fail to provide significance test results. Our work provides asymmetric parametric methods
that complement the non-parametric and semi-parametric methods they propose when data
scarcity is an issue. In addition, their methods reduce the resolution of the scores output by
the classifier (the number of distinct values output), but the methods here do not have such
a weakness since they are continuous functions.
Just as logistic regression allows the log-odds of the posterior distribution to be fit
directly with a line, we could directly fit the log-odds of the posterior with a piecewise
linear function (a spline) instead of indirectly doing the same thing by fitting the asymmetric
CHAPTER 3. CALIBRATION
47
Laplace. In a follow-up to our work, Zhang and Yang [ZY04] did just that and obtained an
approach with even more power while retaining asymmetry.
There is a variety of other work that this section of the dissertation extends. Platt [Pla99]
uses a logistic regression framework that models noisy class labels to produce probabilities
from the raw output of an SVM. His work showed that this post-processing method not only
can produce probability estimates of similar quality to SVMs directly trained to produce
probabilities (regularized likelihood kernel methods), but it also tends to produce sparser
kernels (which generalize better). Finally, recalibrating poorly calibrated classifiers is not
a new problem. Lindley et al. [LTB79] first proposed the idea of recalibrating classifiers,
and DeGroot & Fienberg [DF83, DF86] gave the now accepted standard formalization for
the problem of assessing calibration initiated by others [Bri50, Win69].
3.2.6 Summary of Recalibration Methods
We have reviewed a wide variety of parametric methods for producing probability estimates
from the raw scores of a discriminative classifier and for recalibrating an uncalibrated probabilistic classifier. In addition, we have introduced two new families that attempt to capitalize on the asymmetric behavior that tends to arise from learning a discrimination function.
We have given an efficient way to estimate the parameters of these distributions.
While these distributions attempt to strike a balance between the generalization power
of parametric distributions and the flexibility that the added asymmetric parameters give,
the asymmetric Gaussian appears to have too great of an emphasis away from the modes.
In striking contrast, the asymmetric Laplace distribution appears to be preferable over several large text domains and a variety of performance measures to the primary competing
parametric methods, though comparable performance is sometimes achieved with one of
two varieties of logistic regression.
Given the ease of estimating the parameters of this distribution, it is a good first choice
for producing quality probability estimates when training data is scarce. When training
data is plentiful, isotonic regression methods [ZE02] may yield better results. These estimates can then be used as input to a classifier combination algorithm. Additionally, the
work in this section has demonstrated both empirically and from a theoretical standpoint,
that classifiers tend to yield asymmetric score distributions and are unlikely to give rise to
normal distributions as assumed in combination work by Kahn [Kah04].
48
3.2. RECALIBRATING CLASSIFIERS
Chapter 4
Locality
We remind the reader that our goal is ultimately to outperform a global linear combination
of classifier log-odds, probabilities, or scores. Thus a simple definition of locality is any
combination function where the weight on such classifier outputs cannot be expressed as a
global weight. However, this section seeks to motivate why we may want to vary the weight
we place on a classifier, the implications this carries, and, in the ideal case, what statistics
should be most influential in determining the weight.
To accomplish this, we approach the problem from a series of simplified views. For
example, we can consider when there is a single base classifier, when the base classifiers
are both calibrated and conditionally independent given the class, when the base classifiers
are uncalibrated but remain conditionally independent, and finally when the base classifiers
are neither calibrated nor conditionally independent. Likewise, we can also simplify the
formulation by considering the form the solution would take for a particular combination
model if we were given not only the class labels for each training example, but the “true”
posterior P(c | x) as well. Examining the combination from these simplified viewpoints
will allow us to separate what terms are hard to estimate, due to sparse data or missing
information, and how the information should be used.
To enable clarity in the rest of the chapter, we start by discussing the interpretation
of the “true” posterior for classification. Following this, we briefly return to calibration,
and demonstrate that while it is a key characteristic as discussed in Chapter 3, it does not
address one of the primary challenges of classifier combination — namely estimating the
dependencies of the classifier outputs.
Furthermore, it does not fully take advantage of the fact that the reliability of a classifier’s predictions can vary across the input space. In this chapter, we argue that considering
the locally changing interactions among classifiers is key to improving classifier combi49
4.1. “TRUE” POSTERIORS, LOG-ODDS, AND CONFIDENCES
50
nation performance. Therefore, we turn our focus on these issues by motivating and then
defining the concepts of local dependence, reliability, and variance.
4.1
“True” Posteriors, Log-odds, and Confidences
For clarity, we can use a common situation researchers encounter to discuss concepts key
to the rest of this chapter in a simplified manner. Consider the following tasks often faced
during peer review1 :
• “Make a recommendation accept/reject.”
• “Rate this paper from 0 to 5, where 0 is definitely reject and 5 is definitely accept.”
• “State your confidence on a 0 to 5 scale in your review.”
When a reviewer answers the first question, she is making a classification prediction
regarding the paper. The answer to the second question is the posterior probability after
reading the paper that the reviewer assesses regarding “Accept/Reject”. In fact, if we were
to center the 0 to 5 scales on 0 so that they run from −2.5 to 2.5, then the score would be
similar to log-odds. A negative value would indicate that “Reject” has a higher posterior
probability and a positive value would indicate that “Accept” has a higher posterior probability. Presumably, if the reviewer is acting consistently her recommendation will behave
similarly.
Next, the reviewer states her confidence — which intuitively is a self-assessment of her
expertise and mathematically is a statement about how strongly she believes the posterior
she gave is correct. In other words it is a second-order summary of the uncertainty the
reviewer has about her classification. Consider if instead of making the reviewer specify her
uncertainty, we allowed her to specify an entire distribution expressing her belief p(P(c |
x) = z | x). Then when she is first forced to summarize her uncertainty via the rating,
the typical and self-consistent approach is to predict the expected value of the distribution:
R
P̂(c | x) = z p(P(c | x) = z | x) dz. However, as the reader is well aware, the mean of a
distribution does not fully summarize the distribution. Presumably, as the reviewer receives
more information or perceives she has all necessary information because of her expertise,
her confidence that the expected value fully summarizes her uncertainty will become quite
high. Therefore a reasonable measure for confidence is to treat it as an (inverse) measure
1
We assume that the reviewer assigns these outcomes conditionally on the paper reviewed but independent
of the content of other papers.
CHAPTER 4. LOCALITY
51
of the variance of p(P(c | x) = z | x) — or the spread of the distribution from its mean
value.
While researchers commonly perform peer reviews and understand the intuitive notions
it involves, they are often perplexed when trying to mechanically produce such estimates
from a classifier. This occurs for many reasons — some of which can be highlighted in the
same example. What does it mean to say a paper has a “true class” of “Reject”? Furthermore, what would it mean to say that the probability of reject after reading the paper is 0.8.
In part, the confusion results from confusing a Bayesian notion of subjective probabilities
with a frequentist notion of empirical probabilities. In the Bayesian scheme (including the
preceding paragraph), probability theory is simply a useful tool to convey uncertainty in a
manner that follows certain rules of self-coherence.
Whereas, in the frequentist notion, we must have at least an imaginary concept of a
repeatable experiment. How would one sample “similar” papers? A slightly more tenable viewpoint would be to sample reject/accept opinions from “similar” people. In our
example, “similar” people would amount to something like “people with expertise like
those on the editorial board”. In this view the P (accept | paper) is the empirical frequency as we sample more opinions , i.e. limN →∞ |Ri =accept|
and the belief distribution
N
p(P(c | x) = z | x) is an estimate of the probability the limit will take on each of these values given the evidence. In other words, a statement that P̂ (accept | paper) = 0.80 means
that the estimator believes 80% of a “similar” population would decide to accept this paper.
The frequentist explanation is not necessary, but it makes it easier to conceive of the
type of uncertainty we may want to capture. For our applications, the task is often to
predict topic and the population being sampled can be thought of as all potential users of
the application. Some documents are less clearly on one topic, and therefore, there will
be higher disagreement on those documents. Thus documents will rarely ever have a true
posterior of 1 or 0. It is simply often treated as such because our training data typically
only consists of a single class label for each point.
4.2
Calibration & Locality
Similar to other works, we assume we can obtain from each classifier C i a conditional
π̂i (c)
probability estimate, π̂i (c) = P̂Ci (c|x), and a log-odds like score, λ̂i (c) = log 1−π̂
, either
i (c)
directly or by postprocessing as discussed in Section 3.2. Thus, our approach can either be
viewed as a method of combining probability forecasters or as combining classifiers where
X
E1
E2
52
4.2. CALIBRATION & LOCALITY
E3
Ei
we make assumptions that allow us to model the “internal probabilities” the classifiers are
EM
implicitly utilizing in making their decisions.
Êi
Viewing this
as
a
problem
E = (X, f (X)) of combining log-odds, the problem is equivalent to performingĈinference
ˆ in the model depicted in Figure 4.1. Here we have simply replaced the
1 = (f1 (X), s1 (E1 ))
difficult Ĉproblem
ˆ of estimating p(λ̂1 , . . . , λ̂n | c) by what seems to be the equally hard
2 = (f2 (X), s2 (E2 ))
problemĈof
estimating
p(λ̂ , . . . , λ̂n | λ).2 However, when we start to make simplifying
ˆ
3 = (f3 (X), s3 (E13 ))
assumptions,
becomes more obvious. As we noted, Kahn [Kah04] worked
Ĉi =the
(fˆidifference
(X), si (Ei ))
with a model where the fclass-conditional
classifier outputs were assumed to be Gaussian,
(X)
but even in the case of faˆMsingle
(X) classifier this model cannot support the type of asymmetric
behavior that we see in practice
p1 (E) as shown in Section 3.2.
p2 (E) 3, if we restrict the number of base classifiers to a single
As mentioned in Chapter
p3 (E) equivalent to recalibrating that classifier. Let’s assume we
classifier, the problem becomes
have a single classifierpwhose
M (E) log-odds estimates, λ̂, are distributed normally around the
| E)As shown in the left of Figure 4.2, even when we assume the
true log-odds, λ̂ ∼ Np(E
(λ, 11).
p(E2 |isE)
prior on the true log-odds
a simple Gaussian, the resulting class-conditional distribution
p(Eproperties
p(λ̂ | c) has asymmetric
similar to what is seen in practice (see Chapter 3).
3 | E)
p(EM | E)
p(Êi | E)
λ
p(Ei | Êi )
c
λ̂1 . . . λ̂n
Figure 4.1: Classifier combination can be thought of as combining estimates of each classifier’s
estimate of the log-odds, λ̂i , via the latent variable representing the true log-odds, λ, to improve the
R∞
prediction of the class c. That is via, p(λ̂1 , . . . , λ̂n | c) = −∞ p(λ̂, . . . , λ̂n | λ)p(λ | c) dλ.
If we were to continue to work with this formal model, different formulations would
focus on different characteristics. For example, the classifier’s predictions may not be
centered on the true log-odds, but instead the predictions may show a bias to the left or
right. Thus we could easily include a factor that allowed us to model a systematic shift
in each classifier indicating overconfidence or underconfidence. Likewise, where we used
λ̂ ∼ N (λ, 1) in the example above, we might instead choose to use a different standard
2
Note that by definition P (c|λ) = (1 + exp{−cλ})−1 and we can use Bayes’ rule to invert it.
CHAPTER 4. LOCALITY
53
0.03
0.03
0.025
0.025
0.02
0.02
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.045
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
deviation in various parts of the space. For example, areas where we have large amounts of
training data could use a small standard deviation as we achieve a tighter confidence around
our prediction. Note that this approach implies that the calibration of the classifier varies
locally according to the estimated confidence bound on the true log-odds.
0.04
0.035
0.03
p(λ̂ | c)
p(λ̂ | c)
p(λ̂ | c)
0.025
0.015
0.015
0.02
0.015
0.01
0.01
0.005
0.005
0.01
0.005
0
−100
−50
0
50
100
150
0
−200
−150
−100
−50
λ̂
0
50
λ̂
100
150
200
250
0
−150
−100
−50
0
50
100
150
200
250
λ̂
Figure 4.2: A few examples of the distribution of p(λ̂ | c) for various choices of the prior on the
true log-odds, p(λ), when the classifier’s predictions are distributed normally around the true logodds, λ̂ ∼ N (λ, 1). The prior used is a single Gaussian (left), a mixture of two Gaussians (middle),
and a mixture of three Gaussians (right). 100K samples were drawn from each distribution. The
asymmetry of the resulting distributions is very reminiscent of those seen in practice as shown in
Section 3.2.
We do not directly work with such a graphical model, but instead use it to point out the
recurring themes that we incorporate into our work. These properties can be formulated in
terms of the probability estimates the classifiers emit or in terms of the log-odds. Readers
interested in pursuing this model may also wish to consider implications discussed later in
this chapter. For example, a simplified version of this model would assume the classifiers’
log-odds estimates are independent given the true log-odds. Finally, rather than use such an
independence assumption globally, we could posit that there exists a mixture of different
regions where the parameters are specific to the region.
4.3
Dependence & Locality
Returning to our example (see Section 1.2.1) of a feature space where two mutually exclusive and exhaustive subsets are independent given the class, consider if we now have one
classifier based on each subset and each outputs the log-odds using the true posterior based
only on its subset (i.e., P (c | xi,1 , . . . , xi,k ) and P (c | xi,k+1 , . . . , xi,n )). It can easily be
shown by factorizing the joint that the optimal combination of these classifiers log-odds is
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
54
4.3. DEPENDENCE & LOCALITY
Class
X1 , . . . , X k
Xk+1 , . . . , Xn
Ŷ1
Ŷ2
Figure 4.3: An influence diagram for two classifiers whose optimal combination is to allow the
output of each (Ŷ1 and Ŷ2 ) to contribute independently to the final prediction. The input dimensions
X1 , . . . , Xk are independent of dimensions Xk+1 , . . . , Xn given the class variable; though, the
interactions within the two feature sets may be arbitrarily complex (which is why they are depicted
as within one box). One classifier’s predictions (Ŷ1 ) depend only on the values the first feature
set takes (X1 , . . . , Xk ) while the other classifier ’s predictions (Ŷ2 ) depend only on the values the
second feature set takes (Xk+1 , . . . , Xn ).
simply their sum with the extra prior removed.3 This situation is depicted with an influence diagram in Figure 4.3. However, if the first classifier additionally modeled one of the
attributes from the second subset, P (c | xi,1 , . . . , xi,k+1 ), this is no longer the optimal combination. The reason for this is the well-known fact that their common dependence must
be factored out. In general then, we will have to consider the dependence of the outputs.
This is not surprising given the theoretical proof of improvement reviewed in Section 3.1
required we estimate the joint distribution, and it is necessary to consider dependence when
estimating the joint.
We can make this more explicit by considering a case when the features are partitioned
into 3 subsets: X1 , X2 , and S. Assume Classifier 1 uses X1 and S and outputs an uncali(+|x1 ,s)
brated estimate based on these variables of the form: λ̂1 = a1 log PP (−|x
+ b1 . Likewise
1 ,s)
2 ,s)
Classifier 2 outputs λ̂2 = a2 log PP (+|x
+ b2 . Consider the optimal combination of these
(−|x2 ,s)
two classifiers. Given the prior log-ratio ρ = log
P (+)
P (−)
P (+|s)
,
P (−|s)
and the measure of the shared in-
formation λS = log
the optimal combination can be written as a linear combination
with a non-constant bias term dependent only on the shared information:
3
If given probability estimates, it is the normalized product.
CHAPTER 4. LOCALITY
55
h
i
h
i
−1
λ̂
(x)
−
b
λ̂
(x)
−
b
+
a
λ(x) = a−1
2
2 − λS (x) − ρ
1
1
2
1
(4.1)
= w1 λ̂1 + w2 λ̂2 + b∗ (x)
−1 ∗
where w1 = a−1
1 , w2 = a2 , b (x) = −λS (x) − ρ −
(4.2)
b1
b2
− .
a1 a2
(4.3)
Notice that the non-constant part of the bias term will correct the sign of the combined
decision whenever the amount of double-counted information swamps that presented by the
conditionally independent information each classifier contributes. This example illustrates
that, rather than trying to directly estimate terms such as λ S , we can simply combine the
log-odds with a linear combination where all the weights except the bias are constant.
Then the weights can be interpreted as implicitly representing these interactions. Thus by
introducing non-constant weights we can capture a range of dependency interactions.
4.4
Variance, Sensitivity, & Locality
Now we turn to the issue of variance. As this section will demonstrate, the variance of
several quantities will be of interest to us. These will be the sensitivity of the model,
the covariance of the model’s estimates with the true outputs, and the variance in error
prediction. In brief, we will have to consider the sensitivity or variance in the model’s
output when considering how to normalize the current outputs, the covariance with the true
outputs to determine how to rescale the normalized estimates, and the variance in error
prediction both as a term to minimize and through its connection to the other two via the
variance of the true outputs. We will elaborate on these more as we proceed to ease the task
of the reader.
We start off by considering what would happen if our classifier had total information.
In this case, all of the predictions would be based on the true posterior, and the classifier
would suffer only the Bayes error. This notion of variance then is the generalization of
refinement from the calibration literature. This is essentially a measure of the amount of
information on average that was lacking to fully explain the deterministic portion of the
output value. How might we extend this to be a reasonable definition of variance around a
point in the reliability diagram?
From our discussion of the model incorporating the latent variable of the true log-odds
in Figure 4.1, it is obvious. The prediction error variance is the spread around the average
deviation from the true log-odds. For some simulated data, this term is computable. For
real data, however, we must simply base our estimates on whether the example was labeled
4.4. VARIANCE, SENSITIVITY, & LOCALITY
56
as belonging to the class or not. This has led some authors to claim that variance is not welldefined for classification, but it would be more appropriate to say it is not computable. If
we work with a formal probabilistic model then clearly our assumptions about the variance
in the model can have an impact via inference over the latent true log-odds. Additionally,
we can attempt to model this factor by assuming that if the estimates of the classifiers are
changing rapidly with small changes in the input data (sensitivity) then most likely the true
log-odds are not changing rapidly in a like fashion, and therefore the variance in prediction
error is high.
To clarify this further, consider the following setup. We are given a single base classifier’s predictions, λ̂, and we would like to find parameters a and b for a linear transformation
such that λ̂∗ = a λ̂· + b minimizes the expected
squared error with the true log-odds. That
³
´2 ¸
is argmina,b = E a λ̂(x) + b − λ(x)
. Note that there is no assumption that the base
classifier or the true log-odds are linear — simply that we want the best linear transformation. If the base classifier outputs non-linear predictions then after the transformation the
transformed outputs will also be non-linear. We can solve this following the solution for
the standard regression problem.
·³
Z
´2 ¸
i2
h
∗
∗
=
p(x) λ̂ (x) − λ(x) dx
(4.4)
E λ̂ (x) − λ(x)
Z
h
i2
(4.5)
=
p(x) a λ̂(x) + b − λ(x) dx
Z
Z
Z
= a2 p(x)λ̂2 (x) dx + b2 p(x) dx + p(x)λ2 (x) dx
Z
Z
+2ab p(x)λ̂(x) dx − 2a p(x)λ̂(x)λ(x) dx
Z
−2b p(x)λ(x) dx
(4.6)
Z
By replacement and since
p(x) dx = 1
h
i
£
¤
= a2 E λ̂2 (x) + b2 + E λ2 (x)
i
i
h
h
(4.7)
+2abE λ̂(x) − 2aE λ̂(x)λ(x) − 2bE[λ(x)]
Which yields
·³
´2 ¸
∗
∂E λ̂ (x) − λ(x)
∂a
·³
´2 ¸
∗
∂E λ̂ (x) − λ(x)
∂b
h
i
h
i
h
i
= 2aE λ̂2 (x) + 2bE λ̂(x) − 2E λ̂(x)λ(x)
(4.8)
h
i
= 2b + 2aE λ̂(x) − 2E[λ(x)]
(4.9)
CHAPTER 4. LOCALITY
i
h
Setting to zero and rearranging terms assuming E λ̂2 (x) 6= 0 gives the system
h
i
h
i
E λ̂(x)λ(x) − bE λ̂(x)
h
i
a =
2
E λ̂ (x)
h
i
b = E[λ(x)] − aE λ̂(x)
h
i
Assuming VAR λ̂(x) 6= 0 and solving gives
i
i
h
h
E λ̂(x)λ(x) − E λ̂(x) E[λ(x)]
i
h
a =
VAR λ̂(x)
h
i
COV λ̂(x), λ(x)
i
h
=
VAR λ̂(x)
i
h
i
COV λ̂(x), λ(x) h
i E λ̂(x)
h
b = E[λ(x)] −
VAR λ̂(x)
57
(4.10)
(4.11)
(4.12)
(4.13)
(4.14)
We can rewrite the final solution in a variety of forms to gain insight. For example,
we can rewrite the solutions using the correlation coefficient between the
h trueilog-odds and
σλ
estimated log-odds, ρλ,λ̂ , to write: a = σ ρλ,λ̂ , b = E[λ(x)] − aE λ̂(x) . As can be
λ̂
seen from the example in Figure 4.4, the coefficient a corrects the predictions to be both
correlated with and have a variance similar
to the
h
i correct
h ilog-odds. Since
h i the expectedhvalue
i
of the corrected predictions will be E aλ̂ + b = aE λ̂ + b = aE λ̂ + E[λ] − aE λ̂ =
E[λ], the additive term b ensures the new predictions are hweakly calibrated
on average 4 —
i
the average difference is zero but that does not imply E λ̂ − λ | λ̂ = 0 for all λ̂. Also,
note that if the predictions are independent of the true log-odds then ρ λ,λ̂ = 0 and therefore
a = 0, and the only correction that can be made is by predicting b = E[λ] at all points.
Continuing in this vein, there are a variety of other facts ·that can be demonstrated.
For
³
´2 ¸
example, the squared error of the corrected predictions is E λ̂∗ − λ
= VAR[λ] (1 −
ρ2λ,λ̂ ) which is 0 iff ρλ,λ̂ = ±1, and therefore, the squared error of the corrected predictions
goes to zero as the magnitude of the correlation between the original predictions and the
true log-odds approaches 1.
However, rather than continuing along this line, we would like to examine the quantities
used in deriving the solutions. As we have already mentioned b is related to a measure
COV[λ,λ̂]
of calibration weighted by a. Now, we turn to a = VAR λ̂ . First, we note that the
[ ]
4
i
h
i
h
By linearity of expectation, we have E λ̂∗ − λ = E aλ̂ + b − E[λ] = E[λ] − E[λ] = 0.
58
4.4. VARIANCE, SENSITIVITY, & LOCALITY
50
λ
λ̂
aλ̂
40
30
20
10
Log-Odds
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0
−10
−20
−30
−40
−50
1
2
3
4
5
6
7
8
9
10
x
Figure 4.4: A simple example where the input space has a single dimension to illustrate the role of
σλ
σλ̂ ρλ,λ̂ .
In the example, p(x) is uniform over [1, 10]. In this
h i
example, the initial predictions are correct on average: E[λ] = E λ̂ . The predicted log-odds, λ̂, are
perfectly correlated with the true log-odds, λ. That is, ρλ,λ̂ = 1, but a = 0.5 and b = 1.5. As can be
seen from the correction using just a, the coefficient forces the variation/slope of the predictions to
behave on average like the true variation. The resulting correction by b must take into account the
compression and rotation caused by a.
the ratio of standard deviations in a =
h i
denominator, VAR λ̂ , is a measure of the sensitivity of the model. That is, it captures how
widely the estimates of the model vary regardless of the value of the true log-odds. Thus,
given that model sensitivity is important for even this subcase of the combination problem,
we expect
h
iit will play a role in the larger combination problem. Next, the covariance,
COV λ, λ̂ , captures whether the predictions vary in the same way as the true log-odds.
h
i
The covariance is related to the spread around the average deviation by VAR λ̂ − λ =
h i
h
i
VAR λ̂ +VAR[λ]−2 COV λ, λ̂ . Thus, although the covariance is related to the variance
of the error in prediction, it is unclear whether more can be gained in general in combination
schemes from attempting to directly estimate the variance in error or the covariance. 5
Next,has mentioned
above the averageh differencei is only a weak measure of calibration
i
in that E λ̂ − λ = 0 does not imply E λ̂ − λ | λ̂ = 0, whereas the second condition is
typically what is meant by well-calibrated as discussed earlier. However, we can consider
5
As the reader is no doubt aware the expected
error can be broken down into the variance in
·³ squared
´2 ¸
h
h
i
i
error and the square of the expected error as: E λ̂ − λ
= VAR λ̂ − λ + E 2 λ̂ − λ . Therefore, even
though the variance in error and average error determine the squared error, they are only indirectly related to
the parameter values for a linear correction.
CHAPTER 4. LOCALITY
59
more local measures of calibration. In particular, we can consider using a linear correction when the parameters are determined locally. In order to do so, we need only define a
distribution over the domain conditional on the current prediction point, x 0 . If we denote
the locality or the neighborhood of the prediction point as N (x 0 ), we can denote the local
distribution as p(x | x ∈ N (x0 )). Since this distribution integrates to unity, the derivation
for the weights of the linear correction remain the same — the expectations, variance, and
covariance are now computed using the local distribution. Likewise the parameters are now
locally determined. It is up to the modeler how to represent locality. For example, if we
define locality as a small window around the predicted value, λ̂, then a locally
h linear correci
tion will now be well-calibrated since each local correction will ensure E λ̂ − λ | λ̂ = 0.
Applying the linear correction method locally can be demonstrated concretely by generating some simple examples. We will consider an input space of one dimension where
p(x | c) is Gaussian and P (c) is fixed. To draw points, we draw its class with probability
P (c) and then draw from the class-conditional distributions. 100 training points and 10000
hold-out points are drawn. We then fit a prediction model over the training data using either
linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA) and add-oneestimates for the priors [HTF01]. Locality is defined in these cases by choosing a fixed
width window around each prediction point in the feature space. Finally, the hold-out data
is used to estimate the correction factors globally or locally.
The first example generates the data with the same class-conditional variance and with
P (+) = 0.5. The means were chosen randomly to obtain p(x|−) ∼ N (−0.4854, 1.6068)
and p(x|+) ∼ N (0.3306, 1.6068). Using LDA the parameters are estimated to be P̂ (+) =
0.4804, p̂(x|−) ∼ N (−0.2375, 1.7941), and p̂(x|+) ∼ N (0.5184, 1.7941). We can visualize the true and estimated distributions as shown in Figure 4.5(a). The true and estimated
posterior and log-odds are given in Figure 4.6.6 Using a single global correction, we find
that a = 0.8295 and b = 0.1296. In Figure 4.7, we see that local-estimation finds the same
correction factors except at the edges where the holdout data is sparse. However, since
the factors compensate for each other, even in this case the poor estimation does not hurt
local correction (Figure 4.10a). Note that this also demonstrates how density in an area is
inversely related to variance. Tresp & Taniguchi [TT95] exploit this fact when combining
classifiers.
The second example generates the data with different class-conditional variances and
with P (+) = 0.5. The means and variances were chosen randomly to obtain
6
The reader can note from this example that by working with log-odds instead of in probability space, a
global linear correction can capture far more of the typical behavior seen than in probability space where the
functions are non-linear.
p2 (E)
p3 (E)
pM (E)
4.4. VARIANCE, SENSITIVITY, & LOCALITY
p(E1 | E)
p(E2 | E)
p(x|−)
p(x|+)
p(E3 | E)
p̂(x|−)
p̂(x|+)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
60
0.35
0.35
p(x|−)
p(x|+)
p̂(x|−)
p̂(x|+)
0.3
0.3
0.25
0.25
0.2
p(x | c)
p(x | c)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.15
0.2
0.15
0.1
0.1
0.05
0.05
0
−6
−4
−2
0
X
2
4
6
0
−6
−4
−2
0
2
4
X
Figure 4.5: The class-conditional distribution of feature values for two synthetic examples and
their estimated forms using 100 training examples. The first (left) example constrains the classconditional variances to be equal and uses LDA to train the model. The second (right) example has
class-specific variances and uses QDA to train the model.
p(x|−) ∼ N (−1.0912, 2.3114) and p(x|+) ∼ N (1.4935, 2.0829). Using QDA the parameters are estimated to be P̂ (+) = 0.5000, p̂(x|−) ∼ N (−1.1280, 2.1314), and p̂(x|+) ∼
N (1.5791, 2.8136) (see Figure 4.5(b)). The true and estimated posterior and log-odds
are given in Figure 4.8. Using a single global correction, we find that a = 1.0430 and
b = −0.2412. In Figure 4.10(b), we see that global correction is not in general sufficient
when either the true or estimated model is non-linear.7 In contrast, the locally linear model
performs quite well.
In practical terms, what does this mean for us since we never have the true log-odds?
Quite clearly, the direct generalization of this is logistic regression which calculates the
weights directly. Therefore, it does not require we specify intermediate distributions to
calculate expectations needed for the parameters. Similarly other linear combination models have similar interpretations of the weights even when they do not explicitly introduce
distributional assumptions. Likewise locality can be introduced into these models by introducing approximations to the weight factors as part of the input to the combination model
— depending on the form of the combination model these approximations could then either
be used as non-constant bias correction factors, as coefficients, or implicitly by changing
the combination function based on the values they take.
7
The globally corrected model has corrected the high-density area of examples rather than the low-density
edges.
6
p(E2 | E)
p(E3 | E)
p(EM | E)
CHAPTER 4. LOCALITY
p(Êi | E)
p(Ei | Êi )
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
1
0.9
True
Estimate
61
4
True
Estimate
3
p(x | c)
0.8
p(x | c)
p(x|+)
0.7
p(x|+)
2
p(x|−)
0.5
p̂(x|+)
0.4
p̂(x|−)
Log-Odds
p̂(x|+)
1
0.6
P (+ | x)
p(x|−)
0
p̂(x|−)
−1
P (+ | x)
−2
0.3
0.2
−3
0.1
0
−6
−4
−2
0
X
Log-Odds
2
4
6
−4
−6
−4
−2
0
2
4
X
Figure 4.6: The posterior (left) and log-odds (right) for the example constrained to equal classconditional variance.
In cases where we would like to work with generative models though, simplifications
are sometimes possible. For example Kahn [Kah04] works with a generative model where
p(λ̂ | c) is assumed to be Gaussian with equal class-conditional covariance. The resulting
combination is a linear combination of the log-odds. The benefit of the model is a clean
form of dealing with classifier interactions, but as we have pointed out in several places,
the empirical behavior demonstrated by even a single base classifier is typically piecewise
linear and not globally.
More importantly, these examples have helped highlight the important quantities and
what their roles would be in a locally linear correction. The next section summarizes the
properties discussed throughout the chapter and lays the remaining foundation for our approach which will implicitly capture behavior by creating combination rules that use approximations to these quantities.
4.5
Local Reliability, Variance, and Dependence
This section summarizes and collects the observations made in this chapter. Throughout
the chapter, we offered arguments about the importance of various quantities and why they
could also be characterized in terms of locality. Locality is roughly defined as nearness as
a function of the example. One simple definition would be to characterize input nearness
in terms of having a similar base classifier prediction value. Although it is clear from
6
p(Ei | Êi )
p(Ei | Êi )
p(x | c)
p(x | c)
4.5. LOCAL RELIABILITY,
VARIANCE, AND DEPENDENCE
62
p(x|+)
p(x|+)
p(x|−)
0.9
p(x|−)
p̂(x|+)
0.8
p̂(x|+)
p̂(x|−)
0.7
p̂(x|−)
3
Global
Local
True
Estimate
Log-Odds
P (+ | x)
0.6
True
0.5
Estimate
0.4
Log-Odds
0.3
Log-Odds b correction factor
P (+ | x)
Log-Odds a correction factor
2
1
0
−1
0.2
−2
0.1
Global
Local
g-Odds b correction factor
0
−6
−4
−2
0
X
2
4
6
Log-Odds a correction factor
−3
−6
−4
−2
0
2
4
X
Figure 4.7: The coefficient a (left) and additive correction term b to perform linear correction
estimated globally and locally using hold-out data for the example constrained to equal classconditional variance. For this case where both the true and estimated log-odds are linear, a single
value of a and b is sufficient to perform perfect correction. The local estimation deviates from this
at the edges because of data sparsity.
the above that the prediction error variance around a log-odds prediction of −10 may be
different than that around a prediction of 0, it is sometimes advantageous to consider other
functions of the input features in addition to focusing on the classifier outputs.
A classifier may use data in different parts of the input space differently. Therefore, the
reliability, variance, and dependence may vary locally depending on how well sampled the
region of the input space is, the number of features that are locally relevant (i.e., basically
the complexity of the decision surface locally versus globally), and how the classifiers
employ the data. Many combination schemes fail to account for locality and place only
global weights on the classifier outputs. This reduces the ultimate expressive power of any
combination method.
Any discussion of “local” implies that for each datapoint, x 0 , there is some neighborhood, N (x0 ), that defines what is local to that point. In simple cases, such as defining
locality only in terms of the classifier estimates, then the neighborhoods may be explicitly
definable by binning around estimated log-odds values. In other cases, the neighborhoods
are implicit and purely motivational. For the application of these concepts globally, it can
just be assumed that the neighborhood covers the entire domain. Therefore, to make all of
the above issues concrete, we need only give a mathematical definition for the local version
assuming that we have some definition of neighborhood. Since the work in this disserta-
6
p(E2 | E)
p(E3 | E)
p(EM | E)
CHAPTER 4. LOCALITY
p(Êi | E)
p(Ei | Êi )
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
1
True
Estimate
0.9
63
10
8
p(x | c)
0.8
p(x | c)
6
p(x|+)
0.7
p(x|+)
4
p(x|−)
0.5
p̂(x|+)
0.4
2
Log-Odds
p̂(x|+)
0.6
P (+ | x)
p(x|−)
True
Estimate
0
−2
p̂(x|−)
p̂(x|−)
0.3
−4
P (+ | x)
0.2
−6
0.1
−8
0
−6
−4
−2
0
2
4
6
−10
−6
X
Log-Odds
−4
−2
0
2
4
X
Figure 4.8: The posterior (left) and log-odds (right) for a 2-class example with class-specific variances.
tion uses a binary classification approach, we typically only need to estimate the quantities
below for y = + but we give a formulation for the multiclass problem. We now summarize
the above issues as the following eight points and give mathematical definitions for those
involving locality in terms of both log-odds and probability estimates:
1. Calibration or reliability of classifier outputs;8
2. Variance of estimate with available (or total) information — That is the variance of
the error in prediction;
3. Dependence of classifier outputs;
4. Local reliability of classifier outputs;
The average of predictions from a reliable classifier would equal the average actual
value (in all neighborhoods):
∀y Ep(x|x∈N (x0 )) [P (c(x) = y | x)] = Ep(x|x∈N (x0 )) [P̂Ci (c(x) = y | x)],
or ∀y Ep(x|x∈N (x0 )) [λ(y)] = Ep(x|x∈N (x0 )) [λ̂(y)].
Therefore a reasonable measure of deviance from reliability within a region would
be a measure of deviance from this, such as
i2
P h
E
[P
(c(x)
=
y
|
x)]
−
E
[
P̂
(c(x)
=
y
|
x)]
p(x|x∈N (x0 ))
p(x|x∈N (x0 )) Ci
y
8
In the remainder of this work, we use “reliability” loosely to mean either the items explicitly discussed
as reliability here or possibly touching on issues related to any of the types of variance discussed here. When
it is necessary to make the distinction explicit, we do so.
6
p(Ei | Êi )
p(Ei | Êi )
p(x | c)
p(x | c)
4.5. LOCAL RELIABILITY,
VARIANCE, AND DEPENDENCE
64
p(x|+)
p(x|+)
p(x|−)
5
p̂(x|+)
4.5
p(x|−)
Global
Local
p̂(x|+)
p̂(x|−)
4
p̂(x|−)
14
Global
Local
12
Estimate
Log-Odds
P (+ | x)
3.5
True
3
Estimate
2.5
Log-Odds
2
Log-Odds b correction factor
True
Log-Odds a correction factor
10
P (+ | x)
8
6
4
2
1.5
0
1
g-Odds b correction factor
0.5
−8
−6
−4
−2
0
X
2
4
6
8
−2
−8
−6
−4
−2
Log-Odds a correction factor
0
2
4
6
X
Figure 4.9: The coefficient a (left) and additive correction term b to perform linear correction
estimated globally and locally using hold-out data for the example with class-specific variances.
For this case, where both the true and estimated log-odds are non-linear, a global value of a and b is
not adequate to perform perfect correction.
2
Ep(x|x∈N
(x0 )) [P (c(x) = y | x) − P̂Ci (c(x) = y | x)],
h
i2
P
or y Ep(x|x∈N (x0 )) [λ(y)] − Ep(x|x∈N (x0 )) [λ̂i (c)]
P
2
= y Ep(x|x∈N
(x0 )) [λ(y) − λ̂i (y)].
=
P
y
5. Local variance of estimate with total information;
A classifier that acted with total information would, of course, always predict the
posterior. Within a region,
the prediction error variance is then:
i
h
P
(c(x)
=
y
|
x)
,
VAR
P
(c(x)
=
y
|
x)
−
P̂
Ci
p(x|x∈N (x0 ))
y
h
i
P
or y VARp(x|x∈N (x0 )) λ(y) − λ̂i (y) .
6. Local dependence of classifier outputs;
The ideal situation would be when the classifiers are independent given the class of
the data they are predicting upon (as in Figure 4.3 above). If this were true, then
the ideal would maintain that the joint distribution over the n classifier predictions
(denoted as ŷ1 , . . . , ŷn here) could be factored into their separate predictions:
Q
∀y P (ŷ1 , . . . , ŷn | c(x) = y, x ∈ N (x0 )) = ŷi P (ŷi | c(x) = y, x ∈ N (x0 )).
Therefore, a reasonable measure of deviance from independence would be the KLdivergence of the left distribution from the right distribution.
8
p(Ei | Êi )
p(Ei | Êi )
p(x | c)
p(x | c)
CHAPTER 4. LOCALITY
p(x|+)
65
p(x|+)
p(x|−)
p(x|−)
4
True
Estimate
p̂(x|+)
3
10
True
Estimate
p̂(x|+)
Locally Corrected
Globally Corrected
p̂(x|−)
Locally Corrected
Globally Corrected
5
p̂(x|−)
2
P (+ | x)
P (+ | x)
0
Log-Odds
Log-Odds
1
0
−5
−1
Global
Global
−2
−10
Local
Local
−3
Log-Odds b correction factor
Log-Odds b correction factor
−4
−6
Log-Odds a correction factor
−4
−2
0
X
2
4
6
Log-Odds a correction factor
−15
−8
−6
−4
−2
0
2
4
6
X
Figure 4.10: The locally linear and global corrections of the log-odds for the equal classconditional variance example (left) and the non-equal example (right). As shown on the left, when
both the true and estimated model are linear, global weights suffice to perform perfect correction.
However, when either the true or estimated models are not linear, a locally linear model has the
potential to perform far better correction, as shown on the right.
7. Noise sensitivity of a method;
This is just a measure of variation with the region for a specified model (i.e., deviance
of the estimates for other datapoints belonging to the neighborhood from the mean
estimate of the neighborhood).
This termi is:
h
P
y VARp(x|x∈N (x0 )) P̂Ci (c(x) = y | x) ,
i
h
P
or y VARp(x|x∈N (x0 )) λ̂i (y|x) . It can be expedient though to consider a related
term which is deviation from the query
h point, x0 , instead of the mean prediction.
i
P
This would be: y VARp(x|x∈N (x0 )) P̂Ci (c(x0 ) = y | x0 ) − P̂Ci (c(x) = y | x) ,
h
i
P
or y VARp(x|x∈N (x0 )) λ̂i (y|x0 ) − λ̂i (y|x) .
8. Covariance of a method with complete information;
Given the connection between error prediction variance, noise sensitivity, and variance via VAR[X − Y ] = VAR[X] + VAR[Y ] − 2 COV[X, Y ], it is unclear if covariance need be separately
h estimated. However, we list it for completeness.
i
P
COV
P̂
(c(x)
=
y
|
x),
P
(c(x)
=
y
|
x)
,
C
C
p(x|x∈N
(x
))
0
i
i
y
i
h
P
or y COVp(x|x∈N (x0 )) λ̂i , λi .
Given these quantities of interest, Chapter 5 motivates and defines variables that are
specific approximations to these quantities or variables intuitively tied to the neighborhood
8
66
4.5. LOCAL RELIABILITY, VARIANCE, AND DEPENDENCE
around a prediction point for the domain. These are then used in methods which can either
directly use the approximated variables or use the neighborhood characterizations to define
weights and combination functions dependent on the neighborhood.
Chapter 5
Reliability Indicators
This chapter describes the reliability indicators in detail. There are a number of reliability
indicators that arise out of the internal workings of the models themselves. Since these
variables play a more central role, we devote a significant portion of the chapter to their
motivation and formulation. The remainder of the variables focus primarily on the difference between the original representation of the document and the represention after feature
selection. These latter variables are presented with a brief motivation and description at the
end of the chapter.
The reader should note that identifying and defining variables tied to the reliability of
classifiers is both challenging and an open research problem. Even though we have made
considerable progress in this arena, it remains an attractive area of future research.
5.1
Model-Specific Reliability Indicators
This section motivates a series of indicators based on the inner workings of each classification model. The variables all share a commonality in that they are related to the shift in the
model’s output relative to a slight shift in the input. Additionally, the computational complexity of producing each statistic relative to producing a single test prediction is analyzed.
67
68
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
5.1.1 Variables Based on the Unigram Classifier
(Multinomial naı̈ve Bayes)
For a two-class problem of discriminating class c from ¬c, the log-odds of the unigram
classifier can be written as:
"
#
X
P̂ (c | d)
P̂ (c)
P̂ (w | c)
log
= log
+
#(w, d) log
(5.1)
P̂ (¬c | d)
P̂ (¬c) w∈d
P̂ (w | ¬c)
where #(w, d) denotes the number of times word w occurs in document d. Therefore,
(w|c)
= log P̂ (w | c) − log P̂ (w | ¬c) to
each occurrence of a word w contributes log P̂P̂(w|¬c)
the overall classification. Furthermore, if this quantity is positive, the word occurrence
moves the decision toward the positive class c, and if this quantity is negative, the word
occurrence moves the decision toward the negative class ¬c. In light of this, it seems natural
to consider possible interpretations of functions of the log-likelihood ratio with respect to w,
(w|c)
, and the log-likelihood with respect to w, log P̂ (w | C), as indicator variables.
log P̂P̂(w|¬c)
P
P̂ (w|c)
1
First, consider the mean per-word log-likelihood ratio: |d|
w∈d #(w, d) log P̂ (w|¬c) .
Why we may want to concern ourselves with this quantity can be motivated from several
viewpoints. For example, consider how the output of the unigram classifier would change if
we changed the document by (uniformly) randomly choosing a word in the document and
eliminating it. More formally, let wi refer to the ith unique vocabulary word that occurs
in the bag-of-words representation of the document d and let d −i denote the document
obtained by removing a single occurrence of wi from d. Let ∆ denote a distribution over
these altered documents such that the probability of generating d −i is #(w,d)
where |d|
|d|
denotes the total number of words in the document. Let λ̂U (d) denote the log odds of the
unigram model. Then we wish to know how much it will change our estimate if we remove
a single word, or E∆ [λ̂U (d) − λ̂U (d−i )] where d−i ∼ ∆.
This expectation reduces to the average per-word log-likelihood ratio:
E∆ [λ̂U (d) − λ̂U (d−i )] =
(5.2)
´¸
X · #(wi , d) ³
−i
=
λ̂U (d) − λ̂U (d )
|d|
−i
d
"
#
X #(wi , d)
P̂ (wi | c)
=
log
|d|
P̂ (wi | ¬c)
−i
d
(5.3)
(5.4)
CHAPTER 5. RELIABILITY INDICATORS
#
"
1 X
P̂ (wi | c)
#(wi , d) log
=
|d| −i
P̂ (wi | ¬c)
d
"
#
1 X
P̂ (w | c)
=
#(w, d) log
.
|d| w∈d
P̂ (w | ¬c)
69
(5.5)
(5.6)
By inspection of the formula, it is also obvious that this quantity is tied to the prediction of
the unigram classifier itself being equal to the prediction minus the log priors and divided by
the document length. At the same time, it is a measure of the change around the prediction
with a slight shift.
Continuing along the same vein, it is natural to consider the variance of the statistic:
³
´2
−i
−i
2
VAR∆ [λ̂U (d) − λ̂U (d )] = E∆ [ λ̂U (d) − λ̂U (d ) ] − E∆
[λ̂U (d) − λ̂U (d−i )].
Following a derivation similar to the above we can show that:
#2
"
X
P̂ (w | c)
1
.
#(w, d) log
E∆ [ λ̂U (d) − λ̂U (d−i ) ] =
|d| w∈d
P̂ (w | ¬c)
³
´2
(5.7)
From these two terms, we can then compute the variance. Note that the variance will be
near zero when all the words are pointing with approximately the same strength at the
same class. As the variance increases, then the disagreement among the individual words
is higher. Thus, regardless of the conditional independence assumption the model makes,
we would expect that the predictions will be more reliable when the variance is low than
when it is high.1
The previous two statistics were motivated in terms of the decision boundary of the
unigram classifier, but it seems prudent to also consider statistics of the primary components of the
consider
the mean per-word
h classifier. That is iwe can also
h
i log-likelihood, i.e.
P
P
1
1
w∈d #(w, d) log P̂ (w|c) and |d|
w∈d #(w, d) log P̂ (w|¬c) . Similarly, we can
|d|
consider the variance of the per-word log-likelihood. Now, these statistics will measure the
average strength for each class and the spread of how strongly each word is voting. 2
Finally, note that computing each of these statistics requires the same run-time as making the original prediction, O(|d|). Therefore, computing these statistics does not significantly impact our computational burden.
1
To apply this to polychotomous learning problems, we believe it would be most beneficial to have one
P̂ (w|c)
or log max P̂ (w|c)
. Alternatively one
value per class and consider statistics based on either log 1−
0
P̂ (w|c)
c0 6=c P̂ (w|c )
could consider all class pairs.
2
An interesting future direction is to consider similar statistics that do not weigh each word equally but
instead use the model’s estimate of the word’s probability, P̂ (w).
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
70
5.1.2 Variables Based on the naı̈ve Bayes Classifier
(Multivariate Bernoulli naı̈ve Bayes)
The variables motivated by the naı̈ve Bayes Classifier are directly analogous to those discussed above for the unigram classifier — as would be expected since the models are highly
related. However, the difference in the event space employed by each model dictates subtle
changes that we need to address.
First, note the difference between these two models’ event spaces. The unigram probability model can be thought of generatively as drawing a class according to a class prior,
then drawing a document length, and finally drawing the words according to the classconditional word distribution. In contrast, the naı̈ve Bayes model draws a class according
to a class prior and then for each word in the vocabulary draws whether or not the word
occurs in the document according to a class-conditional distribution. Thus, the multivariate
Bernoulli naı̈ve Bayes classifier models a binary (Bernoulli) variable of whether or not a
word occurs conditioned on the class for every word in the vocabulary. Let V − d denote
the set of features that do not take a value of “present” or 1 in d. Then, for a two-class
problem of discriminating class c from ¬c, the log-odds of the naı̈ve Bayes classifier can
be written as:
log
P̂ (c | d)
P̂ (¬c | d)
= log
P̂ (c)
(5.8)
P̂ (¬c)
#
"
#
"
X
X
P̂ (w = 0 | c)
P̂ (w = 1 | c)
+
log
+
log
P̂
(w
=
1
|
¬c)
P̂ (w = 0 | ¬c)
w∈V −d
w∈d
"
#
X
P̂ (c)
P̂ (w = 0 | c)
= log
+
log
.
(5.9)
P̂ (¬c) w∈V
P̂ (w = 0 | ¬c)
"
#
"
#
X
X
P̂ (w = 1 | c)
P̂ (w = 0 | c)
−
log
+
log
P̂
(w
=
0
|
¬c)
P̂ (w = 1 | ¬c)
w∈d
w∈d
The log odds are often formulated as given in Line 5.9 for efficiency. The priors together
with the first summation in the line is often termed the log odds of the null document since
it is the log odds of a document that contains no word occurrences. This term is costly
to compute since it includes every word in the (typically large) vocabulary. Let |d| denote
the number of “present” words in the document. Then, the costly summation must only be
computed once during training, and at test time, each document’s prediction can be made
in O(|d|) time instead of O(|V |) by subtracting off the “absent” contributions of the words
that are present and adding in their “present” contributions.
CHAPTER 5. RELIABILITY INDICATORS
71
Now, instead of restricting our set of slight changes to the input to only the words
present in the document, we will continue in the style of the model and consider how
its output would change if we altered a single feature value for all possible values in the
domain. Now, we will let d−i denote the document identical to d except that feature i has
been “flipped” and ∆ is now the distribution over documents such that P ∆ (d−i ) = |V1 | for
i = 1... |V |. Let λ̂B (d) be the estimate of the multivariate Bernoulli naı̈ve Bayes model for
document d. We desire to determine the shift in the model output caused by changing a bit.
E∆[λ̂B (d) − λ̂B (d−i )] where d−i ∼ ∆.
Let wi (d) denote the presence/absence value word i takes in document d, and let w i0 (d)
denote its complement. It is quite easy to see that this expectation reduces as follows:
E∆ [λ̂B (d) − λ̂B (d−i )] =
=
=
=
=
=
(5.10)
h³
´i
X
1
λ̂B (d) − λ̂B (d−i )
(5.11)
|V | −i
d
#
"
1 X
P̂ (wi (d) | c) P̂ (wi0 (d) | ¬c)
(5.12)
log
0
|V | −i
P̂
(w
(d)
|
¬c)
P̂
(w
(d)
|
c)
i
i
d
#
"
X
1
P̂ (w = 1 | c) P̂ (w = 0 | ¬c)
(5.13)
log
|V | w∈d
P̂ (w = 1 | ¬c) P̂ (w = 0 | c)
"
#
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
1 X
+
log
|V | w∈V −d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
"
#
X
1
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
log
(5.14)
|V | w∈V
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
"
#
P̂ (w = 1 | c) P̂ (w = 0 | ¬c)
1 X
+
log
|V | w∈d
P̂ (w = 1 | ¬c) P̂ (w = 0 | c)
#
"
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
.
log
−
|V | w∈d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
"
#
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
log
(5.15)
|V | w∈V
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
"
#
2 X
P̂ (w = 1 | c) P̂ (w = 0 | ¬c)
+
log
.
|V | w∈d
P̂ (w = 1 | ¬c) P̂ (w = 0 | c)
Again, the formulation given in Line 5.14 allows this statistic to be efficiently computed
in O(|d|) time during prediction by computing the first summation at training time. In
the unigram model, each word present in the document contributed according to the loglikelihood with respect to it and the corresponding statistic was the mean over those terms.
Likewise, in the multivariate Bernoulli model, each word in the vocabulary contributes
72
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
according to the log odds ratio of the likelihoods, and the resulting statistic is the mean
over the per feature contributions.
´2
³
The derivation for the variance, VAR∆ [λ̂B (d)−λ̂B (d−i )] = E∆ [ λ̂B (d) − λ̂B (d−i ) ]−
2
E∆
[λ̂B (d) − λ̂B (d−i )],
follows similar lines. To compute the first term, following the pattern above, we can easily
derive:
!#2
" Ã
³
´2
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
−i
(5.16)
E∆ [ λ̂B (d) − λ̂B (d ) ] =
log
|V | w∈V
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
" Ã
!#2
1 X
P̂ (w = 1 | c) P̂ (w = 0 | ¬c)
+
log
|V | w∈d
P̂ (w = 1 | ¬c) P̂ (w = 0 | c)
" Ã
!#2
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
−
log
.
|V | w∈d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
" Ã
!#2
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
1 X
log
(5.17)
=
|V | w∈V
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
!#2
"
Ã
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
1 X
+
− log
|V | w∈d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
" Ã
!#2
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
−
log
.
|V | w∈d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
!#2
" Ã
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
=
log
(5.18)
|V | w∈V
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
!#2
" Ã
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
1 X
log
+
|V | w∈d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
" Ã
!#2
1 X
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
−
log
.
|V | w∈d
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
" Ã
!#2
X
1
P̂ (w = 0 | c) P̂ (w = 1 | ¬c)
(5.19)
.
=
log
|V | w∈V
P̂ (w = 0 | ¬c) P̂ (w = 1 | c)
This term need only be computed during training and is then used in combination with
the previous statistic to compute the variance in O(|d|) time. Finally, as for the unigram
classifier, we also consider the mean and variance of the class-strength of each feature. That
(w(d)|c)
(w(d)|¬c)
is the mean and variance of log P̂P̂(w
and log P̂P̂(w
.
0 (d)|c)
0 (d)|¬c)
CHAPTER 5. RELIABILITY INDICATORS
73
5.1.3 Variables Based on the kNN Classifier
The variables based on the kNN classifier stem from two basic motivations: (1) the longstanding intuition that the proximity of the neighbors has an impact on the classifier’s reliability; (2) measuring sensitivity to a change in the input in a way that is connected to the
internal mechanisms of the classifier.
The kNN classifier is one of the oldest classification algorithms and many variants have
been employed [CH67, Yan99]. In its most basic version, the nearest neighbor classifier, a
similarity measure is specified by the user and a test or query point is classified by finding
the most similar training point and using the class label of that point to predict for the
test point. The natural generalization is to find the k most similar or closest neighbors in
the training set and take the majority vote of their class labels. A further generalization
that has been shown to be quite competitive for text classification is to use some form of
distance-weighted voting [Yan99] in order to allow closer neighbors to carry more weight.
While the following variables can be derived in any kNN framework, it is necessary
to understand the role of the reliability indicators within our particular kNN classifier
implementation. In our implementation, k is set to be 2(dlog 2 N e) + 1 where N is the
number of training instances. This rule for choosing k is theoretically motivated by results which show such a rule converges to the optimal classifier as the number of training
points increases [DGL96]. In practice, we have also found it to be a computational convenience that frequently leads to comparable results with numerically optimizing k via a
cross-validation procedure. As is quite common in text classification, we use the cosine
similarity, cos(~x1 , ~x2 ), with higher values indicating greater similarity (closer neighbors).
For those less familiar with the cosine of two vectors3 , it is equivalent to the inner product
of two vectors if all of the vectors have been normalized to the unit N -sphere. The score
used for a class y is:
X
X
s~x (y) =
cos(~x, ~n)
−
cos(~x, ~n)
(5.20)
~
n∈kNN (~
x)|c(~
n)=y
~
n∈kNN (~
x)|c(~
n)6=y
The class with the highest score is predicted as the class of the example. Since we apply
the classifiers as binary classifiers, there are only two classes and s~x (−) is the additive
inverse of s~x (+). Note that we can treat s~x (+) as an approximation for the log-odds of an
example, λ̂kNN (~x); like log-odds, by default we predict the sign of the score function as the
sign of the classifier and increasing magnitude indicates increased confidence. Finally, we
3
We also use a tfidf weighting where the tf factor is the standard term frequency and the idf factor is
log2 (N/df ).
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
74
apply a threshold method referred to as s-cut in [Yan99] where instead of using the default
threshold of 0, we learn a threshold by cross-validation over the training set. 4
Some forms of distance-weighted voting include a decay factor on the distance between
the query point and the neighbor such that for some values of the decay factor, the nearest
neighbor will dominate the overall vote. While we could choose other forms of distanceweighted voting, they all attempt to balance issues like how much the absolute distance of
the nearest neighbor influences the final prediction versus how much the relative difference
in neighbors determines the prediction. Instead, we take the approach of choosing the
common score function given in Eq. 5.20 and taking other factors into account through our
definition and use of reliability indicators.
The first such issue is that not all areas of the input space are well-sampled. The convergence theorem cited above and other guarantees regarding kNN’s performance [DHS01]
essentially rely on the fact that, as long as the test distribution is the same as the training
distribution, we are likely to have training examples near where testing examples are likely
to occur. Therefore, if the decision function is smooth, we will get good estimates quickly.
If it is not smooth, convergence still occurs but is slower. The simplest of the reliability
indicators try to detect when an area is not well-sampled by introducing functions of the
distances of neighbors. NeighborhoodRadius is simply the distance from the query point to
the farthest neighbor included in the neighborhood. Thus, we would expect a small radius
when the area is well-sampled and large when it is not. The next two reliability indicators,
MeanNeighborDistance and SigmaNeighborDistance, take into account the distribution of
neighbors by computing the mean and standard deviation of the distance from the query
point to a neighbor. Thus, a low mean could indicate that the neighborhood is consistently
well-sampled even though the farthest neighbor might be distant. Likewise, a high variance
could indicate that the sampling quality is not as consistent since the neighbors are at very
different distances.
Next, we consider how we might detect when the decision function is not smooth in a
neighborhood. If the decision function is not smooth, then we would also expect that as we
move out from the query point in different directions, the overall prevalence of a class in
the training set would also change. Thus, we can also introduce a sensitivity based variable
that computes how much the output would change as we change the query point slightly.
Since we are concerned with how the decision function would change as we move toward
our neighbor, then it is natural to consider a set of shifts defined in terms of the neighbors.
In particular, let di denote the document that has been shifted by a factor α toward the ith
4
Similar to the recalibration techniques discussed in Section 3.2, this essentially acts as a recalibration
device where the estimated log-odds are shifted by the learned threshold.
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
CHAPTER
5. RELIABILITY INDICATORS
Ĉi = (fˆi (X), si (Ei ))
f (X) 2
ˆ
2 3
fM (X)
p1 (E) 1.5
1
p2 (E)
1
p3 (E)
pM (E) 0.5
5
p(E1 | E)
p(E2 | E) 0
p(E3 | E)
6
−0.5
p(EM | E)
p(Êi | E) −1
4
p(Ei | Êi )
d−1.5−1.5
−1
−0.5
0
0.5
75
1
1.5
2
Figure 5.1: An example in Euclidean space of the kNN shifted instances produced for a query
instance x using the other points shown as its neighborhood. The shifts are illustrated using cyan
lines from the original instance. The nearness of neighbor 5 prevents the shifts toward neighbors
1 − 3 from being larger. In contrast, the shift toward neighbor 4 is fully half the distance since it
is away from the other neighbors. Since a shift toward each neighbor is weighted equally, the net
effect is that a shift toward a dense area is more likely.
neighbor, i.e. di = d + α(ni − d). To determine α we choose the largest α such that the
closest neighbor to the new point is the original document. Clearly, α will not exceed 0.5,
and we can find it very efficiently using a simple bisection algorithm. Let ∆ be the uniform
distribution over the shifted points. That is, if di ∼ ∆, then P∆ (di ) = k1 .5 Likewise,
∆0 will indicate the uniform distribution over all k shifted points and the original query.
Figure 5.1 illustrates these shifts graphically for an example in Euclidean space and Table
5.1 gives the corresponding values of α for the shift toward each neighbor.
If the neighborhood is not smooth, then we would also expect to see that the class prediction is not constant across the shifts. To measure this, we compute the average prediction
E∆0 [sd̂ (+) ≥ 0] where d̂ ∼ ∆0 and denote it kNNShiftMeanPred. However, in addition
to computing how rapidly the prediction is changing, we can also measure how much the
confidence score differs from the prediction for the query. Therefore, we also compute
1/2
E∆ [sdi (+) − sd (+)] and Var∆ [sdi (+) − sd (+)] and denote them as kNNShiftMeanConfDiff and kNNShiftStdDevConfDiff.
5
Notice that ∆ implicitly weights shifts toward more common changes in the document since more common change vectors will also have more neighbors on that side.
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
76
Shifted Point
d1
d2
d3
d4
d5
d6
α
0.27
0.19
0.20
0.50
0.50
0.50
d
0.36
0.35
0.38
0.71
0.28
0.32
n1
0.97
0.99
0.96
1.97
1.13
1.55
n2
1.46
1.48
1.45
2.44
1.61
2.05
n3
1.51
1.52
1.49
2.51
1.67
2.07
n4
1.73
1.71
1.75
0.71
1.53
1.42
n5
0.36
0.35
0.38
1.03
0.28
0.87
n6
0.91
0.91
0.91
0.90
0.92
0.32
Table 5.1: Various quantities for the example in Euclidean space illustrated in Figure 5.1. α is
the amount example d is shifted toward each neighbor to produce d i . Each row lists the Euclidean
distances between the shifted point di and the original point d as well as each neighbor nj . The
nearness of neighbor n5 prevents the shifted instances d1 , d2 , and d3 from shifting closer to neighbors n1 , n2 , and n3 , respectively. Thus α for these shifted points is less than 0.5.
Analyzing the run-time of the kNN algorithm is rather tricky. The naı̈ve implementation would scan all N training points every time a new query is seen. Let l denote the
average number of non-zero features, then on average computing the cosine similarity between two points will be O(l). Thus, if we are classifying M points total, the classification
computation cost is O(M [N l + k log k]) for a naı̈ve implementation of kNN.6 Since we
are applying the classifier to text documents, a typical speed-up exploits the fact that each
document has very few non-zero features. We build an inverted table that indexes the training set by feature — for each feature, storing a list of the training documents that have a
non-zero value. When we receive a new document, we sort the non-zero features in the
document by their tfidf score and then proceed through each inverted list. After every feature’s inverted list, we can bound the theoretically closest possible neighbor that we have
not examined yet. When the bound gets tighter than the distance to the farthest point in the
neighborhood, we can terminate early. Because we proceed through the features in tfidf
order, we tend to find the closest neighbors right away and the majority of time is spent ensuring there are no closer neighbors.7 Since we are interested in performing exact kNN, our
performance gain is less than what it could be. Examining Table 5.2, we see for one small
dataset that we have to examine less than half as many points on average with the sparse
6
The k log k factors comes from tracking the top neighbors. Technically, this is u log k where u is the
number of (update) times we find a neighbor closer than the worst, but u empirically tends to be polynomial
in k.
7
For the “Sparse (K)” entry in Table 5.2, we had to examine the inverted lists of 2.06 features on average
to reach all the neighbors, but we had to examine 13.16 features on average to be sure there were no closer
neighbors. Thus an approximation algorithm that halts after a preset number of features can be accurate and
efficient. Additionally feature selection methods that favor rare features can introduce even more speed-ups.
CHAPTER 5. RELIABILITY INDICATORS
Method
Naı̈ve (k)
Sparse (k)
Sparse (2k)
Sparse w/RIVS (2k)
Avg # Dist Ops to
Find Neighbors
9603
4016.40
4698.15
4698.15
77
Ratio to
Baseline
2.39
1
1.17
1.17
Total Run
Time (s)
285.38
69.1
80.07
196.37
Ratio to
Baseline
4.13
1
1.16
2.84
Table 5.2: Effect on running time of computing the kNN reliability indicators for the Reuters 21578
corpus (9603 training examples, 3299 testing examples, 900 features used). The naı̈ve algorithm
scans all training examples each time. The sparse algorithm uses speed-ups based on sparsity and
just performs basic prediction; we show one version using the standard number of neighbors and
one using twice that. The final version also computes and writes the reliability indicators — using
a neighborhood of k = 29 for prediction but 2k to compute the reliability indicators. For these
comparisons, r-cut with r = 1 is used for prediction [Yan99].
algorithm and have a run time that is a quarter of the time used by the naı̈ve approach. The
exact speed-up depends on characteristics of the dataset, but achieving at least a factor of 4
seems to be common.
To compute the reliability indicators, it is clear that some of the variables can be computed with essentially no extra cost while performing the prediction including: NeighborhoodRadius, MeanNeighborDistance, Mean{Class}NeighborDistance, SigmaNeighborDistance,
and Sigma{Class}NeighborDistance. These simply require updating a corresponding variable’s state as each neighbor is found and voted toward the final classification. The primary
difficulty comes in computing the variables that involve shifting the query point and classifying the shifted point. If we treated each shifted point as a query, the run-time becomes
quite unreasonable. Instead, we take an approximation approach for these variables. Instead of finding the closest k neighbors we find the closest 2k neighbors. We still classify
the query using its closest k neighbors and created a shifted query for each neighbor, but
we find the k closest points to each shifted query among only the 2k points. Thus we avoid
the cost of finding the actual neighborhood to the shifted point by using an approximation.
For each test example, we have to perform an extra O(lk 2 log k) operations to evaluate the
predictions for the shifted point. Table 5.2 shows that finding the larger neighborhood only
penalizes us 16% in run time (“Sparse (K)” vs. “Sparse (2K)”), but that the extra overhead
for computing the reliability indicators surrenders half of the speed-up we gained over the
naı̈ve algorithm (“Sparse w/RIVS (2k)”).8
8
Some of this is I/O overhead and could be optimized to be closer to the performance of “Sparse (2K) ”.
78
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
5.1.4 Variables Based on the Decision Tree Classifier
It is well-known that a standard decision tree corresponds to regions of hyper-rectangles in
the RD space and that two problems which occur frequently in practice are oversplitting and
boundary sensitivity. The first results in using the estimate at a child node when a parent
node would have been a better estimate. The second results in grossly misestimating the
value to predict for examples that fall near the edges of the hyper-rectangles. Thus when
considering how to capture the sensitivity of a decision tree model, we would like to favor
shifts in the input to nearby leaves or to branches with similar values to an example.
Following the pattern set forth for the other classifiers, we again consider shifting the
input toward other examples. However, in this case, the shift is somewhat more implicit.
Assume we are given a document d, and the path of nodes the document follows down the
decision tree, n0 , n1 , ..., nl where n0 is the root node, n1 is the child of the root node, and
(j)
nl is the leaf node where document d ended up. Let ni denote the jth left branch in the
ith node that was not taken. For a binary tree, there will be exactly one, but for a multi-way
tree there can be many. Let B denote this set of B “local untaken” branches that lie along
the path of the document through the decision tree. We will denote by d i a document just
like d except that it has been altered to go down branch i of B. Note it would be possible to
obtain such a document by changing only one feature. However, to reflect that we are more
likely to end up in nearby leaves and examples close to the boundary are more likely to
shift, we will use the distance (after normalizing to the unit sphere) between the document
d and the centroid of documents at the node, d̄(nbi ), where nbi denotes the node reached
by taking the untaken branch bi .
Since each step lower in the decision tree implies the examples have more features
in common, then this distance will naturally be lower for nodes that are close together
in the tree. Likewise, if an example is near the decision boundary then this will naturally account for it as well. Let ² be the minimum distance to an untaken centroid,
² = minbi ∈B kd̄(nbi ) − dk. Then we set the probability of drawing such a similar document
P
di ∼ ∆0 to be P∆0 (di ) ∝ exp(−kd̄(nbi ) − dk + ²) such that bi ∈B P∆0 (di ) = 1. Once we
have taken a branch we will assume that the probabilities of ending up in some particular
terminal leaf descendant of i, tij , is determined by the relative frequencies in the training set,
c(tij )
,
c(nbi )
where c(·) gives the training count at a node. Finally, every terminal leaf except the
one that classified d in the tree is a descendant of some untaken branch. Let λ̂D (t) denote
the decision tree’s estimate of the log-odds of the positive class at a leaf t. Let ∆ denote the
c(ti )
distribution such that the P∆ (tij ) = c(nbj ) P∆0 (di ). Then, we compute the mean change in
i
CHAPTER 5. RELIABILITY INDICATORS
79
1/2
output E∆ [λ̂D (tij ) − λ̂D (nl )] and the standard deviation Var∆ [λ̂D (tij ) − λ̂D (nl )]. We refer
to these variables as DTreeShiftMeanConfDiff and DTreeShiftStdDevConfDiff respectively.
As far as the running time is concerned, we can precompute the centroids to store at all
intermediate leaves of the tree during training. Likewise, since the mass below an untaken
branch is proportioned according to the relative frequencies, we can store the intermediate
sums and multiply them by the untaken branch probabilities determined at prediction time.
Thus we only need to make three passes along the prediction path: (1) find the closest
untaken branch centroid; (2) compute the probability normalization factor; (3) sum the
intermediate sums stored at the highest untaken branch nodes. Therefore, computing these
variables has an expected running time of E[ml], where m is the average path length to the
leaf of a tree and l is the average length of a document.
5.1.5 Variables Based on the SVM Classifier
Similar to the other classifiers, the variables motivated by the SVM classifier try to capture
the sensitivity of the model to changes in the input and intuitive notions of when an example
falls into an area where the SVM solution is less reliable.
Overview of SVMs
There are two ideas key to the SVM classifier: the kernel and maximizing the margin. The
easiest way to think of a kernel is as a special type of similarity function between two
examples. The special stipulations a function must meet to be a kernel can be phrased in
many ways, but the basic definition follows [CST00]. A function K(x, y) : X × X 7→ R
is a kernel iff there exists a function φ which maps an example x from the input space X
to a feature space F such that the for any two examples the kernel is the inner product of
the transformed examples, K(x, y) = hφ(x) · φ(y)i. With respect to the kernel and the
transformed feature space, the SVM classifier outputs a linear separator although it may
be nonlinear with respect to the original input space. Kernel based methods have gained
popularity for many reasons. Among these reasons are: (1) generalization results can be
obtained without reference to the dimensionality of the feature space; (2) since only the
kernel output value is needed, it is possible to work in intractably large feature spaces
when the result of the inner product can still be efficiently computed directly; (3) it is not
necessary to explicitly define the feature space since any symmetric positive semi-definite
matrix M defines a valid kernel where Mij = K(xi , xj ); (4) simple algorithms can easily be
adapted to a new task by choosing an appropriate kernel; (5) it is easier to prove things about
simple algorithms. Since we only apply a linear kernel to the text classification problem, we
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
80
derive results in the standard Euclidean space. As mentioned above, if the examples have
been normalized to the unit sphere (L2-norm) then the inner product between two examples
is equivalent to the cosine similarity function. By defining what it means to project from
one example to another through the kernel space, all of the following variables could be
adapted to any kernel.
Given a set of training data D = {x1 , . . . , xN } which have corresponding class labels
yi ∈ {−1, +1}, the score function f (q) for an SVM can be written in one of two equivalent
ways. In terms of the training examples or in terms of the feature space. The first is written
as:
N
X
αi yi K(xi , q) + b.
(5.21)
f (q) =
i=1
The second can be written as:
f (q) = hw · φ(q)i + b where w =
N
X
αi yi φ(xi ).
(5.22)
i=1
The decision rule is then simply sign(f (q)). Only the points that have non-zero α i values
change the scoring function and are therefore termed support vectors. When the feature
space is tractable, such as in the linear case, the second formula gives a way of computing
the normal to the separating hyperplane during training and quickly classifying examples at
test time. When dealing with the feature space is not tractable (e.g., high-order polynomial
kernels), the first formula gives a way of classifying examples only in terms of the kernel
values using a total of V kernel evaluations where V is the number of support vectors.
Likewise the problem of training an SVM can be formulated as solving for the α’s and b or
as solving for a w and b.
In looking at how this solution is chosen, we come upon the second key concept for
SVMs — choosing the maximum margin solution. When there is a perfect separator, this
simply says choose the separator that has the largest minimum distance to any of the training points. When there is no perfect separator, the conditions must be generalized. Conceptually, we can think of the amount we must move each point to be on the correct side
of the boundary. Additionally, we may want to penalize points that are on the correct side
but too close to the boundary. Then a reasonable solution would be to choose the separator
that minimizes the sum of these terms. The formulation implemented in SVM light that
incorporates these ideas is the 1-norm soft margin SVM [Joa02, CST00]:
N
minimizeξ,w,b
subject to
X
1
ξi
hw · wi + C
2
i=1
yi (hw · xi i + b) ≥ 1 − ξi ,
ξi ≥ 0
i = 1, . . . , N .
(5.23)
i = 1, . . . , N
(5.24)
(5.25)
fˆM (X)
p1 (E)
p2 (E)
CHAPTER 5. RELIABILITY
INDICATORS
p3 (E)
pM (E) 12
p(E1 | E) 10
p(E2 | E)
8
p(E3 | E)
1
2
p(EM | E) 6
kwk
kwk
p(Êi | E) 4
p(Ei | Êi ) 2
81
f(
x)
=
f(
1
x)
=
f(
0
x)
=
−
1
0
−2
−4
−6
−6
−4
−2
0
2
4
6
8
10
12
Figure 5.2: The SVMlight solution with default C for an almost linearly separable problem. The
decision boundary is shown with a solid line. The dashed lines show the limits of the margin. The
support vectors are highlighted in black.
Figure 5.2 illustrates the solution for a nearly separable dataset. When the dataset has
a perfect separator the only support vectors will lie at the boundaries of the margin on the
f (x) = 1 and f (x) = −1 lines. In the general case, the support vectors will include all
training points xi such that yi f (xi ) ≤ 1, i.e. all points within the margin and all of the
points on the “wrong side” of the hyperplane. It can also be seen from one of the KKT
conditions that the optimal solution must satisfy, ∀ N
i=1 ξi (αi − C) = 0 [CST00] (p. 107).
The ξi are referred to as slack variables since by the condition in 5.24 together with the
minimization, they are the amount that a point falls on the wrong side of the margin. Thus
P
the sum N
i=1 ξi is the total distance points must be moved in order to be left with a margin
2
; thus the
free of training points. As labeled in Figure 5.2, the margin has a width of kwk
optimization problem of Eq. 5.23 is a tradeoff between the margin size and the total slack
controlled by the parameter C. Referring back to the KKT condition mentioned above
and since 0 ≤ αi ≤ C, one can easily see that any training point that has non-zero slack
will have an alpha of C and for those training points that fall on the margin boundary and
thus have a slack of zero, their respective α’s may range subject to other conditions not
discussed here. We have only touched on an understanding of SVMs necessary to motivate
the following discussion. The interested reader should see [Joa02, CST00] for much more
detailed information related to the theory and implementation of SVMs.
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
82
fˆM (X)
p1 (E) 15
p2 (E)
p3 (E) 10
pM (E)
p(E1 | E) 5
p(E2 | E)
0
p(E3 | E)
p(EM | E) −5
p(Êi | E)
p(Ei | Êi )−10
−15
−15
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
A
B
−10
−5
0
5
10
15
Figure 5.3: The contours for the score function, f (x), of the SVM light solution with default C.
The labels “A” and “B” fall at the same distance to the separator but would we have equal confidence
at predicting “red circle” at both points?
Motivating SVM Failure Modes
As we mentioned in Section 3.2 the SVM’s score function f (q) has empirically proven
to behave like a linear transform of the log-odds of an example. Figure 5.3 shows that
this function takes equal values parallel to the decision boundary. This seems generally
acceptable for cases where there is good separation except that when comparing the points
labeled “A” and “B”, most people find they are more uncertain about the label of “B” than
“A”. How should this be quantified? Intuitively, we are not more certain that point “B” is a
“blue cross”. Instead, we simply have less certainty about our probability, or in other words
our estimate of variance is higher. Thus, we would like a reliability indicator that captures
this intuitive observation mechanically.
We can easily construct an example that has the same decision boundary but is less
good by increasing the nonseparability of the data. Figure 5.4 shows an extreme of one
such nonseparable case. While the decision boundary is still about the best we can do for a
linear separator, it seems clear that we actually want to decrease our estimate of relevance
near the nonseparable mass; as we move further away along the isolines, it becomes increasingly unclear which class is best to predict. In practice, if the SVM solution has good
generalization we will not find large clumps such as this, but we may find scattered points.
We would like to define reliability indicators that will assist us in automatically detecting
such behavior.
ˆ1 = (fˆ1 (X), s1 (E1 ))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
ˆ2 = (fˆ2 (X), s2 (E2 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
ˆ3 = (fˆ3 (X), s3 (E3 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
CHAPTER
5.
RELIABILITY
INDICATORS
Ĉi = (fˆi (X), si (Ei ))
Ĉi = (fˆi (X), si (Ei ))
f (X) 15
f (X) 15
ˆ
ˆ
fM (X)
fM (X)
p1 (E) 10
p1 (E) 10
p2 (E)
p2 (E)
5
p3 (E)
p3 (E) 5
pM (E)
pM (E)
0
0
p(E1 | E)
p(E1 | E)
p(E2 | E) −5
p(E2 | E) −5
p(E3 | E)
p(E3 | E)
p(EM | E)−10
p(EM | E)−10
p(Êi | E)
p(Êi | E)
−15
p(Ei | Êi )−15−15
−10
−5
0
5
10 p(E15
i | Êi ) −15
83
−10
−5
0
5
10
15
Figure 5.4: The same data as Figure 5.2 but with a large non-separable mass added. The set of
support vectors (in green) has changed but the decision boundary is close to the same. Is it still
reasonable to assume the true log-odds is a (piecewise) linear transform of f (x)?
SVM Variable Details
Now, similar to how we proceeded for the other classification algorithms, we will define a
small set of shifts to the input example and measure statistics of the resulting change in the
model’s score function. In the derivation, we hope to both develop a reasonable estimate of
the model’s sensitivity and automatically capture the types of failure modes we discussed
above.
Since we derive these for the linear kernel only, φ(·) is the identity function and we
drop it. Given a SVM model f (x) and a set of data D, let V = {x i ∈ D | yi f (x) ≤ 1}.
As discussed previously if D is the training data, then for the SVM optimization problem
above, V is the set of support vectors. We may also choose to estimate these variables based
on some other set of data D. Let V = |V|. Now, we will consider shifting the document d
toward each of the elements of V.
In particular, let di denote the document that has been shifted by a factor βi toward the
ith element of V, i.e. di = d + βi (vi − d). Similar to the derivation for the kNN classifier,
to determine βi we define it in terms of the closest point in V to d. Let ² be half the distance
²
to the nearest point in V, i.e. ² = 21 minv∈V kv − dk. Then βi = kvi −dk
.9 Thus the shift
vectors are all rescaled to have the same length. Now, we must define a probability for the
shift. We use a simple exponential based on the relative distance from the document to the
point and the closest point in V. Let di ∼ ∆ where P∆ (di ) ∝ exp(−kvi − dk + 2²) and
9
We assume that the minimum distance is not zero. If it is zero, then we return zero for all of the variables.
5.1. MODEL-SPECIFIC RELIABILITY INDICATORS
84
1/2
PV
P∆ (di ) = 1.10 Our first two variables are E∆ [f (di )−f (d)] and Var∆ [f (di )−f (d)]
and denote them as SVMShiftMeanConfDiff and SVMShiftStdDevConfDiff. Note that to
compute the first of these, we can rewrite it as:
i=1
V
X
i=1
=
=
=
=
P (di ) [f (di ) − f (d)]
V
X
i=1
V
X
i=1
V
X
i=1
V
X
i=1
=
=
=
=
V
X
i=1
V
X
i=1
V
X
i=1
V
X
i=1
(5.26)
P (di ) [hw · di i + b − f (d)]
(5.27)
P (di ) [hw · [d + βi (vi − d)]i + b − f (d)]
(5.28)
P (di ) [hw · di + hw · βi (vi − d)i + b − f (d)]
(5.29)
P (di ) [hw · βi (vi − d)i]
(5.30)
P (di ) [βi hw · vi i − βi hw · di]
(5.31)
P (di )βi [hw · vi i − hw · di]
(5.32)
P (di )βi [hw · vi i + b − hw · di − b]
(5.33)
P (di )βi [f (vi ) − f (d)]
(5.34)
Similarly to compute SVMShiftStdDevConfDiff, we can use
v
" V
#2
u V
uX
X
t
P (di )βi2 [f (vi ) − f (d)]2 −
P (di )βi [f (vi ) − f (d)]
i=1
(5.35)
i=1
Since the scores for each of the vectors vi can be computed during training, we do not incur
any additional time to compute those at test time. Instead, we have one pass over the V
vectors to find the nearest neighbor, one pass to compute the probability function, and a
final pass to compute the sum. Therefore, we can compute the variables in 3V or O(V )
time.
10
As is standard to handle different document lengths, we take the distance between documents after they
have been normalized to the unit sphere. This is not the case for the example figures with generated data we
present here.
85
−0.38
401
−0
.7
87
5
−0.9 −1
8925.19
1
22
75
.78
−0
0.221
2
12
22
7
92
.3
−1
0.
ˆ1 = (fˆ1 (X), s1 (E1 ))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
ˆ2 = (fˆ2 (X), s2 (E2 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
ˆ3 = (fˆ3 (X), s3 (E3 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
CHAPTER
5.
RELIABILITY
INDICATORS
Ĉi = (fˆi (X), si (Ei ))
Ĉi = (fˆi (X), si (Ei ))
f (X) 30
f (X) 30
ˆ
ˆ
fM (X)
fM (X)
p1 (E) 20 0.221
p1 (E) 20
22
−0
p2 (E)
p2 (E)
.1
−0 822
.58 70.0
2
10 −1.1 −0576 194
0.2212
p3 (E)
p3 (E) 10
91 .78 −0 73
75 .3
−0 840
.98 1
92
pM (E) −1.
pM (E)
5
59
−0
0
0
45
0.2
−0.18
2
.5822
12
p(E1 | E)
p(E1 | E)
2
5707.
0
61
−094
5
.3 73
878576
84
−0
p(E2 | E)−10
p(E2 | E)−10
.7.5
01
−0
89
−0.9 25
−0.5
0.2
857
21
6 91
p(E3 | E)
p(E3 | E)
2
−1.3
−0 2
.1
927
.
−1
−1 −−0.182
−1.5945
58 270
.3
0
92 .7857 .01
p(EM | E)−20
p(EM | E)−20
7
756 94
−0 73
.3
84
0.2
p(Êi | E)
p(Êi | E)
01
21
22
−1.5
945
−30
p(Ei | Êi )−30−30
−20
−10
0
10
20 p(E30
i | Êi ) −30
25
89
.991
−0.1
−1
−20
−10
0
10
20
30
Figure 5.5: The contour plots of meanGoodSVProximity (left) and stdDevGoodSVProximity (right)
appear to capture some of the motivating intuition. Note the negative values in the left plot near the
nonseparable mass. In the right mass we see the goodness variance rises in the nonseparable mass
as well as the regions to the side where its unclear which mass examples belong to. Meanwhile
variance in the nicely separated region remains low and stable.
Inspired by the final form in Eq. 5.34 and by the idea that the total slack is the amount
required to move all the points to the right side of the margin, we considered E ∆ [G(di )]
1/2
and Var∆ [G(di )] where G(di ) = yi βi f (vi ). The idea is that G(·) is a “goodness” function. Note that G(di ) will be positive when vi is a support vector on the correct side of
the hyperplane, and it will be negative when vi is a support vector on the wrong side of
the hyperplane. We receive a credit proportional to how far the support vector is on the
correct side and the size of the shift we defined earlier. Thus if the expectation is positive, it means that, on average, the document is closer to “good” support vectors than to
“bad” ones. Likewise, the variance captures the fact that there is high fluctuation among
nearness to good and bad support vectors. We call these variables meanGoodSVProximity
and stdDevGoodSVProximity. Figure 5.5 shows an example of the values they take on and
demonstrates that they capture our intuition to a certain extent. Like the variables above
they are also O(V ) to compute.
Finally, we argued that distance to known points is intuitively important as in Figure
5.3. Since we already find the nearest support vector to the document for the other variables, then we also output a signed distance-weighted function to the nearest support vector,
specifically yn exp(−kvn − dk) where vn = minv∈V kv − dk. We call this variable signedNNSV.
5.2. INPUTS FOR STRIVE (DOCUMENT DEPENDENT)
86
5.2
Inputs for STRIVE (Document Dependent)
The remainder of this chapter summarizes all of the inputs to the metaclassifiers — including the base classifier outputs and all the reliability-indicator variables. A brief motivation
is provided for each reliability indicator that has not been previously discussed. As was
the case with the earlier variables these variables do not exhaustively define the space of
reliability indicators nor are they always crucial. Instead they simply attempt to mathematically capture the intuitions discussed through any or all of the following: approximations,
convenient modeling choices, and algorithm-specific key quantities.
5.2.1 Outputs of Base Classifiers
We considered the outputs of five base classifiers as inputs to STRIVE.
• OutputOfDnet
This is the output of the decision tree built using the Dnet classifier (available as
the WinMine toolkit [Mic01]. Its value is the smoothed log-odds of the estimated
posterior probability at the leaf node of belonging to the class.
• OutputOfSVMLight
This is the output of a Linear SVM model built using the SVM light package [Joa99].
It is often termed the margin score or distance to the normal. If K is the kernel
function, αi is the non-negative weight that the SVM places on each training example
(i.e., support vectors have non-zero weight), and β is the bias or threshold, then this
P
score is β + αi yi K(~xi , ~x).
• OutputOfNaı̈veBayes
This is the output of the naı̈ve Bayes model built using a multivariate Bernoulli representation (i.e. only feature presence/absence in an example is modeled) [MN98]. It
is the log-odds (or logistic) of the model’s probability estimate of class membership,
P (c|d)
. Because of machine floating-point precision issues, it is necessary
i.e. log 1−P
(c|d)
that the implementation compute the log-odds directly.
• OutputOfUnigram
This is the output of the unigram model (also referred to as a multinomial model
[MN98]). It is the log-odds (or logistic) of the model’s probability estimate of class
P (c|d)
membership, i.e. log 1−P
. Because of machine floating-point precision issues, it
(c|d)
is necessary that the implementation compute the log-odds directly.
CHAPTER 5. RELIABILITY INDICATORS
87
• OutputOfkNN
This is the output of the kNN classifier using distance-weighted voting [Yan99]. Unless otherwise mentioned, the value of k is set to be 2(dlog 2 N e) + 1 where N is
the number of training points. This rule for choosing k is theoretically motivated by
results which show such a rule converges to the optimal classifier as the number of
training points increases [DGL96]. In practice, we have also found it to be a computational convenience that frequently leads to comparable results with numerically
optimizing k via a cross-validation procedure. As a distance-measure for neighbors,
we use cos(~x1 , ~x2 ) with higher values indicating greater similarity (closer neighbors).
The score used for a class y is:
s(y) =
X
cos(~x, ~n)
~
n∈kNN (~
x)|c(~
n)=y
−
X
cos(~x, ~n)
(5.36)
~
n∈kNN (~
x)|c(~
n)6=y
5.2.2 Reliability Indicator Variables
Indicator variables are currently roughly broken into one of four types:
• Amount of information present in the original document;
• Information loss or mismatch between representations;
• Sensitivity of the decision to evidence shift;
• Basic voting statistics.
We group the reliability indicators into their primary type based on the main reasons
why we would expect to see a link to classifier reliability. We note that this is only a soft
clustering; some reliability indicators provide context information in more than one way.
Several of the variables listed below have an instantiation for each class in a learning
problem. The variable counts we report tally each instance separately. For the variables
below we list only one entry and use “{Class}” in the name of the variable to denote that
this variable has one instantiation per class. Since our methodology built a binary classifier
for each topic, then our experiments have a Positive and a Negative class version. In a twoclass problem, the values of the two instantiations may be redundant. We have, however,
retained each since in polychotomous (3 or more classes) discrimination they are more
distinct.
After each bullet below, a number is given in parentheses, indicating the number of
variables that this description includes. There are currently 70 document-dependent relia-
88
5.2. INPUTS FOR STRIVE (DOCUMENT DEPENDENT)
bility indicators. After we discuss how to use these indicators, we conduct an analysis of
the empirical impact of each indicator in Section 7.3.
Type 1: Amount of information present in the original document
• (1) DocumentLength
The number of words in a document before feature selection. Presumably longer
documents provide more information to base a decision upon. Therefore, longer
documents will lead to more reliable decisions when DocumentLength is correctly
modeled. Alternatively, models that do not correctly normalize for document length
may be less reliable for extreme lengths (short or long) of documents.
• (1) EffectiveDocumentLength
DocumentLength minus the number of out-of-vocabulary words in the document.
Since a model cannot generalize strongly, other than by smoothing, for features that
were not seen in the training set, this variable may be a better indicator of information
present in the document than DocumentLength.
• (1) NumUniqueWords
Number of distinct tokens in a document, i.e. |{w|w ∈ document}|, as opposed
to length which counts repeats of a token in a document. The motivation is similar
to DocumentLength, but here the variable is only counting each new word as an
indicator of new information.
• (1) EffectiveUniqueWords
NumUniqueWords minus the number of unique out-of-vocabulary words. This is the
analogue of EffectiveDocumentLength and is included for similar reasons.
• (1) PercentUnique
This is a measure of the variety in word choice in a document. It is equal to NumUniqueWords / DocumentLength. This can also be seen as 1 / average number of
times a word is repeated in a document. Close to 1 means very few words (if any)
are repeated in the document; close to 0 means the documents consists of very few
unique words (possibly repeated many times). This is essentially a normalized version of NumUniqueWords; this variable will show high variance for short documents
however. The intuition here is that more complex documents, while providing more
information, also might be more difficult to classify since they may have many features, each carrying some small weight.
CHAPTER 5. RELIABILITY INDICATORS
89
• (1) PercentOOV
The percentage of the words in a document which weren’t seen in the training set.
It is equal to the number of out-of-vocabulary words divided by DocumentLength.
Similar to PercentUnique, this variable can show high variance for short values. The
intuition here is that the more novel words a document contains the more likely a
classifier is to incorrectly classify the document into the a priori prevalent class. Typically unseen words slightly favor minority classes, since we have less samples from
them. This is a variable that essentially allows a global smoothing model to be induced and its range is [0, 1]. Therefore, as it approaches 1, we would expect minority
classes to be more likely than our base models might estimate.
• (1) PercentUniqueOOV
The percentage of the words in a document, not counting duplicates, which weren’t
seen in the training set. This is the distinct token analogue for PercentOOV. Again,
the motivations are similar to just using a different information model.
• (2) PercentIn{Class}BeforeFS
Of all words occurring in the training set (i.e. out-of-vocabulary words are ignored),
the percentage of words in a document that occurred at least once in examples belonging to the class. It is equal to the number of words that occurred in the class
before feature selection divided by EffectiveDocumentLength. Similar to PercentOOV, this can be used to inductively learn smoothing behavior. The assumption is
that if this variable is high, predictions that the example belongs to the class are more
reliable. For the binary case with a negative class that effectively groups many classes
together, this isn’t quite expected with respect to PercentInNegBeforeFS (since predictions of “negative” would almost always expected to be more reliable under that
assumption).
• (2) UpercentIn{Class}BeforeFS
Of all words occurring in the training set (i.e. out-of-vocabulary words are ignored),
the percentage of unique words in a document that occurred at least once in examples belonging to the class. This is the analogue to PercentIn{Class}BeforeFS using
unique tokens as the basis for the information model.
• (2) %Favoring{Class}BeforeFS
Of all words occurring in the training set, the percentage of words in a document that
occurred more times in examples belonging to the class than in examples not belonging to the class. This is essentially a rough statistic for an unnormalized unigram
90
5.2. INPUTS FOR STRIVE (DOCUMENT DEPENDENT)
model (tied slightly into the smoothing related variables discussed above) that gives
a very rough sense of the evidential weight of the original document.
• (2) U%Favoring{Class}BeforeFS
Of all words occurring in the training set, the percentage of unique words in a document that occurred more times in examples belonging to the class than in examples
not belonging to the class. This is the analogue to %Favoring{Class}BeforeFS.
Type 2: Information loss or mismatch between representations
While each of these variables are a measure of loss of information, they all generally have
a paired variable of Type 1 that together give a more direct measure of information loss.
• (1) DocumentLengthAfterFS
The number of words in a document after out-of-vocabulary words have been removed and feature selection was performed. Similar to DocumentLength, this is the
measure of information that the classifier actually sees with respect to this document.
• (1) UniqueAfterFS
The number of unique words remaining in a document after out-of-vocabulary words
have been removed and feature selection was performed. This is the distinct token
analogue of DocumentLengthAfterFS and is similarly expected to be used in conjunction with NumUniqueWords as a gauge of information loss.
• (1) PercentRemoved
The percentage of a document that was discarded because it was out-of-vocabulary
or removed by feature selection. It can have high variance for short documents. The
intuition is that reliability of a classifier is higher for low values of PercentRemoved.
• (1) UniquePercentRemoved
The percentage of unique words in a document that were discarded because they
were out-of-vocabulary or removed by feature selection. The distinct token analogue
of PercentRemoved where the information model is unique words.
• (2) PercentIn{Class}AfterFS
Of all words occurring in the training set, the percentage of words remaining in a
document after feature selection that occurred at least once in examples in the class.
Together with PercentIn{Class}BeforeFS, this allows the model to represent shift in
information content because of feature selection.
CHAPTER 5. RELIABILITY INDICATORS
91
• (2) UpercentIn{Class}AfterFS
Of all words occurring in the training set, the percentage of unique words remaining
in a document (after feature selection) that occurred at least once in the class. Again,
this is expected to be used in conjunction with UpercentIn{Class}BeforeFS to model
information loss.
• (2) %Favoring{Class}AfterFS
Of all words occurring in the training set, the percentage of words remaining in a
document (after feature selection) that occurred more times in examples in the class
than in examples not in the class. Like its BeforeFS counterpart, it is essentially like
an unnormalized unigram model. We expect that it can be used in conjunction with
%Favoring{Class}BeforeFS to measure how a feature selection method may have
biased the information for a given document toward a particular class.
• (2) U%Favoring{Class}AfterFS
Of all words occurring in the training set, the percentage of distinct words remaining
in a document after feature selection that occurred more times in examples in the
class than examples not in the class.
• (12) FSInformationChange
The change in information according to some measure of information. There is one
instantiation per measure of information and per measure of class. The difference
produces a variable related to information loss due to feature selection. We would
expect a large negative difference might give rise to false positives while a large
positive difference might give rise to false negatives with respect to the class. The
following are grouped under this heading:
(1) NumWordsDiscarded
= DocumentLength - DocumentLengthAfterFS
(1) NumTrainingWordsDiscarded
= EffectiveDocumentLength - DocumentLengthAfterFS
(1) NumFeaturesDiscarded
= NumUniqueWords - UniqueAfterFS
(1) NumTrainingFeaturesDiscarded
= EffectiveUniqueWords - UniqueAfterFS
(2) WordsSeenIn{Class}Delta
= PercentIn{Class}BeforeFS - PercentIn{Class}AfterFS
(2) FeaturesSeenIn{Class}Delta
= UpercentIn{Class}BeforeFS - UpercentIn{Class}AfterFS
92
5.2. INPUTS FOR STRIVE (DOCUMENT DEPENDENT)
(2) PercentWordsPointingTo{Class}Delta
= %Favoring{Class}BeforeFS - %Favoring{Class}AfterFS
(2) UPercentWordsPointingTo{Class}Delta
= U%Favoring{Class}BeforeFS - U%Favoring{Class}AfterFS
Type 3: Sensitivity of the decision to evidence shift
• (2) UnigramStdDeviation, Naı̈veBayesStdDeviation
In a binary class problem, the weight each word contributes to the unigram model’s
(w|c)
. Similarly, each word’s presence/absence contributes a weight
decision is log PP(w|¬c)
h
i
(w(d)={present,absent}|c) 1−P (w(d)|¬c)
of log PP(w(d)={present,absent}|¬c)
to the naı̈ve Bayes model. If this term is
1−P (w(d)|c)
greater than 0, the word gives evidence to the positive class (c), and if it is less than
zero, the word gives evidence to the negative class (¬c). The reliability indicators
are the standard deviation of these weights for the feature values (word occurrences
or presence/absence) in a specific document. If the variance is close to zero, that
means all of the words tended to point toward one class. As the variance increases,
this means there was a large skew in the amount of evidence presented by the various
words (possibly strong words pulling toward two classes). The intuition is that the
reliability of naı̈ve Bayes related classifiers will tend to decrease as this variable
increases. The motivation behind these variables are described more fully in Sections
5.1.1 and 5.1.2.
• (4) UnigramMeanLogOfStrengthGiven{Class},
Naı̈veBayesMeanLogOfStrengthGiven{Class}
These variables represent the mean of the conditional contributions of a word given
a class. For example UnigramMeanLogOfStrengthGivenPositive is the mean of
log P (w|Class = Positive) over the words in the document and Naı̈veBayesMean(w(d)|Class=Positive)
LogOfStrengthGivenPositive is the mean of log PP(w
0 (d)|Class=Positive) over the words
in the vocabulary. The motivation behind these variables are described more fully in
Sections 5.1.1 and 5.1.2.
• (4) UnigramStdDeviationLogOfStrengthGiven{Class},
Naı̈veBayesStdDeviationLogOfStrengthGiven{Class}
These variables represent the standard deviation of the conditional contributions of a
word given a class. For example UnigramStdDeviationLogOfStrengthGiven{Positive}
is the standard deviation of log P (w|Class = Positive) over the words in the docu-
CHAPTER 5. RELIABILITY INDICATORS
93
ment and Naı̈veBayesStdDeviationLogOfStrengthGiven{Positive} is the standard de(w(d)|Class=Positive)
viation of log PP(w
0 (d)|Class=Positive) over the words in the vocabulary. The motivation
behind these variables are described more fully in Sections 5.1.1 and 5.1.2.
• (2) UnigramMeanShift, Naı̈veBayesMeanShift
In a binary class problem, the weight each word contributes to the unigram model’s
(w|c)
. Similarly, each word’s presence/absence contributes a weight
decision is log PP(w|¬c)
h
i
P (w(d)={present,absent}|c) 1−P (w(d)|¬c)
of log P (w(d)={present,absent}|¬c) 1−P (w(d)|c) to the naı̈ve Bayes model. If this term
is greater than 0, the word gives evidence to the positive class (c), and if it is less
than zero, the word gives evidence to the negative class (¬c). The reliability indicators are the mean of these weights for the feature values (word occurrences or
presence/absence) in a specific document. As discussed in Sections 5.1.1 and 5.1.2
this quantity is the mean change in the model’s output if we uniformly randomly
chose one word to delete (for the unigram classifier) or uniformly randomly chose
one feature’s value to change (for the naı̈ve Bayes classifier).
• (1) NeighborhoodRadius
The radius required to include all k points in the neighborhood for the kNN classifier.
The radius is the Euclidean distance between the query point and its neighbors after
the points are normalized to the unit sphere. If the radius around a point is comparatively small then the implication is that this portion of the space is extremely well
sampled comparatively, and higher reliability is expected there.
• (3 = 1 [overall] + 2 [one per class] ) MeanNeighborDistance
Mean{Class}NeighborDistance
The mean distance of the points in the neighborhood from the kNN classifier. The
distance used is Euclidean distance between the query point and its neighbors after the points are normalized to the unit sphere. This has a similar motivation to
NeighborhoodRadius — a smaller neighborhood implies better sampling in the area
and higher reliability. Depending on the particular type of distance weighting used
in the kNN classifier this variable may play a more or less important role. There is
one instantiation that averages over all neighbors and then one per class that is the
average of points belonging to that class.
• (3 = 1 [overall] + 2 [one per class] ) SigmaNeighborDistance
Sigma{Class}NeighborDistance
The standard deviation of the distance of the points in the neighborhood from the
kNN classifier. The distance used is Euclidean distance between the query point and
its neighbors after the points are normalized to the unit sphere. The assumption is that
94
5.2. INPUTS FOR STRIVE (DOCUMENT DEPENDENT)
query points that fall in neighborhoods with high variance in neighbor distance will
be less reliable than those with a similar mean in distance but less variance. There
is one instantiation that measures variance over all neighbors and then one per class
that is based on points belonging to that class.
• (1) kNNShiftMeanPred
Similar to variables related to other classifiers, this variable considers how the output
of the kNN classifier changes with slight changes to the prediction point. Since the
kNN classifier uses a neighborhood for prediction, we also use it to define the nature
of the perturbations to the input. k new queries are created by shifting the document
as far toward each neighbor as it can go while the original document remains its
own closest neighbor. This variable is the average class prediction over the original
document and the k new queries. If the neighborhood is smooth we expect to see
values near 0 and 1. The motivation for this variable as well as the computational approximations used to compute it more efficiently are described more fully in Section
5.1.3.
• (1) kNNShiftMeanConfDiff
Using the k new queries defined for kNNShiftMeanPred, this variable measures the
average change in the kNN classifier’s confidence score from the confidence assigned
to the original prediction. A value near zero indicates that the query falls into a
neighborhood that is, on average, smooth with respect to the confidence function.
The motivation for this variable as well as the computational approximations used to
compute it more efficiently are described more fully in Section 5.1.3.
• (1) kNNShiftStdDevConfDiff
Using the k new queries defined for kNNShiftMeanPred, this variable measures the
standard deviation of the difference between the kNN classifier’s confidence score for
the new query and the confidence assigned to the original prediction. A low variance
indicates that the neighborhood is smooth with respect to the confidence function.
The motivation for this variable as well as the computational approximations used to
compute it more efficiently are described more fully in Section 5.1.3.
• (1) SVMShiftMeanConfDiff Similar to variables related to other classifiers, this variable considers how the output of the SVM classifier changes with slight changes to
the prediction point. Since the SVM classifier can be seen as a function of the support
vectors, we use the support vectors to define the nature of the perturbations to the input. Let v be the number of support vectors. Then, v new documents d 0 are created
by shifting the document toward each support vector for a total distance equal to that
CHAPTER 5. RELIABILITY INDICATORS
95
of half the distance to the nearest support vector. A probability distribution ∆ over
shifts is then defined based on the distance to the support vector. Finally, this variable
is the expected difference between the new output and the original output according
to the probability function, E∆ [f (d0 ) − f (d)]. The motivation for this variable and
the mathematical details are described more fully in Section 5.1.5.
• (1) SVMShiftStdDevConfDiff This variable is just like SVMShiftMeanConfDiff ex1/2
cept that it is the standard deviation of the statistic Var ∆ [f (d0 ) − f (d)]. Again
more details are in Section 5.1.5.
• (1) meanGoodSVProximity This variable uses the shifts defined for SVMShiftMeanConfDiff, but instead computes the expectation E∆ [yi βi f (vi )] where vi is the support
vector that we are shifting toward. Thus, the variable will take an overall positive
value if more “good” support vectors (on correct side of margin) are closer to the
document than “bad” support vectors. See Section 5.1.5 for more details.
• (1) stdDevGoodSVProximity This variable is just like meanGoodSVProximity except
1/2
that it is the standard deviation of the statistic Var∆ [yi βi f (vi )]. See Section 5.1.5
for more details.
• (1) signedNNSV For documents that are farther from any data seen during training,
we intuitively have less certainty about their class labels. This variable attempts to
capture that by computing a simple signed function of the nearest support vector
yn exp(−kvn − dk) where vn = minv∈V kv − dk.
• (1) DTreeShiftMeanConfDiff
This is a sensitivity variable related to the decision tree model. Assume that there is
a probability for drawing a document similar to the current one but different enough
to branch down an alternate branch along the path. This variable defines a model
for such a deviation and then captures the expected change in the model’s log-odds
output. See Section 5.1.4 for more details.
• (1) DTreeShiftStdDevConfDiff
This variable is identical to DTreeShiftMeanConfDiff except that it captures the variance in the model’s log-odds output according to the probability model over similar
documents. Again see Section 5.1.4 for more details.
5.3. TASK-DEPENDENT VARIABLES
96
Type 4: Basic voting statistics
There are 2 reliability indicator variables whose primary type is considered to be this type.
Both of these were introduced mainly to reduce the data required to learn m-of-n rules in
the decision tree metaclassifier.
• (1) PercentPredictingPositive
We refer to this in the main text as NumVotingForClass. This variable is the percentage of base classifiers (out of all base classifiers) that vote for membership in the
class. In our experimental evaluation, we only used one instantiation of this variable.
This was added to help the search space since learning this m-of-n type of feature can
require significant data for a decision tree learning algorithm (unless it is specifically
altered for this).
• (1) PercentAgreeWBest
This variable is referred to as PercentAgreement in the main text. For polychotomous
problems, PercentAgreement can be used to indicate among how many classes the
classifiers fracture their votes. Since there are only two classes here, we altered
it to indicate the percent agreement with the best base classifier (the classifier that
performed best over the training data).
5.3
Task-Dependent Variables
Here are a few examples of variables which could be included in the models that are pooled
across tasks. These are variables that do not vary from document to document within a
collection or classification problem but vary across them.
After each bullet below, a number is given in parentheses, indicating the number of
variables that this description includes.
• (1) NumTrainingPoints
Different learning algorithms often have learning curves. Including this variable allows the metaclassifier to inductively learn about crossovers in the learning curves of
the models.
• (1) %TrainingPointsIn{Positive}
This has the same motivation as NumTrainingPoints except since the number of positives in text are rare, there is reason to suspect the learning curve might be more
closely correlated with the number of positive training points.
CHAPTER 5. RELIABILITY INDICATORS
97
• (1) NumberOfSupportVectors
The generalization bounds for an SVM are tied to the number of support vectors in
the solution. Therefore, this is an obvious candidate for predicting the usefulness of,
at least, the SVM model.
98
5.3. TASK-DEPENDENT VARIABLES
Chapter 6
Background for Empirical Analysis
Most of the remaining chapters contain an accompanying empirical analysis for the approaches they discuss. This chapter collects many of the common key elements to these
experiments and presents them together for the ease of the reader. We start with a description of the various performance measures used to evaluate the quality of a classification
model. In addition, we present the motivation behind each measure, a description of why it
is relevant to the task, and discuss the implications improvement according to a particular
measure has beyond the obvious. Next, we give an overview of the primary text collections used throughout the dissertation and describe common processing steps that influence
the learned models, e.g. feature selection. Finally, we present implementation details of
the base classifiers and document particular parameter settings that influence the learned
models.
6.1
Classifier Performance Measures
When selecting among classifier performance measures which have previously been used,
we are faced with choosing from rank-based measures, probability loss functions, and functions such as accuracy that simply measure the membership predictions without regard to
the classifier’s confidence score. A variety of researchers have used rank-based [LC96]
measures of performance because they were interested in interactive systems in which a
rank list of codes for each document would be displayed to users. Many other applications such as automatic routing or tagging require that binary class membership decisions
be made for each document as it is processed. When binary decisions must be made, the
nature of the application often requires a different penalty for false positives than false
negatives. In order to perform well for a particular cost function, one approach ranks the
99
100
6.1. CLASSIFIER PERFORMANCE MEASURES
documents according to a classification confidence score and then optimizes a score threshold for the particular cost function. A classifier that effectively ranks examples can thus be
used for a range of different cost functions. Additionally, producing a stable and accurate
ranking is sufficient to show the existence of a monotone transformation of the classifier’s
scores to accurate probability estimates [LZ05]. However, producing good rankings does
not ensure that the optimal threshold can be optimally selected during training nor does
it guarantee we will choose the best transformation to probability spaces. For example,
Hull et al. [HPS96] found that, although combination techniques were able to improve
document ranking, they did considerably less well at estimating probabilities.
Therefore, at various points in this dissertation, we examine ranking, probability loss
functions, and common classification measures in order to obtain a holistic picture of the
performance trade-offs involved for a particular learning approach.
6.1.1 Classification Measures
The Fβ measure, and particularly F1 [vR79, YL99], is one of the more commonly used
ways to assess text classifier performance. This measure attempts to balance a classifier’s
precision and recall. Precision measures the ability of a classifier to accurately find items
that belong in a class:
Correct Positives
.
(6.1)
Precision =
Predicted Positives
Therefore a classifier that has low precision tends to produce false positives. Recall measures the ability of a classifier to accurately find all the items belonging in the class:
Recall =
Correct Positives
.
Actual Positives
(6.2)
A classifier with low recall tends to produce false negatives — not correctly finding all
of the actual positives. Depending on the nature of the task, precision and recall can have
varying roles. For example, if we are building classifiers that will automatically sort a user’s
e-mail into subject-oriented folders then errors in precision would route a message to the
wrong folder whereas an error in recall would leave the message in the catch-all INBOX
folder. Errors in recall may produce lower savings for a user, since now the user must sort
the remaining items, but is unlikely to cause errors such as not responding to an important
mail that has been misfiled. The Fβ measure is a weighted harmonic mean of precision and
recall.
(β + 1) ∗ Precision ∗ Recall
.
(6.3)
Fβ =
β ∗ Precision + Recall
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
101
The most common choice of β, 1, weights precision and recall equally and yields the
F1:
2 ∗ Precision ∗ Recall
.
(6.4)
Precision + Recall
Generally, the optimal F1 is obtained where precision equals recall, but as in other threshold
optimization, finding this trade-off point may be difficult in practice.
F1 =
Occasionally, the user of a system can specify at training time an actual cost associated
with a false positive, FP , and a false negative, FN . When this is possible, we can write the
expected cost (or utility) of a classification algorithm as a linear utility function,
C(FP , FN ) = FP ∗ P (FalsePositive) + FN ∗ P (FalseNegative)
(6.5)
where P (FalsePositive) and P (FalseNegative) is the probability under the underlying distribution1 that the classifier emits a false positive/negative. The most commonly used function in the literature is the error rate which is FP = FN = 1. However, the importance
of varying cost functions has been recognized by many researchers because applications
rarely have equal costs for different types of errors [PF01]. The text classification community has been particularly aware of their importance and has included linear utility functions
for some time in such evaluations as TREC (the text retrieval conference) [HR99, RH00].
From a decision theory point of view [DHS01], it is only the ratio of costs that influences a decision and not the strict numbers. So, a decision-maker would make the same
policy decisions for C(10, 1) and C(100, 10). More precisely, if we are given probability
estimates P̂ (+|x) and P̂ (−|x), then the optimal decision [DHS01] is to predict positive
whenever:
FP
P̂ (+|x)
.
(6.6)
≥
FN
P̂ (−|x)
For log-odds, the optimal decision is calculated by taking the log of both sides in whatever
base desired. In order to assess how sensitive performance is to the utility measure in a
typical range of cost ratios, we explicitly consider results for C(10, 1), C(1, 10), and the
error function.
6.1.2 Probability Loss Functions
Often the costs of a false positive/negative are not known during training time or may be
dynamic — changing according to a user-specified setting. In this case, if we assume all
linear utility cost functions are equally likely during prediction, then an effective classifier
is equivalent to accurately assessing the probability of each example. Given a probability
1
Or in sample when estimating
6.1. CLASSIFIER PERFORMANCE MEASURES
102
estimate, it is straightforward to make a hard-classification based on dynamic costs by
applying standard Bayesian decision theory:

+
if P̂ (+ | X) ∗ FN ≥ P̂ (− | X) ∗ FP
Ŷ (X) =
(6.7)
−
o.w.
Therefore, the question becomes one of obtaining high-quality probability estimates from
a classifier. We discuss how such estimates can be obtained in Chapter 3. Here we turn
to the question of what loss functions are appropriate for judging the quality of probability
estimates.
Two standard scoring functions for probabilities are log-loss and squared error. Both
of these are proper scoring rules [DF83, DF86] in the sense that a classifier’s view of
its expected performance is maximized when the classifier actually issues a probability of
p̂ when it assesses the probability to be p̂, i.e., the classifier cannot expect to gain from
“hedging its bets”. Log-loss is defined as:
Log-loss(x, c(x)) = log P̂ (c(x) | x).
(6.8)
Unless specified, we will report base e or the natural logarithm. Thus log-loss falls in
[−∞, 0] and penalizes heavily when the classifier assigns low probability to the true class.
When given in this form, the desire is to maximize the log-loss. When minimization is
more convenient, one simply works with the negative of this quantity.
Squared error is defined as:
Error2 (x, c(x)) = [1 − P̂ (c(x) | x)]2 .
(6.9)
Unlike log-loss, squared error falls in a bounded interval, [0, 1], and the aim is to minimize this quantity. Both scoring measures have their strengths and weaknesses. The large
penalties that can be inflicted by log-loss sometimes leads to solutions that can overfit a
single outlier simply to avoid an infinite penalty, whereas under decision theory seeing a
cost-ratio that inflicts unbounded penalties is unlikely. On the other hand, since squared
error is bounded by one, minimizing it can lead to making predictions that appear confident
but in actuality are very wrong. Our use of these measures is primarily restricted to when
we are explicitly concerned with the calibration of the systems in question.
6.1.3 Ranking Measures
In addition to the performance scoring measures discussed, we can also examine the benefits that a classifier has to offer over a range of cost functions by examining receiveroperating characteristic (ROC) curves. An ROC curve plots true positive rate versus false
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
103
a classifier
generates a score that can be used to rank the test
PSfrag replacements positive rate for a classifier. IfPSfrag
replacements
True Positive Rate
True Positive Rate
E set, then an ROC curve is produced by setting
E a classification threshold between pairs of
and then plotting a point corresponding
X adjacent examples that have distinguishable scores
X
E1 to the false positive and true positive rate thatE1are obtained using that threshold. For a given
E2
classifier, Ci , the true positive rate and falseE2positive rate are P (Ci (x) = + | c(x) = +)
E3
E3
and P (Ci (x) = + | c(x) = −).
Ei
Ei
EM
E
Figure 6.1a highlights several aspects of MROC curves using our earlier example of two
Êi
Êi class-conditionally independent classifiers where
Classifier 1 was based on feature X 1 and
E = (X, f (X))
E = (X, f (X))
Classifier 2 was based on featureˆX2 . Recall that if each classifier perfectly predicts the
Ĉ1 = (f1 (X), s1 (E1 ))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
posterior
conditioned
only
on2 =its(fˆrespective
feature, then Classifier 1 does far better than
ˆ
Ĉ2 = (f2 (X), s2 (E2 ))
Ĉ
2 (X), s2 (E2 ))
Ĉ3standard
= (fˆ3 (X),threshold
s3 (E3 )) of 0.5, we obtain a single true positive and
Ĉ3 = (fˆ3 (X), s3 (E3 )) Classifier 2. By choosing the
ˆ
ˆ
Ĉi = (fi (X),
si (Eare
Ĉi = (fi (X), si (Ei )) false positive rate for each classifier;
i )) shown by the single points in the graph. By
these
f (X) varying the thresholds θ and θ that the posterior
f (X)
must exceed in order to predict positive,
1
2
fˆM (X)
fˆM (X)
a range of different true positive and false positive rates are produced.
p1 (E)
p1 (E)
p2 (E)
p2 (E)
Example ROC Curve
Optimal Combination vs. Base Classifiers
p3 (E)
p3 (E)
1
1
pM (E)
pM (E)
p(E1 | E)
p(E1 | E)
p(E2 | E) 0.8
p(E2 | E) 0.8
p(E3 | E)
p(E3 | E)
p(EM | E) 0.6
p(EM | E) 0.6
p(Êi | E)
p(Êi | E)
p(Ei | Êi )
p(Ei | Êi )
0.4
0.2
0.4
P (+ | X1 ) ≥ θ1
P (+ | X2 ) ≥ θ2
P (+|X1 , X2 )≥ θ3
P (+ | X1 ) ≥ 0.5
P (+ | X2 ) ≥ 0.5
P (+| X1 , X2 ) ≥ 0.5
0.2
P (+ | X1 ) ≥ θ1
P (+ | X2 ) ≥ θ2
P (+ | X1 ) ≥ 0.5
P (+ | X2 ) ≥ 0.5
Random
P (+| X1 , X2 ) ≥ 0.5
0
0
0
P (+|X1 , X2 )≥ θ3
0.2
0.4
0.6
False Positive Rate
0.8
1
0
0.2
0.4
0.6
0.8
1
False Positive Rate
Figure 6.1: (a) At left an Example ROC Curve using the conditionally independent classifier example of Section 1.2.1. (b) At right, the optimal combination of Classifier 1 and 2 dominates both.
The optimal combination has an error rate approximately half of Classifier 1 and a sixth of Classifier 2, but as the classifiers get closer to perfect classification, the graphical difference can appear
deceptively small.
Perfect classification corresponds to a point in the top left (northwest) corner of the
graph. Random performance falls on the y = x line, and therefore anything above it per-
104
6.1. CLASSIFIER PERFORMANCE MEASURES
forms better than the random baseline. As we know from the discussion above, the particular threshold that is optimal for a linear utility function depends on the relative weight of
false positive to false negative. The ROC curve actually presents a summary of each classifier’s performance under any linear utility function. For a given linear utility function, the
error weights2 define the slope of isolines that connect points in the graph with equal performance under that utility function (see Figures 6.4-6.6). Conceptually if a line with that
slope is moved down in a parallel fashion from the northwest corner, then the first curve
that touches a line is the optimal point for that linear utility function [PF01]. Therefore, for
a set of classifiers, the optimal performance is defined by the convex hull of their curves.
In cases like this example, where one curve is above all the others, that classifier is said to
dominate the other classifiers, and if we are limited to selecting a single classifier to use for
each linear utility function, the best choice is to always use the dominating classifier.
Of course, since we are interested in combining and not simply selecting, the optimal combination improves over both classifiers despite the fact that Classifier 1 dominates
Classifier 2. This can be seen in Figure 6.1b. The optimal combination of Classifier 1 and 2
dominates both. The optimal combination has an error rate approximately half of Classifier
1 and a sixth of Classifier 2, but as the classifiers get closer to perfect classification, the
graphical difference can appear deceptively small.
We can also visualize other metrics in ROC space [Fla03]. For example, Figure 6.2 contrasts the isolines of F1 in a precision-recall graph versus an ROC curve. In the precisionrecall graph, the isolines do not change when the class prior is changed; however, as demonstrated by Figure 6.3, when positives become rare, F1 is extremely sensitive to any change
in the upper left part of the ROC curve. In contrast, the isolines for error and linear utility
functions are evenly spaced (Figures 6.4-6.6). These figures also demonstrate how varying
the class prior is equivalent to varying the costs of a false positive/negative.
By examining the whole space of linear utility functions, we seek to understand how
robust (in terms of sensitivity to the specific cost function) a combination method is. In
order to attempt to summarize the the linear utility space of functions as a single scalar, we
use area under the ROC curve.3
6.1.4 Summarizing Performance Scores
For each performance measure, we can either macro-average or micro-average. In macroaveraging, the score is computed separately for each class and then arithmetically averaged.
2
3
Different class priors can actually be equivalently treated as a change in error weight ratio.
Noting that the area under the curve is not a precise summary of the linear utility space.
PSfrag replacements
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
ˆ
Ĉ1 = (f1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
E
X
E1
E2
E3
Ei
EM
Êi
Isolines for F1 in Precision/Recall Space E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
True Positive Rate
False Positive Rate
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
Precision
Recall
0.1
0
0
0.1
Isolines of F1 in ROC space with P (+) = 0.10
0.2
0.3
0.4
Isolines0.6
for F10.7
in Precision/Recall
0.5
0.8
0.9 Space
1
Recall
Isolines of F1 in ROC space with P (+) = 0.10
1
0.9
0.8
0.7
True Positive Rate
1
105
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 6.2: (a) At left, the isolines (or contours) connecting equal values of the F1 score in a
Precision-Recall graph. The best performance is in the top right corner (red lines). (b) At right, the
isolines connecting equal values of F1 in an ROC graph. The ROC graph has a free parameter of
P (+) that must be specified to draw the contours. For this graph, P (+) = 0.10.
Although the classes are weighted equally in the macro-average, for some performance
measures, in particular F1, predictions on instances of rare classes effectively carry more
weight since the scores for rare classes are more sensitive to the changes of a few predictions over the positive instances; the net effect is that improvement in prediction over the
rare classes tends to influence the macro-F1 score more than improving over the common
classes. Micro-averaged values are computed directly from the binary decisions over all
classes; since all instances are given equal weight, this places more weight on the common
classes. For precision, recall, and F1 where micro-averaging is well-defined, we generally
report both types of averages. For the other measures, there is no accepted definition of
micro-averaging and some definitions (e.g., linear utility functions) lead to identical macro
and micro-averages. Therefore, for the other measures, we typically only report macroaverages.
6.2
Data
Here we briefly review the characteristics of the primary corpora we use for empirical
evaluation throughout the dissertation. For other corpora and synthetic data only used in
specific experiments, we describe them in the relevant section. Our primary empirical
focus deals with several topic-classification corpora including: the MSN Web Directory;
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
ˆ
Ĉ1 = (f1 (X), s1 (E1 ))
Isolines of F1 in ROC space with P (+) Ĉ
=20.05
= (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
106
Precision
Recall
Isolines for F1 in Precision/Recall Space
Isolines of F1 in ROC space with P (+) = 0.10
1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Isolines of F1 in ROC space with P (+) = 0.40
Precision
Recall
0
0.1
0.2
0.3
Isolines for F1 in Precision/Recall Space
Isolines of F1 in ROC space with P (+) = 0.10
Isolines0.5
of F1 0.6
in ROC0.7
space 0.8
with P 0.9
(+) = 0.05
0.4
1
False Positive Rate
6.2. DATA
Isolines of F1 in ROC space with P (+) = 0.40
1
0.9
0.8
0.7
True Positive Rate
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
ˆ
Ĉ1 = (f1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 6.3: The effects of varying P (+) from 0.05 to 0.40 on the isolines of F1 in ROC space.
two corpora drawn from the Reuters newswire; and the TREC-AP corpus. We have selected
these corpora because they offer a wide array of topic classification problems from very
broad topics over web pages to very narrow topics over financial news stories.
6.2.1 Chronological Split vs. Cross-Validation
Most text classification corpora are drawn from a source that has a natural time ordering.
For example, every news story is published at a specific time (though we may only know
the day), each e-mail is received at a given time, and a newsgroup posting happens at some
specific instant. Because of this, text corpora can sometimes exhibit topic drift where the
nature of the discussion or news events shift focus, and therefore the associated language
distribution also shifts. As a result, many text classification researchers are proponents of
splitting a corpus into a single training and test set chronologically [Lew92b]. This allows
one to test a text classifier in a similar way to how it will be used. In many applications
a chronological split — train on past, predict the future — is the only possible one even if
there is a gradual shift in the underlying sampling distribution.
On the other hand, machine-learning researchers typically use cross-validation for evaluation because it allows one to average out the variance introduced by a single “unlucky/
lucky” split. In general, we follow the use of standard chronological splits in empirical evaluation in order to compare with other researchers for datasets where there is an established
standard split and use appropriate statistical significance tests. For other datasets (e.g., the
PSfrag replacements
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
E
X
E1
E2
E3
Ei
EM
Êi
Isolines of Error in ROC space with P (+) = 0.05
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
Precision
Recall
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
0.2
0
Isolines of Error in ROC space with P (+) = 0.40
Precision
Recall
0.1
0
0.1
0.2
0.3
Isolines
of0.5Error 0.6
in ROC0.7
space 0.8
with P 0.9
(+) = 0.05
0.4
1
False Positive Rate
Isolines of Error in ROC space with P (+) = 0.40
1
0.9
0.8
0.7
True Positive Rate
1
107
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 6.4: The effects of varying P (+) from 0.05 to 0.40 on the isolines of Error in ROC space.
action-item corpus in Chapter 10), we use cross-validation techniques which allow us to
make judgments of statistical significance with less training data.
6.2.2 MSN Web Directory
The MSN Web Directory is a large collection of heterogeneous web pages (from a May
1999 web snapshot) that have been hierarchically classified. We use the same chronological
train/test split of 50078/10024 documents as that reported in [DC00].
The MSN Web hierarchy is a seven-level hierarchy; we use all 13 of the top-level categories. The class proportions in the training set vary from 1.15% to 22.29%. In the testing
set, they range from 1.14% to 21.54%. The classes are general subject categories such
as Health & Fitness and Travel & Vacation. Human indexers have assigned the documents
to zero or more categories. Approximately 195K words appear in at least three training
documents.
6.2.3 Reuters (21578)
The Reuters 21578 corpus [Lew97] contains Reuters news articles from 1987. For this data
set, we use the ModApte standard chronological train/test split of 9603/3299 documents
(8676 unused documents). The classes are economic subjects (e.g., “acq” for acquisitions,
“earn” for earnings, etc.) that human taggers applied to the document; a document may
have multiple subjects. There are actually 135 classes in this domain (only 90 of which oc-
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
E
X
E1
E2
E3
Ei
EM
Êi
Isolines of Cost(FP= 10, FN= 1) in ROC space with P (+)
0.05f (X))
E ==(X,
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
108
Precision
Recall
1
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
0.2
Precision
Recall
0.1
0
0
in ROC0.7
space 0.8
with P 0.9
(+) = 0.05
0.1Isolines
0.2 of Cost(FP=
0.3
0.410, FN=
0.5 1) 0.6
1
ost(FP= 10, FN= 1) in ROC space with P (+) = 0.40
False Positive Rate
6.2. DATA
Isolines of Cost(FP= 10, FN= 1) in ROC space with P (+) = 0.40
1
0.9
0.8
0.7
True Positive Rate
PSfrag replacements
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 6.5: The effects of varying P (+) from 0.05 to 0.40 on the isolines of Cost(FP= 10, FN= 1)
in ROC space. Note that varying the costs of a linear utility function is exactly equivalent to varying
the prior for the error scoring function.
cur in the training and testing set); however, we have only examined the ten most frequent
classes since small numbers of testing examples makes estimating some performance measures unreliable due to high variance.4 Limiting the topic set to the ten largest classes allows
us to compare our results to previously published results [DPHS98, Joa98, MN98, Pla99].
The class proportions in the training set vary from 1.88% to 29.96%. In the testing
set, they range from 1.7% to 32.95%. Approximately 15K words appear in at least three
training documents.
6.2.4 TREC-AP
The TREC-AP corpus is a collection of AP news stories from 1988 to 1990. We use the
same chronological train/test split of 142791/66992 documents that was used in [LSCP96].
As described in [LG94] (see also [Lew95]), the categories are defined by keywords in a
keyword field. The title and body fields are used in the experiments below. There are
twenty categories in total.
The frequencies of the twenty classes are the same as those reported in [LSCP96]. The
class proportions in the training set vary from 0.06% to 2.03%. In the testing set, they range
from 0.03% to 4.32%. Approximately 123K words appear in at least 3 training documents.
4
Primarily area under the ROC curve.
PSfrag replacements
PSfrag replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
E
X
E1
E2
E3
Ei
EM
Êi
Isolines of Cost(FP= 1, FN= 10) in ROC space with P (+)
0.05f (X))
E ==(X,
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
Precision
Recall
0.9
0.8
True Positive Rate
0.7
0.6
0.5
0.4
0.3
0.2
0
nes of Cost(FP= 1, FN= 10) in ROC space with P (+) = 0.40
Precision
Recall
0.1
0
in ROC0.7
space 0.8
with P 0.9
(+) = 0.05
0.1Isolines
0.2 of Cost(FP=
0.3
0.41, FN=
0.5 10) 0.6
1
False Positive Rate
Isolines of Cost(FP= 1, FN= 10) in ROC space with P (+) = 0.40
1
0.9
0.8
0.7
True Positive Rate
1
109
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Figure 6.6: The effects of varying P (+) from 0.05 to 0.40 on the isolines of Cost(FP= 1, FN= 10)
in ROC space. Note that varying the costs of a linear utility function is exactly equivalent to varying
the prior for the error scoring function.
6.2.5 RCV1-v2 (Reuters 2000)
To demonstrate we have not overfit our methods to the primary corpora used throughout the
course of this research, in addition to the diversity of the granularity and type of documents
among the corpora, we have included this corpus only after all of the combination methods and reliability indicators were fully developed. The Reuters Corpus Volume 1 contains
Reuters news articles from 1996-1997 which were released as a corpus by Reuters, Ltd. in
2000 for research purposes [RSW02]. RCV1-v2 is a modified version of the corpus which
was extensively documented by [LYRL04] after they corrected various inconsistencies in
the original corpus. We use the same chronological train/test split of 23149/781265 documents that was used in [LYRL04]. Unlike other large corpora, note that the training set is
kept relatively small. There are several different types of codes that RCV1-v2 documents
are labeled with (regions, industries, topics). We use only topics and restrict our attention to
the 101 topics that have at least one labeled document in the training set. These topics are
hierarchically organized and vary in their granularity. Topics dealing with financial aspects
tend to be very fine grained while those such as politics are coarse grained. Further details
of the semantics are available in [LYRL04].
To ensure comparability with other published results, we have started with the version
of the documents provided in Appendix 12 of [LYRL04]. These documents have already
been stemmed, stopworded, and tokenized. We treat these tokenized documents as if they
were the original documents.
6.3. BASE CLASSIFIERS
110
The frequencies of the 101 classes are the same as those reported in [LYRL04]. The
class proportions in the training set vary from 2 documents (0.0086%) to 10786 documents
(46.59%) with a median of 233 documents (1.007%). In the testing set, they range from 38
documents (0.0049%) to 370541 documents (47.43%) with a median of 8266 documents
(1.058%). Approximately 21K words appear in at least 3 training documents.
6.3
Base Classifiers
We have selected five classifiers traditionally used for text classification for examination:
decision trees, linear SVMs, naı̈ve Bayes, a unigram classifier, and a kNN classifier. We
have chosen these algorithms not only because they are known to perform well but also
because they are different types of classifiers along several different dimensions. The SVM,
unigram, and naı̈ve Bayes algorithms we investigate produce a linear decision boundary
where the kNN and Decision Tree classifiers are non-linear classifiers. In contrast, while
the unigram and naı̈ve Bayes algorithms we employ are generative classifiers and similar
to each other, they differ from the SVM, kNN, and Decision Tree which are discriminative
classifiers.
6.3.1 Decision Trees
For the decision-tree implementation, we have employed the WinMine 5 decision networks
toolkit and refer to this as Dnet below [Mic01]. Dnet builds decision trees using a Bayesian
machine learning algorithm [CHM97, HCM+ 00]. While this toolkit is targeted primarily
at building models to provide probability estimates, we found that Dnet models usually
perform acceptably for the goal of minimizing error rate. However, we found that the
performance of Dnet with regard to other measures is sometimes poor.
The probability that the model predicts is a Laplace correction of the empirical probability at a leaf node (see Eq. 6.10). As noted in [PD03], using a Laplace correction is often
critical when a high quality probability-based ranking is desired. We have not observed
any of the other problems noted in [PD03] with regard to obtaining probability estimates.
This is likely because the Dnet algorithm attempts to optimize a likelihood based criterion
instead of accuracy. As pointed out by Provost & Domingos [PD03], optimizing accuracy
can lead to early stopping or overly aggressive pruning that harms probability estimates.
The interested reader applying an alternate decision tree algorithm should consult [PD03]
for more information.
5
We thank Max Chickering and Robert Rounthwaite for their special support of the WinMine toolkit.
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
111
6.3.2 SVMs
For SVMs, we have used a linear kernel as implemented in the SVMlight package v6.01.
Unless otherwise noted, we used a linear kernel and default settings for all other parameters.
A more thorough discussion of SVMs is included in Section 5.1.5.
6.3.3 Naı̈ve Bayes (multivariate Bernoulli)
The naı̈ve Bayes classifier has also been referred to as a multivariate Bernoulli model
[MN98]. As has been noted elsewhere, the probability estimates used by the classifier
must be smoothed to avoid zero probability estimates [Mit97].
The simplest smoothing method which has been called any of “add-one”, Laplace correction, or standard Laplace correction [Sim95, KBS97] is:
P̂ (C = c | z) =
nc + 1
.
n + |C|
(6.10)
Here n is the number of observations for which condition z holds and n c is the number of
observations for which both z holds and variable C takes value c.
Kohavi, Becker, and Sommerfeld [KBS97] characterize as Laplace approaches any
method which uses:
nc + f
P̂ (C = c | z) =
(6.11)
n + |C|f
where f > 0 is a parameter. They also introduce the Laplace m-estimate which sets f = N1
where N is the total number of observations and not just those matching condition z. 6 Thus
the effect of smoothing decreases as the total number of observations goes to infinity.
Finally, a more general method sometimes referred to as a Bayesian estimate or Bayesian
m-estimate [Mit97] is:
nc + mpc
(6.12)
P̂ (C = c | z) =
n+m
P
where pc is a user-specified prior on class c (where c pc = 1) and m is the “effective
sample size” or the weight the prior will carry as measured in number of observations.
1
The Laplace m-estimate is a special case of the Bayesian m-estimate where p = |C|
and
m=
|C|
.
N
In many applications, z is the null condition and N = n.
In using the multivariate Bernoulli classifier, the conditional word probabilities use a
Bayesian estimate with an effective sample size of 1 and the empirical word frequency (i.e.
6
They use m instead of N to refer to the total number of observations. We reserve m to refer to effective
sample size as in the Bayesian estimate.
6.3. BASE CLASSIFIERS
112
p = P̃ (w) and m = 1). The class estimates are smoothed using a Laplace m-estimate.
Thus, we have:
nc,w + nNw
P̂ (w = {present, absent} | c) =
nc + 1
nc + N1
.
P̂ (c) =
N + |C|
N
(6.13)
(6.14)
nw is the number of documents with the word present / absent, n c,w is the number of documents in class c with the word present / absent, nc is the number of documents in class c,
and N is the total number of documents. Words that did not occur in the training set are
ignored.
6.3.4 Unigram (multinomial naı̈ve Bayes)
The unigram classifier uses probability estimates from a unigram language model [MN98].
This classifier has also been referred to as a multinomial naı̈ve Bayes classifier. Probability
estimates are smoothed in a similar fashion to smoothing in the naı̈ve Bayes classifier. The
resulting estimates are:
nc,w + nNw
nc + 1
nc + N1
P̂ (c) =
.
N + |C|
N
P̂ (w | c) =
(6.15)
(6.16)
nw is the number of times word w occurred, nc,w is the number of times word w occurred
in documents in class c, nc is the total number of word occurrences in class c, and N is
the total number of word occurrences. As a reminder to the reader, the multinomial model
counts repetitions of the same word in the document. Words that did not occur in the
training set are ignored.
6.3.5 k-Nearest Neighbor
We employ a standard variant of the k-nearest neighbor algorithm used in text classification,
kNN with s-cut score thresholding [Yan99]. We use a tfidf-weighting of the terms with a
distance-weighted vote of the neighbors to compute the score before thresholding it. The
score used for a class y for example x is:
sx (y) =
X
cos(x, n)
n∈kNN (x)|c(n)=y
−
X
cos(x, n).
n∈kNN (x)|c(n)6=y
(6.17)
CHAPTER 6. BACKGROUND FOR EMPIRICAL ANALYSIS
113
In order to choose the threshold value for s, we perform cross-validation over the training
set. Unless noted elsewhere, the value of k is set to be 2(dlog 2 N e+1) where N is the number of training points. This rule for choosing k is theoretically motivated by results which
show such a rule converges to the optimal classifier as the number of training points increases [DGL96]. In practice, we have also found it to be a computational convenience that
frequently leads to comparable results with numerically optimizing k via a cross-validation
procedure.
6.3.6 Classifier Outputs
Finally, we must address the question of what kind of outputs we expect from the classifiers.
Again, recalling our example of class-conditionally independent classifiers from Section
1.2.1, if our combination procedure simply uses the class prediction of the base classifiers,
then we often cannot improve or can only improve slightly even when the classifiers are
based on distinct information sources. Additionally, experiments with stacking [Wol92,
TW99] have shown using some kind of partially-ordered score leads to significantly better
results. Therefore, we desire that the classifier outputs a confidence of some kind such that
greater values of the score implies a greater confidence that the example belongs to the
positive class.
However, working with arbitrary scores can be problematic since the inputs can be
on vastly different scales. One alternative is to actually require probabilities. We can
either produce probabilities from a score as described in Chapter 3 or use techniques as
described in [LZ05] that produce probability estimates from models which only predict
class membership.
Consider our idealized example again. If we are given completely correct posterior
estimates by class-conditionally independent classifiers (Figure 4.3 and Section 1.2.1), the
optimal combination is obtained by multiplying the probabilities and renormalizing. Since
multiplicative models are sometimes more difficult to work with, we will prefer additive
models and therefore work with the log-odds from the classifiers. In fact, because we will
use a bias term in our additive models, we often can work with a score that’s “similar
to log-odds” but not scaled correctly — allowing our bias term to implicitly correct the
misestimation of the base models.
6.4. CHAPTER SUMMARY
114
6.4
Chapter Summary
This chapter presented an overview of the various performance measures used to compare
the effectiveness of a combined model with that of the base classifiers or alternative models.
In the remainder of the dissertation, we will focus on F1, linear utilitiy functions, and
area under the ROC curve when evaluating the classification methods. In addition, this
chapter presented the characteristics of the key datasets to be used in experimentation and
the implementation details of the classifiers.
Chapter 7
Combining Classifiers
using Reliability Indicators
From Chapter 3, we know that a classifier can improve a combination of classifiers as long
as the class is not independent of the output of that classifier given the output of the other
classifiers. This is the motivation behind stacking classifiers. However, we would like to
account for more locality1 than stacking allows — which can only use functions of the
classifier outputs to set non-constant weights on the classifiers. Most other methods (e.g.,
local cascade generalization which additionally uses all of the original features as well) are
not suited for high-dimensional problems such as text. This part takes the insights gained
from our work on calibration and introduces a new combination model for text classification that uses reliability indicator variables to create a low-dimensional abstraction of the
properties likely to influence the local reliability, dependence, and variance of the classifier
outputs. Since we then build a classifier using this representation, we are implicitly using
the joint distribution over this space to combine the classifiers.
Our work is distinguished from earlier combination approaches for text classification
by (1) the use of expressive probabilistic dependency models to combine lower-level classifiers, leveraging special signaling variables, referred to as reliability indicators, and (2) a
focus on measures of classification performance rather than the more common consideration of ranking.
1
See Chapter 4 for more on locality.
115
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
116
ˆ
fM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
7.1. INTRODUCTION
Decision Tree
SVM
Naive Bayes
Unigram
Document−Specific
Context
Metaclassifier
Figure 7.1: Schematic characterization of reliability-indicator methodology. The methodology
formalizes the intuition shown here that document-specific context can be used to improve the performance of a set of base classifiers. The output of the classifiers is a graphical representation of a
distribution over possible class labels.
7.1
Introduction
Previous approaches to classifier combination have typically limited the information considered at the metalevel to the output of the classifiers [TW99] and/or the original feature
space [Gam98a]. Since a classifier rarely is the best choice across a whole domain, an
intuitive alternative is to identify the document-specific context that differentiates between
regions where a base classifier has higher or lower reliability.
Returning to the example from Chapter 1, Figure 7.1 shows an example using four base
classifiers: decision tree, SVM, naı̈ve Bayes, and unigram. When given a test document
as input, each of the four base classifiers outputs a probability distribution over possible
class labels (depicted graphically as a histogram in the figure). The metaclassifier uses this
information along with document context (to be described in more detail) to produce a final
classification of the document.
We address the challenge of learning about the reliability of different classifiers in different neighborhoods of the classification domain by introducing variables referred to as
reliability indicators which represent the analytic “context” of a specific document. A reliability indicator is an evidential distinction with states that are linked probabilistically to
regions of a classification problem where a classifier performs relatively strongly or poorly.
The reliability-indicator methodology was introduced by Toyama and Horvitz [TH00]
and applied initially to the task of combining, in a probabilistically coherent manner, several distinct machine-vision analyses in a system for tracking the head and pose of com-
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
117
puter users. The researchers found that different visual processing modalities had distinct
context-sensitive reliabilities that depended on dynamically changing details of lighting,
color, and the overall configuration of the visual scene. The authors introduced reliability
indicators to capture properties of the vision analyses, and of the scenes being analyzed, that
provided probabilistic indications of the reliability of the output of each of the modalities.
To learn probabilistic models for combining the multiple modalities, data were collected
about ground truth, the observed states of indicator variables, and the outputs from the concurrent vision analyses. The data was used to construct a Bayesian network model with the
ability to appropriately integrate the outputs from each of the visual modalities in real time,
providing an overall higher-accuracy composite visual analysis.
The value of the indicator-variable methodology in machine vision stimulated us to explore the approach for representing and learning about reliability-dependent contexts in text
classification problems. For the task of combining classifiers, we formulate and include sets
of variables that hold promise as being related to the performance of the underlying classifiers. We consider the states of reliability indicators and the scores of classifiers directly
and, thus, bypass the need to make ad hoc modifications to the base classifiers. This allows
the metaclassifier to harness the reliability variables if they contain useful discriminatory
information and, if they do not, to fall back in a graceful manner to using the output of the
base classifiers.
As an example, consider three types of documents where: (1) the words in the document
are either uninformative or strongly associated with one class; (2) the words in the document are weakly associated with several disjoint classes; or (3) the words in the document
are strongly associated with several disjoint classes. Classifiers (e.g., a unigram model) will
sometimes demonstrate different patterns of error on these different document types. If we
can characterize a document as belonging to one of these model-specific failure types, then
we can assign the appropriate weight to the classifier’s output for this kind of document. We
have pursued the formulation of reliability indicators that capture different association patterns among words in documents and the structure of classes under consideration. We seek
indicator variables that would allow us to learn context-sensitive reliabilities of classifiers,
conditioned on the observed states of the variable in different settings.
To highlight the approach with a concrete example, Figure 7.2 shows a portion of the
type of combination function we can capture with the reliability-indicator methodology.
The nodes on different branches of a decision tree include the values output by base classifiers, as well as the values of reliability indicators for the document being classified. The
decision tree provides a probabilistic, context-sensitive combination rule indicated by the
particular relevant branching of values of classifier scores and indicator variables. In this
(fˆi (X), si (Ei ))
f (X)
fˆM (X)
118
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
> 0.0487 (680)
p(E3 | E)
PercentInPosBeforeFS
p(EM | E)
< 0.0487 (1592)
p(Êi | E)
p(Ei | Êi )
7.1. INTRODUCTION
> 0.576 (33)
OutputOfkNN
< 0.576 (647)
> 0.134 (115)
UnigramStdDeviationLogOfStrengthGivenNeg
< 0.134 (532)
< 0.655 (1527)
OutputOfkNN
> 0.655 (65)
Figure 7.2: Portion of decision tree, learned by STRIVE-D (norm) for the Business & Finance class
in the MSN Web Directory corpus, representing a combination policy at the metalevel that considers
scores output by classifiers (dark nodes) and values of indicator variables (lighter nodes). Higher
in the same path, the decision tree also makes use of OutputOfUnigram and OutputOfSVMLight, as
well as other indicator variables.
case, the portion of the tree displayed shows a classifier-combination function that considers thresholds on scores provided by a kNN classifier (OutputOfkNN) in conjunction
with the context established by reliability-indicator variables (PercentInPosBeforeFS and
UnigramStdDeviationLogOfStrengthGivenNeg) to make a final decision about a classification. Higher in the path to these nodes, the decision tree has also made use of the outputs
of an SVM classifier and a unigram classifier as well as other indicator variables. The annotations in the figure show the threshold tests that are being performed, the number of
examples in the training set that satisfy the test, and a graphical representation of the probability distribution at the leaves. The likelihood of class membership is indicated by the
length of the bars at the leaves of the tree.
The variable UnigramStdDeviationLogOfStrengthGivenNeg represents the variance of
unigram class-conditional weights for the negative class for words present in the current
document. The intuition behind the formulation of this reliability-indicator variable is that
examples are more likely to be negative when there is low variance in weights. The variable
PercentInPosBeforeFS is the percentage of words in the document that occurred in a positive training example before feature selection. This can indicate an example is likely to be
positive and can help mitigate overly aggressive feature selection by considering the state
before feature selection. Notice that examples with the highest kNN scores (lowest leaf)
are actually given a lower posterior probability of membership than those with lower kNN
scores (highest leaf) because of the context established by the indicators and other classifier
outputs. Chapter 5 gives further details about the motivation, derivation, and computation
of reliability indicators used in these experiments.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
119
The indicator variables used in our studies represent an attempt to formulate states that
capture influential contexts. We constructed variables to represent a variety of contexts that
held promise as being predictive of accuracy. These include such variables as the number
of features present in a document before and after feature selection, the distribution of
features across the positive vs. negative classes, and the mean and variance of classifierspecific weights.
We can broadly group reliability-indicator variables into one of four types, including
variables that measure (1) the amount of information present in the original document, (2)
the information loss or mismatch between the representation used by a classifier and the
original document, (3) the sensitivity of the decision to evidence shift, and (4) some basic
voting statistics.
DocumentLength is an example of a reliability-indicator variable of type 1. The performance of classifiers is sometimes correlated with document length, because longer documents give more information to use in making a classification. DocumentLength can also
be informative because some classifiers will perform poorly over longer documents as they
do not model the influence of document length on classification performance (e.g., they
double count evidence and longer documents are more likely to deviate from a correct
determination).
PercentRemoved serves as an example of type 2. This variable represents the percent
of features removed in the process of feature selection. If most of the document was not
represented by the feature set employed by a classifier, then some classifiers may be unreliable. Other classifiers (e.g., decision trees that model missing attributes) may continue to
be reliable. When the base classifiers are allowed to use different representations, type 2
features can play an even more important role.
An example of type 3 is the UnigramStdDeviation variable. In a binary class prob(w|c)
lem, the weight each word contributes to the unigram model’s decision is log PP(w|¬c)
. This
is the standard deviation of the weight each word contributes over the words in the document. Low variance means the decision of the classifier is unlikely to change with a
small change in the document content; high variance increases the chances that the decision would change with only a small change in the document.
Finally, NumVotingForClass or PercentAgreement are examples of type 4 reliability
indicators. These simple voting statistics improve the metaclassifier search space since the
metaclassifier is given the base classifier decisions as input as well. For a two-class case the
PercentAgreement variable may provide little extra information but for greater number of
classes it can be used to determine if the base classifiers have fractured their votes among
120
7.1. INTRODUCTION
a small number of classes or across a wide array. At the end of this chapter, we discuss
which reliability indicators were most useful in the final combination scheme.
Beyond the key difference in the semantics of their usage, reliability-indicator variables
differ qualitatively from variables representing the output of classifiers in several ways. For
one, we do not assume that the reliability indicators have some threshold point that classifies
the examples better than random. We also do not assume that classification confidence
shows monotonicity trends as in classifiers.
7.1.1 STRIVE: Metaclassifier with Reliability Indicators
We refer to our classifier combination learning and inference framework as STRIVE for
Stacked Reliability Indicator Variable Ensemble. We select this name because the approach can be viewed as essentially extending the stacking framework by introducing reliability indicators at the metalevel. The STRIVE architecture is depicted graphically in Figure
7.4.
Our methodology maps the original classification task into a new learning problem.
In the original learning problem (Figure 7.3), the base classifiers simply predict the class
from a word-based representation of the document, or more generally, each base classifier
outputs a distribution (possibly unnormalized) over class labels. STRIVE adds another layer
of learning to the base problem. A set of reliability-indicator functions use the words in
the document and the classifier outputs to generate the reliability indicator values, r i , for
a particular document. This process can be viewed as yielding a new representation of
the document that consists of the values of the reliability indicators, as well as the outputs
of the base classifiers. The metaclassifier uses this new representation for learning and
classification. This enables the metaclassifier to employ a model that uses the output of the
base classifiers as well as the context established by the reliability indicators to make a final
classification.
We require the outputs of the base classifiers to train the metaclassifier. Thus, we perform cross-validation over the training data and use the resulting base classifier predictions,
obtained when an example serves as a validation item, as training inputs for the metaclassifier. We note that, in the case where the set of reliability indicators are restricted to be the
identity function over the original data, the resulting scheme can be viewed as a variant of
cascade generalization [Gam98a].
p(E1 | E)
E
p(E2 | E)
X
p(E3 | E)
E1
CHAPTER 7. COMBINING CLASSIFIERS
p(EM | E)
E2
USING RELIABILITY INDICATORS
p(Êi | E)
E3
p(Ei | Êi )
Ei
121
EM
Êi
E = (X, f (X))
w1
= (fˆ1 (X), s1 (E1 ))
SVM
w2
= (fˆ2 (X), s2 (E2 ))
w3
= (fˆ3 (X), s3 (E3 ))
r1
···
ˆ
Unigram
r2
i = (fi (X), si (Ei ))
wn
f (X)
r3
class
fˆM (X)
rn
p1 (E)
p2 (E)Figure 7.3: Typical application of a classifier to a text problem. In traditional text classification, a
word-based representation of a document is extracted (along with the class label during the learning
p3 (E)
phase), and the classifiers (here an SVM and Unigram classifier) learn to output scores for the
pM (E)
possible class labels. The shaded boxes represent a distribution over class labels.
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
w1
w2
w3
r3
···
wn
class
SVM
Unigram
Reliability
Indicators
r1
r2
···
rn
Metaclassifier
class
Figure 7.4: Architecture of STRIVE. In STRIVE, an additional layer of learning is added where the
metaclassifier can use the context established by the reliability indicators and the output of the base
classifiers to make an improved decision. The reliability indicators are functions of the document
and/or the output of the base classifiers.
7.2. EXPERIMENTAL ANALYSIS
122
7.2
Experimental Analysis
We performed a large number of experiments to test the value of probabilistic classifier
combination with reliability-indicator variables. For the experiments below, we used only
the top 1000 words with highest mutual information for the MSN Web Directory and
TREC-AP corpus and top 300 words for Reuters for all base classifiers except the kNN
classifier. Note that performance of the base classifiers using feature selection is generally
as good if not better than performance using all of the features. Since the kNN classifier is
computationally expensive, we desired to use the same feature representation across binary
classification tasks within a corpus. Once neighbors are retrieved, the kNN classifier can
make all class decisions quickly. As is commonly done (e.g. [LYRL04]), for each word
we assigned a score of the max of the mutual information scores across binary tasks. The
top features were then taken across these max scores. Since the same feature set was being
used for all classes within a corpus, we used 3× the number of features — 3000 words for
MSN Web and TREC-AP and 900 for Reuters. For the reliability indicators that compare
representations before and after feature selection, we only added instantiations for the nonkNN representation. The corpora are described in further detail in Section 6.2. We now
describe the methodology and results.
Base Classifiers
In an attempt to isolate the benefits gained from the probabilistic combination of classifiers
with reliability indicators, we worked to keep the representations for the base classifiers in
our experiments nearly identical. We would expect that varying the representations (i.e.,
using different feature-selection methods or document representations) would only improve
the performance as this would likely decorrelate the performance of the base classifiers.
One notable deviation is that a tfidf representation was used for the kNN classifier since
that is standard in the text classification literature. As described in more detail in Section
6.3 we selected five classifiers as base classifiers: kNN, decision trees, linear SVMs, naı̈ve
Bayes, and a unigram classifier. We denote these below as kNN, Dnet, SVM, naı̈ve Bayes,
and Unigram.
Basic Combination Methods
We perform experiments to explore a variety of classifier-combination methods and consider several different combination procedures. The first combination method is based on
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
123
selecting one classifier for each binary class problem, based on the one that performed best
for a validation set. We refer to this method as the Best By Class method.
Another combination method centers on taking a majority vote of the base classifiers.
This approach is perhaps the most popular methodology used for the combination of text
classifiers. Because we have five base classifiers, we do not have to address the issue of
breaking ties.2 We refer to this method as the Majority method.
Hierarchical Combination Methods
Stacking
Finally, we investigate several variants of the hierarchical models described earlier. As
mentioned above, omitting the reliability-indicator variables transforms STRIVE to a stacking methodology [TW99, Wol92]. We refer to these classifiers below as Stack-X where X is
replaced by the first letter of the classifier that is performing the metaclassification. Therefore, Stack-D uses a decision tree as the metaclassifier, and Stack-S uses a linear SVM as
the metaclassifier. We note that Stack-S is also a weighted linear combination method since
it is based on a linear SVM and uses only the classifier outputs. Therefore, demonstrating that STRIVE either outperforms Stack-S variants or is not statistically different is a key
challenge of this dissertation.
We found it was difficult to learn the weights for an SVM when the inputs have vastly
different scales. At times, it is not possible to identify good weights. To address the problem
of handling inputs with greatly varying scales, we use an input normalization procedure:
We normalize the inputs to the metaclassifiers to have zero mean and reduce the scale of
the standard deviation.3 In order to perform consistent comparisons, we perform the same
2
When performing a majority vote, ties can be broken in a variety of ways (e.g., breaking ties by always
voting for in class). In earlier work [BDH02, BDH05], we experimented with several variants of these
methods. The most successful broke ties by voting with the Best By Class classifier.
3
Our original intention was to normalize the inputs to zero mean and unit standard deviation. However, we
found slightly before press a line of code was commented out in our normalization code. The end result was
]
that instead of using the normalization that produces zero mean and standard deviation, f 0 = √ f −E[f
,
E[f 2 ]−E 2 [f ]
q
f −E[f ]
E 2 [f ]
. The resulting f 0 has zero mean and a standard deviation of σf 0 = 1 − E[f
we instead used f 0 = √
2] .
2
E[f ]
2
E [f ]
Since E[f ] ≥ E [f ] then σf 0 ≤ 1, and σf 0 = 1 when E[f
2 ] = 0. Thus, the scale of the feature variance
has been reduced. Furthermore since the classifier outputs are similar to log-odds, then for stacking E 2 [f ]
E 2 [f ]
and more importantly E[f
2 ] tend toward zero, and thus the resulting normaliztion differs little from standard
normalization. We are primarily concerned with whether standard normalization would improve stacking
further (if standard normalization hurts striving then we could use this normalization instead). Because the
use of log-odds drives this normalization to behave similar to standard normalization for stacking and since we
2
2
124
7.2. EXPERIMENTAL ANALYSIS
alteration for the metaclassifiers using Dnet. Furthermore, as we will see in Chapter 8, this
normalization is useful for additional reasons.
As might be expected the impact of normalization for decision-tree learners is relatively
minimal and has both positive and negative influences. Because of this, we present only the
version using normalized inputs in the main text, the non-normalized versions are available
in a complete listing in Tables 7.11-7.13 at the end of the chapter. To denote the metaclassifiers whose inputs have been normalized in this manner, we append “(norm)” to their
names.
STRIVE
Similar to the notation described above, we add a letter to STRIVE to denote the particular
metaclassifier method being used. So, STRIVE-D is the STRIVE framework using Dnet as a
metaclassifier. For comparison to the stacking methods, we evaluate STRIVE-D and STRIVES. Normalization, as above, is again noted by appending “(norm)” to the system names.
The experiments reported here use a total of 70 reliability indicators, including those
specific examples given in Section 7.1. The full list of reliability indicators and the motivation for each is discussed in full in Chapter 5. These reliability indicators were formulated by hand as an attempt at representing potentially valuable contexts. Identifying
new reliability indicators remains an open and challenging research problem both for this
methodology and in general.
BestSelect Classifier
To study the effectiveness of the STRIVE methodology, we formulated a simple optimal
combination approach as a point of reference. Such an upper bound can be useful as a
benchmark in experiments with classifier combination procedures. This bound follows
quite naturally when classifier combination is formulated as the process of selecting the
best base classifier, on a per-example basis.
To classify a given document, if any of the classifiers correctly predict that document’s
class, the best combination would select any of the correct classifiers. Thus, such a classification combination errs only when all of the base classifiers are incorrect. We refer to
also compare to non-normalized metaclassifier inputs, the expected result is a negligible difference. Finally,
in a small sample of experiments, this normalization in fact had negligible impact on the stacking methods
compared to normalizing to unit standard deviation. Thus, while not the normalization we intended to use,
this does not change the conclusions drawn from our experiments.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
125
this classifier as the BestSelect classifier. If all of the base classifiers are better than random, the BestSelect is the theoretical upper-bound on performance when combining a set
of classifiers in a selection framework.
We note that we are not using a pure selection approach, as our framework allows the
possibility of choosing a class that none of the base classifiers predicted. In cases where the
classifiers are not better than random (or are logically dependent), such an upper bound may
be uninformatively loose. Even though we are not working in a pure selection framework,
we found it is rarely the case the metaclassifier outputs a prediction which none of the
base classifiers made. Therefore, we have employed this BestSelect bound to assist with
understanding the performance of STRIVE.
7.2.1 Performance Measures
To compare the performance of the classification methods we look at a set of standard
performance measures: the macro-averaged F1, micro-averaged F1, error, two linear utility
functions — C(10, 1) and C(1, 10) — and area under the ROC curve. In addition, we
computed and displayed a receiver-operating characteristic (ROC) curve, which represents
the performance of a classifier under any linear utility function [PF01]. These measures are
described in more detail in Section 6.1.
7.2.2 Experimental Methodology
As the categories under consideration in the experiments are not mutually exclusive, the
classification was carried out by training n binary classifiers, where n is the number of
classes. Decision thresholds for each classifier were set by optimizing them for each performance measure over the validation data. That is, a classifier could have different decision
thresholds for each of the separate performance measures (and for each class). This ensures
that the base classifiers are as competitive as possible across the various measures. For the
micro performance measures, obtaining truly optimal performance requires optimizing all
the thresholds in a corpus in conjunction; we have taken the more computationally efficient
approach of using the macro-optimized thresholds (i.e., the threshold for each class is set
independently from the thresholds for the other classes).
To generate the data for training the metaclassifier, (i.e., reliability indicators, classifier
outputs, and class labels), we used five-fold cross-validation on the training data from each
of the corpora. The data set obtained through this process was then used to train the metaclassifiers. Similar to the base classifiers, cross-validation over the meta-training set was
7.2. EXPERIMENTAL ANALYSIS
126
used to set thresholds for each performance measure. Finally, the resulting metaclassifiers
were applied to the separate testing data described above.
7.2.3 Results
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.5477
0.5982
0.5527
0.6727B
0.6480
0.5813
0.6116
0.5619
0.7016B
0.6866
0.0584
0.0594
0.0649
0.0455
0.0464
0.3012
0.2589
0.2853
0.2250B
0.2524
0.0772
0.0812
0.0798
0.0794
0.0733
0.8802
0.9003
0.8915
0.9123
0.8873
Best By Class
Majority
0.6727 D
0.6643
0.7016
0.6902
0.0452 D
0.0479
0.2235
0.2133BD
0.0729 D
0.0765
N/A
N/A
Stack-D (norm)
Stack-S (norm)
0.6924BD
0.6939BD
0.7233BD
0.7250BD
0.0423BD
0.0423BD
0.1950BD
0.1971BD
0.0708 D
0.0705 D
0.9361BS
0.9334B
STRIVE-D (norm)
STRIVE-S (norm)
0.6988BD
0.7173BD
SR
0.7327BD
S
BD
0.7437SR
0.0413BD
0.0392BD
SR
0.1846BD
S
BD
0.1835S
0.0697 D
0.0682
0.9454BSD
0.9260B
BestSelect
0.8719
0.8924
0.0223
0.0642
0.0565
N/A
Table 7.1: Performance on MSN Web Directory Corpus. The best performance (omitting the
oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates a
method significantly outperforms all (other) Base classifiers, Default combiners, Stacking methods,
or Reliability-indicator based Striving methods at the p = 0.05 level. A blackboard (hollow) font
is used to indicate significance for the macro-sign test and micro-sign test. A normal font indicates
significance for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests
are significant it is indicated with a bold, italicized font.
Tables 7.1, 7.2, and 7.3, present the main performance results over the three corpora.
In terms of the various performance measures, better performance is indicated by larger
F1 or ROC area values or by smaller C(F P, F N ) values. The best performance (ignoring
BestSelect) in each column is given in bold.
To determine statistical significance for the macro-averaged measures, a one-sided macro
sign test and two-sided macro t-test were performed [YL99]. For micro-F1, a two-sided
micro sign test was performed [YL99]. Differences with a p-level above 0.05 were not
considered statistically significant. The macro sign test uses the null hypothesis that the
number of classes in which we improve versus the number in which we decrease is ran-
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
127
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.7846
0.7645
0.6574
0.8545B
0.8097
0.8541
0.8674
0.7908
0.9122B
0.8963
0.0242
0.0234
0.0320
0.0145B
0.0170
0.0799
0.0713
0.1423
0.0499
0.0737
0.0537
0.0476
0.0527
0.0389
0.0336
0.9804
0.9877
0.9703
0.9893
0.9803
Best By Class
Majority
0.8608 D
0.8498
0.9149
0.9102
0.0144
0.0155
0.0496
0.0438
0.0342
0.0437
N/A
N/A
Stack-D (norm)
Stack-S (norm)
0.8680
0.8908BD
S
0.9197B
0.9307BD
S
0.0136
0.0125BD
0.0410
0.0372BD
0.0366
0.0331S
0.9912
0.9956BS
STRIVE-D (norm)
STRIVE-S (norm)
0.8555
0.8835BD
R
0.9172
0.9287BD
R
0.0144
0.0121BD
R
0.0488
0.0352BD
0.0364
0.0343
0.9913
0.9948BR
BestSelect
0.9611
0.9789
0.0036
0.0073
0.0173
N/A
Table 7.2: Performance on Reuters Corpus. The best performance (omitting the oracle BestSelect)
in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners, Stacking methods, or Reliabilityindicator based Striving methods at the p = 0.05 level. A blackboard (hollow) font is used to
indicate significance for the macro-sign test and micro-sign test. A normal font indicates significance for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests are
significant it is indicated with a bold, italicized font.
domly distributed binomially with probability one half. The macro t-test compares whether
the average difference across classes can be explained by the variance of drawing two samples from a t-distribution. Thus, the macro-sign test can detect when we are very likely to
improve in an extremely high proportion of classes while the macro t-test detects when the
amount of difference cannot be explained by random variation. Viewing the results of both
tests generally indicate whether we can always expect to gain and how much.
The micro sign test is similar to the macro sign test but compares the decisions at an
example-level instead of the performance at a class-level. The null hypothesis is that over
the examples where two methods make different classification decisions the distribution of
right/wrong can be explained by a binomial distribution with probability one half. Thus,
the micro sign test indicates whether a method makes significantly different decisions than
another method cumulatively across all classes.
A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates a method significantly outperforms all
(other) Base classifiers, Default combiners, Stacking methods, or Reliability-indicator based
7.2. EXPERIMENTAL ANALYSIS
128
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.6007
0.6001
0.5676
0.7361B
0.6793
0.5706
0.5695
0.5349
0.6926B
0.6533
0.0064
0.0064
0.0065
0.0049B
0.0053
0.0346
0.0347
0.0455
0.0282B
0.0371
0.0081
0.0079
0.0078
0.0077
0.0074
0.9767
0.9819
0.9755
0.9715
0.9238
Best By Class
Majority
0.7356 D
0.7031
0.6925 D
0.6534
0.0049D
0.0056
0.0302
0.0307
0.0073
0.0075
N/A
N/A
Stack-D (norm)
Stack-S (norm)
0.7331
0.7486BD
0.7007BD
0.7011BD
0.0050
0.0048BD
S
0.0251
0.0263BD
0.0073
0.0072
0.9886
0.9834B
STRIVE-D (norm)
STRIVE-S (norm)
0.7246
0.7532BD
R
0.6991BD
0.7148BD
SR
0.0051
0.0047BD
SR
0.0268
0.0277 D
0.0073
0.0071
0.9870
0.9771
BestSelect
0.8986
0.8356
0.0031
0.0133
0.0058
N/A
Table 7.3: Performance on TREC-AP Corpus. The best performance (omitting the oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners, Stacking methods, or Reliabilityindicator based Striving methods at the p = 0.05 level. A blackboard (hollow) font is used to
indicate significance for the macro-sign test and micro-sign test. A normal font indicates significance for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests are
significant it is indicated with a bold, italicized font.
Striving methods at the p = 0.05 level. A blackboard (hollow) font is used to indicate significance for the macro-sign test and micro-sign test. A normal font indicates significance
for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests are
significant it is indicated with a bold, italicized font. When a method is part of a group, the
letter of that group indicates the method beats all other methods in that group significant.
For example, a B attached to the SVM classifier’s macro F1 would indicate that the SVM
classifier significantly outperforms on macro F1 all other base classifiers according to the
macro sign test.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
129
7.2.4 Discussion
First, we note that the base classifiers are competitive and consistent with the previously
reported results over these corpora [ZO01, DC00, DPHS98, Joa98, Lew95, LG94, MN98]. 4
Furthermore, the fact that the linear SVM tends to be the best base classifier is consistent
with the literature [DPHS98, Joa98, YL99].
MSN Web Directory
Examining the main results for the MSN Web Directory corpus in Table 7.1 highlights
several points. First, the basic combiners have only one significant win over the base classifiers, C(1,10) for the Majority vote approach. The results directly support the idea that
the performance of a very good learner (SVM) tends to be diminished when combined via
a majority vote scheme with weak learners; in addition, the win most likely results from
the fact that the base learners (other than SVM) have a tendency to predict positively for a
class. When false negatives are weighed more heavily, the shift toward predicting positive
helps reduce the number of false negatives.
Both variants of Stacking and Striving often outperform the base classifiers, and with
few exceptions, Stack-S (norm) and STRIVE-S (norm) outperform the base classifiers on
nearly all the performance measures, often significantly. In fact this remains true in all
of the corpora. The only consistent exception is the C(10, 1) performance measure. This
measure places a much higher penalty on false positives; therefore methods are pushed toward achieving correct negatives. Given the small number of positives in text classification
corpora, achieving any gains is very challenging.
Both STRIVE-D (norm) and STRIVE-S (norm) show advantages that are robust across a
variety of performance measures. Each shows a consistent improvement across a variety of
performance measures over the state-of-the-art SVM classifier. For the thresholds optimized
for error, Stack-S (norm) achieves a relative reduction in error over the SVM of 7% while
STRIVE-S (norm) further improves to achieve a relative reduction in error of 14% over the
SVM — twice the improvement that stacking yields (see Figure 7.7). When compared to
the best theoretical performance that could be achieved by a per-example selection model
using these base classifiers (as established by the BestSelect model), the error reduction
4
While the results reported for Reuters are not directly comparable to those reported by Yang & Liu
[YL99] as these investigators report results over all 90 classes and do not give a breakdown for the ten most
frequent categories, others [ZO01, DPHS98, Joa98, MN98, Pla99] provide published baselines over the ten
largest classes.
7.2. EXPERIMENTAL ANALYSIS
1
True Positive Rate
0.8
0.6
0.4
STRIVE-D (norm)
STRIVE-S (norm)
SVM
Dnet
Unigram
naive Bayes
kNN
0.2
0
0
0.05
0.1
False Positive Rate
0.15
0.2
Figure 7.5: The ROC curve for the Home & Family class in the MSN Web Directory corpus from
[0, 0.2].
provided by the STRIVE combination methods is an even greater portion of the total possible
reduction.
As can be inferred from the sign tests, these results are very consistent across classes.
For example, by the ROC area measure of performance, STRIVE-D (norm) beats the base
classifiers on 13/13 classes. The notable exception is the performance of STRIVE-S (norm)
on ROC area; graphical inspection of the ROC curves suggests this result arises because
the STRIVE-S (norm) places emphasis on the classifier that performs well in the early part
of the curve.
Often, there is a crossover in the ROC curve between two of the base classifiers further
out on the false-positives axis. Most utility measures in practice correspond to the early
part of the curve (this depends on the particular features of the given curve). The SVM
metaclassifier sometimes seems to lock onto the classifier that is strong in the early portion
of the curve and loses out on the later part of the curve. Since the latter portion of the
curve rarely matters, one could consider using an abbreviated version of curve area to
assess systems. In tables 7.11-7.13, we present an additional measure of ROC area that
only measures the area under the curve for the portion of the x-axis from [0, 0.1]. By this
measure, it is possible to see that STRIVE-S (norm)’s performance is markedly better in the
early part of the ROC curve.
g replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
fˆ1 (X), s1 (E1 ))
fˆ2 (X), s2 (E2 ))
fˆ3 (X), s3 (E3 ))
(fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
130
131
1
True Positive Rate
0.8
0.6
0.4
STRIVE-D (norm)
STRIVE-S (norm)
SVM
Dnet
Unigram
naive Bayes
kNN
0.2
0
0
0.2
0.4
0.6
False Positive Rate
0.8
1
Figure 7.6: The full ROC curve for the Home & Family class in the MSN Web Directory corpus.
In Figures 7.6 and 7.5, we can see that the two STRIVE variants dominate the five base
classifiers over much of the ROC space. In fact, STRIVE-D dominates (i.e., its quality is
greater than any other curve at every point) most of the MSN Web Directory corpus. We
also can see (note the truncated scale) the base classifiers catching up with STRIVE-S (norm)
on the right side of the curve. The base classifiers, in fact, do surpass STRIVE-S (norm)
at points. As a result, STRIVE-D may be a more appropriate choice if the utility function
penalizes false negatives significantly more heavily than false positives. However, as shown
by its performance on the C(1, 10) measure, STRIVE-S (norm) retains some robustness in
this dimension due to its superior performance early in the curve.
In some cases, we can develop an understanding of why the decision tree is more appropriate for tracking crossovers. In the case portrayed in Figure 7.2, it appears that the tree
establishes two separate score regions for kNN where the reliability indicators give further
information about how to classify an example. Since a linear SVM is a weighted sum over
the inputs, it cannot represent crossovers that are dependent on breaking a single variable
into multiple regions (such as this one); it has to use the information present in other variables to try to distinguish these regions. Higher-order polynomial kernels are one way to
allow an SVM to represent this type of information.
g replacements
E
X
E1
E2
E3
Ei
EM
Êi
E = (X, f (X))
fˆ1 (X), s1 (E1 ))
fˆ2 (X), s2 (E2 ))
fˆ3 (X), s3 (E3 ))
(fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
7.2. EXPERIMENTAL ANALYSIS
132
We attempted to do so, but the SVM had difficulty converging using a quadratic kernel.
We then chose an alternate localized kernel. In Table 7.4, STRIVE-S Local (norm) uses a
local product kernel of K(xi , xj ) = [hρ(xi ), ρ(xj )i + 1] [hΠ(xi ), Π(xj )i + 1] where Π(x)
is the projection into the subspace consisting of the base classifier outputs and ρ is the
identity function. The resulting kernel has a subset of the terms in a quadratic kernel (which
would use ρ in both cases). While the results are advantageous for the MSN Web corpus, it
seems to lead to overfitting on the other two corpora. Nor does it achieve exactly the merger
we desire by the ROC area performance measure. In order to prevent overfitting, we then
attempted to merge the models by introducing STRIVE-S Local (norm). This model restricts
ρ to the subset of features included in the model learned by STRIVE-D (norm). While this
leads to substantially less overfitting while retaining some positive gains in MSN Web, its
overall change does not seem to make it preferable to STRIVE-S (norm).
MSN WEB
Stack-S (norm)
STRIVE-S (norm)
STRIVE-S Local (norm)
STRIVE-S LSelect (norm)
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
0.6939
0.7173
0.7251
0.7197
0.7250
0.7437
0.7530
0.7496
0.0423
0.0392
0.0386
0.0388
0.1971
0.1835
0.1810
0.1860
0.0705
0.0682
0.0656
0.0664
0.9334
0.9260
0.9150
0.9097
Reuters
Stack-S (norm)
STRIVE-S (norm)
STRIVE-S Local (norm)
STRIVE-S LSelect (norm)
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
0.8908
0.8835
0.8751
0.8825
0.9307
0.9287
0.9261
0.9280
0.0125
0.0121
0.0125
0.0128
0.0372
0.0352
0.0382
0.0389
0.0331
0.0343
0.0344
0.0341
0.9956
0.9948
0.9890
0.9921
TREC-AP
Stack-S (norm)
STRIVE-S (norm)
STRIVE-S Local (norm)
STRIVE-S LSelect (norm)
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
0.7486
0.7532
0.7439
0.7507
0.7011
0.7148
0.7084
0.7104
0.0048
0.0047
0.0048
0.0047
0.0263
0.0277
0.0231
0.0273
0.0072
0.0071
0.0071
0.0071
0.9834
0.9771
0.9725
0.9731
STRIVE-S Local (norm) uses a local product kernel of K(x i , xj ) =
[hρ(xi ), ρ(xj )i + 1] [hΠ(xi ), Π(xj )i + 1] where Π(x) is the projection into the subspace consisting of the base classifier outputs and ρ is the identity function. The resulting kernel has a subset of
the terms in a quadratic kernel. STRIVE-S Local (norm) restricts ρ to the subset of features included
in STRIVE-D (norm) leads to substantially less overfitting and positive gains in one corpus.
Table 7.4:
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
133
Note that in earlier work [BDH05], Stack-S (norm) did not consistently outperform the
base classifiers. Two changes were key to achieving this change. First, we added another
strong base classifier, kNN. Secondly, in the earlier work, the outputs of some classifiers
were probabilities while others were log-odds. In the current form, all outputs are either
log-odds or scores that demonstrate behavior similar to log-odds. As discussed in Chapter
4, a linear combination of log-odds can capture many desirable types of combination and
recalibration interactions that are not attainable through a a linear combination of probabilities. However, even with the increase in Stack-S (norm)’s performance STRIVE-S (norm)
improves enough beyond it to be statistically significant. In fact, across most measures
STRIVE-S (norm) is now clearly the superior choice to STRIVE-D (norm).
Reuters and TREC-AP
The results for Reuters and TREC-AP in Tables 7.2 and 7.3 are consistent with the above
analysis. We note that the level of improvement over stacking (particularly in Reuters) is
less pronounced for these corpora.
The decision tree meta-models show less consistency than the SVM meta-models. This
is due in part to the nature of the models. While the SVM model cannot threshold regions
of the space well, it can smoothly combine the various model outputs, whereas the decision
tree meta-model fractures the data as it does so and cannot place weights on the outputs.
This further emphasizes the need for a meta-model that provides the advantages of both
models.
In both of these corpora STRIVE-S (norm) continues to outperform the base classifiers.
In the TREC-AP corpus STRIVE-S (norm) also outperforms the stacking methods significantly on several measures. In the Reuters corpus, however, STRIVE-S (norm) is slightly
outperformed by its counterpart Stack-S (norm) although not significantly. Returning to the
central claim of this thesis that, since a linear combination is sometimes optimal, we aim
to sometimes significantly outperform it and, in the remaining cases, achieve near the same
performance. This empirical behavior upholds this central claim.
In Figure 7.7, we display the performance changes for Stack-S (norm) and STRIVE-S
(norm) relative to the best base classifier — SVM classifier. Since F1 can also be com2∗TruePos
puted as 2∗TruePos+FalsePos+FalseNeg
, we display the changes in the three components that
determine F1: true positives, false positives, and false negatives. Not only does STRIVE-S
(norm) achieve considerable reductions in error of 8-18% (using F1 optimized thresholds)
and 5-16% (using error optimized thresholds), but in all but one case, it also increases by
a fair margin the improvement attained by Stack-S (norm). Furthermore, since STRIVE-S
EM
Êi
E = (X, f (X))
fˆ1 (X), s1 (E1 ))
fˆ2 (X), s2 (E2 ))
fgˆ3replacements
(X), s3 (E3 ))
(fˆi (X), si (EiE
))
f (X)
X
fˆM (X)
E1
p1 (E)
E2
p2 (E)
E3
p3 (E)
Ei
pM (E)
EM
p(E1 | E)
Êi
p(E
| E)
E = (X,
f 2(X))
fˆ1 (X),p(E
s1 (E
3 | E)
1 ))
p(E
fˆ2 (X),
s2M(E| E)
2 ))
fˆ3 (X),p(sÊ
i | E)
3 (E
3 ))
p(E
| Êi ))
(fˆi (X),
sii(E
i)
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
7.3. AN ANALYSIS OF RELIABILITY INDICATOR USEFULNESS
134
FalsePos
FalseNeg
TruePos
Reuters
TREC-AP
1.1
FalsePos
FalseNeg
TruePos
1
Percent Relative to Baseline (SVM)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.1
0
0
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
0.2
0.1
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
0.2
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
Percent Relative to Baseline (SVM)
MSN Web
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
TREC-AP
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
Reuters
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
1.2
MSN Web
1.1
Figure 7.7: For Stack-S (norm) and STRIVE-S (norm) change relative to the best base classifier —
the SVM classifier. On the left, we show the relative change using thresholds optimized for F1, and
on the right, we show the relative change using thresholds optimized for error. In both figures, we
display the changes in the three components that determine F1: true positives, false positives, and
false negatives. Not only does STRIVE-S (norm) achieve considerable reductions in error of 8-18%
(left) and 5-16% (right), but in all but one case, it also increases by a fair margin the improvement
attained by Stack-S (norm).
(norm) increases the number of true positives while decreasing both components of error,
STRIVE-S (norm) improves both precision and recall.
Additional Experiments
In earlier work, we investigated whether the reliability indicators could be directly incorporated into the base classifiers. That is, we wanted to understand to what extent their
information can be directly used to improve classification and to what extent it is conditional on the presence of classifier outputs. To examine these issues, we performed an
experiment where we added all of the reliability indicators to the standard document representation and built a model using Dnet. Because the current work uses an expanded set
of reliability-indicators, we do not directly include those tables here. However, the reader
should note that, while including the reliability-indicators directly at the base level led to
improvements over the base model, the STRIVE method was still superior.
7.3
An Analysis of Reliability Indicator Usefulness
In order to determine which variables show the most promise for future investigation, we
would like to measure to what extent each reliability indicator contributes to the final model.
i
1.2
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
135
Since we do not want the analysis to be sensitive to linear relationships, we use the STRIVED (norm) model for analysis. We start with the final model and perform backward selection
over the testing set by deleting the variable that results in the greatest increase in average
logscore and retraining a new model. The average logscore for model M is:
LS(M ) = N
−1
N
X
log P̂M (c(xi )|xi ).
(7.1)
i=1
The increase in average logscore for variable v is: LS(M − v) − LS(M ). Thus, if this
quantity is negative, the variable must have participated in the model, and its deletion lead
to a degradation in model quality. If the change is zero, then the variable either contributed
no predictive power to the model or (more likely) did not participate in the model. This
could either be because the variable was correlated with another variable present or was
judged to have too little predictive power. In either case, it is not a good candidate for
future study. Finally, if the change is positive, it means the variable must have participated
in the model and deleting the variable resulted in an improvement in model quality.
Backward selection continued until each of the 70 reliability indicators were removed
from the model. Thus the last variable deleted is the one that contributed the most to the
model. The classifier outputs were always available to be included in the model. If multiple
variables tied in a round to be deleted, then all tied variables were deleted. Each variable
was assigned a rank from 1 to 70 according to the number of variables deleted previously
plus one. Thus, if 1 variable was deleted in the first seven rounds and then 5 tied variables
were deleted in round 8, the variable deleted in the next round would be assigned a 13.
The final ranking roughly indicates the importance of the variables with 70 being best.
Since a binary model is built for each class, this procedure was repeated for each binary
classification model in a corpus.
Tables 7.5-7.6 present the average ranking of variables across classes in the MSN Web
and Reuters corpus, respectively. Unfortunately, this procedure is too computationally intensive to perform for the TREC-AP corpus.5 We also present the average across binary
class models for all three corpora for the first round in Tables 7.7-7.9
We note that interpreting the average logscores in Tables 7.7-7.9 directly as importance
can be misleading. For example, note that SigmaPositiveNeighborDistance is worst according to average change in logscore in Table 7.7 but best according to average ranking
in Table 7.5. This variable often contributes to the models but by average logscore its
effect is overall negative in the first round of backward selection. The reason for this is
twofold. First, a negative score for deleting an important variable is often (artificially) near
5
One round for 70 variables times 20 classes takes approximately four days.
136
7.3. AN ANALYSIS OF RELIABILITY INDICATOR USEFULNESS
zero initially, because many other variables provide some amount of redundant information. Secondly, the initial negative impact on the model can be a result of how the variable
is used in conjunction with another variable. In fact, when another variable or two has been
deleted, this variable’s score often turns positive.
Thus, we believe the rankings presented 7.7-7.9 present a more accurate picture of variable importance. The logscores provide a valuable sign that addressing this behavior may
improve how effectively the variables are used and bears further investigation. Additionally,
when a variable’s average logscore in the first round is zero, it is fairly safe to assume that
the variable can be ignored because it is correlated with another variable or truly provides
no information.
Interestingly the Unigram-based variables do not heavily influence the models according to either rankings or the logscores (where they often have zero). This was not the case
in our earlier work [BDH02, BDH05], and it was in fact the success of these variables that
prompted the creation of the other classifier-based variables. This suggests that either the
inclusion of the other variables have made the Unigram-based variables obsolete, or more
likely, the inclusion of the kNN classifier has decreased the importance of the Unigram
classifier and the variables that are indicative of its performance.
Next we note that for the Reuters corpus, most of the variables in Table 7.6 have a
ranking of at most 4. Given the small size of the corpus, this and the numerous zero scores
in Table 7.7 indicate that there simply is not enough training data in the Reuters corpus to
make effective use of many reliability indicators.
Comparing the remaining highly ranked variables in the Reuters corpus with the MSN
Web ranking, we see many similarities. The classifier-vote based variable PercentPredictingPositive is, unsurprisingly, very high in both. More interestingly, many of the kNN (e.g.
SigmaPositiveNeighborDistance, kNNShiftMeanPred), decision tree (e.g., DtreeShiftStdConfDiff ), and SVM (e.g. signedNNSV, SVMShiftMeanConfDiff ) based variables are ranked
highly in both corpora. This verifies the importance of these variables and justifies continuing development of them. Given the strong ties of these variables to particular classification
models, they are the best candidates for understanding how the information in these variables could be more directly integrated into the model during training.
Additionally, we see that many of the feature selection variables are also present. Though
primarily it is the “before” variants or the “delta” variants that show up. Indicating that
these variables are useful primarily as an indication of the information that has been lost.
In addition to providing a starting point for future analysis, we feel the similarities across
these corpora provide hope for models that can use the training data in conjunction to come
up with models which effectively leverage information across the corpora.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
SigmaPositiveNeighborDistance
MeanPositiveNeighborDistance
PercentPredictingPositive
U%FavoringNegBeforeFS
%FavoringNegBeforeFS
WordsSeenInPosDelta
U%FavoringPosBeforeFS
%FavoringPosBeforeFS
SigmaNegativeNeighborDistance
kNNShiftMeanConfDiff
UpercentInNegativeBeforeFS
signedNNSV
PercentInNegativeBeforeFS
kNNShiftMeanPred
PercentInPosBeforeFS
FeaturesSeenInPosDelta
PercentWordsPointingToPosDelta
kNNShiftStdDevConfDiff
UPercentWordsPointingToNegDelta
SVMShiftMeanConfDiff
SVMShiftStdDevConfDiff
PercentUnique
WordsSeenInNegDelta
SigmaNeighborDistance
DtreeShiftStdDevConfDiff
DtreeShiftMeanConfDiff
NumTrainingWordsDiscarded
meanGoodSVProximity
UniquePercentRemoved
PercentWordsPointingToNegDelta
EffectiveUniqueWords
PercentUniqueOOV
NumTrainingFeaturesDiscarded
NB MeanLogOfStrengthGivenPos
MeanNegativeNeighborDistance
57.31
53.08
48.38
43.08
41.38
39.62
39.23
38.15
37.31
36.54
36.31
35.23
33.92
33.15
31.85
31.31
30.31
28.15
27.38
26.31
25.23
25.08
24.46
23.00
22.46
21.62
21.31
21.31
21.15
21.15
20.77
20.54
19.23
19.00
18.85
137
UPercentInPosBeforeFS
NB MeanLogOfStrengthGivenNeg
FeaturesSeenInNegDelta
NumFeaturesDiscarded
NB StdDeviation
MeanNeighborDistance
NumUniqueWords
NB StdDevLogOfStrengthGivenNeg
NumWordsDiscarded
PercentOOV
NeighborhoodRadius
PercentRemoved
U%FavoringPosAfterFS
%FavoringNegAfterFS
NB MeanShift
UniqueAfterFS
UPercentWordsPointingToPosDelta
DocumentLengthAfterFS
EffectiveDocumentLength
stdDevGoodSVProximity
PercentAgreeWBest
U%FavoringNegAfterFS
UnigramStdDeviation
PercentInNegAfterFS
UPercentInPosAfterFS
UniStdDevLogOfStrengthGivenNeg
UPercentInNegAfterFS
UniMeanLogOfStrengthGivenNeg
UnigramMeanShift
UniStdDevLogOfStrengthGivenPos
PercentInPosAfterFS
UniMeanLogOfStrengthGivenPos
%FavoringPosAfterFS
NB StdDevLogOfStrengthGivenPos
DocumentLength
18.77
18.54
18.31
18.15
18.00
17.92
17.77
17.69
17.31
17.00
17.00
16.15
15.62
15.31
15.23
15.00
14.92
14.62
14.62
13.46
13.15
11.85
11.85
11.85
11.85
11.85
11.85
11.85
11.85
11.85
11.85
11.85
11.77
10.92
10.46
Table 7.5: In backward selection over the MSN Web testing set, deleting the variable that most
improved the average logscore of the model allows us to rank the variables in rough order of impact
by the average round a feature was deleted in. A higher average rank means a feature has greater
impact on the model.
138
7.3. AN ANALYSIS OF RELIABILITY INDICATOR USEFULNESS
UPercentInPosBeforeFS
PercentPredictingPositive
SigmaPositiveNeighborDistance
kNNShiftMeanPred
U%FavoringPosBeforeFS
%FavoringPosBeforeFS
PercentInPosBeforeFS
DtreeShiftStdDevConfDiff
SVMShiftMeanConfDiff
signedNNSV
NumTrainingFeaturesDiscarded
PercentInNegAfterFS
%FavoringPosAfterFS
PercentAgreeWBest
PercentInNegativeBeforeFS
WordsSeenInPosDelta
PercentUnique
MeanPositiveNeighborDistance
FeaturesSeenInPosDelta
PercentOOV
UPercentWordsPointingToNegDelta
%FavoringNegBeforeFS
SigmaNegativeNeighborDistance
U%FavoringPosAfterFS
U%FavoringNegAfterFS
UnigramStdDeviation
NB MeanShift
DocumentLength
kNNShiftMeanConfDiff
NB StdDevLogOfStrengthGivenNeg
NumTrainingWordsDiscarded
UniquePercentRemoved
meanGoodSVProximity
UPercentInPosAfterFS
%FavoringNegAfterFS
29.20
29.10
23.00
22.90
22.30
22.20
16.00
15.70
15.50
10.80
10.70
10.60
10.50
10.20
10.10
9.90
9.80
9.80
9.70
9.60
9.60
9.30
8.90
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
stdDevGoodSVProximity
PercentUniqueOOV
UniStdDevLogOfStrengthGivenNeg
PercentRemoved
FeaturesSeenInNegDelta
NumFeaturesDiscarded
UniqueAfterFS
UPercentInNegAfterFS
MeanNegativeNeighborDistance
UniMeanLogOfStrengthGivenNeg
NB MeanLogOfStrengthGivenNeg
NeighborhoodRadius
EffectiveUniqueWords
NB StdDevLogOfStrengthGivenPos
UnigramMeanShift
NumUniqueWords
DtreeShiftMeanConfDiff
UniStdDevLogOfStrengthGivenPos
UPercentWordsPointingToPosDelta
DocumentLengthAfterFS
PercentInPosAfterFS
kNNShiftStdDevConfDiff
UniMeanLogOfStrengthGivenPos
EffectiveDocumentLength
NumWordsDiscarded
NB StdDeviation
NB MeanLogOfStrengthGivenPos
SigmaNeighborDistance
UpercentInNegativeBeforeFS
MeanNeighborDistance
SVMShiftStdDevConfDiff
WordsSeenInNegDelta
PercentWordsPointingToPosDelta
U%FavoringNegBeforeFS
PercentWordsPointingToNegDelta
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
4.00
3.90
3.90
3.90
3.90
3.90
3.80
3.70
3.60
3.60
3.50
3.50
Table 7.6: In backward selection over the Reuters testing set, deleting the variable that most improved the average logscore of the model allows us to rank the variables in rough order of impact
by the average round a feature was deleted in. A higher average rank means a feature has greater
impact on the model.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
PercentPredictingPositive
kNNShiftMeanPred
MeanPositiveNeighborDistance
PercentInNegativeBeforeFS
signedNNSV
NumTrainingFeaturesDiscarded
NB MeanLogOfStrengthGivenPos
UniquePercentRemoved
PercentWordsPointingToPosDelta
NB MeanShift
FeaturesSeenInPosDelta
UniqueAfterFS
NumUniqueWords
MeanNeighborDistance
%FavoringPosAfterFS
WordsSeenInNegDelta
U%FavoringPosAfterFS
MeanNegativeNeighborDistance
EffectiveUniqueWords
SVMShiftStdDevConfDiff
%FavoringNegAfterFS
NumTrainingWordsDiscarded
meanGoodSVProximity
PercentUniqueOOV
NumWordsDiscarded
U%FavoringNegAfterFS
UnigramStdDeviation
PercentInNegAfterFS
UPercentInPosAfterFS
NB StdDeviation
UniStdDevLogOfStrengthGivenNeg
UPercentInNegAfterFS
UniMeanLogOfStrengthGivenNeg
NB StdDevLogOfStrengthGivenPos
UnigramMeanShift
-0.00062201
-0.00042239
-0.00026129
-0.00018542
-0.00015635
-0.00009238
-0.00007011
-0.00006485
-0.00006022
-0.00005192
-0.00004302
-0.00004212
-0.00002368
-0.00001981
-0.00001544
-0.00001418
-0.00001300
-0.00001187
-0.00001042
-0.00000790
-0.00000621
-0.00000546
-0.00000530
-0.00000423
-0.00000116
0
0
0
0
0
0
0
0
0
0
139
UniStdDevLogOfStrengthGivenPos
PercentInPosAfterFS
UniMeanLogOfStrengthGivenPos
SigmaNeighborDistance
DocumentLength
EffectiveDocumentLength
NumFeaturesDiscarded
PercentUnique
WordsSeenInPosDelta
UPercentWordsPointingToNegDelta
DocumentLengthAfterFS
SigmaNegativeNeighborDistance
DtreeShiftMeanConfDiff
FeaturesSeenInNegDelta
PercentOOV
stdDevGoodSVProximity
PercentWordsPointingToNegDelta
NB StdDevLogOfStrengthGivenNeg
UPercentWordsPointingToPosDelta
kNNShiftStdDevConfDiff
U%FavoringPosBeforeFS
PercentInPosBeforeFS
kNNShiftMeanConfDiff
UpercentInNegativeBeforeFS
PercentRemoved
PercentAgreeWBest
SVMShiftMeanConfDiff
DtreeShiftStdDevConfDiff
U%FavoringNegBeforeFS
NeighborhoodRadius
UPercentInPosBeforeFS
%FavoringPosBeforeFS
NB MeanLogOfStrengthGivenNeg
%FavoringNegBeforeFS
SigmaPositiveNeighborDistance
0
0
0
0.00000242
0.00000354
0.00000425
0.00001003
0.00001192
0.00001328
0.00001446
0.00001699
0.00002219
0.00003575
0.00003596
0.00004415
0.00005600
0.00007022
0.00007023
0.00007758
0.00008055
0.00008772
0.00008842
0.00009565
0.00010505
0.00011062
0.00011245
0.00011400
0.00011990
0.00013617
0.00014646
0.00016584
0.00019315
0.00020146
0.00020262
0.00040607
Table 7.7: This table shows the average reduction in logscore across classes caused by deleting
each variable individually from the final models in the MSN Web testing set. A negative score
indicates that the deleting the variable negatively impacts the models, since deleting it reduces the
logscore. A score of zero indicates the variable has no impact on the models, while positive indicates
the variable is included in the models but hurts them on average
140
7.3. AN ANALYSIS OF RELIABILITY INDICATOR USEFULNESS
PercentPredictingPositive
SigmaPositiveNeighborDistance
PercentInPosBeforeFS
UPercentInPosBeforeFS
U%FavoringPosBeforeFS
PercentUnique
DtreeShiftStdDevConfDiff
%FavoringPosBeforeFS
U%FavoringNegBeforeFS
kNNShiftMeanPred
SVMShiftStdDevConfDiff
WordsSeenInPosDelta
%FavoringPosAfterFS
MeanNeighborDistance
DocumentLengthAfterFS
meanGoodSVProximity
NumWordsDiscarded
U%FavoringPosAfterFS
U%FavoringNegAfterFS
PercentOOV
UnigramStdDeviation
NB MeanShift
DocumentLength
PercentInNegAfterFS
NB StdDevLogOfStrengthGivenNeg
UniquePercentRemoved
UPercentInPosAfterFS
%FavoringNegAfterFS
stdDevGoodSVProximity
NumTrainingFeaturesDiscarded
SigmaNeighborDistance
UPercentWordsPointingToNegDelta
UniStdDevLogOfStrengthGivenNeg
PercentRemoved
FeaturesSeenInNegDelta
-0.00145601
-0.00052718
-0.00040701
-0.00022327
-0.00020924
-0.00016828
-0.00014711
-0.00014086
-0.00014017
-0.00012356
-0.00009941
-0.00006886
-0.00005240
-0.00004794
-0.00002364
-0.00001837
-0.00001776
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
UniqueAfterFS
NumFeaturesDiscarded
UPercentInNegAfterFS
UniMeanLogOfStrengthGivenNeg
PercentInNegativeBeforeFS
NB MeanLogOfStrengthGivenNeg
EffectiveUniqueWords
NB StdDevLogOfStrengthGivenPos
PercentAgreeWBest
UnigramMeanShift
NumUniqueWords
UniStdDevLogOfStrengthGivenPos
UPercentWordsPointingToPosDelta
PercentInPosAfterFS
kNNShiftStdDevConfDiff
EffectiveDocumentLength
UniMeanLogOfStrengthGivenPos
WordsSeenInNegDelta
NumTrainingWordsDiscarded
NeighborhoodRadius
NB StdDeviation
PercentWordsPointingToNegDelta
%FavoringNegBeforeFS
SigmaNegativeNeighborDistance
FeaturesSeenInPosDelta
PercentUniqueOOV
MeanNegativeNeighborDistance
UpercentInNegativeBeforeFS
SVMShiftMeanConfDiff
kNNShiftMeanConfDiff
MeanPositiveNeighborDistance
PercentWordsPointingToPosDelta
NB MeanLogOfStrengthGivenPos
signedNNSV
DtreeShiftMeanConfDiff
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.00001705
0.00001747
0.00001917
0.00002063
0.00003801
0.00004508
0.00005758
0.00005871
0.00005899
0.00005920
0.00006334
0.00007120
0.00007483
0.00008386
0.00009385
0.00012559
0.00018006
0.00043114
Table 7.8: This table shows the average reduction in logscore across classes caused by deleting each
variable individually from the final models in the Reuters testing set. A negative score indicates that
the deleting the variable negatively impacts the models, since deleting it reduces the logscore. A
score of zero indicates the variable has no impact on the models, while positive indicates the variable
is included in the models but hurts them on average
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
%FavoringPosBeforeFS
SigmaPositiveNeighborDistance
kNNShiftStdDevConfDiff
SVMShiftMeanConfDiff
NB MeanShift
SigmaNegativeNeighborDistance
MeanNegativeNeighborDistance
SigmaNeighborDistance
%FavoringNegBeforeFS
NumFeaturesDiscarded
PercentUnique
stdDevGoodSVProximity
MeanPositiveNeighborDistance
PercentInNegativeBeforeFS
PercentWordsPointingToNegDelta
U%FavoringPosBeforeFS
DtreeShiftStdDevConfDiff
DocumentLengthAfterFS
UniMeanLogOfStrengthGivenPos
%FavoringNegAfterFS
NumWordsDiscarded
MeanNeighborDistance
FeaturesSeenInPosDelta
PercentOOV
UPercentWordsPointingToPosDelta
DocumentLength
UniqueAfterFS
UPercentWordsPointingToNegDelta
U%FavoringPosAfterFS
U%FavoringNegAfterFS
UnigramStdDeviation
NB StdDevLogOfStrengthGivenNeg
NB StdDeviation
UPercentInPosAfterFS
NB MeanLogOfStrengthGivenPos
-0.00059795
-0.00047683
-0.00025752
-0.00011884
-0.00004756
-0.00003388
-0.00001862
-0.00001632
-0.00001630
-0.00001608
-0.00001241
-0.00000973
-0.00000893
-0.00000813
-0.00000577
-0.00000517
-0.00000219
-0.00000196
-0.00000185
-0.00000175
-0.00000155
-0.00000126
-0.00000119
-0.00000095
-0.00000083
-0.00000076
-0.00000060
-0.00000050
0
0
0
0
0
0
0
141
PercentUniqueOOV
UniStdDevLogOfStrengthGivenNeg
FeaturesSeenInNegDelta
UPercentInNegAfterFS
UniMeanLogOfStrengthGivenNeg
%FavoringPosAfterFS
NeighborhoodRadius
PercentAgreeWBest
UnigramMeanShift
UniStdDevLogOfStrengthGivenPos
PercentInPosAfterFS
EffectiveUniqueWords
NumTrainingFeaturesDiscarded
WordsSeenInNegDelta
NumUniqueWords
PercentRemoved
PercentInNegAfterFS
PercentWordsPointingToPosDelta
EffectiveDocumentLength
NB MeanLogOfStrengthGivenNeg
NB StdDevLogOfStrengthGivenPos
kNNShiftMeanConfDiff
UPercentInPosBeforeFS
meanGoodSVProximity
NumTrainingWordsDiscarded
UpercentInNegativeBeforeFS
DtreeShiftMeanConfDiff
kNNShiftMeanPred
SVMShiftStdDevConfDiff
U%FavoringNegBeforeFS
WordsSeenInPosDelta
UniquePercentRemoved
PercentInPosBeforeFS
PercentPredictingPositive
signedNNSV
0
0
0
0
0
0
0
0
0
0
0
0.00000010
0.00000020
0.00000045
0.00000066
0.00000072
0.00000136
0.00000145
0.00000179
0.00000198
0.00000233
0.00000421
0.00000447
0.00000448
0.00000449
0.00000641
0.00000821
0.00001059
0.00001137
0.00001328
0.00002368
0.00002740
0.00004043
0.00009150
0.00026558
Table 7.9: This table shows the average reduction in logscore across classes caused by deleting each
variable individually from the final models in the TREC-AP testing set. A negative score indicates
that the deleting the variable negatively impacts the models, since deleting it reduces the logscore.
A score of zero indicates the variable has no impact on the models, while positive indicates the
variable is included in the models but hurts them on average
7.4. RCV1-V2
142
7.4
RCV1-v2
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
Best By Class
Majority
Stack-D (norm)
Stack-S (norm)
STRIVE-D (norm)
STRIVE-S (norm)
BestSelect
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
0.3999
0.4651
0.4255
0.6213BS
0.5024
0.6187SD
0.5429
0.6064
0.5952
0.5953
0.6111 R
0.8193
0.6516
0.7119
0.6641
0.8170BS
0.7496
0.8170S
0.7910
0.8135S
0.7354
0.8128
0.8235BD
SR
0.9389
0.0185
0.0170
0.0185
0.0109B
0.0143
0.0109 D
0.0131
0.0109
0.0105BD
SR
0.0112
0.0106BD
R
0.0054
0.0900
0.0714
0.0857
0.0496BD
0.0817
0.0499 D
0.0555
0.0491
0.0525
0.0501
0.0485 D
R
0.0138
0.0282
0.0267
0.0250
0.0194BD
0.0224
0.0197 D
0.0219
0.0195
0.0185BD
S
0.0195
0.0184BD
R
0.0127
0.8983
0.9055
0.9343
0.9703BD
SR
0.8828
N/A
N/A
0.9350
0.9229
0.9332
0.9473 R
N/A
Table 7.10: Performance on RCV1-v2 Corpus. The best performance (omitting the oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, or ‘R’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners, Stacking methods, or Reliabilityindicator based Striving methods at the p = 0.05 level. A blackboard (hollow) font is used to
indicate significance for the macro-sign test and micro-sign test. A normal font indicates significance for the macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests are
significant it is indicated with a bold, italicized font.
As discussed in Section 6.2.5, we performed experiments with the RCV1-v2 corpus
after all methods were fully developed to demonstrate we had not overtuned these results
on the earlier corpora. As discussed earlier, the standard chronological split used has 23149
training documents and 781265 testing documents with 101 topics that have at least one
training document.
For all base classifiers except for the SVM, the top 1000 features by mutual information
were used on a per class basis. All base classifiers except for the SVM used the same
representation and settings as for the other experiments. In order to reproduce the settings
given for the baseline in [LYRL04], the SVM used a normed tfidf representation with all
features that occurred in at least 3 training documents.
Table 7.10 presents the summary for the primary methods discussed in this chapter. The
results for STRIVE-S (norm) are generally consistent with the results presented earlier. The
only notable exception is that neither STRIVE-S (norm) nor any other combination method
is able to outperform the base SVM on MacroF1. However, according to every other performance measure except ROC area, STRIVE-S (norm) outperforms all the base classifiers
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
143
— usually with statistical significance. In fact, on those same measures STRIVE-S (norm)
outperforms all other methods except for Stack-S (norm) on error. The mean difference in
error is not significantly different from Stack-S (norm) as shown by the macro t-test, however, as demonstrated by the macro sign-test, it is outperformed on a statistically significant
number of the classes.
Making a closer examination of the results for Stack-S (norm) shows that while it
slightly outperforms STRIVE-S (norm) on error, Stack-S (norm) decreases performance on
MacroF1 and greatly decreases performance on MicroF1. This appears to be in large part
because of several classes where Stack-S (norm) greatly decreases F1 performance. Apparently, the linear combination of classifier outputs is much more sensitive to the particular
threshold than the corresponding strive model. While there is a chance that this could be
due to topic drift which occurs after the chronological split, Strive and the meta-classifiers
using decision trees appear to be able to mitigate such effects.
Figure 7.8 presents a by-class analysis of the change in F1 and error for STRIVE-S (norm)
and Stack-S (norm). The lines shown are the least-squares fit where each topic receives the
same weight. These figures clearly show how the stacking method severely hurts performance on several classes and drags down F1 performance while the striving method has
little impact on macroF1 even on a per-class basis. In terms of error, we see that both
methods perform worse over smaller topics but gain enough over larger topics to create
a net positive effect. Striving seems to have less negative effects on a per-class basis but
also requires slightly more positive examples to consistently boost baseline performance.
Because of the smaller number of topics in the other corpora, it is difficult to tell whether
these trends exist on a per-topic basis in those corpora as well.
To give a sense of the improvement when using the thresholded predictions for microF1,
STRIVE-S (norm) commits 501400 false negative predictions versus the SVM’s 513511 false
negatives and 369291 false positive predictions versus the SVM’s 391038. In total, STRIVES (norm) commits 12111 less false negatives, 21747 less false positives, and 33858 less
errors in prediction than the SVM. To put this in context, Figure 7.9 again displays the
changes in performance relative to the SVM classifier.
In summary, key lessons from the RCV1-v2 corpus are that while striving may not
always increase F1 performance on a per-class basis, the net reduction in total number of
errors causes both a significant increase in microF1 performance and overall error. Also,
we note that in comparison to the other corpora, the SVM dominates the base classifiers
more here. By optimizing k using cross-validation we expect the performance of kNN will
increase to be closer to the SVM and thus increase the potential for classifier combination.
Additionally, both striving and stacking seem to require between 64 and 256 (2 6 to 28 )
5
7.4. RCV1-V2
0.5
SVM Error
log2 Base
Method Error
0
−5
0
−0.5
STRIVE−S (norm)
STRIVE−S (norm) Fit
Stack−S (norm)
Method
Stack−S
Fit F1
log2 (norm)
−10
0
2
4
6
8
−1
Base SVM F1
12
14
10
STRIVE−S (norm)
STRIVE−S (norm) Fit
Stack−S (norm)
Stack−S (norm) Fit
0
2
log2 (Num Positives in Training)
4
6
8
10
12
14
log2 (Num Positives in Training)
Figure 7.8: Each point presents the performance for a single class in the RCV1-v2 corpus. ImE1
E2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
PSfrag replacements
fˆM (X)
E
p1 (E)
X
p2 (E)
E
p3 (E)1
E
pM (E)2
E
p(E1 | E)3
E
p(E2 | E)i
EM
p(E3 | E)
Ê
p(E | E)i
E = (X, fM(X))
p(Êi | E)
Ĉ = (fˆ (X), s (E
))
1
1
p(E1 i | Ê1 i )
Ĉ2 = (fˆ2 (X), s2 (E
2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉ = (fˆ (X), s (Ei ))
i
i
i
f (X)
fˆ (X)
M
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
provement in F1 over the baseline SVM is shown on the left while improvement in error is shown
on the right. As is typical, both axes are given in the log-domain. In case of a zero denominator
or numerator, the log-ratio is defined as 10/−10 respectively. On left we see that Stack-S (norm)
severely decreases the F1 performance on several classes. On right we see that (when performance
differs from the baseline) both methods show a larger increase in performance according to error
over the baseline as the class becomes more prevalent. Striving appears to require slightly more positive examples than stacking which is expected given the higher dimensionality. The regression fits
shown are fit only to the classes where the metaclassifier’s performance differs from the baseline.
Reuters
TREC-AP
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
FalsePos
FalseNeg
TruePos
1
RCV1-v2
MSN Web
Reuters
TREC-AP
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Percent Relative to Baseline (SVM)
1.3
1.1
MSN Web
1.1
1.4
1.2
RCV1-v2
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
1.5
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
1.2
FalsePos
FalseNeg
TruePos
1.6
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
1.7
Percent Relative to Baseline (SVM)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
0
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
0
SVM
Stack-S
STRIVE-S
SVM
Stack-S
STRIVE-S
Base SVM Error
2 Method Error
C1 = (f1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
144
Method F1
log2 Base
SVM F1
X), s1 (E1 ))
X), s2 (E2 ))
X), s3 (E3 ))
X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
Figure 7.9: For Stack-S (norm) and STRIVE-S (norm) change relative to the best base classifier
— the SVM classifier — over all the topic classification corpora. On the left, we show the relative
change using thresholds optimized for F1, and on the right, we show the relative change using
thresholds optimized for error. In both figures, we display the changes in the three components that
determine F1: true positives, false positives, and false negatives. Not only does STRIVE-S (norm)
achieve considerable reductions in error of 4-18% (left) and 3-16% (right), but in all but one case, it
also increases by a fair margin the improvement attained by Stack-S (norm). Furthermore, STRIVES (norm) never hurts performance relative to the SVM on these performance measures as Stack-S
(norm) does over RCV1-v2 on the far left.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
145
positive examples to begin to have a net effect over baseline performance in error and are
thus more effective for larger classes than smaller. In all, STRIVE-S (norm) continues to be
the superior choice having less negative impact on macroF1 than stacking and increasing
performance over the baseline according to microF1 and all of the linear utility performance
measures.
7.5
Summary and Conclusions
In this chapter, we reviewed a methodology for building a metaclassifier for text documents that centers on combining multiple distinct classifiers with probabilistic learning and
inference that leverages reliability-indicator variables. Reliability indicators provide information about the context-sensitive nature of classifier reliability, informing a metaclassifier
about the best way to integrate the outputs from base-level classifiers. We introduced the
STRIVE methodology that uses reliability indicators in a hierarchical combination model
and reviewed comparative studies comparing STRIVE with other combination mechanisms.
We conducted experimental evaluations over four text-classification corpora (MSN Web,
Reuters 21578, TREC-AP, and RCV1-v2) with a variety of performance measures. These
measures were selected to determine the robustness of the classification procedures under different misclassification penalties. The empirical evaluations support the conclusion
that a simple majority vote in situations where one of the classifiers performs strongly can
weaken the best classifier’s performance. In contrast, in all of these corpora across all
measures, the STRIVE methodology was competitive. STRIVE using a SVM metaclassifier produced the top performer in nearly every category except ROC area and, outside of
that, was never beaten statistically significantly. Furthermore, on a class-by-class basis, the
STRIVE methodology using a meta-decision tree produced receiver-operating characteristic
curves that dominated the other classifiers in nearly every class of the MSN Web corpus,
which demonstrates that it provides the best choice for any possible linear utility function
in this corpus.
In conclusion, the experiments show that stacking and STRIVE provide robust combination schemes across a variety of performance measures. In addition, the experiments
show the central claim of this thesis that context-depended combination (STRIVE-S (norm))
procedures provide an effective way of combining classifiers that are generally superior to
constant-weighted linear combinations of the classifier’s estimates of the posterior or logodds (Stack-S (norm)). In the remaining chapters, we turn to issues of how we can alleviate
the need for training data and applying combination outside of topic classification.
7.5. SUMMARY AND CONCLUSIONS
146
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
ROC[0,0.1]
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.5477
0.5982
0.5527
0.6727
0.6480
0.5813
0.6116
0.5619
0.7016
0.6866
0.0584
0.0594
0.0649
0.0455
0.0464
0.3012
0.2589
0.2853
0.2250
0.2524
0.0772
0.0812
0.0798
0.0794
0.0733
0.8802
0.9003
0.8915
0.9123
0.8873
0.5638
0.6114
0.5516
0.6960
0.6541
Best By Class
Majority
0.6727
0.6643
0.7016
0.6902
0.0452
0.0479
0.2235
0.2133
0.0729
0.0765
N/A
N/A
N/A
N/A
Stack-D
Stack-S
Stack-D (norm)
Stack-S (norm)
0.6924
0.6801
0.6924
0.6939
0.7233
0.7118
0.7233
0.7250
0.0423
0.0438
0.0423
0.0423
0.1950
0.2076
0.1950
0.1971
0.0708
0.0701
0.0708
0.0705
0.9361
0.9286
0.9361
0.9334
0.7356
0.7196
0.7356
0.7349
STRIVE-D
STRIVE-S
STRIVE-D (norm)
STRIVE-S (norm)
STRIVE-S Local (norm)
STRIVE-S LSelect (norm)
0.6975
0.6527
0.6988
0.7173
0.7251
0.7197
0.7304
0.6767
0.7327
0.7437
0.7530
0.7496
0.0413
0.0498
0.0413
0.0392
0.0386
0.0388
0.1863
0.2251
0.1846
0.1835
0.1810
0.1860
0.0703
0.0800
0.0697
0.0682
0.0656
0.0664
0.9459
0.9159
0.9454
0.9260
0.9150
0.9097
0.7460
0.6819
0.7457
0.7547
0.7600
0.7529
BestSelect
0.8719
0.8924
0.0223
0.0642
0.0565
N/A
N/A
Table 7.11: All results for the MSN Web Corpus discussed in this chapter. Ignoring BestSelect, the
overall best in each column is shown in red bold and the overall worst is shown in blue italics.
CHAPTER 7. COMBINING CLASSIFIERS
USING RELIABILITY INDICATORS
147
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
ROC[0,0.1]
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.7846
0.7645
0.6574
0.8545
0.8097
0.8541
0.8674
0.7908
0.9122
0.8963
0.0242
0.0234
0.0320
0.0145
0.0170
0.0799
0.0713
0.1423
0.0499
0.0737
0.0537
0.0476
0.0527
0.0389
0.0336
0.9804
0.9877
0.9703
0.9893
0.9803
0.8844
0.9086
0.7841
0.9429
0.9043
Best By Class
Majority
0.8608
0.8498
0.9149
0.9102
0.0144
0.0155
0.0496
0.0438
0.0342
0.0437
N/A
N/A
N/A
N/A
Stack-D
Stack-S
Stack-D (norm)
Stack-S (norm)
0.8680
0.8611
0.8680
0.8908
0.9197
0.9174
0.9197
0.9307
0.0136
0.0141
0.0136
0.0125
0.0410
0.0398
0.0410
0.0372
0.0366
0.0362
0.0366
0.0331
0.9912
0.9952
0.9912
0.9956
0.9453
0.9576
0.9453
0.9628
STRIVE-D
STRIVE-S
STRIVE-D (norm)
STRIVE-S (norm)
STRIVE-S Local (norm)
STRIVE-S LSelect (norm)
0.8551
0.8289
0.8555
0.8835
0.8751
0.8825
0.9162
0.9061
0.9172
0.9287
0.9261
0.9280
0.0141
0.0161
0.0144
0.0121
0.0125
0.0128
0.0461
0.0515
0.0488
0.0352
0.0382
0.0389
0.0364
0.0420
0.0364
0.0343
0.0344
0.0341
0.9913
0.9908
0.9913
0.9948
0.9890
0.9921
0.9509
0.9332
0.9507
0.9616
0.9534
0.9553
BestSelect
0.9611
0.9789
0.0036
0.0073
0.0173
N/A
N/A
Table 7.12: All results for the Reuters Corpus discussed in this chapter. Ignoring BestSelect, the
overall best in each column is shown in red bold and the overall worst is shown in blue italics.
7.5. SUMMARY AND CONCLUSIONS
148
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
ROC[0,0.1]
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.6007
0.6001
0.5676
0.7361
0.6793
0.5706
0.5695
0.5349
0.6926
0.6533
0.0064
0.0064
0.0065
0.0049
0.0053
0.0346
0.0347
0.0455
0.0282
0.0371
0.0081
0.0079
0.0078
0.0077
0.0074
0.9767
0.9819
0.9755
0.9715
0.9238
0.9086
0.9034
0.8454
0.9143
0.8088
Best By Class
Majority
0.7356
0.7031
0.6925
0.6534
0.0049
0.0056
0.0302
0.0307
0.0073
0.0075
N/A
N/A
N/A
N/A
Stack-D
Stack-S
Stack-D (norm)
Stack-S (norm)
0.7331
0.7213
0.7331
0.7486
0.7007
0.6796
0.7007
0.7011
0.0050
0.0051
0.0050
0.0048
0.0251
0.0280
0.0251
0.0263
0.0073
0.0073
0.0073
0.0072
0.9886
0.9834
0.9886
0.9834
0.9565
0.9304
0.9565
0.9411
STRIVE-D
STRIVE-S
STRIVE-D (norm)
STRIVE-S (norm)
STRIVE-S Local (norm)
STRIVE-S LSelect (norm)
0.7283
0.7075
0.7246
0.7532
0.7439
0.7507
0.6958
0.6718
0.6991
0.7148
0.7084
0.7104
0.0051
0.0052
0.0051
0.0047
0.0048
0.0047
0.0267
0.0310
0.0268
0.0277
0.0231
0.0273
0.0073
0.0073
0.0073
0.0071
0.0071
0.0071
0.9870
0.9742
0.9870
0.9771
0.9725
0.9731
0.9512
0.8959
0.9515
0.9239
0.9285
0.9150
BestSelect
0.8986
0.8356
0.0031
0.0133
0.0058
N/A
N/A
Table 7.13: All results for the TREC-AP Corpus discussed in this chapter. Ignoring BestSelect, the
overall best in each column is shown in red bold and the overall worst is shown in blue italics.
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
ROC Area
ROC[0,0.1]
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
0.3999
0.4651
0.4255
0.6213
0.5024
0.6516
0.7119
0.6641
0.8170
0.7496
0.0185
0.0170
0.0185
0.0109
0.0143
0.0900
0.0714
0.0857
0.0496
0.0817
0.0282
0.0267
0.0250
0.0194
0.0224
0.8983
0.9055
0.9343
0.9703
0.8828
0.6691
0.6880
0.7094
0.8538
0.6154
Best By Class
Majority
0.6187
0.5429
0.8170
0.7910
0.0109
0.0131
0.0499
0.0555
0.0197
0.0219
N/A
N/A
N/A
N/A
Stack-D (norm)
Stack-S (norm)
0.6064
0.5952
0.8135
0.7354
0.0109
0.0105
0.0491
0.0525
0.0195
0.0185
0.9350
0.9229
0.8078
0.7976
STRIVE-D (norm)
STRIVE-S (norm)
0.5953
0.6111
0.8128
0.8235
0.0112
0.0106
0.0501
0.0485
0.0195
0.0184
0.9332
0.9473
0.8033
0.8131
BestSelect
0.8193
0.9389
0.0054
0.0138
0.0127
N/A
N/A
Table 7.14: All results for the RCV1-v2 Corpus discussed in this chapter. Ignoring BestSelect, the
overall best in each column is shown in red bold and the overall worst is shown in blue italics.
Chapter 8
Inductive Transfer
for Classifier Combination
From Chapter 7, we have seen that STRIVE can use a set of reliability indicators in conjunction with the outputs of various heterogeneous classifiers to produce a combination
model that consistently outperforms a linear combination of classifier outputs. However,
we also saw in Section 7.4 that, while the overall combination is successful, both STRIVE
and stacking seem to be less successful and sometimes hurt performance versus the best
base classifier on classes with few positive examples.
In this chapter, we turn our attention to the problem of scarce data. We introduce a
generalization of STRIVE, called LABEL (Layered Abstraction-Based Ensemble Learning),
that shows how data from one dataset can be used with that from other datasets to build an
inductive model of classifier combination that transfers across all the datasets and improves
performance in conjunction across the tasks. In addition to the general framework, we
demonstrate empirically how this approach can be used in conjunction with a decision
tree to increase the average performance for STRIVE-D (norm). Finally, we summarize the
interesting directions for future work in the same vein.
8.1
Introduction
Given the typical scarcity of labeled data for building predictive models, the Machine
Learning community has pursued methods which make use of information sources beyond the labeled data associated with a pure supervised-learning framework. An example
of research in this arena is multitask learning [Car97]. In multitask learning, additional
information for building models comes in the form of labels for related functions which
can be learned over the same input. Although such additional labels are typically unavail149
150
8.2. APPLYING INDUCTIVE TRANSFER TO COMBINATION
able at prediction time, results have demonstrated that generalization performance can be
improved on the primary task by learning to predict the new variables in addition to the
output variable of interest.
We are interested in improving the performance of predictive models for cases where
we have inadequate amounts of labeled training data. In contrast to multitask learning, we
seek to leverage labeled data from related problems over different examples to enhance
the final model used in prediction. Problems related to this challenge have been termed
classifier re-use [BG98] or knowledge transfer [CK97]. We introduce a new approach
to the challenge that hinges on mapping the original feature space, targeted at predicting
membership in a specific topic, to a new feature space aimed at modeling the reliability of
an ensemble of text classifiers.
The approach, which we call Layered Abstraction-Based Ensemble Learning (LABEL),
has two subcomponents. First, a set of classifiers is trained on each task according to
the standard supervised learning framework; a problem or task consists of determining
binary membership in a specific topic. Then, we build a context-sensitive ensemble model
using these classifier outputs and a set of reliability indicators (see Chapter 7) that provide
an abstraction of discriminatory context appropriate for modeling classifier reliability. We
thus abstract away the problem of predicting specific class membership to that of predicting
the reliability of a set of classifiers for a given class. As a result, both the input features
and their relationship to the class variable are the same at the metalevel; this enables the
simultaneous use of all the data as a model bias across the entire set of tasks.
For a review of related work, the reader should consult Section 2.2.1. First, we describe
in detail how the LABEL methodology generalizes the STRIVE model by providing a means
for using data across tasks. Then, we present an empirical analysis of this methodology
applied to text classification and summarize the strengths and weaknesses of the approach.
Finally, we discuss promising paths for future work.
8.2
Applying Inductive Transfer to Combination
In distinction to prior efforts, we introduce a representation that is semantically coherent
across tasks. Such semantic coherence facilitates the use of standard methods of inductive
transfer.
We note that the experiments in this chapter use 49 reliability indicators, a subset of
the variables used in Chapter 7. See Bennett, Dumais, & Horvitz [BDH02, BDH05] for
additional discussion of these reliability-indicators.
CHAPTER 8. INDUCTIVE TRANSFER
FOR CLASSIFIER COMBINATION
151
8.2.1 STRIVE
As discussed more extensively in Chapter 7, the STRIVE methodology transforms the original learning problem into a new learning problem. In the initial problem, the base classifiers
simply predict the class from a word-based representation of the document. More generally,
in the original problem, each base classifier outputs a distribution (possibly unnormalized)
over class labels. STRIVE adds another layer of learning to the base problem. Reliabilityindicator functions consider the words in the document and the classifier outputs to generate
the reliability indicator values, ri , for a particular document. This approach yields a new
representation of the document that consists of the values of the reliability indicators, as
well as the outputs of the base classifiers. The metaclassifier exploits this new representation for learning and classification. This enables the metaclassifier to employ a model that
uses the output of the base classifiers as well as the context established by the reliability
indicators to make a final classification.
8.2.2 LABEL: Layered Abstraction-Based Ensemble Learning
Intuitively, regardless of the particular topic or source (e.g., news feed, web page, etc.),
topic discrimination tasks share some common structure. For example, longer documents
tend to provide more information for identifying topics. Furthermore, documents containing words strongly correlated with a single topic are more likely to belong to that topic than
documents containing words strongly correlated with several topics. Additionally, these
conditions may interact with each other based on their particular values. Researchers in the
field may often make similar observations after studying multiple classification problems.
We seek to design a system capable of both inducing such generalizations automatically
and applying them to improve the predictive performance of models.
A training corpus in text classification consists of a set of example documents labeled
with each of their proper topics from a prespecified corpus-specific topic list (a document
may have more than one topic). When the same representation is used for each of the binary discrimination tasks in a corpus, standard multitask learning can be used to perform
classification for all of the topics in the corpus’ topic list. However, standard multitask
learning cannot leverage information across corpora since it would typically require knowing whether a document belongs to each of the topics from all of the corpora (where we
only have in-corpus information). Additionally, the basic feature space is quite different in
documents from different corpora as particular language usage varies widely. Therefore,
we desire a standard representation that has the same semantics across separate tasks from
both the same and different corpora.
152
8.2. APPLYING INDUCTIVE TRANSFER TO COMBINATION
Although STRIVE uses data from each task separately to build a metaclassifier for that
specific task, it is straightforward to extend it to make use of labeled data across tasks. The
key point is that the reliability-indicators we chose carefully abstracted away from a document’s task-specific statistical regularities of word usage while maintaining the discriminatory relationship of the document’s context to the task. For example, documents that come
from a general topic corpus where we are trying to distinguish Health & Fitness from not
Health & Fitness tend to behave very differently at the word usage distribution level than
documents from a narrow financial corpus where we are trying to distinguish Corporate
Acquisitions from not Corporate Acquisitions. However, in terms of the abstraction that
the reliability-indicator UnigramStdDeviation provides, we expect a unigram classifier to
show poor reliability for a particular document from either task when UnigramStdDeviation
is high.
With this approach, we treat the metaclassifier as an abstraction, moving the focus
of the analysis from discriminating a specific topic (e.g., Corporate Acquisitions vs. not
Corporate Acquisitions) to the problem of discriminating topic membership (i.e., In-Topic
vs. Out-of-Topic). The base-level classifiers trained on a particular topic are used as the
representation of topic-specific knowledge, while the metaclassifier provides information
about how to leverage context across topic-classification in general.
Therefore, LABEL, like STRIVE, constructs models with the same type of combination
rules as that shown in Figure 7.2. The differences from STRIVE are in the model construction procedure. After generating the metalevel data, the metafeatures are normalized to
have zero mean and a reduced standard deviation scale within their particular task. 1 At this
point STRIVE would use data from each task to separately build a metaclassifier for each
task. LABEL departs from this by pooling all of the data together and building a single
metaclassifier (with the class variable taking the value 1 if the document is In-Topic for the
particular task and -1 otherwise).
We now give a more formal definition of the problem. For our purposes, a task is the
approximation of a single binary function, fi (Xi ) ∈ {−1, 1}. The input domain of each
of these tasks may differ; thus, Xi denotes an input example from the ith task’s domain.
The labeled data for each task, Li , consists of a set of tuples h~xi,j , fi (~xi,j )i (where j =
1, . . . , |Li |). Given N tasks and a performance measure perf, we would like an inductive
learning procedure, Train(i, L1 , . . . , LN ), that produces a model to generate predictions for
the ith task. Furthermore, we desire that our performance using all the data exceeds the perP
formance using the data for each task separately: N
i=1 EPi [perf(Train(i, L1 , . . . , LN ))] >
1
This is not necessary for STRIVE-D, but for LABEL this helps to deal with spurious statistical variance
that arises from the tasks having different numbers of training examples. Also see the Footnote 3 on Page
123 regarding feature normalization.
CHAPTER 8. INDUCTIVE TRANSFER
FOR CLASSIFIER COMBINATION
153
PN
i=1 EPi [perf(Train(i, Li ))] where Pi is the probability distribution on the ith task. To be
of practical use, the performance achieved using only labeled data from the task
EPi [perf(Train(i, Li ))] should be competitive with the best base classifiers for this task,
otherwise the solution is trivial (simply ensure the models using only labeled data from the
task perform as poorly as possible).
Before applying the resulting model for prediction, it is desirable to specialize this single metaclassifier for each task in two ways. First, each task may have different priors;
so these priors should be taken into account at prediction time. This can be directly accomplished by obtaining probability predictions from the metaclassifier or simply setting
a different threshold for each classification task. Secondly, tasks may diverge from the average case in different ways. Thus, we may want to only retain part of the general model.
The best way to address this question depends, in part, upon the choice of classification
algorithm for the metaclassifier. We discuss our particular choices in Section 8.3.2 below.
8.3
Experimental Analysis
We performed an empirical analysis over standard text classification corpora to explore the
effectiveness of LABEL. We also performed ablation experiments to elucidate how LABEL
achieves an improvement in generalization performance. Each of the classification models
use a decision threshold specific to each task. The threshold for each model and task was
empirically determined over the training data.
8.3.1 Base Classifiers
We selected for our experiments four classifiers that have been used traditionally for text
classification: decision trees, linear SVMs, naı̈ve Bayes, and a unigram classifier. Note that
this is a subset of the classifiers used in Chapter 7. In particular, we did not make use of the
kNN classifier in these experiments.
For each base classifier, the settings and implementations are the same as discussed
in Section 6.3. The exception is that for an implementation of linear SVMs, we used the
Smox toolkit which is based on Platt’s Sequential Minimal Optimization algorithm [Pla98].
Since Smox is the best base classifier in the experiments below, it is the only base classifier
we report in summarizing our experimental results.
8.3. EXPERIMENTAL ANALYSIS
154
8.3.2 Metaclassifiers
As mentioned above, the inputs to the metaclassifiers are normalized to zero mean and
a reduced standard deviation scale (as estimated during the training phase). 2 The experiments reported here use a total of 49 reliability indicators which were formulated by hand
in an attempt to represent potentially valuable contexts (additional detail can be found in
[BDH02, BDH05]).
For these experiments, we used only a decision-tree algorithm (using Dnet) as a metaclassifier. For this reason, we refer to the primary metaclassifier implementations below as
STRIVE-D and LABEL-D. Since both systems use normalized inputs, we drop the “(norm)”
at the end of the system names used in Chapter 7. We note that by comparing these two
systems directly, we see the effects of separately building a metaclassifier per task versus
building them in conjunction.
Here, we introduce one way to specialize the single metaclassifier model learned by
LABEL-D for decision trees. Given a single metaclassifier decision tree model, instead of
using the prediction at each leaf node as the aggregate distribution across tasks of In-Topic
vs. Out-of-Topic, when predicting for task T , we use the estimate:
P (In-Topici | leaf = l) =
In-Topici,l + mpl
m+ In-Topici,l + Out-of-Topici,l
.
(8.1)
For the particular binary classification task i, In-Topici,l and Out-of-Topici,l are the number in and out of topic of those training examples that fall in the leaf node, respectively. p l
is the prior at the leaf node of In-Topic obtained from using all of the data across tasks.
m is the effective sample size which determines how much evidential weight, measured in
“number of observed datapoints”, the prior carries.
We sampled the two extremes of this spectrum, m = 0 and m = ∞. By choosing
m = 0, we specialize the LABEL model to a particular task by placing all of the weight
on the task-specific data. This allows some leaves to effectively have no data in them; for
those leaves, we use the overall prior of in-topic according to the task-specific data. We
refer to this system as LABEL-D (repop) since this acts as if we completely repopulated the
decision tree with task-specific data.
We also present the results obtained by making the prediction at a leaf node using all
of the data across tasks equally (i.e., m = ∞ and the right side of Equation 8.1 simply
becomes pl ). We refer to this as LABEL-D (general) since the metaclassifier is not specialized for each task other than the decision threshold. Comparing these specific instantiations
2
See the Footnote 3 on Page 123 regarding feature normalization.
CHAPTER 8. INDUCTIVE TRANSFER
FOR CLASSIFIER COMBINATION
155
allows us to determine if we are simply coincidentally finding a better tree structure using
all the data or if the actual predictions based on all the data aids us as well.
8.3.3 Data
For the experiments, we again use the MSN Web Directory (13 classes), the Reuters 21578
corpus (10 classes), and the TREC-AP corpus (20 classes). For a detailed description of
the corpora, see Section 6.2.
8.3.4 Performance Measures
To compare the performance of the classification methods we look at a set of standard
accuracy measures. The F1 measure [vR79, YL99] is the harmonic mean of precision and
P ositives
P ositives
recall where Precision = PCorrect
and Recall = Correct
. Additionally,
redicted P ositives
Actual P ositives
we report error, emphasizing the normalized error score. Normalized error divides the
error in each task by the error that would have been achieved by random guessing (the a
priori prevalent class). A normalized error less than one indicates the method outperforms
random guessing. The scores reported here are the arithmetic averages of the values across
all tasks (for F1 this is termed macroF1 in the text classification literature).
8.4
Experimental Results
Table 8.1 summarizes the performance of the systems over all 43 classification tasks. Better
performance is indicated by larger F1 and by smaller error or normalized error values. The
best performance in each column is given bold. To determine statistical significance for
the macro-averaged measures, a one-sided macro sign test was performed [YL99]. When
comparing system A and system B, the null hypothesis is that system A performs better on
approximately half the tasks for which they differ in performance. The results for LABELD (general) are significantly better than the other systems at the p = 0.01 level with the
exception of the difference between the error metrics of LABEL-D (general) and STRIVE-D
which are significant at the p = 0.05 level.
8.5
Summary of Basic LABEL Approach
First, we note that the base classifiers are competitive and more particularly the results
for Smox are consistent with the best reported results over these corpora [DC00, DPHS98,
Joa98]. Thus, we are challenged with an extremely competitive baseline.
8.6. FUTURE WORK
156
Method
Smox
STRIVE-D
LABEL-D (repop)
LABEL-D (general)
Macro F1
0.7411
0.7457
0.7431
0.7545
Error
0.0197
0.0191
0.0188
0.0181
norm Error
0.4789
0.4716
0.4758
0.4512
Table 8.1: Inductive Transfer Performance Summary over all Tasks
In spite of this, LABEL-D (general) shows dominance for each of the performance measures. Additionally, comparing it directly to the most comparable version of STRIVE-D
reported in Bennett, Dumais, & Horvitz [BDH02, BDH05], we see improvement over the
same system that uses data from each task in isolation. Additionally, by comparing LABELD (general) to LABEL-D (repop), we see that it is not simply the structure of the resulting
decision trees, but that the predicted probabilities induced across the entire set of tasks are
key to improving generalization performance. While the percentage improvement is small,
we believe these results are very encouraging for the future use of inductive transfer to
improve models of classifier reliability.
Similar results are observed for each corpus individually but are more pronounced in
the Reuters corpus (which has the classes with the fewest number of positive examples)
than the MSN Web or TREC-AP corpora.
8.6
Future Work
There are several interesting avenues for future work. First, one could conduct experiments
which explore parametric variations of m to control how much weight task-specific data is
given versus the weight given to data from across all tasks. Secondly, while we have only
empirically investigated the decision-tree metaclassifier, one could pursue other classifiers
as a metaclassifier. For example using an SVM as done for STRIVE would be of considerable interest. The analogy for m for an SVM would be the c parameter which controls
the amount of regularization for a problem. In fact, we hypothesize that not only would c
play an important role in specializing the metaclassifier for a task, but that setting c while
building the general model will be crucial because of how c acts as a regularizer. For implementations like SVMLight that have methods for automatically setting c given a set of
data, then a rule for setting c for the general model can be determined in terms of the c that
would be used for each individual dataset (e.g., the sum of the auto-determined c’s for the
individual problems).
CHAPTER 8. INDUCTIVE TRANSFER
FOR CLASSIFIER COMBINATION
157
While we have demonstrated here that data can be used in conjunction across all tasks
to improve average performance, another interesting question is whether data can be used
in isolation to improve performance on a separate class. That is, if we left one corpus out
and trained a general model on the remaining corpora to test on the left-out corpus, how
effectively would that model transfer compared to the one trained in conjunction?
Finally, a potentially promising area to pursue is the inclusion of task-dependent reliabilityindicator variables while building the general model. We discuss a few examples in Section
5.3 including: NumTrainingPoints, %TrainingPointsIn{Positive}, and NumberOfSupportVectors. In addition, we can include task identifiers with each example. Together these
variables can allow the general model to both model a task separately if it improves model
fit as well as model behavior that varies across tasks — such as how likely a given base
classifier is to perform well given a certain number of positive training examples.
8.7
Summary
In this chapter, we demonstrated how Layered Abstraction-Based Ensemble Learning can
be used to reduce the demand for training data by using data in conjunction across tasks.
Furthermore, we conducted a small empirical analysis to demonstrate the validity of the
approach. In particular, we demonstrated how we could improve upon a STRIVE model
using a decision-tree metaclassifier by using all the data across the tasks. Finally, this area
contains many interesting directions for future work, and we provided the reader with a
discussion of some of the more interesting avenues.
158
8.7. SUMMARY
Chapter 9
Online Methods and Regret
Researchers have been able to prove a variety of performance guarantees for online prediction methods. While our primary concern is offline or batch prediction, a natural question
is whether these online methods can also achieve good empirical performance on batch
prediction with little modification.
This chapter presents a brief overview of key existing theoretical results for combining
classifiers in an online framework while maintaining (regret) performance guarantees. In
this chapter, we evaluate the empirical performance of several of these online algorithms in
a batch setting and consider whether they are an attractive alternative to the metaclassifiers
we have already discussed. Finally, in light of our empirical results, we also analyze what
types of regret guarantees are most desirable to yield a combination that performs well in
practice.
9.1
Online Learning
In the batch setting, there is a training phase over a set of (labeled) training examples and
then a testing phase where the learner receives no further feedback about the label of the
examples. In contrast, an online learner receives constant feedback after every example.
This feedback can either provide full-information about the cost of all alternatives not actually explored or partial-information by specifying only the loss of the action the learner
took.
For classification where a loss function is specified, giving the learner the true class
label is an example of the full-information model. In a multiclass problem or for examplespecific cost-sensitive learning, specifying the loss of the predicted class but not specifying
what the loss would have been for predicting the other class labels is an example of the
159
160
9.2. REGRET AND COMBINING CLASSIFIERS
partial-information model. We will assume we are operating in the full-information model
with a specified loss function and that our feedback comes as the true class label. Thus,
at each timepoint, an example is presented to the learner for prediction, after the learner
receives the true label of the example, it incurs some loss, and updates its model.
More formally for each timestep t = 1, . . . , T , the learning algorithm, A receives an
example xt . Then, the learner makes a prediction, ŷA,t . After predicting, the learner is
notified of the correct class yt and suffers loss L(ŷA,t , yt ) for some specified loss function
L. When the particular learning algorithm is clear in context we will use ŷ t instead of ŷA,t .
P
The cumulative loss the algorithm suffers is Tt=1 L(ŷt , yt ), and we will be concerned with
particular algorithms that can bound the cumulative loss relative to some other algorithm.
The simplicity of the online learning framework has allowed results to be derived in simple
cases where the sequence of examples is assumed to be i.i.d. from a fixed distribution as
well as when they are chosen from a non-fixed distribution by an adversary. Given the
strength of these results, it is tempting to see how well they fare in the context of our
problem.
9.2
Regret and Combining Classifiers
In addition to the focus of empirically-based researchers, the problem of combining expert
advice or predictions has long drawn the attention of the theoretical part of the machine
learning community. Quite often, combination algorithms can be theoretically justified
with loss bounds, and furthermore include in their design indicator variables which signal
a particular expert is abstaining or lacking confidence in some way. However, with few
exceptions [FMS04], they seldom address when a classifier should abstain in practice. A
simple lack of confidence is not sufficient since that classifier may still provide more information than any other classifier. Instead, it is preferable for the metaclassifier or decision
maker to determine on an example-by-example basis how strongly each base classifier will
contribute to the final decision. In this view, the decision maker simply treats each indicator
variable as a possibly noisy hint to use a different combination function than what may be
best overall, but she is free to ignore the indicator variable if it appears to be of little value.
As a result, there is often a gap between methods that have theoretical guarantees and
methods that perform the best in practice. In this section, we review some of the more
relevant and common combination algorithms that have performance guarantees. We use
the discussion to motivate important criteria to consider when employing a combination
algorithm, and discuss why the presented algorithms often underperform traditional metaclassifier approaches such as stacking.
CHAPTER 9. ONLINE METHODS AND REGRET
161
Two basic questions commonly drive the design of a metaclassification algorithm: (1)
Can the combiner ever win big? (2) Can we prevent the combiner from hurting performance
too badly on any single problem?
An affirmative answer to the first question ensures that some users of the algorithm will
gain far more by using our algorithm instead of simply using a base classifier, while an
affirmative answer for the second question ensures that those who do use our algorithm
will not regret it too much. Bounds for regret are formalized mathematically relative to a
specific class of alternative algorithms. External regret compares performance to a fixed
policy that is not dependent on the choices the combiner makes while internal regret considers alternatives that may be slight modifications of the combiner’s choices [BM05]. For
example, typical external regret bounds limit the combiner’s loss relative to the loss when
using the single best expert to predict. A tight external regret bound guarantees the user
of the algorithm that even though the best base classifier cannot always be determined reliably, the user cannot have a performance much worse than if she would have known the
base classifier performance a priori. Internal regret bounds guarantee small loss relative to
simple changes to the algorithm, for example: “When the distance to the SVM’s normal
was less than 1, I should have placed the weight I placed on the SVM on the Decision Tree
classifier instead.”.
Our desire to examine regret stems from two basic types of practical combination problems. In the first, as has been the case throughout this dissertation, a machine learning
practitioner has models from different learning algorithms (e.g., Decision Trees, SVMs,
kNN), and while aware of their relative performance, he is unsure whether they offer distinct information that can be effectively combined by a metaclassifier into a model which
outperforms the base components. Furthermore, the practitioner has a set of candidate indicator variables such as the size of the neighborhood around the prediction point for kNN
or the variance in naı̈ve Bayes confidence that may indicate when classifiers are more or
less reliable. In this case, it is possible many of the indicators are not actually tied to the
base classifier’s performance and the metaclassifier should generalize appropriately.
In the second motivating application, the classifiers are trained over disjoint training
sets1 that cannot be shared because of proprietary or data-privacy concerns, but the predictions of all the classifiers can be obtained for a common validation and test set. The
indicators in this case may capture properties such as the similarity score of the most similar in example in each training set. A good bound with respect to the weighted combination
1
These training sets might be drawn from different distributions, but we assume the labeling is consistent.
That is, p(x, y) may vary according to training set, but presumably P (y | x) does not.
162
9.3. COMBINATION ALGORITHMS WITH REGRET GUARANTEES
of these classifiers is equivalent to saying we are taking effective advantage of the information in the proprietary data while maintaining the privacy constraints.
In both cases, we would like to have some guarantee that the classifiers are being combined well and that the indicators are being effectively used. It is with these goals in mind
that we turn to a discussion of a few key algorithms with performance guarantees.
9.3
Combination Algorithms with Regret Guarantees
The majority of algorithms with performance guarantees have been developed in an online setting but can also be applied in a batch setting. Although the guarantees may not
directly apply in the batch setting they form a foundation for understanding the impact of
the algorithms.
One of the oldest algorithms in this category is the Weighted Majority Algorithm (WMA).
Littlestone and Warmuth [LW94] present several variants and theoretical results for them.
WMA is related to the halving algorithm which at every point throws out half of the remaining consistent hypotheses based on which half errs on the current example, but instead
of requiring an expert or hypothesis to be perfect, WMA maintains a weight on each expert
that is modified based on its performance. In its most basic form, the weight on each expert
is initialized to 1. To update the weights, whenever the WMA algorithm makes a mistake,
the experts that were incorrect have their weights multiplied by β where β is a parameter
such that 0 < β < 1. Let ²H be the number of errors committed by WMA so far, ²i be the
number of errors by expert i, and n the number of experts. Then for an online prediction
setting, it can be shown [LW94]:
²H ≤
log n + ²i log β1
2
log 1+β
(9.1)
For a value of β = 0.5, this yields ²H ≤ 2.41(²i + lg n). In other words, the combiner is
within a small factor of the best expert and a logarithmic factor of the number of experts.
A very closely related algorithm to WMA is the Winnow algorithm. The Winnow algorithm [Lit88] is one of the most well-known algorithms for learning a threshold function
of boolean inputs and mistake bounds have been derived for several variants of it. Blum
[Blu97] introduced a variant of the algorithm, Winnow-Specialists, specifically designed
for combining experts. Instead of requiring an expert to always make a prediction, experts
are allowed to abstain or “sleep”. The algorithm predicts using a weighted combination of
the experts that do not abstain. When updating the weights, only the weight of the experts
CHAPTER 9. ONLINE METHODS AND REGRET
163
that did not abstain are changed. When the combiner is incorrect 2 , the classifier increases
the weight on the experts voting correctly by 23 and decreases the weight on the incorrect
experts by 21 . If the n experts contain a subset of r infallible experts where at least one (possibly different) does not abstain on every example, then the online combiner’s mistakes are
limited as:
(9.2)
²H ≤ 2r log 3 3n.
2
The key difference of the sleeping experts formulation is that experts are not penalized if
they know they are unsure and can abstain. Likewise, it makes working with an extremely
large number of experts computationally efficient as long as only a small subset are awake
for any particular example. Cohen & Singer [CS99] exploited this when they applied the
algorithm as a base classifier to topic classification where each expert was a word or phrase;
thus there were a large number of experts but only a small number (those present in the
document) made predictions for any given document. Additionally, Blum performs an
empirical evaluation using variants of WMA and Winnow-Specialist.
Freund et al. [FSSW97] introduce a generalization of the sleeping experts framework
that demonstrates how to convert an expert combination algorithm which uses all awake
experts to a sleeping expert combination that allows abstaining. The results they derive
remove the restrictions that a subset of the n experts be infallible, and they derive a variety
of theoretical results for various algorithms. We note this only to point out that the specialist
algorithms are more robust than what might seem from the conditions on Blum’s result.
Of course, returning to our motivation for this problem, our problem is that we don’t
know when we should put our experts to “sleep”. We have predictions over all examples
and we have indicators that we suspect might indicate a different weight on the experts
is preferable. Thus, an algorithm that specifies how to use indicator variables would be
preferable. Blum & Mansour [BM05] introduce such an algorithm for the case where the
indicators are in [0, 1] that generalizes the sleeping expert setting. The essential idea is that
an indicator value of zero indicates abstaining and a value of one indicates fully voting. It
is then up to the algorithm to learn which experts are best when a given indicator is on.
First, we present the basic framework. Let I be the set of indicators and I ∈ I be a
particular indicator. Again, working in an online adversarial setting, I(t) will denote the
value that an indicator I takes at time t. Let let be the loss of expert e at time t where
let ∈ [0, 1]. The combiner H will operate by specifying a probability distribution at time
t
t, pt , over the experts, and the combiner suffers a loss equal to the expected loss: l H
=
Pn
t t
e=1 pe le .
2
Blum points out it is possible to penalize the ones that are incorrect when the combiner is correct, but
that for the proof to go through they can only be rewarded when the combiner is wrong.
164
9.3. COMBINATION ALGORITHMS WITH REGRET GUARANTEES
Their algorithm is based on the idea that we keep a weight for each indicator/base
classifier pair, and then compute the combiner’s probability vector over experts by normalizing the linear combination of these weights and the indicator variables. First, each
t=1
indicator-expert pair weight is initialized to one, we,I
= 1. Then to compute the weight
given to each expert, we compute the linear combination of the current weights according
P
t
to how “on” each indicator is: wet =
I∈I I(t)we,I . Next, we compute a normalizing
P
term, W t = ne=1 wet , and produce a probability vector over the experts by normalizing the
wet
weight distribution, pte = W
t . Prediction can either be performed by randomly choosing
an expert according to the distribution pte or by mixing the experts’ predictions using this
probability vector.3 To update the weights we perform a multiplicative update based on
how much better the expert was than the combiner and how “on” the particular indicator is
t
t
t+1
t
we,I
= we,I
β I(t)(le −βlH ) , where β ∈ (0, 1) is a parameter of the algorithm.
P
If we define the cumulative loss of an algorithm a under an indicator I as Tt=1 I(t)La ,
then this algorithm achieves good external regret bounds with respect to all of the indicators. In particular, where m is the number of indicator variables :
LH,I ≤
Le,I + (log nm)/ log(1/β)
β
(9.3)
Note that by substituting in an “always on” indicator, typical regret bounds are achieved for
the best expert problem.
Since this seems a promising comparison point, we examine this algorithm to determine
its empirical performance and appropriateness for use throughout the rest of the dissertation. We refer to this algorithm as BMX since Blum & Mansour introduce it and prove
external regret bounds using it. They also introduce a general method for converting external regret guarantees to internal regret guarantees and present a particular algorithm with
internal regret guarantees [BM05]. We note this for the interested reader but do not present
these as comparisons because we feel they still do not address the issues most pertinent to
us.
For classification, the experts can give us estimates of the posterior probabilities or
class predictions. We will use πt,e to denote expert e’s vector of posterior probability
estimates over classes at time t. Likewise, we will use Πt to denote the matrix composed
of all the expert’s probability vectors πt,e . Finally, ŷt,e is the class prediction of expert e
at time t. When a class prediction is taken from a posterior estimate, we assume ŷ t,e =
argmaxc πt,e [c].
3
Blum & Mansour’s analysis proceeds by randomly choosing an expert according to the vector since this
allows for application to problems outside of classification. For our purposes, either can be done and a further
discussion of this issue is below.
CHAPTER 9. ONLINE METHODS AND REGRET
9.4
165
Empirical Analysis
As done throughout this dissertation and described in more detail in Sections 6.3 and 7.2 we
selected five classifiers as base classifiers or input experts for the combination algorithms:
kNN, decision trees, linear SVMs, naı̈ve Bayes, and a unigram classifier. We denote these
below as kNN, Dnet, SVM, naı̈ve Bayes, and Unigram.
We require the outputs of the base classifiers to train the metaclassifiers. Thus, we
perform cross-validation over the training data and use the resulting base classifier predictions, obtained when an example serves as a validation item, as training inputs for the
metaclassifier. As described in Section 6.3, each of the classifiers outputs a score that can
be interpreted as an estimate of the log-odds for the example. We convert these scores to a
probability by treating them as if they were the log-odds.
We present results here for the algorithms applied to two corpora, the Reuters corpus
and the MSN Web Directory. As a reminder from Chapter 7, the Reuters corpus yielded
less improvement for the combiners than was achieved in the MSN Web corpus. Thus, in
judging whether these online algorithms are indeed attractive alternatives to using a linear
SVM as a metaclassifier, these corpora form two useful sample points. We present only the
results for Stack-S (norm) and STRIVE-S (norm) here for comparison since those were the
most competitive.4
As done in the earlier experiments, for the experiments below, we used only the top
1000 words with highest mutual information for the MSN Web Directory and the top 300
words for Reuters for all base classifiers except the kNN classifier. Since the kNN classifier
is computationally expensive, we desired to use the same feature representation across
binary classification tasks within a corpus. Once neighbors are retrieved, the kNN classifier
can make all class decisions quickly. As is commonly done (e.g. [LYRL04]), for each word
we assigned a score of the maximum of the mutual information scores across binary tasks.
The top features were then taken across these maximum scores. Since the same feature
set was being used for all classes within a corpus, we used 3× the number of features —
3000 words for MSN Web and 900 for Reuters. These settings are exactly as in our earlier
experiments to perform a fair comparison. The corpora are described in further detail in
Section 6.2.
To compare the performance of the classification methods we look at a set of standard
performance measures: the macro-averaged F1, micro-averaged F1, error, two linear utility
4
Although we note that for some of the ROC measures STRIVE-D variants were the most competitive.
Since it is not pertinent to this discussion, we omit it here for brevity and refer the reader back to Chapter 7 if
they have further interest.
166
9.4. EMPIRICAL ANALYSIS
functions — C(10, 1) and C(1, 10), the area under the ROC curve, as well as the abbreviated area under the ROC curve. These measures are described in more detail in Section 6.1
and in Chapter 7.
9.4.1 Combination Implementations
For most of the online algorithms, there are several common variants — both in the online
learning case and when applied for batch learning. In addition, each algorithms has several
parameters. In order to enable reproducibility, we give pseudocode for each algorithm and
specify what parameter settings we use.
One common variant for converting an online algorithm to a batch setting is to make
multiple passes through the training data and relax the award/penalty parameter β after each
pass by driving the value of β toward 1. We perform this relaxation for all of the algorithms
and present results for a single pass through the data and 10 passes through the data.
WMA
The implementation of WMA we use is essentially the WMG variant from Littlestone
& Warmuth 1994 [LW94] but we use squared-difference in posterior probability for the
weight updates instead of absolute difference. As done in [Blu97], we use β = 0.5. Blum
notes that the algorithm showed little empirical sensitivity to the particular value of β. For
those runs using more than one iteration over the training set, we use η = 0.25. It is common to include |C| − 1 experts that predict class c with probability 1 to allow the algorithm
to automatically adjust a threshold. In addition to the base classifiers, we include these
“constant experts” as well. Pseudocode for the prediction and training algorithms are given
in Algorithms 9.4.1 and 9.4.2.
Algorithm 9.4.1: WMA predict(n, w, Π)
P
// Preconditions: πe [c] ∈ [0, 1]. 1 = c πe [c]. For some e, we 6= 0.
pe = Pnwe we // weight combiner places on e
e=1
P
πH = ne=1 pe πe
ŷH = argmaxc πH [c]
output (πH , ŷH )
CHAPTER 9. ONLINE METHODS AND REGRET
167
Algorithm 9.4.2: batch train WMA(R, n, hhΠ1 , y1 i, . . . , hΠT , yT ii, η, β)
P
// Preconditions: πt,e [c] ∈ [0, 1]. 1 = c πt,e [c]. β, η ∈ (0, 1).
w ← ~1n
for r 
← 1 to R

for t ← 1 to T






π t,H , ŷt,H ← WMA predict(n, w, Πt )









 do P (c = yt | xt ) ← 1

do
P (c 6= yt | xt ) ← 0




2


we ← we β (πe [yt ]−P (yt |xt ))





// Relax β by driving it toward 1.



β ← β + η(1 − β)
output (w)
Winnow
The implementation of Winnow we use is essentially the same as that empirically explored
by Blum [Blu97] in his variant of the original Winnow2 algorithm [LW94] for sleepingexperts. Since by default all of the experts are awake for our problem, the main differences
from WMA are that Winnow only updates the weights on the experts when the combination
algorithm makes a mistake and that the amount of weight change is based on 0/1 loss.
Two minor differences from Blum’s implementation is that Blum promotes by (1 + β)
instead of β −1 for theoretical reasons but notes that empirical results show no significant
impact from this change. Second, Blum’s original formulation did not change the weights
on the sleeping experts, but also did not ensure that the probability placed on those experts
did not change as required in [FSSW97]. The formulation we give follows [FSSW97] in
that the sum of the weights on the awake specialists is constant before and after an update.
Since we only apply this when all experts are awake, it does not change our results. We
simply note this for the reader interested in reapplying the algorithm.
Finally, as done in [Blu97], we use β = 0.5. Again, Blum notes that little empirical
sensitivity was shown to the particular value of β. For those runs using more than one
iteration over the training set, we use η = 0.25. We predict the most likely class instead of
using a threshold parameter and again include the use of a set of |C| − 1 constant experts.
Since we apply this for binary classification, this means we use a single constant alwaysawake expert that predicts class 1.
Pseudocode for the prediction and training algorithms for an implementation that performs n-way classification are given in Algorithms 9.4.3 and 9.4.4.
168
9.4. EMPIRICAL ANALYSIS
Algorithm 9.4.3: Winnow Specialists predict(n, w, Π, A)
// Preconditions: Ae ∈ {0, 1}. For some e, Ae 6= 0 and we 6= 0. πe [c] ∈ [0, 1]. 1 =
pe = PnAe wAee we // weight combiner places on e
e=1
P
πH = ne=1 pe πe
ŷH = argmaxc πH [c]
output (πH , ŷH )
P
c
πe [c].
Algorithm 9.4.4: batch train Winnow Specialists(R, n, hhΠ1 , y1 i, . . . , hΠT , yT ii, η, β)
P
// Preconditions: πt,e [c] ∈ [0, 1], 1 = c πt,e [c]. β, η ∈ (0, 1).
w ← ~1n
for r 
← 1 to R

for t ← 1 to T






A ← ~0n










if (e is awake) then












 Ae ← 1


 do π t,H , ŷt,H ← Winnow Specialists predict(n, w, Πt , A)

do


if (yt 6= ŷt,H ) then










if (e is awake) then



P
 n



2∗11 (ŷt,e 6=yt )−1

e Ae w e

P
w
←
w
β

e
e
2∗1
1 (ŷt,e 6=yt )−1

e Ae w e β




// Relax β by driving it toward 1.



β ← β + η(1 − β)
output (w)
BMX
For the BMX algorithm, we need indicators in the [0, 1] interval to use. For this purpose,
we use the set of reliability indicators described in Chapter 5. To map these indicators to
the zero-one interval, we range-normalize each reliability indicator. That is, features that
take a value less than the minimum (mini for feature i) observed value in the training set are
mapped to zero. Those greater than the maximum are mapped to one (max i for feature i).
vi −mini
For each remaining feature i a value vi is mapped to max
. Additionally, we include
i − mini
the probabilities of “in-class” from the classifiers as indicators.
Our implementation of the BMX algorithm is exactly as that described in [BM05] except for one change in prediction. In the original, BMX uses p e to randomly draw an expert
and predicts with that expert’s prediction. We use pe to mix the posterior probabilities of
CHAPTER 9. ONLINE METHODS AND REGRET
169
the experts. This does not affect the training of the model. It only alters the predictions over
the test set. In our experience, this always was a better alternative to randomly drawing an
example.
For comparability to the other algorithms, we use β = 0.5. For those runs using more
than one iteration over the training set, we use η = 0.25. As a loss function, we use
squared error. We predict the most likely class instead of using a threshold parameter
and again include the use of a set of |C| − 1 constant experts. Since we apply this for
binary classification, this means we use a single constant expert that predicts class 1 with
probability 1.
Since we are primarily interested in the suitability of the BMX algorithm as an alternative to other approaches to using reliability indicators, we provide greater detail in the
pseudocode than for the other algorithms. Pseudocode for the prediction and training algorithms for an implementation that performs n-way classification are given in Algorithms
9.4.5-9.4.7.
We also present results for a modified version of the BMX algorithm which we refer to
as BMXmod and differs in two points from BMX. We observed over holdout data that the
indicators in BMX can occasionally act more like “importance functions” than “reliability
indicators”. For example, when the sum across all of the indicators is high, that training
example can cause a very large shift in the weights while another example that has a low
sum causes very little change. To help prevent this, BMXmod (L1) normalizes the sum
of the indicator variables to be 1. Second, the BMX algorithm computes the loss of the
combination algorithm as the weighted combination (using p e ) of the loss of the experts.
BMXmod directly computes the loss of the combiner by setting it to be the squared error
of the combiner. Because of this change, using pe to mix the expert’s estimates instead of
randomly drawing an expert also affects the training of the model in BMXmod as well as
the prediction.
Finally, the BMX algorithm updates the weights for nearly every training example. An
alternative approach to consider is, like Winnow, to only update the weights if the combiner
makes an error in prediction. To do so, we introduce a variant of BMX and BMXmod which
we refer to as WinBMX and WinBMXmod, respectively. These algorithms only update the
weights if the combiner’s log-odds of the correct class is less than one. Thus, this acts as a
margin and for examples that are only slightly correct, we continue to update the weights.
170
9.4. EMPIRICAL ANALYSIS
Algorithm 9.4.5: BMX predict(n, w, I, Π)
// Preconditions: I ∈ [0, 1]. we,I ≥ 0. For some (I, e), I 6= 0 and we,I 6= 0.
P
// πe [c] ∈ [0, 1]. 1 = c πe [c].
w0 ← ~0n
W ←0
for I ∈ I
do for
( e ← 1 to n
we0 ← I ∗ we,I + we0
do
W ← W + we0
for e ← 1 to n
0
do pe = wWe // weight combiner places on e
P
πH = ne=1 pe πe
ŷH = argmaxc πH [c]
output (πH , ŷH , p)
Algorithm 9.4.6: BMX update weights(n, w, I, L, LH , norm)
// Preconditions: I ∈ [0, 1]. we,I ≥ 0. L, LH ∈ [0, 1]. norm ∈ {0, 1}.
c←1
if(norm) then
old ← 0





new ← 0




for e ← 1 to n





 do for I ∈ I
do if (I 6= 0) then
(



old ← old + we,I




 new ← new + we,I β I(Le −βLH )





if (old 6= 0) then



 c ← old
new
for e ← 1 to n
do for I ∈ I
do if (I 6= 0) then
we,I ← we,I cβ I(Le −βLH )
output (w0 )
CHAPTER 9. ONLINE METHODS AND REGRET
171
Algorithm 9.4.7: batch train BMX (R, n, L, hhΠ1 , I1 , y1 i, . . . , hΠT , IT , yT ii, η, β)
P
// Preconditions: πt,e [c], It ∈ [0, 1], 1 = c πt,e [c]. β, η ∈ (0, 1).
//
L : Rn × Rn × Y × Y → [0, 1].
w ← ~1n × ~1|I|
for r 
← 1 to R

for t ← 1 to T






π t,H , ŷt,H , p ← BMX predict(n, w, Πt )










P (c = yt | xt ) ← 1








P (c 6= yt | xt ) ← 0









~n

 do Lt ← 0

Lt,H ← 0
do








for e (
← 1 to n










Lt,e ← L(P(c | xt ), Πt,e , yt , ŷt,e ) // get loss of expert





 do


Lt,H ← Lt,H + pe Lt,e // Combiner’s loss depends on e’s weight





// Relax β by driving it toward 1.



β ← β + η(1 − β)
output (w)
9.4.2 Results and Discussion
The results for the MSN Web corpus are presented in Table 9.1. The results for the Reuters
corpus are presented in Table 9.2. We note that the Winnow and WMA variants use only
the classifier outputs, and thus we are interested in them as an alternative metaclassifier in
stacking to Stack-S (norm). In contrast, the BMX variants use the indicators, and thus we
are interested in them as an alternative metaclassifier in the striving framework to STRIVE-S
(norm).
First, we note that unlike Stack-S (norm), WMA and Winnow often do not outperform
even the base classifiers and rarely significantly. Of these options the clear best choice in
both corpora is Winnow (R = 10) — Winnow using a relaxed β over multiple iterations. We
also note that relaxation with multiple passes is clearly important to the implementation of
Winnow here. There are two possible reasons why Winnow requires multiple passes. The
first is since updates are only performed when the combiner makes a mistake, convergence
to an optimal weight set is slow. The second possibility is that the convergence is slow as a
result of the update for Winnow being based on 0/1 loss instead of squared loss. However,
comparing Winnow (R = 10) to Stack-S (norm) over both corpora, even this variant is not
a superior or even comparable alternative to Stack-S (norm).
172
9.4. EMPIRICAL ANALYSIS
In contrast, the indicator-based online methods, in particular the WinBMX variants,
often outperform the base classifiers. The WinBMX variants outperform the base classifiers
relatively frequently in the Reuters corpus and very often in the MSN Web corpus. The
relaxation with multiple iterations does not seem to be critical to any of the indicator-based
online methods. Since WinBMX uses the same weight update (i.e., squared-loss based),
this makes it more likely that the slow convergence observed above for Winnow was a
result of the update weight and not the number of updates.
Next, while each WinBMX variant does not always beat its BMX pairing, each does so
consistently enough that WinBMX is clearly the superior choice to BMX. While the modification in the mod variants have some impact, this impact does not appear to be consistent
across all performance measures, and thus these modifications are probably not desirable
in all cases.
In comparing the indicator-based online alternatives to STRIVE-S (norm), we see that
WinBMX stays competitive with STRIVE-S (norm) in Reuters where striving does not yield
as large of an improvement. However, in the MSN Web Corpus where striving obtains
a much larger improvement, WinBMX and all of the indicator-based online methods are
significantly beaten by STRIVE-S (norm) according to nearly every performance measure.
Thus, it seems that indicator-based online methods are not able to utilize the indicators as
effectively.
However, comparing the indicator-based online methods and in particular WinBMX to
the basic methods, we see that the indicator-based methods do generally improve over the
basic online methods. Thus, even among these methods that do not improve as much relative to the base classifiers as is desirable, the reliability indicators have an impact. Also, we
note that interestingly BMXMod achieves the highest ROC area in the MSN Web corpus,
although looking at the abbreviated area it does much worse. The fact that these methods
are using the indicators, but not as well as desired, leaves some hope for future modifications. In the next section, we discuss those modifications we believe most likely to have an
impact.
Finally, for the reader considering using the online methods, compared to the BMX
variants, the WinBMX variants can be significantly more computationally efficient since
the number of examples having weight updates is extremely small. Since every example
for BMX has a prediction and update of roughly the same complexity, the Winnow variants
experience at least a 2× speed-up.
CHAPTER 9. ONLINE METHODS AND REGRET
173
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
0.0772
0.0812
0.0798
0.0794
0.0733
ROC
Area
0.8802
0.9003
0.8915
0.9123
0.8873
ROC
[0,0.1]
0.5638
0.6114
0.5516
0.6960B
0.6541
0.5477
0.5982
0.5527
0.6727B
0.6480
0.5813
0.6116
0.5619
0.7016B
0.6866
0.0584
0.0594
0.0649
0.0455
0.0464
0.3012
0.2589
0.2853
0.2250B
0.2524
0.6727D
0.6643
0.7016
0.6902
0.0452D
0.0479
0.2235
0.2133BD
O
0.0729D
0.0765
N/A
N/A
N/A
N/A
WMA
0.6324
0.6756
0.0472
0.2560
0.0746
0.8884
0.6465
WMA
0.6317
0.6757
0.0472
0.2563
0.0746
0.8865
0.6461
Winnow
0.5918
0.6369
0.0479
0.3109
0.0724
0.9303 BO
0.7204BO
Winnow
0.6668O
0.7033BD
O
0.0453
0.2439O
0.0780
0.9166B
0.7068B
BMX
0.6657
0.7069BD
O
0.0440BD
O
0.2194O
0.0731
0.9287B
0.7053
BMX
0.6603
0.6993BD
0.0453
0.2351
0.0743
0.9227B
0.7038
BMXmod
0.6727
0.7031
0.0435BD
O
0.2158BO
0.0717
0.9355BO
0.7122B
BMXmod
0.6710
0.7049BD
0.0436BD
O
0.2125BO
0.0717
0.9347BO
0.7115B
WinBMX
0.6811BD
O
0.7115BD
O
0.0428BD
O
0.2084BO
0.0729
0.9304B
0.7195B
WinBMX
0.6841BD
O
0.7114BD
O
0.0434BD
O
0.2088BO
0.0754
0.9267B
0.7223B
WinBMXmod
0.6810BD
O
0.7082
0.0434BD
O
0.2111BO
0.0725
0.9348BO
0.7190B
WinBMXmod
0.6827BD
O
0.7103BD
0.0432BD
O
0.2082BO
0.0731
0.9324BO
0.7201B
0.6939BD
O
0.7250BD
OI
0.0423BD
O
0.1971BD
OI
0.0705D
0.9334B
0.7349BOI
0.7173BDS
OI
0.7437BDS
OI
0.0392BDS
OI
0.1835BDS
OI
0.0682
0.9260B
0.7547BS
OI
0.8719
0.8924
0.0223
0.0642
0.0565
N/A
N/A
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
Best By Class
Majority
(R = 10)
(R = 10)
(R = 10)
(R = 10)
(R = 10)
(R = 10)
Stack-S
(norm)
STRIVE-S
(norm)
BestSelect
Table 9.1: Comparison of the Online Combiners over the MSN Web Corpus. The best performance
(omitting the oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, ‘R’,
’O’, or ’I’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners, Stacking methods, Reliability-indicator based Striving methods, Online basic methods, or
Indicator-based online methods at the p = 0.05 level. A blackboard (hollow) font is used to indicate
significance for the macro-sign test and micro-sign test. A normal font indicates significance for the
macro t-test. For the macro-averages (i.e., excluding micro F1) when both tests are significant it is
indicated with a bold, italicized font.
9.4. EMPIRICAL ANALYSIS
174
MacroF1
MicroF1
Error
C(1,10)
C(10,1)
0.0537
0.0476
0.0527
0.0389
0.0336
ROC
Area
0.9804
0.9877
0.9703
0.9893
0.9803
ROC
[0,0.1]
0.8844
0.9086
0.7841
0.9429B
0.9043
0.7846
0.7645
0.6574
0.8545B
0.8097
0.8541
0.8674
0.7908
0.9122B
0.8963
0.0242
0.0234
0.0320
0.0145B
0.0170
0.0799
0.0713
0.1423
0.0499
0.0737
Majority
0.8608
0.8498
0.9149
0.9102
0.0144
0.0155
0.0496
0.0438
0.0342
0.0437
N/A
N/A
N/A
N/A
WMA
0.8432
0.9064
0.0155
0.0595
0.0376
0.9890
0.9370
WMA
0.8443
0.9056
0.0154
0.0618
0.0365
0.9846
0.9235
Winnow
0.8398
0.9074
0.0158
0.0698
0.0405
0.9946 B
0.9549B
Winnow
0.8727B
0.9208BD
O
0.0141
0.0439BO
0.0339
0.9937B
0.9537B
BMX
0.8640
0.9132
0.0144
0.0553
0.0365
0.9941
0.9487
BMX
0.8581
0.9109
0.0146
0.0576
0.0403
0.9928
0.9408
BMXmod
0.8616
0.9166
0.0141
0.0402B
0.0334
0.9952B
0.9562B
BMXmod
0.8643
0.9166
0.0140
0.0403B
0.0324
0.9951B
0.9565B
WinBMX
0.8843BD
0.9232BD
0.0134BD
0.0426B
0.0363
0.9933B
0.9545B
WinBMX
0.8606
0.9122
0.0147
0.0467B
0.0379
0.9896
0.9453
WinBMXmod
0.8691
0.9225BD
0.0132BD
0.0386BD
O
0.0318
0.9955B
0.9598B
WinBMXmod
0.8747
0.9232BD
0.0125BD
0.0388B
0.0332
0.9955BO
0.9612BO
0.8908BD
0.9307BD
OI
0.0125BD
0.0372BD
O
0.0331
0.9956BO
0.9628B
0.8835BD
0.9287BD
O
0.0121BD
0.0352BD
O
0.0343
0.9948B
0.9616B
0.9611
0.9789
0.0036
0.0073
0.0173
N/A
N/A
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
Best By Class
(R = 10)
(R = 10)
(R = 10)
(R = 10)
(R = 10)
(R = 10)
Stack-S
(norm)
STRIVE-S
(norm)
BestSelect
Table 9.2: Comparison of the Online Combiners over Reuters. The best performance (omitting the
oracle BestSelect) in each column is given in bold. A notation of ‘B’, ‘D’, ‘S’, ‘R’, ’O’, or ’I’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners, Stacking
methods, Reliability-indicator based Striving methods, Online basic methods, or Indicator-based
online methods at the p = 0.05 level. A blackboard (hollow) font is used to indicate significance for
the macro-sign test and micro-sign test. A normal font indicates significance for the macro t-test.
For the macro-averages (i.e., excluding micro F1) when both tests are significant it is indicated with
a bold, italicized font.
CHAPTER 9. ONLINE METHODS AND REGRET
9.5
175
Reconciling Theory and Practice
While the preceding online algorithms are well-motivated and have attractive theoretical
guarantees, in practice their performance was somewhat disappointing. From the practical
standpoint, perhaps a guarantee relative to the best base classifier or the best base classifier
weighted by indicator is too weak. One way of addressing this might be to broaden the
class of regret alternatives to those that have internal regret guarantees.
While this is a step in the correct direction, we can gain insight from examining how
metaclassification algorithms have actually gained in practice when applied to base classifiers — in contrast to combining human experts. The majority of schemes can be viewed
as a weighted combination of classifier predictions or as selecting the best classifier in a
context.
We will focus on the weighted combination scheme and then turn to selecting the best
classifier. Continuing with online notation, consider when we have a two class problem
where the class yt ∈ {−1, 1}. Suppose base classifier e has a confidence score π̂t,e ∈ [0, 1]
and predicts the class ŷt,e = sign(π̂t,e − 0.5). Then, as reviewed in Section 3.2, it is quite
common in practice to have a machine learning classification model rank the examples well
when sorted by π̂t,e but whose classification decisions ŷt,e are not optimal. Essentially, the
model has not learned the correct threshold or bias term — instead of predicting sign(π̂ t,e −
0.5) it should output sign(π̂t,e − b) where b is a constant in [0, 1]. A natural way to consider
such a classifier in the combination framework is to combine it with a classifier that always
outputs π̂t,e = 1 and learn a set of combination weights over the π̂t,e (recasting the sign
issue if necessary). Essentially, we want to learn a set of weights whose loss is close to
the loss of the best averaged prediction. In contrast, most regret guarantees would give us
bounds with respect to the expected loss of selecting an expert. In this case, the default
classifier performs no better than the prior and the poor threshold of the best classifier is
exactly what we would like to fix. Thus, performing as well as the best expected loss is not
a great gain. While this is an exaggerated case, it is easy to construct real situations where
these methods will not give enough weight to the default classifier and generalize worse
than models without explicit guarantees.
If we now consider weighted averages of multiple classifiers, the optimal weight vector
can be seen as trading off the accuracy of the classifiers, their co-dependencies, and a
cumulative noise or bias term (when a constant prediction is included). By considering
methods that apply well to this class, we may achieve looser or no explicit bounds but
because the model has more flexibility we can achieve better generalization with sufficient
training data.
9.5. RECONCILING THEORY AND PRACTICE
176
The majority of bounds are with respect to expected loss; for example, see the definition
t
of lH
in the BMX algorithm above. As pointed out in [FSSW97] achieving a bound with
respect to averaged prediction is typically harder, and since most loss functions are convex,
a tight bound on averaged prediction loss implies a tight bound on expected loss. Since
seeing why we do not obtain a good bound in the opposite direction is a key point, it is
worth going through in detail.
Let’s deal with an online algorithm and restrict our attention to convex linear combinaP
tions of the n experts, i.e. C = {u | u ∈ Rn , ui = 1, ui ≥ 0}. Let, ŝt ∈ Rn be the
output of the experts, which may be log-odds, probabilities, or binary classes.
A good algorithm with respect to expected loss would output w t such that
T
X
t=1
wt · L(ŝt , yt ) ≤ inf
u∈C
T
X
t=1
u · L(ŝt , yt ) + ZI
(9.4)
where ZI is an appropriately defined “small” term (typically relative to T , number of experts, and the infimum).
A good algorithm with respect to the loss of an averaged prediction would output ω t
such that:
T
T
X
X
L(ωt · ŝt , yt ) ≤ inf
L(u · ŝt , yt ) + ZII .
(9.5)
u∈C
t=1
t=1
Assuming, we have a good algorithm with respect to the second of these, averaged
prediction loss, we can easily derive a good bound for expected loss. In the following, let
P
w = argminu∈C Tt=1 u · L(ŝt , yt ).5
T
X
t=1
L(ωt · ŝt , yt ) ≤ inf
u∈C
T
X
t=1
L(u · ŝt , yt ) + ZII
(9.6)
By definition of inf
≤
T
X
L(w · ŝt , yt ) + ZII
(9.7)
By Jensen0 s inequality
T
X
≤
w · L(ŝt , yt ) + ZII
(9.8)
t=1
t=1
By definition of w
T
X
u · L(ŝt , yt ) + ZII
≤ inf
u∈C
(9.9)
t=1
5
In the case where the minimum is not well-defined, the infimum will be defined and we are still guaranteed the result of the derivation since the inequality holds for all w. We omit the details of this.
CHAPTER 9. ONLINE METHODS AND REGRET
177
Finally, because ZII is guaranteed to be small relative to the smaller term of averaged
prediction loss, it will also be sufficiently small for averaged loss.
P
Now, what about reversing the result? Let ω = argminu∈C Tt=1 L(u · ŝt , yt )
T
X
t=1
wt · L(ŝt , yt ) ≤ inf
u∈C
≤
T
X
t=1
0
T
X
t=1
u · L(ŝt , yt ) + ZI
ω · L(ŝt , yt ) + ZI
(9.10)
(9.11)
Now we re stuck
P
P
From Jensen’s inequality, we obviously have Tt=1 L(ω ·ŝt , yt ) ≤ Tt=1 ω · L(ŝt , yt ) + ZI ,
P
but what we want is to say that the combiner’s loss is close to Tt=1 L(ω · ŝt , yt ) which
would require Jensen’s inequality to be flipped to continue the derivation.
From a graphical point of view, choosing a solution with a loss close to the minimum of
expected loss does not mean that using that same vector will be close to the minimum of the
averaged prediction loss. Some vectors may have a large divergence in the two measures.
We still have a loss bound on what our loss can be if we use the average loss solution,
wt , to average the predictions but we are not guaranteed to be in the neighborhood of the
minimum for average prediction loss. That is, the bound may be extremely loose with
respect to the minimum.
Because of this and the closely related issues when combining automatically produced
experts, metaclassifiers without explicit guarantees (e.g., SVMs) may generalize better
since their models allow for the expressivity to optimize for averaged prediction loss. While
we do not pursue such a vein here, the interested reader may be able to construct algorithms
with guarantees and that work well for combination by generalizing some of the basic combination results with averaged prediction guarantees (e.g., exponentiated gradient methods
[FSSW97]) along the indicator variable lines presented in [BM05].
9.6
Chapter Summary
In this chapter, we provided an overview of some key approaches to combining classifiers in
an online classification framework and empirically analyzed their suitability for application
to our batch prediction framework. While the methods did not suffer any large losses as
guaranteed, our empirical analysis highlighted the lack of significant wins relative to using
a linear SVM as a batch metaclassifier as explored earlier. In our analysis, we discussed
how guarantees with respect to average loss of the experts is far weaker than the type of
178
9.6. CHAPTER SUMMARY
guarantee with respect to loss of averaged prediction that is needed. Finally, we suggested
literature of interest and future directions that the interested reader can pursue to continue
this line of research in a way that is likely to show higher empirical performance as well as
theoretical guarantees.
Chapter 10
Action-Item Detection in E-mail
In this chapter, we demonstrate that classifier combination methods are applicable to text
combination approaches outside of topic classification. In doing so, the aim is to demonstrate the flexibility and applicability of these methods to a range of problems. As such, we
have chosen a problem, action-item detection in e-mail documents, that not only presents
different challenges as a learning problem than those present in topic classification but is
also focused on a different performance goal — improving the ranking of e-mails.
E-mail users have an increasingly difficult time managing their inboxes in the face of
challenges that result from rising e-mail usage. This includes prioritizing e-mails over a
range of sources from business partners to family members, filtering and reducing junk
e-mail, and quickly managing requests that demand the receiver’s attention or action. Automated action-item detection targets the third of these problems by attempting to detect
which e-mails require an action or response with information, and within those e-mails,
attempting to highlight the sentence or passage containing the action request.
Such a detection system can be used as one part of an e-mail agent which would assist a
user in processing important e-mails more quickly than would have been possible without
the agent. We view action-item detection as one necessary component of a successful email agent which would perform spam detection, action-item detection, topic classification
and priority ranking, among other functions. The utility of such a detector can manifest as
a method of prioritizing e-mails according to task-oriented criteria other than the standard
ones of topic and sender or as a means of ensuring that the email user hasn’t dropped the
proverbial ball by forgetting to address an action request.
In the context of this dissertation, action-item detection forms a very different type of
text classification problem for empirical study than topic classification. In particular, while
the cues indicating topic are spread throughout a document, the action-item(s) in an e-mail
are typically localized, often in the context of a single sentence. Thus, the intuition is that
179
180
10.1. WHY ACTION-ITEM DETECTION?
the class of the document will interact differently with the document feature representation
than in topic classification. Therefore, the natural question is whether this changes either
the effectiveness of the base classifiers or the combination approaches.
Secondly, while it can be challenging to identify exactly which portions of a document make it about a specific topic during topic classification, identifying the sentences
that determine the action-item status of an e-mail is relatively straightforward. Given such
information, we would like to know whether it can be used effectively in the basic classification task, and if so, how we can incorporate this information into a reliability indicator
based approach during classifier combination.
We first continue to layout and describe the basic problem of action-item detection.
Then, we review related work for similar text classification problems such as e-mail priority ranking and speech act identification. Next we more formally define the action-item
detection problem, discuss the aspects that distinguish it from more common problems like
topic classification, and highlight the challenges in constructing systems that can perform
well at the sentence and document level. From there, we move to a discussion of feature
representation and selection techniques appropriate for this problem and how standard text
classification approaches can be adapted to move smoothly from the sentence-level detection problem to the document-level classification problem. We then conduct an empirical
analysis that helps us determine the effectiveness of our feature extraction procedures as
well as establish baselines for a number of classification algorithms on this task. Next, we
turn to the question of how we can combine a set of action-item detection systems and how
both sentence-level and document-level classifiers can be combined. Finally, we summarize the implications for applying classifier combination techniques to other domains.
10.1
Why Action-Item Detection?
Action-item detection differs from standard text classification in two important ways. First,
the user is interested both in detecting whether an email contains action items and in locating exactly where these action item requests are contained within the email body. In
contrast, standard text categorization merely assigns a topic label to each text, whether that
label corresponds to an e-mail folder or a controlled indexing vocabulary [Lar99, YZCJ02,
LYJC04]. Second, action-item detection attempts to recover the email sender’s intent —
whether she means to elicit response or action on the part of the receiver; note that for
this task, classifiers using only a bag-of-words representation do not perform optimally, as
evidenced in our results below. Instead we find that we need more information-laden features such as higher-order n-grams. Text categorization by topic, on the other hand, works
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
181
From: Henry Hutchins <[email protected]>
To: Sara Smith; Joe Johnson; William Woolings
Subject: meeting with prospective customers
Sent: Fri 12/10/2005 8:08 AM
Hi All,
I’d like to remind all of you that the group from GRTY will be visiting
us next Friday at 4:30 p.m. The current schedule looks like this:
+ 9:30 a.m. Informal Breakfast and Discussion in Cafeteria
+ 10:30 a.m. Company Overview
+ 11:00 a.m. Individual Meetings (Continue Over Lunch)
+ 2:00 p.m. Tour of Facilities
+ 3:00 p.m. Sales Pitch
In order to have this go off smoothly, I would like to practice the
presentation well in advance. As a result, I will need each of your
parts by Wednesday.
Keep up the good work!
–Henry
Figure 10.1: An E-mail with emphasized Action-Item, an explicit request that requires the recipient’s attention or action.
very well using just individual words as features[Lew92a, ADW94a, DPHS98, Seb02]. In
fact, genre-classification, which one would think may require more than a bag-of-words
approach, also works quite well using just unigram features[LCJ03]. Topic detection and
tracking (TDT), also works well with unigram feature sets [YCB + 99, ACD+ 98]. We believe that action-item detection is one of the first clear instances of a text classification
related task where we must move beyond bag-of-words to achieve high performance, albeit
not too far, as bag-of-n-grams seem to suffice, given state-of-the-art classifiers.
10.2
Related Work
Several other researchers have considered very similar text classification tasks. Cohen et al.
[CCM04] describe an ontology of “speech acts”, such as “Propose a Meeting”, and attempt
to predict when an e-mail contains one of these speech acts. While their ontology mostly
focused on types of speech acts that are specific kinds of action-item requests, we consider
action-items in general to be an important specific type of speech act that falls within a
much broader ontology of speech acts. Furthermore, while they provide results for several
classification methods, their methods only make use of human judgments at the documentlevel. In contrast, we consider whether accuracy can be increased by using finer-grained
human judgments that mark the specific sentences and phrases of interest.
10.3. PROBLEM DEFINITION & APPROACH
182
Corston-Oliver et al. [CORGC04] consider detecting items in e-mail to “Put on a To-Do
List”. This classification task is very similar to ours except they do not consider “simple
factual questions” to belong to this category. We include questions, but note that not all
questions are action-items — some are rhetorical or simply social convention, e.g., “How
are you?”. From a learning perspective, while they make use of judgments at the sentencelevel, they do not explicitly compare what, if any, benefits finer-grained judgments offer.
Additionally, they do not study alternative choices or approaches to the classification task.
Instead, they simply apply a standard SVM at the sentence-level and focus primarily on
a linguistic analysis of how the sentence can be logically reformulated before adding it to
the task list. In this study, we examine several alternative classification methods, compare
document-level and sentence-level approaches and analyze the machine learning issues
implicit in these problems. We are also the first to examine in detail the gains from classifier
combination in this problem.
For those that wish to purse further reading on this topic, a variety of learning tasks
related to e-mail has been rapidly growing in the recent literature. For example, in a forum
dedicated to e-mail learning tasks, Culotta et al. [CBM04] presented methods for learning
social networks from e-mail. We do not use peer relationship information in building our
classifiers; however, such methods could complement those here since peer relationships
often influence word choice when requesting an action.
10.3
Problem Definition & Approach
In contrast to previous work, we explicitly focus on the benefits that finer-grained, more
costly, sentence-level human judgments offer over coarse-grained document-level judgments. Additionally, we consider multiple standard text classification approaches and analyze both the quantitative and qualitative differences that arise from taking a documentlevel vs. a sentence-level approach to classification. We also focus on the representation
necessary to achieve the most competitive performance. Finally, after demonstrating the
difference in document-level and sentence-level approaches to the document classification
task, we examine what can be gained from classifier combination in this problem.
10.3.1 Problem Definition
In order to provide the most benefit to the user, a system would not only detect the document, but it would also indicate the specific sentences in the e-mail which contain the
action-items. Therefore, there are three basic action-item detection problems:
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
183
1. Document detection: Classify a document as to whether or not it contains an actionitem.
2. Document ranking: Rank the documents such that all documents containing actionitems occur as high as possible in the ranking.
3. Sentence detection: Classify each sentence in a document as to whether or not it is
an action-item.
As in most Information Retrieval tasks, the weight the evaluation metric should give
to precision and recall depends on the nature of the application. In situations where a user
will eventually read all received messages, ranking (e.g., via precision at recall of 1) may be
most important since this will help encourage shorter delays in communications between
users. In contrast, high-precision detection at low recall will be of increasing importance
when the user is under severe time-pressure and therefore will likely not read all mail. This
can be the case for crisis managers during disaster management. Finally, sentence detection
plays a role in both time-pressure situations and simply to alleviate the user’s required time
to gist the message. In the first part of this chapter, we focus on standard performance
measures such as F1 and accuracy. Then, as we return to classifier combination, we will
discuss ranking measures such as ROC curves and area under the curve.
10.3.2 Approach
As mentioned above, the labeled data can come in one of two forms: a document-labeling
provides a yes/no label for each document as to whether it contains an action-item; a
phrase-labeling provides only a yes label for the specific items of interest. We term the
human judgments a phrase-labeling since the user’s view of the action-item may not correspond with actual sentence boundaries or predicted sentence boundaries. Obviously, it
is straightforward to generate a document-labeling consistent with a phrase-labeling by
labeling a document “yes” if and only if it contains at least one phrase labeled “yes”.
To train classifiers for this task, we can take several viewpoints related to both the basic
problems we have enumerated and the form of the labeled data. The document-level view
treats each e-mail as a learning instance with an associated class-label. Then, the document
can be converted to a feature-value vector and learning progresses as usual. Applying a
document-level classifier to document detection and ranking is straightforward. In order
to apply it to sentence detection, one must make additional steps. For example, if the
classifier predicts a document contains an action-item, then areas of the document that
contain a high concentration of words which the model weights heavily in favor of action-
184
10.3. PROBLEM DEFINITION & APPROACH
items can be indicated. The obvious benefit of the document-level approach is that training
set collection costs are lower since the user only has to specify whether or not an e-mail
contains an action-item and not the specific sentences.
In the sentence-level view, each e-mail is automatically segmented into sentences, and
each sentence is treated as a learning instance with an associated class-label. Since the
phrase-labeling provided by the user may not coincide with the automatic segmentation,
we must determine what label to assign a partially overlapping sentence when converting
it to a learning instance. Once trained, applying the resulting classifiers to sentence detection is now straightforward, but in order to apply the classifiers to document detection
and document ranking, the individual predictions over each sentence must be aggregated in
order to make a document-level prediction. This approach has the potential to benefit from
more-specific labels that enable the learner to focus attention on the key sentences, instead
of having to learn based on data for which the majority of the words in the e-mail provide
no or little information about class membership.
Features
Consider some of the phrases that might constitute part of an action item: “would like to
know”, “let me know”, “as soon as possible”, “have you”. Each of these phrases consists
of common words that occur in many e-mails. However, when they occur as a phrase in
the same sentence, they are far more indicative of an action-item. Additionally, order can
be important: consider “have you” versus “you have”. Because of this, we posit that ngrams play a larger role in this problem than is typical of problems like topic classification.
Therefore, we consider all n-grams up to size 4. Thus, we compare using a “bag of phrases
and words” to simply using a “bag of words”.
When using n-grams, if we find an n-gram of size 4 in a segment of text, we can
represent the text as just one occurrence of the n-gram or as one occurrence of the ngram and an occurrence of each smaller n-gram contained by it. We choose the second of
these alternatives since this will allow the algorithm itself to smoothly back-off in terms
of recall. Methods such as naı̈ve Bayes may be hurt by such a representation because of
double-counting.
Since sentence-ending punctuation can provide information, we retain the terminating
punctuation token when it is identifiable. Additionally, we add a beginning-of-sentence and
end-of-sentence token in order to capture patterns that are often indicators at the beginning
or end of a sentence. Assuming proper punctuation, these extra tokens are unnecessary,
but often e-mail lacks proper punctuation. In addition, for the sentence-level classifiers
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
185
that use n-grams, we additionally code for each sentence a binary encoding of the position
of the sentence relative to the document. This encoding has eight associated features that
represent which octile (the first eighth, second eighth, etc.) contains the sentence.
Implementation Details
In order to compare the document-level to the sentence-level approach, we compare predictions at the document-level. We do not address how to use a document-level classifier
to make predictions at the sentence-level.
In order to automatically segment the text of the e-mail, we use the RASP statistical
parser [Car02]. Since the automatically segmented sentences may not correspond directly
with the phrase-level boundaries, we treat any sentence that contains at least 30% of a
marked action-item segment as an action-item. When evaluating sentence-detection for
the sentence-level system, we use these class labels as ground truth. Since we are not
evaluating multiple segmentation approaches, this does not bias any of the methods. If
multiple segmentation systems were under evaluation, one would need to use a metric that
matched predicted positive sentences to phrases labeled positive. The metric would need
to punish overly long true predictions as well as too-short predictions. Our criteria for
converting to labeled instances implicitly includes both criteria. Since the segmentation is
fixed, an overly long prediction would be predicting “yes” for many “no” instances since
presumably the extra length corresponds to additional segmented sentences all of which
do not contain 30% of an action-item. Likewise, a too short prediction must correspond
to a small sentence included in the action-item but not constituting all of the action-item.
Therefore, in order to consider the prediction to be too short, there will be an additional
preceding/following sentence that is an action-item where we incorrectly predicted “no”.
Once a sentence-level classifier has made a prediction for each sentence, we must combine these predictions to make both a document-level prediction and a document-level
score.1 We use the simple policy of predicting positive when any of the sentences is predicted positive. In order to produce a document score for ranking, the confidence that the
document contains an action-item is:

 1 P
ψ(s) if for any s ∈ d, π(s) = 1
ψ(d) = n(d) s∈d|π(s)=1
 1 maxs∈d ψ(s)
o.w.
n(d)
where s is a sentence in document d, π is the classifier’s 1/0 prediction, ψ is the score the
classifier assigns as its confidence that π(s) = 1, and n(d) is the greater of 1 and the number
1
This combination problem differs in nature from those discovered previously in this dissertation. In this
problem the set of experts (sentences) changes from document to document. We go further into detail on this
point later in the chapter.
186
10.4. EXPERIMENTAL ANALYSIS FOR ACTION-ITEM DETECTION
of (unigram) tokens in the document. In other words, when any sentence is predicted
positive, the document score is the length normalized sum of the sentence scores above
threshold. When no sentence is predicted positive, the document score is the maximum
sentence score normalized by length. As in other text problems, we are more likely to emit
false positives for documents with more words or sentences. Thus we include a length
normalization factor.
10.4
Experimental Analysis for Action-Item Detection
10.4.1 The Data
Our corpus consists of e-mails obtained from volunteers at Carnegie Mellon University and
cover subjects such as: organizing a research workshop, arranging for job-candidate interviews, publishing proceedings, and talk announcements. The messages were anonymized
by replacing the names of each individual and institution with a pseudonym. After attempting to identify and eliminate duplicate e-mails, the corpus contains 744 e-mail messages.
After identity anonymization, the corpora has three basic versions. Quoted material
refers to the text of a previous e-mail that an author often leaves in an e-mail message when
responding to the e-mail. Quoted material can act as noise when learning since it may
include action-items from previous messages that are no longer relevant. To isolate the
effects of quoted material, we have three versions of the corpora. The raw form contains
the basic messages. The auto-stripped version contains the messages after quoted material
has been automatically removed. The hand-stripped version contains the messages after
quoted material has been removed by a human. Additionally, the hand-stripped version has
had any xml content and e-mail signatures removed — leaving only the essential content of
the message. The studies reported here are performed with the hand-stripped version. This
allows us to balance the cognitive load in terms of number of tokens that must be read in the
user-studies we report — including quoted material would complicate the user studies since
some users might skip the material while others read it. Additionally, ensuring all quoted
material is removed prevents tainting the cross-validation since otherwise a test item could
occur as quoted material in a training document.
Data Labeling
Two human annotators labeled each message as to whether or not it contained an actionitem. In addition, they identified each segment of the e-mail which contained an actionitem. A segment is a contiguous section of text selected by the human annotators and may
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
187
span several sentences or a complete phrase contained in a sentence. They were instructed
that an action item is “an explicit request for information that requires the recipient’s attention or a required action” and told to “highlight the phrases or sentences that make up the
request”.
Annotator 2
Annotator 1
No Yes
No 391 26
Yes 29 298
Table 10.1: Agreement of Human Annotators at Document Level
Annotator One labeled 324 messages as containing action items. Annotator Two labeled
327 messages as containing action items. The agreement of the human annotators is shown
in Tables 10.1 and 10.2. The annotators are said to agree at the document-level when
both marked the same document as containing no action-items or both marked at least one
action-item regardless of whether the text segments were the same. At the document-level,
the annotators agreed 93% of the time. The kappa statistic [Car96, CCM04] is often used
to evaluate inter-annotator agreement:
κ=
A−R
1−R
A is the empirical estimate of the probability of agreement. R is the empirical estimate of
the probability of random agreement given the empirical class priors. A value close to −1
implies the annotators agree far less often than would be expected randomly, while a value
close to 1 means they agree more often than randomly expected.
At the document-level, the kappa statistic for inter-annotator agreement is 0.85. This
value is both strong enough to expect the problem to be learnable and is comparable with
results for similar tasks [CCM04, CORGC04].
In order to determine the sentence-level agreement, we use each judgment to create a
sentence-corpus with labels as described in Section 10.3.2, then consider the agreement
over these sentences. This allows us to compare agreement over “no judgments”. We
perform this comparison over the hand-stripped corpus since that eliminates spurious “no”
judgments that would come from including quoted material, etc. Both annotators were free
to label the subject as an action-item, but since neither did, we omit the subject line of
the message as well. This only reduces the number of “no” agreements. This leaves 6301
automatically segmented sentences. At the sentence-level, the annotators agreed 98% of
the time, and the kappa statistic for inter-annotator agreement is 0.82.
10.4. EXPERIMENTAL ANALYSIS FOR ACTION-ITEM DETECTION
188
Annotator 2
Annotator 1
No Yes
No 5810 65
Yes
74
352
Table 10.2: Agreement of Human Annotators at Sentence Level
160
0.2
All Messages
Action-Item Messages
All Messages
Action-Item Messages
0.18
140
0.16
120
Percentage of Messages
Number of Messages
0.14
100
80
60
0.12
0.1
0.08
0.06
40
0.04
20
0.02
0
0
0
200
400
600
800
Number of Tokens
1000
1200
1400
0
200
400
600
800
Number of Tokens
1000
1200
Figure 10.2: The Histogram (left) and Distribution (right) of Message Length. A bin size of 20
words was used. Only tokens in the body after hand-stripping were counted. After stripping, the
majority of words left are usually actual message content.
1400
2
E3
Ei
EM
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉ = (fˆ (X), s (Ei ))
i
i
i
f (X)
fˆM (X)
p1 (E)
p2 (E)
PSfrag replacements
p3 (E)
pM (E)
E
p(E1 | E)
X
p(E2 | E)
E1
p(E3 | E)
E2
p(EM | E)
E3
p(Êi | E)i
E
p(Ei |E
ÊM
i)
Êi
E = (X, f (X))
Ĉ1 = (fˆ1 (X), s1 (E1 ))
Ĉ2 = (fˆ2 (X), s2 (E2 ))
Ĉ3 = (fˆ3 (X), s3 (E3 ))
Ĉi = (fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
In order to produce one single set of judgments, the human annotators went through
each annotation where there was disagreement and came to a consensus opinion. The
annotators did not collect statistics during this process but anecdotally reported that the
majority of disagreements were either cases of clear annotator oversight or different interpretations of conditional statements. For example, “If you would like to keep your job,
come to tomorrow’s meeting” implies a required action where “If you would like to join
the football betting pool, come to tomorrow’s meeting” does not. The first would be an
action-item in most contexts while the second would not. Of course, many conditional
statements are not so clearly interpretable. After reconciling the judgments there are 416
e-mails with no action-items and 328 e-mails containing action-items. Of the 328 e-mails
containing action-items, 259 messages have one action-item segment; 55 messages have
two action-item segments; 11 messages have three action-item segments. Two messages
have four action-item segments, and one message has six action-item segments. Computing
the sentence-level agreement using the reconciled “gold standard” judgments with each of
the annotators’ individual judgments gives a kappa of 0.89 for Annotator One and a kappa
of 0.92 for Annotator Two.
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
189
In terms of message characteristics, there were on average 132 content tokens in the
body after stripping. For action-item messages, there were 115. However, by examining
Figure 10.2 we see the length distributions are nearly identical. As would be expected for
e-mail, it is a long-tailed distribution with about half the messages having more than 60
tokens in the body (this paragraph has 65 tokens).
10.4.2 Classifiers
In order to establish baselines, we have selected a variety of standard text classification
algorithms. Later in this chapter, we return to exactly the same set of base classifiers used
throughout the dissertation, but because action-item detection is relatively unstudied, in this
section we choose a slightly different set of algorithms to study the features of action-item
detection as a learning problem. In selecting algorithms, we have chosen algorithms that
are not only known to work well but which differ along such lines as discriminative vs.
generative and lazy vs. eager. We have done this in order to provide both a competitive and
thorough sampling of learning methods for the task at hand. This is important since it is
easy to improve a strawman classifier by introducing a new representation. By thoroughly
sampling alternative classifier choices we demonstrate that representation improvements
over bag-of-words are not due to using the information in the bag-of-words poorly.
kNN
As done throughout the dissertation, we employ a standard variant of the k-nearest neighbor
algorithm used in text classification, kNN with s-cut score thresholding [Yan99]. We use
a tfidf-weighting of the terms with a distance-weighted vote of the neighbors to compute
the score before thresholding it. In order to choose the value of s for thresholding, we
perform leave-one-out cross-validation over the training set. The value of k is set to be
2(dlog2 N e + 1) where N is the number of training points. This rule for choosing k is
theoretically motivated by results which show such a rule converges to the optimal classifier
as the number of training points increases [DGL96]. In practice, we have also found it to be
a computational convenience that frequently leads to comparable results with numerically
optimizing k via a cross-validation procedure.
Unigram (multinomial Naı̈ve Bayes)
We use a standard multinomial naı̈ve Bayes classifier [MN98]. As done throughout the dissertation, in using this classifier, we smoothed word and class probabilities using a Bayesian
190
10.4. EXPERIMENTAL ANALYSIS FOR ACTION-ITEM DETECTION
estimate (with the word prior) and a Laplace m-estimate, respectively. To use terminology consistent with the rest of the dissertation, this classifier is referred to as a Unigram
classifier. Note that this is distinct from a unigram or bag-of-words representation as also
discussed here.
SVM
We have used a linear SVM with a tfidf feature representation and L2-norm as implemented
in the SVMlight package v6.01 [Joa99]. All default settings were used.
Voted Perceptron
Like the SVM, the Voted Perceptron is a kernel-based learning method. We use the same
feature representation and kernel as we have for the SVM, a linear kernel with tfidfweighting and an L2-norm. The voted perceptron is an online-learning method that keeps
a history of past perceptrons used, as well as a weight signifying how often that perceptron
was correct. With each new training example, a correct classification increases the weight
on the current perceptron and an incorrect classification updates the perceptron. The output
of the classifier uses the weights on the perceptra to make a final “voted” classification.
When used in an offline-manner, multiple passes can be made through the training data.
Furthermore, it is well-known that the Voted Perceptron increases the margin of the solution after each pass through the training data [FS99]. Since Cohen et al. [CCM04] obtain
worse results using an SVM than a Voted Perceptron with one training iteration, they conclude that the best solution for detecting speech acts may not lie in an area with a large
margin. Because their tasks are highly similar to ours, we employ both classifiers to ensure we are not overlooking a competitive alternative classifier to the SVM for the basic
bag-of-words representation.
10.4.3 Performance Measures
To compare the performance of the classification methods for this task, we look at two
standard performance measures, F1 and accuracy.
10.4.4 Experimental Methodology
We perform standard 10-fold cross-validation on the set of documents. For the sentencelevel approach, all sentences in a document are either entirely in the training set or entirely
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
191
in the test set for each fold. For significance tests, we use a two-tailed t-test [YL99] to
compare the values obtained during each cross-validation fold with a p-value of 0.05.
Feature selection was performed using the chi-squared statistic. Different levels of feature selection were considered for each classifier. Each of the following number of features
was tried: 10, 25, 50, 100, 250, 750, 1000, 2000, 4000. There are approximately 4700 unigram tokens without feature selection. In order to choose the number of features to use for
each classifier, we perform nested cross-validation and choose the settings that yield the
optimal document-level F1 for that classifier. For this study, only the body of each e-mail
message was used. Feature selection is always applied to all candidate features. That is, for
the n-gram representation, the n-grams and position features are also subject to removal by
the feature selection method.
The document-level classifiers below that use a bag-of-words representation use all
unigram tokens as the feature pool including sentence-ending markers and punctuation
such as “.”, “!”, and “?”. These classifiers are denoted as Document BoW. A second set
of document-level classifiers that also include n-grams in the feature pool are denoted as
Document Ngram. The sentence-level classifiers that use a bag-of-words representation
also use all unigram tokens in the feature pool including the sentence-ending punctuation.
These classifiers are denoted as Sentence BoW. Finally, the sentence-level classifiers denoted as Sentence Ngram additionally include n-grams and the encoding of the position of
the sentence within the document in the feature pool.
10.4.5 Baseline Results for Action-Item Detection
The results for document-level classification are given in Table 10.3. The primary hypothesis we are concerned with is that n-grams are critical for this task; if this is true, we expect
to see a significant gap in performance between the document-level classifiers that use
n-grams (Document Ngram) and those using the bag-of-words representation (Document
BoW). Examining Table 10.3, we observe that this is indeed the case for every classifier
except the Unigram classifier. This difference in performance produced by the n-gram representation is statistically significant for each classifier except for the Unigram classifier
and the accuracy metric for kNN (see Table 10.4). The Unigram classifier’s poor performance with the n-gram representation is not surprising since the bag-of-n-grams causes
excessive double-counting as mentioned in Section 10.3.2; however, the Unigram classifier
is not hurt at the sentence-level because the sparse examples provide few chances for agglomerative effects of double counting. In either case, when a language-modeling approach
is desired, modeling the n-grams directly may be preferable to using a multinomial naı̈ve
192
10.4. EXPERIMENTAL ANALYSIS FOR ACTION-ITEM DETECTION
Precision
Bayes model. More importantly for the n-gram hypothesis, the n-grams lead to the best
PSfrag
replacements classifier performance as well.
document-level
E
As would be expected, the difference between the sentence-level n-gram representation
X
and bag-of-words
E1 representation is small. This is because the window of text is so small
E2
that the bag-of-words
representation, when done at the sentence-level, implicitly picks up
E3of the n-grams. Further improvement would signify that the order of the
on the power
Ei
words matter even
when only considering a small sentence-size window. Therefore, the
EM
finer-grainedÊsentence-level judgments allows a bag-of-words representation to succeed
i
Ebut
= only
(X, fwhen
(X))performed in a small window — behaving as an n-gram representation for
ˆ1 (X),
Ĉ1 = (fall
s1 (E1purposes.
))
practical
ˆ
Ĉ2 = (f2 (X), s2 (E2 ))
Action-Item Detection SVM Performance (Post Model Selection)
Ĉ3 = (fˆ3 (X), s3 (E3 ))
ˆ
Ĉi = (fi (X), si (Ei ))
1
f (X)
fˆM (X) 0.9
p1 (E)
p2 (E) 0.8
p3 (E)
pM (E) 0.7
p(E1 | E)
p(E2 | E) 0.6
p(E3 | E) 0.5
p(EM | E)
Document Unigram
Sentence Ngram
p(Êi | E) 0.4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p(Ei | Êi )
Recall
Figure 10.3: Both n-grams and a small prediction window lead to consistent improvements over
the standard approach.
Further highlighting the improvement from finer-grained judgments and n-grams, Figure 10.3 graphically depicts the edge the SVM sentence-level classifier has over the standard bag-of-words approach with a precision-recall curve. In the high precision area of the
graph, the consistent edge of the sentence-level classifier is rather impressive — continuing
at precision 1 out to 0.1 recall. This would mean that a tenth of the user’s action-items
would be placed at the top of their action-item sorted inbox. Additionally, the large separation at the top right of the curves corresponds to the area where the optimal F1 occurs
for each classifier, agreeing with the large improvement from 0.6904 to 0.7682 in F1 score.
Considering the relatively unexplored nature of classification at the sentence-level, this
gives great hope for further increases in performance.
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
193
PSfrag
replacements
Although
Cohen et al. [CCM04] observed that the Voted Perceptron with a single
E
training iteration outperformed
SVM in a set of similar tasks, we see no such behavior
X
here. This further strengthens the evidence that an alternate classifier with the bag-ofE1
words representation
E2 could not reach the same level of performance. The Voted Perceptron
classifier does improve
E3 when the number of training iterations are increased, but it is still
Ei classifier.
lower than the SVM
EM
Sentence detection
Êi results are presented in Table 10.6. With regard to the sentence
detection
we note that the F 1 measure gives a better feel for the remaining room
E = problem,
(X, f (X))
ˆ
Ĉfor
(f1 (X), s1 (Ein
1 =improvement
1 ))
this difficult problem. That is, unlike document detection where actionˆ
Ĉitem
=
(
f
(X),
s
(E
))
2
2
2 are
2 fairly common, action-item sentences are very rare. Thus, as in other
documents
Ĉ3 = (fˆ3 (X), s3 (E3 ))
text problems, the accuracy numbers are deceptively high solely because of the default
Ĉi = (fˆi (X), si (Ei ))
accuracy attainable
f (X)by always predicting “no”. Although, the results here are significantly
above-random,
is unclear what level of performance is necessary for sentence detection to
fˆMit(X)
(E)
be useful in andp1of
itself and not simply as a means to document ranking and classification.
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
Figure 10.4: Users find action-items more quickly when assisted by a classification system.
Finally, when considering a new type of classification task, one of the most basic questions is whether an accurate classifier built for the task can have an impact on the end-user.
In order to demonstrate the impact this task can have on e-mail users, we conducted a user
study using an earlier less-accurate version of the sentence classifier — where instead of
using just a single sentence, a three-sentence windowed-approach was used. There were
three distinct sets of e-mail in which users had to find action-items. These sets were either
presented in a random order (Unordered), ordered by the classifier (Ordered), or ordered
by the classifier and with the center sentence in the highest confidence window highlighted
(Order+help). In order to perform fair comparisons between conditions, the overall number
of tokens in each message set should be approximately equal; that is, the cognitive reading load should be approximately the same before the classifier’s reordering. Additionally,
users typically show “practice effects” by improving at the overall task and thus performing
194
10.4. EXPERIMENTAL ANALYSIS FOR ACTION-ITEM DETECTION
better at later message sets. This is typically handled by varying the ordering of the sets
across users so that the means are comparable. While omitting further detail, we note the
sets were balanced for the total number of tokens and a latin square design was used to
balance practice effects.
Figure 10.4 shows that at intervals of 5, 10, and 15 minutes, users consistently found
significantly more action-items when assisted by the classifier, but were most critically
aided in the first five minutes. Although, the classifier consistently aids the users, we did
not gain an additional end-user impact by highlighting. As mentioned above, this might
be a result of the large room for improvement that still exists for sentence detection, but
anecdotal evidence suggests this might also be a result of how the information is presented
to the user rather than the accuracy of sentence detection. For example, highlighting the
wrong sentence near an actual action-item hurts the user’s trust, but if a vague indicator
(e.g., an arrow) points to the approximate area the user is not aware of the near-miss. Since
the user studies used a three sentence window, we believe this played a role as well as
sentence detection accuracy.
10.4.6 Discussion
In contrast to problems where n-grams have yielded little difference, we believe their power
here stems from the fact that many of the meaningful n-grams for action-items consist of
common words, e.g., “let me know”. Therefore, the document-level bag-of-words representation cannot gain much leverage, even when modeling their joint probability correctly,
since these words will often co-occur in the document but not necessarily in a phrase. Additionally, action-item detection is distinct from many text classification tasks in that a single
sentence can change the class label of the document. As a result, good classifiers cannot rely on aggregating evidence from a large number of weak indicators across the entire
document.
Even though we discarded the header information, examining the top-ranked features at
the document-level reveals that many of the features are names or parts of e-mail addresses
that occurred in the body and are highly associated with e-mails that tend to contain many
or no action-items. A few examples are terms such as “org”, “bob”, and “gov”. We note
that these features will be sensitive to the particular distribution (senders/receivers) and thus
the document-level approach may produce classifiers that transfer less readily to alternate
contexts and users at different institutions. This points out that part of the problem of
going beyond bag-of-words may be the methodology, and investigating such properties as
learning curves and how well a model transfers may highlight differences in models which
appear to have similar performance when tested on the distributions they were trained on.
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
195
Finally, the effectiveness of sentence-level detection argues that labeling at the sentencelevel provides significant value. Further experiments would be required to see how this
interacts with the amount of training data available. Sentence detection that is then agglomerated to document-level detection works surprisingly well given the low recall that
would be expected with sentence-level items.
The baseline results have established how action-items can be effectively detected in
e-mails. Our empirical analysis has demonstrated that in contrast to topic classification
n-grams are of key importance to making the most of document-level judgments. When
finer-grained judgments are available, then standard bag-of-words approaches using a small
(sentence) window size can produce results almost as good as the n-gram based approaches.
10.5
Action-Item Detection vs. Topic Classification
Before we return to classifier combination, we highlight the differences that make actionitem detection a different type of task than topic classification and thus suitable for demonstrating the applicability of our combination algorithms to other types of problems.
First, where topic classification has topical cues spread throughout the document, actionitem detection often has a single sentence of interest and thus detecting that sentence is of
critical importance. While we have seen that this can be exploited using sentence-level
classifiers, it is of interest to see how these sentence-level classifiers can be integrated in a
reliability-indicator framework.
From a machine learning perspective, the action-item corpus also has an entirely different balance of positives and negatives than typically seen for topic classification. In the
corpus, action-item e-mails constitute 44% of the corpus. Thus, while striving may sometimes demand more positive training examples to work well (see Section 7.4), the more
balanced nature of this problem may reduce the demand for training data.
Next, unlike topic-classification we also have classifiers built from different “views” —
the document-level and the sentence-level. The next natural question is whether classifier
combination can make use of the different information that the document-models and the
sentence-models provide. We now turn to examine these issues of classifier combination.
10.6
Classifier Combination for Action-Item Detection
When considering classifier combination for action-item detection, there are several different challenges that present themselves. Key questions are how sentence-level classi-
196
10.6. CLASSIFIER COMBINATION FOR ACTION-ITEM DETECTION
fier judgments can be combined into document-level judgments, how document-level and
sentence-level models can be combined, whether gains can be achieved from combining
different classification algorithms, and whether any changes to the reliability-indicators
used for topic classification are necessary in the strive methodology.
Combining sentence-level judgments into document-level judgments provides a unique
challenge distinct from the primary combination problem discussed in this dissertation.
When we combine the models from different classification algorithms, we obtain a prediction from each model for every example we are considering. This is analogous to consulting an expert panel consisting of a fixed set of experts for each example. However,
when combining the sentence predictions to make a document-level prediction, there is no
consistency from document to document. The prediction on the first sentence in the first
document is generally not related to the prediction on any sentence in the second document.
The analogy here is that we consult a different set of experts for each example. This problem is beyond the scope of this dissertation, and we do not directly study the issue related
to alternative methods for combining sentence predictions into a document prediction. Instead, we will continue to use the default method of combining sentence-level predictions
to obtain a document-level prediction given in Equation 10.3.2. Our focus in this section
will be solely on how we can combine the document predictions from the document-level
and sentence-level models to yield a more reliable document ranking and what reliabilityindicator changes are necessary for this task.
In particular, since a typical user will eventually process all received mail, we assume
that producing a quality ranking will more directly measure the impact on the user than
accuracy or F1. Therefore, in the remainder of the chapter our focus will be on ROC
curves and area under the curve since both reflect the quality of the ranking produced.
Additionally, while the previous section provided many different options in terms of representation and base classifiers, a typical question that comes up in classifier combination
is how well a combination method deals with a large set of classifiers. Therefore, rather
than pre-selecting among the best models for action-item detection, we will simply use all
five base classifiers we used in Chapter 7 with both the bag-of-words and bag-of-n-grams
representations at both the sentence-level and the document-level. This gives us a total of
20 different base classifiers. With only 744 e-mails in the action-item detection corpus,
we are interested in seeing whether Strive can not only improve but also avoid harming
performance because of correlated inputs. Furthermore, since it was STRIVE-S (norm) that
was the primary competitor, we are primarily interested in investigating the behavior of this
variant although we also provide results for the decision-tree metaclassifier for completeness. Finally, part of the value in any method is how much it must be tuned to the problem
at hand. Therefore, our focus is not only on applying Strive but doing so in as much of an
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
197
“out-of-the-box” manner as possible. To this end, we next address modifications needed for
the reliability indicators before presenting and analyzing results for classifier combination
for action-item detection.
10.7
Reliability Indicators for Action-Item Detection
First, for each of the document-level base classifiers, we can clearly still use the classifierbased reliability indicators introduced in Chapter 5. There were 6 unigram-model based
variables, 6 naı̈ve Bayes variables, 10 kNN variables, 5 SVM variables, and 2 decision-tree
variables. Since we are constructing base classifiers for both the bag-of-words and bag-ofn-grams representations, this gives 58 reliability indicators (58 = 2 representations ∗ [6 +
6 + 10 + 5 + 2]).
Although the classifier-based reliability indicators are defined for each sentence prediction, in order to use them at the document-level we must somehow combine the reliability
indicators over each sentence. The simplest method would be to take the average over each
classifier-based indicator across the sentences in the document. We do so and thus obtain
another 58 reliability indicators.
While combining the sentence-level predictions into a document-level prediction is outside the scope of our model, our model can benefit from some of the structure a sentencelevel classifier offers when combining document predictions. Analogous to considering
the variance of feature weights in the naı̈ve Bayes model, we can consider such indicators as the mean and standard deviation of the classifier confidences over each sentence
within the document. For each sentence-level base classifier, these then become two more
indicators (mean and standard deviation) which we can benefit from when combining document predictions. Since we are using the same set of 5 base classifiers as elsewhere in
the Strive experiments and we have base classifiers for the sentence-level bag-of-words
and sentence-level bag-of-n-grams, this introduces 20 variables (20 = 2 representations ∗
2(mean, stdev) ∗ 5 base classifiers).
Next, we could also extrapolate the feature-selection-based reliability indicators discussed in Chapter 5 to this problem. However, we do not for two reasons. First, while the
feature selection variables are easily defined for this problem, they are more problematic to
extrapolate to non-text classification problems since they generally rely on the sparse nature
of text. Thus, it is of interest to see how well the combination methods will perform using
only the classifier-based variables, which easily extrapolate to any classification problem.
Secondly, the dimensionality of the meta-problem is already quite high and given the small
10.8. EXPERIMENTAL ANALYSIS
OF COMBINING ACTION-ITEM DETECTORS
198
amount of training data we have available, we seek to keep the dimensionality somewhat
manageable at the meta-level.
Finally, we include the same 2 basic voting statistics reliability-indicators (PercentPredictingPositive and PercentAgreeWBest) as discussed in Chapter 5. For the action-item
problem, this yields a total of 138 reliability-indicators (138 = 58 + 20 + 58 + 2). With the
20 base classifier outputs, there are a total of 158 features for the Strive combiner to handle.
10.8
Experimental Analysis
of Combining Action-Item Detectors
10.8.1 Classifiers
As mentioned above, the base classifiers use the same set of 5 classification algorithms used
for the main Strive experiments in Chapter 7 and discussed in detail in Section 6.3. Namely,
we use a decision-tree via a dependency network implementation, a unigram (multinomial
naı̈ve Bayes) classifier, a naı̈ve Bayes (multivariate Bernoulli) classifier, a kNN classifier,
and a linear SVM classifier. As done with the earlier action-item detection experiments,
the kNN and SVM classifier use a normed tfidf representation. Since we apply the classifiers at both the sentence-level and document-level to a bag-of-words and bag-of-n-grams
representation, we have a total of 20 base classifiers.
Also as done in Chapter 7, we try variants of both stacking and Striving using a linear
SVM and a decision tree. We also present the results of the oracle BestSelect classifier.
10.8.2 Performance Measures
As mentioned above, the primary performance measure we are concerned with is area under
the ROC curve as a measure of ranking performance. However, we present all of the
performance measures used in Chapter 7 to give a complete picture.
10.8.3 Experimental Methodology
As done in Section 10.4, we perform standard 10-fold cross-validation on the set of documents. We note that the 10 folds are a new random draw and not identical to the experiments
above. For the sentence-level approach, all sentences in a document are either entirely in
the training set or entirely in the test set for each fold. For significance tests, we use a
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
199
two-tailed t-test [YL99] to compare the values obtained during each cross-validation fold
with a p-value of 0.05.
While the experiments in Section 10.4 performed nested cross-validation to automatically select the number of features used for each classification model, we instead simply
choose to use the top 300 features by the chi-squared statistic at both the document-level
and sentence-level for both the bag-of-words and bag-of-n-grams representation. Since
there are approximately 4700 unigram tokens in total, this is roughly in-line with the level
of feature reduction performed for the topic classification corpora in Chapter 7.
10.8.4 Results for Combining Action-Item Detectors
Table 10.7 presents the main summary of results. With regard to the earlier set of actionitem detection experiments, several observations are worth noting. First, the Unigram classifier results are much higher for document-level n-gram than the earlier experiments. The
reason is that the results for Section 10.7 used nested cross-validation to automatically
select the number of features. In those experiments, the number of features selected for
n-grams is much higher (average 1360 with some folds as low as 25 and some as high
as 4000). In this situation, nested selection over the number of n-grams has much higher
variance with regard to the Unigram model, and in earlier experiments, it often incorrectly
overpicks the number of features. This allowed for more opportunities for double-counting
evidence by the Unigram model (thus depressing performance).
In contrast, many of the results for kNN are much lower here than the earlier experiments. The reason again is in the feature selection. In the earlier experiments, the nested
cross-validation correctly selected an appropriate number of features and helped kNN attain
peak performance. The remaining models perform as would be expected from the earlier
experiments. Altogether, the overall maximum performance of the base classifiers is still in
line with the earlier experiments. Thus, we are still comparing to overall peak performance.
Now, we turn to the primary concern of whether Striving, and in particular, STRIVE-S
(norm) can improve the ranking of the documents. Examining the results in Table 10.7, we
see that STRIVE-S (norm) statistically significantly beats every other classifier according to
ROC area. If we restrict our attention to just the early part of the curve, we see that STRIVES (norm) still wins but no long significantly over all the stacking models and base classifiers.
This behavior in the early part of the curve is why Stack-S (norm) attains better performance
than Strive-S (norm) over the linear utility functions although not significantly.
Furthermore, we can see that no method significantly beats the striving methods according to any measure. We can see this more clearly in Table 10.8 by restricting our attention
10.9. SUMMARY
200
to the most competitive base classifiers (the sentence-level n-gram), the default combiners,
Stack-S (norm), and STRIVE-S (norm). While the success of the default combiners on error
and F1 suggests we might be able to further exploit the potential of the underlying base
models for combination, STRIVE-S (norm) still remains the clear best choice for ranking.
While the different balance in positive/negatives appears to allow STRIVE-S (norm) to give
more acceptable ROC performance than in topic classification, we see that the ROC area
performance of STRIVE-D (norm) is relatively lower. This is primarily because the decisiontree method stops too early in building the decision tree because of the small number of
training examples. As a result, the overall ranking has too coarse of a granularity and poor
performance.
Finally, we graphically compare the ROC performance of STRIVE-S (norm), Stack-S
(norm), and two of the most competitive base classifiers in Figure 10.5. We see that STRIVES (norm) loses by a very slight amount to Stack-S (norm) for the very early part of the curve
but still beats the base classifiers. Later in the curve, it dominates all the classifiers. If we
examine the curves using error bars (standard-deviation across cross-validation runs), we
also see that the variance of STRIVE-S (norm) drops much faster than the other classifiers
as we move to the right of the curve. Thus, across the runs STRIVE-S (norm) is achieving a
much more consistent quality ranking than the other classifiers.
10.9
Summary
In this chapter, we first established the action-item detection problem as a text classification
problem that is different from topic classification from both a semantic point of view and
a machine learning perspective. We conducted experiments to demonstrate competitive
baselines and to demonstrate how both n-grams and differing sentence-level and documentlevel views could be used to build more effective classifiers.
Next, we demonstrated that the Strive classifier combination approach is applicable to
text classifier combination approaches outside of topic classification by combining the various base action-item detectors. We demonstrated Strive’s flexibility and applicability to
a range of problems by using it in a very “out-of-the-box” manner that required nearly
no changes from early experiments. Furthermore, rather than pre-selecting the competitive base classifiers, we allowed the combination algorithm to automatically determine the
weights. STRIVE-S (norm) generated document rankings with a higher ROC area and with
less variation across folds than the other classifiers and combination methods.
Finally, since all of the reliability-indicators in this section are defined in terms of the
base classification models, we have demonstrated that the Strive methodology is readily
applicable to classification problems outside of text.
Êi
E = (X, f (X))
fˆ1 (X), s1 (E1 ))
fˆ2 (X), s2 (E2 ))
fˆ3 (X), s3 (E3 ))
(fˆ (X), s (Ei ))
i
i
f (X)
fˆ (X)
M
p1 (E)
p2 (E)
g replacements
p3 (E)
E
pM (E)
X
p(E1 | E)
E1
p(E2 | E)
E2
p(E3 | E)
E3
p(EM | E)
Ei
p(Êi | E)
p(Ei |E
ÊM
i)
Êi
E = (X, f (X))
fˆ1 (X), s1 (E1 ))
fˆ2 (X), s2 (E2 ))
fˆ3 (X), s3 (E3 ))
(fˆi (X), si (Ei ))
f (X)
fˆM (X)
p1 (E)
p2 (E)
p3 (E)
pM (E)
p(E1 | E)
p(E2 | E)
p(E3 | E)
p(EM | E)
p(Êi | E)
p(Ei | Êi )
True Positive Rate
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
naı̈ve Bayes (sent,ngram)
SVM (sent,ngram)
Stack-S (norm)
STRIVE-S (norm)
201
naı̈ve Bayes (sent,ngram)
SVM (sent,ngram)
Stack-S (norm)
STRIVE-S (norm)
0
0
0
0.2
0.4
0.6
False Positive Rate
0.8
1
0
0.2
0.4
0.6
False Positive Rate
0.8
Figure 10.5: ROC curves without (left) and with (right) error bars for the action-item corpus of two
of the most competitive base classifiers versus Stacking and Striving. We see that Striving dominates
the base classifiers and only loses for a small portion of the curve to Stacking. As expected, the
variance of all of the classifiers drops as we move to the right. However, the variance for Striving
drops far quicker than the others. Both argue that Striving presents the most robust ranking of the
documents.
Acknowledgments
The material in this chapter is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. NBCHD030010. Any opinions,
findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects
Agency (DARPA), or the Department of Interior-National Business Center (DOI-NBC).
We would like to extend our sincerest thanks to Jill Lehman whose efforts in data collection were essential in constructing the corpus, and both Jill and Aaron Steinfeld for
their direction of the HCI experiments. We would also like to thank Django Wexler for
constructing and supporting the corpus labeling tools and Curtis Huttenhower’s support of
the text preprocessing package. Finally, we gratefully acknowledge Scott Fahlman for his
encouragement and useful discussions on this topic.
1
10.9. SUMMARY
F1
Accuracy
Classifiers
kNN
Unigram
SVM
Voted Perceptron
kNN
Unigram
SVM
Voted Perceptron
Document BoW
0.6670 ± 0.0288
0.6572 ± 0.0749
0.6904 ± 0.0347
0.6288 ± 0.0395
0.7029 ± 0.0659
0.6074 ± 0.0651
0.7595 ± 0.0309
0.6531 ± 0.0390
Document Ngram
0.7108 ± 0.0699
0.6484 ± 0.0513
0.7428 ± 0.0422
0.6774 ± 0.0422
0.7486 ± 0.0505
0.5816 ± 0.1075
0.7904 ± 0.0349
0.7164 ± 0.0376
Sentence BoW
0.7615 ± 0.0504
0.7715 ± 0.0597
0.7282 ± 0.0698
0.6511 ± 0.0506
0.7972 ± 0.0435
0.7863 ± 0.0553
0.7958 ± 0.0551
0.6413 ± 0.0833
Sentence Ngram
0.7790 ± 0.0460
0.7777 ± 0.0426
0.7682 ± 0.0451
0.6798 ± 0.0913
0.8092 ± 0.0352
0.8145 ± 0.0268
0.8173 ± 0.0258
0.7082 ± 0.1032
Table 10.3: Average Document-Detection Performance during Cross-Validation for Each Method and the Sample Standard Deviation (S n−1 ) in
202
italics. The best performance for each classifier is shown in bold.
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
kNN
Unigram Classifier
SVM
Voted Perceptron
Document Winner
Ngram
BoW
Ngram†
Ngram†
203
Sentence Winner
Ngram
Ngram
Ngram
Ngram
Table 10.4: Significance results for n-grams versus a bag-of-words representation for document
detection using document-level and sentence-level classifiers. When the F1 result is statistically
significant, it is shown in bold. When the accuracy result is significant, it is shown with a† . This
table emphasizes the hypothesis that n-grams or a “bag of words and phrases” outperforms a simple
“bag of words” does, in fact, hold.
kNN
Unigram Classifier
SVM
Voted Perceptron
F1 Winner
Sentence
Sentence
Sentence
Sentence
Accuracy Winner
Sentence
Sentence
Sentence
Document
Table 10.5: Significance results for sentence-level classifiers vs. document-level classifiers for
the document detection problem. When the result is statistically significant, it is shown in bold.
This table emphasizes the hypothesis that a sentence-level classifier outperforms a document-level
classifier does, in fact, hold.
Accuracy
BoW Ngram
kNN
0.9519 0.9536
Unigram Classifier 0.9419 0.9550
SVM
0.9559 0.9579
Voted Perceptron
0.8895 0.9247
F1
BoW Ngram
0.6540 0.6686
0.6176 0.6676
0.6271 0.6672
0.3744 0.5164
Table 10.6: Performance of the Sentence-Level Classifiers at Sentence Detection
10.9. SUMMARY
204
F1
Error
C(1,10)
C(10,1)
ROC Area
Document-Level, Bag-of-Words Representation
0.7398 0.2244 0.5593 0.3856
0.8423
0.6905 0.3091 0.5513 0.4300
0.7537
0.6729 0.2688 0.5432 0.4692
0.7745
0.6918 0.2472 0.5392 0.3507
0.8367
0.6695 0.3467 0.5660 0.4166
0.7669
Document-Level, Ngram Representation
Dnet
0.7412 0.2110 0.6228 0.4005
0.8473
Unigram
0.7361 0.2729 0.5581 0.4719
0.8114
naı̈ve Bayes
0.7534 0.1896 0.5405 0.4424
0.8537
SVM
0.7392 0.2124 0.5780 0.3361
0.8640
kNN
0.7021 0.2539 0.5251 0.4452
0.8244
Sentence-Level, Bag-of-Words Representation
Dnet
0.7793 0.1894 0.5227 0.3308
0.8885
Unigram
0.7731 0.2136 0.5815 0.4746
0.8645
naı̈ve Bayes
0.7888 0.1893 0.5653 0.4353
0.8699
SVM
0.6985 0.1988 0.5496 0.4031
0.8548
kNN
0.6328 0.3803 0.5629 0.4098
0.6823
Sentence-Level, Ngram Representation
Dnet
0.7521 0.1841 0.5335 0.3577
0.8723
Unigram
0.8012 0.1868 0.6126 0.4503
0.8723
naı̈ve Bayes
0.8010 0.1747 0.5857 0.4018
0.8777
SVM
0.7842 0.1693 0.5768 0.4073
0.8620
kNN
0.6811 0.2647 0.5615 0.3872
0.8078
Default Combiners
Majority BBC
0.8038 0.1761 0.5511 0.3844
N/A
Best By Class
0.8006 0.1734 0.6235 0.3535
N/A
Stacking
Stack-D (norm)
0.7885 0.1828 0.6046 0.3644
0.8752
Stack-S (norm)
0.7765 0.1814 0.4797S 0.2970 0.8996S
Striving
STRIVE-D (norm) 0.7718 0.1949 0.5512 0.3724
0.8724
STRIVE-S (norm) 0.7813 0.1868 0.5056 0.3134 0.9145BSR
Oracle
BestSelect
0.9894 0.0134 0.2486 0.1411
N/A
Dnet
Unigram
naı̈ve Bayes
SVM
kNN
ROC Area
[0,0.1]
0.4974
0.1729
0.2773
0.4959
0.2822
0.5458
0.2787
0.5069
0.5503
0.4607
0.6164
0.4637
0.4869
0.5822
0.2065
0.5974
0.5362
0.5413
0.5963
0.4424
N/A
N/A
0.5444
0.6400S
0.5333
0.6436 R
N/A
Table 10.7: Average base classifier and classifier combination performance during cross-validation
over the Action-Item Detection Corpus. The best performance (omitting the oracle BestSelect) in
each column is given in bold. The worst performance is given in italics. A notation of ‘B’, ‘D’, ‘S’,
or ‘R’ indicates a method significantly outperforms all (other) Base classifiers, Default combiners,
Stacking methods, or Reliability-indicator based Striving methods at the p = 0.05 level using a
two-tailed t-test.
CHAPTER 10. ACTION-ITEM DETECTION IN E-MAIL
F1
B: Dnet (sent,ngram)
B: Unigram (sent,ngram)
B: naı̈ve Bayes (sent,ngram)
B: SVM (sent,ngram)
B: kNN (sent,ngram)
D: Majority
D: Best By Class
S: Stack-S (norm)
R: STRIVE-S (norm)
B,R,S
R,S
R,S
B,D,R,S
R,S
S
205
Error
C(1,10)
C(10,1)
R
B,D
B
R,S
B,D,R,S
R,S
D,R,S
R
ROC Area
ROC Area
[0,0.1]
B
B
D
B,D,R
B,D
B,D
B,D,R
B,D
N/A
N/A
B
B,R,S
N/A
N/A
B
B,S
Table 10.8: Summary of performance on the action-item detection task. The columns show the
group names for which the row method is better (restricted to just those shown here). ”Better” here
means has a better average across cross-validation runs. When statistically significantly better (by
2-sided t-test p = 0.05), results are printed in a red bold italic font.
206
10.9. SUMMARY
Chapter 11
Summary and Future Work
A classification algorithm uses datapoints that have been labeled with classes by an authority or human to learn a model that, with high accuracy, can automatically predict the class
the authority would have assigned to future instances. For example, the datapoint might
be a particular stock and its class either “up” or “down” depending on how its value will
change over the next 24 hours, or the datapoint might be a particular e-mail whose class
could be the folder in which the receiver will place that e-mail. In the first case, we would
like to predict which way a stock will move before it occurs, while in the second our goal
is to automatically sort emails to save the user time. In both cases, these challenges can be
approached by using machine learning algorithms to build a statistical model which will
predict the class of an example based on past observations.
Decision trees, kNN, SVMs, language models, and naı̈ve Bayes are a few of the classification algorithms that have been developed over the years. Generally, each of these models
are designed using a different set of assumptions regarding the data. For example, they
can be dichotomized as to whether they are linear vs. non-linear models, generative vs.
discriminative, or high-bias vs. low-bias. While some classification algorithms often work
well, none of these algorithms dominate all classification problems. Furthermore, even
when one classification algorithm significantly outperforms another for a given classification problem, it is rarely the case that the worse classifier’s errors are a superset of the better.
This fact has long motivated the desire to combine models in order to obtain better, or more
robust, overall performance. Schemes to do this have varied widely from simple voting
to metaclassifiers that model how the base classifiers interact. However, we are also faced
with the result that obtaining an optimal combination or meta-algorithm over all problems
is not possible [Wol95]. As a result, the key is to construct a combination algorithm that
performs well with respect to many of the commonly observed behaviors within a domain.
207
208
In this dissertation, we focused on text classification, which plays a key role in a variety
of applications. Furthermore, with the surge in digital text media, text classification has
become increasingly important. Text classification techniques can assist in junk e-mail detection [SDHH98], allow medical doctors to more rapidly find relevant research [HBLH94],
aid in patent searches [Lar99], improve web searches [CD00], and serve as a backend in
a multitude of other applications. In order to effectively combine predictions within this
domain, it is necessary to first understand the typical behavior of classification methods
within the domain.
Therefore, in Chapter 3 we demonstrated that when recalibrating the probabilities or
log-odds estimates of classification systems, an asymmetric or piecewise linear model is
preferable to Gaussian or linear systems. Furthermore, we showed why recalibration is
necessary for specific classifiers as well as why classifiers can be expected to behave asymmetrically in general. Finally, we demonstrated how an appropriate statistical family of
asymmetric distributions could be efficiently fit to the data to yield improved probability
estimates. Recalibration is an important subcase of combination, because, in addition to
helping understand the nature of the classifier outputs, a metaclassifier applied to a single
input base classifier is equivalent in many senses to recalibrating the base classifier.
In Chapter 4 we expanded upon the view of a metaclassifier applied to a single base
classifier as recalibration. Extending this analysis to a series of canonical examples, we
showed how calibration, dependence, and variance play a role in classifier combination. We
formalized these concepts and extended them to both global and local settings. We further
emphasized through synthesized data how these quantities can be precisely computed and
how to make use of them in classifier combination.
Because these values cannot be computed without the true-class and posterior information, in Chapter 5 we motivated and developed a series of reliability-indicators whose
goal is to capture quantities related to the local reliability, dependence, and variance of
the classifiers. For example kNNShiftStdDevConfDiff captures how much the output of the
kNN score function varies as the example being classified moves toward each of its neighbors. We go on to define 70 such variables tied to feature selection and various aspects of
classification models, such as when decision trees and linear SVMs are unlikely to work
well for a particular example.
We then delved into the key contribution of this dissertation in Chapter 7 where we
demonstrated a metaclassifier approach, STRIVE, which builds a stacked reliability-indicator
variable based ensemble using the classifier outputs and reliability indicators. The resulting
model performs combination based on the characteristics of the particular example, and because of the reliability indicators, the model can take into account the local behavior of the
CHAPTER 11. SUMMARY AND FUTURE WORK
209
base classifiers when weighting their combination. STRIVE both extends the known ceiling
of performance by 3-18% across various performance measures in a variety of text classification corpora and also outperforms standard approaches such as a constant-weighted linear
combination of the classifier outputs. In addition to improving prediction performance, this
work also points the way for how the base classifiers can be changed to directly account for
these instabilities. Furthermore, since the majority of these variables are defined in terms
of model properties instead of domain properties, the approach is reusable outside of text
classification.
Since many of the reliability-indicators interact with the classifier outputs in similar
ways across different text classification problems, a natural question is whether the STRIVE
combination model from one dataset can be transferred to another. Since labeled training
data is at a premium and even more crucial for meta-models, such work can alleviate the
need for training data. Using LABEL (Layered Abstraction-Based Ensemble Learning) in
Chapter 8, we show how to do just that, and the resulting model further improves performance.
In Chapter 9, we considered adapting online methods to the problem of classifier combination. While the methods did not suffer any large losses as guaranteed, our empirical
analysis highlighted the lack of competitiveness compared to the use of standard batch
classification algorithms as metaclassifiers. In our analysis, we also discussed how regret
guarantees with respect to average loss of the experts is far weaker than the type of guarantee with respect to loss of averaged prediction that is needed.
Since the majority of the corpora used for experimentation have been topic classification datasets, in Chapter 10 we turned to less standard text classification problems such
as finding e-mails containing “action-items” and identifying the particular sentences of interest within them. We demonstrated how labeled data at the sentence-level can be used
to create sentence-level classifiers that are combined into more effective document-level
classifiers, how feature representation trade-offs differ in this task from topic classification,
and how users aided by an action-item detection system find action-items more quickly. We
then applied STRIVE to this problem in an out-of-the-box manner and also achieved performance gains for this task. The resulting combination consistently led to improved rankings
with less performance variance over the training splits than the alternative measures; this
provided evidence that the STRIVE system is a widely applicable approach.
Consider our thesis statement:
Context-dependent combination procedures provide an effective way of combining classifiers that are generally superior to constant-weighted linear combinations of the classifiers’ estimates of the posterior or log-odds. Furthermore,
11.1. KEY CONTRIBUTIONS
210
context can be leveraged in text classifier combination via an abstraction of the
local reliability, dependence, and variance of the base classifier outputs. Finally, these abstractions help identify opportunities for data re-use that can be
employed to significantly improve classification performance.
and our primary criteria for evaluation:
As a demonstration of the suitability of these methods for text classification
though, we set the goal of statistically significantly outperforming the current
state-of-the-art base classification methods over several standard text classification corpora. Furthermore, since we argue that our representation of context
is key, we will empirically demonstrate that these methods outperform simple
constant-weighted combinations of the classifier outputs in some corpora and,
in the remaining ones, achieve a statistically negligible difference.
We have clearly established that STRIVE, a context-dependent combination approach,
provides an effective way of combining classifiers that is generally superior to constantweighted linear combinations of the classifier’s estimates. We also showed that it significantly outperforms the base classifiers in a variety of corpora and usually significantly
outperforms linear combinations of the classifier outputs. Besides presenting arguments of
the importance of the local reliability, dependence, and variance of the base classifier outputs, we also introduced reliability-indicators that capture aspects of these quantities. With
our introduction of the LABEL extension of STRIVE to an inductive transfer framework,
we showed how these abstractions help identify opportunities for data re-use and further
improve performance. Thus, we have achieved all aspects of our key claims.
11.1
Key Contributions
Other combination approaches have used some form of non-constant weights on the classifiers — such as local cascade generalization [Gam98a, Gam98b], hierarchical mixture
of experts [JJNH91, JJ94], and stacking using an appropriate non-linear metaclassifier
[Wol92]. In contrast to these methods, our work makes two key contributions. First, we
demonstrate improvement relative to state-of-the-art base classifiers that have been tuned to
be as competitive as possible and not simple strawmen. Second, our approach focused on
richer definitions of locality. This can be crucial since using all of the base features within
the metaclassifier is often not feasible for text problems because of the high dimensionality
of text. Furthermore, the low-dimensional representation of locality enables us to begin
CHAPTER 11. SUMMARY AND FUTURE WORK
211
to understand why the classifiers fail in addition to benefitting in performance from detecting such failures. Additionally, in contrast to using the base features at the meta-level,
the reliability indicators provide a representation which enable us to extend the methods of
inductive transfer to classifier combination.
Many previous combination approaches (e.g. metaselection [LL01]) define features
related to classifier performance, but then use those features only to select a single classifier
to apply to the problem. In contrast, our approach gives a more general way to blend
classifiers based on the specific documents. For features that may be relevant to choosing
a classifier for a problem but static across all documents within that problem (e.g., number
of training examples), our work on Inductive Transfer demonstrates how to extend our
combination framework to obtain the benefits of both metaselection and combination.
Typical multitask learning and inductive transfer approaches rely on the input space
being the same [TO96, CK97, Car97]. We have identified how classifier combination can
be transformed to be within this framework and demonstrated that it leads to performance
gains when using data in conjunction across corpora. This helps alleviate the need for
training data when training metaclassification models. More finely tuned inductive transfer
methods may be able to offer even more improvement.
In addition to these key contributions there are a variety of other contributions throughout the dissertation. Of those, some of the more important ones follow. To our knowledge,
we were the first to show the empirical asymmetry of classifier probability estimates, explain why these tend to occur, and develop parametric recalibration methods that can exploit this fact. Also, in establishing the differences between action-item detection and topic
classification, we were the first to analyze in detail the trade-off between sentence-level
and document-level judgments as well as analyze the types of models that result. Finally,
integrating sentence-level and document-level judgments in the STRIVE framework is also
one of the first examples of a combination of sentence-level and document-level models.
11.2
Directions for Future Work
There is the potential for many immediate directions for future work that are discussed
throughout the dissertation. In this section, we highlight the more promising and larger
scope problems.
An interesting problem is to generalize the recalibration framework discussed in Chapter 3 to include abstaining. That is, the goal would be to recalibrate a classifier’s probability
estimates so that the post-processing either issues an improved estimate or “abstains”. Such
11.3. SUMMARY
212
a system would be of use in a variety of contexts — including being more readily applicable
to online frameworks like sleeping experts and BMX discussed in Chapter 9. One obvious
path to pursue would be to use a parametric model where points can be omitted during the
training phrase according to a penalty function that increases according to the number of
points omitted. Thus, learning the model would be a trade-off in the fit of the model and
the omission penalty. Additionally, it could include the fit of the model that predicts when
to omit if the rule for omission is data-driven.
A second promising area related to recalibration is deriving a graphical model similar
to the simple model given in Figure 4.1 that treats the true log-odds as a latent variable. Because even straightforward parameterizations for this simple framework lead to the type of
asymmetry seen in practice (Figure 4.2), we are hopeful that a similar graphical model will
also better capture the recalibration process. After finding a formulation that is successful
for recalibration, it could be extended to combination by replicating the model.
A challenging but potentially very rewarding problem is to study how the base classifier
algorithms can be directly modified to capture the information some of the more useful reliability indicators are currently capturing. In particular, in Tables 7.5 and 7.6, we see that the
kNN based variables and, to a lesser extent, the SVM variables seem to play an important
role. If the classifiers can be directly modified, it may directly provide for more reliable
predictions. Depending on what modifications are necessary, there is also a potential for a
reduction in the computational cost of estimating these and training the metaclassification
models.
While we have demonstrated that inductive transfer can be successfully applied to classifier combination, we have merely scratched the surface of how much further improvement
can be gained via inductive transfer. Investigating whether new inductive transfer methods
([Zha05]) are applicable and can lead to further improvements in this area is a problem with
large potential.
Finally, continuing to identify and define reliability-indicator variables tied to the reliability of classifiers is both challenging and an open research problem. While much of our
attention has been focused on the development of these variables, it remains an attractive
area of future research.
11.3
Summary
This dissertation focused on developing a metaclassification scheme, STRIVE, which used
the outputs of classifiers (decision trees, kNN, linear SVMs, and two variants of naı̈ve
CHAPTER 11. SUMMARY AND FUTURE WORK
213
Bayes) in conjunction with a set of reliability-indicators we defined. The resulting model
used the training data to automatically detect regions of poor classifier reliability and generate a more reliable combined prediction. STRIVE extended the known ceiling of performance over state-of-the-art classifiers by 3-18% across various performance measures
in a variety of text classification corpora. More importantly, we achieved our key goal of
empirically validating the thesis that locality-based metaclassifiers generally outperform
constant-weighted linear combinations of classifier outputs. Furthermore, we improved
over alternative approaches such as stacking using decision trees. In addition, since labeled
training data is at a premium, we demonstrated how labeled data from one problem can be
inductively transferred to improve the combination model used for another problem. Finally, after illustrating how text classification tasks such as finding “action-items” in e-mail
differ from more common text classification tasks like topic classification, we developed
methods to effectively classify such e-mails. Using these methods as base classifiers, we
then applied STRIVE in an out-of-the-box fashion to improve overall performance on this
task as well. In conclusion, the STRIVE system is a widely applicable approach that has
already yielded the best known performance in a number of problems and holds promise
for others.
214
11.3. SUMMARY
Bibliography
[Abr63]
Norman Abramson. Information Theory and Coding. McGraw-Hill, New
York, 1963.
[ACD+ 98]
James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and
Yiming Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding
Workshop, Washington, D.C., 1998.
[ADW94a]
Chidanand Apte, Fred Damerau, and Sholom M. Weiss. Automated learning
of decision rules for text categorization. ACM Transactions on Information
Systems, 12(3):233–251, July 1994.
[ADW94b]
Chidanand Apte, Fred Damerau, and Sholom M. Weiss. Towards language
independent automated learning of text categorization models. In SIGIR ’94,
pages 23–30, 1994.
[AKTV+ 01] Khalid Al-Kofahi, Alex Tyrrell, Arun Vacher, Tim Travers, and Peter Jackson. Combining multiple classifiers for text categorization. In CIKM ’01,
Proceedings of the 10th ACM Conference on Information and Knowledge
Management, pages 97–104, November 2001.
[BCB94]
Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Automatic
combination of multiple ranked retrieval systems. In SIGIR ’94, Proceedings
of the 17th Annual International ACM Conference on Research and Development in Information Retrieval, pages 173–181, 1994.
[BCCC93]
N. Belkin, C. Cool, W.B. Croft, and J.P. Callan. The effect of multiple query
representations on information retrieval system performance. In SIGIR ’93,
Proceedings of the 16th Annual International ACM Conference on Research
and Development in Information Retrieval, pages 339–346, 1993.
215
216
BIBLIOGRAPHY
[BDH02]
Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. Probabilistic combination of text classifiers using reliability indicators: Models and results. In
SIGIR ’02, Proceedings of the 25th Annual International ACM Conference on
Research and Development in Information Retrieval, pages 207–214, August
2002.
[BDH05]
Paul N. Bennett, Susan T. Dumais, and Eric Horvitz. The combination of
text classifiers using reliability indicators. Information Retrieval, 8:67–100,
2005.
[Ben00]
Paul N. Bennett. Assessing the calibration of naive bayes’ posterior estimates.
Technical Report CMU-CS-00-155, Carnegie Mellon, School of Computer
Science, 2000.
[Ben02]
Paul N. Bennett. Using asymmetric distributions to improve classifier probabilities: A comparison of new and standard parametric methods. Technical Report CMU-CS-02-126, Carnegie Mellon, School of Computer Science,
2002.
[BF95]
Leo Breiman and Jerome H. Friedman. Predicting multivariate responses
in multiple linear regression. Technical report, November 1995. ftp://ftp.stat.berkeley.edu/pub/users/breiman/curds-whey-all.ps.Z.
[BFOS84]
L. Breiman, J.H. Friedman, R.A. Olshen, and P.J. Stone. Classification and
Regression Trees. Wadsworth International Group, Belmont, CA:, 1984.
[BG98]
Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier architecture for
scalable knowledge reuse. In ICML ’98, pages 64–72, 1998.
[BK99]
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(12):105–139, July 1999.
[Blu97]
Avrim Blum. Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning,
26(1):5–23, 1997.
[BM05]
Avrim Blum and Yishay Mansour. From external to internal regret. In Conference on Computational Learning Theory, 2005.
[Bre96]
Leo Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
BIBLIOGRAPHY
217
[Bri50]
G.W. Brier. Verification of forecasts expressed in terms of probability.
Monthly Weather Review, 78:1–3, 1950.
[Car96]
Jean Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254, 1996.
[Car97]
Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, July
1997.
[Car02]
John Carroll. High precision extraction of grammatical relations. In Proceedings of the 19th International Conference on Computational Linguistics
(COLING), pages 134–140, 2002.
[CBM04]
Aron Culotta, Ron Bekkerman, and Andrew McCallum. Extracting social
networks and contact information from email and the web. In CEAS-2004
(Conference on Email and Anti-Spam), Mountain View, CA, July 2004.
[CCM04]
William W. Cohen, Vitor R. Carvalho, and Tom M. Mitchell. Learning to
classify email into “speech acts”. In EMNLP-2004 (Conference on Empirical
Methods in Natural Language Processing), pages 309–316, 2004.
[CD00]
Hao Chen and Susan T. Dumais. Bringing order to the web: Automatically
categorizing search results. In CHI ’00, pages 145–152, 2000.
[CH67]
T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.
[CHM97]
D.M. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to
learning Bayesian networks with local structure. In UAI ’97, Proceedings of
the 13th Conference on Uncertainty in Artificial Intelligence, pages 80–89,
1997.
[CK97]
William W. Cohen and Daniel Kudenko. Transferring and retraining learned
information filters. In AAAI ’97, pages 583–590, 1997.
[CNM04]
Rich Caruana and Alexandru Niculsecu-Mizil. Ensemble selection from libraries of models. In International Conference on Machine Learning (ICML
2004), 2004.
[CORGC04] Simon Corston-Oliver, Eric Ringger, Michael Gamon, and Richard Campbell. Task-focused summarization of email. In Text Summarization Branches
Out: Proceedings of the ACL-04 Workshop, pages 43–50, 2004.
218
BIBLIOGRAPHY
[CS99]
William W. Cohen and Yoram Singer. Context-sensitive learning methods for
text categorization. ACM Transactions on Information Systems, 17(2):141–
173, 1999.
[CST00]
Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector
Machines and other kernel-based learning methods. Cambridge University
Press, Cambridge, UK, 2000.
[CV95]
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning,
20:273–297, November 1995.
[DC00]
Susan T. Dumais and Hao Chen. Hierarchical classification of web content. In
SIGIR ’00, Proceedings of the 23rd Annual International ACM Conference on
Research and Development in Information Retrieval, pages 256–263, 2000.
[DF83]
Morris H. DeGroot and Stephen E. Fienberg. The comparison and evaluation
of forecasters. Statistician, 32:12–22, 1983.
[DF86]
Morris H. DeGroot and Stephen E. Fienberg. Comparing probability forecasters: Basic binary concepts and multivariate extensions. In P. Goel and
A. Zellner, editors, Bayesian Inference and Decision Techniques. Elsevier
Science Publishers B.V., 1986.
[DGL96]
Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of
Pattern Recognition. Springer-Verlag, New York, NY, 1996.
[DHS01]
Richard Duda, Peter Hart, and David Stork. Pattern Classification. John
Wiley & Sons, Inc., New York, NY, 2001.
[Die00]
Thomas Dietterich. Ensemble methods. In Josef Kittler and Fabio Roli,
editors, MCS ’00, Proceedings of the 1st International Workshop on Multiple Classifier Systems, number 1857 in Lecture Notes in Computer Science,
pages 1–15. Springer, 2000.
[Dom94]
Pedro Domingos. The RISE system: Conquering without separating. In Proceedings of the Sixth IEEE International Conference on Tools with Artificial
Intelligence, pages 704–707. IEEE Computer Society Press, 1994.
[DP96]
Pedros Domingos and Michael Pazzani. Beyond independence: Conditions
for the optimality of the simple bayesian classifier. In ICML ’96, 1996.
BIBLIOGRAPHY
219
[DPHS98]
S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning
algorithms and representations for text categorization. In CIKM ’98, Proceedings of the 7th ACM Conference on Information and Knowledge Management, pages 148–155, 1998.
[Fla03]
Peter Flach. The geometry of ROC space: Understanding machine learning
metrics through ROC isometrics. In ICML ’03, pages 194–2001, 2003.
[FMS04]
Yoav Freund, Yishay Mansour, and Robert E. Schapire. Generalization
bounds for averaged classifiers (how to be a Bayesian without believing).
The Annals of Statistics, 32(4):1698–1722, August 2004.
[Fre95]
Yoav Freund. Boosting a weak learning algorithm by majority. Information
and Computation, 121(2):256–285, 1995.
[Fre98]
Dayne Freitag. Multistrategy learning for information extraction. In ICML
’98, 1998.
[Fri77]
J.H. Friedman. A recursive partitioning decision rule for non-parametric
classification. IEEE Transactions on Computers, pages 404–408, 1977.
[FS97]
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of
on-line learning and an application to boosting. Journal of Computer and
System Sciences, 55(1):119–139, August 1997.
[FS99]
Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296, 1999.
[FSSW97]
Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth.
Using and combining predictors that specialize. In STOC ’97, Proceedings
of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing,
pages 334–343, 1997.
[Gam98a]
João Gama. Combining classifiers by constructive induction. In ECML ’98,
Proceedings of the 10th European Conference on Machine Learning, pages
178–189, 1998.
[Gam98b]
João Gama. Local cascade generalization. In ICML ’98, Proceedings of the
15th International Conference on Machine Learning, pages 206–214, 1998.
[GCSR95]
Andrew B. Gelman, John S. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis. Chapman & Hall/CRC, 1995.
220
BIBLIOGRAPHY
[Goo52]
I.J. Good. Rational decisions. Journal of the Royal Statistical Society, Series
B, 1952.
[GZ86]
Prem K. Goel and Arnold Zellner, editors. Bayesian Inference and Decision
Techniques: Essays in Honor Of Bruno De Finetti. Elsevier, 1986.
[HBH88]
E.J. Horvitz, J.S. Breese, and M. Henrion. Decision theory in expert systems
and artificial intelligence. International Journal of Approximate Reasoning,
Special Issue on Uncertain Reasoning, 2:247–302, 1988.
[HBLH94]
W. Hersh, C. Buckley, T. Leone, and D. Hickam. OHSUMED: An interactive
retrieval evaluation and new large test collection for research. In SIGIR ’94,
Proceedings of the 17th Annual International ACM Conference on Research
and Development in Information Retrieval, pages 192–201, 1994.
[HCM+ 00] D. Heckerman, D.M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie.
Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1:49–75, 2000.
[HMRV98] Jennifer A. Hoeting, David Madigan, Adrian E. Raftery, and Chris T. Volinsky. Bayesian model averaging. Technical Report 9814, Department of Statistics, Colorado State University, May 1998.
[HMS66]
E.B. Hunt, J. Marin, and P.J. Stone. Experiments in Induction. Academic
Press, New York, 1966.
[HPS96]
David Hull, Jan Pedersen, and Hinrich Schuetze. Method combination for
document filtering. In SIGIR ’96, Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval, pages 279–287, 1996.
[HR99]
David A. Hull and Stephen Robertson. The trec-8 filtering track final report. In E. M. Voorhees and D. K. Harman, editors, NIST Special Publication 500-246: The Ninth Text REtrieval Conference (TREC-8), pages 35–56.
Department of Commerce, National Institute of Standards and Technology,
1999.
[HS90]
Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001, October 1990.
BIBLIOGRAPHY
221
[HTF01]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag,
2001.
[JJ94]
Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and
the EM algorithm. Neural Computation, 6:181–214, 1994.
[JJNH91]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3:79–87, 1991.
[Joa98]
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML ’98, Proceedings of the 10th European Conference on Machine Learning, pages 137–142, 1998.
[Joa99]
Thorsten Joachims. Making large-scale svm learning practical. In Bernhard
Schölkopf, Christopher J. Burges, and Alexander J. Smola, editors, Advances
in Kernel Methods - Support Vector Learning, pages 41–56. MIT Press, 1999.
[Joa02]
Thorsten Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer, 2002.
[Kah04]
Joseph M. Kahn. Bayesian Aggregation of Probability Forecasts on Categorical Events. PhD thesis, Stanford University, June 2004. Department of
Engineering-Economic Systems.
[KBS97]
Ron Kohavi, Barry Becker, and Dan Sommerfield. Improving simple bayes.
In ECML ’97 (poster), Proceedings of the 10th European Conference on Machine Learning, pages 78–87, 1997.
[KC00]
Hillol Kargupta and Philip Chan, editors. Advances in Distributed and
Parallel Knowledge Discovery. AAAI Press/MIT Press, Cambridge, Massachusetts, 2000.
[KKP01]
Samuel Kotz, Tomasz J. Kozubowski, and Krzysztof Podgorski. The Laplace
Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Birkhäuser, 2001.
[Kle99]
Lawrence A. Klein. Sensor and Data Fusion Concepts and Applications.
Society of Photo-optical Instrumentation Engineers, 2nd edition, 1999.
[KMT+ 82]
J. Katzer, M. McGill, J. Tessier, W. Frakes, and P. DasGupta. A study of the
overlap among document representations. Information Technology: Research
and Development, 1:261–274, 1982.
222
BIBLIOGRAPHY
[Lar99]
Leah S. Larkey. A patent search and classification system. In Proceedings of
the Fourth ACM Conference on Digital Libraries, pages 179 – 187, 1999.
[LC96]
Leah S. Larkey and W. Bruce Croft. Combining classifiers in text categorization. In SIGIR ’96, Proceedings of the 19th Annual International ACM
Conference on Research and Development in Information Retrieval, pages
289–297, 1996.
[LCJ03]
Yan Liu, Jaime Carbonell, and Rong Jin. A pairwise ensemble approach for
accurate genre classification. In Proceedings of the European Conference on
Machine Learning (ICML), 2003.
[Lea78]
Edward E. Leamer. Specification Searches: Ad Hoc Inference with Nonexperimental Data. John Wiley & Sons, USA, 1978.
[Lew92a]
David D. Lewis. An evaluation of phrasal and clustered representations on
a text categorization task. In SIGIR ’92, Proceedings of the 15th Annual
International ACM Conference on Research and Development in Information
Retrieval, pages 37–50, 1992.
[Lew92b]
David D. Lewis. Representation and Learning in Information Retrieval. PhD
thesis, University of Massachusetts, February 1992. COINS TR 91-93.
[Lew95]
David D. Lewis. A sequential algorithm for training text classifiers: Corrigendum and additional data. ACM SIGIR Forum, 29(2):13–19, Fall 1995.
[Lew97]
David D. Lewis. Reuters-21578, distribution 1.0.
http://www.daviddlewis.com/resources/testcollections/reuters21578, January
1997.
[LG94]
David D. Lewis and William A. Gale. A sequential algorithm for training text
classifiers. In SIGIR ’94, Proceedings of the 17th Annual International ACM
Conference on Research and Development in Information Retrieval, pages
3–12, 1994.
[Lit88]
Nick Littlestone. Learning quickly when irrelevant attributes abound: A new
linear-threshold algorithm. Machine Learning, 2(4):285–318, 1988.
[LJ98]
Y.H. Li and A.K. Jain. Classification of text documents. The Computer
Journal, 41(8):537–546, 1998.
BIBLIOGRAPHY
223
[LL01]
Wai Lam and Kwok-Yin Lai. A meta-learning approach for text categorization. In SIGIR ’01, Proceedings of the 24th Annual International ACM Conference on Research and Development in Information Retrieval, pages 303–
309, 2001.
[LR94]
David D. Lewis and M. Ringuette. Comparison of two learning algorithms
for text categorization. In Proceedings of the Third Annual Symposium on
Document Analysis and Information Retrieval, 1994.
[LSCP96]
David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. Training algorithms for linear text classifiers. In SIGIR ’96, Proceedings of the
19th Annual International ACM Conference on Research and Development
in Information Retrieval, pages 298–306, 1996.
[LTB79]
D.V. Lindley, A. Tversky, and R.V. Brown. On the reconciliation of probability assessments. Journal of the Royal Statistical Society, 142(2):146–180,
1979.
[LW94]
Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm.
Information and Computation, 108(2):212–261, 1994.
[LYJC04]
Y. Liu, R. Yan, R. Jin, and J. Carbonell. A comparison study of kernels for
multi-label text classification using category association. In The Twenty-first
International Conference on Machine Learning (ICML), 2004.
[LYRL04]
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new
benchmark collection for text categorization research. Journal of Machine
Learning Research, 5:361–397, 2004.
http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
[LZ05]
John Langford and Bianca Zadrozny. Estimating class membership probabilities using classifier learners. In AI & Statistics, 2005.
[Mer95]
Christopher J. Merz. Dynamical selection of learning algorithms. In D. Fisher
and H. Lenz, editors, Learning from Data: Artificial Intelligence and Statistics, 5. Springer-Verlag, 1995.
[Mer98]
Christoper J. Merz. Classification and Regression by Combining Models.
PhD thesis, University California Irvine, Information and Computer Science,
1998. http://www.ics.uci.edu/˜pazzani/merz.ps.
224
BIBLIOGRAPHY
[Mer99]
Christopher J. Merz. Using correspondence analysis to combine classifiers.
Machine Learning, 36(1-2):33–58, July 1999.
[Mic01]
Microsoft Corporation. WinMine Toolkit v1.0.
http://research.microsoft.com/˜dmax/WinMine/ContactInfo.html, 2001.
[Mit97]
Tom M. Mitchell. Machine Learning. McGraw-Hill Companies, Inc., 1997.
[MK60]
M.E. Maron and J.L. Kuhns. On relevance, probabilistic indexing, and information retrieval. Journal of the ACM, 7(3):216–244, July 1960.
[MLW92]
Brij Masand, Gordon Linoff, and David Waltz. Classifying news stories using
memory based reasoning. In SIGIR ’92, pages 59–65, 1992.
[MN98]
Andrew McCallum and Kamal Nigam. A comparison of event models for
naive bayes text classification. In Working Notes of AAAI ’98 (The 15th National Conference on Artificial Intelligence), Workshop on Learning for Text
Categorization, pages 41–48, 1998. TR WS-98-05.
[MP99]
Christopher J. Merz and Michael J. Pazzani. A principal components approach to combining regression estimates. Machine Learning, 36(1-2):9–32,
July 1999.
[MRF01]
R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of search engines. In Sigir ’01, 2001.
[MT94]
Ryszard Michalski and Gheorghe Tecuci, editors. Machine Learning: A Multistrategy Approach, volume IV. Morgan Kaufmann Publishers, Inc., 1994.
[PD03]
Foster Provost and Pedros Domingos. Tree induction for probability-based
rankings. Machine Learning, 52(3):199 – 215, September 2003.
[PF01]
Foster Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42:203–231, 2001.
[Pla98]
John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Bernhard Schölkopf, Christopher J.C. Burges, and
Alexander J. Smola, editors, Advances in Kernel Methods – Support Vector
Learning. MIT Press, 1998.
[Pla99]
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter
BIBLIOGRAPHY
225
Bartlett, Bernhard Scholkopf, and Dale Schuurmans, editors, Advances in
Large Margin Classifiers, pages 61–74. MIT Press, 1999.
[RC95]
T.B. Rajashekar and W.B. Croft. Combining automatic and manual index
representations in probabilistic retrieval. Journal of the American Society for
Information Science, 6(4):272–283, 1995.
[RH00]
Stephen Robertson and David A. Hull. The trec-9 filtering track final report. In E. M. Voorhees and D. K. Harman, editors, NIST Special Publication 500-249: The Ninth Text REtrieval Conference (TREC-9), pages 25–40.
Department of Commerce, National Institute of Standards and Technology,
2000.
[RS02]
Miguel E. Ruiz and Padmini Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002.
[RSW02]
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume
1 – from Yesterday’s News to Tomorrow’s Language Resources. In Proceedings of the Third International Conference on Language Resources and
Evaluation, 2002. http://about.reuters.com/researchandstandards/corpus/LREC camera ready.pdf.
[Sch90]
Robert Schapire. The strength of weak learnability. Machine Learning,
5(2):197–227, 1990.
[SDHH98]
Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A
bayesian approach to filtering junk e-mail. In Working Notes of AAAI ’98 (The
15th National Conference on Artificial Intelligence), Workshop on Learning
for Text Categorization, pages 55–62, 1998. TR WS-98-05.
[Seb02]
Fabrizio Sebastiani. Machine learning in automated text categorization. ACM
Computing Surveys, 34(1):1–47, March 2002.
[SF95]
J.A. Shaw and E.A. Fox. Combination of multiple searches. In D. K. Harman,
editor, TREC-3, Proceedings of the 3rd Text REtrieval Conference, number
500-225 in NIST Special Publication, pages 105–108, 1995.
[SFBL98]
Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting
the margin: A new explanation for the effectiveness of voting methods. The
Annals of Statistics, 26(5):1651–1686, 1998.
226
BIBLIOGRAPHY
[Sim95]
Jeffrey S. Simonoff. Smoothing categorical data. Journal of Statistical Planning and Inference, 47(1-2):41–69, 1995.
[SS00]
Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system
for text categorization. Machine Learning, 39:135–168, 2000.
[STP01]
Maytal Saar-Tsechansky and Foster Provost. Active learning for class probability estimation and ranking. In IJCAI ’01, 2001.
[TH00]
Kentaro Toyama and Eric Horvitz. Bayesian modality fusion: Probabilistic
integration of multiple vision algorithms for head tracking. In ACCV 2000,
Proceedings of the 4th Asian Conference on Computer Vision, 2000.
[TO96]
Sebastian Thrun and Joseph O’Sullivan. Discovering structure in multiple
learning tasks: The tc algorithm. In ICML ’96, pages 489–497, 1996.
[TT95]
Volker Tresp and Michiaki Taniguchi. Combining estimators using nonconstant weighting functions. In NIPS ’94, 1995.
[TW99]
K.M. Ting and I.H. Witten. Issues in stacked generalization. Journal of
Artificial Intelligence Research, 10:271–289, 1999.
[Vap00]
Vladimir Vapnik. The Nature of Statistical Learning. Springer, New York,
2nd edition, 2000.
[vR79]
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
[WAD+ 99] Sholom Weiss, Chidanand Apte, Fred Damerau, David Johnson, Frank Oles,
Thilo Goetz, and Thomas Hampp. Maximizing text-mining performance.
IEEE Intelligent Systems, 14(4):63–69, 1999.
[Win69]
Robert L. Winkler. Scoring rules and the evaluation of probability assessors.
Journal of the American Statistical Association, 1969.
[WJB97]
Kevin Woods, W. Philip Kegelmeyer Jr., and Keven Bowyer. Combination
of multiple classifiers using local accuracy estimates. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 19(4):405–410, 1997.
[Wol92]
David H. Wolpert. Stacked generalization. Neural Networks, 5:241–259,
1992.
[Wol95]
David H. Wolpert. The relationship between PAC, the statistical physics
framework, the bayesian framework, and the VC framework. In David H.
BIBLIOGRAPHY
227
Wolpert, editor, The Mathematics of Generalization, pages 117–214.
Addison-Wesley, Reading, MA, 1995.
[Yan99]
Yiming Yang. An evaluation of statistical approaches to text categorization.
Information Retrieval, 1(1/2):67–88, 1999.
[YAP00]
Yiming Yang, Thomas Ault, and Thomas Pierce. Combining multiple learning strategies for effective cross validation. In ICML ’00, Proceedings of
the 17th International Conference on Machine Learning, pages 1167–1182,
2000.
[YCB+ 99]
Y. Yang, J.G. Carbonell, R. Brown, Thomas Pierce, Brian T. Archibald, and
Xin Liu. Learning approaches to topic detection and tracking. IEEE EXPERT,
Special Issue on Applications of Intelligent Information Retrieval, 1999.
[YL99]
Yiming Yang and Xin Liu. A re-examination of text categorization methods.
In SIGIR ’99, Proceedings of the 22nd Annual International ACM Conference
on Research and Development in Information Retrieval, pages 42–49, 1999.
[YZCJ02]
Y. Yang, J. Zhang, J. Carbonell, and C. Jin. Topic-conditioned novelty detection. In Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, July 2002.
[ZE01]
Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In ICML ’01, 2001.
[ZE02]
Bianca Zadrozny and Charles Elkan. Reducing multiclass to binary by coupling probability estimates. In KDD ’02, 2002.
[Zha05]
Jian Zhang. Sparsity models for multi-task learning. In NIPS ’05, 2005.
[ZO01]
Tong Zhang and Frank J. Oles. Text categorization based on regularized
linear classification methods. Information Retrieval, 4:5–31, 2001.
[ZY04]
Jian Zhang and Yiming Yang. Probabilistic score estimation with piecewise
logistic regression. In International Conference on Machine Learning (ICML
2004), 2004.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement