# Ensemble methods ```Learning theory
I
I
I
Ensemble methods
Probability distribution P over X × {0, 1}; let (X, Y ) ∼ P .
We get S := {(xi , yi )}n
i=1 , an iid sample from P .
Goal: Fix , δ ∈ (0, 1). With probability at least 1 − δ (over random
choice of S), learn a classifier fˆ: X → {−1, +1} with low error rate
err(fˆ) = P (fˆ(X) 6= Y ) ≤ .
I
I
Basic question: When is this possible?
I
Suppose I even promise you that there is a perfect classifier from a
particular function class F.
(E.g., F = linear classifiers or F = decision trees.)
I
Default: Empirical Risk Minimization (i.e., pick classifier from F
with lowest training error rate), but this might be computationally
difficult (e.g., for decision trees).
Another question: Is it easier to learn just non-trivial classifiers in F
(i.e., better than random guessing)?
1 / 35
Boosting
2 / 35
Boosting: history
1984 Valiant and Kearns ask whether “boosting” is theoretically possible
(formalized in the PAC learning model).
Boosting: Using a learning algorithm that provides “rough rules-of-thumb”
to construct a very accurate predictor.
1989 Schapire creates first boosting algorithm, solving the open problem of
Valiant and Kearns.
Motivation:
Easy to construct classification rules that are correct more-often-than-not
(e.g., “If ≥5% of the e-mail characters are dollar signs, then it’s spam.”),
1990 Freund creates an optimal boosting algorithm (Boost-by-majority).
1992 Drucker, Schapire, and Simard empirically observe practical limitations of
early boosting algorithms.
but seems hard to find a single rule that is almost always correct.
1995 Freund and Schapire create AdaBoost—a boosting algorithm with
practical advantages over early boosting algorithms.
Basic idea:
Winner of 2004 ACM Paris Kanellakis Award:
Input: training data S
For their “seminal work and distinguished contributions [...] to
the development of the theory and practice of boosting, a
general and provably effective method of producing arbitrarily
accurate prediction rules by combining weak learning rules”;
specifically, for AdaBoost, which “can be used to significantly
reduce the error of algorithms used in statistical analysis, spam
filtering, fraud detection, optical character recognition, and
market segmentation, among other applications”.
For t = 1, 2, . . . , T :
1. Choose subset of examples St ⊆ S (or a distribution over S).
2. Use “weak learning” algorithm to get classifier: ft := WL(St ).
Return an “ensemble classifier” based on f1 , f2 , . . . , fT .
3 / 35
4 / 35
Interpretation
input Training data {(xi , yi )}n
i=1 from X × {−1, +1}.
1: initialize D1 (i) := 1/n for each i = 1, 2, . . . , n (a probability distribution).
2: for t = 1, 2, . . . , T do
3:
Give Dt -weighted examples to WL; get back ft : X → {−1, +1}.
4:
Update weights:
zt :=
n
X
i=1
Interpreting zt
Suppose (X, Y ) ∼ Dt . If
P (f (X) = Y ) =
Dt (i) · yi ft (xi ) ∈ [−1, +1]
then
1 1 + zt
ln
∈ R (weight of ft )
αt :=
2 1 − zt
Dt+1 (i) := Dt (i) exp −αt · yi ft (xi ) /Zt for each i = 1, 2, . . . , n ,
zt =
n
X
i=1
I
where Zt > 0 is normalizer that makes Dt+1 a probability distribution.
5: end for


T
X
6: return Final classifier fˆ(x) := sign
αt · ft (x).
I
I
1
+ γt ,
2
Dt (i) · yi f (xi ) = 2γt ∈ [−1, +1] .
zt = 0 ⇐⇒ random guessing w.r.t. Dt .
zt > 0 ⇐⇒ better than random guessing w.r.t. Dt .
zt < 0 ⇐⇒ better off using the opposite of f ’s predictions.
t=1
(Let sign(z) := 1 if z > 0 and sign(z) := −1 if z ≤ 0.)
6 / 35
5 / 35
Interpretation
Classifier weights αt =
1
2
ln
1+zt
1−zt
3
2
αt
1
0
-1
-2
-3
-1
-0.5
0
0.5
1
zt
Weak learning algorithm WL: ERM with F = “decision stumps” on R2
(i.e., axis-aligned threshold functions x 7→ sign(vxi − t)).
Straightforward to handle importance weights in ERM.
Example weights Dt+1 (i)
Dt+1 (i) ∝ Dt (i) · exp(−αt · yi ft (xi )) .
(Example from Figures 1.1 and 1.2 of Schapire & Freund text.)
7 / 35
8 / 35
D1
D2
D3
–
–
+
+
+
+
–
–
f1
z1 = 0.40, α1 = 0.42
f2
z2 = 0.58, α2 = 0.65
f3
z3 = 0.72, α3 = 0.92
–
+
+
f1
z1 = 0.40, α1 = 0.42
f2
z2 = 0.58, α2 = 0.65
Final classifier
fˆ(x) = sign(0.42f1 (x) + 0.65f2 (x) + 0.92f3 (x))
–
(Zero training error rate!)
f3
z3 = 0.72, α3 = 0.92
10 / 35
9 / 35
Empirical results
UCI
UCIResults
Results
Training error rate of final classifier
Test error rates of C4.5 and AdaBoost on several classification problems.
Each point represents a single classification problem/dataset from UCI repository.
Recall γt := P (ft (X) = Y ) − 1/2 = zt /2 when (X, Y ) ∼ Dt .
25 25
25 25
20 20
20 20
Training error rate of final classifier from AdaBoost:


T
X
n
2
err(fˆ, {(xi , yi )}i=1 ) ≤ exp−2
γt  .
C4.5
C4.5
C4.5
30 30
C4.5
C4.5
C4.5
30 30
15 15
t=1
15 15
10 10
10 10
5 5
5 5
0 0
0 0 5 510 1015 1520 2025 2530 30
0 0
0 0 5 510 1015 1520 2025 2530 30
boosting
Stumps
boosting
Stumps
If average γ̄ 2 :=
1
T
2
2
γ
>
0,
then
training
error
rate
is
≤
exp
−2γ̄
T
.
t
t=1
PT
Some γt could be small, even negative—only care about overall average γ̄ 2 .
boosting
C4.5
boosting
C4.5
C4.5 = popular algorithm for learning decision trees.
(Figure 1.3 from Schapire & Freund text.)
11 / 35
12 / 35
Combining classifiers
A typical run of boosting
16
Let F be the function class used by the weak learning algorithm WL.
The function class used by AdaBoost is




T


X
FT :=
x 7→ sign
αt ft (x) : f1 , f2 , . . . , fT ∈ F , α1 , α2 , . . . , αT ∈ R


Error rate
20
t=1
i.e., linear combinations of T functions from F.
Complexity of FT grows linearly with T .
10
10
100
1000
# of rounds T
(#
Figure 1.7
The training and test percent error rates obtained using boosting on an OCR
6 dataset with C4.5 as the base lea
nodes
across
all decision
trees error,
in fˆrespectively.
is >2 ×The
10top
)horizontal line shows the test error
The top and
bottom curves
are test and training
using just C4.5. The bottom line shows the final test error rate of AdaBoost after 1000 rounds. (Reprinted
permission of the Institute of Mathematical Statistics.)
Training error rate is zero after just five rounds,
but test error rate continues to decrease, even up to 1000 rounds!
Theory suggests danger of overfitting when T is very large.
Indeed, this does happen sometimes . . . but often not!
13 / 35
Boosting the margin
This pronounced lack of overfitting seems to flatly contradict our earlier intuition that s
pler is better. Surely, a combination of five trees is much, much simpler than a combinat
(Figure 1.7 from Schapire & Freund text)
of 1000 trees (about 200 times simpler, in terms of raw size), and both perform equ
well on the training set (perfectly, in fact). So how can it be that the far larger14and
m
/ 35
complex combined classifier performs so much better on the test set? This would appea
One superficially plausible explanation is that the αt ’s are converging rapidly to ze
so that the number of base classifiers being combined is effectively bounded. However
noted above, the ϵt ’s remain around 5–6% in this case, well below 12 , which means that
weights αt on the individual base classifiers are also bounded well above zero, so that
combined classifier is constantly growing and evolving with each round of boosting.
Regard function class
F resistance
used by weak
learning
algorithm
as “feature
functions”:
Such
to overfitting
is typical
of boosting,
although,
as we have seen in s
tion 1.2.3, boosting certainly can overfit. This resistanceFis one of the properties that m
x it7→
φ(x) := f (x) : f ∈ F ∈ {−1, +1}
such an attractive learning algorithm. But how can we understand this behavior?
In chapter 5, we present a theoretical explanation of how, why, and when AdaBoost wo
(possibly infinite dimensional!).
and, in particular, of why it often does not overfit. Briefly, the main idea is the followi
The description
above of
performance
is a linear
classifier
in {−1,
+1}F :on the training set took into acco
only the training error, which is already zero after just five rounds. However, training e




tells
T only part of the story, in that
X
Xit reports just the number of examples that are corre
classified.
understand
AdaBoost, we also need to consider h
 to w
fˆ(x) = signor incorrectly
αt ft (x)
= sign
f f (x) = sign hw, φ(x)i
confident
t=1
∈F by the algorithm are. We will see that such confide
can be measured by a quantity called the margin. According to this explanation, altho
the training error—that is, whether or not the predictions are correct—is not chang
where
Linear classifiers
fˆ(x) = sign
|
PT
t=1 αt ft (x)
P
T
t=1 |αt |
!
.
{z
}
g(x) ∈ [−1, +1]
Call y · g(x) ∈ [−1, +1] the margin achieved on example (x, y).
New theory [Schapire, Freund, Bartlett, and Lee, 1998]:
Larger margins ⇒ better resistance to overfitting, independent of T .
AdaBoost tends to increase margins on training examples.
(Similar but not the same as SVM margins.)
On “letters” dataset:
training error rate
test error rate
% margins ≤0.5
min. margin
C4.5 test error
0
Error due to finite sample
I
15
5
Theoretical guarantee (e.g., when F = decision stumps in Rd ):
With high probability (over random choice of training sample),
!
r
T log d
2
ˆ
err(f ) ≤ exp −2γ̄ T + O
.
n
{z
}
|
|
{z
}
Training error rate
I
1 Introduction and Overv
T =5
0.0%
8.4%
7.7%
0.14
T = 100
0.0%
3.3%
0.0%
0.52
wf :=
T = 1000
0.0%
3.1%
0.0%
0.55
T
X
t=1
15 / 35
αt · 1{ft = f }
∀f ∈ F .
16 / 35
Exponential loss
More on boosting
AdaBoost is a particular “coordinate descent” algorithm (similar to but not the
Many variants of boosting:
n
1X
exp −yi hw, φ(xi )i .
n i=1
min
w∈RF
3
exp(-x)
log(1+exp(-x))
2.5
2
I
I
Boosted decision trees = boosting + decision trees.
I
Boosting algorithms for ranking and multi-class.
I
Boosting algorithms that are robust to certain kinds of noise.
I
Boosting for online learning algorithms (very new!).
I
...
Many connections between boosting and other subjects:
1.5
1
I
Game theory, online learning
I
“Geometry” of information (replace k · k22 with relative entropy divergence)
I
0.5
I
Computational complexity
...
0
-1
-0.5
0
0.5
1
1.5
18 / 35
17 / 35
Application: face detection
Face detection
Problem: Given an image, locate all of the faces.
As a classification problem:
I
Divide up images into patches (at varying scales, e.g., 24×24, 48×48).
I
Learn classifier f : X → Y, where Y = {face, not face}.
Many other things built on top of face detectors (e.g., face tracking, face
recognizers); now in every digital camera and iPhoto/Picasa-like software.
Main problem: how to make this very fast.
20 / 35
Viola & Jones “integral image” trick
[Viola & Jones, 2001]
Face detector architecture by Viola & Jones (2001): major achievement in
computer vision; detector actually usable in real-time.
I
I
“Integral image” trick:
For every image, pre-compute
(0, 0)
2
Think of each image patch (d × d-pixel gray-scale) as a vector x ∈ [0, 1]d .
Used weak learning algorithm that picks linear classifiers
fw,t (x) = sign(hw, xi − t), where
w
has
very particular form:
144
Viola
andaJones
(r, c)
number of sub-windows that need further processing
with very few operations:
1. Evaluate the rectangle features (requires between 6
and 9 array references per feature).
2. Compute the weak classifier for each feature (requires one threshold operation per feature).
3. Combine the weak classifiers (requires one multiply
per feature, an addition, and finally a threshold).
s(r, c) = sum of pixel values in rectangle from (0, 0) to (r, c)
(single pass through image).
To compute inner product
hw, xi = average pixel value in black box
− average pixel value in white box
just need to add and subtract a few s(r, c) values.
I
Figure 5. The first and second features selected by AdaBoost. The
two features are shown in the top row and then overlayed on a typical training face in the bottom row. The first feature measures the
difference in intensity between the region of the eyes and a region
across the upper cheeks. The feature capitalizes on the observation
that the eye region is often darker than the cheeks. The second feature
compares the intensities in the eye regions to the intensity across the
bridge of the nose.
AdaBoost combines many “rules-of-thumb” of this form.
I
I
Very simple.
Extremely fast to evaluate via pre-computation.
features to the classifier, directly increases computation
time.
4.
This section describes an algorithm for constructing a
Problem: severe class imbalance (most
don’t
a face).
of classifiers
which contain
achieves increased
detection performance while radically reducing computation
time. The
key insight
is that smaller,and
and therefore
morein a
Solution: Train several classifiers (each
using
arrange
efficient, boosted classifiers can be constructed which
special kind of decision list called areject
many of the negative sub-windows while detect-
x
f
(1)
(x)
−1
−1
+1
A two feature classifier amounts to about 60 microprocessor instructions. It seems 152
hard toViola
imagine
and Jones
that any simpler filter could⇒
achieve
higher rejection
Evaluating
“rules-of-thumb”
rates. By comparison, scanning a simple image template would require at least 20 times as many operations
per sub-window.
The overall form of the detection process is that of
a degenerate decision tree, what we call a “cascade”
(Quinlan,211986)
/ 35 (see Fig. 6). A positive result from
the first classifier triggers the evaluation of a second
classifier which has also been adjusted to achieve very
high detection rates. A positive result from the second
classifier triggers a third classifier, and so on. A negative
outcome at any point leads to the immediate rejection
of the sub-window.
The structure of the cascade reflects the fact that
within any single image an overwhelming majority of
sub-windows are negative. As such, the cascade attempts to reject as many negatives as possible at the
earliest stage possible. While a positive instance will
classifiers is extremely fast.
22 / 35
Viola & Jones detector: example results
ing almost all positive instances. Simpler classifiers are
f
before more
+1 the majority
(2) used to reject
(3) of sub-windows
···
+1
(x)
f are(x)
complex classifiers
called upon to achieve low false
positive rates.
−1
−1are constructed by training
classifiers using AdaBoost. Starting with a two-feature
classifier, an effective face filter can be obtained
−1strong
−1
by adjusting the strong classifier threshold to mini-
mize
false negatives. The initial AdaBoost threshold,
1 !T
t=1 αt , is designed to yield a low error rate on the
2
training data. A lower threshold yields higher detecI Each f (`) is trained (using AdaBoost),
tion rates and higher false positive rates. Based on perthrough sign) to minimize falseformance
negative
rate.
measured
using a validation training set, the
Figure 6. Schematic depiction of a the detection cascade. A series
of classifiers are applied to every sub-window. The initial classifier
two-feature classifier can be adjusted to detect 100% of
(`)
I Can make f
eliminates a large number of negative examples with very little proin later stages more
than rate
in ofearlier
the facescomplex
with a false positive
50%. Seestages,
Fig. 5 for since
cessing. Subsequent layers eliminate additional negatives but require
a description
of the two features used in this classifier.
most examples don’t make it to
the end.
additional computation. After several stages of processing the numThe detection performance of the two-feature clasFigure
10. processing
Output of our face detector on a number of test images from the MIT + CMU test set.
ber of sub-windows have been reduced radically.
Further
sifier is far from acceptable as a face detection system.
can take any form such as additional stages of the cascade (as in our
fast.
detection system) or an alternative detection system.
Nevertheless the classifier can significantly reduce the
6. Conclusions
This paper brings together new algorithms, represen-
23 / 35
We have presented an approach for face detection
which minimizes computation time while achieving
high detection accuracy. The approach was used to construct a face detection system which is approximately
tations, and insights which are quite generic and may
well have broader application in computer vision and
image processing.
The first contribution is a new a technique for computing a rich set of image features using the integral
24 / 35
Summary
Bagging
Two key points:
I
AdaBoost effectively combines many fast-to-evaluate “weak classifiers”.
I
Cascade structure optimizes speed for common case.
25 / 35
Bagging
Aside: sampling with replacement
Question: if n individuals are picked from a population of size n u.a.r. with
replacement, what is the probability that a given individual is not picked?
Bagging = Bootstrap aggregating (Leo Breiman, 1994).
Input: training data
For t = 1, 2, . . . , T :
{(xi , yi )}n
i=1
n
1
1−
n
from X × {−1, +1}.
For large n:
1. Randomly pick n examples with replacement from training data
(t)
(t)
−→ {(xi , yi )}n
i=1 (a bootstrap sample).
(t)
lim
n→∞
(t)
2. Run learning algorithm on {(xi , yi )}n
i=1
−→ classifier ft .
1−
1
n
n
=
1
≈ 0.3679 .
e
Implications for bagging:
Return a majority vote classifier over f1 , f2 , . . . , fT .
27 / 35
I
Each bootstrap sample contains about 63% of the data set.
I
Remaining 37% can be used to estimate error rate of classifier trained on
the bootstrap sample.
I
Can average across bootstrap samples to get estimate of bagged
classifier’s error rate (sort of).
28 / 35
Random Forests
Random Forests (Leo Breiman, 2001).
d
Input: training data {(xi , yi )}n
i=1 from R × {−1, +1}.
For t = 1, 2, . . . , T :
Breaking the holdout
1. Randomly pick n examples with replacement from training data
(t)
(t)
−→ {(xi , yi )}n
i=1 (a bootstrap sample).
(t)
(t)
2. Run variant of decision tree learning algorithm on {(xi , yi )}n
i=1 ,
where
each
split
is
chosen
by
only
considering
a
random
subset
of
√
d features (rather than all d features)
−→ decision tree classifier ft .
Return a majority vote classifier over f1 , f2 , . . . , fT .
29 / 35
Holdout method
1. Randomly split data into three sets: Training, Holdout, and Test data.
Training
Holdout
Test
2. Loop:
1. Randomly split data into three sets: Training, Holdout, and Test data.
Training
Holdout
I
Test
Train some classifiers on Training data.
Evaluate on Holdout data.
2. Train some classifiers on Training data.
I
Evaluate on Holdout data.
Exit loop at some point.
3. Use results from previous step to pick final classifier.
3. Use results from previous step to pick final classifier.
Why is this different from the basic holdout method?
Steps after first loop iteration might not be independent of Holdout.
31 / 35
32 / 35
Extreme case of overfitting
Cautionary tale
I
I
nytimes.com/2015/06/04/technology/computer-scientists-are-astir-after-baidu-team-is-barred-from-ai-competition.html
Construct completely random classifiers f1 , f2 , . . . , fK : X → {−1, +1}.
Compute Holdout error rates
err(fi , Holdout) ,
i = 1, 2, . . . , K .
number of classifiers
200
150
100
50
0
0.46
0.48
0.5
0.52
0.54
holdout error rate
I
I
I
Select better-than-random classifiers: G := {i : err(fi , Holdout) > 1/2}.
Construct fˆ := majority vote classifier over {fi : i ∈ G}.
Compute Holdout error rate
err(fˆ, Holdout) .
Is err(fˆ, Holdout) a good estimate of err(fˆ)?
33 / 35
What can we do?
I
Avoid excessive adaptation to Holdout if you can help it.
(Think carefully about what to evaluate; use Training data when possible.)
I
Guard against inadvertent “information leakage” from Holdout.
Example of leakage: using both Training and Holdout to determine
standardization or other preprocessing.
I
Use “blurred vision” when looking at Holdout: e.g.,
err(f, Holdout) + noise ,
where noise is on the order of expected standard deviation.
I
Explicit safeguards for adaptive data analysis.
Dwork et al. The reusable holdout: preserving validity in adaptive data
analysis. Science, 349(6248):636–638, 2015.
35 / 35
34 / 35
```