AIM 2004 025

AIM 2004 025
massachusetts institute of technology — computer science and artificial intelligence laboratory
Regularization Through
Feature Knock Out
Lior Wolf and Ian Martin
AI Memo 2004-025
CBCL Memo 242
November 2004
© 2 0 0 4 m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 2 1 3 9 u s a — w w w . c s a i l . m i t . e d u
Abstract
In this paper, we present and analyze a novel regularization technique based on enhancing our dataset with corrupted
copies of the original data. The motivation is that since the learning algorithm lacks information about which parts of the
data are reliable, it has to produce more robust classification functions. We then demonstrate how this regularization leads
to redundancy in the resulting classifiers, which is somewhat in contrast to the common interpretations of the Occam’s razor
principle. Using this framework, we propose a simple addition to the gentle boosting algorithm which enables it to work with
only a few examples. We test this new algorithm on a variety of datasets and show convincing results.
c
Copyright Massachusetts
Institute of Technology, 2004
This report describes research done at the Center for Biological & Computational Learning, which is in the McGovern Institute for Brain Research at MIT,
as well as in the Dept. of Brain & Cognitive Sciences, and which is affiliated with the Computer Sciences & Artificial Intelligence Laboratory (CSAIL).
This research was sponsored by grants from: Office of Naval Research (DARPA) Contract No. MDA972-04-1-0037, Office of Naval Research (DARPA)
Contract No. N00014-02-1-0915, National Science Foundation (ITR/IM) Contract No. IIS-0085836, National Science Foundation (ITR/SYS) Contract No.
IIS-0112991, National Science Foundation (ITR) Contract No. IIS-0209289, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218693, National Science Foundation-NIH (CRCNS) Contract No. EIA-0218506, and National Institutes of Health (Conte) Contract No. 1 P20 MH66239-01A1. Additional support was provided by: Central Research Institute of Electric Power Industry, Center for e-Business (MIT), Daimler-Chrysler AG, Compaq/Digital
Equipment Corporation, Eastman Kodak Company, Honda R& D Co., Ltd., ITRI, Komatsu Ltd., Eugene McDermott Foundation, Merrill-Lynch, Mitsubishi
Corporation, NEC Fund, Nippon Telegraph & Telephone, Oxygen, Siemens Corporate Research, Inc., Sony MOU, Sumitomo Metal Industries, Toyota
Motor Corporation, and WatchVision Co., Ltd.
1
1. Introduction
of overfitting (learning to deal only with the training error).
Therefore, while the empirical error is small, the expected
error is large. In other words, the generalization error (the
difference of empirical error from expected error) is large.
Overfitting can be avoided by using any one of several regularization techniques.
Overfitting is usually the result of allowing too much
freedom in the selection of the function f . Thus, the most
basic regularization technique is to limit the number of free
parameters we use while fitting the function f . For example, in binary classification we may limit ourselves to learning functions of the form f (x) = (h⊤ x > 0) (we assume
X = ℜn . h is a vector of free parameters). Using such
functions, we reduce the risk of overfitting, but may never
optimally learn the target function (i.e., the “true function”
f (x) = y that is behind the distribution P) of other forms,
e.g., we will not be able to learn f (x) = (x(1)2 −x(2) > 0).
Another regularization technique is to minimize the empirical error subject to constraints on the learned functions.
For example, we can require that the norm of the vector of
free parameters h be less than one. A related but different
regularization technique is to minimize the empirical error
together with a penalty term on the complexity of the function we fit. The most popular penalty term – Tikhonov regularization – has a quadratic form. Using the linear model
above, an appropriatePpenalty function would be ||h||22 , and
n
we would minimize i=1 ((h⊤ xi > 0) 6= yi ) + ||h||22 .
Sometimes, adding a regularization term to the optimization problem solved by the algorithm is not trivial. In the
most extreme case, the algorithm is a black box we cannot
alter at all. Still, a simple form of regularization called noise
injection can be employed. In the noise injection technique,
the training dataset is enriched by multiple copies of each
training data point xi . A zero-mean, low-variance Gaussian
noise (independent for each coordinate) is added to each
copy, and the original label yi is preserved. The motivation
is that if two data points x, x′ are close (i.e., ||x − x′ || is
small), we would like f (x) and f (x′ ) to have similar values. By introducing many examples with similar x values,
and identical y values we teach the classifier to have this stability property. Hence, the learned function is encouraged
to be smooth (at least around the training points).
The study of the noise injection technique, which blossomed in the mid 90’s, established the following results on
noise injection: (1) It is an effective way to reduce generalization error. (2) It has a similar effect on shrinkage (the
statistical term for regularization) of the parameters in some
simple models (e.g., [3]). (3) It is equivalent to Tikhonov
regularization [1]. Note that this does not mean that we can
always use Tikhonov regularization instead of noise injection, as for some learning algorithms it is not possible to
create a regularized version.
The technique we introduce next is similar in spirit to
Boosting - the iterative combination of classifiers to build a
strong classifier - is a popular learning technique. The algorithms based on it, such as AdaBoost and GentleBoost,
are easy to implement, work reasonably fast, and in general
produce classifiers with good generalization properties for
large enough datasets. If the dataset is not large enough and
there are many features, these algorithms tend to overfit and
perform much worse than the popular Support Vector Machine (SVM) algorithm. SVM can be successfully applied
to all datasets, from small to very large.
The major drawback of SVM is that it uses, at run time,
when classifying a new example x, all the measurements
(features) of x. This poses a problem because, while we
would like to cover many promising features during training, computing all the features at run-time might be too
costly. This is especially true for object detection problems
in vision, where we often need to search the whole image
in several scales over thousands of possible locations, each
location producing one such vector x. While several approaches for combining feature selection with SVM have
been suggested in the past (e.g., [17]), they are rarely used.
The complexity of the feature vector can be controlled
more easily by using boosting techniques over weak classifiers based on single features (e.g., regression stumps), such
as in the highly successful system of [16]. In this case, the
number of features used is bounded by the number of iterations in the boosting process. However, since boosting tends
to overfit on small datasets, there is a bit of a dilemma here.
An ideal algorithm would enable good control over the total number of features, while being able to learn from only
a few examples. Such an algorithm is presented in Sec. 5.
This new algorithm is based on GentleBoost. Within it
we implemented a regularization technique based on a simple idea: add corrupted copies of your training dataset to the
original one, and the algorithm will not be able to overfit. A
general background on fitting and regularization is given in
the next section.
2 Background
We are given a set of n examples zi = {(xi , yi )}ni=1 , x ∈
X , y ∈ Y drawn from a joint distribution P on X × Y.
The ultimate goal of the learning algorithm is to produce a
function f : X → Y such that the expected error of f given
by the expression E(x,y)∼P (f (x) 6= y) is minimized. The
boolean expression inside the parentheses evaluates to one
if it holds, zero otherwise.
Since we do not know the distribution P P
we are tempted
n
to minimize the empirical error given by i=1 (f (xi ) 6=
yi ). The problem is that if the space of functions from which
the learning algorithm selects f is too large, we are at risk
2
cessity” - is often understood as suggesting the selection
of the simplest possible model that fits the data well. Thus,
given a dataset, we are encouraged to prefer classifier A that
uses 70 features over classifier B that uses 85 if they have
the same expected error, because the “simpler” classifier is
believed to generalize better.
In an apparent contrast to this belief, we know from our
daily experience that simpler is not always better. When a
good teacher explains something to a class, he will use a
lot of repetition. The teacher ensures that the students will
understand the idea, even if some of the explanations were
unclear. Since the idea could be expressed in a simple form
without repetition, the explanation is more complex.
It is also generally accepted that biological systems use
redundancy in their computations. Thus, even if several
computational units break down (e.g., when a neurons die)
the result of the computation is largely unaffected. We expect the learned model to be interpreted with some random
error. In these cases, we should not train a single model, but
instead train to optimize a distribution of models.
Redundancy in boosted classifiers. Boosting over a
weak classifier increases the weights of those examples it
does not classify well. The inclusion of a weak classifier
in the strong classifier therefore inhibits future use of similar weak classifiers. In boosting over regression stumps,
each weak classifier is based on one feature. Hence, from
a group of similar features, we may expect to see only one
participating in the strong classifier. Our boosting over regression stumps algorithm, presented in Sec. 5, modulates
this effect, by creating a new example for which relying on
the selected feature might lead to a mistake. However, it
does not change the values of the other features, making the
similar features suitable for classifying the new example.
Such a process yields a larger classifier, which uses more
features. This effect is clear in our experiments, and might
also be interpreted as “a more complex model is needed to
deal with more complex data.” However, using more features does not necessarily mean that the classifier will lead
to a worse generalization error1.
Koltchinskii and Panchenko have derived bounds on the
generalization error of ensemble (voting) classifiers, such
as boosting, which take redundancy into consideration [10].
A precise description of their results would require the introduction of more notation, and will not be presented here.
Informally speaking, they show that one can refine the measures of complexity used for voting classifiers such that
it would encourage ensembles that can be grouped into
a small number of compact clusters, each including base
(“weak”) classifiers that are similar to one another.
Input: (x1 ,y1 ), ..., (xm ,ym ) where xi ∈ ℜn , yi ∈ Y .
Output: one synthesized pair (x̂, ŷ).
1. Select two examples xa , xb at random.
2. Select a random feature k ∈ [1..n].
3. Set x̂ ← xa and ŷ ← ya .
4. Replace feature k of x̂: x̂(k) ← xb (k).
Figure 1: The Feature Knockout Procedure
noise injection. However, it is different enough that the results obtained for noise injection will not hold for it. For example, the results of [1] use a Taylor expansion around the
original data points. Such an approximation will not hold
for our new technique, since the “noise” is too large (i.e.,
the new datapoint is too different). Other important properties that might not hold are the independence of noise across
coordinates, and the zero mean of the noise.
Our regularization technique is based on creating corrupted copies of the dataset. Each new data point is a
copy of one original training point, picked at random, where
one random coordinate (feature) is replaced with a different
value–usually the value of the same coordinate in another
random training example. The basic procedure used to generate the new example is illustrated in Fig. 1. We call it the
Feature Knockout (KO) procedure, since one feature value
is being altered dramatically. It is repeated many times to
create new examples. It can be used with any learning algorithm, and we use it in the analysis presented in Sec. 4.
However, as we focus our application emphasis on boosting, we use the specialized version in Fig. 2.
The KO regularization technique is especially suited for
use when learning from only a few examples. The robustness we demand from the selected classification function
is much more than local smoothness around the classification points (c.f. noise injection). This kind of smoothness is easy to achieve when example points are far from
one another. Our regularization, however, is less restrictive
than demanding uniform smoothness (Tikhonov) or requiring the reduction of as many parameters as possible. Both
of these approaches might not be ideal when only a few examples are available because there is nothing to balance a
large amount of uniform smoothness, and it is easy to fit a
model that uses very few parameters. Instead, we encourage
redundancy in the classifier since, in contrast to the shortage
of training examples, there is an abundance of features.
3 Motivation
1 The terms “simple” and “complex” are not trivial to define, and their
definition usually depends on what one tries to claim. In the next section
we will use more rigorous definitions for the cases we analyze.
Intuition. It is a common belief that simpler is better. Occam’s razor - “entities should not be multiplied beyond ne3
4 Analysis
this system using Tikhonov regularization is termed “scaled
Tikhonov regularization.” If D is unknown, a natural
p choice
is the diagonal matrix with the entries Dkk = (AA⊤ )kk
[13]. We will now show that using the knockout procedure
to add many new examples is equivalent to scaled Tikhonov
regularization, using the weight matrix above.
The effect of adding noise to the training data depends on
the learning algorithm used, and is highly complex. Even
for the case of adding a zero-mean, low-variance Gaussian
noise (noise injection) this effect was studied only for simple algorithms (e.g. [3]) or the square loss function [1].
In Sec. 4.1 we study the effect of Feature Knockout on
the well known linear least square regression problem. We
show that it leads to a scaled version of Tikhonov regularization. Compare this to Bishop’s result (using a Taylor expansion) that noise injection is equivalent to Tikhonov regularization. Following in Sec. 4.2, we will try to analyze how
feature KO affects the variance of the learned classifier.
Lemma 1 When using the linear model with a least
squares fit, applying the knockout procedure in Fig. 1 to
generate many examples is equivalentpto applying scaled
Tikhonov regularization where Dkk = (AA⊤ )kk .
Proof For simplicity of the proof, we will make the following two assumptions: (1) All features have an expectation of 0. Without this assumption, our derivations below
will be cluttered with many more elements. (2) Instead of
sampling using the knockout procedure, we will just use an
augmented data matrix containing all of the points that the
knockout procedure can create. This is equivalent to studying the limit of infinite new examples, and allows us to ignore all the elements that go to zero as the number of virtual
examples increases.
The vector ĥ which is fitted by means of a scaled
Tikhonov regularization technique, with a parameter λ is
given by: ĥ = Â⊤ (ÂÂ⊤ + λI)−1 Ây =
(D−1 AA⊤ D−1 + λD−1 D2 D−1 )−1 D−1 Ay =
D(AA⊤ + λD2 )−1 DD−1 Ay = D(AA⊤ + λD2 )−1 Ay.
Therefore, h = D−1 ĥ = (AA⊤ + λD2 )−1 Ay.
e the matrix whose columns contain
Now consider A,
all the possible KO examples, together with the original
data points. Let ye be the corresponding labels. By applying a least square linear fit to these inputs, we find:
e
ey . There are nm2 total examples creeA
e⊤ )−1 Ae
h = (A
ated. Since we assumed all features have a mean of zero,
all the knockout values of each feature cancel out. What
remains is m(n − 1) exact copies of each variable and
ey = m(n − 1)Ay.
Ae
eA
e⊤ . The
We now examine the elements of the matrix A
off diagonal elements represent the dot product between two
different variables. Each variable holds either its original
value, or a different value, but it may never happen that both
contain the knockout values at once. It happens m(n − 2)
times for each input example that both features hold the
original data sets. The rest of the cases average out to zero,
because while one value is fixed the other value goes over
all values which have a mean of zero. For the diagonal case,
because of symmetry, each value appears nm times, making
the diagonal nm times the diagonal of the original matrix.
ey = (m(n −
eA
e⊤ )−1 Ae
Putting it all together: e
h = (A
n−1
⊤
2 −1
2)AA + 2mD ) m(n − 1)Ay = n−2
(AA⊤ + λD2 )Ay,
2
where λ = n−2
. The leading fraction does not change the
sign of the results, and is close to one. Ignoring it, we obtain
the results of a scaled Tikhonov regularization. The param-
4.1 Effect of feature KO on linear regression
One of the most basic models we can apply to the data is
the linear model. In this model, the input examples xi ∈
ℜn , i = 1..m are organized as the columns of the matrix
A ∈ ℜn×m ; the corresponding yi values are stacked in one
vector y ∈ ℜm . The prediction made by the model is given
by A⊤ h, where h is the vector of free parameters we have
to fit to the data. In the common least squares case, ||y −
A⊤ h||2 is minimized.
In the case that the matrix A is full rank and overdetermined, it is well known that the optimal solution is h =
A+ y, where A+ = (AA⊤ )−1 A is known as the pseudo inverse of the transpose of A (our definition of A is the transpose of the common text book definition). If A is not full
rank, the matrix inverse (AA⊤ )−1 is not well defined. However, as an operator in the range of A it is well defined, and
the above expression still holds, i.e., even if there is an ambiguity in selecting the inverse matrix, there is no ambiguity
in the operation of all possible matrices on the range of the
columns of A, which is what we care about.
Even so, if the covariance matrix (AA⊤ ) has a large condition number (i.e., it is close to being singular), small perturbations of the data result in large changes to h, and the
system is unstable. The solution fits the data A well, but
does not fit data which is very close to A, hence there is
overfitting. To stabilize the system, we apply regularization.
Tikhonov regularization is based on minimizing ||y −
A⊤ h||2 + λ||h||2 . This is equivalent to using a regularized
⊤
−1
pseudo inverse: A+
A, where I is the
λ = (AA + λI)
identity n × n matrix, and λ is the regularization parameter.
In many applications, the linear system we need to solve
is badly scaled, e.g., one variable is much larger in magnitude than the other variables. In order to rectify this, we may
apply a transformation to the data that weights each variable
differently, or equivalently weight the vector h by applying
a diagonal matrix D, such that h becomes ĥ = Dh.
Instead of solving the original system Ah = y, we now
solve the system Âĥ = y, where  = D−1 A. Solving
4
eter λ can be controlled by allowing the knockout procedure
to perturb more than one element.
Similar to the work done on noise injection, we examined the effect of our procedure on a simple regression technique. We saw that feature knockout resembles the effect of
scaled Tikhonov regularization, i.e., high norm features are
penalized by the knockout procedure. However, boosting
over regressions stumps seems to be scale invariant. Multiplying all the values of a feature by some constant does
not change the resulting classifier, since the process that fits
the regression stumps (see Sec. 5) uses the values of each
feature to determine the thresholds that it uses. However, a
closer look reveals the connection between scaling and the
effect of the knockout procedure on boosting.
Boosting over stumps (e.g., [16]) chooses at each round
one out of n features, and one threshold for this feature.
The thresholds are picked from the m possible values that
exist in between every two sorted feature values. The feature and the threshold define a “weak classifier” (the basic
building blocks of the ensemble classifier built by the boosting procedure [14]), which predicts -1 or +1 according to
the threshold. Equivalently, we can say that boosting over
stumps chooses from a set of nm binary features – these
features are exactly the values returned by the weak classifiers. These nm features have different norms, and are not
scale invariant. Let us call each such feature an nm-feature.
Using the intuitions of the linear least squares case, we
would like to inhibit features of p
high magnitude. All nmfeatures have the same norm ( (m)), but different entropies (a measure which is highly related to norm). These
entropies depend only on the ratio of positive values in each
nm-feature - call this ratio p.
Creating new examples using the Feature Knockout procedure does not change the number of possible thresholds,
and therefore the number of features remains the same. The
values of the new example in the nm feature space will be
the same for all features originating from the n − 1 features that were not changed in the knockout procedure. The
value for a knocked-out feature (feature k in Fig. 1), will
change if the new value is on the other side of the threshold as compared to the old value. This will happen with
probability 2p(1 − p). If this sign flip happens then the feature is inhibited because it gives two different classifications
to two examples with the same label (KO leaves labels unchanged). Note that the entropy of a feature with a positive
ratio of p and the probability 2p(1 − p) behave similarly:
both rise monotonically for 0 ≤ p ≤ 1/2 and then drop
symmetrically. Hence, We obtain the following result:
To get a better understanding of the way Feature Knockout works, we study the behavior of scaled Tikhonov regularization. As mentioned in Sec. 3, in the boosting case, the
knockout procedure is expected to produce solutions which
make use of more features. Are these models more complex? This is hard to define in the general case, but easy to
answer in the linear least square case study.
In linear models, the predictions y̌ on the training data
take the form: y̌ = P y. For example, in the unregularized
pseudo inverse case we have y̌ = A⊤ h = A⊤ (AA⊤ )−1 Ay,
and therefore P = A⊤ (AA⊤ )−1 A. There is a simple measure of complexity called the effective degrees of freedom
[8], which is just T r(P ) for linear models. A model with
P = I (the identity matrix) has zero training error, but may
overfit. In the full rank case, it has as many effective degrees
of freedom as the number of features (T r(P ) = n).
Lemma 2 The linear model obtained using scaled
Tikhonov regularization has a lower effective degree of
freedom than the linear model obtained using unregularized
least squares.
Proof This claim is very standard for Tikhonov regularization. Here we present a slightly more elaborate proof.
Using the same rules we can prove other claims, such as
the claim that the condition number of the matrix we invert
using scaled Tikhonov regularization is lower than the one
achieved without regularization. A lower condition number
is known to lead to better generalization.
For scaled Tikhonov regularization we have y̌ = P y
where P = A⊤ (AA⊤ + λD2 )−1 A. Therefore, T r(P ) =
T r(A⊤(AA⊤ + λD2 )−1 A) = T r((DD−1 AA⊤ D−1 D +
λD2 )−1 AA⊤ ) = T r((D(D−1 AA⊤ D−1 + λI)D)−1 AA⊤ ) =
T r(D−1 (D−1 AA⊤ D−1 + λI)−1 D−1 AA⊤ ) =
T r((D−1 AA⊤ D−1 + λI)−1 D−1 AA⊤ D−1 ) =
T r((EE ⊤ + λI)−1 EE ⊤ ), where E = D−1 A.
Let U SV ⊤ = E be the Singular Value Decomposition of E, where S is a diagonal matrix, and U and
V are orthonormal matrices. The above trace is exactly T r((U S 2 U ⊤ + λI)−1 U S 2 U ⊤ . Let S ∗ be the di∗
2
agonal matrix with elements Skk
= Skk
+ λ, then
(U S 2 U ⊤ + λI)−1 = (U S ∗ U ⊤ )−1 = U S ∗−1 U ⊤ .
The above trace becomes T r(U S ∗−1 U ⊤ U S 2 U ⊤ ) =
2
P Skk
T r(S ∗−1 S 2 U ⊤ U ) = T r(S ∗−1 S 2 ) =
<
2 +λ
k Skk
rank(E) = rank(A). Compare this value with the effective degrees of freedom of the unregularized least square
solution: T r(A⊤ (AA⊤ )−1 A) = T r((AA⊤ )−1 AA⊤ ) =
rank(A). The last equality also holds in the case where
A is not full rank, in which case (AA⊤ )−1 is only defined
on the range of AA⊤ .
Lemma 3 Let t be a single nm-feature created by combining a single input feature with a threshold. The amount
of inhibition t undergoes, as the result of applying feature
knockout, grows monotonically with the entropy of p.
Hence, similarly to the scaling in the linear case, the
knockout procedure inhibits high magnitude features (here
5
optimal and main predictions: B(x) = (f (x) 6= f∗ (x)).
The variance V (x) is defined to be the expected loss of
the prediction with regard to the main prediction: V (x) =
Ex̂∼CX (x) (f (x) 6= f (x̂)). These definitions allow us to
present the following observation:
the magnitude is measured by the entropy). Note that in the
algorithm presented in Sec. 5, a feature is used for knockout
only after it was selected to be a part of the output classifier.
Still, KO inhibits more weak classifiers based on these features with higher entropies, making them less likely to get
picked again. It is possible to perform this higher-entropy
preferential inhibition directly on all features, therefore simulating the full knockout procedure. The implementation of
this is left for future experiments.
Observation 1 Let B0 be the set of all training- exampleindices for which the bias B(xi ) is zero (the unbiased set).
Let B1P
be the set for which B(x1 ) = 1 (the
Pmbiased set).
n
Then,
E
(f
(x̂)
=
6
y
)
=
i
x̂∼C
(x
)
X
i
i=1
i=i B(xi ) +
P
P
i∈B0 V (xi ) −
i∈B1 V (xi )
4.2 Bias/variance decompositions
In the unbiased case (B(x) = 0), the variance (V (x)) increases the training error. In the biased case (B(x) = 1), the
variance at point x decreases the error. A function f , which
minimizes the training cost function that was obtained using Feature Knockout, has to deal with these two types of
variance directly while training. Define the net variance to
be the difference of the biased variance from the unbiased;
a function trained using the Feature Knockout procedure is
then expected to have a higher net variance than a function
trained without this procedure. If we assume our corruption process CX is a reasonable model of the robustness
expected from our classifier, a good classifier would have
a high net variance on the testing data2 . The net variance
measured in our experiments shows the effect of the Feature Knockout approach. An interesting application that is
not explored in this paper is the exploitation of net variance
to derive confidence bars at a point (i.e., a measure of certainty in our prediction). Since Feature Knockout emphasizes these differences, it yields narrower confidence bars.
Many training algorithms can be interpreted
Pn as trying to
minimize a cost function of the form i=1 L(f (xi ), yi ),
where L is a loss function. For example, in the 0/1 loss
function L(f (x), y) = (f (x) 6= y), we pay 1 if the labels are different, 0 otherwise. By applying the knockout procedure to generate more training data, an algorithm
that
Pn minimizes such a cost function will actually minimize:
i=1 Ex̂∼CX (xi ) L(f (x̂), yi ), where CX (x) represents the
distribution of all knocked-out examples created from x.
In the case of the square loss function
L(f (x), y) = (y − f (x))2 , the cost function can
be decomposed (similarly
Pn to [9]) into bias and variance,
respectively:
=
i=1 Ex̂∼CX (xi ) L(f (x̂), yi )
Pn
L(E
f
(x̂),
y
)
+
i
x̂∼C
(x
)
X
i
i=1
Pn
i=1 Eŵ∼CX (xi ) L(ŵ, Ex̂∼CX (xi ) f (x̂)).
Consider the variance term. Since Feature Knockout changes dramatically the value of one of the features
Pn in the vector xi , one can show that if the term
i=1 Eŵ∼CX (xi ) L(ŵ, Ex̂∼CX (xi ) f (x̂)) is bounded, then
our learning algorithm has a bounded differences property
[12] which is equivalent to saying that by removing one of
the features, the value of the learned function f will not
change by more than a bounded amount. Consider a situation (which exists in our object recognition experiments)
where our features are pulled independently from a pool of
many possible features. The bounded difference property
guarantees that with high probability the testing error we get
is close to the expectation of the testing error with regards
to selecting another set of random features of the same size.
The formal discussion of these results, which follows the
lines of [2] is omitted.
Let us now turn our attention to another “bias-variance”
decomposition. Consider one based on the 0/1 loss function, as analyzed in [4]. We follow the terminology of [4]
with a somewhat different derivation, and for the presentation below we include a simplified version. Assume for
simplicity that each training example occurs in our dataset
with only one label, i.e., if xi = xj then yi = yj . Define
the optimal prediction f∗ to be the “true” label f∗ (xi ) = yi .
Define the main prediction of a function f to be just the prediction f (x). The bias is defined to be the loss between the
5 The GentleBoostKO algorithm
While our regularization procedure can be applied, in principle, to any learning algorithm, using it directly when the
number of features n is high might be computationally demanding. This is because for each one of the m training examples, as many as n(m − 1) new examples can be created.
Covering even a small portion of this space might require
the creation of many synthesized examples.
The randomized procedure in Fig. 1 samples this large
space of synthesized training examples. Still, if there are
many features, the sampling would probably be too sparse.
However, for some algorithms our regularization technique can be applied with very little overhead. For boosting over regression stumps, it is sufficient to modify those
features that participate in the trained ensemble (i.e., those
features that actually participate in the classification).
The basic algorithm used in our experiments is specified
in Fig. 2. It is a modified version of the GentleBoost algo2 We omit the formal discussion on the relation between variance on
training examples, and variance on testing examples.
6
the update of the weights of all examples (including the new
one), and a new round of boosting begins. This iterative process finishes when the weights of the examples converge,
or after a fixed number of iterations. In our experiments,
we stopped the boosting after 100 rounds–enough to ensure
convergence in all cases.
Input: (x1 ,y1 ), ..., (xm ,ym ) where xi ∈ ℜn , yi ∈ Y = ±1.
Output: Composite classifier H(x).
1. Initialize weights wi ← 1/m.
2. for t = 1, 2, 3, ...T .
(k)
(a) For each feature k, fit a regression function ft (x)
by weighted least squares on yi to xi with weights wi ,
i = 1..m + t − 1.
6 Experiments
(b) Let kmin be the index of the feature with the minimal
associated weighted least square error.
UCI repository experiments. We evaluated our methods
on 10 UCI repository datasets that are suitable for binary
classification, either by thresholding the value of the target
function (e.g. the price in the housing dataset) at its median, or by picking a label to be the positive class. These 10
datasets were: arrhythmia, dermatology, e-coli, glass, heart,
housing, letters, segmentation, wine, and yeast.
We have split each dataset randomly into 10% training,
90% testing, and ran each of the following classifiers: GentleBoost, GentleBoostKO (Fig. 2), AdaBoost, AdaBoost
with a knockout procedure (similar to GentleBoostKO), and
linear SVM. We also ran linear SVM on a dataset that
contained 100 examples generated in accordance with the
knockout procedure of Fig. 1. SVMs with different nonlinear kernels produced either similar or worse results. In addition, we report results for GentleBoost combined with noise
injection (the algorithm that adds gaussian noise to the examples), with the best noise variance we found. By selecting results according to the performance on the testing data,
the noise injection results were biased, and should only be
taken as an upper bound for the performance of noise injection. Tab. 1 shows the results, averaged over 10 independent
runs.
In this table we measured the mean error, the standard
deviation of the error, and the number of features used by
the classifiers (SVM always uses the maximal number of
features). We also measure the variance over a distribution of knockout examples for correct classifications (unbiased variance), and incorrect classifications (biased variance) (Sec. 4.2). This variance was computed in the following way: for each testing example we generated 50 knockout examples (Fig. 1), and computed the variance over these
50 examples. We averaged the variance over all biased and
unbiased testing examples. A good classifier produces more
variance for incorrectly classified examples, and only a little
variance for correctly classified ones.
It is apparent from the results that: (1) In general the
knockout procedure (KO) helps GentleBoost, raising it to
the same level of performance as SVM. (2) KO seems to
help AdaBoost as well, but not always. It is not clear
whether knockout helps SVM. (3) KO seems to help increase the net variance (which is good, see Sec. 4.2). (4) As
expected KO produces classifiers that tend to use more features. (5) KO shows different, mostly better, performance
(kmin )
(c) Update the classifier H(x) ← H(x) + ft
(d) Use Feature KO to create a new example xm+t :
Select two random indices 1 ≤ a, b ≤ m
xm+t ← xa
xm+t (kmin ) ← xb (kmin )
ym+t ← ya
(e) Set new example weight wm+t to that of its source:
wm+t ← wa
(f) Update the weights and normalize:
(kmin )
(xi )
wi ← wi e−yi ft
, i = 1..m + t
Pm+t
wi ← wi / i=1 wi
3. Output the final classifier H(x)
Figure 2: The GentleBoostKO Algorithm. Steps d and e
constitute the differences from the original GentleBoost.
rithm [7]. GentleBoost seems to converge faster than AdaBoost, and performs better for object detection problems
[15]. At each boosting round, a regression function is fitted (by weighted least-squared error) to each feature in the
training set. We used linear regression for our experiments,
fitting parameters a, b and th so that our regression functions are of the form f (x) = a(x > th) + b. The regression
function with the least weighted squared error is added to
the total classifier H(x) and its associated feature (kmin ) is
used for Feature Knockout (step d).
In the Feature Knockout step, a new example is created
using the class of a randomly selected example xa and all
of its feature values except for the value at kmin . The value
for this feature is taken from a second randomly-selected
example xb . The new example xm+t is then appended to the
training set. In order to quantify the importance of the new
example in the boosting process, a weight has to be assigned
to it. The weight wm+t of the new example is estimated by
copying the weight of the example from which most of the
features are taken (xa ). Alternatively, a more precise weight
can be determined by applying the total classifier H(x) to
the new example.
As with any boosting procedure, each iteration ends with
7
than noise injection (GentleBoostNI).
Visual recognition using the Caltech datasets. We
tested our GentleBoostKO algorithm on several Caltech object recognition datasets that were presented in
[6].
In each experiment we had to distinguish between images containing an object and background images that do not contain the object.
The datasets:
Airplanes, Cars, Faces, Leafs and Motorbikes, as
well as the background images were downloaded from
http://www.vision.caltech.edu/. For the experiments we used the predefined splits (available to all the
datasets but the Leafs dataset). For leafs, we used a random
split of 50% training and 50% testing. Note that since our
methods are discriminative, we needed a negative training
set. For this end, we removed 30 random examples from the
negative testing set, and used them for training.
To turn each image into feature-vectors we used 500 C2
features. These extremely successful features allow us to
learn to recognize objects using few training images, and
the results seem to be comparable or better than the results reported in [5]. The results are shown in Fig. 3.
To compare with previous work, we used the error at the
equilibrium-point between false and true positives as our
error-measure. It is clear that for a few dozen examples,
SVM, GentleBoost and GentleBoostKO have the same performance level. However, for only a few training examples,
GentleBoost does not perform as well as SVM, while GentleBoostKO achieves the same level of performance.
We also tried to apply Lowe’s SIFT features [11] to the
same datasets, although these features were designed for
a different task. For each image, we used Lowe’s binaries to comute the SIFT description of each key point. We
then sampled from the training set 1000 random keypoints
k1 , ..., k1000 . Let {kiI } be the set of all keypoints associated with image I. We represented each training and testing
image I by a vector of 1000 elements: [v I (1)...v I (1000)],
such that v I (j) = mini ||kj − kiI ||. Note that in [11] the use
of the ratio of distances between the closest and the next
closest points were encouraged (and not just the minimum
distance). For our application, which disregards all geometric information, we found that using the minimum gives
much better results. For the testing and training splits reported in [6] we got the following results (ME=mean error,
EqE=error at equilibrium):
Algorithm
Lin. SVM ME
gentleB ME
gentleBKO ME
Lin. SVM EqE
gentleB EqE
gentleBKO EqE
Planes
0.104
0.118
0.100
0.108
0.120
0.111
Cars
0.019
0.036
0.033
0.018
0.037
0.030
Faces
0.107
0.168
0.119
0.111
0.166
0.136
Leaves
0.118
0.137
0.114
0.126
0.132
0.120
Dataset
Algorithm
ARRHY.
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
AdaB
AdaBKO
GentleB
GentleBKO
GentleBNI
LinSVM
LinSVM KO
DERM
ECOLI
GLASS
HEART
HOUSING
LETTERS
SEGM.
WINE
YEAST
Motor.
0.033
0.026
0.023
0.007
0.003
0.008
Table 1:
Mean
Error
47.0%±4.0
41.4%±4.3
40.0%±7.0
37.3%±3.7
35.2%±1.2
38.5%±3.5
39.4%±4.9
4.7%±2.3
21.8%±43.8
3.6%±2.5
1.7%±1.5
2.4%±0.4
0.8%±0.7
1.2%±1.0
10.4%±5.8
11.7%±6.1
10.9%±5.9
5.8%±1.4
7.0%±0.9
8.3%±3.3
6.8%±1.8
37.6%±7.4
33.3%±7.0
34.8%±5.2
30.6%±6.9
33.6%±3.3
40.8%±2.8
37.9%±6.9
24.5%±3.6
21.9%±4.4
26.7%±5.2
23.9%±3.4
26.7%±2.4
23.8%±5.6
24.2%±4.5
16.9%±1.8
16.7%±1.4
20.0%±3.9
17.6%±0.9
20.0%±1.1
21.2%±4.6
17.1%±1.8
14.0%±0.8
11.4%±2.8
4.6%±0.9
3.6%±0.6
2.8%±0.1
4.2%±0.4
4.1%±0.5
7.0%±0.9
9.4%±3.0
6.7%±1.4
7.8%±2.7
6.7%±1.1
2.9%±1.6
3.4%±2.3
15.9%±8.2
12.9%±6.1
17.1%±6.9
12.2%±4.9
17.1%±2.6
12.5%±9.8
15.7%±8.5
31.1%±1.0
31.3%±1.1
33.9%±5.7
32.7%±2.6
32.4%±0.8
31.2%±0.3
31.2%±0.3
Unbias
Var.
0.011
0.022
0.100
0.042
0.093
0.035
0.035
0.042
0.015
0.651
0.018
0.534
0.013
0.012
0.291
0.243
0.546
0.236
0.485
0.235
0.253
0.272
0.302
0.420
0.287
0.320
0.238
0.262
0.139
0.148
0.223
0.191
0.215
0.205
0.193
0.138
0.156
0.243
0.175
0.204
0.297
0.254
0.018
0.040
0.117
0.094
0.096
0.186
0.172
0.054
0.041
0.251
0.031
0.268
0.146
0.181
0.128
0.124
0.605
0.117
0.216
0.149
0.166
0.020
0.023
0.329
0.280
0.326
0.000
0.000
Biased
Var.
0.010
0.026
0.130
0.068
0.137
0.046
0.050
0.040
0.165
0.712
0.135
0.607
0.128
0.302
0.451
0.511
0.625
0.540
0.714
0.483
0.542
0.296
0.337
0.431
0.320
0.343
0.250
0.286
0.217
0.319
0.342
0.351
0.334
0.344
0.344
0.192
0.302
0.358
0.380
0.361
0.454
0.430
0.062
0.147
0.577
0.551
0.510
0.751
0.749
0.121
0.219
0.399
0.302
0.445
0.409
0.546
0.155
0.318
0.723
0.324
0.288
0.411
0.379
0.025
0.034
0.443
0.438
0.495
0.001
0.000
Net
Var.
-0.001
0.004
0.030
0.026
0.044
0.011
0.015
-0.002
0.187
0.061
0.117
0.073
0.115
0.290
0.161
0.269
0.079
0.304
0.229
0.248
0.289
0.024
0.035
0.011
0.033
0.023
0.013
0.024
0.078
0.171
0.119
0.160
0.119
0.138
0.151
0.054
0.146
0.115
0.205
0.157
0.157
0.176
0.044
0.107
0.460
0.457
0.414
0.565
0.578
0.068
0.178
0.148
0.271
0.177
0.263
0.366
0.027
0.194
0.119
0.207
0.072
0.261
0.213
0.006
0.011
0.114
0.158
0.169
0.001
0.000
Feat.
Used
6.2
45.0
43.8
55.2
45.2
279.0
279.0
1.6
17.4
3.8
18.9
6.2
34.0
34.0
4.0
6.2
3.2
6.2
3.6
7.0
7.0
5.4
8.0
6.0
8.0
7.8
8.0
8.0
6.2
10.4
9.8
12.6
11.4
13.0
13.0
4.6
10.6
10.0
12.7
10.6
13.0
13.0
4.6
9.4
15.0
15.5
15.6
16.0
16.0
3.4
11.4
4.8
14.0
4.0
19.0
19.0
3.8
11.2
4.2
11.9
9.0
13.0
13.0
2.4
4.6
6.4
8.0
7.2
8.0
8.0
Results for datasets from the UCI repository. Each data set was
split to 10% training, 90% testing. The mean error, its standard deviation,
the mean biased, unbiased and net variance as well as the mean number of
features used are shown for 10 independent experiments.
Car type identification. This dataset consists of 480
images of private cars, and 248 images of mid sized vehicles (such as SUV’s). All images are 20 × 20 pixels, and
8
Airplanes
Cars
0.7
0.6
0.6
0.3
0.4
0.3
0.2
0.2
0.1
0.1
5
10
15
20
Total Positive Training Examples
25
0
0
30
Equilibrium Error
0.4
0.3
5
10
15
20
Total Positive Training Examples
25
0
0
30
5
Motorbikes
10
15
20
Total Positive Training Examples
25
30
Car Types
0.3
Linear SVM
GentleBoost
GentleBoostKO
0.8
Linear SVM
GentleBoost
GentleBoostKO
0.28
0.7
0.5
0.4
0.3
0.26
0.6
Mean Error
Equilibrium Error
Equilibrium Error
0.4
0.1
0.9
Linear SVM
GentleBoost
GentleBoostKO
0.6
0.5
0.4
0.24
0.22
0.3
0.2
0.2
0.2
0.1
0
0
0.5
0.2
Leaves
0.8
0.7
Linear SVM
GentleBoost
GentleBoostKO
0.7
0.5
Equilibrium Error
Equilibrium Error
0.8
Linear SVM
GentleBoost
GentleBoostKO
0.6
0.5
0
0
Faces
0.7
Linear SVM
GentleBoost
GentleBoostKO
0.18
0.1
5
10
15
20
Total Positive Training Examples
25
30
0
0
5
10
15
20
Total Positive Training Examples
25
30
0.16
0
5
10
15
20
25
30
Percent Used for Training
35
40
Figure 3: A comparison using the C2 features between GentleBoost, GentleBoostKO (Fig. 2), and linear SVM on the five Caltech datasets:
Airplanes, Cars, Faces, Leafs and Motorbikes. The graphs show the the equilibrium error rate vs. the number of training examples used
from the class we want to detect. In each experiment, the test set was fixed to be the same as those described in [6]. Lower right corner:
The results of applying the three algorithms to the car types dataset, together with example images. The results shown are mean and
standard error of 30 independent experiments versus percentile of training images.
References
were collected using the car detector of Mobileye corp., on
a video stream taken from the front window of a moving
car. The task is to learn to identify private cars from mid
sized vehicles, which has some safety applications. Taking
into account the low resolution and the variability in the two
classes, this is a difficult task. The results are shown on the
bottom right corner of Fig. 3. Each point of the graph shows
the mean error when applying the algorithms to training sets
of different size (between 5 and 40 percent of the data). The
rest of the examples were used for testing. It is evident that
for this dataset GentleBoost outperforms SVM. Still, GentleBoostKO does even better.
[1] C.M Bishop. Training with Noise is Equivalent to Tikhonov
Regularization. Neural Computation, 1995.
[2] O. Bousquet, A. Elisseeff.
JMLR, 2002.
Stability and Generalization.
[3] L. Breiman. Heuristics of Instability and Stabilization in
Model Selection. Ann. Statist., 1996.
[4] P. Domingos. A Unifies Bias-Variance Decomposition for
Zero-One and Squared Loss. Proc. Int. Conf. AI, 2000.
[5] L. Fei-Fei, R. Fergus, & P. Perona. A Bayesian approach to
unsupervised 1-Shot learning of Object categories. ICCV03.
7. Summary and Conclusions
[6] R. Fergus, P. Perona, and A. Zisserman. Object Class Recognition by Unsupervised Scale-Invariant Learning. CVPR 2003
Boosting algorithms, and especially the use of GentleBoost
over regression stumps, are gaining a lot of popularity in
the computer vision community. However, GentleBoost
does not show good performance, compared to SVM, when
learning from a small dataset. In this work we propose an
enhancement to the GentleBoost algorithm, that brings its
level of performance to be the same of SVM.
[7] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic
regression: a statistical view of boosting. Ann. Statist., 2000.
[8] T. Hastie, R. Tibshirani, J. H. Friedman The Elements of
Statistical Learning Springer, 2001.
[9] S. Geman, E. Bienenstock, R. Doursat. Neural networks and
the bias/variance dilemma. Neural Computations, 1992.
9
[10] V. Koltchinskii and D. Panchenko. Complexities of convex
combinations and bounding the generalization error in classification. to appear in Ann. Statist.
[11] D.G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 2004.
[12] G. Lugosi. Concentration-of-measure inequalities
[13] A. Neumaier. Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Review 1998.
[14] R.E. Schapire. A brief introduction to boosting. Proc. Int.
Joint Conf. AI, 1999.
[15] A. Torralba, K.P. Murphy and W.T. Freeman. Sharing features: efficient boosting procedures for multiclass object detection. CVPR, 2004.
[16] P. Viola and M. Jones. Rapid object detection using a boosted
cascade of simple features. CVPR, 2001.
[17] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio
and V. Vapnik. Feature Selection for SVMs. NIPS, 2001.
10
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement