minCorr

minCorr
Minimal Correlation Classification
Noga Levy, Lior Wolf
The Blavatnik School of Computer Science, Tel Aviv University, Israel
Abstract. When the description of the visual data is rich and consists
of many features, a classification based on a single model can often be enhanced using an ensemble of models. We suggest a new ensemble learning
method that encourages the base classifiers to learn different aspects of
the data. Initially, a binary classification algorithm such as Support Vector Machine is applied and its confidence values on the training set are
considered. Following the idea that ensemble methods work best when
the classification errors of the base classifiers are not related, we serially
learn additional classifiers whose output confidences on the training examples are minimally correlated. Finally, these uncorrelated classifiers
are assembled using the GentleBoost algorithm. Presented experiments
in various visual recognition domains demonstrate the effectiveness of
the method.
1
Introduction
In classification problems, one aims to learn a classifier that generalizes well on
future data from a limited number of training examples. Given a labeled training set (xi , yi )m
i=1 , a classification algorithm learns a predictor from a predefined
family of hypotheses that accurately labels new examples. An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way,
typically by a weighted voting, to classify the new examples. Effective ensemble learning obtains better predictive performance than any of the individual
classifiers [14], but this is not always the case [7].
As described in [6], a necessary and sufficient condition for an ensemble of
classifiers to be more accurate than any of its individual members is that the
classifiers are accurate and diverse. Classifiers are diverse if their classification
errors on the same data differ. As long as each classifier is reasonably accurate,
the diversity promotes the error correcting ability of the majority vote.
Liu and Yao proposed Negative Correlation Learning (NCL) in which individual neural networks are trained simultaneously [18]. They added a penalty
term to produce learners whose errors tend to be negatively
Pm correlated. The
2
cost function used to train the lth neural network is el = k=1 (fl (xk ) − yk ) −
Pm
2
λ k=1 (fl (xk ) − fensemble (xk )) , where fl is the classification function of the
th
l neural network, and fensemble is the combined one. That is, NCL encourages
each learner to differ from the combined vote on every training sample. The NCL
principle is adapted to various setting, for example, Hu and Mao designed an
NCL-based ensemble of SVMs for regression problems [12].
2
ECCV-12 submission ID 1335
MMDA is a dimensionality reduction algorithm that also relies on the minimal correlation principle [15]. SVM is repeatedly solved such that at each iteration the separating hyperplane is constrained to be orthogonal to the previous
ones. In our tests it became apparent that such an approach is too constraining
and does not take into account the need to treat each class separately.
Our approach measures correlation between every pair of classifiers per each
class of the dataset separately and employs a different decorrelation criterion.
The minimal correlation method suggested here learns the base classifiers successively using the same training set and classification algorithm, but every classifier
is required to output predictions that are uncorrelated with the predictions of
the former classifiers. The underlying assumption is that demanding the predictions of one classifier to be uncorrelated with the predictions of another (in
addition to the accuracy demand) leads to distinct yet complementing models,+
even though the classifiers are not conditionally independent.
2
Minimal Correlation Ensemble
m
The training data S = {(xi , yi )}m
and binary
i=1 consists of examples xi ∈ R
labels yi ∈ ±1. Our first classifier, denoted w1 , is the solution of a regular SVM
optimization problem:
Pm
minw λ2 kwk2 + i=1 ξi
s.t.
∀i. yi (wT xi ) ≥ 1 − ξi , ξi ≥ 0.
Let Xp be the set of all positive examples and Xn the set of all negative examples
in S, arranged as matrices. We now look for a second classifier, w2 , with a
similar objective function, but add the correlation demand: the predictions of
w2 on the examples of each class (separately) should be uncorrelated with the
predictions of w1 on these examples. The separation to classes is necessary since
all accurate classifiers are expected to be correlated as they provide comparable
labeling. Employing Pearson’s sample correlation, this demand translates into
minimizing
P
rp =
i∈Ip
w1T xi − w1T Xp
w2T xi − w2T Xp
(n − 1)s1 s2
hw1T Xp − Xp , w2T Xp − Xp i
=
kw1T Xp − Xp kkw2T Xp − Xp k
=
where Ip is the set of indices of the positive examples, Xp is a matrix whose
columns are the mean vector of the positive class, wiT Xp and si for i = {1, 2}
are the mean and standard deviation of wiT Xp , respectively.
T
wT (Xp −Xp )
Let ŷpi be the normalized predictions of wi on Xp , that is ŷpi = kwiT X −X k .
p)
i ( p
To maintain convexity, s2 is omitted and the additional expression becomes the
covariance between the normalized output of the existing classifiers and the
output of the new classifier,
rp = hŷp1 , w2T Xp − Xp i .
ECCV-12 submission ID 1335
3
Denote vp = ŷp1 XpT − Xp and vn = ŷn1 XnT − Xn . Since we are interested
in the correlation with the minimal magnitude, we choose to add the terms
rp2 + rn2 = kwT vp k2 + kwT vn k2 to the objective function.
The tradeoff between the correlation and the other expressions in the objective
functions is controlled by a tradeoff parameter η. The new optimization problem
whose solution results in a second classifier is
Pm
minw λ2 kwk2 + η2 kwT vp k2 + kwT vn k2 + i=1 ξi
s.t. ∀i. yi hw, xi i ≥ 1 − ξi , ξi ≥ 0.
The optimization problem constructed to learn the k th classifier contains 2(k−1)
correlation expressions, one per each preceding classifier and per each class,
k−1
m
X ηj
X
λ
kwT vpj k2 + kwT vnj k2 +
ξi
minw kwk2 +
2
2
j=1
i=1
s.t.
(1)
∀i. yi hw, xi i ≥ 1 − ξi , ξi ≥ 0.
The computational cost of solving all optimization problems is linear in the
number of classifiers.
2.1
The Dual Problem
We derive the dual problem to get a kernelized version of our method. Let αi
be the dual variable of the margin constraint of example (xi , yi ) and βi the dual
variable of the non-negativity constraint of ξi . The Lagrangian of optimization
function (Eq. 1) is
L(w, α, β) =
+
m
X
i=1
k−1
X
λ
kwk2 +
ηj kwT vpj k2 + kwT vnj k2
2
j=1
ξi +
m
X
αi (1 − ξi − yi wT xi ) −
i=1
m
X
βi ξ i .
i=1
The w that minimizes the primal problem can be expressed as a function of the
dual variables α and β by solving ∂L/∂w = 0,

−1
k−1
m
X X
w = λI +
ηj vpj vpTj + vnj vnTj 
αi yi xTi .
j=1
i=1
Denote by V the matrix whose columns are the vectors vpi and vni multiplied
by the square root of the matching tradeoff parameter ηi ,
√
√
√
√
V =
η1 vp1 , η1 vn1 , . . . , ηk−1 vpk−1 , ηk−1 vnk−1 ,
(2)
then
w = λI + V V T
m
−1 X
i=1
αi yi xi .
(3)
4
ECCV-12 submission ID 1335
Substituting the inverse matrix in Eq.3 by the Woodbury identity, and then
using the fact that the columns of V are multiplications of training examples,
classifying an example x can be kernelized using any kernel function k(xi , xj ) so
that wT x is
λ
m
X
αi yi [k(xi , x) −
√
√
T
η k(xi , V )(I + V T V )−1 k(x, V )T η],
i=1
where k(x, V ) is the column vector whose ith coordinate is k(x, Vi ).
As in SVM, w is a weighted combination of the training examples, but in
our solution this combination is projected onto a linear space that is determined
by the correlation expressions.P
Using Eq. 3, the dual objective function can be
√
m
kernelized as well. Denote u = i=1 αi yi k(V, xi ) η, then the objective function
becomes
m
X
max
αi − uT (I + V T V )−1 u .
α
i=1
This formulation implies that the minimal correlation classifiers can be trained
on high dimensional data using a kernel function.
2.2
Online Version
Following Pegasos (Primal Estimated sub-Gradient SOlver for SVM) [21], we
derive an online sub-gradient descent algorithm for solving the optimization
problem at hand. In iteration t of the algorithm, Pegasos chooses a random
training example (xit , yit ) by picking an index it ∈ 1, . . . , m uniformly at random.
The objective function of the SVM is replaced with an approximation based on
the training example (xit , yit ), yielding:
f (w, it ) =
λ
kwk2 + l(w, (xit , yit )) ,
2
where l(w, (x, y)) = max(0, 1−yi wT xi ) is the hinge-loss function and the update
rule is a gradient decent step on the approximated function:
1
wt+1 = 1 −
wt + β1I[yit hwt , xit i < 1] yit xit
t
We follow a similar path with our objective function, and differently from the
batch mode where all the training set is available, we receive one example at
a time. The hinge loss function suits this setting since it is the sum of m expressions, each depends on one example. The correlation, on the other hand,
depends on all the examples combined, and therefore cannot be separated into
single example loss functions. That is, optimizing hw1T xi , w2T xi i is meaningless
and we need to calculate the outputs of all training examples (or at least a large
enough subset) to estimate the correlation.
Nevertheless, we can serially calculate the correlation of the already seen
examples in a time and space efficient manner as described bellow. Denote by
ECCV-12 submission ID 1335
5
Xp,t the matrix of positive training examples and by Xn,t the matrix of the
negative ones chosen in the iterations 1..t (note that during the run of each
iteration only one training example is held). We define correlation of w1 and w
over the Xp,t as
T
w Xp,t − Xp,t , w1T Xp,t − Xp,t
=
rp,t =
kw1T Xp,t − Xp,t k
T
T
w1 − t2 Xp,t Xp,t w1
Xp,t Xp,t
.
= wT
kw1T Xp,t − Xp,t k
wT (X
−X
)
Denote vp,t = Xp,t − Xp,t kw1T Xp,t −Xp,t k . By maintaining the mean vector Xp,t ,
p,t )
1 ( p,t
P
T
the sum of squares (w1T Xp,t )2 , and the product Xp,t Xp,t
w1 , the calculation
of vp,t+1 is time efficient. Denote vn,t similarly for the negative examples. If
yit = 1 then vp,t is updated while vn,t = vn,t−1 , otherwise vn,t is calculated and
vp,t = vp,t−1 . Our approximation in iteration t is
f (w, it ) =
λ
η
kwk2 +
kwT vp,t k2 + kwT vn,t k2 + l(w; (xit , yit )) .
2
2
and the update step is computed in O(d) time as
wt+1 = (1 −
1
T
T
) λ + η(vp,t vp,t
+ vn,t vn,t
) wt + β1I[yit hwt , xit i < 1] yit xit
λt
After a predetermined number T of iterations, we output the last iterate wT +1 .
2.3
Ensemble Method
After learning multiple classifiers, the base classifiers are assembled into one
strong classifier by using a form of stacking [23]. We found that the GentleBoost shows the best improvement in performance. Assume we have k classifiers,
m
w1 , . . . , wk and a validation set S = (xi , yi )i=1 , and we want to combine these k
classifiers into one strong classifier. First, we classify the validation set S using
the k base classifiers. Every sample in S can now be represented as a vector in
Rk whose coordinates are the k outputs, ti = (w1T xi , . . . , wkT xi ).
We regard these vectors ti as new k-dimensional feature vectors, and learn
a new classifier based on ‘S 0 = (ti , yi )m
i=1 . As this second stage classifier we use
GentleBoost [9] over decision stumps. GentleBoost is a version of AdaBoost that
uses a more conservative update of the weights assigned to the weak classifiers.
The classification of an unknown example x is as follows: we first classify x
with w1 , . . . , wk to construct the GentleBoost input vector t and then apply the
ensemble classifier on t.
6
ECCV-12 submission ID 1335
3
Generalization Bounds
Since every classifier is constrained by the preceding ones, a natural concern is
that the performance of subsequent classifiers could diminish. We prove that
the performance of the added classifiers is comparable to preceding ones. The
optimization problem solved by the minimal correlation ensemble, after the first
round, minimizes the norm of w as well as the norm the correlation expressions
that depend on the training data and the preceding classifiers. Using the matrix
V defined in Eq. 2, whose columns are the vectors multiplying w in all the correlation expressions that appear in the objective function, this norm is kwT V k.
A column Vi in V , is a multiplication of ŷ{p/n}j (the normalized predictions of
√
classifier j on the training examples in one of the classes) by ηj X{p/n} . The
dependency on the former classifiers is derived from ŷ{p/n}j and the dependency
on the training data is derived from both ŷ{p/n}j and X{p/n} .
Deriving a generalization bound using Rademacher complexity is not applicable since the iid assumption, required to bound the true Rademacher complexity
with the empirical Rademacher complexity, does not hold. We use the technique
suggested by Shivaswamy and Jebara [22] to overcome this obstacle by introducing a set of landmark examples U = u1 , ..., um for the regularization. Denote
the function class considered by the minimal correlation optimization problem
as
λ
GSE,λ,η := {x → wT x :
kwk2 + kwT V k2 ≤ E} ,
2
where E emerges from the value of problem 1 when w = 0. Let U be the set
of landmark examples, which is essentially a validation set used to calculate
the correlation expressions, and does not intersect with the train set. Denote
by VU the matrix constructed similarly to the matrix V in equation 2 with
the landmark examples instead of the training set examples. The function class
considered when adding the landmark examples is
T
GU
B,λ,η := {x → w x :
λ
kwk2 + kwT VU k2 ≤ B} .
2
The following bound on learning the function class GSE,λ,η holds:
Theorem 1. Fix γ > 0 and let the training set S be drown iid from a probability
distribution D. For any g ∈ GSE,λ,η , the following bound holds for B = E +
O √1n with probability at least 1 − δ,
m
1 X
P rD [y =
6 sign(g(x))] ≤
ξi + 3
mγ i=1
√
4 2B
+
EU
mγ
m
X
xTi
λI +
r
ln(8/δ)
+O
2m
−1
VU VUT
i=1
The proof follows from Theorem 18(iii) in [22].
1
√ √
m trK
!0.5
xi
.
ECCV-12 submission ID 1335
4
7
Experiments
We have conducted experiments on several data sets to evaluate the performance
of our algorithm in comparison with other classification algorithms. Specifically,
for each dataset, multiple binary problems were derived and used to compare
the performance of various classifiers. To measure the preformance enhancement
arising from the classifiers diversity while eliminating the effects of the ensemble,
we compared our method to boosting over decision stumps and over SVM as
weak classifiers, and to an ensemble of bagged SVMs [3].
The regularization parameter of SVM (λ) is determined using cross validation on the training set of each experiment. The same parameter is then used as
the regularization parameter of the minimal correlation ensemble. The kernel parameters for non-linear SVM are determined similarly for the baseline classifier,
and used for the entire ensemble.
The minimal correlation balancing parameters ηj are calibrated as follows:
first, all ηj for j = 1, . . . , k − 1 are set to 1. When the optimal solution is found,
the correlation between the new classifier and each of the former k − 1 classifiers
is evaluated for both classes. Let β ∈ (0, 1) be a predetermined upper bound on
the magnitude of each of the correlations. If the magnitude of the correlation of
the currently learned classifier with classifier j over any of the classes is higher
then β, then ηj is enlarged to 2ηj . If any of the tradeoffs is updated, we solve
the optimization problem with the new tradeoffs. This process is repeated until
all correlations are bellow β. In all of our experiments we fix β = 0.8.
We train ensembles of k = 4 classifiers by default, and demonstrate performance as a function of this parameter in Figure 2. For learning the ensembles,
the training set is split to 80% training of the base classifiers and 20% “validation” for training the ensemble classifier. For the baseline classifiers, the entire
training set is used.
Letter Recognition Taken from the UCI Machine Learning Repository [20],
consists of 20,000 example images of the 26 letters of the English alphabet in the
upper case. The letters are derived from 20 fonts that are randomly distorted
to form black-and-white images. Each image is analyzed to produce a feature
vector of 16 numerical attributes.
We derive several sets of binary problems with varying complexity. In the
first set, we perform 26 one-vs-all classification experiments. The results, shown
in Figure 1(a), indicate that the minimal correlation ensemble outperforms both
linear and gaussian SVM. In a second set of experiments,
we create composite
classes by combining pairs of letters. For each of the 26
2 possibilities, we create
one positive class that is a union of the examples of two letters. As depicted
in Figure 1(b), the gap in performance in this case is even larger in favor of
the minimal correlation ensemble. The third set of experiments is similar and
contains positive classes that consist of the union of three random letters, see
Figure 1(c). The experiment is repeated for the Gaussian kernel, see Figure 1(jl). The comparison to GentleBoost over SVMs shown in Figure 1(d-f). Figure
1(g-i) compares a different ensemble method based on multiple SVMs trained on
multiple random subsets of the training data and assembled using GentleBoost
8
ECCV-12 submission ID 1335
with decision stumps. Minimal correlation ensemble surpasses both methods.
Finally, GentleBoost with decision stumps was applied over the originl feature
space and obtained significantly lower performance levels.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 1. Experiments on the Letter Recognition Dataset - minimal correlation ensemble
vs. other classification methods. The first row (a-c) compares linear SVM to linearbased minimal correlation ensemble on one-vs-all, two-vs-all and three-vs-all binary
problems from left to right respectively, the second row (d-f) compares the minimal correlation ensembles to boosting with SVM as the weak classifier on the same
sets of problems. The third row (g-i) compares boosting over bagged SVMs to linearbased minimal correlation ensemble, and the forth row (j-l) compares gaussian SVM to
gaussian-based minimal correlation ensemble. In all the graphs, the x axis shows the
baseline classification accuracy and the y axis shows the minimal correlation ensemble
classification accuracy. Every ’+’ represents the results of one experiment.
ECCV-12 submission ID 1335
(a)
(b)
(c)
(d)
(e)
(f)
9
Fig. 2. Average classification accuracy as a function of the number of ensemble classifiers for one-vs-all, two-vs-all and three-vs-all classification problems on the letter data
set. The x-axis indicates the number of learned classifiers; The y-axis is the obtained
performance level for each sized ensemble. (a-c) Linear SVM. (d-f) Gaussian kernel
SVM. Note that one classifier means the baseline SVM classifier.
To measure the impact of the number of learned classifiers, we learned for
every problem an ensemble of two to six classifiers. Figure 2 demonstrates the
performance improvement as a factor of the number of base classifiers for the
one-vs-all, two-vs-all and three-vs-all experiments. In graphs (a)-(c) linear SVM
is used as the base classifier, and it is clear that each of the sequential classifiers
contributes to the performance. In graphs (d)-(f) the gaussian kernel SVM is
used and it seems that in this case the third and on classifiers are not necessary.
Multi-modal pedestrian detection A large dataset made available to us
by a private research lab. The underlying problem consists of template-based
detection in surveillance cameras, where multiple templates are used due to the
large variability in the view angle. The dataset contains detections and false
detections obtained with over 30 different models. The template based detection
is a preliminary stage, in which for each of the pedestrian templates a separate
set of detections is found. It is our task to filter out the false detections for
each of these detectors. Each possible detection is characterized by different
features including (a) the intensities of the detected region (900 features), (b) the
coefficients of the affine transformation each model template undergoes to match
the detected region (6 features), (c) the SIFT [19] descriptor of the detected
region (128 features), and (d) the HOG [5] descriptor (2048 features) of the
detected region.
10
ECCV-12 submission ID 1335
We have compared the performance of Linear SVM to the performance of
the online version of the minimal correction ensemble (given the number of
experiments needed, the batch version was not practical). For each of the four
feature types, and for each of the tens of models, we record the performance of
the two methods using a standard cross-validation scheme, where two thirds of
the samples were used for training and one third for testing. The presented rates
are averaged over 20 cross validation runs.
The results are shown in Figure 3. For the gray value based features of Figure 3(a), which are relatively weak, both methods do similarly, with the baseline
methods outperforming our method on 48% of the models. However, for the
more sophisticated representations, shown in Figure 3(b-d), the minimal correlation ensemble outperform the baseline method for over 70% of the experiments,
showing a typical average increase in performance of about 10%.
(a)
(b)
(c)
(d)
Fig. 3. The graphs present the results obtained for recognizing true detections vs. false
detections on tens of template-based models. Each of the four graphs presents results
for one feature type. (a) the gray values of the detected region; (b) the coefficients of the
model to detected region affine transformation; (c,d) SIFT and HOG edge histograms
of the found patch. Each point in each graph represents one model for which accuracy
of differentiating true from false detections is measured for both the baseline SVM
method (x-axis) and the minimal correlation ensemble method (y).
ECCV-12 submission ID 1335
11
Facial attributes The first 1, 000 images of the ’Labeled Faces in the Wild’
dataset [13] are labeled by a variety of attributes proposed to be useful for face
recognition in [16], available at [1]. Examples of such attributes include ’Blond
Hair’, ’Bags under eyes’, ’Frowning’, ’Mouth Closed’ and ’Middle Aged’. For
each attribute we conduct one binary experiment recognizing the presence of
this feature. Each face image is automatically aligned using automatically detected feature points [8] and represented by the a histogram of LBP features [2].
Cross validation experiments (50% train; 50% test; average of 20 repeats) are
performed and the results, comparing the minimum correlation ensemble with
linear SVM (often used for such tasks; gaussian SVM, not shown, performs
worse) are presented in Figure 4(a). See Figure 4(b) for the top detections of
each of the first four minimally correlated classifiers learned for the attribute
’Eyebrow Thick’. As can be seen, these detections are not independent, demonstrating that reducing correlation differs from grouping the samples and learning
multiple models or other means of strongly enforcing decorrelation.
(a)
(b)
Fig. 4. Experiments on the facial attributes of the ’Labeled Faces in the Wild’ dataset.
(a) A comparison of the performance of linear SVM to to the linear minimal correlation
ensemble for various attributes. The x axis shows the SVM performance and the y axis
shows the minimal correlation ensemble performance. Every ’+’ represents the result
on one facial attribute. (b) Highest-ranked images for the ’Eyebrow Thick’ attribute.
The images in the first row are the examples ranked highest by the linear SVM, the
images in the rows underneath were ranked highest by the second, third and forth
classifiers of the minimal correlation ensemble.
Wound Segmentation The analysis of biological assays such as wound
healing and scatter assay requires separating between multi-cellular and background regions in cellular bright field images. This is a challenging task due to
the hetrogenious nature of both the foreground and the background (see Figure 6(a,b)). Automating this process would save time and effort in future studies,
especially considering the high amount of data currently available. Several algorithms and tools for automatic analysis have recently been proposed to deal
with this task [11, 17]. To the best of our knowledge, the only freely available
software for automatic analysis of wound healing that performs reasonably well
on bright field images without specific parameter setting is TScratch [11].
12
ECCV-12 submission ID 1335
TScratch uses fast discrete curvelet transform [4] to segment and measure
the area occupied by cells in an image. The curvelet transform extracts gradient
information in many scales, orientations and positions in a given image, and
encodes it as curvelet coefficients. It selects two scale levels to fit the gradient
details found in cells’ contours, and generates a curvelet magnitude image by
combining the two scale levels, which incorporates the details of the original
image in the selected scales. Morphological operators are applied to refine the
curvelet magnitude image. Finally, the curvelet magnitude image is partitioned
into occupied and free regions using a threshold. This approach was first applied
for edge detection in microscopy images [10].
We apply minimal correlation ensemble to the segmentation solution described in [24], based on learning the local appearance of cellular versus background (non-cell) small image-patches in wound healing assay images. A given
image is partitioned into small patches of 20x20 pixels. From each patch, the
extracted features are image and patch gray-level mean and standard deviation,
patch’s gray level histogram, histogram of patch’s gradient intensity, histogram
of patch’s spatial smoothness (the difference between the intensity of a pixel
and its neighborhood),and similar features from a broader area surrounding the
patch. All features are concatenated to produce combined feature vectors of
length 137.
The datasets, taken from [24], are: (a) 24 images available from the TScratch
website (http://chaton.inf.ethz.ch/software/). (b) 20 images of cell populations
of brain metastatic melanoma that were acquired using an inverted microscope.
(c) 28 DIC images of DA3 cell lines acquired using an LSM-410 microscopemicroscope (Zeiss, Germany). (d) 54 DIC images of similar cell lines acquired using
an LSM-510 microscope (Zeiss, Germany). In order to create a generic model,
the training set consists of twenty arbitrary images collected at random from all
data sets. The model is tested on the remaining images of each of the four data
sets separately.
Figure 6(c-f) depicts the obtained performance, comparing vanilla linear
SVM to the proposed Minimal Correlation and to TScratch’s results (we used
the continuous output of TScratch and not a single decision in order to plot
ROC curves). As can be seen, the patch classification method implemented here
outperforms TScratch in all datasets, and the proposed method significantly
outperforms SVM in three out of four data sets. Table 5 compares the Area
under curve of TScratch, linear SVM and minimal correlation ensemble for all
datasets.
5
Summary
We employ a variant of SVM that learns multiple classifiers that are especially
suited for combination in an ensemble. The optimization is done efficiently using
QP, and we also present a kernelized version as well as an efficient online version.
Experiments on a variety of visual domains demonstrate the effectiveness of
the proposed method. It is shown that the method is especially effective on
ECCV-12 submission ID 1335
Minimal Correlation
Tscratch algo.
Boosted SVM
Bagging
Tscratch data Melanoma
96.6
94.1
93.4
84
95.9
94.3
95.7
92.6
13
DA3 (LSM-410) DA3 (LSM-510)
95.6
95.5
94.4
94.9
95.5
96.6
95.4
95.4
Fig. 5. A comparison of the Area under curve (AUC) of linear SVM vs. linear minimal
correlation ensemble with two to six classifiers over the four wound healing datasets
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. (a,b) Cell foreground segmentation examples. (c-f) Comparison of ROC curves
on four datasets of wound healing images. The minimal correlation ensemble (solid
blue line) is compared to the Tscratch (red dash-dot) algorithm, the Boosted SVM
(dashed cyan) and to Bagging (dotted green). The X-axis is the false positive rate abd
the Y-axis is the true positive rate. (c) the Tscratch dataset, (d) Melanoma dataset,
(e) and (f) DA3 cell lines acquired with two different microscopes.
compound classes containing several sub-classes and that the results are stable
with respect to the number of employed base classifiers. It is also demonstrated
that using the minimal correlation principle is not the same as learning several
classifiers on different subsets of each class.
6
Acknowledgments
This research was supported by the I-CORE Program of the Planning and Budgeting Committee and The Israel Science Foundation (grant No. 4/11). The
research was carried out in partial fulfillment of the requirements for the Ph.D.
degree of Noga Levy.
14
ECCV-12 submission ID 1335
References
1. www.cs.columbia.edu/CAV E/databases/pubf ig/download/lf w attributes.txt
2. Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns:
Application to face recognition. PAMI 28(12), 2037–2041 (Dec 2006)
3. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
4. Candes, E., Demanet, L., Donoho, D., Ying, L.: Fast discrete curvelet transforms.
Multiscale Modeling and Simulation (2006)
5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR (2005)
6. Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple Classifier
Systems. pp. 1–15 (2000)
7. Dzeroski, S., Zenko, B.: Is combining classifiers with stacking better than selecting
the best one? Machine Learning 54(3), 255–273 (2004)
8. Everingham, M., Sivic, J., Zisserman, A.: “Hello! My name is... Buffy” – automatic
naming of characters in TV video. In: BMVC (2006)
9. Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical
View of Boosting. The Annals of Statistics 38(2) (2000)
10. Geback, T., Koumoutsakos, P.: Edge detection in microscopy images using
curvelets. BMC Bioinformatics 10(75) (2009)
11. Geback, T., Schulz, M., Koumoutsakos, P., Detmar, M.: Tscratch: a novel and
simple software tool for automated analysis of monolayer wound healing assays.
Biotechniques 46, 265–274 (2009)
12. Hu, G., Mao, Z.: Bagging ensemble of svm based on negative correlation learning.
In: ICIS 2009. IEEE International Conference on. vol. 1, pp. 279 –283 (2009)
13. Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild:
A database for studying face recognition in unconstrained environments. University
of Massachusetts, Amherst, TR 07-49 (Oct 2007)
14. Kim, H.C., Pang, S., Je, H.M., Kim, D., Bang, S.Y.: Constructing support vector
machine ensemble. Pattern Recognition 36(12), 2757–2767 (2003)
15. Kocsor, A., Kovcs, K., Szepesvri, C.: Margin maximizing discriminant analysis. In:
European Conference on Machine Learning (2004)
16. Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face verification. In: ICCV. pp. 365–372 (2009)
17. Lamprecht, M., Sabatini, D., Carpenter, A.: Cellprofiler: free, versatile software for
automated biological image analysis. Biotechniques 42, 71–75 (2007)
18. Liu, Y., Yao, X.: Simultaneous training of negatively correlated neural networks
in an ensemble. IEEE Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics 29, 716–725 (1999)
19. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2),
91–110 (2004)
20. Murphy, P., Aha, D.: UCI Repository of machine learning databases. Tech. rep.,
U. California, Dept. of Information and Computer Science, CA, US. (1994)
21. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-gradient
solver for svm. In: ICML (2007)
22. Shivaswamy, P.K., Jebara, T.: Maximum relative margin and data-dependent regularization. Journal of Machine Learning Research (2010)
23. Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)
24. Zaritsky, A., Natan, S., Horev, J., Hecht, I., Wolf, L., Ben-Jacob, E., Tsarfaty, I.:
Cell motility dynamics: A novel segmentation algorithm to quantify multi-cellular
bright field microscopy images. PLoS ONE 6 (2011)
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement