Theory and More Convexity in the Localized Setting of Learning Kernels.

Theory and More Convexity in the Localized Setting of Learning Kernels.
Journal of Machine Learning Research 44 (2015) xx-xx
Submitted 10/15; Published 10/00
Theory for the Localized Setting of Learning Kernels
Yunwen Lei
YUNWELEI @ CITYU . EDU . HK
Department of Mathematics
City University of Hong Kong
Alexander Binder
ALEXANDER BINDER @ SUTD . EDU . SG
Machine Learning Group, TU Berlin
ISTD Pillar, Singapore University of Technology and Design
Ürün Dogan
UDOGAN @ MICROSOFT. COM
Microsoft Research
Cambridge CB1 2FB, UK
Marius Kloft
KLOFT @ HU - BERLIN . DE
Department of Computer Science
Humboldt University of Berlin
Editor: Dmitry Storcheus
Abstract
We analyze the localized setting of learning kernels also known as localized multiple kernel learning. This problem has been addressed in the past using rather heuristic approaches based on approximately optimizing non-convex problem formulations, of which up to now no theoretical learning bounds are known. In this paper, we show generalization error bounds for learning localized
kernel classes where the localities are coupled using graph-based regularization. We propose a
novel learning localized kernels algorithm based on this hypothesis class that is formulated as a
convex optimization problem using a pre-obtained cluster structure of the data. We derive dual
representations using Fenchel conjugation theory, based on which we give a simple yet efficient
wrapper-based optimization algorithm. We apply the method to problems involving multiple heterogeneous data sources, taken from domains of computational biology and computer vision. The
results show that the proposed convex approach to learning localized kernels can achieve higher
prediction accuracies than its global and non-convex local counterparts.
1. Introduction
Kernel-based learning algorithms (e.g., Schölkopf and Smola, 2002) including support vector machines (Cortes and Vapnik, 1995) have found diverse applications due to their distinct merits such as
decent computational complexity, high prediction accuracy (Delgado et al., 2014), and solid mathematical foundation (e.g., Mohri et al., 2012). Since the learning and data representation processes
are decoupled in a modular fashion, one can obtain non-linear kernel machines from simpler linear
ones in a canonical way. The performance of such algorithms, however, is fundamentally limited
through the choice of the involved kernel function as it intrinsically specifies the feature space where
the learning process is implemented. This choice is typically left to the user. A substantial step toc
2015
Yunwen Lei, Alexander Binder, Ürün Dogan and Marius Kloft.
L EI , B INDER , D OGAN AND K LOFT
ward the complete automatization of kernel-based machine learning is achieved in Lanckriet et al.
(2004), who introduce the multiple kernel learning (MKL) or learning kernels framework (Gönen
and Alpaydin, 2011). Being formulated in terms of a single convex optimization criterion, MKL
offers a theoretically sound way (Wu et al., 2007; Ying and Campbell, 2009; Cortes et al., 2010;
Kloft and Blanchard, 2011, 2012; Cortes et al., 2013; Lei and Ding, 2014) of encoding complementary information with distinct base kernels and automatically learning an optimal combination of
those (Ben-Hur et al., 2008; Gehler and Nowozin, 2008) using efficient numerical algorithms (Bach
et al., 2004; Sonnenburg et al., 2006; Rakotomamonjy et al., 2008). This is particularly significant
in the application domains of bioinformatics and computer vision (Ben-Hur et al., 2008; Gehler and
Nowozin, 2008; Kloft, 2011), where data can be obtained from multiple heterogeneous sources,
describing different properties of one and the same object (e.g., genome or image). While early
sparsity-inducing approaches failed to live up to its expectations in terms of improvement over uniform combinations of kernels (cf. Cortes, 2009, and references therein), it was shown that improved
predictive accuracy can be achieved by employing appropriate regularization (Kloft et al., 2011).
Currently, most of the existing algorithms fall into the global setting of MKL, in the sense
that the kernel combination is not varied over the input space. This ignores the fact that different
regions of the input space might require individual kernel weights. For instance in the figures to
the right, the images exhibit very
distinct color distributions. While
a kernel based on global color histograms may be effective to detect
the horse object on the image to
the left, it may fail in the image
to the right, as the image fore- and
backgrounds exhibit very similar
color distributions. This motivates us studying localized approaches to learning kernels (Gönen
and Alpaydin, 2008). The existing algorithms (reviewed in the subsequent section), however, optimize non-convex objective functions using ad-hoc optimization heuristics, which confuses the
issue of reproducibility. Whether or not these algorithms are protected against overfitting is still
an open research question as no theoretical guarantees—neither generalization error nor excess risk
bounds—are known.
In this paper, we show generalization error bounds for a localized setting of learning kernels,
where we assume a pre-specified cluster structure of the data. We show that performing empirical
risk minimization over this class is given by a convex optimization problem. For which we derive
partial and complete dual representations using Fenchel conjugation theory and derive an efficient
convex wrapper-based optimization algorithm. We apply the method to problems involving multiple
heterogeneous data sources, taken from domains of computational biology and computer vision.
The results show that the proposed convex approach to learning localized kernels can achieve higher
prediction accuracies than its global and non-convex local counterparts.
The remainder of this paper is structured as follows. In Section 2 we review related work; in
Section 3 our convex and localized formulation of learning kernels is introduced, a partial dual
representation of which is derived in Section 4, where we also present an efficient optimization
algorithm. We report on theoretical results including generalization error bounds in Section 5.
Empirical results for the application domains of visual image recognition and protein fold class
prediction are presented in Section 6; Section 7 concludes.
2
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
2. Related work
Gönen and
PMAlpaydin (2008) initiated the work on localized MKL by using a discriminant function
f (x) = k=1 ηk (x|V )hwk , φk (x)i + b, where M is the number of kernels, ηk (x|V ) is a parametric
gating model assigning a weight to φk (x) as a function of x, and V encodes the parameters of
the gating model. The gating function is used to divide the input space into different regions,
each of which is assigned to kernel weights. The joint optimization of the gating model and the
kernel-based prediction function is carried out by alternating optimization. This problem is nonconvex due to the non-linearity introduced by the gating function. Yang et al. (2009) develop a
group-sensitive variant of MKL tailored to object categorization. Their approach is non-convex
but, in contrast to Gönen and Alpaydin (2008), examples within a group share the same kernel
weights while examples from different groups employ distinct sets of kernel weights. Han and
Liu (2012) modify the approach of Gönen and Alpaydin (2008) by complementing the spatialsimilarity-based kernels with probability confidence kernels that reflect the degree of confidence to
which the involved examples belong to the same class. Song et al. (2011) present a localized MKL
algorithm for realistic human action recognition in videos. However, the involved local models
are constructed in an independent fashion. Therefore, they ignore the coupling among different
localities, and may produce a suboptimal classifier already when these localities are moderately
correlated. Recently, a localized MKL formulation has been studied as a computational means to
study non-linear SVMs (Jose et al., 2013).
All these approaches are based on non-convex optimization criteria and lack in learning theory.
To our knowledge, the only theoretically sound approach in the context of the localized setting
of learning kernels is by Cortes, Kloft, and Mohri (2013). They present an MKL approach based
on controlling the local Rademacher complexity of the resulting kernel combination. Note that the
meaning of locality is different here, however: while in the present work we perform assignments of
kernel weights locally with respect to the input space, Cortes, Kloft, and Mohri (2013) localize the
hypothesis class, which leads to sharper generalization bounds (Kloft and Blanchard, 2011, 2012).
3. Learning methodology
In this paper, we study a convex formulation of localized MKL (CLMKL). For simplicity, we present
our approach for binary classification, although the approach is general and can be extended to
regression, multi-class classification, and structured output prediction.
3.1 Localized Problem Setting of Learning Kernels
Suppose we are given M base kernels k1 , . . . , kM with φm being the corresponding kernel feature
map corresponding to m-th kernel, i.e., km (x, x̃) = hφm (x), φm (x̃)ikm . Let Hm be the reproducing
kernel Hilbert space corresponding to kernel km , inner product h·, ·ikm and induced norm k · kkm .
For clarity, we frequently use the notation h·, ·i := h·, ·ikm and k · k2 := k · kkm . For any d ∈ N+ ,
introduce the notation Nd = {1, . . . , d}. Suppose that the training examples (x1 , y1 ), . . . , (xn , yn )
are partitioned into l disjoint clusters S1 , . . . , Sl . For each cluster Sj , we learn a distinct linear
P
(1)
(M )
combined kernel k̃j = m∈NM βjm km and a distinct weight vector wj = (wj , . . . , wj ). This
P
(m)
results, for each cluster Sj , in a linear model fj (x) = hwj , φ(x)i + b = m∈NM hwj , φm (x)i + b,
where φ = (φ1 , . . . , φM ) is the concatenated feature map.
3
L EI , B INDER , D OGAN AND K LOFT
3.2 Notation
For a Hilbert space H with inner product h·, ·i and l elements w1 , . . . , wl ∈ H, we define the Σ
semi-norm for (w1 , . . . , wl ) by
X
1/2
k(w1 , . . . , wl )kΣ :=
Σj j̃ hwj , wj̃ i
,
(1)
j,j̃∈Nl
where Σ is a positive semi-definite l×l matrix. For any β = (βjm )j∈Nl ,m∈NM and any m ∈ NM , we
β, 1 Σ
(β)
−1
−1
, . . . , βlm
) + µΣ−1 ]−1 , where diag(a1 , . . . , al ) is
write Qβm := Qm µ = (qmj j̃ )j,j̃∈Nl = [diag(β1m
the l×l diagonal matrix with a1 , . . . , al on the main diagonal. For any x ∈ X , we use τ (x) to denote
the index of the cluster to which the point x belongs, i.e., τ (x) = j ⇐⇒ x ∈ Sj . For brevity,
we write τ (i) := τ (xi ) for all i and a+ = max(a, 0) for all a ∈ R. Introduce the notation w(m) =
(m)
(m) w1 , . . . , w l
. For any p ≥ 1, we denote by p∗ its conjugated exponent, satisfying p1 + p1∗ = 1.
1/p
P
(1)
(M )
(m) p
For wj = (wj , . . . , wj ), we define the `2,p -norm by kwj k2,p :=
kw
k
.
m∈NM
j
km
3.3 Convex localized multiple kernel learning (CLMKL)
The proposed convex formulation for localized MKL is given as follows (for simplicity presented
in terms of the hinge loss function; for a general presentation, see Appendix B.2):
Problem 1 (C ONVEX LOCALIZED MULTIPLE KERNEL LEARNING (CLMKL))
(m) 2
||2
||wj
X
µ X
kw(m) k2Σ−1 + C
ξi
w,ξ,β,b
2βjm
2
m∈NM
i∈Nn
j∈Nl ,m∈NM
X
(m)
s.t. yi
hwj , φm (xi )i + b ≥ 1 − ξi , ξi ≥ 0, ∀i ∈ Sj , j ∈ Nl
min
X
+
(2)
m∈NM
X
p
βjm
≤ 1, ∀j ∈ Nl , βjm ≥ 0, ∀j ∈ Nl , m ∈ NM ,
m∈NM
where ξi are slack variables, C and µ are regularization parameters and Σ−1 is a positive semidefinite matrix (note that we do not need to compute the inversion of Σ−1 in the implementations).
Note that, we impose, for each cluster Sj , j ∈ Nl , a separate `p -norm constraint (Kloft et al.,
2011) on the combination coefficients βj = (βj1 , . . . , βjM ). However, unlike training a local
model independently at each locality, these l local models are optimized jointly in our formulation, exploiting that examples in localities close by may convey complementary information to the
learning task. The regularizer defined in (1) encodes the relationship among different clusters and
imposes a soft constraint on how these local models shall be correlated. Note that, P
if Σ−1 is the
graph Laplacian of an adjacency matrix W (i.e., Σ−1 = D − W with Dj j̃ = δj j̃ k∈Nl Wjk ),
the regularizer (1) coincides with the graph regularizer employed also in Evgeniou et al. (2005):
P
(m)
(m)
kw(m) k2Σ−1 = j,j̃∈Nl Wj j̃ kwj − wj̃ k22 . Recall that a quadratic over a linear function is convex (e.g., Boyd and Vandenberghe, 2004, p.g. 89), so all occurring summands in formulation (2)
are convex, so this is a convex optimization problem. Note that Slater’s condition can be directly
checked, and thus strong duality holds. To our best knowledge, problem (2) is the first convex
formulation in the localized setting of learning kernels.
4
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
4. Optimization Algorithms
As pioneered in Sonnenburg et al. (2006), we consider here a two-layer optimization procedure to
solve the problem (2) where the variables are divided into two groups: the group of kernel weights
(m)
{βjm }j∈Nl ,m∈NM and the group of weight vectors {wj }j∈Nl ,m∈NM . In each iteration, we alternatingly optimize one group of variables while fixing the other group of variables. These iterations
are repeated until some optimality conditions are satisfied. To this aim, we need to find efficient
strategies to solve the two subproblems. The following proposition indicates that the subproblem
(m)
of optimizing the objective of (2) with respect to {wj }j∈Nl ,m∈NM for fixed kernel weights can be
cast as a standard SVM problem with a delicately defined kernel.
Proposition 2 (CLMKL ( PARTIAL ) DUAL PROBLEM ) Introduce the kernel
X (β)
k̃(xi , xĩ ) :=
qmτ (i)τ (ĩ) km (xi , xĩ ).
(3)
m∈NM
The partial Lagrangian dual of (2) with fixed kernel weights βjm is given by
max
αi
s.t.
X
αi −
i∈Nn
X
1 X
yi yĩ αi αĩ k̃(xi , xĩ )
2
i,ĩ∈Nn
(4)
αi yi = 0, 0 ≤ αi ≤ C, ∀i ∈ Nn .
i∈Nn
Further, the optimal weight vector can be represented by
X (β) X
(m)
wj =
qmj j̃
yi αi φm (xi ),
j̃∈Nl
∀j ∈ Nl , m ∈ NM .
(5)
i∈Sj̃
(m)
Next, we show that, the subproblem of optimizing the kernel weights for fixed wj and b has
a closed-form solution. We defer the detailed proof of Propositions 2, 3 to the appendix due to the
lack of the space.
Proposition 3 (S OLVING THE SUBPROBLEM WITH RESPECT TO THE KERNEL WEIGHTS ) Given
(m)
fixed wj and b, the minimal βjm in optimization problem (2) is attained for
2
βjm =
(m)
kwj k2p+1
X
2p
(k)
kwj k2p+1
− 1
p
.
(6)
k∈NM
(m)
To apply Proposition 3 for updating βjm , we need to compute the norm of wj , which can be
accomplished by recalling the representation given in Eq. (5):
2
X X
X
(β)
(β)
(β)
(m) 2
(7)
yi αi qmjτ (i) φm (xi )
yi yĩ αi αĩ qmjτ (i) qmjτ (ĩ) km (xi , xĩ ).
kwj k2 = =
i∈Nn
2
i∈Nn ĩ∈Nn
Furthermore, note that the prediction function becomes
X
X
X (β)
(m)
f (x) =
hwτ (x) , φm (x)i + b =
yi αi
qmτ (x)τ (i) km (xi , x) + b.
m∈NM
i∈Nn
5
m∈NM
(8)
L EI , B INDER , D OGAN AND K LOFT
The resulting optimization algorithm for convex localized multiple kernel learning is shown
in Algorithm 1. The algorithm alternates between solving an SVM subproblem for fixed kernel
weights (Line 4) and updating the kernel weights in a closed-form manner (Line 6). Note that the
proposed optimization approach can be potentially extended to an interleaved optimization strategy
where the optimization of the MKL step is directly integrated into the SVM solver. It has been
shown (Sonnenburg et al., 2006; Kloft et al., 2011) that such a strategy can increase the computational efficiency by up to 1-2 orders of magnitude (cf. Figure 7 in Kloft et al., 2011).
Algorithm 1: Training algorithm for convex localized multiple kernel learning (CLMKL).
n
input: examples {(xi , yi )ni=1 } ⊂ X × {−1, 1} together with cluster indices {τ (i)}ni=1 ,
M base kernels k1 , . . . , kM , and a positive semi-definite matrix Σ−1 .
p
p
1 initialize βjm =
1/M for all j ∈ Nl , m ∈ NM
2 while Optimality conditions are not satisfied do
3
calculate the kernel matrix k̃ by Eq. (3)
4
compute α by solving canonical SVM with k̃
5
6
7
(m)
compute kwj k22 for all j, m by Eq. (7)
update βjm for all j, m according to Eq. (6)
end
We remark that we also derive a complete dual problem removing the dependency on βjm . Due
to the lack of the space, we defer the detailed proof to the appendix:
Proposition 4 (CLMKL ( COMPLETE ) DUAL PROBLEM ) If Σ−1 is positive definite, then the completely dualized Lagrangian dual (dualized with respect to all variables) of Problem (2) becomes:
sup
sup
γ
0≤αi ≤C
mj j̃
P
αi yi =0 m∈NM ,j,j̃∈Nl
X X X
2p p−1
X
1
p−1 p
−
αi yi φm (xi )−
αi yi γmjτ (i) φm (xi )
2
2
j∈Nl
m∈NM
i∈Sj
i∈Nn
i∈Nn
2 X 1 X X
+
αi yi γmjτ (i) φm (xi ) j∈N +
αi .
l Σ
2µ
m∈NM
i∈Nn
i∈Nn
The above dual sheds further light onto the CLMKL optimization problem, and potentially can
be exploited for the development of alternative optimization strategies that directly optimize the dual
criterion (without the need of an two-step wrapper approach); such an approach has been taken in
Sun et al. (2010) in the context of `p -norm MKL. Furthermore, solving the dual enables computing
the duality gap, which can be used as a sound evaluation criterion for the optimization precision.
5. Rademacher complexity bounds
This section presents a theoretical analysis, showing, for the first time, that a localized approach
to learning kernels can generalize to new and unseen data. In particular, we give a purely datadependent bound on the generalization error. Our basic strategy is to plug the optimal βjm established in Eq. (6) into the primal problem (2), thereby writing (2) as the following equivalent
block-norm regularization problem:
6
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
X
2p
1X
(m)
kwj k2p+1
min
w,b 2
j∈Nl
m∈NM
p+1
p
+
X µ
kw(m) k2Σ−1
2
m∈NM
X
X
(m)
+C
1 − yi
hwτ (i) , φm (xi )i − yi b . (9)
i∈Nn
+
m∈NM
Solving Eq. (9) amounts to performing empirical risk minimization in the hypothesis space
X
X
2
(m) 2
kwj k2, 2p + µ
Hp,µ,D := Hp,µ,D,M = fw : x → hwτ (x) , φ(x)i :
kw kΣ−1 ≤ D .
j∈Nl
p+1
m∈NM
The following theorem establishes a generalization error bound for CLMKL.
Theorem 5 (CLMKL GENERALIZATION ERROR BOUNDS ) Suppose that Σ−1 is positive definite
and n is the sample size. Then, for any 0 < δ < 1 with probability at least 1 − δ, the expected risk
E(h) := E[yh(x) ≤ 0] of any classifier h ∈ Hp,µ,D can be upper bounded by:
r
log(2/δ)
E(h) ≤ Ez (h) + 3
+
2n
!1/2
√
M X
X
2 D
(1 − θ)2 X
X
2
inf
θ t
km (xi , xi )
Σjj
km (xi , xi )
,
+
n 0≤θ≤1∗
µ
m=1 2t
j∈Nl
2≤t≤p̄
where Ez (h) :=
1
n
Pn
i=1 (1
i∈Sj
m∈NM
j∈Nl
i∈Sj
− yi h(xi ))+ is the empirical risk w.r.t. the hinge loss.
Remark 6 (Interpretation and Tightness) The above error bound enjoys a mild dependence on
the number of kernels. One can show (cf. Section C) that the dependence is O(log M ) for p ≤
p−1
(log M − 1)−1 log M and O(M 2p ) otherwise, which recover the best known results for global
MKL algorithms in Cortes et al. (2010); Kloft and Blanchard (2011); Kloft et al. (2011).
Theorem 5 also suggests that the generalization performance of CLMKL is controlled by a
weighted summation of the diagonal elements in the matrix Σ, with weights being proportional to
the trace of the gram matrix on the associated clusters.
6. Empirical Studies
6.1 Experimental Setup
We implement the proposed convex localized MKL (CLMKL) algorithm in MATLAB and solve
the involved canonical SVM problem with LIBSVM (Chang and Lin, 2011). When the clusters
{S1 , . . . , Sl } are not known in advance, they are computed through kernel k-means (e.g., Dhillon
et al., 2004). To diminish k-mean’s potential fluctuations due to random initialization of the cluster
means, we repeat kernel k-means several times, and either select the one with the minimal clustering error (the summation of the squared distance between the examples and the associated nearest
cluster) as the final partition, or train a single CLMKL model for each partition and then combine
the resulting CLMKL models by performing majority voting on the binary predictions. We compare
the performance attained by the proposed CLMKL to regular localized MKL (LMKL) (Gönen and
Alpaydin, 2008), the SVM using a uniform kernel combination (UNIF) (Cortes, 2009), and `p -norm
MKL (Kloft et al., 2011), which includes classical MKL (Lanckriet et al., 2004) as a special case.
7
L EI , B INDER , D OGAN AND K LOFT
CLMKL
LMKL
MKL
UNIF
σ = 0.2 98.3 ± 0.8 94.7 ± 1.4 94.8 ± 1.6 94.5 ± 1.6
σ = 0.3 91.4 ± 1.9 89.5 ± 1.8 89.2 ± 2.0 89.3 ± 1.7
Table 1: Performances achieved by LMKL, UNIF, MKL and the proposed CLMKL on the synthetic
dataset. Here, σ is the standard deviation of the noise. The underlying parameter p is 1.
6.2 Controlled Experiments on Synthetic Data
We first experiment on a two-class synthetic dataset with positive and negative points lying on a
disconnected hexagon with radius equal to 6 and 5, respectively, with additional corruptions by
Gaussian noise with standard deviations σ. The figure to the right shows an example of such a
synthetic dataset with 1000 examples and σ = 0.2. This
5
dataset is interesting to us since the optimal combination 4
of the features associated to the first and second coordi- 3
nates vary along the six sides of the hexagon. We choose 2
the linear kernels on the first and second coordinates as 1
two base kernels for CLMKL, and apply k-means with 6 0
−1
clusters to generate data partition. The correlation ma- −2
trix Σ−1 is chosen as the graph Laplacian of an adjacency −3
matrix W , where we set Wj j̃ = exp(−γd2 (Sj , Sj̃ )) with −4
d(Sj , Sj̃ ) being the Euclidean distance between cluster −5
−6
−4
−2
0
2
4
6
Sj and Sj̃ . The parameter γ is set as the reciprocal of the
average distance among different clusters. We use one half of the dataset as the training set, and
each half of the remaining as the validation set and test set. We tune the parameter C from the set
10{−2,−1,...,2} , and µ from the set 2{2,4,6,8} , based on the prediction accuracies on the validation set.
For CLMKL and MKL, we simply set p = 1 in this experiment. For the baseline methods (LMKL,
MKL, UNIF), we complement the linear features by adding the quadratic kernel k(x, x̃) = hx, x̃i2
as the third base kernel, which is a useful feature for this dataset since a circle (a quadratic function)
with appropriate radius is expected to serve as a good predictor. Thus the addition of the quadratic
kernel gives the baseline methods a potential advantage and serves as an additional sanity check of
the robustness of the proposed algorithm.
Table 1 shows the performance of the proposed CLMKL as well as the competitors. We can
observe that the proposed CLMKL consistently achieves the best prediction accuracies with accuracy gains by up to 3.6%. Note that this improvement is achieved when the baseline methods are
supplied with an additional quadratic kernel encoding valuable information for this synthetic data.
MKL LMKL CLMKL holdout CLMKL Oracle Li and Fei-Fei (2007) Bo et al. (2011) Liu et al. (2014)
90.23 87.36
90.80
91.38
73.8
85.7
89.95
Table 2: Prediction accuracies achieved by regular `p -norm MKL and the proposed CLMKL on the
UIUC Sports Event dataset. The columns “Holdout” and “Oracle” show the prediction
accuracies for the selected and optimal parameters, respectively. Liu et al. (2014) is the
best known result from the literature.
8
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
p=1
p = 1.14
p = 1.2
p = 1.33
CLMKL,
holdout
71.7 ± 0.4
74.8 ± 0.4
74.9 ± 0.5
74.5 ± 0.4
l=4
Oracle
72.8 ± 0.9
75.2 ± 0.4
75.0 ± 0.6
75.0 ± 0.4
CLMKL,
holdout
71.9 ± 0.4
75.1 ± 0.5
74.7 ± 0.3
74.5 ± 0.4
l=8
Oracle
73.7 ± 0.9
75.4 ± 0.3
75.5 ± 0.6
74.7 ± 0.3
LMKL
MKL
UNIF
64.3
68.7
73.4
74.2
73.1
68.4
Table 3: Prediction accuracies achieved by UNIF, LMKL, MKL and CLMKL on the protein folding
class prediction task. The columns “Holdout” and “Oracle” show the prediction accuracies
for the selected and optimal parameters, respectively. l indicates the number of clusters in
CLMKL, and p indicates the type of regularizer on the kernel combination coefficients.
6.3 Visual Image Categorization—An Application from the Domain of Computer Vision
We experiment on the UIUC Sports event dataset (Li and Fei-Fei, 2007) consisting of 1574 images,
each associated with one of 8 image classes (each class corresponding to a sport activity). We
compute 12 bag-of-words features, each with a dictionary size of 512, resulting in 12 χ2 -Kernels
(Zhang et al., 2007). The first 6 bag-of-words features are computed over SIFT features (Lowe,
2004) at three different scales and the two color channel sets RGB and opponent colors (van de
Sande et al., 2010). The remaining 6 bag-of-words features are quantiles of color values at the same
three scales and the same two color channel sets. For each channel within a set of color channels,
the quantiles are concatenated. Local features are extracted at a grid of step size 5 on images that
were down-scaled to 600 pixels in the largest dimension. Assignment of local features to visual
words is done using rank-mapping (Binder et al., 2013). The kernel width of the kernels is set to
the mean of the χ2 -distances. All kernels are multiplicatively normalized.
Following the setup of Liu et al. (2014) the dataset is split into 11 parts. One part is withheld
to obtain the final performance measurements, and on the remaining 10 parts we perform 10-fold
cross-validation for finding the optimal parameters. For CLMKL we employ kernel k-means with
3 clusters on the cross-validation parts. For CLMKL we apply majority voting over 8 separate
clusterings, for each of which a separate predictor is trained for fixed parameters. The matrix Σ−1 is
computed as [(exp(−γd(Sj , Sj̃ )))j j̃ ]−1 where as distance the χ2 -distances averaged over the cluster
assignments are used. The two involved parameters γ and µ are determined by cross-validation.
We compare CLMKL to regular `p -norm MKL (Kloft et al., 2011), for which we employ a oneversus-all setup, running over `p -norms in {1.0625, 1.125, 1.333, 2} and regularization constants
in {10k/2 }k=5
k=−2 (optima attained inside the respective grids). CLMKL uses the same set of `p norms, regularization constants from {10k/2 }k=0,...,5 and, due to time constraints, a subset of 18
i i=4
−1
combinations of the two parameters (γ, µ) ∈ {10i/2 }i=0
i=−4 × {2 }i=−4 is used to compute Σ .
Performance is measured through multi-class classification accuracy. Table 2 shows the results. The
column “holdout” shows the prediction accuracy achieved by taking majority voting over predictors
constructed based on different applications of kernel k-means with random initializations, while the
column “Oracle” indicates the best prediction accuracy achieved by these models built on the output
of kernel k-means with random initializations. We observe that CLMKL achieves a performance
improvement by 0.5 − 1.2% over the `p -norm MKL baseline. Comparing this to the best known
results from the literature (Liu et al., 2014), we observe that this is, to our best knowledge, the
highest result ever achieved on the UIUC dataset.
9
L EI , B INDER , D OGAN AND K LOFT
6.4 Protein Fold Prediction—An Application from the Domain of Computational Biology
Protein fold prediction is a key step towards understanding the function of proteins, as the folding
class of a protein is closely linked with its function; thus it is crucial for drug design. We experiment
on the protein folding class prediction dataset by Ding and Dubchak (2001), which was also used
in Campbell and Ying (2011); Kloft (2011); Kloft and Blanchard (2011). This dataset consists of
27 fold classes with 311 proteins used for training and 383 proteins reserved for testing. We use
exactly the same 12 kernels used also in Campbell and Ying (2011); Kloft (2011); Kloft and Blanchard (2011) reflecting different features, e.g., van der Waals volume, polarity and hydrophobicity,
relevant to the fold class prediction as base kernels. This is a non-sparse scenario for which Kloft
(2011) achieved 74.4% accuracy using `1.14 -norm MKL.
To be in line with the previous experiments by Campbell and Ying (2011); Kloft (2011); Kloft
and Blanchard (2011), we precisely replicate their experimental setup: we use the train/test split
supplied by Campbell and Ying (2011) and perform CLMKL via one-versus-all strategy to tackle
multiple classes. The correlation matrix Σ−1 is constructed in the same way as that in Section 6.2.
The parameters are chosen by cross validation over C ∈ 10{−2,−1,...,2} , µ ∈ 2{5,6,7} . We consider
`p -norm CLMKL models with p ∈ {1, 1.14, 1.2, 1.33} and l ∈ {4, 8}. We repeat the experiment 10
times and report the mean prediction accuracies, as well as standard deviations in Table 3.
From the table, we observe that CLMKL has the potential to largely surpass its global counterpart `p -norm MKL. Note that we do not achieve the accuracy 74.4% for `1.14 -norm MKL reported
in Kloft (2011), which is possibly due to different implementations of the `p -norm MKL algorithms. Nevertheless, CLMKL achieves accuracies more than 0.8% higher than the one reported in
Kloft (2011), which is also higher than the one initially reported in Campbell and Ying (2011). For
example, CLMKL with l = 8, p = 1.14 achieves an impressive accuracy of 75.1%.
7. Conclusion
We proposed a localized approach to learning kernels that admits generalization error bounds and
can be phrased a convex optimization problem over a given or pre-obtained cluster structure. A
key ingredient is the use of a graph-regularizer to couple the different local models. The theoretical analysis based on Rademacher complexity theory resulted in large deviation inequalities that
connect the spectrum of the graph regularizer with the generalization capability of the learning algorithm. The proposed method is well suited for deployment in the domains of computer vision and
computational biology: computational experiments showed that the proposed approach can achieve
prediction accuracies higher than its global and non-convex local counterparts.
In future work, we will investigate alternative clustering strategies (including convex ones and
soft clustering), and how to principally include the data partitioning into our framework, for instance, by constructing partitions that capture the local variation of prediction importance of different features, by solving the clustering step and the MKL optimization problem in a joint manner or
by automatically learning the graph Laplacian using appropriate matrix regularization. Another research direction is to directly integrate the MKL step into the SVM solver, as pioneered by Sonnenburg et al. (2006). We expect that such an implementation would lead to a speed-up in computational
efficiency by up to 1-2 orders of magnitude. We will also investigate extensions to other learning
settings (Kloft et al., 2009; Storcheus et al., 2015) and further applications (Kloft and Laskov, 2007;
Nakajima et al., 2009; Binder et al., 2012; Kloft and Laskov, 2012; Kloft et al., 2014).
Acknowledgments
This work was partly funded by the German Research Foundation (DFG) award KL 2698/2-1.
10
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
Appendix A. Proofs on subproblems in Algorithm 1
A.1 Proof of Proposition 2
(m)
Proof of Proposition 2 The Lagrangian of the partial optimization problem w.r.t. wj
L :=
X X kwj(m) k22
and b is
µ X
kw(m) k2Σ−1 +
2βjm
2
j∈Nl m∈NM
m∈NM
XX X
X
X
(m)
α i 1 − ξi − yi
hwj , φm (xi )i − yi b + C
ξi −
vi ξi , (A.1)
+
j∈Nl i∈Sj
m∈NM
i∈Nn
i∈Nn
where αi ≥ 0 and vi ≥ 0 are the Lagrangian multipliers of the constraints.
Setting to zero the gradient of the Lagrangian w.r.t. the primal variables, we get
(m)
∂L
(m)
=0⇒
∂wj
wj
βjm
+µ
X
(m)
w
Σ−1
j j̃ j̃
−
X
yi αi φm (xi ) = 0,
(A.2)
i∈Sj
j̃∈Nl
X
∂L
=0⇒
αi yi = 0,
∂b
(A.3)
i∈Nn
∂L
= 0 ⇒ C = αi + vi ,
∂ξi
∀i ∈ Nn .
(A.4)
Eq. (A.2) implies that
X ||wj(m) ||22
j∈Nl
βjm
+ µ||w(m) ||2Σ−1 =
XX
(m)
αi yi hwj
, φm (xi )i,
(A.5)
j∈Nl i∈Sj
(m)
wj
=
X
(β)
qmj j̃
X
yi αi φm (xi ),
∀j, m.
(A.6)
i∈Sj̃
j̃∈Nl
Plugging Eqs. (A.3), (A.4) into Eq. (A.1), the Lagrangian can be simplified as follows:
L=
X
X X kwj(m) k22
αi +
i∈Nn
m∈NM j∈Nl
(A.5)
=
X
i∈Nn
(A.6)
=
X
i∈Nn
=
X
i∈Nn
(3)
=
X
i∈Nn
2βjm
+
X XX
X µ
(m)
kw(m) k2Σ−1 −
αi yi hwj , φm (xi )i
2
m∈NM j∈Nl i∈Sj
m∈NM
1 X X (m) X
αi −
hwj ,
αi yi φm (xi )i
2
i∈Sj
m∈NM j∈Nl
1 X
αi −
2
X
m∈NM j,j̃∈Nl
1 X
αi −
2
X
(β)
X
qmj j̃
yi yĩ αi αĩ hφm (xi ), φm (xĩ )i
i∈Sj ,ĩ∈Sj̃
(β)
yi yĩ αi αĩ qmτ (i)τ (ĩ) km (xi , xĩ )
m∈NM i,ĩ∈Nn
αi −
1 X
yi yĩ αi αĩ k̃(xi , xĩ ).
2
i,ĩ∈Nn
The proof is complete if we note the constraints established in Eqs. (A.3), (A.4).
11
L EI , B INDER , D OGAN AND K LOFT
A.2 Proof of Proposition 3
Proposition 3 contained in the main text gives a closed form solution for updating the kernel weights,
a detailed proof of which is given in this appendix. Our discussion is largely based on the following
lemma by Micchelli and Pontil (2005).
Lemma A.1 (Micchelli and Pontil, 2005, Lemma 26) Let ai ≥ 0, i ∈ Nd and 1 ≤ r < ∞. Then
1
X ai X r 1+ r
r+1
=
ai
min
P
ηi
η:ηi ≥0, i∈N ηir ≤1
d
i∈Nd
i∈Nd
1
r+1
and the minimum is attained at ηi = ai
r
r+1
P
k∈Nd
ak
− 1
r
.
We are now ready to prove Proposition 3 as follows.
(m)
Proof of Proposition 3 Fixing the variables wj and b, the optimization problem (2) reduces to
min
β
s.t.
1
2
X
(m) 2
k2
−1
βjm
kwj
j∈Nl ,m∈NM
p
βjm
≤ 1, ∀j ∈ Nl , βjm ≥ 0, ∀j ∈ Nl , m ∈ NM .
X
m∈NM
This problem can be decomposed into l independent subproblems, one at each locality. For example,
the subproblem at the j-th locality is as follows:
1 X −1 (m) 2
min
βjm kwj k2
β
2
m∈NM
X p
s.t.
βjm ≤ 1, βjm ≥ 0, ∀m ∈ NM .
m∈NM
(m) 2
k2 ,
Applying Lemma A.1 with αm = kwj
ηm = βjm and r = p completes the proof.
Appendix B. Completely dualized problems
Proposition 2 gives a partial dual of the primal optimization problem (2). Alternatively, we derive
here a complete dual problem removing the dependency on the kernel weights βjm . This completes
the analysis of the primal problem and can be potentially exploited (in future work) to access the
duality gap of computed solutions or to derive an alternative optimization strategy (cf. Sun et al.,
2010). We always assume that Σ is positive definite in this section. We consider a general loss function to give a unifying viewpoint, and our analysis is based on the notions of the Fenchel-Legendre
conjugate (Boyd and Vandenberghe, 2004) and the infimal convolution (Rockafellar, 1997).
B.1 Lemmata used for complete dualization
For a function h, we denote by h∗ (x) = supµ [x> µ − h(µ)] its Fenchel-Legendre conjugate. The
infimal convolution (short: Inf-convolution) of two functions f and g is defined by
(f ⊕ g)(x) := inf [f (x − y) + g(y)].
y
12
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
Lemma B.1 gives a relationship between the Fenchel-Legendre conjugate and the Inf-convolution.
Lemma B.1 (Rockafellar, 1997) For any two functions f1 , f2 , we have (f1 + f2 )∗ (x) = (f1∗ ⊕
f2∗ )(x). Moreover, if f has a decomposable structure in the sense that f (x1 , x2 ) = f1 (x1 )+f2 (x2 ),
i.e., f1 and f2 are functions defined on uncorrelated variables, then (f1 + f2 )∗ (x) = (f1∗ + f2∗ )(x).
For any norm k · k, we denote by k · k∗ its dual norm defined by kxk∗ = supkµk=1 hx, µi. The
Fenchel-Legendre conjugate of square norm takes the following form (Rockafellar, 1997):
1
1
( k · k2 )∗ = k · k2∗ .
2
2
(B.1)
Lemma B.2 establishes the dual norm for a Σ-norm. The result is well-known if H is the 1dimensional Euclidean space.
Lemma B.2 Let H be a Hilbert space and Σ be a l × l positive definite matrix. The dual norm of
P
1/2
the Σ-norm defined by k(w1 , . . . , wl )kΣ =
Σ
hw
,
w
i
is the Σ−1 -norm.
j
j,j̃∈Nl j j̃
j̃
Proof For any two elements w = (w1 , . . . , wl ), v = (v1 , . . . , vl ) ∈ H × · · · × H, we first establish
{z
}
|
l
the following inequality:
h(v1 , . . . , vl ), (w1 , . . . , wl )i ≤ k(v1 , . . . , vl )kΣ−1 k(w1 , . . . , wl )kΣ .
(B.2)
Let µ1 , . . . , µl ∈ Rl be the eigenvectors of Σ with λ1 , . . . , λl being the corresponding eigenvalues.
According to the single value decomposition, we have
X
X
>
−1
,
Σ
=
λ−1
Σ=
λk µk µ>
k
k µk µk ,
k∈Nl
k∈Nl
from which we know
kwk2Σ =
X
λk kwk2µ
>
k µk
k∈Nl
=
X
λk h
X
j∈Nl
k∈Nl
X
µkj wj ,
µkj wj i.
j∈Nl
Therefore,
k(w1 , . . . , wl )kΣ k(v1 , . . . , vl )kΣ−1 =
X
≥
X
k∈Nl
µkj wj k22
1/2 X
j∈Nl
k∈Nl
C. S.
X
λk k
k
X
λ−1
k k
k∈Nl
µkj wj k2 k
j∈Nl
X
X
µkj vj k22
1/2
j∈Nl
µkj vj k2
j∈Nl
X X
X
≥
h
µkj wj ,
µkj vj i
k∈Nl j∈Nl
=
j∈Nl
X
X
hwj , vj̃ i
µkj µkj̃ .
j,j̃∈Nl
k∈Nl
(B.3)
P
P
Since k∈Nl µk µ>
k∈Nl µkj µkj̃ = δj j̃ . Plugging this identity
k is the identity matrix, we know that
into the above inequality yields Eq. (B.2).
13
L EI , B INDER , D OGAN AND K LOFT
Next, we need to show that for any w = (w1 , . . . , wl ) there exists an v = (v1 , . . . , vl ) for which
>
Eq. (B.2) holds as an equality. Introduce the invertible matrix B = λ11 µ1 , . . . , λ1l µl and denote
by B −1 its inverse. Then, we have
1 X
µkj Bj−1
= δkk̃ .
(B.4)
k̃
λk
j∈Nl
Introduce
vk :=
X
−1
Bkj
X
j∈Nl
µj j̃ wj̃ ,
∀k ∈ Nl .
j̃∈Nl
Then, it follows from Eq. (B.4) that
X
X X 1
X
X
X 1
X 1
−1
=
µkj vj =
µkj
Bj k̃
µkj Bj−1
w
µk̃j̃ wj̃
µ
k̃j̃ j̃
k̃
λk
λk
λk
j∈Nl
j∈Nl
j̃∈Nl
j̃∈Nl
k̃∈Nl j∈Nl
k̃∈Nl
X
X
(B.4) X
δkk̃
µk̃j̃ wj̃ =
µkj̃ wj̃ .
=
j̃∈Nl
k̃∈Nl
j̃∈Nl
For any w, v satisfying the above relation, we have
X
X
2 X
2 X
X
X
λ2k µkj wj 2 = µkj vj 2 , µkj wj 2 µkj vj 2 =
µkj wj ,
µkj vj ,
j∈Nl
j∈Nl
j∈Nl
j∈Nl
j∈Nl
j∈Nl
and therefore the inequality (B.3) holds indeed as an equality. The proof is complete.
B.2 Proofs on complete dualization problems
The convex localized MKL model given in (2) can be extended to a general convex loss function:
min
w,ti ,β,b
s.t.
(m) 2
k2
kwj
X
2βjm
j∈Nl ,m∈NM
X
p
βjm
+
X
µ X
kw(m) k2Σ−1 + C
`(ti , yi )
2
m∈NM
i∈Nn
≤ 1, ∀j ∈ Nl , βjm ≥ 0, ∀j ∈ Nl , m ∈ NM
(B.5)
m∈NM
X
(m)
hwj
, φm (xi )i + b = ti , ∀i ∈ Sj , j ∈ Nl .
m∈NM
Here `(ti , yi ) is a general loss function measuring the error incurred from using ti to predict yi .
The following theorem gives the complete dual problem for the above convex localized MKL.
Problem B.3 (C OMPLETELY D UALIZED D UAL P ROBLEM FOR G ENERAL L OSS F UNCTIONS )
Let `(t, y) : R × Y → R be a convex function w.r.t. t for any y. Assume that Σ−1 is positive definite.
Then we have the following complete dual problem for the formulation (B.5):
(
X
αi
−C
`∗ (− , yi )−
Psup
C
αi =0
i∈Nn
i∈Nn
X X X
)
2p p−1
X
X
1
1
p
k
αi φm (xi )k2p−1
⊕
k
αi φm (xi ) j∈N k2Σ
.
l
2
2µ
j∈Nl
m∈NM
i∈Sj
m∈NM
14
i∈Sj
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
Proof Using Proposition 3 to get the optimal βjm , the problem (B.5) is equivalent to

 p+1
p
2p
X
X
X
µ X
1
(m) p+1 

kwj k2
+
`(ti , yi )
inf
kw(m) k2Σ−1 + C
w,b,ti 2
2
i∈Nn
j∈Nl m∈NM
m∈NM
X
(m)
s.t.
hwj , φm (xi )i + b = ti , ∀i ∈ Sj , j ∈ Nl .
m∈NM
According to the definition of Fenchel-Legendre conjugate and its relationship to Inf-convolution
established in Lemma B.1, the Lagrangian saddle point problem translates to
2p p+1
X
µ X
1 X X
p
(m)
+
kwj k2p+1
kw(m) k2Σ−1 + C
`(ti , yi )
w,b,t 2
2
j∈Nl m∈NM
m∈NM
i∈Nn
XX
X
(m)
−
αi
hwj , φm (xi )i + b − ti
sup inf
αi
j∈Nl i∈Sj
m∈NM
X
1
αi b
sup[−`(ti , yi ) − αi ti ] − sup
= sup − C
C
αi
ti
b
i∈Nn
n
Xi∈NX
2p p+1
X
1 X X
µ X
p
(m)
(m) p+1
(m) 2
hwj ,
αi φm (xi )i −
−
− sup
kwj k2
kw kΣ−1
2
2
w
i∈Sj
j∈Nl m∈NM
j∈Nl m∈NM
m∈NM
X
αi
Def.
= Psup
−C
`∗ (− , yi )
C
αi =0
i∈Nn
i∈Nn
∗ X X X
2p p+1
X
1
µ X
p
αi φm (xi )k2p+1
αi φm (xi ) j∈N k2Σ−1
−
k
+
k
l
2
2
i∈Sj
j∈Nl m∈NM i∈Sj
m∈NM
X
αi
Lem. B.1
=
−C
`∗ (− , yi )
Psup
C
αi =0
i∈Nn
i∈Nn
X X X
∗ 2p p+1 ∗
X
µ X
1
p
p+1
2
⊕
k
αi φm (xi )k2
k
αi φm (xi ) j∈N kΣ−1
−
l
2
2
i∈Sj
j∈Nl m∈NM i∈Sj
m∈NM
X
αi
Lem. B.1
=
−C
`∗ (− , yi )
Psup
C
αi =0
i∈Nn
i∈Nn
X X
2 ∗ X µ X
∗ 1
2
αi φm (xi )k2
⊕
αi φm (xi )
kΣ−1
−
k
2p
2
2
m∈NM p+1
j∈Nl
i∈Sj
i∈Sj
j∈Nl
m∈NM
(
X
αi
(B.1)
= Psup
−C
`∗ (− , yi )−
C
αi =0
i∈Nn
i∈Nn
)
X X X
2p p−1
X
1
1 X
p
p−1
k
αi φm (xi )k2
k
αi φm (xi ) j∈N k2Σ
.
⊕
l
2
2µ
X
j∈Nl
m∈NM
i∈Sj
m∈NM
i∈Sj
In the last step of the above deduction, we have used the fact that Σ−1 -norm and Σ-norm, `p -norm
and ` p -norm are two dual-norm pairs.
p−1
15
L EI , B INDER , D OGAN AND K LOFT
We can now prove the complete dual problem established in Proposition 4 by plugging the
Fenchel conjugate function of the hinge loss into Problem B.3.
Proof of Proposition 4 Note that the Fenchel-Legendre conjugate of the hinge loss is `∗ (t, y) = yt
(a function of t) if −1 ≤ yt ≤ 0 and ∞ elsewise (Rifkin and Lippert, 2007). Recall the identity
αi
, provided that
(ηf )∗ (x) = ηf ∗ (x/η). Hence, for each i, the term `∗ (− αCi , yi ) translates to − Cy
i
αi
αi
new
0 ≤ yi ≤ C. With a variable substitution of the form αi = yi , the complete dual problem
established in Theorem B.3 now becomes
X
sup
αi −
0≤αi ≤C
P
i∈Nn
αi yi =0
i∈Nn
X X X
2p p−1
X
1
1 X
p
p−1
2
αi yi φm (xi ) j∈N kΣ ⊕
αi yi φm (xi )k2
k
k
l
2µ
2
i∈Sj
m∈NM
j∈Nl m∈NM i∈Sj
X
1 X
Def
(m) αi −
sup
=
sup
k θj
k2
j∈Nl Σ
2µ
(m)
0≤α
≤C
P i
θ
αi yi =0
j
i∈Nn
m∈NM
i∈Nn
X X X
2p p−1
1
p
(m) p−1
+
αi yi φm (xi ) − θj k2
k
.
2
j∈Nl
(m)
The optimal θj
m∈NM
i∈Sj
satisfies the following K.K.T. condition
X X
2p −1
X
2
X
p
(m)
(m) p−1
(m) p−1
θj −
αi yi φm (xi )
αi yi φm (xi )−θj αi yi φm (xi )−θj m∈NM
2
i∈Sj
2
i∈Sj
i∈Sj
1 X
(m)
=−
Σj j̃ θj̃ .
µ
j̃∈Nl
(m)
Solving the above equation shows that the optimal θj
(m)
θj
=
X
αi yi γmjτ (i) φm (xi ),
takes the following form
∀j ∈ Nl , m ∈ NM .
i∈Nn
Plugging the above identity back into the Langangian saddle point problem, we derive the complete
dual problem in the proposition.
Appendix C. Proof of Generalization Error Bounds (Theorem 5)
This section presents the proof for the generalization error bounds provided in Section 5. Our
basic tool is the data-dependent complexity measure called the Rademacher complexity (Bartlett
and Mendelson, 2002).
16
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
Definition C.1 (Rademacher complexity) For a fixed sample S = (x1 , . . . , xn ), the empirical
Rademacher complexity of a hypothesis space H is defined as
R̂n (H) := Eσ sup
f ∈H
1 X
σi f (xi ),
n
i∈Nn
where the expectation is taken w.r.t. σ = (σ1 , . . . , σn )> with σi , i ∈ Nn , being a sequence of
independent uniform {±1}-valued random variables.
The following theorem establishes the Rademacher complexity bounds for CLMKL machines.
2p
Denote p̄ = p+1
for any p ≥ 1 and observe that p̄ ≤ 2, which implies p̄∗ ≥ 2.
Theorem C.2 (CLMKL R ADEMACHER COMPLEXITY BOUNDS ) If Σ−1 is positive definite, then
the empirical Rademacher complexity of Hp,µ,D can be controlled by
√
D
R̂n (Hp,µ,D ) ≤
n
inf
0≤θ≤1
2≤t≤p̄∗
M (1 − θ)2 X
X
X
X
Σjj
km (xi , xi )
θ t
k
(x
,
x
)
t +
m i i
µ
m=1 2
2
i∈Sj
j∈Nl
m∈NM
j∈Nl
i∈Sj
If, additionally, km (x, x) ≤ B for any x ∈ X and any m ∈ NM , then we have
r
1/2
2
(1 − θ)2
DB
2
R̂n (Hp,µ,D ) ≤
inf
M max Σjj
θ tM t +
.
j∈Nl
n 0≤θ≤1∗
µ
2≤t≤p̄
Tightness of the bound It can be checked that the function x → xM 2/x is decreasing along the
interval (0, 2 log M ) and increasing along the interval (2 log M, ∞). Therefore, under the boundedness assumption km (x, x) ≤ B the Rademacher complexity can be further controlled by

1
1 r
−1 max Σ

2
2
min
2e
log
M
,
M
µ
, if p ≤ loglogMM−1 ,

jj
DB
j∈Nl
×
R̂n (Hp,µ,D ) ≤
p−1
1 2p 12
−1
min
n
2p

, M µ max Σjj 2 , otherwise,
p−1 ) M
j∈Nl
from which it is clear that our Rademacher complexity bounds enjoy a mild dependence on the
p−1
number of kernels. The dependence is O(log M ) for p ≤ (log M − 1)−1 log M and O(M 2p )
otherwise. These dependencies recover the best known results for global MKL algorithms in Cortes
et al. (2010); Kloft and Blanchard (2011); Kloft et al. (2011).
The proof of Theorem C.2 is based on the following lemmata.
Lemma C.3 (Khintchine-Kahane inequality (Kahane, 1985)) Let v1 , . . . , vn ∈ H. Then, for
any q ≥ 1, it holds
X
q
2
X
q
2
Eσ σi vi 2 ≤ q
kvi k2 .
i∈Nn
i∈Nn
Lemma C.4 (Block-structured Hölder inequality (Kloft and Blanchard, 2012)) Let
x = (x(1) , . . . , x(n) ), y = (y (1) , . . . , y (n) ) ∈ H = H1 × · · · × Hn .
Then, for any p ≥ 1, it holds hx, yi ≤ kxk2,p kyk2,p∗ .
17
!1/2
.
L EI , B INDER , D OGAN AND K LOFT
Proof of Theorem C.2
P Firstly, for any t̄ ≥ 1 we can apply a block-structured version of Hölder
inequality to bound i∈Nn σi fw (xi ) by
X
X
σi fw (xi ) =
i∈Nn
σi hwτ (i) , φ(xi )i =
σi hwj , φ(xi )i
j∈Nl i∈Sj
i∈Nn
X
E Hölder X
XD
X
kwj k2,t̄ σi φ(xi )
wj ,
σi φ(xi )
≤
=
i∈Sj
j∈Nl
Alternatively, we can also control
X
XX
X
σi fw (xi ) =
i∈Nn
P
wj ,
X
X X
.
E
X X D (m) X
σi φ(xi ) =
wj ,
σi φm (xi )
i∈Sj
m∈NM j∈Nl
w(m) ,
X
l
σi φm (xi )
j=1
i∈Sj
m∈NM
≤
2,t̄∗
σi fw (xi ) by
i∈Sj
j∈Nl
=
i∈Nn
i∈Sj
j∈Nl
X
l
σi φm (xi )
kw(m) kΣ−1 j=1 Σ
i∈Sj
m∈NM
,
where in the last step of the above deduction we have used the fact that Σ-norm is the dual norm of
Σ−1 -norm (Lemma B.2).
P
inequalities together and using the trivial identity i∈Nn σi fw (xi ) =
PCombining the above twoP
θ i∈Nn σi fw (xi ) + (1 − θ) i∈Nn σi fw (xi ), for any 0 ≤ θ ≤ 1 and any t ≥ 1 we have
Eσ
X
sup
σi fw (xi )
fw ∈Ht,µ,D i∈N
n
"
≤ Eσ
C.-S.
sup
θ
fw ∈Ht,µ,D
≤ Eσ
X
X
σi φ(xi )
kwj k2,t̄ j∈Nl
i∈Sj
X
sup
fw ∈Ht,µ,D
kwj k22,t̄
j∈Nl
2,t̄∗
X
+µ
2
X
X
× θ2
σi φ(xi )
Jensen
≤
i∈Sj
2,t̄∗
XX
2
DEσ θ2
σi φ(xi )
j∈Nl
i∈Sj
m∈NM
kw(m) k2Σ−1
X
l
kw(m) kΣ−1 σi φm (xi )
1/2
2,t̄∗
l 2 1/2
(1 − θ)2 X X
σi φm (xi )
+
µ
j=1 Σ
m∈NM
i∈Sj
l 2 (1 − θ)2 X X
+
σi φm (xi )
µ
j=1 Σ
m∈NM
!1/2
.
i∈Sj
(C.1)
18
j=1 Σ
i∈Sj
m∈NM
j∈Nl
+ (1 − θ)
X
#
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
∗
For any j ∈ Nl , theKhintchine-Kahane
2 (K.-K.) inequality and Jensen inequality (since t̄ ≥ 2)
P
permit us to bound Eσ i∈Sj σi φ(xi ) ∗ by
2,t̄
X
2
Eσ σi φ(xi )
i∈Sj
X X
t̄∗ t̄2∗
= Eσ
σi φm (xi )
2,t̄∗
K.-K.
≤
X ∗
t̄
X
≤
2
i∈Sj
m∈NM
kφm (xi )k22
Eσ
m∈NM
t̄∗ t̄2∗
2
= t̄
∗
i∈Sj
m∈NM
t̄∗ t̄2∗
X X
σi φm (xi )
Jensen
Def.
2
i∈Sj
X X
t̄∗ t̄2∗
2
km (xi , xi )
i∈Sj
m∈NM
M X
∗
.
= t̄ km (xi , xi )
m=1 t̄∗
i∈Sj
2
For any m ∈ NM , we also have
X
l 2
DX
E
X
X
Eσ σi φm (xi )
=
Σ
E
σ
φ
(x
),
σ
φ
(x
)
i m i
j j̃ σ
ĩ m ĩ
j=1 Σ
i∈Sj
i∈Sj
j,j̃∈Nl
=
X
Eσ Σjj
DX
X
σi φm (xi ),
i∈Sj
j∈Nl
=
ĩ∈Sj̃
Σjj
X
X
σĩ φm (xĩ )
E
ĩ∈Sj
km (xi , xi ).
i∈Sj
j∈Nl
Plugging the above two inequalities into Eq. (C.1) and noticing the trivial inequality kwj k2,t̄ ≤
kwj k2,p̄ , ∀t ≥ p ≥ 1, we get the following bound for any 0 ≤ θ ≤ 1:
R̂n (Hp,µ,D ) ≤ inf R̂n (Ht,µ,D )
t≥p
√
1/2
M X
X
X
D
(1 − θ)2 X X
k
(x
,
x
)
inf θ2 t̄∗
+
Σ
k
(x
,
x
)
.
≤
m i i
jj
m i i
n t≥p
µ
m=1 t̄∗
j∈Nl
i∈Sj
m∈NM j∈Nl
2
i∈Sj
The above inequality can be equivalently written as the first inequality of the theorem. The second
inequality follows directly from the boundedness assumption and the fact that
X
Σjj |Sj | ≤ max Σjj n.
j∈Nl
j∈Nl
Proof of Theorem 5 The proof now simply follows by plugging in the bound of Theorem C.2 into
Theorem 7 of Bartlett and Mendelson (2002).
Appendix D. Parameter sets for the CLMKL on the UIUC Sports event dataset
We have chosen the following pairs of the two parameters (µ, γ): (0.0612, 0.0100), (0.1250, 0.0100),
(0.2500, 0.0100), (0.0612, 0.0316), (0.1250, 0.0316), (0.0612, 0.1000), (0.1250, 0.1000),(2.0000, 0.1000),
19
L EI , B INDER , D OGAN AND K LOFT
(0.0612, 0.3162), (0.2500, 0.3162), (1.0000, 0.3162), (16.0000, 0.3162), (0.0612, 1.0000), (0.1250, 1.0000),
(0.2500, 1.0000), (0.5000, 1.0000), (1.0000, 1.0000), (2.0000, 1.0000), (8.0000, 1.0000). The parameters as selected by 10-fold crossvalidation for CLMKL were: `p = 1.333,µ = 2.0,γ = 1.0
References
Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality,
and the smo algorithm. In Proceedings of the twenty-first international conference on Machine
learning, page 6. ACM, 2004.
Peter Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and
structural results. Journal of Machine Learning Research, 3:463–482, 2002.
Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar Rätsch. Support vector machines and kernels for computational biology. PLoS Computational Biology, 4,
2008. URL http://svmcompbio.tuebingen.mpg.de.
Alexander Binder, Shinichi Nakajima, Marius Kloft, Christina Müller, Wojciech Samek, Ulf
Brefeld, Klaus-Robert Müller, and Motoaki Kawanabe. Insights from classifying visual concepts
with multiple kernel learning. PloS one, 7(8):e38897, 2012.
Alexander Binder, Wojciech Samek, Klaus-Robert Müller, and Motoaki Kawanabe. Enhanced representation and multi-task learning for image annotation. Computer Vision and Image Understanding, 2013.
Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Hierarchical matching pursuit for image classification:
Architecture and fast algorithms. Advances in Neural Information Processing Systems (NIPS),
2011.
Stephen Poythress Boyd and Lieven Vandenberghe. Convex optimization. Cambridge Univ. Press,
New York, 2004.
Colin Campbell and Yiming Ying. Learning with support vector machines. Synthesis Lectures on
Artificial Intelligence and Machine Learning, 5(1):1–95, 2011.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
Corinna Cortes. Invited talk: Can learning kernels help performance? In Proceedings of the 26th
Annual International Conference on Machine Learning, ICML ’09, pages 1:1–1:1, New York,
NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. Video http://videolectures.net/
icml09_cortes_clkh/.
Corinna Cortes and Vladimir Vapnik. Support vector networks. Machine Learning, 20:273–297,
1995.
Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning
kernels. In Proceedings of the 28th International Conference on Machine Learning, ICML’10,
2010.
20
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local rademacher complexity. In Advances in Neural Information Processing Systems, pages 2760–2768, 2013.
Manuel Fernández Delgado, Eva Cernadas, Senén Barro, and Dinani Gomes Amorim. Do we need
hundreds of classifiers to solve real world classification problems? Journal of Machine Learning
Research, 15(1):3133–3181, 2014.
Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 551–556. ACM, 2004.
Chris HQ Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349–358, 2001.
Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with
kernel methods. Journal of Machine Learning Research, 6:615–637, 2005.
P.V. Gehler and S. Nowozin. Infinite kernel learning. In Proceedings of the NIPS 2008 Workshop
on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.
Mehmet Gönen and Ethem Alpaydin. Localized multiple kernel learning. In Proceedings of the
25th international conference on Machine learning, pages 352–359. ACM, 2008.
Mehmet Gönen and Ethem Alpaydin. Multiple kernel learning algorithms. J. Mach. Learn. Res.,
12:2211–2268, July 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?
id=1953048.2021071.
Yina Han and Guizhong Liu. Probability-confidence-kernel-based localized multiple kernel learning
with norm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(3):
827–837, 2012.
Cijo Jose, Prasoon Goyal, Parv Aggrwal, and Manik Varma. Local deep kernel learning for efficient
non-linear svm prediction. In Proceedings of the 30th International Conference on Machine
Learning (ICML-13), pages 486–494, 2013.
Jean-Pierre Kahane. Some random series of functions, volume 5 of cambridge studies in advanced
mathematics, 1985.
Marius Kloft. `p -norm multiple kernel learning. PhD thesis, Berlin Institute of Technology, Berlin,
German, 2011.
Marius Kloft and Gilles Blanchard. The local Rademacher complexity of `p -norm multiple kernel
learning. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 24, pages 2438–2446. MIT Press, 2011.
Marius Kloft and Gilles Blanchard. On the convergence rate of lp-norm multiple kernel learning.
Journal of Machine Learning Research, 13(1):2465–2502, 2012.
Marius Kloft and Pavel Laskov. A poisoning attack against online anomaly detection. NIPS Workshop on Machine Learning in Adversarial Environments for Computer Security, 2007.
21
L EI , B INDER , D OGAN AND K LOFT
Marius Kloft and Pavel Laskov. Security analysis of online centroid anomaly detection. The Journal
of Machine Learning Research, 13(1):3681–3724, 2012.
Marius Kloft, Shinichi Nakajima, and Ulf Brefeld. Feature selection for density level-sets. In Machine Learning and Knowledge Discovery in Databases, pages 692–704. Springer Berlin Heidelberg, 2009.
Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 12:953–997, 2011.
Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. Predicting mooc dropout over weeks
using machine learning methods. EMNLP 2014, page 60, 2014.
Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan.
Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning
Research, 5:27–72, 2004.
Yunwen Lei and Lixin Ding. Refined Rademacher chaos complexity bounds with applications to
the multikernel learning problem. Neural. Comput., 26(4):739–760, 2014.
Li-Jia Li and Li Fei-Fei. What, where and who? classifying events by scene and object recognition.
In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE,
2007.
Bao-Di Liu, Yu-Xiong Wang, Bin Shen, Yu-Jin Zhang, and Martial Hebert. Self-explanatory
sparse representation for image classification. In Computer Vision–ECCV 2014, pages 600–616.
Springer International Publishing, 2014.
David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal
of Computer Vision, 60(2):91–110, 2004. URL http://dx.doi.org/10.1023/B:VISI.
0000029664.99615.94.
Charles A Micchelli and Massimiliano Pontil. Learning the kernel function via regularization.
Journal of Machine Learning Research, pages 1099–1125, 2005.
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.
MIT press, 2012.
Shinichi Nakajima, Alexander Binder, Christina Müller, Wojciech Wojcikiewicz, Marius Kloft, Ulf
Brefeld, Klaus-Robert Müller, and Motoaki Kawanabe. Multiple kernel learning for object classification. Proceedings of the 12th Workshop on Information-based Induction Sciences, 24, 2009.
Alain Rakotomamonjy, Francis Bach, Stéphane Canu, Yves Grandvalet, et al. Simplemkl. Journal
of Machine Learning Research, 9:2491–2521, 2008.
Ryan M Rifkin and Ross A Lippert. Value regularization and fenchel duality. The Journal of
Machine Learning Research, 8:441–479, 2007.
R Tyrrell Rockafellar. Convex analysis. Princeton university press, 1997.
Bernhard Schölkopf and Alexander J Smola. Learning with Kernels. MIT Press, Cambridge, MA,
2002.
22
T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS
Yan Song, Yan-Tao Zheng, Sheng Tang, Xiangdong Zhou, Yongdong Zhang, Shouxun Lin, and
T-S Chua. Localized multiple kernel learning for realistic human action recognition in videos.
Circuits and Systems for Video Technology, IEEE Transactions on, 21(9):1193–1202, 2011.
Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple
kernel learning. The Journal of Machine Learning Research, 7:1531–1565, 2006.
Dmitry Storcheus, Mehryar Mohri, and Afshin Rostamizadeh. Foundations of coupled nonlinear
dimensionality reduction. arXiv preprint arXiv:1509.08880, 2015.
Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, and Svn Vishwanathan. Multiple kernel learning and the smo algorithm. In Advances in neural information processing systems, pages 2361–
2369, 2010.
Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating color descriptors for
object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1582–1596, 2010.
URL http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.154.
Qiang Wu, Yiming Ying, and Ding-Xuan Zhou. Multi-kernel regularized classifiers. Journal of
Complexity, 23(1):108–134, 2007.
Jingjing Yang, Yuanning Li, Yonghong Tian, Lingyu Duan, and Wen Gao. Group-sensitive multiple kernel learning for object categorization. In 2009 IEEE 12th International Conference on
Computer Vision, pages 436–443. IEEE, 2009.
Yiming Ying and Colin Campbell. Generalization bounds for learning the kernel. In S. Dasgupta
and A. Klivans, editors, Proceedings of the 22nd Annual Conference on Learning Theory, COLT
’09, Montreal, Quebec, Canada, 2009.
Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, and Cordelia Schmid. Local features and
kernels for classification of texture and object categories: A comprehensive study. International
Journal of Computer Vision, 73(2):213–238, 2007. URL http://dx.doi.org/10.1007/
s11263-006-9794-4.
23
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement