# Theory and More Convexity in the Localized Setting of Learning Kernels.

Journal of Machine Learning Research 44 (2015) xx-xx Submitted 10/15; Published 10/00 Theory for the Localized Setting of Learning Kernels Yunwen Lei YUNWELEI @ CITYU . EDU . HK Department of Mathematics City University of Hong Kong Alexander Binder ALEXANDER BINDER @ SUTD . EDU . SG Machine Learning Group, TU Berlin ISTD Pillar, Singapore University of Technology and Design Ürün Dogan UDOGAN @ MICROSOFT. COM Microsoft Research Cambridge CB1 2FB, UK Marius Kloft KLOFT @ HU - BERLIN . DE Department of Computer Science Humboldt University of Berlin Editor: Dmitry Storcheus Abstract We analyze the localized setting of learning kernels also known as localized multiple kernel learning. This problem has been addressed in the past using rather heuristic approaches based on approximately optimizing non-convex problem formulations, of which up to now no theoretical learning bounds are known. In this paper, we show generalization error bounds for learning localized kernel classes where the localities are coupled using graph-based regularization. We propose a novel learning localized kernels algorithm based on this hypothesis class that is formulated as a convex optimization problem using a pre-obtained cluster structure of the data. We derive dual representations using Fenchel conjugation theory, based on which we give a simple yet efficient wrapper-based optimization algorithm. We apply the method to problems involving multiple heterogeneous data sources, taken from domains of computational biology and computer vision. The results show that the proposed convex approach to learning localized kernels can achieve higher prediction accuracies than its global and non-convex local counterparts. 1. Introduction Kernel-based learning algorithms (e.g., Schölkopf and Smola, 2002) including support vector machines (Cortes and Vapnik, 1995) have found diverse applications due to their distinct merits such as decent computational complexity, high prediction accuracy (Delgado et al., 2014), and solid mathematical foundation (e.g., Mohri et al., 2012). Since the learning and data representation processes are decoupled in a modular fashion, one can obtain non-linear kernel machines from simpler linear ones in a canonical way. The performance of such algorithms, however, is fundamentally limited through the choice of the involved kernel function as it intrinsically specifies the feature space where the learning process is implemented. This choice is typically left to the user. A substantial step toc 2015 Yunwen Lei, Alexander Binder, Ürün Dogan and Marius Kloft. L EI , B INDER , D OGAN AND K LOFT ward the complete automatization of kernel-based machine learning is achieved in Lanckriet et al. (2004), who introduce the multiple kernel learning (MKL) or learning kernels framework (Gönen and Alpaydin, 2011). Being formulated in terms of a single convex optimization criterion, MKL offers a theoretically sound way (Wu et al., 2007; Ying and Campbell, 2009; Cortes et al., 2010; Kloft and Blanchard, 2011, 2012; Cortes et al., 2013; Lei and Ding, 2014) of encoding complementary information with distinct base kernels and automatically learning an optimal combination of those (Ben-Hur et al., 2008; Gehler and Nowozin, 2008) using efficient numerical algorithms (Bach et al., 2004; Sonnenburg et al., 2006; Rakotomamonjy et al., 2008). This is particularly significant in the application domains of bioinformatics and computer vision (Ben-Hur et al., 2008; Gehler and Nowozin, 2008; Kloft, 2011), where data can be obtained from multiple heterogeneous sources, describing different properties of one and the same object (e.g., genome or image). While early sparsity-inducing approaches failed to live up to its expectations in terms of improvement over uniform combinations of kernels (cf. Cortes, 2009, and references therein), it was shown that improved predictive accuracy can be achieved by employing appropriate regularization (Kloft et al., 2011). Currently, most of the existing algorithms fall into the global setting of MKL, in the sense that the kernel combination is not varied over the input space. This ignores the fact that different regions of the input space might require individual kernel weights. For instance in the figures to the right, the images exhibit very distinct color distributions. While a kernel based on global color histograms may be effective to detect the horse object on the image to the left, it may fail in the image to the right, as the image fore- and backgrounds exhibit very similar color distributions. This motivates us studying localized approaches to learning kernels (Gönen and Alpaydin, 2008). The existing algorithms (reviewed in the subsequent section), however, optimize non-convex objective functions using ad-hoc optimization heuristics, which confuses the issue of reproducibility. Whether or not these algorithms are protected against overfitting is still an open research question as no theoretical guarantees—neither generalization error nor excess risk bounds—are known. In this paper, we show generalization error bounds for a localized setting of learning kernels, where we assume a pre-specified cluster structure of the data. We show that performing empirical risk minimization over this class is given by a convex optimization problem. For which we derive partial and complete dual representations using Fenchel conjugation theory and derive an efficient convex wrapper-based optimization algorithm. We apply the method to problems involving multiple heterogeneous data sources, taken from domains of computational biology and computer vision. The results show that the proposed convex approach to learning localized kernels can achieve higher prediction accuracies than its global and non-convex local counterparts. The remainder of this paper is structured as follows. In Section 2 we review related work; in Section 3 our convex and localized formulation of learning kernels is introduced, a partial dual representation of which is derived in Section 4, where we also present an efficient optimization algorithm. We report on theoretical results including generalization error bounds in Section 5. Empirical results for the application domains of visual image recognition and protein fold class prediction are presented in Section 6; Section 7 concludes. 2 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS 2. Related work Gönen and PMAlpaydin (2008) initiated the work on localized MKL by using a discriminant function f (x) = k=1 ηk (x|V )hwk , φk (x)i + b, where M is the number of kernels, ηk (x|V ) is a parametric gating model assigning a weight to φk (x) as a function of x, and V encodes the parameters of the gating model. The gating function is used to divide the input space into different regions, each of which is assigned to kernel weights. The joint optimization of the gating model and the kernel-based prediction function is carried out by alternating optimization. This problem is nonconvex due to the non-linearity introduced by the gating function. Yang et al. (2009) develop a group-sensitive variant of MKL tailored to object categorization. Their approach is non-convex but, in contrast to Gönen and Alpaydin (2008), examples within a group share the same kernel weights while examples from different groups employ distinct sets of kernel weights. Han and Liu (2012) modify the approach of Gönen and Alpaydin (2008) by complementing the spatialsimilarity-based kernels with probability confidence kernels that reflect the degree of confidence to which the involved examples belong to the same class. Song et al. (2011) present a localized MKL algorithm for realistic human action recognition in videos. However, the involved local models are constructed in an independent fashion. Therefore, they ignore the coupling among different localities, and may produce a suboptimal classifier already when these localities are moderately correlated. Recently, a localized MKL formulation has been studied as a computational means to study non-linear SVMs (Jose et al., 2013). All these approaches are based on non-convex optimization criteria and lack in learning theory. To our knowledge, the only theoretically sound approach in the context of the localized setting of learning kernels is by Cortes, Kloft, and Mohri (2013). They present an MKL approach based on controlling the local Rademacher complexity of the resulting kernel combination. Note that the meaning of locality is different here, however: while in the present work we perform assignments of kernel weights locally with respect to the input space, Cortes, Kloft, and Mohri (2013) localize the hypothesis class, which leads to sharper generalization bounds (Kloft and Blanchard, 2011, 2012). 3. Learning methodology In this paper, we study a convex formulation of localized MKL (CLMKL). For simplicity, we present our approach for binary classification, although the approach is general and can be extended to regression, multi-class classification, and structured output prediction. 3.1 Localized Problem Setting of Learning Kernels Suppose we are given M base kernels k1 , . . . , kM with φm being the corresponding kernel feature map corresponding to m-th kernel, i.e., km (x, x̃) = hφm (x), φm (x̃)ikm . Let Hm be the reproducing kernel Hilbert space corresponding to kernel km , inner product h·, ·ikm and induced norm k · kkm . For clarity, we frequently use the notation h·, ·i := h·, ·ikm and k · k2 := k · kkm . For any d ∈ N+ , introduce the notation Nd = {1, . . . , d}. Suppose that the training examples (x1 , y1 ), . . . , (xn , yn ) are partitioned into l disjoint clusters S1 , . . . , Sl . For each cluster Sj , we learn a distinct linear P (1) (M ) combined kernel k̃j = m∈NM βjm km and a distinct weight vector wj = (wj , . . . , wj ). This P (m) results, for each cluster Sj , in a linear model fj (x) = hwj , φ(x)i + b = m∈NM hwj , φm (x)i + b, where φ = (φ1 , . . . , φM ) is the concatenated feature map. 3 L EI , B INDER , D OGAN AND K LOFT 3.2 Notation For a Hilbert space H with inner product h·, ·i and l elements w1 , . . . , wl ∈ H, we define the Σ semi-norm for (w1 , . . . , wl ) by X 1/2 k(w1 , . . . , wl )kΣ := Σj j̃ hwj , wj̃ i , (1) j,j̃∈Nl where Σ is a positive semi-definite l×l matrix. For any β = (βjm )j∈Nl ,m∈NM and any m ∈ NM , we β, 1 Σ (β) −1 −1 , . . . , βlm ) + µΣ−1 ]−1 , where diag(a1 , . . . , al ) is write Qβm := Qm µ = (qmj j̃ )j,j̃∈Nl = [diag(β1m the l×l diagonal matrix with a1 , . . . , al on the main diagonal. For any x ∈ X , we use τ (x) to denote the index of the cluster to which the point x belongs, i.e., τ (x) = j ⇐⇒ x ∈ Sj . For brevity, we write τ (i) := τ (xi ) for all i and a+ = max(a, 0) for all a ∈ R. Introduce the notation w(m) = (m) (m) w1 , . . . , w l . For any p ≥ 1, we denote by p∗ its conjugated exponent, satisfying p1 + p1∗ = 1. 1/p P (1) (M ) (m) p For wj = (wj , . . . , wj ), we define the `2,p -norm by kwj k2,p := kw k . m∈NM j km 3.3 Convex localized multiple kernel learning (CLMKL) The proposed convex formulation for localized MKL is given as follows (for simplicity presented in terms of the hinge loss function; for a general presentation, see Appendix B.2): Problem 1 (C ONVEX LOCALIZED MULTIPLE KERNEL LEARNING (CLMKL)) (m) 2 ||2 ||wj X µ X kw(m) k2Σ−1 + C ξi w,ξ,β,b 2βjm 2 m∈NM i∈Nn j∈Nl ,m∈NM X (m) s.t. yi hwj , φm (xi )i + b ≥ 1 − ξi , ξi ≥ 0, ∀i ∈ Sj , j ∈ Nl min X + (2) m∈NM X p βjm ≤ 1, ∀j ∈ Nl , βjm ≥ 0, ∀j ∈ Nl , m ∈ NM , m∈NM where ξi are slack variables, C and µ are regularization parameters and Σ−1 is a positive semidefinite matrix (note that we do not need to compute the inversion of Σ−1 in the implementations). Note that, we impose, for each cluster Sj , j ∈ Nl , a separate `p -norm constraint (Kloft et al., 2011) on the combination coefficients βj = (βj1 , . . . , βjM ). However, unlike training a local model independently at each locality, these l local models are optimized jointly in our formulation, exploiting that examples in localities close by may convey complementary information to the learning task. The regularizer defined in (1) encodes the relationship among different clusters and imposes a soft constraint on how these local models shall be correlated. Note that, P if Σ−1 is the graph Laplacian of an adjacency matrix W (i.e., Σ−1 = D − W with Dj j̃ = δj j̃ k∈Nl Wjk ), the regularizer (1) coincides with the graph regularizer employed also in Evgeniou et al. (2005): P (m) (m) kw(m) k2Σ−1 = j,j̃∈Nl Wj j̃ kwj − wj̃ k22 . Recall that a quadratic over a linear function is convex (e.g., Boyd and Vandenberghe, 2004, p.g. 89), so all occurring summands in formulation (2) are convex, so this is a convex optimization problem. Note that Slater’s condition can be directly checked, and thus strong duality holds. To our best knowledge, problem (2) is the first convex formulation in the localized setting of learning kernels. 4 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS 4. Optimization Algorithms As pioneered in Sonnenburg et al. (2006), we consider here a two-layer optimization procedure to solve the problem (2) where the variables are divided into two groups: the group of kernel weights (m) {βjm }j∈Nl ,m∈NM and the group of weight vectors {wj }j∈Nl ,m∈NM . In each iteration, we alternatingly optimize one group of variables while fixing the other group of variables. These iterations are repeated until some optimality conditions are satisfied. To this aim, we need to find efficient strategies to solve the two subproblems. The following proposition indicates that the subproblem (m) of optimizing the objective of (2) with respect to {wj }j∈Nl ,m∈NM for fixed kernel weights can be cast as a standard SVM problem with a delicately defined kernel. Proposition 2 (CLMKL ( PARTIAL ) DUAL PROBLEM ) Introduce the kernel X (β) k̃(xi , xĩ ) := qmτ (i)τ (ĩ) km (xi , xĩ ). (3) m∈NM The partial Lagrangian dual of (2) with fixed kernel weights βjm is given by max αi s.t. X αi − i∈Nn X 1 X yi yĩ αi αĩ k̃(xi , xĩ ) 2 i,ĩ∈Nn (4) αi yi = 0, 0 ≤ αi ≤ C, ∀i ∈ Nn . i∈Nn Further, the optimal weight vector can be represented by X (β) X (m) wj = qmj j̃ yi αi φm (xi ), j̃∈Nl ∀j ∈ Nl , m ∈ NM . (5) i∈Sj̃ (m) Next, we show that, the subproblem of optimizing the kernel weights for fixed wj and b has a closed-form solution. We defer the detailed proof of Propositions 2, 3 to the appendix due to the lack of the space. Proposition 3 (S OLVING THE SUBPROBLEM WITH RESPECT TO THE KERNEL WEIGHTS ) Given (m) fixed wj and b, the minimal βjm in optimization problem (2) is attained for 2 βjm = (m) kwj k2p+1 X 2p (k) kwj k2p+1 − 1 p . (6) k∈NM (m) To apply Proposition 3 for updating βjm , we need to compute the norm of wj , which can be accomplished by recalling the representation given in Eq. (5): 2 X X X (β) (β) (β) (m) 2 (7) yi αi qmjτ (i) φm (xi ) yi yĩ αi αĩ qmjτ (i) qmjτ (ĩ) km (xi , xĩ ). kwj k2 = = i∈Nn 2 i∈Nn ĩ∈Nn Furthermore, note that the prediction function becomes X X X (β) (m) f (x) = hwτ (x) , φm (x)i + b = yi αi qmτ (x)τ (i) km (xi , x) + b. m∈NM i∈Nn 5 m∈NM (8) L EI , B INDER , D OGAN AND K LOFT The resulting optimization algorithm for convex localized multiple kernel learning is shown in Algorithm 1. The algorithm alternates between solving an SVM subproblem for fixed kernel weights (Line 4) and updating the kernel weights in a closed-form manner (Line 6). Note that the proposed optimization approach can be potentially extended to an interleaved optimization strategy where the optimization of the MKL step is directly integrated into the SVM solver. It has been shown (Sonnenburg et al., 2006; Kloft et al., 2011) that such a strategy can increase the computational efficiency by up to 1-2 orders of magnitude (cf. Figure 7 in Kloft et al., 2011). Algorithm 1: Training algorithm for convex localized multiple kernel learning (CLMKL). n input: examples {(xi , yi )ni=1 } ⊂ X × {−1, 1} together with cluster indices {τ (i)}ni=1 , M base kernels k1 , . . . , kM , and a positive semi-definite matrix Σ−1 . p p 1 initialize βjm = 1/M for all j ∈ Nl , m ∈ NM 2 while Optimality conditions are not satisfied do 3 calculate the kernel matrix k̃ by Eq. (3) 4 compute α by solving canonical SVM with k̃ 5 6 7 (m) compute kwj k22 for all j, m by Eq. (7) update βjm for all j, m according to Eq. (6) end We remark that we also derive a complete dual problem removing the dependency on βjm . Due to the lack of the space, we defer the detailed proof to the appendix: Proposition 4 (CLMKL ( COMPLETE ) DUAL PROBLEM ) If Σ−1 is positive definite, then the completely dualized Lagrangian dual (dualized with respect to all variables) of Problem (2) becomes: sup sup γ 0≤αi ≤C mj j̃ P αi yi =0 m∈NM ,j,j̃∈Nl X X X 2p p−1 X 1 p−1 p − αi yi φm (xi )− αi yi γmjτ (i) φm (xi ) 2 2 j∈Nl m∈NM i∈Sj i∈Nn i∈Nn 2 X 1 X X + αi yi γmjτ (i) φm (xi ) j∈N + αi . l Σ 2µ m∈NM i∈Nn i∈Nn The above dual sheds further light onto the CLMKL optimization problem, and potentially can be exploited for the development of alternative optimization strategies that directly optimize the dual criterion (without the need of an two-step wrapper approach); such an approach has been taken in Sun et al. (2010) in the context of `p -norm MKL. Furthermore, solving the dual enables computing the duality gap, which can be used as a sound evaluation criterion for the optimization precision. 5. Rademacher complexity bounds This section presents a theoretical analysis, showing, for the first time, that a localized approach to learning kernels can generalize to new and unseen data. In particular, we give a purely datadependent bound on the generalization error. Our basic strategy is to plug the optimal βjm established in Eq. (6) into the primal problem (2), thereby writing (2) as the following equivalent block-norm regularization problem: 6 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS X 2p 1X (m) kwj k2p+1 min w,b 2 j∈Nl m∈NM p+1 p + X µ kw(m) k2Σ−1 2 m∈NM X X (m) +C 1 − yi hwτ (i) , φm (xi )i − yi b . (9) i∈Nn + m∈NM Solving Eq. (9) amounts to performing empirical risk minimization in the hypothesis space X X 2 (m) 2 kwj k2, 2p + µ Hp,µ,D := Hp,µ,D,M = fw : x → hwτ (x) , φ(x)i : kw kΣ−1 ≤ D . j∈Nl p+1 m∈NM The following theorem establishes a generalization error bound for CLMKL. Theorem 5 (CLMKL GENERALIZATION ERROR BOUNDS ) Suppose that Σ−1 is positive definite and n is the sample size. Then, for any 0 < δ < 1 with probability at least 1 − δ, the expected risk E(h) := E[yh(x) ≤ 0] of any classifier h ∈ Hp,µ,D can be upper bounded by: r log(2/δ) E(h) ≤ Ez (h) + 3 + 2n !1/2 √ M X X 2 D (1 − θ)2 X X 2 inf θ t km (xi , xi ) Σjj km (xi , xi ) , + n 0≤θ≤1∗ µ m=1 2t j∈Nl 2≤t≤p̄ where Ez (h) := 1 n Pn i=1 (1 i∈Sj m∈NM j∈Nl i∈Sj − yi h(xi ))+ is the empirical risk w.r.t. the hinge loss. Remark 6 (Interpretation and Tightness) The above error bound enjoys a mild dependence on the number of kernels. One can show (cf. Section C) that the dependence is O(log M ) for p ≤ p−1 (log M − 1)−1 log M and O(M 2p ) otherwise, which recover the best known results for global MKL algorithms in Cortes et al. (2010); Kloft and Blanchard (2011); Kloft et al. (2011). Theorem 5 also suggests that the generalization performance of CLMKL is controlled by a weighted summation of the diagonal elements in the matrix Σ, with weights being proportional to the trace of the gram matrix on the associated clusters. 6. Empirical Studies 6.1 Experimental Setup We implement the proposed convex localized MKL (CLMKL) algorithm in MATLAB and solve the involved canonical SVM problem with LIBSVM (Chang and Lin, 2011). When the clusters {S1 , . . . , Sl } are not known in advance, they are computed through kernel k-means (e.g., Dhillon et al., 2004). To diminish k-mean’s potential fluctuations due to random initialization of the cluster means, we repeat kernel k-means several times, and either select the one with the minimal clustering error (the summation of the squared distance between the examples and the associated nearest cluster) as the final partition, or train a single CLMKL model for each partition and then combine the resulting CLMKL models by performing majority voting on the binary predictions. We compare the performance attained by the proposed CLMKL to regular localized MKL (LMKL) (Gönen and Alpaydin, 2008), the SVM using a uniform kernel combination (UNIF) (Cortes, 2009), and `p -norm MKL (Kloft et al., 2011), which includes classical MKL (Lanckriet et al., 2004) as a special case. 7 L EI , B INDER , D OGAN AND K LOFT CLMKL LMKL MKL UNIF σ = 0.2 98.3 ± 0.8 94.7 ± 1.4 94.8 ± 1.6 94.5 ± 1.6 σ = 0.3 91.4 ± 1.9 89.5 ± 1.8 89.2 ± 2.0 89.3 ± 1.7 Table 1: Performances achieved by LMKL, UNIF, MKL and the proposed CLMKL on the synthetic dataset. Here, σ is the standard deviation of the noise. The underlying parameter p is 1. 6.2 Controlled Experiments on Synthetic Data We first experiment on a two-class synthetic dataset with positive and negative points lying on a disconnected hexagon with radius equal to 6 and 5, respectively, with additional corruptions by Gaussian noise with standard deviations σ. The figure to the right shows an example of such a synthetic dataset with 1000 examples and σ = 0.2. This 5 dataset is interesting to us since the optimal combination 4 of the features associated to the first and second coordi- 3 nates vary along the six sides of the hexagon. We choose 2 the linear kernels on the first and second coordinates as 1 two base kernels for CLMKL, and apply k-means with 6 0 −1 clusters to generate data partition. The correlation ma- −2 trix Σ−1 is chosen as the graph Laplacian of an adjacency −3 matrix W , where we set Wj j̃ = exp(−γd2 (Sj , Sj̃ )) with −4 d(Sj , Sj̃ ) being the Euclidean distance between cluster −5 −6 −4 −2 0 2 4 6 Sj and Sj̃ . The parameter γ is set as the reciprocal of the average distance among different clusters. We use one half of the dataset as the training set, and each half of the remaining as the validation set and test set. We tune the parameter C from the set 10{−2,−1,...,2} , and µ from the set 2{2,4,6,8} , based on the prediction accuracies on the validation set. For CLMKL and MKL, we simply set p = 1 in this experiment. For the baseline methods (LMKL, MKL, UNIF), we complement the linear features by adding the quadratic kernel k(x, x̃) = hx, x̃i2 as the third base kernel, which is a useful feature for this dataset since a circle (a quadratic function) with appropriate radius is expected to serve as a good predictor. Thus the addition of the quadratic kernel gives the baseline methods a potential advantage and serves as an additional sanity check of the robustness of the proposed algorithm. Table 1 shows the performance of the proposed CLMKL as well as the competitors. We can observe that the proposed CLMKL consistently achieves the best prediction accuracies with accuracy gains by up to 3.6%. Note that this improvement is achieved when the baseline methods are supplied with an additional quadratic kernel encoding valuable information for this synthetic data. MKL LMKL CLMKL holdout CLMKL Oracle Li and Fei-Fei (2007) Bo et al. (2011) Liu et al. (2014) 90.23 87.36 90.80 91.38 73.8 85.7 89.95 Table 2: Prediction accuracies achieved by regular `p -norm MKL and the proposed CLMKL on the UIUC Sports Event dataset. The columns “Holdout” and “Oracle” show the prediction accuracies for the selected and optimal parameters, respectively. Liu et al. (2014) is the best known result from the literature. 8 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS p=1 p = 1.14 p = 1.2 p = 1.33 CLMKL, holdout 71.7 ± 0.4 74.8 ± 0.4 74.9 ± 0.5 74.5 ± 0.4 l=4 Oracle 72.8 ± 0.9 75.2 ± 0.4 75.0 ± 0.6 75.0 ± 0.4 CLMKL, holdout 71.9 ± 0.4 75.1 ± 0.5 74.7 ± 0.3 74.5 ± 0.4 l=8 Oracle 73.7 ± 0.9 75.4 ± 0.3 75.5 ± 0.6 74.7 ± 0.3 LMKL MKL UNIF 64.3 68.7 73.4 74.2 73.1 68.4 Table 3: Prediction accuracies achieved by UNIF, LMKL, MKL and CLMKL on the protein folding class prediction task. The columns “Holdout” and “Oracle” show the prediction accuracies for the selected and optimal parameters, respectively. l indicates the number of clusters in CLMKL, and p indicates the type of regularizer on the kernel combination coefficients. 6.3 Visual Image Categorization—An Application from the Domain of Computer Vision We experiment on the UIUC Sports event dataset (Li and Fei-Fei, 2007) consisting of 1574 images, each associated with one of 8 image classes (each class corresponding to a sport activity). We compute 12 bag-of-words features, each with a dictionary size of 512, resulting in 12 χ2 -Kernels (Zhang et al., 2007). The first 6 bag-of-words features are computed over SIFT features (Lowe, 2004) at three different scales and the two color channel sets RGB and opponent colors (van de Sande et al., 2010). The remaining 6 bag-of-words features are quantiles of color values at the same three scales and the same two color channel sets. For each channel within a set of color channels, the quantiles are concatenated. Local features are extracted at a grid of step size 5 on images that were down-scaled to 600 pixels in the largest dimension. Assignment of local features to visual words is done using rank-mapping (Binder et al., 2013). The kernel width of the kernels is set to the mean of the χ2 -distances. All kernels are multiplicatively normalized. Following the setup of Liu et al. (2014) the dataset is split into 11 parts. One part is withheld to obtain the final performance measurements, and on the remaining 10 parts we perform 10-fold cross-validation for finding the optimal parameters. For CLMKL we employ kernel k-means with 3 clusters on the cross-validation parts. For CLMKL we apply majority voting over 8 separate clusterings, for each of which a separate predictor is trained for fixed parameters. The matrix Σ−1 is computed as [(exp(−γd(Sj , Sj̃ )))j j̃ ]−1 where as distance the χ2 -distances averaged over the cluster assignments are used. The two involved parameters γ and µ are determined by cross-validation. We compare CLMKL to regular `p -norm MKL (Kloft et al., 2011), for which we employ a oneversus-all setup, running over `p -norms in {1.0625, 1.125, 1.333, 2} and regularization constants in {10k/2 }k=5 k=−2 (optima attained inside the respective grids). CLMKL uses the same set of `p norms, regularization constants from {10k/2 }k=0,...,5 and, due to time constraints, a subset of 18 i i=4 −1 combinations of the two parameters (γ, µ) ∈ {10i/2 }i=0 i=−4 × {2 }i=−4 is used to compute Σ . Performance is measured through multi-class classification accuracy. Table 2 shows the results. The column “holdout” shows the prediction accuracy achieved by taking majority voting over predictors constructed based on different applications of kernel k-means with random initializations, while the column “Oracle” indicates the best prediction accuracy achieved by these models built on the output of kernel k-means with random initializations. We observe that CLMKL achieves a performance improvement by 0.5 − 1.2% over the `p -norm MKL baseline. Comparing this to the best known results from the literature (Liu et al., 2014), we observe that this is, to our best knowledge, the highest result ever achieved on the UIUC dataset. 9 L EI , B INDER , D OGAN AND K LOFT 6.4 Protein Fold Prediction—An Application from the Domain of Computational Biology Protein fold prediction is a key step towards understanding the function of proteins, as the folding class of a protein is closely linked with its function; thus it is crucial for drug design. We experiment on the protein folding class prediction dataset by Ding and Dubchak (2001), which was also used in Campbell and Ying (2011); Kloft (2011); Kloft and Blanchard (2011). This dataset consists of 27 fold classes with 311 proteins used for training and 383 proteins reserved for testing. We use exactly the same 12 kernels used also in Campbell and Ying (2011); Kloft (2011); Kloft and Blanchard (2011) reflecting different features, e.g., van der Waals volume, polarity and hydrophobicity, relevant to the fold class prediction as base kernels. This is a non-sparse scenario for which Kloft (2011) achieved 74.4% accuracy using `1.14 -norm MKL. To be in line with the previous experiments by Campbell and Ying (2011); Kloft (2011); Kloft and Blanchard (2011), we precisely replicate their experimental setup: we use the train/test split supplied by Campbell and Ying (2011) and perform CLMKL via one-versus-all strategy to tackle multiple classes. The correlation matrix Σ−1 is constructed in the same way as that in Section 6.2. The parameters are chosen by cross validation over C ∈ 10{−2,−1,...,2} , µ ∈ 2{5,6,7} . We consider `p -norm CLMKL models with p ∈ {1, 1.14, 1.2, 1.33} and l ∈ {4, 8}. We repeat the experiment 10 times and report the mean prediction accuracies, as well as standard deviations in Table 3. From the table, we observe that CLMKL has the potential to largely surpass its global counterpart `p -norm MKL. Note that we do not achieve the accuracy 74.4% for `1.14 -norm MKL reported in Kloft (2011), which is possibly due to different implementations of the `p -norm MKL algorithms. Nevertheless, CLMKL achieves accuracies more than 0.8% higher than the one reported in Kloft (2011), which is also higher than the one initially reported in Campbell and Ying (2011). For example, CLMKL with l = 8, p = 1.14 achieves an impressive accuracy of 75.1%. 7. Conclusion We proposed a localized approach to learning kernels that admits generalization error bounds and can be phrased a convex optimization problem over a given or pre-obtained cluster structure. A key ingredient is the use of a graph-regularizer to couple the different local models. The theoretical analysis based on Rademacher complexity theory resulted in large deviation inequalities that connect the spectrum of the graph regularizer with the generalization capability of the learning algorithm. The proposed method is well suited for deployment in the domains of computer vision and computational biology: computational experiments showed that the proposed approach can achieve prediction accuracies higher than its global and non-convex local counterparts. In future work, we will investigate alternative clustering strategies (including convex ones and soft clustering), and how to principally include the data partitioning into our framework, for instance, by constructing partitions that capture the local variation of prediction importance of different features, by solving the clustering step and the MKL optimization problem in a joint manner or by automatically learning the graph Laplacian using appropriate matrix regularization. Another research direction is to directly integrate the MKL step into the SVM solver, as pioneered by Sonnenburg et al. (2006). We expect that such an implementation would lead to a speed-up in computational efficiency by up to 1-2 orders of magnitude. We will also investigate extensions to other learning settings (Kloft et al., 2009; Storcheus et al., 2015) and further applications (Kloft and Laskov, 2007; Nakajima et al., 2009; Binder et al., 2012; Kloft and Laskov, 2012; Kloft et al., 2014). Acknowledgments This work was partly funded by the German Research Foundation (DFG) award KL 2698/2-1. 10 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS Appendix A. Proofs on subproblems in Algorithm 1 A.1 Proof of Proposition 2 (m) Proof of Proposition 2 The Lagrangian of the partial optimization problem w.r.t. wj L := X X kwj(m) k22 and b is µ X kw(m) k2Σ−1 + 2βjm 2 j∈Nl m∈NM m∈NM XX X X X (m) α i 1 − ξi − yi hwj , φm (xi )i − yi b + C ξi − vi ξi , (A.1) + j∈Nl i∈Sj m∈NM i∈Nn i∈Nn where αi ≥ 0 and vi ≥ 0 are the Lagrangian multipliers of the constraints. Setting to zero the gradient of the Lagrangian w.r.t. the primal variables, we get (m) ∂L (m) =0⇒ ∂wj wj βjm +µ X (m) w Σ−1 j j̃ j̃ − X yi αi φm (xi ) = 0, (A.2) i∈Sj j̃∈Nl X ∂L =0⇒ αi yi = 0, ∂b (A.3) i∈Nn ∂L = 0 ⇒ C = αi + vi , ∂ξi ∀i ∈ Nn . (A.4) Eq. (A.2) implies that X ||wj(m) ||22 j∈Nl βjm + µ||w(m) ||2Σ−1 = XX (m) αi yi hwj , φm (xi )i, (A.5) j∈Nl i∈Sj (m) wj = X (β) qmj j̃ X yi αi φm (xi ), ∀j, m. (A.6) i∈Sj̃ j̃∈Nl Plugging Eqs. (A.3), (A.4) into Eq. (A.1), the Lagrangian can be simplified as follows: L= X X X kwj(m) k22 αi + i∈Nn m∈NM j∈Nl (A.5) = X i∈Nn (A.6) = X i∈Nn = X i∈Nn (3) = X i∈Nn 2βjm + X XX X µ (m) kw(m) k2Σ−1 − αi yi hwj , φm (xi )i 2 m∈NM j∈Nl i∈Sj m∈NM 1 X X (m) X αi − hwj , αi yi φm (xi )i 2 i∈Sj m∈NM j∈Nl 1 X αi − 2 X m∈NM j,j̃∈Nl 1 X αi − 2 X (β) X qmj j̃ yi yĩ αi αĩ hφm (xi ), φm (xĩ )i i∈Sj ,ĩ∈Sj̃ (β) yi yĩ αi αĩ qmτ (i)τ (ĩ) km (xi , xĩ ) m∈NM i,ĩ∈Nn αi − 1 X yi yĩ αi αĩ k̃(xi , xĩ ). 2 i,ĩ∈Nn The proof is complete if we note the constraints established in Eqs. (A.3), (A.4). 11 L EI , B INDER , D OGAN AND K LOFT A.2 Proof of Proposition 3 Proposition 3 contained in the main text gives a closed form solution for updating the kernel weights, a detailed proof of which is given in this appendix. Our discussion is largely based on the following lemma by Micchelli and Pontil (2005). Lemma A.1 (Micchelli and Pontil, 2005, Lemma 26) Let ai ≥ 0, i ∈ Nd and 1 ≤ r < ∞. Then 1 X ai X r 1+ r r+1 = ai min P ηi η:ηi ≥0, i∈N ηir ≤1 d i∈Nd i∈Nd 1 r+1 and the minimum is attained at ηi = ai r r+1 P k∈Nd ak − 1 r . We are now ready to prove Proposition 3 as follows. (m) Proof of Proposition 3 Fixing the variables wj and b, the optimization problem (2) reduces to min β s.t. 1 2 X (m) 2 k2 −1 βjm kwj j∈Nl ,m∈NM p βjm ≤ 1, ∀j ∈ Nl , βjm ≥ 0, ∀j ∈ Nl , m ∈ NM . X m∈NM This problem can be decomposed into l independent subproblems, one at each locality. For example, the subproblem at the j-th locality is as follows: 1 X −1 (m) 2 min βjm kwj k2 β 2 m∈NM X p s.t. βjm ≤ 1, βjm ≥ 0, ∀m ∈ NM . m∈NM (m) 2 k2 , Applying Lemma A.1 with αm = kwj ηm = βjm and r = p completes the proof. Appendix B. Completely dualized problems Proposition 2 gives a partial dual of the primal optimization problem (2). Alternatively, we derive here a complete dual problem removing the dependency on the kernel weights βjm . This completes the analysis of the primal problem and can be potentially exploited (in future work) to access the duality gap of computed solutions or to derive an alternative optimization strategy (cf. Sun et al., 2010). We always assume that Σ is positive definite in this section. We consider a general loss function to give a unifying viewpoint, and our analysis is based on the notions of the Fenchel-Legendre conjugate (Boyd and Vandenberghe, 2004) and the infimal convolution (Rockafellar, 1997). B.1 Lemmata used for complete dualization For a function h, we denote by h∗ (x) = supµ [x> µ − h(µ)] its Fenchel-Legendre conjugate. The infimal convolution (short: Inf-convolution) of two functions f and g is defined by (f ⊕ g)(x) := inf [f (x − y) + g(y)]. y 12 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS Lemma B.1 gives a relationship between the Fenchel-Legendre conjugate and the Inf-convolution. Lemma B.1 (Rockafellar, 1997) For any two functions f1 , f2 , we have (f1 + f2 )∗ (x) = (f1∗ ⊕ f2∗ )(x). Moreover, if f has a decomposable structure in the sense that f (x1 , x2 ) = f1 (x1 )+f2 (x2 ), i.e., f1 and f2 are functions defined on uncorrelated variables, then (f1 + f2 )∗ (x) = (f1∗ + f2∗ )(x). For any norm k · k, we denote by k · k∗ its dual norm defined by kxk∗ = supkµk=1 hx, µi. The Fenchel-Legendre conjugate of square norm takes the following form (Rockafellar, 1997): 1 1 ( k · k2 )∗ = k · k2∗ . 2 2 (B.1) Lemma B.2 establishes the dual norm for a Σ-norm. The result is well-known if H is the 1dimensional Euclidean space. Lemma B.2 Let H be a Hilbert space and Σ be a l × l positive definite matrix. The dual norm of P 1/2 the Σ-norm defined by k(w1 , . . . , wl )kΣ = Σ hw , w i is the Σ−1 -norm. j j,j̃∈Nl j j̃ j̃ Proof For any two elements w = (w1 , . . . , wl ), v = (v1 , . . . , vl ) ∈ H × · · · × H, we first establish {z } | l the following inequality: h(v1 , . . . , vl ), (w1 , . . . , wl )i ≤ k(v1 , . . . , vl )kΣ−1 k(w1 , . . . , wl )kΣ . (B.2) Let µ1 , . . . , µl ∈ Rl be the eigenvectors of Σ with λ1 , . . . , λl being the corresponding eigenvalues. According to the single value decomposition, we have X X > −1 , Σ = λ−1 Σ= λk µk µ> k k µk µk , k∈Nl k∈Nl from which we know kwk2Σ = X λk kwk2µ > k µk k∈Nl = X λk h X j∈Nl k∈Nl X µkj wj , µkj wj i. j∈Nl Therefore, k(w1 , . . . , wl )kΣ k(v1 , . . . , vl )kΣ−1 = X ≥ X k∈Nl µkj wj k22 1/2 X j∈Nl k∈Nl C. S. X λk k k X λ−1 k k k∈Nl µkj wj k2 k j∈Nl X X µkj vj k22 1/2 j∈Nl µkj vj k2 j∈Nl X X X ≥ h µkj wj , µkj vj i k∈Nl j∈Nl = j∈Nl X X hwj , vj̃ i µkj µkj̃ . j,j̃∈Nl k∈Nl (B.3) P P Since k∈Nl µk µ> k∈Nl µkj µkj̃ = δj j̃ . Plugging this identity k is the identity matrix, we know that into the above inequality yields Eq. (B.2). 13 L EI , B INDER , D OGAN AND K LOFT Next, we need to show that for any w = (w1 , . . . , wl ) there exists an v = (v1 , . . . , vl ) for which > Eq. (B.2) holds as an equality. Introduce the invertible matrix B = λ11 µ1 , . . . , λ1l µl and denote by B −1 its inverse. Then, we have 1 X µkj Bj−1 = δkk̃ . (B.4) k̃ λk j∈Nl Introduce vk := X −1 Bkj X j∈Nl µj j̃ wj̃ , ∀k ∈ Nl . j̃∈Nl Then, it follows from Eq. (B.4) that X X X 1 X X X 1 X 1 −1 = µkj vj = µkj Bj k̃ µkj Bj−1 w µk̃j̃ wj̃ µ k̃j̃ j̃ k̃ λk λk λk j∈Nl j∈Nl j̃∈Nl j̃∈Nl k̃∈Nl j∈Nl k̃∈Nl X X (B.4) X δkk̃ µk̃j̃ wj̃ = µkj̃ wj̃ . = j̃∈Nl k̃∈Nl j̃∈Nl For any w, v satisfying the above relation, we have X X 2 X 2 X X X λ2k µkj wj 2 = µkj vj 2 , µkj wj 2 µkj vj 2 = µkj wj , µkj vj , j∈Nl j∈Nl j∈Nl j∈Nl j∈Nl j∈Nl and therefore the inequality (B.3) holds indeed as an equality. The proof is complete. B.2 Proofs on complete dualization problems The convex localized MKL model given in (2) can be extended to a general convex loss function: min w,ti ,β,b s.t. (m) 2 k2 kwj X 2βjm j∈Nl ,m∈NM X p βjm + X µ X kw(m) k2Σ−1 + C `(ti , yi ) 2 m∈NM i∈Nn ≤ 1, ∀j ∈ Nl , βjm ≥ 0, ∀j ∈ Nl , m ∈ NM (B.5) m∈NM X (m) hwj , φm (xi )i + b = ti , ∀i ∈ Sj , j ∈ Nl . m∈NM Here `(ti , yi ) is a general loss function measuring the error incurred from using ti to predict yi . The following theorem gives the complete dual problem for the above convex localized MKL. Problem B.3 (C OMPLETELY D UALIZED D UAL P ROBLEM FOR G ENERAL L OSS F UNCTIONS ) Let `(t, y) : R × Y → R be a convex function w.r.t. t for any y. Assume that Σ−1 is positive definite. Then we have the following complete dual problem for the formulation (B.5): ( X αi −C `∗ (− , yi )− Psup C αi =0 i∈Nn i∈Nn X X X ) 2p p−1 X X 1 1 p k αi φm (xi )k2p−1 ⊕ k αi φm (xi ) j∈N k2Σ . l 2 2µ j∈Nl m∈NM i∈Sj m∈NM 14 i∈Sj T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS Proof Using Proposition 3 to get the optimal βjm , the problem (B.5) is equivalent to p+1 p 2p X X X µ X 1 (m) p+1 kwj k2 + `(ti , yi ) inf kw(m) k2Σ−1 + C w,b,ti 2 2 i∈Nn j∈Nl m∈NM m∈NM X (m) s.t. hwj , φm (xi )i + b = ti , ∀i ∈ Sj , j ∈ Nl . m∈NM According to the definition of Fenchel-Legendre conjugate and its relationship to Inf-convolution established in Lemma B.1, the Lagrangian saddle point problem translates to 2p p+1 X µ X 1 X X p (m) + kwj k2p+1 kw(m) k2Σ−1 + C `(ti , yi ) w,b,t 2 2 j∈Nl m∈NM m∈NM i∈Nn XX X (m) − αi hwj , φm (xi )i + b − ti sup inf αi j∈Nl i∈Sj m∈NM X 1 αi b sup[−`(ti , yi ) − αi ti ] − sup = sup − C C αi ti b i∈Nn n Xi∈NX 2p p+1 X 1 X X µ X p (m) (m) p+1 (m) 2 hwj , αi φm (xi )i − − − sup kwj k2 kw kΣ−1 2 2 w i∈Sj j∈Nl m∈NM j∈Nl m∈NM m∈NM X αi Def. = Psup −C `∗ (− , yi ) C αi =0 i∈Nn i∈Nn ∗ X X X 2p p+1 X 1 µ X p αi φm (xi )k2p+1 αi φm (xi ) j∈N k2Σ−1 − k + k l 2 2 i∈Sj j∈Nl m∈NM i∈Sj m∈NM X αi Lem. B.1 = −C `∗ (− , yi ) Psup C αi =0 i∈Nn i∈Nn X X X ∗ 2p p+1 ∗ X µ X 1 p p+1 2 ⊕ k αi φm (xi )k2 k αi φm (xi ) j∈N kΣ−1 − l 2 2 i∈Sj j∈Nl m∈NM i∈Sj m∈NM X αi Lem. B.1 = −C `∗ (− , yi ) Psup C αi =0 i∈Nn i∈Nn X X 2 ∗ X µ X ∗ 1 2 αi φm (xi )k2 ⊕ αi φm (xi ) kΣ−1 − k 2p 2 2 m∈NM p+1 j∈Nl i∈Sj i∈Sj j∈Nl m∈NM ( X αi (B.1) = Psup −C `∗ (− , yi )− C αi =0 i∈Nn i∈Nn ) X X X 2p p−1 X 1 1 X p p−1 k αi φm (xi )k2 k αi φm (xi ) j∈N k2Σ . ⊕ l 2 2µ X j∈Nl m∈NM i∈Sj m∈NM i∈Sj In the last step of the above deduction, we have used the fact that Σ−1 -norm and Σ-norm, `p -norm and ` p -norm are two dual-norm pairs. p−1 15 L EI , B INDER , D OGAN AND K LOFT We can now prove the complete dual problem established in Proposition 4 by plugging the Fenchel conjugate function of the hinge loss into Problem B.3. Proof of Proposition 4 Note that the Fenchel-Legendre conjugate of the hinge loss is `∗ (t, y) = yt (a function of t) if −1 ≤ yt ≤ 0 and ∞ elsewise (Rifkin and Lippert, 2007). Recall the identity αi , provided that (ηf )∗ (x) = ηf ∗ (x/η). Hence, for each i, the term `∗ (− αCi , yi ) translates to − Cy i αi αi new 0 ≤ yi ≤ C. With a variable substitution of the form αi = yi , the complete dual problem established in Theorem B.3 now becomes X sup αi − 0≤αi ≤C P i∈Nn αi yi =0 i∈Nn X X X 2p p−1 X 1 1 X p p−1 2 αi yi φm (xi ) j∈N kΣ ⊕ αi yi φm (xi )k2 k k l 2µ 2 i∈Sj m∈NM j∈Nl m∈NM i∈Sj X 1 X Def (m) αi − sup = sup k θj k2 j∈Nl Σ 2µ (m) 0≤α ≤C P i θ αi yi =0 j i∈Nn m∈NM i∈Nn X X X 2p p−1 1 p (m) p−1 + αi yi φm (xi ) − θj k2 k . 2 j∈Nl (m) The optimal θj m∈NM i∈Sj satisfies the following K.K.T. condition X X 2p −1 X 2 X p (m) (m) p−1 (m) p−1 θj − αi yi φm (xi ) αi yi φm (xi )−θj αi yi φm (xi )−θj m∈NM 2 i∈Sj 2 i∈Sj i∈Sj 1 X (m) =− Σj j̃ θj̃ . µ j̃∈Nl (m) Solving the above equation shows that the optimal θj (m) θj = X αi yi γmjτ (i) φm (xi ), takes the following form ∀j ∈ Nl , m ∈ NM . i∈Nn Plugging the above identity back into the Langangian saddle point problem, we derive the complete dual problem in the proposition. Appendix C. Proof of Generalization Error Bounds (Theorem 5) This section presents the proof for the generalization error bounds provided in Section 5. Our basic tool is the data-dependent complexity measure called the Rademacher complexity (Bartlett and Mendelson, 2002). 16 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS Definition C.1 (Rademacher complexity) For a fixed sample S = (x1 , . . . , xn ), the empirical Rademacher complexity of a hypothesis space H is defined as R̂n (H) := Eσ sup f ∈H 1 X σi f (xi ), n i∈Nn where the expectation is taken w.r.t. σ = (σ1 , . . . , σn )> with σi , i ∈ Nn , being a sequence of independent uniform {±1}-valued random variables. The following theorem establishes the Rademacher complexity bounds for CLMKL machines. 2p Denote p̄ = p+1 for any p ≥ 1 and observe that p̄ ≤ 2, which implies p̄∗ ≥ 2. Theorem C.2 (CLMKL R ADEMACHER COMPLEXITY BOUNDS ) If Σ−1 is positive definite, then the empirical Rademacher complexity of Hp,µ,D can be controlled by √ D R̂n (Hp,µ,D ) ≤ n inf 0≤θ≤1 2≤t≤p̄∗ M (1 − θ)2 X X X X Σjj km (xi , xi ) θ t k (x , x ) t + m i i µ m=1 2 2 i∈Sj j∈Nl m∈NM j∈Nl i∈Sj If, additionally, km (x, x) ≤ B for any x ∈ X and any m ∈ NM , then we have r 1/2 2 (1 − θ)2 DB 2 R̂n (Hp,µ,D ) ≤ inf M max Σjj θ tM t + . j∈Nl n 0≤θ≤1∗ µ 2≤t≤p̄ Tightness of the bound It can be checked that the function x → xM 2/x is decreasing along the interval (0, 2 log M ) and increasing along the interval (2 log M, ∞). Therefore, under the boundedness assumption km (x, x) ≤ B the Rademacher complexity can be further controlled by 1 1 r −1 max Σ 2 2 min 2e log M , M µ , if p ≤ loglogMM−1 , jj DB j∈Nl × R̂n (Hp,µ,D ) ≤ p−1 1 2p 12 −1 min n 2p , M µ max Σjj 2 , otherwise, p−1 ) M j∈Nl from which it is clear that our Rademacher complexity bounds enjoy a mild dependence on the p−1 number of kernels. The dependence is O(log M ) for p ≤ (log M − 1)−1 log M and O(M 2p ) otherwise. These dependencies recover the best known results for global MKL algorithms in Cortes et al. (2010); Kloft and Blanchard (2011); Kloft et al. (2011). The proof of Theorem C.2 is based on the following lemmata. Lemma C.3 (Khintchine-Kahane inequality (Kahane, 1985)) Let v1 , . . . , vn ∈ H. Then, for any q ≥ 1, it holds X q 2 X q 2 Eσ σi vi 2 ≤ q kvi k2 . i∈Nn i∈Nn Lemma C.4 (Block-structured Hölder inequality (Kloft and Blanchard, 2012)) Let x = (x(1) , . . . , x(n) ), y = (y (1) , . . . , y (n) ) ∈ H = H1 × · · · × Hn . Then, for any p ≥ 1, it holds hx, yi ≤ kxk2,p kyk2,p∗ . 17 !1/2 . L EI , B INDER , D OGAN AND K LOFT Proof of Theorem C.2 P Firstly, for any t̄ ≥ 1 we can apply a block-structured version of Hölder inequality to bound i∈Nn σi fw (xi ) by X X σi fw (xi ) = i∈Nn σi hwτ (i) , φ(xi )i = σi hwj , φ(xi )i j∈Nl i∈Sj i∈Nn X E Hölder X XD X kwj k2,t̄ σi φ(xi ) wj , σi φ(xi ) ≤ = i∈Sj j∈Nl Alternatively, we can also control X XX X σi fw (xi ) = i∈Nn P wj , X X X . E X X D (m) X σi φ(xi ) = wj , σi φm (xi ) i∈Sj m∈NM j∈Nl w(m) , X l σi φm (xi ) j=1 i∈Sj m∈NM ≤ 2,t̄∗ σi fw (xi ) by i∈Sj j∈Nl = i∈Nn i∈Sj j∈Nl X l σi φm (xi ) kw(m) kΣ−1 j=1 Σ i∈Sj m∈NM , where in the last step of the above deduction we have used the fact that Σ-norm is the dual norm of Σ−1 -norm (Lemma B.2). P inequalities together and using the trivial identity i∈Nn σi fw (xi ) = PCombining the above twoP θ i∈Nn σi fw (xi ) + (1 − θ) i∈Nn σi fw (xi ), for any 0 ≤ θ ≤ 1 and any t ≥ 1 we have Eσ X sup σi fw (xi ) fw ∈Ht,µ,D i∈N n " ≤ Eσ C.-S. sup θ fw ∈Ht,µ,D ≤ Eσ X X σi φ(xi ) kwj k2,t̄ j∈Nl i∈Sj X sup fw ∈Ht,µ,D kwj k22,t̄ j∈Nl 2,t̄∗ X +µ 2 X X × θ2 σi φ(xi ) Jensen ≤ i∈Sj 2,t̄∗ XX 2 DEσ θ2 σi φ(xi ) j∈Nl i∈Sj m∈NM kw(m) k2Σ−1 X l kw(m) kΣ−1 σi φm (xi ) 1/2 2,t̄∗ l 2 1/2 (1 − θ)2 X X σi φm (xi ) + µ j=1 Σ m∈NM i∈Sj l 2 (1 − θ)2 X X + σi φm (xi ) µ j=1 Σ m∈NM !1/2 . i∈Sj (C.1) 18 j=1 Σ i∈Sj m∈NM j∈Nl + (1 − θ) X # T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS ∗ For any j ∈ Nl , theKhintchine-Kahane 2 (K.-K.) inequality and Jensen inequality (since t̄ ≥ 2) P permit us to bound Eσ i∈Sj σi φ(xi ) ∗ by 2,t̄ X 2 Eσ σi φ(xi ) i∈Sj X X t̄∗ t̄2∗ = Eσ σi φm (xi ) 2,t̄∗ K.-K. ≤ X ∗ t̄ X ≤ 2 i∈Sj m∈NM kφm (xi )k22 Eσ m∈NM t̄∗ t̄2∗ 2 = t̄ ∗ i∈Sj m∈NM t̄∗ t̄2∗ X X σi φm (xi ) Jensen Def. 2 i∈Sj X X t̄∗ t̄2∗ 2 km (xi , xi ) i∈Sj m∈NM M X ∗ . = t̄ km (xi , xi ) m=1 t̄∗ i∈Sj 2 For any m ∈ NM , we also have X l 2 DX E X X Eσ σi φm (xi ) = Σ E σ φ (x ), σ φ (x ) i m i j j̃ σ ĩ m ĩ j=1 Σ i∈Sj i∈Sj j,j̃∈Nl = X Eσ Σjj DX X σi φm (xi ), i∈Sj j∈Nl = ĩ∈Sj̃ Σjj X X σĩ φm (xĩ ) E ĩ∈Sj km (xi , xi ). i∈Sj j∈Nl Plugging the above two inequalities into Eq. (C.1) and noticing the trivial inequality kwj k2,t̄ ≤ kwj k2,p̄ , ∀t ≥ p ≥ 1, we get the following bound for any 0 ≤ θ ≤ 1: R̂n (Hp,µ,D ) ≤ inf R̂n (Ht,µ,D ) t≥p √ 1/2 M X X X D (1 − θ)2 X X k (x , x ) inf θ2 t̄∗ + Σ k (x , x ) . ≤ m i i jj m i i n t≥p µ m=1 t̄∗ j∈Nl i∈Sj m∈NM j∈Nl 2 i∈Sj The above inequality can be equivalently written as the first inequality of the theorem. The second inequality follows directly from the boundedness assumption and the fact that X Σjj |Sj | ≤ max Σjj n. j∈Nl j∈Nl Proof of Theorem 5 The proof now simply follows by plugging in the bound of Theorem C.2 into Theorem 7 of Bartlett and Mendelson (2002). Appendix D. Parameter sets for the CLMKL on the UIUC Sports event dataset We have chosen the following pairs of the two parameters (µ, γ): (0.0612, 0.0100), (0.1250, 0.0100), (0.2500, 0.0100), (0.0612, 0.0316), (0.1250, 0.0316), (0.0612, 0.1000), (0.1250, 0.1000),(2.0000, 0.1000), 19 L EI , B INDER , D OGAN AND K LOFT (0.0612, 0.3162), (0.2500, 0.3162), (1.0000, 0.3162), (16.0000, 0.3162), (0.0612, 1.0000), (0.1250, 1.0000), (0.2500, 1.0000), (0.5000, 1.0000), (1.0000, 1.0000), (2.0000, 1.0000), (8.0000, 1.0000). The parameters as selected by 10-fold crossvalidation for CLMKL were: `p = 1.333,µ = 2.0,γ = 1.0 References Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, page 6. ACM, 2004. Peter Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar Rätsch. Support vector machines and kernels for computational biology. PLoS Computational Biology, 4, 2008. URL http://svmcompbio.tuebingen.mpg.de. Alexander Binder, Shinichi Nakajima, Marius Kloft, Christina Müller, Wojciech Samek, Ulf Brefeld, Klaus-Robert Müller, and Motoaki Kawanabe. Insights from classifying visual concepts with multiple kernel learning. PloS one, 7(8):e38897, 2012. Alexander Binder, Wojciech Samek, Klaus-Robert Müller, and Motoaki Kawanabe. Enhanced representation and multi-task learning for image annotation. Computer Vision and Image Understanding, 2013. Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Hierarchical matching pursuit for image classification: Architecture and fast algorithms. Advances in Neural Information Processing Systems (NIPS), 2011. Stephen Poythress Boyd and Lieven Vandenberghe. Convex optimization. Cambridge Univ. Press, New York, 2004. Colin Campbell and Yiming Ying. Learning with support vector machines. Synthesis Lectures on Artificial Intelligence and Machine Learning, 5(1):1–95, 2011. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. Corinna Cortes. Invited talk: Can learning kernels help performance? In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1:1–1:1, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. Video http://videolectures.net/ icml09_cortes_clkh/. Corinna Cortes and Vladimir Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 28th International Conference on Machine Learning, ICML’10, 2010. 20 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local rademacher complexity. In Advances in Neural Information Processing Systems, pages 2760–2768, 2013. Manuel Fernández Delgado, Eva Cernadas, Senén Barro, and Dinani Gomes Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1):3133–3181, 2014. Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 551–556. ACM, 2004. Chris HQ Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349–358, 2001. Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005. P.V. Gehler and S. Nowozin. Infinite kernel learning. In Proceedings of the NIPS 2008 Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008. Mehmet Gönen and Ethem Alpaydin. Localized multiple kernel learning. In Proceedings of the 25th international conference on Machine learning, pages 352–359. ACM, 2008. Mehmet Gönen and Ethem Alpaydin. Multiple kernel learning algorithms. J. Mach. Learn. Res., 12:2211–2268, July 2011. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm? id=1953048.2021071. Yina Han and Guizhong Liu. Probability-confidence-kernel-based localized multiple kernel learning with norm. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(3): 827–837, 2012. Cijo Jose, Prasoon Goyal, Parv Aggrwal, and Manik Varma. Local deep kernel learning for efficient non-linear svm prediction. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 486–494, 2013. Jean-Pierre Kahane. Some random series of functions, volume 5 of cambridge studies in advanced mathematics, 1985. Marius Kloft. `p -norm multiple kernel learning. PhD thesis, Berlin Institute of Technology, Berlin, German, 2011. Marius Kloft and Gilles Blanchard. The local Rademacher complexity of `p -norm multiple kernel learning. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2438–2446. MIT Press, 2011. Marius Kloft and Gilles Blanchard. On the convergence rate of lp-norm multiple kernel learning. Journal of Machine Learning Research, 13(1):2465–2502, 2012. Marius Kloft and Pavel Laskov. A poisoning attack against online anomaly detection. NIPS Workshop on Machine Learning in Adversarial Environments for Computer Security, 2007. 21 L EI , B INDER , D OGAN AND K LOFT Marius Kloft and Pavel Laskov. Security analysis of online centroid anomaly detection. The Journal of Machine Learning Research, 13(1):3681–3724, 2012. Marius Kloft, Shinichi Nakajima, and Ulf Brefeld. Feature selection for density level-sets. In Machine Learning and Knowledge Discovery in Databases, pages 692–704. Springer Berlin Heidelberg, 2009. Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 12:953–997, 2011. Marius Kloft, Felix Stiehler, Zhilin Zheng, and Niels Pinkwart. Predicting mooc dropout over weeks using machine learning methods. EMNLP 2014, page 60, 2014. Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5:27–72, 2004. Yunwen Lei and Lixin Ding. Refined Rademacher chaos complexity bounds with applications to the multikernel learning problem. Neural. Comput., 26(4):739–760, 2014. Li-Jia Li and Li Fei-Fei. What, where and who? classifying events by scene and object recognition. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. Bao-Di Liu, Yu-Xiong Wang, Bin Shen, Yu-Jin Zhang, and Martial Hebert. Self-explanatory sparse representation for image classification. In Computer Vision–ECCV 2014, pages 600–616. Springer International Publishing, 2014. David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. URL http://dx.doi.org/10.1023/B:VISI. 0000029664.99615.94. Charles A Micchelli and Massimiliano Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, pages 1099–1125, 2005. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012. Shinichi Nakajima, Alexander Binder, Christina Müller, Wojciech Wojcikiewicz, Marius Kloft, Ulf Brefeld, Klaus-Robert Müller, and Motoaki Kawanabe. Multiple kernel learning for object classification. Proceedings of the 12th Workshop on Information-based Induction Sciences, 24, 2009. Alain Rakotomamonjy, Francis Bach, Stéphane Canu, Yves Grandvalet, et al. Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008. Ryan M Rifkin and Ross A Lippert. Value regularization and fenchel duality. The Journal of Machine Learning Research, 8:441–479, 2007. R Tyrrell Rockafellar. Convex analysis. Princeton university press, 1997. Bernhard Schölkopf and Alexander J Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 22 T HEORY FOR THE L OCALIZED S ETTING OF L EARNING K ERNELS Yan Song, Yan-Tao Zheng, Sheng Tang, Xiangdong Zhou, Yongdong Zhang, Shouxun Lin, and T-S Chua. Localized multiple kernel learning for realistic human action recognition in videos. Circuits and Systems for Video Technology, IEEE Transactions on, 21(9):1193–1202, 2011. Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, and Bernhard Schölkopf. Large scale multiple kernel learning. The Journal of Machine Learning Research, 7:1531–1565, 2006. Dmitry Storcheus, Mehryar Mohri, and Afshin Rostamizadeh. Foundations of coupled nonlinear dimensionality reduction. arXiv preprint arXiv:1509.08880, 2015. Zhaonan Sun, Nawanol Ampornpunt, Manik Varma, and Svn Vishwanathan. Multiple kernel learning and the smo algorithm. In Advances in neural information processing systems, pages 2361– 2369, 2010. Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1582–1596, 2010. URL http://doi.ieeecomputersociety.org/10.1109/TPAMI.2009.154. Qiang Wu, Yiming Ying, and Ding-Xuan Zhou. Multi-kernel regularized classifiers. Journal of Complexity, 23(1):108–134, 2007. Jingjing Yang, Yuanning Li, Yonghong Tian, Lingyu Duan, and Wen Gao. Group-sensitive multiple kernel learning for object categorization. In 2009 IEEE 12th International Conference on Computer Vision, pages 436–443. IEEE, 2009. Yiming Ying and Colin Campbell. Generalization bounds for learning the kernel. In S. Dasgupta and A. Klivans, editors, Proceedings of the 22nd Annual Conference on Learning Theory, COLT ’09, Montreal, Quebec, Canada, 2009. Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, and Cordelia Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2):213–238, 2007. URL http://dx.doi.org/10.1007/ s11263-006-9794-4. 23

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

### Related manuals

Download PDF

advertisement