[pdf-1/page]

[pdf-1/page]
Support Vector Machines and Kernel Based
Learning
Johan Suykens
K.U. Leuven, ESAT-SCD/SISTA
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70
Email: [email protected]
http://www.esat.kuleuven.be/scd/
Tutorial ICANN 2007
Porto Portugal, Sept. 2007
ICANN 2007 ⋄ Johan Suykens
biomedical
Living in a data world
process industry
energy
traffic
multimedia
bio-informatics
ICANN 2007 ⋄ Johan Suykens
1
Kernel based learning: interdisciplinary challenges
neural networks
data mining
linear algebra
pattern recognition
mathematics
SVM & kernel methods
machine learning
optimization
statistics
signal processing
systems and control theory
• Understanding the essential concepts and different facets of problems
• Providing systematical approaches, engineering kernel machines
• Integrative design, bridging the gaps between theory and practice
ICANN 2007 ⋄ Johan Suykens
2
- Part I contents • neural networks and support vector machines
feature map and kernels
primal and dual problem
• classification, regression
convex problem, robustness, sparseness
• wider use of the kernel trick
least squares support vector machines as core problems
kernel principal component analysis
• large scale fixed-size method
nonlinear modelling
ICANN 2007 ⋄ Johan Suykens
3
x1 w1
x2 w2
w
x3 w 3
xn n
b
1
Classical MLPs
y
h(·)
h(·)
Multilayer Perceptron (MLP) properties:
• Universal approximation of continuous nonlinear functions
• Learning from input-output patterns: off-line/on-line
• Parallel network architecture, multiple inputs and outputs
+
Flexible and widely applicable:
Feedforward/recurrent networks, supervised/unsupervised learning
-
Many local minima, trial and error for determining number of neurons
ICANN 2007 ⋄ Johan Suykens
4
Support Vector Machines
cost function
cost function
MLP
weights
SVM
weights
• Nonlinear classification and function estimation by convex optimization
with a unique solution and primal-dual interpretations.
• Number of neurons automatically follows from a convex program.
• Learning and generalization in high dimensional input spaces
(coping with the curse of dimensionality).
• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels from
graphical models, ... ), application-specific kernels (e.g. bioinformatics)
ICANN 2007 ⋄ Johan Suykens
5
Classifier with maximal margin
n
• Training set {(xi, yi)}N
i=1 : inputs xi ∈ R ; class labels yi ∈ {−1, +1}
y(x) = sign[wT ϕ(x) + b]
• Classifier:
with ϕ(·) : Rn → Rnh the mapping to a high dimensional feature space
(which can be infinite dimensional!)
• Maximize the margin for good generalization ability (margin =
(VC theory: linear SVM classifier dates back from the sixties)
x
x
x
x
x
x
x
x
x
2
kwk2 )
o
x
o
o
o
ICANN 2007 ⋄ Johan Suykens
→
o o
o
o
o
o
x
x
x
x
x
o
x
o
o
o
o o
o
o
o
o
6
SVM classifier: primal and dual problem
• Primal problem: [Vapnik, 1995]
1
min J (w, ξ) = wT w + c
w,b,ξ
2
N
X
i=1
ξi s.t.
yi[wT ϕ(xi) + b] ≥ 1 − ξi
ξi ≥ 0, i = 1, ..., N
Trade-off between margin maximization and tolerating misclassifications
• Conditions for optimality from Lagrangian.
Express the solution in the Lagrange multipliers.
• Dual problem: QP problem (convex problem)

N
X


N
N
X
1 X
αj s.t.
yiyj K(xi, xj ) αiαj +
max Q(α) = −
α

2 i,j=1

j=1
ICANN 2007 ⋄ Johan Suykens
αiyi = 0
i=1
0 ≤ αi ≤ c, ∀i
7
Obtaining solution via Lagrangian
• Lagrangian:
L(w, b, ξ; α, ν) = J (w, ξ) −
N
X
αi{yi[wT ϕ(xi) + b] − 1 + ξi} −

















max min L(w, b, ξ; α, ν),
α,ν
∂L
∂w
∂L
∂b
∂L
∂ξi
w,b,ξ
=0 → w=
=0 →
νiξi
i=1
i=1
• Find saddle point:
N
X
N
X
N
X
one obtains
αiyiϕ(xi)
i=1
αiyi = 0
i=1
= 0 → 0 ≤ αi ≤ c, i = 1, ..., N
Finally, write the solution in terms of α (Lagrange multipliers).
ICANN 2007 ⋄ Johan Suykens
8
SVM classifier model representations
• Classifier: primal representation
y(x) = sign[wT ϕ(x) + b]
Kernel trick (Mercer Theorem): K(xi, xj ) = ϕ(xi)T ϕ(xj )
• Dual representation:
X
y(x) = sign[
αi yi K(x, xi) + b]
i
3
2.5
1.5
1
x(2)
Some possible kernels K(·, ·):
K(x, xi) = xTi x (linear)
K(x, xi) = (xTi x + τ )d (polynomial)
K(x, xi) = exp(−kx − xik22/σ 2) (RBF)
K(x, xi) = tanh(κ xTi x + θ) (MLP)
2
0.5
0
−0.5
−1
−1.5
−2
−2.5
−2.5
• Sparseness property (many αi = 0)
ICANN 2007 ⋄ Johan Suykens
−2
−1.5
−1
−0.5
0
x(1)
0.5
1
1.5
2
9
SVMs: living in two worlds ...
Primal space:
y(x) = sign[wT ϕ(x) + b]
ϕ1 (x)
ϕ(x)
x
w1
y(x)
xx
x
x
o
o
o
x x
x
x
o
o
x
oo
x
w nh
ϕnh (x)
K(xi, xj ) = ϕ(xi)T ϕ(xj ) (“Kernel trick”)
Dual space:
o
Input space
y(x) = sign[
Feature space
P#sv
i=1
αiyiK(x, xi) + b]
K(x, x1 )
α1
y(x)
x
α#sv
K(x, x#sv )
ICANN 2007 ⋄ Johan Suykens
10
Reproducing Kernel Hilbert Space (RKHS) view
• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]
find function f such that
N
1 X
L(yi, f (xi)) + λkf k2K
min
f ∈H N
i=1
with L(·, ·) the loss function. kf kK is norm in RKHS H defined by K.
• Representer theorem: for convex loss function, solution of the form
f (x) =
N
X
αiK(x, xi)
i=1
Reproducing property f (x) = hf, KxiK with Kx(·) = K(x, ·)
• Some special cases:
L(y, f (x)) = (y − f (x))2: regularization network
L(y, f (x)) = |y − f (x)|ǫ: SVM regression with ǫ-insensitive loss function
ICANN 2007 ⋄ Johan Suykens
−ε
0
+ε
11
Different views on kernel machines
SVM
LS−SVM
Kriging
RKHS
Gaussian Processes
Some early history on RKHS:
1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba
Obtaining complementary insights from different perspectives:
kernels are used in different methodologies
Support vector machines (SVM):
optimization approach (primal/dual)
Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis
Gaussian processes (GP):
probabilistic/Bayesian approach
ICANN 2007 ⋄ Johan Suykens
12
Wider use of the kernel trick
x
• Angle between vectors: (e.g. correlation analysis)
Input space:
xT z
cos θxz =
kxk2kzk2
Feature space:
cos θϕ(x),ϕ(z)
θ
z
ϕ(x)T ϕ(z)
K(x, z)
p
=
=p
kϕ(x)k2kϕ(z)k2
K(x, x) K(z, z)
• Distance between vectors: (e.g. for “kernelized” clustering methods)
Input space:
kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z
Feature space:
kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z)
ICANN 2007 ⋄ Johan Suykens
13
Least Squares Support Vector Machines: “core problems”
• Regression (RR)
T
min w w + γ
w,b,e
X
e2i s.t. yi = wT ϕ(xi) + b + ei, ∀i
i
• Classification (FDA)
T
min w w + γ
w,b,e
X
e2i s.t. yi(wT ϕ(xi) + b) = 1 − ei, ∀i
i
• Principal component analysis (PCA)
X
T
min −w w + γ
e2i s.t. ei = wT ϕ(xi) + b, ∀i
w,b,e
i
• Canonical correlation analysis/partial least squares (CCA/PLS)
X
X
X
ei = wT ϕ1(xi) + b
T
T
2
2
min w w+v v+ν1
ei +ν2
ri −γ
eiri s.t.
ri = v T ϕ2(yi) + d
w,v,b,d,e,r
i
i
i
• partially linear models, spectral clustering, subspace algorithms, ...
ICANN 2007 ⋄ Johan Suykens
14
LS-SVM classifier
• Preserve support vector machine methodology, but simplify via least
squares and equality constraints [Suykens, 1999]
• Primal problem:
N
1 T
1X 2
min J (w, e) = w w + γ
ei s.t. yi [wT ϕ(xi) + b]=1 − ei, ∀i
w,b,e
2
2 i=1
• Dual problem:
0
y
T
y
Ω + I/γ
b
α
=
0
1N
where Ωij = yiyj ϕ(xi)T ϕ(xj ) = yiyj K(xi, xj ) and y = [y1; ...; yN ].
• LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004]
Winning results in competition WCCI 2006 by [Cawley, 2006]
ICANN 2007 ⋄ Johan Suykens
15
Obtaining solution from Lagrangian
• Lagrangian:
L(w, b, e; α) = J (w, e) −
N
X
αi{yi[wT ϕ(xi) + b] − 1 + ei}
i=1
with Lagrange multipliers αi (support values).
• Conditions for optimality:
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
∂L
∂w
∂L
∂b
∂L
∂ei
∂L
∂αi
=0
=0
→
→
w=
N
X
N
X
αiyiϕ(xi)
i=1
αiyi = 0
i=1
=0
=0
→
→
αi = γei,
yi[wT ϕ(xi) + b] − 1 + ei = 0,
i = 1, ..., N
i = 1, ..., N
Eliminate w, e and write solution in α, b.
ICANN 2007 ⋄ Johan Suykens
16
Microarray data analysis
FDA
LS-SVM classifier (linear, RBF)
Kernel PCA + FDA
(unsupervised selection of PCs)
(supervised selection of PCs)
Use regularization for linear classifiers
Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]
Webservice: http://www.esat.kuleuven.ac.be/MACBETH/
Efficient computational methods for feature selection by rank-one updates
[Ojeda et al., 2007]
ICANN 2007 ⋄ Johan Suykens
17
Weighted versions and robustness
Weighted version with
modified cost function
Convex cost function
convex
optimiz.
robust
statistics
SVM solution
LS-SVM solution
SVM
Weighted LS-SVM
N
1 T
1X 2
• Weighted LS-SVM: min w w + γ
viei s.t. yi = wT ϕ(xi) + b + ei, ∀i
w,b,e 2
2 i=1
with vi determined from {ei}N
i=1 of unweighted LS-SVM [Suykens et al., 2002].
Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].
• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]
ICANN 2007 ⋄ Johan Suykens
18
Kernel principal component analysis (KPCA)
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2.5
−1.5
−1
−0.5
0
linear PCA
0.5
1
−2
−1.5
−1
−0.5
0
0.5
1
kernel PCA (RBF kernel)
Kernel PCA [Schölkopf et al., 1998]:
take eigenvalue decomposition of the kernel matrix


K(x1, x1) ... K(x1, xN )
..
..


K(xN , x1) ... K(xN , xN )
(applications in dimensionality reduction and denoising)
Where is the regularization ?
ICANN 2007 ⋄ Johan Suykens
19
Kernel PCA: primal and dual problem
• Underlying primal problem with regularization term [Suykens et al., 2003]
• Primal problem:
N
1 X 2
1 T
ei s.t. ei = wT ϕ(xi) + b, i = 1, ..., N.
min − w w + γ
w,b,e
2
2 i=1
(or alternatively min
1 T
2w w
−
• Dual problem = kernel PCA :
1
2γ
PN
2
e
i=1 i )
Ωcα = λα with λ = 1/γ
with Ωc,ij = (ϕ(xi) − µ̂ϕ)T (ϕ(xj ) − µ̂ϕ) the centered kernel matrix.
• Score variables (allowing also out-of-sample extensions):
P
P
1
T
z(x) = w (ϕ(x) − µ̂ϕ) = j αj (K(xj , x) − N r K(xr , x)−
P
P P
1
1
r K(xr , xj ) + N 2
r
s K(xr , xs ))
N
ICANN 2007 ⋄ Johan Suykens
20
Primal versus dual problems
Example 1: microarray data (10.000 genes & 50 training data)
Classifier model:
T
sign(w
P x + b)T (primal)
sign( i αiyixi x + b) (dual)
primal: w ∈ R10.000 (only 50 training data!)
dual: α ∈ R50
Example 2: datamining problem (1.000.000 training data & 20 inputs)
primal: w ∈ R20
dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)
ICANN 2007 ⋄ Johan Suykens
21
Fixed-size LS-SVM: primal-dual kernel machines
Primal space
Dual space
Nyström method
Kernel PCA
Density estimate
Entropy criteria
Regression
Eigenfunctions
SV selection
Link Nyström approximation (GP) - kernel PCA - density estimation
[Girolami, 2002; Williams & Seeger, 2001]
Modelling in view of primal-dual representations
[Suykens et al., 2002]: primal space estimation, sparse, large scale
ICANN 2007 ⋄ Johan Suykens
22
y
Fixed-size LS-SVM: toy examples
1.4
2.5
1.2
2
1
1.5
0.8
1
0.6
0.5
0.4
0
0.2
−0.5
0
−1
−0.2
−1.5
−0.4
−2
−0.6
−10
−8
−6
−4
−2
0
x
2
4
6
8
10
1.2
−2.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
2
1
1.5
0.8
1
0.6
y
0.5
0.4
0
−0.5
0.2
−1
0
−1.5
−0.2
−2
−0.4
−10
−8
−6
−4
−2
0
x
2
4
6
8
10
−2.5
−2.5
Sparse representations with estimation in primal space
ICANN 2007 ⋄ Johan Suykens
23
Large scale problems: Fixed-size LS-SVM
Estimate in primal (approximate feature map from KPCA on subset)
Santa Fe laser data
5
4
4
3
3
yk , ŷk
2
yk
2
1
1
0
0
−1
−1
−2
−2
0
100
200
300
400
500
600
discrete time k
700
800
900
10
20
30
40
50
60
70
80
90
discrete time k
Training: ŷk+1 = f (yk , yk−1, ..., yk−p)
Iterative prediction: ŷk+1 = f (ŷk , ŷk−1, ..., ŷk−p)
(works well for p large, e.g. p = 50) [Espinoza et al., 2003]
ICANN 2007 ⋄ Johan Suykens
24
Partially linear models: nonlinear system identification
0.15
0.02
0.1
0.015
0.05
0.01
Residuals
0
−0.05
−0.1
−0.15
−0.2
0.005
0
−0.005
−0.01
−0.015
0
2
4
6
8
10
12
Input Signal
14
4
−0.02
x 10
1
1.5
2
2.5
3
3.5
Discrete Time Index
0.3
4
4
x 10
0.02
validation
training
test
0.2
0.015
0.01
0
0.005
Residuals
0.1
−0.1
−0.2
−0.3
−0.4
0.5
0
−0.005
−0.01
−0.015
0
2
4
6
8
Output Signal
10
12
14
4
x 10
−0.02
0.5
1
1.5
2
2.5
Discrete Time Index
3
3.5
4
4
x 10
Silver box benchmark study (physical system with cubic nonlinearity):
(top-right) full black-box, (bottom-right) partially linear
Related application: power load forecasting [Espinoza et al., 2005]
ICANN 2007 ⋄ Johan Suykens
25
- Part II contents • generalizations to KPCA
weighted kernel PCA
spectral clustering
kernel canonical correlation analysis
• model selection
structure detection
kernel design
semi-supervised learning
incorporation of constraints
• kernel maps with reference point
dimensionality reduction and data visualization
ICANN 2007 ⋄ Johan Suykens
26
Core models + additional constraints
• Monoticity constraints: [Pelckmans et al., 2005]
T
min w w + γ
w,b,e
N
X
2
ei
s.t.
i=1

yi = wT ϕ(xi) + b + ei,
wT ϕ(xi) ≤ wT ϕ(xi+1),
(i = 1, ..., N )
(i = 1, ..., N − 1)
• Structure detection: [Pelckmans et al., 2005; Tibshirani, 1996]
min ρ
w,e,t
P
X
P
X
tp+
p=1
w
(p) T
w
(p)
+γ
p=1
N
X
2
ei
i=1
s.t.
(
yi =
PP
T
(p)
(p)
ϕ(p)(xi ) + ei,
p=1 w
T
(∀i)
(p)
−tp ≤ w(p) ϕ(p)(xi ) ≤ tp,
(∀i, ∀p)
• Autocorrelated errors: [Espinoza et al., 2006]
T
min w w + γ
w,b,r,e
N
X
2
ri
s.t.
i=1

yi = wT ϕ(xi) + b + ei,
ei = ρei−1 + ri,
(i = 1, .., N )
(i = 2, ..., N )
• Spectral clustering: [Alzate & Suykens, 2006; Chung, 1997; Shi & Malik, 2000]
T
T
min −w w + γe D
w,b,e
ICANN 2007 ⋄ Johan Suykens
−1
T
e s.t. ei = w ϕ(xi) + b, (i = 1, ..., N )
27
Generalizations to Kernel PCA: other loss functions
• Consider general loss function L (L2 case = KPCA):
N
1 X
1 T
L(ei) s.t. ei = wT ϕ(xi) + b, i = 1, ..., N.
min − w w + γ
w,b,e
2
2 i=1
Generalizations of KPCA that lead to robustness and sparseness, e.g.
Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006].
• Weighted least squares versions and incorporation of constraints:


ei = wT ϕ(xi) + b, i = 1, ..., N


PN
N

(1)
X
1 T
1
e
e
i=1 i i = 0
vie2i s.t.
min − w w + γ
...

w,b,e
2
2 i=1


 PN e e(i−1) = 0
i=1 i i
(j)
Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC ei ).
The solution is given by a generalized eigenvalue problem.
ICANN 2007 ⋄ Johan Suykens
28
Generalizations to Kernel PCA: robust denoising
Test set images, corrupted with Gaussian noise and outliers
Classical Kernel PCA
Robust method
bottom rows: application of different pre-image algorithms
Robust method: improved results and fewer components needed
ICANN 2007 ⋄ Johan Suykens
29
Generalizations to Kernel PCA: sparseness
2.5
2
1.5
x2
1
0.5
0
−0.5
−1
−0.5
0
0.5
1
x1
1.5
2
2.5
2
2
2
1.5
1.5
1.5
1
1
1
x
x
x
2
2.5
2
2.5
2
2.5
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−0.5
0
0.5
1
x
1.5
2
1
PC1
2.5
−1
−0.5
0
0.5
1
x
1
PC2
1.5
2
2.5
−1
−0.5
0
0.5
1
x
1.5
1
PC3
Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006]
(top figure: denoising; bottom figures: different support vectors (in black) per principal component vector)
ICANN 2007 ⋄ Johan Suykens
30
2
2.5
Spectral clustering: weighted KPCA
5
4
1
• Spectral graph clustering [Chung, 1997;
Shi & Malik, 2000; Ng et al., 2002]
• Normalized cut problem
Lq = λDq
with L = D − W the Laplacian of the graph.
Cluster membership indicators are given by q.
2
6
3
cut of size 2
cut of size 1
(minimal cut)
• KPCA Weighted LS-SVM formulation to normalized cut:
1 T
1 T
min − w w + γ e V e such that ei = wT ϕ(xi) + b, ∀i = 1, ..., N
w,b,e
2
2
with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006].
Allows for out-of-sample extensions on test data.
ICANN 2007 ⋄ Johan Suykens
31
Application to image segmentation
Given image (240 × 160)
Image segmentation
Large scale image: out-of-sample extension [Alzate & Suykens, 2006]
ICANN 2007 ⋄ Johan Suykens
32
Kernel Canonical Correlation Analysis
Correlation: min
w,v
X
kzxi − zyi k22
i
z = wT ϕ1 (x)
T x
zy = v ϕ2 (y)
ϕ2(·)
ϕ1(·)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Target spaces
x
x
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Space X
Space Y
Feature space on Y
Feature space on X
Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in:
- bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003]
- information retrieval, fMRI [Shawe-Taylor et al., 2004]
- state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005]
ICANN 2007 ⋄ Johan Suykens
33
LS-SVM formulation to Kernel CCA
• Score variables: zx = wT (ϕ1(x) − µ̂ϕ1 ), zy = v T (ϕ2(y) − µ̂ϕ2 )
Feature maps ϕ1, ϕ2, kernels K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), K2(yi, yj ) = ϕ2(yi)T ϕ2(yj )
• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])
max
γ
w,v,e,r
i=1
such that
with µ̂ϕ1 = (1/N )
N
X
N
N
1X 2 1 T
1X 2
1 T
ei − ν2
ri − w w − v v
eiri − ν1
2 i=1
2 i=1
2
2
ei = wT (ϕ1(xi) − µ̂ϕ1 ), ri = v T (ϕ2(yi) − µ̂ϕ2 ), ∀i
PN
i=1 ϕ1 (xi ), µ̂ϕ2 = (1/N )
PN
i=1 ϕ2 (yi ).
• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]
»
0
Ωc,1
Ωc,2
0
–»
α
β
–
=λ
»
ν1Ωc,1 + I
0
0
ν2Ωc,2 + I
–»
α
β
–
, λ = 1/γ
with Ωc,1ij = (ϕ1 (xi ) − µ̂ϕ1 )T (ϕ1 (xj ) − µ̂ϕ1 ), Ωc,2ij = (ϕ2 (yi ) − µ̂ϕ2 )T (ϕ2 (yj ) − µ̂ϕ2 )
ICANN 2007 ⋄ Johan Suykens
34
System identification of Hammerstein systems
• Hammerstein system:
xt+1 = Axt + Bf (ut) + νt
yt = Cxt + Df (ut) + vt
h
h
i
i
Q
S
νp
with E{ vp [νqT vqT ]} = S T R δpq .
y
u
f(.)
[A, B, C, D]
• System identification problem:
−1
given {(ut, yt)}N
t=0 , estimate A, B, C, D, f .
• Subspace algorithms [Goethals et al., IEEE-AC 2005]:
first estimate state vector sequence (can also be done by KCCA)
(for linear systems equivalent to Kalman filtering)
• Related problems: linear non-Gaussian models, links with ICA
Kernels for linear systems and gait recognition [Bissacco et al., 2007]
ICANN 2007 ⋄ Johan Suykens
35
Bayesian inference
Level 1
Parameters
Likelihood
Max
Posterior =
Prior
Evidence
Level 2
Hyperparameters
Likelihood
Max
Posterior =
Prior
Evidence
Level 3
Model comparison
Likelihood
Max
Posterior =
Prior
Evidence
Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal
matrix S in K(xi, xj ) = exp(−(xi − xj )T S(xi − xj )) which indicate how relevant
input variables are (but: many local minima, computationally expensive).
ICANN 2007 ⋄ Johan Suykens
36
Classification of brain tumors using ARD
Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005]
ICANN 2007 ⋄ Johan Suykens
37
Hierarchical Kernel Machines
Conceptually
Computationally
Level 3
Model selection
Level 2
Sparseness
Structure detection
Level 1
LS−SVM substrate
Hierarchical kernel machine
Convex optimization
Hierarchical modelling approach leading to convex optimization problem
Computationally fusing training, hyperparameter and model selection
Optimization modelling: sparseness, input/structure selection, stability ...
[Pelckmans et al., ML 2006]
ICANN 2007 ⋄ Johan Suykens
38
Additive regularization trade-off
• Traditional Tikhonov regularization scheme:
X
T
e2i s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N
min w w + γ
w,e
i
Training solution for fixed value of γ :
(K + I/γ)α = y
→ Selection of γ via validation set: non-convex problem
• Additive regularization trade-off [Pelckmans et al., 2005]:
X
T
min w w +
(ei − ci)2 s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N
w,e
i
Training solution for fixed value of c = [c1; ...; cN ]:
(K + I)α = y − c
→ Selection of c via validation set: can be convex problem
• Convex relaxation to Tikhonov regularization
[Pelckmans et al., IEEE-TNN 2007]
ICANN 2007 ⋄ Johan Suykens
39
Sparse models
• SVM classically: sparse solution from QP problem at training level
• Hierarchical kernel machine: fused problem with sparseness obtained at
the validation level [Pelckmans et al., 2005]
RBF
LS−SVMγ=5.3667,σ2=0.90784, with 2 different classes
class 1
class 2
1
0.8
X
2
0.6
0.4
0.2
0
−0.2
ICANN 2007 ⋄ Johan Suykens
−1.2
−1
−0.8
−0.6
−0.4
−0.2
X1
0
0.2
0.4
0.6
0.8
40
Additive models and structure detection
(p) T
PP
• Additive models: ŷ(x) = p=1 w
ϕ(p)(x(p)) with x(p) the p-th input.
PP
(p) (p)
Kernel K(xi, xj ) = p=1 K (p)(xi , xj ).
• Structure detection [Pelckmans et al., 2005]:
min ρ
w,e,t
(
s.t.
PP
p=1 tp
yi =
PP
+
PP
p=1
p=1 w
T
(p) T
w
(p) T
w
(p)
+γ
(p)
PN
i=1
e2i
ϕ(p)(xi ) + ei, ∀i = 1, ..., N
(p)
−tp ≤ w(p) ϕ(p)(xi ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P
Study how the solution with maximal
variation varies for different values of ρ
1.2
4 relevant input variables
Maximal Variation
1
0.8
0.6
0.4
21 irrelevant input variables
0.2
0
−0.2
0
10
ICANN 2007 ⋄ Johan Suykens
1
10
2
10
ρ
3
10
4
10
41
Incorporation of prior knowledge
• Example: LS-SVM regression with monoticity constraint
N
1X
yi = wT ϕ(xi) + b + ei, ∀i = 1, ..., N
wT ϕ(xi) ≤ wT ϕ(xi+1), ∀i = 1, ..., N − 1
1
min wT w + γ
e2i s.t.
w,b,e 2
2 i=1
• Application: estimation of cdf [Pelckmans et al., 2005]
empirical cdf
true cdf
1
1
0.8
0.8
P(X)
0.6
P(X)
1
Y
0.6
0.4
0.4
Y2
0.2
0.2
0
0
−0.2
−2
ecdf
cdf
Chebychev mkr
mLS−SVM
−0.2
−1.5
−1
−0.5
0
X
ICANN 2007 ⋄ Johan Suykens
0.5
1
1.5
2
−1.5
−1
−0.5
0
0.5
1
1.5
X
42
Equivalent kernels from constraints
Regression with autocorrelated errors:
min
w,b,r,e
T
w w+γ
2
i ri
P
s.t. yi = wT ϕ(xi) + b + ei (i = 1, .., N )
ei = ρei−1 + ri (i = 2, ..., N )
leads to
fˆ(x) =
N
X
Modular Definition of the Model Structure
αj−1Keq (xj , x) + b
j=2
LS-SVM
Regression
Partially Linear
Structure
Imposing
Symmetry
Autocorrelated
Residuals
with “equivalent kernel”
Keq (xj , xi) = K(xj , xi) − ρK(xj−1, xi)
where K(xj , xi) = ϕ(xj )T ϕ(xi) [Espinoza et al., 2006].
ICANN 2007 ⋄ Johan Suykens
43
Application: electric load forecasting
Short-term load forecasting (1-24 hours)
Important for power generation decisions
Hourly load values from substations in Belgian grid
Seasonal/weekly/intra-daily patterns
1-hour ahead
1
1
Actual Load
Linear
0.9
0.8
0.8
Normalized Load
Normalized Load
Actual Load
FS−LSSVM
0.9
0.7
0.6
0.5
0.4
0.3
0.2
0.6
0.5
0.4
0.3
0.2
0.1
0.1
20
40
60
80
Hour
100
120
140
160
20
40
60
(a)
24-hours ahead
80
Hour
100
120
140
160
(b)
1
1
Actual Load
FS−LSSVM
Actual Load
Linear
0.9
0.9
0.8
0.8
Normalized Load
Normalized Load
Fixed-size LS-SVM →
0.7
0.7
0.6
0.5
0.4
0.3
0.2
0.7
0.6
0.5
0.4
← Linear ARX model
0.3
0.2
0.1
0.1
20
40
60
80
Hour
100
120
140
160
20
40
60
(c)
80
Hour
100
120
140
160
(d)
[Espinoza et al., 2007]
ICANN 2007 ⋄ Johan Suykens
44
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
x2
x2
Semi-supervised learning
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8
−1
−0.5
0
0.5
x1
1
1.5
2
2.5
−0.8
−1
−0.5
0
0.5
x1
1
1.5
2
2.5
Semi-supervised learning: part labeled and part unlabeled data
Assumptions for semi-supervised learning to work [Chapelle et al., 2006]:
• Smoothness assumption: if two points x1, x2 in a high density region are
close, then also the corresponding outputs y1, y2
• Cluster assumption: points from same cluster are likely same class
• Low density separation: decision boundary should be in low density region
• Manifold assumption: data lie on a low-dimensional manifold
ICANN 2007 ⋄ Johan Suykens
45
Semi-supervised learning in RKHS
• Learning in RKHS [Belkin & Niyogi, 2004]:
N
1 X
V (yi, f (xi)) + λkf k2K + ηf T Lf
min
f ∈H N
i=1
with V (·, ·) loss function, L Laplacian matrix, kf kK norm in RKHS
H, f = [f (x1); ...; f (xNl+Nu )] (Nl, Nu number of labeled and unlabeled
data)
• Laplacian term: discretization of the Laplace-Beltrami operator
• Representer theorem: f (x) =
PNl+Nu
i=1
αiK(x, xi)
• Least squares solution case: Laplacian acts on kernel matrix
• Problem: true labels of unlabeled data assumed to be zero.
ICANN 2007 ⋄ Johan Suykens
46
Formulation by adding constraints
• Semi-supervised LS-SVM model [Luts et al., 2007]:
min
w,e,b,ŷ
1 T
2w w
+ 12 γ
PN
1
2
+
e
i=1 i
2η
PN
i,j=1 vij (ŷi
− ŷj )2
s.t. ŷi = wT ϕ(xi) + b, i = 1, ..., N
ŷi = νiyi − ei, νi ∈ {0, 1}, , i = 1, ..., N
where νi = 0 for unlabeled data, νi = 1 for labeled data.
• MRI image: healthy tissue versus tumor classification [Luts et al., 2007]
Nosologic Image
Nosologic Image
6
2
6
2
5
4
5
4
4
6
4
6
3
8
3
8
2
10
2
10
1
12
1
12
2
4
6
8
10
12
2
4
6
8
10
12
[etumour FP6-2002-lifescihealth503094, healthagents FP6-2005-IST027213]
ICANN 2007 ⋄ Johan Suykens
47
Learning combination of kernels
Pm
• Take combination K =
i=1 µi Ki (µi ≥ 0) (e.g. for data fusion).
Learn µi as convex problem [Lanckriet et al., JMLR 2004]
• QP problem of SVM:
max 2αT 1 − αT diag(y)Kdiag(y)α s.t. 0 ≤ α ≤ C, αT y = 0
α
is replaced by
m
X
min max 2αT 1 − αT diag(y)(
µiKi)diag(y)α
µi
α
s.t. 0 ≤ α ≤ C, αT y = 0, trace(
i=1
m
X
i=1
µiKi) = c,
m
X
µiKi 0.
i=1
Can be solved as a semidefinite program (SDP problem) [Boyd &
Vandenberghe, 2004] (LMI constraint for positive definite kernel)
ICANN 2007 ⋄ Johan Suykens
48
Kernel design
B
- Probability product kernel:
Z
K(p1, p2) =
p1(x)ρ p2(x)ρdx
A
C
X
E
D
- Prior knowledge incorporation
P(A,B,C,D,E) = P(A|B) P(B) P(C|B) P(D|C) P(E|B)
Kernels from graphical models, Bayesian networks, HMMs
Kernels tailored to data types (DNA sequence, text, chemoinformatics)
[Tsuda et al., Bioinformatics 2002; Jebara et al., JMLR 2004, Ralaivola et al., 2005]
ICANN 2007 ⋄ Johan Suykens
49
Dimensionality reduction and data visualization
• Traditionally:
commonly used techniques are e.g. principal component analysis, multidimensional scaling, self-organizing maps
• More recently:
isomap, locally linear embedding, Hessian locally linear embedding,
diffusion maps, Laplacian eigenmaps
(“kernel eigenmap methods and manifold learning”)
[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]
• Relevant issues:
- learning and generalization [Cucker & Smale, 2002; Poggio et al., 2004]
- model representations and out-of-sample extensions
- convex/non-convex problems, computational complexity [Smale, 1997]
• Kernel maps with reference point (KMref) [Suykens, 2007]:
data visualization and dimensionality reduction by solving linear system
ICANN 2007 ⋄ Johan Suykens
50
Kernel maps with reference point: problem statement
• Kernel maps with reference point [Suykens, 2007]:
- LS-SVM core part: realize dimensionality reduction x 7→ z
- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2
min
z,w1 ,w2 ,b1 ,b2 ,ei,1 ,ei,2
such that
N
X
ν
1
η
2
T
T
2
T
(ei,1 + ei,2)
(z − PD z) (z − PD z) + (w1 w1 + w2 w2) +
2
2
2 i=1
cT1,1z = q1 + e1,1
cT1,2z = q2 + e1,2
cTi,1z = w1T ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N
cTi,2z = w2T ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN
PN
P
2
s
Dz
k
kz
−
Regularization term: (z − PD z)T (z − PD z) = N
ij
j
i
2
j=1
i=1
with D diagonal matrix and sij = exp(−kxi − xj k22/σ 2).
ICANN 2007 ⋄ Johan Suykens
51
Kernel maps with reference point: solution
• The unique solution to the problem is given by the linear system

U
 −1T M1−1V1T
−1T M2−1V2T
−V1M1−11
1T M1−11
0
−V2M2−11




z
η(q1c1,1 + q2c1,2)
  b1  = 

0
0
b2
0
1T M2−11
with matrices
U = (I − PD )T (I − PD ) − γI + V1M1−1V1T + V2M2−1V2T + ηc1,1cT1,1 + ηc1,2cT1,2
1
1
1
1
M1 = Ω1 + I , M2 = Ω2 + I
ν
η
ν
η
V1 = [c2,1 ... cN,1] , V2 = [c2,2 ... cN,2]
kernel matrices Ω1, Ω2 ∈ R(N −1)×(N −1):
Ω1,ij = K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), Ω2,ij = K2(xi, xj ) = ϕ2(xi)T ϕ2(xj )
positive definite kernel functions K1(·, ·), K2(·, ·).
ICANN 2007 ⋄ Johan Suykens
52
Kernel maps with reference point: model representations
• The primal and dual model representations allow making out-ofsample extensions. Evaluation at point x∗ ∈ Rp:
N
ẑ∗,1 = w1T ϕ1(x∗) + b1 =
ẑ∗,2 = w2T ϕ2(x∗) + b2 =
1X
αi,1K1(xi, x∗) + b1
ν i=2
N
1X
αi,2K2(xi, x∗) + b2
ν i=2
Estimated coordinates for visualization: ẑ∗ = [ẑ∗,1; ẑ∗,2].
• α1, α2 ∈ RN −1 are the unique solutions to the linear systems
M1α1 = V1T z − b11N −1 and M2α2 = V2T z − b21N −1
and α1 = [α2,1; ...; αN,1], α2 = [α2,2; ...; αN,2], 1N −1 = [1; 1; ..., ; 1].
ICANN 2007 ⋄ Johan Suykens
53
KMref: spiral example
−3
20
x 10
15
0.5
0
z2
x
3
10
5
−0.5
1
0
0.5
1
0
0.5
0
−0.5
−0.5
−1
x
−1
−1.5
−1.5
2
−5
−0.02
x
−0.015
−0.01
1
−0.005
z1
0
0.005
0.01
training data (blue *), validation data (magenta o), test data (red +)
Model selection: min
X
i,j
ICANN 2007 ⋄ Johan Suykens
ẑiT ẑj
kẑi k2 kẑj k2
−
xT
i xj
kxi k2 kxj k2
2
54
KMref: swiss roll example
−3
3
x 10
2.5
0.6
0.4
0.2
2
−0.2
z2
x
3
0
1.5
−0.4
−0.6
1
−0.8
0.6
0.4
0.2
1
0
−0.2
0
−0.4
−0.5
−0.6
x
2
0.5
0.5
−0.8
−1
0
−3.5
−3
−2.5
x
Given 3D swiss roll data
−2
−1.5
−1
−0.5
z1
1
KMref result - 2D projection
600 training data, 100 validation data
ICANN 2007 ⋄ Johan Suykens
0
−3
x 10
55
KMref: visualizing gene distributions
−3
x 10
−3
x 10
2.1
2.1
2
1.9
z
3
z3
2
1.9
1.8
1.8
1.7
1.7
2.3
2.3
2.2
2.2
2.1
2
−3
x 10
1.9
−2
−2.1 −2.05
−2.2 −2.15
−2.25
−2.35 −2.3
z
2
z
1
−1.95
−3
x 10
−2.35 −2.3
−2.25 −2.2
−2.15 −2.1
−2.05
−3
2.1
−3
x 10
2
−2 −1.95
1.9
z
x 10
2
z1
KMref 3D projection (Alon colon cancer microarray data set)
Dimension input space: 62
Number of genes: 1500 (training: 500, validation: 500, test: 500)
Model selection: σ 2 = 104, σ12 = 103, σ22 = 0.5σ12, σ32 = 0.1σ12,
η = 1, ν = 100, D = diag{10, 5, 1}, q = [+1; −1; −1].
ICANN 2007 ⋄ Johan Suykens
56
Nonlinear dynamical systems control
1.5
1
θ
u
0.5
xp
0
−1.5
min
subject to
Control objective
+
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
LS−SVM objective
System dynamics (time k = 1, 2, ... , N)
LS−SVM controller (time k = 1, 2, ... , N)
Merging optimal control and support vector machine optimization problems:
Approximate solutions to optimal control problems [Suykens et al., NN 2001]
ICANN 2007 ⋄ Johan Suykens
57
Conclusions and future challenges
• Integrative understanding and systematic design for supervised, semisupervised, unsupervised learning and beyond
• Kernel methods: complementary views (LS-)SVM, RKHS, GP
• Least squares support vector machines as “core problems”:
provides methodology for “optimization modelling”
• Bridging gaps between fundamental theory, algorithms and applications
• Reliable methods: numerically, computationally, statistically
Websites:
http://www.kernel-machines.org/
http://www.esat.kuleuven.be/sista/lssvmlab/
ICANN 2007 ⋄ Johan Suykens
58
Books
• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.
• Chapelle O., Schölkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, 2006.
• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge University
Press, 2000.
• Cucker F., Zhou D.-X., Learning Theory: an Approximation Theory Viewpoint, Cambridge University
Press, 2007.
• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006.
• Schölkopf B., Smola A., Learning with Kernels, MIT Press, 2002.
• Schölkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,
2004.
• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,
2004.
• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support
Vector Machines, World Scientific, Singapore, 2002.
• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory
: Methods, Models and Applications, vol. 190 NATO-ASI Series III: Computer and Systems Sciences,
IOS Press, 2003.
• Vapnik V., Statistical Learning Theory, John Wiley & Sons, 1998.
• Wahba G., Spline Models for Observational Data, Series Appl. Math., 59, SIAM, 1990.
ICANN 2007 ⋄ Johan Suykens
59
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement