[pdf-1/page]

[pdf-1/page]
SCCB Lectures
Johan Suykens
K.U. Leuven, ESAT-SCD/SISTA
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70
Email: [email protected]
http://www.esat.kuleuven.be/scd/
SCCB 2006, Modena Italy, Sept. 2006
SCCB 2006 ⋄ Johan Suykens
SCCB 2006 ⋄ Johan Suykens
Main objectives of the lectures
• Lectures organization:
1. Support vector machines and kernel based learning (2 x 90 min.)
2. Case studies (45 min.)
3. Topics in complex networks, synchronization and cooperative
behaviour (90 min.)
• Emphasis on mathematical engineering in multi-disciplinary problems
• Essential concepts
• Providing systematical approaches
• Bridging the gap between theory and practice (coping with fragmentation
in science and different fields)
• Understanding different facets of problems
SCCB 2006 ⋄ Johan Suykens
1
Growth of the “omics”
(From: Human Genome Program, Genomics and its impact on medicine and society, U.S. Department of Energy, 2001)
Genomics (DNAs), Transcriptomics (RNAs and gene expression),
Proteomics (protein expression and interactions), Metabolomics
(metabolic networks), ...
SCCB 2006 ⋄ Johan Suykens
2
Systems biology
• Molecular biology: decompose system into parts (reductionist approach)
• Systems biology: integration of parts into a whole (holistic approach)
High-throughput techniques to study genomes and proteomes:
Microarrays to measure changes in mRNAs
Mass spectrometry to identify proteins, quantify protein levels
SCCB 2006 ⋄ Johan Suykens
3
Integrative biology
SCCB 2006 ⋄ Johan Suykens
4
Systems biology (bioinformatics)
SCCB 2006 ⋄ Johan Suykens
5
DNA double helix
SCCB 2006 ⋄ Johan Suykens
6
Central dogma of molecular biology
(Source: National Human Genome Research Institute)
DNA → RNA → PROTEIN
SCCB 2006 ⋄ Johan Suykens
7
Some genomes numbers
SCCB 2006 ⋄ Johan Suykens
8
cDNA-microarrays
cDNA-microarrays (spotted arrays) (relative measurements, differential hybridization):
glass slides on which cDNA is deposited.
SCCB 2006 ⋄ Johan Suykens
9
Oligonucleotide microarrays
Oligonucleotide microarrays (DNA chips, Affymetrix) (absolute measurements):
produced by the synthesis of oligonucleotides on silicium chips.
SCCB 2006 ⋄ Johan Suykens
10
Gene expression data matrix
SCCB 2006 ⋄ Johan Suykens
11
Proteomics (1)
MALDI-TOF mass spectrometer
Mass Spectrometry: measure the molecular masses of molecules or molecule fragments:
mass analysis of complex organic mixtures, identification of proteins and peptides.
Structural proteomics: X-ray crystallography and NMR spectroscopy (high-throughput
determination of protein structures)
SCCB 2006 ⋄ Johan Suykens
12
Proteomics (2)
(K.U. Leuven Prometa facility www.prometa.kuleuven.be & biomacs.kuleuven.be)
SCCB 2006 ⋄ Johan Suykens
13
Nuclear Magnetic Resonance
brain tumors, multiple sclerose, Alzheimer, epilepsy, prostate cancer
SCCB 2006 ⋄ Johan Suykens
14
MRI and MRS
SCCB 2006 ⋄ Johan Suykens
15
Support Vector Machines and Kernel Based
Learning
Johan Suykens
K.U. Leuven, ESAT-SCD/SISTA
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70
Email: [email protected]
http://www.esat.kuleuven.be/scd/
SCCB 2006, Modena Italy, Sept. 2006
SCCB 2006 ⋄ Johan Suykens
16
Contents - Part I: Basics
• Motivation
• Basics of support vector machines
• Use of the “kernel trick”
• Kernelbased learning
• Learning and generalization
• Least squares support vector machines
• Primal and dual representations
• Robustness
SCCB 2006 ⋄ Johan Suykens
17
Why support vector machines and kernel methods?
• With new technologies (e.g. in microarrays, proteomics) massive data
sets become available that are high dimensional.
• Tasks and objectives: predictive modelling, knowledge discovery and
integration, data fusion (classification, feature selection, prior knowledge
incorporation, correlation analysis, robustness).
• Supervised, unsupervised or semi-supervised learning depending on the
given data and problem.
• Need for modelling techniques that are able to operate on different data
types (sequences, graphs, numerical, categorical, ...)
• Linear as well as nonlinear models
• Reliable methods: numerically, computationally, statistically
SCCB 2006 ⋄ Johan Suykens
18
Learning: unsupervised, supervised, semi-supervised
x x
x
x
x x
x
x
x
unsupervised
x x
x
x
x x
x
x
x
supervised
x
x
x
x
x x
x
x x
semi−supervised
Given data can be labeled, unlabeled or partially labeled
Typically: clustering = unsupervised, classification = supervised
SCCB 2006 ⋄ Johan Suykens
19
x1 w1
x2 w2
w
x3 w 3
xn n
b
1
h(·)
y
Classical MLPs
h(·)
Multilayer Perceptron (MLP) properties:
• Universal approximation of continuous nonlinear functions
• Learning from input-output patterns; either off-line or on-line learning
• Parallel network architecture, multiple inputs and outputs
Use in feedforward and recurrent networks
Use in supervised and unsupervised learning applications
Problems: Existence of many local minima!
How many neurons needed for a given task?
SCCB 2006 ⋄ Johan Suykens
20
Classically: need for dimensionality reduction
Gene expression matrix (10.000 genes × 50 patients):
MLP with 3 hidden units would need estimating more than 30.000 weights
→ traditionally: first dimensionality reduction needed (e.g. PCA)
(MLP model is not suitable on very high dimensional input vectors)
SCCB 2006 ⋄ Johan Suykens
21
Support Vector Machines
cost function
cost function
MLP
weights
SVM
weights
• Nonlinear classification and function estimation by convex optimization
with a unique solution and primal-dual interpretations.
• Number of neurons automatically follows from a convex program.
• Learning and generalization in large dimensional input spaces
(coping with the curse of dimensionality).
• Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ).
Application-specific kernels possible (e.g. textmining, bioinformatics)
SCCB 2006 ⋄ Johan Suykens
22
SVMs: living in two worlds ...
Primal space:
(→ large data sets)
Parametric: estimate w ∈ Rnh
y(x) = sign[wT ϕ(x) + b]
ϕ1 (x)
ϕ(x)
x
w1
y(x)
xx
x
x
o
o
o
x xx
x
o
o x
oo
w nh
ϕnh (x)
K(xi, xj ) = ϕ(xi)T ϕ(xj ) (“Kernel trick”)
Dual space:
o
Input space
Feature space
x
(→ high dimensional inputs)
Non-parametric:
α ∈ RN
Pestimate
#sv
y(x) = sign[ i=1 αiyiK(x, xi) + b]
K(x, x1 )
α1
y(x)
x
α#sv
K(x, x#sv )
SCCB 2006 ⋄ Johan Suykens
23
Classifier with maximal margin
n
:
inputs
x
∈
R
; class labels yi ∈ {−1, +1}
• Training set {(xi, yi)}N
i
i=1
• Classifier:
y(x) = sign[wT ϕ(x) + b]
with ϕ(·) : Rn → Rnh the mapping to a high dimensional feature space
(which can be infinite dimensional!)
• Maximize the margin for good generalization ability (margin =
(VC theory: linear SVM classifier dates back from the sixties)
x
x
x
x
x
x
x
x
x
2
kwk2 )
o
x
o
o
o
SCCB 2006 ⋄ Johan Suykens
o o
o
o
o
o
→
x
x
x
x
x
o
x
o
o
o
o o
o
o
o
o
24
SVM classifier: primal and dual problem
• Primal problem: [Vapnik, 1995]
1
min J (w, ξ) = wT w + c
w,b,ξ
2
N
X
i=1
ξi s.t.
yi[wT ϕ(xi) + b] ≥ 1 − ξi
ξi ≥ 0, i = 1, ..., N
Trade-off between margin maximization and tolerating misclassifications
• Conditions for optimality from Lagrangian.
Express the solution in the Lagrange Multipliers.
• Dual problem: QP problem (convex problem)

N
X


N
N
X
1 X
αj s.t.
yiyj K(xi, xj ) αiαj +
max Q(α) = −
α

2 i,j=1

j=1
SCCB 2006 ⋄ Johan Suykens
αiyi = 0
i=1
0 ≤ αi ≤ c, ∀i
25
Obtaining solution via Lagrangian
• Lagrangian:
L(w, b, ξ; α, ν) = J (w, ξ) −
• Find saddle point:

















N
X
i=1
αi{yi[wT ϕ(xi) + b] − 1 + ξi} −
max min L(w, b, ξ; α, ν),
α,ν
∂L
∂w
∂L
∂b
∂L
∂ξi
w,b,ξ
=0 → w=
=0 →
N
X
N
X
N
X
νiξi
i=1
one obtains
αiyiϕ(xi)
i=1
αiyi = 0
i=1
= 0 → 0 ≤ αi ≤ c, i = 1, ..., N
Finally, write the solution in terms of α (Lagrange multipliers).
SCCB 2006 ⋄ Johan Suykens
26
SVM classifier model representations
• Classifier: primal representation
y(x) = sign[wT ϕ(x) + b]
Kernel trick (Mercer Theorem): K(xi, xj ) = ϕ(xi)T ϕ(xj )
• Dual representation:
X
y(x) = sign[
αi yi K(x, xi) + b]
i
Some possible kernels K(·, ·):
K(x, xi) = xTi x (linear SVM)
K(x, xi) = (xTi x + τ )d (polynomial SVM of degree d)
K(x, xi) = exp(−kx − xik22/σ 2) (RBF kernel)
K(x, xi) = tanh(κ xTi x + θ) (MLP kernel)
• Sparseness property (many αi = 0)
SCCB 2006 ⋄ Johan Suykens
27
SVM: support vectors
1
3
0.9
2.5
2
0.8
1.5
0.7
1
x(2)
x(2)
0.6
0.5
0.5
0
0.4
−0.5
0.3
−1
0.2
−1.5
0.1
0
−2
0
0.1
0.2
0.3
0.4
0.5
x(1)
0.6
0.7
0.8
0.9
1
−2.5
−2.5
−2
−1.5
−1
−0.5
0
x(1)
0.5
1
1.5
2
• Decision boundary can be expressed in terms of a limited number of
support vectors (subset of given training data); sparseness property
• Classifier follows from the solution to a convex QP problem.
SCCB 2006 ⋄ Johan Suykens
28
Reproducing Kernel Hilbert Space (RKHS) view
• Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000]
find function f such that
N
1 X
L(yi, f (xi)) + λkf k2K
min
f ∈H N
i=1
with L(·, ·) the loss function. kf kK is norm in RKHS H defined by K.
• Representer theorem: for any convex loss function the solution is of the
form
N
X
f (x) =
αiK(x, xi)
i=1
• Some special cases:
L(y, f (x)) = (y − f (x))2
L(y, f (x)) = |y − f (x)|ǫ
SCCB 2006 ⋄ Johan Suykens
regularization network
−ε 0 +ε
SVM regression with ǫ-insensitive loss function
29
Different views on kernel machines
SVM
Kriging
LS−SVM
RKHS
Gaussian Processes
Some early history on RKHS:
1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba
Obtaining complementary insights from different perspectives:
kernels are used in different settings (try to get the big picture)
Support vector machines (SVM):
optimization approach (primal/dual)
Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis
Gaussian processes (GP):
probabilistic/Bayesian approach
SCCB 2006 ⋄ Johan Suykens
30
Interdisciplinary challenges
neural networks
data mining
linear algebra
pattern recognition
mathematics
SVM & kernel methods
machine learning
optimization
statistics
signal processing
systems and control theory
NATO Advanced Study Institute on Learning Theory and Practice (Leuven, 2002)
http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html
SCCB 2006 ⋄ Johan Suykens
31
Wider use of the kernel trick
x
• Angle between vectors: (e.g. correlation analysis)
Input space:
xT z
cos θxz =
kxk2kzk2
Feature space:
cos θϕ(x),ϕ(z)
θ
z
ϕ(x)T ϕ(z)
K(x, z)
p
=
=p
kϕ(x)k2kϕ(z)k2
K(x, x) K(z, z)
• Distance between vectors: (e.g. for “kernelized” clustering methods)
Input space:
kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z
Feature space:
kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z)
SCCB 2006 ⋄ Johan Suykens
32
Training, validation, test set
• Simplest procedure:
Selection of tuning parameters (regularization constants, kernel tuning
parameters) on a validation set, such that one may hope for a good
generalization on test data.
Train
Test
Validation
• Better: leave-one-out crossvalidation, 10-fold crossvalidation
SCCB 2006 ⋄ Johan Suykens
33
Important goal: good generalization
• In fact one would like to minimize the generalization error:
Z
E[f ] =
L(y, f (x))dP (x, y)
X ×Y
N
1 X
instead of the empirical error (training data) EN [f ] =
L(yi, f (xi))
N i=1
(with loss function L(y, f (x)), i.i.d. samples from a fixed but unknown
probability distribution P (x, y) and random variables x ∈ X , y ∈ Y)
• Generalization error bounds, suitable for model selection (if sharp!)
• General message: avoid overfitting by taking a good choice of the model
complexity (depending on the framework: size of hypothesis space, VC
dimension, effective number of parameters, degrees of freedom)
• Occam’s razor principle:
“Entia non sunt multiplicanda praeter necessitatem”
SCCB 2006 ⋄ Johan Suykens
34
Generalization: different mathematical frameworks
• Vapnik et al.:
Predictive learning problem (inductive inference)
Estimating values of functions at given points (transductive inference)
Vapnik V. (1998) Statistical Learning Theory, John Wiley & Sons, New York.
• Poggio et al., Smale:
Estimate true function f with analysis of approximation error and sample
error (e.g. in RKHS space, Sobolev space)
Cucker F., Smale S. (2002) “On the mathematical foundations of learning theory”, Bulletin of the AMS,
39, 1–49.
Poggio T., Rifkin R., Mukherjee S., Niyogi P. (2004) “General conditions for predictivity in learning
theory”, Nature, 428 (6981): 419-422.
Other: Rademacher complexity, stability of learning machines, ...
SCCB 2006 ⋄ Johan Suykens
35
Least Squares Support Vector Machines: “core problems”
• Regression
T
min w w + γ
w,b,e
X
i
e2i s.t. yi = wT ϕ(xi) + b + ei, ∀i
• Classification
T
min w w + γ
w,b,e
X
i
e2i s.t. yi(wT ϕ(xi) + b) = 1 − ei, ∀i
• Principal component analysis
X
T
min w w − γ
e2i s.t. ei = wT ϕ(xi) + b, ∀i
w,b,e
i
• Canonical correlation analysis/partial least squares
min
w,v,b,d,e,r
T
T
w w+v v+ν1
X
i
e2i +ν2
X
i
ri2−γ
X
i
eiri s.t.
ei = wT ϕ1(xi) + b
ri = v T ϕ2(yi) + d
• partially linear models, spectral clustering, subspace algorithms, ...
SCCB 2006 ⋄ Johan Suykens
36
LS-SVM classifier
• Preserve support vector machine methodology, but simplify via least
squares and equality constraints [Suykens, 1999]
• Primal problem:
N
1 T
1X 2
ei s.t. yi [wT ϕ(xi) + b]=1 − ei, ∀i
min J (w, e) = w w + γ
w,b,e
2
2 i=1
• Dual problem:
0
y
T
y
Ω + I/γ
b
α
=
0
1N
where Ωij = yiyj ϕ(xi)T ϕ(xj ) = yiyj K(xi, xj ) for i, j = 1, ..., N
and y = [y1; ...; yN ].
SCCB 2006 ⋄ Johan Suykens
37
Obtaining solution from Lagrangian
• Lagrangian:
L(w, b, e; α) = J (w, e) −
N
X
i=1
αi{yi[wT ϕ(xi) + b] − 1 + ei}
with Lagrange multipliers αi (support values).
• Conditions for optimality:
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
∂L
∂w
∂L
∂b
∂L
∂ei
∂L
∂αi
=0
=0
→
→
w=
N
X
N
X
αiyiϕ(xi)
i=1
αiyi = 0
i=1
=0
=0
→
→
αi = γei,
yi[wT ϕ(xi) + b] − 1 + ei = 0,
i = 1, ..., N
i = 1, ..., N
Eliminate w, e and write solution in α, b.
SCCB 2006 ⋄ Johan Suykens
38
LS-SVM classifiers benchmarking
• LS-SVM classifiers perform very well on 20 UCI benchmark data sets
(10 binary, 10 multiclass) in comparison with many other methods.
but: be aware of the “No free lunch theorem”
NCV
Ntest
N
nnum
ncat
n
M
ny,MOC
ny,1vs1
bal
416
209
625
4
0
4
3
2
3
cmc
982
491
1473
2
7
9
3
2
3
ims
1540
770
2310
18
0
18
7
3
21
iri
100
50
150
4
0
4
3
2
3
led
2000
1000
3000
0
7
7
10
4
45
thy
4800
2400
7200
6
15
21
3
2
3
usp
6000
3298
9298
256
0
256
10
4
45
veh
564
282
846
18
0
18
4
2
6
wav
2400
1200
3600
19
0
19
3
2
2
win
118
60
178
13
0
13
3
2
3
[Van Gestel et al., Machine Learning 2004]
• Winning results in competition WCCI 2006 by [Cawley, 2006]
SCCB 2006 ⋄ Johan Suykens
39
Benchmarking SVM & LS-SVM classifiers
Ntest
n
RBF LS-SVM
RBF LS-SVMF
Lin LS-SVM
Lin LS-SVMF
Pol LS-SVM
Pol LS-SVMF
RBF SVM
Lin SVM
LDA
QDA
Logit
C4.5
oneR
IB1
IB10
NBk
NBn
Maj. Rule
acr
bld
230
115
14
6
87.0(2.1) 70.2(4.1)
86.4(1.9) 65.1(2.9)
86.8(2.2) 65.6(3.2)
86.5(2.1) 61.8(3.3)
86.5(2.2) 70.4(3.7)
86.6(2.2) 65.3(2.9)
86.3(1.8) 70.4(3.2)
86.7(2.4) 67.7(2.6)
85.9(2.2) 65.4(3.2)
80.1(1.9) 62.2(3.6)
86.8(2.4) 66.3(3.1)
85.5(2.1) 63.1(3.8)
85.4(2.1) 56.3(4.4)
81.1(1.9) 61.3(6.2)
86.4(1.3) 60.5(4.4)
81.4(1.9) 63.7(4.5)
76.9(1.7) 56.0(6.9)
56.2(2.0) 56.5(3.1)
gcr
334
20
76.3(1.4)
70.8(2.4)
75.4(2.3)
68.6(2.3)
76.3(1.4)
70.3(2.3)
75.9(1.4)
75.4(1.7)
75.9(2.0)
72.5(1.4)
76.3(2.1)
71.4(2.0)
66.0(3.0)
69.3(2.6)
72.6(1.7)
74.7(2.1)
74.6(2.8)
69.7(2.3)
hea
90
13
84.7(4.8)
83.2(5.0)
84.9(4.5)
82.8(4.4)
83.7(3.9)
82.4(4.6)
84.7(4.8)
83.2(4.2)
83.9(4.3)
78.4(4.0)
82.9(4.0)
78.0(4.2)
71.7(3.6)
74.3(4.2)
80.0(4.3)
83.9(4.5)
83.8(4.5)
56.3(3.8)
ion
117
33
96.0(2.1)
93.4(2.7)
87.9(2.0)
85.0(3.5)
91.0(2.5)
91.7(2.6)
95.4(1.7)
87.1(3.4)
87.1(2.3)
90.6(2.2)
86.2(3.5)
90.6(2.2)
83.6(4.8)
87.2(2.8)
85.9(2.5)
92.1(2.5)
82.8(3.8)
64.4(2.9)
pid
256
8
76.8(1.7)
72.9(2.0)
76.8(1.8)
73.1(1.7)
77.0(1.8)
73.0(1.8)
77.3(2.2)
77.0(2.4)
76.7(2.0)
74.2(3.3)
77.2(1.8)
73.5(3.0)
71.3(2.7)
69.6(2.4)
73.6(2.4)
75.5(1.7)
75.1(2.1)
66.8(2.1)
snr
ttt
70
320
60
9
73.1(4.2) 99.0(0.3)
73.6(4.6) 97.9(0.7)
72.6(3.7) 66.8(3.9)
73.3(3.4) 57.6(1.9)
76.9(4.7) 99.5(0.5)
77.3(2.6) 98.1(0.8)
75.0(6.6) 98.6(0.5)
74.1(4.2) 66.2(3.6)
67.9(4.9) 68.0(3.0)
53.6(7.4) 75.1(4.0)
68.4(5.2) 68.3(2.9)
72.1(2.5) 84.2(1.6)
62.6(5.5) 70.7(1.5)
77.7(4.4) 82.3(3.3)
69.4(4.3) 94.8(2.0)
71.6(3.5) 71.7(3.1)
66.6(3.2) 71.7(3.1)
54.4(4.7) 66.2(3.6)
wbc
adu
228
12222
9
14
96.4(1.0) 84.7(0.3)
96.8(0.7) 77.6(1.3)
95.8(1.0) 81.8(0.3)
96.9(0.7) 71.3(0.3)
96.4(0.9) 84.6(0.3)
96.9(0.7) 77.9(0.2)
96.4(1.0) 84.4(0.3)
96.3(1.0) 83.9(0.2)
95.6(1.1) 82.2(0.3)
94.5(0.6) 80.7(0.3)
96.1(1.0) 83.7(0.2)
94.7(1.0) 85.6(0.3)
91.8(1.4) 80.4(0.3)
95.3(1.1) 78.9(0.2)
96.4(1.2) 82.7(0.3)
97.1(0.9) 84.8(0.2)
95.5(0.5) 82.7(0.2)
66.2(2.4) 75.3(0.3)
AA
84.4
81.8
79.4
75.7
84.2
82.0
84.4
79.8
78.9
76.2
79.2
79.9
74.0
77.7
80.2
79.7
76.6
63.2
AR PST
3.5
8.8
7.7
12.1
4.1
8.2
4.0
7.5
9.6
12.6
7.8
10.2
15.5
12.5
10.4
7.3
12.3
17.1
[Van Gestel et al., Machine Learning 2004]
SCCB 2006 ⋄ Johan Suykens
40
0.727
0.109
0.109
0.109
0.727
0.344
1.000
0.021
0.004
0.002
0.109
0.021
0.002
0.021
0.039
0.109
0.002
0.002
Classification of brain tumors from MRS data (1)
SCCB 2006 ⋄ Johan Suykens
41
Classification of brain tumors from MRS data (2)
SCCB 2006 ⋄ Johan Suykens
42
Classification of brain tumors from MRS data (3)
SCCB 2006 ⋄ Johan Suykens
43
Classification of brain tumors from MRS data (4)
class 2 − meningiomas
class 1 − glioblastomas
Cho
lipids/Lac
0.25
0.2
Cr
0.15
NAc
0.1
0.05
0
0
−50
−100
−150
frequency (Hz)
Class 1
−200
−250
mean magnitude intensity (+ std)
mean magnitude intensity (+ std)
mean magnitude intensity (+ std)
0.3
0.4
0.3
0.2
Cr
NAc
Ala
0.1
0
0
−50
0.3
mean
mean+std
Cho
Cho choline
Cr creatine
NAc N−acetyl
Lac lactate
Ala alanine
0.35
class 3 − metastases
0.5
0.4
−100
−150
frequency (Hz)
Class 2
−200
−250
Cho
lipids/Lac
0.2
0.1
Cr
NAc
0
0
−50
−100
−150
−200
−250
frequency (Hz)
Class 3
LS-SVM classifiers with lin/pol/rbf kernel [Devos et al., JMR 2004]
SCCB 2006 ⋄ Johan Suykens
44
Classification of brain tumors from MRS data (5)
RBF12
RBF13
RBF23
Lin12, γ =1
Lin13, γ =1
Lin23, γ =1
etrain ± std(etrain )
0.0800 ± 0.2727
0.0600 ± 0.2387
1.6700 ± 1.1106
1.7900 ± 1.0473
0±0
0±0
mean % correct
99.8621
99.8966
96.7255
96.4902
100
100
6.2000 ± 1.3333
6.1300 ± 1.4679
15.6400 ± 1.7952
15.3700 ± 1.8127
4.0100 ± 1.3219
4.0000 ± 1.1976
89.3100
89.4310
69.333
69.8627
90.452
90.4762
etest ± std(etest )
2.8500 ± 1.9968
2.6800 ± 1.6198
8.1200 ± 1.2814
7.7900 ± 1.2815
2.0000 ± 1.1976
2.0200 ± 1.2632
mean % correct
90.1724
90.7586
67.5200
68.8400
90.4762
90.3810
±
±
±
±
±
±
86.586
87.3103
69.280
68.3200
83.619
85.9048
3.8900
3.6800
7.6800
7.9200
3.4400
2.9600
1.8472
1.7746
0.8863
1.0316
1.2253
1.3478
Comparison of LS-SVM classification with LOO using RBF and linear kernel, with additional
bias term correction (N1 = 50, N2 = 37, N3 = 26).
Be careful not to overfit the data (especially with small data sets and nonlinear classifiers)
SCCB 2006 ⋄ Johan Suykens
45
Ripley data set
1.5
0.6
0.4
1
0.7
0.9
x(2)
0.8
0.5
0.2
0.1
0.5
0
0.4
−0.5
−1.5
0.3
−1
−0.5
0
x(1)
0.5
1
1.5
Posterior class probability
Bayesian inference: classification
1
0.8
0.6
0.4
0.2
0
1.5
1
1.5
1
0.5
0.5
0
x(2)
0
−0.5
−1
−0.5
−1.5
x(1)
• Probabilistic interpretation with moderated output
• Bias term correction for unbalanced and/or small data sets
• Bayesian approaches to kernel methods: Gaussian processes
SCCB 2006 ⋄ Johan Suykens
46
Classical PCA analysis
X
X
X
X
n
with
x
∈
R
• Given zero mean data {xi}N
i=1
X
X
X
X X
X
X
X
X
X
X
X
• Find projected variables wT xi with maximal variance
N
X
1
max Var(wT x) = Cov(wT x, wT x) ≃
(wT xi)2
w
N i=1
= wT C w
where C =
1
N
PN
T
T
.
Consider
additional
constraint
w
w = 1.
x
x
i
i
i=1
• Resulting eigenvalue problem:
Cw = λw
with C = C T ≥ 0, obtained from the Lagrangian L(w; λ) = 21 wT Cw−
λ(wT w − 1) and setting ∂L/∂w = 0, ∂L/∂λ = 0.
SCCB 2006 ⋄ Johan Suykens
47
Kernel PCA
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2.5
−1.5
−1
−0.5
0
linear PCA
0.5
1
−2
−1.5
−1
−0.5
0
0.5
1
kernel PCA (RBF kernel)
[Schölkopf et al., 1998]
SCCB 2006 ⋄ Johan Suykens
48
Kernel PCA: primal and dual
• Eigenvalue decomposition of the kernel matrix [Schölkopf et al., 1998]
• Primal problem: [Suykens et al., 2003]
N
1 T
1X 2
ei s.t. ei = wT ϕ(xi) + b, i = 1, ..., N.
min J (w, e) =γ w w −
w,b,e
2
2 i=1
• Dual problem = kernel PCA :
Ωcα = λα with λ = 1/γ
with Ωc,ij = (ϕ(xi) − µ̂ϕ)T (ϕ(xj ) − µ̂ϕ) the centered kernel matrix and
PN
µ̂ϕ = (1/N ) i=1 ϕ(xi).
• Score variables (allowing also out-of-sample extensions):
P
P
1
T
z(x) = w (ϕ(x) − µ̂ϕ) = j αj (K(xj , x) − N r K(xr , x)−
P
P P
1
1
K(x
,
x
)
+
r
j
r
r
s K(xr , xs ))
N
N2
SCCB 2006 ⋄ Johan Suykens
49
Obtaining solution from Lagrangian
• Lagrangian (here case b = 0)
N
N
X
1X 2 1 T
T
e − w w−
αi ei − w ϕ(xi)
L(w, e; α) = γ
2 i=1 i 2
i=1
• Conditions for optimality

N

X


∂L

αiϕ(xi)
 ∂w = 0 → w =





∂L
∂ei
∂L
∂αi
=0 →
=0 →
i=1
αi = γei,
ei − wT ϕ(xi)
i = 1, ..., N
= 0, i = 1, ..., N
Eliminate w, e and write solution in α.
SCCB 2006 ⋄ Johan Suykens
50
Microarray data analysis
FDA
LS-SVM classifier (linear, RBF)
Kernel PCA + FDA
(unsupervised selection of PCs)
(supervised selection of PCs)
Use regularization for linear classifiers
Systematic benchmarking study in [Pochet et al., Bioinformatics 2004]
Webservice: http://www.esat.kuleuven.ac.be/MACBETH/
SCCB 2006 ⋄ Johan Suykens
51
Primal versus dual problems
Example 1: microarray data (10.000 genes & 50 training data)
Classifier model:
T
sign(w
P x + b)T (primal)
sign( i αiyixi x + b) (dual)
linear FDA primal: w ∈ R10.000 (only 50 training data!)
linear FDA dual: α ∈ R50
Example 2: datamining problem (1.000.000 training data & 20 inputs)
linear FDA primal: w ∈ R20
linear FDA dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !)
SCCB 2006 ⋄ Johan Suykens
52
Fixed-size LS-SVM: primal-dual kernel machines
Primal space
Dual space
Nyström method
Kernel PCA
Density estimate
Entropy criteria
Regression
Eigenfunctions
SV selection
Modelling in view of primal-dual representations
Link Nyström approximation (GP) - kernel PCA - density estimation
[Suykens et al., 2002]: primal space estimation, sparse, large scale
SCCB 2006 ⋄ Johan Suykens
53
Nyström method (Gaussian processes)
[Williams, 2001 Nyström method; Girolami, 2002 KPCA, density estimation]
• “big” matrix: Ω(N,N ) ∈ RN ×N , “small” matrix: Ω(M,M ) ∈ RM ×M
(based on random subsample, in practice often M ≪ N )
• Eigenvalue decompositions: Ω(N,N ) Ũ = Ũ Λ̃ and Ω(M,M ) U = U Λ
• Relation to eigenvalues and eigenfunctions of the integral equation
Z
K(x, x′)φi(x)p(x)dx = λiφi(x′)
with
1
λ̂i = λi, φ̂i(xk ) =
M
SCCB 2006 ⋄ Johan Suykens
√
√
M
X
M
M uki, φ̂i(x′) =
ukiK(xk , x′)
λi k=1
54
y
Fixed-size LS-SVM: toy examples
1.4
2.5
1.2
2
1
1.5
0.8
1
0.6
0.5
0.4
0
0.2
−0.5
0
−1
−0.2
−1.5
−0.4
−2
−0.6
−10
−8
−6
−4
−2
0
x
2
4
6
8
10
1.2
−2.5
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
2
1
1.5
0.8
1
0.6
y
0.5
0.4
0
−0.5
0.2
−1
0
−1.5
−0.2
−2
−0.4
−10
−8
−6
−4
−2
0
x
2
4
6
8
10
−2.5
−2.5
Sparse representations with estimation in primal space
SCCB 2006 ⋄ Johan Suykens
55
Robustness
y
x
breakdown?
x x
x
x
x
Convex cost function
x
Weighted version with
modified cost function
x
x
convex
optimiz.
robust
statistics
SVM solution
LS-SVM solution
SVM
Weighted LS-SVM
?
Robust statistics: Bounded derivative of loss function, bounded kernel
Linear parametric models: do not start from LS
Kernel based regression: starting from LS is allowed under certain conditions
[Suykens et al., 2002; Debruyne et al., 2006]
SCCB 2006 ⋄ Johan Suykens
56
Contents - Part II: Advanced topics
• Spectral graph clustering
• Semi-supervised learning
• Integration of data sources
• Kernels from graphical models
• Kernel canonical correlation analysis
• Sparseness, feature selection, relevance determination
• Prior knowledge incorporation, convex optimization
SCCB 2006 ⋄ Johan Suykens
57
Graph representation
• Graph representing e.g. a set of proteins:
Graph G = (V, E)
V = {x1, x2, ..., xN }: set of vertices (nodes)
E: set of edges
W = [wij ]: affinity matrix with similarity values wij ≥ 0
wij : the more similar xi and xj , the larger the value wij
• Examples: wij ∈ {0, 1}, wij =
kxi −xj k22
exp(− 2σ2 )
5
4
1
2
3
SCCB 2006 ⋄ Johan Suykens
6
(e.g. w12 = 1, w13 = 1, w14 = 0)
58
Spectral graph clustering (1)
• Discover two clusters A, B in the graph G with minimal cut:
min
q∈{−1,+1}N
1X
wij (qi − qj )2
2 i,j
with cluster membership indicator qi = 1 if i ∈ A, qi = −1 if i ∈ B.
5
4
1
2
6
3
cut of size 2
cut of size 1
(minimal cut)
SCCB 2006 ⋄ Johan Suykens
59
Spectral graph clustering (2)
• The min-cut spectral clustering problem can be written as
min
q∈{−1,+1}N
q T (D − W )q
with degree matrix D = diag(d1, ..., dN ) and degrees di =
P
j
wij .
• Relax the combinatorial problem: q T q = 1 instead of q ∈ {−1, +1}N .
This gives the eigenvalue problem Lq̃ = λq̃
with L = D − W the Laplacian of the graph (like kernel PCA on L)
[Shi & Malik, 2000; Ng et al. 2002; Chung, 1997].
• Cluster member indicators: qi = sign(q̃i − θ) with threshold θ.
• Normalized cut: Lq̃ = λDq̃
(avoids isolated points)
• Note: diffusion kernel K = exp(−βL) [Kondor, 2002]
SCCB 2006 ⋄ Johan Suykens
60
Underlying primal problems
• Min-cut: LS-SVM primal problem for kernel PCA
1
1
max γ eT e − wT w such that ei = wT ϕ(xi) + b, ∀i = 1, ..., N
w,b,e
2
2
• Normalized: Weighted LS-SVM primal problem
1
1
max γ eT V e − wT w such that ei = wT ϕ(xi) + b, ∀i = 1, ..., N
w,b,e
2
2
with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006]
• Bias term leads to optimal centering.
• Allows for out-of-sample extensions on test data and evaluation on
validation sets (for tuning parameter selection).
SCCB 2006 ⋄ Johan Suykens
61
Semi-supervised learning
x x
x
x x x
x
x x
semi−supervised
Semi-supervised learning: part labeled and part unlabeled
Assumptions for semi-supervised learning to work:
[Chapelle, Schölkopf, Zien, 2006]
• Smoothness assumption: if two points x1, x2 in a high density region are
close, then also the corresponding outputs y1, y2
• Cluster assumption: points from the same cluster are likely to be of the
same class
• Low density separation: decision boundary should be in low density region
• Manifold assumption: data lie on a low-dimensional manifold
SCCB 2006 ⋄ Johan Suykens
62
Estimation of labels at unlabeled nodes
Functional class prediction on a protein network [Tsuda et al., 2005]
−1
+1
−1
?
−1
+1
• Total number of nodes: N = Nl + Nu:
Nl labeled nodes with given values y1, ..., yNl ∈ {−1, +1}
Nu unlabeled nodes with values yNl+1, ..., yNl+Nu (assumed 0)
• Goal: find estimated values ŷ = [ŷ1; ...; ŷN ] from
min
ŷ
Nl
X
i=1
(ŷi − yi)2 + µ
N
X
i=Nl +1
ŷi2 + c
N
X
i,j=1
wij (ŷi − ŷj )2
Solution:
ŷ = (I + cL)−1y
with L the Laplacian matrix and y = [y1; ...; yN ].
SCCB 2006 ⋄ Johan Suykens
63
Integration of data sources
Different graphs, each containing a part of information [Tsuda et al., 2005]
−1
+1
−1
?
−1
+1
−1
+1
−1
?
−1
+1
−1
+1
−1
?
−1
+1
Consider linear combination of Laplacians L =
PNg
j=1 βj Lj
and solve
min (ŷ − y)T (ŷ − y) + c ŷLŷ
ŷ,β
SCCB 2006 ⋄ Johan Suykens
64
Semi-supervised learning in RKHS
• Learning in RKHS [Belkin & Niyogi, 2004]:
N
1 X
V (yi, f (xi)) + λkf k2K + ηf T Lf
min
f ∈H N
i=1
with V (·, ·) the loss function, L the Laplacian matrix, kf kK is norm in
RKHS H and f = [f (x1); ...; f (xNl+Nu )] (Nl, Nu number of labeled and
unlabeled data)
• Laplacian term: discretization of the Laplace-Beltrami operator
• Representer theorem: f (x) =
PNl+Nu
i=1
αiK(x, xi)
• Least squares solution case: Laplacian acts on kernel matrix
SCCB 2006 ⋄ Johan Suykens
65
Growth of databases
Hence: computational complexity important (e.g. exploit sparse matrices)
SCCB 2006 ⋄ Johan Suykens
66
Large scale interaction data sets in yeast
APC
complex
mRNA
splicing
26S core
proteosome
20S core
proteosome
Histone deacetylase
complex
Tubulin-binding
complex
Arp 2/3
complex
TRAPP
complex
RNase
complex
v-SNARE
COP II
vesicle coat
TAF IID
complex
mRNA
processing
SCF
Eukaryotic translation
initiation factor 3
complex
Ribosomal
DNA replication
factor C complex
Mitochondrial large
ribosomal subunit
RNA
biogenesis
Pol II mediator
complex
Figure 2 Visualization of combined, large-scale interaction data sets in yeast. A total of 14,000 physical interactions obtained from the GRID database were represented with the Osprey
network visualization system (see http://biodata.mshri.on.ca/grid). Each edge in the graph represents an interaction between nodes, which are coloured according to Gene Ontology
(GO) functional annotation. Highly connected complexes within the data set, shown at the perimeter of the central mass, are built from nodes that share at least three interactions within
other complex members. The complete graph contains 4,543 nodes of ~6,000 proteins encoded by the yeast genome, 12,843 interactions and an average connectivity of 2.82 per
node. The 20 highly connected complexes contain 340 genes, 1,835 connections and an average connectivity of 5.39.
[Tyers & Mann, From genomes to proteomics, Nature 2002]
SCCB 2006 ⋄ Johan Suykens
67
Function class prediction of yeast proteins (1)
• Dataset [Tsuda et al., 2005; Lanckriet et al., 2004]: 3588 proteins
Function of proteins labelled according to MIPS Comprehensive Yeast
Genome Database
Focus on 13 highest-level categories of functional hierarchy
• Functional classes:
1. metabolism
3. cell cycle and DNA processing
5. protein synthesis
7. cellular transportation
9. interaction with cell environment
11. control of cell organization
13. others
SCCB 2006 ⋄ Johan Suykens
2. energy
4. transcription
6. protein fate
8. cell rescue and defense
10. cell fate
12. transport facilitation
68
Function class prediction of yeast proteins (2)
• Improved results with combined Laplacians [Tsuda et al., 2005]
• Choice of matrices Wi related to the graphs Gi (i = 1, 2, .., 5):
W1 :
W2 :
W3 :
W4 :
W5 :
Network from Pfam domain structure
Co-participation in a protein complex
Protein-protein interactions (MIPS physical interactions)
Genetic interactions (MIPS genetic interactions)
Network created from cell gene expression measurements
(Note: Pfam is a large collection of multiple sequence alignments and
hidden Markov models covering many common protein domains and
families - www.sanger.ac.uk/Software/Pfam/)
SCCB 2006 ⋄ Johan Suykens
69
Data fusion with kernels
• Consider a combination of kernel matrices K =
with
Kernel
K1
K2
K3
K4
K5
K6
K7
K8
Data
protein sequences
protein sequences
protein sequences
hydropathy profile
protein interactions
protein interactions
gene expression
random numbers
Pm
i=1 µi Ki
(µi ≥ 0)
Similarity measure
Smith -Waterman
BLAST
Pfam HMM
FFT
linear kernel
diffusion kernel
RBF kernel
linear kernel
• Improved results by combining kernels [Lanckriet et al., 2004]
SCCB 2006 ⋄ Johan Suykens
70
Learning the optimal combination
• Learn optimal combination of µi together with SVM classifier by solving
a single convex problem [Lanckriet et al., JMLR 2004]
• QP problem of SVM:
max 2αT 1 − αT diag(y)Kdiag(y)α s.t. 0 ≤ α ≤ C, αT y = 0
α
is replaced by
m
X
min max 2αT 1 − αT diag(y)(
µiKi)diag(y)α
µi
α
s.t. 0 ≤ α ≤ C, αT y = 0, trace(
i=1
m
X
i=1
µiKi) = c,
m
X
i=1
µiKi 0.
Can be solved as semidefinite program (SDP problem) [Boyd &
Vandenberghe, 2004] (LMI constraint for positive definite kernel)
SCCB 2006 ⋄ Johan Suykens
71
Gene prioritization through genomic data fusion
• [Aerts et al., Nature Biotechnology 2006]
Integrating multiple heterogeneous data sources (microarray, BIND,
BLAST, cis-regulatory modules, EST, GO, InterPro, KEGG, transcription
motifs, literature)
Overall prioritization obtained by data fusion with a global ranking using
order statistics.
• The approach successfully identified a novel gene in DiGeorge syndrome
(in vivo validation, zebrafish)
• Potential for such methodologies towards kernel methods
SCCB 2006 ⋄ Johan Suykens
72
Combined MRI and MRS classification system (1)
[Luts et al., 2006]
SCCB 2006 ⋄ Johan Suykens
73
Combined MRI and MRS classification system (2)
SCCB 2006 ⋄ Johan Suykens
74
Combined MRI and MRS classification system (3)
[Devos et al., 2005]
SCCB 2006 ⋄ Johan Suykens
75
Kernel design
B
- Probability product kernel:
Z
K(p1, p2) =
p1(x)ρ p2(x)ρdx
A
C
X
E
D
- Prior knowledge incorporation
P(A,B,C,D,E) = P(A|B) P(B) P(C|B) P(D|C) P(E|B)
Kernels from graphical models, Bayesian networks, HMMs
Kernels tailored to data types (DNA sequence, text, chemoinformatics)
[Tsuda et al., Bioinformatics 2002; Jebara et al., JMLR 2004, Ralaivola et al., 2005]
SCCB 2006 ⋄ Johan Suykens
76
Kernels and graphical models
• Probability product kernel [Jebara et al., 2004]
K(p1, p2) =
Z
p1(x)ρ p2(x)ρdx
X
• Case ρ = 1/2: Bhattacharyya kernel
K(p1, p2) =
Z p
X
p
p1(x) p2(x)dx
p
R p
(related to Hellinger distance H(p1, p2) = 12 X ( p1(x) −
p2(x))2dx by
p
2 − 2K(p1, p2), which is a symmetric approximation to KullbackH(p1, p2) =
Leibler divergence).
• Case ρ = 1: expected likelihood kernel
K(p1, p2) =
Z
X
SCCB 2006 ⋄ Johan Suykens
p1(x)p2(x)dx = Ep1 [p2(x)] = Ep2 [p1(x)]
77
Bayesian networks
Example of a Bayesian network, used in the study of ovarian cancer:
Bayesian network approaches enable to incorporate medical expert
knowledge [Fannes et al., 2004]
SCCB 2006 ⋄ Johan Suykens
78
Canonical Correlation Analysis
• CCA analysis has applications e.g. in system identification,
signal processing, and recently in bioinformatics and textmining.
• Objective: find a maximal correlation between the projected variables
zx = wT x and zy = v T y where x ∈ Rnx , y ∈ Rny (zero mean).
• Maximize the correlation coefficient
wT Cxy v
E[zxzy ]
p
=p
max ρ = p
p
T
w,v
E[zxzx] E[zy zy ]
w Cxxw v T Cyy v
with Cxx = E[xxT ], Cyy = E[yy T ], Cxy = E[xy T ]. This is formulated as
the constrained optimization problem
max wT Cxy v s.t. wT Cxxw = 1 and v T Cyy v = 1
w,v
which leads to the generalized eigenvalue problem
Cxy v = η Cxxw, Cyxw = ν Cyy v.
SCCB 2006 ⋄ Johan Suykens
79
Kernel CCA
Correlation: min
w,v
X
i
kzxi −
ϕ1(·)
zyi k22
x
zx = wT ϕ1(x)
x
x
x
x
x
T
zy = v ϕ2(y)
ϕ2(·)
x
0
x
x
x
x
x
x
x
x
0
x
Space X
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Target spaces
x
x
x
Feature space on X
x
x
x
x
Space Y
Feature space on Y
SCCB 2006 ⋄ Johan Suykens
[Suykens et al. 2002, Bach & Jordan, JMLR 2002]
80
LS-SVM formulation to Kernel CCA
• Score variables: zx = wT (ϕ1(x) − µ̂ϕ1 ), zy = v T (ϕ2(y) − µ̂ϕ2 )
Feature maps ϕ1, ϕ2, kernels K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), K2(yi, yj ) = ϕ2(yi)T ϕ2(yj )
• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004])
max
γ
w,v,e,r
i=1
such that
with µ̂ϕ1 = (1/N )
N
X
N
N
1X 2 1 T
1X 2
1 T
ei − ν2
ri − w w − v v
eiri − ν1
2 i=1
2 i=1
2
2
ei = wT (ϕ1(xi) − µ̂ϕ1 ), ri = v T (ϕ2(yi) − µ̂ϕ2 ), ∀i
PN
i=1 ϕ1 (xi ), µ̂ϕ2 = (1/N )
PN
i=1 ϕ2 (yi ).
• Dual problem: generalized eigenvalue problem [Suykens et al. 2002]
»
0
Ωc,1
Ωc,2
0
–»
α
β
–
=λ
»
ν1Ωc,1 + I
0
0
ν2Ωc,2 + I
–»
α
β
–
, λ = 1/γ
with Ωc,1ij = (ϕ1 (xi ) − µ̂ϕ1 )T (ϕ1 (xj ) − µ̂ϕ1 ), Ωc,2ij = (ϕ2 (yi ) − µ̂ϕ2 )T (ϕ2 (yj ) − µ̂ϕ2 )
SCCB 2006 ⋄ Johan Suykens
81
Obtaining solution from Lagrangian
PN 2
PN 2
1
1
−
ν
e
e
r
−
ν
22
12
i=1 i
i=1 ri
i=1 i i
P
P
N
N
− 21 wT w− 21 v T v− i=1 αi[ei −wT (ϕ1(xi)− µ̂ϕ1 )]− i=1 βi[ri −v T (ϕ2(yi)− µ̂ϕ2 )]
• Lagrangian L(w, v, e, r; α, β) = γ
PN
• Conditions for optimality (eliminate w, v, e, r)
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
∂L
∂w
∂L
∂v
=0
=0
→
→
v=
T
∂L
∂ei
=0
→
∂L
∂ri
=0
→
∂L
∂αi
=0
→
∂L
∂βi
=0
→
SCCB 2006 ⋄ Johan Suykens
w=
N
X
i=1
N
X
i=1
αi(ϕ1(xi) − µ̂ϕ1 )
βi(ϕ2(yi) − µ̂ϕ2 )
T
γv (ϕ2(yi) − µ̂ϕ2 ) = ν1w (ϕ1(xi) − µ̂ϕ1 ) + αi
i = 1, ..., N
T
T
γw (ϕ1(xi) − µ̂ϕ1 ) = ν2v (ϕ2(yi) − µ̂ϕ2 ) + βi
i = 1, ..., N
T
ei = w (ϕ1(xi) − µ̂ϕ1 )
i = 1, ..., N
T
ri = v (ϕ2(yi) − µ̂ϕ2 )
i = 1, ..., N
82
Kernel CCA applications
• [Vert & Kanehisa, Bioinformatics 2003]:
For kernels related to spaces X and Y
K1 : graph from gene network
K2 : gene expression profiles
Study correlation between gene network and set of profiles
Able to extract biologically relevant expression patterns and pathways
with related activity.
• [Yamanishi et al., Bioinformatics 2003]:
Extract correlated gene clusters from multiple genomic data. Successfully
tested on the ability to recognize operons in the Escherichia coli genome,
from the comparison of three data sets:
1. functional relationships between genes in metabolic pathways
2. geometrical relationships along the chromosome
3. co-expression relationships as observed by gene expression data
SCCB 2006 ⋄ Johan Suykens
83
Classification of brain tumors using ARD
Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu et al.]
SCCB 2006 ⋄ Johan Suykens
84
Bayesian inference
Level 1
Parameters
Likelihood
Max
Posterior =
Prior
Evidence
Level 2
Hyperparameters
Likelihood
Max
Posterior =
Prior
Evidence
Level 3
Model comparison
Likelihood
Max
Posterior =
Prior
Evidence
Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal
matrix S in K(xi, xj ) = exp(−(xi − xj )T S(xi − xj )) which indicate how relevant
input variables are (but: many local minima, computationally expensive).
SCCB 2006 ⋄ Johan Suykens
85
Additive regularization trade-off
• Traditional Tikhonov regularization scheme:
T
min w w + γ
w,e
X
i
e2i s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N
Training solution for fixed value of γ :
(K + I/γ)α = y
→ Selection of γ via validation set: non-convex problem
• Additive regularization trade-off [Pelckmans et al., 2005]:
T
min w w +
w,e
X
i
(ei − ci)2 s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N
Training solution for fixed value of c = [c1; ...; cN ]:
(K + I)α = y − c
→ Selection of c via validation set: can be convex problem
SCCB 2006 ⋄ Johan Suykens
86
Hierarchical Kernel Machines
Conceptually
Computationally
Level 3
Model selection
Level 2
Sparseness
Structure detection
Level 1
LS−SVM substrate
Hierarchical kernel machine
Convex optimization
Hierarchical modelling approach leading to convex optimization problem
Computationally fusing training, hyperparameter and model selection
Optimization modelling: sparseness, input/structure selection, stability ...
[Pelckmans et al., Machine Learning 2006]
SCCB 2006 ⋄ Johan Suykens
87
Issues about sparseness
• Sparse approximation (zeros in solution vector) - two main approaches
in general:
1. by choice of loss function (e.g. epsilon-insensitive loss function)
2. by choice of the regularization term (e.g. 1-norm instead of 2-norm)
• For linear models (or parameterized models): both options possible
• For support vector machines: rely on 2-norm for regularization term
(for kernel based model representation in dual) and w can be infinite
dimensional (e.g. in RBF kernel case).
• Interpretability helps with additive models [Hastie & Tibshirani, 1986]
and componentwise models (also suitable in high dimensions)
SCCB 2006 ⋄ Johan Suykens
88
Additive models and structure detection
PP
T
• Additive models: ŷ(x) = p=1 w(p) ϕ(p)(x(p)) with x(p) the p-th input
variable and a feature map ϕ(p) for each variable. This leads to the
PP
(p) (p)
kernel K(xi, xj ) = p=1 K (p)(xi , xj ).
• Structure detection [Pelckmans et al., 2005]:
min ρ
w,e,t
(
s.t.
PN 2
PP
(p) T (p)
w
+
γ
w
t
+
i=1 ei
p=1
p=1 p
PP
(p)
(p) T (p)
yi =
ϕ (xi ) + ei, ∀i = 1, ..., N
w
p=1
PP
T
(p)
−tp ≤ w(p) ϕ(p)(xi ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P
Study how the solution with maximal
variation varies for different values of ρ
1.2
4 relevant input variables
Maximal Variation
1
0.8
0.6
0.4
21 irrelevant input variables
0.2
0
−0.2
0
10
SCCB 2006 ⋄ Johan Suykens
1
10
2
10
ρ
3
10
4
10
89
Incorporation of prior knowledge (1)
• Support vector machine formulations allow to incorporate additional
constraints that express prior knowledge about the problem, e.g.
monoticity, symmetry, positivity, ...
• Especially, LS-SVM as simple core models for which one can take into
account additional regularization terms and/or constraints. Systematic
and straightforward (dual) solution from Lagrangian.
• Large potential of convex optimization techniques
[Boyd & Vandenberghe, 2004]
SCCB 2006 ⋄ Johan Suykens
90
Incorporation of prior knowledge (2)
• Example: LS-SVM regression with monoticity constraint
N
1X
yi = wT ϕ(xi) + b + ei, ∀i = 1, ..., N
wT ϕ(xi) ≤ wT ϕ(xi+1), ∀i = 1, ..., N − 1
1 T
e2i s.t.
min w w + γ
w,b,e 2
2 i=1
• Application: estimation of cdf [Pelckmans et al., 2005]
empirical cdf
true cdf
1
1
0.8
0.8
P(X)
0.6
P(X)
1
Y
0.6
0.4
0.4
2
0.2
Y
0.2
0
0
−0.2
−2
ecdf
cdf
Chebychev mkr
mLS−SVM
−0.2
−1.5
−1
−0.5
0
X
SCCB 2006 ⋄ Johan Suykens
0.5
1
1.5
2
−1.5
−1
−0.5
0
0.5
1
1.5
X
91
Acknowledgements (1)
• K.U. Leuven, ESAT-SCD: research teams SMC, BIOI, Biomed
Prof. B. De Moor, Prof. S. Van Huffel, Prof. K. Marchal, Prof. Y.
Moreau, Prof. J. Vandewalle
Current and former postdocs and PhD candidates:
C. Alzate, Dr. L. Ameye, Dr. T. De Bie, Dr. J. De Brabanter, Dr. A.
Devos, Dr. M. Espinoza, Dr. F. De Smet, O. Gevaert, Dr. I. Goethals,
Dr. B. Hamers, F. Janssens, Dr. L. Hoegaerts, P. Karsmakers, Dr. G.
Lanckriet, Dr. C. Lu, Dr. L. Lukas, J. Luts, F. Ojeda, Dr. K. Pelckmans,
Dr. N. Pochet, B. Van Calster, R. Van de Plas, Dr. T. Van Gestel, V.
Van Belle, Dr. L. Vanhamme
• Many people for joint work, discussions, invitations, joint organization of
meetings.
SCCB 2006 ⋄ Johan Suykens
92
Acknowledgements (2)
• Support from GOA-Ambiorics (Algorithms for Medical and Biological
Research, Integration, Computation and Software), COE Optimization in
Engineering, COE Symbiosys, IAP V, FWO projects, IWT.
• Biomed MRS/MRSI research in collaboration with biomedical NMR unit,
Dept. of Radiology, Univ. Hospitals Leuven, Belgium, partners of
EU projects INTERPRET (IST-1999-10310), eTUMOUR (FP6-2002LIFESCIHEALTH 503094), BIOPATTERN (FP6-2002-IST 508803),
HEALTHagents (IST-2004-27214)
SCCB 2006 ⋄ Johan Suykens
93
Acknowledgements (3)
SCCB 2006 ⋄ Johan Suykens
94
References: books
• Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004.
• Chapelle O., Schölkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, MA, (in
press) 2006.
• Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge University
Press, 2000.
• Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, Cambridge,
MA 2006.
• Schölkopf B., Smola A., Learning with Kernels, MIT Press, Cambridge, MA, 2002.
• Schölkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press,
Cambridge, MA (2004)
• Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press,
2004.
• Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support
Vector Machines, World Scientific, Singapore, 2002.
• Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory :
Methods, Models and Applications, vol. 190 of NATO-ASI Series III : Computer and Systems Sciences,
IOS Press (Amsterdam, The Netherlands) 2003.
• Vapnik V., Statistical Learning Theory, John Wiley & Sons, New York, 1998.
• Wahba G., Spline Models for Observational Data, Series in Applied Mathematics, 59, SIAM,
Philadelphia, 1990.
SCCB 2006 ⋄ Johan Suykens
95
Related references - methods
• Alzate C., Suykens J. A. K., “A Weighted Kernel PCA Formulation with Out-of-Sample Extensions for
Spectral Clustering Methods”, WCCI-IJCNN 2006, Vancouver, 138-144.
• Bach F.R., Jordan M.I., “Kernel independent component analysis”, Journal of Machine Learning
Research, 3, 1-48, 2002.
• Belkin M., Niyogi P., “Semi-supervised learning on Riemannian manifolds”, Machine Learning, Vol. 56,
pp. 209-239, 2004.
• Burges C.J.C., “A tutorial on support vector machines for pattern recognition”, Knowledge Discovery
and Data Mining, 2(2), 121-167, 1998.
• Cawley G.C., Talbot N.L.C., “Fast exact leave-one-out cross-validation of sparse least-squares support
vector machines”, Neural Networks, Vol. 17, 10, pp. 1467-1475, 2004.
• Cawley G.C., “Leave-one-out Cross-validation Based Model Selection Criteria for Weighted LS-SVMs”,
WCCI-IJCNN 2006 Vancouver.
• Chung F.R.K. , “Spectral graph theory”, Regional Conference Series in Mathematics 92, Amer. Math.
Soc., Providence, 1997.
• Cortes C., Vapnik V., “Support vector networks”, Machine Learning, 20, 273–297, 1995.
• Cucker F., Smale S., “On the mathematical foundations of learning theory”, Bulletin of the AMS, 39,
1-49. 2002
• Debruyne M., Christmann A., Hubert M., Suykens J.A.K., “Robustness and stability of reweighted kernel
based regression”, Internal Report 06-150, ESAT-SISTA, K.U.Leuven (Leuven, Belgium), 2006.
• Espinoza M., Suykens J.A.K., De Moor B., “Kernel Based Partially Linear Models and Nonlinear
Identification”, IEEE Transactions on Automatic control, special issue (System identification : linear
vs nonlinear), vol. 50, no. 10, Oct. 2005, pp. 1509- 1519.
SCCB 2006 ⋄ Johan Suykens
96
• Espinoza M., Suykens J.A.K., De Moor B., “LS-SVM Regression with Autocorrelated Errors”, in Proc.
of the 14th IFAC Symposium on System Identification (SYSID), Newcastle, Australia, Mar. 2006, pp.
582-587.
• Evgeniou T., Pontil M., Poggio T., “Regularization networks and support vector machines”, Advances
in Computational Mathematics, 13(1): 1-50, 2000.
• Girolami M., “Orthogonal series density estimation and the kernel eigenvalue problem”, Neural
Computation, 14(3), 669–688, 2002.
• Girosi F., “An equivalence between sparse approximation and support vector machines”, Neural
Computation, 10(6), 1455–1480, 1998.
• Hastie T., Tibshirani R., “Generalized Additive Models (with discussion)”, Statistical Science, Vol 1, No
3, 297-318, 1986.
• Hoegaerts L., Suykens J.A.K., Vandewalle J., De Moor B., “Primal space sparse kernel partial least
squares regression for large scale problems”, IJCNN 2004, Hungary, Budapest, Jul. 2004, pp. 561-566.
• Hoegaerts L., Suykens J.A.K., Vandewalle J., De Moor B., “Subset based least squares subspace
regression in RKHS”, Neurocomputing, vol. 63, Jan. 2005, pp. 293-323.
• Jebara T., Kondor R., Howard A., “Probability Product Kernels”, Journal of Machine Learning
Research, 5(Jul):819–844, 2004.
• Kondor R., Lafferty J., “Diffusion Kernels on Graphs and Other Discrete Input Spaces”. ICML 2002.
• Kwok J.T., “The evidence framework applied to support vector machines”, IEEE Transactions on
Neural Networks, 10, 1018–1031, 2000.
• Lanckriet, G.R.G., Cristianini, N., Bartlett, P., El Ghaoui, L., Jordan, M.I., “Learning the Kernel Matrix
with Semidefinite Programming”, Journal of Machine Learning Research, 5, 27-72, 2004.
• Lin C.-J., “On the convergence of the decomposition method for support vector machines”, IEEE
Transactions on Neural Networks. 12, 1288–1298, 2001.
SCCB 2006 ⋄ Johan Suykens
97
• MacKay D.J.C., “Bayesian interpolation”, Neural Computation, 4(3), 415–447, 1992.
• MacKay D.J.C., “Introduction to Gaussian processes”, In C.M. Bishop, editor, Neural Networks and
Machine Learning, volume 168 of NATO ASI Series, pages 133-165. Springer, Berlin, 1998.
• Mercer J., “Functions of positive and negative type and their connection with the theory of integral
equations”, Philos. Trans. Roy. Soc. London, 209, 415–446, 1909.
• Mika S., Rätsch G., Weston J., Schölkopf B., Müller K.-R., “Fisher discriminant analysis with kernels”,
In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX,
41-48. IEEE, 1999.
• Müller K.R., Mika S., Ratsch G., Tsuda K., Schölkopf B., “An introduction to kernel-based learning
algorithms”, IEEE Transactions on Neural Networks, 2001. 12(2): 181-201, 2001.
• Ng A.Y., Jordan M.I., Weiss Y., “On spectral clustering: Analysis and an algorithm”, In T. Dietterich,
S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS) 14,
2002.
• Pelckmans K., Suykens J.A.K., De Moor B., “Building Sparse Representations and Structure
Determination on LS-SVM Substrates”, Neurocomputing, vol. 64, Mar. 2005, pp. 137-159.
• Pelckmans K., Suykens J.A.K., De Moor B., “Additive regularization trade-off: fusion of training and
validation levels in kernel methods”, Machine Learning, vol. 62, no. 3, Mar. 2006, pp. 217-252.
• Pelckmans K., De Brabanter J., Suykens J.A.K., De Moor B., “Handling Missing Values in Support
Vector Machine Classifiers”, Neural Networks, vol. 18, 2005, pp. 684-692.
• Pelckmans K., Espinoza M., De Brabanter J., Suykens J.A.K., De Moor B., “Primal-Dual Monotone
Kernel Regression”, Neural processing letters, vol. 22, no. 2, Oct. 2005, pp. pp. 171-182..
• Pelckmans K., De Brabanter J. Suykens J.A.K., De Moor B., “Convex Clustering Shrinkage”, in
Workshop on Statistics and Optimization of Clustering Workshop (PASCAL), London, U.K., Jul. 2005
• Perez-Cruz F., Bousono-Calzon C., Artes-Rodriguez A., “Convergence of the IRWLS Procedure to the
Support Vector Machine Solution”, Neural Computation, 17: 7-18, 2005.
SCCB 2006 ⋄ Johan Suykens
98
• Platt J., “Fast training of support vector machines using sequential minimal optimization”, In Schölkopf
B., Burges C.J.C., Smola A.J. (Eds.) Advances in Kernel methods - Support Vector Learning, 185–208,
MIT Press, 1999.
• Poggio T., Girosi F., “Networks for approximation and learning”, Proceedings of the IEEE, 78(9),
1481–1497, 1990.
• Poggio T., Rifkin R., Mukherjee S., Niyogi P., “General conditions for predictivity in learning theory”,
Nature, 428 (6981): 419-422, 2004.
• Principe J., Fisher III, Xu D., “Information theoretic learning”, in S. Haykin (Ed.), Unsupervised
Adaptive Filtering, John Wiley & Sons, New York, 2000.
• Rosipal R., Trejo L.J., “Kernel partial least squares regression in reproducing kernel Hilbert space”,
Journal of Machine Learning Research, 2, 97–123, 2001.
• Saunders C., Gammerman A., Vovk V., “Ridge regression learning algorithm in dual variables”, Proc. of
the 15th Int. Conf. on Machine Learning (ICML-98), Madison-Wisconsin, 515–521, 1998.
• Schölkopf B., Smola A., Müller K.-R., “Nonlinear component analysis as a kernel eigenvalue problem”,
Neural Computation, 10, 1299–1319, 1998.
• Schölkopf B., Mika S., Burges C., Knirsch P., Müller K.-R., Rätsch G., Smola A., “Input space vs.
feature space in kernel-based methods”, IEEE Transactions on Neural Networks, 10(5), 1000–1017,
1999.
• Shi J., Malik J., “Normalized Cuts and Image Segmentation”, IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(8), 888-905, 2000.
• Suykens J.A.K., Vandewalle J., “Least squares support vector machine classifiers”, Neural Processing
Letters, vol. 9, no. 3, Jun. 1999, pp. 293-300.
• Suykens J.A.K., De Brabanter J., Lukas L., Vandewalle J., “Weighted least squares support vector
machines : robustness and sparse approximation”, Neurocomputing, Special issue on fundamental and
information processing aspects of neurocomputing, vol. 48, no. 1-4, Oct. 2002, pp. 85-105.
SCCB 2006 ⋄ Johan Suykens
99
• Suykens J.A.K., Van Gestel T., Vandewalle J., De Moor B., “A support vector machine formulation to
PCA analysis and its kernel version”, IEEE Transactions on Neural Networks, vol. 14, no. 2, Mar.
2003, pp. 447-450.
• Van Gestel T., Suykens J.A.K., Baesens B., Viaene S., Vanthienen J., Dedene G., De Moor B.,
Vandewalle J., “Benchmarking Least Squares Support Vector Machine Classifiers”, Machine Learning,
vol. 54, no. 1, Jan. 2004, pp. 5-32.
• Van Gestel T., Suykens J.A.K., Lanckriet G., Lambrechts A., De Moor B., Vandewalle J., “Bayesian
Framework for Least Squares Support Vector Machine Classifiers, Gaussian Processes and Kernel Fisher
Discriminant Analysis”, Neural Computation, vol. 15, no. 5, May 2002, pp. 1115-1148.
• Williams C.K.I., Rasmussen C.E., “Gaussian processes for regression”, In D.S. Touretzky, M.C. Mozer,
and M.E. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8, 514–520. MIT
Press, 1996.
• Williams C.K.I., Seeger M., “Using the Nyström method to speed up kernel machines”, In T.K. Leen,
T.G. Dietterich, and V. Tresp (Eds.), Advances in neural information processing systems, 13, 682–688,
MIT Press, 2001.
• Zanni L., Serafini T., Zanghirati G., “Parallel Software for Training Large Scale Support Vector Machines
on Multiprocessor Systems”, Journal of Machine Learning Research, 7:1467-1492, 2006.
SCCB 2006 ⋄ Johan Suykens
100
Related references - applications
• Aerts S., Lambrechts D., Maity S., Van Loo P., Coessens B., De Smet F., Tranchevent L.C., De Moor
B., Marynen P., Hassan B., Carmeliet P., Moreau Y., “Gene prioritization through genomic data fusion”,
Nature Biotechnology, 24 (5): 537-544, 2006.
• Antal P., Fannes G., Timmerman D., Moreau Y., De Moor B., “Using literature and data to learn
Bayesian Networks as clinical models of ovarian tumors”, Artificial Intelligence in Medicine, vol. 30,
2004, pp. 257-281.
• Brown M., Grundy W., Lin D., Cristianini N., Sugnet C, Furey T., Ares M., Haussler D., “Knowledgebased analysis of microarray gene expression data using support vector machines”, Proceedings of the
National Academy of Science, 97(1), 262–267, 2000.
• Devos A., Lukas L., Suykens J.A.K., Vanhamme L., Tate A.R., Howe F.A., Majos C., Moreno-Torres A.,
Van der Graaf M., Arus C., Van Huffel S., “Classification of brain tumours using short echo time 1H
MRS spectra”, Journal of Magnetic Resonance, vol. 170 , no. 1, Sep. 2004, pp. 164-175.
• Devos A., Simonetti A.W., van der Graaf M., Lukas L., Suykens J.A.K., Vanhamme L., Buydens L.M.C.,
Heerschap A., Van Huffel S. “The use of multivariate MR imaging intensities versus metabolic data from
MR spectroscopic imaging for brain tumour classification”, Journal of Magnetic Resonance, 173 (2):
218-228 April 2005.
• Guyon I., Weston J., Barnhill S., Vapnik V., “Gene selection for cancer classification using support vector
machines”, Machine Learning, 46, 389–422, 2002.
• Lanckriet, G.R.G., De Bie, T., Cristianini, N. , Jordan, M.I., Noble, W.S., “A statistical framework for
genomic data fusion”, Bioinformatics, 20, 2626-2635, 2004.
• Lu C., Van Gestel T., Suykens J.A.K., Van Huffel S., Vergote I., Timmerman D., “Preoperative prediction
of malignancy of ovarium tumor using least squares support vector machines”, Artificial Intelligence in
Medicine, vol. 28, no. 3, Jul. 2003, pp. 281-306.
SCCB 2006 ⋄ Johan Suykens
101
• Luts J., Heerschap A., Suykens J.A.K., Van Huffel S., “A combined MRI and MRSI based Multiclass
System for Brain Tumour Recognition using LS-SVMs with Class Probabilities and Feature Selection”,
Internal Report 06-143, ESAT-SISTA, K.U.Leuven (Leuven, Belgium), 2006.
• Pochet N., De Smet F., Suykens J.A.K., De Moor B., “Systematic benchmarking of micorarray data
classification : assessing the role of nonlinearity and dimensionality reduction”, Bioinformatics, vol. 20,
no. 17, Nov. 2004, pp. 3185-3195.
• Pochet N.L.M.M., Janssens F.A.L., De Smet F., Marchal K., Suykens J.A.K., De Moor B.L.R.,
“[email protected]: a microarray classification benchmarking tool”, Bioinformatics, vol. 21, no. 14, Jul. 2005,
pp. 3185-3186.
• Ralaivola L., Swamidass S., Saigo H., Baldi P., “Graph Kernels for Chemical Informatics”, Neural
Networks, 18(8): 1093-1110, 2005.
• Tyers M., Mann M., “From genomics to proteomics”, Nature, Vol. 422, 193-197, 2003.
• Tsuda K., Kin T., Asai K. “Marginalized kernels for biological sequences”, Bioinformatics, 18(Suppl.1):
S268-S275, 2002.
• Tsuda K., Shin H.J., Schölkopf B., “Fast protein classification with multiple networks”, Bioinformatics
(ECCB’05), 21(Suppl. 2):ii59–ii65, 2005.
• Vert J.-P., Kanehisa M., “Extracting active pathways from gene expression data”, Bioinformatics, vol.
19, p. 238ii-244ii, 2003.
• Yamanishi Y., Vert J.-P., Nakaya A., Kanehisa M., “Extraction of Correlated Gene Clusters from
Multiple Genomic Data by Generalized Kernel Canonical Correlation Analysis”, Bioinformatics, vol. 19,
p. 323i-330i, 2003. (Proceedings of ISMB 2003).
SCCB 2006 ⋄ Johan Suykens
102
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement