Regularized Vector Field Learning with Sparse Approximation for Mismatch Removal Jiayi Ma

Regularized Vector Field Learning with Sparse Approximation for Mismatch Removal Jiayi Ma
Regularized Vector Field Learning with Sparse Approximation for
Mismatch Removal
Jiayi Maa , Ji Zhaoa , Jinwen Tiana , Xiang Baib,d , Zhuowen Tuc,d
a
Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science and
Technology, Wuhan, 430074, China.
b
Department of Electronics and Information Engineering, Huazhong University of Science and Technology,
Wuhan, 430074, China.
c
Lab of Neuro Imaging, University of California, Los Angeles, CA, 90095, USA.
d
Microsoft Research Asia, Beijing, 100080, China.
Abstract
In vector field learning, regularized kernel methods such as regularized least-squares
require the number of basis functions to be equivalent to the training sample size, N . The
learning process thus has O(N 3 ) and O(N 2 ) in the time and space complexity respectively.
This poses significant burden on the vector learning problem for large datasets. In this
paper, we propose a sparse approximation to a robust vector field learning method, sparse
vector field consensus (SparseVFC ), and derive a statistical learning bound on the speed of
the convergence. We apply SparseVFC to the mismatch removal problem. The quantitative
results on benchmark datasets demonstrate the significant speed advantage of SparseVFC
over the original VFC algorithm (two orders of magnitude faster) without much performance
degradation; we also demonstrate the large improvement by SparseVFC over traditional
methods like RANSAC. Moreover, the proposed method is general and it can be applied to
other applications in vector field learning.
Keywords: Vector field learning, sparse approximation, regularization, reproducing kernel
Hilbert space, outlier, mismatch removal
Email addresses: [email protected] (Jiayi Ma), [email protected] (Ji Zhao),
[email protected] (Jinwen Tian), [email protected] (Xiang Bai), [email protected]
(Zhuowen Tu)
Preprint submitted to Pattern Recognition
February 27, 2013
1. Introduction
For a given set of data S = {(xn , yn ) ∈ X × Y}N
n=1 , one can learn a function f : X → Y
so that it approximates a mapping to output yn for an input xn . In this process one can fit
a function to the given training data with a smoothness constraint, which typically achieves
some form of capacity control of the function space. The learning process aims to balance the
data fitting and model complexity, and thus, produces a robust algorithm that generalizes
well to the unseen data [1]. This can one way be formulated into an optimization problem
with a certain choice of regularization [2, 3], which typically operates in the Reproducing
Kernel Hilbert Space (RKHS) [4] (associated with a particular kernel). Two well-known
methods along the line of learning with regularization are regularized least-squares (RLS)
and support vector machines (SVM) [2]. These regularized kernel methods have drawn
much attention due to their computational simplicity and strong generalization power in
traditional machine learning problems such as regression and classification.
Regularized kernel methods over RKHS often lead to solving a convex optimization
problem. For example, RLS directly solves a linear system. However, the computational
complexity associated with applying kernel matrices is relatively high. Given a set of N
training samples, the kernel matrix is of size N ×N . This suggests a space complexity O(N 2 )
and a time complexity at least O(N 2 ); in fact most regularized kernel methods have core
operations such as matrix inversion in RLS which is of O(N 3 ). It is therefore computationally
demanding if N is large. This situation is more troublesome in vector-valued cases where we
focus on the vector-valued regularized least-squares (vector-valued RLS). There are several
different names for the RLS model [5] such as regularization networks (RN) [3], kernel ridge
regression (KRR) [6], least squares support vector machines (LS-SVM) [7], etc. We use a
widely adopted one in the literature, i.e. RLS in this paper.
A vector field is a map that assigns each position x ∈ IRP with a vector y ∈ IRD , defined
by a vector-valued function. Examples of vector field range from the velocity fields of fluid
particles to the optical flow fields associated with visual motion. The problem of vector
field learning is tied with functional estimation and supervised learning. We may learn a
2
vector field by using regularized kernel methods and, in particular, vector-valued RLS [8, 9].
According to the representer theorem in RKHS, the optimal solution is a linear combination
of a number (the size of training data points N ) of basis functions [10]. The corresponding
kernel matrix is of size DN × DN . Compared to the scalar case, the vectored values pose
more challenges on large datasets. One possible solution is to learn sparse basis functions; i.e.
we could pick a subset of size M basis functions, making the kernel matrix of size DM ×DM .
Thus, the computational complexity could be greatly reduced when M N . Moreover, the
solution based on sparse bases also enables faster prediction. Therefore, when we learn a
vector field via vector-valued regularized kernel methods, using a sparse representation may
be of particular advantage.
Motivated by the recent sparse representation literature [11], here we investigate a suboptimal solution to the vector-valued RLS problem. We derive an upper bound for the
approximation accuracy based on the representer theorem. The upper bound has a rate of
q 1
convergence in O
, indicating a fast convergence rate. We develop a new algorithm,
M
SparseVFC, by using the sparse approximation in a robust vector field learning problem,
vector field consensus (VFC) [9]. As in [9], the SparseVFC algorithm is applied to the mismatch removal problem. The experimental results on various 2D and 3D datasets show the
clear advantages of SparseVFC over the competing methods in efficiency and robustness.
1.1. Related work
Sparse representations [11] have recently gained a considerable amount of interest for
learning the kernel machines [12, 13, 14, 15]. To scale up kernel methods to deal with large
data, a wealthy body of efficient sparse approximations have been suggested in the scalar
case, including the Nyström approximation [16], sparse greedy approximations [17], reduced
support vector machines [18], and incomplete Cholesky decomposition [19]. Common to all
these approximation schemes is that only a subset of the latent variables are treated exactly,
with the remaining variables being approximated. These methods differ in the mechanism
for choosing the subset, and in the matrix form used to represent the hypothesis space. In
this paper, we generalize this idea to the vector-valued setting.
3
Another approach to achieve the sparse representation is to add regularization term penalizing the `1 norm of the expansion coefficients [20, 21]. However, this does not alleviate
the problem, since it has to compute (and invert) the kernel matrix of size N × N , and
then it is not suitable for large datasets. To overcome this problem, Kim et al. [22] described a specialized interior-point method for solving large scale `1 -regularized that uses
the preconditioned conjugate gradients algorithm to compute the search direction.
The large scale kernel learning problem can also be addressed by iterative optimization
such as conjugate gradient [23] and more efficient decomposition-based methods [24, 25].
Specially, for the RLS model with the `2 loss, fast optimization algorithms such as accelerated
gradient descent in conjunction with stochastic methods [26, 27, 28] can perform very fast
as well. However, a main drawback of these methods is that they have to store a kernel
matrix of size N × N which is not realistic for large datasets.
The main contributions of our work include: i) we present a sparse approximation algorithm for vector-valued RLS which has linear time and space complexity w.r.t. the training
data size; ii) we derive an upper bound for the approximation accuracy of the proposed
sparse approximation algorithm in terms of regularized risk functional; iii) based on the
sparse approximation and our the vector field learning method VFC, we give a new algorithm SparseVFC which significantly speeds up the original VFC algorithm without scarifies
in accuracy; iv) we apply SparseVFC to mismatch removal which is a fundamental problem in computer vision and the results demonstrate its superiority over the original VFC
algorithm and many other state-of-the-art methods.
The rest of the paper is organized as follows. In Section 2, we briefly review the vectorvalued Tikhonov regularization and lay out a sparse approximation algorithm for solving
vector-valued RLS. In Section 3, we apply the sparse approximation to robust vector field
learning and mismatch removal problem. The datasets and evaluation criteria used in the
experiments are presented in Section 4. In Section 5, we evaluate the performance of our
algorithm on both synthetic data and real-world images. Finally, we conclude in Section 6.
4
2. Sparse approximation algorithm
We start by defining the vector-valued RKHS and recalling the vector-valued Tikhonov
regularization, and then lay out a sparse approximation algorithm for solving the vectorvalued RLS problem. At last, we derive a statistical learning bound for the sparse approximation.
2.1. Vector-valued reproducing kernel Hilbert spaces
We are interested in a class of vector-valued kernel methods, where the hypotheses space
is chosen to be a reproducing kernel Hilbert space (RKHS). This motivates reviewing the
basic theory of vector-valued RKHS. The development of the theory in the vector case
is essentially the same as in the scalar case. We refer to [10, 29] for further details and
references.
Let Y be a real Hilbert space with inner product (norm) h·, ·iY , (k · kY ), for example,
Y ⊆ IRD , X a set, for example, X ⊆ IRP , and H a Hilbert space with inner product (norm)
1
2
h·, ·iH , (k · kH ). A norm can be defined via an inner product, for example, kf kH = hf , f iH
.
Next, we recall the definition of RKHS as well as some properties of it which we will use in
this paper.
Definition 1. A Hilbert space H is an RKHS if the evaluation maps evx : H → Y are
bounded, i.e. if ∀x ∈ X there exists a positive constant Cx such that
kevx (f )kY = kf (x)kY ≤ Cx kf kH ,
∀f ∈ H.
(1)
A reproducing kernel Γ : X × X → B(Y) is then defined as:
Γ(x, x0 ) := evx evx∗ 0 ,
(2)
where B(Y) is the space of bounded operators on Y, for example, B(Y) ⊆ IRD×D , and evx∗
is the adjoint of evx .
From the definition of the RKHS, we can derive the following properties:
5
(i) For each x ∈ X and y ∈ Y, the kernel Γ has the following reproducing property
hf (x), yiY = hf , Γ(·, x)yiH ,
∀f ∈ H.
(3)
(ii) For every x ∈ X and f ∈ H, we have that
1
2
kf (x)kY ≤ kΓ(x, x)kY,Y
kf kH ,
(4)
where k · kY,Y is the operator norm.
The property (i) can be easily derived from the equation evx∗ y = Γ(·, x)y. In the property
1
2
(ii), we assume that supx∈X kΓ(x, x)kY,Y
= sΓ < ∞.
Similar to the scalar case, for any N ∈ IIN, {xn : n ∈ IINN } ⊆ X , and a reproducing kernel
Γ, a unique RKHS can be defined by considering the completion of the space
( N
)
X
HN =
Γ(·, xn )cn : cn ∈ Y ,
(5)
n=1
with respect to the norm induced by the inner product
N
X
hf , giH =
hΓ(xj , xi )ci , dj iY ,
(6)
i,j=1
for any f , g ∈ HN with f =
PN
i=1
Γ(·, xi )ci and g =
PN
j=1
Γ(·, xj )dj .
2.2. Vector-valued Tikhonov regularization
Regularization aims to stabilize the solution of an ill-conditioned problem. Given an
N -sample S = {(xn , yn ) ∈ X × Y : n ∈ IINN } of patterns, where X ⊆ IRP and Y ⊆ IRD
are input space and output space respectively, in order to learn a mapping f : X → Y, the
vector-valued Tikhonov regularization in an RKHS H with kernel Γ minimizes a regularized
risk functional [10]
Φ(f ) =
N
X
V (f (xn ), yn ) + λkf k2H ,
(7)
n=1
where the first term is called the empirical error with the loss function V : Y × Y → [0, ∞)
satisfying V (y, y) = 0, the second term is a stabilizer with a regularization parameter λ
controlling the trade-off between these two terms.
6
Theorem 1. Vector-valued Representer Theorem [10]: The optimal solution of the
regularized risk functional (7) has the form
o
f (x) =
N
X
Γ(x, xn )cn ,
cn ∈ Y.
(8)
n=1
Hence, minimizing over the (possibly) infinite dimensional Hilbert space, boils down to
find a finite set of coefficients {cn : n ∈ IINN }.
A number of loss functions have been discussed in the literature [1]. In this paper, we
focus on vector-valued RLS which is a vector-valued Tikhonov regularization with an `2 loss
function, i.e.,
V (f (xn ), yn ) = pn kyn − f (xn )k2 ,
(9)
where pn ≥ 0 is the weight, and k · k denotes the `2 norm. Other common loss functions are
the absolute value loss V (f (x), y) = |f (x) − y| and Vapnik’s -insensitive loss V (f (x), y) =
max(|f (x) − y| − , 0).
Using the `2 loss, the coefficients cn of the optimal solution, i.e. equation (8), is then
determined by a linear system [10, 9]
e + λP
e −1 )C = Y,
(Γ
(10)
e is called the Gram matrix which is an N × N block matrix
where the kernel matrix Γ
e = P ⊗ ID×D is a DN × DN diagonal matrix, here
with the (i, j)-th block Γ(xi , xj ). P
T T
P = diag(p1 , . . . , pN ), and ⊗ denotes Kronecker product. C = (cT
1 , · · · , cN ) and Y =
T T
) are DN × 1 dimensional vectors.
(y1T , · · · , yN
Note that when the loss function V is not quadratic anymore, the solution of the regularized risk functional (7) still has the form (8), but the coefficients cn cannot be found
anymore by solving a linear system.
2.3. Sparse approximation in vector-valued regularized least-squares
Under the Representer Theorem, the optimal solution f o comes from an RKHS HN
defined as in equation (5). Finding the coefficients cn of the optimal solution f o in vectorvalued RLS merely requires to solve the linear system (10). However, for large values of
7
N , it may pose a serious problem due to heavy computational (i.e. scales as O(N 3 )) or
memory (i.e. scales as O(N 2 )) requirements, and, even when it is implementable, one may
prefer a suboptimal but simpler method. In this section, we propose an algorithm that is
based on a similar kind of idea as the subset of regressors method [30, 31] for the standard
vector-valued RLS problem.
Rather than searching for the optimal solution f o in HN , we use a sparse approximation
and search a suboptimal solution f s in a space HM (M N ) with much less basis functions
defined as
(
HM =
M
X
)
em )cm : cm ∈ Y
Γ(·, x
,
(11)
m=1
with {e
xm : m ∈ IINM } ⊆ X 1 , and then minimize the loss over all the training data. Yet the
problem that remains is how to choose the point set {e
xm : m ∈ IINM }, and accordingly find
a set of coefficients {cm : m ∈ IINM }. In the scalar case, different approaches for selecting
this point set are discussed, for example, in [5]. There, it was found that simply selecting an
arbitrary subset of the training inputs performs no worse than more sophisticated methods.
Recent progress in compressed sensing [11] also demonstrated the power of sparse random
basis representation. Therefore, in the interest of computational efficiency, we use the simply
random sampling method to choose sparse basis functions in the vector case. And we also
compare the influence of different choices of sparse basis functions in the experiment section.
According to the sparse approximation, the unique solution of the vector-valued RLS in
HM has this form
f s (x) =
M
X
em )cm .
Γ(x, x
(12)
m=1
To solve the coefficients cm , we now consider the Hilbertian norm and the corresponding
inner product (6)
kf k2H
M X
M
X
e
ei )ci , cj iY = CT ΓC,
=
hΓ(e
xj , x
(13)
i=1 j=1
T T
e
where C = (cT
1 , · · · , cM ) is a DM × 1 dimensional vector, the kernel matrix Γ is an M × M
ej ). Thus, the minimization of the regularized
block matrix with the (i, j)-th block Γ(e
xi , x
1
Note that the point set {e
xm : m ∈ IINM } may not be a subset of the input training data {xn : n ∈ IINN }.
8
risk functional (7) becomes
min Φ(f ) = min
f ∈H
f ∈H
( N
X
)
pn kyn − f (xn )k2 + λkf k2H
n=1
n
o
1/2
2
Te
e
e
= min kP (Y − UC)k + λC ΓC ,
C
e is an N × M block matrix with the (i, j)-th block Γ(xi , x
ej ):
where U


e1 ) · · · Γ(x1 , x
eM )
Γ(x1 , x


.
.


.
e
.
.
.
U=
.
.
.
.


e1 ) · · · Γ(xN , x
eM )
Γ(xN , x
(14)
(15)
Taking the derivative of the right hand of equation (14) with respect to the coefficient
matrix C and setting it to zero, we can then compute the coefficient matrix C from the
following linear system
e TP
eU
e + λΓ)C
e
e T PY.
e
(U
=U
(16)
In contrast to the optimal solution f o given by the Representer Theorem, which is a
linear combination of the basis functions Γ(·, x1 ), · · · , Γ(·, xN ) determined by the inputs
x1 , · · · , xN of training samples, the suboptimal solution f s is formed by a linear combination
of arbitrary M -tuples of the basis functions. Generally, this sparse approximation will yield
a vast increase in speed and decrease in memory requirements with negligible decrease in
accuracy.
Note that the sparse approximation is somewhat related to SVM since SVM’s predictive
function also depends on a few samples (i.e. support vectors). In fact, under certain conditions, sparsity leads to SVM, which is related to the Structural Risk Minimization principle
[1]. As derived in [20], when the data is noiseless, and the coefficient matrix C is chosen to
minimize the following cost function, the sparse approximation gives the same solution of
SVM:
2
N
X
Q(C) = f (x) −
cn Γ(x, xn ) + λkCk`1 ,
n=1
(17)
H
where k · k`1 is the usual `1 norm. If N is very large and Γ(·, xn ) is not an orthonormal
basis, it is possible that many different sets of coefficients will achieve the same error on a
9
given data set. Among all the approximating functions that achieve the same error, using
the `1 norm in the second term of equation (17) favors the one with the smallest number of
non-zero coefficients. Our approach follows this basic idea. The difference is that in the cost
function (17) the sparse basis functions are chosen automatically during the optimization
process, while in our approach the basis functions are chosen randomly for the purpose of
reducing both time and space complexities.
2.4. Bounds of the sparse approximation
To derive upper bounds on error of approximation of the optimal solution f o by the
suboptimal one f s , we employ Maurey-Jones-Barron’s theorem [32, 33] reformulated in terms
of G-variation [34]: for a Hilbert space X with norm k · k and f ∈ X, G is a bounded subset
of it, the following upper bound holds
r
kf − spann Gk ≤
(sG kf kG )2 − kf k2
,
n
(18)
where spann G is a linear combination of n arbitrary basis functions in G, sG = supg∈G kgk,
and k · kG denotes the G-variation which is defined as
kf kG = inf{c > 0 : f /c ∈ cl conv(G ∪ −G)},
(19)
with cl conv(·) being the closure of the convex hull of a set and −G = {−g : g ∈ G}. Here
the closure of a set A is defined as cl A = {f ∈ X : (∀ > 0)(∃g ∈ A)kf − gk < }. For
properties of G-variation, we refer the reader to [34, 35, 36].
Taking advantage of this upper bound, we derive the following proposition which compares the optimal solution f o with a suboptimal solution f s in terms of regularized risk
functional Φ(f ).
Proposition 1. Let {(xn , yn ) ∈ X × Y}N
n=1 be a finite set of input-output pairs of data,
1
P
2
, f o (x) = N
Γ : X × X → Y a matrix-valued kernel, sΓ = supx∈X kΓ(x, x)kY,Y
n=1 Γ(x, xn )cn
P
M
em )cm a
the optimal solution of the regularized risk functional (7), f s (x) =
m=1 Γ(x, x
suboptimal solution, suppose ∀n ∈ IINN , kf s (xn ) + f o (xn ) − 2yn kY ≤ supx∈X kf s (x) + f o (x)kY ,
10
then we have the following upper bound:
s
o
Φ(f ) − Φ(f ) ≤
(N s2Γ
+ λ)
α
+ 2kf o kH
M
r
α
M
,
(20)
where α = (sΓ kf o kG )2 − kf o k2H , H is the RKHS corresponding to the reproducing kernel Γ,
λ is the regularization parameter.
Proof. According to the property (ii) of an RKHS in Section 2.1, for every f ∈ H and x ∈ X
we have
1
2
kf (x)kY ≤ kΓ(x, x)kY,Y
kf kH ≤ sΓ kf kH ,
(21)
1
2
. Thus we obtain
where sΓ = supx∈X kΓ(x, x)kY,Y
sup kf (x)kY ≤ sΓ kf kH .
(22)
x∈X
By the last inequality and equation (18), we obtain
s
o
Φ(f ) − Φ(f ) =
≤
N
X
(kf s (xn ) − yn k2Y − kf o (xn ) − yn k2Y ) + λ(kf s k2H − kf o k2H )
n=1
N
X
hf s (xn ) − f o (xn ), f s (xn ) + f o (xn ) − 2yn iY
n=1
+ λkf s kH − kf o kH · (kf s kH + kf o kH )
≤
N
X
kf s (xn ) − f o (xn )kY kf s (xn ) + f o (xn ) − 2yn kY
n=1
+ λkf s − f o kH (kf s kH + kf o kH )
≤ N sup kf s (x) − f o (x)kY sup kf s (x) + f o (x)kY
x∈X
x∈X
+ λkf s − f o kH (kf s kH + kf o kH )
≤ N sΓ kf o − f s kH sΓ kf o + f s kH − 2ymin + λkf o − f s kH (kf o kH + kf s kH )
r r r r α
α
α
α
o
o
2sΓ kf kH + sΓ
+λ
2kf kH +
≤ N sΓ
M
M
M
M
r α
α
= (N s2Γ + λ)
+ 2kf o kH
,
(23)
M
M
11
where α = (sΓ kf o kG )2 − kf o k2H , k · kG corresponds to the G-variation, and G corresponds to
HN in our problem.
From this upper bound, we can easily derive that to achieve an approximation accuracy
, i.e. Φ(f s ) − Φ(f o ) ≤ , the needed minimal number of basis functions satisfies
M ≤α
N s2Γ + λ
−
2kf o k2H
r
1+
−1
−1
.
(N s2Γ + λ)kf o k2H
(24)
3. Sparse approximation for robust vector field learning
A vector field is also a vector-valued function. The Tikhonov regularization treats all
samples as inliers which ignores the issue of robustness, i.e., the real-world data may often
contain some unknown outliers. Recently, Zhao et al. [9] present a robust vector-valued
RLS method named Vector Field Consensus (VFC) for vector field learning. In which, each
sample is associated with a latent variable indicating whether it is an inlier or an outlier,
and then the EM algorithm is adopted for optimization. Besides, the technique of robust
vector field learning has been also adopted in Gaussian processes, basically by using the
so-called t-processes [37, 38]. In this section, we present a sparse approximation algorithm
for VFC, and apply it to the mismatch removal problem.
3.1. Sparse vector field consensus
Given a set of observed input-output pairs S = {(xn , yn ) : n ∈ IINN } as samples randomly
drawn from a vector field which may contain some unknown outliers, the goal is to learn
a mapping f to fit the inliers well. Due to the existence of outliers, it is desirable to have
a robust estimate of the mapping f . There are two choices: (i) to build a more complex
model that includes the outliers – which involves modeling the outlier process using extra
(hidden) variables which enable us to identify and reject outliers, or (ii) to use an estimator
which is less sensitive to outliers, as described in Huber’s robust statistics [39]. In this
paper, we use the first scenario. In the following we make an assumption that, the vector
field samples in the inlier class have Gaussian noise with zero mean and uniform standard
deviation σ, while the ones in the outlier class are uniformly distributed
12
1
a
with a being the
volume of the output domain. Let γ be the percentage of inliers which we do not know in
T T
T T
T
advance, X = (xT
1 , · · · , xN ) and Y = (y1 , · · · , yN ) be DN × 1 dimensional vectors, and
θ = {f , σ 2 , γ} the unknown parameter set. The likelihood is then a mixture model as:
N Y
ky −f (xn )k2
γ
1−γ
− n
2
2σ
p(Y|X, θ) =
.
(25)
e
+
(2πσ 2 )D/2
a
n=1
We model the mapping f in a vector-valued RKHS H with reproducing kernel Γ, and
λ
2
impose a smoothness constraint on it, i.e. p(f ) ∝ e− 2 kf kH . Therefore, we can estimate a
MAP solution of θ by using the Bayes rule as θ ∗ = arg maxθ p(Y|X, θ)p(f ). An iterative
EM algorithm can be used to solve this problem. We associate sample n with a latent
variable zn ∈ {0, 1}, where zn = 1 indicates a Gaussian distribution and zn = 0 points to a
uniform distribution. We follow the standard notations [40] and omit some terms that are
independent of θ. The complete-data log posterior is then given by
Q(θ, θ
old
N
1 X
P (zn = 1|xn , yn , θ old )kyn − f (xn )k2
)=− 2
2σ n=1
N
X
D
2
− ln σ
P (zn = 1|xn , yn , θ old )
2
n=1
+ ln(1 − γ)
N
X
P (zn = 0|xn , yn , θ old )
n=1
+ ln γ
N
X
λ
P (zn = 1|xn , yn , θ old ) − kf k2H .
2
n=1
(26)
E-step: Denote P = diag(p1 , . . . , pN ), where the probability pn = P (zn = 1|xn , yn , θ old )
can be computed by applying Bayes rule:
pn =
γe−
γe−
kyn −f (xn )k2
2σ 2
kyn −f (xn )k2
2σ 2
2 D/2
,
(27)
+ (1 − γ) (2πσa)
The posterior probability pn is a soft decision, which indicates to what degree the sample n
agrees with the current estimated vector field f .
M-step: We determine the revised parameter estimate θ new = arg maxθ Q(θ, θ old ). Taking derivative of Q(θ) with respect to σ 2 and γ, and setting them to zero, we obtain
σ2 =
e
(Y − V)T P(Y
− V)
,
D · tr(P)
13
(28)
γ = tr(P)/N,
(29)
where V = (f (x1 )T , · · · , f (xN )T )T is a DN × 1 dimensional vector, and tr(·) denotes the
trace.
Considering the terms of objective function Q in equation (26) that are related to f , the
vector field can be estimated from minimizing an energy function as
N
1 X
λ
E(f ) = 2
pn kyn − f (xn )k2 + kf k2H .
2σ n=1
2
(30)
This is a regularized risk functional (7) with `2 loss function (9). To estimate the vector
field f , it needs to solve a linear system similar to equation (10), which takes up most of the
run-time and memory requirements of the algorithm. Obviously, the sparse approximation
could be used here to reduce the time and space complexity.
Using the sparse approximation, we search a suboptimal f s which has the form as in
equation (12) with the coefficient cn determined by a linear system similar to equation (16):
e TP
eU
e + λσ 2 Γ)C
e
e T PY.
e
(U
=U
(31)
Since it is a sparse approximation to the vector field consensus algorithm, we name our
method SparseVFC. In summary, compared with the original VFC algorithm, we estimate
a vector field f by solving a linear system (31) in SparseVFC rather than equation (10) in
VFC. Our SparseVFC algorithm is outlined in algorithm 1.
It also presented a fast implementation for VFC named FastVFC in the original paper [9].
To reduce the time complexity, it first uses the low rank matrix approximation by computing
e of size DN × DN , and then
the singular value decomposition (SVD) for the kernel matrix Γ
the Woodbury matrix identity to invert the coefficient matrix in the linear system for solving
C [41].
e is a diagonal matrix, we can compute U
e TP
e
Computational complexity. Notice that P
e to the n-th column
in linear system (31) by multiplying the n-th diagonal element of P
fT , and thus the time complexity of SparseVFC for mismatch removal is reduced to
of U
O(mD3 M 2 N + mD3 M 3 ), where m is the iterative times for EM. The space complexity of
14
Algorithm 1: The SparseVFC Algorithm
Input: Training set S = {(xn , yn ) : n ∈ IINN }, matrix kernel Γ, regularization
constant λ, basis function number M
Output: Vector field f , inlier set I
1
Initialize a, γ, V = 0DN ×1 , P = IN ×N , and σ 2 by equation (28);
2
Randomly choose M basis functions from the training inputs and construct kernel
e
matrix Γ;
3
4
5
6
repeat
E-step:
Update P = diag(p1 , . . . , pN ) by equation (27);
M-step:
7
Update C by solving linear system (31);
8
Update V by using equation (12);
9
Update σ 2 and γ by equations (28) and (29);
10
until Q converges;
11
Vector field f is determined by equation (12);
12
The inlier set is I = {n : pn > τ, n ∈ IINN } with τ being a predefined threshold.
SparseVFC scales like O(D2 M N + D2 M 2 ) due to the memory requirements for storing the
e and kernel Γ.
e
matrix U
Typically, the required number of basis functions for sparse approximation is much less
than the number of data points, i.e. M N . In this paper, we apply the SparseVFC to
the mismatch removal problem on 2D and 3D images, in which the number of the point
matches N is typically in the order of 103 , and the required number of basis function M is
in the order of 101 . Therefore, both the time and space complexities of SparseVFC could
be simply written as O(N ). Table 1 summarizes the time and space complexities of VFC,
FastVFC and SparseVFC. Compared to VFC and FastVFC, the time and space complexities
are reduced from O(N 3 ) and O(N 2 ) to both O(N ). This is significant for large training sets.
Note that the time complexity of FastVFC is still O(N 3 ), it is because the SVD operation
15
Table 1: Comparison of computational complexity.
VFC [9] FastVFC [9] SparseVFC
O(N 3 )
O(N 3 )
O(N )
space O(N 2 )
O(N 2 )
O(N )
time
of a matrix of size N × N has time complexity O(N 3 ).
Relation to FastVFC. There is some relationship between our SparseVFC algorithm and
the FastVFC algorithm. On the one hand, both these two algorithms use approximation to
the matrix-valued kernel Γ to search a suboptimal solution rather than the optimal solution,
and then reduce the time complexity. On the other hand, our algorithm is clearly superior
to the FastVFC, which can be seen from both the time and space complexities as shown
in Table 1. For FastVFC, it makes a low rank matrix approximation on the kernel matrix
e itself, which does not change the size of Γ.
e Therefore, the memory requirement is not
Γ
decreased, and it is still not implementable for large training sets. Moreover, the FastVFC
algorithm has the same time complexity O(N 3 ) as VFC, due to the SVD operation and kernel
matrix inversion operation in FastVFC and VFC respectively. The difference is that the SVD
operation in FastVFC needs to perform only once while the matrix inversion operation in
VFC needs to perform in each EM iteration, and hence an acceleration can be achieved in
FastVFC. In contrast, the SparseVFC uses a sparse representation and chooses much less
basis functions to approximate the function space, leading to a significant reduction of the
size of the corresponding reproducing kernel. This sparse approximation (even with random
chosen basis functions) not only significantly reduces both the time and space complexities,
but also does not lead to sacrifice in accuracy, and in some situations it even gains a little
better performance compared to the original VFC algorithm (as shown in our experiments).
16
3.2. Application to mismatch removal
In this section, we focus on establishing accurate point correspondences between two
images of the same scene. Many of the computer vision tasks such as building 3D models, registration, object recognition, tracking, and structure and motion recovery start by
assuming that the point correspondences have been successfully recovered [42].
Point correspondences between two images are in general established by first detecting interest points and then matching the detected points based on local descriptors [43].
This may result in a number of mismatches (outliers) due to viewpoint changes, occlusions, repeated structures, etc. The existence of mismatches is usually enough to ruin the
traditional estimation methods. In this case, a robust estimator is desirable to remove mismatches [44, 45, 46, 47, 48, 49]. In our SparseVFC, as shown in the last line of algorithm 1,
a sample being an inlier or outlier could be determined by its posterior probability after EM
convergences. Using this property, we apply SparseVFC to the mismatch removal problem.
Next, we point out some key issues.
Vector field introduced by image pairs. We first make a linear re-scaling of the point
correspondences so that the positions of feature points in the first and second images both
have zero mean and unit variance. Let the input x ∈ IRP be the location of a normalized
point in the first image, and the output y ∈ IRD be the corresponding displacement of that
point in the second image; then the matches can be converted into motion field training set.
For 2D images P = D = 2; for 3D surfaces P = D = 3. Fig. 1 illustrates schematically the
2D image case.
Kernel selection. Kernel plays a central role in regularization theory as it provides a
flexible and computationally feasible way to choose an RKHS. Usually, for the mismatch
removal problem, the structure of the generated vector field is relatively simple. We simply
2
choose a diagonal decomposable kernel [8, 9]: Γ(xi , xj ) = e−βkxi −xj k I. Then we can solve a
more efficient linear system instead of equation (31) as
e = UT PY,
e
(UT PU + λσ 2 Γ)C
(32)
2
where the kernel matrix Γ ∈ IRM ×M and Γij = e−βkx̃i −x̃j k , U ∈ IRN ×M and Uij =
17
Figure 1: Schematic illustration of motion field introduced by image pairs. Left: an image pair and its
putative matches; right: motion field samples introduced by the point matches in the left figure. ◦ and ×
indicate feature points in the first and second images respectively.
2
e = (c1 , · · · , cM )T and Y
e = (y1 , · · · , yN )T are M × D and N × D matrie−βkxi −exj k , C
ces respectively. Here N is the number of putative matches, and M is the number of bases.
It should be noted that solving the vector field learning problem with this diagonal decomposable kernel is not equivalent to solving D independent scalar problems, since the update
of the posterior probability pn and variance σ 2 in equations (27) and (28) are determined
by all components of the output yn .
When it is applied to mismatch removal, there is a problem should draw attention. We
must ensure that the point set {e
xm : m ∈ IINM } used to construct the basis functions does
not contain two same points since in this case the coefficient matrix in linear system (32),
i.e. (UT PU + λσ 2 Γ), will be singular. Obviously, this may appear in the mismatch removal
problem, since in the putative match set there may exist one point in the first image matched
to several points in the second image.
3.3. Implementation Details
There are mainly four parameters in the SparseVFC algorithm: γ, λ, τ and M . Parameter γ reflects our initial assumption on the amount of inliers in the correspondence sets.
Parameter λ reflects the amount of the smoothness constraint which controls the trade-off
between the closeness to the data and the smoothness of the solution. Parameter τ is a
threshold, which is used for deciding the correctness of a match. In general, our method
is very robust to these parameters. We set γ = 0.9, λ = 3, and τ = 0.75 according to the
18
original VFC algorithm throughout this paper. Parameter M is the number of the basis
functions used for sparse approximation. The choice of M depends on both the data (i.e.,
the true vector field) and the assumed function space (i.e., the reproducing kernel), as shown
in equation (24). We will discuss it in the experiment according to the specific application.
It should be noted that in practice we do not determine the value of M according to
equation (24). On the one hand, it is derived in the context of providing a theoretic upper
bound, and in practice to achieve a good approximation the required value of M may be
much smaller than this bound. On the other hand, the upper bound in equation (24) is hard
to compute, and it is costly to derive a value of M for each sample set.
4. Experimental setup
In our evaluation we consider synthetic 2D vector field estimation, mismatch removal on
2D real images and 3D surfaces. All the experiments are performed on a Intel Core2 2.5GHz
PC with Matlab code. Next, we discuss about the datasets and evaluation criteria.
Synthetic 2D vector field: The synthetic vector field is constructed from a scalar function
defined by a mixture of five Gaussians, which have the same covariance 0.25I and centered at
(0, 0), (1, 0), (0, 1), (−1, 0) and (0, −1) respectively, as in [8]. Its gradient and perpendicular
gradient indicate a divergence-free and a curl-free field respectively. The synthetic data is
then constructed by taking a convex combination of these two vector fields. In our evaluation
the combination coefficient is set to 0.5, and then we get the synthetic field as shown in Fig. 2.
The field is computed on a 70 × 70 grid over the square [−2, 2] × [−2, 2].
The training inliers are uniformly sampled points from the grid, and we add Gaussian
noise with zero mean and uniform standard deviation 0.1 on the outputs. The outliers are
generated as follows: the input x is chosen randomly from the grid; the output y is generated
randomly from a uniform distribution on the square [−2, 2] × [−2, 2]. The performance for
vector field learning is measured by an angular measure of error [50] between the learned
vector of VFC and the ground truth. If vg = (vg1 , vg2 ) and ve = (ve1 , ve2 ) are the ground truth
and estimated fields, we consider the transformation v → ṽ =
measure is defined as err = arccos(ṽe , ṽg ).
19
1
(v 1 , v 2 , 1).
k(v 1 ,v 2 ,1)k
The error
Figure 2: Visualization of a synthetic 2D field without noise and its 500 random samples used in experiment.
2D image datasets: We tested our method on the dataset of Mikolajczyk et al. [51]
and Tuytelaars et al. [52], and several image pairs of non-rigid objects. The images in the
first dataset are either of planar scenes or the camera position was fixed during acquisition.
Therefore, the images are always related by a homography. The test data of Tuytelaars et
al. contains several wide baseline image pairs. The image pairs of non-rigid object are made
in this paper.
The open source VLFeat toolbox [53] is used to determine the initial correspondences
of SIFT [43]. All parameters are set as the default values except for the distance ratio
threshold t. Usually, the greater value of t is, the smaller amount of matches with higher
correct match percentage will be. The match correctness is determined by computing an
overlap error [51], as in [9].
3D surfaces datasets: For 3D case, we consider two datasets used in [54]: the Dino and
Temple datasets. Each surface pair are representations of the same rigid object which can
be aligned using a rotation, translation and scale.
We determine the initial correspondences by using the method of Zaharescu et al. [54].
The feature point detector is called MeshDOG, which is a generalization of the difference
of Gaussian (DOG) operator [43]. The feature descriptor is called MeshHOG, and it is a
generalization of the histogram of oriented gradients (HOG) descriptor [55]. The match
correctness is determined as follows. For these two datasets, the correspondences between
20
the two surfaces can be formulated as y = sRx + t, where R3×3 is a rotation matrix,
s is a scaling parameter, and t3×1 is a translation vector. We can use some robust rigid
point registration methods such as the Coherent Point Drift (CPD) [41] to solve these three
parameters, and then the match correctness can be accordingly determined.
5. Experimental results
To test the performance of the sparse approximation algorithm, we perform experiments
on the proposed SparseVFC algorithm, and compare its approximate accuracy and time
efficiency to VFC and FastVFC on both synthetic data and real-world images. The experiments are conducted from two aspects: (i) vector field learning performance comparison on
synthetic vector field; (ii) mismatch removal performance comparison on real image datasets.
5.1. Vector field learning on synthetic vector field
We perform a representative experiment on a synthetic 2D vector field shown in Fig. 2.
For each cardinality of the training set, we repeat the experimental process by 10 times with
different randomly drawn examples. After the vector field is learned, we use it to predict
the outputs on the whole grid and compare them to the ground truth. The experimental
results are evaluated by means of test error (i.e. error between prediction and ground truth
on the whole grid) and run-time speedup factor (i.e. ratio of run-time of VFC to run-time
of FastVFC or SparseVFC) for clarity. The matrix-valued kernel is chosen to be a convex
combination of the divergence-free kernel Γdf and curl-free kernel Γcf [56] with width σ
e = 0.8,
and the combination coefficient is set to 0.5. The divergence-free and curl-free kernels have
the following forms
"
T
kx−x0 k2
1
x − x0
x − x0
−
0
2
Γdf (x, x ) = 2 e 2eσ
σ
e
σ
e
σ
e
kx − x0 k2
+ (D − 1) −
·I ,
σ
e2
"
#
0
0 T
0 k2
kx−x
1
x−x
x−x
Γcf (x, x0 ) = 2 e− 2eσ2
I−
.
σ
e
σ
e
σ
e
21
(33)
(34)
0.08
0.019
#inlier: 500, inlier pct.: 20%
#inlier: 500, inlier pct.: 50%
#inlier: 500, inlier pct.: 80%
#inlier: 800, inlier pct.: 20%
#inlier: 800, inlier pct.: 50%
#inlier: 800, inlier pct.: 80%
0.017
σ*
0.016
#inlier: 500, inlier pct.: 20%
#inlier: 500, inlier pct.: 50%
#inlier: 500, inlier pct.: 80%
#inlier: 800, inlier pct.: 20%
#inlier: 800, inlier pct.: 50%
#inlier: 800, inlier pct.: 80%
0.07
Test error
0.018
0.015
0.06
0.05
0.014
0.04
0.013
0.012
40
50
60
Number of basis
70
0.03
40
80
(a)
50
60
Number of basis
70
80
(b)
Figure 3: Experiments for choosing the sample point number M used for sparse approximation. (a) The
standard deviation of the estimated inliers σ ∗ with respect to M . (b) The test error with respect to M .
The inlier number in the training set is set to 500 and 800, and the inlier percentage is varied among 20%,
50% and 80%.
In order to choose an appropriate value of M on this synthetic field, we assess the
performance of SparseVFC with respect to M under different experimental setup. Rather
than using the ground truth to assess performance, i.e. the mean of test error, we use the
standard deviation of the estimated inliers [44]
∗
σ =
1 X
kyi − f (xi )k2
|I| i∈I
!1/2
,
(35)
where I is the estimated inlier set in the training samples, and | · | denotes the cardinality
of a set. Generally speaking, smaller value of σ ∗ indicates better estimation. The result
is presented in Fig. 3a, we can see that M = 60 could reach a good compromise between
the accuracy and computational complexity. For comparison, we also present the mean of
test error in Fig. 3b. We can see that these two criterions tend to produce similar choices
of M . For FastVFC, 150 largest eigenvalues are used for calculating the low rank matrix
approximation.
We first give some intuitive impression on the performance of our method, as shown in
Fig. 4. We see that SparseVFC can successfully recover the vector field from sparse training
samples, and the performance is getting better as the number of inlier in the training set
increases. Fig. 5 and Fig. 6 show the vector field learning performances of VFC, FastVFC and
22
Figure 4: Reconstruction of field learning shown in Fig 2a via SparseVFC. Left: 200 inliers contained in the
training set; right: 500 inliers contained in the training set. The inlier ratios are both 0.5. The means of
test error is about 0.072 and 0.044 respectively.
SparseVFC. We consider two scenarios for performance comparison: (i) fix the inlier number
and change the inlier percentage; (ii) fix the inlier percentage and change the inlier number.
Fig. 5 shows the performance comparison of the test error under different experimental
setup, we see that both SparseVFC and FastVFC perform much the same as VFC. Fig. 6
shows a performance comparison on the average run-time. From Fig. 6a, we see that the
computational cost of SparseVFC is about linear with respect to the training set size. The
run-time speedup factors of FastVFC and SparseVFC with respect to VFC are presented
in Fig. 6b. We see that the speedup factor of FastVFC is about a constant, since its time
complexity is the same as VFC, both are O(N 3 ). However, compared to FastVFC, the use of
SparseVFC leads to an essential speedup, and this advantage will be significantly magnified
with larger scale of training set.
In conclusion, when an appropriate number of basis functions are chosen, the SparseVFC
algorithm can approximate the original VFC algorithm quite well, while with a significant
run-time speedup.
5.2. Mismatch removal on real images
We notice that the algorithm demonstrates strong ability on mismatch removal. Thus
we apply it to the mismatch removal problem and perform experiments on a wide range of
23
0.06
0.09
VFC
FastVFC
SparseVFC
VFC
FastVFC
SparseVFC
0.08
0.07
Test error
Test Error
0.055
0.05
0.06
0.05
0.045
0.04
0.2
0.4
0.6
Inlier percentage
0.8
0.03
200
1
400
600
800
1000
Number of inliers contained in the training sets
(a)
(b)
Figure 5: Performance comparison on test error. (a) Comparison of test error through changing the inlier
ratio in the training sets, the inlier number is fixed to 500. (b) Comparison of test error through changing
the inlier number in the training sets, the inlier ratio is fixed to 0.2.
3
250
FastVFC
SparseVFC
200
Speedup factor
Time (s)
2.5
2
1.5
1
0.5
200
150
100
50
400
600
800
1000
Number of inliers contained in the training sets
(a)
0
200
400
600
800
1000
Number of inliers contained in the training sets
(b)
Figure 6: Performance comparison on run-time under the experimental setup in Fig. 5b. (a) Run-time of
SparseVFC. (b) Run-time speedup of FastVFC and SparseVFC with respect to VFC.
real images. The performance is characterized by precision and recall. Besides VFC and
FastVFC, we use three additional mismatch removal methods for comparison: RANdom
SAmple Consensus (RANSAC) [57], Maximum Likelihood Estimation SAmple Consensus
(MLESAC) [44] and Identifying point correspondences by Correspondence Function (ICF)
[47]. RANSAC tries to get as small an outlier-free subset as feasible to estimate a given
parametric model by resampling, while its variants MLESAC adopts a new cost function
using weighted voting strategy based on M-estimator, and chooses the solution which maxi24
Table 2: The numbers of matches and initial inlier percentages in the six image pairs.
Tree Valbonne Mex
#matches
167
126
158
DogCat Peacock T-shirt
113
236
300
inlier pct. 56.29% 54.76% 51.90% 82.30% 71.61% 60.67%
mize the likelihood rather than the inlier count in RANSAC. The ICF method uses support
vector regression to learn a correspondence function pair which maps points in one image to
their corresponding points in another, and then rejects the mismatches by checking whether
they are consistent with the estimated correspondence functions.
The structure of the generated motion field is relatively simple in the mismatch removal
problem. We simply choose a diagonal decomposable kernel with Gaussian form in the scalar
part, and we set β = 0.1 according to the original VFC algorithm. Moreover, the number of
basis functions M used for sparse approximation is fixed in our evaluation, since choosing
M adaptively would require some pre-processing which would increase the computational
cost. We manage to tune the value of M by using a small test set, and we find that using 15
basis functions for sparse approximation is accurate enough. So we set M = 15 throughout
the experiments. For FastVFC, 10 largest eigenvalues are used for calculating the low rank
matrix approximation.
5.2.1. Results on several 2D image pairs
Our first experiment involves mismatch removal on several image pairs, including wide
three baseline image pairs (Tree, Valbonne and Mex ) and three image pairs of non-rigid
object (DogCat, Peacock and T-shirt). Table 2 presents the numbers of matches as well as
the initial correct match percentages in these six image pairs.
The whole mismatch removal progress on these image pairs is illustrated schematically in
Fig. 7. The columns show the iterative progress, the level of the blue color indicates to what
degree a sample belongs to inlier, and it is also the posterior probability pn in equation (30).
25
Initial correspondences
Initialization
Iteration 1
Iteration 5
Convergence
Final correspondences
Figure 7: Results on image pairs of Tree, Valbonne, Mex, DogCat, Peacock and T-shirt. The first three rows
are wide baseline image pairs, and the rest three are image pairs of non-rigid object. The columns show
the iterative mismatch removal progress, and the level of the blue color indicates to what degree a sample
belongs to inlier. For visibility, only 50 randomly selected matches are presented in the first column.
1
26
Table 3: Performance comparison with different mismatch removal algorithms. The pairs in the table are
precision-recall pairs (%).
RANSAC [57] MLESAC [44]
FastVFC [9]
SparseVFC
(94.68, 94.68) (98.82, 89.36) (92.75, 68.09) (94.85, 97.87)
(94.79, 96.81)
(94.85, 97.87)
Valbonne (94.52, 100.00) (94.44, 98.55) (91.67, 63.77) (98.33, 85.51)
(98.33, 85.51)
(98.33, 85.51)
Tree
Mex
ICF [47]
VFC [9]
(91.76, 95.12) (93.83, 92.68) (96.15, 60.98) (96.47, 100.00) (96.47, 100.00) (96.47, 100.00)
DogCat
-
-
(92.19, 63.44) (100.00, 100.00) (100.00, 100.00) (100.00, 100.00)
Peacock
-
-
(99.12, 66.86) (99.40, 98.82)
(99.40, 98.22)
(99.40, 98.82)
T-shirt
-
-
(99.07, 58.79) (98.88, 96.70)
(98.84, 93.41)
(98.88, 96.70)
In the beginning, all the SIFT matches in the first column are assumed to be inlier. We
convert them into motion field training sets which are shown in the 2nd column. As the EM
iterative process continues, progressively more refined matches are shown in the 3rd, 4th
and 5th columns. The 5th column shows that the algorithm almost converges to a nearly
binary decision on the match correctness. The SIFT matches reserved by the algorithm are
presented in the last column. It should be noted that there is an underline assumption in
our method that the vector field should be smooth, i.e. the norm of the field f in equation
(13) should be small. However, for wide baseline image pairs such as Tree, Valbonne and
Mex, the related motion fields are in general not continuous, and our method is still effective
for mismatch removal. We give an explanation as follows: under the sparsity assumptions
of the training data, it is not hard to seek a smooth vector field which can fit nearly all
the inliers (sparse) well; if the goal is just to remove mismatches in the training data, the
smoothness constraint can work well.
Next, we give a performance comparison with the other five methods in Table 3. The
geometry model used in RANSAC and MLESAC is epipolar geometry. We see that MLESAC
has slightly better precisions than the RANSAC with the cost of producing a slightly lower
recall. The recall of ICF is quite low, although it has a satisfactory precision. However,
VFC, FastVFC and SparseVFC can successfully distinguish inliers from outliers, and they
27
have the best trade-off between precision and recall. The low rank matrix approximation
used in FastVFC may slightly hurt the performance. While in SparseVFC, it seems that the
sparse approximation does not lead to degenerated performance, on the contrary, it makes
the algorithm more efficient.
Notice that we did not compare to RANSAC and MLESAC on the image pairs of nonrigid object. RANSAC and MLESAC depend on a parametric model, for example, fundamental matrix. If there exist some objects with deformation in the image pairs, they can no
longer work, since the point pairs will no longer obey the epipolar geometry.
5.2.2. Results on a 2D image datasets
We test our method on the dataset of Mikolajczyk et al, which contains image transformations of viewpoint change, scale change, rotation, image blur, JPEG compression, and
illumination. We use all the 40 image pairs, and for each pair, we set the SIFT distance ratio
threshold t to 1.5, 1.3 and 1.0 respectively as in [9]. The cumulative distribution function
of original correct match percentage is shown in Fig. 8a. The initial average precision of
all image pairs is 69.58%, and nearly 30 percent of the training sets have correct match
percentage below 50%. Fig. 8b presents the cumulative distribution of the number of point
matches contained in the experimental image pairs. We see that most of the image pairs
have large scale of point matches (i.e. in the order of 1000’s).
The precision-recall pairs on this dataset are summarized in Fig. 8. The average precisionrecall pairs are (95.49%, 97.55%), (97.95%, 96.93%), (93.95%, 62.69%) and (98.57%, 97.78%)
for RANSAC, MLESAC, ICF and SparseVFC respectively. Here we choose homography
as the geometry model in RANSAC and MLESAC. Note that the performances of VFC,
FastVFC and SparseVFC are quite close, thus we omit the results of VFC and FastVFC in
the figure for clarity. From the result, we see that ICF usually has high precision or recall, but
not simultaneously. MLESAC performs a little better than RANSAC, and they both achieve
quite satisfactory performance. This can be explained by the lack of complex constraints
between the elements of the homography matrix. Our method has good performance in
dealing with the mismatch removal problem, and it has the best trade-off between precision
28
1
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.5
0.4
Precision
1
0.9
Cumulative Distribution
Cumulative Distribution
1
0.9
0.6
0.5
0.4
0.6
0.4
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.3
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Correct Match Ratio
1
0
0
(a)
500 1000 1500 2000 2500 3000 3500 4000
Number of Point Matches
RANSAC, p=95.49%, r=97.55%
MLESAC, p=97.95%, r=96.93%
ICF,
p=93.95%, r=62.69%
SparseVFC, p=98.57%, r=97.78%
0.5
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Recall
(b)
1
(c)
Figure 8: Experimental results on the dataset of Mikolajczyk et al. (a) Cumulative distribution function of
original correct match percentage. (b) Cumulative distribution function of number of point matches in the
image pairs. (c)Precision-recall statistics for RANSAC, MLESAC, ICF and SparseVFC on the dataset of
Mikolajczyk. Our method (red circles, upper right corner) has the best precision and recall overall.
Table 4: Average precision-recall and run-time comparison of RANSAC, VFC, FastVFC and SparseVFC on
the dataset of Mikolajczyk.
RANSAC [57]
VFC [9]
FastVFC [9]
SparseVFC
(p, r ) (95.49, 97.55) (98.57, 97.75) (98.75, 96.71) (98.57, 97.78)
t (ms)
3784
6085
402
21
and recall compared to the other three methods.
We compare the approximate accuracy and time efficiency of SparseVFC to VFC and
FastVFC in Table 4. As shown in it, both FastVFC and SparseVFC approximate VFC quite
well, especially our method SparseVFC. Moreover, SparseVFC achieves a significant speedup
with respect to the original VFC algorithm, of about 300 times on average. The run-time
of RANSAC is also presented in Table 4, and we see that SparseVFC is much more efficient
than RANSAC. To prevent the efficiency of RANSAC from decreasing drastically, usually
a maximum sampling number is preset in the literature. We set the maximum sampling
number to 5000.
The influence of different choices of basis functions for sparse approximation is also
29
Table 5: Performance comparison by using different matrix-valued kernels for robust learning. DK: decomposable kernel; DFK+CFK: combination of divergence-free and curl-free kernels.
DK, ω = 0
VFC [9]
DK, ω 6= 0
DFK+CFK
(98.57, 97.75) (98.66, 97.91) (98.69, 97.65)
SparseVFC (98.57, 97.78) (98.66, 97.93) (98.67, 97.71)
tested on this dataset. Besides the random sampling, we consider three other different
√
√
methods: (i) simply use M × M uniform grid points over the bounded input space; (ii)
find M clustering center of the training inputs via k-means clustering algorithm; (iii) pick
M basis functions minimizing the residuals via sparse greedy matrix approximation [17].
The average precision-recall pairs of these three methods are (98.58%, 97.79%), (98.57%,
97.75%) and (98.58%, 97.79%) respectively. We see that all the three approximation methods
produce almost the same result as the random sampling method. Therefore, for the mismatch
removal problem, it does not seem that selecting the “optimal” subset using sophisticated
methods improves the performance compared to a random subset. However, in the interests
of computational efficiency, we may be better off simply choosing a random subset of the
training data.
Next, we study on using different kernel functions for robust learning. Two additional
2
matrix-valued kernels are tested, such as a decomposable kernel [8]: Γ(xi , xj ) = e−βkxi −xj k A
with A = ω1D×D +(1−ωD)ID×D , and a convex combination of the divergence-free kernel Γdf
and curl-free kernel Γcf . Note that the form kernel will degenerate into a diagonal kernel
when we set ω = 0. In our evaluation, the combination coefficients in these two kernels
are selected via cross-validation. The results are summarized in Table 5. We see that
SparseVFC approximate VFC quite well under different matrix-valued kernels. Moreover,
using a diagonal decomposable kernel is adequate for the mismatch removal problem; it can
reduce the complexity of the linear system, i.e. equation (32), while with only a negligible
decrease in accuracy.
30
Table 6: Performance comparison by choosing different numbers of basis functions.
M
5
10
15
20
(p, r ) (98.10, 96.88) (98.57, 97.73) (98.57, 97.78) (98.57, 97.79)
t (ms)
12
15
21
49
Finally, we use this dataset to investigate the influence of the choice of M , the number of
basis functions. Besides the default value 15, three additional values of M including 5, 10 and
20 are tested, and the results are summarized in Table 6. We see that SparseVFC is not very
sensitive to the choice of M , and even M = 5 can achieve a satisfied performance. It should
be noted that better approximate accuracy can be achieved by choosing M adaptively. For
example, firstly compute the standard deviation of the estimated inliers σ ∗ (i.e. equation
(35)) under different choices of M , and then choose the one with the smallest value of σ ∗ .
However, such scenario would significantly increase the computational cost, since it requires
to run the algorithm once for each choice of M . Therefore, fixing the value of M to a constant
large enough, i.e. 15 in this paper, will achieve a good trade-off between the approximate
accuracy and computational efficiency.
5.2.3. Results on 3D surface pairs
Since VFC is not influenced by the dimension of the input data, now we test the mismatch
removal performance of SparseVFC on 3D surface pairs and compare it to the original
algorithm. Here the parameters are set as the same values as in the 2D case.
The two surface pairs Dino and Temple are shown in Fig. 9. As shown, the capability
of SparseVFC is not weakened in this case. For the Dino dataset, there are 325 initial
correspondences with 264 correct matches and 61 mismatches; after using the SparseVFC to
remove mismatches, 263 matches are preserved and all of which are correct matches. That
is to say, all the false matches are eliminated while discarding only 1 correct match. For
the Temple dataset, there are 239 initial correspondences with 215 correct matches and 24
31
Figure 9: Mismatch removal results of SparseVFC on two 3D surface pairs: Dino and Temple. There are
325 and 239 initial matches in these two pairs respectively. For each group of results, the left pair denotes
the identified putative correct matches (Dino 263, Temple 216), and the right pair denotes the removed
putative mismatches (Dino 62, Temple 23). For visibility, at most 50 randomly selected correspondences
are presented.
mismatches; after using the SparseVFC to remove mismatches, 216 matches are preserved,
in which 214 are correct matches – that is, 22 of 24 false matches are eliminated while
discarding only 1 correct match. Performance comparison are presented in Table 7, we
see that both FastVFC and SparseVFC have a good approximation to the original VFC
algorithm. Therefore, the sparse approximation used in our method works quite well, not
only in the 2D case, but also in the 3D case.
6. Conclusion
In this paper, we study a sparse approximation algorithm for vector-valued regularized
least-squares. It searches a suboptimal solution under an assumption that the solution space
can be represented sparsely with much less basis function. The time and space complexities
of our algorithm are both linear in the scale of training samples, and the number of basis
functions is manually assigned. We also present a new robust vector field learning method
32
Table 7: Performance comparison of VFC, FastVFC and SparseVFC on two 3D surface pairs: Dino and
Temple. The initial correct match percentages are about 81.23% and 89.96% respectively.
VFC [9]
Dino
FastVFC [9]
SparseVFC
(98.87, 99.62) (100.00, 99.62) (100.00, 99.62)
Temple (99.07, 99.53) (99.07, 99.53) (99.07, 99.53)
called SparseVFC, which is a sparse approximation to VFC, and apply it to solving the mismatch removal problem. The quantitative results on various experimental data demonstrate
that the sparse approximation leads to a vast increase in speed with negligible decrease in
performance, and it outperforms several state-of-the-art methods such as RANSAC from
the perspective of mismatch removal.
Our sparse approximation addresses a generic problem for vector field learning, and it
can be applied to other methodologies which can be converted into the vector field learning
problem, for example, the Coherent Point Drift algorithm [41] designed for point registration.
The sparse approximation of the Coherent Point Drift algorithm is described in detail in
the appendix. Besides, our approach also has some limitations, for example, the sparse
approximation is validated only in the case of `2 loss. Our future work shall focus on: (i)
determining the basis number automatically and efficiently, and (ii) validating the sparse
approximation under different types of loss functions.
Appendix: Sparse approximation for Coherent Point Drift
The Coherent Point Drift (CPD) algorithm [41] considers alignment of two point sets as
T T
a probability density estimation problem. Given two point sets XLD×1 = (xT
1 , · · · , xL ) and
T T
YKD×1 = (y1T , · · · , yK
) with D being the data point’s dimension, the algorithm assumes
the points in Y are the Gaussian mixture model (GMM) centroids, and the points in X
are the data points generated by the GMM. It then estimates the transformation T which
yields the best alignment between the transformed GMM centroids and the data points by
33
maximizing a likelihood. There the transformation T is defined as an initial position plus
a displacement function f : T (y) = y + f (y), where f is assumed to come from an RKHS H
and hence has the form of equation (8).
Similar to our SparseVFC algorithm, the CPD algorithm also adopts an iterative EM
algorithm to alternatively recover the spatial transformation and update the point correspondence. And in the M-step, for recovering the displacement function f , it needs to solve
a linear system similar to equation (10), which takes up most of the run-time and memory
requirements of the algorithm. The sparse approximation again could be used here to reduce
the time and space complexity.
In the iteration of CPD algorithm, the displacement function f can be estimated from
minimizing an energy function as
K
L
1 XX
λ
E(f ) = 2
pkl kxl − yk − f (yk )k2 + kf k2H ,
2σ l=1 k=1
2
(36)
where pkl is the posterior probability of the correspondence between two points yk and xl ,
and σ is the standard deviation of the GMM components.
Using the sparse approximation, we search a suboptimal f s with the form equation (12).
2
Due to the choice of a diagonal decomposable kernel Γ(yi , yj ) = e−βkyi −yj k I, equation (36)
becomes
L
1 X
λ Te
2
e
E(C) = 2
k(diag(P·,l ) ⊗ ID×D )1/2 (XK
l − Y − UC)k + C ΓC,
2σ l=1
2
(37)
e is a M × M block
where P = {pkl } and P·,l denotes the l-column of P, kernel matrix Γ
e is a K × M block matrix with the (i, j)-th block
ej ), U
matrix with the (i, j)-th block Γ(e
yi , y
ej ), XK
Γ(yi , y
l = (xl ; · · · ; xl ) is a KD × 1 dimensional vector, and C = (c1 ; · · · ; cM ) is the
coefficient vector.
Taking the derivative of equation (37) with respect to the coefficient vector C and setting
it to zero, we obtain a linear system
e = UT PX
e − UT diag(P1)Y,
e
(UT diag(P1)U + λσ 2 Γ)C
34
(38)
2
where the kernel matrix Γ ∈ IRM ×M and Γij = e−βkỹi −ỹj k , U ∈ IRK×M and Uij =
2
e = (c1 , · · · , cM )T , X
e = (x1 , · · · , xL )T and
e−βkyi −eyj k , 1 is a column vector of all ones, C
e = (y1 , · · · , yK )T . Here M is the number of bases.
Y
e This corresponds
Thus we obtain a suboptimal solution from the coefficient matrix C.
e determined by
to the optimal solution f o , i.e. equation (8), with the coefficient matrix C
the linear system [41]
e = diag(P1)−1 PX − Y,
(Γ + λσ 2 diag(P1)−1 )C
(39)
2
e = (c1 , · · · , cK )T .
where Γ ∈ IRK×K with Γij = e−βkyi −yj k , and C
Since it is a sparse approximation of the CPD algorithm, we name this approach SparseCPD. We simply summarize the SparseCPD algorithm in algorithm 2.
Algorithm 2: The SparseCPD Algorithm
Input: Two point sets X and Y
Output: Transformation T
1
Parameter initialization, including M and all the parameters in CPD;
2
repeat
3
4
5
E-step:
Update the point correspondence;
M-step:
6
e by solving linear system (38);
Update the coefficient vector C
7
Update other parameters in CPD;
8
until converges;
9
e
The transformation T is determined according to the coefficient vector C.
References
[1] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995.
[2] T. Evgeniou, M. Pontil, T. Poggio, Regularization Networks and Support Vector Machines, Computational Mathematics 13 (2000) 1–50.
35
[3] F. Girosi, M. Jones, T. Poggio, Regularization Theory and Neural Networks Architectures, Neural
Computation 7 (1995) 219–269.
[4] N. Aronszajn, Theory of Reproducing Kernels, Transactions of the American Mathematical Society 68
(1950) 337–404.
[5] R. Rifkin, G. Yeo, T. Poggio, Regularized Least-Squares Classification, in: Advances in Learning
Theory: Methods, Model and Applications, 2003.
[6] C. Saunders, A. Gammermann, V. Vovk, Ridge Regression Learning Algorithm in Dual Variables, in:
Proceedings of International Conference on Machine Learning, 1998, pp. 515–521.
[7] J. A. K. Suykens, J. Vandewalle, Least Squares Support Vector Machine Classifiers, Neural Processing
Letters 9 (1999) 293–300.
[8] L. Baldassarre, L. Rosasco, A. Barla, A. Verri, Multi-Output Learning via Spectral Filtering, Technical
Report, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, 2011.
[9] J. Zhao, J. Ma, J. Tian, J. Ma, D. Zhang, A Robust Method for Vector Field Learning with Application to Mismatch Removing, in: Proceedings of IEEE conference on Computer Vision and Pattern
Recognition, 2011, pp. 2977–2984.
[10] C. A. Micchelli, M. Pontil, On Learning Vector-Valued Functions, Neural Computation 17 (2005)
177–204.
[11] E. Candes, T. Tao, Near-Optimal Signal Recovery from Random Projections: Universal Encoding
Strategies, IEEE Transactions on Inform. Theory 52 (2005) 5406–5425.
[12] S. S. Keerthi, O. Chapelle, D. DeCoste, Building Support Vector Machines with Reduced Classifier
Complexity, Journal of Machine Learning Research 7 (2006) 1493–1515.
[13] M. Wu, B. Scholkopf, G. Bakir, A Direct Method for Building Sparse Kernel Learning Algorithms,
Journal of Machine Learning Research 7 (2006) 603–624.
[14] P. Sun, X. Yao, Sparse Approximation Through Boosting for Learning Large Scale Kernel Machines,
IEEE Transactions on Neural Networks 21 (2010) 883–894.
[15] E. Tsivtsivadze, T. Pahikkala, A. Airola, J. Boberg, T. Salakoski, A Sparse Regularized Least-Squares
Preference Learning Algorithm, in: Proceedings of the Conference on Tenth Scandinavian Conference
on Artificial Intelligence, 2008, pp. 76–83.
[16] C. K. I. Williams, M. Seeger, Using the Nyström Method to Speed Up Kernel Machines, in: Advances
in Neural Information Processing Systems, 2000, pp. 682–688.
[17] A. J. Smola, B. Schölkopf, Sparse Greedy Matrix Approximation for Machine Learning, in: Proceedings
of International Conference on Machine Learning, 2000, pp. 911–918.
[18] Y.-J. Lee, O. Mangasarian, RSVM: Reduced Support Vector Machines, in: Proceedings of the SIAM
International Conference on Data Mining, 2001.
36
[19] S. Fine, K. Scheinberg, Efficient SVM Training Using Low-Rank Kernel Representations, Journal of
Machine Learning Research 2 (2001) 243–264.
[20] F. Girosi, An Equivalence Between Sparse Approximation and Support Vector Machines, Neural
Computation 10 (1998) 1455–1480.
[21] S. Chen, D. Donoho, M. Saunders, Atomic Decomposition by Basis Pursuit, Siam Journal of Scientific
Computing 20 (1999) 33–61.
[22] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, An Interior-Point Method for Large-Scale
`1 -Regularized Least Squares, IEEE Journal of Selected Topics in Signal Processing 1 (2007) 606–617.
[23] M. N. Gibbs, D. J. C. MacKay, Efficient Implementation of Gaussian Processes, Technical Report,
Department of Physics, Cavendish Laboratory, Cambridge University, Cambridge, 1997.
[24] J. C. Platt, Fast Training of Support Vector Machines Using Sequential Minimal Optimization, in:
Advances in Kernel Methods: Support Vector Learning, 1999, pp. 185–208.
[25] S. S. Keerthi, K. Duan, S. Shevade, A. Poo, A Fast Dual Algorithm for Kernel Logistic Regression,
Machine Learning 61 (2005) 151–165.
[26] L. Bottou, O. Bousquet, The Tradeoffs of Large Scale Learning, in: Advances in Neural Information
Processing Systems, 2008, pp. 161–168.
[27] M. Zinkevich, M. Weimer, A. Smola, L. Li, Parallelized Stochastic Gradient Descent, in: Advances in
Neural Information Processing Systems, 2010, pp. 2595–2603.
[28] Y. Tsuruoka, J. Tsujii, S. Ananiadou, Stochastic Gradient Descent Training for L1-regularized Loglinear Models with Cumulative Penalty, in: Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the
AFNLP, 2009, pp. 477–485.
[29] C. Carmeli, E. De Vito, A. Toigo, Vector Valued Reproducing Kernel Hilbert Spaces of Integrable
Functions and Mercer Theorem, Analysis and Applications 4 (2006) 377–408.
[30] T. Poggio, F. Girosi, Networks for Approximation and Learning, Proceedings of the IEEE 78 (1990)
1481–1497.
[31] J. Quiñonero-Candela, C. E. Ramussen, C. K. I. Williams, Approximation Methods for Gaussian
Process Regression, in: Large-Scale Kernel Machines, 2007, pp. 203–223.
[32] A. R. Barron, Universal Approximation Bounds for Superpositions of A Sigmoidal Function, IEEE
Transactions on Information Theory 39 (1993) 930–945.
[33] L. K. Jones, A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for
Projection Pursuit Regression and Neural Network Training, Annals of Statistics 20 (1992) 608–613.
[34] V. Kůrková, High-Dimensional Approximation by Neural Networks, in: Advances in Learning Theory:
Methods, Models and Applications, 2003, pp. 69–88.
37
[35] V. Kůrková, M. Sanguineti, Comparison of Worst Case Errors in Linear and Neural Network Approximation, IEEE Transactions on Information Theory 48 (2002) 264–275.
[36] V. Kůrková, M. Sanguineti, Learning with Generalization Capability by Kernel Methods of Bounded
Complexity, Journal of Complexity 21 (2005) 350–367.
[37] S. Yu, V. Tresp, K. Yu, Robust Multi-Task Learning with t-Processes, in: Proceedings of International
Conference on Machine Learning, 2007, pp. 1103–1110.
[38] S. Zhu, K. Yu, Y. Gong, Predictive Matrix-Variate t Models, in: Advances in Neural Information
Processing Systems, 2008, pp. 1721–1728.
[39] P. J. Huber, Robust Statistics, John Wiley & Sons, New York, 1981.
[40] C. M. Bishop, Pattern Rcognition and Machine Learning, Springer, 2006.
[41] A. Myronenko, X. Song, Point Set Registration: Coherent Point Drift, IEEE Transactions on Pattern
Analysis and Machine Intelligence 32 (2010) 2262–2275.
[42] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision (2nd ed.), Cambridge University
Press, Cambridge, 2003.
[43] D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer
Vision 60 (2004) 91–110.
[44] P. H. S. Torr, A. Zisserman, MLESAC: A New Robust Estimator with Application to Estimating Image
Geometry, Computer Vision and Image Understanding 78 (2000) 138–156.
[45] J. H. Kim, J. H. Han, Outlier Correction from Uncalibrated Image Sequence Using the Triangulation
Method, Pattern Recognition 39 (2006) 394–404.
[46] L. Goshen, I. Shimshoni, Guided Sampling via Weak Motion Models and Outlier Sample Generation
for Epipolar Geometry Estimation, International Journal of Computer Vision 80 (2008) 275–288.
[47] X. Li, Z. Hu, Rejecting Mismatches by Correspondence Function, International Journal of Computer
Vision 89 (2010) 1–17.
[48] J. Ma, J. Zhao, Y. Zhou, J. Tian, Mismatch Removal via Coherent Spatial Mapping, in: Proceedings
of IEEE International Conference on Image Processing, 2012, pp. 1–4.
[49] J. Ma, J. Zhao, J. Tian, Z. Tu, A. Yuille, Robust Estimation of Nonrigid Transformation for Point Set
Registration, in: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2013.
[50] J. Barron, D. Fleet, S. Beauchemin, Performance of Optical Flow Techniques, International Journal
of Computer Vision 12 (1994) 43–77.
[51] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L. van
Gool, A Comparison of Affine Region Detectors, International Journal of Computer Vision 65 (2005)
43–72.
[52] T. Tuytelaars, L. van Gool, Matching Widely Separated Views Based on Affine Invariant Regions,
38
International Journal of Computer Vision 59 (2004) 61–85.
[53] A. Vedaldi, B. Fulkerson, VLFeat - An Open and Portable Library of Computer Vision Algorithms, in:
Proceedings of the 18th annual ACM international conference on Multimedia, 2010, pp. 1469–1472.
[54] A. Zaharescu, E. Boyer, K. Varanasi, R. Horaud, Surface Feature Detection and Description with
Applications to Mesh Matching, in: Proceedings of IEEE conference on Computer Vision and Pattern
Recognition, 2009, pp. 373–380.
[55] N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, in: Proceedings of IEEE
conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
[56] I. Macêdo, R. Castro, Learning Divergence-free and Curl-free Vector Fields with Matrix-valued Kernels,
Technical Report, Instituto Nacional de Matematica Pura e Aplicada, Brasil, 2008.
[57] M. A. Fischler, R. C. Bolles, Random Sample Consensus: A Paradigm for Model Fitting with Application to Image Analysis and Automated Cartography, Communications of the ACM 24 (1981)
381–395.
39
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement