Regularized Vector Field Learning with Sparse Approximation for Mismatch Removal Jiayi Maa , Ji Zhaoa , Jinwen Tiana , Xiang Baib,d , Zhuowen Tuc,d a Institute for Pattern Recognition and Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, 430074, China. b Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China. c Lab of Neuro Imaging, University of California, Los Angeles, CA, 90095, USA. d Microsoft Research Asia, Beijing, 100080, China. Abstract In vector field learning, regularized kernel methods such as regularized least-squares require the number of basis functions to be equivalent to the training sample size, N . The learning process thus has O(N 3 ) and O(N 2 ) in the time and space complexity respectively. This poses significant burden on the vector learning problem for large datasets. In this paper, we propose a sparse approximation to a robust vector field learning method, sparse vector field consensus (SparseVFC ), and derive a statistical learning bound on the speed of the convergence. We apply SparseVFC to the mismatch removal problem. The quantitative results on benchmark datasets demonstrate the significant speed advantage of SparseVFC over the original VFC algorithm (two orders of magnitude faster) without much performance degradation; we also demonstrate the large improvement by SparseVFC over traditional methods like RANSAC. Moreover, the proposed method is general and it can be applied to other applications in vector field learning. Keywords: Vector field learning, sparse approximation, regularization, reproducing kernel Hilbert space, outlier, mismatch removal Email addresses: [email protected] (Jiayi Ma), [email protected] (Ji Zhao), [email protected] (Jinwen Tian), [email protected] (Xiang Bai), [email protected] (Zhuowen Tu) Preprint submitted to Pattern Recognition February 27, 2013 1. Introduction For a given set of data S = {(xn , yn ) ∈ X × Y}N n=1 , one can learn a function f : X → Y so that it approximates a mapping to output yn for an input xn . In this process one can fit a function to the given training data with a smoothness constraint, which typically achieves some form of capacity control of the function space. The learning process aims to balance the data fitting and model complexity, and thus, produces a robust algorithm that generalizes well to the unseen data [1]. This can one way be formulated into an optimization problem with a certain choice of regularization [2, 3], which typically operates in the Reproducing Kernel Hilbert Space (RKHS) [4] (associated with a particular kernel). Two well-known methods along the line of learning with regularization are regularized least-squares (RLS) and support vector machines (SVM) [2]. These regularized kernel methods have drawn much attention due to their computational simplicity and strong generalization power in traditional machine learning problems such as regression and classification. Regularized kernel methods over RKHS often lead to solving a convex optimization problem. For example, RLS directly solves a linear system. However, the computational complexity associated with applying kernel matrices is relatively high. Given a set of N training samples, the kernel matrix is of size N ×N . This suggests a space complexity O(N 2 ) and a time complexity at least O(N 2 ); in fact most regularized kernel methods have core operations such as matrix inversion in RLS which is of O(N 3 ). It is therefore computationally demanding if N is large. This situation is more troublesome in vector-valued cases where we focus on the vector-valued regularized least-squares (vector-valued RLS). There are several different names for the RLS model [5] such as regularization networks (RN) [3], kernel ridge regression (KRR) [6], least squares support vector machines (LS-SVM) [7], etc. We use a widely adopted one in the literature, i.e. RLS in this paper. A vector field is a map that assigns each position x ∈ IRP with a vector y ∈ IRD , defined by a vector-valued function. Examples of vector field range from the velocity fields of fluid particles to the optical flow fields associated with visual motion. The problem of vector field learning is tied with functional estimation and supervised learning. We may learn a 2 vector field by using regularized kernel methods and, in particular, vector-valued RLS [8, 9]. According to the representer theorem in RKHS, the optimal solution is a linear combination of a number (the size of training data points N ) of basis functions [10]. The corresponding kernel matrix is of size DN × DN . Compared to the scalar case, the vectored values pose more challenges on large datasets. One possible solution is to learn sparse basis functions; i.e. we could pick a subset of size M basis functions, making the kernel matrix of size DM ×DM . Thus, the computational complexity could be greatly reduced when M N . Moreover, the solution based on sparse bases also enables faster prediction. Therefore, when we learn a vector field via vector-valued regularized kernel methods, using a sparse representation may be of particular advantage. Motivated by the recent sparse representation literature [11], here we investigate a suboptimal solution to the vector-valued RLS problem. We derive an upper bound for the approximation accuracy based on the representer theorem. The upper bound has a rate of q 1 convergence in O , indicating a fast convergence rate. We develop a new algorithm, M SparseVFC, by using the sparse approximation in a robust vector field learning problem, vector field consensus (VFC) [9]. As in [9], the SparseVFC algorithm is applied to the mismatch removal problem. The experimental results on various 2D and 3D datasets show the clear advantages of SparseVFC over the competing methods in efficiency and robustness. 1.1. Related work Sparse representations [11] have recently gained a considerable amount of interest for learning the kernel machines [12, 13, 14, 15]. To scale up kernel methods to deal with large data, a wealthy body of efficient sparse approximations have been suggested in the scalar case, including the Nyström approximation [16], sparse greedy approximations [17], reduced support vector machines [18], and incomplete Cholesky decomposition [19]. Common to all these approximation schemes is that only a subset of the latent variables are treated exactly, with the remaining variables being approximated. These methods differ in the mechanism for choosing the subset, and in the matrix form used to represent the hypothesis space. In this paper, we generalize this idea to the vector-valued setting. 3 Another approach to achieve the sparse representation is to add regularization term penalizing the `1 norm of the expansion coefficients [20, 21]. However, this does not alleviate the problem, since it has to compute (and invert) the kernel matrix of size N × N , and then it is not suitable for large datasets. To overcome this problem, Kim et al. [22] described a specialized interior-point method for solving large scale `1 -regularized that uses the preconditioned conjugate gradients algorithm to compute the search direction. The large scale kernel learning problem can also be addressed by iterative optimization such as conjugate gradient [23] and more efficient decomposition-based methods [24, 25]. Specially, for the RLS model with the `2 loss, fast optimization algorithms such as accelerated gradient descent in conjunction with stochastic methods [26, 27, 28] can perform very fast as well. However, a main drawback of these methods is that they have to store a kernel matrix of size N × N which is not realistic for large datasets. The main contributions of our work include: i) we present a sparse approximation algorithm for vector-valued RLS which has linear time and space complexity w.r.t. the training data size; ii) we derive an upper bound for the approximation accuracy of the proposed sparse approximation algorithm in terms of regularized risk functional; iii) based on the sparse approximation and our the vector field learning method VFC, we give a new algorithm SparseVFC which significantly speeds up the original VFC algorithm without scarifies in accuracy; iv) we apply SparseVFC to mismatch removal which is a fundamental problem in computer vision and the results demonstrate its superiority over the original VFC algorithm and many other state-of-the-art methods. The rest of the paper is organized as follows. In Section 2, we briefly review the vectorvalued Tikhonov regularization and lay out a sparse approximation algorithm for solving vector-valued RLS. In Section 3, we apply the sparse approximation to robust vector field learning and mismatch removal problem. The datasets and evaluation criteria used in the experiments are presented in Section 4. In Section 5, we evaluate the performance of our algorithm on both synthetic data and real-world images. Finally, we conclude in Section 6. 4 2. Sparse approximation algorithm We start by defining the vector-valued RKHS and recalling the vector-valued Tikhonov regularization, and then lay out a sparse approximation algorithm for solving the vectorvalued RLS problem. At last, we derive a statistical learning bound for the sparse approximation. 2.1. Vector-valued reproducing kernel Hilbert spaces We are interested in a class of vector-valued kernel methods, where the hypotheses space is chosen to be a reproducing kernel Hilbert space (RKHS). This motivates reviewing the basic theory of vector-valued RKHS. The development of the theory in the vector case is essentially the same as in the scalar case. We refer to [10, 29] for further details and references. Let Y be a real Hilbert space with inner product (norm) h·, ·iY , (k · kY ), for example, Y ⊆ IRD , X a set, for example, X ⊆ IRP , and H a Hilbert space with inner product (norm) 1 2 h·, ·iH , (k · kH ). A norm can be defined via an inner product, for example, kf kH = hf , f iH . Next, we recall the definition of RKHS as well as some properties of it which we will use in this paper. Definition 1. A Hilbert space H is an RKHS if the evaluation maps evx : H → Y are bounded, i.e. if ∀x ∈ X there exists a positive constant Cx such that kevx (f )kY = kf (x)kY ≤ Cx kf kH , ∀f ∈ H. (1) A reproducing kernel Γ : X × X → B(Y) is then defined as: Γ(x, x0 ) := evx evx∗ 0 , (2) where B(Y) is the space of bounded operators on Y, for example, B(Y) ⊆ IRD×D , and evx∗ is the adjoint of evx . From the definition of the RKHS, we can derive the following properties: 5 (i) For each x ∈ X and y ∈ Y, the kernel Γ has the following reproducing property hf (x), yiY = hf , Γ(·, x)yiH , ∀f ∈ H. (3) (ii) For every x ∈ X and f ∈ H, we have that 1 2 kf (x)kY ≤ kΓ(x, x)kY,Y kf kH , (4) where k · kY,Y is the operator norm. The property (i) can be easily derived from the equation evx∗ y = Γ(·, x)y. In the property 1 2 (ii), we assume that supx∈X kΓ(x, x)kY,Y = sΓ < ∞. Similar to the scalar case, for any N ∈ IIN, {xn : n ∈ IINN } ⊆ X , and a reproducing kernel Γ, a unique RKHS can be defined by considering the completion of the space ( N ) X HN = Γ(·, xn )cn : cn ∈ Y , (5) n=1 with respect to the norm induced by the inner product N X hf , giH = hΓ(xj , xi )ci , dj iY , (6) i,j=1 for any f , g ∈ HN with f = PN i=1 Γ(·, xi )ci and g = PN j=1 Γ(·, xj )dj . 2.2. Vector-valued Tikhonov regularization Regularization aims to stabilize the solution of an ill-conditioned problem. Given an N -sample S = {(xn , yn ) ∈ X × Y : n ∈ IINN } of patterns, where X ⊆ IRP and Y ⊆ IRD are input space and output space respectively, in order to learn a mapping f : X → Y, the vector-valued Tikhonov regularization in an RKHS H with kernel Γ minimizes a regularized risk functional [10] Φ(f ) = N X V (f (xn ), yn ) + λkf k2H , (7) n=1 where the first term is called the empirical error with the loss function V : Y × Y → [0, ∞) satisfying V (y, y) = 0, the second term is a stabilizer with a regularization parameter λ controlling the trade-off between these two terms. 6 Theorem 1. Vector-valued Representer Theorem [10]: The optimal solution of the regularized risk functional (7) has the form o f (x) = N X Γ(x, xn )cn , cn ∈ Y. (8) n=1 Hence, minimizing over the (possibly) infinite dimensional Hilbert space, boils down to find a finite set of coefficients {cn : n ∈ IINN }. A number of loss functions have been discussed in the literature [1]. In this paper, we focus on vector-valued RLS which is a vector-valued Tikhonov regularization with an `2 loss function, i.e., V (f (xn ), yn ) = pn kyn − f (xn )k2 , (9) where pn ≥ 0 is the weight, and k · k denotes the `2 norm. Other common loss functions are the absolute value loss V (f (x), y) = |f (x) − y| and Vapnik’s -insensitive loss V (f (x), y) = max(|f (x) − y| − , 0). Using the `2 loss, the coefficients cn of the optimal solution, i.e. equation (8), is then determined by a linear system [10, 9] e + λP e −1 )C = Y, (Γ (10) e is called the Gram matrix which is an N × N block matrix where the kernel matrix Γ e = P ⊗ ID×D is a DN × DN diagonal matrix, here with the (i, j)-th block Γ(xi , xj ). P T T P = diag(p1 , . . . , pN ), and ⊗ denotes Kronecker product. C = (cT 1 , · · · , cN ) and Y = T T ) are DN × 1 dimensional vectors. (y1T , · · · , yN Note that when the loss function V is not quadratic anymore, the solution of the regularized risk functional (7) still has the form (8), but the coefficients cn cannot be found anymore by solving a linear system. 2.3. Sparse approximation in vector-valued regularized least-squares Under the Representer Theorem, the optimal solution f o comes from an RKHS HN defined as in equation (5). Finding the coefficients cn of the optimal solution f o in vectorvalued RLS merely requires to solve the linear system (10). However, for large values of 7 N , it may pose a serious problem due to heavy computational (i.e. scales as O(N 3 )) or memory (i.e. scales as O(N 2 )) requirements, and, even when it is implementable, one may prefer a suboptimal but simpler method. In this section, we propose an algorithm that is based on a similar kind of idea as the subset of regressors method [30, 31] for the standard vector-valued RLS problem. Rather than searching for the optimal solution f o in HN , we use a sparse approximation and search a suboptimal solution f s in a space HM (M N ) with much less basis functions defined as ( HM = M X ) em )cm : cm ∈ Y Γ(·, x , (11) m=1 with {e xm : m ∈ IINM } ⊆ X 1 , and then minimize the loss over all the training data. Yet the problem that remains is how to choose the point set {e xm : m ∈ IINM }, and accordingly find a set of coefficients {cm : m ∈ IINM }. In the scalar case, different approaches for selecting this point set are discussed, for example, in [5]. There, it was found that simply selecting an arbitrary subset of the training inputs performs no worse than more sophisticated methods. Recent progress in compressed sensing [11] also demonstrated the power of sparse random basis representation. Therefore, in the interest of computational efficiency, we use the simply random sampling method to choose sparse basis functions in the vector case. And we also compare the influence of different choices of sparse basis functions in the experiment section. According to the sparse approximation, the unique solution of the vector-valued RLS in HM has this form f s (x) = M X em )cm . Γ(x, x (12) m=1 To solve the coefficients cm , we now consider the Hilbertian norm and the corresponding inner product (6) kf k2H M X M X e ei )ci , cj iY = CT ΓC, = hΓ(e xj , x (13) i=1 j=1 T T e where C = (cT 1 , · · · , cM ) is a DM × 1 dimensional vector, the kernel matrix Γ is an M × M ej ). Thus, the minimization of the regularized block matrix with the (i, j)-th block Γ(e xi , x 1 Note that the point set {e xm : m ∈ IINM } may not be a subset of the input training data {xn : n ∈ IINN }. 8 risk functional (7) becomes min Φ(f ) = min f ∈H f ∈H ( N X ) pn kyn − f (xn )k2 + λkf k2H n=1 n o 1/2 2 Te e e = min kP (Y − UC)k + λC ΓC , C e is an N × M block matrix with the (i, j)-th block Γ(xi , x ej ): where U e1 ) · · · Γ(x1 , x eM ) Γ(x1 , x . . . e . . . U= . . . . e1 ) · · · Γ(xN , x eM ) Γ(xN , x (14) (15) Taking the derivative of the right hand of equation (14) with respect to the coefficient matrix C and setting it to zero, we can then compute the coefficient matrix C from the following linear system e TP eU e + λΓ)C e e T PY. e (U =U (16) In contrast to the optimal solution f o given by the Representer Theorem, which is a linear combination of the basis functions Γ(·, x1 ), · · · , Γ(·, xN ) determined by the inputs x1 , · · · , xN of training samples, the suboptimal solution f s is formed by a linear combination of arbitrary M -tuples of the basis functions. Generally, this sparse approximation will yield a vast increase in speed and decrease in memory requirements with negligible decrease in accuracy. Note that the sparse approximation is somewhat related to SVM since SVM’s predictive function also depends on a few samples (i.e. support vectors). In fact, under certain conditions, sparsity leads to SVM, which is related to the Structural Risk Minimization principle [1]. As derived in [20], when the data is noiseless, and the coefficient matrix C is chosen to minimize the following cost function, the sparse approximation gives the same solution of SVM: 2 N X Q(C) = f (x) − cn Γ(x, xn ) + λkCk`1 , n=1 (17) H where k · k`1 is the usual `1 norm. If N is very large and Γ(·, xn ) is not an orthonormal basis, it is possible that many different sets of coefficients will achieve the same error on a 9 given data set. Among all the approximating functions that achieve the same error, using the `1 norm in the second term of equation (17) favors the one with the smallest number of non-zero coefficients. Our approach follows this basic idea. The difference is that in the cost function (17) the sparse basis functions are chosen automatically during the optimization process, while in our approach the basis functions are chosen randomly for the purpose of reducing both time and space complexities. 2.4. Bounds of the sparse approximation To derive upper bounds on error of approximation of the optimal solution f o by the suboptimal one f s , we employ Maurey-Jones-Barron’s theorem [32, 33] reformulated in terms of G-variation [34]: for a Hilbert space X with norm k · k and f ∈ X, G is a bounded subset of it, the following upper bound holds r kf − spann Gk ≤ (sG kf kG )2 − kf k2 , n (18) where spann G is a linear combination of n arbitrary basis functions in G, sG = supg∈G kgk, and k · kG denotes the G-variation which is defined as kf kG = inf{c > 0 : f /c ∈ cl conv(G ∪ −G)}, (19) with cl conv(·) being the closure of the convex hull of a set and −G = {−g : g ∈ G}. Here the closure of a set A is defined as cl A = {f ∈ X : (∀ > 0)(∃g ∈ A)kf − gk < }. For properties of G-variation, we refer the reader to [34, 35, 36]. Taking advantage of this upper bound, we derive the following proposition which compares the optimal solution f o with a suboptimal solution f s in terms of regularized risk functional Φ(f ). Proposition 1. Let {(xn , yn ) ∈ X × Y}N n=1 be a finite set of input-output pairs of data, 1 P 2 , f o (x) = N Γ : X × X → Y a matrix-valued kernel, sΓ = supx∈X kΓ(x, x)kY,Y n=1 Γ(x, xn )cn P M em )cm a the optimal solution of the regularized risk functional (7), f s (x) = m=1 Γ(x, x suboptimal solution, suppose ∀n ∈ IINN , kf s (xn ) + f o (xn ) − 2yn kY ≤ supx∈X kf s (x) + f o (x)kY , 10 then we have the following upper bound: s o Φ(f ) − Φ(f ) ≤ (N s2Γ + λ) α + 2kf o kH M r α M , (20) where α = (sΓ kf o kG )2 − kf o k2H , H is the RKHS corresponding to the reproducing kernel Γ, λ is the regularization parameter. Proof. According to the property (ii) of an RKHS in Section 2.1, for every f ∈ H and x ∈ X we have 1 2 kf (x)kY ≤ kΓ(x, x)kY,Y kf kH ≤ sΓ kf kH , (21) 1 2 . Thus we obtain where sΓ = supx∈X kΓ(x, x)kY,Y sup kf (x)kY ≤ sΓ kf kH . (22) x∈X By the last inequality and equation (18), we obtain s o Φ(f ) − Φ(f ) = ≤ N X (kf s (xn ) − yn k2Y − kf o (xn ) − yn k2Y ) + λ(kf s k2H − kf o k2H ) n=1 N X hf s (xn ) − f o (xn ), f s (xn ) + f o (xn ) − 2yn iY n=1 + λkf s kH − kf o kH · (kf s kH + kf o kH ) ≤ N X kf s (xn ) − f o (xn )kY kf s (xn ) + f o (xn ) − 2yn kY n=1 + λkf s − f o kH (kf s kH + kf o kH ) ≤ N sup kf s (x) − f o (x)kY sup kf s (x) + f o (x)kY x∈X x∈X + λkf s − f o kH (kf s kH + kf o kH ) ≤ N sΓ kf o − f s kH sΓ kf o + f s kH − 2ymin + λkf o − f s kH (kf o kH + kf s kH ) r r r r α α α α o o 2sΓ kf kH + sΓ +λ 2kf kH + ≤ N sΓ M M M M r α α = (N s2Γ + λ) + 2kf o kH , (23) M M 11 where α = (sΓ kf o kG )2 − kf o k2H , k · kG corresponds to the G-variation, and G corresponds to HN in our problem. From this upper bound, we can easily derive that to achieve an approximation accuracy , i.e. Φ(f s ) − Φ(f o ) ≤ , the needed minimal number of basis functions satisfies M ≤α N s2Γ + λ − 2kf o k2H r 1+ −1 −1 . (N s2Γ + λ)kf o k2H (24) 3. Sparse approximation for robust vector field learning A vector field is also a vector-valued function. The Tikhonov regularization treats all samples as inliers which ignores the issue of robustness, i.e., the real-world data may often contain some unknown outliers. Recently, Zhao et al. [9] present a robust vector-valued RLS method named Vector Field Consensus (VFC) for vector field learning. In which, each sample is associated with a latent variable indicating whether it is an inlier or an outlier, and then the EM algorithm is adopted for optimization. Besides, the technique of robust vector field learning has been also adopted in Gaussian processes, basically by using the so-called t-processes [37, 38]. In this section, we present a sparse approximation algorithm for VFC, and apply it to the mismatch removal problem. 3.1. Sparse vector field consensus Given a set of observed input-output pairs S = {(xn , yn ) : n ∈ IINN } as samples randomly drawn from a vector field which may contain some unknown outliers, the goal is to learn a mapping f to fit the inliers well. Due to the existence of outliers, it is desirable to have a robust estimate of the mapping f . There are two choices: (i) to build a more complex model that includes the outliers – which involves modeling the outlier process using extra (hidden) variables which enable us to identify and reject outliers, or (ii) to use an estimator which is less sensitive to outliers, as described in Huber’s robust statistics [39]. In this paper, we use the first scenario. In the following we make an assumption that, the vector field samples in the inlier class have Gaussian noise with zero mean and uniform standard deviation σ, while the ones in the outlier class are uniformly distributed 12 1 a with a being the volume of the output domain. Let γ be the percentage of inliers which we do not know in T T T T T advance, X = (xT 1 , · · · , xN ) and Y = (y1 , · · · , yN ) be DN × 1 dimensional vectors, and θ = {f , σ 2 , γ} the unknown parameter set. The likelihood is then a mixture model as: N Y ky −f (xn )k2 γ 1−γ − n 2 2σ p(Y|X, θ) = . (25) e + (2πσ 2 )D/2 a n=1 We model the mapping f in a vector-valued RKHS H with reproducing kernel Γ, and λ 2 impose a smoothness constraint on it, i.e. p(f ) ∝ e− 2 kf kH . Therefore, we can estimate a MAP solution of θ by using the Bayes rule as θ ∗ = arg maxθ p(Y|X, θ)p(f ). An iterative EM algorithm can be used to solve this problem. We associate sample n with a latent variable zn ∈ {0, 1}, where zn = 1 indicates a Gaussian distribution and zn = 0 points to a uniform distribution. We follow the standard notations [40] and omit some terms that are independent of θ. The complete-data log posterior is then given by Q(θ, θ old N 1 X P (zn = 1|xn , yn , θ old )kyn − f (xn )k2 )=− 2 2σ n=1 N X D 2 − ln σ P (zn = 1|xn , yn , θ old ) 2 n=1 + ln(1 − γ) N X P (zn = 0|xn , yn , θ old ) n=1 + ln γ N X λ P (zn = 1|xn , yn , θ old ) − kf k2H . 2 n=1 (26) E-step: Denote P = diag(p1 , . . . , pN ), where the probability pn = P (zn = 1|xn , yn , θ old ) can be computed by applying Bayes rule: pn = γe− γe− kyn −f (xn )k2 2σ 2 kyn −f (xn )k2 2σ 2 2 D/2 , (27) + (1 − γ) (2πσa) The posterior probability pn is a soft decision, which indicates to what degree the sample n agrees with the current estimated vector field f . M-step: We determine the revised parameter estimate θ new = arg maxθ Q(θ, θ old ). Taking derivative of Q(θ) with respect to σ 2 and γ, and setting them to zero, we obtain σ2 = e (Y − V)T P(Y − V) , D · tr(P) 13 (28) γ = tr(P)/N, (29) where V = (f (x1 )T , · · · , f (xN )T )T is a DN × 1 dimensional vector, and tr(·) denotes the trace. Considering the terms of objective function Q in equation (26) that are related to f , the vector field can be estimated from minimizing an energy function as N 1 X λ E(f ) = 2 pn kyn − f (xn )k2 + kf k2H . 2σ n=1 2 (30) This is a regularized risk functional (7) with `2 loss function (9). To estimate the vector field f , it needs to solve a linear system similar to equation (10), which takes up most of the run-time and memory requirements of the algorithm. Obviously, the sparse approximation could be used here to reduce the time and space complexity. Using the sparse approximation, we search a suboptimal f s which has the form as in equation (12) with the coefficient cn determined by a linear system similar to equation (16): e TP eU e + λσ 2 Γ)C e e T PY. e (U =U (31) Since it is a sparse approximation to the vector field consensus algorithm, we name our method SparseVFC. In summary, compared with the original VFC algorithm, we estimate a vector field f by solving a linear system (31) in SparseVFC rather than equation (10) in VFC. Our SparseVFC algorithm is outlined in algorithm 1. It also presented a fast implementation for VFC named FastVFC in the original paper [9]. To reduce the time complexity, it first uses the low rank matrix approximation by computing e of size DN × DN , and then the singular value decomposition (SVD) for the kernel matrix Γ the Woodbury matrix identity to invert the coefficient matrix in the linear system for solving C [41]. e is a diagonal matrix, we can compute U e TP e Computational complexity. Notice that P e to the n-th column in linear system (31) by multiplying the n-th diagonal element of P fT , and thus the time complexity of SparseVFC for mismatch removal is reduced to of U O(mD3 M 2 N + mD3 M 3 ), where m is the iterative times for EM. The space complexity of 14 Algorithm 1: The SparseVFC Algorithm Input: Training set S = {(xn , yn ) : n ∈ IINN }, matrix kernel Γ, regularization constant λ, basis function number M Output: Vector field f , inlier set I 1 Initialize a, γ, V = 0DN ×1 , P = IN ×N , and σ 2 by equation (28); 2 Randomly choose M basis functions from the training inputs and construct kernel e matrix Γ; 3 4 5 6 repeat E-step: Update P = diag(p1 , . . . , pN ) by equation (27); M-step: 7 Update C by solving linear system (31); 8 Update V by using equation (12); 9 Update σ 2 and γ by equations (28) and (29); 10 until Q converges; 11 Vector field f is determined by equation (12); 12 The inlier set is I = {n : pn > τ, n ∈ IINN } with τ being a predefined threshold. SparseVFC scales like O(D2 M N + D2 M 2 ) due to the memory requirements for storing the e and kernel Γ. e matrix U Typically, the required number of basis functions for sparse approximation is much less than the number of data points, i.e. M N . In this paper, we apply the SparseVFC to the mismatch removal problem on 2D and 3D images, in which the number of the point matches N is typically in the order of 103 , and the required number of basis function M is in the order of 101 . Therefore, both the time and space complexities of SparseVFC could be simply written as O(N ). Table 1 summarizes the time and space complexities of VFC, FastVFC and SparseVFC. Compared to VFC and FastVFC, the time and space complexities are reduced from O(N 3 ) and O(N 2 ) to both O(N ). This is significant for large training sets. Note that the time complexity of FastVFC is still O(N 3 ), it is because the SVD operation 15 Table 1: Comparison of computational complexity. VFC [9] FastVFC [9] SparseVFC O(N 3 ) O(N 3 ) O(N ) space O(N 2 ) O(N 2 ) O(N ) time of a matrix of size N × N has time complexity O(N 3 ). Relation to FastVFC. There is some relationship between our SparseVFC algorithm and the FastVFC algorithm. On the one hand, both these two algorithms use approximation to the matrix-valued kernel Γ to search a suboptimal solution rather than the optimal solution, and then reduce the time complexity. On the other hand, our algorithm is clearly superior to the FastVFC, which can be seen from both the time and space complexities as shown in Table 1. For FastVFC, it makes a low rank matrix approximation on the kernel matrix e itself, which does not change the size of Γ. e Therefore, the memory requirement is not Γ decreased, and it is still not implementable for large training sets. Moreover, the FastVFC algorithm has the same time complexity O(N 3 ) as VFC, due to the SVD operation and kernel matrix inversion operation in FastVFC and VFC respectively. The difference is that the SVD operation in FastVFC needs to perform only once while the matrix inversion operation in VFC needs to perform in each EM iteration, and hence an acceleration can be achieved in FastVFC. In contrast, the SparseVFC uses a sparse representation and chooses much less basis functions to approximate the function space, leading to a significant reduction of the size of the corresponding reproducing kernel. This sparse approximation (even with random chosen basis functions) not only significantly reduces both the time and space complexities, but also does not lead to sacrifice in accuracy, and in some situations it even gains a little better performance compared to the original VFC algorithm (as shown in our experiments). 16 3.2. Application to mismatch removal In this section, we focus on establishing accurate point correspondences between two images of the same scene. Many of the computer vision tasks such as building 3D models, registration, object recognition, tracking, and structure and motion recovery start by assuming that the point correspondences have been successfully recovered [42]. Point correspondences between two images are in general established by first detecting interest points and then matching the detected points based on local descriptors [43]. This may result in a number of mismatches (outliers) due to viewpoint changes, occlusions, repeated structures, etc. The existence of mismatches is usually enough to ruin the traditional estimation methods. In this case, a robust estimator is desirable to remove mismatches [44, 45, 46, 47, 48, 49]. In our SparseVFC, as shown in the last line of algorithm 1, a sample being an inlier or outlier could be determined by its posterior probability after EM convergences. Using this property, we apply SparseVFC to the mismatch removal problem. Next, we point out some key issues. Vector field introduced by image pairs. We first make a linear re-scaling of the point correspondences so that the positions of feature points in the first and second images both have zero mean and unit variance. Let the input x ∈ IRP be the location of a normalized point in the first image, and the output y ∈ IRD be the corresponding displacement of that point in the second image; then the matches can be converted into motion field training set. For 2D images P = D = 2; for 3D surfaces P = D = 3. Fig. 1 illustrates schematically the 2D image case. Kernel selection. Kernel plays a central role in regularization theory as it provides a flexible and computationally feasible way to choose an RKHS. Usually, for the mismatch removal problem, the structure of the generated vector field is relatively simple. We simply 2 choose a diagonal decomposable kernel [8, 9]: Γ(xi , xj ) = e−βkxi −xj k I. Then we can solve a more efficient linear system instead of equation (31) as e = UT PY, e (UT PU + λσ 2 Γ)C (32) 2 where the kernel matrix Γ ∈ IRM ×M and Γij = e−βkx̃i −x̃j k , U ∈ IRN ×M and Uij = 17 Figure 1: Schematic illustration of motion field introduced by image pairs. Left: an image pair and its putative matches; right: motion field samples introduced by the point matches in the left figure. ◦ and × indicate feature points in the first and second images respectively. 2 e = (c1 , · · · , cM )T and Y e = (y1 , · · · , yN )T are M × D and N × D matrie−βkxi −exj k , C ces respectively. Here N is the number of putative matches, and M is the number of bases. It should be noted that solving the vector field learning problem with this diagonal decomposable kernel is not equivalent to solving D independent scalar problems, since the update of the posterior probability pn and variance σ 2 in equations (27) and (28) are determined by all components of the output yn . When it is applied to mismatch removal, there is a problem should draw attention. We must ensure that the point set {e xm : m ∈ IINM } used to construct the basis functions does not contain two same points since in this case the coefficient matrix in linear system (32), i.e. (UT PU + λσ 2 Γ), will be singular. Obviously, this may appear in the mismatch removal problem, since in the putative match set there may exist one point in the first image matched to several points in the second image. 3.3. Implementation Details There are mainly four parameters in the SparseVFC algorithm: γ, λ, τ and M . Parameter γ reflects our initial assumption on the amount of inliers in the correspondence sets. Parameter λ reflects the amount of the smoothness constraint which controls the trade-off between the closeness to the data and the smoothness of the solution. Parameter τ is a threshold, which is used for deciding the correctness of a match. In general, our method is very robust to these parameters. We set γ = 0.9, λ = 3, and τ = 0.75 according to the 18 original VFC algorithm throughout this paper. Parameter M is the number of the basis functions used for sparse approximation. The choice of M depends on both the data (i.e., the true vector field) and the assumed function space (i.e., the reproducing kernel), as shown in equation (24). We will discuss it in the experiment according to the specific application. It should be noted that in practice we do not determine the value of M according to equation (24). On the one hand, it is derived in the context of providing a theoretic upper bound, and in practice to achieve a good approximation the required value of M may be much smaller than this bound. On the other hand, the upper bound in equation (24) is hard to compute, and it is costly to derive a value of M for each sample set. 4. Experimental setup In our evaluation we consider synthetic 2D vector field estimation, mismatch removal on 2D real images and 3D surfaces. All the experiments are performed on a Intel Core2 2.5GHz PC with Matlab code. Next, we discuss about the datasets and evaluation criteria. Synthetic 2D vector field: The synthetic vector field is constructed from a scalar function defined by a mixture of five Gaussians, which have the same covariance 0.25I and centered at (0, 0), (1, 0), (0, 1), (−1, 0) and (0, −1) respectively, as in [8]. Its gradient and perpendicular gradient indicate a divergence-free and a curl-free field respectively. The synthetic data is then constructed by taking a convex combination of these two vector fields. In our evaluation the combination coefficient is set to 0.5, and then we get the synthetic field as shown in Fig. 2. The field is computed on a 70 × 70 grid over the square [−2, 2] × [−2, 2]. The training inliers are uniformly sampled points from the grid, and we add Gaussian noise with zero mean and uniform standard deviation 0.1 on the outputs. The outliers are generated as follows: the input x is chosen randomly from the grid; the output y is generated randomly from a uniform distribution on the square [−2, 2] × [−2, 2]. The performance for vector field learning is measured by an angular measure of error [50] between the learned vector of VFC and the ground truth. If vg = (vg1 , vg2 ) and ve = (ve1 , ve2 ) are the ground truth and estimated fields, we consider the transformation v → ṽ = measure is defined as err = arccos(ṽe , ṽg ). 19 1 (v 1 , v 2 , 1). k(v 1 ,v 2 ,1)k The error Figure 2: Visualization of a synthetic 2D field without noise and its 500 random samples used in experiment. 2D image datasets: We tested our method on the dataset of Mikolajczyk et al. [51] and Tuytelaars et al. [52], and several image pairs of non-rigid objects. The images in the first dataset are either of planar scenes or the camera position was fixed during acquisition. Therefore, the images are always related by a homography. The test data of Tuytelaars et al. contains several wide baseline image pairs. The image pairs of non-rigid object are made in this paper. The open source VLFeat toolbox [53] is used to determine the initial correspondences of SIFT [43]. All parameters are set as the default values except for the distance ratio threshold t. Usually, the greater value of t is, the smaller amount of matches with higher correct match percentage will be. The match correctness is determined by computing an overlap error [51], as in [9]. 3D surfaces datasets: For 3D case, we consider two datasets used in [54]: the Dino and Temple datasets. Each surface pair are representations of the same rigid object which can be aligned using a rotation, translation and scale. We determine the initial correspondences by using the method of Zaharescu et al. [54]. The feature point detector is called MeshDOG, which is a generalization of the difference of Gaussian (DOG) operator [43]. The feature descriptor is called MeshHOG, and it is a generalization of the histogram of oriented gradients (HOG) descriptor [55]. The match correctness is determined as follows. For these two datasets, the correspondences between 20 the two surfaces can be formulated as y = sRx + t, where R3×3 is a rotation matrix, s is a scaling parameter, and t3×1 is a translation vector. We can use some robust rigid point registration methods such as the Coherent Point Drift (CPD) [41] to solve these three parameters, and then the match correctness can be accordingly determined. 5. Experimental results To test the performance of the sparse approximation algorithm, we perform experiments on the proposed SparseVFC algorithm, and compare its approximate accuracy and time efficiency to VFC and FastVFC on both synthetic data and real-world images. The experiments are conducted from two aspects: (i) vector field learning performance comparison on synthetic vector field; (ii) mismatch removal performance comparison on real image datasets. 5.1. Vector field learning on synthetic vector field We perform a representative experiment on a synthetic 2D vector field shown in Fig. 2. For each cardinality of the training set, we repeat the experimental process by 10 times with different randomly drawn examples. After the vector field is learned, we use it to predict the outputs on the whole grid and compare them to the ground truth. The experimental results are evaluated by means of test error (i.e. error between prediction and ground truth on the whole grid) and run-time speedup factor (i.e. ratio of run-time of VFC to run-time of FastVFC or SparseVFC) for clarity. The matrix-valued kernel is chosen to be a convex combination of the divergence-free kernel Γdf and curl-free kernel Γcf [56] with width σ e = 0.8, and the combination coefficient is set to 0.5. The divergence-free and curl-free kernels have the following forms " T kx−x0 k2 1 x − x0 x − x0 − 0 2 Γdf (x, x ) = 2 e 2eσ σ e σ e σ e kx − x0 k2 + (D − 1) − ·I , σ e2 " # 0 0 T 0 k2 kx−x 1 x−x x−x Γcf (x, x0 ) = 2 e− 2eσ2 I− . σ e σ e σ e 21 (33) (34) 0.08 0.019 #inlier: 500, inlier pct.: 20% #inlier: 500, inlier pct.: 50% #inlier: 500, inlier pct.: 80% #inlier: 800, inlier pct.: 20% #inlier: 800, inlier pct.: 50% #inlier: 800, inlier pct.: 80% 0.017 σ* 0.016 #inlier: 500, inlier pct.: 20% #inlier: 500, inlier pct.: 50% #inlier: 500, inlier pct.: 80% #inlier: 800, inlier pct.: 20% #inlier: 800, inlier pct.: 50% #inlier: 800, inlier pct.: 80% 0.07 Test error 0.018 0.015 0.06 0.05 0.014 0.04 0.013 0.012 40 50 60 Number of basis 70 0.03 40 80 (a) 50 60 Number of basis 70 80 (b) Figure 3: Experiments for choosing the sample point number M used for sparse approximation. (a) The standard deviation of the estimated inliers σ ∗ with respect to M . (b) The test error with respect to M . The inlier number in the training set is set to 500 and 800, and the inlier percentage is varied among 20%, 50% and 80%. In order to choose an appropriate value of M on this synthetic field, we assess the performance of SparseVFC with respect to M under different experimental setup. Rather than using the ground truth to assess performance, i.e. the mean of test error, we use the standard deviation of the estimated inliers [44] ∗ σ = 1 X kyi − f (xi )k2 |I| i∈I !1/2 , (35) where I is the estimated inlier set in the training samples, and | · | denotes the cardinality of a set. Generally speaking, smaller value of σ ∗ indicates better estimation. The result is presented in Fig. 3a, we can see that M = 60 could reach a good compromise between the accuracy and computational complexity. For comparison, we also present the mean of test error in Fig. 3b. We can see that these two criterions tend to produce similar choices of M . For FastVFC, 150 largest eigenvalues are used for calculating the low rank matrix approximation. We first give some intuitive impression on the performance of our method, as shown in Fig. 4. We see that SparseVFC can successfully recover the vector field from sparse training samples, and the performance is getting better as the number of inlier in the training set increases. Fig. 5 and Fig. 6 show the vector field learning performances of VFC, FastVFC and 22 Figure 4: Reconstruction of field learning shown in Fig 2a via SparseVFC. Left: 200 inliers contained in the training set; right: 500 inliers contained in the training set. The inlier ratios are both 0.5. The means of test error is about 0.072 and 0.044 respectively. SparseVFC. We consider two scenarios for performance comparison: (i) fix the inlier number and change the inlier percentage; (ii) fix the inlier percentage and change the inlier number. Fig. 5 shows the performance comparison of the test error under different experimental setup, we see that both SparseVFC and FastVFC perform much the same as VFC. Fig. 6 shows a performance comparison on the average run-time. From Fig. 6a, we see that the computational cost of SparseVFC is about linear with respect to the training set size. The run-time speedup factors of FastVFC and SparseVFC with respect to VFC are presented in Fig. 6b. We see that the speedup factor of FastVFC is about a constant, since its time complexity is the same as VFC, both are O(N 3 ). However, compared to FastVFC, the use of SparseVFC leads to an essential speedup, and this advantage will be significantly magnified with larger scale of training set. In conclusion, when an appropriate number of basis functions are chosen, the SparseVFC algorithm can approximate the original VFC algorithm quite well, while with a significant run-time speedup. 5.2. Mismatch removal on real images We notice that the algorithm demonstrates strong ability on mismatch removal. Thus we apply it to the mismatch removal problem and perform experiments on a wide range of 23 0.06 0.09 VFC FastVFC SparseVFC VFC FastVFC SparseVFC 0.08 0.07 Test error Test Error 0.055 0.05 0.06 0.05 0.045 0.04 0.2 0.4 0.6 Inlier percentage 0.8 0.03 200 1 400 600 800 1000 Number of inliers contained in the training sets (a) (b) Figure 5: Performance comparison on test error. (a) Comparison of test error through changing the inlier ratio in the training sets, the inlier number is fixed to 500. (b) Comparison of test error through changing the inlier number in the training sets, the inlier ratio is fixed to 0.2. 3 250 FastVFC SparseVFC 200 Speedup factor Time (s) 2.5 2 1.5 1 0.5 200 150 100 50 400 600 800 1000 Number of inliers contained in the training sets (a) 0 200 400 600 800 1000 Number of inliers contained in the training sets (b) Figure 6: Performance comparison on run-time under the experimental setup in Fig. 5b. (a) Run-time of SparseVFC. (b) Run-time speedup of FastVFC and SparseVFC with respect to VFC. real images. The performance is characterized by precision and recall. Besides VFC and FastVFC, we use three additional mismatch removal methods for comparison: RANdom SAmple Consensus (RANSAC) [57], Maximum Likelihood Estimation SAmple Consensus (MLESAC) [44] and Identifying point correspondences by Correspondence Function (ICF) [47]. RANSAC tries to get as small an outlier-free subset as feasible to estimate a given parametric model by resampling, while its variants MLESAC adopts a new cost function using weighted voting strategy based on M-estimator, and chooses the solution which maxi24 Table 2: The numbers of matches and initial inlier percentages in the six image pairs. Tree Valbonne Mex #matches 167 126 158 DogCat Peacock T-shirt 113 236 300 inlier pct. 56.29% 54.76% 51.90% 82.30% 71.61% 60.67% mize the likelihood rather than the inlier count in RANSAC. The ICF method uses support vector regression to learn a correspondence function pair which maps points in one image to their corresponding points in another, and then rejects the mismatches by checking whether they are consistent with the estimated correspondence functions. The structure of the generated motion field is relatively simple in the mismatch removal problem. We simply choose a diagonal decomposable kernel with Gaussian form in the scalar part, and we set β = 0.1 according to the original VFC algorithm. Moreover, the number of basis functions M used for sparse approximation is fixed in our evaluation, since choosing M adaptively would require some pre-processing which would increase the computational cost. We manage to tune the value of M by using a small test set, and we find that using 15 basis functions for sparse approximation is accurate enough. So we set M = 15 throughout the experiments. For FastVFC, 10 largest eigenvalues are used for calculating the low rank matrix approximation. 5.2.1. Results on several 2D image pairs Our first experiment involves mismatch removal on several image pairs, including wide three baseline image pairs (Tree, Valbonne and Mex ) and three image pairs of non-rigid object (DogCat, Peacock and T-shirt). Table 2 presents the numbers of matches as well as the initial correct match percentages in these six image pairs. The whole mismatch removal progress on these image pairs is illustrated schematically in Fig. 7. The columns show the iterative progress, the level of the blue color indicates to what degree a sample belongs to inlier, and it is also the posterior probability pn in equation (30). 25 Initial correspondences Initialization Iteration 1 Iteration 5 Convergence Final correspondences Figure 7: Results on image pairs of Tree, Valbonne, Mex, DogCat, Peacock and T-shirt. The first three rows are wide baseline image pairs, and the rest three are image pairs of non-rigid object. The columns show the iterative mismatch removal progress, and the level of the blue color indicates to what degree a sample belongs to inlier. For visibility, only 50 randomly selected matches are presented in the first column. 1 26 Table 3: Performance comparison with different mismatch removal algorithms. The pairs in the table are precision-recall pairs (%). RANSAC [57] MLESAC [44] FastVFC [9] SparseVFC (94.68, 94.68) (98.82, 89.36) (92.75, 68.09) (94.85, 97.87) (94.79, 96.81) (94.85, 97.87) Valbonne (94.52, 100.00) (94.44, 98.55) (91.67, 63.77) (98.33, 85.51) (98.33, 85.51) (98.33, 85.51) Tree Mex ICF [47] VFC [9] (91.76, 95.12) (93.83, 92.68) (96.15, 60.98) (96.47, 100.00) (96.47, 100.00) (96.47, 100.00) DogCat - - (92.19, 63.44) (100.00, 100.00) (100.00, 100.00) (100.00, 100.00) Peacock - - (99.12, 66.86) (99.40, 98.82) (99.40, 98.22) (99.40, 98.82) T-shirt - - (99.07, 58.79) (98.88, 96.70) (98.84, 93.41) (98.88, 96.70) In the beginning, all the SIFT matches in the first column are assumed to be inlier. We convert them into motion field training sets which are shown in the 2nd column. As the EM iterative process continues, progressively more refined matches are shown in the 3rd, 4th and 5th columns. The 5th column shows that the algorithm almost converges to a nearly binary decision on the match correctness. The SIFT matches reserved by the algorithm are presented in the last column. It should be noted that there is an underline assumption in our method that the vector field should be smooth, i.e. the norm of the field f in equation (13) should be small. However, for wide baseline image pairs such as Tree, Valbonne and Mex, the related motion fields are in general not continuous, and our method is still effective for mismatch removal. We give an explanation as follows: under the sparsity assumptions of the training data, it is not hard to seek a smooth vector field which can fit nearly all the inliers (sparse) well; if the goal is just to remove mismatches in the training data, the smoothness constraint can work well. Next, we give a performance comparison with the other five methods in Table 3. The geometry model used in RANSAC and MLESAC is epipolar geometry. We see that MLESAC has slightly better precisions than the RANSAC with the cost of producing a slightly lower recall. The recall of ICF is quite low, although it has a satisfactory precision. However, VFC, FastVFC and SparseVFC can successfully distinguish inliers from outliers, and they 27 have the best trade-off between precision and recall. The low rank matrix approximation used in FastVFC may slightly hurt the performance. While in SparseVFC, it seems that the sparse approximation does not lead to degenerated performance, on the contrary, it makes the algorithm more efficient. Notice that we did not compare to RANSAC and MLESAC on the image pairs of nonrigid object. RANSAC and MLESAC depend on a parametric model, for example, fundamental matrix. If there exist some objects with deformation in the image pairs, they can no longer work, since the point pairs will no longer obey the epipolar geometry. 5.2.2. Results on a 2D image datasets We test our method on the dataset of Mikolajczyk et al, which contains image transformations of viewpoint change, scale change, rotation, image blur, JPEG compression, and illumination. We use all the 40 image pairs, and for each pair, we set the SIFT distance ratio threshold t to 1.5, 1.3 and 1.0 respectively as in [9]. The cumulative distribution function of original correct match percentage is shown in Fig. 8a. The initial average precision of all image pairs is 69.58%, and nearly 30 percent of the training sets have correct match percentage below 50%. Fig. 8b presents the cumulative distribution of the number of point matches contained in the experimental image pairs. We see that most of the image pairs have large scale of point matches (i.e. in the order of 1000’s). The precision-recall pairs on this dataset are summarized in Fig. 8. The average precisionrecall pairs are (95.49%, 97.55%), (97.95%, 96.93%), (93.95%, 62.69%) and (98.57%, 97.78%) for RANSAC, MLESAC, ICF and SparseVFC respectively. Here we choose homography as the geometry model in RANSAC and MLESAC. Note that the performances of VFC, FastVFC and SparseVFC are quite close, thus we omit the results of VFC and FastVFC in the figure for clarity. From the result, we see that ICF usually has high precision or recall, but not simultaneously. MLESAC performs a little better than RANSAC, and they both achieve quite satisfactory performance. This can be explained by the lack of complex constraints between the elements of the homography matrix. Our method has good performance in dealing with the mismatch removal problem, and it has the best trade-off between precision 28 1 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.5 0.4 Precision 1 0.9 Cumulative Distribution Cumulative Distribution 1 0.9 0.6 0.5 0.4 0.6 0.4 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.3 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Correct Match Ratio 1 0 0 (a) 500 1000 1500 2000 2500 3000 3500 4000 Number of Point Matches RANSAC, p=95.49%, r=97.55% MLESAC, p=97.95%, r=96.93% ICF, p=93.95%, r=62.69% SparseVFC, p=98.57%, r=97.78% 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall (b) 1 (c) Figure 8: Experimental results on the dataset of Mikolajczyk et al. (a) Cumulative distribution function of original correct match percentage. (b) Cumulative distribution function of number of point matches in the image pairs. (c)Precision-recall statistics for RANSAC, MLESAC, ICF and SparseVFC on the dataset of Mikolajczyk. Our method (red circles, upper right corner) has the best precision and recall overall. Table 4: Average precision-recall and run-time comparison of RANSAC, VFC, FastVFC and SparseVFC on the dataset of Mikolajczyk. RANSAC [57] VFC [9] FastVFC [9] SparseVFC (p, r ) (95.49, 97.55) (98.57, 97.75) (98.75, 96.71) (98.57, 97.78) t (ms) 3784 6085 402 21 and recall compared to the other three methods. We compare the approximate accuracy and time efficiency of SparseVFC to VFC and FastVFC in Table 4. As shown in it, both FastVFC and SparseVFC approximate VFC quite well, especially our method SparseVFC. Moreover, SparseVFC achieves a significant speedup with respect to the original VFC algorithm, of about 300 times on average. The run-time of RANSAC is also presented in Table 4, and we see that SparseVFC is much more efficient than RANSAC. To prevent the efficiency of RANSAC from decreasing drastically, usually a maximum sampling number is preset in the literature. We set the maximum sampling number to 5000. The influence of different choices of basis functions for sparse approximation is also 29 Table 5: Performance comparison by using different matrix-valued kernels for robust learning. DK: decomposable kernel; DFK+CFK: combination of divergence-free and curl-free kernels. DK, ω = 0 VFC [9] DK, ω 6= 0 DFK+CFK (98.57, 97.75) (98.66, 97.91) (98.69, 97.65) SparseVFC (98.57, 97.78) (98.66, 97.93) (98.67, 97.71) tested on this dataset. Besides the random sampling, we consider three other different √ √ methods: (i) simply use M × M uniform grid points over the bounded input space; (ii) find M clustering center of the training inputs via k-means clustering algorithm; (iii) pick M basis functions minimizing the residuals via sparse greedy matrix approximation [17]. The average precision-recall pairs of these three methods are (98.58%, 97.79%), (98.57%, 97.75%) and (98.58%, 97.79%) respectively. We see that all the three approximation methods produce almost the same result as the random sampling method. Therefore, for the mismatch removal problem, it does not seem that selecting the “optimal” subset using sophisticated methods improves the performance compared to a random subset. However, in the interests of computational efficiency, we may be better off simply choosing a random subset of the training data. Next, we study on using different kernel functions for robust learning. Two additional 2 matrix-valued kernels are tested, such as a decomposable kernel [8]: Γ(xi , xj ) = e−βkxi −xj k A with A = ω1D×D +(1−ωD)ID×D , and a convex combination of the divergence-free kernel Γdf and curl-free kernel Γcf . Note that the form kernel will degenerate into a diagonal kernel when we set ω = 0. In our evaluation, the combination coefficients in these two kernels are selected via cross-validation. The results are summarized in Table 5. We see that SparseVFC approximate VFC quite well under different matrix-valued kernels. Moreover, using a diagonal decomposable kernel is adequate for the mismatch removal problem; it can reduce the complexity of the linear system, i.e. equation (32), while with only a negligible decrease in accuracy. 30 Table 6: Performance comparison by choosing different numbers of basis functions. M 5 10 15 20 (p, r ) (98.10, 96.88) (98.57, 97.73) (98.57, 97.78) (98.57, 97.79) t (ms) 12 15 21 49 Finally, we use this dataset to investigate the influence of the choice of M , the number of basis functions. Besides the default value 15, three additional values of M including 5, 10 and 20 are tested, and the results are summarized in Table 6. We see that SparseVFC is not very sensitive to the choice of M , and even M = 5 can achieve a satisfied performance. It should be noted that better approximate accuracy can be achieved by choosing M adaptively. For example, firstly compute the standard deviation of the estimated inliers σ ∗ (i.e. equation (35)) under different choices of M , and then choose the one with the smallest value of σ ∗ . However, such scenario would significantly increase the computational cost, since it requires to run the algorithm once for each choice of M . Therefore, fixing the value of M to a constant large enough, i.e. 15 in this paper, will achieve a good trade-off between the approximate accuracy and computational efficiency. 5.2.3. Results on 3D surface pairs Since VFC is not influenced by the dimension of the input data, now we test the mismatch removal performance of SparseVFC on 3D surface pairs and compare it to the original algorithm. Here the parameters are set as the same values as in the 2D case. The two surface pairs Dino and Temple are shown in Fig. 9. As shown, the capability of SparseVFC is not weakened in this case. For the Dino dataset, there are 325 initial correspondences with 264 correct matches and 61 mismatches; after using the SparseVFC to remove mismatches, 263 matches are preserved and all of which are correct matches. That is to say, all the false matches are eliminated while discarding only 1 correct match. For the Temple dataset, there are 239 initial correspondences with 215 correct matches and 24 31 Figure 9: Mismatch removal results of SparseVFC on two 3D surface pairs: Dino and Temple. There are 325 and 239 initial matches in these two pairs respectively. For each group of results, the left pair denotes the identified putative correct matches (Dino 263, Temple 216), and the right pair denotes the removed putative mismatches (Dino 62, Temple 23). For visibility, at most 50 randomly selected correspondences are presented. mismatches; after using the SparseVFC to remove mismatches, 216 matches are preserved, in which 214 are correct matches – that is, 22 of 24 false matches are eliminated while discarding only 1 correct match. Performance comparison are presented in Table 7, we see that both FastVFC and SparseVFC have a good approximation to the original VFC algorithm. Therefore, the sparse approximation used in our method works quite well, not only in the 2D case, but also in the 3D case. 6. Conclusion In this paper, we study a sparse approximation algorithm for vector-valued regularized least-squares. It searches a suboptimal solution under an assumption that the solution space can be represented sparsely with much less basis function. The time and space complexities of our algorithm are both linear in the scale of training samples, and the number of basis functions is manually assigned. We also present a new robust vector field learning method 32 Table 7: Performance comparison of VFC, FastVFC and SparseVFC on two 3D surface pairs: Dino and Temple. The initial correct match percentages are about 81.23% and 89.96% respectively. VFC [9] Dino FastVFC [9] SparseVFC (98.87, 99.62) (100.00, 99.62) (100.00, 99.62) Temple (99.07, 99.53) (99.07, 99.53) (99.07, 99.53) called SparseVFC, which is a sparse approximation to VFC, and apply it to solving the mismatch removal problem. The quantitative results on various experimental data demonstrate that the sparse approximation leads to a vast increase in speed with negligible decrease in performance, and it outperforms several state-of-the-art methods such as RANSAC from the perspective of mismatch removal. Our sparse approximation addresses a generic problem for vector field learning, and it can be applied to other methodologies which can be converted into the vector field learning problem, for example, the Coherent Point Drift algorithm [41] designed for point registration. The sparse approximation of the Coherent Point Drift algorithm is described in detail in the appendix. Besides, our approach also has some limitations, for example, the sparse approximation is validated only in the case of `2 loss. Our future work shall focus on: (i) determining the basis number automatically and efficiently, and (ii) validating the sparse approximation under different types of loss functions. Appendix: Sparse approximation for Coherent Point Drift The Coherent Point Drift (CPD) algorithm [41] considers alignment of two point sets as T T a probability density estimation problem. Given two point sets XLD×1 = (xT 1 , · · · , xL ) and T T YKD×1 = (y1T , · · · , yK ) with D being the data point’s dimension, the algorithm assumes the points in Y are the Gaussian mixture model (GMM) centroids, and the points in X are the data points generated by the GMM. It then estimates the transformation T which yields the best alignment between the transformed GMM centroids and the data points by 33 maximizing a likelihood. There the transformation T is defined as an initial position plus a displacement function f : T (y) = y + f (y), where f is assumed to come from an RKHS H and hence has the form of equation (8). Similar to our SparseVFC algorithm, the CPD algorithm also adopts an iterative EM algorithm to alternatively recover the spatial transformation and update the point correspondence. And in the M-step, for recovering the displacement function f , it needs to solve a linear system similar to equation (10), which takes up most of the run-time and memory requirements of the algorithm. The sparse approximation again could be used here to reduce the time and space complexity. In the iteration of CPD algorithm, the displacement function f can be estimated from minimizing an energy function as K L 1 XX λ E(f ) = 2 pkl kxl − yk − f (yk )k2 + kf k2H , 2σ l=1 k=1 2 (36) where pkl is the posterior probability of the correspondence between two points yk and xl , and σ is the standard deviation of the GMM components. Using the sparse approximation, we search a suboptimal f s with the form equation (12). 2 Due to the choice of a diagonal decomposable kernel Γ(yi , yj ) = e−βkyi −yj k I, equation (36) becomes L 1 X λ Te 2 e E(C) = 2 k(diag(P·,l ) ⊗ ID×D )1/2 (XK l − Y − UC)k + C ΓC, 2σ l=1 2 (37) e is a M × M block where P = {pkl } and P·,l denotes the l-column of P, kernel matrix Γ e is a K × M block matrix with the (i, j)-th block ej ), U matrix with the (i, j)-th block Γ(e yi , y ej ), XK Γ(yi , y l = (xl ; · · · ; xl ) is a KD × 1 dimensional vector, and C = (c1 ; · · · ; cM ) is the coefficient vector. Taking the derivative of equation (37) with respect to the coefficient vector C and setting it to zero, we obtain a linear system e = UT PX e − UT diag(P1)Y, e (UT diag(P1)U + λσ 2 Γ)C 34 (38) 2 where the kernel matrix Γ ∈ IRM ×M and Γij = e−βkỹi −ỹj k , U ∈ IRK×M and Uij = 2 e = (c1 , · · · , cM )T , X e = (x1 , · · · , xL )T and e−βkyi −eyj k , 1 is a column vector of all ones, C e = (y1 , · · · , yK )T . Here M is the number of bases. Y e This corresponds Thus we obtain a suboptimal solution from the coefficient matrix C. e determined by to the optimal solution f o , i.e. equation (8), with the coefficient matrix C the linear system [41] e = diag(P1)−1 PX − Y, (Γ + λσ 2 diag(P1)−1 )C (39) 2 e = (c1 , · · · , cK )T . where Γ ∈ IRK×K with Γij = e−βkyi −yj k , and C Since it is a sparse approximation of the CPD algorithm, we name this approach SparseCPD. We simply summarize the SparseCPD algorithm in algorithm 2. Algorithm 2: The SparseCPD Algorithm Input: Two point sets X and Y Output: Transformation T 1 Parameter initialization, including M and all the parameters in CPD; 2 repeat 3 4 5 E-step: Update the point correspondence; M-step: 6 e by solving linear system (38); Update the coefficient vector C 7 Update other parameters in CPD; 8 until converges; 9 e The transformation T is determined according to the coefficient vector C. References [1] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 1995. [2] T. Evgeniou, M. Pontil, T. Poggio, Regularization Networks and Support Vector Machines, Computational Mathematics 13 (2000) 1–50. 35 [3] F. Girosi, M. Jones, T. Poggio, Regularization Theory and Neural Networks Architectures, Neural Computation 7 (1995) 219–269. [4] N. Aronszajn, Theory of Reproducing Kernels, Transactions of the American Mathematical Society 68 (1950) 337–404. [5] R. Rifkin, G. Yeo, T. Poggio, Regularized Least-Squares Classification, in: Advances in Learning Theory: Methods, Model and Applications, 2003. [6] C. Saunders, A. Gammermann, V. Vovk, Ridge Regression Learning Algorithm in Dual Variables, in: Proceedings of International Conference on Machine Learning, 1998, pp. 515–521. [7] J. A. K. Suykens, J. Vandewalle, Least Squares Support Vector Machine Classifiers, Neural Processing Letters 9 (1999) 293–300. [8] L. Baldassarre, L. Rosasco, A. Barla, A. Verri, Multi-Output Learning via Spectral Filtering, Technical Report, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, 2011. [9] J. Zhao, J. Ma, J. Tian, J. Ma, D. Zhang, A Robust Method for Vector Field Learning with Application to Mismatch Removing, in: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2011, pp. 2977–2984. [10] C. A. Micchelli, M. Pontil, On Learning Vector-Valued Functions, Neural Computation 17 (2005) 177–204. [11] E. Candes, T. Tao, Near-Optimal Signal Recovery from Random Projections: Universal Encoding Strategies, IEEE Transactions on Inform. Theory 52 (2005) 5406–5425. [12] S. S. Keerthi, O. Chapelle, D. DeCoste, Building Support Vector Machines with Reduced Classifier Complexity, Journal of Machine Learning Research 7 (2006) 1493–1515. [13] M. Wu, B. Scholkopf, G. Bakir, A Direct Method for Building Sparse Kernel Learning Algorithms, Journal of Machine Learning Research 7 (2006) 603–624. [14] P. Sun, X. Yao, Sparse Approximation Through Boosting for Learning Large Scale Kernel Machines, IEEE Transactions on Neural Networks 21 (2010) 883–894. [15] E. Tsivtsivadze, T. Pahikkala, A. Airola, J. Boberg, T. Salakoski, A Sparse Regularized Least-Squares Preference Learning Algorithm, in: Proceedings of the Conference on Tenth Scandinavian Conference on Artificial Intelligence, 2008, pp. 76–83. [16] C. K. I. Williams, M. Seeger, Using the Nyström Method to Speed Up Kernel Machines, in: Advances in Neural Information Processing Systems, 2000, pp. 682–688. [17] A. J. Smola, B. Schölkopf, Sparse Greedy Matrix Approximation for Machine Learning, in: Proceedings of International Conference on Machine Learning, 2000, pp. 911–918. [18] Y.-J. Lee, O. Mangasarian, RSVM: Reduced Support Vector Machines, in: Proceedings of the SIAM International Conference on Data Mining, 2001. 36 [19] S. Fine, K. Scheinberg, Efficient SVM Training Using Low-Rank Kernel Representations, Journal of Machine Learning Research 2 (2001) 243–264. [20] F. Girosi, An Equivalence Between Sparse Approximation and Support Vector Machines, Neural Computation 10 (1998) 1455–1480. [21] S. Chen, D. Donoho, M. Saunders, Atomic Decomposition by Basis Pursuit, Siam Journal of Scientific Computing 20 (1999) 33–61. [22] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, An Interior-Point Method for Large-Scale `1 -Regularized Least Squares, IEEE Journal of Selected Topics in Signal Processing 1 (2007) 606–617. [23] M. N. Gibbs, D. J. C. MacKay, Efficient Implementation of Gaussian Processes, Technical Report, Department of Physics, Cavendish Laboratory, Cambridge University, Cambridge, 1997. [24] J. C. Platt, Fast Training of Support Vector Machines Using Sequential Minimal Optimization, in: Advances in Kernel Methods: Support Vector Learning, 1999, pp. 185–208. [25] S. S. Keerthi, K. Duan, S. Shevade, A. Poo, A Fast Dual Algorithm for Kernel Logistic Regression, Machine Learning 61 (2005) 151–165. [26] L. Bottou, O. Bousquet, The Tradeoffs of Large Scale Learning, in: Advances in Neural Information Processing Systems, 2008, pp. 161–168. [27] M. Zinkevich, M. Weimer, A. Smola, L. Li, Parallelized Stochastic Gradient Descent, in: Advances in Neural Information Processing Systems, 2010, pp. 2595–2603. [28] Y. Tsuruoka, J. Tsujii, S. Ananiadou, Stochastic Gradient Descent Training for L1-regularized Loglinear Models with Cumulative Penalty, in: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2009, pp. 477–485. [29] C. Carmeli, E. De Vito, A. Toigo, Vector Valued Reproducing Kernel Hilbert Spaces of Integrable Functions and Mercer Theorem, Analysis and Applications 4 (2006) 377–408. [30] T. Poggio, F. Girosi, Networks for Approximation and Learning, Proceedings of the IEEE 78 (1990) 1481–1497. [31] J. Quiñonero-Candela, C. E. Ramussen, C. K. I. Williams, Approximation Methods for Gaussian Process Regression, in: Large-Scale Kernel Machines, 2007, pp. 203–223. [32] A. R. Barron, Universal Approximation Bounds for Superpositions of A Sigmoidal Function, IEEE Transactions on Information Theory 39 (1993) 930–945. [33] L. K. Jones, A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training, Annals of Statistics 20 (1992) 608–613. [34] V. Kůrková, High-Dimensional Approximation by Neural Networks, in: Advances in Learning Theory: Methods, Models and Applications, 2003, pp. 69–88. 37 [35] V. Kůrková, M. Sanguineti, Comparison of Worst Case Errors in Linear and Neural Network Approximation, IEEE Transactions on Information Theory 48 (2002) 264–275. [36] V. Kůrková, M. Sanguineti, Learning with Generalization Capability by Kernel Methods of Bounded Complexity, Journal of Complexity 21 (2005) 350–367. [37] S. Yu, V. Tresp, K. Yu, Robust Multi-Task Learning with t-Processes, in: Proceedings of International Conference on Machine Learning, 2007, pp. 1103–1110. [38] S. Zhu, K. Yu, Y. Gong, Predictive Matrix-Variate t Models, in: Advances in Neural Information Processing Systems, 2008, pp. 1721–1728. [39] P. J. Huber, Robust Statistics, John Wiley & Sons, New York, 1981. [40] C. M. Bishop, Pattern Rcognition and Machine Learning, Springer, 2006. [41] A. Myronenko, X. Song, Point Set Registration: Coherent Point Drift, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010) 2262–2275. [42] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision (2nd ed.), Cambridge University Press, Cambridge, 2003. [43] D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision 60 (2004) 91–110. [44] P. H. S. Torr, A. Zisserman, MLESAC: A New Robust Estimator with Application to Estimating Image Geometry, Computer Vision and Image Understanding 78 (2000) 138–156. [45] J. H. Kim, J. H. Han, Outlier Correction from Uncalibrated Image Sequence Using the Triangulation Method, Pattern Recognition 39 (2006) 394–404. [46] L. Goshen, I. Shimshoni, Guided Sampling via Weak Motion Models and Outlier Sample Generation for Epipolar Geometry Estimation, International Journal of Computer Vision 80 (2008) 275–288. [47] X. Li, Z. Hu, Rejecting Mismatches by Correspondence Function, International Journal of Computer Vision 89 (2010) 1–17. [48] J. Ma, J. Zhao, Y. Zhou, J. Tian, Mismatch Removal via Coherent Spatial Mapping, in: Proceedings of IEEE International Conference on Image Processing, 2012, pp. 1–4. [49] J. Ma, J. Zhao, J. Tian, Z. Tu, A. Yuille, Robust Estimation of Nonrigid Transformation for Point Set Registration, in: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2013. [50] J. Barron, D. Fleet, S. Beauchemin, Performance of Optical Flow Techniques, International Journal of Computer Vision 12 (1994) 43–77. [51] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L. van Gool, A Comparison of Affine Region Detectors, International Journal of Computer Vision 65 (2005) 43–72. [52] T. Tuytelaars, L. van Gool, Matching Widely Separated Views Based on Affine Invariant Regions, 38 International Journal of Computer Vision 59 (2004) 61–85. [53] A. Vedaldi, B. Fulkerson, VLFeat - An Open and Portable Library of Computer Vision Algorithms, in: Proceedings of the 18th annual ACM international conference on Multimedia, 2010, pp. 1469–1472. [54] A. Zaharescu, E. Boyer, K. Varanasi, R. Horaud, Surface Feature Detection and Description with Applications to Mesh Matching, in: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2009, pp. 373–380. [55] N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, in: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893. [56] I. Macêdo, R. Castro, Learning Divergence-free and Curl-free Vector Fields with Matrix-valued Kernels, Technical Report, Instituto Nacional de Matematica Pura e Aplicada, Brasil, 2008. [57] M. A. Fischler, R. C. Bolles, Random Sample Consensus: A Paradigm for Model Fitting with Application to Image Analysis and Automated Cartography, Communications of the ACM 24 (1981) 381–395. 39

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement