samuele salti tesi

samuele salti tesi
Alma Mater Studiorum
Università di Bologna
DEIS – Dipartimento di Informatica,
Elettronica e Sistemistica
ON-LINE ADAPTIVE
VISUAL TRACKING
Samuele Salti
TUTOR
COORDINATOR
Professor
Professor
Tullio Salmon Cinotti
Paola Mello
CO-TUTOR
Professor
Luigi Di Stefano
PhD. Thesis
January, 2008 – December, 2010
PhD program in Electronic, Computer Science and Telecommunications Engineering
cycle XXIII – ING-INF/05
Contents
Introduction
1
1 Adaptive Transition Model
11
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2
Previous work . . . . . . . . . . . . . . . . . . . . . . .
13
1.3
On-line adaptation of the transition model . . . . . . . .
15
1.3.1
SVMs in ǫ-regression mode . . . . . . . . . . .
16
1.3.2
SVRs for transition model estimation . . . . . .
16
Support Vector Kalman . . . . . . . . . . . . . . . . . .
19
1.4.1
Adaptive process noise model . . . . . . . . . .
21
Experimental results . . . . . . . . . . . . . . . . . . .
23
1.5.1
Simulation of linear motion . . . . . . . . . . .
24
1.5.2
Simulation of non-linear motion . . . . . . . . .
29
1.5.3
3D camera tracking . . . . . . . . . . . . . . . .
30
1.5.4
Mean-shift tracking through occlusions . . . . .
32
1.4
1.5
2 Adaptive Appearance Model
2.1
2.2
35
Additional definitions . . . . . . . . . . . . . . . . . . .
36
2.1.1
Confidence map . . . . . . . . . . . . . . . . .
36
2.1.2
Generative vs. Discriminative Trackers . . . . .
37
Elements of Adaptive Modeling in Visual Tracking . . .
39
2.2.1
Sampling and Labeling . . . . . . . . . . . . . .
41
2.2.2
Feature Extraction . . . . . . . . . . . . . . . .
45
2.2.3
Feature Set Refinement . . . . . . . . . . . . . .
46
iii
CONTENTS
3
2.2.4
Feature selection . . . . . . . . . . . . . . . . . 49
2.2.5
Model Estimation . . . . . . . . . . . . . . . . . 50
2.2.6
Model Update . . . . . . . . . . . . . . . . . . . 50
2.3
Adaptive modeling with Particle Filtering . . . . . . . . 52
2.4
Experimental Results . . . . . . . . . . . . . . . . . . . 59
2.4.1
Methodology . . . . . . . . . . . . . . . . . . . 59
2.4.2
Dollar sequence . . . . . . . . . . . . . . . . . . 61
2.4.3
Faceocc2 sequence . . . . . . . . . . . . . . . . 63
2.4.4
Coke sequence . . . . . . . . . . . . . . . . . . 66
Synergistic Change Detection and Tracking
3.1
Related Works . . . . . . . . . . . . . . . . . . . . . . . 73
3.2
Models and assumptions . . . . . . . . . . . . . . . . . 75
3.2.1
RBE model . . . . . . . . . . . . . . . . . . . . 75
3.2.2
Bayesian change detection model . . . . . . . . 76
3.2.3
Bayesian loop models . . . . . . . . . . . . . . 77
3.3
Cognitive Feedback . . . . . . . . . . . . . . . . . . . . 79
3.4
Bayesian change detection . . . . . . . . . . . . . . . . 82
3.4.1
4
On-line learning for change detection . . . . . . 83
3.5
Reasoning probabilistically on change maps . . . . . . . 86
3.6
Experimental Results . . . . . . . . . . . . . . . . . . . 90
3D Surface Matching and Object Categorization
4.1
4.2
iv
71
101
SHOT descriptor . . . . . . . . . . . . . . . . . . . . . 103
4.1.1
Analysis of Previous Work . . . . . . . . . . . . 104
4.1.2
On the traits and importance of the local RF . . . 106
4.1.3
Disambiguated EVD for a repeatable RF . . . . 109
4.1.4
Description by Signatures of Histograms . . . . 112
4.1.5
Experimental results . . . . . . . . . . . . . . . 117
Color SHOT . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2.1
A combined texture-shape 3D descriptor . . . . 123
4.2.2
Experimental Results . . . . . . . . . . . . . . . 126
CONTENTS
4.3
Object Category Recognition with 3D ISM . . . . . . . 130
4.3.1
3D Implicit Shape Model . . . . . . . . . . . . . 131
4.3.2
Codebook . . . . . . . . . . . . . . . . . . . . . 134
4.3.3
Codeword Activation Strategy . . . . . . . . . . 137
4.3.4
Votes Weighting Strategy . . . . . . . . . . . . . 138
4.3.5
Experimental Results . . . . . . . . . . . . . . . 139
4.3.6
Discussion . . . . . . . . . . . . . . . . . . . . 143
Conclusions
147
Bibliography
151
Publications related to this work
163
v
CONTENTS
vi
Introduction
Visual tracking is the problem of estimating some variables related to a
target given a video sequence depicting the target. In its simplest form,
it consists of estimating the position of the target while it wanders in
the scene, i.e. its trajectory in the image plane. Depending on the final
application and the tracker complexity, additional target variables can
be estimated, such as scale, orientation, joint angles between its parts,
velocity, etc. These variables form the target state, i.e. the set of hidden
variables that the tracker tries to recover from noisy observations of it,
i.e. the video frames.
Visual tracking is a fundamental feature for the automation of many
tasks, such as visual surveillance, robot or vehicle autonomous navigation, automatic video indexing in multimedia databases, etc. . . It is also a
basic enabling factor for making machines able to interpret human motion and deliver a whole new branch of services and applications, such
as natural human-computer interfaces, smart homes, offices or urban environments and computer-aided diagnosis or rehabilitation.
Visual tracking is difficult because of the classical nuisances that
computer vision faces, such as scene illumination changes, loss of information due to perspective projection, sensor noise, etc... , and because of
peculiar difficulties, such as complex motion patterns of the target, nonrigid or appearance-changing targets, partial and full target occlusions.
Despite many years of research, long term tracking in real world
scenarios for generic targets is still unaccomplished. The main contribution of this thesis is the definition of effective algorithms that can bring
visual tracking closer to a solution by letting the tracker adapt to mutat1
Introduction
ing working conditions. In particular, we propose to adapt two crucial
components of visual trackers: the transition model and the appearance
model. The adaptation is performed on-line, i.e. frame-by-frame while
the tracker runs. To better contextualize our contributions, we first introduce the standard formulation of the tracking problem and the tools
typically used to solve it.
As noted in [17], two major components can be distinguished in a
typical visual tracker: Filtering and Data Association is mostly a topdown process dealing with the dynamics of the tracked object and evaluation of different hypotheses; Target Representation and Localization
is mostly a bottom-up process which has to cope with changes in the appearance of the target and has to provide an effective description of it in
presence of similar objects (distractors). The way the two components
are combined and weighted is application dependent and plays a decisive role in the robustness and efficiency of the tracker. Nevertheless, for
a general tracker both components are key to tracker success.
As far as the Filtering and Data Association component is concerned,
to deal with all the nuisances and to take into account the uncertainty
onto the final estimation they introduce, one widespread approach is to
formulate tracking as a probabilistic inference problem on the space of
all possible states. The probabilistic formulation and the requirement for
the updating of state estimation on receipt of new measurements naturally lead to the Bayesian approach. It provides a rigorous general framework for dynamic state estimation.
In the Bayesian approach the output is the posterior probability density function (PDF) of the state, based on all available information, i.e.
the sequence of previous states and received measurements. Since the
posterior PDF encompasses all available statistical information, an optimal estimation of the state with respect to any criterion may be obtained
from it.
In this thesis we deal only with causal trackers, i.e. we do not take
into account visual trackers using future frames and states to estimate
the state at a given time. In a causal tracker an estimate of the state
2
Introduction
x k 1
xk
z k 1
zk
Figure 1: The first order Markov chain structure assumed for the target
state.
is computed every time a measurement is received, i.e. a new frame is
available in the frame buffer, using only past states and measures. A
recursive filter is the natural solution in this case. Hence, Recursive
Bayesian Estimation (RBE) [3, 79] is the standard tool to tackle the state
estimation in causal visual trackers.
RBE is solved, at least from a theoretical point of view, under the
standard assumption that the system can be modeled as a first order
Markov model (Fig. 1), i.e.
• the state at time k, xk ∈ RN , depends only on the previous state
xk−1 ;
• the measure at time k, zk ∈ RM , depends only on xk .
In the case of visual tracking, the measure zk typically coincides with
the current frame Ik , hence the two terms and symbols will be used interchangeably.
From the first order Markovian assumption it follows that the system
is completely specified by:
• a law of evolution of the state,
xk = fk (xk−1 , νk )
(1)
3
Introduction
where νk is an i.i.d. process noise sequence and fk is a possibly
non-linear function relating the state at time k with the previous
one;
• a measurement process,
zk = hk (xk , ηk )
(2)
where ηk is an i.i.d. measurement noise sequence and hk is a possibly non-linear function relating the measurement at time k with
the current state;
• an initial state x0 .
Process noise takes into account any modeling errors or unforeseen disturbances in the state evolution model
In a Bayesian probabilistic approach, given the noise affecting the
low of evolution of the state and the measurement process, the entities
comprising the system are defined by PDFs, i.e.
• the transition model,
p(xk | xk−1 )
(3)
defined by (1) and the statistics of νk ;
• the observation likelihood,
p(zk | xk )
(4)
defined by (2) and the statistics of ηk ;
• the initial target PDF p(x0 ).
These PDFs are generally assumed to be known a priori and never
updated.
Given this characterization of the target, a general but conceptual
solution can be obtained in two steps: prediction and update. In the prediction stage, the Chapman-Kolmogorov equation is used to propagate
4
Introduction
the belief on the state at time k − 1 to time k
Z
p(xk | z1:k−1 ) = p(xk | xk−1 )p(xk−1 | z1:k−1 ) dxk−1 .
(5)
where z1:k−1 is the set of all measurements up to frame k−1, {z1 , . . . , zk−1 }
This usually corresponds to a spreading of the belief on the state, due to
the increasing distance in time from the last measurement. In the update
stage, the PDF is sharpened again by using the current measure zk and
the Bayes rule
p(xk | z1:k ) ∝ p(zk | xk )p(xk | z1:k−1 ).
(6)
This conceptual solution is analytically solvable only in few cases.
A notable one is when the law of evolution of the state and the measurement equations are linear and the noises are Gaussian. In this situation,
the optimal solution is provided by the Kalman filter [42]. The RBE
framework for this case becomes:
h
i
xk = fk (xk−1 , νk ) ⇒ xk = Fk xk−1 + νk , E νk νTk = Qk
h
i
zk = hk (xk , ηk ) ⇒ zk = Hk xk + ηk , E ηk ηTk = Rk .
(7)
(8)
and the mean and covariance matrix of the Gaussian posterior can be
optimally estimated using the Kalman filter equations:
• prediction,
x−k = Fk xk−1
P−k = Fk Pk−1 FTk + Qk
(9)
(10)
where xk−1 and Pk−1 are the previous estimates of, respectively, the
mean vector and the covariance matrix and x−k and P−k are, following the typical Kalman notation, the estimates of, respectively, the
mean vector and the covariance matrix for the current frame before
a new measure is available;
5
Introduction
• update,
Sk = Hk P−k ∗ HTk + Rk
(11)
Kk = P−k Hk S−k 1
(12)
xk = x−k − Kk zk − Hk z−k
Pk = (I − Kk Hk ) P−k
(13)
(14)
where xk and Pk are the optimal estimates of, respectively, the
mean vector and the covariance matrix.
When the assumptions made by the Kalman filter do not hold, a suboptimal solution to the RBE problem can be obtained with particle filters
[79]. Particle filters performs sequential Monte Carlo estimation. Given
the posterior, p(xk | z1:k ) we want to obtain an estimate of the state from
it:
x̂k =
Z
f (xk )p(xk | z1:k ) dxk .
(15)
Rn
The Monte Carlo solution is a numerical evaluation of the integral,
that requires to sample L samples xik from the posterior and then compute
the estimate as the sample mean
1X
f (xik ) .
L i=1
L
x̂k =
(16)
Unfortunately, it is impossible to sample from the posterior in the
general, non Gaussian / non linear case, since it has a non standard form
and it is usually known only up to a proportionality constant. However,
if it is possible to generate samples from a density q(xk ) that is similar
to the posterior (i.e., it is not 0 when the posterior is not 0), then we can
still use the Monte Carlo method to approximate the integral in (15) by
drawing sample from q(xk ) and weighting them accordingly,
x̂k =
6
L
p(xik | z1:k )
1X
f (xik )w(xik ) with w(xik ) =
.
L i=1
q(xik )
(17)
Introduction
This technique is known as importance sampling and the PDF q is referred to as the importance or proposal density.
Particle filter are based on sequential importance sampling. The key
idea is to represent the posterior by a set of random samples with associated weights, the particles. The posterior PDF can then be approximated
as
L
X
(18)
p(xk | z1:k ) =
w(xik )δ(xk − xik )
i=1
where samples are obtained at each time step from the proposal density
q(xk | xk−1 , zk ), weights are updated at each time step as [79]
w(xik )
p(zk | xik )p(xik | xik−1 )
p(xk | z1:k )
i
∝
∝ w(xk−1 )
q(xk | x1:k−1 , z1:k )
q(xik | xik−1 , zk )
(19)
and then normalized to sum to one. It can be shown that as N → ∞ the
approximation in (18) converges to the true posterior density.
The main problem with sequential importance sampling is represented by particle degeneracy. In particular, the variance of the particles weights can only increase with sequential importance sampling. In
practice, this means that after a certain number of recursive steps, all but
one particle will have negligible weights. To counteract this effect resampling algorithms are introduced, leading to so called sequential importance resampling algorithms. Resampling eliminates samples with
low weights and multiplies samples with high importance weights. This
corresponds to computing a less accurate approximation of the posterior
that concentrates on salient regions of the state space and avoids to waste
computational power by propagating particles that carry on negligible
contributions to the posterior approximation. The new set of particles is
generated by resampling with replacement L times from the cumulative
sum of normalized weights of the particles [79].
Within the RBE framework, our major contribution, described in
Chapter 1, is an algorithm to effectively and efficiently estimate the transition model p(xk | xk−1 ) on-line from the tracker output in the Gaussian
and linear case. This reduces the number of parameters to be set by
7
Introduction
the user, in particular the process noise covariance which are typically
hard to estimate but play a significant role for the filter performance.
Our algorithm also allows for obtaining a time-variant estimation of the
transition model, and therefore results in a more adaptive filter.
As far as Target Representation and Localization is concerned, two
main ingredients constitute it, namely the choice of the features space
and the target appearance model.
The regions of the current frame Ik analyzed by the recursive Bayesian
filter are generally projected in some feature space. For instance, in a
standard approach for tracking with particle filters [78], the samples of
the state generated by the importance density are then represented as
color histograms [17].
The feature representation usually is:
• more compact than the corresponding region of Ik ;
• invariant to some (geometric or photometric) variations.
A variety of features has been used to describe the target, e.g. motion
vectors, change detection, object classification, low-level features such
as pixel colors or gradients, or mid-level features such as edges and interest points (see [104] for a survey). A main discriminant characteristic
among features is their spatial extent:
• Part-wise features. Features are extracted from small patches or
even single pixels (e.g. 5 × 5 HoGs [20]). It is relatively easy to
deal with partial occlusions but these features are hard to match if
the target undergoes deformation or rigid transformations such as
rotations and scalings.
• Target-wise features. The feature represents the whole target
appearance ( e.g. color histograms [17]). This kind of features can
typically tolerate target deformations and rigid transformations.
Correct handling of occlusions represents the most serious limitation of these representations.
8
Introduction
The link between the Filtering and the Representation stage of a
tracker is represented by the observation likelihood p(zk | xik ) defined
in (4). To evaluate it, in oder to update a particle weight, the appearance model of the target, that we indicate with A, is compared with the
features extracted from the state candidate xik . The target model lives, of
course, in the same feature space used to describe the current candidates.
The target model is usually learned once, either offline from training data
or online from the first frame(s), and then used throughout the sequence.
The use of a fixed model for locating the target makes it difficult to
cope with illumination changes and deformable targets. Hence, recently,
the idea of appearance model update has been proposed by several researchers to aim at successful long-term tracking despite these difficulties. By letting the model evolve across frames to include and adapt
to the potential geometric and photometric changes of the target, these
methods are inherently able to cope with target deformations and lightening variations. On the other hand, they expose the tracker to the risk
of drift, i.e. the inclusion of background appearance in the appearance
model that can eventually lead to loss of track.
In chapter 2, we analyze the recent advances in target model update
and present our proposal, which is based on the deployment of the Recursive Bayesian Estimation framework to tackle target model update,
too. This allows for exploiting the robustness of this framework also
in the crucial step of target model update and introduces a probabilistic
treatment as an interesting solution for this open problem.
Chapter 3 deals with adaptive tracking with a static camera. Our
contribution in this context concerns both Target Representation and Filtering. As for the former, we introduce a novel, efficient change detection algorithm, robust to sudden illumination changes, based on the joint
histogram of background and foreground intensities and on Bayesian
inference. As for the latter, we propose a sound way to obtain an adaptive observation likelihood from the output of the change detection and a
method to obtain a proper prior for the change detection from the prediction step of the recursive Bayesian filter employed as tracker. The two
9
Introduction
flows of information realize a full adaptive Bayesian loop encompassing
tracking and change detection.
Finally, in Chapter 4 we present our work on the detection of categories in 3D data. In a real automatic deployment a visual tracker is
usually initialized with the output of a detector for the category of interest (e.g. humans, cars, faces). While detection in images has reached
a high level of maturity [20, 50, 100], data coming from 3D sensors
have not been fully exploited yet. Moreover, we have recently seen an
increasing interest on the automatic analysis of such data due to the release of cheap modern sensors such as the Kinect device by Microsoft,
that lets foresee an ubiquitous presence of 3D data for human computer
interaction. In our work we adapt the well-known Implicit Shape Models
[50], proposed for images, to the detection of categories in 3D data. This
extension is based on our novel descriptor for 3D data, dubbed SHOT,
that obtains state of the art performance in various experiments of shape
matching, also presented in the chapter. Finally, the extension of SHOT
for the description of textured 3D data like those provided by the Kinect
sensor is described and compared with another texture-aware descriptor
[106].
All the tracking results for the first three chapters are available as
videos at the author website 1 .
1
10
www.vision.deis.unibo.it/ssalti
Chapter 1
Adaptive Transition Model
Recursive Bayesian Estimation (RBE) is a widespread solution for visual tracking as well as for applications in other domains where a hidden state is estimated recursively from noisy measurements. Although
theoretically sound and unquestionably powerful, from a practical point
of view RBE suffers from the assumption of complete a priori knowledge of the transition model, that is typically unknown. The use of a
wrong a priori transition model may lead to large estimation errors or
even to divergence. We propose to prevent these problems, in case of
fully observable systems, learning the transition model on-line via Support Vector Regression [86]. An application of this general framework
is proposed in the context of linear/Gaussian systems, where we dub it
Support Vector Kalman (SVK), and shown to be superior to a standard,
non adaptive solution.
1.1 Motivation
The difficulty of identifying a proper transition model for a specific application typically leads to empirical and suboptimal tuning of the estimator parameters. The most widespread solutions to specify a transition model for tracking are to empirically select it among a restricted
set of standard ones ( such as constant position, i.e. Brownian motion,
11
Chapter 1. Adaptive Transition Model
2500
Actual
Estimated
2000
Velocity
1500
1000
500
0
0
50
100
150
200
250
300
Time
Figure 1.1: The effect of the use of a wrong transition model: the Kalman
estimation diverges from the true velocity.
[1, 4, 16] or constant velocity [17, 32, 34, 38]) or learn it off-line from
representative training data [78]. Besides the availability of these training sequences, which depends on the particular application, the major
shortcoming of these solutions is that they do not allow to change the
transition model trough time, although this can be beneficial and neither
the conceptual solution nor the solving algorithms require it to be fixed.
Approximate tuning of a recursive Bayesian filter may seriously degrade its performances, that could be optimal (e.g., when the assumptions of a Kalman filter are met) or sub-optimal (e.g., in all the other
cases where a particle filter is used) in case of correct system identification. In Fig. 1.1 we present a simple experiment to highlight the strong,
detrimental impact of a wrong transition model on an otherwise optimal
and correctly tuned recursive Bayesian filter. In this simulation a point is
moving along a line with constant acceleration and we try to estimate its
position and velocity by a Kalman filter from measurements corrupted
with Gaussian noise, whose constant covariance matrix is known and
12
1.2 Previous work
used as the measurement noise covariance matrix of the filter, Rk . Hence,
we are using the optimal estimator for the experimental setup. The only
parameter that is wrongly tuned is the transition model, in particular we
are using a constant velocity matrix Fk instead of a constant acceleration
one. The process covariance matrix, Qk , was set very high, in order to
compensate for the wrong transition matrix. Despite this, the estimation
and the true value of the velocity diverge. In other words, the estimation
of an otherwise optimal estimator like the Kalman filter can be arbitrarily
wrong when an incorrect transition model is assumed. This is the main
motivation behind our work.
1.2 Previous work
Closely related to our work are the efforts devoted to the derivation of
adaptive Kalman filters, that have been studied since the introduction
of this filtering technique. In fact, our proposal can be seen as a new
approach to build an adaptive Kalman filter. The main idea behind adaptive filtering schemes is that the basic source of uncertainty is due to
the unknown noise covariances, and the proposed solution is to estimate
them on-line from observed data. One of the most comprehensive contributions is given by Mehra [58]. He reviews proposed approaches and
classify them according to four categories:
1. Bayesian Estimation (BE)
2. Maximum Likelihood Estimation (MLE)
3. Correlation Methods (CM)
4. Covariance-Matching Techniques (CMT).
Methods in the first category imply integration over a large dimensional
space and can be solved only with special assumptions on the PDF of
the noise parameters. MLE requires the solution of a non-linear equation that, in turns, is solvable only under the assumptions that the system
13
Chapter 1. Adaptive Transition Model
is time invariant and completely observable and the filter has reached a
steady state. Under these assumptions, however, only a time invariant
estimation of the parameters of the noise PDF can be obtained. Correlation Methods, too, are applicable only to time invariant and completely observable systems. Finally, Covariance-Matching Techniques
can estimate either process or measurement noise parameters and turn
out to provide good and time-varying approximations for the measurement noise when the process noise is known.
In the work of Oussalah and De Schutter [70], an improved correlation method is proposed, but the requirement on the stationarity of the
system is not dropped. In the context of visual tracking, Weng et al.
[101] present the application of an adaptive Kalman filter. The process
and measurement errors are modified in every frame taking into account
the degree of occlusion of the target: greater occlusion corresponds to
greater value of measurement noise and vice versa. The two noises always sum up to one. In the extreme case of total occlusion, measurement
noise is set to infinity and process noise to 0. Zhang et al. [109] use the
term Adaptive to refer to an adaptive forgetting factor, that is used to
trade off the contribution to the covariance estimate for the current time
step of the covariance estimate for the previous time step and the process
noise. This is done in order to improve the responsiveness of the filter in
case of abrupt state changes.
Compared to all these proposals, our method makes less assumptions
on the system, the only one being its complete observability. This allows
it to be more generally applicable and, in particular, to fit better the usual
working conditions of visual trackers. Moreover, unlike BE, MLE and
CM techniques our proposal provides a time-varying noise statistics estimation. This is extremely important to allow the filter to dynamically
weight the prediction on the state and the noisy measurement it has to
fuse at each frame, e.g. to cope with occlusions when the measurement
can be totally wrong and the prediction on the state is the only reliable
source of information to keep on tracking the target. Unlike the work of
Weng et al. [101], our proposal is not specifically conceived for visual
14
1.3 On-line adaptation of the transition model
tracking and, hence, generally applicable. Finally, it is worth pointing
out that, unlike all reviewed approaches, our proposal is adaptive in a
broader sense, for it identifies on-line not only the process noise covariance matrix but also the transition matrix.
1.3 On-line adaptation of the transition model
We propose to overcome the difficulties and the shortcomings due to the
empirical tuning of the transition model by adapting it on-line . If the
state is completely observable, as it is the case in most practical applications, i.e. the hk function just adds measurement noise on the state,
the transition model is directly related to the dynamics exhibited by the
measurements. Hence, it is possible to exploit their temporal evolution
in order to learn the function fk , and, implicitly, the PDF p(xk |xk−1 ). That
is, we can avoid to define p(xk |xk−1 ), and instead use in its place a learned
PDF p̃z1:k−1 (xk |xk−1 ), derived from a learned f˜z1:k−1 . Here, p̃z1:k−1 formally
indicates that the PDF is learned using as training data the relationships
between all the consecutive measures from 1 to k − 1.
Furthermore, we propose to learn the motion model using Support
Vector Machine [99] in ǫ-regression mode (SVR) [86]. SVMs are well
known and effective tools in pattern recognition based on the statistical learning theory developed by Vapnik and Chervonenkis. Their
widespread use is due to their solid theoretical bases which guarantee
their ability to generalize from training data minimizing the over-fitting
problem. Their use as regressors is probably less popular but even in this
field they provide excellent performances [86]. In the case of linear and
Gaussian systems, there is another important reason to use SVR in combination with Kalman filters (the optimal RBE filter in such a case). The
noise model assumed by an SVR is Gaussian, with mean and covariance
being random variables whose distributions depend on two of its parameters, C and ǫ, as discussed in the very interesting work of Pontil et al.
[76]. The mean, in particular, is uniformly distributed between −ǫ and
ǫ. Therefore, the SVR noise model is a superset of that assumed by the
15
Chapter 1. Adaptive Transition Model
Kalman filter, i.e. a zero-mean Gaussian. In other words, the SVR is
a theoretically sound regressor to apply in all the situations where the
Kalman is the optimal filter.
1.3.1 SVMs in ǫ-regression mode
To introduce SVMs as regressors, and in particular in ǫ-regression mode,
let us have a quick look at the regression of a linear model given a series
of data (xi , yi ). In ǫ-regression mode the SVR tries to estimate a function
of x that is far from training data yi at most ǫ and is at the same time
as flat as possible. The requirement of flatness comes from the theory
of complexity developed by Vapnik [99] and ensures that we will get
a solution with minimal complexity (hence, with better generalization
abilities). In the linear case, the model we fit on the data is
f (x) = hw, xi + b
(1.1)
and the solution with minimal complexity is given by the one and only
solution of the following convex optimization problem
min 12 ||w| |2 + C
Pl
i=1 (ξi
+ ξi∗ )



 yi − hw, xi i − b ≤ ǫ + ξi


 yi − hw, xi i − b ≥ −ǫ − ξ ∗
i
(1.2)
The constant C is an algorithm parameter and weights the deviations
from the model greater than ǫ. The problem is then usually solved using
its dual form, that is easier to solve and to extend to estimate also nonlinear functions ([99]).
1.3.2 SVRs for transition model estimation
In the context of RBE, given the first order Markovian assumption, one
is left with two options to regress fk :
16
1.3 On-line adaptation of the transition model
1. to learn it from measures, that is to provide to the SVR as training
data at time k the tuples
hx̂1 , z2 i, . . . , hx̂k−2 , zk−1 i
(1.3)
where x̂k is the state vector estimate obtained from the recursive
Bayesian filter at time k;
2. to learn if from states, that is to provide to the SVR as training data
at time k the tuples
hx̂1 , x̂2 i, . . . , hx̂k−2 , x̂k−1 i.
(1.4)
Generally speaking, to learn the transition model from the relation
between consecutive filtered states may cause the filter to repeatedly
confirm itself, i.e. to regress the transition model that the filter itself is
imposing on the training data. While this effect may guarantee a certain
level of smoothness of the output, if this loop degenerates the filter trusts
too much the learned model and diverges from the real state of the system by ignoring subsequent measures. On the other hand, learning form
measures avoids this risk and results in a more responsive filter; yet, for
the same reasons, it produces a filter more sensitive to noise, whose effects on the output of the filter or on the quality of the learned transition
model cannot easily be mitigated. Therefore, we advocate the use of the
learning from states strategy and will introduce a specific mechanism to
avoid the degeneracy of the learning loop.
Since the SVR can only regress functions f : Rn → R, if the state
vector has dimension n, n SVRs are used, and each one is fed with tuples of the form hx̂k−2 , x̂ik−1 i, where the superscript i indicates the i-th
component of a vector.
Another important design choice is the nature and length of the temporal window used to select states (or measures) for training. It does
not make sense to use all the state transitions since the beginning of observations to learn the transition model for the current time slot, or, at
17
Chapter 1. Adaptive Transition Model
least, it does not make sense during regression to equally weight their
contributions. A solution that may be used to address this problem is dynamic SVR for time series regression, introduced by Cao and Gu [11].
While we believe that this may be beneficial, and can be an interesting
investigation to carry on in the future, so far we have relied on a simpler
solution, namely a sliding window of fixed length, to prevent too old
samples from polluting the current estimate.
Finally, the influence of the time variable must be considered during
regression. To understand this, consider the circular motion on the unit
circle depicted in the leftmost chart of Fig.1.2. Assuming for clarity of
the graphical explanation the state vector to be composed only by the x
position of the point, some of the samples from which the SVR has to
regress the transition model of this point are depicted in the second chart.
As can be seen, without taking into account the evolution of the state
through time, even with a perfect regression (represented by the dotted
line in the second chart), it is impossible to have a correct prediction of
the state at time t, given the state at time t − 1: for example, at time t = 4
and t = 6 the previous state, xt−1 , is equal for the two positions, but the
output of the regression should be different, namely x4 = −1 and x6 = 0.
This situation can be disambiguated adding time as an input variable to
the function to be regressed, as shown by the last chart.
To summarize, n SVRs are used, where n is the dimension of the
state vector xk . The i-th SVR is trained at frame k by using the following
training set
{hk − 2 − W, x̂k−1−W , x̂ik−2−W i, ..., hk − 1, x̂k−2 , x̂ik−1 i}
(1.5)
where W is the length of the sliding window. We always use W = 10 in
our experiments.
In the following section we address in detail the linear-Gaussian case,
when the Kalman filter is the optimal solution, and show how our framework can be instantiated to successfully and advantageously adapt the
transition matrix and the associated noise covariance matrix on-line.
18
1.4 Support Vector Kalman
Figure 1.2: An example showing the importance of the inclusion of the
temporal variable among those used for regression.
1.4 Support Vector Kalman
In the case of linear process and measurement functions, of Gaussian
zero-mean noise and of Gaussian PDF for the initial state, all the subsequent PDFs of the state are (multivariate) Gaussians as well. Therefore,
they are completely specified by their mean vector, that is usually considered also the estimation of the state, and their covariance matrix. The
Kalman filter is the optimal estimator for this case.
Since between the hypotheses of the Kalman filter there is the linearity of fk , two consequences immediately arise:
1. we must use a linear kernel, i.e. the SVR formulation introduced
in 1.3.1;
2. we must modify it in order to regress a linear function.
In fact, the standard function learned by an SVR is (1.1), i.e. an affine
mapping. As discussed by Poggio et al. [75], a linear mapping can be
learned without harming the general theory underneath SVM algorithms,
since the linear kernel is a positive definite kernel. Moreover, a solving
algorithm for the linear mapping was also proposed in the paper of Platt
[74] that introduced the standard and widespread solution for the affine
case, i.e. the Sequential Minimal Optimization (SMO) algorithm.
Using this flavor of SVRs, it is possible, given the training data in the
considered temporal window, to obtain an estimate of F k . Each vector
19
Chapter 1. Adaptive Transition Model
of weights wik regressed by the i-th SVR at time k can be almost directly
used as the i-th row of the estimated transition matrix F̂ k . The last but
not least issue to be solved in order to deploy the SVR weights as rows
of the Kalman transition matrix is the problem of normalization.
Typical implementations of SVMs require the input and output to be
normalized within the range [0, 1] or [−1, +1]. While this normalization is a neutral preprocessing as far as the SVR output is concerned, it
has subtle consequences when the weight vectors of the SVR are used
within our proposal. To illustrate this, let us consider a simple example
where a mapping from a scalar x to y is regressed, and the variables are
normalized to the range [−1, +1]. Then
x̃ =
2x − xmax − xmin
,
xmax − xmin
ỹ =
2y − ymax − ymin
,
ymax − ymin
(1.6)
where the superscript ˜ denotes the normalized variables and xmax , xmin
are the maximum and minimum value of the variable within the considered temporal window. Hence, the function of x that gives the unnormalized y is
2(ymax − ymin )w
xmax − xmin
(ymax − ymin )(xmax + xmin )w
b = ymax + ymin −
xmax − xmin
ỹ = w x̃ ⇒ y = ax + b, a =
(1.7)
i.e., again an affine mapping. Therefore, using the unnormalized coefficient a as an entry of the transition matrix F̂ k results in poor prediction,
since the constant term is not taken into account. In order to obtain a
linear mapping, that fits directly into the transition matrix of a Kalman
filter, a two steps normalization must be carried out. Given a sequence
of training data, a first normalization is applied,
x̄ = x −
xmax + xmin
,
2
ȳ = y −
ymax + ymin
.
2
(1.8)
These are the data on which the Kalman filter has to work. In other
words, at every time step, the output of the previous time step must be
20
1.4 Support Vector Kalman
renormalized if its value changes the minimum or maximum within the
temporal window. This is equivalent to a translation of the origin of the
state space and does not affect the Kalman filter itself. No normalization
is required for the covariance matrix. After this normalization, the data
can be scaled in the range [−1, +1], as required by the SVR, according
to
x̃ =
2
x̄ ,
x̄max − x̄min
ỹ =
2
ȳ
ȳmax − ȳmin
(1.9)
where the subscripts have the same meaning as in (1.6). Using this two
steps normalization, the unnormalized function of the Kalman data is
ỹ = w x̃ ⇒ ŷ =
(ȳmax − ȳmin )
w x̄,
( x̄max − x̄min )
(1.10)
i.e. the required linear mapping.
1.4.1 Adaptive process noise model
As discussed in Sec. 1.2, the classical definition of an adaptive Kalman
filter is more concerned with dynamic adjustment of Qk than with the
adaptation of the transition model [70, 109]. Our proposal makes it easy
to learn on-line the value of Fk , but provides also an effective and efficient way to dynamically adjust the value of the process noise. The
value of Qk , in fact, is crucial for the performances of the Kalman filter.
In particular, the ratio between the uncertainties on the transition model
and on the measurements tunes the filter to be either more responsive but
more sensitive to noise or smoother but with a greater latency in reacting
to sharp changes in the dynamics of the observed system.
Within our framework, a probabilistic interpretation of the output
of the SVR allows to dynamically quantify the degree of belief on the
regressed transition model, and, consequently, the value of Qk . Some
works have already addressed the probabilistic interpretation of the output of a SVR [13, 28, 51]. All of them estimate error bars on the prediction, i.e. the variance of the prediction. Therefore they are all suitable
21
Chapter 1. Adaptive Transition Model
for estimating the Gaussian covariance matrix of the regression output.
We choose to use [51] since it is the simplest method and turned out also
the most effective in the comparison proposed in [51].
Given a training set, this method performs k-fold cross validation
on it and considers the histogram of the residuals, i.e. the difference
between the known function value at xi and the value of the function
regressed using only the training data not in the xi fold. Then it fits a
Gaussian or a Laplace PDF to the histogram, using a robust statistical test
to select between the two PDFs. In our implementation, in accordance
with the hypothesis of the Kalman filter, we avoid the test and always fit a
Gaussian, i.e. we estimate the covariance as the mean squared residual.
We also keep Qk diagonal for simplicity. Hence, every SVR provides
only the value of the diagonal entry of its row of Qk . As discussed before,
however, learning from states is prone to degeneration of the learning
loop into a filter unaffected by measurements. To avoid this, we prevent
the covariance of every SVR to fall down a predetermined percentage
of the corresponding entry of R (10% in our implementation). This has
experimentally proved to be effective enough to avoid the coalescence
of the filter while at the same time preserving its ability to dynamically
adapt the values of Q.
Finally, this method of estimation of the process noise covariance
matrix allows for an intuitive interpretation of the C parameter of the
SVRs. Since C weights the deviations from the regressed function greater
than ǫ, it is directly related with the smoothness of the Support Vector
Kalman output. In fact, if C is high, errors will be highly penalized,
and the regressed function will tend to overfit the data, leading to greater
residuals during the cross validation and to a bigger uncertainty on the
transition model. This will result in a more noisy but more responsive
output of the Kalman estimation. If, instead, C is low, the SVR output
will be smoother and the residuals during the cross validation will be
smaller. The resulting tighter covariances will guide the Kalman filter
towards smoother estimates of the state.
22
1.5 Experimental results
200000
-4
Kalman CV Q=10-4R
Kalman DR Q=10-4R
Kalman CA Q=10 R
SVK 2x2 matrix
Ground truth
180000
160000
140000
Position
120000
100000
80000
60000
40000
20000
0
0
50
200000
100
150
200
250
Frame
300
350
400
450
500
150
200
250
Frame
300
350
400
450
500
-2
Kalman CV Q=10-2R
Kalman DR Q=10-2R
Kalman CA Q=10 R
SVK 2x2 matrix
Ground truth
180000
160000
140000
Position
120000
100000
80000
60000
40000
20000
0
0
50
100
Figure 1.3: Charts showing the evolution of the filters against ground
truth data in case of linear motion: the top one compares SVK to Kalman
filters tuned for smoothness, the bottom one to Kalman filters tuned for
responsiveness.
1.5 Experimental results
We provide first two simulations concerning a simple 1D estimation
problem (i.e., a point moving along a line). In the first experiment, the
motion is kept within the assumptions required by the Kalman filter, in
particular there is a linear relationship between consecutive states. In
the second one, a case of non-linear motion is considered. Finally, we
provide experimental results concerning tracking of the 3D position and
23
Chapter 1. Adaptive Transition Model
orientation of a moving camera for real-time video augmentation and of
tracking of various targets in the image plane.
1.5.1 Simulation of linear motion
In both simulations, comparisons have been carried out versus three
Kalman filters adopting different motion models: drift (Kalman DR),
constant velocity (Kalman CV) and constant acceleration (Kalman CA).
Their model matrices are as follows:
FDR = [1] , FCV


∆t2 




 1 ∆t
 1 ∆t 

2

 , FCA = 
= 
 0 1 ∆t  .
0 1


0 0
1
(1.11)
Two different tunings were considered for each Kalman filter: a more
responsive one, when Q has been set equal to 10−2 R; and a smoother
one, with Q = 10−4 R. As far as SVK is concerned, it was fed with
noisy measures of the position and the velocity of the point, therefore
regressing a 2 ×2 model matrix. The only rough tuning regards C, which
is set equal to 2−10 in this simulation and to 2 in the non-linear case:
intuitively, an easier sequence allows for using a smoother filter.
During the linear motion sequence, motion is switched every 160
samples between a constant acceleration, a constant position and a constant velocity law. Therefore, each Kalman filter has a time frame wherein
the real motion of the point is exactly that described by its transition matrix. Results on the whole sequence are reported in Fig.1.3 and Tab.1.1.
As for simulation parameters, R has been kept constant in time and equal
to 100 ∗ I, with I denoting the identity matrix; constant acceleration was
30.0 m/s2 , constant velocity was 1000 m/s and ∆t was 0.5. Gaussian
noise with covariance matrix R was added to the data to produce noisy
measurements for the filters.
As shown by the first column of Tab.1.1, our proposal achieves the
best Root Mean Squared Error (RMSE) on the whole sequence. This
24
1.5 Experimental results
50
-2
Kalman CV Q=10-2R
Kalman CA Q=10 R
SVK 2x2 matrix
Absolute Error
40
30
20
10
0
0
20
40
60
80
100
120
140
160
Frame
50
-2
Kalman CV Q=10-2R
Kalman CA Q=10 R
SVK 2x2 matrix
Absolute Error
40
30
20
10
0
320
340
360
380
400
420
50
460
480
-2
Kalman CV Q=10-2R
Kalman DR Q=10 R
-4
Kalman CA Q=10 R
-2
Kalman CA Q=10 R
SVK 2x2 matrix
40
Absolute Error
440
Frame
30
20
10
0
160
180
200
220
240
260
280
300
320
Frame
Figure 1.4: The charts report absolute errors for, respectively, the constant acceleration, the constant velocity and the constant position intervals.
25
Chapter 1. Adaptive Transition Model
100
90
80
Variance
70
60
50
40
30
20
10
SVK Position Var
SVK Velocity Var
0
0
50
100
150
200
250
Frame
300
350
400
450
500
Figure 1.5: The chart shows the covariances on state variables provided
by SVK throughout the whole sequence.
shows the benefits that on-line adaptation of the transition model can
produce on the state estimate. This is also shown by the two charts in
Fig.1.3. At the scale of the charts, the estimation of our filter is indistinguishable from the real state of the system, whereas the delay of
Kalman DR and the overshots/undershots of Kalman CA and Kalman
CV in presence of sharp changes of motion are clearly visible.
Going into more details, we separately analyze each of the three different parts of motion (Fig. 1.4). Here, we discuss not only the performance on the whole interval associated with each motion law, but, also,
those achieved in the final part of each interval (i.e., the last 80 samples).
In fact, final samples allow to evaluate the accuracy of the steady state
of the estimators, filtering out the impact of the delays due to the filter
degree of responsiveness.
During the constant acceleration interval, Kalman CA performs best,
both with the responsive and the smooth tuning. This is reasonable, since
theoretically it is the optimal filter for this specific part of motion. Our
filter, however, performs slightly worse than Kalman CA, but definitely
better than Kalman CV and Kalman DR (2-nd column of tab.1.1). This
is also demonstrated by the first chart of Fig. 1.3, which, for better visu26
1.5 Experimental results
Filter
Whole
CA
CV
Drift
CA*
CV*
Drift*
SVK 2x2 Model
Kalman CA Q = 10−2 R
Kalman CA Q = 10−4 R
Kalman CV Q = 10−2 R
Kalman CV Q = 10−4 R
Kalman DR Q = 10−2 R
Kalman DR Q = 10−4 R
22.41
76.62
357.45
227.38
1680.37
4498.51
29698.38
9.79
4.83
4.26
100.12
1213.78
6015.22
25771.38
38.02
51.3
242.19
155.13
1160.73
4536.67
31583.97
35.41
125.87
581.52
355.71
2439.37
1793.30
29279.53
8.91
4.59
3.72
104.84
1416.30
8056.45
35763.45
9.63
4.55
4.04
3.74
49.82
4757.75
37809.42
1.67
6.06
7.87
5.31
109.30
2.77
16743.08
Table 1.1: Comparison of RMSE on linear motion: first column reports
the RMSEs on the whole sequence; then, partial RMSEs on each piece
of motion are given as well as RMSEs concerning only the final part
of each interval (marked with *), when the filter may have reached the
steady state.
alization, displays only absolute errors less than 50. Only our filter stays
in the visualized range, apart from the optimal one. When considering
only the steady state part (5-th column of tab.1.1) the analysis does not
change, partly because this interval is the very first one and, hence, there
are no delays to recover, and partly because the Kalman CV and DR
do not have the proper transition matrix for this part and, thus, cannot
recover from errors.
During the constant velocity part, SVK has the best overall RMSE
(3-rd column of tab.1.1). This is due to the delay accumulated by Kalman
CV, theoretically the optimal filter, during the previous intervals. Therefore, we can highlight one of the major advantages brought in by SKV:
in case of sharp changes of the motion law, dynamical update of parameters renders SVK even more accurate than the optimal filter due to its
higher responsiveness. This is confirmed by Fig. 1.5, showing the position and the velocity variances estimated by SVK. It can be seen that, immediately after the change of motion from constant position to constant
velocity at sample 320, both variances significantly increase, somehow
”detecting” such a change, thanks to the adaptive process noise modeling
embodied into our filter. The resulting lower confidence in the predictions automatically turns the filter from smoothness to responsiveness,
preventing the overshots/undershots exhibited by standard Kalman filters. After few samples the covariance on the velocity decreases again,
27
Chapter 1. Adaptive Transition Model
proving that SVK has confidently learned the new model. Considering
only the steady state (6-th column of tab.1.1) Kalman CV is, as expected,
the best one. Unlike the CA interval, however, only the responsive tuning performs well since the smoother Kalman CV has accumulated too
much delay to recover. This difference is due to the intrinsically higher
smoothness of the CV model with respect to the CA one. Kalman CA,
with both tunings, is the second best and this is also predictable since
a constant velocity motion may be seen as a special case of a constant
acceleration one. Again, SVK is by far closer to the optimal filters than
to those adopting a wrong motion model and, visualizing only errors less
than 50, it is the only one visible in the corresponding chart of Fig. 1.4,
apart from the optimal ones.
Finally, due to the delay accumulated by the other filters, SVK turns
out the best estimator also in the constant position interval (4-th column
of Tab.1.1). As far as the steady state is concerned, all the filters exhibit a
good RMSE apart from the very smooth ones, namely CV and DR tuned
towards smoothness, since they do not recover from delays even after
80 samples. Unlike the other motion intervals, SVK keeps on being the
best, even when the steady state only is considered. A reason for this
is provided again by the chart of covariances (Fig. 1.5). During the
constant position part, the SVR is able to regress a very good transition
matrix and both the uncertainties are kept really low compared to the
values in R. Therefore, the filter is highly smooth, as can be seen in the
chart of absolute errors, and this keeps the RMSE low also in the last
part.
Our proposal is robust to higher measurement noise, too. We report
in Tab.1.2 the RMSEs for the same simulation, but with R = 1000I.
Even in this case SVK turns out to be the overall best thanks to its adaptive behavior. Considerations similar to previous ones apply to the three
different parts of motion.
To summarize, simulations with linear motion laws show that the
proposed SVR-based approach to on-line adaptation of the transition
model is an effective solution for the tracking problem when the assump28
1.5 Experimental results
Filter
SVK 2x2 Model R=1000
Kalman CA Q = 10−2 R
Kalman CA Q = 10−4 R
Kalman CV Q = 10−2 R
Kalman CV Q = 10−4 R
Kalman DR Q = 10−2 R
Kalman DR Q = 10−4 R
Whole
Drift
CV
CA
Drift*
CV*
CA*
43.36
36.36
67.93
31.35
5.23
30.56
28.29
79.65
130.17
52.94
15.36
19.17
14.3
14.52
357.69
581.70
242.46
13.33
17.28
10.94
11.75
228.08
356.26
156.61
100.97
16.81
11.71
106.77
1681.04 2439.48 1162.36 1214.90
106.66
49.56
1418.82
4500.00 1793.01 4539.23 6016.82
8.78
4761.46 8059.09
29699.11 29279.76 31584.70 25772.48 16742.06 37810.78 35764.94
Table 1.2: Comparison of RMSE between different filters in case of
higher measurement noise.
R = 100
Whole
R=1000
Whole
SVK 2x2 Model
Kalman CA resp.
Kalman CA smooth
Kalman CV resp.
Kalman CV smooth
Kalman DR resp.
Kalman DR smooth
20.61
61.92
308.32
72.69
248.30
143.63
434.83
SVK 2x2 Model
Kalman CA resp.
Kalman CA smooth
Kalman CV resp.
Kalman CV smooth
Kalman DR resp.
Kalman DR smooth
47.98
62.32
308.66
72.95
248.46
144.87
435.20
Table 1.3: Comparison of RMSE on non-linear motion.
tion of stationary transition matrix cannot hold due to the tracked system
undergoing significant changes in its motion traits.
1.5.2 Simulation of non-linear motion
Given its ability to dynamically adapt the transition matrix, we expect
SVK to be superior to a standard Kalman filter also in the case of nonlinear motion. In such a case, in fact, a time-varying linear function can
approximate better than a fixed linear function the real non-linear motion. Hence, to assess its merits we have run simulations with a motion
compound of two different sinusoidal parts linked by a constant position
interval. The motion law of the two sinusoidal parts is as follows:
x1 (t) = 300t + 300 sin(2πt) + 300 cos(2πt),
(1.12)
x2 (t) = 300t − 300 sin(2πt) − 300 cos(2πt).
(1.13)
Aggregate results are shown in Fig. 1.6, Fig. 1.7 and Tab.1.3 for the same
levels of measurement noise as in 1.5.1. Our filter proves again to be the
overall best.
29
Chapter 1. Adaptive Transition Model
2000
-4
Kalman CV Q=10-4R
Kalman DR Q=10-4R
Kalman CA Q=10 R
SVK 2x2 matrix
Ground truth
1500
Position
1000
500
0
-500
0
50
2000
100
150
200
250
Frame
300
350
400
450
500
150
200
250
Frame
300
350
400
450
500
-2
Kalman CV Q=10-2R
Kalman DR Q=10-2R
Kalman CA Q=10 R
SVK 2x2 matrix
Ground truth
1500
Position
1000
500
0
-500
0
50
100
Figure 1.6: Simulation dealing with non-linear motion with R = 100I.
Chart on top compares SVK to Kalman filters tuned for smoothness, the
bottom one to Kalman filters tuned for responsiveness. At this scale, the
estimation of our filter is almost indistinguishable from the ground truth.
1.5.3 3D camera tracking
In this experiment, we track the 3D position of a moving camera in order to augment the video content, taking as measurement the output of a
standard pose estimation algorithm [81] fed with point correspondences
established matching invariant local features, in particular SURF features [6]. Some snapshots are reported in Fig. 1.8. The snapshots show
side-by-side the augmentation resulting from the use of Kalman CA and
30
1.5 Experimental results
2000
-4
Kalman CV Q=10-4R
Kalman DR Q=10-4R
Kalman CA Q=10 R
SVK 2x2 matrix
Ground truth
1500
Position
1000
500
0
-500
0
50
2000
100
150
200
250
Frame
300
350
400
450
500
150
200
250
Frame
300
350
400
450
500
-2
Kalman CV Q=10-2R
Kalman DR Q=10-2R
Kalman CA Q=10 R
SVK 2x2 matrix
Ground truth
1500
Position
1000
500
0
-500
0
50
100
Figure 1.7: Simulation dealing with non-linear motion with R = 1000I.
Chart on top compares SVK to Kalman filters tuned for smoothness, the
bottom one to Kalman filters tuned for responsiveness.
our SVK. Both filters have been tuned to be as responsive as in 1.5.2 and
measurement noise covariances has been adjusted to match the range
of the input data. The sequence shows a fast change of motion of the
camera, the purpose of filters being to keep the virtual object spatially
aligned with the reference position, denoted for easier interpretation of
results by a white sheet of paper. We can see that both filters exhibit a
delay following the sharp motion change at frame 19, but SVK is subject
to a smaller translation error (e.g. frame 23), recovers much faster (SVK
31
Chapter 1. Adaptive Transition Model
is again on the target by frame 27, Kalman CA only by frame 40) and,
unlike Kalman CA, without any overshot (which Kalman CA exhibits
from frames 27 to 40).
1.5.4 Mean-shift tracking through occlusions
In the last experiment, we compare our SVK to standard, non adaptive
solutions for estimating an object trajectory in the image plane based on
the mean-shift tracker introduced by Comaniciu et al. [17]. We compare
the original mean-shift (MS) tracker and the non-adaptive Kalman filter
(Kalman-MS tracker) to the SVK. Both KalmanMS and SVK use the MS
tracker as the measurements source . The MS tracker and the KalmanMS tracker have been proposed in the original work by Comaniciu et al.
[17].
The MS tracker implicitly assumes a constant position motion model
by letting the tracker start its search for the best position in each new
exactly where the object was found in the previous frame. The KalmanMS tracker in our experiment uses a constant velocity motion model.
Some snapshots of the test sequence are depicted in Fig. 1.9. The
mean-shift technique is generally speaking not robust to total occlusions,
like that shown in the third snapshot (Frame # 067), because the MS
tracker can be attracted by the background structure (e.g. the road in
our experiment) if this is more similar to the target than the occluder.
For this reason the MS Tracker is unwilling to follow the object while
it passes below the advertisement panel and stays in the last position
where it could locate the target (frame # 067 of Fig. 1.9). The KalmanMS tracker follows the previous dynamic of the target, thanks to the
smoothness brought in by the Kalman filter transition model (frame #
067 of Fig. 1.9). Nevertheless, since the way it weights the contribution of the measure and the prediction on the state is fixed, it is finally
caught back by the measures (the MS tracker) continuously claiming the
presence of the target in the old location, before the occluder. Only the
SVK is able to correctly guess the trajectory of the target while the lat32
1.5 Experimental results
(a) 17
(b) 20
(c) 21
(d) 22
(e) 23
(f) 24
(g) 25
(h) 26
(i) 27
(j) 28
(k) 34
(l) 40
Figure 1.8: Some of the most significant frames from the experiment on
3D camera tracking.
33
Chapter 1. Adaptive Transition Model
Figure 1.9: Some of the most significant frames from the experiment on
object tracking in the image plane. In cyan the SVK tracker, red the MS
tracker, blue the KalmanMS tracker.
ter is occluded (frame # 067 of Fig. 1.9) and continues to track it after
the occlusion (frame # 083 and subsequent frames of Fig. 1.9). This is
due to the ability of the SVK to dynamically adjust the process noise
covariance matrix, increasing its confidence on the motion of the object
(i.e. to decrease the variance) while the object keeps moving with an approximatively constant motion law on the image plane (first part of the
sequence, first two snapshots, from frame # 001 to frame # 050 of Fig.
1.9). Thanks to the high confidence gained on the motion model, the filter is able to reject the wrong measurements coming from the MS tracker
during the occlusion. This happens again during the second occlusions
at frame # 200 of Fig. 1.9.
34
Chapter 2
Adaptive Appearance Model
Every visual tracker uses an internal representation of the appearance of
the target, that it compares with the current frame Ik in order to locate the
target. We refer to this internal representation as appearance model or
target model, A, and we denote the instance used by the tracker at time
k as Ak . This model is usually learned once, either offline from training
data or online from the first frame(s).
In the works on tracking up to the last decade this model was usually fixed throughout the sequence [15, 17, 32, 37, 38, 78]. The main
efforts of these works were devoted to develop robust ways to use the
fixed model for locating the target in the current frame, despite all the
nuisances that realistic video sequences may contain, such as clutter and
distractors, illumination changes and deformable targets.
More recently, the idea of appearance model update has been proposed by several researchers to aim at successful long-term tracking despite these difficulties. By letting the model evolve across frames to
include and adapt to the potential geometric and photometric changes of
the target, these methods are inherently able to cope with target deformations and lightening variations. On the other hand, they expose the
tracker to the risk of drift, i.e. the inclusion of background appearance
in the appearance model that can eventually lead to loss of track.
In our work on adaptive appearance modeling we define the general
35
Chapter 2. Adaptive Appearance Model
structure of an adaptive modeling tracker and identify and discuss the
main alternatives that have been proposed for each main building block
of such systems. Recently, adaptive modeling trackers have been extended also to the multi-target case [8, 88, 103]. Our review, however,
focuses on the single target case, that has reached a higher level of maturity. The multiple target trackers are covered by this review only as far
as the part of their proposal covering single target tracking is concerned.
Then, we formulate our proposal for target model adaptation, based
on the idea that tracking and target model update are similar in spirit and
in practice: they both try to estimate the state of a system from noisy
measures, under the assumption that the system state exhibits temporal
consistency in consecutive frames. The state for target model update
is the target appearance instead of the cinematic characteristics of the
target, but the conceptual problems are highly similar. Therefore, we cast
the problem of model update as a recursive Bayesian problem, and try
to utilize the same tools, in particular the particle filter, to accomplish it.
The work presented in this chapter has been carried out while the author
was visiting Prof. Andrea Cavallaro’s group within the Multimedia and
Vision Group of the Queen Mary University of London.
2.1 Additional definitions
We presented the classical framework for visual tracking in the Introduction. Here, we add two notions that are used in the context of target
model update, namely the confidence map and the division in generative
and discriminative trackers.
2.1.1 Confidence map
Typically the tracker evaluates several state candidates x̂ik to select the
current state xk . The candidates are sampled according to a variety of
strategies, but they typically belong to a neighborhood of the previous
state. This enforces temporal smoothness, upon which tracking is based.
36
2.1 Additional definitions
(a)
(b)
Figure 2.1: Generative versus discriminative trackers. A state candidate
x̃k from the current frame Ik is projected in the feature space F and its
likelihood of being the target is computed. The likelihood is a function
of a distance or similarity measure between the current model Ak and the
candidate features in a generative tracker, a function of the confidence
value of a classifier hk in a discriminative tracker.
The evaluation results in the assignment of a score Cki to each candidate,
(e.g. the weight of the corresponding particle in a particle filter [78],
the feature similarity in a Mean-Shift tracker [17], the confidence of a
classifier in a tracking-by-detection approach [4], ... ). We refer to the
D
E
set of pairs x̂ik , Cki as the confidence map Ck .
2.1.2 Generative vs. Discriminative Trackers
An important classification of visual trackers, as far as target model update is concerned, is the division between generative and discriminative
trackers (Fig. 2.1).
37
Chapter 2. Adaptive Appearance Model
Generative Trackers
The tracker [107] [46] [57] [80] [40] [49] is guided by a generative observation likelihood, i.e. “the state estimation boils down to the problem
of finding the state which has the most similar object appearance to the
model in a maximum-likelihood or maximum-a-posterior formulation”
[93]. Generative models of the foreground try to represent the object
appearance without considering its discriminative power with respect to
the background or other targets appearance. In these methods the observation likelihood is based on a similarity function defined on the feature
space F, that compares the current model Ak with the current candidate
state x̃k features providing a similarity score or likelihood of the candidate state (Fig. 2.1a). A model is explicitly given and the similarity to
it assigns a likelihood value to every point of the feature space, i.e. to
every possible state candidate.
Discriminative Trackers
The tracker [16] [4] [5] [29] [30] [89] [93] is guided by a discriminative observation likelihood, i.e. a classifier trained to learn “a decision
boundary that can best separate the object and the background” [93] .
Classifiers able to produce a confidence value for the predicted label can
be used in this framework. In these proposals the appearance model Ak
is not explicitly given, it is implicitly defined by the subset of the set
of all possible appearances F that is positively labeled by the classifier
(Fig. 2.1b). In these methods the observation likelihood is the confidence
value of the classifier on the classification as foreground of the current
candidate state x̃k , and it is 0 if the candidate is classified as background.
Hybrid Trackers
Some methods have proposed hybrid solutions such as: switching between discriminative and generative observation models according to
the targets proximity in a multi-target scenario [88]; using co-training
[7] between a long-term generative observation model and a short-term
38
2.2 Elements of Adaptive Modeling in Visual Tracking
Figure 2.2: The general structure of the target model update flow in an
adaptive tracker, k ≥ 1.
discriminative one [105]; using several generative models but discriminatively learn in each frame the weights to combine them in order to
maximize the distance with the neighboring regions [103]; store and
update two generative non parametric models of foreground and background appearances and use them to train in every frame a discriminative
tracker [55].
2.2 Elements of Adaptive Modeling in Visual
Tracking
The general structure of an adaptive model tracker is sketched in Fig.
2.2.
1. Given the output of the tracker xk and the confidence map Ck on
the evaluated candidates, a set of samples si of the new target appearance are extracted from the current frame. If the tracker is a
39
Chapter 2. Adaptive Appearance Model
Sampling and Labeling
Template Update
[57]
IVT
[80]
AdaptiveManifold
[49]
WSL
[40]
Unified Bayesian
[107]
Visual Tracking Decomposition
[46]
Ensemble Tracking
[4]
Non-Parametric Tracker
[55]
SVMs Co-Tracking - Tracker 1
[93]
SVMs Co-Tracking - Tracker 2
Co-Training - Generative
[105]
Co-Training - Discriminative
Adaptive Weights
[103]
Discriminative Features Selection [16]
OnlineBoost
[29]
SemiBoost
[30]
BeyondSemiBoost
[89]
MILTracker
[5]
Feature Processing
Current State
Current State
Current State
Current State
Current State
Current State
Current State
Adaptive Classifier
Co-Training
Co-Training
Co-Training
Pivot Blended in
None
None
None
Pivot Blended in
Pivot Added
Label Switch
Redundant and Outliers filtering
None
None
None
None
Current State
Pivot Blended in
Current State
Pivot Blended in
Current State
None
Fixed Classifier.
None
Fixed and Adaptive Classifier.
None
Current-State-Centered
None
Model Estimation
Model Update
Direct Use of Features
Direct Use of Features
Direct Use of Features
Direct Use of Features
Direct Use of Features
Direct Use of Features
New Classifier Training
Direct Use of Features
Classifier Update
Classifier Update
Direct Use of Features
Classifier Update
Direct Use of Features
Direct Use of Features
New Classifier Training
New Classifier Training
New Classifier Training
New Classifier Training
Last model
Subspace
Manifold
Blending
Last model
Sliding Window
Sliding Window
Ranking
Sliding Window
Sliding Window
Manifold
Sliding Window
Blending
Last model
Ranking
Ranking
Ranking
Ranking
Table 2.1: Reviewed Methods.
discriminative tracker, a set of samples are extracted also from the
background. Samples are hard or soft labeled as target or background samples yielding a labeled sample set {sli , l ∈ [0, 1]}.
2. Sample extracted from the current frame are projected into the
feature space used for tracking, generating a set of labeled features
{fil , l ∈ [0, 1]}.
3. Feature can be filtered and/or selected.
(a) Filtering: the set of features may be pruned to remove outliers or augmented with reliable features from trusted target
appearances. Labels may be switched or modified, too.
(b) Selection: if multiple cues are used as features (such as color,
edges, shape, motion vectors, etc. . . ) feature selection may
be performed to select the most effective features for the current frame.
These steps aim at providing a more representative and effective
feature set {f̃il , l̃ ∈ [0, 1]}.
4. Given the selected labeled features the model ak+1 of the target in
the current frame is estimated.
5. The model for the current frame ak+1 is merged with the previous
40
2.2 Elements of Adaptive Modeling in Visual Tracking
overall model Ak , yielding the model Ak+1 used in the next frame
for state estimation.
This section describes the alternatives to implement each of these
main building blocks.
To limit the chances of drift, an adaptive model tracker has to try to
solve the following sub-problems:
• Robust integration of new target model samples. The inclusion
of new information from the current frame in the target model has
to be designed to be robust to the presence of outliers from the
background due to non perfect alignment of the tracker bounding
box with the actual target position.
• On-line Evaluation of tracker output. The output of the tracker
must be evaluated on-line in absence of ground truth to decide
whether or not to use it in model update. This is particularly important to avoid occluders appearance if the target undergoes occlusions.
• Stability/Plasticity Dilemma [31]. The simultaneous requirement for rapid learning and stable memory. This is a common
problem of all on-line adaptive systems.
Each of the above mentioned building blocks deals with one or more of
these sub-problems.
2.2.1 Sampling and Labeling
Given the output state xk of the tracker in the current frame Ik and the
confidence map Ck , this step selects the regions of the current frame that
are then used to update the model and, in a discriminative tracker, assign
them either to the target or the background class.
The different proposals are presented according to the degree of reliability they assign to the tracker.
41
Chapter 2. Adaptive Appearance Model
(a) Current State Sampling
(b) Current-State-Centered Sampling
(c) External Classifier
(d) Co-Training
Figure 2.3: Sampling and labeling strategies. In (a), (b) and (c) the
thicker hatch represents the current state estimate, the wider hatch the
sampling region for foreground labeled samples and the wider dotted
rectangle defines the region for background samples. Note that in (c) the
last two regions coincide. In (d), the images represent the confidence
maps of two trackers: blue low likelihood, red high likelihood.
42
2.2 Elements of Adaptive Modeling in Visual Tracking
• Current State (Fig. 2.3a). The region defined by xk is the only
one used to update the target model. In case of discriminative
trackers, samples from a region surrounding the current state are
used as background appearance sample. This method assumes that
the tracker is always correct and leaves to the subsequent stages
the task of attenuating the effects of misaligned current states.
• Current-State-Centered Sampling. (Fig. 2.3b). Introduced in
MILBoost [5]. Samples are extracted in the region defined by xk
plus its neighborhood. Samples extracted in the proximity of xk
are grouped in bags of samples and at least one sample of each
bag is assumed to be a target sample whereas samples from the
outer sampling region are used as samples for the background. It
is up to the subsequent stages of the algorithm to disambiguate
the uncertainty left in the target samples, for example by using
Multiple Instance Learning as done in [5]. This method assumes
that the tracker can by slightly off the target, but is always close to
it.
• Co-Training Sampling.
Introduced in Co-Tracking [93]. Two
subtrackers that use independent features make up the tracker. The
output xk is given by the combination of their output, but the sampling and labeling for model update of each tracker is carried on
independently, within the framework of co-training [7]. Each subtracker provides the training samples for the other. Target samples
come from the global maxima of the other subtracker confidence
maps whereas background samples are taken from the local maxima not overlapping with the global maximum. In this way, each
subtracker is trained to be able to discriminate the cases that are
difficult for the other tracker. This method assumes that in a given
frame at least one of the two features alone is able to correctly
track the target.
• External Classifier (Fig. 2.3c. Samples are extracted in the re43
Chapter 2. Adaptive Appearance Model
gion defined by xk plus its neighborhood but are not labeled according to their position with respect to xk . Instead, labeling is
performed by means of an external classifier. Samples are soft labeled as samples of the target or the background according to the
confidence of the classifier. Although this option makes sense for
both generative and discriminative trackers, it has been used only
by discriminative or hybrid approaches.
Generally speaking, the use of a classifier to guide the tracker
updates is an interesting solution to break the self learning loop.
Nevertheless it leads to a chicken-and-egg problem: if an external
algorithm, like this classifier, can reliably tell if a patch selected
from the output of the tracker belongs to the object of interest in
spite of all the changes in appearance the target underwent, such a
powerful algorithm could be successfully used as the observation
model for the tracker and there would be no need to update the
target model. Of course this is not the case: if the detector has to
cope with all the possible changes it has to be updated as well, and
this introduces the problem of drift for it, too.
By considering how the proposed solutions cope with the issue
of classifier adaptability, this category can be further specified as
follows:
– Fixed Classifier. Introduced in [30]. The classifier in this
case may be an object detector or a similarity function with a
fixed pivotal appearance model. It is created off-line or in the
first frame and never updated. These methods assume that
the classifier is able to cope with all the variations the target
will undergoes in a sequence or, alternatively, that there will
be no more variations of the target appearance than those
that the classifier is invariant to. Therefore, this choice limits
the degree of adaptability of the tracker. On the other hand,
it does not make any assumption on the correctness of the
current state, besides the proximity with the target.
44
2.2 Elements of Adaptive Modeling in Visual Tracking
– Adaptive Classifier. Introduced in [55]. The classifier is a
similarity function with respect to the previous model. This
method does not assume any reliability of the current state
but it requires the absence of sudden changes in the target or
background appearance evolution. Moreover, the degree of
adaptability, i.e. the maximum variation in appearance between consecutive frames, is dictated by hard thresholds that
may be difficult to set. Finally, by using the previous model
to label the current samples, this method is prone to the drift
introduced by self learning, although, unlike the other proposals, this loop is based on models rather than on states.
– Fixed and Adaptive Classifiers. Introduced in [89]. Two
classifiers are used. One is fixed and its trained on the first
frame. Another one is adaptive, and it is the one used to label
the samples. This method tries to obtain the benefits of not
assuming any correctness of the current state, introduced by
using a classifier for samples labeling, without limiting the
adaptability of the tracker, by letting the classifier adapt to
target or background changes. This rises the problem of drift
for the adaptive classifier. The proposed solution is to update
the classifier only when the tracker and the fixed classifier
are in agreement. Although this may limit the chances of
drift for the adaptive classifier, it results in similar limits on
the degree of adaptability introduced by the fixed classfier
solution.
2.2.2 Feature Extraction
Features are extracted for each sample sli of Ik , producing a set of labeled
feature vectors { f l }.
With reference to Tab. 2.2, we categorize features used by the adaptive modeling trackers according to the spatial extension of the features
extracted from each sample. This has a direct impact on the ability of
45
Chapter 2. Adaptive Appearance Model
the tracker to correctly adapt in presence of partial occlusions:
• Part-wise features.
Feature vectors are extracted from small
patches or even single pixels . This makes it possible to reason
explicitly about occlusions and to avoid to use features from the
occluding object to update the target model. It also helps to deal
with the approximation inherent to the modeling of the target as
a rectangular object, since every feature can be classified either
as foreground or background, even those laying inside the target
bounding box.
• Target-wise features. Feature vectors represent the whole target
appearance ( e.g. color histograms [17]). As noted in the Introduction, this kind of features can typically tolerate target deformations and rigid transformations such as rotations and scaling even
without model update. On the other hand, being a global representation of the target, it is difficult to correctly update it in presence
of partial occlusions.
2.2.3 Feature Set Refinement
Given the features { f l} extracted and labeled from the current frame, this
step processes the features and the labels in order to obtain a modified
˜
set { f˜l } that is more effective for model update. To this purpose, two
main strategies have been followed, that can be deployed alternatively
or sequentially: feature processing and feature selection.
Feature Processing
As fas as feature processing is concerned a tracker can perform:
• Sample checking. The idea behind the following filtering steps is
that it is possible to decide a priori which samples are not suitable
to perform model update given the current model. In particular
some adaptive trackers perform:
46
2.2 Elements of Adaptive Modeling in Visual Tracking
Human (ISM [50])
Detector
Head (ConvNN)
Ellipse
Colour Filter Bank [16]
Bag of Features
LBP [69]
HoGs [20]
Colour
Edges
Separate HSI
Haar-like wavelets
Steerable Filters [26]
SURF [6]
HoGs [20]
Contour
Human (HoGs [20])
[57]
[80]
[49]
[40]
[4]
[55]
[8]
[93]
Color
RBG values (pixel)
Single
Mixture
Selection
Template Update
IVT
AdaptiveManifold
WSL
Ensable Tracking
Non-Parametric Tracker
Detector Confidence
SVMs Co-Tracking - Tracker 1
SVMs Co-Tracking - Tracker 2
Co-Training - Generative
Co-Training - Discriminative
Unified Bayesian
Adaptive Weights**
Discriminative Features Selection
OnlineBoost
SemiBoost
BeyondSemiBoost
MILTracker
Visual Tracking Decomposition*
Target
Histogram
Template
Intensity
Parts
Haar Filters
Hist
x
x
x
x
x
x
x
x
x
x
x
x
x
x
[105]
[107]
[103]
[16]
[29]
[30]
[89]
[5]
[46]
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Table 2.2: Features. The single asterisk indicates use of multiple trackers, hence not all the features listed might be used in the same tracker.
The double asterisk indicates the use of the Adaptive Multiple Features
Blending strategy for the feature set composition (see Sec. 2.2.4).
– Redundant Sample Removal. Introduced in [55]. Feature
vectors that are too similar to the current model are discarded
as redundant.
– Outliers filtering. As far as outliers are concerned, two
different strategies have been deployed:
∗ Outliers Removal. Introduced in [55]. Feature vectors that are too different from the current model are discarded as outliers.
∗ Positive Label Switch. Introduced in [4]. In case the
confidence on a target-labeled feature vector is not high
enough, the label is switched to background. This is
done mainly to counteract the approximation inherent in
the use of a rectangular box as target shape.
• Pivot. The initial appearance is used as a pivot, under the assumptions that the bounding box in the first frame was correct and
47
Chapter 2. Adaptive Appearance Model
that the target and the background appearance remains similar to
the initial one in the feature space. In the proposals adopting this
strategy, first frame data receive a special treatment: it is reasonable because usually first frame detection is assumed to be reliable,
for example in a tag-and-track application for visual surveillance,
where a human operator provides the first bounding box. For a
full automatic deployment of tracking the first bounding box cannot be assumed to be particularly more accurate than the next ones.
Another important issue with the use of the pivot for samples refinement is that it may not allow to adapt to sudden appearance
changes nor to gradual changes in appearance that in the long run
lead to great changes in target appearance compared to the first
frame. This, depending on the application, may be a limitation
that prevent the adoption of this filtering step. If general automatic
visual tracking is the aim of an algorithm, then this filtering step
should not be used, although it can greatly improve performances
in more specific contexts. Use of features from the pivot to refine
the current sample set has been proposed in two flavors:
– Pivot added. Features from samples of the pivot are added
to the feature set with the proper label. With this strategy,
subsequent stages of the algorithm can decide to ignore the
added features and exploit only the features from the current
frame for the update.
– Pivot blended in. Feature vectors are blended with the pivot
features. With this choice the influence of the pivot cannot be
discarded afterwards. On the other hand, the model update
is guaranteed to keep the model in a neighborhood of the
initial appearance, hence this solution trades off adaptability
for robustness.
48
2.2 Elements of Adaptive Modeling in Visual Tracking
2.2.4 Feature selection
This is a key component of a generation of recently proposed family of
discriminative tracking algorithms [5, 29, 30, 89] that perform model
update by continuously updating the set of used features, selecting them
according to their discriminative ability in distinguishing the target from
the background. Beside these methods, that heavily base their efficacy
on feature selection, feature selection is a fundamental step for all adaptive and even non-adaptive trackers, since different cues, such as edge
patterns, color histograms or appearance patterns, may have a different
ability to track a target in different parts of the sequence. Nevertheless,
no standard way has emerged to tackle this fundamental problem. One
of the main difficulties in performing on-line selection is given by the
fact that different cues may have different score dynamics and ranges,
which makes it hard to compare their effectiveness directly by comparing their scores. They can be compared by evaluating a posteriori their
effects on the tracker accuracy, for example selecting the features to use
at frame k by ranking them according to their effectiveness in locating
the target in the previous frame k − 1, under the assumption that the position estimated by the tracker at frame k − 1 is correct. According to their
treatment of this stage, trackers can be categorized in three classes (see
also the vertical left-most column of Table 2.2):
• Single Feature. Only one kind of feature is used, e.g. one color
histogram. No selection is carried out.
• Mixture of (Independent) Features. A fixed set of features is
used. The composition of the set is never updated. Usually a certain degree of independence between the features is required (or
assumed) for their simultaneous use to be effective. This is for example the case of trackers working in the co-training framework,
that implicitly perform feature selection by weighting the contribution to the final estimation of classifiers using independent features.
49
Chapter 2. Adaptive Appearance Model
• On-line Feature Selection. A fixed set of features is used. The
composition of the subset used in each frame is updated according
to the features effectiveness in the previous frame(s) [16].
– Online Boosting Feature Selection is performed by applying on-line boosting [72] to weak classifiers that act as feature selectors [29].
• Adaptive Multiple Features Weighting. A fixed set of features
is used. The weights of the features in the likelihood composition
are updated in every frame based on the features effectiveness in
the previous frame(s).
2.2.5 Model Estimation
Given the filtered feature set and the labels, a new partial model ak+1 that
describe the target appearance in the current frame is built. This has no
particular influence on the adaptation abilities of the tracker nor on its
risk to drift. The main alternatives are:
• Non parametric use of features. The model estimated for the
current frame is the non parametric ensemble of the feature extracted from the target or the background.
• New Classifier Training. The current samples are used to train
a classifier that best separates the target and the background in the
current frame.
• Old Classifier(s) Update. The current samples are used to update
a previously trained classifier.
2.2.6 Model Update
Given the new model for the current frame ak+1 , it has to be merged
with the overall model used so far, Ak , to obtain Ak+1 . This step directly
addresses the Stability/Plasticity Dilemma presented above. Solutions
50
2.2 Elements of Adaptive Modeling in Visual Tracking
are presented in order of Plasticity, i.e. starting from the most adaptive
ones:
• Last model only. The result of the last frame is used as the model
for the next frame.
• Sliding Window. A fixed amount of samples/classifiers is kept
after every frame is processed. The newest is added and the oldest
is discarded.
• Ranking. Up to a maximum fixed amount of samples/classifiers,
the most effective ones are kept after every frame is processed, the
new one is always added. This raises the problem of assessing
their effectiveness, similar to the problem of evaluating features
selection on-line. And again, the most widespread solution is to
evaluate the models efficacy on the previous frame(s).
• Blending. Sample or classifier parameters estimated from the
current frame are blended with their previous values. This in principle is more stable than the previous alternatives, since all the
history up to the current frame has an influence on the new model.
On the other hand, it is more prone to drift, since the inclusion
of wrong samples for the target model cannot be fixed afterwards,
only the inclusion of correct samples will eventually render the
influence of the outlier negligible.
• Subspace/Manifold. A subspace or a set of subspaces (an approximation for a manifold in the feature space) is updated with
the new sample from the current frame. It potentially retains the
history of all the target appearances with a fixed amount of memory, hence it is the most stable solution. On the other hand, it
is difficult to accommodate for sudden target appearance changes
with such a model. Sometimes a forgetting factor is used to diminish through time the effect of the oldest samples on the subspace/manifold shape.
51
Chapter 2. Adaptive Appearance Model
Figure 2.4: The patch based appearance model in our proposal.
2.3 Adaptive modeling with Particle Filtering
At the basis of our proposal lays the intuition that we can substitute some
of the fundamental stages of the target model update algorithm described
so far with equivalent steps performed by a particle filter estimating the
target appearance.
Hence, in our proposal two RBE trackers are used. One tracks the
target state, the other the target model. Since inference on high dimensionality spaces is hard and inefficient, we actually use an approximation
of the particle filter when tracking appearance. Hence, although our formulation is deeply inspired by this filter and can easily be interpreted
and implemented following its usual patterns, the appearance tracker is
not strictly speaking a Bayesian filter. In particular, it is our definition of
the observation likelihood that is not conformant, as detailed in the next
sections.
The appearance model in our proposal is a part-based, Generalized
Hough Transform-like model ([50], [1], [45]). It has been inspired also
by the bag of patches non-parametric model of [55]. It offers several
advantages over a global representation: it captures a coarse geometric structure of the target instead of global properties only; it naturally
allows for dealing with partial occlusions; it can be used to obtained a
52
2.3 Adaptive modeling with Particle Filtering
segmentation of the target [50]. We model both the foreground and the
background, in the spirit of recent discriminative trackers. Hence, our
model is compound by a model for each class
Ak = {AFk , ABk }
(2.1)
where the models are a set of graylevel square patches T of fixed side
r with their geometric displacements v with respect to the object center
(Fig. 2.4)
M
M
AFk (B) = {(sik }i=1
= {(T ki , vik )}i=1
2
T ki ∈ [0, 255]r , vik ∈ R2
(2.2)
The particle filter tracking the state of the target has the bounding
box center coordinates as state variable and the current frame as measure. The tracker of the appearance, instead, has a patch and its displacement as state variable and the pair formed by the current frame and
the current state estimation as measure. In fact, it is the output of the
tracker estimating the bounding box that provides a new measure of the
target appearance for the model update and, symmetrically, the tracker
estimating the appearance provides a new model to update the state in
the next frame. In other words, let
zk = Ik
(2.3)
yk = (xk , Ik )
(2.4)
denote the measure for the state tracker and the appearance, respectively.
Then, the particle filter estimating the state computes the standard recursion:
p(xk | z1:k ) ∝ p(Ik | xk )
Z
p(xk | xk−1 )p(xk−1 | z1:k−1 ) dxk−1
(2.5)
and then the particle filter estimating the appearance solves:
53
Chapter 2. Adaptive Appearance Model
p(sk+1 | y1:k ) ∝ p(yk | sk+1 )
Z
p(sk+1 | sk )p(sk | y1:k−1 ) dyk−1
(2.6)
Given this formalization of model update as appearance tracking, in
our proposal we replace (compare Fig. 2.5 with Fig. 2.2):
• the standard sampling and labeling step with the propagation of
the appearance particles to the next frame, i.e. by sampling from
the proposal on appearance q(sk+1 | sk , yk+1 ).
• the sample refinement, in particular the sample processing, with
the update step of the appearance particle filter, which dynamically weights samples according to the likelihood on appearance
p(yk |sk+1 ) (in principle the update step can carry on also the on-line
feature selection but is not done in our proposal yet);
• the model estimation for the current frame with the resampling
step of the appearance tracker, which probabilistically discards
down-weighted samples from the previous step and effectively
produces the model that best explain the current frame, given the
observations up to the current frame.
In the following we define the basic components of the particle filters
we use to estimate the state and the appearance.
Appearance Proposal Density
q(sk+1 | sk , yk ) = q(sk+1 | T k , sk , Ik , xk )
(2.7)
To sample from it, we sample a new displacement with Gaussian
Brownian motion relative to the displacement of this patch in the
previous frame, vk , and then extract a patch from Ik centered in the
position given by the new displacement applied to xk . This gives a
new particle to approximate the new posterior PDF on appearance.
54
2.3 Adaptive modeling with Particle Filtering
Figure 2.5: The structure of the target model update flow in our adaptive
tracker, k ≥ 1.
55
Chapter 2. Adaptive Appearance Model
ŝk+1 = (T̂ k+1 , v̂k+1 ) ∼ q(sk+1 | sk , Ik , xk ) ⇔
v̂ ∼ N(µ = v , Σ = Σ ), T̂ = I k+1
k
v
k+1
(2.8)
k x ,v
k k+1
where, to indicate the extraction from a frame Ik of a patch defined
by a displacement v with respect to a bounding box x with It x,v .
Our proposal density is a full definition of a proposal for particle
filtering since it depends on both the previous state sk and the current measure yk , whereas the classical proposal used in a particle
filter discards the dependency on the current measure. In particular, we exploit the current measure to sample the new appearance
of the patch, since to generate it according to a generative model of
illumination changes and object deformations requires these models, which are difficult to obtain for a general purpose tracker, and
it also requires to explore a high dimensionality space (i.e., given
the side of the patches r, the dimensionality of the space is r2 and
we use r ∼ 20), which in turn requires a huge number of particles
to obtain an acceptable approximation of the posterior. By letting
the current measure guide the exploration of the state space we
avoid these problems and obtain an efficient algorithm. Finally,
the proposal density in our method accounts also for deformable
objects by letting a patch move inside the object.
Appearance Observation Likelihood
p(yk | sk+1 ) = p(It , xk | T k+1 , vk+1 )
(2.9)
The likelihood of the measure under the hypothesis that the patch
sk+1 belongs to the appearance model is where our proposal differs with respect to a standard particle filter. In particular, having
exploited the current measure to guide the state space exploration
and to sample the new patch appearance for sk+1 , we cannot define
56
2.3 Adaptive modeling with Particle Filtering
the likelihood in terms of it, since sk+1 depends on yk . Therefore, we define the likelihood of sk+1 in terms of the particles of
the distribution of the other class, i.e. we use the particles of the
background class to assess the likelihood of the foreground particles and vice versa. Note that this way to evaluate p(yk | sk+1 )
implicitly takes still into account the measure yk , since the patches
from both classes come from yk through the proposal density.
We base our likelihood on the Zero-mean Normalized Cross Correlation (ZNCC). When applied to graylevel patches, this measure
computes the similarity of the patches and is invariant to affine
changes of the illumination. Therefore, the likelihood in our algorithm accounts for the robustness towards photometric changes of
the target. The ZNCC of two vectors a, b is defines as
ZNCC(a, b) =
(a − µ(a)1)(b − µ(b)1)
|a − µ(a)1| |b − µ(b)1|
(2.10)
where 1 is the vector of 1s of the same dimension of a and b, µ(x)
is the mean of the components of the vector x and |x| its norm. Let
j
j̄ = arg max ZNCC T k+1 , T̄ k+1
(2.11)
j=1,...,M
j
where T̄ k+1
stand for the j-th particle of the other class with respect
to the class of T k+1 . Then we compute the likelihood as
j̄
1 − ZNCC(T k+1 , T̄ k+1
)
p(Ik , xk | T k+1 , vk+1 ) ∝ exp(
).
2
(2.12)
Our definition of the likelihood is discriminative: the weight of
each particle of the appearance model is higher the more discriminative with respect to the other class the particle is. This means
that the resampling stage will be able to discard the particles not
useful to track the target when estimating the model for the current
frame. In other words, the weights computation performed with
57
Chapter 2. Adaptive Appearance Model
our likelihood realizes the Feature Processing stage of the scheme
for model update presented before. If besides graylevel patches
other features are used, their weighting and the subsequent resampling would effectively perform also probabilistic feature selection. The main difficulty to successfully carry out feature selection
in this way is represented, as discuss in the previous section, by the
different scales and dynamic responses of the similarity functions
used to compare the features (e.g. the Bhattacharyya distance for
histograms versus the ZNCC for patches ), that makes it difficult
to obtain comparable likelihood values.
State Proposal Density We employ a standard Gaussian proposal with
a fixed, diagonal covariance matrix Σ x .
p(xk | xk−1 , Ik ) = N(xk , µ = xk−1 , Σ = Σ x )
(2.13)
State Observation Likelihood
p(Ik | xk )
(2.14)
Given the model estimated on the previous frame Ak = {AFk , ABk }
let
j
j̄i = arg max ZNCC Ik x ,vi , T k+1
j=1,...,M
k
k
∀ sik ∈ AFk
(2.15)
i.e. for each foreground particle the index points to the patch in
the background model that is the most similar to the current frame
in the location given by the foreground particle displacement. In
other words, it indicates the particle of the background that best
explains the foreground appearance, given that the target is really
58
2.4 Experimental Results
at xk . Then, we compute the state likelihood as

M
 1 X

max(0, ZNCC(T ki , Ik x ,vi )−
p(Ik | xk ) ∝ exp 
k k
M i=1
j̄i
ZNCC(T k , Ik x ,vi ) (2.16)
k
k
i.e. as the mean likelihood obtained by the candidate xk over all
the particles of the foreground model, where the likelihood of a
candidate with respect to a particle of the foreground is given by
the similarity with the foreground patch and the dissimilarity from
the best background patch of the patch at the location identified
by the foreground particle displacement. This definition of the
likelihood naturally deals with partial occlusions. To overcome
also total occlusions we have to increase the stability of our algorithm by using one of the strategies introduced in Sec. 2.2.6.
We deployed the sliding window strategy since it is the simplest,
most efficient one and the overall probabilistic inference structure
of our proposal already provides robustness against outliers, such
as those included in the target model during occlusions. To include the sliding window strategy in our proposal the appearance
tracker particles are no more patches with displacements, but sliding windows of patches and displacements. The proposal density
is identical, whereas both likelihood values are computed as the
average over the sliding window of the likelihoods for a single
patch, presented above.
2.4 Experimental Results
2.4.1 Methodology
Trackers are initialized with the first bounding box available in the ground
truth. Probabilistic trackers have been run 10 times and the mean of
these runs is used for comparison with other trackers but the error bars
59
Chapter 2. Adaptive Appearance Model
for these trackers are plotted in the charts as well. Comparable or even
better mean scores are not enough to assess that a probabilistic tracker is
to be preferred: if the variance is higher the tracker is less reliable and,
hence, less useful in a real deployment.
Two charts are used for each sequence. One reports the dice overlap
with the ground truth in each frame of the sequence. i.e. the mean value
of the ratio between 2 times the area of the intersection of the ground
truth bounding box with the estimated bounding box and the sum of
their areas:
2 xk ∩ xGT
k .
dk =
|x | + xGT k
(2.17)
k
This performance index varies in [0, 1], the higher the better. Such
index is also highly sensitive to small misalignment of the bounding
boxes, hence values above 0.7 usually correspond to satisfactory tracking.
The second chart shows correct track ratio versus the mean overlap
on correct frames, where we define correct frames those where the overlap is greater than a threshold and the correct track ratio is given by the
ratio between the correct frames and the total frame of the sequence. An
optimal tracker is represented by a line at the very top of the chart. This
chart tries to cope with the fact that for different applications different
correct track ratios (more commonly expressed as lost track ratio) may
be required. By considering the chart at a defined x coordinate, it is
possible to understand which trackers are able to provide such level of
lost track ratio, if their line intersects such vertical axis, and with which
accuracy, represented by their mean overlap.
We compare our proposal against several adaptive trackers selected
for their relevance in the recent literature as well as for the availability of
the implementations at the authors’ website: Boost Tracker [29], SemiBoost Tracker [30], BeyondSemiBoost Tracker [89], A-BHMC (AdaptiveBasinHopingMC) [45], IVT (Incremental Visual Tracker) [80].
60
2.4 Experimental Results
(a)
(b)
(c)
Figure 2.6: From left to right: Initialization frame for the Dollar sequence; sudden change of appearance (frame 90); a distractor pops out
(frame 130).The green rectangle represents the ground truth bounding
box.
To evaluate the importance of model adaptation in the considered
sequences as well as to rank the overall performance of adaptive solutions, results from three standard non adaptive solutions are also added,
namely Frag-Track [1], a color-based particle filter [78] and Mean-shift
[17].
All the sequences are part of the dataset provided by the authors of
MILBoost [5].
2.4.2 Dollar sequence
This is a simple sequence, but it allows for some interesting considerations. There is no clutter. The target (Fig. 2.6a) suddenly changes
appearance (Fig. 2.6b). After a while a distractor equal to the original
appearance of the target pops out close to the target (Fig. 2.6c) and then
moves next to it. It is useful to understand the robustness to distractors
and the degree of adaptiveness of the algorithms in a very controlled and
predictable situation.
SemiBoost uses a fixed external classifier. This allows for very good
performances up to the sudden change. After that, the target is believed
to have exited the scene by this tracker because nothing matches well
with such prior model. When the distractor appears, this tracker believes
the object is back in the scene, and follows it.
61
Chapter 2. Adaptive Appearance Model
BeyondSemiBoost uses an adaptive slowly evolving prior in combination with a fixed one from the first frame. This allows the tracker to
overcome the sudden appearance change. Nevertheless, when the distractor appears, the fixed prior misleads the tracker.
The behavior of Boost is slightly unexpected. Since it is not binded
to the initial appearance by a prior, it should have been able to avoid the
distractor, as well as the sudden change. It does indeed overcome the
change in appearance but it many runs it jumps on the distractor as soon
as it appears, much like BeyondSemiBoost. This explains the higher
variance compared to the other trackers. The behaviour of A-BHMC is
interesting. Since it is designed to cope with appearance changes steaming from geometric changes, it allows its patches to move independently
from each other, similarly to our proposal, but not to vary much in appearance, since patches are matched across frames using a tracker assuming brightness constancy. This results in a greater instability than the
other trackers. This also leads to two outcomes that limit its performace
in this sequence: the lower part of the target is excluded from the model
when it changes and some patches are attracted by the distractor when it
appears close to the target. Therefore, the ouput of the tracker stretches
between the target and the distractor. Our proposal, which updates also
the particle appearance, does not suffer from these problems.
As for non adaptive solutions, the use of global statistics allows
Mean Shift to overcome the nuisances of this scene, because the new
appearance of the target is similar to the previous one as fas as the color
histogram is concerned and the use of temporal consistency prevents it to
jump completely onto the distractor. Nevertheless its performance after
the appearance of the distractor is not satisfactory. FragTrack, using spatially localized histograms, is instead affected by the change and drifts to
the distractor. The Particle Filter exhibits a large variance in its results,
given by the fact that in the trials of the algorithms it was sometimes affected by the distractor and sometimes not: this indicates that the ability
of the particle filter to avoid the distractor in this sequence is just a random event due to the random approximation of the posterior produced
62
2.4 Experimental Results
by the filter.
The best performer are IVT and our proposal. IVT deploy a particle filtering for state tracking as our tracker. Its target model is instead
composed by global features, in particular the target graylevel template.
A subspace of templates is constructed on-line and the distance from it
constitutes the base for the definition of the observation likelihood. This
is a very stable solution and has problems in adapting to sudden changes
of appearance. Moreover the graylevel template has problems in dealing
with deformable targets. None of these critics condition is met in this
sequence, where from the object sudden change to the appearance of the
distractor more then 40 frames elapses while the object is still and the
majority of the target does not deform. Therefore, the tracker obtains
a performance equivalent to ours both in terms of mean overlap and of
variance. Both trackers are able to learn the new appearance of the target
and do not confound it with the distractor in all the runs.
2.4.3 Faceocc2 sequence
This is a moderately difficult scene, targeting face tracking (Fig. 2.8).
The main nuisances in these scenes are frequent and rather large occlusions. Beside, a permanent target appearance change happens about the
middle of the sequence, followed by a last occlusion. Hence, the main
ability a tracker has to show in this sequence is a high discriminative
power between occlusions, i.e. spurious changes of the target appearance, and permanent changes of the target.
Results are reported in Fig. 2.9. Our proposal turns out the best again,
as shown by the correct track ratio chart. Thanks to its formulation, our
filter is able to discriminate between partial occlusions and changes of
the target. In fact, when the book starts to occlude the face, its appearance has been already captured by the particle of our appearance model
that are modeling the background. Hence, when performing weights
update and resampling, the patches extracted on the book to perform target model update will receive a low score and will be likely discarded,
63
Chapter 2. Adaptive Appearance Model
1
0.9
0.8
Dice Overlap
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
50
100
150
Frame
Boost
FragTrack
MeanShift
200
250
SemiBoost
IVT
BeyondSemiBoost
300
ParticleFilter
Proposed
A_BHMC
(a) Dice Overlap
1
Mean Dice Overlap
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Correct Track Ratio
Boost
FragTrack
MeanShift
SemiBoost
IVT
BeyondSemiBoost
ParticleFilter
Proposed
A_BHMC
(b) Lost Track Ratio
Figure 2.7: Dollar sequence
64
1
2.4 Experimental Results
(a) 8
(b) 93
(c) 163
(d) 268
(e) 498
(f) 573
(g) 718
(h) 808
Figure 2.8: From left to right, top to bottom: Initialization frame for
the Faceocc2 sequence; first mild occlusion (frame 93); a larger occlusions (frame 163); third occlusions (frame 268); target rotation and large
occlusions (frame 498); target appearance change (frame 573); large occlusion (frame 718); final appearance of the target (frame 808). The blue
rectangle represents the ground truth bounding box.
therefore not corrupting the target model. On the other hand, the hat is
fully included in the target bounding box, and therefore its patches are
inserted in the target model.
IVT, deploying global features, suffers more than our proposal both
the large occlusion around frame 500 and the target appearance deformation around frame 350 (head turning). Mean-shift deploying global
features ,as well, and being not adaptive cannot cope with the challenges
of this sequence. FragTrack, although non adaptive, too, is based on
part-wise features. Since the target appearance does not change up to
frame 550, the non adaptiveness of the tracker is compensated by the
ability to correctly match the target in presence of occlusions, and the
tracker is the second best in the correct track ratio chart. Nevertheless,
the tracker suffers the target deformation around frame 350 and the appearance change after the last occlusion. This indicate the need to allow
for target deformation when deploying part-wise features and the need
to update the part-based representation to obtain better overlaps in this
65
Chapter 2. Adaptive Appearance Model
sequence.
Trackers deploying external classifiers for the sample and labeling
stage (SemiBoost, BeyondSemiBoost) show good performances up to
the large target deformation of frame 300. Again, the use of strong priors on the target appearance, assumed by using a detector to label new
samples for appearance model update, limits their adaptability. On the
other hand, a continuously adapting tracker like Boost suffer the same
nuisances, and in particular occlusions, because of its lack of stability.
2.4.4 Coke sequence
A can of Coke is tracked in front of a uniform background. The can
is moved behind a plant, causing partial and total occlusions. The can
is also rotated, causing appearance changes. Finally, an artificial light
stands very close to the target causing reflections and illumination changes.
The target is also small and relatively untextured. Overall, a challenging
sequence from many points of view.
Results are reported in Fig. 2.11. Basically, all trackers fail. The
not adaptive solutions loose the target immediately since the can starts
to rotate from the first frame. Appearance changes handling is of course
fundamental in this sequence. The use of priors in SemiBoost and BeyondSemiBoost does not allow them to cope with a sequence with so
many sudden changes of appearance. Also the prior cannot be really
informative since the object is relatively untextured, very small and similar to the background. The use of salient regions by A-BHMC makes it
loose the target as soon as an untextured side of the can is shown to the
camera.
Even IVT looses the target in the first frame because it does not have
the time to create an effective subspace representation for the can appearance in the first frames, where the can keeps on changing its appearance.
Moreover, subspaces and manifolds do not seem the appropriate tools to
cope with this sequence.
The only partially successful solutions are those that allows for con66
2.4 Experimental Results
1
0.9
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0
100
200
300
400
500
600
700
Frame
Boost
FragTrack
MeanShift
SemiBoost
IVT
BeyondSemiBoost
ParticleFilter
Proposed
A_BHMC
(a) Dice Overlap
1
0.8
Mean Dice Overlap
Dice Overlap
0.7
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Correct Track Ratio
Boost
FragTrack
MeanShift
SemiBoost
IVT
BeyondSemiBoost
ParticleFilter
Proposed
A_BHMC
(b) Correct Track Ratio
Figure 2.9: Faceocc2 sequence
67
Chapter 2. Adaptive Appearance Model
(a) 0
(b) 10
(c) 65
(d) 185
Figure 2.10: From left to right: Initialization frame for the Coke sequence; after ten frames the appearance of the can is already changed
and the target undergoes a partial occlusion; then the can wanders around
undergoing changes in appearance and illumination as in frame 65 and
occlusions as in frame 185. The green rectangle represents the ground
truth bounding box.
tinuous update, without priors, and with a part based model, namely
Boost and our filter. We mainly impute the failure of our filter in this sequence to the lack of texture of the back of the object that is not correctly
handled by our observation likelihood based on the ZNCC. We believe
that with a proper mechanism to perform on-line feature selection and
the inclusion of edge features our performance will likely improve.
68
2.4 Experimental Results
1
0.9
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0
50
100
Boost
FragTrack
MeanShift
150
Frame
200
SemiBoost
IVT
BeyondSemiBoost
250
ParticleFilter
Proposed
A_BHMC
(a) Dice Overlap
1
0.8
Mean Dice Overlap
Dice Overlap
0.7
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Correct Track Ratio
Boost
FragTrack
MeanShift
SemiBoost
IVT
BeyondSemiBoost
ParticleFilter
Proposed
A_BHMC
(b) Lost Track Ratio
Figure 2.11: Coke sequence
69
Chapter 2. Adaptive Appearance Model
70
Chapter 3
Synergistic Change Detection
and Tracking
In this chapter we investigate adaptive visual tracking with static cameras. The usual approach [15, 32, 34, 38, 88, 90, 104] in such a case
is to ground tracking on change detection: a process that labels every
pixel as changed (i.e. a target pixel) or unchanged (i.e. a background
pixel) with respect to a static background. Although in these proposals change detection is key for tracking, little attention has been paid to
sound modeling of the interaction between the change detector and the
tracker. This negatively affects the quality of the information flowing
between the two computational modules, as well as the soundness of the
proposals. Moreover, the interaction can be highly influenced by heuristically tuned parameters, such as change detection thresholds, that limit
the deployment of these solutions in real world applications.
Our work aims at sound modeling of the analysis of the output of the
change detection that produces a new measure for the tracker. We also
wish to have a limited number of parameters and that they can be easily
interpreted and tuned. As we have seen, Recursive Bayesian Estimation
(RBE) casts visual tracking as a Bayesian inference problem in state
space given noisy observation of the hidden state. Bayesian reasoning
has been recently used also to solve the problem of change detection in
71
Chapter 3. Synergistic Change Detection and Tracking
image sequences [47].
We introduce a novel Bayesian change detection approach aimed at
efficiency and robustness to common sources of disturbance such as illumination changes, camera gain and exposure variations, noise. At
each new frame, a binary Bayesian classifier is trained and then used
to discriminate between pixels sensing a scene change and pixels sensing a spurious intensity variation due to disturbs. After efficient nonparametric estimation of likelihood distributions for both classes, the
posterior probability of sensing a scene change at each pixel is obtained.
Given this Bayesian change detector and a generic recursive Bayesian
filter as tracker, we develop a principled framework whereby both algorithms can virtuously influence each other according to a Bayesian loop.
In particular:
• the output of the change detection is used to provide a fully specified observation likelihood to the RBE tracker;
• the RBE tracker provides a feedback to the Bayesian change detector by defining an informative prior for it;
• both PDFs are modeled and realized as marginalizations of the
joint PDF on tracker state and change detector output.
The derivation of a measure for the tracker from the change detection output is a fundamental part of a every tracker based on change
detection. The idea of letting the tracker provide a feedback to change
detection is inspired by the emergence of cognitive feedback in Computer Vision [96]. The idea of cognitive feedback is to let not only lowlevel vision modules feed high-level ones, but also the latter influence
the former. This creates a closure loop, reminiscent of effects found in
psychophysics. This concept has not been deployed for the problem of
visual tracking yet. Nevertheless, it fits surprisingly well in the case
of Bayesian change detection, where priors can well model the stimuli
coming from the tracker.
72
3.1 Related Works
By exploiting the synergy between the two flows of information our
system creates a full and synergistic Bayesian loop between the tracker
and the change detection, whose benefits are presented in the Experimental Results section (Sec. 3.6), where the Kalman Filter is used as
RBE tracker and the algorithm introduced in Sec. 3.4 as change detection. However, our proposal is general and in principle can be used with
any RBE tracker and Bayesian change detection, such as e.g., respectively, particle filters and [47].
3.1 Related Works
Classical works on blob tracking based on change detection are W4
[32] and the system developed at the Video Surveillance and Monitoring (VSAM) group of CMU [15]. In these systems the output of the
change detector is thresholded and a connected component analysis is
carried out to identify moving regions (blobs). A first or second order
dynamical model of every tracked object is used to predict its position in
the current frame from the previous ones. Positions are then refined by
matching the predictions to the output of the change detection. In VSAM
[15] any blob whose centroid falls within a neighborhood of the target
predicted position is considered for matching. Matching is performed as
correlation of an appearance template of the target to the changed pixels,
and the position corresponding to the best correlation is selected as the
new position for the object. In W4 [32] the new position is that corresponding to the maximum of the binary edge correlation between the
current and previous silhouette edge profiles. However, the interaction
between tracking and change detection is limited, tracking is not formalized in the context of RBE, change detection depends on hard thresholds,
no probabilistic reasoning is carried out to derive a new measure from the
change detection output or to update the object position, ( i.e. a bunch of
heuristics are used to solve the case of not connected blobs for the same
object).
73
Chapter 3. Synergistic Change Detection and Tracking
[90] and [34] are examples of blob trackers based on change detection where the RBE framework is used in the form of the Kalman filter.
Yet, the use of this powerful framework is impoverished by the absence
of a truly probabilistic treatment of the change detection output. In practice, covariance matrices defining measurement and process uncertainties are constant, and the filter evolves toward its steady-state regardless
of the quality of the measures obtained from change detection. A posteriori covariance matrices are sometimes deterministically increased by
the algorithms, but this is mainly a shortcut to implement track management: if there is no match for the track in the current frame uncertainties
are increased and if a posteriori uncertainties on state gets too high, the
track is discarded.
[38] is one of the most famous attempt to integrate RBE in the form
of a particle filter with a statistical treatment of background (and foreground) models. It proposes a multi-blob likelihood function that, given
the frame and the background model, allows the system to reason probabilistically on the number of people present in the scene as well as on
their positions. The main limitations are the use of a calibrated camera
with reference to the ground plane and the use of a foreground model
learned off-line. While the former can be reasonable, although cumbersome, the use of foreground models is always troublesome in practice,
given the high intra-class variability of target appearances. Moreover, no
cognitive feedback is provided from the Particle Filter to influence the
change detection.
Sorts of cognitive feedbacks from tracking to change detection have
been used so far only to deal with background maintenance and adaptive background modeling issues. For example, [95] proposes a method
based on approximate inference on a dynamic Bayesian Network that
simultaneously solves tracking and background model updating for every frame. Nevertheless, as discussed by the authors, this proposal do
not take advantage of models of foreground motion as our algorithm
does, although this would allow for better estimation of both the background model and the background/foreground labels, because it will also
74
3.2 Models and assumptions
severely complicate inference. Another example of background maintenance is [33], where positive and negative feedbacks from high-level
modules ( a stereo-based people detector and tracker, a detector of rapid
changes in global illumination, camera gain, and camera position) are
used to update the parameters of the Gaussian distributions in the Gaussian Mixture Model used as background. These feedbacks come in the
form of pixel-wise positive or negative real number maps that are generated as sum of the contributions of the high-level modules and are
thresholded in order to decide if a pixel should be used to update the
background. Contributions from the high-level modules are heuristically
determined.
3.2 Models and assumptions
We first present assumptions and notations used to model RBE and Bayesian
change detection separately, then we introduce the common framework
that allows us to define probabilistically the bidirectional interaction between the two modules, i.e. the observation likelihood for the tracker
defined on the change map and the prior for the change detection that
implements the Cognitive Feedback.
3.2.1 RBE model
We assume a rectangular model for the tracked object, as done in many
proposals such as i.e. [17]. Hence, the state of the RBE tracker, xk ,
comprises at least four variables
n
o
xk = ibk , jbk, wk , hk , . . .
(3.1)
where (ibk , jbk ) are the coordinates of the barycenter of the rectangle and
wk and hk its dimensions. These variables define the position and size
at frame k of the tracked object. Of course, the state internally used
by the tracker can beneficially include other cinematic variables (veloc75
Chapter 3. Synergistic Change Detection and Tracking
ity,acceleration,. . . ). Yet, change detection can only provide a measure
and benefit from a prior on the position and size of the object. Hence,
other variables are not used in the reminder of the presentation of the
algorithm, though they can be used internally by the RBE filter, and are
indeed used in our implementation (Sec. 3.6).
We can also represent the bonding box by defining new variables iL ,
jT , iR , jB as

 
 1 − 1   iL
2 
A = 
 , 
1 12
iR


 




 ibk   jT 
 jbk 
 = A 
 , 
 = A 
 .
wk
jB
hk
(3.2)
We assume the variables iL , jT , iR , jB to be independent, since this
is reasonable in our context and also simplifies the derivation of the information flows of our loop. This implies that the variables ibk , jbk , wk , hk
defining the alternative representation are not independent, but this is not
a problem since RBE can handle dependent variables (e.g. the Kalman
filter does not require diagonal covariance matrices).
3.2.2 Bayesian change detection model
In Bayesian change detection each pixel of the image is modeled as a
categorical Bernoulli-distributed random variable, ci j , with the two possible realizations ci j = C and ci j = U indicating the event of pixel (i, j)
being changed or unchanged, respectively.
h i
In the following we refer to the matrix c = ci j of all these random
h
i
variables as the change mask and to the matrix p = p(ci j = C) of probabilities defining the Bernoulli distribution of these variables as change
map. The change mask and the change map assume values, respectively,
in the (w × h)-dimensional spaces Θ = {C, U}w×h and Ω = [0, 1]w×h , with
w and h denoting image width and height, respectively. The output of a
Bayesian change detector is the posterior change map given the current
frame fk and background model bk , i.e. the value of the Bernoulli distribution parameter for every pixel in the image given the frame and the
76
3.2 Models and assumptions
Figure 3.1: Model for the change map given a bounding box.
background:
p(ci j = C | fk , bk ) =
p( fk , bk | ci j = C)p(ci j = C)
p( fk , bk )
(3.3)
Clearly, either a non-informative prior is used, such as a uniform prior,
or this information has to be provided by an external module. We assume that the categorical random variables ci j comprising the posterior
change mask are independent, i.e. they are conditionally independent
given fk , bk .
3.2.3 Bayesian loop models
All the information that can flow from the RBE filter to the Bayesian
change detection and vice versa is in principle represented in every frame
by the joint probability density function p(xk , c) of the state vector and
the change mask. Both information flows can be formalized and realized
as its marginalization:
p(ci j ) =
& X
4
p(xk ) =
XR
ci j ∈Θi j
p (xk , c)
p xk , ci j , ci j dxk
(3.4)
(3.5)
c∈Θ
77
Chapter 3. Synergistic Change Detection and Tracking
Figure 3.2: Overall system description. In every frame the RBE tracker
provides a prediction p (xk | z1:k−1 ) from the previous
state that is used
by our framework to generate a set of priors p ci j , each one of them
assessing the probability that a particular pixels is changed. This informative prior is used by a Bayesian change detection algorithm together
with the current frame
fk and
a model of the background bk to produce
a change map p ci j | fk , bk . The change map is not thresholded but a
probabilistic analysis is carried out in order to provide a new measure
for the tracker p (zk | xk ), that is merged with the prediction in the update
state of RBE. The blue and red histogram around respectively the prediction and the measure represent the variance associated with the four
variables defining a bounding box, which are assumed to follow a Gaussian distribution in the specific example. Generally speaking, they are
placed there to remind to the reader that completely specified probabilities are flowing from and into the RBE tracker thanks to our proposal.
78
3.3 Cognitive Feedback
where ci j denotes the change mask without the (i, j)-th element, taking
values inside the space Θi j = {C, U}w×h−1 .
The PDF computed with (3.4) defines an informative prior for the
Bayesian change detection algorithm, and the estimation of the state obtained with (3.5) can then be used as the PDF of a new measure by the
RBE tracker, i.e. as p(zk | xk ). We detail in Sec. 3.3 and Sec. 3.5 the solutions for (3.4) and (3.5). With reference to Fig. 3.2, it is worth notice
that in our framework only fully defined probabilities flow among the
modules, not just expectations or deterministic measures.
As we shall see in next sections, to use the above equations we need a
statistical model that links the two random vectors xk and c. In agreement
with our rectangular model of the tracked object, as shown in Fig. 3.1 we
assume



K1 if (i, j) ∈ R (xk )

p ci j = C | xk = 
(3.6)


K
otherwise
2
where R(xk ) is the rectangular region delimited by the bounding box
defined by the state xk and 0 ≤ K2 ≤ K1 ≤ 1 are two constant parameters
specifying the probability that a pixel is changed inside and outside the
bounding box, respectively. Moreover, we assume that the random variables ci j are conditionally independent given a bounding box, i.e.
p (c | xk ) =
Y p ci j | xk
(3.7)
ij
3.3 Cognitive Feedback
Given the assumptions in Sec. 3.2, we can obtain an exact solution for
(3.4), i.e., given the PDF of the state vector p(xk ), we can compute a
prior p(ci j ) for each pixel of the frame that can then be used as prior in
the Bayesian change detection algorithm. Starting from (3.4), we can
79
Chapter 3. Synergistic Change Detection and Tracking
rewrite it as
& X p ci j =
p xk , ci j , ci j dxk
=
=
4
ci j ∈Θi j
4
p xk , ci j dxk
R
&
R
&
R4
p ci j | xk p (xk ) dxk
(3.8)
In the final marginalization we can recognize our model of the change
map given a bounding box defined in (3.6) and the PDF of the state.
Therefore, this equation provides a way to let the current estimation
of the state computed by the RBE module influence the prior for the
Bayesian change detection algorithm, thereby realizing the Cognitive
Feedback. In particular, as discussed above, we will use the prediction computed for the current frame using the motion model, i.e. p(xk |
Z1:k−1 ).
To solve (3.8) we have to span the space R4 of all possible bounding
boxes xk . We partition R4 into the two complementary sub-spaces Bi j
and B̄i j = R4 \ Bi j of bounding boxes that contain or not the considered
pixel (i, j), respectively. Given the assumed model (3.6), we obtain
p(ci j = C) =
&
p(ci j | xk )p(xk ) dxk
R4
= K1
&
p(xk ) dxk + K2
Bi j
= K1
&
p(xk ) dxk + K2
&
&
p(xk ) dxk
xk ∈R4
p(xk ) dxk
xk ∈Bi j
= K2 + (K1 − K2 )Ii j , Ii j =
&
Bi j
80
p(xk ) dxk
B̄i j
xk ∈Bi j
− K2
&
p (xk ) dxk .
(3.9)
3.3 Cognitive Feedback
Since Ii j varies in [0, 1], it follows that p(ci j = C) varies in [K2 , K1 ]: if no
bounding box with non-zero probability contains the pixel, we expect a
probability that the pixel is changed equal to K2 ; if all the bounding
boxes contain the pixel the probability is K1 ; it is a weighted average
otherwise.
By using the alternative representation for the bounding box defined
in (3.2) and recalling that we assume iL , jT , iR , jB to be independent, the
integral becomes
&
Ii j =
iL ≤i≤iR
jT ≤ j≤ jB
=
Zi
−∞
p (iL ) p (iR ) p ( jT ) p ( jB ) diL diR d jT d jB







Bi j






p (iL ) diL
Z+∞
p (iR ) diR
i
Zj
−∞
p ( jT ) d jT
= F iL (i) 1 − F iR (i) F jT ( j) 1 − F jB ( j)
Z+∞
p ( jB ) d jB
j
(3.10)
where F x stands for the CDF of the random variable x.
This reasoning holds for any distribution p(xk ) we might have on
the state vector. If, for instance, we use a particle filter as RBE tracker,
we can compute an approximation of the CDF from the approximation
of the PDF provided by the weighted particles, after having propagated
them according to the motion model and having marginalized them accordingly. In the case of the Kalman Filter all the PDFs are Gaussians,
hence we can define all the factors of the product in (3.10) in terms of
the standard Gaussian CDF, Φ(·)
!
!
!
!
j − µ jT
µ jB − j
µiR − i
i − µiL
Φ
Φ
Φ
Ii j = Φ
σi L
σiR
σ jT
σ jB
(3.11)
where µx and σ x stand for the mean and the standard deviation of the
random variable x. The factors of the product in (3.11) can be computed
efficiently with only 4 searches in a pre-computed Look-Up Table of the
standard Φ(·) values.
81
Chapter 3. Synergistic Change Detection and Tracking
3.4 Bayesian change detection
The main difficulty with change detection consists in discerning changes
of the monitored scene in presence of spurious intensity variations yielded
by nuisances such as noise, gradual or sudden illumination changes,
dynamic adjustments of camera parameters (e.g. auto-exposure, autogain). Many different algorithms for dealing with these issues have been
proposed (see [24] for a recent survey).
A first class of popular algorithms based on statistical per-pixel background models, such as e.g. Mixture of Gaussians [90] or kernel-based
non-parametric models [23], are effective in case of noise and gradual
illumination changes (e.g. due to the time of the day). Unfortunately,
though, they cannot deal with those disturbs causing sudden intensity
changes (e.g. a light switch), yielding in such cases lots of false positives.
A second class of algorithms relies on a priori modeling the possible spurious intensity changes over small image patches yielded by
disturbs. Following this idea, a pixel from the current frame is classified
as changed if the intensity transformation between its local neighborhood and the corresponding neighborhood in the background can not be
explained by the chosen a priori model. As a result, gradual as well
as sudden photometric distortions do not yield false positives provided
that they are explained by the model. Thus, the main issue concerns
the choice of the a priori model: generally speaking, the more restrictive such a model, the higher is the ability to detect changes (sensitivity) but the lower is robustness to disturbs (specificity). Some proposals
assume disturbs to yield linear intensity transformations [53, 68]. Nevertheless, as discussed in [102], many non-linearities may arise in the
image formation process, so that a less constrained model is often required to achieve adequate robustness. Hence, other algorithms adopt
order-preserving models, i.e. assume monotonic non-decreasing intensity transformations [48, 64, 102]
We propose a change detection approach that, instead of assum82
3.4 Bayesian change detection
xi,1 xi,2 xi,3
yi,1 yi,2 yi,3
xi,4 xi xi,5
yi,4 yi yi,5
xi,6 xi,7 xi,8
yi,6 yi,7 yi,8
Figure 3.3: Notations adopted for the background (on the left) and the
current frame (on the right) neighborhood intensities.
ing a-priori the model of intensity changes caused by disturbs, learns
it on-line together with the model of intensity changes yielded by foreground objects. In particular, at each new frame a binary Bayesian classifier is trained and then used to discriminate between pixels sensing a
scene change due to foreground objects and pixels sensing a spurious intensity variation due to disturbs. On-line learning of the models holds the
potential for deploying on a frame-by-frame basis models as restrictive
as needed to discriminate between the two classes, so that the algorithm
can exhibit a high sensitivity without a significant loss of specificity.
Moreover, the fully Bayesian formulation for the change detection problem allows for seamlessly incorporating in a sound way a prior probability to strengthen the change detection output. In our framework this
prior is provided by the tracker via the cognitive feedback defined above.
3.4.1 On-line learning for change detection
By taking pixels in lexicographical order, let us denote the background
and the current frame intensities, respectively, as
B = (x1 , . . . , xN )
and
F = (y1 , . . . , yN )
(3.12)
where xi , yi ∈ [0, 255] ⊂ N, i = 1, . . . , N and N is the total number of
pixels in the images. The goal of change detection is to compute the
binary change mask
M = (c1 , . . . , cN )
(3.13)
i.e. to classify each pixel i into one of the two classes:
83
Chapter 3. Synergistic Change Detection and Tracking
ci = C : the pixel is sensing a scene change;
ci = U : the pixel is not sensing a scene change.
The idea at the basis of our proposal consists in training at each new
frame a binary Bayesian classifier using as feature vector the backgroundframe pair of intensities (x, y) observed at a pixel, and then computing
the change map by letting each pixel take the a-posteriori value of the
probability of being changed:
p(c= C | x, y) =
p(c= C)p(x, y | c= C)
.
p(x, y)
(3.14)
The prior p(c= C) is obtained via the Bayesian loop from the tracker.
In order to train the classifier we have to estimate the likelihood p(x, y |
c= C) and the evidence p(x, y). We can avoid to estimate the evidence by
the usual manipulation of (3.14) as
p (c = C) p (x, y | c = C)
p (x, y)
p (c = C) p (x, y | c = C)
=
p (c = C) p (x, y | c = C) + p (c = U) p (x, y | c = U)
1
=
.
(3.15)
p (c = U) p (x, y | c = U)
1+
p (c = C) p (x, y | c = C)
p (c = C | x, y) =
To estimate p(x, y | c= C) and p(x, y | c= U), we carry out a preliminary classification of pixels by means of a very simple and efficient
neighborhood-based change detection algorithm. For a generic pixel i,
let the intensities of a surrounding 3 × 3 neighborhood be denoted as
in Fig. 3.3, let the intensity differences between the j-th and the central
pixel of the neighborhood in the background and in the current frame be,
respectively,
di,(x)j = xi, j − xi and di,(y)j = yi, j − yi
(3.16)
84
3.4 Bayesian change detection
and let the pixel in the neighborhood yielding the maximum absolute
value of the background intensity difference be
j̄i = arg max di,(x)j (3.17)
j=1,...,8
A preliminary change mask M̃ = (c̃1 , . . . , c̃N ) is computed by classifying
each pixel as changed if the sign of the intensity differences di,x j̄ and di,y j̄
i
i
is the same, unchanged otherwise:
di,(x)j̄
i
·
di,(y)j̄
i
c̃i = c
⋚ 0
(3.18)
c̃i = u
This algorithm is a simplified version of that proposed in [102] and exhibits O(N) complexity. In fact, since the background model is not updated, the computation of j̄i for each pixel by (3.17) can be performed
off-line after background initialization. Furthermore, the algorithm is
threshold-free.
The preliminary change mask is thus used to label each pixel to create a training set out of the current frame. The two likelihood distributions p(x, y | c= C) and p(x, y | c= U) are estimated on this training set
as follows:
hC (x,y)
NC
hU (x,y)
p(x,y | c=U)=
NU
p(x,y | c=C)=
(3.19)
(3.20)
where NC is the number of pixels labeled as changed, hC (x, y) and hU (x, y)
are the 2-D joint histograms of background versus frame intensity computed by considering, respectively, the pixels labeled as changed and
those labeled as unchanged. Before being used in (3.15), both the histograms hC (x, y) and hU (x, y) are smoothed by averaging over a moving
window of fixed size. The smoothing allow for correcting errors introduced by wrong labeled training data in the preliminary rough labeling
85
Chapter 3. Synergistic Change Detection and Tracking
as well as for introducing a small amount of spatial consistency among
labels, under the hypothesis that pixels close to each other in the image
space show similar intensity values both in the foreground and in the
background.
3.5 Reasoning probabilistically on change maps
h
i
Given the change map p = p(ci j = C) obtained by the Bayesian change
detection algorithm, we aim at computing the probability density function p(xk ) of the current state of the RBE filter, to use it as the observation
likelihood p(zk | xk ). To this purpose, from the marginalization in (3.5)
we obtain:
p(xk ) =
=
X
c∈Θ
X
p(xk , c)
p(xk | c)p(c)
c∈Θ
=
X
c∈Θ
p(xk | c)
Y
p(ci j )
(3.21)
ij
where the last equality follows from the assumption of independence
among the categorical random variables ci j comprising the posterior change
map computed by the Bayesian change detection.
To use (3.21), we need an expression for the conditional probability
p(xk | c) of the state given a change mask, based on the assumed model
(3.6), (3.7) for the conditional probability p(c | xk ) of the change mask
given a state. Informally speaking, we need to find the inverse of the
model (3.6), (3.7).
By Bayes rule, eq. (3.7) and independence of the variables ci j :
p(xk | c) = p∗ (xk )
Y p(ci j | xk )
p(c | xk )
∗
=
p
(x
)
.
k
p∗ (c)
p∗ (ci j )
i, j
(3.22)
We have used the notation p∗ (xk ) and p∗ (ci j ) in (3.22) since here these
probabilities must be interpreted differently than in (3.21): in (3.21)
86
3.5 Reasoning probabilistically on change maps
p(xk ) and p(ci j ) represent, respectively, the measurement and the change
map of the current frame, whilst in (3.22) both must be interpreted as
priors that form part of our model for p(xk | c), which is independent
of the current frame. Furthermore, using as prior on the state p∗ (xk ) the
prediction of the RBE filter, as done in the Cognitive Feedback section,
would have created a strong coupling between the output of the sensor
and the previous state of the filter, that does not fit the RBE framework,
where measures depend only on the current state, and could easily lead
the loop to diverge. Hence, we assume a uniform non-informative prior
p∗ (xk ) =
1
α
for the state.
The analysis conducted for the Cognitive Feedback is useful to expand each p∗ (ci j ) in (3.22). Since we are assuming a uniform prior on an
infinite domain for the state variables, i.e. a symmetric PDF with respect
to x = 0, it turns out that its CDF is constant and equals to 12 :
CDF(x) =
1
1 α→+∞ 1
x + −−−−−→
α
2
2
(3.23)
Hence, every p∗ (ci j ) in (3.22) can be expressed using (3.9) and (3.10) as:
1
p (ci j = C) = K2 + (K1 − K2 )
2
∗
!4
= KC .
(3.24)
By plugging (3.22) in (3.21) and defining KU = p∗ (ci j = U) = 1 − KC :
Y p(C | xk )p(C) p(U | xk )p(U) !
αp(xk ) =
+
K
KU
C
i, j
(3.25)
where, for simplicity of notation, we use C and U for ci j = C and ci j = U,
respectively. Since we know that p(U) = 1−p(C) and p(U | xk ) = 1−p(C |
xk ), we obtain:
p (xk ) Y
(p (C) (p(C | xk ) − KC ) + KC (1 − p (C | xk )))
=
β
i, j
(3.26)
with β = 1/α(KC (1 − KC ))w×h . By substituting the model (3.6) for p(C |
87
Chapter 3. Synergistic Change Detection and Tracking
xk ) and taking the logarithm of both sides to improve the numeric stability, after some manipulations we get:
γ + ln p(xk ) = h(xk , p) =
X
(i, j)∈R(xk )
ln
p(C)K3 + K4
p(C)K5 + K6
(3.27)
P
where γ = − ln β − ln p(C)K5 + K6 and h(·) is a known function of
the state vector value xk for which we want to calculate the probability
density, of the change map p provided by the Bayesian change detection
algorithm, and of the constants
K3 = K1 − KC K4 = KC (1 − K1 )
(3.28)
K5 = K2 − KC K6 = KC (1 − K2 )
Hence, by letting xk vary over the space of all possible bounding boxes,
(3.27) allows us to compute, up to the additive constant γ, a non-parametric
estimation h(·) of the log-PDF of the current state vector of the RBE
tracker. This holds independently of the PDF of the state.
In the case of the Kalman Filter, the PDF of the state vector (ib , jb, w, h)
is Gaussian. In such a case, the variables (iL , jT , iR , jB ) are a linear combination of Gaussian Random Variables. Moreover, we are assuming
that variables (iL , jT , iR , jB ) are independent. Therefore, the variables
(iL , jT , iR , jB ) are jointly Gaussian and the mean µ and the covariance
matrix Σ of the state variables are fully defined by the four means µL ,
µR , µT , µB and the four variances σ2L , σ2R , σ2T , σ2B of (iL , jT , iR, jB ).
To estimate these eight parameters, let us substitute the expression of
the Gaussian PDF for p(xk ) in the left-hand side of (3.27), thus obtaining:
(iL −µL )2 (iR −µR )2 ( jT −µT )2 ( jB −µB )2
δ−ln(σL σR σT σB )−
−
−
−
= h(xk , p)
2σ2L
2σ2R
2σ2T
2σ2B
(3.29)
where δ = γ −2 ln(2π). The eight parameters of the PDF and the additive
constant δ might be estimated by imposing (3.29) for a number N > 9
of different bounding boxes and then solving numerically the obtained
88
3.5 Reasoning probabilistically on change maps
over-determined system of N non-linear equations in 9 unknowns.
To avoid such a challenging problem, we propose an approximate
procedure. First of all, an estimate b
µ of the mean of the state vector
µ = (µL , µR , µT , µB ) can be obtained by observing that, due to increasing
monotonicity of logarithm, the mode of the computed log-PDF coincides
with the mode of the PDF, and that, due to the Gaussianity assumption,
the mode of the PDF coincides with its mean. Hence, we obtain an
estimate b
µ of µ by searching for the bounding box maximizing h(·).
b
µ = arg max h(x, p)
(3.30)
x
Then, we impose that (3.29) is satisfied at the estimated mean point b
µ
2
2
2
2
2
and that all the variances are equal, i.e. σL = σR = σT = σB = σ , thus obtaining a functional relationship between the two remaining parameters
δ and σ2 :
δ = 2 ln σ2 + h(b
µ, p)
(3.31)
By substituting in (3.29) the above expression for δ and the estimated b
µ
for µ, we can compute an estimate b
σ2 (x) of the variance σ2 by imposing
(3.29) for whatever bounding box x , b
µ. In particular, we obtain:
b
σ2 (x) =
2
b
µ − x2
1
2 h(b
µ, p) − h(x, p)
(3.32)
To achieve a more robust estimate, we average b
σ2(x) over a neighborhood
of the estimated mean bounding box b
µ. Finally, to obtain the means and
covariance of the measurements for the Kalman Filter, we exploit the
property of linear combinations of Gaussian variables:

 A−1 0
µ = 
0 A−1


 b
µ

 A−1 0
Σ=b
σ 
0 A−1

2

  A−1 0
 
0 A−1
T


(3.33)
89
Chapter 3. Synergistic Change Detection and Tracking
3.6 Experimental Results
We have tested the proposed Bayesian loop on publicly available datasets
with ground truth data, i.e. some videos from the CAVIAR1 and ISSIA
Soccer datasets [22]. The former comprises videos from typical videosurveillance scenarios, whereas the latter deals with a football match.
We have used a Kalman Filter with constant velocity motion model
as RBE tracker and the algorithm introduced in Sec. 3.4 as Bayesian
change detection. The detection to initialize the tracker was done manually from the ground truth (although change detection holds the potential to solve the detection problem in the same conceptual framework,
an advantage over tracking systems based on other approaches such as
e.g. color histograms). We have selected videos with a single person or
where the tracked person was well separated from the others2.
In particular, the complete system has been used to track people
wondering in a shopping mall using three sequences from the CAVIAR
dataset (referred to as CAVIAR1, CAVIAR2, CAVIAR3, respectively)
and two players during a match in the sixth sequence of the ISSIA dataset
(ISSIA GK and ISSIA P). Tracking results for these videos are available
at the companion website.
As for the CAVIAR dataset, the main difficulties are changes in appearance of the target due to light changes inside and outside the shop,
shadows, camouflage, small size of the target and, for sequence 2, dramatic changes in target size onto the image plane (he walks inside the
shop until barely disappears). The ISSIA Soccer dataset is less challenging as far as color, lightening and size variations are concerned, and the
players cast practically no shadow. Yet, it provides longer sequences and
more dynamic targets. We used our system to track the goalkeeper and
a player: the goalkeeper allows to test our system on a sequence 2500
1
Data coming from the EC Funded CAVIAR project/IST 2001 37540, found at
URL: http://homepages.inf.ed.ac.uk/rbf/CAVIAR/
2
How to combine our system with proper data association algorithms and to take
into account in the probabilistic analysis of the change map the multiple target scenario
is an interesting subject for future work.
90
3.6 Experimental Results
frames long; the player shows rapid motion changes and unpredictable
poses (he even falls to the ground kicking the ball in the middle of the
sequence).
Our system does not require to set a threshold to classify the output
of the change detection, only the model for p(ci j = C | xk ) must be set.
To account for the differences between the reasoning of the cognitive
feedback and the analysis of the change map, two different models must
be defined, i.e. two different pairs of values for K1 and K2 must be tuned.
We refer to them as K1CF ,K2CF and K1PA , K2PA respectively. We coarsely
tuned these parameters on a sequence of the CAVIAR dataset not used
for testing. The best values turned out to be
K1CF = 0.5, K2CF = 0.0, K1PA = 0.5, K2PA = 0.2 .
(3.34)
We expect these values to be generally applicable: we use them with
success also on the ISSIA videos. They basically state:
• that the model for both analyses must allow for unchanged pixels
into the bounding box (K1CF = K1PA = 0.5), due to the approximation inherent to the rectangular model in presence of non rectangular and deformable targets;
• that a good prior for the change detection dictates the absence of
unchanged pixels outside he bounding box (K2CF = 0.0);
• that, even with a such a strong prior, we must allow for a small
number of errors of the Bayesian change detection out of the bounding box and left them out of the estimation we provide when analyzing the change map (K2PA = 0.2).
These considerations hold regardless of the sequence at hand, the illumination condition and the characteristic of the target. Hence, we see our
system as a step toward easily deployable solutions for visual tracking.
We also coarsely tuned the values for the Kalman filter state covariance matrix using the same sequence. We use a constant velocity motion
91
Chapter 3. Synergistic Change Detection and Tracking
model, thereby adding the velocity of the target along the i and j axes to
the state vector. The best values turned out to be:






F = 










Q = 





1 1 0 0 0 0 

0 1 0 0 0 0 

0 0 1 1 0 0 

0 0 0 1 0 0 

0 0 0 0 1 0 

0 0 0 0 0 1
1 0 0
0 10 0
0 0 1
0
0
0

 1

 0
H = 
 0

0
0
0
0
0
0
0 10
0 0
0
0
0
0 0 0
0 1 0
0 0 0
0 0 0
with the state vector xk given by
xk =
ibk
dibk
dk
jbk
d jbk
dk
(3.35)

0 0 

0 0 

0 0 

0 0 

1 0 

0 1

0 0 

0 0 

1 0 

0 1
wk hk
T
(3.36)
(3.37)
.
(3.38)
To quantitatively evaluate the performance we use the mean dice
overlap dk over a sequence, introduced in the previous chapter (Sec. 2.4.1
:
2 xk ∩ xGT
k .
dk =
|x | + xGT k
(3.39)
k
Quantitative evaluation is reported in Table 3.1. Our system, whose
results are reported in the first column, successfully tracks all the tar92
3.6 Experimental Results
Table 3.1: Performance scores.
(∗)
indicates loss of target.
Seq.
CAVIAR 1
CAVIAR 2
CAVIAR 3
Full Loop
0.74
0.66
0.70
Constant R
0.64
0.66
0.64
Kalm+MS
0.29(∗)
0.01(∗)
0.012(∗)
FragTrack
0.55
0.01(∗)
0.01(∗)
ISSIA GK
ISSIA P
0.70
0.61
0.65
0.56
0.74
0.64
0.02(∗)
0.02(∗)
gets. The main source of misalignment between the bounding box and
the ground truth in the CAVIAR dataset are shadows (first column of
Fig. 3.5 and 3.6): because of the position of the artificial lights, cast
shadows on the floor fit with our rectangular model and the analysis of
the change map tends to include them, elongating the bounding box (e.g.
the frames # 368 707 and 1046 of sequence CAVIAR 2, depicted in
Fig. 3.5). Although many proposals for shadow removal exist [77] and
could be used in a real deployment of our system, we present results
without such post processing step to better characterize our proposal and
show its robustness to disturbance factors.
On the ISSIA videos, too, our tracker was able to successfully track
both targets throughout the whole sequence, as shown in Fig. 3.7. The
main limitation of our algorithm in this case is due to the assumed rectangular model: in many frames, the players are running or performing
extreme movements and their limbs cover a wider area than when a person is e.g. walking. Hence, the actual changed area inside the ground
truth bounding box differs from a rectangular shape and the measures
of our system are always too conservative in size with respect to the
ground truth (e.g. frames # 656 and 768 of the player sequence in Fig.
3.7). Nevertheless, it is remarkable that our tracker is able to adapt to extreme situations, such as the player falling on the ground (second frame
in the same sequence). It is also important that it succeeded in tracking
the goalkeeper, although this sequence is easier than that of the player,
because this is a long sequence, and it shows that the proposed loop does
93
Chapter 3. Synergistic Change Detection and Tracking
not incur in positive feedbacks and divergence.
To highlight the importance of the full Bayesian loop, we have performed the same experiments without considering the full PDF estimated
during the change map analysis, but just the mean and a constant measurement covariance matrix R equal to

 100 0
0
0

 0 100 0
0
R = 
 0
0 100 0

0
0
0 100




 .



(3.40)
Results for this configuration are reported in the second column of Tab.
3.1: our proposal performs consistently better throughout all the sequences (only for one sequence, results are identical). Going into more
details, the superior performance is given by the ability of our full loop
to be closer to the ground truth bounding box even when the rectangular shape assumption is violated (e.g. compare frames # 720 in the
CAVIAR1 experiment reported in Fig. 3.4 and # 487 in CAVIAR3 experiment reported in Fig. 3.6, where the feet and the head lay outside
of the bounding box estimated by the partial loop). This is in turn due
to the dynamic estimation of the measurement covariance matrix: in all
the frames where the rectangular model is not adequate, the probabilistic
analysis of the change map is able to detect such mismatch by obtaining
a higher uncertainty on its bounding box estimation (that for such frames
tends to concentrate on the target trunk) and this allows the Kalman filter to trust less the measure and, hence, to be more accurate. The same
observation explains the difference in performance in the ISSIA dataset.
We also compare the performance of our tracker against two standard solution for visual tracking: Mean Shift tracker used in conjunction
with a Kalman Filter (KalmanMS) [17] and FragTrack [1]. They are
based, respectively, on the color histogram of the whole target (i.e. this
tracker ignores spatial distribution of the colors on the target) and on the
graylevel histogram of each cell of two grids superimposed on the target.
94
3.6 Experimental Results
Results for these trackers are reported in the third and fourth column of
Tab. 3.1, respectively.
The first sequence we consider from the CAVIAR dataset is the easiest one in our tests. There are no scale changes, no motion low changes
(the person walks with practically constant velocity from right to left),
and moderate changes in appearance, due to the not uniform light intensity in the corridor of the mall. Nevertheless this sequence turns out to
be too difficult for the KalmanMS tracker and tough to handle for FragTrack. This is due to two factors: the moderate changes in appearance
of the target and the hypothesis of a rectangular target, assumed also by
these trackers. These two factors cause the KalmanMS tracker to provide poor tracking in the beginning of the sequence, not being able to
adapt to the deformations of the target (i.e. to include in the bounding
box the wide open legs in frame # 736 of Fig. 3.4) since the trunk alone
fits better with the initial model; and then, to drift to the background and
loose the target, since, due to the appearance change of the target, the
best matching parts of the initial histogram are those of the background,
that were included in the initial model, even if it was initialized from the
ground truth, due to the approximate rectangular model. FragTrack performs definitely better, although it is less precise on the estimation of the
bounding box than our system, e.g. it cuts the feet and the head of the
target in the third and fourth frame of the sequence reported in Fig. 3.4.
Similarly to KalmanMS, though, it can not handle appearance changes:
at the end of the sequence it looses the target (last two frames in Fig. 3.4)
by considering the background more similar to the initial appearance of
the target.
The other two CAVIAR sequences are too difficult for a tracker based
on color or graylevel histograms. Both the KalmanMS tracker and the
FragTrack loose the target at the beginning of the sequence. The most
likely cause for this is that they are also very sensitive to the initialization condition: in contrast with the previous sequence, where in the first
frame it was possible to reasonably approximate the target with a rectangular bounding box, this is not possible in the first frames of these two
95
Chapter 3. Synergistic Change Detection and Tracking
sequences (compare the first row of Fig. 3.4 with those of Fig. 3.5 and
3.6). Because of this, a lot of background is included in the initial model,
and this makes the tracker stick to the initial position and loose the target. Such sensitivity is less important for bigger targets. Therefore, we
can conclude that our solution, which is unaffected by this initialization problem, is more suitable than the considered alternatives for visual
surveillance scenarios, where targets are usually small and untextured.
On the ISSIA sequences, KalmanMS obtains slightly better performances than our proposal. Of course, color is an important cue to successfully track the players in such scenes. This is strengthen by the fact
that, for the particular colors in these scenes, the compression to gray
levels is particularly lossy: for example, yellow parts of the tracked players get really similar to the green background. This is confirmed by the
poor performances of FragTrack, which uses graylevel images like our
system. Despite this, the difference in performance between our solution and KalmanMS is encouraging, given the gap in the quality of the
analyzed cues. We expect a sensible gain in performance by deploying
color-based Bayesian change detection. This represents an interesting
future direction of research to continue and extend this work.
96
3.6 Experimental Results
#688
#704
#720
#736
#752
#768
#784
#800
#816
#832
Figure 3.4: Samples equally spaced along the time axis from
the CAVIAR1 experiment (sequence ”OneStopEnter2front” from the
CAVIAR dataset). From left to right column: our method (full loop;
our method with constant measurement covariance matrix(constant R);
KalmanMS; FragTrack.
97
Chapter 3. Synergistic Change Detection and Tracking
#0255
#0368
#0481
#0594
#0707
#0820
#0933
#1046
#1159
#1272
Figure 3.5: Samples equally spaced along the time axis from the
CAVIAR2 experiment (sequence ”OneStopMoveEnter2front” from the
CAVIAR dataset). From left to right column: our method (full loop);
our method with constant measurement covariance matrix (constant R);
KalmanMS; FragTrack.
98
3.6 Experimental Results
#280
#349
#418
#487
#556
#625
#694
#763
#832
#901
Figure 3.6: Samples equally spaced along the time axis from the
CAVIAR3 experiment (sequence ”OneStopMoveNoEnter1front” from
the CAVIAR dataset). From left to right column: our method (full loop);
our method with constant measurement covariance matrix (constant R);
KalmanMS; FragTrack.
99
Chapter 3. Synergistic Change Detection and Tracking
#0420
#432
#1064
#544
#1708
#656
#2352
#768
#2996
#880
Figure 3.7: Exemplar frames equally spaced along the time axis from the
ISSIA Soccer dataset: left column, the goalkeeper tracking experiment
(ISSIA GK); right column, the player tracking experiment (ISSIA P).
100
Chapter 4
3D Surface Matching and
Object Categorization
Automatic recognition of shapes in 3D data, also referred to as shape
matching, is attracting a growing interest in the research community,
with applications found in areas such as shape retrieval, shape registration, object recognition, manipulation and grasping, robot localization
and navigation. An important enabling factor for the development of
this technology is represented by the increasing availability of cheaper
and more effective 3D sensors. Many of these sensors are able to acquire not only the 3D shape of the scene, but also its texture: this is the
case, e.g. of stereo sensors, structure-from-motion systems, certain laser
scanners as well as the recently proposed Kinect device by Microsoft.
Surface matching can be tackled by either a global or a local approach. According to the former, a surface is described entirely by
means of global features, whereas the latter relies on local keypoints
and regional feature descriptions to determine point-to-point correspondences between surfaces. Borrowing a denomination typical of the face
recognition community [110] we refer here to these two approaches as,
respectively, holistic and feature-based. While the holistic approach is
popular in the context of 3D object retrieval [39, 71, 87], feature-based
methods are inherently more effective for 3D object recognition in pres101
Chapter 4. 3D Surface Matching and Object Categorization
Figure 4.1: Example of matching local descriptors in a 3D object recognition scenario. Green lines identify correct matches, whereas red ones
represent wrong correspondences.
ence of cluttered backgrounds and occlusions.
Feature-based methods rely on 3D keypoints that are extracted from
a 3D surface. This task is accomplished by 3D detectors, whose aim is to
determine points which are distinctive, to allow for effective description
and matching, and repeatable with respect to point-of-view variations
and noise [12, 60, 111]. Sometimes, a characteristic scale is also associated to each keypoint, so as to provide a local neighborhood to the
following description stage [2, 60, 66, 98, 106]. Then, a description of
the local neighborhood of each keypoint is computed by means of a 3D
descriptor [12, 14, 27, 41, 60, 66, 106, 111] in order to obtain a compact local representation of the input data invariant up to a predefined
level of transformation (rotation, scaling, affine warp, . . . ). Descriptors
are finally matched across different views to attain point-to-point correspondences (e.g. as in Fig. 4.1). This approach has become the standard
paradigm in case of 2D data [6, 10, 43, 54, 56, 61, 62] for tackling classical computer vision problems such as object recognition, automatic
registration, image indexing, etc...
Object categorization is among the most stimulating, yet challenging, computer vision tasks. It consists of automatically assigning a category to a particular object given its representation (an image, a point
102
4.1 SHOT descriptor
cloud, ..) and a predefined taxonomy. This is different from object recognition, which consists of recognizing a particular instance of a particular
class (i.e. an object recognition algorithm is trained to recognize a specific car whereas an object category recognition algorithm is trained to
recognize all cars as members of the same class) and more challenging.
We develop a novel object category recognition algorithm by solving
the surface matching problem based on local features. The main contributions are as follows:
• a novel comprehensive proposal for surface representation, dubbed
SHOT, which encompasses a new unique and repeatable local reference frame as well as a new 3D descriptor;
• the modification of this proposal to exploit texture, provided by
the output of modern 3D sensors;
• the extension of the Implicit Shape Model [50] approach to the categorization of 3D data described by means of the SHOT method.
4.1 SHOT descriptor
This section deals with our proposal for local 3D description. First,
we categorize existing methods into two classes: Signatures and Histograms. Then, by discussion and experiments alike, we point out the
key issues of uniqueness and repeatability of the local reference frame.
Based on these observations, we formulate a novel comprehensive proposal for surface representation, which encompasses a new unique and
repeatable local reference frame as well as a new 3D descriptor. The
latter lays at the intersection between Signatures and Histograms, so as
to possibly achieve a better balance between descriptiveness and robustness. Experiments on publicly available datasets as well as on range
scans obtained with Spacetime Stereo provide a thorough validation of
our proposal, which is shown to outperform clearly three well-known
state of the art methods.
103
Chapter 4. 3D Surface Matching and Object Categorization
4.1.1 Analysis of Previous Work
In Table 4.1 we propose a categorization of the main proposals in the
field. As shown in the second column, we divide proposals for 3D descriptors into two main categories, namely Signature and Histogram.
The first category, that includes earliest works on the subject, describes
the 3D surface neighborhood of a given point (hereinafter support) by
defining an invariant local Reference Frame (RF) and encoding, according to the local coordinates, one or more geometric measurements computed individually on each point of a subset of the support. On the other
hand, Histogram-based methods describe the support by accumulating
local geometrical or topological measurements (e.g. point counts, mesh
triangle areas) into histograms according to a specific quantized domain
(e.g. point coordinates, curvatures) which requires the definition of either a Reference Axis (RA) or a local RF. In broad terms, signatures
are potentially highly descriptive thanks to the use of spatially well localized information, whereas histograms trade-off descriptive power for
robustness by compressing geometric structure into bins.
As far as Signature-based methods are concerned, one of the first
proposals is Structural Indexing [91], which builds up a representation
based on either a 3D curve or a Splash depending on the characteristics
of the 3D support. The former encodes the angles between consecutive segments of the polygonal approximation of edges (corresponding
to depth or orientation discontinuities) on the surface. The latter encodes as a 3D curve the local distribution of surface orientations along a
geodesic circle centered on the point. In Point Signatures [14] the signature is given by the signed height of the 3D curve obtained by intersecting a sphere centered in the point with the surface. 3D Point Fingerprint
[92] encodes the normal angle variations and the contour radius variations along different geodesic circles projected on the tangent plane.
Recently, Exponential Mapping [66] proposed a descriptor that encodes
the components of the normals within the support by deploying a 2D
parametrization of the local surface.
104
4.1 SHOT descriptor
Table 4.1: Taxonomy of 3D descriptors.
Method
Category
StInd [91]
PS [14]
3DPF [92]
EM [66]
SI [41]
LSP [12]
3DSC [27]
ISS [111]
Tensor [59]
MeshHoG [106]
SHOT
Signature
Signature
Signature
Signature
Histogram
Histogram
Histogram
Histogram
Histogram
Both
Both
Local RF
Unique Unambig.
No
Yes
No
Yes
No
Yes
Yes
No
RA
RA
No
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
As for Histogram-based methods, those relying on the definition of
just a RA are typically based on the feature point normal. For example, Spin Images [41], arguably the most popular method for 3D mesh
description, computes 2D histograms of points falling within a cylindrical volume by means of a plane that ”spins” around the normal. Within
the same subclass, Local Surface Patches [12] computes histograms of
normals and shape indexes [44] of the points belonging to the support.
As for methods relying on the definition of a full local RF, 3D Shape
Context [27] modifies the basic idea of Spin Images by accumulating 3D
histograms of points within a sphere centered at the feature point. Intrinsic Shape Signatures [111] proposed an improvement of [27] based
on a different partitioning of the 3D local volume as well as on a different definition of the local RF. Finally, Mian et al. [59] accumulate 3D
histograms (Tensors) of mesh triangle areas within a cubic support.
Two observations steam from the taxonomy proposed in Tab. 4.1.
First, all proposals rely on the definition of a local RF or, at least, a
repeatable RA. However, we believe that the importance of the choice
of the local reference for a 3D descriptor is underrated in literature, with
efforts mainly focused on the development of discriminative descriptors.
105
Chapter 4. 3D Surface Matching and Object Categorization
As a consequence, approaches for the choice of the local reference are
ambiguous, or not unique, or too sensitive to noise and also lack specific
experimental validation. Instead, as we will show in the remainder of
the chapter, the repeatability of the local RF (or, analogously, of the RA)
is mandatory to achieve effective local surface description.
Therefore, one of the contributions of our work is a specific study
upon local RFs. We carry out an analysis of repeatability and robustness on proposed local RFs, and provide experiments that demonstrate
the strong impact of the choice of the RF on the performance of a 3D
descriptor (Sec. 4.1.2). Given the impact of such a choice, we introduce
a robust local RF that, unlike all other proposals, is unique and unambiguous(Sec. 4.1.3).
Secondly, based on the nature of existing approaches highlighted by
the proposed categorization, it is our belief that an effective and robust
solution to the problem of 3D shape description can be found as a proper
combination of Signatures and Histograms. Hence, we propose a novel
3D descriptor aware of the proposed categorization (Sec. 4.1.4). Its
design, inspired by the analysis of the successful choices performed in
the related field of 2D descriptors [54], has been explicitly conceived to
achieve computational efficiency, descriptive power and robustness. Recently, MeshHoG [106] another approach for 3D data description that
can be seen as an attempt to combine the benefits of Signatures and Histograms, was proposed. We will show in the experimental results that
our proposal consistenly outperforms it.
4.1.2 On the traits and importance of the local RF
The definition of a local RF, invariant to translations and rotations and
robust to noise and clutter, has been the preferred option to endow a 3D
descriptor with invariance to the same sources of variations, similarly to
the way rotation and/or scale invariance is injected into 2D descriptors.
On the other hand, the definition of such an invariant frame is challenging. Furthermore, although almost every new proposal for local shape
106
4.1 SHOT descriptor
description is equipped with its own local RF, experimental validation
has always been focused on the results obtained by the joint used of an
RF and a descriptor, whilst the impact of the selected local RF on the
descriptor performance has not been investigated in literature.
1
EM + Prop. RF
EM + Sign Disamb.
EM + MeshHoG RF
EM
0.9
0.8
0.7
Recall
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
1-Precision
Figure 4.2: Impact of the local RF on a descriptor performance. The
optimal point is located at the top left side of the chart.
In Table 4.1 we have reported for each proposal the properties of
uniqueness and unambiguity of their local RF. As highlighted in the
third column, the majority of proposals are based on RFs that are not
unique [91] [14] [92] [27] [59], i.e. to obtain an invariant description
they require multiple descriptors to be computed at each feature point.
This is usually handled by describing a ”model” point using multiple descriptors, each based on a different local RFs, and a ”scene” point with
just one of them. This approach causes additional ambiguity to the correspondence problem since it shifts the intrinsic non-uniqueness of the
local RF to the matching stage, thus increasing potential mismatches,
computational requirements and sometimes also memory footprint. Another disadvantage brought in by the use of multiple local RFs is that the
proposed matching stage is so tailored on the descriptor that it prevents
the use of off-the-shelf efficient solutions for matching and indexing,
that in principle could be advantageously performed orthogonally with
107
Chapter 4. 3D Surface Matching and Object Categorization
respect to the description. This may result in a severe loss of computational efficiency.
In addition to multiple RFs, another limit of current proposals consists in the intrinsic ambiguity of the sign of the local RF axes. For example, in [66] and [111], normals and principal curvature directions are
used. The main problem with this choice is that principal directions are
not vectors, i.e. their sign is not defined. From a practical point of view,
principal directions are computed using Singular Value Decomposition
(SVD) or Eigenvalue Decomposition (EVD) of the covariance matrix of
the point coordinates within the support1. Of course, the output of the
algorithm is a vector with a sign. Nevertheless, this sign is simply a numerical accident and, thus, is not repeatable on different (e.g. rotated)
instances of the same mesh, even though the same SVD/EVD algorithm
is used, as clearly discussed in [9]. Therefore, such an approach to the
definition of the local RF is inherently ambiguous and thus not repeatable. [111] resorts to multiple RFs to overcome this limitation, while
[66] does not deal with it explicitly.
To highlight the impact of the local RF on a descriptor performance,
we show in Fig. 4.2 the performance of the EM descriptor [66] with
different local RFs. Results are reported as Recall vs 1-Precision curves
(see Sec. 4.1.5 for a discussion about this choice and for the settings used
in all our experiments). The ambiguous RF used in [66] leads to unsatisfactory performances (black curve). Using exactly the same settings
and exactly the same descriptor, we can boost performances simply by
deploying the Sign Disambiguation technique recently proposed in [9]
(green curve). Furthermore, using the more robust and more repeatable
local RF that we propose in next section we can obtain another significant improvement (e.g. at recall 0.7 precision raises from 0.308 to 0.994)
without changing the descriptive power of the descriptor (blue curve). It
is also worth pointing out here that our local RF does not match perfectly the EM descriptor, for none of its axes provides an approximation
of the local normal that is instead assumed by the theory underneath the
1
108
From personal communication with the authors of [66] and as reported in [111].
4.1 SHOT descriptor
EM descriptor. Nevertheless, performances with our local RF are better
than those obtained with the original proposal, showing the overwhelming importance of a robust, repeatable local RF. The importance of a
robust RF is confirmed by the use of the EM descriptors with the only
other unique and unambiguous local RF, part of the MeshHoG algorithm
[106]. Such local RF is based on curvatures, which are highly sensitive
to noise. This results in a poorly repeatable RF, which negatively influence the descriptor performances (red line).
4.1.3 Disambiguated EVD for a repeatable RF
As shown by Table 4.1, none of current local RF proposals but that of
MeshHoG is at the same time unique and unambiguous. The local RF
defined by the MeshHoG descriptor is highly sensitive to noise, as shown
in the previous section. Hence, there is a lack of a robust, unique and unambiguous RF. To fill this gap we have designed and extensively tested
a variety of novel unique and unambiguous local RFs. We present here
the method that turned out to be the most robust in our thorough experimental evaluation. It builds on a well known technique presented in [35]
and [63], where the problem of normal estimation in presence of noise
is specifically addressed. A Total Least Squares (TLS) estimation of the
normal direction is obtained in [35] and [63] by EVD of the covariance
matrix M of the k−nearest neighbors pi of the point, defined by
1X
1X
(pi − p̂)(pi − p̂)T , p̂ =
pi .
M=
k i=0
k i=0
k
k
(4.1)
In particular, the TLS estimation of the normal direction is given by
the eigenvector corresponding to the smallest eigenvalue of M. Finally,
they perform the sign disambiguation of the normals globally by means
of sign consistency, i.e. propagating the sign from a seed chosen heuristically.
While this has proven to be a robust and effective technique for surface reconstruction of a single object, it cannot work for local surface de109
Chapter 4. 3D Surface Matching and Object Categorization
scription since in the latter case signs must be repeatable across any possible object pose as well as in scenes with multiple objects, so that a local
rather than global sign disambiguation method is mandatory. Moreover,
Hoppe’s sign disambiguation concerns the normal only, hence it leaves
ambiguous the signs of the remaining two axes.
In our proposal, we start by modifying (4.1) so as to assign distant
points smaller weights, in order to increase repeatability in presence of
clutter. Then, to improve robustness, all points laying within the spherical support (of radius R) which are used to compute the descriptor are
used also to calculate M. For the sake of efficiency, we also neglect the
centroid computation, replacing it with the feature point p. Therefore,
we compute M as a weighted linear combination,
1
M= X
X
(R − di )(pi − p)(pi − p)T
(4.2)
(R−di ) i:di ≤R
i:di ≤R
where di = kpi − pk2 . Our experimental evaluation indicates that the
eigenvectors of M define repeatable, orthogonal directions in presence
of noise and clutter. It is worth pointing out that, compared to [35] and
[63], in our proposal the third eigenvector no longer represents the TLS
estimation of the normal direction and sometimes it notably differs from
it. However, this does not affect performance, since in the case of local
surface description what matters is a highly repeatable and robust triplet
of orthogonal directions, and not its geometrical or topological meaning.
Hence, eigenvectors of (4.2) represent a good starting point, but they
need to be disambiguated to yield a repeatable local RF. The problem of
sign disambiguation for EVD and SVD has been recently addressed in
[9]. Their proposal basically reorients the sign of each singular or eigenvector so that its sign is coherent with the majority of the vectors it is
representing. We determine the sign on the local x and z axes according
to this principle. In the following we refer to the three eigenvectors in
decreasing eigenvalue order as the x+ , y+ and z+ axis, respectively. With
x− , y− and z− , we denote instead the opposite vectors. Hence, the final
110
4.1 SHOT descriptor
disambiguated x axis is defined as
S +x =˙
i : di ≤ R ∧ (pi − p) · x+ ≥ 0
S −x =˙ i : di ≤ R ∧ (pi − p) · x− > 0


+
+
−


x , |S x | ≥ |S x |
x = 


x− , otherwise
(4.3)
(4.4)
(4.5)
The same procedure is used to disambiguate the z axis. Finally, the y
axis is obtained as z × x.
We compare the repeatability of our proposal against three representative RFs: that of MeshHoG, that of PS and that of EM, respectively a
not-robust solution, a not-unique solution and an ambiguous one. To prevent the shortcomings of not uniqueness and ambiguity from invalidating the comparison we consider only the global maximum of the height
[14] for PS and we add the sign disambiguation of [9] to EM (EM+SD),
thereby obtaining two unique and unambiguous RFs. We also consider
the original EM approach to show the effectiveness of sign disambiguation. Using again the settings detailed in Sec. 4.1.5, in Fig. 4.3 we plot,
for 5 increasing noise levels, the mean cosine between corresponding
axes of the local RFs computed on two instances of the same mesh, i.e.
the original one and a rotated and noisy instance. On one hand, ambiguity is clearly the most serious nuisance, as the low performances of the
original EM proposal demonstrate. On the other hand, the use of a higher
number of points to compute the local RF ( i.e. the whole surface contained in the spherical support, as done by EM, instead of the 3D curve
resulting by the intersection of the spherical support with the surface, as
done by PS) yields better robustness, as shown by the relative drop of
EM with respect to PS when noise increases. Nevertheless, the steepest
drop of performance is provided by MeshHoG, which confirms the need
to ground local RF computation on more robust features than second
order differential entities like curvatures.The disambiguation introduced
in EM+SD dramatically enhances repeatability. However, both EM and
EM+SD subordinate computation of the directions on the tangent plane
111
Chapter 4. 3D Surface Matching and Object Categorization
to the normal estimation (i.e. , the repeatable directions they compute
are then projected onto the tangent plane to create an orthogonal basis).
This choice sums noise on the normal to the noise inevitably affecting
the other directions, thereby leading to increased sensitivity of the estimation of the axes on the tangent plane and finally to poor repeatability.
Our proposal, instead, estimates all axes simultaneously and turns out to
be the most effective, thanks to the combination of its noise and clutteraware definition, the effectiveness of the proposed disambiguation and
the inherent uniqueness deriving from its theoretical formulation.
1
0.9
0.8
Mean Cosine
0.7
0.6
Prop.
EM + Sign Disamb.
PS
MeshHoG
EM
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
Noise
Figure 4.3: Comparison between local RFs.
4.1.4 Description by Signatures of Histograms
In Sec. 4.1.1 we have classified 3D descriptors as based on either histograms or signatures. We have designed our proposal following this intuition and aiming at a local representation that is efficient, descriptive,
robust to noise and clutter as well as to point density variation. The point
density issue is specific to the 3D scenario, where the same 3D volume
of the real world may be represented with different amounts of vertexes
in its mesh approximation, e.g. due to the use of different 3D sensors
(stereo, Time-of-Flight cameras, LIDARs, etc...) or different acquisition
distances.
112
4.1 SHOT descriptor
Besides our taxonomy, another source of inspiration has
been the related field of 2D
feature descriptors, which has
reached a remarkable maturity
during the last years. By analyzing SIFT [54], arguably the
most successful and widespread
proposal among 2D descriptors,
we have singled out what we believe are among the major reasons behind its effectiveness. First
of all, the use of histograms is Figure 4.4: Signature structure for
spread throughout the algorithm, SHOT.
from the definition of the local
orientation to the descriptor itself, this accounting for its robustness.
The low descriptive power of a global histogram computed on the whole
patch is balanced by the introduction of coarse geometric information:
the descriptor is, in fact, a concatenation of histograms, each computed
on a precise location in a regular grid superimposed on the patch. The
use of this coarse geometric information creates what we identify as a
signature-like structure.
Moreover, the elements of these local histograms are based on first
order derivatives describing the signal of interest, i.e. intensity gradients.
Although it has been argued that building a descriptor based on differential entities may result in poor robustness to noise [14], they hold high
descriptive power, as the effectiveness of SIFT clearly demonstrates.
Therefore, we believe they can provide a more effective solution for a
descriptor than point coordinates [41] [27]. Yet, to achieve robustness to
noise, differential entities have to be filtered, and not deployed directly,
e.g. as done in [66].
Finally, an important part of the SIFT algorithm deals with the definition of a local invariant 2D reference frame (i.e. the characteristic
113
Chapter 4. 3D Surface Matching and Object Categorization
orientation). The author states that in case of ambiguity in determining
the local RF, a great benefit to the stability of matches is provided by the
use of multiple orientations. This highlights the importance of a unique,
unambiguous local RF for the effectiveness of a descriptor.
Based on these considerations, we propose a 3D descriptor that encodes histograms of basic first-order differential entities (i.e. the normals
of the points within the support), which are more representative of the
local structure of the surface compared to plain 3D coordinates. The use
of histograms brings in the filtering effect required to achieve robustness
to noise. Having defined an unique and robust 3D local RF (see Sec.
4.1.3), it is possible to enhance the discriminative power of the descriptor by introducing geometric information concerning the location of the
points within the support, thereby mimicking a signature. This is done
by first computing a set of local histograms over the 3D volumes defined
by a 3D grid superimposed on the support and then grouping together
all local histograms to form the actual descriptor. Hence, our descriptor
lays at the intersection between Histograms and Signatures: we dub it
Signature of Histograms of OrienTations (SHOT).
For each of the local histograms, we accumulate point counts into
bins according to a function of the angle, θi , between the normal at each
point within the corresponding part of the grid, nvi , and the normal at the
feature point, nu . This function is cosθi , the reason being twofold: it can
be computed fast, since cosθi = nu · nvi ; an equally spaced binning on
cosθi is equivalent to a spatially varying binning on θi , whereby a coarser
binning is created for directions close to the reference normal direction
and a finer one for orthogonal directions. In this way, small differences
in orthogonal directions to the normal, i.e. presumably the most informative ones, cause a point to be accumulated in different bins leading to
different histograms. Moreover, in presence of quasi-planar regions (i.e.
not very descriptive ones) this choice limits histogram differences due to
noise by concentrating counts in a fewer number of bins.
As for the structure of the signature, we use an isotropic spherical
grid that encompasses partitions along the radial, azimuth and elevation
114
4.1 SHOT descriptor
axes, as sketched in Fig. 4.4. Since each volume of the grid encodes
a very descriptive entity represented by the local histogram, we can use
a coarse partitioning of the spatial grid and hence a small cardinality
of the descriptor. In particular, our experimentations indicate that 32
is a proper number of spatial bins, resulting from 8 azimuth divisions,
2 elevation divisions and 2 radial divisions (though, for clarity, only 4
azimuth divisions are shown in Fig. 4.4). Combined with the fact that the
tuning we present in sec. 4.1.5 indicates a proper number of bins for the
internal histograms to be around 10, we obtain a total descriptor length
of 320, a good improvement over the 1980 proposed for 3DSC [27] or
the 595 for ISS [111], that allows for faster indexing and matching.
Since our descriptor is based upon local histograms, it is important
to avoid boundary effects, as pointed out e.g. in [41] [54]. Furthermore,
due to the spatial subdivision of the support, boundary effects might arise
also in presence of perturbations of the local RF. Therefore, for each
point being accumulated into a specific local histogram bin, we perform
quadrilinear interpolation with its neighbors, i.e. the neighboring bins in
the local histogram and the bins having the same index in the local histograms corresponding to the neighboring volumes of the grid. In particular, each count is multiplied by a weight of 1 − d for each dimension.
As for the local histogram, d is the distance of the current entry from the
central value of the bin. As for elevation and azimuth, d is the angular
distance of the entry from the central value of the volume. Along the radial dimension, d is the Euclidean distance of the entry from the central
value of the volume. Along each dimension, d is measured in units of the
histogram or grid spacing, i.e. it is normalized by the distance between
two neighbor bins or volumes.
To achieve robustness to variations of the point density, we normalize the whole descriptor to sum up to 1. This is preferable to the solution
proposed in [27], i.e. normalizing each bin with the inverse of the point
density and bin volume. In fact, while [27] implicitly assumes that the
sampling density may vary independently in every bin, and thus discards as not informative the differences in point density among bins, we
115
Chapter 4. 3D Surface Matching and Object Categorization
Figure 4.5: Exp. 1: Precision-Recall curves on Stanford dataset and a
scene at the 3 noise levels.
Figure 4.6: Exp. 2: Precision-Recall curves on subsampled dataset and
a detail from one scene.
assume global (or at least regional) variations of the density and keep the
local differences as a source of discriminative information.
116
4.1 SHOT descriptor
Figure 4.7: Exp. 3: Results on Spacetime Stereo dataset and two models
(middle) and scenes (right).
Time (s) Radius (mr) Length
SHOT
4.8
15
320
SI
5.6
30
100
EM
52.6
10
2700
PS
248.8
10
90
Figure 4.8: Charts: ms/correspondence vs. support radius (in the smaller
chart the time axis is zoomed in for better comparison between SI and
SHOT). Table: measured execution times (in Experiment 1) and tuned
parameter values. Radius values are reported in mesh resolution units.
As for SI, the support radius is the product of the bin size by the number
of bins in each side of the spin image.
4.1.5 Experimental results
Surface Matching
In this section we provide experimental validation of our proposals, i.e.
the unique local RF together with the SHOT descriptor. To this purpose, we carry out a quantitative comparison against three state-of-theart approaches in a typical surface matching scenario, where correspondences have to be established between a set of features extracted from
117
Chapter 4. 3D Surface Matching and Object Categorization
a scene and those extracted from a number of models. The considered approaches are: Spin Images (SI), as representative of Histogrambased methods due to its vast popularity in the addressed scenario; Exponential Mapping (EM) and Point Signatures (PS) as representatives
of Signature-based methods, the former since it is a very recent approach, the latter given its importance in literature. All methods were
implemented in C++ and are made publicly available together with the
datasets ( www.vision.deis.unibo.it/SHOT ).
For a fair comparison, we use the same feature detector for all algorithms: in particular, we randomly extract a set of feature points from
each model, then we extract their corresponding points from the scene,
so that performance of the descriptors is not affected by errors of the detector. Analogously, for what concerns the matching stage, we adopt the
same matching measure for all algorithms, i.e., as proposed in [41], the
Euclidean distance. We could also have evaluated the synergistic effect
of description and matching for those methods that explicitly include a
proposal for the latter, e.g. the tolerance band for PS. In turn, we did
experiments on the whole dataset with the original EM and PS matching schemes, obtaining slightly worse performance for both. This, and
the attempt to be as fair as possible, leaned us to use the same matching
measure for all algorithms. However, we did not discard the characteristics of the descriptors that required a specific treatment during matching:
in particular, since EM is a sparse descriptor, we compute the Euclidean
distance only on the overlapping subset of EM descriptor pairs, as proposed by the authors; as for PS, we use the matching scheme proposed
by the authors to disambiguate its not-unique local RF [14]. For each
scene and model, we match each scene feature against all model features
and we compute the ratio between the nearest neighbor and the second
best (as in [54]): if the ratio is below a threshold a correspondence is
established between the scene feature and its closest model feature.
According to the methodology for evaluation of 2D descriptors recommended in [61], we provide results in terms of Recall versus Precision curves. This choice is preferable compared to ROC curves (i.e.
118
4.1 SHOT descriptor
True Positive Rate versus False Positive rate) when comparing descriptors due to the ambiguity in calculating the False Positive Rate [43]. We
present three different experiments. Experiment 1 deals with 6 models
(”Armadillo”, ”Asian Dragon”, ”Thai Statue”, ”Bunny”, ”Happy Buddha”, ”Dragon”) taken from the Stanford 3D Scanning Repository 2 . We
build up 45 scenes by randomly rotating and translating different subsets of the model set so to create clutter3 ; then, similarly to [98], we
add Gaussian random noise with increasing standard deviation, namely
σ1 , σ2 and σ3 at respectively 10%, 20% and 30% of the average mesh
resolution (computed on all models). In Experiment 2 we consider the
same models and scenes as in Experiment 1, add noise (i.e. σ1 ) and
resample the 3D meshes down to 1/8 of their original point density by
using MeshLab 4 Quadratic Mesh Collapse Decimation filter. For a fair
comparison in this experiment, our implementation of SI -used throughout all the evaluation- normalizes each descriptor to the unit vector to
make it more robust to density variations [18]. Finally, in Experiment 3
the dataset consists of scenes and models acquired in our lab by means
of a 3D sensing technique known as Spacetime Stereo [21], [108]. In
particular, we compare 8 object models against 15 scenes characterized
by clutter and occlusions, each scene containing two models. Fig. 4.7
shows two scenes together with the models appearing in them. In each
of the three experiments, 1000 feature points were extracted from each
model. As for the scenes, in Exp. 1 and 2 we extract n ∗ 1000 features
per scene (n being the number of models in the scene) whereas in Exp.
3 we extract 3000 features per scene.
Throughout all the three experiments we used the same values for
the parameters of the considered methods. In particular, we tuned the
two parameters of each descriptor (support radius and length of the descriptor) based on a tuning scene corrupted with noise level σ1 and built
rotating and translating three Stanford models (”Bunny”, ”Happy Bud2
http://graphics.stanford.edu/data/3Dscanrep
3 sets of 15 scenes each, containing respectively 3, 4 and 5 models
4
http://meshlab.sourceforge.net/
3
119
Chapter 4. 3D Surface Matching and Object Categorization
dha”, ”Dragon”). The values resulting from the tuning process are reported in the last two columns of the Table in Fig. 4.8. It is worth noting
that our tuning yielded comparable values of the support radius among
the various methods, and that, for SI and PS, the resulting parameter
values are coherent, as far as the order of magnitude is concerned, with
those originally proposed by their authors (no indication about EM parameters is given in [66]). Yet, we used the finely tuned values instead
of those originally proposed by the authors since the former yield higher
performance in these experiments.
Results for the three Experiments are reported in Figure 4.5, 4.6 and
4.7, respectively. Experiment 1 focuses on robustness to noise. Given
the reported results, it is clear that SHOT performs better than the other
methods at all different noise levels on the Stanford dataset. We can observe that, comparing the two Signature methods, PS exhibits a higher
robustness than EM. We address this mainly to the higher robustness of
its local RF, as shown in Fig. 4.3. This, together with the good performance of SHOT, highlights the importance of deploying a robust local
RF. As for SI, it appears to be highly susceptible to noise, its performance notably deteriorating as the noise level increases. This is due
to the fact that this descriptor is highly sensitive to small variations in
the normal estimation (i.e. SI Reference Axis), that here we compute as
proposed in [41]. This is also consistent with the results reported in [27].
As for Experiment 2, it is clear that the point density variation is the
most challenging nuisance among those accounted for in our experimental validation, causing a severe performance loss of all methods, even
those specifically addressing it as EM. SHOT, PS and SI obtain comparable performance, nevertheless for high values of precision, that are
typical working points for real applications, SHOT obtains the highest
levels of Recall.
Experiment 3 shows that under real working conditions SHOT outperforms the other methods. It is worth noting that this experiment is
especially focused on the descriptiveness of evaluated approaches, since
the smoother shapes of the objects surfaces compared to those of the
120
4.1 SHOT descriptor
Stanford models make the former harder to discriminate. Hence, results
demonstrate the higher descriptiveness embedded in SHOT with respect
to the other proposals.
In addition, we have compared the methods in terms of their computational efficiency and memory requirements. Since, as discussed in
Sec. 4.1.2, descriptors based on multiple RFs, like PS, can not deploy
efficient indexing to speed-up the matching stage, we use a full search
strategy for all methods. Results are reported in Fig. 4.8. The two charts
in the figure, showing the number of milliseconds per correspondence
needed by the various methods using different support sizes, demonstrate the notable differences in computational efficiency between the algorithms. In particular, SI and SHOT run one order of magnitude faster
than EM and almost two orders of magnitude faster than PS, with SI
turning out consistently slightly faster than SHOT at each support size.
As for EM, efficiency is mainly affected by the re-parametrization of the
support needed to describe each feature point and to the large memory
footprint (see next). With regards to PS, as discussed in Sec. (4.1.2) the
use of multiple local RFs dramatically slows down the matching stage.
These results are confirmed by the Table in the figure (first column),
which reports the measured times required to match the scene to the
models in Experiment 1 (i.e. 3000 scene features and 3000 models features) using the tuned parameter values. Here, the larger support needed
by SI allows SHOT to run slightly faster. As for memory requirements,
the reported descriptor length (third column) highlights the much higher
memory footprint required by EM compared to other methods.
3D registration
As a practical application in a challenging and active research area, we
demonstrate the use of SHOT correspondences to perform fully automatic 3D Reconstruction from Spacetime Stereo data. We merge 18
views covering a 360◦ field of view of one of the smooth objects used in
Experiment 3 and 29 views of an object not use in the previous experi121
Chapter 4. 3D Surface Matching and Object Categorization
(a)
(b)
(c)
(d)
Figure 4.9: 3D Reconstruction from Spacetime Stereo views: (a) initial
set of views (b) coarse registration (c) global registration frontal view (d)
global registration rear view.
ments. We follow a 2 steps procedure:
1. we obtain a coarse registration by estimating the 3D transformations between every pair of views and retaining only those maximizing the global area of overlap;
2. we use the coarse registration as initial guess for a final global
registration carried out using a standard external tool (Scanalyze).
In the first step, correspondences among views are established by
computing and matching SHOT descriptors on 1000 randomly selected
feature points. 3D transformations are estimated by applying a well
known Absolute Orientation algorithm [36] on such correspondences
and filtering outliers by RANSAC. Maximization of the area of overlap
is achieved through the Maximum Spanning Tree approach described in
[66]. As shown in Fig. 4.9 and Fig. 4.10, without any assumptions about
the initial poses, SHOT correspondences allows for attaining a coarse
alignment which is an accurate enough initial guess to successfully reconstruct the 3D shape of the object without any manual intervention.
To the best of our knowledge, fully automatic 3D reconstruction from
multiple Spacetime Stereo views has not been demonstrated yet.
122
4.2 Color SHOT
(a)
(b)
(c)
(d)
Figure 4.10: 3D Reconstruction from Spacetime Stereo views: (a) initial
set of views (b) coarse registration (c) global registration frontal view (d)
global registration rear view.
4.2 Color SHOT
In this section we show that the design of the SHOT descriptor can naturally and successfully be generalized to incorporate texture (Sec. 4.2.1)
and that such an extension allows for improved performances on publicly
available datasets (Sec. 4.2.2). This results in a particularly interesting
approach for carrying out surface matching tasks based on the output of
modern 3D sensors capable of delivering both shape and texture.
The majority of the proposals introduced in Sec 4.1.1 detect and describe a feature point by using shape data only. Recently, [106] has
proposed the MeshDoG/HoG approach, which is the only 3D descriptor
where texture information are taken into account. We will compare the
performance of the generalized SHOT descriptor against this method.
4.2.1 A combined texture-shape 3D descriptor
To generalize the design of the SHOT descriptor so as to include multiple cues, we denote here as S HG, f (P) the generic signature of histograms
computed over the spherical support around feature point P. This signature of histograms relies upon two different entities: G, a vector-valued
point-wise property of a vertex, and f , the metric used to compare two
of such point-wise properties. To compute a histogram of the signature,
f is applied over all pairs (G P , G Q ), with Q representing a generic vertex
123
Chapter 4. 3D Surface Matching and Object Categorization
…
…
Color Step (SC)
Shape Step (SS)
CSHOT
Shape description
Texture description
Figure 4.11: The proposed descriptor merges together a signature of
histograms of normal orientations and of texture-based measurements.
belonging to the spherical support around feature point P. In the original
SHOT formulation, G is the surface normal estimation, N, while f (·) is
the dot product, denoted as p(·):
f G P , G Q = p NP , NQ = NP · NQ
(4.6)
In the proposed generalization, m signatures of histograms relative to
different (property, metric) pairs are computed on the spherical support
and chained together in order to build the descriptor D(P) for feature
point P:
D(P) =
m
[
i
S H(G,
f ) (P)
(4.7)
i=1
Although the formulation in (4.7) is general, we will hereinafter
refer to the specific case of m = 2, so as to combine a signature of
histograms of shape-related measurements together with a signature of
texture-related measurements (Fig. 4.11). As for the former, we use the
formulation of the original SHOT descriptor, i.e. vector HP is repre124
4.2 Color SHOT
sented by the surface normal estimation in P, NP , while the operator f ()
is the dot product, p(), as in (4.6). As for the latter, since we want here
to embed texture information into the descriptor, we have to define a
proper vector representing a point-wise property of the texture at each
vertex and a suitable metric to compare two such texture-related properties. The overall descriptor, based on two signatures of histograms, will
be dubbed hereinafter as Color-SHOT (CSHOT).
The most intuitive choice for a texture-based G vector is the RGB
triplet of intensities associated to each vertex, referred to here as R. To
properly compare RGB triplets, one option is to deploy the same metric
as in SHOT, i.e. to use the dot product p(RP , RQ ). Alternatively, we
have tested another possible metric based on the L p norm between two
triplets. In particular, we have implemented the operator based on the
L1 norm, referred to as l(·), which consists in the sum of the absolute
differences between the triplets:
l RP , RQ
3 X
R (i) − R (i)
=
P
Q
(4.8)
i=1
Moreover, we have investigated the possibility of using different color
spaces rather than RGB. We have chosen the CIELab space given its
well-known property of being more perceptually uniform than the RGB
space[25]. Hence, as a different solution, vector G is represented by
color triplets computed in this space, which will be referred to as C.
Comparison between C triplets can be done using the metrics used for R
triplets, i.e. the dot product p(·) or the L1 norm l(·), leading to signatures
of histograms relying, respectively, on p C P , C Q and l C P , C Q .
In addition, we have investigated on the use of more specific met-
rics defined for the CIELab color space. In particular, we have deployed
two metrics, known as CIE94 and CIE2000, that were defined by the
CIE Commission respectively in 1994 and 2000: for their definitions
the reader is referred to [25]. These two metrics lead to two versions
of operator f (·) which will be referred to, respectively, as c94 (·) and
125
Chapter 4. 3D Surface Matching and Object Categorization
c00 (·) . Hence, two additional signatures of histograms can be defined
based on these two measures, denoted respectively as c94 C P , C Q and
c00 C P , C Q .
The CSHOT descriptor inherits SHOT parameters, i.e. the radius of
the support and the number of bins in each histogram). However, given
the different nature of the two signatures of histograms embedded in
CSHOT, it is useful to allow for a different number of bins in the two
histogram types. Thus, the CSHOT descriptor will have an additional
parameter with respect to SHOT, indicating the number of bins in each
texture histogram and referred to as Color Step (S C , see Fig. 4.11).
4.2.2 Experimental Results
The 6 different versions defined in Section 4.2.1 for the novel CSHOT
descriptor are now evaluated in a typical 3D object recognition scenario
where one or more objects have to be found in a scene with clutter and
occlusions. The experimental evaluation is aimed at determining which
version performs best in terms of both accuracy and efficiency. Furthermore, the best versions will be compared against the original SHOT descriptor as well as the MeshHoG descriptor, so as to evaluate the benefits
brought in by the proposed approach.
In all experiments, features points are first extracted from a scene and
an object, then they are described and matched based on the Euclidean
distance between descriptors. As for the feature extraction stage, we rely
on the same approach as in Sec. 4.1.5, i.e. features are first randomly
extracted from the object, then the corresponding features are extracted
from the scene by means of available ground-truth information together
with a set of additional features randomly extracted from clutter. All
algorithms have been tested by keeping constant their parameters. In
particular, all parameters that CSHOT shares with SHOT have been set
the values introduced in Sec. 4.1.4. Such values have been also used
here for the tests concerning the SHOT descriptor. As for the additional
parameter used by CSHOT (S C ), it has been tuned for each CSHOT
126
4.2 Color SHOT
0,7
0,7
R, p
R, l
SHOT
MeshHoG S+C
MeshHoG S
0,6
C, p
C, l
C, c94
C, c00
SHOT
MeshHoG S+C
MeshHoG S
0,6
0,5
0,4
0,4
Recall
Recall
0,5
0,3
0,3
30
30
25
0,2
25
0,2
20
ms
ms
20
15
15
0,1
10
0,1
10
5
5
0
0
0
0
-0,1
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
-0,1
0
0,1
0,2
1-Precision
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1-Precision
Figure 4.12: Comparison in terms of accuracy (big chart) and efficiency
(small chart) between CSHOTs with different measures in the RGB (left
chart) and CIELab (right chart) color spaces on Dataset 1. SHOT and
two variants of MeshHoG are also reported.
version on a subset, made out of 3 scenes, of the Spacetime Stereo dataset
introduced in Sec. 4.1.5. This subset has been used to tune also the radius
and number of bins of the orientation histograms of MeshHoG, with the
other parameters of the method kept as originally proposed in [106].
Comparison between color spaces and metrics
A first experimental evaluation has been carried out to identify the best
CSHOT combinations for, respectively, the RGB and the CIELab color
spaces. Results have been computed on a dataset composed of the 12
scenes not used for tuning of the Spacetime Stereo dataset. This subset, hereinafter referred to as Dataset 1, includes scenes with clutter and
occlusions of the objects to be recognized.
Figure 4.12 shows the comparison between the evaluated measures
respectively in the RGB (left chart) and CIELab (right chart) color spaces.
As for the former, the two (property, metric) pairs being compared are:
(R, p) and (R, l). As for the latter, four pairs are compared, i.e. : (C, p),
(C, l), (C, c94 ), (C, c00 ). Each comparison is carried out in terms of accuracy (big chart) and efficiency (small chart). As for the former, results
are provided in terms of Precision vs. Recall curves computed on the
output of the descriptor matching process carried out between the fea127
Chapter 4. 3D Surface Matching and Object Categorization
0,6
SHOT
CSHOT-R,c
CSHOT-C,c
MeshHoG S+C
MeshHoG S
0,5
Recall
0,4
0,3
0,2
30
25
ms
20
0,1
15
10
5
0
0
-0,1
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1-Precision
Figure 4.13: Left: Two models and four scenes of Dataset 2. Right:
Comparison in terms of accuracy (big chart) and efficiency (small chart)
between the 2 best versions of CSHOT, SHOT and two variants of MeshHoG on Dataset 2.
tures extracted from the objects and those extracted from the scenes.
Each object-scene pair of the dataset is then averaged to give out the final charts shown in the figure. As for efficiency, results are provided as
the average amount of time (ms) needed to compute one correspondence
between the scene and the object.
As for the RGB space, (R, l) proves to be more accurate than (R, p),
and only slightly less efficient. As for the CIELab space, (C, l), (C, c94 )
and (C, c00 ) notably outperform (C, p), with (C, l) being slightly more accurate and more efficient than (C, c94 ), and with (C, c00 ) being by far the
least efficient one. Hence, the two CSHOT versions that turn out more
favorable in terms of the accuracy-efficiency trade-off are, respectively,
(R, l) for the RGB space, and (C, l) for the CIELab space.
Comparison with SHOT and MeshHoG
We will now comment on the comparison between the two best CSHOT
versions and the SHOT and MeshHoG descriptors, so as to assess the
benefits brought in by the combined deployment of texture and shape
in the proposed extension as well as to compare its overall performance
with respect to state-of-the-art methods. We tested two versions of MeshHoG: one using only shape, as done by SHOT, and one deploying shape
128
4.2 Color SHOT
and texture. For shape-only MeshHoG, we used the mean curvature as
feature. As reported in the experimental results section of [106] (Sec
6.1), the use of both shape and texture can be achieved by juxtaposing
two MeshHoG descriptors, computed respectively using as feature the
mean curvature and the color. Conversely to what reported in [106], on
our dataset the shape-and-texture version of MeshHoG provides slightly
better performance than the texture-only version: thus, it is the one included in our comparison.
The two charts in Fig. 4.12 include the results yielded on Dataset
1 by SHOT and the two considered variants of MeshHoG . In addition,
Fig. 4.13 reports a further comparison carried out between the same proposals on another dataset. This dataset, referred to here as Dataset 2,
comprises 8 models and 16 scenes(2 models and 4 scenes of this dataset
are shown on the left side of the Figure). Dataset 2 differs from Dataset
1 because the former includes objects having very similar shapes but
different textures (i.e. different types of cans). Hence, it helps highlighting the importance of relying also on texture for the goal of 3D object
recognition in cluttered scenes. Similarly to the previous experiment, results are given both in terms of accuracy (big chart) and efficiency (small
chart).
Several observations can be made on these charts. First of all, on both
dataset, the two best versions of CSHOT, i.e. (R, l) and (C, l) , notably
outperform SHOT and the shape-only version of MeshHoG in terms of
accuracy, with the gap in performance being more evident on Dataset
2, where the algorithms that rely only on shape fail since they do not
hold enough discriminative power to cope with the traits of the dataset.
The results on both datasets confirm the benefits of including texture
information in the descriptor. Secondly, on both datasets the CSHOT
descriptor based on (C, l) proves to be more effective than that relying
on (R, l) as well as than the shape and texture version of MeshHoG, thus
allowing for state-of-the-art performance on the considered datasets. Finally, as for efficiency, the CSHOT descriptor based on (C, l) is approximately twice as slow as SHOT and one order of magnitude faster than
129
Chapter 4. 3D Surface Matching and Object Categorization
MeshHoG.
4.3 Object Category Recognition with 3D ISM
In the last decade the main effort on recognition of object categories
has been devoted to categorizing classes of objects from images [73],
one of the most prominent approaches being the application to image
features of the Bag-of-Words paradigm, previously used for text categorization and document analysis. In particular, this approach, typically
referred to as Bag-of-Features (BoF) or Bag-of-Visual-Words (BoVW),
represents image categories as histograms (”bags”) of feature descriptors
[19, 82, 84]. To account for efficiency, histograms are not built on descriptors themselves but on an alphabet of descriptors, typically termed
”codebook”, obtained via clustering or vector quantization [73].
BoF methods turned out to be particularly effective even though, unlike some more recent proposals, they discard geometrical relationships
between object parts. Among those leveraging geometric structure, one
of the most successful proposals is Implicit Shape Model (ISM) [50],
that encodes spatial relationships by means of a probabilistic Generalized Hough Transform in a 3-dimensional space representing scale and
translation. Moreover, the use of geometrically well-localized information allows these methods to be deployed also as detectors of specific
object categories in presence of clutter, occlusion and multiple object
instances. Typical object categories of interest have been pedestrians,
faces, humans, cars [50].
The increasing availability of large databases of 3D models has fostered a growing interest towards computer vision and machine learning
techniques capable of processing 3D point clouds and meshes. One of
the most investigated tasks so far has been shape retrieval (see [39, 94]
for surveys) which aims at finding the most similar 3D models in the
database to a given query model inputted by the user. Another well
investigated topic concerns 3D object recognition [27, 41]. Only very
recently the first methods aimed at 3D object categorization have been
130
4.3 Object Category Recognition with 3D ISM
proposed in literature. They mainly extend the BoF paradigm to the 3D
scenario by representing categories as histograms of codewords obtained
from local shape descriptions of 3D features [52, 67, 97].
In this last part of our work on 3D data we investigate on how to
deploy Implicit Shape Modeling for the categorization of meshes. Although in the reminder of this paper we will focus only on categorization, it is worth noting that this approach holds the potential to solve
within the same framework the problem of simultaneous localization and
classification of objects in cluttered scenes, even in presence of multiple
instances, i.e. to be uses as a category detector able to initialize a tracker.
4.3.1 3D Implicit Shape Model
The basic idea idea underlying Implicit Shape Models is to perform
object category recognition and instances localization based on a nonparametric probability mass function of the position of the object center.
These probability functions come from a probabilistic interpretation of
the voting space of a Generalized Hough Transform algorithm. Votes
are casted by local features that are matched against a codebook learned,
together with votes, from a set of training examples. When applied to
3D data, we identify the general form of an algorithm training a 3D ISM
as follows (Fig. 4.14 ):
Training
Features
CLUSTERING
Codebook
ACTIVATED
CODEWORDS
C1
C2
C3
C1
C1
C2
C3
LOCAL RFs
PoV-invariant
Shape Models
C1
C2
C3
C2
C1
C1
C2
C3
C2
CATi
Figure 4.14: Overview of the training stage of 3D ISM.
• local features are detected and described from the 3D training data.
• for each category Ci
131
Chapter 4. 3D Surface Matching and Object Categorization
– all features belonging to Ci are clustered to create the codebook of Ci
– for each training feature f jCi of category Ci
∗ f jCi is matched against the codebook of Ci according to
a codeword activation strategy.
∗ each activated codeword adds to the ISM of Ci the position of f jCi with respect to the object center. Each feature f jCi needs to incorporate a repeatable local Reference Frame (RF), and votes are expressed with respect
to such local RF of f jCi .
Then, a generical 3D ISM recognition procedure may be decomposed in the following steps (Fig. 4.15):
• local features are extracted and described from the 3D input data.
• for each feature f j and each category Ci
– f j is matched against the codebook of Ci according to a codeword activation strategy.
– each activated codeword casts its set of votes for the Hough
Space of Ci in its ISM.
– votes are rotated and translated so as to be expressed in the
local RF of the input features before voting, thus obtaining
Point-of-View (PoV) independent votes. The magnitude of
the vote is set according to a vote weighting strategy.
• in case of categorization of 3D database entries, the category yielding the global maximum among all the Hough spaces is selected as
output; in case of detection in a cluttered scene, local maxima of
each category above a threshold are selected as category instance
hypotheses for a further verification stage and/or pose estimation.
This scheme exhibits two main differences with respect to the use of
ISM for detection of object categories in 2D images. First of all, since
132
4.3 Object Category Recognition with 3D ISM
Test Features
CODEBOOKs
Activated
Codewords
SHAPE MODELs
CAT1 C1 C2 C3 C4 C5 C6
C1 C2 C3 C4 C5 C6
C2
CAT2 C1 C2 C3 C4 C5 C6
C1 C2 C3 C4 C5 C6
C3
CAT3 C1 C2 C3 C4 C5 C6
C1 C2 C3 C4 C5 C6
C1
Global Maximum=
Categorization
3
Votes for
Object Center
C5
C3
HOUGH SPACEs
PoV-invariant
Votes
1
3
LOCAL RFs
1 1
Local Maxima=
Detection
Hypothesis
1
2
2
1
Figure 4.15: Overview of 3D ISM for Categorization and Detection.
the sensor produces metric data, there is no need for scale invariance: in
the 2D case, when casting votes for the object center, the object scale
is treated as a third dimension in the voting space. With 3D data we
can cast votes for object hypotheses directly in the coordinates space,
which is again a 3D dimensional space. The second difference regards
the use of PoV-independent votes, that leads to a PoV-independent detector. In the original ISM proposal, objects of the same category under
different point of views are regarded as instances of different, unrelated
categories. It is worth pointing out that the use of PoV-independent votes
is not just a nice extension that allows for more flexibility of the final
method, it is indeed mandatory when using 3D ISM to categorizes 3D
database entries, for these cannot be assumed to be expressed within the
same global RF.
As noted before most of the proposals in the field of 3D local features
do not include a fully defined local RF. Once more this demonstrates the
importance that our SHOT descriptor defines a full 3D, unambiguous
local reference frame. We thus use SHOT features as the base of our
3D ISM. This is also another test of the quality of the proposed features,
which demonstrate good performance even in 3D object categorization,
an experiment that was not proposed in Sec. 4.1.5.
133
Chapter 4. 3D Surface Matching and Object Categorization
In the previous overview of the method we have highlighted the main
design decisions that need to be taken to define a 3D ISM, i.e. the codeword activation strategy and the vote weighting strategy. In the following
we address, by discussion and experiments, the possible alternatives for
these design choices together with other major issues related to codebook size and composition. It is worth noting that, although we have
conducted experiments using 3D data only, all our reasoning is independent from data dimensionality. Therefore, we expect the observations
drawn from our analysis to be beneficial also for the case of standard 2D
ISMs.
4.3.2 Codebook
Codebook size
Codebooks are widely used for 2D and 3D object categorization (e.g.
[85] [97] [52]). The reason behind their use is efficiency, both in terms
of memory occupancy of the codebook and computational time for codeword activation. They are not expected to have any positive impact on
the generalization abilities of the algorithms. They are usually built by
applying some standard clustering algorithms, like k-means, on the features extracted from the training data. Little attention, however, has been
paid to the loss in discriminative power of the codebook after size reduction. Furthermore, research in the field of Approximate Nearest Neighbor provides efficient methods to solve the codeword activation problem
even in high dimensional spaces and with large databases [65]. Finally,
the cost of storing a set of descriptors for each training model of the currently publicly available 3D datasets is nowadays definitely affordable
by off-the-shelf machines. Based on the above considerations, we investigated on the actual importance of building a codebook to successfully
perform object category recognition in 3D data.
The chart in Fig. 4.16 shows the outcome of an experiment carried
out on the [email protected] Watertight dataset (see Sec. 4.3.5 for more details about the dataset and the experimental methodology ). We used
134
60
80
50
75
40
70
30
65
20
60
10
%
ms
4.3 Object Category Recognition with 3D ISM
Time
Rec. Rate
55
No clustering
0
50
10
100
1000
10000
Codewords
Figure 4.16: Impact of codebook size on mean recognition rate and mean
recognition time
half dataset for training and half for testing, i.e. ten models for training
and ten for testing for each category. 200 mesh vertexes were randomly
selected on each training model obtaining 2000 features as training set
for each category. We then performed k-means on this set, varying k
logarithmically from 10 to 2000. We used such codebooks to categorize
the test set. The best mean recognition rate is obtained with 2000 codewords, i.e. using the plain training data without any clustering. Loss in
efficiency is minimal, for instance using 100 codewords the mean time
to categorize one test model is about 42 ms, whereas using the plain
training set as codebook it slightly increases to about 52 ms. Memory
occupancy, of course, scales linearly with codebook size and, for the
considered dataset, when using no clustering is less than 57MB. Therefore, based on the indication of this and other similar experiments, in
the following we use as ”codebook” the whole training data, without
carrying out any clustering on them.
Sharing codewords among categories
In the original ISM proposal, the case of simultaneous recognition of
multiple categories is solved by running a detector for each category, endowed with its own codebook built from training data belonging to its
category. We refer to this configuration as ISM with separated code135
Chapter 4. 3D Surface Matching and Object Categorization
books: codebooks of different categories are independently built and
used. In the context of categorization of DB entries, we have investigated on another possible configuration, that we refer to here as ISM
with global codebook: a codebook is created from the training data belonging to all categories and then used by all ISMs. The Shape Model
of each category is still built during the training stage by considering
only the training data belonging to that category. However, denoting
with S Mi the Shape Model of category Ci , not only those originated by
the training data of Ci , but all the codewords in the codebook, regardless of the categories of the features that generated them, can participate
to S Mi , provided that they are similar - according to the codeword activation strategy - to any of the training features of Ci . Therefore, this
scheme endows the ISM paradigm with a broader capability of generalization: whilst the separated codebooks configuration is able to generalize at an intra-class level, by letting features observed in different
training instances of the same class collaborate to the detection of an
instance during testing, the global codebook configuration lets ISM generalize also at an inter-class level. It allows features observed in training
examples of different categories to reinforce the hypothesis that an instance of category Ci is present. In other words, it builds a ”universal”
codebook of all the likely features given the training data, and then associates a spatial location for a specific category to all those that are ”similar” to the training features of such category, regardless of the labels of
the training data that originated that codeword.
It is worth highlighting that memory requirements of both configurations are equal: although a global codebook requires C times more space
than a separated codebook, with C the number of categories, only one
instance of it has to be stored in memory since it can be shared among
all the C 3D ISM required by our proposal. Query time scales logarithmically with the size of the codebook: since codewords in the global
codebook are C times those of the separated codebooks, query time is
increased by log C, a limited amount for typical number of categories in
publicly available 3D databases (i.e. less than 30).
136
4.3 Object Category Recognition with 3D ISM
4.3.3 Codeword Activation Strategy
The codeword activation strategy proposed for the deployment of ISM
in the case of 2D data [50] is the cutoff threshold: codewords are activated, and, thus, cast their votes, if their distance from the test feature is
below a threshold. An alternative approach is represented by the k-NN
activation strategy: the closest k codewords to the test feature are activated, regardless of their distance. We consider the latter strategy more
suitable to the task of categorization, the reason being twofold. First of
all, in those parts of the feature space characterised by a high codeword
density, k-NN activates generally less features than the cutoff strategy,
only the k most similar ones. By increasing the number of votes casted
by each test feature in the Hough space we may expect to sharpen the
peak corresponding to a true instance of the class, but also to generate
spurious peaks in the voting space, by randomly accumulating wrong
votes in the same bin. In such parts of the feature space, the k-NN strategy acts as a filter that aims at reducing the probability of adding noise
into the Hough space, while it hopefully retains the ability to let the correct hypothesis emerge, by selecting only the most similar codewords.
Secondly, in those parts of the feature space with a low density or even
absence of codewords, k-NN still activates k codewords, whereas the
cutoff strategy cast very few votes, if any. Indeed, being the threshold
generally chosen as small as to prevent generation of false peaks, the cutoff strategy generally tends not to activate any codeword in low density
regions of the feature space. Obviously, the codewords activated by the
k-NN strategy can be really different from the test data. Still, given the
training set, they are the most similar at hand: if we have to generalize
from the training examples to attempt to classify the current input, they
appear a reasonable choice. The same reasoning does not hold when
using 3D ISM to detect instances in cluttered scenes: in such a case, a
high distance from any codeword is likely to indicate that the test feature
comes from clutter and hence should not cast votes, such behavior being
correctly modeled by the cutoff strategy. Yet, when reasoning in absence
137
Chapter 4. 3D Surface Matching and Object Categorization
of clutter, as it is the case of categorization of entries of a 3D database,
the k-NN strategy offers an adaptive behavior with respect to the training
data that seems more suitable to the task.
4.3.4 Votes Weighting Strategy
In [50], the vote weight for each pair (test feature, vector in the shape
model) is given by the product of a match weight and an occurrence
weight
1
1
∗
w = p (on , x|Ci , l) ∗ p Ci∗ | fk =
|M| |Occ [i]|
(4.9)
with M being the set of codewords activated by the test feature fk and
Occ[i] being the set of vectors in the Shape Model associated with codeword i.
The rationale behind this choice is tightly coupled with the use of
the original ISM for detection in cluttered scenes. In presence of clutter, there is an obvious trade off between increasing the number of true
detections and limiting the number of false detections. The choice of
the vote weighting strategy operated in [50] goes in this direction. If
a feature activates more codewords than another feature and/or if such
codewords can be observed in more feasible positions with respect to the
object center than other codewords, then this feature will be regarded as
less distinctive since it likely generates more spurious votes in the Hough
Space. By keeping low the weight, i.e. the confidence, on the position of
the object center for the votes of such features, the original ISM tries to
choose a good working point to optimize the above mentioned trade-off,
by keeping below the detection threshold such spurious local maxima of
the voting space. We refer to this vote weighting strategy as Localization
Weights (LW).
Again, in absence of clutter the scenario is different. Recall from
Sec. 4.3.1 that we propose to select as output the category yielding the
global maximum among all the Hough spaces. Therefore, in this case the
emphasis for each 3D ISM should be on supporting as much as possible
138
4.3 Object Category Recognition with 3D ISM
its best hypothesis. This means that spurious local maxima are not relevant for categorization, as long as they do not hide the true global maximum. Since we can reasonably expect that the geometrically consistent
bin will likely provide the strongest peak in the voting space, there is no
reason to try to weaken local maxima by acting on the vote weight. On
the other hand, using the original ISM vote weighting strategy may uselessly reduce the strength of the global maximum only because features
that casted vote for it have also casted votes for wrong locations, and this
can lead to a wrong selection of the correct category in the final competition among each global maximum of all categories. Hence, in the case
of categorization, we have investigated on the use of the same constant
weight for all features and codewords. Hereinafter, we will denote this
vote weighting strategy as Categorization Weights (CW).
4.3.5 Experimental Results
We have tested our proposals on the [email protected] Watertight (ASW)
dataset, previously used for the evaluation of 3D object categorization
algorithms such as [97], and on the Princeton Shape Benchmark (PSB)
[83], already used for 3D categorization in [52]. Since meshes in the
PSB dataset exhibit a high variance in metric dimensions, even within
the same class, to define a Hough Space suitable for all meshes, we normalize models before using them for testing or training. Specifically, we
translate the model barycenter into the origin, compute the Eigenvalue
Decomposition (EVD) of the scatter matrix of each model to find its
principal axes, we scale the model down or up by a scale factor given by
1/Xmax − Xmin , with Xmax ,Xmin the maximum and minimum coordinates
of the mesh along the first principal axis, and finally rotate the model to
align it with its principal axes. It is important to note that, due to the sign
ambiguity inherent to the EVD [9], we still need PoV-independent votes
to achieve correct categorization. This normalization allows also for an
important simplification: we can define the Hough Space just around the
barycenter, i.e. the origin: any hypothesis for the object center laying
139
Chapter 4. 3D Surface Matching and Object Categorization
far away from the barycenter will clearly be a spurious peak in the voting space. This improves the effectiveness of our method, by discarding
peaks in the a priori wrong regions of the voting space, and the its efficiency, since it reduces the memory footprint needed to store the Hough
Space. In particular, we used a Hough Space consisting of one squared
bin, centered in the origin and with a side of 0.2. In all the experiments
with both datasets we randomly extract 200 feature points from each
training model and 1000 feature points from each testing model, and we
describe them using SHOT with 16 spatial sectors (8 on the tangent plane
and 2 concentric spheres) and 10 bins for the normal histograms. We diminish the number of spatial divisions, and therefore the dimensionality
of the descriptor with respect to that used in the previous experimental
results because clustering operates better in lower dimensionality spaces.
We do not perform any multi scale description, we use just a single support radius, equal to 0.25 and 0.45 for the AWS and the PSB dataset,
respectively. As discussed in section 4.3.2, we use a plain codebook
composed by all training descriptors.
The [email protected] Watertight dataset contains 20 categories, each including 20 models. We tested our performance on this dataset according
to two methodologies. First, we divided the dataset in a training and
a testing set by taking the first 10 models of each category as training
set and the rest as testing set. With this configuration we studied the
influence of the previously discussed design issues. Then, we also performed Leave-One-Out cross validation as done in [97], to be able to
compare our results with such related work. Of course, the first test is
more challenging, since significantly less training data is available to
learn category shapes.
Results for the first series of experiments are reported in Fig. 4.18.
We compared the performance of all the combinations of the proposed
design decisions, i.e. global codebook (GC) vs. separated codebooks
(SC), LW vs. CW and k-NN vs. cutoff with different values. The best
recognition rate for this dataset is 79% and is obtained using 1-NN as
Codeword Activation Strategy and a global codebook. In such configu140
4.3 Object Category Recognition with 3D ISM
Vase
Fourleg
Bearing
Bust
Mech
Bird
10
Armadillo
Spring
Glasses
Airplane
Fish
Cup
Plier
Table
Hand
Teddy
Chair
Octopus
Ant
Airplane
Glasses
Cup
Human
Human 80
10
90
10
100
100
Ant
100
Chair
100
Octopus
30
Table
20 20
Teddy
90
Plier
10
100
Fish
10
10
80 10
10
60
10 10
10 20
100
Bust
Mech
Vase
Fourleg
10
40 10
Armadillo
Bearing
10
100
Hand
Bird 10
Spring
20
100
20
70
10
100
20
10
70
10
10 40
20
40
10 30
40
Figure 4.17: Confusion Matrix for [email protected] Watertight, 1-NN Codeword Activation Strategy and CW Votes Weighting Strategy. The rows
represent the test categories of the input model, the columns the output
of the 3D ISM.
ration LW is the same as CW, since each codeword has zero or one vote.
Fig. 4.17 reports the confusion matrix for such case.
In the case of the Leave-One-Out cross validation, [97] reports a
mean recognition rate of 87.25%. Using 2-NN as Codeword Activation Strategy, a global codebook and CW as Votes Weighting Strategy,
we have obtained 100%.
The PSB dataset comes with a hierarchical categorization and a predefined division in training and testing sets. We use such categorization and such division. To compare our results against those in [52] we
use the categorization level named Coarse 2, although it defines quite
abstract meta-categories, such as ”Household”, which includes electric
guitars, guns as well as stairs, or ”-1”, that stands for ”all other models
in the dataset”. Clearly this dataset is more challenging than ASW, the
intra-class and the inter-class variability being definitely higher.
141
Chapter 4. 3D Surface Matching and Object Categorization
80
70
60
%
GC LW
SC LW
50
GC CW
SC CW
40
30
20
0,1
0,2
0,3
0,4
0,5
0,6
(a) cutoff
80
70
60
%
GC LW
SC LW
50
GC CW
SC CW
40
30
20
1
2
3
4
5
(b) k-NN
Figure 4.18: Mean recognition rate as a function of varying cutoff and
k-NN values on [email protected] Watertight.
Results are reported in Fig. 4.19. We compared the same combinations as in the previous experiment. The best recognition rate for
this dataset is 50.2% and is obtained using 2-NN as Codeword Activation Strategy, a global codebook and the CW Votes Weighting Strategy. [52] reports a mean recognition rate of 55%. It is worth noting
that, in addition to the previously mentioned difficulties, the PSB dataset
presents also a highly variable point density among the models. As it
has been noted in the experimental comparison on the SHOT descriptor
(Sec. 4.1.5), point density variation is not well tolerated by current 3D
descriptors. This was explicitly accounted for in [52], where all PSB
142
4.3 Object Category Recognition with 3D ISM
55
50
45
GC LW
40
%
SC LW
35
GC CW
SC CW
30
25
20
15
0,3
0,4
0,5
0,6
0,7
0,8
(a) cutoff
55
50
45
%
40
GC LW
SC LW
35
GC CW
SC CW
30
25
20
15
1
2
3
4
5
(b) k-NN
Figure 4.19: Mean recognition rate as a function of varying cutoff and
k-NN values on the PSB coarse 2 dataset.
meshes were resampled to a constant number of vertexes, uniformly distributed in the meshes. We have not implemented such resampling yet,
that could likely improve our performance.
4.3.6 Discussion
The most evident outcome of our investigation is definitely the fact that
the Codeword Activation Strategy and codebook composition play a significant role on the performance of 3D ISM for categorization. In both
datasets k-NN with global codebook consistently outperforms the cutoff
143
Chapter 4. 3D Surface Matching and Object Categorization
threshold with both kinds of codebook composition, regardless of the
choice of k. This confirms two intuitions:
• that the intrinsic adaptation to codewords density in the feature
space provided by k-NN is more suitable for database entries categorization, i.e. in absence of clutter, since it enhances ISM generalization ability;
• that the global codebook, when compatible with the application
constraints on memory occupancy and computation time, endows
ISM with higher, inter-class generalization power.
Experiments also reveal a tight coupling between the use of k-NN
and the global codebook: k-NN with separated codebooks exhibits unsatisfactory performance, even with respect to the cutoff strategy. With
the global codebook the k nearest neighbor codewords for a test feature
are the same for each tested category, i.e. they represent the overall k
most similar features throughout those belonging to all categories seen
in the training stage, what then differs for the different categories is how
these codewords vote in the different ISMs. In particular, it is worth
pointing out that, differently from the case of separated codebooks, it
happens that some of the codewords have no associated votes in the ISM
of a specific category. This happens when a codeword is not similar to
any training data of that category. Therefore, many of the k activated
codewords will likely vote only for a subset of the categories, so that
votes accumulation in the Hough Space has more chances to let the true
category emerge, being required to filter out a limited amount of wrong
votes. In other words, this configuration balances the impact of codebook (i.e. of features similarity) and shape model (i.e. of geometrical
structure) and results in good recognition rates. With separated codebooks, instead, the k nearest neighbors are different in different codebooks, so that in several of them the activated codewords may be very
dissimilar to the test feature. Moreover, since there are no codewords
without votes in this configuration, all the activated codewords will cast
144
4.3 Object Category Recognition with 3D ISM
votes in their shape models. This configuration, therefore, tends to diminish the importance of feature similarity and relies almost completely
on shape models being able to select the correct category. This increases
the probability of generating wrong, spurious peaks in the voting space.
The vote weighting strategy does not play a role as important as the
other two design decisions. Nevertheless, as far as the k-NN codeword
activation strategy is concerned, the Categorization Voting obtains consistently slightly better performance in both datasets and with both kind
of codebooks. This provides experimental evidence to the reasoning of
Sec. 4.3.4.
As for the experiments on the cutoff threshold strategy, whilst on the
PSB dataset the global codebook is still the favorable option, and there is
little difference between the votes weighting strategies, in the case of the
ASW dataset the decisive factor for obtaining higher performance seems
to be the LW strategy whereas, unlike in the k-NN case, the codebook
options seem to have quite a minor impact. We ascribe the latter to the
cutoff strategy intrisecally balancing feature similarity and geometrical
structure, for dissimilar codewords, given the cutoff threshold, cannot
cast votes at recognition time also when the separated codebook is used.
On the other hand, it is quite more difficult to explain the higher performance of LW on this dataset. The higher performance of LW seems
to suggest that in the ASW dataset wrong categories are supported in
the voting space by less distinctive codewords, whose vote weights are
indeed diminished by using LW.
The Confusion Matrix in Fig. 4.17 evidences how, beside gross errors that must be ascribed to the difficulty of the task, several errors
are somehow reasonable for an algorithm that tries to categorize objects
based only on 3D shape only. For instance, the category ”Octopus”,
for which our proposal fails to recognize the majority of test models, is
confused with ”Hand”, ”Armadillo” and ”Fourleg”, i.e. with categories
that present sort of ”limbs” in configurations similar to those assumed
by the models in the ”Octopus” category. The 40% of ”Fourleg” test
models are wrongly categorized as ”Armadillo”, which, again, in some
145
Chapter 4. 3D Surface Matching and Object Categorization
training models appears in a Fourleg-like pose. All the wrongly assigned
test models of ”Bearing” are labeled as ”Table” or ”Plier”, which have
parts (the legs, the handles) that are shaped as bearings. Provided that
this dataset can be successfully categorized by using only shape when
enough training data can be deployed, as our 100% result in the LeaveOne-Out test demonstrates, the mostly reasonable errors in the Confusion Matrix show that our proposal is able to learn a plausible, although
less specific, model for the category shape in presence of less training
data.
146
Conclusions
This dissertation has presented the research activity concerning adaptive visual tracking carried out during the Ph.D. course. In particular, three main contributions related to adaptive tracking have been presented: adaptive transition models, adaptive appearance models and an
adaptive Bayesian loop for tracking based on change detection in case
of static cameras. Moreover, our work on category detection in 3D data
has been presented.
As far as adaptive transition models are concerned, a new approach
to build an adaptive recursive Bayesian estimation framework has been
introduced, both from a theoretical point of view and in terms of its
instantiation in the case of linear transition and measurement models and
Gaussian noise. The proposed SVK filter has been shown to outperform
a standard Kalman solution, requiring less parameters to be arbitrarily
(and possibly wrongly) tuned. In the linear and Gaussian scenario, an
interesting future investigation concerns the evaluation of the proposed
approach against comparable solutions for adaptive Kalman filtering (i.e.
Covariance Matching Techniques and [109]).
We also see this work, as all the contributions of this thesis, as a step
toward a general and parameters free tracking system. Endowing this
vision, another interesting future work will be directed to the insertion
of algorithms for automatic on-line selection of SVR parameters. Finally, the instantiation of our proposal also in the case of non linear and
non Gaussian tracking, in particular by modifying it in order to be beneficially used also with particle filters, would be a great contribution to
foster its applicability and adoption.
147
Conclusions
As far as adaptive appearance models are concerned, our contribution has been twofold: we presented a critical review and classification of the most significant, recently proposed algorithms that deal with
model adaptation; we casted the problem of model update as a Recursive
Bayesian Estimation problem. Preliminary experimental results, where
our proposal was compared on challenging sequences against many state
of the art trackers, both adaptive and non adaptive, are encouraging. The
main extension for our proposal would be to define a proper method to
compare different features, in order to use the particle filter framework
to perform also on-line probabilistic feature selection. Moreover, the
proposed importance density and observation likelihoods are just one
possible instantiation of this novel framework. They can be modified
and extended in several ways: to make them more robust to tracker misalignments, by exploiting the full posterior PDF on the state instead of
the current estimation only; to make them more robust to occlusions
by deploying more stable schemes than the sliding window and consequently modifying the PDFs evaluation; to make them fully compliant
with the particle filtering framework, by not fully relying on the current
frame during the proposal density sampling and, hence, allowing for a
proper observation likelihood to be defined.
An adaptive Bayesian loop for tracking based on change detection
in case of static cameras has been proposed. On-line training of a binary Bayesian classifier based on background-frame pairs of intensities has been proposed to perform change detection robustly and efficiently in presence of common sources of disturbance such as illumination changes, camera gain and exposure variations. The ability of
such algorithm to learn a model of admissible intensity variations frame
by frame allows it to obtain high sensitivity without sacrificing specificity. Importantly, this promising trade-off is yielded without penalizing efficiency. Based on this novel change detection algorithm, a principled framework to model the interaction between Bayesian change
detection and tracking have been presented. By modeling the interaction as marginalization of the joint probability of the tracker state and
148
Conclusions
the change mask, it is possible to obtain analytical expressions for the
PDFs of the tracker observation likelihood and the change detector prior.
Benefits of such interaction have been discussed with experiments on
publicly available datasets targeting visual surveillance and automatic
analysis of sport events, where the proposed method outperformed two
standard solutions for visual tracking. Several interesting extensions are
possible: adapt the probabilistic reasoning on change maps to the case
of particle filters; extend the proposed Bayesian algorithm to color-based
change detection; take into account in the loop the number and the position of multiple targets and also their appearance, in the spirit of BraMBLe [38] but without requiring a foreground model; experiment with
multiple sources of measurements, such as color histograms, providing
for them, too, a fully specified observation likelihood.
As for the categorization of 3D data, our proposal encompasses the
deployment of Implicit Shape Models in combination with a novel proposal for 3D description, dubbed SHOT. We have devised the general
structure of a 3D ISM and identified and discussed three design decisions
that could improve the performance of the method when used for categorization. Experimental results on two well known and large datasets
demonstrate that the combination of the k-NN codeword activation strategy and the use of a global codebook built from the training data of all
categories is more effective for categorization than the standard ISM approach. Votes weighting strategy, on the other hand, does not seem to
play such an important role for overall performance. The proposed optimal configuration compares favorably with the state of the art in 3D data
categorization, obtaining similar results in one case and outperforming
current proposals on the other dataset.
We have tested also the SHOT descriptor on its own. The results validate the intuition that the synergy between the design of a repeatable
local RF and the embedding of an hybrid signature/histogram nature
into a descriptor allows for achieving at the same time state-of-the-art
robustness and descriptiveness. Remarkably, our proposal delivers such
notable performances with high computational efficiency.
149
Conclusions
Starting from SHOT, we have presented a general formulation for
multi-cue description of 3D data by signatures of histograms. We have
then proposed a specific implementation of this formulation, CSHOT,
that realizes a joint texture-shape 3D feature descriptor. CSHOT has
been shown to improve the accuracy of SHOT and to obtain state-of-theart performance on data comprising both shape and texture. By means
of experimental evaluation, different combinations of metrics and color
spaces have been tested: the L1 norm in the CIELab color space turns
out to be the most effective choices.
As for future work, the obvious next step is to deploy 3D ISM to
detect category instances in 3D data and initialize a tracker. 3D ISM may
be used also to continuously guide a tracker in a tracking-by-detection
approach. As for the SHOT descriptor, we plan to investigate on how to
improve robustness to point density variations. Comparing our proposal
with other relevant methods and on larger datasets is another important
prosecution for this work.
150
Bibliography
[1] Adam, A., E. Rivlin, and I. Shimshoni (2006). Robust Fragmentsbased Tracking using the Integral Histogram. In Proc. of the Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) - Volume 1, pp. 798–805. IEEE Computer Society Washington, DC, USA.
[2] Akagunduz, E. and I. Ulusoy (2007). 3D object representation using transform and scale invariant 3D features. In Proc. of the International Conference on Computer Vision (ICCV), pp. 1–8. IEEE
Computer Society, Washington, DC, USA.
[3] Arulampalam, S., S. Maskell, N. Gordon, and T. Clapp (2001).
A tutorial on Particle Filters for On-line Non-linear/Non-Gaussian
Bayesian Tracking. IEEE Transactions on Signal Processing 50, 174–
188.
[4] Avidan, S. (2005). Ensemble tracking. In International Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 494–501.
[5] Babenko, B., M.-H. Yang, and S. Belongie (2009). Visual tracking
with online multiple instance learning. In Proc. of the International
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
983–990. IEEE Computer Society Washington, DC, USA.
[6] Bay, H., A. Ess, T. Tuytelaars, and L. J. V. Gool (2008). Speeded-Up
Robust Features (SURF). Computer Vision and Image Understanding 110(3), 346–359.
[7] Blum, A. and T. Mitchell (1998). Combining labeled and unlabeled
data with co-training. In Proc. of the eleventh annual conference
on Computational learning theory (COLT), pp. 92–100. ACM New
York, NY, USA.
151
Bibliography
[8] Breitenstein, M. D., F. Reichlin, B. Leibe, E. Koller-Meier, and
L. van Gool (2009). Robust tracking-by-detection using a detector
confidence particle filter. In Proc. of the International Conference on
Computer Vision (ICCV), pp. 1515–1522. IEEE Computer Society
Washington, DC, USA.
[9] Bro, R., E. Acar, and T. Kolda (2008). Resolving the sign ambiguity
in the singular value decomposition. Journal of Chemometrics 22,
135–140.
[10] Calonder, M., V. Lepetit, C. Strecha, and P. Fua (2010). Brief:
Binary robust independent elementary features. In Proc. of the European Conference on Computer Vision (ECCV) - Part IV, Heraklion,
Greece, pp. 778–792. Springer-Verlag, Berlin, Heidelberg.
[11] Cao, L. and Q. Gu (2002). Dynamic Support Vector Machines for
non-stationary time series forecasting. Intelligent Data Analysis 6,
67–83.
[12] Chen, H. and B. Bhanu (2007). 3d free-form object recognition
in range images using local surface patches. International Journal of
Pattern Recognition and Artificial Intelligence 28(10), 1252–1262.
[13] Chu, W., S. Keerthi, and C. J. Ong (2004, Jan.). Bayesian Support
Vector Regression using a unified loss function. IEEE Transactions
on Neural Networks 15(1), 29–44.
[14] Chua, C. S. and R. Jarvis (1997). Point signatures: A new representation for 3d object recognition. International Journal of Computer
Vision (IJCV) 25(1), 63–85.
[15] Collins, R. T., A. J. Lipton, and T. Kanade (1999). A system for
video surveillance and monitoring. Technical report, Robotics Institute at Carnegie Mellon University.
[16] Collins, R. T., Y. Liu, and M. Leordeanu (2005, oct). Online selection of discriminative tracking features. IEEE transactions on pattern
analysis and machine intelligence 27(10), 1631–43.
[17] Comaniciu, D., V. Ramesh, and P. Meer (2003). Kernel-based object tracking. IEEE Transactions Pattern Analysis and Machine Intelligence (PAMI) 25(5), 564–575.
152
Bibliography
[18] Conde, C., L. Rodrı́guez-Aragón, and E. Cabello (2006). Automatic 3d face feature points extraction with spin images. International Conference on Image Analysis and Recognition (ICIAR) 4142,
317–328.
[19] Csurka, G., C. Bray, C. R. Dance, and L. Fan (2004). Visual categorization with bags of keypoints. In Proc. of. European Conference
of Computer Vision - Workshop on Statistical Learning in Computer
Vision (ECCV), Lecture Notes In Computer Science (LNCS), pp. 1–
22. Springer-Verlag, London.
[20] Dalal, N. and B. Triggs (2005). Histograms of oriented gradients
for human detection. In Proc. of the Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 886–893.
IEEE Computer Society Washington, DC, USA.
[21] Davis, J., D. Nehab, R. Ramamoothi, and S. Rusinkiewicz (2005).
Spacetime stereo : A unifying framework for depth from triangulation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27(2), 1615–1630.
[22] D’Orazio, T., M. Leo, N. Mosca, P. Spagnolo, and P. L. Mazzeo
(2009). A semi-automatic system for ground truth generation of soccer video sequences. In Proc. of Sixth International Conference on
Advanced Video and Signal Based Surveillance (AVSS), pp. 559–564.
IEEE Computer Society Washington, DC, USA.
[23] Elgammal, A., D. Harwood, and L. Davis (1999). Non-parametric
model for background subtraction. In Proc. of the International Conference on Computer Vision (ICCV), pp. 751–767. IEEE Computer
Society, Washington, DC, USA.
[24] Elhabian, S. Y., K. M. El-Sayed, and S. H. Ahmed (2008). Moving
object detection in spatial domain using background removal techniques - state-of-art. Recent Patents on Computer Sciences 1, 32–54.
[25] Fairchild, M. (2005). Color Appearance Models. John Wiley &
Sons Ltd.
[26] Freeman, W. T. and E. H. Adelson (1991, September). The design
and use of steerable filters. IEEE transactions on pattern analysis and
machine intelligence 13(10), 891–906.
153
Bibliography
[27] Frome, A., D. Huber, R. Kolluri, T. Bülow, and J. Malik (2004).
Recognizing objects in range data using regional point descriptors.
In Proc. of the European Conference on Computer Vision (ECCV) Volume III, Lecture Notes In Computer Science (LNCS), pp. 224–
237. Springer-Verlag, London.
[28] Gao, J., S. Gunn, C. Harris, and M. Brown (2 January 2002). A
probabilistic framework for SVM regression and error bar estimation.
Machine Learning 46, 71–89(19).
[29] Grabner, H. and H. Bischof (2006). On-line boosting and vision.
In Proc. of the Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR) - Volume 1, pp. 260–267. IEEE Computer Society Washington, DC, USA.
[30] Grabner, H., C. Leistner, and H. Bischof (2008). Semi-supervised
on-line boosting for robust tracking. In Proc. of the tenth European
Conference on Computer Vision (ECCV) - Part I, Lecture Notes In
Computer Science (LNCS), pp. 234–247. Springer-Verlag, Berlin,
Heidelberg.
[31] Grossberg, S. (1988). Competitive learning: from interactive activation to adaptive resonance, pp. 243–283. Norwood, NJ, USA:
Ablex Publishing Corp.
[32] Haritaoglu, I., D. Harwood, and L. S. Davis (2000). W4: Realtime surveillance of people and their activities. IEEE Transactions on
Pattern Analysis and Machine Intelligence (PAMI) 22, 809–830.
[33] Harville, M. (2002, July). A framework for high-level feedback to adaptive, per-pixel, mixture-of-Gaussian background models. In Proc. of the seventh European Conference on Computer Vision
(ECCV) - Part III, Lecture Notes In Computer Science (LNCS), pp.
543–560. Springer-Verlag, London.
[34] Harville, M. and D. Li (2004). Fast, integrated person tracking
and activity recognition with plan-view templates from a single stereo
camera. In Proc. of the computer society conference on Computer vision and pattern recognition (CVPR) - Volume 2, pp. 398–405. IEEE
Computer Society, Washington, DC, USA.
[35] Hoppe, H., T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle
(1992). Surface reconstruction from unorganized points. In Proc. of
the Compute Graphics Conference (SIGGRAPH), pp. 71–78.
154
Bibliography
[36] Horn, B. K. P. (1987). Closed-form solution of absolute orientation
using unit quaternions. Journal of the Optical Society of America A
(JOSA A) 4(4), 629–642.
[37] Isard, M. and A. Blake (1998). Condensation – conditional density
propagation for visual tracking. International Journal of Computer
Vision (IJCV) 29(1), 5–28.
[38] Isard, M. and J. MacCormick (2001, July). Bramble: A bayesian
multiple-blob tracker. In Proc. of the International Conference on
Computer Vision (ICCV) - Volume 2, pp. 34–41. IEEE Computer Society, Washington, DC, USA.
[39] Iyer, M., S. Jayanti, K. Lou, Y. Kalyanaraman, and K. Ramani
(2005). Three dimensional shape searching: state-of-the-art review
and future trends. Computer Aided Design 5(15), 509–530.
[40] Jepson, A. D., D. J. Fleet, and T. F. El-Maraghi (2003). Robust
online appearance models for visual tracking. IEEE Transactions on
Pattern Analysis and Machine Intelligence 25(10), 1296–1311.
[41] Johnson, A. and M. Hebert (1999). Using spin images for efficient
object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 21(5), 433–449.
[42] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transactions of the American Society Of Mechanical Engineers (ASME)–Journal of Basic Engineering 82(Series D),
35–45.
[43] Ke, Y. and R. Sukthankar (2004). Pca-sift: A more distinctive
representation for local image descriptors. In Proc. of the Computer Society Conference on Computer Vision and Pattern Recognition Workshops(CVPR) - Volume 2, pp. 506–513. IEEE Computer Society Washington, DC, USA.
[44] Koenderink, J. and A. Doorn (1992). Surface shape and curvature
scales. Image Vision Computing 8, 557–565.
[45] Kwon, J. and K. M. Lee (2009). Tracking of a non-rigid object
via patch-based dynamic appearance modeling and adaptive Basin
Hopping Monte Carlo sampling. In Proc. of the Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
1208–1215. IEEE Computer Society Washington, DC, USA.
155
Bibliography
[46] Kwon, J. and K. M. Lee (2010). Visual tracking decomposition. In
Proc. of the Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 1269–1276. IEEE Computer Society Washington, DC, USA.
[47] Lanza, A., L. Di Stefano, and L. Soffritti (2009). Bayesian orderconsistency testing with class priors derivation for robust change detection. In Proc. of Sixth International Conference on Advanced Video
and Signal Based Surveillance (AVSS), pp. 460–465. IEEE Computer
Society Washington, DC, USA.
[48] Lanza, A. and L. D. Stefano (2006). Detecting changes in grey
level sequences by ML isotonic regression. In Proc. of the International Conference on Advanced Video and Signal-based Surveillance
(AVSS), pp. 1–4. IEEE Computer Society, Washington, DC, USA.
[49] Lee, K.-C. and D. Kriegman (2005). Online Learning of Probabilistic Appearance Manifolds for Video-Based Recognition and
Tracking. In Proc. of the Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR) - Vol. 2, pp. 852–859. IEEE
Computer Society Washington, DC, USA.
[50] Leibe, B., A. Leonardis, and B. Schiele (2008, May). Robust object
detection with interleaved categorization and segmentation. International Journal of Computer Vision 77(1-3), 259–289.
[51] Lin, C.-J. and R. C. Weng (2004). Simple probabilistic predictions for Support Vector Regression. Technical report, Department of
Computer Science, National Taiwan University,.
[52] Liu, Y., H. Zha, and H. Qin (2006). Shape topics: a compact representation and new algorithms for 3d partial shape retrieval. In Proc.
of the Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR) - Volume 2, pp. 2025–2032. IEEE Computer Society Washington, DC, USA.
[53] Lou, J., H. Yang, W. Hu, and T. Tan (2002). An illuminationinvariant change detection algorithm. In Proc. of the 5th Asian Conference on Computer Vision (ACCV) - Volume 1, pp. 13–18.
[54] Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision
(IJCV) 60, 91–110.
156
Bibliography
[55] Lu, L. and G. D. Hager (2007). A nonparametric treatment for
location / segmentation based visual tracking. In Proc. of the Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE Computer Society Washington, DC,
USA.
[56] Matas, J., O. Chum, M. Urba, and T. Pajdla (2002). Robust wide
baseline stereo from maximally stable extremal regions. In Proc. of
British Machine Vision Conference (BMVC), pp. 384–396.
[57] Matthews, I., T. Ishikawa, and S. Baker (2004, June). The template
update problem. IEEE Transactions on Pattern Analysis and Machine
Intelligence 26(1), 810–815.
[58] Mehra, R. (1972, Oct). Approaches to adaptive filtering. IEEE
Transactions on Automatic Control 17(5), 693–698.
[59] Mian, A., M. Bennamoun, and R. Owens (2006). A novel representation and feature matching algorithm for automatic pairwise registration of range images. International Journal of Computer Vision
(IJCV) 66(1), 19–40.
[60] Mian, A. S., M. Bennamoun, and R. A. Owens (2010). On the
repeatability and quality of keypoints for local feature-based 3d object
retrieval from cluttered scenes. International Journal of Computer
Vision (IJCV) 89(2-3), 348–361.
[61] Mikolajczyk, K. and C. Schmid (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI) 27(10), 1615–1630.
[62] Mikolajczyk, K., T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,
F. Schaffalitzky, T. Kadir, and L. Van Gool (2005, oct). A Comparison
of Affine Region Detectors. International Journal of Computer Vision
(IJCV) 65(1-2), 43–72.
[63] Mitra, N. J., A. Nguyen, and L. Guibas (2004). Estimating surface
normals in noisy point cloud data. International Journal of Computational Geometry and Applications 14(4–5), 261–276.
[64] Mittal, A. and V. Ramesh (2006). An intensity-augmented ordinal
measure for visual correspondence. In Proc. of the Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR) 157
Bibliography
Volume 1, pp. 849–856. IEEE Computer Society, Washington, DC,
USA.
[65] Muja, M. and D. G. Lowe (2009). Fast approximate nearest neighbors with automatic algorithm configuration. In Proc. of International
Conference on Computer Vision Theory and Application (VISSAPP),
pp. 331–340. INSTICC Press.
[66] Novatnack, J. and K. Nishino (2008). Scale-dependent/invariant
local 3d shape descriptors for fully automatic registration of multiple
sets of range images. In Proc. of the European Conference on Computer Vision (ECCV), pp. 440–453. Springer-Verlag, Berlin, Heidelberg.
[67] Ohbuchi, R., K. Osada, T. Furuya, and T. Banno (2008). Salient
local visual features for shape-based 3d model retrieval. In Proc. of
the Int. Conf. on Shape Modeling and Applications (SMI), pp. 93–102.
[68] Ohta, N. (2001). A statistical approach to background subtraction
for surveillance systems. In Proc. of the International Conference on
Computer Vision (ICCV) - Volume 2, pp. 481–486. IEEE Computer
Society, Washington, DC, USA.
[69] Ojala, T., M. Pietikainen, and T. Maenpaa (2002, July). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE transactions on pattern analysis and machine intelligence 24(7), 871–987.
[70] Oussalah, M. and J. De Schutter (2000). Adaptive Kalman filter
for noise identification. In Proc. of the 25th International Conference
on Noise and Vibration Engineering (ISMA).
[71] Ovsjanikov, M., J. Sun, and L. Guibas (2008). Global intrinsic
symmetries of shapes. Computer Graphics Forum 5, 1341–1348.
[72] Oza, N. C. (2001, Sep). Online Ensemble Learning. Ph. D. thesis,
The University of California, Berkeley, CA.
[73] Pinz, A. (2005). Object categorization. Foundation and Trends in
Computer Graphics and Vision 1(4), 255–353.
[74] Platt, J. C. (1999). Fast training of Support Vector Machines using Sequential Minimal Optimization. Advances in kernel methods:
support vector learning, 185–208.
158
Bibliography
[75] Poggio, T., S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri
(2001). b. CBCL Paper 198/AI Memo 2001-011.
[76] Pontil, M., S. Murkerjee, and F. Girosi (1998). On the noise
model of Support Vector Machine Regression. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA.
[77] Prati, A., I. Mikic, M. M. Trivedi, and R. Cucchiara (2003). Detecting moving shadows: Algorithms and evaluation. IEEE Transactions
on Pattern Analysis and Machine Intelligence 25(7), 918–923.
[78] Pérez, P., C. Hue, J. Vermaak, and M. Gangnet (2002). Proc. of
the color-based probabilistic tracking. In Proceedings of the seventh
European Conference on Computer Vision (ECCV) - Part I, Lecture
Notes In Computer Science (LNCS), pp. 661–675. Springer-Verlag,
London.
[79] Ristic, B., S. Arulampalam, and N. Gordon (2004). Beyond the
Kalman Filter: Particle Filters for Tracking Applications. Artech
House.
[80] Ross, D. A., J. Lim, R.-S. Lin, and M.-H. Yang (2008). Incremental
learning for robust visual tracking. International Journal of Computer
Vision (IJCV) 77(1-3), 125–141.
[81] Schweighofer, G. and A. Pinz (2006, Dec.). Robust pose estimation from a planar target. IEEE Transactions on Pattern Analysis and
Machine Intelligence 28(12), 2024–2030.
[82] Serre, T., L. Wolf, and T. Poggio (2005). A new biologically motivated framework for robust object recognition. In Proc. of the Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society Washington, DC, USA.
[83] Shilane, P., P. Min, M. Kazhdan, and T. Funkhouser (2004). The
princeton shape benchmark. In Proc. of Shape Modeling International (SMI), pp. 167–178. IEEE Computer Society, Washington, DC,
USA.
[84] Sivic, J., B. Russell, A. Elfros, and Z. Zisserman (2005). Discovering objects and their location in images. In Proc. of the International
Conference on Computer Vision (ICCV) - Volume 1, pp. 370–377.
IEEE Computer Society, Washington, DC, USA.
159
Bibliography
[85] Sivic, J. and A. Zisserman (2006). Video google: Efficient visual
search of videos. In Toward Category-Level Object Recognition, Lecture notes in computer science, pp. 127–144. Springer-Verlag, Berlin,
Heidelberg.
[86] Smola, A. J. and B. S. Olkopf (1998). A tutorial on Support Vector
Regression. Technical report, Statistics and Computing.
[87] Somanath, G. and C. Kambhamettu (2010). Abstraction and generalization of 3d structure. In Proc. of the Asian Conference of Computer Vision (ACCV) - Part III, Lecture Notes in Computer Science,
pp. 483–496. Springer.
[88] Song, X., J. Cui, H. Zha, and H. Zhao (2008). Vision-based multiple interacting targets tracking via on-line supervised learning. In
Proc. of the tenth European Conference on Computer Vision (ECCV)
- Part III, Lecture Notes In Computer Science (LNCS), pp. 642–655.
Springer-Verlag, Berlin, Heidelberg.
[89] Stalder, S., H. Grabner, and L. van Gool (2009). Beyond semisupervised tracking: Tracking should be as simple as detection, but
not simpler than recognition. In Proc. of the International Conference on Computer Vision (ICCV) - Workshop on On-line Learning
for Computer Vision., pp. 1409. IEEE Computer Society Washington,
DC, USA.
[90] Stauffer, C. and W. E. L. Grimson (1999). Adaptive background
mixture models for real-time tracking. In Proc. of the Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
- Volume 2, pp. 246–252. IEEE Computer Society, Washington, DC,
USA.
[91] Stein, F. and G. Medioni (1992). Structural indexing: Efficient
3-d object recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI) 14(2), 125–145.
[92] Sun, Y. and M. A. Abidi (2001). Surface matching by 3d point’s
fingerprint. IEEE International Conference on Computer Vision
(ICCV) 2, 263–269.
[93] Tang, F., S. Brennan, Q. Zhao, and H. Tao (2007). Co-tracking
using semi-supervised support vector machines. In Proc. of the International Conference on Computer Vision (ICCV), pp. 1–8. IEEE
Computer Society Washington, DC, USA.
160
Bibliography
[94] Tangelder, J. W. H. and R. C. Veltkamp (2004). A survey of content
based 3d shape retrieval methods. In Proc. of the Shape Modeling International Conference (SMI), pp. 145–156. IEEE Computer Society
Washington, DC, USA.
[95] Taycher, L., J. W. F. Iii, and T. Darrell (2005, January). Incorporating object tracking feedback into background maintenance framework. In Proc. of the Workshop on Motion and Video Computing
(WACV/MOTION) - Volume 2, pp. 120–125. IEEE Computer Society,
Washington, DC, USA.
[96] Thomas, A., V. Ferrari, B. Leibe, T. Tuytelaars, and L. van Gool
(2007). Depth-from-recognition: Inferring metadata by cognitive
feedback. In Proc. of the International Conference on Computer
Vision (ICCV), pp. 1–8. IEEE Computer Society, Washington, DC,
USA.
[97] Toldo, R., U. Castellani, and A. Fusiello (2009). A bag of words
approach for 3d object categorization. In Proc. of the the forth International Conference on Computer Vision/Computer Graphics Collaboration Techniques (MIRAGE), pp. 116–127. Springer-Verlag, Berlin,
Heidelberg.
[98] Unnikrishnan, R. and M. Hebert (2008). Multi-scale interest regions from unorganized point clouds. In Proc. of the Computer Society Conference on Computer Vision and Pattern Recognition Workshops(CVPR), pp. 1–8. IEEE Computer Society Washington, DC,
USA.
[99] Vapnik, V. N. (1995). The nature of statistical learning theory.
New York, NY, USA: Springer-Verlag New York, Inc.
[100] Viola, P. and M. Jones (2002). Robust real-time object detection.
International Journal of Computer Vision 57(2), 137–154.
[101] Weng, S.-K., C.-M. Kuo, and S.-K. Tu (2006). Video object tracking using adaptive Kalman filter. Journal of Visual Communication
and Image Representation 17(6), 1190–1208.
[102] Xie, B., V. Ramesh, and T. Boult (2004, feb). Sudden illumination change detection using order consistency. Image and Vision
Computing 22(2), 117–125.
161
Bibliography
[103] Yang, M., L. Fengjun, X. Wei, and G. Yihong (2009). Detection
driven adaptive multi-cue integration for multiple human tracking. In
Proc. of the International Conference on Computer Vision (ICCV),
pp. 1554–1561. IEEE Computer Society Washington, DC, USA.
[104] Yilmaz, A., O. Javed, and M. Shah (2006, dec). Object Tracking:
A Survey. ACM Computing Surveys 38(4).
[105] Yu, Q., T. B. Dinh, and G. Medioni (2008). Online tracking and
reacquisition using co-trained generative and discriminative trackers.
In Proc. of the European Conference on Computer Vision (ECCV)
- Part II, Lecture notes in computer sciences (LNCS), pp. 678–691.
Springer-Verlag, Berlin, Heidelberg.
[106] Zaharescu, A., E. Boyer, K. Varanasi, and R. P. Horaud (2009,
June). Surface feature detection and description with applications
to mesh matching. In Proc. of the Computer Society Conference
on Computer Vision and Pattern Recognition(CVPR), pp. 373–380.
IEEE Computer Society Washington, DC, USA.
[107] Zelniker, E. E., T. M. Hospedales, S. Gong, and T. Xiang (2009).
A unified approach for adaptive multiple feature tracking for surveillance applications. In Proc. of the British Machine Vision Conference
(BMVC).
[108] Zhang, L., B. Curless, and S. Seitz (2003). Spacetime stereo:
Shape recovery for dynamic scenes. In Proc. of the Computer Society
Conference on Computer Vision and Pattern Recognition(CVPR) Volume 2, pp. 367–374. IEEE Computer Society Washington, DC,
USA.
[109] Zhang, Y., H. Hu, and H. Zhou (2005, June-3 July). Study on
adaptive Kalman filtering algorithms in human movement tracking.
In Proc. of the IEEE International Conference on Information Acquisition (ICIA), pp. 11–15.
[110] Zhao, W., R. Chellappa, P. Phillips, and A. Rosenfeld (2003).
Face recognition: A literature survey. ACM Computing Survey, 399–
458.
[111] Zhong, Y. (2009). Intrinsic shape signatures: A shape descriptor
for 3d object recognition. In Proc. of the International Conference on
Computer Vision Workshops (ICCV), pp. 689–696. IEEE Computer
Society, Washington, DC, USA.
162
Publications related to this work
• A. Lanza, S. Salti, L. Di Stefano, On-Line Training of a Binary
Bayesian Classifier for Robust and Efficient Background Subtraction, submitted to ICIP 2011 .
• F. Tombari, S. Salti, L. Di Stefano, A combined intensity-shape
descriptor for texture-enhanced 3D feature matching, submitted
to ICIP 2011 .
• S. Salti, F. Tombari, L. Di Stefano, A Performance Evaluation of
3D Keypoint Detection, The 1st IEEE Joint 3DIM/3DPVT Conference (3DIMPVT), Hangzhou, China, 16-19 May, 2011.
• F. De Crescenzio, M. Fantini, F. Persiani, L. Di Stefano, P. Azzari,
S. Salti, Augmented Reality for Aircraft Maintenance Training and
Operations Support, Computer Graphics and Applications, IEEE,
vol. 31, no. 1, pp. 96-101, January-February 2011.
• S. Salti, F. Tombari, L. Di Stefano, On the use of Implicit Shape
Models for recognition of object categories in 3D data, The 10th
Asian Conference on Computer Vision (ACCV), Queenstown, New
Zealand, 8-12 November, 2010.
• S. Salti, A. Lanza, L. Di Stefano, Bayesian Loop for Synergistic Change Detection and Tracking, The 10th International Workshop on Visual Surveillance (VS), Queenstown, New Zealand, 8
November, 2010.
• F. Tombari, S. Salti, L. Di Stefano, Unique Shape Context for 3D
Data Description , ACM Int. Workshop on 3D Object Retrieval
@ ACM MM 2010, Firenze, Italy, 25-29 October, 2010.
• F. Tombari, S. Salti, L. Di Stefano, Unique Signatures of Histograms for Local Surface Description, The 11th European Con163
Publications related to this work
ference on Computer Vision (ECCV), Heraklion, Crete, Greece,
5-11 September, 2010.
• S. Salti, L. Di Stefano, On-line learning of the Transition Model
for Recursive Bayesian Estimation, The 2nd International Workshop on Machine Learning for Vision-based Motion Analysis (MLVMA)
@ ICCV 2009, Kyoto, Japan, October 2009.
• S. Salti, L. Di Stefano, SVR-based jitter reduction for markerless
Augmented Reality, International Conference on Image Analysis
and Processing (ICIAP), Vietri sul Mare (SL), Italy, September
2009.
164
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement