Institutionen för systemteknik Department of Electrical Engineering Visual Tracking Examensarbete

Institutionen för systemteknik Department of Electrical Engineering Visual Tracking Examensarbete
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
Visual Tracking
Examensarbete utfört i
vid Tekniska högskolan vid Linköpings universitet
av
Martin Danelljan
LiTH-ISY-EX--13/4736--SE
Linköping 2013
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Linköpings tekniska högskola
Linköpings universitet
581 83 Linköping
Visual Tracking
Examensarbete utfört i
vid Tekniska högskolan vid Linköpings universitet
av
Martin Danelljan
LiTH-ISY-EX--13/4736--SE
Handledare:
Fahad Khan
ISY ,
Examinator:
Linköpings universitet
Michael Felsberg
ISY ,
Linköpings universitet
Linköping, 12 december 2013
Avdelning, Institution
Division, Department
Datum
Date
Computer Vision Laboratory
Department of Electrical Engineering
SE-581 83 Linköping
Språk
Language
Rapporttyp
Report category
ISBN
Svenska/Swedish
Licentiatavhandling
ISRN
Engelska/English
Examensarbete
C-uppsats
D-uppsats
Övrig rapport
2013-12-12
—
LiTH-ISY-EX--13/4736--SE
Serietitel och serienummer
Title of series, numbering
ISSN
—
URL för elektronisk version
Titel
Title
Visuell följning
Författare
Author
Martin Danelljan
Visual Tracking
Sammanfattning
Abstract
Visual tracking is a classical computer vision problem with many important applications in areas such
as robotics, surveillance and driver assistance. The task is to follow a target in an image sequence. The
target can be any object of interest, for example a human, a car or a football. Humans perform accurate
visual tracking with little effort, while it remains a difficult computer vision problem. It imposes major
challenges, such as appearance changes, occlusions and background clutter. Visual tracking is thus an
open research topic, but significant progress has been made in the last few years.
The first part of this thesis explores generic tracking, where nothing is known about the target except
for its initial location in the sequence. A specific family of generic trackers that exploit the FFT for
faster tracking-by-detection is studied. Among these, the CSK tracker have recently shown obtain
competitive performance at extraordinary low computational costs. Three contributions are made to
this type of trackers. Firstly, a new method for learning the target appearance is proposed and shown to
outperform the original method. Secondly, different color descriptors are investigated for the tracking
purpose. Evaluations show that the best descriptor greatly improves the tracking performance. Thirdly,
an adaptive dimensionality reduction technique is proposed, which adaptively chooses the most important feature combinations to use. This technique significantly reduces the computational cost of
the tracking task. Extensive evaluations show that the proposed tracker outperform state-of-the-art
methods in literature, while operating at several times higher frame rate.
In the second part of this thesis, the proposed generic tracking method is applied to human tracking
in surveillance applications. A causal framework is constructed, that automatically detects and tracks
humans in the scene. The system fuses information from generic tracking and state-of-the-art object
detection in a Bayesian filtering framework. In addition, the system incorporates the identification and
tracking of specific human parts to achieve better robustness and performance. Tracking results are
demonstrated on a real-world benchmark sequence.
Nyckelord
Keywords
Tracking, Computer Vision, Person Tracking, Object Detection, Deformable Parts Model, RaoBlackwellized Particle Filter, Color Names
Abstract
Visual tracking is a classical computer vision problem with many important applications
in areas such as robotics, surveillance and driver assistance. The task is to follow a target
in an image sequence. The target can be any object of interest, for example a human, a car
or a football. Humans perform accurate visual tracking with little effort, while it remains
a difficult computer vision problem. It imposes major challenges, such as appearance
changes, occlusions and background clutter. Visual tracking is thus an open research
topic, but significant progress has been made in the last few years.
The first part of this thesis explores generic tracking, where nothing is known about the target except for its initial location in the sequence. A specific family of generic trackers that
exploit the FFT for faster tracking-by-detection is studied. Among these, the CSK tracker
have recently shown obtain competitive performance at extraordinary low computational
costs. Three contributions are made to this type of trackers. Firstly, a new method for
learning the target appearance is proposed and shown to outperform the original method.
Secondly, different color descriptors are investigated for the tracking purpose. Evaluations show that the best descriptor greatly improves the tracking performance. Thirdly,
an adaptive dimensionality reduction technique is proposed, which adaptively chooses
the most important feature combinations to use. This technique significantly reduces the
computational cost of the tracking task. Extensive evaluations show that the proposed
tracker outperform state-of-the-art methods in literature, while operating at several times
higher frame rate.
In the second part of this thesis, the proposed generic tracking method is applied to human
tracking in surveillance applications. A causal framework is constructed, that automatically detects and tracks humans in the scene. The system fuses information from generic
tracking and state-of-the-art object detection in a Bayesian filtering framework. In addition, the system incorporates the identification and tracking of specific human parts to
achieve better robustness and performance. Tracking results are demonstrated on a realworld benchmark sequence.
iii
Acknowledgments
I want to thank my supervisor Fahad Khan and examiner Michael Felsberg.
Further, I want to thank everyone that have contributed with constructive comments and
discussions. I thank Klas Nordberg for discussing various parts of the theory with me. I
thank Zoran Sjanic for the long and late discussions about Bayesian filtering methods.
Finally, I thank Giulia Meneghetti for helping me set up all the computers I needed.
Linköping, Januari 2014
Martin Danelljan
v
Contents
Notation
1
xi
Introduction
1.1 A Brief Introduction to Visual Tracking . . . . . . . . . . . . . .
1.1.1 Introducing Circulant Tracking by Detection with Kernels
1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . .
1.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.3 Approaches and Results . . . . . . . . . . . . . . . . . .
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
4
4
5
5
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
10
11
11
12
12
12
13
14
14
15
16
Color Features for Tracking
3.1 Evaluated Color Features . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Color Names . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Incorporating Color into Tracking . . . . . . . . . . . . . . . . . . . .
17
17
19
19
I
Generic Tracking
2
Circulant Tracking by Detection
2.1 The MOSSE Tracker . . . . . . . . . .
2.1.1 Detection . . . . . . . . . . . .
2.1.2 Training . . . . . . . . . . . . .
2.2 The CSK Tracker . . . . . . . . . . . .
2.2.1 Training with a Single Image . .
2.2.2 Detection . . . . . . . . . . . .
2.2.3 Multidimensional Feature Maps
2.2.4 Kernel Functions . . . . . . . .
2.3 Robust Appearance Learning . . . . . .
2.4 Details . . . . . . . . . . . . . . . . . .
2.4.1 Parameters . . . . . . . . . . .
2.4.2 Windowing . . . . . . . . . . .
2.4.3 Feature Value Normalization . .
3
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
CONTENTS
3.2.1
4
5
II
6
7
Color Feature Normalization . . . . . . . . . . . . . . . . . . .
Adaptive Dimensionality Reduction
4.1 Principal Component Analysis . . . . . . .
4.2 The Theory Behind the Proposed Approach
4.2.1 The Data Term . . . . . . . . . . .
4.2.2 The Smoothness Term . . . . . . .
4.2.3 The Total Cost Function . . . . . .
4.3 Details of the Proposed Approach . . . . .
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
22
22
22
23
24
Evaluation
5.1 Evaluation Methodology . . . . . . . . . . . .
5.1.1 Evaluation Metrics . . . . . . . . . . .
5.1.2 Dataset . . . . . . . . . . . . . . . . .
5.1.3 Trackers and Parameters . . . . . . . .
5.2 Circulant Structure Trackers Evaluation . . . .
5.2.1 Grayscale Experiment . . . . . . . . .
5.2.2 Grayscale and Color Names Experiment
5.2.3 Experiments with Other Color Features
5.3 Color Evaluation . . . . . . . . . . . . . . . .
5.3.1 Results . . . . . . . . . . . . . . . . .
5.3.2 Discussion . . . . . . . . . . . . . . .
5.4 Adaptive Dimensionality Reduction Evaluation
5.4.1 Number of Feature Dimensions . . . .
5.4.2 Final Performance . . . . . . . . . . .
5.5 State-of-the-Art Evaluation . . . . . . . . . . .
5.6 Conclusions and Future Work . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
28
29
29
29
30
31
33
33
33
34
37
38
39
39
40
Tracking Model
6.1 System Overview . . . . . . . . . . . . . . . . . . . . . . .
6.2 Object Model . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Object Motion Model . . . . . . . . . . . . . . . . .
6.2.2 Part Deformations and Motion . . . . . . . . . . . .
6.2.3 The Complete Transition Model . . . . . . . . . . .
6.3 The Measurement Model . . . . . . . . . . . . . . . . . . .
6.3.1 The Image Likelihood . . . . . . . . . . . . . . . .
6.3.2 The Model Likelihood . . . . . . . . . . . . . . . .
6.4 Applying the Rao-Blackwellized Particle Filter to the Model
6.4.1 The Transition Model . . . . . . . . . . . . . . . . .
6.4.2 The Measurement Update for the Non-Linear States
6.4.3 The Measurement Update for the Linear States . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
50
51
51
52
52
53
53
53
54
55
Object Detection
7.1 Object Detection with Discriminatively Trained Part Based Models . . .
57
57
.
.
.
.
.
.
Category Object Tracking
ix
CONTENTS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
60
61
61
61
64
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
67
68
69
69
69
71
71
72
72
72
Results, Discussion and Conclusions
9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . .
9.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
75
78
80
7.2
8
9
7.1.1 Histogram of Oriented Gradients . . . . . . . . . .
7.1.2 Detection with Deformable Part Models . . . . . .
7.1.3 Training the Detector . . . . . . . . . . . . . . . .
Object Detection in Tracking . . . . . . . . . . . . . . . .
7.2.1 Ways of Exploiting Object Detections in Tracking
7.2.2 Converting Detection Scores to Likelihoods . . . .
7.2.3 Converting Deformation Costs to Probabilities . .
Details
8.1 The Appearance Likelihood . . . . . . . . . . .
8.1.1 Motivation . . . . . . . . . . . . . . . .
8.1.2 Integration of the RCSK Tracker . . . . .
8.2 Rao-Blackwellized Particle Filtering . . . . . . .
8.2.1 Parameters and Initialization . . . . . . .
8.2.2 The Particle Filter Measurement Update .
8.2.3 The Kalman Filter Measurement Update .
8.2.4 Estimation . . . . . . . . . . . . . . . .
8.3 Further Details . . . . . . . . . . . . . . . . . .
8.3.1 Adding and Removing Objects . . . . . .
8.3.2 Occlusion Detection and Handling . . . .
A Bayesian Filtering
A.1 The General Case . . . . . . . . . . .
A.1.1 General Bayesian Solution . .
A.1.2 Estimation . . . . . . . . . .
A.2 The Kalman Filter . . . . . . . . . . .
A.2.1 Algorithm . . . . . . . . . . .
A.2.2 Iterated Measurement Update
A.3 The Particle Filter . . . . . . . . . . .
A.3.1 Algorithm . . . . . . . . . . .
A.3.2 Estimation . . . . . . . . . .
A.4 The Rao-Blackwellized Particle Filter
A.4.1 Algorithm . . . . . . . . . . .
A.4.2 Estimation . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
83
84
84
84
85
86
86
86
87
88
88
90
B Proofs and Derivations
B.1 Derivation of the RCSK Tracker Algorithm . . . . . . . . . .
B.1.1 Kernel Function Proofs . . . . . . . . . . . . . . . . .
B.1.2 Derivation of the Robust Appearance Learning Scheme
B.2 Proof of Equation 6.13 . . . . . . . . . . . . . . . . . . . . .
B.2.1 Proof of Uncorrelated Parts . . . . . . . . . . . . . .
B.2.2 Derivation of the Weight Update . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
93
93
94
95
96
96
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
Bibliography
CONTENTS
99
Notation
S ETS
Notation
Meaning
Z
R
C
`p (M, N )
`D
p (M, N )
The set of integers
The set of real numbers
The set of complex numbers
The set of all functions f : Z × Z → R with period (M, N )
The set of all functions f : Z×Z → RD with period (M, N )
F UNCTIONS AND O PERATORS
Notation
Meaning
h·, ·i
k·k
|·|
∗
?
F
F −1
τm,n
κ(f, g)
x
p(x)
N (µ, C)
Inner product
L2 -norm
Absolute value or cardinality
Convolution
Correlation
The Discrete Fourier Transform on `p (M, N )
The Inverse Discrete Fourier Transform on `p (M, N )
The shift operator (τm,n f )(k, l) = f (k − m, l − n)
Kernel on some space X , with f, g ∈ X
Complex conjugate of x ∈ C
Probability of x
Gaussian distribution with mean µ and covariance matrix C
xi
1
Introduction
Visual tracking can be defined as the problem of estimating the trajectory of one or multiple objects in an image sequence. It is an important and classical computer vision problem, which has received much research attention over the last decades. Visual tracking
has many important applications. It often acts as a part in higher level systems, e.g. automated surveillance, gesture recognition and robotic navigation. It is generally a challenging problem for numerous reasons, including fast motion, background clutter, occlusions
and appearance changes.
This thesis explores various aspects of visual tracking. The rest of of this chapter is
organized as follows. Section 1.1 gives a brief introduction to the visual tracking field.
Section 1.2 contains the thesis problem formulation and motivation together with a brief
overview of the contributions and results. Section 1.3 describes the outline of the rest of
the thesis.
1.1
A Brief Introduction to Visual Tracking
A survey of visual tracking methods lies far outside the scope of this document. The
visual tracking field is extremely diverse, and it does not exist anything close to a unified
theory. This section introduces some concepts in visual tracking that are related to the
work in this thesis. See [48] for a more complete (but rather outdated) survey.
Visual tracking methods largely depend on the application and the amount of available
prior information. There are for example applications where the camera and background
are known to be static. In such cases one can employ background subtraction techniques
to detect moving targets. In many cases the problem is to track certain kinds or classes
of objects, e.g. humans or cars. The general appearance of objects in the class can thus
be used as prior information. This type of tracking problems are sometimes referred to
1
2
1
Introduction
750
Figure 1.1: Visual tracking of humans in the Town Centre sequence with the proposed framework for category tracking. The bounding boxes mark humans that are
tracked in the current frame. The colored lines show the trajectories up to the current
frame.
as category tracking, as the task is to track a category of objects. Many visual tracking
methods deal with the case where only the initial target position in a sequence is known.
This case is often called generic tracking.
Visual tracking of an object can in its simplest form be divided into two parts.
1. Modelling.
2. Tracking.
The first step constructs a model of the object. This model can include various degrees of
a priori information and it can be updated in each new frame with new information about
the object. The model should include some representation of the object appearance. The
appearance can for example be modelled using the object shape, as templates, histograms
or by parametric representations of distributions. An often essential part of the modelling
is the choice of image features. Popular choices are intensity, color and edge features.
Models can also include information about the expected object motion.
The second step deals with how to use the model to find the position of the object in the
next frame. Methods applied in this step are highly dependent on the used model. A very
simple example of a model is to use the pixel values of a rectangular image area around
the object. A simple way of tracking with this model is by correlating the new frame with
this image patch and finding the maximum response.
A popular trend in how to solve this modelling/tracking problem is to apply tracking by
detection. This is a quite loose categorization of a kind of visual tracking algorithms.
1.1
3
A Brief Introduction to Visual Tracking
Tracking by detection essentially means that some kind of machine learning technique is
used to train a discriminative classifier of the object appearance. The tracking step is then
done by classifying image regions to find the most probable location of the object.
The second part of this thesis investigates automatic category tracking. This problem includes automatic detection and tracking of all objects in the scene that is of a certain class.
Such tracking is visualized in figure 1.1. This problem contains additional challenges
compared to generic tracking. However, there are known information about the object
class, that can be exploited to achieve more robust visual tracking. The question is then
how to fuse this information into a visual tracking framework.
Visual tracking techniques can also be categorized by which object properties or states
that are estimated. A visual tracker should be able to track the location of the object
in the image. The location is often described with a two-dimensional image coordinate,
i.e. two states. However, more general transformations than just pure translations can be
considered. Many trackers attempt to also track the scale (i.e. size) of the object in the
image. This adds another one or two states to be estimated, depending on if uniform or
non-uniform scale is used. The orientation of the object can also be added as a state to
be estimated. Even more general image transformations such as affine transformations
or homographies can be considered. However, in general a state can be any property of
interest, e.g. complex shape, locations of object parts and appearance modes.
1.1.1
Introducing Circulant Tracking by Detection with Kernels
This section gives a brief introduction to the CSK tracker [21], which is of importance to
this thesis. It is studied in detail in chapter 2, but is introduced here since it is included
in the problem formulation of this thesis. As mentioned earlier, methods that apply tracking by detection are becoming increasingly more popular. Most of these use some kind
of sparse sampling strategy to harvest training examples for a discriminative classifier.
The processing of each sample independently requires much computational effort when
it comes to feature extraction, training and detection. It is clear that a set of samples contains redundant information if they are sampled with an overlap. The CSK exploits this
redundancy for much faster computation.
The CSK tracker relies on a kernelized least squares classifier [7]. The task is to classify
patches of an image region as the object or background. For simplicity, one-dimensional
signals are considered here. Let z be a sample from such a signal, i.e. z ∈ RM . Let
φ : RM 7→ H be a mapping from RM to some Hilbert space H [28]. Let f be a linear
classifier in H given by (1.1), where v ∈ H.
f (z) = hφ(z), vi
(1.1)
The classifier is trained using a sample x ∈ RM from the region around the object. The
set of training examples consists of all cyclic shifts xm , m ∈ {0, . . . , M − 1} of x. The
classifier is derived using regularized least squares, which means minimizing the loss function (1.2), where y ∈ RM contains the label for each example and λ is a regularization
4
1
Introduction
parameter.
=
M
−1
X
|f (xm ) − ym |2 + λkvk2
(1.2)
m=0
The v that minimizes
P (1.2) can be written as a linear combination of the mapped training
examples, v = k αk φ(xk ). By using the kernel trick, a closed form solution can be
derived. This is given by (1.3). See [7] for a complete derivation.
α = (K + λI)−1 y
(1.3)
K is the kernel matrix, with the elements Kij = κ(xi , xj ). κ is the kernel function that
defines the inner product in H, κ(x, z) = hφ(x), φ(z)i, ∀x, z ∈ RM . The classification
of an example z ∈ RM is done using (1.4).
ŷ =
M
−1
X
αm κ(z, xm )
(1.4)
m=0
The result generalizes to images. Let x be a grayscale image patch of the target. Define
the kernelized correlation as ux (m, n) = κ(xm,n , x), where xm,n are cyclic shifts of
x. Again, y(m, n) are the corresponding labels. Let capital letters denote the Discrete
Fourier Transform (DFT) [16] of the respective two-dimensional signals. It can be shown
that the Fourier transformed coefficients A of the kernelized least squares classifier can
be calculated using (1.5), if the classifier is trained on the single image patch x.
A=
Y
Ux + λ
(1.5)
The classification of all cyclic shifts of a grayscale image patch z can be written as a
convolution, which is a product in the Fourier domain. The classification score ŷ of the
image region z is computed with (1.6), where we have defied uz (m, n) = κ(zm,n , x).
ŷ = F −1 {AUz }
(1.6)
A notable feature of this tracker is that most computations can be done using the Fast
Fourier Transform (FFT) [16]. Thereby exceptionally low computational cost compared
to most other visual trackers is obtained.
1.2
Thesis Overview
This section contains an overview of this thesis. The problem formulation is stated in
section 1.2.1. Section 1.2.2 describes the motivation behind the thesis. Section 1.2.3
briefly presents the approaches, contributions and results.
1.2.1
Problem Formulation
The goal of this master thesis is to research the area of visual tracking, with the focus
on generic tracking and automatic category tracking. This shall be done through the
following scheme.
1.2
Thesis Overview
5
1. Study the CSK tracker [21] and investigate how it can be improved. The goal is
specifically to improve it with color features and new appearance learning methods.
2. Use the acquired knowledge to build a framework for causal automatic category
tracking of humans in surveillance scenes. The main goal is to investigate how
deformable parts models can be used in such a framework, to be able to track
defined parts of a human along with the human itself.
1.2.2
Motivation
In the recent benchmark evaluation of generic visual trackers by Wu et al [46], the CSK
tracker was shown to provide the highest speed among the top 10 trackers. Because
of the simplicity of this approach, it holds the potential of being improved even further.
The goal is to achieve sate-of-the-art tracking performance at faster than real-time frame
rates. Such a tracker has many interesting applications, for example in robotics, where
the computational cost often is a major limiting factor. One other interesting example is
scenarios where it is desired to track a large number of targets in real-time, for example
in automated surveillance of crowded scenes.
Many generic trackers including the CSK, only rely on image intensity information and
thus discard all color information present in the images. Although, the use of color information has proven successful in related computer vision areas, such as object detection
and recognition [41, 26, 49, 42, 25], it has not been thoroughly investigated for tracking
purposes yet. Changing the feature representation in a generic tracking framework often
requires modifications of other parts in the framework as well. It is therefore necessary to
look at the whole framework to avoid suboptimal ad hoc solutions.
In many major applications of visual tracking, the task is to track certain classes of objects, often humans. Safety systems in cars and automated surveillance are examples of
such applications. Many existing category tracking frameworks in literature use pure data
association of object detections, thus discarding most of the temporal information. Many
recent frameworks also use global optimization over time windows in the sequence, thus
disregarding the causality requirement which is existent in most practical applications.
These are the motivations behind creating an automatic category tracking framework that
is causal and thoroughly exploits the temporal dimension to achieve robust and high precision tracking.
Recently the deformable parts model detector [15] has been used in category tracking
of humans. However, this has not yet been attempted in a causal framework. By jointly
tracking an object and a set parts, a more correct deformable model can be applied that
may increase accuracy and robustness. This might especially increase the robustness to
partial occlusions, that is a common problem. Furthermore, part locations and trajectories
is of interest in action detection and recognition, which are computer vision topics with
related applications.
1.2.3
Approaches and Results
Three contributions are made to the CSK tracker. Firstly an appearance learning method
is derived, which significantly improves the robustness and performance of this tracker. In
6
1
Introduction
evaluations the proposed method, named RCSK, is shown to outperform the original one
when using multidimensional feature maps (e.g. color). Secondly an extensive evaluation
of different color representations was done. This evaluation shows that color improves the
tracking performance significantly and that the Color Names descriptor [43] is the best
choice. Thirdly, an adaptive dimensionality reduction technique is proposed to reduce
the feature dimensionality, thereby achieving a significant speed boost with a negligible
effect on the tracking performance. This technique adaptively chooses the most important
combinations of features.
Comprehensive evaluations are done to validate the performance gains of the proposed
improvements. These include a comparison between a large number of different color
representations for tracking. Lastly, the proposed generic visual tracker methods are compared to existing state-of-the-art methods in literature in an extensive evaluation. The
proposed method is shown to outperform the existing methods, while operating at many
times higher frame rates.
The second part of the thesis deals with the second goal in the problem formulation. A
category tracking framework was built that combines generic tracking with object detection in a causal probabilistic framework with deformable part models. Specifically, the
derived RCSK tracker was used in combination with the deformable parts model detector
[15]. The Rao-Blackwellized Particle Filter [37] was used in the filtering step to achieve
scalability in the number of parts in the model. The framework was applied to automatic
tracking of multiple humans in surveillance scenes. The tracking results are demonstrated
on a real-world benchmark sequence. Figure 1.1 illustrates the output of the proposed
tracker framework.
1.3
Thesis Outline
This thesis report is organized into two parts. The first part is dedicated to generic tracking. Chapter 2 discusses the family of circulant structure trackers, including the CSK
tracker introduced in section 1.1.1. The proposed appearance learning scheme for these
trackers is derived in section 2.3. Chapter 3 discusses different color features for tracking
and how the proposed tracker is extended with color information. In chapter 4, the proposed adaptive dimensionality reduction technique is derived and integrated to the tracking framework. The evaluations, results and conclusions from the first part of the thesis
is presented in chapter 5. This includes the extensive comparison with state-of-the-art
methods from literature.
The second part of this report considers the category tracking problem. Chapter 6 gives
an overview of the system and presents the model on which it is built. Chapter 7 describes in detail how DPM object detector is used. Additional details are discussed in
chapter 8, including how the generic tracking method derived in the first part of this thesis
is incorporated. Finally, the results are discussed in chapter 9.
The appendix contains two parts. Appendix A summarizes the Bayesian filtering problem
and most importantly describes the RBPF. Appendix B contains mathematical proofs and
derivations of the most important results.
Part I
Generic Tracking
2
Circulant Tracking by Detection
A standard correlation filter is a simple and straightforward visual tracking approach.
Much research over the last decades have aimed at producing more robust filters. Most
recently the Minimum Output Sum of Squared Error (MOSSE) filter [8] was proposed.
It performs comparably to state-of-the-art trackers, but at hundreds of FPS. In [21], this
approach was formulated as a tracking-by-detection problem and kernels was introduced
into the framework. The resulting CSK tracker was presented briefly in section 1.1.1. This
chapter starts with a detailed presentation of the MOSSE and CSK trackers. In section 2.3
a new learning scheme for these kinds of trackers is proposed.
2.1
The MOSSE Tracker
The key to fast correlation filters is to avoid computing the correlations in the spatial
domain, but instead exploiting the O(P ln(P )) complexity of the FFT. However, this assumes a periodic extension of the local image patch. Obviously this assumption is a very
harsh approximation of reality. However, since the background is of much lesser importance, this approximation can be seen as valid if the tracked object is centered enough in
the local image patch.
2.1.1
Detection
The goal in visual tracking is to find the location of the object in each new frame. Initially
only monochromatic images are considered, or more generally two dimensional, discrete
and scalar valued signals, i.e. functions Z × Z → R. To avoid special notation for circular
convolution and correlation, it is always assumed that a signal is extended periodically.
The set of all periodic functions f : Z × Z → R with period M in the first argument
and period N in the second argument is denoted `p (M, N ). The periodicity means that
f (m + M, n + N ) = f (m, n), ∀m, n ∈ Z.
9
10
2
Circulant Tracking by Detection
Let z ∈ `p (M, N ) be the periodic extension of an image patch of size M × N . h ∈
`p (M, N ) is a correlation filter that has been trained on the appearance of a specific object.
The correlation result at the image patch z can be calculated using (2.1). The position of
the target can then be estimated as the location of the maximum correlation output.
ŷ = h ? z = F −1 {HZ}
(2.1)
Capital letters denote the DFT of the corresponding signals. The second equality in (2.1)
follows from the correlation property of the DFT. The next section deals with how to train
the correlation filter h.
2.1.2
Training
First consider the simplest case. Given an image patch x ∈ `p (M, N ) that is centred at
the object of interest, the task is to find the correlation filter h ∈ `p (M, N ) that gives the
output y ∈ `p (M, N ) if correlated with x. y can simply be a Kronecker delta centered
at the target, but it proves to be more robust to use a smooth function, e.g. a sampled
Gaussian. The goal is to find a h that satisfies h ? x = y. If all frequencies in x contain
Y
non-zero energy there is a unique solution given by H = X
.
In practice it is important to be able to train the filter using multiple image samples
x1 , . . . , xJ . These samples can originate from different frames. Let y 1 , . . . , y J be their
corresponding desired output functions (or label functions). h is found by minimizing:
=
J
X
βj kh ? xj − y j k2 + λkhk2
(2.2)
j=1
Here β1 , . . . , βJ are weight parameters for the corresponding samples and λ is a regularization parameter. The filter that minimizes (2.2) is given in (2.3). See [8] for the
derivation.
PJ
j j
j=1 βj Y X
H = PJ
(2.3)
j j
j=1 βj X X + λ
Equation 2.3 suggests updating the numerator HN and denominator HD of H in each
new frame using a learning parameter γ. If H t−1 =
t
t
t−1
HN
t−1
HD
+λ
is the filter updated in frame
t − 1 and x , y are the new sample and desired output function from frame t, then the
filter is updated as in (2.4). This is the core part of the MOSSE tracking algorithm in [8].
t−1
t
HN
= (1 − γ)HN
+ γY t X t
t
HD
=
Ht =
t−1
(1 − γ)HD
t
HN
t +λ
HD
+ γX t X
t
This update scheme results in the weights given in (2.5).
(
(1 − γ)t−1
,j = 1
βj =
γ(1 − γ)t−j , j = 2, . . . , t
(2.4a)
(2.4b)
(2.4c)
(2.5)
2.2
11
The CSK Tracker
2.2
The CSK Tracker
This section discusses the CSK tracker, which was briefly presented in section 1.1.1. The
CSK tracker can be obtained by extending the MOSSE tracker with a non-linear kernel.
This extension is accomplished by introducing a mapping φ : X 7→ H from the signal
space X to some Hilbert space H and exploiting the kernel trick [7]. The result is also
generalized to vector valued signals f : Z × Z 7→ RD , to handle multiple features. The
set of such periodic signals is denoted `D
p (M, N ). The individual components of f are
denoted f d , d ∈ {1, . . . , D}, where f d ∈ `p (M, N ).
2.2.1
Training with a Single Image
Let h · , · i be the standard inner product in `p (M, N ). Let xm,n = τ−m,−n x be the result
of shifting x ∈ `p (M, N ) m and n steps, so that xm,n (k, l) = x(k + m, l + n). Note that
h ? x(m, n) = hxm,n , hi. The cost function in (2.2) can for the case of a single training
image (J = 1) be written as in (2.6).
X
2
=
(hxm,n , hi − y(m, n)) + λhh, hi
(2.6)
m,n
The sum in (2.6) is taken over a single period.1 The equation can be further generalized
by considering the mapped examples φ(xm,n ). The decision boundary is obtained by
minimizing (2.7) over v ∈ H.
X
2
=
(hφ(xm,n ), vi − y(m, n)) + λhv, vi
(2.7)
m,n
Observe that this is the cost function for regularized least squares classification with kernels. A well known result from classification is that the v that minimizes (2.7) is in the
subspace spanned by the vectors (φ(xm,n ))m,n . This result is easy to show in this case
by decomposing any v to v = vk + v⊥ , where vk is in this subspace and v⊥ is orthogonal
to it. The result can be written as in (2.8) for some scalars a(m, n).
X
v=
a(m, n)φ(xm,n )
(2.8)
m,n
The inner product in H is defined by the kernel function κ(f, g) = hφ(f ), φ(g)i, ∀f, g ∈
X . The coefficients a(m, n) are found by minimizing (2.9), where we have transformed
(2.7) by expressing v using (2.8) and used the definition of the kernel function.
2
XX
X
X
=
a(k, l)κ(xm,n , xk,l )−y(m, n) +λ
a(m, n)
a(k, l)κ(xm,n , xk,l )
m,n
k,l
m,n
k,l
(2.9)
A closed form solution to (2.9) can be derived under the assumption of a shift invariant
kernel. The concept of shift invariant kernels is defined in section 2.2.4. The coefficients
a can be extended periodically to an element in `p (M, N ). The a that minimizes (2.9)
1 It is always assumed that the summation is done over a single period, e.g. ∀(m, n) ∈ {1, . . . , M } ×
{1, . . . , N }, if no limits or set is specified.
12
2
Circulant Tracking by Detection
is given in (2.10). A derivation using circulant matrices can be found in [21], but it is
also proved in section B.1.2 for a more general case. Here we have defined the function
ux (m, n) = κ(xm,n , x). It is clear that ux ∈ `p (M, N ).
A = F {a} =
Y
Ux + λ
(2.10)
This is the same result as in (1.5), which the original CSK tracker [21] builds upon.
2.2.2
Detection
The calculation of the detection results of the image patch z is similar to (2.1). Here, x is
the learnt appearance of the object and A is the DFT of the learnt coefficients. By defining
uz (m, n) = κ(zm,n , x), the output can be computed using (2.11).
ŷ = F −1 {AUz }
2.2.3
(2.11)
Multidimensional Feature Maps
The equations 2.10 and 2.11 can be used for any feature dimensionality. The task is just to
define a shift invariant kernel function that can be used for multidimensional features. One
example of such a kernel is the standard inner product in `D
p (M, N ), i.e. κ(f, g) = hf, gi.
d
Let x denote feature layer d ∈ {1, . . . , D} of x. The training and detection in this
case can be derived from equations 2.10 and 2.11. The result is given in (2.12). This is
essentially the MOSSE tracker for multidimensional features, trained on a single image.
H d = PD
Y Xd
X dX d + λ
(D
)
X
−1
d d
ŷ = F
H Z
(2.12a)
d=1
(2.12b)
d=1
2.2.4
Kernel Functions
The kernel function is a mapping κ : X × X → R that is symmetric and positive definite.
X is the sample space, i.e. `D
p (M, N ). The kernel function needs to be shift invariant
for the equations 2.10 and 2.11 to be valid. This section contains the definition of a shift
invariant kernel from [21] and theorems that need to be stated regarding this property.
2.1 Definition (Shift Invariant Kernel). A shift invariant kernel is a valid kernel κ on
`D
p (M, N ) that satisfies
κ(f, g) = κ(τm,n f, τm,n g), ∀m, n ∈ Z, ∀f, g ∈ `D
p (M, N )
(2.13)
2.2 Proposition. Let κ be the inner product kernel in (2.14), where k : R → R.
κ(f, g) = k(hf, gi), ∀f, g ∈ `D
p (M, N )
(2.14)
2.3
13
Robust Appearance Learning
Then κ is a shift invariant kernel. Further, the following relation holds.
(D
!
)
X
κ(τ−m,−n f, g) = k F −1
F d Gd (m, n) , ∀m, n ∈ Z
(2.15)
d=1
2.3 Proposition. Let κ be the radial basis function kernel in (2.16), where k : R → R.
κ(f, g) = k(kf − gk2 ), ∀f, g ∈ `D
p (M, N )
(2.16)
Then κ is a shift invariant kernel. Further, the following relation holds.
( D
)
!
X
2
2
−1
d d
κ(τ−m,−n f, g) = k kf k + kgk − F
2
F G (m, n) , ∀m, n ∈ Z
d=1
(2.17)
The proofs are found in section B.1.1. From these propositions it follows that Gaussian
and polynomial kernels are shift invariant. Equations (2.15) and (2.17) give efficient ways
to compute the kernel outputs Ux and Uz in e.g. (2.10) and (2.11) using the FFT.
2.3
Robust Appearance Learning
This section contains the description of my proposed extension of the CSK learning approach in (2.10) to support training with multiple images. It can also be seen as an extension of the MOSSE tracker to multiple features if a linear kernel is used. The result is
a more robust learning scheme of the tracking model, which is shown to outperform the
learning scheme of the CSK [21] in chapter 5. The tracker that is proposed in this section
is therefore referred as Robust CSK or RCSK.
The CSK tracker learns its tracking model by computing A using (2.10) for each new
frame independently. It then applies an ad hoc method of updating the classifier coefficients by linear interpolation between the new coefficients A and the previous ones At−1
using: At = (1 − γ)At−1 + γA, where γ is a learning rate parameter. Modifying the
cost function to include training with multiple images is not as straight forward as with
the MOSSE-tracker for grayscale images. This is due to the fact that the kernel function
is non-linear in general. The equivalent to (2.2) in the CSK case would be to minimize:
=
J
X
j=1
βj
X
2
hφ(xjm,n ), vi − y j (m, n) + λhv, vi
(2.18)
m,n
PJ P
However, the solution v = j=1 m,n aj (m, n)φ(xjm,n ) involves computing a set of
coefficients aj for each training image xj . This requires an evaluation of all pairs of
j
i
j
kernel outputs ui,j
x (m, n) = κ(xm,n , x ). All A can then by computed by solving N M
number of J ×J linear systems. This is obviously highly impractical in a real-time setting
if the number of images J is more than only a few. To keep the simplicity and speed of the
MOSSE tracker, it is thus necessary to find some approximation of the solution to (2.18).
Specifically, the appearance model should only contain one set of classifier coefficients a
to simplify learning and detection.
14
2
Circulant Tracking by Detection
This can be accomplished by restricting the solution so that the coefficients a are the same
for all images. This is expressed as the cost function in (2.19).
X
J
X
j
j
j
2
j
j
=
βj
|hφ(xm,n ), v i − y (m, n)| + λhv , v i
(2.19a)
m,n
j=1
where,
j
v =
X
a(k, l)φ(xjk,l )
(2.19b)
k,l
The a that minimizes (2.19) is given in (2.20), where we have set ujx (m, n) = κ(xjm,n , xj ).
PJ
j j
j=1 βj Y Ux
(2.20)
A = PJ
j
j
j=1 βj Ux (Ux + λ)
See section B.1.2 for the derivation. The object patch appearance x̂t is updated using the
same learning parameter γ. The final update rule is given in (2.21).
t t
AtN = (1 − γ)At−1
N + γY Ux
AtD
= (1 −
At =
γ)At−1
D
+
γUxt (Uxt
(2.21a)
+ λ)
AtN
AtD
x̂t = (1 − γ)x̂t + γxt
(2.21b)
(2.21c)
(2.21d)
The resulting weights βj will be the same as in (2.5). See algorithm 2.1 for the complete
pseudo code of the proposed RCSK tracker.
2.4
Details
This section discusses various details of the proposed tracker algorithm, including parameters and necessary preprocessing steps for feature extraction.
2.4.1
Parameters
The label function y is as in [8, 21] set to the Gaussian function in (2.22). The standard
deviation is proportional to the given target size s = (s1 , s2 ), with a constant σy . Since a
constant label function y is used, its transform Y t = Y = F {y} can be precomputed.
2 2 !!
1
N
M
y(m, n) = exp − 2
m−
+ n−
,
(2.22)
2σy s1 s2
2
2
for
m ∈ {0, . . . , M − 1} , n ∈ {0, . . . , N − 1}
The kernel κ is set to a Gaussian with a variance proportional to the dimensionality of the
patches, with a constant σκ2 . The kernel used in [21] is given in (2.23).
1
2
κ(f, g) = exp − 2
kf − gk
(2.23)
σκ M N D
2.4
Details
15
Algorithm 2.1 The proposed RCSK tracker.
Input:
Sequence of frames: {I 1 , . . . , I T }
Target position in the first frame: p1
Target size: s
Window function: w
Parameters: γ, λ, η, σy , σκ
Output:
Estimated target position in each frame: {p1 , . . . , pT }
1:
2:
3:
4:
5:
6:
7:
8:
9:
Initialization:
Construct label function y using (2.22) and set Y = F {y}
Extract x1 from I 1 at p1
Calculate u1x (m, n) = κ(x1m,n , x1 ) using (2.15) or (2.17)
Initialize: A1N = Y Ux1 , A1D = Ux1 (Ux1 + λ) , A1 = A1N /A1D , x̂1 = x1
for t = 2 : T do
Detection:
Extract z t from I t at pt−1
t
, x̂t−1 ) using (2.15) or (2.17)
Calculate utz (m, n) = κ(zm,n
Calculate correlation output: ŷ t = F −1 {At−1 Uzt }
Calculate the new position pt = argmaxp ŷ(p)
Training:
Extract xt from I t at pt
Calculate utx (m, n) = κ(xtm,n , xt ) using (2.15) or (2.17)
Update the tracker using (2.21)
13: end for
10:
11:
12:
A padding parameter η decides the amount of background contained in the patches, so
that (M, N ) = (1 + η)s. The regularization parameter λ can be set to almost zero in most
cases if the proposed learning is used. But since the effect of this parameter proved to
be negligible for small values, it is set to the same value as in [21] for a fair comparison.
The optimal setting of the learning rate γ is highly dependant on the sequence, though a
compromise can often be found if the same value is used for many sequences (as in the
evaluations). The complete set of parameters and default values is presented in table 2.1.
The default values are the ones suggested by [21].
2.4.2
Windowing
As noted earlier, the periodic assumption is the key to be able to exploit the FFT in the
computations. However, this assumption introduces discontinuities at the edges.2 A common technique from signal processing to overcome this problem is windowing, where
2 Continuity is not defined for functions with discrete domains. However, we can think of the domain as
continuous for a moment, i.e. as before the signal was sampled
16
2
Parameter
name
γ
λ
η
σy
σκ
Default value
0.075
0.01
1.0
1/16
0.2
Circulant Tracking by Detection
Explanation
Learning rate.
Regularization parameter.
Amount of background included in the extracted patches.
Standard deviation of the label function y.
Standard deviation of the gaussian kernel function κ.
Table 2.1: The parameters for the RCSK and CSK tracker.
the extracted sample is multiplied by a window function. [21] suggests a Hann window,
defined in (2.24).
πm
πn
w(m, n) = sin2
sin2
(2.24)
M −1
N −1
In the detection stage of the tracking algorithm, an image patch z is extracted from the new
frame. However, it is not likely that the object is centred in the patch. This means that the
window function distorts the object appearance. This effect becomes greater the further
away from the center of the patch the object is located. This means that the windowing
also effects the tracking performance in a negative way. The simplest ways to counter this
effect is to iterate the detection step in the algorithm, where each new sample is extracted
from the previously estimated position in each iteration. Although this often increases the
accuracy of the tracker, it significantly increases the computational time. It can also make
the tracking more unstable. Another option is to predict the position of the object in the
next frame in a more sophisticated way, instead of just assuming constant position. This
can be done by applying a Kalman filter on a constant velocity or acceleration model.
2.4.3
Feature Value Normalization
For image intensity features, [8, 21] suggest normalizing the values to the range [−0.5, 0.5].
The reason for this is to minimize the amount of distortion induced by the windowing operation discussed in the previous section. The idea is to remove as much of the inherent
bias in the feature values as possible by subtracting some a priori mean feature value. The
same methodology can be applied to other kinds of features.
One way of eliminating the need of choosing the normalization of each feature, is to
automatically learn a normalization constant (that is subtracted from the feature value)
based on the specific image sequence or even the specific frame. This however, has to
be done with care to avoid corrupting the learnt appearance and classifier coefficients. A
method for adaptively selecting the normalization constant based on the weighted average
feature values was tried, but no significant performance gain was observed compared to
using the ad-hoc a priori mean feature values. So it was not investigated further.
A special feature normalization scheme for features with a probabilistic representation
(e.g. histograms) is presented in section 3.2.1.
3
Color Features for Tracking
Most state-of-the-art trackers either rely on intensity or texture information [19, 50, 24,
13, 38], including the CSK and MOSSE trackers discussed in the previous chapter. While
significant progress has been made to visual tracking, the use of color information has
been limited to simple color space transformations [35, 31, 10, 32, 11]. However, sophisticated color features have shown to significantly improve the performance of object
recognition and detection [41, 26, 49, 42, 25]. This motivates an investigation of how
color information should be used in visual tracking.
Exploiting color information for visual tracking is a difficult challenge. Color measurements can vary significantly over an image sequence due to variations in illuminant, shadows, shading, specularities, camera and object geometry. Robustness with respect to these
factors have been studied in color imaging, and successfully applied to image classification [41, 26], and action recognition [27]. This chapter presents the color features that
are evaluated in section 5.3 and discusses how they are incorporated into the family of
circulant structure trackers presented in chapter 2.
3.1
Evaluated Color Features
In this section, 11 color representations are presented briefly. These are evaluated in section 5.3 with the proposed tracking framework. Each color representation uses a mapping
from local RGB-values to a color space of some dimension. All color features evaluated
here except Opponent-Angle and SO use pixelwise mappings from one RGB-value to a
color value.
RGB: As a baseline, the standard 3-channel RGB color space is used.
LAB: The 3-dimensional LAB color space is perceptually uniform, meaning that colors at
17
18
3
Color Features for Tracking
equal distance are also perceptually considered to be equally far apart. The L -component
approximates the human perception of lightness.
YCbCr: YCbCr contains a luminance component Y and two chrominance components
Cb and Cr which encodes the blue- and red-difference respectively. The representation
is approximately perceptually uniform. It is commonly used in image compression algorithms.
R
G
, R+G+B
rg: The rg [17] color channels are computed as (r, g) = R+G+B
. They are
invariant with respect to shadow and shading effects.
HSV: In the HSV color space V encodes the lightness as the maximum RGB-value, H is
the hue and S is the saturation, which corresponds to the purity of the color. H and S are
invariant to shadow-shading. The hue H is additionally invariant for specularities.
Opponent: The opponent color space is an orthonormal transformation of the RGB-color
space, given by (3.1).

  √1


− √12
0
O1
R
2
1
−2  
√1
√
 O2  = 
G .
(3.1)
 √6
6
6 
1
1
√
√
√1
O3
B
3
3
3
This representation is invariant with respect to specularities.
C: The C color representation [41] adds photometric invariants with respect to shadowshading to the opponent descriptor by normalizing with the intensity. This is done according to (3.2).
T
O2
C = O1
(3.2)
O3
O3
O3
HUE: The hue is a 36-dimensional histogram representation [42] of H = arctan O1
O2 .
√
The contribution to the hue histogram is weighted with the saturation S = O12 + O22
to counter the instabilities of the hue representation. This representation is invariant to
shadow-shading and specularities.
Opp-Angle: The Opp-Angle is a 36-dimensional histogram representation [42] based on
spatial derivatives of the opponent channels. The histogram is constructed using (3.3).
O1x
angxO = arctan
,
(3.3)
O2x
The subscript x denotes the spatial derivative. This representation is invariant to specularities, shadow-shading, blur and a constant offset.
SO: SO is a biologically inspired descriptor of Zhang et al. [49]. This color representation
is based on center surround filters on the opponent color channels.
Color Names: See section 3.1.1.
3.2
Incorporating Color into Tracking
3.1.1
19
Color Names
The Color Names descriptor is explained in more detail, since it proved to be the best
choice in the evaluation in section 5.3. It is therefore used in the proposed version of
the tracker and in part two of this thesis. Color names (CN), are linguistic color labels
assigned by humans to represent colors in the world. In a linguistic study performed by
Berlin and Kay [6], it was concluded that the English language contains eleven basic color
terms: black, blue, brown, grey, green, orange, pink, purple, red, white and yellow. In the
field of computer vision, color naming is an operation that associates RGB observations
with linguistic color labels. In this thesis, the mapping provided by [43] is used. Each
RGB value is mapped to a probabilistic 11 dimensional color representation, which sums
up to 1. For each pixel, the color name values represent the probabilities that the pixel
should be assigned to the above mentioned colors. Figure 3.1 visualizes the color name
descriptor in a real-world tracking example.
The color names mapping is automatically learned from images retrieved by the Google
Image search. 100 example images per color were used in the training stage. The provided
mapping is a lookup table from 323 = 32768 uniformly sampled RGB values to the 11
color name probabilities.
A difference from the other color descriptors mentioned in section 3.1 is that the color
names encodes achromatic colors, such as white, gray and black. This means that it does
not aim towards full photometric invariance, but rather towards discriminative power.
3.2
Incorporating Color into Tracking
In section 2.2.3 is was noted that the kernel formulation of the CSK and RCSK tracker
makes is easy to extend the tracking algorithm to multidimensional features, such as color
features. By using a linear kernel in these trackers, they can also be seen as different
extensions of the MOSSE tracker to multidimensional features. The windowing operation
discussed in section 2.4.2, is applied to every feature-layer separately after the feature
extraction step, which in this case is a color space transformation followed by a feature
normalization.
3.2.1
Color Feature Normalization
The feature normalization step, as described in section 2.4.3, is an important and nontrivial task to be addressed. For all color descriptors in section 3.1 with a non-probabilistic
representation (i.e. all except HUE, Opp-Angle and Color Names), the normalization is
done by centring the range of each feature value. This means that the range of the feature
values are symmetric around zero. This is motivated by assuming uniform and independent feature value probabilities. However, the independence assumption is not valid for
the high-dimensional color descriptors. For these descriptors it is more correct to normalize the representation so that the expected sum over the feature values is zero. For color
names, this means subtracting each feature bin with 1/11.
A specific attribute of the family of trackers explained in chapter 2, including the proposed RCSK, opens up an interesting alternative normalization scheme that can be used
20
3
Color Features for Tracking
(b) Black
(c) Blue
(d) Brown
(e) Gray
(f) Green
(g) Orange
(h) Pink
(i) Purple
(a) RGB image patch.
(j) Red
(k) White
(l) Yellow
Figure 3.1: Figure 3.1a is an image patch of the target in the soccer sequence, which
is a benchmark image sequence for evaluating visual trackers. Figure 3.1b to 3.1l are
the 11 color name probabilities obtained from the image patch. Notice how motion
blur, illumination, specularities and compression artefacts complicates the process
of color naming the pixels.
with color names. It can in fact be used for any feature representation that sums up to
some constant value. Color names contain only 10 degrees of freedom for this reason.
The color name values lie in a 10-dimensional hyper plane in the feature space. This
plane is orthogonal to the vector (1, 1, . . . , 1)T . The color name values can be centered
by changing the feature space basis to an orthonormal basis chosen so that the last basis
vector is orthogonal to this plane. However, since the last coordinate in the new basis
is constant (and thus contains no information) it can be discarded. The feature dimensionality is thus reduced from 11 to 10 when this normalization scheme is used. This
has a positive effect on the computational cost of the trackers, by reducing the number of
necessary FFT-computations and memory accesses in each frame.
The nature of the trackers explained in chapter 2, makes them invariant to the choice of
basis to be used in the normalization step. This comes from the fact that the inner products and L2 -norms that are used in the kernel computations, are invariant under unitary
transformations of the feature values. This property is discussed further in section 4.2.2.
To minimize the computational cost of this feature normalization step, a new lookup table
was constructed that maps RGB-values directly to the 10-dimensional normalized color
name values. In later chapters, these normalized color names is referred to as just color
names. This means that this normalization scheme was always employed for color names
in the experiments of chapter 5.
4
Adaptive Dimensionality Reduction
The time complexity of the proposed tracker in algorithm 2.1 scales linearly with the
number of features. To overcome this problem, an adaptive dimensionality reduction
technique is proposed in this chapter. This technique reduces the number of feature dimensions without any significant loss in tracking performance. The dimensionality reduction is based on Principal Component Analysis (PCA), which is described in section 4.1.
Section 4.2 presents the theory behind the proposed approach. Section 4.3 contains implementation details and pseudo code of the approach and how it is applied to the trackers
discussed in chapter 2
4.1
Principal Component Analysis
PCA1 [30] is a standard way of performing dimensionality reduction. It is done by computing an orthonormal basis for the linear subspace of a given dimension that holds the
largest portion of the total variance in the dataset. The basis vectors are aligned so that
the projections onto this basis are pairwise uncorrelated. From a geometric perspective,
PCA returns an orthonormal basis for the subspace that minimizes the average squared
L2 -error between a set of centered2 data points and its projections onto this subspace.
This is formulated in (4.1).
N
1 X
kxi − BB T xi k2
min ε =
N i=1
subject to B T B = I
1 PCA
2 Here
is also known as the Discrete Karhunen-Loève Transform.
“centered” refers to that the average value has been subtracted from the data.
21
(4.1a)
(4.1b)
22
4
Adaptive Dimensionality Reduction
xi ∈ Rn are the centered data points and B is a n × m dimensional matrix that contains
the orthonormal basis vectors of the subspace in its columns. It can be shown that this
optimization problem is equivalent to maximizing
(4.2) under the same constraint. The
P
covariance matrix C is defined as C = N1 i xi xTi .
V = tr(B T CB)
(4.2)
The PCA-solution to this problem is to choose the columns of B as the normalized eigenvectors of C that correspond to the largest eigenvalues (see [30] for the proof). It should
be mentioned that any orthonormal basis to the subspace spanned by these eigenvectors
is a solution to the optimization problem (4.1).
4.2
The Theory Behind the Proposed Approach
The proposed dimensionality reduction in this section is a mapping to a linear subspace
of the feature space. This subspace is defined by an orthonormal basis. Let Bt denote
the matrix containing the orthonormal basis vectors of this subspace as columns. Assume
1
that the feature map of the appearance x̂t ∈ `D
p (M, N ) at frame t has D1 features and
that the desired feature dimensionality is D2 . Bt should thus be a D1 × D2 matrix. The
projection to the feature subspace is done by the linear mapping x̃t (m, n) = BtT x̂t (m, n),
where x̃t is the compressed feature map. This section presents a method of computing the
subspace basis Bt to be used in the dimensionality reduction.
4.2.1
The Data Term
The original feature map x̂t of the learnt patch appearance can be optimally reconstructed
(in L2 -sense) as Bt x̃t = Bt BtT x̂t . An optimal projection matrix can be found by minimizing the reconstruction error of the appearance in (4.3).
1 X t
min εtdata =
kx̂ (m, n) − Bt BtT x̂t (m, n)k2
(4.3a)
M N m,n
subject to BtT Bt = I
(4.3b)
Equation 4.3a can be seen as a data term since it only regards the current object appearance. The expression can be simplified to (4.4) by introducing the data matrix Xt which
contains all pixel values of x̂t , such that there is a column for each pixel and a row for
each feature. Xt thus has the dimensions D1 × M N . The second equality follows from
the properties of the Frobenius norm and the trace operator. The covariance matrix Ct is
defined by Ct = M1N Xt XtT .
εtdata =
4.2.2
1
kXt − Bt BtT Xt k2F = tr(Ct ) − tr(BtT Ct Bt )
MN
(4.4)
The Smoothness Term
The projection matrix must be able to adapt to changes in the target and background
appearance. Otherwise it would likely become outdated and the tracker would deteriorate
4.2
23
The Theory Behind the Proposed Approach
over time since valuable information is lost in the feature compression. However, the
projection matrix must also take the already learnt appearance into account. If it changes
too drastically, the already learnt classifier coefficients At−1 become irrelevant since they
were computed with a seemingly different set of features. The changes in the projection
matrix must thus be slow enough for the already learnt model to remain valid.
To obtain smooth variations in the projection matrix, a smoothness term is added to the optimization problem. This term adds a cost if there is any change in the subspace spanned
by the column vectors in the new projection matrix compared to the earlier subspaces.
This is motivated by studying the transformations between these subspaces. Let Bt be
the ON-basis for the new subspace and Bj for some earlier subspace (j < t) of the same
dimension. The optimal transformation from the older to the new subspace is given by
P = BtT Bj . It can be shown that the matrix P is unitary if and only if the column vectors
in Bt and Bj span the same subspace. One can easily verify that the point wise transformation by a unitary matrix corresponds to a unitary operator U on `D
p (M, N ). Lastly, one
can see that inner product kernels and radial basis function kernels are invariant to unitary
transforms, i.e. κ(U f, U g) = κ(f, g). The kernel output is thus invariant under changes
in the projection matrix as long as the spanned subspace stays the same. A cost should
only be added if the subspace itself is changed. Equation 4.5 accomplishes this.
εjsmooth =
D2
X
2
(k)
(k) λk bj − Bt BtT bj (4.5)
k=1
(k)
bj
(k)
is column vector k in Bj . The positive constants λj
are used to weight the impor-
(k)
bj .
tance of each basis vector
Equation 4.5 minimizes the squared L2 -distance of the
error when projecting the old basis vectors onto the new subspace. The cost becomes zero
if the two subspaces are the same (even if Bt and Bj are not) and is at a maximum if the
subspaces are orthogonal. By defining the diagonal matrix Λj with the weights along the
(k)
diagonal [Λj ]k,k = λj , this expression can be rewritten to (4.6).
εjsmooth = tr(Λj ) − tr(BtT Bj Λj BjT Bt )
4.2.3
(4.6)
The Total Cost Function
Assume that the tracker is currently on frame number t. Let x̂t be the learnt feature map
of the object appearance. The goal is to find the optimal projection matrix Bt for the
current frame. The set of previously computed projection matrices {B1 , . . . , Bt−1 } are
given. Bt is found by minimizing (4.7), under the constraint BtT Bt = I.
εttot = αt εtdata +
t−1
X
αj εjsmooth
j=1
t−1
X
= αt tr(Ct ) − tr(BtT Ct Bt ) +
αj tr(Λj ) − tr(BtT Bj Λj BjT Bt )
(4.7)
j=1
This cost function is the weighted sum of the data term (4.4) and the smoothness term in
24
4
Adaptive Dimensionality Reduction
(4.6) for each previous projection matrix Bj . αj are importance weights. Equation 4.7 can
be reformulated to the equivalent maximization problem (4.8) by exploiting the linearity
of the trace-function.

! 
t−1
X
Vtot = tr BtT αt Ct +
αj Bj Λj BjT Bt 
(4.8)
j=1
By comparing this expression to the PCA-formulation (4.2) one can see that this optimization problem can be solved using the PCA methodology with the covariance matrix Rt
defined in (4.9). It can be verified that Rt indeed is symmetric and positive definite.
Rt = αt Ct +
t−1
X
αj Bj Λj BjT
(4.9)
j=1
The columns in Bt is thus chosen as the D2 normalized eigenvectors of Rt that corresponds to the largest eigenvalues.
4.3
Details of the Proposed Approach
The adaptive PCA algorithm described above requires a way of choosing the weights αj
and Λj . αj control the relative importance of the current appearance and the previously
computed subspaces. These are set by using an appropriate learning rate parameter µ that
acts in the same way as the learning rate γ for the appearance learning. Setting µ = 1
corresponds to only using the current learnt appearance in the calculation of the projection
matrix. µ = 0 is the same as computing the projection matrix once in the first frame and
then letting it be fixed for the entire sequence. The value was experimentally tuned to
µ = 0.1 for the linear kernel case and µ = 0.15 for the non-linear kernel case.
The diagonal in Λj contains the importance weights for each basis vector in the previously
computed projection matrix Bj . These are set to the eigenvalues of the corresponding basis vectors in Bj . This makes sense since the score function (4.8) equals the sum of these
eigenvalues. Each eigenvalue can thus be interpreted as the score for its corresponding
basis vector in Bt . In a probabilistic interpretation, the eigenvalues are the variances for
each component in the new basis. Since PCA uses variance as the measure of importance,
it is natural to weight each component (basis vector) with its variance. The term Bj Λj BjT
then becomes the “reconstructed” covariance matrix of rank D2 , i.e. the covariance of
the reconstructed appearance using the projections in image j. Equation 4.9 is thus a
weighted sum of image covariances.
Algorithm 4.1 provides the full pseudo code for the computation of the projection matrix.
The mean feature values do not contain information about the structure and should therefore be subtracted from the data before computing the projection matrix. Including the
mean in the PCA computation affects the projection matrix to conserve the mean in the
projected features, rather that maximizing the variance which is related to image structure.
Algorithm 4.2 provides the full pseudo code for the proposed RCSK tracker with adaptive
4.3
Details of the Proposed Approach
25
Algorithm 4.1 Adaptive projection matrix computation.
Input:
Frame number: t
Learned object appearance: x̂t
Previous covariance matrix: Qt−1
Parameters: µ, D2
Output:
Projection matrix: Bt
Current covariance matrix: Qt
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
P
Calculate mean x̄t = M1N m,n x̂t (m, n)
P
Calculate covariance Ct = M1N m,n (x̂t (m, n) − x̄t )(x̂t (m, n) − x̄t )T
if t = 1 then
Set Rt = Ct
else
Set Rt = (1 − µ)Qt−1 + µCt
end if
Do EVD Rt = Et St EtT , the eigenvalues in St are in descending order
Set Bt to the first D2 columns in Et
Set [Λt ]i,j = [St ]i,j , 1 ≤ i, j ≤ D2
if t = 1 then
Set Qt = Bt Λt BtT
else
Set Qt = (1 − µ)Qt−1 + µBt Λt BtT
end if
dimensionality reduction. Note that the windowing of the feature map is always done after
the projection onto the new reduced feature space. It is not a part of the feature extraction
as in algorithm 2.1. The reason is that windowing adds spatial correlation between the
pixels, which contradicts the independence and stationarity assumptions used in the PCA.
26
4
Adaptive Dimensionality Reduction
Algorithm 4.2 Proposed RCSK tracker with dimensionality reduction.
Input:
Sequence of frames: {I 1 , . . . , I T }
Target position in the first frame: p1
Target size: s
Window function: w
Parameters: γ, λ, η, σy , σκ , µ, D2
Output:
Estimated target position in each frame: {p1 , . . . , pT }
Initialization:
Construct label function y using (2.22) and set Y = F {y}
Extract x1 from I 1 at p1
Initialize x̂1 = x1
Calculate B1 and Q1 using algorithm 4.1
Project features and apply window: x̃1 (m, n) = w(m, n)B1T x1 (m, n)
6: Calculate u1x (m, n) = κ(x̃1m,n , x̃1 ) using (2.15) or (2.17)
˜1 = x̃1
7: Initialize: A1N = Y Ux1 , A1D = Ux1 (Ux1 + λ) , A1 = A1N /A1D , x̂
1:
2:
3:
4:
5:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
for t = 2 : T do
Detection:
Extract z t from I t at pt−1
Project features and apply window: z̃ t (m, n) = w(m, n)BtT z t (m, n)
t
˜t−1 ) using (2.15) or (2.17)
, x̂
Calculate utz (m, n) = κ(z̃m,n
t
Calculate correlation output: ŷ = F −1 {At−1 Uzt }
Calculate the new position pt = argmaxp ŷ(p)
Training:
Extract xt from I t at pt
Update appearance x̂t using (2.21d)
Calculate Bt and Qt using algorithm 4.1
Project features and apply window: x̃t (m, n) = w(m, n)BtT xt (m, n)
Calculate utx (m, n) = κ(x̃tm,n , x̃t ) using (2.15) or (2.17)
Update the tracker using (2.21a), (2.21b) and (2.21c)
˜t (m, n) = w(m, n)B T x̂t (m, n)
Calculate projected appearance: x̂
t
end for
5
Evaluation
This chapter contains evaluations, results, discussions and conclusions related to the first
part of this thesis. Section 5.1 describes the evaluation methodology, including evaluation
metrics and datasets. In section 5.2, a comparison is made between the trackers presented
in chapter 2. The color features discussed in chapter 3 are evaluated in section 5.3. The
effect of the dimensionality reduction technique proposed in chapter 4 is investigated in
section 5.4. The best performing proposed tracker versions is then compared to state-ofthe-art methods in an extensive evaluation in section 5.5. Lastly, section 5.6 presents some
general conclusions and discussions about possible directions of future work.
5.1
Evaluation Methodology
The methods were evaluated using the protocol and code recently provided by Wu et
al. [46]1 . The evaluation code was modified with some bug fixes and some added functionality. It employs the most commonly used scheme for evaluating causal generic trackers on image sequences with ground-truth target locations. The tracker is initialized in
the first frame, with the known target location. In the subsequent frames, the tracker
is used to estimate the locations of the target. Only information from all the previous
and the current frame may be exploited by the tracker when estimating a target location.
The estimated trajectory is then compared with the ground truth locations using different
evaluation metrics.
All evaluations were performed on a desktop computer with an Intel Xenon 2 core 2.66
GHz CPU with 16 GB of RAM.
1 The sequences together with the ground-truth and matlab code are available at: https://sites.
google.com/site/trackerbenchmark/benchmarks/v10
27
28
5.1.1
5
Evaluation
Evaluation Metrics
The trackers were evaluated using three evaluation metrics commonly used in literature.
The first is average center location error (CLE), which is the average L2 -distance (in
pixels) between the estimated and ground truth center locations of the target over the
sequence. The second metric is distance precision (DP), which is the relative number
of frames where the estimated center location is within a certain distance threshold d
from the ground truth center location. The third metric is overlap precision (OP), defined
as the relative number of frames where the overlap between the estimated and ground
truth bounding box exceeds a certain threshold b. Bounding box overlap is commonly
measured using the PASCAL criterion. For an image sequence, the three measures are
calculated using (5.1). The ground truth and estimated center locations are denoted pt and
p̂t respectively, where t is the frame number. Similarly, Bt and B̂t denotes the ground
truth and estimated bounding boxes of the target. A bounding box is here defined as the
set of pixels covered by the rectangular area. N is the number of frames in the sequence.
CLE =
N
1 X
kp̂t − pt k
N t=1
1
N
1
OP(b) =
N
DP(d) =
|{t : kp̂t − pt k ≤ d}| , d ≥ 0
(
)
|B̂t ∩ Bt |
≥ b , 0 ≤ b ≤ 1
t:
|B̂t ∪ Bt |
(5.1a)
(5.1b)
(5.1c)
There is not much agreement in the literature on what performance measures to use for
comparing visual trackers. DP and CLE only regard the estimated center location, which
is desirable when evaluating trackers that do not estimate scale. OP also takes the estimated scale into account. This metric was only used in the state-of-the-art evaluation (see
section 5.5), to give a more complete and fair comparison to trackers that estimate scale
variations.
Results of both CLE and DP are reported in the tables. The numeric values of distance
precision are reported at a threshold of 20 pixels [21, 38, 46], i.e. DP(20). It is motivated
to use both these metrics since they contain complementary information. CLE has the
drawback of being unstable at tracker failures. DP is robust to failures, but it does not
include any information about the accuracy of the tracker within 20 pixels. For robustness
reasons, the per-video results are summarized using the median results over the whole
dataset.
Some authors [21, 46] suggest the usage of precision and success plots, where distance
and overlap precision respectively are plotted over a range of thresholds. This says much
more about the performance, but makes the task of comparing different methods harder.
In this chapter, precision and success plots are used to compare the overall performance
of different trackers. The average precision values over the set of sequences were used.
In both types of plots, a ranking score is computed to simplify the interpretation of the
results. In the precision plots, the DP-value at 20 pixels is used. The area-under-the-curve
(AUC) is used as the ranking score in the success plots. The ranking scores are displayed
in brackets next to the tracker names in the legend of each plot. See [46] for more details.
5.2
Circulant Structure Trackers Evaluation
5.1.2
29
Dataset
The benchmark evaluation of [46] includes a dataset of 50 image sequences. 35 of these
are color sequences. Additionally, another 6 benchmark color sequences was added to
the dataset, namely: Kitesurf, Shirt, Surfer, Board, Stone and Panda. The resulting set
of 41 color sequences was used for all evaluations in this chapter, except in the initial
comparison between MOSSE, CSK and RCSK on grayscale sequences. In this case, the
full set of 56 sequences was used.
The sequences pose many challenging situations. [46] actually provides an annotation of
their 50 sequences with 11 different attributes, which explain the challenges encountered
in each sequence. The different attributes are: motion blur, illumination changes, scale
variation, heavy occlusions, in-plane and out-of-plane rotations, deformation, out of view,
background clutter and low resolution. This is used to make attribute based comparisons,
which can show interesting strengths and weaknesses of different trackers.
5.1.3
Trackers and Parameters
All trackers are evaluated using the same parameters over the whole dataset. This requirement is commonly used to prevent over-tuning. For the trackers explained in chapter 2,
including the proposed RCSK, the standard parameters suggested by [21] are used, including the same Gaussian kernel. The parameters and used values are displayed in table 2.1.
The kernel bandwidth σκ , does not have any effect for the MOSSE tracker, or when a
linear kernel is used for either RCSK or CSK. The dimensionality reduction learning rate
µ is set to 0.1 for linear and 0.15 for non-linear kernels, when used with the RCSK.
The code for the proposed tracker versions was implemented in Matlab (no mex-functions
were used). For the original CSK tracker, the Matlab-code provided by the authors was
used but modified to support multidimensional feature maps. The evaluated MOSSE
tracker was implemented in Matlab as well. Note that this is not exactly the same tracker
as proposed in [8], but rather a simplification of it. There are no random initial examples and no failure detection (see [8]). The implementation of the MOSSE tracker that
was evaluated here is similar to the implementation of the CSK and RCSK. The only differences are that it uses other equations for learning and updating the model, as well as
calculating the tracking scores. Interestingly it was shown in [21] that a simple implementation of the core tracking functionality of the MOSSE tracker, i.e. (2.4) and (2.1),
resulted in a better tracker than using the code provided by the authors.
In the state-of-the-art evaluation of section 5.5, the code or binaries for the compared
trackers were either obtained from [46] or from the authors. They are used with the
suggested default parameters.
5.2
Circulant Structure Trackers Evaluation
This section contains the comparison between different variants in the family of circulant
structure trackers presented in chapter 2. This includes the proposed variant RCSK. Three
experiments were done. Firstly RCSK, CSK [21] and MOSSE [8] were compared for
grayscale features. Secondly the trackers were compared for multidimensional features,
30
5
Evaluation
Precision plot
0.7
Distance Precision
0.6
0.5
0.4
0.3
0.2
MOSSE [0.589]
CSK [0.561]
RCSK [0.554]
0.1
0
0
10
20
30
40
Location error threshold
50
Figure 5.1: Comparison between MOSSE, CSK and RCSK for grayscale features.
specifically grayscale together with color names. Five trackers were evaluated in that
experiment. The proposed robust learning scheme was used with both a linear and a
Gaussian kernel. The methods are called RCS and RCSK respectively. Similarly, the
learning scheme of [21] was used with a linear and a Gaussian kernel, called CS and CSK
respectively. These were compared with a straightforward generalization of the MOSSE
tracker to multidimensional features. In the third experiment, RCSK was compared with
CSK for all the color features mentioned in section 3.1.
5.2.1
Grayscale Experiment
This experiment compares the RCSK with CSK and MOSSE when just using grayscale
features. For this reason all the 56 sequences in the dataset are employed. The results are
presented in figure 5.1. The same grayscale features as applied in the original CSK code
provided by the authors was used for all trackers. For color sequences, these are computed
using the rgb2gray-function in Matlab. The precision plot show a slight advantage for the
linear kernel (i.e. the MOSSE tracker). The CSK and RCSK performs similarly.
Qualitatively, it can be seen that although the linear kernel provides better results in average, it has stability issues in situations with significant illuminations changes. This is
most clearly seen in the shaking and skating1 sequences. In shaking there is drastically increasing back-light for a few frames. Figure 5.2 shows the frame where MOSSE fails and
lose track of its target. The strong back-light and blooming in frame 59 compared to the
previous frame completely corrupts the score function of the MOSSE tracker (figure 5.2b),
while the other two trackers are able to track through this frame robustly (figure 5.2d and
5.2f). MOSSE fails in a similar way in skating1 when the target (an ice skater) moves
from an illuminated area in the scene to a much darker area. Also in this case CSK and
RCSK manage to track the target through those frames.
5.2
31
Circulant Structure Trackers Evaluation
(a) MOSSE, frame 58.
(b) MOSSE, frame 59.
(c) CSK, frame 58.
(d) CSK, frame 59.
(e) RCSK, frame 58.
(f) RCSK, frame 59.
Figure 5.2: Frames 58 and 59 from the shaking sequence. The score function is
shown in red and the tracking output bounding box in green. The tracked target
in this sequence is the head of the guitarist. MOSSE fails in frame 59 (figure 5.2b)
where the back-light from the spotlight in the background has increased significantly
compared to the previous frame (figure 5.2a). The kernelized versions CSK (figure 5.2c and 5.2d) and RCSK (figure 5.2e and 5.2f) are able to track through these
frames robustly.
5.2.2
Grayscale and Color Names Experiment
This experiment is similar to the one described in the previous section. The difference
is that in addition to grayscale features, color names was used as well. The motivation
for using color names here is that it is shown to be the best performing evaluated color
descriptor in section 5.3. The evaluations in this experiment were done on the set of 41
color sequences.
Five trackers were evaluated in this experiment, of which two are versions of the proposed
32
5
Evaluation
Precision plot
Distance Precision
0.8
0.6
0.4
RCS [0.686]
RCSK [0.674]
MOSSE [0.669]
CSK [0.641]
CS [0.628]
0.2
0
0
10
20
30
40
Location error threshold
50
(a) Precision plot.
Median CLE
Median DP
RCS
13.3
83.8
RCSK
13.8
81.4
CS
21
66.4
CSK
16.9
74
MOSSE
15.6
78.2
(b) Table of the median CLE (in pixels) and DP (in percent) values over all the sequences. The two
best results are displayed in red and blue fonts respectively.
Figure 5.3: Comparison between RCS, RCSK, CS, CSK and MOSSE for combined
grayscale and color names features.
tracker.
• RCS: the proposed robust learning scheme with a linear kernel.
• RCSK: the proposed robust learning scheme with a Gaussian kernel.
• CS: the learning scheme of [21] on a linear kernel.
• CSK: the original [21], which uses a Gaussian kernel.
• MOSSE: a straightforward generalization of the standard MOSSE tracker to multidimensional features using (2.12) and (2.4).
This mean that RCS and RCSK use algorithm 2.1 with different kinds of kernels. By
“linear kernel” it is meant that the standard inner product is applied as a kernel κ(f, g) =
hf, gi, as in the MOSSE tracker.
The results from the experiments are shown in figure 5.3. The best results were obtained
by using the robust learning scheme, i.e. RCS and RCSK. It can clearly be seen that the
learning scheme of [21] (CSK and CS) is suboptimal in this case. Further, there is a slight
advantage in using a linear kernel over a Gaussian kernel.
5.3
Color Evaluation
80
70
60
50
40
30
20
10
33
Original update scheme
Proposed update scheme
Figure 5.4: Comparison of original update scheme (CSK) with the proposed learning method (RCSK) using median distance precision (in percent).
5.2.3
Experiments with Other Color Features
To investigate if the robust learning scheme proposed in section 2.3 performs better in
general, it was compared with the CSK for all the color features mentioned in section 3.1.
The RCSK and CSK trackers as described in section 5.2.2 were applied with varying color
features. Color representations with no inherent intensity channel (RGB, rg, HUE, OppAngle, SO and CN) were concatenated with the conventional intensity channel, obtained
by the Matlab function rgb2gray.
Figure 5.4 displays the median distance precisions for each color feature. The robust update scheme performs better in 9 out of the 11 evaluated color descriptors. The effect
is most apparent for the high-dimensional color descriptors Opp-Angle and HUE. There
is also a significant performance gain when using CN, YCbCr, rg and HSV. Most importantly, the precision is increased for the best performing color descriptors.
5.3
Color Evaluation
In section 5.2 it was shown that the proposed learning scheme generally performs better
for varying color features. The RCSK was therefore chosen for evaluating the different
color representations. All color features in section 3.1 were included. These are also compared with using intensity features alone. As described in section 5.2.3, these intensity
features (obtained by rgb2gray) were concatenated with the color representations with no
inherent intensity channel (RGB, rg, HUE, Opp-Angle, SO and CN).2
5.3.1
Results
The results from the experiment are shown in figure 5.5. Color names obtains the best
results in general, with significantly better median CLE and DP values and better average
distance precision in the precision plot. The Opp-Angle (AOpp) descriptor is the clear
2 It was noted that using the inherent intensity component, e.g. the L-component in LAB, gives better results
than changing it to the usual intensity features (rgb2gray).
34
5
Evaluation
Precision plot
Distance Precision
0.8
0.6
I+CN [0.674]
I+AOpp [0.654]
HSV [0.616]
LAB [0.609]
I+HUE [0.600]
C [0.586]
I+RG [0.581]
YCbCr [0.575]
Opp [0.540]
I+RGB [0.531]
I [0.515]
I+SO [0.359]
0.4
0.2
0
0
10
20
30
40
Location error threshold
50
(a) Precision plot.
Median CLE
Median DP
I
42.8
48.5
I+RGB
42.3
50.4
LAB
22.3
64.3
YCbCr
25.6
59
I+RG
24.8
59.2
Opp
37.9
51.8
C
23.5
60.2
HSV
27.4
66.5
I+SO
66.6
26.4
I+AOpp
17.9
71.5
I+HUE
28.2
56.6
I+CN
13.8
81.4
(b) Table of the median CLE (in pixels) and DP (in percent) values over all the sequences. The two
best results are displayed in red and blue fonts respectively.
Figure 5.5: Color evaluation results. The performance of the RCSK tracker was
evaluated for the color features discussed in section 3.1 and intensity alone.
second best choice for high precision. These two color descriptors are then followed by
the set of descriptors with different photometric invariances, namely HSV, LAB, HUE,
C, RG and YCbCr. The simple opponent color transformation and the standard RGB
representation give slight increase in performance compared to only intensity. Lastly, it
can seen that the SO descriptor is not well suited for tracking purpose.
It should be noted that the Opponent-Angle (AOpp) descriptor encodes shape information
as well as color. It is thus a powerful descriptor in scenarios where shape is more discriminative that color. The attribute based results are summarized in figures 5.6 and 5.7. AOpp
outperform the other color descriptors in sequences with significant background clutter,
while struggling in motion blur. AOpp and HSV perform better than color names in illumination variation due to more photometric invariance. CN proves to be the most robust
descriptor at occlusions and blur. This evaluation show that some descriptors contain complementarity information, which indicates that even more powerful descriptors might be
found by combining them is sophisticated ways, together with shape information.
5.3.2
Discussion
Three probable reasons for the success of color names over the other color descriptors can
be identified. Firstly, it uses a continous representation in the sense that common distance
measures (e.g. the Euclidic distance) makes sense, in the way that similar colors are close
5.3
35
Color Evaluation
Precision plot of fast motion (14)
Precision plot of background clutter (18)
0.6
0.4
0.2
0
0
0.8
I+CN [0.542]
LAB [0.529]
HSV [0.520]
I+AOpp [0.501]
I+HUE [0.499]
I+RG [0.460]
YCbCr [0.454]
C [0.402]
Opp [0.399]
I+RGB [0.388]
I [0.377]
I+SO [0.220]
10
20
30
40
Location error threshold
Distance Precision
Distance Precision
0.8
0.6
0.4
0.2
0
0
50
Precision plot of motion blur (10)
I+AOpp [0.678]
I+HUE [0.603]
C [0.589]
HSV [0.588]
I+RG [0.576]
I+CN [0.573]
LAB [0.549]
YCbCr [0.523]
Opp [0.519]
I+RGB [0.507]
I [0.501]
I+SO [0.328]
10
20
30
40
Location error threshold
50
Precision plot of deformation (16)
0.8
0.7
0.4
0.2
0
0
I+CN [0.662]
HSV [0.616]
LAB [0.598]
I+AOpp [0.596]
I+HUE [0.558]
YCbCr [0.489]
I+RG [0.487]
C [0.444]
Opp [0.424]
I+RGB [0.401]
I [0.337]
I+SO [0.246]
10
20
30
40
Location error threshold
Distance Precision
Distance Precision
0.6
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
Precision plot of illumination variation (20)
0.2
0
0
50
0.8
I+AOpp [0.639]
HSV [0.614]
I+HUE [0.596]
I+CN [0.591]
LAB [0.575]
I+RG [0.551]
C [0.539]
YCbCr [0.526]
Opp [0.478]
I+RGB [0.466]
I [0.434]
I+SO [0.336]
10
20
30
40
Location error threshold
50
Distance Precision
Distance Precision
0.4
10
20
30
40
Location error threshold
Precision plot of in−plane rotation (20)
0.8
0.6
I+AOpp [0.628]
I+CN [0.611]
C [0.565]
HSV [0.560]
I+HUE [0.533]
I+RG [0.521]
LAB [0.501]
YCbCr [0.487]
Opp [0.449]
I+RGB [0.427]
I [0.379]
I+SO [0.343]
0.6
0.4
0.2
0
0
HSV [0.685]
I+AOpp [0.674]
I+CN [0.661]
C [0.626]
I+HUE [0.611]
LAB [0.593]
I+RG [0.540]
Opp [0.503]
YCbCr [0.502]
I+RGB [0.486]
I [0.456]
I+SO [0.321]
10
20
30
40
Location error threshold
50
Figure 5.6: Precision plots showing the results of the attribute-based evaluation of
different color features with the RCSK. The plots display the distance precision for
the evaluated attributes fast motion, background clutter, motion blur, deformation,
illumination variation and in-plane rotation. The value appearing in the title denotes
the number of videos associated with the respective attribute. The average distance
precision at 20 pixels is displayed in the legends.
36
5
Precision plot of low resolution (4)
Evaluation
Precision plot of occlusion (24)
0.7
0.8
0.5
I+CN [0.490]
LAB [0.489]
HSV [0.488]
I+HUE [0.473]
YCbCr [0.417]
I+RG [0.416]
C [0.415]
I+RGB [0.411]
Opp [0.411]
I [0.407]
I+AOpp [0.401]
I+SO [0.162]
0.4
0.3
0.2
0.1
0
0
10
20
30
40
Location error threshold
Distance Precision
Distance Precision
0.6
0.6
0.4
0.2
0
0
50
I+CN [0.651]
I+AOpp [0.627]
LAB [0.585]
HSV [0.572]
YCbCr [0.565]
C [0.564]
I+HUE [0.551]
I+RG [0.549]
Opp [0.521]
I+RGB [0.513]
I [0.484]
I+SO [0.383]
Precision plot of out−of−plane rotation (28)
10
20
30
40
Location error threshold
50
Precision plot of out of view (4)
0.8
0.7
I+AOpp [0.657]
I+CN [0.629]
HSV [0.623]
I+HUE [0.589]
C [0.579]
LAB [0.567]
I+RG [0.548]
YCbCr [0.524]
Opp [0.493]
I+RGB [0.483]
I [0.459]
I+SO [0.360]
0.4
0.2
0
0
10
20
30
40
Location error threshold
50
Distance Precision
Distance Precision
0.6
0.6
0.5
HSV [0.477]
I+CN [0.419]
LAB [0.412]
I+HUE [0.332]
I+AOpp [0.325]
YCbCr [0.287]
I+RGB [0.271]
C [0.269]
Opp [0.266]
I+RG [0.258]
I [0.257]
I+SO [0.206]
0.4
0.3
0.2
0.1
0
0
10
20
30
40
Location error threshold
50
Precision plot of scale variation (21)
Distance Precision
0.8
0.6
0.4
0.2
0
0
I+AOpp [0.668]
I+CN [0.623]
I+HUE [0.600]
LAB [0.594]
HSV [0.562]
I+RG [0.545]
YCbCr [0.521]
C [0.485]
Opp [0.474]
I+RGB [0.468]
I [0.448]
I+SO [0.330]
10
20
30
40
Location error threshold
50
Figure 5.7: Precision plots showing the results of the attribute-based evaluation
of different color features with the RCSK. The plots display the distance precision
for the evaluated attributes low resolution, occlusion, in-plane rotation, out of view
and scale variation. The value appearing in the title denotes the number of videos
associated with the respective attribute. The average distance precision at 20 pixels
is displayed in the legends.
5.4
Adaptive Dimensionality Reduction Evaluation
37
to each other. This is also true for LAB, which is considered to be perceptually uniform,
meaning that Euclidian distances in the colorspace reflects perceptual similarities. However, the hue component H in HSV should be interpreted as the angle in a cylindrical
coordinate system, meaning that its maximum and minimum H-value correspond to the
same hue. Secondly, color names is a probabilistic representation. The update scheme
of the appearance template in (2.21d) can thus be interpreted as a statistical update of the
color probabilities. However, for non probabilistic representations such as Lab, the template update of (2.21d) may produce colors that have not occurred in the examples. This
increases the sensitivity of the tracker towards background clutter and occlusions. The
third reason is that the color name representation is trained to categorize colors based on
how humans do by language. Intuitively, this is a discriminative way of selecting basic
colors.
Besides performance, two other attributes are important, namely compactness and computational cost. A feature descriptor should contain a minimal number of dimensions in
relation to its discriminative power to reduce computation and memory costs. Here, compact color space representations such as HSV have a clear advantage. However, with the
dimensionality reduction technique introduced in chapter 4, color names can be reduced
to 2 dimensions without any significant performance loss. The major drawback of AOpp
is its large computational cost, since it requires computations of derivatives among other
things. The color names representation only requires the computation of indices for the
lookup table, which can be done with simple integer arithmetic. The next step is just a
memory access, which is fast for processors with reasonably sized cashes. The median
tracking frame rate is in fact over 10 times higher for color names compared to AOpp,
while similar to HSV.
In conclusion it has been demonstrated that color information has the potential to drastically increase the tracking performance. However, the choice of color descriptor is crucial.
Simple representations such as RGB and opponent give only a negligible improvement
compared to using only grayscale features. The standard photometric invariant descriptors are significantly better suited for the tracking task. However, color names is the best
choice for the RCSK. Not only because it gives the best overall results, it is also surprisingly inexpensive to compute.
5.4
Adaptive Dimensionality Reduction Evaluation
This section evaluates the impact of the dimensionality reduction technique presented
in chapter 4. This is a general technique, that can be applied for any types of features
for the RCSK tracker. However, since color names was shown to be the generally best
color descriptor for the RCSK in section 5.3, it was only applied to that descriptor in
these experiments. The goal is thus to create a more compact color name representation,
without any significant performance loss.
The evaluated tracker versions that included the dimensionality reduction was implemented as algorithm 4.2. The color names features are compressed independently. This
proved to give better results than to compress the intensity channel together with the
color names. The learning rate parameter µ for the dimensionality reduction is set to
38
5
Evaluation
Precision plot
Distance Precision
0.8
0.6
I+CN [0.674]
I+CN6 [0.668]
I+CN9 [0.666]
I+CN8 [0.666]
I+CN2 [0.664]
I+CN7 [0.661]
I+CN3 [0.652]
I+CN4 [0.650]
I+CN5 [0.633]
I+CN1 [0.577]
I [0.515]
0.4
0.2
0
0
10
20
30
40
Location error threshold
50
(a) Precision plot.
Median CLE
Median DP
Median FPS
I
42.8
48.5
152
I+CN1
26.6
67.3
106
I+CN2
14.3
79.3
105
I+CN3
14.9
76.7
98.9
I+CN4
16.3
69.9
89.6
I+CN5
20
70.2
85.3
I+CN6
13.8
81.9
81.1
I+CN7
13.6
78.9
77.9
I+CN8
13.8
81.4
71.8
I+CN9
13.8
81.4
69.2
I+CN
13.8
81.4
78.9
(b) Table of the median CLE (in pixels), DP (in percent) and frame rate (in FPS) over all the
sequences. The two best results are displayed in red and blue fonts respectively.
Figure 5.8: Evaluation of the number of dimensions that color names is compressed
to, using the dimensionality reduction presented in chapter 4. The number next to
“CN” denotes the number of dimensions used. The RCSK is used for this evaluation.
0.1 for RCS (linear kernel) and 0.15 for RCSK (Gaussian kernel). Two experiments are
presented in this section. The first experiment investigates the impact of the number of
output feature dimensions. The second experiment evaluates the effect of the dimensionality reduction on RCS and RCSK from section 5.2.2 with color names. Computational
cost and performance is of equal interest in these experiments.
5.4.1
Number of Feature Dimensions
The RCSK (i.e with a Gaussian kernel) was used the evaluate the impact of the number
of reduced dimensions. The normalized color names have D1 = 10 dimensions. In this
experiment they were compressed to D2 = 1, 2, . . . , 9 dimensions using algorithm 4.2.
This is compared to no compression at all, i.e. RCSK with intensity and color names
(I+CN) and using zero dimensions, i.e. RCSK with only intensity.
The results are displayed in figure 5.8. The compressed feature representation is named
I+CND2 where D2 is the compressed color name dimension. From the results of this
experiment, it is clear that no significant gain is obtained by using more than two dimensions. However, using only one dimension gives inferior results, while hardly increasing
the tracker speed. Thus, CN2 is chosen for the final representation. Note that CN needs
5.5
39
State-of-the-Art Evaluation
Precision plot
Distance Precision
0.8
0.6
0.4
0.2
0
0
RCS I+CN [0.686]
RCS I+CN2 [0.676]
RCSK I+CN [0.674]
RCSK I+CN2 [0.664]
10
20
30
40
Location error threshold
50
(a) Precision plot.
Median CLE
Median DP
Median FPS
RCS I+CN
13.3
83.8
94.1
RCS I+CN2
15.3
75.7
136
RCSK I+CN
13.8
81.4
78.9
RCSK I+CN2
14.3
79.3
105
(b) Table of the median CLE (in pixels), DP (in percent) and frame rate (in FPS) over all the
sequences. The two best results are displayed in red and blue fonts respectively.
Figure 5.9: Comparison between the color names and compressed color names for
RCS and RCSK.
to be compressed to 6 or fewer dimensions to overcome the computational overhead introduced by the dimensionality reduction. In general this depends on the target size though.
5.4.2
Final Performance
In this experiment the RCS and RCSK with intensity and color names (I+CN) (as in
section 5.2.2) are compared with the respective trackers when color names are compressed
to 2 dimensions (I+CN2), which was found to be optimal in the previous experiment.
The results are shown in figure 5.9. The performance loss is minor in the precision plot
(average distance precision). But the speed gain is 45% for RCS and 33% for RCSK,
which is significant.
5.5
State-of-the-Art Evaluation
This section presents an extensive evaluation of the proposed trackers and state-of-theart methods that exist in literature. Two proposed versions are compared to the existing
methods, namely RCS with intensity and color names (RCS CN) and RCS with intensity
and compressed color names (RCS CN2). The “I” in the naming convention is dropped.
The Gaussian kernel versions are omitted since they proved to be inferior in section 5.4.2.
40
5
Evaluation
The proposed methods are compared with 20 trackers that exist in literature. These are:
CT [50], TLD [24], DFT [38], EDFT [14], ASLA [23], L1APG [4], CSK [21], SCM
[52], LOT [32], CPF [35], CXT [13], Frag [1], IVT [36], ORIA [45], MTT [51], BSBT
[40], MIL [3], Struck [19], LSHT [20] and LSST [44]. These methods include the top
four performing trackers in the recent benchmark evaluation [46], namely Struck, SCM,
TLD and CXT.3 Also ASLA, CSK, DFT and L1APG were among the top trackers in this
evaluation. EDFT, LSHT and LSST are recent trackers in the literature, that were not
included in the benchmark evaluation. Again, all 41 color sequences are used. The results
are presented using both precision plots (distance precision) and success plots (overlap
precision).
The overall results are presented in figure 5.10. The two proposed trackers outperform
or perform favorably compared to the other evaluated methods in all evaluation metrics.
Struck is the best performing method of the compared existing trackers. It uses powerful
learning methods, namely a kernelized structured output support vector machine. So it
is noteworthy that it is outperformed by tracker using a simpler learning method, namely
a modified least squared classifier. It should also be noted that neither Struck nor RCS
estimate scale variations. For this reason, ASLA and SCM which use an affine tracking
model, obtain better overlap precisions at high overlap thresholds b > 0.7. However the
robustness of these trackers seems to be far less than those of RCS and Struck.
The computational cost is an other important aspect of this evaluation. It should be noted
that the RCS versions run at an order of magnitude higher frame rate than Struck (which
is a C++ implementation). RCS CN2 runs at the second highest frame rate in median,
only 10% below CSK. Also CPF and CT obtain notable frame rates, but provide inferior
performance. ASLA and SCM perform rather well in this evaluation, but they are not
feasible for real-time applications.
The precision plots of the attribute based results are shown in figures 5.11 and 5.12. The
corresponding success plots are shown in figures 5.13 and 5.14. The proposed trackers perform favourably in most of these attributes. The results are especially good in attributes
that are related to appearance changes, namely motion blur, deformation, illumination
variation, in-plane-rotation, occlusion and out-of-plane-rotation. The reason for this is
probably a combination between robust learning and robust features. Struck performs better in sequences with fast-motion. This is related to the negative effects of the windowing
operation discussed in section 2.4.2. It should be mentioned that ASLA and SCM naturally have an advantage in sequences with large scale variations since they are able to
estimate this property.
5.6
Conclusions and Future Work
In this chapter it has been shown that the proposed RCS and RCSK trackers perform
favorably to state-of-the-art trackers on a large number of benchmark sequences. Especially, their performance and speed combined is unmatched among the evaluated trackers.
The framework is however still quite simple, so further improvements are expected to be
3 Unfortunately,
code or binaries could not be obtained for all top 10 trackers from this comparison.
5.6
41
Conclusions and Future Work
Precision plot
Distance Precision
0.8
0.6
RCS CN [0.686]
RCS CN2 [0.676]
Struck [0.639]
EDFT [0.528]
CSK [0.526]
LSHT [0.511]
ASLA [0.505]
TLD [0.498]
CXT [0.484]
LOT [0.481]
0.4
0.2
0
0
10
20
30
40
Location error threshold
50
(a) Precision plot.
Success plot
Overlap Precision
0.8
0.6
0.4
0.2
0
0
RCS CN [0.484]
RCS CN2 [0.473]
Struck [0.459]
ASLA [0.417]
EDFT [0.401]
CSK [0.377]
SCM [0.377]
LSHT [0.375]
TLD [0.369]
DFT [0.358]
0.2
0.4
0.6
Overlap threshold
RCS CN
RCS CN2
Struck
LSHT
CPF
CXT
DFT
CSK
MIL
EDFT
SCM
TLD
ASLA
BSBT
LOT
L1APG
MTT
Frag
ORIA
LSST
CT
IVT
CLE
13.3
15.3
19.6
32.3
41.1
43.8
47.9
50.3
51.9
53.5
54.3
54.4
56.8
58.5
60.9
62.9
67.8
70.8
72.5
78.4
78.4
94.3
DP
83.8
75.7
71.3
55.9
37.1
39.5
41.4
54.5
35.5
49
34.1
45.4
42.2
20.9
37.1
28.9
32.3
38.7
22.5
23.4
20.8
22.4
FPS
94.1
136
10.4
12.5
55.5
11.3
9.11
151
11.6
19.7
0.0862
20.7
0.946
3.45
0.467
1.03
0.378
3.34
7.92
3.57
68.9
14.2
(c) Table of the median CLE (in pixels), DP (in percent) and frame rate (in
FPS) over all the sequences. The two
best results are displayed in red and
blue fonts respectively.
0.8
1
(b) Success plot.
Figure 5.10: Comparison with state-of-the-art methods in literature. The trackers
proposed in this thesis are shown in bold font. For clarity, only the top 10 performing
methods are displayed in the plots.
possible.
There are a number of issues that could be addressed for future development. In section 5.3 it was noted that it should be possible to combine color and shape descriptors to
obtain better discrimination of the target. The possibility of feature selection also comes
in here, which would provide a way of selecting the most discriminative feature combinations. This is related to the dimensionality reduction presented in chapter 4. The
current method basically uses the amount of structure or variance to determine the importance of different feature combinations. One other option would be to do feature selec-
42
5
Evaluation
tion/reduction based on supervised techniques, to aim at selecting the most discriminative
feature combinations.
One factor, that is limiting in some situations, is that the RCS and RCSK trackers do
not estimate scale. This limitation should be addressed in future research, since it may
give a significant overall performance gain as well. This is motivated by the success of
ASLA and SCM in sequences with scale variation (see figure 5.12 and 5.14). As discussed
in section 8.1.2, the tracker model can be rescaled accurately, without basically any cost,
since it can be stored entirely in the Fourier-domain. This opens the possibility of applying
a brute-force search in the scale dimension. But to preserve the low computational cost,
more sophisticated techniques must probably be applied.
A few issues are apparent from the attribute based results in section 5.5. One of these
is fast target motions. A few techniques might be used to address this problem. A first
simple thing to try is motion prediction, as discussed in section 2.4.2. This problem is
also related to general failure detection and re-detection, which is currently not present
in the algorithm. As most generic trackers, the RCS and RCSK have a very local view
of the scene, meaning that it only cares about the target and the surrounding background.
In contrast, the TLD tracker applies a whole framework for the purpose of re-detection
in failure cases. Failures also often occur at occlusions, which is a major challenge. Although, the RCS is shown to be fairly robust to partial occlusion and short full occlusions
compared to other trackers, long-term occlusions is a major problem. To address these
sorts of issues, I think that it is necessary for the framework to take multiple hypothesis
into account, and to evaluate them over time using different tracking models.
Although, failure detection and handling may be important for the generic tracking scenario, it may not be desired when a tracker acts as a part of larger system, which often is
the case in real-world applications. One example is people tracking, which is discussed
in the second part of this thesis. In that case there are distinct information, provided by
for example a person detector, which can be used for failure detection and re-detection.
The trackers that are proposed here might be especially suited as parts of such larger systems, thanks to their speed and simplicity. Another interesting property is that the tracker
outputs a dense set of confidence scores, which can be fused with confidences from other
system parts.
5.6
43
Conclusions and Future Work
Precision plot of background clutter (18)
0.7
0.6
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Struck [0.599]
RCS CN2 [0.537]
RCS CN [0.518]
CXT [0.407]
TLD [0.405]
CPF [0.394]
MIL [0.389]
MTT [0.384]
EDFT [0.375]
DFT [0.371]
10
20
30
40
Location error threshold
Distance Precision
Distance Precision
Precision plot of fast motion (14)
0.7
0.5
0.4
0.3
0.2
0.1
0
0
50
Precision plot of motion blur (10)
RCS CN2 [0.650]
RCS CN [0.595]
Struck [0.555]
EDFT [0.465]
DFT [0.411]
L1APG [0.383]
CXT [0.379]
ASLA [0.375]
TLD [0.375]
MIL [0.374]
10
20
30
40
Location error threshold
Distance Precision
Distance Precision
0
0
0.6
0.4
0.2
0
0
50
Precision plot of illumination variation (20)
RCS CN2 [0.596]
RCS CN [0.579]
ASLA [0.511]
Struck [0.506]
SCM [0.436]
CSK [0.433]
DFT [0.427]
TLD [0.410]
LSHT [0.407]
CXT [0.396]
10
20
30
40
Location error threshold
50
Distance Precision
Distance Precision
0
0
10
20
30
40
Location error threshold
50
0.8
0.6
0.2
RCS CN2 [0.650]
RCS CN [0.634]
Struck [0.554]
ASLA [0.553]
LSHT [0.532]
DFT [0.527]
EDFT [0.523]
LOT [0.502]
CPF [0.490]
Frag [0.449]
Precision plot of in−plane rotation (20)
0.8
0.4
50
0.8
0.6
0.2
10
20
30
40
Location error threshold
Precision plot of deformation (16)
0.8
0.4
RCS CN2 [0.591]
RCS CN [0.585]
ASLA [0.567]
CSK [0.540]
Struck [0.525]
LOT [0.501]
EDFT [0.495]
LSHT [0.485]
SCM [0.473]
DFT [0.465]
0.6
0.4
0.2
0
0
RCS CN2 [0.658]
RCS CN [0.653]
Struck [0.533]
EDFT [0.458]
CXT [0.457]
CSK [0.451]
ASLA [0.441]
LSHT [0.429]
L1APG [0.428]
MTT [0.423]
10
20
30
40
Location error threshold
50
Figure 5.11: Precision plots showing the attribute-based results of the state-of-theart evaluation. The plots display the distance precision for the evaluated attributes
fast motion, background clutter, motion blur, deformation, illumination variation
and in-plane rotation. The trackers proposed in this thesis are shown in bold font.
For clarity, only the top 10 performing methods are displayed in the plots. The value
appearing in the title denotes the number of videos associated with the respective
attribute. The average distance precision at 20 pixels is displayed in the legends.
44
5
Precision plot of low resolution (4)
Evaluation
Precision plot of occlusion (24)
0.7
0.8
0.5
Struck [0.497]
MTT [0.493]
L1APG [0.460]
MIL [0.423]
CSK [0.410]
RCS CN [0.407]
RCS CN2 [0.407]
BSBT [0.346]
CXT [0.335]
SCM [0.335]
0.4
0.3
0.2
0.1
0
0
10
20
30
40
Location error threshold
Distance Precision
Distance Precision
0.6
0.6
0.2
0
0
50
RCS CN2 [0.668]
RCS CN [0.660]
Struck [0.594]
ASLA [0.521]
CPF [0.520]
TLD [0.520]
LOT [0.504]
LSHT [0.483]
SCM [0.464]
EDFT [0.462]
0.4
Precision plot of out−of−plane rotation (28)
10
20
30
40
Location error threshold
50
Precision plot of out of view (4)
0.8
0.7
RCS CN2 [0.655]
RCS CN [0.648]
Struck [0.555]
ASLA [0.512]
CPF [0.494]
TLD [0.489]
LSHT [0.480]
LOT [0.476]
EDFT [0.453]
CSK [0.451]
0.4
0.2
0
0
10
20
30
40
Location error threshold
50
Distance Precision
Distance Precision
0.6
0.6
0.5
LOT [0.504]
CPF [0.446]
MIL [0.445]
Struck [0.442]
TLD [0.436]
DFT [0.408]
CXT [0.382]
RCS CN2 [0.364]
RCS CN [0.345]
BSBT [0.326]
0.4
0.3
0.2
0.1
0
0
10
20
30
40
Location error threshold
50
Precision plot of scale variation (21)
Distance Precision
0.8
0.6
0.4
0.2
0
0
RCS CN2 [0.632]
RCS CN [0.630]
Struck [0.613]
ASLA [0.574]
SCM [0.561]
CPF [0.503]
EDFT [0.478]
CSK [0.476]
TLD [0.475]
CXT [0.468]
10
20
30
40
Location error threshold
50
Figure 5.12: Precision plots showing the attribute-based results of the state-of-theart evaluation. The plots display the distance precision for the evaluated attributes
low resolution, occlusion, in-plane rotation, out of view and scale variation. The
trackers proposed in this thesis are shown in bold font. For clarity, only the top
10 performing methods are displayed in the plots. The value appearing in the title
denotes the number of videos associated with the respective attribute. The average
distance precision at 20 pixels is displayed in the legends.
5.6
45
Conclusions and Future Work
Success plot of background clutter (18)
0.8
0.6
0.6
0.4
0.2
0
0
Overlap Precision
Overlap Precision
Success plot of fast motion (14)
0.8
Struck [0.455]
RCS CN [0.406]
RCS CN2 [0.399]
MIL [0.323]
TLD [0.323]
MTT [0.311]
CXT [0.310]
EDFT [0.310]
CPF [0.307]
DFT [0.298]
0.2
0.4
0.6
Overlap threshold
0.8
0.4
0.2
0
0
1
0.8
0.6
0.6
0.4
0.2
0
0
RCS CN2 [0.472]
RCS CN [0.458]
Struck [0.444]
EDFT [0.365]
DFT [0.341]
MIL [0.321]
ASLA [0.318]
TLD [0.316]
L1APG [0.312]
CSK [0.303]
0.2
0.4
0.6
Overlap threshold
0.8
0.4
0.2
0
0
1
0.6
0.6
0.2
0
0
ASLA [0.432]
RCS CN2 [0.407]
RCS CN [0.404]
Struck [0.375]
SCM [0.357]
DFT [0.338]
LSHT [0.324]
CSK [0.320]
TLD [0.311]
EDFT [0.294]
0.2
0.4
0.6
Overlap threshold
0.8
0.4
0.6
Overlap threshold
0.8
1
RCS CN2 [0.453]
RCS CN [0.440]
ASLA [0.422]
DFT [0.405]
Struck [0.394]
EDFT [0.370]
CPF [0.362]
LSHT [0.356]
LOT [0.337]
Frag [0.326]
0.2
0.4
0.6
Overlap threshold
0.8
1
Success plot of in−plane rotation (20)
0.8
Overlap Precision
Overlap Precision
Success plot of illumination variation (20)
0.8
0.4
0.2
Success plot of deformation (16)
0.8
Overlap Precision
Overlap Precision
Success plot of motion blur (10)
ASLA [0.461]
RCS CN [0.424]
RCS CN2 [0.422]
Struck [0.400]
EDFT [0.395]
CSK [0.379]
SCM [0.377]
LSHT [0.366]
LOT [0.363]
DFT [0.362]
1
0.4
0.2
0
0
RCS CN [0.454]
RCS CN2 [0.450]
Struck [0.396]
ASLA [0.382]
EDFT [0.353]
CXT [0.340]
CSK [0.332]
LSHT [0.325]
DFT [0.322]
L1APG [0.322]
0.2
0.4
0.6
Overlap threshold
0.8
Figure 5.13: Success plots showing the attribute-based results of the state-of-the-art
evaluation. The plots display the overlap precision for the evaluated attributes fast
motion, background clutter, motion blur, deformation, illumination variation and
in-plane rotation. The trackers proposed in this thesis are shown in bold font. For
clarity, only the top 10 performing methods are displayed in the plots. The value
appearing in the title denotes the number of videos associated with the respective
attribute. The area under the curve is displayed in the legends.
1
46
5
Success plot of low resolution (4)
Success plot of occlusion (24)
1
0.6
0.2
0
0
Overlap Precision
Overlap Precision
0.8
0.4
MTT [0.403]
L1APG [0.381]
Struck [0.366]
CSK [0.347]
MIL [0.327]
RCS CN2 [0.317]
RCS CN [0.317]
SCM [0.300]
BSBT [0.295]
EDFT [0.287]
0.2
0.8
0.6
0.4
0.2
0.4
0.6
Overlap threshold
0.8
0
0
1
0.6
0.6
0.2
0
0
RCS CN [0.443]
RCS CN2 [0.443]
ASLA [0.415]
Struck [0.395]
LSHT [0.359]
TLD [0.354]
CPF [0.354]
LOT [0.348]
DFT [0.343]
EDFT [0.342]
0.2
0.4
0.6
Overlap threshold
0.8
RCS CN2 [0.451]
RCS CN [0.449]
Struck [0.423]
ASLA [0.417]
TLD [0.388]
CPF [0.384]
SCM [0.368]
LOT [0.364]
DFT [0.363]
LSHT [0.351]
0.2
0.4
0.6
Overlap threshold
0.8
1
0.4
0.2
0
0
LOT [0.409]
MIL [0.379]
Struck [0.378]
TLD [0.375]
CPF [0.371]
RCS CN2 [0.340]
DFT [0.335]
RCS CN [0.331]
CXT [0.293]
BSBT [0.287]
0.2
0.4
0.6
Overlap threshold
0.8
Success plot of scale variation (21)
Overlap Precision
0.8
0.6
0.4
0.2
0
0
ASLA [0.471]
SCM [0.445]
Struck [0.420]
RCS CN [0.405]
RCS CN2 [0.399]
CPF [0.344]
EDFT [0.342]
TLD [0.338]
LOT [0.333]
CXT [0.332]
0.2
1
Success plot of out of view (4)
0.8
Overlap Precision
Overlap Precision
Success plot of out−of−plane rotation (28)
0.8
0.4
Evaluation
0.4
0.6
Overlap threshold
0.8
1
Figure 5.14: Success plots showing the attribute-based results of the state-of-theart evaluation. The plots display the overlap precision for the evaluated attributes
low resolution, occlusion, in-plane rotation, out of view and scale variation. The
trackers proposed in this thesis are shown in bold font. For clarity, only the top
10 performing methods are displayed in the plots. The value appearing in the title
denotes the number of videos associated with the respective attribute. The area under
the curve is displayed in the legends.
1
Part II
Category Object Tracking
6
Tracking Model
In many applications, the problem is to automatically track all objects of a certain category.
Examples of these kinds of applications include automated surveillance and safety systems in cars. These kinds of problems contain additional challenges compared to generic
tracking. The system needs to automatically detect new objects and track them through
out the scene. However, compared to generic tracking there are additional a priori information that can be exploited to improve the robustness and accuracy. This information
consists of the general appearance of the object class, e.g. humans. If multiple objects are
tracked simultaneously, then the interactions between these can be used to further improve
the tracking.
This second part of the thesis describes the automatic category tracker that was developed
and implemented. This chapter gives an overview of the complete framework and describes the object and observation model, on which the framework is built. Section 6.1
gives an overview of the framework. Section 6.2 describes how individual objects are
modelled using deformable part models and dynamics. Section 6.3 describes the measurement model and section 6.4 describes how the objects are tracked by applying the
Rao-Blackwellized Particle Filter (see section A.4) to the model.
6.1
System Overview
Most category object trackers exploit the class specific and object specific appearance
along with a motion model. The implemented tracker however, also incorporates tracking
of specific object parts into the framework. One motivation for this is that such information can be of importance in e.g. action recognition. Consider for example tracking a
human along with its hands hand feet. Actions such as “walking” and “playing guitar” can
then potentially be detected. However, main motivation for incorporating part tracking is
49
50
6
Tracking Model
to improve the tracking of the object itself.
A short overview of the system is as follows. Images from the sequence are processed
through two different system parts. The object detector produces dense score functions
of the class specific object and part appearances. The appearance tracking produces score
functions of the object and part specific appearances, based on the learnt appearance of the
object and parts. These two sources of information are combined in a Bayesian filtering
step, together with the dynamic model of the object. The results from the filtering are
used to estimate the new location of the object in the image. The estimation is then
used to update the learnt appearances of the object and parts. Tracked objects interact in
two ways within the framework. Location estimates of all objects and parts are used to
detect inter-object occlusions. This information is then used in the filtering step and in the
appearance learning. The current location estimates are also used together with the object
detections to find and initialize new objects in the scene and to remove false objects.
6.2
Object Model
This section presents the dynamic object model, which is formulated as a state space
model.
6.2.1
Object Motion Model
A popular assumption in visual tracking is a constant velocity motion model [22, 2, 9]. If
z is the object position in Cartesian coordinates (usually one, two or three dimensional),
then the constant velocity model can be expressed as in (6.1).
ż(t) = v(t)
(6.1a)
v̇(t) = wv (t)
(6.1b)
The first equation in the model defines v as the velocity of the object. wv is noise, which
gives some flexibility in the model. Usually wv is assumed to be white and Gaussian. A
discretization of this model is given in (6.2), where T is the sample time. This discretization is done using Zero Order Hold [18], where the noise wv is basically assumed to be
constant during the sample interval.
T2 v
w
2 t
vt + T wtv
zt+1 = zt + T vt +
(6.2a)
vt+1 =
(6.2b)
The scale (size) st of the object is modelled as constant with some process noise. Such a
model is valid in cases when relative motion in the direction of the optical axis is small or
if the tracked objects are far away.
st+1 = st + st wts
(6.3)
It is physically more correct to let the noise wts model the relative change in scale. This is
the motivation for scaling the noise with st . Only uniform scale is regarded in this work,
i.e. st is scalar.
6.2
51
Object Model
6.2.2
Part Deformations and Motion
The model of the part locations should include the deformation costs of the parts and
a motion model. The state space model in (6.4) has been constructed only using the
deformation costs.
j
= st aj + st ujt
zt+1
ujt
j
∼ N (0, D )
(6.4a)
(6.4b)
Here, ztj is the relative position of part j at time t, aj is the modelled expected position of
the part and ujt is noise modelling the uncertainty. Although this model validly describes
the deformations of a part as discussed in section 7.2.3, it discards the history of the part
j
j
motion given the scale, i.e. p(zt+1
|ztj , st ) = p(zt+1
|st ). The history is included by using a
Markov model instead, based on a constant relative position assumption. This assumption
is highly valid for parts that are more or less rigidly connected to the object, e.g. the head
of a human. However, the model can also be tuned for moving parts by increasing the
j
process noise wtz . The part deformations are considered in the observation likelihood, as
discussed in section 6.3.
j
zt+1
= ztj + st wtz
j
(6.5)
This thesis will only deal with static models of part locations. Although there is a possibility of using dynamic models, it is not be investigated in this work.
6.2.3
The Complete Transition Model
The complete transition model that is used in the proposed object tracker is given in (6.6).
The number of states are 2N + 5, where N is the number of object parts. In this chapter
xt = (zt0 , st , vt , zt1 , . . . , ztN )T is used to denote the object state at time t.
0
zt+1
= zt0 + T vt + st wtz
vt+1 = vt +
=
j
st wtz
ztj +
j
j
wtz ∼ N (0, Qz )
wts
wtv
(6.6a)
st wts
T st wtv
st+1 = st +
j
zt+1
0
s
(6.6b)
(6.6c)
,
j ∈ {1, . . . , N }
(6.6d)
,
j ∈ {0, . . . , N }
(6.6e)
∼ N (0, Q )
(6.6f)
∼ N (0, Qv )
(6.6g)
The velocity state noise wtv that should appear in (6.6a) according to the discretization
(6.2) of the constant velocity model, is discarded. The reason is that the Rao-Blackwellized
Particle Filter (RBPF) (see section A.4) does not handle this well if the velocity is taken
0
as a linear state and the position as non-linear. This noise is replaced with wtz , which is
assumed to be independent of wtv . However, this modification also adds the freedom to
tune the model between constant velocity and random walk. The drawback of this modification is insignificant, since the goal is to estimate the position of the object and not the
real velocity.
52
6
Tracking Model
j
The process noises defined in (6.6) are assumed to be mutually uncorrelated. Qz , Qs and
Qv are general covariance matrices. The fact that the noises are scaled with st is motivated
by the pinhole camera model. Note that the model is non-linear and non-Gaussian. It is
however linear and Gaussian conditioned on the scale st . It is thus applicable for the
RBPF.
6.3
The Measurement Model
This section describes the measurement model that is used in the tracker. Earlier works
(like [9]) have used detector confidences as likelihoods in filtering or optimization frameworks. The measurement model that used here is inspired by [22], but extended with
the deformation model for part positions. The likelihood contains two basic factors, as
described in (6.7). It is the image that is measured at time t. The factor p(It |xt ) thus
describes how well the given state explains the measured image. M is the deformation
model, which is static. The second factor p(M |xt ) describes how well the state fits in to
the deformation model.
p(It , M |xt ) = p(It |xt )p(M |xt )
(6.7)
Independence is assumed between the image measurements and the deformations given
the object state.
6.3.1
The Image Likelihood
The first factor in (6.7) combines information from the class specific appearance of the object and parts with the appearance of individual objects. Object detections (see chapter 7)
are used as class specific appearance. p(θtj |z, s) denotes the probability of detecting the
object if j = 0 and part otherwise, at position z and scale s in image It . Section 7.2.2 describes how this factor is computed. The object appearance is modelled in a similar way.
Assume that there is a detector trained on the appearance of the specific object. p(ϕjt |z)
denotes the probability of detection analogously to the object detection. Section 8.1.2
describes how this factor is computed using the proposed K-MOSSE tracker that is described in the first part of this thesis. The image likelihood is modelled by (6.8).
p(It |xt ) = p(θt0 |zt0 , st )p(ϕ0t |zt0 )
N
Y
p(θtj |zt0 + ztj , st )p(ϕjt |zt0 + ztj )
(6.8)
j=1
The model assumes independence between the object and part detections conditioned on
the state. This is intuitively a valid approximation in an occlusion free environment. However, if occlusions are considered, the detections are likely to be correlated. The model
also assumes independence between the detections from the class appearance and object
appearance detector conditioned on the state. It can be argued that this is a reasonable approximation if the detectors uses different features. This is not valid at occlusions though.
6.4
53
Applying the Rao-Blackwellized Particle Filter to the Model
6.3.2
The Model Likelihood
The second factor in (6.7) exploits the information in the known deformable parts model.
This factor is modelled by (6.9).
!
N
N
j
Y
Y
j j
j
j zt
p(M |xt ) =
p(a |zt , st ) =
(6.9)
N a ; ,D
st
j=1
j=1
aj is the known mean position of the part and Dj is a covariance matrix describing the
deformations. Section 7.2.3 describes how these values are computed.
6.4
Applying the Rao-Blackwellized Particle Filter to
the Model
The Rao-Blackwellized Particle Filter (RBPF) is a Bayesian filtering algorithm that exploits linear-Gaussian substructures in the state space model. It uses a particle filter to
approximate the set of non-linear states xnt and Kalman filters for the remaining states
xlt , which are linear and Gaussian conditioned on xnt . The algorithm is described in section A.4. The state space model in (6.6) is only linear and Gaussian conditioned on the
scale st . So, this state has to be included in xnt . All states except the velocity states vt appear as non-Gaussian in the measurement model because of (6.8). This implies that only
vt truly can be included in xlt . However, since the main goal is to approximate the object
position and scale, we will assume a Gaussian approximation of the part measurements.
So, the partitioning of the states is done as in (6.10).
 
vt
0
 zt1 
z
 
xnt = t , xlt =  . 
(6.10)
st
 .. 
ztN
6.4.1
The Transition Model
The state transition model in (6.6) with the partitioning of states in (6.10) is a special case
of the RBPF model in (A.14). The functions and matrices in the RBPF model is identified
in (6.11).
ftn (xnt ) = xnt ,
T I2×2
Ant (xnt ) =
01×2
Btn (xnt ) = st I3×3 ,
02×2N
01×2N
,
ftl (xnt ) = 02N +2×1
(6.11a)
Alt (xnt ) = I2N +2×2N +2
T I2×2 02×2N
Btl (xnt ) = st
02N ×2 I2N ×2N
(6.11b)
(6.11c)
54
6
The covariances are given by (6.12).
0
Qz
02×1
Qnt =
01×2 Qs
 v
Q
02×2

1
02×2 Qz
Qlt = 
 ..
..
 .
.
02×2 . . .
Tracking Model
(6.12a)
···
..
.
..
.
02×2

02×2
.. 
. 


02×2 
N
Qz
Qln
t = 03×2N +2
6.4.2
(6.12b)
(6.12c)
The Measurement Update for the Non-Linear States
The measurement model described in section 6.3 is clearly more general than the one in
(A.14). This means that the RBPF can not be applied directly. The restricting measurement model in the RBPF comes from the fact that is has to be linear and Gaussian in xlt
conditioned on xnt . However, the standard particle filter described in section A.3 can handle general measurement models. The measurement model in section 6.3 can thus be used
to update the particle filtered states xn . A Gaussian approximation of the measurement
model is then used to update the linear states xl . This results in a variant of the RBPF that
uses different measurement models for the two measurement updates in algorithm A.4.
The full model is used in the particle filter measurement update and the linear-Gaussian
approximation is used in the Kalman filter measurement update.
The particle filter measurement update in algorithm A.4 requires the probability density p(yt |Xtn , Yt−1 ). yt is the measurement, which in our case is yt = {It , M } and
Yt = {y1 , . . . , yt } denotes all measurement up to time t. Xtn = {xn1 , . . . , xnt } denotes
the trajectory of non-linear states up to time t. This probability is given in (6.13). See
section B.2 for the proof.
p(yt |Xtn , Yt−1 ) = gt (st )L0t (zt0 , st )
N Y
hjt ( · , st ) ? Ljt ( · , st ) (zt0 )
j=1
= gt (st )L0t (zt0 , st )
N Z
Y
j=1
R2
hjt (z, st )Ljt (zt0 + z, st ) dz
(6.13)
The functions that appear here are defined in (6.14).
Ljt (z, s) = p(θtj |z, s)p(ϕjt |z)
hjt (z, s) = N z; µjt (s), Htj (s)
N
Y
1 j
1 zj
gt (s) =
N
ẑt|t−1 ; aj , 2 Pt|t−1
+ Dj
s
s
j=1
(6.14a)
(6.14b)
(6.14c)
6.4
55
Applying the Rao-Blackwellized Particle Filter to the Model
The mean and covariance of hjt are defined in (6.15).
1 j −1 j
j
j
zj
−1 j
µt (s) = Ht (s) (Pt|t−1 ) ẑt|t−1 + (D ) a
s
−1
j
j
z
Ht (s) = (Pt|t−1
)−1 + (s2 Dj )−1
(6.15a)
(6.15b)
j
j
z
and Pt|t−1
ẑt|t−1
are the predicted mean and covariance of state ztj given the trajectory
j
j
z
Xtn for the non-linear states. I.e. we have p(ztj |Xtn , Yt−1 ) = N (ztj ; ẑt|t−1
, Pt|t−1
). This
j
j
z
means that ẑt|t−1
and Pt|t−1
depend on zt0 and st , even though it is not denoted explicitly
to simplify the notation. In the RBPF algorithm, these are the predictions bound to the
particle Xtn,i , when evaluating the likelihood for this particle.
The derived particle weighting function in (6.13) has an interesting interpretation. hjt acts
as a filter that smooths the likelihood function for part j. This reflects the uncertainty that
is present in the predicted part locations. hjt also shifts the part likelihood so that it is
evaluated in zt0 + µjt (st ). µjt (st ) should be seen as the predicted relative part location that
has been corrected by the deformation model. Equation 6.15 is recognized as the fusion
j
zj
formula of the two estimates ẑt|t−1
and st aj with covariances Pt|t−1
and s2t Dj of the part
location. Htj (st ) is the covariance of the new part location estimate. hjt actually is the
probability density p(ztj |Xtn , Yt−1 , aj ), which supports this interpretation.
The function gtj (st ) is essentially a deformation cost, where the uncertainty in the part locations has been added to the deformation covariance. This factor is the model likelihood
given the non-linear states p(M |Xtn , Yt−1 ).
6.4.3
The Measurement Update for the Linear States
The RBPF requires a measurement model that is linear and Gaussian conditioned on the
non-linear states to update the linear ones. This is achieved by a Gaussian approximation
of (6.8). The first two factors in this equation (i.e. the object and appearance detection)
do not contain any information about the part locations given zt0 and st , so they can be
excluded here. Since the part measurements are mutually independent, each part can
be considered individually. Consider the Gaussian approximation in (6.16) of the image
likelihoods for each part. The function Ljt defined in (6.14a) is the product between the
class appearance likelihood and object specific appearance likelihood of part j.
Ljt (z, s) ≈ ktj (s)N (z; ytj (s), Rtj (s))
(6.16)
ytj (s) and Rtj (s) are the mean and covariance of the approximation, that depends on the
scale s. The scale factor ktj (s) is unimportant and can be disregarded since the scale
is given when performing the measurement update for the linear states. The method for
obtaining this approximation and the validity of it is discussed in section 8.2.3. The image
likelihood for part j can be written as in (6.17).
Ljt (zt0 + ztj , st ) ∼ N (ytj (st ); zt0 + ztj , Rtj (st ))
(6.17)
Here ∼ denotes approximately equal up to a scale factor (for a constant st ). The argument
56
6
Tracking Model
and mean of the normal distribution have been switched. This can be done thanks to the
symmetry of the Gaussian function. ytj (st ) can be regarded as a measurement. The
right hand side in (6.17) is then the likelihood p(ytj (st )|xt ) of this measurement. This
likelihood is equivalent to the measurement equation in (6.18a).
The deformation model likelihood in (6.9) is already Gaussian and does not have to be
approximated. Since the deformation likelihood for the parts are independent, (6.9) is
equivalent to the measurement equation in (6.18b) for each part.
ytj (st ) = zt0 + ztj + ejt (st ) , ejt (st )|st ∼ N (0, Rtj (st ))
1
aj = ztj + djt , djt ∼ N (0, Dj )
st
(6.18a)
(6.18b)
The conditionally linear-Gaussian measurement model thus contain two measurements
for each part location. The iterated measurement update in algorithm A.2 can easily be
applied in the RBPF on this model, since all position measurements are mutually uncorrelated.
7
Object Detection
As described in chapter 1, category object tracking is the problem of tracking objects of a
specific class, e.g. humans. This tracking problem contains additional a priori information
that can be exploited. If the system is supposed to work completely automatically, it
additionally has to detect objects of this category. Object detection is a well studied area
in computer vision. The first section in this chapter briefly describes the object detector
[15], which is used in my proposed object tracker. The remaining sections discuss how
this detector is used in my framework.
7.1
Object Detection with Discriminatively Trained
Part Based Models
The object detector [15] by Felzenszwalb et. al. has proved to be very successful, and it
still achieves state of the art performance. It uses a model of deformable parts to describe
the object. This section contains a brief description of this object detection framework.
For more details, see [15]. This object detector will be referred to as the deformable part
model (DPM) detector.
7.1.1
Histogram of Oriented Gradients
Dalal and Triggs introduced the histogram of oriented gradients (HOG) features in [12]
and applied it to human detection in static images. The HOG feature map is created
by first computing histograms of gradient orientations in a dense image grid of cells,
which are typically 8 × 8 pixels. The histograms are constructed using soft assignment in
neighbouring cells. This is then followed by a normalization step, where the histograms
are normalized using the gradient energy from different neighbouring cells. The original
HOG results in a 36-dimensional feature vector for each cell. The HOG features are
57
58
7
Object Detection
typically computed at many different scales. Figure 7.1 visualizes the HOG feature map
for an image region.
Dalal and Triggs trained a linear support vector machine (SVM) [7] on thousands of false
and positive examples of humans, using HOG features. Object detection score at all
locations and scales of an image can be computed using a sliding window search, i.e. by
correlating the feature map of the image with the SVM-weights.
7.1.2
Detection with Deformable Part Models
The DPM detector [15] extends the work of Dalal and Triggs to use deformable part
models. The detector uses a modification of HOG-features. But instead of just training
a single template of SVM weights for the whole object, separate part templates are also
trained for a set of object parts (e.g. head or feet for a human). Additionally, a deformable
part model of the object is trained. This model includes the anchor position vj = (vjx , vjy )
and the quadratic deformation cost coefficients dj = (d1j , d2j , d3j , d4j )T for each part j ∈
{1, . . . , N }. Let Gj be the trained SVM weights (or filters) for each part, where j = 0
indicates the root filter that is trained on the whole object. Let H denote the HOG feature
map for an image. The classification score from a filter at position p = (x, y) and scale
s is calculated as in (7.1).1 Note that the part scores for an object detection at scale s are
computed at half the scale, i.e. double the resolution.

G0 ? H(p, s)
,j = 0
s
ζj (p, s) =
(7.1)
Gj ? H p,
, j ∈ {1, . . . , N }
2
The vector defined in (7.2) contains the linear and quadratic absolute displacements of a
part from its anchor position.


(xj − x0 )/s − vjx
 (yj − y0 )/s − v y 
j 

2 
∆i (p0 , pj , s) = 
(7.2)
 (xj − x0 )/s − vjx 
2
(yj − y0 )/s − vjy
The deformation cost for part j is computed as dTj ∆j (p0 , pj , s). Given a trained model
and SVM weights, the DPM detector calculates the final object detection score as in (7.3).
Here b is just a constant.


N
N
X
X
ζ(p0 , s) = max 
ζj (pj , s) −
dTj ∆i (p0 , pj , s) + b
(7.3)
p1 ,...,pN
j=0
j=1
1 Correlation (?) is generalized to vector valued functions by correlating each feature layer individually and
the summing the results at each position.
7.1
Object Detection with Discriminatively Trained Part Based Models
59
(a) Frame 50 from the Town Centre sequence.
(b) Visualization of HOG features from a part of the image in figure 7.1a containing the two persons
dressed in black near the center of the image. The HOG features has been calculated at scale
7
s = 2 10 ≈ 1.62, i.e. the image is down-sampled with a factor 1/s first.
Figure 7.1: Visualization of the HOG features of the frame in figure 7.1a. In figure 7.1b, the magnitude of each orientation bin in a cell is visualized by the intensity
of the line with the corresponding orientation.
60
7
(a) Root filter SVM weights.
Object Detection
(b) Part filter SVM weights. (c) Part placements and deformation costs.
Figure 7.2: Visulization of the INRIA-person model, which is trained on the INRIA
[12] dataset. The model contains two components that are reflections of each other
along the vertical axis. The SVM weight for each orientation bin is visualized as in
figure 7.2a and 7.1b. The magnitude of the deformation cost functions are visualized
in figure 7.2c. The model contains eight parts.
7.1.3
Training the Detector
The classifier in (7.3) can be formulated as in (7.4).
fw (x) = max wT Φ(x, z)
z∈Z(x)
(7.4)
Here, w is the vector of classifier weights, that in this case includes the filter weights
Gj and the deformation coefficients dj . x is the example to be classified and Z(x) is
the set of possible latent values for x. The part positions are latent in this case. Φ(x, z)
is the extracted feature vector for the particular example x and part configuration z. In
[15] this classifier is trained using supervised learning with latent SVM. This is done by
minimizing the objective function in (7.5). xi denotes the examples and yi ∈ {−1, 1} are
the corresponding labels.
n
LD (w) =
X
1
kwk2 + C
max(0, 1 − yi fw (xi ))
2
i=1
(7.5)
Here, C is a regularization parameter. The optimization problem (7.5) is non-convex.
However, a strong local optima can be found by exploiting the semi-convexity of this
function. The optimization iterates between finding the optimal part placements for the
positive examples given w and optimizing over w given these part placements. The second step can be shown to be a convex optimization problem. For further details on this
training procedure, see [15]. My proposed human tracker uses the human detector that is
pretrained on the INRIA [12] dataset. The trained model is visualized in figure 7.2.
7.2
Object Detection in Tracking
7.2
61
Object Detection in Tracking
One of the goals in this part of the thesis is to fuse the information from the object and part
detections with appearance tracking in a probabilistic framework. This section discusses
how object detection can be used in a tracking framework.
7.2.1
Ways of Exploiting Object Detections in Tracking
The popular way of using object detections in tracking is to select a sparse set of detections
as observations of the object state. This can then be used in for example a Bayesian filtering framework. Though many reduce the problem to data association of a set of sparse
detections [2, 39]. Such a set of detections can be obtained by simply thresholding the
dense detection scores. However, the thresholding discards large amounts of information
returned by the object detector.
An obvious possibility is to use the whole detection score as a confidence map or likelihood of an object being present at a specific location. Breitenstein et. al. [9] pointed out
that the detector score from the original HOG person detector [12] is too poor to use as a
confidence map in most cases. They countered this fact by mostly relying on thresholded
detections in their filtering framework and only trusting the detector confidence in certain
specific cases. The DPM detector returns much more distinct detection confidences. Izadinia et. al. [22] successfully exploited these confidences to track both humans and their
parts in a non-causal framework based on graph optimization.
My work exploits the dense detection scores for both the object and the separate parts. The
scores obtained from the root-filter are not used explicitly, but they of course contribute
to the final object detection scores. The confidences returned by the full detector are of
much higher quality than the confidences computed from the part filters. As the original
HOG detector, the part confidences are just the output from a linear classifier. These
outputs also suffer from the fact that the appearance of for example a shoulder is not that
discriminative. However, these flaws are countered by jointly tracking the parts and the
object itself.
7.2.2
Converting Detection Scores to Likelihoods
The detection scores are computed at a cell level resolution. The cells are not overlapping in the standard version of the DPM detector. This means that the resolution of the
1
detection score at scale s is 8s
times the resolution of the original image for the object detections and twice for the part detections. However, it is more practical to use pixel-dense
scores. These are obtained by interpolating the detection scores with splines. The effect
is illustrated in figure 7.3.
To be able to use the detection scores in a probabilistic framework, the scores are transformed to values that can be interpreted at probabilities. In chapter 6 the detection scores
are incorporated in the tracker as likelihoods p(θj |z, s). θj is a binary stochastic variable
that indicates the detection of the object (j = 0) or part (j > 0). p(θj |z, s) should then
be interpreted as how likely it is to detect the object or part at position z and scale s.
In [22] detection scores are transformed to probabilities using sigmoid functions. A Sigmoid function is some kind of smooth step function. There exist many proposed variants
62
7
Object Detection
of such functions. The type used here is given in (7.6).
ψ(t) =
1
1 + e−α(t−β)
(7.6)
The parameters α > 0 and β ∈ R needs to be tuned for each object and part detector
individually. Let ζ̂j (z, s) denote the interpolated detection score at position x and scale
s. These scores are simply mapped through a sigmoid function using (7.7), to obtain the
corresponding likelihoods. The effect of this step is illustrated in figure 7.3.
1
(7.7)
p(θj |z, s) = ψj ζ̂j (z, s) =
1 + exp(−αj (ζ̂j (z, s) − βj ))
The sigmoid parameters are tuned based on gathered statistics of the confidence values.
The cumulative distribution functions Fj of the detection scores over an image are defined
in (7.8). Here x = (z, s) and X is the set of all positions and scales in the image. In
practice many images can be used. Fj are approximated by computing histograms of
detection scores over a set of images and computing their cumulative sums.

1


 |X| |{x ∈ X : ζ(x) ≤ ξ}| , j = 0
Fj (ξ) =
(7.8)
1



|{x ∈ X : ζj (x) ≤ ξ}| , j ∈ {1, . . . , N }
|X|
Using the definition of precision Pj (ξ) and recall Rj (ξ) at a detection threshold ξ, the
equality in (7.9) can be derived. Here Tj is the relative number of locations x in the
image that contain the specified object.
Fj (ξ) = 1 − Tj
Rj (ξ)
Pj (ξ)
(7.9)
One way of tuning the parameters in (7.7) would be to specify the desired recall rate at the
two thresholds λ1 and λ2 , where x is considered a detection if and only if p(θj |x) ≥ λk .
(1)
(2)
The corresponding detection score thresholds ξj and ξj can then be obtained if the
recall rates are known. This gives the equation system (7.10).
 ψj ξ (1) = λ1
j
(7.10)
ψj ξ (2) = λ2
j
The solution is given in (7.11).
1
1
αj = (1)
ln
− 1 − ln
−1
(2)
λ2
λ1
ξj − ξj
1
1
(1)
βj = ξ j +
ln
−1
αj
λ1
1
(7.11a)
(7.11b)
This approach is problematic since a labelled dataset is needed to estimate the recall Rj (ξ).
A simple method that does not require labelled data was used in this work. For the object
7.2
63
Object Detection in Tracking
2
1
0
−1
−2
−3
−4
−5
−6
(a) The DPM human detector output at scale s = 2
7
10
≈ 1.62 for the image displayed in figure 7.1a.
2
1
0
−1
−2
−3
−4
−5
−6
(b) Spline interpolation of the detector scores in figure 7.3a to obtain pixel-dense scores.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
7
10
(c) The final likelihood p(θ |z, s = 2 ) computed from the detector scores in figure 7.3b, using
(7.7) with the computed values for α0 and β0 .
Figure 7.3: Visualization of the human detector output (figure 7.3a) from the image in figure 7.1a. Figure 7.3b and 7.3c show how these detections are transformed
to pixel-dense likelihoods. Note that the high values corresponds to humans in figure 7.1a with similar height in pixels. Also note how apparent the peaks are in
figure 7.3c compared to figure 7.3b.
64
7
Object Detection
detector, the desired fraction Λk of detections in the images at threshold λk was tuned by
(k)
visual inspection of the resulting p(θ0 |x) and by performance evaluation. This gives ξ0
(k)
from F0 (ξ0 ) = 1 − Λk . A valid approximation is that Tj = T if occlusions in the scene
are rare. By comparing the detection scores of the part detectors with the object detector it
is possible to get a very coarse approximation of the precision of the part detectors relative
the object detector at a “high enough” recall rate. If the precision of the part detectors are
(k)
assumed to be a factor cj worse at the desired recall rates, then ξj can be found from
(k)
Fj (ξj ) = 1 − Λk /cj for j > 0. The sigmoid parameters are then computed using (7.11).
This method provides a more intuitive way of tuning the sigmoid parameters by letting
the user select λk and then Λk based on that. These parameters have intuitive meanings,
while it can be difficult to select αj and βj directly.
In the proposed human tracker, the sigmoid parameters were tuned using the detection
scores from the first 20 frames in the Town Centre sequence [5]. The thresholds were
were chosen as λ1 = 0.5 and λ2 = 0.1. The desired fraction of detections for the object
detector was set to Λ1 = 10−4 and Λ2 = 10−2 . The precision factor was set to the same
for all part detectors cj = 0.1. Not much effort was invested in to tune these values,
since the proposed tracker proved to be quite insensitive to variations in the mentioned
parameters.
7.2.3
Converting Deformation Costs to Probabilities
The deformable parts model was fitted in to a probabilistic framework by converting the
deformation costs in the DPM detector to probabilities. To simplify later steps in the modelling, Gaussian probabilities are used. The deformation probabilities are parametrized
as in (7.12). As in chapter 6, the part number is denoted with index j, z j is the position
of the part relative to the object location and s is the scale of the object. aj is the mean
position of the part relative to the object and Dj is a 2 × 2 covariance matrix.
p(z j |s) = N z j ; saj , s2 Dj
(7.12)
The straight forward way to estimate aj and Dj is to do a Maximum Likelihood estimation given a set of images with known object and part locations and scales. However, a
more ad hoc solution was done because of the absence of such a dataset. The proposed
solution exploits the pre-trained anchor positions vj and deformation coefficients dj from
the DPM detector. The mean position aj is set to the (x, y)T that minimizes the deformation cost function in (7.13). (x, y)T is the relative part position normalized with the scale
of the object.
fj (x, y) = d1j (x − vjx ) + d2j (y − vjy ) + d3j (x − vjx )2 + d4j (y − vjy )2
(7.13)
The solution, which is given in (7.14) is easily obtained by setting the gradient to zero
∇fj = 0.


d1
vjx − 2dj3
j 
aj = 
(7.14)
d2
vjy − 2dj4
j
7.2
65
Object Detection in Tracking
The covariance matrix is set as in (7.15).
1/d3j
j
D =
0
0
1/d4j
(7.15)
The motivation for this is that p(z 1 , . . . , z n |s) then can be written as in (7.16). The model
assumes independence between the part locations given the scale. Note the similarities
of the argument of the exponential function and the deformation cost in (7.3). The only
difference is that the anchor position has been corrected with the linear deformation terms
instead of including them as extra terms.
p(z 1 , . . . , z n |s) =
n
Y
p(z j |s) =
j=1


v
uY
j
2
n
X
u n 3 4
1
zx
1
3
j
t


=
d d exp −
dj
− ax + d4j
(2πs2 )n j=1 j j
2 j=1
s
zyj
− ajy
s
!2 

(7.16)
8
Details
This chapter presents and discusses the most important details of the constructed framework for human tracking in surveillance scenes. Section 8.1 discusses the computation
of the appearance likelihood and how the proposed RCSK tracker is incorporated into the
framework. Section 8.2 contains the details of how the RBPF is used in the filtering step.
Section 8.3 presents some details related to occlusion handling and how objects are added
and removed.
8.1
The Appearance Likelihood
The image measurement model in (6.8) uses two kinds of likelihoods. p(θtj |z, s) models
the class specific appearance and is discussed in section 7.2.2. The second factor p(ϕjt |z)
models the object specific appearance and does not take the object class into account.
This section describes how this likelihood factor is computed using the generic tracking
methods described in the first part of this thesis.
8.1.1
Motivation
It can be argued that the object specific appearance is irrelevant information if the application does not require identities of the tracked objects. However, this is not true in practice
since the object detections are never perfect. The motivation for including appearance
based tracking in the model can be summarized in these two points.
• To help keeping the object identities.
• To counter the imperfections of the object detections.
The first point is clear. If the individual appearance of the objects are modelled, then
association between frames is simpler. The second point comes from the fact that object
67
68
8
Details
detections in still images does not regard the temporal dimension. In section 7.2.2 it was
noted that the object detections of human parts are of poor quality, i.e. with a low precision
rate. A good generic tracker does not suffer from this problem, since it regards the learnt
appearance of the object in the previous frames.
Generic trackers are often very accurate in tracking an object from frame to frame. However, common problems among generic trackers are long term drift and that failure often
results in loosing track of its target completely. But object detections in still images does
not suffer from these flaws. The motivation for combining appearance tracking and object
detections is thus that it has the potential to give accurate frame to frame tracking while
being robust to drift and failure.
For the application in this thesis, a generic tracker with the following properties was
desired.
1. Simple.
2. Fast.
3. Exploits color information.
4. Outputs pixel-dense confidences.
Some generic trackers, like the Tracking Learning Detection framework [24] uses complex appearance models and methods for failure detection and redetection of the target.
Such properties are not needed in this application, since it is handled by the implemented
filtering framework and the fusion with object detections. This is the motivation behind
the first point in the list.
Although the proposed framework does not aim for real time in its current state, it still
has to be sufficiently fast to make testing and evaluation practical. The object detections
can be precomputed, but this is not the case for the generic tracker output since it uses
information from previous frames. In the surveillance sequence, which the framework
is evaluated on, as many as 30 humans can appear in the scene at the same time. Since
the human is tracked along with 8 defined parts, almost 300 image regions are tracked
simultaneously. The second point in the list thus becomes clear.
The used object detector that is described in section 7.1 only uses edge information. It is
thus intuitive to use color as appearance information to complement this. In the case of
human tracking, the color of cloths is a very discriminative feature to separate individuals.
This explains the need of the third property in the list. The last property comes as a result
of the image measurement model in (6.8). It assumes that the appearance likelihood can
be evaluated in a sufficiently dense set of locations.
The RCSK tracker, discussed in the first part of the thesis, turns out to have all these
properties. This tracker is used to compute the appearance likelihoods as described in the
next section.
8.1.2
Integration of the RCSK Tracker
A variant of the proposed RCSK tracker in algorithm 2.1 is used to compute the appearance likelihoods p(ϕjt |z). In each frame where the object or part is not occluded, the
8.2
Rao-Blackwellized Particle Filtering
69
0
tracker is updated using the image patch around the estimated location (ẑt|t
for the object
j
0
for part j). The size of the patch is determined by the estimated scale ŝt|t
and ẑt|t
+ ẑt|t
of the object.
When processing a new frame, the tracking scores (correlation output) for each unoccluded object and part is computed in an area around their previously estimated locations
in the image. These score values can equivalently be seen as detections scores from an
object detector trained on the specific object or part appearance. Analogously to the detection scores from the DPM detector in chapter 7, these score values are mapped through a
sigmoid function to obtain a probability interpretation. The same kind of sigmoid function
(7.6) was used in this case. The parameters were tuned to α = 6 and β = 0.5.
To get a more accurate appearance tracking, the spatial size of the RCSK appearance
model (i.e. AtN , AtD and x̂t in (2.21)) needs to be set according to the current estimated
object scale ŝt|t . The transformed model coefficients AtN and AtD are resized by ether
padding with zeros in the highest frequencies or removing the highest frequencies to get
the appropriate size. This corresponds to an interpolation of the coefficients in the spatial
domain. The learnt appearance is resized in a similar way, by either zero padding or
removing the edges. This simple scheme of resizing the trackers proved to be very robust,
even in quite extreme cases.
8.2
Rao-Blackwellized Particle Filtering
This section describes the practical details of how the RBPF is applied to the model described in chapter 6. The time update and the resampling step in the RBPF is performed as
described in algorithm A.4. Some approximations are needed in the measurement updates
though to make the RBPF applicable to the model proposed in chapter 6. The standard
prior proposal distribution q(xt+1 |Xt , Yt+1 ) = p(xnt+1 |Xtn , Yt ) was chosen.
8.2.1
Parameters and Initialization
When a new object is added, its states are initialized using the information from the corresponding detection. The initial object position, scale and velocity for each particle are
drawn from the prior Gaussian distribution. The mean position and scale are set as the
detection values. The mean velocity is set to zero. The initial part positions are for all
particles set to the ones obtained from the initial detection.
All prior and process covariances are parametrized and tuned coarsely by hand.
8.2.2
The Particle Filter Measurement Update
Theoretically (6.13) should be used to update the particle weights in the RBPF. In practice,
the likelihood functions Ljt are defined over a discrete domain, i.e. at each pixel location
and at some discrete scales. The integration must thus be approximated by a sum. The
Gaussian filter hjt needs to be sampled at a pixel dense grid. However, since the covariance
zj
Htj of hjt depends on the object scale s and the predicted part location covariance Pt|t−1
,
it is also depends on the particle number. This means that the correlation in (6.13) should
be computed with a different filter for each particle.
70
8
Details
To reduce the computational cost of the measurement update, the covariance Htj (s) defined in (6.15b) is in each time step approximated by a diagonal covariance matrix H̃tj
that is independent of the particle number. The likelihood (6.13) can be approximately
calculated using (8.1).
p(yt |Xtn , Yt−1 ) = gt (st )L0t (zt0 , st )
N Y
h̃jt ? Ljt ( · , st )
zt0 + µjt (st )
(8.1)
j=1
Here, the approximative diffusion filter h̃jt defined in (8.2) is independent of the particle
number.
h̃jt (z) = N z; 0, H̃tj
(8.2)
Notice that the effect of µjt (st ) (defined in (6.15a)) has been moved to the computation of
the point where the correlation result is evaluated. This is obtained by a simple change of
variables in the integral in (6.13). The correlation output in (8.1) is hence evaluated in the
deformation-corrected predicted part location zt0 + µjt (st ).
The diagonal elements of H̃tj are computed as mean of Htj over all particles. The nondiagonal elements are set to zero. This approximation enables h̃jt to be separated into two
one-dimensional Gaussian filters, which further significantly reduces the computational
cost. The approximation of Htj is motivated by the fact that the exact amount of smoothing generated by the filter is of much lesser importance than the mean value µjt , which
decides where the correlation result is evaluated. Further, the precision in the estimation
zj
of Pt|t−1
and thereby also Htj is questionable. As described in section 7.2.3, the deforj
z
mation covariance Dj is set to a diagonal matrix. Pt|t−1
is almost diagonal in most cases
since the process noise of the part locations are diagonal. The diagonal approximation of
Htj is thus also motivated.
In the evaluation of (6.13), the object and part positions for each particle are rounded to
the nearest pixel. The likelihood is linearly interpolated in the scale dimension.
In evaluations it was observed that background clutter is a major problem for some human
part detectors. This clutter occasionally deteriorated the total likelihood function for a part,
thereby opposing the assumption of a major mode in the likelihood, which is necessary
for the Gaussian approximation to be valid. This model error resulted in a noticeable
reduction in tracking performance of the object itself when severe background clutter was
apparent. This was typically the case for the lower body parts. The reason is that the
feet and legs move relative to the object itself, which makes them harder to track due to
appearance changes and self occlusions. Additionally the lower body parts are much more
commonly occluded by background structures and have a less discriminative appearance.
A significant improvement in the performance was noticed if only a subset of parts was
used in the measurement update for the non-linear states (6.13). Only the upper body
parts are thus used in (6.13) for the human tracking application.
8.2
Rao-Blackwellized Particle Filtering
8.2.3
71
The Kalman Filter Measurement Update
In section 6.4.3 a Gaussian approximation of the likelihood functions for the object parts
is assumed. This is necessary to be able to apply the RBPF to the proposed tracking model.
The mean ytj (s) of the Gaussian approximation is selected as the maximum value of the
likelihood function Ljt (z, s) in the region. The covariance Rtj (s) is set to the covariance
of the likelihood function after normalizing it so that it sums to one. ytj (s) and Rtj (s)
are calculated for each scale level. In the Kalman filter measurement update of the RBPF,
each particle can be updated using the measurement that corresponds to the scale that is
closest to the scale estimation given by that particle. In practice, it turned out to be better
to update all particles with the measurement that corresponds to the estimated scale ŝt of
the object.
Gaussian approximations reasonable if the actual probability is uni-modal. Since the
labelling function of the RCSK tracker is Gaussian, this generates a roughly Gaussian
shaped score function in most “easy” cases. In this case “easy” means sufficiently small
translations and appearance changes between frames. This most often the case in surveillance scenes. The part detection scores are ideally uni-modal in a neighbourhood of the
expected part location. If other objects are close, then several modes may exist. Though,
this is in most cases handled by the fusion with the appearance likelihood, which should
only have one mode at the target. The problem is that the part detectors for most human
parts suffer from a very low precision rate due to background clutter. The trained SVMdetectors used for human parts by the DPM human detector, are small and detect at a
quite low resolution. They often give high classification scores to certain basic shapes or
patterns that can be common in background structures. For example, a spot or line on the
ground can look suspiciously similar to a foot. When the tracking is affected by this kind
of clutter, the Gaussian assumption is often violated, which might cause the tracking of
a specific part to fail. However, since many parts are tracked jointly, the tracking of the
object itself is robust to if a minority of the part trackers fail.
The iterated measurement update in algorithm A.2 was applied to the Kalman filter measurement update in the RBPF, since all position measurements are uncorrelated. This
avoids inverting large matrices (up to 4N × 4N ) in the computation of the Kalman gain in
(A.16d) for each particle. Instead, the iterated measurement update only requires several
2 × 2 and 1 × 1 matrices to be inverted for each particle, which is considerably faster.
The iterated measurement update was also applied to the Kalman filter time update, since
it contains an “extra” measurement update with uncorrelated measurements, when using
the model in (6.6) (see section A.4).
8.2.4
Estimation
Like the usual particle filter, the RBPF only returns an approximation of the posterior
distribution and not a point estimate of the state. In chapter 1 it is stated that visual
tracking is the problem of estimating the trajectory of the object in the image, i.e. the
position or state of the object in each frame. It is thus necessary to find a point estimate
of the state. This can be done in many ways.
The two most common methods for obtaining a point estimate were tried: minimum
variance (MV) and maximum a posteriori (MAP). These are described in section A.4.2.
72
8
Details
The performance difference proved to be insignificant. The MV estimate however gives
smoother trajectories, which are visually more appealing. This method was therefore used
in the final version.
8.3
Further Details
This section discusses the additional system parts that were needed to build a complete
automatic human tracking framework.
8.3.1
Adding and Removing Objects
An automatic object tracker requires automatic ways of detecting new objects and removing false tracked objects. Although these are important and non-trivial tasks, they were
not in the focus of this thesis, so simple methods were employed. However, these methods
proved to be quite effective.
To find new objects in the scene, all detections over a certain threshold are gathered in
each frame. This set of detections is then reduced in a number of steps. Firstly, detections
too close to the image borders are removed. This is then followed by two steps of nonmaximum suppression. If the bounding boxes of two new detections are overlapping more
than a certain threshold, then the one with the smallest score is removed. The second step
compares the overlap between the remaining boxes and the bounding boxes of the existing
objects in the scene. A new detection is removed if has a too large overlap with any of
the existing objects. The remaining detections after this step are considered to be newly
detected objects. These are initialized with the position, scale and part locations given by
the detection. The overlap measure used in these cases is given in (8.3). B1 and B2 are
two bounding boxes.
area(B1 ∩ B2 ) area(B1 ∩ B2 )
overlap(B1 , B2 ) = max
,
(8.3)
area(B1 )
area(B2 )
An existing tracked object is removed if any of the following requirements are fulfilled.
1. The object is too far outside the image.
2. The object is too large or too small, so that it is outside the range of scales used by
the detector.
3. The object has been fully occluded for too many frames.
4. The object is not significantly occluded but too dissimilar to the object class, which
is indicated by a too low detection score over the last few frames.
8.3.2
Occlusion Detection and Handling
Advanced techniques for occlusion detection were not investigated in this thesis work.
Rather, a very simple but effective method for detecting inter object occlusions was
adopted from [47]. If two tracked objects overlap, then the object with the bounding
box that has the highest lower y-coordinate is considered to be the occluded one. This results from the assumption that objects that are further away from the camera are higher up
283
8.3
Further Details
73
Figure 8.1: Three scenarios with significant inter object occlusions. Only the human
part boxes that are considered to be non-occluded by the framework are displayed.
in the image. This is true for example if the objects are moving on a ground plane which
is tilted towards the camera, but not necessarily planar. The assumption holds for most
surveillance videos. The parts of the occluded object that have a large enough overlap
with the occluding object are considered to be occluded. Parts that are far enough outside
the image borders are also considered to be occluded. No system is used for detecting occlusions from scene objects or other non-tracked objects. Figure 8.1 visualizes the effect
of the inter object occlusion detection.
Occluded parts are not used in the measurement updates of the RBPF. They are not tracked
and their appearance models are not updated. If enough parts of the object are occluded,
then the object itself is considered to be too occluded to utilize the appearance tracking
and detection scores for the whole object in the measurement update of the RBPF. In this
case, the likelihoods p(θt0 |z, s) and p(ϕ0t |z) are set to uniform distributions. Further, the
appearance of the whole object is not updated. For the application to human tracking, it
was chosen so that the object itself is considered occluded if any of the upper body parts
74
8
Details
are occluded.
The time update in the RBPF is not affected by the occlusions. So, occluded part locations
are predicted by the model and covariance is added to the predictions, which reflects the
added uncertainty by only predicting. If all object parts are occluded then no measurements of the object location are available. The time update in the RBPF will then give
predictions of the object location using the constant velocity motion model described in
section 6.2.3.
9
Results, Discussion and Conclusions
This chapter presents the results of the constructed category tracker, when applied to
human tracking in surveillance scenes. Section 9.1 presents the qualitative results. Section 9.2 discusses the method, results and potential future work. The final conclusions are
summarized in section 9.3.
9.1
Results
The framework was implemented in Matlab and tested on a desktop computer with an Intel Xenon 2 core 2.66 GHz CPU with 16 GB RAM. The number of particles in the RBPF
was set to 1000. Very little time was spent on tuning the large number of parameters.
The Town Centre sequence provided by [5] was used for testing and evaluations. This
sequence consists of 7501 frames in 1920 × 1080 resolution at 25 fps. The scene is a
busy town centre street. Figure 9.1 visualizes the bounding boxes of the tracked objects
and the estimated object trajectories for every 100:th frame in the first 1000 frames. Figure 9.2 displays the part-trajectories of a few selected objects. In the latter two images,
the trajectories are disrupted by inter object occlusions. In the last image, the person is
successfully tracked even though only the head of the person is visible for a long period
of time.
Figure 9.1 and 9.2 show reasonable object and part trajectories. The system is able to
track most humans through the entire scene. But there are some disrupted tracks, mostly
due to imperfections in the object detector. It can also be seen that the DPM detector,
while powerful, also gives some obvious false detections. Two of these are persistent,
as they are triggered by background structures. One of these is the mannequin in the
shop-window to the left and the other one occurs in the lower right corner.
75
76
9
100
200
300
400
500
600
700
800
900
1000
Results, Discussion and Conclusions
Figure 9.1: Tracking results for frame number 100, 200, . . . , 1000 in the Town Centre sequence. The center location trajectories are displayed for humans that are
tracked in the specific frame.
9.1
Results
77
452
350
850
Figure 9.2: Estimated human part trajectories of some selected objects. Note that
many trajectories in the two latter cases are disrupted by inter object occlusions.
78
9.2
9
Results, Discussion and Conclusions
Discussion and Future Work
As discussed in section 1.2.2, most research in category tracking is focused on pure data
association of detections and global optimization with non-causal assumptions. The motivation behind the presented work in to incorporate object detections in more sophisticated
ways while avoiding the non-causal assumption. However, real-time frame rates are necessary for online applications. In the current Matlab implementation, the computational
time for the system with the object detector excluded is between 1 and 3 seconds per
frame, depending on the number of present targets. But since particle filters and FFTs
are parallelizable, real-time frame rates can potentially be obtained in a GPU (graphics
processing unit) implementation.
The object detection scores were precomputed using the Matlab/mex code provided by
[15]. This took approximately four days for the entire Town Centre sequence (7501
frames and using 58 different scales). However, recent works [33, 34] have achieved
close to real-time frame rates (10 fps) for a GPU implementation of the DPM detector,
by exploiting coarse-to-fine search strategies. In the constructed framework, it would presumably be enough to use object detection measurements in every few frames and rely
solely on the generic tracking results between these frames.
The main weakness of the proposed model is the Gaussian approximation of the partlikelihoods, as it does not model cluttered detections well. Figure 9.3 visualizes the detection likelihoods for the human and all parts, at a certain frame and scale. There are three
strong human detections in this scale and three corresponding strong head detections are
visible. However, almost no clear detections exist for the hips and legs. The feet likelihoods suffer from much clutter in many areas of the image. The straightforward way
of handling this problem is to add the parts states that are most affected by clutter to the
non-linear states in the RBPF. But this would also require an exponential increase in the
number of particles to achieve the same theoretical accuracy of the posterior distribution,
due to the curse of dimensionality. Another option is to approximate the part likelihoods
in (6.16) with a mixture of Gaussians and use a mode parameter state to distinguish between the different hypothesises. The mode state is most simply included as a non-linear
state. But since it is discrete, it would not require such a large increase in particles.
Much research has been invested in improving DPM detector framework. For example,
[25] increased the performance in many object categories by combining color names (section 3.1.1) with HOG at the feature level. This could alleviate the problem with false
detections and other imperfections of the detector. The problems with occlusions caused
by non-tracked objects, e.g. background structures could be handled further by incorporating an occlusion model. The work of [39] uses part detections scores to determine partial
occlusions.
Reidentification of objects that are lost, at for example occlusions, is an important task
if object identities are of interest. This could potentially be done with the appearance
models applied by the RCSK tracker. Otherwise, separate appearance models could be
trained for this purpose.
9.2
79
Discussion and Future Work
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
(a) Frame 130.
(b) Human.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
(c) Head.
0
(d) Left shoulder.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
(e) Right shoulder.
0
(f) Left hip.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
(g) Right hip.
0
(h) Legs.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
(i) Left foot.
0
(j) Right foot.
12
Figure 9.3: Detection likelihoods p(θj |z, s) (figure 9.3b to 9.3j) at scale s = 2 10
computed on frame 130 (figure 9.3a) in the Town Centre sequence.
80
9.3
9
Results, Discussion and Conclusions
Conclusions
In this second part of the thesis, a system for category tracking is presented. The main
novelties are the fusion of generic tracking with object detection scores from DPM in a
causal probabilistic framework and that the RBPF is used in the filtering step. Encouraging results are demonstrated when applied to human tracking in a real-world surveillance
sequence. The causal nature and real-time potential of this system make it attractive for
online applications. Additionally, the estimated trajectories of human parts could be used
by other systems for action detection and recognition.
Appendix
A
Bayesian Filtering
This chapter contains a brief presentation of Bayesian filtering theory. The chapter starts
with the general theory and solution. The rest of this chapter presents various parts of the
theory that is used in the proposed category object tracker. Section A.4 contains the algorithm and details of the Rao-Blackwellized particle filter, which is of major importance
for the proposed tracker.
A.1
The General Case
Consider the general first order hidden Markov model in (A.1). The state of the system at
time t ∈ N is denoted xt ∈ Rn . yt ∈ Rmt is the measurements given at t.
p(xt+1 |xt )
(A.1a)
p(yt |xt )
(A.1b)
The state transition density p(xt+1 |xt ) model the dynamics of the system. The likelihood
p(yt |xt ) model how likely it is to receive a certain measurement given the state of the system. The goal is to estimate the posterior probability p(xt |Yt ), where Yt = {y1 , . . . , yt }
is the set of all measurements observed so far. The posterior is the probability distribution
of the state given all information (measurements) available at that time instance.
83
84
A.1.1
A
Bayesian Filtering
General Bayesian Solution
The general solution of the Bayesian filtering problem of the model in (A.1) is given by
the recursion formula in (A.2).
p(yt |xt )p(xt |Yt−1 )
p(yt |Yt−1 )
Z
p(xt+1 |Yt ) =
p(xt+1 |xt )p(xt |Yt ) dxt
p(xt |Yt ) =
(A.2a)
(A.2b)
Rn
The normalization factor in (A.2a) can be expressed as in (A.3).
Z
p(yt |Yt−1 ) =
p(yt |xt )p(xt |Yt−1 ) dxt
(A.3)
Rn
Equation A.2a is often called the measurement update, since the posterior is updated with
the new information contained in the measurement yt . Equation A.2b is often called the
time update, since it predicts the posterior in the next time instance using the modelled
dynamics. The recursion can be initialized with p(x1 |Y0 ) = p(x1 ). The recursion formula
in (A.2) can be derived by applying Bayes’ theorem and marginalization, along with the
Markov properties of the model in (A.1).
In practice the recursive Bayesian solution can only be applied in some special cases
where finite dimensional parametrizations of the densities exists. Section A.2 discusses
such a case. Otherwise some sort of finite dimensional approximation of the densities are
needed. Sections A.3 and A.4 discus such methods.
A.1.2
Estimation
At each time instance, an estimation of the state xt given the measurements Yt can be
calculated by using e.g. the minimum variance (MV) or maximum a posteriori (MAP)
estimate.
Z
x̂MV
=
xt p(xt |Yt ) dxt
(A.4a)
t|t
Rn
x̂MAP
t|t
A.2
= arg max p(xt |Yt )
(A.4b)
xt
The Kalman Filter
The linear Gaussian model is one of the cases where an analytic solution to (A.2) exists.
Such a model is given by (A.5). It is a special case of the general model in (A.1), with
A.2
85
The Kalman Filter
p(xt+1 |xt ) = N (xt+1 ; At xt , Bt Qt BtT ) and p(yt |xt ) = N (yt ; Ct xt , Rt ).
xt+1 = At xt + Bt vt
(A.5a)
yt = Ct xt + et
(A.5b)
vt ∼ N (0, Qt )
(A.5c)
et ∼ N (0, Rt )
(A.5d)
x1 ∼ N (x̂1|0 , P1|0 )
(A.5e)
vt and et are white. vt is called process noise and et is measurement noise. At , Bt
and Ct are matrices of appropriate dimensions.1 The notation x̂t|k and Pt|k denotes the
estimation of the state mean and covariance respectively at time t, given all measurements
up to time k.
A.2.1
Algorithm
The measurement update of the Kalman filter is given in (A.6). Kt is the Kalman gain,
that needs to be computed for each time instance.
x̂t|t = x̂t|t−1 + Kt (yt − Ct x̂t|t−1 )
(A.6a)
Pt|t = Pt|t−1 − Kt Ct Pt|t−1
Kt =
Pt|t−1 CtT (Ct Pt|t−1 CtT
(A.6b)
−1
+ Rt )
(A.6c)
The time update is given in (A.7).
x̂t+1|t = At x̂t|t
Pt+1|t =
At Pt|t ATt
(A.7a)
+ Bt Qt Bt
(A.7b)
The complete algorithm is given in algorithm A.1. For further details and derivation of
the Kalman filter, see [18].
Algorithm A.1 Kalman Filter Update at time t
Input:
Matrices: At , Bt , Ct , Qt and Rt
Measurements: yt
Predictiona at time t: x̂t|t−1 and Pt|t−1
Output:
Estimation at time t: x̂t|t and Pt|t
Prediction at time t + 1: x̂t+1|t and Pt+1|t
1:
2:
Measurement update using (A.6).
Time update using (A.7).
a The prediction is given by the model at the first iteration (i.e.
t = 1) and by the previous iteration otherwise.
1 Note that the dimensions can vary dynamically since the number of measurements (i.e. dimension of y )
t
can change.
86
A
A.2.2
Bayesian Filtering
Iterated Measurement Update
If yt consists of several independent measurements, then the iterated measurement update
can be used instead to reduce the computational cost. Let yt = (yt1 , . . . , ytM )T , where
yti are the uncorrelated measurements.2 This implies that the measurement noise covariance Rt is block diagonal, with the non-zero blocks Rti = Cov(yti |xt ). Also let Cti be
the rows of the measurement matrix Ct associated with yti . The measurement update in
algorithm A.1 can then be done using algorithm A.2.
Algorithm A.2 Iterated Measurement Update at time t
Input:
Matrices: Ct and Rt
Measurements: yt = (yt1 , . . . , ytM )T
Prediction at time t − 1: x̂t|t−1 and Pt|t−1
Output:
Estimation at time t: x̂t|t and Pt|t
Set x̂0t = x̂t|t−1 and Pt0 = Pt|t−1
for i = 1, . . . , M do
Kti = Pti−1 (Cti )T (Cti Pti−1 (Cti )T + Rti )−1
)
+ Kti (yti − Cti x̂i−1
x̂it = x̂i−1
t
t
i−1
i
Pt = Pt − Kti Cti Pti−1
6: end for
M
7: Set x̂t|t = x̂M
t and Pt|t = Pt
1:
2:
3:
4:
5:
The major gain in using the iterated measurement update instead of (A.6) is that the matrix
inversions are computed for smaller matrices. This can be used to radically increase the
computational speed of the marginalized particle filter, where thousands of measurement
updates (one for each particle) needs to be computed at each time instance.
A.3
The Particle Filter
Approximative methods are necessary in more general cases than the one in (A.5). The
particle filter approximates the posterior over an adaptive grid. In contrast to the form of
Bayesian filtering described in (A.2), the particle filter attempts to estimate the posterior
of the whole trajectory Xt = {x1 , . . . , xt }, i.e. p(Xt |Yt ). The filtering posterior p(xt |Yt )
is obtained by marginalizing over the earlier states Xt−1 .
A.3.1
Algorithm
The complete algorithm is stated in algorithm A.3. It uses a proposal density q(xt+1 |xt , yt+1 )
i N
to sample the new particle grid {xit+1 }N
i=1 given the previous grid {xt }i=1 and the next
measurement yt+1 . The simplest and most common choice of proposal density is the prior
2 Note
that yti does not has to be scalar.
A.3
87
The Particle Filter
q(xt+1 |xt , yt+1 ) = p(xt+1 |xt ). Although many different proposal densities can be used
depending on the application.
The resampling step in the particle filter is needed to avoid sample depletion. It is necessary to discard particles with too low weight that do not contribute significantly to the
approximation of the posterior. It is important to note that even though the particle filter
returns an estimate of the posterior for the whole trajectory, it is only accurate (assuming
enough particles) for the last few states, because of the depletion problem.
Algorithm A.3 Particle Filter Update at time t
Input:
Number of particles: N
i
N
Particles and predicted weights at time t:a {Xti }N
i=1 , {wt|t−1 }i=1
Proposal distribution: q(xt+1 |xt , yt+1 )
Output:
i
Particle weights at time t: {wt|t
}N
i=1
i
i
N
Particles and predicted weights at time t + 1: {Xt+1
}N
i=1 , {wt+1|t }i=1
Measurement update:
1: Calculate new weights
2:
i
i
w̃t|t
= wt|t−1
p(yt |xit )
(A.8)
i
w̃t|t
i
wt|t
= PN
i
i=1 w̃t|t
(A.9)
Normalize the weights
Resampling:b
3: Sample N particles with replacement from the set {xit }N
i=1 with the probabilities
i
{wt|t
}N
i=1 .
i
4: Set the weights to wt|t
= 1/N .
Time update:
5: Generate new particles using the proposal distribution
xit+1 ∼ q(xt+1 |xit , yt+1 )
6: Compute the new predicted weights
p(xit+1 |xit )
i
i
= wt|t
wt+1|t
q(xit+1 |xit , yt+1 )
(A.10)
(A.11)
a In the first iteration (t = 1) the particles are sampled using xi ∼ p , that is given by the model. The
x1
1
i
initial weights are set to w1|0
= 1/N .
b The resampling is optional in each iteration. It can be done only when needed, indicated by some measure
for depletion (see [18]).
A.3.2
Estimation
The filtering posterior is approximated by (A.12). The approximations are most commonly done before the resampling. However, it can be done after as well if enough parti-
88
A
Bayesian Filtering
cles are used.
p̂(xt |Yt ) =
N
X
i
δ(xt − xit )
wt|t
(A.12)
i=1
A point estimate of the state can be obtained by using this approximation in (A.4a). This
results in the MV approximation given in (A.13). An approximation of the MAP estimate
i
can be obtained by simply choosing the particle xit with the highest weight wt|t
x̂MV
t|t =
N
X
i
wt|t
xit
(A.13)
i=1
A.4
The Rao-Blackwellized Particle Filter
The particle filter is in most cases unpractical for state spaces with more than a few dimensions. This is due to the fact that the number of particles has to grow exponentially with
the number of states to keep the accuracy of the estimations. The Rao-Blackwellized particle filter (RBPF) [37] exploits linear-Gaussian substructures in the state space to reduce
the dimensionality of the particle approximation. The rest of the states are approximated
using Kalman filters. In this section the state space model in (A.14) is considered. In this
model the state vector has been partitioned as xt = (xnt , xlt )T . The partitions are called
the non-linear and linear states respectively.
xnt+1 = ftn (xnt ) + Ant (xnt )xlt + Btn (xnt )vtn
(A.14a)
xlt+1
(A.14b)
=
ftl (xnt ) + Alt (xnt )xlt
ht (xnt ) + Ct (xnt )xlt
+
Btl (xnt )vtl
yt =
+ et
n
vt
Qnt
∼ N (0, Qt ), Qt =
l
T
vt
(Qln
t )
(A.14c)
Qln
t
Qlt
(A.14d)
et ∼ N (0, Rt )
(A.14e)
xn1
xl1
(A.14f)
∼ pxn1
∼
N (x̂l0 , P0 )
(A.14g)
ftn , ftl and ht are vector valued functions of xnt . Ant , Alt , Btn , Btl and Ct are matrices of
appropriate dimensions that depend on xnt . Note that this model is linear in xlt conditioned
on xnt .
A.4.1
Algorithm
The RBPF uses the factorization in (A.15).
p(Xtn , xlt |Yt ) = p(xlt |Xtn , Yt )p(Xtn |Yt )
(A.15)
The first factor is the linear states conditioned on the non linear ones and the measurements. Given a particle approximation of Xtn , an optimal approximation of this factor can be derived. The solution is to run a Kalman filter for each particle Xtn,i . The
A.4
89
The Rao-Blackwellized Particle Filter
i
goal is to calculate p(xlt |Xtn,i , Yt ) = N (xlt ; x̂l,i
t|t , Pt|t ) in the measurement update and
n,i
i
, Yt ) = N (xlt+1 ; x̂l,i
p(xlt+1 |Xt+1
t+1|t , Pt+1|t ) in the time update, for each particle i. The
fact that these distributions are Gaussian can be proven by induction. To simplify the
notation somewhat, the particle index i is skipped. The dependence on xnt for the various functions and matrices in the model is not denoted explicitly to further increase the
readability of the formulas.
The measurement update of the linear states is given in (A.16).
p(xlt |Xtn , Yt ) = N (xlt ; x̂lt|t , Pt|t )
x̂lt|t
=
x̂lt|t−1
+ Kt (yt − ht −
(A.16a)
Ct x̂lt|t−1 )
Pt|t = Pt|t−1 − Kt Ct Pt|t−1
Kt =
Pt|t−1 CtT (Ct Pt|t−1 CtT
(A.16b)
(A.16c)
+ Rt )
−1
(A.16d)
The time update of the linear states is given in (A.17).
n
p(xlt+1 |Xt+1
, Yt ) = N (xlt+1 ; x̂lt+1|t , Pt+1|t )
x̂lt+1|t = Ālt x̂lt|t + B̄tl zt + ftl + Lt (zt − Ant x̂lt|t )
Pt+1|t =
Lt =
zt =
Ālt Pt|t (Ālt )T
Btl Q̄lt (Btl )T
+
− Lt Ant Pt|t (Ālt )T
−1
Ālt Pt|t (Ant )T Ant Pt|t (Ant )T + Btn Qnt (Btn )T
xnt+1 − ftn
(A.17a)
(A.17b)
(A.17c)
(A.17d)
(A.17e)
The bar-denoted matrices are defined in (A.18). However, if the process noises vtn and vtl
l
l
l
l
l
are uncorrelated, i.e. Qln
t = 0, then Āt = At , B̄t = 0 and Q̄t = Qt .
Ālt = Alt − B̄tl Ant
B̄tl
Q̄lt
=
=
T
n n −1
Btl (Qln
t ) (Gt Qt )
T
n −1 ln
Qlt − (Qln
Qt
t ) (Qt )
(A.18a)
(A.18b)
(A.18c)
Note that (A.17) contains a measurement update using zt , followed by the ordinary Kalman
filter time update. This extra measurement update in necessary to include the information
from the time update in the particle filter. Equation A.14a acts as the measurement equation in this case.
The second factor in (A.15) can be factorized as in (A.19). This factorization is used for
the particle approximation of the non-linear states.
p(Xtn |Yt ) =
n
p(yt |Xtn , Yt−1 )p(xnt |Xt−1
, Yt−1 )
n
p(Xt−1
|Yt−1 )
p(yt |Yt−1 )
(A.19)
The distributions for the likelihood and prior factors are given in (A.20). The linear states
90
A
Bayesian Filtering
can in these cases be seen as extra measurement noise and process noise respectively.
p(yt |Xtn , Yt−1 ) = N yt ; ht + Ct x̂lt+1|t , Ct Pt|t−1 CtT + Rt
(A.20a)
p(xnt+1 |Xtn , Yt ) = N xnt+1 ; ftn + Ant x̂lt|t , Ant Pt|t (Ant )T + Btn Qnt (Btn )T
(A.20b)
The complete algorithm is given in algorithm A.4. As the particle filter, it uses a general proposal distribution q(xt+1 |Xt , Yt+1 ). The most common example of a proposal
distribution is the prior in (A.20b). See [29] for details on proposal distributions, and a
more detailed description of the RBPF. Also see [37] for profs and details on some special
cases.
A.4.2
Estimation
The filtering posterior and point estimates of the non-linear states can be calculated in the
same way as for the particle filter in section A.3.2. The posterior of the linear states can
be approximated as in (A.25), by a mixture of Gaussians.
p̂(xlt |Yt ) =
N
X
l,i
i
wt|t
N xlt ; x̂l,i
,
P
t|t
t|t
(A.25)
i=1
The MAP estimate of the linear states can be approximated with the x̂l,i
t|t that corresponds
i
to the highest weight wt|t
. The minimum variance (MV) estimate is calculated using
(A.26).
x̂lt|t =
N
X
i
wt|t
x̂l,i
t|t
(A.26a)
l,i
i
i
l
l T
wt|t
Pt|t
+ (x̂l,i
−
x̂
)(x̂
−
x̂
)
t|t
t|t
t|t
t|t
(A.26b)
i=1
l
P̂t|t
=
N
X
i=1
Note that the measurement update of the linear states in algorithm A.4 is placed after the
resampling step. It can thus be more convenient to extract all estimates after step 6 in the
algorithm.
A.4
91
The Rao-Blackwellized Particle Filter
Algorithm A.4 Rao-Blackwellized Particle Filter Update at time t
Input:
Number of particles: N
i
N
Particles and predicted weights at time t:a {Xtn,i }N
i=1 , {wt|t−1 }i=1
N
i
N
Predicted linear states at time t:b {x̂l,i
t|t−1 }i=1 , {Pt|t−1 }i=1
n
n
Proposal distribution: q(xt+1 |Xt , Yt+1 )
Output:
i
Particle weights at time t: {wt|t
}N
i=1
n,i N
i
Particles and predicted weights at time t + 1: {Xt+1
}i=1 , {wt+1|t
}N
i=1
N
i N
Estimated linear states at time t: {x̂l,i
t|t }i=1 , {Pt|t }i=1
N
i
N
Predicted linear states at time t + 1: {x̂l,i
t+1|t }i=1 , {Pt+1|t }i=1
Particle filter measurement update:
1: Calculate new weights using (A.20a)
i
i
w̃t|t
= wt|t−1
p(yt |Xtn,i , Yt−1 )
2: Normalize the weights
i
w̃t|t
i
wt|t = PN
i
i=1 w̃t|t
(A.21)
(A.22)
Resampling:c
k
i
3: Sample a set indices Jt = {jtk }N
k=1 with the probabilities p(jt = i) = wt|t , ∀i, k.
n,j k
l,j k
jk
k
t
t
Set Xtn,k = Xt t , x̂l,k
t|t−1 = x̂t|t−1 and Pt|t−1 = Pt|t−1 for k = 1, . . . , N .
k
5: Set the weights to wt|t = 1/N .
4:
Kalman filter measurement update:
n,i
6: Kalman filter measurement update for each particle Xt using (A.16).
Particle filter time update:
7: Generate new particles using the proposal distribution
n,i
n
xn,i
t+1 ∼ q(xt+1 |Xt , Yt+1 )
8: Compute the new predicted weights using (A.20b)
n,i
p(xn,i
t+1 |Xt , Yt )
i
i
= wt|t
wt+1|t
n,i
q(xn,i
t+1 |Xt , Yt+1 )
(A.23)
(A.24)
Kalman filter time update:
n,i
9: Kalman filter time update for each particle Xt using (A.17).
a In the first iteration (t = 1) the particles are sampled using xn,i ∼ p n . The initial weights are set to
x1
1
i
w1|0
= 1/N .
b In
i = P for all i.
the first iteration (t = 1) the linear states are initialized as x̂l,i
= x̂l0 and P1|0
0
1|0
resampling is optional in each iteration (see algorithm A.3).
c The
B
Proofs and Derivations
B.1
B.1.1
Derivation of the RCSK Tracker Algorithm
Kernel Function Proofs
This sections proves the propositions in section 2.2.4. This is done by first showing the
following result.
B.1 Lemma. The inner product on `D
p (M, N ) is a shift invariant kernel.
Proof: The inner product is clearly a valid kernel function. Using the definition of the
standard scalar product in `D
p (M, N ) and the periodicity of f ad g we get.
hτm,n f, τm,n gi =
D X
X
f d (k − m, l − n)g d (k − m, l − n)
d=1 k,l
=
D X
X
f d (r, s)g d (r, s) = hf, gi
(B.1)
d=1 r,s
Since f, g, m, n were arbitrary, this is valid for all f, g ∈ `D
p (M, N ) and m, n ∈ Z.
The results in section 2.2.4 can now we shown using this basic lemma.
Proof of proposition 2.2: The shift invariance follows directly from (2.14) and lemma B.1.
Equation 2.15 follows from the correlation property and linearity of the DFT (see [16]).
93
94
B
Proofs and Derivations


D X
X
κ(τ−m,−n f, g) = k 
f d (k + m, l + n)g d (k, l)
d=1 k,l
=k
D
X
!
g ∗ f (m, n)
=k F
−1
(D
X
d=1
!
)
F
d
Gd
(m, n)
(B.2)
d=1
Proof of proposition 2.3: Equation 2.16 can be expanded as:
κ(f, g) = k kf − gk2 = k (hf − g, f − gi)
= k (hf, f i + hg, gi − 2hf, gi) = k kf k2 + kgk2 − 2hf, gi
(B.3)
The shift invariance now follows from lemma B.1. The proof of (2.17) is similar to the
one of (2.15), using (B.3) and applying lemma B.1 to get kτ−m,−n f k2 = kf k2 .
B.1.2
Derivation of the Robust Appearance Learning Scheme
This section derives that A = F {a} in (2.20) is the minimum of the cost function in
(2.19). The cost function can be rewritten to (B.4), by inserting v j in (2.19b) to (2.19a)
2
J
X
XX
j
j
j
βj
=
a(k, l)κ(xm,n , xk,l ) − y (m, n)
j=1
m,n
k,l
!
+λ
X
a(m, n)
m,n
X
a(k, l)κ(xjm,n , xjk,l )
(B.4)
k,l
This function is clearly convex in a, since the squared L2 -norm of an affine transformation
is convex and is a sum of such functions. The global minimum can thus be found by
finding a stationary point. The derivative with respect to a(r, s) is computed in (B.5).
∂
=
∂(a(r, s))
=2
=2
J
X
!
βj
X
j=1
m,n
J
X
X
j=1
κ(xjm,n , xjr,s )
X
a(k, l)κ(xjm,n , xjk,l )
j
− y (m, n) + λa(m, n)
k,l
!
βj
m,n
κ(xjr−m,s−n , xj )
X
a(k, l)κ(xjm−k,n−l , xj )
j
− y (m, n) + λa(m, n)
k,l
(B.5)
Here we have used the symmetric property of the kernel function, i.e. κ(x, z) = κ(z, x)
and the shift invariance that is defined in definition 2.1. We define the function ujx ∈
`p (M, N ) in (B.6).
ujx (m, n) = κ(xjm,n , xj )
(B.6)
B.2
95
Proof of Equation 6.13
Using this definition, the derivative in (B.5) can be expressed as:
1
∂
=
2 ∂(a(r, s))
=
J
X
!
βj
J
X
j=1
=
J
X
ujx (r
X
− m, s − n)
m,n
j=1
=
X
βj
X
a(k, l)ujx (m
j
− k, n − l) − y (m, n) + λa(m, n)
k,l
ujx (r − m, s − n) a ∗ ujx (m, n) − y j (m, n) + λa(m, n)
m,n
βj ujx ∗ a ∗ ujx − y j + λa (r, s)
(B.7)
j=1
By setting these derivatives to zero we get.
∂
= 0,
∂(a(r, s))
⇐⇒
J
X
∀r, s ∈ Z
βj ujx ∗ a ∗ ujx − y j + λa = 0
j=1
⇐⇒
F

J
X

⇐⇒
J
X
j=1


βj ujx ∗ a ∗ ujx − y j + λa
=0

βj Uxj Uxj A − Y j + λA = 0
j=1
⇐⇒
A
J
X
J
X
βj Y j Uxj = 0
βj Uxj Uxj + λ −
j=1
j=1
PJ
⇐⇒
j
Uxj
βj Y
j
j
j=1 βj Ux Ux + λ
A= P
J
j=1
(B.8)
Here we have used that the DFT is linear and invertible, along with the convolution property (see [16]). The last equivalence assumes that all frequency components in the denominator are zero. This completes the derivation of (2.20).
B.2
Proof of Equation 6.13
This section proves (6.13), that is used to update the particle weights in the RBPF.
96
B.2.1
B
Proofs and Derivations
Proof of Uncorrelated Parts
This section proves that (B.9) holds, when the RBPF is used on the described model.
v
p(xlt |Xtn , Yt−1 ) = N (vt ; v̂t|t−1 , Pt|t−1
)
N
Y
j
j
z
N (ztj ; ẑt|t−1
, Pt|t−1
)
(B.9)
j=1
l
From the RBPF we have that p(xlt |Xtn , Yt−1 ) = N (xlt ; ẑt|t−1
, Pt|t−1 ) (see [37] for
proof). So, it only has to be proven that Pt|t−1 is block diagonal with the 2 × 2 blocks
v
z1
zN
Pt|t−1
, Pt|t−1
, . . . , Pt|t−1
. From the model, this is true for t = 1. Now assume that it is
also true for a t ≤ 1.
We first prove that Pt|t is 2 × 2-block diagonal using the iterated measurement update in
algorithm A.2. First let {yt1 , . . . , ytN } be the independent position measurements from the
image likelihood. We have Cti = (02×2i , I2×2 , 02×2(N −i) ). Assume that Pti−1 in algorithm A.2 is 2 × 2-block diagonal. Using the algorithm it is easy to verify that Kti Cti Pti−1
is only non-zero in diagonal block number i + 1. Pti is thus also block diagonal. It follows that PtN is block diagonal since Pt0 = Pt|t−1 is block diagonal from the assumption.
Now let {ytN +1 , . . . , yt2N } be the independent position measurements from the deformation likelihood. The only difference now is that Cti is multiplied with a scalar, so the same
holds here. Pt|t = Pt2N is thus block diagonal.
Now consider the time update in (A.17). Using the model in (6.11) it is easy to verify
that Lt = (L1t , 02×2N )T for some 2 × 2 matrix L1t . This implies that the last term in
(A.17c) is only non-zero in the first diagonal block. The first two terms in (A.17c) are
also clearly block diagonal since Qlt in (6.12) is 2 × 2-block diagonal. This implies that
Pt+1|t is 2 × 2-block diagonal and that the initial statement is valid for t + 1. This proves
by induction that (B.9) is valid for all t.
B.2.2
Derivation of the Weight Update
To simplify notation, the likelihood function in (B.10) is defined.
Ljt (z, s) = p(θtj |z, s)p(ϕjt |z)
The measurement model from section 6.3 can then be written as in (B.11).
!
N
j
Y
z
p(yt |xt ) = L0t (zt0 , st )
N aj ; t , Dj Ljt (zt0 + ztj , st )
st
j=1
Let l be the number of linear states, in our case l = 2(N + 1). It follows that.
Z
p(yt |Xtn , Yt−1 ) =
p(yt , xlt |Xtn , Yt−1 ) dxlt
Rl
Z
=
p(yt |xlt , Xtn , Yt−1 )p(xlt |Xtn , Yt−1 ) dxlt
Rl
Z
=
p(yt |xt )p(xlt |Xtn , Yt−1 ) dxlt
Rl
(B.10)
(B.11)
(B.12)
B.2
97
Proof of Equation 6.13
The last step follows from the Markov property of the model. Using (B.9) and (B.11) in
(B.12) gives.
p(yt |Xtn , Yt−1 ) =
N Z
Y
j
0 0
zj
= Lt (zt , st )
N ztj ; ẑt|t−1
, Pt|t−1
N
j=1
R2
zj
a ; t , Dj
st
j
!
Ljt (zt0 + ztj , st ) dztj
(B.13)
Notice that the linear state vt has been marginalized away. We will now consider the
product of the two Gaussian functions inside the integral. The indices are skipped to
simplify the notation.
z 1
p
N (z; ẑ, P )N a; , D =
2
s
(2π) det(P ) det(D)



T
z
z 
 1
· exp − (z − ẑ)T P −1 (z − ẑ) + a −
D−1 a −
 (B.14)
2 |
s
s}
{z
=:V
We define as = sa and Ds = s2 D to simplify the equations. The exponent V can then
be written as follows.
V = (z − ẑ)T P −1 (z − ẑ) + (as − z)T Ds−1 (as − z)
= z T (P −1 + Ds−1 )z − 2z T (P −1 ẑ + Ds−1 as ) + ẑ T P −1 ẑ + aTs Ds−1 as
(B.15)
We define the quantities in (B.16).
H = P −1 + Ds−1
µ=H P
−1
ẑ +
−1
Ds−1 as
(B.16a)
(B.16b)
It is easy to check that (B.15) can be rewritten to (B.17).
V = (z − µ)T H −1 (z − µ) + ẑ T P −1 ẑ + aTs Ds−1 as − µT H −1 µ
{z
}
|
{z
} |
=:V1
(B.17)
=:V2
V2 is the part of V that is independent of z.
V2 = ẑ T P −1 ẑ + aTs Ds−1 as − (P −1 ẑ + Ds−1 as )T H(P −1 ẑ + Ds−1 as )
= ẑ T (P −1 − P −1 HP −1 )ẑ + aTs (Ds−1 − Ds−1 HDs−1 )as − 2aTs Ds−1 HP −1 ẑ
(B.18)
The matrix inversion lemma is given in (B.19), where A and C are invertible matrices.
(A − BCD)−1 = A−1 + A−1 B(C −1 − DA−1 B)−1 DA−1
(B.19)
98
B
Proofs and Derivations
Using this lemma, we get.
(P −1 − P −1 HP −1 )−1 = P + (H −1 − P −1 )−1
= P + (P −1 + Ds−1 − P −1 )−1 = P + Ds
(B.20)
And similarily.
(Ds−1 − Ds−1 HDs−1 )−1 = P + Ds
(B.21)
Additionally we have that.
Ds−1 HP −1 = P (P −1 + Ds−1 )Ds
−1
= (P + Ds )−1
(B.22)
Equation B.18 can thus be simplified as.
T
−1
V2 = (ẑ − as ) (P + Ds )
(ẑ − as ) =
T −1 ẑ
1
ẑ
−a
P +D
− a (B.23)
s
s2
s
Finally we note that (B.22) implies (B.24), where we have used the properties of matrix
determinant.
det(H)
1
= 4
(B.24)
det(P + Ds )
s det(P ) det(D)
Using (B.17), (B.23) and (B.24) in (B.14) gives.
z ẑ
1
; a, 2 P + D
N (z; ẑ, P ) N a; , D = N (z; µ, H) N
s
s
s
(B.25)
Using this result in (B.13) and moving the factor that is independent of ztj out of the
integral gives (6.13) with the definitions in (6.14) and (6.15).
Bibliography
[1] Amit Adam, Ehud Rivlin, and Shimshoni. Robust fragments-based tracking using
the integral histogram. In CVPR, 2006. Cited on page 40.
[2] Afshin Dehghan Amir Roshan Zamir and Mubarak Shah. Gmcp-tracker: Global
multi-object tracking using generalized minimum clique graphs. In Proceedings of
the European Conference on Computer Vision (ECCV), 2012. Cited on pages 50
and 61.
[3] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual Tracking with Online Multiple Instance Learning. In CVPR, 2009. Cited on page 40.
[4] Chenglong Bao, Yi Wu, Haibin Ling, and Hui Ji. Real time robust l1 tracker using
accelerated proximal gradient approach. In CVPR, 2012. Cited on page 40.
[5] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance
video. In CVPR, pages 3457–3464, June 2011. Cited on pages 64 and 75.
[6] Brent Berlin and Paul Kay. Basic Color Terms: Their Universality and Evolution.
UC Press, Berkeley, CA, 1969. Cited on page 19.
[7] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
ISBN 0387310738. Cited on pages 3, 4, 11, and 58.
[8] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Yui M. Lui. Visual object tracking using adaptive correlation filters. In Computer Vision and Pattern Recognition
(CVPR), 2010. Cited on pages 9, 10, 14, 16, and 29.
[9] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and
Luc Van Gool. Robust tracking-by-detection using a detector confidence particle
filter. In IEEE International Conference on Computer Vision, October 2009. Cited
on pages 50, 52, and 61.
[10] Robert T. Collins, Yanxi Liu, and Marius Leordeanu. Online selection of discriminative tracking features. PAMI, 27(10):1631–1643, 2005. Cited on page 17.
99
100
Bibliography
[11] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based object tracking. PAMI, 25(5):564–575, 2003. Cited on page 17.
[12] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International
Conference on Computer Vision & Pattern Recognition, volume 2, pages 886–893,
INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005.
URL http://lear.inrialpes.fr/pubs/2005/DT05. Cited on pages 57,
60, and 61.
[13] Thang Ba Dinh, Nam Vo, and Gerard Medioni. Context tracker: Exploring supporters and distracters in unconstrained environments. In CVPR, 2011. Cited on pages
17 and 40.
[14] Michael Felsberg. Enhanced distribution field tracking using channel representations. In ICCV Workshop, 2013. Cited on page 40.
[15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 32(9):1627–1645, 2010. Cited on pages 5, 6, 57,
58, 60, and 78.
[16] Claude Gasquet and Patrick Witomski. Fourier Analysis and Applications: Filtering,
Numerical Computation, Wavelets. Texts in Applied Mathematics. Springer-Verlag
New York Inc., 1999. ISBN 0-387-98485-2. Cited on pages 4, 93, and 95.
[17] T. Gevers and A. W. M. Smeulders. Color based object recognition. Pattern Recognition, 32:453–464, 1999. Cited on page 18.
[18] Fredrik Gustafsson. Statistical Sensor Fusion. Studentlitteratur, second edition,
2012. ISBN 978-91-44-07732-1. Cited on pages 50, 85, and 87.
[19] Sam Hare, Amir Saffari, and Philip H. S. Torr. Struck: Structured output tracking
with kernels. In International Conference on Computer Vision (ICCV), 2011. Cited
on pages 17 and 40.
[20] Shengfeng He, Qingxiong Yang, Rynson Lau, Jiang Wang, and Ming-Hsuan Yang.
Visual tracking via locality sensitive histograms. In CVPR, 2013. Cited on page 40.
[21] J.F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision (ECCV), 2012. Cited on pages 3, 5, 9, 12, 13, 14, 15, 16,
28, 29, 30, 32, and 40.
[22] Hamid Izadinia, Imran Saleemi, Wenhui Li, and Mubarak Shah. (mp)2t: Multiple
people multiple parts tracker. In Proceedings of the European Conference on Computer Vision (ECCV), volume 7577 of Lecture Notes in Computer Science, pages
100–114. Springer, 2012. Cited on pages 50, 52, and 61.
[23] Xu Jia, Huchuan Lu, and Ming-Hsuan Yang. Visual tracking via adaptive structural
local sparse appearance model. In CVPR, 2012. Cited on page 40.
Bibliography
101
[24] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Tracking-learning-detection.
IEEE Trans. Pattern Analysis Machine Intelligence, 34(7):1409–1422, 2012. Cited
on pages 17, 40, and 68.
[25] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew Bagdanov, Maria Vanrell, and Antonio Lopez. Color attributes for object detection. In
CVPR, 2012. Cited on pages 5, 17, and 78.
[26] Fahad Shahbaz Khan, Joost van de Weijer, and Maria Vanrell. Modulating shape
features by color attention for object recognition. IJCV, 98(1):49–64, 2012. Cited
on pages 5 and 17.
[27] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew Bagdanov, Antonio Lopez, and Michael Felsberg. Coloring action recognition in still
images. IJCV, 105(3):205–221, 2013. Cited on page 17.
[28] Erwin Kreyszig. Introductory Functional Analysis with Applications. Wiley Classics Library. John Wiley & Sons, Inc., 1989. ISBN 978-0-471-50459-7. Cited on
page 3.
[29] Fredrik Lindsten. Rao-Blackwellised particle methods for inference and identification. Licentiate thesis, Linköping University, 2011. Cited on page 90.
[30] Alfred Mertins. Signal Analysis : Wavelets, Filter Banks, Time-Frequency Transforms, and Applications. John Wiley & Sons, 1999. ISBN 0-471-98626-7. Cited on
pages 21 and 22.
[31] Katja Nummiaro, Esther Koller-Meier, and Luc J. Van Gool. An adaptive colorbased particle filter. IVC, 21(1):99–110, 2003. Cited on page 17.
[32] Shaul Oron, Aharon Bar-Hillel, Dan Levi, and Shai Avidan. Locally orderless tracking. In CVPR, 2012. Cited on pages 17 and 40.
[33] M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approach for fast deformable object detection. In IEEE Conference on Computer Vision and Pattern
Recognition, 2011. Cited on page 78.
[34] Marco Pedersoli, Jordi Gonzalez, Xu Hu, and Xavier Roca. Toward real-time pedestrian detection based on a deformable template model. IEEE Transactions on Intelligent Transportation Systems, 2013. Cited on page 78.
[35] Patrick Perez, Carine Hue, Jaco Vermaak, and Michel Gangnet. Color-based probabilistic tracking. In ECCV, 2002. Cited on pages 17 and 40.
[36] David Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. Incremental
learning for robust visual tracking. IJCV, 77(1):125–141, 2008. Cited on page 40.
[37] Thomas Schön, Fredrik Gustafsson, and Per-Johan Nordlund. Marginalized particle filters for mixed linear nonlinear state-space models. IEEE Trans. on Signal
Processing, 53:2279–2289, 2005. Cited on pages 6, 88, 90, and 96.
[38] Laura Sevilla-Lara and Erik Learned-Miller. Distribution fields for tracking. In IEEE
102
Bibliography
Conference on Computer Vision and Pattern Recognition (CVPR), 2012. Cited on
pages 17, 28, and 40.
[39] Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. Partbased multiple-person tracking with partial occlusion handling. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2012. Cited on pages 61
and 78.
[40] Severin Stalder, Helmut Grabner, and Luc van Gool. Beyond semi-supervised tracking: Tracking should be as simple as detection, but not simpler than recognition. In
ICCV Workshop, 2009. Cited on page 40.
[41] K. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating color descriptors
for object and scene recognition. PAMI, 32(9):1582–1596, 2010. Cited on pages 5,
17, and 18.
[42] J. van de Weijer and C. Schmid. Coloring local feature extraction. In ECCV, 2006.
Cited on pages 5, 17, and 18.
[43] J. van de Weijer, C. Schmid, Jakob J. Verbeek, and D. Larlus. Learning color names
for real-world applications. TIP, 18(7):1512–1524, 2009. Cited on pages 6 and 19.
[44] Dong Wang, Huchuan Lu, and Ming-Hsuan Yang. Least soft-threshold squares
tracking. In CVPR, 2013. Cited on page 40.
[45] Yi Wu, Bin Shen, and Haibin Ling. Online robust image alignment via iterative
convex optimization. In CVPR, 2012. Cited on page 40.
[46] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark.
In CVPR, 2013. Cited on pages 5, 27, 28, 29, and 40.
[47] Bo Yang and Ram Nevatia. Online learned discriminative part-based appearance
models for multi-human tracking. In Proceedings of the European Conference on
Computer Vision (ECCV), 2012. Cited on page 72.
[48] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing
Surveys, 32(13), 2006. Cited on page 1.
[49] Jun Zhang, Youssef Barhomi, and Thomas Serre. A new biologically inspired color
image descriptor. In ECCV, 2012. Cited on pages 5, 17, and 18.
[50] Kaihua Zhang, Lei Zhang, and Ming-Hsuan Yang. Real-time compressive tracking.
In Proceedings of the European Conference on Computer Vision (ECCV), 2012.
Cited on pages 17 and 40.
[51] Tianzhu Zhang, Bernard Ghanem, Si Liu, and Narendra Ahuja. Robust visual tracking via multi-task sparse learning. In CVPR, 2012. Cited on page 40.
[52] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang. Robust object tracking via sparsitybased collaborative model. In CVPR, 2012. Cited on page 40.
Upphovsrätt
Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år
från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut
enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte
upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens
medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt
skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang
som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets
hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring exceptional circumstances.
The online availability of the document implies a permanent permission for anyone to
read, to download, to print out single copies for his/her own use and to use it unchanged
for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on
the consent of the copyright owner. The publisher has taken technical and administrative
measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when
his/her work is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its
procedures for publication and for assurance of document integrity, please refer to its
www home page: http://www.ep.liu.se/
© Martin Danelljan
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement