Institutionen för systemteknik Department of Electrical Engineering Examensarbete Visual Tracking Examensarbete utfört i vid Tekniska högskolan vid Linköpings universitet av Martin Danelljan LiTH-ISY-EX--13/4736--SE Linköping 2013 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping Visual Tracking Examensarbete utfört i vid Tekniska högskolan vid Linköpings universitet av Martin Danelljan LiTH-ISY-EX--13/4736--SE Handledare: Fahad Khan ISY , Examinator: Linköpings universitet Michael Felsberg ISY , Linköpings universitet Linköping, 12 december 2013 Avdelning, Institution Division, Department Datum Date Computer Vision Laboratory Department of Electrical Engineering SE-581 83 Linköping Språk Language Rapporttyp Report category ISBN Svenska/Swedish Licentiatavhandling ISRN Engelska/English Examensarbete C-uppsats D-uppsats Övrig rapport 2013-12-12 — LiTH-ISY-EX--13/4736--SE Serietitel och serienummer Title of series, numbering ISSN — URL för elektronisk version Titel Title Visuell följning Författare Author Martin Danelljan Visual Tracking Sammanfattning Abstract Visual tracking is a classical computer vision problem with many important applications in areas such as robotics, surveillance and driver assistance. The task is to follow a target in an image sequence. The target can be any object of interest, for example a human, a car or a football. Humans perform accurate visual tracking with little effort, while it remains a difficult computer vision problem. It imposes major challenges, such as appearance changes, occlusions and background clutter. Visual tracking is thus an open research topic, but significant progress has been made in the last few years. The first part of this thesis explores generic tracking, where nothing is known about the target except for its initial location in the sequence. A specific family of generic trackers that exploit the FFT for faster tracking-by-detection is studied. Among these, the CSK tracker have recently shown obtain competitive performance at extraordinary low computational costs. Three contributions are made to this type of trackers. Firstly, a new method for learning the target appearance is proposed and shown to outperform the original method. Secondly, different color descriptors are investigated for the tracking purpose. Evaluations show that the best descriptor greatly improves the tracking performance. Thirdly, an adaptive dimensionality reduction technique is proposed, which adaptively chooses the most important feature combinations to use. This technique significantly reduces the computational cost of the tracking task. Extensive evaluations show that the proposed tracker outperform state-of-the-art methods in literature, while operating at several times higher frame rate. In the second part of this thesis, the proposed generic tracking method is applied to human tracking in surveillance applications. A causal framework is constructed, that automatically detects and tracks humans in the scene. The system fuses information from generic tracking and state-of-the-art object detection in a Bayesian filtering framework. In addition, the system incorporates the identification and tracking of specific human parts to achieve better robustness and performance. Tracking results are demonstrated on a real-world benchmark sequence. Nyckelord Keywords Tracking, Computer Vision, Person Tracking, Object Detection, Deformable Parts Model, RaoBlackwellized Particle Filter, Color Names Abstract Visual tracking is a classical computer vision problem with many important applications in areas such as robotics, surveillance and driver assistance. The task is to follow a target in an image sequence. The target can be any object of interest, for example a human, a car or a football. Humans perform accurate visual tracking with little effort, while it remains a difficult computer vision problem. It imposes major challenges, such as appearance changes, occlusions and background clutter. Visual tracking is thus an open research topic, but significant progress has been made in the last few years. The first part of this thesis explores generic tracking, where nothing is known about the target except for its initial location in the sequence. A specific family of generic trackers that exploit the FFT for faster tracking-by-detection is studied. Among these, the CSK tracker have recently shown obtain competitive performance at extraordinary low computational costs. Three contributions are made to this type of trackers. Firstly, a new method for learning the target appearance is proposed and shown to outperform the original method. Secondly, different color descriptors are investigated for the tracking purpose. Evaluations show that the best descriptor greatly improves the tracking performance. Thirdly, an adaptive dimensionality reduction technique is proposed, which adaptively chooses the most important feature combinations to use. This technique significantly reduces the computational cost of the tracking task. Extensive evaluations show that the proposed tracker outperform state-of-the-art methods in literature, while operating at several times higher frame rate. In the second part of this thesis, the proposed generic tracking method is applied to human tracking in surveillance applications. A causal framework is constructed, that automatically detects and tracks humans in the scene. The system fuses information from generic tracking and state-of-the-art object detection in a Bayesian filtering framework. In addition, the system incorporates the identification and tracking of specific human parts to achieve better robustness and performance. Tracking results are demonstrated on a realworld benchmark sequence. iii Acknowledgments I want to thank my supervisor Fahad Khan and examiner Michael Felsberg. Further, I want to thank everyone that have contributed with constructive comments and discussions. I thank Klas Nordberg for discussing various parts of the theory with me. I thank Zoran Sjanic for the long and late discussions about Bayesian filtering methods. Finally, I thank Giulia Meneghetti for helping me set up all the computers I needed. Linköping, Januari 2014 Martin Danelljan v Contents Notation 1 xi Introduction 1.1 A Brief Introduction to Visual Tracking . . . . . . . . . . . . . . 1.1.1 Introducing Circulant Tracking by Detection with Kernels 1.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 1.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Approaches and Results . . . . . . . . . . . . . . . . . . 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 4 4 5 5 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 11 11 12 12 12 13 14 14 15 16 Color Features for Tracking 3.1 Evaluated Color Features . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Color Names . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Incorporating Color into Tracking . . . . . . . . . . . . . . . . . . . . 17 17 19 19 I Generic Tracking 2 Circulant Tracking by Detection 2.1 The MOSSE Tracker . . . . . . . . . . 2.1.1 Detection . . . . . . . . . . . . 2.1.2 Training . . . . . . . . . . . . . 2.2 The CSK Tracker . . . . . . . . . . . . 2.2.1 Training with a Single Image . . 2.2.2 Detection . . . . . . . . . . . . 2.2.3 Multidimensional Feature Maps 2.2.4 Kernel Functions . . . . . . . . 2.3 Robust Appearance Learning . . . . . . 2.4 Details . . . . . . . . . . . . . . . . . . 2.4.1 Parameters . . . . . . . . . . . 2.4.2 Windowing . . . . . . . . . . . 2.4.3 Feature Value Normalization . . 3 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii CONTENTS 3.2.1 4 5 II 6 7 Color Feature Normalization . . . . . . . . . . . . . . . . . . . Adaptive Dimensionality Reduction 4.1 Principal Component Analysis . . . . . . . 4.2 The Theory Behind the Proposed Approach 4.2.1 The Data Term . . . . . . . . . . . 4.2.2 The Smoothness Term . . . . . . . 4.2.3 The Total Cost Function . . . . . . 4.3 Details of the Proposed Approach . . . . . 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 22 22 22 23 24 Evaluation 5.1 Evaluation Methodology . . . . . . . . . . . . 5.1.1 Evaluation Metrics . . . . . . . . . . . 5.1.2 Dataset . . . . . . . . . . . . . . . . . 5.1.3 Trackers and Parameters . . . . . . . . 5.2 Circulant Structure Trackers Evaluation . . . . 5.2.1 Grayscale Experiment . . . . . . . . . 5.2.2 Grayscale and Color Names Experiment 5.2.3 Experiments with Other Color Features 5.3 Color Evaluation . . . . . . . . . . . . . . . . 5.3.1 Results . . . . . . . . . . . . . . . . . 5.3.2 Discussion . . . . . . . . . . . . . . . 5.4 Adaptive Dimensionality Reduction Evaluation 5.4.1 Number of Feature Dimensions . . . . 5.4.2 Final Performance . . . . . . . . . . . 5.5 State-of-the-Art Evaluation . . . . . . . . . . . 5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 28 29 29 29 30 31 33 33 33 34 37 38 39 39 40 Tracking Model 6.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . 6.2 Object Model . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Object Motion Model . . . . . . . . . . . . . . . . . 6.2.2 Part Deformations and Motion . . . . . . . . . . . . 6.2.3 The Complete Transition Model . . . . . . . . . . . 6.3 The Measurement Model . . . . . . . . . . . . . . . . . . . 6.3.1 The Image Likelihood . . . . . . . . . . . . . . . . 6.3.2 The Model Likelihood . . . . . . . . . . . . . . . . 6.4 Applying the Rao-Blackwellized Particle Filter to the Model 6.4.1 The Transition Model . . . . . . . . . . . . . . . . . 6.4.2 The Measurement Update for the Non-Linear States 6.4.3 The Measurement Update for the Linear States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 50 51 51 52 52 53 53 53 54 55 Object Detection 7.1 Object Detection with Discriminatively Trained Part Based Models . . . 57 57 . . . . . . Category Object Tracking ix CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 58 60 61 61 61 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 67 67 68 69 69 69 71 71 72 72 72 Results, Discussion and Conclusions 9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 9.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 78 80 7.2 8 9 7.1.1 Histogram of Oriented Gradients . . . . . . . . . . 7.1.2 Detection with Deformable Part Models . . . . . . 7.1.3 Training the Detector . . . . . . . . . . . . . . . . Object Detection in Tracking . . . . . . . . . . . . . . . . 7.2.1 Ways of Exploiting Object Detections in Tracking 7.2.2 Converting Detection Scores to Likelihoods . . . . 7.2.3 Converting Deformation Costs to Probabilities . . Details 8.1 The Appearance Likelihood . . . . . . . . . . . 8.1.1 Motivation . . . . . . . . . . . . . . . . 8.1.2 Integration of the RCSK Tracker . . . . . 8.2 Rao-Blackwellized Particle Filtering . . . . . . . 8.2.1 Parameters and Initialization . . . . . . . 8.2.2 The Particle Filter Measurement Update . 8.2.3 The Kalman Filter Measurement Update . 8.2.4 Estimation . . . . . . . . . . . . . . . . 8.3 Further Details . . . . . . . . . . . . . . . . . . 8.3.1 Adding and Removing Objects . . . . . . 8.3.2 Occlusion Detection and Handling . . . . A Bayesian Filtering A.1 The General Case . . . . . . . . . . . A.1.1 General Bayesian Solution . . A.1.2 Estimation . . . . . . . . . . A.2 The Kalman Filter . . . . . . . . . . . A.2.1 Algorithm . . . . . . . . . . . A.2.2 Iterated Measurement Update A.3 The Particle Filter . . . . . . . . . . . A.3.1 Algorithm . . . . . . . . . . . A.3.2 Estimation . . . . . . . . . . A.4 The Rao-Blackwellized Particle Filter A.4.1 Algorithm . . . . . . . . . . . A.4.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 83 84 84 84 85 86 86 86 87 88 88 90 B Proofs and Derivations B.1 Derivation of the RCSK Tracker Algorithm . . . . . . . . . . B.1.1 Kernel Function Proofs . . . . . . . . . . . . . . . . . B.1.2 Derivation of the Robust Appearance Learning Scheme B.2 Proof of Equation 6.13 . . . . . . . . . . . . . . . . . . . . . B.2.1 Proof of Uncorrelated Parts . . . . . . . . . . . . . . B.2.2 Derivation of the Weight Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 93 93 94 95 96 96 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Bibliography CONTENTS 99 Notation S ETS Notation Meaning Z R C `p (M, N ) `D p (M, N ) The set of integers The set of real numbers The set of complex numbers The set of all functions f : Z × Z → R with period (M, N ) The set of all functions f : Z×Z → RD with period (M, N ) F UNCTIONS AND O PERATORS Notation Meaning h·, ·i k·k |·| ∗ ? F F −1 τm,n κ(f, g) x p(x) N (µ, C) Inner product L2 -norm Absolute value or cardinality Convolution Correlation The Discrete Fourier Transform on `p (M, N ) The Inverse Discrete Fourier Transform on `p (M, N ) The shift operator (τm,n f )(k, l) = f (k − m, l − n) Kernel on some space X , with f, g ∈ X Complex conjugate of x ∈ C Probability of x Gaussian distribution with mean µ and covariance matrix C xi 1 Introduction Visual tracking can be defined as the problem of estimating the trajectory of one or multiple objects in an image sequence. It is an important and classical computer vision problem, which has received much research attention over the last decades. Visual tracking has many important applications. It often acts as a part in higher level systems, e.g. automated surveillance, gesture recognition and robotic navigation. It is generally a challenging problem for numerous reasons, including fast motion, background clutter, occlusions and appearance changes. This thesis explores various aspects of visual tracking. The rest of of this chapter is organized as follows. Section 1.1 gives a brief introduction to the visual tracking field. Section 1.2 contains the thesis problem formulation and motivation together with a brief overview of the contributions and results. Section 1.3 describes the outline of the rest of the thesis. 1.1 A Brief Introduction to Visual Tracking A survey of visual tracking methods lies far outside the scope of this document. The visual tracking field is extremely diverse, and it does not exist anything close to a unified theory. This section introduces some concepts in visual tracking that are related to the work in this thesis. See [48] for a more complete (but rather outdated) survey. Visual tracking methods largely depend on the application and the amount of available prior information. There are for example applications where the camera and background are known to be static. In such cases one can employ background subtraction techniques to detect moving targets. In many cases the problem is to track certain kinds or classes of objects, e.g. humans or cars. The general appearance of objects in the class can thus be used as prior information. This type of tracking problems are sometimes referred to 1 2 1 Introduction 750 Figure 1.1: Visual tracking of humans in the Town Centre sequence with the proposed framework for category tracking. The bounding boxes mark humans that are tracked in the current frame. The colored lines show the trajectories up to the current frame. as category tracking, as the task is to track a category of objects. Many visual tracking methods deal with the case where only the initial target position in a sequence is known. This case is often called generic tracking. Visual tracking of an object can in its simplest form be divided into two parts. 1. Modelling. 2. Tracking. The first step constructs a model of the object. This model can include various degrees of a priori information and it can be updated in each new frame with new information about the object. The model should include some representation of the object appearance. The appearance can for example be modelled using the object shape, as templates, histograms or by parametric representations of distributions. An often essential part of the modelling is the choice of image features. Popular choices are intensity, color and edge features. Models can also include information about the expected object motion. The second step deals with how to use the model to find the position of the object in the next frame. Methods applied in this step are highly dependent on the used model. A very simple example of a model is to use the pixel values of a rectangular image area around the object. A simple way of tracking with this model is by correlating the new frame with this image patch and finding the maximum response. A popular trend in how to solve this modelling/tracking problem is to apply tracking by detection. This is a quite loose categorization of a kind of visual tracking algorithms. 1.1 3 A Brief Introduction to Visual Tracking Tracking by detection essentially means that some kind of machine learning technique is used to train a discriminative classifier of the object appearance. The tracking step is then done by classifying image regions to find the most probable location of the object. The second part of this thesis investigates automatic category tracking. This problem includes automatic detection and tracking of all objects in the scene that is of a certain class. Such tracking is visualized in figure 1.1. This problem contains additional challenges compared to generic tracking. However, there are known information about the object class, that can be exploited to achieve more robust visual tracking. The question is then how to fuse this information into a visual tracking framework. Visual tracking techniques can also be categorized by which object properties or states that are estimated. A visual tracker should be able to track the location of the object in the image. The location is often described with a two-dimensional image coordinate, i.e. two states. However, more general transformations than just pure translations can be considered. Many trackers attempt to also track the scale (i.e. size) of the object in the image. This adds another one or two states to be estimated, depending on if uniform or non-uniform scale is used. The orientation of the object can also be added as a state to be estimated. Even more general image transformations such as affine transformations or homographies can be considered. However, in general a state can be any property of interest, e.g. complex shape, locations of object parts and appearance modes. 1.1.1 Introducing Circulant Tracking by Detection with Kernels This section gives a brief introduction to the CSK tracker [21], which is of importance to this thesis. It is studied in detail in chapter 2, but is introduced here since it is included in the problem formulation of this thesis. As mentioned earlier, methods that apply tracking by detection are becoming increasingly more popular. Most of these use some kind of sparse sampling strategy to harvest training examples for a discriminative classifier. The processing of each sample independently requires much computational effort when it comes to feature extraction, training and detection. It is clear that a set of samples contains redundant information if they are sampled with an overlap. The CSK exploits this redundancy for much faster computation. The CSK tracker relies on a kernelized least squares classifier [7]. The task is to classify patches of an image region as the object or background. For simplicity, one-dimensional signals are considered here. Let z be a sample from such a signal, i.e. z ∈ RM . Let φ : RM 7→ H be a mapping from RM to some Hilbert space H [28]. Let f be a linear classifier in H given by (1.1), where v ∈ H. f (z) = hφ(z), vi (1.1) The classifier is trained using a sample x ∈ RM from the region around the object. The set of training examples consists of all cyclic shifts xm , m ∈ {0, . . . , M − 1} of x. The classifier is derived using regularized least squares, which means minimizing the loss function (1.2), where y ∈ RM contains the label for each example and λ is a regularization 4 1 Introduction parameter. = M −1 X |f (xm ) − ym |2 + λkvk2 (1.2) m=0 The v that minimizes P (1.2) can be written as a linear combination of the mapped training examples, v = k αk φ(xk ). By using the kernel trick, a closed form solution can be derived. This is given by (1.3). See [7] for a complete derivation. α = (K + λI)−1 y (1.3) K is the kernel matrix, with the elements Kij = κ(xi , xj ). κ is the kernel function that defines the inner product in H, κ(x, z) = hφ(x), φ(z)i, ∀x, z ∈ RM . The classification of an example z ∈ RM is done using (1.4). ŷ = M −1 X αm κ(z, xm ) (1.4) m=0 The result generalizes to images. Let x be a grayscale image patch of the target. Define the kernelized correlation as ux (m, n) = κ(xm,n , x), where xm,n are cyclic shifts of x. Again, y(m, n) are the corresponding labels. Let capital letters denote the Discrete Fourier Transform (DFT) [16] of the respective two-dimensional signals. It can be shown that the Fourier transformed coefficients A of the kernelized least squares classifier can be calculated using (1.5), if the classifier is trained on the single image patch x. A= Y Ux + λ (1.5) The classification of all cyclic shifts of a grayscale image patch z can be written as a convolution, which is a product in the Fourier domain. The classification score ŷ of the image region z is computed with (1.6), where we have defied uz (m, n) = κ(zm,n , x). ŷ = F −1 {AUz } (1.6) A notable feature of this tracker is that most computations can be done using the Fast Fourier Transform (FFT) [16]. Thereby exceptionally low computational cost compared to most other visual trackers is obtained. 1.2 Thesis Overview This section contains an overview of this thesis. The problem formulation is stated in section 1.2.1. Section 1.2.2 describes the motivation behind the thesis. Section 1.2.3 briefly presents the approaches, contributions and results. 1.2.1 Problem Formulation The goal of this master thesis is to research the area of visual tracking, with the focus on generic tracking and automatic category tracking. This shall be done through the following scheme. 1.2 Thesis Overview 5 1. Study the CSK tracker [21] and investigate how it can be improved. The goal is specifically to improve it with color features and new appearance learning methods. 2. Use the acquired knowledge to build a framework for causal automatic category tracking of humans in surveillance scenes. The main goal is to investigate how deformable parts models can be used in such a framework, to be able to track defined parts of a human along with the human itself. 1.2.2 Motivation In the recent benchmark evaluation of generic visual trackers by Wu et al [46], the CSK tracker was shown to provide the highest speed among the top 10 trackers. Because of the simplicity of this approach, it holds the potential of being improved even further. The goal is to achieve sate-of-the-art tracking performance at faster than real-time frame rates. Such a tracker has many interesting applications, for example in robotics, where the computational cost often is a major limiting factor. One other interesting example is scenarios where it is desired to track a large number of targets in real-time, for example in automated surveillance of crowded scenes. Many generic trackers including the CSK, only rely on image intensity information and thus discard all color information present in the images. Although, the use of color information has proven successful in related computer vision areas, such as object detection and recognition [41, 26, 49, 42, 25], it has not been thoroughly investigated for tracking purposes yet. Changing the feature representation in a generic tracking framework often requires modifications of other parts in the framework as well. It is therefore necessary to look at the whole framework to avoid suboptimal ad hoc solutions. In many major applications of visual tracking, the task is to track certain classes of objects, often humans. Safety systems in cars and automated surveillance are examples of such applications. Many existing category tracking frameworks in literature use pure data association of object detections, thus discarding most of the temporal information. Many recent frameworks also use global optimization over time windows in the sequence, thus disregarding the causality requirement which is existent in most practical applications. These are the motivations behind creating an automatic category tracking framework that is causal and thoroughly exploits the temporal dimension to achieve robust and high precision tracking. Recently the deformable parts model detector [15] has been used in category tracking of humans. However, this has not yet been attempted in a causal framework. By jointly tracking an object and a set parts, a more correct deformable model can be applied that may increase accuracy and robustness. This might especially increase the robustness to partial occlusions, that is a common problem. Furthermore, part locations and trajectories is of interest in action detection and recognition, which are computer vision topics with related applications. 1.2.3 Approaches and Results Three contributions are made to the CSK tracker. Firstly an appearance learning method is derived, which significantly improves the robustness and performance of this tracker. In 6 1 Introduction evaluations the proposed method, named RCSK, is shown to outperform the original one when using multidimensional feature maps (e.g. color). Secondly an extensive evaluation of different color representations was done. This evaluation shows that color improves the tracking performance significantly and that the Color Names descriptor [43] is the best choice. Thirdly, an adaptive dimensionality reduction technique is proposed to reduce the feature dimensionality, thereby achieving a significant speed boost with a negligible effect on the tracking performance. This technique adaptively chooses the most important combinations of features. Comprehensive evaluations are done to validate the performance gains of the proposed improvements. These include a comparison between a large number of different color representations for tracking. Lastly, the proposed generic visual tracker methods are compared to existing state-of-the-art methods in literature in an extensive evaluation. The proposed method is shown to outperform the existing methods, while operating at many times higher frame rates. The second part of the thesis deals with the second goal in the problem formulation. A category tracking framework was built that combines generic tracking with object detection in a causal probabilistic framework with deformable part models. Specifically, the derived RCSK tracker was used in combination with the deformable parts model detector [15]. The Rao-Blackwellized Particle Filter [37] was used in the filtering step to achieve scalability in the number of parts in the model. The framework was applied to automatic tracking of multiple humans in surveillance scenes. The tracking results are demonstrated on a real-world benchmark sequence. Figure 1.1 illustrates the output of the proposed tracker framework. 1.3 Thesis Outline This thesis report is organized into two parts. The first part is dedicated to generic tracking. Chapter 2 discusses the family of circulant structure trackers, including the CSK tracker introduced in section 1.1.1. The proposed appearance learning scheme for these trackers is derived in section 2.3. Chapter 3 discusses different color features for tracking and how the proposed tracker is extended with color information. In chapter 4, the proposed adaptive dimensionality reduction technique is derived and integrated to the tracking framework. The evaluations, results and conclusions from the first part of the thesis is presented in chapter 5. This includes the extensive comparison with state-of-the-art methods from literature. The second part of this report considers the category tracking problem. Chapter 6 gives an overview of the system and presents the model on which it is built. Chapter 7 describes in detail how DPM object detector is used. Additional details are discussed in chapter 8, including how the generic tracking method derived in the first part of this thesis is incorporated. Finally, the results are discussed in chapter 9. The appendix contains two parts. Appendix A summarizes the Bayesian filtering problem and most importantly describes the RBPF. Appendix B contains mathematical proofs and derivations of the most important results. Part I Generic Tracking 2 Circulant Tracking by Detection A standard correlation filter is a simple and straightforward visual tracking approach. Much research over the last decades have aimed at producing more robust filters. Most recently the Minimum Output Sum of Squared Error (MOSSE) filter [8] was proposed. It performs comparably to state-of-the-art trackers, but at hundreds of FPS. In [21], this approach was formulated as a tracking-by-detection problem and kernels was introduced into the framework. The resulting CSK tracker was presented briefly in section 1.1.1. This chapter starts with a detailed presentation of the MOSSE and CSK trackers. In section 2.3 a new learning scheme for these kinds of trackers is proposed. 2.1 The MOSSE Tracker The key to fast correlation filters is to avoid computing the correlations in the spatial domain, but instead exploiting the O(P ln(P )) complexity of the FFT. However, this assumes a periodic extension of the local image patch. Obviously this assumption is a very harsh approximation of reality. However, since the background is of much lesser importance, this approximation can be seen as valid if the tracked object is centered enough in the local image patch. 2.1.1 Detection The goal in visual tracking is to find the location of the object in each new frame. Initially only monochromatic images are considered, or more generally two dimensional, discrete and scalar valued signals, i.e. functions Z × Z → R. To avoid special notation for circular convolution and correlation, it is always assumed that a signal is extended periodically. The set of all periodic functions f : Z × Z → R with period M in the first argument and period N in the second argument is denoted `p (M, N ). The periodicity means that f (m + M, n + N ) = f (m, n), ∀m, n ∈ Z. 9 10 2 Circulant Tracking by Detection Let z ∈ `p (M, N ) be the periodic extension of an image patch of size M × N . h ∈ `p (M, N ) is a correlation filter that has been trained on the appearance of a specific object. The correlation result at the image patch z can be calculated using (2.1). The position of the target can then be estimated as the location of the maximum correlation output. ŷ = h ? z = F −1 {HZ} (2.1) Capital letters denote the DFT of the corresponding signals. The second equality in (2.1) follows from the correlation property of the DFT. The next section deals with how to train the correlation filter h. 2.1.2 Training First consider the simplest case. Given an image patch x ∈ `p (M, N ) that is centred at the object of interest, the task is to find the correlation filter h ∈ `p (M, N ) that gives the output y ∈ `p (M, N ) if correlated with x. y can simply be a Kronecker delta centered at the target, but it proves to be more robust to use a smooth function, e.g. a sampled Gaussian. The goal is to find a h that satisfies h ? x = y. If all frequencies in x contain Y non-zero energy there is a unique solution given by H = X . In practice it is important to be able to train the filter using multiple image samples x1 , . . . , xJ . These samples can originate from different frames. Let y 1 , . . . , y J be their corresponding desired output functions (or label functions). h is found by minimizing: = J X βj kh ? xj − y j k2 + λkhk2 (2.2) j=1 Here β1 , . . . , βJ are weight parameters for the corresponding samples and λ is a regularization parameter. The filter that minimizes (2.2) is given in (2.3). See [8] for the derivation. PJ j j j=1 βj Y X H = PJ (2.3) j j j=1 βj X X + λ Equation 2.3 suggests updating the numerator HN and denominator HD of H in each new frame using a learning parameter γ. If H t−1 = t t t−1 HN t−1 HD +λ is the filter updated in frame t − 1 and x , y are the new sample and desired output function from frame t, then the filter is updated as in (2.4). This is the core part of the MOSSE tracking algorithm in [8]. t−1 t HN = (1 − γ)HN + γY t X t t HD = Ht = t−1 (1 − γ)HD t HN t +λ HD + γX t X t This update scheme results in the weights given in (2.5). ( (1 − γ)t−1 ,j = 1 βj = γ(1 − γ)t−j , j = 2, . . . , t (2.4a) (2.4b) (2.4c) (2.5) 2.2 11 The CSK Tracker 2.2 The CSK Tracker This section discusses the CSK tracker, which was briefly presented in section 1.1.1. The CSK tracker can be obtained by extending the MOSSE tracker with a non-linear kernel. This extension is accomplished by introducing a mapping φ : X 7→ H from the signal space X to some Hilbert space H and exploiting the kernel trick [7]. The result is also generalized to vector valued signals f : Z × Z 7→ RD , to handle multiple features. The set of such periodic signals is denoted `D p (M, N ). The individual components of f are denoted f d , d ∈ {1, . . . , D}, where f d ∈ `p (M, N ). 2.2.1 Training with a Single Image Let h · , · i be the standard inner product in `p (M, N ). Let xm,n = τ−m,−n x be the result of shifting x ∈ `p (M, N ) m and n steps, so that xm,n (k, l) = x(k + m, l + n). Note that h ? x(m, n) = hxm,n , hi. The cost function in (2.2) can for the case of a single training image (J = 1) be written as in (2.6). X 2 = (hxm,n , hi − y(m, n)) + λhh, hi (2.6) m,n The sum in (2.6) is taken over a single period.1 The equation can be further generalized by considering the mapped examples φ(xm,n ). The decision boundary is obtained by minimizing (2.7) over v ∈ H. X 2 = (hφ(xm,n ), vi − y(m, n)) + λhv, vi (2.7) m,n Observe that this is the cost function for regularized least squares classification with kernels. A well known result from classification is that the v that minimizes (2.7) is in the subspace spanned by the vectors (φ(xm,n ))m,n . This result is easy to show in this case by decomposing any v to v = vk + v⊥ , where vk is in this subspace and v⊥ is orthogonal to it. The result can be written as in (2.8) for some scalars a(m, n). X v= a(m, n)φ(xm,n ) (2.8) m,n The inner product in H is defined by the kernel function κ(f, g) = hφ(f ), φ(g)i, ∀f, g ∈ X . The coefficients a(m, n) are found by minimizing (2.9), where we have transformed (2.7) by expressing v using (2.8) and used the definition of the kernel function. 2 XX X X = a(k, l)κ(xm,n , xk,l )−y(m, n) +λ a(m, n) a(k, l)κ(xm,n , xk,l ) m,n k,l m,n k,l (2.9) A closed form solution to (2.9) can be derived under the assumption of a shift invariant kernel. The concept of shift invariant kernels is defined in section 2.2.4. The coefficients a can be extended periodically to an element in `p (M, N ). The a that minimizes (2.9) 1 It is always assumed that the summation is done over a single period, e.g. ∀(m, n) ∈ {1, . . . , M } × {1, . . . , N }, if no limits or set is specified. 12 2 Circulant Tracking by Detection is given in (2.10). A derivation using circulant matrices can be found in [21], but it is also proved in section B.1.2 for a more general case. Here we have defined the function ux (m, n) = κ(xm,n , x). It is clear that ux ∈ `p (M, N ). A = F {a} = Y Ux + λ (2.10) This is the same result as in (1.5), which the original CSK tracker [21] builds upon. 2.2.2 Detection The calculation of the detection results of the image patch z is similar to (2.1). Here, x is the learnt appearance of the object and A is the DFT of the learnt coefficients. By defining uz (m, n) = κ(zm,n , x), the output can be computed using (2.11). ŷ = F −1 {AUz } 2.2.3 (2.11) Multidimensional Feature Maps The equations 2.10 and 2.11 can be used for any feature dimensionality. The task is just to define a shift invariant kernel function that can be used for multidimensional features. One example of such a kernel is the standard inner product in `D p (M, N ), i.e. κ(f, g) = hf, gi. d Let x denote feature layer d ∈ {1, . . . , D} of x. The training and detection in this case can be derived from equations 2.10 and 2.11. The result is given in (2.12). This is essentially the MOSSE tracker for multidimensional features, trained on a single image. H d = PD Y Xd X dX d + λ (D ) X −1 d d ŷ = F H Z (2.12a) d=1 (2.12b) d=1 2.2.4 Kernel Functions The kernel function is a mapping κ : X × X → R that is symmetric and positive definite. X is the sample space, i.e. `D p (M, N ). The kernel function needs to be shift invariant for the equations 2.10 and 2.11 to be valid. This section contains the definition of a shift invariant kernel from [21] and theorems that need to be stated regarding this property. 2.1 Definition (Shift Invariant Kernel). A shift invariant kernel is a valid kernel κ on `D p (M, N ) that satisfies κ(f, g) = κ(τm,n f, τm,n g), ∀m, n ∈ Z, ∀f, g ∈ `D p (M, N ) (2.13) 2.2 Proposition. Let κ be the inner product kernel in (2.14), where k : R → R. κ(f, g) = k(hf, gi), ∀f, g ∈ `D p (M, N ) (2.14) 2.3 13 Robust Appearance Learning Then κ is a shift invariant kernel. Further, the following relation holds. (D ! ) X κ(τ−m,−n f, g) = k F −1 F d Gd (m, n) , ∀m, n ∈ Z (2.15) d=1 2.3 Proposition. Let κ be the radial basis function kernel in (2.16), where k : R → R. κ(f, g) = k(kf − gk2 ), ∀f, g ∈ `D p (M, N ) (2.16) Then κ is a shift invariant kernel. Further, the following relation holds. ( D ) ! X 2 2 −1 d d κ(τ−m,−n f, g) = k kf k + kgk − F 2 F G (m, n) , ∀m, n ∈ Z d=1 (2.17) The proofs are found in section B.1.1. From these propositions it follows that Gaussian and polynomial kernels are shift invariant. Equations (2.15) and (2.17) give efficient ways to compute the kernel outputs Ux and Uz in e.g. (2.10) and (2.11) using the FFT. 2.3 Robust Appearance Learning This section contains the description of my proposed extension of the CSK learning approach in (2.10) to support training with multiple images. It can also be seen as an extension of the MOSSE tracker to multiple features if a linear kernel is used. The result is a more robust learning scheme of the tracking model, which is shown to outperform the learning scheme of the CSK [21] in chapter 5. The tracker that is proposed in this section is therefore referred as Robust CSK or RCSK. The CSK tracker learns its tracking model by computing A using (2.10) for each new frame independently. It then applies an ad hoc method of updating the classifier coefficients by linear interpolation between the new coefficients A and the previous ones At−1 using: At = (1 − γ)At−1 + γA, where γ is a learning rate parameter. Modifying the cost function to include training with multiple images is not as straight forward as with the MOSSE-tracker for grayscale images. This is due to the fact that the kernel function is non-linear in general. The equivalent to (2.2) in the CSK case would be to minimize: = J X j=1 βj X 2 hφ(xjm,n ), vi − y j (m, n) + λhv, vi (2.18) m,n PJ P However, the solution v = j=1 m,n aj (m, n)φ(xjm,n ) involves computing a set of coefficients aj for each training image xj . This requires an evaluation of all pairs of j i j kernel outputs ui,j x (m, n) = κ(xm,n , x ). All A can then by computed by solving N M number of J ×J linear systems. This is obviously highly impractical in a real-time setting if the number of images J is more than only a few. To keep the simplicity and speed of the MOSSE tracker, it is thus necessary to find some approximation of the solution to (2.18). Specifically, the appearance model should only contain one set of classifier coefficients a to simplify learning and detection. 14 2 Circulant Tracking by Detection This can be accomplished by restricting the solution so that the coefficients a are the same for all images. This is expressed as the cost function in (2.19). X J X j j j 2 j j = βj |hφ(xm,n ), v i − y (m, n)| + λhv , v i (2.19a) m,n j=1 where, j v = X a(k, l)φ(xjk,l ) (2.19b) k,l The a that minimizes (2.19) is given in (2.20), where we have set ujx (m, n) = κ(xjm,n , xj ). PJ j j j=1 βj Y Ux (2.20) A = PJ j j j=1 βj Ux (Ux + λ) See section B.1.2 for the derivation. The object patch appearance x̂t is updated using the same learning parameter γ. The final update rule is given in (2.21). t t AtN = (1 − γ)At−1 N + γY Ux AtD = (1 − At = γ)At−1 D + γUxt (Uxt (2.21a) + λ) AtN AtD x̂t = (1 − γ)x̂t + γxt (2.21b) (2.21c) (2.21d) The resulting weights βj will be the same as in (2.5). See algorithm 2.1 for the complete pseudo code of the proposed RCSK tracker. 2.4 Details This section discusses various details of the proposed tracker algorithm, including parameters and necessary preprocessing steps for feature extraction. 2.4.1 Parameters The label function y is as in [8, 21] set to the Gaussian function in (2.22). The standard deviation is proportional to the given target size s = (s1 , s2 ), with a constant σy . Since a constant label function y is used, its transform Y t = Y = F {y} can be precomputed. 2 2 !! 1 N M y(m, n) = exp − 2 m− + n− , (2.22) 2σy s1 s2 2 2 for m ∈ {0, . . . , M − 1} , n ∈ {0, . . . , N − 1} The kernel κ is set to a Gaussian with a variance proportional to the dimensionality of the patches, with a constant σκ2 . The kernel used in [21] is given in (2.23). 1 2 κ(f, g) = exp − 2 kf − gk (2.23) σκ M N D 2.4 Details 15 Algorithm 2.1 The proposed RCSK tracker. Input: Sequence of frames: {I 1 , . . . , I T } Target position in the first frame: p1 Target size: s Window function: w Parameters: γ, λ, η, σy , σκ Output: Estimated target position in each frame: {p1 , . . . , pT } 1: 2: 3: 4: 5: 6: 7: 8: 9: Initialization: Construct label function y using (2.22) and set Y = F {y} Extract x1 from I 1 at p1 Calculate u1x (m, n) = κ(x1m,n , x1 ) using (2.15) or (2.17) Initialize: A1N = Y Ux1 , A1D = Ux1 (Ux1 + λ) , A1 = A1N /A1D , x̂1 = x1 for t = 2 : T do Detection: Extract z t from I t at pt−1 t , x̂t−1 ) using (2.15) or (2.17) Calculate utz (m, n) = κ(zm,n Calculate correlation output: ŷ t = F −1 {At−1 Uzt } Calculate the new position pt = argmaxp ŷ(p) Training: Extract xt from I t at pt Calculate utx (m, n) = κ(xtm,n , xt ) using (2.15) or (2.17) Update the tracker using (2.21) 13: end for 10: 11: 12: A padding parameter η decides the amount of background contained in the patches, so that (M, N ) = (1 + η)s. The regularization parameter λ can be set to almost zero in most cases if the proposed learning is used. But since the effect of this parameter proved to be negligible for small values, it is set to the same value as in [21] for a fair comparison. The optimal setting of the learning rate γ is highly dependant on the sequence, though a compromise can often be found if the same value is used for many sequences (as in the evaluations). The complete set of parameters and default values is presented in table 2.1. The default values are the ones suggested by [21]. 2.4.2 Windowing As noted earlier, the periodic assumption is the key to be able to exploit the FFT in the computations. However, this assumption introduces discontinuities at the edges.2 A common technique from signal processing to overcome this problem is windowing, where 2 Continuity is not defined for functions with discrete domains. However, we can think of the domain as continuous for a moment, i.e. as before the signal was sampled 16 2 Parameter name γ λ η σy σκ Default value 0.075 0.01 1.0 1/16 0.2 Circulant Tracking by Detection Explanation Learning rate. Regularization parameter. Amount of background included in the extracted patches. Standard deviation of the label function y. Standard deviation of the gaussian kernel function κ. Table 2.1: The parameters for the RCSK and CSK tracker. the extracted sample is multiplied by a window function. [21] suggests a Hann window, defined in (2.24). πm πn w(m, n) = sin2 sin2 (2.24) M −1 N −1 In the detection stage of the tracking algorithm, an image patch z is extracted from the new frame. However, it is not likely that the object is centred in the patch. This means that the window function distorts the object appearance. This effect becomes greater the further away from the center of the patch the object is located. This means that the windowing also effects the tracking performance in a negative way. The simplest ways to counter this effect is to iterate the detection step in the algorithm, where each new sample is extracted from the previously estimated position in each iteration. Although this often increases the accuracy of the tracker, it significantly increases the computational time. It can also make the tracking more unstable. Another option is to predict the position of the object in the next frame in a more sophisticated way, instead of just assuming constant position. This can be done by applying a Kalman filter on a constant velocity or acceleration model. 2.4.3 Feature Value Normalization For image intensity features, [8, 21] suggest normalizing the values to the range [−0.5, 0.5]. The reason for this is to minimize the amount of distortion induced by the windowing operation discussed in the previous section. The idea is to remove as much of the inherent bias in the feature values as possible by subtracting some a priori mean feature value. The same methodology can be applied to other kinds of features. One way of eliminating the need of choosing the normalization of each feature, is to automatically learn a normalization constant (that is subtracted from the feature value) based on the specific image sequence or even the specific frame. This however, has to be done with care to avoid corrupting the learnt appearance and classifier coefficients. A method for adaptively selecting the normalization constant based on the weighted average feature values was tried, but no significant performance gain was observed compared to using the ad-hoc a priori mean feature values. So it was not investigated further. A special feature normalization scheme for features with a probabilistic representation (e.g. histograms) is presented in section 3.2.1. 3 Color Features for Tracking Most state-of-the-art trackers either rely on intensity or texture information [19, 50, 24, 13, 38], including the CSK and MOSSE trackers discussed in the previous chapter. While significant progress has been made to visual tracking, the use of color information has been limited to simple color space transformations [35, 31, 10, 32, 11]. However, sophisticated color features have shown to significantly improve the performance of object recognition and detection [41, 26, 49, 42, 25]. This motivates an investigation of how color information should be used in visual tracking. Exploiting color information for visual tracking is a difficult challenge. Color measurements can vary significantly over an image sequence due to variations in illuminant, shadows, shading, specularities, camera and object geometry. Robustness with respect to these factors have been studied in color imaging, and successfully applied to image classification [41, 26], and action recognition [27]. This chapter presents the color features that are evaluated in section 5.3 and discusses how they are incorporated into the family of circulant structure trackers presented in chapter 2. 3.1 Evaluated Color Features In this section, 11 color representations are presented briefly. These are evaluated in section 5.3 with the proposed tracking framework. Each color representation uses a mapping from local RGB-values to a color space of some dimension. All color features evaluated here except Opponent-Angle and SO use pixelwise mappings from one RGB-value to a color value. RGB: As a baseline, the standard 3-channel RGB color space is used. LAB: The 3-dimensional LAB color space is perceptually uniform, meaning that colors at 17 18 3 Color Features for Tracking equal distance are also perceptually considered to be equally far apart. The L -component approximates the human perception of lightness. YCbCr: YCbCr contains a luminance component Y and two chrominance components Cb and Cr which encodes the blue- and red-difference respectively. The representation is approximately perceptually uniform. It is commonly used in image compression algorithms. R G , R+G+B rg: The rg [17] color channels are computed as (r, g) = R+G+B . They are invariant with respect to shadow and shading effects. HSV: In the HSV color space V encodes the lightness as the maximum RGB-value, H is the hue and S is the saturation, which corresponds to the purity of the color. H and S are invariant to shadow-shading. The hue H is additionally invariant for specularities. Opponent: The opponent color space is an orthonormal transformation of the RGB-color space, given by (3.1). √1 − √12 0 O1 R 2 1 −2 √1 √ O2 = G . (3.1) √6 6 6 1 1 √ √ √1 O3 B 3 3 3 This representation is invariant with respect to specularities. C: The C color representation [41] adds photometric invariants with respect to shadowshading to the opponent descriptor by normalizing with the intensity. This is done according to (3.2). T O2 C = O1 (3.2) O3 O3 O3 HUE: The hue is a 36-dimensional histogram representation [42] of H = arctan O1 O2 . √ The contribution to the hue histogram is weighted with the saturation S = O12 + O22 to counter the instabilities of the hue representation. This representation is invariant to shadow-shading and specularities. Opp-Angle: The Opp-Angle is a 36-dimensional histogram representation [42] based on spatial derivatives of the opponent channels. The histogram is constructed using (3.3). O1x angxO = arctan , (3.3) O2x The subscript x denotes the spatial derivative. This representation is invariant to specularities, shadow-shading, blur and a constant offset. SO: SO is a biologically inspired descriptor of Zhang et al. [49]. This color representation is based on center surround filters on the opponent color channels. Color Names: See section 3.1.1. 3.2 Incorporating Color into Tracking 3.1.1 19 Color Names The Color Names descriptor is explained in more detail, since it proved to be the best choice in the evaluation in section 5.3. It is therefore used in the proposed version of the tracker and in part two of this thesis. Color names (CN), are linguistic color labels assigned by humans to represent colors in the world. In a linguistic study performed by Berlin and Kay [6], it was concluded that the English language contains eleven basic color terms: black, blue, brown, grey, green, orange, pink, purple, red, white and yellow. In the field of computer vision, color naming is an operation that associates RGB observations with linguistic color labels. In this thesis, the mapping provided by [43] is used. Each RGB value is mapped to a probabilistic 11 dimensional color representation, which sums up to 1. For each pixel, the color name values represent the probabilities that the pixel should be assigned to the above mentioned colors. Figure 3.1 visualizes the color name descriptor in a real-world tracking example. The color names mapping is automatically learned from images retrieved by the Google Image search. 100 example images per color were used in the training stage. The provided mapping is a lookup table from 323 = 32768 uniformly sampled RGB values to the 11 color name probabilities. A difference from the other color descriptors mentioned in section 3.1 is that the color names encodes achromatic colors, such as white, gray and black. This means that it does not aim towards full photometric invariance, but rather towards discriminative power. 3.2 Incorporating Color into Tracking In section 2.2.3 is was noted that the kernel formulation of the CSK and RCSK tracker makes is easy to extend the tracking algorithm to multidimensional features, such as color features. By using a linear kernel in these trackers, they can also be seen as different extensions of the MOSSE tracker to multidimensional features. The windowing operation discussed in section 2.4.2, is applied to every feature-layer separately after the feature extraction step, which in this case is a color space transformation followed by a feature normalization. 3.2.1 Color Feature Normalization The feature normalization step, as described in section 2.4.3, is an important and nontrivial task to be addressed. For all color descriptors in section 3.1 with a non-probabilistic representation (i.e. all except HUE, Opp-Angle and Color Names), the normalization is done by centring the range of each feature value. This means that the range of the feature values are symmetric around zero. This is motivated by assuming uniform and independent feature value probabilities. However, the independence assumption is not valid for the high-dimensional color descriptors. For these descriptors it is more correct to normalize the representation so that the expected sum over the feature values is zero. For color names, this means subtracting each feature bin with 1/11. A specific attribute of the family of trackers explained in chapter 2, including the proposed RCSK, opens up an interesting alternative normalization scheme that can be used 20 3 Color Features for Tracking (b) Black (c) Blue (d) Brown (e) Gray (f) Green (g) Orange (h) Pink (i) Purple (a) RGB image patch. (j) Red (k) White (l) Yellow Figure 3.1: Figure 3.1a is an image patch of the target in the soccer sequence, which is a benchmark image sequence for evaluating visual trackers. Figure 3.1b to 3.1l are the 11 color name probabilities obtained from the image patch. Notice how motion blur, illumination, specularities and compression artefacts complicates the process of color naming the pixels. with color names. It can in fact be used for any feature representation that sums up to some constant value. Color names contain only 10 degrees of freedom for this reason. The color name values lie in a 10-dimensional hyper plane in the feature space. This plane is orthogonal to the vector (1, 1, . . . , 1)T . The color name values can be centered by changing the feature space basis to an orthonormal basis chosen so that the last basis vector is orthogonal to this plane. However, since the last coordinate in the new basis is constant (and thus contains no information) it can be discarded. The feature dimensionality is thus reduced from 11 to 10 when this normalization scheme is used. This has a positive effect on the computational cost of the trackers, by reducing the number of necessary FFT-computations and memory accesses in each frame. The nature of the trackers explained in chapter 2, makes them invariant to the choice of basis to be used in the normalization step. This comes from the fact that the inner products and L2 -norms that are used in the kernel computations, are invariant under unitary transformations of the feature values. This property is discussed further in section 4.2.2. To minimize the computational cost of this feature normalization step, a new lookup table was constructed that maps RGB-values directly to the 10-dimensional normalized color name values. In later chapters, these normalized color names is referred to as just color names. This means that this normalization scheme was always employed for color names in the experiments of chapter 5. 4 Adaptive Dimensionality Reduction The time complexity of the proposed tracker in algorithm 2.1 scales linearly with the number of features. To overcome this problem, an adaptive dimensionality reduction technique is proposed in this chapter. This technique reduces the number of feature dimensions without any significant loss in tracking performance. The dimensionality reduction is based on Principal Component Analysis (PCA), which is described in section 4.1. Section 4.2 presents the theory behind the proposed approach. Section 4.3 contains implementation details and pseudo code of the approach and how it is applied to the trackers discussed in chapter 2 4.1 Principal Component Analysis PCA1 [30] is a standard way of performing dimensionality reduction. It is done by computing an orthonormal basis for the linear subspace of a given dimension that holds the largest portion of the total variance in the dataset. The basis vectors are aligned so that the projections onto this basis are pairwise uncorrelated. From a geometric perspective, PCA returns an orthonormal basis for the subspace that minimizes the average squared L2 -error between a set of centered2 data points and its projections onto this subspace. This is formulated in (4.1). N 1 X kxi − BB T xi k2 min ε = N i=1 subject to B T B = I 1 PCA 2 Here is also known as the Discrete Karhunen-Loève Transform. “centered” refers to that the average value has been subtracted from the data. 21 (4.1a) (4.1b) 22 4 Adaptive Dimensionality Reduction xi ∈ Rn are the centered data points and B is a n × m dimensional matrix that contains the orthonormal basis vectors of the subspace in its columns. It can be shown that this optimization problem is equivalent to maximizing (4.2) under the same constraint. The P covariance matrix C is defined as C = N1 i xi xTi . V = tr(B T CB) (4.2) The PCA-solution to this problem is to choose the columns of B as the normalized eigenvectors of C that correspond to the largest eigenvalues (see [30] for the proof). It should be mentioned that any orthonormal basis to the subspace spanned by these eigenvectors is a solution to the optimization problem (4.1). 4.2 The Theory Behind the Proposed Approach The proposed dimensionality reduction in this section is a mapping to a linear subspace of the feature space. This subspace is defined by an orthonormal basis. Let Bt denote the matrix containing the orthonormal basis vectors of this subspace as columns. Assume 1 that the feature map of the appearance x̂t ∈ `D p (M, N ) at frame t has D1 features and that the desired feature dimensionality is D2 . Bt should thus be a D1 × D2 matrix. The projection to the feature subspace is done by the linear mapping x̃t (m, n) = BtT x̂t (m, n), where x̃t is the compressed feature map. This section presents a method of computing the subspace basis Bt to be used in the dimensionality reduction. 4.2.1 The Data Term The original feature map x̂t of the learnt patch appearance can be optimally reconstructed (in L2 -sense) as Bt x̃t = Bt BtT x̂t . An optimal projection matrix can be found by minimizing the reconstruction error of the appearance in (4.3). 1 X t min εtdata = kx̂ (m, n) − Bt BtT x̂t (m, n)k2 (4.3a) M N m,n subject to BtT Bt = I (4.3b) Equation 4.3a can be seen as a data term since it only regards the current object appearance. The expression can be simplified to (4.4) by introducing the data matrix Xt which contains all pixel values of x̂t , such that there is a column for each pixel and a row for each feature. Xt thus has the dimensions D1 × M N . The second equality follows from the properties of the Frobenius norm and the trace operator. The covariance matrix Ct is defined by Ct = M1N Xt XtT . εtdata = 4.2.2 1 kXt − Bt BtT Xt k2F = tr(Ct ) − tr(BtT Ct Bt ) MN (4.4) The Smoothness Term The projection matrix must be able to adapt to changes in the target and background appearance. Otherwise it would likely become outdated and the tracker would deteriorate 4.2 23 The Theory Behind the Proposed Approach over time since valuable information is lost in the feature compression. However, the projection matrix must also take the already learnt appearance into account. If it changes too drastically, the already learnt classifier coefficients At−1 become irrelevant since they were computed with a seemingly different set of features. The changes in the projection matrix must thus be slow enough for the already learnt model to remain valid. To obtain smooth variations in the projection matrix, a smoothness term is added to the optimization problem. This term adds a cost if there is any change in the subspace spanned by the column vectors in the new projection matrix compared to the earlier subspaces. This is motivated by studying the transformations between these subspaces. Let Bt be the ON-basis for the new subspace and Bj for some earlier subspace (j < t) of the same dimension. The optimal transformation from the older to the new subspace is given by P = BtT Bj . It can be shown that the matrix P is unitary if and only if the column vectors in Bt and Bj span the same subspace. One can easily verify that the point wise transformation by a unitary matrix corresponds to a unitary operator U on `D p (M, N ). Lastly, one can see that inner product kernels and radial basis function kernels are invariant to unitary transforms, i.e. κ(U f, U g) = κ(f, g). The kernel output is thus invariant under changes in the projection matrix as long as the spanned subspace stays the same. A cost should only be added if the subspace itself is changed. Equation 4.5 accomplishes this. εjsmooth = D2 X 2 (k) (k) λk bj − Bt BtT bj (4.5) k=1 (k) bj (k) is column vector k in Bj . The positive constants λj are used to weight the impor- (k) bj . tance of each basis vector Equation 4.5 minimizes the squared L2 -distance of the error when projecting the old basis vectors onto the new subspace. The cost becomes zero if the two subspaces are the same (even if Bt and Bj are not) and is at a maximum if the subspaces are orthogonal. By defining the diagonal matrix Λj with the weights along the (k) diagonal [Λj ]k,k = λj , this expression can be rewritten to (4.6). εjsmooth = tr(Λj ) − tr(BtT Bj Λj BjT Bt ) 4.2.3 (4.6) The Total Cost Function Assume that the tracker is currently on frame number t. Let x̂t be the learnt feature map of the object appearance. The goal is to find the optimal projection matrix Bt for the current frame. The set of previously computed projection matrices {B1 , . . . , Bt−1 } are given. Bt is found by minimizing (4.7), under the constraint BtT Bt = I. εttot = αt εtdata + t−1 X αj εjsmooth j=1 t−1 X = αt tr(Ct ) − tr(BtT Ct Bt ) + αj tr(Λj ) − tr(BtT Bj Λj BjT Bt ) (4.7) j=1 This cost function is the weighted sum of the data term (4.4) and the smoothness term in 24 4 Adaptive Dimensionality Reduction (4.6) for each previous projection matrix Bj . αj are importance weights. Equation 4.7 can be reformulated to the equivalent maximization problem (4.8) by exploiting the linearity of the trace-function. ! t−1 X Vtot = tr BtT αt Ct + αj Bj Λj BjT Bt (4.8) j=1 By comparing this expression to the PCA-formulation (4.2) one can see that this optimization problem can be solved using the PCA methodology with the covariance matrix Rt defined in (4.9). It can be verified that Rt indeed is symmetric and positive definite. Rt = αt Ct + t−1 X αj Bj Λj BjT (4.9) j=1 The columns in Bt is thus chosen as the D2 normalized eigenvectors of Rt that corresponds to the largest eigenvalues. 4.3 Details of the Proposed Approach The adaptive PCA algorithm described above requires a way of choosing the weights αj and Λj . αj control the relative importance of the current appearance and the previously computed subspaces. These are set by using an appropriate learning rate parameter µ that acts in the same way as the learning rate γ for the appearance learning. Setting µ = 1 corresponds to only using the current learnt appearance in the calculation of the projection matrix. µ = 0 is the same as computing the projection matrix once in the first frame and then letting it be fixed for the entire sequence. The value was experimentally tuned to µ = 0.1 for the linear kernel case and µ = 0.15 for the non-linear kernel case. The diagonal in Λj contains the importance weights for each basis vector in the previously computed projection matrix Bj . These are set to the eigenvalues of the corresponding basis vectors in Bj . This makes sense since the score function (4.8) equals the sum of these eigenvalues. Each eigenvalue can thus be interpreted as the score for its corresponding basis vector in Bt . In a probabilistic interpretation, the eigenvalues are the variances for each component in the new basis. Since PCA uses variance as the measure of importance, it is natural to weight each component (basis vector) with its variance. The term Bj Λj BjT then becomes the “reconstructed” covariance matrix of rank D2 , i.e. the covariance of the reconstructed appearance using the projections in image j. Equation 4.9 is thus a weighted sum of image covariances. Algorithm 4.1 provides the full pseudo code for the computation of the projection matrix. The mean feature values do not contain information about the structure and should therefore be subtracted from the data before computing the projection matrix. Including the mean in the PCA computation affects the projection matrix to conserve the mean in the projected features, rather that maximizing the variance which is related to image structure. Algorithm 4.2 provides the full pseudo code for the proposed RCSK tracker with adaptive 4.3 Details of the Proposed Approach 25 Algorithm 4.1 Adaptive projection matrix computation. Input: Frame number: t Learned object appearance: x̂t Previous covariance matrix: Qt−1 Parameters: µ, D2 Output: Projection matrix: Bt Current covariance matrix: Qt 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: P Calculate mean x̄t = M1N m,n x̂t (m, n) P Calculate covariance Ct = M1N m,n (x̂t (m, n) − x̄t )(x̂t (m, n) − x̄t )T if t = 1 then Set Rt = Ct else Set Rt = (1 − µ)Qt−1 + µCt end if Do EVD Rt = Et St EtT , the eigenvalues in St are in descending order Set Bt to the first D2 columns in Et Set [Λt ]i,j = [St ]i,j , 1 ≤ i, j ≤ D2 if t = 1 then Set Qt = Bt Λt BtT else Set Qt = (1 − µ)Qt−1 + µBt Λt BtT end if dimensionality reduction. Note that the windowing of the feature map is always done after the projection onto the new reduced feature space. It is not a part of the feature extraction as in algorithm 2.1. The reason is that windowing adds spatial correlation between the pixels, which contradicts the independence and stationarity assumptions used in the PCA. 26 4 Adaptive Dimensionality Reduction Algorithm 4.2 Proposed RCSK tracker with dimensionality reduction. Input: Sequence of frames: {I 1 , . . . , I T } Target position in the first frame: p1 Target size: s Window function: w Parameters: γ, λ, η, σy , σκ , µ, D2 Output: Estimated target position in each frame: {p1 , . . . , pT } Initialization: Construct label function y using (2.22) and set Y = F {y} Extract x1 from I 1 at p1 Initialize x̂1 = x1 Calculate B1 and Q1 using algorithm 4.1 Project features and apply window: x̃1 (m, n) = w(m, n)B1T x1 (m, n) 6: Calculate u1x (m, n) = κ(x̃1m,n , x̃1 ) using (2.15) or (2.17) ˜1 = x̃1 7: Initialize: A1N = Y Ux1 , A1D = Ux1 (Ux1 + λ) , A1 = A1N /A1D , x̂ 1: 2: 3: 4: 5: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: for t = 2 : T do Detection: Extract z t from I t at pt−1 Project features and apply window: z̃ t (m, n) = w(m, n)BtT z t (m, n) t ˜t−1 ) using (2.15) or (2.17) , x̂ Calculate utz (m, n) = κ(z̃m,n t Calculate correlation output: ŷ = F −1 {At−1 Uzt } Calculate the new position pt = argmaxp ŷ(p) Training: Extract xt from I t at pt Update appearance x̂t using (2.21d) Calculate Bt and Qt using algorithm 4.1 Project features and apply window: x̃t (m, n) = w(m, n)BtT xt (m, n) Calculate utx (m, n) = κ(x̃tm,n , x̃t ) using (2.15) or (2.17) Update the tracker using (2.21a), (2.21b) and (2.21c) ˜t (m, n) = w(m, n)B T x̂t (m, n) Calculate projected appearance: x̂ t end for 5 Evaluation This chapter contains evaluations, results, discussions and conclusions related to the first part of this thesis. Section 5.1 describes the evaluation methodology, including evaluation metrics and datasets. In section 5.2, a comparison is made between the trackers presented in chapter 2. The color features discussed in chapter 3 are evaluated in section 5.3. The effect of the dimensionality reduction technique proposed in chapter 4 is investigated in section 5.4. The best performing proposed tracker versions is then compared to state-ofthe-art methods in an extensive evaluation in section 5.5. Lastly, section 5.6 presents some general conclusions and discussions about possible directions of future work. 5.1 Evaluation Methodology The methods were evaluated using the protocol and code recently provided by Wu et al. [46]1 . The evaluation code was modified with some bug fixes and some added functionality. It employs the most commonly used scheme for evaluating causal generic trackers on image sequences with ground-truth target locations. The tracker is initialized in the first frame, with the known target location. In the subsequent frames, the tracker is used to estimate the locations of the target. Only information from all the previous and the current frame may be exploited by the tracker when estimating a target location. The estimated trajectory is then compared with the ground truth locations using different evaluation metrics. All evaluations were performed on a desktop computer with an Intel Xenon 2 core 2.66 GHz CPU with 16 GB of RAM. 1 The sequences together with the ground-truth and matlab code are available at: https://sites. google.com/site/trackerbenchmark/benchmarks/v10 27 28 5.1.1 5 Evaluation Evaluation Metrics The trackers were evaluated using three evaluation metrics commonly used in literature. The first is average center location error (CLE), which is the average L2 -distance (in pixels) between the estimated and ground truth center locations of the target over the sequence. The second metric is distance precision (DP), which is the relative number of frames where the estimated center location is within a certain distance threshold d from the ground truth center location. The third metric is overlap precision (OP), defined as the relative number of frames where the overlap between the estimated and ground truth bounding box exceeds a certain threshold b. Bounding box overlap is commonly measured using the PASCAL criterion. For an image sequence, the three measures are calculated using (5.1). The ground truth and estimated center locations are denoted pt and p̂t respectively, where t is the frame number. Similarly, Bt and B̂t denotes the ground truth and estimated bounding boxes of the target. A bounding box is here defined as the set of pixels covered by the rectangular area. N is the number of frames in the sequence. CLE = N 1 X kp̂t − pt k N t=1 1 N 1 OP(b) = N DP(d) = |{t : kp̂t − pt k ≤ d}| , d ≥ 0 ( ) |B̂t ∩ Bt | ≥ b , 0 ≤ b ≤ 1 t: |B̂t ∪ Bt | (5.1a) (5.1b) (5.1c) There is not much agreement in the literature on what performance measures to use for comparing visual trackers. DP and CLE only regard the estimated center location, which is desirable when evaluating trackers that do not estimate scale. OP also takes the estimated scale into account. This metric was only used in the state-of-the-art evaluation (see section 5.5), to give a more complete and fair comparison to trackers that estimate scale variations. Results of both CLE and DP are reported in the tables. The numeric values of distance precision are reported at a threshold of 20 pixels [21, 38, 46], i.e. DP(20). It is motivated to use both these metrics since they contain complementary information. CLE has the drawback of being unstable at tracker failures. DP is robust to failures, but it does not include any information about the accuracy of the tracker within 20 pixels. For robustness reasons, the per-video results are summarized using the median results over the whole dataset. Some authors [21, 46] suggest the usage of precision and success plots, where distance and overlap precision respectively are plotted over a range of thresholds. This says much more about the performance, but makes the task of comparing different methods harder. In this chapter, precision and success plots are used to compare the overall performance of different trackers. The average precision values over the set of sequences were used. In both types of plots, a ranking score is computed to simplify the interpretation of the results. In the precision plots, the DP-value at 20 pixels is used. The area-under-the-curve (AUC) is used as the ranking score in the success plots. The ranking scores are displayed in brackets next to the tracker names in the legend of each plot. See [46] for more details. 5.2 Circulant Structure Trackers Evaluation 5.1.2 29 Dataset The benchmark evaluation of [46] includes a dataset of 50 image sequences. 35 of these are color sequences. Additionally, another 6 benchmark color sequences was added to the dataset, namely: Kitesurf, Shirt, Surfer, Board, Stone and Panda. The resulting set of 41 color sequences was used for all evaluations in this chapter, except in the initial comparison between MOSSE, CSK and RCSK on grayscale sequences. In this case, the full set of 56 sequences was used. The sequences pose many challenging situations. [46] actually provides an annotation of their 50 sequences with 11 different attributes, which explain the challenges encountered in each sequence. The different attributes are: motion blur, illumination changes, scale variation, heavy occlusions, in-plane and out-of-plane rotations, deformation, out of view, background clutter and low resolution. This is used to make attribute based comparisons, which can show interesting strengths and weaknesses of different trackers. 5.1.3 Trackers and Parameters All trackers are evaluated using the same parameters over the whole dataset. This requirement is commonly used to prevent over-tuning. For the trackers explained in chapter 2, including the proposed RCSK, the standard parameters suggested by [21] are used, including the same Gaussian kernel. The parameters and used values are displayed in table 2.1. The kernel bandwidth σκ , does not have any effect for the MOSSE tracker, or when a linear kernel is used for either RCSK or CSK. The dimensionality reduction learning rate µ is set to 0.1 for linear and 0.15 for non-linear kernels, when used with the RCSK. The code for the proposed tracker versions was implemented in Matlab (no mex-functions were used). For the original CSK tracker, the Matlab-code provided by the authors was used but modified to support multidimensional feature maps. The evaluated MOSSE tracker was implemented in Matlab as well. Note that this is not exactly the same tracker as proposed in [8], but rather a simplification of it. There are no random initial examples and no failure detection (see [8]). The implementation of the MOSSE tracker that was evaluated here is similar to the implementation of the CSK and RCSK. The only differences are that it uses other equations for learning and updating the model, as well as calculating the tracking scores. Interestingly it was shown in [21] that a simple implementation of the core tracking functionality of the MOSSE tracker, i.e. (2.4) and (2.1), resulted in a better tracker than using the code provided by the authors. In the state-of-the-art evaluation of section 5.5, the code or binaries for the compared trackers were either obtained from [46] or from the authors. They are used with the suggested default parameters. 5.2 Circulant Structure Trackers Evaluation This section contains the comparison between different variants in the family of circulant structure trackers presented in chapter 2. This includes the proposed variant RCSK. Three experiments were done. Firstly RCSK, CSK [21] and MOSSE [8] were compared for grayscale features. Secondly the trackers were compared for multidimensional features, 30 5 Evaluation Precision plot 0.7 Distance Precision 0.6 0.5 0.4 0.3 0.2 MOSSE [0.589] CSK [0.561] RCSK [0.554] 0.1 0 0 10 20 30 40 Location error threshold 50 Figure 5.1: Comparison between MOSSE, CSK and RCSK for grayscale features. specifically grayscale together with color names. Five trackers were evaluated in that experiment. The proposed robust learning scheme was used with both a linear and a Gaussian kernel. The methods are called RCS and RCSK respectively. Similarly, the learning scheme of [21] was used with a linear and a Gaussian kernel, called CS and CSK respectively. These were compared with a straightforward generalization of the MOSSE tracker to multidimensional features. In the third experiment, RCSK was compared with CSK for all the color features mentioned in section 3.1. 5.2.1 Grayscale Experiment This experiment compares the RCSK with CSK and MOSSE when just using grayscale features. For this reason all the 56 sequences in the dataset are employed. The results are presented in figure 5.1. The same grayscale features as applied in the original CSK code provided by the authors was used for all trackers. For color sequences, these are computed using the rgb2gray-function in Matlab. The precision plot show a slight advantage for the linear kernel (i.e. the MOSSE tracker). The CSK and RCSK performs similarly. Qualitatively, it can be seen that although the linear kernel provides better results in average, it has stability issues in situations with significant illuminations changes. This is most clearly seen in the shaking and skating1 sequences. In shaking there is drastically increasing back-light for a few frames. Figure 5.2 shows the frame where MOSSE fails and lose track of its target. The strong back-light and blooming in frame 59 compared to the previous frame completely corrupts the score function of the MOSSE tracker (figure 5.2b), while the other two trackers are able to track through this frame robustly (figure 5.2d and 5.2f). MOSSE fails in a similar way in skating1 when the target (an ice skater) moves from an illuminated area in the scene to a much darker area. Also in this case CSK and RCSK manage to track the target through those frames. 5.2 31 Circulant Structure Trackers Evaluation (a) MOSSE, frame 58. (b) MOSSE, frame 59. (c) CSK, frame 58. (d) CSK, frame 59. (e) RCSK, frame 58. (f) RCSK, frame 59. Figure 5.2: Frames 58 and 59 from the shaking sequence. The score function is shown in red and the tracking output bounding box in green. The tracked target in this sequence is the head of the guitarist. MOSSE fails in frame 59 (figure 5.2b) where the back-light from the spotlight in the background has increased significantly compared to the previous frame (figure 5.2a). The kernelized versions CSK (figure 5.2c and 5.2d) and RCSK (figure 5.2e and 5.2f) are able to track through these frames robustly. 5.2.2 Grayscale and Color Names Experiment This experiment is similar to the one described in the previous section. The difference is that in addition to grayscale features, color names was used as well. The motivation for using color names here is that it is shown to be the best performing evaluated color descriptor in section 5.3. The evaluations in this experiment were done on the set of 41 color sequences. Five trackers were evaluated in this experiment, of which two are versions of the proposed 32 5 Evaluation Precision plot Distance Precision 0.8 0.6 0.4 RCS [0.686] RCSK [0.674] MOSSE [0.669] CSK [0.641] CS [0.628] 0.2 0 0 10 20 30 40 Location error threshold 50 (a) Precision plot. Median CLE Median DP RCS 13.3 83.8 RCSK 13.8 81.4 CS 21 66.4 CSK 16.9 74 MOSSE 15.6 78.2 (b) Table of the median CLE (in pixels) and DP (in percent) values over all the sequences. The two best results are displayed in red and blue fonts respectively. Figure 5.3: Comparison between RCS, RCSK, CS, CSK and MOSSE for combined grayscale and color names features. tracker. • RCS: the proposed robust learning scheme with a linear kernel. • RCSK: the proposed robust learning scheme with a Gaussian kernel. • CS: the learning scheme of [21] on a linear kernel. • CSK: the original [21], which uses a Gaussian kernel. • MOSSE: a straightforward generalization of the standard MOSSE tracker to multidimensional features using (2.12) and (2.4). This mean that RCS and RCSK use algorithm 2.1 with different kinds of kernels. By “linear kernel” it is meant that the standard inner product is applied as a kernel κ(f, g) = hf, gi, as in the MOSSE tracker. The results from the experiments are shown in figure 5.3. The best results were obtained by using the robust learning scheme, i.e. RCS and RCSK. It can clearly be seen that the learning scheme of [21] (CSK and CS) is suboptimal in this case. Further, there is a slight advantage in using a linear kernel over a Gaussian kernel. 5.3 Color Evaluation 80 70 60 50 40 30 20 10 33 Original update scheme Proposed update scheme Figure 5.4: Comparison of original update scheme (CSK) with the proposed learning method (RCSK) using median distance precision (in percent). 5.2.3 Experiments with Other Color Features To investigate if the robust learning scheme proposed in section 2.3 performs better in general, it was compared with the CSK for all the color features mentioned in section 3.1. The RCSK and CSK trackers as described in section 5.2.2 were applied with varying color features. Color representations with no inherent intensity channel (RGB, rg, HUE, OppAngle, SO and CN) were concatenated with the conventional intensity channel, obtained by the Matlab function rgb2gray. Figure 5.4 displays the median distance precisions for each color feature. The robust update scheme performs better in 9 out of the 11 evaluated color descriptors. The effect is most apparent for the high-dimensional color descriptors Opp-Angle and HUE. There is also a significant performance gain when using CN, YCbCr, rg and HSV. Most importantly, the precision is increased for the best performing color descriptors. 5.3 Color Evaluation In section 5.2 it was shown that the proposed learning scheme generally performs better for varying color features. The RCSK was therefore chosen for evaluating the different color representations. All color features in section 3.1 were included. These are also compared with using intensity features alone. As described in section 5.2.3, these intensity features (obtained by rgb2gray) were concatenated with the color representations with no inherent intensity channel (RGB, rg, HUE, Opp-Angle, SO and CN).2 5.3.1 Results The results from the experiment are shown in figure 5.5. Color names obtains the best results in general, with significantly better median CLE and DP values and better average distance precision in the precision plot. The Opp-Angle (AOpp) descriptor is the clear 2 It was noted that using the inherent intensity component, e.g. the L-component in LAB, gives better results than changing it to the usual intensity features (rgb2gray). 34 5 Evaluation Precision plot Distance Precision 0.8 0.6 I+CN [0.674] I+AOpp [0.654] HSV [0.616] LAB [0.609] I+HUE [0.600] C [0.586] I+RG [0.581] YCbCr [0.575] Opp [0.540] I+RGB [0.531] I [0.515] I+SO [0.359] 0.4 0.2 0 0 10 20 30 40 Location error threshold 50 (a) Precision plot. Median CLE Median DP I 42.8 48.5 I+RGB 42.3 50.4 LAB 22.3 64.3 YCbCr 25.6 59 I+RG 24.8 59.2 Opp 37.9 51.8 C 23.5 60.2 HSV 27.4 66.5 I+SO 66.6 26.4 I+AOpp 17.9 71.5 I+HUE 28.2 56.6 I+CN 13.8 81.4 (b) Table of the median CLE (in pixels) and DP (in percent) values over all the sequences. The two best results are displayed in red and blue fonts respectively. Figure 5.5: Color evaluation results. The performance of the RCSK tracker was evaluated for the color features discussed in section 3.1 and intensity alone. second best choice for high precision. These two color descriptors are then followed by the set of descriptors with different photometric invariances, namely HSV, LAB, HUE, C, RG and YCbCr. The simple opponent color transformation and the standard RGB representation give slight increase in performance compared to only intensity. Lastly, it can seen that the SO descriptor is not well suited for tracking purpose. It should be noted that the Opponent-Angle (AOpp) descriptor encodes shape information as well as color. It is thus a powerful descriptor in scenarios where shape is more discriminative that color. The attribute based results are summarized in figures 5.6 and 5.7. AOpp outperform the other color descriptors in sequences with significant background clutter, while struggling in motion blur. AOpp and HSV perform better than color names in illumination variation due to more photometric invariance. CN proves to be the most robust descriptor at occlusions and blur. This evaluation show that some descriptors contain complementarity information, which indicates that even more powerful descriptors might be found by combining them is sophisticated ways, together with shape information. 5.3.2 Discussion Three probable reasons for the success of color names over the other color descriptors can be identified. Firstly, it uses a continous representation in the sense that common distance measures (e.g. the Euclidic distance) makes sense, in the way that similar colors are close 5.3 35 Color Evaluation Precision plot of fast motion (14) Precision plot of background clutter (18) 0.6 0.4 0.2 0 0 0.8 I+CN [0.542] LAB [0.529] HSV [0.520] I+AOpp [0.501] I+HUE [0.499] I+RG [0.460] YCbCr [0.454] C [0.402] Opp [0.399] I+RGB [0.388] I [0.377] I+SO [0.220] 10 20 30 40 Location error threshold Distance Precision Distance Precision 0.8 0.6 0.4 0.2 0 0 50 Precision plot of motion blur (10) I+AOpp [0.678] I+HUE [0.603] C [0.589] HSV [0.588] I+RG [0.576] I+CN [0.573] LAB [0.549] YCbCr [0.523] Opp [0.519] I+RGB [0.507] I [0.501] I+SO [0.328] 10 20 30 40 Location error threshold 50 Precision plot of deformation (16) 0.8 0.7 0.4 0.2 0 0 I+CN [0.662] HSV [0.616] LAB [0.598] I+AOpp [0.596] I+HUE [0.558] YCbCr [0.489] I+RG [0.487] C [0.444] Opp [0.424] I+RGB [0.401] I [0.337] I+SO [0.246] 10 20 30 40 Location error threshold Distance Precision Distance Precision 0.6 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 Precision plot of illumination variation (20) 0.2 0 0 50 0.8 I+AOpp [0.639] HSV [0.614] I+HUE [0.596] I+CN [0.591] LAB [0.575] I+RG [0.551] C [0.539] YCbCr [0.526] Opp [0.478] I+RGB [0.466] I [0.434] I+SO [0.336] 10 20 30 40 Location error threshold 50 Distance Precision Distance Precision 0.4 10 20 30 40 Location error threshold Precision plot of in−plane rotation (20) 0.8 0.6 I+AOpp [0.628] I+CN [0.611] C [0.565] HSV [0.560] I+HUE [0.533] I+RG [0.521] LAB [0.501] YCbCr [0.487] Opp [0.449] I+RGB [0.427] I [0.379] I+SO [0.343] 0.6 0.4 0.2 0 0 HSV [0.685] I+AOpp [0.674] I+CN [0.661] C [0.626] I+HUE [0.611] LAB [0.593] I+RG [0.540] Opp [0.503] YCbCr [0.502] I+RGB [0.486] I [0.456] I+SO [0.321] 10 20 30 40 Location error threshold 50 Figure 5.6: Precision plots showing the results of the attribute-based evaluation of different color features with the RCSK. The plots display the distance precision for the evaluated attributes fast motion, background clutter, motion blur, deformation, illumination variation and in-plane rotation. The value appearing in the title denotes the number of videos associated with the respective attribute. The average distance precision at 20 pixels is displayed in the legends. 36 5 Precision plot of low resolution (4) Evaluation Precision plot of occlusion (24) 0.7 0.8 0.5 I+CN [0.490] LAB [0.489] HSV [0.488] I+HUE [0.473] YCbCr [0.417] I+RG [0.416] C [0.415] I+RGB [0.411] Opp [0.411] I [0.407] I+AOpp [0.401] I+SO [0.162] 0.4 0.3 0.2 0.1 0 0 10 20 30 40 Location error threshold Distance Precision Distance Precision 0.6 0.6 0.4 0.2 0 0 50 I+CN [0.651] I+AOpp [0.627] LAB [0.585] HSV [0.572] YCbCr [0.565] C [0.564] I+HUE [0.551] I+RG [0.549] Opp [0.521] I+RGB [0.513] I [0.484] I+SO [0.383] Precision plot of out−of−plane rotation (28) 10 20 30 40 Location error threshold 50 Precision plot of out of view (4) 0.8 0.7 I+AOpp [0.657] I+CN [0.629] HSV [0.623] I+HUE [0.589] C [0.579] LAB [0.567] I+RG [0.548] YCbCr [0.524] Opp [0.493] I+RGB [0.483] I [0.459] I+SO [0.360] 0.4 0.2 0 0 10 20 30 40 Location error threshold 50 Distance Precision Distance Precision 0.6 0.6 0.5 HSV [0.477] I+CN [0.419] LAB [0.412] I+HUE [0.332] I+AOpp [0.325] YCbCr [0.287] I+RGB [0.271] C [0.269] Opp [0.266] I+RG [0.258] I [0.257] I+SO [0.206] 0.4 0.3 0.2 0.1 0 0 10 20 30 40 Location error threshold 50 Precision plot of scale variation (21) Distance Precision 0.8 0.6 0.4 0.2 0 0 I+AOpp [0.668] I+CN [0.623] I+HUE [0.600] LAB [0.594] HSV [0.562] I+RG [0.545] YCbCr [0.521] C [0.485] Opp [0.474] I+RGB [0.468] I [0.448] I+SO [0.330] 10 20 30 40 Location error threshold 50 Figure 5.7: Precision plots showing the results of the attribute-based evaluation of different color features with the RCSK. The plots display the distance precision for the evaluated attributes low resolution, occlusion, in-plane rotation, out of view and scale variation. The value appearing in the title denotes the number of videos associated with the respective attribute. The average distance precision at 20 pixels is displayed in the legends. 5.4 Adaptive Dimensionality Reduction Evaluation 37 to each other. This is also true for LAB, which is considered to be perceptually uniform, meaning that Euclidian distances in the colorspace reflects perceptual similarities. However, the hue component H in HSV should be interpreted as the angle in a cylindrical coordinate system, meaning that its maximum and minimum H-value correspond to the same hue. Secondly, color names is a probabilistic representation. The update scheme of the appearance template in (2.21d) can thus be interpreted as a statistical update of the color probabilities. However, for non probabilistic representations such as Lab, the template update of (2.21d) may produce colors that have not occurred in the examples. This increases the sensitivity of the tracker towards background clutter and occlusions. The third reason is that the color name representation is trained to categorize colors based on how humans do by language. Intuitively, this is a discriminative way of selecting basic colors. Besides performance, two other attributes are important, namely compactness and computational cost. A feature descriptor should contain a minimal number of dimensions in relation to its discriminative power to reduce computation and memory costs. Here, compact color space representations such as HSV have a clear advantage. However, with the dimensionality reduction technique introduced in chapter 4, color names can be reduced to 2 dimensions without any significant performance loss. The major drawback of AOpp is its large computational cost, since it requires computations of derivatives among other things. The color names representation only requires the computation of indices for the lookup table, which can be done with simple integer arithmetic. The next step is just a memory access, which is fast for processors with reasonably sized cashes. The median tracking frame rate is in fact over 10 times higher for color names compared to AOpp, while similar to HSV. In conclusion it has been demonstrated that color information has the potential to drastically increase the tracking performance. However, the choice of color descriptor is crucial. Simple representations such as RGB and opponent give only a negligible improvement compared to using only grayscale features. The standard photometric invariant descriptors are significantly better suited for the tracking task. However, color names is the best choice for the RCSK. Not only because it gives the best overall results, it is also surprisingly inexpensive to compute. 5.4 Adaptive Dimensionality Reduction Evaluation This section evaluates the impact of the dimensionality reduction technique presented in chapter 4. This is a general technique, that can be applied for any types of features for the RCSK tracker. However, since color names was shown to be the generally best color descriptor for the RCSK in section 5.3, it was only applied to that descriptor in these experiments. The goal is thus to create a more compact color name representation, without any significant performance loss. The evaluated tracker versions that included the dimensionality reduction was implemented as algorithm 4.2. The color names features are compressed independently. This proved to give better results than to compress the intensity channel together with the color names. The learning rate parameter µ for the dimensionality reduction is set to 38 5 Evaluation Precision plot Distance Precision 0.8 0.6 I+CN [0.674] I+CN6 [0.668] I+CN9 [0.666] I+CN8 [0.666] I+CN2 [0.664] I+CN7 [0.661] I+CN3 [0.652] I+CN4 [0.650] I+CN5 [0.633] I+CN1 [0.577] I [0.515] 0.4 0.2 0 0 10 20 30 40 Location error threshold 50 (a) Precision plot. Median CLE Median DP Median FPS I 42.8 48.5 152 I+CN1 26.6 67.3 106 I+CN2 14.3 79.3 105 I+CN3 14.9 76.7 98.9 I+CN4 16.3 69.9 89.6 I+CN5 20 70.2 85.3 I+CN6 13.8 81.9 81.1 I+CN7 13.6 78.9 77.9 I+CN8 13.8 81.4 71.8 I+CN9 13.8 81.4 69.2 I+CN 13.8 81.4 78.9 (b) Table of the median CLE (in pixels), DP (in percent) and frame rate (in FPS) over all the sequences. The two best results are displayed in red and blue fonts respectively. Figure 5.8: Evaluation of the number of dimensions that color names is compressed to, using the dimensionality reduction presented in chapter 4. The number next to “CN” denotes the number of dimensions used. The RCSK is used for this evaluation. 0.1 for RCS (linear kernel) and 0.15 for RCSK (Gaussian kernel). Two experiments are presented in this section. The first experiment investigates the impact of the number of output feature dimensions. The second experiment evaluates the effect of the dimensionality reduction on RCS and RCSK from section 5.2.2 with color names. Computational cost and performance is of equal interest in these experiments. 5.4.1 Number of Feature Dimensions The RCSK (i.e with a Gaussian kernel) was used the evaluate the impact of the number of reduced dimensions. The normalized color names have D1 = 10 dimensions. In this experiment they were compressed to D2 = 1, 2, . . . , 9 dimensions using algorithm 4.2. This is compared to no compression at all, i.e. RCSK with intensity and color names (I+CN) and using zero dimensions, i.e. RCSK with only intensity. The results are displayed in figure 5.8. The compressed feature representation is named I+CND2 where D2 is the compressed color name dimension. From the results of this experiment, it is clear that no significant gain is obtained by using more than two dimensions. However, using only one dimension gives inferior results, while hardly increasing the tracker speed. Thus, CN2 is chosen for the final representation. Note that CN needs 5.5 39 State-of-the-Art Evaluation Precision plot Distance Precision 0.8 0.6 0.4 0.2 0 0 RCS I+CN [0.686] RCS I+CN2 [0.676] RCSK I+CN [0.674] RCSK I+CN2 [0.664] 10 20 30 40 Location error threshold 50 (a) Precision plot. Median CLE Median DP Median FPS RCS I+CN 13.3 83.8 94.1 RCS I+CN2 15.3 75.7 136 RCSK I+CN 13.8 81.4 78.9 RCSK I+CN2 14.3 79.3 105 (b) Table of the median CLE (in pixels), DP (in percent) and frame rate (in FPS) over all the sequences. The two best results are displayed in red and blue fonts respectively. Figure 5.9: Comparison between the color names and compressed color names for RCS and RCSK. to be compressed to 6 or fewer dimensions to overcome the computational overhead introduced by the dimensionality reduction. In general this depends on the target size though. 5.4.2 Final Performance In this experiment the RCS and RCSK with intensity and color names (I+CN) (as in section 5.2.2) are compared with the respective trackers when color names are compressed to 2 dimensions (I+CN2), which was found to be optimal in the previous experiment. The results are shown in figure 5.9. The performance loss is minor in the precision plot (average distance precision). But the speed gain is 45% for RCS and 33% for RCSK, which is significant. 5.5 State-of-the-Art Evaluation This section presents an extensive evaluation of the proposed trackers and state-of-theart methods that exist in literature. Two proposed versions are compared to the existing methods, namely RCS with intensity and color names (RCS CN) and RCS with intensity and compressed color names (RCS CN2). The “I” in the naming convention is dropped. The Gaussian kernel versions are omitted since they proved to be inferior in section 5.4.2. 40 5 Evaluation The proposed methods are compared with 20 trackers that exist in literature. These are: CT [50], TLD [24], DFT [38], EDFT [14], ASLA [23], L1APG [4], CSK [21], SCM [52], LOT [32], CPF [35], CXT [13], Frag [1], IVT [36], ORIA [45], MTT [51], BSBT [40], MIL [3], Struck [19], LSHT [20] and LSST [44]. These methods include the top four performing trackers in the recent benchmark evaluation [46], namely Struck, SCM, TLD and CXT.3 Also ASLA, CSK, DFT and L1APG were among the top trackers in this evaluation. EDFT, LSHT and LSST are recent trackers in the literature, that were not included in the benchmark evaluation. Again, all 41 color sequences are used. The results are presented using both precision plots (distance precision) and success plots (overlap precision). The overall results are presented in figure 5.10. The two proposed trackers outperform or perform favorably compared to the other evaluated methods in all evaluation metrics. Struck is the best performing method of the compared existing trackers. It uses powerful learning methods, namely a kernelized structured output support vector machine. So it is noteworthy that it is outperformed by tracker using a simpler learning method, namely a modified least squared classifier. It should also be noted that neither Struck nor RCS estimate scale variations. For this reason, ASLA and SCM which use an affine tracking model, obtain better overlap precisions at high overlap thresholds b > 0.7. However the robustness of these trackers seems to be far less than those of RCS and Struck. The computational cost is an other important aspect of this evaluation. It should be noted that the RCS versions run at an order of magnitude higher frame rate than Struck (which is a C++ implementation). RCS CN2 runs at the second highest frame rate in median, only 10% below CSK. Also CPF and CT obtain notable frame rates, but provide inferior performance. ASLA and SCM perform rather well in this evaluation, but they are not feasible for real-time applications. The precision plots of the attribute based results are shown in figures 5.11 and 5.12. The corresponding success plots are shown in figures 5.13 and 5.14. The proposed trackers perform favourably in most of these attributes. The results are especially good in attributes that are related to appearance changes, namely motion blur, deformation, illumination variation, in-plane-rotation, occlusion and out-of-plane-rotation. The reason for this is probably a combination between robust learning and robust features. Struck performs better in sequences with fast-motion. This is related to the negative effects of the windowing operation discussed in section 2.4.2. It should be mentioned that ASLA and SCM naturally have an advantage in sequences with large scale variations since they are able to estimate this property. 5.6 Conclusions and Future Work In this chapter it has been shown that the proposed RCS and RCSK trackers perform favorably to state-of-the-art trackers on a large number of benchmark sequences. Especially, their performance and speed combined is unmatched among the evaluated trackers. The framework is however still quite simple, so further improvements are expected to be 3 Unfortunately, code or binaries could not be obtained for all top 10 trackers from this comparison. 5.6 41 Conclusions and Future Work Precision plot Distance Precision 0.8 0.6 RCS CN [0.686] RCS CN2 [0.676] Struck [0.639] EDFT [0.528] CSK [0.526] LSHT [0.511] ASLA [0.505] TLD [0.498] CXT [0.484] LOT [0.481] 0.4 0.2 0 0 10 20 30 40 Location error threshold 50 (a) Precision plot. Success plot Overlap Precision 0.8 0.6 0.4 0.2 0 0 RCS CN [0.484] RCS CN2 [0.473] Struck [0.459] ASLA [0.417] EDFT [0.401] CSK [0.377] SCM [0.377] LSHT [0.375] TLD [0.369] DFT [0.358] 0.2 0.4 0.6 Overlap threshold RCS CN RCS CN2 Struck LSHT CPF CXT DFT CSK MIL EDFT SCM TLD ASLA BSBT LOT L1APG MTT Frag ORIA LSST CT IVT CLE 13.3 15.3 19.6 32.3 41.1 43.8 47.9 50.3 51.9 53.5 54.3 54.4 56.8 58.5 60.9 62.9 67.8 70.8 72.5 78.4 78.4 94.3 DP 83.8 75.7 71.3 55.9 37.1 39.5 41.4 54.5 35.5 49 34.1 45.4 42.2 20.9 37.1 28.9 32.3 38.7 22.5 23.4 20.8 22.4 FPS 94.1 136 10.4 12.5 55.5 11.3 9.11 151 11.6 19.7 0.0862 20.7 0.946 3.45 0.467 1.03 0.378 3.34 7.92 3.57 68.9 14.2 (c) Table of the median CLE (in pixels), DP (in percent) and frame rate (in FPS) over all the sequences. The two best results are displayed in red and blue fonts respectively. 0.8 1 (b) Success plot. Figure 5.10: Comparison with state-of-the-art methods in literature. The trackers proposed in this thesis are shown in bold font. For clarity, only the top 10 performing methods are displayed in the plots. possible. There are a number of issues that could be addressed for future development. In section 5.3 it was noted that it should be possible to combine color and shape descriptors to obtain better discrimination of the target. The possibility of feature selection also comes in here, which would provide a way of selecting the most discriminative feature combinations. This is related to the dimensionality reduction presented in chapter 4. The current method basically uses the amount of structure or variance to determine the importance of different feature combinations. One other option would be to do feature selec- 42 5 Evaluation tion/reduction based on supervised techniques, to aim at selecting the most discriminative feature combinations. One factor, that is limiting in some situations, is that the RCS and RCSK trackers do not estimate scale. This limitation should be addressed in future research, since it may give a significant overall performance gain as well. This is motivated by the success of ASLA and SCM in sequences with scale variation (see figure 5.12 and 5.14). As discussed in section 8.1.2, the tracker model can be rescaled accurately, without basically any cost, since it can be stored entirely in the Fourier-domain. This opens the possibility of applying a brute-force search in the scale dimension. But to preserve the low computational cost, more sophisticated techniques must probably be applied. A few issues are apparent from the attribute based results in section 5.5. One of these is fast target motions. A few techniques might be used to address this problem. A first simple thing to try is motion prediction, as discussed in section 2.4.2. This problem is also related to general failure detection and re-detection, which is currently not present in the algorithm. As most generic trackers, the RCS and RCSK have a very local view of the scene, meaning that it only cares about the target and the surrounding background. In contrast, the TLD tracker applies a whole framework for the purpose of re-detection in failure cases. Failures also often occur at occlusions, which is a major challenge. Although, the RCS is shown to be fairly robust to partial occlusion and short full occlusions compared to other trackers, long-term occlusions is a major problem. To address these sorts of issues, I think that it is necessary for the framework to take multiple hypothesis into account, and to evaluate them over time using different tracking models. Although, failure detection and handling may be important for the generic tracking scenario, it may not be desired when a tracker acts as a part of larger system, which often is the case in real-world applications. One example is people tracking, which is discussed in the second part of this thesis. In that case there are distinct information, provided by for example a person detector, which can be used for failure detection and re-detection. The trackers that are proposed here might be especially suited as parts of such larger systems, thanks to their speed and simplicity. Another interesting property is that the tracker outputs a dense set of confidence scores, which can be fused with confidences from other system parts. 5.6 43 Conclusions and Future Work Precision plot of background clutter (18) 0.7 0.6 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Struck [0.599] RCS CN2 [0.537] RCS CN [0.518] CXT [0.407] TLD [0.405] CPF [0.394] MIL [0.389] MTT [0.384] EDFT [0.375] DFT [0.371] 10 20 30 40 Location error threshold Distance Precision Distance Precision Precision plot of fast motion (14) 0.7 0.5 0.4 0.3 0.2 0.1 0 0 50 Precision plot of motion blur (10) RCS CN2 [0.650] RCS CN [0.595] Struck [0.555] EDFT [0.465] DFT [0.411] L1APG [0.383] CXT [0.379] ASLA [0.375] TLD [0.375] MIL [0.374] 10 20 30 40 Location error threshold Distance Precision Distance Precision 0 0 0.6 0.4 0.2 0 0 50 Precision plot of illumination variation (20) RCS CN2 [0.596] RCS CN [0.579] ASLA [0.511] Struck [0.506] SCM [0.436] CSK [0.433] DFT [0.427] TLD [0.410] LSHT [0.407] CXT [0.396] 10 20 30 40 Location error threshold 50 Distance Precision Distance Precision 0 0 10 20 30 40 Location error threshold 50 0.8 0.6 0.2 RCS CN2 [0.650] RCS CN [0.634] Struck [0.554] ASLA [0.553] LSHT [0.532] DFT [0.527] EDFT [0.523] LOT [0.502] CPF [0.490] Frag [0.449] Precision plot of in−plane rotation (20) 0.8 0.4 50 0.8 0.6 0.2 10 20 30 40 Location error threshold Precision plot of deformation (16) 0.8 0.4 RCS CN2 [0.591] RCS CN [0.585] ASLA [0.567] CSK [0.540] Struck [0.525] LOT [0.501] EDFT [0.495] LSHT [0.485] SCM [0.473] DFT [0.465] 0.6 0.4 0.2 0 0 RCS CN2 [0.658] RCS CN [0.653] Struck [0.533] EDFT [0.458] CXT [0.457] CSK [0.451] ASLA [0.441] LSHT [0.429] L1APG [0.428] MTT [0.423] 10 20 30 40 Location error threshold 50 Figure 5.11: Precision plots showing the attribute-based results of the state-of-theart evaluation. The plots display the distance precision for the evaluated attributes fast motion, background clutter, motion blur, deformation, illumination variation and in-plane rotation. The trackers proposed in this thesis are shown in bold font. For clarity, only the top 10 performing methods are displayed in the plots. The value appearing in the title denotes the number of videos associated with the respective attribute. The average distance precision at 20 pixels is displayed in the legends. 44 5 Precision plot of low resolution (4) Evaluation Precision plot of occlusion (24) 0.7 0.8 0.5 Struck [0.497] MTT [0.493] L1APG [0.460] MIL [0.423] CSK [0.410] RCS CN [0.407] RCS CN2 [0.407] BSBT [0.346] CXT [0.335] SCM [0.335] 0.4 0.3 0.2 0.1 0 0 10 20 30 40 Location error threshold Distance Precision Distance Precision 0.6 0.6 0.2 0 0 50 RCS CN2 [0.668] RCS CN [0.660] Struck [0.594] ASLA [0.521] CPF [0.520] TLD [0.520] LOT [0.504] LSHT [0.483] SCM [0.464] EDFT [0.462] 0.4 Precision plot of out−of−plane rotation (28) 10 20 30 40 Location error threshold 50 Precision plot of out of view (4) 0.8 0.7 RCS CN2 [0.655] RCS CN [0.648] Struck [0.555] ASLA [0.512] CPF [0.494] TLD [0.489] LSHT [0.480] LOT [0.476] EDFT [0.453] CSK [0.451] 0.4 0.2 0 0 10 20 30 40 Location error threshold 50 Distance Precision Distance Precision 0.6 0.6 0.5 LOT [0.504] CPF [0.446] MIL [0.445] Struck [0.442] TLD [0.436] DFT [0.408] CXT [0.382] RCS CN2 [0.364] RCS CN [0.345] BSBT [0.326] 0.4 0.3 0.2 0.1 0 0 10 20 30 40 Location error threshold 50 Precision plot of scale variation (21) Distance Precision 0.8 0.6 0.4 0.2 0 0 RCS CN2 [0.632] RCS CN [0.630] Struck [0.613] ASLA [0.574] SCM [0.561] CPF [0.503] EDFT [0.478] CSK [0.476] TLD [0.475] CXT [0.468] 10 20 30 40 Location error threshold 50 Figure 5.12: Precision plots showing the attribute-based results of the state-of-theart evaluation. The plots display the distance precision for the evaluated attributes low resolution, occlusion, in-plane rotation, out of view and scale variation. The trackers proposed in this thesis are shown in bold font. For clarity, only the top 10 performing methods are displayed in the plots. The value appearing in the title denotes the number of videos associated with the respective attribute. The average distance precision at 20 pixels is displayed in the legends. 5.6 45 Conclusions and Future Work Success plot of background clutter (18) 0.8 0.6 0.6 0.4 0.2 0 0 Overlap Precision Overlap Precision Success plot of fast motion (14) 0.8 Struck [0.455] RCS CN [0.406] RCS CN2 [0.399] MIL [0.323] TLD [0.323] MTT [0.311] CXT [0.310] EDFT [0.310] CPF [0.307] DFT [0.298] 0.2 0.4 0.6 Overlap threshold 0.8 0.4 0.2 0 0 1 0.8 0.6 0.6 0.4 0.2 0 0 RCS CN2 [0.472] RCS CN [0.458] Struck [0.444] EDFT [0.365] DFT [0.341] MIL [0.321] ASLA [0.318] TLD [0.316] L1APG [0.312] CSK [0.303] 0.2 0.4 0.6 Overlap threshold 0.8 0.4 0.2 0 0 1 0.6 0.6 0.2 0 0 ASLA [0.432] RCS CN2 [0.407] RCS CN [0.404] Struck [0.375] SCM [0.357] DFT [0.338] LSHT [0.324] CSK [0.320] TLD [0.311] EDFT [0.294] 0.2 0.4 0.6 Overlap threshold 0.8 0.4 0.6 Overlap threshold 0.8 1 RCS CN2 [0.453] RCS CN [0.440] ASLA [0.422] DFT [0.405] Struck [0.394] EDFT [0.370] CPF [0.362] LSHT [0.356] LOT [0.337] Frag [0.326] 0.2 0.4 0.6 Overlap threshold 0.8 1 Success plot of in−plane rotation (20) 0.8 Overlap Precision Overlap Precision Success plot of illumination variation (20) 0.8 0.4 0.2 Success plot of deformation (16) 0.8 Overlap Precision Overlap Precision Success plot of motion blur (10) ASLA [0.461] RCS CN [0.424] RCS CN2 [0.422] Struck [0.400] EDFT [0.395] CSK [0.379] SCM [0.377] LSHT [0.366] LOT [0.363] DFT [0.362] 1 0.4 0.2 0 0 RCS CN [0.454] RCS CN2 [0.450] Struck [0.396] ASLA [0.382] EDFT [0.353] CXT [0.340] CSK [0.332] LSHT [0.325] DFT [0.322] L1APG [0.322] 0.2 0.4 0.6 Overlap threshold 0.8 Figure 5.13: Success plots showing the attribute-based results of the state-of-the-art evaluation. The plots display the overlap precision for the evaluated attributes fast motion, background clutter, motion blur, deformation, illumination variation and in-plane rotation. The trackers proposed in this thesis are shown in bold font. For clarity, only the top 10 performing methods are displayed in the plots. The value appearing in the title denotes the number of videos associated with the respective attribute. The area under the curve is displayed in the legends. 1 46 5 Success plot of low resolution (4) Success plot of occlusion (24) 1 0.6 0.2 0 0 Overlap Precision Overlap Precision 0.8 0.4 MTT [0.403] L1APG [0.381] Struck [0.366] CSK [0.347] MIL [0.327] RCS CN2 [0.317] RCS CN [0.317] SCM [0.300] BSBT [0.295] EDFT [0.287] 0.2 0.8 0.6 0.4 0.2 0.4 0.6 Overlap threshold 0.8 0 0 1 0.6 0.6 0.2 0 0 RCS CN [0.443] RCS CN2 [0.443] ASLA [0.415] Struck [0.395] LSHT [0.359] TLD [0.354] CPF [0.354] LOT [0.348] DFT [0.343] EDFT [0.342] 0.2 0.4 0.6 Overlap threshold 0.8 RCS CN2 [0.451] RCS CN [0.449] Struck [0.423] ASLA [0.417] TLD [0.388] CPF [0.384] SCM [0.368] LOT [0.364] DFT [0.363] LSHT [0.351] 0.2 0.4 0.6 Overlap threshold 0.8 1 0.4 0.2 0 0 LOT [0.409] MIL [0.379] Struck [0.378] TLD [0.375] CPF [0.371] RCS CN2 [0.340] DFT [0.335] RCS CN [0.331] CXT [0.293] BSBT [0.287] 0.2 0.4 0.6 Overlap threshold 0.8 Success plot of scale variation (21) Overlap Precision 0.8 0.6 0.4 0.2 0 0 ASLA [0.471] SCM [0.445] Struck [0.420] RCS CN [0.405] RCS CN2 [0.399] CPF [0.344] EDFT [0.342] TLD [0.338] LOT [0.333] CXT [0.332] 0.2 1 Success plot of out of view (4) 0.8 Overlap Precision Overlap Precision Success plot of out−of−plane rotation (28) 0.8 0.4 Evaluation 0.4 0.6 Overlap threshold 0.8 1 Figure 5.14: Success plots showing the attribute-based results of the state-of-theart evaluation. The plots display the overlap precision for the evaluated attributes low resolution, occlusion, in-plane rotation, out of view and scale variation. The trackers proposed in this thesis are shown in bold font. For clarity, only the top 10 performing methods are displayed in the plots. The value appearing in the title denotes the number of videos associated with the respective attribute. The area under the curve is displayed in the legends. 1 Part II Category Object Tracking 6 Tracking Model In many applications, the problem is to automatically track all objects of a certain category. Examples of these kinds of applications include automated surveillance and safety systems in cars. These kinds of problems contain additional challenges compared to generic tracking. The system needs to automatically detect new objects and track them through out the scene. However, compared to generic tracking there are additional a priori information that can be exploited to improve the robustness and accuracy. This information consists of the general appearance of the object class, e.g. humans. If multiple objects are tracked simultaneously, then the interactions between these can be used to further improve the tracking. This second part of the thesis describes the automatic category tracker that was developed and implemented. This chapter gives an overview of the complete framework and describes the object and observation model, on which the framework is built. Section 6.1 gives an overview of the framework. Section 6.2 describes how individual objects are modelled using deformable part models and dynamics. Section 6.3 describes the measurement model and section 6.4 describes how the objects are tracked by applying the Rao-Blackwellized Particle Filter (see section A.4) to the model. 6.1 System Overview Most category object trackers exploit the class specific and object specific appearance along with a motion model. The implemented tracker however, also incorporates tracking of specific object parts into the framework. One motivation for this is that such information can be of importance in e.g. action recognition. Consider for example tracking a human along with its hands hand feet. Actions such as “walking” and “playing guitar” can then potentially be detected. However, main motivation for incorporating part tracking is 49 50 6 Tracking Model to improve the tracking of the object itself. A short overview of the system is as follows. Images from the sequence are processed through two different system parts. The object detector produces dense score functions of the class specific object and part appearances. The appearance tracking produces score functions of the object and part specific appearances, based on the learnt appearance of the object and parts. These two sources of information are combined in a Bayesian filtering step, together with the dynamic model of the object. The results from the filtering are used to estimate the new location of the object in the image. The estimation is then used to update the learnt appearances of the object and parts. Tracked objects interact in two ways within the framework. Location estimates of all objects and parts are used to detect inter-object occlusions. This information is then used in the filtering step and in the appearance learning. The current location estimates are also used together with the object detections to find and initialize new objects in the scene and to remove false objects. 6.2 Object Model This section presents the dynamic object model, which is formulated as a state space model. 6.2.1 Object Motion Model A popular assumption in visual tracking is a constant velocity motion model [22, 2, 9]. If z is the object position in Cartesian coordinates (usually one, two or three dimensional), then the constant velocity model can be expressed as in (6.1). ż(t) = v(t) (6.1a) v̇(t) = wv (t) (6.1b) The first equation in the model defines v as the velocity of the object. wv is noise, which gives some flexibility in the model. Usually wv is assumed to be white and Gaussian. A discretization of this model is given in (6.2), where T is the sample time. This discretization is done using Zero Order Hold [18], where the noise wv is basically assumed to be constant during the sample interval. T2 v w 2 t vt + T wtv zt+1 = zt + T vt + (6.2a) vt+1 = (6.2b) The scale (size) st of the object is modelled as constant with some process noise. Such a model is valid in cases when relative motion in the direction of the optical axis is small or if the tracked objects are far away. st+1 = st + st wts (6.3) It is physically more correct to let the noise wts model the relative change in scale. This is the motivation for scaling the noise with st . Only uniform scale is regarded in this work, i.e. st is scalar. 6.2 51 Object Model 6.2.2 Part Deformations and Motion The model of the part locations should include the deformation costs of the parts and a motion model. The state space model in (6.4) has been constructed only using the deformation costs. j = st aj + st ujt zt+1 ujt j ∼ N (0, D ) (6.4a) (6.4b) Here, ztj is the relative position of part j at time t, aj is the modelled expected position of the part and ujt is noise modelling the uncertainty. Although this model validly describes the deformations of a part as discussed in section 7.2.3, it discards the history of the part j j motion given the scale, i.e. p(zt+1 |ztj , st ) = p(zt+1 |st ). The history is included by using a Markov model instead, based on a constant relative position assumption. This assumption is highly valid for parts that are more or less rigidly connected to the object, e.g. the head of a human. However, the model can also be tuned for moving parts by increasing the j process noise wtz . The part deformations are considered in the observation likelihood, as discussed in section 6.3. j zt+1 = ztj + st wtz j (6.5) This thesis will only deal with static models of part locations. Although there is a possibility of using dynamic models, it is not be investigated in this work. 6.2.3 The Complete Transition Model The complete transition model that is used in the proposed object tracker is given in (6.6). The number of states are 2N + 5, where N is the number of object parts. In this chapter xt = (zt0 , st , vt , zt1 , . . . , ztN )T is used to denote the object state at time t. 0 zt+1 = zt0 + T vt + st wtz vt+1 = vt + = j st wtz ztj + j j wtz ∼ N (0, Qz ) wts wtv (6.6a) st wts T st wtv st+1 = st + j zt+1 0 s (6.6b) (6.6c) , j ∈ {1, . . . , N } (6.6d) , j ∈ {0, . . . , N } (6.6e) ∼ N (0, Q ) (6.6f) ∼ N (0, Qv ) (6.6g) The velocity state noise wtv that should appear in (6.6a) according to the discretization (6.2) of the constant velocity model, is discarded. The reason is that the Rao-Blackwellized Particle Filter (RBPF) (see section A.4) does not handle this well if the velocity is taken 0 as a linear state and the position as non-linear. This noise is replaced with wtz , which is assumed to be independent of wtv . However, this modification also adds the freedom to tune the model between constant velocity and random walk. The drawback of this modification is insignificant, since the goal is to estimate the position of the object and not the real velocity. 52 6 Tracking Model j The process noises defined in (6.6) are assumed to be mutually uncorrelated. Qz , Qs and Qv are general covariance matrices. The fact that the noises are scaled with st is motivated by the pinhole camera model. Note that the model is non-linear and non-Gaussian. It is however linear and Gaussian conditioned on the scale st . It is thus applicable for the RBPF. 6.3 The Measurement Model This section describes the measurement model that is used in the tracker. Earlier works (like [9]) have used detector confidences as likelihoods in filtering or optimization frameworks. The measurement model that used here is inspired by [22], but extended with the deformation model for part positions. The likelihood contains two basic factors, as described in (6.7). It is the image that is measured at time t. The factor p(It |xt ) thus describes how well the given state explains the measured image. M is the deformation model, which is static. The second factor p(M |xt ) describes how well the state fits in to the deformation model. p(It , M |xt ) = p(It |xt )p(M |xt ) (6.7) Independence is assumed between the image measurements and the deformations given the object state. 6.3.1 The Image Likelihood The first factor in (6.7) combines information from the class specific appearance of the object and parts with the appearance of individual objects. Object detections (see chapter 7) are used as class specific appearance. p(θtj |z, s) denotes the probability of detecting the object if j = 0 and part otherwise, at position z and scale s in image It . Section 7.2.2 describes how this factor is computed. The object appearance is modelled in a similar way. Assume that there is a detector trained on the appearance of the specific object. p(ϕjt |z) denotes the probability of detection analogously to the object detection. Section 8.1.2 describes how this factor is computed using the proposed K-MOSSE tracker that is described in the first part of this thesis. The image likelihood is modelled by (6.8). p(It |xt ) = p(θt0 |zt0 , st )p(ϕ0t |zt0 ) N Y p(θtj |zt0 + ztj , st )p(ϕjt |zt0 + ztj ) (6.8) j=1 The model assumes independence between the object and part detections conditioned on the state. This is intuitively a valid approximation in an occlusion free environment. However, if occlusions are considered, the detections are likely to be correlated. The model also assumes independence between the detections from the class appearance and object appearance detector conditioned on the state. It can be argued that this is a reasonable approximation if the detectors uses different features. This is not valid at occlusions though. 6.4 53 Applying the Rao-Blackwellized Particle Filter to the Model 6.3.2 The Model Likelihood The second factor in (6.7) exploits the information in the known deformable parts model. This factor is modelled by (6.9). ! N N j Y Y j j j j zt p(M |xt ) = p(a |zt , st ) = (6.9) N a ; ,D st j=1 j=1 aj is the known mean position of the part and Dj is a covariance matrix describing the deformations. Section 7.2.3 describes how these values are computed. 6.4 Applying the Rao-Blackwellized Particle Filter to the Model The Rao-Blackwellized Particle Filter (RBPF) is a Bayesian filtering algorithm that exploits linear-Gaussian substructures in the state space model. It uses a particle filter to approximate the set of non-linear states xnt and Kalman filters for the remaining states xlt , which are linear and Gaussian conditioned on xnt . The algorithm is described in section A.4. The state space model in (6.6) is only linear and Gaussian conditioned on the scale st . So, this state has to be included in xnt . All states except the velocity states vt appear as non-Gaussian in the measurement model because of (6.8). This implies that only vt truly can be included in xlt . However, since the main goal is to approximate the object position and scale, we will assume a Gaussian approximation of the part measurements. So, the partitioning of the states is done as in (6.10). vt 0 zt1 z xnt = t , xlt = . (6.10) st .. ztN 6.4.1 The Transition Model The state transition model in (6.6) with the partitioning of states in (6.10) is a special case of the RBPF model in (A.14). The functions and matrices in the RBPF model is identified in (6.11). ftn (xnt ) = xnt , T I2×2 Ant (xnt ) = 01×2 Btn (xnt ) = st I3×3 , 02×2N 01×2N , ftl (xnt ) = 02N +2×1 (6.11a) Alt (xnt ) = I2N +2×2N +2 T I2×2 02×2N Btl (xnt ) = st 02N ×2 I2N ×2N (6.11b) (6.11c) 54 6 The covariances are given by (6.12). 0 Qz 02×1 Qnt = 01×2 Qs v Q 02×2 1 02×2 Qz Qlt = .. .. . . 02×2 . . . Tracking Model (6.12a) ··· .. . .. . 02×2 02×2 .. . 02×2 N Qz Qln t = 03×2N +2 6.4.2 (6.12b) (6.12c) The Measurement Update for the Non-Linear States The measurement model described in section 6.3 is clearly more general than the one in (A.14). This means that the RBPF can not be applied directly. The restricting measurement model in the RBPF comes from the fact that is has to be linear and Gaussian in xlt conditioned on xnt . However, the standard particle filter described in section A.3 can handle general measurement models. The measurement model in section 6.3 can thus be used to update the particle filtered states xn . A Gaussian approximation of the measurement model is then used to update the linear states xl . This results in a variant of the RBPF that uses different measurement models for the two measurement updates in algorithm A.4. The full model is used in the particle filter measurement update and the linear-Gaussian approximation is used in the Kalman filter measurement update. The particle filter measurement update in algorithm A.4 requires the probability density p(yt |Xtn , Yt−1 ). yt is the measurement, which in our case is yt = {It , M } and Yt = {y1 , . . . , yt } denotes all measurement up to time t. Xtn = {xn1 , . . . , xnt } denotes the trajectory of non-linear states up to time t. This probability is given in (6.13). See section B.2 for the proof. p(yt |Xtn , Yt−1 ) = gt (st )L0t (zt0 , st ) N Y hjt ( · , st ) ? Ljt ( · , st ) (zt0 ) j=1 = gt (st )L0t (zt0 , st ) N Z Y j=1 R2 hjt (z, st )Ljt (zt0 + z, st ) dz (6.13) The functions that appear here are defined in (6.14). Ljt (z, s) = p(θtj |z, s)p(ϕjt |z) hjt (z, s) = N z; µjt (s), Htj (s) N Y 1 j 1 zj gt (s) = N ẑt|t−1 ; aj , 2 Pt|t−1 + Dj s s j=1 (6.14a) (6.14b) (6.14c) 6.4 55 Applying the Rao-Blackwellized Particle Filter to the Model The mean and covariance of hjt are defined in (6.15). 1 j −1 j j j zj −1 j µt (s) = Ht (s) (Pt|t−1 ) ẑt|t−1 + (D ) a s −1 j j z Ht (s) = (Pt|t−1 )−1 + (s2 Dj )−1 (6.15a) (6.15b) j j z and Pt|t−1 ẑt|t−1 are the predicted mean and covariance of state ztj given the trajectory j j z Xtn for the non-linear states. I.e. we have p(ztj |Xtn , Yt−1 ) = N (ztj ; ẑt|t−1 , Pt|t−1 ). This j j z means that ẑt|t−1 and Pt|t−1 depend on zt0 and st , even though it is not denoted explicitly to simplify the notation. In the RBPF algorithm, these are the predictions bound to the particle Xtn,i , when evaluating the likelihood for this particle. The derived particle weighting function in (6.13) has an interesting interpretation. hjt acts as a filter that smooths the likelihood function for part j. This reflects the uncertainty that is present in the predicted part locations. hjt also shifts the part likelihood so that it is evaluated in zt0 + µjt (st ). µjt (st ) should be seen as the predicted relative part location that has been corrected by the deformation model. Equation 6.15 is recognized as the fusion j zj formula of the two estimates ẑt|t−1 and st aj with covariances Pt|t−1 and s2t Dj of the part location. Htj (st ) is the covariance of the new part location estimate. hjt actually is the probability density p(ztj |Xtn , Yt−1 , aj ), which supports this interpretation. The function gtj (st ) is essentially a deformation cost, where the uncertainty in the part locations has been added to the deformation covariance. This factor is the model likelihood given the non-linear states p(M |Xtn , Yt−1 ). 6.4.3 The Measurement Update for the Linear States The RBPF requires a measurement model that is linear and Gaussian conditioned on the non-linear states to update the linear ones. This is achieved by a Gaussian approximation of (6.8). The first two factors in this equation (i.e. the object and appearance detection) do not contain any information about the part locations given zt0 and st , so they can be excluded here. Since the part measurements are mutually independent, each part can be considered individually. Consider the Gaussian approximation in (6.16) of the image likelihoods for each part. The function Ljt defined in (6.14a) is the product between the class appearance likelihood and object specific appearance likelihood of part j. Ljt (z, s) ≈ ktj (s)N (z; ytj (s), Rtj (s)) (6.16) ytj (s) and Rtj (s) are the mean and covariance of the approximation, that depends on the scale s. The scale factor ktj (s) is unimportant and can be disregarded since the scale is given when performing the measurement update for the linear states. The method for obtaining this approximation and the validity of it is discussed in section 8.2.3. The image likelihood for part j can be written as in (6.17). Ljt (zt0 + ztj , st ) ∼ N (ytj (st ); zt0 + ztj , Rtj (st )) (6.17) Here ∼ denotes approximately equal up to a scale factor (for a constant st ). The argument 56 6 Tracking Model and mean of the normal distribution have been switched. This can be done thanks to the symmetry of the Gaussian function. ytj (st ) can be regarded as a measurement. The right hand side in (6.17) is then the likelihood p(ytj (st )|xt ) of this measurement. This likelihood is equivalent to the measurement equation in (6.18a). The deformation model likelihood in (6.9) is already Gaussian and does not have to be approximated. Since the deformation likelihood for the parts are independent, (6.9) is equivalent to the measurement equation in (6.18b) for each part. ytj (st ) = zt0 + ztj + ejt (st ) , ejt (st )|st ∼ N (0, Rtj (st )) 1 aj = ztj + djt , djt ∼ N (0, Dj ) st (6.18a) (6.18b) The conditionally linear-Gaussian measurement model thus contain two measurements for each part location. The iterated measurement update in algorithm A.2 can easily be applied in the RBPF on this model, since all position measurements are mutually uncorrelated. 7 Object Detection As described in chapter 1, category object tracking is the problem of tracking objects of a specific class, e.g. humans. This tracking problem contains additional a priori information that can be exploited. If the system is supposed to work completely automatically, it additionally has to detect objects of this category. Object detection is a well studied area in computer vision. The first section in this chapter briefly describes the object detector [15], which is used in my proposed object tracker. The remaining sections discuss how this detector is used in my framework. 7.1 Object Detection with Discriminatively Trained Part Based Models The object detector [15] by Felzenszwalb et. al. has proved to be very successful, and it still achieves state of the art performance. It uses a model of deformable parts to describe the object. This section contains a brief description of this object detection framework. For more details, see [15]. This object detector will be referred to as the deformable part model (DPM) detector. 7.1.1 Histogram of Oriented Gradients Dalal and Triggs introduced the histogram of oriented gradients (HOG) features in [12] and applied it to human detection in static images. The HOG feature map is created by first computing histograms of gradient orientations in a dense image grid of cells, which are typically 8 × 8 pixels. The histograms are constructed using soft assignment in neighbouring cells. This is then followed by a normalization step, where the histograms are normalized using the gradient energy from different neighbouring cells. The original HOG results in a 36-dimensional feature vector for each cell. The HOG features are 57 58 7 Object Detection typically computed at many different scales. Figure 7.1 visualizes the HOG feature map for an image region. Dalal and Triggs trained a linear support vector machine (SVM) [7] on thousands of false and positive examples of humans, using HOG features. Object detection score at all locations and scales of an image can be computed using a sliding window search, i.e. by correlating the feature map of the image with the SVM-weights. 7.1.2 Detection with Deformable Part Models The DPM detector [15] extends the work of Dalal and Triggs to use deformable part models. The detector uses a modification of HOG-features. But instead of just training a single template of SVM weights for the whole object, separate part templates are also trained for a set of object parts (e.g. head or feet for a human). Additionally, a deformable part model of the object is trained. This model includes the anchor position vj = (vjx , vjy ) and the quadratic deformation cost coefficients dj = (d1j , d2j , d3j , d4j )T for each part j ∈ {1, . . . , N }. Let Gj be the trained SVM weights (or filters) for each part, where j = 0 indicates the root filter that is trained on the whole object. Let H denote the HOG feature map for an image. The classification score from a filter at position p = (x, y) and scale s is calculated as in (7.1).1 Note that the part scores for an object detection at scale s are computed at half the scale, i.e. double the resolution. G0 ? H(p, s) ,j = 0 s ζj (p, s) = (7.1) Gj ? H p, , j ∈ {1, . . . , N } 2 The vector defined in (7.2) contains the linear and quadratic absolute displacements of a part from its anchor position. (xj − x0 )/s − vjx (yj − y0 )/s − v y j 2 ∆i (p0 , pj , s) = (7.2) (xj − x0 )/s − vjx 2 (yj − y0 )/s − vjy The deformation cost for part j is computed as dTj ∆j (p0 , pj , s). Given a trained model and SVM weights, the DPM detector calculates the final object detection score as in (7.3). Here b is just a constant. N N X X ζ(p0 , s) = max ζj (pj , s) − dTj ∆i (p0 , pj , s) + b (7.3) p1 ,...,pN j=0 j=1 1 Correlation (?) is generalized to vector valued functions by correlating each feature layer individually and the summing the results at each position. 7.1 Object Detection with Discriminatively Trained Part Based Models 59 (a) Frame 50 from the Town Centre sequence. (b) Visualization of HOG features from a part of the image in figure 7.1a containing the two persons dressed in black near the center of the image. The HOG features has been calculated at scale 7 s = 2 10 ≈ 1.62, i.e. the image is down-sampled with a factor 1/s first. Figure 7.1: Visualization of the HOG features of the frame in figure 7.1a. In figure 7.1b, the magnitude of each orientation bin in a cell is visualized by the intensity of the line with the corresponding orientation. 60 7 (a) Root filter SVM weights. Object Detection (b) Part filter SVM weights. (c) Part placements and deformation costs. Figure 7.2: Visulization of the INRIA-person model, which is trained on the INRIA [12] dataset. The model contains two components that are reflections of each other along the vertical axis. The SVM weight for each orientation bin is visualized as in figure 7.2a and 7.1b. The magnitude of the deformation cost functions are visualized in figure 7.2c. The model contains eight parts. 7.1.3 Training the Detector The classifier in (7.3) can be formulated as in (7.4). fw (x) = max wT Φ(x, z) z∈Z(x) (7.4) Here, w is the vector of classifier weights, that in this case includes the filter weights Gj and the deformation coefficients dj . x is the example to be classified and Z(x) is the set of possible latent values for x. The part positions are latent in this case. Φ(x, z) is the extracted feature vector for the particular example x and part configuration z. In [15] this classifier is trained using supervised learning with latent SVM. This is done by minimizing the objective function in (7.5). xi denotes the examples and yi ∈ {−1, 1} are the corresponding labels. n LD (w) = X 1 kwk2 + C max(0, 1 − yi fw (xi )) 2 i=1 (7.5) Here, C is a regularization parameter. The optimization problem (7.5) is non-convex. However, a strong local optima can be found by exploiting the semi-convexity of this function. The optimization iterates between finding the optimal part placements for the positive examples given w and optimizing over w given these part placements. The second step can be shown to be a convex optimization problem. For further details on this training procedure, see [15]. My proposed human tracker uses the human detector that is pretrained on the INRIA [12] dataset. The trained model is visualized in figure 7.2. 7.2 Object Detection in Tracking 7.2 61 Object Detection in Tracking One of the goals in this part of the thesis is to fuse the information from the object and part detections with appearance tracking in a probabilistic framework. This section discusses how object detection can be used in a tracking framework. 7.2.1 Ways of Exploiting Object Detections in Tracking The popular way of using object detections in tracking is to select a sparse set of detections as observations of the object state. This can then be used in for example a Bayesian filtering framework. Though many reduce the problem to data association of a set of sparse detections [2, 39]. Such a set of detections can be obtained by simply thresholding the dense detection scores. However, the thresholding discards large amounts of information returned by the object detector. An obvious possibility is to use the whole detection score as a confidence map or likelihood of an object being present at a specific location. Breitenstein et. al. [9] pointed out that the detector score from the original HOG person detector [12] is too poor to use as a confidence map in most cases. They countered this fact by mostly relying on thresholded detections in their filtering framework and only trusting the detector confidence in certain specific cases. The DPM detector returns much more distinct detection confidences. Izadinia et. al. [22] successfully exploited these confidences to track both humans and their parts in a non-causal framework based on graph optimization. My work exploits the dense detection scores for both the object and the separate parts. The scores obtained from the root-filter are not used explicitly, but they of course contribute to the final object detection scores. The confidences returned by the full detector are of much higher quality than the confidences computed from the part filters. As the original HOG detector, the part confidences are just the output from a linear classifier. These outputs also suffer from the fact that the appearance of for example a shoulder is not that discriminative. However, these flaws are countered by jointly tracking the parts and the object itself. 7.2.2 Converting Detection Scores to Likelihoods The detection scores are computed at a cell level resolution. The cells are not overlapping in the standard version of the DPM detector. This means that the resolution of the 1 detection score at scale s is 8s times the resolution of the original image for the object detections and twice for the part detections. However, it is more practical to use pixel-dense scores. These are obtained by interpolating the detection scores with splines. The effect is illustrated in figure 7.3. To be able to use the detection scores in a probabilistic framework, the scores are transformed to values that can be interpreted at probabilities. In chapter 6 the detection scores are incorporated in the tracker as likelihoods p(θj |z, s). θj is a binary stochastic variable that indicates the detection of the object (j = 0) or part (j > 0). p(θj |z, s) should then be interpreted as how likely it is to detect the object or part at position z and scale s. In [22] detection scores are transformed to probabilities using sigmoid functions. A Sigmoid function is some kind of smooth step function. There exist many proposed variants 62 7 Object Detection of such functions. The type used here is given in (7.6). ψ(t) = 1 1 + e−α(t−β) (7.6) The parameters α > 0 and β ∈ R needs to be tuned for each object and part detector individually. Let ζ̂j (z, s) denote the interpolated detection score at position x and scale s. These scores are simply mapped through a sigmoid function using (7.7), to obtain the corresponding likelihoods. The effect of this step is illustrated in figure 7.3. 1 (7.7) p(θj |z, s) = ψj ζ̂j (z, s) = 1 + exp(−αj (ζ̂j (z, s) − βj )) The sigmoid parameters are tuned based on gathered statistics of the confidence values. The cumulative distribution functions Fj of the detection scores over an image are defined in (7.8). Here x = (z, s) and X is the set of all positions and scales in the image. In practice many images can be used. Fj are approximated by computing histograms of detection scores over a set of images and computing their cumulative sums. 1 |X| |{x ∈ X : ζ(x) ≤ ξ}| , j = 0 Fj (ξ) = (7.8) 1 |{x ∈ X : ζj (x) ≤ ξ}| , j ∈ {1, . . . , N } |X| Using the definition of precision Pj (ξ) and recall Rj (ξ) at a detection threshold ξ, the equality in (7.9) can be derived. Here Tj is the relative number of locations x in the image that contain the specified object. Fj (ξ) = 1 − Tj Rj (ξ) Pj (ξ) (7.9) One way of tuning the parameters in (7.7) would be to specify the desired recall rate at the two thresholds λ1 and λ2 , where x is considered a detection if and only if p(θj |x) ≥ λk . (1) (2) The corresponding detection score thresholds ξj and ξj can then be obtained if the recall rates are known. This gives the equation system (7.10). ψj ξ (1) = λ1 j (7.10) ψj ξ (2) = λ2 j The solution is given in (7.11). 1 1 αj = (1) ln − 1 − ln −1 (2) λ2 λ1 ξj − ξj 1 1 (1) βj = ξ j + ln −1 αj λ1 1 (7.11a) (7.11b) This approach is problematic since a labelled dataset is needed to estimate the recall Rj (ξ). A simple method that does not require labelled data was used in this work. For the object 7.2 63 Object Detection in Tracking 2 1 0 −1 −2 −3 −4 −5 −6 (a) The DPM human detector output at scale s = 2 7 10 ≈ 1.62 for the image displayed in figure 7.1a. 2 1 0 −1 −2 −3 −4 −5 −6 (b) Spline interpolation of the detector scores in figure 7.3a to obtain pixel-dense scores. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 7 10 (c) The final likelihood p(θ |z, s = 2 ) computed from the detector scores in figure 7.3b, using (7.7) with the computed values for α0 and β0 . Figure 7.3: Visualization of the human detector output (figure 7.3a) from the image in figure 7.1a. Figure 7.3b and 7.3c show how these detections are transformed to pixel-dense likelihoods. Note that the high values corresponds to humans in figure 7.1a with similar height in pixels. Also note how apparent the peaks are in figure 7.3c compared to figure 7.3b. 64 7 Object Detection detector, the desired fraction Λk of detections in the images at threshold λk was tuned by (k) visual inspection of the resulting p(θ0 |x) and by performance evaluation. This gives ξ0 (k) from F0 (ξ0 ) = 1 − Λk . A valid approximation is that Tj = T if occlusions in the scene are rare. By comparing the detection scores of the part detectors with the object detector it is possible to get a very coarse approximation of the precision of the part detectors relative the object detector at a “high enough” recall rate. If the precision of the part detectors are (k) assumed to be a factor cj worse at the desired recall rates, then ξj can be found from (k) Fj (ξj ) = 1 − Λk /cj for j > 0. The sigmoid parameters are then computed using (7.11). This method provides a more intuitive way of tuning the sigmoid parameters by letting the user select λk and then Λk based on that. These parameters have intuitive meanings, while it can be difficult to select αj and βj directly. In the proposed human tracker, the sigmoid parameters were tuned using the detection scores from the first 20 frames in the Town Centre sequence [5]. The thresholds were were chosen as λ1 = 0.5 and λ2 = 0.1. The desired fraction of detections for the object detector was set to Λ1 = 10−4 and Λ2 = 10−2 . The precision factor was set to the same for all part detectors cj = 0.1. Not much effort was invested in to tune these values, since the proposed tracker proved to be quite insensitive to variations in the mentioned parameters. 7.2.3 Converting Deformation Costs to Probabilities The deformable parts model was fitted in to a probabilistic framework by converting the deformation costs in the DPM detector to probabilities. To simplify later steps in the modelling, Gaussian probabilities are used. The deformation probabilities are parametrized as in (7.12). As in chapter 6, the part number is denoted with index j, z j is the position of the part relative to the object location and s is the scale of the object. aj is the mean position of the part relative to the object and Dj is a 2 × 2 covariance matrix. p(z j |s) = N z j ; saj , s2 Dj (7.12) The straight forward way to estimate aj and Dj is to do a Maximum Likelihood estimation given a set of images with known object and part locations and scales. However, a more ad hoc solution was done because of the absence of such a dataset. The proposed solution exploits the pre-trained anchor positions vj and deformation coefficients dj from the DPM detector. The mean position aj is set to the (x, y)T that minimizes the deformation cost function in (7.13). (x, y)T is the relative part position normalized with the scale of the object. fj (x, y) = d1j (x − vjx ) + d2j (y − vjy ) + d3j (x − vjx )2 + d4j (y − vjy )2 (7.13) The solution, which is given in (7.14) is easily obtained by setting the gradient to zero ∇fj = 0. d1 vjx − 2dj3 j aj = (7.14) d2 vjy − 2dj4 j 7.2 65 Object Detection in Tracking The covariance matrix is set as in (7.15). 1/d3j j D = 0 0 1/d4j (7.15) The motivation for this is that p(z 1 , . . . , z n |s) then can be written as in (7.16). The model assumes independence between the part locations given the scale. Note the similarities of the argument of the exponential function and the deformation cost in (7.3). The only difference is that the anchor position has been corrected with the linear deformation terms instead of including them as extra terms. p(z 1 , . . . , z n |s) = n Y p(z j |s) = j=1 v uY j 2 n X u n 3 4 1 zx 1 3 j t = d d exp − dj − ax + d4j (2πs2 )n j=1 j j 2 j=1 s zyj − ajy s !2 (7.16) 8 Details This chapter presents and discusses the most important details of the constructed framework for human tracking in surveillance scenes. Section 8.1 discusses the computation of the appearance likelihood and how the proposed RCSK tracker is incorporated into the framework. Section 8.2 contains the details of how the RBPF is used in the filtering step. Section 8.3 presents some details related to occlusion handling and how objects are added and removed. 8.1 The Appearance Likelihood The image measurement model in (6.8) uses two kinds of likelihoods. p(θtj |z, s) models the class specific appearance and is discussed in section 7.2.2. The second factor p(ϕjt |z) models the object specific appearance and does not take the object class into account. This section describes how this likelihood factor is computed using the generic tracking methods described in the first part of this thesis. 8.1.1 Motivation It can be argued that the object specific appearance is irrelevant information if the application does not require identities of the tracked objects. However, this is not true in practice since the object detections are never perfect. The motivation for including appearance based tracking in the model can be summarized in these two points. • To help keeping the object identities. • To counter the imperfections of the object detections. The first point is clear. If the individual appearance of the objects are modelled, then association between frames is simpler. The second point comes from the fact that object 67 68 8 Details detections in still images does not regard the temporal dimension. In section 7.2.2 it was noted that the object detections of human parts are of poor quality, i.e. with a low precision rate. A good generic tracker does not suffer from this problem, since it regards the learnt appearance of the object in the previous frames. Generic trackers are often very accurate in tracking an object from frame to frame. However, common problems among generic trackers are long term drift and that failure often results in loosing track of its target completely. But object detections in still images does not suffer from these flaws. The motivation for combining appearance tracking and object detections is thus that it has the potential to give accurate frame to frame tracking while being robust to drift and failure. For the application in this thesis, a generic tracker with the following properties was desired. 1. Simple. 2. Fast. 3. Exploits color information. 4. Outputs pixel-dense confidences. Some generic trackers, like the Tracking Learning Detection framework [24] uses complex appearance models and methods for failure detection and redetection of the target. Such properties are not needed in this application, since it is handled by the implemented filtering framework and the fusion with object detections. This is the motivation behind the first point in the list. Although the proposed framework does not aim for real time in its current state, it still has to be sufficiently fast to make testing and evaluation practical. The object detections can be precomputed, but this is not the case for the generic tracker output since it uses information from previous frames. In the surveillance sequence, which the framework is evaluated on, as many as 30 humans can appear in the scene at the same time. Since the human is tracked along with 8 defined parts, almost 300 image regions are tracked simultaneously. The second point in the list thus becomes clear. The used object detector that is described in section 7.1 only uses edge information. It is thus intuitive to use color as appearance information to complement this. In the case of human tracking, the color of cloths is a very discriminative feature to separate individuals. This explains the need of the third property in the list. The last property comes as a result of the image measurement model in (6.8). It assumes that the appearance likelihood can be evaluated in a sufficiently dense set of locations. The RCSK tracker, discussed in the first part of the thesis, turns out to have all these properties. This tracker is used to compute the appearance likelihoods as described in the next section. 8.1.2 Integration of the RCSK Tracker A variant of the proposed RCSK tracker in algorithm 2.1 is used to compute the appearance likelihoods p(ϕjt |z). In each frame where the object or part is not occluded, the 8.2 Rao-Blackwellized Particle Filtering 69 0 tracker is updated using the image patch around the estimated location (ẑt|t for the object j 0 for part j). The size of the patch is determined by the estimated scale ŝt|t and ẑt|t + ẑt|t of the object. When processing a new frame, the tracking scores (correlation output) for each unoccluded object and part is computed in an area around their previously estimated locations in the image. These score values can equivalently be seen as detections scores from an object detector trained on the specific object or part appearance. Analogously to the detection scores from the DPM detector in chapter 7, these score values are mapped through a sigmoid function to obtain a probability interpretation. The same kind of sigmoid function (7.6) was used in this case. The parameters were tuned to α = 6 and β = 0.5. To get a more accurate appearance tracking, the spatial size of the RCSK appearance model (i.e. AtN , AtD and x̂t in (2.21)) needs to be set according to the current estimated object scale ŝt|t . The transformed model coefficients AtN and AtD are resized by ether padding with zeros in the highest frequencies or removing the highest frequencies to get the appropriate size. This corresponds to an interpolation of the coefficients in the spatial domain. The learnt appearance is resized in a similar way, by either zero padding or removing the edges. This simple scheme of resizing the trackers proved to be very robust, even in quite extreme cases. 8.2 Rao-Blackwellized Particle Filtering This section describes the practical details of how the RBPF is applied to the model described in chapter 6. The time update and the resampling step in the RBPF is performed as described in algorithm A.4. Some approximations are needed in the measurement updates though to make the RBPF applicable to the model proposed in chapter 6. The standard prior proposal distribution q(xt+1 |Xt , Yt+1 ) = p(xnt+1 |Xtn , Yt ) was chosen. 8.2.1 Parameters and Initialization When a new object is added, its states are initialized using the information from the corresponding detection. The initial object position, scale and velocity for each particle are drawn from the prior Gaussian distribution. The mean position and scale are set as the detection values. The mean velocity is set to zero. The initial part positions are for all particles set to the ones obtained from the initial detection. All prior and process covariances are parametrized and tuned coarsely by hand. 8.2.2 The Particle Filter Measurement Update Theoretically (6.13) should be used to update the particle weights in the RBPF. In practice, the likelihood functions Ljt are defined over a discrete domain, i.e. at each pixel location and at some discrete scales. The integration must thus be approximated by a sum. The Gaussian filter hjt needs to be sampled at a pixel dense grid. However, since the covariance zj Htj of hjt depends on the object scale s and the predicted part location covariance Pt|t−1 , it is also depends on the particle number. This means that the correlation in (6.13) should be computed with a different filter for each particle. 70 8 Details To reduce the computational cost of the measurement update, the covariance Htj (s) defined in (6.15b) is in each time step approximated by a diagonal covariance matrix H̃tj that is independent of the particle number. The likelihood (6.13) can be approximately calculated using (8.1). p(yt |Xtn , Yt−1 ) = gt (st )L0t (zt0 , st ) N Y h̃jt ? Ljt ( · , st ) zt0 + µjt (st ) (8.1) j=1 Here, the approximative diffusion filter h̃jt defined in (8.2) is independent of the particle number. h̃jt (z) = N z; 0, H̃tj (8.2) Notice that the effect of µjt (st ) (defined in (6.15a)) has been moved to the computation of the point where the correlation result is evaluated. This is obtained by a simple change of variables in the integral in (6.13). The correlation output in (8.1) is hence evaluated in the deformation-corrected predicted part location zt0 + µjt (st ). The diagonal elements of H̃tj are computed as mean of Htj over all particles. The nondiagonal elements are set to zero. This approximation enables h̃jt to be separated into two one-dimensional Gaussian filters, which further significantly reduces the computational cost. The approximation of Htj is motivated by the fact that the exact amount of smoothing generated by the filter is of much lesser importance than the mean value µjt , which decides where the correlation result is evaluated. Further, the precision in the estimation zj of Pt|t−1 and thereby also Htj is questionable. As described in section 7.2.3, the deforj z mation covariance Dj is set to a diagonal matrix. Pt|t−1 is almost diagonal in most cases since the process noise of the part locations are diagonal. The diagonal approximation of Htj is thus also motivated. In the evaluation of (6.13), the object and part positions for each particle are rounded to the nearest pixel. The likelihood is linearly interpolated in the scale dimension. In evaluations it was observed that background clutter is a major problem for some human part detectors. This clutter occasionally deteriorated the total likelihood function for a part, thereby opposing the assumption of a major mode in the likelihood, which is necessary for the Gaussian approximation to be valid. This model error resulted in a noticeable reduction in tracking performance of the object itself when severe background clutter was apparent. This was typically the case for the lower body parts. The reason is that the feet and legs move relative to the object itself, which makes them harder to track due to appearance changes and self occlusions. Additionally the lower body parts are much more commonly occluded by background structures and have a less discriminative appearance. A significant improvement in the performance was noticed if only a subset of parts was used in the measurement update for the non-linear states (6.13). Only the upper body parts are thus used in (6.13) for the human tracking application. 8.2 Rao-Blackwellized Particle Filtering 8.2.3 71 The Kalman Filter Measurement Update In section 6.4.3 a Gaussian approximation of the likelihood functions for the object parts is assumed. This is necessary to be able to apply the RBPF to the proposed tracking model. The mean ytj (s) of the Gaussian approximation is selected as the maximum value of the likelihood function Ljt (z, s) in the region. The covariance Rtj (s) is set to the covariance of the likelihood function after normalizing it so that it sums to one. ytj (s) and Rtj (s) are calculated for each scale level. In the Kalman filter measurement update of the RBPF, each particle can be updated using the measurement that corresponds to the scale that is closest to the scale estimation given by that particle. In practice, it turned out to be better to update all particles with the measurement that corresponds to the estimated scale ŝt of the object. Gaussian approximations reasonable if the actual probability is uni-modal. Since the labelling function of the RCSK tracker is Gaussian, this generates a roughly Gaussian shaped score function in most “easy” cases. In this case “easy” means sufficiently small translations and appearance changes between frames. This most often the case in surveillance scenes. The part detection scores are ideally uni-modal in a neighbourhood of the expected part location. If other objects are close, then several modes may exist. Though, this is in most cases handled by the fusion with the appearance likelihood, which should only have one mode at the target. The problem is that the part detectors for most human parts suffer from a very low precision rate due to background clutter. The trained SVMdetectors used for human parts by the DPM human detector, are small and detect at a quite low resolution. They often give high classification scores to certain basic shapes or patterns that can be common in background structures. For example, a spot or line on the ground can look suspiciously similar to a foot. When the tracking is affected by this kind of clutter, the Gaussian assumption is often violated, which might cause the tracking of a specific part to fail. However, since many parts are tracked jointly, the tracking of the object itself is robust to if a minority of the part trackers fail. The iterated measurement update in algorithm A.2 was applied to the Kalman filter measurement update in the RBPF, since all position measurements are uncorrelated. This avoids inverting large matrices (up to 4N × 4N ) in the computation of the Kalman gain in (A.16d) for each particle. Instead, the iterated measurement update only requires several 2 × 2 and 1 × 1 matrices to be inverted for each particle, which is considerably faster. The iterated measurement update was also applied to the Kalman filter time update, since it contains an “extra” measurement update with uncorrelated measurements, when using the model in (6.6) (see section A.4). 8.2.4 Estimation Like the usual particle filter, the RBPF only returns an approximation of the posterior distribution and not a point estimate of the state. In chapter 1 it is stated that visual tracking is the problem of estimating the trajectory of the object in the image, i.e. the position or state of the object in each frame. It is thus necessary to find a point estimate of the state. This can be done in many ways. The two most common methods for obtaining a point estimate were tried: minimum variance (MV) and maximum a posteriori (MAP). These are described in section A.4.2. 72 8 Details The performance difference proved to be insignificant. The MV estimate however gives smoother trajectories, which are visually more appealing. This method was therefore used in the final version. 8.3 Further Details This section discusses the additional system parts that were needed to build a complete automatic human tracking framework. 8.3.1 Adding and Removing Objects An automatic object tracker requires automatic ways of detecting new objects and removing false tracked objects. Although these are important and non-trivial tasks, they were not in the focus of this thesis, so simple methods were employed. However, these methods proved to be quite effective. To find new objects in the scene, all detections over a certain threshold are gathered in each frame. This set of detections is then reduced in a number of steps. Firstly, detections too close to the image borders are removed. This is then followed by two steps of nonmaximum suppression. If the bounding boxes of two new detections are overlapping more than a certain threshold, then the one with the smallest score is removed. The second step compares the overlap between the remaining boxes and the bounding boxes of the existing objects in the scene. A new detection is removed if has a too large overlap with any of the existing objects. The remaining detections after this step are considered to be newly detected objects. These are initialized with the position, scale and part locations given by the detection. The overlap measure used in these cases is given in (8.3). B1 and B2 are two bounding boxes. area(B1 ∩ B2 ) area(B1 ∩ B2 ) overlap(B1 , B2 ) = max , (8.3) area(B1 ) area(B2 ) An existing tracked object is removed if any of the following requirements are fulfilled. 1. The object is too far outside the image. 2. The object is too large or too small, so that it is outside the range of scales used by the detector. 3. The object has been fully occluded for too many frames. 4. The object is not significantly occluded but too dissimilar to the object class, which is indicated by a too low detection score over the last few frames. 8.3.2 Occlusion Detection and Handling Advanced techniques for occlusion detection were not investigated in this thesis work. Rather, a very simple but effective method for detecting inter object occlusions was adopted from [47]. If two tracked objects overlap, then the object with the bounding box that has the highest lower y-coordinate is considered to be the occluded one. This results from the assumption that objects that are further away from the camera are higher up 283 8.3 Further Details 73 Figure 8.1: Three scenarios with significant inter object occlusions. Only the human part boxes that are considered to be non-occluded by the framework are displayed. in the image. This is true for example if the objects are moving on a ground plane which is tilted towards the camera, but not necessarily planar. The assumption holds for most surveillance videos. The parts of the occluded object that have a large enough overlap with the occluding object are considered to be occluded. Parts that are far enough outside the image borders are also considered to be occluded. No system is used for detecting occlusions from scene objects or other non-tracked objects. Figure 8.1 visualizes the effect of the inter object occlusion detection. Occluded parts are not used in the measurement updates of the RBPF. They are not tracked and their appearance models are not updated. If enough parts of the object are occluded, then the object itself is considered to be too occluded to utilize the appearance tracking and detection scores for the whole object in the measurement update of the RBPF. In this case, the likelihoods p(θt0 |z, s) and p(ϕ0t |z) are set to uniform distributions. Further, the appearance of the whole object is not updated. For the application to human tracking, it was chosen so that the object itself is considered occluded if any of the upper body parts 74 8 Details are occluded. The time update in the RBPF is not affected by the occlusions. So, occluded part locations are predicted by the model and covariance is added to the predictions, which reflects the added uncertainty by only predicting. If all object parts are occluded then no measurements of the object location are available. The time update in the RBPF will then give predictions of the object location using the constant velocity motion model described in section 6.2.3. 9 Results, Discussion and Conclusions This chapter presents the results of the constructed category tracker, when applied to human tracking in surveillance scenes. Section 9.1 presents the qualitative results. Section 9.2 discusses the method, results and potential future work. The final conclusions are summarized in section 9.3. 9.1 Results The framework was implemented in Matlab and tested on a desktop computer with an Intel Xenon 2 core 2.66 GHz CPU with 16 GB RAM. The number of particles in the RBPF was set to 1000. Very little time was spent on tuning the large number of parameters. The Town Centre sequence provided by [5] was used for testing and evaluations. This sequence consists of 7501 frames in 1920 × 1080 resolution at 25 fps. The scene is a busy town centre street. Figure 9.1 visualizes the bounding boxes of the tracked objects and the estimated object trajectories for every 100:th frame in the first 1000 frames. Figure 9.2 displays the part-trajectories of a few selected objects. In the latter two images, the trajectories are disrupted by inter object occlusions. In the last image, the person is successfully tracked even though only the head of the person is visible for a long period of time. Figure 9.1 and 9.2 show reasonable object and part trajectories. The system is able to track most humans through the entire scene. But there are some disrupted tracks, mostly due to imperfections in the object detector. It can also be seen that the DPM detector, while powerful, also gives some obvious false detections. Two of these are persistent, as they are triggered by background structures. One of these is the mannequin in the shop-window to the left and the other one occurs in the lower right corner. 75 76 9 100 200 300 400 500 600 700 800 900 1000 Results, Discussion and Conclusions Figure 9.1: Tracking results for frame number 100, 200, . . . , 1000 in the Town Centre sequence. The center location trajectories are displayed for humans that are tracked in the specific frame. 9.1 Results 77 452 350 850 Figure 9.2: Estimated human part trajectories of some selected objects. Note that many trajectories in the two latter cases are disrupted by inter object occlusions. 78 9.2 9 Results, Discussion and Conclusions Discussion and Future Work As discussed in section 1.2.2, most research in category tracking is focused on pure data association of detections and global optimization with non-causal assumptions. The motivation behind the presented work in to incorporate object detections in more sophisticated ways while avoiding the non-causal assumption. However, real-time frame rates are necessary for online applications. In the current Matlab implementation, the computational time for the system with the object detector excluded is between 1 and 3 seconds per frame, depending on the number of present targets. But since particle filters and FFTs are parallelizable, real-time frame rates can potentially be obtained in a GPU (graphics processing unit) implementation. The object detection scores were precomputed using the Matlab/mex code provided by [15]. This took approximately four days for the entire Town Centre sequence (7501 frames and using 58 different scales). However, recent works [33, 34] have achieved close to real-time frame rates (10 fps) for a GPU implementation of the DPM detector, by exploiting coarse-to-fine search strategies. In the constructed framework, it would presumably be enough to use object detection measurements in every few frames and rely solely on the generic tracking results between these frames. The main weakness of the proposed model is the Gaussian approximation of the partlikelihoods, as it does not model cluttered detections well. Figure 9.3 visualizes the detection likelihoods for the human and all parts, at a certain frame and scale. There are three strong human detections in this scale and three corresponding strong head detections are visible. However, almost no clear detections exist for the hips and legs. The feet likelihoods suffer from much clutter in many areas of the image. The straightforward way of handling this problem is to add the parts states that are most affected by clutter to the non-linear states in the RBPF. But this would also require an exponential increase in the number of particles to achieve the same theoretical accuracy of the posterior distribution, due to the curse of dimensionality. Another option is to approximate the part likelihoods in (6.16) with a mixture of Gaussians and use a mode parameter state to distinguish between the different hypothesises. The mode state is most simply included as a non-linear state. But since it is discrete, it would not require such a large increase in particles. Much research has been invested in improving DPM detector framework. For example, [25] increased the performance in many object categories by combining color names (section 3.1.1) with HOG at the feature level. This could alleviate the problem with false detections and other imperfections of the detector. The problems with occlusions caused by non-tracked objects, e.g. background structures could be handled further by incorporating an occlusion model. The work of [39] uses part detections scores to determine partial occlusions. Reidentification of objects that are lost, at for example occlusions, is an important task if object identities are of interest. This could potentially be done with the appearance models applied by the RCSK tracker. Otherwise, separate appearance models could be trained for this purpose. 9.2 79 Discussion and Future Work 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) Frame 130. (b) Human. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 (c) Head. 0 (d) Left shoulder. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 (e) Right shoulder. 0 (f) Left hip. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 (g) Right hip. 0 (h) Legs. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 (i) Left foot. 0 (j) Right foot. 12 Figure 9.3: Detection likelihoods p(θj |z, s) (figure 9.3b to 9.3j) at scale s = 2 10 computed on frame 130 (figure 9.3a) in the Town Centre sequence. 80 9.3 9 Results, Discussion and Conclusions Conclusions In this second part of the thesis, a system for category tracking is presented. The main novelties are the fusion of generic tracking with object detection scores from DPM in a causal probabilistic framework and that the RBPF is used in the filtering step. Encouraging results are demonstrated when applied to human tracking in a real-world surveillance sequence. The causal nature and real-time potential of this system make it attractive for online applications. Additionally, the estimated trajectories of human parts could be used by other systems for action detection and recognition. Appendix A Bayesian Filtering This chapter contains a brief presentation of Bayesian filtering theory. The chapter starts with the general theory and solution. The rest of this chapter presents various parts of the theory that is used in the proposed category object tracker. Section A.4 contains the algorithm and details of the Rao-Blackwellized particle filter, which is of major importance for the proposed tracker. A.1 The General Case Consider the general first order hidden Markov model in (A.1). The state of the system at time t ∈ N is denoted xt ∈ Rn . yt ∈ Rmt is the measurements given at t. p(xt+1 |xt ) (A.1a) p(yt |xt ) (A.1b) The state transition density p(xt+1 |xt ) model the dynamics of the system. The likelihood p(yt |xt ) model how likely it is to receive a certain measurement given the state of the system. The goal is to estimate the posterior probability p(xt |Yt ), where Yt = {y1 , . . . , yt } is the set of all measurements observed so far. The posterior is the probability distribution of the state given all information (measurements) available at that time instance. 83 84 A.1.1 A Bayesian Filtering General Bayesian Solution The general solution of the Bayesian filtering problem of the model in (A.1) is given by the recursion formula in (A.2). p(yt |xt )p(xt |Yt−1 ) p(yt |Yt−1 ) Z p(xt+1 |Yt ) = p(xt+1 |xt )p(xt |Yt ) dxt p(xt |Yt ) = (A.2a) (A.2b) Rn The normalization factor in (A.2a) can be expressed as in (A.3). Z p(yt |Yt−1 ) = p(yt |xt )p(xt |Yt−1 ) dxt (A.3) Rn Equation A.2a is often called the measurement update, since the posterior is updated with the new information contained in the measurement yt . Equation A.2b is often called the time update, since it predicts the posterior in the next time instance using the modelled dynamics. The recursion can be initialized with p(x1 |Y0 ) = p(x1 ). The recursion formula in (A.2) can be derived by applying Bayes’ theorem and marginalization, along with the Markov properties of the model in (A.1). In practice the recursive Bayesian solution can only be applied in some special cases where finite dimensional parametrizations of the densities exists. Section A.2 discusses such a case. Otherwise some sort of finite dimensional approximation of the densities are needed. Sections A.3 and A.4 discus such methods. A.1.2 Estimation At each time instance, an estimation of the state xt given the measurements Yt can be calculated by using e.g. the minimum variance (MV) or maximum a posteriori (MAP) estimate. Z x̂MV = xt p(xt |Yt ) dxt (A.4a) t|t Rn x̂MAP t|t A.2 = arg max p(xt |Yt ) (A.4b) xt The Kalman Filter The linear Gaussian model is one of the cases where an analytic solution to (A.2) exists. Such a model is given by (A.5). It is a special case of the general model in (A.1), with A.2 85 The Kalman Filter p(xt+1 |xt ) = N (xt+1 ; At xt , Bt Qt BtT ) and p(yt |xt ) = N (yt ; Ct xt , Rt ). xt+1 = At xt + Bt vt (A.5a) yt = Ct xt + et (A.5b) vt ∼ N (0, Qt ) (A.5c) et ∼ N (0, Rt ) (A.5d) x1 ∼ N (x̂1|0 , P1|0 ) (A.5e) vt and et are white. vt is called process noise and et is measurement noise. At , Bt and Ct are matrices of appropriate dimensions.1 The notation x̂t|k and Pt|k denotes the estimation of the state mean and covariance respectively at time t, given all measurements up to time k. A.2.1 Algorithm The measurement update of the Kalman filter is given in (A.6). Kt is the Kalman gain, that needs to be computed for each time instance. x̂t|t = x̂t|t−1 + Kt (yt − Ct x̂t|t−1 ) (A.6a) Pt|t = Pt|t−1 − Kt Ct Pt|t−1 Kt = Pt|t−1 CtT (Ct Pt|t−1 CtT (A.6b) −1 + Rt ) (A.6c) The time update is given in (A.7). x̂t+1|t = At x̂t|t Pt+1|t = At Pt|t ATt (A.7a) + Bt Qt Bt (A.7b) The complete algorithm is given in algorithm A.1. For further details and derivation of the Kalman filter, see [18]. Algorithm A.1 Kalman Filter Update at time t Input: Matrices: At , Bt , Ct , Qt and Rt Measurements: yt Predictiona at time t: x̂t|t−1 and Pt|t−1 Output: Estimation at time t: x̂t|t and Pt|t Prediction at time t + 1: x̂t+1|t and Pt+1|t 1: 2: Measurement update using (A.6). Time update using (A.7). a The prediction is given by the model at the first iteration (i.e. t = 1) and by the previous iteration otherwise. 1 Note that the dimensions can vary dynamically since the number of measurements (i.e. dimension of y ) t can change. 86 A A.2.2 Bayesian Filtering Iterated Measurement Update If yt consists of several independent measurements, then the iterated measurement update can be used instead to reduce the computational cost. Let yt = (yt1 , . . . , ytM )T , where yti are the uncorrelated measurements.2 This implies that the measurement noise covariance Rt is block diagonal, with the non-zero blocks Rti = Cov(yti |xt ). Also let Cti be the rows of the measurement matrix Ct associated with yti . The measurement update in algorithm A.1 can then be done using algorithm A.2. Algorithm A.2 Iterated Measurement Update at time t Input: Matrices: Ct and Rt Measurements: yt = (yt1 , . . . , ytM )T Prediction at time t − 1: x̂t|t−1 and Pt|t−1 Output: Estimation at time t: x̂t|t and Pt|t Set x̂0t = x̂t|t−1 and Pt0 = Pt|t−1 for i = 1, . . . , M do Kti = Pti−1 (Cti )T (Cti Pti−1 (Cti )T + Rti )−1 ) + Kti (yti − Cti x̂i−1 x̂it = x̂i−1 t t i−1 i Pt = Pt − Kti Cti Pti−1 6: end for M 7: Set x̂t|t = x̂M t and Pt|t = Pt 1: 2: 3: 4: 5: The major gain in using the iterated measurement update instead of (A.6) is that the matrix inversions are computed for smaller matrices. This can be used to radically increase the computational speed of the marginalized particle filter, where thousands of measurement updates (one for each particle) needs to be computed at each time instance. A.3 The Particle Filter Approximative methods are necessary in more general cases than the one in (A.5). The particle filter approximates the posterior over an adaptive grid. In contrast to the form of Bayesian filtering described in (A.2), the particle filter attempts to estimate the posterior of the whole trajectory Xt = {x1 , . . . , xt }, i.e. p(Xt |Yt ). The filtering posterior p(xt |Yt ) is obtained by marginalizing over the earlier states Xt−1 . A.3.1 Algorithm The complete algorithm is stated in algorithm A.3. It uses a proposal density q(xt+1 |xt , yt+1 ) i N to sample the new particle grid {xit+1 }N i=1 given the previous grid {xt }i=1 and the next measurement yt+1 . The simplest and most common choice of proposal density is the prior 2 Note that yti does not has to be scalar. A.3 87 The Particle Filter q(xt+1 |xt , yt+1 ) = p(xt+1 |xt ). Although many different proposal densities can be used depending on the application. The resampling step in the particle filter is needed to avoid sample depletion. It is necessary to discard particles with too low weight that do not contribute significantly to the approximation of the posterior. It is important to note that even though the particle filter returns an estimate of the posterior for the whole trajectory, it is only accurate (assuming enough particles) for the last few states, because of the depletion problem. Algorithm A.3 Particle Filter Update at time t Input: Number of particles: N i N Particles and predicted weights at time t:a {Xti }N i=1 , {wt|t−1 }i=1 Proposal distribution: q(xt+1 |xt , yt+1 ) Output: i Particle weights at time t: {wt|t }N i=1 i i N Particles and predicted weights at time t + 1: {Xt+1 }N i=1 , {wt+1|t }i=1 Measurement update: 1: Calculate new weights 2: i i w̃t|t = wt|t−1 p(yt |xit ) (A.8) i w̃t|t i wt|t = PN i i=1 w̃t|t (A.9) Normalize the weights Resampling:b 3: Sample N particles with replacement from the set {xit }N i=1 with the probabilities i {wt|t }N i=1 . i 4: Set the weights to wt|t = 1/N . Time update: 5: Generate new particles using the proposal distribution xit+1 ∼ q(xt+1 |xit , yt+1 ) 6: Compute the new predicted weights p(xit+1 |xit ) i i = wt|t wt+1|t q(xit+1 |xit , yt+1 ) (A.10) (A.11) a In the first iteration (t = 1) the particles are sampled using xi ∼ p , that is given by the model. The x1 1 i initial weights are set to w1|0 = 1/N . b The resampling is optional in each iteration. It can be done only when needed, indicated by some measure for depletion (see [18]). A.3.2 Estimation The filtering posterior is approximated by (A.12). The approximations are most commonly done before the resampling. However, it can be done after as well if enough parti- 88 A Bayesian Filtering cles are used. p̂(xt |Yt ) = N X i δ(xt − xit ) wt|t (A.12) i=1 A point estimate of the state can be obtained by using this approximation in (A.4a). This results in the MV approximation given in (A.13). An approximation of the MAP estimate i can be obtained by simply choosing the particle xit with the highest weight wt|t x̂MV t|t = N X i wt|t xit (A.13) i=1 A.4 The Rao-Blackwellized Particle Filter The particle filter is in most cases unpractical for state spaces with more than a few dimensions. This is due to the fact that the number of particles has to grow exponentially with the number of states to keep the accuracy of the estimations. The Rao-Blackwellized particle filter (RBPF) [37] exploits linear-Gaussian substructures in the state space to reduce the dimensionality of the particle approximation. The rest of the states are approximated using Kalman filters. In this section the state space model in (A.14) is considered. In this model the state vector has been partitioned as xt = (xnt , xlt )T . The partitions are called the non-linear and linear states respectively. xnt+1 = ftn (xnt ) + Ant (xnt )xlt + Btn (xnt )vtn (A.14a) xlt+1 (A.14b) = ftl (xnt ) + Alt (xnt )xlt ht (xnt ) + Ct (xnt )xlt + Btl (xnt )vtl yt = + et n vt Qnt ∼ N (0, Qt ), Qt = l T vt (Qln t ) (A.14c) Qln t Qlt (A.14d) et ∼ N (0, Rt ) (A.14e) xn1 xl1 (A.14f) ∼ pxn1 ∼ N (x̂l0 , P0 ) (A.14g) ftn , ftl and ht are vector valued functions of xnt . Ant , Alt , Btn , Btl and Ct are matrices of appropriate dimensions that depend on xnt . Note that this model is linear in xlt conditioned on xnt . A.4.1 Algorithm The RBPF uses the factorization in (A.15). p(Xtn , xlt |Yt ) = p(xlt |Xtn , Yt )p(Xtn |Yt ) (A.15) The first factor is the linear states conditioned on the non linear ones and the measurements. Given a particle approximation of Xtn , an optimal approximation of this factor can be derived. The solution is to run a Kalman filter for each particle Xtn,i . The A.4 89 The Rao-Blackwellized Particle Filter i goal is to calculate p(xlt |Xtn,i , Yt ) = N (xlt ; x̂l,i t|t , Pt|t ) in the measurement update and n,i i , Yt ) = N (xlt+1 ; x̂l,i p(xlt+1 |Xt+1 t+1|t , Pt+1|t ) in the time update, for each particle i. The fact that these distributions are Gaussian can be proven by induction. To simplify the notation somewhat, the particle index i is skipped. The dependence on xnt for the various functions and matrices in the model is not denoted explicitly to further increase the readability of the formulas. The measurement update of the linear states is given in (A.16). p(xlt |Xtn , Yt ) = N (xlt ; x̂lt|t , Pt|t ) x̂lt|t = x̂lt|t−1 + Kt (yt − ht − (A.16a) Ct x̂lt|t−1 ) Pt|t = Pt|t−1 − Kt Ct Pt|t−1 Kt = Pt|t−1 CtT (Ct Pt|t−1 CtT (A.16b) (A.16c) + Rt ) −1 (A.16d) The time update of the linear states is given in (A.17). n p(xlt+1 |Xt+1 , Yt ) = N (xlt+1 ; x̂lt+1|t , Pt+1|t ) x̂lt+1|t = Ālt x̂lt|t + B̄tl zt + ftl + Lt (zt − Ant x̂lt|t ) Pt+1|t = Lt = zt = Ālt Pt|t (Ālt )T Btl Q̄lt (Btl )T + − Lt Ant Pt|t (Ālt )T −1 Ālt Pt|t (Ant )T Ant Pt|t (Ant )T + Btn Qnt (Btn )T xnt+1 − ftn (A.17a) (A.17b) (A.17c) (A.17d) (A.17e) The bar-denoted matrices are defined in (A.18). However, if the process noises vtn and vtl l l l l l are uncorrelated, i.e. Qln t = 0, then Āt = At , B̄t = 0 and Q̄t = Qt . Ālt = Alt − B̄tl Ant B̄tl Q̄lt = = T n n −1 Btl (Qln t ) (Gt Qt ) T n −1 ln Qlt − (Qln Qt t ) (Qt ) (A.18a) (A.18b) (A.18c) Note that (A.17) contains a measurement update using zt , followed by the ordinary Kalman filter time update. This extra measurement update in necessary to include the information from the time update in the particle filter. Equation A.14a acts as the measurement equation in this case. The second factor in (A.15) can be factorized as in (A.19). This factorization is used for the particle approximation of the non-linear states. p(Xtn |Yt ) = n p(yt |Xtn , Yt−1 )p(xnt |Xt−1 , Yt−1 ) n p(Xt−1 |Yt−1 ) p(yt |Yt−1 ) (A.19) The distributions for the likelihood and prior factors are given in (A.20). The linear states 90 A Bayesian Filtering can in these cases be seen as extra measurement noise and process noise respectively. p(yt |Xtn , Yt−1 ) = N yt ; ht + Ct x̂lt+1|t , Ct Pt|t−1 CtT + Rt (A.20a) p(xnt+1 |Xtn , Yt ) = N xnt+1 ; ftn + Ant x̂lt|t , Ant Pt|t (Ant )T + Btn Qnt (Btn )T (A.20b) The complete algorithm is given in algorithm A.4. As the particle filter, it uses a general proposal distribution q(xt+1 |Xt , Yt+1 ). The most common example of a proposal distribution is the prior in (A.20b). See [29] for details on proposal distributions, and a more detailed description of the RBPF. Also see [37] for profs and details on some special cases. A.4.2 Estimation The filtering posterior and point estimates of the non-linear states can be calculated in the same way as for the particle filter in section A.3.2. The posterior of the linear states can be approximated as in (A.25), by a mixture of Gaussians. p̂(xlt |Yt ) = N X l,i i wt|t N xlt ; x̂l,i , P t|t t|t (A.25) i=1 The MAP estimate of the linear states can be approximated with the x̂l,i t|t that corresponds i to the highest weight wt|t . The minimum variance (MV) estimate is calculated using (A.26). x̂lt|t = N X i wt|t x̂l,i t|t (A.26a) l,i i i l l T wt|t Pt|t + (x̂l,i − x̂ )(x̂ − x̂ ) t|t t|t t|t t|t (A.26b) i=1 l P̂t|t = N X i=1 Note that the measurement update of the linear states in algorithm A.4 is placed after the resampling step. It can thus be more convenient to extract all estimates after step 6 in the algorithm. A.4 91 The Rao-Blackwellized Particle Filter Algorithm A.4 Rao-Blackwellized Particle Filter Update at time t Input: Number of particles: N i N Particles and predicted weights at time t:a {Xtn,i }N i=1 , {wt|t−1 }i=1 N i N Predicted linear states at time t:b {x̂l,i t|t−1 }i=1 , {Pt|t−1 }i=1 n n Proposal distribution: q(xt+1 |Xt , Yt+1 ) Output: i Particle weights at time t: {wt|t }N i=1 n,i N i Particles and predicted weights at time t + 1: {Xt+1 }i=1 , {wt+1|t }N i=1 N i N Estimated linear states at time t: {x̂l,i t|t }i=1 , {Pt|t }i=1 N i N Predicted linear states at time t + 1: {x̂l,i t+1|t }i=1 , {Pt+1|t }i=1 Particle filter measurement update: 1: Calculate new weights using (A.20a) i i w̃t|t = wt|t−1 p(yt |Xtn,i , Yt−1 ) 2: Normalize the weights i w̃t|t i wt|t = PN i i=1 w̃t|t (A.21) (A.22) Resampling:c k i 3: Sample a set indices Jt = {jtk }N k=1 with the probabilities p(jt = i) = wt|t , ∀i, k. n,j k l,j k jk k t t Set Xtn,k = Xt t , x̂l,k t|t−1 = x̂t|t−1 and Pt|t−1 = Pt|t−1 for k = 1, . . . , N . k 5: Set the weights to wt|t = 1/N . 4: Kalman filter measurement update: n,i 6: Kalman filter measurement update for each particle Xt using (A.16). Particle filter time update: 7: Generate new particles using the proposal distribution n,i n xn,i t+1 ∼ q(xt+1 |Xt , Yt+1 ) 8: Compute the new predicted weights using (A.20b) n,i p(xn,i t+1 |Xt , Yt ) i i = wt|t wt+1|t n,i q(xn,i t+1 |Xt , Yt+1 ) (A.23) (A.24) Kalman filter time update: n,i 9: Kalman filter time update for each particle Xt using (A.17). a In the first iteration (t = 1) the particles are sampled using xn,i ∼ p n . The initial weights are set to x1 1 i w1|0 = 1/N . b In i = P for all i. the first iteration (t = 1) the linear states are initialized as x̂l,i = x̂l0 and P1|0 0 1|0 resampling is optional in each iteration (see algorithm A.3). c The B Proofs and Derivations B.1 B.1.1 Derivation of the RCSK Tracker Algorithm Kernel Function Proofs This sections proves the propositions in section 2.2.4. This is done by first showing the following result. B.1 Lemma. The inner product on `D p (M, N ) is a shift invariant kernel. Proof: The inner product is clearly a valid kernel function. Using the definition of the standard scalar product in `D p (M, N ) and the periodicity of f ad g we get. hτm,n f, τm,n gi = D X X f d (k − m, l − n)g d (k − m, l − n) d=1 k,l = D X X f d (r, s)g d (r, s) = hf, gi (B.1) d=1 r,s Since f, g, m, n were arbitrary, this is valid for all f, g ∈ `D p (M, N ) and m, n ∈ Z. The results in section 2.2.4 can now we shown using this basic lemma. Proof of proposition 2.2: The shift invariance follows directly from (2.14) and lemma B.1. Equation 2.15 follows from the correlation property and linearity of the DFT (see [16]). 93 94 B Proofs and Derivations D X X κ(τ−m,−n f, g) = k f d (k + m, l + n)g d (k, l) d=1 k,l =k D X ! g ∗ f (m, n) =k F −1 (D X d=1 ! ) F d Gd (m, n) (B.2) d=1 Proof of proposition 2.3: Equation 2.16 can be expanded as: κ(f, g) = k kf − gk2 = k (hf − g, f − gi) = k (hf, f i + hg, gi − 2hf, gi) = k kf k2 + kgk2 − 2hf, gi (B.3) The shift invariance now follows from lemma B.1. The proof of (2.17) is similar to the one of (2.15), using (B.3) and applying lemma B.1 to get kτ−m,−n f k2 = kf k2 . B.1.2 Derivation of the Robust Appearance Learning Scheme This section derives that A = F {a} in (2.20) is the minimum of the cost function in (2.19). The cost function can be rewritten to (B.4), by inserting v j in (2.19b) to (2.19a) 2 J X XX j j j βj = a(k, l)κ(xm,n , xk,l ) − y (m, n) j=1 m,n k,l ! +λ X a(m, n) m,n X a(k, l)κ(xjm,n , xjk,l ) (B.4) k,l This function is clearly convex in a, since the squared L2 -norm of an affine transformation is convex and is a sum of such functions. The global minimum can thus be found by finding a stationary point. The derivative with respect to a(r, s) is computed in (B.5). ∂ = ∂(a(r, s)) =2 =2 J X ! βj X j=1 m,n J X X j=1 κ(xjm,n , xjr,s ) X a(k, l)κ(xjm,n , xjk,l ) j − y (m, n) + λa(m, n) k,l ! βj m,n κ(xjr−m,s−n , xj ) X a(k, l)κ(xjm−k,n−l , xj ) j − y (m, n) + λa(m, n) k,l (B.5) Here we have used the symmetric property of the kernel function, i.e. κ(x, z) = κ(z, x) and the shift invariance that is defined in definition 2.1. We define the function ujx ∈ `p (M, N ) in (B.6). ujx (m, n) = κ(xjm,n , xj ) (B.6) B.2 95 Proof of Equation 6.13 Using this definition, the derivative in (B.5) can be expressed as: 1 ∂ = 2 ∂(a(r, s)) = J X ! βj J X j=1 = J X ujx (r X − m, s − n) m,n j=1 = X βj X a(k, l)ujx (m j − k, n − l) − y (m, n) + λa(m, n) k,l ujx (r − m, s − n) a ∗ ujx (m, n) − y j (m, n) + λa(m, n) m,n βj ujx ∗ a ∗ ujx − y j + λa (r, s) (B.7) j=1 By setting these derivatives to zero we get. ∂ = 0, ∂(a(r, s)) ⇐⇒ J X ∀r, s ∈ Z βj ujx ∗ a ∗ ujx − y j + λa = 0 j=1 ⇐⇒ F J X ⇐⇒ J X j=1 βj ujx ∗ a ∗ ujx − y j + λa =0 βj Uxj Uxj A − Y j + λA = 0 j=1 ⇐⇒ A J X J X βj Y j Uxj = 0 βj Uxj Uxj + λ − j=1 j=1 PJ ⇐⇒ j Uxj βj Y j j j=1 βj Ux Ux + λ A= P J j=1 (B.8) Here we have used that the DFT is linear and invertible, along with the convolution property (see [16]). The last equivalence assumes that all frequency components in the denominator are zero. This completes the derivation of (2.20). B.2 Proof of Equation 6.13 This section proves (6.13), that is used to update the particle weights in the RBPF. 96 B.2.1 B Proofs and Derivations Proof of Uncorrelated Parts This section proves that (B.9) holds, when the RBPF is used on the described model. v p(xlt |Xtn , Yt−1 ) = N (vt ; v̂t|t−1 , Pt|t−1 ) N Y j j z N (ztj ; ẑt|t−1 , Pt|t−1 ) (B.9) j=1 l From the RBPF we have that p(xlt |Xtn , Yt−1 ) = N (xlt ; ẑt|t−1 , Pt|t−1 ) (see [37] for proof). So, it only has to be proven that Pt|t−1 is block diagonal with the 2 × 2 blocks v z1 zN Pt|t−1 , Pt|t−1 , . . . , Pt|t−1 . From the model, this is true for t = 1. Now assume that it is also true for a t ≤ 1. We first prove that Pt|t is 2 × 2-block diagonal using the iterated measurement update in algorithm A.2. First let {yt1 , . . . , ytN } be the independent position measurements from the image likelihood. We have Cti = (02×2i , I2×2 , 02×2(N −i) ). Assume that Pti−1 in algorithm A.2 is 2 × 2-block diagonal. Using the algorithm it is easy to verify that Kti Cti Pti−1 is only non-zero in diagonal block number i + 1. Pti is thus also block diagonal. It follows that PtN is block diagonal since Pt0 = Pt|t−1 is block diagonal from the assumption. Now let {ytN +1 , . . . , yt2N } be the independent position measurements from the deformation likelihood. The only difference now is that Cti is multiplied with a scalar, so the same holds here. Pt|t = Pt2N is thus block diagonal. Now consider the time update in (A.17). Using the model in (6.11) it is easy to verify that Lt = (L1t , 02×2N )T for some 2 × 2 matrix L1t . This implies that the last term in (A.17c) is only non-zero in the first diagonal block. The first two terms in (A.17c) are also clearly block diagonal since Qlt in (6.12) is 2 × 2-block diagonal. This implies that Pt+1|t is 2 × 2-block diagonal and that the initial statement is valid for t + 1. This proves by induction that (B.9) is valid for all t. B.2.2 Derivation of the Weight Update To simplify notation, the likelihood function in (B.10) is defined. Ljt (z, s) = p(θtj |z, s)p(ϕjt |z) The measurement model from section 6.3 can then be written as in (B.11). ! N j Y z p(yt |xt ) = L0t (zt0 , st ) N aj ; t , Dj Ljt (zt0 + ztj , st ) st j=1 Let l be the number of linear states, in our case l = 2(N + 1). It follows that. Z p(yt |Xtn , Yt−1 ) = p(yt , xlt |Xtn , Yt−1 ) dxlt Rl Z = p(yt |xlt , Xtn , Yt−1 )p(xlt |Xtn , Yt−1 ) dxlt Rl Z = p(yt |xt )p(xlt |Xtn , Yt−1 ) dxlt Rl (B.10) (B.11) (B.12) B.2 97 Proof of Equation 6.13 The last step follows from the Markov property of the model. Using (B.9) and (B.11) in (B.12) gives. p(yt |Xtn , Yt−1 ) = N Z Y j 0 0 zj = Lt (zt , st ) N ztj ; ẑt|t−1 , Pt|t−1 N j=1 R2 zj a ; t , Dj st j ! Ljt (zt0 + ztj , st ) dztj (B.13) Notice that the linear state vt has been marginalized away. We will now consider the product of the two Gaussian functions inside the integral. The indices are skipped to simplify the notation. z 1 p N (z; ẑ, P )N a; , D = 2 s (2π) det(P ) det(D) T z z 1 · exp − (z − ẑ)T P −1 (z − ẑ) + a − D−1 a − (B.14) 2 | s s} {z =:V We define as = sa and Ds = s2 D to simplify the equations. The exponent V can then be written as follows. V = (z − ẑ)T P −1 (z − ẑ) + (as − z)T Ds−1 (as − z) = z T (P −1 + Ds−1 )z − 2z T (P −1 ẑ + Ds−1 as ) + ẑ T P −1 ẑ + aTs Ds−1 as (B.15) We define the quantities in (B.16). H = P −1 + Ds−1 µ=H P −1 ẑ + −1 Ds−1 as (B.16a) (B.16b) It is easy to check that (B.15) can be rewritten to (B.17). V = (z − µ)T H −1 (z − µ) + ẑ T P −1 ẑ + aTs Ds−1 as − µT H −1 µ {z } | {z } | =:V1 (B.17) =:V2 V2 is the part of V that is independent of z. V2 = ẑ T P −1 ẑ + aTs Ds−1 as − (P −1 ẑ + Ds−1 as )T H(P −1 ẑ + Ds−1 as ) = ẑ T (P −1 − P −1 HP −1 )ẑ + aTs (Ds−1 − Ds−1 HDs−1 )as − 2aTs Ds−1 HP −1 ẑ (B.18) The matrix inversion lemma is given in (B.19), where A and C are invertible matrices. (A − BCD)−1 = A−1 + A−1 B(C −1 − DA−1 B)−1 DA−1 (B.19) 98 B Proofs and Derivations Using this lemma, we get. (P −1 − P −1 HP −1 )−1 = P + (H −1 − P −1 )−1 = P + (P −1 + Ds−1 − P −1 )−1 = P + Ds (B.20) And similarily. (Ds−1 − Ds−1 HDs−1 )−1 = P + Ds (B.21) Additionally we have that. Ds−1 HP −1 = P (P −1 + Ds−1 )Ds −1 = (P + Ds )−1 (B.22) Equation B.18 can thus be simplified as. T −1 V2 = (ẑ − as ) (P + Ds ) (ẑ − as ) = T −1 ẑ 1 ẑ −a P +D − a (B.23) s s2 s Finally we note that (B.22) implies (B.24), where we have used the properties of matrix determinant. det(H) 1 = 4 (B.24) det(P + Ds ) s det(P ) det(D) Using (B.17), (B.23) and (B.24) in (B.14) gives. z ẑ 1 ; a, 2 P + D N (z; ẑ, P ) N a; , D = N (z; µ, H) N s s s (B.25) Using this result in (B.13) and moving the factor that is independent of ztj out of the integral gives (6.13) with the definitions in (6.14) and (6.15). Bibliography [1] Amit Adam, Ehud Rivlin, and Shimshoni. Robust fragments-based tracking using the integral histogram. In CVPR, 2006. Cited on page 40. [2] Afshin Dehghan Amir Roshan Zamir and Mubarak Shah. Gmcp-tracker: Global multi-object tracking using generalized minimum clique graphs. In Proceedings of the European Conference on Computer Vision (ECCV), 2012. Cited on pages 50 and 61. [3] B. Babenko, Ming-Hsuan Yang, and S. Belongie. Visual Tracking with Online Multiple Instance Learning. In CVPR, 2009. Cited on page 40. [4] Chenglong Bao, Yi Wu, Haibin Ling, and Hui Ji. Real time robust l1 tracker using accelerated proximal gradient approach. In CVPR, 2012. Cited on page 40. [5] Ben Benfold and Ian Reid. Stable multi-target tracking in real-time surveillance video. In CVPR, pages 3457–3464, June 2011. Cited on pages 64 and 75. [6] Brent Berlin and Paul Kay. Basic Color Terms: Their Universality and Evolution. UC Press, Berkeley, CA, 1969. Cited on page 19. [7] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738. Cited on pages 3, 4, 11, and 58. [8] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Yui M. Lui. Visual object tracking using adaptive correlation filters. In Computer Vision and Pattern Recognition (CVPR), 2010. Cited on pages 9, 10, 14, 16, and 29. [9] Michael D. Breitenstein, Fabian Reichlin, Bastian Leibe, Esther Koller-Meier, and Luc Van Gool. Robust tracking-by-detection using a detector confidence particle filter. In IEEE International Conference on Computer Vision, October 2009. Cited on pages 50, 52, and 61. [10] Robert T. Collins, Yanxi Liu, and Marius Leordeanu. Online selection of discriminative tracking features. PAMI, 27(10):1631–1643, 2005. Cited on page 17. 99 100 Bibliography [11] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-based object tracking. PAMI, 25(5):564–575, 2003. Cited on page 17. [12] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International Conference on Computer Vision & Pattern Recognition, volume 2, pages 886–893, INRIA Rhône-Alpes, ZIRST-655, av. de l’Europe, Montbonnot-38334, June 2005. URL http://lear.inrialpes.fr/pubs/2005/DT05. Cited on pages 57, 60, and 61. [13] Thang Ba Dinh, Nam Vo, and Gerard Medioni. Context tracker: Exploring supporters and distracters in unconstrained environments. In CVPR, 2011. Cited on pages 17 and 40. [14] Michael Felsberg. Enhanced distribution field tracking using channel representations. In ICCV Workshop, 2013. Cited on page 40. [15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. Cited on pages 5, 6, 57, 58, 60, and 78. [16] Claude Gasquet and Patrick Witomski. Fourier Analysis and Applications: Filtering, Numerical Computation, Wavelets. Texts in Applied Mathematics. Springer-Verlag New York Inc., 1999. ISBN 0-387-98485-2. Cited on pages 4, 93, and 95. [17] T. Gevers and A. W. M. Smeulders. Color based object recognition. Pattern Recognition, 32:453–464, 1999. Cited on page 18. [18] Fredrik Gustafsson. Statistical Sensor Fusion. Studentlitteratur, second edition, 2012. ISBN 978-91-44-07732-1. Cited on pages 50, 85, and 87. [19] Sam Hare, Amir Saffari, and Philip H. S. Torr. Struck: Structured output tracking with kernels. In International Conference on Computer Vision (ICCV), 2011. Cited on pages 17 and 40. [20] Shengfeng He, Qingxiong Yang, Rynson Lau, Jiang Wang, and Ming-Hsuan Yang. Visual tracking via locality sensitive histograms. In CVPR, 2013. Cited on page 40. [21] J.F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In Proceedings of the European Conference on Computer Vision (ECCV), 2012. Cited on pages 3, 5, 9, 12, 13, 14, 15, 16, 28, 29, 30, 32, and 40. [22] Hamid Izadinia, Imran Saleemi, Wenhui Li, and Mubarak Shah. (mp)2t: Multiple people multiple parts tracker. In Proceedings of the European Conference on Computer Vision (ECCV), volume 7577 of Lecture Notes in Computer Science, pages 100–114. Springer, 2012. Cited on pages 50, 52, and 61. [23] Xu Jia, Huchuan Lu, and Ming-Hsuan Yang. Visual tracking via adaptive structural local sparse appearance model. In CVPR, 2012. Cited on page 40. Bibliography 101 [24] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Tracking-learning-detection. IEEE Trans. Pattern Analysis Machine Intelligence, 34(7):1409–1422, 2012. Cited on pages 17, 40, and 68. [25] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew Bagdanov, Maria Vanrell, and Antonio Lopez. Color attributes for object detection. In CVPR, 2012. Cited on pages 5, 17, and 78. [26] Fahad Shahbaz Khan, Joost van de Weijer, and Maria Vanrell. Modulating shape features by color attention for object recognition. IJCV, 98(1):49–64, 2012. Cited on pages 5 and 17. [27] Fahad Shahbaz Khan, Rao Muhammad Anwer, Joost van de Weijer, Andrew Bagdanov, Antonio Lopez, and Michael Felsberg. Coloring action recognition in still images. IJCV, 105(3):205–221, 2013. Cited on page 17. [28] Erwin Kreyszig. Introductory Functional Analysis with Applications. Wiley Classics Library. John Wiley & Sons, Inc., 1989. ISBN 978-0-471-50459-7. Cited on page 3. [29] Fredrik Lindsten. Rao-Blackwellised particle methods for inference and identification. Licentiate thesis, Linköping University, 2011. Cited on page 90. [30] Alfred Mertins. Signal Analysis : Wavelets, Filter Banks, Time-Frequency Transforms, and Applications. John Wiley & Sons, 1999. ISBN 0-471-98626-7. Cited on pages 21 and 22. [31] Katja Nummiaro, Esther Koller-Meier, and Luc J. Van Gool. An adaptive colorbased particle filter. IVC, 21(1):99–110, 2003. Cited on page 17. [32] Shaul Oron, Aharon Bar-Hillel, Dan Levi, and Shai Avidan. Locally orderless tracking. In CVPR, 2012. Cited on pages 17 and 40. [33] M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approach for fast deformable object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. Cited on page 78. [34] Marco Pedersoli, Jordi Gonzalez, Xu Hu, and Xavier Roca. Toward real-time pedestrian detection based on a deformable template model. IEEE Transactions on Intelligent Transportation Systems, 2013. Cited on page 78. [35] Patrick Perez, Carine Hue, Jaco Vermaak, and Michel Gangnet. Color-based probabilistic tracking. In ECCV, 2002. Cited on pages 17 and 40. [36] David Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang. Incremental learning for robust visual tracking. IJCV, 77(1):125–141, 2008. Cited on page 40. [37] Thomas Schön, Fredrik Gustafsson, and Per-Johan Nordlund. Marginalized particle filters for mixed linear nonlinear state-space models. IEEE Trans. on Signal Processing, 53:2279–2289, 2005. Cited on pages 6, 88, 90, and 96. [38] Laura Sevilla-Lara and Erik Learned-Miller. Distribution fields for tracking. In IEEE 102 Bibliography Conference on Computer Vision and Pattern Recognition (CVPR), 2012. Cited on pages 17, 28, and 40. [39] Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. Partbased multiple-person tracking with partial occlusion handling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012. Cited on pages 61 and 78. [40] Severin Stalder, Helmut Grabner, and Luc van Gool. Beyond semi-supervised tracking: Tracking should be as simple as detection, but not simpler than recognition. In ICCV Workshop, 2009. Cited on page 40. [41] K. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating color descriptors for object and scene recognition. PAMI, 32(9):1582–1596, 2010. Cited on pages 5, 17, and 18. [42] J. van de Weijer and C. Schmid. Coloring local feature extraction. In ECCV, 2006. Cited on pages 5, 17, and 18. [43] J. van de Weijer, C. Schmid, Jakob J. Verbeek, and D. Larlus. Learning color names for real-world applications. TIP, 18(7):1512–1524, 2009. Cited on pages 6 and 19. [44] Dong Wang, Huchuan Lu, and Ming-Hsuan Yang. Least soft-threshold squares tracking. In CVPR, 2013. Cited on page 40. [45] Yi Wu, Bin Shen, and Haibin Ling. Online robust image alignment via iterative convex optimization. In CVPR, 2012. Cited on page 40. [46] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In CVPR, 2013. Cited on pages 5, 27, 28, 29, and 40. [47] Bo Yang and Ram Nevatia. Online learned discriminative part-based appearance models for multi-human tracking. In Proceedings of the European Conference on Computer Vision (ECCV), 2012. Cited on page 72. [48] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing Surveys, 32(13), 2006. Cited on page 1. [49] Jun Zhang, Youssef Barhomi, and Thomas Serre. A new biologically inspired color image descriptor. In ECCV, 2012. Cited on pages 5, 17, and 18. [50] Kaihua Zhang, Lei Zhang, and Ming-Hsuan Yang. Real-time compressive tracking. In Proceedings of the European Conference on Computer Vision (ECCV), 2012. Cited on pages 17 and 40. [51] Tianzhu Zhang, Bernard Ghanem, Si Liu, and Narendra Ahuja. Robust visual tracking via multi-task sparse learning. In CVPR, 2012. Cited on page 40. [52] Wei Zhong, Huchuan Lu, and Ming-Hsuan Yang. Robust object tracking via sparsitybased collaborative model. In CVPR, 2012. Cited on page 40. Upphovsrätt Detta dokument hålls tillgängligt på Internet — eller dess framtida ersättare — under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet — or its possible replacement — for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for his/her own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/ © Martin Danelljan

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement