Linköping Studies in Science and Technology Dissertation No. 1395 Shape Based Recognition Cognitive Vision Systems in Traffic Safety Applications Fredrik Larsson Department of Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden Linköping November 2011 Shape Based Recognition - Cognitive Vision Systems in Traffic Safety Applications c 2011 Fredrik Larsson Department of Electrical Engineering Linköping University SE-581 83 Linköping Sweden ISBN: 978-91-7393-074-1 ISSN 0345-7524 Linköping Studies in Science and Technology Dissertation No. 1395 iii Abstract Traffic accidents are globally the number one cause of death for people 15-29 years old and is among the top three causes for all age groups 5-44 years. Much of the work within this thesis has been carried out in projects aiming for (cognitive) driver assistance systems and hopefully represents a step towards improving traffic safety. The main contributions are within the area of Computer Vision, and more specifically, within the areas of shape matching, Bayesian tracking, and visual servoing with the main focus being on shape matching and applications thereof. The different methods have been demonstrated in traffic safety applications, such as bicycle tracking, car tracking, and traffic sign recognition, as well as for pose estimation and robot control. One of the core contributions is a new method for recognizing closed contours, based on complex correlation of Fourier descriptors. It is shown that keeping the phase of Fourier descriptors is important. Neglecting the phase can result in perfect matches between intrinsically different shapes. Another benefit of keeping the phase is that rotation covariant or invariant matching is achieved in the same way. The only difference is to either consider the magnitude, for rotation invariant matching, or just the real value, for rotation covariant matching, of the complex valued correlation. The shape matching method has further been used in combination with an implicit star-shaped object model for traffic sign recognition. The presented method works fully automatically on query images with no need for regions-of-interests. It is shown that the presented method performs well for traffic signs that contain multiple distinct contours, while some improvement still is needed for signs defined by a single contour. The presented methodology is general enough to be used for arbitrary objects, as long as they can be defined by a number of regions. Another contribution has been the extension of a framework for learning based Bayesian tracking called channel based tracking. Compared to earlier work, the multi-dimensional case has been reformulated in a sound probabilistic way and the learning algorithm itself has been extended. The framework is evaluated in car tracking scenarios and is shown to give competitive tracking performance, compared to standard approaches, but with the advantage of being fully learnable. The last contribution has been in the field of (cognitive) robot control. The presented method achieves sufficient accuracy for simple assembly tasks by combining autonomous recognition with visual servoing, based on a learned mapping between percepts and actions. The method demonstrates that limitations of inexpensive hardware, such as web cameras and low-cost robotic arms, can be overcome using powerful algorithms. All in all, the methods developed and presented in this thesis can all be used for different components in a system guided by visual information, and hopefully represents a step towards improving traffic safety. iv v Populärvetenskaplig sammanfattning Trafikolyckor är globalt sett den vanligaste dödsorsaken för människor i åldrarna 15-29 år och är bland de tre vanligaste dödsorsakerna för alla åldersgrupper 5-44 år. En stor del av arbetet, som har lett fram till denna avhandling, har skett inom projekt med fokus på system för att hjälpa bilförare. Förhoppningsvis bidrar de resultat som presenteras till förbättrad trafiksäkerhet och i förlängningen även till räddade människoliv. Avhandlingen beskriver metoder och algoritmer inom ämnet datorseende. Datorseende är en ingenjörsvetenskap som har som mål att skapa seende maskiner, vilket i praktiken innebär utveckling av algoritmer och datorprogram som kan extrahera och använda information från bilder. För att vara mer specifik så innehåller denna avhandling metoder inom delområdena formigenkänning, målföljning och visuellt återkopplad styrning. De olika metoderna har framförallt demonstrerats i tillämpningar med anknytning till trafiksäkerhet, så som trafikskyltsigenkänning och följning av bilar, men också inom andra områden, bland annat för att styra mekaniska robotarmar. Tyngdpunkten hos avhandlingen ligger inom området formigenkänning. Formigenkänning syftar till att automatiskt kunna identifiera och känna igen olika geometriska former trots försvårande omständigheter, som rotation, skalning och deformation. Ett av huvudresultaten är en metod för att känna igen former genom att betrakta ytterkonturer. Denna metod är baserad på korrelation av så kallade Fourier-deskriptorer och har använts för detektion och igenkänning av trafikskyltar. Metoden bygger på att känna igen delregioner hos skyltar var för sig och sedan kombinera dessa med krav på inbördes geometriska förhållanden. Formigenkänning har tillsammans med målföljning även använts för att detektera och följa cyklister i videosekvenser, genom att känna igen cykelhjul vilka avbildas som ellipser i bildplanet. Inom området målföljning presenteras en vidareutveckling av tidigare arbeten inom så kallad kanalbaserad målföljning. Målföljning handlar om att noggrant uppskatta tillstånd, till exempel position och hastighet, hos objekt. Detta görs genom att använda observationer från olika tidpunkter tillsammans med rörelseoch observationsmodeller. Den metod som presenteras har använts i en bil för att följa positionen hos andra bilister, vilket i slutändan används för att varna föraren vid potentiella faror. Det sista delområde som berörs handlar om styrning av robotar med hjälp av återkopplad visuell information. Avhandlingen innehåller en metod inspirerad av hur vi människor lär oss att använda vår kroppar redan i fosterstadiet. Metoden bygger på att i ett första skede skicka slumpmässiga kontrollsignaler till roboten, vilket resulterar i slumpmässiga rörelser, och sedan observera resultatet. Genom att göra detta upprepade gånger kan den omvända relationen skapas, som kan användas för att välja de kontrollsignaler som krävs för att uppnå en önskad konfiguration Tillsamman utgör de presenterade metoderna olika komponenter som kan användas i system som använder visuell information, ej begränsade till de tillämpningar som beskrivs ovan. vi vii Acknowledgments I would like to thank all current and former members of the Computer Vision Laboratory. You have all in one way or another contributed to this thesis, either scientifically or, equally important, by contributing to the friendly and inspiring atmosphere. Especially I would like to thank: • Michael Felsberg for providing an excellent working environment, for being an excellent supervisor, and a never ending source of inspiration. • Per-Erik Forssén for being an equally good co-supervisor and for sharing lots of knowledge regarding object recognition, conics, and local features. • Gösta Granlund for initially allowing me to join the CVL group and for sharing knowledge and inspiration regarding biological vision systems. • Johan Wiklund for keeping the computers reasonably happy most of the time and for acknowledging the usefulness of gaffer tape. • Liam Ellis, Per-Erik Forssén, Klas Nordberg and Marcus Wallenberg for proofreading parts of this manuscript and giving much appreciated feedback. Also I would like to thank all friends and my family for support with non-scientific issues, most notably: • My parents Ingrid and Kjell for infinite love and for always being there, your love and support means the world to me. • Marie Knutsson for lots of love and much needed distractions, your presence in my life makes it richer on all levels. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n◦ 215078 DIPLECS, from the European Community’s Sixth Framework Programme (FP6/2003-2007) under grant agreement n◦ 004176 COSPAL and from the project Extended Target Tracking funded by the Swedish research council, all which are hereby gratefully acknowledged. Fredrik Larsson November 2011 viii Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . 1.2 Outline . . . . . . . . . . . . . . . . . . . . . 1.2.1 Outline Part I: Background Theory . . 1.2.2 Outline Part II: Included Publications 1.3 Projects . . . . . . . . . . . . . . . . . . . . . 1.3.1 COSPAL . . . . . . . . . . . . . . . . 1.3.2 DIPLECS . . . . . . . . . . . . . . . . 1.3.3 ETT: Extended Target Tracking . . . 1.4 Publications . . . . . . . . . . . . . . . . . . . I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background Theory 1 1 2 2 2 9 9 10 14 15 17 2 Shape Matching 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Region Based Matching . . . . . . . . . . . . . . . . 2.1.2 Contour Based Matching . . . . . . . . . . . . . . . 2.1.3 Partial Contour Matching and Non-Rigid Matching 2.2 Conics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Conic From a Torchlight . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 20 22 22 24 3 Tracking 3.1 Bayesian Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Channel Representation . . . . . . . . . . . . . . . . . . . . . . . . 29 29 31 32 4 Visual Servoing 4.1 Open-Loop Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Visual Servoing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 The Visual Servoing Task . . . . . . . . . . . . . . . . . . . . . . . 35 35 35 37 5 Concluding Remarks 5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 41 ix . . . . . . . . . . . . . . . . . . x II CONTENTS Publications 51 A Torchlight Navigation 53 B Bicycle Tracking Using Ellipse Extraction 65 C Correlating Fourier Descriptors of Local Patches for Road Sign Recognition 89 D Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition 115 E Learning Higher-Order Markov Models for Object Tracking in Image Sequences 131 F Simultaneously Learning to Recognize and Control a Low-Cost Robotic Arm 147 Chapter 1 Introduction 1.1 Motivation Road and traffic safety is an ever important topic of concern. About 50 million people are injured and more than 1.2 million people die in traffic related accidents every year, which is more than one person dying every 30 seconds. Road traffic injuries are globally the number one cause of death for people 15-29 years old and is among the top three causes for all age groups 5-44 years [85]. The United Nations General Assembly has proclaimed the period 2011-2020 as the Decade of Action for Road Safety with a goal to first stabilize and then to reduce the number of traffic fatalities around the world [84]. The number of yearly fatalities is expected to raise to 1.9 million around 2020 and to 2.4 million around 2030 unless the trend is changed [85]. Among the actions stipulated are the tasks of designing safer roads, reducing drunk driving and speeding, and to improve driver training and licensing, also the responsibility of vehicle manufacturers to produce safe cars is mentioned [16]. Much of the work within this thesis has been performed in projects aiming for (cognitive) driver assistance systems and hopefully represents a step towards improving traffic safety. The main technical contributions of this thesis are within the area of Computer Vision, and more specifically, within the areas of shape matching, Bayesian tracking and visual servoing with the main focus being on shape matching and applications thereof. The different methods have been demonstrated in traffic safety applications, such as bicycle tracking, car tracking, and traffic sign recognition, as well as for pose estimation and robot control. Work leading to this thesis has mostly been carried out within three projects. The main parts originate from research within two European projects, COSPAL (COgnitive Systems using Perception-Action-Learning [1]) and DIPLECS (DynamicInteractive Perception-Action LEarning Systems [2]), while some of the latest contributions stem from the project ETT (Extended Target Tracking) funded by the Swedish research council, see Sec. 1.3 for more details on the projects. 1 2 1.2 CHAPTER 1. INTRODUCTION Outline This thesis is written as a collection of previously published papers and is divided into two main parts in addition to this introduction. The rest of this introductory chapter contains brief information about the included publications together with explicit statements of the contributions made by the author, followed by a section describing the different projects that the work was carried out within. Part I contains chapters on background theory and concepts needed for Part II, and a concluding chapter. Part II contains the six included papers which make up the core of this thesis. 1.2.1 Outline Part I: Background Theory Each of the main topics of the thesis, shape matching, Bayesian tracking and visual servoing are given one introductory chapter, covering the basics within these fields. Part I ends with a concluding chapter that summarizes the main results of the thesis and briefly discusses possible areas of future research. Part of the material in Part I has previously been published in [55]. 1.2.2 Outline Part II: Included Publications Edited versions of six papers are included in Part II. The included papers are selected in order to reflect the different areas of research that was touched upon by the author during the years as a Ph.D. student at the Computer Vision Laboratory at Linköping University. Paper A contains work on relative pose estimation using a torch light. The reprojection of the emitted light beam creates, under certain conditions, an ellipse in the image plane. We show that it is possible to use this ellipse in order to estimate the relative pose. Paper B builds on the ideas presented in paper A and contains initial work on bicycle tracking, done jointly with the Automatic Control group at Linköping University. The relative pose estimates are based on ellipses originating from the projection of the bicycle wheels into the image. The different ellipses have to be associated to the correct ellipses in previous frames, i.e. front wheel to front wheel and rear wheel to rear wheel. This is combined with a particle filter framework in order to track the bicycle in 3D. Paper C contains work on generic shape recognition using Fourier descriptors, while papers A and B only deal with ellipses. The paper presents theoretical justifications for using a correlation based matching scheme for Fourier descriptors and also presents initial work on traffic sign recognition. Paper D extends the work on traffic sign recognition by introducing spatial constraints on the local shapes using an implicit star-shaped object model. The earlier paper C focus on recognizing individual shapes while this work takes the configuration of different shapes into consideration. Paper E contains work on learning based object tracking. In Paper B the motion model of the tracked object is known beforehand. This is not always the 1.2. OUTLINE 3 case and the method presented in paper E addresses this scenario. The approach is evaluated in car tracking experiments. Paper F describes a method for learning how to control a robotic arm without knowing beforehand what it looks like or how it is controlled. In order for the method presented in this paper to work, consistent estimates of the robot configuration/pose are needed. This is achieved by a heuristic approach based on template matching but could (preferably) be replaced using the tracking framework from papers B and E in combination with the shape and pose estimation ideas from papers A-D. Bibliographic details for each of the included papers together with abstracts and statements of the contributions made by the author are given in this section. Paper A: Torchlight Navigation M. Felsberg, F. Larsson, W. Han, A. Ynnerman, and T. Schön. Torchlight navigation. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR), 2010. This work received a paper award from the Swedish Society for Automated Image Analysis. Abstract: A common computer vision task is navigation and mapping. Many indoor navigation tasks require depth knowledge of flat, unstructured surfaces (walls, floor, ceiling). With passive illumination only, this is an ill-posed problem. Inspired by small children using a torchlight, we use a spotlight for active illumination. Using our torchlight approach, depth and orientation estimation of unstructured, flat surfaces boils down to estimation of ellipse parameters. The extraction of ellipses is very robust and requires little computational effort. Contributions: The author was the main source for implementing the method, conducting the experiments and writing large parts of the paper. The original idea was developed by Felsberg, Han, Ynnerman and Schön. 4 CHAPTER 1. INTRODUCTION Paper B: Bicycle Tracking Using Ellipse Extraction T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking using ellipse extraction. In Proceedings of the 14th International Conference on Information Fusion, 2011. Honorable mention, nominated for the best student paper award Abstract: A new approach to track bicycles from imagery sensor data is proposed. It is based on detecting ellipsoids in the images, and treat these pair-wise using a dynamic bicycle model. One important application area is in automotive collision avoidance systems, where no dedicated systems for bicyclists yet exist and where very few theoretical studies have been published. Possible conflicts can be predicted from the position and velocity state in the model, but also from the steering wheel articulation and roll angle that indicate yaw changes before the velocity vector changes. An algorithm is proposed which consists of an ellipsoid detection and estimation algorithm and a particle filter. A simulation study of three critical single target scenarios is presented, and the algorithm is shown to produce excellent state estimates. An experiment using a stationary camera and the particle filter for state estimation is performed and has shown encouraging results. Contributions: The author was the main source behind the computer vision related parts of this paper while Ardeshiri was the main source behind the parts related to control theory. The author implemented the method for ellipse estimation and wrote parts of the paper. 1.2. OUTLINE 5 Paper C: Correlating Fourier Descriptors of Local Patches for Road Sign Recognition F. Larsson, M. Felsberg, and P.-E. Forssén. Correlating Fourier descriptors of local patches for road sign recognition. IET Computer Vision, 5(4):244–254, 2011. Abstract: The Fourier descriptors (FDs) is a classical but still popular method for contour matching. The key idea is to apply the Fourier transform to a periodic representation of the contour, which results in a shape descriptor in the frequency domain. Fourier descriptors are most commonly used to compare object silhouettes and object contours; we instead use this well established machinery to describe local regions to be used in an object recognition framework. Many approaches to matching FDs are based on the magnitude of each FD component, thus ignoring the information contained in the phase. Keeping the phase information requires us to take into account the global rotation of the contour and shifting of the contour samples. We show that the sum-of-squared differences of FDs can be computed without explicitly de-rotating the contours. We compare our correlation based matching against affine-invariant Fourier descriptors (AFDs) and WARP matched FDs and demonstrate that our correlation based approach outperforms AFDs and WARP on real data. As a practical application we demonstrate the proposed correlation based matching on a road sign recognition task. Contributions: The author is the main source behind the research leading to this paper. The author developed and implemented the method and wrote the paper. Initial inspiration and ideas originated from Forssén and Felsberg, with Felsberg also contributing to the presented matching scheme. 6 CHAPTER 1. INTRODUCTION Paper D: Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition F. Larsson and M. Felsberg. Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition. In Proceedings of the Scandinavian Conference on Image Analysis (SCIA), volume 6688 of Lecture Notes in Computer Science, pages 238–249, 2011. PEDESTRIAN_CROSSING PEDESTRIAN_CROSSING PEDESTRIAN_CROSSING PASS_RIGHT_SIDE Abstract: Traffic sign recognition is important for the development of driver assistance systems and fully autonomous vehicles. Even though GPS navigator systems works well for most of the time, there will always be situations when they fail. In these cases, robust vision based systems are required. Traffic signs are designed to have distinct colored fields separated by sharp boundaries. We propose to use locally segmented contours combined with an implicit star-shaped object model as prototypes for the different sign classes. The contours are described by Fourier descriptors. Matching of a query image to the sign prototype database is done by exhaustive search. This is done efficiently by using the correlation based matching scheme for Fourier descriptors and a fast cascaded matching scheme for enforcing the spatial requirements. We demonstrated on a publicly available database state of the art performance. Contributions: The author is the main source behind the research leading to this paper. The author developed and implemented the method and wrote the main part of the paper. 7 1.2. OUTLINE Paper E: Learning Higher-Order Markov Models for Object Tracking in Image Sequences M. Felsberg and F. Larsson. Learning higher-order Markov models for object tracking in image sequences. In Proceedings of the International Symposium on Visual Computing (ISVC), volume 5876 of Lecture Notes in Computer Science, pages 184–195. Springer-Verlag, 2009. Frame: 321 Frame: 322 50 50 100 100 150 150 200 200 50 100 150 200 250 300 50 100 150 200 250 300 Abstract: This work presents a novel object tracking approach, where the motion model is learned from sets of frame-wise detections with unknown associations. We employ a higher-order Markov model on position space instead of a first-order Markov model on a high-dimensional state-space of object dynamics. Compared to the latter, our approach allows the use of marginal rather than joint distributions, which results in a significant reduction of computation complexity. Densities are represented using a grid-based approach, where the rectangular windows are replaced with estimated smooth Parzen windows sampled at the grid points. This method performs as accurately as particle filter methods with the additional advantage that the prediction and update steps can be learned from empirical data. Our method is compared against standard techniques on image sequences obtained from an RC car following scenario. We show that our approach performs best in most of the sequences. Other potential applications are surveillance from cheap or uncalibrated cameras and image sequence analysis. Contributions: The core ideas behind this paper originates from Felsberg. The author wrote parts of the paper and was the main source for implementing the theoretical findings and for conducting experiments validating the tracking framework. 8 CHAPTER 1. INTRODUCTION Paper F: Simultaneously Learning to Recognize and Control a Low-Cost Robotic Arm F. Larsson, E. Jonsson, and M. Felsberg. Simultaneously learning to recognize and control a low-cost robotic arm. Image and Vision Computing, 27(11):1729–1739, 2009 xw ∆x + − x Image based control law Joint controller 2D pose estimation Feature extraction Abstract: In this paper, we present a visual servoing method based on a learned mapping between feature space and control space. Using a suitable recognition algorithm, we present and evaluate a complete method that simultaneously learns the appearance and control of a low-cost robotic arm. The recognition part is trained using an action precedes perception approach. The novelty of this paper, apart from the visual servoing method per se, is the combination of visual servoing with gripper recognition. We show that we can achieve high precision positioning without knowing in advance what the robotic arm looks like or how it is controlled. Contributions: The author is the main source behind the research leading to this paper. The author developed and implemented the method and wrote the main part of the paper. 9 1.3. PROJECTS 1.3 Projects Most of the research leading to this thesis was conducted within the two European projects COSPAL and DIPLECS. Both projects were within the European Framework Programme calls for cognitive systems and thus had a strong focus on learning based methods able to adapt to the environment. DIPLECS can be seen as the follow up project to COSPAL and was closer to real applications, exemplified by driver assistance, than the previous project. Some of the latest contributions stem from the project ETT funded by the Swedish research council. ETT shares some similarities with DIPLECS such as applications within the traffic safety domain and the use of shape recognition techniques. Additional details about the three projects can be found below. 1.3.1 COSPAL COSPAL (COgnitive Systems using Perception-Action-Learning1 ) was a European Community’s Sixth Framework Programme project carried out between 2004 and 2007 [1]. The main goal of the COSPAL project was to conduct research leading towards systems that learn from experience, rather than using predefined models of the world. The key concept, as stated in the project name, was to use perception-actionlearning. This was achieved by applying the idea of action-precedes-perception during the learning phase [39]. Meaning that, the system learns by first performing an action (random or goal directed) and then observing the outcome. By doing so, it is possible to learn the inverse mapping between percept and action. The motivation behind this reversed causal direction is that the action space tends to be of much lower dimensionality than the percept space [39]. This approach was successfully demonstrated in the context of robot control described in the included publication [65]. The main demonstrator scenario of the COSPAL project involved a robotic arm and a shape sorting puzzle, see Fig. 1.1, but the system architecture and algorithms implemented were all designed to be as generic as possible. This was demonstrated in [20] when part of the main COSPAL system successfully was used for two different tasks, solving a shape sorting puzzle and driving a radio 1 FP6/2003-2007, grant agreement n◦ 004176 10 CHAPTER 1. INTRODUCTION Figure 1.1: Images from the COSPAL main demonstrator. Left: A view captured by the camera mounted on the gripper. Right: Side view of the robotic arm and shape sorting puzzle. controlled car. The results presented by the author in [62, 63, 64, 65], originate from the COSPAL project. 1.3.2 DIPLECS DIPLECS (Dynamic-Interactive Perception-Action LEarning Systems2 ) was a European Community’s Seventh Framework Programme project carried out between 2007 and 2010 [2]. DIPLECS continued the work of COSPAL and extended the results from COSPAL to incorporate dynamic and interaction with other agents. The scenarios considered during the COSPAL project involved a single system operating in a static world. This was extended in DIPLECS to allow for a changing world and multiple systems acting simultaneously within the world. The main scenario of the DIPLECS project was driver assistance and one of the core ideas was to learn by observing human drivers, i.e. perception-action learning. The 2 FP7/2007-2013, grant agreement n◦ 215078 1.3. PROJECTS 11 following project overview is quoted from the DIPLECS webpage. ‘The DIPLECS project aims to design an Artificial Cognitive System capable of learning and adapting to respond in the everyday situations humans take for granted. The primary demonstration of its capability will be providing assistance and advice to the driver of a car. The system will learn by watching humans, how they act and react while driving, building models of their behaviour and predicting what a driver would do when presented with a specific driving scenario. The end goal of which is to provide a flexible cognitive system architecture demonstrated within the domain of a driver assistance system, thus potentially increasing future road safety.’[2] The DIPLECS integrated system was demonstrated in a number of different traffic scenarios using a RC-car, see Fig. 1.2, and a real vehicle, see Fig. 1.3. The RC-car allowed for the system to actively control the actions of the vehicle, for tasks such as automatic obstacle avoidance and path following [21, 41, 75], something that due to safety protocols was not done on the real car. The real car Figure 1.2: The RC-car setup used for active control by the system. was instrumented with multiple cameras mounted on the roof, on the dashboard facing out and also cameras facing the driver used for eye-tracking. Multiple additional sensors such as gas and break pedal proximity senors, differential GPS were also mounted in the car. The images from the three roof mounted cameras were stitched into one wide field of view image, see Fig. 1.3. The observed paths of object in the world take on nontrivial properties due to the nonlinear distortions occurring on the stitching boundaries as well as the potential movement of both vehicle and observed object. Methods developed in the included publication on learning tracking models [26] were integrated in the instrumented vehicle in order to address these challenges. The main demonstrator showed the systems ability to adapt to the behavior of the driver, [30]. One example was the grounding of visual percepts to semantic meaning based on driver actions, demonstrated with traffic signs, see Fig. 1.4 and videos at www.diplecs.eu. Originally the system is not aware of the semantic meaning of the detection corresponding to a stop sign. The system is aware that the reported detection is a a sign, just not of what type. After a few runs of stopping at a junction with the sign present, the system deduces that the sign 12 CHAPTER 1. INTRODUCTION Figure 1.3: Top: The instrumented vehicle used in the DIPLECS project. Bottom: The combined view given by stitching the views given by the three individual cameras mounted on the roof of the vehicle. might be a stop sign or a give way sign. After additional runs when the driver makes a full stop even though no other cars were present, the system correctly deduces that the sign type is in fact a stop sign. Research leading to the included publications on shape matching and traffic sign recognition [58, 59] and of learning tracking models [26] was conducted within this project. Other publications by the author that originates from the time in the DIPLECS project are [25, 60, 61]. The author was to a large extent involved in implementing the required functionalities from CVL in the main demonstrator and was the main source behind implementing the functionalities needed for multi target tracking based on the channel based tracking framework, see paper E. 1.3. PROJECTS 13 Figure 1.4: Upper left: Unknown sign. Upper right: Based on driver behavior the likelihoods of give way sign and stop sign are equal. Middle: Based on behavior, the system is confident that the sign is a stop sign. Bottom: View while approaching the junction. 14 1.3.3 CHAPTER 1. INTRODUCTION ETT: Extended Target Tracking The project ETT, Extended Target Tracking, running 2011-2014 aims at multiple and extended target tracking. Traditionally targets have been represented by their kinematic state (position, velocity, etc.). The project investigates new ways of extending the state vector and moving away from just a point target description. Early results, described in the included paper B, have been in the area of bicycle tracking where the bicycle is treated as a weakly articulated object and the observations consist of the projected ellipses originating from the bicycle wheels, see Fig. 1.5. Figure 1.5: Image of a bike with estimated ellipses belonging to the bike wheels. The estimated ellipses are halfway between the colored lines. 1.4. PUBLICATIONS 1.4 Publications This is a complete list of publications by the author. Journal Papers F. Larsson, M. Felsberg, and P.-E. Forssén. Correlating Fourier descriptors of local patches for road sign recognition. IET Computer Vision, 5(4):244–254, 2011 F. Larsson, E. Jonsson, and M. Felsberg. Simultaneously learning to recognize and control a low-cost robotic arm. Image and Vision Computing, 27(11):1729–1739, 2009 Peer-Reviewed Conference Papers T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking using ellipse extraction. In Proceedings of the 14th International Conference on Information Fusion, 2011. Honorable mention, nominated for the best student paper award F. Larsson and M. Felsberg. Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition. In Proceedings of the Scandinavian Conference on Image Analysis (SCIA), volume 6688 of Lecture Notes in Computer Science, pages 238–249, 2011 M. Felsberg and F. Larsson. Learning object tracking in image sequences. In Proceedings of the International Conference on Cognitive Systems, 2010 M. Felsberg, F. Larsson, W. Han, A. Ynnerman, and T. Schön. Torchlight navigation. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR), 2010 M. Felsberg and F. Larsson. Learning higher-order Markov models for object tracking in image sequences. In Proceedings of the International Symposium on Visual Computing (ISVC), volume 5876 of Lecture Notes in Computer Science, pages 184–195. Springer-Verlag, 2009 F. Larsson, M. Felsberg, and P-E. Forssén. Patch contour matching by correlating Fourier descriptors. In Digital Image Computing: Techniques and Applications (DICTA), Melbourne, Australia, December 2009. IEEE Computer Society M. Felsberg and F. Larsson. Learning Bayesian tracking for motion estimation. In Proceedings of the European Conference on Computer Vision (ECCV), International Workshop on Machine Learning for Visionbased Motion Analysis, 2008 15 16 CHAPTER 1. INTRODUCTION F. Larsson, E. Jonsson, and M. Felsberg. Visual servoing for floppy robots using LWPR. In Workshop on Robotics and Mathematics (ROBOMAT), pages 225–230, 2007 Other Conference Papers F. Larsson and M. Felsberg. Traffic sign recognition using Fourier descriptors and spatial models. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2011 M. Felsberg, F. Larsson, W. Han, A. Ynnerman, and T. Schön. Torch guided navigation. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2010. Awarded a paper award at the conference. F. Larsson, P-E. Forssén, and M. Felsberg. Using Fourier descriptors for local region matching. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2009 F. Larsson, E. Jonsson, and M. Felsberg. Learning floppy robot control. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2008 F. Larsson, E. Jonsson, and M. Felsberg. Visual servoing based on learned inverse kinematics. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2007 Theses F. Larsson. Methods for Visually Guided Robotic Systems: Matching, Tracking and Servoing. Linköping Studies in Science and Technology. Thesis No. 1416, Linköping University, 2009 F. Larsson. Visual Servoing Based on Learned Inverse Kinematics. M.Sc. Thesis LITH-ISY-EX–07/3929, Linköping University, 2007 Reports F. Larsson. Automatic 3D Model Construction for Turn-Table Sequences - A Simplification. LiTH-ISY-R, 3022, Linköping University, Department of Electrical Engineering, 2011 Part I Background Theory 17 Chapter 2 Shape Matching Shape matching is an ever popular area of research within the computer vision community that, as the name implies, concerns representing and recognizing arbitrary shapes. This chapter contains a brief introduction to the field of 2D shape matching and is intended as preparation for papers A-D which, to varying degree, deal with shape matching. Included is also a section on conics, containing an extended derivation of the relationship linking relative pose and the reflection of a light beam from a torchlight, used in paper A. 2.1 Overview A common classification of shape matching methods is into region based and contour based methods. Contour based methods aim to capture the information contained on the boundary/contour only, while region based methods also include information about the internal region. Both classes can further be divided into local or global methods. Global methods treat the whole shape at once while local methods divide the shape into parts that are described individually in order to increase robustness to e.g. occlusion. See [69, 86, 91] for three excellent survey papers on shape matching. When dealing with shape matching, an important aspect to take into consideration is which invariances are appropriate. Depending on the task at hand a particular invariance might either be beneficial or harmful. Take optical character recognition, OCR, as one example. For this particular application, full rotation invariance would be harmful since a 9 and a 6 would be confused. This is similar to the situation we face in the included papers C and D that deal with traffic sign recognition; We do not want to confuse the numbers on speed signs nor the diamond shape of Swedish main road signs with the shapes of square windows. Depending on the desired invariance properties, different methods aim for different invariances, for example invariance under projective transformations [79], affine transformations [4], or non-rigid deformations [14, 31], just to mention a few. For an overview of invariance properties of different shape descriptors see the extensive listing and description of over 40 different methods in [86]. 19 20 CHAPTER 2. SHAPE MATCHING 0 0 0 1 1 1 1 1 0 0 Figure 2.1: Illustration of a grid based method for describing shape. The grid is transformed into a vector and each tile is marked with hit=1, if it touches or is within the boundary, or miss=0, otherwise. 2.1.1 Region Based Matching Region based methods aim to capture information not only from the boundary but also from the internal region of the shape. A simple and intuitive example of a region based method is the grid based method [70] illustrated in Fig. 2.1. This approach places a grid over the canonical version, i.e. normalized with respect to rotation, scale etc, of the shape. The grid is then transformed into a binary feature vector with the same length as the number of tiles in the grid. Ones indicate that the corresponding grid tiles touch the shape and zeros that the tiles are completely outside the shape. Note that this simple method does not capture any texture information. Popular region based approaches are moment based methods [43, 52, 82], generic Fourier descriptors [89] and methods based on the medial axis/skeleton such as [78]. 2.1.2 Contour Based Matching Contour based methods only account for the information given by contour itself. A simple example of a contour based method is shape signatures [19]. Shape signatures are basically representations based on a one-dimensional parameterization of the contour. This can be achieved using scalar valued functions, e.g. the distance to the center of gravity as a function of distance traveled along the contour as in Fig. 2.2, or functions with multivariate output, e.g using the full vector to the center of gravity not just the distance. Shape signatures provide a periodic one-dimensional parameterization of the shape. It is thus a natural step to apply the Fourier transform to this periodic 21 2.1. OVERVIEW f=r(l) (x,y) r (x0 , y0 ) l Figure 2.2: Shape signature based on distance to the center of gravity. signal and this is exactly what is done in order to obtain Fourier Descriptors (FDs) [37, 88]. FDs use the Fourier coefficients of the 1D Fourier transform of the shape signature. Different shape signatures have been used with the Fourier descriptor framework, e.g. distance to centroid, curvature and complex valued representation. For more details on FDs see the included papers [58, 59] where we show that it is possible to retain the phase information and perform sum-of-squared differences matching without explicitly de-rotating the FDs. Another popular contour based method is the curvature scale space, CSS, [73] that is incorporated in the MPEG-7 visual shape descriptors standard [12]. The CSS descriptor is based on inflection points of successively smoothed versions of the contour. The authors of [90] present an extensive comparison between FDs and CSS. In their study they show that FDs outperform CCS on the MPEG-7 contour shape database. Shape context is another popular global contour based descriptor [10]. The descriptor is computed as log-polar histograms of edge energy around points sampled from the contour. Matching individual histograms is commonly done using χ2 test statistics and the matching cost between shapes is given from the pairing of points that minimize the total sum of individual costs. Shape context allows for small non-rigid deformations. One limitation with contour based methods is that they tend to be sensitive to noise and errors in the contour segmentation process. Small changes in the contour may result in big changes in the shape descriptor making matching impossible. Region based methods are less sensitive to noise since small changes of the contour leave the interior relatively unchanged. For an in depth discussion of pros and cons of the different approaches see [91]. 22 2.1.3 CHAPTER 2. SHAPE MATCHING Partial Contour Matching and Non-Rigid Matching The difficulties involved in achieving reliable segmentation of shapes in natural images have led to work on shape matching based on local contour segments and different voting techniques [9, 33, 34, 68, 77]. Another rapidly evolving area is that of non-rigid shape matching based on chord angles [18], shape contexts [10], triangulated graphs [31] and shape-trees [32]. The interested reader is referred to [33], regarding partial contour matching, and to [14], regarding non-rigid matching, for the numerous references therein. Many of the successful methods for recognition of deformable shapes tend to be very slow. The mentioned papers [18] and [32] take about 1h and 136h respectively for the MPEG-7 dataset (for which they currently rank 11th and 3rd in bulls eye score). The current best methods on the MPEG-7 dataset [8, 87] do not report any running times. As a comparison, our FDs based matching method in paper B takes less than 30 seconds on the same dataset, although at worse bulls eye score due to not dealing with non-rigid deformations. In our work we have focused on recognition of closed contours and this fits well with our main application, traffic sign recognition. Traffic signs are designed to have easily distinguishable regions and are placed in such a way that they are rarely occluded. Traffic signs are also rigid objects meaning that invariance to non-rigid deformations could be harmful in this application domain. 2.2 Conics Conics have a prominent role in two of the included papers and thus deserve a thorough introduction. A conic, or rather conic section, is the result of the intersection between a cone and a plane. Figure 2.3: Illustration of the three types of conics. Left: Parabolas. Center: Ellipses. Right: Hyperbolas. Image adapted from Wikimedia Commons [17]. 23 2.2. CONICS Conics are represented by the following second order polynomial ax2 + 2bxy + cy 2 + 2dx + 2ey + f = 0 (2.1) where x, y denote coordinates in the plane and a, b, c, d, e, f denote the coefficients defining the conic. Using homogeneous coordinates and matrix notation, (2.1) can be written as pT Cp = 0 (2.2) where x p= y 1 and a C= b d b c e (2.3) d e . f (2.4) Note that any multiple of C defines the same conic, thus a conic has only five degrees of freedom. A conic with det (C) 6= 0 is called a non-degenerate conic. Three types of nondegenerate conics exist in the Euclidean case: parabolas, hyperbolas and ellipses with circles being a special case of ellipses, see Fig. 2.3. It is possible to classify a non-degenerate/degenerate conic based on the determinant of the upper left 2 × 2 submatrix a b C22 = (2.5) b c according to Table 2.1 [50, 66, 67]. det(C22 ) > 0 det(C) 6= 0 a + c < 0 : real ellipse a + c > 0 : imaginary ellipse det(C) = 0 point ellipse two unique parallel lines two coincident rank(C) = 1 : parallel lines two intersecting lines rank(C) = 2 : det(C22 ) = 0 parabola det(C22 ) < 0 hyperbola Table 2.1: Classification of the different types of conics. For the case of an ellipse, the center of the conic is given as xc −d −1 = C22 , yc −e (2.6) and the directions of the major and minor axes are given by the eigenvectors of C22 . The relations and properties mentioned above hold for the Euclidean case. For more information on the properties of conics in different spaces see [11, 40, 50]. 24 2.3 CHAPTER 2. SHAPE MATCHING The Conic From a Torchlight This is an extended version of the derivation of the resulting conic from a reflected light beam used in paper A. This conic relates the reprojection of the light beam emitted by a torchlight to the relative pose of the illuminated object. Related work dealing with pose estimation from (multiple) conics can be found in [48, 51, 81]. Figure 2.4: The torchlight setup used in paper A. For the rest of this section; capital scalars X,Y,Z denote world coordinates while lower case scalars x,y denote image coordinates. The subscripts o, p are used if there is need to distinguish between orthographic camera, i.e. parallel projection, or pinhole camera. The same definitions as in [40] are used regarding orthographic and pinhole camera. Assume that the world coordinate system is placed at the optical center of a pinhole camera and that the optical axis is aligned to the world Z-axis. The emitted light is assumed to form a perfect cylinder with radius R and that propagates in the direction of the optical axis, see Fig. 2.4. The light beam is intersected by a plane P and this will, under the mild assumption that the plane normal is not orthogonal to the optical axis, result in an ellipse [42]. If the plane normal is orthogonal to the optical axis the result is a line, or rather two coincident parallel lines according to the previous section. The camera views the illuminated plane, which results in a bright ellipse in the image plane, described by Cp , that is directly related to the relative pose. This is the same assumptions as made in the included paper A. We are looking for the resulting conic Cp in the image of the pinhole camera. One way is to first find the expression, in world coordinates, of the resulting quadric describing the intersection of the light beam and the plane P, and then project this quadric into the pinhole camera. However, an easier way is to first assume that we use an orthographic camera placed at the same position as the pinhole camera and find Co , the conic in the orthographic image. Finding Co is trivial since the optical axis is assumed to be the same as the direction of light propagation. This conic can then be transformed into the pinhole camera using a homography resulting in the desired Cp . 25 2.3. THE CONIC FROM A TORCHLIGHT The rest of this section is structured as follows. First the derivation of the homography relating the two cameras is described, secondly the resulting conic in the orthographic camera is discussed, and thirdly the homography and the orthographic conic are used in order to find the desired conic in the pinhole camera. Finding the Homography Under the assumption that the two cameras are viewing the same plane P, a homography relates the coordinates in the orthographic camera to coordinates in the pinhole camera. This homography H is, up to a scalar, given by the relation Hpo = pp (2.7) where po , pp denote homogeneous coordinates in the two cameras. This can further be written as X X Y Y HPo (2.8) Z = Pp Z , 1 1 where Po , Pp denote corresponding projection orthographic projection matrix is given as 1 0 0 Po = 0 1 0 0 0 0 while the actually used camera is modelled f and projection matrix f 0 Pp = 0 f 0 0 matrices and (X, Y, Z) ∈ P. The 0 0 1 (2.9) as a pinhole camera with focal length 0 0 0 0 . 1 0 (2.10) Further assume that the the plane P lies at distance Z0 with normal (n1 , n2 , −1)T and is parametrized over (X, Y ) as Z(X, Y ) = n1 X + n2 Y + Z0 . Combining equations (2.8)-(2.11) result in X X Y Y HPo n1 X + n2 Y + Z0 = Pp n1 X + n2 Y + Z0 1 1 X fX fY H Y = 1 n1 X + n2 Y + Z0 (2.11) (2.12) (2.13) 26 CHAPTER 2. SHAPE MATCHING and the final homography is identified as f 0 f H= 0 n1 n2 0 0 . Z0 (2.14) Finding the Conic in the Orthographic Camera The light beam/cylinder is given as ( 1 X 2 + Y 2 ≤ R2 L(X, Y, Z) = , 0 X 2 + Y 2 > R2 (2.15) where X, Y, Z denote world coordinates and R is the radius of the beam. The conic describing the image of the outer contour in the orthographic camera Po , see (2.9), is readily identified as x2o + yo2 = R2 (2.16) where (xo , yo ) denote the coordinates in the image plane. This can further be written as pT (2.17) o Co po = 0 using homogeneous coordinates po = [xo , yo , 1]T and the matrix representation of the conic, where 1 0 0 0 . Co = 0 1 (2.18) 0 0 −R2 Transforming the Conic into the Pinhole Camera Equation (2.14) describes the mapping from coordinates in the orthographic image into coordinates in the pinhole image, see (2.7). According to [40], the corresponding transformation of Co into Cp is Cp = H−T Co H−1 . (2.19) This can be verified by manipulating (2.17) according to 0 0 0 = pT o Co po T −T = pT )Co (H−1 H)po o (H H = (Hpo )T H−T Co H−1 (Hpo ) (2.20) (2.21) (2.22) and identifying pp = (Hpo ) which gives 0 0 −T = pT Co H−1 pp pH = pT p Cp pp . (2.23) (2.24) 27 2.3. THE CONIC FROM A TORCHLIGHT Combining (2.14), (2.18) and (2.19) gives Cp 1 f2 − R2 n21 Z02 f 2 2 = − RZn2 f1 n2 2 0 − RZn2 f1 n2 2 2 R2 n1 Z02 f 1 f2 R2 n2 Z02 f 0 R2 n1 Z02 f − R2 n22 Z02 f 2 R2 n2 Z02 f 2 −R Z2 0 . (2.25) Cp being a projective element, allows simplification of (2.25) by multiplication with Z02 f 2 R2 giving Z02 R2 − n21 Cp = −n1 n2 f n1 which is also the form used in paper A. −n1 n2 Z02 R2 − n22 f n2 f n1 f n2 −f 2 , (2.26) 28 CHAPTER 2. SHAPE MATCHING Chapter 3 Tracking This chapter is an extended version of the brief introductions to Bayesian tracking contained in the included papers B and E. Included is also a section on the channel representation used in paper E. The channel representation is a sparse localized representation that, among other things, can be used for estimation and representation of probability density functions. 3.1 Bayesian Tracking Throughout this thesis, the term tracking refers to Bayesian tracking unless otherwise stated. This should not be confused with visual tracking techniques, such as the KLT-tracker [71], which minimizes a cost function directly in the image domain. Bayesian tracking (or Bayesian filtering) techniques address the problem of estimating an object’s state vector, which may consist of arbitrary abstract properties, based on measurements, which are usually not direct measurements of the tracked state dimensions. Applications can be estimating the 3D position of an object based on the (x, y)-position in the image plane or estimating the pose vector of a bicycle based on observations of the wheels, as in paper B. Bayesian tracking techniques are often applied to visual data, see e.g. [13, 45, 74, 83]. Assume a system that changes over time and a way to acquire measurements from the same system. The task is then to estimate the probability of each possible state of the system given all measurements up to the current time step. To put it more formally: In Bayesian tracking, the current system state is represented as a probability density function (pdf) over the system’s state space. The state density for a given time is estimated in a two separate steps. First, the pdf from the previous time step is propagated through the system model which gives a prior estimate for the current state. Secondly, new measurements are used to update the prior distribution which results in the state estimate for the current time step, i.e. the posterior distribution. The process is commonly illustrated as a closed loop with two phases, see Fig. 3.1. 29 30 CHAPTER 3. TRACKING Prediction Eq. (3.3) Measurement Eq. (3.4) Figure 3.1: Illustration of the Bayesian tracking loop. The loop alternates between making predictions and incorporating new measurements. Using the same notation as in [7, 26], the system model f is given as: xk = f (xk−1 , vk−1 ) , (3.1) where xk denotes the state space of the system and vk denotes the noise term, both at time k. The system model describes how the system state changes over time k. The measurement model h is defined as: zk = h(xk , nk ) , (3.2) where nk denotes the noise term at time k. The task is thus to estimate the pdf p(xk |z1:k ), were z1:k denotes all measurements from time 1 to k. This is achieved by combining the old state estimate with new measurements. The old state estimate is propagated through the system model resulting in a prediction/prior distribution for the new time step. Given the previous measurements and the system model, the prior distribution is Z p(xk |z1:k−1 ) = p(xk |xk−1 )p(xk−1 |z1:k−1 )dxk−1 . (3.3) Which is the result of (3.1) representing a first order Markov model. When new measurements become available, the prior distribution is updated accordingly and the estimate of the posterior distribution is obtained as p(xk |z1:k ) = (3.2) z}|{ = p(xk |z1:k−1 , zk ) = p(zk |xk , z1:k−1 )p(xk |z1:k−1 ) = p(zk |z1:k−1 ) p(zk |xk )p(xk |z1:k−1 ) . p(zk |z1:k−1 ) (3.4) 31 3.2. DATA ASSOCIATION The denominator in (3.4), p(zk |z1:k−1 ) = Z p(zk |xk )p(xk |z1:k−1 )dxk , (3.5) acts as a normalizing constant ensuring that the posterior estimate is a proper pdf. It is possible to estimate xk by recurrent use of (3.3) and (3.4) given an estimate of the initial state p(x0 ), and assuming p(x0 |z0 ) = p(x0 ). Equation (3.4) can be solved exactly or only approximately depending on the assumptions made about the system. Under the assumption of a linear system model and a linear measurement models combined with Gaussian white noise [49] the Kalman filter is the optimal recursive solution in the maximum likelihood sense. Various numerical methods exist for handling the general case with non-linear models and non-Gaussian noise, e.g. particle filters [36] and grid-based methods [7]. For a good introduction and overview of Bayesian estimation techniques see [7, 15]. 3.2 Data Association The problem of data association arises whenever measurements might come from multiple sources, such as in multi-target tracking, or in the presence of false and/or missing measurements. The problem is to correctly associate the acquired measurements to the tracked targets. This is one of the greatest and most fundamental challenges when dealing with Bayesian tracking in computer vision [6]. There are numerous reasons why this is a hard and still largely an unsolved problem. At each time step, the prediction from the previous one is to be matched to the new measurements. If there are no new measurements matching the prediction, this might be due to occlusion, incorrect prediction, or that the tracked object have ceased to exist. If there are multiple measurements matching the prediction, a decision has to be made regarding which one, if any, to use. If there are multiple targets matching a single measurement, this situation must also be dealt with. The most straightforward way of dealing with the problem is the greedy nearest neighbor principle. Target-measurement associations are simply made such that each prediction is paired with the nearest still unused measurement. This approach requires making hard associations at each time step. Consequently, if an incorrect association is made, recovery is unlikely. Other approaches postpone the association decision by looking at the development over a window in time, e.g. Multiple Hypotheses Tracking (MHT). Another strategy is to update each prediction based on all available measurements, but to weight the importance of each measurement according to their agreement with the prediction, e.g. Probabilistic Data Association Filter (PDAF) [76], Joint PDAF and Probabilistic Multiple Hypotheses Tracking (PMHT) [80]. Much research is undertaken within this field, see e.g. approaches based on random finite sets such as the Probability Hypothesis Density (PHD) filter [72]. 32 CHAPTER 3. TRACKING 3.3 Channel Representation This section contains an extended version of the brief introduction to the channel representation found in paper E. The channel representation is a sparse localized representation [38], which is used in the included paper to represent probability density functions. Channel encoding is a way to transform a compact representation, such as numbers, into a sparse localized representation. For an overview and definitions of the aspects of compact/sparse/local representations see [35]. This introduction to the channel representation is limited to the encoding of scalars, but the representation readily generalizes to multiple dimensions. Using the same notation as in [47]; a channel vector c is constructed from a scalar x by the nonlinear transformation c = [B(x − x̃1 ), B(x − x̃2 ), ... , B(x − x̃N )]T . (3.6) Where B(·) denotes the basis/kernel function used. B is often chosen to be symmetric, non-negative and with compact support. The kernel centers x̃i can be placed arbitrarily in the input space, but are often uniformly distributed. The process of creating a channel vector from a scalar or another compact representation is referred to as channel encoding and the opposite process is referred to as decoding. Gaussians, B-splines, and windowed cos2 functions are examples of suitable kernel functions [35]. Using the windowed cos2 function π cos2 (ax) if |x| ≤ 2a B(x) = , (3.7) 0 otherwise and placing 10 kernels centered on integer values x̃i ∈ [1, 10] , gives the basis functions seen in Fig. 3.2. For this example the kernel width is set to a = π3 , which means that there are always three simultaneously non-zero kernels for the domain [1.5, 9.5]. How to properly choose a depending on required spatial and feature resolution is addressed in [22]. Encoding the scalar x = 3.3 using these 1 0 0 1 2 3 4 5 6 7 8 9 10 11 Figure 3.2: Ten cos2 kernels with respective kernel centered on integer values. kernels results in the channel vector c = = [B(2.3), B(1.3), B(0.3), . . . , B(−6.7)]T [ 0 0.04 0.90 0.55 0 0 0 0 0 0 ]T . (3.8) Note that only a few of the channels have a non-zero value, and that only channels close to each other are activated. This illustrates how channel encoding results in 33 3.3. CHANNEL REPRESENTATION a sparse localized representation. The basic idea when decoding a channel vector is to consider only a few neighboring channels at a time in order to ensure that the locality is preserved in the decoding process as well. The decoding algorithm for the cos2 (·) kernels in (3.7) is adapted from [35] and is repeated here for completeness 1 x̂ = l + arg 2a l l+M X−1 k i2a(k−l) c e k=l ! . (3.9) Here, ck denotes the kth element in the channel vector, l indicates the element position in the resulting vector and M = πa indicates how many channels that are considered at a time, i.e. M = 3 in our case. An estimate x̂l that is outside its valid range [l + 1.5, l + 2.5] is rejected. Additionally each decoded value is accompanied by a certainty measure r l+M −1 1 X k rl = l + c . (3.10) M k=l Applying (3.9) and (3.10) to (3.8) results in x̂ = [ −0.02 r = [ 0.95 3.30 1.50 3.31 1.46 4.00 0.55 5.00 0.00 6.00 0.00 7.00 0.00 8.00 ]T (3.11) 0.00 ]T . (3.12) Note that only the second element in x̂ is within its valid range, leaving only the correct estimate of 3.3 that also has the highest confidence. Adding a number of channel vectors results in a soft histogram, i.e. a histogram with overlapping bins. Using the same kernels as above, and encoding x1 = 3.3 and x2 = 6.8 results in c1 c2 = = [ [ 0 0 0.04 0 0.90 0 0.55 0 0 0 0 0 0 0.48 0 0.96 0 0 ]T 0.96 0 ]T (3.13) and the corresponding soft histogram c = c1 + c2 = [ 0 0.04 0.90 0.55 0 0 0.48 0.96 0.96 0 ]T . (3.14) Due to the locality of the representation, the two different scalars do not interfere with each other. Retrieving the original scalars is straightforward as long as they are sufficiently separated with respect to the kernels used. In the case of interference, retrieving the cluster centers is a simple procedure. For more details on decoding schemes see [35, 47]. The ability to simultaneously being able to represent multiple values can be used for e.g. estimating the local orientation in an image or representing multiple hypotheses for the state of a tracked target. A certainty measure is also obtained while decoding making it possible to recover multiple modes with decreasing certainty. A certainty measure can also be included in the encoding process by simply multiplying the channel vector by the certainty. Examples of how this has been used can be found in paper E where this property is used for encoding noisy measurements. 34 CHAPTER 3. TRACKING As mentioned above, a soft histogram is obtained by adding channel vectors. This can be used for estimating and representing probability density functions (pdfs). It is simple to find the peaks of the pdf by decoding the channel vector, quite similar to locating the bin with most entries in ordinary histograms. However, the accuracy of an ordinary histogram is limited to the bin size. In the channel case, sub-bin accuracy is possible due to the fact that the channels are overlapping and that the distance to the channel-center determines the influence of each sample. It has been shown [24] that the use of the channel representation reduces the quantization effect by a factor up to 20 compared to ordinary histograms. Using channels instead of histograms allows for reducing the computational complexity, by using fewer bins, or to obtain a higher accuracy while using the same number of bins. It is also possible to obtain a continuous reconstruction of the underlying pdf, instead of just locating the peaks [47]. As previously stated, this is a very brief introduction to the channel representation. The interested reader is referred to [23, 35, 38, 46, 47] for in depth presentations. Chapter 4 Visual Servoing This chapter is intended as an extended introduction to paper F and contains an introduction to visual servoing adapting the nomenclature from [44, 53]. The use of visual information for robot control can be divided into two classes depending on approach; open-loop systems and closed-loop systems. The term visual servoing refers to the latter approach. 4.1 Open-Loop Systems An open-loop system can be seen as a system working in two distinct phases where extraction of visual information is separated from the task of operating the robot. Information, e.g. the position of the object to be grasped, is extracted from the image(s) during the first phase. This information is then fed to a robot control system that moves the robot arm blindly during the second phase. This requires an accurate inverse kinematic model for the robot arm as well as an accurately calibrated camera system. Also, the environment needs to remain static between the assessment phase and the movement phase. 4.2 Visual Servoing The second main approach is based on a closed-loop system architecture, often denoted visual servoing. The extraction of visual information and computation of control signals is more tightly coupled than for open-loop systems. Visual information is continuously used as feedback to update the control signals. This results in a system that is less dependent on static environment, calibrated camera(s) etc. Depending on the method of transforming information into robot action, visual servoing systems are further divided into two subclasses, dynamic look-and-move systems and direct visual servoing systems. Dynamic look-and-move systems use visually extracted information as input to a robot controller that computes the desired joint configurations and then uses joint feedback to internally stabilize the robot. This means that once the desired lengths and angles of the joints have been 35 36 CHAPTER 4. VISUAL SERVOING xw ∆x + − x Cartesian control law Joint controller 3D pose estimation Feature extraction Figure 4.1: Flowchart for a position based dynamic look-and-move system. ∆x denotes the deviation between target (xw ) and reached (x) configuration of the end-effector. All configurations are given in 3D positions for this position based setup. computed, this configuration is reached. Direct visual servoing systems use the extracted information to directly compute the input to the robot, meaning that this approach can be used when no joint feedback is available. Both the dynamic look-and-move and the direct visual servoing approach may be used in a position based or image based way, or in a combination of both. In a position based approach the images are processed such that relevant 3D information is retrieved in world/robot/camera coordinates. The process of positioning the robotic arm is then defined in the appropriate 3D coordinate system. In an image based approach, 2D information is directly used to decide how to position the robot, i.e. the robotic arm is to be moved to a position defined by image coordinates. See figure 4.1 and 4.2 for flowcharts describing the different system architectures. xw ∆x + − x Image based control law Joint controller 2D pose estimation Feature extraction Figure 4.2: Flowchart for an image based direct visual servo system. ∆x denotes the deviation between target (xw ) and reached (x) configuration of the end-effector. All configurations are given in 2D coordinates for this setup. 37 4.3. THE VISUAL SERVOING TASK According to the introduced nomenclature the approach used in paper F is classified as image based direct visual servoing. The desired configuration is specified in terms of image coordinates for automatically acquired features which are directly mapped into control signals for the robotic arm. 4.3 The Visual Servoing Task The task in visual servoing is to minimize the norm of the deviation vector ∆x = xw − x, where x denotes the reached configuration and xw denotes the target configuration. For example, the configuration x may denote position, velocity and/or jerk of the joints. The configuration x is said to lie in the task space and the control signal y that generated this configuration is located in the joint space. The image Jacobian Jimg is the linear mapping that maps changes in joint space ∆y to changes in task space ∆x such that: ∆x = Jimg ∆y. (4.1) The term image Jacobian is used since the task space is often the acquired image(s). The configuration vector is then the position of features in these images. The term interaction matrix may sometimes be encountered instead of image Jacobian. Furthermore, let J denote the inverse image Jacobian, i.e. a mapping from changes in task space to changes in joint space such that: ∆y = J∆x J= ∂y1 ∂x1 .. . ∂ym ∂x1 ... .. . ... (4.2) ∂y1 ∂xn .. . ∂ym ∂xn . (4.3) The term inverse image Jacobian does not necessarily mean that J is the mathematical inverse to Jimg . In fact, the mapping Jimg does not need to be injective and hence not invertible. The word inverse simply implies that the inverse image Jacobian describes changes in joint spaces given wanted changes in task space while the image Jacobian describes changes in task space given changes in joint space. If the inverse image Jacobian, or an estimate thereof, has been acquired, the task of correcting for an erroneous control signal is rather simple in theory. If the current position with deviation ∆x originates from the control signal y, the new control signal is then given as ynew = y − J∆x. (4.4) However, in a non-ideal situation, the new control signal will most likely not result in the target configuration either. The process of estimating the Jacobian and updating the control signal needs to be repeated until a stopping criterion is met, e.g. the deviation is sufficiently small or the maximum number of iterations is reached. 38 CHAPTER 4. VISUAL SERVOING Chapter 5 Concluding Remarks Part I of this thesis covers some basic materials completing the publications included in Part II. This concluding section summarizes the main results and briefly discusses possible areas of future research. 5.1 Results Much of the work within this thesis has been carried out in projects aiming for (cognitive) driver assistance systems and hopefully represents a step towards improving traffic safety. The main contributions are within the area of Computer Vision, and more specifically, within the areas of shape matching, Bayesian tracking, and visual servoing with the main focus being on shape matching and applications thereof. The different methods have been demonstrated in traffic safety applications, such as bicycle tracking, car tracking, and traffic sign recognition, as well as for pose estimation and robot control. One of the core contributions is a new method for recognizing closed contours. This matching method in combination with spatial models has led to a methodology for traffic sign detection and recognition. Another contribution has been the extension of a framework for learning based Bayesian tracking called channel based tracking. The framework has been evaluated in car tracking scenarios and is shown to give competitive tracking performance, compared to standard approaches. The last field of contribution has been in cognitive robot control. A method is presented for learning how to control a robotic arm without knowing beforehand what it looks like or how it is controlled. Below follows a brief summary of the individual contributions in each the included papers. Paper A contains work on relative pose estimation using a torch light. The reprojection of the emitted light beam creates, under certain conditions, an ellipse in the image plane. It is shown that it is possible to use this ellipse in order to estimate the relative pose between the torchlight and illuminated object. Paper B builds on the ideas presented in paper A and contains initial work on bicycle tracking. The relative pose estimates are based on ellipses originating from the projection of the bicycle wheels into the image. This is combined with a 39 40 CHAPTER 5. CONCLUDING REMARKS particle filter framework and a weakly articulated object model in order to track the bicycle in 3D. This approach is demonstrated in simulations and on real world data with encouraging results. In paper C, a novel method for matching Fourier descriptors is presented and evaluated. One of the main conclusions is that it is important to keep the phase information when matching Fourier descriptors. Neglecting the phase corresponds to matching while minimizing the rotation difference between each individual pair of Fourier coefficients, instead of minimizing the rotation difference between the shapes. This can result in perfect matches between intrinsically different shapes. Another benefit of keeping the phase is that rotation covariant or invariant matching is achieved in the same way by using complex valued correlation. The only difference is to either consider the magnitude, for rotation invariant matching, or just the real value, for rotation covariant matching, of the complex valued correlation. In paper D, the matching method presented in paper C is used in combination with an implicit star-shaped object models for traffic sign recognition. The presented method works fully automatically on query images with no need for regions-of-interests. It is shown that the presented method performs well for traffic signs that contain multiple distinct contours, while some improvement still is needed for signs defined by a single contour. The presented methodology is general enough to be used for arbitrary objects, as long as they can be defined by a number of regions. Another major contribution is the release of the first publicly available large database not only containing small patches around traffic signs, allowing for comparison of different approaches. Paper E contains work on learning based object tracking and extends a framework for Bayesian tracking called channel based tracking. Compared to earlier work, the multi-dimensional case has been reformulated in a sound probabilistic way and the learning algorithm itself has been extended. The framework is evaluated in car tracking scenarios and is shown to give competitive tracking performance, compared to standard approaches. Paper F describes a method that allows simultaneous learning of appearance and control of a robotic arm. The method achieves sufficient accuracy for simple assembly tasks by combining autonomous recognition with visual servoing, based on a learned mapping between percepts and actions. The paper demonstrates that limitations of inexpensive hardware, such as web cameras and low-cost robotic arms, can be overcome using powerful algorithms. All in all, the methods developed and presented in this thesis can all be used for different components in a system guided by visual information, and hopefully represents a step towards improving traffic safety. 5.2. FUTURE WORK 5.2 41 Future Work The methods and results presented in this thesis are currently being developed and used within the projects ETT and GARNICS (Gardening with a Cognitive System [3]). ETT, as described earlier, focuses on extended target tracking, with applications to the traffic safety domain. An interesting research direction would be to investigate to what degree the tracking framework could be incorporated directly in the matching of contours. By tracking contours over time it would be possible to learn what potential transformations that the contour can undergo. Given a few observations of a contour, the system could predict how the contour should look in the next time step, and this could be exploited in the matching step. Another obvious extension of the method presented in paper D would be to include color in the traffic sign prototypes. This is a straightforward extension and will likely lead to increased matching performance, although at a slightly higher computational cost. GARNICS is a European project within the cognitive system domain and aims at 3D sensing of plant growth and building perceptual representations for learning the links to actions of a robot gardener. The Fourier descriptor based matching combined with the spatial models can potentially be used to keep track of the growth of plants by recognizing the individual leaves and their relative position. In an embodied setting such as GARNICS the tracking and recognition framework could be utilized for guiding the actions in case of uncertainties. If the system is uncertain of the identity of an object, actions can be chosen by consulting the tracking model in order to resolve these ambiguities. 42 CHAPTER 5. CONCLUDING REMARKS Bibliography [1] The COSPAL project. http://www.cospal.org. [2] The DIPLECS project. http://www.diplecs.eu. [3] The GARNICS project. http://www.garnics.eu. [4] K. Arbter, W. Snyder, and H. Burkhardt. Application of Affine-Invariant Fourier Descriptors to Recognition of 3-D Objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7):640–647, 1990. [5] T. Ardeshiri, F. Larsson, F. Gustafsson, T. Schön, and M. Felsberg. Bicycle tracking using ellipse extraction. In Proceedings of the 14th International Conference on Information Fusion, 2011. [6] H. Ardö. Multi-target Tracking Using on-line Viterbi Optimisation and Stochastic Modelling. PhD thesis, Centre for Mathematical Sciences LTH, Lund University, Sweden, 2009. [7] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174–188, 2002. [8] X. Bai, X. Yang, L. Latecki, W. Liu, and Z. Tu. Learning context sensitive shape similarity by graph transduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):861–874, 2010. [9] D. H. Ballard. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition, 13(2):111–122, 1981. [10] S. Belongie, J. Malik, and J. Puzicha. Shape Matching and Object Recognition Using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [11] R. Bix. Conics and Cubics. Springer, 2006. [12] M. Bober, F. Preteux, and Y.-M. Kim. MPEG-7 visual shape descriptors. IEEE Transactions on Circuits and Systems for Video Technology, 11(6):716– 719, June 2001. 43 44 BIBLIOGRAPHY [13] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online Multiperson Tracking-by-Detection from a Single, Uncalibrated Camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9):1820–1833, 2011. [14] A. Bronstein, M. Bronstein, A. Bruckstein, and R. Kimmel. Analysis of TwoDimensional Non-Rigid Shapes. International Journal of Computer Vision, 78(1):67–88, 2008. [15] Z. Chen. Bayesian filtering: From Kalman filters to particle filters, and beyond. Technical report, Communications Research Laboratory, McMaster University, 2003. [16] Commission for Global Road Safety. Make Roads Safe, A Decade of Action for Road Safety, ISBN-13: 978-0-9561403-2-6. 2010. [17] Wikimedia Commons. File:conic sections with plane.svg. http://commons.wikimedia.org/wiki/File:Conic sections with plane.svg. [18] M. Donoser, H. Riemenschneider, and H. Bischof. Efficient partial shape matching of outer contours. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2009. [19] A. El-ghazal, O. Basir, and S. Belkasim. Farthest point distance: A new shape signature for Fourier descriptors. Signal Processing: Image Communication, 24(7):572 – 586, 2009. [20] L. Ellis and R. Bowden. Learning responses to visual stimuli: A generic approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2007. [21] L. Ellis, M. Felsberg, and R. Bowden. Affordance mining: Forming perception through action. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2010. [22] M. Felsberg. Spatio-featural scale-space. In Proceedings of the International Conference on Scale Space Methods and Variational Methods in Computer Vision, volume 5567 of Lecture Notes in Computer Science (LNCS), 2009. [23] M. Felsberg. Adaptive filtering using channel representations. In L. M. J. Florack, R. Duits, G. Jongbloed, M.-C. van Lieshout, and L. Davies, editors, Locally Adaptive Filters in Signal and Image Processing, pages 35–54. Springer, 2011. [24] M. Felsberg, P.-E. Forssén, and H. Scharr. Channel smoothing: Efficient robust smoothing of low-level signal features. IEEE Transactions on Pattern Analysis and Machine, 28(2):209–222, 2006. [25] M. Felsberg and F. Larsson. Learning Bayesian tracking for motion estimation. In Proceedings of the European Conference on Computer Vision (ECCV), International Workshop on Machine Learning for Vision-based Motion Analysis, 2008. BIBLIOGRAPHY 45 [26] M. Felsberg and F. Larsson. Learning higher-order Markov models for object tracking in image sequences. In Proceedings of the International Symposium on Visual Computing (ISVC), volume 5876 of Lecture Notes in Computer Science, pages 184–195. Springer-Verlag, 2009. [27] M. Felsberg and F. Larsson. Learning object tracking in image sequences. In Proceedings of the International Conference on Cognitive Systems, 2010. [28] M. Felsberg, F. Larsson, W. Han, A. Ynnerman, and T. Schön. Torch guided navigation. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2010. [29] M. Felsberg, F. Larsson, W. Han, A. Ynnerman, and T. Schön. Torchlight navigation. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR), 2010. [30] M. Felsberg, A. Shaukat, and D. Windridge. Online learning in perceptionaction systems. In Proceedings of the European Conference on Computer Vision (ECCV), Workshop on Vision for Cognitive Tasks, 2010. [31] P. Felzenszwalb. Representation and Detection of Deformable Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(2):208–220, 2005. [32] P. Felzenszwalb and J. Schwartz. Hierarchical matching of deformable shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007. [33] V. Ferrari, F. Jurie, and C. Schmid. From images to shape models for object detection. International Journal of Computer Vision, 87(3):284–303, 2010. [34] S. Fidler, M. Boben, and A. Leonardis. Learning hierarchical compositional representations of object structure. In S. Dickinson, A. Leonardis, B. Schiele, and M.J. Tarr, editors, Object Categorization: Computer and Human Vision Perspectives. Cambridge University Press, 2009. [35] P-E. Forssén. Low and Medium Level Vision using Channel Representations. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, March 2004. Dissertation No. 858, ISBN 91-7373-876-X. [36] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107–113, 1993. [37] G. H. Granlund. Fourier Preprocessing for Hand Print Character Recognition. IEEE Transactions on Computers, C–21(2):195–201, 1972. [38] G.H. Granlund. An associative perception-action structure using a localized space variant information representation. In Proceedings of Algebraic Frames for the Perception-Action Cycle (AFPAC), 2000. 46 BIBLIOGRAPHY [39] G.H. Granlund. Organization of architectures for cognitive vision systems. In H.I Christensen and H.H. Nagel, editors, Cognitive Vision Systems: Sampling the spectrum of approaches, pages 37–55. Springer-Verlag, Berlin Heidelberg, Germany, 2006. [40] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [41] J. Hedborg, P.-E. Forssén, and M. Felsberg. Fast and accurate structure and motion estimation. In International Symposium on Visual Computing, number Volume 5875 in Lecture Notes in Computer Science, pages 211–222. Springer-Verlag, 2009. [42] D. Hilbert and S. Cohn-Vossen. Geometry and the imagination. Chelsea Publishing Company, New York, 1952. [43] M. Hu. Visual Pattern Recognition by Moment Invariants. IRE Transactions on Information Theory, IT-8:179–187, 1962. [44] S. A. Hutchinson, G. D. Hager, and P. I. Corke. A tutorial on visual servo control. IEEE Transaction on Robotics and Automation, 12(5):651–670, 1996. [45] M. Isard and A. Blake. CONDENSATION – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998. [46] B. Johansson. Low Level Operations and Learning in Computer Vision. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, December 2004. Dissertation No. 912, ISBN 91-85295-93-0. [47] E. Jonsson. Channel-Coded Feature Maps for Computer Vision and Machine Learning. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, February 2008. Dissertation No. 1160, ISBN 978-91-7393-988-1. [48] F. Kahl and A. Heyden. Using Conic Correspondences in Two Images to Estimate the Epipolar Geometry. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1998. [49] R. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35– 45, 1960. [50] K. Kanatani. Geometric computation for machine vision. Oxford University Press, Inc., 1993. [51] J. Kannala, M. Salo, and J. Heikkilä. Algorithms for computing a planar homography from conics in correspondence. In Proceedings of the British Machine Vision Conference (BMVC), 2006. [52] A. Khotanzad and Y. Hong. Invariant Image Recognition by Zernike Moments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5):489–497, 1990. BIBLIOGRAPHY 47 [53] D. Kragic and H. I. Christensen. Survey on visual servoing for manipulation. Technical report, ISRN KTH/NA/P–02/01–SE, Jan. 2002., CVAP259, 2002. [54] F. Larsson. Visual Servoing Based on Learned Inverse Kinematics. M.Sc. Thesis LITH-ISY-EX–07/3929, Linköping University, 2007. [55] F. Larsson. Methods for Visually Guided Robotic Systems: Matching, Tracking and Servoing. Linköping Studies in Science and Technology. Thesis No. 1416, Linköping University, 2009. [56] F. Larsson. Automatic 3D Model Construction for Turn-Table Sequences A Simplification. LiTH-ISY-R, 3022, Linköping University, Department of Electrical Engineering, 2011. [57] F. Larsson and M. Felsberg. Traffic sign recognition using Fourier descriptors and spatial models. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2011. [58] F. Larsson and M. Felsberg. Using Fourier Descriptors and Spatial Models for Traffic Sign Recognition. In Proceedings of the Scandinavian Conference on Image Analysis (SCIA), volume 6688 of Lecture Notes in Computer Science, pages 238–249, 2011. [59] F. Larsson, M. Felsberg, and P.-E. Forssén. Correlating Fourier descriptors of local patches for road sign recognition. IET Computer Vision, 5(4):244–254, 2011. [60] F. Larsson, M. Felsberg, and P-E. Forssén. Patch contour matching by correlating Fourier descriptors. In Digital Image Computing: Techniques and Applications (DICTA), Melbourne, Australia, December 2009. IEEE Computer Society. [61] F. Larsson, P-E. Forssén, and M. Felsberg. Using Fourier descriptors for local region matching. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2009. [62] F. Larsson, E. Jonsson, and M. Felsberg. Visual servoing based on learned inverse kinematics. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2007. [63] F. Larsson, E. Jonsson, and M. Felsberg. Visual servoing for floppy robots using LWPR. In Workshop on Robotics and Mathematics (ROBOMAT), pages 225–230, 2007. [64] F. Larsson, E. Jonsson, and M. Felsberg. Learning floppy robot control. In Proceedings of the Swedish Symposium on Image Analysis (SSBA), 2008. [65] F. Larsson, E. Jonsson, and M. Felsberg. Simultaneously learning to recognize and control a low-cost robotic arm. Image and Vision Computing, 27(11):1729–1739, 2009. 48 BIBLIOGRAPHY [66] J. W. Lasley. On degenerate conics. The American Mathematical Monthly, 64(5):362–364, 1957. [67] J. D. Lawrence. A Catalog of Special Plane Curves. Dove Publications, Inc., 1972. [68] M. Leordeanu, M. Hebert, and R. Sukthankar. Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007. [69] S. Loncaric. A survey of shape analysis techniques. Pattern Recognition, 31(8):983–1001, 1998. [70] G. Lu and A. Sajjanhar. Region-based shape representation and similarity measure suitable for content-based image retrieval. Multimedia Systems, 7(2):165–174, 1999. [71] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence, 1981. [72] R. Mahler. Multitarget Bayes filtering via first-order multitarget moments. IEEE Transactions on Aerospace and Electronic Systems, 39(4):1152–1178, 2003. [73] F. Mokhtarian, S. Abbasi, and J. Kittler. Robust and efficient shape indexing through curvature scale space. In Proceedings of the British Machine Vision Conference (BMVC), 1996. [74] V. Pavlovic, J.M. Rehg, T.J. Cham, and K.P. Murphy. A dynamic Bayesian network approach to figure tracking using learned dynamic models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1999. [75] N. Pugeault and R. Bowden. Driving me around the bend: Learning to drive from visual gist. In Proceedings of the 1st IEEE Workshop on Challenges and Opportunities in Robot Perception, in parallel to the IEEE International Conference on Computer Vision (ICCV), 2011. [76] B. Shalom and E. Tse. Tracking in a cluttered environment with probabilistic data association. Automatica, 11(5):451–460, 1975. [77] J. Shotton, A. Blake, and R. Cipolla. Contour-based learning for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2005. [78] K. Siddiqi, A. Shokoufandeh, S. Dickinson, and S. Zucker. Shock Graphs and Shape Matching. International Journal of Computer Vision, 35(1):13–32, 1999. BIBLIOGRAPHY 49 [79] P. Srestasathiern and A. Yilmaz. Planar Shape Representation and Matching Under Projective Transformation. Computer Vision and Image Understanding, In press, 2011. [80] R. L. Streit and T. E. Luginbuhl. Probabilistic multi-hypothesis tracking. Technical report, 10, NUWC-NPT, 1995. [81] A. Sugimoto. A Linear Algorithm for Computing the Homography from Conics in Correspondence. Journal of Mathematical Imaging and Vision, 13(2):115–130, 2000. [82] M. R. Teague. Image analysis via the general theory of moments. Journal of the Optical Society of America (1917-1983), 70(8):920–930, 1980. [83] K. Toyama and A. Blake. Probabilistic tracking with exemplars in a metric space. International Journal of Computer Vision, 48(1):9–19, 2002. [84] United Nations General Assembly. Improving global road safety, A/RES/64/255 (2010). Resolution of the United Nations General Assembly, 64th session. 2010. [85] World Health Organization. Global status report on road safety: time for action. 2009. [86] M. Yang, K. Kpalma, and J. Ronsin. A Survey of Shape Feature Extraction Techniques. In Pattern Recognition Techniques, Technology and Applications, pages 978–953. IN-TECH, 2008. [87] X. Yang, S. Koknar-Tezel, and L. Latecki. Locally constrained diffusion process on locally densified distance spaces with applications to shape retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. [88] C.T. Zahn and R.Z. Roskies. Fourier descriptors for plane closed curves. IEEE Transactions on Computers, C-21(3):269–281, 1972. [89] D. Zhang and G. Lu. Generic Fourier descriptor for shape-based image retrieval. In Proceedings on the IEEE International Conference on Multimedia and Expo, 2002. [90] D. Zhang and G. Lu. A Comparative Study of Curvature Scale Space and Fourier Descriptors for Shape-based Image Retrieval. Journal of Visual Communication and Image Representation, 14(1):39–57, 2003. [91] D. Zhang and G. Lu. Review of shape representation and description techniques. Pattern Recognition, 37(1):1 – 19, 2004.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement