Linköping Studies in Science and Technology Dissertation No. 858 Low and Medium Level Vision using Channel Representations Per-Erik Forssén Dissertation No. 858 Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping March 2004 Low and Medium Level Vision using Channel Representations c 2004 Per-Erik Forssén ° Department of Electrical Engineering Linköping University SE-581 83 Linköping Sweden ISBN 91-7373-876-X ISSN 0345-7524 iii Don’t confuse the moon with the finger that points at it. Zen proverb iv v Abstract This thesis introduces and explores a new type of representation for low and medium level vision operations called channel representation. The channel representation is a more general way to represent information than e.g. as numerical values, since it allows incorporation of uncertainty, and simultaneous representation of several hypotheses. More importantly it also allows the representation of “no information” when no statement can be given. A channel representation of a scalar value is a vector of channel values, which are generated by passing the original scalar value through a set of kernel functions. The resultant representation is sparse and monopolar. The word sparse signifies that information is not necessarily present in all channels. On the contrary, most channel values will be zero. The word monopolar signifies that all channel values have the same sign, e.g. they are either positive or zero. A zero channel value denotes “no information”, and for non-zero values, the magnitude signifies the relevance. In the thesis, a framework for channel encoding and local decoding of scalar values is presented. Averaging in the channel representation is identified as a regularised sampling of a probability density function. A subsequent decoding is thus a mode estimation technique. The mode estimation property of channel averaging is exploited in the channel smoothing technique for image noise removal. We introduce an improvement to channel smoothing, called alpha synthesis, which deals with the problem of jagged edges present in the original method. Channel smoothing with alpha synthesis is compared to mean-shift filtering, bilateral filtering, median filtering, and normalized averaging with favourable results. A fast and robust blob-feature extraction method for vector fields is developed. The method is also extended to cluster constant slopes instead of constant regions. The method is intended for view-based object recognition and wide baseline matching. It is demonstrated on a wide baseline matching problem. A sparse scale-space representation of lines and edges is implemented and described. The representation keeps line and edge statements separate, and ensures that they are localised by inhibition from coarser scales. The result is however still locally continuous, in contrast to non-max-suppression approaches, which introduce a binary threshold. The channel representation is well suited to learning, which is demonstrated by applying it in an associative network. An analysis of representational properties of associative networks using the channel representation is made. Finally, a reactive system design using the channel representation is proposed. The system is similar in idea to recursive Bayesian techniques using particle filters, but the present formulation allows learning using the associative networks. vi vii Acknowledgements This thesis could never have been written without the support from a large number of people. I am especially grateful to the following persons: My fiancée Linda, for love and encouragement, and for constantly reminding me that there are other important things in life. All the people at the Computer Vision Laboratory, for providing a stimulating research environment, for sharing ideas and implementations with me, and for being good friends. Professor Gösta Granlund, for giving me the opportunity to work at the Computer Vision Laboratory, for introducing me to an interesting area of research, and for relating theories of mind and vision to our every-day experience of being. Anders Moe and Björn Johansson for their constructive criticism on this manuscript. Dr Hagen Spies, for giving an inspiring PhD course, which opened my eyes to robust statistics and camera geometry. Dr Michael Felsberg, for all the discussions on channel smoothing, B-splines, calculus in general, and Depeche Mode. Johan Wiklund, for keeping the computers happy, and for always knowing all there is to know about new technologies and gadgets. The Knut and Alice Wallenberg foundation, for funding research within the WITAS project. And last but not least my fellow musicians and friends in the band Pastell, for helping me to kill my spare time. About the cover The front cover page is a collection of figures from the thesis, arranged to constitute a face, in the spirit of painter Salvador Dali. The back cover page is a photograph of Swedish autumn leaves, processed with the SOR method in section 7.2.1, using intensities in the range [0, 1], and the parameters dmax = 0.05, binomial filter of order 11, and 5 IRLS iterations. viii Contents 1 Introduction 1.1 Motivation . 1.2 Overview . . 1.3 Contributions 1.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 4 2 Representation of Visual Information 2.1 System principles . . . . . . . . . . . . . . . . . . . . . . 2.1.1 The world as an outside memory . . . . . . . . . 2.1.2 Active vision . . . . . . . . . . . . . . . . . . . . 2.1.3 View centred and object centred representations 2.1.4 Robust perception . . . . . . . . . . . . . . . . . 2.1.5 Vision and learning . . . . . . . . . . . . . . . . . 2.2 Information representation . . . . . . . . . . . . . . . . . 2.2.1 Monopolar signals . . . . . . . . . . . . . . . . . 2.2.2 Local and distributed coding . . . . . . . . . . . 2.2.3 Coarse coding . . . . . . . . . . . . . . . . . . . . 2.2.4 Channel coding . . . . . . . . . . . . . . . . . . . 2.2.5 Sparse coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 6 6 6 7 7 7 8 9 10 11 3 Channel Representation 3.1 Compact and local representations . . . . . . . . . . . 3.1.1 Compact representations . . . . . . . . . . . . . 3.1.2 Channel encoding of a compact representation 3.2 Channel representation using the cos2 kernel . . . . . 3.2.1 Representation of multiple values . . . . . . . . 3.2.2 Properties of the cos2 kernel . . . . . . . . . . . 3.2.3 Decoding a cos2 channel representation . . . . 3.3 Size of the represented domain . . . . . . . . . . . . . 3.3.1 A linear mapping . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 13 14 16 16 17 19 20 21 4 Mode Seeking and Clustering 4.1 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Kernel density estimation . . . . . . . . . . . . . . . . . . . 23 23 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Contents 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 25 26 27 30 31 32 5 Kernels for Channel Representation 5.1 The Gaussian kernel . . . . . . . . . . . . . . . . . . 5.1.1 A local decoding for the Gaussian kernel . . . 5.2 The B-spline kernel . . . . . . . . . . . . . . . . . . . 5.2.1 Properties of B-splines . . . . . . . . . . . . . 5.2.2 B-spline channel encoding and local decoding 5.3 Comparison of kernel properties . . . . . . . . . . . . 5.3.1 The constant sum property . . . . . . . . . . 5.3.2 The constant norm property . . . . . . . . . 5.3.3 The scalar product . . . . . . . . . . . . . . . 5.4 Metameric distance . . . . . . . . . . . . . . . . . . . 5.5 Stochastic kernels . . . . . . . . . . . . . . . . . . . . 5.5.1 Varied noise level . . . . . . . . . . . . . . . . 5.6 2D and 3D channel representations . . . . . . . . . . 5.6.1 The Kronecker product . . . . . . . . . . . . 5.6.2 Encoding of points in 2D . . . . . . . . . . . 5.6.3 Encoding of lines in 2D . . . . . . . . . . . . 5.6.4 Local decoding for 2D Gaussian kernels . . . 5.6.5 Examples . . . . . . . . . . . . . . . . . . . . 5.6.6 Relation to Hough transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 34 35 36 37 38 38 39 40 43 45 46 47 47 48 48 48 50 51 6 Channel Smoothing 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Algorithm overview . . . . . . . . . . . . . . . . . 6.1.2 An example . . . . . . . . . . . . . . . . . . . . . . 6.2 Edge-preserving filtering . . . . . . . . . . . . . . . . . . . 6.2.1 Mean-shift filtering . . . . . . . . . . . . . . . . . . 6.2.2 Bilateral filtering . . . . . . . . . . . . . . . . . . . 6.3 Problems with strongest decoding synthesis . . . . . . . . 6.3.1 Jagged edges . . . . . . . . . . . . . . . . . . . . . 6.3.2 Rounding of corners . . . . . . . . . . . . . . . . . 6.3.3 Patchiness . . . . . . . . . . . . . . . . . . . . . . . 6.4 Alpha synthesis . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Separating output sharpness and channel blurring 6.4.2 Comparison of super-sampling and alpha synthesis 6.4.3 Relation to smoothing before sampling . . . . . . . 6.5 Comparison with other denoising filters . . . . . . . . . . 6.6 Applications of channel smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 53 54 54 55 55 56 57 58 58 60 61 62 62 65 66 4.3 Mode seeking . . . . . . . . . . 4.2.1 Channel averaging . . . 4.2.2 Expectation value of the 4.2.3 Mean-shift filtering . . . 4.2.4 M-estimators . . . . . . 4.2.5 Relation to clustering . Summary and comparison . . . . . . . . . local . . . . . . . . . . . . . . . . . . . . . . . . decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 6.7 xi 6.6.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Homogeneous Regions in Scale-Space 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 The scale-space concept . . . . . . . . . . . . 7.1.2 Blob features . . . . . . . . . . . . . . . . . . 7.1.3 A blob feature extraction algorithm . . . . . 7.2 The clustering pyramid . . . . . . . . . . . . . . . . 7.2.1 Clustering of vector fields . . . . . . . . . . . 7.2.2 A note on winner-take-all vs. proportionality 7.3 Homogeneous regions . . . . . . . . . . . . . . . . . . 7.3.1 Ellipse approximation . . . . . . . . . . . . . 7.3.2 Blob merging . . . . . . . . . . . . . . . . . . 7.4 Blob features for wide baseline matching . . . . . . . 7.4.1 Performance . . . . . . . . . . . . . . . . . . 7.4.2 Removal of cropped blobs . . . . . . . . . . . 7.4.3 Choice of parameters . . . . . . . . . . . . . . 7.5 Clustering of planar slopes . . . . . . . . . . . . . . . 7.5.1 Subsequent pyramid levels . . . . . . . . . . . 7.5.2 Computing the slope inside a binary mask . . 7.5.3 Regions from constant slope model . . . . . . 7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . 8 Lines and Edges in Scale-Space 8.1 Background . . . . . . . . . . . . . . . . . . . . . 8.1.1 Classical edge detection . . . . . . . . . . 8.1.2 Phase-gating . . . . . . . . . . . . . . . . 8.1.3 Phase congruency . . . . . . . . . . . . . 8.2 Sparse feature maps in a scale hierarchy . . . . . 8.2.1 Phase from line and edge filters . . . . . . 8.2.2 Characteristic phase . . . . . . . . . . . . 8.2.3 Extracting characteristic phase in 1D . . 8.2.4 Local orientation information . . . . . . . 8.2.5 Extracting characteristic phase in 2D . . 8.2.6 Local orientation and characteristic phase 8.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 66 . . . . . . . . . . . . . . . . . . . 69 69 69 70 71 71 72 74 74 75 76 77 79 79 80 81 82 83 84 85 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 88 88 88 89 90 90 91 93 94 95 96 9 Associative Learning 9.1 Architecture overview . . . . . . . . . . . . . . . . . 9.2 Representation of system output states . . . . . . . . 9.2.1 Channel representation of the state space . . 9.3 Channel representation of input features . . . . . . . 9.3.1 Feature generation . . . . . . . . . . . . . . . 9.4 System operation modes . . . . . . . . . . . . . . . . 9.4.1 Position encoding for discrete event mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 99 101 101 102 102 103 103 . . . . . . . . . . . . xii Contents 9.5 9.6 9.7 9.8 9.4.2 Magnitude encoding for continuous function mapping Associative structure . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Optimisation procedure . . . . . . . . . . . . . . . . . 9.5.2 Normalisation modes . . . . . . . . . . . . . . . . . . . 9.5.3 Sensitivity analysis for continuous function mode . . . Experimental verification . . . . . . . . . . . . . . . . . . . . 9.6.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 9.6.2 Associative network variants . . . . . . . . . . . . . . 9.6.3 Varied number of samples . . . . . . . . . . . . . . . . 9.6.4 Varied number of channels . . . . . . . . . . . . . . . . 9.6.5 Noise sensitivity . . . . . . . . . . . . . . . . . . . . . Other local model techniques . . . . . . . . . . . . . . . . . . 9.7.1 Radial Basis Function networks . . . . . . . . . . . . . 9.7.2 Support Vector Machines . . . . . . . . . . . . . . . . 9.7.3 Adaptive fuzzy control . . . . . . . . . . . . . . . . . . Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 10 An Autonomous Reactive System 10.1 Introduction . . . . . . . . . . . . . . . . . 10.1.1 System outline . . . . . . . . . . . 10.2 Example environment . . . . . . . . . . . 10.3 Learning successive recognition . . . . . . 10.3.1 Notes on the state mapping . . . . 10.3.2 Exploratory behaviour . . . . . . . 10.3.3 Evaluating narrowing performance 10.3.4 Learning a narrowing policy . . . . 10.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 106 106 107 109 110 110 112 113 115 116 118 118 119 119 120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 121 122 122 124 124 125 127 128 129 11 Conclusions and Future Research Directions 11.1 Conclusions . . . . . . . . . . . . . . . . . . . 11.2 Future research . . . . . . . . . . . . . . . . . 11.2.1 Feature matching and recognition . . . 11.2.2 Perception action cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 131 132 132 132 Appendices A Theorems on cos2 kernels . . . . . . . . . . . . . . . . . . . . . . . B Theorems on B-splines . . . . . . . . . . . . . . . . . . . . . . . . . C Theorems on ellipse functions . . . . . . . . . . . . . . . . . . . . . 133 133 138 140 Bibliography . . . . . . . . . 145 Chapter 1 Introduction 1.1 Motivation The work presented in this thesis has been performed within the WITAS1 project [24, 52, 105]. The goal of the WITAS project has been to build an autonomous2 Unmanned Aerial Vehicle (UAV) that is able to deal with visual input, and to develop tools and techniques needed in an autonomous systems context. Extensive work on adaptation of more conventional computer vision techniques to the WITAS platform has previously been carried out by the author, and is documented in [32, 35, 81]. This thesis will however deal with basic research aspects of the WITAS project. We will introduce new techniques and information representations well suited for computer vision in autonomous systems. Computer vision is usually described using a three level model: • The first level, low-level vision is concerned with obtaining descriptions of image properties in local regions. This usually means description of colour, lines and edges, motion, as well as methods for noise attenuation. • The next level, medium-level vision makes use of the features computed at the low level. Medium-level vision has traditionally involved techniques such as joining line segments into object boundaries, clustering, and computation of depth from stereo image pairs. Processing at this level also includes more complex tasks, such as the estimation of ego motion, i.e. the apparent motion of a camera as estimated from a sequence of camera images. • Finally, high-level vision involves using the information from the lower levels to perform abstract reasoning about scenes, planning etc. The WITAS project involves all three levels, but as the title of this thesis suggests, we will only deal with the first two levels. The unifying theme of the thesis 1 WITAS stands for the Wallenberg laboratory for research on Information Technology and Autonomous Systems. 2 An autonomous system is self guided, or without direct control of an operator. 2 Introduction is a new information representation called channel representation. All methods developed in the thesis either make explicit use of channel representations, or can be related to the channel representation. 1.2 Overview We start the thesis in chapter 2 with a short overview of system design principles in biological and artificial vision systems. We also give an overview of different information representations. Chapter 3 introduces the channel representation, and discusses its representational properties. We also describe how a compact representation may be converted into a channel representation using a channel encoding, and how the compact representation may be retrieved using a local decoding. Chapter 4 relates averaging in the channel representation to estimation methods from robust statistics. We re-introduce the channel representation in a statistical formulation, and show that channel averaging followed by a local decoding is a mode estimation technique. Chapter 5 introduces channel representations using other kernels than the cos2 kernel. The different kernels are compared in a series of experiments. In this chapter we also explore the interference during local decoding between multiple values stored in a channel vector. We also introduce the notion of stochastic kernels, and extend the channel representation to higher dimensions. Chapter 6 describes an image denoising technique called channel smoothing. We identify a number of problems with the original channel smoothing technique, and give solutions to them, one of them being the alpha synthesis technique. Channel smoothing is also compared to a number of popular image denoising techniques, such as mean-shift, bilateral filtering, median filtering, and normalized averaging. Chapter 7 contains a method to obtain a sparse scale-space representation of homogeneous regions. The homogeneous regions are represented as sparse blob features. The blob feature extraction method can be applied to both grey-scale and colour images. We also extend the method to cluster constant slopes instead of locally constant regions. Chapter 8 contains a method to obtain a sparse scale-space representation of lines and edges. In contrast to non-max-suppression techniques, the method generates a locally continuous response, which should make it well suited e.g. as input to a learning machinery. Chapter 9 introduces an associative network architecture that makes use of the channel representation. In a series of experiments the descriptive powers and the noise sensitivity of the associative networks are analysed. In the experiments we also compare the associative networks with conventional function approximation using local models. We also discuss the similarities and differences between the associative networks and Radial Basis Function (RBF) networks, Support Vector Machines (SVM), and Fuzzy control. Chapter 10 incorporates the associative networks in a feedback loop, which allows successive recognition in an environment with perceptual aliasing. A sys- 1.3 Contributions 3 tem design is proposed, and is demonstrated by solving the localisation problem in a labyrinth. In this chapter we also use reinforcement learning to learn an exploratory behaviour. 1.3 Contributions We will now list what is believed to be the novel contributions of this thesis. • A framework for channel encoding and local decoding of scalar values is presented in chapter 3. This material originates from the author’s licentiate thesis [34], and is also contained in the article “HiperLearn: A High Performance Channel Learning Architecture” [51]. • Averaging in the channel representation is identified as a regularised sampling of a probability density function. A subsequent decoding is thus a mode estimation technique. This idea was originally mentioned in the paper “Image Analysis using Soft Histograms” [33], and is thoroughly explained in chapter 4. • The local decoding for 1D and 2D Gaussian kernels in chapter 5. This material is also published in the paper “Two-Dimensional Channel Representation for Multiple Velocities” [93]. • The channel smoothing technique for image noise removal, has been investigated by several people, for earlier work by the author, see the technical report [42] and the papers “Noise Adaptive Channel Smoothing of Low Dose Images” [87], and “Channel Smoothing using Integer Arithmetic” [38]. The alpha synthesis approach described in chapter 6 is however a novel contribution, not published elsewhere. • The blob-feature extraction method developed in chapter 7. This is an improved version of the algorithm published in the paper “Robust Multi-Scale Extraction of Blob Features” [41]. • A scale-space representation of lines and edges is implemented and described in chapter 8. This chapter is basically an extended version of the conference paper “Sparse feature maps in a scale hierarchy” [39]. • The analysis of representational properties of an associative network in chapter 9. This material is derived from the article “HiperLearn: A High Performance Channel Learning Architecture” [51]. • The reactive system design using channel representation in chapter 10 is similar in idea to recursive Bayesian techniques using particle filters. The use of the channel representation to define transition and narrowing, is however believed to be novel. This material was also presented in the paper “Successive Recognition using Local State Models” [37], and the technical report [36]. 4 1.4 Introduction Notations The mathematical notations used in this thesis should resemble those most commonly in use in the engineering community. There are however cases where there are several common styles, and thus this section has been added to avoid confusion. The following notations are used for mathematical entities: s u z C s(x) Scalars (lowercase letters in italics) Vectors (lowercase letters in boldface) Complex numbers (lowercase letters in italics bold) Matrices (uppercase letters in boldface) Functions (lowercase letters) The following notations are used for mathematical operations: AT bxc hx|yi arg z conj z |z| kzk (s ∗ fk )(x) adist(ϕ1 − ϕ2 ) vec(A) diag(x) supp{f } Matrix and vector transpose The floor operation The scalar product Argument of a complex number Complex conjugate Absolute value of real or complex numbers Matrix or vector norm Convolution Angular distance of cyclic variables Conversion of a matrix to a vector by stacking the columns Extension of a vector to a diagonal matrix. The support (definition domain, or non-zero domain) of function f . Additional notations are introduced when needed. Chapter 2 Representation of Visual Information This chapter gives a short overview of some aspects of image interpretation in biological and artificial vision systems. We will put special emphasis on system principles, and on which information representations to choose. 2.1 System principles When we view vision as a sense for robots and other real-time perception systems, the parallels with biological vision at the system level become obvious. Since an autonomous robot is in direct interaction with the environment, it is faced with many of the problems that biological vision systems have dealt with successfully for millions of years. This is the reason why biological systems have been an important source of inspiration to the computer vision community, since the early days of the field, see e.g. [74]. Since biological and mechanical systems use different kinds of “hardware”, there are of course several important differences. Therefore the parallel should not be taken too far. 2.1.1 The world as an outside memory Traditionally much effort in machine vision has been devoted to methods for finding detailed reconstructions of the external world [9]. As pointed out by e.g. O’Regan [83] there is really no need for a system that interacts with the external world to perform such a reconstruction, since the world is continually “out there”. He uses the neat metaphor “the world as an outside memory” to explain why. By focusing your eyes at something in the external world, instead of examining your internal model, you will probably get more accurate and up-to-date information as well. 6 2.1.2 Representation of Visual Information Active vision If we do not need a detailed reconstruction, then what should the goal of machine vision be? The answer to this question in the paradigm of active vision [3, 4, 1] is that the goal should be generation of actions. In that way the goal depends on the situation, and on the problem we are faced with. Consider the following situation: A helicopter is situated above a road and equipped with a camera. From the helicopter we want to find out information about a car on the road below. When looking at the car through our sensor, we obtain a blurred image at low resolution. If the image is not good enough we could simply move closer, or change the zoom of the camera. The distance to the car can be obtained if we have several images of the car from different views. If we want several views, we do not actually need several cameras, we could simply move the helicopter and obtain shots from other locations. The key idea behind active vision is that an agent in the external world has the ability to actively extract information from the external world by means of its actions. This ability to act can, if properly used, simplify many of the problems in vision, for instance the correspondence problem [9]. 2.1.3 View centred and object centred representations Biological vision systems interpret visual stimuli by generation of image features in several retinotopic maps [5]. These maps encode highly specific information such as colour, structure (lines and edges), motion, and several high-level features not yet fully understood. An object in the field of view is represented by connections between the simultaneously active features in all of the feature maps. This is called a view centred representation [46], and is an object representation which is distributed across all the feature maps, or views. Perceptual experiments are consistent with the notion that biological vision systems use multiple such view representations to represent three-dimensional objects [12]. In chapters 7 and 8 we will generate sparse feature maps of structural information, that can be used to form a view centred object representation. In sharp contrast, many machine vision applications synthesise image features into compact object representations that are independent of the views from which they are viewed. This approach is called an object centred representation [46]. This kind of representation also exists in the human mind, and is used e.g. in abstract reasoning, and in spoken language. 2.1.4 Robust perception In the book “The Blind Watchmaker” [23] Dawkins gives an account of the echolocation sense of bats. The bats described in the book are almost completely blind, and instead they emit ultrasound cries and use the echoes of the cries to perceive the world. The following is a quote from [23]: 2.2 Information representation 7 It seems that bats may be using something that we could call a ’strangeness filter’. Each successive echo from a bat’s own cries produces a picture of the world that makes sense in terms of the previous picture of the world built up with earlier echoes. If the bat’s brain hears an echo from another bat’s cry, and attempts to incorporate this into the picture of the world that it has previously built up, it will make no sense. It will appear as though objects in the world have suddenly jumped in various random directions. Objects in the real world do not behave in such a crazy way, so the brain can safely filter out the apparent echo as background noise. A crude equivalent to this strangeness filter has been developed in the field of robust statistics [56]. Here samples which do not fit the used model at all are allowed to be rejected as outliers. In this thesis we will develop another robust technique, using the channel information representation. 2.1.5 Vision and learning As machine vision systems become increasingly complex, the need to specify their behaviour without explicit programming becomes increasingly apparent. If a system is supposed to act in an un-restricted environment, it needs to be able to behave in accordance with the current surroundings. The system thus has to be flexible, and needs to be able to generate context dependent responses. This leads to a very large number of possible behaviours that are difficult or impossible to specify explicitly. Such context dependent responses are preferably learned by subjecting the system to the situations, and applying percept-response association [49]. By using learning, we are able to define what our system should do, not how it should do it. And finally, a system that is able to learn, is able to adapt to changes, and to act in novel situations that the programmer did not foresee. 2.2 Information representation We will now discuss a number of different approaches to representation of information, which are used in biological and artificial vision systems. This is by no means an exhaustive presentation, it should rather be seen as background, and motivation for the representations chosen in the following chapters of this thesis. 2.2.1 Monopolar signals Information processing cells in the brain exhibit either bipolar or monopolar responses. One rare example of bipolar detectors is the hair cells in semicircular canals of the vestibular system1 . These cells hyperpolarise when the head rotates one way, and depolarise when it is rotated the other way [61]. 1 The vestibular system coordinates the orientation of the head. 8 Representation of Visual Information Bipolar signals are typically represented numerically as values in a range centred around zero, e.g. [−1.0, 1.0]. Consequently monopolar signals are represented as non-negative numbers in a range from zero upwards, e.g. [0, 1.0]. Interestingly there seem to be no truly bipolar detectors at any stage of the visual system. Even the bipolar cells of the retina are monopolar in their responses despite their name. The disadvantage with a monopolar detector compared to a bipolar one is that it can only respond to one aspect of an event. For instance do the retinal bipolar cells respond to either bright, or dark regions. Thus there are twice as many retinal bipolar cells, as there could have been if they had had bipolar responses. However, a bipolar detector has to produce a maintained discharge at the equilibrium. (For the bipolar cells this would have meant maintaining a level inbetween the bright, and dark levels.) This results in bipolar detectors being much more sensitive to disturbances [61]. Monopolar, or non-negative representations will be used frequently throughout this thesis. Although the use of monopolar signals is widespread in biological vision systems, it is rarely found in machine vision. It has however been suggested in [45]. 2.2.2 Local and distributed coding Three different strategies for representation of a system state using a number of signals is given by Thorpe in [97]. Thorpe uses the following simple example to illustrate their differences: We have a stimulus that can consist of a horizontal or a vertical bar. The bar can be either white, black, or absent (see figure 2.1). For simplicity we assume that the signals are binary, i.e. either active or inactive. ? ? Distributed Coding ? B W H Semi−Local Coding V B&H B&V W&H W&V Local Coding Nothing Figure 2.1: Local, semi-local, and distributed coding. Figure adapted from [97]. One way to represent the state of the bar is to assign one signal to each of the possible system states. This is called a local coding in figure 2.1, and the result is a local representation. One big advantage with a local representation is that the system can deal with several state hypotheses at once. In the example in figure 2.1, two active signals would mean that there was two bars present in the scene. Another way is to assign one output for each state of the two properties: orienta- 2.2 Information representation 9 tion and colour. This is called semi-local coding in figure 2.1. As we move away from a completely local representation, the ability to deal with several hypotheses gradually disappears. For instance, if we have one vertical and one horizontal bar, we can deal with them separately using a semi-local representation only if they have the same colour. The third variant in figure 2.1 is to assign one stimulus pattern to each system state. In this representation the number of output signals is minimised. This results in a representation of a given system state being distributed across the whole range of signals, hence the name distributed representation. Since this variant also succeeds at minimising the number of output signals, it is also a compact coding scheme. These three representation schemes are also different in terms of metric. A similarity metric is a measure of how similar two states are. The coding schemes in figure 2.1 can for instance be compared by counting how many active (i.e. nonzero) signals they have in common. For the local representation, no states have common signals, and thus, in a local representation we can only tell whether two states are the same or not. For the distributed representation, the similarity metric is completely random, and thus not useful. For the semi-local representation however, we get a useful metric. For example, bars with the same orientation, but different colour will have one active signal in common, and are thus halfway between being the same state, and being different states. 2.2.3 Coarse coding We will now describe a coding scheme called coarse coding, see e.g. [96]. Coarse coding is a technique that can represent continuous state spaces. In figure 2.2 the plane represents a continuous two dimensional state space. This space is coded using a number of feature signals with circular receptive fields, illustrated by the circles in the figure. Figure 2.2: Coarse coding. Figure adapted from [96]. Each feature signal is binary, i.e. either active or inactive, and is said to coarsely 10 Representation of Visual Information represent the location in state space. Since we have several features which are partially overlapping, we can get a rough estimate of where in state-space we are, by considering all the active features. The white cross in the figure symbolises a particular state, and each feature activated by this state has its receptive field coloured grey. As can be seen, we get an increasingly darker shade of grey where several features are active, and the region where the colour is the darkest contains the actual state. Evidently, a small change in location in state space will result in a small change in the activated feature set. Thus coarse coding results in a useful similarity metric, and can be identified as a semi-local coding scheme according to the taxonomy in section 2.2.2. As we add more features in a coarse coding scheme, we obtain an increasingly better better resolution of the state space. 2.2.4 Channel coding The multiple channel hypothesis is discussed by Levine and and Shefner [71] as a model for human analysis of periodic patterns. According to [71], the multiple channel hypothesis was first made by Campbell and Robson, 1968 in [13]. The multiple channel hypothesis constitutes a natural extension of coarse coding to smoothly varying features called channels, see figure 2.3. It is natural to consider smoothly varying and overlapping features for representation of continuous phenomena, but there is also evidence for channel representations of discrete state spaces such as representation of quantity in primates [79]. 0 1 2 3 4 5 6 7 8 9 Figure 2.3: Linear channel arrangement. One channel function is shown in solid, the others are dashed. The process of converting a state variable into channels is known in signal processing as channel coding, see [89] and [90], and the resultant information representation is called a channel representation [46, 10, 80]. Representations using channels allow a state space resolution much better than indicated by the number of channels, a phenomenon known as hyperacuity [89]. As is common in science, different fields of research have different names for almost the same thing. In neuroscience and computational neurobiology the concept population coding [108] is sometimes used as a synonym for channel representation. In neural networks the concept of radial basis functions (RBF) [7, 58] is used to describe responses that depend on the distance to a specific position. In control theory, the fuzzy membership functions also have similar shape and application [84]. The relationship between channel representation, RBF networks and Fuzzy control will be explored in section 9.7. 2.2 Information representation 2.2.5 11 Sparse coding A common coding scheme is the compact coding scheme used in data compression algorithms. Compact coding is the solution to an optimisation where the information content in each output signal is maximised. But we could also envision a different optimisation goal: maximisation of the information content in the active signals only (see figure 2.4). Something similar to this seems to happen at the lower levels of visual processing in mammals [31]. The result of this kind of optimisation on visual input is a representation that is sparse, i.e. most signals are inactive. The result of a sparse coding is typically either a local, or a semi-local representation, see section 2.2.2. Compact Coding Sparse Coding Minimum number of units Minimum number of active units Figure 2.4: Compact and sparse coding. Figure adapted from [31]. As we move upwards in the interpretation hierarchy in biological vision systems, from cone cells, via centre-surround cells to the simple and complex cells in the visual cortex, the feature maps tend to employ increasingly sparse representations [31]. There are several good reasons why biological systems employ sparse representations, many of which could also apply to machine vision systems. For biological vision, one advantage is that the amount of signalling is kept at a low rate, and this is a good thing, since signalling wastes energy. Sparse coding also leads to representations in which pattern recognition, template storage, and matching are made easier [31, 75, 35]. Compared to compact representations, sparse features convey more information when they are active, and contrary to how it might appear, the amount of computation will not be increased significantly, since only the active features need to be considered. Both coarse coding and channel coding approximate the sparse coding goal. They both produce representations where most signals are inactive. Additionally, an active signal conveys more information than an inactive one, since an active signal tell us roughly where in state space we are. 12 Representation of Visual Information Chapter 3 Channel Representation In this chapter we introduce the channel representation, and discuss its representational properties. We also derive expressions for channel encoding and local decoding using cos2 kernel functions. 3.1 3.1.1 Compact and local representations Compact representations Compact representations (see chapter 2) such as numbers, and generic object names (house, door, Linda) are useful for communicating precise pieces of information. One example of this is the human use of language. However, compact representations are not well suited to use if we want to learn a complex and unknown relationship between two sets of data (as in function approximation, or regression), or if we want to find patterns in one data set (as in clustering, or unsupervised learning). Inputs in compact representations tend to describe temporally and/or spatially distant events as one thing, and thus the actual meaning of an input cannot be established until we have seen the entire training set. Another motivation for localised representations is that most functions can be sufficiently well approximated as locally linear, and linear relationships are easy to learn (see chapter 9 for more on local learning). 3.1.2 Channel encoding of a compact representation The advantages with localised representations mentioned above motivate the introduction of the channel representation [46, 10, 80]. The channel representation is an encoding of a signal value x, and an associated confidence r ≥ 0. This is done by passing x through a set of localised kernel functions {Bk (x)}K 1 , and weighting the result with the confidence r. Each output signal is called a channel, and the vector consisting of a set of channel values ¢T ¡ (3.1) u = r B1 (x) B2 (x) . . . BK (x) 14 Channel Representation is said to be the channel representation of the signal–confidence pair (x, r), provided that the channel encoding is injective for r 6= 0, i.e. there should exist a corresponding decoding that reconstructs x, and r from the channel values. The confidence r can be viewed as a measure of reliability of the value x. It can also be used as a means of introducing a prior, if we want to do Bayesian inference (see chapter 10). When no confidence is available, it is simply taken to be r = 1. Examples of suitable kernels for channel representations include Gaussians [89, 36, 93], B-splines [29, 87], and windowed cos2 functions [80]. In practise, any kernel with a shape similar to the one in figure 3.1 will do. Figure 3.1: A kernel function that generates a channel from a signal. In the following sections, we will exemplify the properties of channel representations with the cos2 kernel. Later on we will introduce the Gaussian, and the B-spline kernels. We also make a summary where the advantages and disadvantages of each kernel are compiled. Finally we put the channel representation into perspective by comparing it with other local model techniques. 3.2 Channel representation using the cos2 kernel We will now exemplify channel representation with the cos2 kernel ( cos2 (ωd(x, k)) Bk (x) = 0 if ωd(x, k) ≤ otherwise. π 2 (3.2) Here the parameter k is the kernel centre, ω is the channel width, and d(x, k) is a distance function. For variables in linear domains (i.e. subsets of R) the Euclidean distance is used, d(x, k) = |x − k| , (3.3) 3.2 Channel representation using the cos2 kernel 15 and for periodic domains (i.e. domains isomorphic with S) with period K a modular1 distance is used, dK (x, k) = min(mod(x − k, K), mod(k − x, K)). (3.4) The measure of an angle is a typical example of a variable in a periodic domain. The total domain of a signal x can be seen as cut up into a number of local but π , see figure 3.2. partially overlapping intervals, d(x, k) ≤ 2ω 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Figure 3.2: Linear and modular arrangements of cos2 kernels. One kernel is shown in solid, the others are dashed. Channel width is ω = π/3. For example, the channel representation of the value x = 5.23, with confidence r = 1, using the kernels in figure 3.2 (left), becomes ¡ u = 0 0 0 0.0778 0.9431 0.4791 0 ¢T 0 . As can be seen, many of the channel values become zero. This is often the case, and is an important aspect of channel representation, since it allows more compact storage of the channel values. A channel with value zero is said to be inactive, and a non-zero channel is said to be active. As is also evident in this example, the channel encoding is only able to represent signal values in a bounded domain. The exact size of the represented domain depends on the method we use to decode the channel vector, thus we will first derive a decoding scheme (in section 3.2.3) and then find out the size of the represented domain (in section 3.3). In order to simplify the notation in (3.2), the channel positions were defined as consecutive integers, directly corresponding to the indices of consecutive kernel functions. We are obviously free to scale and translate the actual signal value in any desired way, before we apply the set of kernel functions. For instance, a signal value ξ can be scaled and translated using x = scale · (ξ − translation), (3.5) to fit the domain represented by the set of kernel functions {Bk (x)}K 1 . Non-linear mappings x = f(ξ) are of course also possible, but they should be monotonous for the representation to be non-ambiguous. 1 using the modulo operation mod(x, K) = x − bx/KcK 16 3.2.1 Channel Representation Representation of multiple values Since each signal value will only activate a small subset of channels, most of the values in a channel vector will usually be zero. This means that for a large channel vector, there is room for more than one scalar. This is an important aspect of the channel representation, that gives it an advantage compared to compact representations. For instance, we can simultaneously represent the value 7 with confidence 0.3 and the value 3 with confidence 0.7 in the same channel vector ¡ ¢T u = 0 0.175 0.7 0.175 0 0.075 0.3 0.075 . This is useful to describe ambiguities. Using the channel representation we can also represent the statement “no information”, which simply becomes an all zero channel vector ¡ ¢T u= 0 0 0 0 0 0 0 0 . There is an interesting parallel to multiple responses in biological sensory systems. If someone pokes two fingers in your back, you can feel where they are situated if they are a certain distance apart. If they are too close however, you will instead perceive one poking finger in-between the two. A representation where this phenomenon can occur is called metameric in psychology2 , and the states (one poking finger, or two close poking fingers) that cannot be distinguished in the given representation are called metamers. The metamery aspect of a channel representation (using Gaussian kernels) was studied by Snippe and Koenderink in [89, 90] from a perceptual modelling perspective. We will refer to the smallest distance between sensations that a channel representation can handle as the metameric distance. Later on (section 5.4) we will have a look at how small this distance actually is for different channel representations. The typical behaviour is that for large distances between encoded values we have no interference, for intermediate distances we do have interference, and for small distances the encoded values will be averaged [34, 87]. 3.2.2 Properties of the cos2 kernel The cos2 kernel was the first one used in a practical experiment. In [80] Nordberg et al. applied it to a simple pose estimation problem. A network with channel inputs was trained to estimate channel representations of distance, horizontal position, and orientation of a wire-frame cube. The rationale for introducing the cos2 kernel was a constant norm property, and constant norm of the derivative. Our motivations for using the cos2 kernel (3.2) is that it has a localised support, which ensures sparsity. Another motivation is that for values of ω = π/N where N ∈ {3, 4, ...} we have X k Bk (x) = π 2ω and X k Bk (x)2 = 3π . 8ω (3.6) 2 Another example of a metameric representation is colour, which basically is a three channel representation of wavelength. 3.2 Channel representation using the cos2 kernel 17 This implies that the sum, and the vector norm of a channel value vector generated from a single signal–confidence pair is invariant to the value of the signal x, as long as x is within the represented domain of the channel set (for proofs, see theorems A.3 and A.4 in the appendix). The constant sum implies that the encoded value, and the encoded confidence can be decoded independently. The constant norm implies that the kernels locally constitute a tight frame [22], a property that ensures uniform distribution of signal energy in the channel space, and makes a decoding operation easy to find. 3.2.3 Decoding a cos2 channel representation An important property of the channel representation is the possibility to retrieve the signal–confidence pairs stored in a channel vector. The problem of decoding signal and confidence values from a set of channel function values, superficially resembles the reconstruction of a continuous function from a set of frame coefficients. There is however a significant difference: we are not interested in reconstructing the exact shape of a function, we merely want to find all peak locations and their heights. In order to decode several signal values from a channel vector, we have to make a local decoding, i.e. a decoding that assumes that the signal value lies in a specific limited interval (see figure 3.3). k k+1 k+2 [k+0.5,k+1.5] Figure 3.3: Interval for local decoding (ω = π/3). For the cos2 kernel, and the local tight frame situation (3.6), it is suitable to use decoding intervals of the form [k − 1 + N/2, k + N/2] (see theorem A.1 in the appendix). The reason for this is that a signal value in such an interval will only activate the N nearest channels, see figure 3.3. Decoding a channel vector thus involves examining all such intervals for signal–confidence pairs, by computing estimates using only those channels which should have been activated. The local decoding is computed using a method illustrated in figure 3.4. The channel values, uk , are now seen as samples from a kernel function translated to have its peak at the represented signal value x̂. We denote the index of the first channel in the decoding interval by l (in the figure we have l = 4), and use groups of consecutive channel values {ul , ul+1 , . . ., ul+N −1 }. If we assume that the channel values of the N active channels constitute an encoding of a single signal–confidence pair (x, r), we obtain N equations 18 Channel Representation 0 10 Figure 3.4: Example of channel values (ω = π/3, and x̂ = 5.23). ul ul+1 .. . = ul+N −1 rBl (x) rBl+1 (x) .. . . (3.7) rBl+N −1 (x) We will now transform an arbitrary row of this system in a number of steps ul+d = rBl+d (x) = r cos2 (ω(x − l − d)) (3.8) u l+d = r/2(1 + cos(2ω(x − l − d)) (3.9) u l+d = r/2(1 + cos(2ω(x − l)) cos(2ωd) + sin(2ω(x − l)) sin(2ωd)) r cos(2ω(x − l)) ¡1 ¢ = 2 cos(2ωd) 12 sin(2ωd) 12 r sin(2ω(x − l)) . r ul+d (3.10) (3.11) We can now rewrite (3.7) as | ul ul+1 .. . 1 = 2 ul+N −1 {z } u | cos(2ω0) cos(2ω1) .. . cos(2ω(N − 1)) sin(2ω0) sin(2ω1) .. . sin(2ω(N − 1)) {z A 1 r cos(2ω(x − l)) 1 .. r sin(2ω(x − l)) . . r {z } 1 | p } (3.12) For N ≥ 3, this system can be solved using a least-squares fit p1 p = p2 = (AT A)−1 AT u = Wu . p3 (3.13) Here W is a constant matrix, which can be computed in advance and be used to decode all local intervals. The final estimate of the signal value becomes x̂ = l + 1 arg [p1 + ip2 ] . 2ω (3.14) 3.3 Size of the represented domain 19 For the confidence estimate, we have two solutions r̂1 = |p1 + ip2 | and r̂2 = p3 . (3.15) The case of ω = π/2 requires a different approach to find x̂, r̂1 , and r̂2 since u = Ap is under-determined when N = 2. Since the channel width ω = π/2 has proven to be not very useful in practise, this decoding approach has been moved to observation A.5 in the appendix. When the two confidence measures are equal, we have a group of consecutive channel values {ul , ul+1 , . . ., ul+N −1 } that originate from a single signal value x. The fraction r̂1 /r̂2 is independent of scalings of the channel vector, and could be used as a measure of the validity of the model assumption (3.7). The model assumption will quite often be violated when we use the channel representation. For instance, response channels estimated using a linear network will not in general fulfill (3.7) even though we may have supplied such responses during training. We will study the robustness of the decoding (3.14), as well as the behaviour in case of interfering signal–confidence pairs in chapter 5. See also [36]. The solution in (3.14) is said to be a local decoding, since it has been defined using the assumption that the signal value lies in a specific interval (illustrated in figure 3.3). If the decoded value lies outside the interval, the local peak is probably better described by another group of channel values. For this reason, decodings falling outside their decoding intervals are typically neglected. We can also note that for the local tight frame situation (3.6), the matrix AT A becomes diagonal, and we can compute the local decoding as a local weighted summation of complex exponentials 1 arg x̂ = l + 2ω "l+N −1 X # 2ω(k−l) i u e . k (3.16) k=l For this situation the relation between neighbouring channel values tells us the signal value, and the channel magnitudes tell us the confidence of this statement. In signal processing it is often argued that it is important to attach a measure of confidence to signal values [48]. The channel representation can be seen as a unified representation of signal and confidence. 3.3 Size of the represented domain As mentioned in section 3.2, a channel representation is only able to represent values in a bounded domain, which has to be known beforehand. We will now derive an expression for the size of this domain. We start by introducing a notation for the active domain (non-zero domain, or support) of a channel Sk = {x : Bk (x) > 0} = ]lk , uk [ (3.17) where lk and uk are the lower and upper bounds of the active domain. Since the kernels should go smoothly to zero (see section 3.2.2), this is always an open 20 Channel Representation interval, as indicated by the brackets. For the cos2 kernel (3.2), and the constant sum situation (3.6), the common support of N channels, SkN becomes SkN = Sk ∩ Sk+1 ∩ . . . ∩ Sk+N −1 = ]k − 1 + N/2, k + N/2[ . (3.18) This is proven in theorem A.1 in the appendix. See also figure 3.5 for an illustration. Figure 3.5: Common support regions for ω = π/3. Left: supports Sk for individual channels. Right: common supports Sk3 . If we perform the local decoding using groups of N channels with ω = π/N , N ∈ N/{1}, we will have decoding intervals of type (3.18). These intervals are all of length 1, and thus they do not overlap (see figure 3.5, right). We now modify the upper end of the intervals SkN = ]k − 1 + N/2, k + N/2] (3.19) in order to be able to join them. This makes no practical difference, since all that happens at the boundary is that one channel becomes inactive. For a channel representation using K channels (with K ≥ N ) we get a represented interval of type N N = S1N ∪ S2N ∪ . . . ∪ SK−N RK +1 = ]N/2, K + 1 − N/2] (3.20) This expression is derived in theorem A.2 in the appendix. For instance K = 8, and ω = π/3 (and thus N = 3), as in figure 3.2, left, will give us R83 = ]3/2, 8 + 1 − 3/2] = ]1.5, 7.5] . 3.3.1 A linear mapping Normally we will need to scale and translate our measurements to fit the represented domain for a given channel set. We will now describe how this linear mapping is found. N = If we have a variable ξ ∈ [rl , ru ] that we wish to map to the domain RK ]RL , RU ] using x = t1 ξ + t0 , we get the system ¶µ ¶ µ ¶ µ 1 rl t0 RL = (3.21) RU 1 ru t1 with the solution t1 = RU − RL ru − rl and t0 = RL − t1 rl . (3.22) 3.4 Summary 21 N Inserting the boundaries of the represented domain RK , see (3.20) gives us t1 = K +1−N ru − rl and t0 = N − t1 rl . 2 (3.23) This expression will be used in the experiment sections to scale data to a given set of kernels. 3.4 Summary In this chapter we have introduced the channel representation concept. Important properties of channel representations are that we can represent ambiguous statements, such as “either the value 3 or the value 7”. We can also represent the confidence we have in each hypothesis, i.e. statements like “the value 3 with confidence 0.6 or the value 7 with confidence 0.4” are possible. We are also able to represent the statement “no information”, using an all zero channel vector. The signal–confidence pairs stored in a channel vector can be retrieved using a local decoding. The local decoding problem superficially resembles the reconstruction of a continuous signal from a set of samples, but it is actually different, since we are only interested in finding the peaks of a function. We also note that the decoding has to be local in order to decode multiple values. An important limitation in channel representation is that we can only represent signals with bounded values. I.e. we must know a largest possible value, and a smallest possible value of the signal to represent. For a bounded signal, we can derive an optimal linear mapping that maps the signal to the interval a given channel set can represent. 22 Channel Representation Chapter 4 Mode Seeking and Clustering In this chapter we will relate averaging in the channel representation to estimation methods from robust statistics. We do this by re-introducing the channel representation in a slightly different formulation. 4.1 Density estimation Assume that we have a set of vectors xn , that are measurements from the same source. Given this set of measurements, can we make any prediction regarding a new measurement? If the process that generates the measurements does not change over time it is said to be a stationary stochastic process, and for a stationary process, an important property is the relative frequencies of the measurements. Estimation of relative frequencies is exactly what is done in probability density estimation. 4.1.1 Kernel density estimation If the data xn ∈ Rd come from a discrete distribution, we could simply count the number of occurrences of each value of xn , and use the relative frequencies of the values as measures of probability. An example of this is a histogram computation. However, if the data has a continuous distribution, we instead need to estimate a probability density function (PDF) f : Rd → R+ ∪ {0}. Each value f (x) is nonnegative, and is called a probability density for the value x. This should not be confused with the probability of obtaining a given value, which is normally zero for a signal with a continuous distribution. The integral of f (x) over a domain tell us the probability of x occurring in this domain. In all practical situations we have a finite amount of samples, and we will thus somehow have to limit the degrees of freedom of the PDF, in order to avoid over-fitting to the sample set. Usually a smoothness constraint is applied, as in the kernel density estimation methods, see 24 Mode Seeking and Clustering e.g. [7, 43]. A kernel density estimator estimates the value of the PDF in point x as fˆ(x) = µ ¶ N x − xn 1 X K N hd n=1 h (4.1) where K(x) is the kernel function, and h is a scaling parameter that is usually called the kernel width, and d is the dimensionality of the vector space. If we require that Z 1 ³x´ (4.2) H(x) ≥ 0 and H(x)dx = 1 for H(x) = d K h h R we know that fˆ(x) ≥ 0 and fˆ(x)dx = 1 as is required of a PDF. Using the scaled kernel H(x) above, we can rewrite (4.1) as N 1 X H(x − xn ) . fˆ(x) = N n=1 (4.3) In other words (4.1) is a sample average of H(x − xn ). As the number of samples tends to infinity, we obtain µ ¶ Z N x − xn 1 X K = E {H (x − xn )} = f (xn )H(x − xn )dxn lim N →∞ N hd h n=1 = (f ∗ H)(x) . (4.4) This means that in an expectation sense, the kernel H(x) can be interpreted as a low-pass filter acting on the PDF f (x). This is also pointed out in [43]. Thus H(x) is the smoothness constraint, or regularisation, that makes the estimate more stable. This is illustrated in figure 4.1. The figure shows three kernel density estimates from the same sample set, using a Gaussian kernel K(x) = 1 −0.5xT x e (2π)d/2 (4.5) with three different kernel widths. 4.2 Mode seeking If the data come from a number of different sources, it would be a useful aid in prediction of new measurements to have estimates of the means and covariances of the individual sources, or modes of the distribution. See figure 4.1 for an example of a distribution with four distinct modes (the peaks). Averaging of samples in channel representation [47, 80, 34] (see also chapter 3), followed by a local decoding is one way to estimate the modes of a distribution. 4.2 Mode seeking 25 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 2 4 0 0 h = 0.02 2 4 0 0 2 h = 0.05 4 h = 0.1 Figure 4.1: Kernel density estimates for three different kernel widths. 4.2.1 Channel averaging With the interpretation of the convolution in (4.4) as a low-pass filter, it is easy to make the association to signal processing with sampled signals, and suggest regular sampling as a representation of fˆ(x). If the sample space Rd is low dimensional, and samples only occur in a bounded domain1 A, (i.e. f (x) = 0 ∀x 6∈ A) it is feasible to represent fˆ(x) by estimates of its values at regular positions. If the sample set S = {xn }N 1 is large this would also reduce memory requirements compared to storing all samples. Note that the analogy with signal processing and sampled signals should not be taken too literally. We are not at present interested in the exact shape of the PDF, we merely want to find the modes, and this does not require the kernel H(x) to constitute a band-limitation, as would have been the case if reconstruction of (the band-limited version of) the continuous signal fˆ(x), from its samples was our goal. For simplicity of notation, we only consider the case of a one dimensional PDF f (x) in the rest of this section. Higher dimensional channel representations will be introduced in section 5.6. In the channel representation, a set of non-negative kernel functions {Hk (x)}K 1 is applied to each of the samples xn , and the result is optionally weighted with a confidence rn ≥ 0, ¡ un = rn H1 (xn ) H2 (xn ) ... HK (xn ) ¢T . (4.6) This operation defines the channel encoding of the signal–confidence pair (xn , rn ), and the resultant vector un constitutes a channel representation of the signal– confidence, provided that the channel encoding is injective for r 6= 0, i.e. there exists a corresponding decoding that reconstructs the signal, and its confidence from the channels. We additionally require that the consecutive, integer displaced kernels H k (x), are shifted versions of an even function H(x), i.e. H k (x) = H(x − k) = H(k − x) . 1 Bounded (4.7) in the sense that A ⊂ {x : (x − m)T M(x − m) ≤ 1} for some choice of m and M. 26 Mode Seeking and Clustering We now consider an average of channel vectors u= N 1 X un N n=1 with elements uk = N 1 X k u . N n=1 n (4.8) If we neglect the confidence rn , we have ukn = H(xn − k) = H(k − xn ) . (4.9) By inserting (4.9) into (4.8) we see that uk = fˆ(k) according to (4.3). In other words, averaging of samples in the channel representation is equivalent to a regular sampling of a kernel density estimator. Consequently, the expectation value of a channel vector u is a sampling of the PDF f (x) filtered with the kernel H(x). I.e. for each channel value we have Z k k (4.10) E{un } = E{H (xn )} = H k (x)f (x)dx = (f ∗ H)(k) . We now generalise the interpretation of the local decoding in section 3.2.3. The local decoding of a channel vector is a procedure that takes a subset of the channel values (e.g. {uk . . . uk+N −1 }), and computes the mode location x, the confidence/probability density r, and if possible the standard deviation σ of the mode {x, r, σ} = dec(uk , uk+1 , . . . , uk+N −1 ) . (4.11) The actual expressions to compute the mode parameters depend on the used kernel. A local decoding for the cos2 kernel was derived in section 3.2.3. This decoding did not give an estimate of the standard deviation, but in chapter 5 we will derive local decodings for Gaussian and B-spline kernels as well (in sections 5.1.1 and 5.2.2), and this motivates the general formulation above. 4.2.2 Expectation value of the local decoding We have identified the local decoding as a mode estimation procedure. Naturally we would like our mode estimation to be as accurate as possible, and we also want it to be unbiased. This can be investigated using the expectation value of the local decoding. Recall the expectation value of a channel (4.10). For the cos2 kernel this becomes Z k cos2 (ω(x − k))f (x)dx (4.12) E{un } = Sk where Sk is the support of kernel k (see section 3.3). We will now require that the PDF is restricted to the common support SlN used in the local decoding. This allows us to write the expectation value of one of the channel values used in the 4.2 Mode seeking 27 decoding as Z E{unl+d } = SlN cos2 (ω(x − l − d))f (x)dx 1¡ cos(2ωd) = 2 sin(2ωd) R N cos(2ω(x − l))f (x)dx ¢ RSl 1 SlN sin(2ω(x − l))f (x)dx R f (x)dx SlN {z } | E{p} (4.13) using the same method as in (3.8)-(3.11). We can now stack such equations for all involved channel values, and solve for E{p}. This is exactly what we did in the derivation of the local decoding. If we assume a Dirac PDF, i.e. f (x) = rδ(x − µ), we obtain r cos(2ω(µ − l)) E{p} = r sin(2ω(µ − l)) . (4.14) r Plugging this into the final decoding step (3.14) gives us the mode estimate x̂ = µ. In general however, (3.14) will not give us the exact mode location. In appendix A, theorem A.6 we prove that, if a mode f is restricted to the support of the decoding SlN , and is even about the mean µ (i.e. f (µ + x) = f (µ − x)), (3.14) is an unbiased estimate of the mean E{x̂} = l + 1 arg [E{p1 } + iE{p2 }] = µ = E{xn } . 2ω (4.15) When f has an odd component, the local decoding tends to overshoot the mean slightly, seemingly always in the direction of the mode of the density2 . In general however, these conditions are not fulfilled. It is for instance impossible to have a shift invariant estimate for non-Dirac densities, when the decoding intervals SlN are non-overlapping. For an experimental evaluation of the behaviour under more general conditions, see [36]. 4.2.3 Mean-shift filtering An alternative way to find the modes of a distribution is through gradient ascent on (4.1), as is done in mean-shift filtering [43, 16]. Mean-shift filtering is a way to cluster a sample set, by moving each sample toward the closest mode, and this is done through gradient ascent on the kernel density estimate. Assuming that the kernel K(x) is differentiable, the gradient of f (x) can be estimated as the gradient of (4.1). I.e ˆ (x) = ∇f µ ¶ N X x − xn 1 ∇K . N hd+1 n=1 h (4.16) This expression becomes particularly simple if we use the Epanechnikov kernel [43] 2 Note that this is an empirical observation, no proof is given. 28 Mode Seeking and Clustering ( c(1 − xT x) if xT x ≤ 1 K(x) = (4.17) 0 otherwise. R Here c is a normalising constant that ensures K(x)dx = 1. For the kernel (4.17) we define3 the gradient as ( −2cx if xT x ≤ 1 ∇K(x) = (4.18) 0 otherwise. Inserting this into (4.16) gives us ˆ (x) = ∇f 2c N hd+2 X (xn − x) , (4.19) xn ∈Sh (x) where Sh (x) = {xn ∈ Rd : ||xn − x|| ≤ h} is the support of the kernel (4.17). Instead of doing gradient ascent using (4.19) directly, Fukunaga and Hostetler [43] use an approximation of the normalised gradient ˆ ˆ ln f (x) ≈ ∇f (x) = d + 2 m(x) ∇ h2 fˆ(x) (4.20) where m(x) is the mean shift vector m(x) = 1 k(x) X (xn − x) . (4.21) xn ∈Sh (x) Here k(x) is the number of samples inside the support Sh (x). Using the normalised gradient, [43] proceeds to define the following clustering algorithm ½ 0 = xn x̄ (4.22) For all n ˆ ln f (x̄i ) x̄i+1 = x̄i + a∇ where a should be chosen to guarantee convergence. In [43] the choice a = h2 /(d + 2) is made. This results in the iteration rule x̄i+1 = 1 k(x) X x̄in . (4.23) x̄in ∈Sh (x̄i ) That is, in each iteration we replace x̄i by a mean inside a window centred around it. Cheng [16] calls this mean of already shifted points a blurring process, and contrasts this with the non-blurring process x̄i+1 = 3 The 1 k(x) X xn ∈Sh (x̄i ) gradient is actually undefined when xT x = 1. xn . (4.24) 4.2 Mode seeking 29 The behaviour of the two approaches for clustering are similar, but only the non-blurring mean-shift is mode seeking, and is the one that is typically used. Cheng also generalises the concept of mean shift, by considering (4.24) to be a mean shift using the uniform kernel, ( 1 xT x ≤ 1 K(x) = 0 otherwise, (4.25) and suggests other weightings inside the window Sh (x). The generalised meanshift iteration now becomes P K(xn − x̄i )xn . x̄i+1 = P K(xn − x̄i ) (4.26) Since mean shift according to (4.26) using the kernel (4.25) finds the modes of (4.1) using the Epanechnikov kernel (4.17), the Epanechnikov kernel is said to be the shadow of the uniform kernel. Similarly mean shift using other kernels finds peaks of (4.1) computed from their shadow kernels. Figure 4.2 shows an illustration of mean shift clustering. A mean shift iteration has been started in each of the data points, and the trajectories x̄0 , x̄1 , . . . x̄∗ have been plotted. As can be seen in the centre plot, all trajectories end in either of five points, indicating successful clustering of the data set. 1 1 0.5 0.5 0 0 −0.5 −0.5 0.2 0.1 0 1 1 0 −1 −1 −0.5 0 0.5 1 −1 −1 −0.5 0 0.5 1 0 −1 −1 Figure 4.2: Illustration of mean shift filtering. Left: data set. Centre: trajectories of the mean shift iterations using a uniform kernel with h = 0.3. Right: Kernel density estimate using Epanechnikov kernel with h = 0.3. Gradient ascent with fixed step length often suffers from slow convergence, but (4.26) has the advantage that it moves quickly near points of low density (this can be inferred by observing that the denominator of (4.26) is a kernel density estimator without the normalisation factor.). Mean shift filtering was recently re-introduced in the signal processing community by the Cheng paper [16], and has been developed by Comaniciu and Meer in [19, 20] into algorithms for edge-preserving filtering, and segmentation. 30 Mode Seeking and Clustering 4.2.4 M-estimators M-estimation4 (first described by Huber in [62] according to Hampel et al. [56]), or the approach based on influence functions [56] is a technique that attempts to remove the sensitivity to outliers in parameter estimation. Assume {xn }N 1 are samples in some parameter space, and we want to estimate the parameter choice x̄ that best fits the data. This estimation problem is defined by the the following objective function x̄ = arg min J(x) = arg min x x N X ρ(rn /h) where rn = ||x − xn || . (4.27) n=1 Here ρ(r) is an error norm 5 , rn are residuals, and h is a scale parameter. The error norm ρ(r) should be nondecreasing with increasing r [95]. Both the least-squares problem, and median computation are special cases of M-estimation with the error norms ρ(r) = r2 , and ρ(r) = |r| respectively. See figure 4.3 for some common examples of error norms. 2 2 2 2 1.5 1.5 1.5 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 −2 0 Least squares 2 0 −2 0 2 0 −2 Least absolute 0 2 Biweight 0 −2 0 2 Cutoff-squares Figure 4.3: Some common error norms. Note that, in general, (4.27) is a non-convex problem which may have several local minima. Solutions to (4.27) are also solutions to N X n=1 ϕ(rn /h) = 0 where ϕ(r) = ∂ρ(r) . ∂r (4.28) The function ϕ(r) is called the influence function. The solution to (4.27) or (4.28) is typically found using iterated reweighted least-squares 6 (IRLS), see e.g. [109, 95]. As the name suggests IRLS is an iterative algorithm. In general it requires an initial guess close to the optimum of (4.27). To derive IRLS we start by assuming that the expression to be minimised in (4.27) can be written as a weighted least-squares expression 4 The name comes from “generalized maximum likelihood” according to [56]. error norm is sometimes called a loss function. 6 Iterated reweighted least squares is also known as a W-estimator [56]. 5 The 4.2 Mode seeking 31 N X J(x) = ρ(rn /h) = n=1 N X rn2 w(rn /h) . (4.29) n=1 We now compute the gradient with respect to {rn }N 1 of both sides 1¡ ϕ(r1 /h) h ... ¢T 2¡ rn w(rn /h) ϕ(rN /h) = h ... rN w(rN /h) ¢T . (4.30) This system of equations is fulfilled for the weight function w(r) = ϕ(r)/r/2. This can be simplified to w(r) = ϕ(r)/r, while still giving a solution to (4.27). This gives us the weights for one step in the IRLS process: x̄i = arg min x N X (rni )2 w(rni−1 /h) where rni = ||x̄i − xn || . (4.31) n=1 Each iteration of (4.31) is a convex problem, which can be solved by standard techniques for least-squares problems, such as Gaussian elimination or SVD. By computing the derivative with respect to x̄i (treating w as constant weights), we get N X 2(x̄i − xn )w(rni−1 /h) = 0 (4.32) n=1 with the solution P P xn w(||x̄i−1 − xn ||/h) xn w(rni−1 /h) P . = x̄ = P w(||x̄i−1 − xn ||/h) w(rni−1 /h) i (4.33) By comparing this with (4.26), we see that the iterations in IRLS are equivalent to those in mean-shift filtering if we set the kernel in mean-shift filtering equal to the scaled weight function in IRLS i.e. K(x̄ − xn ) = w(||x̄ − xn ||/h) . (4.34) From this we can also infer that the error norm corresponds to the kernel of the corresponding kernel density estimator, up to a sign (mean-shift is a maximisation, and M-estimation is a minimisation) and an offset. 4.2.5 Relation to clustering Clustering is the problem of partitioning a non-labelled data set into a number of clusters, or classes, in such a way that similar samples are grouped into the same cluster. Clustering is also known as data partitioning, segmentation and vector quantisation. It is also one of the approaches to unsupervised learning. See e.g. [63] for an overview of different clustering approaches. The mode seeking problem is related to clustering in the sense that each mode can be seen as a natural cluster prototype. For mean shift, this connection is 32 Mode Seeking and Clustering especially direct, since each sample can be assigned a class label depending on which mode the mean-shift iteration ended in [19, 20]. For channel averaging we can use the distances to each of the decoded modes to decide which cluster a sample should belong to. 4.3 Summary and comparison In this chapter we have explained three methods that find modes in a distribution. All three methods start from a set of samples, and have been shown to estimate the modes of the samples under a regularisation with a smoothing kernel (for Mestimation we only get one mode). We have shown that non-blurring meanshift is equivalent to a set of M-estimations using IRLS started in each sample. Channel averaging, on the other hand is a different method, that approaches the same problem from a different angle. Which method is preferable depends on the problem to be solved. Both channel averaging and mean-shift filtering are intended for estimation of all modes of a distribution, or for clustering. If we have a large number of samples inside the kernel window, mean-shift filtering is at a disadvantage, since each iteration involves an evaluation of e.g. (4.16) for all samples, which would be cumbersome indeed. In contrast, for mode seeking using channel averaging, only the averaging step is affected by the number of samples. Another advantage with channel averaging is that it has a constant data-independent complexity. For mean-shift we can only have an expected computational complexity since it is an iterative method, where the convergence speed depends on the data. For high dimensional sample spaces, or if the domain of the sample space is large, mean-shift is at an advantage, since the number of required channels grows exponentially with the vector space dimension. When the used kernel is small, mean-shift also becomes favourable. Chapter 5 Kernels for Channel Representation In this chapter we will introduce two new kernel functions and derive encodings and local decodings for them. The kernels are then compared with the cos2 kernel (introduced in chapter 3), with respect to a number of properties. We also introduce the notion of stochastic kernels, and study the interference of multiple peaks when decoding a channel vector. Finally we extend the channel representation to higher dimensions. 5.1 The Gaussian kernel Inspired by the similarities between channel averaging and kernel density estimation (see sections 4.1.1 and 4.2.1) we now introduce the Gaussian kernel − B (x) = e k (x − k)2 2σ 2 . (5.1) Here the k parameter is the channel centre, and σ is the channel width. The σ parameter can be related to the ω parameter of the cos2 kernel by√ requiring that the areas under the kernels should be the same. Since Agauss = 2πσ, and Acos2 = π/2ω, we get p p (5.2) σ = π/8/ω and ω = π/8/σ . Just like before, channel encoding is done according to (3.1). Figure 5.1 shows an example of a Gaussian channel set. Compared to the cos2 kernel, the Gaussian kernel has the disadvantage of not having a compact support. This means that we will always have small non-zero values in each channel (unless we threshold the channel vector, or weight the channel values with a relevance r = 0). Additionally, the Gaussian kernels do not have either constant sum or norm as the cos2 kernels do, see (3.6). 34 Kernels for Channel Representation 0 1 2 3 4 5 6 7 8 9 Figure 5.1: Example of a Gaussian channel set. The width σ = 0.6 corresponds roughly to ω = π/3. 5.1.1 A local decoding for the Gaussian kernel We will now devise a local decoding for this channel representation as well. If we look at three neighbouring channel values around k = l, we obtain three equations l−1 rBl−1 (x) u ul = rBl (x) . (5.3) ul+1 rBl+1 (x) The logarithm of an arbitrary row can be written as (x − l − d)2 ln ul+d = ln r + ln Bl+d (x) = ln r − 2σ 2 ´T ³ ¢ ¡ 2 x−l 1 . = 1 d d2 − ln r − (x−l) 2 2 2 2σ σ 2σ | {z } p (5.4) (5.5) If we apply this to each row of (5.3), we obtain an equation system of the form ln u = Dp (5.6) 1 0 1 0 ln u . 2 1 1 −1 2 } | 2 {z −1 D From the solution p we can find the estimates x̂, σ̂, and r̂ as with the solution 0 p = − 12 x̂ = l − p2 , 2p3 r 1 σ̂ = − , 2p3 (5.7) p2 p1 − 2 4p3 . and r̂ = e This gives us one decoding per group of 3 channels. Just like in the cos2 decoding (see section 3.2.3) we remove those decodings that lie more than a distance 0.5 from the decoding interval centre, since the corresponding peak is probably better described by another group of channels. 5.2 The B-spline kernel 35 The Gaussian kernel has potential for use in channel averaging, see section 4.2.1, since they provide a direct estimate of the standard deviation for an additive Gaussian noise component. Using √ the addition theorem for Gaussian variances, we can estimate the noise as σ̂noise = σ̂ 2 − σ 2 . Just like in the cos2 case, the r̂-value of the decoding is the probability density at the mode. The new parameter σ̂noise tells us the width of the mode, and can be seen as a measure of how accurate the localisation of the mode is. For a high confidence in a peak location σ̂noise should be small, and r̂N (where N is the number of samples) should be large. 5.2 The B-spline kernel B-splines are a family of functions normally used to define a basis for interpolating splines, see e.g. [21, 100]. The central1 B-spline of degree zero is defined as a rectangular pulse ( 1 −0.5 ≤ x < 0.5 0 (5.8) B (x) = 0 otherwise. B-splines of higher degrees are defined recursively, and can be obtained through convolutions Bn (x) = Bn−1 ∗ B0 (x) = B0 ∗ B0 . . . ∗ B0 (x) . {z } | (5.9) (n+1) times As the degree is increased, the basis functions tend toward a Gaussian shape (see figure 5.2). In fact, according to the central limit theorem, a Gaussian is obtained as n approaches infinity. Figure 5.2: Central B-splines of degrees 0,1. . . 5. If we require explicit expressions for the piecewise polynomials which the Bspline consists of, the following recurrence relation [21] is useful Bn (x) = x + (n + 1)/2 n−1 (n + 1)/2 − x n−1 B−1/2 (x) + B1/2 (x) . n n (5.10) Here, shifted versions of a B-spline are denoted Bnk (x) = Bn (x − k). Using (5.10), we obtain the following expression for B-splines of degree 1 1 B-splines are often defined with B0 (x) having the support x ∈ [0, 1] instead. 36 Kernels for Channel Representation B1 (x) = (x + 1)B0−1/2 (x) + (1 − x)B01/2 (x) x + 1 −1 ≤ x < 0 = 1−x 0≤x<1 0 otherwise. (5.11) (5.12) For degree 2 we get x + 3/2 1 3/2 − x 1 B−1/2 (x) + B1/2 (x) 2 2 µ ¶ 3 (x − 3/2)2 0 (x + 3/2)2 0 2 B−1 (x) + − x B0 (x) + B1 (x) = 2 4 2 (x + 3/2)2 /2 −1.5 ≤ x < −0.5 3/4 − x2 −0.5 ≤ x < 0.5 = 2 (x − 3/2) /2 0.5 ≤ x < 1.5 0 otherwise. B2 (x) = (5.13) (5.14) (5.15) By applying the binomial theorem on the Fourier transform of (5.9), and going back it is also possible to derive the following expression [100] Bn (x) = µ ¶n ¶ n+1 µ 1 X n+1 n+1 . (−1)k max 0, x − k + n! 2 k (5.16) k=0 5.2.1 Properties of B-splines The B-spline family have a number of useful properties: 1. Positivity Bn (x) > 0, x ∈ ]−(n + 1)/2, (n + 1)/2[ . 2. Compact support Bn (x) = 0, x 6∈ [−(n + 1)/2, (n + 1)/2] . 3. Constant sum X Bnk (x) = 1 regardless of x. (5.17) k For a proof, see theorem B.1 in the appendix. 4. For B-splines of degree n ≥ 1, the original scalar value may be retrieved by the first moment x= X kBnk (x) . k For a proof, see theorem B.2 in the appendix. (5.18) 5.2 The B-spline kernel 37 These properties, make B-splines useful candidates for kernels in the channel representation. 5.2.2 B-spline channel encoding and local decoding Using B-spline kernels of degree ≥ 1 we may define a B-spline channel representation, where a signal–confidence pair (x, r) can be encoded according to (3.1). I.e. we have uk = rBk (x) . (5.19) Due to the constant sum property of the B-spline kernel (5.17), the confidence may be extracted from a channel set as r= X uk . (5.20) k We can further retrieve the signal value times the confidence using the first moment (see equation 5.18) xr = X kuk . k Thus it is convenient to first compute the confidence, and then to extract the signal value as P kuk 1X k ku . x̂ = Pk k = r ku (5.21) k Figure 5.3 shows an example of a B-spline channel set. Just like the cos2 kernel, the B-splines have compact support, and constant sum. Their norm however is not constant. A value x encoded using a B-spline channel representation of degree n, will have at most n + 1 active channel values. This makes a local decoding using a group of n+1 consecutive channel values reasonable. It also means that a B-spline channel representation of degree n is comparable to a cos2 channel representation with width ω = π/(n + 1) . (5.22) Just like in the cos2 and Gaussian decodings, (see sections 3.2.3 and 5.1.1) we remove those decodings that lie more than a distance 0.5 from the decoding window centre. Note that the local decoding described here assumes a Dirac PDF, see section 4.2.2. A more elaborate local decoding, which deals with non-Dirac PDFs has been developed by Felsberg, and a joint publication is under way [30]. 38 Kernels for Channel Representation 0 1 2 3 4 5 6 7 8 9 Figure 5.3: Example of a B-spline channel set. The degree n = 2 is comparable to ω = π/3 for the cos2 kernels. 5.3 Comparison of kernel properties In this section we will compare the three kernel families, cos2 , Gaussians and B-splines. We will compare them according to a number of different criteria. 5.3.1 The constant sum property ¡ The sum2 of a channel vector u = r B1 (x) ||u(x)||1 = K X k=1 B2 (x) . . . |rB (x)| = r k K X BK (x) ¢T is given by Bk (x) . (5.23) k=1 As proven in theorems A.3 and B.1, (5.23) is constant for cos2 and B-spline K ). Note that channels as long as x is inside the represented domain (i.e. x ∈ RN since the Gaussian kernels do not have compact support, (5.23) will actually depend on the value of K. Figure 5.4 (left) shows numerical evaluations of (5.23) for σ = 0.4, 0.6, 0.8, and 1.0. A channel set of K = 11 channels has been used, with channel positions −3, −2, . . . 7. As can be seen in the figure the sum peaks near channel positions and has distinct minima right in-between two channel positions. To measure the amount of deviation we use a normalised peak-to-peak distance εpp = (max(||u||1 ) − min(||u||1 ))/mean(||u||1 ) . (5.24) This measure is plotted in figure 5.4 (right). As can be seen, the sum is practically constant for σ > 0.6. For a finite number of channels, the deviation from the result in figure 5.4 is very small3 . 2 Since 3 When the channel representation is non-negative, the sum is actually the l1 norm. K = 500, the deviation is less than 2 × 10−4 (worst case is σ = 1.0). 5.3 Comparison of kernel properties 3 39 0.2 0.15 2 0.1 1 0 0.05 0 1 2 3 4 0 0.4 0.6 0.8 1 Figure 5.4: Sum as function of position. Left: Sums for σ = 0.4, 0.6, 0.8, 1.0 (bottom to top). Right: Normalised peak-to-peak distance εpp (σ). 5.3.2 The constant norm property ¡ The l2 norm ||u|| of a channel vector u(x) = r B1 (x) given by ||u(x)||2 = r2 K X B2 (x) Bk (x)2 . ... ¢T BK (x) is (5.25) k=1 As proven in theorem A.4, (5.25) is constant for cos2 channels as long as x is K ). Since the Gaussian kernels do not inside the represented domain (i.e. x ∈ RN have compact support, (5.25) will depend on the value of K, so they are a bit problematic. Figure 5.5 shows a numerical comparison of Gaussian and B-spline kernels. In order to compare them, √ kernels with corresponding4 widths have been used (i.e according to σ = (n + 1)/ 8π). The experiment used K = 11 channels positioned at −3, −2, . . . 7. For a finite number of channels, the deviation from this experiment is very small5 for the Gaussians. As can be seen in the figure the norm peaks near channel positions and has distinct minima right in-between two channel positions. To compare the amount of deviation we use a normalised peak-to-peak distance εpp = (max(||u||) − min(||u||))/mean(||u||) (5.26) on the interval [0, 4]. Figure 5.5 (right) shows plots of this measure for the Gaussian and B-spline kernels. As can be seen, both kernels tend toward a constant norm as the channel width is increased. The Gaussians however have a significantly faster convergence. This is most likely due to their non compact support. For all widths except (n = 1, σ = 0.4) the deviation is smaller for the Gaussians. 4 For a motivation to the actual correspondence criteria see discussion around (5.2) and (5.22). K = 500, the deviation is less than 5 × 10−10 (worst case is σ = 1.0). 5 When 40 Kernels for Channel Representation 1 1.5 0.4 1 0.3 0.5 0.2 0.5 0.1 0 0 1 2 3 4 0 0 1 2 3 4 0 0.4 0.6 0.8 1 Figure 5.5: Norm as function of position. Left: Using B-spline kernels n = 1,2,3,4 (top to bottom curves). Centre: Using Gaussian kernels with σ = 0.40, 0.60, 0.80, 1.00 (bottom to top curves). Right: Solid εpp (σ) for Gaussian kernels. Crosses indicate the B-spline results for n = 1, 2, 3, 4. 5.3.3 The scalar product We will now have a look at the scalar product of two channel vectors. Note that the constant norm property does not imply a position invariant scalar product, merely that the highest value of the scalar product stays constant. We will thus compare all three kernels. Figure 5.6 shows a graphical comparison of the scalar product functions for cos2 , B-spline, and Gaussian kernels. In this experiment we have encoded the value 0, and computed scalar products between this vector and encodings of varying positions. i.e. s(x) = u(0)T u(x) = K X Bk (0)Bk (x) . (5.27) k=1 Each plot shows 10 superimposed curves s(x), where the channel positions have displaced in steps of 0.1. I.e. s(x) = K X Bk (0 − d)Bk (x − d) (5.28) k=1 for d ∈ {0, 0.1, 0.2, . . . 0.9}. As can be seen, the scalar product is position variant for all kernels, but the amount of variance with position decreases as the channel width increases. For the cos2 kernel, the position variance goes to zero as we approach the peak of scalar product function (5.27). As we increase the channel widths the position variance drops for the other kernels, and especially for the Gaussians with σ = 1.0, s(x) looks very stable. The problem with the position variance of the scalar product is two-fold. As figure 5.7 illustrates, the peak of the scalar product is even moved as the alignment changes. Such a behaviour would make the scalar product unsuitable e.g. as a measure of similarity. To avoid the displacement of the peak, we could consider a normalised scalar product 5.3 Comparison of kernel properties 1.125 1 41 1.5 1.875 1 1 0.5 0.5 0 −4 −2 0 2 4 1.0917 1 0 −4 −2 0 2 4 1.4195 0 −4 −2 0 2 4 −2 0 2 4 −2 0 2 4 1.7725 1 1 0.5 0.5 0 −4 −2 0 2 4 0.5731 0 −4 −2 0 2 4 0.4873 0.5 0.4325 0.4 0.4 0.2 0.2 0 −4 0 −4 −2 0 2 4 0 −4 −2 0 2 4 0 −4 Figure 5.6: Superimposed scalar product functions for different alignments of the channel positions. Top to bottom: cos2 , Gaussian, and B-spline kernels. Left to right: ω = π/3, π/4, π/5, σ = 0.6, 0.8, 1.0, n = 2, 3, 4. s(x) = u(0)T u(x) . ||u(0)||||u(x)|| (5.29) Figure 5.8 shows the same experiment as in figure 5.6, but using the normalised scalar product (5.29). As can be seen, the normalisation ensures the the peak is in the right place for all kernels, but it does not make all of the position dependency problems go away. 42 Kernels for Channel Representation 0.6 1 0.5 0.8 0.4 0.6 0.3 0.4 0.2 0.2 0.1 0 −4 −2 0 2 0 −4 4 −2 0 2 4 Figure 5.7: Different peak positions for different channel alignments. Left: Two Gaussian scalar product functions (σ = 0.6). Right: Two B-spline scalar product functions (n = 2). 1 1 1 0.5 0.5 0.5 0 −4 −2 0 2 4 0 −4 −2 0 2 4 0 −4 1 1 1 0.5 0.5 0.5 0 −4 −2 0 2 4 0.5 0 −4 −2 0 2 4 0 −4 −2 0 2 4 0 −4 1 1 0.5 0.5 0 −4 −2 0 2 4 0 −4 −2 0 2 4 −2 0 2 4 −2 0 2 4 Figure 5.8: Superimposed normalised scalar product functions for different alignments of the channel positions. Top to bottom: cos2 , Gaussian, and B-spline kernels. Left to right: ω = π/3, π/4, π/5, σ = 0.6, 0.8, 1.0, n = 2, 3, 4. 5.4 Metameric distance 5.4 43 Metameric distance As mentioned in section 3.2.1, the channel representation is able to represent multiple values. In this section we will study interference between multiple values represented in a channel vector. A representation where the representation of two values is sometimes confused with their average is called a metameric representation, see section 3.2.1. To study this behaviour for the channel representation, we now channel encode two signals, sum their channel representations, and decode. The two signals we will use are f1 (x) = x and f2 (x) = 6 − x for x ∈ [0, 6] . (5.30) These signals are shown in figure 5.9, left. As can be seen they are different near the edges of the plot, and more similar in value near the centre. If we channel encode these signals and decode their sum, we will obtain two distinct decodings near the edges of the plot, and near the centre of the plot we will obtain their average. Initially we will use a channel representation with integer channel positions, and a channel width of ω = π/3. The centre plot of figure 5.9 shows the result of channel encoding, summing and decoding the two signals. 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 2 4 6 0 2 4 6 0 0 2 4 6 Figure 5.9: Illustration of metamerism. Left: Input signals. Centre: Decoded outputs. Using cos2 channels at integer positions, and ω = π/3. Right: Result using channels at half integer positions instead. Dotted curves are the input signals. This nice result is however merely a special case. If we move the channel positions such that they are not positioned at integer positions we will get different results. Figure 5.9, right shows the result when the channels are positioned at half-integers instead. As can be seen here, we now have an interval of interference, where the decoded values have moved closer to each other, before the interval where the two values are averaged. The two cases in figure 5.9 are the extreme cases in terms of allowable value distance. When the channel positions are aligned with the intersection point of the two signals, as in the centre plot, the smallest allowed distance, or metameric distance is the largest, dmm = 2.0. When the signal intersection point falls right in-between two channel positions, the metameric distance is the smallest dmm = 1.5. 44 Kernels for Channel Representation The metameric distance also depends on the used channel width. The top row of figure 5.10 shows the metameric distances for three different channel widths of the cos2 kernel. Each plot shows 10 superimposed curves, generated using different alignments of the signal intersection point and the channel positions. From these plots we can see that the metameric distance increases with the channel width. Another effect of increasing the channel width is that the position dependency of the interference is reduced. The experiment is repeated for the B-spline, and the Gaussian kernels in figure 5.10, middle and bottom rows. We have chosen kernel widths corresponding to the ones for the cos2 kernel according to (5.2), and (5.22). As can be seen, the cos2 , and the B-spline kernel behave similarly, while the Gaussian kernel has a slightly slower increase in metameric distance with increasing kernel width. The reason for this is that we have used a constant decoding window size for the Gaussian kernel, whereas the window size increases with the channel width for both cos2 and B-spline kernels. 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 2 4 6 0 2 4 6 0 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 2 4 6 0 0 2 4 6 0 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 2 4 6 0 0 2 4 6 0 0 2 4 6 0 2 4 6 0 2 4 6 Figure 5.10: Metamerism for different channel widths. Top row: cos2 channels, width ω = π/3, π/4, π/5. Middle row: B-spline channels, order 2, 3, 4. Bottom row: Gaussian channels, σ = 0.6, 0.8, 1.0. All plots show 10 superimposed curves with different alignments of the channel centres. 5.5 Stochastic kernels 5.5 45 Stochastic kernels In section 4.2.1 we identified averages of channel values as samples from a PDF convolved with the used kernel function. Now, we will instead assume that we have measurements xn from some source S, and that the measurements have been contaminated with additive noise from a source D, i.e. xn = pn + ηn with pn ∈ S and ηn ∈ D . (5.31) For this situation a channel representation using a kernel H(x) will have channels with expectation values E{uk } = (f ∗ η ∗ H)(x − k) . (5.32) Here f (x), and η(x) are the density functions of the source and the noise respectively. If the number of samples in an average is reasonably large, it makes sense to view (η ∗ H)(x − k) as a stochastic kernel.6 In order to make the stochastic kernel as compact as possible, we will now consider the rectangular kernel ( 1 when − 0.5 < x − k ≤ 0.5 k (5.33) H (x) = 0 otherwise. Figure 5.11 shows estimates of three stochastic kernels, together with the PDF of the added noise. The noise is of triangular7 , or TPDF type, and is what is typically used to de-correlate the errors in audio quantisation with dithering [60]. 1 1 1 0.5 0.5 0.5 0 0 1 2 3 4 0 0 1 2 3 4 0 −2 0 2 Figure 5.11: Stochastic bins. Left: Estimated PDFs of H k (x) = 1 for k ∈ [1, 2, 3]. Centre: PDFs with addition of noise before the kernel function. Right: Estimated density of noise. In general, the kernel (5.33) is not a good choice for density estimation. If f (x) is discontinuous, or changes rapidly, it will cause aliasing-like effects on the estimated PDF. Such aliasing effects can be reduced by dithering see e.g. [54]. Dithering is the process of adding a small amount of noise (with certain characteristics) to a signal prior to a quantisation. Dithering is commonly used in image 6 For earlier accounts of this idea, see [34, 36]. triangular noise sample is generated by averaging two samples from a rectangular distribution. 7A 46 Kernels for Channel Representation reproduction with a small number of available intensities or colours, as well as in perceptual quality improvement of digital audio [60]. 5.5.1 Varied noise level We will now have a look at how the precision of a local decoding is affected by addition of noise before applying the kernel functions. We will use three channels k = 1, 2, 3, with kernels H k (x) as defined in (5.33), and channel encode measurements xn corresponding to source values p ∈ [1.5, 2.5]. To obtain the mode location with better accuracy than the channel distance, we use the local decoding derived for the Gaussian kernel (see section 5.1.1). We will try averages of N = 10, 30, and 1 000 samples, and compute the absolute error |x̂−p|, where x̂ is the local decoding. To remove any bias introduced by the source value p (and also to make the curves less noisy) the errors are averaged over all tested source values, giving a mean absolute error (MAE). We will try source values p ∈ [1.5, 2.5] in steps of 0.01. The standard deviation of the noise η, see (5.31), is varied in the range [0, 1] in steps of 0.01. Two noise distributions are tested, rectangular noise and triangular noise. The plots in figure 5.12 show the results. 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.5 1 0 0 0.5 1 0 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.5 1 0 0 0.5 1 0 0 0.5 1 0 0.5 1 Figure 5.12: MAE of local decoding as function of noise level. Top row: Results using rectangular noise. Bottom row: Results using triangular noise. Left to right: Number of samples N = 10, N = 30, N = 1 000. Solid curves show MAE for rectangular kernels, dashed curves show MAE for Gaussian kernels with σ = 0.8. Number of source values for which the curves are averaged is 101. As can be seen in the plot, the optimal noise level is actually above zero for the rectangular kernel. This is due to the expectation of the channel values being the convolution of the kernel and the noise PDF, see (5.32). The added noise thus 5.6 2D and 3D channel representations 47 results in a smoother PDF sampling. Finding the optimal noise given a kernel function will be called the dithering problem. Which noise is optimal depends on the source density f (x), and on the used decoding scheme. In this experiment it is thus reasonable to assume that the more similar the noise density convolved with the kernel is to a Gaussian, the better the accuracy. The dashed curves in each plot show the performance of overlapping bins with Gaussian kernels of width σ = 0.8. As can be seen in the plots, for large number of samples the performances of the two kernels are similar once the dithering noise is above a certain level. Biological neurons are known to have binary responses (i.e. at a given time instant they either fire or don’t fire). They are able to convey graded information by having the rate of firing depend on the sum of the incoming (afferent) signals. This behaviour could be modelled as (temporally local) averaging with noise added before application of a threshold (activation function). If the temporal averaging in the neurons is larger than just a few samples, it would be reasonable to expect that biological neurons implicitly have solved the inverse dithering problem of tuning the activation threshold to the noise characteristics, see e.g. [102]. 5.6 2D and 3D channel representations So far we have only discussed channel representations of scalar values. A common situation, which is already dealt with by mean-shift filtering and M-estimation, see chapter 4, is a higher dimensional sample space. 5.6.1 The Kronecker product The most straight-forward way to extend channel encoding to higher dimensions is by means of the Kronecker product. The result of a Kronecker product between two vectors x and y is a new vector formed by stacking all possible element-wise products xi yj . This is related to the (column-wise) vectorisation of an outerproduct x ⊗ y = vec(yxT ) (5.34) y1 y1 x1 x1 .. .. .. . ⊗ . = vec . xK yL yL x1 ... .. . ... y1 x1 .. . y1 xK y1 xK .. = .. . . . yL xK yL x1 . .. yL xK If we have a higher dimensionality than 2, we simply repeat the Kronecker product. E.g. for a 3D space we have 48 Kernels for Channel Representation x ⊗ y ⊗ z = vec(vec(zyT )xT ) . (5.35) For channel representations in higher dimensions it is additionally meaningful to do encodings of subspaces, such as a line in 2D, and a line or a plane in 3D. We will develop the encoding of a line in 2D in section 5.6.3. For applications such as channel averaging (see section 4.2.1), the channel representation will become increasingly less practical as the dimensionality of the space to be represented is increased, and methods such as the mean-shift filter (see section 4.2.3) are to be preferred. However, for applications where the sparsity of the channel representation can be utilised in storage of data (e.g. the associative learning in chapter 9) higher dimensions are viable. 5.6.2 Encoding of points in 2D Since the Gaussian function is separable, the Kronecker product of two 1D Gaussian channel vectors is equivalent to using isotropic 2D Gaussian kernels (x − k)2 + (y − l)2 2σ 2 . B (x, y) = B (x)B (y) = e k,l k l − (5.36) For efficiency, we will however still perform the encoding using the Kronecker product of two 1D channel vectors. 5.6.3 Encoding of lines in 2D A line constraint in 2D is the set of all points (x, y) fulfilling the equation x cos φ + y sin φ − ρ = 0 . (5.37) ¡ ¢T Here cos φ sin φ is the line normal, and ρ is the signed normal distance, i.e. the projection of an arbitrary point on the line onto the line normal. The distance of a specific point (k, l) to the line is then given by d = ||k cos φ + l sin φ − ρ|| . (5.38) This means that we can encode the line constraint, by simply applying the kernel d2 (k cos φ + l sin φ − ρ)2 − − 2σ 2 . Bk,l (φ, ρ) = e 2σ 2 = e 5.6.4 (5.39) Local decoding for 2D Gaussian kernels In order to capture sample dispersions that are not aligned to the axes the local decoding should model the channel values using a full 2D Gaussian function −1 uk,l = rBk,l (x) = re−0.5 (x − m) C (x − m) . T (5.40) 5.6 2D and 3D channel representations ¡ Here x = x we get y ¢T uk,l ¡ ,m= k − = re l ¢T 49 , and C is a full covariance matrix. In scalar form (x − k)2 σy2 + (y − l)2 σx2 − 2(x − k)(y − l)σxy 2 ) 2(σx2 σy2 − σxy (5.41) or, for ∆x = x − k and ∆y = y − l, uk,l − = re ∆x2 σy2 + ∆y 2 σx2 − 2∆x∆yσxy 2 ) 2(σx2 σy2 − σxy . (5.42) If we choose to estimate the parameters in a 3 × 3 neighbourhood around the position (k, l), we obtain the following system k−1,l−1 k−1,l−1 rB (x, y) u uk−1,l rBk−1,l (x, y) k−1,l+1 u rBk−1,l+1 (x, y) k,l−1 k,l−1 u (x, y) rB uk,l = (5.43) rBk,l (x, y) . k,l+1 u rBk,l+1 (x, y) k+1,l−1 u rBk+1,l−1 (x, y) uk+1,l rBk+1,l (x, y) uk+1,l+1 rBk+1,l+1 (x, y) The logarithm of an arbitrary row can be written as ln uk+c,l+d = ln r − (∆x − c)2 σy2 + (∆y − d)2 σx2 − 2(∆x − c)(∆y − d)σxy . (5.44) 2 ) 2(σx2 σy2 − σxy This can be factorised as ¡ ln uk+c,l+d = 0.5 1 2c 2d −c2 −d2 ¢ −2cd p (5.45) for the parameter vector 2 2 ln r(σx2 σy2 − σxy ) − ∆x2 σy2 − ∆y 2 σx2 + 2∆x∆yσxy ∆xσy2 − ∆yσxy 2 1 ∆yσx − ∆xσxy . (5.46) p= 2 2 2 2 σy σx σy − σxy 2 σx −σxy In the parameters p, we recognise the inverse covariance matrix µ 2 ¶ µ ¶ 1 σy −σxy p4 p6 −1 = . C = 2 2 2 p6 p5 −σxy σx2 σx σy − σxy Thus we can compute the covariance matrix as (5.47) 50 Kernels for Channel Representation µ Ĉ = σx2 σxy σxy σy2 ¶ 1 = p4 p5 − p26 µ p5 −p6 −p6 p4 ¶ . (5.48) From the expressions for p2 , and p3 in (5.46), we obtain the following system µ ¶ µ ¶ p2 −1 ∆x =C p3 ∆y µ with the solution ¶ µ ¶ ∆x̂ p = Ĉ 2 . p3 ∆ŷ (5.49) From the solution, we can obtain an estimate of the confidence r̂, as 2 2 r̂ = e0.5(p1 + p4 ∆x̂ + p5 ∆ŷ − 2p6 ∆x̂∆ŷ) . (5.50) The final peak location is obtained by adding the centre bin location to the peak offset x̂ = ∆x̂ + k and ŷ = ∆ŷ + l . (5.51) The expectation of the estimated covariance matrix Ĉ is the sum of the covariance of the noise, and the covariance of the kernel functions Cb = diag(σ, σ), see (5.36) and (5.39). This means that we can obtain the covariance of the noise as Ĉnoise = Ĉ − Cb . 5.6.5 Examples Clustering of points in 2D is illustrated in figure 5.13. Here we have channel encoded 1000 points, each from one of 5 clusters at random locations. The centre plot shows the average of the channel representation of the points. In the right plot, the decoded modes have their peaks represented as dots, and their covariances as ellipses. This approach to clustering is different from techniques like K-means clustering [7] and mixture model estimation using expectation maximisation (EM) [15, 7], which both require the number of clusters to be known a priori. In order to get a variable number of clusters, such methods have to test all reasonable numbers of clusters, and select one based on some criterion see e.g. [15]. Here we instead assume a scale, specified by the channel distance and kernel width, and directly obtain a variable number of clusters. This makes our clustering technique fall into the same category as mean-shift clustering, see section 4.2.3. The encoding of line constraints is illustrated in figure 5.14. Here we have channel encoded 4 lines, and averaged their channel representations. In the right plot, the decoded modes have their peaks represented as dots, and their covariances as ellipses. This approach to finding multiple solutions to a line constraint equation was applied in [93] to find multiple solutions to systems of optical flow constraints at motion boundaries. 5.6 2D and 3D channel representations 51 5 5 5 0 0 0 −5 −5 −5 −5 0 5 −5 0 5 −5 0 5 Figure 5.13: Illustration of point decoding. Left: Points to encode. Centre: Average of the corresponding channel representations, the sizes of the filled circles correspond to the channel values. Ellipses show the decoded modes. Right: The points, and the decoded modes. 5 5 5 0 0 0 −5 −5 −5 −5 0 5 −5 0 5 −5 0 5 Figure 5.14: Illustration of line constraint decoding. Left: Lines to encode. Centre: Average of the corresponding channel representations, the sizes of the filled circles correspond to the channel values. Ellipses show the decoded modes. Right: The original lines, and the decoded modes. 5.6.6 Relation to Hough transforms The idea of encoding line constraints, averaging and decoding, is the same as in the Hough transform, see e.g. [92]. In the Hough transform, each line constraint contributes either 1 or 0 to cells in an accumulator array (corresponding to the channel matrix). To reduce noise, and avoid detection of multiple maxima for each solution, a smoothing is sometimes applied to the accumulator array after it has been computed [92]. Note that channel encoding is a more sound approach, since it corresponds to smoothing before sampling. The use of overlapping kernels, and a local decoding, instead of just finding the accumulator cell with the largest contribution, improves the accuracy of the result. Furthermore it avoids the trade-off between 52 Kernels for Channel Representation noise sensitivity and accuracy of the result inherent in the Hough transform, and all other peak detection schemes using non-overlapping bins.8 The difference in obtained accuracy is often large, see e.g. [33] for an example where the estimation error is reduced by more than a factor 50. 8 A larger amount of noise will however still mean that larger kernels should be used, and this will affect the amount of interference between multiple decodings, see section 5.4. Chapter 6 Channel Smoothing In this chapter we will introduce the channel smoothing technique for image denoising. We will identify a number of problems with the original approach [42], and suggest solutions to them. One of the solutions is the alpha synthesis technique. Channel smoothing is also compared to other popular image denoising techniques, such as mean-shift, bilateral filtering, median filtering, and normalized averaging. 6.1 Introduction In chapter 4 we saw that averaging in the channel representation followed by a local decoding is a way to find simple patterns (clusters) in a data set. In low-level image processing, the data set typically consists of pixels or features in a regular grid. Neighbouring pixels are likely to originate from the same scene structure, and it seems like a good idea to exploit this known relationship and perform spatially localised clustering. This is done in channel smoothing [42, 29]. 6.1.1 Algorithm overview Channel smoothing of a grey-scale image p : Z2 → R can be divided into three distinct steps. 1. Decomposition. The first step is to channel encode each pixel value, p(x, y), in the grey-scale image with the confidence r(x, y) = 1, to obtain a set of channel images. 2. Smoothing. We then perform a spatially local averaging (low-pass filtering) on each of the channel images. 3. Synthesis. Finally we synthesise an image and a corresponding confidence using a local decoding (see section 3.2.3) in each pixel. 54 Channel Smoothing ⇒ ⇒ ⇒ Figure 6.1: Illustration of channel smoothing. Left to right: Input, channel representation, low-passed channels, and local decoding. At locations where the image is discontinuous we will obtain several valid decodings. The most simple solution is to select the decoding with the highest confidence. This synthesis works reasonably well, but it has several problems, such as introduction of jagged edges, and rounded corners. In section 6.3 we will analyse these problems, and suggest solutions. Finally in section 6.4 we will have a look at a more elaborate synthesis method that deals with these problems. 6.1.2 An example Figure 6.1 shows an example of channel smoothing. In this example we have used K = 8 channel images, and averaged each channel image with a separable Gaussian filter of σ = 1.18 and 7 coefficients per dimension. A channel representation ¢T ¡ u = H 1 (p) . . . H K (p) can represent measurements p ∈ [3/2, K − 1/2], and thus we have scaled the image intensities p(x, y) ∈ [rl , rh ] using a linear mapping px (x, y) = t1 p(x, y) + t0 with K −2 and t0 = 3/2 − t1 rl (6.1) t1 = rh − rl as described in section 3.3. 6.2 Edge-preserving filtering The channel smoothing procedure performs averaging of measurements that are similar in both property (here intensity) and location. In this respect, channel smoothing is similar to edge preserving filtering techniques, such as robust anisotropic diffusion[8], selective binomial filtering[40], mean-shift filtering[19], and non-linear Gaussian filtering[103, 2, 44] also known as SUSAN noise filtering[88], and in case of vector valued signals as bilateral filtering[98]. All these 6.2 Edge-preserving filtering 55 methods can be related to redescending 1 M-estimators (see section 4.2.4). The relationship between anisotropic diffusion and M-estimation is established in [8], and selective binomial filtering can be viewed as an M-estimation with a cutoff-squares error norm (see section 4.2.4). In section 6.5 we will compare mean-shift filtering and bilateral filtering to channel smoothing, thus we will now describe these two methods in more detail. 6.2.1 Mean-shift filtering Mean-shift filtering [19, 16, 43] is a way to cluster a sample set, by moving each sample toward the closest mode. As described in section 4.2.3 this is accomplished by gradient descent on a kernel density estimate. For each sample pn , an iteration is started in the original sample value i.e. p̄0n = pn , and is iterated until convergence. The generalised mean-shift iteration [16] is defined by P i i+1 k H(pk − p̄n )pk . (6.2) p̄n = P i k H(pk − p̄n ) Here H is a kernel which is said to be the shadow of the kernel in the corresponding kernel density estimator, see section 4.1.1. For a kernel density estimate using the Epanechnikov kernel, the iteration rule becomes an average in a local window, and can thus be computed very quickly [43]. This is what has given the method its name, and averaging in a local window is also the most commonly applied variant of mean-shift filtering. Mean-shift filtering has been developed by Comaniciu and Meer [19, 20] into algorithms for edge-preserving filtering, and segmentation. In the edge-preserving filter, they apply mean-shift filtering to the parameter vector ¡ p(x) = x/σs r(x)/σr g(x)/σr b(x)/σr ¢T . (6.3) Here r, g, and b are the three colour bands of an RGB image, and the parameters σs , and σr allow independent scaling of the spatial and range (colour) vector elements respectively. The convergence point p̄∗ is stored in the position where the iteration was started. The result is thus a hill-climbing on the kernel density estimate. 6.2.2 Bilateral filtering Bilateral filtering [98] of a signal p(x) is defined by R p(y)H((x − y)/σs )H((p(x) − p(y))/σr )dy q(x) = R H((x − y)/σs )H((p(x) − p(y))/σr )dy (6.4) where H is a multi-dimensional Gaussian, see (4.5). For the special case of a scalar valued image p(x), the expression (6.4) is identical to the earlier non-linear Gaussian filter [103, 2, 44], and to the SUSAN noise filtering technique [88]. In 1 A redescending M-estimator has an influence function with monotonically decreasing magnitude beyond a certain distance to the origin.[8] 56 Channel Smoothing (6.4) we explicitly distinguish between the spatial position x, and the sample values f (x), whereas in the generalised mean-shift iteration (6.2) on the parameter vector (6.3) they are treated as one entity. It is nevertheless possible to relate generalised mean-shift and bilateral filtering, for the case of a separable kernel H(d), such as the Gaussian. If we use p from (6.3) in (6.2), and identify r, g, b as f in (6.4), then the result of (6.2) and (6.4) after one iteration are identical in the r, g, b and f part. The difference is that (6.4) does not update the position as (6.2) does. A similar observation is made in [104], where non-linear Gaussian filtering (i.e. bilateral filtering) is shown to correspond to the first iteration of a gradient descent on an M-estimation problem. Bilateral filtering thus moves the data towards the local M-estimate, but in general gets stuck before it is reached [104]. Tomasi and Manduchi [98] also suggest a scheme for iterated bilateral filtering, by again applying (6.4) to the result q(x). As noted in [19] this iteration will not converge to a stable clustering. Instead it will eventually erode the image to contain a single constant colour. Thus, neither bilateral filtering nor iterated bilateral filtering are robust techniques in a strict sense. 6.3 Problems with strongest decoding synthesis The original synthesis in channel smoothing selects the decoding with the strongest confidence as its output.[42, 29] This synthesis has three distinct problems 1 Jagged edges. Since the synthesis selects one decoding, edges can become arbitrarily sharp. This means that for an infinitesimal translation of the underlying signal, the synthesis of a pixel may completely change value, and this in turn results in jagged edges. 2 Rounded corners. Corners will tend to be rounded, because the number of pixels voting for the intensity inside the corner (i.e. the confidence) becomes lower than the number of votes for the outside intensity when we are near the tip of the corner. 3 Patchiness. For steep slopes, or large amounts of blurring, when we have inlier noise, the selection tends to generate a patchy output signal. To illustrate these effects we have devised a test image consisting of a slanted plane, surrounded by a number of triangles with protruding acute angles (see figure 6.2, left for a noise corrupted version). The difference in grey-level from the background is also different for all the triangles, in order to illustrate at what difference the non-linear behaviour starts. The parameters of the channel smoothing have been specifically set to exaggerate the three problems listed above. The result is shown in the right plot of figure 6.2. 6.3 Problems with strongest decoding synthesis 57 Figure 6.2: Illustration of problems with strongest decoding synthesis. Left to right: Input (100 × 100 pixels, with values in range [−1, 1]. Noise is Gaussian with σ = 0.1, and 5% salt&pepper pixels.), normalized average (σ = 2.2), channel smoothing (K = 15 channels, σ = 2.2). In order to demonstrate the non-linear behaviour of the method, a normalized average with the same amount of smoothing is also shown for comparison. Normalized averaging of a signal–confidence pair (p, r) using the kernel g is defined by the quotient q(x) = (p · r ∗ g)(x) (r ∗ g)(x) (6.5) where · denotes an element-wise product. In our example, normalized averaging is equivalent to plain averaging, except near the edges of the image, where it helps maintaining the correct DC-level. See [48, 26] for more on normalized averaging, and the more general framework of normalized convolution[26]. 6.3.1 Jagged edges The jagged edges problem can be dealt with using super-sampling techniques common in computer graphics, see e.g. [59] section 4.8. For channel smoothing this can be done by filling in zeros in between the encoded channel values, before smoothing, and thus generate an output at a higher resolution. By modifying the amount of channel smoothing accordingly, we can obtain edges with a higher resolution. This technique is demonstrated in figure 6.3. Here we have filled in zeroes to obtain 4× the pixel density along each spatial dimension. We have then modified the amount of smoothing according to σnew = σ p (4n − 1)/3 (6.6) where n = 3 is the octave scale, as suggested in [64]. As a final step we have then blurred and subsampled the high resolution output. This has given us an image without jagged edges (see figure 6.3, right). 58 Channel Smoothing Figure 6.3: Super-sampling from channels. Left to right: Strongest decoding output (using K = 7, and σ = 1.3), Strongest decoding p output after 4× upsampling of the channel images (using K = 7, and σ = 1.3 × (43 − 1)/3 = 5.96), Smoothing and subsampling of the high-resolution decoding (σ = 1.4). For input image, see figure 6.2. 6.3.2 Rounding of corners A solution to the rounding of corners was suggested by Spies and Johansson in [94]. They proposed to sometimes select the decoding closest to the original greylevel, instead of the one with the highest confidence. The method in [94] works as follows: • If the strongest confidence is above a threshold th the corresponding decoding is selected. • If not, all decodings with a confidence above tl are searched for the decoding closest to the original grey value. Spies and Johansson suggest using th = 0.9 and tl = 0.1. A similar behaviour can be obtained by removing all decodings with confidence below a threshold cmin , and select the remaining decoding which is closest to the original value. Selecting the closest decoding will make the method correspond roughly to the hill-climbing done by the mean-shift procedure (see section 6.2.1). These two methods (from now on called Spies and Hill-climbing) are compared to the strongest decoding in figure 6.4. As can be seen in the figure, these methods trade preservation of details against removal of outliers which happen to be inliers in nearby structures. 6.3.3 Patchiness The patchiness problem is caused by a too wide distribution of samples around a mode. More specifically, the exact mode location cannot be determined by examining only channel values inside the decoding window, since the noise has not been completely averaged out, see figure 6.5. Usually more channel smoothing is unable to average out the noise, since there simply are no more samples inside the appropriate grey-level range locally in the image. 6.3 Problems with strongest decoding synthesis 59 Figure 6.4: Comparison of decoding selection schemes. Left to right: Strongest decoding synthesis, the Spies method (th = 0.7 tl = 0.25), the Hill-climbing method (cmin = 0.25). All three use K = 15 channels, and σ = 2.2. Figure 6.5: Reason for the patchiness problem. For wide distributions, it might be impossible to pick a proper decoding window. Both alternatives indicated here will give wrong results, since the noise has not been completely averaged out. There are two causes to the wide distributions of samples. The first one is that the noise distribution might be too wide. This can be dealt with by increasing the sizes of the used kernel functions such that more samples can be used in the averaging process. This would however require a modification to the decoding step. A way to obtain larger kernels which does not require a modification of the decoding step is to simply reduce the number of channels and scale them to fit the same interval according to (6.1). Reducing the number of channels will however make the method increasingly more similar to plain linear smoothing. There is however a second cause to the wide distributions. The channel averaging process implicitly assumes that the signal is piecewise constant. When the signal locally constitutes a ramp, we violate this assumption, and the actual width of the distribution depends on the slope of the ramp. Since the required width of the channels depends both on the amount of noise and on the slope of the signal, a more theoretically sound solution is to use a more advanced decoding scheme, which adapts the size of the decoding window to take more channel values into account when necessary. Alternatively we could replace the locally constant assumption with a locally linear one, and cluster in (µ, dx, dy)-space instead. None of these ideas are investigated further in this thesis. 60 Channel Smoothing 6.4 Alpha synthesis We will now instead attack the jagged edges problem, and at the same time make a slight improvement on the performance with respect to the rounding of corners. The reason for the jagged edges in the strongest decoding synthesis is that the decoding is a selection of one of the decodings. In many situations a selection is desirable, for instance if we want to teach a robot to navigate around an object. Going either left or right of the object works fine, but the average: straight-ahead is not a valid trajectory. In the present situation however, we want to output an image, and in images we should not have arbitrarily sharp edges. The solution to this problem is to generate a continuous transition between the decoded values. Instead of choosing the decoding with the highest confidence, we will now combine all decoded signal–confidence pairs (pk , rk ) in a non-linear way. In this way it is possible to obtain an output signal where the degree of smoothness is controlled by a parameter. The combination of decoded signal–confidence pairs (pk , rk ) is done according to pout = X pk wk where k rα wk = P k α l rl (6.7) and α is a tuning parameter. For α = 1 we get a linear combination of the decodings pk , and for α → ∞ we obtain a selection of the strongest decoding again. The behaviour of (6.7) is illustrated in figure 6.6. 10 10 8 8 6 6 4 4 4 4 2 2 2 2 10 10 8 6 0 0 −10 −5 0 Input 5 10 −10 8 6 0 −5 0 5 channel representation 10 −10 0 −5 0 5 low-passed channels 10 −10 −5 0 5 10 Output Figure 6.6: Illustration of alpha synthesis in 1D. Outputs for 5 different α values are shown. What is important to note here is that the actual distance between the decodings p1 and p2 plays no part in the synthesis (when the decodings do not interfere). This is not the case for methods like Histogram filters [107]. In Histogram filters the noise is removed by regularised inverse filtering of the histogram (i.e. the channel vector). The output is then computed as the mass centre of histogram, after raising the histogram bin values to a power γ. The purpose of γ is to control the sharpness of the result, and it thus serves the same purpose as our α. In such an approach, the interference of the two levels will be different when they are close 6.4 Alpha synthesis 61 and when they are far apart. Furthermore, raising the histogram to a power is not a shift invariant operation. The signal in figure 6.6 (left) is expanded into a number of channels, which are low-passed. In the decoding we get two valid signal–confidence pairs (p1 , r1 ) and (p2 , r2 ). The blurred channels have values that smoothly change from their highest value to zero. Since the confidence computation is a local average of the channel values, the confidence will have the same profile as the channels. The confidence of a valid decoding is exactly 1 when the low-pass kernel is inside one of the flat regions, and as we move across the edge it drops to zero. 6.4.1 Separating output sharpness and channel blurring The actual value of the confidence is given by a convolution of a step and the used low-pass kernel. In this example we have used a Gaussian low-pass kernel, and thus obtain the following confidences for the valid decodings Z r1 (x) = ((1 − step) ∗ g)(x) = Z r2 (x) = (step ∗ g)(x) = x −∞ −x −∞ √ 1 2πσ 2 2 e−0.5(x/σ) = Φ µ −x σ ¶ =1−Φ ³x´ 2 1 √ . e−0.5(x/σ) = Φ σ 2πσ 2 ³x´ σ (6.8) (6.9) Where Φ(x) is the integral of the standard normal PDF. The weights now become w1 = r1α = r1α + r2α 1 1+ 1 1−Φ α ³ x ´ − 1 σ and w2 = 1 α 1 1 + ³ x ´ − 1 Φ σ (6.10) If we look at the decoding for x = 0, we get 1 1 p1 + p 2 p1 + p2 = (6.11) 2 2 2 which is desirable, since this point is directly inbetween the two levels p1 and p2 . If we look at the derivative at x = 0, we get pout (0) = w1 (0)p1 + w2 (0)p2 = ∂w1 ∂w2 ∂pout (0) = (0)p1 + (0)p2 = . . . ∂x ∂x ∂x α (p2 − p1 ) . =√ 2πσ (6.12) This motivates setting α ∝ σ. By switching to a parameter β = α/σ we get a new synthesis formula pout = X k pk wk where rβσ wk = P k βσ . l rl (6.13) 62 Channel Smoothing Using the parameter β, we can control the slope of the decoding at the transition point independently of σ. While (6.12) only holds exactly for x = 0, a constant α/σ ratio gives very stable slopes for other values of x as well. This is demonstrated experimentally in figure 6.7. 3 10 8 2.5 6 2 4 1.5 2 1 −5 0 5 0 −5 0 5 Figure 6.7: Illustration of stability of the α ∝ σ approximation. Here we have used σ ∈ {1, 1.05, . . . , 3} and set β = 1. Left: each row is pout for some value of σ. Right: all pout curves superimposed. Figure 6.8 demonstrates the synthesis according to (6.7) on the test-image in figure 6.2 (left), for varying amounts of blurring (σ) and varied values of β. Each row has a constant β value, and as can be seen they have roughly the same sharpness. 6.4.2 Comparison of super-sampling and alpha synthesis We will now make a comparison of alpha synthesis and the super-sampling method described in section 6.3.1. Figure 6.9 shows the result. From the figure we can see that the result is qualitatively similar with respect to elimination of jagged edges. After examining details (see bottom row of figure 6.9) we find that alpha synthesis is slightly better at preserving the shape of corners. In addition to slightly better preservation of corners, alpha synthesis also has the advantage that it is roughly a factor 16 faster, since it does not need to use a higher resolution internally. 6.4.3 Relation to smoothing before sampling An image from a digital camera is a regular sampling of a projected 3D scene. Before sampling, the projection has been blurred by the point spread function of the camera lens, and during sampling it is further smoothed by the spatial size of the detector elements. These two blurrings can be modelled by convolution of the continuous signal f (x) with a blurring kernel g(x). A blurring of a piecewise constant signal f (x), such as the one in figure 6.6, will have transition profiles which look the same, regardless of the signal amplitude. This is evident since for a scalar A, a signal f (x), and a filter g(x) we have 6.4 Alpha synthesis 63 Figure 6.8: σ independent sharpness. Column-wise σ = 1.2, 1.6, 2.4, 3.2. Rowwise β = 1, 2, 4, 8, ∞ (i.e. strongest decoding synthesis). The input image is shown in figure 6.2. K = 7 has been used throughout the experiment. 64 Channel Smoothing Figure 6.9: Comparison of super-sampling method and alpha synthesis. Left to right: strongest decoding synthesis, strongest decoding with 4× higher resolution, blurred and subsampled, alpha synthesis. Top row shows full images, bottom row shows a detail. K = 7, σ = 2.2 and α = 3σ has been used. (Af ∗ g)(x) = A(f ∗ g)(x) . (6.14) For a suitable choice of blurring kernel g(x), a small translation of the input signal will result in a small change in the sampled signal. The behaviour of smoothing before sampling bears resemblance to the alpha synthesis in two respects: 1. For a suitable choice of α, alpha synthesis generates a smooth transition between two constant levels, in such a way that a small translation of the input signal results in a small change in the output signal. This is not the case for e.g. the strongest decoding synthesis, but it is the case for smoothing before sampling. 2. As noted earlier, the actual distance between the two grey-levels in a transition plays no part in the alpha synthesis. This means that a transition with a certain profile will be described with the same number of samples, regardless of scaling (i.e. regardless of the distance between p1 and p2 ). This follows directly from (6.7). In this respect channel smoothing with alpha synthesis behaves like a blurring before sampling for piecewise constant signals. This is different from methods such as the histogram filter [107], which control the sharpness of the output by an exponent on the channel values. We can thus view channel smoothing with alpha synthesis as analogous to smoothing before sampling on a piecewise constant signal model. The analogy is that the channel images represent a continuous image model (for piecewise constant signals), and this model is blurred and sampled by the alpha synthesis. 6.5 Comparison with other denoising filters 6.5 65 Comparison with other denoising filters We will now compare channel-smoothing with alpha synthesis with a number of other common denoising techniques. The methods are tested on the test-image in figure 6.2, contaminated with two noise types: Gaussian noise only, and Gaussian plus salt&pepper noise. For all methods we have chosen the filter parameters such that the mean absolute error (MAE) between the uncontaminated input and the input contaminated with Gaussian noise is minimised.2 MAE is defined as εMAE = N1 X N2 1 X |fideal (x) − fout (x)| . N1 N2 x =1 x =1 1 (6.15) 2 MAE was chosen above RMSE, since it is more forgiving to outliers. The methods, and their optimal filter parameters are listed below 1. Normalized averaging. As defined in (6.5). σ = 1.17. 2. Median filtering. See e.g. [92]. The Matlab implementation, with symmetric border extension, and a 5 × 5 spatial window. 3. Bilateral filtering. As defined in (6.4). σs = 1.64 and σp = 0.30. 4. Mean-shift filtering. As defined in (6.2). σs = 4.64 and σp = 0.29. 5. Channel smoothing. With alpha synthesis, K = 8, σ = 1.76, and α = 3σ (not optimised). The results are reproduced in figure 6.10. As is evident in this experiment, neither bilateral nor mean-shift filtering is able to eliminate outliers. The reason for this is that they both do gradient descent starting in the outlier value. However, none of these methods allow outliers to influence the other pixels as a linear filter does. The average MAE values for 10 instances of the images in figure 6.10 are summarised in the table below. noise type Gaussian Gaussian+S&P input 0.0790 0.0996 normaver 0.0305 0.0389 median 0.0307 0.0320 bilateral 0.0277 0.0377 meanshift 0.0226 0.0411 chan.sm. 0.0230 0.0234 The winner for Gaussian noise is mean-shift, with channel smoothing close second. For Gaussian+salt&pepper noise, channel smoothing is way ahead of all other methods, with the median filter being the runner up. The perceptual image quality, which is what we would really like to optimise for, is a different matter however. 2 For stability of the parameter choice, the error was measured on 10 instances of the noise contaminated input. 66 6.6 Channel Smoothing Applications of channel smoothing The image denoising procedure is demonstrated in figure 6.11. The examples are described below, row by row: row 1 This is a repeat of the experiment in figure 6.2 (left), but with alpha synthesis added, and with the number of channels chosen to match the noise (by making the channels wider). row 2 This is restoration after contamination with the same type of noise on a natural image. row 3 This is a duplication of the restoration of an irregularly sampled signal experiment done in [48], but using channel smoothing instead of normalized averaging to interpolate values at the missing positions. Note that this is image denoising and interpolation combined, if just interpolation is sought, normalized averaging is preferable, since the piecewise constant assumption modifies the image content slightly. row 4 This is an example of simple image editing by means of a confidence map. The right edge of the image is black due to problems with the frame grabber. This has been corrected by setting the confidence at these locations to zero. Additionally two cars have been removed, by setting the confidences of their pixels to zero. Note that more sophisticated methods, such as texture synthesis[25] exist for removing objects in an image. 6.6.1 Extensions Channel smoothing has been extended to directional smoothing by Felsberg in [28]. This is needed if we want to enhance thin periodic patterns, such as is present in fingerprints. Anisotropic channel filtering is however out of the scope of this thesis. 6.7 Concluding remarks In this chapter we have investigated the channel smoothing technique for image denoising. A number of problems with the straight-forward approach have been identified and solutions have been suggested. Especially the alpha synthesis technique seems promising, and should be investigated further. Channel smoothing has also been compared to a number of common image denoising techniques, and was found to be comparable to, or better than all tested methods in a MAE sense. The real criterion should however be the perceived image quality, and this has not been investigated. 6.7 Concluding remarks A B 67 Input normalized average median filter bilateral filter mean-shift channel smoothing Input normalized average median filter bilateral filter mean-shift channel smoothing Figure 6.10: Comparison of filters. A Gaussian noise (σ = 0.1). B Gaussian noise σ = 0.1, and 5% salt&pepper pixels. Filter parameters: Normalized average: σ = 1.17, Median filter: symmetric border extension and 5 × 5 window, Bilateral filter: σs = 1.64, σp = 0.30, Mean-shift filter: σs = 4.64, σp = 0.29, Channel smoothing: K = 8, σ = 1.76, α = 3σ (not optimised). 68 Channel Smoothing 100 × 100 pixels all ones K = 7, σ = 1.7, α = 3σ 512 × 512 pixels all ones K = 7, σ = 1.7, α = 3σ 512 × 512 pixels 10% density K = 5, σ = 0.3, α = 3σ 256 × 256 pixels edited confidence map K = 22, σ = 1.4, α = 3σ Figure 6.11: Examples of channel smoothing. Columns left to right: Input images, input confidence, output. The first two images have been contaminated with Gaussian noise with σ = 0.1 (intensities ∈ [0, 1]), and 5% salt&pepper pixels. Chapter 7 Homogeneous Regions in Scale-Space In this chapter we develop a hierarchical blob feature extraction method. The method works on vector fields of arbitrary dimension. We demonstrate that the method can be used in wide baseline matching to align images. We also extend the method to cluster constant slopes for grey-scale images. 7.1 Introduction For large amounts of smoothing, the channel smoothing output looks almost like a segmented image (see e.g. figure 6.2). This is what inspired us to develop the simple blob detection algorithm1 to be presented in this chapter. The channel smoothing operation is non-linear, and fits into the same category as clustering and robust estimation techniques. The smoothing operation performed on each channel is however linear, and thus relates to linear scale-space theory. 7.1.1 The scale-space concept When applying successive low-pass filtering to a signal, fine structures are gradually removed. This is formalised in the theory of scale space[106, 72]. Scale space is the extension of a signal f (x), by means of a blurring kernel g(x, σ) into a new signal fs (x, σ) = (f ∗ g(σ))(x) . (7.1) The parameter σ is the scale coordinate. The original signal is embedded in the new signal since fs (x, 0) = f (x). The kernel g(x, σ) is typically a Gaussian, but Poisson kernels have also been used [27]. Figure 7.1 contains an illustration of the scale-space concept. 1 The method was originally presented in [41]. Here we have extended the method to deal with colour images, and also made a few other improvements. 70 Homogeneous Regions in Scale-Space 1 10 0.8 8 0.6 6 0.4 4 0.2 2 0 0 2 4 6 8 f (x) 10 0 0 2 4 6 8 10 fs (x, σ) Figure 7.1: Gaussian scale space of a 1D-signal. The fact that fine structures are removed by blurring motivates the use of a lower sample density at coarser scales, as is done in a scale pyramid. 7.1.2 Blob features Homogeneity features are called blobs in scale-space theory [72]. In comparison with segmentation, blob detection has a more modest goal—we do not attempt to segment out exact shapes of objects, instead we want to extract robust and repeatable features. The difference between segmentation and blob detection is illustrated by the example in figure 7.2. As can be seen, the blob representation discards exact shapes, and thin connections between patches are neglected. One segment Two blobs Figure 7.2: Difference between segmentation and blob detection. Blob features have been used as texture descriptors [73] and as features for image database search [6]. For a discussion of the similarities and differences of other approaches and the one presented here, the reader is directed to [40]. Blob features are related to maximally stable extremal regions(MSER)[82], and to affinely invariant neighbourhoods[99]. MSER features are regions grown around an intensity extrema (max or min) and are used to generate affine invariant frames, which are then used for view-based object recognition [82]. Affinely invariant neighbourhoods are found by starting at intensity extrema and finding the nearest extrema along rays emanating from the point. These extrema are then linked to form a closed curve, which is used to define an affine invariant [99]. 7.2 The clustering pyramid Image Clustering pyramid 71 Label image Raw blobs Merged blobs Figure 7.3: Steps in blob extraction. 7.1.3 A blob feature extraction algorithm The blob estimation procedure uses a scale pyramid. Each position and scale in the pyramid contains a measurement–confidence pair (p, r).2 The confidence is a binary variable, i.e. r ∈ {0, 1}. It signifies the absence or presence of a dominant measurement in the local image region. When r = 1, the dominant measurement is found in p. This representation is obtained by non-linear means, in contrast to most scale-space methods which are linear [72], and thus obtain the average measurement. The pyramid is used to generate a label image, from which we can compute the moments of each region. The shape of each region is approximated by its moments of orders 0, 1 and 2. These moments are conveniently visualised as ellipses, see right part of figure 7.3. Finally we merge blobs which are adjacent and of similar colour using an agglomerative clustering scheme. The following sections will each in turn describe the different steps of the algorithm, starting with the generation of the clustering pyramid. 7.2 The clustering pyramid When building the clustering pyramid, we view each image pixel as a measurement p with confidence r = 1. We then expand the image into a set of channel images. For each of these channel images we generate a low-pass pyramid. The channels obtained from the input image constitute scale 1. Successively coarser scales are obtained by low-pass filtering, followed by sub-sampling. The low-pass filter used consists of a horizontal and a vertical 4-tap binomial kernel [1 3 3 1]/8. Since the filter sums to 1, the confidence values of the decoded (p, r) pairs correspond to fractions of the area covered by the filter. Thus, we construct the low-pass pyramids, and decode the dominant (p, r) pair in each position. Finally we make the confidence binary by setting values below rmin to zero, and values above or equal to rmin to 1. Typically we use the area threshold rmin = 0.5. Figure 7.4 shows such a pyramid for an aerial image. In principle this pyramid generation method can also be applied to vector field images, by extending the channel representation as described in section 5.6. The number of channels required for vector fields of dimension higher than 2 does however make this approach rather expensive with respect to both computational 2 For generality we will have a vector notation of p, although we will initially use scalar valued images. 72 Homogeneous Regions in Scale-Space Figure 7.4: Clustering pyramid created using K = 26 channels, spaced according to (6.1). Positions with confidence r = 0 are indicated with crosses. and storage requirements. For example a channel representation of an RGB colour image with 26 channels per colour band will give us 263 = 17576 channel images to filter. 7.2.1 Clustering of vector fields To perform clustering on vector fields, we will replace channel filtering with another robust estimation technique. The representation at the next scale, p∗ , is now generated as the solution to the weighted robust estimation problem X wk rk ρ(||pk − p∗ ||) . (7.2) arg min ∗ p k Here wk are weights from the outer product of two binomial kernels, and ρ(d) is a robust error norm. In contrast to linear filter theory we cannot use a separable optimisation, so for a 1D binomial kernel [1 3 3 1]/8 we will have to take all 16 pixels in a local neighbourhood into account. Note that for most choices of ρ(d), problem (7.2) is not convex, and thus local minima exist. We are however looking for the global min, and will employ a technique known as successive outlier rejection (SOR) to find it. An iteration of SOR is computed as p∗est = 1 X pk rk wk ok Nr where k Nr = X rk wk ok . (7.3) k Here ok are outlier rejection weights, which are initially set to 1. After each iteration we find the pixel with largest residual dk = ||p∗est − pk ||. If dk > dmax , we remove this pixel by setting ok = 0. This procedure is iterated until there are no outliers left. The found solution will be a fixed point of a gradient descent on (7.2), for the cut-off squares error norm 3 . Furthermore, provided that more than half the data supports it, the solution will be either the global min, or close 3 ρ(d) = d2 for |d| < dmax , and d2max otherwise. 7.2 The clustering pyramid 73 to the global min. The SOR approach thus effectively solves the initialisation problem which exists in the commonly applied iterated reweighted least squares (IRLS) technique, see section 4.2.4, or e.g. [109, 95]. Initialisation is also the reason why mean-shift filtering [19, 20] is unsuitable for generation of a clustering pyramid. The importance of initialisation is demonstrated in figure 7.5. Here we have applied SOR to minimise (7.2), and compared the result with mean-shift filtering, and IRLS. The IRLS method was, just like mean-shift, initiated with the original pixel value. As can be seen, only SOR is able to reject outliers. Note that since each iteration either removes one outlier or terminates, there is an upper limit to the number of required iterations (e.g. 16 for a 4 × 4 neighbourhood). Figure 7.5: Importance of initialisation. Left to right: Input image, IRLS output (6 × 6 window), mean-shift output (spatial radius 3), SOR output (6 × 6 window). As mentioned, the SOR method is closely related to (7.2), for the cut-off squares error norm. The sharp boundary between acceptance and rejection in cut-off squares is usually undesirable. As a remedy to this we add a few extra IRLS iterations with a more smooth error norm. An iteration of IRLS on (7.2) has the same form as (7.3), but with the outlier rejection weights replaced with ok = ρ0 (dk )/dk . If the error norms are similar in shape and extent, we should already be close to the solution, since we are then minimising similar functions. We will use the weighting function ( (1 − (dk /dmax )2 )2 if |dk | < dmax (7.4) o(dk ) = 0 otherwise which corresponds to the biweight error norm [109]. The dmax parameter defines the scale of the error norm, and is used to control the sensitivity of the algorithm. To compute p∗ and r∗ in each position in the pyramid, we thus first find p∗ with SOR, followed by IRLS. The output confidence r∗ is then computed as ( P P 1 if ∗ k rk wk ok ≥ rmin k wk (7.5) r = 0 otherwise. Here ok are the weights used in the last IRLS iteration, see (7.3), and just like in section 7.2, rmin is a threshold which corresponds to the minimum required area fraction belonging to the cluster. The clustering process is repeated for successively coarser scales until the pyramid has been filled. 74 7.2.2 Homogeneous Regions in Scale-Space A note on winner-take-all vs. proportionality There is a distinct difference between the channel approach and the SOR+IRLS approach with respect to propagation of information through the pyramid. For the channel variant, all pixels in the input image will have contributed to the value of the channel vector at the top of the pyramid. For the SOR+IRLS variant, each pixel at each scale only describes the dominant colour component, and thus all the other values will be suppressed. This is the same approach as is taken in elections in the UK and the US, where each constituency only gets to select one (or one kind of) representative to the next level. The channel approach on the other hand corresponds to the Swedish system of propagating the proportions of votes for the different parties to the next level. The number of channels K, and the maximum property distance dmax can be related according to dmax = 2t1 = 2(K + 1 − N )/(ru − rl ), where N , ru , and rl are defined in section 3.3. The relationship comes from 2t1 being the approximate boundary between averaging and rejection, see section 5.4. The two methods are quite similar in behaviour, but even for grey-scale images on conventional computer architectures, the SOR+IRLS variant is a factor 5 or more faster. This is due to the small support (4 × 4) of the parameter estimation. On hardware with higher degrees of parallelism this could however be different. 7.3 Homogeneous regions The generation of the clustering pyramid was bottom up, and we will now continue by doing a top-down pass over the pyramid. Note that the region propagation described here is different from the one in [41]. The changes since [41] have speeded up the algorithm and also made the result after region merging more stable. We start at the top scale and generate an empty label image. Each time we encounter a pixel with a property distance above dmax ) from the pixels above it, we will use it as a seed for a new region. Each new seed is assigned an integer label. We allow each pixel to propagate its label to the twelve nearest pixels below it, provided that they are sufficiently similar see figure 7.6 (left). For a pixel on the scale below, this means that we should compare it to three pixels above it, see figure 7.6 (right). If several pixels on the scale above have a property distance below dmax , we propagate the label of the one with the smallest distance. The algorithm for label image generation is summarised like this: step 1 Generate an empty label image at the top scale. step 2 Assign new labels to all unassigned pixels with r = 1. step 3 Move to the next scale. step 4 Compare each pixel with r = 1 to the three nearest pixels on the scale above. Propagate the label of the pixel with the smallest property distance, if smaller than dmax . step 5 If we are at the finest scale we are done, otherwise go back to step 2. 7.3 Homogeneous regions 75 Figure 7.6: Label propagation. Left: A pixel can propagate its label to twelve pixels at the scale below. Right: This can be implemented by comparing each pixel with three pixels on the scale above. The result of this algorithm is shown in figure 7.7. As can be seen, this is an oversegmentation, i.e. image patches which are homogeneous have been split into several regions. Figure 7.7: Result of label image generation. Left to right: Input image, label image, blobs from regions. 7.3.1 Ellipse approximation The label image l(x) : Z2 → N is a compact representation of a set of binary masks ( 1 if l(x) = n (7.6) vn (x) = 0 otherwise. The raw moments of such a binary mask vn (x) : Z2 → {0, 1} are defined by the weighted sum XX xk2 xl1 vn (x). (7.7) µkl = x1 x2 For a more extensive discussion on moments of binary masks, see e.g. [92]. For all the regions {vn }N 1 in the image we will now compute the raw moments of order 0 76 Homogeneous Regions in Scale-Space to 2, i.e. µ00 , µ01 , µ10 , µ02 , µ11 , and µ20 . This can be done using only one for-loop over the label image. The raw moments are then converted to measures of the area an , the centroid vector mn , and the inertia matrix In according to an = µ00 , mn = 1 µ00 µ µ01 µ10 ¶ and In = 1 µ00 µ µ02 µ11 µ11 µ20 ¶ −mn mTn . (7.8) Using the input image p(x, y) we also compute and store the average measurements for all regions, 1 XX p(x1 , x2 )vn (x1 , x2 ). (7.9) pn = µ00 x x 1 2 If a region has the shape of an ellipse, its shape can be retrieved from the inertia matrix, see theorem C.1 in the appendix. Even if the region is not a perfect ellipse, the ellipse corresponding to the inertia matrix is a convenient approximation of T T the region shape. From the eigenvalue decomposition I = λ1 ê√ 1 ê1 + λ2 ê2 ê2 with √ λ1 ≥ λ2 we can find the axes of the ellipse as 2 λ1 ê1 and 2 λ2 ê2 respectively, see theorem C.2 in the appendix. Since I = IT each blob can be represented by 1 + 2 + 3 + N parameters where N is the dimensionality of the processed vector field. I.e. we have 7 parameters for grey-scale images, and 9 for RGB images etc. A visualisation of the regions as ellipses is shown in figure 7.7 (right). 7.3.2 Blob merging As stated before the result of the label image generation is an oversegmentation, and thus the final stage of the algorithm is a merging of adjacent blobs. Due to the linearity of the raw moments (7.7) the moments of a combined mask can be computed from the moments of its parts. I.e. if we have v = v1 + v2 we get µij (v) = µij (v1 ) + µij (v2 ). For the corresponding property vectors we get p(v) = (µ00 (v1 )p(v1 ) + µ00 (v2 )p(v2 ))/µ00 (v). Candidates for merging are selected based on a count of pixels along the border of two regions with similar colours. We define an adjacency matrix M with Mij signifying the number of pixels along the common border of blobs i and j. The adjacency matrix is computed by modifying the region propagation algorithm in section 7.3, such that step 4 at the finest scale computes the count. Whenever two pixels (with different labels) at the coarser scale match a pixel at the finer scale, we are at a region boundary, and thus add 1 to the corresponding position in the adjacency matrix. Since M is symmetric, only the upper triangle need to be stored. Using M we can now select candidates for merging according to q (7.10) Mij > mthr min(µ00 (vi ), µ00 (vj )) where mthr is a tuning threshold. Typically we use mthr = 0.5. This choice results in a lot of mergers, but thin strands of pixels are typically not allowed to join two blobs into one, see figure 7.2. The square root in (7.10) is motivated by Mij being a length, and µ00 being an area. 7.4 Blob features for wide baseline matching 77 All merging candidates are now placed in a list. They are successively merged pairwise, starting with the most similar pair according to the property distance d = ||pi −pj ||. After merging, the affected property distances are recomputed, and a new smallest distance pair is found. The pairwise merging is repeated until none of the candidate mergers have a property distance below dmax . This clustering scheme falls into the category of agglomerative clustering [63]. Finally we remove blobs with areas (i.e. µ00 ) below a threshold amin = 20. Figure 7.8: Result of blob merging. Left to right: blobs from regions (384 blobs), merged blobs (92 blobs, mthr = 0.5), reprojection of blob colours onto label image regions. Figure 7.8 shows blob representations of the image in figure 7.7 (left), before and after merging. In order to show the quality of the segmentation, the blob colours have been used to paint the corresponding regions in the label image, see figure 7.8 (right). In this example, the clustering pyramid is created using K = 26 channels that are spaced according to (6.1). The blob representation initially contains 384 blobs, which after merging and removal of small blobs drops to 92. This gives a total of 644 parameters for the entire image. Compared to the 348 × 287 input image pixels this is a factor 155 of data reduction. 7.4 Blob features for wide baseline matching The usefulness of an image feature depends on the application, and should thus be evaluated in a proper context. We intend to use the blob features for view based object recognition, wide-baseline matching, and aerial navigation. All of these topics are however out of the scope of this thesis, and we will settle for a very simple demonstration. Using pairs of images captured from a helicopter, we L detect blobs, and store their centroids in two lists {mk }K 1 , and {nl }1 . We then find pairs of points that correspond given a homographic model µ ¶ µ ¶ µ ¶ µ ¶ m̂l mk n̂k −1 nl =H and h =H . (7.11) h 1 1 1 1 Two points are said to correspond when the residual q q δkl = (n̂k − nl )T (n̂k − nl ) + (m̂l − mk )T (m̂l − mk ) (7.12) 78 Homogeneous Regions in Scale-Space is below a given threshold. Figure 7.9: Correspondences for blobs from a video sequence. Each row shows a pair of matched images. The correspondence is found using a RANSAC [109] like method. We start by selecting 4 random points in image 1. For each of these we select a random point in the other image among those with a property distance dkl = ||pm,k − pn,l || below dmax . For each such random correspondence, we compute the pairwise geometric residuals (7.12) for all point pairs and keep the pairs with distances 7.4 Blob features for wide baseline matching 79 below a threshold δmax . We stop once we get more than 15 matches. The found matches are then used to estimate a new homography using scaled total least squares (STLS) with the scaling described in [57]. We then find correspondences given the new homography and recompute the residuals. This procedure is iterated until convergence, which usually is reached after three to four cycles. Figure 7.9 shows the found correspondences for three image pairs from a video sequence. The number of blobs Nb , the number correspondences Nm , and the inlier fraction ε = Nm /Nb are listed in the table below. 1 142 61 0.43 Frame Nb Nm ε 7.4.1 2 163 61 0.37 3 151 76 0.50 4 139 76 0.55 5 161 70 0.43 6 176 70 0.40 Performance A C-implementation of the blob feature extraction algorithm takes about 4 seconds to process a 360 × 288 RGB image on a Sun Ultra 60 (296MHz). Moving the implementation to an Intel Pentium 4 (1.9GHz) platform resulted in computation times below one second per frame. 7.4.2 Removal of cropped blobs Since an image only depicts a window of a scene, some of the homogeneous regions in the scene will only partially be contained in the image. Such cropped regions will give rise to blobs which change shape as the camera moves, in a manner which does not correspond to camera movement. Thus, if the blobs are to be used for image matching, we will probably gain stability in the matching by removing cropped blobs. Most such blobs can be removed by calculating the bounding box of the ellipse corresponding to the blob shape, and removing those blobs which have their bounding boxes partially outside the image. In appendix C, theorem C.3, the outline of an ellipse is shown to be given by a parameter curve. To find a bounding box for an ellipse, we rewrite this curve into two parameter curves ¶ ¶µ ¶ µ a cos t ar11 cos t + br12 sin t r11 r12 +m +m= r21 r22 ar21 cos t + br22 sin t b sin t ¶ µp 2 2 sin(t + ϕ ) a2 r11 + b2 r12 1 p +m = 2 + b2 r 2 sin(t + ϕ ) a2 r21 2 22 µ x= (7.13) (7.14) Since sin(t + ϕ) assumes all values in the range [−1, 1] during one period, the bounding box is given by the amplitudes q A1 = q 2 + b2 r 2 a2 r11 12 and A2 = 2 + b2 r 2 . a2 r21 22 (7.15) 80 Homogeneous Regions in Scale-Space The bounding box becomes x1 ∈ [m1 − A1 , m1 + A1 ] and x2 ∈ [m2 − A2 , m2 + A2 ] . 7.4.3 (7.16) Choice of parameters The blob feature extraction algorithm has three parameters, spatial kernel width, area threshold amin , and range kernel width dmax . We will now give some suggestions on how they should be set. 1. Spatial kernel width A larger spatial kernel will make the method less sensitive to translations of the grid. At the same time however, it will also reduce the amount of detail obtained. A trade-off that seems to work well for a wide range of images is the choice of a 4 × 4 neighbourhood. To speed up the algorithm, we can even consider using the 12 pixel neighbourhood consisting of the central pixels in the 4 × 4 region, and omit the spatial weights. 2. Area threshold rmin . Characteristic for this parameter is that low values give more mergers of non-connected but adjacent regions. This results in fewer features. High values will cause less information to be propagated to the higher levels of the pyramid. This will lead to fewer mergers, but also to less information being propagated to higher levels in the pyramid. Thus we obtain more regions at the lowest level, and consequently higher computational times. Typically we will use the intermediate value rmin = 0.5. 3. Range kernel witdh dmax . This parameter decides whether two colours should be considered the same or not. A small value will give lots of blobs, while a large value gives few blobs. A suitable choice of dmax value depends on the input images. Typically we will thus mainly modify the dmax parameter and let the others have the fixed values suggested above. 7.5 Clustering of planar slopes 7.5 81 Clustering of planar slopes The implicit assumption in the previous clustering methods has been that the image consists of piecewise constant patches. We could instead assume a constant slope. This is reasonable e.g. in depth maps from indoor scenes and of various man-made objects. We now assume a scalar input image, i.e. f : Z2 → R, and a local image model of the type ¡ f (x) = 1 x1 − m1 ¢ x2 − m2 p(x) . (7.17) ¢T ¡ is the centre spatial position of the local model. The paHere m = m1 m2 rameter vector in each pixel can be interpreted as p(x) = (mean, slopex , slopey )T . When building the first level of the clustering pyramid we thus have to go from the image f to a parametric representation p. In principle this could be done by applying plain derivative filters. This would however distort the measurements at the edges of each planar patch, and thus make the size of the detected regions smaller, as well as limiting the sizes of objects that can be detected. Instead we will estimate the parametric representation using a robust linear model arg min ∗ p X ¡ wk rk ρ(||fk − 1 x1,k − m1 ¢ x2,k − m2 p∗ ||) (7.18) k where wk are weights from a binomial kernel, typically in a 4 × 4 neighbourhood. Like in the colour clustering method, see section 7.2, we solve (7.18) by SOR followed by a few M-estimation steps. For the 4 × 4 region, we define the following quantities 1 1 M= 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 X= 1 1 2 2 2 2 3 3 3 3 4 4 − 2.5 4 4 (7.19) B = vec(M) Y = XT vec(X) vec(Y) . (7.20) We also define a weight matrix W with diagonal elements (W)kk = ok wk rk . Here ok are outlier rejection weights, wk are spatial binomial weights, and rk are the input confidences. For the 4 × 4 region f , the iterations of the model parameter estimation are now performed as p∗est = (WB)† Wvec(f ) . (7.21) Since we have just 3 parameters to estimate, we could use a (slightly less reliable) matrix inversion instead p∗est = (BT WWB)−1 BT WWvec(f ) . (7.22) 82 Homogeneous Regions in Scale-Space This is faster since the matrix inverse can be computed in a non-iterative way. At the start all outlier rejection weights are set to 1. After each iteration we find the pixel with the largest residual ¡ dk = |fk − 1 x1,k − m1 ¢ x2,k − m2 p∗est | . (7.23) If dk > dmax , we remove this pixel by setting ok = 0. By iterating until convergence we will find a fixed point of (7.18), for the cut-off squares error norm. Furthermore, if more than half the data supports the solution, we are guaranteed to be close to the global min. Again, we polish the result by a few M-estimation steps with a smoother kernel. The corresponding IRLS iteration will have the same form as the iteration above, i.e (7.21) or (7.22), with the exception that the outlier rejection weights are replaced with ok = ρ0 (dk )/dk . To obtain p∗ and r∗ for the first scale in the pyramid, we thus compute p∗ according to the above equations, and r∗ according to (7.5). We typically set the area threshold to rmin = 0.85, which is considerably higher than in the colour method. Pixels on the boundaries between two different slopes will typically have a parameter estimate that does not correspond to any of the slopes. In order to get zero confidence for such pixels, we could either reduce the dmax threshold, or increase the rmin parameter. It turns out that the latter option is preferable, since reducing dmax will also cause grey-level quantisation in the input to propagate to the slope parameters. The result of the parameter estimation step for a simple test image is shown in figure 7.10. Figure 7.10: Result of parameter estimation. Left to right: Input, mean estimate, x-slope estimate, y-slope estimate. 7.5.1 Subsequent pyramid levels After we have obtained the parametric representation, we can generate the other steps in the pyramid using almost the same technique as in the colour method. The main difference is that we have to take the centre of the local neighbourhood into account, and adjust the local mean level accordingly. That is, we now have a robust estimation problem of the form X wk rk ρ(||p̃k − p∗ ||) (7.24) arg min ∗ p k where p̃k is a vector with the mean level adjusted 7.5 Clustering of planar slopes 83 p1,k − ((x1,k − m1 )p2,k + (x2,k − m2 )p3,k )s . p2,k p̃k = p3,k (7.25) Here s is a factor that compensates for the fact that the pixel distance at scale 2 is twice that at scale 1 and so on. In other words we have s = 2scale−1 . The residuals for SOR are now computed as q (7.26) dk = (p̃k − p∗est )T W(p̃k − p∗est ) ¢ ¡ where W is a weight matrix of the form diag(W) = 1 wd wd . The parameter wd allows us to adjust the relative importance of error in mean and error in slope. Typically we set wd = 200. 7.5.2 Computing the slope inside a binary mask To estimate the mean and slopes inside each mask vn , we will now assume a local ¢T ¡ signal model centred around the point m = m1 m2 ¡ f (x) = 1 ¢ x2 − m2 p = p1 + p2 x1 − p2 m1 + p3 x2 − p3 m2 . x1 − m1 (7.27) We define the moments ηkl of f (x) inside the mask vn as ηkl = 1 X vn (x1 , x2 )f (x1 , x2 )xk1 xl2 N x ,x 1 (7.28) 2 where N = µ00 is the number of elements inside the mask vn . For the model (7.27) we now get η00 = p1 + p2 1 X 1 X vn (x1 , x2 )x1 − p2 m1 + p3 vn (x1 , x2 )x2 − p3 m2 N x ,x N x ,x 1 2 1 2 (7.29) = p 1 + p 2 m1 − p 2 m1 + p 3 m2 − p 3 m2 = p 1 (7.30) as expected. For the first moments we obtain Ã η10 ! 1 X = p 1 m1 + p 2 vn (x1 , x2 )x21 − m21 N x ,x 1 2 Ã ! X 1 vn (x1 , x2 )x1 x2 − m1 m2 + p3 N x ,x 1 (7.31) 2 = p1 m1 + p2 I11 + p3 I12 (7.32) 84 Homogeneous Regions in Scale-Space and Ã η01 1 X = p 1 m2 + p 2 vn (x1 , x2 )x2 x1 − m2 m1 N x ,x 1 2 Ã ! 1 X vn (x1 , x2 )x22 − m22 + p3 N x ,x 1 ! (7.33) 2 = p1 m2 + p2 I21 + p3 I22 . (7.34) This can be summarised as the system µ η01 η10 ¶ µ ¶ p = p1 m + I 2 p3 (7.35) where m and I are obtained according to (7.7) and (7.8). We thus first compute m and I. We then compute p1 from η00 , see (7.30), and finally p2 and p3 as µ ¶ µ ¶ p2 −1 η01 − p1 m1 =I . p3 η10 − p1 m2 7.5.3 (7.36) Regions from constant slope model The results of blob feature extraction on the test image in figure 7.10 are shown in figure 7.11. As can be seen in this figure, most of the regions have successfully been detected. The result is also compared with the output from clustering using the piecewise constant assumption. We stress that these are just first results. Some of the regions obtained before merging (see figure 7.11 left) have striped structure in contrast to the case in the locally constant clustering (see figure 7.7, left). This suggests that the clustering strategy might not be optimal. It is well known that clustering of lines using the model ¡ x y ¢T ¡ cos φ 1 ¢ sin φ −ρ = 0 (7.37) is preferable to using the model y = kx + m (7.38) since estimation of very steep slopes (large k) becomes unstable. See section 5.6.3 for an example of use for (7.37). This preference for a normal representation of lines suggests that we should view the grey-level image as a surface, and cluster surface normals instead. The algorithm presented here is intended as a post processing to a depth-fromstereo vision algorithm. In depth maps of indoor scenes, a region with constant slope could correspond to a floor, a wall, or a door etc. A compact description of such features will hopefully be useful for autonomous indoor navigation. 7.6 Concluding Remarks 85 Figure 7.11: Blobs from piecewise linear assumption. Left to right: label image, detected blobs, reprojection of blob slopes to the corresponding label image regions, and blobs from piecewise constant assumption (dmax = 0.16). 7.6 Concluding Remarks In this chapter we have introduced a blob feature detection algorithm that works on vector fields of arbitrary dimension. The usefulness of the algorithm has been demonstrated on a wide baseline matching task. We have also extended the blob feature detection to cluster constant slopes instead of locally constant colour. The slope clustering has however not been fully evaluated, and some design choices, such as the chosen representation of the slopes might not be optimal. Specifically the option to cluster surface normals instead should be tested. 86 Homogeneous Regions in Scale-Space Chapter 8 Lines and Edges in Scale-Space In this chapter we develop a representation of lines and edges in a scale hierarchy. By using three separate maps, the identity of lines and edges are kept separate. Further, the maps are made sparse by inhibition from coarser scales. 8.1 Background Biological vision systems are capable of instance recognition in a manner that is vastly superior to current machine vision systems. Perceptual experiments [83, 12] are consistent with the idea that they accomplish this feat by remembering a sparse set of features for a few views of each object, and are able to interpolate between these (see discussion in chapter 2). What features biological systems use is currently not certain, but we have a few clues. It is a widely known fact that difference of Gaussians, and Gabor-type wavelets are useful models of the first two levels of processing in biological vision systems [5]. There is however no general agreement on how to proceed from these simple descriptors, toward more descriptive and more sparse features. One way might be to detect various kinds of image symmetries such as circles, star-patterns, and divergences (such as corners) as was done in [65, 64]. Two very simple kinds of symmetries are lines and edges1 , and in this chapter we will see how extraction of lines and edges can be made more selective, in a manner that is locally continuous both in scale and spatially. An important difference between our approach and other line-and-edge representations, is that we keep different kinds of events separate instead of combining them into one compact feature map. 1 To be strict, an edge is better described as an anti-symmetry. 88 8.1.1 Lines and Edges in Scale-Space Classical edge detection Depth discontinuities in a scene often lead to intensity discontinuities in images of that scene. Thus, discontinuities such as lines and edges in an image tend to correspond to object boundaries. This fact been known and used for a long time in image processing. One early example that is still widely used are the Sobel edge filters [91]. Another common example is the Canny edge detector [14] that produces visually pleasing binary images. The goal of edge detecting algorithms in image processing is often to obtain useful input to segmentation algorithms [92], and for this purpose, the ideal step edge detection that the Canny edge detector performs is in general insufficient [85], since a step edge is just one of the events that can divide the areas of a physical scene. Since our goal is quite different (we want a sparse scene description that can be used in view-based object recognition), we will discuss conventional edge detection no further. 8.1.2 Phase-gating Lines and edges correspond to specific local phases of the image signal. Line and edge features are thus related to the local phase feature. Our use of local phase originates from the idea of phase-gating, originally mentioned in a thesis by Haglund [55]. Phase-gating is a postulate that states that an estimate from an arbitrary operator is valid only in particular places, where the relevance of the estimate is high [55]. Haglund uses this idea to obtain an estimate of size, by only using the even quadrature component when estimating frequency, i.e. he only propagated frequency estimates near 0 and π phase. 8.1.3 Phase congruency Mach bands are illusory peaks and valleys in illumination that humans, and other biological vision systems perceive near certain intensity profiles, such as ramp edges (see figure 8.1). Morrone et al. have observed that these illusory lines, as well as perception of actual lines and edges, occur at positions where the sum of Fourier components above a given threshold have a corresponding peak [78]. They also note that the sum of the squared output of even and odd symmetric filters always peaks at these positions, which they refer to as points of phase congruency. This observation has lead to the invention of phase congruency feature detectors [68]. At points of phase congruency, the phase is spatially stable over scale. This is a desirable property for a robust feature. However, phase congruency does not tell us which kind of feature we have detected; is it a line, or an edge? For this reason, phase congruency detection has been augmented by Reisfeld to allow discrimination between line, and edge events [86]. Reisfeld has devised what he calls a Constrained Phase Congruency Detector (CPCT for short), that maps a pixel position and an orientation to an energy value, a scale, and a symmetry phase (0, ±π/2 or π). This approach is however not quite suitable for us, since the map produced is of a semi discrete nature; each pixel is either of 0, ±π/2 or π phase, and only belongs to the scale where the energy is maximal. The features we want should on the contrary allow a slight overlap in scale space, and have 8.2 Sparse feature maps in a scale hierarchy 89 Figure 8.1: Mach bands near a ramp edge. Top-left: Image intensity profile Bottom-left: Perceived image intensity Right: Image responses in a small spatial range near the characteristic phases. 8.2 Sparse feature maps in a scale hierarchy Most feature generation procedures employ filtering in some form. The outputs from these filters tell quantitatively more about the filters used than the structures they were meant to detect. We can get rid of this excessive load of data, by allowing only certain phases of output from the filters to propagate further. These characteristic phases have the property that they give invariant structural information rather than all the phase components of a filter response. We will now generate feature maps that describe image structure in a specific scale, and at a specific phase. The distance between the different scales is one octave (i.e. each map has half the centre frequency of the previous one.) The phases we detect are those near the characteristic phases 0, π, and ±π/2. Thus, for each scale, we will have three resultant feature maps (see figure 8.2). Image scale pyramid 0 phase 111 000 111 000 111 000 111 000 11111 00000 11111 00000 1111111 0000000 0000000 1111111 0000000000000 1111111111111 0000000000000 1111111111111 0000000000000 1111111111111 0000000000000 1111111111111 0000000000000 1111111111111 00000000 11111111 00000000 11111111 0000000000000 1111111111111 1111111111111 0000000000000 0000000000000 1111111111111 0000000000000 1111111111111 π phase π 2 phase Figure 8.2: Scale hierarchies. This approach touches the field of scale-space analysis pioneered by Witkin [106]. See [72] for a recent overview of scale space methods. Our approach to scale 90 Lines and Edges in Scale-Space space analysis is somewhat similar to that of Reisfeld [86]. Reisfeld has defined what he calls a Constrained Phase Congruency Transform (CPCT), that maps a pixel position and an orientation to an energy value, a scale, and a symmetry phase (0, π, ±π/2, or none). We will instead map each image position, at a given scale, to three complex numbers, one for each of the characteristic phases. The argument of the complex numbers indicates the dominant orientation of the local image region at the given scale, and the magnitude indicates the local signal energy when the phase is near the desired one. As we move away from the characteristic phase, the magnitude will go to zero. This representation will result in a number of complex valued images that are quite sparse, and thus suitable for pattern detection. 8.2.1 Phase from line and edge filters For signals containing multiple frequencies, the phase is ambiguous, but we can always define the local phase of a signal, as the phase of the signal in a narrow frequency range. The local phase can be computed from the ratio between a band-pass filter (even, denoted fe ) and its quadrature complement (odd, denoted fo ). These two filters are usually combined into a complex valued quadrature filter, f = fe + ifo [48]. The real and imaginary parts of a quadrature filter correspond to line, and edge detecting filters respectively. The local phase can now be computed as the argument of the filter response, q(x) = (s ∗ f )(x), or if we use the two real-valued filters separately, as the four quadrant inverse tangent; arctan(qo (x), qe (x)). To construct the quadrature pair, we start with a discretised lognormal filter function, defined in the frequency domain 2 − ln (ρ/ρi ) ln 2 if ρ > 0 (8.1) Ri (ρ) = e 0 otherwise. The parameter ρi determines the peak of the lognorm function, and is called the centre frequency of the filter. We now construct the even and odd filters as the real and imaginary parts of an inverse discrete Fourier transform of this filter2 fe,i (x) = Re(IDFT{Ri (ρ)}) fo,i (x) = Im(IDFT{Ri (ρ)}) . (8.2) (8.3) We write a filtering of a sampled signal, s(x), with a discrete filter fk (x) as qk (x) = (s ∗ fk )(x), giving the response signal the same indices as the filter that produced it. 8.2.2 Characteristic phase By characteristic phase we mean phases that are consistent over a range of scales, and thus characterise the local image region. For natural images this mainly 2 Note that there are other ways to obtain spatial filters from frequency descriptions that, in many ways produce better filters [67]. 8.2 Sparse feature maps in a scale hierarchy 91 happens at local magnitude peaks of the responses from the even and odd filters.3 In other words, the characteristic phases are almost always one of 0, π, and ±π/2. This motivates our restriction of the phase to these three cases. 1 0.5 0 10 20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70 0.2 0.1 0 −0.1 −0.2 10 0.2 0.1 0 −0.1 −0.2 10 Top: Centre: Bottom: Figure 8.3: Line and edge filter responses in 1D. A one-dimensional signal. Line responses at ρi = π/2 (solid), and π/4 and π/8 (dashed) Edge responses at ρi = π/2 (solid), and π/4 and π/8 (dashed) Only some occurrences of these phases are consistent over scale though (see figure 8.3). First, we can note that band-pass filtering always causes ringings in the response. For isolated line and edge events this will mean one extra magnitude peak (with the opposite sign) at each side of the peak corresponding to the event. These extra peaks will move when we change frequency bands, in contrast to those peaks that correspond to the line and edge features. Second, we can note that each line event will produce one magnitude peak in the line response, and two peaks in the edge response. The peaks in the edge response, however, will also move when we change frequency bands. We can thus use stability over scale as a criterion to sort out the desired peaks. 8.2.3 Extracting characteristic phase in 1D Starting from the line and edge filter responses at scale i: qe,i , and qo,i , we now define three phase channels p1,i = max(0, qe,i ) p2,i = max(0, −qe,i ) (8.4) (8.5) p3,i = abs(qo,i ) . (8.6) That is, we let p1,i constitute the positive part of the line filter response, corresponding to 0 phase, p2,i , the negative part, corresponding to π phase, and p3,i the magnitude of the edge filter response, corresponding to ±π/2 phase. 3 A peak in the even response will always correspond to a zero crossing in the odd response, and vice versa, due to the quadrature constraint. 92 Lines and Edges in Scale-Space Phase invariance over scale can be expressed by requiring that the phase at the next lower octave has the same sign p1,i = max(0, qe,i · qe,i−1 /ai−1 ) · max(0, sign(qe,i )) p2,i = max(0, qe,i · qe,i−1 /ai−1 ) · max(0, sign(−qe,i )) p3,i = max(0, qo,i · qo,i−1 /ai−1 ) . (8.7) (8.8) (8.9) The first max operation in the equations above will set the magnitude to zero whenever the filter at the next scale has a different sign. This operation will reduce the effect of the ringings from the filters. In order to keep the magnitude near the characteristic phases proportional to the local signal energy, we have normalised the product with the signal energy at the next lower octave ai−1 = q 2 2 qe,i−1 + qo,i−1 . The result of the operation in (8.7)-(8.9) can be viewed as a phase description at a scale in between the two used. These channels are compared with the original ones in figure 8.4. 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0 10 20 30 40 50 60 70 0 10 0.1 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Figure 8.4: Consistent phase in 1D. (ρi = π/4) p1,i , p2,i , p3,i according to (8.4)-(8.6) (dashed), and (8.7)-(8.9) (solid) We will now further constrain the phase channels in such a way that only responses consistent over scale are kept. We do this by inhibiting the phase channels with the complementary response in the third lower octave c1,i = max(0, p1,i − αabs(qo,i−2 )) (8.10) c2,i = max(0, p2,i − αabs(qo,i−2 )) c3,i = max(0, p3,i − αabs(qe,i−2 )) . (8.11) (8.12) We have chosen an amount of inhibition α = 2, and the base scale, ρi = π/4. With this value we successfully remove the edge responses at the line event, and at the same time keep the rate of change in the resultant signal below the Nyquist frequency. The resultant characteristic phase channels will have a magnitude corresponding to the energy at scale i, near the corresponding phase. These channels are compared with the original ones in figure 8.5. As we can see, this operation manages to produce channels that indicate lines and edges without any unwanted extra responses. An important aspect of this operation is that it results in a gradual transition between the description of a signal as a line or an edge. If we continuously increase the thickness of a line, it will gradually turn into a bar that will be represented as two edges.4 This phenomenon is illustrated in figure 8.6. 4 Note that the fact that both the line, and the edge statements are low near the fourth event 8.2 Sparse feature maps in a scale hierarchy 93 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0 10 20 30 40 50 60 70 0.1 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Figure 8.5: Phase channels in 1D. (ρi = π/4, α = 2) p1,i , p2,i , p3,i according to (8.4)-(8.6) (dashed), and (8.10)-(8.12) (solid). 1 0.5 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0.4 0.2 0 0.4 0.2 0 Figure 8.6: Transition between line and edge description. (ρi = π/4) Top: Signal Centre: c1,i phase channel Bottom: c3,i phase channel. 8.2.4 Local orientation information The filters we employ in 2D will be the extension of the lognorm filter function (8.1) to 2D [48] Fki (u) = Ri (ρ)Dk (û) (8.13) where ( (û · n̂k )2 Dk (û) = 0 if u · n̂k > 0 (8.14) otherwise. ¡ ¢T ¡√ √ ¢T We will use four filters, with directions n̂1 = 0 1 , n̂2 = 0.5 0.5 , ¡ ¢T ¡√ √ ¢T n̂3 = 1 0 , and n̂4 = 0.5 − 0.5 . These directions have angles that are uniformly distributed modulo π. Due to this, and the fact that the angular function decreases as cos2 ϕ, the sum of the filter-response magnitudes will be orientation invariant [48]. Just like in the 1D case, we will perform the filtering in the spatial domain (fe,ki ∗ pki )(x) ≈ Re(IDFT{Fki (u)}) (8.15) (fo,ki ∗ pki )(x) ≈ Im(IDFT{Fki (u)}) . (8.16) (positions 105 to 125) does not mean that this event will be lost. The final representation will also include other scales of filters, which will describe these events better. 94 Lines and Edges in Scale-Space Here we have used a filter optimisation technique [67] to factorise the lognorm quadrature filters into two approximately one-dimensional components. The filter pki (x), is a smoothing filter in a direction orthogonal to n̂k , while fe,ki (x), and fo,ki (x) constitute a 1D lognorm quadrature pair in the n̂k direction. Using the responses from the four quadrature filters, we can construct a local orientation image. This is a complex valued image, in which the magnitude of each complex number indicates the signal energy when the neighbourhood is locally onedimensional, and the argument of the numbers denote the local orientation, in the double angle representation [48] z(x) = X aki (n̂k1 + in̂k2 )2 = a1i (x) − a3i (x) + i(a2i (x) − a4i (x)) k q where aki (x), the signal energy, is defined as aki = 8.2.5 (8.17) 2 2 . qe,ki + qo,ki Extracting characteristic phase in 2D To illustrate characteristic phase in 2D, we need a new test pattern. We will use the 1D signal from figure 8.6, rotated around the origin (see figure 8.7). 100 200 300 400 500 600 100 200 300 400 500 600 Figure 8.7: A 2D test pattern. When extracting characteristic phases in 2D we will make use of the same observation as the local orientation representation does: Since visual stimuli can locally be approximated by a simple signal in the dominant orientation [48], we can define the local phase as the phase of the dominant signal component. To deal with characteristic phases in the dominant signal direction, we first synthesise responses from a filter in a direction, n̂z , compatible with the local orientation5 ¡ √ n̂z = Re( z) √ ¢T Im( z) . (8.18) The filters will be weighted according to the value of the scalar product between the filter direction, and this orientation compatible direction 5 Since the local orientation, z, is represented with a double angle argument, we could just as well have chosen the opposite direction. Which one of these we choose does not really matter, as long as we are consistent. 8.2 Sparse feature maps in a scale hierarchy wk = n̂Tk n̂z . 95 (8.19) Thus, in each scale we synthesise one odd, and one even response projection as X qe,i,k abs(wk ) (8.20) qe,i = k qo,i = X qo,i,k wk . (8.21) k This will change the sign of the odd responses when the directions differ more than π, but since the even filters are symmetric, they should always have a positive weight. In accordance with our findings in the 1D study (8.7)-(8.9), (8.10)-(8.12), we now compute three phase channels, c1,i , c2,i , and c3,i , in each scale. No responses Figure 8.8: Characteristic phase channels in 2D. (ρi = π/4) Left to right: Characteristic phase channels c1,i , c2,i , and c3,i , according to (8.10)(8.12) (α = 2). The colours indicate the locally dominant orientation. The characteristic phase channels are shown in figure 8.8.6 As we can see, the channels exhibit a smooth transition from describing the white regions in the test pattern (see figure 8.7) as lines, and as two edges. Also note that the phase statements actually give the phase in the dominant orientation, and not in the filter directions, as was the case for CPCT [86]. 8.2.6 Local orientation and characteristic phase An orientation image can be be gated with a phase channel, cn (x), in the following way 0 if cn (x) = 0 (8.22) z n (x) = cn (x) · z(x) otherwise. |z(x)| We now do this for each of the characteristic phase statements c1,i (x), c2,i (x), and c3,i (x), in each scale. The result is shown in figure 8.9. The colours in the 6 The magnitude of lines this thin can be difficult to reproduce in print. However, the magnitudes in this plot should vary in the same way as in figure 8.6. 96 Lines and Edges in Scale-Space figure indicate the locally dominant orientation, just like in figure 8.8. Notice for instance how the bridge near the centre of the image changes from being described by two edges, to being described as a bright line, as we move through scale space. Figure 8.9: Sparse feature hierarchy. (ρi = π/2, π/4, π/8, π/16) 8.3 Concluding remarks The strategy of this approach for low-level representation is to provide sparse, and reliable statements as much as possible, rather than to provide statements in all points. Traditionally, the trend has been to produce compact, descriptive components as much as possible; mainly to reduce storage and computation. As the demands on performance are increasing it is no longer clear why components signifying 8.3 Concluding remarks 97 different phenomena should be mixed. An edge is something separating two regions with different properties, and a line is something entirely different. The use of sparse data representations in computation leads to a mild increase in data volume for separate representations, compared to combined representations. Although the representation is given in discrete scales, this can be viewed as a conventional sampling, although in scale space, which allows interpolation between these discrete scales, with the usual restrictions imposed by the sampling theorem. The requirement of a good interpolation between scales determines the optimal relative bandwidths of filters to use. 98 Lines and Edges in Scale-Space Chapter 9 Associative Learning This chapter introduces an associative network architecture using the channel representation. We describe the descriptive properties of the networks, and illustrate their behaviour using a set of experiments. We will also relate the associative networks to the techniques Radial Basis Function (RBF) networks, Support Vector Machines (SVM) and Fuzzy Control. 9.1 Architecture overview In the proposed architecture, the choice of information representation is of fundamental importance. The architecture makes use of the channel information representation introduced in chapter 3. The channel representation implies a mapping of signals into a higher-dimensional space, in such a way that it introduces locality in the information representation with respect to all dimensions; geometric space as well as property space. The obtained locality gives two advantages: • Nonlinear functions and combinations can be implemented using linear mappings • Optimisation in learning converges much faster. Figure 9.1 gives an intuitive illustration of how signals are represented as local fragments, which can be freely assembled to form an output. The system is moving along a state space trajectory. The state vector x consists of both internal and external system parameters. The response space is typically a subset of those parameters, e.g. orientation of an object, position of a camera sensor in navigation, or actions of a robot. Response channels and feature channels measure local aspects of the state space. The response channels and feature channels define response channel vectors u and feature vectors a respectively. The processing mode of the architecture is association where the mapping of features ah onto desired responses uk is learned from a representative training set of observation pairs {an , un }N n=1 , see figure 9.1(b). The feature vector a may contain some hundred thousand components, while the output vector u may contain some 100 Associative Learning ah3 ah2 uk uk x ckh ah n state trajectory ah1 n (a) (b) Figure 9.1: Architecture overview. (a) The system is moving along a state space trajectory. Response channels, uk , and feature channels, ah , measure different (local) aspects of the state vector x. (b) The response channels and the feature channels define localised functions along the trajectory. A certain response channel is associated with some of the feature channels with appropriate weights ckh . Figure borrowed from [51]. thousand components. For most features of interest, only limited parts of the domain will have non-zero contributions. This provides the basis for a sparse representation, which gives improved efficiency in storage and better performance in processing. The model of the system is in the standard version a linear mapping from a feature vector a to a response vector u over a linkage matrix C, u = Ca . (9.1) In some training process, a set with N samples of output vectors u and corresponding feature vectors a are obtained.¡ These form a response matrix ¢ ¢ ¡ U = u1 . . . uN and a feature matrix A = a1 . . . aN . The training implies finding a solution matrix C to U = CA . (9.2) The linkage matrix is computed as a solution to a least squares problem with a monopolar constraint C ≥ 0. This constraint has a regularising effect, and in addition it gives a sparse linkage matrix. The monopolar representation together with locality, allows a fast optimisation, as it allows a parallel optimisation of a large number of loosely coupled system states. We will compare the standard version (9.1) to models where the mapping is made directly to the response subset of the state parameters, i.e. typically what would be used in regular kernel machines. We will in these cases use a modified model with various normalisations of a. 9.2 Representation of system output states 9.2 101 Representation of system output states For a system acting in a continuous environment, we can define a state space X ⊂ RM . A state vector, x ∈ X , completely characterises the current situation for the system, and X is thus the set of all situations possible for the system. The state space has two parts termed internal and external. Internal states describe the system itself, such as its position and its orientation. External states describe a subset of the total states of the environment, which are to be incorporated in the system’s knowledge, such as the position, orientation and size of a certain object. The estimation of external states requires a coupling to internal states, which can act as a known reference in the learning process. In general it is desirable to estimate either a state, or a response that changes the state, i.e. a system behaviour. For simplicity we will in this chapter assume that the desired response variables are components of the state vector x. We assume that the system is somehow induced to change its state, such that it covers the state space of interest for the learning process. For an agent acting in the physical world, the system state change has to take place in a continuous way (due to the inertia caused by limited power for displacement of a certain mass, see [49] for a more extensive discussion). It is thus reasonable to view the set of states We can express {xn }N 1 used in the learning process as a system ¢ ¡ state trajectory. this system state trajectory as a matrix X = x1 x2 . . . xN . 9.2.1 Channel representation of the state space The normal form of output for the structure is in channel representation. It is advantageous to represent the scalar state variables in a regular channel vector form, as this allows multiple outputs when the mapping from input to output is ambiguous, see section 3.2.1. The channel representation also forms the basis for learning of discontinuous phenomena, as will be demonstrated in section 9.6. A response channel vector um is a channel representation of one of the components xm of the state vector x, see (3.1). The vector um is thus a non-ambiguous representation of position in a response state space Rm = {xm : x ∈ X }. With this definition, a response channel will be non-zero only in a very limited region of the state space. The value of a channel can be viewed as a confidence in the hypothesis that the current state is near a particular prototype state. When a specific channel is non-zero it is said to be active, and the subspace where a specific channel is active is called the active domain of that channel. As the active domain is always much smaller than the inactive domain, an inactive channel will convey almost no information about position in state space. The small active domain is also what makes the representation sparse. m The¡ response channel ¢ vectors un can be put into response channel matrices m m m m U = u1 u2 . . . uN . All such response channel matrices are stacked row-wise to form the response channel matrix U. While U will have a much larger number of rows than the original state matrix X due to the increase of dimensionality in the representation, the sparsity of the representation will imply a moderate increase of the amount of data (typically a factor 3). 102 9.3 Associative Learning Channel representation of input features It is assumed that the system can obtain at least partial knowledge about its state from a set of observed feature variables, {ah }, forming a feature vector a = ¢T ¡ 1 2 a a . . . aH . In order for an association or learning process to be meaningful, there has to be a sufficiently unique and repeatable correspondence between system states and observed features. One way to state this requirement is as follows: The sensor space, A, of states that the feature channels can represent, should allow an unambiguous mapping f : A → R. The situation where this requirement is violated is known in learning and robotics as perceptual aliasing, see e.g. [17]. A generative model for {ah }, useful for systems analysis, can be expressed as localised, non-negative kernel functions Bh (x). These are functions of a weighted distance between the state vector x and a set of prototype states xh ∈ X . We exemplify this with the cos2 -kernel, ( cos2 (d(x, xh )) if d(x, xh ) ≤ π/2 h h (9.3) a = B (x) = 0 otherwise. The used distance function is defined as q d(x, xh ) = (x − xh )T Mh (x − xh ). (9.4) The matrix Mh is positive semidefinite, and describes the similarity measure for the distance function around state xh , allowing a scaling with different sensitivities with respect to different state variables. Equation (9.3) indicates that ah will have a maximal value of 1 as x = xh . It will go monotonically to zero as the weighted distance increases to π/2. Normally, neither xh , nor Mh are explicitly known, but emerge implicitly from the properties of the set of sensors used in the actual case. These are generally different from one sensor or filter to another, which motivates the notion of channel representation, as each channel has its specific identity, the identification of which is part of the learning process. In general, there is no requirement for a regular arrangement of channels, be it on the input side or on the output side. The prescription of an orderly arrangement at the output comes from the need to interface the structure to the environment, e.g. to determine its performance. In such a case it will be desirable to map the response channel variables back into scalar variables in order to compare them with the reference, something which is greatly facilitated by a regular arrangement. Similarly to the state variables, we denote the observation at sample point n ¡ ¢T . These observation or feature vectors can be by a vector an = a1n a2n . . . a¡H n ¢ put into a feature matrix A = a1 a2 . . . aN . 9.3.1 Feature generation The feature vectors a, input to the associative structure, may derive directly from the preprocessing parts of a computer vision system, representing local image properties such as orientation, curvature, colour, etc. Unless the features emerge 9.4 System operation modes 103 as monopolar quantities, we will channel encode them. If the properties have a confidence measure, it is natural to weight the channel features with this, see discussion in chapter 3. Often, combinations or functions comb(a) of a set of features a, will be used as input to the associative structure. A common way to increase specificity in the percept space is to generate product pairs of the feature vector components, or a subset of them, i.e. (9.5) comb(a) = vec(aaT ) . The symbol vec signifies the trivial transformation of concatenating rows or columns of a matrix into a vector, see section 5.6.1. For simplicity of notation, we will express this as a substitution, a ← comb(a) . (9.6) If we find a linear feature–response mapping in the training phase using the feature combination (9.5), it will correspond to a quadratic mapping from the original features to the responses. A network with this kind of feature expansion is called a higher order network [7]. The final vector a, going into the associative structure will generally be considerably longer than the corresponding size of the sensor channel array. As we are dealing with sparse feature data, the increase of the data volume will be moderate. 9.4 System operation modes The channel learning architecture can be run under two different operation modes, providing output in two different representations: 1. position encoding for discrete event mapping 2. magnitude encoding for continuous function mapping. The first variety, discrete event mapping, is the mode which maximally exploits the advantage of the information representation, to allow implementation and learning of highly non-linear transfer functions, using a linear mapping. The second variety is similar to more traditional function approximation methods. 9.4.1 Position encoding for discrete event mapping In this mode, the structure is trained to map onto a set of channel representations of the system response state variables, as discussed in subsection 9.2.1. Thus each response will have a non-zero output only within limited regions of the definition range. The major issue is that a multi-dimensional, fragmented feature set is mapped onto a likewise fragmented, version of the system response state space. See figure 9.2 for an illustration. There are a number of characteristics of the discrete event mode: • Mapping is made to sets of response channels, whose response functions may be partially overlapping to allow the reconstruction of a continuous variable. 104 Associative Learning k+1 u uk n n Figure 9.2: Illustration of discrete event mapping. Solid curves are weighted input h feature functions cP kh a (t) along the state space trajectory. Dashed curves are the k responses u (t) = h ckh ah (t). • Output channels are assumed to assume some standard maximum value, say 1, but are expected to be zero most of the time, to allow a sparse representation. • The system state is not given by the magnitude of a single output channel, but is given by the relation between outputs of adjacent channels. • Relatively few feature functions, or sometimes only a single feature function, are expected to map onto a particular output channel. • The channel representation of a signal allows a unified representation of signal value and of signal confidence, where the relation between channel values represents value, and the magnitude represents confidence. Since the discrete event mode implies that both the feature and response state vectors are in the channel representation, the confidence of the feature vector will be propagated to the response vector if the mapping is linear. The properties just listed, allows the structure to be implemented as a purely linear mapping, u = Ca . (9.7) 9.4.2 Magnitude encoding for continuous function mapping The continuous function mapping mode is used to generate the response state variables directly, rather than a set of channel functions for position decoding. The response state vector, x, is approximated by a weighted sum of channel feature functions, see figure 9.3 for an illustration. This mode corresponds to classical function approximation objectives. The mode is used for accurate representation of a scalar continuous function, which is often useful in control systems. The approximation will be good if the feature functions are sufficiently local, and sufficiently dense. There are a number of characteristics for the continuous function mapping: 9.4 System operation modes 105 x n Figure 9.3: Illustration of continuous function mapping. Solid curves are weighted h input feature functions P ch a h(t) along the state space trajectory. Dashed curve is the response x(t) = h ch a (t). • It uses rather complete sets of feature functions, compared to the mapping onto a single response in discrete event mode. The structure can still handle local feature dropouts without adverse effects upon well behaved regions. • Mapping is made onto continuous response variables, which may have a magnitude which varies over a large range. • A high degree of accuracy in the mapping can be obtained if the feature vector is normalised, as stated below. In this mode, however, it is not possible to represent both a state value x, and a confidence measure r, unless it is done explicitly. For a channel vector, the vector sum corresponds to the confidence, see section 3. As a first assumption we could thus assume that the feature vector sum corresponds to the confidence measure. Assuming a linear feature–response mapping, this will imply that the confidence is propagated to the response, rx = Ca . (9.8) By dividing the feature vector a with r we can normalise with the amount of confidence, or certainty, in a. This is related to the theory of normalized averaging, see e.g. [48]. If we use this model, we have additionally made the assumption that all features have the same confidence in each sample. To be slightly more flexible, we will instead assume a linear model for the confidence measure r = wT a , (9.9) where w > 0 is a suitable weight vector. We now obtain the following response model for continuous function mode: 1 (9.10) x = C T a. w a Note that wT a is a weighted l1 -norm of a, since a is non-negative. An unweighted l1 -norm, w = 1, is often used in RBF networks and probabilistic mixture models, see [58, 77]. Other choices of weighting w will be discussed in section 9.5.2. 106 9.5 Associative Learning Associative structure We will now turn to the problem of estimating the linkage matrix C in (9.10) and in (9.7). We take on a unified approach for the two system operation modes. The models can be summarised into u=C 1 a, s(a) (9.11) where s(a) is a normalisation function, and u denotes a scalar or a vector, representing either the explicit state variable/variables, or a channel representation thereof. In continuous function mode (9.10) u = x and s(a) = wT a. In discrete event mode (9.7) u is a channel representation of x and s(a) ≡ 1. In the subsequent discussion, we will limit the scope to a supervised learning framework. Still, the structure can advantageously be used as a core in systems for other strategies of learning, such as reinforcement learning, with a proper embedding [96]. This discussion will assume batch mode training. This implies that there are N observation pairs of corresponding feature vectors an and state or response vectors un . Let A and U denote the matrices containing all feature vector and response vector samples respectively, i.e. | u1 U = | A = | a1 | | u2 | | a2 | − u1 | − u2 .. . . . uN = . | − uK − a1 | − a2 . . . aN = .. . | − aH − − − − − − (9.12) For a set of observation samples collected in accordance with (9.12), the model in (9.11) can be expressed as (9.13) U = CADs , where ¡ Ds = diag−1 s(a1 ) s(a2 ) ... ¢ s(aN ) . (9.14) The linkage matrix C is computed as a solution to a weighted least squares problem, with the constraint C ≥ 0. This constraint has a regularising effect on the mapping, and also ensures a sparse linkage matrix C. For a more extensive discussion on the monopolar constraint, see the article [51]. 9.5.1 Optimisation procedure The procedure to optimise the associative networks is mainly the work of Granlund and Johansson [51]. It is described in this section for completeness. The linkage 9.5 Associative structure 107 matrix C is computed as the solution to the constrained weighted least-squares problem min e(C) , C≥0 (9.15) where e(C) = kU − CADs k2W = trace(U − CADs )W(U − CADs )T . (9.16) The weight matrix W, which controls the relevance of each sample, is chosen as W = D−1 s . The minimisation problem (9.15) does not generally have a unique solution, as it can be under-determined or over-determined. The proposed solution to (9.15) is the fixed point of the sequence ( C(0) = 0 C(i + 1) = max (0, C(i) − ∇e(C(i))Df ) , (9.17) where Df is the positive definite diagonal matrix Df = diag(v)diag−1 (ADs AT v) for some v > 0 . (9.18) Since W = D−1 s we have ∇e(C) = (CADs − U)WDs AT = (CADs − U)AT , (9.19) and we rewrite sequence (9.17) as ¢ ¡ C(i + 1) = max 0, C(i) − (C(i)ADs − U)AT Df . (9.20) We can interpret Ds and Df as normalisations in the sample and feature domain respectively, see section 9.5.2 for further details. We will consequently refer to Ds as sample domain normalisation and Df as feature domain normalisation. 9.5.2 Normalisation modes This normalisation can be put in either of the two representation domains, the sample domain or the feature domain, but with different effects upon convergence, accuracy, etc. For each choice of sample domain normalisation Ds there are nonunique choices of feature domain normalisations Df such that sequence (9.20) converges to a solution of problem (9.15). Df can for example be computed from (9.18). The choice of normalisation depends on the operation mode, i.e. continuous function mapping or discrete event mapping. There are some choices that are of particular interest. These are discussed below. 108 Associative Learning Discrete event mapping Discrete event mode (9.7) corresponds to a sample domain normalisation matrix Ds = I . (9.21) Choosing v = 1 = (1 1 . . . 1)T in (9.18) gives 1 Df = diag−1 (AAT 1) = a1 mT f 1 a2 mT f .. , (9.22) . P where mf = h ah is proportional to the mean in the feature domain. As Ds does not contain any components of A there is no risk that it turns singular in domains of samples having all feature components zero. This choice of normalisation will be referred to as Normalisation entirely in the feature domain. Continuous function mapping There are several ways to choose w in the continuous function model (9.10), depending on the assumptions of error models, and the resulting choice of confidence measure s. One approach is to assume that all training samples have the same confidence, i.e. s ≡ 1, and compute C ≥ 0 and w ≥ 0 such that ½ 1 ≈ wT A (9.23) X ≈ CA . Sometimes it may be desirable to have an individual confidence measure for each training sample. Another approach is to design a suitable w and then compute C using the optimisation framework in section 9.5.1 with s(a) = wT a. There are two specific designs of w that are worth emphasising. The channel representation implies that large feature channel magnitudes indicate a higher confidence than low values. We can consequently use the sum of the feature channels as a measure of confidence: 1 (9.24) s(a) = 1T a ⇒ x = C T a . 1 a As mentioned before, this model is often used in RBF-networks and probabilistic mixture models, see [58, 77]. The corresponding sample domain normalisation matrix is 1 −1 Ds = diag (A 1) = aT 1 1 T and if we choose v = 1 in (9.18) we get Df = diag−1 (A1) = 1 a1 1 1 aT 2 1 .. , (9.25) . 1 a2 1 .. . . (9.26) 9.5 Associative structure 109 This choice of model will be referred to as Mixed domain normalisation. It can also be argued that a feature element which is frequently active should have a higher confidence than a feature element which is rarely active. This can be included in the confidence measure by using a weighted sum of the features, where the weight is proportional to the mean in the sample domain: X an . (9.27) s(a) = mTs a where ms = A1 = n This corresponds to the sample domain normalisation matrix 1 Ds = diag−1 (AT A1) = mT s a1 1 mT s a2 .. , (9.28) . and by using v = A1 in (9.18) we get Df = I . (9.29) This choice of model will be referred to as Normalisation entirely in the sample domain. 9.5.3 Sensitivity analysis for continuous function mode We will now make some observations concerning the insensitivity to noise of the system, under the assumption of sample normalisation in continuous function mode. That is, a response state estimate x̂n is generated from a feature vector a according to model (9.10), i.e. x̂n = C 1 an . w T an (9.30) We observe that regardless of choice of normalisation vector wT , the response will be independent of any global scaling of the features, i.e. 1 1 γan . (9.31) C T an = C T w an w γan If multiplicative noise is applied, represented by a diagonal matrix Dγ , we get x̂n = C 1 Dγ an . wT Dγ an (9.32) If the choice of weights in C and w is consistent, i.e. if the weights used to generate a response at a sample n were to obey the relation C = x̂n wT , the network is perfectly invariant to multiplicative noise. As we shall see in the experiments to follow, the normalisation comes close to this ideal for the entire sample set, provided that the response signal varies slowly. For such situations, the network suppresses multiplicative noise well. Similarly, a sensitivity analysis can be made for discrete event mode. We will in this presentation only refer to the discussion in chapter 3 for the invariances available in the channel representation, and to the results from the experimental verification in the following section. 110 9.6 Associative Learning Experimental verification We will in this section analyse the behaviour and the noise sensitivity of several variants of associative networks, both in continuous function mode and in discrete event mode. A generalisation of the common CMU twin spiral pattern [18] has been used, as this is often used to evaluate classification networks. We have chosen to make the pattern more difficult in order to show that the proposed learning machinery can represent both continuous function mappings (regression) and mappings to discrete classes (classification). The robustness is analysed with respect to three types of noise: additive, multiplicative, and impulse noise on the feature vector. 9.6.1 Experimental setup In the experiments, a three dimensional state space X ⊂ R3 is used. The sensor space A ⊂ R2 , and the response space R ⊂ R are orthogonal projections of the state space. The network is trained to perform the mapping f : A → R which is depicted in figure 9.4. Note that this mapping can be seen as a surface of points x ∈ R3 , with x3 = f (x1 , x2 ). The analytic expression for f (x1 , x2 ) is: ( √ if mod(ϕ + 1000r, 2π) < π fs (r, ϕ) (9.33) f (r, ϕ) = sign(fs (r, ϕ)) otherwise, √ √ where fs (r, ϕ) = (1/ 2 − r) cos(ϕ + 1000r) . p Variables r = x21 + x22 and ϕ = tan−1 (x1 , x2 ) are the polar coordinates in sensor space A. As can be seen in the figure, the mapping contains both smooth parts (given by the cos function) and discontinuities (introduced by the sign function). The pattern is intended to demonstrate the following properties: 1. The ability to approximate piecewise continuous surfaces. 2. The ability to describe discontinuities (i.e. assignment into discrete classes). 3. The transition between interpolation and representation of a discontinuity. 4. The inherent approximation introduced by the sensor channels. As sensor channels, a variant of the channels prescribed in expression (9.3) is used: ( cos2 (ωd) if ωd ≤ π2 (9.34) Bh (x) = 0 otherwise, q (9.35) where d = (x − xh )T M(x − xh ), and M = diag(1 1 0). In the experiments H = 2000 such sensors are used, with random positions {xh }H 1 inside the box ([−0.5, 0.5], [−0.5, 0.5]) ⊂ A. The sensors 9.6 Experimental verification 111 0.5 0.4 0.3 0.2 0.1 x2 0 −0.1 −0.2 −0.3 −0.4 −0.5 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 x1 Figure 9.4: Desired response function. Black to White correspond to values of x3 ∈ [−1, 1]. have channel widths of ω = π/0.14 giving each an active domain with radius 0.07. ¡ ¢T Thus, for each state xn , a feature vector an = B1 (xn ) B2 (xn ) . . . BH (xn ) is obtained. During training, random samples of the state vector xn ∈ X on the surface f : A → R are generated. These are used to obtain pairs {fn , an } using (9.33) and (9.34). The training sets are stored in the matrices f , and A respectively. The performance is then evaluated on a regular sampling grid. This has the advantage that performance can be visualised as an image. Since real valued positions x ∈ X are used, the training and evaluation sets are disjoint. The mean absolute error (MAE) between the network output and the ground truth (9.33), is used as a performance measure, N 1 X |f (xn ) − can | , N n=1 (9.36) N 1 X |f (xn ) − dec(Can )| . N n=1 (9.37) εMAE = or, for discrete event mode εMAE = The rationale for using this error measure is that it is roughly proportional to the number of misclassifications along the black-to-white boundary, in contrast to RMSE which is proportional to the number of misclassifications squared. 112 Associative Learning 9.6.2 Associative network variants We will demonstrate the behaviour of the following five variants of associative networks: 1. Mixed domain normalisation bipolar network This network uses the model 1 fˆ = T 1 a ca . This model is often used in RBF-networks and probabilistic mixture models, see [58, 77]. This network is optimised according to c = arg min kf − cADs k2 + γkck2 . c (9.38) In the experiments, the explicit solution is used, i.e. c = fADs (ADs DTs AT + γI)−1 . (9.39) Note that for larger systems, it is more efficient to replace (9.39) with a gradient descent method. 2. Mixed domain normalisation monopolar network Same as above, but with a monopolar constraint on c, instead of the Tikhonov regularization used above. 3. Sample domain normalisation monopolar network This network uses the model fˆ = 1 ca , mTs a where ms is computed from the training set sensor channels according to ms = A1. 4. Uniform sample confidence monopolar network This network uses the model fˆ = 1 ca , wT a where the mapping w is trained to produce the response 1 for all samples, see (9.23). 5. Discrete event mode monopolar network This network uses the model û = Ca ⇔ fˆ = dec(Ca) , with K = 7 channels. The responses should describe the interval [−1, 1] so the decoding step involves a linear mapping, see (3.5). 9.6 Experimental verification 113 Figure 9.5: Performance of bipolar network (#1) under varied number of samples. Top left to bottom right: N = 63, 125, 250, 500, 1000, 2000, 4000, 8000. 9.6.3 Varied number of samples As a demonstration of the generalisation abilities of the networks we will first vary the number of samples. The monopolar networks are optimised according to section 9.5, with 50 iterations. For the bipolar network we have used γ = 0.005. This value is chosen to give the same error on the training set as in network #2 using N = 500 samples. The performance on the regular grid is demonstrated in figure 9.5 for the bipolar network (#1), and in figure 9.6 for the discrete event network (#5). If we look at the centre of the spiral, we see that both networks fail to describe the fine details of the spiral, although #1 is doing slightly better. For the discrete event network, the failure is a direct consequence of the feature channel sizes. For the bipolar network it is a combined consequence of the size and density of the feature channels. We can also observe that the discrete event network is significantly better at dealing with the discontinuities. This is also reflected in the error measures, see figure 9.7. For very low numbers of samples, when both networks clearly fail, the bipolar network is slightly better. We have also plotted the performance of the monopolar mappings in continuous function mode. As can be seen in the plot, these are all slightly worse off than the bipolar network. All three monopolar continuous function mode variants have similar performances on this setup. Differences appear mainly when the sample density becomes non-uniform (not shown here). 114 Associative Learning Figure 9.6: Performance of discrete event network (#5) under varied number of samples. Top left to bottom right: N = 63, 125, 250, 500, 1000, 2000, 4000, 8000. 0.5 0.4 0.3 0.2 0.1 0 2 10 3 10 4 10 Figure 9.7: MAE under varied number of samples. Solid thick is #5, and dashed is #1. Solid thin are #2,#3, and #4. For low number of samples the variants are ordered #2, #3, #4 with #4 being the best one. 9.6 Experimental verification 115 Figure 9.8: Performance of discrete event network (#5) under varied number of channels. Top left to bottom right: K = 3 to K = 14. 9.6.4 Varied number of channels The relationship between the sizes of feature and response channels is important for the performance of the network. The distance between the channels also determines where the decision between interpolation and introduction of a discontinuity is made. We will now demonstrate these two effects by varying the number of channels in the range [3 . . . 14], and keeping the number of samples high, N = 8000. As can be seen in figure 9.8, a low number of channels gives a smooth response function. For K = 3 no discontinuity is introduced at all, since there is only one interval for the local reconstruction (see section 3.2.3). As the number of channels is increased, the number of discontinuities increases. Initially this is an advantage, but for a large number of channels, the response function becomes increasingly patchy (see figure 9.8). In practice, there is thus a trade-off between description of discontinuities, and patchiness. This trade-off is also evident if MAE is plotted against the number of channels, see figure 9.9 left. In figure 9.9, right part, error curves for smaller numbers of samples have been plotted. It can be seen that, for a given number of samples, the optimal choice of channels varies. Better performance is obtained for a small number of channels, when fewer samples are used. The standard way to interpret this result is that a high number of response channels allows a more complex model, which requires 116 Associative Learning 0.1 0.4 0.08 0.3 0.06 0.2 0.04 0.1 0.02 0 3 5 10 0 14 3 5 10 14 Figure 9.9: MAE under varied number of channels. Left MAE for N = 8000. Right MAE for N = 63, 125, 250, 500, 1000, 2000, 4000, and 8000. 5000 4800 4600 4400 4200 4000 4 6 8 10 12 14 Figure 9.10: Number of non-zero coefficients under varied number of channels. Compare this with 2000 non-zero coefficients for the continuous function networks. more samples. If we plot the number of non-zero coefficients in the linkage matrix C, we also see that there is an optimal number of channels, see figure 9.10. Note that although the size of C is between 3 and 14 times larger than in continuous function mode, the number of links only increases by a factor 2.1 to 2.5. 9.6.5 Noise sensitivity We will now demonstrate the performance of the associative networks when the feature set is noisy. We will use the following noise models: 9.6 Experimental verification 117 0.5 0.15 0.4 0.1 0.3 0.2 0.05 0.1 0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.008 0.01 0.6 0.8 1 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.002 0.004 0.006 Figure 9.11: Noise sensitivity. Top left: additive noise, top right: multiplicative noise, bottom: impulse noise. Solid thick is #5, and dashed is #1. Solid thin are #2,#3, and #4. 1. Additive noise: A random value is added to each feature value, i.e. a∗n = an + η , with η k ∈ rect[−p, p], and the parameter p is varied in the range [0, 0.1]. 2. Multiplicative noise: Each feature value is multiplied with a random value, i.e. a∗n = Dη an , where Dη is a diagonal matrix with (Dη )kk = η k ∈ rect[1 − p, 1 + p], and the parameter p is varied in the range [0, 1]. 3. Impulse noise: A fraction of the features is set to 1, i.e. ( 1 if fr < p where fr ∈ rect(0, 1) ak,∗ n = k an otherwise, and the parameter p is varied in the range [0, 0.01]. The results of the experiments are shown in figure 9.11. We have consistently used N = 4000 samples for evaluation, and corrupted them with noise according to the discussion above. In order to make the amount of regularisation comparable we have optimised the γ parameter for network #1 to give the same error on the training set as network #2 at N = 4000 samples. This gave γ = 0.08. As can be seen from the additive noise experiment, network #5 has a different slope for its dependence upon noise level. The other networks are comparable, 118 Associative Learning and differences are mainly due to how well the networks are able to represent the pattern in the first place (see section 9.6.3). For the multiplicative noise case, we see that the slope is similar for all networks. Thus we can conclude that the multiplicative noise behaviour is comparable for all tested networks. For the impulse noise case we can see that for small amounts of noise, network #5 has a less steep slope than the others. For larger amounts of noise however, all networks seem to behave in a similar manner. The purpose of these experiments has been to demonstrate the abilities of the associative networks to generalise, and to cope with various kinds of sensor noise. Several experiments using image features as inputs have been made, but have to be excluded from this presentation. For details of such experiments, the reader is directed to [53, 34, 80]. 9.7 Other local model techniques We will now have a look at three classes of techniques similar to the associative networks presented in this chapter. The descriptions of the techniques, Radial Basis Function (RBF) networks, Support Vector Machines (SVM), and adaptive fuzzy control, are not meant to be exhaustive, the purpose of the presentation is merely to describe the similarities and differences between them and the associative networks. 9.7.1 Radial Basis Function networks The fact that an increased input dimensionality with localised inputs simplifies learning problems has also been exploited in the field of Radial Basis Function (RBF) networks [77, 58]. RBF networks have a hidden layer with localised Gaussian models, and an output layer which is linear. In effect this means that RBF networks learn a hidden representation which works like a channel encoding. The advantage with this approach is that the locations, and sizes of the channels (or RBFs) adapt to the data. The obvious disadvantage compared to using a fixed set of localised inputs is of course longer training time, since the network has two layers that have to be learned. Typically the RBF positions are found using a clustering scheme such as Kmeans [77], or, if the number of traing data is low, one RBF is centered around each traning data. Related to RBF networks are hierarchies of local Gaussian models. Such networks have been investigated by for instance Landelius in [69]. His setup allows new models to be added where needed, and unused models to be removed. Compared to the associative networks presented in this chapter, we also note that the response from an RBF network is a continuous function, and not a channel representation. This means that RBF networks cannot properly deal with multiple hypotheses. 9.7 Other local model techniques 9.7.2 119 Support Vector Machines Support Vector Machines (SVM) is another kernel technique that avoids mapping into a high-dimensional space alltogether. For a SVM it is required that the used kernel is positive definite. For such cases, Mercers theorem states that the kernel function is equivalent to an inner product in a high-dimensional space [58]. Obvious differences between the associative networks and SVM are that a SVM has a low dimensional feature space, and maps either to a binary variable (classification SVM), or to a continuous function (regression SVM). An associative network on the other hand uses a high-dimensional, sparse representation of the feature space, and maps to a set of response channels. Since SVMs do not use responses in the channel representation, they are unable to deal with multiple hypotheses. 9.7.3 Adaptive fuzzy control Adaptive fuzzy control is a technique for learning locally linear functional relationships, see e.g. [84] for an overview. In fuzzy control a set of local fuzzy inference rules between measurements, and desired outputs are established. These are often in a form suitable for linguistic communication, for instance: IF temperature(warm) THEN power(reduce). The linguistic states (“warm” and “reduce” in our example) are defined by localised membership functions, corresponding to the kernels in the channel representation. Each input variable is fuzzified into a set of membership degrees, which are in the range [0, 1.0]. Groups of one membership function per input variable are connected to an output membership function in a fuzzy inference rule. Each fuzzy inference rule only fires to a certain degree, which is determined by the amount of input activations. The result of the fuzzy inference is a weighted linear combination of the output membership functions, which can be used to decode a response in a defuzzification step, which is typically a global moment (centroid) computation. The IF-THEN inference rules can be learned by a neural network, see for instance [76]. Typically the learning adjusts the shape and positions of the membership functions, while the actual set of IF-THEN rules stays fixed. Thus a fuzzy-inference can be thought of as an associative network with adaptive feature and response channels, and a static, binary linkage matrix C. There are several differences between the associative networks, and fuzzy control. In fuzzy control the implicit assumption is that there is only one value per feature dimension activating the membership functions at the input side. As shown in this chapter, this is not the case in associative learning using the channel representation. Furthermore fuzzy control only allows one response, since the defuzzification is a global operation. In contrast, representation of multiple values is an important aspect of the channel representation, see section 3.2.1. 120 9.8 Associative Learning Concluding remarks In this chapter we have demonstrated that the channel learning architecture running in discrete event mode is able to describe simultaneously continuous and transiential phenomena, while still being better than or as good as a linear network at suppressing noise. An increase in the number of response channels does not cause an explosion in the number of used links. Rather, it remains fairly stable at approximately twice the number of links required for a continuous function mapping. This is a direct consequence of the monopolar constraint. The training procedure shows a fast convergence. In the experiments described, a mere 50 iterations have been required. The fast convergence is due to the monopolar constraint, locality of the features and responses, and the choice of feature domain normalisation. The learning architecture using channel information also deals properly with the perceptual aliasing problem, that is, it does not attempt to merge or average conflicting statements, but rather passes them on to the next processing level. This allows a second processing stage to resolve the perceptual aliasing, using additional information not available at the lower level. The ability of the architecture to handle a large number of models in separate or loosely coupled domains of the state space, promises systems with a combination of the continuous mapping of control systems with the state complexity we have become familiar with from digital systems. Such systems can be used for the implementation of extremely complex, contextually controlled mapping model structures. One such application is for view based object recognition in computer vision [53]. Chapter 10 An Autonomous Reactive System This chapter describes how a world model for successive recognition can be learned using associative learning. The learned world model consists of a linear mapping that successively updates a high-dimensional system state, using performed actions and observed percepts. The actions of the system are learned by rewarding actions that are good at resolving state ambiguities. As a demonstration, the system is used to resolve the localisation problem in a labyrinth. 10.1 Introduction During the eighties a class of robotic systems known as reactive robotic systems became popular. The introduction of system designs such as the subsumption architecture [11] caused a small revolution due to their remarkably short response times. Reactive systems are able to act quickly since the actions they perform are computed as a direct function of the sensor readings, or percepts, at a given time instant. This design principle works surprisingly well in many situations despite its simplicity. However, a purely reactive design is sensitive to a fundamental problem known as perceptual aliasing, see e.g. [17]. Perceptual aliasing is the situation when the percepts are identical in two situations when the system should perform different actions. There are two main solutions to this problem: • The first is to add more sensors to the system such that the two situations can be distinguished. • The second is to give the system an internal state. This state is estimated such that it is different in the two situations, and can thus be used to guide the actions. This chapter will deal with the latter solution, which further on will be called 122 An Autonomous Reactive System successive state estimation. We note here that the introduced state can be tailormade to resolve the perceptual aliasing. Successive state estimation is called recursive parameter estimation in signal processing, and on-line filtering in statistics [101]. Successive recognition could potentially be useful to computer vision systems that are to navigate in a known environment using visual input, such as the autonomous helicopter in the WITAS project [52]. 10.1.1 System outline Successive state estimation is an important component of an active perception system. The system design to be described is illustrated in figure 10.1. The state estimation, which is the main topic of this chapter, is performed by the state transition and state narrowing boxes. The state transition box updates the state using information about which action the system has taken, and the state narrowing box successively resolves ambiguities in the state by only keeping states that are consistent with the observed stimulus. Action Stimulus state transitions channel coding state narrowing System state motor programs New Action Figure 10.1: System outline. The system consistently uses the channel representation (see chapter 3) to represent states and actions. This implies that information is stored in channel vectors of which most elements are zero. Each channel is non-negative, and its magnitude signifies the relevance of a specific hypothesis (such as a specific system state in our case), and thus a zero value represents “no information”. This information representation has the advantage that it enables very fast associative learning methods to be employed [50], and improves product sum matching [34]. The channel coding box in figure 10.1 converts the percepts into a channel representation. Finally, the motor program box is the subsystem that generates the actions of the system. The complexity of this box is at present kept at a minimum. 10.2 Example environment To demonstrate the principle of successive state estimation, we will apply it on the problem shown in figure 10.2. The arrow in the figure symbolises an autonomous 10.2 Example environment 123 agent that is supposed to successively estimate its position and gaze direction by performing actions and observing how the percepts change. This is known as the robot localisation problem [101]. The labyrinth is a known environment, but the initial location of the agent is unknown, and thus the problem consists of learning (or designing) a world model that is useful for successive recognition. Figure 10.2: Illustration of the labyrinth navigation problem. The stimulus constitutes a three element binary vector, which tells whether there are walls to the left, in front, or to the right of the agent. For the situation in the figure, this vector will look like this: ¡ m= 0 0 ¢T 1 . This stimulus is converted to percept channels in one of two ways ¡ ¢T p 1 = m 1 m 2 m 3 1 − m1 1 − m2 1 − m3 ¡ ¢T p 2 = p1 p2 p3 p4 p5 p6 p7 p8 , or (10.1) where ( 1 if m = mh ph = 0 otherwise, and {mh }81 is the set of all possible stimuli. This expansion is needed since we want to train an associative network [50] to perform the state transitions, and since the network only has non-negative coefficients, we must have a non-zero input vector whenever we want a response. The two variants p1 and p2 will be called semi-local, and local percepts respectively. For the semi-local percepts, correlation serves as a similarity measure, or metric, but for the local percepts we have no metric—the correlation is either 1 or 0. The system has three possible actions a1 = TURN LEFT, 2 a = TURN RIGHT, and a3 = MOVE FORWARD. These are also represented as a three 124 An Autonomous Reactive System element binary vector, with only one non-zero element at a time. E.g. TURN RIGHT is represented as ¡ a2 = 0 1 ¢T 0 . Each action will either turn the agent 90◦ clockwise or anti clockwise, or move it forward to the next grid location (unless there is a wall in the way). As noted in section 10.1, the purpose of the system state is to resolve perceptual aliasing. For the current problem this is guaranteed by letting the state describe both agent location and absolute orientation. This gives us the number of states as Ns = rows × cols × orientations . (10.2) For the labyrinth in figure 10.2 this means 7 × 7 × 4 = 196 different states. 10.3 Learning successive recognition If the state is in a local representation, that is, each component of the state vector represents a local interval in state space, successive recognition can be obtained by a linear mapping. For the environment described in section 10.2, we will thus use a state vector with Ns components. The linear mapping will recursively estimate the state, s, from an earlier state, the performed action, a, and an observed percept p. I.e. s(t + 1) = C [s(t) ⊗ a(t) ⊗ p(t + 1)] (10.3) where ⊗ is the Kronecker product, which generates a vector containing all product pairs of the elements in the involved vectors (see section 5.6.1). The sought linear mapping C is thus of dimension Ns × Ns Na Np where Na and Np are the sizes of the action and percept vectors respectively. In order to learn the mapping we supply examples of s, a, and p for all possible state transitions. This gives us a total of Ns Na samples. The coefficients of the mapping C are found using a least squares optimisation with non-negative constraint arg min ||u − Cf ||2 cij >0 where u = s(t + 1) f = s(t) ⊗ a(t) ⊗ p(t + 1) . For details of the actual optimisation see section 9.5.1. 10.3.1 Notes on the state mapping The first thing to note about usage of the mapping, C, is that the state vector obtained by the mapping has to be normalised at each time step, i.e. 10.3 Learning successive recognition s̃(t + 1) s(t + 1) = C [s(t) ⊗ a(t) ⊗ p(t + 1)] s̃(t + 1) . =P k s̃k (t + 1) 125 (10.4) In the environment described in section 10.2, we obtain exactly the same behaviour when we use two separate maps: ½ ∗ s (t + 1) = C1 [s(t) ⊗ a(t)] (10.5) s̃(t + 1) = C2 [s∗ (t + 1) ⊗ p(t + 1)] . These two maps correspond to the boxes state transition and state narrowing in figure 10.1. An interesting parallel to on-line filtering algorithms in statistics is that C1 corresponds to the stochastic transition model s∗ (t + 1) ∼ p(x(t + 1)|s(t), a(t)) (10.6) where x is the unknown current state. Additionally, C2 is related to the stochastic observation model p(p(t)|s(t)). A probabilistic interpretation of C2 would be s(t + 1) ∼ p(x(t + 1)|s∗ (t + 1), p(t + 1)) . (10.7) See for instance [101] for a system which makes use of this framework. The mappings have sizes Ns × Ns Na and Ns × Ns Np , and this gives us at most Ns2 (Na + Np ) coefficients compared to Ns2 Na Np in the single mapping case. Thus the split into two maps is advantageous, provided that the behaviour is not affected (which in our case it is not). Aside from the gain in number of coefficients, the split into two maps will also simplify the optimisation of the mappings considerably. If we during the optimisation supply samples of s∗ (t+1) that are identical to s(t+1) we end up with a mapping, C2 , that simply weights the state vector with the correlations between the observed percept and those corresponding to each state during optimisation. In other words (10.5) is equivalent to s̃(t + 1) = diag(Pp(t + 1))C1 [s(t) ⊗ a(t)] . (10.8) Here P is a matrix with row n containing the percept observed at state n during the training, and diag() generates a matrix with the argument vector in the diagonal. 10.3.2 Exploratory behaviour How quickly the system is able to recognise it’s location is of course critically dependent on which actions it takes. A good exploratory behaviour should strive to observe new percepts as often as possible, but how can the system know that shifting its attention to something new when it does not yet know where it is? In this system the actions are chosen using a policy, where the probabilities for each action are conditional on the previous action a(t − 1) and the observed percept p(t). I.e. the action probabilities can be calculated as p(a(t) = ah ) = ch [a(t − 1) ⊗ p2 (t)] (10.9) 126 Time An Autonomous Reactive System 0 1 2 3 4 5 6 7 Estimate (using p1 ) Estimate (using p2 ) Actual state Time Estimate (using p1 ) Estimate (using p2 ) Actual state Figure 10.3: Illustration of state narrowing. 10.2). The coefficients in where {ah }31 are the three possible actions (see section P the mappings {ch }31 should be defined such that h p(a(t) = ah ) = 1. Initially we define the policy {ch }31 , manually. A random run of a system with a fixed policy is demonstrated in figure 10.3. The two different kinds of percepts p1 and p2 are those defined in (10.1). 10.3 Learning successive recognition 10.3.3 127 Evaluating narrowing performance The performance of the localisation process may be evaluated by observing how the estimated state vector s(t) changes over time. As a measure of how narrow a specific state vector is we will use P n(t) = sk (t) . max{sk (t)} k (10.10) k If all state channels are activated to the same degree, as is the case for t = 0, we will get n(t) = Ns , and if just one state channel is activated we will get n(t) = 1. Thus n(t) can be seen as a measure of how many possible states are still remaining. Figure 10.4 (top) shows a comparison of systems using local and semi-local percepts for 50 runs of the network. For each run the true initial state is selected at random, and s(0) is set to 1/Ns . 2 2 10 10 1 1 10 10 0 10 0 10 20 30 10 40 10 20 30 40 2 10 1 10 0 10 5 10 15 20 25 30 35 40 Figure 10.4: Narrowing performance. Top left: n(t) for a system using p1 . Top right: n(t) for a system using p2 . Each graph shows 50 runs (dotted). The solid curves are averages. Bottom: Solid: n(t) for p1 and p2 . Dashed: p1 using f1 (). Dash-dotted: p1 using f2 (). Each curve is an average over 50 runs. Since the only thing that differs between the two upper plots in figure 10.4 is the percepts, the difference in convergence has to occur in step 2 of (10.5). We can further demonstrate what influence the feature correlation has on the convergence by modifying the correlation step in equation 10.8 as follows s̃(t + 1) = diag(f(Pp(t + 1)))C1 [s(t) ⊗ a(t)] . (10.11) We will try the following two choices of f() on correlations of the semi-local percepts 128 An Autonomous Reactive System √ f1 (c) = c and ( 1 if c > 0 f2 (c) = 0 otherwise. (10.12) All four kinds of systems are compared in the lower graph of figure 10.4. As can be seen, the narrowing behaviour is greatly improved by a sharp decay of the percept correlation function. However, for continuous environments there will most likely be a trade off between sharp correlation functions and state interpolation and the number of samples required during training. 10.3.4 Learning a narrowing policy The conditional probabilities in the policy defined in section 10.3.2 can be learned using reinforcement learning [96]. A good exploratory behaviour is found by giving rewards to conditional actions {a(t)|p(t), a(t − 1)} that reduce the narrowing measure (10.10), and by having the action probability density p(a(t) = ah |p(t), a(t−1)) gradually increase for conditional actions with above-average rewards. This is called a pursuit method [96]. In order for the rewards not to die out, the system state is regularly reset to all ones, for instance when t mod 30 = 0. The first attempt is to define the reward as a plain difference of the narrowing measure (10.10), i.e. r1 (t) = n(t − 1) − n(t) . (10.13) With this reward, the agent easily gets stuck into sub-optimal policies, such as constantly trying to move into a wall. Better behaviour is obtained by also looking at the narrowing difference one step into the future, i.e. r2 (t) = r1 (t) + r1 (t + 1) = n(t − 1) − n(t + 1) . 2 2 10 10 1 1 10 10 0 10 (10.14) 0 10 20 30 40 10 10 20 30 40 Figure 10.5: Narrowing performance. Left: n(t) for a policy learned using r1 (t). Right: n(t) for a policy learned using r2 (t). Each graph shows 50 runs (dotted). The thick curves are averages. Dashed curves show average narrowing for a completely random walk. The behaviours learned using (10.13) and (10.14) are compared with a random walk in figure 10.5. 10.4 Concluding remarks 10.4 129 Concluding remarks The aim of this chapter has not been to describe a useful application, but instead to show how the principle of successive recognition can be used. Compared to a real robot navigation task, the environment used is way too simple to serve as a model world. Further experiments will extend the model to continuous environments, with noisy percepts and actions. 130 An Autonomous Reactive System Chapter 11 Conclusions and Future Research Directions In this chapter we conclude the thesis by summarising the results. We also indicate open issues, which point to research directions that can be pursued further. 11.1 Conclusions This thesis is the result of asking the question “What can be done in the channel representation, which cannot be accomplished without it?”. We started by deriving expressions for channel encoding scalars, and retrieving them again using a local decoding. We then investigated what the simple operation of averaging in the channel representation resulted in. The result that several modes of a distribution can be treated in parallel is of fundamental importance in perception. Perception in the presence of noise is a difficult problem. One especially persistent kind of noise is a competing nearby feature. By making use of the channel representation, we can make this problem go away, by simultaneously estimating all present features in parallel. We can select significant features after estimation, by picking one or several of the local decodings. In principle this would allow the design of a perception system similar to that of the bat described in section 2.1.4. Channel representation is also useful for response generation. The associative networks in chapter 9 was shown to be able to learn piecewise continuous mappings, without blurring discontinuities. Such an ability is useful in response generation, such as navigation with obstacle avoidance. If we encounter an obstacle in front of us, it might be possible to pass it on both the left and the right side, so both these options are valid responses. Their average however is not, and thus a system that learns obstacle avoidance will need to use some kind of channel representation in order to avoid such inappropriate averaging. For all levels in a perception system it is crucial that not all information is processed at each position. In order not to be flooded with data we need to exploit 132 Conclusions and Future Research Directions locality, i.e. restricting the amount of information available at each position to a local context. This can however lead to problems such as perceptual aliasing. As was demonstrated in chapter 9, intermediate responses in channel representation is a proper way to deal with perceptually aliased states. The channel representation solves the perceptual aliasing problem by not trying to merge states, but instead passing them on to the next processing level, where hopefully more context will be available to resolve the aliasing. 11.2 Future research As is common in science, this thesis answered some questions, but at the same time it raised several new ones. We will now mention some questions which might be worthwhile to pursue further. This far, the only operations considered in channel spaces are averaging and non-negative projections. Are there other meaningful operations in channel spaces? One option is to adapt the averaging to the local image structure, see Felsberg’s paper [28] for some first results in this area. The clustering of constant slopes developed in chapter 7 is, as mentioned just a first result. It could probably benefit from changing the representation of the slopes to cluster. 11.2.1 Feature matching and recognition Two currently active areas in computer vision are wide baseline matching, see e.g. [99] and parts based object recognition, see e.g. [70, 66, 82]. Both of these areas are possible applications for the blob features developed in chapter 7. 11.2.2 Perception action cycles Active vision is an important aspect of robotics. Apart from the simple example in chapter 10, this thesis has not dealt with closed perception–action loops. One direction to pursue is to explore the visual servoing idea in connection with the methods and representations developed in this thesis. One way to do this is to apply the successive recognition system in chapter 10 to more realistic problems, but other architectures and approaches could also prove useful. It is by logic we prove, it is by intuition that we invent. Henri Poincaré, 1904 Appendices A Theorems on cos2 kernels Theorem A.1 For cos2 kernels with ω = π/N , a group of N consecutive channels starting at index k has a common active domain of SkN = ]k − 1 + N/2, k + N/2[ . Proof: The active domain (non-zero domain, or support) of a channel is defined as (A.1) Sk = {x : Bk (x) > 0} = ]Lk , Uk [ . Since the kernels should go smoothly to zero (as discussed in section 3.2.2), this is always an open interval, as indicated by the brackets. For the cos2 kernel (3.2) we have domains of the type Sk = ]k − π/2/ω, k + π/2/ω[ . (A.2) For ω = π/N this becomes Sk = ]k − N/2, k + N/2[ . The common active domain of N channels, SkN (A.3) becomes SkN = Sk ∩ Sk+1 ∩ . . . ∩ Sk+N −1 = ]Lk+N −1 , Uk [ = = ]k + N − 1 − N/2, k + N/2[ = ]k − 1 + N/2, k + N/2[ . (A.4) (A.5) ¤ This concludes the proof. Theorem A.2 For cos2 kernels with ω = π/N , and a local decoding using N channels, the represented domain of a K channel set becomes N = ]N/2, K + 1 − N/2] . RK Proof: If we perform the local decoding using groups of N channels with ω = π/N , we will have decoding intervals according to theorem A.1. Note that we need to have N ∈ N/{0, 1} in order to have a proper decoding. These intervals are all of length 1, and thus they do not overlap. We now modify the upper end of the intervals (A.6) SkN = ]k − 1 + N/2, k + N/2] 134 Appendices in order to be able to join them. This makes no practical difference, since all that happens at the boundary is that one channel becomes inactive. For a channel representation using K channels (with K ≥ N ) we get a represented interval of type N N = S1N ∪ S2N ∪ . . . ∪ SK−N RK +1 = ]L1+N −1 , UK−N +1 ] = ]N/2, K − N + 1 + N/2] = ]N/2, K + 1 − N/2] . (A.7) (A.8) ¤ This concludes the proof. ¢T ¡ Theorem A.3 The sum of a channel value vector B1 (x) B2 (x) . . . BK (x) N for ω = π/N , where N ∈ N/{0, 1}, is invariant to the value of x when x ∈ RK . Proof: According to theorem A.1, groups of N consecutive channels with ω = π/N have mutually non-overlapping active domains SkN . This means that for a given channel vector, the value x will fall into exactly one of these domains, SkN . Thus the sum over the entire channel set is equal to the sum over the channels belonging to SkN , for some value of k K X Bn (x) = n=0 k+N X−1 Bn (x) = n=k N −1 X Bk+n (x) . (A.9) n=0 We now define a complex valued function v k (x) = ei2ω(x−k) . (A.10) This allows us to write the kernel function Bk (x) as Bk (x) = cos2 (ω(x − k)) = 0.5 + 0.5 cos(2ω(x − k)) = = 0.5 + 0.5Re [vk (x)] . Now the sum in (A.9) becomes "N −1 # X 1 N + Re Bk+n (x) = vk+n (x) . 2 2 n=0 n=0 N −1 X (A.11) The complex sum in this expression can be rewritten as N −1 X n=0 vk+n (x) = N −1 X n=0 ei2ω(x−k−n) = ei2ω(x−k) N −1 ³ X e−i2ω ´n . (A.12) n=0 For1 e−i2ω 6= 1 this geometric sum can be written as 1 The case e−i2ω = 1 never happens, since it is equivalent to ω = nπ, where n ∈ N, and our assumption was ω = π/N , N ∈ N/{0, 1}. A Theorems on cos2 kernels N −1 ³ X 135 e−i2ω ´n n=0 = 1 − e−i2ωN . 1 − e−i2ω (A.13) The numerator of this expression is zero exactly when ωN = nπ, n ∈ N. Since our assumption was ω = π/N , N ∈ N/{0, 1}, it is always zero. From this follows that the exponential sum in equation A.11 also equals zero. We can now reformulate equation A.11 as N −1 X Bk+n (x) = n=0 N 2 for ω = π/N where N ∈ N/{0, 1}. (A.14) ¤ This in conjunction with (A.9) proves the theorem. Theorem A.4 The sum of a squared channel value vector ¢T ¡ 1 2 for ω = π/N , where N ∈ N/{0, 1, 2} is invariB (x) B2 (x)2 . . . BK (x)2 N ant to the value of x when x ∈ RK . Proof: The proof of this theorem is similar to the proof of theorem A.3. With the same reasoning as in (A.9) we get K X Bn (x)2 = k+N X−1 n=0 Bn (x)2 = n=k N −1 X Bk+n (x)2 . (A.15) n=0 We now rewrite the squared kernel function as Bk (x)2 = cos4 (ω(x − k)) = 1 3 1 + cos(2ω(x − k)) + cos(4ω(x − k)) . 8 2 8 This allows us to rewrite (A.9) as N −1 X B k+n n=0 "N −1 "N −1 # # X X 1 3N 1 2 + Re (x) = vk+n (x) + Re vk+n (x) . 8 2 8 n=0 n=0 2 (A.16) The first complex sum in this expression is zero for ω = π/N , where N ∈ N/{0, 1} (see equations A.12 and A.13). The second sum can be written as N −1 X n=0 2 vk+n (x) = N −1 X ei4ω(x−k−n) = ei4ω(x−k) n=0 N −1 ³ X e−i4ω ´n . (A.17) n=0 For e−i4ω 6= 1 (that is, ω 6= nπ/2 where n ∈ N)2 , this geometric sum can be written as 2 In effect this excludes the solutions N = 1, and N = 2. 136 Appendices N −1 ³ X e−i4ω ´n = n=0 1 − e−i4ωN . 1 − e−i4ω (A.18) nπ , for integers n, and The numerator of this expression is zero exactly when ω = 2N N , but again our premise was ω = π/N , N ∈ N/{0, 1, 2}, so it is always zero. The constraints on equation A.18 requires us to exclude the cases N ∈ {0, 1, 2}. We can now reformulate equation A.16 as N −1 X Bk+n (x)2 = n=0 3N 8 for ω = π/N where N ∈ N/{0, 1, 2}. (A.19) ¤ This in conjunction with (A.15) proves the theorem. Observation A.5 We will now derive a local decoding for the cos2 when ω = π/2. For the case ω = π/2 we can also define a local decoding, but it is more difficult to decide whether the decoding is valid or not. We now have the system ¶ ¶ µ µ l ¶ µ r cos2 (π/2(x − l)) rBl (x) u (A.20) = = r cos2 (π/2(x − l − 1)) ul+1 rBl+1 (x) since cos(x − π/2) = sin(x) we have ¶ µ l ¶ µ r cos2 (π/2(x − l)) u . = ul+1 r sin2 (π/2(x − l)) We now see that h√ i √ 2 2 x̂ = l + arg ul + i ul+1 = l + tan−1 π π µq ¶ l+1 l u /u (A.21) (A.22) and r̂1 = |ul + iul+1 | and r̂2 = ul + ul+1 . (A.23) In order to select valid decodings, we cannot simply check if x̂ is inside the common support, since this is always the case. One way to avoid giving invalid ¤ solutions is to require that r̂2 (l) ≥ r̂2 (l + 1) and r̂2 (l) ≥ r̂2 (l − 1). Theorem A.6 The cos2 local decoding is an unbiased estimate of the mean, if the PDF f (x) is even, and restricted to the decoding support SlN . f (x) = f (2µ − x) supp{f } ⊂ SlN and 2 ⇒ E{x̂} = E{xn } . The local decoding of a cos channel representation consists of two steps: a linear parameter estimation and a non-linear combination of the parameters into estimates of the mode location and the confidence. The expected value of the linear parameter estimation is R cos(2ω(x − l))f (x)dx SlN R (A.24) E{p} = SlN sin(2ω(x − l))f (x)dx R f (x)dx SN l A Theorems on cos2 kernels 137 if we require that supp{f } ⊂ SlN , see section 4.2.2. We now simplify the notation, by denoting c(x) = cos(2ωx), and s(x) = sin(2ωx). Further, we assume that f is even about a point µ, i.e. f (x) = f (2µ − x). This allows us to rewrite E{p1 } as Z Z f (x)c(x − l)dx = f (x)c(x − µ + µ − l)dx (A.25) E{p1 } = Z = SlN SlN SlN f (x) [c(x − µ)c(µ − l) − s(x − µ)s(µ − l)] dx Z Z = c(µ − l) SlN f (x)c(x − µ)dx − s(µ − l) | SlN f (x)s(x − µ)dx {z (A.27) } =0 Z = c(µ − l) (A.26) SlN f (x)c(x − µ)dx (A.28) where one of the integrals becomes zero due to antisymmetry about µ. In a similar way we can rewrite E{p2 } as Z f (x)c(x − µ)dx . (A.29) E{p2 } = s(µ − l) SlN We now denote the integral by α Z α= SlN f (x)c(x − µ)dx . (A.30) This allows us to write E{p1 } = α cos(2ω(µ − l)) (A.31) E{p2 } = α sin(2ω(µ − l)) . (A.32) Finally we insert these two expressions into the non-linear step of the decoding 1 arg [E{p1 } + iE{p2 }] 2ω 1 1 arg αei2ω(µ−l) = l + (2ω(µ − l)) =l+ 2ω 2ω = µ. E{x̂} = l + (A.33) (A.34) (A.35) For a density that is even about µ, we also get Z (A.36) E{xn } = xf (x)dx Z Z Z = (2µ − x)f (2µ − x)dx = 2µ − xf (2µ − x)dx = 2µ − xf (x)dx . (A.37) Setting the right-hand of (A.36) equal to the right-hand of (A.37) gives Z E{xn } = xf (x)dx = µ . (A.38) Together with (A.35) this gives E{x̂} = E{xn }, which concludes the proof. ¤ 138 B Appendices Theorems on B-splines Theorem B.1 The sum of integer shifted B-splines is independent of the encoded scalar for any degree n. X Bnk (x) = 1 ∀x, n ∈ N k Proof: B-splines of degree zero are defined as ( 1 k − 0.5 ≤ x < k + 0.5 0 Bk (x) = 0 otherwise. From this trivially follows that the zeroth degree sum is constant, since exactly one B-spline is non-zero, and equal to 1 at a time. That is X B0k (x) = 1 ∀x . (B.1) k Using the recurrence relation (5.10), we can express the sum of an arbitrary degree as X k Bnk (x) = (B.2) ¶ (n + 1)/2 − x + k n−1 Bk+1/2 (x) (B.3) n n k ¶ Xµ ¶ X µ x − k + (n + 1)/2 (n + 1)/2 − x + k n−1 Bn−1 B (x) + (x) = k−1/2 k+1/2 n n = X µ x − k + (n + 1)/2 k Bn−1 k−1/2 (x) + k ¶ Xµ ¶ n/2 + l − x n−1 Bn−1 B (x) + (x) = l l n n l l X µ x − l + n/2 n/2 − x + l ¶ + (x) Bn−1 = l n n l X Bn−1 (x) . = l X µ x − l + n/2 (B.4) (B.5) (B.6) (B.7) l That is, the sum of B-splines of degree n is equal to the sum of degree n − 1. This in conjunction with (B.1) proves theorem B.1 by induction. ¤ Theorem B.2 The first moment of integer shifted B-splines is equal to the encoded scalar for any degree n ≥ 1. X kBnk (x) = x ∀x, n ∈ N+ k B Theorems on B-splines 139 Proof: Using (5.11), the first moment of B-splines of degree one can be written as X kB1k (x) = k i X h k (x − k + 1)B0k−1/2 (x) + (1 − x + k)B0k+1/2 (x) k = X k(x − k + 1)B0k−1/2 (x) + X k = X k(1 − x + k)B0k+1/2 (x) k (l + 1/2)(x − l + 1/2)B0l (x) + X l = X = (B.9) (l − 1/2)(1/2 − x + l)B0l (x) l (1/4 − l + lx + 2 x/2)B0l (x) + X l X (B.8) (B.10) (l − 1/4 − lx + 2 x/2)B0l (x) l xB0l (x) =x X l (B.11) B0l (x) = x. (B.12) l We now make an expansion of the first moment using the recurrence relation (5.10), X k kBnk (x) = (B.13) ¸ X · x − k + (n + 1)/2 (n + 1)/2 − x + k n−1 n−1 Bk−1/2 (x) + Bk+1/2 (x) k = n n k (B.14) X (n + 1)/2 − x + k X x − k + (n + 1)/2 Bn−1 Bn−1 k k = k−1/2 (x) + k+1/2 (x) n n k k (B.15) = X (l + 1/2)(x − l + n/2) n l Bn−1 (x) + l X (l − 1/2)(n/2 − x + l) l n Bn−1 (x) l (B.16) X 2lx − 2l + ln + x − l + n/2 Bn−1 (x) l 2n 2 = l + X ln − 2lx + 2l2 − n/2 + x − l 2n l = X ln + x − l l n Bn−1 (x) . l By applying theorem B.1 we get Bn−1 (x) l (B.17) (B.18) (B.19) 140 Appendices X kBnk (x) = k X ln + x − l l n Bn−1 (x) l (B.20) µ ¶ 1 X n−1 x lBl (x) . = + 1− n n (B.21) l If we assume the theorem holds for n − 1, we get X µ x + 1− n µ x = + 1− n kBnk (x) = k 1 n 1 n ¶X lBn−1 (x) l ¶ x = x. (B.23) This in conjunction with (B.12) proves theorem B.2 by induction. C (B.22) l ¤ Theorems on ellipse functions Theorem C.1 The matrix A describing the shape of an ellipse (x − m)T A(x − m) ≤ 1 is related to the inertia matrix I of the same ellipse according to 1 1 I = A−1 or A = I−1 . 4 4 Proof: A surface patch in the shape of an ellipse is the set of points x = (x1 x2 )T fulfilling the relation (x1 /a)2 + (x2 /b)2 ≤ 1. This can be rewritten as µ xT Dx ≤ 1 for D = 1/a2 0 0 1/b2 ¶ . (C.1) In order to describe an ellipse with arbitrary position and orientation, we add a rotation R = (r1 r2 ), and a translation m = (m1 m2 )T and obtain ¡ x−m ¢T ¡ ¢ A x − m ≤ 1 where A = RT DR . (C.2) Note that the square root of the left hand expression is a Mahalanobis distance between m and x, with A defining the metric, see e.g. [7]. A often corresponds to the inverse covariance of a data set. For the ellipse described by A, and m, we can define a binary mask, ( T 1 if (x − m) A (x − m) ≤ 1 v(x) = 0 otherwise. The mask v(x) has moments that in the continuous case are given by (C.3) C Theorems on ellipse functions 141 Z µkl = Z Z R2 xk1 xl2 v(x)dx = = xT RT DRx≤1 (x−m)T A(x−m)≤1 xk1 xl2 dx (x1 + m1 )k (x2 + m2 )l dx (C.4) (C.5) · ¸ Z y = Rx (rT1 y + m1 )k (rT2 y + m2 )l dy = (C.6) = dx = dy yT Dy≤1 ¸ Z · 1 1 1 1 x = D2 y (rT1 D− 2 x + m1 )k (rT2 D− 2 x + m2 )l |D− 2 |dx = = 1 dy = |D− 2 |dx T x x≤1 (C.7) µ ¶ Z Z cos ϕ π 1 1 1 x = ρ sin ϕ = ρn̂ = (rT1 D− 2 ρn̂ + m1 )k (rT2 D− 2 ρn̂ + m2 )l abρdρdϕ . −π 0 dx = ρdρdϕ (C.8) If we define the rotation R to be µ R= cos φ − sin φ sin φ cos φ ¶ ¡ = r1 r2 ¢ (C.9) we can simplify this to Z µkl = Z π −π 1 (ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )k × 0 (ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )l abρdρdϕ . (C.10) We can now verify the expressions for the low order moments Z µ00 = µ10 = µ01 = π −π Z π −π Z π −π Z 1 abρdρdϕ = πab (C.11) (ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )abρdρdϕ = m1 πab (C.12) (ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )abρdρdϕ = m2 πab (C.13) 0 Z 1 0 Z 0 1 142 Appendices Z π Z 1 (ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )2 abρdρdϕ −π 0 µ ¶ 1 2 (a cos2 φ + b2 sin2 φ) + m21 = πab 4 Z π Z 1 = (ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )2 abρdρdϕ −π 0 µ ¶ 1 2 2 2 2 2 (a cos φ + b sin φ) + m2 = πab 4 Z π Z 1 = (ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )× µ20 = µ02 µ11 −π (C.14) (C.15) 0 µ = πab (ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )abρdρdϕ ¶ . 1 2 (a cos2 φ − b2 sin2 φ) + m1 m2 4 (C.16) We now group the three second order moments into a matrix µ µ20 µ11 µ11 µ02 ¶ µ ¶ πab rT1 D−1 r1 rT1 D−1 r2 + πabmmT rT2 D−1 r1 rT2 D−1 r2 4 πab T −1 R D R + πabmmT . = 4 = (C.17) (C.18) By division with µ00 , see (C.11), we get 1 µ00 µ µ02 µ11 µ11 µ20 ¶ = 1 T −1 R D R + mmT . 4 (C.19) By subtraction of mmT we obtain the definition of the inertia matrix I= 1 µ00 µ µ02 µ11 µ11 µ20 ¶ − mmT = 1 T −1 R D R. 4 (C.20) Here we recognise the inverse of the ellipse matrix A−1 = RT D−1 R, see (C.2), and thus the ellipse matrix A is related to I as I= 1 −1 A 4 which was what we set out to prove. or A= 1 −1 I 4 (C.21) ¤ C Theorems on ellipse functions 143 Theorem C.2 The axes, and the area of an ellipse can be extracted from its inertia matrix I according to p p √ {a, b} = {2 λ1 , 2 λ2 } and Area = 4π det I . Proof: For positive definite matrices, the eigenvectors constitute a rotation, and thus (C.20) is an eigenvalue decomposition of I. In other words I = λ1 ê1 êT1 + λ2 ê2 êT2 , has its eigenvalues in the diagonal of 14 D−1 , {λ1 , λ2 } = {a2 /4, b2 /4}. √ √ From this follows that {a, b} = {2 λ1 , 2 λ2 }. √ 1 2 2 a b , we can find the ellipse area as πab = 4π det I. Since det I = λ1 λ2 = 16 Also note that of all shapes with a given inertia matrix, the ellipse is the one that is best concentrated around m. This means that in the discrete case, the above area measure will always be an overestimate of the actual area, with exception of the degenerate case when all pixels lie on a line. ¤ Theorem C.3 The outline of an ellipse is given by the parameter curve µ ¶ T −1/2 cos t + m for t ∈ [0, 2π[ . (C.22) x=R D sin t Proof: An ellipse is the set of points x ∈ R2 fulfilling relation (C.2). By inserting the parameter curve (C.22) into the quadratic form of (C.2) we obtain (x − m)T A(x − m) = (x − m)T RT DR(x − m) µ ¶ ¡ ¢ −1/2 T T −1/2 cos t RR DRR D = cos t sin t D sin t = cos2 t + sin2 t = 1 . (C.23) (C.24) (C.25) Thus all points in (C.22) belong to the ellipse outline. Note that (C.22) is a convenient way to draw the ellipse outline. ¤ 144 Appendices Bibliography [1] Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. Int. Journal of Computer Vision, 1(3):333–356, 1988. [2] V. Aurich and J. Weule. Non-linear gaussian filters performing edge preserving diffusion. In 17:th DAGM-Symposium, pages 538–545, Bielefeld, 1995. [3] R. Bajcsy. Active perception. Proceedings of the IEEE, 76(8):996–1005, August 1988. [4] D. H. Ballard. Animate vision. In Proc. Int. Joint Conf. on Artificial Intelligence, pages 1635–1641, 1989. [5] M. F. Bear, B. W. Connors, and M. A. Paradiso. Neuroscience: Exploring the Brain. Williams & Wilkins, 1996. ISBN 0-683-00488-3. [6] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color- and texturebased image segmentation using EM and its application to content based image retrieval. In Proceedings of the Sixth International Conference on Computer Vision, pages 675–682, 1998. [7] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. ISBN 0-19-853864-2. [8] M. Black, G. Sapiro, D. Marimont, and D. Heeger. Robust anisotropic diffusion. IEEE Transactions on Image Processing, 7(3):421–432, March 1998. [9] A. Blake. The Handbook of Brain Theory and Neural Networks, chapter Active Vision, pages 61–63. MIT Press, 1995. M. A. Arbib, Ed. [10] M. Borga. Learning Multidimensional Signal Processing. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, 1998. Dissertation No 531, ISBN 91-7219-202-X. [11] R. Brooks. A robust layered control system for a mobile robot. IEEE Trans. on Robotics and Automation, 2(1):14–23, March 1986. [12] H. H. Bülthoff, S. Y. Edelman, and M. J. Tarr. How are three-dimensional objects represented in the brain? A.I. Memo No. 1479, April 1994. MIT AI Lab. 146 Bibliography [13] F. W. Campbell and J. G. Robson. Application of Fourier analysis to the visibility of gratings. J. Physiol., 197:551–566, 1968. [14] J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8(6):255–274, November 1986. [15] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximisation and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, August 2002. [16] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, August 1995. [17] L. Chrisman. Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. In National Conference on Artificial Intelligence, pages 183–188, 1992. [18] CMU Neural Networks Benchmark Collection, http://www.cs.cmu.edu/afs/cs/project/ai-repository/ ai/areas/neural/bench/cmu/. [19] D. Comaniciu and P. Meer. Mean shift analysis and applications. In Proceedings of ICCV’99, pages 1197–1203, Corfu, Greece, Sept 1999. [20] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, May 2002. [21] G. Dahlquist and Å. Björck. Numerical Methods and Scientific Computation, chapter Interpolation and related subjects. SIAM, Philadelphia, 2003. In press. [22] I. Daubechies. The wavelet transform, time-frequency localization and signal analysis. IEEE Trans. on Information Theory, 36(5):961–1005, September 1990. [23] R. Dawkins. The Blind Watchmaker. Penguin Books, 1986. [24] P. Doherty, G. Granlund, K. Kuchcinski, E. Sandewall, K. Nordberg, E. Skarman, and J. Wiklund. The WITAS Unmanned Aerial Vehicle Project. In W. Horn, editor, ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence, pages 747–755, Berlin, August 2000. [25] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling. In ICCV99, pages 1033–1038, Corfu, Greece, September 1999. [26] G. Farnebäck. Polynomial Expansion for Orientation and Motion Estimation. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, 2002. Dissertation No 790, ISBN 91-7373-475-6. Bibliography 147 [27] M. Felsberg. Low Level Image Processing with the Structure Multivector. PhD thesis, Christian-Albrechts-Universität, Kiel, March 2002. [28] M. Felsberg and G. Granlund. Anisotropic channel filtering. In Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749, pages 755–762, Gothenburg, Sweden, June-July 2003. [29] M. Felsberg, H. Scharr, and P.-E. Forssén. The B-spline channel representation: Channel algebra and channel based diffusion filtering. Technical Report LiTH-ISY-R-2461, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, September 2002. [30] M. Felsberg, H. Scharr, and P.-E. Forssén. Channel smoothing. IEEE PAMI, 2004. Submitted. [31] D. J. Field. What is the goal of sensory coding? Neural Computation, 1994. [32] P.-E. Forssén. Updating Camera Location and Heading using a Sparse Displacement Field. Technical Report LiTH-ISY-R-2318, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, November 2000. [33] P.-E. Forssén. Image Analysis using Soft Histograms. In Proceedings of the SSAB Symposium on Image Analysis, pages 109–112, Norrköping, March 2001. SSAB. [34] P.-E. Forssén. Sparse Representations for Medium Level Vision. Lic. Thesis LiU-Tek-Lic-2001:06, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, February 2001. Thesis No. 869, ISBN 91-7219-951-2. [35] P.-E. Forssén. Window Matching using Sparse Templates. Technical Report LiTH-ISY-R-2392, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, September 2001. [36] P.-E. Forssén. Observations Concerning Reconstructions with Local Support. Technical Report LiTH-ISY-R-2425, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, April 2002. [37] P.-E. Forssén. Successive Recognition using Local State Models. In Proceedings SSAB02 Symposium on Image Analysis, pages 9–12, Lund, March 2002. SSAB. [38] P.-E. Forssén. Channel smoothing using integer arithmetic. In Proceedings SSAB03 Symposium on Image Analysis, Stockholm, March 2003. SSAB. [39] P.-E. Forssén and G. Granlund. Sparse Feature Maps in a Scale Hierarchy. In AFPAC, Algebraic Frames for the Perception Action Cycle, Kiel, Germany, September 2000. [40] P.-E. Forssén and G. Granlund. Blob Detection in Vector Fields using a Clustering Pyramid. Technical Report LiTH-ISY-R-2477, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, November 2002. 148 Bibliography [41] P.-E. Forssén and G. Granlund. Robust multi-scale extraction of blob features. In Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749, pages 11–18, Gothenburg, Sweden, June-July 2003. [42] P.-E. Forssén, G. Granlund, and J. Wiklund. Channel Representation of Colour Images. Technical Report LiTH-ISY-R-2418, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, March 2002. [43] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21(1):32–40, 1975. [44] F. Godtliebsen, E. Spjøtvoll, and J. Marron. A nonlinear gaussian filter applied to images with discontinuities. J. Nonpar. Statist., 8:21–43, 1997. [45] G. H. Granlund. Magnitude Representation of Features in Image Analysis. In The 6th Scandinavian Conference on Image Analysis, pages 212–219, Oulu, Finland, June 1989. [46] G. H. Granlund. The complexity of vision. Signal Processing, 74(1):101–126, April 1999. Invited paper. [47] G. H. Granlund. An Associative Perception-Action Structure Using a Localized Space Variant Information Representation. In Proceedings of Algebraic Frames for the Perception-Action Cycle (AFPAC), Kiel, Germany, September 2000. [48] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1. [49] G. Granlund. Does Vision Inevitably Have to be Active? In Proceedings of the 11th Scandinavian Conference on Image Analysis, Kangerlussuaq, Greenland, June 7–11 1999. SCIA. Also as Technical Report LiTH-ISYR-2247. [50] G. Granlund, P.-E. Forssén, and B. Johansson. HiperLearn: A high performance learning architecture. Technical Report LiTH-ISY-R-2409, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, January 2002. [51] G. Granlund, P.-E. Forssén, and B. Johansson. HiperLearn: A high performance channel learning architecture. IEEE Transactions on Neural Networks, 2003. Submitted. [52] G. Granlund, K. Nordberg, J. Wiklund, P. Doherty, E. Skarman, and E. Sandewall. WITAS: An Intelligent Autonomous Aircraft Using Active Vision. In Proceedings of the UAV 2000 International Technical Conference and Exhibition, Paris, France, June 2000. Euro UVS. [53] G. H. Granlund and A. Moe. Unrestricted recognition of 3-D objects using multi-level triplet invariants. In Proceedings of the Cognitive Vision Workshop, Zürich, Switzerland, September 2002. URL: http://www.vision.ethz.ch/cogvis02/. Bibliography 149 [54] R. M. Gray. Dithered quantizers. IEEE Transactions on Information Theory, 39(3):805–812, 1993. [55] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, October 1992. Dissertation No 284, ISBN 91-7870-988-1. [56] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The approach based on influence functions. John Wiley and Sons, New York, 1986. [57] R. I. Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):580–593, June 1997. [58] S. Haykin. Neural Networks–A comprehensive foundation. Prentice Hall, Upper Saddle River, New Jersey, 2nd edition, 1999. ISBN 0-13-273350-1. [59] D. Hearn and P. Baker. Computer Graphics, 2nd ed. Prentice Hall International, 1994. ISBN 0-13-159690-X. [60] C. M. Hicks. The application of dither and noise-shaping to nyquist-rate digital audio: an introduction. Technical report, Communications and Signal Processing Group, Cambridge University Engineering Department, United Kingdom, 1995. [61] I. P. Howard and B. J. Rogers. Binocular Vision and Stereopsis. Oxford Psychology Series, 29. Oxford University Press, New York, 1995. ISBN 0195084764. [62] P. Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(73-101), 1964. [63] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, Sept 1999. [64] B. Johansson. Multiscale curvature detection in computer vision. Lic. Thesis LiU-Tek-Lic-2001:14, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, March 2001. Thesis No. 877, ISBN 91-7219-999-7. [65] B. Johansson and G. Granlund. Fast selective detection of rotational symmetries using normalized inhibition. In Proceedings of the 6th European Conference on Computer Vision, volume I, pages 871–887, Dublin, Ireland, June 2000. [66] B. Johansson and A. Moe. Patch-duplets for object recognition and pose estimation. Technical Report LiTH-ISY-R-2553, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, November 2003. [67] H. Knutsson, M. Andersson, and J. Wiklund. Advanced Filter Design. In Proceedings of the 11th Scandinavian Conference on Image Analysis, Greenland, June 1999. SCIA. Also as report LiTH-ISY-R-2142. 150 Bibliography [68] P. Kovesi. Image features from phase congruency. Tech. Report 95/4, University of Western Australia, Dept. of CS, 1995. [69] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, 1997. Dissertation No 469, ISBN 91-7871-892-9. [70] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’03), June 2003. [71] M. W. Levine and J. M. Shefner. Fundamentals of sensation and perception. Addison-Wesley, 1981. [72] T. Lindeberg. Scale-space Theory in Computer Vision. Kluwer Academic Publishers, 1994. ISBN 0792394186. [73] T. Lindeberg and J. Gårding. Shape from texture in a multi-scale perspective. In Proc. 4th International Conference on Computer Vision, pages 683–691, Berlin, Germany, May 1993. [74] D. Marr. Vision. W. H. Freeman and Company, New York, 1982. [75] C. Meunier and J. P. Nadal. The Handbook of Brain Theory and Neural Networks, chapter Sparsely Coded Neural Networks, pages 899–901. MIT Press, 1995. M. A. Arbib, Ed. [76] S. Mitaim and B. Kosko. Adaptive joint fuzzy sets for function approximation. In International Conference on Neural Networks (ICNN-97), pages 537–542, June 1997. [77] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1:281–293, 1989. [78] M. C. Morrone, J. R. Ross, and R. A. Owens. Mach bands are phase dependent. Nature, 324:250–253, 1986. [79] A. Nieder, D. Freedman, and E. Miller. Representation of the quantity of visual items in the primate prefrontal cortex. Science, 297:1708–1711, 6 September 2002. [80] K. Nordberg, G. Granlund, and H. Knutsson. Representation and Learning of Invariance. In Proceedings of IEEE International Conference on Image Processing, Austin, Texas, November 1994. IEEE. [81] K. Nordberg, P. Doherty, G. Farnebäck, P.-E. Forssén, G. Granlund, A. Moe, and J. Wiklund. Vision for a UAV helicopter. In Proceedings of IROS’02, workshop on aerial robotics, Lausanne, Switzerland, October 2002. [82] S. Obdrzalek and J. Matas. Object recognition using local affine frames on distinguished regions. In Proceedings of the British Machine Vision Conference, pages 113–122, London, 2002. ISBN 1-901725-19-7. Bibliography 151 [83] J. K. O’Regan. Solving the ‘real’ mysteries of visual perception: The world as an outside memory. Canadian Journal of Psychology, 46:461–488, 1992. [84] R. Palm, H. Hellendoorn, and D. Driankov. Model Based Fuzzy Control. Springer-Verlag, Berlin, 1996. ISBN 3-540-61471-0. [85] P. Perona and J. Malik. Detecting and localizing edges composed of steps, peaks and roofs. In Proceedings of ICCV, pages 52–57, 1990. [86] D. Reisfeld. The constrained phase congruency feature detector: simultaneous localization, classification, and scale determination. Pattern Recognition letters, 17(11):1161–1169, 1996. [87] H. Scharr, M. Felsberg, and P.-E. Forssén. Noise adaptive channel smoothing of low-dose images. In CVPR Workshop: Computer Vision for the Nano Scale, June 2003. [88] S. M. Smith and J. M. Brady. SUSAN - a new approach to low level image processing. International Journal of Computer Vision, 23(1):45–78, 1997. [89] H. Snippe and J. Koenderink. Discrimination thresholds for channel-coded systems. Biological Cybernetics, 66:543–551, 1992. [90] H. Snippe and J. Koenderink. Information in channel-coded systems: correlated receivers. Biological Cybernetics, 67:183–190, 1992. [91] I. Sobel. Camera models and machine perception. Technical Report AIM-21, Stanford Artificial Intelligence Laboratory, Palo Alto, California, 1970. [92] M. Sonka, V. Hlavac, and R. Boyle. Image Processing, Analysis, and Machine Vision. International Thomson Publishing Inc., 1999. ISBN 0-53495393-X. [93] H. Spies and P.-E. Forssén. Two-dimensional channel representation for multiple velocities. In Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749, pages 356–362, Gothenburg, Sweden, June-July 2003. [94] H. Spies and B. Johansson. Directional channel representation for multiple line-endings and intensity levels. In Proceedings of IEEE International Conference on Image Processing, Barcelona, Spain, September 2003. [95] C. V. Stewart. Robust parameter estimation in computer vision. SIAM Review, 41(3):513–537, 1999. [96] R. S. Sutton and A. G. Barto. Reinforcement Learning, An Introduction. MIT Press, Cambridge, Massachusetts, 1998. ISBN 0-262-19398-1. [97] S. Thorpe. The Handbook of Brain Theory and Neural Networks, chapter Localized Versus Distributed representations, pages 549–552. MIT Press, 1995. M. A. Arbib, Ed. 152 Bibliography [98] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proceedings of the 6th ICCV, 1998. [99] T. Tuytelaars and L. V. Gool. Matching widely separated views based on affinely invariant neighbourhoods. International Journal of Computer Vision, 2003. To appear. [100] M. Unser. Splines: A perfect fit for signal and image processing. IEEE Signal Processing Magazine, pages 22–38, November 1999. [101] N. Vlassis, B. Terwijn, and B. Kröse. Auxiliary particle filter robot localization from high-dimensional sensor observations. Technical Report IAS-UVA01-05, Computer Science Institute, University of Amsterdam, The Netherlands, September 2001. [102] M. Volgushev and U. T. Eysel. Noise Makes Sense in Neuronal Computing. Science, 290:1908–1909, december 2000. [103] J. Weule. Iteration nichtlinearer Gauss-Filter in der Bildverarbeitung. PhD thesis, Heinrich-Heine-Universität Düsseldorf, 1994. [104] G. Winkler and V. Liebscher. Smoothers for discontinuous signals. J. Nonpar. Statistics, 14(1-2):203–222, 2002. [105] WITAS web page. http://www.ida.liu.se/ext/witas/. [106] A. Witkin. Scale-space filtering. In 8th Int. Joint Conf. Artificial Intelligence, pages 1019–1022, Karlsruhe, 1983. [107] A. Wrangsjö and H. Knutsson. Histogram filters for noise reduction. In C. Rother and S. Carlsson, editors, Proceedings of SSAB’03, pages 33–36, 2003. [108] R. Zemel, P. Dayan, and A. Pouget. Probabilistic interpretation of population codes. Neural Computation, 2(10):403–430, 1998. [109] Z. Zhang. Parameter estimation techniques: A tutorial. Technical Report 2676, INRIA, October 1995.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement