Linköping Studies in Science and Technology Thesis No. 562 Efficient Spatiotemporal Filtering and Modelling Jörgen Karlholm LIU-TEK-LIC-1995:27 Department o f Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden Efficient Spatiotemporal Filtering and Modelling © 1996 Jörgen Karlholm Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping Sweden ISBN 91-7871-741-8 ISSN 0280-7971 Abstract The thesis describes novel methods for efficient spatiotemporal filtering and modelling. A multiresolution algorithm for energy-based estimation and representation of local spatiotemporal structure by second order symmetric tensors is presented. The problem of how to properly process estimates with varying degree of reliability is addressed. An efficient spatiotemporal implementation of a certainty-based signal modelling method called normalized convolution is described. As an application of the above results, a smooth pursuit motion tracking algorithm that uses observations of both target motion and position for camera head control and motion prediction is described. The target is detected using a novel motion field segmentation algorithm which assumes that the motion fields of the target and its immediate vicinity, at least occasionally, each can be modelled by a single parameterised motion model. A method to eliminate camera-induced background motion in the case of a pan/tilt rotating camera is suggested. ii I dedicate this thesis to my teachers whose encouragement and stimulation have been invaluable to me. c 1996 by Jörgen Karlholm Copyright Acknowledgements First of all, I wish to thank past and present colleagues at the Computer Vision Laboratory for being great friends. It is a privilege to work with such talented people. I thank Professor Gösta Granlund, head of the Laboratory, for showing confidence in me and my work. Professor Hans Knutsson and Drs. Mats T. Andersson, Carl–Fredrik Westin and Carl– Johan Westelius have contributed with many ideas and suggestions. I am particularly grateful to Dr. Westelius, who developed the advanced graphics simulation (’virtual reality’) that I have used extensively in my work. I thank Professors Henrik I. Christensen, Erik Granum and Henning Nielsen, and Steen Kristensen and Paolo Pirjanian, all at Laboratory of Image Analysis, University of Aalborg, Denmark, for their kind hospitality during my visits. Dr. Klas Nordberg and Morgan Ulvklo significantly reduced the number of errors in the manuscript. Johan Wiklund provided excellent technical support and help with computer related problems. This work was done within the European Union ESPRIT–BRA 7108 Vision as Process (VAP–II) project. Financial support from the Swedish National Board for Industrial and Technical Development (NUTEK) and the Swedish Defence Materials Administration (FMV) is gratefully acknowledged. “Free fall is the natural state of motion.” - In C. W. Misner, K. S. Thorne, J. A. Wheeler: Gravitation Contents 1 2 3 4 5 INTRODUCTION 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Representing local structure with tensors . . . . . . . . . . . . . . . . . 1.4.1 Energy-based local structure representation and estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Efficient implementation of spherically separable quadrature filters MULTIRESOLUTION SPARSE TENSOR FIELD 2.1 Low-pass filtering, subsampling, and energy distribution 2.2 Building a consistent pyramid . . . . . . . . . . . . . . 2.3 Experimental results . . . . . . . . . . . . . . . . . . . 2.4 Computation of dense optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 3 3 9 11 11 14 21 25 A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND MODELLING 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Normalized and differential convolution . . . . . . . . . . . . . . . . . 3.3 NDC for spatiotemporal filtering and modelling . . . . . . . . . . . . . 3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 33 36 41 41 MOTION SEGMENTATION AND TRACKING 4.1 Space–variant sensing . . . . . . . . . . . . . . . 4.2 Control aspects . . . . . . . . . . . . . . . . . . 4.3 Motion models and segmentation . . . . . . . . . 4.3.1 Segmentation . . . . . . . . . . . . . . . 4.3.2 Discussion . . . . . . . . . . . . . . . . 4.3.3 Motion models . . . . . . . . . . . . . . 4.4 On the use of normalized convolution in tracking 4.5 Experimental results . . . . . . . . . . . . . . . 49 50 50 54 54 58 59 63 63 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 vii viii A TECHNICAL DETAILS A.1 Local motion and tensors . . . . . . . . . . . . A.1.1 Velocity from the tensor . . . . . . . . A.1.2 Speed and eigenvectors . . . . . . . . . A.1.3 Speed and Txy . . . . . . . . . . . . . A.2 Motion models . . . . . . . . . . . . . . . . . A.2.1 Accuracy of affine parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 73 73 74 74 76 76 1 INTRODUCTION This first chapter is intended to give [most of] the necessary background information needed to understand the new results presented in the later chapters. 1.1 Motivation It is not particularly difficult to convince anyone working with image processing that efficient algorithms for motion estimation are of central importance to the field. Areas such as video coding and real–time robot vision are highly dependent on accurate and fast methods to extract spatiotemporal signal features for segmentation and response generation. The work presented in this thesis was prompted by certain breakthroughs in the implementation of spatiotemporal filter banks, which seemed to allow completely new applications of massive spatiotemporal filtering and modelling. Could it be that highly accurate motion estimation not only gives a quality improvement to standard real–time algorithms, but actually makes completely new, more powerful methods realizable? We cannot claim to have a definite answer to whether high performance spatiotemporal filtering is useful in applications with time–constraints—there are some very promising results based on more qualitative aspects of motion fields that seem not to require massive 3D computation—but it is clear that as computers get faster, more powerful methods than those that have been used up to now will inevitably find their way to real–time applications. 1.2 The thesis Section 1.4 of this chapter is an introduction to energy–based approaches to spatiotemporal local structure estimation and representation. It also provides an outline of certain recent technical results that motivated the work presented in the thesis. 1 2 INTRODUCTION In Chapter 2 we present a novel efficient multiresolution algorithm for accurate and reliable estimation and representation of local spatiotemporal structure over a wide range of image displacements. First we describe how the local spatiotemporal spectral energy distribution is affected by spatial lowpass filtering and subsampling, and we investigate the practical consequences of this. Then we consider how to process a resolution pyramid of local structure estimates so that it contains as accurate and reliable estimates as possible—we provide experimental results that support our reasoning. In Chapter 3 the concept of signal certainty–based filtering and modelling is treated. This addresses the important problem of how to properly process estimates with varying degree of reliability. The chapter contains a general account of a method called normalised convolution. We look at how this can be efficiently implemented in the case of spatiotemporal filtering and modelling, and present experimental results that demonstrate the power of the method. Chapter 4 describes an active vision application of the concepts developed in the earlier chapters. We present a smooth pursuit motion tracking algorithm that uses observations of both target position and velocity. The chapter contains novel efficient motion segmentation algorithms for extraction of the target motion field. Excerpts from a number of simulated and real tracking sequences are provided. 1.3 Notation In choosing the various symbols, certain general principles have been followed. Vectors are denoted by bold–faced lower–case characters, e.g., u, and tensors of higher order by bold-faced upper–case characters, e.g., I. The norm of a tensor A is denoted by kAk. This will usually mean the Frobenius norm. A ’hat’ above a vector indicates unit norm, e.g., û. Eigenvalues are denoted by the letter λ, and are ordered λ1 λ2 : : : . A certain abuse of notation will be noted, in that to simplify notation we do not discriminate between tensors and their coordinate matrices that always turn up in numerical calculations, e.g., AT will denote the transpose of the matrix A. Occasionally we will use the notation < a; b > for the inner product between two possibly complex–valued vectors. This is equivalent to the coordinate vector operation a b, where the star denotes complex conjugation and transposition. The signal is usually called f (x). This always refers to a local coordinate system centred in a local neighbourhood. The Fourier transform of a function s(x) is denoted by S(u). The Laplace transform of a function g(t ) is denoted by G(s). 1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS 3 1.4 Representing local structure with tensors This section gives the necessary background information concerning the estimation and representation of local spatio-temporal structure. The concepts presented here, and the fact that the estimation procedure can be efficiently implemented, form a prerequisite for the methods presented in later chapters. The inertia tensor method is by Bigün, [16, 17, 15, 18]. Pioneer work on its application to motion analysis was done by Jähne, [53, 54, 55, 56]. The local structure tensor representation and estimation by quadrature filtering was conceived by Knutsson, [59, 60, 61, 62, 66, 65]. A comprehensive description of the theory and its application is given in [43]. The efficient implementation of multidimensional quadrature filtering is by Knutsson and Andersson, [63, 64, 6]. 1.4.1 Energy-based local structure representation and estimation One of the principal goals of low-level computer vision can be formulated as the detection and representation of local anisotropy. Typical representatives are lines, edges, corners, crossings and junctions. The reason why these features are important is that they are of a generic character yet specific enough to greatly facilitate unambiguous image segmentation. In motion analysis we are interested in features that are stable in time so that local velocity may be computed from the correspondence of features in successive frames. Indeed, many motion estimation algorithms work by matching features from one frame to the next. A problem with this strategy is that there may be multiple matches, particularly in textured regions. Assuming that a feature is stable between several successive frames, the local velocity may be found from a spatio-temporal vector pointing in the direction of least signal variation1. The Fourier transform2 of a signal that is constant in one direction is confined to a plane through the origin perpendicular to this direction, Figure 1.2. If the feature is also constant in a spatial direction, i.e., it is an edge or a line, there will be two orthogonal spatio-temporal directions with small signal variation. This prevents finding the true motion direction, but constrains it to lie in a plane orthogonal to the direction of maximum signal variation. A signal that varies in a single direction has a spectrum confined to a line in this direction through the origin in the Fourier domain, Figure 1.1. The discussion has focused the attention on the angular 1 There are several ways to more exactly define the directional signal variation, leading to different algorithms. For our present discussion an intuitive picture is sufficient. 2 When referring to the Fourier transform of the signal we always mean the transform of a windowed signal, a ’local Fourier transform’. INTRODUCTION 4 distribution of the signal spectrum and its relation to motion estimation. We will here review two related methods to obtain a representation of this distribution. In the first method one estimates the inertia tensor J of the power spectrum, in analogy with the inertia tensor of mechanics. The inertia tensor comes from finding the direction n̂min that minimises Z J [n̂] = ku , (uT n̂)n̂k2 jF (u)j2 du which is simply the energy-weighted integral of the orthogonal distances of the data points to a line through the origin in the n̂-direction. The motivation for introducing J is that for signals with a dominant direction of variation, this direction is a global minimum of J . Likewise, we see that for a signal that is constant in one unique direction, this direction is a global maximum of J . A little algebra leads to the following equivalent expression J [n̂] = n̂T Jn̂ In component form the inertia tensor is given by Z J jk = jF (u)j2 (kuk2 δ jk , u j uk )du (1.1) Since J is a symmetric second order tensor it is now apparent that J is minimised by the eigenvector of J corresponding to the smallest eigenvalue. From Plancherel’s relation and using the fact that multiplication by iuk in the Fourier domain corresponds to partial derivation ∂x∂ in the spatial domain, Equation (1.1) may be written k Z J jk = [ ∂f k∇ f k2 δ jk , ∂x ∂f ] dx ∂x j k (1.2) The integration is taken over the window defining the local spatio-temporal neighbourhood and may be organised as low-pass filtering of the derivative product volumes. The derivatives themselves can be efficiently computed with one-dimensional filters. Critics of the inertia tensor approach argue that the spatio-temporal low-pass filtering causes an unwanted smearing3, and that the use of squared derivatives demands a higher sampling rate than is needed to represent the original signal. An alternative approach to energy-based local structure estimation and representation was given by Knutsson. This procedure also leads to a symmetric second-order tensor representation but it avoids the spatio-temporal averaging of the inertia tensor approach, 3 This problem is particularly severe at motion boundaries and for small objects moving against a textured background. 1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS Spatial domain Fourier domain Figure 1.1: A planar autocorrelation function in the spatial domain corresponds to energy being distributed on a line in the Fourier domain. (Iso-surface plots). Figure 1.2: An autocorrelation function concentrated on a line in the spatial domain corresponds to a planar energy distribution in the Fourier domain. (Iso-surface plots). Figure 1.3: A spherical autocorrelation function in the spatial domain corresponds to a spherical energy distribution in the Fourier domain. (Iso-surface plots). 5 INTRODUCTION 6 and the shift of the spectrum to higher frequencies. The concept of the local structure tensor comes from the observation that an unambiguous representation of signal orientation is given by T = Ax̂x̂T (1.3) where x̂ is a unit vector in the direction of maximum signal variation and A is any scalar greater than zero. Estimation of T is done by probing Fourier space in several directions n̂k with filters that each pick up energy in an angular sector centred at a particular direction. These filters are complex-valued quadrature4 filters, which means that their real and imaginary parts are reciprocal Hilbert transforms[22]. The essential result of this is that the impulse response of the filter has a Fourier transform that is real and non-zero only in a half-space n̂Tk u > 0. One designs the quadrature filters to be spherically separable in the Fourier domain, which means that they can be written as a product of a function of radius and a function of direction Fk (u) = R(ρ)Dk (û) The radial part, R(ρ), is made positive in a pass-band so that the filter response corresponds to a weighted average of the spectral coefficients in a region of Fourier space5 . Figure 1.4 shows the radial part of the filter used in most experiments in this thesis. This is a lognormal frequency function, which means that it is gaussian on a logarithmic scale. The real benefit from using quadrature filters comes from taking the absolute value (magnitude) of the filter response. This is a measure of the ’size’ of the average spectral coefficient in a region and is for narrow–banded signals invariant to shifts of the signal, which then assures that the orientation estimate is independent of whether it is obtained at an edge or on a line (phase invariance). The filter response magnitudes jqk j are used as coefficients in a linear summation of basis tensors Test = ∑ jqk jMk (1.4) k where Mk is the dual of the outer product tensor Nk the reconstruction relation ∑(S Nk )Mk = S = n̂k n̂Tk . The Mk ’s are defined by for all second order symmetric tensors S (1.5) k where A B = ∑ Ai j Bi j ij 4 Early approaches to motion analysis by spatio-temporal quadrature filtering are by Adelson and Bergen [1] and Heeger [47]. These papers also discuss the relation to psychophysics and neurophysiology of motion computation. 5 Note that since the Fourier transform of a real signal is Hermitian there is no no loss of information from looking at the spectrum in the filter’s ’positive direction’ only. 1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS 7 1 0.5 0 π/2 frequency π p Figure 1.4: Lognormal radial frequency function with centre frequency π=2 2. defines an inner product in the tensor space. For the reconstruction relation to be satisfied we need at least as many linearly independent filter directions as there are degrees of freedom in a tensor, i.e., six when filtering spatio-temporal sequences. The explicit mathematical expression for the Mk ’s depends on the angular distribution of the filters, but it is not too difficult to show that the general dual basis element is given by Mk = [∑ Ni Ni ],1 Nk (1.6) i where denotes a tensor (outer-) product. When the filters are symmetrically distributed in a hemisphere this [43] reduces to 4 1 Mk = Nk , I 3 4 where I refers to the identity tensor with components Ii j = δi j . As discussed earlier in this section, a signal that varies in a single direction x̂ (such as a moving edge), referred to as a simple signal, has a spectrum that is non-zero only on a line through the origin in this direction. Writing this as S(u) = G(x̂T u) δline x̂ (u) 6 where δline x̂ (u) is an impulse line in the x̂-direction , the result of filtering a simple signal 6 This is defined as a product of one-dimensional delta-functions in the directions orthogonal to x̂. INTRODUCTION 8 with a spherically separable quadrature filter can be written Z Z qk = Fk (u) S(u) du = R(ρ)Dk (û) G(x̂T u) δline x̂ (u) du Z∞ Z∞ = Dk (x̂) R(ρ) G(ρ) dρ + Dk (,x̂) R(ρ) G(,ρ) dρ 0 = aDk (x̂) + a 0 Dk (,x̂) where a is a complex number that only depends on the radial overlap between the filter and the signal (a being its complex conjugate). Since the filter is non-zero in only one of the directions x̂, the magnitude of the filter response is jqk j = jajjDk (x̂) + Dk (,x̂)j Now, if we choose the directional function to be ( Dk (û) = 2 T (n̂T k û) = (n̂k n̂k ) (ûûT ) n̂Tk û > 0 otherwise 0 (1.7) we see that Equation (1.4) reduces to Test = ∑ jqk jMk = ∑ jaj (n̂k n̂Tk x̂x̂T ) Mk = ∑ jaj (Nk x̂x̂T ) Mk k = jaj x̂x̂T k (1.8) k This proves that the suggested method correctly estimates the orientation tensor T of Equation (1.3) in the ideal case of a simple signal. In general we may represent the orientation of a signal by the outer product tensor that minimises the Frobenius distance ∆ = kTest , Ax̂x̂T kF = q ∑i j (Ti j , Axix j )2 It is not difficult to see that the minimum value is obtained when A is taken as the largest eigenvalue of Test and x̂ the corresponding eigenvector. Another relevant ideal case besides that of a simple signal is the moving point. The spectrum of such a signal is constant on a plane through the origin and zero outside it. In this case the estimated tensor will have two (equal) non-zero eigenvalues and the eigenvector corresponding to the third (zero-valued) eigenvalue will point in the direction of spatio-temporal motion. In general, wherever there is more than one spatial orientation, the correct image velocity can be recovered from the direction of the eigenvector corresponding to the smallest eigenvalue. The details of how the velocity components are recovered from the tensor are given in Appendix A.1.1. 1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS 9 1.4.2 Efficient implementation of spherically separable quadrature filters We saw in the previous subsection that the inertia tensor method allows an efficient implementation using one-dimensional derivative and low-pass filters. We now seek a corresponding efficient implementation of the quadrature filtering method. A problem with the quadrature filters is that spherical separability in general is incompatible with Cartesian separability. We therefore inevitably introduce errors when trying to approximate the multi-dimensional filters by a convolution product of low-dimensional filters. The success of such an attempt naturally depends on the magnitude of the approximation error. Unfortunately, it is not entirely obvious how the filters should be decomposed to simultaneously satisfy the constraints of minimum approximation error and smallest possible number of coefficients. Given the shape of the present filters, with their rather sharp magnitude peak in the central direction of the filter, it is natural to try a decomposition with a one-dimensional quadrature filter in the principal direction and real-valued one-dimensional low-pass filters in the orthogonal directions. With a symmetric distribution of filters, taking full advantage of symmetry properties, this leads to the scheme of Figure 1.5. The filter coefficients are determined in a recursive optimisation process One dimensional low−pass filters p One dimensional quadrature filters x,y g −x,y f6 p −x,y g x,y f5 p g x,z f4 x,z g −x,z f −y,z g y,z f2 g−y,z f1 p z −x,z p y p 3 Image sequence p p x p y,z Figure 1.5: 3D sequential quadrature filtering structure. in Fourier space. With an ideal multi-dimensional target filter F̂ (u), and our current INTRODUCTION 10 component product7 approximation of this, F (u) = ∏k Fk (u), we define temporary targets F̂i (u) = F̂ (u) ∏k6=i Fk (u) for the component filters. We use this target filter to produce a (hopefully) better component filter by minimising the weighted square error Ei = k Wi (juj) ∏ Fk (u) [ F̂i (u) , Finew (u) ] k2 k= 6 i = k Wi(juj) [ F̂ (u) , Finew(u) ∏ Fk (u) ] k2 k6=i (1.9) (1.90 ) where Wi (juj) is a radial frequency weight, typically Wi (juj) ∝ juj,α + ε, see [43]. Equation (1.90) says that on average we will decrease the difference between the ideal filter F̂ (u) and the product F (u). The constraint that the filter coefficients are to be confined to certain chosen spatial coordinate points (in this case ideally on a line8 ) is implemented by specifying the discrete Fourier transform to be on the form Nk Fk (u) = ∑ fk (ξn ) exp(,iξn u) ; n=1 juj < π where the sum is over all allowed spatial coordinate points ξn . The optimiser then solves a linear system for the least-square sense optimal coefficients fk . The optimisation procedure (definition of component target function and optimisation of the corresponding filter) is repeated cyclically for all component filters until convergence. For the present filters this takes less than ten iterations. The quality of the optimised filters has been carefully investigated by estimating the orientation of certain test patterns with known orientation at each position using the local structure tensor method. The result is that the precision is almost identical to what is obtained with full multi-dimensional filters, [6]. As an example, a particular set of full 3D filters requires 9 9 9 6 = 8748 coefficients. A corresponding set of sequential filters with the same performance characteristics requires 273 coefficients. This clearly makes the quadrature filtering approach competitive in relation to the inertia tensor method9 which requires partial derivative computations followed by multi-dimensional low-pass filtering of all independent products of the derivatives. 7 Convolution in the space-time domain corresponds to multiplication in the frequency domain. of the filters are for sampling density reasons ’weakly’ two-dimensional with a tri-diagonal form. 9 From a computation cost point of view, assuming the same size of the filters, disregarding the fact that the inertia tensor method requires a higher sampling density to obtain comparable spatial resolution. 8 Some 2 MULTIRESOLUTION SPARSE TENSOR FIELD A unique feature of our approach to motion segmentation is that models are fitted not to local estimates of velocity but to a sparse field of spatiotemporal local structure tensors. In this chapter we present a fast algorithm for generation of a sparse tensor field where estimates are consistent over scales. 2.1 Low-pass filtering, subsampling, and energy distribution As described in Section 1.4.1 there is a direct correspondence between on one hand the spatiotemporal displacement vector of a local spatial feature and on the other hand the local structure tensor. However, to robustly estimate the tensor, so as to to avoid temporal under-sampling and be able to cope with a wide range of velocities, it is necessary to adopt some kind of multiresolution scheme. Though it is possible to conceive various advanced partitionings of the frequency domain into frequency channels by combinations of spatial and temporal subsampling, [44], we opted to compute a simple low-pass pyramid à la Burt [25] for each frame, and not to resample the sequence temporally. The result is a partitioning of the frequency domain as shown in Figure 2.1. Each level in the pyramid is constructed from the previous level by Gaussian low-pass filtering and subsequent subsampling by a factor two. To avoid aliasing when subsampling it is necessary to use a filter that is quite narrow in the frequency domain. As is seen in Figure 2.2, the result is that the energy is substantially reduced in the spatial directions. This is of no consequence in the ideal case of a moving line or point, when the spectral energy is confined to a line or a plane — the orientation is unaffected by the low-pass filtering. However, the situation is quite different in the case of a noisy signal. Referring to Figure 2.3, a sequence of white noise images that is spatially low-pass filtered and 11 MULTIRESOLUTION SPARSE TENSOR FIELD 12 temporal frequency spatial frequency Figure 2.1: Partitioning of the frequency domain by successive spatial subsampling. The plot shows Nyquist frequencies for the different levels. subsampled becomes orientation biased since energy is lost in the spatial directions. A possible remedy for this is to low-pass filter the signal temporally with a filter that is twice as broad as the filter used in the spatial directions [44]. This type of temporal filter can be efficiently implemented as a cascaded recursive filter, [39]. A couple of experiments were carried out to quantitatively determine the influence of isotropic noise on orientation estimates in spatially subsampled sequences. The orientation bias caused by the low-pass filtering1 and subsampling was measured by computing the average orientation tensor over each sequence and determining the quotient λt =λspat , where λt refers to the eigenvector in the temporal direction and λspat to the average of the eigenvectors in the spatial plane. In the first experiment the average orientation of initially white noise was computed. The results are shown in Table 2.1. As expected, the average orientation becomes biased when not compensating for the energy anisotropy by temporal low-pass filtering. However, the effect is quite small for moderate spatial low-pass filtering (σspat = π=4). This is due to the fact that the quadrature filter used is fairly insensitive to high fre1 The low-pass filters were on the form F (ωx ; ωy ; ωt ) = exp [,( 2σω2x 2 spat + ω2y ω2 + t2 ) ]. 2σ2spat 2σt 2.1 LOW-PASS FILTERING, SUBSAMPLING, AND ENERGY DISTRIBUTION 13 Pi Figure 2.2: Gaussian low-pass filtering (σ = π=4) reduces the signal energy in the spatial directions. The subsequent subsampling moves the maximum frequency down to π=2 (dashed line). There is also some aliasing caused by the non-zero tail above π=2. quencies, cf. Figure 1.4. Repeating the low-pass filtering and subsampling twice with σspat = π=4 with no temporal filtering results in a quotient of 1:04, which indicates that temporal filtering in fact may be unnecessary. This is important in real-time control applications where long delays cause problems. σspat π=4 π=8 π=16 σt = ∞ 1.03 1.19 1.57 σt = 2σspat 1.03 1.03 1.01 Table 2.1: Results of low-pass filtering and spatial subsampling by a factor two of a sequence of white noise. The numbers show the quotient λt =λspat of the average tensor. In a second experiment we used a synthetic volume with an unambiguous spatiotemporal orientation at each position. The sequence was designed to have a radially sinusoidal variation of grey-levels with a decreasing frequency towards the periphery, Fig- 14 MULTIRESOLUTION SPARSE TENSOR FIELD Figure 2.3: Illustration of low-pass filtering and subsampling of a white noise sequence. Plots show the spectral energy distribution in one spatial direction and the temporal direction. Left: Spatial low-pass filtering reduces the signal energy in the spatial directions. Middle: Spatial subsampling leads to a rescaling of the spectrum. Right: The energy anisotropy is compensated for by a temporal low-pass filtering with a filter that is twice as broad as the corresponding filter in the spatial direction. ure 2.4. The volume was corrupted with white noise and subsequently low-pass filtered (σspat = π=4) and subsampled. The change in average orientation caused by the noise remained below one degree with the signal–to–noise ratio as low as 0 dB SNR. This is not surprising since the quadrature filter picks up much less energy from a signal with a random phase distribution over frequencies than from a signal with a well-defined phase. The conclusion of this investigation is that with an appropriate choice of radial filter function of the quadrature filter, i.e., one that is comparatively insensitive to high frequencies, there is no reason to worry about orientation bias induced by uncorrelated signal noise. Consequently, no temporal low-pass filtering is necessary. 2.2 Building a consistent pyramid The construction of a low-pass pyramid from each image frame gives a number of separate spatial resolution channels that can be processed in parallel. Consecutive images 2.2 BUILDING A CONSISTENT PYRAMID 15 Figure 2.4: Radial variation of grey-levels in test volume. are stacked in a temporal buffer which is convolved with quadrature filters. The magnitude of the filter outputs are used in the composition of local structure tensors. The result is a multiresolution pyramid of tensors. At each original image position there are now several tensors, one from each level of the pyramid, describing the local spatiotemporal structure as it appears at each particular scale. The question arises how to handle this type of representation of the local structure. The answer, of course, depends on the intended application. Our intention is to perform segmentation by fitting parameterised models of the spatiotemporal displacement field to estimates in regions of the image. Interpreting the spatiotemporal displacement as the direction of minimal signal variation, it is clear that the information is readily available in the tensor. For efficiency reasons we want to use the sparse multiresolution tensor field as it is, without any data conversion or new data added. The problem of how to handle the tensor field pyramid then reduces to that of deciding which confidence value should be given to each estimate, i.e., how much a particular tensor should be trusted. For computational efficiency it is also desirable to sort out data that does not contain any useful information as early as possible in a chain of operations. At this point it is appropriate to make the distinction between two entities of fundamental importance. We use the definitions by Horn, [49]. A point on an object in motion traces out a particle path (flow line) in 3D space, the temporal derivative of which is the instantaneous 3D velocity vector of the point. The geometrical projection of the particle path onto the image plane by the camera gives rise to a 2D particle path whose temporal derivative is the projected 2D velocity vector. The 2D motion field is the collection of all such 2D velocity vectors. The image velocity or optical flow is (any) estimate of the 2D motion field based on the spatiotemporal variation of image intensity. Several MULTIRESOLUTION SPARSE TENSOR FIELD 16 investigators [49, 97, 98, 78, 36, 38] have studied the relation between the 2D motion field and the image velocity. The conclusion is that they generally are different, sometimes very much so. A classical example is by Horn, [49]. A smooth sphere with a specular2 surface rotating under constant illumination generates no spatiotemporal image intensity variation. On the other hand, if the sphere is fixed but a light source is in motion, there will be a spatiotemporal image intensity variation caused by reflection in the surface of the sphere. The intensity variation caused by this type of “illusory” motion can not be distinguished from that caused by objects in actual motion without a priori knowledge or high-level scene interpretation. A diffusely reflecting (Lambertian) surface also induces intensity variation caused by changes in angle between the surface normal and light sources. This variation is typically a smooth function of position, and independent of texture and surface markings. The conclusion is that the optical flow accurately describes the motion field of predominantly diffusely reflecting surfaces with a large spatial variation of grey level. The use of the local structure tensor for motion estimation is based on the assumption that the spatiotemporal directions of largest signal variance are orthogonal to the spatiotemporal displacement vector. The shape of the tensor, regarded as an ellipsoid with the semi-axes proportional to the eigenvalues of the tensor, cf. Figures 1.1 – 1.3, is a simple model of the local signal variation, with the longest semi-axis corresponding to the direction of maximum signal variation. It is evident that when the signal variation is close to uniform in all directions, no reliable information about the local motion direction is available. To qualify as a reliable estimate we require the neighbourhood to have a well-defined anisotropy (orientation). This means that the smallest tensor eigenvalue should be substantially smaller than the largest one. It is beneficial to have a computationally cheap measure of anisotropy that does not require eigenvalue decomposition, so that unreliable estimates of velocity may be quickly rejected from further processing. A suitable measure is given by µ= jjTjjF Tr (T) q = ∑i; j Ti j 2 ∑k Tkk p = ∑k λk 2 ∑k λk p 1= 3 µ 1 (2.1) In Figure 2.5 µ is plotted as a function of degree of neighbourhood anisotropy. We simply threshold on µ, discarding tensors that are not sufficiently anisotropic. The Frobenius norm of the tensor, kTkF , is a measure of the signal amplitude as seen by the filters. If the amplitude is small, it is possible that the spatial variation of the signal may be too small to dominate over changes in illumination—the tensor becomes very noise sensitive. We therefore reject tensors whose norm is not greater than an energy threshold η. To become independent of the absolute level of contrast, one may alternatively reject tensors whose norm is below a small fraction (say, a few percent) of the largest tensor element in a frame. 2 Reflects like a mirror. 2.2 BUILDING A CONSISTENT PYRAMID 17 1 0.95 0.9 0.85 µ 0.8 0.75 0.7 0.65 0.6 0.55 0 5 10 15 20 25 λ 30 35 40 45 50 Figure 2.5: Estimating of anisotropy of local neighbourhood. Solid line: Planep degree λ2 + 1 + 1 = (λ + 1 + 1). Dotted line: Line-like anisotropy, µ = 2λ + 1 = (2λ + 1). like anisotropy, µ = p 2 Next, consider the relative reliability of tensors at different scales. With the filters (almost) symmetrically distributed in space, there is no significant direction dependence of the angular error in the orientation estimation3 . Consequently we expect the angular error in the estimation of the spatiotemporal displacement vector to be independent of direction. The angular error may be converted into an absolute speed error, which becomes a function of the speed and the scale at which the estimate is obtained, Figure 2.6 (left). Similarly it is interesting to see the angular error at each scale transformed into a corresponding angle at the finest scale, Figure 2.6 (right). This gives an indication of the relative validity of the estimates obtained at different scales as a function of image speed. There are a couple of additional relevant constraints: 1. The spatial localisation of the estimate should be as good as possible. 2. Temporal aliasing should be avoided. The first of these demands is of particular importance when using more sophisticated models than a constant translation. It also leads to more accurate results at motion borders. The second item calls for a short digression. Consider a sinusoidal signal of frequency ω0 moving with speed v0 [pixels/frame]. The resulting inter-frame phase shift 3 For the test volume of Figure 2.4 corrupted with additive noise, the average magnitude of the angular error is 0:8 , 3:0 , and 9:4 , for a signal–to–noise ratio of ∞dB, 10dB, and 0dB respectively, [64]. MULTIRESOLUTION SPARSE TENSOR FIELD 18 25 3 20 2 angular error [deg] absolute speed error [pixels/frame] 2.5 1.5 15 10 1 5 0.5 0 0 5 10 15 speed [pixels/frame] 20 25 0 0 1 2 3 4 5 6 speed [pixels/frame] 7 8 9 10 Figure 2.6: Plots of errors as they appear at the finest scale, assuming an angular error ∆φ = 3:0 . Full line: finest scale, k=0. Dotted line: coarsest scale, k=3. Left: Absolute speed error ∆v as a function of speed v and scale k. The functions are ∆v(v; k) = 2k tan(arctan(2,k v) + ∆φ) , v. Right: Apparent angular error ∆φ(v; k) as a function of speed v and scale k. The functions are ∆φ(v; k) = arctan(v) , arctan [ 2k tan(arctan(2,k v) , ∆φ) ]. is ω0 v0 . However, this phase shift is ambiguous, since the signal looks the same if shifted any multiple of the wavelength. In particular, any displacement v0 greater than half the wavelength manifests itself as a phase shift less than π (aliasing). One consequence of this is that an unambiguous estimate of spatiotemporal orientation can strictly only be obtained if the speed v0 satisfies v0 < π ω0 where ω0 is the maximum spatial frequency. The combination of a quadrature filter comparatively insensitive to high frequencies and a low-pass filter with a low cut-off frequency reduces the actual high-pass energy influence, so that we dare to extend the maximum allowed velocity estimated at each scale to well above the nominal 1 pixel per frame, particularly at the coarser scales. The arguments presented above indicate that a tensor validity labelling scheme must include some kind of coarse-to-fine strategy. With initial speed estimates at the coarser scales we can decide whether or not it is useful to descend to finer scales to obtain more precise results. If not, we inhibit the corresponding positions at the finer scales, i.e., we set a flag that indicates that they are invalid, Figure 2.7. The local speed, s, is in the moving point case determined by the size of the temporal component of the eigenvector corresponding to the smallest eigenvalue of the tensor. In the case of a single dominant eigenvector (moving edge/line case) the speed is determined by the temporal 2.2 BUILDING A CONSISTENT PYRAMID 19 high! too low high! too low Figure 2.7: The use of coarse-to-fine inhibition is here illustrated for a case with a sharp motion gradient, e.g., a motion border between two objects. component of the eigenvector corresponding to the largest eigenvalue. One finds (cf. Appendix A.1.2), 8r > < 1,e e s= r > : 1,e e 2 13 2 13 2 33 2 33 numerical rank 1 tensor numerical rank 2 tensor It appears that we have to compute the eigenvalue decomposition of the tensor. This is actually not the case. The eigenvalues can be efficiently computed as the roots of the (cubic) characteristic polynomial det(T , λI). Using the fact that a symmetric tensor can always be decomposed into T = (λ1 , λ2 )T1 + (λ 2 , λ3)T2 + λ3 T3 with T1 = ê1 êT1 T2 = ê1 êT1 + ê2 êT2 T3 = ê1 êT1 + ê2 êT2 + ê3 êT3 =I we create a rank 2 tensor T̃ by subtracting λ3 I. This new tensor has the same eigenvectors as the original tensor, but eigenvalues λ̃1 = λ1 , λ3, λ̃2 = λ2 , λ3 and λ̃3 = 0. Now that the tensor is of (at most) rank 2, there are a couple of interesting relations that we MULTIRESOLUTION SPARSE TENSOR FIELD 20 can use to find the temporal components of the eigenvectors needed to compute the local speed, namely Tr T̃xy = λ̃1 (1 , e213) det T̃xy = λ̃1 λ̃2 e233 (numerical rank 1 case) (numerical rank 2 case) where T̃xy refers to the leading 2 2-submatrix of T̃, and Tr refers to the trace (sum of diagonal elements) of a matrix. See Appendix A.1.3 for a proof of these elementary but perhaps somewhat unobvious relations. An outline of the algorithm is given in Figure 2.8. SPARSE FIELD consistent sparse tensor field anisotropy and energy thresholding, coarse−to−fine data validation spatio−temporal convolution, tensor construction separate resolution channels PYRAMID low−pass pyramid input signal Figure 2.8: Illustration of the multi-scale tensor field estimation procedure. 2.3 EXPERIMENTAL RESULTS 21 2.3 Experimental results To demonstrate the method we apply it to a class of sequences, Figure 2.9, representing a typical tracking situation—a small textured surface slowly translating in front of a comparatively rapidly translating background. Each frame is supplemented by a corresponding vector image representing the true displacement field. The investigation fo- Figure 2.9: Test sequence. Every tenth frame is shown. cuses on the compatibility between the multi-scale tensor field and the true displacement field. This is done by looking at the deviation ∆φ from the anticipated 90 angle between the local spatiotemporal displacement vector and the eigenvector(s) corresponding to the non-zero eigenvalue(s). The appropriate true displacement v̂ at a coarse scale is determined by computing the average of the vectors at the four positions of the finer scale that correspond to each position at the coarse scale, followed by a rotation to compensate for MULTIRESOLUTION SPARSE TENSOR FIELD 22 level 1 2 3 µ 0:63 0:625 0:62 lower limit 0:0 1:0 3:0 upper limit 1:5 3:5 Table 2.2: Suitable threshold values. the spatial scale change. Starting at the coarsest scale, at each position we compute a weighted average < ∆φ >= λ̃1 ∆φ1 + λ̃2∆φ2 λ̃1 + λ̃2 where ∆φ1 and ∆φ2 are the deviations of the first and second eigenvectors respectively. The average deviation is then converted into an angle at the next, finer scale4 , and, together with the weight λ̃1 + λ̃2 , propagated to the corresponding four positions at this scale. A new average deviation angle is computed from the coarse-scale value and the new fine-scale local estimate. Eventually we arrive at the finest scale with an average deviation angle at each position that takes into account estimates at all scales. Results for two test sequences are shown p in Tables 2.3 and 2.4. In the first sequence p the second at 8 2 11:3 background moves at a speed of 4 2 5:7 pixels/frame, in thep pixels/frame. In both cases the foreground texture moves at 1 2 1:4 pixels/frame. The ’single scale’ column gives the result when each scale is used alone, covering the entire speed range. The ’multi-scale w. inhib.’ column refers to the proposed method with multiple scales with each scale covering a certain speed interval. Following the reasoning in the preceding section, we have a number of parameters to set—the anisotropy threshold µ, the energy threshold η, and (in the multi-scale case) the upper and lower speed limits for each scale. In the experiments we required a density greater than 20 %, i.e., at least every fifth position should have a valid tensor. With this constraint we find that the values of Table 2.2 give accurate results when three levels are used. From Tables 2.3 and 2.4 we see that, as expected, the average error is quite large when only the finest or the coarsest scale is used. The multi-scale method on the other hand, generates quite accurate estimates for both low and high speed regions. In Figure 2.10 an example of the instantaneous angular error as a function of image position is shown. Two sequences containing a single translating texture (the background of the previ4 If v denotes the speed, the fine-scale angle is given by (cf. Figure 2.6) ∆φ f ine = arctan(2v) , arctan(2 tan(arctan(v) , ∆φ)) 2.3 EXPERIMENTAL RESULTS Figure 2.10: Distribution of errors using 1, 2, and 3 levels in the pyramid. 23 MULTIRESOLUTION SPARSE TENSOR FIELD 24 level(s) 1 2 3 single scale background foreground 11.18 1.73 2.56 1.96 1.23 5.76 multi-scale w. inhib. background foreground 2.84 1.47 1.67 1.56 Table 2.3: Average apparent angular error at finest scale in degrees. Background velocity vb = (,4; 4), foreground velocity v f = (1; 1). Density 20%. level(s) 1 2 3 single scale background foreground 10.63 1.91 4.60 1.99 0.86 4.84 multi-scale w. inhib. background foreground 4.78 1.17 1.68 1.57 Table 2.4: Average apparent angular error at finest scale in degrees. Background velocity vb = (,8; 8), foreground velocity v f = (1; 1). Density 20%. ouspsequences) were generated. In the first p sequence the texture moves at a speed of 1 2 1:4 pixels/frame, in the second at 4 2 5:7 pixels/frame. Figure 2.11 shows histograms of the apparent angular errors. The average errors are 1:26 and 0:54 respectively. This shows that the present method is capable of producing results which are comparable with any existing motion estimation algorithm. Finally we provide results for three well-known test sequences used in the investigation by Barron et al. [13]—Fleet’s tree sequences and Quam’s Yosemite sequence5. The tree sequences, Figure 2.12, were generated by translating a camera with respect to a textured planar surface. In the ’Translating tree’ sequence the camera moves perpendicular to the line of sight, generating velocities between 1:7 and 2:3 pixels/frame—the surface is tilted. In the ’Diverging tree’ sequence the camera moves towards the (tilted) surface generating velocities between 0:0 and 2:0 pixels/frame. Note that since the velocity range of the tree sequences is quite narrow, they are note very well suited for study of the performance of multiresolution algorithms. The Yosemite sequence, Figure 2.13, is a computer animation of a flight through Yosemite valley. It was generated by mapping aerial photographs onto a digital terrain map. Motion is predominantly divergent with large local variations due to occlusion and variations in depth. Velocities range from 0 to 5 pixels/frame. The clouds in the background move rightward while changing their shape over time, which makes accurate estimation of their velocity difficult. In Fig5 These and other sequences and their corresponding true flows are available via anonymous ftp from the University of Western Ontario at ftp.csd.uwo.ca/pub/vision/TESTDATA. 2.4 COMPUTATION OF DENSE OPTICAL FLOW 0.6 0.7 0.5 0.6 25 0.5 relative bin counts relative bin counts 0.4 0.3 0.4 0.3 0.2 0.2 0.1 0 0 0.1 0.5 1 1.5 2 2.5 3 angular error [deg] 3.5 4 4.5 5 0 0 0.5 1 1.5 angular error [deg] 2 2.5 Figure 2.11: Histograms of apparent angular error distributions. Left: Velocity v = Right: v = (4:0; 4:0). Density 20%. Each estimate was given a weight equal to its confidence value. (1:0; 1:0). ure 2.13 (right), note the larger errors caused by motion discontinuities at the mountain ridges. Numerical results for the three sequences are given in Table 2.5. Figures 2.14 and 2.15 show histograms of the angular errors. Sequence translating tree diverging tree Yosemite Average error 0:30 1:72 2:41 Standard deviation 0:57 1:64 5:36 Density 66% 47% 48% Table 2.5: Performance for the three test sequences. 2.4 Computation of dense optical flow In the present study we found no reason to compute a dense optical flow, since this is a computationally expensive and notoriously ill-posed problem. However, there are situations when a dense field is required, e.g., in region-based velocity segmentation. At least two principally different approaches to computation of dense fields exist. One is to fit parametric models of motion in local image patches, the models ranging from constant motion via affine transformations to mixture models of multiple motions. The fitting is done by clustering, e.g., [85, 57], or [least-squares] optimisation. In Chapter 4 we use 26 MULTIRESOLUTION SPARSE TENSOR FIELD Figure 2.12: Translating tree. Left: Single frame from the sequence. Right: Angular error magnitude. Figure 2.13: Yosemite sequence. Left: Single frame from the sequence. Right: Angular error magnitude. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 relative bin count relative bin count 2.4 COMPUTATION OF DENSE OPTICAL FLOW 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0 0 27 0.1 0.5 1 1.5 angular error [deg] 2 2.5 3 0 0 1 2 3 4 5 6 angular error [deg] 7 8 9 10 Figure 2.14: Tree sequences. Histograms of angular errors. Left: translating tree. Right: diverging tree. Each estimate was given a weight equal to its confidence value. 1 0.9 0.8 relative bin count 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 angular error [deg] Figure 2.15: Yosemite sequence. Histogram of angular errors. Each estimate was given a weight equal to its confidence value. simple parametric model fitting to segment an image into object and background. The second class of methods is based on regularisation theory, e.g., [91, 75], the basic idea of which is to stabilise solutions of ill-posed problems by imposing extra constraints, e.g., spatial smoothness, on the solutions. The problem of computing a dense optical flow is ill-posed since we need two constraints at each position to be able to determine the two velocity components, but we have only one at our disposal at edges or lines, namely the direction of maximum signal variation. We will now discuss the regularisation approach and its relation to the work presented in this chapter. MULTIRESOLUTION SPARSE TENSOR FIELD 28 One way to proceed is to define a cost functional E = E1 + αE2 where the first term is a global measure of the inconsistency between the observed tensor field and the optical flow model, and the second term, E2 , implements the regularisation by a motion variation penalty. α is a positive number that determines the relative importance of the two terms. We are looking for the optical flow field that minimises E . Any measure of inconsistency between a tensor and a velocity vector must be based on the fact that the direction of maximum signal variation, coinciding with the largest eigenvector, is orthogonal to the spatiotemporal direction of motion. Whenever the second eigenvalue is significantly greater than the third, smallest one, its corresponding eigenvector is also orthogonal to the direction of motion. A possible inconsistency term is consequently given by Z Z 2 2 E1 = [ (λ1 , λ3 )(ê1 ũ) + (λ2 , λ3 )(ê2 ũ) ] dxdy = ũT T̃ ũ dxdy image image where ũ = (u; v; 1)T is the spatiotemporal displacement vector, and T̃ the ’rank-reduced’ tensor introduced in Section 2.2. In the most simple case the regularisation term is an integral [50] Z ∂u 2 ∂u 2 ∂v 2 ∂v 2 E2 = [( ) +( ) +( ) +( ) ] dxdy ∂y ∂x ∂y image ∂x which clearly penalises spatial variation in the motion field—there exist several modifications to avoid unwanted smoothing over velocity discontinuities, e.g., [77, 110]. The minimum of the cost functional is found from the corresponding Euler-Lagrange partial differential equation, e.g., [41], which in this particular case is given by t̃T1 ũ , α∆u = 0 t̃T2 ũ , α∆v = 0 where t̃i denotes the ith column vector of T̃, and ∆ is the Laplacian operator. To efficiently solve this equation one may use some instance of the multi-grid method, [23, 40, 89, 32]. Multi-grid methods take advantage of low-pass pyramid representations in the following way. The equation is converted into a finite difference approximation which is then solved iteratively, e.g., by the Gauss-Seidel or Jacobi methods [42]. When analysing such iterative methods it is found that high-frequency errors are eliminated much more quickly than errors of longer wavelength. These low-frequency errors are however shifted to higher frequencies at a coarser scale where they consequently may be efficiently eliminated. A solution at a coarse scale can, however, not capture fine details and consequently there are high-frequency errors. The idea is then to produce an approximate solution at one level and then use this as an initial value for the iteration at another level where errors in a different frequency band may be efficiently eliminated. A particularly interesting scheme in the context of multi-level tensor fields is the adaptive 2.4 COMPUTATION OF DENSE OPTICAL FLOW 29 multi-scale coarse-to-fine scheme of Battiti, Amaldi, and Koch, [14]6 , which may be regarded as a refinement of an old method called block relaxation, where one starts by computing an approximation to the optical flow at the coarsest scale, use an interpolated version of this as an initial value for the iteration at the next finer scale, and so on. The adaptive method is based on an error analysis that gives the expected relative velocity error as a function of velocity and scale when using a certain derivative approximation to estimate the brightness gradient. The authors use this to detect those points whose velocity estimates can not be improved at a finer scale, i.e., points where the coarse scale is optimal. The motion vectors of the corresponding points at the finer scale are then simply set to the interpolated values from the coarser scale and do not participate in the iteration at this level. In the light of the discussion in the preceding sections, it appears straightforward to formulate a tensor field version of this method. 6 See [99] for a discussion of the biological aspects of this and other coarse-to-fine strategies for motion computation. 30 MULTIRESOLUTION SPARSE TENSOR FIELD 3 A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND MODELLING The idea of accompanying a feature estimate with a measure of its reliability has been central to much of the work at the Computer Vision Laboratory. In recent years a formal theory for this has been developed, primarily by Knutsson and Westin, [72, 69, 68, 70, 103, 106, 105]. Applications range from interpolation (see the above references) via frequency estimation [71] to phase-based stereopsis and focus-of-attention control [101, 107]. In this chapter a review of the principles of normalized convolution (NC) is presented spiced with some new results. This is followed by a description and experimental test of a computationally efficient implementation of NC for spatiotemporal filtering and modelling using quadrature filters and local structure tensors. The experiments show that NC as implemented here gives a significant reduction of distortion caused by incomplete data. The effect is particularly evident when data is noisy. 3.1 Background Normalized convolution (NC) has its origin in a 1986 patent [67] describing a method to enhance the degree of discrimination of filtering by making operators insensitive to irrelevant signal variation. Let b denote a vector-valued filter and f a corresponding vector-valued signal (e.g., a representation of local orientation). Suppose that the filter is designed to detect a certain pattern of vector direction variation irrespective of signal magnitude variation. A simple way of eliminating interference from magnitude variations would be to normalize the signal vectors. This is unfortunately not a very good idea. The magnitude of the signal contains information about how well the local neighbourhood is described by the signal vector model. This information is lost in the normalization. A special case is when the signal magnitude is zero – then a default model has to be imposed on the local neighbourhood. The patented procedure that 31 A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 32 MODELLING provides a solution to the problem is outlined in [43]: “The method is based on a combination of a set of convolutions. The following four filter results are needed: s1 s2 s3 s4 = = = = hb fi h b kfki hkbk f i hkbk kfki ; ; ; ; (3.1) (3.2) (3.3) (3.4) where k k denotes the magnitude of the filter or the data. The output at each position is written as an inner product h; i between the filter and the signal centered around the position. [Formally this actually corresponds to a correlation between signal and filter but the use of the term convolution has stuck – the difference between the terms is just that in the convolution case the filter is reflected in its origin before the inner product is taken.] The first term, s1 , corresponds to standard convolution between the filter and the data. The fourth term, s4 , can be regarded as the local signal energy under the filter. As to the second and third term, the interpretation is somewhat harder. The filter is weighted locally with corresponding data producing a weighted average operator, where the weights are given by the data giving a “data dependent mean operator”. For the third term it is vice versa. The mean data is calculated using the operator certainty as weights producing “operator dependent mean data”. The four filter results are combined into s = s4 s1 , s2 s3 γ s4 (3.5) where γ is a constant controlling the model selectivity. This value is typically set to one, γ = 1. The numerator in Equation (3.5) can consequently be interpreted as a standard convolution weighted with the local “energy” minus the “mean” operator acting on the “mean” data. The denominator is an energy normalization controlling the model versus energy dependence of the algorithm.” The procedure is referred to as a consistency operation, since the result is that the operators are made sensitive only to signals consistent with an imposed model. It was not until quite recently [106] that it was realised that the above method actually is a special case of a general method to deal with uncertain or incomplete data, normalized convolution (NC). 3.2 NORMALIZED AND DIFFERENTIAL CONVOLUTION 33 3.2 Normalized and differential convolution Suppose that we have a whole filter set fbk g to operate with upon our signal f. It is possible to regard the filters as constituting a basis in a linear space B = spanfbk g, and the signal may locally be expanded in this basis. In general, the signal cannot be reconstructed from this expansion since it typically belongs to a space of much higher dimension than B . A Fourier expansion is an example of a completely recoverable expansion, which is further simplified by the basis functions (filters) being orthogonal. There are infinitely many ways of choosing the coefficients in the expansion when the filters span only a subspace of the signal space. A natural and mathematically tractable choice, however, is to minimise the orthogonal distance between the signal and its projection on B . This is equivalent to the linear least–squares (LLS) method. Let us formulate the LLS method mathematically. Choose a basis for the signal space and expand the filters and the signal in this basis. Assume that the dimension of the signal space is N, and that we have M filters at our disposal. The coordinates of the filter set may be represented by an N M matrix B and those of a scalar signal f by an N 1 matrix F. The expansion coefficients may be written as an M 1 matrix F̃. All in all we have 2 j j B = 4 b1 b2 j j ::: ::: ::: j bM j 2 6 F=4 3 5 f1 .. . fN 3 75 2 6 F̃ = 4 f˜1 .. . f˜M 3 75 We assume that N M. The LLS problem then consists in minimising E = kBF̃ , Fk2 (3.6) One often has a notion of reliability of measurement – as was indicated above the direction of vector-valued signals may carry a representation of a detected feature, whereas the magnitude indicates the reliability of that statement. When a measurement is unreliable, there is no point in minimising the projection distance for the corresponding element. On the other hand, one wants small distortion for the reliable components. This leads to the weighted linear least–squares (WLLS) method, with an objective function EW kW(BF̃ , F)k2 (3.7) wN ) is an N N diagonal matrix with the reliability = where W = diag(w) = diag(w1 ; : : : ; weights. Letting A denote complex conjugation and transposition of a matrix A one A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 34 MODELLING finds EW B , F )WT W(BF̃ , F) = = 2 2 2 = F̃ B W BF̃ , 2 F̃ B W F , F W F = F̃ GF̃ , 2 F̃ x , c = (F̃ Since G is positive definite we may use a theorem that states that any F̃ that minimises EW also satisfies the linear equation GF̃ = x Consequently F̃ = G,1 x = (B W2 B),1 B W2 F (3.8) Introducing the notation a b = diag(a) b for element-wise vector multiplication we may rewrite this in the language of convolutions 2 hw2 b b i 1 1 6 . .. F̃ = 4 ; h w2 bM ; b1 hw2 b1 bM i 3,1 2 hw2 b1 fi 3 ::: i ; h ::: .. . w2 bM ; bM i 75 64 ; .. . hw2 bM ; fi 75 (3.9) It turns out to be profitable to decompose the weight matrix into a product W2 = A C of a matrix C = diag(c) containing the data reliability and a second diagonal matrix A = diag(a) with another set of weights for the filter coefficients, corresponding to a classical windowing function. Consequently a realised filter is regarded as a product a b of the windowing function a and the basis function b. In this perspective it is clear that Equation (3.9) is unsatisfactory, but fortunately a little algebra does the job: hw2 bi b j i = ha c bi b j i = ha bi c b j i = ha bi bj ci ; ; ; ; and hw2 bi fi = ha c bi fi = ha bi c fi ; ; ; We arrive at a final expression for normalized convolution 2 ha b b ci 1 1 6 .. F̃ = 4 . ; ha bM b ; ci 1 ::: ::: ha b1 bM ci 3,1 2 ha b1 c fi 3 ; .. . ha bM bM ; ci 75 64 ; .. . ha bM c fi 75 (3.10) ; Note that we must design M (M + 1)=2 new outer product filters a bi bj in addition to the original M filters a bi . The same expression as the above may be derived from tensor algebra regarding the process as a change of coordinate system in a linear space with a 3.2 NORMALIZED AND DIFFERENTIAL CONVOLUTION 35 metric whose coordinates are a c in the original basis. One then finds that G contains the coordinates of the metric expressed in the fbk g basis, whereas G,1 is the metric coordinates expressed in the basis dual to fbk g. The dual basis consists of the operators fbk g for which hbi ; b j i = δij . From the tensor algebra perspective Equation (3.10) describes a coordinate transformation of the signal from fbk g to fbk g. To be able to compare the output of the normalized convolution with a standard convolution it is necessary to transform back to the dual basis. This is achieved by operating with the metric on F̃, but since the varying reliability of data has now been compensated for, the transformation is achieved by using a matrix 2 ha b b 1i 6 1 ... 1 G0 = 4 ; ha bM b ; 1i 1 ::: ::: ha b1 bM 1i 3 ; .. . ha bM bM ; 1i 75 (3.11) where 1 = (1; 1; : : : ; 1)T . This matrix may be precomputed since it is independent of the data weights. Of course, the resultant output from the NC computation must be accompanied by a corresponding certainty function. There are three factors that influence this function: the input certainty, the numerical stability of the NC algorithm, and independent certainty estimates pertaining to new feature representations constructed from the NC filter responses1 . The numerical stability of the algorithm has to do with the nearness to singularity2 of the metric G. This is quantified by the condition number [42] κ(G) = kGkkG,1k It is easily verified that 1 κ(G) ∞ for p-norms and the Frobenius norm. From this, we arrive at a possible certainty measure that takes into account the magnitude of the input certainty function c(G) = 1 kG0kkG,1k (3.12) The total output certainty function is then a product of c(G) and the feature representation certainty. To summarise, what have we achieved by this effort? From the original signal we have at each of the original positions arrived at an expansion of the signal in the basis fbk g within a window described by a, the applicability function, given a data weighting function c, the certainty function. In this way we have imposed a model onto the original instance, an estimate of dominant orientation may be accompanied by a certainty function c = λ1λ,λ2 . 1 will here not discuss the possible use of generalised inverses, such as the Moore–Penrose pseudoinverse, but see, e.g., [79, 82]. 1 For 2 We A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 36 MODELLING signal and by transforming back to the dual basis using G0 , the result is as is we had applied standard filtering on this model. What is the relation between normalized convolution and the consistency operation described earlier? Suppose that the linear space B spanned by fbk g is divided into two subspaces S and L , that are orthogonal in the standard Euclidean metric. This orthogonality relation may no longer apply under the data reliability dependent metric G. The result is that, to filter a signal using the NC approach, one may have to expand the signal in more basis functions than those which are of primary interest, the reason being that the new metric induces correlations between previously orthogonal basis functions. If not enough basis functions are used, the signal model may become inaccurate and the resultant filtering output misleading. Suppose that we are interested in filtering the signal with the basis functions in S , but want to include the ones in L to model the signal. It is then possible to reduce the size of the metric matrix to be inverted in normalized convolution from the original M M to Dim(S ) Dim(S ). This is described in in detail in [103]. Just as B , the basis coordinate matrix may be decomposed, B = ( S L ). When Dim(L ) = 1 we obtain the following expression for the projection of f on S F̃S = ( L ACLS ACS , S ACLL ACS ),1 ( L ACLS A , S ACLL A ) CF (3.13) For the important case L lutions be written = (1; 1; : : : ; 1)T 2 6 F̃S = GS,1 4 (DC component), this may in terms of convo- ha ciha b1 c fi,ha b1 ciha c fi ; ; ha ciha bDim S ; ( ); ; ; .. . c fi,ha bDim(S ) ; ciha; c fi 3 75 (3.14) with GS (i; j) = ha; ciha bi bj ; ci,ha bi ; ciha bj ; ci (3.140) This is referred to as normalized differential convolution (NDC). If we have a single filter a b, insensitive to constant signals, and with the property that b b = (1; 1; : : : ; 1)T , we obtain ha; ciha b; c fi,ha b; ciha; c fi fS = (3.15) ha; ci2 ,jha b; cij2 This is almost Equation (3.5) with γ = 2. 3.3 NDC for spatiotemporal filtering and modelling In this section we will apply NDC in its most simple form, Equation (3.15), to the problem of estimating and modelling a local spatiotemporal neighbourhood. The model we 3.3 NDC FOR SPATIOTEMPORAL FILTERING AND MODELLING 37 use is the local structure tensor reviewed in Section 1.4 and already extensively used in this thesis. Recall that it is estimated as a linear combination of six basis tensors with coefficients equal to the magnitude of six corresponding quadrature filter responses. The question now arises where and how to apply normalized convolution. One possibility is to use the six quadrature filters plus a low-pass filter (or more accurately, their corresponding basis functions, yet to be defined) as our signal model space. However, this is for efficiency reasons not possible, since it would require 7 8=2 = 28 new complex outer-product filters and inversion of a 6 6 complex matrix at each position. Instead we have to resort to a much more modest approach, namely to perform the NC at the quadrature filter level. The quadrature filter consists of an even real and an odd imaginary part that are reciprocal Hilbert transforms. With a supplementary constant basis function we consequently have at our disposal three basis functions. Being complex, the quadrature filter may be written h(x) = m(x) exp [ iφ(x)] = m(x)( cos φ(x) + i sinφ(x)), where m(x) is the magnitude and φ(x) the phase. It is natural to let m become the applicability, and the real and imaginary parts of exp[iφ(x)] the two non-constant basis functions in NC. With three basis functions we have to generate four new outer-product filters m(x), m(x) cos2 φ(x), m(x) sin2 φ(x) and m(x) cos φ(x) sin φ(x). Of course, the trigonometric identity sin2 φ + cos2 φ = 1 makes it possible to dispose of one of the filters, so we are left with three extra. Since signal and certainty data must be filtered separately, the grand total is 5=2 2 = 5 times as many 3D-convolutions in NC as in the original algorithm. An equivalent formulation uses two complex basis functions exp[iφ(x)] plus a constant function which also leads to three new (real) filters. Note that when a real-valued signal is expanded in such a basis, the optimal coefficients of the two complex basis functions will necessarily be equal except for a complex conjugation, so the number of free parameters are the same as in the case of real basis functions. It is actually possible to reduce the number of convolutions further by modelling the signal with a single complex basis function b = exp [ iφ(x)] plus a constant function. This was suggested by Westelius [101] based on experimental results, but no theoretical motivation was given. Since it may not be apparent why it should be useful to model a real signal with a single complex basis function, a simple hand-waving argument will be provided. With two complex basis functions b and b we get a metric (cf. Equation (3.14)) GS = ha ci2 ,jha b cij2 ha ciha b2 ci,ha b ci2 2 2 ; ; con jug: ; ; ; ha ci ,jha b cij ; ; The off-diagonal filter a b2 has a Fourier transform that essentially is a shifted version of the transform of the quadrature filter, its centre frequency being approximately twice that of the quadrature filter. Consequently it will be quite insensitive to low-frequency components of the certainty function. The three filters a, a b and a b2 actually cover the Fourier space symmetrically—the first one being a lowpass filter, the second a band- A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 38 MODELLING pass filter and the third a highpass filter. Now, the certainty function has one particular characteristic: it is positive, which means that its DC-component is always the largest Fourier coefficient. As a result, the diagonal term ha; ci2 tends to be larger than the others. In fact, a considerable variance in the certainty function combined with a low mean is required for the off-diagonal elements to become significant. In our present work the certainty function is a binary valued (0=1) spatial mask with zero certainty in regions that we want to disregard. We want to be able to filter close to the certainty discontinuity and to interpolate results across narrow zero-certainty gaps. Figure 3.1 shows what happens at a certainty edge. We used a realisation of the lognormal filter of Figure 1.4. The conclusion is that the non-diagonal elements are insignificant for positions outside the zero-certainty region3. Keeping only the diagonal elements, it is then easily seen that the NDC formula reduces to Equation (3.15). 1.2 1 scalar product magnitude 1 c 0.8 0.6 0.4 0.2 0 −5 0.8 0.6 0.4 0.2 −4 −3 −2 −1 spatial position 0 1 2 3 0 −5 −4 −3 −2 −1 spatial position 0 1 2 3 Figure 3.1: Left: Certainty function. Right, solid line: ha; ci. Right, dashed: hab; ci. Right, dotted: ha b2 ; ci. Having thus introduced the use of a single complex basis function, it is straightforward to apply the NDC formula Equation (3.15), once we have generated the applicability filter a = m(x). For efficiency reasons this should be implemented as a separable filter. Since the applicability function is closely related to the quadrature filter we look for an implementation that can take advantage of this relationship. This is important for two reasons. Firstly, the design of the separable filters is done by optimisation (cf. Section 1.4.2) so generally there is a certain discrepancy between the resulting filter and the ideal 3D function. Two completely independent optimisations, one for the applicability, one for the quadrature filter, do not take into account the fact that the applicability func3 An equivalent statement is that the basis functions 1, exp [ iφ(x)] and exp [ ,iφ(x)] remain practically orthogonal in the new metric ac. 3.3 NDC FOR SPATIOTEMPORAL FILTERING AND MODELLING 39 tion should be equal to the magnitude of the realised quadrature filter. Secondly, there is a possibility of further reducing the number of operations. An important observation (cf. Section 1.4.2) is that the 3D quadrature filter may be decomposed into two orthogonal one-dimensional lowpass filters and a (predominantly) one-dimensional quadrature filter. It would be nice to be able to use the two lowpass filters in the applicability filtering, just adding a third lowpass filter next to the 1D quadrature filter. However, there is one problem–the coefficients of the applicability filter must be positive, which adds an extra constraint to the optimisation of the lowpass filters. Although a detailed description of the optimisation process is outside the scope of this presentation, a short digression may be in place, [7]. The positivity constraint is implemented by adding a regularisation term to the original cost function. The original cost is a measure of the distance between the ideal filter function and the realised filter in the Fourier domain. The new term similarly represents the distance between a constant, slightly positive ideal function and the realised filter, but now in the spatial domain. Let the total cost function be Etot = αEF + (1 , α)Espat 0α1 The adaptive factor α is decreased below one when the current realisation of the filter contains negative coefficients, and increased towards one whenever the coefficients are all positive. The spatial cost function includes an adaptive weighting function which is increased wherever a coefficient is negative and decreased towards zero at spatial positions where the filter coefficient is non-negative. The idea is that the spatial ideal function should only influence the optimisation when (through α) and where (through the spatial weighting function) the current coefficients are negative. The combined effect of the (non-linear) adaptive parameters is to force the iterative optimisation process to find the realisable filter with positive coefficients that is closest to the ideal filter function. The resultant 3D quadrature filters are actually not significantly inferior to the ones generated by the ’unconstrained’ optimisation described in Section 1.4.2. Figure 3.2 shows the pattern of sequential filtering that produces the quadrature and applicability filter responses, which may be compared with Figure 1.5. Note that we need two such structures—one to filter the signal-certainty product, and one to filter the certainty function itself. It is possible to generalise the scheme to more complex signal models that require additional filters, such as the a b2 -filter that turns up when using two complex basis functions to model the signal. This filter is then of course also decomposed into the two lowpass filters followed by a third complex filter. To be able to use the NC result in the construction of the tensor we must transform back to the dual coordinates using the G0 -matrix of Equation (3.11). Since the quadrature filter is insensitive to constant signals this reduces to a multiplication by ha; 1i, the sum of the applicability filter coefficients. To minimise computations the filters are normalized so that ha; 1i = 1. From the NC quadrature filter responses and their output certainties , Equation (3.12), we A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 40 MODELLING One dimensional low−pass filters p One dimensional quadrature/applicability filters g −x,y f6 a −x,y a6 g x,y f5 a x,y a5 g x,z f4 a x,z a4 g −x,z f a −x,z a3 g y,z f2 a y,z a2 g−y,z f1 a−y,z a1 x,y p z p −x,y p −x,z p y p Image sequence 3 x,z p −y,z p x p y,z Figure 3.2: 3D sequential quadrature/applicability filtering structure. now proceed to construct the local structure tensor. The proper treatment of the certainty function related to the tensor construction procedure requires some modification of the presentation given in Section 1.4.1, Equations (1.4), (1.5) and (1.6), since a non-constant certainty function can be interpreted as a change of basis tensors Nk and their duals Mk . To take into account the filter response certainties ck , we set Test = ∑ jqk j Mk (fck g) = [∑ ci Ni Ni],1 ∑ jqk j ck Nk k i (3.16) k The numerical stability of the procedure depends on the nearness to singularity of the 6 6-matrix [∑i ci Ni Ni ]. We may consequently use its condition number to construct a certainty function for the tensor estimation, in exactly the same way as in Equation (3.12). The computational burden of inverting a 6 6-matrix at each spatial position is considerable. To avoid unnecessary operations the relative sizes of the filter certainties should be monitored so that precomputed matrices are used whenever the certainty variation is sufficiently isotropic. In this way the truly hard work is generally limited to a small fraction of the positions in each frame. 3.4 EXPERIMENTAL RESULTS 41 3.4 Experimental results To determine the performance of the NDC implementation we used an onion-shaped 64-cube 3D volume, Figure 3.3 (left), for which the actual dominant 3D orientation is known at each position. We did not investigate the performance at the filter level, and, in particular, the phase behaviour was not accounted for. [Recall that only the magnitude of the filter response is used when constructing the tensor.] However, these properties have been thoroughly investigated by Westelius [101] for one-dimensional quadrature filters. The orientation estimation test was done by multiplying each 2D image with a certainty function set to zero at a number of rows in the central part and to one everywhere else. The idea is that we want to disregard the values in the zerocertainty slice when computing the 3D orientation outside the slice. Of course, by setting the signal value to zero we introduce an artifact that the standard convolution cannot handle. The question is how well the NDC algorithm will do. We chose to look at the 3D orientation half-way through the volume, at frame 32. The dominant 3D orientation at each spatial position was estimated from the tensor and compared with the ’actual’ value. An average angular error for each row was computed by averaging over the columns. The test was repeated for three different levels of white noise, ∞ dB, 10 dB, and 0 dB, Figures 3.3 – 3.15. As is seen, the NDC implementation quite successfully manages to capture the correct orientation at the mask border whereas the standard algorithm estimates an ’incorrect’ signal orientation several pixels outside the border due to the artifact edge generated by the zero-mask. Also, note that the imposed signal model appears to have a very beneficial effect on the performance for noisy signals. 3.5 Discussion Despite several very successful applications, including the new ones in this thesis, we feel that the full potential of the signal/certainty philosophy is yet to be exploited. A consistent signal/certainty treatment should be an integrated part of the processing at all levels of a robot vision system, including high-level segmentation, recognition, and decision-making. Adaptive signal processing (’learning’) in combination with normalized convolution (or a similar concept) has the potential to provide very efficient highlevel components. Some quite promising work in this direction has been carried out [19]. A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 42 MODELLING 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 10 20 30 40 50 60 Figure 3.3: Left: Slice 32 from the test sequence. Right: Orientation error in degrees for each row averaged over all columns. 2 2 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.4: Output certainty functions. Left: Mask width 4 pixels. Right: Mask width 20 pixels. 3.5 DISCUSSION 43 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.5: Noise-free sequence. Mask width 4 pixels. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.6: Noise-free sequence. Mask width 4 pixels. Errors inside mask deleted. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 44 MODELLING 15 15 10 10 5 5 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.7: Noise-free sequence. Mask width 20 pixels. Errors inside mask deleted. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 10 20 30 40 50 60 Figure 3.8: Left: Slice 32 from the test sequence corrupted with white noise at 10dB SNR. Right: Orientation error in degrees for each row averaged over all columns. 3.5 DISCUSSION 45 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.9: 10dB SNR. Mask width 4 pixels. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.10: 10dB SNR. Mask width 4 pixels. Errors inside mask deleted. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 46 MODELLING 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.11: 10dB SNR. Mask width 20 pixels. Errors inside mask deleted. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. 20 18 16 14 12 10 8 6 4 2 0 10 20 30 40 50 60 Figure 3.12: Left: Slice 32 from the test sequence corrupted with white noise at 0dB SNR. Right: Orientation error in degrees for each row averaged over all columns. 3.5 DISCUSSION 47 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.13: 0dB SNR. Mask width 4 pixels. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.14: 0dB SNR. Mask width 4 pixels. Errors inside mask deleted. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND 48 MODELLING 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.15: 0dB SNR. Mask width 20 pixels. Errors inside mask deleted. Average orientation error in degrees for each row. Left: Standard convolution. Right: Normalized convolution. 4 MOTION SEGMENTATION AND TRACKING In this chapter we present a simple application within the realm of active vision, using concepts developed in the previous chapters. Active vision1 [4, 2, 8, 9, 10, 11, 3, 27, 30] is a paradigm that regards vision as a process with a purpose, a closed loop involving the observer and the environment with which it interacts. Control of sensing is stressed— the current goal of the system dictates attention. Active vision also emphasizes the importance of proprioception in facilitating and resolving ambiguities in the perception of the environment. This involves knowledge of the observer’s own position, motion, as well as accurate models of sensors and actuators. We present computationally efficient algorithms for smooth pursuit tracking of a single moving object. We use knowledge of pan and tilt joint velocities to compensate for camera–induced motion when tracking with a pan-tilt rotating camera. In a second algorithm we allow camera translation and assume the background motion field in the vicinity of the target can be fitted to a single parameterised motion model. Most computer vision tracking algorithms derive the motion information necessary for the pursuit from observation of target position only, which unfortunately is very difficult to obtain with precision. The reason that observations of velocity are not used is of course the computational cost of accurate measurement—most real-time algorithms match pairs of consecutive images and then it is just not possible to obtain accurate and stable velocity estimates. Advances in signal processing theory with designs of highly efficient sequential and recursive filtering schemes for velocity estimation challenge the old established ’quick–and–dirty’ approaches to motion tracking. In our work we use the multi–resolution local structure tensor field method described in Chapter 2 to obtain high-quality estimates of velocity which we then use actively in prediction and segmentation. The algorithms readily incorporate the signal certainty/modelling formalism developed in the previous chapter. 1 Approximate synonyms: Animate Vision, Attentive Vision, Behavioural Vision, Purposive Vision. 49 50 MOTION SEGMENTATION AND TRACKING 4.1 Space–variant sensing A visual system, whether biological or artificial, that has the ambition to provide its host with useful information in a general setting, must be able to process information at multiple spatial and temporal resolutions. Discrimination of detailed shape and texture requires very high spatial sensor resolution, whereas reliable estimation of high speeds demands large receptive fields. To cover the broad spectrum of spatial and temporal frequencies that appear in the visual input that is relevant to higher animals or autonomous robots a whole range of resolution levels is needed. The problem is that any information processing system has a limited capacity, which means that an implementation must make some sort of compromise. The eyes of higher animals have developed into spacevariant foveal sensors which have decreasing spatial resolution towards the periphery. The density of sensory elements is several orders of magnitude higher in the centre of the field of view, the fovea, than in the periphery. Correspondingly, in the primary visual cortex there are several times as many neurons engaged in processing foveal signals as there are neurons dedicated to peripheral information. To compensate for the lack of spatial resolution in the periphery, the visual system constantly shifts its gaze direction to bring peripheral stimuli of potential interest into the centre of the field of view for scrutiny. This is done by means of saccades, fast ballistic eye movements to foveate the target. The concept of a space-variant sensor, or fovea for short, has been carried over to the field of computer vision. There exist several hardware approaches, e.g., [87, 84]. A software approach is to compute a complete multi-resolution pyramid, but to throw away most of the data. An example is the log–Cartesian fovea [29], where a multi-resolution region-of-interest (ROI) is defined within the pyramid. From each level an equal size window is kept, which means that the finer levels cover a smaller part of the field of view than the coarser levels. There is no restriction to the position of the centre of the ROI, and consequently no explicit gaze direction change is needed for high resolution processing in the periphery. Of course, image processing on this type of data structure can be quite complicated. We use a simple version [101] of the log–Cartesian fovea, where the ROI is always at the centre of the field of view and of constant size, Figure 4.1. 4.2 Control aspects We now give a short account of the basics of human voluntary smooth pursuit tracking [51, 28, 24, 83, 74, 108]. The tracking process always involves an initial shift of attention to the stimulus, or target. If the target is not initially foveated, i.e., positioned in the centre of the field of view, a saccade is produced. The eyes then accelerate to their final speed within approximately 120 msec. Interestingly, this final speed is ap- 4.2 CONTROL ASPECTS 51 Figure 4.1: Log–Cartesian fovea. Left: Input image. Middle: Octave-based lowpass pyramid. From each level of the pyramid the central part is extracted. The result is equivalent to a space-variant sensor with a number of separate resolution channels of equal size. Right: Combined image. Borders between different resolution levels have been marked. Note that all resolution channels cover the central part of the field of view. proximately 10% less than the speed of the target, so that the eyes tend to lag behind the target. This is compensated for by occasional saccades that recentre the target in the fovea. The principal stimulus for pursuit movements is retinal slip, i.e., the translational image velocity of the target, though there is also evidence for a weak position error component for small sudden target offsets from the fovea, with the pursuit velocity being proportional to the offset. A conventional negative feedback controller, Figure 4.2, is not able to accurately model the characteristics of the human smooth pursuit system. The reason is that the model is unstable with the high gain and long delays that are found experimentally. High gain, of course, is necessary to obtain good accuracy in pursuit, but can be very problematic in systems with long delays. To overcome this problem, evolution has developed a quite different concept, Figure 4.3, where an internal (adaptive) model of the ocular motor system is used to predict the eye velocity and subsequently to reconstruct the target velocity. This is a manifestation of the fact that ’the brain knows the body it resides in’ [24]. Next, we consider the design of the controller for a machine vision smooth pursuit tracker. Although control issues are not central to our work, there are at least two aspects of a motion estimation based smooth pursuit tracker that distinguish it from most other trackers, so a certain digression may be justified. Most algorithms, whether for passive or active tracking, only use observations of position. Furthermore, the delay of the observations is regarded as negligible. When these conditions apply there are standard algorithms for state reconstruction, e.g., the so–called α–β– and α–β–γ–filters, [12]. We want to use both position and velocity observations, and cannot ignore that the estimates are delayed several sampling interval units. We define the overall goal of our control algorithm as to keep the centroid of the target motion field stationary in the centre of the field of view. Note that this consists of two conflicting sub–goals, namely to stabilize MOTION SEGMENTATION AND TRACKING 52 dX __ dt −s τ Σ G(s) ~ e ______ K 1 + sT dE __ dt P(s) −1 Figure 4.2: A (too) simple model of the human smooth pursuit system. P(s) is the transfer function of the ocular motor system, X and E the target and eye positions, respectively. K 0:9, T 0:04 sec. With the experimentally verified delay τ 0:13 sec, this system is unstable. ~ −s( τ1+ τ 3 ) P(s) e dX __ dt Σ −s τ e 1 Σ −s τ e 2 G(s) ~ −s τ3 ______ K 1 + sT P(s) e dE __ dt −1 Figure 4.3: A more accurate model of the human smooth pursuit system (disregarding interaction with the saccade system). This appears to first have been proposed by Young, [109]. The system utilizes an internal (adaptive) model of the ocular motor system to predict the eye velocity and uses this to reconstruct the target velocity. This effectively turns the system into an unconditionally stable feed–forward controller. After [83]. the target on the image, and to bring the target closer to the centre. A choice has to be made regarding which of these should have the highest priority. From an active vision perspective it may be argued that it is important to keep the target still, e.g., to facilitate a time–consuming object recognition process. This implies that the ’optimal’ strategy may be to use velocity compensation and occasional ’catch–up’ saccades to recentre the object in the field of view. We have successfully experimented with velocity error compensation combined with a weak position error compensation and catch–up saccades. The reason for including a smooth position error component is that there are certain technical problems associated with saccades. Firstly, camera heads driven by stepping motors, such as the KTH and Aalborg heads, are very fast at executing small steps, but 4.2 CONTROL ASPECTS 53 disproportionately slow for large movements. Secondly, even with fast motors, there is always a recovery period after each saccade during which no observations can be made, due to the fact that the temporal buffers of the spatiotemporal filters must be refilled. By including a weak smooth position error compensation we can avoid some saccades and the associated information loss, without compromising image stability too much. Since there is no a priori reason for a coupling between motion in horizontal and vertical directions, we use separate algorithms for each spatial dimension. Let x(kT ) denote the position of the target at time t = kT , where T is the sampling interval. Assuming a slow acceleration, we have approximately x((k + 1)T ) = x(kT ) + T ẋ(kT ) ẋ((k + 1)T ) = ẋ(kT ) + T ẍ(kT ) A state model that assumes constant velocity is unrealistic. When one has to take into account planned target velocity changes, it is common to model acceleration as a stochastic drift process, e.g., an IAR process, [21]. The control is achieved by specifying camera head joint velocities, and we assume that the motor controller is significantly faster than the pursuit loop. With state variables X1 (k) = x(kT ), X2 (k) = ẋ(kT ), and X3 (k) = ẍ(kT ), we then arrive at the following closed–loop state equation 21 X(k + 1) = AX(k) + Bu(k) + ν = 40 3 5 T 1 0 0 23 45 2 3 4 5 0 T 0 T X(k) + 1 u(k) + 0 1 0 v(k) (4.1) where v(k) is white noise. We use state variable feedback u(k) = ,LX(k) = , u p ud 0 X(k) (4.2) with u p and ud constants that determine the influence of positional and velocity errors, respectively. The state is reconstructed from observations by Kalman filtering. Now, because of the temporal buffer depth, observations of position and velocity are delayed several sampling intervals so we have to use a multi–step predictor. The observations are y(k) = CX(k) + ε = 1 0 0 1 0 e (k ) X(k) + 1 0 e2 (k) (4.3) where e1 (k) and e2 (k) are white noise processes. The optimal linear m–step predictor is then given by (e.g., [5]) X̂(kjk , m) = Am X̂(k , mjk , m) + k ∑ i=k,m+1 Ak,i Bu(i , 1) (4.4) where the reconstructor is given by X̂(kjk) = [I , KC][AX̂(k , 1jk , 1) + Bu(k , 1)] + Ky(k) (4.5) MOTION SEGMENTATION AND TRACKING 54 Here K is the 3 2 Kalman gain matrix, the properties of which depend on the noise processes ν and ε, whose covariance matrices are R1 = E [ ν(k)ν(l ) T 20 ] = 40 0 3 5 0 0 0 0 δkl 0 σ2ν R2 = E [ ε(k)ε(l )T ] = σ2 pos 0 0 δ σ2vel kl (4.6) The (stationary) Kalman gain is then given by K = APCT [ CPCT + R2 ],1 (4.7) where the prediction error covariance matrix P satisfies P = R1 + APAT , APCT [ CPCT + R2 ],1 CPAT [discrete matrix Riccati equation] (4.8) It is difficult to estimate the process variances σ2ν , σ2pos and σ2vel , so we regard them, as well as the feedback parameters u p and ud , as free parameters that we have at our disposal to make the system behave appropriately. 4.3 Motion models and segmentation 4.3.1 Segmentation The process of motion–based image segmentation in a system with focus of attention control consists of two separate stages—an ’early’ preattentive detection stage and a ’late’ object recognition oriented attentive stage. The parallel preattentive process detects local structure in the motion field and thereby indicates where potentially interesting stimuli appear in the field of view. The local structures of interest are optical flow patterns characteristic to objects in motion, such as those shown in Figure 4.4. Indeed, neurons that are sensitive to this type of stimuli have been found in the primate visual system, [88, 31]. The problem of detecting independently moving objects with a moving observer is nontrivial, see, e.g., [26, 90]. Recently Fermüller and Aloimonos [2, 37, 34, 35], based on earlier work by Nelson [80], have found a class of motion constraints that, given bounds on egomotion, allows a moving observer to detect local motion patterns that can not originate from the motion of the observer. In a computer vision system the preattentive process may be implemented as a set of convolutions of the velocity estimates with symmetry operators [15, 104]. The magnitude of the filter responses defines a ’potential field’ with local extrema at interesting points. It is possible to devise algorithms where the attention is attracted to successive interesting points, 4.3 MOTION MODELS AND SEGMENTATION 55 Figure 4.4: Characteristic motion vector patterns generated by solid objects in motion. to a certain extent analogously to the interaction between a particle and a field of force [101]. Once a moving object has become the focus of attention, segmentation enters the second, attentive stage. The overall goal of this stage is usually regarded as to provide information for the inference of object shape and 3D depth and motion, a problem referred to as structure from motion. Considering the importance of this, it is not surprising that the amount of literature is absolutely enormous, see, e.g., [48, 46] and references therein. Most structure recovery methods rely on accurate estimation of the motion field. Aloimonos and Fermüller (above references) suggest robust methods based on qualitative properties of motion and bounds on egomotion. Here we confine ourselves to a much more restricted task—to extract the 2D projection of a single moving rigid object, fit its 2D motion vectors to a parameterised model, and track it by smooth pursuit. The basic assumption we make is that, at least occasionally, the motion field in the immediate vicinity of the target can be fitted to a single parameterised model, Figure 4.5. When this applies, we can use a very simple procedure to V (x,y,p ) o V (x,y,p ) b Figure 4.5: Basic assumption of the approach is that, at least occasionally, we can find a region in which the target and background motion fields each fit a single parameterised model. extract the target. The idea is to estimate the background motion in an annular region surrounding the target, and then use this to predict motion in the region inside the annu- MOTION SEGMENTATION AND TRACKING 56 (a) (b) Figure 4.6: Illustration of the motion segmentation process. lus. The local motion estimates in the central region that are not well predicted by the annular motion model are interpreted as due to the target motion. We can formulate this as an iterative algorithm. 1. We assume that an estimate of target size and position is available. This defines a predicted target region. Initially this must be provided either by a preattentive motion detector or by top–down prediction. 2. Fit a motion model to a ring around the target region, and if the model has a small residual, use it to predict the motion in the target region. For each position, we compare the predicted value with the actual estimate, and if the error is small, we set a flag to mark that the estimate is consistent with the ring model. 3. If the prediction is satisfactory for most positions, we merge the ring and the central region to a new target region and repeat the process. The interpretation is that both the ring and the central region are completely contained within the background or the target. 4. If the ring model does not cover all data in the target region, we fit a motion model to the remaining estimates in the region. If this model gives a small residual we assume that we have found the target, otherwise we merge the ring and the central region to a new target region and repeat the process. Figure 4.6 provides an illustration of two typical segmentation situations. The result of a successful segmentation process is a set of target motion parameters and the set of points whose motion estimates were used to estimate the parameters. We use the lowpass–filtered confidence values of the motion estimates in these points to construct a simple moving average model of the target. On those occasions when the segmentation process fails, i.e., when the region grows beyond a predetermined maximum size, we use the target model and predicted values of the motion parameters to pick out estimates 4.3 MOTION MODELS AND SEGMENTATION 57 in the original target region for model fitting. The use of prediction of velocity and position increases the robustness against interference from occlusion. Comparison between predicted and estimated motion parameters permits detection of incorrect parameter estimates which would correspond to unreasonable acceleration. In the case of a pan/tilt rotating camera it is possible to use an alternative procedure based on elimination of the camera–induced background motion field. From Figure 4.7 it is easy to convince oneself that an accurate expression for the camera–induced displacement is given by 0∆x1 @∆yA = ω r + 1 ẑ (ω r) r f 0 where ω = (ωx ; ωy ; 0)T is the angular velocity vector. At each position we may construct x ( −f ) r= y ∆x image plane optic centre ^ x ^ z ωy Figure 4.7: Illustration of displacement induced by pan rotation of camera. a unit length spatiotemporal displacement vector û and multiply each tensor T̃ with a factor that is a measure of the inconsistency between the tensor and the the background displacement direction. One possible choice is a ’soft threshold’ f (û; T̃) = 1 , exp [ , ûT T̃ û kT̃k σ2 ] This method to eliminate self–induced background motion is actually similar to how we pick out tensors that are consistent with model predictions (see above)—it may be regarded as a kind of focus–of–attention. The basic idea is illustrated in Figure 4.8, where a cone symbolises the (hard or soft) threshold angle. With the use of motion elimination the extraction of the target in the rotating camera case is achieved by simply increasing the target region until no further data consistent with the target motion is found. MOTION SEGMENTATION AND TRACKING 58 Egorotation cone Retinal slip cone Figure 4.8: Eliminating irrelevant motion. Left: Cone centered at vector pointing in direction of camera-induced motion. The line segment (which becomes a plane in the spatiotemporal space) moves with respect to a static background, since it does not intersect with the cone. Right: Cone centered at the target’s predicted spatiotemporal velocity vector. The line segment may belong to the target, since the corresponding 3D plane is inside the cone. 4.3.2 Discussion The segmentation methods presented above are related to several other approaches for motion tracking with an active camera. Nordlund and Uhlin [81, 94] fit two consecutive frames to a global 2D displacement model (translation or affine transformation) and detect the centroid of the target displacement field from the residual image. The benefit of this method is that it is computationally efficient. An obvious drawback is the loss of locality, which implies very strong assumptions about the global structure of the scene. Also, the quality of tracking, i.e., the ability to stabilize the target on the image, is compromised by the lack of precise motion estimates. Murray and Basu [76] use a modification of image subtraction combined with elimination of camera–induced motion to track a moving object with a pan/tilt rotating camera—they do not address the case of a translating camera. Again, the lack of accurate motion estimates prevents good image stabilisation. Tölg [92], see also [86], uses a differential–based optical flow algorithm [95] which makes it possible to extract the target motion field by clustering—only fronto–parallel 4.3 MOTION MODELS AND SEGMENTATION 59 motion is treated. Tölg, assuming a pan/tilt rotating camera, also compensates for camera–induced motion, but bases this on motion vector subtraction, which assumes that the true motion vector can be accurately estimated. However, it is known that in general only the local normal velocity (component in direction of spatial gradient) is available with some precision. The question of how to represent the target region shape has been addressed by many authors in the field of image sequence coding, but has attracted comparatively little attention in tracking. The principal use of shape information in tracking is to implement a spatial weight mask—a kind of focus–of–attention. Meyer and Bouthemy [73] use a region descriptor (convex hull) to detect occlusion from region area size changes. Though their motion–based region segmentation method, [20], is unsuitable for real– time applications, the idea of a more structural region representation is conceivable also for sparse motion fields using tools from computational geometry, [93, 96]. 4.3.3 Motion models We have experimented with three different motion models—pure 2D translation, translation and expansion/contraction, and affine transformations. Concerning parameter estimation, computational efficiency was stressed—costly iterative methods were regarded as unacceptable. The pros and cons of this are discussed at the end of this section. The case of pure translation is particularly simple, as illustrated in Figure 4.9. If a set of local structure tensors2 emanate from a single pure 2D translation, the sum of the tensors, Tsum , will be of rank 2 with the eigenvector corresponding to the smallest eigenvalue pointing in the direction of spatiotemporal displacement. If the tensors do not come from a single translation, Tsum will have a significant third eigenvalue λ3 . Consequently we can use the quotient λ3 =TrTsum as a measure of the deviation from pure translation. When there is a significant velocity component orthogonal to the image plane, we have to take into account the perspective transformation. Restricting ourselves to a single spatial dimension, the perspective transformation equation may be written x = ,f X Z (4.9) where f is the camera constant, x denotes the image coordinate, X the corresponding world coordinate, and Z the orthogonal distance from the camera lens to the object. 2 We use the rank 2 tensors T̃ introduced in Section 2.2. MOTION SEGMENTATION AND TRACKING 60 (a) (b) (c) (d) Figure 4.9: Tensor averaging. (a), (b): Two edges in common motion, creating planes in the 3D spatiotemporal space. Orientation tensors (ideally of rank 1) visualised as ellipsoids with eigenvectors forming principal axes. The eigenvector corresponding to the largest eigenvalue indicates orientation of plane. (c), (d): Averaging of tensors ‘symbolically’ shown in (c) gives as result an estimate of the true motion, now from the eigenvector corresponding to the smallest eigenvalue of the tensor (d), which ideally is of rank 2. Differentiating Equation (4.9) with respect to time, we find ẋ = f X Ẋ Ż , f Z2 Z (4.10) Assuming that the variation in depth of the visible part of the target along the line of sight is small compared with the viewing distance (weak perspective), we conclude that the apparent velocity of a projected point is a linear function of its distance from the image coordinate centre. In Figure 4.10 we see that this means that the spatiotemporal velocity vectors converge at a point, which we may call the spatiotemporal focus-ofexpansion, xST FOE (t ). To find this point, one may use some kind of clustering scheme or proceed as follows. The spatiotemporal displacement vector at x = (x; y; t )T is parallel to x , xST FOE (t ). Ideally the local structure tensor T̃(x) has no projection in this direction, so that T̃(x)(x , xSTFOE ) = 0 We can find a least squares optimal value of xST FOE in a spatial region R by minimising E= ∑ (x , xSTFOE )T T̃(x) (x , xSTFOE ) x 2R T = xST FOE ∑ T̃(x) xST FOE , 2xTSTFOE ∑ T̃(x) x + ∑ xT T̃(x) x Since ∑ T̃(x) is positive (semi–) definite, it follows that the optimal xST FOE is a solution 4.3 MOTION MODELS AND SEGMENTATION 61 x t=t1 t=t2 t xSTFOE (t 1 ) xSTFOE (t 2 ) Figure 4.10: Spatiotemporal convergence of velocity vectors. At any given time, the spatiotemporal tangent vectors converge at xST FOE (t ). to the linear equation system ∑ T̃(x) xST FOE = ∑ T̃(x) x So in addition to the sum of tensors that we use for the pure translation model estimation, we need the vector sum ∑ T̃(x) x. The translation component of the motion can be found from the average displacement direction ∑ Tr T̃(x) (x , xSTFOE ). An alternative way to proceed when modelling the motion field as a fronto–parallel translation plus motion orthogonal to the image plane is to make the ansatz u v =a x u y 0 + v0 We will not describe this in detail, since it is a simple special case of our final approach, namely to model the motion field as an affine transformation u a v = 11 a21 a12 a22 x u y + 0 v0 (4.11) This type of motion model has recently become a popular choice for the segmentation stage of video sequence coding algorithms, e.g., [58, 73, 100, 33]. One reason for this is that it can be shown that the general motion of a planar surface patch under orthographic projection can be expressed as an affine transformation. MOTION SEGMENTATION AND TRACKING 62 The linear transformation represents rotation in the image plane, depth motion and shear, while the constant part describes translation in the image plane. Rewriting Equation (4.11) as a linear transformation of the spatiotemporal displacement vector ũ = (∆x; ∆y; ∆t )T = (u; v; 1)T , we obtain 0u1 0a 11 ũ = @ v A = Ax̃ = @ a21 1 0 a12 a22 0 u0 v0 1 10 x 1 [email protected] y A (4.12) 1 Again noting that the local structure tensor at each position has no projection in the direction of spatiotemporal displacement, a least squares sense optimal affine motion model for an image region R satisfies Aopt = argmin A ∑ ũT (x y) T̃ ũ(x y) = argmin ∑ xT AT T̃ Ax A x2R ; ; x2R (4.13) Setting the partial derivatives with respect to the parameters to zero results in the following symmetric linear equation system ( 20 x2t 66BB 11 6B ∑ 66B B x y 2R 6B [email protected] ; ) xyt11 y2t11 xt11 yt11 t11 x2t12 xyt12 xt12 x2t22 xyt12 y2t12 yt12 xyt22 y2t22 xt12 yt12 t12 xt22 yt22 t22 10 a CC BB a1112 CC BB u0 CC BB a21 [email protected] a 22 v0 1 0 xt CC BB yt1313 CC + BB t13 CC BB xt23 A @ yt 23 13 CC77 CC77 = 0 CC77 A5 t23 (4.14) where ti j refers to the (i; j) component of T̃. In Appendix A.2.1 we present results of an experimental investigation of the accuracy of the affine parameter estimates3 . We use α–β–filters [12] for prediction of the four target motion parameters a11 ,a12 , a21 , and a22 during tracking. Possibly the only reason for using linear least squares methods for motion parameter estimation is that they are computationally efficient, which of course is critical in real–time tracking. A well–known problem with linear least squares methods is their sensitivity to outliers. There exist several techniques, referred to as robust estimation, [52, 45], where outliers are rejected in an iterative process. This typically consists in maximising a log likelihood function where the data set is considered to be divided into a Gaussian signal process and a non–Gaussian outlier noise process. There are also various clustering techniques, e.g., [85, 57], that can be used as robust alternatives to linear least 3 Experimental results for the simpler motion models are not included—they are similar to the results for the affine model when applicable. 4.4 ON THE USE OF NORMALIZED CONVOLUTION IN TRACKING 63 squares. Without going into details, let us just mention that it is completely straightforward to apply these methods to motion parameter estimation using the spatiotemporal constraints presented in this section. Any of these techniques typically requires several times as much work as linear least squares. 4.4 On the use of normalized convolution in tracking Presently, we see two main applications of normalized convolution in active tracking. Firstly, it is possible to shorten the recovery period following a saccade by several sampling intervals by giving the old pre–saccade frames that are left in the temporal buffer a certainty value equal to zero. In this way a rough estimate of the post–saccade target velocity can be produced very quickly. Secondly, it is possible to integrate normalized convolution with focus–of–attention control by using zero–certainty masks to suppress interference from known structures, such as a robot arm or already modelled objects. Details of this are given in [103, 101, 107]. 4.5 Experimental results An advanced graphics simulation environment, [102, 101], was used for development and evaluation of the tracking routines. The tracker was implemented on a simulated standard Puma 560 robot with a mounted (stereo–) camera head, Figure 4.11. The use of a simulated environment has a number of advantages—it allows for a wide range of experimental situations, it provides true velocity and position which can be used in algorithm performance evaluation, and it makes you independent of special purpose hardware. Simulation can certainly never replace performance in ’the real world’ as the final criterion of failure or success—particularly when there is a real-time constraint— but it is extremely useful in early stages of development. The sequence in Figure 4.13 shows smooth pursuit with the robot initially approaching the target, then moving parallel to it, and finally stopping in front of the approaching target. Note how the robot’s manœuvers are reflected in displacement of the target from the centre of the field of view. In Figure 4.14 the weights (confidence values) of the local motion estimates that were used to extract the target are superimposed on the target which is tracked with a rotating camera. The code has been ported to the Aalborg University Camera Head Demonstrator, Figure 4.12. Although only a few experiments were carried out with the installed code, we 64 MOTION SEGMENTATION AND TRACKING consider them sufficient as a ’proof of concept’, see Figures 4.15 and 4.16. The image processing was done at a moderate 2 Hz on a Unix workstation by importing the video signal to the above mentioned graphics simulation package. The simulation package code communicated with the onboard camera head controllers via sockets to issue commands and receive actual camera head joint velocities and positions. We anticipate a more careful implementation not involving the graphics package to achieve at least 10 Hz. 4.5 EXPERIMENTAL RESULTS Figure 4.11: Robot simulation. Left eye, right eye, and overview. The test platform consists of a simulated Puma 560 robot with a mounted stereo camera head. Figure 4.12: The Aalborg University Camera Head Demonstrator system. 65 66 MOTION SEGMENTATION AND TRACKING Figure 4.13: Simulated sequence with a translating camera. A box is transported on a conveyor belt. The small cross marks the centroid of the target motion field, whereas the large cross simply marks the centre of the image. 4.5 EXPERIMENTAL RESULTS Figure 4.14: Simulated sequence. The confidence values of the extracted motion estimates are superimposed on the target. 67 68 MOTION SEGMENTATION AND TRACKING Figure 4.15: ’To and fro’—a real–time smooth–pursuit tracking sequence recorded at the Laboratory of Image Analysis, Aalborg University. 4.5 EXPERIMENTAL RESULTS Figure 4.16: ’Walking the plank’—a real–time smooth–pursuit tracking sequence recorded at the Laboratory of Image Analysis, Aalborg University. 69 70 MOTION SEGMENTATION AND TRACKING 5 SUMMARY The thesis has treated certain aspects of efficient spatiotemporal filtering and modelling. A novel energy–based multiresolution algorithm for estimation and representation of local spatiotemporal structure by second order symmetric tensors has been presented. The algorithm utilizes coarse–to–fine data validation to produce accurate and reliable estimates over a wide range of image displacements. It is highly parallelisable thanks to an efficient separable sequential implementation of 3D quadrature filters. Results show that the algorithm is capable of producing very accurate estimates of image velocity. An efficient spatiotemporal implementation of a certainty–based signal modelling method called normalised convolution has been described and experimentally investigated. By using a novel iterative filter optimisation method to produce positive filter coefficients, we show that normalised convolution can be efficiently integrated with the sequential quadrature filtering scheme. We also describe how certainty information is incorporated in the construction of the local structure tensor. Experimental tests show that normalised convolution for signal interpolation in spatiotemporal sequences is very effective, due to the high information redundancy in the signal. As an application of the above results, we have presented a smooth pursuit motion tracking algorithm that uses observations of both target motion and position for camera head control and motion prediction. We introduce a novel target motion segmentation method which assumes that the motion fields of the target and its immediate vicinity, at least occasionally, each can be modelled by a single parameterised motion model. We use the acquired and updated information about target position and velocity to extract motion estimates on those occasions when the segmentation algorithm fails. We also show how to eliminate camera–induced background motion in the case of a pan/tilt rotating camera. We provide excerpts from both simulated and real tracking sequences that show that the algorithms work. 71 SUMMARY 72 The thesis probably creates more question–marks than it straightens out. Certainly a great deal more work can be put into the study of the multi–resolution tensor field estimation algorithm. There may be more optimal ways to combine the different levels, and it may also be useful to consider some local interaction at each level. In particular, a study of relaxation techniques would be interesting. Considering applications, promising work on region–based motion segmentation for image sequence coding has recently been done [33]. The efficient implementation of normalised convolution is an important result. As mentioned earlier, we think that the full potential of the signal/certainty philosophy is yet to be exploited. For instance, there has to be work done on integrating certainty processing, feature extraction and object recognition. A straightforward refinement of the iterative target segmentation algorithm would be to make the region growing process shape adaptable, most simply by computing the principal axes of the motion field distribution. Of course, there are dozens of possible real–time applications other than smooth pursuit tracking that can use accurate motion estimates. To name but a few: egomotion estimation, obstacle avoidance, fixation, and homing. Also, as mentioned earlier, there are better ways to estimate motion model parameters than the linear least squares methods that we have used in this work. It would be interesting to study the performance of, e.g., mixture models, [57], and to see whether they can be efficiently implemented within the local structure tensor framework—it seems likely that robust iterative methods converge faster when the local estimates are of high precision with occasional outliers, than if they have a large variance as is the case for the more primitive optical flow methods. * * * A TECHNICAL DETAILS In this appendix we provide some of the more technical details and results of experimental tests that for continuity reasons were omitted in the main text. The mathematical results are all quite elementary, but it was felt that they should be included for the sake of completeness. A.1 Local motion and tensors A.1.1 Velocity from the tensor There are two fundamental cases. 1. Moving point, Figure A.1. The spatiotemporal line traced out by the moving point has a tangent vector vST = (vx ; vy ; 1)T which is parallel to the eigenvector ê3 = (e31 ; e32 ; e33 )T corresponding to the smallest eigenvalue of the tensor. Consequently the velocity is given by v = (vx ; vy )T = 1 T (e31 ; e32 ) e33 (A.1) 2. Moving line, Figure A.2. The moving line creates a plane in the spatiotemporal volume. In this case the true velocity can not be determined, since it is impossible to detect any velocity component parallel to the line. However, the component vnorm normal to the line is easily recovered by noticing that its spatiotemporal counterpart vST norm is orthogonal to the eigenvector ê1 corresponding to the largest eigenvalue, as well as to a vector l = (e12 ; ,e11 ; 0)T parallel to the line of intersection of the motion plane and t = t0 . But e1 l = (,e11 e13 ; ,e12 e13 ; e211 + e212 )T 73 TECHNICAL DETAILS 74 so that ,e13 (e ; e )T 11 12 e211 + e212 vnorm = (A.2) since the temporal component of vST norm equals 1. A.1.2 Speed and eigenvectors We have claimed that s s= s s= e213 1 , e213 1 , e233 e233 T rank 1 (A.3) T rank 2 (A.4) are measures of the magnitude of local velocity. Proof: (A.3) follows from (A.2) since e213 e213 2 2 ( e + e ) = 12 2 2 11 e211 + e212 11 + e12 ) kvnormk2 = (e2 = e213 1 , e213 Similarly, (A.4) follows from (A.1) since kvk2 = e12 2 2 (e31 + e32 ) = 33 1 , e233 e233 2 A.1.3 Speed and Txy We have claimed that Tr Txy = λ1 (1 , e213) det Txy = λ1 λ2 e233 T rank 1 (A.5) T rank 2 (A.6) Proof: When T is of rank 1 the leading 2 2-submatrix is Txy = λ1 e211 e11 e12 e11 e12 e212 A.1 LOCAL MOTION AND TENSORS 75 v v y ST y x t x Figure A.1: Moving point case. Left: Image plane. Right: Spatiotemporal volume. ST v norm v norm e1 l v y y x t x Figure A.2: Moving line case. Left: Image plane. Right: Spatiotemporal volume. TECHNICAL DETAILS 76 Since ke1 k2 = e211 + e212 + e213 2 2-submatrix is Txy = = 1, (A.5) follows. When T is of rank 2 the leading λ1 e211 + λ2e221 λ1 e11 e12 + λ2e21 e22 λ1 e11 e12 + λ2e21 e22 λ1 e212 + λ2e222 so that the determinant is det Txy = λ1 λ2 [ e211 e222 + e212 e221 , 2e11e12 e21 e22 ] = λ1 λ2 (e11 e22 , e12 e21 )2 But x̂ e33 = [e1 e2 ]3 = e11 e 21 ŷ e12 e22 tˆ e13 e23 and we are done. 2 = e11 e22 , e12e21 3 A.2 Motion models A.2.1 Accuracy of affine parameter estimates To determine the accuracy of the affine parameter estimates a synthetic 2D profile, Figure A.3, was subjected to sequences of affine transformations. The object covered 100 100 pixels when not transformed. All experiments started with the object in a standard position, centred in the field of view. We separately studied expansion (Figure A.5), rotation (Figure A.6), translation (Figure A.7), and composite transformation (Figure A.8). In the figures, histograms of estimated values of the affine parameters are plotted as a 11 a21 a12 a22 vx vy As is seen, the estimates are excellent over a wide range of parameter values. Of course, these estimates have been obtained under perfect conditions and thus merely show that the algorithm then has the ability to produce consistent estimates with small variance. To see how the algorithm performs in somewhat more realistic situations, a simple camera simulation was generated by adding 10 dB SNR additive uncorrelated noise and motion blurring (moving average). Results are shown in Figures A.9 – A.12. As expected, very little or no degradation in performance can be detected—the spatiotemporal filtering has proven very noise tolerant, with an average orientation estimation error of 3:0 at 10 dB SNR (0:8 at ∞ dB SNR) for the particular filter set used here, [63]. A.2 MOTION MODELS Figure A.3: Textured profile used in experiments. Figure A.4: A more realistic situation with 10 dB SNR additive uncorrelated noise and image blurring. 77 TECHNICAL DETAILS 78 12 30 10 25 8 12 20 4 15 6 10 4 5 0.005 0 0.01 1 6 0 0 2 2 0.01 0 0.02 −0.01 8 4 0 0.01 0 0 0.01 0 −1 8 6 6 4 4 0 1 0 −0.01 0 0 0 0.01 (a) 25 12 8 20 10 5 4 6 10 2 5 0 0 0.02 3 4 0 0.01 0 −1 10 12 10 8 10 8 6 1 0.01 0 0 6 10 4 5 0 0 0 0.02 0.04 0.06 0.08 −0.01 4 4 2 0 8 6 6 4 0 −0.01 0 −1 15 8 8 4 2 0.02 1 0 1 2 2 0.02 0.04 0 −1 0 1 0 0.01 0 1 0 −0.01 0 0.01 0 −1 6 6 5 5 4 4 3 3 2 2 1 (c) Figure A.5: Expansion. (a) a11 0:02. (d) a11 = a22 = 0:04. 2 1 0 12 6 0 2 0.01 20 2 2 0 0.04 −0.01 1 6 8 15 4 0 (b) 10 6 0 −1 8 2 2 2 0.005 0.01 4 4 2 2 0 6 6 4 4 4 10 8 6 0 −0.01 0 10 8 8 6 2 12 10 10 0 −1 8 8 6 2 0 0.01 −0.01 12 10 6 4 0 0 12 8 8 6 2 10 10 0 0 1 0.02 0.04 0.06 0.08 0 −1 (d) = a22 = 0:005. (b) a11 = a22 = 0:01. (c) a11 = a22 = A.2 MOTION MODELS 12 79 15 12 12 10 8 5 0 −0.01 0.01 0 0 0.01 0.02 0 −1 0 1 12 20 12 −0.01 0 0 0 0.02 0.04 10 0 −1 0 1 0 1 0 1 0 1 15 8 10 10 6 4 4 4 0 −0.01 0 0.01 0 −1 5 5 2 2 2 0 −0.02 0.01 6 6 5 2 0 15 8 8 10 0 −0.01 10 10 15 5 2 2 0 6 4 4 4 2 8 10 6 6 4 15 8 8 6 10 10 10 10 0 1 0 −0.04 −0.02 0 0 −0.01 0 (a) 0.01 0 −1 (b) 8 10 8 4 4 6 8 4 6 2 2 0 −0.01 0 0 0 −1 0.01 6 0.02 0.04 0.06 0.08 8 6 8 8 6 6 4 4 2 2 4 0 −0.08−0.06−0.04−0.02 0 1 0 −0.01 0 0.01 12 0 −0.01 0 (c) 0.01 0 −1 0 0 0.05 0.1 0.15 12 8 10 8 6 6 4 8 6 4 4 0 1 0 −0.15 −0.1 −0.05 0 −1 10 10 2 2 0 2 2 4 2 6 2 12 10 8 4 4 2 0 10 10 6 12 8 12 8 6 0 0 −0.01 2 0 0.01 0 −1 (d) Figure A.6: Rotation. (a) a12 = ,a21 = 0:01. (b) a12 = ,a21 = 0:02. (c) a12 = ,a21 = 0:04. (d) a12 = ,a21 = 0:08. TECHNICAL DETAILS 80 15 10 8 6 10 6 8 4 10 10 6 5 4 2 0 −0.01 0 0.01 0 0 2 0.5 0 −0.01 1 0 0.01 −0.01 0 12 10 10 6 4 4 2 0 −0.01 0 0.01 −0.01 0 12 10 10 8 8 6 6 4 4 2 0 0.01 15 10 10 5 5 8 6 6 4 4 2 2 0 0 0.5 0 −0.01 1 0 0.01 −0.01 0 12 12 12 10 10 8 8 6 6 4 4 2 2 0 −0.01 0 10 10 8 6 8 6 6 4 4 2 0.01 0 0 2 0 −0.01 4 0 8 2 0 0.01 0 0 2 0 −0.01 4 = vy 6 = 0:5. (b) vx 0 4 3 2 6 5 4 4 3 3 2 2 1 0 0 0 0 4 6 = = 2. vy 8 4 2 1 0 0 0 0 0 0 0.03 1. (c) vx 5 1 0.02 = vy 6 1 1 2 10 0.02 0.04 0.06 8 0.02 0.04 0.06 0 0 6 6 5 5 4 4 3 3 2 2 2 4 −2 0 6 8 6 6 4 4 4 2 2 0 0 0.01 6 5 0.01 = 8 0 2 2 0 0.01 −0.01 0 (d) 8 0 −0.03 −0.02 −0.01 6 4 2 2 2 2 4 4 4 2 4 2 6 6 4 6 3 6 0 0 8 10 6 8 4 0.03 0.01 12 10 5 0.02 2 2 0 0.01 −0.01 0 8 12 6 0.01 1 2 2 0 6 0 0 0 0 4 4 Figure A.7: Translation. (a) vx (d) vx = vy = 4. 4 0.01 8 (c) 8 2 6 0 0.01 −0.01 0 12 8 0 0.01 −0.01 0 1 (b) 2 0 −0.01 0 0 15 (a) 12 0.01 10 8 6 2 0 12 8 8 5 4 2 0 0.01 −0.01 0 10 8 6 4 2 15 12 12 8 0.01 (a) 0.02 0.03 0 −2 2 −1 0 0 −0.06 −0.04 −0.02 1 0 0 0 1 0.02 0.04 0.06 0 −4 (b) Figure A.8: Composite transformation. (a) a11 = a12 = ,a21 = a22 = 0:015. vx = ,vy = 1: (b) a11 = a12 = ,a21 = a22 = 0:03. vx = ,vy = 2: A.2 MOTION MODELS 81 10 8 8 6 8 4 6 6 3 4 4 2 2 2 2 0 0 4 4 4 10 8 5 6 6 10 6 8 2 1 0.005 0 0.01 −0.01 0 0.01 0 −1 0 1 0 0 0.01 0 0.02 −0.01 2 0 0.01 8 10 6 8 6 6 4 4 4 2 0 −0.01 0 0.01 0 0 6 5 6 4 6 3 4 0.01 0 −1 2 0 1 0 −0.01 0 0.01 0 0 0.01 0.02 0 −1 0 1 0 1 (b) 12 8 10 8 20 8 6 15 4 4 10 0.02 2 2 5 5 4 4 3 4 2 6 6 5 6 6 3 2 2 1 0 0.04 −0.01 0 0.01 0 −1 0 1 10 0 0 1 0 0.02 0.04 0.06 0.08 −0.01 8 8 6 6 4 4 2 2 0 0.01 0 −1 10 6 8 6 8 5 4 6 3 4 4 2 2 6 2 1 0 −0.01 1 2 1 (a) 0 0 0 2 2 0.005 1 8 8 4 2 0 10 10 8 0 −1 0 0.01 0 0 0.02 0.04 0 −1 5 4 3 2 1 0 1 0 −0.01 0 0.01 (c) Figure A.9: Expansion. (a) a11 = a22 0:02. (d) a11 = a22 = 0:04. 0 0 0.02 0.04 0.06 0.08 0 −1 (d) = 0:005. (b) a11 = a22 = 0:01. (c) a11 = a22 = TECHNICAL DETAILS 82 12 12 10 10 8 8 6 6 4 4 2 2 0 −0.01 0 0.01 0 0 12 0.01 0.02 12 6 10 5 8 4 6 3 4 2 2 1 0 −1 0 0 −0.01 1 8 12 10 8 6 4 4 0 −0.02 6 4 0 0.01 0 0 −0.01 0 0.01 0 −1 0 0 1 0.02 0 −0.04 1 1 0 1 2 1 −0.02 0 0 −0.01 0 0.01 0 −1 4 4 6 3 3 5 0.02 0.04 0.06 0.08 2 1 2 1 0 −1 0 −0.01 0 0 1 4 3 4 0 5 4 2 1 6 5 8 0.01 0 3 6 5 0 1 4 10 6 10 0 0 0 5 2 15 0 −0.01 1 (b) 6 2 0 6 6 2 0 0 −1 8 (a) 5 0.04 4 2 −0.01 2 2 4 2 2 6 3 4 8 8 4 6 10 6 5 8 10 12 6 10 0 0.01 3 2 1 0.05 0.1 0.15 0 −1 8 10 6 6 8 5 5 6 4 4 3 3 2 2 1 1 4 2 0 −0.08−0.06−0.04−0.02 0 0 −0.01 0 0.01 0 −1 10 6 4 5 6 4 3 4 2 2 0 2 0 −0.15 −0.1 −0.05 1 (c) 0 0 −0.01 1 0 0.01 0 −1 (d) Figure A.10: Rotation. (a) a12 = ,a21 = 0 01. ,a21 = 0 04. (d) a12 = ,a21 = 0 08. : : 6 8 : (b) a12 = ,a21 = 0 02. : (c) a12 = A.2 MOTION MODELS 83 8 12 10 10 8 8 6 6 4 10 8 10 6 6 8 8 4 4 6 6 4 4 4 2 2 2 2 2 0 −0.01 0 0.01 −0.01 0 0 0.01 0 0 0.5 1 6 0 −0.01 2 0 0.01 −0.01 0 0 0.01 0 0 1 2 1 2 8 12 10 5 10 8 4 6 3 4 6 2 4 1 2 0 −0.01 4 2 0 0.01 0 0 8 6 6 4 4 2 2 0 0.01 −0.01 0 8 6 8 0.5 1 0 −0.01 2 0 0.01 −0.01 0 0 (a) 0.01 0 0 (b) 12 15 10 8 8 6 6 4 4 15 10 10 4 2 2 0 0.01 −0.01 0 0 0.01 12 0 0 5 10 8 8 6 4 0 −0.01 10 12 8 10 0 0.01 −0.01 0 4 4 2 2 0 −0.01 4 2 0 0.01 −0.01 0 0 0.01 0 0 0 2 6 6 6 5 5 5 4 4 4 3 3 3 0 0 2 4 0.02 0.03 0 0 8 8 6 6 = vy = 0.02 0.03 0 0 0 0 1 2 0 −0.01 2 0 0.01 −0.01 0 0 0.01 0 0 0:5. (b) vx = vy = 1. (c) vx 6 5 6 5 4 5 4 3 3 2 0 0 2 1 0.02 0.04 0.06 0 0 4 4 3 3 2 2 1 1 1 0.02 0.04 0.06 0.02 0.03 0 −2 0 0 2 4 −2 0 5 4 2 2 (a) 2. 2 3 3 0.01 = vy = 4 2 1 5 2 0 6 4 1 1 0 −0.03 −0.02 −0.01 4 6 4 6 4 2 2 8 6 2 4 4 6 10 8 3 1 0.01 4 6 4 2 1 0.01 2 (d) Figure A.11: Translation. (a) vx (d) vx = vy = 4. 1 0 0 12 10 (c) 2 0.01 12 8 6 6 5 2 2 12 10 10 6 5 0 −0.01 15 8 −1 0 0 −0.06 −0.04 −0.02 0 0 0 0.02 0.04 0.06 0 −4 (b) Figure A.12: Composite transformation. (a) a11 = a12 = ,a21 = a22 = 0:015. vx = ,vy = 1: (b) a11 = a12 = ,a21 = a22 = 0:03. vx = ,vy = 2: 84 TECHNICAL DETAILS Bibliography [1] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. Jour. of the Opt. Soc. of America, 2:284–299, 1985. [2] J. Aloimonos. Purposive and qualitative active vision. In DARPA Image Understanding Workshop, Philadelphia, Penn., USA, September 1990. [3] Y. Aloimonos, editor. Active Perception, volume 1 of Advances in Computer Vision. Lawrence Erlbaum Publishers, 1993. [4] Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. Int. Journal of Computer Vision, 1(3):333–356, 1988. [5] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice–Hall, 1979. [6] M. Andersson and H. Knutsson. General sequential Spatiotemporal Filters for Efficient Low Level Vision. In ECCV-96, April 1996. Submitted. [7] M. T. Andersson and H. Knutsson. Personal communication. [8] R. Bajcsy. Active perception vs. passive perception. In Proc. IEEE Workshop on Computer Vision, pages 55–59, 1985. [9] R. Bajcsy. Active perception. Proceedings of the IEEE, 76(8):996–1005, August 1988. [10] D. H. Ballard. Animate vision. In Proc. Int. Joint Conf. on Artificial Intelligence, pages 1635–1641, 1989. [11] D. H. Ballard. Animate vision. Technical Report 329, Computer Science Department, University of Rochester, Feb. 1990. [12] Y. Bar-Shalom and T. E. Fortmann. Tracking and data association. Academic Press, 1988. [13] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow techniques. Int. J. of Computer Vision, 12(1):43–77, 1994. 85 86 BIBLIOGRAPHY [14] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across multiple scales: An adaptive coarse-to-fine strategy. Int. Journal of Computer Vision, 6(2):133–145, 1991. [15] J. Bigün. Local Symmetry Features in Image Processing. PhD thesis, Linköping University, Sweden, 1988. Dissertation No 179, ISBN 91–7870–334–4. [16] J. Bigün and G. H. Granlund. Optimal orientation detection of linear symmetry. In Proceedings of the IEEE First International Conference on Computer Vision, pages 433–438, London, Great Britain, June 1987. Report LiTH–ISY–I–0828, Computer Vision Laboratory, Linköping University, Sweden, 1986. [17] J. Bigün and G. H. Granlund. Optical flow based on the inertia matrix of the frequency domain. In Proceedings from SSAB Symposium on Picture Processing, Lund University, Sweden, March 1988. SSAB. Report LiTH–ISY–I–0934, Computer Vision Laboratory, Linköping University, Sweden, 1988. [18] J. Bigün, G. H. Granlund, and J. Wiklund. Multidimensional orientation: texture analysis and optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI–13(8), August 1991. [19] M. Borga. Normalized reinforcement learning, 1995. Unpublished report. [20] P. Bouthemy and E. François. Motion segmentation and qualitative dynamic scene analysis from image sequence. Int. Journal of Computer Vision, 10(2):157– 182, 1993. [21] G. E. Box and G. M. Jenkins. Time Series Analysis: Forecasting and Control. Holden–Day, San Francisco, 1976. [22] R. Bracewell. The Fourier Transform and its Applications. McGraw-Hill, 2nd edition, 1986. [23] A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of computations, 31:333–390, 1977. [24] V. B. Brooks. The Neural Basis of Motor Control. Oxford University Press, 1986. [25] P. J. Burt. The pyramid as a structure for efficient computation. In A. Rosenfeld, editor, Multiresolution Image processing and Analysis, volume 12 of Springer Series in Information Sciences. Springer-Verlag, New York, 1984. [26] P. J. Burt, J. R. Bergen, R. Hingorani, R. Kolcqynski, W. A. Lee, A. Leung, J. Lubin, and H. Shvaytser. Object tracking with a moving camera. In Proceedings IEEE Workshop on Visual Motion, pages 2–12, 1989. BIBLIOGRAPHY 87 [27] H.I. Christensen, K.W. Bowyer, and H. Bunke, editors. Active robot vision: camera heads, model based navigation and reactive control, volume 6 of Series in machine perception and artifical intelligence. World Scientific, 1993. ISBN 98102-1321-2. [28] H. Collewijn and E. P. Tammiga. Human smooth and saccadic eye movements during voluntary pursuit of different target motions on different backgrounds. J. Physiology, 351:217–250, 1984. [29] J. L. Crowley. Image processing in the SAVA system. In J. L. Crowley and H. I. Christensen, editors, Vision as Process, ESPRIT Basic Research Series, pages 73–83. Springer-Verlag, 1994. [30] J. L. Crowley and H. I. Christensen, editors. Vision as Process, ESPRIT Basic Research Series. Springer-Verlag, 1994. ISBN 3-540-58143-X. [31] C. J. Duffy and R. H. Wurtz. Sensitivity of MST neurons to optical flow stimuli: a continuum of response selectivity to large field stimuli. J. Neurophysiology, 65:1329–1345, 1991. [32] W. Enkelmann. Investigations of multigrid methods for the estimation of optical flow fields in image sequences. Computer Vision, Graphics and Image Processing, 43:150–177, 1988. [33] G. Farnebäck. Motion-based Segmentation of Image Sequences. Master’s Thesis LiTH-ISY-EX-1596, Computer Vision Laboratory, S–581 83 Linköping, Sweden, May 1996. [34] C. Fermüller. Passive navigation as a pattern recognition problem. Int. Journal of Computer Vision, 14:147–158, 1995. [35] C. Fermüller. Towards a theory of direct perception. In Proc. ARPA Image Understanding Workshop, 1996. [36] C. Fermüller and Y. Aloimonos. Estimating 3-d motion from image gradients. Technical Report T.R. CAR-TR-564, Center for Automation Research, University of Maryland, 1991. [37] C. Fermüller and Y. Aloimonos. Qualitative egomotion. Int. Journal of Computer Vision, 15:7–29, 1995. [38] D. J. Fleet. Measurement of image velocity. Kluwer Academic Publishers, 1992. ISBN 0–7923–9198–5. [39] D. J. Fleet and K. Langley. Recursive filters for optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(1):61–67, 1995. BIBLIOGRAPHY 88 [40] F. Glazer. Multilevel relaxation in low-level computer vision. In A. Rosenfeld, editor, Multiresolution Image Processing and Analysis, volume 12 of Springer Series in Information Sciences, pages 312–330. Springer-Verlag, New York, 1984. [41] H. Goldstein. Classical Mechanics. Addison-Wesley, second edition, 1980. [42] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, second edition, 1989. [43] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1. [44] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden, October 1992. Dissertation No 284, ISBN 91–7870–988–1. [45] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The approach based on influence functions. John Wiley and Sons, New York, 1986. [46] R. M. Haralick and L. G. Shapiro. Addison–Wesley, 1993. Computer and robot vision, volume 2. [47] D. J. Heeger. Optical flow from spatiotemporal filters. In First Int. Conf. on Computer Vision, pages 181–190, London, June 1987. [48] D. J. Heeger and A. D. Jepson. Subspace methods for recovering rigid motion I: Algorithm and implementation. Int. Journal of Computer Vision, 7(2):95–117, Januari 1992. [49] B. K. P. Horn. Robot vision. The MIT Press, 1986. [50] B. K. P. Horn and B. G. Schunk. Determining optical flow. Artificial Intelligence, 17:185–204, 1981. [51] I. P. Howard. Human Visual Orientation. John Wiley & Sons, Toronto, 1982. [52] P. Huber. Robust Statistics. John Wiley and Sons, New York, 1981. [53] B. Jähne. Motion determination in space-time images. In Image Processing III, pages 147–152. SPIE Proceedings 1135, International Congress on Optical Science and Engineering, 1989. [54] B. Jähne. Motion determination in space-time images. In O. Faugeras, editor, Computer Vision-ECCV90, pages 161–173. Springer-Verlag, 1990. [55] B. Jähne. Digital Image Processing: Concepts, Algorithms and Scientific Applications. Springer Verlag, Berlin, Heidelberg, 1991. BIBLIOGRAPHY 89 [56] B. Jähne. Spatio-Temporal Image Processing: Theory and Scientific Applications. Springer Verlag, Berlin, Heidelberg, 1993. ISBN 3-540-57418-2. [57] A. Jepson and M. Black. Mixture models for optical flow. Technical Report RBCV-TR-93-44, Res. in Biol. and Comp. Vision, Dept. of Comp. Sci., Univ. of Toronto, 1993. [58] H. Jozawa. Segment–based video coding using an affine motion model. In Proc. SPIE Visual Communications and Signal Processing, volume 2308, pages 1605– 1614, Chicago, 1994. [59] H. Knutsson. Producing a continuous and distance preserving 5-D vector representation of 3-D orientation. In IEEE Computer Society Workshop on Computer Architecture for Pattern Analysis and Image Database Management - CAPAIDM, pages 175–182, Miami Beach, Florida, November 1985. IEEE. Report LiTH– ISY–I–0843, Linköping University, Sweden, 1986. [60] H. Knutsson. Representing and estimating 3-D orientation using quadrature filters. In Conference Publication No. 265, Second Int. Conf. on Image Processing and Its Applications, pages 87–91, London, June 1986. IEE, IEE. [61] H. Knutsson. A tensor representation of 3-D structures. In 5th IEEE-ASSP and EURASIP Workshop on Multidimensional Signal Processing, Noordwijkerhout, The Netherlands, September 1987. Poster presentation. [62] H. Knutsson. Representing local structure using tensors. In The 6th Scandinavian Conference on Image Analysis, pages 244–251, Oulu, Finland, June 1989. Report LiTH–ISY–I–1019, Computer Vision Laboratory, Linköping University, Sweden, 1989. [63] H. Knutsson and M. Andersson. Robust N-Dimensional Orientation Estimation using Quadrature Filters and Tensor Whitening. In Proceedings of IEEE International Conference on Acoustics, Speech, & Signal Processing, Adelaide, Australia, April 1994. IEEE. LiTH-ISY-R-1798. [64] H. Knutsson and M. Andersson. Optimization of Sequential Filters. In Proceedings of the SSAB Symposium on Image Analysis, Linköping, Sweden, March 1995. SSAB. LiTH-ISY-R-1797. [65] H. Knutsson, H. Bårman, and L. Haglund. Robust Orientation Estimation in 2D, 3D and 4D Using Tensors. In Proceedings of Second International Conference on Automation, Robotics and Computer Vision, ICARCV’92, Singapore, September 1992. 90 BIBLIOGRAPHY [66] H. Knutsson and G. H. Granlund. Spatio-temporal analysis using tensors. In Sixth Multidimensional Signal Processing Workshop, page 11, Pacific Grove, California, September 1989. MDSP Technical Committee of the IEEE Acoustics, Speech and Signal Processing Society, Maple Press. Abstract. [67] H. Knutsson, M. Hedlund, and G. H. Granlund. Anordning för bestämning av graden av konstans hos en egenskap för ett område i en i diskreta bildelement uppdelad bild. Swedish patent 8502570-8 1987, 1986. [68] H. Knutsson and C-F. Westin. Normalized and Differential Convolution: Methods for Interpolation and Filtering of Incomplete and Uncertain Data. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York City, USA, June 1993. IEEE. [69] H. Knutsson and C-F. Westin. Normalized Convolution: Technique for Filtering Incomplete and Uncertain Data. In Proceedings of the 8th Scandinavian Conference on Image Analysis, Tromsö, Norway, May 1993. SCIA, NOBIM, Norwegian Society for Image Processing and Pattern Recognition. Report LiTH–ISY– I–1528. [70] H. Knutsson and C-F. Westin. Robust estimation from Sparse Feature Fields. In Proceedings of EC–US Workshop, Amherst, USA, October 1993. [71] H. Knutsson, C-F. Westin, and G. H. Granlund. Local Multiscale Frequency and Bandwidth Estimation. In Proceedings of IEEE International Conference on Image Processing, Austin, Texas, November 1994. IEEE. Cited in Science: Vol. 269, 29 Sept. 1995. [72] H. Knutsson, C-F. Westin, and C-J. Westelius. Filtering of Uncertain Irregularly Sampled Multidimensional Data. In Twenty-seventh Asilomar Conf. on Signals, Systems & Computers, Pacific Grove, California, USA, November 1993. IEEE. [73] F. Meyer and P. Bouthemy. Region-based tracking using affine motion models in long image sequences. CVGIP: Image Understanding, 60(2):119–140, September 1994. [74] R. Milanese. Focus of attention in human vision: a survey. Technical Report 90.03, Computing Science Center, University of Geneva, Geneva, August 1990. [75] V. A. Morozov. Regularization methods for ill-posed problems. CRC Press, Boca Raton, FL, 1993. [76] D. Murray and A. Basu. Motion tracking with an active camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):449–459, May 1994. [77] H. H. Nagel. On the estimation of optical flow: Relations between different approaches and som new results. Artificial Intelligence, 33:299–324, 1987. BIBLIOGRAPHY 91 [78] H. H. Nagel. On a constraint equation for the estimation of displacement rates in image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:13–30, 1989. [79] M. Z. Nashed. Generalized inverses and applications. Academic Press, New York, 1976. [80] R. Nelson. Qualitative detection of motion by a moving observer. Int. Journal of Computer Vision, 7:33–46, 1991. [81] P. Nordlund and T. Uhlin. Closing the loop: Pursuing a moving object by a moving observer. In Proceedings 6th Int. Conf. on Computer Analysis of Images and Patterns, Prague, Czech Republic, September 1995. [82] C.R. Rao and S.K. Mitra. Generalized inverse of matrices and its applications. Wiley and Sons, New York, 1971. [83] D.A. Robinson. Why visuomotor systems don’t like negative feedback and how they avoid it. In M. A. Arbib and A. R. Hanson, editors, Vision, Brain, and Cooperative Computation, pages 89–107. The MIT Press, 1987. [84] A. Rojer and E. Schwartz. Design considerations for a space-variant visual sensor with complex-logarithmic geometry. In Proceedings of the International Conference on Pattern Recognition, volume 2, pages 278–285, Philadelphia, Pennsylvania, June 1990. [85] B. G. Schunk. Image flow segmentation and estimation by constraint line clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(10):1010–1027, 1989. [86] W. von Seelen, S. Bohrer, J. Kopecz, and W. M. Theimer. A neural architecture for visual information processing. Int. Journal of Computer Vision, 16:229–260, 1995. [87] J. van der Spiegel, G. Kreider, C. Claeys, I. Debusschere, G. Sandini, P. Dario, F. Fantini, P. Bellutti, and G. Soncini. A foveated retina-like sensor using CCD technology. In C. Mead and M. Ismael, editors, Analog VLSI implementation of neural systems. Kluwer, 1989. [88] K. Tanaka and H. A. Saito. Analysis of motion of the visual field by direction, expansion/contraction, and rotation cells illustrated in the dorsal part of the Medial Superior Temporal area of the macaque monkey. J. Neurophysiology, 62:626– 641, 1989. [89] D. Terzopoulos. Image analysis using multigrid relaxation methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:129–139, 1986. 92 BIBLIOGRAPHY [90] W. Thompson and T. C. Pong. Detecting moving objects. Int. Journal of Computer Vision, 4:39–57, 1990. [91] A. N. Tikhonov and V. Y. Arsenin. Solutions of ill-posed problems. W. H. Winston, Washington, DC, 1977. [92] S. Tölg. Gaze control for an active camera by modeling human pursuit eye movements. In Proceedings SPIE Conf. on Intelligent Robots and Computer Vision XI, volume 1825, pages 585–598. SPIE, 1992. [93] G. T. Toussaint, editor. Computational Morphology – A Computational Geometric Approach to the Analysis of Form. North–Holland, 1988. [94] T. Uhlin, P. Nordlund, A. Maki, and J.-O. Eklundh. Towards an active visual observer. In Proc. 5th Int. Conference on Computer Vision, pages 679–686, Cambridge, MA, June 1995. [95] S. Uras, F. Girosi, A. Verri, and V. Torre. A computational approach to motion perception. Biological Cybernetics, pages 79–97, 1988. [96] R. C. Veltkamp. Closed Object Boundaries from Scattered Points. Springer– Verlag, 1994. [97] A. Verri and T. Poggio. Against quantitative optical flow. In Proceedings 1st ICCV. IEEE Computer Society Press, 1987. [98] A. Verri and T. Poggio. Motion field and optical flow: Qualitative properties. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(5):490–498, 1989. [99] H. T. Wang, B. Mathur, and C. Koch. A multiscale adaptive network model of motion computation in primates. In Advances in Neural Information Processing Systems, San Mateo, CA, 1991. Morgan Kaufmann. [100] J. Y. A. Wang and E. H. Adelson. Spatio-temporal segmentation of video data. In Proceedings of the SPIE Conference: Image and Video Processing II, February 1994. [101] C-J. Westelius. Focus of Attention and Gaze Control for Robot Vision. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden, 1995. Dissertation No 379, ISBN 91–7871–530–X. [102] C-J. Westelius, J. Wiklund, and C-F. Westin. Prototyping, Visualization and Simulation Using the Application Visualization System. In H. I. Christensen and J.L. Crowley, editors, Experimental Environments for Computer Vision and Image Processing, volume 11 of Series on Machine Perception and Artificial Intelligence, pages 33–62. World Scientific Publisher, 1994. ISBN 981-02-1510-X. BIBLIOGRAPHY 93 [103] C-F. Westin. A Tensor Framework for Multidimensional Signal Processing. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden, 1994. Dissertation No 348, ISBN 91–7871–421–4. [104] C-F. Westin and H. Knutsson. Extraction of Local Symmetries Using Tensor Field Filtering. In Proceedings of 2nd Singapore International Conference on Image Processing, Singapore, September 1992. IEEE Singapore Section. LiTH– ISY–R–1515, Linköping University, Sweden. [105] C-F. Westin and H. Knutsson. Processing Incomplete and Uncertain Data using Subspace Methods. In Proceedings of 12th International Conference on Pattern Recognition, Jerusalem,Israel, October 1994. IAPR. [106] C-F. Westin, K. Nordberg, and H. Knutsson. On the Equivalence of Normalized Convolution and Normalized Differential Convolution. In Proceedings of IEEE International Conference on Acoustics, Speech, & Signal Processing, Adelaide, Australia, April 1994. IEEE. [107] C-F. Westin, C-J. Westelius, H. Knutsson, and G. Granlund. Attention Control for Robot Vision. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Califonia, June 1996. IEEE Computer Society Press. [108] R. H. Wurtz, H. Komatsu, M. R. Dürsteler, and D. S. Yamasaki. Motion to movement: Cerebral cortical visual processing for pursuit eye movements. In G. M. Edelman, W. E. Gall, and W. M. Cowan, editors, Signal and Sense: Local and Global Order in Perceptual Maps, chapter 10, pages 233–260. Wiley–Liss, 1990. [109] L. R. Young. Pursuit eye tracking movements. In P. Bach y Rita, C. C. Collins, and J. E. Hyde, editors, Control of Eye Movements. Academic Press, 1971. [110] A. L. Yuille and N. M. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333:71–74, 1988. I would like to present a medal of valour to the patient reader who has managed to plough through the volume. Experience shows that a long–term exposure to the material is accompanied by feelings of confusion, frustration, and, finally, resignation. The author

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement