Efficient Spatiotemporal Filtering and Modelling Jörgen Karlholm Linköping Studies in Science and Technology

Efficient Spatiotemporal Filtering and Modelling Jörgen Karlholm Linköping Studies in Science and Technology
Linköping Studies in Science and Technology
Thesis No. 562
Efficient Spatiotemporal
Filtering and Modelling
Jörgen Karlholm
LIU-TEK-LIC-1995:27
Department o f Electrical Engineering
Linköpings universitet, SE-581 83 Linköping, Sweden
Efficient Spatiotemporal Filtering and Modelling
© 1996 Jörgen Karlholm
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping
Sweden
ISBN 91-7871-741-8
ISSN 0280-7971
Abstract
The thesis describes novel methods for efficient spatiotemporal filtering and modelling.
A multiresolution algorithm for energy-based estimation and representation of local
spatiotemporal structure by second order symmetric tensors is presented.
The problem of how to properly process estimates with varying degree of reliability
is addressed. An efficient spatiotemporal implementation of a certainty-based signal
modelling method called normalized convolution is described.
As an application of the above results, a smooth pursuit motion tracking algorithm that
uses observations of both target motion and position for camera head control and motion
prediction is described. The target is detected using a novel motion field segmentation
algorithm which assumes that the motion fields of the target and its immediate vicinity,
at least occasionally, each can be modelled by a single parameterised motion model. A
method to eliminate camera-induced background motion in the case of a pan/tilt rotating
camera is suggested.
ii
I dedicate this thesis to my teachers
whose encouragement and stimulation
have been invaluable to me.
c 1996 by Jörgen Karlholm
Copyright Acknowledgements
First of all, I wish to thank past and present colleagues at the Computer Vision Laboratory for being great friends. It is a privilege to work with such talented people.
I thank Professor Gösta Granlund, head of the Laboratory, for showing confidence in me
and my work.
Professor Hans Knutsson and Drs. Mats T. Andersson, Carl–Fredrik Westin and Carl–
Johan Westelius have contributed with many ideas and suggestions. I am particularly
grateful to Dr. Westelius, who developed the advanced graphics simulation (’virtual
reality’) that I have used extensively in my work.
I thank Professors Henrik I. Christensen, Erik Granum and Henning Nielsen, and Steen
Kristensen and Paolo Pirjanian, all at Laboratory of Image Analysis, University of Aalborg, Denmark, for their kind hospitality during my visits.
Dr. Klas Nordberg and Morgan Ulvklo significantly reduced the number of errors in the
manuscript.
Johan Wiklund provided excellent technical support and help with computer related
problems.
This work was done within the European Union ESPRIT–BRA 7108 Vision as Process
(VAP–II) project. Financial support from the Swedish National Board for Industrial and
Technical Development (NUTEK) and the Swedish Defence Materials Administration
(FMV) is gratefully acknowledged.
“Free fall is the natural state of motion.”
- In C. W. Misner, K. S. Thorne, J. A. Wheeler: Gravitation
Contents
1
2
3
4
5
INTRODUCTION
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 The thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Representing local structure with tensors . . . . . . . . . . . . . . . . .
1.4.1 Energy-based local structure representation and
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Efficient implementation of spherically separable quadrature filters
MULTIRESOLUTION SPARSE TENSOR FIELD
2.1 Low-pass filtering, subsampling, and energy distribution
2.2 Building a consistent pyramid . . . . . . . . . . . . . .
2.3 Experimental results . . . . . . . . . . . . . . . . . . .
2.4 Computation of dense optical flow . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
3
3
9
11
11
14
21
25
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND MODELLING
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Normalized and differential convolution . . . . . . . . . . . . . . . . .
3.3 NDC for spatiotemporal filtering and modelling . . . . . . . . . . . . .
3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
31
33
36
41
41
MOTION SEGMENTATION AND TRACKING
4.1 Space–variant sensing . . . . . . . . . . . . . . .
4.2 Control aspects . . . . . . . . . . . . . . . . . .
4.3 Motion models and segmentation . . . . . . . . .
4.3.1 Segmentation . . . . . . . . . . . . . . .
4.3.2 Discussion . . . . . . . . . . . . . . . .
4.3.3 Motion models . . . . . . . . . . . . . .
4.4 On the use of normalized convolution in tracking
4.5 Experimental results . . . . . . . . . . . . . . .
49
50
50
54
54
58
59
63
63
SUMMARY
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
vii
viii
A TECHNICAL DETAILS
A.1 Local motion and tensors . . . . . . . . . . . .
A.1.1 Velocity from the tensor . . . . . . . .
A.1.2 Speed and eigenvectors . . . . . . . . .
A.1.3 Speed and Txy . . . . . . . . . . . . .
A.2 Motion models . . . . . . . . . . . . . . . . .
A.2.1 Accuracy of affine parameter estimates
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
73
73
73
74
74
76
76
1
INTRODUCTION
This first chapter is intended to give [most of] the necessary background information
needed to understand the new results presented in the later chapters.
1.1 Motivation
It is not particularly difficult to convince anyone working with image processing that
efficient algorithms for motion estimation are of central importance to the field. Areas
such as video coding and real–time robot vision are highly dependent on accurate and
fast methods to extract spatiotemporal signal features for segmentation and response
generation.
The work presented in this thesis was prompted by certain breakthroughs in the implementation of spatiotemporal filter banks, which seemed to allow completely new
applications of massive spatiotemporal filtering and modelling. Could it be that highly
accurate motion estimation not only gives a quality improvement to standard real–time
algorithms, but actually makes completely new, more powerful methods realizable?
We cannot claim to have a definite answer to whether high performance spatiotemporal
filtering is useful in applications with time–constraints—there are some very promising
results based on more qualitative aspects of motion fields that seem not to require massive 3D computation—but it is clear that as computers get faster, more powerful methods
than those that have been used up to now will inevitably find their way to real–time applications.
1.2 The thesis
Section 1.4 of this chapter is an introduction to energy–based approaches to spatiotemporal local structure estimation and representation. It also provides an outline of certain
recent technical results that motivated the work presented in the thesis.
1
2
INTRODUCTION
In Chapter 2 we present a novel efficient multiresolution algorithm for accurate and reliable estimation and representation of local spatiotemporal structure over a wide range
of image displacements. First we describe how the local spatiotemporal spectral energy
distribution is affected by spatial lowpass filtering and subsampling, and we investigate
the practical consequences of this. Then we consider how to process a resolution pyramid of local structure estimates so that it contains as accurate and reliable estimates as
possible—we provide experimental results that support our reasoning.
In Chapter 3 the concept of signal certainty–based filtering and modelling is treated.
This addresses the important problem of how to properly process estimates with varying
degree of reliability. The chapter contains a general account of a method called normalised convolution. We look at how this can be efficiently implemented in the case
of spatiotemporal filtering and modelling, and present experimental results that demonstrate the power of the method.
Chapter 4 describes an active vision application of the concepts developed in the earlier
chapters. We present a smooth pursuit motion tracking algorithm that uses observations
of both target position and velocity. The chapter contains novel efficient motion segmentation algorithms for extraction of the target motion field. Excerpts from a number
of simulated and real tracking sequences are provided.
1.3 Notation
In choosing the various symbols, certain general principles have been followed. Vectors
are denoted by bold–faced lower–case characters, e.g., u, and tensors of higher order by
bold-faced upper–case characters, e.g., I. The norm of a tensor A is denoted by kAk.
This will usually mean the Frobenius norm. A ’hat’ above a vector indicates unit norm,
e.g., û. Eigenvalues are denoted by the letter λ, and are ordered λ1 λ2 : : : .
A certain abuse of notation will be noted, in that to simplify notation we do not discriminate between tensors and their coordinate matrices that always turn up in numerical
calculations, e.g., AT will denote the transpose of the matrix A.
Occasionally we will use the notation < a; b > for the inner product between two possibly complex–valued vectors. This is equivalent to the coordinate vector operation a b,
where the star denotes complex conjugation and transposition.
The signal is usually called f (x). This always refers to a local coordinate system centred
in a local neighbourhood. The Fourier transform of a function s(x) is denoted by S(u).
The Laplace transform of a function g(t ) is denoted by G(s).
1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS
3
1.4 Representing local structure with tensors
This section gives the necessary background information concerning the estimation and
representation of local spatio-temporal structure. The concepts presented here, and the
fact that the estimation procedure can be efficiently implemented, form a prerequisite
for the methods presented in later chapters.
The inertia tensor method is by Bigün, [16, 17, 15, 18]. Pioneer work on its application to motion analysis was done by Jähne, [53, 54, 55, 56]. The local structure tensor representation and estimation by quadrature filtering was conceived by Knutsson,
[59, 60, 61, 62, 66, 65]. A comprehensive description of the theory and its application
is given in [43]. The efficient implementation of multidimensional quadrature filtering
is by Knutsson and Andersson, [63, 64, 6].
1.4.1 Energy-based local structure representation and
estimation
One of the principal goals of low-level computer vision can be formulated as the detection and representation of local anisotropy. Typical representatives are lines, edges,
corners, crossings and junctions. The reason why these features are important is that
they are of a generic character yet specific enough to greatly facilitate unambiguous
image segmentation. In motion analysis we are interested in features that are stable in
time so that local velocity may be computed from the correspondence of features in successive frames. Indeed, many motion estimation algorithms work by matching features
from one frame to the next. A problem with this strategy is that there may be multiple
matches, particularly in textured regions. Assuming that a feature is stable between several successive frames, the local velocity may be found from a spatio-temporal vector
pointing in the direction of least signal variation1. The Fourier transform2 of a signal
that is constant in one direction is confined to a plane through the origin perpendicular to
this direction, Figure 1.2. If the feature is also constant in a spatial direction, i.e., it is an
edge or a line, there will be two orthogonal spatio-temporal directions with small signal
variation. This prevents finding the true motion direction, but constrains it to lie in a
plane orthogonal to the direction of maximum signal variation. A signal that varies in a
single direction has a spectrum confined to a line in this direction through the origin in
the Fourier domain, Figure 1.1. The discussion has focused the attention on the angular
1 There are several ways to more exactly define the directional signal variation, leading to different algorithms. For our present discussion an intuitive picture is sufficient.
2 When referring to the Fourier transform of the signal we always mean the transform of a windowed signal,
a ’local Fourier transform’.
INTRODUCTION
4
distribution of the signal spectrum and its relation to motion estimation. We will here
review two related methods to obtain a representation of this distribution.
In the first method one estimates the inertia tensor J of the power spectrum, in analogy
with the inertia tensor of mechanics. The inertia tensor comes from finding the direction
n̂min that minimises
Z
J [n̂] = ku , (uT n̂)n̂k2 jF (u)j2 du
which is simply the energy-weighted integral of the orthogonal distances of the data
points to a line through the origin in the n̂-direction. The motivation for introducing J is
that for signals with a dominant direction of variation, this direction is a global minimum
of J . Likewise, we see that for a signal that is constant in one unique direction, this
direction is a global maximum of J . A little algebra leads to the following equivalent
expression
J [n̂] = n̂T Jn̂
In component form the inertia tensor is given by
Z
J jk = jF (u)j2 (kuk2 δ jk , u j uk )du
(1.1)
Since J is a symmetric second order tensor it is now apparent that J is minimised by the
eigenvector of J corresponding to the smallest eigenvalue. From Plancherel’s relation
and using the fact that multiplication by iuk in the Fourier domain corresponds to partial
derivation ∂x∂ in the spatial domain, Equation (1.1) may be written
k
Z
J jk =
[
∂f
k∇ f k2 δ jk , ∂x
∂f
] dx
∂x
j
k
(1.2)
The integration is taken over the window defining the local spatio-temporal neighbourhood and may be organised as low-pass filtering of the derivative product volumes. The
derivatives themselves can be efficiently computed with one-dimensional filters.
Critics of the inertia tensor approach argue that the spatio-temporal low-pass filtering
causes an unwanted smearing3, and that the use of squared derivatives demands a higher
sampling rate than is needed to represent the original signal.
An alternative approach to energy-based local structure estimation and representation
was given by Knutsson. This procedure also leads to a symmetric second-order tensor
representation but it avoids the spatio-temporal averaging of the inertia tensor approach,
3 This problem is particularly severe at motion boundaries and for small objects moving against a textured
background.
1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS
Spatial domain
Fourier domain
Figure 1.1: A planar autocorrelation function in the spatial domain corresponds to energy being distributed on a line in the Fourier domain. (Iso-surface plots).
Figure 1.2: An autocorrelation function concentrated on a line in the spatial domain
corresponds to a planar energy distribution in the Fourier domain. (Iso-surface plots).
Figure 1.3: A spherical autocorrelation function in the spatial domain corresponds to a
spherical energy distribution in the Fourier domain. (Iso-surface plots).
5
INTRODUCTION
6
and the shift of the spectrum to higher frequencies. The concept of the local structure
tensor comes from the observation that an unambiguous representation of signal orientation is given by
T = Ax̂x̂T
(1.3)
where x̂ is a unit vector in the direction of maximum signal variation and A is any scalar
greater than zero. Estimation of T is done by probing Fourier space in several directions
n̂k with filters that each pick up energy in an angular sector centred at a particular direction. These filters are complex-valued quadrature4 filters, which means that their real
and imaginary parts are reciprocal Hilbert transforms[22]. The essential result of this is
that the impulse response of the filter has a Fourier transform that is real and non-zero
only in a half-space n̂Tk u > 0. One designs the quadrature filters to be spherically separable in the Fourier domain, which means that they can be written as a product of a
function of radius and a function of direction
Fk (u) = R(ρ)Dk (û)
The radial part, R(ρ), is made positive in a pass-band so that the filter response corresponds to a weighted average of the spectral coefficients in a region of Fourier space5 .
Figure 1.4 shows the radial part of the filter used in most experiments in this thesis.
This is a lognormal frequency function, which means that it is gaussian on a logarithmic scale. The real benefit from using quadrature filters comes from taking the absolute
value (magnitude) of the filter response. This is a measure of the ’size’ of the average
spectral coefficient in a region and is for narrow–banded signals invariant to shifts of the
signal, which then assures that the orientation estimate is independent of whether it is
obtained at an edge or on a line (phase invariance). The filter response magnitudes jqk j
are used as coefficients in a linear summation of basis tensors
Test
=
∑ jqk jMk
(1.4)
k
where Mk is the dual of the outer product tensor Nk
the reconstruction relation
∑(S Nk )Mk = S
=
n̂k n̂Tk . The Mk ’s are defined by
for all second order symmetric tensors S
(1.5)
k
where
A B = ∑ Ai j Bi j
ij
4 Early approaches to motion analysis by spatio-temporal quadrature filtering are by Adelson and Bergen [1]
and Heeger [47]. These papers also discuss the relation to psychophysics and neurophysiology of motion
computation.
5 Note that since the Fourier transform of a real signal is Hermitian there is no no loss of information from
looking at the spectrum in the filter’s ’positive direction’ only.
1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS
7
1
0.5
0
π/2
frequency
π
p
Figure 1.4: Lognormal radial frequency function with centre frequency π=2 2.
defines an inner product in the tensor space. For the reconstruction relation to be satisfied
we need at least as many linearly independent filter directions as there are degrees of
freedom in a tensor, i.e., six when filtering spatio-temporal sequences. The explicit
mathematical expression for the Mk ’s depends on the angular distribution of the filters,
but it is not too difficult to show that the general dual basis element is given by
Mk = [∑ Ni Ni ],1 Nk
(1.6)
i
where denotes a tensor (outer-) product. When the filters are symmetrically distributed
in a hemisphere this [43] reduces to
4
1
Mk = Nk , I
3
4
where I refers to the identity tensor with components Ii j = δi j .
As discussed earlier in this section, a signal that varies in a single direction x̂ (such as a
moving edge), referred to as a simple signal, has a spectrum that is non-zero only on a
line through the origin in this direction. Writing this as
S(u) = G(x̂T u) δline
x̂ (u)
6
where δline
x̂ (u) is an impulse line in the x̂-direction , the result of filtering a simple signal
6 This
is defined as a product of one-dimensional delta-functions in the directions orthogonal to x̂.
INTRODUCTION
8
with a spherically separable quadrature filter can be written
Z
Z
qk = Fk (u) S(u) du = R(ρ)Dk (û) G(x̂T u) δline
x̂ (u) du
Z∞
Z∞
= Dk (x̂)
R(ρ) G(ρ) dρ + Dk (,x̂)
R(ρ) G(,ρ) dρ
0
= aDk (x̂) + a
0
Dk (,x̂)
where a is a complex number that only depends on the radial overlap between the filter
and the signal (a being its complex conjugate). Since the filter is non-zero in only one
of the directions x̂, the magnitude of the filter response is
jqk j = jajjDk (x̂) + Dk (,x̂)j
Now, if we choose the directional function to be
(
Dk (û) =
2
T
(n̂T
k û) = (n̂k n̂k )
(ûûT )
n̂Tk û > 0
otherwise
0
(1.7)
we see that Equation (1.4) reduces to
Test
=
∑ jqk jMk = ∑ jaj (n̂k n̂Tk x̂x̂T ) Mk = ∑ jaj (Nk x̂x̂T ) Mk
k
=
jaj x̂x̂T
k
(1.8)
k
This proves that the suggested method correctly estimates the orientation tensor T of
Equation (1.3) in the ideal case of a simple signal. In general we may represent the
orientation of a signal by the outer product tensor that minimises the Frobenius distance
∆ = kTest , Ax̂x̂T kF
=
q
∑i j (Ti j , Axix j )2
It is not difficult to see that the minimum value is obtained when A is taken as the largest
eigenvalue of Test and x̂ the corresponding eigenvector.
Another relevant ideal case besides that of a simple signal is the moving point. The spectrum of such a signal is constant on a plane through the origin and zero outside it. In
this case the estimated tensor will have two (equal) non-zero eigenvalues and the eigenvector corresponding to the third (zero-valued) eigenvalue will point in the direction of
spatio-temporal motion. In general, wherever there is more than one spatial orientation, the correct image velocity can be recovered from the direction of the eigenvector
corresponding to the smallest eigenvalue.
The details of how the velocity components are recovered from the tensor are given in
Appendix A.1.1.
1.4 REPRESENTING LOCAL STRUCTURE WITH TENSORS
9
1.4.2 Efficient implementation of spherically separable
quadrature filters
We saw in the previous subsection that the inertia tensor method allows an efficient
implementation using one-dimensional derivative and low-pass filters. We now seek a
corresponding efficient implementation of the quadrature filtering method. A problem
with the quadrature filters is that spherical separability in general is incompatible with
Cartesian separability. We therefore inevitably introduce errors when trying to approximate the multi-dimensional filters by a convolution product of low-dimensional filters.
The success of such an attempt naturally depends on the magnitude of the approximation error. Unfortunately, it is not entirely obvious how the filters should be decomposed
to simultaneously satisfy the constraints of minimum approximation error and smallest
possible number of coefficients. Given the shape of the present filters, with their rather
sharp magnitude peak in the central direction of the filter, it is natural to try a decomposition with a one-dimensional quadrature filter in the principal direction and real-valued
one-dimensional low-pass filters in the orthogonal directions. With a symmetric distribution of filters, taking full advantage of symmetry properties, this leads to the scheme
of Figure 1.5. The filter coefficients are determined in a recursive optimisation process
One dimensional
low−pass filters
p
One dimensional
quadrature filters
x,y
g −x,y
f6
p
−x,y
g x,y
f5
p
g x,z
f4
x,z
g −x,z
f
−y,z
g y,z
f2
g−y,z
f1
p
z
−x,z
p
y
p
3
Image
sequence
p
p
x
p
y,z
Figure 1.5: 3D sequential quadrature filtering structure.
in Fourier space. With an ideal multi-dimensional target filter F̂ (u), and our current
INTRODUCTION
10
component product7 approximation of this, F (u) = ∏k Fk (u), we define temporary targets
F̂i (u) =
F̂ (u)
∏k6=i Fk (u)
for the component filters. We use this target filter to produce a (hopefully) better component filter by minimising the weighted square error
Ei = k Wi (juj) ∏ Fk (u) [ F̂i (u) , Finew (u) ] k2
k=
6 i
=
k Wi(juj) [ F̂ (u) , Finew(u) ∏ Fk (u) ] k2
k6=i
(1.9)
(1.90 )
where Wi (juj) is a radial frequency weight, typically Wi (juj) ∝ juj,α + ε, see [43]. Equation (1.90) says that on average we will decrease the difference between the ideal filter
F̂ (u) and the product F (u). The constraint that the filter coefficients are to be confined
to certain chosen spatial coordinate points (in this case ideally on a line8 ) is implemented
by specifying the discrete Fourier transform to be on the form
Nk
Fk (u) =
∑ fk (ξn ) exp(,iξn u)
;
n=1
juj
<
π
where the sum is over all allowed spatial coordinate points ξn . The optimiser then solves
a linear system for the least-square sense optimal coefficients fk .
The optimisation procedure (definition of component target function and optimisation
of the corresponding filter) is repeated cyclically for all component filters until convergence. For the present filters this takes less than ten iterations. The quality of the
optimised filters has been carefully investigated by estimating the orientation of certain
test patterns with known orientation at each position using the local structure tensor
method. The result is that the precision is almost identical to what is obtained with full
multi-dimensional filters, [6]. As an example, a particular set of full 3D filters requires
9 9 9 6 = 8748 coefficients. A corresponding set of sequential filters with the same
performance characteristics requires 273 coefficients. This clearly makes the quadrature
filtering approach competitive in relation to the inertia tensor method9 which requires
partial derivative computations followed by multi-dimensional low-pass filtering of all
independent products of the derivatives.
7 Convolution
in the space-time domain corresponds to multiplication in the frequency domain.
of the filters are for sampling density reasons ’weakly’ two-dimensional with a tri-diagonal form.
9 From a computation cost point of view, assuming the same size of the filters, disregarding the fact that the
inertia tensor method requires a higher sampling density to obtain comparable spatial resolution.
8 Some
2
MULTIRESOLUTION SPARSE
TENSOR FIELD
A unique feature of our approach to motion segmentation is that models are fitted not to
local estimates of velocity but to a sparse field of spatiotemporal local structure tensors.
In this chapter we present a fast algorithm for generation of a sparse tensor field where
estimates are consistent over scales.
2.1 Low-pass filtering, subsampling, and energy distribution
As described in Section 1.4.1 there is a direct correspondence between on one hand the
spatiotemporal displacement vector of a local spatial feature and on the other hand the
local structure tensor. However, to robustly estimate the tensor, so as to to avoid temporal under-sampling and be able to cope with a wide range of velocities, it is necessary to
adopt some kind of multiresolution scheme. Though it is possible to conceive various
advanced partitionings of the frequency domain into frequency channels by combinations of spatial and temporal subsampling, [44], we opted to compute a simple low-pass
pyramid à la Burt [25] for each frame, and not to resample the sequence temporally.
The result is a partitioning of the frequency domain as shown in Figure 2.1. Each level
in the pyramid is constructed from the previous level by Gaussian low-pass filtering
and subsequent subsampling by a factor two. To avoid aliasing when subsampling it
is necessary to use a filter that is quite narrow in the frequency domain. As is seen in
Figure 2.2, the result is that the energy is substantially reduced in the spatial directions.
This is of no consequence in the ideal case of a moving line or point, when the spectral
energy is confined to a line or a plane — the orientation is unaffected by the low-pass
filtering. However, the situation is quite different in the case of a noisy signal. Referring
to Figure 2.3, a sequence of white noise images that is spatially low-pass filtered and
11
MULTIRESOLUTION SPARSE TENSOR FIELD
12
temporal frequency
spatial frequency
Figure 2.1: Partitioning of the frequency domain by successive spatial subsampling.
The plot shows Nyquist frequencies for the different levels.
subsampled becomes orientation biased since energy is lost in the spatial directions. A
possible remedy for this is to low-pass filter the signal temporally with a filter that is
twice as broad as the filter used in the spatial directions [44]. This type of temporal filter
can be efficiently implemented as a cascaded recursive filter, [39]. A couple of experiments were carried out to quantitatively determine the influence of isotropic noise on
orientation estimates in spatially subsampled sequences. The orientation bias caused by
the low-pass filtering1 and subsampling was measured by computing the average orientation tensor over each sequence and determining the quotient λt =λspat , where λt refers
to the eigenvector in the temporal direction and λspat to the average of the eigenvectors
in the spatial plane.
In the first experiment the average orientation of initially white noise was computed.
The results are shown in Table 2.1. As expected, the average orientation becomes biased when not compensating for the energy anisotropy by temporal low-pass filtering.
However, the effect is quite small for moderate spatial low-pass filtering (σspat = π=4).
This is due to the fact that the quadrature filter used is fairly insensitive to high fre1 The
low-pass filters were on the form F (ωx ; ωy ; ωt ) = exp [,( 2σω2x
2
spat
+
ω2y
ω2
+ t2 ) ].
2σ2spat
2σt
2.1 LOW-PASS FILTERING, SUBSAMPLING, AND ENERGY DISTRIBUTION 13
Pi
Figure 2.2: Gaussian low-pass filtering (σ = π=4) reduces the signal energy in the spatial
directions. The subsequent subsampling moves the maximum frequency down to π=2
(dashed line). There is also some aliasing caused by the non-zero tail above π=2.
quencies, cf. Figure 1.4. Repeating the low-pass filtering and subsampling twice with
σspat = π=4 with no temporal filtering results in a quotient of 1:04, which indicates
that temporal filtering in fact may be unnecessary. This is important in real-time control
applications where long delays cause problems.
σspat
π=4
π=8
π=16
σt = ∞
1.03
1.19
1.57
σt
= 2σspat
1.03
1.03
1.01
Table 2.1: Results of low-pass filtering and spatial subsampling by a factor two of a
sequence of white noise. The numbers show the quotient λt =λspat of the average tensor.
In a second experiment we used a synthetic volume with an unambiguous spatiotemporal orientation at each position. The sequence was designed to have a radially sinusoidal variation of grey-levels with a decreasing frequency towards the periphery, Fig-
14
MULTIRESOLUTION SPARSE TENSOR FIELD
Figure 2.3: Illustration of low-pass filtering and subsampling of a white noise sequence.
Plots show the spectral energy distribution in one spatial direction and the temporal direction. Left: Spatial low-pass filtering reduces the signal energy in the spatial directions. Middle: Spatial subsampling leads to a rescaling of the spectrum. Right: The
energy anisotropy is compensated for by a temporal low-pass filtering with a filter that
is twice as broad as the corresponding filter in the spatial direction.
ure 2.4. The volume was corrupted with white noise and subsequently low-pass filtered
(σspat = π=4) and subsampled. The change in average orientation caused by the noise
remained below one degree with the signal–to–noise ratio as low as 0 dB SNR. This is
not surprising since the quadrature filter picks up much less energy from a signal with
a random phase distribution over frequencies than from a signal with a well-defined
phase. The conclusion of this investigation is that with an appropriate choice of radial
filter function of the quadrature filter, i.e., one that is comparatively insensitive to high
frequencies, there is no reason to worry about orientation bias induced by uncorrelated
signal noise. Consequently, no temporal low-pass filtering is necessary.
2.2 Building a consistent pyramid
The construction of a low-pass pyramid from each image frame gives a number of separate spatial resolution channels that can be processed in parallel. Consecutive images
2.2 BUILDING A CONSISTENT PYRAMID
15
Figure 2.4: Radial variation of grey-levels in test volume.
are stacked in a temporal buffer which is convolved with quadrature filters. The magnitude of the filter outputs are used in the composition of local structure tensors. The
result is a multiresolution pyramid of tensors. At each original image position there are
now several tensors, one from each level of the pyramid, describing the local spatiotemporal structure as it appears at each particular scale. The question arises how to handle
this type of representation of the local structure. The answer, of course, depends on the
intended application. Our intention is to perform segmentation by fitting parameterised
models of the spatiotemporal displacement field to estimates in regions of the image. Interpreting the spatiotemporal displacement as the direction of minimal signal variation,
it is clear that the information is readily available in the tensor. For efficiency reasons we
want to use the sparse multiresolution tensor field as it is, without any data conversion or
new data added. The problem of how to handle the tensor field pyramid then reduces to
that of deciding which confidence value should be given to each estimate, i.e., how much
a particular tensor should be trusted. For computational efficiency it is also desirable to
sort out data that does not contain any useful information as early as possible in a chain
of operations.
At this point it is appropriate to make the distinction between two entities of fundamental importance. We use the definitions by Horn, [49]. A point on an object in motion
traces out a particle path (flow line) in 3D space, the temporal derivative of which is the
instantaneous 3D velocity vector of the point. The geometrical projection of the particle
path onto the image plane by the camera gives rise to a 2D particle path whose temporal
derivative is the projected 2D velocity vector. The 2D motion field is the collection of
all such 2D velocity vectors. The image velocity or optical flow is (any) estimate of
the 2D motion field based on the spatiotemporal variation of image intensity. Several
MULTIRESOLUTION SPARSE TENSOR FIELD
16
investigators [49, 97, 98, 78, 36, 38] have studied the relation between the 2D motion
field and the image velocity. The conclusion is that they generally are different, sometimes very much so. A classical example is by Horn, [49]. A smooth sphere with a
specular2 surface rotating under constant illumination generates no spatiotemporal image intensity variation. On the other hand, if the sphere is fixed but a light source is
in motion, there will be a spatiotemporal image intensity variation caused by reflection
in the surface of the sphere. The intensity variation caused by this type of “illusory”
motion can not be distinguished from that caused by objects in actual motion without a
priori knowledge or high-level scene interpretation. A diffusely reflecting (Lambertian)
surface also induces intensity variation caused by changes in angle between the surface
normal and light sources. This variation is typically a smooth function of position, and
independent of texture and surface markings. The conclusion is that the optical flow
accurately describes the motion field of predominantly diffusely reflecting surfaces with
a large spatial variation of grey level.
The use of the local structure tensor for motion estimation is based on the assumption
that the spatiotemporal directions of largest signal variance are orthogonal to the spatiotemporal displacement vector. The shape of the tensor, regarded as an ellipsoid with
the semi-axes proportional to the eigenvalues of the tensor, cf. Figures 1.1 – 1.3, is a
simple model of the local signal variation, with the longest semi-axis corresponding to
the direction of maximum signal variation. It is evident that when the signal variation is
close to uniform in all directions, no reliable information about the local motion direction is available. To qualify as a reliable estimate we require the neighbourhood to have
a well-defined anisotropy (orientation). This means that the smallest tensor eigenvalue
should be substantially smaller than the largest one. It is beneficial to have a computationally cheap measure of anisotropy that does not require eigenvalue decomposition, so
that unreliable estimates of velocity may be quickly rejected from further processing. A
suitable measure is given by
µ=
jjTjjF
Tr (T)
q
=
∑i; j Ti j 2
∑k Tkk
p
=
∑k λk 2
∑k λk
p
1= 3
µ 1
(2.1)
In Figure 2.5 µ is plotted as a function of degree of neighbourhood anisotropy. We simply threshold on µ, discarding tensors that are not sufficiently anisotropic. The Frobenius
norm of the tensor, kTkF , is a measure of the signal amplitude as seen by the filters. If
the amplitude is small, it is possible that the spatial variation of the signal may be too
small to dominate over changes in illumination—the tensor becomes very noise sensitive. We therefore reject tensors whose norm is not greater than an energy threshold η.
To become independent of the absolute level of contrast, one may alternatively reject
tensors whose norm is below a small fraction (say, a few percent) of the largest tensor
element in a frame.
2 Reflects
like a mirror.
2.2 BUILDING A CONSISTENT PYRAMID
17
1
0.95
0.9
0.85
µ
0.8
0.75
0.7
0.65
0.6
0.55
0
5
10
15
20
25
λ
30
35
40
45
50
Figure 2.5: Estimating
of anisotropy of local neighbourhood. Solid line: Planep degree
λ2 + 1 + 1 = (λ + 1 + 1). Dotted line: Line-like anisotropy, µ =
2λ + 1 = (2λ + 1).
like anisotropy, µ =
p
2
Next, consider the relative reliability of tensors at different scales. With the filters (almost) symmetrically distributed in space, there is no significant direction dependence
of the angular error in the orientation estimation3 . Consequently we expect the angular
error in the estimation of the spatiotemporal displacement vector to be independent of
direction. The angular error may be converted into an absolute speed error, which becomes a function of the speed and the scale at which the estimate is obtained, Figure 2.6
(left). Similarly it is interesting to see the angular error at each scale transformed into
a corresponding angle at the finest scale, Figure 2.6 (right). This gives an indication of
the relative validity of the estimates obtained at different scales as a function of image
speed.
There are a couple of additional relevant constraints:
1. The spatial localisation of the estimate should be as good as possible.
2. Temporal aliasing should be avoided.
The first of these demands is of particular importance when using more sophisticated
models than a constant translation. It also leads to more accurate results at motion
borders. The second item calls for a short digression. Consider a sinusoidal signal of
frequency ω0 moving with speed v0 [pixels/frame]. The resulting inter-frame phase shift
3 For the test volume of Figure 2.4 corrupted with additive noise, the average magnitude of the angular error
is 0:8 , 3:0 , and 9:4 , for a signal–to–noise ratio of ∞dB, 10dB, and 0dB respectively, [64].
MULTIRESOLUTION SPARSE TENSOR FIELD
18
25
3
20
2
angular error [deg]
absolute speed error [pixels/frame]
2.5
1.5
15
10
1
5
0.5
0
0
5
10
15
speed [pixels/frame]
20
25
0
0
1
2
3
4
5
6
speed [pixels/frame]
7
8
9
10
Figure 2.6: Plots of errors as they appear at the finest scale, assuming an angular error
∆φ = 3:0 . Full line: finest scale, k=0. Dotted line: coarsest scale, k=3.
Left: Absolute speed error ∆v as a function of speed v and scale k. The functions are
∆v(v; k) = 2k tan(arctan(2,k v) + ∆φ) , v.
Right: Apparent angular error ∆φ(v; k) as a function of speed v and scale k. The functions
are ∆φ(v; k) = arctan(v) , arctan [ 2k tan(arctan(2,k v) , ∆φ) ].
is ω0 v0 . However, this phase shift is ambiguous, since the signal looks the same if shifted
any multiple of the wavelength. In particular, any displacement v0 greater than half the
wavelength manifests itself as a phase shift less than π (aliasing). One consequence of
this is that an unambiguous estimate of spatiotemporal orientation can strictly only be
obtained if the speed v0 satisfies
v0 <
π
ω0
where ω0 is the maximum spatial frequency. The combination of a quadrature filter
comparatively insensitive to high frequencies and a low-pass filter with a low cut-off
frequency reduces the actual high-pass energy influence, so that we dare to extend the
maximum allowed velocity estimated at each scale to well above the nominal 1 pixel per
frame, particularly at the coarser scales.
The arguments presented above indicate that a tensor validity labelling scheme must
include some kind of coarse-to-fine strategy. With initial speed estimates at the coarser
scales we can decide whether or not it is useful to descend to finer scales to obtain
more precise results. If not, we inhibit the corresponding positions at the finer scales,
i.e., we set a flag that indicates that they are invalid, Figure 2.7. The local speed, s,
is in the moving point case determined by the size of the temporal component of the
eigenvector corresponding to the smallest eigenvalue of the tensor. In the case of a single
dominant eigenvector (moving edge/line case) the speed is determined by the temporal
2.2 BUILDING A CONSISTENT PYRAMID
19
high!
too low
high!
too low
Figure 2.7: The use of coarse-to-fine inhibition is here illustrated for a case with a sharp
motion gradient, e.g., a motion border between two objects.
component of the eigenvector corresponding to the largest eigenvalue. One finds (cf.
Appendix A.1.2),
8r
>
< 1,e e
s= r
>
: 1,e e
2
13
2
13
2
33
2
33
numerical rank 1 tensor
numerical rank 2 tensor
It appears that we have to compute the eigenvalue decomposition of the tensor. This is
actually not the case. The eigenvalues can be efficiently computed as the roots of the
(cubic) characteristic polynomial det(T , λI). Using the fact that a symmetric tensor
can always be decomposed into
T = (λ1 , λ2 )T1
+ (λ 2
, λ3)T2 + λ3 T3
with
T1 = ê1 êT1
T2 = ê1 êT1
+
ê2 êT2
T3 = ê1 êT1
+
ê2 êT2
+
ê3 êT3
=I
we create a rank 2 tensor T̃ by subtracting λ3 I. This new tensor has the same eigenvectors as the original tensor, but eigenvalues λ̃1 = λ1 , λ3, λ̃2 = λ2 , λ3 and λ̃3 = 0. Now
that the tensor is of (at most) rank 2, there are a couple of interesting relations that we
MULTIRESOLUTION SPARSE TENSOR FIELD
20
can use to find the temporal components of the eigenvectors needed to compute the local
speed, namely
Tr T̃xy = λ̃1 (1 , e213)
det T̃xy = λ̃1 λ̃2 e233
(numerical rank 1 case)
(numerical rank 2 case)
where T̃xy refers to the leading 2 2-submatrix of T̃, and Tr refers to the trace (sum of
diagonal elements) of a matrix. See Appendix A.1.3 for a proof of these elementary but
perhaps somewhat unobvious relations.
An outline of the algorithm is given in Figure 2.8.
SPARSE FIELD
consistent sparse tensor field
anisotropy and energy thresholding,
coarse−to−fine data validation
spatio−temporal convolution, tensor construction
separate resolution channels
PYRAMID
low−pass pyramid
input signal
Figure 2.8: Illustration of the multi-scale tensor field estimation procedure.
2.3 EXPERIMENTAL RESULTS
21
2.3 Experimental results
To demonstrate the method we apply it to a class of sequences, Figure 2.9, representing a typical tracking situation—a small textured surface slowly translating in front of
a comparatively rapidly translating background. Each frame is supplemented by a corresponding vector image representing the true displacement field. The investigation fo-
Figure 2.9: Test sequence. Every tenth frame is shown.
cuses on the compatibility between the multi-scale tensor field and the true displacement
field. This is done by looking at the deviation ∆φ from the anticipated 90 angle between
the local spatiotemporal displacement vector and the eigenvector(s) corresponding to the
non-zero eigenvalue(s). The appropriate true displacement v̂ at a coarse scale is determined by computing the average of the vectors at the four positions of the finer scale that
correspond to each position at the coarse scale, followed by a rotation to compensate for
MULTIRESOLUTION SPARSE TENSOR FIELD
22
level
1
2
3
µ
0:63
0:625
0:62
lower limit
0:0
1:0
3:0
upper limit
1:5
3:5
Table 2.2: Suitable threshold values.
the spatial scale change. Starting at the coarsest scale, at each position we compute a
weighted average
<
∆φ >=
λ̃1 ∆φ1 + λ̃2∆φ2
λ̃1 + λ̃2
where ∆φ1 and ∆φ2 are the deviations of the first and second eigenvectors respectively.
The average deviation is then converted into an angle at the next, finer scale4 , and,
together with the weight λ̃1 + λ̃2 , propagated to the corresponding four positions at this
scale. A new average deviation angle is computed from the coarse-scale value and the
new fine-scale local estimate. Eventually we arrive at the finest scale with an average
deviation angle at each position that takes into account estimates at all scales.
Results for two test sequences are shown
p in Tables 2.3 and 2.4. In the first sequence
p the
second at 8 2 11:3
background moves at a speed of 4 2 5:7 pixels/frame, in thep
pixels/frame. In both cases the foreground texture moves at 1 2 1:4 pixels/frame.
The ’single scale’ column gives the result when each scale is used alone, covering the
entire speed range. The ’multi-scale w. inhib.’ column refers to the proposed method
with multiple scales with each scale covering a certain speed interval.
Following the reasoning in the preceding section, we have a number of parameters to
set—the anisotropy threshold µ, the energy threshold η, and (in the multi-scale case) the
upper and lower speed limits for each scale. In the experiments we required a density
greater than 20 %, i.e., at least every fifth position should have a valid tensor. With this
constraint we find that the values of Table 2.2 give accurate results when three levels
are used. From Tables 2.3 and 2.4 we see that, as expected, the average error is quite
large when only the finest or the coarsest scale is used. The multi-scale method on the
other hand, generates quite accurate estimates for both low and high speed regions. In
Figure 2.10 an example of the instantaneous angular error as a function of image position
is shown.
Two sequences containing a single translating texture (the background of the previ4 If
v denotes the speed, the fine-scale angle is given by (cf. Figure 2.6)
∆φ f ine = arctan(2v) , arctan(2 tan(arctan(v) , ∆φ))
2.3 EXPERIMENTAL RESULTS
Figure 2.10: Distribution of errors using 1, 2, and 3 levels in the pyramid.
23
MULTIRESOLUTION SPARSE TENSOR FIELD
24
level(s)
1
2
3
single scale
background foreground
11.18
1.73
2.56
1.96
1.23
5.76
multi-scale w. inhib.
background foreground
2.84
1.47
1.67
1.56
Table 2.3: Average apparent angular error at finest scale in degrees. Background velocity vb = (,4; 4), foreground velocity v f = (1; 1). Density 20%.
level(s)
1
2
3
single scale
background foreground
10.63
1.91
4.60
1.99
0.86
4.84
multi-scale w. inhib.
background foreground
4.78
1.17
1.68
1.57
Table 2.4: Average apparent angular error at finest scale in degrees. Background velocity vb = (,8; 8), foreground velocity v f = (1; 1). Density 20%.
ouspsequences) were generated. In the first p
sequence the texture moves at a speed of
1 2 1:4 pixels/frame, in the second at 4 2 5:7 pixels/frame. Figure 2.11 shows
histograms of the apparent angular errors. The average errors are 1:26 and 0:54 respectively. This shows that the present method is capable of producing results which are
comparable with any existing motion estimation algorithm.
Finally we provide results for three well-known test sequences used in the investigation
by Barron et al. [13]—Fleet’s tree sequences and Quam’s Yosemite sequence5. The tree
sequences, Figure 2.12, were generated by translating a camera with respect to a textured planar surface. In the ’Translating tree’ sequence the camera moves perpendicular
to the line of sight, generating velocities between 1:7 and 2:3 pixels/frame—the surface
is tilted. In the ’Diverging tree’ sequence the camera moves towards the (tilted) surface generating velocities between 0:0 and 2:0 pixels/frame. Note that since the velocity
range of the tree sequences is quite narrow, they are note very well suited for study of
the performance of multiresolution algorithms. The Yosemite sequence, Figure 2.13, is
a computer animation of a flight through Yosemite valley. It was generated by mapping
aerial photographs onto a digital terrain map. Motion is predominantly divergent with
large local variations due to occlusion and variations in depth. Velocities range from 0
to 5 pixels/frame. The clouds in the background move rightward while changing their
shape over time, which makes accurate estimation of their velocity difficult. In Fig5 These and other sequences and their corresponding true flows are available via anonymous ftp from the
University of Western Ontario at ftp.csd.uwo.ca/pub/vision/TESTDATA.
2.4 COMPUTATION OF DENSE OPTICAL FLOW
0.6
0.7
0.5
0.6
25
0.5
relative bin counts
relative bin counts
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0
0
0.1
0.5
1
1.5
2
2.5
3
angular error [deg]
3.5
4
4.5
5
0
0
0.5
1
1.5
angular error [deg]
2
2.5
Figure 2.11: Histograms of apparent angular error distributions. Left: Velocity v =
Right: v = (4:0; 4:0). Density 20%. Each estimate was given a weight
equal to its confidence value.
(1:0; 1:0).
ure 2.13 (right), note the larger errors caused by motion discontinuities at the mountain
ridges. Numerical results for the three sequences are given in Table 2.5. Figures 2.14
and 2.15 show histograms of the angular errors.
Sequence
translating tree
diverging tree
Yosemite
Average error
0:30
1:72
2:41
Standard deviation
0:57
1:64
5:36
Density
66%
47%
48%
Table 2.5: Performance for the three test sequences.
2.4 Computation of dense optical flow
In the present study we found no reason to compute a dense optical flow, since this is a
computationally expensive and notoriously ill-posed problem. However, there are situations when a dense field is required, e.g., in region-based velocity segmentation. At least
two principally different approaches to computation of dense fields exist. One is to fit
parametric models of motion in local image patches, the models ranging from constant
motion via affine transformations to mixture models of multiple motions. The fitting is
done by clustering, e.g., [85, 57], or [least-squares] optimisation. In Chapter 4 we use
26
MULTIRESOLUTION SPARSE TENSOR FIELD
Figure 2.12: Translating tree. Left: Single frame from the sequence. Right: Angular
error magnitude.
Figure 2.13: Yosemite sequence. Left: Single frame from the sequence. Right: Angular
error magnitude.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
relative bin count
relative bin count
2.4 COMPUTATION OF DENSE OPTICAL FLOW
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0
0
27
0.1
0.5
1
1.5
angular error [deg]
2
2.5
3
0
0
1
2
3
4
5
6
angular error [deg]
7
8
9
10
Figure 2.14: Tree sequences. Histograms of angular errors. Left: translating tree. Right:
diverging tree. Each estimate was given a weight equal to its confidence value.
1
0.9
0.8
relative bin count
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
angular error [deg]
Figure 2.15: Yosemite sequence. Histogram of angular errors. Each estimate was given
a weight equal to its confidence value.
simple parametric model fitting to segment an image into object and background. The
second class of methods is based on regularisation theory, e.g., [91, 75], the basic idea of
which is to stabilise solutions of ill-posed problems by imposing extra constraints, e.g.,
spatial smoothness, on the solutions. The problem of computing a dense optical flow is
ill-posed since we need two constraints at each position to be able to determine the two
velocity components, but we have only one at our disposal at edges or lines, namely the
direction of maximum signal variation. We will now discuss the regularisation approach
and its relation to the work presented in this chapter.
MULTIRESOLUTION SPARSE TENSOR FIELD
28
One way to proceed is to define a cost functional E = E1 + αE2 where the first term
is a global measure of the inconsistency between the observed tensor field and the optical flow model, and the second term, E2 , implements the regularisation by a motion
variation penalty. α is a positive number that determines the relative importance of the
two terms. We are looking for the optical flow field that minimises E . Any measure
of inconsistency between a tensor and a velocity vector must be based on the fact that
the direction of maximum signal variation, coinciding with the largest eigenvector, is
orthogonal to the spatiotemporal direction of motion. Whenever the second eigenvalue
is significantly greater than the third, smallest one, its corresponding eigenvector is also
orthogonal to the direction of motion. A possible inconsistency term is consequently
given by
Z
Z
2
2
E1 =
[ (λ1 , λ3 )(ê1 ũ) + (λ2 , λ3 )(ê2 ũ) ] dxdy =
ũT T̃ ũ dxdy
image
image
where ũ = (u; v; 1)T is the spatiotemporal displacement vector, and T̃ the ’rank-reduced’
tensor introduced in Section 2.2. In the most simple case the regularisation term is an
integral [50]
Z
∂u 2
∂u 2
∂v 2
∂v 2
E2 =
[(
) +(
) +(
) +(
) ] dxdy
∂y
∂x
∂y
image ∂x
which clearly penalises spatial variation in the motion field—there exist several modifications to avoid unwanted smoothing over velocity discontinuities, e.g., [77, 110]. The
minimum of the cost functional is found from the corresponding Euler-Lagrange partial
differential equation, e.g., [41], which in this particular case is given by
t̃T1 ũ , α∆u = 0
t̃T2 ũ , α∆v = 0
where t̃i denotes the ith column vector of T̃, and ∆ is the Laplacian operator. To
efficiently solve this equation one may use some instance of the multi-grid method,
[23, 40, 89, 32]. Multi-grid methods take advantage of low-pass pyramid representations
in the following way. The equation is converted into a finite difference approximation
which is then solved iteratively, e.g., by the Gauss-Seidel or Jacobi methods [42]. When
analysing such iterative methods it is found that high-frequency errors are eliminated
much more quickly than errors of longer wavelength. These low-frequency errors are
however shifted to higher frequencies at a coarser scale where they consequently may
be efficiently eliminated. A solution at a coarse scale can, however, not capture fine
details and consequently there are high-frequency errors. The idea is then to produce an
approximate solution at one level and then use this as an initial value for the iteration at
another level where errors in a different frequency band may be efficiently eliminated.
A particularly interesting scheme in the context of multi-level tensor fields is the adaptive
2.4 COMPUTATION OF DENSE OPTICAL FLOW
29
multi-scale coarse-to-fine scheme of Battiti, Amaldi, and Koch, [14]6 , which may be
regarded as a refinement of an old method called block relaxation, where one starts by
computing an approximation to the optical flow at the coarsest scale, use an interpolated
version of this as an initial value for the iteration at the next finer scale, and so on. The
adaptive method is based on an error analysis that gives the expected relative velocity
error as a function of velocity and scale when using a certain derivative approximation
to estimate the brightness gradient. The authors use this to detect those points whose
velocity estimates can not be improved at a finer scale, i.e., points where the coarse
scale is optimal. The motion vectors of the corresponding points at the finer scale are
then simply set to the interpolated values from the coarser scale and do not participate
in the iteration at this level. In the light of the discussion in the preceding sections, it
appears straightforward to formulate a tensor field version of this method.
6 See [99] for a discussion of the biological aspects of this and other coarse-to-fine strategies for motion
computation.
30
MULTIRESOLUTION SPARSE TENSOR FIELD
3
A SIGNAL CERTAINTY APPROACH
TO SPATIOTEMPORAL FILTERING
AND MODELLING
The idea of accompanying a feature estimate with a measure of its reliability has been
central to much of the work at the Computer Vision Laboratory. In recent years a formal theory for this has been developed, primarily by Knutsson and Westin, [72, 69, 68,
70, 103, 106, 105]. Applications range from interpolation (see the above references)
via frequency estimation [71] to phase-based stereopsis and focus-of-attention control
[101, 107]. In this chapter a review of the principles of normalized convolution (NC)
is presented spiced with some new results. This is followed by a description and experimental test of a computationally efficient implementation of NC for spatiotemporal
filtering and modelling using quadrature filters and local structure tensors. The experiments show that NC as implemented here gives a significant reduction of distortion
caused by incomplete data. The effect is particularly evident when data is noisy.
3.1 Background
Normalized convolution (NC) has its origin in a 1986 patent [67] describing a method
to enhance the degree of discrimination of filtering by making operators insensitive to
irrelevant signal variation. Let b denote a vector-valued filter and f a corresponding
vector-valued signal (e.g., a representation of local orientation). Suppose that the filter is designed to detect a certain pattern of vector direction variation irrespective of
signal magnitude variation. A simple way of eliminating interference from magnitude
variations would be to normalize the signal vectors. This is unfortunately not a very
good idea. The magnitude of the signal contains information about how well the local neighbourhood is described by the signal vector model. This information is lost in
the normalization. A special case is when the signal magnitude is zero – then a default model has to be imposed on the local neighbourhood. The patented procedure that
31
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
32
MODELLING
provides a solution to the problem is outlined in [43]:
“The method is based on a combination of a set of convolutions. The
following four filter results are needed:
s1
s2
s3
s4
=
=
=
=
hb fi
h b kfki
hkbk f i
hkbk kfki
;
;
;
;
(3.1)
(3.2)
(3.3)
(3.4)
where k k denotes the magnitude of the filter or the data. The output at
each position is written as an inner product h; i between the filter and the
signal centered around the position. [Formally this actually corresponds to
a correlation between signal and filter but the use of the term convolution
has stuck – the difference between the terms is just that in the convolution
case the filter is reflected in its origin before the inner product is taken.]
The first term, s1 , corresponds to standard convolution between the filter and the data. The fourth term, s4 , can be regarded as the local signal
energy under the filter. As to the second and third term, the interpretation
is somewhat harder. The filter is weighted locally with corresponding data
producing a weighted average operator, where the weights are given by the
data giving a “data dependent mean operator”. For the third term it is vice
versa. The mean data is calculated using the operator certainty as weights
producing “operator dependent mean data”.
The four filter results are combined into
s
=
s4 s1 , s2 s3
γ
s4
(3.5)
where γ is a constant controlling the model selectivity. This value is typically set to one, γ = 1. The numerator in Equation (3.5) can consequently
be interpreted as a standard convolution weighted with the local “energy”
minus the “mean” operator acting on the “mean” data. The denominator is
an energy normalization controlling the model versus energy dependence
of the algorithm.”
The procedure is referred to as a consistency operation, since the result is that the operators are made sensitive only to signals consistent with an imposed model. It was not until
quite recently [106] that it was realised that the above method actually is a special case
of a general method to deal with uncertain or incomplete data, normalized convolution
(NC).
3.2 NORMALIZED AND DIFFERENTIAL CONVOLUTION
33
3.2 Normalized and differential convolution
Suppose that we have a whole filter set fbk g to operate with upon our signal f. It is
possible to regard the filters as constituting a basis in a linear space B = spanfbk g,
and the signal may locally be expanded in this basis. In general, the signal cannot be
reconstructed from this expansion since it typically belongs to a space of much higher
dimension than B . A Fourier expansion is an example of a completely recoverable
expansion, which is further simplified by the basis functions (filters) being orthogonal.
There are infinitely many ways of choosing the coefficients in the expansion when the
filters span only a subspace of the signal space. A natural and mathematically tractable
choice, however, is to minimise the orthogonal distance between the signal and its projection on B . This is equivalent to the linear least–squares (LLS) method.
Let us formulate the LLS method mathematically. Choose a basis for the signal space
and expand the filters and the signal in this basis. Assume that the dimension of the
signal space is N, and that we have M filters at our disposal. The coordinates of the filter
set may be represented by an N M matrix B and those of a scalar signal f by an N 1
matrix F. The expansion coefficients may be written as an M 1 matrix F̃. All in all we
have
2 j j
B = 4 b1 b2
j
j
:::
:::
:::
j
bM
j
2
6
F=4
3
5
f1
..
.
fN
3
75
2
6
F̃ = 4
f˜1
..
.
f˜M
3
75
We assume that N M.
The LLS problem then consists in minimising
E = kBF̃ , Fk2
(3.6)
One often has a notion of reliability of measurement – as was indicated above the direction of vector-valued signals may carry a representation of a detected feature, whereas
the magnitude indicates the reliability of that statement. When a measurement is unreliable, there is no point in minimising the projection distance for the corresponding
element. On the other hand, one wants small distortion for the reliable components.
This leads to the weighted linear least–squares (WLLS) method, with an objective function
EW
kW(BF̃ , F)k2
(3.7)
wN ) is an N N diagonal matrix with the reliability
=
where W = diag(w) = diag(w1 ; : : : ;
weights. Letting A denote complex conjugation and transposition of a matrix A one
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
34
MODELLING
finds
EW
B , F )WT W(BF̃ , F) = =
2
2
2
= F̃ B W BF̃ , 2 F̃ B W F , F W F = F̃ GF̃ , 2 F̃ x , c
= (F̃
Since G is positive definite we may use a theorem that states that any F̃ that minimises
EW also satisfies the linear equation
GF̃ = x
Consequently
F̃ = G,1 x = (B W2 B),1 B W2 F
(3.8)
Introducing the notation a b = diag(a) b for element-wise vector multiplication we may
rewrite this in the language of convolutions
2 hw2 b b i
1 1
6
.
..
F̃ = 4
;
h
w2 bM ; b1
hw2 b1 bM i 3,1 2 hw2 b1 fi 3
:::
i
;
h
:::
..
.
w2 bM ; bM
i
75 64
;
..
.
hw2 bM ; fi
75
(3.9)
It turns out to be profitable to decompose the weight matrix into a product W2 = A C
of a matrix C = diag(c) containing the data reliability and a second diagonal matrix
A = diag(a) with another set of weights for the filter coefficients, corresponding to a
classical windowing function. Consequently a realised filter is regarded as a product a b
of the windowing function a and the basis function b. In this perspective it is clear that
Equation (3.9) is unsatisfactory, but fortunately a little algebra does the job:
hw2 bi b j i = ha c bi b j i = ha bi c b j i = ha bi bj ci
;
;
;
;
and
hw2 bi fi = ha c bi fi = ha bi c fi
;
;
;
We arrive at a final expression for normalized convolution
2 ha b b ci
1 1
6
..
F̃ = 4
.
;
ha bM b ; ci
1
:::
:::
ha b1 bM ci 3,1 2 ha b1 c fi 3
;
..
.
ha bM bM ; ci
75 64
;
..
.
ha bM c fi
75
(3.10)
;
Note that we must design M (M + 1)=2 new outer product filters a bi bj in addition to
the original M filters a bi . The same expression as the above may be derived from tensor
algebra regarding the process as a change of coordinate system in a linear space with a
3.2 NORMALIZED AND DIFFERENTIAL CONVOLUTION
35
metric whose coordinates are a c in the original basis. One then finds that G contains the
coordinates of the metric expressed in the fbk g basis, whereas G,1 is the metric coordinates expressed in the basis dual to fbk g. The dual basis consists of the operators fbk g
for which hbi ; b j i = δij . From the tensor algebra perspective Equation (3.10) describes
a coordinate transformation of the signal from fbk g to fbk g. To be able to compare
the output of the normalized convolution with a standard convolution it is necessary to
transform back to the dual basis. This is achieved by operating with the metric on F̃, but
since the varying reliability of data has now been compensated for, the transformation is
achieved by using a matrix
2 ha b b 1i
6 1 ... 1
G0 = 4
;
ha bM b ; 1i
1
:::
:::
ha b1 bM 1i 3
;
..
.
ha bM bM ; 1i
75
(3.11)
where 1 = (1; 1; : : : ; 1)T . This matrix may be precomputed since it is independent of the
data weights.
Of course, the resultant output from the NC computation must be accompanied by a
corresponding certainty function. There are three factors that influence this function:
the input certainty, the numerical stability of the NC algorithm, and independent certainty estimates pertaining to new feature representations constructed from the NC filter
responses1 . The numerical stability of the algorithm has to do with the nearness to singularity2 of the metric G. This is quantified by the condition number [42]
κ(G) = kGkkG,1k
It is easily verified that 1 κ(G) ∞ for p-norms and the Frobenius norm. From this,
we arrive at a possible certainty measure that takes into account the magnitude of the
input certainty function
c(G) =
1
kG0kkG,1k
(3.12)
The total output certainty function is then a product of c(G) and the feature representation certainty.
To summarise, what have we achieved by this effort? From the original signal we have
at each of the original positions arrived at an expansion of the signal in the basis fbk g
within a window described by a, the applicability function, given a data weighting function c, the certainty function. In this way we have imposed a model onto the original
instance, an estimate of dominant orientation may be accompanied by a certainty function c = λ1λ,λ2 .
1
will here not discuss the possible use of generalised inverses, such as the Moore–Penrose pseudoinverse, but see, e.g., [79, 82].
1 For
2 We
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
36
MODELLING
signal and by transforming back to the dual basis using G0 , the result is as is we had
applied standard filtering on this model.
What is the relation between normalized convolution and the consistency operation described earlier? Suppose that the linear space B spanned by fbk g is divided into two
subspaces S and L , that are orthogonal in the standard Euclidean metric. This orthogonality relation may no longer apply under the data reliability dependent metric G. The
result is that, to filter a signal using the NC approach, one may have to expand the signal
in more basis functions than those which are of primary interest, the reason being that
the new metric induces correlations between previously orthogonal basis functions. If
not enough basis functions are used, the signal model may become inaccurate and the
resultant filtering output misleading. Suppose that we are interested in filtering the signal with the basis functions in S , but want to include the ones in L to model the signal.
It is then possible to reduce the size of the metric matrix to be inverted in normalized
convolution from the original M M to Dim(S ) Dim(S ). This is described in in detail
in [103]. Just as B , the basis coordinate matrix may be decomposed, B = ( S L ). When
Dim(L ) = 1 we obtain the following expression for the projection of f on S
F̃S = ( L ACLS ACS , S ACLL ACS ),1 ( L ACLS A , S ACLL A ) CF (3.13)
For the important case L
lutions be written
= (1; 1; : : : ; 1)T
2
6
F̃S = GS,1 4
(DC component), this may in terms of convo-
ha ciha b1 c fi,ha b1 ciha c fi
;
;
ha ciha bDim S
;
( );
;
;
..
.
c fi,ha bDim(S ) ; ciha; c fi
3
75
(3.14)
with
GS (i; j) = ha; ciha bi bj ; ci,ha bi ; ciha bj ; ci
(3.140)
This is referred to as normalized differential convolution (NDC). If we have a single filter
a b, insensitive to constant signals, and with the property that b b = (1; 1; : : : ; 1)T , we
obtain
ha; ciha b; c fi,ha b; ciha; c fi
fS =
(3.15)
ha; ci2 ,jha b; cij2
This is almost Equation (3.5) with γ
=
2.
3.3 NDC for spatiotemporal filtering and modelling
In this section we will apply NDC in its most simple form, Equation (3.15), to the problem of estimating and modelling a local spatiotemporal neighbourhood. The model we
3.3 NDC FOR SPATIOTEMPORAL FILTERING AND MODELLING
37
use is the local structure tensor reviewed in Section 1.4 and already extensively used in
this thesis. Recall that it is estimated as a linear combination of six basis tensors with
coefficients equal to the magnitude of six corresponding quadrature filter responses. The
question now arises where and how to apply normalized convolution. One possibility
is to use the six quadrature filters plus a low-pass filter (or more accurately, their corresponding basis functions, yet to be defined) as our signal model space. However, this
is for efficiency reasons not possible, since it would require 7 8=2 = 28 new complex
outer-product filters and inversion of a 6 6 complex matrix at each position. Instead we
have to resort to a much more modest approach, namely to perform the NC at the quadrature filter level. The quadrature filter consists of an even real and an odd imaginary part
that are reciprocal Hilbert transforms. With a supplementary constant basis function we
consequently have at our disposal three basis functions. Being complex, the quadrature
filter may be written h(x) = m(x) exp [ iφ(x)] = m(x)( cos φ(x) + i sinφ(x)), where m(x)
is the magnitude and φ(x) the phase. It is natural to let m become the applicability,
and the real and imaginary parts of exp[iφ(x)] the two non-constant basis functions in
NC. With three basis functions we have to generate four new outer-product filters m(x),
m(x) cos2 φ(x), m(x) sin2 φ(x) and m(x) cos φ(x) sin φ(x). Of course, the trigonometric
identity sin2 φ + cos2 φ = 1 makes it possible to dispose of one of the filters, so we are
left with three extra. Since signal and certainty data must be filtered separately, the grand
total is 5=2 2 = 5 times as many 3D-convolutions in NC as in the original algorithm.
An equivalent formulation uses two complex basis functions exp[iφ(x)] plus a constant function which also leads to three new (real) filters. Note that when a real-valued
signal is expanded in such a basis, the optimal coefficients of the two complex basis
functions will necessarily be equal except for a complex conjugation, so the number of
free parameters are the same as in the case of real basis functions.
It is actually possible to reduce the number of convolutions further by modelling the
signal with a single complex basis function b = exp [ iφ(x)] plus a constant function.
This was suggested by Westelius [101] based on experimental results, but no theoretical
motivation was given. Since it may not be apparent why it should be useful to model a
real signal with a single complex basis function, a simple hand-waving argument will be
provided.
With two complex basis functions b and b we get a metric (cf. Equation (3.14))
GS =
ha ci2 ,jha b cij2 ha ciha b2 ci,ha b ci2 2
2
;
;
con jug:
;
;
;
ha ci ,jha b cij
;
;
The off-diagonal filter a b2 has a Fourier transform that essentially is a shifted version
of the transform of the quadrature filter, its centre frequency being approximately twice
that of the quadrature filter. Consequently it will be quite insensitive to low-frequency
components of the certainty function. The three filters a, a b and a b2 actually cover the
Fourier space symmetrically—the first one being a lowpass filter, the second a band-
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
38
MODELLING
pass filter and the third a highpass filter. Now, the certainty function has one particular
characteristic: it is positive, which means that its DC-component is always the largest
Fourier coefficient. As a result, the diagonal term ha; ci2 tends to be larger than the others. In fact, a considerable variance in the certainty function combined with a low mean
is required for the off-diagonal elements to become significant. In our present work the
certainty function is a binary valued (0=1) spatial mask with zero certainty in regions
that we want to disregard. We want to be able to filter close to the certainty discontinuity and to interpolate results across narrow zero-certainty gaps. Figure 3.1 shows what
happens at a certainty edge. We used a realisation of the lognormal filter of Figure 1.4.
The conclusion is that the non-diagonal elements are insignificant for positions outside
the zero-certainty region3. Keeping only the diagonal elements, it is then easily seen
that the NDC formula reduces to Equation (3.15).
1.2
1
scalar product magnitude
1
c
0.8
0.6
0.4
0.2
0
−5
0.8
0.6
0.4
0.2
−4
−3
−2
−1
spatial position
0
1
2
3
0
−5
−4
−3
−2
−1
spatial position
0
1
2
3
Figure 3.1: Left: Certainty function. Right, solid line: ha; ci. Right, dashed: hab; ci.
Right, dotted: ha b2 ; ci.
Having thus introduced the use of a single complex basis function, it is straightforward
to apply the NDC formula Equation (3.15), once we have generated the applicability
filter a = m(x). For efficiency reasons this should be implemented as a separable filter.
Since the applicability function is closely related to the quadrature filter we look for an
implementation that can take advantage of this relationship. This is important for two
reasons. Firstly, the design of the separable filters is done by optimisation (cf. Section 1.4.2) so generally there is a certain discrepancy between the resulting filter and the
ideal 3D function. Two completely independent optimisations, one for the applicability,
one for the quadrature filter, do not take into account the fact that the applicability func3 An equivalent statement is that the basis functions 1, exp [ iφ(x)] and exp [ ,iφ(x)] remain practically
orthogonal in the new metric ac.
3.3 NDC FOR SPATIOTEMPORAL FILTERING AND MODELLING
39
tion should be equal to the magnitude of the realised quadrature filter. Secondly, there is
a possibility of further reducing the number of operations. An important observation (cf.
Section 1.4.2) is that the 3D quadrature filter may be decomposed into two orthogonal
one-dimensional lowpass filters and a (predominantly) one-dimensional quadrature filter. It would be nice to be able to use the two lowpass filters in the applicability filtering,
just adding a third lowpass filter next to the 1D quadrature filter. However, there is one
problem–the coefficients of the applicability filter must be positive, which adds an extra
constraint to the optimisation of the lowpass filters. Although a detailed description of
the optimisation process is outside the scope of this presentation, a short digression may
be in place, [7]. The positivity constraint is implemented by adding a regularisation term
to the original cost function. The original cost is a measure of the distance between the
ideal filter function and the realised filter in the Fourier domain. The new term similarly represents the distance between a constant, slightly positive ideal function and the
realised filter, but now in the spatial domain. Let the total cost function be
Etot = αEF + (1 , α)Espat
0α1
The adaptive factor α is decreased below one when the current realisation of the filter
contains negative coefficients, and increased towards one whenever the coefficients are
all positive. The spatial cost function includes an adaptive weighting function which
is increased wherever a coefficient is negative and decreased towards zero at spatial
positions where the filter coefficient is non-negative. The idea is that the spatial ideal
function should only influence the optimisation when (through α) and where (through
the spatial weighting function) the current coefficients are negative. The combined effect
of the (non-linear) adaptive parameters is to force the iterative optimisation process to
find the realisable filter with positive coefficients that is closest to the ideal filter function.
The resultant 3D quadrature filters are actually not significantly inferior to the ones
generated by the ’unconstrained’ optimisation described in Section 1.4.2.
Figure 3.2 shows the pattern of sequential filtering that produces the quadrature and
applicability filter responses, which may be compared with Figure 1.5. Note that we
need two such structures—one to filter the signal-certainty product, and one to filter
the certainty function itself. It is possible to generalise the scheme to more complex
signal models that require additional filters, such as the a b2 -filter that turns up when
using two complex basis functions to model the signal. This filter is then of course also
decomposed into the two lowpass filters followed by a third complex filter. To be able
to use the NC result in the construction of the tensor we must transform back to the
dual coordinates using the G0 -matrix of Equation (3.11). Since the quadrature filter is
insensitive to constant signals this reduces to a multiplication by ha; 1i, the sum of the
applicability filter coefficients. To minimise computations the filters are normalized so
that ha; 1i = 1.
From the NC quadrature filter responses and their output certainties , Equation (3.12), we
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
40
MODELLING
One dimensional
low−pass filters
p
One dimensional
quadrature/applicability filters
g −x,y
f6
a −x,y
a6
g x,y
f5
a x,y
a5
g x,z
f4
a x,z
a4
g −x,z
f
a −x,z
a3
g y,z
f2
a y,z
a2
g−y,z
f1
a−y,z
a1
x,y
p
z
p
−x,y
p
−x,z
p
y
p
Image
sequence
3
x,z
p
−y,z
p
x
p
y,z
Figure 3.2: 3D sequential quadrature/applicability filtering structure.
now proceed to construct the local structure tensor. The proper treatment of the certainty
function related to the tensor construction procedure requires some modification of the
presentation given in Section 1.4.1, Equations (1.4), (1.5) and (1.6), since a non-constant
certainty function can be interpreted as a change of basis tensors Nk and their duals Mk .
To take into account the filter response certainties ck , we set
Test
=
∑ jqk j Mk (fck g) = [∑ ci Ni Ni],1 ∑ jqk j ck Nk
k
i
(3.16)
k
The numerical stability of the procedure depends on the nearness to singularity of the
6 6-matrix [∑i ci Ni Ni ]. We may consequently use its condition number to construct a certainty function for the tensor estimation, in exactly the same way as in Equation (3.12). The computational burden of inverting a 6 6-matrix at each spatial position
is considerable. To avoid unnecessary operations the relative sizes of the filter certainties should be monitored so that precomputed matrices are used whenever the certainty
variation is sufficiently isotropic. In this way the truly hard work is generally limited to
a small fraction of the positions in each frame.
3.4 EXPERIMENTAL RESULTS
41
3.4 Experimental results
To determine the performance of the NDC implementation we used an onion-shaped
64-cube 3D volume, Figure 3.3 (left), for which the actual dominant 3D orientation
is known at each position. We did not investigate the performance at the filter level,
and, in particular, the phase behaviour was not accounted for. [Recall that only the
magnitude of the filter response is used when constructing the tensor.] However, these
properties have been thoroughly investigated by Westelius [101] for one-dimensional
quadrature filters. The orientation estimation test was done by multiplying each 2D
image with a certainty function set to zero at a number of rows in the central part and
to one everywhere else. The idea is that we want to disregard the values in the zerocertainty slice when computing the 3D orientation outside the slice. Of course, by setting
the signal value to zero we introduce an artifact that the standard convolution cannot
handle. The question is how well the NDC algorithm will do. We chose to look at the 3D
orientation half-way through the volume, at frame 32. The dominant 3D orientation at
each spatial position was estimated from the tensor and compared with the ’actual’ value.
An average angular error for each row was computed by averaging over the columns.
The test was repeated for three different levels of white noise, ∞ dB, 10 dB, and 0 dB,
Figures 3.3 – 3.15. As is seen, the NDC implementation quite successfully manages
to capture the correct orientation at the mask border whereas the standard algorithm
estimates an ’incorrect’ signal orientation several pixels outside the border due to the
artifact edge generated by the zero-mask. Also, note that the imposed signal model
appears to have a very beneficial effect on the performance for noisy signals.
3.5 Discussion
Despite several very successful applications, including the new ones in this thesis, we
feel that the full potential of the signal/certainty philosophy is yet to be exploited. A
consistent signal/certainty treatment should be an integrated part of the processing at
all levels of a robot vision system, including high-level segmentation, recognition, and
decision-making. Adaptive signal processing (’learning’) in combination with normalized convolution (or a similar concept) has the potential to provide very efficient highlevel components. Some quite promising work in this direction has been carried out
[19].
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
42
MODELLING
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
10
20
30
40
50
60
Figure 3.3: Left: Slice 32 from the test sequence. Right: Orientation error in degrees
for each row averaged over all columns.
2
2
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.4: Output certainty functions. Left: Mask width 4 pixels. Right: Mask width
20 pixels.
3.5 DISCUSSION
43
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.5: Noise-free sequence. Mask width 4 pixels. Average orientation error in
degrees for each row. Left: Standard convolution. Right: Normalized convolution.
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.6: Noise-free sequence. Mask width 4 pixels. Errors inside mask deleted.
Average orientation error in degrees for each row. Left: Standard convolution. Right:
Normalized convolution.
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
44
MODELLING
15
15
10
10
5
5
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.7: Noise-free sequence. Mask width 20 pixels. Errors inside mask deleted.
Average orientation error in degrees for each row. Left: Standard convolution. Right:
Normalized convolution.
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
10
20
30
40
50
60
Figure 3.8: Left: Slice 32 from the test sequence corrupted with white noise at 10dB
SNR. Right: Orientation error in degrees for each row averaged over all columns.
3.5 DISCUSSION
45
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.9: 10dB SNR. Mask width 4 pixels. Average orientation error in degrees for
each row. Left: Standard convolution. Right: Normalized convolution.
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.10: 10dB SNR. Mask width 4 pixels. Errors inside mask deleted. Average
orientation error in degrees for each row. Left: Standard convolution. Right: Normalized
convolution.
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
46
MODELLING
45
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.11: 10dB SNR. Mask width 20 pixels. Errors inside mask deleted. Average
orientation error in degrees for each row. Left: Standard convolution. Right: Normalized
convolution.
20
18
16
14
12
10
8
6
4
2
0
10
20
30
40
50
60
Figure 3.12: Left: Slice 32 from the test sequence corrupted with white noise at 0dB
SNR. Right: Orientation error in degrees for each row averaged over all columns.
3.5 DISCUSSION
47
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.13: 0dB SNR. Mask width 4 pixels. Average orientation error in degrees for
each row. Left: Standard convolution. Right: Normalized convolution.
90
90
80
80
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.14: 0dB SNR. Mask width 4 pixels. Errors inside mask deleted. Average
orientation error in degrees for each row. Left: Standard convolution. Right: Normalized
convolution.
A SIGNAL CERTAINTY APPROACH TO SPATIOTEMPORAL FILTERING AND
48
MODELLING
45
45
40
40
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Figure 3.15: 0dB SNR. Mask width 20 pixels. Errors inside mask deleted. Average
orientation error in degrees for each row. Left: Standard convolution. Right: Normalized
convolution.
4
MOTION SEGMENTATION AND
TRACKING
In this chapter we present a simple application within the realm of active vision, using
concepts developed in the previous chapters. Active vision1 [4, 2, 8, 9, 10, 11, 3, 27, 30]
is a paradigm that regards vision as a process with a purpose, a closed loop involving the
observer and the environment with which it interacts. Control of sensing is stressed—
the current goal of the system dictates attention. Active vision also emphasizes the
importance of proprioception in facilitating and resolving ambiguities in the perception
of the environment. This involves knowledge of the observer’s own position, motion, as
well as accurate models of sensors and actuators.
We present computationally efficient algorithms for smooth pursuit tracking of a single
moving object. We use knowledge of pan and tilt joint velocities to compensate for
camera–induced motion when tracking with a pan-tilt rotating camera. In a second
algorithm we allow camera translation and assume the background motion field in the
vicinity of the target can be fitted to a single parameterised motion model.
Most computer vision tracking algorithms derive the motion information necessary for
the pursuit from observation of target position only, which unfortunately is very difficult to obtain with precision. The reason that observations of velocity are not used is
of course the computational cost of accurate measurement—most real-time algorithms
match pairs of consecutive images and then it is just not possible to obtain accurate and
stable velocity estimates. Advances in signal processing theory with designs of highly
efficient sequential and recursive filtering schemes for velocity estimation challenge the
old established ’quick–and–dirty’ approaches to motion tracking. In our work we use
the multi–resolution local structure tensor field method described in Chapter 2 to obtain
high-quality estimates of velocity which we then use actively in prediction and segmentation. The algorithms readily incorporate the signal certainty/modelling formalism
developed in the previous chapter.
1 Approximate
synonyms: Animate Vision, Attentive Vision, Behavioural Vision, Purposive Vision.
49
50
MOTION SEGMENTATION AND TRACKING
4.1 Space–variant sensing
A visual system, whether biological or artificial, that has the ambition to provide its
host with useful information in a general setting, must be able to process information at
multiple spatial and temporal resolutions. Discrimination of detailed shape and texture
requires very high spatial sensor resolution, whereas reliable estimation of high speeds
demands large receptive fields. To cover the broad spectrum of spatial and temporal frequencies that appear in the visual input that is relevant to higher animals or autonomous
robots a whole range of resolution levels is needed. The problem is that any information processing system has a limited capacity, which means that an implementation must
make some sort of compromise. The eyes of higher animals have developed into spacevariant foveal sensors which have decreasing spatial resolution towards the periphery.
The density of sensory elements is several orders of magnitude higher in the centre of
the field of view, the fovea, than in the periphery. Correspondingly, in the primary visual
cortex there are several times as many neurons engaged in processing foveal signals as
there are neurons dedicated to peripheral information. To compensate for the lack of
spatial resolution in the periphery, the visual system constantly shifts its gaze direction
to bring peripheral stimuli of potential interest into the centre of the field of view for
scrutiny. This is done by means of saccades, fast ballistic eye movements to foveate
the target. The concept of a space-variant sensor, or fovea for short, has been carried
over to the field of computer vision. There exist several hardware approaches, e.g.,
[87, 84]. A software approach is to compute a complete multi-resolution pyramid, but
to throw away most of the data. An example is the log–Cartesian fovea [29], where a
multi-resolution region-of-interest (ROI) is defined within the pyramid. From each level
an equal size window is kept, which means that the finer levels cover a smaller part of
the field of view than the coarser levels. There is no restriction to the position of the
centre of the ROI, and consequently no explicit gaze direction change is needed for high
resolution processing in the periphery. Of course, image processing on this type of data
structure can be quite complicated. We use a simple version [101] of the log–Cartesian
fovea, where the ROI is always at the centre of the field of view and of constant size,
Figure 4.1.
4.2 Control aspects
We now give a short account of the basics of human voluntary smooth pursuit tracking [51, 28, 24, 83, 74, 108]. The tracking process always involves an initial shift of
attention to the stimulus, or target. If the target is not initially foveated, i.e., positioned
in the centre of the field of view, a saccade is produced. The eyes then accelerate to
their final speed within approximately 120 msec. Interestingly, this final speed is ap-
4.2 CONTROL ASPECTS
51
Figure 4.1: Log–Cartesian fovea. Left: Input image. Middle: Octave-based lowpass
pyramid. From each level of the pyramid the central part is extracted. The result is
equivalent to a space-variant sensor with a number of separate resolution channels of
equal size. Right: Combined image. Borders between different resolution levels have
been marked. Note that all resolution channels cover the central part of the field of view.
proximately 10% less than the speed of the target, so that the eyes tend to lag behind
the target. This is compensated for by occasional saccades that recentre the target in
the fovea. The principal stimulus for pursuit movements is retinal slip, i.e., the translational image velocity of the target, though there is also evidence for a weak position
error component for small sudden target offsets from the fovea, with the pursuit velocity
being proportional to the offset. A conventional negative feedback controller, Figure 4.2,
is not able to accurately model the characteristics of the human smooth pursuit system.
The reason is that the model is unstable with the high gain and long delays that are found
experimentally. High gain, of course, is necessary to obtain good accuracy in pursuit, but
can be very problematic in systems with long delays. To overcome this problem, evolution has developed a quite different concept, Figure 4.3, where an internal (adaptive)
model of the ocular motor system is used to predict the eye velocity and subsequently to
reconstruct the target velocity. This is a manifestation of the fact that ’the brain knows
the body it resides in’ [24].
Next, we consider the design of the controller for a machine vision smooth pursuit
tracker. Although control issues are not central to our work, there are at least two aspects
of a motion estimation based smooth pursuit tracker that distinguish it from most other
trackers, so a certain digression may be justified. Most algorithms, whether for passive or active tracking, only use observations of position. Furthermore, the delay of the
observations is regarded as negligible. When these conditions apply there are standard
algorithms for state reconstruction, e.g., the so–called α–β– and α–β–γ–filters, [12]. We
want to use both position and velocity observations, and cannot ignore that the estimates
are delayed several sampling interval units. We define the overall goal of our control
algorithm as to keep the centroid of the target motion field stationary in the centre of the
field of view. Note that this consists of two conflicting sub–goals, namely to stabilize
MOTION SEGMENTATION AND TRACKING
52
dX
__
dt
−s τ
Σ
G(s) ~
e
______
K
1 + sT
dE
__
dt
P(s)
−1
Figure 4.2: A (too) simple model of the human smooth pursuit system. P(s) is the
transfer function of the ocular motor system, X and E the target and eye positions, respectively. K 0:9, T 0:04 sec. With the experimentally verified delay τ 0:13 sec,
this system is unstable.
~
−s( τ1+ τ 3 )
P(s) e
dX
__
dt
Σ
−s τ
e 1
Σ
−s τ
e 2
G(s) ~
−s τ3
______
K
1 + sT
P(s) e
dE
__
dt
−1
Figure 4.3: A more accurate model of the human smooth pursuit system (disregarding
interaction with the saccade system). This appears to first have been proposed by Young,
[109]. The system utilizes an internal (adaptive) model of the ocular motor system to
predict the eye velocity and uses this to reconstruct the target velocity. This effectively
turns the system into an unconditionally stable feed–forward controller. After [83].
the target on the image, and to bring the target closer to the centre. A choice has to be
made regarding which of these should have the highest priority. From an active vision
perspective it may be argued that it is important to keep the target still, e.g., to facilitate
a time–consuming object recognition process. This implies that the ’optimal’ strategy
may be to use velocity compensation and occasional ’catch–up’ saccades to recentre the
object in the field of view. We have successfully experimented with velocity error compensation combined with a weak position error compensation and catch–up saccades.
The reason for including a smooth position error component is that there are certain
technical problems associated with saccades. Firstly, camera heads driven by stepping
motors, such as the KTH and Aalborg heads, are very fast at executing small steps, but
4.2 CONTROL ASPECTS
53
disproportionately slow for large movements. Secondly, even with fast motors, there is
always a recovery period after each saccade during which no observations can be made,
due to the fact that the temporal buffers of the spatiotemporal filters must be refilled. By
including a weak smooth position error compensation we can avoid some saccades and
the associated information loss, without compromising image stability too much. Since
there is no a priori reason for a coupling between motion in horizontal and vertical directions, we use separate algorithms for each spatial dimension. Let x(kT ) denote the
position of the target at time t = kT , where T is the sampling interval. Assuming a slow
acceleration, we have approximately
x((k + 1)T ) = x(kT ) + T ẋ(kT )
ẋ((k + 1)T ) = ẋ(kT ) + T ẍ(kT )
A state model that assumes constant velocity is unrealistic. When one has to take into account planned target velocity changes, it is common to model acceleration as a stochastic
drift process, e.g., an IAR process, [21]. The control is achieved by specifying camera
head joint velocities, and we assume that the motor controller is significantly faster than
the pursuit loop. With state variables X1 (k) = x(kT ), X2 (k) = ẋ(kT ), and X3 (k) = ẍ(kT ),
we then arrive at the following closed–loop state equation
21
X(k + 1) = AX(k) + Bu(k) + ν = 40
3
5
T
1
0 0
23
45
2 3
4 5
0
T
0
T X(k) + 1 u(k) + 0
1
0
v(k)
(4.1)
where v(k) is white noise. We use state variable feedback
u(k) = ,LX(k) = , u p
ud
0 X(k)
(4.2)
with u p and ud constants that determine the influence of positional and velocity errors,
respectively. The state is reconstructed from observations by Kalman filtering. Now,
because of the temporal buffer depth, observations of position and velocity are delayed
several sampling intervals so we have to use a multi–step predictor. The observations
are
y(k) = CX(k) + ε =
1
0
0
1
0
e (k )
X(k) + 1
0
e2 (k)
(4.3)
where e1 (k) and e2 (k) are white noise processes. The optimal linear m–step predictor is
then given by (e.g., [5])
X̂(kjk , m) = Am X̂(k , mjk , m) +
k
∑
i=k,m+1
Ak,i Bu(i , 1)
(4.4)
where the reconstructor is given by
X̂(kjk) = [I , KC][AX̂(k , 1jk , 1) + Bu(k , 1)] + Ky(k)
(4.5)
MOTION SEGMENTATION AND TRACKING
54
Here K is the 3 2 Kalman gain matrix, the properties of which depend on the noise
processes ν and ε, whose covariance matrices are
R1 = E [ ν(k)ν(l )
T
20
] = 40
0
3
5
0 0
0 0 δkl
0 σ2ν
R2 = E [ ε(k)ε(l )T ] =
σ2
pos
0
0
δ
σ2vel kl
(4.6)
The (stationary) Kalman gain is then given by
K = APCT [ CPCT + R2 ],1
(4.7)
where the prediction error covariance matrix P satisfies
P = R1 + APAT , APCT [ CPCT + R2 ],1 CPAT
[discrete matrix Riccati equation]
(4.8)
It is difficult to estimate the process variances σ2ν , σ2pos and σ2vel , so we regard them,
as well as the feedback parameters u p and ud , as free parameters that we have at our
disposal to make the system behave appropriately.
4.3 Motion models and segmentation
4.3.1 Segmentation
The process of motion–based image segmentation in a system with focus of attention
control consists of two separate stages—an ’early’ preattentive detection stage and a
’late’ object recognition oriented attentive stage. The parallel preattentive process detects local structure in the motion field and thereby indicates where potentially interesting stimuli appear in the field of view. The local structures of interest are optical
flow patterns characteristic to objects in motion, such as those shown in Figure 4.4. Indeed, neurons that are sensitive to this type of stimuli have been found in the primate
visual system, [88, 31]. The problem of detecting independently moving objects with
a moving observer is nontrivial, see, e.g., [26, 90]. Recently Fermüller and Aloimonos
[2, 37, 34, 35], based on earlier work by Nelson [80], have found a class of motion
constraints that, given bounds on egomotion, allows a moving observer to detect local
motion patterns that can not originate from the motion of the observer. In a computer
vision system the preattentive process may be implemented as a set of convolutions of
the velocity estimates with symmetry operators [15, 104]. The magnitude of the filter
responses defines a ’potential field’ with local extrema at interesting points. It is possible to devise algorithms where the attention is attracted to successive interesting points,
4.3 MOTION MODELS AND SEGMENTATION
55
Figure 4.4: Characteristic motion vector patterns generated by solid objects in motion.
to a certain extent analogously to the interaction between a particle and a field of force
[101].
Once a moving object has become the focus of attention, segmentation enters the second,
attentive stage. The overall goal of this stage is usually regarded as to provide information for the inference of object shape and 3D depth and motion, a problem referred to as
structure from motion. Considering the importance of this, it is not surprising that the
amount of literature is absolutely enormous, see, e.g., [48, 46] and references therein.
Most structure recovery methods rely on accurate estimation of the motion field. Aloimonos and Fermüller (above references) suggest robust methods based on qualitative
properties of motion and bounds on egomotion.
Here we confine ourselves to a much more restricted task—to extract the 2D projection
of a single moving rigid object, fit its 2D motion vectors to a parameterised model, and
track it by smooth pursuit. The basic assumption we make is that, at least occasionally,
the motion field in the immediate vicinity of the target can be fitted to a single parameterised model, Figure 4.5. When this applies, we can use a very simple procedure to
V (x,y,p )
o
V (x,y,p )
b
Figure 4.5: Basic assumption of the approach is that, at least occasionally, we can find a
region in which the target and background motion fields each fit a single parameterised
model.
extract the target. The idea is to estimate the background motion in an annular region
surrounding the target, and then use this to predict motion in the region inside the annu-
MOTION SEGMENTATION AND TRACKING
56
(a)
(b)
Figure 4.6: Illustration of the motion segmentation process.
lus. The local motion estimates in the central region that are not well predicted by the
annular motion model are interpreted as due to the target motion. We can formulate this
as an iterative algorithm.
1. We assume that an estimate of target size and position is available. This defines
a predicted target region. Initially this must be provided either by a preattentive
motion detector or by top–down prediction.
2. Fit a motion model to a ring around the target region, and if the model has a small
residual, use it to predict the motion in the target region. For each position, we
compare the predicted value with the actual estimate, and if the error is small, we
set a flag to mark that the estimate is consistent with the ring model.
3. If the prediction is satisfactory for most positions, we merge the ring and the
central region to a new target region and repeat the process. The interpretation
is that both the ring and the central region are completely contained within the
background or the target.
4. If the ring model does not cover all data in the target region, we fit a motion model
to the remaining estimates in the region. If this model gives a small residual we
assume that we have found the target, otherwise we merge the ring and the central
region to a new target region and repeat the process.
Figure 4.6 provides an illustration of two typical segmentation situations. The result
of a successful segmentation process is a set of target motion parameters and the set
of points whose motion estimates were used to estimate the parameters. We use the
lowpass–filtered confidence values of the motion estimates in these points to construct a
simple moving average model of the target. On those occasions when the segmentation
process fails, i.e., when the region grows beyond a predetermined maximum size, we
use the target model and predicted values of the motion parameters to pick out estimates
4.3 MOTION MODELS AND SEGMENTATION
57
in the original target region for model fitting. The use of prediction of velocity and position increases the robustness against interference from occlusion. Comparison between
predicted and estimated motion parameters permits detection of incorrect parameter estimates which would correspond to unreasonable acceleration.
In the case of a pan/tilt rotating camera it is possible to use an alternative procedure
based on elimination of the camera–induced background motion field. From Figure 4.7
it is easy to convince oneself that an accurate expression for the camera–induced displacement is given by
0∆x1
@∆yA = ω r + 1 ẑ (ω r) r
f
0
where ω = (ωx ; ωy ; 0)T is the angular velocity vector. At each position we may construct
x
( −f )
r= y
∆x
image plane
optic centre
^
x
^
z
ωy
Figure 4.7: Illustration of displacement induced by pan rotation of camera.
a unit length spatiotemporal displacement vector û and multiply each tensor T̃ with a
factor that is a measure of the inconsistency between the tensor and the the background
displacement direction. One possible choice is a ’soft threshold’
f (û; T̃) = 1 , exp [ ,
ûT T̃ û
kT̃k σ2 ]
This method to eliminate self–induced background motion is actually similar to how
we pick out tensors that are consistent with model predictions (see above)—it may be
regarded as a kind of focus–of–attention. The basic idea is illustrated in Figure 4.8,
where a cone symbolises the (hard or soft) threshold angle. With the use of motion
elimination the extraction of the target in the rotating camera case is achieved by simply
increasing the target region until no further data consistent with the target motion is
found.
MOTION SEGMENTATION AND TRACKING
58
Egorotation cone
Retinal slip cone
Figure 4.8: Eliminating irrelevant motion. Left: Cone centered at vector pointing in
direction of camera-induced motion. The line segment (which becomes a plane in the
spatiotemporal space) moves with respect to a static background, since it does not intersect with the cone. Right: Cone centered at the target’s predicted spatiotemporal velocity
vector. The line segment may belong to the target, since the corresponding 3D plane is
inside the cone.
4.3.2 Discussion
The segmentation methods presented above are related to several other approaches for
motion tracking with an active camera. Nordlund and Uhlin [81, 94] fit two consecutive
frames to a global 2D displacement model (translation or affine transformation) and
detect the centroid of the target displacement field from the residual image. The benefit
of this method is that it is computationally efficient. An obvious drawback is the loss of
locality, which implies very strong assumptions about the global structure of the scene.
Also, the quality of tracking, i.e., the ability to stabilize the target on the image, is
compromised by the lack of precise motion estimates.
Murray and Basu [76] use a modification of image subtraction combined with elimination of camera–induced motion to track a moving object with a pan/tilt rotating
camera—they do not address the case of a translating camera. Again, the lack of accurate motion estimates prevents good image stabilisation.
Tölg [92], see also [86], uses a differential–based optical flow algorithm [95] which
makes it possible to extract the target motion field by clustering—only fronto–parallel
4.3 MOTION MODELS AND SEGMENTATION
59
motion is treated. Tölg, assuming a pan/tilt rotating camera, also compensates for
camera–induced motion, but bases this on motion vector subtraction, which assumes
that the true motion vector can be accurately estimated. However, it is known that in
general only the local normal velocity (component in direction of spatial gradient) is
available with some precision.
The question of how to represent the target region shape has been addressed by many
authors in the field of image sequence coding, but has attracted comparatively little
attention in tracking. The principal use of shape information in tracking is to implement
a spatial weight mask—a kind of focus–of–attention. Meyer and Bouthemy [73] use
a region descriptor (convex hull) to detect occlusion from region area size changes.
Though their motion–based region segmentation method, [20], is unsuitable for real–
time applications, the idea of a more structural region representation is conceivable also
for sparse motion fields using tools from computational geometry, [93, 96].
4.3.3 Motion models
We have experimented with three different motion models—pure 2D translation, translation and expansion/contraction, and affine transformations. Concerning parameter estimation, computational efficiency was stressed—costly iterative methods were regarded
as unacceptable. The pros and cons of this are discussed at the end of this section.
The case of pure translation is particularly simple, as illustrated in Figure 4.9. If a
set of local structure tensors2 emanate from a single pure 2D translation, the sum of
the tensors, Tsum , will be of rank 2 with the eigenvector corresponding to the smallest
eigenvalue pointing in the direction of spatiotemporal displacement. If the tensors do
not come from a single translation, Tsum will have a significant third eigenvalue λ3 .
Consequently we can use the quotient λ3 =TrTsum as a measure of the deviation from
pure translation.
When there is a significant velocity component orthogonal to the image plane, we have
to take into account the perspective transformation. Restricting ourselves to a single
spatial dimension, the perspective transformation equation may be written
x = ,f
X
Z
(4.9)
where f is the camera constant, x denotes the image coordinate, X the corresponding
world coordinate, and Z the orthogonal distance from the camera lens to the object.
2 We
use the rank 2 tensors T̃ introduced in Section 2.2.
MOTION SEGMENTATION AND TRACKING
60
(a)
(b)
(c)
(d)
Figure 4.9: Tensor averaging. (a), (b): Two edges in common motion, creating planes
in the 3D spatiotemporal space. Orientation tensors (ideally of rank 1) visualised as ellipsoids with eigenvectors forming principal axes. The eigenvector corresponding to the
largest eigenvalue indicates orientation of plane. (c), (d): Averaging of tensors ‘symbolically’ shown in (c) gives as result an estimate of the true motion, now from the
eigenvector corresponding to the smallest eigenvalue of the tensor (d), which ideally is
of rank 2.
Differentiating Equation (4.9) with respect to time, we find
ẋ = f
X
Ẋ
Ż , f
Z2
Z
(4.10)
Assuming that the variation in depth of the visible part of the target along the line of
sight is small compared with the viewing distance (weak perspective), we conclude that
the apparent velocity of a projected point is a linear function of its distance from the
image coordinate centre. In Figure 4.10 we see that this means that the spatiotemporal
velocity vectors converge at a point, which we may call the spatiotemporal focus-ofexpansion, xST FOE (t ). To find this point, one may use some kind of clustering scheme or
proceed as follows. The spatiotemporal displacement vector at x = (x; y; t )T is parallel to
x , xST FOE (t ). Ideally the local structure tensor T̃(x) has no projection in this direction,
so that
T̃(x)(x , xSTFOE ) = 0
We can find a least squares optimal value of xST FOE in a spatial region R by minimising
E=
∑ (x , xSTFOE )T T̃(x) (x , xSTFOE )
x 2R
T
= xST FOE
∑ T̃(x) xST FOE , 2xTSTFOE ∑ T̃(x) x + ∑ xT T̃(x) x
Since ∑ T̃(x) is positive (semi–) definite, it follows that the optimal xST FOE is a solution
4.3 MOTION MODELS AND SEGMENTATION
61
x
t=t1
t=t2
t
xSTFOE (t 1 )
xSTFOE (t 2 )
Figure 4.10: Spatiotemporal convergence of velocity vectors. At any given time, the
spatiotemporal tangent vectors converge at xST FOE (t ).
to the linear equation system
∑ T̃(x) xST FOE = ∑ T̃(x) x
So in addition to the sum of tensors that we use for the pure translation model estimation,
we need the vector sum ∑ T̃(x) x. The translation component of the motion can be found
from the average displacement direction ∑ Tr T̃(x) (x , xSTFOE ).
An alternative way to proceed when modelling the motion field as a fronto–parallel
translation plus motion orthogonal to the image plane is to make the ansatz
u
v
=a
x u y
0
+
v0
We will not describe this in detail, since it is a simple special case of our final approach,
namely to model the motion field as an affine transformation
u a
v
=
11
a21
a12
a22
x u y
+
0
v0
(4.11)
This type of motion model has recently become a popular choice for the segmentation
stage of video sequence coding algorithms, e.g., [58, 73, 100, 33]. One reason for this is
that it can be shown that the general motion of a planar surface patch under orthographic
projection can be expressed as an affine transformation.
MOTION SEGMENTATION AND TRACKING
62
The linear transformation represents rotation in the image plane, depth motion and shear,
while the constant part describes translation in the image plane.
Rewriting Equation (4.11) as a linear transformation of the spatiotemporal displacement
vector ũ = (∆x; ∆y; ∆t )T = (u; v; 1)T , we obtain
0u1
0a
11
ũ = @ v A = Ax̃ = @ a21
1
0
a12
a22
0
u0
v0
1
10 x 1
[email protected] y A
(4.12)
1
Again noting that the local structure tensor at each position has no projection in the
direction of spatiotemporal displacement, a least squares sense optimal affine motion
model for an image region R satisfies
Aopt
= argmin
A
∑ ũT (x y) T̃ ũ(x y) = argmin
∑ xT AT T̃ Ax
A
x2R
;
;
x2R
(4.13)
Setting the partial derivatives with respect to the parameters to zero results in the following symmetric linear equation system
(
20 x2t
66BB 11
6B
∑ 66B
B
x y 2R 6B
[email protected]
;
)
xyt11
y2t11
xt11
yt11
t11
x2t12
xyt12
xt12
x2t22
xyt12
y2t12
yt12
xyt22
y2t22
xt12
yt12
t12
xt22
yt22
t22
10 a
CC BB a1112
CC BB u0
CC BB a21
[email protected] a
22
v0
1 0 xt
CC BB yt1313
CC + BB t13
CC BB xt23
A @ yt
23
13
CC77
CC77 = 0
CC77
A5
t23
(4.14)
where ti j refers to the (i; j) component of T̃. In Appendix A.2.1 we present results of
an experimental investigation of the accuracy of the affine parameter estimates3 . We use
α–β–filters [12] for prediction of the four target motion parameters a11 ,a12 , a21 , and a22
during tracking.
Possibly the only reason for using linear least squares methods for motion parameter estimation is that they are computationally efficient, which of course is critical in real–time
tracking. A well–known problem with linear least squares methods is their sensitivity
to outliers. There exist several techniques, referred to as robust estimation, [52, 45],
where outliers are rejected in an iterative process. This typically consists in maximising
a log likelihood function where the data set is considered to be divided into a Gaussian
signal process and a non–Gaussian outlier noise process. There are also various clustering techniques, e.g., [85, 57], that can be used as robust alternatives to linear least
3 Experimental results for the simpler motion models are not included—they are similar to the results for
the affine model when applicable.
4.4 ON THE USE OF NORMALIZED CONVOLUTION IN TRACKING
63
squares. Without going into details, let us just mention that it is completely straightforward to apply these methods to motion parameter estimation using the spatiotemporal
constraints presented in this section. Any of these techniques typically requires several
times as much work as linear least squares.
4.4 On the use of normalized convolution in tracking
Presently, we see two main applications of normalized convolution in active tracking.
Firstly, it is possible to shorten the recovery period following a saccade by several sampling intervals by giving the old pre–saccade frames that are left in the temporal buffer
a certainty value equal to zero. In this way a rough estimate of the post–saccade target
velocity can be produced very quickly. Secondly, it is possible to integrate normalized
convolution with focus–of–attention control by using zero–certainty masks to suppress
interference from known structures, such as a robot arm or already modelled objects.
Details of this are given in [103, 101, 107].
4.5 Experimental results
An advanced graphics simulation environment, [102, 101], was used for development
and evaluation of the tracking routines. The tracker was implemented on a simulated
standard Puma 560 robot with a mounted (stereo–) camera head, Figure 4.11. The use
of a simulated environment has a number of advantages—it allows for a wide range
of experimental situations, it provides true velocity and position which can be used in
algorithm performance evaluation, and it makes you independent of special purpose
hardware. Simulation can certainly never replace performance in ’the real world’ as the
final criterion of failure or success—particularly when there is a real-time constraint—
but it is extremely useful in early stages of development.
The sequence in Figure 4.13 shows smooth pursuit with the robot initially approaching
the target, then moving parallel to it, and finally stopping in front of the approaching
target. Note how the robot’s manœuvers are reflected in displacement of the target from
the centre of the field of view. In Figure 4.14 the weights (confidence values) of the
local motion estimates that were used to extract the target are superimposed on the target
which is tracked with a rotating camera.
The code has been ported to the Aalborg University Camera Head Demonstrator, Figure 4.12. Although only a few experiments were carried out with the installed code, we
64
MOTION SEGMENTATION AND TRACKING
consider them sufficient as a ’proof of concept’, see Figures 4.15 and 4.16. The image
processing was done at a moderate 2 Hz on a Unix workstation by importing the video
signal to the above mentioned graphics simulation package. The simulation package
code communicated with the onboard camera head controllers via sockets to issue commands and receive actual camera head joint velocities and positions. We anticipate a
more careful implementation not involving the graphics package to achieve at least 10
Hz.
4.5 EXPERIMENTAL RESULTS
Figure 4.11: Robot simulation. Left eye, right eye, and overview. The test platform
consists of a simulated Puma 560 robot with a mounted stereo camera head.
Figure 4.12: The Aalborg University Camera Head Demonstrator system.
65
66
MOTION SEGMENTATION AND TRACKING
Figure 4.13: Simulated sequence with a translating camera. A box is transported on a
conveyor belt. The small cross marks the centroid of the target motion field, whereas the
large cross simply marks the centre of the image.
4.5 EXPERIMENTAL RESULTS
Figure 4.14: Simulated sequence. The confidence values of the extracted motion estimates are superimposed on the target.
67
68
MOTION SEGMENTATION AND TRACKING
Figure 4.15: ’To and fro’—a real–time smooth–pursuit tracking sequence recorded at
the Laboratory of Image Analysis, Aalborg University.
4.5 EXPERIMENTAL RESULTS
Figure 4.16: ’Walking the plank’—a real–time smooth–pursuit tracking sequence
recorded at the Laboratory of Image Analysis, Aalborg University.
69
70
MOTION SEGMENTATION AND TRACKING
5
SUMMARY
The thesis has treated certain aspects of efficient spatiotemporal filtering and modelling.
A novel energy–based multiresolution algorithm for estimation and representation of
local spatiotemporal structure by second order symmetric tensors has been presented.
The algorithm utilizes coarse–to–fine data validation to produce accurate and reliable
estimates over a wide range of image displacements. It is highly parallelisable thanks to
an efficient separable sequential implementation of 3D quadrature filters. Results show
that the algorithm is capable of producing very accurate estimates of image velocity.
An efficient spatiotemporal implementation of a certainty–based signal modelling method
called normalised convolution has been described and experimentally investigated. By
using a novel iterative filter optimisation method to produce positive filter coefficients,
we show that normalised convolution can be efficiently integrated with the sequential
quadrature filtering scheme. We also describe how certainty information is incorporated
in the construction of the local structure tensor. Experimental tests show that normalised
convolution for signal interpolation in spatiotemporal sequences is very effective, due to
the high information redundancy in the signal.
As an application of the above results, we have presented a smooth pursuit motion tracking algorithm that uses observations of both target motion and position for camera head
control and motion prediction. We introduce a novel target motion segmentation method
which assumes that the motion fields of the target and its immediate vicinity, at least occasionally, each can be modelled by a single parameterised motion model. We use the
acquired and updated information about target position and velocity to extract motion
estimates on those occasions when the segmentation algorithm fails. We also show how
to eliminate camera–induced background motion in the case of a pan/tilt rotating camera. We provide excerpts from both simulated and real tracking sequences that show that
the algorithms work.
71
SUMMARY
72
The thesis probably creates more question–marks than it straightens out. Certainly a
great deal more work can be put into the study of the multi–resolution tensor field estimation algorithm. There may be more optimal ways to combine the different levels,
and it may also be useful to consider some local interaction at each level. In particular, a
study of relaxation techniques would be interesting. Considering applications, promising work on region–based motion segmentation for image sequence coding has recently
been done [33].
The efficient implementation of normalised convolution is an important result. As mentioned earlier, we think that the full potential of the signal/certainty philosophy is yet to
be exploited. For instance, there has to be work done on integrating certainty processing,
feature extraction and object recognition.
A straightforward refinement of the iterative target segmentation algorithm would be
to make the region growing process shape adaptable, most simply by computing the
principal axes of the motion field distribution. Of course, there are dozens of possible real–time applications other than smooth pursuit tracking that can use accurate
motion estimates. To name but a few: egomotion estimation, obstacle avoidance, fixation, and homing. Also, as mentioned earlier, there are better ways to estimate motion model parameters than the linear least squares methods that we have used in this
work. It would be interesting to study the performance of, e.g., mixture models, [57],
and to see whether they can be efficiently implemented within the local structure tensor framework—it seems likely that robust iterative methods converge faster when the
local estimates are of high precision with occasional outliers, than if they have a large
variance as is the case for the more primitive optical flow methods.
*
*
*
A
TECHNICAL DETAILS
In this appendix we provide some of the more technical details and results of experimental tests that for continuity reasons were omitted in the main text. The mathematical
results are all quite elementary, but it was felt that they should be included for the sake
of completeness.
A.1
Local motion and tensors
A.1.1 Velocity from the tensor
There are two fundamental cases.
1. Moving point, Figure A.1. The spatiotemporal line traced out by the moving
point has a tangent vector vST = (vx ; vy ; 1)T which is parallel to the eigenvector ê3 = (e31 ; e32 ; e33 )T corresponding to the smallest eigenvalue of the tensor.
Consequently the velocity is given by
v = (vx ; vy )T
=
1
T
(e31 ; e32 )
e33
(A.1)
2. Moving line, Figure A.2. The moving line creates a plane in the spatiotemporal
volume. In this case the true velocity can not be determined, since it is impossible
to detect any velocity component parallel to the line. However, the component
vnorm normal to the line is easily recovered by noticing that its spatiotemporal
counterpart vST
norm is orthogonal to the eigenvector ê1 corresponding to the largest
eigenvalue, as well as to a vector l = (e12 ; ,e11 ; 0)T parallel to the line of intersection of the motion plane and t = t0 . But
e1 l = (,e11 e13 ; ,e12 e13 ; e211 + e212 )T
73
TECHNICAL DETAILS
74
so that
,e13 (e ; e )T
11 12
e211 + e212
vnorm =
(A.2)
since the temporal component of vST
norm equals 1.
A.1.2 Speed and eigenvectors
We have claimed that
s
s=
s
s=
e213
1 , e213
1 , e233
e233
T rank 1
(A.3)
T rank 2
(A.4)
are measures of the magnitude of local velocity.
Proof: (A.3) follows from (A.2) since
e213
e213
2
2
(
e
+
e
)
=
12
2 2 11
e211 + e212
11 + e12 )
kvnormk2 = (e2
=
e213
1 , e213
Similarly, (A.4) follows from (A.1) since
kvk2 = e12
2
2
(e31 + e32 ) =
33
1 , e233
e233
2
A.1.3 Speed and Txy
We have claimed that
Tr Txy = λ1 (1 , e213)
det Txy = λ1 λ2 e233
T rank 1
(A.5)
T rank 2
(A.6)
Proof: When T is of rank 1 the leading 2 2-submatrix is
Txy = λ1
e211
e11 e12
e11 e12
e212
A.1 LOCAL MOTION AND TENSORS
75
v
v
y
ST
y
x
t
x
Figure A.1: Moving point case. Left: Image plane. Right: Spatiotemporal volume.
ST
v norm
v norm
e1
l
v
y
y
x
t
x
Figure A.2: Moving line case. Left: Image plane. Right: Spatiotemporal volume.
TECHNICAL DETAILS
76
Since ke1 k2 = e211 + e212 + e213
2 2-submatrix is
Txy =
=
1, (A.5) follows. When T is of rank 2 the leading
λ1 e211 + λ2e221
λ1 e11 e12 + λ2e21 e22
λ1 e11 e12 + λ2e21 e22
λ1 e212 + λ2e222
so that the determinant is
det Txy = λ1 λ2 [ e211 e222 + e212 e221 , 2e11e12 e21 e22 ] = λ1 λ2 (e11 e22 , e12 e21 )2
But
x̂
e33 = [e1 e2 ]3 = e11
e
21
ŷ
e12
e22
tˆ
e13
e23
and we are done. 2
= e11 e22
, e12e21
3
A.2 Motion models
A.2.1 Accuracy of affine parameter estimates
To determine the accuracy of the affine parameter estimates a synthetic 2D profile, Figure A.3, was subjected to sequences of affine transformations. The object covered
100 100 pixels when not transformed. All experiments started with the object in a
standard position, centred in the field of view.
We separately studied expansion (Figure A.5), rotation (Figure A.6), translation (Figure A.7), and composite transformation (Figure A.8). In the figures, histograms of estimated values of the affine parameters are plotted as
a
11
a21
a12
a22
vx
vy
As is seen, the estimates are excellent over a wide range of parameter values. Of course,
these estimates have been obtained under perfect conditions and thus merely show that
the algorithm then has the ability to produce consistent estimates with small variance.
To see how the algorithm performs in somewhat more realistic situations, a simple camera simulation was generated by adding 10 dB SNR additive uncorrelated noise and motion blurring (moving average). Results are shown in Figures A.9 – A.12. As expected,
very little or no degradation in performance can be detected—the spatiotemporal filtering has proven very noise tolerant, with an average orientation estimation error of 3:0
at 10 dB SNR (0:8 at ∞ dB SNR) for the particular filter set used here, [63].
A.2 MOTION MODELS
Figure A.3: Textured profile used in experiments.
Figure A.4: A more realistic situation with 10 dB SNR additive uncorrelated noise and
image blurring.
77
TECHNICAL DETAILS
78
12
30
10
25
8
12
20
4
15
6
10
4
5
0.005
0
0.01
1
6
0
0
2
2
0.01
0
0.02 −0.01
8
4
0
0.01
0
0
0.01
0
−1
8
6
6
4
4
0
1
0
−0.01
0
0
0
0.01
(a)
25
12
8
20
10
5
4
6
10
2
5
0
0
0.02
3
4
0
0.01
0
−1
10
12
10
8
10
8
6
1
0.01
0
0
6
10
4
5
0
0
0
0.02 0.04 0.06 0.08 −0.01
4
4
2
0
8
6
6
4
0
−0.01
0
−1
15
8
8
4
2
0.02
1
0
1
2
2
0.02
0.04
0
−1
0
1
0
0.01
0
1
0
−0.01
0
0.01
0
−1
6
6
5
5
4
4
3
3
2
2
1
(c)
Figure A.5: Expansion. (a) a11
0:02. (d) a11 = a22 = 0:04.
2
1
0
12
6
0
2
0.01
20
2
2
0
0.04 −0.01
1
6
8
15
4
0
(b)
10
6
0
−1
8
2
2
2
0.005
0.01
4
4
2
2
0
6
6
4
4
4
10
8
6
0
−0.01
0
10
8
8
6
2
12
10
10
0
−1
8
8
6
2
0
0.01 −0.01
12
10
6
4
0
0
12
8
8
6
2
10
10
0
0
1
0.02 0.04 0.06 0.08
0
−1
(d)
= a22 =
0:005. (b) a11 = a22
=
0:01. (c) a11 = a22
=
A.2 MOTION MODELS
12
79
15
12
12
10
8
5
0
−0.01
0.01
0
0
0.01
0.02
0
−1
0
1
12
20
12
−0.01
0
0
0
0.02
0.04
10
0
−1
0
1
0
1
0
1
0
1
15
8
10
10
6
4
4
4
0
−0.01
0
0.01
0
−1
5
5
2
2
2
0
−0.02
0.01
6
6
5
2
0
15
8
8
10
0
−0.01
10
10
15
5
2
2
0
6
4
4
4
2
8
10
6
6
4
15
8
8
6
10
10
10
10
0
1
0
−0.04
−0.02
0
0
−0.01
0
(a)
0.01
0
−1
(b)
8
10
8
4
4
6
8
4
6
2
2
0
−0.01
0
0
0
−1
0.01
6
0.02 0.04 0.06 0.08
8
6
8
8
6
6
4
4
2
2
4
0
−0.08−0.06−0.04−0.02
0
1
0
−0.01
0
0.01
12
0
−0.01
0
(c)
0.01
0
−1
0
0
0.05
0.1
0.15
12
8
10
8
6
6
4
8
6
4
4
0
1
0
−0.15 −0.1 −0.05
0
−1
10
10
2
2
0
2
2
4
2
6
2
12
10
8
4
4
2
0
10
10
6
12
8
12
8
6
0
0
−0.01
2
0
0.01
0
−1
(d)
Figure A.6: Rotation. (a) a12 = ,a21 = 0:01. (b) a12 = ,a21 = 0:02. (c) a12 = ,a21 =
0:04. (d) a12 = ,a21 = 0:08.
TECHNICAL DETAILS
80
15
10
8
6
10
6
8
4
10
10
6
5
4
2
0
−0.01
0
0.01
0
0
2
0.5
0
−0.01
1
0
0.01 −0.01
0
12
10
10
6
4
4
2
0
−0.01
0
0.01 −0.01
0
12
10
10
8
8
6
6
4
4
2
0
0.01
15
10
10
5
5
8
6
6
4
4
2
2
0
0
0.5
0
−0.01
1
0
0.01 −0.01
0
12
12
12
10
10
8
8
6
6
4
4
2
2
0
−0.01
0
10
10
8
6
8
6
6
4
4
2
0.01
0
0
2
0
−0.01
4
0
8
2
0
0.01
0
0
2
0
−0.01
4
=
vy
6
=
0:5. (b) vx
0
4
3
2
6
5
4
4
3
3
2
2
1
0
0
0
0
4
6
=
=
2.
vy
8
4
2
1
0
0
0
0
0
0
0.03
1. (c) vx
5
1
0.02
=
vy
6
1
1
2
10
0.02
0.04
0.06
8
0.02
0.04
0.06
0
0
6
6
5
5
4
4
3
3
2
2
2
4
−2
0
6
8
6
6
4
4
4
2
2
0
0
0.01
6
5
0.01
=
8
0
2
2
0
0.01 −0.01
0
(d)
8
0
−0.03 −0.02 −0.01
6
4
2
2
2
2
4
4
4
2
4
2
6
6
4
6
3
6
0
0
8
10
6
8
4
0.03
0.01
12
10
5
0.02
2
2
0
0.01 −0.01
0
8
12
6
0.01
1
2
2
0
6
0
0
0
0
4
4
Figure A.7: Translation. (a) vx
(d) vx = vy = 4.
4
0.01
8
(c)
8
2
6
0
0.01 −0.01
0
12
8
0
0.01 −0.01
0
1
(b)
2
0
−0.01
0
0
15
(a)
12
0.01
10
8
6
2
0
12
8
8
5
4
2
0
0.01 −0.01
0
10
8
6
4
2
15
12
12
8
0.01
(a)
0.02
0.03
0
−2
2
−1
0
0
−0.06 −0.04 −0.02
1
0
0
0
1
0.02
0.04
0.06
0
−4
(b)
Figure A.8: Composite transformation.
(a) a11 = a12 = ,a21 = a22 = 0:015.
vx = ,vy = 1: (b) a11 = a12 = ,a21 = a22 = 0:03. vx = ,vy = 2:
A.2 MOTION MODELS
81
10
8
8
6
8
4
6
6
3
4
4
2
2
2
2
0
0
4
4
4
10
8
5
6
6
10
6
8
2
1
0.005
0
0.01 −0.01
0
0.01
0
−1
0
1
0
0
0.01
0
0.02 −0.01
2
0
0.01
8
10
6
8
6
6
4
4
4
2
0
−0.01
0
0.01
0
0
6
5
6
4
6
3
4
0.01
0
−1
2
0
1
0
−0.01
0
0.01
0
0
0.01
0.02
0
−1
0
1
0
1
(b)
12
8
10
8
20
8
6
15
4
4
10
0.02
2
2
5
5
4
4
3
4
2
6
6
5
6
6
3
2
2
1
0
0.04 −0.01
0
0.01
0
−1
0
1
10
0
0
1
0
0.02 0.04 0.06 0.08 −0.01
8
8
6
6
4
4
2
2
0
0.01
0
−1
10
6
8
6
8
5
4
6
3
4
4
2
2
6
2
1
0
−0.01
1
2
1
(a)
0
0
0
2
2
0.005
1
8
8
4
2
0
10
10
8
0
−1
0
0.01
0
0
0.02
0.04
0
−1
5
4
3
2
1
0
1
0
−0.01
0
0.01
(c)
Figure A.9: Expansion. (a) a11 = a22
0:02. (d) a11 = a22 = 0:04.
0
0
0.02 0.04 0.06 0.08
0
−1
(d)
=
0:005. (b) a11
= a22 =
0:01. (c) a11
= a22 =
TECHNICAL DETAILS
82
12
12
10
10
8
8
6
6
4
4
2
2
0
−0.01
0
0.01
0
0
12
0.01
0.02
12
6
10
5
8
4
6
3
4
2
2
1
0
−1
0
0
−0.01
1
8
12
10
8
6
4
4
0
−0.02
6
4
0
0.01
0
0
−0.01
0
0.01
0
−1
0
0
1
0.02
0
−0.04
1
1
0
1
2
1
−0.02
0
0
−0.01
0
0.01
0
−1
4
4
6
3
3
5
0.02 0.04 0.06 0.08
2
1
2
1
0
−1
0
−0.01
0
0
1
4
3
4
0
5
4
2
1
6
5
8
0.01
0
3
6
5
0
1
4
10
6
10
0
0
0
5
2
15
0
−0.01
1
(b)
6
2
0
6
6
2
0
0
−1
8
(a)
5
0.04
4
2
−0.01
2
2
4
2
2
6
3
4
8
8
4
6
10
6
5
8
10
12
6
10
0
0.01
3
2
1
0.05
0.1
0.15
0
−1
8
10
6
6
8
5
5
6
4
4
3
3
2
2
1
1
4
2
0
−0.08−0.06−0.04−0.02
0
0
−0.01
0
0.01
0
−1
10
6
4
5
6
4
3
4
2
2
0
2
0
−0.15 −0.1 −0.05
1
(c)
0
0
−0.01
1
0
0.01
0
−1
(d)
Figure A.10: Rotation. (a) a12 = ,a21 = 0 01.
,a21 = 0 04. (d) a12 = ,a21 = 0 08.
:
:
6
8
:
(b) a12
=
,a21 = 0 02.
:
(c) a12
=
A.2 MOTION MODELS
83
8
12
10
10
8
8
6
6
4
10
8
10
6
6
8
8
4
4
6
6
4
4
4
2
2
2
2
2
0
−0.01
0
0.01 −0.01
0
0
0.01
0
0
0.5
1
6
0
−0.01
2
0
0.01 −0.01
0
0
0.01
0
0
1
2
1
2
8
12
10
5
10
8
4
6
3
4
6
2
4
1
2
0
−0.01
4
2
0
0.01
0
0
8
6
6
4
4
2
2
0
0.01 −0.01
0
8
6
8
0.5
1
0
−0.01
2
0
0.01 −0.01
0
0
(a)
0.01
0
0
(b)
12
15
10
8
8
6
6
4
4
15
10
10
4
2
2
0
0.01 −0.01
0
0
0.01
12
0
0
5
10
8
8
6
4
0
−0.01
10
12
8
10
0
0.01 −0.01
0
4
4
2
2
0
−0.01
4
2
0
0.01 −0.01
0
0
0.01
0
0
0
2
6
6
6
5
5
5
4
4
4
3
3
3
0
0
2
4
0.02
0.03
0
0
8
8
6
6
= vy =
0.02
0.03
0
0
0
0
1
2
0
−0.01
2
0
0.01 −0.01
0
0
0.01
0
0
0:5. (b) vx
= vy =
1. (c) vx
6
5
6
5
4
5
4
3
3
2
0
0
2
1
0.02
0.04
0.06
0
0
4
4
3
3
2
2
1
1
1
0.02
0.04
0.06
0.02
0.03
0
−2
0
0
2
4
−2
0
5
4
2
2
(a)
2.
2
3
3
0.01
= vy =
4
2
1
5
2
0
6
4
1
1
0
−0.03 −0.02 −0.01
4
6
4
6
4
2
2
8
6
2
4
4
6
10
8
3
1
0.01
4
6
4
2
1
0.01
2
(d)
Figure A.11: Translation. (a) vx
(d) vx = vy = 4.
1
0
0
12
10
(c)
2
0.01
12
8
6
6
5
2
2
12
10
10
6
5
0
−0.01
15
8
−1
0
0
−0.06 −0.04 −0.02
0
0
0
0.02
0.04
0.06
0
−4
(b)
Figure A.12: Composite transformation.
(a) a11 = a12 = ,a21 = a22 = 0:015.
vx = ,vy = 1: (b) a11 = a12 = ,a21 = a22 = 0:03. vx = ,vy = 2:
84
TECHNICAL DETAILS
Bibliography
[1] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception
of motion. Jour. of the Opt. Soc. of America, 2:284–299, 1985.
[2] J. Aloimonos. Purposive and qualitative active vision. In DARPA Image Understanding Workshop, Philadelphia, Penn., USA, September 1990.
[3] Y. Aloimonos, editor. Active Perception, volume 1 of Advances in Computer
Vision. Lawrence Erlbaum Publishers, 1993.
[4] Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. Int. Journal of
Computer Vision, 1(3):333–356, 1988.
[5] B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice–Hall, 1979.
[6] M. Andersson and H. Knutsson. General sequential Spatiotemporal Filters for
Efficient Low Level Vision. In ECCV-96, April 1996. Submitted.
[7] M. T. Andersson and H. Knutsson. Personal communication.
[8] R. Bajcsy. Active perception vs. passive perception. In Proc. IEEE Workshop on
Computer Vision, pages 55–59, 1985.
[9] R. Bajcsy. Active perception. Proceedings of the IEEE, 76(8):996–1005, August
1988.
[10] D. H. Ballard. Animate vision. In Proc. Int. Joint Conf. on Artificial Intelligence,
pages 1635–1641, 1989.
[11] D. H. Ballard. Animate vision. Technical Report 329, Computer Science Department, University of Rochester, Feb. 1990.
[12] Y. Bar-Shalom and T. E. Fortmann. Tracking and data association. Academic
Press, 1988.
[13] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow
techniques. Int. J. of Computer Vision, 12(1):43–77, 1994.
85
86
BIBLIOGRAPHY
[14] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across multiple
scales: An adaptive coarse-to-fine strategy. Int. Journal of Computer Vision,
6(2):133–145, 1991.
[15] J. Bigün. Local Symmetry Features in Image Processing. PhD thesis, Linköping
University, Sweden, 1988. Dissertation No 179, ISBN 91–7870–334–4.
[16] J. Bigün and G. H. Granlund. Optimal orientation detection of linear symmetry.
In Proceedings of the IEEE First International Conference on Computer Vision,
pages 433–438, London, Great Britain, June 1987. Report LiTH–ISY–I–0828,
Computer Vision Laboratory, Linköping University, Sweden, 1986.
[17] J. Bigün and G. H. Granlund. Optical flow based on the inertia matrix of the
frequency domain. In Proceedings from SSAB Symposium on Picture Processing, Lund University, Sweden, March 1988. SSAB. Report LiTH–ISY–I–0934,
Computer Vision Laboratory, Linköping University, Sweden, 1988.
[18] J. Bigün, G. H. Granlund, and J. Wiklund. Multidimensional orientation: texture
analysis and optical flow. IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI–13(8), August 1991.
[19] M. Borga. Normalized reinforcement learning, 1995. Unpublished report.
[20] P. Bouthemy and E. François. Motion segmentation and qualitative dynamic
scene analysis from image sequence. Int. Journal of Computer Vision, 10(2):157–
182, 1993.
[21] G. E. Box and G. M. Jenkins. Time Series Analysis: Forecasting and Control.
Holden–Day, San Francisco, 1976.
[22] R. Bracewell. The Fourier Transform and its Applications. McGraw-Hill, 2nd
edition, 1986.
[23] A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of computations, 31:333–390, 1977.
[24] V. B. Brooks. The Neural Basis of Motor Control. Oxford University Press, 1986.
[25] P. J. Burt. The pyramid as a structure for efficient computation. In A. Rosenfeld,
editor, Multiresolution Image processing and Analysis, volume 12 of Springer
Series in Information Sciences. Springer-Verlag, New York, 1984.
[26] P. J. Burt, J. R. Bergen, R. Hingorani, R. Kolcqynski, W. A. Lee, A. Leung,
J. Lubin, and H. Shvaytser. Object tracking with a moving camera. In Proceedings
IEEE Workshop on Visual Motion, pages 2–12, 1989.
BIBLIOGRAPHY
87
[27] H.I. Christensen, K.W. Bowyer, and H. Bunke, editors. Active robot vision: camera heads, model based navigation and reactive control, volume 6 of Series in
machine perception and artifical intelligence. World Scientific, 1993. ISBN 98102-1321-2.
[28] H. Collewijn and E. P. Tammiga. Human smooth and saccadic eye movements
during voluntary pursuit of different target motions on different backgrounds. J.
Physiology, 351:217–250, 1984.
[29] J. L. Crowley. Image processing in the SAVA system. In J. L. Crowley and H. I.
Christensen, editors, Vision as Process, ESPRIT Basic Research Series, pages
73–83. Springer-Verlag, 1994.
[30] J. L. Crowley and H. I. Christensen, editors. Vision as Process, ESPRIT Basic
Research Series. Springer-Verlag, 1994. ISBN 3-540-58143-X.
[31] C. J. Duffy and R. H. Wurtz. Sensitivity of MST neurons to optical flow stimuli:
a continuum of response selectivity to large field stimuli. J. Neurophysiology,
65:1329–1345, 1991.
[32] W. Enkelmann. Investigations of multigrid methods for the estimation of optical
flow fields in image sequences. Computer Vision, Graphics and Image Processing, 43:150–177, 1988.
[33] G. Farnebäck. Motion-based Segmentation of Image Sequences. Master’s Thesis
LiTH-ISY-EX-1596, Computer Vision Laboratory, S–581 83 Linköping, Sweden,
May 1996.
[34] C. Fermüller. Passive navigation as a pattern recognition problem. Int. Journal of
Computer Vision, 14:147–158, 1995.
[35] C. Fermüller. Towards a theory of direct perception. In Proc. ARPA Image Understanding Workshop, 1996.
[36] C. Fermüller and Y. Aloimonos. Estimating 3-d motion from image gradients.
Technical Report T.R. CAR-TR-564, Center for Automation Research, University
of Maryland, 1991.
[37] C. Fermüller and Y. Aloimonos. Qualitative egomotion. Int. Journal of Computer
Vision, 15:7–29, 1995.
[38] D. J. Fleet. Measurement of image velocity. Kluwer Academic Publishers, 1992.
ISBN 0–7923–9198–5.
[39] D. J. Fleet and K. Langley. Recursive filters for optical flow. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 17(1):61–67, 1995.
BIBLIOGRAPHY
88
[40] F. Glazer. Multilevel relaxation in low-level computer vision. In A. Rosenfeld,
editor, Multiresolution Image Processing and Analysis, volume 12 of Springer Series in Information Sciences, pages 312–330. Springer-Verlag, New York, 1984.
[41] H. Goldstein. Classical Mechanics. Addison-Wesley, second edition, 1980.
[42] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins
University Press, second edition, 1989.
[43] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer
Academic Publishers, 1995. ISBN 0-7923-9530-1.
[44] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden, October 1992. Dissertation No
284, ISBN 91–7870–988–1.
[45] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust Statistics: The approach based on influence functions. John Wiley and Sons, New
York, 1986.
[46] R. M. Haralick and L. G. Shapiro.
Addison–Wesley, 1993.
Computer and robot vision, volume 2.
[47] D. J. Heeger. Optical flow from spatiotemporal filters. In First Int. Conf. on
Computer Vision, pages 181–190, London, June 1987.
[48] D. J. Heeger and A. D. Jepson. Subspace methods for recovering rigid motion
I: Algorithm and implementation. Int. Journal of Computer Vision, 7(2):95–117,
Januari 1992.
[49] B. K. P. Horn. Robot vision. The MIT Press, 1986.
[50] B. K. P. Horn and B. G. Schunk. Determining optical flow. Artificial Intelligence,
17:185–204, 1981.
[51] I. P. Howard. Human Visual Orientation. John Wiley & Sons, Toronto, 1982.
[52] P. Huber. Robust Statistics. John Wiley and Sons, New York, 1981.
[53] B. Jähne. Motion determination in space-time images. In Image Processing
III, pages 147–152. SPIE Proceedings 1135, International Congress on Optical
Science and Engineering, 1989.
[54] B. Jähne. Motion determination in space-time images. In O. Faugeras, editor,
Computer Vision-ECCV90, pages 161–173. Springer-Verlag, 1990.
[55] B. Jähne. Digital Image Processing: Concepts, Algorithms and Scientific Applications. Springer Verlag, Berlin, Heidelberg, 1991.
BIBLIOGRAPHY
89
[56] B. Jähne. Spatio-Temporal Image Processing: Theory and Scientific Applications.
Springer Verlag, Berlin, Heidelberg, 1993. ISBN 3-540-57418-2.
[57] A. Jepson and M. Black. Mixture models for optical flow. Technical Report
RBCV-TR-93-44, Res. in Biol. and Comp. Vision, Dept. of Comp. Sci., Univ. of
Toronto, 1993.
[58] H. Jozawa. Segment–based video coding using an affine motion model. In Proc.
SPIE Visual Communications and Signal Processing, volume 2308, pages 1605–
1614, Chicago, 1994.
[59] H. Knutsson. Producing a continuous and distance preserving 5-D vector representation of 3-D orientation. In IEEE Computer Society Workshop on Computer
Architecture for Pattern Analysis and Image Database Management - CAPAIDM,
pages 175–182, Miami Beach, Florida, November 1985. IEEE. Report LiTH–
ISY–I–0843, Linköping University, Sweden, 1986.
[60] H. Knutsson. Representing and estimating 3-D orientation using quadrature filters. In Conference Publication No. 265, Second Int. Conf. on Image Processing
and Its Applications, pages 87–91, London, June 1986. IEE, IEE.
[61] H. Knutsson. A tensor representation of 3-D structures. In 5th IEEE-ASSP and
EURASIP Workshop on Multidimensional Signal Processing, Noordwijkerhout,
The Netherlands, September 1987. Poster presentation.
[62] H. Knutsson. Representing local structure using tensors. In The 6th Scandinavian
Conference on Image Analysis, pages 244–251, Oulu, Finland, June 1989. Report
LiTH–ISY–I–1019, Computer Vision Laboratory, Linköping University, Sweden,
1989.
[63] H. Knutsson and M. Andersson. Robust N-Dimensional Orientation Estimation
using Quadrature Filters and Tensor Whitening. In Proceedings of IEEE International Conference on Acoustics, Speech, & Signal Processing, Adelaide, Australia, April 1994. IEEE. LiTH-ISY-R-1798.
[64] H. Knutsson and M. Andersson. Optimization of Sequential Filters. In Proceedings of the SSAB Symposium on Image Analysis, Linköping, Sweden, March
1995. SSAB. LiTH-ISY-R-1797.
[65] H. Knutsson, H. Bårman, and L. Haglund. Robust Orientation Estimation in 2D,
3D and 4D Using Tensors. In Proceedings of Second International Conference on
Automation, Robotics and Computer Vision, ICARCV’92, Singapore, September
1992.
90
BIBLIOGRAPHY
[66] H. Knutsson and G. H. Granlund. Spatio-temporal analysis using tensors. In Sixth
Multidimensional Signal Processing Workshop, page 11, Pacific Grove, California, September 1989. MDSP Technical Committee of the IEEE Acoustics, Speech
and Signal Processing Society, Maple Press. Abstract.
[67] H. Knutsson, M. Hedlund, and G. H. Granlund. Anordning för bestämning av
graden av konstans hos en egenskap för ett område i en i diskreta bildelement
uppdelad bild. Swedish patent 8502570-8 1987, 1986.
[68] H. Knutsson and C-F. Westin. Normalized and Differential Convolution: Methods
for Interpolation and Filtering of Incomplete and Uncertain Data. In Proceedings
of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York City, USA, June 1993. IEEE.
[69] H. Knutsson and C-F. Westin. Normalized Convolution: Technique for Filtering
Incomplete and Uncertain Data. In Proceedings of the 8th Scandinavian Conference on Image Analysis, Tromsö, Norway, May 1993. SCIA, NOBIM, Norwegian Society for Image Processing and Pattern Recognition. Report LiTH–ISY–
I–1528.
[70] H. Knutsson and C-F. Westin. Robust estimation from Sparse Feature Fields. In
Proceedings of EC–US Workshop, Amherst, USA, October 1993.
[71] H. Knutsson, C-F. Westin, and G. H. Granlund. Local Multiscale Frequency
and Bandwidth Estimation. In Proceedings of IEEE International Conference on
Image Processing, Austin, Texas, November 1994. IEEE. Cited in Science: Vol.
269, 29 Sept. 1995.
[72] H. Knutsson, C-F. Westin, and C-J. Westelius. Filtering of Uncertain Irregularly
Sampled Multidimensional Data. In Twenty-seventh Asilomar Conf. on Signals,
Systems & Computers, Pacific Grove, California, USA, November 1993. IEEE.
[73] F. Meyer and P. Bouthemy. Region-based tracking using affine motion models in
long image sequences. CVGIP: Image Understanding, 60(2):119–140, September 1994.
[74] R. Milanese. Focus of attention in human vision: a survey. Technical Report
90.03, Computing Science Center, University of Geneva, Geneva, August 1990.
[75] V. A. Morozov. Regularization methods for ill-posed problems. CRC Press, Boca
Raton, FL, 1993.
[76] D. Murray and A. Basu. Motion tracking with an active camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5):449–459, May 1994.
[77] H. H. Nagel. On the estimation of optical flow: Relations between different
approaches and som new results. Artificial Intelligence, 33:299–324, 1987.
BIBLIOGRAPHY
91
[78] H. H. Nagel. On a constraint equation for the estimation of displacement rates
in image sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:13–30, 1989.
[79] M. Z. Nashed. Generalized inverses and applications. Academic Press, New
York, 1976.
[80] R. Nelson. Qualitative detection of motion by a moving observer. Int. Journal of
Computer Vision, 7:33–46, 1991.
[81] P. Nordlund and T. Uhlin. Closing the loop: Pursuing a moving object by a
moving observer. In Proceedings 6th Int. Conf. on Computer Analysis of Images
and Patterns, Prague, Czech Republic, September 1995.
[82] C.R. Rao and S.K. Mitra. Generalized inverse of matrices and its applications.
Wiley and Sons, New York, 1971.
[83] D.A. Robinson. Why visuomotor systems don’t like negative feedback and how
they avoid it. In M. A. Arbib and A. R. Hanson, editors, Vision, Brain, and
Cooperative Computation, pages 89–107. The MIT Press, 1987.
[84] A. Rojer and E. Schwartz. Design considerations for a space-variant visual sensor
with complex-logarithmic geometry. In Proceedings of the International Conference on Pattern Recognition, volume 2, pages 278–285, Philadelphia, Pennsylvania, June 1990.
[85] B. G. Schunk. Image flow segmentation and estimation by constraint line
clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence,
11(10):1010–1027, 1989.
[86] W. von Seelen, S. Bohrer, J. Kopecz, and W. M. Theimer. A neural architecture
for visual information processing. Int. Journal of Computer Vision, 16:229–260,
1995.
[87] J. van der Spiegel, G. Kreider, C. Claeys, I. Debusschere, G. Sandini, P. Dario,
F. Fantini, P. Bellutti, and G. Soncini. A foveated retina-like sensor using CCD
technology. In C. Mead and M. Ismael, editors, Analog VLSI implementation of
neural systems. Kluwer, 1989.
[88] K. Tanaka and H. A. Saito. Analysis of motion of the visual field by direction, expansion/contraction, and rotation cells illustrated in the dorsal part of the Medial
Superior Temporal area of the macaque monkey. J. Neurophysiology, 62:626–
641, 1989.
[89] D. Terzopoulos. Image analysis using multigrid relaxation methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:129–139, 1986.
92
BIBLIOGRAPHY
[90] W. Thompson and T. C. Pong. Detecting moving objects. Int. Journal of Computer Vision, 4:39–57, 1990.
[91] A. N. Tikhonov and V. Y. Arsenin. Solutions of ill-posed problems. W. H. Winston, Washington, DC, 1977.
[92] S. Tölg. Gaze control for an active camera by modeling human pursuit eye movements. In Proceedings SPIE Conf. on Intelligent Robots and Computer Vision XI,
volume 1825, pages 585–598. SPIE, 1992.
[93] G. T. Toussaint, editor. Computational Morphology – A Computational Geometric Approach to the Analysis of Form. North–Holland, 1988.
[94] T. Uhlin, P. Nordlund, A. Maki, and J.-O. Eklundh. Towards an active visual
observer. In Proc. 5th Int. Conference on Computer Vision, pages 679–686, Cambridge, MA, June 1995.
[95] S. Uras, F. Girosi, A. Verri, and V. Torre. A computational approach to motion
perception. Biological Cybernetics, pages 79–97, 1988.
[96] R. C. Veltkamp. Closed Object Boundaries from Scattered Points. Springer–
Verlag, 1994.
[97] A. Verri and T. Poggio. Against quantitative optical flow. In Proceedings 1st
ICCV. IEEE Computer Society Press, 1987.
[98] A. Verri and T. Poggio. Motion field and optical flow: Qualitative properties.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(5):490–498,
1989.
[99] H. T. Wang, B. Mathur, and C. Koch. A multiscale adaptive network model of
motion computation in primates. In Advances in Neural Information Processing
Systems, San Mateo, CA, 1991. Morgan Kaufmann.
[100] J. Y. A. Wang and E. H. Adelson. Spatio-temporal segmentation of video data. In
Proceedings of the SPIE Conference: Image and Video Processing II, February
1994.
[101] C-J. Westelius. Focus of Attention and Gaze Control for Robot Vision. PhD thesis,
Linköping University, Sweden, S–581 83 Linköping, Sweden, 1995. Dissertation
No 379, ISBN 91–7871–530–X.
[102] C-J. Westelius, J. Wiklund, and C-F. Westin. Prototyping, Visualization and Simulation Using the Application Visualization System. In H. I. Christensen and
J.L. Crowley, editors, Experimental Environments for Computer Vision and Image Processing, volume 11 of Series on Machine Perception and Artificial Intelligence, pages 33–62. World Scientific Publisher, 1994. ISBN 981-02-1510-X.
BIBLIOGRAPHY
93
[103] C-F. Westin. A Tensor Framework for Multidimensional Signal Processing. PhD
thesis, Linköping University, Sweden, S–581 83 Linköping, Sweden, 1994. Dissertation No 348, ISBN 91–7871–421–4.
[104] C-F. Westin and H. Knutsson. Extraction of Local Symmetries Using Tensor
Field Filtering. In Proceedings of 2nd Singapore International Conference on
Image Processing, Singapore, September 1992. IEEE Singapore Section. LiTH–
ISY–R–1515, Linköping University, Sweden.
[105] C-F. Westin and H. Knutsson. Processing Incomplete and Uncertain Data using
Subspace Methods. In Proceedings of 12th International Conference on Pattern
Recognition, Jerusalem,Israel, October 1994. IAPR.
[106] C-F. Westin, K. Nordberg, and H. Knutsson. On the Equivalence of Normalized
Convolution and Normalized Differential Convolution. In Proceedings of IEEE
International Conference on Acoustics, Speech, & Signal Processing, Adelaide,
Australia, April 1994. IEEE.
[107] C-F. Westin, C-J. Westelius, H. Knutsson, and G. Granlund. Attention Control
for Robot Vision. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, Califonia, June 1996. IEEE
Computer Society Press.
[108] R. H. Wurtz, H. Komatsu, M. R. Dürsteler, and D. S. Yamasaki. Motion to movement: Cerebral cortical visual processing for pursuit eye movements. In G. M.
Edelman, W. E. Gall, and W. M. Cowan, editors, Signal and Sense: Local and
Global Order in Perceptual Maps, chapter 10, pages 233–260. Wiley–Liss, 1990.
[109] L. R. Young. Pursuit eye tracking movements. In P. Bach y Rita, C. C. Collins,
and J. E. Hyde, editors, Control of Eye Movements. Academic Press, 1971.
[110] A. L. Yuille and N. M. Grzywacz. A computational theory for the perception of
coherent visual motion. Nature, 333:71–74, 1988.
I would like to present a medal of valour to the patient reader who has managed to
plough through the volume. Experience shows that a long–term exposure to the material
is accompanied by feelings of confusion, frustration, and, finally, resignation.
The author
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement