Carl-Johan Westelius

Carl-Johan Westelius
Linkoping Studies in Science and Technology. Dissertations
No. 379
Focus of Attention and
Gaze Control for
Robot Vision
Carl-Johan Westelius
Department of Electrical Engineering

Linkoping University, S{581 83 LINKOPING,
Sweden.
Linkoping 1995
ii
Abstract
This thesis deals with focus of attention control in active vision systems.
A framework for hierarchical gaze control in a robot vision system is presented, and an implementation for a simulated robot is described. The
robot is equipped with a heterogeneously sampled imaging system, a fovea,
resembling the spatially varying resolution of a human retina. The relation
between foveas and multiresolution image processing as well as implications for image operations are discussed.
A stereo algorithm based on local phase dierences is presented both as
a stand alone algorithm and as a part of a robot vergence control system.
The algorithm is fast and can handle large disparities and maintaining
subpixel accuracy. The method produces robust and accurate estimates of
displacement on synthetic as well as real life stereo images. Disparity lter
design is discussed and a number of lters are tested, e.g. Gabor lters and
lognorm quadrature lters. A design method for disparity lters having
precisely one phase cycle is also presented.
A theory for sequentially dened data modied focus of attention is presented. The theory is applied to a preattentive gaze control system consisting of three cooperating control strategies. The rst is an object nder
that uses circular symmetries as indications for possible object and directs
the xation point accordingly. The second is an edge tracker that makes
the xation point follow structures in the scene. The third is a camera
vergence control system which assures that both eyes are xating on the
same point. The coordination between the strategies is handled using
potential elds in the robot parameter space.
Finally, a new focus of attention method for disregarding lter responses
from already modelled structures is presented. The method is based on
a ltering method, normalized convolution, originally developed for ltering incomplete and uncertain data. By setting the certainty of the input
data to zero in areas of known or predicted signals, a purposive removal of
operator responses can be obtained. On succeeding levels, image features
from these areas become 'invisible' and consequently do not attract the
attention of the system. This technique also allows the system to eectively explore new events. By cancelling known, or modeled, signals the
attention of the system is shifted to new events not yet described.
iv
PREFACE
This thesis is based on the following material:
C-J Westelius, H. Knutsson, and G. H. Granlund. Focus of attention
control. In Proceedings of the 7th Scandinavian Conference on Image
Analysis, pages 667{674, Aalborg, Denmark, August 1991. Pattern
Recognition Society of Denmark.
C-J Westelius, H. Knutsson, and G. H. Granlund. Preattentive gaze
control for robot vision. In Proceedings of Third International Conference on Visual Search. Taylor and Francis, 1992.
J. Wiklund, C-J Westelius, and H. Knutsson. Hierarchical phase
based disparity estimation. In Proceedings of 2nd Singapore International Conference on Image Processing. IEEE Singapore Section,
September 1992.
H. Knutsson, C-F Westin, and C-J Westelius. Filtering of uncertain irregularity sampled multidimensional data. In Twenty-seventh
Asilomar Conf. on Signals, Systems & Computers, Pacic Grove,
California, USA, November 1993. IEEE.
G. H. Granlund, H. Knutsson, C-J Westelius, and J Wiklund. Issues
in robot vision. Image and Vision Computing, 12(3):131{148, April
1994.
C-J. Westelius and H. Knutsson. Hierarchical disparity estimation
using quadrature lter phase. International Journal on Computer
Vision, 1995. Special issue on stereo, (submitted).
C-J. Westelius, C-F. Westin, and H. Knutsson. Focus of attention
mechanisms using normalized convolution. IEEE Trans on Robotics
and Automation, 1996. Special section on robot vision. (submitted).
v
vi
Material related to this work but not explicitly reviewed in this
thesis:
C-J Westelius and C-F Westin. Representation of colour in image
processing. In Proceedings of the SSAB Conference on Image Analysis, Gothenburg, Sweden, March 1989. SSAB.
C-J Westelius and C-F Westin. A colour representation for scalespaces. In The 6th Scandinavian Conference on Image Analysis,
pages 890{893, Oulu, Finland, June 1989.
C-J Westelius, G. H. Granlund, and H. Knutsson. Model projection in a feature hierarchy. In Proceedings of the SSAB Symposium on Image Analysis, pages 244{247, Linkoping, Sweden, March
1990. SSAB. Report LiTH{ISY{I{1090, Linkoping University, Sweden, 1990.
M. Gokstorp and C-J. Westelius. Multiresolution disparity estimation. In Proceedings of the 9th Scandinavian conference on Image
Analysis, Uppsala, Sweden, June 1995. SCIA.
J. Karlholm, C-J. Westelius, C-F. Westin, and H. Knutsson. Object
tracking based on the orientation tensor concept. In Proceedings of
the 9th Scandinavian conference on Image Analysis, Uppsala, Sweden, June 1995. SCIA.
Contributions in books and collections:
C-J Westelius, H. Knutsson J. Wiklund, and C-F Westin. Phasebased disparity estimation. In J.L. Crowley and H. I. Christensen,
editors, Vision as Process, pages 179{192, Springer-Verlag, 1994.
ISBN 3-540-58143-X.
C-J Westelius, H. Knutsson, and G. Granlund. Low level focus of
attention. In J.L. Crowley and H. I. Christensen, editors, Vision as
Process, pages 157{178, Springer-Verlag, 1994. ISBN 3-540-58143X.
vii
C-J Westelius, J. Wiklund, and C-F Westin. Prototyping, visualization and simulation using the application visualization system.
In H. I. Christensen and J.L. Crowley, editors, Experimental Environments for Computer Vision and Image Processing, volume 11
of Series on Machine Perception and Articial Intelligence, pages
33{62. World Scientic Publisher, 1994. ISBN 981-02-1510-X.
C-J Westelius. Local Phase Estimation. In G. H. Granlund and
H. Knutsson, principal authors, Signal Processing for Computer Vision,pages 259{278. Kluwer Academic Publishers, 1995. ISBN 07923-9530-1.
viii
Acknowledgements
Although my name alone is printed on the cover of this thesis, there are a
number of people who, in one way or another, have a part in its realization.
First of all, I would like to thank all the members of the Computer Vision
Laboratory for being jolly good fellows. I will miss the weekly chats in
the sauna (and the beer too).
I thank my supervisor, Dr. Hans Knutsson, for his enthusiastic help without which this thesis would have been ready much sooner, but with much
poorer quality. His intuition never cease astonishing me.
I thank Prof. Gosta Granlund for giving me the opportunity to work in
his group and for sharing ideas and visions about vision.
I would like to give Catharina Holmgren a distinguished services medal
for proof-reading this thesis over and over again. It must be extremely
boring to read something you are not interested in and correct the same
kind of mistakes all the time.
I thank Dr. Klas Nordberg for taking the time reading and commenting
on this thesis. What Catharina did with the language Klas did with the
technical content.
I would also like to express my gratitude to Dr. Carl-Fredrik Westin, my
friend and colleague, for all his support, both scientically and morally.
My special thanks to everybody in the \Vision as Process" consortium.
It has been very stimulating to work with VAP. Many of the activities
related to VAP have made the PhD-studies worthwhile (including the
yearly pre-demo-panics).
Finally, there is someone who eventually accepted that \soon" means
somewhere between now and eternity. Thank you, Brita, for being, for
caring, for loving. I promise: No more PhD-theses for me!
x
Contents
1 INTRODUCTION AND OVERVIEW
1
1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : 1
1.2 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
2 LOCAL PHASE ESTIMATION
2.1 What is local phase? : : : : : : : : :
2.2 Singular points in phase scale-space :
2.3 Choice of lters : : : : : : : : : : : :
2.3.1 Creating a phase scale-space
2.3.2 Gabor lters : : : : : : : : :
2.3.3 Quadrature lters : : : : : :
2.3.4 Other even-odd pairs : : : : :
2.3.5 Discussion on lter choice : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
3 PHASE-BASED DISPARITY ESTIMATION
3.1 Introduction : : : : : : : : : : : : : : : : : : :
3.2 Disparity estimation : : : : : : : : : : : : : :
3.2.1 Computation structure : : : : : : : : :
3.2.2 Edge extraction : : : : : : : : : : : : :
3.2.3 Local image shifts : : : : : : : : : : :
3.2.4 Disparity estimation : : : : : : : : : :
3.2.5 Edge and grey level image consistency
3.2.6 Disparity accumulation : : : : : : : :
3.2.7 Spatial consistency : : : : : : : : : : :
3.3 Experimental results : : : : : : : : : : : : : :
3.3.1 Generating stereo image pairs : : : : :
3.3.2 Statistics : : : : : : : : : : : : : : : :
3.3.3 Increasing number of resolution levels
3.3.4 Increasing maximum disparity : : : :
3.3.5 Combining line and grey level results :
xi
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
5
5
11
13
16
16
23
27
35
37
37
38
39
40
41
41
45
45
46
47
48
51
52
58
63
xii
3.3.6 Results on natural images : : : : : : : : : : : : : : : 75
3.4 Conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : 78
3.5 Further research : : : : : : : : : : : : : : : : : : : : : : : : 79
4 HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
81
4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : :
4.1.1 Human focus of attention : : : : : : : : : : : :
4.1.2 Machine focus of attention : : : : : : : : : : :
4.2 Space-variant sampled image sensors: Foveas : : : : :
4.2.1 What is a fovea? : : : : : : : : : : : : : : : : :
4.2.2 Creating a log-Cartesian fovea : : : : : : : : :
4.2.3 Image operations in a fovea : : : : : : : : : : :
4.3 Sequentially dened, data modied focus of attention :
4.3.1 Control mechanism components : : : : : : : : :
4.3.2 The concept of nested regions of interest : : : :
4.4 Gaze control : : : : : : : : : : : : : : : : : : : : : : :
4.4.1 System description : : : : : : : : : : : : : : : :
4.4.2 Control hierarchy : : : : : : : : : : : : : : : : :
4.4.3 Disparity estimation and camera vergence : : :
4.4.4 The edge tracker : : : : : : : : : : : : : : : : :
4.4.5 The object nder : : : : : : : : : : : : : : : : :
4.4.6 Model acquisition and memory : : : : : : : : :
4.4.7 System states and state transitions : : : : : : :
4.4.8 Calculating camera orientation parameters : :
4.5 Experimental results : : : : : : : : : : : : : : : : : : :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 81
: 81
: 83
: 84
: 84
: 85
: 89
: 89
: 89
: 91
: 94
: 94
: 95
: 96
: 97
: 105
: 112
: 113
: 119
: 125
5 ATTENTION CONTROL USING NORMALIZED CONVOLUTION
129
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : 129
5.2 Normalized convolution : : : : : : : : : : : : : : : : : : : : 130
5.3 Quadrature lters for normalized convolution : : : : : : : : 134
5.3.1 Quadrature lters for NC using real basis functions : 134
5.3.2 Quadrature lters for NC using complex basis functions : : : : : : : : : : : : : : : : : : : : : : : : : : : 137
5.3.3 Real or complex basis functions? : : : : : : : : : : : 138
5.4 Model-based habituation/inhibition : : : : : : : : : : : : : : 145
5.4.1 Saccade compensation : : : : : : : : : : : : : : : : : 146
xiii
5.4.2 Inhibition of the robot arm inuence on low
image processing : : : : : : : : : : : : : : : :
5.4.3 Inhibition of modeled objects : : : : : : : : :
5.4.4 Combining certainty masks : : : : : : : : : :
5.5 Discussion : : : : : : : : : : : : : : : : : : : : : : : :
6 ROBOT AND ENVIRONMENT SIMULATOR
6.1 General description of the AVS software
6.1.1 Module Libraries : : : : : : : : :
6.2 Robot vision simulator modules : : : : :
6.3 Example of an experiment : : : : : : : :
6.3.1 Macro modules : : : : : : : : : :
6.4 Simulation versus reality : : : : : : : : :
6.5 Summary : : : : : : : : : : : : : : : : :
A AVS PROBLEMS AND PITFALLS
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
level
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
: 146
: 149
: 151
: 153
155
: 155
: 158
: 158
: 164
: 165
: 169
: 170
173
A.1 Module scheduling problems : : : : : : : : : : : : : : : : : : 173
A.2 Texture mapping problems : : : : : : : : : : : : : : : : : : 175
xiv
1
INTRODUCTION AND OVERVIEW
1.1 Background
A traditional view of a computer vision system has been that it is an analyzing system in one end and a responding one in the other as illustrated in
Figure 1.1. The analyzing part supplies a model of the three-dimensional
world derived from two-dimensional images. The world-model is then used
by the responding part for action planning. Vision is considered to be a
pre-action stage. The vision algorithms have to furnish a world model in
ne detail. Every feature that the action planning system might need has
to be calculated. The close relationship between analysis and response is
not utilized.
Image
Response
Analysis
Generation
Figure 1.1: The classical pipelined structure of a robot vision system.
1
2
INTRODUCTION AND OVERVIEW
As an answer to this, the active vision paradigm has been developed over
some ten years [6, 5, 50, 3, 4]. In short, active vision is based on the
ability for the perceiving system to purposively change both external and
internal image formation parameters, e.g. xation point, focal length, etc.
Instead of squeezing every bit of information out of every image, the active
vision system picks the bits that are easy to estimate in the continuous
ow of images. The system adapts its behavior in order to get the bits of
information that are important at the moment. This possibility to solve
otherwise ill-posed problems, in combination with an appealing similarity to biological systems has thrilled the imagination of many researchers
(including the author). The problem that arises is how to control the perception. How are the purposive actions generated? Clearly, the structure
in Figure 1.1 is not appropriate for an active vision system.
The work at the Computer Vision Laboratory at Linkoping University is
aimed at an integrated analysis-response system where general responses
are modied by data, and data is actively sought using proper responses
[33]. One important property is that suciently complex and data-driven
responses are built up by letting a general response command, invoked
from higher levels, be modied by processed data entering from lower
levels, to produce a specic command for the specic situation in which
the system currently operates. Action commands also have an impact
on input feature extraction and interpretation, e.g. the interpretation of
optical ow is dierent when the head is moving from when it is still.
The computing structure can be thought of as a pyramid with sensor inputs and actuator outputs at the base (Figure 1.2 on the facing page). The
input information enters the system, and features of increasing abstraction
are estimated as the information ows upwards. The particular advantage
of this structure is that the output produced from the system leaves the
pyramid at the same lowest level as the input enters. This arrangement
enables the interaction between input data analysis and output response
synthesis.
This thesis discusses focus of attention and gaze control for active vision
systems in this context. It should be emphasized that the algorithms
described here are biologically inspired but not an attempt to model biological systems.
1.2 OVERVIEW
3
Levels of
response
specification
Levels of
information
abstraction
Sensor inputs
Actuator outputs
Real
World
Figure 1.2: An integrated analysis-response structure. To the left the
signals from the sensor inputs are processed into descriptions of increasing abstraction. To the right general response commands are gradually
rened into situation specic actions.
1.2 Overview
Chapters 2 and 3 deal with estimation of disparity using local phase and
are based on [85, 35, 73]. The local phase is explained, its invariances
and equivariances are described, and its behavior in a scale-space is elaborated on in Section 2.1. A number of dierent types of phase estimating
lters are described and tested with respect to scale-space behavior in Section 2.3. In Chapter 3, a hierarchical algorithm for phase-based disparity
estimation, that can handle large disparities and still give subpixel accuracy is described. In Section 3.3, the lter dependence of the disparity
algorithm behavior is tested and evaluated.
In Chapter 4 a framework for a hierarchical approach to gaze control of a
robot vision system is presented, and an implementation on a simulated
robot is also described. The robot has a three-layer hierarchical gaze control system based on rotation symmetries, linear structures and disparity.
It is equipped with heterogeneously sampled imaging systems, foveas, resembling the space varying resolution of a human retina. The relation
4
INTRODUCTION AND OVERVIEW
between the fovea and multiresolution image processing is discussed together with implications for image operations. The chapter is based on
[75, 76, 35].
Chapter 5 deals with how to implement a habituation function in order to
reduce the impact of known or modeled image structures on data driven
focus of attention. Using a technique termed 'normalized convolution'
when extracting the image features allows for marking areas of the input
image as unimportant. The image features from these areas then become
'invisible' and consequently do not attract the attention of the system,
which is the desired behavior of a habituation function. Chapter 5 is
published in [80, 49].
Finally, Chapter 6 describes the robot simulator that is used in the experiments throughout this thesis.
2
LOCAL PHASE ESTIMATION
2.1 What is local phase?
Most people are familiar with the global Fourier phase. The shift theorem, describing how the Fourier phase is aected by moving the signal,
is common knowledge. But the phase in signal representations based on
local operations, e.g. lognormal lters [44], is not so well-known.
The local phase has a number of interesting invariance and equivariance
properties that makes it an important feature in image processing.
Local phase estimates are invariant to signal energy. The phase
varies in the same manner regardless if there are small or large signal
variations. This feature makes phase estimates suitable for matching,
since it reduces the need for camera exposure calibration and illumination
control.
Local phase estimates and spatial position are equivariant. The
local phase generally varies smoothly and monotonically with the position of the signal except for the modulo 2 wrap-around. Section 2.2
discusses cases where the local phase behaves dierently. Furthermore,
it is a continuous variable that can measure changes much smaller than
the spatial quantization, enabling subpixel accuracy without a subpixel
representation of image features.
5
LOCAL PHASE ESTIMATION
6
Phase is stable against scaling. It has been shown that phase is stable
against scaling up to 20 percent [27].
The spatial derivative of local phase estimates is equivariant with
spatial frequency. In high frequency areas the phase changes faster than
in low frequency areas. The slope of the phase curve is therefore steep
for high frequencies. The phase derivative is called local or instantaneous
frequency [10].
There are many ways to approach the concept of local phase. One way
is to start from the analytic function of a signal and design lters that
locally estimate the instantaneous phase of the analytical function [10].
An alternative approach, used in this chapter, is to relate local phase to
the detection of lines and edges in images. This chapter discusses onedimensional signals. The extension of the concept of phase into two or
more dimensions is discussed in Section 4.4.
Figure 2.1 on the next page shows the intensity prole over a number of
lines and edges. The lines and edges are called events in the rest of this
chapter. For illustration purposes the ideal step and Dirac functions have
been blurred more than what corresponds to the normal fuzziness of a
naturalistic image. The low pass lter used is a Gaussian with = 1:8
pixels.
When designing the lters for line and edge detection it is important that
they are insensitive to the DC component in the image since at surfaces
are of no interest for edge and line detection. A simple line detector is:
hline() = ,( + 1) + 2( ) , ( , 1)
(2.1)
However, this lter has a frequency too high to t the frequency spectrum
of the signal in Figure 2.1. Convolving the lter with a Gaussian, = 2:8,
tunes the lter to the appropriate frequency band (left of Figure 2.2 on
page 8). The problem is to design an edge lter that \matches" the line
lter. There are two requirements on an edge/line lter pair:
1. Detection of both lines and edges with equal localization acuity.
2. Discrimination between the types of events.
2.1 WHAT IS LOCAL PHASE?
7
Intensity
1
0.8
0.6
0.4
0.2
0
20
40
60
80
100
120
140
20
40
60
80
100
120
140
Figure 2.1: Intensity proles for a bright line on dark background at
position = 20, an edge from dark to bright at position = 60, a dark
line on bright background at position = 100, and an edge from bright to
dark at position = 140. All lines and edges are ideal functions blurred
with a Gaussian, ( = 1:8).
Is there a formal way to dene a line/edge lter pair such that these
requirements are met?
The answer is yes. In order to see how to generate such a lter pair,
study the properties of lines and edges centered in a window. Setting the
origin to the center of the window reveals that lines are even functions, i.e.
f (, ) = f ( ). Thus, lines have an even real Fourier transform. Edges are
odd functions plus a DC term. The DC term can be neglected without loss
of generality since neither the line nor the edge lters should be sensitive
to it. Thus, consider edges simply as odd functions, i.e. f (, ) = ,f ( ),
having an odd and imaginary transform.
Now, take a line, fline( ), and an edge, fedge ( ), with exactly the same
magnitude function in the Fourier domain,
kFedge (u)k = kFline (u)k:
(2.2)
For such signals the line and edge lter should give identical outputs when
applied to their respective target events,
Hedge(u)Fedge (u) = Hline(u)Fline (u):
(2.3)
LOCAL PHASE ESTIMATION
8
Combining Equations (2.2) and (2.3) gives:
kHedge(u)k = kHline(u)k:
(2.4)
Equation (2.4), in combination with the fact that the line lter is an even
function with an even real Fourier transform, while the edge lter is an odd
function having an odd and imaginary Fourier transform, shows that an
edge lter can be generated from a line lter using the Hilbert transform
which is:
(
line(u); if u < 0
Hedge(u) = ,iH
(2.5)
iHline(u); if u 0
0.4
0.4
0.2
0.2
0
0
−0.2
−0.2
−10
−5
0
5
10
−10
−5
0
5
10
Figure 2.2: The line detector (left) and its Hilbert transform as edge
detector (right).
The line and edge detectors in Equation (2.5) are both real-valued which
makes it possible to combine them into a complex lter with the line lter
as the real part and the edge lter as the imaginary part:
h() = hline() , ihedge ():
(2.6)
The phase is then represented as a complex value where the magnitude reects the signal energy and the argument reects the relationship between
the evenness and oddness of the signal (Figure 2.3).
A lter fullling Equations (2.5) and (2.6) is called a quadrature lter
and is, in fact, the analytic signal of the line lter. Figure 2.4 shows that
the output magnitude from a quadrature lter depends only on the signal
energy and on how well the signal matches the lter pass band, and not
on whether the signal is even, odd or a mixture thereof. The lter phase,
on the other hand, depends on the relation between evenness and oddness
2.1 WHAT IS LOCAL PHASE?
9
θ
Figure 2.3: A representation of local phase as a complex vector where
the magnitude reects the signal energy and the argument reects the
evenness and oddness relationship.
of the signal relative to the lter center. The phase is not aected by
signal energy. The polar plots at the bottom show the trajectory of the
phase vector in Figure 2.3 when traversing the neighborhood around each
event. Note how the phase value points out the type of event when the
magnitude has a peak value.
How is the phase from the quadrature lter related to the instantaneous
phase of the analytic function of the signal? It is easy to show that
convolving a signal with a quadrature lter is the same as convolving the
analytic function of the signal with the real part of the lter:
h() f () = (hline () , ihedge ( )) f () =
hline() f () , iHifhline ( )g f () =
hline() f () , ihline() Hiff ( )g =
(2.7)
hline() (f () , iHiff ()g) = hline ( ) fA( ):
Since hline is a real lter sensitive to changes in a signal and fA is a signal
with continuously changing phase, the phase of the lter output is an estimate of the instantaneous phase. Narrow band lters generally estimate
phase better than broadband lters. If the signal is a sine function with
a constant magnitude, the instantaneous phase will be estimated exactly.
LOCAL PHASE ESTIMATION
10
20
40
60
80
100
120
140
20
40
60
80
100
120
140
20
40
60
80
100
120
140
1
0.5
0
3.14
1.57
0
−1.57
−3.14
Figure 2.4: Line and edge detection using the quadrature lter in Fig-
ure 2.2 on page 8. Top: The input image. Second: The magnitude of the
quadrature lter output has one peak for each event, and the peak value
depends only on the signal energy and how well the signal ts the lter
pass band. Third: The phase of the quadrature lter output indicates
the kind of event. Bright lines have = 0, dark lines have = , dark
to bright edges have = =2, and bright to dark edges have = ,=2.
Bottom: Polar plots showing the phase vector in a neighborhood around
the lines and edges.
2.2 SINGULAR POINTS IN PHASE SCALE-SPACE
11
2.2 Singular points in phase scale-space
The phase is generally stable in scale-space. There are, however, points
around which the phase has an unwanted behavior. At these points, the
analytic signal goes through the origin of the complex plane, i.e. they are
singular points of the analytic function. The invariances and equivariances
described in Section 2.1 are generally not valid if the analytic signal is close
to a singular point.
Figure 2.5 on the next page shows a stylized example of a phase resolution
pyramid. Assume we have an analytic signal consisting of two positive
frequencies u0 and 3u0
FA(u) = (u , u0 ) + 2(u , 3u0 )
fA() = eiu + 2ei3u 0
0
(2.8)
(2.9)
Now suppose that we create the resolution pyramid using
p a lter that attenuates the high frequency part of the signal a factor 1= 2 but leaves the
low frequency part unaected. Since the signal is periodic, the behavior
of the signal can be studied using a polar plot of the signal vector.
Figure 2.5a shows a polar plot of the original signal vector. Its magnitude
is fairly large and the vector runs counter-clockwise all the time. The phase
is therefore monotonous and increasing, except for the wrap-around caused
by the modulo 2 representation (Figure 2.5b). The local frequency, i.e.
the slope, is positive and almost constant. In Figure 2.5c the amplitude
of the high frequency part of the signal is reduced and the signal vector
now comes close to the origin at certain points. The local frequency is
much higher when the signal vector passes close to the origin, causing the
phase curve to bend, but it is still monotonous and increasing. Further
LP-ltering causes the signal vector to go through the origin (Figure 2.5e).
In these points, the phase jumps discontinuously and the local frequency
becomes impulsive. In Figure 2.5g, the signal vector moves clockwise when
going through the small loops. This means that the phase decreases and
that the local frequency is negative (Figure 2.5h).
LOCAL PHASE ESTIMATION
12
a
3.14
2
b
1.57
0
0
−1.57
−2
−2
0
c
2
−3.14
0
3.14
2
50
d
100
50
f
100
50
h
100
50
100
1.57
0
0
−1.57
−2
−2
0
e
2
−3.14
0
3.14
2
1.57
0
0
−1.57
−2
−2
0
g
2
−3.14
0
3.14
2
1.57
0
0
−1.57
−2
−2
0
2
−3.14
0
Figure 2.5: The periodic signal in Equation (2.9) is LP ltered in four
steps, attenuating the high frequency part of the signal. The left column shows polar plots of the signal vector and the right column shows
the phase for the same signal. a) The signal vector circles the origin
at a distance. b) The slope of the phase plot, i.e. the local frequency,
is positive and almost constant. c) The signal vector round the origin
closely. d) The phase curve bends, which means that the frequency is
locally very high. e) The signal vector goes through the origin, i.e. singular points. f) The local frequency is impulsive. g)The the signal vector
goes trough small loops without rounding the origin. h) The phase curve
bends downward which means that the frequency is locally negative.
2.3 CHOICE OF FILTERS
13
This behavior of the phase in scale-space is due to the fact that the highfrequency part of the signal disappears at lower resolution. In Figure 2.5b
the phase has three cycles for each period of the signal since the highfrequency part of the signal dominates. In Figure 2.5h, on the other hand,
there is only one phase cycle per signal period since the high-frequency
part is attenuated.
To avoid singular points we can avoid considering points with very low
magnitude, since the magnitude is zero at the singular points. Unfortunately, it is not that simple. The impact of a singular point is spread in
scale; negative frequencies on coarser resolution and very high frequencies
on ner resolution. At these points, the magnitude can not be neglected
and a high enough threshold also cuts out many useful phase estimates.
Fleet describes how singular points can be detected and how their inuence
can be reduced [26]. A method that uses line images in combination with
the original images to reduce the inuence of singular points is described
in Chapter 3.
2.3 Choice of lters
When designing lters to be used as disparity estimators, there are a
number of requirements, some of which are mutually exclusive, to be considered. Dierent lter types have dierent characteristics and which one
to use depends on the application.
There are a number of dierent lters that can be used when measuring
phase disparities. Gabor lters are, by far, the most commonly used in
phase-based disparity measurement. They have linear phase, i.e. constant
local frequency, and are therefore intuitively appealing to use. Quadrature lters do not have any negative frequency components nor any DC
component. Dierence of Gaussians approximating the rst and second
derivative of a Gaussian can also be used to estimate phase.
LOCAL PHASE ESTIMATION
14
Intensity
150
100
50
0
0
50
100
Position
150
200
250
Figure 2.6: The signal that is used to test the scale-space behavior of
the lters.
Below a number of lters are evaluated with regard to the following requirements:
No DC component. The lters must not have a DC component. Figure 2.7 on the facing page shows how a DC component makes the
signal vector wag back and forth instead of going round.
No wrap-around. It is desirable, though not necessary, that the phase
of the impulse response runs from , to without any wrap around.
This maximizes the maximal measurable disparity for a given size
of the lter.
Monotonous phase. The phase has to be monotonous, otherwise the
phase dierence between left and right images is not a one-to-one
function of the disparity. Below, the phase is called monotonous
even though it might wrap around, since the wrap around is caused
by the modulo 2 representation.
Only one half-plane of the frequency domain. It is also a requisite
that the lter only picks up frequencies in one half-plane of the frequency domain. This is a quadrature requirement which means that
the phase must rotate in the same direction for all frequencies. If this
does not apply the phase dierences might change sign depending
on the frequency content of the signal.
2.3 CHOICE OF FILTERS
15
U0=0.76 BW=0.5 DC=1.32469e−08
Gabor filter phase
3
2
1
0
−1
−2
−3
0
50
100
150
spatial position
200
250
200
250
U0=0.76 BW=1.2 DC=0.0317708
Gabor filter phase
3
2
1
0
−1
−2
−3
0
50
100
150
spatial position
Figure 2.7: Above, the phase from a Gabor lter with no DC-
component. Below, the phase from a Gabor lter with broader bandwidth and thus a DC component on the signal in Figure 2.6. Note that
the phase is going back and forth instead of wrapping around when the
signal uctuation is small compared the DC level.
In-sensitive to singular points. The area aected by the singular points
has to be as small as possible, both spatially and in scale. As a rule
of thumb the sensitivity to singular points decreases with decreasing
bandwidth. This requirement is contradictory to the requirement of
small spatial support.
Small spatial support. The computational cost of the convolution is
proportional to the spatial support of the lter function, i.e. the size
of the lter, which therefore should be small.
LOCAL PHASE ESTIMATION
16
2.3.1 Creating a phase scale-space
The behavior of the phase in scale-space has been tested using the signal
shown in Figure 2.6 on page 14 as input. All ltering has been done in
the Fourier domain. For each lter type the DFTs of the lter with the
highest center frequency have been generated using their denitions. The
frequency function has then been multiplied by a LP Gaussian function:
LP (u) = e
, 2u22
(2.10)
u
p
where u = = 2 2 . This emulates the LP-ltering in a subsampled
resolution pyramid. It can be argued that the LP ltering should not be
used at the highest resolution level, but it can be motivated by taking the
smoothing eects of the imaging system into account. The lter function
for each level is calculated by scaling the frequency functions appropriately:
u
Fu (u) = Fu u u
0
1
0
1
(2.11)
Using linear interpolation between nearest neighbors enables non-integer
scaling. Again, the method is chosen to resemble a subsampled resolution
pyramid. Generating new lters for each scale will give better, but larger,
lters, and it will not correspond to a subsampled resolution pyramid.
2.3.2 Gabor lters
In literature, Gabor lters are chosen for the minimum space-frequency
uncertainty and for the separability of center frequency and bandwidth.
A Gabor lter tuned to a frequency, u0 , is created, spatially, by multiplying an envelope function by a complex exponential function with angular
frequency u0 Equation (2.12). Gabor showed that a Gaussian envelope
minimizes the space-frequency uncertainty product [29], a review is found
in [53]. This means that the Gabor lters are well localized in both domains simultaneously.
2.3 CHOICE OF FILTERS
17
1
Magnitude
0.8
0.6
0.4
0.2
0
−π
−3π/4
−π/2
−π/4
0
π/4
π/2
3π/4
π
Frequency
Figure 2.8: The magnitude of three Gabor lters in the frequency domain. u0 = f=8; =4; =2g and = 0:8
When designing a Gabor lter, the parameters are the standard deviation,
, and the center frequency, u0 . These also eect the size and the bandwidth of the lter. The denition of a Gabor lter in the spatial domain
is:
(2.12)
gu () = eiu p1 e, 2
and the denition in the frequency domain is:
2
0
0
Gu u (u) = e
0
2 2
, (u2(,uu0))2
2
(2.13)
where u = 1=.
The Gabor lters have linear, and thus monotonous, phase by denition.
Since the Gaussian has innite support in both domains, it is impossible
to keep the lter in the right half-plane. It is therefore theoretically impossible to avoid negative frequencies and a DC component. For practical
purposes the Gaussian can be considered to be zero below some suciently
low threshold. The center frequency u0 is connected to the number of pixels per cycle of the phase and the frequency standard deviation u is
connected to the spatial and frequency support. By adjusting them, it
is possible to get any number of phase cycles over the size of the spatial
support. But all combinations of u0 and u do not yield a useful lter.
To see this, suppose that a certain center frequency, u0 , is wanted. The
LOCAL PHASE ESTIMATION
18
radius of the frequency support must then be smaller than u0 so that the
frequency function is suciently low at u = 0, i.e. a negligible DC component. This gives an upper limit on the bandwidth, or rather frequency
standard deviation, of the lter. Say, we allow the ratio between the DC
component and the top value to be maximally PDC :
Guu (0) P
Gu u (u0 ) DC
0
0
(2.14)
Using Equation (2.13) and resolving u gives the upper limit on the bandwidth:
u p u0
,2ln(P )
DC
(2.15)
See for instance Figure 2.7 on page 15, where the DC component is a few
percent of the maximum value. Using the dual relationship between the
frequency and spatial domains, = 1=u , it is possible to use inequality
(2.15) as a lower limit of the spatial standard deviation:
p
DC )
= 1 ,2ln(P
(2.16)
u0
u
The spatial support of a Gabor lter is innite just as the frequency
support. A threshold, Pcut , must therefore be set in order to get a nite
spatial size. The spatial radius, R, of the lter is then:
kgu u (R)k P
kgu u (0)k
0
cut
0
(2.17)
Using Equation (2.12) and resolving R gives the lower limit on the lter
radius:
q
R ,2ln(Pcut )
(2.18)
Setting the standard deviation to its lower limit gives the lter radius as
a function of the design parameters Pcut and PDC :
p
R = 2 ln(Pcut )ln(PDC )
u0
(2.19)
2.3 CHOICE OF FILTERS
19
The phase dierence between the end points of the lter can now be
calculated:
q
= u0 R , u0 (,R) = 2u0 R = 4 ln(Pcut )ln(PDC )
(2.20)
It should be pointed out that the truncation threshold, Pcut , aects the DC
component of the lter. The DC component should therefore be checked
after truncation of the lter to see if it is still less than PDC . This was not
done in Table 2.1.
Table 2.1 shows some values on the phase dierence between the end
points of the lter. If the phase dierence is less than 2 the phase does
not wrap around.
PDC
Pcut
0.05
0.1
0.2 0.25
0.05 11.982 10.505 8.783 8.151
0.1 10.505 9.210 7.700 7.146
0.2
8.783 7.700 6.437 5.974
0.25 8.151 7.146 5.974 5.545
Table 2.1: Phase dierence between lter end points for dierent values
on the DC component and the truncation threshold. Both PDC and Pcut
should be small. The DC value is not adjusted after truncation of the
lter.
Both PDC and Pcut should be small in order to minimize the DC component and keep the Gaussian envelope. The conclusion is that having all
the support, or most of it, in the right half-plane of the frequency domain,
i.e. a small PDC , requires a center frequency that generates wrap around
of the phase. Similarly, it can be shown that beginning with a phase that
does not wrap around yields a center frequency that is much smaller than
the frequency support of the lter. The resulting lter will then have a
substantial DC component. The upper limit of the relative bandwidth,
, of the Gabor lters used has heuristically been set to approximately
0.8 octaves (Figure 2.8 on page 17). This is also the bandwidth used by
Fleet et al [27, 39]. Langley suggests that the mean DC level should be
LOCAL PHASE ESTIMATION
20
subtracted from the input images in order to enhance the results [51].
The reason is that the DC component of the lter then is less critical.
The best would be to calculate the weighted average in every image point
using the Gaussian envelope of the Gabor lter and subtract it from the
original image, which is the same as constructing a new lter without DC
component.
The behavior of the Gabor lters around the singular points has been
thoroughly investigated by Fleet et al [27]. They used a Gabor scalespace function dened as
g(; ) = g()u () ()
(2.21)
0
where is the scale parameter. The center frequency decreases when the
scale parameter increases, i.e.
(2.22)
u () = 2
0
In theory it would be possible to keep the absolute bandwidth constant,
i.e. to xate u at the standard deviation used at the lowest u0 and then
vary u0 . But by doing so, the number of phase cycles over the lter varies
with the scale. If the relative bandwidth is kept constant, increasing can be seen as stretching out the same lter to cover larger areas [32].
Approximating the upper and lower half-height cuto frequencies as one
standard deviation over and one under the center frequency, i.e.
u
= log2 uu0(()) +
(2.23)
,
0
u
gives the expression for the spatial standard deviation of the lter.
1
() = u 1() 22 +
,1
0
!
(2.24)
The isophase curves in Figure 2.9 on page 22 show the phase on a number
of scales. The dark broad lines are due to phase wrap around. A feature
that is stable in scale-space keeps its spatial position in all scales. If the
2.3 CHOICE OF FILTERS
21
phase was completely stable in scale, then the isophase pattern would only
consist of vertical lines. The existence of singular points is easily observed
in the phase diagram. The positions where the isophase curves converge
are singular points. Just above them, the isophase curves turn downwards
indicating areas with decreasing phase, i.e. negative local frequency, compare with Figure 2.5g. The high density of isophase curves just below the
singular points shows that the local frequency is very high (Figure 2.5d).
LOCAL PHASE ESTIMATION
22
Phase of Gabor filter output
4.5
4
3.5
Scale
3
2.5
2
1.5
1
0.5
0
50
100
150
4
4
3
3
2
1
2
50
100
150
200
0
250
50
100
150
200
250
a
b
Magnitude less 10% of maximum
Magnitude less 5% of maximum
4
4
3
3
2
1
0
250
1
Scale
Scale
0
200
Magnitude less 20% of maximum
Scale
Scale
Magnitude Gabor filter output
2
1
50
100
150
c
200
250
0
50
100
150
200
250
d
Figure 2.9: Above: Isophase plot of Gabor phase scale-space. The
positions where the isophase curves converge are singular points. Below:
Isomagnitude plot of Gabor phase scale-space. In the dark areas the
magnitude is below the threshold. u0 = ( 4 ), and = 0:8
2.3 CHOICE OF FILTERS
23
2.3.3 Quadrature lters
Quadrature lters can be dened as having no support in the left halfplane of the frequency domain, and no DC component. This denition
makes them very easy to generate (Equations (2.5) and (2.6)). There
are number of dierent types of quadrature lters of which two will be
investigated here.
Lognorm lter
1
Magnitude
0.8
0.6
0.4
0.2
0
−π
−3π/4
−π/2
−π/4
0
π/4
π/2
3π/4
π
Frequency
Figure 2.10: Three lognorm lters in the frequency domain.
f=8; =4; =2g and = 0:8
u0 =
Lognorm lters are a class of quadrature lters used for orientation, phase
and frequency estimation [44]. The design parameters are the center frequency u0 and the relative bandwidth in octaves, . The denition of the
lognorm lters is in the frequency domain:
(
,
F (u) = e
0
4
log(2)
log2 ( uu )
0
if u > 0
otherwise
(2.25)
There is, by denition, no DC component or any support in the left halfplane of the frequency domain. Although an analytic expression of the
spatial denition of a lognorm lter is unavailable, it possible to use some
of the results from the Gabor lter case.
For a certain relative bandwidth the phase goes through a certain number
of cycles independent of the size of the lter support. Recalling, from the
LOCAL PHASE ESTIMATION
24
Gabor case, that the center frequency is related to the number of pixels
per phase cycle and the bandwidth is related to the size of the spatial
support, it is evident that using a wide relative bandwidth is the same as
ensuring no wrap around.The long tail of the lognorm frequency function
makes this possible only for relatively low center frequencies (Figure 2.10
on the page before). If too much of the tail is cut, it can no longer be
considered to be a lognorm lter.
The isophase curves in Figure 2.11 on the facing page show the phase
on a number of scales. It is generated using the same parameters as in
the Gabor case above, i.e. u0 = =4 and = 0:8. The similarity makes it
easy to identify the singular points and compare the behavior of the phase
around them.
Studying the behavior of the phase around the singular points indicates
that the disturbance region is smaller than for Gabor lters, i.e. the areas
with negative frequencies are smaller. On the other hand, the size of a
lognorm lter is approximately 50 percent larger than a Gabor lter with
the same center frequency and bandwidth when truncating on one percent
of the maximal value.
Powexp lter
There is a type of quadrature lters where the number of cycles of the
phase is directly controllable. A family of lters with center frequency u0
and a bandwidth controlled by can be constructed from the following
standard Fourier pair [10]:
F~ (u) =
(
ue,u; if u > 0
0;
if u 0
f~() = (1 + i1 )+1
(2.26)
(2.27)
Scaling with u0 gives a lter with a center frequency depending on . This
dependence is of course unwanted and can be avoided by scaling with u0 =
2.3 CHOICE OF FILTERS
25
Phase of Lognormal filter output
4.5
4
3.5
Scale
3
2.5
2
1.5
1
0.5
0
50
100
150
4
4
3
3
2
1
2
50
100
150
200
0
250
50
100
150
200
250
a
b
Magnitude less 10% of maximum
Magnitude less 5% of maximum
4
4
3
3
2
1
0
250
1
Scale
Scale
0
200
Magnitude less 20% of maximum
Scale
Scale
Magnitude of Lognormal filters
2
1
50
100
150
c
200
250
0
50
100
150
200
250
d
Figure 2.11: Above: Isophase plot of Lognorm phase scale-space. The
positions where the isophase curves converge are singular points. Below:
Isomagnitude plot of lognorm phase scale-space. In the dark areas the
magnitude is below the threshold. u0 = ( 4 ), and = 0:8
LOCAL PHASE ESTIMATION
26
instead:
, u
u
F^u (u) = u u
u e
(2.28)
0
0
0
0
1
f^u () = 1 + i u
0
0
(2.29)
+1
Finally, normalizing the frequency function gives the wanted Fourier pair:
^u (u) u ,( uu ,1)
F
Fu (u) = ^
= u e
(2.30)
Fu (u0 )
0
0
0
0
0
1
,
u e
1 + i u
fu () = 0
1
0
+1
0
(2.31)
Noting that Fu (u) = (F1u (u)) , it is easy to see that the center frequency of the lter is independent of and that the bandwidth will decrease with increasing . It is equally easy to see that the number of phase
cycles of these lters is a function of .
0
0
For = 1, the relative bandwidth is approximately two octaves and the
phase cycles once. However, both the frequency and spatial support of
the lter are very large, which reduces the usefulness of the lter type. As
an example the spatial support is approximately 60 percent larger than
a lognorm lter with the same center frequency and bandwidth when
truncating at one percent of the maximum value.
2.3 CHOICE OF FILTERS
27
2.3.4 Other even-odd pairs
The lters described above all consist of an even real part and an odd
imaginary part and the phase is calculated from the ratio between these
parts. There are a few other types of lters that are neither Gabor nor
quadrature lters, but which can be interpreted as odd-even pairs.
Non-ringing lters
1
Magnitude
0.8
0.6
0.4
0.2
0
−π
−3π/4
−π/2
−π/4
0
π/4
π/2
3π/4
π
Frequency
Figure 2.12: Three non-ringing lters.
u f=8; =4; =2g and 2:2. The lters are generated in the spatial domain using Equation (2.35).
The radii of spatial support are R = 14; 6; 3.
A lter type that has exactly one phase cycle over the spatial support
and that can be designed to have almost quadrature features has been
suggested by Knutsson (personal communication). Using a monotonous
antisymmetric phase function, a lter having a phase span of n2 and no
DC component can be dened as:
f () = g0 ()eiC g()
(2.32)
0
where
g0 () = dg =
d
(
> 0 if ,R R
(2.33)
n = 0; 1; 2; ::
(2.34)
0
otherwise
and
C0 = gn
,
(R)
LOCAL PHASE ESTIMATION
28
The function g( ) can be any monotonous antisymmetric function, but
since the derivative controls the envelope, it is advisable to use a function
with a smooth and unimodal derivative. How well such a lter approximates a quadrature lter depends on the size of the lter and how smooth
the lter function is. It is easily shown that the DC component is zero,
F (0) =
Z1
1
f ()d =
ZR
,R
g0 ()ei
ng()
g(R)
d = ei
ng(R)
g(R)
, ei
ng(,R)
g(R)
=0
Choosing n = 1 yields a lter with no DC component, and no wrap around.
The isophase curves is calculated using the primitive function to a squared
cosine as argument function and, thus, the squared cosine as envelope:
(
,i( +sin( R )) ; k k < R
2 f () = cos ( 2R )e R
0
otherwise
(2.35)
The center frequency is approximately 3=2R and the relative bandwidth
is approximately 2:2 octaves. The phase of the lter is monotonous but
the lter has a considerable support for negative frequencies.
2.3 CHOICE OF FILTERS
29
Phase of Non-ringing filter output
4.5
4
3.5
Scale
3
2.5
2
1.5
1
0.5
0
50
100
150
4
4
3
3
2
1
2
50
100
150
200
0
250
50
100
150
200
250
a
b
Magnitude less 10% of maximum
Magnitude less 5% of maximum
4
4
3
3
2
1
0
250
1
Scale
Scale
0
200
Magnitude less 20% of maximum
Scale
Scale
Magnitude of Non-ringing filters
2
1
50
100
150
c
200
250
0
50
100
150
200
250
d
Figure 2.13: Above: Isophaseplot of non-ringing phase scale-space. The
positions where the isophase curves converge are singular points. Below:
Isomagnitude plot of non-ringing phase scale-space. In the dark areas
the magnitude is below the threshold. u0 = ( 4 ),
LOCAL PHASE ESTIMATION
30
Windowed Fourier Transform
1
Magnitude
0.8
0.6
0.4
0.2
0
−π
−3π/4
−π/2
−π/4
0
π/4
π/2
3π/4
π
Frequency
Figure 2.14: Three windowed Fourier transform lters. u =
f2=15; 2=7; 2=4g and = 2:0. The lters where generated in the
spatial domain using Equation (2.36). The radii of spatial support where
R = 7; 3; 2.
The windowed Fourier transform can used for estimating local phase. The
window can be chosen arbitrarily, e.g. a rectangular function. Weng advocates the rectangular window [71], which is actually a special case of
the non-ringing lters. The spatial magnitude function is a rectangular
function and the argument is a ramp:
(
,i
f () = e R ; k k < R
0
otherwise
(2.36)
Although the term windowed Fourier transform is not tied to any particular window function, lters dened according to Equation (2.36) are
called WFT lters in this thesis. Figure 2.14 shows three WFT lters in
the Fourier domain. The long tails of ripples make the lter sensitive to
high-frequency noise. Weng suggests a preltering of the signal with a
Gaussian, which is the same as a smoothing of the lter. The resulting
lter is then very similar to the non-ringing lter above.
The signal is preltered with a smoothing function as described in Subsection 2.3.1. Any further smoothing is therefore not necessary in this
test.
2.3 CHOICE OF FILTERS
31
Phase of WFT filter output
4.5
4
3.5
Scale
3
2.5
2
1.5
1
0.5
0
50
100
150
4
4
3
3
2
1
2
50
100
150
200
0
250
50
100
150
200
250
a
b
Magnitude less 10% of maximum
Magnitude less 5% of maximum
4
4
3
3
2
1
0
250
1
Scale
Scale
0
200
Magnitude less 20% of maximum
Scale
Scale
Magnitude of WFT filters
2
1
50
100
150
c
200
250
0
50
100
150
200
250
d
Figure 2.15: Above: Isophaseplot of WFT phase scale-space. The
positions where the isophase curves converge are singular points. Below:
Isomagnitude plot of WFT phase scale-space. In the dark areas the
magnitude is below the threshold. u0 = ( 4 ),
LOCAL PHASE ESTIMATION
32
Gaussian dierences
Gaussian lters and their derivatives, or rather dierences, can be eciently implemented using binomial lter kernels [18]. The basic kernels
are a LP kernel and a dierence kernel (Figure 2.16). These can be implemented using shifts and summations. The rst and second dierence
of the binomial Gaussians can be used as a phase estimator (Figure 2.16).
The spatial support is only 5 pixels.
Figure 2.16: Left, the LP kernel. Middle, the rst dierence kernel, f .
Right, the second dierence kernel, f .
1
Magnitude
0.8
0.6
0.4
0.2
0
−π
−3π/4
−π/2
−π/4
0
π/4
π/2
3π/4
π
Frequency
Figure 2.17: The DFT magnitude of the binomial phase lter for =
0:5 (dot-dashed), = 0:3333 (dashed) and = 0:3660 (solid).
From the design, it is evident that there is no DC component and that
the phase does not wrap around. The rst dierence lter, f , is the odd
kernel, and changing sign on the second dierence, f , gives the even
kernel. Instead of just using the kernels as they are, it is possible give
2.3 CHOICE OF FILTERS
33
them dierent relative weights, producing a range of lters:
f () = ,f + i(1 , )f ; where 0 1
(2.37)
The energy in the left half-plane of the frequency domain is minimized
by setting = 0:3660 (Figure 2.17 on the preceding page). This design
method is a special case of a method for producing quadrature lters called
prolate spheroidals. In the general case, there are an arbitrary number of
basis lters that are weighted together, using a multi-variable optimizing
technique. The method produces the best possible quadrature lter of a
given size in the sense that it has minimum energy in the left half plane.
If it is essential to the implementation to use only summations and shifts,
the weights can be chosen to 1 for f and 2 for f , corresponding to
= 0:3333. The relative bandwidth is approximately two octaves and
only slightly dependent on .
LOCAL PHASE ESTIMATION
34
Phase of Gaussian differences filter output
4.5
4
3.5
Scale
3
2.5
2
1.5
1
0.5
0
50
100
150
4
4
3
3
2
1
2
50
100
150
200
0
250
50
100
150
200
250
a
b
Magnitude less 10% of maximum
Magnitude less 5% of maximum
4
4
3
3
2
1
0
250
1
Scale
Scale
0
200
Magnitude less 20% of maximum
Scale
Scale
Magnitude of Gaussian differences
2
1
50
100
150
c
200
250
0
50
100
150
200
250
d
Figure 2.18: Above: Isophase plot of Gaussian dierences phase scale-
space. The positions where the isophase curves converge are singular
points. Below: Isomagnitude plot of Gaussian derivatives phase scalespace. In the dark areas the magnitude is below the threshold. u0 =
( 2 ),
2.3 CHOICE OF FILTERS
35
2.3.5 Discussion on lter choice
The choice of lter is not evident from these investigations. Dierent
characteristics might have dierent priorities in dierent applications. The
size of the kernel may be less important if special purpose hardware is
used, making scale space behavior a critical issue. On the other hand, if
the convolution time depends directly on the kernel size, a less robust but
smaller kernel might be accepted. The most relevant test is to use the
lters in the intended application and measure the overall performance.
For convenience the lter characteristics are summed below.
Gabor lters
The Gabor lters might have a DC component if not designed carefully.
The phase is monotonous but wraps around. The frequency support is
localized in the right half-plane, and the sensitivity to singular points is
small.
Lognorm lters
The lognorm quadrature lters have neither a DC component nor any
frequency support in the left half-plane of the frequency domain. The
phase generally wraps around, but it is monotonous. The sensitivity to
singular points is small, for narrow band lters. The sensitivity increases
with bandwidth.
Non-ring lters
The non-ring lter investigated here has no DC component, monotonous
phase and no phase wrap around. The lter has a slight sensitivity for
negative frequencies depending on the center frequency. The spatial support is small. The sensitivity to singular points is larger than for Gabor
and lognorm lters.
36
LOCAL PHASE ESTIMATION
(Rectangular) Window Fourier Transform
Being a special case of the non-ring lters, the WFT lters share the
properties described above. The sensitivity to singular points is the largest
of the tested lters. The smoothing of the lter that is necessary to reduce
the noise inuence makes the lter very similar to the non-ring lter based
on the squared cosine magnitude.
Dierence of Gaussians
Gaussian derivatives lters implemented with binomial coecients do not
have any DC component. The phase is monotonous and there is no phase
wrap around. The sensitivity for negative frequencies can be adjusted by
weighing the even and odd kernels appropriately. It can, however, not be
reduced to zero. The sensitivity to singular points is slightly larger than
for non-ring lters. The spatial support is small.
3
PHASE-BASED DISPARITY
ESTIMATION
3.1 Introduction
The problem of estimating depth information from two or more images of
a scene is one which has received considerable attention over the years and
a wide variety of methods have been proposed to solve it [8, 24]. Methods
based on correlation and methods using some form of feature matching
between the images have found most widespread use. Of these, the latter
have attracted increasing attention since the work of Marr [54], in which
the features are zero-crossings on varying scales. These methods share an
underlying basis of spatial domain operations.
In recent years, however, increasing interest has been shown in computational models of vision based primarily on a localized frequency domain
representation - the Gabor representation [29, 2], rst suggested in the
context of computer vision by Granlund [32].
In [63, 87, 40, 27, 51] it is shown that such a representation also can be
adapted to the solution of the stereopsis problem. The basis for the success
of these methods is the robustness of the local Gabor-phase dierences.
The algorithm presented here is an extension of the work presented in [87].
37
PHASE-BASED DISPARITY ESTIMATION
38
Intensity
π
Phase
π/2
0
−π/2
ξ
1
ξ
−π
2
Position
ξ
ξ
1
Position
2
∆θ
∆ξ
Figure 3.1: Left: A superimposed stereo image pair of a line. In the left
image the line is located at 1 (solid) and in the right image it is located
at 2 . Right: The phase curves corresponding to the line in the two
images. The displacement can be estimated by calculating the phase
dierence, , and the slope of the phase curve, i.e. the local frequency
d=d .
3.2 Disparity estimation
The fact that phase is locally equivariant with position can be used to
estimate local displacement between two images [63, 87, 25, 72]. In a
stereo image pair the local displacement is a measure of depth and in an
image sequence the local displacement is an estimate of velocity.
One of the advantages of using phase for displacement estimation is that
subpixel accuracy can be obtained without having to change the sampling
density. Figure 3.1 shows an example where the displacement of a line in
a stereo image pair is estimated using phase dierences. Traditional displacement estimation would calculate the position of a signicant feature,
e.g. the local maximum of the intensity, and then calculate the dierence. If subpixel accuracy is needed the feature locations would have to
be stored using some sort of subpixel representation.
The local phase, on the other hand, is a continuous variable sensitive to
changes much smaller than the spatial quantization. Sampling the phase
function with a certain density does not restrict the phase dierences
3.2 DISPARITY ESTIMATION
39
Shift
Left
image
Edge
detect.
Right
image
Shift
Shift
Edge
detect.
D
i
s
p
a
r
i
t
y
e
s
t
i
m
a
t
i
o
n
Spatial
Spatial
Cons.
Cons.
Accumulator
Shift
Figure 3.2: Computation structure for the hierarchical stereo algorithm.
to the same accuracy. Thus, a subpixel displacement generates a phase
shift giving phase dierences with subpixel accuracy without a subpixel
representation of image features. In Figure 3.1 the displacement estimate
is:
= (3.1)
0
3.2.1 Computation structure
A hierarchical stereo algorithm that uses a phase based disparity estimator
has been developed [84]. To optimize the computational performance, a
multiresolution representation of the left and right image is used. An edge
detector, tuned to vertical structures, is used to produce a pair of images
containing edge information. The edge images reduces the inuence of
singular points since the singular points in the original images and the
edge images generally do not coincide. The impact of a DC component
in the disparity lter is also reduced be means of the edge images. The
40
PHASE-BASED DISPARITY ESTIMATION
edge images together with the corresponding original image pair are used
to build the resolution pyramids. It is one octave between the levels. The
number of levels needed depends on the maximum disparity in the stereo
image pair.
The algorithm starts at the coarsest resolution. The disparity accumulator holds and updates disparity estimates and condence measures for
each pixel. The four input images are shifted locally according to the
current disparity estimates. After the shift, a new disparity estimate is
calculated using the phase dierences, the local frequency and their condence values. The disparity estimate from the edge image pair has high
condence close to edges, while the condence is low in between them.
The estimates from the original image pair resolve possible problems of
matching incompatible edges, that is, only edges with the same sign of the
gradient should be matched. Both these disparity estimates are weighted
together by a consistency function to form the disparity measure between
the shifted images. The new disparity measure updates the current estimate in the disparity accumulator. For each resolution level a renement
of the disparity estimate can be done by iterating these steps. It should
get closer and closer to zero during the iterations.
Between each level the disparity image is resampled to the new resolution,
and a local spatial consistency check is performed. The steps above are
repeated until the nest resolution is reached. The accumulator image
then contains the nal disparity estimates and certainties.
3.2.2 Edge extraction
Creating edge images can be done using any edge extraction algorithm.
Here the edge extraction is performed using the same lter as for the
disparity estimation. The magnitude of the lter response is stored in
the edge image, creating a sort of line drawing. The disparity lters are
sensitive only to more or less vertically oriented structures, but this is no
limitation since horizontal lines does not contain any disparity information. The produced edge image is used as input to create a resolution
pyramid in the same way as described above. There are a total of four
pyramids that are generated before starting the disparity estimation.
3.2 DISPARITY ESTIMATION
41
3.2.3 Local image shifts
The images from the current level in the resolution pyramid are shifted
according to the disparity accumulator, which is initialized to zero. The
left and right images are shifted half the distance each. The shift procedure
decreases the disparity since the left and right images are shifted towards
each other. It reduces dierences due to foreshortening as well [61]. This
means that if a disparity is estimated fairly well at a coarse resolution,
the reduction of the disparity will enable the next level to further rene
the result.
The shift is implemented as a \picking at a distance" procedure:
xsL(1 ; 2) = xL(1 + 0:5; 2 )
xsR(1 ; 2) = xR (1 , 0:5; 2 )
(3.2)
(3.3)
which means that a value is picked from a the old image to the new image
at a distance determined by the disparity, . This ensures that there will
be no points without a value. Linear interpolation between neighboring
pixels allow non-integer shifts.
3.2.4 Disparity estimation
The disparity is measured on both the grey level images and the edge
images. The phase can be estimated using any of the lters described in
Section 2.3. The result will of course vary with the the lter characteristics, but a number of consistency checks reduces the variation between
lter types.
The disparity is estimated in the grey level images and the edge images
separately and the results are weighted together. The lter responses in a
point can be represented by a complex number. The real and imaginary
parts of the complex number represents the even and odd lter responses
respectively. The magnitude is a measure of how strong the signal is and
how well it ts the lter. The magnitude will therefore be used as a
condence measure of the lter response. The argument of the complex
number is the phase in the signal.
42
PHASE-BASED DISPARITY ESTIMATION
Let the responses from the phase estimating lter be represented with the
complex numbers zL and zR for the left and right image respectively. The
lters are normalized so that 0 kzL;R k 1. Calculating
d = zLzR ;
(3.4)
where denotes the complex conjugate, yields a phase dierence measure
and a condence value,
kdk = kzL kkzR k;
arg(d) = arg(zL ) , arg(zR );
0 kdk 1
, arg(d) (3.5)
(3.6)
The magnitude, kdk, is large only if both lter magnitudes are large.
It consequently indicates how reliable the phase dierence is. If a lter
sees a completely homogeneous neighborhood, its magnitude is zero and
its argument is undened. Calculating the phase dierence without any
condence values then produces an arbitrary result.
If the images are captured under similar conditions and they are covering
approximately the same area, it is reasonable that the magnitudes of the
lter responses are approximately the same for both images. This can
be used to check the validity of the disparity estimate. A substantial
dierence in magnitude can be due to noise or too large disparity, i.e. the
image neighborhoods do not depict the same part of reality. It can also be
due to a singular point in one of the signals, since the magnitude is reduced
considerably in such neighborhoods. In any of these cases the condence
value of the estimate should be reduced, so the consistency checks later
on can weigh the estimate accordingly. Sanger used the ratio between the
smaller and the larger of the magnitudes as a condence value [63]. Such
a condence value does not dierentiate between strong and weak signals.
The condence function below depends both on the relation between the
lter magnitudes and the absolute value. The condence value therefore
reects both the similarity and the signal strength:
2kzL zR k q
(3.7)
C1 = kzL zR k kz k2 + kz k2
L
R
The square root of kzL zR k is geometrical average between the lter magnitudes i.e. a measure on the combined signal strength. The exponent 3.2 DISPARITY ESTIMATION
43
1
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 3.3: The magnitude dierence penalty function. The plots show
the function for 0 10 from left to right. The abscissa is the ratio
between the smaller and the larger magnitude.
controls how much a magnitude dierence should be punished. The expression within the parenthesis is equal to one if kzL k = kzR k and decays
with increasing magnitude dierence. Setting
M 2 = kzL zR k
= kzR k=kzL k
(3.8)
(3.9)
transforms Equation (3.7) into a more intuitively understandable form:
2 C = M 1+
1
2
(3.10)
If kzL k = kzR k = M , i.e. = 1, then C1 = M . This means that if the
magnitudes are almost the same, the condence value is also the same.
If the magnitudes dier, the condence goes down with a rate which is
controlled by . Figure 3.3 shows how the condence depends on the lter
magnitude ratio, , and the exponent . Throughout the testing of the
algorithm the exponent has heuristically been set to 4.
If the phase dierence is very large it might wrap around and indicate a
disparity with the wrong sign. Very large phase dierences should therefore be given a lower condence value [87].
C2 = C1 cos
2
arg(d) 2
(3.11)
PHASE-BASED DISPARITY ESTIMATION
44
In Chapter 2 it was shown that the phase derivative varies with the frequency content of the signal. In order to correctly interpret the phase
dierence, = arg(d), as disparity it is necessary to estimate the phase
derivative, i.e. the local frequency [51, 27].
Let z( ) be a phase estimate at position . The phase dierences between
position and its two neighbors is a measure of how fast the phase varies
in the neighborhood, i.e. the local frequency. The local frequency can be
approximated using the the phase dierence to the left and right of the
current position:
fL, = zL ( , 1)zL ()
(3.12)
fL+ = zL ()zL ( + 1)
(3.13)
fR, = zR ( , 1)zR ()
(3.14)
fR+ = zR ()zR ( + 1)
(3.15)
The arguments of fi ; i 2 fL,; L+; R,; R+g are estimates of the local
frequency, that are combined using
0 = arg (fL, + fL+ + fR, + fR+ )
(3.16)
Knowing the local frequency, i.e. the slope of the phase curve, calculating
the disparity in pixels is straight forward:
(3.17)
= arg(d)
0
Note that does not have to be an integer. Using phase dierences
allows subpixel accuracy.
The condence value is updated by a factor depending only on the similarity between the local frequency estimates and not on their magnitudes.
If the local frequency is zero or negative the condence value is set to zero
since the phase dierence then is completely unreliable.
8 P fi <C kfik if 0 > 0
C =:
0
if 0 0
2
3
1
4
4
where i 2 fL,;L+;R,;R+g
(3.18)
3.2 DISPARITY ESTIMATION
45
3.2.5 Edge and grey level image consistency
Let subscript g and e denote grey level and edge image values respectively.
The disparity and condence values are calculated for the grey level image
and the edge image separately using Equations (3.17) and (3.18). These
estimates are then combined to give the total disparity estimate and its
condence value.
+ Ce3 e
= Cg3 Cg +
(3.19)
g3 Ce3
The condence value for the disparity estimate depends on Cg3 , Ce3 and
the similarity between the phase dierences arg(dg ) and arg(de ). This is
accomplished by adding the condence values as vectors with the phase
dierences as arguments.
Ctot = Cg ei
3
arg(d )
2
g
+ Ce3
arg(d ) ei 2 e (3.20)
The phase dierences, arg(de;g ) are divided by two in order to ensure that
Ctot is large only for arg(dg ) arg(de ) and not for arg(dg ) arg(de ) 2
as well.
3.2.6 Disparity accumulation
The disparity accumulator is updated using the disparity estimate and its
condence value. The accumulator holds the cumulative sum of disparity
estimates. Since the images are shifted according to the current accumulator value, the value to be added is just a correction towards the true
disparity. Thus, the disparity value is simply added to the accumulator.
new = old + (3.21)
When updating the condence value of the accumulator, high condence
values are emphasized and low values are attenuated.
pC + pC !
old
total
Cnew =
2
2
(3.22)
PHASE-BASED DISPARITY ESTIMATION
46
3.2.7 Spatial consistency
In most images there are areas where the phase estimates are weak or
contradictory. In these areas the disparity estimates are not reliable. This
results in tearing the image apart when making the shift before disparity
renement, creating unnecessary distortion of the image. It is then desirable to spread the estimates from nearby areas with higher condence
values. On the other hand, it is not desirable to average between areas
with dierent disparity and high condence. A lter function fullling
these requirements has a spatial function with a large peak in the middle
and then decays rapidly towards the periphery, such as a Gaussian with
a small :
h() = p12 e,
2
2 2
,R R
(3.23)
A kernel with R = 7 and = 1:0 has been used when testing the algorithm.
The lter is used in the vertical and horizontal direction separately.
The lter is convolved with both the condence values alone and the
disparity estimates weighted with the condence values.
m=hC
v = h C (3.24)
(3.25)
If the lter is positioned on a point with a high condence value, the
disparity estimate will be left virtually untouched, but if the condence
value is weak it changes towards the average of the neighborhood. The
new disparity estimate and its condence value are
Cnew = m
new = mv
(3.26)
(3.27)
After the spatial consistency operation, the accumulator is used to shift
the input images either on the same level once more or on the next ner
level, depending on how many iterations are used on each level.
3.3 EXPERIMENTAL RESULTS
Filter
Name
Peak
Freq.
Non-ringing 7
=2
Non-ringing 11 3=10
Non-ringing 15 3=14
Gabor 1
=p
2
=(2 2)
Gabor 2
Gabor 3
=4
Lognorm 1
=p
2
=(2p2)
Lognorm 2a
Lognorm 2b
=(2 2)
Lognorm 3a
=4
Lognorm 3b
=4
Lognorm 3c
=4
WFT 5
2=5
WFT 7
2=7
WFT 11
2=11
Gaussian di.
=2
47
Band- Filter
width Size
2.2
2.2
2.2
0.8
0.8
0.8
1.0
1.0
2.0
1.0
2.0
4.0
2.0
2.0
2.0
2.0
7
11
15
15
17
19
13
15
15
19
19
19
5
7
11
5
Table 3.1: The lters used testing the phase based stereo algorithm.
3.3 Experimental results
The algorithm has been tested both on synthetic test images and on real
images, using a wide variety of lters (Table 3.1). All types of lters discussed in Section 2.3 are represented. Also a number of design parameter
combinations have been tested. The lter based on dierences of binomial
Gaussians was designed with = 1=3, (Equation (2.37)). The non-ringing
lters were designed using Equation (2.32). The spatial size of these two
types of lters are given by their denitions. Strictly speaking, the Gabor
and lognorm lters have innite support and must be truncated to get
a nite size. The size could be set large enough to make the truncation
negligible, but this often gives very large lters. The criteria for setting
the size of the Gabor and lognorm lters have been that the DC level
must not be more than one percent of the value at the center frequency,
and the envelope must have decreased to less than ten percent of the peak
value.
PHASE-BASED DISPARITY ESTIMATION
48
Original
image
LP Filter
Shift
image
Local shift
Statistics
Stereo
Disparity
estimates
Figure 3.4: The synthetic stereo pairs are generated using a shift image
describing the local shifts. The image to be shifted is LP ltered in order
to avoid aliasing. The shift image is also used as ground truth for the
stereo estimates.
3.3.1 Generating stereo image pairs
The quantitative results have been obtained by using synthetically generated images and comparing the estimated disparities with ground truth.
The often used method of taking an image and simply shift it a few pixels in order to create a known disparity does not show the advantage of
the local phase estimation. For such image pairs the dierence in global
Fourier phase would do just as well. A method for evaluation of a disparity estimator must be based on locally varying shifts in order to resemble
real life situations. A scheme for generation of locally varying disparity in
a controlled fashion is found in Figure 3.4. The test image is shifted locally according to the values in a synthetically generated disparity image,
which is also used as ground truth for evaluating the results.
A global shift creates an image with the same properties as the original image at a certain distance. Local shifts deform the original image,
stretching it in some areas and compressing it in other areas. The stretching does not create any problems but the compressing might do so. The
3.3 EXPERIMENTAL RESULTS
49
Figure 3.5: A noise image with an energy function inversely proportional to the radius in the frequency domain.
spatial frequency increases when the image is compressed and it might
exceed the Nyquist frequency. The image to be shifted is therefore LP
ltered using a Gaussian ( = 1:273).
In real life images, the structures in the image belong to real objects and
so do the disparities. In synthetic test images, generated as shown in
Figure 3.4 on the facing page, the disparities are not necessarily related to
the image structure. A random image, e.g. white noise, random dots etc,
is therefore best suited for testing purposes, since all parts of the image
then exhibit a similar structure. The tests below use a noise image with
PHASE-BASED DISPARITY ESTIMATION
50
Figure 3.6: Left: \Twin Peaks", one positive peak, one negative peak
and zero around the edges. The magnitude changes linearly between the
peaks and the boarder, both horizontally and vertically. Right: \Flip
Flop", alternating positive and negative peaks, zero around the edges.
The magnitude changes linearly from the edges towards the peaks, but
has discontinuities between stripes with positive and negative values.
a spatial frequency spectrum that is inversely proportional to the radius
of the frequency domain (Figure 3.5 on the page before):
kF (u)k ku1k
(3.28)
This is justied by the fact that it resembles a natural image more than
for instance a white noise image [44].
The two dierent shift images that have been used in the tests are shown
in Figure 3.6. \Twin Peaks" consists of one positive and one negative
peak. The disparity is zero along the image edges and the magnitude
changes linearly between the peaks and the border both vertically and
horizontally. There are no discontinuities in the image. The other shift
image, \Flip Flop", consists of alternating positive and negative peaks
with exponentially increasing frequency upwards. It is also zero along the
image edges. Horizontally, the magnitude changes linearly from the edges
towards the peaks, but it has discontinuities vertically between stripes
with positive and negative values. The shift values in the shift images
3.3 EXPERIMENTAL RESULTS
51
are normalized to the interval [,1; 1], and the maximum disparity is then
controlled by a parameter to the shift module (Figure 3.4).
3.3.2 Statistics
The result of the stereo algorithm is evaluated by measuring the mean
and the standard deviation of the error between the shift image used to
create the stereo pair and the disparity image. Since the algorithm also
provides a condence value, the mean and standard deviation weighted
with the condence values are also calculated. Let i denote the true
disparity and let ~i denote the estimated disparity. The statistics are
then calculated as:
m = n1
n
X
i=1
(i , ~i )
s2 = n ,1 1
n
X
n
X
i=1
Ci(i , ~i)
mw = i=1 X
n
n
X
s2w = i=1
(i , ~i , m)2
i=1
Ci
Ci(i , ~i , mw )2
n
X
i=1
Ci
(3.29)
(3.30)
(3.31)
(3.32)
The unweighted values furnish a measure of the how well the algorithm
performs over the whole image. The weighted values, on the other hand,
indicates how well the condence value reects the reliability of the measurements. If the condence value always is low when the disparity estimate is wrong, the weighted statistics show better values than the unweighted.
52
PHASE-BASED DISPARITY ESTIMATION
If the disparity is captured at the coarsest level, the ner levels will rene the estimate to very high precision, while if the algorithm fails on a
coarse resolution there is no way to recover. In the areas with too large
disparity, the estimates are arbitrary. The algorithm is then likely to mismatch structures that accidently coincides due to insucient shifting of
the images. As a consequence, it is hard to compare the statistics when
using dierent shift images since the statistics are dependent on the ratio between the image area with measurable disparity and the area with
too large disparity. The statistics diagrams should therefore be compared
quantitatively only if they belong to the same shift image. The qualitative
behavior is comparable for dierent shift images, though.
The zero estimator is used for comparison. It always estimates zero disparity with full condence, i.e. the statistics for the zero estimator measure
the mean and standard deviation of the shift image.
3.3.3 Increasing number of resolution levels
A test where the maximum disparity is 10 pixels and the number of iterations on each level is one per level has been carried out. The number
of resolution levels varies from one through six. An example of typical
estimates and condence values is found in Figure 3.7. Note how the
condence values decrease when the disparity estimation fails. Figures
3.8, 3.9 and 3.10 show the plots of the error standard deviation versus
the number of resolution levels. Plots from tests using \Twin Peaks" and
\Flip Flop" as disparity images are presented in the left and right columns
respectively.
Starting with the results from \Twin Peaks", one can see that most of
the lters estimate the disparity accurately for the whole image if the
number of levels is more than three. Some of the wide lters manage
with only two levels, while a few other lters need four levels to reach
the minimum error level. The plots for Gabor2, Gabor3 and WFT11
take o towards high values when the number of levels increases. This is
due to large errors in a few regions of the image, and not to any general
degradation of the estimates. The errors in these regions are probably
3.3 EXPERIMENTAL RESULTS
53
Figure 3.7: Disparity estimates, left, and their condence values, right,
for one resolution level, above, and three resolution levels, below. The
maximum disparity is 10 pixels, the number of iterations is one per level,
and the lter is lognorm2b. Note how the condence decreases in neighborhoods were the disparity estimation fails.
caused by singular points on the top level. An interesting observation is
that the short lters, non-ring7 and binomial Gaussian dierences, have
less error than the wider lters when the number of levels is high enough.
Unfortunately, this is not true for all types of images, which is shown
below.
In the right columns of Figures 3.8, 3.9 and 3.10, none of the lters has
successfully estimated the disparity in the whole image. This is due to the
54
PHASE-BASED DISPARITY ESTIMATION
coarse-to-ne approach. In \Twin Peaks" there are two well separated areas, which do not interfere signicantly with each other as the image is LP
ltered and subsampled. In \Flip Flop" the areas with dierent disparity
are smaller and closer to each other. An LP ltering over suciently large
regions will always collect information from areas with opposite disparities averaging to zero. Consequently, the interference within the lter
cancels out the advantage of reaching over a greater distance. It is therefore natural that the wide lters generally perform better than the short
ones.
The generally poor performance of the windowed Fourier transform lters
is separately discussed in Section 3.4.
3.3 EXPERIMENTAL RESULTS
Non−ring 7
1.00
0.10
0.01
1
2
5
3
4
Non−ring 15
5
2
3
6
2
3
4
Non−ring 15
5
6
5
1.0
1
6
4
5
6
4
5
6
4
5
6
4
5
6
2
3
Gabor 1
10.0
Error std. dev.
Error std. dev.
4
Gabor 1
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 2
2
3
Gabor 2
10.0
Error std. dev.
10.0
Error std. dev.
5
Error std. dev.
2
10.0
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 3
2
3
Gabor 3
10.0
Error std. dev.
10.0
1.00
0.10
0.01
1
3
4
Non−ring 11
10.0
0.10
0.01
1
1.0
1
6
1.00
0.01
1
2
10.0
0.10
0.01
1
1.0
1
6
1.00
10.0
Error std. dev.
3
4
Non−ring 11
Error std. dev.
Error std. dev.
10.0
0.01
1
Non−ring 7
10.0
Error std. dev.
Error std. dev.
10.0
Error std. dev.
55
2
3
4
5
6
1.0
1
2
3
Figure 3.8: Log of error standard deviation versus number of resolution
levels. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance
is indicated with dash-dotted lines. The solid curves correspond to the
unweighted values, while the dashed curves correspond to the weighted
values.
PHASE-BASED DISPARITY ESTIMATION
56
Lognorm 1
1.00
0.10
0.01
1
2
2
3
4
Lognorm 3a
5
3
4
Lognorm 3b
5
3
4
Lognorm 3c
5
2
3
5
1.0
1
6
6
2
3
4
Lognorm 2b
5
6
2
3
4
Lognorm 3a
5
6
2
3
4
Lognorm 3b
5
6
2
3
4
Lognorm 3c
5
6
2
3
5
6
Error std. dev.
10.0
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
0.01
1
5
10.0
1.00
0.01
1
1.0
1
6
0.10
0.01
1
3
4
Lognorm 2a
Error std. dev.
Error std. dev.
5
2
10.0
1.00
10.0
Error std. dev.
3
4
Lognorm 2b
1.0
1
6
0.10
0.01
1
Error std. dev.
5
1.00
10.0
Error std. dev.
3
4
Lognorm 2a
Error std. dev.
Error std. dev.
10.0
0.01
1
Lognorm 1
10.0
Error std. dev.
Error std. dev.
10.0
2
4
6
1.0
1
4
Figure 3.9: Log of error standard deviation versus number of resolution
levels. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance
is indicated with dash-dotted lines. The solid curves correspond to the
unweighted values, while the dashed curves correspond to the weighted
values.
3.3 EXPERIMENTAL RESULTS
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
0.10
10.0
1.0
1
10.0
1.00
0.10
0.01
1
1.0
1
10.0
1.00
0.01
1
2
3
4
5
Windowed Fourier Transform 7
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
Error std. dev.
10.0
0.10
10.0
1.0
1
Error std. dev.
Error std. dev.
2
3
4
5
Windowed Fourier Transform 7
1.00
0.01
1
Windowed Fourier Transform 5
Error std. dev.
0.10
10.0
Error std. dev.
10.0
1.00
0.01
1
Error std. dev.
Windowed Fourier Transform 5
Error std. dev.
Error std. dev.
10.0
57
3
4
5
1.0
1
3
4
5
Figure 3.10: Log of error standard deviation versus number of resolution levels. \Twin Peaks"(left) and \Flip Flop" (right) are used as shift
images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines. The solid curves correspond
to the unweighted values, while the dashed curves correspond to the
weighted values.
58
PHASE-BASED DISPARITY ESTIMATION
Figure 3.11: Disparity estimates, left, and their condence values, right,
for ve pixel maximum disparity, above, and 20 pixel maximum disparity,
below. The number of resolution levels and iterations were xed to three
and one respectively, and the lter used was lognorm2b. Note how the
condence decreases in neighborhoods where the disparity estimation
fails.
3.3.4 Increasing maximum disparity
A test where the number of resolution levels and iterations on each level is
xed while the maximum disparity increases, has been carried out using
both \Flip Flop" and \Twin Peaks". The maximum disparity increases
from 1 to 85 pixels and the number of levels and iterations is ve and
one respectively. An example of typical estimates and condence values is
3.3 EXPERIMENTAL RESULTS
59
found in Figure 3.11. Note how the condence values decrease when the
disparity estimation fails.
Figures 3.12, 3.13 and 3.14 show the plots of the Log of error standard
deviation versus maximum disparity. Again, the results corresponding to
\Twin Peaks" are presented in the left columns while the right columns
are results corresponding to \Flip Flop". The \Flip Flop" results are of
marginal value since they approach the zero estimator rapidly, but they
are printed for completeness.
Most of the curves have a \knee" where the error standard deviation
rapidly increases with a factor of ten. The disparity where this occurs is
the maximum reachable disparity for each lter. When the disparity is
larger than this the highest level fails and no recovery is possible. Naturally, the wide lters with low peak frequency reach longer and therefore
show the best results in this test.
PHASE-BASED DISPARITY ESTIMATION
60
Non−ring 7
10.0
1.00
0.10
0.01
0
10
20
20
0.10
20
30
60
70
1.00
0.10
10
20
30
60
70
1.00
0.10
10
20
30
60
70
1.00
0.10
10
20
30
40
50
20
30 40 50 60
Non−ring 15
70
80
10
20
30
40 50
Gabor 1
60
70
80
10
20
30
40 50
Gabor 2
60
70
80
10
20
30
40 50
Gabor 3
60
70
80
10
20
30
60
70
80
10.0
1.00
0.10
10.0
1.00
0.10
10.0
1.00
0.10
100
10.0
0.01
0
10
0.10
0.01
0
80
Error std. dev.
100
40 50
Gabor 3
80
100
10.0
0.01
0
70
1.00
0.01
0
80
Error std. dev.
100
40 50
Gabor 2
30 40 50 60
Non−ring 11
100
10.0
0.01
0
20
10.0
0.01
0
80
Error std. dev.
Error std. dev.
100
40 50
Gabor 1
10
100
1.00
10
0.10
0.01
0
80
10.0
0.01
0
Error std. dev.
70
Error std. dev.
30 40 50 60
Non−ring 15
1.00
100
0.10
10
10.0
0.01
0
80
1.00
100
Error std. dev.
70
10.0
0.01
0
Error std. dev.
30 40 50 60
Non−ring 11
Error std. dev.
Error std. dev.
100
Non−ring 7
100
Error std. dev.
Error std. dev.
100
60
70
80
10.0
1.00
0.10
0.01
0
40
50
Figure 3.12: Log of error standard deviation versus maximum disparity
using \Twin Peaks"(left) and \Flip Flop" (right) as shift images. The
zero estimator performance is indicated with dash-dotted lines. The solid
curves correspond to the unweighted values, while the dashed curves
correspond to the weighted values.
3.3 EXPERIMENTAL RESULTS
Lognorm 1
10.0
1.00
0.10
10
20
20
0.10
20
70
1.00
0.10
10
20
70
1.00
0.10
10
20
70
1.00
0.10
10
20
30
40
50
60
20
30 40 50 60
Lognorm 2b
70
80
10
20
30 40 50 60
Lognorm 3a
70
80
10
20
30 40 50 60
Lognorm 3b
70
80
10
20
30 40 50 60
Lognorm 3c
70
80
10
20
30
70
80
10.0
1.00
0.10
10.0
1.00
0.10
10.0
1.00
0.10
100
10.0
0.01
0
10
0.10
0.01
0
80
Error std. dev.
100
30 40 50 60
Lognorm 3c
80
100
10.0
0.01
0
70
1.00
0.01
0
80
Error std. dev.
100
30 40 50 60
Lognorm 3b
30 40 50 60
Lognorm 2a
100
10.0
0.01
0
20
10.0
0.01
0
80
Error std. dev.
100
30 40 50 60
Lognorm 3a
10
100
1.00
10
0.10
0.01
0
80
Error std. dev.
Error std. dev.
70
10.0
0.01
0
Error std. dev.
30 40 50 60
Lognorm 2b
1.00
100
0.10
10
10.0
0.01
0
80
1.00
100
Error std. dev.
70
10.0
0.01
0
Error std. dev.
30 40 50 60
Lognorm 2a
Error std. dev.
Error std. dev.
100
Lognorm 1
100
Error std. dev.
Error std. dev.
100
0.01
0
61
70
80
10.0
1.00
0.10
0.01
0
40
50
60
Figure 3.13: Log of error standard deviation versus maximum disparity
using \Twin Peaks"(left) and \Flip Flop" (right) as shift images. The
zero estimator performance is indicated with dash-dotted lines. The solid
curves correspond to the unweighted values, while the dashed curves
correspond to the weighted values.
PHASE-BASED DISPARITY ESTIMATION
62
Windowed Fourier Transform 5
10.0
1.00
0.10
0.01
0
10
1.00
0.10
10
1.00
0.10
10
1.00
0.10
10
20
30
40
50
60
70
80
10
20 30 40 50 60 70
Windowed Fourier Transform 11
80
10
20 30 40 50 60 70
Binomial Gaussian differences
80
1.00
0.10
10.0
1.00
0.10
100
10.0
0.01
0
20 30 40 50 60 70
Windowed Fourier Transform 7
10.0
0.01
0
80
Error std. dev.
Error std. dev.
100
20 30 40 50 60 70
Binomial Gaussian differences
10
100
10.0
0.01
0
0.10
0.01
0
80
Error std. dev.
Error std. dev.
100
20 30 40 50 60 70
Windowed Fourier Transform 11
1.00
100
10.0
0.01
0
10.0
0.01
0
80
Error std. dev.
Error std. dev.
100
20 30 40 50 60 70
Windowed Fourier Transform 7
Windowed Fourier Transform 5
100
Error std. dev.
Error std. dev.
100
80
10.0
1.00
0.10
0.01
0
10
20
30
40
50
60
70
Figure 3.14: Log of error standard deviation versus maximum disparity
using \Twin Peaks"(left) and \Flip Flop" (right) as shift images. The
zero estimator performance is indicated with dash-dotted lines. The solid
curves correspond to the unweighted values, while the dashed curves
correspond to the weighted values.
80
3.3 EXPERIMENTAL RESULTS
63
Figure 3.15: A natural image used for testing the benets of combin-
ing results from line and grey level images. (courtesy to CVAP, Royal
Institute of Technology, Stockholm.)
3.3.5 Combining line and grey level results
In order to see the benets of using both grey level images and line images
the experiments in Subsection 3.3.3 where repeated using grey level images
only and line images only. However, the structure of the noise image
is not ideal for showing how the line images contribute to the overall
performance, since there are no extended lines or edges. Instead, a natural
image, shown in Figure 3.15, is used as original image in the scheme in
Figure 3.4 on page 48. Figure 3.16 shows the results from the lognorm3c
lter when a natural image is used. Each row shows output certainty, the
disparity estimate and the error image. The rst row corresponds to using
both grey and line images, the second to grey level images only, and the
third to line images only. Note how the combined results benets from
both the other results. Where the line images fails the grey level image
succeeds and vice versa.
Figures 3.17, 3.18 and 3.19 show the curves corresponding to the test in
Subsection 3.3.3 with the image in Figure 3.15 instead of the noise image
in Figure 3.5 on page 49. A general observation is that the minimum
64
PHASE-BASED DISPARITY ESTIMATION
Figure 3.16: Top: The output certainty, the disparity estimate and the
error image from using both grey level images and line images. Middle:
The same using grey level image only. Bottom: The same using line
image only. Note how the combined results benets from both the other
results. Where the line images fail the grey level images succeed and vice
versa.
3.3 EXPERIMENTAL RESULTS
65
error levels are slightly higher using the natural image than when using
the noise image. This is due to the spatial consistency operation that
spreads the estimates from edges with high certainty into areas with weak
image structure.
Figures 3.20, 3.21 and 3.22 show the error plots corresponding to using
grey level images only. The results from using line images only are shown
in Figures 3.23, 3.24 and 3.25. The dierences might at a rst glance look
minimal but it should be kept in mind that the curves depict statistics for
the full image and that the error can be large locally, cf. Figure 3.16.
To point out a few interesting results, compare the nonring7 curves in the
left columns of Figures 3.17, 3.20 and 3.23. When using three levels of
resolution, the error from the combined result is lower than for any of the
other two. On the other hand, when the number of resolution levels is
high enough the dierence is very small. This might seem disappointing,
but is due to the fact that \Twin Peaks" is an nice eld without discontinuities. The results corresponding to the \Flip Flop" image show that for
most of the lters, combining grey level image and line image estimates
is preferable. In particular, the lters with a low center frequency, e.g.
nonring15 and lognorm3c, benet from combining the estimates.
PHASE-BASED DISPARITY ESTIMATION
66
Non−ring 7
1.00
0.10
0.01
1
2
5
3
4
Non−ring 15
5
2
3
6
2
3
4
Non−ring 15
5
6
5
1.0
1
6
4
5
6
4
5
6
4
5
6
4
5
6
2
3
Gabor 1
10.0
Error std. dev.
Error std. dev.
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 2
2
3
Gabor 2
10.0
Error std. dev.
Error std. dev.
4
Gabor 1
10.0
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 3
2
3
Gabor 3
10.0
Error std. dev.
10.0
Error std. dev.
5
Error std. dev.
2
10.0
1.00
0.10
0.01
1
3
4
Non−ring 11
10.0
0.10
0.01
1
1.0
1
6
1.00
0.01
1
2
10.0
0.10
0.01
1
1.0
1
6
1.00
10.0
Error std. dev.
3
4
Non−ring 11
Error std. dev.
Error std. dev.
10.0
0.01
1
Non−ring 7
10.0
Error std. dev.
Error std. dev.
10.0
2
3
4
5
6
1.0
1
2
3
Figure 3.17: Log of error standard deviation versus number of resolution levels using both grey level and line images. \Twin Peaks"(left)
and \Flip Flop" (right) are used as shift images with 10 pixels maximum
disparity. The zero estimator performance is indicated with dash-dotted
lines. The solid curves correspond to the unweighted values, while the
dashed curves correspond to the weighted values.
3.3 EXPERIMENTAL RESULTS
Lognorm 1
1.00
0.10
0.01
1
2
5
3
4
Lognorm 3a
5
3
4
Lognorm 3b
5
2
3
4
Lognorm 3c
5
2
3
5
5
6
2
3
4
Lognorm 2b
5
6
2
3
4
Lognorm 3a
5
6
2
3
4
Lognorm 3b
5
6
2
3
4
Lognorm 3c
5
6
2
3
5
6
10.0
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
0.01
1
1.0
1
6
0.10
0.01
1
3
4
Lognorm 2a
10.0
1.00
0.01
1
2
Error std. dev.
Error std. dev.
3
4
Lognorm 2b
1.0
1
6
0.10
0.01
1
Error std. dev.
5
1.00
10.0
Error std. dev.
3
4
Lognorm 2a
Error std. dev.
Error std. dev.
10.0
0.01
1
Lognorm 1
10.0
Error std. dev.
Error std. dev.
10.0
Error std. dev.
67
2
4
6
1.0
1
4
Figure 3.18: Log of error standard deviation versus number of resolution levels using both grey level and line images. \Twin Peaks"(left)
and \Flip Flop" (right) are used as shift images with 10 pixels maximum
disparity. The zero estimator performance is indicated with dash-dotted
lines. The solid curves correspond to the unweighted values, while the
dashed curves correspond to the weighted values.
PHASE-BASED DISPARITY ESTIMATION
68
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
0.10
10.0
1.0
1
10.0
1.00
0.10
0.01
1
1.0
1
10.0
1.00
0.01
1
2
3
4
5
Windowed Fourier Transform 7
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
Error std. dev.
10.0
0.10
10.0
1.0
1
Error std. dev.
Error std. dev.
2
3
4
5
Windowed Fourier Transform 7
1.00
0.01
1
Windowed Fourier Transform 5
Error std. dev.
0.10
10.0
Error std. dev.
10.0
1.00
0.01
1
Error std. dev.
Windowed Fourier Transform 5
Error std. dev.
Error std. dev.
10.0
3
4
5
1.0
1
3
4
5
Figure 3.19: Log of error standard deviation versus number of resolution levels using both grey level and line images. \Twin Peaks"(left)
and \Flip Flop" (right) are used as shift images with 10 pixels maximum
disparity. The zero estimator performance is indicated with dash-dotted
lines. The solid curves correspond to the unweighted values, while the
dashed curves correspond to the weighted values.
3.3 EXPERIMENTAL RESULTS
Non−ring 7
1.00
0.10
0.01
1
2
5
3
4
Non−ring 15
5
2
3
6
2
3
4
Non−ring 15
5
6
5
1.0
1
6
4
5
6
4
5
6
4
5
6
4
5
6
2
3
Gabor 1
10.0
Error std. dev.
Error std. dev.
4
Gabor 1
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 2
2
3
Gabor 2
10.0
Error std. dev.
10.0
Error std. dev.
5
Error std. dev.
2
10.0
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 3
2
3
Gabor 3
10.0
Error std. dev.
10.0
1.00
0.10
0.01
1
3
4
Non−ring 11
10.0
0.10
0.01
1
1.0
1
6
1.00
0.01
1
2
10.0
0.10
0.01
1
1.0
1
6
1.00
10.0
Error std. dev.
3
4
Non−ring 11
Error std. dev.
Error std. dev.
10.0
0.01
1
Non−ring 7
10.0
Error std. dev.
Error std. dev.
10.0
Error std. dev.
69
2
3
4
5
6
1.0
1
2
3
Figure 3.20: Log of error standard deviation versus number of resolution levels using only grey level images. \Twin Peaks"(left) and \Flip
Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines.
The solid curves correspond to the unweighted values, while the dashed
curves correspond to the weighted values.
PHASE-BASED DISPARITY ESTIMATION
70
Lognorm 1
1.00
0.10
0.01
1
2
2
3
4
Lognorm 3a
5
3
4
Lognorm 3b
5
3
4
Lognorm 3c
5
2
3
5
1.0
1
6
6
2
3
4
Lognorm 2b
5
6
2
3
4
Lognorm 3a
5
6
2
3
4
Lognorm 3b
5
6
2
3
4
Lognorm 3c
5
6
2
3
5
6
Error std. dev.
10.0
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
0.01
1
5
10.0
1.00
0.01
1
1.0
1
6
0.10
0.01
1
3
4
Lognorm 2a
Error std. dev.
Error std. dev.
5
2
10.0
1.00
10.0
Error std. dev.
3
4
Lognorm 2b
1.0
1
6
0.10
0.01
1
Error std. dev.
5
1.00
10.0
Error std. dev.
3
4
Lognorm 2a
Error std. dev.
Error std. dev.
10.0
0.01
1
Lognorm 1
10.0
Error std. dev.
Error std. dev.
10.0
2
4
6
1.0
1
4
Figure 3.21: Log of error standard deviation versus number of resolution levels using only grey level images. \Twin Peaks"(left) and \Flip
Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines.
The solid curves correspond to the unweighted values, while the dashed
curves correspond to the weighted values.
3.3 EXPERIMENTAL RESULTS
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
0.10
10.0
1.0
1
10.0
1.00
0.10
0.01
1
1.0
1
10.0
1.00
0.01
1
2
3
4
5
Windowed Fourier Transform 7
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
Error std. dev.
10.0
0.10
10.0
1.0
1
Error std. dev.
Error std. dev.
2
3
4
5
Windowed Fourier Transform 7
1.00
0.01
1
Windowed Fourier Transform 5
Error std. dev.
0.10
10.0
Error std. dev.
10.0
1.00
0.01
1
Error std. dev.
Windowed Fourier Transform 5
Error std. dev.
Error std. dev.
10.0
71
3
4
5
1.0
1
3
4
5
Figure 3.22: Log of error standard deviation versus number of resolution levels using only grey level images. \Twin Peaks"(left) and \Flip
Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines.
The solid curves correspond to the unweighted values, while the dashed
curves correspond to the weighted values.
PHASE-BASED DISPARITY ESTIMATION
72
Non−ring 7
1.00
0.10
0.01
1
2
5
3
4
Non−ring 15
5
2
3
6
2
3
4
Non−ring 15
5
6
5
1.0
1
6
4
5
6
4
5
6
4
5
6
4
5
6
2
3
Gabor 1
10.0
Error std. dev.
Error std. dev.
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 2
2
3
Gabor 2
10.0
Error std. dev.
Error std. dev.
4
Gabor 1
10.0
1.00
0.10
2
3
4
5
1.0
1
6
Gabor 3
2
3
Gabor 3
10.0
Error std. dev.
10.0
Error std. dev.
5
Error std. dev.
2
10.0
1.00
0.10
0.01
1
3
4
Non−ring 11
10.0
0.10
0.01
1
1.0
1
6
1.00
0.01
1
2
10.0
0.10
0.01
1
1.0
1
6
1.00
10.0
Error std. dev.
3
4
Non−ring 11
Error std. dev.
Error std. dev.
10.0
0.01
1
Non−ring 7
10.0
Error std. dev.
Error std. dev.
10.0
2
3
4
5
6
1.0
1
2
3
Figure 3.23: Log of error standard deviation versus number of resolution levels using only line images. \Twin Peaks"(left) and \Flip
Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines.
The solid curves correspond to the unweighted values, while the dashed
curves correspond to the weighted values.
3.3 EXPERIMENTAL RESULTS
Lognorm 1
1.00
0.10
0.01
1
2
5
3
4
Lognorm 3a
5
3
4
Lognorm 3b
5
2
3
4
Lognorm 3c
5
2
3
5
5
6
2
3
4
Lognorm 2b
5
6
2
3
4
Lognorm 3a
5
6
2
3
4
Lognorm 3b
5
6
2
3
4
Lognorm 3c
5
6
2
3
5
6
10.0
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
2
1.0
1
6
10.0
Error std. dev.
10.0
1.00
0.10
0.01
1
1.0
1
6
0.10
0.01
1
3
4
Lognorm 2a
10.0
1.00
0.01
1
2
Error std. dev.
Error std. dev.
3
4
Lognorm 2b
1.0
1
6
0.10
0.01
1
Error std. dev.
5
1.00
10.0
Error std. dev.
3
4
Lognorm 2a
Error std. dev.
Error std. dev.
10.0
0.01
1
Lognorm 1
10.0
Error std. dev.
Error std. dev.
10.0
Error std. dev.
73
2
4
6
1.0
1
4
Figure 3.24: Log of error standard deviation versus number of resolution levels using only line images. \Twin Peaks"(left) and \Flip
Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines.
The solid curves correspond to the unweighted values, while the dashed
curves correspond to the weighted values.
PHASE-BASED DISPARITY ESTIMATION
74
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
0.10
10.0
1.0
1
10.0
1.00
0.10
0.01
1
1.0
1
10.0
1.00
0.01
1
2
3
4
5
Windowed Fourier Transform 7
6
2
3
4
5
Windowed Fourier Transform 11
6
2
3
4
5
Binomial Gaussian differences
6
2
6
Error std. dev.
10.0
0.10
10.0
1.0
1
Error std. dev.
Error std. dev.
2
3
4
5
Windowed Fourier Transform 7
1.00
0.01
1
Windowed Fourier Transform 5
Error std. dev.
0.10
10.0
Error std. dev.
10.0
1.00
0.01
1
Error std. dev.
Windowed Fourier Transform 5
Error std. dev.
Error std. dev.
10.0
3
4
5
1.0
1
3
4
5
Figure 3.25: Log of error standard deviation versus number of resolution levels using only line images. \Twin Peaks"(left) and \Flip
Flop" (right) are used as shift images with 10 pixels maximum disparity. The zero estimator performance is indicated with dash-dotted lines.
The solid curves correspond to the unweighted values, while the dashed
curves correspond to the weighted values.
3.3 EXPERIMENTAL RESULTS
75
Figure 3.26: Above: Left and right image, (captured using the 'Getax'
robot head at Department of Electronic and Electrical Engineering at
University of Surrey, UK). Lower left: Disparity estimates threshold using the condence values. Lower right: Condence values. Note that the
condence values are strong on image structures and weak on at surfaces. The result is obtained with ve resolution levels and two iterations
on each level. The lter used is nonring7.
3.3.6 Results on natural images
Tests on real life images give similar results but the performance is harder
to quantify since the true disparity is almost always unknown. Three examples are shown in Figures 3.26, 3.27 and 3.28. Note that the condence
values are strong on image structures and weak on at surfaces.
76
PHASE-BASED DISPARITY ESTIMATION
Figure 3.27: Above: Left and right image, (captured using the 'Getax'
robot head at Department of Electronic and Electrical Engineering at
University of Surrey, UK). Lower left: Disparity estimates threshold using the condence values. Lower right: Condence values. Note that the
condence values are strong on image structures and weak on at surfaces. The result is obtained with ve resolution levels and two iterations
on each level. The lter used is nonring7.
3.3 EXPERIMENTAL RESULTS
Figure 3.28: Two images from the Sarno tree sequence. Above: Left
and right image. Lower left: Disparity estimates threshold using the condence values. Lower right: Condence values. Note that the condence
values are strong on image structures and weak on at surfaces. The
result is obtained with four resolution levels and two iterations on each
level. The lter used is nonring7.
77
78
PHASE-BASED DISPARITY ESTIMATION
3.4 Conclusion
The test results show that the overall performance of the stereo algorithm
is not critically dependent on the type of disparity lter used. The consistency checks and edge image ltering applied in order to enhance the
performance of the algorithm, do indeed reduce the impact of the actual
lter shape.
There are, however, some conclusions to be made. The results of the phase
based stereo algorithm in Section 3.3, are somewhat contradictory to the
results of the investigation of the singular points in Section 2.3. The stereo
tests indicate that lters with a wide bandwidth are preferable, while a
narrow bandwidth is more advantageous from a phase scale space point
of view. The reason is that broad band lters have less phase cycles
and therefore can handle greater disparities for a given lter size. The
computational eciency that follows from this implies that the best lter
choice is the lters without wrap around; Dierence-of-Gaussians are very
small and thus requires more levels of resolution, while the non-ringing
lters are larger and manages with a smaller number of levels. As pointed
out earlier, it is not always possible to compensate small size with by
increasing the number of levels.
The poor performance of the WFT lters is due to the rectangular window. Such a window gives image structures in the lter periphery the
same importance as one close to the lter center. The frequency domain
representation of the lters show that they are sensitive to high-frequency
noise due to the long tails of ripples (Figure 2.14 on page 30). As mentioned before, Weng suggests a low pass preltering of the input image
[71]. The resulting lter is very close to the nonring lters and so are the
corresponding results. The reason for using the non-preltered version of
WFT here is to show that the preltering is not a little adjustment to get
rid of little noise; it is crucial for the operation to work.
3.5 FURTHER RESEARCH
79
3.5 Further research
The problem with lter interference mentioned in Subsection 3.3.3 can be
reduced by introducing a correlation stage in the computational structure.
A correlation stage means that instead of computing phase dierences
pointwise only, it is done over a neighborhood, and the dierence value
with the highest condence is used as disparity estimate. On a certain
level, disparities larger that the lter support can be captured. Larger
disparities can then be estimated for a given lter and a given number of
resolution levels. The computational cost is of course higher, especially if
the correlation neighborhood is large. A compromise is to use correlation
at the coarsest resolution only, since correct estimates there reduces the
disparities at lower levels [30, 31].
All real stereo pairs have more or less vertical disparities as well as horizontal, due to camera geometry etc. To be able to handle general cases,
further work will include extending the algorithm to two dimensions, estimating vertical disparities as well. This can be implemented either by
interleaving two one-dimensional algorithms, one horizontal and one vertical, or by using two-dimensional lters in the disparity estimator. A
method for two-dimensional disparities based on the phase in the Multiresolution Fourier Transform, MFT, has been developed by Calway, Knutsson
and Wilson [17, 16].
Being able to handle multiple disparities in the same image point is also
a potentially useful extension. See for instance [70] for a multiple motion
approach to motion estimation.
80
PHASE-BASED DISPARITY ESTIMATION
4
HIERARCHICAL DATA-DRIVEN
FOCUS OF ATTENTION
4.1 Introduction
A fundamental problem yet to be solved in computer vision in general
and active vision in particular is that of how to focus the attention of a
system. The issues involved in focus of attention (FOA) incorporate not
only where to look next but also more abstract mechanisms such as how
to concentrate on certain features in the continuous ow of input data.
There are several reasons for narrowing the channel of input information
by focusing on specic parts. The most obvious reason for FOA is to
reduce the vast amount of input data to match the available computational
resources. However, the ability to decompose a complex problem into
simpler subproblems has also been put forward as a major motivation for
using focus of attention mechanisms.
4.1.1 Human focus of attention
Humans can shift the attention either by moving the xation point or by
concentrating on a part of the eld of view. The two types are called overt
and covert attention respectively. The covert attention shifts are about
four times as fast as the overt shifts. This speed dierence can be used to
check a potential xation point to see if it is worthwhile moving the gaze
to that position.
81
82
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
A number of paradigms describing human focus of attention has been
developed over the years [57]. In the zoom-lens metaphor computational
resources can either be spread on the whole eld of view, 'wide angle lens',
or concentrated on a portion, 'telephoto lens'. This metaphor is founded
on the assumption that the computational resources are the main cause
for having to focus the attention on one thing at the time. Thus, the
problem is to allocate the available resources properly.
The work presented below relates to the search light metaphor [41]. A
basic assumption in this metaphor is the division between preattentive
and attentive perception. The idea is that the preattentive part of the
system makes a crude analysis of the eld of view. The attentive part then
analyzes areas indicated as being particularly interesting more closely.
The two systems should not be seen as taking turns in a time multiplex
manner, but rather as a pipeline where the attentive part makes selective
use of the continuous stream of results from the preattentive part. The
term 'search light' reects how the attentive system analyzes parts of the
information available by illuminating it with an attentional search light.
The reason for having to focus the attention in this metaphor is that some
tasks are inherently of sequential nature.
What features or properties are then important for positioning the xation
point? For attentional shifts the criterion is closely connected to the task
at hand. Yarbus pioneered the work on studying how human move the
xation point in images depending on the wanted information [88]. For
pre-attentional shifts gradients in space and time, i.e high contrast areas
or motion, are considered to be the important features. Abbott and Ahuja
present a list of criteria for the choice of the next xation point [1]. Many
of the items in the list relate to computational considerations concerning
the surface reconstruction algorithm presented. However, a few clues from
human visual behavior was also included:
Absolute distance and direction. If multiple candidates for xation
points are present, the ones closer to the center of the viewing eld
are more likely to be chosen. Upward movement is generally preferred to downward movement.
4.1 INTRODUCTION
83
2D images characteristics. If polygonal objects are presented, points
close to corners are likely to be chosen as xation points. When
symmetries are present, the xation point tends to be chosen along
symmetry lines.
Temporal changes. When peripheral stimuli suddenly appear, a strong
temporal cue often leads to a movement of xation point towards
the stimuli.
Since xation point selection is a highly task dependent action, it is probably easy to construct situations that contradict the list above. The reader
is urged to consult for the appropriate references in order to get a full
description of how the results were obtained.
4.1.2 Machine focus of attention
A number of research groups are currently working on incorporating focus
of attention mechanisms in computer vision algorithms. This section is by
no means a comprehensive overview, but rather presents a few interesting
examples.
The Vision as Process consortium, ESPRIT Basic Research Action 3038
and 7108, was united by the scientic hypothesis that vision should be
studied as a continuous process. The project is aimed at bringing together knowhow from a wide variety of research elds ranging low level
feature extraction and ocular reexes through object recognition and task
planning [69, 22].
Ballard and Brown have produced a series of experiments with ocular
reexes and visual skills [6, 11, 13, 12, 7]. The basic idea is to use simple
and fast image processing algorithms in combination with a exible, active
perceiving system.
A focus of attention system based on salient features has been developed
by Milanese [58]. A number of features are extracted from the input image
and are represented in a set of feature maps. Features diering from their
84
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
surroundings are moved to a corresponding set of conspicuity maps. These
maps consist of interesting regions of each feature. The conspicuity maps
are then merged into a central saliency map where the attention system
generates a sequence of attention shifts based on the activity in the map.
Brunnstrom, Eklund and Lindeberg have presented an active vision approach to classifying corner points in order to examine the structure of the
scene. Interesting areas are detected and potential corner points scrutinized by zooming in on them [15, 14]. The possibility of actively choosing
the imaging parameters, e.g. point of view and focal length, allows the
classication algorithm to be much simpler than for static images or prerecorded sequences.
A variation of the search light metaphor, called the attentional beam
has been developed by Tsotsos and Culhane [23, 66, 67]. It is based
on a hierarchical information representation where a search light on the
top is passed downwards in the hierarchy to all processing units that
contribute to the attended unit. Neighboring units are inhibited. The
information in the 'beamed' part of the hierarchy is reprocessed, without
the interference from the neighbors, the beam is then used to inhibit the
processing elements and a new beam is chosen.
4.2 Space-variant sampled image sensors: Foveas
4.2.1 What is a fovea?
Space-variant image sampling is not strictly necessary for studying focus of
attention and gaze control, but the need for positioning the xation point
is more evident for such a sensor. There are, however, compelling biological and technical reasons for exploring the use of space-variant sampled
sensors.
The human eye has its highest resolution in the center and decays toward
the periphery. The rst 15 it decays linearly with the angle to the optic
4.2 SPACE-VARIANT SAMPLED IMAGE SENSORS: FOVEAS
85
axis, then the reduction is even faster. The center part is called the
fovea and has a visual eld of about 1:75 . This corresponds to about 45
millimeters 1.5 meters away. As a comparison, the total visual eld using
both eyes is about 180 [38]. There are a number of advantages in such
an arrangement, for example:
Data reduction compared to having the whole eld of view in full
resolution.
High resolution is combined with a broad eld of view.
The eects of image warping due to wide angle lens distortion is
reduced. The distortion increases with the angle from the optic
axis, but the resolution decreases.
The periphery vision gathers information about other possible points
of interest, and of contextual inuence. Therefore, the processing in
this area does not have to be so comprehensive.
These advantages can be utilized in a robot vision system as well. There
are a number of research projects developing both hardware and algorithms for space-variant sampled image arrays, all exemplifying implementations of the fovea concept [65, 64, 68].
In human vision fovea refers to the central part of the retina, but in robot
vision the term is often used to indicate that the system treats the central
and the peripheral parts of the eld of view dierently.
4.2.2 Creating a log-Cartesian fovea
The log-Cartesian fovea representation of the eld of view is a central part
of the system presented in this thesis, and it can be seen as a special case
of a subsampled resolution pyramid. The dierence is that it is only the
center part that is represented in all resolutions, or scales (Figure 4.1 on
the following page).
86
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.1: Upper left: Original image. The skewed appearance is due
to the broad eld of view, 90 . Upper right: Low Pass pyramid. Lower
left: Center part of each level in the pyramid. Lower right: Interpolated
center part images stacked on top of each other. The borders between
the levels are marked for clarity.
4.2 SPACE-VARIANT SAMPLED IMAGE SENSORS: FOVEAS
Figure 4.2: Topmost left: The original 512 512 image. The following
ve 32 32 images are the log-Cartesian fovea representation. The levels
correspond to the visual angles 90, 53, 28, 14 and 7 respectively.
The levels are numbered from 0 to 4 where 0 corresponds to the nest
resolution.
87
88
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
In the system presented here the log-Cartesian fovea is generated by repeated LP-ltering, subsampling and cropping in octaves. The input image is LP-ltered with a Gaussian ( = 4= [36]) which allows subsampling by a factor two. The new image then has the same eld of view but
half the number of pixels in each direction. The subsampling procedure
is repeated until the desired number of levels is reached. This generates
a subsampled resolution pyramid of the same eld of view but dierent
size on each level. All levels are then cropped to the same size by keeping
the center parts and discarding the rest. The result is a number of levels
with the same number of pixels but with dierent resolution, i.e. covering
dierent elds of view. In a real system it would be preferable to have an
optic system that generates the fovea as the input images are digitized.
Figure 4.1 shows an example starting with a 512 512 image and using
5 levels. LP-ltering and subsampling four times reduces the size of the
nal level to 32 32 pixels. Cutting out the center 32 32 pixels at each
512 = 51:2. Figure 4.2 on the page
level reduces the data by a factor 512
53232
before shows the individual images in the fovea representation. The lower
right image in Figure 4.1 shows the total eld of view which is visualized
by interpolating and combining the images.
Note that although the fovea representation is often visualized as one image with varying resolution, it is actually N separate images with dierent
resolution. The levels are numbered from 0 to N , 1, where 0 corresponds
to the highest resolution.
4.2 SEQUENTIALLY DEFINED, DATA MODIFIED FOA
89
4.2.3 Image operations in a fovea
The image operations that are used are applied on all levels of resolution.
Figure 4.3 on the following page illustrates two dierent ways of handling
the part of a lter that reaches outside the image. The image is the
nest resolution part of the fovea representation in Figure 4.2 on page 87.
When convolving an image with a lter, it is common to pad the image by
repeating the border pixels as far as the lter reaches outside the image as
in the upper left image of Figure 4.3. Note how the texture is expanded
into linear structures to the right and below the image. Also note how
the lower right corner pixel turns into a large square. On images much
larger than the lter this distortion is often acceptable since it is only
a small portion of the total image. In a fovea representation, however,
the images are only a few times larger than the lters. Especially when
using successive ltering this becomes a problem, since the border eects
spread, and eventually dominate the results.
When ltering a particular level in a fovea representation, that level can
be padded with information from the nearest coarser level, which covers a
larger neighborhood but with lower resolution, (lower left in Figure 4.3).
This means that it is possible to get a better border by interpolating the
corresponding area in the nearest coarser level. The unwanted border effects that otherwise might disturb the algorithms are then reduced. Note,
for instance, that the cube appears to be much larger when border extension is used. This does not happen if the information instead is picked
from the nearest coarser level, (lower right in Figure 4.3).
4.3 Sequentially dened, data modied focus of
attention
4.3.1 Control mechanism components
Having full resolution only in the center part of the visual eld makes it
obvious that a good algorithm for positioning the xation point is neces-
90
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.3: Upper left: One level in a fovea. Upper right: Padding
by copying border pixels. Note how the texture is expanded into linear
structures to the right and below the image. Also note how the lower
right corner pixel turns into a big square. Lower left: Next coarser level
gives information about a larger neighborhood around the same point,
but with lower resolution. Lower right: Using interpolated information
from the coarser level to pad the image. The interpolated information is
more coherent with the real neighborhood than the extended border.
sary. A number of focus-of-attention control mechanisms must be active
simultaneously to be able to both handle unexpected events and perform
an eective search. The dierent components can roughly be divided into
the following groups:
4.3 SEQUENTIALLY DEFINED, DATA MODIFIED FOA
91
1. Preattentive, data driven control. Non-predicted structured image
information and events attract the focus-of-attention in order to get
the information analyzed.
2. Attentive, model driven control. The focus-of-attention is directed
toward an interesting region according to predictions using already
acquired image information or apriori knowledge.
3. Habituation. As image structures are analyzed and modeled their
impact on preattentive gaze control is reduced.
The distinction between the preattentive and attentive parts is oating.
It is more of a spectrum from pure reexes to pure attentional movements
of the xation point.
4.3.2 The concept of nested regions of interest
In the hierarchical system shown in Figure 1.2 all levels might have an
idea about how to position the camera in order to solve their own current
problems. One way of solving this is to let the dierent levels take turns
in controlling the camera. Another way is to recognize the fact that the
interesting area for a level is often a sub-area of the interesting area of the
level above. Moreover, the task of a level is often directly related to the
task of the level above. Positioning the xation point can therefore be a
renement from coarse to ne, where higher levels give the major region
of interest and lower levels adjust to interesting areas within that region.
Consider the following stylized example:
Assume there are the following four major levels, from top to bottom, in
the information processing hierarchy on the left hand side of the pyramid
(the names are borrowed from the Esprit project BRA 3038, Vision as
Process):
1. System Supervisor
2. Symbolic Scene Interpreter
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
92
Figure 4.4: Regions of interest for the dierent subsystems in the hierarchy.
3. 3D Geometric Interpreter
4. Low Level Feature Extractor
Further assume that we have a table with a few objects, as in top left,
Figure 4.4, and the task is simply \watch the table". Assuming that
the system supervisor knows where the table is, it determines a region
of interest (top right). Note that the circles only refer to the positioning
of the xation point, not to any other limitations in the viewing eld.
The xation point is in the middle of the table. The symbolic scene
4.3 SEQUENTIALLY DEFINED, DATA MODIFIED FOA
93
interpretor has found indications of an object or group of objects in the
upper left corner of the table and sets its region of interest accordingly
(lower left). The 3D Geometric Interpreter starts its task by a further
renement of the region of interest by marking the bright object (lower
right). Finally, on the lowest level, the Low Level Feature Extractor selects
an area to start model the structures in the image. The xation point is
now moved towards the interesting object by the nested regions of interest.
The general response command \watch the table" has been transformed
into \focus on the set of objects in the upper left corner of the table".
The border of the regions of interest are not to be absolute, rigid boundaries. They are rather recommendations that can be neglected if there are
good reasons. How good a reason must be is controlled with an importance value. This can be illustrated by viewing the region of interest as
a basin, or potential well, within which the lower levels can move around
freely. The slope and height of the wall is proportional to how important
it is to keep the xation point there.
There are, however, situations when the lower levels are supposed to violate the directives from superior levels. One such occasion is event detection. Here, an event is an unpredicted gradient in space and/or time.
When an interesting or important stimuli is detected, the xation point
should be moved in that direction on a reex basis. Suppose the system
is watching the table and an object enters the scene and is detected in the
'corner of the eye'. The resolution in the periphery is probably not high
enough to see what it is, only where it is. The low level feature extractor
is the rst level to detect the event and reacts by pulling the xation point
towards the event. By the time the higher levels react on the event there
are extracted low level features with high resolution available. The higher
levels will now either move their regions of interest to analyze the event
further or force the xation point back to the original position. Thus,
the dierent regions of interest do not have to be determined in the order
indicated in Figure 4.4. It all depends on whether it is an attentive or
preattentive movement of the xation point.
94
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.5: The robot conguration. The robot is a Puma 560 arm with
six degrees of freedom equipped with a head with two movable cameras
with one degree of freedom each; the camera pan angle.
4.4 Gaze control
4.4.1 System description
In the experiments below the robot consists of an arm with a camera
head, (Figure 4.5). The robot is a Puma 560 arm with six degrees of
freedom. In the experiments presented here only two degrees of freedom
are used. They implement head pan and head tilt angles. The head has
two movable cameras with one degree of freedom each; the camera pan
angle. The purpose of this system might be automatic identication, inspection or even surveillance of the objects in the scene in front of it. This
type of robotic vision system is wide-spread, see for instance [1, 11, 21].
The system, both image generation and analysis, is implemented in the
Application Visualisation System, AVS, which is described in Chapter 6.
4.4 GAZE CONTROL
95
The cameras are equipped with log-Cartesian sensors. The total eld of
view is 90 and the fovea consists of 5 levels. The individual elds of view
are in the experiments 7 , 14 , 28 , 53 and 90 respectively.
The outward response of this particular system is designed to enable information gathering. Interesting events in the eld of view attract the
xation point in order to get them within the high resolution part of the
fovea. The robot does not move objects. In this experiments it is only
permitted to change the point of view using the head pan and tilt, and
camera vergence.
The robot has a gaze control system with three levels (Figure 4.6). The
top level is an object detector, or rather a symmetry detector [9, 37, 82],
drawing the attention towards regions of high complexity. The second
level is a line tracker drawing the xation point towards, and along, linear
structures based on local orientation and phase [44, 75, 76]. The lowest
level verges the cameras to make them have the same xation point using
the disparity estimates from the stereo algorithm described in Chapter 3.
4.4.2 Control hierarchy
The left hand side of Figure 4.6 shows the feature hierarchy with increasing
abstraction going upwards. More abstract features are stepwise composed
from simpler ones. The features are used both as ground for more complex
features and as modiers for response outputs. The right hand side of the
same gure shows the response hierarchy with increasing specicity going
downwards [33, 72]. The renement of the positioning of the xation point
is handled with potential elds in the robot's parameter space [52]. It can
be visualized as an 'energy landscape', as in Figure 4.10 on page 103,
where the xation point trajectory is the path a little ball freely rolling
around would take. The xation point can be moved to a certain position
by forming a potential well around the position in the parameter space
causing to the robot to look in that direction.
96
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
DISTMASS
ROTCONS
"OBJECT
FINDER"
2D PHASE
STEREO
"EDGE
TRACKER"
ORIENT
FOVEA
CAMERA INPUTS
HEAD CTRL
VERGENCE
CAMERA
CTRL
HEAD
CTRL
Figure 4.6: The preattentive focus of attention system.
4.4.3 Disparity estimation and camera vergence
The lowest level in the control system is the camera vergence process.
The cameras are verged symmetrically to get zero disparity in the center
of the image regardless of the state of the system. If the head is moving
the vergence is calculated using the disparity from the part of the eld of
view that is predicted to be centered in the next time step.
The disparity is measured using the multi-scale method based on the phase
in quadrature lters described in Chapter 3. The fovea version has two
major dierences compared to the computation structure in Figure 3.2 on
page 39. First, only the center of the eld of view is represented in all
resolutions. The accuracy of disparity estimates therefore decays towards
the periphery of the eld of view. Second, the edge extractor is used on
4.4 GAZE CONTROL
97
α
Figure 4.7: Vector representation of local orientation.
every level of the input pyramid, instead of creating a pyramid from the
edge representation of the nest resolution. This dierence is motivated
by a possible future fovea sensor array. In such a system it would not be
possible to have the computational structure in Figure 3.2, since a high
resolution image of the total eld of view does not exist.
4.4.4 The edge tracker
Apart from estimating disparity, the phase from quadrature lters is also
used to generate a potential eld drawing the attention towards and along
lines and edges in the image [75]. This is the second level in the control
structure in Figure 4.6.
Local orientation
An algorithm for phase-invariant orientation estimation in two dimensions
is presented in [44]. Phase-invariant means that the orientation of a locally
one-dimensional signal can be estimated regardless of it being an edge
or a line, i.e. regardless of the phase. The orientation is represented as
a complex number, where the argument represents the local orientation
estimate and the magnitude indicates the certainty of the estimate [32].
98
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.7 shows the correspondence between the complex number and
the local orientation. Note that the argument varies twice as fast as the
orientation of the local structure:
z = Mei = Mei '
2
(4.1)
where ' is the the angle between the gradient and the horizontal axis.
Rotating a line radians makes it look the same as the initial line which
means that the representation has to be the same. The key feature of this
representation is that maximally incompatible orientations are mapped
on complex numbers with opposite signs. This, in turn, makes averaging
a meaningful operation.
When working with resolution pyramids, information is LP ltered and
subsampled, interpolated etc. The information representation has to be
continuous in order for these operations to give meaningful results. If the
average of the representations of the structure in two points represents a
structure that is completely dierent, then the representation is not useful
in a resolution pyramid.
Representation of phase in higher dimensions
In Chapter 2 only one-dimensional signals are discussed. The extension
of the phase concept into two or more dimensions is not trivial [36]. In
Section 2.1 it is shown that the local phase is connected to the analytic
function and hence to the Hilbert transform. A direction of reference has
to be introduced in order to make a multi-dimensional denition of the
Hilbert transform possible. Thus, local phase needs a direction of reference
as well.
If a continuous representation is desired, the phase cannot be represented
with only a single value, although it is a scalar. The phase representation
has to include both the phase value and the reference direction. This
means in the general case that if the dimensionality of the signal space is
N, the dimensionality of the phase representation is N+1 [34].
4.4 GAZE CONTROL
99
B
C
A
D
F
E
Region
A
B
C
D
E
F
^e
'
&0
=2
.
&0
=2
.
phase
=2
=2
=2
,=2
,=2
,=2
Figure 4.8: A dark disc on a bright background. The ^e vectors, marked
with arrows, are used as the phase reference direction. Note the opposite
signs on the phase in regions A and F, and in regions C and D. The table
contains the phase reference direction as an angle to the horizontal axis,
', and the phase value, .
Figure 4.8 shows an example in 2D where the neighboring regions A and
F, and regions C and D have phase estimates, , with opposite signs.
This makes a meaningful averaging impossible. For instance, if fC and fD
denote the phase lter outputs, the average between region C and D is:
(4.2)
aver = arg 12 (kfC keiC + kfD keiD )
= arg 12 (kfC kei=2 + kfD kei(,=2) )
(4.3)
i
= arg 2 (kfC k , kfD k)
(4.4)
Thus, the average phase can be =2, ,=2, or even undened depending
on the relationship between the lter magnitudes in the regions.
The reason for the shifting sign on the phase value is the denition of
the reference direction, marked with arrows in Figure 4.8. The reference
direction is extracted from the orientation estimate by halving the argument:
!
!
e
cos(arg(
z
)
=
2)
1
^e = e2 = sin(arg(z)=2)
(4.5)
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
100
Since the phase is measured along ^e, it will change sign if ^e changes to the
opposite direction. Two neighboring points may have ^e in opposite directions and thus phase values with opposite signs, although they belong to
the same image structure. Averaging of such a neighborhood would therefore be meaningless. It can be argued that choosing the phase reference
directions such that they all point out from the object solves the problem,
but it is impossible to locally determine what is the inside or the outside
of an object. With only local information available, region A could for
instance be a region F on a white disc on a dark background.
A 2D-phase representation, suggested by Knutsson, that includes the reference direction in a two-dimensional space is:
0 1 0
1
x
M
cos(') sin()
x = [email protected] CA = [email protected] sin(') sin() CA
1
2
x3
M cos()
(4.6)
where
M 2 [0; 1] is the signal energy,
' 2 [0; ] is the reference direction, and
2 [,; ] is the phase value.
Resolving the phase angle, , gives:
= arctan
q
x21 + x22 ; x3
(4.7)
Figure 4.9 shows the representation which can be interpreted as a 3D
vector of length M rotated an angle in a plane dened by ^e and x^3 .
The shaded circle corresponds to the phase representation in 1D shown
in Figure 2.3 on page 9. An intuitive feeling for how this representation
solves the problem in the example above can be obtained by some mental
imaging. Turn ^e around x3 until it points in the opposite direction. The
phase value, , is then dened in the opposite direction. In other words,
when the reference direction changes sign the phase angle denition also
changes sign.
4.4 GAZE CONTROL
101
x3
θ
M
x2
ϕ e
x1
Figure 4.9: A 3D representation of phase in 2D.
The phase estimate in region C in Figure 4.8 on page 99 is now:
0
1 0 1
M
cos() sin(=2)
,MC
C
xC = [email protected] MC sin() sin(=2) CA = [email protected] 0:0 CA
MC cos(=2)
0:0
(4.8)
and in region D it is:
0
1 0
1
M
,
M
D cos(0) sin(,=2)
D
xD = [email protected] MD sin(0) sin(,=2) CA = [email protected] 0:0 CA
MD cos(,=2)
0:0
(4.9)
The average phase in a neighborhood is simply the average of the components, xi , respectively:
1
0
1 0
(x C + x D )
((,MC ) + (,MD ))
CA
xaver = [email protected] (x C + x D )CA = [email protected]
0:0
1
2
1
2
1
2
1
1
2
2
(x3C + x3D )
1
2
0:0
(4.10)
Note that the direction of the average phase vector is now independent of
the signal energy in the two lter as opposed to the case in Equation (4.2).
102
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
The average phase angle, aver , can be calculated using Equation (4.7):
1
(4.11)
aver = arctan 2 (MC + MD ) ; 0 = =2
which is independent of the relationship between MC and MD .
Estimating line/edge position and orientation
The local orientation estimates generated from the input images are directly useful for following locally one-dimensional structures in the image
since they point out in which direction to move. The phase estimates,
however, do not directly point out the direction to move since it depends
on whether the estimates are generated from an edge or a line. If it is
a bright line one should move towards = 0, and if it is an edge one
should move towards = =2, and so on. To get around this problem, the
magnitude of the orientation algorithm is used as input for the 2D-phase
algorithm:
m() = kz()k
(4.12)
The orientation magnitude image,m( ), forms a line sketch of the image
where lines and edges look the same regardless if they are bright lines
on a dark background, dark lines on bright background, bright-to-dark
edges or dark-to-bright edges. The 2D-phase estimate will therefore give
the distance to the one-dimensional structure. Moving the xation point
towards = 0 will now be correct for both lines and edges in the original
image.
The 2D-phase is applied on a region of interest covering the center pixels
on each level in the fovea. Denote the average phase estimate on level j :
xj = N1
N
X
i=1
x(i)
(4.13)
Typically, the four center pixels are used to get the 2D-phase value on
each level. The 2D-phase vector magnitude can be visualized as the energy
landscape, or potential eld, in Figures 4.10 and 4.11. Note how energy
valleys follow the locally oriented structures on each level.
4.4 GAZE CONTROL
Figure 4.10: Potential elds generated by lines and edges. Top: Level
0, 7 view eld. Bottom: Level 1, 14 view eld. The xation point is
on the edge of the cube on the table.
103
104
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.11: Potential elds generated by lines and edges. Top: Level
2,28 view eld. Middle: Level 3, 53 view eld. Bottom: Level 4, 90
view eld. The xation point is on the edge of the cube on the table.
4.4 GAZE CONTROL
105
α
Figure 4.12: Vector representation of rotation symmetries.
4.4.5 The object nder
The third level in the control hierarchy in Figure 4.6 concerns objects, or
rather possible objects. Reisfeld et al argue that symmetries are important
features for preattentive gaze control [62]. The object nder is based on
rotation symmetries. These symmetries are dened as the rotations of
the orientation estimates within a neighborhood [46, 9, 82]. Figure 4.12
shows the vector representation of these symmetries. Note how complex
values with opposite signs again represent maximally incompatible patters.
Overlaying the concentric circles on the star gives orthogonal line crossings
everywhere. This is also true for the two spiral patterns. It might be hard
to see that the pattern transformation is continuous when changing .
By studying the orientation estimates generated from the patterns, the
continuity is apparent. A consistency algorithm is applied to enforce the
neighborhoods that t the symmetries well [47, 48].
106
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Rotation symmetry estimation
The orientation estimates for the concentric circles pattern can be written
as a function of the distance to the center and the angle to the horizontal
axis:
f () = f (kk)ei2 arg()
(4.14)
where is the position vector from the center of the symmetry. The
corresponding function for the star pattern is:
f () = f (kk)ei2 arg()+
(4.15)
i.e. a phase shift of . The spiral patterns correspond to phase shifts of
=2. The general functions is:
f () = f (kk)eif () = f (kk)ei2 arg()+
(4.16)
where is determined by the pattern according to Figure 4.12.
A lter for detecting these symmetries should display the conjugated symmetry itself:
b() = b(kk)eib () = b(kk)e,i2 arg()
(4.17)
where is the position vector from the center of the lter.
The magnitude function can be any window function. Here the magnitude
is a squared cosine function with zero magnitude in the center:
q
8 , r <cos 8 if 1 r = 12 + 22 7
kb()k = :
(4.18)
0
otherwise
The lter response when centered on a rotation symmetry pattern is:
X
s(0) = f ( , )b()j=0
(4.19)
=
=
X
X
f (k0 , k)eif (0,) b(k k)e,ib ()
(4.20)
f (kk)b(kk)eif (,),b (,)
(4.21)
4.4 GAZE CONTROL
107
Use the denitions in Equations (4.16) and (4.17) in Equation (4.19):
s(0) =
=
X
X
f (kk)b(k k)ei2 arg(,)+,2 arg(,)
f (kk)b(k k)ei
= kskei
(4.22)
(4.23)
(4.24)
Equation (4.22) shows that the lter, b, estimates the correct rotation
symmetry when it is centered on it. Unfortunately the lter also responds
o center and to linear structures. The selectivity can be enhanced by
using a consistency algorithm [47, 48, 82]. This algorithm requires three
additional lterings with dierent combinations of lter and data magnitudes. The four lter results are:
s
s
s
s
=f b
2 = f kbk
3 = kf k b
4 = kf k kbk
1
(4.25)
(4.26)
(4.27)
(4.28)
The second convolution s2 is obtained by using the lter magnitude as a
scalar lter on the input data. Similarly, s3 comes from using the complex
lter with the data magnitude as a scalar image. Finally, the magnitude
of the lter is convolved with the magnitude of the data. A consistency
operation is obtained if the four outputs are combined as:
s = s s s, s s
4 1
2 3
4
(4.29)
Figure 4.13 on the following page shows a test pattern with rotation symmetries and the results of the symmetry detector.
108
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.13: Left: Rotation symmetry test pattern. Right: The results
from the symmetry detector overlayed on the original image. This image
pair is borrowed from [81]. The test pattern originally appeared in [9].
Rotation symmetry localization
Objects small enough to be covered with one glance can be seen as an imperfect instantiation of the concentric circles i.e. s 0. The estimates are
therefore attenuated with the argument, s , i.e. attenuating the estimates
from star-like patterns:
8
<ksk cos (s) if , s so = :
0
otherwise
2
2
2
(4.30)
The result is a 'closed area detector'. It marks areas with evidence for
being closed, and the intensity is a measure of how much evidence there
is. If the concentric circle estimates are attenuated instead, the operation
turns into a corner detector.
4.4 GAZE CONTROL
109
A vector eld pointing towards the local mass center of so is produced
with three separable lters:
8 <cos ( ) cos ( ) ,7 ; 7
hm () = :
0
otherwise
(4.31)
h1 () = hm ()1
h2 () = hm ()2
(4.32)
(4.33)
2
2
1
8
2
8
1
2
The output from hm is used both for normalization and as a rotation
symmetry certainty image:
Mm = hm so
(4.34)
The vector eld with vectors pointing to the local mass center is:
h s =M
Vm = 1 o m
h2 so=Mm
!
(4.35)
Interpreting the vector elds on all levels, Vmj , as gradient elds of energy
landscapes gives the potential elds in Figures 4.14 and 4.15. Note how
a potential well is created everywhere where there is evidence for a closed
contour.
110
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Figure 4.14: Potential elds generated by rotation symmetries. Top:
Level 0, 7 view eld. Bottom: Level 1, 14 view eld. The xation
point is on the edge of the cube on the table.
4.4 GAZE CONTROL
Figure 4.15: Potential elds generated by rotation symmetries. Top:
Level 2, 28 view eld. Middle: Level 3, 53 view eld. Bottom: Level
4, 90 view eld. The xation point is on the edge of the cube on the
table.
111
112
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Tilt
π/2
0
−π/2
−π
0
π Pan
Figure 4.16: The quantization of the pan-tilt parameter space used as
memory.
4.4.6 Model acquisition and memory
So far, only processes that attract the xation point have been considered.
In Section 4.3, habituation was mentioned as a mechanism needed for an
operating focus of attention system. The basis for such a mechanism is
some sort of memory.
A rst step toward a rudimentary memory of where the system has looked
before is shown in Figure 4.16. It is a form of motor memory consisting
of an array that quantizes the parameter space spanned by the head pan
and tilt angles. Since the robot moves only these two joints the parameter
space is two-dimensional and there is a one-to-one mapping to the possible
view directions.
The system remembers where it has seen something by marking the positions in the memory array that corresponds to the xation directions in
which it has been tracking lines and edges. Bilinear interpolation is used
between neighboring bins. The memory is used to indicate that an edge
or a line has been tracked before and the system should move its xation
point elsewhere.
In a general system where many points in the parameter space might
correspond to looking at the same thing, an extended approach to memory
4.4 GAZE CONTROL
113
B.H.B.
Track line
O.L.
L.L.
Search line
Avoid object
S.L.
S.L.
C.S.
N.S.
Locate object
Figure 4.17: State transition network for the test system.
is needed. It is then important to remember not only where but also what
the system has seen. For non-static scenes when becomes important. This
requires a procedure for model acquisition which is an ultimate goal for
this process.
4.4.7 System states and state transitions
The potential elds in Figures 4.10, 4.11, 4.14 and 4.15 are weighted together dierently depending on what state the robot is currently in. Figure 4.17 shows the states and the possible transitions. The transitions
between the states are determined by the type and quality of the data in
the xation point. Before going into the details, an overview of the states
and the state transitions is presented.
Suppose the system is in the state of locating a possible object. It then
uses the rotation symmetry estimates on the coarser levels of the fovea
representation. When the distance to a symmetry is small enough, (Close
to Symmetry), the system starts to search for the lines and edges of the
object. The edge tracking procedure starts when the line or edge is xated
(On Line). If the line or edge is lost, (Line Lost), the line search starts
again. The system moves away from an object when the xation point
returns to a position where it has tracked before, (Been Here Before).
114
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
When a new symmetry is encountered, (New Symmetry), the system
starts moving towards it. If the symmetry is lost, (Symmetry Lost),
the system starts searching for linear structures that hopefully will lead
towards new interesting areas.
The camera and head parameters are calculated by dening an image point
to be xated. The image point can be seen as attracting the gaze and is
therefore called attracting point and is denoted v. The next xation point
is estimated independently for the right and left cameras. The camera
orientation parameters are then calculated from an average of the two
xation points:
v = 12 (vl + vr )
(4.36)
Also the system state transition is determined for the left and right view
independently and then combined. If the transitions are not consistent,
the following ranking order is used from high to low: locate object, avoid
object, track line, search line. This means that if one eye wants to switch
to locate object while the other wants to continue avoiding, the system
will switch to locate object. The reason is that one eye might catch a new
object before the other one does.
State: search line
When searching for a line or edge only the phase information is used. The
xation point should move towards and then along the valleys in Figures
4.10 and 4.11. The 2D-phase information therefore has to be transformed
into vectors. This can be accomplished by a coarse to ne approach.
Figure 4.18 shows the phase lter magnitude for a line located at = 1 on
three scales. The ne scale lter has larger amplitude than the other two
close to the line, while the coarse scale lter has the largest amplitude at a
distance. The phase values give the distance to the line as in the disparity
estimation in Chapter 3. Since the magnitude is a certainty estimate, the
fovea level with the highest magnitude for a given xation point should
Phase filter magnitude
4.4 GAZE CONTROL
115
0.6
0.4
0.2
0
2
4
6
8
10
12
14
16
ξ
Figure 4.18: The phase lter magnitude on three scales for a line located
at = 1.
control where to move. Let J denote the level with the largest magnitude:
kxJ k = max
kxj k
j
(4.37)
Level J is called the controlling level. The phase angle J gives the distance
to the line while the reference direction vector, e^, dened in Equation ( 4.5
on page 99), gives the direction to it. In order to move along an oriented
structure a vector, e^?, perpendicular to e^ has to be chosen. There are
always two opposite alternatives when choosing e^?. The alternatives are
equally good, so any will do, e.g.
e^? = ,ee
2
!
1
(4.38)
The problem is now that the direction of e^ may ip from one point to
another, cf. regions A and F in Figure 4.8 on page 99. This makes the
xation point move back and forth over the discontinuity. Such behavior
is avoided by using the direction of the last xation point motion, vlast :
sign(vlast e^?)
(4.39)
If e^? changes to the opposite direction, the sign of the scalar product above
also changes and the xation point continues to move without changing
direction.
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
116
The vector to the next xation point, v, consists of one part directed along
the linear structure and one perpendicular to it:
v = 2J sign(vlast e^?)?e^? + J e^
(4.40)
The factor, 2J , compensates for the compressed distances on levels with
lower resolution. The constants ? and control the speed of the xation
point motion along and towards the linear structure. has a natural
connection to the lter. Setting
(4.41)
=1
c
where c is the center frequency of the lter, makes the xation point
move directly to a line if it is an impulse line. Normally, the phase varies
slower than this since an image mostly contains lower frequencies. This
gives a too small value of the distance to the line which, from a control
point of view, is advantageous since it makes the system stable.
? does not have such a natural interpretation as does. A rule of thumb
is that the xation point should not move further than the controlling
fovea level can reach.
When in search line state, is set according to Equation (4.41), while
? is set so the motion towards the line is larger than the motion along
it:
? = 0:25 :
(4.42)
If ? is too large and the line bends away from the tangent of the xation
point motion, then the xation point might never get close to the actual
line.
Figure 4.17 on page 113 shows that the only state transition from search
line is to track line (On Line). There are two conditions that have to be
fullled in order to make a transition. First, the distance to the line has to
be small enough. This can be expressed as a condition on the controlling
fovea level:
J 1
(4.43)
4.4 GAZE CONTROL
117
i.e. if one of the two nest levels of the fovea has the maximum phase
magnitude, then the xation point is close enough to start tracking. Second, a scale consistency condition is used in order to reduce the impact of
noise:
ke^J? e^J? k TC
+1
(4.44)
If the scalar product between the orientation estimate on the controlling
level and the next coarser level is smaller than TC , then the estimate is
considered to be noise and therefore discarded. Note that if, for instance,
level 0 is inconsistent with level 1 but level 1 is consistent with level 2,
then level 1 will be used even if level 0 has a larger magnitude. In most
experiments, the value on the consistency threshold is:
(4.45)
TC = p1
2
which is heuristically determined.
State: track line
The same information is used for tracking lines as for searching for lines.
The only dierence is that ? is now larger:
? = :
(4.46)
There are two possible state transitions from track line (Figure 4.17 on
page 113). The rst one is Line Lost which returns the system to search
line if the conditions in Equations (4.43) and (4.44) no longer are fullled.
This typically happens when a line or edge bends abruptly.
The other state transition, Been Here Before, involves the parameter
space memory array (Figure 4.16 on page 112). The memory array is
updated during the tracking. For each new xation the corresponding
memory location is incremented with one, or rather, the four closest locations are updated using bilinear interpolation. If the head returns to
a position in the parameter space where it has been tracking before, and
118
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
the memory value is larger than a threshold, Tm , the system changes the
state to avoid object. The value on the threshold is heuristically set to:
Tm = 3:0
(4.47)
State: avoid object
When avoiding an object, the symmetry information on the three coarsest levels is used. The intensity in the \mass image", Mm (), (Equation (4.34)) is a measure of how much evidence there is for a symmetry.
The controlling level, J , is determined by searching from coarse to ne in
Mm :
MmJ = max
Mmj ; j 2 [2; 3; 4]
j
(4.48)
v = ,2J s VmJ
(4.49)
The vector, v, to the next xation point is then:
where s generally is set to unity.
There are two state transitions from avoid object. If the rotation symmetry
information is lost, i.e. if
MmJ = 0;
(4.50)
the system returns to search line in order to nd some structure again
(Symmetry Lost).
The second state transitions concern detection of a new object (New
Symmetry). During avoid object the xation point is moving contrary
to the vector eld Vm (Equation (4.49)). When the xation point reaches
a new symmetry, the vector VmJ switches sign and points towards the new
symmetry, i.e. in approximately the same direction as the current motion
of the xation point. Thus, the condition for the state transition to locate
object is:
vlast VmJ > 0
(4.51)
4.4 GAZE CONTROL
119
State: locate object
When locating a new object the same information as when avoiding objects is used. The only dierence is that the xation point now moves
along the vector eld, Vm instead of against it. The vector to the next
xation, v, point is:
v = 2J sVmJ
(4.52)
where s generally is set to unity.
There is one state transition from locate object and that is to search line.
The transition takes place if one of two conditions is fullled. First, as in
the avoid object case, the system starts to search for lines if the symmetry
is lost (Equation (4.50)). The second condition concerns actually arriving
to a new symmetry, i.e. a potential new object. When searching for a
line, a coarse to ne approach is used. This method is not applicable
for rotation symmetries since the positions of symmetry centers are more
scale-dependent. As an example, consider a square. On a coarse scale
the symmetry center is in the center of the square. On a ne scale there
are four symmetry centers close to the corners of the square. Therefore, a
threshold on the distance to the symmetry center on the controlling level
is used as a state transition condition:
kvk < Ts
(4.53)
The value on Ts is half the width of the symmetry lter:
Ts = 7
(4.54)
4.4.8 Calculating camera orientation parameters
In order to derive the orientation control parameters from the position of
an image point, a camera model has to be assumed. For physical cameras
there are a number of models ranging from a simple pinhole camera to advanced simulations of light going trough aggregates of lenses. In computer
graphics the pinhole camera dominates although more advanced cameras
are available on some high end platforms.
120
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Kanatani [42] has derived the camera-induced motion eld for rotation
around the camera optic center. This remains a good approximation if the
distance between the lens and true center of rotation is small compared
to the distance perpendicular to the lens to the projected objects [59].
Below, the corresponding equations for how to rotate a camera in order
to xate a certain point are derived.
The camera model
The cameras are pinhole cameras with a square sensor and with equal
vertical and horizontal elds of view. The imaging parameters are the
eld of view, , and the number of sensor elements horizontally (and
vertically) across the sensor, Ne .
Neither the focal length, f , nor the physical size of the image sensor, w,
are explicitly given here. The relationship between the eld of view, the
image sensor size and the focal length is:
tan 2 = 2wf
Resolving gives:
(4.55)
w
= 2 arctan 2f
(4.56)
It is evident from Equation (4.56) that for a given eld of view there
is an innite number of combinations on the focal length and the sensor
width. The focal length often appears as a denominator, and it is therefore
convenient to have the following convention:
f = 1:0
(4.57)
w = 2 tan 2
(4.58)
we = Nw
(4.59)
e
where f is the focal length, w is the \physical" sensor size, and we is the
sensor element (pixel) size.
4.4 GAZE CONTROL
121
Camera pan
The optical center of the cameras coincides with the axes for the individual
camera pan joints (Figure 4.19). This makes the change in pan angle, 'cp ,
needed to xate a point, P , independent of the distance between the
optical center and P :
tan('cp ) = wfe (4.60)
where is the image coordinate. Resolving 'cp and using Equations (4.57),
(4.58) and (4.59) gives the expression on the pan angle:
'cp = arctan N2 tan 2
e
(4.61)
f
ψ
w/2
we ξ
ϕcp
x
P
d
Figure 4.19: The camera seen from above. The change in the camera
pan angle to xate P , 'cp , is independent of the distance to P , since the
camera turns around the optical center.
122
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
Head tilt
The head pan and tilt do not rotate the cameras around the optical center.
This means that the change in head pan and tilt angles needed to xate
a point depends on the distance to the point. Although it is possible to
use depth estimates, e.g. from stereo, to calculate the correct angles, it
is desirable to be as independent of information from other processes as
possible. Figure 4.20 shows a side view of the head mounted on the robot
ϕ
we ξ
we ξ error
x1
h
d2
ϕ
P
ht
ϕ
x2
d1
hp
Figure 4.20: The error made when the change in tilt angle 'ht is calculated with Equation (4.61). The image of the P will not be exactly
centered after tilting the head.
arm. If the equivalent to Equation (4.61) is used to control the head tilt
angle, 'ht , the image of P is not centered. If the error is small enough, the
simplicity of using Equation (4.61) is preferable to calculating the accurate
inverse kinematics. Figure 4.20 shows an example where the head is tilted
to xate a point P . The following relations are found:
x2 = h , h cos(')d2 = L , h sin(')
(4.62)
4.4 GAZE CONTROL
123
q
where L = x1 2 + d1 2 is the initial distance from the camera center to
the P .
The new position of the image of P can be calculated from the pinhole
camera equation:
(4.63)
weerror = f xd2
2
Resolve error and combine with Equations (4.62) and (4.62):
'))
(4.64)
error = wf hL(1,,hcos(
sin(')
e
Finally, use Equations (4.57), (4.58) and (4.59):
1 , cos(')
error = Ne
(4.65)
2 tan( 2 )( Lh , sin('))
The following qualitative observations can be made regarding the size of
the error:
1. The error decreases when the L=h ratio increases, i.e. the error is
smaller for distant points.
2. The error increases when the eld of view, , decreases, i.e. the error
is larger with a telephoto lens than with a wide angle lens.
3. The error increases with the a large change in tilt angle, i.e. a large
motion will yield a large error.
Figure 4.21 on the next page shows the error expressed as a percentage of
the sensor width, plotted as a function of L=h for a number of tilt angles.
The worst case is when the attracting point is situated on the edge of the
current eld of view, i.e. when 'ht = =2.
The L=h ratio in the experiment is typically between 15 and 30 (h = 100
mm and 1500 mm < L < 3000 mm), which means that the worst case
error is about 1% corresponding to 5 pixels. This might seem as a lot,
but the change in tilt angle mostly is much less the half the eld of view.
The continuous operation of the head also assures a correction in the next
iteration since the correction pan angle then is very small.
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
124
4
Error (%)
3
2
1
0
5
10
15
20
L/h
25
30
35
Figure 4.21: The xation error expressed as percentage of the sensor
width as a function of L=h. The eld of view is = =2. The error is
plotted for four changes in tilt angles: ' = =2 = =4 (dash-dotted),
' = 3=16 (dotted), ' = =8 (dashed), ' = =16 (solid).
Head pan
Calculating the error made for the head pan is more dicult than for
the head tilt since it depends on the head tilt angle. In Figure 4.20 on
page 122, the head pan axis is marked out. This axis is only a true pan
axis if the tilt angle is zero. When the head is maximally tilted the pan
axis is parallel with the optical axes of the cameras.
When 'ht = 0 Equation (4.65) is applicable if h is set to half the camera
baseline (50 mm). The L=h ratio is then typically between 30 and 60
which means that the error caused by panning is half the error caused by
tilting the same angle.
The analytical head pan error equation is much more complicated than
Equation (4.65). Simulating a few representative cases gives an intuitive
feeling for the error as a function of the eld of view, the distance, and
the tilt angle. Figure 4.22 on page 126 shows the error made when the
attracting point is on the edge of the image, which is the worst case.
The eld of view is =6 and =2 for the left and right columns of plots
respectively. The rows of plots show the error for a point at the distance
L = 250 mm, L = 1; 000 mm, L = 2; 500 mm, respectively. The solid
curve shows the error after one gaze shift. The dashed, dash-dotted and
4.5 EXPERIMENTAL RESULTS
125
dotted curves show the error after a second, third and fourth gaze shift.
Note that the repeated gaze shifts also change the tilt angle.
The following qualitative observations can be made regarding the size of
error when changing the head pan angle:
1. The error increases when the distance increases if the tilt angle is
larger than approximately =4.
2. The error increases with the tilt angle.
Note that although the error is between 40 and 50 percent of the image
size after the rst gaze shift, it becomes less than 10 percent after a second
shift if the eld of view is =2. After three iterations the error is less than
2%. The worst case is fairly rare, and for tilt angles less than =4 the
initial error is only 15 percent of the image width.
The initial error can be minimized by using the camera pan joints in
combination with the head pan. The cameras can be used for quick pan
changes and when a point is xated, the head pan can be used to ensure
symmetric vergence. This sort of control scheme is inspired by the human
visual system and can for instance be found in [21, 60].
4.5 Experimental results
A trajectory of how the robot moves the xation point can be found in
Figure 4.23 on page 127. The middle picture on the wall shows a clear
example of how an object is xated and then scrutinized. The system
makes a saccade from the right picture to the center of the middle picture. A state transition from locate object to track line occurs and the
system starts tracking the periphery of the picture. When returning to
the starting point on the frame of the picture, the system saccades to the
table and continues there.
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
50
40
40
30
20
10
10
π/8
π/4
3π/8
Tilt angle (rad)
0
0
π/2
50
50
40
40
Error (%)
Error (%)
30
20
0
0
30
20
10
10
π/8
π/4
3π/8
Tilt angle (rad)
0
0
π/2
50
50
40
40
30
20
10
10
π/8
π/4
3π/8
Tilt angle (rad)
π/2
π/4
3π/8
Tilt angle (rad)
π/2
π/8
π/4
3π/8
Tilt angle (rad)
π/2
π/8
π/4
3π/8
Tilt angle (rad)
π/2
30
20
0
0
π/8
30
20
0
0
Error (%)
Error (%)
50
Error (%)
Error (%)
126
0
0
Figure 4.22: Head pan errors as a function of tilt angle. The eld of
view is = =6 and = =2 for the left and right columns of plots
respectively. The rows of plots show the error for a point at the distance
L = 250 mm, L = 1000 mm, L = 2500 mm, respectively. The solid curve
shows the error after one gaze shift. The dashed, dash-dotted and dotted
curves show the error after a second, third and fourth gaze shift.
4.5 EXPERIMENTAL RESULTS
127
Figure 4.23: A typical trajectory of the xation point. The xation
point has followed the structures in the image and moved from object to
object.
In these experiments only features extracted from gray scale structures
have been used. A natural extension is to incorporate color edges [83, 79,
78], texture gradients [45], etc. in order to get a better segmentation of
the image. The potential elds give the possibility to do this fairly easy.
If for instance a color and gray scale edge coincide, the potential well will
be much deeper than if it is a gray scale edge on a region with constant
color.
The results also show that a set of individually simple processes together
can produce a complex and purposive behavior. When adding higher
levels to the system, they should inuence the lower ones by generating
potential wells corresponding to interesting directions and not control the
cameras directly. In this way all levels of a hierarchical system control the
xation point simultaneously.
128
HIERARCHICAL DATA-DRIVEN FOCUS OF ATTENTION
5
ATTENTION CONTROL USING
NORMALIZED CONVOLUTION
5.1 Introduction
Chapter 4 describes a gaze control algorithm based on a number of simultaneously working subsystems. In this algorithm, the gaze is attracted by
a set of features in the scene, and a kind of motor memory is used to remember earlier xations. The three basic processes; preattentive control,
attentive control and habituation are pointed out as vital for an active
observer. The model, or memory, cannot be seen as implementing the habituation function, since the xation point has to return to the particular
point in order to know it has already been there. The desired behavior of
the system is to act as if certain features and events do not exist,i.e. neither being attracted nor repelled by them [23]. The straightforward way
of simply \cutting out" or erasing the corresponding areas in the input
image or the feature image does not solve the problem. The inuence of
an image structure reaches over a large area, especially at coarse resolution. Erasing all points in a feature image that are inuenced by a certain
image feature removes many useful estimates as well. Erasing in the input
image creates \objects" that generate new image features.
It can be argued that building a repelling potential eld around known, or
modeled, structures solves the problem with the returning gaze point, but
it does not. The attraction of the feature remains, but it is balanced by
the repelling potential eld. This might cause the xation point to stop in
129
ATTENTION CONTROL USING NORMALIZED CONV.
130
a local minimum not corresponding to any structure in the eld of view,
since it is impossible to build a eld that exactly cancels the attracting
eld. The repelling potential eld also forces the xation point to move
around an already modeled structure when passing it.
Using a technique termed 'normalized convolution' when extracting the
image features allows for marking areas of the input image as unimportant.
The image features from these areas are then 'invisible' and consequently
do not attract the attention of the system, which is the desired behavior
of a habituation function.
5.2 Normalized convolution
Normalized convolution (NC) is a novel ltering method presented by
Knutsson and Westin [48]. The convolution method is designed for ltering of incomplete and/or uncertain data. A comprehensive theory is
found in [82]. A central concept in this theory is the separation of signal
and certainty for both the lter and the input signal.
Normalized convolution is based on viewing each lter as a set of one or
more basis functions, bi , and a weighting window, a, called the applicability function. The window is used for spatial localization of the basis
functions which may be considered to have innite support. Similarly, the
input signal is divided into the actual signal f and a certainty function,
c. Standard convolution of a signal, s, with a set of lters, hi can be
expressed as:
0 f 1 0 h s 1 0 ab cf 1
[email protected] ... CA = [email protected] ... C
A = [email protected] ... CA
fN
hN s
abN cf
1
1
1
(5.1)
Let the normalized convolution between abi ; i 2 f1; ::; N g and cf be
5.2 NORMALIZED CONVOLUTION
dened by:
0f 1
B
@ ... CA = G,
1
1
1
fN
0f 1
[email protected] ... CA = G,
0 ab cf 1
[email protected] ... CA
abN cf
1
1
fN
131
(5.2)
where G is a matrix dened as:
0 ab b c
G =B
@ ...
abN b c
1
1
: : : ab1 bN c 1
...
..
.
abN b
CA
(5.3)
c
where bi denotes the complex conjugate of bi . The matrix G is a met1
:::
N
ric that compensates for the non-orthogonality of the basis functions. A
number of product lters, abi bj are used in order to estimate G. Note
that G depends on the certainty functions but is independent of the actual
signal. This means that if the certainty is independent of variations in the
signal over time, G,1 can be pre-calculated. Typical examples where G is
constant over time are:
Border eect reduction. The signal certainty is set to one inside
the input image and to zero outside it. This reduces the eects
caused by the image border.
Sparse or heterogeneous sampling. Non-Cartesian, sparse heterogeneous sensor arrays or sensor arrays with malfunctioning sensor
elements can be handled by setting the signal certainty to zero for
those points.
Filtering of an image with a number of lters can be seen as expanding it in
a set of basis functions. The lter outputs are often used as the coordinates
of the image in that particular basis. Strictly, this is not correct since
lter outputs actually are coordinates in the dual base. Coordinates can
be transformed between the base and the dual base using the metric.
Readers not familiar with dual bases can turn to [82] for an introduction.
For orthonormal bases the dierence is academic since the metric is an
identity matrix, but for non-orthonormal bases it is important. Note that
ATTENTION CONTROL USING NORMALIZED CONV.
132
an orthonormal basis can, locally, turn non-orthonormal due to variations
in the signal certainty.
The normalized convolution scheme generates the coordinates, f i , corresponding to the base, bi . In order to be able to compare with the lter
outputs from ordinary ltering the coordinates have to be transformed
into dual coordinates fi . As mentioned above this is done acting with the
metric on the coordinates. Since G,1 has compensated for variations due
to the signal certainty, the metric that corresponds to full certainty should
be used:
0 ab b c : : : ab b c 1
N
CA
..
...
G =B
@ ...
.
abN b c : : : abN bN c
1
1
0
1
0
1
0
0
(5.4)
0
where c0 is a constant function equal to one. The dual coordinates are
then:
0f 1
[email protected] ... CA = G
f0N
0f 1
[email protected] ... CA = G G,
1
01
0
fN
0
1
0 ab cf 1
[email protected] ... CA
abN cf
1
(5.5)
Note that setting the input signal certainty equal to one gives G0 G,1 = I
which in turn gives f0i = abi cf = fi which corresponds to the standard
convolution in Equation (5.1).
Information about the output signal certainty is captured in the determinant of G. Although it is not necessary, it is convenient to have the output
certainty in the interval [0; 1], which is accomplished by normalizing with
the determinant of G0 . It is also desirable that the output certainty is
equivariant with the input certainty, which means that multiplying the
input certainties with a factor results in a multiplication of the output
certainties by the same factor:
cin ! cout
(5.6)
but
det(M ) = N det(M )
(5.7)
5.2 NORMALIZED CONVOLUTION
133
where N is the dimension of the matrix M . Having N basis functions,
the following output certainty has been shown to work well:
det G N
cout = det G
0
1
(5.8)
Converting a ltering operation to NC
Rewriting a ltering operation as a normalized convolution can be done
according to the following cookbook recipe:
1. Dene an applicability function. The applicability function, a,
has to be the same for all lters. Often it is an ordinary window
function, such as a Gaussian or a squared cosine.
2. Dene the basis functions. Or rather, dene the windowed basis
functions, abi . These are the lters from the original operation. If
there is no DC-lter in the original lter set, it has to be added (cf.
Section 5.3).
3. Dene a signal certainty function. Sometimes it is easy to
dene a signal certainty, i.e. for laser ranging using time-of-ight,
where the returning light intensity can be used. Points where no
signal is captured are then given zero certainty. If the certainty is
not given by the device generating the signal, some sort of certainty
measure has to be constructed.
4. Dene the signal. As in the basis function case, it is actually the
signal multiplied by the certainty, cf , that is used.
5. Dene the scalar product functions. In order to measure the
scalar product between the basis functions, a set of lters with all
pair-wise combinations of basis functions, bi bj ; i; j 2 f1; ::; N g, has
to be generated.
In the following section these steps are applied to performing normalized
convolution using quadrature lters.
ATTENTION CONTROL USING NORMALIZED CONV.
134
5.3 Quadrature lters for normalized convolution
In earlier chapters quadrature lters are used for orientation and local
phase estimation. Some of the lters presented in Chapter 2 are not true
quadrature lters, but the analysis below will be valid for them as well.
A quadrature lter can be seen as consisting of either complex or real
basis functions. Both ways are described below, taking the non-ring lter
in Equation (2.32) as an example.
5.3.1 Quadrature lters for NC using real basis
functions
NC quadrature ltering can be made using three real basis functions; a
constant function and the real and imaginary parts of the original lter.
The applicability function is a spatial localization function corresponding
to a windowing function. Here it is chosen to:
8
<cos
kf ()k = :
0
2
R
2
if k k < R
otherwise
(5.9)
since this is the magnitude function of the original lter in Equation (2.32).
By this choice of applicability function we get the following basis functions:
b () = 1
1
(5.10)
2
(5.11)
3
(5.12)
b () = cos +
sin
R
R
b () = sin R + sin R
where b2 and b3 are the real and imaginary part of the lter function.
In addition to the complex basis function a constant basis function is
required. The reason for this is that since the original lter is insensitive
to the mean value of a signal, it is incapable of estimating the signal
5.3 QUADRATURE FILTERS FOR NORMALIZED CONV.
135
certainty level. For example, any constant certainty eld would give zero
output.
The basis functions can be considered to have innite support since they
are always multiplied by the applicability function when constructing the
lters needed:
ha () = ab1 = cos2 2R
2 hax () = ab2 = cos 2R cos R + sin R
+
sin
hay () = ab3 = cos2 2R sin R
R
(5.13)
(5.14)
(5.15)
hax and hay are the non-ring lters used in the original operation. In
addition to the lters in Equations (5.13), (5.14) and (5.15), three product
lters have to be generated:
hax () = ab1b1 = cos2 2R cos2 +
sin
R
R
2
(5.16)
+
sin
(5.17)
hay () = ab2 b2 = cos2 2R sin2 R
R
2 haxy () = ab1b2 = cos 2R cos R + sin R sin R + sin R
(5.18)
2
These three lters are needed to measure how the correlation between the
basis functions varies with the signal certainty function. This is necessary in order to adjust for the signal variations induced by the certainty
function.
Figure 5.1 on the next page shows the lters having radius R = 9. The
lters outputs are combined according to Equations (5.2) and (5.3):
0 1
0
1
f
h
a cf
[email protected] CA = G, [email protected] cf CA
f
hay cf
1
2
3
1
(5.19)
ATTENTION CONTROL USING NORMALIZED CONV.
136
ax
0.4
ay
0.4
0.2
0.2
0
0
−0.2
−0.2
−8
−6
−4
−2
0
2
4
6
8
−8
−6
−4
−2
ξ
a
0.4
0
2
4
6
8
2
4
6
8
2
4
6
8
ξ
axx
0.4
0.2
0.2
0
0
−0.2
−0.2
−8
−6
−4
−2
0
2
4
6
8
−8
−6
−4
−2
ξ
ayy
0.4
0
ξ
axy
0.4
0.2
0.2
0
0
−0.2
−0.2
−8
−6
−4
−2
0
2
4
6
8
−8
−6
−4
−2
ξ
Figure 5.1: Non-ring NC quadrature lters.
0
ξ
5.3 QUADRATURE FILTERS FOR NORMALIZED CONV.
with
1
0
h
a c hax c hay c
G=B
@hax c hax c haxy cCA
hay c haxy c hay c
137
(5.20)
2
2
G is symmetric since all functions are real. If the basis functions are
complex, G is Hermitian.
The scheme results in six lters and nine dierent convolutions instead of
two lters with one convolution each. Furthermore, a 3 3 matrix has to
be inverted, possibly for each point.
5.3.2 Quadrature lters for NC using complex
basis functions
In this section NC quadrature ltering using complex basis functions instead of real ones is discussed. It is actually one real and one complex
basis function. Using the same applicability function as in Equation (5.13)
gives the following complex basis functions:
b1() = 1
(5.21)
b2() = ei( R +sin R )
(5.22)
The complex lter that corresponds to ab2 can be realized as the two real
lters in Equations (5.14) and (5.15), i.e. the original even-odd lter pair.
Since one of the basis functions is constant, the correlation lters are the
same as the original ones. The complex versions of Equations (5.19) and
(5.20) are:
!
f 1 = G,1
ha cf
f2
(hax + ihay ) cf
!
and, using ab2 b2 = a,
(5.23)
!
ha c
(hax , ihay ) c
G = (h +
ih
)
c
ha c
ax
ay
(5.24)
138
ATTENTION CONTROL USING NORMALIZED CONV.
The total number of lters is three. Counting one complex convolution as
two real makes six dierent convolutions. The computational complexity
is reduced further by the fact that the G is a 2 2 matrix, instead of 3 3,
which facilitates the NC combination of lter results.
5.3.3 Real or complex basis functions?
It is evident that using complex basis functions reduces the computational
cost signicantly. The real case requires nine convolutions while the the
complex case only needs six. An important question to ask is how the two
approaches dier? The dierence is best illuminated with two examples.
In the rst example, data is missing, or marked unimportant, in single
points. This resembles sensor element failure or some kind of point-wise
data drop-out. In the second example, regions larger than the lters are
missing. This typically happens around the edges of an image, or when
masking an image structure for focus of attention purposes.
The test signal is the same as in Section 2.1 and the lters are the ones
in Figure 5.1 on page 136. The performance of the two approaches is
measured by comparing their output phase with the lter output from an
un-corrupted signal, shown in Figure 5.2 on the next page.
Let z denote the reference signal, i.e. the complex lter response from the
original lter on the un-corrupted signal, and let z~ denote the corresponding response from normalized convolution on the signal with partially
missing data. The denition of the NC lter response is:
8
<c (f + if ) if real basis functions
z~ = : o 2 3
(5.25)
co f2
if complex basis functions
where co is the output certainty given by Equation (5.8). The dierence
in phase between the reference signal and the tested signal is calculated
as:
di = zi z~i
(5.26)
The magnitude of di contains the reference signal magnitude, the estimated signal magnitude and the certainty of the NC estimate. This makes
5.3 QUADRATURE FILTERS FOR NORMALIZED CONV.
Intensity
1
0.75
0.5
0.25
0
0
20
40
60
80
100 120 140
reference mag.
reference phase
ξ
π
π/2
0
−π/2
−π
0
20
40
60
80
100 120 140
20
40
60
80
100 120 140
1
0.75
0.5
0.25
0
0
Figure 5.2: Original input signal and quadrature lter output.
139
140
ATTENTION CONTROL USING NORMALIZED CONV.
it suitable for weighting the errors, since errors in points with both high
signal amplitude and high NC certainty are more serious than others. The
following four statistics are used:
n
X
m = 1 arg(d )
(5.27)
n i=1
s2 = n ,1 1
n
X
mw = i=1
n
X
s2w = i=1
i
n
X
i=1
(arg(di ) , m)2
kdi k arg(di )
n
X
kdi k
(5.28)
(5.29)
i=1
(kdi k arg(di ) , mw )2
n
X
i=1
kdi k
(5.30)
Point-wise missing data
In this example, data is point-wise missing and the signal certainty is set
to zero in these points. The two top rows of plots in Figure 5.3 on page 142
show the signal and the certainty function. The line at = 20 and the
edge at = 60 are missing a sample at the center of the image feature.
The line at = 100 and the edge at = 140 have missing data in the
neighborhood of, but not at, the actual feature.
The two middle rows Figure 5.3 show the quadrature lter outputs and the
error plots. The left and right columns correspond to the real and complex
basis function respectively. The output certainty is plotted in the fth
row. The bottom plots in both gures show the dierence between the
estimated phase and the true phase. The error is plotted for kdk > 0:01
only. The dierence is close to zero which shows that both methods handle
the missing data well. The error statistics are listed in Table 5.1, which
5.3 QUADRATURE FILTERS FOR NORMALIZED CONV.
141
shows that the performance of the two methods is almost equivalent when
looking at the absolute errors.
m
mw
s
sw
Real basis functions
-0.007 -0.009 0.275 0.030
Complex basis functions -0.004 -0.016 0.196 0.054
Table 5.1: Statistics of the angular error in radians for phase estimation
on point-wise missing data
ATTENTION CONTROL USING NORMALIZED CONV.
142
1
certainty
certainty
1
0.75
0.5
0.25
0
0
20
40
60
80
0.75
0.5
0.25
0
0
100 120 140
20
40
60
ξ
1
Intensity
Intensity
1
0.5
0.25
20
40
60
80
0.75
0.5
0.25
0
0
100 120 140
20
40
60
π
π/2
0
−π/2
40
60
80
cNC magnitude
0.75
0.5
0.25
20
40
60
80
100 120 140
1
0.75
0.5
0.25
0
0
20
40
60
80
π/2
0
−π/2
0
20
40
60
80
100 120 140
0
0
20
40
60
80
100 120 140
20
40
60
80
100 120 140
20
40
60
80
100 120 140
20
40
60
80
100 120 140
1
0.75
0.5
0.25
0
0
1
0.75
0.5
0.25
0
0
100 120 140
π
−π
100 120 140
−π/2
−π
100 120 140
cNC certainty
NC magnitude
NC certainty
20
1
0
0
NC Phase error
cNC Phase
π
π/2
0
80
ξ
cNC Phase error
NC Phase
ξ
−π
100 120 140
ξ
0.75
0
0
80
π
π/2
0
−π/2
−π
0
Figure 5.3: Results from normalized convolution using quadrature l-
ters with point-wise missing data. Compare the phase and magnitude
plots with Figure 5.2 on page 139. The bottom plots show the dierence between the reference phase and the NC phase. Left: Real basis
functions. Right: Complex basis functions.
5.3 QUADRATURE FILTERS FOR NORMALIZED CONV.
143
Large missing regions
In the test above the lters reach over the points with missing data. Only
a few samples under the lter are missing for each point. In the following
test the regions of missing data are larger than the lters. The two top
rows of plots in Figure 5.4 on the next page show the signal and the
certainty function. As above, the line at = 20 and the edge at = 60
are cut at the center, while the line at = 100 and the edge at = 140
have missing data in the neighborhood.
When having point-wise missing data, the proper behavior of the phase
estimate is often easy to dene. When large regions are missing, however,
it is not so easy to tell. There is no way to determine the shape of the
signal in the masked, or missing region. Take for instance the line at
= 20 and the edge at = 60 in Figure 5.4 on the following page. After
masking with the input certainty they look similar. Therefore the phase
estimates also look the same. When ltering at the edge of a certainty
gap that is broader than the lter, NC extrapolates according to the basis
functions. In the quadrature lter case this means that the phase cycles
with the same angular velocity as the impulse response. This behavior
gives an appropriate phase extrapolation for the line at = 20 but gives
a large error for the edge at = 60. The output certainty is small for
these regions which explains the dierence between the unweighted and
the weighted statistics. Again the performances of the two approaches are
similar.
m
mw
s
sw
Real basis functions
0.168 0.114 0.630 0.280
Complex basis functions 0.063 0.136 0.611 0.366
Table 5.2: Error statistics for phase estimation partially missing data
Discussion on basis choice
Both the approach with real basis functions and the one with complex
basis functions yield quadrature lter outputs that are close to the out-
ATTENTION CONTROL USING NORMALIZED CONV.
144
1
certainty
certainty
1
0.75
0.5
0.25
0
0
20
40
60
80
0.75
0.5
0.25
0
0
100 120 140
20
40
60
ξ
1
Intensity
Intensity
1
0.5
0.25
20
40
60
80
0.75
0.5
0.25
0
0
100 120 140
20
40
60
π
π/2
0
−π/2
40
60
80
cNC magnitude
0.75
0.5
0.25
20
40
60
80
100 120 140
1
0.75
0.5
0.25
0
0
20
40
60
80
π/2
0
−π/2
0
20
40
60
80
100 120 140
0
0
20
40
60
80
100 120 140
20
40
60
80
100 120 140
20
40
60
80
100 120 140
20
40
60
80
100 120 140
1
0.75
0.5
0.25
0
0
1
0.75
0.5
0.25
0
0
100 120 140
π
−π
100 120 140
−π/2
−π
100 120 140
cNC certainty
NC magnitude
NC certainty
20
1
0
0
NC Phase error
cNC Phase
π
π/2
0
80
ξ
cNC Phase error
NC Phase
ξ
−π
100 120 140
ξ
0.75
0
0
80
π
π/2
0
−π/2
−π
0
Figure 5.4: Results from normalized convolution using quadrature l-
ters with data missing in large regions. Compare the phase and magnitude plots with Figure 5.2 on page 139. The bottom plots show the
dierence between the reference phase and the NC phase. Left: Real
basis functions. Right: Complex basis functions.
5.4 MODEL-BASED HABITUATION/INHIBITION
145
puts from the uncorrupted signal. It is important not only to estimate
accurately, but also to know when not doing so. Since weighted errors are
smallest for the approach with real basis functions, it is the best approach
from this point of view. On the other hand, the dierence in accuracy
between the approaches is not proportional to the dierence in complexity, especially when the missing regions are smaller than the lters. In a
scale-space implementation there is likely a level of resolution where the
lters reach over the regions with missing data. The conclusion is that
complex basis functions give the best performance-complexity ratio for
quadrature ltering using normalized convolution.
5.4 Model-based habituation/inhibition
Normalized convolution can be used for disregarding image structures by
setting the signal certainty to zero for these structures. This ability makes
NC suitable for directing the focus of attention of an active vision system.
Take for instance a robot system with an arm and a camera head. The
system is supposed to react on objects entering the scene, or on unpredicted motion in the vicinity of the robot. However, it does not need to
react on known objects such as its own arm. This may seem as a simple
problem, but in a hierarchical system, consisting of a number of more
or less independent processes, the impact of known structures is hard to
suppress. The dierent low level processes do not, and can not, have
any knowledge about which structures are already known and which are
not. The higher levels, on the other hand, can provide information about
this. In the system presented here, high level processes generate certainty
masks that are set to zero certainty in areas with known structures. The
certainty masks can be generated from estimated information of object
geometry and/or from estimated local image features such as for example color, velocity or orientation. This means that, for instance, all blue
objects can be neglected using this technique.
146
ATTENTION CONTROL USING NORMALIZED CONV.
5.4.1 Saccade compensation
Saccades, i.e. fast camera motions made in order to xate a new point
of interest, normally introduce strong erroneous responses after a saccade
for a period of time that depends on the lter size in the time dimension.
Note that merely shutting o the camera during the saccade introduces
strong responses in any ltering having temporal extent when the camera
is turned back on. However, such errors can be almost completely eliminated using normalized convolution. When the head makes a saccade,
the certainty for the whole eld of view is set to zero as long as the head
moves. The input frames are then considered to be missing data and do
not inuence the motion estimation. Both the examples below contain
saccade compensation.
5.4.2 Inhibition of the robot arm inuence on
low level image processing
Figure 5.5 on the next page shows a system consisting of a simulated
Puma 560 arm and a stereo camera head mounted on a neck centered
over the \waist joint". Objects are produced in the machine to the left
and a conveyor belt transports them towards a bin to the right. Figure 6.3
on page 161 shows a blueprint of the scene.
Figure 5.8 on page 150 shows every 10th frame of a sequence where an
object passes the robot on the conveyor belt, while the robot arm moves
back and forth along the belt. The two leftmost columns in Figure 5.8
on page 150 show an overview and the right camera view. The head performs saccadic tracking, i.e. when the object is too far from the center of
the image the head rapidly changes xation point to center the object.
Consequently, the head is not moving continuously. A method based on
three-dimensional quadrature lters that can track object by smooth pursuit is presented in [43] but is at present not implemented with normalized
convolution.
5.4 MODEL-BASED HABITUATION/INHIBITION
147
Figure 5.5: A puma 560 robot arm by a conveyor belt. The robot has
a neck with a stereo camera head cf. the cover.
Certainty mask
Low
Lowlevel
level
processing
processing
Puma
PumaCTRL
CTRL
Head
HeadCTRL
CTRL
Puma
Pumaarm
arm
Head
Head
Arm
Armmodel
model
Arm mask
Cameras
Cameras
Figure 5.6: Block diagram for model based inhibition of a puma robot
arm.
148
ATTENTION CONTROL USING NORMALIZED CONV.
The third column of Figure 5.8 shows the pairwise temporal dierences
between consecutive frames. The head tracks the local center of mass of
the temporal dierences. The tracking is similar to the method for locating
symmetries described in Subsection 4.4.7. When the head is not moving
and the arm is out of sight, the object appears clearly and is easy to
track. When the head makes a saccade, as in the fourth row of Figure 5.8,
the whole eld of view changes which makes tracking impossible. In the
last three rows the robot arm becomes visible and dominates the eld of
view, which makes the head track the arm instead of the object. Both
the saccade and the robot arm are examples of known events that should
be inhibited from eecting the pre-attentive gaze control. Normalized
convolution is applicable for both these problems.
When the head makes a saccade, the certainty for the whole eld of view
is set to zero as long as the head moves. The input frames are then
considered to be missing data and do not inuence the motion estimation.
In order to generate a certainty mask that cancels the motion estimates
from the robot arm, a model of the arm is needed. The position of the
arm in relation to the camera and the joint angles for both the arm and
head are also necessary. The block diagram in Figure 5.6 on the page
before shows a system that takes the arm and head control parameters
and generates a mask that covers the arm. The geometrical model does
not need to be precise. A bounding box representation of each link is
sucient.
The bounding boxes are used for rendering a certainty mask directly into
a fovea representation as shown in Figure 5.7. The leftmost image shows
the eld of view with the fovea level boundaries marked out for clarity.
The other images show the certainty mask for the four levels of the fovea.
The fourth column in Figure 5.8 on page 150 shows the arm mask. The
fovea is combined as described in Chapter 4. The masked area in the lower
part of the three rst images of the fourth column is due the bounding
box of the shoulder link.
The last column in Figure 5.8 shows the temporal dierences using normalized convolution using the certainty masks in the fourth column. Only
the object of interest is visible. The eects of the known events have van-
5.4 MODEL-BASED HABITUATION/INHIBITION
149
Figure 5.7: The certainty mask generated from the bounding box representation of the robot arm.
ished. In the last row, the object is behind the robot arm, and there is
naturally no way of recovering it. Note that if the bounding box representation is too crude and an object is close to the arm, the object might
be masked out although it is visible, cf. Subsection 5.4.4. On the other
hand, if the bounding box representation is too tight and the parameters
and/or arm position are imperfect, the mask might fail to cover the arm
completely.
5.4.3 Inhibition of modeled objects
The same procedure as in the last section can be applied to remove low
level responses from already modeled structures. Figure 5.9 on page 152
shows every 10th frame of a sequence where some objects pass the robot
on the conveyor belt. The two leftmost columns in this gure show an
overview and the right camera view. The large cross-hair, in the right
camera view, shows the current xation point, and the small cross-hair
points out where the head would saccade to if a saccade was to be executed
immediately. The third column shows the pairwise temporal dierences
between consecutive frames.
When a new object is detected it is xated, and a model is invoked. This
model contains a bounding box representation, a 3D position and a 3D
velocity. When the model is set, a certainty mask is generated which is
used for inhibiting responses from the object. The system is then sensitive
only to un-modeled events. The masks from the object models are shown
in the fourth column, while the last column contains temporal dierences
using normalized convolution.
150
ATTENTION CONTROL USING NORMALIZED CONV.
Figure 5.8: Inhibition of motion estimate due to known events.
5.4 MODEL-BASED HABITUATION/INHIBITION
151
In the rst two rows the system detects a new object behind the two
already modeled objects. When the third object is modeled, the system
makes a saccade towards the middle object by canceling the appropriate
mask. During the saccade the whole eld of view is given zero certainty.
In the four last rows the attention is shifted between the objects by simply
canceling the mask of the object to be attended to. With this strategy
the low level processes do not need to know the dierence between a
new object entering the scene and a mask being canceled, and a complex
communication structure between high and low levels is avoided.
As long as the constant velocity model is appropriate, the projected certainty mask will cover the modeled structures. However, if the model is
too simple, the corresponding certainty mask slides o. If this happens,
the previously suppressed low level responses will automatically alarm and
the system shifts the attention for updating that particular model. When
all moving objects are modeled correctly, the lower levels of the system
will be quiet. Only models corresponding to objects that change their
behavior alarm and need additional attention.
5.4.4 Combining certainty masks
In Subsections 5.4.2 and 5.4.3 the certainty masks are based on geometry
only. Although this is often sucient, it sometimes has undesired side
eects. The top left image in Figure 5.11 on page 154 shows a situation
where an object is almost entirely masked by the arm although it is visible.
The mask is generated according to the arm and head geometry only, any
objects between the arm and camera are not accounted for. The middle
left image in the same gure shows what happens if the mask instead is
based on object color only. Regions with the color of the robot arm are
masked. The color has to be extracted with methods that compensate for
the color of the illumination [56, 55]. The arm is masked correctly but the
oor and the little cylindrical object to the left are masked as well. The
solution in this case is shown in the bottom image of Figure 5.11, where
the two previous masks are combined. The total mask should only be zero
for red objects within the bounding box representation of the arm, thus
152
ATTENTION CONTROL USING NORMALIZED CONV.
Figure 5.9: Inhibition of already modeled objects.
5.5 DISCUSSION
153
Figure 5.10: The robots view of the conveyor belt with a cylindrical
object and a box.
they are combined with a logical OR. By combinating masks from dierent
models or image features a more selective masking can be performed.
5.5 Discussion
It has been shown that normalized convolution can be used for attention
control. By suppressing known structures, relatively simple methods can
be used in the following processing stages. Guiding attention by means of
certainty masks ts in a natural way in a system exploiting the benets
from separating signal from certainty [48, 82]. Since normalized convolution is based on a series of convolutions, this scheme can be implemented
using special hardware for convolutions. For all levels of a feature hierarchy the convolutions can be executed in parallel, and NC can therefore be
as fast as a standard convolution.
154
ATTENTION CONTROL USING NORMALIZED CONV.
Figure 5.11: Combination of certainty masks. Top: Geometry only.
Middle: Color only. Bottom: Color and geometry.
6
ROBOT AND ENVIRONMENT
SIMULATOR
A robot vision simulator has been developed in the Application Visualization System, AVS, to facilitate testing of robot vision and control algorithms [86]. The simulator reduces the need for expensive special purpose
hardware since it can be run in \slow motion". A real-time process can be
investigated although only limited computational resources are available.
It also allows testing of dierent types of robots and robot congurations
without any extra cost. The scene can be varied from a few very simple
polyhedral objects to complex, realistic, texture mapped environments.
The simulated reality, such as true 3d-structure, true distance, etc., can
easily be compared with the results obtained by the robot vision system
in order to evaluate the performance of the algorithms [20, 19].
6.1 General description of the AVS software
The AVS software is a product from Advanced Visual Systems Inc. (AVS
Inc.). It is an interactive visualization environment for scientists, engineers
and technical professionals. AVS oers an environment that allows users
to construct applications incorporating their own code without graphics
programming. The package is easy to learn and use, it provides an intuitive interface for quickly designing prototype applications, and it supplies
powerful tools for customization and tuning production applications.
155
ROBOT AND ENVIRONMENT SIMULATOR
156
There are two major ways of using AVS. Firstly, AVS has a number of data
visualizers called viewers. These are ready-to-use visualization packages
for a variety of data types.
{ Image viewer is an interactive 2D image processing and display package.
{ Geometry viewer is an interactive 3D geometric data renderer.
{ Graph viewer is a tool for plotting functions, measurements, statistics
using line, bar, area, scatter and contour plots.
Secondly, AVS can be used as a prototyping tool. New algorithms and
methods can be designed and tested using visual programming, i.e. choosing and interconnecting program modules graphically as shown in Figure 6.1 on the next page. The prototyping subsystems are:
{ Network editor, a visual programming tool.
{ Module generator, an integration tool for user-supplied code.
{ Layout editor, a graphical user interface design tool.
{ Command Language Interpreter, a scripting language and callable
interface.
AVS users can construct their own visualization applications, by combining software components into executable ow networks. The components,
called modules, are sub-programs or functions which are called by the
AVS kernel. The ow networks are built from a menu of modules by using
direct manipulation, a visual programming interface called the AVS Network Editor. With the Network Editor, the user produces an application
by selecting a group of modules and drawing connections between them.
The AVS Command Language Interpreter (CLI) is a text language that
can drive most of the AVS system. It can be used for making animations,
6.1 GENERAL DESCRIPTION OF THE AVS SOFTWARE
157
Figure 6.1: The AVS network editor. The window to the left is the
network control panel. This is the default location for module widgets.
The large window is the network construction window. The upper part
contains a network editor menu to the left, and a palette of modules
in the current library to the right. The rest of the network construction
window is a workspace. It contains a sample network that reads an image
from a le, displays it, runs a sobel operator and displays the result.
save networks, remote control, etc, Any module can pass CLI commands
to the AVS kernel, i.e. any module can modify parameters, rearrange
networks, invoke and delete other modules, etc.
AVS includes a rich set of modules for the construction of networks. It
allows users to create their own new modules to meet their specic needs
and dynamically load them into networks. A module generator can be used
to automatically generate module code in FORTRAN or C. Both ANSI{C
prototypes and C++ classes are available. The module generator creates
skelletons for Makeles and manual pages as well.
ROBOT AND ENVIRONMENT SIMULATOR
158
6.1.1 Module Libraries
About 150 standard supported modules are included in the AVS software
provided by AVS Inc. In addition to this, a number of share ware modules are available. The International AVS Center, housed at the North
Carolina Supercomputing Center, is a center for collection and distribution of new modules. Hundreds of user-contributed and ported public
domain modules are available there and can be retrieved via ftp (including documentation). Examples of recently ported public domain modules
are Khoros (approx. 250 modules) and Sunvision. A database facilitates searching for modules suited for certain applications. The center
also distributes a magazine quarterly, the AVS Network News magazine,
presenting novelties concerning AVS software and applications.
The module library developed at the Computer Vision Laboratory contains about 100 modules developed specially for image processing problems. The robot simulator described below is a part of this library.
6.2 Robot vision simulator modules
Puma 560
Puma Dynamics
Puma Inverse
Transformations
Stereo Camera
Stereo Head Dynamicscamera simulation
Invert DH Matrix
Monopod
Monopod Dynamics
Multiply DH Matrices
Vacuum Gripper
Conveyor belt scene
Figure 6.2: The basic modules used in the Robot Vision Simulator.
Figure 6.2 shows a set of modules for geometric visualization and control
of robots and tools. These are the basic modules in the robot vision simulator. In the actual experiments a number of situation specic modules
are used for image processing, path planning, etc. Section 6.3 describes an
experiment where the modules below are used as building blocks together
with a number of special purpose modules.
6.2 ROBOT VISION SIMULATOR MODULES
159
Puma 560
The \Puma 560 " module generates a geometric description of a Puma
560 robot arm with six degrees of freedom (6 DOF). The robot can be
positioned by supplying a transformation matrix to one of the input ports.
The robot pose is controlled via the joint angle parameters. There are
basically three ways of furnishing joint angles. First, they can be given
manually using the widgets on the control panel. Second, a controlling
module can be attached to the remote control input port, passing joint
angles. Finally, there is also a possibility for a module to access the joint
parameter widgets using the \AVScommand" function call.
The module does not simulate the dynamics of the robot arm. In order to
implement robot dynamics and/or other forms of control parameters, a lter, e.g. \Puma dynamics ", can be inserted between the module supplying
the desired joint parameter values and the \Puma 560 " module.
The transformation matrix describing the location and orientation of the
actuator end point is presented on an output port, enabling attachment
of dierent tools, e.g. \vacuum gripper " or \Stereo Camera ". In fact, it
is possible to put one robot arm at the end of another by connecting the
output port of one robot to the input port of another.
Stereo Camera
\Stereo Camera " creates a geometric description of a binocular camera
head with 4 DOF. The cameras have independent pans, a variable baseline,
and a variable eld of view. The camera parameters are manipulated in
the same manner as the robot arm.
The camera head is attached to an actuator by connecting the transformation matrix input of \Stereo Camera " to the corresponding output of
\Puma 560 " or \Monopod ". The transformation matrix of the actuator
end point is then transferred to the camera head, making it follow the
actuator movements.
160
ROBOT AND ENVIRONMENT SIMULATOR
Note that \Stereo Camera " has three output geometry ports. One \geometry viewer " module can render any number of views from the same
scene, but only one can be presented on the output port. Therefore, two
geometry viewers are needed to get a stereo image pair and one extra is
used for overview images. Consequently, the world representation is kept
in three separate copies.
Monopod
\Monopod " is a camera head platform, or neck, with 3 DOF: pan, tilt and
elevation. The hand of the robot arm and the neck have identical geometry. The platform is positioned by furnishing a transformation matrix on
one of the module input ports. If the same transformation matrix is used
for both the robot arm and the monopod neck, the result is the conguration in Chapter 5. The neck appears to be mounted on the \shoulder
link" of the robot (see cover).
Vacuum Gripper
\Vacuum Gripper " is intended to be used together with \Conveyor belt
scene " below. It creates a geometric description of a suction cup to be
positioned at the end of a robot arm. As the name indicates, the function
is the same as lifting small objects with a vacuum cleaner hose. The
module has a boolean parameter indicating whether the gripper is active
or not. When the suction is activated, the transformation matrix is sent
to the output port.
Conveyor belt scene
\Conveyor belt scene ", shown in Figure 6.3, has two conveyor belts, each
connected to a machine. One machine produces objects, hence called
the Producer, and puts them on a conveyor belt. The selection of type of
object and creation time can either be externally controlled, or at random.
6.2 ROBOT VISION SIMULATOR MODULES
161
80 cm
hgh: 16 cm
Large bin
80 cm
40 cm
hgh: 40 cm
(150,209,0)
(48,150,0)
Small bin
40 cm
40 cm
hgh: 60 cm
Robot
Position
(85, 100, 0)
85 cm
Table
60 cm
(75,30,0)
300 cm
(0,0,0)
Consumer
80 cm
hgh: 60 cm
(150,0,0)
Producer
100 cm
80 cm
hgh: 120 cm
Figure 6.3: The scene generated by \Conveyor belt scene ". The dashed
circle indicates the working area of the robot in the experiment in Chapter 5, but there are no limitations as to where to position it.
162
ROBOT AND ENVIRONMENT SIMULATOR
At the end of this belt there is a bin collecting manufactured objects.
The other belt transports objects placed on it to the other machine, the
Consumer, where they are consumed.
The \Vacuum Gripper " output can be attached to the module enabling
manipulation of the manufactured objects. For the robot to lift an object,
it has to position the gripper close to the bounding box of the object,
orient it perpendicularly to the surface, and activate the suction.
At present the simulation of physical objects and their interactions is
constrained to a few special situations. Objects are transported along the
conveyor belts when put on them and fall to the oor if released in the
air. The table, or the oor, can be used for temporary storage of objects.
Objects can, however, be moved through each other. Piling objects is thus
not possible.
Puma Dynamics, Stereo Head Dynamics and Monopod
Dynamics
\Puma Dynamics ", \Stereo Head Dynamics ", \Monopod Dynamics " are
meant to implement a dynamic model of the robot arm, the camera head,
and the monopod respectively. For the time being, the modules restrain
the maximum speeds with which the joints can rotate. This feature is
important, for instance when using the robot to catch the objects in the
conveyor belt scene. Not being able to move instantly from one point
in the workspace to another, the system is forced to use some kind of
predictor.
Puma Inverse
\Puma Inverse " calculates the inverse kinematics [28], translating (if possible) the desired transformation matrix of the end eector to robot joint
angles. If both the robot transformation matrix and the end eector matrix are presented, the latter is interpreted as being in world coordinates.
If only the end eector matrix is furnished, it is interpreted in robot centered coordinates.
6.2 ROBOT VISION SIMULATOR MODULES
163
Camera simulation
Figure 6.4: Motion blur created by the \camera simulation " module.
\camera simulation " is a module for introducing noise and motion blur to
the images (Figure 6.4). The blur is controlled with a decay rate parameter
0 1, where = 1 means no blur at all and = 0 means no image
update, i.e. a frozen image. The noise is controlled with a corresponding
factor 0 1 where = 0 means no noise and = 1 means noise only.
The function for generating the output image at time t is:
iout (t) = (1 , ) iin (t) + (1 , )iout (t , t) + N (t)
(6.1)
where iout (t , t) is the preceding output image, and N (t) is a white noise
image. The pixel values in the noise image are uniformly distributed in
the interval [0; 1[.
164
ROBOT AND ENVIRONMENT SIMULATOR
Transformations
\Transformations " produces a transformation matrix in homogeneous coordinates from translation, rotation, and scale parameters. The module
is used for positioning the robot, monopod, etc.
6.3 Example of an experiment
The following example shows how a complex system can be built in AVS,
see gure 6.5. These AVS networks are used in the experiments described
in Chapter 5.
animated float
Virtual reality
Robot Vision
Object tracker dT
Occular Reflexes
Figure 6.5: This AVS network corresponds to the experiments described
in Chapter 5. The \Virtual reality " and \Robot Vision " are macro modules that correspond to the networks in Figure 6.6 on page 166 and
Figure 6.7 on page 168 respectively. The upstream connections to the
\Virtual reality " module from \Occular reexes " are here made invisible
in order not to clutter the network.
The net is driven by \animated oat " acting as a clock pulse, activating
connected modules. Modules not connected to the clock execute when
they are called by upstream modules. Note that a module connected both
6.3 EXAMPLE OF AN EXPERIMENT
165
to the \animated oat " module and other upstream modules normally
wait for the upstream modules to nish before executing, but this can be
overridden by the programmer.
The \Virtual reality " module simulates the environment and generate images, while the \Robot Vision " module analyses these images. These modules are macro modules that contain whole networks. They are therefore
separately described in more detail in Subsection 6.3.1.
\Object tracker dT " extracts the image point that corresponds to the
centroid of the temporal dierences in each of the right and left images.
The tracking is similar to the method for locating symmetries described
in Subsection 4.4.7. The output from the module is the vector to the
centroid in image coordinates for each image.
\Occular reexes " orients the head towards the moving object according
to the information from the \Object tracker dT " using the tilt and pan
joints on the monopod. Exactly the same control strategy as in Subsection 4.4.8 is used since the hand of the Puma 560 is identical to the
neck of the monopod. The module also verges the cameras to look at the
same point according to the disparity estimates. Note that the upstream
connections to the \Virtual reality " module from \Occular reexes " are
invisible, but the input ports where they attach are visible. The reason
for this is purely esthetic.
6.3.1 Macro modules
Macro modules is a way of organizing large networks into logical clusters
of modules. Collecting a number of modules into a macro module does
not eect the execution order of the modules. From the scheduler's point
of view all modules might as well be connected in one large network.
ROBOT AND ENVIRONMENT SIMULATOR
166
IN-> Virtual reality
Transformations
Puma Inverse
Puma 560
Monopod
Vacuum Gripper
Conveyor belt scene
Stereo Camera
geometry viewer
geometry viewer
antialias
antialias
Scheduler BUG workaround
OUT-> Virtual reality
Figure 6.6: The network corresponding to the \Virtual reality " macro
module in Figure 6.5 on page 164.
6.3 EXAMPLE OF AN EXPERIMENT
167
Virtual reality
\Virtual reality " is a macro module containing the network shown i Figure 6.6 on the facing page. It generates the geometric description of the
conveyor belt scene, the robot arm, the camera head etc. It also renders the images from the stereo camera head using the module \geometry
viewer ". The \antialias " modules reduce the aliasing eects in the images by means of low pass ltering and subsampling. The ruggedness of
edges and lines typical for computer generated images is then reduced.
Note that the images has to be rendered with twice as high resolution as
needed.
\Conveyor belt scene " manages the transformation of manufactured objects and handles robot interaction. Note that it is connected to the clock
pulse since it calculates the dynamic behavior of the objects and therefore
needs to know the time.
The robot hand can be positioned and oriented using any module producing a transformation matrix, e.g. \Transformations ". \Puma Inverse " calculates the inverse kinematics, translating (if possible) the desired transformation matrix of the end eector to robot joint angles.
The \Scheduler BUG workaround " module makes the network wait for all
modules in \virtual reality " before continuing. Appendix A describes why
it is necessary.
Robot Vision
Figure 6.7 shows the network corresponding to the \Robot Vision " macro
module. Log-Cartesian fovea representations of the luminance of the left
and right images are created in the \Float luminance " and \Create fovea "
modules. Certainty masks are generated in \model based inhibition " both
for the robot arm and for modeled objects as described in Chapter 5.
Phase based disparity estimates are generated by the \Robot NDCfovea
stereo " module. The module uses a fovea version of the method described
ROBOT AND ENVIRONMENT SIMULATOR
168
IN-> Robot Vision
Float luminance
Float luminance
Create fovea
model based inhinibition
Mask signal
Mask signal
fovea dT
NDC oneori
Create fovea
NDC oneori
Fovea distmass
fovea dT
Fovea distmass
Robot NDCfovea Stereo
OUT-> Robot Vision
Figure 6.7: The sub-network corresponding to the \Robot Vision "
macro module in Figure 6.5 on page 164.
6.4 SIMULATION VERSUS REALITY
169
in Chapter 3 in combination with the normalized convolution scheme presented in Chapter 5.
A vector eld pointing towards the centroid of temporal dierences is
created by the \fovea dT " and \Fovea distmass ". The vector eld can be
interpreted as the gradients of a potential eld around moving objects.
6.4 Simulation versus reality
Simulated environments are powerful tools for developing robot vision
algorithms. New robot types and congurations can be tested without
having to buy new expensive hardware. Interface, electrical and control
problems can be bypassed. Ground truth is known. Dynamic events
can be run in slow motion, and so forth. This blessing might, on the
other hand, be a curse. Algorithms working perfectly in the simulated
environment might fail completely on natural images. Control strategies
might break down due to latencies [12] in a real system that did not exist
in the simulated world, etc.
In order to address this problem, all feature extration algorithms have been
tested on real images. The \camera simulation " module is an attempt in
this direction as well.
The control problem is harder to address. The problem of delays in the
image processing is still there in a simulated environment. The length of
the delay is mostly constant and known, which might not correspond to
a real situation. In Figure 6.5 on page 164 the network is synchronous,
meaning that all modules are running according to a common clock, dening \real-time" for the entire system. This type of simulation does not
reect the diculties of having a real-time running world and dedicated
hardware running in its own time connected to general purpose hardware
working even slower. Another disadvantage of having a synchronous system is that it forces the higher levels to work with the same resolution
in time as the lower levels, something that probably is most inecient.
The setup in Figure 6.8 on page 171 is a possible solution to this problem.
170
ROBOT AND ENVIRONMENT SIMULATOR
The system is spread over a number of AVS sessions, each with a separate
scheduler and thus a separate clock. The sessions execute independently
and communicate asynchronously which makes it more realistic than the
system in Figure 6.5 on page 164. Real-time control loops are kept in
\session A", while more time consuming analysis is carried out in \session
B". There is no limit to how many AVS sessions that can be connected.
Future research will explore the viability of a system such as in Figure 6.8.
6.5 Summary
The robot simulator has played a major role in the research that is the
basis for this thesis. It allows the user to gain insight in the problems
concerning the generation of robot joint angles from visual input, without
getting drowned in control problems. For instance, how camera motion
is enabled but also limited by the robot's degrees of freedom. Hand-eyecoordination tasks can be studied as well.
One of the demonstrators in the VAP project is built around the conveyor
belt scene in Figure 6.3 on page 161. A robot system consisting of an
Puma 560 arm, a vacuum gripper, and a stereo camera head is used for
inspecting manufactured machine parts. The machine parts are passing
by on one conveyer belt and only correctly manufactured objects should
be allowed to continue to the bin. Objects with small defects are to be
returned on another conveyer belt while severely defective and any other
object should be discarded in a special bin. If objects are arriving faster
than the system can inspect them, a local stack, the table, can be used.
This demonstrator contains both low level reactive behaviors and high
level planning. The research issue is to design a multi-level control policy that can handle both expected and unexpected events, and adapt its
behavior according to the situation. The experience from these experiments combined with the expertize on real camera heads in other member
groups enables faster progress than doing simulation only or real experiments only.
6.5 SUMMARY
Asynchronous Interface
module
171
AVS
session A
World Time
Real Time
Control:
− Vergence
− Smooth pursuit
− Hand−Eye coord
Clock
Environment
Simulation
Real Time
Image Processing:
− Spat−temp filtering
− Phasebased stereo
− Colour transform
Asynchronous
Communication
AVS
session B
Object Hypotheses
Generation
Action Planner
World Model
Maintanance
Image Processing
Control
Goal Driven FOA
Asynchronous
Communication
Figure 6.8: A system consisting of two AVS sessions combined with
asynchronous communication modules.
172
ROBOT AND ENVIRONMENT SIMULATOR
A
AVS PROBLEMS AND PITFALLS
A.1 Module scheduling problems
Modules can run locally on the same host or remotely on other hosts,
possibly with dierent architectures. Modules may also run in the same
unix process saving resources and enhancing overall performance.
In Figure 6.7 on page 168 the data ow can be divided into four separate
streams which makes it possible to run groups of modules in parallel on
four dierent hosts. In the experiments, four SUN IPX machines were
used and \Robot NDCfovea Stereo " was run on a high performance number
cruncher, Stardent GS2500.
AVS handles the scheduling of which modules to execute and when. Generally it works as expected, but there are two serious bugs in the scheduler. The rst bug concerns the \geometry viewer " module. It is a special
module in the sense that it is built-in in AVS and not supplied as a separate executable as most other modules. This fact seems to confuse the
scheduler. Figure A.1 on the following page shows a network with a \geometry viewer " in the middle. If a new geometry is read by the \read
geometry ", the \geometry viewer " executes followed by \eld math " and
\display image ", which is the correct behavior. Similarly, if a new image
is read by \read image " the desired behavior is that \eld math " waits
for \geometry viewer " to nish before executing. Instead the \eld math "
173
AVS PROBLEMS AND PITFALLS
174
read image
read geom
geometry viewer
field math
display image
Figure A.1: A sample network that executes in an undesired order if
a new image is read with the \read image " module. The \eld math "
module will execute twice instead of waiting for the \geometry viewer "
to nish.
module executes before the \geometry viewer ", and then once more when
the \geometry viewer " has nished. If the \geometry viewer " had been
any other module the network would have executed as expected. The
\Scheduler BUG workaround " module in Figure 6.6 on page 166 solves
the problem by stopping all data from leaving the \Virtual reality " module before all \geometry viewer " modules have executed. This solution is
unsatisfactory since it requires all data to be copied from input to output
which is both time and memory consuming.
The second scheduler problem concerns parallel module executions and
feedback data streams. AVS can schedule parallel execution or feedback
data streams but not both. A network containing both parallel execution
and feedback data can be made to execute correctly by carefully starting
up modules, but it is highly unstable. Any interaction with the network
might cause it to start executing in an undesired order.
Both these problems might disappear with the new release of AVS (AVS6).
The control structure is then completely changed giving more control to
the programmer. Time will show.
A.2 TEXTURE MAPPING PROBLEMS
175
A.2 Texture mapping problems
B
A
a
B
A
b
a
b
Figure A.2: Left: A texture is mapped onto an object by linear interpolation between the image points a and b. The result is non-equidistant
mapping onto the object. Right: A texture is mapped onto an object by
linear interpolation between the object coordinates A and B.
Texture mapping enhances the naturalistic look of simulated objects since
the surface structure of real objects can be captured. It allows having detailed structures even though the objects are dened with a few polygons.
Almost all computer graphics systems, AVS included, use linear interpolation in image coordinates for textures. This may make a at surface
look warped. The reason for this is shown to the left in Figure A.2, where
176
REFERENCES
the image of an object between points A and B is rendered between the
image points a and b. The dashed rays show how the texture pixels are
projected onto the object when linear interpolation in the image plane
is used. The rays go through equidistant points in the image plane but
are not depicted on equidistant points on the object surface. The eect
can be reduced by making a ner tessellation of the surface which means
adding vertices in between A and B. The texture will then be correctly
mapped in these points as well, and the distortion between the vertices
will be smaller. However, for image sequences, e.g. from moving cameras,
this method is not satisfactory. Textures seem to oat around when the
camera is moved.
In principle it is possible to add the texture when creating the objects and
create one polygon vertex for each pixel in the texture map. The objects
will then consist of patches colored according to the texture, as opposed
to coloring the projected image of the objects. This method yields better
results but is very memory and time consuming.
The appropriate method for texture mapping is shown to the right in
Figure A.2. If the depth to the points is taken into account the texture is
mapped onto equidistant points along the object but not onto equidistant
points in the image plane. The reason for not using the proper approach
in most computer graphics systems is that is more complex, and thus more
time consuming.
References
[1] A. L. Abbott and N. Ahuja. Surface reconstruction by dynamic integration of focus, camera vergence and stereo. In Proceedings IEEE
Conf. on Computer Vision, pages 523{543, 1989.
[2] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for
the perception of motion. Jour. of the Opt. Soc. of America, 2:284{
299, 1985.
[3] J. Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. International Journal of Computer Vision, 1(3):333{356, 1987.
[4] R. Bajcsy. Passive perception vs. active perception. In Proc. IEEE
Workshop on Computer Vision, 1986.
[5] R. Bajcsy. Active perseption. Proceedings of the IEEE, 76(8):996{
1005, August 1988.
[6] D. H. Ballard. Animate vision. Technical Report 329, Computer
Science Department, University of Rochester, Feb. 1990.
[7] D. H. Ballard and A. Ozcandarli. Eye xation and early vision: kinetic depth. In Proceedings 2nd IEEE Int. Conf. on computer vision,
pages 524{531, december 1988.
[8] S. T. Barnard and M. A. Fichsler. Computational Stereo. ACM
Comput. Surv., 14:553{572, 1982.
[9] J. Bigun. Local Symmetry Features in Image Processing. PhD thesis,
Linkoping University, Sweden, 1988. Dissertation No 179, ISBN 91{
7870{334{4.
[10] R. Bracewell. The Fourier Transform and its Applications. McGrawHill, 2nd edition, 1986.
177
178
REFERENCES
[11] C. M. Brown. The Rochester robot. Technical Report 257, Computer
Science Department, University of Rochester, Aug. 1988.
[12] C. M. Brown. Gaze control with interactions and delays. IEEE
systems, man and cybernetics, 20(1):518{527, march 1990.
[13] C. M. Brown. Prediction and cooperation in gaze control. Biological
cybernetics, 63:61{70, 1990.
[14] K. Brunnstrom. Active exploration of static scenes. PhD thesis, Royal
Institute of Technology, October 1993. ISRN KTH/NA/P-93/29-SE,
ISSN 1101-2250.
[15] K. Brunnstrom, J. O. Eklundh, and T. Lindeberg. Active detection
and classication of junctions by foveating with a head{eye system
guided by the scale{space primal sketch. Technical Report TRITIANA-P9131, CVAP, NADA, Royal Institute of Technology, Stockholm,
Sweden, 1990.
[16] A. D. Calway, H. Knutsson, and R. Wilson. Multiresolution estimation of 2-d disparity using a frequency domain approach. In Proc.
British Machine Vision Conf., Leed, UK, September 1992.
[17] A. D. Calway, H. Knutsson, and R. Wilson. Multiresolution frequency
domain algorithm for fast image registration. In Proc. 3rd Int. Conf.
on Visual Search, Nottingham, UK, August 1992.
[18] A. Chehikian and J. L. Crowley. Fast computation of optimal semi{
octave pyramids. In Proceedings of the 7th scandinavian Conf. on
image analysis, pages 18{27, Aalborg, Denmark, 1991. Pattern recognition of Denmark and Aalborg University.
[19] C. C. Chen and M. M. Trivedi. Savic: A simulation, visualization,
and interactive control environment for mobile robots. In H.I. Christensen, K.W. Bowyer, and H. Bunke, editors, Active robot vision:
camera heads, model based navigation and reactive control, volume 6
of series in machine perception and artical intelligence, pages 123{
144. World scientic publishing Co. Pte. Ltd., 1993. ISBN 981-021321-2.
REFERENCES
179
[20] ChuXin Chen and Mohan M. Trevedi. Mobile robots with articulated
tracks and manipulators: Intelligent control and graphical interface
for teleoperation. In Mobile Robots VII, volume 1831, pages 592{603.
SPIE, 1992.
[21] H.I. Christensen, K.W. Bowyer, and H. Bunke, editors. Active robot
vision: camera heads, model based navigation and reactive control,
volume 6 of series in machine perception and artical intelligence.
World scientic publishing Co. Pte. Ltd., 1993. ISBN 981-02-1321-2.
[22] J. L. Crowley and H. I. Christensen, editors. Vision as Process,
ESPRIT Basic Research Series. Springer-Verlag, 1994. ISBN 3-54058143-X.
[23] S. Culhane and J. Tsotsos. An attentinal prototype for early vision.
In Proccedings of the 2:nd European Conf. on computer vision, Santa
Margharita Ligure, Italy, May 1992.
[24] M. M. Fleck. A topological stereo matcher. Int. Journal of Computer
Vision, 6(3):197{226, August 1991.
[25] D. J. Fleet. Measurement of image velocity. Kluwer Academic Publishers, 1992. ISBN 0{7923{9198{5.
[26] D. J. Fleet and A. D. Jepson. Stability of phase information. In Proceedings of IEEE Workshop on Visual Motion, pages 52{60, Princeton, USA, October 1991. IEEE, IEEE Society Press.
[27] D. J. Fleet, A. D. Jepson, and M. R. M. Jenkin. Phase-based disparity
measurement. CVGIP Image Understanding, 53(2):198{210, March
1991.
[28] K. S. Fu, R. C. Gonzales, and C. S. G. Lee. Robotics. McGraw Hill
Int. Editions, New York, 1987.
[29] D. Gabor. Theory of communication. Proc. Inst. Elec. Eng.,
93(26):429{441, 1946.
[30] M. Gkstorp and C-J. Westelius. Multiresolution disparity estimation.
In Proceedings of the 9th Scandinavian conference on Image Analysis,
Uppsala, Sweden, June 1995. SCIA.
180
REFERENCES
[31] Mats Gokstorp. Depth Computation in Robot Vision. PhD thesis,

Linkoping University. SWEDEN, S-581 83 LINKOPING,
SWEDEN,
1995. Dissertation No. 377, ISBN 91-7871-522-9.
[32] G. H. Granlund. In search of a general picture processing operator.
Computer Graphics and Image Processing, 8(2):155{178, 1978.
[33] G. H. Granlund. Integrated analysis-response structures for robotics
systems. Report LiTH{ISY{I{0932, Computer Vision Laboratory,
Linkoping University, Sweden, 1988.
[34] G. H. Granlund and H. Knutsson. Signal Processing for Computer
Vision. Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1.
[35] G. H. Granlund, H. Knutsson, C-J Westelius, and J Wiklund. Issues
in robot vision. Image and Vision Computing, 12(3):131{148, April
1994.
[36] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis,
Linkoping University, Sweden, S{581 83 Linkoping, Sweden, October 1992. Dissertation No 284, ISBN 91{7870{988{1.
[37] O. Hansen and J. Bigun. Local symmetry modeling in multidimensional images. In Pattern Recognition Letters, Volume 13, Nr 4, 1992.
[38] D. H. Hubel. Eye, Brain and Vision, volume 22 of Scientic American
Library. W. H. Freeman and Company, 1988. ISBN 0{7167{5020{1.
[39] A. D. Jepson and D. J. Fleet. Scale-space singularities. In
O. Faugeras, editor, Computer Vision-ECCV90, pages 50{55.
Springer-Verlag, 1990.
[40] A. D. Jepson and M. Jenkin. The fast computation of disparity from
phase dierences. In Proceedings CVPR, pages 386{398, San Diego,
California, USA, 1989.
[41] B. Julesz. Early vision and focal attention. Review of Modern physics,
63(3):735{772, 1991.
[42] K. Kanatani. Camera rotation invariance of image characteristics.
Computer Vision, Graphics and Image Processing, 39(3):328{354,
Sept. 1987.
REFERENCES
181
[43] J. Karlholm, C-J. Westelius, C-F. Westin, and H. Knutsson. Object
tracking based on the orientation tensor concept. In Proceedings of the
9th Scandinavian conference on Image Analysis, Uppsala, Sweden,
June 1995. SCIA.
[44] H. Knutsson. Filtering and Reconstruction in Image Processing. PhD
thesis, Linkoping University, Sweden, 1982. Diss. No. 88.
[45] H. Knutsson and G. H. Granlund. Fourier domain design of line and
edge detectors. In Proceedings of the 5th International Conference on
Pattern Recognition, Miami, Florida, December 1980.
[46] H. Knutsson, G. H. Granlund, and J. Bigun. Apparatus for detecting
sudden changes of a feature in a region of an image that is divided
into descrete picture elements. Swedish patent 8502571-6 (US-Patent
4.747.150, 1988, 1986.
[47] H. Knutsson, M. Hedlund, and G. H. Granlund. Apparatus for determining the degree of consistency of a feature in a region of an
image that is divided into descrete picture elements. Swedish patent
8502570-8 (US-Patent 4.747.152, 1988), 1986.
[48] H. Knutsson and C-F Westin. Normalized and dierential convolution: Methods for interpolation and ltering of incomplete and uncertain data. In Proceedings of IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, New York City, USA,
June 1993. IEEE.
[49] H. Knutsson, C-F Westin, and C-J Westelius. Filtering of uncertain irregularity sampled multidimensional data. In Twenty-seventh
Asilomar Conf. on Signals, Systems & Computers, Pacic Grove,
California, USA, November 1993. IEEE.
[50] E. Krotkov. Exploratory visual sensing for determining spatial layout with an agile stereo camera system. PhD thesis, University of
Pennsylvenia, April 1987.
[51] K. Langley, T. J. Atherton, R. G. Wilson, and M. H. E. Larcombe.
Vertical and horizontal disparities from phase. In O. Faugeras, editor,
Computer Vision-ECCV90, pages 315{325. Springer-Verlag, April
1990.
182
REFERENCES
[52] J. C. Latombe. Robot Motion Planning. Kluwer Academic Publishers,
1991. ISBN 0-7923-9129-2.
[53] B. Maclennan. Gabor representations of spatiotemporal visual images. Technical Report CS-91-144, Computer Science Department,
University of Tennesse, September 1981.
[54] D. Marr. Vision. W. H. Freeman and Company, New York, 1982.
[55] J. Matas, R. Marik, and J. Kittler. Generation, verication and
localisation of object hypotheses based on colour. In British Machine
Vision Conference, pages 539{548, 1993.
[56] J. Matas, R. Marik, and J. Kittler. Illumination invariant colour
recognition. In E. Hancock, editor, British Machine Vision Conference, pages 469{479. BMVA, BMVA Press, 1994.
[57] R. Milanese. Focus of attention in human vision: a survey. Technical Report 90.03, Computing Science Center, University of Geneva,
Geneva, August 1990.
[58] R. Milanese. Detection of salient features for focus of attention. In
Proc. of the 3rd Meeting of the Swiss Group for Articial Intelligence
and Cognitive Science, Biel-Bienne, October 1991. World Scientic
Publishing.
[59] D. Murray and A. Basu. Motion tracking with an active camera.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
16(5):449{459, May 1994.
[60] K. Pahlavan. Active robot vision and primary occular reexes. PhD
thesis, Royal Institute of Technology, May 1993. ISSN 1101-2250.
[61] L. H. Quam. Hierarchicl warp stereo. In Proceedings from DARPA
Image understanding workshop, pages 149{155, 1984.
[62] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Context free attentional
operators: the generalized symmetry transform. International Journal of Computer Vision, 1994. special issue on qualitative vision.
[63] T. D. Sanger. Stereo disparity computation using gabor lters. Biological cybernetics, 59:405{418, 1988.
REFERENCES
183
[64] E. L. Schwartz. Computational anatomy and functional architectur
of striate cortex: A spatial mapping approach to perceptual coding.
Vision Reasearch, 20:645{669, 1980.
[65] M. Tistarelli and G. Sandini. Direct estimation of time{to{impact
from optical ow. In Proceedings of IEEE Workshop on Visual Motion, pages 52{60, Princeton, USA, October 1991. IEEE, IEEE Society Press.
[66] J. K. Tsotsos. Localizing stimuli in a sensory eld using an inhibitory
attentinal beam. Technical Report RBCV{TR{91{37, Department of
Computer Science, University of Toronto, October 1991.
[67] J. K. Tsotsos. On the relative complexity of active vs. passive visual
search. Int. Journal of Computer Vision, 7(2):127{142, Januari 1992.
[68] J. van der Spiegel, G. Kreider, C. Claeys, I. Debusschere, G. Sandini,
P. Dario, F. Fantini, P. Bellutti, and G. Soncini. A foveated retinalike sensor using ccd technology. In C. Mead and M. Ismael, editors,
Analog VLSI implementation of neural systems. Kluwer, 1989.
[69] Esprit basic research action 3038, vision as process, nal report.
Project document, April 1992.
[70] J. Y. A. Wang and E. H. Adelson. Layered representation for motion analysis. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 361{366, June 1993.
[71] J. Weng. Image matching using the windowed Fourier phase. International Journal of Computer Vision, 11(3):211{236, March 1993.
[72] C-J Westelius. Preattentive gaze control for robot vision, June 1992.
Thesis No. 322, ISBN 91{7870{961{X.
[73] C-J. Westelius and H. Knutsson. Hierarchical disparity estimation using quadrature lter phase. International journal on computer vision,
1995. Special issue on stereo, (submitted).
[74] C-J Westelius, H. Knutsson, and G. H. Granlund. Focus of attention control. Report LiTH{ISY{I{1140, Computer Vision Laboratory,
Linkoping University, Sweden, 1990.
184
REFERENCES
[75] C-J Westelius, H. Knutsson, and G. H. Granlund. Focus of attention
control. In Proceedings of the 7th Scandinavian Conference on Image
Analysis, pages 667{674, Aalborg, Denmark, August 1991. Pattern
Recognition Society of Denmark.
[76] C-J Westelius, H. Knutsson, and G. H. Granlund. Preattentive gaze
control for robot vision. In Proceedings of Third International Conference on Visual Search. Taylor and Francis, 1992.
[77] C-J Westelius, H. Knutsson, and J. Wiklund. Robust vergence control using scale{space phase information. Report LiTH-ISY-I-1363,
Computer Vision Laboratory, Linkoping University, Sweden, 1992.
[78] C-J Westelius and C-F Westin. A colour representation for scalespaces. In The 6th Scandinavian Conference on Image Analysis, pages
890{893, Oulu, Finland, June 1989.
[79] C-J Westelius and C-F Westin. Representation of colour in image processing. In Proceedings of the SSAB Conference on Image Analysis,
Gothenburg, Sweden, March 1989. SSAB.
[80] C-J. Westelius, C-F. Westin, and H. Knutsson. Focus of attention
mechanisms using normalized convolution. IEEE Trans on Robotics
and Automation, 1996. Special section on robot vision. (submitted).
[81] C-F Westin. Feature extraction based on a tensor image description,
September 1991. Thesis No. 288, ISBN 91{7870{815{X.
[82] C-F Westin. A Tensor Framework for Multidimensional Signal
Processing. PhD thesis, Linkoping University, Sweden, S{581 83
Linkoping, Sweden, 1994. Dissertation No 348, ISBN 91{7871{421{4.
[83] C-F Westin and C-J Westelius. A colour model for hierarchical image
processing. Master's thesis, Linkoping University, Sweden, August
v1988. LiTH{ISY{EX{0857.
[84] J. Wiklund and H. Knutsson. A generalized convolver. In Proceedings of the 9th Scandinavian conference on Image Analysis, Uppsala,
Sweden, June 1995. SCIA.
REFERENCES
185
[85] J. Wiklund, C-J Westelius, and H. Knutsson. Hierarchical phase
based disparity estimation. In Proceedings of 2nd Singapore International Conference on Image Processing. IEEE Singapore Section,
September 1992.
[86] J. Wiklund, C-F Westin, and C-J Westelius. AVS, Application Visualization System, software evaluation report. Report LiTH-ISYR-1469, Computer Vision Laboratory, S{581 83 Linkoping, Sweden,
1993.
[87] R. Wilson and H. Knutsson. A multiresolution stereopsis algorithm
based on the Gabor representation. In 3rd International Conference
on Image Processing and Its Applications, pages 19{22, Warwick,
Great Britain, July 1989. IEE. ISBN 0 85296382 3 ISSN 0537{9989.
[88] A. L. Yarbus. Eye movements and vision. Plenum, New York, 1969.
186
REFERENCES
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement