Low and Medium Level Vision using Channel Representations Per-Erik Forss´ en

Low and Medium Level Vision using Channel Representations Per-Erik Forss´ en
Linköping Studies in Science and Technology
Dissertation No. 858
Low and Medium Level Vision
using Channel Representations
Per-Erik Forssén
Dissertation No. 858
Department of Electrical Engineering
Linköping University, SE-581 83 Linköping, Sweden
Linköping March 2004
Low and Medium Level Vision using Channel Representations
c 2004 Per-Erik Forssén
°
Department of Electrical Engineering
Linköping University
SE-581 83 Linköping
Sweden
ISBN 91-7373-876-X
ISSN 0345-7524
iii
Don’t confuse the moon
with the finger that points at it.
Zen proverb
iv
v
Abstract
This thesis introduces and explores a new type of representation for low and
medium level vision operations called channel representation. The channel representation is a more general way to represent information than e.g. as numerical
values, since it allows incorporation of uncertainty, and simultaneous representation of several hypotheses. More importantly it also allows the representation of
“no information” when no statement can be given. A channel representation of a
scalar value is a vector of channel values, which are generated by passing the original scalar value through a set of kernel functions. The resultant representation
is sparse and monopolar. The word sparse signifies that information is not necessarily present in all channels. On the contrary, most channel values will be zero.
The word monopolar signifies that all channel values have the same sign, e.g. they
are either positive or zero. A zero channel value denotes “no information”, and
for non-zero values, the magnitude signifies the relevance.
In the thesis, a framework for channel encoding and local decoding of scalar
values is presented. Averaging in the channel representation is identified as a
regularised sampling of a probability density function. A subsequent decoding is
thus a mode estimation technique.
The mode estimation property of channel averaging is exploited in the channel
smoothing technique for image noise removal. We introduce an improvement to
channel smoothing, called alpha synthesis, which deals with the problem of jagged
edges present in the original method. Channel smoothing with alpha synthesis is
compared to mean-shift filtering, bilateral filtering, median filtering, and normalized averaging with favourable results.
A fast and robust blob-feature extraction method for vector fields is developed. The method is also extended to cluster constant slopes instead of constant
regions. The method is intended for view-based object recognition and wide baseline matching. It is demonstrated on a wide baseline matching problem.
A sparse scale-space representation of lines and edges is implemented and described. The representation keeps line and edge statements separate, and ensures
that they are localised by inhibition from coarser scales. The result is however
still locally continuous, in contrast to non-max-suppression approaches, which introduce a binary threshold.
The channel representation is well suited to learning, which is demonstrated by
applying it in an associative network. An analysis of representational properties
of associative networks using the channel representation is made.
Finally, a reactive system design using the channel representation is proposed.
The system is similar in idea to recursive Bayesian techniques using particle filters,
but the present formulation allows learning using the associative networks.
vi
vii
Acknowledgements
This thesis could never have been written without the support from a large number
of people. I am especially grateful to the following persons:
My fiancée Linda, for love and encouragement, and for constantly reminding me
that there are other important things in life.
All the people at the Computer Vision Laboratory, for providing a stimulating
research environment, for sharing ideas and implementations with me, and for
being good friends.
Professor Gösta Granlund, for giving me the opportunity to work at the Computer
Vision Laboratory, for introducing me to an interesting area of research, and for
relating theories of mind and vision to our every-day experience of being.
Anders Moe and Björn Johansson for their constructive criticism on this manuscript.
Dr Hagen Spies, for giving an inspiring PhD course, which opened my eyes to
robust statistics and camera geometry.
Dr Michael Felsberg, for all the discussions on channel smoothing, B-splines, calculus in general, and Depeche Mode.
Johan Wiklund, for keeping the computers happy, and for always knowing all there
is to know about new technologies and gadgets.
The Knut and Alice Wallenberg foundation, for funding research within the WITAS
project.
And last but not least my fellow musicians and friends in the band Pastell, for
helping me to kill my spare time.
About the cover
The front cover page is a collection of figures from the thesis, arranged to constitute
a face, in the spirit of painter Salvador Dali. The back cover page is a photograph
of Swedish autumn leaves, processed with the SOR method in section 7.2.1, using
intensities in the range [0, 1], and the parameters dmax = 0.05, binomial filter of
order 11, and 5 IRLS iterations.
viii
Contents
1 Introduction
1.1 Motivation .
1.2 Overview . .
1.3 Contributions
1.4 Notations . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
2 Representation of Visual Information
2.1 System principles . . . . . . . . . . . . . . . . . . . . . .
2.1.1 The world as an outside memory . . . . . . . . .
2.1.2 Active vision . . . . . . . . . . . . . . . . . . . .
2.1.3 View centred and object centred representations
2.1.4 Robust perception . . . . . . . . . . . . . . . . .
2.1.5 Vision and learning . . . . . . . . . . . . . . . . .
2.2 Information representation . . . . . . . . . . . . . . . . .
2.2.1 Monopolar signals . . . . . . . . . . . . . . . . .
2.2.2 Local and distributed coding . . . . . . . . . . .
2.2.3 Coarse coding . . . . . . . . . . . . . . . . . . . .
2.2.4 Channel coding . . . . . . . . . . . . . . . . . . .
2.2.5 Sparse coding . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
6
6
6
7
7
7
8
9
10
11
3 Channel Representation
3.1 Compact and local representations . . . . . . . . . . .
3.1.1 Compact representations . . . . . . . . . . . . .
3.1.2 Channel encoding of a compact representation
3.2 Channel representation using the cos2 kernel . . . . .
3.2.1 Representation of multiple values . . . . . . . .
3.2.2 Properties of the cos2 kernel . . . . . . . . . . .
3.2.3 Decoding a cos2 channel representation . . . .
3.3 Size of the represented domain . . . . . . . . . . . . .
3.3.1 A linear mapping . . . . . . . . . . . . . . . . .
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
13
14
16
16
17
19
20
21
4 Mode Seeking and Clustering
4.1 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Kernel density estimation . . . . . . . . . . . . . . . . . . .
23
23
23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
Contents
4.2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
25
26
27
30
31
32
5 Kernels for Channel Representation
5.1 The Gaussian kernel . . . . . . . . . . . . . . . . . .
5.1.1 A local decoding for the Gaussian kernel . . .
5.2 The B-spline kernel . . . . . . . . . . . . . . . . . . .
5.2.1 Properties of B-splines . . . . . . . . . . . . .
5.2.2 B-spline channel encoding and local decoding
5.3 Comparison of kernel properties . . . . . . . . . . . .
5.3.1 The constant sum property . . . . . . . . . .
5.3.2 The constant norm property . . . . . . . . .
5.3.3 The scalar product . . . . . . . . . . . . . . .
5.4 Metameric distance . . . . . . . . . . . . . . . . . . .
5.5 Stochastic kernels . . . . . . . . . . . . . . . . . . . .
5.5.1 Varied noise level . . . . . . . . . . . . . . . .
5.6 2D and 3D channel representations . . . . . . . . . .
5.6.1 The Kronecker product . . . . . . . . . . . .
5.6.2 Encoding of points in 2D . . . . . . . . . . .
5.6.3 Encoding of lines in 2D . . . . . . . . . . . .
5.6.4 Local decoding for 2D Gaussian kernels . . .
5.6.5 Examples . . . . . . . . . . . . . . . . . . . .
5.6.6 Relation to Hough transforms . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
33
34
35
36
37
38
38
39
40
43
45
46
47
47
48
48
48
50
51
6 Channel Smoothing
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Algorithm overview . . . . . . . . . . . . . . . . .
6.1.2 An example . . . . . . . . . . . . . . . . . . . . . .
6.2 Edge-preserving filtering . . . . . . . . . . . . . . . . . . .
6.2.1 Mean-shift filtering . . . . . . . . . . . . . . . . . .
6.2.2 Bilateral filtering . . . . . . . . . . . . . . . . . . .
6.3 Problems with strongest decoding synthesis . . . . . . . .
6.3.1 Jagged edges . . . . . . . . . . . . . . . . . . . . .
6.3.2 Rounding of corners . . . . . . . . . . . . . . . . .
6.3.3 Patchiness . . . . . . . . . . . . . . . . . . . . . . .
6.4 Alpha synthesis . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Separating output sharpness and channel blurring
6.4.2 Comparison of super-sampling and alpha synthesis
6.4.3 Relation to smoothing before sampling . . . . . . .
6.5 Comparison with other denoising filters . . . . . . . . . .
6.6 Applications of channel smoothing . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
53
54
54
55
55
56
57
58
58
60
61
62
62
65
66
4.3
Mode seeking . . . . . . . . . .
4.2.1 Channel averaging . . .
4.2.2 Expectation value of the
4.2.3 Mean-shift filtering . . .
4.2.4 M-estimators . . . . . .
4.2.5 Relation to clustering .
Summary and comparison . . .
. . .
. . .
local
. . .
. . .
. . .
. . .
. . . . . .
. . . . . .
decoding
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
6.7
xi
6.6.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Homogeneous Regions in Scale-Space
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 The scale-space concept . . . . . . . . . . . .
7.1.2 Blob features . . . . . . . . . . . . . . . . . .
7.1.3 A blob feature extraction algorithm . . . . .
7.2 The clustering pyramid . . . . . . . . . . . . . . . .
7.2.1 Clustering of vector fields . . . . . . . . . . .
7.2.2 A note on winner-take-all vs. proportionality
7.3 Homogeneous regions . . . . . . . . . . . . . . . . . .
7.3.1 Ellipse approximation . . . . . . . . . . . . .
7.3.2 Blob merging . . . . . . . . . . . . . . . . . .
7.4 Blob features for wide baseline matching . . . . . . .
7.4.1 Performance . . . . . . . . . . . . . . . . . .
7.4.2 Removal of cropped blobs . . . . . . . . . . .
7.4.3 Choice of parameters . . . . . . . . . . . . . .
7.5 Clustering of planar slopes . . . . . . . . . . . . . . .
7.5.1 Subsequent pyramid levels . . . . . . . . . . .
7.5.2 Computing the slope inside a binary mask . .
7.5.3 Regions from constant slope model . . . . . .
7.6 Concluding Remarks . . . . . . . . . . . . . . . . . .
8 Lines and Edges in Scale-Space
8.1 Background . . . . . . . . . . . . . . . . . . . . .
8.1.1 Classical edge detection . . . . . . . . . .
8.1.2 Phase-gating . . . . . . . . . . . . . . . .
8.1.3 Phase congruency . . . . . . . . . . . . .
8.2 Sparse feature maps in a scale hierarchy . . . . .
8.2.1 Phase from line and edge filters . . . . . .
8.2.2 Characteristic phase . . . . . . . . . . . .
8.2.3 Extracting characteristic phase in 1D . .
8.2.4 Local orientation information . . . . . . .
8.2.5 Extracting characteristic phase in 2D . .
8.2.6 Local orientation and characteristic phase
8.3 Concluding remarks . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
66
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
69
70
71
71
72
74
74
75
76
77
79
79
80
81
82
83
84
85
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
87
87
88
88
88
89
90
90
91
93
94
95
96
9 Associative Learning
9.1 Architecture overview . . . . . . . . . . . . . . . . .
9.2 Representation of system output states . . . . . . . .
9.2.1 Channel representation of the state space . .
9.3 Channel representation of input features . . . . . . .
9.3.1 Feature generation . . . . . . . . . . . . . . .
9.4 System operation modes . . . . . . . . . . . . . . . .
9.4.1 Position encoding for discrete event mapping
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
99
101
101
102
102
103
103
.
.
.
.
.
.
.
.
.
.
.
.
xii
Contents
9.5
9.6
9.7
9.8
9.4.2 Magnitude encoding for continuous function mapping
Associative structure . . . . . . . . . . . . . . . . . . . . . . .
9.5.1 Optimisation procedure . . . . . . . . . . . . . . . . .
9.5.2 Normalisation modes . . . . . . . . . . . . . . . . . . .
9.5.3 Sensitivity analysis for continuous function mode . . .
Experimental verification . . . . . . . . . . . . . . . . . . . .
9.6.1 Experimental setup . . . . . . . . . . . . . . . . . . . .
9.6.2 Associative network variants . . . . . . . . . . . . . .
9.6.3 Varied number of samples . . . . . . . . . . . . . . . .
9.6.4 Varied number of channels . . . . . . . . . . . . . . . .
9.6.5 Noise sensitivity . . . . . . . . . . . . . . . . . . . . .
Other local model techniques . . . . . . . . . . . . . . . . . .
9.7.1 Radial Basis Function networks . . . . . . . . . . . . .
9.7.2 Support Vector Machines . . . . . . . . . . . . . . . .
9.7.3 Adaptive fuzzy control . . . . . . . . . . . . . . . . . .
Concluding remarks . . . . . . . . . . . . . . . . . . . . . . .
10 An Autonomous Reactive System
10.1 Introduction . . . . . . . . . . . . . . . . .
10.1.1 System outline . . . . . . . . . . .
10.2 Example environment . . . . . . . . . . .
10.3 Learning successive recognition . . . . . .
10.3.1 Notes on the state mapping . . . .
10.3.2 Exploratory behaviour . . . . . . .
10.3.3 Evaluating narrowing performance
10.3.4 Learning a narrowing policy . . . .
10.4 Concluding remarks . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
104
106
106
107
109
110
110
112
113
115
116
118
118
119
119
120
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
121
122
122
124
124
125
127
128
129
11 Conclusions and Future Research Directions
11.1 Conclusions . . . . . . . . . . . . . . . . . . .
11.2 Future research . . . . . . . . . . . . . . . . .
11.2.1 Feature matching and recognition . . .
11.2.2 Perception action cycles . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
131
131
132
132
132
Appendices
A
Theorems on cos2 kernels . . . . . . . . . . . . . . . . . . . . . . .
B
Theorems on B-splines . . . . . . . . . . . . . . . . . . . . . . . . .
C
Theorems on ellipse functions . . . . . . . . . . . . . . . . . . . . .
133
133
138
140
Bibliography
.
.
.
.
.
.
.
.
.
145
Chapter 1
Introduction
1.1
Motivation
The work presented in this thesis has been performed within the WITAS1 project
[24, 52, 105]. The goal of the WITAS project has been to build an autonomous2
Unmanned Aerial Vehicle (UAV) that is able to deal with visual input, and to
develop tools and techniques needed in an autonomous systems context. Extensive work on adaptation of more conventional computer vision techniques to the
WITAS platform has previously been carried out by the author, and is documented in [32, 35, 81]. This thesis will however deal with basic research aspects
of the WITAS project. We will introduce new techniques and information representations well suited for computer vision in autonomous systems.
Computer vision is usually described using a three level model:
• The first level, low-level vision is concerned with obtaining descriptions of
image properties in local regions. This usually means description of colour,
lines and edges, motion, as well as methods for noise attenuation.
• The next level, medium-level vision makes use of the features computed at
the low level. Medium-level vision has traditionally involved techniques such
as joining line segments into object boundaries, clustering, and computation
of depth from stereo image pairs. Processing at this level also includes more
complex tasks, such as the estimation of ego motion, i.e. the apparent motion
of a camera as estimated from a sequence of camera images.
• Finally, high-level vision involves using the information from the lower levels
to perform abstract reasoning about scenes, planning etc.
The WITAS project involves all three levels, but as the title of this thesis
suggests, we will only deal with the first two levels. The unifying theme of the thesis
1 WITAS
stands for the Wallenberg laboratory for research on Information Technology and
Autonomous Systems.
2 An autonomous system is self guided, or without direct control of an operator.
2
Introduction
is a new information representation called channel representation. All methods
developed in the thesis either make explicit use of channel representations, or can
be related to the channel representation.
1.2
Overview
We start the thesis in chapter 2 with a short overview of system design principles
in biological and artificial vision systems. We also give an overview of different
information representations.
Chapter 3 introduces the channel representation, and discusses its representational properties. We also describe how a compact representation may be converted
into a channel representation using a channel encoding, and how the compact representation may be retrieved using a local decoding.
Chapter 4 relates averaging in the channel representation to estimation methods from robust statistics. We re-introduce the channel representation in a statistical formulation, and show that channel averaging followed by a local decoding is
a mode estimation technique.
Chapter 5 introduces channel representations using other kernels than the
cos2 kernel. The different kernels are compared in a series of experiments. In this
chapter we also explore the interference during local decoding between multiple
values stored in a channel vector. We also introduce the notion of stochastic
kernels, and extend the channel representation to higher dimensions.
Chapter 6 describes an image denoising technique called channel smoothing.
We identify a number of problems with the original channel smoothing technique,
and give solutions to them, one of them being the alpha synthesis technique. Channel smoothing is also compared to a number of popular image denoising techniques,
such as mean-shift, bilateral filtering, median filtering, and normalized averaging.
Chapter 7 contains a method to obtain a sparse scale-space representation
of homogeneous regions. The homogeneous regions are represented as sparse blob
features. The blob feature extraction method can be applied to both grey-scale
and colour images. We also extend the method to cluster constant slopes instead
of locally constant regions.
Chapter 8 contains a method to obtain a sparse scale-space representation
of lines and edges. In contrast to non-max-suppression techniques, the method
generates a locally continuous response, which should make it well suited e.g. as
input to a learning machinery.
Chapter 9 introduces an associative network architecture that makes use of the
channel representation. In a series of experiments the descriptive powers and the
noise sensitivity of the associative networks are analysed. In the experiments we
also compare the associative networks with conventional function approximation
using local models. We also discuss the similarities and differences between the
associative networks and Radial Basis Function (RBF) networks, Support Vector
Machines (SVM), and Fuzzy control.
Chapter 10 incorporates the associative networks in a feedback loop, which
allows successive recognition in an environment with perceptual aliasing. A sys-
1.3 Contributions
3
tem design is proposed, and is demonstrated by solving the localisation problem
in a labyrinth. In this chapter we also use reinforcement learning to learn an
exploratory behaviour.
1.3
Contributions
We will now list what is believed to be the novel contributions of this thesis.
• A framework for channel encoding and local decoding of scalar values is
presented in chapter 3. This material originates from the author’s licentiate thesis [34], and is also contained in the article “HiperLearn: A High
Performance Channel Learning Architecture” [51].
• Averaging in the channel representation is identified as a regularised sampling of a probability density function. A subsequent decoding is thus a mode
estimation technique. This idea was originally mentioned in the paper “Image Analysis using Soft Histograms” [33], and is thoroughly explained in
chapter 4.
• The local decoding for 1D and 2D Gaussian kernels in chapter 5. This material is also published in the paper “Two-Dimensional Channel Representation
for Multiple Velocities” [93].
• The channel smoothing technique for image noise removal, has been investigated by several people, for earlier work by the author, see the technical
report [42] and the papers “Noise Adaptive Channel Smoothing of Low Dose
Images” [87], and “Channel Smoothing using Integer Arithmetic” [38]. The
alpha synthesis approach described in chapter 6 is however a novel contribution, not published elsewhere.
• The blob-feature extraction method developed in chapter 7. This is an improved version of the algorithm published in the paper “Robust Multi-Scale
Extraction of Blob Features” [41].
• A scale-space representation of lines and edges is implemented and described
in chapter 8. This chapter is basically an extended version of the conference
paper “Sparse feature maps in a scale hierarchy” [39].
• The analysis of representational properties of an associative network in chapter 9. This material is derived from the article “HiperLearn: A High Performance Channel Learning Architecture” [51].
• The reactive system design using channel representation in chapter 10 is similar in idea to recursive Bayesian techniques using particle filters. The use
of the channel representation to define transition and narrowing, is however
believed to be novel. This material was also presented in the paper “Successive Recognition using Local State Models” [37], and the technical report
[36].
4
1.4
Introduction
Notations
The mathematical notations used in this thesis should resemble those most commonly in use in the engineering community. There are however cases where there
are several common styles, and thus this section has been added to avoid confusion.
The following notations are used for mathematical entities:
s
u
z
C
s(x)
Scalars (lowercase letters in italics)
Vectors (lowercase letters in boldface)
Complex numbers (lowercase letters in italics bold)
Matrices (uppercase letters in boldface)
Functions (lowercase letters)
The following notations are used for mathematical operations:
AT
bxc
hx|yi
arg z
conj z
|z|
kzk
(s ∗ fk )(x)
adist(ϕ1 − ϕ2 )
vec(A)
diag(x)
supp{f }
Matrix and vector transpose
The floor operation
The scalar product
Argument of a complex number
Complex conjugate
Absolute value of real or complex numbers
Matrix or vector norm
Convolution
Angular distance of cyclic variables
Conversion of a matrix to a vector by stacking the columns
Extension of a vector to a diagonal matrix.
The support (definition domain, or non-zero domain) of function f .
Additional notations are introduced when needed.
Chapter 2
Representation of Visual
Information
This chapter gives a short overview of some aspects of image interpretation in
biological and artificial vision systems. We will put special emphasis on system
principles, and on which information representations to choose.
2.1
System principles
When we view vision as a sense for robots and other real-time perception systems,
the parallels with biological vision at the system level become obvious. Since an
autonomous robot is in direct interaction with the environment, it is faced with
many of the problems that biological vision systems have dealt with successfully
for millions of years. This is the reason why biological systems have been an
important source of inspiration to the computer vision community, since the early
days of the field, see e.g. [74]. Since biological and mechanical systems use different
kinds of “hardware”, there are of course several important differences. Therefore
the parallel should not be taken too far.
2.1.1
The world as an outside memory
Traditionally much effort in machine vision has been devoted to methods for finding detailed reconstructions of the external world [9]. As pointed out by e.g.
O’Regan [83] there is really no need for a system that interacts with the external
world to perform such a reconstruction, since the world is continually “out there”.
He uses the neat metaphor “the world as an outside memory” to explain why. By
focusing your eyes at something in the external world, instead of examining your
internal model, you will probably get more accurate and up-to-date information
as well.
6
2.1.2
Representation of Visual Information
Active vision
If we do not need a detailed reconstruction, then what should the goal of machine
vision be? The answer to this question in the paradigm of active vision [3, 4, 1]
is that the goal should be generation of actions. In that way the goal depends on
the situation, and on the problem we are faced with.
Consider the following situation: A helicopter is situated above a road and
equipped with a camera. From the helicopter we want to find out information
about a car on the road below. When looking at the car through our sensor, we
obtain a blurred image at low resolution. If the image is not good enough we
could simply move closer, or change the zoom of the camera. The distance to the
car can be obtained if we have several images of the car from different views. If
we want several views, we do not actually need several cameras, we could simply
move the helicopter and obtain shots from other locations.
The key idea behind active vision is that an agent in the external world has
the ability to actively extract information from the external world by means of its
actions. This ability to act can, if properly used, simplify many of the problems
in vision, for instance the correspondence problem [9].
2.1.3
View centred and object centred representations
Biological vision systems interpret visual stimuli by generation of image features in
several retinotopic maps [5]. These maps encode highly specific information such
as colour, structure (lines and edges), motion, and several high-level features not
yet fully understood. An object in the field of view is represented by connections
between the simultaneously active features in all of the feature maps. This is
called a view centred representation [46], and is an object representation which
is distributed across all the feature maps, or views. Perceptual experiments are
consistent with the notion that biological vision systems use multiple such view
representations to represent three-dimensional objects [12]. In chapters 7 and 8
we will generate sparse feature maps of structural information, that can be used
to form a view centred object representation.
In sharp contrast, many machine vision applications synthesise image features
into compact object representations that are independent of the views from which
they are viewed. This approach is called an object centred representation [46].
This kind of representation also exists in the human mind, and is used e.g. in
abstract reasoning, and in spoken language.
2.1.4
Robust perception
In the book “The Blind Watchmaker” [23] Dawkins gives an account of the echolocation sense of bats. The bats described in the book are almost completely blind,
and instead they emit ultrasound cries and use the echoes of the cries to perceive
the world. The following is a quote from [23]:
2.2 Information representation
7
It seems that bats may be using something that we could call a ’strangeness
filter’. Each successive echo from a bat’s own cries produces a picture
of the world that makes sense in terms of the previous picture of the
world built up with earlier echoes. If the bat’s brain hears an echo from
another bat’s cry, and attempts to incorporate this into the picture of
the world that it has previously built up, it will make no sense. It will
appear as though objects in the world have suddenly jumped in various
random directions. Objects in the real world do not behave in such
a crazy way, so the brain can safely filter out the apparent echo as
background noise.
A crude equivalent to this strangeness filter has been developed in the field
of robust statistics [56]. Here samples which do not fit the used model at all are
allowed to be rejected as outliers. In this thesis we will develop another robust
technique, using the channel information representation.
2.1.5
Vision and learning
As machine vision systems become increasingly complex, the need to specify their
behaviour without explicit programming becomes increasingly apparent.
If a system is supposed to act in an un-restricted environment, it needs to be
able to behave in accordance with the current surroundings. The system thus has
to be flexible, and needs to be able to generate context dependent responses. This
leads to a very large number of possible behaviours that are difficult or impossible
to specify explicitly. Such context dependent responses are preferably learned by
subjecting the system to the situations, and applying percept-response association
[49].
By using learning, we are able to define what our system should do, not how
it should do it. And finally, a system that is able to learn, is able to adapt to
changes, and to act in novel situations that the programmer did not foresee.
2.2
Information representation
We will now discuss a number of different approaches to representation of information, which are used in biological and artificial vision systems. This is by no
means an exhaustive presentation, it should rather be seen as background, and
motivation for the representations chosen in the following chapters of this thesis.
2.2.1
Monopolar signals
Information processing cells in the brain exhibit either bipolar or monopolar responses. One rare example of bipolar detectors is the hair cells in semicircular
canals of the vestibular system1 . These cells hyperpolarise when the head rotates
one way, and depolarise when it is rotated the other way [61].
1 The
vestibular system coordinates the orientation of the head.
8
Representation of Visual Information
Bipolar signals are typically represented numerically as values in a range centred around zero, e.g. [−1.0, 1.0]. Consequently monopolar signals are represented
as non-negative numbers in a range from zero upwards, e.g. [0, 1.0].
Interestingly there seem to be no truly bipolar detectors at any stage of the
visual system. Even the bipolar cells of the retina are monopolar in their responses
despite their name. The disadvantage with a monopolar detector compared to a
bipolar one is that it can only respond to one aspect of an event. For instance do
the retinal bipolar cells respond to either bright, or dark regions. Thus there are
twice as many retinal bipolar cells, as there could have been if they had had bipolar
responses. However, a bipolar detector has to produce a maintained discharge at
the equilibrium. (For the bipolar cells this would have meant maintaining a level inbetween the bright, and dark levels.) This results in bipolar detectors being much
more sensitive to disturbances [61]. Monopolar, or non-negative representations
will be used frequently throughout this thesis.
Although the use of monopolar signals is widespread in biological vision systems, it is rarely found in machine vision. It has however been suggested in [45].
2.2.2
Local and distributed coding
Three different strategies for representation of a system state using a number of
signals is given by Thorpe in [97]. Thorpe uses the following simple example to
illustrate their differences: We have a stimulus that can consist of a horizontal or
a vertical bar. The bar can be either white, black, or absent (see figure 2.1). For
simplicity we assume that the signals are binary, i.e. either active or inactive.
?
?
Distributed
Coding
?
B
W
H
Semi−Local
Coding
V
B&H
B&V
W&H
W&V
Local
Coding
Nothing
Figure 2.1: Local, semi-local, and distributed coding. Figure adapted from [97].
One way to represent the state of the bar is to assign one signal to each of the
possible system states. This is called a local coding in figure 2.1, and the result is
a local representation. One big advantage with a local representation is that the
system can deal with several state hypotheses at once. In the example in figure
2.1, two active signals would mean that there was two bars present in the scene.
Another way is to assign one output for each state of the two properties: orienta-
2.2 Information representation
9
tion and colour. This is called semi-local coding in figure 2.1. As we move away
from a completely local representation, the ability to deal with several hypotheses
gradually disappears. For instance, if we have one vertical and one horizontal bar,
we can deal with them separately using a semi-local representation only if they
have the same colour.
The third variant in figure 2.1 is to assign one stimulus pattern to each system
state. In this representation the number of output signals is minimised. This results in a representation of a given system state being distributed across the whole
range of signals, hence the name distributed representation. Since this variant also
succeeds at minimising the number of output signals, it is also a compact coding
scheme.
These three representation schemes are also different in terms of metric. A
similarity metric is a measure of how similar two states are. The coding schemes
in figure 2.1 can for instance be compared by counting how many active (i.e. nonzero) signals they have in common. For the local representation, no states have
common signals, and thus, in a local representation we can only tell whether two
states are the same or not. For the distributed representation, the similarity metric
is completely random, and thus not useful.
For the semi-local representation however, we get a useful metric. For example,
bars with the same orientation, but different colour will have one active signal in
common, and are thus halfway between being the same state, and being different
states.
2.2.3
Coarse coding
We will now describe a coding scheme called coarse coding, see e.g. [96]. Coarse
coding is a technique that can represent continuous state spaces. In figure 2.2 the
plane represents a continuous two dimensional state space. This space is coded
using a number of feature signals with circular receptive fields, illustrated by the
circles in the figure.
Figure 2.2: Coarse coding. Figure adapted from [96].
Each feature signal is binary, i.e. either active or inactive, and is said to coarsely
10
Representation of Visual Information
represent the location in state space. Since we have several features which are
partially overlapping, we can get a rough estimate of where in state-space we are,
by considering all the active features. The white cross in the figure symbolises
a particular state, and each feature activated by this state has its receptive field
coloured grey. As can be seen, we get an increasingly darker shade of grey where
several features are active, and the region where the colour is the darkest contains
the actual state. Evidently, a small change in location in state space will result in
a small change in the activated feature set. Thus coarse coding results in a useful
similarity metric, and can be identified as a semi-local coding scheme according to
the taxonomy in section 2.2.2. As we add more features in a coarse coding scheme,
we obtain an increasingly better better resolution of the state space.
2.2.4
Channel coding
The multiple channel hypothesis is discussed by Levine and and Shefner [71] as
a model for human analysis of periodic patterns. According to [71], the multiple
channel hypothesis was first made by Campbell and Robson, 1968 in [13]. The
multiple channel hypothesis constitutes a natural extension of coarse coding to
smoothly varying features called channels, see figure 2.3. It is natural to consider
smoothly varying and overlapping features for representation of continuous phenomena, but there is also evidence for channel representations of discrete state
spaces such as representation of quantity in primates [79].
0
1
2
3
4
5
6
7
8
9
Figure 2.3: Linear channel arrangement. One channel function is shown in solid,
the others are dashed.
The process of converting a state variable into channels is known in signal
processing as channel coding, see [89] and [90], and the resultant information representation is called a channel representation [46, 10, 80]. Representations using
channels allow a state space resolution much better than indicated by the number
of channels, a phenomenon known as hyperacuity [89].
As is common in science, different fields of research have different names for almost the same thing. In neuroscience and computational neurobiology the concept
population coding [108] is sometimes used as a synonym for channel representation.
In neural networks the concept of radial basis functions (RBF) [7, 58] is used to
describe responses that depend on the distance to a specific position. In control
theory, the fuzzy membership functions also have similar shape and application
[84]. The relationship between channel representation, RBF networks and Fuzzy
control will be explored in section 9.7.
2.2 Information representation
2.2.5
11
Sparse coding
A common coding scheme is the compact coding scheme used in data compression
algorithms. Compact coding is the solution to an optimisation where the information content in each output signal is maximised. But we could also envision a
different optimisation goal: maximisation of the information content in the active
signals only (see figure 2.4). Something similar to this seems to happen at the
lower levels of visual processing in mammals [31]. The result of this kind of optimisation on visual input is a representation that is sparse, i.e. most signals are
inactive. The result of a sparse coding is typically either a local, or a semi-local
representation, see section 2.2.2.
Compact Coding
Sparse Coding
Minimum number
of units
Minimum number
of active units
Figure 2.4: Compact and sparse coding. Figure adapted from [31].
As we move upwards in the interpretation hierarchy in biological vision systems,
from cone cells, via centre-surround cells to the simple and complex cells in the
visual cortex, the feature maps tend to employ increasingly sparse representations
[31].
There are several good reasons why biological systems employ sparse representations, many of which could also apply to machine vision systems. For biological
vision, one advantage is that the amount of signalling is kept at a low rate, and
this is a good thing, since signalling wastes energy. Sparse coding also leads to
representations in which pattern recognition, template storage, and matching are
made easier [31, 75, 35]. Compared to compact representations, sparse features
convey more information when they are active, and contrary to how it might appear, the amount of computation will not be increased significantly, since only the
active features need to be considered.
Both coarse coding and channel coding approximate the sparse coding goal.
They both produce representations where most signals are inactive. Additionally,
an active signal conveys more information than an inactive one, since an active
signal tell us roughly where in state space we are.
12
Representation of Visual Information
Chapter 3
Channel Representation
In this chapter we introduce the channel representation, and discuss its representational properties. We also derive expressions for channel encoding and local
decoding using cos2 kernel functions.
3.1
3.1.1
Compact and local representations
Compact representations
Compact representations (see chapter 2) such as numbers, and generic object
names (house, door, Linda) are useful for communicating precise pieces of information. One example of this is the human use of language. However, compact
representations are not well suited to use if we want to learn a complex and unknown relationship between two sets of data (as in function approximation, or
regression), or if we want to find patterns in one data set (as in clustering, or
unsupervised learning).
Inputs in compact representations tend to describe temporally and/or spatially
distant events as one thing, and thus the actual meaning of an input cannot be
established until we have seen the entire training set. Another motivation for localised representations is that most functions can be sufficiently well approximated
as locally linear, and linear relationships are easy to learn (see chapter 9 for more
on local learning).
3.1.2
Channel encoding of a compact representation
The advantages with localised representations mentioned above motivate the introduction of the channel representation [46, 10, 80]. The channel representation
is an encoding of a signal value x, and an associated confidence r ≥ 0. This is done
by passing x through a set of localised kernel functions {Bk (x)}K
1 , and weighting
the result with the confidence r. Each output signal is called a channel, and the
vector consisting of a set of channel values
¢T
¡
(3.1)
u = r B1 (x) B2 (x) . . . BK (x)
14
Channel Representation
is said to be the channel representation of the signal–confidence pair (x, r), provided that the channel encoding is injective for r 6= 0, i.e. there should exist a
corresponding decoding that reconstructs x, and r from the channel values.
The confidence r can be viewed as a measure of reliability of the value x. It
can also be used as a means of introducing a prior, if we want to do Bayesian
inference (see chapter 10). When no confidence is available, it is simply taken to
be r = 1.
Examples of suitable kernels for channel representations include Gaussians [89,
36, 93], B-splines [29, 87], and windowed cos2 functions [80]. In practise, any kernel
with a shape similar to the one in figure 3.1 will do.
Figure 3.1: A kernel function that generates a channel from a signal.
In the following sections, we will exemplify the properties of channel representations with the cos2 kernel. Later on we will introduce the Gaussian, and the
B-spline kernels. We also make a summary where the advantages and disadvantages of each kernel are compiled. Finally we put the channel representation into
perspective by comparing it with other local model techniques.
3.2
Channel representation using the cos2 kernel
We will now exemplify channel representation with the cos2 kernel
(
cos2 (ωd(x, k))
Bk (x) =
0
if ωd(x, k) ≤
otherwise.
π
2
(3.2)
Here the parameter k is the kernel centre, ω is the channel width, and d(x, k) is a
distance function. For variables in linear domains (i.e. subsets of R) the Euclidean
distance is used,
d(x, k) = |x − k| ,
(3.3)
3.2 Channel representation using the cos2 kernel
15
and for periodic domains (i.e. domains isomorphic with S) with period K a modular1 distance is used,
dK (x, k) = min(mod(x − k, K), mod(k − x, K)).
(3.4)
The measure of an angle is a typical example of a variable in a periodic domain.
The total domain of a signal x can be seen as cut up into a number of local but
π
, see figure 3.2.
partially overlapping intervals, d(x, k) ≤ 2ω
0
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Figure 3.2: Linear and modular arrangements of cos2 kernels. One kernel is shown
in solid, the others are dashed. Channel width is ω = π/3.
For example, the channel representation of the value x = 5.23, with confidence
r = 1, using the kernels in figure 3.2 (left), becomes
¡
u = 0 0 0 0.0778
0.9431
0.4791
0
¢T
0 .
As can be seen, many of the channel values become zero. This is often the
case, and is an important aspect of channel representation, since it allows more
compact storage of the channel values. A channel with value zero is said to be
inactive, and a non-zero channel is said to be active.
As is also evident in this example, the channel encoding is only able to represent
signal values in a bounded domain. The exact size of the represented domain depends on the method we use to decode the channel vector, thus we will first derive
a decoding scheme (in section 3.2.3) and then find out the size of the represented
domain (in section 3.3).
In order to simplify the notation in (3.2), the channel positions were defined
as consecutive integers, directly corresponding to the indices of consecutive kernel
functions. We are obviously free to scale and translate the actual signal value in
any desired way, before we apply the set of kernel functions. For instance, a signal
value ξ can be scaled and translated using
x = scale · (ξ − translation),
(3.5)
to fit the domain represented by the set of kernel functions {Bk (x)}K
1 . Non-linear
mappings x = f(ξ) are of course also possible, but they should be monotonous for
the representation to be non-ambiguous.
1 using
the modulo operation mod(x, K) = x − bx/KcK
16
3.2.1
Channel Representation
Representation of multiple values
Since each signal value will only activate a small subset of channels, most of the
values in a channel vector will usually be zero. This means that for a large channel
vector, there is room for more than one scalar. This is an important aspect
of the channel representation, that gives it an advantage compared to compact
representations. For instance, we can simultaneously represent the value 7 with
confidence 0.3 and the value 3 with confidence 0.7 in the same channel vector
¡
¢T
u = 0 0.175 0.7 0.175 0 0.075 0.3 0.075 .
This is useful to describe ambiguities. Using the channel representation we can
also represent the statement “no information”, which simply becomes an all zero
channel vector
¡
¢T
u= 0 0 0 0 0 0 0 0 .
There is an interesting parallel to multiple responses in biological sensory systems. If someone pokes two fingers in your back, you can feel where they are
situated if they are a certain distance apart. If they are too close however, you
will instead perceive one poking finger in-between the two. A representation where
this phenomenon can occur is called metameric in psychology2 , and the states (one
poking finger, or two close poking fingers) that cannot be distinguished in the given
representation are called metamers. The metamery aspect of a channel representation (using Gaussian kernels) was studied by Snippe and Koenderink in [89, 90]
from a perceptual modelling perspective.
We will refer to the smallest distance between sensations that a channel representation can handle as the metameric distance. Later on (section 5.4) we will
have a look at how small this distance actually is for different channel representations. The typical behaviour is that for large distances between encoded values
we have no interference, for intermediate distances we do have interference, and
for small distances the encoded values will be averaged [34, 87].
3.2.2
Properties of the cos2 kernel
The cos2 kernel was the first one used in a practical experiment. In [80] Nordberg et
al. applied it to a simple pose estimation problem. A network with channel inputs
was trained to estimate channel representations of distance, horizontal position,
and orientation of a wire-frame cube. The rationale for introducing the cos2 kernel
was a constant norm property, and constant norm of the derivative.
Our motivations for using the cos2 kernel (3.2) is that it has a localised support,
which ensures sparsity. Another motivation is that for values of ω = π/N where
N ∈ {3, 4, ...} we have
X
k
Bk (x) =
π
2ω
and
X
k
Bk (x)2 =
3π
.
8ω
(3.6)
2 Another example of a metameric representation is colour, which basically is a three channel
representation of wavelength.
3.2 Channel representation using the cos2 kernel
17
This implies that the sum, and the vector norm of a channel value vector
generated from a single signal–confidence pair is invariant to the value of the
signal x, as long as x is within the represented domain of the channel set (for
proofs, see theorems A.3 and A.4 in the appendix). The constant sum implies
that the encoded value, and the encoded confidence can be decoded independently.
The constant norm implies that the kernels locally constitute a tight frame [22], a
property that ensures uniform distribution of signal energy in the channel space,
and makes a decoding operation easy to find.
3.2.3
Decoding a cos2 channel representation
An important property of the channel representation is the possibility to retrieve
the signal–confidence pairs stored in a channel vector. The problem of decoding
signal and confidence values from a set of channel function values, superficially resembles the reconstruction of a continuous function from a set of frame coefficients.
There is however a significant difference: we are not interested in reconstructing
the exact shape of a function, we merely want to find all peak locations and their
heights.
In order to decode several signal values from a channel vector, we have to make
a local decoding, i.e. a decoding that assumes that the signal value lies in a specific
limited interval (see figure 3.3).
k
k+1 k+2
[k+0.5,k+1.5]
Figure 3.3: Interval for local decoding (ω = π/3).
For the cos2 kernel, and the local tight frame situation (3.6), it is suitable to
use decoding intervals of the form [k − 1 + N/2, k + N/2] (see theorem A.1 in the
appendix). The reason for this is that a signal value in such an interval will only
activate the N nearest channels, see figure 3.3. Decoding a channel vector thus
involves examining all such intervals for signal–confidence pairs, by computing
estimates using only those channels which should have been activated.
The local decoding is computed using a method illustrated in figure 3.4. The
channel values, uk , are now seen as samples from a kernel function translated to
have its peak at the represented signal value x̂.
We denote the index of the first channel in the decoding interval by l (in the
figure we have l = 4), and use groups of consecutive channel values {ul , ul+1 , . . .,
ul+N −1 }.
If we assume that the channel values of the N active channels constitute an
encoding of a single signal–confidence pair (x, r), we obtain N equations
18
Channel Representation
0
10
Figure 3.4: Example of channel values (ω = π/3, and x̂ = 5.23).





ul
ul+1
..
.


 
 
=
 
ul+N −1
rBl (x)
rBl+1 (x)
..
.



.

(3.7)
rBl+N −1 (x)
We will now transform an arbitrary row of this system in a number of steps
ul+d = rBl+d (x) = r cos2 (ω(x − l − d))
(3.8)
u
l+d
= r/2(1 + cos(2ω(x − l − d))
(3.9)
u
l+d
= r/2(1 + cos(2ω(x − l)) cos(2ωd) + sin(2ω(x − l)) sin(2ωd))


r cos(2ω(x − l))
¡1
¢
= 2 cos(2ωd) 12 sin(2ωd) 12  r sin(2ω(x − l))  .
r
ul+d
(3.10)
(3.11)
We can now rewrite (3.7) as





|
ul
ul+1
..
.


 1


= 
 2
ul+N −1
{z
}
u
|
cos(2ω0)
cos(2ω1)
..
.
cos(2ω(N − 1))
sin(2ω0)
sin(2ω1)
..
.
sin(2ω(N − 1))
{z
A

1 

r cos(2ω(x − l))
1

..   r sin(2ω(x − l))  .
.
r
{z
}
1 |
p
}
(3.12)
For N ≥ 3, this system can be solved using a least-squares fit


p1
p =  p2  = (AT A)−1 AT u = Wu .
p3
(3.13)
Here W is a constant matrix, which can be computed in advance and be used to
decode all local intervals. The final estimate of the signal value becomes
x̂ = l +
1
arg [p1 + ip2 ] .
2ω
(3.14)
3.3 Size of the represented domain
19
For the confidence estimate, we have two solutions
r̂1 = |p1 + ip2 |
and r̂2 = p3 .
(3.15)
The case of ω = π/2 requires a different approach to find x̂, r̂1 , and r̂2 since
u = Ap is under-determined when N = 2. Since the channel width ω = π/2 has
proven to be not very useful in practise, this decoding approach has been moved
to observation A.5 in the appendix.
When the two confidence measures are equal, we have a group of consecutive
channel values {ul , ul+1 , . . ., ul+N −1 } that originate from a single signal value
x. The fraction r̂1 /r̂2 is independent of scalings of the channel vector, and could
be used as a measure of the validity of the model assumption (3.7). The model
assumption will quite often be violated when we use the channel representation.
For instance, response channels estimated using a linear network will not in general
fulfill (3.7) even though we may have supplied such responses during training. We
will study the robustness of the decoding (3.14), as well as the behaviour in case
of interfering signal–confidence pairs in chapter 5. See also [36].
The solution in (3.14) is said to be a local decoding, since it has been defined
using the assumption that the signal value lies in a specific interval (illustrated in
figure 3.3). If the decoded value lies outside the interval, the local peak is probably
better described by another group of channel values. For this reason, decodings
falling outside their decoding intervals are typically neglected.
We can also note that for the local tight frame situation (3.6), the matrix AT A
becomes diagonal, and we can compute the local decoding as a local weighted
summation of complex exponentials
1
arg
x̂ = l +
2ω
"l+N −1
X
#
2ω(k−l)
i
u e
.
k
(3.16)
k=l
For this situation the relation between neighbouring channel values tells us the
signal value, and the channel magnitudes tell us the confidence of this statement.
In signal processing it is often argued that it is important to attach a measure
of confidence to signal values [48]. The channel representation can be seen as a
unified representation of signal and confidence.
3.3
Size of the represented domain
As mentioned in section 3.2, a channel representation is only able to represent
values in a bounded domain, which has to be known beforehand. We will now
derive an expression for the size of this domain. We start by introducing a notation
for the active domain (non-zero domain, or support) of a channel
Sk = {x : Bk (x) > 0} = ]lk , uk [
(3.17)
where lk and uk are the lower and upper bounds of the active domain. Since
the kernels should go smoothly to zero (see section 3.2.2), this is always an open
20
Channel Representation
interval, as indicated by the brackets. For the cos2 kernel (3.2), and the constant
sum situation (3.6), the common support of N channels, SkN becomes
SkN = Sk ∩ Sk+1 ∩ . . . ∩ Sk+N −1 = ]k − 1 + N/2, k + N/2[ .
(3.18)
This is proven in theorem A.1 in the appendix. See also figure 3.5 for an illustration.
Figure 3.5: Common support regions for ω = π/3. Left: supports Sk for individual
channels. Right: common supports Sk3 .
If we perform the local decoding using groups of N channels with ω = π/N ,
N ∈ N/{1}, we will have decoding intervals of type (3.18). These intervals are all
of length 1, and thus they do not overlap (see figure 3.5, right). We now modify
the upper end of the intervals
SkN = ]k − 1 + N/2, k + N/2]
(3.19)
in order to be able to join them. This makes no practical difference, since all that
happens at the boundary is that one channel becomes inactive. For a channel
representation using K channels (with K ≥ N ) we get a represented interval of
type
N
N
= S1N ∪ S2N ∪ . . . ∪ SK−N
RK
+1 = ]N/2, K + 1 − N/2]
(3.20)
This expression is derived in theorem A.2 in the appendix.
For instance K = 8, and ω = π/3 (and thus N = 3), as in figure 3.2, left, will
give us
R83 = ]3/2, 8 + 1 − 3/2] = ]1.5, 7.5] .
3.3.1
A linear mapping
Normally we will need to scale and translate our measurements to fit the represented domain for a given channel set. We will now describe how this linear
mapping is found.
N
=
If we have a variable ξ ∈ [rl , ru ] that we wish to map to the domain RK
]RL , RU ] using x = t1 ξ + t0 , we get the system
¶µ ¶
µ ¶ µ
1 rl
t0
RL
=
(3.21)
RU
1 ru
t1
with the solution
t1 =
RU − RL
ru − rl
and t0 = RL − t1 rl .
(3.22)
3.4 Summary
21
N
Inserting the boundaries of the represented domain RK
, see (3.20) gives us
t1 =
K +1−N
ru − rl
and t0 =
N
− t1 rl .
2
(3.23)
This expression will be used in the experiment sections to scale data to a given
set of kernels.
3.4
Summary
In this chapter we have introduced the channel representation concept. Important properties of channel representations are that we can represent ambiguous
statements, such as “either the value 3 or the value 7”. We can also represent
the confidence we have in each hypothesis, i.e. statements like “the value 3 with
confidence 0.6 or the value 7 with confidence 0.4” are possible. We are also able
to represent the statement “no information”, using an all zero channel vector.
The signal–confidence pairs stored in a channel vector can be retrieved using a
local decoding. The local decoding problem superficially resembles the reconstruction of a continuous signal from a set of samples, but it is actually different, since
we are only interested in finding the peaks of a function. We also note that the
decoding has to be local in order to decode multiple values.
An important limitation in channel representation is that we can only represent
signals with bounded values. I.e. we must know a largest possible value, and a
smallest possible value of the signal to represent. For a bounded signal, we can
derive an optimal linear mapping that maps the signal to the interval a given
channel set can represent.
22
Channel Representation
Chapter 4
Mode Seeking and
Clustering
In this chapter we will relate averaging in the channel representation to estimation methods from robust statistics. We do this by re-introducing the channel
representation in a slightly different formulation.
4.1
Density estimation
Assume that we have a set of vectors xn , that are measurements from the same
source. Given this set of measurements, can we make any prediction regarding
a new measurement? If the process that generates the measurements does not
change over time it is said to be a stationary stochastic process, and for a stationary
process, an important property is the relative frequencies of the measurements.
Estimation of relative frequencies is exactly what is done in probability density
estimation.
4.1.1
Kernel density estimation
If the data xn ∈ Rd come from a discrete distribution, we could simply count the
number of occurrences of each value of xn , and use the relative frequencies of the
values as measures of probability. An example of this is a histogram computation.
However, if the data has a continuous distribution, we instead need to estimate a
probability density function (PDF) f : Rd → R+ ∪ {0}. Each value f (x) is nonnegative, and is called a probability density for the value x. This should not be
confused with the probability of obtaining a given value, which is normally zero for
a signal with a continuous distribution. The integral of f (x) over a domain tell us
the probability of x occurring in this domain. In all practical situations we have
a finite amount of samples, and we will thus somehow have to limit the degrees
of freedom of the PDF, in order to avoid over-fitting to the sample set. Usually a
smoothness constraint is applied, as in the kernel density estimation methods, see
24
Mode Seeking and Clustering
e.g. [7, 43]. A kernel density estimator estimates the value of the PDF in point x
as
fˆ(x) =
µ
¶
N
x − xn
1 X
K
N hd n=1
h
(4.1)
where K(x) is the kernel function, and h is a scaling parameter that is usually
called the kernel width, and d is the dimensionality of the vector space. If we
require that
Z
1 ³x´
(4.2)
H(x) ≥ 0 and
H(x)dx = 1 for H(x) = d K
h
h
R
we know that fˆ(x) ≥ 0 and fˆ(x)dx = 1 as is required of a PDF.
Using the scaled kernel H(x) above, we can rewrite (4.1) as
N
1 X
H(x − xn ) .
fˆ(x) =
N n=1
(4.3)
In other words (4.1) is a sample average of H(x − xn ). As the number of samples
tends to infinity, we obtain
µ
¶
Z
N
x − xn
1 X
K
= E {H (x − xn )} = f (xn )H(x − xn )dxn
lim
N →∞ N hd
h
n=1
= (f ∗ H)(x) .
(4.4)
This means that in an expectation sense, the kernel H(x) can be interpreted as
a low-pass filter acting on the PDF f (x). This is also pointed out in [43]. Thus
H(x) is the smoothness constraint, or regularisation, that makes the estimate
more stable. This is illustrated in figure 4.1. The figure shows three kernel density
estimates from the same sample set, using a Gaussian kernel
K(x) =
1
−0.5xT x
e
(2π)d/2
(4.5)
with three different kernel widths.
4.2
Mode seeking
If the data come from a number of different sources, it would be a useful aid in
prediction of new measurements to have estimates of the means and covariances of
the individual sources, or modes of the distribution. See figure 4.1 for an example
of a distribution with four distinct modes (the peaks). Averaging of samples in
channel representation [47, 80, 34] (see also chapter 3), followed by a local decoding
is one way to estimate the modes of a distribution.
4.2 Mode seeking
25
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
2
4
0
0
h = 0.02
2
4
0
0
2
h = 0.05
4
h = 0.1
Figure 4.1: Kernel density estimates for three different kernel widths.
4.2.1
Channel averaging
With the interpretation of the convolution in (4.4) as a low-pass filter, it is easy to
make the association to signal processing with sampled signals, and suggest regular
sampling as a representation of fˆ(x). If the sample space Rd is low dimensional,
and samples only occur in a bounded domain1 A, (i.e. f (x) = 0 ∀x 6∈ A) it
is feasible to represent fˆ(x) by estimates of its values at regular positions. If
the sample set S = {xn }N
1 is large this would also reduce memory requirements
compared to storing all samples.
Note that the analogy with signal processing and sampled signals should not
be taken too literally. We are not at present interested in the exact shape of the
PDF, we merely want to find the modes, and this does not require the kernel H(x)
to constitute a band-limitation, as would have been the case if reconstruction of
(the band-limited version of) the continuous signal fˆ(x), from its samples was our
goal.
For simplicity of notation, we only consider the case of a one dimensional PDF
f (x) in the rest of this section. Higher dimensional channel representations will
be introduced in section 5.6.
In the channel representation, a set of non-negative kernel functions {Hk (x)}K
1
is applied to each of the samples xn , and the result is optionally weighted with a
confidence rn ≥ 0,
¡
un = rn H1 (xn )
H2 (xn )
...
HK (xn )
¢T
.
(4.6)
This operation defines the channel encoding of the signal–confidence pair (xn , rn ),
and the resultant vector un constitutes a channel representation of the signal–
confidence, provided that the channel encoding is injective for r 6= 0, i.e. there
exists a corresponding decoding that reconstructs the signal, and its confidence
from the channels.
We additionally require that the consecutive, integer displaced kernels H k (x),
are shifted versions of an even function H(x), i.e.
H k (x) = H(x − k) = H(k − x) .
1 Bounded
(4.7)
in the sense that A ⊂ {x : (x − m)T M(x − m) ≤ 1} for some choice of m and M.
26
Mode Seeking and Clustering
We now consider an average of channel vectors
u=
N
1 X
un
N n=1
with elements
uk =
N
1 X k
u .
N n=1 n
(4.8)
If we neglect the confidence rn , we have
ukn = H(xn − k) = H(k − xn ) .
(4.9)
By inserting (4.9) into (4.8) we see that uk = fˆ(k) according to (4.3). In other
words, averaging of samples in the channel representation is equivalent to a regular
sampling of a kernel density estimator. Consequently, the expectation value of a
channel vector u is a sampling of the PDF f (x) filtered with the kernel H(x).
I.e. for each channel value we have
Z
k
k
(4.10)
E{un } = E{H (xn )} = H k (x)f (x)dx = (f ∗ H)(k) .
We now generalise the interpretation of the local decoding in section 3.2.3.
The local decoding of a channel vector is a procedure that takes a subset of the
channel values (e.g. {uk . . . uk+N −1 }), and computes the mode location x, the
confidence/probability density r, and if possible the standard deviation σ of the
mode
{x, r, σ} = dec(uk , uk+1 , . . . , uk+N −1 ) .
(4.11)
The actual expressions to compute the mode parameters depend on the used
kernel. A local decoding for the cos2 kernel was derived in section 3.2.3. This
decoding did not give an estimate of the standard deviation, but in chapter 5 we
will derive local decodings for Gaussian and B-spline kernels as well (in sections
5.1.1 and 5.2.2), and this motivates the general formulation above.
4.2.2
Expectation value of the local decoding
We have identified the local decoding as a mode estimation procedure. Naturally
we would like our mode estimation to be as accurate as possible, and we also want
it to be unbiased. This can be investigated using the expectation value of the local
decoding. Recall the expectation value of a channel (4.10). For the cos2 kernel
this becomes
Z
k
cos2 (ω(x − k))f (x)dx
(4.12)
E{un } =
Sk
where Sk is the support of kernel k (see section 3.3). We will now require that
the PDF is restricted to the common support SlN used in the local decoding. This
allows us to write the expectation value of one of the channel values used in the
4.2 Mode seeking
27
decoding as
Z
E{unl+d }
=
SlN
cos2 (ω(x − l − d))f (x)dx
1¡
cos(2ωd)
=
2
sin(2ωd)
R

N cos(2ω(x − l))f (x)dx
¢  RSl

1  SlN sin(2ω(x − l))f (x)dx 
R
f (x)dx
SlN
{z
}
|
E{p}
(4.13)
using the same method as in (3.8)-(3.11). We can now stack such equations for all
involved channel values, and solve for E{p}. This is exactly what we did in the
derivation of the local decoding. If we assume a Dirac PDF, i.e. f (x) = rδ(x − µ),
we obtain


r cos(2ω(µ − l))
E{p} =  r sin(2ω(µ − l))  .
(4.14)
r
Plugging this into the final decoding step (3.14) gives us the mode estimate x̂ = µ.
In general however, (3.14) will not give us the exact mode location. In appendix
A, theorem A.6 we prove that, if a mode f is restricted to the support of the
decoding SlN , and is even about the mean µ (i.e. f (µ + x) = f (µ − x)), (3.14) is
an unbiased estimate of the mean
E{x̂} = l +
1
arg [E{p1 } + iE{p2 }] = µ = E{xn } .
2ω
(4.15)
When f has an odd component, the local decoding tends to overshoot the mean
slightly, seemingly always in the direction of the mode of the density2 . In general
however, these conditions are not fulfilled. It is for instance impossible to have a
shift invariant estimate for non-Dirac densities, when the decoding intervals SlN
are non-overlapping. For an experimental evaluation of the behaviour under more
general conditions, see [36].
4.2.3
Mean-shift filtering
An alternative way to find the modes of a distribution is through gradient ascent
on (4.1), as is done in mean-shift filtering [43, 16]. Mean-shift filtering is a way to
cluster a sample set, by moving each sample toward the closest mode, and this is
done through gradient ascent on the kernel density estimate.
Assuming that the kernel K(x) is differentiable, the gradient of f (x) can be
estimated as the gradient of (4.1). I.e
ˆ (x) =
∇f
µ
¶
N
X
x − xn
1
∇K
.
N hd+1 n=1
h
(4.16)
This expression becomes particularly simple if we use the Epanechnikov kernel [43]
2 Note
that this is an empirical observation, no proof is given.
28
Mode Seeking and Clustering
(
c(1 − xT x) if xT x ≤ 1
K(x) =
(4.17)
0
otherwise.
R
Here c is a normalising constant that ensures K(x)dx = 1. For the kernel (4.17)
we define3 the gradient as
(
−2cx if xT x ≤ 1
∇K(x) =
(4.18)
0
otherwise.
Inserting this into (4.16) gives us
ˆ (x) =
∇f
2c
N hd+2
X
(xn − x) ,
(4.19)
xn ∈Sh (x)
where Sh (x) = {xn ∈ Rd : ||xn − x|| ≤ h} is the support of the kernel (4.17).
Instead of doing gradient ascent using (4.19) directly, Fukunaga and Hostetler
[43] use an approximation of the normalised gradient
ˆ
ˆ ln f (x) ≈ ∇f (x) = d + 2 m(x)
∇
h2
fˆ(x)
(4.20)
where m(x) is the mean shift vector
m(x) =
1
k(x)
X
(xn − x) .
(4.21)
xn ∈Sh (x)
Here k(x) is the number of samples inside the support Sh (x).
Using the normalised gradient, [43] proceeds to define the following clustering
algorithm
½ 0
= xn
x̄
(4.22)
For all n
ˆ ln f (x̄i )
x̄i+1 = x̄i + a∇
where a should be chosen to guarantee convergence. In [43] the choice
a = h2 /(d + 2) is made. This results in the iteration rule
x̄i+1 =
1
k(x)
X
x̄in .
(4.23)
x̄in ∈Sh (x̄i )
That is, in each iteration we replace x̄i by a mean inside a window centred
around it. Cheng [16] calls this mean of already shifted points a blurring process,
and contrasts this with the non-blurring process
x̄i+1 =
3 The
1
k(x)
X
xn ∈Sh (x̄i )
gradient is actually undefined when xT x = 1.
xn .
(4.24)
4.2 Mode seeking
29
The behaviour of the two approaches for clustering are similar, but only the
non-blurring mean-shift is mode seeking, and is the one that is typically used.
Cheng also generalises the concept of mean shift, by considering (4.24) to be a
mean shift using the uniform kernel,
(
1 xT x ≤ 1
K(x) =
0 otherwise,
(4.25)
and suggests other weightings inside the window Sh (x). The generalised meanshift iteration now becomes
P
K(xn − x̄i )xn
.
x̄i+1 = P
K(xn − x̄i )
(4.26)
Since mean shift according to (4.26) using the kernel (4.25) finds the modes of
(4.1) using the Epanechnikov kernel (4.17), the Epanechnikov kernel is said to be
the shadow of the uniform kernel. Similarly mean shift using other kernels finds
peaks of (4.1) computed from their shadow kernels.
Figure 4.2 shows an illustration of mean shift clustering. A mean shift iteration
has been started in each of the data points, and the trajectories x̄0 , x̄1 , . . . x̄∗ have
been plotted. As can be seen in the centre plot, all trajectories end in either of
five points, indicating successful clustering of the data set.
1
1
0.5
0.5
0
0
−0.5
−0.5
0.2
0.1
0
1
1
0
−1
−1
−0.5
0
0.5
1
−1
−1
−0.5
0
0.5
1
0
−1
−1
Figure 4.2: Illustration of mean shift filtering. Left: data set. Centre: trajectories
of the mean shift iterations using a uniform kernel with h = 0.3. Right: Kernel
density estimate using Epanechnikov kernel with h = 0.3.
Gradient ascent with fixed step length often suffers from slow convergence, but
(4.26) has the advantage that it moves quickly near points of low density (this
can be inferred by observing that the denominator of (4.26) is a kernel density
estimator without the normalisation factor.).
Mean shift filtering was recently re-introduced in the signal processing community by the Cheng paper [16], and has been developed by Comaniciu and Meer
in [19, 20] into algorithms for edge-preserving filtering, and segmentation.
30
Mode Seeking and Clustering
4.2.4
M-estimators
M-estimation4 (first described by Huber in [62] according to Hampel et al. [56]),
or the approach based on influence functions [56] is a technique that attempts to
remove the sensitivity to outliers in parameter estimation. Assume {xn }N
1 are
samples in some parameter space, and we want to estimate the parameter choice
x̄ that best fits the data. This estimation problem is defined by the the following
objective function
x̄ = arg min J(x) = arg min
x
x
N
X
ρ(rn /h)
where
rn = ||x − xn || .
(4.27)
n=1
Here ρ(r) is an error norm 5 , rn are residuals, and h is a scale parameter. The
error norm ρ(r) should be nondecreasing with increasing r [95].
Both the least-squares problem, and median computation are special cases of
M-estimation with the error norms ρ(r) = r2 , and ρ(r) = |r| respectively. See
figure 4.3 for some common examples of error norms.
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
−2
0
Least squares
2
0
−2
0
2
0
−2
Least absolute
0
2
Biweight
0
−2
0
2
Cutoff-squares
Figure 4.3: Some common error norms.
Note that, in general, (4.27) is a non-convex problem which may have several
local minima.
Solutions to (4.27) are also solutions to
N
X
n=1
ϕ(rn /h) = 0
where
ϕ(r) =
∂ρ(r)
.
∂r
(4.28)
The function ϕ(r) is called the influence function. The solution to (4.27) or (4.28)
is typically found using iterated reweighted least-squares 6 (IRLS), see e.g. [109, 95].
As the name suggests IRLS is an iterative algorithm. In general it requires an
initial guess close to the optimum of (4.27).
To derive IRLS we start by assuming that the expression to be minimised in
(4.27) can be written as a weighted least-squares expression
4 The
name comes from “generalized maximum likelihood” according to [56].
error norm is sometimes called a loss function.
6 Iterated reweighted least squares is also known as a W-estimator [56].
5 The
4.2 Mode seeking
31
N
X
J(x) =
ρ(rn /h) =
n=1
N
X
rn2 w(rn /h) .
(4.29)
n=1
We now compute the gradient with respect to {rn }N
1 of both sides
1¡
ϕ(r1 /h)
h
...
¢T
2¡
rn w(rn /h)
ϕ(rN /h) =
h
...
rN w(rN /h)
¢T
.
(4.30)
This system of equations is fulfilled for the weight function w(r) = ϕ(r)/r/2. This
can be simplified to w(r) = ϕ(r)/r, while still giving a solution to (4.27). This
gives us the weights for one step in the IRLS process:
x̄i = arg min
x
N
X
(rni )2 w(rni−1 /h)
where
rni = ||x̄i − xn || .
(4.31)
n=1
Each iteration of (4.31) is a convex problem, which can be solved by standard
techniques for least-squares problems, such as Gaussian elimination or SVD. By
computing the derivative with respect to x̄i (treating w as constant weights), we
get
N
X
2(x̄i − xn )w(rni−1 /h) = 0
(4.32)
n=1
with the solution
P
P
xn w(||x̄i−1 − xn ||/h)
xn w(rni−1 /h)
P
.
=
x̄ = P
w(||x̄i−1 − xn ||/h)
w(rni−1 /h)
i
(4.33)
By comparing this with (4.26), we see that the iterations in IRLS are equivalent
to those in mean-shift filtering if we set the kernel in mean-shift filtering equal to
the scaled weight function in IRLS i.e.
K(x̄ − xn ) = w(||x̄ − xn ||/h) .
(4.34)
From this we can also infer that the error norm corresponds to the kernel of the
corresponding kernel density estimator, up to a sign (mean-shift is a maximisation,
and M-estimation is a minimisation) and an offset.
4.2.5
Relation to clustering
Clustering is the problem of partitioning a non-labelled data set into a number
of clusters, or classes, in such a way that similar samples are grouped into the
same cluster. Clustering is also known as data partitioning, segmentation and
vector quantisation. It is also one of the approaches to unsupervised learning. See
e.g. [63] for an overview of different clustering approaches.
The mode seeking problem is related to clustering in the sense that each mode
can be seen as a natural cluster prototype. For mean shift, this connection is
32
Mode Seeking and Clustering
especially direct, since each sample can be assigned a class label depending on
which mode the mean-shift iteration ended in [19, 20]. For channel averaging we
can use the distances to each of the decoded modes to decide which cluster a
sample should belong to.
4.3
Summary and comparison
In this chapter we have explained three methods that find modes in a distribution.
All three methods start from a set of samples, and have been shown to estimate
the modes of the samples under a regularisation with a smoothing kernel (for Mestimation we only get one mode). We have shown that non-blurring meanshift is
equivalent to a set of M-estimations using IRLS started in each sample. Channel
averaging, on the other hand is a different method, that approaches the same
problem from a different angle.
Which method is preferable depends on the problem to be solved. Both channel averaging and mean-shift filtering are intended for estimation of all modes of a
distribution, or for clustering. If we have a large number of samples inside the kernel window, mean-shift filtering is at a disadvantage, since each iteration involves
an evaluation of e.g. (4.16) for all samples, which would be cumbersome indeed.
In contrast, for mode seeking using channel averaging, only the averaging step is
affected by the number of samples. Another advantage with channel averaging
is that it has a constant data-independent complexity. For mean-shift we can
only have an expected computational complexity since it is an iterative method,
where the convergence speed depends on the data. For high dimensional sample
spaces, or if the domain of the sample space is large, mean-shift is at an advantage,
since the number of required channels grows exponentially with the vector space
dimension. When the used kernel is small, mean-shift also becomes favourable.
Chapter 5
Kernels for Channel
Representation
In this chapter we will introduce two new kernel functions and derive encodings
and local decodings for them. The kernels are then compared with the cos2 kernel
(introduced in chapter 3), with respect to a number of properties. We also introduce the notion of stochastic kernels, and study the interference of multiple peaks
when decoding a channel vector. Finally we extend the channel representation to
higher dimensions.
5.1
The Gaussian kernel
Inspired by the similarities between channel averaging and kernel density estimation (see sections 4.1.1 and 4.2.1) we now introduce the Gaussian kernel
−
B (x) = e
k
(x − k)2
2σ 2 .
(5.1)
Here the k parameter is the channel centre, and σ is the channel width. The
σ parameter can be related to the ω parameter of the cos2 kernel by√ requiring
that the areas under the kernels should be the same. Since Agauss = 2πσ, and
Acos2 = π/2ω, we get
p
p
(5.2)
σ = π/8/ω and ω = π/8/σ .
Just like before, channel encoding is done according to (3.1). Figure 5.1 shows an
example of a Gaussian channel set. Compared to the cos2 kernel, the Gaussian
kernel has the disadvantage of not having a compact support. This means that
we will always have small non-zero values in each channel (unless we threshold the
channel vector, or weight the channel values with a relevance r = 0). Additionally,
the Gaussian kernels do not have either constant sum or norm as the cos2 kernels
do, see (3.6).
34
Kernels for Channel Representation
0
1
2
3
4
5
6
7
8
9
Figure 5.1: Example of a Gaussian channel set. The width σ = 0.6 corresponds
roughly to ω = π/3.
5.1.1
A local decoding for the Gaussian kernel
We will now devise a local decoding for this channel representation as well. If we
look at three neighbouring channel values around k = l, we obtain three equations

 l−1  
rBl−1 (x)
u
 ul  =  rBl (x)  .
(5.3)
ul+1
rBl+1 (x)
The logarithm of an arbitrary row can be written as
(x − l − d)2
ln ul+d = ln r + ln Bl+d (x) = ln r −
2σ 2
´T
³
¢
¡
2
x−l
1
.
= 1 d d2
−
ln r − (x−l)
2
2
2
2σ
σ
2σ
|
{z
}
p
(5.4)
(5.5)
If we apply this to each row of (5.3), we obtain an equation system of the form
ln u = Dp
(5.6)

1 0
1 
0
ln u .
2
1
1
−1
2
}
| 2 {z
−1
D
From the solution p we can find the estimates x̂, σ̂, and r̂ as

with the solution
0
p =  − 12
x̂ = l −
p2
,
2p3
r
1
σ̂ = −
,
2p3
(5.7)
p2
p1 − 2
4p3 .
and r̂ = e
This gives us one decoding per group of 3 channels. Just like in the cos2 decoding (see section 3.2.3) we remove those decodings that lie more than a distance
0.5 from the decoding interval centre, since the corresponding peak is probably
better described by another group of channels.
5.2 The B-spline kernel
35
The Gaussian kernel has potential for use in channel averaging, see section
4.2.1, since they provide a direct estimate of the standard deviation for an additive
Gaussian noise component. Using √
the addition theorem for Gaussian variances, we
can estimate the noise as σ̂noise = σ̂ 2 − σ 2 . Just like in the cos2 case, the r̂-value
of the decoding is the probability density at the mode. The new parameter σ̂noise
tells us the width of the mode, and can be seen as a measure of how accurate the
localisation of the mode is. For a high confidence in a peak location σ̂noise should
be small, and r̂N (where N is the number of samples) should be large.
5.2
The B-spline kernel
B-splines are a family of functions normally used to define a basis for interpolating
splines, see e.g. [21, 100]. The central1 B-spline of degree zero is defined as a
rectangular pulse
(
1 −0.5 ≤ x < 0.5
0
(5.8)
B (x) =
0 otherwise.
B-splines of higher degrees are defined recursively, and can be obtained through
convolutions
Bn (x) = Bn−1 ∗ B0 (x) = B0 ∗ B0 . . . ∗ B0 (x) .
{z
}
|
(5.9)
(n+1) times
As the degree is increased, the basis functions tend toward a Gaussian shape (see
figure 5.2). In fact, according to the central limit theorem, a Gaussian is obtained
as n approaches infinity.
Figure 5.2: Central B-splines of degrees 0,1. . . 5.
If we require explicit expressions for the piecewise polynomials which the Bspline consists of, the following recurrence relation [21] is useful
Bn (x) =
x + (n + 1)/2 n−1
(n + 1)/2 − x n−1
B−1/2 (x) +
B1/2 (x) .
n
n
(5.10)
Here, shifted versions of a B-spline are denoted Bnk (x) = Bn (x − k). Using (5.10),
we obtain the following expression for B-splines of degree 1
1 B-splines
are often defined with B0 (x) having the support x ∈ [0, 1] instead.
36
Kernels for Channel Representation
B1 (x) = (x + 1)B0−1/2 (x) + (1 − x)B01/2 (x)


x + 1 −1 ≤ x < 0
= 1−x
0≤x<1


0
otherwise.
(5.11)
(5.12)
For degree 2 we get
x + 3/2 1
3/2 − x 1
B−1/2 (x) +
B1/2 (x)
2
2
µ
¶
3
(x − 3/2)2 0
(x + 3/2)2 0
2
B−1 (x) +
− x B0 (x) +
B1 (x)
=
2
4
2

(x + 3/2)2 /2 −1.5 ≤ x < −0.5



3/4 − x2
−0.5 ≤ x < 0.5
=
2

(x − 3/2) /2 0.5 ≤ x < 1.5



0
otherwise.
B2 (x) =
(5.13)
(5.14)
(5.15)
By applying the binomial theorem on the Fourier transform of (5.9), and going
back it is also possible to derive the following expression [100]
Bn (x) =
µ
¶n
¶
n+1 µ
1 X n+1
n+1
.
(−1)k max 0, x − k +
n!
2
k
(5.16)
k=0
5.2.1
Properties of B-splines
The B-spline family have a number of useful properties:
1. Positivity
Bn (x) > 0, x ∈ ]−(n + 1)/2, (n + 1)/2[ .
2. Compact support
Bn (x) = 0, x 6∈ [−(n + 1)/2, (n + 1)/2] .
3. Constant sum
X
Bnk (x) = 1 regardless of x.
(5.17)
k
For a proof, see theorem B.1 in the appendix.
4. For B-splines of degree n ≥ 1, the original scalar value may be retrieved by
the first moment
x=
X
kBnk (x) .
k
For a proof, see theorem B.2 in the appendix.
(5.18)
5.2 The B-spline kernel
37
These properties, make B-splines useful candidates for kernels in the channel
representation.
5.2.2
B-spline channel encoding and local decoding
Using B-spline kernels of degree ≥ 1 we may define a B-spline channel representation, where a signal–confidence pair (x, r) can be encoded according to (3.1).
I.e. we have
uk = rBk (x) .
(5.19)
Due to the constant sum property of the B-spline kernel (5.17), the confidence
may be extracted from a channel set as
r=
X
uk .
(5.20)
k
We can further retrieve the signal value times the confidence using the first moment
(see equation 5.18)
xr =
X
kuk .
k
Thus it is convenient to first compute the confidence, and then to extract the
signal value as
P
kuk
1X k
ku .
x̂ = Pk k =
r
ku
(5.21)
k
Figure 5.3 shows an example of a B-spline channel set. Just like the cos2 kernel,
the B-splines have compact support, and constant sum. Their norm however is not
constant. A value x encoded using a B-spline channel representation of degree n,
will have at most n + 1 active channel values. This makes a local decoding using a
group of n+1 consecutive channel values reasonable. It also means that a B-spline
channel representation of degree n is comparable to a cos2 channel representation
with width
ω = π/(n + 1) .
(5.22)
Just like in the cos2 and Gaussian decodings, (see sections 3.2.3 and 5.1.1)
we remove those decodings that lie more than a distance 0.5 from the decoding
window centre.
Note that the local decoding described here assumes a Dirac PDF, see section
4.2.2. A more elaborate local decoding, which deals with non-Dirac PDFs has
been developed by Felsberg, and a joint publication is under way [30].
38
Kernels for Channel Representation
0
1
2
3
4
5
6
7
8
9
Figure 5.3: Example of a B-spline channel set. The degree n = 2 is comparable to
ω = π/3 for the cos2 kernels.
5.3
Comparison of kernel properties
In this section we will compare the three kernel families, cos2 , Gaussians and
B-splines. We will compare them according to a number of different criteria.
5.3.1
The constant sum property
¡
The sum2 of a channel vector u = r B1 (x)
||u(x)||1 =
K
X
k=1
B2 (x) . . .
|rB (x)| = r
k
K
X
BK (x)
¢T
is given by
Bk (x) .
(5.23)
k=1
As proven in theorems A.3 and B.1, (5.23) is constant for cos2 and B-spline
K
). Note that
channels as long as x is inside the represented domain (i.e. x ∈ RN
since the Gaussian kernels do not have compact support, (5.23) will actually depend on the value of K. Figure 5.4 (left) shows numerical evaluations of (5.23)
for σ = 0.4, 0.6, 0.8, and 1.0. A channel set of K = 11 channels has been used,
with channel positions −3, −2, . . . 7.
As can be seen in the figure the sum peaks near channel positions and has
distinct minima right in-between two channel positions. To measure the amount
of deviation we use a normalised peak-to-peak distance
εpp = (max(||u||1 ) − min(||u||1 ))/mean(||u||1 ) .
(5.24)
This measure is plotted in figure 5.4 (right). As can be seen, the sum is practically
constant for σ > 0.6. For a finite number of channels, the deviation from the result
in figure 5.4 is very small3 .
2 Since
3 When
the channel representation is non-negative, the sum is actually the l1 norm.
K = 500, the deviation is less than 2 × 10−4 (worst case is σ = 1.0).
5.3 Comparison of kernel properties
3
39
0.2
0.15
2
0.1
1
0
0.05
0
1
2
3
4
0
0.4
0.6
0.8
1
Figure 5.4: Sum as function of position. Left: Sums for σ = 0.4, 0.6, 0.8, 1.0
(bottom to top). Right: Normalised peak-to-peak distance εpp (σ).
5.3.2
The constant norm property
¡
The l2 norm ||u|| of a channel vector u(x) = r B1 (x)
given by
||u(x)||2 = r2
K
X
B2 (x)
Bk (x)2 .
...
¢T
BK (x) is
(5.25)
k=1
As proven in theorem A.4, (5.25) is constant for cos2 channels as long as x is
K
). Since the Gaussian kernels do not
inside the represented domain (i.e. x ∈ RN
have compact support, (5.25) will depend on the value of K, so they are a bit
problematic. Figure 5.5 shows a numerical comparison of Gaussian and B-spline
kernels. In order to compare them, √
kernels with corresponding4 widths have been
used (i.e according to σ = (n + 1)/ 8π). The experiment used K = 11 channels
positioned at −3, −2, . . . 7. For a finite number of channels, the deviation from
this experiment is very small5 for the Gaussians.
As can be seen in the figure the norm peaks near channel positions and has
distinct minima right in-between two channel positions. To compare the amount
of deviation we use a normalised peak-to-peak distance
εpp = (max(||u||) − min(||u||))/mean(||u||)
(5.26)
on the interval [0, 4]. Figure 5.5 (right) shows plots of this measure for the Gaussian
and B-spline kernels. As can be seen, both kernels tend toward a constant norm as
the channel width is increased. The Gaussians however have a significantly faster
convergence. This is most likely due to their non compact support. For all widths
except (n = 1, σ = 0.4) the deviation is smaller for the Gaussians.
4 For
a motivation to the actual correspondence criteria see discussion around (5.2) and (5.22).
K = 500, the deviation is less than 5 × 10−10 (worst case is σ = 1.0).
5 When
40
Kernels for Channel Representation
1
1.5
0.4
1
0.3
0.5
0.2
0.5
0.1
0
0
1
2
3
4
0
0
1
2
3
4
0
0.4
0.6
0.8
1
Figure 5.5: Norm as function of position. Left: Using B-spline kernels n = 1,2,3,4
(top to bottom curves). Centre: Using Gaussian kernels with σ = 0.40, 0.60, 0.80,
1.00 (bottom to top curves). Right: Solid εpp (σ) for Gaussian kernels. Crosses
indicate the B-spline results for n = 1, 2, 3, 4.
5.3.3
The scalar product
We will now have a look at the scalar product of two channel vectors. Note that
the constant norm property does not imply a position invariant scalar product,
merely that the highest value of the scalar product stays constant. We will thus
compare all three kernels. Figure 5.6 shows a graphical comparison of the scalar
product functions for cos2 , B-spline, and Gaussian kernels.
In this experiment we have encoded the value 0, and computed scalar products
between this vector and encodings of varying positions. i.e.
s(x) = u(0)T u(x) =
K
X
Bk (0)Bk (x) .
(5.27)
k=1
Each plot shows 10 superimposed curves s(x), where the channel positions have
displaced in steps of 0.1. I.e.
s(x) =
K
X
Bk (0 − d)Bk (x − d)
(5.28)
k=1
for d ∈ {0, 0.1, 0.2, . . . 0.9}. As can be seen, the scalar product is position variant
for all kernels, but the amount of variance with position decreases as the channel
width increases. For the cos2 kernel, the position variance goes to zero as we
approach the peak of scalar product function (5.27). As we increase the channel
widths the position variance drops for the other kernels, and especially for the
Gaussians with σ = 1.0, s(x) looks very stable.
The problem with the position variance of the scalar product is two-fold. As
figure 5.7 illustrates, the peak of the scalar product is even moved as the alignment
changes.
Such a behaviour would make the scalar product unsuitable e.g. as a measure of
similarity. To avoid the displacement of the peak, we could consider a normalised
scalar product
5.3 Comparison of kernel properties
1.125
1
41
1.5
1.875
1
1
0.5
0.5
0
−4
−2
0
2
4
1.0917
1
0
−4
−2
0
2
4
1.4195
0
−4
−2
0
2
4
−2
0
2
4
−2
0
2
4
1.7725
1
1
0.5
0.5
0
−4
−2
0
2
4
0.5731
0
−4
−2
0
2
4
0.4873
0.5
0.4325
0.4
0.4
0.2
0.2
0
−4
0
−4
−2
0
2
4
0
−4
−2
0
2
4
0
−4
Figure 5.6: Superimposed scalar product functions for different alignments of the
channel positions. Top to bottom: cos2 , Gaussian, and B-spline kernels. Left to
right: ω = π/3, π/4, π/5, σ = 0.6, 0.8, 1.0, n = 2, 3, 4.
s(x) =
u(0)T u(x)
.
||u(0)||||u(x)||
(5.29)
Figure 5.8 shows the same experiment as in figure 5.6, but using the normalised
scalar product (5.29). As can be seen, the normalisation ensures the the peak is in
the right place for all kernels, but it does not make all of the position dependency
problems go away.
42
Kernels for Channel Representation
0.6
1
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0
−4
−2
0
2
0
−4
4
−2
0
2
4
Figure 5.7: Different peak positions for different channel alignments. Left: Two
Gaussian scalar product functions (σ = 0.6). Right: Two B-spline scalar product
functions (n = 2).
1
1
1
0.5
0.5
0.5
0
−4
−2
0
2
4
0
−4
−2
0
2
4
0
−4
1
1
1
0.5
0.5
0.5
0
−4
−2
0
2
4
0.5
0
−4
−2
0
2
4
0
−4
−2
0
2
4
0
−4
1
1
0.5
0.5
0
−4
−2
0
2
4
0
−4
−2
0
2
4
−2
0
2
4
−2
0
2
4
Figure 5.8: Superimposed normalised scalar product functions for different alignments of the channel positions. Top to bottom: cos2 , Gaussian, and B-spline
kernels. Left to right: ω = π/3, π/4, π/5, σ = 0.6, 0.8, 1.0, n = 2, 3, 4.
5.4 Metameric distance
5.4
43
Metameric distance
As mentioned in section 3.2.1, the channel representation is able to represent multiple values. In this section we will study interference between multiple values
represented in a channel vector. A representation where the representation of two
values is sometimes confused with their average is called a metameric representation, see section 3.2.1. To study this behaviour for the channel representation, we
now channel encode two signals, sum their channel representations, and decode.
The two signals we will use are
f1 (x) = x and f2 (x) = 6 − x for x ∈ [0, 6] .
(5.30)
These signals are shown in figure 5.9, left. As can be seen they are different
near the edges of the plot, and more similar in value near the centre. If we
channel encode these signals and decode their sum, we will obtain two distinct
decodings near the edges of the plot, and near the centre of the plot we will obtain
their average. Initially we will use a channel representation with integer channel
positions, and a channel width of ω = π/3. The centre plot of figure 5.9 shows the
result of channel encoding, summing and decoding the two signals.
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
2
4
6
0
2
4
6
0
0
2
4
6
Figure 5.9: Illustration of metamerism. Left: Input signals. Centre: Decoded outputs. Using cos2 channels at integer positions, and ω = π/3. Right: Result using
channels at half integer positions instead. Dotted curves are the input signals.
This nice result is however merely a special case. If we move the channel
positions such that they are not positioned at integer positions we will get different
results. Figure 5.9, right shows the result when the channels are positioned at
half-integers instead. As can be seen here, we now have an interval of interference,
where the decoded values have moved closer to each other, before the interval
where the two values are averaged. The two cases in figure 5.9 are the extreme
cases in terms of allowable value distance. When the channel positions are aligned
with the intersection point of the two signals, as in the centre plot, the smallest
allowed distance, or metameric distance is the largest, dmm = 2.0. When the
signal intersection point falls right in-between two channel positions, the metameric
distance is the smallest dmm = 1.5.
44
Kernels for Channel Representation
The metameric distance also depends on the used channel width. The top row
of figure 5.10 shows the metameric distances for three different channel widths of
the cos2 kernel. Each plot shows 10 superimposed curves, generated using different
alignments of the signal intersection point and the channel positions. From these
plots we can see that the metameric distance increases with the channel width.
Another effect of increasing the channel width is that the position dependency of
the interference is reduced.
The experiment is repeated for the B-spline, and the Gaussian kernels in figure
5.10, middle and bottom rows. We have chosen kernel widths corresponding to the
ones for the cos2 kernel according to (5.2), and (5.22). As can be seen, the cos2 ,
and the B-spline kernel behave similarly, while the Gaussian kernel has a slightly
slower increase in metameric distance with increasing kernel width. The reason
for this is that we have used a constant decoding window size for the Gaussian
kernel, whereas the window size increases with the channel width for both cos2
and B-spline kernels.
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
2
4
6
0
2
4
6
0
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
2
4
6
0
0
2
4
6
0
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
2
4
6
0
0
2
4
6
0
0
2
4
6
0
2
4
6
0
2
4
6
Figure 5.10: Metamerism for different channel widths. Top row: cos2 channels,
width ω = π/3, π/4, π/5. Middle row: B-spline channels, order 2, 3, 4. Bottom
row: Gaussian channels, σ = 0.6, 0.8, 1.0. All plots show 10 superimposed curves
with different alignments of the channel centres.
5.5 Stochastic kernels
5.5
45
Stochastic kernels
In section 4.2.1 we identified averages of channel values as samples from a PDF
convolved with the used kernel function. Now, we will instead assume that we
have measurements xn from some source S, and that the measurements have been
contaminated with additive noise from a source D, i.e.
xn = pn + ηn
with
pn ∈ S
and ηn ∈ D .
(5.31)
For this situation a channel representation using a kernel H(x) will have channels
with expectation values
E{uk } = (f ∗ η ∗ H)(x − k) .
(5.32)
Here f (x), and η(x) are the density functions of the source and the noise respectively. If the number of samples in an average is reasonably large, it makes sense
to view (η ∗ H)(x − k) as a stochastic kernel.6 In order to make the stochastic
kernel as compact as possible, we will now consider the rectangular kernel
(
1 when − 0.5 < x − k ≤ 0.5
k
(5.33)
H (x) =
0 otherwise.
Figure 5.11 shows estimates of three stochastic kernels, together with the PDF
of the added noise. The noise is of triangular7 , or TPDF type, and is what is
typically used to de-correlate the errors in audio quantisation with dithering [60].
1
1
1
0.5
0.5
0.5
0
0
1
2
3
4
0
0
1
2
3
4
0
−2
0
2
Figure 5.11: Stochastic bins. Left: Estimated PDFs of H k (x) = 1 for k ∈ [1, 2, 3].
Centre: PDFs with addition of noise before the kernel function. Right: Estimated
density of noise.
In general, the kernel (5.33) is not a good choice for density estimation. If
f (x) is discontinuous, or changes rapidly, it will cause aliasing-like effects on the
estimated PDF. Such aliasing effects can be reduced by dithering see e.g. [54].
Dithering is the process of adding a small amount of noise (with certain characteristics) to a signal prior to a quantisation. Dithering is commonly used in image
6 For
earlier accounts of this idea, see [34, 36].
triangular noise sample is generated by averaging two samples from a rectangular distribution.
7A
46
Kernels for Channel Representation
reproduction with a small number of available intensities or colours, as well as in
perceptual quality improvement of digital audio [60].
5.5.1
Varied noise level
We will now have a look at how the precision of a local decoding is affected by
addition of noise before applying the kernel functions. We will use three channels
k = 1, 2, 3, with kernels H k (x) as defined in (5.33), and channel encode measurements xn corresponding to source values p ∈ [1.5, 2.5]. To obtain the mode
location with better accuracy than the channel distance, we use the local decoding
derived for the Gaussian kernel (see section 5.1.1).
We will try averages of N = 10, 30, and 1 000 samples, and compute the
absolute error |x̂−p|, where x̂ is the local decoding. To remove any bias introduced
by the source value p (and also to make the curves less noisy) the errors are
averaged over all tested source values, giving a mean absolute error (MAE). We
will try source values p ∈ [1.5, 2.5] in steps of 0.01.
The standard deviation of the noise η, see (5.31), is varied in the range [0, 1] in
steps of 0.01. Two noise distributions are tested, rectangular noise and triangular
noise. The plots in figure 5.12 show the results.
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.5
1
0
0
0.5
1
0
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
0.5
1
0
0
0.5
1
0
0
0.5
1
0
0.5
1
Figure 5.12: MAE of local decoding as function of noise level.
Top row: Results using rectangular noise. Bottom row: Results using triangular
noise. Left to right: Number of samples N = 10, N = 30, N = 1 000. Solid curves
show MAE for rectangular kernels, dashed curves show MAE for Gaussian kernels
with σ = 0.8. Number of source values for which the curves are averaged is 101.
As can be seen in the plot, the optimal noise level is actually above zero for
the rectangular kernel. This is due to the expectation of the channel values being
the convolution of the kernel and the noise PDF, see (5.32). The added noise thus
5.6 2D and 3D channel representations
47
results in a smoother PDF sampling. Finding the optimal noise given a kernel
function will be called the dithering problem.
Which noise is optimal depends on the source density f (x), and on the used
decoding scheme. In this experiment it is thus reasonable to assume that the more
similar the noise density convolved with the kernel is to a Gaussian, the better the
accuracy. The dashed curves in each plot show the performance of overlapping
bins with Gaussian kernels of width σ = 0.8. As can be seen in the plots, for
large number of samples the performances of the two kernels are similar once the
dithering noise is above a certain level.
Biological neurons are known to have binary responses (i.e. at a given time
instant they either fire or don’t fire). They are able to convey graded information
by having the rate of firing depend on the sum of the incoming (afferent) signals.
This behaviour could be modelled as (temporally local) averaging with noise added
before application of a threshold (activation function). If the temporal averaging in
the neurons is larger than just a few samples, it would be reasonable to expect that
biological neurons implicitly have solved the inverse dithering problem of tuning
the activation threshold to the noise characteristics, see e.g. [102].
5.6
2D and 3D channel representations
So far we have only discussed channel representations of scalar values. A common
situation, which is already dealt with by mean-shift filtering and M-estimation,
see chapter 4, is a higher dimensional sample space.
5.6.1
The Kronecker product
The most straight-forward way to extend channel encoding to higher dimensions
is by means of the Kronecker product. The result of a Kronecker product between
two vectors x and y is a new vector formed by stacking all possible element-wise
products xi yj . This is related to the (column-wise) vectorisation of an outerproduct
x ⊗ y = vec(yxT )
(5.34)

  

y1
y1 x1
x1
 ..
 ..   .. 
 .  ⊗  .  = vec  .

xK
yL
yL x1
...
..
.
...

y1 x1
 .. 


  . 
 y1 xK 
y1 xK


..  =  ..  .



.
 . 


yL xK
 yL x1 
 . 
 .. 
yL xK
If we have a higher dimensionality than 2, we simply repeat the Kronecker product.
E.g. for a 3D space we have
48
Kernels for Channel Representation
x ⊗ y ⊗ z = vec(vec(zyT )xT ) .
(5.35)
For channel representations in higher dimensions it is additionally meaningful
to do encodings of subspaces, such as a line in 2D, and a line or a plane in 3D. We
will develop the encoding of a line in 2D in section 5.6.3.
For applications such as channel averaging (see section 4.2.1), the channel
representation will become increasingly less practical as the dimensionality of the
space to be represented is increased, and methods such as the mean-shift filter (see
section 4.2.3) are to be preferred. However, for applications where the sparsity of
the channel representation can be utilised in storage of data (e.g. the associative
learning in chapter 9) higher dimensions are viable.
5.6.2
Encoding of points in 2D
Since the Gaussian function is separable, the Kronecker product of two 1D Gaussian channel vectors is equivalent to using isotropic 2D Gaussian kernels
(x − k)2 + (y − l)2
2σ 2
.
B (x, y) = B (x)B (y) = e
k,l
k
l
−
(5.36)
For efficiency, we will however still perform the encoding using the Kronecker
product of two 1D channel vectors.
5.6.3
Encoding of lines in 2D
A line constraint in 2D is the set of all points (x, y) fulfilling the equation
x cos φ + y sin φ − ρ = 0 .
(5.37)
¡
¢T
Here cos φ sin φ is the line normal, and ρ is the signed normal distance, i.e. the
projection of an arbitrary point on the line onto the line normal. The distance of
a specific point (k, l) to the line is then given by
d = ||k cos φ + l sin φ − ρ|| .
(5.38)
This means that we can encode the line constraint, by simply applying the kernel
d2
(k cos φ + l sin φ − ρ)2
−
−
2σ 2
.
Bk,l (φ, ρ) = e 2σ 2 = e
5.6.4
(5.39)
Local decoding for 2D Gaussian kernels
In order to capture sample dispersions that are not aligned to the axes the local
decoding should model the channel values using a full 2D Gaussian function
−1
uk,l = rBk,l (x) = re−0.5 (x − m) C (x − m) .
T
(5.40)
5.6 2D and 3D channel representations
¡
Here x = x
we get
y
¢T
uk,l
¡
,m= k
−
= re
l
¢T
49
, and C is a full covariance matrix. In scalar form
(x − k)2 σy2 + (y − l)2 σx2 − 2(x − k)(y − l)σxy
2 )
2(σx2 σy2 − σxy
(5.41)
or, for ∆x = x − k and ∆y = y − l,
uk,l
−
= re
∆x2 σy2 + ∆y 2 σx2 − 2∆x∆yσxy
2 )
2(σx2 σy2 − σxy
.
(5.42)
If we choose to estimate the parameters in a 3 × 3 neighbourhood around the
position (k, l), we obtain the following system

 k−1,l−1   k−1,l−1
rB
(x, y)
u
 uk−1,l  
rBk−1,l (x, y) 

 k−1,l+1  


u
rBk−1,l+1 (x, y)

 k,l−1  
k,l−1
 
 u
(x, y) 
 rB




 uk,l  = 
(5.43)
rBk,l (x, y)  .
 k,l+1  



 u
rBk,l+1 (x, y) 
 k+1,l−1  

 
u
rBk+1,l−1 (x, y)
 


 uk+1,l   rBk+1,l (x, y) 

uk+1,l+1
rBk+1,l+1 (x, y)
The logarithm of an arbitrary row can be written as
ln uk+c,l+d = ln r −
(∆x − c)2 σy2 + (∆y − d)2 σx2 − 2(∆x − c)(∆y − d)σxy
. (5.44)
2 )
2(σx2 σy2 − σxy
This can be factorised as
¡
ln uk+c,l+d = 0.5 1 2c
2d
−c2
−d2
¢
−2cd p
(5.45)
for the parameter vector


2
2 ln r(σx2 σy2 − σxy
) − ∆x2 σy2 − ∆y 2 σx2 + 2∆x∆yσxy


∆xσy2 − ∆yσxy


2


1
∆yσx − ∆xσxy

 . (5.46)
p= 2 2
2


2
σy
σx σy − σxy 

2


σx
−σxy
In the parameters p, we recognise the inverse covariance matrix
µ 2
¶ µ
¶
1
σy
−σxy
p4 p6
−1
=
.
C = 2 2
2
p6 p5
−σxy
σx2
σx σy − σxy
Thus we can compute the covariance matrix as
(5.47)
50
Kernels for Channel Representation
µ
Ĉ =
σx2
σxy
σxy
σy2
¶
1
=
p4 p5 − p26
µ
p5
−p6
−p6
p4
¶
.
(5.48)
From the expressions for p2 , and p3 in (5.46), we obtain the following system
µ ¶
µ ¶
p2
−1 ∆x
=C
p3
∆y
µ
with the solution
¶
µ ¶
∆x̂
p
= Ĉ 2 .
p3
∆ŷ
(5.49)
From the solution, we can obtain an estimate of the confidence r̂, as
2
2
r̂ = e0.5(p1 + p4 ∆x̂ + p5 ∆ŷ − 2p6 ∆x̂∆ŷ) .
(5.50)
The final peak location is obtained by adding the centre bin location to the peak
offset
x̂ = ∆x̂ + k
and
ŷ = ∆ŷ + l .
(5.51)
The expectation of the estimated covariance matrix Ĉ is the sum of the covariance of the noise, and the covariance of the kernel functions Cb = diag(σ, σ),
see (5.36) and (5.39). This means that we can obtain the covariance of the noise
as Ĉnoise = Ĉ − Cb .
5.6.5
Examples
Clustering of points in 2D is illustrated in figure 5.13. Here we have channel
encoded 1000 points, each from one of 5 clusters at random locations. The centre
plot shows the average of the channel representation of the points. In the right
plot, the decoded modes have their peaks represented as dots, and their covariances
as ellipses. This approach to clustering is different from techniques like K-means
clustering [7] and mixture model estimation using expectation maximisation (EM)
[15, 7], which both require the number of clusters to be known a priori. In order
to get a variable number of clusters, such methods have to test all reasonable
numbers of clusters, and select one based on some criterion see e.g. [15]. Here we
instead assume a scale, specified by the channel distance and kernel width, and
directly obtain a variable number of clusters. This makes our clustering technique
fall into the same category as mean-shift clustering, see section 4.2.3.
The encoding of line constraints is illustrated in figure 5.14. Here we have
channel encoded 4 lines, and averaged their channel representations. In the right
plot, the decoded modes have their peaks represented as dots, and their covariances
as ellipses. This approach to finding multiple solutions to a line constraint equation
was applied in [93] to find multiple solutions to systems of optical flow constraints
at motion boundaries.
5.6 2D and 3D channel representations
51
5
5
5
0
0
0
−5
−5
−5
−5
0
5
−5
0
5
−5
0
5
Figure 5.13: Illustration of point decoding. Left: Points to encode. Centre: Average of the corresponding channel representations, the sizes of the filled circles
correspond to the channel values. Ellipses show the decoded modes. Right: The
points, and the decoded modes.
5
5
5
0
0
0
−5
−5
−5
−5
0
5
−5
0
5
−5
0
5
Figure 5.14: Illustration of line constraint decoding. Left: Lines to encode. Centre:
Average of the corresponding channel representations, the sizes of the filled circles
correspond to the channel values. Ellipses show the decoded modes. Right: The
original lines, and the decoded modes.
5.6.6
Relation to Hough transforms
The idea of encoding line constraints, averaging and decoding, is the same as in
the Hough transform, see e.g. [92]. In the Hough transform, each line constraint
contributes either 1 or 0 to cells in an accumulator array (corresponding to the
channel matrix).
To reduce noise, and avoid detection of multiple maxima for each solution, a
smoothing is sometimes applied to the accumulator array after it has been computed [92]. Note that channel encoding is a more sound approach, since it corresponds to smoothing before sampling. The use of overlapping kernels, and a local
decoding, instead of just finding the accumulator cell with the largest contribution,
improves the accuracy of the result. Furthermore it avoids the trade-off between
52
Kernels for Channel Representation
noise sensitivity and accuracy of the result inherent in the Hough transform, and
all other peak detection schemes using non-overlapping bins.8 The difference in
obtained accuracy is often large, see e.g. [33] for an example where the estimation
error is reduced by more than a factor 50.
8 A larger amount of noise will however still mean that larger kernels should be used, and this
will affect the amount of interference between multiple decodings, see section 5.4.
Chapter 6
Channel Smoothing
In this chapter we will introduce the channel smoothing technique for image denoising. We will identify a number of problems with the original approach [42], and
suggest solutions to them. One of the solutions is the alpha synthesis technique.
Channel smoothing is also compared to other popular image denoising techniques,
such as mean-shift, bilateral filtering, median filtering, and normalized averaging.
6.1
Introduction
In chapter 4 we saw that averaging in the channel representation followed by a
local decoding is a way to find simple patterns (clusters) in a data set. In low-level
image processing, the data set typically consists of pixels or features in a regular
grid. Neighbouring pixels are likely to originate from the same scene structure, and
it seems like a good idea to exploit this known relationship and perform spatially
localised clustering. This is done in channel smoothing [42, 29].
6.1.1
Algorithm overview
Channel smoothing of a grey-scale image p : Z2 → R can be divided into three
distinct steps.
1. Decomposition.
The first step is to channel encode each pixel value, p(x, y), in the grey-scale
image with the confidence r(x, y) = 1, to obtain a set of channel images.
2. Smoothing.
We then perform a spatially local averaging (low-pass filtering) on each of
the channel images.
3. Synthesis.
Finally we synthesise an image and a corresponding confidence using a local
decoding (see section 3.2.3) in each pixel.
54
Channel Smoothing
⇒
⇒
⇒
Figure 6.1: Illustration of channel smoothing. Left to right: Input, channel representation, low-passed channels, and local decoding.
At locations where the image is discontinuous we will obtain several valid
decodings. The most simple solution is to select the decoding with the highest
confidence. This synthesis works reasonably well, but it has several problems,
such as introduction of jagged edges, and rounded corners. In section 6.3 we will
analyse these problems, and suggest solutions. Finally in section 6.4 we will have
a look at a more elaborate synthesis method that deals with these problems.
6.1.2
An example
Figure 6.1 shows an example of channel smoothing. In this example we have used
K = 8 channel images, and averaged each channel image with a separable Gaussian
filter of σ = 1.18 and 7 coefficients per dimension. A channel representation
¢T
¡
u = H 1 (p) . . . H K (p) can represent measurements p ∈ [3/2, K − 1/2], and
thus we have scaled the image intensities p(x, y) ∈ [rl , rh ] using a linear mapping
px (x, y) = t1 p(x, y) + t0 with
K −2
and t0 = 3/2 − t1 rl
(6.1)
t1 =
rh − rl
as described in section 3.3.
6.2
Edge-preserving filtering
The channel smoothing procedure performs averaging of measurements that are
similar in both property (here intensity) and location. In this respect, channel smoothing is similar to edge preserving filtering techniques, such as robust
anisotropic diffusion[8], selective binomial filtering[40], mean-shift filtering[19],
and non-linear Gaussian filtering[103, 2, 44] also known as SUSAN noise filtering[88], and in case of vector valued signals as bilateral filtering[98]. All these
6.2 Edge-preserving filtering
55
methods can be related to redescending 1 M-estimators (see section 4.2.4). The relationship between anisotropic diffusion and M-estimation is established in [8], and
selective binomial filtering can be viewed as an M-estimation with a cutoff-squares
error norm (see section 4.2.4).
In section 6.5 we will compare mean-shift filtering and bilateral filtering to
channel smoothing, thus we will now describe these two methods in more detail.
6.2.1
Mean-shift filtering
Mean-shift filtering [19, 16, 43] is a way to cluster a sample set, by moving each
sample toward the closest mode. As described in section 4.2.3 this is accomplished
by gradient descent on a kernel density estimate. For each sample pn , an iteration is started in the original sample value i.e. p̄0n = pn , and is iterated until
convergence. The generalised mean-shift iteration [16] is defined by
P
i
i+1
k H(pk − p̄n )pk
.
(6.2)
p̄n = P
i
k H(pk − p̄n )
Here H is a kernel which is said to be the shadow of the kernel in the corresponding kernel density estimator, see section 4.1.1. For a kernel density estimate
using the Epanechnikov kernel, the iteration rule becomes an average in a local
window, and can thus be computed very quickly [43]. This is what has given the
method its name, and averaging in a local window is also the most commonly
applied variant of mean-shift filtering.
Mean-shift filtering has been developed by Comaniciu and Meer [19, 20] into
algorithms for edge-preserving filtering, and segmentation. In the edge-preserving
filter, they apply mean-shift filtering to the parameter vector
¡
p(x) = x/σs
r(x)/σr
g(x)/σr
b(x)/σr
¢T
.
(6.3)
Here r, g, and b are the three colour bands of an RGB image, and the parameters σs , and σr allow independent scaling of the spatial and range (colour) vector
elements respectively. The convergence point p̄∗ is stored in the position where
the iteration was started. The result is thus a hill-climbing on the kernel density
estimate.
6.2.2
Bilateral filtering
Bilateral filtering [98] of a signal p(x) is defined by
R
p(y)H((x − y)/σs )H((p(x) − p(y))/σr )dy
q(x) = R
H((x − y)/σs )H((p(x) − p(y))/σr )dy
(6.4)
where H is a multi-dimensional Gaussian, see (4.5). For the special case of a
scalar valued image p(x), the expression (6.4) is identical to the earlier non-linear
Gaussian filter [103, 2, 44], and to the SUSAN noise filtering technique [88]. In
1 A redescending M-estimator has an influence function with monotonically decreasing magnitude beyond a certain distance to the origin.[8]
56
Channel Smoothing
(6.4) we explicitly distinguish between the spatial position x, and the sample values
f (x), whereas in the generalised mean-shift iteration (6.2) on the parameter vector
(6.3) they are treated as one entity.
It is nevertheless possible to relate generalised mean-shift and bilateral filtering,
for the case of a separable kernel H(d), such as the Gaussian. If we use p from
(6.3) in (6.2), and identify r, g, b as f in (6.4), then the result of (6.2) and (6.4)
after one iteration are identical in the r, g, b and f part. The difference is that (6.4)
does not update the position as (6.2) does. A similar observation is made in [104],
where non-linear Gaussian filtering (i.e. bilateral filtering) is shown to correspond
to the first iteration of a gradient descent on an M-estimation problem. Bilateral
filtering thus moves the data towards the local M-estimate, but in general gets
stuck before it is reached [104].
Tomasi and Manduchi [98] also suggest a scheme for iterated bilateral filtering,
by again applying (6.4) to the result q(x). As noted in [19] this iteration will
not converge to a stable clustering. Instead it will eventually erode the image
to contain a single constant colour. Thus, neither bilateral filtering nor iterated
bilateral filtering are robust techniques in a strict sense.
6.3
Problems with strongest decoding synthesis
The original synthesis in channel smoothing selects the decoding with the strongest
confidence as its output.[42, 29] This synthesis has three distinct problems
1 Jagged edges.
Since the synthesis selects one decoding, edges can become arbitrarily sharp.
This means that for an infinitesimal translation of the underlying signal, the
synthesis of a pixel may completely change value, and this in turn results in
jagged edges.
2 Rounded corners.
Corners will tend to be rounded, because the number of pixels voting for
the intensity inside the corner (i.e. the confidence) becomes lower than the
number of votes for the outside intensity when we are near the tip of the
corner.
3 Patchiness.
For steep slopes, or large amounts of blurring, when we have inlier noise, the
selection tends to generate a patchy output signal.
To illustrate these effects we have devised a test image consisting of a slanted
plane, surrounded by a number of triangles with protruding acute angles (see figure 6.2, left for a noise corrupted version). The difference in grey-level from the
background is also different for all the triangles, in order to illustrate at what difference the non-linear behaviour starts. The parameters of the channel smoothing
have been specifically set to exaggerate the three problems listed above. The result
is shown in the right plot of figure 6.2.
6.3 Problems with strongest decoding synthesis
57
Figure 6.2: Illustration of problems with strongest decoding synthesis. Left to
right: Input (100 × 100 pixels, with values in range [−1, 1]. Noise is Gaussian
with σ = 0.1, and 5% salt&pepper pixels.), normalized average (σ = 2.2), channel
smoothing (K = 15 channels, σ = 2.2).
In order to demonstrate the non-linear behaviour of the method, a normalized
average with the same amount of smoothing is also shown for comparison. Normalized averaging of a signal–confidence pair (p, r) using the kernel g is defined
by the quotient
q(x) =
(p · r ∗ g)(x)
(r ∗ g)(x)
(6.5)
where · denotes an element-wise product.
In our example, normalized averaging is equivalent to plain averaging, except
near the edges of the image, where it helps maintaining the correct DC-level. See
[48, 26] for more on normalized averaging, and the more general framework of
normalized convolution[26].
6.3.1
Jagged edges
The jagged edges problem can be dealt with using super-sampling techniques common in computer graphics, see e.g. [59] section 4.8. For channel smoothing this can
be done by filling in zeros in between the encoded channel values, before smoothing, and thus generate an output at a higher resolution. By modifying the amount
of channel smoothing accordingly, we can obtain edges with a higher resolution.
This technique is demonstrated in figure 6.3.
Here we have filled in zeroes to obtain 4× the pixel density along each spatial
dimension. We have then modified the amount of smoothing according to
σnew = σ
p
(4n − 1)/3
(6.6)
where n = 3 is the octave scale, as suggested in [64]. As a final step we have then
blurred and subsampled the high resolution output. This has given us an image
without jagged edges (see figure 6.3, right).
58
Channel Smoothing
Figure 6.3: Super-sampling from channels. Left to right: Strongest decoding
output (using K = 7, and σ = 1.3), Strongest decoding p
output after 4× upsampling of the channel images (using K = 7, and σ = 1.3 × (43 − 1)/3 = 5.96),
Smoothing and subsampling of the high-resolution decoding (σ = 1.4). For input
image, see figure 6.2.
6.3.2
Rounding of corners
A solution to the rounding of corners was suggested by Spies and Johansson in
[94]. They proposed to sometimes select the decoding closest to the original greylevel, instead of the one with the highest confidence. The method in [94] works as
follows:
• If the strongest confidence is above a threshold th the corresponding decoding
is selected.
• If not, all decodings with a confidence above tl are searched for the decoding
closest to the original grey value.
Spies and Johansson suggest using th = 0.9 and tl = 0.1. A similar behaviour
can be obtained by removing all decodings with confidence below a threshold cmin ,
and select the remaining decoding which is closest to the original value. Selecting
the closest decoding will make the method correspond roughly to the hill-climbing
done by the mean-shift procedure (see section 6.2.1). These two methods (from
now on called Spies and Hill-climbing) are compared to the strongest decoding in
figure 6.4. As can be seen in the figure, these methods trade preservation of details
against removal of outliers which happen to be inliers in nearby structures.
6.3.3
Patchiness
The patchiness problem is caused by a too wide distribution of samples around
a mode. More specifically, the exact mode location cannot be determined by
examining only channel values inside the decoding window, since the noise has not
been completely averaged out, see figure 6.5. Usually more channel smoothing is
unable to average out the noise, since there simply are no more samples inside the
appropriate grey-level range locally in the image.
6.3 Problems with strongest decoding synthesis
59
Figure 6.4: Comparison of decoding selection schemes. Left to right: Strongest decoding synthesis, the Spies method (th = 0.7 tl = 0.25), the Hill-climbing method
(cmin = 0.25). All three use K = 15 channels, and σ = 2.2.
Figure 6.5: Reason for the patchiness problem. For wide distributions, it might
be impossible to pick a proper decoding window. Both alternatives indicated here
will give wrong results, since the noise has not been completely averaged out.
There are two causes to the wide distributions of samples. The first one is that
the noise distribution might be too wide. This can be dealt with by increasing
the sizes of the used kernel functions such that more samples can be used in the
averaging process. This would however require a modification to the decoding
step. A way to obtain larger kernels which does not require a modification of the
decoding step is to simply reduce the number of channels and scale them to fit the
same interval according to (6.1). Reducing the number of channels will however
make the method increasingly more similar to plain linear smoothing.
There is however a second cause to the wide distributions. The channel averaging process implicitly assumes that the signal is piecewise constant. When
the signal locally constitutes a ramp, we violate this assumption, and the actual
width of the distribution depends on the slope of the ramp. Since the required
width of the channels depends both on the amount of noise and on the slope of
the signal, a more theoretically sound solution is to use a more advanced decoding
scheme, which adapts the size of the decoding window to take more channel values
into account when necessary. Alternatively we could replace the locally constant
assumption with a locally linear one, and cluster in (µ, dx, dy)-space instead. None
of these ideas are investigated further in this thesis.
60
Channel Smoothing
6.4
Alpha synthesis
We will now instead attack the jagged edges problem, and at the same time make
a slight improvement on the performance with respect to the rounding of corners. The reason for the jagged edges in the strongest decoding synthesis is that
the decoding is a selection of one of the decodings. In many situations a selection is desirable, for instance if we want to teach a robot to navigate around an
object. Going either left or right of the object works fine, but the average:
straight-ahead is not a valid trajectory. In the present situation however, we
want to output an image, and in images we should not have arbitrarily sharp
edges.
The solution to this problem is to generate a continuous transition between the
decoded values. Instead of choosing the decoding with the highest confidence, we
will now combine all decoded signal–confidence pairs (pk , rk ) in a non-linear way.
In this way it is possible to obtain an output signal where the degree of smoothness
is controlled by a parameter. The combination of decoded signal–confidence pairs
(pk , rk ) is done according to
pout =
X
pk wk
where
k
rα
wk = P k α
l rl
(6.7)
and α is a tuning parameter. For α = 1 we get a linear combination of the
decodings pk , and for α → ∞ we obtain a selection of the strongest decoding
again. The behaviour of (6.7) is illustrated in figure 6.6.
10
10
8
8
6
6
4
4
4
4
2
2
2
2
10
10
8
6
0
0
−10
−5
0
Input
5
10
−10
8
6
0
−5
0
5
channel
representation
10
−10
0
−5
0
5
low-passed
channels
10
−10
−5
0
5
10
Output
Figure 6.6: Illustration of alpha synthesis in 1D. Outputs for 5 different α values
are shown.
What is important to note here is that the actual distance between the decodings p1 and p2 plays no part in the synthesis (when the decodings do not interfere).
This is not the case for methods like Histogram filters [107]. In Histogram filters
the noise is removed by regularised inverse filtering of the histogram (i.e. the channel vector). The output is then computed as the mass centre of histogram, after
raising the histogram bin values to a power γ. The purpose of γ is to control the
sharpness of the result, and it thus serves the same purpose as our α. In such an
approach, the interference of the two levels will be different when they are close
6.4 Alpha synthesis
61
and when they are far apart. Furthermore, raising the histogram to a power is not
a shift invariant operation.
The signal in figure 6.6 (left) is expanded into a number of channels, which are
low-passed. In the decoding we get two valid signal–confidence pairs (p1 , r1 ) and
(p2 , r2 ). The blurred channels have values that smoothly change from their highest
value to zero. Since the confidence computation is a local average of the channel
values, the confidence will have the same profile as the channels. The confidence
of a valid decoding is exactly 1 when the low-pass kernel is inside one of the flat
regions, and as we move across the edge it drops to zero.
6.4.1
Separating output sharpness and channel blurring
The actual value of the confidence is given by a convolution of a step and the used
low-pass kernel. In this example we have used a Gaussian low-pass kernel, and
thus obtain the following confidences for the valid decodings
Z
r1 (x) = ((1 − step) ∗ g)(x) =
Z
r2 (x) = (step ∗ g)(x) =
x
−∞
−x
−∞
√
1
2πσ 2
2
e−0.5(x/σ) = Φ
µ
−x
σ
¶
=1−Φ
³x´
2
1
√
.
e−0.5(x/σ) = Φ
σ
2πσ 2
³x´
σ
(6.8)
(6.9)
Where Φ(x) is the integral of the standard normal PDF. The weights now become
w1 =
r1α
=
r1α + r2α
1

1+
1
1−Φ
α
³ x ´ − 1
σ
and w2 =
1
α
1
1 +  ³ x ´ − 1
Φ
σ
(6.10)

If we look at the decoding for x = 0, we get
1
1
p1 + p 2
p1 + p2 =
(6.11)
2
2
2
which is desirable, since this point is directly inbetween the two levels p1 and p2 .
If we look at the derivative at x = 0, we get
pout (0) = w1 (0)p1 + w2 (0)p2 =
∂w1
∂w2
∂pout
(0) =
(0)p1 +
(0)p2 = . . .
∂x
∂x
∂x
α
(p2 − p1 ) .
=√
2πσ
(6.12)
This motivates setting α ∝ σ. By switching to a parameter β = α/σ we get a
new synthesis formula
pout =
X
k
pk wk
where
rβσ
wk = P k βσ .
l rl
(6.13)
62
Channel Smoothing
Using the parameter β, we can control the slope of the decoding at the transition point independently of σ. While (6.12) only holds exactly for x = 0, a
constant α/σ ratio gives very stable slopes for other values of x as well. This is
demonstrated experimentally in figure 6.7.
3
10
8
2.5
6
2
4
1.5
2
1
−5
0
5
0
−5
0
5
Figure 6.7: Illustration of stability of the α ∝ σ approximation. Here we have
used σ ∈ {1, 1.05, . . . , 3} and set β = 1. Left: each row is pout for some value of σ.
Right: all pout curves superimposed.
Figure 6.8 demonstrates the synthesis according to (6.7) on the test-image
in figure 6.2 (left), for varying amounts of blurring (σ) and varied values of β.
Each row has a constant β value, and as can be seen they have roughly the same
sharpness.
6.4.2
Comparison of super-sampling and alpha synthesis
We will now make a comparison of alpha synthesis and the super-sampling method
described in section 6.3.1. Figure 6.9 shows the result. From the figure we can see
that the result is qualitatively similar with respect to elimination of jagged edges.
After examining details (see bottom row of figure 6.9) we find that alpha synthesis
is slightly better at preserving the shape of corners.
In addition to slightly better preservation of corners, alpha synthesis also has
the advantage that it is roughly a factor 16 faster, since it does not need to use a
higher resolution internally.
6.4.3
Relation to smoothing before sampling
An image from a digital camera is a regular sampling of a projected 3D scene.
Before sampling, the projection has been blurred by the point spread function of
the camera lens, and during sampling it is further smoothed by the spatial size
of the detector elements. These two blurrings can be modelled by convolution of
the continuous signal f (x) with a blurring kernel g(x). A blurring of a piecewise
constant signal f (x), such as the one in figure 6.6, will have transition profiles
which look the same, regardless of the signal amplitude. This is evident since for
a scalar A, a signal f (x), and a filter g(x) we have
6.4 Alpha synthesis
63
Figure 6.8: σ independent sharpness. Column-wise σ = 1.2, 1.6, 2.4, 3.2. Rowwise β = 1, 2, 4, 8, ∞ (i.e. strongest decoding synthesis). The input image is
shown in figure 6.2. K = 7 has been used throughout the experiment.
64
Channel Smoothing
Figure 6.9: Comparison of super-sampling method and alpha synthesis. Left to
right: strongest decoding synthesis, strongest decoding with 4× higher resolution,
blurred and subsampled, alpha synthesis. Top row shows full images, bottom row
shows a detail. K = 7, σ = 2.2 and α = 3σ has been used.
(Af ∗ g)(x) = A(f ∗ g)(x) .
(6.14)
For a suitable choice of blurring kernel g(x), a small translation of the input
signal will result in a small change in the sampled signal.
The behaviour of smoothing before sampling bears resemblance to the alpha
synthesis in two respects:
1. For a suitable choice of α, alpha synthesis generates a smooth transition
between two constant levels, in such a way that a small translation of the
input signal results in a small change in the output signal. This is not the
case for e.g. the strongest decoding synthesis, but it is the case for smoothing
before sampling.
2. As noted earlier, the actual distance between the two grey-levels in a transition plays no part in the alpha synthesis. This means that a transition with a
certain profile will be described with the same number of samples, regardless
of scaling (i.e. regardless of the distance between p1 and p2 ). This follows
directly from (6.7). In this respect channel smoothing with alpha synthesis
behaves like a blurring before sampling for piecewise constant signals. This
is different from methods such as the histogram filter [107], which control the
sharpness of the output by an exponent on the channel values.
We can thus view channel smoothing with alpha synthesis as analogous to
smoothing before sampling on a piecewise constant signal model. The analogy is
that the channel images represent a continuous image model (for piecewise constant
signals), and this model is blurred and sampled by the alpha synthesis.
6.5 Comparison with other denoising filters
6.5
65
Comparison with other denoising filters
We will now compare channel-smoothing with alpha synthesis with a number of
other common denoising techniques. The methods are tested on the test-image in
figure 6.2, contaminated with two noise types: Gaussian noise only, and Gaussian
plus salt&pepper noise. For all methods we have chosen the filter parameters such
that the mean absolute error (MAE) between the uncontaminated input and the
input contaminated with Gaussian noise is minimised.2 MAE is defined as
εMAE =
N1 X
N2
1 X
|fideal (x) − fout (x)| .
N1 N2 x =1 x =1
1
(6.15)
2
MAE was chosen above RMSE, since it is more forgiving to outliers. The methods,
and their optimal filter parameters are listed below
1. Normalized averaging.
As defined in (6.5). σ = 1.17.
2. Median filtering.
See e.g. [92]. The Matlab implementation, with symmetric border extension, and a 5 × 5 spatial window.
3. Bilateral filtering.
As defined in (6.4). σs = 1.64 and σp = 0.30.
4. Mean-shift filtering.
As defined in (6.2). σs = 4.64 and σp = 0.29.
5. Channel smoothing.
With alpha synthesis, K = 8, σ = 1.76, and α = 3σ (not optimised).
The results are reproduced in figure 6.10. As is evident in this experiment,
neither bilateral nor mean-shift filtering is able to eliminate outliers. The reason
for this is that they both do gradient descent starting in the outlier value. However,
none of these methods allow outliers to influence the other pixels as a linear filter
does. The average MAE values for 10 instances of the images in figure 6.10 are
summarised in the table below.
noise type
Gaussian
Gaussian+S&P
input
0.0790
0.0996
normaver
0.0305
0.0389
median
0.0307
0.0320
bilateral
0.0277
0.0377
meanshift
0.0226
0.0411
chan.sm.
0.0230
0.0234
The winner for Gaussian noise is mean-shift, with channel smoothing close
second. For Gaussian+salt&pepper noise, channel smoothing is way ahead of all
other methods, with the median filter being the runner up. The perceptual image
quality, which is what we would really like to optimise for, is a different matter
however.
2 For stability of the parameter choice, the error was measured on 10 instances of the noise
contaminated input.
66
6.6
Channel Smoothing
Applications of channel smoothing
The image denoising procedure is demonstrated in figure 6.11. The examples are
described below, row by row:
row 1 This is a repeat of the experiment in figure 6.2 (left), but with alpha synthesis
added, and with the number of channels chosen to match the noise (by
making the channels wider).
row 2 This is restoration after contamination with the same type of noise on a
natural image.
row 3 This is a duplication of the restoration of an irregularly sampled signal experiment done in [48], but using channel smoothing instead of normalized
averaging to interpolate values at the missing positions. Note that this is
image denoising and interpolation combined, if just interpolation is sought,
normalized averaging is preferable, since the piecewise constant assumption
modifies the image content slightly.
row 4 This is an example of simple image editing by means of a confidence map.
The right edge of the image is black due to problems with the frame grabber. This has been corrected by setting the confidence at these locations to
zero. Additionally two cars have been removed, by setting the confidences of
their pixels to zero. Note that more sophisticated methods, such as texture
synthesis[25] exist for removing objects in an image.
6.6.1
Extensions
Channel smoothing has been extended to directional smoothing by Felsberg in [28].
This is needed if we want to enhance thin periodic patterns, such as is present in
fingerprints. Anisotropic channel filtering is however out of the scope of this thesis.
6.7
Concluding remarks
In this chapter we have investigated the channel smoothing technique for image
denoising. A number of problems with the straight-forward approach have been
identified and solutions have been suggested. Especially the alpha synthesis technique seems promising, and should be investigated further. Channel smoothing
has also been compared to a number of common image denoising techniques, and
was found to be comparable to, or better than all tested methods in a MAE sense.
The real criterion should however be the perceived image quality, and this has not
been investigated.
6.7 Concluding remarks
A
B
67
Input
normalized average
median filter
bilateral filter
mean-shift
channel smoothing
Input
normalized average
median filter
bilateral filter
mean-shift
channel smoothing
Figure 6.10: Comparison of filters. A Gaussian noise (σ = 0.1). B Gaussian noise
σ = 0.1, and 5% salt&pepper pixels. Filter parameters: Normalized average:
σ = 1.17, Median filter: symmetric border extension and 5 × 5 window, Bilateral
filter: σs = 1.64, σp = 0.30, Mean-shift filter: σs = 4.64, σp = 0.29, Channel
smoothing: K = 8, σ = 1.76, α = 3σ (not optimised).
68
Channel Smoothing
100 × 100 pixels
all ones
K = 7, σ = 1.7, α = 3σ
512 × 512 pixels
all ones
K = 7, σ = 1.7, α = 3σ
512 × 512 pixels
10% density
K = 5, σ = 0.3, α = 3σ
256 × 256 pixels
edited confidence map
K = 22, σ = 1.4, α = 3σ
Figure 6.11: Examples of channel smoothing. Columns left to right: Input images,
input confidence, output. The first two images have been contaminated with
Gaussian noise with σ = 0.1 (intensities ∈ [0, 1]), and 5% salt&pepper pixels.
Chapter 7
Homogeneous Regions in
Scale-Space
In this chapter we develop a hierarchical blob feature extraction method. The
method works on vector fields of arbitrary dimension. We demonstrate that the
method can be used in wide baseline matching to align images. We also extend
the method to cluster constant slopes for grey-scale images.
7.1
Introduction
For large amounts of smoothing, the channel smoothing output looks almost like
a segmented image (see e.g. figure 6.2). This is what inspired us to develop the
simple blob detection algorithm1 to be presented in this chapter. The channel
smoothing operation is non-linear, and fits into the same category as clustering
and robust estimation techniques. The smoothing operation performed on each
channel is however linear, and thus relates to linear scale-space theory.
7.1.1
The scale-space concept
When applying successive low-pass filtering to a signal, fine structures are gradually removed. This is formalised in the theory of scale space[106, 72]. Scale space
is the extension of a signal f (x), by means of a blurring kernel g(x, σ) into a new
signal
fs (x, σ) = (f ∗ g(σ))(x) .
(7.1)
The parameter σ is the scale coordinate. The original signal is embedded in
the new signal since fs (x, 0) = f (x). The kernel g(x, σ) is typically a Gaussian,
but Poisson kernels have also been used [27]. Figure 7.1 contains an illustration
of the scale-space concept.
1 The method was originally presented in [41]. Here we have extended the method to deal
with colour images, and also made a few other improvements.
70
Homogeneous Regions in Scale-Space
1
10
0.8
8
0.6
6
0.4
4
0.2
2
0
0
2
4
6
8
f (x)
10
0
0
2
4
6
8
10
fs (x, σ)
Figure 7.1: Gaussian scale space of a 1D-signal.
The fact that fine structures are removed by blurring motivates the use of a
lower sample density at coarser scales, as is done in a scale pyramid.
7.1.2
Blob features
Homogeneity features are called blobs in scale-space theory [72]. In comparison
with segmentation, blob detection has a more modest goal—we do not attempt
to segment out exact shapes of objects, instead we want to extract robust and
repeatable features. The difference between segmentation and blob detection is
illustrated by the example in figure 7.2. As can be seen, the blob representation
discards exact shapes, and thin connections between patches are neglected.
One segment
Two blobs
Figure 7.2: Difference between segmentation and blob detection.
Blob features have been used as texture descriptors [73] and as features for
image database search [6]. For a discussion of the similarities and differences of
other approaches and the one presented here, the reader is directed to [40].
Blob features are related to maximally stable extremal regions(MSER)[82], and
to affinely invariant neighbourhoods[99]. MSER features are regions grown around
an intensity extrema (max or min) and are used to generate affine invariant frames,
which are then used for view-based object recognition [82]. Affinely invariant
neighbourhoods are found by starting at intensity extrema and finding the nearest
extrema along rays emanating from the point. These extrema are then linked to
form a closed curve, which is used to define an affine invariant [99].
7.2 The clustering pyramid
Image
Clustering
pyramid
71
Label
image
Raw
blobs
Merged
blobs
Figure 7.3: Steps in blob extraction.
7.1.3
A blob feature extraction algorithm
The blob estimation procedure uses a scale pyramid. Each position and scale in
the pyramid contains a measurement–confidence pair (p, r).2 The confidence is a
binary variable, i.e. r ∈ {0, 1}. It signifies the absence or presence of a dominant
measurement in the local image region. When r = 1, the dominant measurement
is found in p. This representation is obtained by non-linear means, in contrast
to most scale-space methods which are linear [72], and thus obtain the average
measurement.
The pyramid is used to generate a label image, from which we can compute the
moments of each region. The shape of each region is approximated by its moments
of orders 0, 1 and 2. These moments are conveniently visualised as ellipses, see
right part of figure 7.3. Finally we merge blobs which are adjacent and of similar
colour using an agglomerative clustering scheme. The following sections will each
in turn describe the different steps of the algorithm, starting with the generation
of the clustering pyramid.
7.2
The clustering pyramid
When building the clustering pyramid, we view each image pixel as a measurement
p with confidence r = 1. We then expand the image into a set of channel images.
For each of these channel images we generate a low-pass pyramid. The channels
obtained from the input image constitute scale 1. Successively coarser scales are
obtained by low-pass filtering, followed by sub-sampling. The low-pass filter used
consists of a horizontal and a vertical 4-tap binomial kernel [1 3 3 1]/8.
Since the filter sums to 1, the confidence values of the decoded (p, r) pairs
correspond to fractions of the area covered by the filter. Thus, we construct the
low-pass pyramids, and decode the dominant (p, r) pair in each position. Finally
we make the confidence binary by setting values below rmin to zero, and values
above or equal to rmin to 1. Typically we use the area threshold rmin = 0.5. Figure
7.4 shows such a pyramid for an aerial image.
In principle this pyramid generation method can also be applied to vector field
images, by extending the channel representation as described in section 5.6. The
number of channels required for vector fields of dimension higher than 2 does
however make this approach rather expensive with respect to both computational
2 For generality we will have a vector notation of p, although we will initially use scalar valued
images.
72
Homogeneous Regions in Scale-Space
Figure 7.4: Clustering pyramid created using K = 26 channels, spaced according
to (6.1). Positions with confidence r = 0 are indicated with crosses.
and storage requirements. For example a channel representation of an RGB colour
image with 26 channels per colour band will give us 263 = 17576 channel images
to filter.
7.2.1
Clustering of vector fields
To perform clustering on vector fields, we will replace channel filtering with another
robust estimation technique. The representation at the next scale, p∗ , is now
generated as the solution to the weighted robust estimation problem
X
wk rk ρ(||pk − p∗ ||) .
(7.2)
arg min
∗
p
k
Here wk are weights from the outer product of two binomial kernels, and ρ(d) is
a robust error norm. In contrast to linear filter theory we cannot use a separable
optimisation, so for a 1D binomial kernel [1 3 3 1]/8 we will have to take all 16
pixels in a local neighbourhood into account.
Note that for most choices of ρ(d), problem (7.2) is not convex, and thus local
minima exist. We are however looking for the global min, and will employ a
technique known as successive outlier rejection (SOR) to find it. An iteration of
SOR is computed as
p∗est =
1 X
pk rk wk ok
Nr
where
k
Nr =
X
rk wk ok .
(7.3)
k
Here ok are outlier rejection weights, which are initially set to 1. After each
iteration we find the pixel with largest residual dk = ||p∗est − pk ||. If dk > dmax ,
we remove this pixel by setting ok = 0. This procedure is iterated until there are
no outliers left. The found solution will be a fixed point of a gradient descent
on (7.2), for the cut-off squares error norm 3 . Furthermore, provided that more
than half the data supports it, the solution will be either the global min, or close
3 ρ(d)
= d2 for |d| < dmax , and d2max otherwise.
7.2 The clustering pyramid
73
to the global min. The SOR approach thus effectively solves the initialisation
problem which exists in the commonly applied iterated reweighted least squares
(IRLS) technique, see section 4.2.4, or e.g. [109, 95]. Initialisation is also the
reason why mean-shift filtering [19, 20] is unsuitable for generation of a clustering
pyramid. The importance of initialisation is demonstrated in figure 7.5. Here we
have applied SOR to minimise (7.2), and compared the result with mean-shift
filtering, and IRLS. The IRLS method was, just like mean-shift, initiated with the
original pixel value. As can be seen, only SOR is able to reject outliers. Note that
since each iteration either removes one outlier or terminates, there is an upper
limit to the number of required iterations (e.g. 16 for a 4 × 4 neighbourhood).
Figure 7.5: Importance of initialisation. Left to right: Input image, IRLS output
(6 × 6 window), mean-shift output (spatial radius 3), SOR output (6 × 6 window).
As mentioned, the SOR method is closely related to (7.2), for the cut-off squares
error norm. The sharp boundary between acceptance and rejection in cut-off
squares is usually undesirable. As a remedy to this we add a few extra IRLS
iterations with a more smooth error norm. An iteration of IRLS on (7.2) has
the same form as (7.3), but with the outlier rejection weights replaced with ok =
ρ0 (dk )/dk . If the error norms are similar in shape and extent, we should already
be close to the solution, since we are then minimising similar functions. We will
use the weighting function
(
(1 − (dk /dmax )2 )2 if |dk | < dmax
(7.4)
o(dk ) =
0
otherwise
which corresponds to the biweight error norm [109]. The dmax parameter defines
the scale of the error norm, and is used to control the sensitivity of the algorithm.
To compute p∗ and r∗ in each position in the pyramid, we thus first find p∗ with
SOR, followed by IRLS. The output confidence r∗ is then computed as
(
P
P
1 if
∗
k rk wk ok ≥ rmin
k wk
(7.5)
r =
0 otherwise.
Here ok are the weights used in the last IRLS iteration, see (7.3), and just like in
section 7.2, rmin is a threshold which corresponds to the minimum required area
fraction belonging to the cluster. The clustering process is repeated for successively
coarser scales until the pyramid has been filled.
74
7.2.2
Homogeneous Regions in Scale-Space
A note on winner-take-all vs. proportionality
There is a distinct difference between the channel approach and the SOR+IRLS
approach with respect to propagation of information through the pyramid. For
the channel variant, all pixels in the input image will have contributed to the value
of the channel vector at the top of the pyramid. For the SOR+IRLS variant, each
pixel at each scale only describes the dominant colour component, and thus all the
other values will be suppressed. This is the same approach as is taken in elections
in the UK and the US, where each constituency only gets to select one (or one
kind of) representative to the next level. The channel approach on the other hand
corresponds to the Swedish system of propagating the proportions of votes for the
different parties to the next level.
The number of channels K, and the maximum property distance dmax can be
related according to dmax = 2t1 = 2(K + 1 − N )/(ru − rl ), where N , ru , and rl
are defined in section 3.3. The relationship comes from 2t1 being the approximate
boundary between averaging and rejection, see section 5.4.
The two methods are quite similar in behaviour, but even for grey-scale images
on conventional computer architectures, the SOR+IRLS variant is a factor 5 or
more faster. This is due to the small support (4 × 4) of the parameter estimation.
On hardware with higher degrees of parallelism this could however be different.
7.3
Homogeneous regions
The generation of the clustering pyramid was bottom up, and we will now continue
by doing a top-down pass over the pyramid. Note that the region propagation
described here is different from the one in [41]. The changes since [41] have speeded
up the algorithm and also made the result after region merging more stable.
We start at the top scale and generate an empty label image. Each time we
encounter a pixel with a property distance above dmax ) from the pixels above it,
we will use it as a seed for a new region. Each new seed is assigned an integer
label. We allow each pixel to propagate its label to the twelve nearest pixels below
it, provided that they are sufficiently similar see figure 7.6 (left). For a pixel on
the scale below, this means that we should compare it to three pixels above it,
see figure 7.6 (right). If several pixels on the scale above have a property distance
below dmax , we propagate the label of the one with the smallest distance.
The algorithm for label image generation is summarised like this:
step 1 Generate an empty label image at the top scale.
step 2 Assign new labels to all unassigned pixels with r = 1.
step 3 Move to the next scale.
step 4 Compare each pixel with r = 1 to the three nearest pixels on the scale
above. Propagate the label of the pixel with the smallest property
distance, if smaller than dmax .
step 5 If we are at the finest scale we are done, otherwise go back to step 2.
7.3 Homogeneous regions
75
Figure 7.6: Label propagation. Left: A pixel can propagate its label to twelve
pixels at the scale below. Right: This can be implemented by comparing each
pixel with three pixels on the scale above.
The result of this algorithm is shown in figure 7.7. As can be seen, this is an
oversegmentation, i.e. image patches which are homogeneous have been split into
several regions.
Figure 7.7: Result of label image generation. Left to right: Input image, label
image, blobs from regions.
7.3.1
Ellipse approximation
The label image l(x) : Z2 → N is a compact representation of a set of binary masks
(
1 if l(x) = n
(7.6)
vn (x) =
0 otherwise.
The raw moments of such a binary mask vn (x) : Z2 → {0, 1} are defined by
the weighted sum
XX
xk2 xl1 vn (x).
(7.7)
µkl =
x1
x2
For a more extensive discussion on moments of binary masks, see e.g. [92]. For all
the regions {vn }N
1 in the image we will now compute the raw moments of order 0
76
Homogeneous Regions in Scale-Space
to 2, i.e. µ00 , µ01 , µ10 , µ02 , µ11 , and µ20 . This can be done using only one for-loop
over the label image.
The raw moments are then converted to measures of the area an , the centroid
vector mn , and the inertia matrix In according to
an = µ00 ,
mn =
1
µ00
µ
µ01
µ10
¶
and
In =
1
µ00
µ
µ02
µ11
µ11
µ20
¶
−mn mTn . (7.8)
Using the input image p(x, y) we also compute and store the average measurements for all regions,
1 XX
p(x1 , x2 )vn (x1 , x2 ).
(7.9)
pn =
µ00 x x
1
2
If a region has the shape of an ellipse, its shape can be retrieved from the inertia
matrix, see theorem C.1 in the appendix. Even if the region is not a perfect ellipse,
the ellipse corresponding to the inertia matrix is a convenient approximation of
T
T
the region shape. From the eigenvalue decomposition
I = λ1 ê√
1 ê1 + λ2 ê2 ê2 with
√
λ1 ≥ λ2 we can find the axes of the ellipse as 2 λ1 ê1 and 2 λ2 ê2 respectively,
see theorem C.2 in the appendix. Since I = IT each blob can be represented by
1 + 2 + 3 + N parameters where N is the dimensionality of the processed vector
field. I.e. we have 7 parameters for grey-scale images, and 9 for RGB images etc.
A visualisation of the regions as ellipses is shown in figure 7.7 (right).
7.3.2
Blob merging
As stated before the result of the label image generation is an oversegmentation,
and thus the final stage of the algorithm is a merging of adjacent blobs.
Due to the linearity of the raw moments (7.7) the moments of a combined
mask can be computed from the moments of its parts. I.e. if we have v = v1 + v2
we get µij (v) = µij (v1 ) + µij (v2 ). For the corresponding property vectors we get
p(v) = (µ00 (v1 )p(v1 ) + µ00 (v2 )p(v2 ))/µ00 (v).
Candidates for merging are selected based on a count of pixels along the border
of two regions with similar colours. We define an adjacency matrix M with Mij
signifying the number of pixels along the common border of blobs i and j. The
adjacency matrix is computed by modifying the region propagation algorithm in
section 7.3, such that step 4 at the finest scale computes the count. Whenever
two pixels (with different labels) at the coarser scale match a pixel at the finer
scale, we are at a region boundary, and thus add 1 to the corresponding position
in the adjacency matrix. Since M is symmetric, only the upper triangle need to
be stored. Using M we can now select candidates for merging according to
q
(7.10)
Mij > mthr min(µ00 (vi ), µ00 (vj ))
where mthr is a tuning threshold. Typically we use mthr = 0.5. This choice results
in a lot of mergers, but thin strands of pixels are typically not allowed to join two
blobs into one, see figure 7.2. The square root in (7.10) is motivated by Mij being
a length, and µ00 being an area.
7.4 Blob features for wide baseline matching
77
All merging candidates are now placed in a list. They are successively merged
pairwise, starting with the most similar pair according to the property distance
d = ||pi −pj ||. After merging, the affected property distances are recomputed, and
a new smallest distance pair is found. The pairwise merging is repeated until none
of the candidate mergers have a property distance below dmax . This clustering
scheme falls into the category of agglomerative clustering [63].
Finally we remove blobs with areas (i.e. µ00 ) below a threshold amin = 20.
Figure 7.8: Result of blob merging. Left to right: blobs from regions (384 blobs),
merged blobs (92 blobs, mthr = 0.5), reprojection of blob colours onto label image
regions.
Figure 7.8 shows blob representations of the image in figure 7.7 (left), before
and after merging. In order to show the quality of the segmentation, the blob
colours have been used to paint the corresponding regions in the label image,
see figure 7.8 (right). In this example, the clustering pyramid is created using
K = 26 channels that are spaced according to (6.1). The blob representation
initially contains 384 blobs, which after merging and removal of small blobs drops
to 92. This gives a total of 644 parameters for the entire image. Compared to the
348 × 287 input image pixels this is a factor 155 of data reduction.
7.4
Blob features for wide baseline matching
The usefulness of an image feature depends on the application, and should thus
be evaluated in a proper context. We intend to use the blob features for view
based object recognition, wide-baseline matching, and aerial navigation. All of
these topics are however out of the scope of this thesis, and we will settle for a
very simple demonstration. Using pairs of images captured from a helicopter, we
L
detect blobs, and store their centroids in two lists {mk }K
1 , and {nl }1 . We then
find pairs of points that correspond given a homographic model
µ ¶
µ ¶
µ ¶
µ ¶
m̂l
mk
n̂k
−1 nl
=H
and h
=H
.
(7.11)
h
1
1
1
1
Two points are said to correspond when the residual
q
q
δkl = (n̂k − nl )T (n̂k − nl ) + (m̂l − mk )T (m̂l − mk )
(7.12)
78
Homogeneous Regions in Scale-Space
is below a given threshold.
Figure 7.9: Correspondences for blobs from a video sequence. Each row shows a
pair of matched images.
The correspondence is found using a RANSAC [109] like method. We start by
selecting 4 random points in image 1. For each of these we select a random point
in the other image among those with a property distance dkl = ||pm,k − pn,l ||
below dmax . For each such random correspondence, we compute the pairwise
geometric residuals (7.12) for all point pairs and keep the pairs with distances
7.4 Blob features for wide baseline matching
79
below a threshold δmax . We stop once we get more than 15 matches. The found
matches are then used to estimate a new homography using scaled total least
squares (STLS) with the scaling described in [57]. We then find correspondences
given the new homography and recompute the residuals. This procedure is iterated
until convergence, which usually is reached after three to four cycles.
Figure 7.9 shows the found correspondences for three image pairs from a video
sequence. The number of blobs Nb , the number correspondences Nm , and the
inlier fraction ε = Nm /Nb are listed in the table below.
1
142
61
0.43
Frame
Nb
Nm
ε
7.4.1
2
163
61
0.37
3
151
76
0.50
4
139
76
0.55
5
161
70
0.43
6
176
70
0.40
Performance
A C-implementation of the blob feature extraction algorithm takes about 4 seconds
to process a 360 × 288 RGB image on a Sun Ultra 60 (296MHz). Moving the
implementation to an Intel Pentium 4 (1.9GHz) platform resulted in computation
times below one second per frame.
7.4.2
Removal of cropped blobs
Since an image only depicts a window of a scene, some of the homogeneous regions
in the scene will only partially be contained in the image. Such cropped regions
will give rise to blobs which change shape as the camera moves, in a manner
which does not correspond to camera movement. Thus, if the blobs are to be used
for image matching, we will probably gain stability in the matching by removing
cropped blobs. Most such blobs can be removed by calculating the bounding box
of the ellipse corresponding to the blob shape, and removing those blobs which
have their bounding boxes partially outside the image.
In appendix C, theorem C.3, the outline of an ellipse is shown to be given by a
parameter curve. To find a bounding box for an ellipse, we rewrite this curve into
two parameter curves
¶
¶µ
¶
µ
a cos t
ar11 cos t + br12 sin t
r11 r12
+m
+m=
r21 r22
ar21 cos t + br22 sin t
b sin t
¶
µp 2
2 sin(t + ϕ )
a2 r11 + b2 r12
1
p
+m
=
2 + b2 r 2 sin(t + ϕ )
a2 r21
2
22
µ
x=
(7.13)
(7.14)
Since sin(t + ϕ) assumes all values in the range [−1, 1] during one period, the
bounding box is given by the amplitudes
q
A1 =
q
2 + b2 r 2
a2 r11
12
and A2 =
2 + b2 r 2 .
a2 r21
22
(7.15)
80
Homogeneous Regions in Scale-Space
The bounding box becomes
x1 ∈ [m1 − A1 , m1 + A1 ] and x2 ∈ [m2 − A2 , m2 + A2 ] .
7.4.3
(7.16)
Choice of parameters
The blob feature extraction algorithm has three parameters, spatial kernel width,
area threshold amin , and range kernel width dmax . We will now give some suggestions on how they should be set.
1. Spatial kernel width A larger spatial kernel will make the method less
sensitive to translations of the grid. At the same time however, it will also
reduce the amount of detail obtained. A trade-off that seems to work well
for a wide range of images is the choice of a 4 × 4 neighbourhood. To speed
up the algorithm, we can even consider using the 12 pixel neighbourhood
consisting of the central pixels in the 4 × 4 region, and omit the spatial
weights.
2. Area threshold rmin . Characteristic for this parameter is that low values
give more mergers of non-connected but adjacent regions. This results in
fewer features. High values will cause less information to be propagated
to the higher levels of the pyramid. This will lead to fewer mergers, but
also to less information being propagated to higher levels in the pyramid.
Thus we obtain more regions at the lowest level, and consequently higher
computational times. Typically we will use the intermediate value rmin =
0.5.
3. Range kernel witdh dmax . This parameter decides whether two colours
should be considered the same or not. A small value will give lots of blobs,
while a large value gives few blobs. A suitable choice of dmax value depends
on the input images.
Typically we will thus mainly modify the dmax parameter and let the others
have the fixed values suggested above.
7.5 Clustering of planar slopes
7.5
81
Clustering of planar slopes
The implicit assumption in the previous clustering methods has been that the
image consists of piecewise constant patches. We could instead assume a constant
slope. This is reasonable e.g. in depth maps from indoor scenes and of various
man-made objects. We now assume a scalar input image, i.e. f : Z2 → R, and a
local image model of the type
¡
f (x) = 1 x1 − m1
¢
x2 − m2 p(x) .
(7.17)
¢T
¡
is the centre spatial position of the local model. The paHere m = m1 m2
rameter vector in each pixel can be interpreted as p(x) = (mean, slopex , slopey )T .
When building the first level of the clustering pyramid we thus have to go from
the image f to a parametric representation p. In principle this could be done by
applying plain derivative filters. This would however distort the measurements at
the edges of each planar patch, and thus make the size of the detected regions
smaller, as well as limiting the sizes of objects that can be detected. Instead we
will estimate the parametric representation using a robust linear model
arg min
∗
p
X
¡
wk rk ρ(||fk − 1 x1,k − m1
¢
x2,k − m2 p∗ ||)
(7.18)
k
where wk are weights from a binomial kernel, typically in a 4 × 4 neighbourhood.
Like in the colour clustering method, see section 7.2, we solve (7.18) by SOR
followed by a few M-estimation steps. For the 4 × 4 region, we define the following
quantities

1
1
M=
1
1
1
1
1
1
1
1
1
1

1
1

1
1

1
1
X=
1
1
2
2
2
2
3
3
3
3

4
4
 − 2.5
4
4
(7.19)


B = vec(M)
Y = XT
vec(X)
vec(Y) .
(7.20)
We also define a weight matrix W with diagonal elements (W)kk = ok wk rk . Here
ok are outlier rejection weights, wk are spatial binomial weights, and rk are the
input confidences. For the 4 × 4 region f , the iterations of the model parameter
estimation are now performed as
p∗est = (WB)† Wvec(f ) .
(7.21)
Since we have just 3 parameters to estimate, we could use a (slightly less reliable)
matrix inversion instead
p∗est = (BT WWB)−1 BT WWvec(f ) .
(7.22)
82
Homogeneous Regions in Scale-Space
This is faster since the matrix inverse can be computed in a non-iterative way. At
the start all outlier rejection weights are set to 1. After each iteration we find the
pixel with the largest residual
¡
dk = |fk − 1
x1,k − m1
¢
x2,k − m2 p∗est | .
(7.23)
If dk > dmax , we remove this pixel by setting ok = 0. By iterating until convergence
we will find a fixed point of (7.18), for the cut-off squares error norm. Furthermore,
if more than half the data supports the solution, we are guaranteed to be close to
the global min. Again, we polish the result by a few M-estimation steps with a
smoother kernel. The corresponding IRLS iteration will have the same form as the
iteration above, i.e (7.21) or (7.22), with the exception that the outlier rejection
weights are replaced with ok = ρ0 (dk )/dk .
To obtain p∗ and r∗ for the first scale in the pyramid, we thus compute p∗
according to the above equations, and r∗ according to (7.5).
We typically set the area threshold to rmin = 0.85, which is considerably higher
than in the colour method. Pixels on the boundaries between two different slopes
will typically have a parameter estimate that does not correspond to any of the
slopes. In order to get zero confidence for such pixels, we could either reduce
the dmax threshold, or increase the rmin parameter. It turns out that the latter
option is preferable, since reducing dmax will also cause grey-level quantisation
in the input to propagate to the slope parameters. The result of the parameter
estimation step for a simple test image is shown in figure 7.10.
Figure 7.10: Result of parameter estimation. Left to right: Input, mean estimate,
x-slope estimate, y-slope estimate.
7.5.1
Subsequent pyramid levels
After we have obtained the parametric representation, we can generate the other
steps in the pyramid using almost the same technique as in the colour method.
The main difference is that we have to take the centre of the local neighbourhood
into account, and adjust the local mean level accordingly. That is, we now have a
robust estimation problem of the form
X
wk rk ρ(||p̃k − p∗ ||)
(7.24)
arg min
∗
p
k
where p̃k is a vector with the mean level adjusted
7.5 Clustering of planar slopes
83


p1,k − ((x1,k − m1 )p2,k + (x2,k − m2 )p3,k )s
.
p2,k
p̃k = 
p3,k
(7.25)
Here s is a factor that compensates for the fact that the pixel distance at scale
2 is twice that at scale 1 and so on. In other words we have s = 2scale−1 . The
residuals for SOR are now computed as
q
(7.26)
dk = (p̃k − p∗est )T W(p̃k − p∗est )
¢
¡
where W is a weight matrix of the form diag(W) = 1 wd wd . The parameter
wd allows us to adjust the relative importance of error in mean and error in slope.
Typically we set wd = 200.
7.5.2
Computing the slope inside a binary mask
To estimate the mean and slopes inside each mask vn , we will now assume a local
¢T
¡
signal model centred around the point m = m1 m2
¡
f (x) = 1
¢
x2 − m2 p = p1 + p2 x1 − p2 m1 + p3 x2 − p3 m2 .
x1 − m1
(7.27)
We define the moments ηkl of f (x) inside the mask vn as
ηkl =
1 X
vn (x1 , x2 )f (x1 , x2 )xk1 xl2
N x ,x
1
(7.28)
2
where N = µ00 is the number of elements inside the mask vn . For the model (7.27)
we now get
η00 = p1 + p2
1 X
1 X
vn (x1 , x2 )x1 − p2 m1 + p3
vn (x1 , x2 )x2 − p3 m2
N x ,x
N x ,x
1
2
1
2
(7.29)
= p 1 + p 2 m1 − p 2 m1 + p 3 m2 − p 3 m2 = p 1
(7.30)
as expected. For the first moments we obtain
Ã
η10
!
1 X
= p 1 m1 + p 2
vn (x1 , x2 )x21 − m21
N x ,x
1 2
Ã
!
X
1
vn (x1 , x2 )x1 x2 − m1 m2
+ p3
N x ,x
1
(7.31)
2
= p1 m1 + p2 I11 + p3 I12
(7.32)
84
Homogeneous Regions in Scale-Space
and
Ã
η01
1 X
= p 1 m2 + p 2
vn (x1 , x2 )x2 x1 − m2 m1
N x ,x
1 2
Ã
!
1 X
vn (x1 , x2 )x22 − m22
+ p3
N x ,x
1
!
(7.33)
2
= p1 m2 + p2 I21 + p3 I22 .
(7.34)
This can be summarised as the system
µ
η01
η10
¶
µ ¶
p
= p1 m + I 2
p3
(7.35)
where m and I are obtained according to (7.7) and (7.8). We thus first compute
m and I. We then compute p1 from η00 , see (7.30), and finally p2 and p3 as
µ
¶
µ ¶
p2
−1 η01 − p1 m1
=I
.
p3
η10 − p1 m2
7.5.3
(7.36)
Regions from constant slope model
The results of blob feature extraction on the test image in figure 7.10 are shown
in figure 7.11. As can be seen in this figure, most of the regions have successfully
been detected. The result is also compared with the output from clustering using
the piecewise constant assumption.
We stress that these are just first results. Some of the regions obtained before
merging (see figure 7.11 left) have striped structure in contrast to the case in the
locally constant clustering (see figure 7.7, left). This suggests that the clustering
strategy might not be optimal. It is well known that clustering of lines using the
model
¡
x y
¢T ¡
cos φ
1
¢
sin φ −ρ = 0
(7.37)
is preferable to using the model
y = kx + m
(7.38)
since estimation of very steep slopes (large k) becomes unstable. See section 5.6.3
for an example of use for (7.37). This preference for a normal representation of
lines suggests that we should view the grey-level image as a surface, and cluster
surface normals instead.
The algorithm presented here is intended as a post processing to a depth-fromstereo vision algorithm. In depth maps of indoor scenes, a region with constant
slope could correspond to a floor, a wall, or a door etc. A compact description of
such features will hopefully be useful for autonomous indoor navigation.
7.6 Concluding Remarks
85
Figure 7.11: Blobs from piecewise linear assumption. Left to right: label image, detected blobs, reprojection of blob slopes to the corresponding label image
regions, and blobs from piecewise constant assumption (dmax = 0.16).
7.6
Concluding Remarks
In this chapter we have introduced a blob feature detection algorithm that works
on vector fields of arbitrary dimension. The usefulness of the algorithm has been
demonstrated on a wide baseline matching task. We have also extended the blob
feature detection to cluster constant slopes instead of locally constant colour. The
slope clustering has however not been fully evaluated, and some design choices,
such as the chosen representation of the slopes might not be optimal. Specifically
the option to cluster surface normals instead should be tested.
86
Homogeneous Regions in Scale-Space
Chapter 8
Lines and Edges in
Scale-Space
In this chapter we develop a representation of lines and edges in a scale hierarchy.
By using three separate maps, the identity of lines and edges are kept separate.
Further, the maps are made sparse by inhibition from coarser scales.
8.1
Background
Biological vision systems are capable of instance recognition in a manner that is
vastly superior to current machine vision systems. Perceptual experiments [83, 12]
are consistent with the idea that they accomplish this feat by remembering a
sparse set of features for a few views of each object, and are able to interpolate
between these (see discussion in chapter 2). What features biological systems
use is currently not certain, but we have a few clues. It is a widely known fact
that difference of Gaussians, and Gabor-type wavelets are useful models of the
first two levels of processing in biological vision systems [5]. There is however no
general agreement on how to proceed from these simple descriptors, toward more
descriptive and more sparse features.
One way might be to detect various kinds of image symmetries such as circles,
star-patterns, and divergences (such as corners) as was done in [65, 64]. Two very
simple kinds of symmetries are lines and edges1 , and in this chapter we will see
how extraction of lines and edges can be made more selective, in a manner that
is locally continuous both in scale and spatially. An important difference between
our approach and other line-and-edge representations, is that we keep different
kinds of events separate instead of combining them into one compact feature map.
1 To
be strict, an edge is better described as an anti-symmetry.
88
8.1.1
Lines and Edges in Scale-Space
Classical edge detection
Depth discontinuities in a scene often lead to intensity discontinuities in images
of that scene. Thus, discontinuities such as lines and edges in an image tend to
correspond to object boundaries. This fact been known and used for a long time
in image processing. One early example that is still widely used are the Sobel
edge filters [91]. Another common example is the Canny edge detector [14] that
produces visually pleasing binary images. The goal of edge detecting algorithms in
image processing is often to obtain useful input to segmentation algorithms [92],
and for this purpose, the ideal step edge detection that the Canny edge detector
performs is in general insufficient [85], since a step edge is just one of the events
that can divide the areas of a physical scene. Since our goal is quite different (we
want a sparse scene description that can be used in view-based object recognition),
we will discuss conventional edge detection no further.
8.1.2
Phase-gating
Lines and edges correspond to specific local phases of the image signal. Line
and edge features are thus related to the local phase feature. Our use of local
phase originates from the idea of phase-gating, originally mentioned in a thesis by
Haglund [55]. Phase-gating is a postulate that states that an estimate from an
arbitrary operator is valid only in particular places, where the relevance of the
estimate is high [55]. Haglund uses this idea to obtain an estimate of size, by
only using the even quadrature component when estimating frequency, i.e. he only
propagated frequency estimates near 0 and π phase.
8.1.3
Phase congruency
Mach bands are illusory peaks and valleys in illumination that humans, and other
biological vision systems perceive near certain intensity profiles, such as ramp
edges (see figure 8.1). Morrone et al. have observed that these illusory lines, as
well as perception of actual lines and edges, occur at positions where the sum of
Fourier components above a given threshold have a corresponding peak [78]. They
also note that the sum of the squared output of even and odd symmetric filters
always peaks at these positions, which they refer to as points of phase congruency.
This observation has lead to the invention of phase congruency feature detectors
[68]. At points of phase congruency, the phase is spatially stable over scale. This
is a desirable property for a robust feature. However, phase congruency does
not tell us which kind of feature we have detected; is it a line, or an edge? For
this reason, phase congruency detection has been augmented by Reisfeld to allow
discrimination between line, and edge events [86]. Reisfeld has devised what he
calls a Constrained Phase Congruency Detector (CPCT for short), that maps a
pixel position and an orientation to an energy value, a scale, and a symmetry
phase (0, ±π/2 or π). This approach is however not quite suitable for us, since
the map produced is of a semi discrete nature; each pixel is either of 0, ±π/2 or
π phase, and only belongs to the scale where the energy is maximal. The features
we want should on the contrary allow a slight overlap in scale space, and have
8.2 Sparse feature maps in a scale hierarchy
89
Figure 8.1: Mach bands near a ramp edge.
Top-left: Image intensity profile
Bottom-left: Perceived image intensity Right: Image
responses in a small spatial range near the characteristic phases.
8.2
Sparse feature maps in a scale hierarchy
Most feature generation procedures employ filtering in some form. The outputs
from these filters tell quantitatively more about the filters used than the structures they were meant to detect. We can get rid of this excessive load of data,
by allowing only certain phases of output from the filters to propagate further.
These characteristic phases have the property that they give invariant structural
information rather than all the phase components of a filter response.
We will now generate feature maps that describe image structure in a specific
scale, and at a specific phase. The distance between the different scales is one
octave (i.e. each map has half the centre frequency of the previous one.) The
phases we detect are those near the characteristic phases 0, π, and ±π/2. Thus,
for each scale, we will have three resultant feature maps (see figure 8.2).
Image scale pyramid
0 phase
111
000
111
000
111
000
111
000
11111
00000
11111
00000
1111111
0000000
0000000
1111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
00000000
11111111
00000000
11111111
0000000000000
1111111111111
1111111111111
0000000000000
0000000000000
1111111111111
0000000000000
1111111111111
π phase
π
2 phase
Figure 8.2: Scale hierarchies.
This approach touches the field of scale-space analysis pioneered by Witkin
[106]. See [72] for a recent overview of scale space methods. Our approach to scale
90
Lines and Edges in Scale-Space
space analysis is somewhat similar to that of Reisfeld [86]. Reisfeld has defined
what he calls a Constrained Phase Congruency Transform (CPCT), that maps a
pixel position and an orientation to an energy value, a scale, and a symmetry phase
(0, π, ±π/2, or none). We will instead map each image position, at a given scale, to
three complex numbers, one for each of the characteristic phases. The argument of
the complex numbers indicates the dominant orientation of the local image region
at the given scale, and the magnitude indicates the local signal energy when the
phase is near the desired one. As we move away from the characteristic phase, the
magnitude will go to zero. This representation will result in a number of complex
valued images that are quite sparse, and thus suitable for pattern detection.
8.2.1
Phase from line and edge filters
For signals containing multiple frequencies, the phase is ambiguous, but we can
always define the local phase of a signal, as the phase of the signal in a narrow
frequency range.
The local phase can be computed from the ratio between a band-pass filter
(even, denoted fe ) and its quadrature complement (odd, denoted fo ). These two
filters are usually combined into a complex valued quadrature filter, f = fe + ifo
[48]. The real and imaginary parts of a quadrature filter correspond to line, and
edge detecting filters respectively. The local phase can now be computed as the
argument of the filter response, q(x) = (s ∗ f )(x), or if we use the two real-valued
filters separately, as the four quadrant inverse tangent; arctan(qo (x), qe (x)).
To construct the quadrature pair, we start with a discretised lognormal filter
function, defined in the frequency domain

2

 − ln (ρ/ρi )
ln 2
if ρ > 0
(8.1)
Ri (ρ) = e

0
otherwise.
The parameter ρi determines the peak of the lognorm function, and is called the
centre frequency of the filter. We now construct the even and odd filters as the
real and imaginary parts of an inverse discrete Fourier transform of this filter2
fe,i (x) = Re(IDFT{Ri (ρ)})
fo,i (x) = Im(IDFT{Ri (ρ)}) .
(8.2)
(8.3)
We write a filtering of a sampled signal, s(x), with a discrete filter fk (x) as qk (x) =
(s ∗ fk )(x), giving the response signal the same indices as the filter that produced
it.
8.2.2
Characteristic phase
By characteristic phase we mean phases that are consistent over a range of scales,
and thus characterise the local image region. For natural images this mainly
2 Note that there are other ways to obtain spatial filters from frequency descriptions that, in
many ways produce better filters [67].
8.2 Sparse feature maps in a scale hierarchy
91
happens at local magnitude peaks of the responses from the even and odd filters.3
In other words, the characteristic phases are almost always one of 0, π, and ±π/2.
This motivates our restriction of the phase to these three cases.
1
0.5
0
10
20
30
40
50
60
70
20
30
40
50
60
70
20
30
40
50
60
70
0.2
0.1
0
−0.1
−0.2
10
0.2
0.1
0
−0.1
−0.2
10
Top:
Centre:
Bottom:
Figure 8.3: Line and edge filter responses in 1D.
A one-dimensional signal.
Line responses at ρi = π/2 (solid), and π/4 and π/8 (dashed)
Edge responses at ρi = π/2 (solid), and π/4 and π/8 (dashed)
Only some occurrences of these phases are consistent over scale though (see
figure 8.3). First, we can note that band-pass filtering always causes ringings in
the response. For isolated line and edge events this will mean one extra magnitude
peak (with the opposite sign) at each side of the peak corresponding to the event.
These extra peaks will move when we change frequency bands, in contrast to those
peaks that correspond to the line and edge features. Second, we can note that each
line event will produce one magnitude peak in the line response, and two peaks in
the edge response. The peaks in the edge response, however, will also move when
we change frequency bands. We can thus use stability over scale as a criterion to
sort out the desired peaks.
8.2.3
Extracting characteristic phase in 1D
Starting from the line and edge filter responses at scale i: qe,i , and qo,i , we now
define three phase channels
p1,i = max(0, qe,i )
p2,i = max(0, −qe,i )
(8.4)
(8.5)
p3,i = abs(qo,i ) .
(8.6)
That is, we let p1,i constitute the positive part of the line filter response, corresponding to 0 phase, p2,i , the negative part, corresponding to π phase, and p3,i
the magnitude of the edge filter response, corresponding to ±π/2 phase.
3 A peak in the even response will always correspond to a zero crossing in the odd response,
and vice versa, due to the quadrature constraint.
92
Lines and Edges in Scale-Space
Phase invariance over scale can be expressed by requiring that the phase at the
next lower octave has the same sign
p1,i = max(0, qe,i · qe,i−1 /ai−1 ) · max(0, sign(qe,i ))
p2,i = max(0, qe,i · qe,i−1 /ai−1 ) · max(0, sign(−qe,i ))
p3,i = max(0, qo,i · qo,i−1 /ai−1 ) .
(8.7)
(8.8)
(8.9)
The first max operation in the equations above will set the magnitude to zero
whenever the filter at the next scale has a different sign. This operation will
reduce the effect of the ringings from the filters. In order to keep the magnitude
near the characteristic phases proportional to the local signal energy, we have
normalised the product with the signal energy at the next lower octave ai−1 =
q
2
2
qe,i−1
+ qo,i−1
. The result of the operation in (8.7)-(8.9) can be viewed as a phase
description at a scale in between the two used. These channels are compared with
the original ones in figure 8.4.
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
10
20
30
40
50
60
70
0
10
0.1
20
30
40
50
60
70
0
10
20
30
40
50
60
70
Figure 8.4: Consistent phase in 1D. (ρi = π/4)
p1,i , p2,i , p3,i according to (8.4)-(8.6) (dashed), and (8.7)-(8.9) (solid)
We will now further constrain the phase channels in such a way that only responses consistent over scale are kept. We do this by inhibiting the phase channels
with the complementary response in the third lower octave
c1,i = max(0, p1,i − αabs(qo,i−2 ))
(8.10)
c2,i = max(0, p2,i − αabs(qo,i−2 ))
c3,i = max(0, p3,i − αabs(qe,i−2 )) .
(8.11)
(8.12)
We have chosen an amount of inhibition α = 2, and the base scale, ρi = π/4.
With this value we successfully remove the edge responses at the line event, and
at the same time keep the rate of change in the resultant signal below the Nyquist
frequency. The resultant characteristic phase channels will have a magnitude corresponding to the energy at scale i, near the corresponding phase. These channels
are compared with the original ones in figure 8.5.
As we can see, this operation manages to produce channels that indicate lines
and edges without any unwanted extra responses. An important aspect of this
operation is that it results in a gradual transition between the description of a
signal as a line or an edge. If we continuously increase the thickness of a line,
it will gradually turn into a bar that will be represented as two edges.4 This
phenomenon is illustrated in figure 8.6.
4 Note
that the fact that both the line, and the edge statements are low near the fourth event
8.2 Sparse feature maps in a scale hierarchy
93
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
10
20
30
40
50
60
70
0.1
0
10
20
30
40
50
60
70
0
10
20
30
40
50
60
70
Figure 8.5: Phase channels in 1D. (ρi = π/4, α = 2)
p1,i , p2,i , p3,i according to (8.4)-(8.6) (dashed), and (8.10)-(8.12) (solid).
1
0.5
0
0
50
100
150
200
250
300
0
50
100
150
200
250
300
0
50
100
150
200
250
300
0.4
0.2
0
0.4
0.2
0
Figure 8.6: Transition between line and edge description. (ρi = π/4)
Top: Signal Centre: c1,i phase channel Bottom: c3,i phase channel.
8.2.4
Local orientation information
The filters we employ in 2D will be the extension of the lognorm filter function
(8.1) to 2D [48]
Fki (u) = Ri (ρ)Dk (û)
(8.13)
where
(
(û · n̂k )2
Dk (û) =
0
if u · n̂k > 0
(8.14)
otherwise.
¡
¢T
¡√
√ ¢T
We will use four filters, with directions n̂1 = 0 1 , n̂2 =
0.5
0.5 ,
¡
¢T
¡√
√ ¢T
n̂3 = 1 0 , and n̂4 =
0.5 − 0.5 . These directions have angles that
are uniformly distributed modulo π. Due to this, and the fact that the angular
function decreases as cos2 ϕ, the sum of the filter-response magnitudes will be
orientation invariant [48].
Just like in the 1D case, we will perform the filtering in the spatial domain
(fe,ki ∗ pki )(x) ≈ Re(IDFT{Fki (u)})
(8.15)
(fo,ki ∗ pki )(x) ≈ Im(IDFT{Fki (u)}) .
(8.16)
(positions 105 to 125) does not mean that this event will be lost. The final representation will
also include other scales of filters, which will describe these events better.
94
Lines and Edges in Scale-Space
Here we have used a filter optimisation technique [67] to factorise the lognorm
quadrature filters into two approximately one-dimensional components. The filter
pki (x), is a smoothing filter in a direction orthogonal to n̂k , while fe,ki (x), and
fo,ki (x) constitute a 1D lognorm quadrature pair in the n̂k direction.
Using the responses from the four quadrature filters, we can construct a local
orientation image. This is a complex valued image, in which the magnitude of each
complex number indicates the signal energy when the neighbourhood is locally onedimensional, and the argument of the numbers denote the local orientation, in the
double angle representation [48]
z(x) =
X
aki (n̂k1 + in̂k2 )2 = a1i (x) − a3i (x) + i(a2i (x) − a4i (x))
k
q
where aki (x), the signal energy, is defined as aki =
8.2.5
(8.17)
2
2 .
qe,ki
+ qo,ki
Extracting characteristic phase in 2D
To illustrate characteristic phase in 2D, we need a new test pattern. We will use
the 1D signal from figure 8.6, rotated around the origin (see figure 8.7).
100
200
300
400
500
600
100
200
300
400
500
600
Figure 8.7: A 2D test pattern.
When extracting characteristic phases in 2D we will make use of the same
observation as the local orientation representation does: Since visual stimuli can
locally be approximated by a simple signal in the dominant orientation [48], we
can define the local phase as the phase of the dominant signal component.
To deal with characteristic phases in the dominant signal direction, we first
synthesise responses from a filter in a direction, n̂z , compatible with the local
orientation5
¡
√
n̂z = Re( z)
√ ¢T
Im( z) .
(8.18)
The filters will be weighted according to the value of the scalar product between
the filter direction, and this orientation compatible direction
5 Since
the local orientation, z, is represented with a double angle argument, we could just as
well have chosen the opposite direction. Which one of these we choose does not really matter,
as long as we are consistent.
8.2 Sparse feature maps in a scale hierarchy
wk = n̂Tk n̂z .
95
(8.19)
Thus, in each scale we synthesise one odd, and one even response projection as
X
qe,i,k abs(wk )
(8.20)
qe,i =
k
qo,i =
X
qo,i,k wk .
(8.21)
k
This will change the sign of the odd responses when the directions differ more than
π, but since the even filters are symmetric, they should always have a positive
weight. In accordance with our findings in the 1D study (8.7)-(8.9), (8.10)-(8.12),
we now compute three phase channels, c1,i , c2,i , and c3,i , in each scale.
No responses
Figure 8.8: Characteristic phase channels in 2D. (ρi = π/4)
Left to right: Characteristic phase channels c1,i , c2,i , and c3,i , according to (8.10)(8.12) (α = 2). The colours indicate the locally dominant orientation.
The characteristic phase channels are shown in figure 8.8.6 As we can see,
the channels exhibit a smooth transition from describing the white regions in the
test pattern (see figure 8.7) as lines, and as two edges. Also note that the phase
statements actually give the phase in the dominant orientation, and not in the
filter directions, as was the case for CPCT [86].
8.2.6
Local orientation and characteristic phase
An orientation image can be be gated with a phase channel, cn (x), in the following
way

0
if cn (x) = 0
(8.22)
z n (x) = cn (x) · z(x)

otherwise.
|z(x)|
We now do this for each of the characteristic phase statements c1,i (x), c2,i (x),
and c3,i (x), in each scale. The result is shown in figure 8.9. The colours in the
6 The magnitude of lines this thin can be difficult to reproduce in print. However, the magnitudes in this plot should vary in the same way as in figure 8.6.
96
Lines and Edges in Scale-Space
figure indicate the locally dominant orientation, just like in figure 8.8. Notice for
instance how the bridge near the centre of the image changes from being described
by two edges, to being described as a bright line, as we move through scale space.
Figure 8.9: Sparse feature hierarchy. (ρi = π/2, π/4, π/8, π/16)
8.3
Concluding remarks
The strategy of this approach for low-level representation is to provide sparse, and
reliable statements as much as possible, rather than to provide statements in all
points.
Traditionally, the trend has been to produce compact, descriptive components
as much as possible; mainly to reduce storage and computation. As the demands
on performance are increasing it is no longer clear why components signifying
8.3 Concluding remarks
97
different phenomena should be mixed. An edge is something separating two regions
with different properties, and a line is something entirely different.
The use of sparse data representations in computation leads to a mild increase
in data volume for separate representations, compared to combined representations.
Although the representation is given in discrete scales, this can be viewed as a
conventional sampling, although in scale space, which allows interpolation between
these discrete scales, with the usual restrictions imposed by the sampling theorem.
The requirement of a good interpolation between scales determines the optimal
relative bandwidths of filters to use.
98
Lines and Edges in Scale-Space
Chapter 9
Associative Learning
This chapter introduces an associative network architecture using the channel representation. We describe the descriptive properties of the networks, and illustrate
their behaviour using a set of experiments. We will also relate the associative networks to the techniques Radial Basis Function (RBF) networks, Support Vector
Machines (SVM) and Fuzzy Control.
9.1
Architecture overview
In the proposed architecture, the choice of information representation is of fundamental importance. The architecture makes use of the channel information representation introduced in chapter 3. The channel representation implies a mapping
of signals into a higher-dimensional space, in such a way that it introduces locality
in the information representation with respect to all dimensions; geometric space
as well as property space. The obtained locality gives two advantages:
• Nonlinear functions and combinations can be implemented using linear mappings
• Optimisation in learning converges much faster.
Figure 9.1 gives an intuitive illustration of how signals are represented as local fragments, which can be freely assembled to form an output. The system is
moving along a state space trajectory. The state vector x consists of both internal and external system parameters. The response space is typically a subset
of those parameters, e.g. orientation of an object, position of a camera sensor in
navigation, or actions of a robot. Response channels and feature channels measure
local aspects of the state space. The response channels and feature channels define
response channel vectors u and feature vectors a respectively.
The processing mode of the architecture is association where the mapping of
features ah onto desired responses uk is learned from a representative training set of
observation pairs {an , un }N
n=1 , see figure 9.1(b). The feature vector a may contain
some hundred thousand components, while the output vector u may contain some
100
Associative Learning
ah3
ah2
uk
uk
x
ckh ah
n
state trajectory
ah1
n
(a)
(b)
Figure 9.1: Architecture overview. (a) The system is moving along a state space
trajectory. Response channels, uk , and feature channels, ah , measure different
(local) aspects of the state vector x. (b) The response channels and the feature
channels define localised functions along the trajectory. A certain response channel is associated with some of the feature channels with appropriate weights ckh .
Figure borrowed from [51].
thousand components. For most features of interest, only limited parts of the
domain will have non-zero contributions. This provides the basis for a sparse
representation, which gives improved efficiency in storage and better performance
in processing.
The model of the system is in the standard version a linear mapping from a
feature vector a to a response vector u over a linkage matrix C,
u = Ca .
(9.1)
In some training process, a set with N samples of output vectors u and
corresponding
feature
vectors a are obtained.¡ These form
a response matrix
¢
¢
¡
U = u1 . . . uN and a feature matrix A = a1 . . . aN . The training implies
finding a solution matrix C to
U = CA .
(9.2)
The linkage matrix is computed as a solution to a least squares problem with a
monopolar constraint C ≥ 0. This constraint has a regularising effect, and in
addition it gives a sparse linkage matrix. The monopolar representation together
with locality, allows a fast optimisation, as it allows a parallel optimisation of a
large number of loosely coupled system states.
We will compare the standard version (9.1) to models where the mapping is
made directly to the response subset of the state parameters, i.e. typically what
would be used in regular kernel machines. We will in these cases use a modified
model with various normalisations of a.
9.2 Representation of system output states
9.2
101
Representation of system output states
For a system acting in a continuous environment, we can define a state space
X ⊂ RM . A state vector, x ∈ X , completely characterises the current situation
for the system, and X is thus the set of all situations possible for the system. The
state space has two parts termed internal and external. Internal states describe
the system itself, such as its position and its orientation. External states describe
a subset of the total states of the environment, which are to be incorporated in the
system’s knowledge, such as the position, orientation and size of a certain object.
The estimation of external states requires a coupling to internal states, which
can act as a known reference in the learning process. In general it is desirable
to estimate either a state, or a response that changes the state, i.e. a system
behaviour. For simplicity we will in this chapter assume that the desired response
variables are components of the state vector x.
We assume that the system is somehow induced to change its state, such that
it covers the state space of interest for the learning process. For an agent acting in
the physical world, the system state change has to take place in a continuous way
(due to the inertia caused by limited power for displacement of a certain mass, see
[49] for a more extensive discussion). It is thus reasonable to view the set of states
We can express
{xn }N
1 used in the learning process as a system
¢
¡ state trajectory.
this system state trajectory as a matrix X = x1 x2 . . . xN .
9.2.1
Channel representation of the state space
The normal form of output for the structure is in channel representation. It is
advantageous to represent the scalar state variables in a regular channel vector
form, as this allows multiple outputs when the mapping from input to output is
ambiguous, see section 3.2.1. The channel representation also forms the basis for
learning of discontinuous phenomena, as will be demonstrated in section 9.6.
A response channel vector um is a channel representation of one of the components xm of the state vector x, see (3.1). The vector um is thus a non-ambiguous
representation of position in a response state space Rm = {xm : x ∈ X }.
With this definition, a response channel will be non-zero only in a very limited
region of the state space. The value of a channel can be viewed as a confidence in
the hypothesis that the current state is near a particular prototype state. When a
specific channel is non-zero it is said to be active, and the subspace where a specific
channel is active is called the active domain of that channel. As the active domain
is always much smaller than the inactive domain, an inactive channel will convey
almost no information about position in state space. The small active domain is
also what makes the representation sparse.
m
The¡ response channel
¢ vectors un can be put into response channel matrices
m
m
m
m
U = u1 u2 . . . uN . All such response channel matrices are stacked row-wise
to form the response channel matrix U. While U will have a much larger number
of rows than the original state matrix X due to the increase of dimensionality
in the representation, the sparsity of the representation will imply a moderate
increase of the amount of data (typically a factor 3).
102
9.3
Associative Learning
Channel representation of input features
It is assumed that the system can obtain at least partial knowledge about its
state from a set of observed feature variables, {ah }, forming a feature vector a =
¢T
¡ 1 2
a a . . . aH . In order for an association or learning process to be meaningful,
there has to be a sufficiently unique and repeatable correspondence between system
states and observed features. One way to state this requirement is as follows: The
sensor space, A, of states that the feature channels can represent, should allow
an unambiguous mapping f : A → R. The situation where this requirement is
violated is known in learning and robotics as perceptual aliasing, see e.g. [17].
A generative model for {ah }, useful for systems analysis, can be expressed as
localised, non-negative kernel functions Bh (x). These are functions of a weighted
distance between the state vector x and a set of prototype states xh ∈ X . We
exemplify this with the cos2 -kernel,
(
cos2 (d(x, xh )) if d(x, xh ) ≤ π/2
h
h
(9.3)
a = B (x) =
0
otherwise.
The used distance function is defined as
q
d(x, xh ) = (x − xh )T Mh (x − xh ).
(9.4)
The matrix Mh is positive semidefinite, and describes the similarity measure for
the distance function around state xh , allowing a scaling with different sensitivities
with respect to different state variables. Equation (9.3) indicates that ah will have
a maximal value of 1 as x = xh . It will go monotonically to zero as the weighted
distance increases to π/2. Normally, neither xh , nor Mh are explicitly known, but
emerge implicitly from the properties of the set of sensors used in the actual case.
These are generally different from one sensor or filter to another, which motivates
the notion of channel representation, as each channel has its specific identity, the
identification of which is part of the learning process.
In general, there is no requirement for a regular arrangement of channels, be it
on the input side or on the output side. The prescription of an orderly arrangement
at the output comes from the need to interface the structure to the environment,
e.g. to determine its performance. In such a case it will be desirable to map the
response channel variables back into scalar variables in order to compare them with
the reference, something which is greatly facilitated by a regular arrangement.
Similarly to the state variables, we denote the observation at sample point n
¡
¢T
. These observation
or feature vectors can be
by a vector an = a1n a2n . . . a¡H
n
¢
put into a feature matrix A = a1 a2 . . . aN .
9.3.1
Feature generation
The feature vectors a, input to the associative structure, may derive directly from
the preprocessing parts of a computer vision system, representing local image
properties such as orientation, curvature, colour, etc. Unless the features emerge
9.4 System operation modes
103
as monopolar quantities, we will channel encode them. If the properties have
a confidence measure, it is natural to weight the channel features with this, see
discussion in chapter 3.
Often, combinations or functions comb(a) of a set of features a, will be used
as input to the associative structure. A common way to increase specificity in the
percept space is to generate product pairs of the feature vector components, or a
subset of them, i.e.
(9.5)
comb(a) = vec(aaT ) .
The symbol vec signifies the trivial transformation of concatenating rows or columns
of a matrix into a vector, see section 5.6.1. For simplicity of notation, we will express this as a substitution,
a ← comb(a) .
(9.6)
If we find a linear feature–response mapping in the training phase using the
feature combination (9.5), it will correspond to a quadratic mapping from the
original features to the responses. A network with this kind of feature expansion
is called a higher order network [7].
The final vector a, going into the associative structure will generally be considerably longer than the corresponding size of the sensor channel array. As we are
dealing with sparse feature data, the increase of the data volume will be moderate.
9.4
System operation modes
The channel learning architecture can be run under two different operation modes,
providing output in two different representations:
1. position encoding for discrete event mapping
2. magnitude encoding for continuous function mapping.
The first variety, discrete event mapping, is the mode which maximally exploits
the advantage of the information representation, to allow implementation and
learning of highly non-linear transfer functions, using a linear mapping.
The second variety is similar to more traditional function approximation methods.
9.4.1
Position encoding for discrete event mapping
In this mode, the structure is trained to map onto a set of channel representations
of the system response state variables, as discussed in subsection 9.2.1. Thus each
response will have a non-zero output only within limited regions of the definition
range. The major issue is that a multi-dimensional, fragmented feature set is
mapped onto a likewise fragmented, version of the system response state space.
See figure 9.2 for an illustration.
There are a number of characteristics of the discrete event mode:
• Mapping is made to sets of response channels, whose response functions may
be partially overlapping to allow the reconstruction of a continuous variable.
104
Associative Learning
k+1
u
uk
n
n
Figure 9.2: Illustration of discrete event mapping. Solid curves are weighted input
h
feature functions cP
kh a (t) along the state space trajectory. Dashed curves are the
k
responses u (t) = h ckh ah (t).
• Output channels are assumed to assume some standard maximum value, say
1, but are expected to be zero most of the time, to allow a sparse representation.
• The system state is not given by the magnitude of a single output channel,
but is given by the relation between outputs of adjacent channels.
• Relatively few feature functions, or sometimes only a single feature function,
are expected to map onto a particular output channel.
• The channel representation of a signal allows a unified representation of
signal value and of signal confidence, where the relation between channel
values represents value, and the magnitude represents confidence. Since the
discrete event mode implies that both the feature and response state vectors
are in the channel representation, the confidence of the feature vector will
be propagated to the response vector if the mapping is linear.
The properties just listed, allows the structure to be implemented as a purely
linear mapping,
u = Ca .
(9.7)
9.4.2
Magnitude encoding for continuous function mapping
The continuous function mapping mode is used to generate the response state
variables directly, rather than a set of channel functions for position decoding. The
response state vector, x, is approximated by a weighted sum of channel feature
functions, see figure 9.3 for an illustration.
This mode corresponds to classical function approximation objectives. The
mode is used for accurate representation of a scalar continuous function, which is
often useful in control systems.
The approximation will be good if the feature functions are sufficiently local,
and sufficiently dense. There are a number of characteristics for the continuous
function mapping:
9.4 System operation modes
105
x
n
Figure 9.3: Illustration of continuous function mapping. Solid curves are weighted
h
input feature functions
P ch a h(t) along the state space trajectory. Dashed curve is
the response x(t) = h ch a (t).
• It uses rather complete sets of feature functions, compared to the mapping
onto a single response in discrete event mode. The structure can still handle
local feature dropouts without adverse effects upon well behaved regions.
• Mapping is made onto continuous response variables, which may have a
magnitude which varies over a large range.
• A high degree of accuracy in the mapping can be obtained if the feature
vector is normalised, as stated below.
In this mode, however, it is not possible to represent both a state value x, and
a confidence measure r, unless it is done explicitly.
For a channel vector, the vector sum corresponds to the confidence, see section
3. As a first assumption we could thus assume that the feature vector sum corresponds to the confidence measure. Assuming a linear feature–response mapping,
this will imply that the confidence is propagated to the response,
rx = Ca .
(9.8)
By dividing the feature vector a with r we can normalise with the amount of
confidence, or certainty, in a. This is related to the theory of normalized averaging,
see e.g. [48].
If we use this model, we have additionally made the assumption that all features
have the same confidence in each sample. To be slightly more flexible, we will
instead assume a linear model for the confidence measure
r = wT a ,
(9.9)
where w > 0 is a suitable weight vector. We now obtain the following response
model for continuous function mode:
1
(9.10)
x = C T a.
w a
Note that wT a is a weighted l1 -norm of a, since a is non-negative. An unweighted l1 -norm, w = 1, is often used in RBF networks and probabilistic mixture models, see [58, 77]. Other choices of weighting w will be discussed in section
9.5.2.
106
9.5
Associative Learning
Associative structure
We will now turn to the problem of estimating the linkage matrix C in (9.10) and
in (9.7). We take on a unified approach for the two system operation modes. The
models can be summarised into
u=C
1
a,
s(a)
(9.11)
where s(a) is a normalisation function, and u denotes a scalar or a vector, representing either the explicit state variable/variables, or a channel representation
thereof. In continuous function mode (9.10) u = x and s(a) = wT a. In discrete
event mode (9.7) u is a channel representation of x and s(a) ≡ 1.
In the subsequent discussion, we will limit the scope to a supervised learning
framework. Still, the structure can advantageously be used as a core in systems
for other strategies of learning, such as reinforcement learning, with a proper
embedding [96]. This discussion will assume batch mode training. This implies
that there are N observation pairs of corresponding feature vectors an and state
or response vectors un . Let A and U denote the matrices containing all feature
vector and response vector samples respectively, i.e.





|




 u1
U
=




|












A






=
|
 a1
|
|
u2
|
|
a2
|

− u1
|
 − u2

..
. . . uN  = 

.
|
− uK

− a1

|
 − a2


. . . aN
= 
..

.
|
− aH


−
− 



−

−
− 



−
(9.12)
For a set of observation samples collected in accordance with (9.12), the model
in (9.11) can be expressed as
(9.13)
U = CADs ,
where
¡
Ds = diag−1 s(a1 )
s(a2 )
...
¢
s(aN ) .
(9.14)
The linkage matrix C is computed as a solution to a weighted least squares
problem, with the constraint C ≥ 0. This constraint has a regularising effect on
the mapping, and also ensures a sparse linkage matrix C. For a more extensive
discussion on the monopolar constraint, see the article [51].
9.5.1
Optimisation procedure
The procedure to optimise the associative networks is mainly the work of Granlund
and Johansson [51]. It is described in this section for completeness. The linkage
9.5 Associative structure
107
matrix C is computed as the solution to the constrained weighted least-squares
problem
min e(C) ,
C≥0
(9.15)
where
e(C) = kU − CADs k2W
= trace(U − CADs )W(U − CADs )T .
(9.16)
The weight matrix W, which controls the relevance of each sample, is chosen as
W = D−1
s . The minimisation problem (9.15) does not generally have a unique
solution, as it can be under-determined or over-determined.
The proposed solution to (9.15) is the fixed point of the sequence
(
C(0)
= 0
C(i + 1)
=
max (0, C(i) − ∇e(C(i))Df ) ,
(9.17)
where Df is the positive definite diagonal matrix
Df = diag(v)diag−1 (ADs AT v) for some v > 0 .
(9.18)
Since W = D−1
s we have
∇e(C) = (CADs − U)WDs AT
= (CADs − U)AT ,
(9.19)
and we rewrite sequence (9.17) as
¢
¡
C(i + 1) = max 0, C(i) − (C(i)ADs − U)AT Df .
(9.20)
We can interpret Ds and Df as normalisations in the sample and feature domain
respectively, see section 9.5.2 for further details. We will consequently refer to Ds
as sample domain normalisation and Df as feature domain normalisation.
9.5.2
Normalisation modes
This normalisation can be put in either of the two representation domains, the
sample domain or the feature domain, but with different effects upon convergence,
accuracy, etc. For each choice of sample domain normalisation Ds there are nonunique choices of feature domain normalisations Df such that sequence (9.20)
converges to a solution of problem (9.15). Df can for example be computed from
(9.18). The choice of normalisation depends on the operation mode, i.e. continuous
function mapping or discrete event mapping. There are some choices that are of
particular interest. These are discussed below.
108
Associative Learning
Discrete event mapping
Discrete event mode (9.7) corresponds to a sample domain normalisation matrix
Ds = I .
(9.21)
Choosing v = 1 = (1 1 . . . 1)T in (9.18) gives
 1

Df = diag−1 (AAT 1) = 

a1 mT
f

1
a2 mT
f
..

,

(9.22)
.
P
where mf = h ah is proportional to the mean in the feature domain. As Ds does
not contain any components of A there is no risk that it turns singular in domains
of samples having all feature components zero.
This choice of normalisation will be referred to as Normalisation entirely in the
feature domain.
Continuous function mapping
There are several ways to choose w in the continuous function model (9.10), depending on the assumptions of error models, and the resulting choice of confidence
measure s. One approach is to assume that all training samples have the same
confidence, i.e. s ≡ 1, and compute C ≥ 0 and w ≥ 0 such that
½
1 ≈ wT A
(9.23)
X ≈ CA .
Sometimes it may be desirable to have an individual confidence measure for each
training sample. Another approach is to design a suitable w and then compute C
using the optimisation framework in section 9.5.1 with s(a) = wT a.
There are two specific designs of w that are worth emphasising. The channel
representation implies that large feature channel magnitudes indicate a higher
confidence than low values. We can consequently use the sum of the feature
channels as a measure of confidence:
1
(9.24)
s(a) = 1T a ⇒ x = C T a .
1 a
As mentioned before, this model is often used in RBF-networks and probabilistic
mixture models, see [58, 77]. The corresponding sample domain normalisation
matrix is

 1
−1
Ds = diag

(A 1) = 

aT
1 1
T
and if we choose v = 1 in (9.18) we get


Df = diag−1 (A1) = 
1
a1 1
1
aT
2 1
..

,

(9.25)
.

1
a2 1
..

.
.
(9.26)
9.5 Associative structure
109
This choice of model will be referred to as Mixed domain normalisation.
It can also be argued that a feature element which is frequently active should
have a higher confidence than a feature element which is rarely active. This can
be included in the confidence measure by using a weighted sum of the features,
where the weight is proportional to the mean in the sample domain:
X
an .
(9.27)
s(a) = mTs a where ms = A1 =
n
This corresponds to the sample domain normalisation matrix

 1

Ds = diag−1 (AT A1) = 
mT
s a1
1
mT
s a2
..

,
(9.28)
.
and by using v = A1 in (9.18) we get
Df = I .
(9.29)
This choice of model will be referred to as Normalisation entirely in the sample
domain.
9.5.3
Sensitivity analysis for continuous function mode
We will now make some observations concerning the insensitivity to noise of the
system, under the assumption of sample normalisation in continuous function
mode. That is, a response state estimate x̂n is generated from a feature vector a
according to model (9.10), i.e.
x̂n = C
1
an .
w T an
(9.30)
We observe that regardless of choice of normalisation vector wT , the response will
be independent of any global scaling of the features, i.e.
1
1
γan .
(9.31)
C T an = C T
w an
w γan
If multiplicative noise is applied, represented by a diagonal matrix Dγ , we get
x̂n = C
1
Dγ an .
wT Dγ an
(9.32)
If the choice of weights in C and w is consistent, i.e. if the weights used to generate
a response at a sample n were to obey the relation C = x̂n wT , the network is
perfectly invariant to multiplicative noise. As we shall see in the experiments
to follow, the normalisation comes close to this ideal for the entire sample set,
provided that the response signal varies slowly. For such situations, the network
suppresses multiplicative noise well.
Similarly, a sensitivity analysis can be made for discrete event mode. We will
in this presentation only refer to the discussion in chapter 3 for the invariances
available in the channel representation, and to the results from the experimental
verification in the following section.
110
9.6
Associative Learning
Experimental verification
We will in this section analyse the behaviour and the noise sensitivity of several
variants of associative networks, both in continuous function mode and in discrete
event mode. A generalisation of the common CMU twin spiral pattern [18] has
been used, as this is often used to evaluate classification networks. We have
chosen to make the pattern more difficult in order to show that the proposed
learning machinery can represent both continuous function mappings (regression)
and mappings to discrete classes (classification). The robustness is analysed with
respect to three types of noise: additive, multiplicative, and impulse noise on the
feature vector.
9.6.1
Experimental setup
In the experiments, a three dimensional state space X ⊂ R3 is used. The sensor
space A ⊂ R2 , and the response space R ⊂ R are orthogonal projections of the
state space. The network is trained to perform the mapping f : A → R which is
depicted in figure 9.4. Note that this mapping can be seen as a surface of points
x ∈ R3 , with x3 = f (x1 , x2 ). The analytic expression for f (x1 , x2 ) is:
(
√
if mod(ϕ + 1000r, 2π) < π
fs (r, ϕ)
(9.33)
f (r, ϕ) =
sign(fs (r, ϕ)) otherwise,
√
√
where fs (r, ϕ) = (1/ 2 − r) cos(ϕ + 1000r) .
p
Variables r = x21 + x22 and ϕ = tan−1 (x1 , x2 ) are the polar coordinates in sensor
space A. As can be seen in the figure, the mapping contains both smooth parts
(given by the cos function) and discontinuities (introduced by the sign function).
The pattern is intended to demonstrate the following properties:
1. The ability to approximate piecewise continuous surfaces.
2. The ability to describe discontinuities (i.e. assignment into discrete classes).
3. The transition between interpolation and representation of a discontinuity.
4. The inherent approximation introduced by the sensor channels.
As sensor channels, a variant of the channels prescribed in expression (9.3) is
used:
(
cos2 (ωd) if ωd ≤ π2
(9.34)
Bh (x) =
0
otherwise,
q
(9.35)
where d = (x − xh )T M(x − xh ),
and M = diag(1 1 0). In the experiments H = 2000 such sensors are used, with
random positions {xh }H
1 inside the box ([−0.5, 0.5], [−0.5, 0.5]) ⊂ A. The sensors
9.6 Experimental verification
111
0.5
0.4
0.3
0.2
0.1
x2
0
−0.1
−0.2
−0.3
−0.4
−0.5
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
x1
Figure 9.4: Desired response function. Black to White correspond to values of
x3 ∈ [−1, 1].
have channel widths of ω = π/0.14 giving each an active domain with radius 0.07.
¡
¢T
Thus, for each state xn , a feature vector an = B1 (xn ) B2 (xn ) . . . BH (xn )
is obtained.
During training, random samples of the state vector xn ∈ X on the surface
f : A → R are generated. These are used to obtain pairs {fn , an } using (9.33)
and (9.34). The training sets are stored in the matrices f , and A respectively. The
performance is then evaluated on a regular sampling grid. This has the advantage
that performance can be visualised as an image. Since real valued positions x ∈ X
are used, the training and evaluation sets are disjoint.
The mean absolute error (MAE) between the network output and the ground
truth (9.33), is used as a performance measure,
N
1 X
|f (xn ) − can | ,
N n=1
(9.36)
N
1 X
|f (xn ) − dec(Can )| .
N n=1
(9.37)
εMAE =
or, for discrete event mode
εMAE =
The rationale for using this error measure is that it is roughly proportional to
the number of misclassifications along the black-to-white boundary, in contrast to
RMSE which is proportional to the number of misclassifications squared.
112
Associative Learning
9.6.2
Associative network variants
We will demonstrate the behaviour of the following five variants of associative
networks:
1. Mixed domain normalisation bipolar network
This network uses the model
1
fˆ =
T
1 a
ca .
This model is often used in RBF-networks and probabilistic mixture models,
see [58, 77]. This network is optimised according to
c = arg min kf − cADs k2 + γkck2 .
c
(9.38)
In the experiments, the explicit solution is used, i.e.
c = fADs (ADs DTs AT + γI)−1 .
(9.39)
Note that for larger systems, it is more efficient to replace (9.39) with a
gradient descent method.
2. Mixed domain normalisation monopolar network
Same as above, but with a monopolar constraint on c, instead of the Tikhonov
regularization used above.
3. Sample domain normalisation monopolar network
This network uses the model
fˆ =
1
ca ,
mTs a
where ms is computed from the training set sensor channels according to
ms = A1.
4. Uniform sample confidence monopolar network
This network uses the model
fˆ =
1
ca ,
wT a
where the mapping w is trained to produce the response 1 for all samples,
see (9.23).
5. Discrete event mode monopolar network
This network uses the model
û = Ca
⇔
fˆ = dec(Ca) ,
with K = 7 channels. The responses should describe the interval [−1, 1] so
the decoding step involves a linear mapping, see (3.5).
9.6 Experimental verification
113
Figure 9.5: Performance of bipolar network (#1) under varied number of samples.
Top left to bottom right: N = 63, 125, 250, 500, 1000, 2000, 4000, 8000.
9.6.3
Varied number of samples
As a demonstration of the generalisation abilities of the networks we will first
vary the number of samples. The monopolar networks are optimised according to
section 9.5, with 50 iterations. For the bipolar network we have used γ = 0.005.
This value is chosen to give the same error on the training set as in network #2
using N = 500 samples.
The performance on the regular grid is demonstrated in figure 9.5 for the
bipolar network (#1), and in figure 9.6 for the discrete event network (#5).
If we look at the centre of the spiral, we see that both networks fail to describe
the fine details of the spiral, although #1 is doing slightly better. For the discrete
event network, the failure is a direct consequence of the feature channel sizes. For
the bipolar network it is a combined consequence of the size and density of the
feature channels.
We can also observe that the discrete event network is significantly better at
dealing with the discontinuities. This is also reflected in the error measures, see
figure 9.7. For very low numbers of samples, when both networks clearly fail, the
bipolar network is slightly better. We have also plotted the performance of the
monopolar mappings in continuous function mode. As can be seen in the plot,
these are all slightly worse off than the bipolar network. All three monopolar
continuous function mode variants have similar performances on this setup. Differences appear mainly when the sample density becomes non-uniform (not shown
here).
114
Associative Learning
Figure 9.6: Performance of discrete event network (#5) under varied number of
samples. Top left to bottom right: N = 63, 125, 250, 500, 1000, 2000, 4000, 8000.
0.5
0.4
0.3
0.2
0.1
0
2
10
3
10
4
10
Figure 9.7: MAE under varied number of samples. Solid thick is #5, and dashed
is #1. Solid thin are #2,#3, and #4. For low number of samples the variants are
ordered #2, #3, #4 with #4 being the best one.
9.6 Experimental verification
115
Figure 9.8: Performance of discrete event network (#5) under varied number of
channels. Top left to bottom right: K = 3 to K = 14.
9.6.4
Varied number of channels
The relationship between the sizes of feature and response channels is important
for the performance of the network. The distance between the channels also determines where the decision between interpolation and introduction of a discontinuity
is made. We will now demonstrate these two effects by varying the number of channels in the range [3 . . . 14], and keeping the number of samples high, N = 8000.
As can be seen in figure 9.8, a low number of channels gives a smooth response
function. For K = 3 no discontinuity is introduced at all, since there is only one
interval for the local reconstruction (see section 3.2.3). As the number of channels
is increased, the number of discontinuities increases. Initially this is an advantage,
but for a large number of channels, the response function becomes increasingly
patchy (see figure 9.8). In practice, there is thus a trade-off between description
of discontinuities, and patchiness. This trade-off is also evident if MAE is plotted
against the number of channels, see figure 9.9 left.
In figure 9.9, right part, error curves for smaller numbers of samples have been
plotted. It can be seen that, for a given number of samples, the optimal choice of
channels varies. Better performance is obtained for a small number of channels,
when fewer samples are used. The standard way to interpret this result is that a
high number of response channels allows a more complex model, which requires
116
Associative Learning
0.1
0.4
0.08
0.3
0.06
0.2
0.04
0.1
0.02
0
3
5
10
0
14
3
5
10
14
Figure 9.9: MAE under varied number of channels. Left MAE for N = 8000.
Right MAE for N = 63, 125, 250, 500, 1000, 2000, 4000, and 8000.
5000
4800
4600
4400
4200
4000
4
6
8
10
12
14
Figure 9.10: Number of non-zero coefficients under varied number of channels.
Compare this with 2000 non-zero coefficients for the continuous function networks.
more samples.
If we plot the number of non-zero coefficients in the linkage matrix C, we also
see that there is an optimal number of channels, see figure 9.10. Note that although
the size of C is between 3 and 14 times larger than in continuous function mode,
the number of links only increases by a factor 2.1 to 2.5.
9.6.5
Noise sensitivity
We will now demonstrate the performance of the associative networks when the
feature set is noisy. We will use the following noise models:
9.6 Experimental verification
117
0.5
0.15
0.4
0.1
0.3
0.2
0.05
0.1
0
0
0
0.02
0.04
0.06
0.08
0.1
0
0.2
0.4
0.008
0.01
0.6
0.8
1
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.002
0.004
0.006
Figure 9.11: Noise sensitivity. Top left: additive noise, top right: multiplicative
noise, bottom: impulse noise. Solid thick is #5, and dashed is #1. Solid thin are
#2,#3, and #4.
1. Additive noise: A random value is added to each feature value, i.e.
a∗n = an + η ,
with η k ∈ rect[−p, p], and the parameter p is varied in the range [0, 0.1].
2. Multiplicative noise: Each feature value is multiplied with a random
value, i.e.
a∗n = Dη an ,
where Dη is a diagonal matrix with (Dη )kk = η k ∈ rect[1 − p, 1 + p], and the
parameter p is varied in the range [0, 1].
3. Impulse noise: A fraction of the features is set to 1, i.e.
(
1
if fr < p where fr ∈ rect(0, 1)
ak,∗
n =
k
an otherwise,
and the parameter p is varied in the range [0, 0.01].
The results of the experiments are shown in figure 9.11. We have consistently
used N = 4000 samples for evaluation, and corrupted them with noise according
to the discussion above. In order to make the amount of regularisation comparable
we have optimised the γ parameter for network #1 to give the same error on the
training set as network #2 at N = 4000 samples. This gave γ = 0.08.
As can be seen from the additive noise experiment, network #5 has a different
slope for its dependence upon noise level. The other networks are comparable,
118
Associative Learning
and differences are mainly due to how well the networks are able to represent the
pattern in the first place (see section 9.6.3). For the multiplicative noise case,
we see that the slope is similar for all networks. Thus we can conclude that
the multiplicative noise behaviour is comparable for all tested networks. For the
impulse noise case we can see that for small amounts of noise, network #5 has a
less steep slope than the others. For larger amounts of noise however, all networks
seem to behave in a similar manner.
The purpose of these experiments has been to demonstrate the abilities of the
associative networks to generalise, and to cope with various kinds of sensor noise.
Several experiments using image features as inputs have been made, but have to
be excluded from this presentation. For details of such experiments, the reader is
directed to [53, 34, 80].
9.7
Other local model techniques
We will now have a look at three classes of techniques similar to the associative
networks presented in this chapter. The descriptions of the techniques, Radial
Basis Function (RBF) networks, Support Vector Machines (SVM), and adaptive
fuzzy control, are not meant to be exhaustive, the purpose of the presentation is
merely to describe the similarities and differences between them and the associative
networks.
9.7.1
Radial Basis Function networks
The fact that an increased input dimensionality with localised inputs simplifies
learning problems has also been exploited in the field of Radial Basis Function
(RBF) networks [77, 58]. RBF networks have a hidden layer with localised Gaussian models, and an output layer which is linear. In effect this means that RBF
networks learn a hidden representation which works like a channel encoding. The
advantage with this approach is that the locations, and sizes of the channels (or
RBFs) adapt to the data. The obvious disadvantage compared to using a fixed
set of localised inputs is of course longer training time, since the network has two
layers that have to be learned.
Typically the RBF positions are found using a clustering scheme such as Kmeans [77], or, if the number of traing data is low, one RBF is centered around
each traning data. Related to RBF networks are hierarchies of local Gaussian
models. Such networks have been investigated by for instance Landelius in [69].
His setup allows new models to be added where needed, and unused models to be
removed. Compared to the associative networks presented in this chapter, we also
note that the response from an RBF network is a continuous function, and not a
channel representation. This means that RBF networks cannot properly deal with
multiple hypotheses.
9.7 Other local model techniques
9.7.2
119
Support Vector Machines
Support Vector Machines (SVM) is another kernel technique that avoids mapping
into a high-dimensional space alltogether. For a SVM it is required that the used
kernel is positive definite. For such cases, Mercers theorem states that the kernel
function is equivalent to an inner product in a high-dimensional space [58].
Obvious differences between the associative networks and SVM are that a SVM
has a low dimensional feature space, and maps either to a binary variable (classification SVM), or to a continuous function (regression SVM). An associative
network on the other hand uses a high-dimensional, sparse representation of the
feature space, and maps to a set of response channels. Since SVMs do not use
responses in the channel representation, they are unable to deal with multiple
hypotheses.
9.7.3
Adaptive fuzzy control
Adaptive fuzzy control is a technique for learning locally linear functional relationships, see e.g. [84] for an overview. In fuzzy control a set of local fuzzy
inference rules between measurements, and desired outputs are established. These
are often in a form suitable for linguistic communication, for instance: IF temperature(warm) THEN power(reduce). The linguistic states (“warm” and “reduce” in
our example) are defined by localised membership functions, corresponding to the
kernels in the channel representation. Each input variable is fuzzified into a set of
membership degrees, which are in the range [0, 1.0]. Groups of one membership
function per input variable are connected to an output membership function in a
fuzzy inference rule. Each fuzzy inference rule only fires to a certain degree, which
is determined by the amount of input activations. The result of the fuzzy inference
is a weighted linear combination of the output membership functions, which can
be used to decode a response in a defuzzification step, which is typically a global
moment (centroid) computation. The IF-THEN inference rules can be learned by
a neural network, see for instance [76]. Typically the learning adjusts the shape
and positions of the membership functions, while the actual set of IF-THEN rules
stays fixed. Thus a fuzzy-inference can be thought of as an associative network
with adaptive feature and response channels, and a static, binary linkage matrix
C.
There are several differences between the associative networks, and fuzzy control. In fuzzy control the implicit assumption is that there is only one value
per feature dimension activating the membership functions at the input side. As
shown in this chapter, this is not the case in associative learning using the channel representation. Furthermore fuzzy control only allows one response, since the
defuzzification is a global operation. In contrast, representation of multiple values
is an important aspect of the channel representation, see section 3.2.1.
120
9.8
Associative Learning
Concluding remarks
In this chapter we have demonstrated that the channel learning architecture running in discrete event mode is able to describe simultaneously continuous and
transiential phenomena, while still being better than or as good as a linear network at suppressing noise. An increase in the number of response channels does
not cause an explosion in the number of used links. Rather, it remains fairly stable at approximately twice the number of links required for a continuous function
mapping. This is a direct consequence of the monopolar constraint.
The training procedure shows a fast convergence. In the experiments described,
a mere 50 iterations have been required. The fast convergence is due to the
monopolar constraint, locality of the features and responses, and the choice of
feature domain normalisation.
The learning architecture using channel information also deals properly with
the perceptual aliasing problem, that is, it does not attempt to merge or average
conflicting statements, but rather passes them on to the next processing level.
This allows a second processing stage to resolve the perceptual aliasing, using
additional information not available at the lower level.
The ability of the architecture to handle a large number of models in separate
or loosely coupled domains of the state space, promises systems with a combination of the continuous mapping of control systems with the state complexity we
have become familiar with from digital systems. Such systems can be used for
the implementation of extremely complex, contextually controlled mapping model
structures. One such application is for view based object recognition in computer
vision [53].
Chapter 10
An Autonomous Reactive
System
This chapter describes how a world model for successive recognition can be learned
using associative learning. The learned world model consists of a linear mapping
that successively updates a high-dimensional system state, using performed actions
and observed percepts. The actions of the system are learned by rewarding actions
that are good at resolving state ambiguities. As a demonstration, the system is
used to resolve the localisation problem in a labyrinth.
10.1
Introduction
During the eighties a class of robotic systems known as reactive robotic systems
became popular. The introduction of system designs such as the subsumption
architecture [11] caused a small revolution due to their remarkably short response
times. Reactive systems are able to act quickly since the actions they perform are
computed as a direct function of the sensor readings, or percepts, at a given time
instant. This design principle works surprisingly well in many situations despite
its simplicity. However, a purely reactive design is sensitive to a fundamental
problem known as perceptual aliasing, see e.g. [17].
Perceptual aliasing is the situation when the percepts are identical in two
situations when the system should perform different actions. There are two main
solutions to this problem:
• The first is to add more sensors to the system such that the two situations
can be distinguished.
• The second is to give the system an internal state. This state is estimated
such that it is different in the two situations, and can thus be used to guide
the actions.
This chapter will deal with the latter solution, which further on will be called
122
An Autonomous Reactive System
successive state estimation. We note here that the introduced state can be tailormade to resolve the perceptual aliasing.
Successive state estimation is called recursive parameter estimation in signal
processing, and on-line filtering in statistics [101]. Successive recognition could
potentially be useful to computer vision systems that are to navigate in a known
environment using visual input, such as the autonomous helicopter in the WITAS
project [52].
10.1.1
System outline
Successive state estimation is an important component of an active perception
system. The system design to be described is illustrated in figure 10.1. The state
estimation, which is the main topic of this chapter, is performed by the state
transition and state narrowing boxes.
The state transition box updates the state using information about which action the system has taken, and the state narrowing box successively resolves ambiguities in the state by only keeping states that are consistent with the observed
stimulus.
Action
Stimulus
state
transitions
channel
coding
state
narrowing
System state
motor
programs
New Action
Figure 10.1: System outline.
The system consistently uses the channel representation (see chapter 3) to
represent states and actions. This implies that information is stored in channel
vectors of which most elements are zero. Each channel is non-negative, and its
magnitude signifies the relevance of a specific hypothesis (such as a specific system state in our case), and thus a zero value represents “no information”. This
information representation has the advantage that it enables very fast associative
learning methods to be employed [50], and improves product sum matching [34].
The channel coding box in figure 10.1 converts the percepts into a channel
representation. Finally, the motor program box is the subsystem that generates
the actions of the system. The complexity of this box is at present kept at a
minimum.
10.2
Example environment
To demonstrate the principle of successive state estimation, we will apply it on the
problem shown in figure 10.2. The arrow in the figure symbolises an autonomous
10.2 Example environment
123
agent that is supposed to successively estimate its position and gaze direction by
performing actions and observing how the percepts change. This is known as the
robot localisation problem [101]. The labyrinth is a known environment, but the
initial location of the agent is unknown, and thus the problem consists of learning
(or designing) a world model that is useful for successive recognition.
Figure 10.2: Illustration of the labyrinth navigation problem.
The stimulus constitutes a three element binary vector, which tells whether
there are walls to the left, in front, or to the right of the agent. For the situation
in the figure, this vector will look like this:
¡
m= 0
0
¢T
1 .
This stimulus is converted to percept channels in one of two ways
¡
¢T
p 1 = m 1 m 2 m 3 1 − m1 1 − m2 1 − m3
¡
¢T
p 2 = p1 p2 p3 p4 p5 p6 p7 p8 ,
or
(10.1)
where
(
1 if m = mh
ph =
0 otherwise,
and {mh }81 is the set of all possible stimuli. This expansion is needed since we want
to train an associative network [50] to perform the state transitions, and since the
network only has non-negative coefficients, we must have a non-zero input vector
whenever we want a response.
The two variants p1 and p2 will be called semi-local, and local percepts respectively. For the semi-local percepts, correlation serves as a similarity measure, or
metric, but for the local percepts we have no metric—the correlation is either 1
or 0.
The system has three possible actions a1 = TURN LEFT,
2
a = TURN RIGHT, and a3 = MOVE FORWARD. These are also represented as a three
124
An Autonomous Reactive System
element binary vector, with only one non-zero element at a time. E.g. TURN RIGHT
is represented as
¡
a2 = 0
1
¢T
0 .
Each action will either turn the agent 90◦ clockwise or anti clockwise, or move it
forward to the next grid location (unless there is a wall in the way).
As noted in section 10.1, the purpose of the system state is to resolve perceptual
aliasing. For the current problem this is guaranteed by letting the state describe
both agent location and absolute orientation. This gives us the number of states
as
Ns = rows × cols × orientations .
(10.2)
For the labyrinth in figure 10.2 this means 7 × 7 × 4 = 196 different states.
10.3
Learning successive recognition
If the state is in a local representation, that is, each component of the state vector
represents a local interval in state space, successive recognition can be obtained
by a linear mapping. For the environment described in section 10.2, we will thus
use a state vector with Ns components.
The linear mapping will recursively estimate the state, s, from an earlier state,
the performed action, a, and an observed percept p. I.e.
s(t + 1) = C [s(t) ⊗ a(t) ⊗ p(t + 1)]
(10.3)
where ⊗ is the Kronecker product, which generates a vector containing all product
pairs of the elements in the involved vectors (see section 5.6.1). The sought linear
mapping C is thus of dimension Ns × Ns Na Np where Na and Np are the sizes of
the action and percept vectors respectively.
In order to learn the mapping we supply examples of s, a, and p for all possible
state transitions. This gives us a total of Ns Na samples. The coefficients of
the mapping C are found using a least squares optimisation with non-negative
constraint
arg min ||u − Cf ||2
cij >0
where
u = s(t + 1)
f = s(t) ⊗ a(t) ⊗ p(t + 1) .
For details of the actual optimisation see section 9.5.1.
10.3.1
Notes on the state mapping
The first thing to note about usage of the mapping, C, is that the state vector
obtained by the mapping has to be normalised at each time step, i.e.
10.3 Learning successive recognition

 s̃(t + 1)
 s(t + 1)
= C [s(t) ⊗ a(t) ⊗ p(t + 1)]
s̃(t + 1)
.
=P
k s̃k (t + 1)
125
(10.4)
In the environment described in section 10.2, we obtain exactly the same behaviour when we use two separate maps:
½ ∗
s (t + 1) = C1 [s(t) ⊗ a(t)]
(10.5)
s̃(t + 1) = C2 [s∗ (t + 1) ⊗ p(t + 1)] .
These two maps correspond to the boxes state transition and state narrowing in
figure 10.1. An interesting parallel to on-line filtering algorithms in statistics is
that C1 corresponds to the stochastic transition model
s∗ (t + 1) ∼ p(x(t + 1)|s(t), a(t))
(10.6)
where x is the unknown current state. Additionally, C2 is related to the stochastic
observation model p(p(t)|s(t)). A probabilistic interpretation of C2 would be
s(t + 1) ∼ p(x(t + 1)|s∗ (t + 1), p(t + 1)) .
(10.7)
See for instance [101] for a system which makes use of this framework.
The mappings have sizes Ns × Ns Na and Ns × Ns Np , and this gives us at
most Ns2 (Na + Np ) coefficients compared to Ns2 Na Np in the single mapping case.
Thus the split into two maps is advantageous, provided that the behaviour is not
affected (which in our case it is not).
Aside from the gain in number of coefficients, the split into two maps will
also simplify the optimisation of the mappings considerably. If we during the
optimisation supply samples of s∗ (t+1) that are identical to s(t+1) we end up with
a mapping, C2 , that simply weights the state vector with the correlations between
the observed percept and those corresponding to each state during optimisation.
In other words (10.5) is equivalent to
s̃(t + 1) = diag(Pp(t + 1))C1 [s(t) ⊗ a(t)] .
(10.8)
Here P is a matrix with row n containing the percept observed at state n during the
training, and diag() generates a matrix with the argument vector in the diagonal.
10.3.2
Exploratory behaviour
How quickly the system is able to recognise it’s location is of course critically
dependent on which actions it takes. A good exploratory behaviour should strive
to observe new percepts as often as possible, but how can the system know that
shifting its attention to something new when it does not yet know where it is?
In this system the actions are chosen using a policy, where the probabilities
for each action are conditional on the previous action a(t − 1) and the observed
percept p(t). I.e. the action probabilities can be calculated as
p(a(t) = ah ) = ch [a(t − 1) ⊗ p2 (t)]
(10.9)
126
Time
An Autonomous Reactive System
0
1
2
3
4
5
6
7
Estimate
(using p1 )
Estimate
(using p2 )
Actual
state
Time
Estimate
(using p1 )
Estimate
(using p2 )
Actual
state
Figure 10.3: Illustration of state narrowing.
10.2). The coefficients in
where {ah }31 are the three possible actions (see section
P
the mappings {ch }31 should be defined such that h p(a(t) = ah ) = 1.
Initially we define the policy {ch }31 , manually. A random run of a system with
a fixed policy is demonstrated in figure 10.3. The two different kinds of percepts
p1 and p2 are those defined in (10.1).
10.3 Learning successive recognition
10.3.3
127
Evaluating narrowing performance
The performance of the localisation process may be evaluated by observing how
the estimated state vector s(t) changes over time. As a measure of how narrow a
specific state vector is we will use
P
n(t) =
sk (t)
.
max{sk (t)}
k
(10.10)
k
If all state channels are activated to the same degree, as is the case for t = 0, we
will get n(t) = Ns , and if just one state channel is activated we will get n(t) = 1.
Thus n(t) can be seen as a measure of how many possible states are still remaining.
Figure 10.4 (top) shows a comparison of systems using local and semi-local
percepts for 50 runs of the network. For each run the true initial state is selected
at random, and s(0) is set to 1/Ns .
2
2
10
10
1
1
10
10
0
10
0
10
20
30
10
40
10
20
30
40
2
10
1
10
0
10
5
10
15
20
25
30
35
40
Figure 10.4: Narrowing performance.
Top left: n(t) for a system using p1 . Top right: n(t) for a system using p2 . Each
graph shows 50 runs (dotted). The solid curves are averages. Bottom: Solid: n(t)
for p1 and p2 . Dashed: p1 using f1 (). Dash-dotted: p1 using f2 (). Each curve is
an average over 50 runs.
Since the only thing that differs between the two upper plots in figure 10.4 is
the percepts, the difference in convergence has to occur in step 2 of (10.5). We can
further demonstrate what influence the feature correlation has on the convergence
by modifying the correlation step in equation 10.8 as follows
s̃(t + 1) = diag(f(Pp(t + 1)))C1 [s(t) ⊗ a(t)] .
(10.11)
We will try the following two choices of f() on correlations of the semi-local percepts
128
An Autonomous Reactive System
√
f1 (c) = c and
(
1 if c > 0
f2 (c) =
0 otherwise.
(10.12)
All four kinds of systems are compared in the lower graph of figure 10.4. As
can be seen, the narrowing behaviour is greatly improved by a sharp decay of the
percept correlation function. However, for continuous environments there will most
likely be a trade off between sharp correlation functions and state interpolation
and the number of samples required during training.
10.3.4
Learning a narrowing policy
The conditional probabilities in the policy defined in section 10.3.2 can be learned
using reinforcement learning [96]. A good exploratory behaviour is found by giving
rewards to conditional actions {a(t)|p(t), a(t − 1)} that reduce the narrowing measure (10.10), and by having the action probability density p(a(t) = ah |p(t), a(t−1))
gradually increase for conditional actions with above-average rewards. This is
called a pursuit method [96].
In order for the rewards not to die out, the system state is regularly reset to
all ones, for instance when t mod 30 = 0. The first attempt is to define the reward
as a plain difference of the narrowing measure (10.10), i.e.
r1 (t) = n(t − 1) − n(t) .
(10.13)
With this reward, the agent easily gets stuck into sub-optimal policies, such as
constantly trying to move into a wall. Better behaviour is obtained by also looking
at the narrowing difference one step into the future, i.e.
r2 (t) = r1 (t) + r1 (t + 1) = n(t − 1) − n(t + 1) .
2
2
10
10
1
1
10
10
0
10
(10.14)
0
10
20
30
40
10
10
20
30
40
Figure 10.5: Narrowing performance.
Left: n(t) for a policy learned using r1 (t). Right: n(t) for a policy learned using
r2 (t). Each graph shows 50 runs (dotted). The thick curves are averages. Dashed
curves show average narrowing for a completely random walk.
The behaviours learned using (10.13) and (10.14) are compared with a random
walk in figure 10.5.
10.4 Concluding remarks
10.4
129
Concluding remarks
The aim of this chapter has not been to describe a useful application, but instead to
show how the principle of successive recognition can be used. Compared to a real
robot navigation task, the environment used is way too simple to serve as a model
world. Further experiments will extend the model to continuous environments,
with noisy percepts and actions.
130
An Autonomous Reactive System
Chapter 11
Conclusions and Future
Research Directions
In this chapter we conclude the thesis by summarising the results. We also indicate
open issues, which point to research directions that can be pursued further.
11.1
Conclusions
This thesis is the result of asking the question “What can be done in the channel
representation, which cannot be accomplished without it?”. We started by deriving
expressions for channel encoding scalars, and retrieving them again using a local
decoding. We then investigated what the simple operation of averaging in the
channel representation resulted in.
The result that several modes of a distribution can be treated in parallel is of
fundamental importance in perception. Perception in the presence of noise is a
difficult problem. One especially persistent kind of noise is a competing nearby
feature. By making use of the channel representation, we can make this problem
go away, by simultaneously estimating all present features in parallel. We can
select significant features after estimation, by picking one or several of the local
decodings. In principle this would allow the design of a perception system similar
to that of the bat described in section 2.1.4.
Channel representation is also useful for response generation. The associative
networks in chapter 9 was shown to be able to learn piecewise continuous mappings,
without blurring discontinuities. Such an ability is useful in response generation,
such as navigation with obstacle avoidance. If we encounter an obstacle in front of
us, it might be possible to pass it on both the left and the right side, so both these
options are valid responses. Their average however is not, and thus a system that
learns obstacle avoidance will need to use some kind of channel representation in
order to avoid such inappropriate averaging.
For all levels in a perception system it is crucial that not all information is
processed at each position. In order not to be flooded with data we need to exploit
132
Conclusions and Future Research Directions
locality, i.e. restricting the amount of information available at each position to a
local context. This can however lead to problems such as perceptual aliasing. As
was demonstrated in chapter 9, intermediate responses in channel representation is
a proper way to deal with perceptually aliased states. The channel representation
solves the perceptual aliasing problem by not trying to merge states, but instead
passing them on to the next processing level, where hopefully more context will
be available to resolve the aliasing.
11.2
Future research
As is common in science, this thesis answered some questions, but at the same
time it raised several new ones. We will now mention some questions which might
be worthwhile to pursue further.
This far, the only operations considered in channel spaces are averaging and
non-negative projections. Are there other meaningful operations in channel spaces?
One option is to adapt the averaging to the local image structure, see Felsberg’s
paper [28] for some first results in this area.
The clustering of constant slopes developed in chapter 7 is, as mentioned just
a first result. It could probably benefit from changing the representation of the
slopes to cluster.
11.2.1
Feature matching and recognition
Two currently active areas in computer vision are wide baseline matching, see
e.g. [99] and parts based object recognition, see e.g. [70, 66, 82]. Both of these areas
are possible applications for the blob features developed in chapter 7.
11.2.2
Perception action cycles
Active vision is an important aspect of robotics. Apart from the simple example
in chapter 10, this thesis has not dealt with closed perception–action loops. One
direction to pursue is to explore the visual servoing idea in connection with the
methods and representations developed in this thesis. One way to do this is to
apply the successive recognition system in chapter 10 to more realistic problems,
but other architectures and approaches could also prove useful.
It is by logic we prove, it is by intuition that we invent.
Henri Poincaré, 1904
Appendices
A
Theorems on cos2 kernels
Theorem A.1 For cos2 kernels with ω = π/N , a group of N consecutive channels
starting at index k has a common active domain of
SkN = ]k − 1 + N/2, k + N/2[ .
Proof: The active domain (non-zero domain, or support) of a channel is defined
as
(A.1)
Sk = {x : Bk (x) > 0} = ]Lk , Uk [ .
Since the kernels should go smoothly to zero (as discussed in section 3.2.2), this
is always an open interval, as indicated by the brackets. For the cos2 kernel (3.2)
we have domains of the type
Sk = ]k − π/2/ω, k + π/2/ω[ .
(A.2)
For ω = π/N this becomes
Sk = ]k − N/2, k + N/2[ .
The common active domain of N channels,
SkN
(A.3)
becomes
SkN = Sk ∩ Sk+1 ∩ . . . ∩ Sk+N −1 = ]Lk+N −1 , Uk [ =
= ]k + N − 1 − N/2, k + N/2[ = ]k − 1 + N/2, k + N/2[ .
(A.4)
(A.5)
¤
This concludes the proof.
Theorem A.2 For cos2 kernels with ω = π/N , and a local decoding using N
channels, the represented domain of a K channel set becomes
N
= ]N/2, K + 1 − N/2] .
RK
Proof: If we perform the local decoding using groups of N channels with ω = π/N ,
we will have decoding intervals according to theorem A.1. Note that we need to
have N ∈ N/{0, 1} in order to have a proper decoding. These intervals are all
of length 1, and thus they do not overlap. We now modify the upper end of the
intervals
(A.6)
SkN = ]k − 1 + N/2, k + N/2]
134
Appendices
in order to be able to join them. This makes no practical difference, since all that
happens at the boundary is that one channel becomes inactive. For a channel
representation using K channels (with K ≥ N ) we get a represented interval of
type
N
N
= S1N ∪ S2N ∪ . . . ∪ SK−N
RK
+1 = ]L1+N −1 , UK−N +1 ]
= ]N/2, K − N + 1 + N/2] = ]N/2, K + 1 − N/2] .
(A.7)
(A.8)
¤
This concludes the proof.
¢T
¡
Theorem A.3 The sum of a channel value vector B1 (x) B2 (x) . . . BK (x)
N
for ω = π/N , where N ∈ N/{0, 1}, is invariant to the value of x when x ∈ RK
.
Proof: According to theorem A.1, groups of N consecutive channels with ω =
π/N have mutually non-overlapping active domains SkN . This means that for a
given channel vector, the value x will fall into exactly one of these domains, SkN .
Thus the sum over the entire channel set is equal to the sum over the channels
belonging to SkN , for some value of k
K
X
Bn (x) =
n=0
k+N
X−1
Bn (x) =
n=k
N
−1
X
Bk+n (x) .
(A.9)
n=0
We now define a complex valued function
v k (x) = ei2ω(x−k) .
(A.10)
This allows us to write the kernel function Bk (x) as
Bk (x) = cos2 (ω(x − k)) = 0.5 + 0.5 cos(2ω(x − k)) =
= 0.5 + 0.5Re [vk (x)] .
Now the sum in (A.9) becomes
"N −1
#
X
1
N
+ Re
Bk+n (x) =
vk+n (x) .
2
2
n=0
n=0
N
−1
X
(A.11)
The complex sum in this expression can be rewritten as
N
−1
X
n=0
vk+n (x) =
N
−1
X
n=0
ei2ω(x−k−n) = ei2ω(x−k)
N
−1 ³
X
e−i2ω
´n
.
(A.12)
n=0
For1 e−i2ω 6= 1 this geometric sum can be written as
1 The case e−i2ω = 1 never happens, since it is equivalent to ω = nπ, where n ∈ N, and our
assumption was ω = π/N , N ∈ N/{0, 1}.
A Theorems on cos2 kernels
N
−1 ³
X
135
e−i2ω
´n
n=0
=
1 − e−i2ωN
.
1 − e−i2ω
(A.13)
The numerator of this expression is zero exactly when ωN = nπ, n ∈ N. Since our
assumption was ω = π/N , N ∈ N/{0, 1}, it is always zero. From this follows that
the exponential sum in equation A.11 also equals zero. We can now reformulate
equation A.11 as
N
−1
X
Bk+n (x) =
n=0
N
2
for ω = π/N where N ∈ N/{0, 1}.
(A.14)
¤
This in conjunction with (A.9) proves the theorem.
Theorem A.4 The sum of a squared channel value vector
¢T
¡ 1 2
for ω = π/N , where N ∈ N/{0, 1, 2} is invariB (x) B2 (x)2 . . . BK (x)2
N
ant to the value of x when x ∈ RK .
Proof: The proof of this theorem is similar to the proof of theorem A.3. With
the same reasoning as in (A.9) we get
K
X
Bn (x)2 =
k+N
X−1
n=0
Bn (x)2 =
n=k
N
−1
X
Bk+n (x)2 .
(A.15)
n=0
We now rewrite the squared kernel function as
Bk (x)2 = cos4 (ω(x − k)) =
1
3 1
+ cos(2ω(x − k)) + cos(4ω(x − k)) .
8 2
8
This allows us to rewrite (A.9) as
N
−1
X
B
k+n
n=0
"N −1
"N −1
#
#
X
X
1
3N
1
2
+ Re
(x) =
vk+n (x) + Re
vk+n (x) .
8
2
8
n=0
n=0
2
(A.16)
The first complex sum in this expression is zero for ω = π/N , where
N ∈ N/{0, 1} (see equations A.12 and A.13).
The second sum can be written as
N
−1
X
n=0
2
vk+n
(x) =
N
−1
X
ei4ω(x−k−n) = ei4ω(x−k)
n=0
N
−1 ³
X
e−i4ω
´n
.
(A.17)
n=0
For e−i4ω 6= 1 (that is, ω 6= nπ/2 where n ∈ N)2 , this geometric sum can be
written as
2 In
effect this excludes the solutions N = 1, and N = 2.
136
Appendices
N
−1 ³
X
e−i4ω
´n
=
n=0
1 − e−i4ωN
.
1 − e−i4ω
(A.18)
nπ
, for integers n, and
The numerator of this expression is zero exactly when ω = 2N
N , but again our premise was ω = π/N , N ∈ N/{0, 1, 2}, so it is always zero. The
constraints on equation A.18 requires us to exclude the cases N ∈ {0, 1, 2}.
We can now reformulate equation A.16 as
N
−1
X
Bk+n (x)2 =
n=0
3N
8
for ω = π/N where N ∈ N/{0, 1, 2}.
(A.19)
¤
This in conjunction with (A.15) proves the theorem.
Observation A.5 We will now derive a local decoding for the cos2 when ω = π/2.
For the case ω = π/2 we can also define a local decoding, but it is more difficult
to decide whether the decoding is valid or not. We now have the system
¶
¶ µ
µ l ¶ µ
r cos2 (π/2(x − l))
rBl (x)
u
(A.20)
=
=
r cos2 (π/2(x − l − 1))
ul+1
rBl+1 (x)
since cos(x − π/2) = sin(x) we have
¶
µ l ¶ µ
r cos2 (π/2(x − l))
u
.
=
ul+1
r sin2 (π/2(x − l))
We now see that
h√
i
√
2
2
x̂ = l + arg ul + i ul+1 = l + tan−1
π
π
µq
¶
l+1
l
u /u
(A.21)
(A.22)
and
r̂1 = |ul + iul+1 | and
r̂2 = ul + ul+1 .
(A.23)
In order to select valid decodings, we cannot simply check if x̂ is inside the
common support, since this is always the case. One way to avoid giving invalid
¤
solutions is to require that r̂2 (l) ≥ r̂2 (l + 1) and r̂2 (l) ≥ r̂2 (l − 1).
Theorem A.6 The cos2 local decoding is an unbiased estimate of the mean, if the
PDF f (x) is even, and restricted to the decoding support SlN .
f (x) = f (2µ − x)
supp{f } ⊂ SlN
and
2
⇒
E{x̂} = E{xn } .
The local decoding of a cos channel representation consists of two steps: a
linear parameter estimation and a non-linear combination of the parameters into
estimates of the mode location and the confidence. The expected value of the
linear parameter estimation is

R
cos(2ω(x − l))f (x)dx
SlN
R


(A.24)
E{p} =  SlN sin(2ω(x − l))f (x)dx 
R
f (x)dx
SN
l
A Theorems on cos2 kernels
137
if we require that supp{f } ⊂ SlN , see section 4.2.2. We now simplify the notation,
by denoting c(x) = cos(2ωx), and s(x) = sin(2ωx). Further, we assume that f is
even about a point µ, i.e. f (x) = f (2µ − x). This allows us to rewrite E{p1 } as
Z
Z
f (x)c(x − l)dx =
f (x)c(x − µ + µ − l)dx
(A.25)
E{p1 } =
Z
=
SlN
SlN
SlN
f (x) [c(x − µ)c(µ − l) − s(x − µ)s(µ − l)] dx
Z
Z
= c(µ − l)
SlN
f (x)c(x − µ)dx − s(µ − l)
|
SlN
f (x)s(x − µ)dx
{z
(A.27)
}
=0
Z
= c(µ − l)
(A.26)
SlN
f (x)c(x − µ)dx
(A.28)
where one of the integrals becomes zero due to antisymmetry about µ. In a similar
way we can rewrite E{p2 } as
Z
f (x)c(x − µ)dx .
(A.29)
E{p2 } = s(µ − l)
SlN
We now denote the integral by α
Z
α=
SlN
f (x)c(x − µ)dx .
(A.30)
This allows us to write
E{p1 } = α cos(2ω(µ − l))
(A.31)
E{p2 } = α sin(2ω(µ − l)) .
(A.32)
Finally we insert these two expressions into the non-linear step of the decoding
1
arg [E{p1 } + iE{p2 }]
2ω
1
1
arg αei2ω(µ−l) = l +
(2ω(µ − l))
=l+
2ω
2ω
= µ.
E{x̂} = l +
(A.33)
(A.34)
(A.35)
For a density that is even about µ, we also get
Z
(A.36)
E{xn } = xf (x)dx
Z
Z
Z
= (2µ − x)f (2µ − x)dx = 2µ − xf (2µ − x)dx = 2µ − xf (x)dx .
(A.37)
Setting the right-hand of (A.36) equal to the right-hand of (A.37) gives
Z
E{xn } = xf (x)dx = µ .
(A.38)
Together with (A.35) this gives E{x̂} = E{xn }, which concludes the proof.
¤
138
B
Appendices
Theorems on B-splines
Theorem B.1 The sum of integer shifted B-splines is independent of the encoded
scalar for any degree n.
X
Bnk (x) = 1 ∀x, n ∈ N
k
Proof: B-splines of degree zero are defined as
(
1 k − 0.5 ≤ x < k + 0.5
0
Bk (x) =
0 otherwise.
From this trivially follows that the zeroth degree sum is constant, since exactly
one B-spline is non-zero, and equal to 1 at a time. That is
X
B0k (x) = 1
∀x .
(B.1)
k
Using the recurrence relation (5.10), we can express the sum of an arbitrary degree
as
X
k
Bnk (x) =
(B.2)
¶
(n + 1)/2 − x + k n−1
Bk+1/2 (x)
(B.3)
n
n
k
¶ Xµ
¶
X µ x − k + (n + 1)/2
(n + 1)/2 − x + k n−1
Bn−1
B
(x)
+
(x)
=
k−1/2
k+1/2
n
n
=
X µ x − k + (n + 1)/2
k
Bn−1
k−1/2 (x) +
k
¶ Xµ
¶
n/2 + l − x n−1
Bn−1
B
(x)
+
(x)
=
l
l
n
n
l
l
X µ x − l + n/2 n/2 − x + l ¶
+
(x)
Bn−1
=
l
n
n
l
X
Bn−1
(x) .
=
l
X µ x − l + n/2
(B.4)
(B.5)
(B.6)
(B.7)
l
That is, the sum of B-splines of degree n is equal to the sum of degree n − 1. This
in conjunction with (B.1) proves theorem B.1 by induction.
¤
Theorem B.2 The first moment of integer shifted B-splines is equal to the encoded scalar for any degree n ≥ 1.
X
kBnk (x) = x ∀x, n ∈ N+
k
B Theorems on B-splines
139
Proof: Using (5.11), the first moment of B-splines of degree one can be written
as
X
kB1k (x) =
k
i
X h
k (x − k + 1)B0k−1/2 (x) + (1 − x + k)B0k+1/2 (x)
k
=
X
k(x − k + 1)B0k−1/2 (x) +
X
k
=
X
k(1 − x + k)B0k+1/2 (x)
k
(l + 1/2)(x − l + 1/2)B0l (x) +
X
l
=
X
=
(B.9)
(l − 1/2)(1/2 − x + l)B0l (x)
l
(1/4 − l + lx +
2
x/2)B0l (x)
+
X
l
X
(B.8)
(B.10)
(l − 1/4 − lx +
2
x/2)B0l (x)
l
xB0l (x)
=x
X
l
(B.11)
B0l (x)
= x.
(B.12)
l
We now make an expansion of the first moment using the recurrence relation
(5.10),
X
k
kBnk (x) =
(B.13)
¸
X · x − k + (n + 1)/2
(n + 1)/2 − x + k n−1
n−1
Bk−1/2 (x) +
Bk+1/2 (x)
k
=
n
n
k
(B.14)
X (n + 1)/2 − x + k
X x − k + (n + 1)/2
Bn−1
Bn−1
k
k
=
k−1/2 (x) +
k+1/2 (x)
n
n
k
k
(B.15)
=
X (l + 1/2)(x − l + n/2)
n
l
Bn−1
(x) +
l
X (l − 1/2)(n/2 − x + l)
l
n
Bn−1
(x)
l
(B.16)
X 2lx − 2l + ln + x − l + n/2
Bn−1
(x)
l
2n
2
=
l
+
X ln − 2lx + 2l2 − n/2 + x − l
2n
l
=
X ln + x − l
l
n
Bn−1
(x) .
l
By applying theorem B.1 we get
Bn−1
(x)
l
(B.17)
(B.18)
(B.19)
140
Appendices
X
kBnk (x) =
k
X ln + x − l
l
n
Bn−1
(x)
l
(B.20)
µ
¶
1 X n−1
x
lBl (x) .
= + 1−
n
n
(B.21)
l
If we assume the theorem holds for n − 1, we get
X
µ
x
+ 1−
n
µ
x
= + 1−
n
kBnk (x) =
k
1
n
1
n
¶X
lBn−1
(x)
l
¶
x = x.
(B.23)
This in conjunction with (B.12) proves theorem B.2 by induction.
C
(B.22)
l
¤
Theorems on ellipse functions
Theorem C.1 The matrix A describing the shape of an ellipse
(x − m)T A(x − m) ≤ 1 is related to the inertia matrix I of the same ellipse
according to
1
1
I = A−1 or A = I−1 .
4
4
Proof: A surface patch in the shape of an ellipse is the set of points x = (x1 x2 )T
fulfilling the relation (x1 /a)2 + (x2 /b)2 ≤ 1. This can be rewritten as
µ
xT Dx ≤ 1
for D =
1/a2
0
0
1/b2
¶
.
(C.1)
In order to describe an ellipse with arbitrary position and orientation, we add a
rotation R = (r1 r2 ), and a translation m = (m1 m2 )T and obtain
¡
x−m
¢T
¡
¢
A x − m ≤ 1 where
A = RT DR .
(C.2)
Note that the square root of the left hand expression is a Mahalanobis distance
between m and x, with A defining the metric, see e.g. [7]. A often corresponds
to the inverse covariance of a data set.
For the ellipse described by A, and m, we can define a binary mask,
(
T
1 if (x − m) A (x − m) ≤ 1
v(x) =
0 otherwise.
The mask v(x) has moments that in the continuous case are given by
(C.3)
C Theorems on ellipse functions
141
Z
µkl =
Z
Z
R2
xk1 xl2 v(x)dx =
=
xT RT DRx≤1
(x−m)T A(x−m)≤1
xk1 xl2 dx
(x1 + m1 )k (x2 + m2 )l dx
(C.4)
(C.5)
·
¸ Z
y = Rx
(rT1 y + m1 )k (rT2 y + m2 )l dy =
(C.6)
=
dx = dy
yT Dy≤1
¸ Z
·
1
1
1
1
x = D2 y
(rT1 D− 2 x + m1 )k (rT2 D− 2 x + m2 )l |D− 2 |dx =
=
1
dy = |D− 2 |dx
T
x x≤1
(C.7)
µ
¶


Z
Z
cos ϕ
π
1
1
1
 x = ρ sin ϕ = ρn̂  =
(rT1 D− 2 ρn̂ + m1 )k (rT2 D− 2 ρn̂ + m2 )l abρdρdϕ .
−π 0
dx = ρdρdϕ
(C.8)
If we define the rotation R to be
µ
R=
cos φ
− sin φ
sin φ
cos φ
¶
¡
= r1
r2
¢
(C.9)
we can simplify this to
Z
µkl =
Z
π
−π
1
(ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )k ×
0
(ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )l abρdρdϕ .
(C.10)
We can now verify the expressions for the low order moments
Z
µ00 =
µ10 =
µ01 =
π
−π
Z π
−π
Z π
−π
Z
1
abρdρdϕ = πab
(C.11)
(ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )abρdρdϕ = m1 πab
(C.12)
(ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )abρdρdϕ = m2 πab
(C.13)
0
Z
1
0
Z
0
1
142
Appendices
Z
π
Z
1
(ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )2 abρdρdϕ
−π 0
µ
¶
1 2
(a cos2 φ + b2 sin2 φ) + m21
= πab
4
Z π Z 1
=
(ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )2 abρdρdϕ
−π 0
µ
¶
1 2
2
2
2
2
(a cos φ + b sin φ) + m2
= πab
4
Z π Z 1
=
(ρa cos φ cos ϕ − ρb sin φ sin ϕ + m1 )×
µ20 =
µ02
µ11
−π
(C.14)
(C.15)
0
µ
= πab
(ρa cos φ cos ϕ + ρb sin φ sin ϕ + m2 )abρdρdϕ
¶
.
1 2
(a cos2 φ − b2 sin2 φ) + m1 m2
4
(C.16)
We now group the three second order moments into a matrix
µ
µ20
µ11
µ11
µ02
¶
µ
¶
πab rT1 D−1 r1 rT1 D−1 r2
+ πabmmT
rT2 D−1 r1 rT2 D−1 r2
4
πab T −1
R D R + πabmmT .
=
4
=
(C.17)
(C.18)
By division with µ00 , see (C.11), we get
1
µ00
µ
µ02
µ11
µ11
µ20
¶
=
1 T −1
R D R + mmT .
4
(C.19)
By subtraction of mmT we obtain the definition of the inertia matrix
I=
1
µ00
µ
µ02
µ11
µ11
µ20
¶
− mmT =
1 T −1
R D R.
4
(C.20)
Here we recognise the inverse of the ellipse matrix A−1 = RT D−1 R, see (C.2),
and thus the ellipse matrix A is related to I as
I=
1 −1
A
4
which was what we set out to prove.
or
A=
1 −1
I
4
(C.21)
¤
C Theorems on ellipse functions
143
Theorem C.2 The axes, and the area of an ellipse can be extracted from its
inertia matrix I according to
p
p
√
{a, b} = {2 λ1 , 2 λ2 } and Area = 4π det I .
Proof: For positive definite matrices, the eigenvectors constitute a rotation, and
thus (C.20) is an eigenvalue decomposition of I. In other words I = λ1 ê1 êT1 +
λ2 ê2 êT2 , has its eigenvalues in the diagonal of 14 D−1 , {λ1 , λ2 } = {a2 /4, b2 /4}.
√
√
From this follows that {a, b} = {2 λ1 , 2 λ2 }.
√
1 2 2
a b , we can find the ellipse area as πab = 4π det I.
Since det I = λ1 λ2 = 16
Also note that of all shapes with a given inertia matrix, the ellipse is the
one that is best concentrated around m. This means that in the discrete case,
the above area measure will always be an overestimate of the actual area, with
exception of the degenerate case when all pixels lie on a line.
¤
Theorem C.3 The outline of an ellipse is given by the parameter curve
µ
¶
T −1/2 cos t
+ m for t ∈ [0, 2π[ .
(C.22)
x=R D
sin t
Proof: An ellipse is the set of points x ∈ R2 fulfilling relation (C.2). By inserting
the parameter curve (C.22) into the quadratic form of (C.2) we obtain
(x − m)T A(x − m) = (x − m)T RT DR(x − m)
µ
¶
¡
¢ −1/2
T
T −1/2 cos t
RR DRR D
= cos t sin t D
sin t
= cos2 t + sin2 t = 1 .
(C.23)
(C.24)
(C.25)
Thus all points in (C.22) belong to the ellipse outline. Note that (C.22) is a
convenient way to draw the ellipse outline.
¤
144
Appendices
Bibliography
[1] Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. Int. Journal of
Computer Vision, 1(3):333–356, 1988.
[2] V. Aurich and J. Weule. Non-linear gaussian filters performing edge preserving diffusion. In 17:th DAGM-Symposium, pages 538–545, Bielefeld, 1995.
[3] R. Bajcsy. Active perception. Proceedings of the IEEE, 76(8):996–1005,
August 1988.
[4] D. H. Ballard. Animate vision. In Proc. Int. Joint Conf. on Artificial Intelligence, pages 1635–1641, 1989.
[5] M. F. Bear, B. W. Connors, and M. A. Paradiso. Neuroscience: Exploring
the Brain. Williams & Wilkins, 1996. ISBN 0-683-00488-3.
[6] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color- and texturebased image segmentation using EM and its application to content based
image retrieval. In Proceedings of the Sixth International Conference on
Computer Vision, pages 675–682, 1998.
[7] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, 1995. ISBN 0-19-853864-2.
[8] M. Black, G. Sapiro, D. Marimont, and D. Heeger. Robust anisotropic
diffusion. IEEE Transactions on Image Processing, 7(3):421–432, March
1998.
[9] A. Blake. The Handbook of Brain Theory and Neural Networks, chapter
Active Vision, pages 61–63. MIT Press, 1995. M. A. Arbib, Ed.
[10] M. Borga. Learning Multidimensional Signal Processing. PhD thesis,
Linköping University, Sweden, SE-581 83 Linköping, Sweden, 1998. Dissertation No 531, ISBN 91-7219-202-X.
[11] R. Brooks. A robust layered control system for a mobile robot. IEEE Trans.
on Robotics and Automation, 2(1):14–23, March 1986.
[12] H. H. Bülthoff, S. Y. Edelman, and M. J. Tarr. How are three-dimensional
objects represented in the brain? A.I. Memo No. 1479, April 1994. MIT AI
Lab.
146
Bibliography
[13] F. W. Campbell and J. G. Robson. Application of Fourier analysis to the
visibility of gratings. J. Physiol., 197:551–566, 1968.
[14] J. Canny. A computational approach to edge detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, PAMI-8(6):255–274, November 1986.
[15] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image
segmentation using expectation-maximisation and its application to image
querying. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(8):1026–1038, August 2002.
[16] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 17(8):790–799, August 1995.
[17] L. Chrisman. Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. In National Conference on Artificial Intelligence, pages 183–188, 1992.
[18] CMU Neural Networks Benchmark Collection,
http://www.cs.cmu.edu/afs/cs/project/ai-repository/
ai/areas/neural/bench/cmu/.
[19] D. Comaniciu and P. Meer. Mean shift analysis and applications. In Proceedings of ICCV’99, pages 1197–1203, Corfu, Greece, Sept 1999.
[20] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(5):603–619, May 2002.
[21] G. Dahlquist and Å. Björck. Numerical Methods and Scientific Computation,
chapter Interpolation and related subjects. SIAM, Philadelphia, 2003. In
press.
[22] I. Daubechies. The wavelet transform, time-frequency localization and signal
analysis. IEEE Trans. on Information Theory, 36(5):961–1005, September
1990.
[23] R. Dawkins. The Blind Watchmaker. Penguin Books, 1986.
[24] P. Doherty, G. Granlund, K. Kuchcinski, E. Sandewall, K. Nordberg,
E. Skarman, and J. Wiklund. The WITAS Unmanned Aerial Vehicle Project.
In W. Horn, editor, ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence, pages 747–755, Berlin, August 2000.
[25] A. A. Efros and T. K. Leung. Texture synthesis by non-parametric sampling.
In ICCV99, pages 1033–1038, Corfu, Greece, September 1999.
[26] G. Farnebäck. Polynomial Expansion for Orientation and Motion Estimation. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, 2002. Dissertation No 790, ISBN 91-7373-475-6.
Bibliography
147
[27] M. Felsberg. Low Level Image Processing with the Structure Multivector.
PhD thesis, Christian-Albrechts-Universität, Kiel, March 2002.
[28] M. Felsberg and G. Granlund. Anisotropic channel filtering. In Proceedings
of the 13th Scandinavian Conference on Image Analysis, LNCS 2749, pages
755–762, Gothenburg, Sweden, June-July 2003.
[29] M. Felsberg, H. Scharr, and P.-E. Forssén. The B-spline channel representation: Channel algebra and channel based diffusion filtering. Technical Report
LiTH-ISY-R-2461, Dept. EE, Linköping University, SE-581 83 Linköping,
Sweden, September 2002.
[30] M. Felsberg, H. Scharr, and P.-E. Forssén. Channel smoothing. IEEE PAMI,
2004. Submitted.
[31] D. J. Field. What is the goal of sensory coding? Neural Computation, 1994.
[32] P.-E. Forssén. Updating Camera Location and Heading using a Sparse Displacement Field. Technical Report LiTH-ISY-R-2318, Dept. EE, Linköping
University, SE-581 83 Linköping, Sweden, November 2000.
[33] P.-E. Forssén. Image Analysis using Soft Histograms. In Proceedings of the
SSAB Symposium on Image Analysis, pages 109–112, Norrköping, March
2001. SSAB.
[34] P.-E. Forssén. Sparse Representations for Medium Level Vision. Lic. Thesis
LiU-Tek-Lic-2001:06, Dept. EE, Linköping University, SE-581 83 Linköping,
Sweden, February 2001. Thesis No. 869, ISBN 91-7219-951-2.
[35] P.-E. Forssén. Window Matching using Sparse Templates. Technical Report
LiTH-ISY-R-2392, Dept. EE, Linköping University, SE-581 83 Linköping,
Sweden, September 2001.
[36] P.-E. Forssén. Observations Concerning Reconstructions with Local Support.
Technical Report LiTH-ISY-R-2425, Dept. EE, Linköping University, SE-581
83 Linköping, Sweden, April 2002.
[37] P.-E. Forssén. Successive Recognition using Local State Models. In Proceedings SSAB02 Symposium on Image Analysis, pages 9–12, Lund, March
2002. SSAB.
[38] P.-E. Forssén. Channel smoothing using integer arithmetic. In Proceedings
SSAB03 Symposium on Image Analysis, Stockholm, March 2003. SSAB.
[39] P.-E. Forssén and G. Granlund. Sparse Feature Maps in a Scale Hierarchy. In
AFPAC, Algebraic Frames for the Perception Action Cycle, Kiel, Germany,
September 2000.
[40] P.-E. Forssén and G. Granlund. Blob Detection in Vector Fields using a Clustering Pyramid. Technical Report LiTH-ISY-R-2477, Dept. EE, Linköping
University, SE-581 83 Linköping, Sweden, November 2002.
148
Bibliography
[41] P.-E. Forssén and G. Granlund. Robust multi-scale extraction of blob features. In Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749, pages 11–18, Gothenburg, Sweden, June-July 2003.
[42] P.-E. Forssén, G. Granlund, and J. Wiklund. Channel Representation of
Colour Images. Technical Report LiTH-ISY-R-2418, Dept. EE, Linköping
University, SE-581 83 Linköping, Sweden, March 2002.
[43] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density
function, with applications in pattern recognition. IEEE Transactions on
Information Theory, 21(1):32–40, 1975.
[44] F. Godtliebsen, E. Spjøtvoll, and J. Marron. A nonlinear gaussian filter
applied to images with discontinuities. J. Nonpar. Statist., 8:21–43, 1997.
[45] G. H. Granlund. Magnitude Representation of Features in Image Analysis. In The 6th Scandinavian Conference on Image Analysis, pages 212–219,
Oulu, Finland, June 1989.
[46] G. H. Granlund. The complexity of vision. Signal Processing, 74(1):101–126,
April 1999. Invited paper.
[47] G. H. Granlund. An Associative Perception-Action Structure Using a Localized Space Variant Information Representation. In Proceedings of Algebraic
Frames for the Perception-Action Cycle (AFPAC), Kiel, Germany, September 2000.
[48] G. H. Granlund and H. Knutsson. Signal Processing for Computer Vision.
Kluwer Academic Publishers, 1995. ISBN 0-7923-9530-1.
[49] G. Granlund. Does Vision Inevitably Have to be Active? In Proceedings
of the 11th Scandinavian Conference on Image Analysis, Kangerlussuaq,
Greenland, June 7–11 1999. SCIA. Also as Technical Report LiTH-ISYR-2247.
[50] G. Granlund, P.-E. Forssén, and B. Johansson. HiperLearn: A high performance learning architecture. Technical Report LiTH-ISY-R-2409, Dept. EE,
Linköping University, SE-581 83 Linköping, Sweden, January 2002.
[51] G. Granlund, P.-E. Forssén, and B. Johansson. HiperLearn: A high performance channel learning architecture. IEEE Transactions on Neural Networks, 2003. Submitted.
[52] G. Granlund, K. Nordberg, J. Wiklund, P. Doherty, E. Skarman, and
E. Sandewall. WITAS: An Intelligent Autonomous Aircraft Using Active
Vision. In Proceedings of the UAV 2000 International Technical Conference
and Exhibition, Paris, France, June 2000. Euro UVS.
[53] G. H. Granlund and A. Moe.
Unrestricted recognition of 3-D objects using multi-level triplet invariants. In Proceedings of the Cognitive Vision Workshop, Zürich, Switzerland, September 2002. URL:
http://www.vision.ethz.ch/cogvis02/.
Bibliography
149
[54] R. M. Gray. Dithered quantizers. IEEE Transactions on Information Theory,
39(3):805–812, 1993.
[55] L. Haglund. Adaptive Multidimensional Filtering. PhD thesis, Linköping
University, Sweden, SE-581 83 Linköping, Sweden, October 1992. Dissertation No 284, ISBN 91-7870-988-1.
[56] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and W. A. Stahel. Robust
Statistics: The approach based on influence functions. John Wiley and Sons,
New York, 1986.
[57] R. I. Hartley. In defense of the eight-point algorithm. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 19(6):580–593, June 1997.
[58] S. Haykin. Neural Networks–A comprehensive foundation. Prentice Hall,
Upper Saddle River, New Jersey, 2nd edition, 1999. ISBN 0-13-273350-1.
[59] D. Hearn and P. Baker. Computer Graphics, 2nd ed. Prentice Hall International, 1994. ISBN 0-13-159690-X.
[60] C. M. Hicks. The application of dither and noise-shaping to nyquist-rate
digital audio: an introduction. Technical report, Communications and Signal
Processing Group, Cambridge University Engineering Department, United
Kingdom, 1995.
[61] I. P. Howard and B. J. Rogers. Binocular Vision and Stereopsis. Oxford
Psychology Series, 29. Oxford University Press, New York, 1995. ISBN
0195084764.
[62] P. Huber. Robust estimation of a location parameter. Ann. Math. Statist.,
35(73-101), 1964.
[63] A. Jain, M. Murty, and P. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3):264–323, Sept 1999.
[64] B. Johansson. Multiscale curvature detection in computer vision. Lic. Thesis
LiU-Tek-Lic-2001:14, Dept. EE, Linköping University, SE-581 83 Linköping,
Sweden, March 2001. Thesis No. 877, ISBN 91-7219-999-7.
[65] B. Johansson and G. Granlund. Fast selective detection of rotational symmetries using normalized inhibition. In Proceedings of the 6th European
Conference on Computer Vision, volume I, pages 871–887, Dublin, Ireland,
June 2000.
[66] B. Johansson and A. Moe. Patch-duplets for object recognition and pose
estimation. Technical Report LiTH-ISY-R-2553, Dept. EE, Linköping University, SE-581 83 Linköping, Sweden, November 2003.
[67] H. Knutsson, M. Andersson, and J. Wiklund. Advanced Filter Design. In
Proceedings of the 11th Scandinavian Conference on Image Analysis, Greenland, June 1999. SCIA. Also as report LiTH-ISY-R-2142.
150
Bibliography
[68] P. Kovesi. Image features from phase congruency. Tech. Report 95/4, University of Western Australia, Dept. of CS, 1995.
[69] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhD thesis, Linköping University, Sweden, SE-581 83 Linköping,
Sweden, 1997. Dissertation No 469, ISBN 91-7871-892-9.
[70] B. Leibe and B. Schiele. Analyzing appearance and contour based methods
for object categorization. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR’03), June 2003.
[71] M. W. Levine and J. M. Shefner. Fundamentals of sensation and perception.
Addison-Wesley, 1981.
[72] T. Lindeberg. Scale-space Theory in Computer Vision. Kluwer Academic
Publishers, 1994. ISBN 0792394186.
[73] T. Lindeberg and J. Gårding. Shape from texture in a multi-scale perspective. In Proc. 4th International Conference on Computer Vision, pages
683–691, Berlin, Germany, May 1993.
[74] D. Marr. Vision. W. H. Freeman and Company, New York, 1982.
[75] C. Meunier and J. P. Nadal. The Handbook of Brain Theory and Neural
Networks, chapter Sparsely Coded Neural Networks, pages 899–901. MIT
Press, 1995. M. A. Arbib, Ed.
[76] S. Mitaim and B. Kosko. Adaptive joint fuzzy sets for function approximation. In International Conference on Neural Networks (ICNN-97), pages
537–542, June 1997.
[77] J. Moody and C. J. Darken. Fast learning in networks of locally-tuned
processing units. Neural Computation, 1:281–293, 1989.
[78] M. C. Morrone, J. R. Ross, and R. A. Owens. Mach bands are phase dependent. Nature, 324:250–253, 1986.
[79] A. Nieder, D. Freedman, and E. Miller. Representation of the quantity of
visual items in the primate prefrontal cortex. Science, 297:1708–1711, 6
September 2002.
[80] K. Nordberg, G. Granlund, and H. Knutsson. Representation and Learning
of Invariance. In Proceedings of IEEE International Conference on Image
Processing, Austin, Texas, November 1994. IEEE.
[81] K. Nordberg, P. Doherty, G. Farnebäck, P.-E. Forssén, G. Granlund, A. Moe,
and J. Wiklund. Vision for a UAV helicopter. In Proceedings of IROS’02,
workshop on aerial robotics, Lausanne, Switzerland, October 2002.
[82] S. Obdrzalek and J. Matas. Object recognition using local affine frames on
distinguished regions. In Proceedings of the British Machine Vision Conference, pages 113–122, London, 2002. ISBN 1-901725-19-7.
Bibliography
151
[83] J. K. O’Regan. Solving the ‘real’ mysteries of visual perception: The world
as an outside memory. Canadian Journal of Psychology, 46:461–488, 1992.
[84] R. Palm, H. Hellendoorn, and D. Driankov. Model Based Fuzzy Control.
Springer-Verlag, Berlin, 1996. ISBN 3-540-61471-0.
[85] P. Perona and J. Malik. Detecting and localizing edges composed of steps,
peaks and roofs. In Proceedings of ICCV, pages 52–57, 1990.
[86] D. Reisfeld. The constrained phase congruency feature detector: simultaneous localization, classification, and scale determination. Pattern Recognition
letters, 17(11):1161–1169, 1996.
[87] H. Scharr, M. Felsberg, and P.-E. Forssén. Noise adaptive channel smoothing
of low-dose images. In CVPR Workshop: Computer Vision for the Nano
Scale, June 2003.
[88] S. M. Smith and J. M. Brady. SUSAN - a new approach to low level image
processing. International Journal of Computer Vision, 23(1):45–78, 1997.
[89] H. Snippe and J. Koenderink. Discrimination thresholds for channel-coded
systems. Biological Cybernetics, 66:543–551, 1992.
[90] H. Snippe and J. Koenderink. Information in channel-coded systems: correlated receivers. Biological Cybernetics, 67:183–190, 1992.
[91] I. Sobel. Camera models and machine perception. Technical Report AIM-21,
Stanford Artificial Intelligence Laboratory, Palo Alto, California, 1970.
[92] M. Sonka, V. Hlavac, and R. Boyle. Image Processing, Analysis, and Machine Vision. International Thomson Publishing Inc., 1999. ISBN 0-53495393-X.
[93] H. Spies and P.-E. Forssén. Two-dimensional channel representation for
multiple velocities. In Proceedings of the 13th Scandinavian Conference on
Image Analysis, LNCS 2749, pages 356–362, Gothenburg, Sweden, June-July
2003.
[94] H. Spies and B. Johansson. Directional channel representation for multiple line-endings and intensity levels. In Proceedings of IEEE International
Conference on Image Processing, Barcelona, Spain, September 2003.
[95] C. V. Stewart. Robust parameter estimation in computer vision. SIAM
Review, 41(3):513–537, 1999.
[96] R. S. Sutton and A. G. Barto. Reinforcement Learning, An Introduction.
MIT Press, Cambridge, Massachusetts, 1998. ISBN 0-262-19398-1.
[97] S. Thorpe. The Handbook of Brain Theory and Neural Networks, chapter
Localized Versus Distributed representations, pages 549–552. MIT Press,
1995. M. A. Arbib, Ed.
152
Bibliography
[98] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.
In Proceedings of the 6th ICCV, 1998.
[99] T. Tuytelaars and L. V. Gool. Matching widely separated views based on
affinely invariant neighbourhoods. International Journal of Computer Vision, 2003. To appear.
[100] M. Unser. Splines: A perfect fit for signal and image processing. IEEE
Signal Processing Magazine, pages 22–38, November 1999.
[101] N. Vlassis, B. Terwijn, and B. Kröse. Auxiliary particle filter robot localization from high-dimensional sensor observations. Technical Report IAS-UVA01-05, Computer Science Institute, University of Amsterdam, The Netherlands, September 2001.
[102] M. Volgushev and U. T. Eysel. Noise Makes Sense in Neuronal Computing.
Science, 290:1908–1909, december 2000.
[103] J. Weule. Iteration nichtlinearer Gauss-Filter in der Bildverarbeitung. PhD
thesis, Heinrich-Heine-Universität Düsseldorf, 1994.
[104] G. Winkler and V. Liebscher. Smoothers for discontinuous signals. J. Nonpar. Statistics, 14(1-2):203–222, 2002.
[105] WITAS web page.
http://www.ida.liu.se/ext/witas/.
[106] A. Witkin. Scale-space filtering. In 8th Int. Joint Conf. Artificial Intelligence,
pages 1019–1022, Karlsruhe, 1983.
[107] A. Wrangsjö and H. Knutsson. Histogram filters for noise reduction. In
C. Rother and S. Carlsson, editors, Proceedings of SSAB’03, pages 33–36,
2003.
[108] R. Zemel, P. Dayan, and A. Pouget. Probabilistic interpretation of population codes. Neural Computation, 2(10):403–430, 1998.
[109] Z. Zhang. Parameter estimation techniques: A tutorial. Technical Report
2676, INRIA, October 1995.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement