On Texture and Geometry in Image Analysis David Karl John Gustavsson

On Texture and Geometry in Image Analysis David Karl John Gustavsson
On Texture and Geometry in Image
Analysis
David Karl John Gustavsson
The Image Group, Department of Computer Science
Faculty of Science, University of Copenhagen
2009
This thesis is dedicated to
Elin and Ludwig
Preface
In your hands you hold
the result of three years of hard labor in the science mine,
materialized in this PhD dissertation.
i
Acknowledgements
Many people have helped and supported me to get where I am today. A
big thanks goes to my extended family - especially my mother Lena and my
father Håkan - and to all my friends, of course.
Mads Nielsen, for excellent supervision and always putting ideas into a
larger context, thank you so much! Kim S. Pedersen, for insightful supervision on almost daily basis and for great inspiration, you have my deepest
gratitude and appreciation. Francois Lauze, for helping me to transform
vague ideas into mathematics and for never hesitating to give a math lecture, thank you!
Anders Heyden and Niels-Christian Overgaard, for making my visit at
Malmö University both enjoyable and scientifically fruitful, thank you! Christoph
Schöerr, for making my visit at Heidelberg University enjoyable and for always sharing your knowledge, thank you!
I also want to thank all the PhD-students - former and current - at the
former Image Group at ITU University of Copenhagen, the Image Group at
DIKU and Applied Mathematics at Malmö University.
The biggest thanks goes to my wife Mariana, and our twins Ludwig and
Elin, for always perfectly balancing my professional life with a perfect private
life.
This research was funded by the VISIONTRAIN RTN-CT-2004-005439
Marie Curie Action within the EC’s FP6.
ii
Abstract
Images are composed of geometric structure and texture. Large scale structures are considered to be the geometric structure, while small scale details
are considered to be the texture. In this dissertation, we will argue that the
most important difference between geometric structure and texture is not
the scale - instead, it is the requirement on representation or reconstruction.
Geometric structure must be reconstructed exactly and can be represented
sparsely. Texture does not need to be reconstructed exactly, a random sample from the distribution being sufficient. Furthermore, texture can not be
represented sparsely.
In image inpainting, the image content is missing in a region and should
be reconstructed using information from the rest of the image. The main
challenges in inpainting are: prolonging and connecting geometric structure
and reproducing the variation found in texture. The Filter, Random fields
and Maximum Entropy (FRAME) model [213, 214] is used for inpaining texture. We argue that many ’textures’ contain details that must be inpainted
exactly. Simultaneous reconstruction of geometric structure and texture is
a difficult problem, therefore, a two-phase reconstruction procedure is proposed. An inverse temperature β is added to the FRAME model. In the first
phase, the geometric structure is reconstructed by cooling the distribution,
and in the second phase, the texture is added by heating the distribution.
Empirically, we show that the long range geometric structure is inpainted in
a visually appealing way during the first phase, and texture is added in the
second phase by heating the distribution.
A method for measuring and quantifying the image content in terms of
geometric structure and texture is proposed. It is assumed that geometric structures can be represented sparsely, while texture can not. Reversing
the argumentation, we argue that if the image can be represented sparsely
then it contains mainly geometric structure, and if it cannot be represented
sparsely then it contains texture. The degree of geometric structure is determined by the sparseness of the representation. A Truncated Singular Value
Decomposition complexity measure is proposed, where the rank of a good
approximation is defining the image complexity.
Image regularization can be viewed as approximating an observed image
with a simpler image. The property of the simpler image depends on the regularization method, a regularization parameter and the image content. Here
we analyze the norm of the regularized solution and the norm of the residual
as a function of the regularization parameter (using different regularization
methods). The aim is to characterize the image content by the content in the
residual. Buades et al. [27] used the content in the residual - called ’Method
iii
Noise’ - for evaluating denoising methods. Our aim is complementary, as we
want to characterize the image content in terms of geometric structure and
texture, using different regularization methods.
The image content does not depend solely on the objects in the scene, but
also on the viewing distance. Increasing the viewing distance influences the
image content in two different ways. As the viewing distance increases, details
are suppressed because the inner scale also increases. By increasing the
viewing distance, the spatial lay-out of the captured scene will also change.
At large viewing distances, the sky occupies a large region in the image
and buildings, trees and lawns appear as uniformly colored regions. The
following questions are addressed: How much of the visual appearance in
terms of geometry and texture of an image can be explained by the classical
results from natural image statistics? and how does the visual appearance
of an image and the classical statistics relate to the viewing distance?
iv
Contents
Preface
i
Acknowledgements
ii
Abstract
iii
1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 The Bayesian Approach and MAP-solution . . . . . .
1.2 Inpainting using FRAME - Filter, Random fields And Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Inpainting . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . .
1.2.3 FRAME . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.4 Cooling and Heating - Inpainting using FRAME . . .
1.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
1.3 DIKU Multi-Scale Image Database . . . . . . . . . . . . . .
1.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Collection Procedure and Content . . . . . . . . . . .
1.3.3 Natural Image Statistics . . . . . . . . . . . . . . . .
1.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
1.4 SVD as Content Descriptor . . . . . . . . . . . . . . . . . .
1.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Optimal Rank Approximation and TSVD . . . . . .
1.4.3 Measuring the Complexity of Images - Singular Value
Reconstruction Index . . . . . . . . . . . . . . . . . .
1.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Image Description by Regularization . . . . . . . . . . . . .
1.5.1 Related Work . . . . . . . . . . . . . . . . . . . . . .
1.5.2 Image Decomposition . . . . . . . . . . . . . . . . . .
1.5.3 The Bayesian Approach and MAP-Solution . . . . .
v
.
.
1
1
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
9
14
15
18
20
21
22
23
27
29
29
30
.
.
.
.
.
.
33
33
37
37
38
40
1.5.4 Regularized and Residual Norm . . . . . . . . . . . .
1.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Motion Estimation by Contour Registration . . . . . . . . .
1.6.1 Image and Contour Registration . . . . . . . . . . . .
1.6.2 Image Registration by Contour Matching . . . . . . .
1.6.3 Relation to Feature-Based and Contour Registration
1.6.4 Applications . . . . . . . . . . . . . . . . . . . . . . .
1.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Scientific Contributions . . . . . . . . . . . . . . . . . . . . .
1.7.1 Published Paper and Scientific Contribution . . . . .
1.7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
2 Image Inpainting by Cooling and Heating
2.1 Introduction . . . . . . . . . . . . . . . . .
2.2 Review of FRAME . . . . . . . . . . . . .
2.2.1 The Choice of Filter Bank . . . . .
2.2.2 Sampling . . . . . . . . . . . . . .
2.3 Using FRAME for inpainting . . . . . . .
2.3.1 Adding a temperature term β = T1
2.3.2 Cooling - the ICM solution . . . . .
2.3.3 Cooling - Fast cooling solution . . .
2.3.4 Heating - Adding texture . . . . . .
2.4 Results . . . . . . . . . . . . . . . . . . . .
2.5 Conclusion . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
42
43
44
47
49
50
51
52
52
56
.
.
.
.
.
.
.
.
.
.
.
60
61
63
64
65
65
66
66
67
68
68
70
3 A Multi-Scale Study of the Distribution of Geometry and
Texture in Natural Images
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Multi-Scale Geometry and Texture image Database (MS-GTI
DB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Collection procedure and equipment . . . . . . . . . . .
3.2.2 The different Scenes . . . . . . . . . . . . . . . . . . .
3.2.3 Region extraction . . . . . . . . . . . . . . . . . . . . .
3.2.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Point Operators and Scale Space . . . . . . . . . . . . . . . .
3.4 Statistics of Natural Images . . . . . . . . . . . . . . . . . . .
3.4.1 Scale Invariance . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Laplacian distribution of Linear Filter Responses . . .
3.4.3 Size Distribution in Natural Images . . . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
72
73
75
75
76
78
80
80
82
82
86
91
96
4 A SVD-Based Image Complexity Measure
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
4.2 Complexity Measure . . . . . . . . . . . . . . . . .
4.2.1 Error Measure - Matrix Norms . . . . . . .
4.2.2 Matrix Complexity Measure - Matrix Rank .
4.2.3 Optimal Rank k Approximation . . . . . .
4.2.4 Global Measure . . . . . . . . . . . . . . . .
4.3 DIKU Multi-Scale Image Database . . . . . . . . .
4.4 Singular Value Distribution in Natural Images . . .
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . .
4.5.1 The baboon image . . . . . . . . . . . . . .
4.5.2 DIKU Multi-Scale Image Database . . . . .
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
5 On the Rate of Structural Change in Scale Spaces
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Related work . . . . . . . . . . . . . . . . . .
5.1.2 Convexity, Fourier Transforms, Power Spectra
5.2 Tikhonov Regularization . . . . . . . . . . . . . . . .
5.3 Linear Scale-Space and Regularization . . . . . . . .
5.4 Total Variation image decomposition . . . . . . . . .
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Sinc in Scale Space . . . . . . . . . . . . . . .
5.5.2 Black squares with added Gaussian noise . . .
5.5.3 DIKU Multi Scale Image Sequence Database I
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
99
101
101
102
102
104
105
105
107
107
107
110
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
. 112
. 114
. 114
. 115
. 117
. 119
. 120
. 120
. 120
. 123
. 124
6 Variational Segmentation and Contour Matching of NonRigid Moving Object
126
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 Segmentation of Image Sequences . . . . . . . . . . . . . . . . 129
6.2.1 Region-Based Segmentation . . . . . . . . . . . . . . . 129
6.2.2 The Interaction Term . . . . . . . . . . . . . . . . . . . 131
6.2.3 Using the Interaction Term in Segmentation of Image
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 A Contour Matching Problem . . . . . . . . . . . . . . . . . . 132
6.4 Detect and Locate the Occlusion . . . . . . . . . . . . . . . . 134
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.5.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . 135
6.5.2 Contour Matching and occlusion detection . . . . . . . 137
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
vii
Chapter 1
1.1
Introduction
It is often claimed that ’everybody knows what texture is, but no one can
define it’. This seems to be true both in daily discussions and in scientific
papers. In image processing papers, texture is rarely defined, and if a definition is present it is often a problem specific definition that is solely valid
in a specific setting. No generally accepted definition of texture exists. One
reason for the absence of such a definition is the fact that texture should
capture a large and partly contradictive set of concepts - regular, irregular,
stochastic, stationary, outer scale and inner scale. Because of the large variation of concepts included in texture, it is fairly easy to construct negative
examples that annihilate an attempt to define texture.
Images are often viewed as a composition of geometric structures and texture. The geometric structure is considered to be the large scale structure,
while texture is considered to be the small scale details. Geometric structure is considered to be simple because of its rather homogenous appearance.
Smooth objects under the same illumination will reflect the light in a similar way, resulting in smooth geometric structures. A scene is composed of
independent, discrete and roughly uniformly colored objects occluding each
other. Consider a concrete building viewed from a large distance or a person
with a uniformly colored sweater viewed from a few meters, both the concrete
building and the sweater will appear as smooth uniformly colored objects, i.e.
geometric structure. Texture, on the other hand, is considered to be more
complex, because it is composed of a large number of small scale elements.
Under the same illumination, regions with large numbers of small scale elements will reflect light in different ways. The small scale elements can be
either the roughness of the surface or small scale ’objects’ such as leaves or
hair. Texture can also be viewed as a composition of independent, discrete
and roughly uniformly colored objects but on a smaller scale. This reveals
one fundamental property of texture: it always contains some kind of varia1
tion. When the distance to the concrete building is decreased, the roughness
of the concrete becomes visible and the smooth geometric structure therefore
transforms into a textured object.
Viewing images as a composition of geometric structure and texture implies the possibility to decompose an image into its components, i.e.
I = Istruct ⊕ Itext ,
(1.1)
where ⊕ is some image composition operator, Istruct is the geometric structure
component and Itext is the texture component. Two very different examples
of image decomposition, which will be discussed later, are Total Variation
image decomposition [169] and Primal Sketch [56, 57]. In Total Variation
decomposition, an image is decomposed by minimizing an energy functional
and the composition operator ⊕ is ordinary addition - the intensities of the
structure and texture component are added to form the images. In Primal
Sketch by Guo et. al [56, 57] - inspired by earlier work by Marr [128] the operator ⊕ denotes a rather complex algorithm, which includes image
segmentation and texture modeling. Here the geometric structure is formed
by the boundaries of objects defined by edges over a fixed scale, and the
object boundaries are represented by sparse coding. The texture component is formed by the remaining regions, which are not object boundaries,
divided into regions containing stationary textures and are represented using
a Markov random field model (FRAME).
Models for image decomposition also relate to image formation models.
The Dead Leaves model is a well known formation model for natural images,
introduced in a morphological setting by Matheron [131], further studied in
connection with natural image statistics by Ruderman [168], for size distribution of homogenous regions by Alvarez et. al [1, 3] and for modeling scale
invariance in natural images by Lee et. al [117]. The Dead Leaves model
is based on the notion that a scene is composed of independent discrete objects of different sizes and different colors occluding each other. The scene is
composed of planar templates T of the same shape, often squares or circles,
but of different sizes and colors. Templates of random size and color are
randomly located in the 3D world such that the (x, y) plane, i.e. the image
plane, is totally covered. The z dimension is solely used to determine which
of the template is closest to the image plane. The intensity in a location
(x, y) is determined by the color of the template closest to the image plane
(i.e. the template with the smallest z). As shown by Lee et. al, the Dead
Leaves Model is a generative model that can reproduce statistical properties
found in ensembles of natural images.
Grenander et. al [79] use 2D profiles called generators - gs - of 3D ob2
jects to analyze images. The gs are views or appearances of the 3D objects
captured from a random viewing position. An image is composed of random
selected generators of random size, color and location. Grenander et. al use
a superposition model - instead of an occlusion model as the Dead Leaves
model - where the intensity in a location x is the sum of intensities of the
generator covering the location. Grenander et. al [78] and Srivastava et.
al [184] use the generator-based image formation process to analytically find
probability distributions for linear marginal distributions - i.e. the histogram
of linear filter responses - called Bessell K-form.
Image decomposition methods often use an implicit definition of geometric structure and texture. Geometric structure in the image is the content
in the structure component and texture is the content in the texture component. Geometric structure and texture is therefore defined in terms of how
the method decompose the image, i.e. each decomposition method has its
own implicit definition of structure and texture.
The notion of scale is almost always discussed in connection with texture.
Indeed, texture is sometimes defined, as in the monograph on texture by
Petrou and Garcı́a [155], as: ’the details on a scale smaller than the scale of
interest’. Two fundamental different types of scales, as pointed out by scale
space theory [98, 202, 111, 122], are present: the inner scale and the outer
scale. The inner scale is the smallest details captured by the camera, and
the outer scale is the field of view, or the part of the scene, captured by the
camera. As discussed earlier, texture always contains some kind of variation.
The variation property of texture indicates that texture should be defined
on larger regions - it does not make sense to call a pixel or a small patch a
texture. A texture must be defined on a large enough region, such that the
variation is present. The outer scale must be sufficiently large, such that the
texture variation is present. Consider the brick wall shown in figure 1.2, the
image to the left contains roughly one brick, while the image to the right
contains a larger part of the wall. The bricks in the image to the right can
be considered to form a texture; the outer scale is large enough to capture
the variation of the brick wall. The bricks in the image to the left cannot be
considered to be texture because the outer scale is too small to capture the
variation of the brick wall. On the other hand, the variation on the bricks in
the image to the right can be considered to be texture.
In this thesis, we will argue that the most important difference between
geometric structure and texture is the exactness in which the content must be
described or represented. Both the geometric structure and the texture are
considered to be samples from stochastic processes. For the geometric part, a
(the) fixed specific instance is required, i.e. the geometric structure needs to
be described exactly. The texture component does not need to be described
3
exactly - instead, a random sample from the distribution is sufficient.
Furthermore, the details that must be described exactly in an image depend on the problem at hand. The image content that requires an exact
representation depends on what it should be used for. The same image in
another context may alter the content that needs an exact representation.
Consider shining stars on a dark sky - as shown to the left in figure 1.1:
is this geometric structure or texture? As background in a romantic scene
of a feature movie, the stars in the dark sky represent texture. In such a
setting, the stars appear to be small lighted dots of different sizes randomly
located in the dark sky. Any random sample from the same distribution
would serve equally as good as background in the movie. On the other hand,
if the image is used for orientation or for locating a constellation of stars,
then the exact size and location of the star must be represented. In such a
setting, the shining stars are no longer texture because a random sample from
the distribution is not sufficient. Consider an aerial photo, as shown in the
middle in figure 1.1: is this geometric structure or texture? Again, it depends
on the context or the problem at hand. If the photo is used for finding roads
or counting trees, then the details must be represented exactly. On the
other hand, viewed as an aerial photo of a landscape, any random sample
would serve equally well. To the right in figure 1.1, a gathering of people
is shown - is this geometric structure or texture? If one is searching for a
specific person, then each person in the gathering must be represented exactly
and the content should be considered as geometric structure. Viewed as an
example of a gathering of people at a football match, it can be considered
to be texture. In all examples the contents are considered to be texture if
pointing the camera at another location with the same type of content would
serve equally well; pointing the camera at another part of the sky, aerial
photo of the landscape at another location or another gathering of people
(possible at another part of the grand-stand).
The insight that geometric structure must be represented exactly, while
texture can be represented by some distribution leads to the question: ’How
much geometric structure and texture does an image contain?’. Measuring and quantifying the image content in terms of geometric structure and
texture is a challenging problem. Assuming that geometric structure can
be represented exactly using some sparse representation, while texture cannot. Reversing the assumption leads to an approach to measuring the image
content. If the image can be represented sparsely, then it contains geometric structures. Images that can not be represented sparsely mainly contain
texture. Using this approach the image content can be quantified by the
sparseness of the representation.
The approach proposed in this thesis started to arise while experimenting
4
with the inpainting problem (chapter 2 and papers [84, 86]), and evolved
during the project to finally become the main theme in the thesis. In the
inpainting problem, the image content (intensities) in a region Ω is missing
and should be reconstructed using information in the rest of the image and
constrained to the boundary of Ω. In image inpainting, one is facing two
types of fundamentally different problems:
• Prolonging and connecting geometric structures.
• Reproducing the variation found in texture.
Prolonging and connecting geometric structures needs to be done exactly,
and the result is almost binary: the number of visually appealing reconstructions for the geometric structure are few and in the extreme case just one.
Texture, on the other hand, contains some degree of freedom, which influences the number of visually appealing reconstructions. At the first glance,
it seems like the two problems are fundamentally different and not related
at all. That is also the approach often found in the literature, geometric
structures can be reconstructed by minimizing a suitable functional, such as
Total Variation, while texture should be reconstructed using a suitable texture synthesizing method. The difference between image synthesizing and
inpainting lies in the boundary conditions that put hard restrictions on the
possible reconstructions in the latter case. Many textures also contain geometric structures at a smaller scale, which need to be reconstructed exactly.
So geometric structures that require an exact reconstruction are present even
in textures. This is sometimes referred to as textons [101, 209, 210].
Figure 1.1: Shining stars on a dark sky, an aerial photo and a gathering of
people. Texture or geometry?
In chapter 3, a newly collected database containing sequences
images containing the same scene, captured at different viewing
is presented. How does changing the viewing distance influence
content in terms of geometric structure and the image content?
5
of natural
distances,
the image
Does the
Figure 1.2: A brick wall captured at different viewing distance. Four 80 × 80
patches contain a brick wall captured at different viewing distances. What
is the geometric structure and texture in the different images?
image composition in terms of geometric structure and texture depend on
the viewing distance? The viewing distances influence the image content in
two ways: the image composition (’outer-scale’) and the level of captured
details (’inner-scale’). Classical statistical properties of natural images are
estimated on individual images and analyzed with respect to the viewing
distance. The estimations are strongly linked with both the image content
and the viewing distance. The results are further analyzed and discussed in
section 1.3.
In chapter 4 (paper [85]), an image complexity measure is proposed. The
basic idea presented in the paper is that an image is simple if it can be can
be approximated with a small error using a sparse representation, and it is
complex otherwise. Large scale geometric structures can be described using
a sparse representation, while small scale stochastic texture cannot. In this
sense, geometric structure is simple, while texture is complex. The proposed
method is based on truncated singular value decomposition and the optimal
rank-k property of such approximation. Furthermore, an image is composed
of image patches and the complexity of the image should be determined by
the patches in the image. The results are further analyzed and discussed in
section 1.4.
In chapter 5 (paper [83]), image regularization methods are used to characterize the image content. Image regularization can be viewed as approximating the observed image with a simpler image, where simpler is defined
by the regularization term and the regularization parameter. By increasing
the regularization parameter, the regularized image gets simpler and more
details in the observed image are suppressed. The residual image contains
the details that have been suppressed during the regularization and the norm
of the residual image is a measure of the suppressed details. The norm of the
residual as a function of the regularization parameter, measures the amount
6
of details that are suppressed at different scales. Of interest is also the derivative with respect to the regularization parameter, which reveals the rate in
which details are suppressed. The regularized solution contains the large
scale geometric structure, while the residual image contains the texture. By
measuring the content in the regularized and residual image the degree of
geometric structure and texture can be quantified. The results are further
analyzed and discussed in section 1.5.
In chapter 6 (papers [65, 82]), at first glance, a rather different topic was
treated. The goal was to combine temporal inpainting with segmentation
using shape prior. The research was done during a 3-month visit at Malmö
University visiting Prof. Anders Heyden. By using the previous segmentation
as a shape prior for the current segmentation, good segmentation will be
achieved even if the object of interest is occluded or missing in some of the
frames. How can object boundaries be used to estimate the motion of the
object? The object boundaries, in form of simple connected curves, should
solely be used for computing the motion, in the form of a displacement field
mapping one image onto the other. The image content, or features based on
the image content, cannot be used because they are assumed to be unreliable;
for example, in inpainting where the image content is lost or in the case when
the objects do not contain enough features, such as clouds. How can the
geometric structure - the object boundary - be used to compute the motion
and deformation of a non-rigid deformable object, including the motion of
the interior of the object? The results are further analyzed and discussed in
section 1.6.
1.1.1
The Bayesian Approach and MAP-solution
The Bayesian framework will be used in later sections in connection with
inpainting and image decomposition and is introduced here in a general setting.
Let us introduce the Bayesian approach in a general signal processing
setting. Let u0 be an observed signal that is a degenerated version of a ’clean’
signal u. The goal is to recover the ’clean’ signal u, using information from
the observed signal u0 . The Posteriori distribution p(u|u0) is the conditional
probability distribution for the ’clean’ signal u, given the observed signal u0 .
Of special interest is the u, which maximizes the posteriori distribution - the
’clean’ signal with the highest probability. This is the Maximum A Posteriori
(MAP) solution defined as
uM AP = arg max
7
u
{p(u|u0)} .
(1.2)
To compute the posteriori distribution and find the MAP-solution, Bayes’
rule is often used. Bayes’ rule states that
p(u|u0) =
p(u0 |u)p(u)
,
p(u0 )
(1.3)
where p(u0 |u) is the data model term (or likelihood ), p(u) is prior term and
p(u0 ) is a normalization constant (which is usually ignored). Bayes’ rule
connects the posteriori distribution with likelihood and prior distributions.
Using Bayes’ rule to compute the MAP-solution
uM AP = arg max
u
{p(u0 |u)p(u)} .
(1.4)
the likelihood and prior term must be estimated to find the MAP-solution. It
is often simpler to estimate the likelihood and prior term, than the posteriori
distribution directly. The book by Kaipo and Somersalo [103] treats inverse
problem from a (Bayesian) statistical point of view.
8
1.2
1.2.1
Inpainting using FRAME - Filter, Random fields And Maximum Entropy
Inpainting
Image inpainting - also known as image completion and hole filling - deals
with the problem of reconstructing the image content in a missing region,
using information in the rest of the image and constrained by the boundary
of the missing region. The term ’image inpainting’ is an artistic synonym
for image interpolation and it comes from the restoration of paintings in
museums. The term image/digital inpainting was first used by Bertalmio et.
al [16].
The general inpainting problem can be formulated as: an image u0 =
u0 (x) is defined on the image domain x ∈ D. For some reason, a subset
Ω ⊂ D is missing or unavailable. The objective for image inpainting is to
reconstruct the entire image u from the incomplete image u0 . (Figure 1.3
contains a visualization of the notation.) It is often assumed that u(D \ Ω) =
u0 (D \ Ω) - i.e. the content in the non-missing region should not be altered;
in other cases, u0 may be degraded due to noise and/or blur and the content
e be an extended region including Ω such
in u0 (D \ Ω) may be altered. Let Ω
e ⊂ D.Ω
e could, for example, be a rectangle covering Ω.
that Ω ⊂ Ω
It is also worth noting that, in most cases, the main evaluation criteria is
how visually appealing the inpainting is.
Applications of inpainting are, restoration of damaged images [39], restoration of damaged films [115, 67, 116], removing unwanted objects in images
and sequences, super resolution [62, 106], lossy image compression [66], recover missing block in transmission and compression [163] and deinterlacing
[10, 107, 108, 105].
In a Bayesian setting, the inpainting problem is stated as follows: the
posteriori distribution p(u|u0) is the probability distribution of the inpainting
u given the incomplete image u0 and the MAP-solution is the most likely
inpainting given u0 . To use the Bayes’ rule to compute the MAP-solution, as
stated in equation 1.4, the likelihood and the prior term must be estimated.
1.2.2
Related Work
Inpainting methods are often categorized into functional-based (or diffusionbased) and texture synthesis methods. The functional-based methods are
considered to be able to reconstruct geometric structure, while they fail to
reconstruct texture in a visually appealing way. Texture synthesis methods
9
D
u......= u(D)
. .
...... .....
.
...
.
...
.... Ω ...
.
..
...
.
....
.
....... ..........
.....
u0 = u0 (D \ Ω)
Figure 1.3: The inpainting objective is to reconstruct the entire image u(D)
using the content in u0 (D \ Ω), where D is the image domain and Ω is the
missing region. Two well-known example images often showed in inpainting
papers: scratched photo of three girls and New Orleans covered with text.
are considered to reconstruct texture well, but fail to reconstruct geometric
structures in a visually appealing way. The failure of the functional-based
methods on texture is evident and has been shown many times. The failure
of texture synthesis methods on geometric structure is less evident and, in
fact, it has rarely been shown on realistic images.
Furthermore, the texture synthesis methods can be divided into parametric models and non-parametric/patch-based models. In parametric models, a
parametric representation of the probability distribution is used and parameters are estimated using an observed image (or the non-missing part of the
image). The missing content is reconstructed by sampling from the distribution. In non-parametric models, the probability distribution is represented
non-parametrically by the patch samples from the known part of the image.
The missing content is reconstructed by directly querying the patch samples.
In inpainting literature, it is often assumed that texture synthesis and
texture inpainting are identical problems. If a method can synthesize a texture, it can directly be used for inpainting. The main difference between
texture synthesis and texture inpainting is the presence of boundary conditions in the latter. The boundary conditions put hard constraints on the
possible reconstructions.
PDE- and Functional-Based Methods
The recent monograph on image processing by Chan and Shen [40] contains
an overview of the functional/PDE-based inpainting methods. The basic
idea in most PDE-based methods is to prolong and connect the geometric
structure present in the surroundings of the missing region Ω. The geometric
10
structures present in ∂Ω are the level lines, and the problem is to prolong
and connect the level lines.
Masnou and Morel [130] (see also [129]) were the first to use variational
image interpolation for edge completion. They proposed a disocclusion algorithm - inpainting algorithm - in which the elastic functional was minimized
inside the missing region.
Bertilmio et. al [16] proposed a third order PDE solved only inside Ω with
proper boundary condition given by ∂Ω. The PDE is based on the transport equation, where the information transported is defined by the Laplacian
transported orthogonal to the image gradient (i.e. along the isophotes).
Ballester et. al [11] used a joint interpolation of vector fields and intensities approach, Tschumperle [195, 194] proposed a tensor-driven PDE and
Peyrés et. al [157] used a non-local regularization approach.
Total variation image decomposition has also been used for image inpainting [38, 39]. The Total Variation energy functional is defined as
Z
Z
2
E(u; u0) =
(u0 − u) dx + λ
|∇u|dx
(1.5)
D\Ω
D
and the solution is a minimizer of energy functional. Minimizing the total
variation energy functional using the calculus of variation leads to a secondorder PDE. This second-order PDE prolongs and connects contours (geometric structure) with straight ’lines’. Geometric structures will be prolonged
and connected using straight lines only if the missing region is smaller than
the geometric structure.
TV inpainting uses the L1 -norm of the gradient as the smoothness term,
resulting in a piecewise constant inpainting. In Harmonic inpainting, the L2 norm of the gradient is used as the smoothness term, resulting in a smooth
and blurred inpainting. Chan and Kang [37] use harmonic inpainting and
total variation inpainting to analyze the error.
Total variation image decomposition has also been used for temporal
inpainting combined with optical flow [116] and for deinterlacing [106].
Texture Synthesis - Parametric Models
In parametric models, an observed sample image is used to estimate the
parameters in the image model, represented as parametric probability distribution. A sample is drawn from the probability model in order to synthesize
an image. The observed data is used to estimate parameters in parametric
probability distribution and the distribution is used for generating a sample.
Ideally, a random sample from the parametric model should be drawn, but
in most cases, a ’typical’ sample is drawn instead.
11
Heeger and Berger [91] proposed an image pyramid approach for texture
synthesis. It is based on the assumption that first order statistics of appropriately chosen linear filter responses capture the relevant information for
characterizing the texture. A collapsing pyramid is used for matching the
histogram of filter responses, resulting in an image with the same marginal
distribution.
Portilla and Simoncelli [161] compute various correlations and use those
correlations as constraints while synthesizing. An image is synthesized, subject to the constraints, by iteratively updating the image and projecting it
onto the set of images satisfying the constraints.
Peyré [156] combined the sparse representation, using the non-orthogonal
basis proposed by Olshausen and Field [149], with the non-negative matrix
decomposition proposed by Lee and Seung [118, 119] to ’learn’ the image basis
- the image ’dictionary’. The variance and kurtosis of marginal distribution
of the decomposed image, using the learned dictionary, are used to synthesise
a texture. A ’typical’ sample that matchs the marginal statistics is drawn,
using a modified version of the sampling method proposed by Portilla and
Simoncelli [161].
Texture Synthesis - Non-parametric Models
In non-parametric models, no model is specified a priori. Instead, the data,
in terms of patch samples, is used directly for estimations. The probability
distribution is represented non-parametrically by the patch samples.
A patch-based approach was proposed by Efros and Leung [52] for synthesizing textures. Instead of drawing a random sample from a statistical
model, the sample image is used directly for synthesizing the texture. An
image is synthesized in a pixel-by-pixel manner. A site x which should be
synthesized is picked. Let N(x) be a square window neighborhood of x, i.e.
an image patch centered at x. The image patch Nbest which is closest to N(x)
in the Sum-of-Square Distances (SSD) sense is found. The set of patches for
sampling is given by Ω(x) = {N : SSD(N, Nbest (x)) ≤ }. The center pixels
in each patch in Ω(x) form a histogram of intensities, with a neighborhood
similar to N(x). The intensity in site x is a random sample from the distribution. It is not only the size of the square window N(x) that is crucial (as
pointed out in the original paper), but also the visiting order - i.e. the order
in which the intensities should be synthesized.
Criminisi et al. [47, 48] used a patch-based approach for image inpainting,
but instead of an onion pealing visiting order they used a priority order. The
priority order depends on two terms: the number of neighbors with known
intensity (the amount of reliable information surrounding the site) and a
12
term which explicitly encourages geometric structures (often called isophotes
in inpainting literature). Giving higher priority to sites containing geometric
structures will prolong and connect geometric structures.
Efros and Freeman [51] proposed a fast and simple method for texture
synthesis called image quilting. Square sized image patches from a sample image are placed in raster order in such a way that the boundaries are
overlapping. In the overlapping boundary, the squared intensity difference
is computed and a minimum boundary cut is computed. This will result in
a ragged edge between the patches and the feature in the texture is better
preserved. This is an early application of the well-known graph-cut using
max-cut/min-flow algorithms (and dynamic programming) for solving image
processing problems [113].
As reported by Cuzol et. al [49], the methods proposed by Efros and
Leung [52] and Criminisi et. al [48] are one-sweep methods without any
back-tracking or Gibbs-sampling. Once the intensity for a site has been
determined, it will not be altered; this may lead to visual inconsistency.
Cuzol et. al. [49] propose a particle filter-based approach for re-sampling in
patch-space to overcome the ’one-sweep’ problem.
Combining Geometric and Texture Inpainting
Some attempts to combine PDE/functional- and texture-based methods exist.
Bertalmio et. al [15, 17] decompose the image into a geometric and
texture component using Meyers’ G-norm [132, 7, 8]. The inpainting is
done component-wise using different methods. The geometric component
is inpainted using a third-order PDE developed by Bertalmio et al. [16]
(briefly mentioned in the PDE methods section). The texture component is
inpainted using a slightly modified version of the patch based texture synthesizing method proposed by Efros and Leung [51] (discussed in the texture
inpainting section). A similar approach was proposed by Rane et. al [163] to
recover missing image blocks in transmission and compression. Bugeau and
Bertalmio [31] use a similar approach and evaluate different methods for the
different components. Their results indicate that the method proposed by
Tschumperle [195] is in general preferable in the geometric component.
Elad et al. [54] use the sparse representation-based image decomposition
method Morphological Component Analysis (MCA)[54, 53] for inpainting.
In MCA, as in TV or Meyer-decomposition, an image is decomposed using
ordinary addition into a geometric and a texture component. In MCA, a
sparse representation approach is used and two dictionaries are learned; Tt
should sparsely decompose the texture component (and it should not be
13
able to represent the geometric component sparsely) and Tg should sparsely
decompose the geometric component (and it should not be able to represent
the texture component sparsely). Formally, the image is decomposed
I = Ig + It = Tt αt + Tg αg ,
(1.6)
where αt and αg are the sparse coefficients in the geometric respective texture component. As an addition regularization term, total variation is used
solely on the geometric component Tg αg . The dictionaries are learned using
the non-missing part of the image and the missing information is inpainted
simultaneously component-wise using the learned dictionaries.
1.2.3
FRAME
FRAME - Filter, Random fields And Maximum Entropy by Zhu et. al
[213, 214] is a general framework for analyzing and synthesizing stationary
textures. FRAME is based on two properties: (i) textures having the same
marginal distributions - histogram of filter responses - are visually hard to
discriminate and (ii) a probability distribution is uniquely determined by all
its marginal distributions.
The basic idea behind FRAME is as follows: Let H be a set of statistics in
the form of histograms of filter responses (marginal distributions) extracted
from an observed image and let Ω(H) be the set of all probability distributions with the same (expected) marginal distributions as the observed image
(i.e. H). Among the distributions - Ω(H) - that are consistent with the observed image, select the least committed distribution, that is the distribution
that maximizes the entropy.
Even if the fundamental idea behind FRAME is rather straight forward,
a detailed discussion requires a large amount of notations. Let F = {F α :
α ∈ K} be a set of filters, I α = I ∗ F α be the filter response (an image) using
filter α and H α = hhα1 , · · · , hαN i be the (normalized) histogram for filter α
using N bins. Furthermore, let Ω(H) = {p(I) : Ep (H(I ∗ F α )) = H α },
where Ep is the expectation and H is the histogram operator using N (fixed)
bins. This is simply ’all probability distributions that have the same marginal
distributions as the observed image’. Among the distributions p(I) ∈ Ω(H),
the one that maximizes the entropy is selected (i.e. maximum entropy is the
objective function), which leads to a constrained optimization problem which
can be solved using the technique of Lagrange multipliers. The solution has
the following form
14
)
( K N
XX
1
λαi hαi ,
exp −
p(I) =
Z(Λ)
i
α
(1.7)
p(I(x)|I(D\x)) = p((I(x))|I(Nx )),
(1.8)
where Λ = {λαi } is the Lagrange multipliers and Z(Λ) is a normalization
constant (that depends on Λ). λαi is the Lagrange multiplier corresponding
to hαi the histogram for filter α and bin i.
To synthesize an image, a random sample from the distribution is drawn.
A random sample from p(I) is generated by gibbs-sampling from the conditional distribution
where Nx is the neighborhood of x (defined by the filter support). By repeatedly randomly selecting sites and sampling the intensity given the current
intensities in the neighborhood Nx , a random sample from p(I) will be generated.
1.2.4
Cooling and Heating - Inpainting using FRAME
e be a square extended region of Ω such that Ω ⊂ Ω
e ⊂ D, and of such
Let Ω
e for sites x ∈ Ω. Inpainting is done by
a size that the filter is covered in Ω
gibbs-sampling from the conditional distribution
p((I(x))|I(Nx )),
(1.9)
where x ∈ Ω and Nx is the neighborhood of x. Nx may contain sites x ∈ Ω
e \ Ω are used as
and x ∈
/ Ω. Only sites x ∈ Ω are updated, while sites x ∈ Ω
boundary condition. The intensities in the missing region are initialized by
sampling from an (independent) uniformly distributed random distribution.
Synthesizing an image by sampling from the distribution showed visual
similarity with the observed image. The features present in the observed
image are also present in the sample from the distribution - the optimization process has converged visually. The visual convergency shows that the
filters used caught the important visual features of the observed image. In
contrast, inpainting by sampling from the distribution in the missing region
Ω did not converge to a visually appealing solution. The failure was evident
close to the boundary where the geometric structure was not prolonged in
a visually appealing way. The problem of using FRAME for reconstructing
large scale image structures such as edges was also observed in [57], where
primal sketches were used to extract the edges. This supports the claim
15
made in this thesis that certain structures even on a smaller scale must be
reconstructed exactly.
An inverse temperature β = T1 was added to the distribution
)
(
XX
1
p(I) =
(1.10)
exp −β
λαi Hiα .
Z(Λ)
α
i
Cooling the distribution - increasing β - will decrease the probability for
images with low probabilities and increase the probability for images with
high probabilities. Increasing β will narrow the probability distribution in
the sense that a large part of the probability density will be located at a
smaller subset of all images, and as β → ∞ all the probability density will
be located at the global maxima. In this sense, the distribution will be less
’stochastic’ as β increases. Let Nmax be the number of global maxima and
let Imax be the set of all global maximizers, then as β → ∞
0
if I ∈
/ Imax
p(I) =
1
if
I
∈
Imax
Nmax
The probability mass is uniformly distributed over the set of global maximizers.
Cooling the Distribution - Adding the Geometry
The idea behind the cooling approaches is that the large scale geometric
structures are brought out by cooling the distribution, while the small scale
structures are suppressed. A more MAP-like solution is assumed to contain
the large scale geometric structure, while suppressing the small scale details.
As pointed out by Nikolova [144], the MAP-solution may not be smooth and
may instead contain small scale structures.
To stress the inference of the geometric structure in the inpainting, three
cooling approaches are proposed.
The first approach is to cool the distribution using a fixed β > 1. The
motivation for this approach is to emphasize more likely structures, and fade
out less likely structures. It is a redistribution of the probability mass in such
a way that a larger part of the probability mass is located on the images with
higher probabilities, while a smaller part of the probability mass is located
on the images with lower probabilities.
The second approach is the so-called Iterated Conditional Mode (ICM),
analyzed by Besag in [19] and by Kittler and Föglein in [110], which corresponds to setting β = ∞. When updating the intensity by gibbs-sampling
from the conditional distribution (1.8), the ICM approach corresponds to
16
always selecting the intensity with the highest probability. The intensity at
site x ∈ Ω is updated using
I(x) = arg max {p(I(x)|Nx )} ,
(1.11)
i.e. the most likely intensity given the current neighborhood Nx . It is a
site-wise greedy approach which, always locally selects the intensity with the
highest probability. ICM depends both on the initialization of the missing
region Ω and on the visiting order. If the missing region is initialized randomly and the visiting order is random, then repeating the inpainting will
give different results. Winkler [201] and Li [121] contain a general discussion
about ICM.
The third approach is a fast cooling scheme which gradually increases β.
The fast cooling scheme has the following form
βn+1 = C+ · βn = (C+ · β0 )n ,
(1.12)
where C+ > 1 is the increment factor and β0 > 0 is the initial inverse temperature. The fast cooling scheme was motivated by a simulated annealing
approach for finding the MAP-solution of Markov Random Fields (MRF)
[69]. By iteratively increasing β, the probability mass is gradually moved
from low probability images to high probability images. The gradual increase of β - the annealing process - decreases the probability of getting
stuck in a local optimum. Winkler [201], Bremaud [23] and Li [121] contain
general treatment of simulated annealing and MAP-solutions for MRF’s.
Heating the Distribution - Adding the Texture
The geometric structure is reconstructed by sampling from a cooled distribution or using a cooling scheme. In the cooling phase, the large scale geometric
structure is added, while the small scale texture is suppressed. The result
is prolonged and connected geometric structures that appear too smooth.
In order to add the small scale texture, a second heating sampling phase
was used. The ’initialization’ for the second phase was the result of the
first ’cooling’ phase inpainting - the reconstructed geometric structure. The
small scale texture should be added in this phase, without destroying the
large scale geometric structure, reconstructed in the previous phase.
Two heating approaches were evaluated. The first approach was to use a
fixed β ≥ 1, where β = 1 corresponds to the original learned FRAME model.
The second approach was to use a simulated heating-like process, fast
heating scheme, where β is gradually decreasing. The fast heating used was
of the form
17
βn+1 = βn · C− = (β0 · C− )n ,
(1.13)
where C− < 1 is the decrement factor and β0 is the initial inverse temperature. The stopping criterion was βn < 1 + .
1.2.5
Discussion
Simple stationary textures containing structures on at least two scales are
used. The coarse scale structure was rather large compared with the image
resolution. Corduroy, birch tree bark and batten images were used from
the KTH-TIPS2 database [63]. The corduroy is composed of rather large
horizontal geometric structures with some intensity variation, the birch tree
bark contains large scale geometric structures of different sizes distributed
along the diagonal, and the batten contains connected vertical geometric
structures which both merges and splits.
The missing region was rather large, relative to the larger scale details
in the texture. The missing region was roughly 4 times the size of the large
scale details. This implies that PDE-based methods such as TV would fail
to prolong and connect the geometric structures.
The experiments on the corduroy, birch tree bark and batten images show
that by cooling the distribution, the geometric structure is reconstructed
better.
Adding a fixed β > 1, stressed the inference of the geometric structure.
The geometric structure on the boundaries was prolonged and connected in
a visually appealing way, while the small scale variation was suppressed. If β
was too small or too large, then the geometric structures were not prolonged
and connected in a visual appealing way.
The ICM approach which corresponds to β = ∞ did reconstruct the
geometric structure in the corduroy image, but failed to reconstruct the geometric structure in the birch bark and batten image. In the birch bark and
batten images, geometric structures were constructed ’randomly’, depending
on the random initialization and visiting order.
The fast cooling scheme reconstructs and prolongs the geometric structure
found in the three images.
The geometric structure is reconstructed better by sampling from a cooled
distribution or using a cooling scheme. Sampling from a distribution using
a fixed, but not too large, β will often reconstruct the geometric structure.
Using a fast cooling scheme will prolong and connect the geometric structure
in a visually appealing way.
18
The texture is added, after the geometric structure has been reconstructed,
by sampling from a heated distribution. Using the fixed β approach, where β
is slightly larger than 1, added the texture without removing the geometric
structure. If β is too large, no or very little texture is added. If β is too
small, the geometric structure starts to degenerate.
Similar behavior was observed when using the fast heating scheme. For
βn much larger than 1, few intensities were altered; as βn approached 1
more intensities were altered. If the fast heating scheme was stopped too
early, then no or very little texture was added. On the other hand, if it was
stopped too late - βn was too small - then the geometric structure started to
degenerate.
19
1.3
DIKU Multi-Scale Image Database
Image content does not solely depend on the object in the captured scene,
but also on the viewing distance. Changing the viewing distance, either by
physically moving the camera or by changing the focal length, will alter the
image content. The visual appearance of a tree viewed from a few meters is
rather different from viewing the same tree from 200 meters. At a few meters,
the branches and even the individual leaves are visible. As the viewing distance increases, details are suppressed and the tree top appears as a uniform
green region. Given an image, a coarse scale representation of the image can
be generated using the linear gaussian scale space [98, 202, 111, 60, 122]. The
coarse scale representation is generated by
It = Gt ∗ I0 ,
(1.14)
where ∗ denotes the convolution operator, I0 the observed image and
2
1
x + y2
Gt (x, y) =
.
(1.15)
exp −
2πt
2t
In scale space theory, a coarse scale representation of the image is generated
with the same resolution. By increasing the viewing distance, details will be
suppressed, but the resolution will also decrease. The statistical behavior of
natural images over scales has been studied in great detail [193, 205]. Here
we consider images containing the environment - both nature and man-made
structure - viewed from a normal human perspective (i.e. ’bird’ and related
perspectives are excluded).
Increasing the viewing distance will also alter the outer-scale of the image
and the spatial layout of the captured scene. A cup on a table can be captured
from almost all angles, a car on the street can be captured from many angles,
while a building captured from 200 meters can be captured from a few angles.
The distance to the main objects in the scene puts constraints on the spatial
lay-out of the captured scene. As the viewing distance increases, the spatial
layout will change - the sky will appear at the top of the image, houses will
appear as uniformly colored blocks in the middle of the image and mountains
will appear as a smoothly changing region in the middle of the image.
We propose to collect a new image database containing the same scene
captured at different viewing distances (by adjusting the focal length). The
database should contain sequences of the same scene captured using different
focal length. Furthermore, the region present in all images in a sequence
should be extracted, resulting in a set of sequences of images containing
the same part of the scene captured at different scales (focal length in the
20
objective). The extracted region contains the same part of the scene captured
at different scales and of different resolutions.
1.3.1
Related Work
In section 1.3.3, we discuss the scale invariant property of ensembles of natural images [94, 117, 58, 193, 168, 166, 166, 183], modeling the partial derivative of an ensemble of images using the generalized Laplacian distribution
[117, 94, 126, 183] and the distribution of homogenous regions [76, 1, 3, 2, 77].
Torralba and Oliva [192, 193, 146] analyze the η in the power spectra
power law for natural images as a function of viewing distance. They used
a more complete model, where η depends on the orientation [9]. The spatial
composition of an image - called the spatial envelope - is constrained by
the viewing distance. Objects viewed from a small distance can be observed
from almost any point of view. As the viewing distance increases, the possible
point of view from which an object can be observed decreases. This property
is strongly reflected in the η in the power spectra power law. The estimation
of η can be used for estimating the distance to the objects in the scene.
Wu et. al [204] analyze the image content as a function of viewing distance
using information theoretical tools, see also [199, 210]. They analyze how the
compression rate and the entropy of the image gradient changes as a function
of viewing distance. The starting point is the behavior of the Dead Leaves
model [131, 168, 1, 117] as the distance to the ’image plane’ increases. In
the Dead Leaves model, images are formed by discrete objects of random
size and color, occluding each other - a scene is composed of discrete objects
occluding each other. The objects in the Dead Leaves model have the same
shape (template) - often circles or squares - called leaves, while the size and
the coloring are random. The components in the Dead Leaves model are:
• The size r of the leaves (template) follow a distribution p(r) ∝
a finite range [rmin , rmax ].
1
,
r2
over
• The color of the objects is uniformly distributed over [amin , amax ].
• The position (x, y, z) is following a Poisson process with intensity λ,
and the z-axis is solely used for occlusion detection.
Lee et. al [117] analyze the scale invariant property of the Dead Leaves
model. They show that under the assumption [rmin , rmax ] → [0, ∞] it is scale
invariant. To analyze the behavior of the Dead Leaves model as a function of
increasing viewing distance, [rmin , rmax ] is kept fixed and let r = rmax − rmin .
An individual image contains objects of certain sizes.
21
Increasing the viewing distance involves two processes: smoothing and
resolution reduction. The smoothing process is modeled by block averaging
(using 2×2 blocks) and the resolution reduction is modelled by sub-sampling.
This is similar to the classical image pyramid viewed from an image formation
point of view [32]. By repeating the smoothing and sub-sampling procedure,
images with increasing viewing distances will be generated. The viewing distance will be double in each iteration, and let s denote the iteration number.
Wu et al. study the statistical behavior of the Dead Leaves model using
a fixed object size r distribution, with increasing viewing distance s. In the
beginning, s r and the image contains large uniformly colored regions. As
the distance increases, s ≈ r. The average size of a leaf is roughly covering one
pixel. The image contains small uniformly colored objects. As the viewing
distance increases, s r. And a pixel is the average of a large number
of leaves. The visual appearance is close to white noise. By increasing the
viewing distance the image content will transform from a rather large scale
geometric structure to a highly stochastic appearance - from low entropy to
high entropy. Wu et al. argue that different regimes are suitable for the
different types of image content. Sparse representations such as wavelets are
suitable for low entropy type of content, while Markov Random Field (MRF)
is suitable in the high entropy case.
Yanulevskaya and Guesebroek [207] model the distribution of partial
derivatives in natural images and image patches with the Weibull distribution (which essentially is the same as the generalized Laplacian distribution).
They characterize the image content by the estimation of the parameters in
the Weibull distribution. They propose three sub-models: the power law, the
exponential and the Gaussian distribution. Akaike’s information criterion
(AIC) is used to determine an adequate sub-model. Images from the power
law sub-model usually have a well-separated foreground and background.
Furthermore, the background is often a rather uniform region. Images from
the exponential sub-model, usually contain a lot of details at different scales.
Images from the Gaussian sub-model, usually contain high frequency texture. Yanulevskaya and Guesebroek show that the image content for individual images can partly be explained by the parameters estimated in the
Weibull distribution. Geusebroek and Smeulders used a similar approach to
characterize stochastic textures [71, 72].
1.3.2
Collection Procedure and Content
We have collected a database of ensemble of image sequences containing the
the same scene captured at different scales. The sequences contain natural images - both nature and man-made structure - viewed from a human
22
perspective.
The camera used to collect the database was a Nikon D40x, and 3 different
objectives - 18 − 55 mm, 55 − 200 mm and 70 − 300 mm - were used. The
camera was placed on a tripod facing the scene. Each scene was captured
at 15 different scales using different focal length - ranging from 18 mm to
300 mm or roughly 4 octaves. A 1 × 1 region in the least zoomed image
corresponds to 16 × 16 region in the most zoomed image. The focal length
was adjusted manually.
The image resolution was 2592 × 3872 pixels. The images were captured
using Nikon’s 14 bits raw format NEF, and were converted to 16 bits TIF
images.
The database contains both man-made structures and natural scenes.
The physical distance between the camera and the main object in the scene
varies between 5 meters and a few kilometers. For most images, the distance
between the camera and the main object in the scene is between 30 meters
and 150 meters. The goal was that the sky should occupy a large region in
the image at all scales in some sequences and the sky should be absent at all
scales in other sequences.
The part of the scene present in all images in a sequence hs been extracted
by use of registration techniques and by hand, resulting in a set of images
containing the same part of the scene captured at different scales (different
focal length). The resolution of the extracted regions ranges between 2592 ×
3872 and 160 × 240 pixels.
1.3.3
Natural Image Statistics
Three classical results from natural image statistics are verified on the newly
collected image database: the power spectra power law (scale invariance),
the generalized Laplacian distribution of the partial distribution and the
distribution of homogenous regions.
The statistics are estimated on three different ’sets’:
• On the ensemble of images.
• On individual images.
• On all images captured at the same scale (same focal length setting).
Estimation on the ensemble of image sequences is performed to verify the
soundness of the database which should be similar to previously reported
estimation on other databases.
To analyze how far the visual appearance of an individual image is explained by the statistics, it was estimated on individual images.
23
Scale Invariance
One of the earliest result in natural image statistics is the apparent scale
invariance [145, 94, 117, 58, 193, 168, 166, 166, 196].
The scale invariance can be stated as: the power spectra of an ensemble
of image is following a power law in spatial frequencies given by
S(ω) =
A
,
|ω|2−η
(1.16)
where ω = (ωx , ωy ) is the spatial frequency, η is estimated on the ensemble
and A is a constant that depends on the contrast in the ensemble of images.
The power spectra power law can be formulated in the spatial domain using
the correlation function [168, 94], and it has the following form
C(x) = C1 +
C2
,
|x|η
(1.17)
where x is the distance between the pixels and C1 and C2 are constants.
A large η implies that the intensities are less correlated, while a small η
indicates higher correlation between the pixels. The intensity correlation
decrease with the distances.
Ruderman and Bialek [166] reported η = 0.19 on their database collected
in the woods. Ruderman [168] reported η = −0.3 for an ensemble of seashore
images containing a lot of water and sky. Huang and Mumford [94] also
reported η = 0.19 on the van Hateren database [197]. Lee et al. [117]
estimated η on different types of environments; for an ensemble of images
containing vegetation η ≈ 0.2 and for an ensemble of images containing
roads η ≈ 0.6.
On our database, η is estimated to 0.202 on the ensemble of images.
For individual images, η varies between −0.3 and 0.5. Images with a large η
generally contain small scale details like texture and the distance to the main
object in the scene is often small (often a few meters). Images with a small η
contain large scale geometric structures and the distance to the main objects
in the scene is rather large (often 100 meters or more). Hence, the estimation
of η on individual images gives important information of statistical content
of the image, especially whether η is large or small.
Estimations of η on all images captured at the same scale (focal length)
show a tendency to increase as the viewing distance decreases. η increases
rapidly for the first 3 capture scales. For the remaining capture scales, η
has a tendency to increase but less rapidly and non-monotonically. As the
viewing distance decreases, the region in the images occupied by the sky also
has a tendency to decrease. The sky occupies a minor region in most of the
24
images after the first 3 capture scales. At a large viewing distance, buildings,
lawns and trees appear to be rather uniform geometric structures, and as the
viewing distance decreases, more details are brought out.
Laplacian Distribution of Partial Derivatives
It has been reported [172, 174, 173, 30, 207, 73, 72, 94, 183, 117, 126] that
the distribution of partial derivatives of an ensemble of natural images can
be modeled by a generalized Laplacien distribution
1 −| x |α
e s ,
(1.18)
Z
where α and s are estimated parameters. s is related to the width of the
distribution (i.e. the variance) and α is related to the peakness of the distribution.
All computations are performed on a log-intensity scale (i.e. log(I)).
Instead of using the intensity difference between two adjacent pixels - i.e.
log(I(x, y)) − log(I(x + 1, y)) - the normalized scale space derivatives are
used. The partial scale space derivative in the x direction at scale t is
p(x) =
∂
∂Gt
(Gt ∗ I) =
∗ I,
(1.19)
∂x
∂x
where Gt is the gaussian function. The notation αx denotes, estimation using
the partial derivative in x direction.
Huang and Mumford [94] estimated α to 0.55 on the van Hateren [197].
On our database, αx is estimated on the ensemble of images to 0.37 and
0.78 at scale t = 1 and t = 64, respectively. For individual images, αx varies
between 0.25 and 1.00 for t = 1 and between 0.55 and 2.00 for t = 64.
The visual appearance of the images corresponds well with the estimation of
αx . Images with a large αx contain small scale details, often high frequency
texture. The t in the scale space derivative determines what is considered
to be small scale details. The distance to the main object in the scene is
often fairly small - a few meters. Images with a large αx contain large scale
geometric structures and the distance to the main objects in the scene is
often large. (See also [207] and [73]).
The estimation of αx on all images captured at the same scale (focal
length) shows no clear tendency as the viewing distance decreases. Instead
the estimation is rather stable with small variation over the capture scales.
25
Distribution of Homogenous Region
Alvarez et. al [2, 1, 3] and Gousseau et. al [76, 77] studied the distribution
of homogenous regions in individual images and ensembles of natural images.
They also relate the size distribution of homogenous regions to the question
whether natural images belong to the function space of bounded variation
(BV).
Following Alvarez et al. [2, 1, 3], the definition of homogenous regions
is rather simple. (Gousseau et. al [77] use a different definition.) First, the
intensity resolution is reduced to k levels such that each new intensity level
contains the same number of locations. If I is a N × M image, then each new
intensity level will contain N k·M locations. A homogenous region is defined
as a connected - using either 4 or 8 connectivity - set of locations with the
same intensity. The size of a homogenous region is the number of locations
it contains. They show that the size distribution of homogenous regions is
following a power law
A
,
(1.20)
sα
where A and α are estimated parameters; α and A can be estimated using
log-log regression. The size distribution is following a power law both on individual natural images and on ensembles of natural images. The estimation
of α on individual images shows large variations. Images containing mainly
small scale details have an α ≈ 3, while images containing mainly large scale
geometric structures have an α ≈ 1.6. Alvarez et al. [2, 1, 3] reported α
close to 2.0 on ensembles of natural images.
On our database, α was estimated to 2.11 on the ensembles of images.
On individual images, α varied between 1.75 and 3.00. Images with small
α ≈ 1.75 contain large scale geometric structures, and the distance to the
main objects in the scene is rather large (hundreds of meters). Images with
large α ≈ 3.00 contain small scale details, and the distance to the main
objects in the scene is rather small (a few meters).
The estimation of α on all images captured at the same scale (focal length)
shows no clear tendency as the viewing distance decreases. Instead the estimation is rather stable with small variation over the capture scales. The
estimation using the different capture scales of α, varies between 2.15 and
2.22.
f (s) =
26
1.3.4
Discussion
Three classical and well-known statistical properties of natural images have
been estimated on the newly collected ensemble of image sequences database.
We study the statistical properties: the power spectra power law (scale invariance), the Laplacian distribution of the partial derivatives and the power
law for size distribution of homogenous regions.
η in the power spectra power law - equation (1.16) - is estimated to 0.202
on the ensemble of images. Ruderman and Bialek [166] reported η = 0.19 on
their database collected in the woods and Huang and Mumford [94] (also)
reported η = 0.19 on the van Hateren database [197].
α in the generalized Laplacian distribution - equation 1.18 - is estimated
to 0.37 and 0.78 at scale t = 1 respectively t = 64 on the ensemble of images.
Huang and Mumford [94] estimated α to 0.55 on the van Hateren database.
α in the size distribution of homogenous region is estimated to 2.11 on
the ensemble of images. Alvarez et. al [2, 1, 3] report α close to 2.
All three classical results could be verified on the new collected database.
The estimation on individual images in the database also verifies previous
reported results. The estimation of η in the power spectra power law is
large if the image mainly contains small scale details, and it is small if it
mainly contains large scale geometric structures. The estimation of α in
the generalized Laplacian distribution is large if the image mainly contains
small scale details, and it is small if it mainly contains large scale geometric
structures. The estimation of α in size distribution of homogenous regions is
large if the image mainly contains small scale details, and it is small if the
image mainly contains large scale geometric structures. The visual content
is partly explained by the estimated statistical properties.
The estimation of η in the power spectra power law based on the capture
scales increases as the viewing distance decreases. η increases rapidly for the
first three capture scales and increases moderately for the remaining scales.
The spatial layout - the spatial envelope - changes a lot at the first three
capture scales. At the largest viewing distance, the sky occupies a large
region in many of the images. As the viewing distance decreases, the region
occupied by the sky shrinks and after the first three scales the sky occupies
a small region in most images. As the viewing distance decreases, details
are brought out. Sequences where the estimation of η is following a similar
pattern as the capture scales based estimations often contain the sky at the
large viewing distances. As the viewing distance decreases, the sky is absent
or occupies a small region in the images. In sequences where η is large on
all capture scales, the sky is often absent or occupying a small region, at all
scales. Furthermore, the images contain small scale details and the viewing
27
distances are rather small at all capture scales. In sequences where η is small
on all capture scales, the sky is occupying a large part of the image and the
viewing distance is large at all capture scales. If the sequence contains a
transition in distance, then the estimation of η has a tendency to increase
with decreasing viewing distances. If the viewing distance in a sequence
capturing a scene containing a building is between 100 and 6 meters, then
the sequence contains a transition in distances. The spatial layout - spatial
envelope - capturing a building from 100 meters is totally different from the
spatial lay-out capturing a building from 6 meters. A picture captured from
100 meters is a distance photo, while that captured from 6 meter is a close-up
photo. A sequence having a bush captured between 15 and 1 meters does not
contain such a transition. The spatial layout does not change drastically over
those viewing distances - both 15 meters and 1 meters are closeup photos. A
sequence containing the sky and the ocean captured at a very large viewing
distance does not contain such a transition. The spatial layout does not
change over such viewing distances - all images are large distance or even
panorama images.
Torralba and Oliva et al. [146, 192] use the estimation of η, as a function
of orientation, in the power spectra power law to determine the distance to
main objects in the scene. η depends on the spatial layout of the scene and
the spatial layout of the scene constrained by the distance to the main object
in the scene.
28
1.4
SVD as Content Descriptor
How can the image content be quantified in terms of geometric structures
and texture? One approach is that the geometric structure can be described
exactly using a sparse representation, while texture cannot be described exactly by a sparse representation.
Given an image I, a sparse approximation Ik should be constructed, where
k is a sparseness/complexity parameter that measures the complexity of the
approximation. As k → ∞, Ik → I - i.e. as the sparseness decreases, the
approximation gets closer to the observed image I.
1.4.1
Related Work
The ’approximation’-approach relates to the problem of finding the optimal
base to represent the data. The book by Kirby [109] contains an overview of
different best basis approaches - especially the SVD, PCA and wavelet bases.
One well-known and commonly used approach for representing data is the
Principal Component Analyzis (PCA) - also known as the Karhunen-Loeven
transformation or the Hotelling transformation. In PCA, the goal is to find
an optimal orthonormal basis such that the variance of the data decreases
as much as possible when an additional base vector is included. The PCA
can be defined recursively in a natural way. The first normalized basis vector
Ψ1 is minimizing the variance of the data. An additional basis vector Ψk+1
must be orthogonal to the previous basis vectors - i.e. hΨk+1 , Ψi i = 0 for
i = 1, · · · , k - and minimize the variance of the data. PCA is the optimal
linear dimensionality reduction method in the mean square error sense [20].
Rather than finding the optimal basis for one observation, PCA is used for
reducing the dimensionality of a set of observations.
Independent Component Analysis (ICA) was first formulated by Jutten
and Herault in their seminal paper [102]. The concept of mutual independence is central in ICA. Let X = (X1 , · · · , Xn ) be a set of stochastic variables
and p(X) be the joint probability distribution. The stochastic variables are
independent if
p(X) = p(X1 , · · · , Xn ) = p(X1 ) · · · · · p(Xn ),
(1.21)
where p(Xi ) is the marginal distribution for Xi . The objective in ICA is to
find a transformation W such that
s = W x,
29
(1.22)
where the components s = hs1 , · · · , sn i are as independent as possible (using
some independence measure F (s1 , · · · , s2 )). x = hx1 , · · · , xn i is a realization
of X and x is generated using a linear model
x = As,
(1.23)
where A is the mixing matrix and s is the independent component. Given
an x, the mixing matrix A = W −1 and the independent component s should
be found. See the tutorial on ICA by Hyvärinen [96] and Hyvärinen and Oja
[97].
A well-known sparse representation was proposed by Olshausen and Field
[149, 150], which relates to the models of the human visual front-end. An
image is modeled as a linear superposition of (possibly) non-orthogonal basis
functions φi (x, y)
X
I(x, y) =
ai φi (x, y)
(1.24)
i
where φi form an over-complete basis for the image space and ai are coefficients of the basis vectors. The ai should be sparse, meaning that most of
the ai should be zero. The distribution p(a) will be peaked at zero and will
have ’heavy-tails’.
1.4.2
Optimal Rank Approximation and TSVD
One approach would be to approximate an image I in a lower dimensional
subspace. An image is simpler if it can be approximated well in a subspace of
low dimensionality, while an image is regarded as complex if a good approximation requires a subspace of dimensionality close to the dimension of the
observed image. Let Ik be an approximation of I in a subspace of dimension
k. As the dimension k increases towards the dimension of I, the approximation Ik gets closer to observed image I. Viewing images as matrices allows us
to regard the dimensionality of subspaces as the matrix rank. The dimension
of a matrix is the number of independent columns it contains or equally the
dimension of the subspace spanned by the columns. This is captured by the
rank of the matrix
Rank(A) = dim(span{a1 , · · · , an }).
(1.25)
Given an image I with rank(I) = k0 , a rank k approximation Ik of I
should be computed. The approximation Ik should be optimal in the sense
that any other matrix B with rank k will have at least as large approximation
error as Ik . Measuring the approximation error in terms of the 2-norm gives
30
Ik = arg min
Rank(B)=k
kA − Bk2 .
(1.26)
The matrix B, with Rank(B) = k, that has the lowest approximation
error in the 2-norm sense should be computed. The image residual I − Ik
contains the details that are suppressed in the approximation Ik , and kI−Ik k2
is a measurement of the suppressed details.
Notice that any matrix A can be decomposed as
A = UΣV T
(1.27)
where U and V are orthogonal matrices, i.e. UU T = I and V V T = I, where
I is the identity matrix, and diag(Σ) = (σ 1 , · · · , σ n ), where σ i ≥ σ i+1 ≥ 0.
This the well-known Singular Value Decomposition (SVD) [75, 109]; σ i are
called singular values, ui left-singular vector and v i right-singular vector. The
set {σ i , ui, v i } is called the singular system of A. The rank of a matrix A is
the number of singular values strictly larger than zero. Furthermore, the
2-norm is
kAk2 = σ 1 ,
(1.28)
i.e. the largest singular value, and the squared Frobenious norm
X
X
kAk2F =
a2ij =
(σ i )2 ,
i,j
(1.29)
i
i.e. the sum of the squared singular values. The 2-norm is a vector induced
norm defined as
kAxk2
kAk2 = supkxk2 6=0
,
(1.30)
kxk2
√
where the right-hand side is defined by the vector 2-norm, i.e. kxk2 = xxT .
The matrix 2-norm is an operator norm, which can be geometrically interpreted as how much A as a linear operator is scaling the vector x.
Let Σk be the matrix containing the k largest singular values on the diagonal,
then the Truncated Singular Value Decomposition is defined as
Ak = UΣk V T .
(1.31)
Rank(Ak ) = k and Rank(A − Ak ) = Rank(A) − k. The 2-norm of the
residual matrix is
kA − Ak k2 = σ k+1
31
(1.32)
and the squared Frobenious norm
kA −
Ak k2F
=
n
X
(σ i )2 .
(1.33)
i=k+1
The squared Frobenious norm corresponds to the Sum-of-Square Distance
(SSD) between images often used for comparing images.
In fact, the TSVD approximation is the best rank k approximation in
the sense that any other rank k approximation will have at least as large
reconstruction error using either the 2-norm or the Frobenious norm. The
TSVD is the solution to them minimization problem 1.26 (and it is also the
solution if the Frobenious norm is used instead).
Damped Singular Value Decomposition (DSVD)
A common approach used in deblurring and denoising is to use a ’soft’ threshold for the singular values. In TSVD, the k largest singular values are kept,
while k+1 to Rank(A) singular values are set to zero. In Damped Singular
Value Decomposition (DSVD), all singular values are damped using filter
factors defined as
σi2
fi = 2
σi + λ2
(1.34)
where λ is a problem dependent regularization parameter. The filtered singular values are φi = fi σi . TSVD can also be formulated using filter factors
and the filter factors are
1 if i ≤ k
fi =
0 if i > k
.
The DSVD is a ’soft’ threshold in the sense that the large singular values
are kept almost unchanged and the small singular values are almost zero after
filtering. This is because
0 if σi λ
fi ≈
1 if λ σi
.
DSVD is related to the solution of Tikhonov regularization problems [142,
103, 87, 89, 88, 55]. The DSVD is the solution to the following Tikhonov
regularization problem
uλ = arg minu ku0 − uk22 + λ2 kuk22
32
(1.35)
commonly studied in denoising and inverse problem.
1.4.3
Measuring the Complexity of Images - Singular
Value Reconstruction Index
An image is considered to be simple if it can be approximated well in a subspace of low dimensionality, and it is considered to be complex if it can only
be approximated well in a subspace of high dimensionality. Given an image I,
an approximation Ik should be constructed such that the reconstruction error is smaller than σ err . We define the complexity of the image as the lowest
dimension of the subspace in which the approximation Ik can be constructed
min {k : kA − Ak k2 ≤ σ err } .
(1.36)
This is termed the Singular Value Reconstruction Index (SVRI) at level σ err
and tells how many singular values are required for an approximation with an
error smaller than σ err . First, the error level is determined - how well should
the approximation fit the original image - then the number of singular values
are determined (i.e. the dimension of subspace).
Another definition is: Given a subspace of dimension k, how well can the
observed image be approximated? And use the 2-norm or the Frobenious
norm of the residual image as the complexity measure.
Furthermore, an image is composed of image patches. We assume that
the complexity of an image should be determined by the complexity of the
patches that constitute the image. Rather than computing the global singular value reconstruction index at level σ err , the complexity for each patch
constituting the image is computed and the mean complexity of those patches
gives the image complexity.
1.4.4
Discussion
In figure 1.4, the SVRI are shown as a filter applied to images of 100 × 100
pixels, containing the same scene captured at different viewing distances.
In figure 1.5, images with low/high SVRI value are shown. The top
row shows images with low SVRI. The sky is covering a large part of the
images. Furthermore, the distance to the main objects in the scenes are
large, resulting in large scale geometric structures - such as buildings - with
sharp boundaries. From this, we get the indication that images with a low
SVRI mainly contain geometric structures.
In the second row of figure 1.5, images with high SVRI are shown. The
images contain small scale details such as leaves and twigs, and the distance
33
Figure 1.4: Example of SVRI used as a filter on 100 × 100 images containing
the same scene captured at 5 different viewing distances. The top row contains the images, the second and third rows contain SVRI filter using patch
size 5 and σ err = 0.05 respective patch size 10 × 10 and σ err = 0.1
34
to the main objects in the scene is rather small. Furthermore, the sky is
absent (or covers a small region in the images) in all of the images. Hence,
from this, we get the indication that images with high SVRI contain mainly
texture.
Rank Distribution in Natural Images
Measuring the image complexity using TSVD and the proposed SVRI depends on the rank distribution and on the distribution of the singular values
in natural image patches. Furthermore, the image content in natural image
patches should depend on the size of the smaller singular values. The patch
content should be different if the smaller singular values are small or large.
1000 randomly selected 25 × 25 image patches were selected from each
image in the DIKU Multi-Scale Image database. (An experiment using 50 ×
50 image patches gave similar results.) The singular values for each of the
roughly 800,000 patches were computed. The first conclusion based on the
experiment is that image patches in natural images are almost always of full
rank - i.e. in the experiment, σ25 > 0 in all patches.
The condition number for a n × n matrix A is defined as
Cond(A) =
σ1
σn
(1.37)
and it measures how well-conditioned the matrix is [75]. A large condition
number indicates that that matrix is ill-conditioned and that the columns are
almost linearly dependent. For natural image patches, the condition number
is always finite, because the patches are of full rank - i.e. σ25 > 0. The
condition number is large, which indicates that columns are almost linearly
dependent.
The distribution of σ1 has a large variance and is not very peaked around
the mode. The distribution of the σi ’s for i > 1 follows the same basic form.
The distributions are very peaked at zero, which indicates that most singular
values are very small. Still, the distributions have ’heavy tails’ i.e. values
(relatively) far from zero.
Visually comparing patches with large σn with patches of small σn clearly
indicates a large difference in image content. Patches with a large σn contain
small scale details, while patches with small σn contain geometric structure.
35
Figure 1.5: Example of images with low/high singular value reconstruction
index (SVRI) at level 0.01 using 15 × 15 patches. The top row contains
images with low SVRI, all images contain the sky and the viewing distances
are rather large. The bottom row shows examples of images with high SVRI;
all images contain small scale details and the viewing distances are small.
36
1.5
Image Description by Regularization
The problem of measuring and quantifying the image content in terms of
geometric structure and texture can be approached in many ways. In section
1.4 and chapter 4, a matrix approach using TSVD is proposed, based on the
property that the TSVD approximation is the optimal rank k estimation. In
the following section and chapter 5, image regularization and decomposition
in a continuous setting is used for measuring the image content.
Image regularization can be viewed as approximating an observed image
u0 with a simpler image u, called the regularized solution, such that some
energy functional is minimized. Most energy functionals are composed of
two terms, a data fidelity term and a regularization term. The simplicity of
u is defined and determined by the regularization term and a regularization
parameter λ. As the regularization parameter increases, the regularized solution gets simpler in the sense defined by the regularization term. To stress
the dependence on the regularization parameter λ, uλ will sometimes be used
instead of u. The residual image (also called the image residual) is the difference between the observed image and the approximated image - (u0 − uλ )
- and it contains the details that were removed in the approximation.
Image regularization can also be viewed as decomposing the observed
image u0 into two components: a geometric and a texture component. By
measuring the content in two components, the image content can be quantified in terms of geometric structure and texture. We analyze the squared L2
norm of the regularized solution and the residual image as a function of the
regularization parameter λ.
The L2 norm of the regularized solution and the residual has been studied
in connection with parameter selection in denoising, but not for describing
the image content in terms of geometric structure and texture.
1.5.1
Related Work
Characterizing the image content by analyzing the norm of the scale space
representation of the image as a function of scale/regularization parameter
has not received a lot of research attention. The behavior of the norm as
a function of scale/regularization parameter has been studied in denoising,
for optimal parameter selection. Thompson et. al [189] contains a classical
study of parameter selection in denoising.
Sporring [180] and Sporring and Weickert [181, 182] view images as distribution of light quanta and use information theory to study the image content
in scale spaces. They show that the entropy of an image is an increasing function of the scale (in scale space). Empirically, they show that the derivative
37
of the entropy with respect to the scale is a good texture descriptor.
One of the oldest, and still often used, optimal parameter selection method
in denoising is the Morozov discrepancy principle (or discrepancy principle)
[138, 87, 198]. The noise is assumed to be additive
u0 = u + e,
(1.38)
where u is the ’clean’ image and e is the noise. Furthermore, the norm of
the noise, kek = ε, must be known or possible to estimate. The parameter
should be selected such that
kek = ku0 − uλ k = ε,
(1.39)
i.e. the residual norm should be equal to the norm of the noise.
Another common method for determining the optimal parameter in denoising is the L-curve studied by Hansen [89, 87, 88, 198]. The L-curve is a
log-log plot of the norm of the regularized solution against the norm of the
corresponding residual. It shows the trade-off between the size of the solution and the size of the residual, using suitable norms. In the log-log scale,
it has a L-shape. According to the L-curve criterion, the optimal value for
the parameter λ is the one with highest curvature (i.e. the corner of the L).
The L-curve is related to the less formal trade-off curve discussed in [22].
Buades et.al [27] introduced the concept of ’Method Noise’ for evaluating
denoising methods. (See also [28, 29, 26].) The ’Method Noise’ is simply
the difference between the original image and the denoised image - i.e. the
image residual. Optimally the image residual should only contain noise and
no structure after denoising. The noise and only the noise has been suppressed in the denoising. For example, denoising a noise ’free’ image should
optimally result in an empty residual, while denoising an image corrupted by
additive independent Guassian white noise should result in a residual containing Gaussian white noise. Furthermore, the residual should not contain
any structure not caused by the noise. ’Method Noise’ aims to characterize
the denoising methods by analyzing the content in the residual image. While
’Method Noise’ aims to characterize the behavior of the method by analyzing
the residual, our aim is to characterize the image by analyzing the residual
norm as a function of the regularization parameter. Our aim is, in some
sense, complementary.
1.5.2
Image Decomposition
In image decomposition, an observed image u0 is considered to be composed
of two components: a smooth/geometric component and a noise/texture
38
component. Formally, an image decomposition may be written as
u0 = u + v,
(1.40)
where u0 is the observed image, u is the smooth/geometric component and v
is the texture. Often, we are interested in the geometric component u, which
can be found by minimizing a suitable energy functional
Z
Z
2
E(u; λ) = (u0 − u) dx + λ Ψ(Du)dx
(1.41)
Here λ is a problem dependent regularization parameter, D is a linear operator (often a differential operator) Ψ(x) is a ’penalty’ function (often absolute
or squared absolute value). The noise/texture component v is also known
as the residual (image), because v = u0 − u, i.e. the difference between the
observed image and the regularized solution, and it contains the details that
have been suppressed in the regularization. Regularization in computer vision and image processing has a long history and is used for transforming
ill-posed problems into well-posed problems [18, 159, 158], and for denoising.
The energy functional is composed of two terms, a data term and a regularization term. The data term forces the solution u to be close in the L2 sense
to the observed data u0 , while the regularization term forces u to be smooth
in the sense defined by the operator D.
In [83] the squared L2 -norm of the regularized solution and the residual
using three regularization methods were used to analyze and characterize
the image content. The regularization methods include first order Tikhonov
regularization
Z
E(u; λ) = (u0 − u)2 + λ|∇u|2dx,
(1.42)
Linear Guassian Scale Space [98, 202, 111, 122]
u λ = u 0 ∗ Gλ ,
(1.43)
where * denotes the convolution operator and Gλ is the gaussian function
with variance λ. Linear gaussian scale space is equivalent with an infinity order Tikhonov regularization [143], but it is more intuitive to use the
convolution formulation.
Finally, the Total Variation Image Decomposition is also studied
Z
E(u; λ) = (u − u0 )2 + λ|∇u|dx.
(1.44)
39
1.5.3
The Bayesian Approach and MAP-Solution
The bayseian approach allows us a statistical interpretation to energy minimization [140]. The Bayes’ rule and the MAP-solution were introduced in
section 1.1.1. For the statistical interpretation, it is assumed that the pure
signal u has been corrupted by additive gaussian white noise, resulting in a
observed image u0 . The pure signal u should be recovered from the observed
image u0 .
The additive gaussian white noise assumption gives v = u0 −u ∈ N(0, σ 2 ),
and
(u0 (x0 )−u(x0 ))2
1
2σ 2
p(u0 (x0 )|u(x0 )) = √ e−
.
σ 2π
Assuming that the pixel noise is independent, we have
P
p(u0 |u) = C1 e−
x∈D
−(u0 (x)−u(x))2
2σ 2
(1.45)
.
(1.46)
The prior term is harder to model and more assumptions are required.
For smooth images without texture, it is feasible to assume small intensity
variation. One may assume that |∇u| is following a zero mean normal distribution with variance µ, which gives
−
p(u) = C2 e
P
x∈D
|∇u(x)|2
2µ2
.
(1.47)
Another assumption would be that |∇u| is following a Laplacian distribution, which gives
−
p(u) = C2 e
P
x∈D
|∇u(x)|
2µ2
.
(1.48)
The distribution of partial derivatives in natural images can be modeled
with the generalized Laplacian distribution [184, 117, 126, 207]. Inserting
the estimation in to the Bayes formulation gives
umap = arg max {C1 e−
P
x∈D
(u0 (x)−u(x))2
2σ 2
−
C2 e
P
x∈D
|∇u(x)|2
2µ2
},
(1.49)
where the prior term is estimated by assuming that the gradient magnitude is following a gaussian distribution. By taking the negative log (−log) of
the MAP-solution, one can get rid of the exponential and the maximization
problem turns into a minimizing problem, given by
E(u) =
X (u0 (x) − u(x))2
x∈D
2σ 2
40
+
X |∇u(x)|2
x∈D
2µ2
.
(1.50)
Switching to the continuous domain and renaming the parameters gives
Z
Z
2
E(u) = (u0 (x) − u(x)) dx + λ
|∇u(x)|2 dx,
(1.51)
D
D
which corresponds to the first order Tikhonov regularization energy functional. Instead, using the assumption that the gradient magnitude is following a Laplacian distribution as in equation (1.48) leads to the total variation
image decomposition functional.
Using the Bayesian formulation, we see that the first order Tikhonov regularization and the total variation image decomposition are MAP solutions
under different assumptions about the distribution of the prior term.
1.5.4
Regularized and Residual Norm
To analyze the image content with respect to geometric structure and texture,
the squared L2 norm of the regularized solution uλ
s(λ) = kuλ k22
(1.52)
r(λ) = kvλ k22 = ku0 − uλ k22
(1.53)
and the squared residual norm
as a function of the regularization parameter, are studied. Of interest is also
the corresponding derivatives with respect to the regularization parameter
λ. The derivative with respect to λ reveals the rate in which details are
suppressed.
The normalized norm of the regularized solution is defined as
snorm (λ) =
kuλ k22
kuλ k22 + kvλ k22
(1.54)
kvλ |22
.
kuλk22 + kvλ |22
(1.55)
and the normalized residual norm is defined as
rnorm (λ) =
The derivative of the normalized norm with respect to λ reveals the rate in
which details are suppressed as the regularization parameter increases.
By the triangle inequality we have ku0 k22 ≤ kuk22 + kvk22 . t(λ) = kuλk22 +
kvλ k22 will denote the total norm. The total norm t(λ) is not constant, instead,
it depends on the parameter λ. t(0) = ku0k22 and t(∞) = kCk22 + ku0 − Ck22 ,
where C is the mean intensity in the image. Normalizing the initial image
41
u0 by subtracting the mean value C gives a simpler form for the limit case
t(∞) = ku0 k22 .
The sum of the two norms is one, i.e. snorm (λ) + rnorm (λ) = 1. The
normalized regularized norm snorm (λ) can be viewed as the degree of the
total norm that is explained by the regularized solution, and the normalized
residual norm rnorm (λ) as the degree of the total norm that is explained by
the texture component.
The squared L2 norm of the regularized solution and the residual as a
function of λ were studied in terms of convexity/concavity using three regularization methods. We show that for first order Tikhonov regularization,
s(λ) is a monotonically decreasing convex function, while r(λ) is a monotonically increasing function but r(λ) is neither concave nor convex. The same
holds for the linear gaussian scale space if the parameter in the gaussian function is the variance, but fails when the parameter is the standard deviation.
Empirically, we show that the squared L2 -norm of the residual using TV is
an increasing non-concave function.
1.5.5
Discussion
We attempt to characterize the image contents in terms of geometric structure and texture using regularization. Image regularization can be viewed
as approximating a given image with a simpler one. Analyzing and measuring the content that is preserved after regularization and the content that
is removed can help in characterizing the original image content in terms
of geometry and texture. Measure the content that is kept - i.e. the regularized solution - and the content that is removed - i.e. the residual - can
be used to characterize content. Image regularization can also be viewed as
image decomposition - the image is decomposed into a geometric component
and a texture component. Again, the image content can be characterized by
measuring the content in the components.
Buades et.al [27] used the content in the residual - termed ’Method Noise’
- for evaluating denoising methods. They try to characterize the denoising
methods by the content in the residual. Our goal is complementary: how to
characterize the image by the content in its residual (using some regularization method).
42
1.6
Motion Estimation by Contour Registration
Image registration is the process of spatially overlaying two or more images
containing the same objects. The images may contain a scene at different
times or from different viewpoints, or they may contain the same objects
or scene, but captured using different sensors (different modality). Image
registration is a fundamental problem in image processing and is a crucial
intermediate step in many applications. Some common applications are preprocessing for image classification [41, 206, 95], image stitching/mosaicing
[50, 187, 178] and sensor fusion in medical applications [160, 164, 125, 127,
203].
The registration problem has been approached in a number of ways. One
approach has been to find a transformation that overlays the images such
that the sum of intensity differences is small. This approach is often called
a direct method because the image intensities are used directly. Another
approach has been to find interesting points such as corners and edges in the
images and then find correspondences between these points. The transformation that overlays the images is then found by using the correspondence
between the points. This approach is called feature-based and only a sparse
representation - the interesting points - of the images is used to determine
the transformation.
There is also a close relation between motion estimation and registration
[92]. Registration of images in a time sequence - temporal registration - may
be regarded as a motion estimation problem. Optical flow [93, 13, 25, 152]
is often used for estimation of the apparent motion in a sequence and it has
some similarities with the direct registration approach.
The problem addressed in this thesis has some similarities with image
registration, contour registration, and motion estimation, but it is still quite
different. How can the motion of a deformable moving object, such as a
walking person or running horse, be estimated solely by the knowledge of the
boundary of the object as seen in different images? The motion of the interior
of the moving object should be estimated solely based on the knowledge of
the boundary.
Let Γ1 be a closed curve embedded in an image I1 and Γ2 another closed
curve embedded in another image I2 . The problem is to find a geometric
transformation Φ that overlays the two images such that Γ1 is mapped on
to Γ2 ; the interior of Γ1 should be transformed in a reasonable way and
the transformation should be the ”simplest” possible. The motion of the
contour and the motion of the interior must be consistent and computed
43
simultaneously.
To add some intuition, one can think of the two closed contours as the
boundary of a deformable moving object in a sequence and the problem is to
simultaneously compute the motion of the contour and estimate the motion
of the interior of the object. The motion of the entire object should be
computed solely based on the contour. For example, the motion of the nose
should be estimated solely based on the contours of the head in two images.
This can be useful when the boundaries of the object are available, but
the image contents are not reliable. The boundary of the object may be
available because shape priors are used in the segmentation process and the
image contents are not reliable because the object has been occluded or the
image contents may have been lost.
1.6.1
Image and Contour Registration
The general image registration problem may be defined as:
Definition 1 The Image Registration Problem
Given two images T1 and T2 and a distance measure D(T1 , T2 ), that measures
the difference between two images, find a geometric transformation Φ : R2 →
R2 that minimizes D(T1 , Φ(T2 )).
The registration problem has been approached in many different ways.
See Browns’ [24] rather old survey and Zitova and Flussers’ survey [215].
Direct Method
In the direct approach towards the registration problem, the image intensities
are directly used in the distance measure [99]. A geometric transformation Φ
that minimizes the distance between the intensities of the image T1 and the
transformed image Φ(T2 ) should be found. One common distance measure
is the sum of squared distance
D
SSD
1
(T1 , T2 ) = k T1 − T2 k2L2 =
2
Z
(T1 (x) − T2 (x))2 dx,
(1.56)
and a geometric transformation Φ that minimizes D SSD (T1 , Φ(T2 )) should be
found (see [136, 24] for some approaches). Often, Φ is a parametric transformation with parameters a and the problem is to find the optimal parameters
for the transformation Φa . One example of a parametric geometric transformation is the intensity-based affine linear transformation.
44
The direct (or intensity-based) method is in general ill-posed in the sense
that a small change in the input images may give a completely different
transformation. By adding an additional smoothness (or regularization) term
that penalizes certain transformations the problem becomes well-posed. The
transformation Φ is a minimizer of the functional
E(Φ) = D(T1 , Φ(T2 )) + αS(Φ),
(1.57)
where D(T1 , T2 ) is the distance measure, α > 0 is a positive smoothness
parameter and S(Φ) is a smoothing term.
Feature-Based Method
In the feature-based approach, a number of feature points - also called control
points, interest points and landmarks - are extracted from the images. A
correspondence between the feature points detected in the two images is
established and some feature points may be discarded. The correspondence
between the feature points is used for finding a geometric transformation.(See
e.g. [191, 136]).
Good feature points should be stable over time, spread over the whole image and efficiently detectable. Such features are not present in all types of
images. Common and well-suited feature points are corners, edges and line
intersections. A region can be represented as a feature point by the center
of gravity and line segments can be represented as feature points by the two
endpoints or the middle point. Common feature detection methods are the
HARRIS detector [90], the scale invariant HARRIS detector [134], SUSAN
[176, 175, 177] and SIFT [124]. (See also [135] and [133] for an evaluation of
feature detectors.)
Given two sets of control points from two images, a correspondence between the control points should be established. One approach for establishing
a correspondence between the set of control points is to look at the spatial
relations. Another approach is to compute a descriptor locally around the
control point and use that for establishing a correspondence. The simplest
descriptor is image intensities locally around the control points, however,
some form of filter responses is often used.
After a correspondence between the control points has been established,
a geometric transformation that overlays the images should be constructed.
The geometric transformation should be constructed such that the corresponding feature points are overlaid. Global linear geometric transformations
such as similarity and affine transformation are common transformations.
Non-parametric feature-based geometric transformations are also common,
such as elastic and fluid-based registration.
45
Dense Contour Registration
Contour registration is a fundamental problem in image processing, with a lot
of applications within shape analysis. In the contour registration problem,
two contours Γ1 and Γ2 (i.e. object boundaries) are given and a transformation that overlays the contours should be found. Contour registration
is a mapping between two contours and does not in general give an image
registration.
One dense based approach to the curve matching problem is to minimize the ”elastic energy” that is required to transform one curve into the
other [208, 14, 74]. The curves are usually represented as simple connected
parametric curves. Let Γ1 be parameterized by the arch-length s. Let t(s)
be a function mapping arch-length to arch-length, t(s) is a correspondence
between Γ1 and Γ2 . Γ1 (s) is mapped to Γ2 (t(s)). Associated with each
continuous correspondence function t(s) is a cost - the elastic cost of the
correspondence function t(s). Given the two contours Γ1 and Γ2 and a fixed
correspondence function t(s), the cost function measures the ”elastic energy”
that t(s) requires to transform Γ1 into Γ2 . The cost function C is defined by
Z
F (Γ1 , Γ2 , t(s))ds,
(1.58)
C(Γ1 , Γ2 , t(s)) =
Γ1
where the function F measures the ”elastic” properties. The distance
between two contours is then the minimum ”elastic energy” over all correspondence functions t(s)
Z
F (Γ1 , Γ2 , t(s))ds .
(1.59)
D(Γ1 , Γ2 ) = min
Γ1
Given the distance measure D between two contours, the contour registration problem becomes
Z
F (Γ1 , Γ2 , t(s))ds .
(1.60)
R(Γ1 , Γ2 ) = arg min t(s)
Γ1
The function F models the elastic properties and can depend on the
physical properties of the subject being studied. F can depend on other
curve properties such as the first derivative Γ̇ and the curvature |Γ̈|.
This approach minimizes the elastic energy exactly on the contours. The
cost for deforming the contour is explicitly formalized on the contour. Minimizing the elastic energy of deforming a contour gives an implicit cost of
deforming the interior of the contour.
The relation between shape similarity measures and contour registration
is very close. For example, the minimum ’elastic energy’ that is required to
46
transform one curve into the other can be considered to be a shape similarity
measure. In registration, the objective is to find the contour transformation,
while for shape similarity measures the interest is the cost of the transformation.
1.6.2
Image Registration by Contour Matching
By using shape priors in the segmentation, good object boundaries can be
found, even if the object boundary is occluded or the image contents have
been destroyed at the boundary. Let F1 , · · · , Fn be an image sequence containing a non-rigid moving object that should be segmented. In some frames,
the image contents may not be reliable either because the object is occluded
or because the image contents are missing. The object boundary can still
be found in many cases by using shape priors in the segmentation.(See e.g.
[64, 44, 45, 46]). Often, one also wants to estimate the motion of the object between the frames. Because the image contents inside the object are
missing, neither a direct registration approach nor a feature-based approach
can be used directly. Furthermore, it is not the motion of the contour that
should be computed; instead, it is the motion of the interior of the object
that should be computed solely based on the object boundaries. The motion
of the object should be computed based on the assumption that
• The object boundary is correct.
• Part of the image contents is not reliable.
A variational formulation of this problem is presented, which simultaneously computes a displacement field for the contour and interpolates the
motion of the interior of the object. A good motion estimation should overlay
the two boundaries in such a way that the motion of the interior is interpolated in a consistent way.
A Variational Approach
In this section, we are going to present a variational solution to the following
contour matching problem: Suppose we have two simple closed curves Γ1 and
Γ2 contained in the image domain Ω. Find the “most economical” mapping
Φ = Φ(x) : Ω → R2 such that Φ maps Γ1 onto Γ2 , i.e. φ(Γ1 ) = Γ2 .
The latter condition is to be understood in the sense that if α = α(s) :
[0, 1] → Ω is a positively oriented parametrization of Γ1 , then β(s) = Φ(α(s)) :
[0, 1] → Ω is a positively oriented parametrization of Γ2 (allowing some parts
of Γ2 to be covered multiple times).
47
To present our variational solution of this problem, let M denote the set
of twice differential mappings Φ, which maps Γ1 to Γ2 in the above sense.
Let
M = {Φ ∈ C 2 (Ω; R2 ) | Φ(Γ1 ) = Γ2 }.
(1.61)
Moreover, given a mapping Φ : Ω → R2 , not necessarily a member of M,
then we express Φ in the form Φ(x) = x + U(x), where the vector valued
function U = U(x) : Ω → R2 is called the displacement field associated with
Φ, or simply the displacement field. It is sometimes necessary to write out
the components of the displacement field; U(x) = (u1 (x), u2 (x))T .
We now define the “most economical” map to be the member Φ∗ of M,
which minimizes the following energy functional:
Z
1
E[Φ] =
kDU(x)k2F dx,
(1.62)
2 Ω
where kDU(x)kF denotes the Frobenius norm of DU(x) = [∇u1 (x), ∇u2 (x)]T ,
which for an arbitrary matrix A ∈ R2×2 is defined by kAk2F = tr(AT A). The
optimal transformation is given by
Φ∗ = arg min E[Φ].
(1.63)
Φ∈M
Using that E[Φ] can be written in the form
Z
1
|∇u1(x)|2 + |∇u2(x)|2 dx,
E[Φ] =
2 Ω
(1.64)
it can be seen that the Gâteaux derivative [170, 68, 6] of E[Φ] is given by
dE[Φ; V ] =
=
Z
ZΩ
∇u1 (x) · ∇v1 (x) + ∇u2 (x) · ∇v2 (x) dx
tr(DU(x)T DV (x)) dx,
Ω
for any displacement field V (x) = (v1 (x), v2 (x))T . After integration by parts,
we find that the necessary condition for Φ∗ (x) = x + U ∗ (x) to be a solution
of the minimization problem (1.63) takes the form
Z
0 = − ∆U ∗ (x) · V (x) dx,
(1.65)
Ω
for any admissible displacement field variation V = V (x). Here ∆U ∗ (x) =
(∆u∗1 (x), ∆u∗2 (x))T is the Laplacian of the vector valued function U ∗ = U ∗ (x).
48
Since every admissible mapping Φ must map the initial contour Γ1 onto the
target contour Γ2 , it can be shown that any displacement field variation V
must satisfy
V (x) · nΓ2 (x + U ∗ (X)) = 0 for all x ∈ Γ1 .
(1.66)
Notice that this condition only has to be satisfied precisely on the curve
Γ1 , and that V = V (x) is allowed to vary freely away from the initial contour. The interpretation of the above condition is that the displacement
field variation at x ∈ Γ1 must be tangent to the target contour Γ2 at the
point y = Φ(x). In view of this interpretation of (1.66), it is not difficult
to see that the necessary condition (1.65) implies that the solution Φ∗ of
the minimization problem (1.63) must satisfy the following Euler-Lagrange
equation:
(
∆U ∗ − (∆U ∗ · n̂Γ2 ) n̂Γ2 ,
on Γ1 ,
(1.67)
0=
∆U ∗ ,
otherwise,
where n̂∗Γ2 (x) = nΓ2 (x + U ∗ (x)), x ∈ Γ1 , is the pullback of the normal field
of the target contour Γ2 to the initial contour Γ1 . The standard way of
solving (1.67) is to use the gradient descent method: Let U = U(t, x) be the
time-dependent displacement field which solves the evolution PDE
(
∆U − (∆U · n̂∗Γ2 ) n̂∗Γ2 ,
on Γ1 ,
∂U
(1.68)
=
∂t
∆U,
otherwise,
where the initial displacement U(0, x) = U0 (x) ∈ M specified by the user,
and U = 0 on ∂Ω, the boundary of Ω (Dirichlet boundary condition). Then
U ∗ (x) = limt→∞ U(t, x) is a solution of the Euler-Lagrange equation (1.67).
The PDE (1.68) coincides with the so-called geometry-constrained diffusion introduced by Andresen and Nielsen in [5]. Thus we have derived the
energy functional that geometry-constrained diffusion is minimizing.
1.6.3
Relation to Feature-Based and Contour Registration
In our approach, image F1 contains curve Γ1 and image F2 contains curve
Γ2 , and an image registration - Φ(x) - that overlays the two contours and
minimizes the functional (1.62) should be found.
The preceding segmentation can be a viewed as a feature extraction step.
In the segmentation step, an accurate object boundary is extracted and it is
viewed as an continuous planar curve. Feature points are not allocated along
49
..............
Γ1 .......... ......
...
....
....
.
.
.
.
.
..
.
.
.
.
.
..
....
.....
.
.
.
.....
.
F1
........ .................
..............
Φ(Γ1 ) = Γ2
.. 2
..........Γ
...... .........
.
.
.
....
.....
...
.. ..
...
...
.
....
..
.....
.
.
......... ............ F2
............
Figure 1.6: Given two closed curves Γ1 and Γ2 contained in two images F1
and F2 , Φ maps F1 onto F2 such that Γ1 is mapped onto Γ2 (i.e. Φ(Γ1 ) = Γ2 ).
the boundary; instead, the dense curve is used directly to determine the
image registration. Instead of extracting control points on the curves - such
as points with high curvature and zero crossings of the curvature - and then a
correspondence between the control points, a continuous transformation that
overlays the dense contours in the image domain is found. In the featurebased approach, a registration should be found using the correspondence
between a discrete set of feature points; in our approach, a registration should
be found based on a continuous feature set. The constraint on the geometric
transformation is the mapping between the contours.
The segmentation of the object represents the feature extraction step.
Given the segmentation - i.e. a continuous set of features, the feature correspondence step and the geometric transformation step are solved simultaneously. The dense correspondence between the contours restricts the set of
possible transformations.
1.6.4
Applications
The contour-based motion estimation was combined with shape prior segmentation in image sequences. By using the previous segmentation as shape
prior, accurate object segmentation was possible even if part of the object
was missing or occluded. The contours of the object in two adjacent frames
were used for estimating the displacement field. The intensity in the second
frame can be predicted by applying the displacement field. This is temporal
inpainting or transport of intensities between frames. By comparing the predicted intensity with the observed intensity, object occlusion can be detected.
If the difference between the predicted and observed intensity is large, then
the object is occluded. Temporal inpainting using solely the contour has
50
been used for texturing objects in image sequences [190].
1.6.5
Discussion
The estimation of the displacement field - the deformation and motion - of a
non-rigid moving object solely based on the object boundary relates both to
feature-based image registration and contour registration. The contours can
be viewed as continuous sets of features that should be mapped onto each
other. Simultaneously, the deformation field for the interior of the object
should be estimated. From a contour registration point of view, the contours
should be mapped onto each other, but the deformation cost is no longer
solely on the deformation of the boundary. Instead, the deformation cost
is the cost of deforming the interior of the contour. Simultaneously, the
contours should be mapped onto each other, such that the cost of deforming
the interior is minimized. The elastic energy of deforming one contour into
the other is commonly used as a shape similarity measure. In a similar way,
the deformation cost of the contour could instead be measured in terms of
deforming the interior of the contour.
51
1.7
Scientific Contributions
1.7.1
Published Paper and Scientific Contribution
• David Gustavsson, Kim S. Pedersen, and Mads Nielsen. Geometric
and texture inpainting by gibbs sampling. In Proceedings of Swedish
Symposium in Image Analysis 2007, 2007.
Filter Random fields And Maximum Entropy (FRAME) [213, 214] is
used for inpainting missing regions in images containing stationary texture. The problem of prolonging and connecting geometric structures,
by Gibbs-sampling directly from the learned FRAME model is observed. An inverse temperature term β = T1 is added to the FRAME
distribution. A two-phase inpainting procedure is proposed. In the
first phase, the large scale geometric structure is inpainted by sampling
from a cooled distribution using a fixed β > 1. By cooling the distribution, the probability mass is redistributed in such a way that a
larger part of the probability mass is located on the images with high
probability. It is assumed, and empirically verified, that by cooling the
distribution, the large scale geometric structure will be brought out. In
the second phase, the small scale texture is added by sampling from a
heated distribution.
The experiments show that the geometric structure is prolonged and
connected in a visual pleasing way in the first phase, even if the missing region is much larger than the geometric structure. The heating
phase adds texture to the inpainting without destroying the geometric
structure.
Theory developed in collaboration with all authors. All implementation
and experiments were done by the author.
• David Gustavsson, Kim S. Pedersen, and Mads Nielsen. Image inpainting by cooling and heating. In Proceedings of Scandinavian Conference
on Image Analysis (SCIA) 2007, 2007. Peer review
The two-phase inpainting strategy using FRAME, proposed in the paper [86], is extended. In the first phase, which inpaints the geometric
structure, a fast cooling scheme is proposed. Using the fast cooling
scheme, a more MAP-like solution is found which prolongs and connects geometric structures. The fast cooling scheme is less sensitive to
parameter settings and seems to perform better than the fixed temperature approach.
52
The iterated Conditional Mode (ICM) which corresponds to β = ∞
is also evaluated. ICM is a site-wise greedy strategy that depends on
the initialization and the visiting order. ICM often fails to inpaint the
geometric structure.
In the second phase, which adds the texture, a fast heating procedure
is proposed. The improvement, by using the fast heating procedure, is
minor.
Theory developed in collaboration with all authors. All implementation
and experiments were done by the author.
• David Gustavsson, Ketut Fundana, Niels-Ch. Overgaard, Anders Heyden, and Mads Nielsen. Variational Segmentation and Contour Matching of Non-Rigid Moving Object. In Proceedings of Workshop on Dynamical Vision WDV 2008, 2008. Peer review
Level set based segmentation, including shape priors, in image sequences is combined with registration by geometry-constrained diffusion [5, 4]. By using the previous segmentation of a moving non-rigid
object, as shape prior for the current segmentation, accurate object segmentation is possible even if the object is partly occluded or missing.
By using registration by geometry-constrained diffusion on the object
boundaries, the complete deformation and motion of the object can be
estimated. The estimated motion is used for occlusion detection and
temporal inpainting.
We show, by using calculus of variation, that the geometry-constrained
diffusion equation proposed by Andresen and Nielsen [5, 4] is minimizing an energy functional. The Euler-Lagrange equation for the energy
functional corresponds to the proposed diffusion equation.
Theory developed in collaboration with all authors. The shape prior
based level set segmentation was implemented by Ketut Fundana. The
Registration by geometry-constrained diffusion method was implemented
and evaluated by the first author.
• Ketut Fundana, Niels-Ch. Overgaard, Anders Heyden, David Gustavsson and Mads Nielsen.
Nonrigid Object Segmentation and Occlusion Detection in Image Sequences. In Proceedings of International Conference on Computer Vision Theory and Applications (VISAPP) 2008, 2008.
Peer review
Motion estimation using shape prior segmentation and registration by
geometry-constrained diffusion is treated (as in paper [81]). An al53
gorithm for estimation of the deformation and motion of a non-rigid
moving object using geometry-constrained diffusion is presented. An
algorithm for occlusion detection using the contour based deformation
and motion estimation is presented. The intensity inside the moving
object can be predicted by applying the motion estimation. If the predicted intensity in a location is different from the observed intensity,
then the object is occluded in that location. The experiments show
that occlusion can be detected if the deformation and motion are mild.
Estimation of the deformation and motion using solely the contour of
the object is not possible under large self-occlusion.
Theory developed in collaboration with all authors. The shape prior
based level segmentation is implemented by Ketut Fundana. The Registration by geometry-constrained diffusion method is implemented and
evaluated by the author.
• David Gustavsson, Kim S. Pedersen and Mads Nielsen. Multi-Scale
Natural Images: a database and some statistics In Danish Conference
on Pattern Recognition and Image Analysis (DSAGM) 2008 Extended
abstract
The new multi-scale image sequences database is presented. The procedure and equipment used for collecting the database are discussed.
Natural images are defined as images containing ’natural’ scenes - both
nature and man-made structure - from a human perspective, which
exclude bird perspective.
Classical results from natural image statistics are computed and verified
on the new database.
Theory developed in collaboration with all authors. The database was
collected by the author and Rabia Granlund. All implementation and
experiments were done by the author.
• David Gustavsson, Kim S. Pedersen, and Mads Nielsen. A SVD Based
Image Complexity Measure. In Proceding of International Conference
on Computer Vision Theory and Applications (VISAPP) 2009, 2009.
Peer review
A Truncated Singular Value Decomposition image complexity measure
is proposed, based on the assumption that simple images can be approximated well in a subspace of low dimensionality, while a complex
image cannot. Using the well-known property that the truncated singular value decomposition is the optimal rank k estimation, using either
the 2-norm or the Frobenious norm, the rank of an approximation with
54
a smaller error than σ err is used as the complexity measure. It is termed
Singular Value Reconstruction Index (SVRI) at level σ err and it is the
dimensionality of the subspace where the image can be approximated
with an error smaller than σ err . Geometric structure can be approximated well in subspace of low dimensionality, while stochastic texture
require a subspace of high dimensionality. An image is composed of
patches, and the complexity of the image should be determined by the
patches constituting the image. The complexity of the image is the
average SVRI at level σ err of the patches constituting the image.
Empirically, the rank distribution of image patches in natural images
is studied. Patches in natural images are almost always of full rank.
The condition number is often very large, which indicates that columns
are almost linear dependent. Visual inspection indicates that patches
with large smallest singular values often contain small scale details,
while patches with small smallest ( σn ) singular values often contain
geometric structure.
Theory developed in collaboration with all authors. All implementation
and experiments were done by the author.
• David Gustavsson, Kim S. Pedersen, Francios Lauze and Mads Nielsen.
On the Rate of Structural Change in Scale Spaces. In Proceedings
of Scale Space and Variational Methods in Computer Vision (SSVM)
2009, 2009. Peer review
The squared L2 -norm of the regularized solution and residual, as a
function of the regularization parameter λ, is studied, using first order Tikhonov regularization, linear Gaussian Scale Space and Total
Variation (TV) image decomposition. The squared L2 -norm of the
regularized solution is a monotonically decreasing convex function of
λ, the squared L2 -norm of the residual is a monotonically increasing
function of λ but for non-trivial images it is not concave, using first
order Tikhonov regularization. The same holds for Linear Gaussian
Scale Space when the parameter is the variance of the Gaussian, but
fails when the parameter is the standard deviation. Experimentally, we
have shown that the squared L2 norm of the residual is not a concave
function of the regularization parameter λ using TV-decomposition.
We also show, on artificial images containing details of different sizes,
that inflection points of the squared residual norm as a function of λ
corresponds to λ where details are totaly suppressed.
Theory developed in collaboration with all authors. All implementation
and experiments were done by the author.
55
1.7.2
Discussion
In this dissertation, we treat geometric structure and texture from different points of view. We argue that the most important difference between
geometric structure and texture is the requirement on the representation.
Geometric structure must be represented exactly, while a random sample
from a distribution is sufficient for texture. Often, but not always, this is
related to scale. The geometric structure is the large scale structure, while
the texture is the small scale details.
In the primal sketch by Guo et al. [56, 57], the geometry of the objects
in an image - i.e. edges and blobs - are represented exactly. The remaining
regions in the image are segmented into regions containing stationary texture.
The textured regions are reconstructed by a random sample from a learned
distribution using the FRAME model.
In information scaling by Wu et al. [205], the image content as a function of the viewing distance is studied using information theory. Statistical
properties of an image (or in a region of the image) depend on the viewing
distance and alter by changing the viewing distance. Two processes are involved when the viewing distance is increased: smoothing and sub-sampling.
Wu et al. argue that different image processing methods are suitable for
different image content. Two regimes are singled out: low entropy and high
entropy. Wavelet - or some other sparse representation - is suitable in the
low entropy case, while random markov field is suitable in the high entropy
case. Sparse coding can encode geometric structure - low entropy - while it
fails to encode small scale stochastic details - high entropy. Random markov
field fails to reconstruct long range geometric structures - low entropy regime
- but it can reconstruct small scale stochastic texture. Again, we see that
geometric structure must be represented exactly and this can be done using
a sparse representation. Texture, on the other hand, cannot and does not
have to be represented exactly.
In this thesis, we treat the problem of inpainting a missing region in a
texture. Inpainting, in contrast to texture synthesizing, has boundary conditions that put constraints on it. Most texture contains details at different
scales and certain details present on the boundary must be reconstructed
exactly. Often, ’texture’ contains geometric structure - details that must be
reconstructed exactly - on a smaller scale. The classical division of inpainting
methods into methods suitable for geometric structure and methods suitable
for texture is rather artificial, because texture often contains geometric structure on a smaller scale and texture synthesizing methods can often prolong
geometric structures. A more suitable division could be energy minimization
methods and sampling methods. The failure of the energy minimization to
56
faithfully reconstruct stochastic texture is evident and has been shown many
times. The failure of sampling methods on geometric structure is less evident
and has rarely been shown on realistic images.
Empirically, we show that FRAME can prolong and connect geometric
structure by cooling the learned distribution. The missing region is large,
compared with the ’size’ of the geometric structure, and in contrast geometric
methods such as TV fails to connect the geometric structure in this case.
As pointed out by Wu et al. [205], the image content does not solely
depend on the object in the scene, but also on the viewing distance. As
the viewing distance changes, so do the image statistics. Wu et al. mainly
study the statistical changes of different types of content, as a function of
the viewing distance.
Torralba and Oliva [193, 192, 146, 147] study the image composition as
a function of viewing distance. The spatial lay-out of a scene is termed the
spatial envelope and is a function of the viewing distance. Considering images
captured by a human (i.e. from human view point at ground position), the
possible angles from which an object can be captured from is determined by
the viewing distance. Small objects captured from a small distance can be
captured from almost all angles, while large objects captured from a large
distance can be captured from very few angles. A cup on a table can be
captured from almost all angles, while a house captured from 200 meters can
be captured from a few angles, e.g. we cannot see the house from above unless
flying. The spatial lay-out of a scene captured at a large viewing distance
is rather fixed; the sky occupies the top, buildings, forests and mountains
occupy the middle, while the roads and lawns occupy the lower part of the
image. The spatial envelope is constrained by the viewing distance. Torralba
and Oliva show that the spatial envelope, and thereby the viewing distance,
has a large influence on the η in the power spectra power law for natural
images. They show that by estimating η on individual images, the distance
to the main object in the scene can be estimated. They used a model where
η depends on the angle and estimate ηθ using different angles.
Three classical results from natural image statistics are studied, using
the new image sequence database containing images of the same scene captured at different scales. The power spectra power law (scale invariance),
the Laplacian distribution of the partial differential and the distribution of
homogenous regions (the size power law) are studied. We are facing the question: How much of the visual appearance of the image can be be explained
by the statistical property of the image? How does the estimation relate to
geometric structure and texture in the image, and to the viewing distance?
In general, images captured from a large distance - more than 100 meters
- mainly contain geometric structure. At large viewing distances, the sky
57
is present, which often is a very large geometric structure. Furthermore,
buildings, roads, lawns and trees appear as rather uniformly colored regions
viewed from a large distance, i.e geometric structure. This is confirmed by
the statistical estimations: η in the power spectra power law is rather small,
which indicates that intensities are more correlated. Estimation of α in the
generalized Laplacian distribution is small, which indicates a sharp peak at
zero, but also large values which correspond to object boundaries. Estimation
of α in the size power law of homogenous regions is small, which indicate the
presence of larger homogenous regions. Images mainly containing geometric
structure can be characterized as follows: the intensities are more correlated,
they contain larger homogenous regions and the partial derivatives are in
general small inside the regions, but rather large on the object boundaries.
In general, images captured from a small distance - less than 20 meters -,
mainly contain texture. Trees capture from less than 100 meters also mainly
contain texture. At small viewing distances the sky is absent and the small
scale details on the object in the scene have been brought out. At such a small
distance, details on trees, bushes and lawns are brought out. This is, again,
confirmed by the statistics: estimation of η in the power spectra power law
is large which indicates that the intensities are less correlated. Estimation
of α in the generalized Laplacian distribution is rather large and estimation
of α in the size distribution is large. Images mainly containing texture can
be characterized by: the intensities are less correlated, they contain smaller
homogenous regions and the distribution of partial derivatives is less peaked
at zero.
In order to estimate the image content in terms of geometric structure
and texture, an approximation approach is proposed. The approximation
approach, again, relates to previous work by Wu et al. [205], where they argue
that texture cannot be represented sparsely, while geometric structure can.
The approximation approach can be viewed as reversing the argumentation:
if the image content can be represented sparsely, then it contains geometric
structure. And, if the image content can not be represented sparsely, then
it contains texture. The truncated singular value decomposition is used and
the rank of a good approximation is used as the complexity measure. The
rank of the approximation is the number of basis vectors required for a good
approximation.
The second approximation approach is based on image regularization in
the continuous domain. Image regularization can be viewed as an approximation of an image with a simpler one, often in a different subspace of functions.
Assuming that the observed image is in u0 ∈ L2 , then first order Tikhonov
regularization and linear Gaussian scale space map the image into a Sobolev
space, while TV maps into the space of functions with bounded variation
58
(BV).
Buades et al.[27] introduced the ’Method Noise’ for evaluating denoising
methods. The ’Method Noise’ is the image residual and should, after denoising, solely contain the noise. By analyzing the content in residual, the
performance of the denoising method can be characterized. The proposed
’Method Noise’ evaluation method is, in some sense, the complementary
problem. Instead of evaluating and characterizing the method, our aim is to
characterize the image content by the content in the image residual.
In the residual norm study, conclusions in the case of first order Tikhonov
regularization and linear Gaussian scale space are made by analytically proving the properties. In the TV case, the conclusion is based on experiments.
Finding a closed form expression for the residual norm and the derivative of
the residual norm with respect to λ would be very rewarding. As it seems
from the experiments, the residual norm has points of high curvature at a
scale for which structure of a certain size is totally removed. The distribution
of such a point of high curvature would be very important for describing the
image content. It would also be very useful for optimal parameter selection.
59
Chapter 2
Image Inpainting by Cooling
and Heating
60
This chapter contain a slightly re-formatted version of
David Gustavsson, Kim S. Pedersen, and Mads Nielsen.
Image inpainting by cooling and heating.
In Proceedings of Scandinavian Conference on Image Analysis (SCIA)
2007, 2007.
Image Inpainting by Cooling and Heating
David Gustavsson1 , Kim S. Pedersen2 , and Mads Nielsen2
1
IT University of Copenhagen
Rued Langgaards Vej 7, DK-2300 Copenhagen S,Denmark
[email protected]
2
DIKU, University of Copenhagen
Universitetsparken 1, DK-2100 Copenhagen Ø, Denmark
{kimstp,madsn}@diku.dk
abstract
We discuss a method suitable for inpainting both large scale geometric structures and stochastic texture components. We use the well-known FRAME
model for inpainting. We introduce a temperature term in the learnt FRAME
Gibbs distribution. By using a fast cooling scheme a MAP-like solution is
found that can reconstruct the geometric structure. In a second step a heating scheme is used that reconstruct the stochastic texture. Both steps in
the reconstruction process are necessary, and contribute in two very different
ways to the appearance of the reconstruction.
Keywords: Inpainting, FRAME, ICM, MAP, Simulated Annealing
2.1
Introduction
Image inpainting concerns the problem of reconstruction of the image contents inside a region Ω with unknown or damaged contents. We assume that
Ω is a subset of the image domain D ⊆ R2 , Ω ⊂ D and we will for this
paper assume that D form a discrete lattice. The reconstruction is based
on the available surrounding image content. Some algorithms have reported
excellent performance for pure geometric structures (see e.g. [39] for a review
of such methods), while others have reported excellent performance for pure
61
textures (e.g. [21, 51, 52]), but only few methods [17] achieve good results
on both types of structures.
The variational approaches have been shown to be very successful for
geometric structures but have a tendency to produce a too smooth solution
without fine scale texture (See [39] for a review). Bertalmio et al [17] propose
a combined method in which the image is decomposed into a structure part
and a texture part, and different methods are used for filling the different
parts. The structure part is reconstructed using a variational method and
the texture part is reconstructed by image patch pasting.
Synthesis of a texture and inpainting of a texture seem to be, more or
less, identical problems, however they are not. In [84] we propose a two
step method for inpainting based on Zhu, Wu and Mumford’s stochastic
FRAME model (Filters, Random fields and Maximum Entropy) [214, 213].
Using FRAME naively for inpainting does not produce good results and more
sophisticated strategies are needed and in [84] we propose such a strategy. By
adding a temperature term T to the learnt Gibbs distribution and sampling
from it using two different temperatures, both the geometric and the texture
component can be reconstructed. In a first step, the geometric structure
is reconstructed by sampling using a cooled - i.e. using a small fixed T distribution. In a second step, the stochastic texture component is added by
sampling from a heated - i.e. using a large fixed T - distribution.
Ideally we want to use the MAP solution of the FRAME model to reconstruct geometric structure of the damaged region Ω. In [84] we use a
fixed low temperature to find a MAP-Like solution in order to reconstruct
the geometric structure. To find the exact MAP-solution one must use the
time consuming simulated annealing approach such as described by Geman
and Geman [69]. However to reconstruct the missing contents of the region
Ω, the true MAP solution may not be needed. Instead a solution which is
close to the MAP solution may provide visually good enough results. In
this paper we propose a fast cooling scheme that reconstruct the geometric
structure and approaches the MAP solution. Another approach is to use the
solution produced by the Iterated Conditional Modes (ICM) algorithm (see
e.g. [201]) for reconstruction of the geometric structure. Finding the ICM
solution is much faster than our fast cooling scheme, however it often fails to
reconstruct the geometric structure. This is among other things caused by
the ICM solutions strong dependence on the initialisation of the algorithm.
We compare experimentally the fast cooling solution with the ICM solution.
To reconstruct the stochastic texture component the Gibbs distribution is
heated. By heating the Gibbs distribution more stochastic texture structures
will be reconstructed without destroying the geometric structure that was
reconstructed in the cooling step. In [84] we use a fixed temperature to find
62
a solution including the texture component. Here we introduce a gradual
heating scheme.
The paper has the following structure. In section 2.2 FRAME is reviewed,
in section 2.2.1 filter selection is discussed and in section 2.2.2 we explain how
FRAME is used for reconstruction. Inpainting using FRAME is treated in
section 2.3. In section 2.3.1 a temperature term is added to the Gibbs distribution, the ICM solution and fast cooling solution is discussed in sections
2.3.2 and 2.3.3. Adding the texture component by heating the distribution is
discussed in section 2.3.4. In section 2.4 experimental results are presented
and in section 2.5 conclusion are drawn and future work is discussed.
2.2
Review of FRAME
FRAME is a well known method for analysing and reproducing textures
[213, 214]. FRAME can also be thought of as a general image model under the
assumptions that the image distribution is stationary. FRAME constructs a
probability distribution p(I) for a texture from observed sample images.
Given a set of filters F α (I) one computes the histogram of the filter responses H α with respect to the filter α. The filter histograms are estimates
of marginal distributions of the full probability distribution p(I). Given the
marginal distributions for the sample images one wants to find all distributions that have the same expected marginal distributions, and among those
find the distribution with maximum entropy, i.e. by applying the maximum
entropy principle. This distribution is the least committed distribution fulfilling the constraints given by the marginal distributions. This is a constrained
optimisation problem that can be solved using Lagrange multipliers. The
solution is
XX
1
p(I) =
exp{−
λαi Hiα }
(2.1)
Z(Λ)
i
α
Here i is the number of histogram bins in H α for the filter α and Λ = {λαi }
are the Lagrange multipliers which gives information on how the different
values for the filter α should be distributed. The relation between λα :s for
different filters F α gives information on how the filters are weighted relative
to each other.
An Algorithm for finding the distribution and Λ can be found in [214].
FRAME is a generative model and given the distribution p(I) for a texture
it can be used for inference (analysis) and synthesis.
63
2.2.1
The Choice of Filter Bank
We have used three types of filters in our experiments: The delta filter, the
power of Gabor filters and Scale Space derivative filters. The delta, Scale
Space derivative and Gabor filters are linear filters, hence F α (I) = I ∗ F α ,
where ∗ denotes convolution. The power of the Gabor filter is the squared
magnitude applied to the linear Gabor filter.
The Filters F α are:
• Delta filter - given by the Dirac delta δ(x) which simply returns the
intensity at the filter position.
• the power of Gabor filters - defined by | I ∗ Gσ e−iωx |2 , where i2 = −1.
Here we use 8 orientations, ω = 0, π4 , π2 , 3π
, π, 5π
, 3π
, 7π
and 2 scales
4
4
2
4
σ = 1, 4, in total 16 Gabor filters have been used.
• Scale space derivatives - using 3 scales σ = 0.1, 1, 3 and 6 derivatives
2
2
2
σ ∂ Gσ ∂ Gσ ∂ Gσ
σ
, ∂G
, ∂x2 , ∂y2 , ∂x∂y .
Gσ , ∂G
∂x
∂y
For both the Gabor and scale space derivative filters the Gaussian aperture
function Gσ with standard deviation σ defining the spatial scale is used,
2
1
x + y2
Gσ (x, y) =
.
exp −
2πσ 2
2σ 2
Which and how many filters should be used have a large influence on the
type of image that can be modelled. The filters must catch the important
visual appearance of the image at different scales. The support of the filters
determines a Markov neighbourhood. Small filters add fine scale properties of
the image, while large filters add coarse scale properties of the image. Hence
to model properties at different scales, different filter sizes must be used. The
drawback of using large filters is that the computation time increases with
the filter size. On the other hand large filters must be used to catch coarse
scale dependencies in the image.
Gabor filters are orientation sensitive and have been used for analysing
textures in a number of papers and are in general suitable for textures (e.g.
[20, 100]). By carefully selecting the orientation ω and the scale σ, structures
with different orientations and scales will be captured.
It is well known from scale space theory that scale space derivative filters
capture structures at different scales. By increasing σ in the Gaussian kernel,
finer details are suppressed, while coarse structures are enhanced. By using
the full scale-space both fine and coarse scale structures will be captured
[188].
64
2.2.2
Sampling
Once the distribution p(I) is learnt, it is possible to use a Gibbs sampler to
synthesise images from p(I). I is initialised randomly (or in some other way
based on prior knowledge). Then a site (x, y)i ∈ D is randomly picked and
the intensity Ii = I((x, y)i) at (x, y)i is updated according to the conditional
distribution [123, 201]
p(Ii |I−i )
(2.2)
where the notation I−i denotes the set of intensities at the set of sites
{(x, y)−i} = D\(x, y)i. Hence p(Ii |I−i ) is the probability for the different
intensities in site (x, y)i given the intensities in the rest of the image. Because of the equivalence between Gibbs distributions and Markov Random
Fields given a neighbourhood system N (the Hammersley-Clifford theorem,
see e.g. [201]), we can make the simplification
p(Ii |I−i ) = p(Ii |INi )
(2.3)
where Ni ⊂ D\(x, y)i is the neighbourhood of (x, y)i. In the FRAME model,
the neighbourhood system N is defined by the extend of the filters F α .
By sampling from the conditional distribution in (2.3), I will be a sample
from the distribution p(I).
2.3
Using FRAME for inpainting
We can use FRAME for inpainting by first constructing a model p(I) of the
image, e.g. by learning from the non-damaged part of the image, D\Ω. We
then use the learnt model p(I) to sample new content inside the damaged
region Ω. This is done by only updating sites in Ω. A site (x, y)i ∈ Ω is
randomly picked and updated by sampling from the conditional distribution
given in (2.3). If the site (x, y)i is close (in terms of filter size) to the boundary ∂Ω of the damaged region, then the filters get support from both sites
inside and outside Ω. The sites outside Ω are known and fixed, and are
boundary conditions for the inpainting. We therefore include a small band
region around Ω in the computation of the histograms H α . Another option
would have been to use the whole image I to compute the histogram H α ,
however this has the downside that the effect of updates inside Ω on the
histograms are dependent on the the relative size ratio between ω and D,
causing a slow convergence rate for small Ω.
65
2.3.1
Adding a temperature term β =
1
T
Sampling from the distribution p(I) using a Gibbs sampler does not easily
enforce the large scale geometric structure in the image. By using the Gibbs
sampler one will get a sample from the distribution, this includes both the
stochastic and the geometric structure of the image, however the stochastic
structure will dominate the result.
Adding an inverse temperature term β = T1 to the distribution gives
p(I) =
XX
1
exp{−β
λαi Hiα } .
Z(Λ)
α
i
(2.4)
In [84] we proposed a two step method to reconstruct both the geometric
and stochastic part of the missing region Ω:
1. Cooling: By sampling from (2.4) using a fixed small temperature T
value, structures with high probability will be reconstructed, while
structures with low probability will be suppressed. In this step large
geometric structures will be reconstructed based on the model p(I).
2. Heating: By sampling from (2.4) using a fixed temperature T ≈ 1,
the texture component of the image will be reconstructed based on the
model p(I).
In the first step the geometric structure is reconstructed by finding a
smooth MAP-like solution and in the second step the texture component is
reconstructed by adding it to the large scale geometry.
In this paper we propose a novel variation of the above discussed method.
We consider two cooling schemes and a gradual heating scheme which can
be considered as the inverse of simulated annealing.
2.3.2
Cooling - the ICM solution
Finding the MAP solution by simulated annealing is very time consuming.
One alternative method is the Iterated Conditional Modes (ICM) algorithm.
By letting T → 0 (or equivalently letting β → ∞) the conditional distribution
(2.3) will become a point distribution. In each step of the Gibbs sampling
one will set the new intensity for a site (x, y)i to
Iinew = arg max p(Ii | INi ) .
Ii
(2.5)
This is a site-wise MAP solution (i.e. in each site and in each step the
most likely intensity will be selected). This site-wise greedy strategy is not
66
Figure 2.1: From top left to bottom right: a) the image containing a damaged
region b) the ICM solution c) the fast cooling solution d) adding texture on
top of the fast cooling solution by heating the distribution e) total variation
(TV) solution and f) the reconstructed region in context (can you find it?).
guaranteed to find the global MAP solution for the full image. The ICM
solution is similar but not identical to the high β sampling step described in
[86]. The ICM solution depends on initialisation of the unknown region Ω.
Here we initialise by sampling pixel values identically and independent from
a uniform distribution on the intensity range.
2.3.3
Cooling - Fast cooling solution
The MAP solution for the inpainting is the most likely reconstruction given
the known part of the image D\Ω,
I MAP = arg
max
Ii ∀(x,y)i ∈Ω
p(I | I(D\Ω), Λ) .
(2.6)
Simulated annealing can be used for finding the MAP solution. Replacing β in (2.4) with an increasing (decreasing) sequence βn called a cooling
(heating) scheme. Using simulated annealing one starts to sample using a
high temperature T and slowly cooling down the distribution (2.4) by letting
67
T → 0. If βn is increasing slowly enough and letting n → ∞ then simulated
annealing will find the MAP solution ( see e.g. [69, 201, 123]). Unfortunately
simulated annealing is very time consuming.
To reconstruct Ω, the true MAP solution may not be needed, instead a
solution which is close to the MAP solution may be enough. We therefore
adopt a fast cooling scheme, that does not guarantee the MAP solution. The
goal is to reconstruct the geometric structure of the image and suppress the
stochastic texture.
The fast cooling scheme used in this paper is defined as (in terms of β)
βn+1 = C + · βn
(2.7)
where C + > 1.0 and β0 = 0.5.
2.3.4
Heating - Adding texture
The geometric structures of the image will be reconstructed by sampling
using the cooling scheme. Unfortunately the visual appearance will be too
smooth, and the stochastic part of the image needs to be added.
The stochastic part should be added in such a way that it does not destroy
the large scale geometric part reconstructed in the previous step. This is done
by sampling from the distribution (2.4) using a heating scheme similar to the
cooling scheme presented in previous section and using the solution from the
cooling scheme as initialisation.
The heating scheme in this paper is
βn+1 = C − · βn
(2.8)
where C − < 1.0 and β0 = 25.
By using a decreasing βn , value finer details in the texture will be reproduced, while coarser details in the texture will be suppressed.
2.4
Results
Learning the FRAME model p(I) is computational expensive, therefore only
small image patches have been used. Even for small image patches the optimisation times are at least a few days. After the FRAME model has been
learnt, inpainting can be done relatively fast if Ω is not to large.
The dynamic range of the images have been decreased to 11 intensity
levels for computational reasons. The images that have been selected includes
both large scale geometric structures as well as texture.
68
Figure 2.2: From top left to bottom right: a) the image containing a damaged
region b) the ICM solution c) the fast cooling solution d) adding texture on
top of the fast cooling solution by heating the distribution e) total variation
(TV) solution and f) the reconstructed region in context (can you find it?).
The delta filter, 16 Gabor filters and 18 scale space derivative filters have
been used in all experiments and 11 histogram bins have been used for all
filters (see section 2.2.1 for a discussion).
In the cooling scheme (2.7), we use β0 = 0.5, C + = 1.2 and the stopping
criterion βn > 25 in all experiments. In the heating scheme (2.8), we use
β0 = 25,C − = 0.8 and the stopping criterion βn < 1.0.
Each figure contains an unknown region Ω of size 30 × 30 that should be
reconstructed. Figure 2.1 contains corduroy images, figure 2.2 contains birch
bark images and figure 2.3 wood images. Each figure contains the original
image with the damaged region Ω with initial noise, the ICM and fast cooling
solutions and the solution of a total variation (TV) based approach [39] for
comparison.
The ICM solution reconstruct the geometric structure in the corduroy,
but fails to reconstruct the geometric structure in both the birch and the
wood images. This is due to the local update strategy of ICM, which makes
it very sensitive to initial conditions. If ICM starts to produce wrong large
69
Figure 2.3: From top left to bottom right: a) the image containing a damaged
region b) the ICM solution c) the fast cooling solution d) adding texture on
top of the fast cooling solution by heating the distribution e) total variation
(TV) solution and f) the reconstructed region in context (can you find it?).
scale geometric structures it will never recover.
The fast cooling solution on the other hand seem to reconstruct the geometric structure in all examples and does an even better job than the ICM
solution for the corduroy image. The fast cooling solutions are smooth and
have suppressed the stochastic textures. Because of the failure of ICM we
only include results on heating based on the fast cooling solution.
The results - image d) - after the heating are less smooth Ω’s, but it
is still smoother than I\Ω. The total variation (TV) approach produce a
too smooth solution even if strong geometric structures are present in all
example.
2.5
Conclusion
Using FRAME to learn a probability distribution for a type of images gives
a Gibbs distribution. The boundary condition makes it hard to use the
learnt Gibbs distribution as it is for inpainting; it does not enforce large scale
70
geometric structures strongly enough. By using a fast cooling scheme a MAPlike solution is found that reconstruct the geometric structure. Unfortunately
this solution is too smooth and does not contain the stochastic texture. The
stochastic texture component can be reproduced by sampling using a heating
scheme. The heating scheme adds the stochastic texture component to the
reconstruction and decrease the smoothness of the reconstruction based on
the fast cooling solution.
A possible continuation of this approach is to replace the MAP-like step
with a partial differential equation based method and a natural choice is the
Gibbs Reaction And Diffusion Equations (GRADE) [212, 211], which are
build on the FRAME model.
We decompose an image into a geometric component and a stochastic
component and use the decomposition for inpainting. This is related to
Meyer’s [8, 132] image decomposition into a smooth component and a oscillating component (belonging to different function spaces). We find it interesting to explore this theoretic connection with variational approaches.
Acknowledgements
This work was supported by the Marie Curie Research Training Network:
Visiontrain (MRTN-CT-2004-005439).
71
Chapter 3
A Multi-Scale Study of the
Distribution of Geometry and
Texture in Natural Images
72
A Multi-Scale Study of the Distribution of Geometry and
Texture in Natural Images
David Gustavsson, Kim S. Pedersen, and Mads Nielsen
DIKU, University of Copenhagen
Universitetsparken 1, DK-2100 Copenhagen Ø, Denmark
{davidg,kimstp,madsn}@diku.dk
abstract
A new image database containing an ensemble of image sequences is presented. Each sequence contains 15 images of the same scene, captured at
different viewing distances, termed capture scales. The scenes contain both
nature and man-made structure, and the images are captured from a ’normal’ human point-of-view. The part of the scene present at all capture scales
has been extracted, resulting in sequences of images of increasing resolution
with the same content.
Classical results from natural image statistics - scale invariance, Laplacian
distribution of the partial derivative and the size distribution of homogenous
region - are verified and analyzed on the database.
The classical natural image statistics are also estimated on individual images. The estimation on individual images can explain the visual appearance
in terms of geometric structure and texture to some degree. We argue that
estimation on individual images depends on the viewing distance in two different ways: the spatial lay-out of the scene and the suppression of details
(inner scale). Images, captured from a human point of view, from large viewing distance contain the sky on top, houses or forests in the vertical middle
and lawns or roads in the lower part. The spatial layout is constrained by
the viewing distance. The sky, buildings, lawns and forests appear as rather
uniformly colored regions viewed from a large distance. This is because the
inner scale is too large to bring out the texture at such a distance.
Keywords: natural images, scale space, geometric structure, texture, scale
invariance, power law, generalized Laplacian distribution, area distribution
3.1
Introduction
Images contain different types of information, from highly stochastic texture
such as grass and fur to highly geometric structures, such as houses and
cars. Furthermore, most images contain a mix of geometric structures and
stochastic textures. The image content does not solely depend on the objects
in the captured scene, but also on the scale that it was captured at. The
73
same object, captured at different scales, will have different appearance. For
example, a tree viewed from 5 meters is very different from the same tree
viewed from 100 meters. At a coarse scale, finer details - such as the leaves are suppressed while the coarse scale structure - the tree top and the trunk
- are brought out. At a finer scale, the coarse scale geometric structures
are suppressed while the finer scale details are brought out. A coarse scale
representation of an image can be generated artificially using linear gaussian
scale space by convolving the original image with a gaussian function [98,
202, 111, 122, 60]. Generating a coarse scale representation of an image
using linear gaussian scale space will increase the effective inner-scale, small
details are suppressed, while keeping the resolution. Similarly, increasing
the viewing distance increases the inner-scale, smaller details will not be
captured, and the resolution will decrease. Scale space does not model the
statistical changes of the image content when the viewing distance is altered.
For specific type of objects - such as houses or trees - how does the statistical property change as a function of viewing distance? By capturing the
same part of a scene at different viewing distances, statical changes due to
altering the inner-scale and resolution can be analyzed. Wu et. al [205] studied the ’entropy rate’ and ’inferential uncertainty’ as a function of viewing
distance.
Changing the viewing distance will change the inner-scale, it will also
have an effect on the outer scale of the image. Changing the viewing distance,
either by moving the camera or adjusting the focal length in the objective,
will alter the composition of the captured scene. Torralba and Oliva [146, 147]
called the spatial lay-out of an image the spatial envelope, and they showed
that it can be used for determining the distance to the main objects in the
scene [192]. Here we only consider natural images, captured from a human
point of view i.e. underwater and bird views are excluded. The distance
to the object in the scene also puts hard constraints on the possible views
that the object can be seen from. A cup on a table can be viewed from
almost all angles, a car on street can be viewed from many angles, while
a large building can be viewed from a few angles. How does the image
statistics change when the viewing distance changes? Capturing the same
scene at different viewing distances (and different outer-scale) may reveal
statical changes due to changes in the spatial envelope.
The statistical change as a function of the viewing distance due to changes
in the spatial envelope will be studied in this report.
The natural image database provided by van Hateren, also called the
natural stimuli collection, is one of the most widely used databases [197]. The
van Hateren database contains roughly 4000 images of resolution 1024×1536.
The database contains scenes captured at different scales, but it does not
74
contain images of the same scene captured at different scales.
The KTH-TIPS2 database contains 11 materials captured under different
illumination and scales ([63]). The database contains texture captured at
different scales.
To be able to analyze how the image content changes, when the same
scene has been captured at different scales, a firs step is to collect a new
image database. A new database containing a rich variety of scenes and distances is introduced. The database contains natural scenes - both man-made
and natural environments - captured at 15 different ’scales’. The viewing distance is altered by adjusting the focal length and the different focal lengths
are called capture scales. The database and the collection procedure are
presented in section 3.2.
Classical statistical properties, found in ensemble of natural images are
computed on the database and the result is compared with previous reported
results. The statistics are also computed based on the capture scale - estimation using images with the same capture scale. In section 3.4.1, the apparent
scale invariance and power spectra power law found in ensemble of natural
images are discussed. The distribution of partial derivatives computed on an
ensemble of natural images can be modeled by a generalized Laplacian distribution, and is discussed in section 3.4.2. The distribution of homogenous
regions in natural images, both computed on individual images and on ensemble of images, is following a power law in size, and is discussed in section
3.4.3.
3.2
Multi-Scale Geometry and Texture image Database (MS-GTI DB)
Images or rather scenes are considered to be ’natural’ if they naturally appear
in everyday life, from a human point of view. The human point of view
excludes aerial and underwater images even if they in some sense are natural.
Scenes containing both man-made structures and natural environments have
been captured.
3.2.1
Collection procedure and equipment
The MS-GTI database contains images of the same scene captured at different scales. The camera that has been used is a Nikon D40x. The three
different objectives that have been used are: 18-55 mm, 55-200 mm and 70300 mm. The camera has been placed on a tripod stand facing the scene. A
region of interest in the scene, of such a size that it is present at all capture
75
scales, has been selected. The scene, with the region of interest approximately in the center, is captured at different scales by adjusting/changing
the objective. The scene is captured at 15 different scales, the focal length
varies from 18 mm to 300 mm - roughly 4 octaves and 16 times magnification.
Hence a 1 × 1 pixel region in the least zoomed image corresponds to a 16 × 16
region in the most zoomed image. The image resolution is 2592×3872 pixels.
The image content is of course determined by the distance from the camera to the scene. The distance between the camera and the scene varies, from
a few meters to a distance of hundreds of meters - ”panorama” distance .
Objective
18-55 mm
55-200 mm
70-300 mm
Focal Length
number of
Capture Scales
18, 24, 35, 45
4
55,70,85,105,135,165,200
7
225,250,275,300
4
Table 3.1: The three objectives used to collect the database, together with
the focal length used for the objectives. 15 images have been collected for
each scene giving approximately 16x magnification.
The RAW format used by the D40x camera is Nikons own 14 bits format
NEF. The NEF images have been converted to 16 bits TIFF images, each
TIFF image is 60 MB. The images in a sequence are indexed from the least
zoomed image I1 (smallest focal length) to the most zoomed image I15 , i.e. in
increasing zoom order. This index is the capture scale used in the following
sections. Increasing the capture scale corresponds to decreasing the viewing
distance and the decreasing inner scale. The capture scale simply denotes
the numbering of the focal length used.
Table 3.1 describes the used objectives and table 3.2 shows the focal
length for each of the 15 images of a sequence. Examples can be found in
figure 3.1
3.2.2
The different Scenes
The scenes selected for the database are mostly natural images containing
both man-made environments - mostly buildings - and nature - trees, tree
trunks and bushes. In many cases the same type of scenes have been captured
but with different distance between the camera and the scene, which changed
the captured image contents.
76
Figure 3.1: Example of captured scenes. The columns contain from left to
right: the least zoomed image I1 - 18 mm, I3 - 35 mm,I6 - 70mm, I10 - 165
mm, and the most zoomed image I15 - 300 mm. Row 1 (IS 1 ), 2 (IS 4 ) and
7 (IS 18 ) contain man-made environments, 4 (IS 8 ) and 9 (IS 31 ) contain a
mixture of nature and man-made environments, and 3 (IS 6 ), 5 (IS 10 ), 6
(IS 11 ) and 8 (IS 28 ) contain nature environments. The distance between the
main objects in the scene and the camera varies between the scenes - from a
few meters to ”panorama” distance. This gives a large variation in distance
and image contents at all scales (zoom) - rows 7 and 9 contain a large portion
of the sky even in the most zoomed images and row 8 contains small scale
texture in the least zoomed image.
77
Image
Objective
I1
I2
I3
I4
I5
I6
I7
I8
18-55
18-55
18-55
18-55
55-200
55-200
55-200
55-200
focal
length
18
24
35
45
55
70
85
105
Image
Objective
I9
I10
I11
I12
I13
I14
I15
55-200
55-200
55-200
70-300
70-300
70-300
70-300
focal
length
135
165
200
225
250
275
300
Table 3.2: Summary of the objectives and focal lengths used for collecting
the images in the sequences.
The depth of field is the portion of a scene that appears to be sharp in
the image. A lens (objective) can be focused only on one specific distance
(the focus plan), still objects in the scene on a distance close to the focus
plan appear to be sharp. The depth of field is the distance range, where the
objects in the scene are in acceptable focus. The depth of field varies with
the objective, and is usually larger for normal objective, while it is smaller
for zoom objective.
If objects in the scene, captured using a magnification objective, appear
at different distances, some of the objects will be out-of-focus.
For example; a closeup picture of a scrub, the distance between the camera
and the twigs is varying a lot. This will result in an image where some of
the twigs are in focus, while others are out-of-focus. Examples of scenes are
presented in figure 3.1. Scenes containing man-made structure - rows 1 and
2 - are often planar in the most zoomed image. Thereby all objects in the
scene are in the depth of field. Nature images - row 3 - often contain objects
at varying distance to the camera. Thereby, some objects are outside the
depth of field and they are out-of-focus.
3.2.3
Region extraction
The part of the scene captured by the camera that is present in all capture
scales, has been extracted. Resulting in sequences of images of different
resolution, containing the same part of the scene at different capture scales.
The resolutions of the different regions range from 2592 × 3872 to 160 × 240
and is summarized in table 3.3.
The regions are extracted by registration of the most zoomed image I15 in all
of the other images. This is a very challenging registration problem, because
78
Figure 3.2: The figure contains 80×80 patches extracted from three different
image sequences at different scales. Column 1 is extracted from I1 (least
zoomed), column 2 from I3 , column 3 from I6 , column 4 from I10 and column
5 from I15 . The image contents at the different scales are very different even
if the captured object is the same. The first row contains part of a brick wall
(row two in figure 3.1)- on the coarse scale the brick wall appears as texture
that transforms into bricks at finer scale. The second row contains a part of
a brick wall and the appearance at the different scale is similar to the other
brick wall.
79
of the large range of scales. The problem has partly been solved using manual
feature selection and affine registration, SIFT features [124] combined with
RANSAC [59] for computing an affine registration, and manual registration.
Region Id
R14
R13
R12
R11
R10
R9
R8
Region Size
2470 × 3690
2230 × 3330
1980 × 2950
1740 × 2600
1520 × 2260
1200 × 1790
940 × 1440
Region Id
R7
R6
R5
R4
R3
R2
R1
Region Size
760 × 1170
620 × 950
480 × 740
380 × 580
300 × 460
200 × 310
160 × 240
Table 3.3: The extracted regions and where resolution, Ri is extracted from
Ii and R15 = I15 .
3.2.4
Notation
The notation used for the sequences of images/regions is summarized in table 3.4. The captured full images are called ’images’, and are denoted by
I sometimes with an lower index, and the extracted part of the images are
called ’regions’, and are denoted by R, sometimes with an lower index. The
sequences of images are denoted ISij and the sequences of regions are denoted RSij where the upper index indicates the sequence and the lower index
indicates the image number (sometimes the indexes are omitted).
3.3
Point Operators and Scale Space
Comparing images containing the same part of a scene, captured at different
scales by zooming, is a challenging problem that requires an understanding
of the image formation process. A simple model [112, 80] and its relation to
scale space is discussed.
Let S(r) be a scene, and let
x2
1
exp(−
)
(3.1)
2πσ 2
2σ 2
be a linear detector, called a point operator, with weight σ. Applying the
detector at a position will yield a point observation and by applying the
G0 (x, σ) =
80
Abbreviation
Image (Ii )
Region (Ri )
Patch
IS
ISij
ISi
IS j
RS
RSij
RSi
RS j
Meaning
An (full) image what has been captured
An extracted region from an image
A part of a region or image
Image Sequences i.e. all images in the db
Image i from sequence j
All images numbered i
i.e one image from every sequence
All images from sequence j
Region Sequences i.e. all regions in the db
Region i from sequence j.
All regions numbered i,
i.e. one region from each region sequence.
All regions in sequence j.
Table 3.4: Summary of the terminology and notation used for the images in
the database. ’Image’ denotes the full captured image and ’region’ denotes
an extracted part of the image. The sequences of images are denoted IS and
the sequences of regions RS, the upper index indicates the sequence number
and the lower index indicates the image/region number (and are sometimes
omitted).
detector at several positions an image is obtained. Formally this can be
written
I(x, σ) = G0 (x, σ) ∗ S(r)
(3.2)
here ∗ denotes the convolution operator. The σ is called the inner-scale and
denotes the size of the point operator. One may think of a point operator as
measuring the light comming from a point in the scene, but that is of course
not true because zero size (zero-scale) observation does not exist. Instead
the point operator should be viewed as a measurement over a small region,
modeled with a guassian kernel, and the size of the region is determined by
σ. So σ is the spatial resolution of the point operator and sets the limit
of details that can be detected. The image captured by a point operator
is always a ”blurred version” of a point in the scene. By increasing σ the
spatial resolution decreases and the point becomes more ”blurred”.
Given an image captured using a fixed inner-scale σ, images of lower
spatial resolution can be studied using linear scale space ( see e.g. [98, 111,
202, 122, 188]). Coarse scale representation of an image can be generated
81
using linear Gaussian scale space by convolving the observed image with a
Gaussian function.
Capturing a scene at a different scale, connotes that σ in equation ( 3.2
) has been changed, by adjusting the objective (zooming). By adjusting
the objective, the scene will be captured using a different inner-scale and
different levels of details will be suppressed. Furthermore, by changing the
focal length, the sampling density will also be altered. Increasing the viewing
distance will increase σ in the point operator and sampling points will be less
dense. Wu et al. [205] use an image pyramid approach [32] for describing
image content transformation when the viewing distance increases. They use
block average 2 × 2 block to increase the inner-scale and subsampling is used
to reduce the resolution.
3.4
Statistics of Natural Images
In the following sections some classical results regarding natural image statistics are reviewed and verified on the MS-GT database. By comparing the
classical results with the results from the MS-GT DB the soundness of the
images in the database will be verified and the classical results are verified
on a new image database.
Most of the classical results in natural image statistics are based on empirical studies, on large image databases (often van Hateren’s [197]) containing
natural images. The MS-GT DB contains an ensemble of sequences of images. Each sequence contains the same scene captured at different scales.
The statistics in the following sections have been computed after transforming the RGB-images into gray value images.
3.4.1
Scale Invariance
One of the earliest result in the area of characterization of natural images is
the (apparent) scaling invariance [145, 166, 167, 168, 94, 196]. The scaling
invariance property was first formulated as: the power spectra of a large
ensemble of natural images is follow a power law
S(ω) =
A
|ω|2−η
(3.3)
where ω is the spatial frequency, and A is a constant that depends on the
overall contrast in the image. η is usually a small value and values close
to 0.2 have been reported [196, 168, 94]. It should also be noted that η
depends on the type of images and that small image databases with specific
82
contents - for example beaches and blue skies - may have η far from 0.2.
Torralba and Oliva [192] use the distribution of the power spectra to estimate
the distance between the scene and the camera in individual images [192]
and to characterize the image in terms of man-made environment or nature
environment [193].
The scale invariance property of natural images can also be expressed in
the spatial domain using the correlation function (see [168]). The correlation
function C(x) where x is the separation distance between two pixels in an
image is
C(x) = E(I(x0 )I(x0 + x))
(3.4)
and it reveals information about how intensities are correlated solely based
on the distance between intensities. The correlation is computed by considering all images in the ensemble, all initial positions x0 and all displacements
vectors x. The power spectra power law ( 3.3 ) expressed in the spatial
domain using the correlation function, takes the following form
C(x) = C1 +
C2
|x|η
(3.5)
where η is the same expositional as in ( 3.3 ). The intensity correlation
decreases with the distance between the pixels.
On the MS-GT database η was estimated using log-log regression to 0.202.
The highest estimation of η for a single image was η = 0.52 and the lowest
was η = −0.36.
In the top row of figure 3.3, 4 images with small η are shown and in the
bottom row 4 images with large η are shown. One striking difference between
the images is the presence of the sky in the top row, while the sky is absent
in the bottom row. The sky is a smooth and rather uniformly colored region,
spatially extended especially in the y direction. The presence of the sky in an
image has a large influence on he power spectra and the intensity correlation.
The presence of the sky in an image implies a long range intensity correlation.
The viewing distance, the distance to the main object in the captured scene,
for the images in the top row is rather large, while the viewing distances for
the bottom row images are rather small. Buildings, trees (forest) and lawns
appear as rather homogenous regions viewed from a large distance.
In figure 3.4, η has been estimated on the different capture scales (i.e.
ISi where i = 1, · · · , 15). Estimation of η is plotted against the capture
scales. As the capture scale increases, η also increases. For the first four
capture scales, η increases rather rapidly, while for the remaining capture
scales the increase is less rapid (and not monotonic). The sky is present and
83
Figure 3.3: The power spectra for natural images follows a power law in
spatial frequency. η is estimated to 0.202 for the images in the database,
which is similar to the results reported by other researchers. Estimating the
power low parameters for individual images shows large variation. The top
row contains images with small η (≈ −0.3) and the bottom row contains
images with large η (≈ 0.5). In the images shown in the first row a large
part of each image is occupied by the sky, while the sky is absent in all images
in the bottom row. Also note that the average distance to the main object
in the scene is much larger in the images in the first row than in the second
row.
occupies a large part of the image in many of the images at small capture
scales (i.e. large viewing distances). As the viewing distance decreases, the
sky is occupying a smaller region of the image. Furthermore, buildings, lawns
and trees appear as rather uniformly regions at larger viewing distances, as
the viewing distance decreases, more details appear and the regions appear
to be less homogenous.
In figure 3.5, typical and untypical sequences are shown. The first three
rows show three sequences where η has been estimated on the individual
images. The estimations of η at the different viewing distance is following
the same pattern as for the ensemble (shown in figure 3.4). In the sense that
they follow the same pattern as the ensemble of images, they are considered
to be ’typical’ sequences. At large viewing distance the sky is present, and
the objects in the scenes appear to be rather homogenous because of the
viewing distance. As the distance decreases the sky is occupying a smaller
region of the image and more details have emerged.
The subsequent three rows show three ’untypical’ sequences. The birch
tree bark sequence contains small scale details on all viewing distances, therefore are the estimations of η large on all viewing distances. In the bush sequence, estimations of η are rather stable and do not vary much. At larger
viewing distances the bush and the lawn are rather homogenous regions, as
84
0.35
0.3
0.25
η
0.2
0.15
0.1
0.05
0
−0.05
0
5
10
15
Captured Scale
Figure 3.4: Estimation of η as a function of capture scale, where index 1 is
the least zoomed and 15 is the most zoomed. Note that the capture scale
is non-linear (see table 3.2), the increase in magnification is larger for the
smaller index. η is increasing if the viewing distance decreases computed
over an ensemble of images. In terms of intensity correlation, expressed in
equation 3.5, the correlation decreases as the viewing distance decreases.
This can partly be explained by the presence of the highly correlated sky at
larger viewing distances. Furthermore, large scale objects such as trees and
houses appear to be more homogeneous viewed from larger distances.
85
the viewing distance decreases more details emerge. Highly correlated leaves
with sharp boundaries emerge, and as the viewing distance decreases details
on the leaves emerge. In the Malmö harbor sequence the sky and the ocean
is occupying a large region of the image at all viewing distances, therefore
the estimations of η are small on all viewing distances.
3.4.2
Laplacian distribution of Linear Filter Responses
It has been reported, [30, 126, 173, 174, 94, 117], that the distribution of the
partial derivatives of an ensemble of natural images can be modeled by an
Generalized Laplacian Distribution
1 −| x |α
(3.6)
e s
Z
where α and s are parameters estimated from an ensemble of natural images.
The parameters s and α are related to the variance and kurtosis. The kurtosis
κ and the skewness S for a random variable X is defined as
p(x) =
E(X − mx )3
E(X − mx )4
and
S
=
(3.7)
σ4
σ3
where E is the expectation, mx is the mean (an estimation of E(X)) and σ
is the standard deviation (and σ 2 is the variance). The relation
κ=
s2 Γ( α3 )
Γ( α1 )Γ( α5 )
σ =
and κ =
Γ( α1 )
Γ2 ( α3 )
2
(3.8)
can be used for estimation of s and α. A more model fitting approach, such
as Kullback-Leibler divergence, Least Square Error (LSE) and Maximum
Likelihood (ML), can also be used for estimating the parameters.
Natural images are in general not differentiable, therefore a (linear) scale
space approach is adopted, the derivative of an image is the scale space
derivative at a fixed scale. The scale space partial derivative in x is defined
as
∂
∂Gt
(Gt ∗ I) =
∗I
(3.9)
∂x
∂x
where ∗ denotes the convolution operator and Gt is the guassian function
x2 + y 2
1
exp(−
).
(3.10)
2πt
2t
Instead of using the scale space derivative at a fine scale, the intensity
difference for adjacent pixels as an linear filter could have been used. The
Gt (x, y) =
86
0.5
0.45
Seq 15
Seq 22
Seq 32
0.4
0.4
0.35
0.3
0.3
0.2
η
η
0.25
0.1
0.2
Seq 35
Seq 37
Seq 42
0
0.15
−0.1
0.1
−0.2
−0.3
0.05
0
5
10
15
Captured Scale
0
0
5
10
15
Captured Scale
Figure 3.5: Example of η estimated on image sequences. The first three rows
contain ’normal’ sequences and subsequent three rows contain ’un-normal’ sequences. The plots show η against the captured scales, for the ’normal’ (left)
and ’un-normal’ (right) sequences. In the ’normal’ sequences the sky occupies a large, but decreasing region in the photo. The ’un-normal’ sequences
contain either small scale details (tree trunk) or geometric structures (ocean)
at all scales.
87
benefit of using the scale space derivative is the explicit scale formulation
and the possibility to use different scales. The scale space derivatives have
been computed on log(I + 1), and t = 1 is the smallest scale.
Compared with the Gaussian distribution, the Generalized Laplacian distribution (usually) has a sharper peak at zero and ’heavy tails’. Most natural
images contain homogenous regions, objects under the similar illumination,
with similar or smoothly varying intensities, which correspond to the sharp
peak at zero. At the object boundary the intensities change rapidly, which
corresponds to the ’heavy tails’. The α parameter relates to the sharpness
of the peak, while s relates to the width of the distribution.
Yanulevskaya and Geusebroek [207] analyzed the relation between α and
the image content in (individual) images and image patches. (see also Geusebroek and Smeulders [72, 71]). Three sub-models are identified: power law,
exponential and gaussian distribution. The appropriate image model is selected using Akaike’s information criterion (AIC). Typically images with a
well separated foreground and uniform background are following a power law,
while images with a lot of details at different scales follow an exponential distribution, and images containing mainly high frequency texture are following
a gaussian distribution.
In figure 3.6, αx has been estimated on individual images. The first two
rows contain images with large αx value estimated using t = 1 (αx ≈ 1.00)
and t = 64 (αx ≈ 2.00). The images contain small scale details, where small
relates to the t used in the scale space derivative. The subsequent two rows
show images with low αx value, estimated using t = 1 (αx ≈ 0.25) and t = 64
(αx ≈ 0.55). The images contain large scale geometric structures such as the
sky and the buildings.
In the last row, in figure 3.6, the empirical distribution using the ensemble
of images and the corresponding generalized Laplacian distribution is shown
for t = 1 and t = 64. αx = 0.37 for t = 1, and αx = 0.78 for t = 64.
In figure 3.7, estimation of αx using different capture scale (ISi ) are plotted. On the y-axis is αx and on the x-axis is the capture scale. αx is estimated
using four different t in the scale space derivative; t = 1, 4, 16, and 64. As
it seems αx does not follow any trend (for any t). As a function of capture
scale, αx does neither increase or decrease, instead it seems to be stable with
some variation.
Bessel K form for Natural Images
Related to the Generalized Laplacian Distribution is the so-called (statistical)
Bessel K forms proposed by Grenander and Srivastava [78] and Srivastava et
al. [184]. The Bessel K form is derived using the transport generator model
88
(a) Large αx at t = σ 2 = 1
(b) Large αx at t = σ 2 = 64
(c) Small αx at t = σ 2 = 1
(d) Small αx at t = σ 2 = 64
0
−1
−2
−2
−3
−4
−4
log(probability)
log(probability)
−6
−8
−10
−5
−6
−7
−12
−8
−14
−9
−16
−18
−4
−10
−3
−2
−1
0
1
derivative at scale t=1
2
3
4
−11
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
derivative at scale t=64
0.4
0.6
0.8
1
Figure 3.6: Estimation of αx in the generalized Laplacian distribution using
scale space derivatives at scale t = 1 and t = 64. The first two rows contain
images with large αx at scale t = 1 (αx ≈ 1.00) respective t = 64 (αx ≈ 2).
The following two rows contain images with small αx at scale t = 1 (αx ≈
0.25) respective t = 64 (αx ≈ 0.55). αx is large if the image contains small
scale details and is small if it contains large scale geometric structures (where
’small’ and ’large’ are defined by the inner scale t). The last row shows plots
of empirical distribution using the ensemble of images and the corresponding
generalized Laplacian distribution at scale t = 1 (left) and t = 64 (right).
89
1
t=1
t=4
t=16
t=64
0.9
0.8
αx
0.7
0.6
0.5
0.4
0
5
10
15
Capture scale
Figure 3.7: The plots show estimations of αx in the Laplacian distribution
estimated using ISi , where αx is on the y-axis and the captured scale is on
the x-axis. t = 1, t = 4, t = 16 and t = 64 in the gaussian function are
shown. No trend as a function of the capture scale can be found in the
estimations, instead the estimations of αx are rather stable at the different
captured scales.
and it models the image formation process.
Images are generated by projecting 3D objects onto the 2D image, resulting in a set of so-called 2D profiles - gi - representing the different objects
in the scene. To make up an image, 2D profiles interact in a non-linear
way - occlusion, scaling and superposition. Under some simplified statistical
assumption on the distribution, scale and location of the generators gi , the
authors show that the marginal distribution of linear filters are following the
Bessel K Form distribution
r
1
2
p−0.5
p(x; p, c) =
|x|
K(p−0.5) (
|x|)
(3.11)
Z(p, c)
c
where Z is a normalization constant and K is the modified Bessel function of
second kind. The parameters p and c are called the Bessel Parameters and
can be estimated using the following equations
p=
and
3
S −3
(3.12)
σ
p
(3.13)
c=
where σ and S are estimations from the filtered images.
As shown in [78, 184] the Bessel K form models the partial derivatives of
individual images well. Furthermore the Bessel form parameter p relates to
90
the object present in the image. The p value depends on the distinctness of
the edges and on the frequency of the edges. Images with large objects, with
sharp boundaries have a low p value, while images with many objects have
a large p value. Images containing large geometric structures will in general
have a low p value, while images containing small scale textures will have a
high p value.
The results of estimating p on the database is very similar to the estimations of α in the generalized Laplacian distribution. Yanulevskaya and
Geusebroek [207] explain the relation between estimation of α and the visual
content in a similar way as Grrenander and Srivastava connect the visual
content with estimation of p.(See also [73].)
3.4.3
Size Distribution in Natural Images
Area Distribution in Natural images
As discussed, nearby pixel intensities are highly correlated and the correlation
decrease with the distance according to a power law. It is natural to consider
how the size of homogenous regions in natural images are distributed. Alvarez
et al. [3, 1, 76, 77] analyze the size distribution of homogenous regions, in
terms of area and perimeter, in natural images and they show that the size
distribution of homogenous regions in natural images follow a power law.
Alvarez et al. [3, 1], show that the size distribution of homogenous regions is
following a power law, both estimated on individual images and on ensemble
of images. Following Alvarez et al., we will verify their result on the database
and analyze the behavior as a function of capture scale.
A Homogenous region can be defined in many ways depending on the
problem at hand. Our interest is to characterize natural images with respect to the distribution of size, therefore a very simple approach is suitable. Let I be a image of size M × N with intensities in {1, · · · , G} and let
k ∈ {1, · · · , G}. Histogram equalization such that the number of intensities
are k and the number of pixels is (approximative) the same - Mk·N - for all
intensities. After the histogram equalization a homogenous region is defined
as the set of connected - using either 8 or 4 connectivity - pixels with the
same intensity. The size of a homogenous region is defined as the number of
pixels in the region.
The area distribution of homogenous regions are following a power law
A
(3.14)
sα
where s is the area, A and α are an image dependent parameters. The
f (s) =
91
parameters α and A can be estimated by regression on the set
{(log(f (s)), s) : s ∈ 1, · · · , Tmax }
(3.15)
where Tmax is the smallest size s for which f (s) is zero. For an ensemble
of natural images the Tmax value is large and covering almost the full range
of size distribution. For an individual image Tmax value can be small and a
large range of the size distribution will not be used in the regression.
For ensembles of natural images α ≈ 2, for individual images the α varies.
For images containing larger geometric structures, α is often smaller, approximately 1.7, while for images containing small scale texture α is often larger,
approximately 3.0.
In figure 3.8, images with the small α ≈ 1.57 (figure a) and large α ≈ 3.00
(figure b) are shown. The content difference is striking. The images with a
small α mainly contain large scale geometric structure, and the distance to
the main objects in the scene is large. The images with a large α contain
small scale details (texture), and the distance to the main objects in the
scene is small. α is estimated to 2.11 on the ensemble of images. Figure 3.8,
also shows the empirical distribution and the estimated power law (in log-log
scale) for a small α and a large α, and the fit is good in both cases.
Figure 3.9, shows estimations of α, estimated on different capture scale
(ISi ). The α:s (y-axis) are plotted against the capture scales (x-axis). The
lowest estimation of α is 2.15 and the largest is 2.22. Furthermore, no trend
can be found in the estimations. α seems to neither decrease, nor increase
as the viewing distance decreases.
Directional Homogenous Region Size
In previous section, the area distribution of homogenous region in individual
images and an ensemble of image was shown to follow a power law - equation
3.14. The orientation of the homogenous region was not considered. In the
following section the ’size’ of homogenous region in the x and y directions are
analyzed. Because natural images are more correlated in x direction, than
in y direction the size distribution of homogenous regions in the different
direction may be different.
The image intensity resolution is reduced to k intensities using histogram
equalization, and the regions with the same intensity are considered to be
homogenous.
The intersection length of a homogenous region along a direction (the x
and y direction in our case) is the number of connected pixels with equal
intensity. By collecting all intersection lengths of homogenous regions along
92
(a) Small α ≈ 1.75
(b) Large α ≈ 3.00
1
0
10
10
0
10
−1
10
−1
10
−2
10
−2
10
−3
10
−4
10
−3
10
−5
10
−6
10
−4
0
10
1
10
2
10
10
0
10
1
10
2
10
(c) The empirical distribution and the estimated regression
Figure 3.8: Distribution of homogenous regions in different types of images.
Figure (a) contains examples of images with small α. The images contain
mainly large scale geometric structures such as the sky and buildings viewed
from distance. Figure (b) contains images with small α. The images contain small scale texture and the distance to the main objects in the scene
is quite small. Figure (c) contains the empirical distributions and the estimated power law (in log-log scale) for a small α (left) and a large α (right).
Estimated on the ensemble of images α = 2.11.
93
2.22
2.21
2.2
α
2.19
2.18
2.17
2.16
2.15
0
5
10
15
Capture scale
Figure 3.9: The plot shows estimations of α in the power law for the area
distribution - equation 3.14 - using ISi , where α is on the y-axis and the
capture scales are on the x-axis. No trend in the estimation can be found increasing or decreasing over the scales.
a fixed direction in an image, the distribution of intersection lengths of homogenous regions in the direction is computed. The x and y directions will
be used but any other direction could also be used.
In figure 3.10, three different homogenous region which are covering the
same area are shown. The distribution of intersection length in x and y
directions is different for the region. One region is extended in x direction
while the other region is extended in the y direction. The last region is
connected in the 8-connectivity sense, but on the top it is not connected in
the x direction. The intersection length distribution will therefore be two
small intersections.
Analyzing the intersection length distribution for homogenous regions
indicates that it follows a power law (as in 3.14) in intersection length, with
different value for α in x and y direction. Estimating the αx and αy using
log-log regression on all full images in the database gives αx = 2.96 and
αy = 3.55 as shown in figure 3.11. Homogenous regions extend longer in
the x direction than in the y direction. This supports the fact that natural
images are more correlated in the x-direction than in the y-direction.
94
Figure 3.10: Three different homogenous regions with the same area but
with different shape and/or orientation. The region to the left is longer in
y-direction, than in the x-direction. The region to the middle is longer in the
x-direction, than in the y direction. The distribution of intersection length in
x and y direction for the two regions is different. The region to the right is not
connected on the top in the x direction, the intersection length distribution
will therefor be two short intersection.
1
1
10
10
0
0
10
10
−1
−1
10
10
−2
−2
10
10
−3
−3
10
10
−4
−4
10
10
−5
−5
10
10
−6
−6
10
10
−7
−7
10
10
−8
10
−8
0
10
1
10
2
10
3
10
10
0
10
1
10
2
10
3
10
Figure 3.11: Log plot of the intersection length distribution in x (left) and y
(right) direction, together with the regression line estimated using all full images in the database. αx = 2.96 and αy = 3.55 which shows that homogenous
regions are longer in the x direction than in the y direction, which is also
indicated by, and consistent with, the higher correlation in the x direction
than in the y direction.
95
Figure 3.12: The two images with largest (left) and smallest (right) difference
1
between αx and αy . For the image (IS11
) with the largest difference αx = 2.14
and αy = 3.63, the homogenous regions are longer in the x direction than in
the y direction. For the image (IS63 ) with the smallest difference αx = 2.14
and αy = 2.13, the homogenous regions have the same extension in x and y
direction.
3.5
Discussion
A new database containing an ensemble of sequences of natural images is
collected. Each sequence contains the same scene captured at different scales
by adjusting the focal length. Natural images, or rather natural scenes,
are vaguely defined as everyday scenes observed by a human from a human
perspective. The definition includes both nature and man-made structures,
but exclude ’bird views’, because they are not considered to be from a human
perspective.
Three classical and well known results from natural image statistics are
verified on the database.
The apparent scale invariance of an ensemble of natural images, which
can be expressed as the power spectra of an ensemble of natural images,
follows a power law in spatial frequencies. We estimated η = 0.202 in the
power law on the database. Ruderman and Bialek [166] estimated η = 0.19
on their database collected in the woods, Huang and Mumford [94] estimated
also η = 0.19 on the van Hateren image database [197] and van der Schaaf
and van Hateren [196] estimated η = 0.12 on their natural image database.
The estimation of η on the database is similar to the previously reported
results.
The distribution of partial derivatives of an ensemble of natural images
can be modeled with a generalized Laplacian distribution. The partial derivative of the image was defined as the scale space derivative at scale t. We estimated αx = 0.37 and αx = 0.78 at scale t = 1 and t = 64. For comparison,
Huang and Mumford estimated α = 0.55 on the van Hateren database [197].
The size distribution of homogenous regions in natural images follows a
96
power law. We estimated α = 2.11 in the size power law. Alvarez et al.
[3, 1, 2] reported α close to 2. Again, our result is similar to other reported
results.
Estimation of the three statistics on individual images can to some degree
explain the visual content of the image. Furthermore, the estimations depend
highly on the viewing distance.
Estimation of η in the power spectra power law is in general large if
the image is captured at small viewing distance and the scene contains small
scale details. η is small if the viewing distance is large and the scene contains
large, uniformly colored regions such as trees, lawns and buildings.
Estimation of αx in the Laplacian distribution of the partial derivatives,
is usually large if the viewing distance is small and the scene mainly contains
small scale details. αx is small if the viewing distance is large and the scene
mainly contains geometric structures.
Estimation of α in the distribution of homogenous regions is large if the
viewing distance is small and the scene contains small scale details. α is small
if the viewing distance is large and the scene contains large scale geometric
structures.
The relation between the estimation and the viewing distance can be
explained by two different factors: the spatial composition - the spatial envelope - of the scene and the inner scale. In images captured at large viewing
distances, the sky is often occupying a large region in the image (i.e spatial
composition) and buildings, trees and lawns often appear as uniformly colored regions (i.e. inner scale is rather large). At a small viewing distance the
sky is absent or occupying a small region in the image (i.e. spatial composition) and the details on the trees, the bushes and the lawns are brought out
(i.e. the inner scale is smaller).
97
Chapter 4
A SVD-Based Image
Complexity Measure
98
This chapter contain a slightly re-formatted version of
David Gustavsson, Kim S. Pedersen, and Mads Nielsen.
A SVD Based Image Complexity Measure.
In Proceding of International Conference on Computer Vision Theory and
Applications (VISAPP) 2009, 2009.
A SVD Based Image Complexity Measure
David Gustavsson, Kim S. Pedersen, and Mads Nielsen
DIKU, University of Copenhagen
Universitetsparken 1, DK-2100 Copenhagen Ø, Denmark
{davidg,kimstp,madsn}@diku.dk
abstract
Images are composed of geometric structures and texture, and different image processing tools - such as denoising, segmentation and registration - are
suitable for different types of image contents. Characterization of the image
content in terms of geometric structure and texture is an important problem
that one is often faced with. We propose a patch based complexity measure,
based on how well the patch can be approximated using singular value decomposition. As such the image complexity is determined by the complexity
of the patches. The concept is demonstrated on sequences from the newly
collected DIKU Multi-Scale image database.
Keywords: Image Complexity Measure, Geometry, Texture, Singular Value
Decomposition, SVD, Truncated Singular Value Decomposition, TSVD, Matrix Norm
4.1
Introduction
Images contain a mix of different types of information, from highly stochastic
textures such as grass and gravel to geometric structures such as houses and
cars. Different image processing tools are suitable for different type of image
contents and most tools are very image content dependent. The definition of
what is texture and geometry is not particularly agreed upon in the computer
99
vision community. Our hypothesis is that the separation between geometry
and texture is defined through the purpose of the method and the scale of
interest. What may be considered an unimportant structure / texture in one
application may be considered important in another.
For example, segmentation of an image containing objects with clear geometric structures forming boundaries calls for edge-based or geometry-based
methods such as watersheds [148], the Mumford-shah model [141], level sets
[171], or snakes [104]. While segmentation of an image containing objects
only discernable by differences in texture calls for texture based segmentation methods [162]. That is, the type of objects we are attempting to segment
defines our scale of interest, i.e. what type and scale of structure we include
in the model of a segment.
In denoising an image containing geometric structures calls for e.g. an edge
preserving method such as anisotropic diffusion [200] or total variation image
decomposition [169]. For images containing small scale texture, a patch based
denoising method such as non-local mean filtering may be more appropriate
[29]. Again we see that depending on the purpose we include structures at
finer scales into the model of the problem as needed.
As a final example, we mention that total variation (TV) image decomposition, and other functional base methods, are very successful for inpainting
images containing geometric structures [39]. Unfortunately the functional
based methods fails to faithfully reconstruct regions containing small scale
structures, however texture based methods manage to reconstruct such images [52, 48, 86, 49]. In the functional approaches the focus is solely on large
scale structures or geometry, whereas in the texture methods small scale
texture is included in the model.
Prior knowledge about the methods and the image content are therefore essential for successfully solving a task. A natural question is: ”For a given type
of images, which type of methods are suitable?” Often one wants to characterize the methods by analyzing the type of images that it is (un)suitable
for. To be able to characterize the methods in this way, the images must be
characterized with respect to the image contents. An image complexity measure is needed, i.e. a measure that quantify the image contents with respect
to geometric structure and texture or scale of interest.
A patch based complexity measure using Singular Value decomposition (SVD)
is presented. The complexity for the patch is determined by the number of
singular values that are required for good approximation - the matrix rank
of a good approximation. The number of singular values that are required
for approximating an image patch is used for characterizing the patch content. The global complexity measure for the image is computed as the mean
complexity of all patches in the image. The proposed complexity measure is
100
evaluated on the baboon image and on the newly collected DIKU Multi-Scale
image sequence database.
4.2
Complexity Measure
In the following section images are viewed as matrices, hence the image
complexity measure transforms into a matrix complexity measure. Basic
matrix properties are used extensively in the following section, which can
be found in e.g. [75]. One obvious approach is to approximate a matrix
A with a simpler matrix Ak and measure the error (residual) between the
original matrix A and the approximation Ak . Here k is a parameter used
for computing the approximation Ak . We assume that, as the parameter k
increases the error between A and Ak decrease (or at least not increase) and
as k → ∞ the error becomes 0. The approximation Ak should also be simpler
than A. To be able to use this approach, an error measure between matrices
and a matrix complexity measure must be defined.
4.2.1
Error Measure - Matrix Norms
To measure the difference between the original image A and a simpler approximation Ak of I, it is natural to use a matrix norm kA − Ak k. One of
the most commonly used matrix norms is the Frobenius norm (which corresponds to the L2 -norm). Let A be a m × n matrix with elements aij , the
Frobenius norm of A is defined as
n X
m
X
1
kAkF = (
|aij |2 ) 2 .
(4.1)
j=1 i=1
Another common type of matrix norms are the so-called induced matrix
norms. Let A be a m × n matrix and x ∈ Rn a colon vector (i.e x =
(x1 , · · · , xn )T ), the matrix norm induced by the vector norm kxk is defined
as
kAxk
kxk=1 kxk
kAk = sup
(4.2)
≤ α for all x). The matrix
(or in words the smallest number α such that kAxk
kxk
norm is here defined in terms of a vector norm kxk. The induced matrix norm
can be viewed as how much the matrix A expands the vectors and is actually
an operator norm. Different vector norms can be used to induce different
matrix norms, most common are the p-norms defined as
101
n
X
1
kxkp = (
| xi |p ) p
(4.3)
i=1
1
and especially the 2-norm kxk2 = (xT x) 2 . The matrix norm induced by the
2-norm is
kAxk2
kxk2 =1 kxk2
kAk2 = sup
(4.4)
Both the The Frobenius matrix norm and the matrix 2-norm are invariant
under orthogonal transformation and will be used in the following sections.
4.2.2
Matrix Complexity Measure - Matrix Rank
Given a matrix A, a simpler matrix approximation Ak of A should be constructed. But first one must define what ’simpler’ means. A natural approach
to quantify complexity of a matrix is by the rank of the matrix, and a simpler
approximation of a matrix can be viewed as a matrix with lower rank.
Let A be a m × n matrix then the rank of A can be viewed as the dimension
of the subspace spanned by the columns of A = (a1 , · · · , an ),
rank(A) = dim( span{a1 , · · · , an } ).
4.2.3
(4.5)
Optimal Rank k Approximation
It is well known from matrix theory that a m×n matrix A can be decomposed
into
A = UΣV T
(4.6)
where U is a m × m orthogonal matrix, V is a n × n orthogonal matrix and
Σ is a m × n diagonal matrix with elements σ 1 , · · · , σ l where l = min{m, n}.
This is the so-called Singular Value Decomposition (SVD), where the σ i :s
are called singular values and the column vectors ui and v i , of U and V are
called singular vectors. The entries in Σ is ordered such that σ 1 ≥ σ 2 ≥
· · · ≥ σ l ≥ 0.
Using the fact that the Frobenious norms are invariant under multiplication
by orthogonal matrices gives
kAk2F
=
kΣk2F
=
102
l
X
i=1
(σ i )2 .
(4.7)
Let Σk be the m × n matrix containing the k largest singular values on the
diagonal and let
Ak = UΣk V T .
(4.8)
Ak is the so-called Truncated Singular Value Decomposition (TSVD) approximation of A where the first k singular values are used, and if rank(A) ≥ k
then rank(Ak ) = k. The image approximation residual is defined as A − Ak
and if, again, rank(A) ≥ k then rank(A − Ak ) = rank(A) − k.
The reconstruction error or the residual error for the Frobenious norm is
kA − Ak kF = (
l
X
1
(σ (i) )2 ) 2
(4.9)
i=k+1
and for the 2-norm
kA − Ak k2 = σk+1 .
(4.10)
The rank(Ak ) ≤ rank(A), so Ak is simpler in the sense that its’ rank is not
larger (and usually the rank is lower). Furthermore Ak is the best rank − k
approximation of A in the sense that
Ak = arg minrank(B)=k kA − Bk2
(4.11)
So any matrix B with rank k has at least as large reconstruction error using the 2-norm as Ak . Ak is also the best rank k approximation using the
Frobenious norm. Singular Value Decomposition can be viewed as a method
for finding the optimal basis and is related to other optimal basis methods
such as Independent Component Analysis (ICA) [96] and Karhunen-Love
Expansion [109].
There are two possibilities to compare images by comparing the norm of
the residual. Either the number of singular values, k, are fixed and the
reconstruction error kAk − Ak using k singular values are compared. The
other possibility is to keep the reconstruction error fixed, σ err , and use as
many singular values that are required for the reconstruction error to be
lower than σ err . Either the rank k or the reconstruction error σ err is kept
fixed.
Let k0 be the number of singular values that should be used in the reconstruction. The residual error (using either the 2-norm or Frobenious norm)
is
kA − Ak0 k = σkerr
0
103
(4.12)
and σkerr
is called the singular value reconstruction error using k0 singular
0
values.
Let σ err be a fixed reconstruction error and let k be the smallest integer such
that
kA − Ak k ≤ σ err
(4.13)
k is called the singular value reconstruction index (SVRI) at level σ err . The
SVRI state the smallest number of singular values that are required to get a
reconstruction with a reconstruction error smaller than σ err .
4.2.4
Global Measure
Instead of computing an approximation of the full image, which is not feasible
for high resolution images, a patch based approach is adopted. The singular
value reconstruction error at level σ err is computed for each p × p patch in
the image.
Based on the patch complexities an image complexity measure should be
computed. The obvious candidate is the mean or the mode complexity computed over all patches in the image. The mean patch complexity is used as
the complexity measure for the image. The interpretation of the mean, is
simply the average number of singular values that are required for an approximation, such that the reconstruction error is less than σ err , of the patches
in the image.
Figure 4.1: Image sequences - 02, 05 and 08 - from the DIKU Multi- Scale
image database (used in the experiments) at three capture scales.
104
4.3
DIKU Multi-Scale Image Database
The newly collected DIKU Multi-Scale image database [83], contains sequences of the same scene captured using varying focal length - called capture
scales -, will be used to analyze the distribution of singular values in natural image patches and analyze how the image content changes over different
capture scales.
The database contains sequences of natural images - both man-made and
natural environment - with a large variety of scenes and distances to the
main object in the scene. Each sequence contains 15 high resolution images
of the same scene captured using different focal length. The zoom factor is
roughly 16x and the naming convention is that image 1 is the least zoomed
and 15 the most zoomed. Three examples of sequences are shown in figure
4.1.
Furthermore, the part of the scene that is present at all capture scales has
been extracted, resulting in a sequence of region containing the same part of
scene captured at different capture scales. The part of the scene present in
the image to the right in figure 4.1, has been extracted from the remaining
14 images (of which two are shown in the figure).
Three sequences - 02 building with windows, 05 building without windows
and 08 tree trunk - shown in figure 4.1 are used in the experiments. The
image contents are very different on the different capture scales that can be
seen in the 80 ×80 extracted patches shown in figure 4.2. For example, in the
most zoomed image a brick is almost covering the whole 80 × 80 patch, while
in the least zoomed image a large part of the brick wall is contained in the
patch. (The 80 × 80 patches are only shown for visualization of the contents
differences, while the complete regions are used in the experiments.)
4.4
Singular Value Distribution in Natural
Images
The proposed method depends on the distribution of singular values in natural image patches. The distribution of principal component and independent
components in natural images has received a lot of attention for some years,
partly because its relation to the front-end vision [197].
To analyze the distribution of singular values in natural image patches, 1000
randomly selected 25 × 25 patches from each image in the DIKU Multi-Scale
image database have been selected - approximately 800000 patches - and the
corresponding singular values have been computed.
105
Figure 4.2: 80×80 patches extracted from the three sequence shown in figure
4.1 at 3 different scales (index 1, 6 and 15). The patches show the contents
different at the different capture scales.
Figure 4.3: Each column show the patch with the largest (top) and smallest
(bottom) σ25 in the same image. The content difference is striking and clearly
indicate the importance for the small singular values for characterize the
image content.
106
The first, not so surprising, conclusion is that patches in natural images
almost always have full rank - i.e. the singular values are almost always
strictly larger than 0.
The distribution of singular values σ1 and σ2 are shown in figure 4.4. The
variance for the distribution of σ1 is large, and it is interesting that many
patches have values close to 25. The distribution for σ2 is peaked at zero but
also have ’heavy tails’ - values relatively far from zero. This is also the case
for σi where i > 2.
In figure 4.3 the patches with the largest σ25 (top) and smallest σ25 (bottom)
in five different images are shown. The contents difference in the different
patches are striking - the patches with the largest σ25 all contain large variations, while the patches with the lowest σ25 contain no or very little visible
variations.
The distribution of the small singular values are peaked at zero, but also
show some variation and ’heavy tails’. Visual comparison of patches with
high and low σ25 clearly indicates a content difference, which implies that
singular value reconstruction index is suitable for measuring image content.
4.5
4.5.1
Experiments
The baboon image
The baboon image is used only for demonstrating the method. The baboon
is a good test image because it contains both very complex texture and
large regions with geometric structures. In figure 4.5 the spatial distribution
of complexity is shown using different patch sizes and error levels. White
regions indicating high complexity and black indicating low complexity. The
highly stochastic texture returns high complexity values at all scales and
error levels, while the geometric structures return low complexity. As the
patch size grows larger the spatial distribution of complexity gets smoother.
4.5.2
DIKU Multi-Scale Image Database
The image complexity measure is computed over the different capture scales
using different patch sizes and error levels. The results are shown in figure
4.6.
The plot to the left and right, in figure 4.6, has the same error level 0.35,
but different patch sizes, 15 respective 25 pixels. Still the shape of the curves
are very similar. On the other hand the plot in the middle and to the right
have same patch sizes - 25 pixels -, but different error level - 0.05 and 0.35 107
Distribution σ
1
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
0
5
10
15
20
25
Distribution σ
2
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0
1
2
3
4
5
6
7
Figure 4.4: The distribution of singular values σ1 and σ2 for natural images
patches of size 25 × 25. The variance for the distribution of σ1 is large (as
expected), the distributions for σ2 is peaked at zero but also have ’heavytails’.
108
Figure 4.5: Patch based complexity measure of the baboon image. Different
patch size are used in the colon, from left to right, the sizes are 9,15 and 25
pixels, and different reconstruction errors are used in the rows, from top to
bottom, 0.1, 0.3, and 0.5.
12
10
seq 02
seq 05
seq 08
seq 02
seq 05
seq 08
9
11
8
Image Complexity
Image Complexity
10
9
8
7
6
5
7
4
6
5
3
0
5
10
15
2
0
5
10
Captured Scale
15
Captured Scale
20
seq 02
seq 05
seq 08
19
18
Image Complexity
17
16
15
14
13
12
11
10
0
5
10
15
Captured Scale
Figure 4.6: Complexity measure (y-axis)computed over different capture
scales (x-axis) using different patch sizes and error levels. From left to right:
patch size 15 and σ err = 0.05, patch size 25 and σ err = 0.35, and patch size
15 and σ err = 0.05.
109
and the curves are very different which indicate that the error level is more
important than the patch size.
For sequence 02 the complexity at error level 0.05 first decreases roughly for
the first 7 capture scales, and then increases for the last 7 capture scales. For
sequence 08 the complexity at error level 0.05 decrease quite rapidly at the
first scales and then decreases slower for the remaining capture scales. For
sequence 05 the complexity decreases with increasing capture scale.
The average number of singular values required for an approximation at a
fixed error level varies a lot over the different capture scale. This indicate
that the contents in terms of complexity, change over the capture scales which
is clearly visiable from figure 4.2.
4.6
Conclusion
A patch based image complexity measure based on the number of singular
values that are required to approximate a patch at a given error level is
presented. The number of singular values is used to characterize the image
content in terms of geometric structures and texture.
The proposed method is motivated by the optimal rank-k property of the
truncated singular value approximation. The distribution of singular values
in patches from natural images seems to be peaked at zero and have ’heavytails’. The image content in patches with relatively large smallest singular
value are very different from the patches with relatively small smallest singular value.
ACKNOWLEDGEMENTS
This research was funded by the EU Marie Curie Research Training Network
VISIONTRAIN MRTN-CT-2004- 005439 and the Danish Natural Science
Research Council project Natural Image Sequence Analysis (NISA) 272-050256. The authors wants to thank prof. Christoph Schnörr (Heidelberg University) and PhD. Niels-Christian Overgaard (Lund University) for sharing
their knowledge.
110
Chapter 5
On the Rate of Structural
Change in Scale Spaces
111
This chapter contain a slightly re-formatted version of
David Gustavsson, Kim S. Pedersen, Francios Lauze and Mads Nielsen.
On the Rate of Structural Change in Scale Spaces.
In Proceedings of Scale Space and Variational Methods in Computer Vision
(SSVM) 2009, 2009.
On the rate of structural change in scale spaces
David Gustavsson, Kim S. Pedersen, Francois Lauze and Mads Nielsen
DIKU, University of Copenhagen
Universitetsparken 1, DK-2100 Copenhagen Ø, Denmark
{davidg,kimstp,francois,madsn}@diku.dk
abstract
We analyze the rate in which image details are suppressed as a function of the
regularization parameter, using first order Tikhonov regularization, Linear
Gaussian Scale Space and Total Variation image decomposition. The squared
L2 -norm of the regularized solution and the residual are studied as a function
of the regularization parameter. For first order Tikhonov regularization it is
shown that the norm of the regularized solution is a convex function, while
the norm of the residual is not a concave function. The same result holds for
Gaussian Scale Space when the parameter is the variance of the Gaussian,
but may fail when the parameter is the standard deviation. Essentially this
imply that the norm of regularized solution can not be used for global scale
selection because it does not contain enough information. An empirical study
based on synthetic images as well as a database of natural images confirms
that the squared residual norms contain important scale information.
Keywords: Regularization, Tikhonov Regularization, Scale Space, TV, Total Variation, Geometric Structure, Texture
5.1
Introduction
Images contain a mix of different type of information - from fine scale stochastic textures to large scale geometric structures. Image regularization can be
viewed as approximating the observed original image with a simpler image,
112
where simpler is defined by the regularization (prior) term and the regularization parameter λ. Here an image is considered to be simpler if it is smoother
(or piece-wise smoother). Regularization can also be viewed as decomposing
the observed image into a regularized (smooth) component and a small scale
texture/noise component (called the residual, because it is the difference between the regularized solution and the observed image). By increasing the
regularization parameter λ smoother and smoother approximations are generated. The rate in which image details are suppressed as a function of the
regularization parameter depends on the image content and regularization
method. The image residual contains the details that are suppressed during the regularization and the norm of the residual is a measurement of the
amount of details that are suppressed. The norm of the residual as a function of the regularization parameter gives important information about the
image content. For images containing small scale structure a lot of details
are suppressed even for small λ and the norm of the residual will be large
for small λ. For images containing solely large scale geometric structures few
details will be suppressed for small λ and the norm of the residual will be
small. The rate in which details are suppressed can be viewed as the derivative of the norm of the residual with respect to the regularization parameter,
and reveals the amount of details that are suppressed if the regularization
parameter increases.
First order Tikhonov regularization, Gaussian linear scale space (which
is equivalent to infinite order Tikhonov regularization [143]) and Total Variation image decomposition are studied. The squared L2 -norm of the regularized solution and the residual are studied as functions of the regularization
parameter. Of special interest is the convexity/concavity of those norms
viewed as functions, because it relates to the possibility that the rate in
which details are suppressed can increase/decrease. In section 5.2, first order
Tikhonov regularization is revisited and it is shown that the norm of the regularized solution is a convex function, while the norm of the residual is not a
concave function. In section 5.3, linear Gaussian Scale Space is revisited, and
it is shown that the norm of the regularized solution is convex as a function
of the Gaussian variance, or equivalently diffusion time, but may fail to be
convex when the parameter is the Gaussian standard deviation. The squared
norm of the residual is in general not a concave function of its parameter.
In section 5.4, Total Variation (TV) image decomposition is revisited. In
section 5.5 experimental results are presented, the norm of the Sinc function,
synthetic image containing image structures at different scales and natural
images are studied.
These studies tend to show that the square residual norm contains scale
information, particularly at values where local convexity/concavity behavior
113
changes.
5.1.1
Related work
Characterization of images by analyzing the behavior of the norm of the
regularized solution and the residual as functions of the regularization parameter has not received much research attention. Sporring and Weickert
[181, 182] view images as distributions of light quanta and use information
theory to study the structure of images in scale space. The entropy of an
image as a function of the scale (in scale-space) is analyzed and shown to be
an increasing function of the scale. The result holds both for linear Gaussian
scale space and non-linear scale-space. Furthermore the derivative of the
entropy with respect to the scale is shown, empirically, to be a good texture
descriptor. The derivative of the scale-space entropy function with respect
to the scale is a global measure of how much the entropy of an image changes
at different scale. Where Sporring and Weickert studies monotone functions
of images across scale, we study norms of the scale space image and residual.
Buades et.al [27] introduced the concept of Method Noise in denoising.
The Method Noise is the image details that are removed in the denoising
- i.e. the residual image - and the content is used for comparing denoising
methods. The residual image has often been used for determine the optimal
regularization parameter. (See Thompson et.al [189] for a classical study.)
Selection of the optimal stopping time for diffusion filter was studied by
Mrazek and Navara [139], which also relate to the Lyapunov functionals
studied by Weickert [200].
5.1.2
Convexity, Fourier Transforms, Power Spectra
Recall that a function f (x) defined on a convex set C is convex if
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
for all 0 ≤ λ ≤ 1 and for all x, y ∈ C. If f (x) is convex on a convex set C
then −f (x) is said to be concave on C. When f (x) is twice-differentiable, a
necessary and sufficient condition for convexity is
∀x ∈ C,
f 00 (x) ≥ 0
(5.1)
(in the multidimensional case a the Hessian matrix is positive semi-definite).
Two elementary facts will be used in the sequel: 1) let h(λ) be a function of
the form
Z
h(λ) = d(λ, x)s(x)dx
(5.2)
114
where d(λ, x) is convex in λ and s(x) ≥ 0 then h(λ) is convex. 2) Assume
that f (x) = h(g(x)) where g : Rn → Rk and h : Rk → R. Then
• if h is convex and non-decreasing and g is convex, then f is convex,
• if h is convex and non-increasing and g is concave, then f is concave.
ˆ Parseval’s
The Fourier transform of a function f is denoted with f.
theorem asserts that this is an isometry of L2 : kf kL2 = kfˆkL2 where
ZZ
2
kf (x, y)k2 =
|f (x, y)|2dxdy.
(5.3)
The frequency domain variables are denoted (ωx , ωy ) =: ω. The power spec2
ˆ
trum function of a function f is the function ω 7→ |f(ω)|
. f is said to follow
α
ˆ
a (α-)power law if |f(ω)| ∼ C/|ω| , where C and α are some constants. It is
well known that the power spectra computed over a large ensemble of natural
image approximate a power law in spatial frequencies with α around 1.7 or
at least in (0, 2) [166, 58].
We use often implicitly the following classical result from Calculus. Let
B := B(0, 1) the unit ball of Rn and B c its complement. Let g a positive
function
defined on Rn . Assume that Rg ∼ kxk−α in B (resp B c ). Then
R
g dx < ∞ if and only if α < n (resp. Bc g dx < ∞ if and only if α > n).
B
Finally, to conclude this paragraph, given a regularization, the functions
s(λ) and r(λ) will denote the squared L2 -norm of respectively the the regularized solution and of the residual as a function of the regularization parameter
λ.
5.2
Tikhonov Regularization
The first order Tikhonov regularization is defined as the minimizer of the
energy functional
ZZ
Eλ [f ] =
(f − g)2 + λ|∇f |2dxdy
(5.4)
where g is the observed data and λ is the regularization parameter. The
energy functional is composed of two terms: the data fidelity term kf − gk22
and the regularization term k∇f k22 . Note that Wiener filter can be regarded
as a Tikhonov regularization method applied to the Fourier domain. Thanks
to Parseval’s theorem all calculation can be performed in the Fourier domain
where this energy becomes
ZZ
ˆ
Eλ [f ] =
(fˆ − ĝ)2 − λ(ωx2 fˆ2 + ωy2 fˆ2 )dωx dωy .
(5.5)
115
Using the Calculus of Variations, a necessary condition for a function f
to minimize the functional (5.4) is given by its Euler-Lagrange equation:
(f − g) − λ∆f = 0. In the Fourier domain, it becomes
fˆ − ĝ + λ(ωx2 fˆ + ωy2fˆ) = 0 i.e fˆ =
ĝ
1 + λ|ω|2
(5.6)
1
that is, the original signal multiplied with the filter function F (λ, ω) = 1+λ|ω|
2
which is a non-increasing convex function w.r.t λ (for λ ≥ 0). Set d(λ, ω) =
F (λ, ω)2. It is important to remark that defining the regularization in frequency domain by λ → F (λ, ω)ĝ(ω) extends Tikhonov regularization beyond
the case where g ∈ W 1,2 (R2 ), the Sobolev space of L2 functions with L2 weak
derivatives, which is the natural space for Tikhonov regularization as defined
by minimization of (5.4). Indeed, the corresponding function s(λ) is given
by
ZZ
2
s(λ) = kF (λ, ω)ĝk2 =
d(λ, ω)|ĝ|2 dω.
(5.7)
This is the integral of the squared filter function times the power spectrum
of the original signal g, and we have the following result:
Proposition 1 The squared L2 -norm s(λ) of the minimizer of the Tikhonov
regularization functional as a function of the regularization parameter λ is,
for non-trivial images, a monotonically decreasing convex function (for λ ∈
(0, ∞)), when it exists.
If g follows an α-power law, then from the Calculus fact recalled in the
previous section, g 6∈ L2 (Rn ), however s(λ), s0 (λ) and s00 (λ) exist and are
finite for λ > 0 if and only if α ∈ (0, 2) (which is the case for natural
images). Both s0 and s00 diverge for λ → 0+ .
The square of a non-increasing convex function is a convex function, and
from Section 5.1.2 we have the first part of the proposition. Now
dλ (λ, ω) = −
2|ω|2
,
(1 + λ|ω|2)3
dλλ (λ, ω) = 6
2|ω|4
.
(1 + λ|ω|2)4
RR
RR
s0 (λ) =
dλ (λ, ω)|g|2 dω and s00 (λ) =
dλλ (λ, ω)|g|2 dω and the rest of the
proposition follows by elementary analysis.
Set R(λ, ω) = 1 − F (λ, ω) and e(λ, ω) = R(λ, ω)2 . The Fourier image
residual is R(λ)ĝ and its squared norm is
ZZ
2
r(λ) = kR(λ, ω)ĝk =
e(λ, ω)|ĝ|2 dω
116
An elementary calculation gives eλ (λ, ω) = 2λ|ω|2/(1 + λ|ω|2)3 and this function, is for λ fixed, bounded in ω while it satisfies
∀ω,
lim eλ (λ, ω) → 0, lim eλ (λ, ω) → 0
λ→0+
λ→∞
The same holds for r 0 (λ) when it is finite and therefore by the mean value
theorem, as it is positive, it must have a maximum and r 00 (λ) must change
sign and we can state the following:
Proposition 2 Assume first that g ∈ W 1,2 (R2 ) is non trivial. Then, although s(λ) is convex and decreasing, the squared norm residual r(λ) of
Tikhonov regularization, while increasing from 0 to kgk22, is neither concave
nor convex.
Note that when g is a α−power law with α ∈ (0, 2), g 6∈ L2 (R) while its
regularization gλ is when λ > 0, thus g − gλ 6∈ L2 (R2 ) and r(λ) = kg − gλk22 =
+∞.
5.3
Linear Scale-Space and Regularization
Linear scale-space theory [111, 202, 98] deals with simplified coarse scale representation of an image g, generated by solving the diffusion (heat) equation
with initial value g:
∂f
= 4f, f (−, 0) = g(−)
(5.8)
∂t
where 4 = ∂xx + ∂yy is the Laplacian. Equivalently, this coarse scale representation can be obtained by convolution with a Gaussian kernel:
1 − x2 +y2 2
e 2σ
fσ = g ∗ Gσ , Gσ (x, y) =
(5.9)
2πσ 2
and the link between the two formulations is given by fσ = f (−, 2σ 2 ). A third
formulation of Linear Scale-Space is obtained as “infinite order” Tikhonov
regularization, the 1-dimensional case was introduced by Nielsen et al. in
[143]. In dimension 2, one defines for λ > 0
2
ZZ
ZZ X
∞
k X
λk
∂k f
k
2
E[f ] =
(f − g) dxdy +
dxdy (5.10)
` ∂x` ∂y k−`
k!
k=1
`=0
where k` is the (`, k)-binomial coefficient. By a direct computation, its
associated Euler-Lagrange equation is given by
f −g+
∞
X
(−1)k λk
k=1
k!
117
4k f = 0
where 4k is the k-th iterated Laplacian
k
4 = 4◦···◦4 =
|
{z
}
k times
k X
k
`=0
∂ 2k
.
` ∂x2` ∂y 2(k−`)
Via Fourier Transform, the Laplacian operator becomes the multiplication
by −|ω|2 operator and as in 1st order Tikhonov regularization, the solution
is given by filtering:
fˆ =
1+
ĝ
P∞
2
λk |ω|2k
k=1
k!
= e−λ|ω| ĝ.
(5.11)
The solution of the filtering problem for a given λ > 0 is the same as solving
(5.8) with t = λ. By setting λ = 2σ 2 and applying the convolution theorem
to (5.9) one gets the above equation. Using the Fourier formulation, the
squared norm of the solution at λ of (5.11) s(λ) the squared-norm residual
r(λ) are given by
ZZ
2
−λ|ω|2
2
s(λ) = ke
ĝk2 =
e−2λ|ω| |ĝ(ω)|2 dω,
ZZ 2
−λ|ω|2
2
−λ|ω|2
1−e
r(λ) = k(1 − e
)ĝk2 =
|ĝ(ω)|2 dω.
2
2
If one defines d(λ, ω) = e−2λ|ω| and e(λ, ω) = (1 − eλ|ω| ), they have with
respect to convexity/concavity, the same properties as their Tikhonov counterpart defined in the previous section and one can state the following, in
term of heat equation / Gaussian variance
Proposition 3
1. The squared L2 -norm s(t) of the solution of heat equation as a function of the diffusion “time” t (or equivalently the convolution by the Gaussian kernel in function of the kernel variance) is,
for non-trivial images, a monotonically decreasing convex function (for
t ∈ (0, ∞)), when it exists.
2. The squared norm residual r(t) of the solution of the heat equation at
time t, while increasing from 0 to kgk22, is neither concave nor convex.
If, instead of using the diffusion time / variance as parameter, one uses the
standard deviation σ of the Gaussian kernel, the resulting solution squared
norm function s(σ), although increasing, may fail to be convex as the function
2
2
σ 7→ e−σ |ω| is not convex in σ, this is a half Gaussian bell. A simple example
showing the convexity failure is provided by the band limited function b whose
118
Fourier transform is b̂(ω) = 1 if |ω| ≤ 1 and b̂(ω) = 0 otherwise. A direct
calculation gives
π
2
s(σ) =
1 − e−σ
σ
which is neither convex nor concave. In the other hand, for a function g
following a α−power law with α < 2, s(σ), this seems to be convex (for
instance if α = 0, s(σ) = π/σ, if α = 1, s(σ) = π 3/2 /σ 2 ).
If,again, the power spectrum of the image g is following a power law in spatial
frequencies, while its regularized L2 - norm is finite, the residual norm is not
as the initial datum is not square-integrable.
5.4
Total Variation image decomposition
Bounded Variation image modeling was introduced in the seminal work of
Rudin et al. in [169], where the following variational image denoising problem
is considered. Given an image g and λ > 0, find the minimizer of the following
energy
Z
ZZ
2
E(f ; g, λ) = (g − f ) dxdy + λ
|∇f | dxdy
(5.12)
The regularized image fλ can be interpreted as a denoised version of g, but
also as the “geometric” content of g while the residual νλ = g − fλ contains
the “noise/fine texture” component. Several methods have been proposed to
solve the above equation, by solving a regularized form of the Euler-Lagrange
equation of the functional
f − g − λ∇ ·
∇g
=0
|∇g|
where ∇· denote the divergence operator, but also for instance the non linear
projection method of Chambolle ([34]), which we have used in this work. λ is
a regularization parameter that determines the level of details that ends up
in the (noise/texture) component νλ . As λ increases νλ will contain details
of larger and larger scale, that will not appear in fλ .
Again it is interesting so see how the image content changes as λ increases.
The component vλ is the residual of the regularization and contains the
details that are suppressed in the cartoon component fλ and we set
r(λ; g) = kvλ k22 = kg − fλ k22
(5.13)
i.e. the squared L2 -norm of the residual image as a function of the regularization parameter λ. Related to the norm of the residual is the norm of the
119
cartoon component as a function of λ
s(λ; u0) = kuλk22
(5.14)
s0 (λ) encodes the rate in which details are suppressed in the cartoon component uλ . Due to the high non linearity of the TV-regularization problem,
there is no relatively simple expression for s(λ), r(λ) and their respective
derivatives.
A norm study for the dual norm of the TV norm was done by Meyer in
[132]. A more direct behavior for the 2-norm can be computed in a few cases.
For instance Strong and Chan [186] showed that if g is the function g(x) = 1
if x ∈ B(0, 1) the unit disk, g(x) = 0 if x 6∈ B(0, 1), then its regularization
has the form cg, where c ∈ (0, 1) is a constant, therefore attenuating the
contrasts of the image.
In general situation, we cannot expect these type of simple results. We
have instead decided to study the behavior of these functions experimentally
on an image database.
5.5
5.5.1
Experiments
Sinc in Scale Space
Let g(x) = sin(x)/x be the Sinc function where x ∈ [−∞, ∞]. The squared
L2 norm of the residual as a function of the regularization parameter is in
the Tikhonov case
Z 1
λx2 2
r(λ) =
(
) dx
(5.15)
2
−1 1 + λx
and in the scale space case
r(σ) =
Z
1
−1
(1 − e
−ω 2 σ 2
2
)2 dω.
(5.16)
The result is presented in figure 5.1. The plots clearly indicate that the
residual norm - in both cases -is not concave.
5.5.2
Black squares with added Gaussian noise
The first experiment is done on an artificially generated 100 × 100 image
containing four 3 × 3 black squares, one 20 × 20 black square and added
Gaussian white noise with σ 2 = 12. The white background has intensity 125
120
1.0
0.8
0.15
0.6
0.10
0.4
0.6
0.4
0.2
0.2
2
2
4
6
8
10
12
4
6
8
10
14
0.5
1.0
1.5
2.0
(a) Tikhonov regularization: Residual norm, first and second order derivative
0.5
1.5
0.4
0.4
0.3
1.0
0.3
0.2
0.2
0.1
0.5
0.1
2
4
6
8
-0.1
2
4
6
8
2
4
6
8
-0.2
(b) Scale Space: Residual norm, first and second order derivative
Figure 5.1: The residual norm as a function of the regularization parameter
. The plots clearly indicate that residual norm function are,
for g(x) = sin(x)
x
in both case, increasing functions, but not concave.
and the black square 10, after the noise has been added the image is zero
mean normalize.
In figure 5.2 the regularized and residual image are shown for increasing
regularization using first order Tikhonov Regularization. As the small scale
noise is suppressed, the large scale geometric structures are also smoothed
out. The norm of the residual is an increasing function of the scale and it
seems to be concave, and in fact it can be concave for the shown λ. However
λ may be small at the inflection point.
In figure 5.3 the regularized and residual images are shown for increasing
regularization using linear gaussian scale space. The results for the linear
Gaussian scale-space is similar to the result using first order Tikhonov regularization.
In figure 5.4 the regularized and residual images are shown for increasing regularization using Total Variation image decomposition. The different
structures are suppressed at using different λ while the large scale structures
are well preserved. At λ = 12 the gaussian white noise is suppressed, and at
λ = 210 is the small boxes remove and finally the large box is suppressed at
λ = 550. The residual norm as a function of the regularization parameter is
not a concave function of λ.
121
4.5
1800
First Order Derivative of the Residual Norm
Residual Norm
4
1600
3.5
1400
3
1200
2.5
2
1000
1.5
800
1
600
0.5
400
200
0
0
20
40
60
80
100
120
Regularization Parameter
140
160
180
200
−0.5
0
20
40
60
80
100
120
Regularization Parameter
140
160
180
200
Figure 5.2: Result for the squares and noise image using first order Tikhonov
regularization. On the first row the regularized and the residual images for
λ = 3, 10, 20 and 50 are shown. The plots contain the L2 −norm of the
residual as a function the scale λ, followed by the first order derivative in
log-scale.
8
2600
Derivative of the Residual Norm in Log−Scale
7
2400
6
2200
5
2000
4
3
1800
2
1600
1
1400
0
Residual Norm
1200
0
10
20
30
40
50
60
70
80
90
100
−1
0
10
20
30
40
50
60
70
80
90
100
Regularization Parameter − σ2
2
Regularization Parameter − σ
Figure 5.3: Result for the squares and noise image using linear scale space.
On the first row the regularized and the residual images for σ 2 = 1, 7, 13 and
64 are shown. The plots contain the L2 −norm of the residual as a function
the scale σ, followed by the first order derivative in log-scale.
122
10
3500
Derivative of the Residual Norm in Log−Scale
Residual Norm
3000
5
2500
0
2000
−5
1500
−10
1000
−15
500
0
0
100
200
300
400
500
Regularization Parameter
600
700
800
−20
0
100
200
300
400
500
Regularization Parameter
600
700
800
Figure 5.4: Result for the squares and noise image using TV-decomposition.
On the row regularized and the residual images for λ = 12, 38, 100 and 200
are shown. The plots contain the L2 −norm of the residual as a function
the scale λ, followed by the first order derivative in log-scale. The residual
norm seems to be a monotonically increasing non-concave function. The
residual norm has three points of ’high’ curvature: one at λ = 12 - the noise
is suppressed - and λ = 210 - the small squares are suppressed, and λ = 580
- the large square is suppressed.
5.5.3
DIKU Multi Scale Image Sequence Database I
The newly collected DIKU Multi-Scale image sequence database [85], contains sequences of the same scene captured using varying focal length. The
sequences contain both man-made structures and nature, the distance to the
main objects in the scenes also show a large variation (from a few meters to
a few kilometers).
Each image has first been normalized by an affine intensity range change
so that that the intensity range becomes [0, 1], followed by subtracting the
mean value (i.e. the mean intensity is 0 in each image).
The mean residual norm was computed on the normalized images in the
database, using fixed scales σ = 2i where i = 0, · · · , 12, using linear gaussian
scale space. The result is a feature vector hr̄(0), · · · , r̄(12)i containing
r̄(i) =
1 X
r(i; I)
N I∈F
(5.17)
where F is the set of all N normalized images in the database.
The (signed) distance function d(I0 ) of a normalized image I0 ∈ F to the
mean is defined as
123
d(I0 ) =
12
X
i=0
r(i; I0 ) − r̄(i)
(5.18)
The (signed) distance to the mean has been computed for all images in
the DIKU database. Images with large positive values have a larger than
average residual and images with large negative values have a smaller than
average residual.
The first row in figure 5.5 contains the 4 images with the largest positive distance to the mean, on the second row the 4 images with the largest
negative distance to the mean. The image contents difference is striking
and clearly indicate that the residual norm contains important contents information. The same experiment was performed using first order Tikhonov
regularization with similar, but not identical, result.
Figure 5.5: The top row show images where f (σ) is much larger than the
average and bottom row show images where f (σ) is much smaller than the
average. The contents difference is striking! The images in the first row
contain small scale details (texture), while the images in the bottom row
contain large scale geometric structures.
5.6
Conclusions
For square-integrable images, the squared L2 -norms of the regularized images
in first order Tikhonov regularization and linear Gaussian Scale Space are, in
general decreasing convex functions of the regularizing parameter. This may
fail for Linear Scale space when Gaussian standard deviation is used as a
parameter. Their squared residual norm are however not concave functions.
For the the Total Variation regularization too, it is shown empirically that
the squared norm of the residual is not concave.
This confirms that the squared norm of the residual may be an indicator
of image structure, both for 1st order Tikhonov regularization, Gaussian
124
Scale Space as well as Total variation regularization. The behavior of the
latter will be studied further in future research.
ACKNOWLEDGEMENTS
This research was funded by the EU Marie Curie Research Training Network
VISIONTRAIN MRTN-CT-2004- 005439 and the Danish Natural Science
Research Council project Natural Image Sequence Analysis (NISA) 272-050256. The authors wants to thank Christoph Schnörr (Heidelberg University), Niels-Christian Overgaard (Lund University) and Vladlena Gorbunova
(Copenhagen University) for charing their knowledge.
125
Chapter 6
Variational Segmentation and
Contour Matching of
Non-Rigid Moving Object
126
This chapter contain a slightly re-formatted version of
David Gustavsson, Ketut Fundana, Niels-Ch. Overgaard, Anders Heyden,
and Mads Nielsen.
Variational Segmentation and Contour Matching of Non-Rigid Moving
Object.
In Proceding of Workshop on Dynamical Vision WDV 2008, 2008.
Variational Segmentation and Contour Matching of Non-Rigid
Moving Object
David Gustavsson1,2 , Ketut Fundana3 , Niels Chr. Overgaard3 , Anders
Heyden3 , and Mads Nielsen1
1
DIKU, University of Copenhagen
Universitetsparken 1,DK-2100 Copenhagen Ø, Denmark
{davidg,madsn}@diku.dk
2
IT University of Copenhagen
Rued Langgaards Vej 7,DK-2300 Copenhagen S, Denmark
3
Applied Mathematics Group,School of Technology and Society, Malmö
University
Östra Varvsgatan 11A, SE-205 06 Malmö, Sweden
{ketut.fundana,nco,heyden}@ts.mah.se
abstract
In this paper we propose a method for variational segmentation and contour
matching of nonrigid objects in image sequences which can deal with the
occlusions. The method is based on a region-based active contour model
of the Chan-Vese, augmented with a frame-to-frame interaction term which
uses the segmentation result from the previous frame as a shape prior. This
method has given good results despite the presence of minor occlusions, but
can not handle significant occlusions. We have extended this approach by
adding a registration step between two consecutive contours. This registration step is based on a novel variational formulation and gives also a mapping
of the intensities from the interior of the previous contour to the next. With
this information occlusions can be detected from deviations from predicted
127
intensities and the missing intensities in the occluded areas can then be reconstructed. The performance of the method is shown with experiments on
synthetic and real image sequences.
6.1
Introduction
Segmentation is an important and difficult process in computer vision, with
the purpose of dividing a given image into one or several meaningful regions
or objects. This process is more difficult when the objects to be segmented
are moving and nonrigid and even more when there are severe occlusions. The
shape of nonrigid, moving objects may vary a lot along image sequences due
to, for instance, deformations or occlusions, which puts additional constraints
on the segmentation process. In particular we would like to distinguish real
shape deformations of the object from apparent shape deformations due to
occlusions.
There have been a number of methods proposed and applied to this problem. Active contours are powerful methods for image segmentation; either
boundary-based such as geodesic active contours [33], or region-based such
as Chan-Vese models [35], which are formulated as variational problems.
Those variational formulations perform quite well and have often been applied based on level sets. Active contour based segmentation methods often
fail due to noise, clutter and occlusion. In order to make the segmentation
process robust against these effects, shape priors have been proposed to be incorporated into the segmentation process. In recent years, many researchers
have successfully introduced shape priors into segmentation methods such as
in [36, 44, 46, 43, 42, 165, 120].
We are interested in segmenting nonrigid moving objects in image sequences. When the objects are nonrigid, an appropriate segmentation method
that can deal with shape deformations should be used. The application of
active contour methods for segmentation in image sequences gives promising results as in [137, 153, 154]. These methods use variants of the classical
Chan-Vese model as the basis for segmentation. In [137], for instance, it
is proposed to simply use the result from one image as an initializer in the
segmentation of the next.
Another major problem for segmentation methods for image sequences is
the presence of occlusions. Minor occlusions can usually be handled by some
kind of shape prior. However, major occlusions is still a big problem. In order
to improve the robustness of the segmentation methods in the presence of
occlusions, it is necessary to detect the occlusions. The occluded area can
then either be excluded from segmentation process or reconstructed [185, 70,
114].
128
The main purpose of this paper is to propose and analyze a novel variational segmentation method for image sequences, that can both deal with
shape deformations and at the same time is robust to noise, clutter and occlusions. The proposed method is based on minimizing an energy functional
containing the standard Chan-Vese functional as one part and a term that
penalizes the deviation from the previous shape as a second part. The second
part of the functional is based on a transformed distance map to the previous
contour, where different transformation groups, such as Euclidean, similarity
or affine, can be used depending on the particular application. This variational framework is then augmented with a novel contour flow algorithm,
giving a mapping of the intensities inside the contour of one image to the
inside of the contour in the next image. Using this mapping, occlusions can
be detected by simply thresholding the difference between the transformed
intensities and the observed ones in the novel image.
This paper is organized as follows: in Sect. 6.2 we discuss the proposed
segmentation of image sequences. The variational contour matching is described in Sect. 6.3 and how this can be used to detect and locate the
occlusion is described in Sect. 6.4. Experimental results of the model are
presented in Sect. 6.5 and we end the paper with some conclusions.
6.2
Segmentation of Image Sequences
In this section, we describe the region-based segmentation model of ChanVese [35] and a variational model for updating segmentation results from one
frame to the next in an image sequence.
6.2.1
Region-Based Segmentation
The idea of the Chan-Vese model [35] is to find a contour Γ such that the
image I is optimally approximated by a gray scale value µint on int(Γ), the
inside of Γ, and by another gray scale value µext on ext(Γ), the outside of Γ.
The optimal contour Γ∗ is defined as the solution of the variational problem,
ECV (Γ∗ ) = min ECV (Γ),
Γ
(6.1)
where ECV is the Chan-Vese functional,
Z
Z
1
1
2
2
ECV (Γ) = α|Γ| + β
(I(x) − µint ) dx +
(I(x) − µext ) dx .
2 int(Γ)
2 ext(Γ)
(6.2)
129
Here |Γ| is the arc length of the contour, α, β > 0 are weight parameters,
and
Z
1
µint = µint (Γ) =
I(x) dx,
(6.3)
| int(Γ)| int(Γ)
Z
1
µext = µext (Γ) =
I(x) dx.
(6.4)
| ext(Γ)| ext(Γ)
The gradient descent flow for the problem of minimizing a functional
ECV (Γ) is the solution to initial value problem:
d
Γ(t) = −∇ECV (Γ(t)),
dt
Γ(0) = Γ0 ,
(6.5)
where Γ0 is an initial contour. Here ∇ECV (Γ) is the L2 -gradient of the energy
functional ECV (Γ), cf. e.g. [179] for definitions of these notions. Then the
L2 -gradient of ECV is
∇ECV (Γ) = ακ + β
1
1
(I − µint (Γ))2 − (I − µext (Γ))2 ,
2
2
(6.6)
where κ is the curvature.
In the level set framework [151], a curve evolution, t 7→ Γ(t), can be
represented by a time dependent level set function φ : R2 × R → R as
Γ(t) = {x ∈ R2 ; φ(x, t) = 0}, φ(x) < 0 and φ(x) > 0 are the regions inside
and the outside of Γ, respectively. The normal velocity of t 7→ Γ(t) is the
scalar function dΓ/dt defined by
d
∂φ(x, t)/∂t
Γ(t)(x) := −
dt
|∇φ(x, t)|
(x ∈ Γ(t)).
(6.7)
Recall that the outward unit normal n and the curvature
κ can be expressed
in terms of φ as n = ∇φ/|∇φ| and κ = ∇ · ∇φ/|∇φ| .
Combined with the definition of gradient descent evolutions (6.5) and the
formula for the normal velocity (6.7) this gives the gradient descent procedure
in the level set framework:
1
1
∂φ 2
2
= ακ + β (I − µint (Γ)) − (I − µext (Γ)) |∇φ|,
∂t
2
2
where φ(x, 0) = φ0 (x) represents the initial contour Γ0 .
130
6.2.2
The Interaction Term
The interaction EI (Γ0 , Γ) between a fixed contour Γ0 and an active contour Γ
may be regarded as a shape prior and be chosen in several different ways, such
as the pseudo-distances, cf. [43], and the area of the symmetric difference of
the sets int(Γ) and int(Γ0 ), cf. [36].
Let φ0 : D → R denotes the signed distance function associated with the
contour Γ0 and a ∈ R2 is a group of translations. We want to determine the
optimal translation vector a = a(Γ), then the interaction EI = EI (Γ0 , Γ) is
defined by the formula,
Z
EI (Γ0 , Γ) = min
φ0 (x − a) dx.
(6.8)
a
int(Γ)
Minimizing over groups of transformations is the standard devise to obtain
pose-invariant interactions, see [36] and [43].
Since this is an optimization problem a(Γ) can be found using the gradient
descent procedure. The optimal translation a(Γ) can then be obtained as the
limit, as time t tends to infinity, of the solution to initial value problem
Z
ȧ(t) =
∇φ0 (x − a(t)) dx,
a(0) = 0.
(6.9)
int(Γ)
Similar gradient descent schemes can be devised for rotations and scalings
(in the case of similarity transforms), cf. [36].
6.2.3
Using the Interaction Term in Segmentation of
Image Sequences
Let Ij : D → R, j = 1, . . . , N, be a succession of N frames from a given image
sequence. Also, for some integer k, 1 ≤ k ≤ N, suppose that all the frames
I1 , I2 , . . . , Ik−1 have already been segmented, such that the corresponding
contours Γ1 , Γ2 , . . . , Γk−1 are available. In order to take advantage of the
prior knowledge obtained from earlier frames in the segmentation of Ik , we
propose the following method: If k = 1, i.e. if no previous frames have
actually been segmented, then we just use the standard Chan-Vese model,
as presented in Sect. 6.2.1. If k > 1, then the segmentation of Ik is given by
the contour Γk which minimizes an augmented Chan-Vese functional of the
form,
A
ECV
(Γk−1, Γk ) := ECV (Γk ) + γEI (Γk−1 , Γk ),
(6.10)
where ECV is the Chan-Vese functional, EI = EI (Γk−1 , Γk ) is an interaction
term, which penalizes deviations of the current active contour Γk from the
131
previous one, Γk−1, and γ > 0 is a coupling constant which determines the
strength of the interaction.
The augmented Chan-Vese functional (6.10) is minimized using standard
gradient descent (6.5) described in Sect. 6.2.1 with ∇E equal to
A
∇ECV
(Γk−1 , Γk ) := ∇ECV (Γk ) + γ∇EI (Γk−1 ; Γk ),
(6.11)
and the initial contour Γ(0) = Γk−1 . Here ∇ECV is the L2 -gradient (6.6) of
the Chan-Vese functional, and ∇EI the L2 -gradient of the interaction term,
which is given by the formula,
∇EI (Γk−1 , Γk ; x) = φk−1 (x − a(Γk )),
(for x ∈ Γk ).
(6.12)
Here φk−1 is the signed distance function for Γk−1 .
We use the Chan-Vese model to segment a selected object with approximately uniform intensity and apply the proposed method frame-by-frame.
First we compute the optimal translation vector (6.9) based on the previous
contour, we then use this vector to translate the previous contour until it is
aligned to the optimal position (6.12). Then the minimum of the functional
(6.10) is obtained by the gradient descent procedure (6.11) implemented in
the level set framework outline in Sect. 6.2. This procedure is iterated until
it converges.
6.3
A Contour Matching Problem
In this section we are going to present a variational solution to the following
contour matching problem: Suppose we have two simple closed curves Γ1 and
Γ2 contained in the image domain Ω. Find the “most economical” mapping
Φ = Φ(x) : Ω → R2 such that Φ maps Γ1 onto Γ2 , i.e. φ(Γ1 ) = Γ2 . The latter
condition is to be understood in the sense that if α = α(s) : [0, 1] → Ω is a
positively oriented parametrization of S1 , then β(s) = Φ(α(s)) : [0, 1] → Ω
is a positively oriented parametrization of Γ2 (allowing some parts of Γ2 to
be covered multiple times).
To present our variational solution of this problem, let M denote the set
of twice differential mappings Φ which maps Γ1 to Γ2 in the above sense.
Loosely speaking
M = {Φ ∈ C 2 (Ω; R2 ) | Φ(Γ1 ) = Γ2 }.
Moreover, given a mapping Φ : Ω → R2 , not necessarily a member of M,
then we express Φ in the form Φ(x) = x + U(x), where the vector valued
function U = U(x) : Ω → R2 is called the displacement field associated with
132
Φ, or simply the displacement field. It is sometimes necessary to write out
the components of the displacement field; U(x) = (u1 (x), u2 (x))T .
We now define the “most economical” map to be the member Φ∗ of M
which minimizes the following energy functional:
Z
1
E[Φ] =
kDU(x)k2F dx,
(6.13)
2 Ω
where kDU(x)kF denotes the Frobenius norm of DU(x) = [∇u1 (x), ∇u2 (x)]T ,
which for an arbitrary matrix A ∈ R2×2 is defined by kAk2F = tr(AT A). That
is, the optimal matching is given by
Φ∗ = arg min E[Φ].
(6.14)
Φ∈M
Using that E[Φ] can be written in the form
Z
1
|∇u1(x)|2 + |∇u2(x)|2 dx,
E[Φ] =
2 Ω
(6.15)
it is easy to see that the Gâteaux derivative of E[Φ] is given by
Z
dE[Φ; V ] =
∇u1 (x) · ∇v1 (x) + ∇u2 (x) · ∇v2 (x) dx
ZΩ
=
tr(DU(x)T DV (x)) dx,
Ω
for any displacement field V (x) = (v1 (x), v2 (x))T . After integration by parts
we find that the necessary condition for Φ∗ (x) = x + U ∗ (x) to be a solution
of the minimization problem (6.14) takes the form
Z
0 = − ∆U ∗ (x) · V (x) dx,
(6.16)
Ω
for any admissible displacement field variation V = V (x). Here ∆U ∗ (x) =
(∆u1 (x), ∆u2 (x))T is the Laplacian of the vector valued function U ∗ = U ∗ (x).
Since every admissible mapping Φ must map the initial contour Γ1 onto the
target contour Γ2 , it can be shown that any displacement field variation V
must satisfy
V (x) · nS2 (x + U ∗ (X)) = 0 for all x ∈ Γ1 .
(6.17)
Notice that this condition only has to be satisfied precisely on the curve
Γ1 , and that V = V (x) is allowed to vary freely away from the initial contour. The interpretation of the above condition is that the displacement field
133
variation at x ∈ Γ1 must be tangent to the target contour Γ2 at the point
y = Φ(x). In view of this interpretation of (6.17) it is not difficult to see that
necessary condition (6.16) implies that the solution Φ∗ of the minimization
problem (6.14) must satisfy the following Euler-Lagrange equation:
(
∆U ∗ − (∆U ∗ · n∗Γ2 ) n∗Γ2 ,
on Γ1 ,
(6.18)
0=
∆U ∗ ,
otherwise,
where n∗Γ2 (x) = nΓ2 (x + U ∗ (x)), x ∈ Γ1 , is the pullback of the normal field
of the target contour Γ2 to the initial contour Γ1 . The standard way of
solving (6.18) is to use the gradient descent method: Let U = U(t, x) be the
time-dependent displacement field which solves the evolution PDE
(
∆U − (∆U · n∗Γ2 ) n∗Γ2 ,
on Γ1 ,
∂U
(6.19)
=
∂t
∆U,
otherwise,
where the initial displacement U(0, x) = U0 (x) ∈ M specified by the user,
and U = 0 on ∂Ω, the boundary of Ω (Dirichlet boundary condition). Then
U ∗ (x) = limt→∞ U(t, x) is a solution of the Euler-Lagrange equation (6.18).
Notice that the PDE (6.19) coincides with the so-called geometry-constrained
diffusion introduced in [5]. Thus we have incidentally found a variational formulation of the non-rigid registration problem considered there.
...............
Γ1 .......... ......
..
....
....
..
.
.
.
.
.
.
.
.
....
....
.
.
.
.
.
.....
.
.....
......
F1
................................
Φ(Γ1 ) = Γ2
.........Γ2
....... ..........
.
.
.
.
.
....
....
...
....
...
...
.
....
..
.
.
.....
.
.
......... ............ F2
............
Figure 6.1: Given two closed curves Γ1 and Γ2 contained in two images F1
and F2 , Φ maps F1 onto F2 such that Γ1 is mapped onto Γ2 (i.e. Φ(Γ1 ) = Γ2 ).
6.4
Detect and Locate the Occlusion
The mapping Φ = Φ(x) : Ω → R2 such that Φ maps Γ1 onto Γ2 is an
estimation of the displacement (motion and deformation) of the boundary of
134
an object between two frames. By finding the displacement of the contour, a
consistent displacement of the intensities inside the closed curve Γ1 can also
be found. Φ maps Γ1 onto Γ2 and pixels inside Γ1 are mapped inside Γ2 .
This displacement field which only depends on displacement - or registration
- of the contour (and not on the image intensities) can then be used to map
the intensities inside Γ1 into Γ2 . After mapping, the intensities inside Γ1 and
Γ2 can be compared and then be classified as the same or different value.
Since we can still find the contour in the occluded area, therefore we can also
compute the displacement field even in the occluded area.
After the occlusion has been detected, the segmentation can be further
improved by again employing the previously described Chan-Vese-method
augmented with an interaction term. However, in this second stage, the
integration is only performed over the area of the image where no occlusion
has been detected. This procedure treats the occluded area in the same way
as a part of the image with missing data as in [12], which is reasonable .
6.5
Experiments
6.5.1
Segmentation
In this section we present the results obtained from experiment using synthetic image sequence. We use the Chan-Vese model to segment a selected
object with approximately uniform intensity and apply the proposed method
frame-by-frame. The minimization of the functional is obtained by the gradient descent procedure (6.11) implemented in the level set framework. See
also [151].
The classical Chan-Vese method will have problems segmenting an object if occlusions appear in the image which cover the whole or parts of
the selected object. In Fig. 6.2 and Fig. 6.5, we show the segmentation results for a non-rigid object in a synthetic image sequence and for a
walking human in a real image sequence (available at http://homepages.inf.
ed.ac.uk/rbf/CAVIAR/), respectively, where occlusions occur. The classical
Chan-Vese method fails to segment the selected object when it reaches the
occlusion (Left column). Using the proposed method, which uses the frameto-frame interaction term, we obtain much better results (Right column).
In both experiments the coupling constant γ is varied to see the influence
of the interaction term on the segmentation results. The contour is only
slightly affected by the prior if γ is small. On the other hand, if γ is too
large, the contour will be close to a similarity transformed version of the
135
Figure 6.2: Segmentation of a non-rigid object in a synthetic image sequences
with additive Gaussian noise (Frame 1-7). Without the interaction term,
noise in the occlusion is captured (Left column). This is avoided when the
interaction term is included (Right column).
136
Figure 6.3: Left: Deformation field. Right: Frame 4 after deformation according to the displacement field onto Frame 5.
Figure 6.4: The occluded regions of the Frame 3-6 of Fig. 6.2 can be detected
and located
prior.
6.5.2
Contour Matching and occlusion detection
As described in Sect. 6.3 and Sect. 6.4, occlusion can be detected and
located by deforming the current frame according to the displacement and
compare the deformed frame with the next frame (inside the contour Γ2 ).
First we compute the displacement field based on the segmentation results
of two frames. In Fig. 6.3, we show the displacement field of Frame 4 and
5. With this displacement field, we can do full deformation of the Frame
4 onto Frame 5 (Fig. 6.3 right) and then compare the intensities between
Frame 5 and deformed Frame 4. By comparing, we can then classify the
intensities as having the same or different value by thresholding. The results
for the artificial sequence are presented in Fig. 6.4 and for the walking person
sequence in Fig. 6.6.
137
Figure 6.5: Segmentation of a person covered by an occlusion in the human
walking sequence. Left column: without interaction term, and Right column:
with interaction term
138
Figure 6.6: The occluded regions of the Frame 3 and 4 of Fig. 6.5 are detected
and located by predicting the intensities inside the contour of the walking
person.
6.6
Conclusions
We have presented a new method for segmentation and contour matching of
image sequences containing nonrigid, moving objects, that also can handle
occlusions. The proposed segmentation method is formulated as variational
problem, with one part of the functional corresponding to the Chan-Vese
model and another part corresponding to the pose-invariant interaction with
a shape prior based on the previous contour. The optimal transformation as
well as the shape deformation are determined by minimization of an energy
functional using a gradient descent scheme. This segmentation method is
augmented with a contour flow estimation algorithm based on a novel variational formulation. The estimated contour flow makes it possible to extract
occluded areas and then further refine the segmentation. Preliminary results
are shown and its performance looks promising both in terms of segmentation
and occlusion detection.
Acknowledgements.
This research is funded by the VISIONTRAIN RTN-CT-2004-005439 Marie
Curie Action within the EC’s FP6.
139
Bibliography
[1] L. Alvarez, Y. Gousseau, and J.-M. Morel. Scales in natural images
and a consequence on their bounded variation norm. In Proceedings of
Scale Space Methods in Computer Vision SS, pages 247–258, 1999.
[2] L. Alvarez, Y. Gousseau, and J.-M. Morel. The size of objects in natural and artificial images. Advances in Imaging and Electron Physics,
(111):167–242, 1999.
[3] L. Alvarez, Y. Gousseau, and J.-M. Morel. The size of objects in natural
images. Technical Report CMLA9921, CMLA, 1999.
[4] P. R. Andresen and M. Nielsen. Non-rigid registration by geometryconstrained diffusion. Medical Image Analysis, 5(2):81–88, 2001.
[5] Per R. Andresen and Mads Nielsen. Non-rigid registration by geometryconstrained diffusion. In MICCAI ’99: Proceedings of the Second International Conference on Medical Image Computing and ComputerAssisted Intervention, pages 533–543, London, UK, 1999. SpringerVerlag.
[6] G. Aubert and P. Kornprobst. Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations (second edition), volume 147 of Applied Mathematical Sciences.
Springer-Verlag, 2006.
[7] Jean-Franois Aujol, Gilles Aubert, Laure Blanc-Féraud, and Antonin
Chambolle. Image decomposition into a bounded variation component
and an oscillating component. Journal of Mathematical Imaging and
Vision, 22(1):71–88, January 2005.
[8] Jean-Franois Aujol, Guy Gilboa, Tony Chan, and Stanley Osher. Structure-texture image decompositionmodeling, algorithms,
and parameter selection. International Journal of Computer Vision,
67(1):111–136, April 2006.
140
[9] R. Baddeley. The correlational structure of natural images and the
calibration of spatial representations. Cognitive Science, 21:351–372,
1997.
[10] C. Ballester, B. Bertalmio, V. Caselles, L. Garrido, A. Marques, and
F. Ranchin. An inpainting- based deinterlacing method. IEEE Transactions On Image Processing, 16(10):2476–2491, October 2007.
[11] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera.
Filling-in by joint interpolation of vector fields and gray levels. IEEE
Transactions On Image Processing, 10(8):1200–1211, August 2001.
[12] C. Ballester, V. Caselles, and J. Verdera. A variational model for
disocclusion. In Proceeding ICIP (3), pages 677–680, 2003.
[13] J.L. Barron, D.J. Fleet, and S.S. Beauchemin. Performance of optical
flow techniques. International Journal of Computer Vision, 12(1):43–
77, February 1994.
[14] R. Basri, L. Costa, D. Geiger, and D. Jacobs. Determining the similarity of deformable shapes. In Proceedings of ICCV Workshop on
Physics-Based Modeling in Computer Vision, pages 135–143, 1995.
[15] M. Bertalmio, L.A. Vese, G. Sapiro, and S.J. Osher. Image filling-in in a
decomposition space. In Proceedings of IEEE International Conference
on Image Processing (ICIP), pages I: 853–856, 2003.
[16] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma
Ballester.
Image inpainting.
In SIGGRAPH ’00: Proceedings
of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, New York, NY, USA, 2000. ACM
Press/Addison-Wesley Publishing Co.
[17] Marcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Stanley Osher. Simultaneous strurcture and texture image inpainting. IEEE
Transactions On Image Processing, 12(8):882–889, August 2003.
[18] M Bertero, Tomaso Poggio, and Vincent Torre. Ill-posed problems in
early vision. Technical Report A.I. Memo 924, MIT, May 1987.
[19] Julian Besag. On the statistical analysis of dirty pictures. Journal
of the Royal Statistical Society. Series B (Methodological), 48:259–302,
1986.
141
[20] Josef Bigun. Vision with Direction - A Systematic Introduction to
Image Processing and Computer Vision. Springer-Verlag, 2006.
[21] J. S. De Bonet. Multiresolution sampling procedure for analysis and
synthesis of texture images. In Computer Graphics, pages 361–368.
ACM SIGGRAPH, 1997.
[22] Stephen Boyd and Lieven Vandenberge. Convex Optimization. Cambridge University Press, 2004.
[23] Pierre Brémaud. Markov Chains - Gibbs Field, MonteCarlo Simulation,
and Queues. Number 31 in TAM. Springer-Verlag, 1999.
[24] Lisa Gottesfeld Brown. A survey of image registration techniques. ACM
Comput. Surv., 24(4):325–376, 1992.
[25] Thomas Brox, Andres Bruhn, Nils Papenberg, and Joachim Weickert.
High accuracy optical flow estimation based on a theory for warping. In
Tomas Pajdla and Jiri Matas, editors, Proc. 8th European Conference
on Computer Vision (ECCV 04), volume 4, pages 25–36. SpringerVerlag, May 2004.
[26] Antoni Buades, A. Chien, Jean-Michel Morel, and Stanley Osher.
Topology preserving linear filtering applied to medical imaging. SIAM
Journal on Imaging Sciences, 1(1):26–50, 2008.
[27] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local
algorithm for image denoising. In CVPR ’05: Proceedings of the 2005
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’05) - Volume 2, pages 60–65, Washington, DC,
USA, 2005. IEEE Computer Society.
[28] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Neighborhood
filters and pde&#x2019;s. Numer. Math., 105(1):1–34, 2006.
[29] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Nonlocal image and movie denoising. International Journal of Computer Vision,
76(2):123–139, February 2008.
[30] R.W. Buccigrossi and E. P. Simoncelli. Image compression via joint
statistical characterization in the wavelet domain. IEEE Transactions
On Image Processing, 8(12):1688–1701, December 1999.
142
[31] Aurelia Bugeau and Marcelo Bertalmio. Combining texture synthesis
and diffusion for image inpainting. In International Conference on
Computer Vision Theory and Applications (VISAPP), 2009.
[32] P.J. Burt. Fast filter transforms for image processing. Computer Vision
Graphics and Image Processing, 16(1):20–51, May 1981.
[33] V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours.
International Journal of Computer Vision, 22(1):61–79, 1997.
[34] Antonin Chambolle. An algorithm for total variation minimization and
applications. Journal of Mathematical Imaging and Vision, 20(1-2):89–
97, 2004.
[35] T. Chan and L. Vese. Active contour without edges. IEEE Transactions
On Image Processing, 10(2):266–277, 2001.
[36] T. Chan and W. Zhu. Level set based prior segmentation. Technical
Report 03-66, Department of Mathematics, UCLA, 2003.
[37] Tony F. Chan and Sung Ha Kang. Error analysis for image inpainting.
Journal of Mathematical Imaging and Vision, 26(1-2):85–103, 2006.
[38] Tony F. Chan and Jianhong Shen. Mathematical models for local nontexture inpaintings. SIAM Journal of Applied Mathematics,
62(3):1019–1043, 2001.
[39] Tony F. Chan and Jianhong Shen. Variational image inpainting. Communications on Pure and Applied Mathematics, 58, February 2005.
[40] Tony F Chan and Jianhong Shen. Image Processing and Analysis variational, PDE, wavelet, and stochastic methods. SIAM, 2006.
[41] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
23(6):681–685, June 2001.
[42] D. Cremers and G. Funka-Lea. Dynamical statistical shape priors for
level set based sequence segmentation. In 3rd Workshop on Variational
and Level Set Methods in Computer Vision, LNCS 3752, pages 210–
221. Springer Verlag, 2005.
[43] D. Cremers and S. Soatto. A pseudo-distance for shape priors in level
set segmentation. In O. Faugeras and N. Paragios, editors, 2nd IEEE
Workshop on Variational, Geometric and Level Set Methods in Computer Vision, 2003.
143
[44] Daniel. Cremers. Statistical Shape Knowledge in Variational Image
Segmentation. Phd thesis, Department of Mathematics and Computer
Science, University of Mannheim, July 2002.
[45] Daniel Cremers. Dynamical statistical shape priors for level set-based
tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8):1262–1273, 2006.
[46] Daniel Cremers, Nil Sochen, and Christoph Schnörr.
Towards
recognition-based variational segmentation using shape priors and dynamic labeling. In Scale Space 2003, LNCS 2695, pages 388–400.
Springer Verlag, 2003.
[47] A Criminisi, P Perez, and K Toyama. Object removal by exemplarbased
inpainting. In Conference on Computer Vision and Pattern Recognition, CVPR 03, volume 2, pages 721–728, 2003.
[48] A. Criminisi, P. Prez, and K. Toyama. Region filling and object removal
by exemplar-based image inpainting. IEEE Transactions On Image
Processing, 13(9):1200–1212, september 2004.
[49] Ann Cuzol, Kim S. Pedersen, and Mads Nielsen. Field of particle filters
for image inpainting. Journal of Mathematical Imaging and Vision,
31(2-3):147–156, July 2008.
[50] P. Dani and S. Chaudhuri. Automated assembling of images: Image
montage preparation. 28(3):431–445, March 1995.
[51] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis
and transfer. In Proceedings of SIGGRAPH, Los Angeles, California,
USA, August 2001.
[52] Alexei A. Efros and Thomas K. Leung. Texture synthesis by nonparametric sampling. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 1033–1038, Corfu, Greece,
September 1999.
[53] M. Elad and M. Aharon. Image denoising via sparse and redundant
representations over learned dictionaries. IEEE Transactions On Image
Processing, 15(12):3736–3745, December 2006.
[54] M. Elad, J. Starck, P. Querre, and D. Donoho. Simultaneous cartoon
and texture image inpainting using morphological component analysis
(mca). Applied and Computational Harmonic Analysis, 19(3):340–358,
November 2005.
144
[55] Lars Eldén. Matrix Methods in Data Mining and Pattern Recognition (Fundamentals of Algorithms). Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, 2007.
[56] Cheng en Guo, Song-Chun Zhu, and Ying Nian Wu. Towards a mathematical theory of primal sketch and sketchability. In Proceedings
of IEEE International Conference on Computer Vision (ICCV), volume II, pages 1228–1235, 2003.
[57] Cheng en Guo, Song-Chun Zhu, and Ying Nian Wu. Primal sketch:
Integrating structure and texture. Comput. Vis. Image Underst.,
106(1):5–19, 2007.
[58] David J Field. Relations between the statistics of natural images and
the response properties of cortical cells. Journal of the Optical Society
of America A, 4:2379–2394, December 1987.
[59] Martin A. Fischler and Robert C. Bolles. Random sample consensus:
a paradigm for model fitting with applications to image analysis and
automated cartography. Communications of the ACM, 24(6):381–395,
1981.
[60] Luc Florack. Image Structure, volume 10 of Computational Imaging
and Vision. Kluwert Academic Publishers, 1997.
[61] Luc Florack, R Duits, and J Bierkens. Tikhonov regularization versus scale space: A new result. In Proceedings of IEEE International
Conference on Image Processing (ICIP), pages 271–274, 2004.
[62] William T. Freeman, Egon C. Pasztor, and Owen T. Carmichael.
Learning low-level vision. Int. J. Comput. Vision, 40(1):25–47, 2000.
[63] Mario Fritz, Eric Hayman, Barbara Caputo, and Jan-Olof Eklundh.
the kth-tips (textures under varying illumination, pose and scale).
http://www.nada.kth.se/cvap/databases/kth-tips/index.html, 2004.
[64] K. Fundana, N.C. Overgaard, and A. Heyden. Variational segmentation
of image sequences using region-based active contours and deformable
shape priors. International Journal of Computer Vision, 80(3), December 2008.
[65] Ketut Fundana, Niels Chr. Overgaard, Anders Heyden, David Gustavsson, and Mads Nielsen. Nonrigid object segmentation and occlusion detection in image sequences. In 3rd International Conference on
Computer Vision Theory and Applications (VISAPP 08), 2008.
145
[66] Irena Galić, Joachim Weickert, Martin Welk, Andrés Bruhn, Alexander
Belyaev, and Hans-Peter Seidel. Image compression with anisotropic
diffusion. J. Math. Imaging Vis., 31(2-3):255–269, 2008.
[67] A. Gangal and B. Dizdaroglu. Automatic restoration of old motion
picture films using spatiotemporal exemplar-based inpainting. In Advanced Concepts for Intelligent Vision Systems ACIVS, pages 55–66,
2006.
[68] I. M. Gelfand and S. V. Fomin. Calculus of variations. Dover, 1963.
[69] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distribution, and the bayesian restoration of images. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 6:721–741, 1984.
[70] C. Gentile, O. Camps, and M. Sznaier. Segmentation for robust tracking in the presence of severe occlusion. IEEE Transactions On Image
Processing, 13(2):166–178, 2004.
[71] J.M. Geusebroek. The stochastic structure of images. In Proceedings
of Scale Space Methods in Computer Vision SS, pages 327–338, 2005.
[72] J.M. Geusebroek and A.W.M. Smeulders. Fragmentation in the vision of scenes. In Proceedings of IEEE International Conference on
Computer Vision (ICCV), pages 130–135, 2003.
[73] J.M. Geusebroek and A.W.M. Smeulders. A six-stimulus theory for
stochastic texture. International Journal of Computer Vision, 62(12):7–16, April 2005.
[74] Chris A. Glasbey and Kanti V. Mardia. A review of image-warping
methods. Journal of Applied Statistics, 25(2):155–171, April 1998.
[75] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns
Hopkins, 3rd edition, 1996.
[76] Y. Gousseau and J-M Morel. Are natural images of bounded variation?
SIAM Journal of Mathematical Analysis, 3(33):634–648, 2001.
[77] Y. Gousseau and F. Roueff. Modeling occlusion and scaling in natural images. SIAM Journal of Multiscale Modeling and Simulation,
1(6):105–134, 2007.
146
[78] Ulf Grenander and Anuj Srivastava. Probability models for clutter in
natural images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(4):424–429, 2001.
[79] Ulf Grenander, Anuj Srivastava, and Michael Miller. Asymptotic performance analysis of bayesian object recognition. IEEE Transactions
on Information Theory, 46(4):1658–1666, 2000.
[80] Lewis D. Griffin. Scale-imprecision space. Image Vision Comput.,
15(5):369–398, 1997.
[81] David Gustavsson. Multi-scale texture and geometric structure image
database - ms-gti-i. to be written... DIKU-666, DIKU, 2008. Will
contain collection procedure and contents....
[82] David Gustavsson, Ketut Fundana, Niels-Ch. Overgaard, and Mads
Nielsen. Variational segmentation and contour matching of non-rigid
moving object. In Workshop on Dynamical Vision 2007, 2007.
[83] David Gustavsson, Kim S. Pedersen, Francois Lauze, and Mads
Nielsen. On the rate of structural change in scale spaces. In Proceedings of Scale Space and Variational Methods in Computer Vision
SSVM, 2009.
[84] David Gustavsson, Kim S. Pedersen, and Mads Nielsen. Geometric and
texture inpainting by gibbs sampling. In SSBA-2007, 2007.
[85] David Gustavsson, Kim S. Pedersen, and Mads Nielsen. A SVD based
image complexity measure. In International Conference on Computer
Vision Theory and Applications (VISAPP), 2009.
[86] David Gustavsson, Kim Steenstrup Pedersen, and Mads Nielsen. Image
inpainting by cooling and heating. In Bjarne Ersbøll and Kim Steenstrup Pedersen, editors, Scandinavian Conference on Image Analysis
(SCIA ’07), volume 4522 of Lecture Notes in Computer Science, pages
591–600. Springer Verlag, June 2007.
[87] Per Christian Hansen. Rank-Deficient and Discrete Ill-Posed Problems.
Numerical Aspects of Linear Inversion. SIAM, Philadelphia, 1998.
[88] Per Christian Hansen. The l-curve and its use in the numerical treatment of inverse problems. In in Computational Inverse Problems in
Electrocardiology, ed. P. Johnston, Advances in Computational Bioengineering, pages 119–142. WIT Press, 2000.
147
[89] Per Christian Hansen and Dianne Prost O’Leary. The use of the l-curve
in the regularization of discrete ill-posed problems. SIAM Journal on
Scientific Computing, 14(6):1487–1503, 1993.
[90] C. Harris and M. Stephens. A combined corner and edge detection.
In Proceedings of The Fourth Alvey Vision Conference, pages 147–151,
1988.
[91] David J. Heeger and James R. Bergen. Pyramid-based texture analysis/synthesis. In SIGGRAPH ’95: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–
238, New York, NY, USA, 1995. ACM.
[92] Ellen C. Hildreth. Computations underlying the measurement of visual
motion. pages 99–146, 1987.
[93] Berthold K P. Horn and Brian G Schunck. Determining optical flow.
Artificial Intelligence, 17(1-3):185–203, 1981.
[94] Jinggang Huang and David Mumford. Statistics of natural images
and models. Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 01:1541, 1999.
[95] X.S. Huang, S.Z. Li, and Y.S. Wang. Evaluation of face alignment
solutions using statistical learning. In Proceedings of International
Conference on Automatic Face and Gesture Recognition AFGR, pages
213–218, 2004.
[96] Aapo Hyvärinen. Survey on independent component analysis. Neural
Computing Surveys, 2:94–128, 1999.
[97] Aapo Hyvärinen and Erkki Oja. Independent component analysis: Algorithms and applications. Neural Networks, 13(4-5):411–430, 2000.
[98] T. Iijima. Basic theory on normalization of a pattern. Bulletin of
Electrical Laboratory, 26:368–388, 1962. In Japanese.
[99] M. Irani and P. Anandan. All about direct methods. In W. Triggs,
A. Zisserman, and R. Szeliski, editors, Workshop on Vision Algorithms:
Theory and practice. Springer-Verlag, 1999.
[100] Anil K. Jain and Farshid Farrokhnia. Unsupervised texture segmentation using gabor filters. Pattern Recogn., 24(12):1167–1186, 1991.
148
[101] B. Julesz and R. Bergen. Textons, the elements of texture perception,
and their interactions. Nature, 290:91–97, 1981.
[102] Christian Jutten and Jeanny Herault. Blind separation of sources, part
1: an adaptive algorithm based on neuromimetic architecture. Signal
Process., 24(1):1–10, 1991.
[103] Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverse
Problems, volume 160 of Applied Mathematical Series. Springer, Berlin,
2004.
[104] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes:
Active contour models. International Journal of Computer Vision,
V1(4):321–331, January 1988.
[105] S.H. Keller, F Lauze, and M. Nielsen. Deinterlacing using variational methods. IEEE Transactions On Image Processing, 17(11):1–14,
November 2008.
[106] Sune Keller, Francoise Lauze, and Mads Nielsen. Motion compensated
video super resolution. In Proceedings of Scale Space and Variational
Methods in Computer Vision SSVM, pages 801–812, 2007.
[107] Sune H. Keller, Francoise Lauze, and Mads Nielsen. A total variation
motion adaptive deinterlacing scheme. In Proceedings of Scale Space
Methods in Computer Vision SS, pages 408–418, 2005.
[108] Sune Hogild Keller. Video Upscaling Using Variational Methods. PhD
thesis, University of Copenhagen, 2007.
[109] Michael Kirby. Geometric Data Analysis: An Empirical Approach to
Dimensionality Reduction and the Study of Patterns. John Wiley &
Sons, Inc., New York, NY, USA, 2000.
[110] Josef Kittler and J. Föglein. Contextual classification of multispectral
pixel data. Image and Vision Computing, 2(1):13–29, 1984.
[111] Jan J Koenderink. The structure of images. Biological Cybernetics,
50:363–370, 1984.
[112] Jan J. Koenderink and Andrea J. Van Doorn. The structure of locally orderless images. International Journal of Computer Vision, 31(23):159–168, 1999.
149
[113] V. Kolmogorov and R. Zabih. What energy functions can be minimized
via graph cuts? IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26(2):147–159, February 2004.
[114] J. Konrad and M. Ristivojevic. Video segmentation and occlusion detection over multiple frames. In Image and Video Communications and
Processing 2003, SPIE 5022, pages 377–388. SPIE, 2003.
[115] G. Laccetti, L. Maddalena, and A. Petrosino. Removing line scratches
in digital image sequences by fusion techniques. In International Conference on Image Analysis and Processing CIAP, pages 695–702, 2005.
[116] Francois B. Lauze. Computational Methods For Motion Recovery, Motion Compensated Inpainting and Applications. PhD thesis, IT University of Copenhagen, 2004.
[117] Ann B. Lee, David Mumford, and Jinggang Huang. Occlusion models
for natural images: A statistical study of a scale-invariant dead leaves
model. International Journal of Computer Vision, 41(1-2):35–59, 2001.
[118] Daniel D. Lee and Sebastian H. Seung. Learning the parts of objects by
non-negative matrix factorization. Nature, 401(6755):788–791, October
1999.
[119] Daniel D. Lee and Sebastian H. Seung. Algorithms for non-negative
matrix factorization. In Proceedings of Conference on Neural Information Processing Systems NIPS, volume 13, pages 556–562, 2001.
[120] M.E. Leventon, W.E.L. Grimson, and O. Faugeras. Statistical shape
influence in geodesic active contours. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages
316–323, 2000.
[121] Stan Z. Li. Markov Random Field Modeling in Image Analysis. Computer Science Workbench. Springer, 2001.
[122] Tony Lindeberg. Scale-Space Theory in Compter Vision. Kluwert Academica Publishers, 1994.
[123] Jun S Liu. Monte Carlo Strategies in Scientific Computing. Springe
Series in Statistics. Springer-Verlag, 2004.
[124] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
150
[125] J.B.A. Maintz, P.A. van den Elsen, and M.A. Viergever. 3d multimodality medical image registration using morphological tools. Image
and Vision Computing, 19(1-2):53–62, January 2001.
[126] S.G. Mallat. A theory for multiresolution signal decomposition: The
wavelet representation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 11(7):674–693, July 1989.
[127] P. Markelj, D. Tomazevic, F. Pernus, and B. Likar. Robust gradientbased 3-d/2-d registration of ct and mr to x-ray images. IEEE Transations On Medical Imaging, 27(12):1704–1714, December 2008.
[128] David Marr. VISION: a computational investigation into the human
representation and processing of visual information. W. H. Freeman,
San Francisco, 1982.
[129] S Masnou. Disocclusion: a variational approach using level lines. IEEE
Transactions On Image Processing, 11(2):68–76, February 2002.
[130] S Masnou and Jean-Michel Morel. Level lines based disocclusion. In
Proceedings of IEEE International Conference on Image Processing
(ICIP), pages 259–263, 1998.
[131] S. G. Matheron. Random Sets and Integral Geometry. John Wiley and
Sons, New York, 1975.
[132] Yves Meyer. Oscillating Patterns in Image Processing and Nonlinear
Evolution Equations: The Fifteenth Dean Jacqueline B. Lewis Memorial Lectures. American Mathematical Society (AMS), Boston, MA,
USA, 2001.
[133] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,
F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. International Journal of Computer Vision, 65(1-2):43–
72, 2005.
[134] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant
interest point detectors. International Journal of Computer Vision,
60(1):63–86, 2004.
[135] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(10):1615–1630, 2005.
151
[136] Jan Modersitzki. Numerical Methods for Image Registration. Numerical Mathematics and Scientific Computation. Oxford University Press,
2004.
[137] M. Moelich and T. Chan. Tracking objects with the chan-vese algorithm. Technical Report 03-14, Department of Mathematics, UCLA,
March 2003.
[138] V. A. Morozov. On the solution of functional equations by the method
of regularization. Soviet Math. Dokl., 7:414–417, 1966.
[139] Pavel Mrázek and Mirko Navara. Selection of optimal stopping time for
nonlinear diffusion filtering. International Journal of Computer Vision,
52(2-3):189–203, 2003.
[140] David Mumford. Bayesian rationale for the variational formulation.
In Bart M. ter Haar Romeny, editor, Geometry-Driven Diffusion in
Computer Vision, volume 1 of Computional Imaging and Vision, pages
135–146, 1994.
[141] David Mumford and Jayant Shah. Boundary detection by minimizing
functionals. In Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pages 22–26, San Fransisco, 1985.
[142] Arnold Neumaier. Solving ill-conditioned and singular linear systems:
A tutorial on regularization. SIAM Review, 40:636–666, 1998.
[143] Mads Nielsen, Luc Florack, and Rachid Deriche. Regularization, scalespace, and edge detection filters. International Journal of Computer
Vision, 7(4):291–307, October 1997.
[144] Mila Nikolova. Counter-examples for bayesian map restoration. In Proceedings of Scale Space and Variational Methods in Computer Vision
SSVM, pages 140–152, 2007.
[145] S Nishikawa, R Massa, and J Mott-Smith. Area properties of television
pictures. IEEE Transactions on Information Theory, 11(3):348–352,
July 1965.
[146] Aude Oliva and Antonio Torralba. Modeling the shape of the scene:
A holistic representation of the spatial envelope. International Journal
of Computer Vision, 42(3):145–175, 2001.
152
[147] Aude Oliva and Antonio B. Torralba. Scene-centered description from
spatial envelope properties. In BMCV ’02: Proceedings of the Second
International Workshop on Biologically Motivated Computer Vision,
pages 263–272, London, UK, 2002. Springer-Verlag.
[148] Ole Fogh Olsen and Mads Nielsen. Multi-scale gradient magnitude
watershed segmentation. In ICIAP’97 - 9th International Conference
on Image Analysis and Processing, volume 1310 of Lecture Notes in
Computer Science, pages 6–13, Florence, Italy, September 1997.
[149] B A Olshausen and D J Field. Natural image statistics and efficient
coding. In Network: Computation in Neural Systems, number 7, 1996.
[150] Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1. Vision Research,
37:3311–3325, 1997.
[151] S. Osher and R. Fedkiw. Level Set Methods and Dynamic Implicit
Surfaces. Springer-Verlag, New York, 2003.
[152] Nils Papenberg, Andrés Bruhn, Thomas Brox, Stephan Didas, and
Joachim Weickert. Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision,
67(2):141–158, 2006.
[153] N. Paragios and R. Deriche. Geodesic active contours and level set
methods for the detection and tracking of moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(3):266–280,
2000.
[154] N. Paragios and R. Deriche. Geodesic active regions and level set
methods for motion estimation and tracking. Computer Vision and
Image Understanding, 97:259–282, 2005.
[155] Maria Petrou and Pedro Garcı́a Sevilla. Dealing with Texture. Wiley,
2006.
[156] Gabriel Peyré. Non-negative sparexe modeling of textures. In Proceedings of Scale Space and Variational Methods in Computer Vision
SSVM, LNCS. Springer, 2007.
[157] Gabriel Peyré, Sébastien Bougleux, and Laurent Cohen. Non-local regularization of inverse problems. In David Forsyth, Philip Torr, and Andrew Zisserman, editors, Proceedings of European Conference on Com153
puter Vision (ECCV), volume 5304 of LNCS, pages 57–68. Springer,
2008.
[158] Tomaso Poggio and Vincent Torre. Ill-posed problems and regularization analysisin early vision. Technical Report A.I. Memo 773, MIT,
April 1984.
[159] Tomaso Poggio, H Voorhees, and A Yuille. A regularized solution to
edge detection. Technical Report A.I. Memo 833, MIT, May 1985.
[160] B.C. Porter, D.J. Rubens, J.G. Strang, J. Smith, S. Totterman, and
K.J. Parker. Three-dimensional registration and fusion of ultrasound
and mri using major vessels as fiducial markers. IEEE Transations On
Medical Imaging, 20(4):354–359, April 2001.
[161] Javier Portilla and Eero P. Simoncelli. A parametric texture model
based on joint statistics of complex wavelet coefficients. International
Journal of Computer Vision, 40(1):49–70, 2000.
[162] Trygve Randen and John Hakon Husoy. Filtering for texture classification: A comparative study. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 21(4):291–310, 1999.
[163] S.D. Rane, G. Sapiro, and M. Bertalmio. Structure and texture fillingin of missing image blocks in wireless transmission and compression
applications. IEEE Transactions On Image Processing, 12(3):296–303,
March 2003.
[164] A. Roche, X. Pennec, G. Malandain, and N.J. Ayache. Rigid registration of 3-d ultrasound with mr images: A new approach combining intensity and gradient information. IEEE Transations On Medical
Imaging, 20(10):1038–1049, October 2001.
[165] M. Rousson and N. Paragios. Shape priors for level set representations.
In Proceedings of European Conference on Computer Vision (ECCV),
LNCS 2351, pages 78–92. Springer Verlag, 2002.
[166] D. L. Ruderman and W. Bialek. Statistics of natural images: Scaling
in the woods. Physical Review Letters, 73(6):814–817, August 1994.
[167] Daniel L. Ruderman. Statistics of natural images. Network: Computation in Neural Systems, (5):517–548, 1994.
[168] Daniel L. Ruderman. Origins of scaling in natural images. Vision
Research, 37(23):3385–3398, 1997.
154
[169] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total
variation based noise removal algorithms. Phys. D, 60(1-4):259–268,
1992.
[170] Hans Sagan. Introduction to the calculus of variations. DOVER, 1992.
[171] J. A. Sethian. Level Set Methods and Fast Marching Methods:Evolving
Interfaces in Computational Geometry, Fluid Mechanics, Computer
Vision, and Materials Science. Cambridge University Press, 1999.
[172] E. P. Simoncelli. Bayesian denoising of visual images in the wavelet
domain. In P Müller and B Vidakovic, editors, Bayesian Inference in
Wavelet Based Models, volume 41 of Lecture Notes in Statistics, pages
291–308. Springer-Verlag, 1999.
[173] E P Simoncelli and E H Adelson. Noise removal via bayesian wavelet
coring. In Procedings of Third International Conference on Image Processing, volume I, pages 379–382, Lausanne, 1996. IEEE Signal Processing Society.
[174] E.P. Simoncelli. Statistical models for images: Compression, restoration and synthesis. In Asilomar Conference on Signals, Systems and
Computers, 1997.
[175] Stephen M. Smith. Flexible filter neighbourhood designation. In
ICPR ’96: Proceedings of the 1996 International Conference on Pattern Recognition (ICPR ’96) Volume I, page 206, Washington, DC,
USA, 1996. IEEE Computer Society.
[176] Stephen M. Smith and J. M. Brady. SUSAN - a new approach to
low level image processing. Technical Report TR95SMS1c, Chertsey,
Surrey, UK, 1995.
[177] Stephen M. Smith and J. Michael Brady. SUSAN - a new approach to
low level image processing. International Journal of Computer Vision,
23(1):45–78, 1997.
[178] Pierre Soille. Morphological image compositing. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 28(5):673–683, May 2006.
[179] J. E. Solem and N. Chr. Overgaard. A geometric formulation of gradient descent for variational problems with moving surfaces. In ScaleSpace 2005, LNCS 3459, pages 419–430. Springer Verlag, 2005.
155
[180] Jon Sporring. The entropy of scale-space. In Proceedings of International Conference on Pattern Recognition (ICPR), volume I, Washington, DC, USA, 1996. IEEE Computer Society.
[181] Jon Sporring and Joachim Weickert. On generalized entropies and
scale-space. In SCALE-SPACE ’97: Proceedings of the First International Conference on Scale-Space Theory in Computer Vision, pages
53–64, London, UK, 1997. Springer-Verlag.
[182] Jon Sporring and Joachim Weickert. Information measures in scalespaces. IEEE Transactions on Information Theory, 45:1051–1058,
1999.
[183] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S.-C. Zhu. On advances
in statistical modeling of natural images. Journal of Mathematical
Imaging and Vision, 18(1):17–33, 2003.
[184] Anuj Srivastava, Xiuwen Liu, and Ulf Grenander. Universal analytical
forms for modeling image probabilities. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(9):1200–1214, 2002.
[185] C. Strecha, R. Fransens, and L. V. Gool. A probabilistic approach
to large displacement optical flow and occlusion detection. In Statistical Methods in Video Processing, LNCS 3247, pages 71–82. Springer
Verlag, 2004.
[186] D. Strong and T. F. Chan. Exact solutions to total variation problems.
Technical Report 96-41, UCLA, Ca., 1996.
[187] Richard Szeliski. Image alignment and stitching: a tutorial. Foundations and Trends in Computer Graphics and Vision, 2(1):1–104, 2006.
[188] Bart M. ter Haar Romeny. Front-End Vision and Multi-Scale Image
Analysis: Multi-Scale Computer Vision Theory and Applications, written in Mathematica, volume 27 of Computional Imaging and Vision.
Kluwer Academic Publishers, 2003.
[189] Alan M. Thompson, John C. Brown, Jim W. Kay, and D. Michael
Titterington. A study of methods of choosing the smoothing parameter
in image restoration by regularization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 13(4):326–339, 1991.
[190] N. P. Tiilikainen, A.E.Bartoli, and S. Olsen. Contour-based registration and retexturing of cartoon-like videos. In Proceedings of British
Machine Vision Conference (BMVC), 2008.
156
[191] Philip H. S. Torr and Andrew Zisserman. Feature based methods for
structure and motion estimation. In W. Triggs, A. Zisserman, and
R. Szeliski, editors, Workshop on Vision Algorithms, pages 278–294.
Springer-Verlag, 1999.
[192] Antonio Torralba and Aude Oliva. Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), September 2002.
[193] Antonio Torralba and Aude Oliva. Statistics of natural image categories. Network: Computation in Neural Systems, 14(3):391 – 412,
August 2003.
[194] David Tschumperle. Curvature-preserving regularization of multivalued images using pde’s. In Proceedings of European Conference on
Computer Vision (ECCV), pages II: 295–307. Springer-Verlag, 2006.
[195] David Tschumperlé. Fast anisotropic smoothing of multi-valued images
using curvature-preserving pde’s. International Journal of Computer
Vision, 68(1):65–82, 2006.
[196] A. van der Shaaf and J.H. van Hateren. Modelling the power spectra of natural images : Statistics and information. Vision research,
36(17):2759–2770, 1998.
[197] J. H. van Hateren and A. vad der Schaaff. Independent component
filters of natural images compared with simple cells in primary visual
cortex. Proc. Royal Soc. Lond. B, 265:359–366, 1998.
[198] Curtis R. Vogel. Computational Methods for Inverse Problems. Society
for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
[199] Y.Z. Wang and S.C. Zhu. Perceptual scale-space and its applications.
International Journal of Computer Vision, 80(1), October 2008.
[200] Joachim Weickert. Anisotropic Diffusion in Image Processing. ECMI.
Teubner-Verlag, 1998.
[201] Gerhard Winkler. Image Analysis, Random Fields, and Markov Chain
Monte Carlo Methods. Number 27 in Stochastic Modelling and Applied
Probability. Springer-Verlag, 2006.
[202] Andrew P. Witkin. Scale-space filtering. In Proceedings 8th International Joint Conference on Artificial Intelligence, volume 2, pages
1019–1022, Karlsruhe, August 1983.
157
[203] A. Wong and W. Bishop. Efficient least squares fusion of mri and ct images using a phase congruency model. Pattern Recognition, 29(3):173–
180, February 2008.
[204] Fei Wu, Changshui Zhang, and Jingrui He. An evolutionary system
for near-regular texture synthesis. Pattern Recogn., 40(8):2271–2282,
2007.
[205] Ying Nian Wu, Cheng-En Guo, and SongChun Zhu. From information
scaling of natural images to regims of statistical models. Quarterly of
Applied Mathematics, 2007.
[206] S.C. Yan, C. Liu, S.Z. Li, H.J. Zhang, H.Y. Shum, and Q.S. Cheng.
Face alignment using texture-constrained active shape models. Image
and Vision Computing, 21(1):69–75, January 2003.
[207] Victoria Yanulevskaya and Jan-Mark Geusebroek. Significance of the
Weibull distribution and its sub-models in natural images. In International Conference on Computer Vision Theory and Applications (VISAPP), 2009.
[208] Laurent Younes. Computable elastic distances between shapes. SIAM
Journal on Applied Mathematics, 58(2):565–586, 1998.
[209] S.C. Zhu, C.E. Guo, Z.J. Xu, and Y.Z. Wang. What are textons?
In Proceedings of European Conference on Computer Vision (ECCV),
page IV: 793 ff., 2002.
[210] S.C. Zhu and Y.Z. Wang. Perceptual scale-space and its applications.
In Proceedings of IEEE International Conference on Computer Vision
(ICCV), pages I: 58–65, 2005.
[211] Song Chun Zhu and David Mumford. Grade: Gibbs reaction and diffusion equation - a framework for pattern synthesis, denoising, image
enhancement, and clutter removal. IEEE Trans. PAMI, 19(11):1627–
1660, november 1997.
[212] Song Chun Zhu and David Mumford. Prior learning and gibbs reactiondiffusion. IEEE Transaction on Pattern Analysis and Machine Intelligence, 19(11):1236–1250, November 1997.
[213] Song Chun Zhu, Ying Nian Wu, and David Mumford. Minimax entropy
principle and its application to texture modelling. Neural Computation,
9(8):1627–1660, 1997.
158
[214] Song Chun Zhu, Ying Nian Wu, and David Mumford. Filters, random
fields and maximum entropy FRAME: To a unified theory for texture
modeling. International Journal of Computer Vision, 27(2):107–126,
1998.
[215] Barbara Zitova and Jan Flusser. Image registration methods: a survey.
Image and Vision Computing, 21(11):977–1000, October 2003.
159
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement