Feature Selection and Learning for Semantic Segmentation

Feature Selection and Learning for Semantic Segmentation
FAKULTÄT FÜR INFORMATIK
DER TECHNISCHEN UNIVERSITÄT MÜNCHEN
Master’s Thesis in Informatics
Feature Selection and Learning
for Semantic Segmentation
Caner Hazırbaş
FAKULTÄT FÜR INFORMATIK
DER TECHNISCHEN UNIVERSITÄT MÜNCHEN
Master’s Thesis in Informatics
Feature Selection and Learning
for Semantic Segmentation
Selektion und Lernen von Merkmalen
für Semantische Segmentierung
Author:
Caner Hazırbaş
Supervisor: Prof. Dr. Daniel Cremers
Advisors:
M.Sc. Julia Diebold
Dipl.-Inf. (Univ.) Mohamed Souiai
Date:
June 26, 2014
I confirm that this master’s thesis is my own work and I have documented all sources and
material used.
Ich versichere, dass ich diese Masterarbeit selbständig verfasst und nur die angegebenen
Quellen und Hilfsmittel verwendet habe.
München, den 26. Juni 2014
..............................
Caner Hazırbaş
Acknowledgments
First and foremost, I would like to offer my sincere gratitude to Prof. Dr. Daniel Cremers
who gave me an invaluable guidance to decide on my research subject and also motivated
me with his immense knowledge on the topic.
Besides my supervisor, I would like to offer my special thanks to my advisors, M.Sc. Julia Diebold and Dipl.-Inf. (Univ.) Mohamed Souiai, for being the most indestructible advisors ever against my countless questions, for endless motivation and their support during
the thesis. Without their assistance and dedicated involvement in every step throughout
the process, this thesis would have never been accomplished. Furthermore, I particularly
thank to Dr. Rudolph Triebel for his kindness and assistance to solve my problems regarding to machine learning.
“Success is not a target but a very long journey.” I could not make it through and would
have been lost without my friends. I thank my dearest friends, Bilal Porgalı and Yağmur
Sevilmiş for being with me through this exhausting journey.
My special thanks are extended to my friends, my fellow labmates in the Computer
Vision and Pattern Recognition Research Group at TU Munich: Alfonso Ros Dos Santos,
Sahand Sharifzadehgolpayegani, Zorah Lähner, Emanuel Laude, Benjamin Strobel and
Mathieu Andreux for the stimulating discussions and also for having a great time together.
Last but not the least, I would like to thank my mother Hüsniye Hazırbaş for giving
me moral in the very dark moments and spiritual support throughout my life, my father
Birol Hazırbaş for being with me all the time with his prayers, and also my beloved sister
Melike Hazırbaş for giving me all her trust and support.
vii
Abstract
This work presents a comprehensive study on feature selection and learning for semantic segmentation. Various types of features, different learning algorithms in conjunction
with minimizing a variational formulation, are discussed in order to obtain the best segmentation of the scene with minimal redundancy in the feature set. The features are scored
in terms of relevance and redundancy. A clever feature selection reduces not only the redundancy but also the computational cost of object detection. Additionally different learning algorithms are studied and the most suitable multi-class object classifier is trained with
the selected subset of features for detection based unary potential computation. In order
to obtain consistent segmentation results we minimize a variational formulation of the
multi-labelling problem by means of first order primal-dual optimization.
Experiments on different benchmarks give a deep understanding on how many and
what kind of features and which learning algorithm should be used for semantic segmentation.
ix
Contents
Acknowledgements
vii
Abstract
ix
I.
1
Introduction and Theory
1. Introduction
1.1. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. Image Segmentation and Object Recognition
2.1. Variational Image Segmentation . . . . . . . . . . . . . .
2.1.1. Mumford-Shah Functional . . . . . . . . . . . .
2.1.2. The ROF Model . . . . . . . . . . . . . . . . . . .
2.1.3. Total-Variation Segmentation . . . . . . . . . . .
2.2. Object Classification Methods . . . . . . . . . . . . . . .
2.2.1. Adaptive Boosting . . . . . . . . . . . . . . . . .
2.2.2. Random Forests . . . . . . . . . . . . . . . . . . .
2.3. Feature Selection based on Mutual Information Criteria
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
7
7
8
8
11
13
13
15
17
II. Feature Selection and Learning for Semantic Segmentation
19
3. Feature Ranking and Object Learning
3.1. Appearance Model . . . . . . . . . . . . . . . . . . . .
3.1.1. Shape Features . . . . . . . . . . . . . . . . . .
3.1.2. Color and Texture Features . . . . . . . . . . .
3.1.3. Location Features . . . . . . . . . . . . . . . . .
3.1.4. Depth Features . . . . . . . . . . . . . . . . . .
3.2. Feature Ranking . . . . . . . . . . . . . . . . . . . . . .
3.3. Object Learning . . . . . . . . . . . . . . . . . . . . . .
3.3.1. Measurement of the Classification Uncertainty
.
.
.
.
.
.
.
.
21
21
22
24
25
27
27
29
31
4. Variational Optimization and Image Segmentation
4.1. Variational Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2. First-order Primal-Dual Convex Optimization . . . . . . . . . . . . . . . . .
35
35
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xi
Contents
III. Experimental Evaluation
39
5. Benchmarks
5.1. eTrims Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2. NYU version 1 Depth Benchmark . . . . . . . . . . . . . . . . . . . . . . . . .
5.3. Corel and Sowerby Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . .
41
41
42
42
6. Experimental Results
6.1. Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2. Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
46
47
IV. Summary and Conclusion
51
7. Summary
7.1. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
53
8. Conclusion
8.1. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
55
Appendix
59
A. List of Ranked Features
59
Bibliography
65
xii
Part I.
Introduction and Theory
1. Introduction
Building man-like machines, also called humanoid robots, has been a dream of humanity
since the life started. The fundamental problem to address in developing such robots is
how to understand the environment as we humans do. For many years, human vision has
kept its secret in visual understanding of the scene and making robust inference about unknown environments or objects using previous experience. Although scientists have spent
many years to figure out how humans actually see the environment, we are still struggling
to express how this perfect learning and fully understanding happens. Nevertheless, we
know that the eyes allow individuals to interpret their surroundings and therefore we developed cameras. Invention of the cheap cost cameras has opened a gate for robots in
the direction of visual perception. Now, cameras are taking an important role on visual
understanding in many applications.
Development of the autonomous, unmanned vehicles and devices is essential not only
to replace man-power with mechanical power, but also to support the daily life activities. Transportation, production, surveillance systems, aero-space industry and human
supporting systems are only some of application areas. However, the remaining major
challenge is how to understand the environment and move autonomously using cameras.
Perceptual understanding of the scene is at the very heart of this problem and is still being
studied by many scientists.
Computer vision studies investigated perceptual interpretation of the objects in the
scene using just a single image, retrieved from a low-cost camera. Today, with the high
computational processing power of computers, many researchers consider recognition and
segmentation problems together to increase the capacity of current systems for perceptual
scene analysis. The integration of the visual semantics of objects into the segmentation
problem is nowadays the main concern in the field of semantic segmentation.
Segmentation of images based on semantics is essential for real-time scene understanding, surveillance systems and 3D scene reconstruction, e.g. in [5, 26, 46]. The performance
of these segmentation algorithms heavily depend on the quality of the appearance model
which is computed using pixel-wise object detection algorithms.
The major challenge is to find the most representative features to distinguish dissimilar objects in terms of their shape, color and textural differences. Conventional object
detectors deal with the task of finding bounding boxes around each object [14, 32, 49]. In
contrast, dense object detection approaches [29, 44] focus on detecting the objects at pixel
level which provides a preliminary segmentation. Several other studies pursued different approaches for object detection and learning algorithms in order to perform semantic
segmentation, e.g. [3, 28, 43].
In this work we propose a method to systematically select the best subset of features
for semantic segmentation. Our approach is able to outperform state-of-the-art results in
terms of runtime and even in accuracy in several cases.
3
1. Introduction
1.1. Related Work
In 2001, Viola and Jones presented simple, but robust Haar-like features for real-time face
detection [49]. Haar-like features are simple to implement, computationally low cost and
very accurate in capturing the shape of objects. These features can be used for detection
of any type of object and this flexibility makes them convenient for semantic segmentation
applications. Lowe [32] presented a distinctive, scale-invariant feature transform (SIFT)
and Dalal and Triggs [14] presented so called histograms of oriented gradients (HOG)
which are computationally more expensive. SIFT can robustly identify the objects even
among clutter and under partial occlusion, because the SIFT feature descriptor is invariant
to uniform scaling, orientation, and partially invariant to affine distortion and illumination
changes. Moreover, Zhang et al. [52] proposed a set of novel features based on oriented
gradients to detect the cat heads. They have shown that exploiting shape and texture
features jointly outperforms the existing, leading features, e.g. Haar, HOG. However, these
proposed methods are mostly used in the context of sliding window techniques. In the
scope of this research we aim at detecting the objects densely at pixel level.
In [44], Shotton et al. proposed texture-layout filters based on textons which jointly
model patterns of texture and their spatial layout for dense object detection. They presented a two-step algorithm for semantic segmentation of photographs. In the first step
unary classification and feature selection are achieved using shared boosting to give an efficient classifier. Then, in the second step, an accurate image segmentation is achieved by
incorporating an unary classifier in a conditional random field to enforce the neighboring
pixels to have same labels.
Ladický et al. [29] proposed a hierarchical random field model, that allows integration of
features computed at different levels of the quantisation hierarchy. Moreover, Ladický et
al. [30] combined different features for unary pixel classification by using Joint Boosting [48]. However, their approach is sensitive to a large set of parameters.
Fröhlich et al. [21] proposed an iterative approach for semantic segmentation of a facade
dataset. They learn a single random forest and incrementally add context features derived
from coarser levels. This approach uses different kinds of features in a joint, flexible, and
fast manner and refines the semantic segmentation of the scene iteratively.
Hermans et al. [26] discussed 3D semantic reconstruction of indoor scenes by exploiting
2D semantic segmentation approach based on Randomized Decision Forests for RGB-D
sensor data. They used depth features in addition to a very basic set of features to train a
classifier for unary pixel classification.
However, neither of the approaches above does give any justification for the chosen set
of features nor address the problem of how to choose the best feature set for object detection in semantic segmentation. Feature selection plays an important role on the quality of
the object detector and also on runtime of the entire system. It is also known that the m
best features are not necessarily the best m features [12].
In [37], Peng et al. proposed a theoretical framework to rank the features based on
minimum-Redundancy-Maximum-Relevance (mRMR). They formulate a cost function
composed of a relevance and a redundancy term. The relevance between features and
class labels is maximized and the redundancy between feature pairs is minimized.
4
1.2. Contributions
1.2. Contributions
In this work, we adapt the approach of [37] to the task of semantic segmentation. We study
different types of features and learning algorithms and perform experiments on various
segmentation benchmarks. The main goal is to find the best subset of features for semantic
segmentation. This is done by systematically selecting them via minimizing redundancy
and maximizing relevance and thereby improving the unary pixel potentials. Exhaustive
experiments on various benchmarks show that reducing the redundancy in feature sets
significantly reduces the runtime while preserving the performance. Our approach is able
to outperform state-of-the-art algorithms in terms of runtime and even in accuracy in several cases.
5
2. Image Segmentation and Object
Recognition
This chapter presents the theoretical background of the proposed method. Variational
image segmentation, object learning (classification) algorithms and mutual information
based feature selection method are discussed in detail to provide an insight into image
segmentation and object recognition.
2.1. Variational Image Segmentation
Image segmentation is recently the most popular research topic in computer vision. It
deals with finding disjoints sets of elements from an observation data such that similarity
inside the sets and dissimilarity in-between the disjoint sets are maximum.
Image segmentation contains finding object regions and accurate boundaries between
them. Let Ω be the image domain and k be the number of disjoint regions, the segmentation
criteria of an image can be written as:
Ω = {Ω1 , Ω2 , ..., Ωk−1 , Ωk },
where
k
[
Ωi = Ω ∧ Ωi ∩ Ωj = ∅ ∀i,j ∈ {1, .., k} s.t. i 6= j.
i=1
In the simplest case, with only background and foreground, regions can be segmented
using very basic and conventional methods such as edge-detection based segmentation.
However, since this is not realistic, one should take more complex cases into consideration to be able to segment the images into multiple regions. The image segmentation approaches are generally composed of graphical based models, e.g. Graph-Cut based image
segmentation [16] or variatonal image segmentation such as the generalized MumfordShah Functional [36] for multi-region segmentation.
In this study, we only focus on the variational multi-region segmentation problem. Variational image segmentation minimizes an energy functional to find the consistent segmentation. An energy functional for image segmentation is typically formulated as:
E(u(x)) =
k
X
Edata (ui (x), I) + λ ·
i=1
k
X
Ereg (ui (x))
i=1
where u(x) = (u1 , ..., uk ) is the set of indicator functions with:
(
ui (x) =
1
0
for x ∈ Ωi
else
7
2. Image Segmentation and Object Recognition
Edata is called data term and presents the fidelity of the segmentation of a given image
I. Ereg is called regularizer term and ensures the resulting segmented regions should have
homogeneous class labels with optimum class boundaries. The parameter λ is a weight
and denotes the amount of smoothing on the segments. The trade-off for the parameter
λ is that the segmented regions get more and more smoothed as the λ increases. If the
parameter λ is set to a very low value, then the segmented regions will have noisy labels
and lose homogeneity property.
2.1.1. Mumford-Shah Functional
Mumford and Shah [36] presented an energy functional to segment images with piece-wise
smooth approximation in the following form:
Z
E(u, C) =
Ω
2
(I − u) dx + λ
Z
|∇u|2 dx + ν|C|
(2.1)
Ω\C
where u : Ω → R is an approximation of the segmented image and C ⊂ Ω is the onedimensional discontinuity set. (I − u) converges to zero while u gets similar to I and this
provides a good approximation of the segments. The second term smooths the regions
everywhere except for the set C. The last term ensures that the length of boundary |C|
between classes is minimal with a weighting ν. However, C is part of the energy itself and
no numerical solution is given to minimize this functional. There are multiple approaches
presented to solve the Mumford-Shah functional. This thesis employs a related model of
the Mumford-Shah functional to solve multi-region segmentation problems.
2.1.2. The ROF Model
In 1992, Rudin-Osher-Fatemi [40] pioneered the concept of nonlinear image denoising
which aims at removing noises by preserving the edges on an image. The principle is
to remove unwanted excessive and possibly spurious details, induce high total variation,
that is, the integral of the absolute gradient of the image. According to this principle, reducing the total variation of the image removes the unwanted details and preserves the
image edges.
Given a noisy image f : Ω → R, where Ω is bounded open subset of R2 , we seek for
a clean image u, the denoised image version of f with bounded variation. To solve this
problem, Rudin et al. minimize the total variation of an image with the following energy
functional:
1
min
||f − u||22 + λ
2
u∈BV (Ω)
Z
|∇u|dx .
(2.2)
Ω
This variational energyR has a data term, minimizes the L2 norm of desired and noisy
image and a regularizer Ω |∇u|, so called total variation, minimizes the length of object
boundaries. λ is a scale parameter that determines the level of smoothing.
Total variation is not differentiable under the constraint that u takes only 0 and 1. For this
reason, bounded variation of u is used to address the problem; u ∈ BV (Ω) is constrained
8
2.1. Variational Image Segmentation
to be of bounded variation, defined as:
BV (Ω) := {u ∈ L1 (Ω) T V (u) < +∞}
where the total variation is finite and given as [22]:
¶
T V (u) := sup −
Z
u(x) · div ξdx ξ ∈ Cc1 (Ω, Rn ), ||ξ||L∞ (Ω) ≤ 1
Ω
©
(2.3)
If u is differentiable and Ω is a bounded open set, then we can re-write the total variation
equation using Gauss’ theorem:
Z
|∇u| =
Ω
∇u
yielding equality with ξˆ :=
=
|∇u|
Z
ZΩ
−u · div ξ =
Z
∇u · ξ ≤ ||ξ||∞
∇u · ξˆ =
Ω
|∇u|
Ω
Ω
Z
Z
|∇u|
Ω
and ξˆ can be approximated by a sequence (ξ)n ⊂ Cc1 (Ω) it holds the following:
Z
Ω
−u · div (ξ)n =
Z
∇u · (ξ)n −→
Z
Ω
∇u · ξˆ =
Ω
Z
|∇u| .
Ω
The ROF model is also called T V − L2 model and it has two advantages;
• TV preserves discontinuity (edges on the images) and therefore the method is well
suited for image processing tasks,
• TV is a convex function; a function f is convex if and only if its epigraph epi(f ) :=
{(x, y)|f (x) ≤ y} is a convex set. And it can be easily shown that if f is convex and
there exist a minimizer of f , then its minimizer is indeed globally optimal. Convex
functions have the advantage to be minimized with well-researched optimization
solvers regardless of initialization .
Equation 2.2 can be minimized with gradient descent and the equation for this task is
defined by:
Ç
∇u
|∇u|
å
1
(u − f )
λ
∂u with boundary condition
=0
∂n ∂Ω
∂u
=∇·
∂t
−
However, in the case of constant functions, |∇u| will be equal to 0 and the optimization
scheme will face the singularity problem because of zero division. Therefore, this problem
will be circumvented using a first-order primal-dual formulation of the Total-Variation
given in Equation 2.3 . First-order primal-dual convex optimization scheme will be discussed in Section 4.2 . See Figure 2.1 for an image denoising example with ROF model.
9
2. Image Segmentation and Object Recognition
(a) Noisy image
(b) Denoised image with λ = 10
(c) Denoised image with λ = 30
(d) Denoised image with λ = 50
Figure 2.1.: Image Denoising. Sample noisy image (a) was denoised with gradient descent minimization on the ROF model, implemented by Magiera and Löndahl
on Matlab [33]. The denoised images (b), (c), (d) were produced with 100 iterations and different smoothing scales (λ). Each RGB color channel was processed independently. The resulting image gets smoother as the λ increases.
10
2.1. Variational Image Segmentation
2.1.3. Total-Variation Segmentation
A continuous model for TV based image segmentation, that will be used, has the following
form:
min
Ω1 ,...,Ωk
k Z
X
fi (x) dx +
i=1 Ωi
k
λX
P er(Ωi , Ω)
2 i=1
where fi : Ω → R+ are non-negative potential functions defined for each pixel with object
classification methods. The λ-weighted second term, is half the sum of the perimeters of
S
the sets Ω1 , ..., Ωk and corresponds to the total length of the partition interface i<j ∂Ωi ∩
∂Ωj (in order to count each perimeter once.). Thus, we segment the image into k subpartitions that yield the lowest energy and ensure a minimal surface at the same time.
The partitions are represented by characteristic functions, also known as indicator, functions u1 , ..., uk : Ω → {0, 1} so that the energy takes a computationally tractable form:
k
X
for x ∈ Ωi
satisfying
ui (x) = 1 a.e. x ∈ Ω
else
i=1
(
ui (x) =
1
0
Thus, the first term becomes:
k Z
X
i=1 Ωi
fi (x)dx =
k Z
X
ui (x)fi (x)dx
i=1 Ω
Moreover, if indicator functions ui ∈ BV (Ω) of measurable sets Ωi ⊂ Ω are scalar-valued
functions of bounded variation, the co-area [17] formula tells us that the the perimeter is
equal to the total variation:
Z
P er(Ωi , Ω) = P er(ui , Ω) = T V (ui ) =
|∇ui |dx.
Ω
As a consequence of the re-formulations of the problem, the energy functional takes the
form of an intermediate minimization problem as follows:
min
u∈B
®
B :=
k Z
X
i=1
k
λX
ui (x) · fi (x) dx +
2 i=1
Ω
Z
g(x) · |∇ui | dx .
(2.4)
Ω
´
X
k
(ui , ..., uk ) ∈ BV (Ω, {0, 1}) ui = 1 a.e. x ∈ Ω
i
with coherency constraint B on the indicator functions ui . The minimizer u now lies on
B, that is the k- dimensional space of binary functions with bounded variation and fulfills
the point-wise characteristic property (at every position x only one ui is allowed to be
non-zero). Also note that, TV in Equation 2.4 is weighted with a space-dependent g(x)
function given in Equation 2.5 , which takes low values if the derivative at position x is
high and takes high values otherwise, meaning that discontinuity will be preserved on the
boundaries of the input image.
11
2. Image Segmentation and Object Recognition
Ç
g(x) = exp
å
|∇f (x)|2
1
−
, σ2 =
2
2σ
|Ω|
Z
|∇f (x)|2 dx
(2.5)
Ω
The energy functional is still hard to optimize since it is non-convex and nondifferentiable because of the binary indicator functions (ui ). Therefore, to convexify the
energy, the ui functions are relaxed by mapping them into the whole interval ui : Ω → [0, 1]
as described in [9]. With this, indicators change from hard to soft assignments, thus at a position x, u(x) can have multiple non-zero entries although the coherecency constraint still
holds in that case, i.e. the sum has to yield 1. TV is replaced with its dual representation
(Equation 2.3) to be a differentiable function.
The energy function now takes the form of a convex and differentiable optimization
problem as follows:
min sup
k Z
X
u∈S ξ∈K
i=1 Ω
ui (x) · fi (x) dx − λ
k Z
X
div ξi (x) · ui (x) dx .
(2.6)
i=1 Ω
with minimization over the set S of BV functions moving into the k-dimensional simplex
and the dual variable ξ is in the convex set K:
®
´
X
u = (u1 , ..., uk ) ∈ BV (Ω, [0, 1]) ui (x) = 1 a.e x ∈ Ω .
k
S :=
i
In [31, 51], the authors use variations of a straight-forward formulation that arises from
the TV definition of the regularizer. ξ is enforced to stay inside a norm boundary, e.g.
1
P
( i ||ξ||2 ) 2 ≤ 1. This enforcement limits the amount of flow happening at every position.
Since, we weight the TV with a space-dependent weighting function g, the dual space can
be re-written as:
®
K := ξ = (ξi , ..., ξk ) : Ω → R
2×k
´
X
g(x)
2
||ξi (x)|| ≤
a.e. x ∈ Ω .
2
i
Alternative approach is presented by Chambolle et al. [10]. They introduce the notion
of a paired calibration of the dual variables’ components that represents a local convex
envelope of the energy:
®
KC :=
´
X
g(x)
2×k ξ = (xii , ..., xk ) : Ω → R
ξi (x)| ≤
(a.e.) x ∈ Ω .
|
2
i1 ≤i≤i2
It is proven that this model has a tighter bound for k > 2 in comparion to others which
means that the found minimizer is closer to the original global optimum. This dual space
creates tighter solutions and the projection of a variable onto KC is computationally burdensome, because it involves the projection of a variable onto multiple convex sets. On
the other hand, the projection onto K is very fast since it consists of point-wise truncation
operations.
The relaxed energy in Equation 2.6 is just an approximation of the original problem and
does not guarantee an optimal solution. The literature has shown that in the case of two
regions (k = 2), the thresholded solution of the relaxed version yields a global optimizer
independent of the chosen threshold. Nevertheless, this assumption no longer holds for
any binarized solution in the multi-region case.
12
2.2. Object Classification Methods
Data: (x1 , y1 ), ..., (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1}
Result: Final strong classifier H(x)
for t = 1, .., T do
• Train a weak learner using weight distribution Dt .
• Get weak hypothesis ht : X → {−1, +1} with error:
t = Pri∼Dt [ht (xi ) 6= yi ].
• Choose αt =
1
2
ln
Ä
1−t
t
ä
.
• Update:
®
Dt (i)
e−αt if ht (xi ) = yi
×
Dt+1 (i) =
e αt
if ht (xi ) 6= yi
Zt
Dt (i) exp(−αt yi ht (xi ))
=
Zt
where Zt is a normalization factor such that Dt+1 will be a distribution.
end
Output the final hypothesis:
T
ÄX
H(x) = sign
ä
αt ht (x) .
t=1
Algorithm 1: Discrete AdaBoost. The Adaptive Boosting Algorithm
2.2. Object Classification Methods
In machine learning and statistics, classification is the problem of identifying the categorical class label of a new observation, on the basis of training data which contains observations whose category memberships are known. The output of a classification algorithm is
called classifier, that is, a function maps the observation data to a category (a class label).
This section gives a brief description of Adaptive Boosting [20] and Random Forests [7]
machine learning algorithms which are used in the proposed method among various linear and non-linear supervised learning algorithms, e.g. Artificial Neural Networks [39],
Support Vector Machines [4], linear classifiers, deep learning.
2.2.1. Adaptive Boosting
The AdaBoost is a boosting algorithm, introduced by Freud and Schapire [18] in 1997 and
is capable of solving the binary classification problem. Adaptive Boosting is a linear discriminative learning model which combines weak learners as strong classifier to find the
best separation between two classes. Algorithm 1 shows the algorithm of Discrete AdaBoost [19].
13
2. Image Segmentation and Object Recognition
Figure 2.2.: Boosting Classifier. Illustration of the Adaptive Boosting Algorithm.
Given an observation set (X) with corresponding class labels (Y ), the algorithm iteratively finds the weak learners and combines them linearly to a strong classifier. Usually a
weak learner is a decision tree with one level, also called decision stump.
After receiving the weak hypotheses ht , the algorithm computes the importance rate αt
of assigned to ht . Note that αt is proportional to t and gets larger if gets smaller. One
of the main ideas of the algorithm is to maintain a distribution of weights (Di ) over the
training set. In each round, weights of misclassified samples are increased and correctly
classified ones are decreased. As a result of this re-weighting, the weak learner ht is forced
to focus on hard examples in the training set. Thus, the weight tends to concentrate on
hard examples.
The final hypotheses H, which is also called strong classifier, is a majority vote of linearly
combined weak learners ht , weighted with corresponding αt .
The weak learner is responsible to find a weak hypotheses ht : X → {−1, +1} appropriate
for the sample weights Dt and the quality of this weak learner is measured by its error:
t = Pri∼Dt [ht (xi ) 6= yi ] =
X
Dt (i).
i:ht (x)6=yi
Figure 2.2 illustrates the boosting algorithm on a given simple training set. Strong
classifier is shown as a combination of three weak classifiers.
There are different kinds of Adaptive Boosting algorithm and each of them is named
depending on the error function:
• Real AdaBoost : The output of decision trees is a class probability estimate p(x) =
P (y = +1|x), the probability that x is in the positive class [20]. Each leaf node in the
14
2.2. Object Classification Methods
decision tree is changed to output half the logit transform of its previous value:
1
ft = ln
2
Ç
å
1−x
.
x
• Logit AdaBoost : This is a convex optimization problem and minimizes the logistic
loss:
Ç
X
å
−yi f (xi )
log 1 + e
, where f (xi ) = αi ht (xi )
i
• Gentle AdaBoost : Previous boosting algorithms choose ft greedily, minimizing
the overall test error as much as possible at each step Gentle AdaBoost features a
P
bounded step size. ft is chosen to minimize i wt,i (yi − ft (xi ))2 , and no further
coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, Gentle AdaBoost will choose ft (x) = αt ht (x) exactly equal to
y, while steepest descent algorithms will try to set αt = ∞. Empirical observations
about the good performance of Gentle AdaBoost appear to back up Schapire and
Singer’s remark that allowing excessively large values of α can lead to poor generalization performance [19, 41].
2.2.2. Random Forests
Random Forests is an ensemble learning method, proposed by Breiman [7] in 2001 and
can handle multi-class classification problems. Random Forests grow many decision trees,
introduced in [8], to construct more robust classifier against outliers.
The construction of a decision tree requires a training sample of m observations:
O = (x1 , y1 ), ..., (xm , ym ) , i = 1, .., m
where xi ∈ X, variable set and yi ∈ Y , corresponding class labels. A sample constructed
decision tree is shown in Figure 2.3 . Each interior node corresponds to a variable in
X, while each leaf node corresponds to a class label in Y . A decision tree is trained by
splitting the input training set into subsets based on an attribute value test. Splitting is
repeated recursively on each node until all samples on the node have same target class
label or splitting no longer gives noticeable improvement on the predictions.
A decision tree is usually constructed from top to down and at each interior node, one
attribute is selected, which gives the best split of the subset. The quality of a split is determined by the average value of importance of each attribute, computed with the following
metrics:
• Information Gain: The concept of information gain comes from information theory
and it is based on entropy measurement. Information gain of an attribute gives us
the importance of that attribute given set of examples. Let T denote a set of training samples, each has the form of (x, y) = (x1 , x2 , ...., xk , y) where xa ∈ V (a) is the
15
2. Image Segmentation and Object Recognition
is coughing ?
yes
has fever ?
no
feeling groggy ?
sick
healthy
sick
healthy
0.92 55%
0.17 5%
0.03 13%
0.001 27%
Figure 2.3.: Decision Tree. A tree showing the sickness of a child in a city where there is
an epidemic disease. The numbers under the leaf nodes show the probability
of sickness and the percentage of observations in the leaf.
value of ath attribute of the sample x and y is the corresponding class label. The
information gain for an attribute a is defined in terms of entropy H as follows [35]:
IG(T, a) = H(T ) −
X |{x ∈ T |xa = v}|
v∈V (a)
H(X) = −
X
|T |
P (xi ) · log|k| P (xi ) ,
i
X
· H({x ∈ T |xa = v})
P (xi ) = 1
i
• Gini Importance: Random Forests gain its superior performance from its implicit
feature selection strategy. In this strategy, Gini importance is the indicator of feature
relevance. This metric measures how often a particular attribute θ is selected for a
split and how effective its overall discriminative value for the classification. At each
node t within a binary decision tree, the optimal split is found using Gini impurity
i(t) and calculated as follows [34]:
i(t) = 1 − p21 − p20 .
where pk = nnk is the fraction of the nk samples from class k = {0, 1}. By following
this formula Gini importance IG of all variables θ is computed as:
IG (θ) =
X
∆iθ (t)
t
∆i(t) = iθ (t) − pl iθ (tl ) − pr iθ (tr )
nl
nr
pl =
, pr =
n
n
A decision tree has many advantages compared to the other learning methods. It is
simple to construct, very robust and reliable since it has a self-validation strategy, can handle both categorical and numerical data, performs well even in case of a large dataset and
most importantly, a decision tree is fast to predict the class label of the new observations. A
decision tree has also some drawbacks; the quality of the classifier highly depends on the
16
2.3. Feature Selection based on Mutual Information Criteria
training samples and a decision tree may over-fit and not generalize well on the training
data.
Random Forests consist of multiple decision trees. To train a forest, the initial training
data is split into subsets such that each tree has the same amount of samples. Each tree in
the forest gives a vote (classification) which is a mapping from an input vector to a class
label and the forest chooses the classification having the most votes over all the trees in the
forest.
In contrast to single decision tree, Random Forests do not over-fit. Adding more trees
to the forest increases the overall performance of the classifier, but the prediction time also
increases. Nevertheless, Random Forests can be easily parallelized since the prediction of
the new observation is evaluated at each tree independently. In Random Forests, there is
no need for an external test or cross-validation to estimate the forest error since the error
computation is done internally during training. One-third of the training subset of each
tree is left out to estimate the forest error while the trees are constructed. This error is
called the out-of-bag error and gives the unbiased classification error of the forest.
Moreover, in the original paper of Random Forests [7], it is shown that the forest error
depends on two factors which are the correlation between trees and the strength of each
individual tree in the forest. The forest error increases as the correlation gets higher, but
decreases as the strength increases.
Random Forests are scalable, reliable, robust, insensitive to the outliers, capable of solving multi-class classification problem on high dimensional data and also outperform the
other machine learning algorithms in terms of speed and accuracy. The strong characteristics of Random Forests make them popular in the area of visual object detection and
recognition.
2.3. Feature Selection based on Mutual Information Criteria
Conventional machine learning algorithms aim to find the best subset of features which
distinguish the disjoint sets (classes) well for classification. However, most of the methods suffer from local minima or over-fitting problem due to high sensitivity to outliers
and the final classifier does not generalize well on the given training data which is usually
noisy because of the high redundancy in the feature set. Therefore, in statistical analysis, this problem is addressed with feature reduction using some probabilistic frameworks
such as maximizing the dependency. This feature analysis provides minimal classification
errors by maximizing the statistical dependency of the target class c on the data distribution in an unsupervised situation. However, a maximum dependency approach typically
involves computation of multivariate joint probability which is often difficult and inaccurate. One approach to overcome this problem is to realize maximum dependency with
maximizing the relevance, selection of features with the highest relevance to the target class
c. Relevance measures the dependency of the variables with correlation or mutual information. Ding and Peng [15, 37] proposed a feature selection framework based on mutual
information. The idea is to maximize the relevance and minimize the redundancy of the
feature set. Peng et al. [37] proved that maximizing dependency is the same as maximizing the relevance between features and class labels while minimizing the redundancy of
dependent features in the feature set. Thus, the noise and outliers in data are reduced and
17
2. Image Segmentation and Object Recognition
the redundancy is minimized which induces the classifiers to perform better. This feature
selection is applied to the training data as a preprocessing step before learning the object
classifiers. As a result of redundancy minimization, the classifier performance is significantly increased and the efficiency of the learning algorithm is boosted by the selection of
feature subset [15, 37].
In theory, mutual information is defined in terms of probabilistic density functions p(x),
p(y), and p(x, y) of two random variables x and y:
Z Z
I(X; Y ) =
p(x, y) log
Y
X
p(x, y)
dxdy .
p(x)p(y)
Max-Dependency finds the subset of features S with m features {xi } and is defined as:
max D(S, c), D = I({xi , i = 1, ...m}; c)
Max-Relevance selects the features xi which are required to have the largest mutual
information I(xi ; c) with the target class c, reflecting the largest dependency on the target
class. An approximation of D(S, c) in the maximum relevance approach is computed as:
max D(S, c), D =
1 X
I(xi ; c)
|S| x ∈S
i
However, it is very likely the case that there exist selected features which are dependent
and give strong relevance. For this reason while the relevance is maximized, the redundancy of the correlated features should be minimized:
min R(S), R =
1
|S|2
X
I(xi ; xj )
xi ,xj ∈S
A combination of the two criteria described above, formulates the minimal-redundancymaximal-relevance (mRMR) feature selection method. The final optimization problem is
defined as the difference of the two terms:
max Φ(D, R), Φ = D − R
In this method, the subset of the features are iteratively selected such that at each step,
one feature is selected depending on all previously selected features with an incremental
search:
î
ó
X
1
max
I(xj ; c) −
I(xj ; xi ) .
xj ∈X−Sm−1
m − 1 x ∈S
i
m−1
Comparing to the Max-Dependency, the mRMR avoids the estimation of multivariate
probability densities p(x1 , ..., xm ) and only calculates the bivariate densities, i.e. p(xi , xj )
and p(xi , c), that is much easier and applicable to big data. Moreover, this also leads to
more efficient feature selection algorithms. On the other hand, mRMR algorithms do not
guarantee the global maximization of the problem due to the difficulty in searching the
whole space. However, global optimum solution in this approach might lead to overfitting on the data, whereas, the mRMR is a practical way to achieve a superior classification accuracy and it also reduces the complexity of the computation [37].
18
Part II.
Feature Selection and Learning
for Semantic Segmentation
3. Feature Ranking and Object Learning
In this chapter, we introduce our approach for feature selection and learning for semantic
segmentation. Our purpose is to find the most distinctive, discriminative and robust feature set for object recognition in the task of semantic segmentation. We employ a feature
ranking strategy to compute the importance of each feature in a given set of features and
reduce the redundancy in the feature set by analyzing the change of the performance accuracy depending on each feature. Thus, the redundant features are eliminated whilst the
accuracy and the efficiency of the system is enhanced.
In the scope of the proposed method, Section 3.1 presents our appearance model, also
called unary pixel potential, based on object detection/recognition. Section 3.2 shows how
the features are ranked using mutual information based feature selection strategy. Furthermore, Section 3.3 compares two machine learning methods, which are used for object
learning in the majority of state-of-the-art studies, in order to address the question which
one is the most convenient learning strategy in the task of semantic segmentation.
3.1. Appearance Model
Our appearance model is a unary pixel potential ρi (x) assigned for every class i given
in Equation 3.1 and the potentials are inferred with multi-class object detectors i.e:
‹(i|f, x), i = 1, ..., n
ρi (x) = − log P
(3.1)
‹(i|f, x) is the probability of class i for a given feature vector f of pixel x. By taking
where P
the negative logarithm of probability distributions, we ensure that our potential function
is a monotonously decreasing function. Hence, the potential is a convex function, meaning
that there exists a global minimizer.
Objects can be categorized with their shape, color and textural properties. To detect an
object, one should exploit these features jointly so that each different types of object can
be distinguished from others whilst the ones in the same class are kept together. In this
study, we employ six types of Haar-like features as shape features, 5 color and 17 texton
features (Gaussian filters, Laplacian of Gaussians and a first order derivative of Gaussian
filters with various bandwidths) as color-texture features, in addition, we use normalized
canonical location features in Equation 3.2 and when applicable depth features as shown
in Figure 3.1 . Object classifiers are trained with these 37 joint features to build a robust
multi-class object detector.
Ç
fl (p) =
xp
yp
,
Iwidth Iheight
å
(3.2)
with (x, y) is the location of the pixel p on an image I.
21
3. Feature Ranking and Object Learning
(a) Haar-like features
(b) Color features
(c) Depth features
Figure 3.1.: Feature set. First row, Haar-like features: horizontal/vertical edges and
lines, center surround and four square. Second row, color features: relative
pixel/patch, horizontal/vertical edge and center surround on color channels.
Third row, depth features: relative pixel, relative pixel comparison and height
of a pixel.
Filters are applied in a patch surrounding the pixel on the image. We use the CIELab
colour space [1] since it is a perceptual uniform color model and is designed to approximate the human vision. This color space is a device-independent model and describes
all visible colors to human eye. Luminance (L) channel varies from 0 to 100, indicates the
brightness, ‘a’ and ‘b’ are color channels. ‘a’ axis varies between green and red while ‘b’
axis varies between blue and yellow. Colors are usually numbered from -128 to 127. Figure 3.2 shows the three dimensional Lab space. We sample the image pixels for training.
Filter responses are computed on a ∆ss × ∆ss grid on the image to reduce the computational expense [46] during training. For testing, however, filter responses are computed
for each pixel.
3.1.1. Shape Features
In 2001, Viola and Jones [49] presented very simple, but robust Haar-like features for face
detection. These features are reminiscent of Haar basis functions (Haar wavelets), that
were proposed in 1909 by Alfred Haar [24]. A Haar wavelet allows a function defined in
an interval to be represented in terms of orthogonal basis functions. The only disadvantage
is that Haar wavelets are not continuous and therefore not differentiable. The Haar mother
22
3.1. Appearance Model
Figure 3.2.: CIELab color space. Lab color space has three dimension. ‘L’ ranges from 0 to
100 and shows the brightness while ‘a’ and ‘b’ represent colors and they both
range from -128 to 127. Source: [2].
basis and k-th Haar function is defined by:
ψ(x) =



1
−1


0
0 ≤ x < 12 ,
1
2 < x ≤ 1,
else.
ψjk (x) ≡ ψ(2j x − k)
for j a non-negative integer and 0 ≤ k ≤ 2j − 1 as shown in Figure 3.3 .
Figure 3.3.: Haar Wavelets. Haar mother basis function and haar wavelets [50].
These simple Haar-like features can be rapidly computed using an intermediate representation for the image which is called the integral image [49], also known as summed area
table, introduced to computer graphics by Frank Crow in 1984 [13]. The integral image (I)
at a pixel position (x, y) contains the sum of all pixel intensities of the pixels above and to
23
3. Feature Ranking and Object Learning
R1
R2
1
R3
2
R4
3
4
Figure 3.4.: Haar-like feature calculation via Integral Image. On the integral image above,
the pixel 1 contains the total sum of all pixels in the region R1 , 2 contains
R1 +R2 , 3 is equal to R1 +R3 and 4 is the sum of all regions: R1 +R2 +R3 +R4 .
The sum of the pixels in the region R4 can be computed with four basic operations: R4 = (4 + 1) − (2 + 3).
the left:
I(x, y) =
X
image(x0, y0).
x0≤x,y0≤y
With this small modification on the image, sum of the intensities in a rectangle can be computed efficiently only with 4 simple operations illustrated in Figure 3.4 . To compute the
feature responses, the sum of the pixels which lie within the white rectangles is subtracted
from the sum of the pixels in the grey rectangles.
Haar-like features are very good to capture shapes such as lines, edges, corners and so
forth, and can be extensively used for all kinds of objects. The most important advantage of Haar-like features is that the object boundaries can be recognized very sensitively
and perfect object separations can be retrieved as a result of segmentation. Another big
advantage of these features is that objects can be detected in any scale by scaling the feature window. When doing that, all the rectangle sums must be normalized by the size
of rectangle for scale-invariant feature responses. We only employ six Haar-like features,
however, one can also use different types and enlarge the feature set. All shape features
are computed only on the luminance (L) channel in the Lab color space.
3.1.2. Color and Texture Features
Pixel colors are very basic features and give the local information about the object particles. However, objects usually consist of many different colors and individual pixel colors
cannot be used directly as a feature to the object as a whole. Therefore, we extract the color
features for each pixel in a surrounding window. We are inspired by Hermans’ simple
color features [26]: the relative pixel and relative patches, shown in Figure 3.1 are used
for color relevancy of the object parts and different objects. Thus, we do not only learn
24
3.1. Appearance Model
Gaussian
Laplacian of Gaussian
−3
Derivative of Gaussian
−3
x 10
−3
x 10
x 10
8
5
2
6
0
1
4
−5
0
2
−10
−1
0
30
−15
30
−2
30
25
20
25
20
20
10
10
10
0
Gaussian
5
0
0
Laplacian of Gaussian
25
20
20
20
15
15
15
10
10
10
5
5
5
0
0
10
15
20
25
0
Derivative of Gaussian
25
5
10
10
10
5
0
25
0
20
15
15
5
0
20
20
15
0
0
5
10
15
20
25
0
2
4
6
8
10
12
14
16
18
20
Figure 3.5.: Texture features. Gaussian, Laplacian of Gaussian and derivative of Gaussian
convolution kernels used as texture features. Rows show 3D and 2D illustration of the kernels, respectively. Warm colors represent the higher function
values.
the general color model of the object, but also the color relation between object and its
surroundings. This also allows to learn the co-occurrency of the objects and increase the
classification accuracy in some sense. In addition to the relative color features, we also
use three of the Haar-like features since they are extremely powerful to discover object
boundaries. Color features are computed on each channel in the Lab color space.
Colors are not always so discriminative and robust for different textures. For that reason,
we exploit the simple, but relatively more expensive basic spatial image filters for texture
analysis, that is Gaussian (Equation 3.3), Laplacian of Gaussian (Equation 3.4) and derivative of the Gaussian (Equation 3.5) convolution kernels. The kernels are computed locally
around each pixel and the kernel size is determined by its bandwidth (2κ · σ + 1, κ = 1).
Figure 3.5 shows the 3D and 2D texture kernels used in this application.
G(x, y; σ) =
1 − x2 +y2 2
e 2σ
2πσ 2
(3.3)
ñ
ô
1
x2 + y 2 − x2 +y2 2
LoG(x, y; σ) = − 4 1 −
e 2σ
πσ
2σ 2
ñ
∂Gσ ∂Gσ
DoG(x, y; σ) =
,
∂x ∂y
ôT
ñ
= xe
−x
2 +y 2
2σ 2
(3.4)
−x
, ye
2 +y 2
2σ 2
ôT
(3.5)
3.1.3. Location Features
More or less, every object occurs at a specific region on the images. For example, sky
is always expected to be on the top as car, road and pavements invariably locate at the
25
3. Feature Ranking and Object Learning
Rhino
P.Bear
Water
Snow
Vegetation
Ground
Sky
Figure 3.6.: Location Potential Map. Location potential map of the 7-class Corel Benchmark [25], learnt from a training data. Pixels are normalized to 100-by-100
canonical size and wα is chosen 1. Colors show the occurrence likelihood of
the objects at each pixel. Warmer colors represent higher probabilities.
bottom. The location information about the objects can be considered in two different
ways:
• Location as a feature: In the simplest case, locations can be integrated into the object
learning strategy in the same way as the other types of features. The location feature is basically computed as a normalized canonical form of a pixel location given
in Equation 3.2 . The most important location features will be implicitly selected
during training.
• Location as a potential: Another way of exploiting the location attribute of the objects is to learn a potential, captures the weak dependence of the class label on the
absolute location of the pixel in the image [44]. The location potential(θλ ) can be
adapted into the appearance model as follows:
Ä
ä
‹(i|f, x) · θλ (i, x̂) .
ρi (x) = − log P
with θλ (i, x̂) being the location potential of class i at normalized pixel location x̂. The
location potential is then learnt from a training data with Equation 3.6 , as used
in [44].
Ç
åw
Ni,x̂ + αλ λ
θλ (i, x̂) =
(3.6)
Nx̂+αλ
where Ni,x̂ is the number of pixels of class i at normalized location x̂ in the training
set, Nx̂ is the total number of pixels at location x̂ and αλ is a small integer, corresponding to a weak Dirichlet prior on θλ . The location potential is raised to the power of
wλ to compensate for overcounting.
Figure 3.6 shows an example location map, learnt for the 7-class Corel dataset of [25].
This map shows the object occurrences on the image. The color gets warmer as the occurrence potential of the object increases. As shown in the figure, there is a high uncertainity
on the locations. The appearance model is composed of a scalar multiplication of the object
detector confidence with the location confidence and the weak location confidence causes
high uncertainty in the model. Therefore, in this study we decided to use the location
attribute as a feature.
26
3.2. Feature Ranking
3.1.4. Depth Features
With the development of cheap RGB-D cameras, the popularity of using these cameras in
the applications has increased. RGB-D sensors are very cheap and provide depth information of the scene along with the color images. Therefore, nowadays researchers work on
integrating the depth features into the system. By doing that, the accuracy of the system
will be increased by only exploiting computationally very low cost and simple features.
See Figure 3.7 for a sample gray scaled depth map of a given image, taken from NYU version 1 Depth Benchmark [47]. The Gaussian filter is applied to the all channels (Lab) whilst
Laplacian of Gaussian and derivative of Gaussian filters are only applied to the luminance
(L) channel.
We use three deep features, also used in [26]:
1- Depth of a relative pixel in a surrounding window, normalized by the furthest depth
in the corresponding column.
2- Depth comparison by subtraction of two relative depths as described in [45]. At a
given pixel x, the features compute:
Ç
fθ (I, x) = dI
u
x+
dI (x)
å
Ç
− dI
v
x+
dI (x)
å
,
where dI (x) is the depth at pixel x on an image I, and u and v are randomly chosen
offsets in a window (θ = (u, v)).
The relative distances (offsets) are normalized by
the depth of the current pixel dI1(x) to make the features more robust to camera
translation.
3- The height of a point in 3D [47]: The height fh of a point x = (xr , xc ) with respect to
the camera center is computed as follows:
fh (I, x) = −dI (x) · xr .
3.2. Feature Ranking
All the features used for object recognition in this study have been selected only because
of their popularity in state-of-the-art works. However, randomly chosen features do not
always perform well or combining strong robust features does not guarantee any improvement on the performance. In addition, another drawback of using many features is the
huge computation time. As the features are computed for each pixel, the calculation time
for the features exponentially grows by adding new ones to the system and the run-time
thus drastically increases. Another important issue when using different types of features
jointly is that there might be a redundant feature and therefore no longer a feature but a
noise, will definitely reduce the classification performance. For all these reasons, we aim
at reducing the redundancy and picking the most distinctive and robust feature set for a
given benchmark.
Among many other feature selection methods, we use a feature ranking algorithm based
on mutual information. In [37] Peng et al. present an algorithm which allows ordering
27
3. Feature Ranking and Object Learning
(a) RGB image
(b) Depth map of the image (a)
Figure 3.7.: Depth map of the scene. RGB-D cameras provide both RGB color image (a)
and corresponding depth map (b) of the scene. Depth values are normalized
to gray scale. Brightness increases as the regions get away from the camera.
28
3.3. Object Learning
features by means of maximizing the relevance between features and the corresponding
class labels and minimizing the inter-feature redundancy based on mutual information.
The Minimum-Redundancy-Maximum-Relevance method (mRMR) maximizes the objective function in Equation 3.8 where M I(X, Y ) denotes the mutual information of two
continuous random variables:
Z Z
M I(X; Y ) =
p(x, y) log
Y
X
p(x, y)
dxdy ,
p(x)p(y)
"
max
fj ∈X−Sm−1
X
1
M I(fj ; c) −
m − 1 x ∈S
i
(3.7)
#
M I(fj ; fi ) .
(3.8)
m−1
The mRMR method (see Section 6.1 for detailed explanation) orders the features incrementally by selecting the one with highest score at each iteration. After ranking the features depending on their relative importance scores, we run the whole system adding
the ranked features one by one to the feature set and performing the segmentation on
all benchmarks independently. The list of ranked features for each benchmark is given
in Appendix A . Further evaluations will be presented in Chapter 6 .
3.3. Object Learning
After feature computation, the most important, class-separator features are selected with
one of the machine learning feature selection methods. In this research, we only study
commonly used supervised learning algorithms to learn the object classifiers. For our
field of application, we consider Gentle AdaBoost [20] and Random Forests [7] to be the
most relevant. Classification algorithms generally estimates a class label for a given feature
vector. However, since we need to compute the appearance model for segmentation which
is consisted of class distributions per pixel, we adapt these two learning algorithms to our
purpose as described below.
Random Forests (see Section 2.2.2 ) are an ensemble of decision trees which can handle
large data. The learning process starts with one decision tree and adds another one to
the forest if the so called out-of-bag (oob) error is bigger than a certain threshold or the
maximum number of trees is not reached. Each tree Tw returns a tree vote Tw (f ) for a
given feature vector f . The probability distribution for each class i given f and a pixel x is
then estimated as follows:
N
P
‹(i|f, x) =
P
[Tw (f ) = i]
w=1
N
, i = 1, ..., C
(3.9)
where N is the number of trees in the forest and C is the total number of classes.
Random Forests exhibit a good detection rate and are very robust against outliers. Another alternative object learning method is the Boosting algorithm (see Section 2.2.1 ).
Boosting is a greedy learning algorithm that linearly combines the weak learners to train
a strong classifier. A one-vs-all strategy is used for the training and one strong classifier
for each class is trained. This method can either output a class label or a confidence value
29
3. Feature Ranking and Object Learning
(a) Input Image
(b) RF unary classifier
(c) AB unary classifier
(d) Ground Truth
(e) RF Segmentation
(f) AB Segmentation
Figure 3.8.: Random Forest vs. Gentle AdaBoost. Images (a) and (d) show the input and
ground truth images, respectively. Images (b) and (c) illustrate the detection re‹(i|f, x)) for Random Forest and Gentle AdaBoost and images
sults (arg maxi,x P
(e) and (f) compare the variational segmentation results.
which is sum of the weighted votes of the weak learners. However, we require a probability distribution. Therefore, the distribution is computed using a soft-max transfer function:
‹(i|f, x) =
P
exp(Hi (f, x))
C
P
i=1
.
(3.10)
exp(Hi (f, x))
where Hi (f, x) denotes the confidence score class i computed as follows:
Hi (f, x) =
W
X
i
αw
(x) · hiw (x) .
(3.11)
w=1
For our experiments, we set the same learning parameters for all datasets. We trained
the Random Forests with a maximum of 50 decision trees, each having a depth of 15 at
most. In the Gentle AdaBoost, we perform the training using 100 weak learners per class.
Figure 3.8 shows the experimental results for each unary pixel classifier. Note that
the Random Forests give better classification results compared to the Gentle AdaBoost.
This is due to the unbalanced training data. Hence in most benchmarks the pixel count
for each class is non-uniformly distributed. This leads to a suppression of classes with
less pixels. In addition, the soft-max transfer function over-smooths the class potentials.
The reason for this is that each classifier is trained independently and that the confidence
scores are discrete, real numbers. So the one-vs-all Boosting approach does not provide
a good appearance model. In contrast, the Random Forests circumvent this problem by
30
3.3. Object Learning
Rhino
P.Bear
Water
Snow
Vegetation
Ground
Sky
(a) Random Forest Potential Map
(b) Gentle AdaBoost Potential Map
Figure 3.9.: Probability maps provided by Random Forest (a) and AdaBoost (b) classifiers.
Warmer color represents higher probability.
penalizing the class errors with the corresponding class weights during the training and
perform equally well for each class.
Figure 3.9 shows the class-wise probability maps for both learners. We have more uncertainty (cold colors) in the AdaBoost map ( Figure 3.9 b) compared to probabilities generated by the Random Forest ( Figure 3.9 a). This problem can be overcome by increasing the number of weak learners. However, this drastically increases the prediction time.
Apart from that, prediction time in the Random Forests is faster than the one in the Gentle AdaBoost. For the above reasons, we decided to use Random Forests in the following
experiments.
3.3.1. Measurement of the Classification Uncertainty
Random Forests classifier does not return a class label but a distribution over classes in our
study. By using this probability distribution ρ(x) for each pixel on an image (see Figure 3.9
a), we measure the quality of object recognition with entropy. Entropy in information
theory is a measure of the uncertainty associated with a random variable. This term is also
known as Shannon entropy [42], which quantifies the expected value of the information
contained in a message, that is a specific realization of a random variable. The entropy
H(C) of C random variables (in our case, the total number of classes) in a pixel x is defined
as follows:
H(C) = −
C
X
ρi (x) · logC ρi (x).
i
Uncertainty of the detection at a pixel x is maximum (H(C) = 1) if the Random Forests
assign equal probabilities to all classes, and the uncertainty is minimum (H(C) = 0) if one
of the classes has the highest probability (p(x) = 1.0). Figure 3.10 illustrates a diagram
that plots the entropy versus probability for the case of 2 random variables.
We expect that a good distribution will retrieve the best segmentation at the end. Therefore, entropy is used to measure the quality of the classifier. Figure 3.11 shows an example
31
3. Feature Ranking and Object Learning
H(X)
1
0.5
0
0
0.5
1
p(X=1)
Figure 3.10.: Entropy H(X). Entropy in the case of two random variables with the probabilities p(X = 1) and 1 − p(X = 1). Uncertainty is maximal (H(X) = 1) when
two random variables have equal probability.
for entropy based uncertainty measurement. The global uncertainty measurements of the
unary pixel classifiers for each benchmark are presented in Section 6.2 .
32
3.3. Object Learning
(a) Example Image
‹(i|f, x))
(b) Unary pixel classification (arg maxi,x P
(c) Entropy Image
Figure 3.11.: Detection Uncertainty. (a) and (b) demonstrate the input image and the
unary pixel classification result, respectively. (c) shows the detection accuracy
at each pixel computed with entropy. Warmer colors represent high certainty.
As shown in (b) and (c), vegetation, sky and inner part of the building have
a high certainty in contrast to other regions. Due to the insufficient training
samples of cars in the eTrims [27] Benchmark, the pixels, belonging to the car,
have the lowest certainty.
33
4. Variational Optimization and Image
Segmentation
This chapter presents the energy functional based on unary pixel potential for multi-class
image segmentation and gives a brief description of the convex energy optimization with
a first-order primal-dual formulation.
4.1. Variational Image Segmentation
Given an input image I : Ω → R3 defined on the image domain Ω ⊂ R2 , we compute
the appearance model ρi (x) for all classes i. A subsequent minimization of a regularizing
energy which is based on variational multi-labeling [9] allows to obtain consistent pixel
labeling and thereby removing label noise in the unary potentials. The energy functional
in Equation 4.1 is minimized using a first-order solver [11]. As a regularizer, the total
variation of the indicator function ui of an object of class i is used. This penalizes the
boundary length of the object associated with ui . In order to favor the coincidence of
the objects and the image edges, the total variation term is weighted with a non-negative
function g : Ω → R+ given in Equation 2.5 .
E(u) =
n Z
X
ρi (x) · ui (x) dx + λ
i=1 Ω
n Z
X
g(x) · |∇ui (x)| dx
(4.1)
i=1 Ω
4.2. First-order Primal-Dual Convex Optimization
In order to optimize the solution, the energy functional must meet the following two conditions:
• Necessary condition. If there exist any minimum of the energy functional, the
derivative of the functional at the minimum must be equal to zero:
dE(u)
= 0.
du
• Sufficient condition. To obtain a global minimizer, functional must meet the convexity criteria.
As already discussed in Section 2.1 , our energy functional in Equation 4.1 satisfies the
above conditions. However, the minimization problem is now a saddle-point optimization
since the total variation is replaced with its primal-dual formulation:
|∇u| = sup ξ · ∇u ,
|ξ|≤1
35
4. Variational Optimization and Image Segmentation
∇u
where |ξ| ∈ R2 is a dual variable and the supremum is attained at ξ = |∇u|
if |∇u| =
6 0.
This re-formulation allows to re-write the TV as in Equation 2.4 .
Since we replace the total variation with primal-dual formulation, we minimize the energy functional to find a saddle-point. In this study, we exploit an efficient algorithm for
minimizing this saddle-point problem, introduced in [11, 38]. We bring our energy function (Equation 4.1) into a primal form as follows:
min sup
n Z
X
u∈S ξ∈K
i=1 Ω
ρi (x) · ui (x) dx − λ
n Z
X
div ξi · ui (x) dx
i=1 Ω
This problem can now be optimized easily using gradient descent/ascent aspect, also used
in [53]. This solving scheme descents in the primal variable and ascents in the dual variable
until convergence at the saddle-point. As following the necessary condition, we fix the
variables and derive the energy to formulate the update scheme for optimization:
∂E
∂E
= ρ − div ξ ,
= ∇u
∂u
∂ξ
In the first step, the primal variable u0 , dual variable ξ 0 and an auxiliary variable v 0 used
in the acceleration step, are initialized with 0. Then the algorithm iteratively runs the
following steps:
ξin+1 = ΠK (ξin + τ · ∇vin )
un+1
= ΠS (uni + η · (div ξin+1 − ρi ))
i
v n+1 = 2un+1 − un
τ > 0, η > 0 are step sizes and the solution provably converges for sufficiently small step
sizes. ΠK and ΠS are the projections onto the corresponding sets. For the primal variable
u, the projection onto the set S = BV(Ω; [0, 1]) is calculated by clipping:
(ΠS u)(x) = min {1, max {0, u(x)}} =



u(x),
1,


0,
if u(x) ∈ [0, 1]
if u(x) > 1
if u(x) < 0
For the dual variable ξ, the projection onto the unit disk K is done as follows:
(ΠK )(x) =
ξ(x)
max {1, |ξ(x)|}
This update scheme is applied on each pixel independently on the image. Therefore, the
optimization process can be accelerated on a GPU.
Additionally, we also optimize the smoothing parameter λ for each benchmark by using
binary search as given in Algorithm 2 . Hence, we obtain the best possible multi-region
segmentation of the image.
36
4.2. First-order Primal-Dual Convex Optimization
Data: Unary pixel-wise potentials ρ(x)
Result: λopt
← 0.005;
increment ← 3;
in-between ← false;
λ ← 1;
λopt ← 0;
Popt ← 0, Pcurrent ← 0, Pprevious ← 0;
Solve the segmentation with λ and calculate the performance P;
while Pcurrent - Pprevious ≥ do
if Popt < P then
Popt ← P;
λopt ← λ;
end
Pprevious ← Pcurrent ;
Pcurrent ← P;
if Pcurrent - Pprevious > 0 then
if in-between then
increment ← increment / 2;
end
λ ← λ + increment;
in-between ← false;
else
increment ← increment / 2;
λ ← λ - increment;
in-between ← true;
end
Solve the segmentation with λ;
Calculate the performance P;
end
Algorithm 2: Optimization of the smoothing parameter λ. For each benchmark, λ is
optimized with binary search.
37
Part III.
Experimental Evaluation
5. Benchmarks
In this chapter, we first introduce the benchmarks, used for the scientific evaluation. Our
framework has been evaluated on four challenging databases. The benchmarks consist of
landscape images, wild-scenes, facade and indoor scenes. This chapter gives a quick look
for each benchmark.
5.1. eTrims Benchmark
E-Training for Interpreting Images of Man-Made Scenes(eTrims) [27] dataset is consisted
of 60 facade/street images. Each pixel is labeled with a color which corresponds to a class
label. Thus, this dataset offers the ground truth annotation on both pixel and region levels. There are in total 8 classes, which are ‘Window’, ‘Vegetation’, ‘Sky’, ‘Building’, ‘Car’,
‘Road’, ‘Door’ and ‘Pavement’. Typical objects in the images exhibit considerable variations in both shape and appearance and hence represent a challenge for the object based
image interpretation. This benchmark is one of the most challenging datasets since the
shape and appearances of objects are more detailed in the images with high resolution.
and generalization of the complex structures on the objects is a difficult task in object classification. Figure 5.1 shows a sample image and the ground truth labeling from eTrims
Benchmark. The pixels close to the object boundaries are non-labeled (black) due to the
ambiguity on the original images.
Figure 5.1.: eTrims Benchmark. Sample image (left) from eTrims dataset with its associated ground truth labeling (right). Each class is represented with a color. Labels
are imposed on the respective color regions.
41
5. Benchmarks
5.2. NYU version 1 Depth Benchmark
Cheap and sufficiently accurate RGB-Depth cameras increase their popularity in computer
vision applications since besides color images, these cameras also provide a depth map of
the scene. In our experiments, we used the same RGB-D indoor benchmark introduced
in [47] and also used in [26]. This benchmark is comprised of thousand of objects and
each image has 640×480 resolution. In order to compare our results with state-of-the-art,
we also categorize the objects into 12 core classes as ‘Bed’, ‘Blind’, ‘Bookshelf’, ‘Cabinet’,
‘Ceiling’, ‘Floor’, ‘Picture’, ‘Sofa’, ‘Table’, ‘TV’, ‘Wall’, ‘Window’. Depths are normalized
to grey level (range from 0 to 255), the whiter color represents the further distances. Unknown regions are labeled with white color on the ground truth and these regions are not
considered during the training. Figure 5.2 demonstrates a sample image with corresponding depth map and ground truth.
Figure 5.2.: NYU version 1 Depth Benchmark. From left to right: original image, grey
level depth map and corresponding ground truth labeling.
5.3. Corel and Sowerby Benchmarks
We also applied our method to two natural image datasets, used in [25]. First dataset is a
100 subset of the Corel image database, consisting of African and Arctic wildlife natural
scenes. The Corel dataset is labeled into 7 classes: ‘Rhino/Hippo’, ‘Polar Bear’, ‘Vegetation’, ‘Sky’, ‘Water’, ‘Snow’ and ‘Ground’. Each image is 180×120 pixels. See Figure 5.3
for a sample image and ground truth labeling.
Second dataset, the Sowerby Image Database of British Aerospace, is a set of color images of out-door scenes (consisting objects near roads in rural and suburban area) and their
associated labels. This benchmark is comprised of 104 images with 8 labels: ‘Sky’, ‘Grass’,
‘Roadline’, ‘Road’, ‘Building’, ‘Sign’, ‘Car’; each image is 96×64 pixels.
42
5.3. Corel and Sowerby Benchmarks
Figure 5.3.: Corel Benchmark. An example image (left) from Corel benchmark and its associated ground truth (right) with imposed class names on the corresponding
labels.
Figure 5.4.: Sowerby Benchmark. An example image (left) from Sowerby benchmark and
its associated ground truth (right) with imposed class names on the corresponding labels.
43
6. Experimental Results
This chapter presents our evaluations for feature selection and segmentation separately.
In Section 6.1 , we discuss how many and what kind of features give the best performance
for each benchmark independently. Then, in Section 6.2 , we present our segmentation
results with the selected subset of features. We compare our experimental results with
state-of-the-art methods in terms of quantitative scores and run-time.
We have implemented our approach in C++, NVidia CUDA with OpenCV (Open Computer Vision) [6] Library and DARWIN [23] Machine Learning Framework on Ubuntu
12.04 LTS (Precise Pangolin) Operating System. All experiments have been executed on Intel® CoreTM i7-3770 processor equipped with 32 GB RAM and a NVIDIA GeForce GTX480
graphics card.
We tested our framework on four different benchmarks, the 8-class facade dataset introduced in [27], the 7-class Corel and Sowerby datasets of He et al. [25] as well as the NYU
Depth v1 [47] with 12 core classes. Except for the eTrims Benchmark, we randomly split
each dataset into training and test sets by 50%. Table 6.1 shows the patch size and the
sampling rate ∆ss used for each benchmark. Due to huge memory consumption on feature ranking with mRMR algorithm, we only use 100 randomly sampled images for NYU
Depth v1 Benchmark.
Table 6.1.: Parameters. Patch size and sampling rate, used for each benchmark.
Parameter
Sowerby
Corel
eTrims
NYUv1
Patch size
∆ss
6
3
10
3
24
5
24
5
For the object learning, the same parameters are used for all benchmarks during the
training of Random Forests and Gentle AdaBoost classifiers. AdaBoost classifiers are
trained with 100 weak learners for each class where each weak classifier is a decision
stamp. Random Forests are trained with maximum 50 trees where each has at most 15
depths. Maximum categories parameter is set to 15 (any bigger number than the total
class number) so that none of the classes in the benchmark are grouped during training.
Splitting is repeated unless at least 10 samples are left in the node. The number of the trees
in the forest is determined by the forest accuracy, also known as the out-of-bag error, which
is set to 1%. Apart from the learning parameters, due to high unbalanced sample distribution over the classes in benchmarks, we set the class weights for each classes so that
during the out-of-bag error calculation, class errors are weighted with the corresponding
class weights to balance the class-wise classification accuracy. Although this weighting
does not improve the overall performance, it has a significant effect on the class-wise per-
45
6. Experimental Results
formance scores. Table 6.2 shows the class weights used for each benchmark. The class
weights are set to 1 for each class in the NYUv1 Depth Benchmark.
Table 6.2.: Class Weights. Class priors to balance the detection performance of each class
in all benchmarks.
(a) Sowerby
Labels Sky Grass Roadline Road Building Sign Car
Weight 1.0 1.0
1.5
1.0
1.0
1.5 1.5
(b) Corel
Labels Rhino Polar Bear Water Snow Ground Vegetation Sky
Weight 1.0
1.0
1.05
1.0
1.0
1.0
1.1
(c) eTrims
Labels Building Vegetation Door Window Car Sky Pavement Road
Weight
1.0
1.0
1.15
0.65
1.0 1.0
1.0
1.0
6.1. Feature Selection
We rank the features using the mRMR method on the training images and select the best
subset of features, by incrementally adding the ranked features one by one to our feature
set and comparing the results for each feature set. Figure 6.1 illustrates the overall performance scores vs. the total number of used features for all benchmarks. The features
which are actually increasing the performance as well as the number of features leading to
a redundant-free feature set can be easily found in this illustration.
The eTrims Benchmark is composed of high resolution images, thus object textures have
significant importance for detection. Similar to Fröhlich et al. [21] we split the dataset by
a ratio of 60%/40% for training/testing. As shown in Figure 6.1 , there is a jump at the
24th feature which is a first order derivative of a Gaussian on the ‘L’ channel. Figure 6.1
indicates that the remaining features are redundant. Therefore, we only use the first 24
features for the eTrims Benchmark.
For a reasonable overall performance on the Sowerby Benchmark according to Figure 6.1 , already the first three features would be enough: the relative color feature on
the ‘a’ channel, the Haar horizontal edge feature on the ‘b’ channel and the relative patch
feature on the ‘L’ channel. Instead of using a larger set of features, this simple set can be
used to obtain similar performance. However, one can observe another jump in the performance at the 21st feature. Additional features beyond the first 21 do not improve the
performance, but would increase the computational cost. Hence, for our experiments we
used the first 21 features. Most of the dropped features are texture features. In contrast
to the eTrims Benchmark, the information gain of the textural features is relatively small
because of the low resolution of the images in the Sowerby dataset.
46
6.2. Segmentation Results
1
0.89
0.9
C T T C T T L T C T C T L H T T T T T T H H H H
C T C C T H C T C C T
H T L T T T H H
C
T T T T T T H T C T T T T T
H C
0.76
0.8
T L T H T L T T T T T H H H
0.72
H0.7 D C C T
T L H T T T L T T T T T T H T C H H H
C
C
C
C C
C
C T C T T T T C C
C C
D C C
H C T T T T
C
C C T C T C C C C T C T T C T
H C
Overall Performance
0.7
0.6
C
0.5
C
L
CC
0.4
0.3
H
C
eTrims
Sowerby
Corel
NYUv1
Selected Features
D
0.2
0
5
10
15
20
25
30
35
40
Feature Number
Figure 6.1.: Overall performance vs. Features. Features are added to the feature set one by
one. Pink dots denote the type of the feature (H: Haar, C: Color, T: Texture and
L: Location, D: Depth) added at current step. Yellow circles show how many
features are selected for the benchmark.
6.2. Segmentation Results
In this section, we present our qualitative and quantitative segmentation results in addition to the performance/run-time comparisons between ours and state-of-the-art studies.
We, first, present the global entropy scores computed during the evaluation. Entropy
is computed at each pixel and averaged as global entropy all over the test set in each
random split. Scores present the final mean global entropy scores for each benchmark
given in Table 6.3 .
Table 6.3.: Global Detection Entropy. Unary pixel classifier global entropy scores for all
benchmarks.
‹global (C)
H
eTrims
0.31
NYUv1
0.38
Corel
0.42
Sowerby
0.23
In Table 6.4 we compare our eTrims benchmark segmentation results with the work of
47
6. Experimental Results
Fröhlich et al. [21]. The algorithm is executed on ten random splits of each benchmark and
the scores shown in Tables 6.4, 6.5, 6.6, 6.7 are the average scores of these random splits.
The scores give the percentage of the correctly labeled pixels for each class, overall denotes
the percentage of the correctly labeled pixels in the whole dataset whilst the average gives
the mean of the class-wise scores.
Time
Average
Overall
Road
Pavement
Sky
Car
Window
Door
Vegetation
Building
Table 6.4.: eTrims Benchmark. Comparison of the segmentation performance and the efficiency for eTrims
Labels
Fröhlich et al. [21] 63.5 90.1 95.4 71.9 77.4 73.2 71.1 69.9 76.1 72.3 ≈ 17s
Ours
75.0 84.0 64.0 57.0 78.0 96.0 56.0 69.0 76.0 72.0 ≈ 5s
Our performance scores shown in Table 6.4 are as good as Fröhlich et al. [21]’s and we
have significant improvement in the evaluation time for the eTrims Benchmark.
Table 6.5 shows that both training and testing speeds are significantly reduced compared to Shotton et al. [46] on the Sowerby and Corel benchmarks.
Table 6.5.: Sowerby and Corel Benchmark. Segmentation/detection and efficiency comparison with Shotton et al.
Overall
Sowerby Corel
Shotton et al. – Full CRF model [46]
88.6
74.6
Ours – Segmentation
90.0
69.4
Shotton et al. – Unary classifier only 85.6
68.4
Ours – Unary classifier only
87.1
65.0
Speed(Train/Test)
Sowerby
Corel
5h/10s 12h/30s
13s/96ms 45s/280ms
-
Table 6.6.: Sowerby and Corel Benchmark Average Class Scores. Average class-wise segmentation performance of ten random splits.
(a) Sowerby
Labels
Sky Grass Roadline Road Sign Building Car
Average Score 93.7 87.5
7.35
96.0 57.0
0.042 0.274
(b) Corel
Labels
Rhino Polar Bear Water Snow Vegetation Ground Sky
Average Score 83.1
65.5
75.1 60.6
71.5
65.6 53.0
Furthermore, we compare our segmentation with the approaches of Ladický et al. [29]
48
6.2. Segmentation Results
and Hermans et al. [26] on the NYUv1 Benchmark. Here we use only the best 11 ranked
features including the depth features to compete with Hermans et al. [26]. Table 6.7 demonstrates that we can significantly improve the detection rate for several classes. We obtain
the best score compared to Ladický et al. [28] and Hermans et al. [26] for more than half of
the classes. Furthermore, we obtain the best average performance. We cannot win a direct
runtime-competition with the algorithm of Hermans et al. [26]. Nevertheless, our system is
open to be paralellized for faster evaluation. Our goal is to keep things simple and to give
the opportunity for a quick reimplementation without the need of a high-end hardware.
Compared to the multi-threaded CPU implementation of Ladický et al. [29] our runtime
is around 20 times faster. This shows the outstanding performance of our method in the
detection accuracy and runtime.
14
58
82
51
76
76
88
78
87
75
3
57
59
42
57
57
67
82
48
72
34
58
75
54
70
59
93
75
88
71
49
57
79
62
70
34
67
71
50
66
47
46
81
40
74
56
82
85
73
74
90
78
68
70
63
11
17
49
19
44
68
71
70
65
64
44
64
74
57
68
Time
Floor
Picture
Sofa
Table
TV
Wall
Window
Overall
Average
Labels
Ladický et al. [29]
Hermans et al. [26]
Ours
Hermans et al. [26](unary)
Ours (unary)
Bed
Blind
Booksh.
Cabinet
Ceiling
Table 6.7.: NYUv1 Benchmark. Comparison of the segmentation/detection performance
and the efficiency for NYUv1
≈3m
≈ 1s
≈ 8s
0.2s
≈ 4s
Figure 6.2 illustrates the qualitative results in all benchmarks. In the segmentation result
of eTrims in the third row of Figure 6.2 d, we can see that the windows on the roof are
labeled as sky because of the reflection and also in the upper part of image we can find a
region labeled as pavement although there is none in the ground truth.
Comprehensive experiments on various benchmarks have proven that the segmentation
performance is heavily dependent on the pixel-wise object classification where the quality
of the feature set is significantly effective. Combining robust and discriminative features
does not always yield better performance since the redundancy of the interrelated features
causes high amount of misclassification in the object detection, hence decreases the efficiency of the system in terms of computational expense and the accuracy of segmentation.
Therefore, the features were ranked based on their mutual information and the most distinctive feature set for each benchmark was determined with performance experiments on
the test sets. These selected subset of features improved the segmentation accuracy whilst
decreased the computational cost along with the complexity of the classification problem.
It is also shown that Haar-like features are the most robust and the most discriminative
features compared to the other simple features and also to the computationally expensive
features such as textons. However, in contrast to the conventional object detection studies,
these simple, yet powerful features yield better performance if the features are extracted
on color channels. Moreover, Haar-like features are indeed robust to capture the object
shapes which induces the object detectors to preserve the object boundaries by finding the
best separation of the segments accurately.
49
6. Experimental Results
(a) Original Image
(b) Ground Truth
(c) Unary Pixel
Classification
(d) Segmentation
Figure 6.2.: Qualitative Results. Rows correspond to the Sowerby, Corel, eTrims and NYU
Depth v1 benchmarks respectively. First two columns show input and ground
truth images while the third and forth columns show unary pixel classification
and segmentation result respectively.
50
Part IV.
Summary and Conclusion
7. Summary
Semantic segmentation is a joint task of object detection/recognition and segmentation
in the field of computer vision. In this work, to address the problem of semantic segmentation, an appearance model based on object detectors is constructed and an energy
functional based on this model is optimized with a convex optimization approach.
In this study, we construct our appearance model with Random Forest classifier, trained
with various types of features and optimize our energy functional with the total variation,
boundary-length minimizer, based variational optimization technique. Although the optimization tries to find the best segmentation on a given appearance model, the robustness
and the accuracy of the segmentation is purely dependent on the quality of the appearance
model. The major challenge in constructing this sort of appearance model is to accurately
detect the objects while preserving the boundaries in-between. In order to segment all
disjoint sets on an image, the current studies use a mixture of all types of features so that
the various objects, which carry different structural and textural properties, can be easily
distinguished from each other. Despite the fact that different types of features are actually
designed for capturing different object properties such as color, texture or shape, using
these features jointly is not always a good solution because of the high relevance between
features yields high redundancy in the feature set. As a result of redundant features, the
noise in the training data increases and the classification methods struggle with building
robust object classifiers. Moreover, the computational cost is exponentially increased by
adding new features to the feature set since the object detection is performed at a pixel
level and herewith the segmentation speed gets slower.
By exploiting redundancies in feature sets we have shown that the computational cost
for learning and testing in the task of semantic segmentation can be significantly reduced.
To this end we adapted a feature selection algorithm to the purpose of semantic segmentation which reduced the redundancy within the feature set while preserving accuracy. In
many cases, we are able to outperform state-of-the-art results in terms of performance and
in addition to runtime due to a sophisticated feature set selection with reduced dimension.
7.1. Discussion
In this study, we have proven that Random Forests perform better than Gentle AdaBoost
in terms of runtime as well as classification accuracy in the task of multi-class object detection for semantic segmentation. Apart from that, the lists of ranked features indicates
that Haar-like features are the most robust and the most discriminative features compared
to the others. Nevertheless, we should also note that Haar-like rectangles are capable of
understanding the shapes robustly and produce much more distinctive features on color
channels. Additionally, experiments have shown that the object boundaries can be accurately detected with these simple features.
53
7. Summary
As every object has its own characteristic, every benchmark has also certain differences
and therefore the number and type of features, which should be used, vary from one to another. Simple objects and low resolution images on which the objects have no detail, could
be easily segmented by using only a couple of features while high resolution images on
which the shape and the texture of the objects can be observed well, could be segmented
accurately by using relatively more features. Therefore, the used features should be examined and the best subset of the features should be selected in order to detect the objects
and to find the consistent segmentation in addition to reduce the computational expense
and runtime.
54
8. Conclusion
Semantic segmentation is one of the most challenging research topics in the field of computer vision due to the huge diversity of the objects and the limitations of current object
classification methods. Although the current multi-class classification algorithms perform
well on variant image benchmarks, because of the high relevance between features and
the redundancy in the feature set, these algorithms suffer from huge amount of the noise
in the training data. As a result of noisy training, the accuracy of the classifiers drastically decreases and the segmentation fails. Therefore, the visual object detection features,
which basically represent the characteristic of the objects, should be chosen depending
on their importance so that the classification method generates a robust object classifier
with a noise-free training set. Hence, the quality of the object detectors increases and the
algorithm generates better segmentation outputs with a lower cost of computation.
8.1. Future Work
The feature selection by reducing the redundancy in the feature set with minimumredundancy-maximum-relevance method improves the performance and reasonably reduces
the computation time. However, this approach requires a huge memory to run and it is
not possible to perform the feature selection on big benchmarks. As a future work, either
one should optimize the method to perform on a huge data or another approach should
be replaced.
Due to the insufficient memory issue, we could only test several types of features and
our framework only allows to select the features for each benchmark independently. In
future studies, one could generalize the framework to select the features considering all
benchmarks together. Moreover, once the system is eligible, the number of objects in the
set of classes can be increased to solve more complex and higher dimensional object classification problems.
55
Appendix
A. List of Ranked Features
The full list of ranked features for each benchmark is given in this chapter. For eTrims,
Corel and Sowerby benchmarks, 37 features, composed of 6 Haar-like features, 12 Color
features, 2 location and 17 texture features, are used. For NYU version 1 Depth benchmark,
additionally 3 depth features are used. Tables A.2, A.3, A.4, A.5 demonstrates the ranked
features with their associated type, color channel which the feature computed on, the patch
size and the bandwidth parameters used for the feature. For better visualization, rows in
the tables are colored with the corresponding color of features given in Table A.1 .
Table A.1.: Feature Colors. Each feature is associated with a color.
Haar Color Texture Location Depth
59
A. List of Ranked Features
Table A.2.: eTrims Benchmark. The full list of ranked features.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
60
Feature
Color
Color
Color
Haar
Color
Color
Texture
Color
Color
Color
Color
Texture
Color
Texture
Texture
Color
Texture
Haar
Color
Texture
Texture
Texture
Texture
Texture
Location
Texture
Haar
Texture
Location
Texture
Texture
Texture
Texture
Texture
Haar
Haar
Haar
Type
Lab Channel Patch Size Bandwidth
Relative Patch
b
24
Vertical Edge
a
24
Relative Pixel
L
24
Vertical Line
L
24
Relative Patch
a
24
Center Surround
b
24
Gaussian
L
1
Vertical Edge
b
24
Relative Patch
L
24
Horizontal Edge
a
24
Relative Pixel
b
24
Laplacian of Gaussian
L
1
Relative Pixel
a
24
Gaussian
L
2
Derivative of Gaussian
L
2
Center Surround
a
24
Gaussian
L
4
Center Surround
L
24
Horizontal Edge
b
24
Derivative of Gaussian
L
2
Derivative of Gaussian
L
4
Laplacian of Gaussian
L
2
Laplacian of Gaussian
L
4
Derivative of Gaussian
L
4
Y Direction
Gaussian
a
1
Vertical Edge
L
24
Gaussian
a
2
X Direction
Gaussian
b
4
Gaussian
b
1
Gaussian
a
4
Gaussian
b
2
Laplacian of Gaussian
L
8
Four Square
L
24
Horizontal Edge
L
24
Horizontal Line
L
24
-
Table A.3.: NYUv1 Benchmark. The full list of ranked features.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Feature
Depth
Location
Color
Color
Depth
Color
Color
Color
Color
Color
Haar
Color
Depth
Color
Color
Texture
Haar
Color
Texture
Texture
Texture
Texture
Texture
Texture
Haar
Texture
Color
Texture
Texture
Texture
Texture
Texture
Haar
Texture
Location
Texture
Texture
Texture
Haar
Haar
Type
Point Height
Y Direction
Relative Patch
Center Surround
Relative Pixel Depth
Horizontal Edge
Relative Patch
Vertical Edge
Relative Pixel
Center Surround
Vertical Line
Horizontal Edge
Relative P. Comparison
Vertical Edge
Relative Pixel
Laplacian of Gaussian
Horizontal Line
Relative Patch
Laplacian of Gaussian
Laplacian of Gaussian
Laplacian of Gaussian
Derivative of Gaussian
Derivative of Gaussian
Gaussian
Horizontal Edge
Gaussian
Relative Pixel
Gaussian
Gaussian
Derivative of Gaussian
Gaussian
Gaussian
Vertical Edge
Derivative of Gaussian
X Direction
Gaussian
Gaussian
Gaussian
Four Square
Center Surround
Lab Channel Patch Size Bandwidth
Depth Map
24
a
24
a
24
Depth Map
24
a
24
b
24
a
24
a
24
b
24
L
24
b
24
Depth Map
24
b
24
b
24
L
1
L
24
L
24
L
8
L
2
L
4
L
4
L
4
a
1
L
24
b
2
L
24
b
4
a
2
L
2
a
4
b
1
L
24
L
2
L
4
L
1
L
2
L
24
L
24
-
61
A. List of Ranked Features
Table A.4.: Corel Benchmark. The full list of ranked features.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
62
Feature
Haar
Color
Color
Color
Texture
Color
Color
Color
Color
Color
Texture
Color
Texture
Texture
Texture
Texture
Color
Color
Texture
Location
Haar
Texture
Texture
Texture
Location
Texture
Texture
Texture
Texture
Texture
Texture
Haar
Texture
Color
Haar
Haar
Haar
Type
Lab Channel Patch Size Bandwidth
Horizontal Line
L
10
Vertical Edge
a
10
Relative Patch
a
10
Center Surround
b
10
Derivative of Gaussian
L
4
Center Surround
a
10
Relative Patch
L
10
Horizontal Edge
a
10
Vertical Edge
b
10
Relative Patch
b
10
Derivative of Gaussian
L
2
Relative Pixel
a
10
Gaussian
b
4
Derivative of Gaussian
L
4
Laplacian of Gaussian
L
2
Laplacian of Gaussian
L
4
Relative Pixel
b
10
Relative Pixel
L
10
Derivative of Gaussian
L
2
Y Direction
Vertical Line
L
10
Gaussian
a
4
Gaussian
a
1
Gaussian
b
1
X Direction
Gaussian
L
1
Gaussian
L
4
Gaussian
b
2
Gaussian
L
2
Gaussian
a
2
Laplacian of Gaussian
L
8
Center Surround
L
10
Laplacian of Gaussian
L
1
Horizontal Edge
b
10
Four Square
L
10
Vertical Edge
L
10
Horizontal Edge
L
10
-
Table A.5.: Sowerby Benchmark. The full list of ranked features.
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Feature
Color
Color
Color
Texture
Color
Color
Texture
Haar
Color
Texture
Color
Color
Texture
Color
Texture
Texture
Color
Texture
Texture
Location
Texture
Color
Texture
Color
Texture
Location
Haar
Texture
Texture
Texture
Texture
Texture
Texture
Haar
Haar
Haar
Haar
Type
Lab Channel Patch Size Bandwidth
Relative Pixel
a
6
Horizontal Edge
b
6
Relative Patch
L
6
Laplacian of Gaussian
L
1
Center Surround
a
6
Center Surround
b
6
Derivative of Gaussian
L
4
Center Surround
L
6
Relative Patch
a
6
Derivative of Gaussian
L
2
Relative Pixel
L
6
Vertical Edge
a
6
Gaussian
b
4
Horizontal Edge
a
6
Derivative of Gaussian
L
4
Laplacian of Gaussian
L
2
Relative Pixel
b
6
Laplacian of Gaussian
L
4
Derivative of Gaussian
L
2
Y Direction
Gaussian
a
4
Vertical Edge
b
6
Gaussian
a
1
Relative Patch
b
6
Gaussian
b
1
X Direction
Vertical Line
L
6
Gaussian
L
1
Gaussian
L
4
Gaussian
b
2
Gaussian
L
2
Gaussian
a
2
Laplacian of Gaussian
L
8
Four Square
L
6
Vertical Edge
L
6
Horizontal Edge
L
6
Horizontal Line
L
6
-
63
Bibliography
[1] Cie: 015:2004. Technical report, Colorimetry, 3rd edn. Commission Internationale de
l’Eclairage, 2004. 22
[2] Adobe. ”Technical Guides”. Color Models: CIELAB,. http://dba.med.sc.edu/
price/irf/Adobe_tg/models/cielab.html. 23
[3] P. Arbelaez, B. Hariharan, Chunhui Gu, S. Gupta, L. Bourdev, and J. Malik. Semantic segmentation using regions and parts. In Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, pages 3378–3385, June 2012. 3
[4] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm
for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.
13
[5] Wassim Bouachir, Atousa Torabi, Guillaume-Alexandre Bilodeau, and Pascal Blais.
A bag of words approach for semantic segmentation of monitored scenes. CoRR,
abs/1305.3189, 2013. 3
[6] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000. 45
[7] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 13, 15, 17, 29
[8] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression
Trees. Wadsworth, 1984. 15
[9] A. Chambolle, D. Cremers, and T. Pock. A convex approach for computing minimal
partitions. Tech. rep. TR-2008-05, University of Bonn, 2008. 12, 35
[10] A. Chambolle, D. Cremers, and T. Pock. A convex approach to minimal partitions.
SIAM Journal on Imaging Sciences, 5(4):1113–1158, 2012. 12
[11] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Technical Report R.I. 685, Ecole Polytechnique,
Palaiseau Cedex, France, 2010. 35, 36
[12] T.M. Cover. The best two independent measurements are not the two best. Systems,
Man and Cybernetics, IEEE Transactions on, SMC-4(1):116–117, Jan 1974. 4
[13] Franklin C. Crow. Summed-area tables for texture mapping. In Proceedings of the
11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH
’84, pages 207–212, New York, NY, USA, 1984. ACM. 23
65
Bibliography
[14] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In In CVPR, pages 886–893, 2005. 3, 4
[15] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene
expression data. In Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003
IEEE, pages 523–528, Aug 2003. 17, 18
[16] PedroF. Felzenszwalb and DanielP. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004. 7
[17] WendellH. Fleming and Raymond Rishel. An integral formula for total gradient variation. Archiv der Mathematik, 11(1):218–222, 1960. 11
[18] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences,
55(1):119 – 139, 1997. 13
[19] Yoav Freund and Robert E. Schapire. A short introduction to boosting. In In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1401–
1406. Morgan Kaufmann, 1999. 13, 15
[20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:
a statistical view of boosting. Annals of Statistics, 28:2000, 1998. 13, 14, 29
[21] Björn Fröhlich, Erik Rodner, and Joachim Denzler. Semantic segmentation with millions of features: Integrating multiple cues in a combined random forest approach. In
Asian Conference on Computer Vision (ACCV), pages 218–231, 2012. 4, 46, 48
[22] Enrico Giusti. Minimal Surfaces and Functions of Bounded Variation. Monographs in
Mathematics. Birkhäuser Boston, 1984. 9
[23] Stephen Gould. DARWIN: A framework for machine learning and computer vision
research and development. Journal of Machine Learning Research, 13:3533–3537, Dec
2012. 45
[24] A. Haar. Zur Theorie der orthogonalen Funktionensysteme. 1909. 22
[25] Xuming He, R.S. Zemel, and M.A. Carreira-Perpindn. Multiscale conditional random
fields for image labeling. In Computer Vision and Pattern Recognition, 2004. CVPR 2004.
Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–695–II–
702 Vol.2, June 2004. 26, 42, 45
[26] Alexander Hermans, Georgios Floros, and Bastian Leibe. Dense 3d semantic mapping of indoor scenes from rgb-d images. In International Conference on Robotics and
Automation, 2014. 3, 4, 24, 27, 42, 49
[27] F. Korč and W. Förstner. eTRIMS Image Database for interpreting images of manmade scenes. Technical Report TR-IGG-P-2009-01, April 2009. 33, 41, 45
[28] L. Ladický, Russell C., Kohli P., and Torr P. Graph cut based inference with cooccurrence statistics. In Proceedings of ECCV, 2010. 3, 49
66
Bibliography
[29] L. Ladický, C. Russell, P. Kohli, and P. H S Torr. Associative hierarchical crfs for object
class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference
on, pages 739–746, Sept 2009. 3, 4, 48, 49
[30] Ĺubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell, and PhilipH.S. Torr.
What, where and how many? combining object detectors and crfs. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision – ECCV 2010,
volume 6314 of Lecture Notes in Computer Science, pages 424–437. Springer Berlin Heidelberg, 2010. 4
[31] Jan Lellmann, Jörg Kappes, Jing Yuan, Florian Becker, and Christoph Schnörr. Convex
multi-class image labeling by simplex-constrained total variation. In Xue-Cheng Tai,
Knut Mørken, Marius Lysaker, and Knut-Andreas Lie, editors, Scale Space and Variational Methods in Computer Vision, volume 5567 of Lecture Notes in Computer Science,
pages 150–162. Springer Berlin Heidelberg, 2009. 12
[32] D. Lowe. Object recognition from local scale-invariant features.
September 1999. 3, 4
Corfu, Greece,
[33] Philippe Magiera and Carl Löndahl.
”ROF Denoising Algorithm”,.
http://www.mathworks.de/matlabcentral/fileexchange/
22410-rof-denoising-algorithm, 2008. 10
[34] Bjoern H. Menze, Michael B. Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert,
Wolfgang Petrich, and Fred A. Hamprecht. A comparison of random forest and its
gini importance with standard chemometric methods for the feature selection and
classification of spectral data. BMC Bioinformatics, 10(1), 2009. 16
[35] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1
edition, 1997. 16
[36] David Mumford and Jayant Shah. Optimal approximations by piecewise smooth
functions and associated variational problems. Communications on Pure and Applied
Mathematics, 42(5):577–685, 1989. 7, 8
[37] H. Peng, Fulmi Long, and C. Ding. Feature selection based on mutual information
criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 27(8):1226–1238, Aug 2005. 4, 5, 17, 18,
27
[38] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing
the piecewise smooth mumford-shah functional. In IEEE International Conference on
Computer Vision (ICCV), Kyoto, Japan, 2009. 36
[39] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65(6):386–408, 1958. 13
[40] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based
noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4):259 – 268, 1992. 8
67
Bibliography
[41] Robert E. Schapire and Yoram Singer.
Improved boosting algorithms using
confidence-rated predictions. In Machine Learning, pages 80–91, 1999. 15
[42] Claude Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27:379–423, 623–656, July, October 1948. 31
[43] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on, pages 1–8, June 2008. 3
[44] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape
and context modeling for mulit-class object recognition and segmentation. In European Conference on Computer Vision (ECCV), pages 1–15, 2006. 3, 4, 26
[45] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew W. Fitzgibbon, Mark Finocchio,
Andrew Blake, Mat Cook, and Richard Moore. Real-time human pose recognition in
parts from single depth images. Commun. ACM, 56(1):116–124, 2013. 27
[46] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. Textonboost for
image understanding: Multi-class object recognition and segmentation by jointly
modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–
23, 2009. 3, 22, 48
[47] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D
Representation and Recognition, 2011. 27, 42, 45
[48] A. Torralba, K.P. Murphy, and W.T. Freeman. Sharing features: efficient boosting
procedures for multiclass object detection. In Computer Vision and Pattern Recognition,
2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2,
pages II–762–II–769 Vol.2, June 2004. 4
[49] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the
2001 IEEE Computer Society Conference on, volume 1, pages I–511–I–518 vol.1, 2001. 3,
4, 22, 23
[50] Eric. W. Weisstein. ”Haar Function.” From MathWorld– A Wolfram Web Resource.
http://mathworld.wolfram.com/HaarFunction.html. 23
[51] Christopher Zach, David Gallup, Jan-Michael Frahm, and Marc Niethammer. Fast
global labeling for real-time stereo using multiple plane sweeps. In VMV, pages 243–
252, 2008. 12
[52] Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection - how to effectively
exploit shape and texture features. In David Forsyth, Philip Torr, and Andrew Zisserman, editors, Computer Vision – ECCV 2008, volume 5305 of Lecture Notes in Computer
Science, pages 802–816. Springer Berlin Heidelberg, 2008. 4
[53] M. Zhu and T. Chan. An efficient primal-dual hybrid gradient algorithm for total
variation image restoration. UCLA CAM Report, pages 08–34, 2008. 36
68
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement