# Feature Selection and Learning for Semantic Segmentation

FAKULTÄT FÜR INFORMATIK DER TECHNISCHEN UNIVERSITÄT MÜNCHEN Master’s Thesis in Informatics Feature Selection and Learning for Semantic Segmentation Caner Hazırbaş FAKULTÄT FÜR INFORMATIK DER TECHNISCHEN UNIVERSITÄT MÜNCHEN Master’s Thesis in Informatics Feature Selection and Learning for Semantic Segmentation Selektion und Lernen von Merkmalen für Semantische Segmentierung Author: Caner Hazırbaş Supervisor: Prof. Dr. Daniel Cremers Advisors: M.Sc. Julia Diebold Dipl.-Inf. (Univ.) Mohamed Souiai Date: June 26, 2014 I confirm that this master’s thesis is my own work and I have documented all sources and material used. Ich versichere, dass ich diese Masterarbeit selbständig verfasst und nur die angegebenen Quellen und Hilfsmittel verwendet habe. München, den 26. Juni 2014 .............................. Caner Hazırbaş Acknowledgments First and foremost, I would like to offer my sincere gratitude to Prof. Dr. Daniel Cremers who gave me an invaluable guidance to decide on my research subject and also motivated me with his immense knowledge on the topic. Besides my supervisor, I would like to offer my special thanks to my advisors, M.Sc. Julia Diebold and Dipl.-Inf. (Univ.) Mohamed Souiai, for being the most indestructible advisors ever against my countless questions, for endless motivation and their support during the thesis. Without their assistance and dedicated involvement in every step throughout the process, this thesis would have never been accomplished. Furthermore, I particularly thank to Dr. Rudolph Triebel for his kindness and assistance to solve my problems regarding to machine learning. “Success is not a target but a very long journey.” I could not make it through and would have been lost without my friends. I thank my dearest friends, Bilal Porgalı and Yağmur Sevilmiş for being with me through this exhausting journey. My special thanks are extended to my friends, my fellow labmates in the Computer Vision and Pattern Recognition Research Group at TU Munich: Alfonso Ros Dos Santos, Sahand Sharifzadehgolpayegani, Zorah Lähner, Emanuel Laude, Benjamin Strobel and Mathieu Andreux for the stimulating discussions and also for having a great time together. Last but not the least, I would like to thank my mother Hüsniye Hazırbaş for giving me moral in the very dark moments and spiritual support throughout my life, my father Birol Hazırbaş for being with me all the time with his prayers, and also my beloved sister Melike Hazırbaş for giving me all her trust and support. vii Abstract This work presents a comprehensive study on feature selection and learning for semantic segmentation. Various types of features, different learning algorithms in conjunction with minimizing a variational formulation, are discussed in order to obtain the best segmentation of the scene with minimal redundancy in the feature set. The features are scored in terms of relevance and redundancy. A clever feature selection reduces not only the redundancy but also the computational cost of object detection. Additionally different learning algorithms are studied and the most suitable multi-class object classifier is trained with the selected subset of features for detection based unary potential computation. In order to obtain consistent segmentation results we minimize a variational formulation of the multi-labelling problem by means of first order primal-dual optimization. Experiments on different benchmarks give a deep understanding on how many and what kind of features and which learning algorithm should be used for semantic segmentation. ix Contents Acknowledgements vii Abstract ix I. 1 Introduction and Theory 1. Introduction 1.1. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Image Segmentation and Object Recognition 2.1. Variational Image Segmentation . . . . . . . . . . . . . . 2.1.1. Mumford-Shah Functional . . . . . . . . . . . . 2.1.2. The ROF Model . . . . . . . . . . . . . . . . . . . 2.1.3. Total-Variation Segmentation . . . . . . . . . . . 2.2. Object Classification Methods . . . . . . . . . . . . . . . 2.2.1. Adaptive Boosting . . . . . . . . . . . . . . . . . 2.2.2. Random Forests . . . . . . . . . . . . . . . . . . . 2.3. Feature Selection based on Mutual Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 7 7 8 8 11 13 13 15 17 II. Feature Selection and Learning for Semantic Segmentation 19 3. Feature Ranking and Object Learning 3.1. Appearance Model . . . . . . . . . . . . . . . . . . . . 3.1.1. Shape Features . . . . . . . . . . . . . . . . . . 3.1.2. Color and Texture Features . . . . . . . . . . . 3.1.3. Location Features . . . . . . . . . . . . . . . . . 3.1.4. Depth Features . . . . . . . . . . . . . . . . . . 3.2. Feature Ranking . . . . . . . . . . . . . . . . . . . . . . 3.3. Object Learning . . . . . . . . . . . . . . . . . . . . . . 3.3.1. Measurement of the Classification Uncertainty . . . . . . . . 21 21 22 24 25 27 27 29 31 4. Variational Optimization and Image Segmentation 4.1. Variational Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. First-order Primal-Dual Convex Optimization . . . . . . . . . . . . . . . . . 35 35 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Contents III. Experimental Evaluation 39 5. Benchmarks 5.1. eTrims Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. NYU version 1 Depth Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 5.3. Corel and Sowerby Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 42 42 6. Experimental Results 6.1. Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 47 IV. Summary and Conclusion 51 7. Summary 7.1. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 8. Conclusion 8.1. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 Appendix 59 A. List of Ranked Features 59 Bibliography 65 xii Part I. Introduction and Theory 1. Introduction Building man-like machines, also called humanoid robots, has been a dream of humanity since the life started. The fundamental problem to address in developing such robots is how to understand the environment as we humans do. For many years, human vision has kept its secret in visual understanding of the scene and making robust inference about unknown environments or objects using previous experience. Although scientists have spent many years to figure out how humans actually see the environment, we are still struggling to express how this perfect learning and fully understanding happens. Nevertheless, we know that the eyes allow individuals to interpret their surroundings and therefore we developed cameras. Invention of the cheap cost cameras has opened a gate for robots in the direction of visual perception. Now, cameras are taking an important role on visual understanding in many applications. Development of the autonomous, unmanned vehicles and devices is essential not only to replace man-power with mechanical power, but also to support the daily life activities. Transportation, production, surveillance systems, aero-space industry and human supporting systems are only some of application areas. However, the remaining major challenge is how to understand the environment and move autonomously using cameras. Perceptual understanding of the scene is at the very heart of this problem and is still being studied by many scientists. Computer vision studies investigated perceptual interpretation of the objects in the scene using just a single image, retrieved from a low-cost camera. Today, with the high computational processing power of computers, many researchers consider recognition and segmentation problems together to increase the capacity of current systems for perceptual scene analysis. The integration of the visual semantics of objects into the segmentation problem is nowadays the main concern in the field of semantic segmentation. Segmentation of images based on semantics is essential for real-time scene understanding, surveillance systems and 3D scene reconstruction, e.g. in [5, 26, 46]. The performance of these segmentation algorithms heavily depend on the quality of the appearance model which is computed using pixel-wise object detection algorithms. The major challenge is to find the most representative features to distinguish dissimilar objects in terms of their shape, color and textural differences. Conventional object detectors deal with the task of finding bounding boxes around each object [14, 32, 49]. In contrast, dense object detection approaches [29, 44] focus on detecting the objects at pixel level which provides a preliminary segmentation. Several other studies pursued different approaches for object detection and learning algorithms in order to perform semantic segmentation, e.g. [3, 28, 43]. In this work we propose a method to systematically select the best subset of features for semantic segmentation. Our approach is able to outperform state-of-the-art results in terms of runtime and even in accuracy in several cases. 3 1. Introduction 1.1. Related Work In 2001, Viola and Jones presented simple, but robust Haar-like features for real-time face detection [49]. Haar-like features are simple to implement, computationally low cost and very accurate in capturing the shape of objects. These features can be used for detection of any type of object and this flexibility makes them convenient for semantic segmentation applications. Lowe [32] presented a distinctive, scale-invariant feature transform (SIFT) and Dalal and Triggs [14] presented so called histograms of oriented gradients (HOG) which are computationally more expensive. SIFT can robustly identify the objects even among clutter and under partial occlusion, because the SIFT feature descriptor is invariant to uniform scaling, orientation, and partially invariant to affine distortion and illumination changes. Moreover, Zhang et al. [52] proposed a set of novel features based on oriented gradients to detect the cat heads. They have shown that exploiting shape and texture features jointly outperforms the existing, leading features, e.g. Haar, HOG. However, these proposed methods are mostly used in the context of sliding window techniques. In the scope of this research we aim at detecting the objects densely at pixel level. In [44], Shotton et al. proposed texture-layout filters based on textons which jointly model patterns of texture and their spatial layout for dense object detection. They presented a two-step algorithm for semantic segmentation of photographs. In the first step unary classification and feature selection are achieved using shared boosting to give an efficient classifier. Then, in the second step, an accurate image segmentation is achieved by incorporating an unary classifier in a conditional random field to enforce the neighboring pixels to have same labels. Ladický et al. [29] proposed a hierarchical random field model, that allows integration of features computed at different levels of the quantisation hierarchy. Moreover, Ladický et al. [30] combined different features for unary pixel classification by using Joint Boosting [48]. However, their approach is sensitive to a large set of parameters. Fröhlich et al. [21] proposed an iterative approach for semantic segmentation of a facade dataset. They learn a single random forest and incrementally add context features derived from coarser levels. This approach uses different kinds of features in a joint, flexible, and fast manner and refines the semantic segmentation of the scene iteratively. Hermans et al. [26] discussed 3D semantic reconstruction of indoor scenes by exploiting 2D semantic segmentation approach based on Randomized Decision Forests for RGB-D sensor data. They used depth features in addition to a very basic set of features to train a classifier for unary pixel classification. However, neither of the approaches above does give any justification for the chosen set of features nor address the problem of how to choose the best feature set for object detection in semantic segmentation. Feature selection plays an important role on the quality of the object detector and also on runtime of the entire system. It is also known that the m best features are not necessarily the best m features [12]. In [37], Peng et al. proposed a theoretical framework to rank the features based on minimum-Redundancy-Maximum-Relevance (mRMR). They formulate a cost function composed of a relevance and a redundancy term. The relevance between features and class labels is maximized and the redundancy between feature pairs is minimized. 4 1.2. Contributions 1.2. Contributions In this work, we adapt the approach of [37] to the task of semantic segmentation. We study different types of features and learning algorithms and perform experiments on various segmentation benchmarks. The main goal is to find the best subset of features for semantic segmentation. This is done by systematically selecting them via minimizing redundancy and maximizing relevance and thereby improving the unary pixel potentials. Exhaustive experiments on various benchmarks show that reducing the redundancy in feature sets significantly reduces the runtime while preserving the performance. Our approach is able to outperform state-of-the-art algorithms in terms of runtime and even in accuracy in several cases. 5 2. Image Segmentation and Object Recognition This chapter presents the theoretical background of the proposed method. Variational image segmentation, object learning (classification) algorithms and mutual information based feature selection method are discussed in detail to provide an insight into image segmentation and object recognition. 2.1. Variational Image Segmentation Image segmentation is recently the most popular research topic in computer vision. It deals with finding disjoints sets of elements from an observation data such that similarity inside the sets and dissimilarity in-between the disjoint sets are maximum. Image segmentation contains finding object regions and accurate boundaries between them. Let Ω be the image domain and k be the number of disjoint regions, the segmentation criteria of an image can be written as: Ω = {Ω1 , Ω2 , ..., Ωk−1 , Ωk }, where k [ Ωi = Ω ∧ Ωi ∩ Ωj = ∅ ∀i,j ∈ {1, .., k} s.t. i 6= j. i=1 In the simplest case, with only background and foreground, regions can be segmented using very basic and conventional methods such as edge-detection based segmentation. However, since this is not realistic, one should take more complex cases into consideration to be able to segment the images into multiple regions. The image segmentation approaches are generally composed of graphical based models, e.g. Graph-Cut based image segmentation [16] or variatonal image segmentation such as the generalized MumfordShah Functional [36] for multi-region segmentation. In this study, we only focus on the variational multi-region segmentation problem. Variational image segmentation minimizes an energy functional to find the consistent segmentation. An energy functional for image segmentation is typically formulated as: E(u(x)) = k X Edata (ui (x), I) + λ · i=1 k X Ereg (ui (x)) i=1 where u(x) = (u1 , ..., uk ) is the set of indicator functions with: ( ui (x) = 1 0 for x ∈ Ωi else 7 2. Image Segmentation and Object Recognition Edata is called data term and presents the fidelity of the segmentation of a given image I. Ereg is called regularizer term and ensures the resulting segmented regions should have homogeneous class labels with optimum class boundaries. The parameter λ is a weight and denotes the amount of smoothing on the segments. The trade-off for the parameter λ is that the segmented regions get more and more smoothed as the λ increases. If the parameter λ is set to a very low value, then the segmented regions will have noisy labels and lose homogeneity property. 2.1.1. Mumford-Shah Functional Mumford and Shah [36] presented an energy functional to segment images with piece-wise smooth approximation in the following form: Z E(u, C) = Ω 2 (I − u) dx + λ Z |∇u|2 dx + ν|C| (2.1) Ω\C where u : Ω → R is an approximation of the segmented image and C ⊂ Ω is the onedimensional discontinuity set. (I − u) converges to zero while u gets similar to I and this provides a good approximation of the segments. The second term smooths the regions everywhere except for the set C. The last term ensures that the length of boundary |C| between classes is minimal with a weighting ν. However, C is part of the energy itself and no numerical solution is given to minimize this functional. There are multiple approaches presented to solve the Mumford-Shah functional. This thesis employs a related model of the Mumford-Shah functional to solve multi-region segmentation problems. 2.1.2. The ROF Model In 1992, Rudin-Osher-Fatemi [40] pioneered the concept of nonlinear image denoising which aims at removing noises by preserving the edges on an image. The principle is to remove unwanted excessive and possibly spurious details, induce high total variation, that is, the integral of the absolute gradient of the image. According to this principle, reducing the total variation of the image removes the unwanted details and preserves the image edges. Given a noisy image f : Ω → R, where Ω is bounded open subset of R2 , we seek for a clean image u, the denoised image version of f with bounded variation. To solve this problem, Rudin et al. minimize the total variation of an image with the following energy functional: 1 min ||f − u||22 + λ 2 u∈BV (Ω) Z |∇u|dx . (2.2) Ω This variational energyR has a data term, minimizes the L2 norm of desired and noisy image and a regularizer Ω |∇u|, so called total variation, minimizes the length of object boundaries. λ is a scale parameter that determines the level of smoothing. Total variation is not differentiable under the constraint that u takes only 0 and 1. For this reason, bounded variation of u is used to address the problem; u ∈ BV (Ω) is constrained 8 2.1. Variational Image Segmentation to be of bounded variation, defined as: BV (Ω) := {u ∈ L1 (Ω) T V (u) < +∞} where the total variation is finite and given as [22]: ¶ T V (u) := sup − Z u(x) · div ξdx ξ ∈ Cc1 (Ω, Rn ), ||ξ||L∞ (Ω) ≤ 1 Ω © (2.3) If u is differentiable and Ω is a bounded open set, then we can re-write the total variation equation using Gauss’ theorem: Z |∇u| = Ω ∇u yielding equality with ξˆ := = |∇u| Z ZΩ −u · div ξ = Z ∇u · ξ ≤ ||ξ||∞ ∇u · ξˆ = Ω |∇u| Ω Ω Z Z |∇u| Ω and ξˆ can be approximated by a sequence (ξ)n ⊂ Cc1 (Ω) it holds the following: Z Ω −u · div (ξ)n = Z ∇u · (ξ)n −→ Z Ω ∇u · ξˆ = Ω Z |∇u| . Ω The ROF model is also called T V − L2 model and it has two advantages; • TV preserves discontinuity (edges on the images) and therefore the method is well suited for image processing tasks, • TV is a convex function; a function f is convex if and only if its epigraph epi(f ) := {(x, y)|f (x) ≤ y} is a convex set. And it can be easily shown that if f is convex and there exist a minimizer of f , then its minimizer is indeed globally optimal. Convex functions have the advantage to be minimized with well-researched optimization solvers regardless of initialization . Equation 2.2 can be minimized with gradient descent and the equation for this task is defined by: Ç ∇u |∇u| å 1 (u − f ) λ ∂u with boundary condition =0 ∂n ∂Ω ∂u =∇· ∂t − However, in the case of constant functions, |∇u| will be equal to 0 and the optimization scheme will face the singularity problem because of zero division. Therefore, this problem will be circumvented using a first-order primal-dual formulation of the Total-Variation given in Equation 2.3 . First-order primal-dual convex optimization scheme will be discussed in Section 4.2 . See Figure 2.1 for an image denoising example with ROF model. 9 2. Image Segmentation and Object Recognition (a) Noisy image (b) Denoised image with λ = 10 (c) Denoised image with λ = 30 (d) Denoised image with λ = 50 Figure 2.1.: Image Denoising. Sample noisy image (a) was denoised with gradient descent minimization on the ROF model, implemented by Magiera and Löndahl on Matlab [33]. The denoised images (b), (c), (d) were produced with 100 iterations and different smoothing scales (λ). Each RGB color channel was processed independently. The resulting image gets smoother as the λ increases. 10 2.1. Variational Image Segmentation 2.1.3. Total-Variation Segmentation A continuous model for TV based image segmentation, that will be used, has the following form: min Ω1 ,...,Ωk k Z X fi (x) dx + i=1 Ωi k λX P er(Ωi , Ω) 2 i=1 where fi : Ω → R+ are non-negative potential functions defined for each pixel with object classification methods. The λ-weighted second term, is half the sum of the perimeters of S the sets Ω1 , ..., Ωk and corresponds to the total length of the partition interface i<j ∂Ωi ∩ ∂Ωj (in order to count each perimeter once.). Thus, we segment the image into k subpartitions that yield the lowest energy and ensure a minimal surface at the same time. The partitions are represented by characteristic functions, also known as indicator, functions u1 , ..., uk : Ω → {0, 1} so that the energy takes a computationally tractable form: k X for x ∈ Ωi satisfying ui (x) = 1 a.e. x ∈ Ω else i=1 ( ui (x) = 1 0 Thus, the first term becomes: k Z X i=1 Ωi fi (x)dx = k Z X ui (x)fi (x)dx i=1 Ω Moreover, if indicator functions ui ∈ BV (Ω) of measurable sets Ωi ⊂ Ω are scalar-valued functions of bounded variation, the co-area [17] formula tells us that the the perimeter is equal to the total variation: Z P er(Ωi , Ω) = P er(ui , Ω) = T V (ui ) = |∇ui |dx. Ω As a consequence of the re-formulations of the problem, the energy functional takes the form of an intermediate minimization problem as follows: min u∈B ® B := k Z X i=1 k λX ui (x) · fi (x) dx + 2 i=1 Ω Z g(x) · |∇ui | dx . (2.4) Ω ´ X k (ui , ..., uk ) ∈ BV (Ω, {0, 1}) ui = 1 a.e. x ∈ Ω i with coherency constraint B on the indicator functions ui . The minimizer u now lies on B, that is the k- dimensional space of binary functions with bounded variation and fulfills the point-wise characteristic property (at every position x only one ui is allowed to be non-zero). Also note that, TV in Equation 2.4 is weighted with a space-dependent g(x) function given in Equation 2.5 , which takes low values if the derivative at position x is high and takes high values otherwise, meaning that discontinuity will be preserved on the boundaries of the input image. 11 2. Image Segmentation and Object Recognition Ç g(x) = exp å |∇f (x)|2 1 − , σ2 = 2 2σ |Ω| Z |∇f (x)|2 dx (2.5) Ω The energy functional is still hard to optimize since it is non-convex and nondifferentiable because of the binary indicator functions (ui ). Therefore, to convexify the energy, the ui functions are relaxed by mapping them into the whole interval ui : Ω → [0, 1] as described in [9]. With this, indicators change from hard to soft assignments, thus at a position x, u(x) can have multiple non-zero entries although the coherecency constraint still holds in that case, i.e. the sum has to yield 1. TV is replaced with its dual representation (Equation 2.3) to be a differentiable function. The energy function now takes the form of a convex and differentiable optimization problem as follows: min sup k Z X u∈S ξ∈K i=1 Ω ui (x) · fi (x) dx − λ k Z X div ξi (x) · ui (x) dx . (2.6) i=1 Ω with minimization over the set S of BV functions moving into the k-dimensional simplex and the dual variable ξ is in the convex set K: ® ´ X u = (u1 , ..., uk ) ∈ BV (Ω, [0, 1]) ui (x) = 1 a.e x ∈ Ω . k S := i In [31, 51], the authors use variations of a straight-forward formulation that arises from the TV definition of the regularizer. ξ is enforced to stay inside a norm boundary, e.g. 1 P ( i ||ξ||2 ) 2 ≤ 1. This enforcement limits the amount of flow happening at every position. Since, we weight the TV with a space-dependent weighting function g, the dual space can be re-written as: ® K := ξ = (ξi , ..., ξk ) : Ω → R 2×k ´ X g(x) 2 ||ξi (x)|| ≤ a.e. x ∈ Ω . 2 i Alternative approach is presented by Chambolle et al. [10]. They introduce the notion of a paired calibration of the dual variables’ components that represents a local convex envelope of the energy: ® KC := ´ X g(x) 2×k ξ = (xii , ..., xk ) : Ω → R ξi (x)| ≤ (a.e.) x ∈ Ω . | 2 i1 ≤i≤i2 It is proven that this model has a tighter bound for k > 2 in comparion to others which means that the found minimizer is closer to the original global optimum. This dual space creates tighter solutions and the projection of a variable onto KC is computationally burdensome, because it involves the projection of a variable onto multiple convex sets. On the other hand, the projection onto K is very fast since it consists of point-wise truncation operations. The relaxed energy in Equation 2.6 is just an approximation of the original problem and does not guarantee an optimal solution. The literature has shown that in the case of two regions (k = 2), the thresholded solution of the relaxed version yields a global optimizer independent of the chosen threshold. Nevertheless, this assumption no longer holds for any binarized solution in the multi-region case. 12 2.2. Object Classification Methods Data: (x1 , y1 ), ..., (xm , ym ) where xi ∈ X, yi ∈ Y = {−1, +1} Result: Final strong classifier H(x) for t = 1, .., T do • Train a weak learner using weight distribution Dt . • Get weak hypothesis ht : X → {−1, +1} with error: t = Pri∼Dt [ht (xi ) 6= yi ]. • Choose αt = 1 2 ln Ä 1−t t ä . • Update: ® Dt (i) e−αt if ht (xi ) = yi × Dt+1 (i) = e αt if ht (xi ) 6= yi Zt Dt (i) exp(−αt yi ht (xi )) = Zt where Zt is a normalization factor such that Dt+1 will be a distribution. end Output the final hypothesis: T ÄX H(x) = sign ä αt ht (x) . t=1 Algorithm 1: Discrete AdaBoost. The Adaptive Boosting Algorithm 2.2. Object Classification Methods In machine learning and statistics, classification is the problem of identifying the categorical class label of a new observation, on the basis of training data which contains observations whose category memberships are known. The output of a classification algorithm is called classifier, that is, a function maps the observation data to a category (a class label). This section gives a brief description of Adaptive Boosting [20] and Random Forests [7] machine learning algorithms which are used in the proposed method among various linear and non-linear supervised learning algorithms, e.g. Artificial Neural Networks [39], Support Vector Machines [4], linear classifiers, deep learning. 2.2.1. Adaptive Boosting The AdaBoost is a boosting algorithm, introduced by Freud and Schapire [18] in 1997 and is capable of solving the binary classification problem. Adaptive Boosting is a linear discriminative learning model which combines weak learners as strong classifier to find the best separation between two classes. Algorithm 1 shows the algorithm of Discrete AdaBoost [19]. 13 2. Image Segmentation and Object Recognition Figure 2.2.: Boosting Classifier. Illustration of the Adaptive Boosting Algorithm. Given an observation set (X) with corresponding class labels (Y ), the algorithm iteratively finds the weak learners and combines them linearly to a strong classifier. Usually a weak learner is a decision tree with one level, also called decision stump. After receiving the weak hypotheses ht , the algorithm computes the importance rate αt of assigned to ht . Note that αt is proportional to t and gets larger if gets smaller. One of the main ideas of the algorithm is to maintain a distribution of weights (Di ) over the training set. In each round, weights of misclassified samples are increased and correctly classified ones are decreased. As a result of this re-weighting, the weak learner ht is forced to focus on hard examples in the training set. Thus, the weight tends to concentrate on hard examples. The final hypotheses H, which is also called strong classifier, is a majority vote of linearly combined weak learners ht , weighted with corresponding αt . The weak learner is responsible to find a weak hypotheses ht : X → {−1, +1} appropriate for the sample weights Dt and the quality of this weak learner is measured by its error: t = Pri∼Dt [ht (xi ) 6= yi ] = X Dt (i). i:ht (x)6=yi Figure 2.2 illustrates the boosting algorithm on a given simple training set. Strong classifier is shown as a combination of three weak classifiers. There are different kinds of Adaptive Boosting algorithm and each of them is named depending on the error function: • Real AdaBoost : The output of decision trees is a class probability estimate p(x) = P (y = +1|x), the probability that x is in the positive class [20]. Each leaf node in the 14 2.2. Object Classification Methods decision tree is changed to output half the logit transform of its previous value: 1 ft = ln 2 Ç å 1−x . x • Logit AdaBoost : This is a convex optimization problem and minimizes the logistic loss: Ç X å −yi f (xi ) log 1 + e , where f (xi ) = αi ht (xi ) i • Gentle AdaBoost : Previous boosting algorithms choose ft greedily, minimizing the overall test error as much as possible at each step Gentle AdaBoost features a P bounded step size. ft is chosen to minimize i wt,i (yi − ft (xi ))2 , and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, Gentle AdaBoost will choose ft (x) = αt ht (x) exactly equal to y, while steepest descent algorithms will try to set αt = ∞. Empirical observations about the good performance of Gentle AdaBoost appear to back up Schapire and Singer’s remark that allowing excessively large values of α can lead to poor generalization performance [19, 41]. 2.2.2. Random Forests Random Forests is an ensemble learning method, proposed by Breiman [7] in 2001 and can handle multi-class classification problems. Random Forests grow many decision trees, introduced in [8], to construct more robust classifier against outliers. The construction of a decision tree requires a training sample of m observations: O = (x1 , y1 ), ..., (xm , ym ) , i = 1, .., m where xi ∈ X, variable set and yi ∈ Y , corresponding class labels. A sample constructed decision tree is shown in Figure 2.3 . Each interior node corresponds to a variable in X, while each leaf node corresponds to a class label in Y . A decision tree is trained by splitting the input training set into subsets based on an attribute value test. Splitting is repeated recursively on each node until all samples on the node have same target class label or splitting no longer gives noticeable improvement on the predictions. A decision tree is usually constructed from top to down and at each interior node, one attribute is selected, which gives the best split of the subset. The quality of a split is determined by the average value of importance of each attribute, computed with the following metrics: • Information Gain: The concept of information gain comes from information theory and it is based on entropy measurement. Information gain of an attribute gives us the importance of that attribute given set of examples. Let T denote a set of training samples, each has the form of (x, y) = (x1 , x2 , ...., xk , y) where xa ∈ V (a) is the 15 2. Image Segmentation and Object Recognition is coughing ? yes has fever ? no feeling groggy ? sick healthy sick healthy 0.92 55% 0.17 5% 0.03 13% 0.001 27% Figure 2.3.: Decision Tree. A tree showing the sickness of a child in a city where there is an epidemic disease. The numbers under the leaf nodes show the probability of sickness and the percentage of observations in the leaf. value of ath attribute of the sample x and y is the corresponding class label. The information gain for an attribute a is defined in terms of entropy H as follows [35]: IG(T, a) = H(T ) − X |{x ∈ T |xa = v}| v∈V (a) H(X) = − X |T | P (xi ) · log|k| P (xi ) , i X · H({x ∈ T |xa = v}) P (xi ) = 1 i • Gini Importance: Random Forests gain its superior performance from its implicit feature selection strategy. In this strategy, Gini importance is the indicator of feature relevance. This metric measures how often a particular attribute θ is selected for a split and how effective its overall discriminative value for the classification. At each node t within a binary decision tree, the optimal split is found using Gini impurity i(t) and calculated as follows [34]: i(t) = 1 − p21 − p20 . where pk = nnk is the fraction of the nk samples from class k = {0, 1}. By following this formula Gini importance IG of all variables θ is computed as: IG (θ) = X ∆iθ (t) t ∆i(t) = iθ (t) − pl iθ (tl ) − pr iθ (tr ) nl nr pl = , pr = n n A decision tree has many advantages compared to the other learning methods. It is simple to construct, very robust and reliable since it has a self-validation strategy, can handle both categorical and numerical data, performs well even in case of a large dataset and most importantly, a decision tree is fast to predict the class label of the new observations. A decision tree has also some drawbacks; the quality of the classifier highly depends on the 16 2.3. Feature Selection based on Mutual Information Criteria training samples and a decision tree may over-fit and not generalize well on the training data. Random Forests consist of multiple decision trees. To train a forest, the initial training data is split into subsets such that each tree has the same amount of samples. Each tree in the forest gives a vote (classification) which is a mapping from an input vector to a class label and the forest chooses the classification having the most votes over all the trees in the forest. In contrast to single decision tree, Random Forests do not over-fit. Adding more trees to the forest increases the overall performance of the classifier, but the prediction time also increases. Nevertheless, Random Forests can be easily parallelized since the prediction of the new observation is evaluated at each tree independently. In Random Forests, there is no need for an external test or cross-validation to estimate the forest error since the error computation is done internally during training. One-third of the training subset of each tree is left out to estimate the forest error while the trees are constructed. This error is called the out-of-bag error and gives the unbiased classification error of the forest. Moreover, in the original paper of Random Forests [7], it is shown that the forest error depends on two factors which are the correlation between trees and the strength of each individual tree in the forest. The forest error increases as the correlation gets higher, but decreases as the strength increases. Random Forests are scalable, reliable, robust, insensitive to the outliers, capable of solving multi-class classification problem on high dimensional data and also outperform the other machine learning algorithms in terms of speed and accuracy. The strong characteristics of Random Forests make them popular in the area of visual object detection and recognition. 2.3. Feature Selection based on Mutual Information Criteria Conventional machine learning algorithms aim to find the best subset of features which distinguish the disjoint sets (classes) well for classification. However, most of the methods suffer from local minima or over-fitting problem due to high sensitivity to outliers and the final classifier does not generalize well on the given training data which is usually noisy because of the high redundancy in the feature set. Therefore, in statistical analysis, this problem is addressed with feature reduction using some probabilistic frameworks such as maximizing the dependency. This feature analysis provides minimal classification errors by maximizing the statistical dependency of the target class c on the data distribution in an unsupervised situation. However, a maximum dependency approach typically involves computation of multivariate joint probability which is often difficult and inaccurate. One approach to overcome this problem is to realize maximum dependency with maximizing the relevance, selection of features with the highest relevance to the target class c. Relevance measures the dependency of the variables with correlation or mutual information. Ding and Peng [15, 37] proposed a feature selection framework based on mutual information. The idea is to maximize the relevance and minimize the redundancy of the feature set. Peng et al. [37] proved that maximizing dependency is the same as maximizing the relevance between features and class labels while minimizing the redundancy of dependent features in the feature set. Thus, the noise and outliers in data are reduced and 17 2. Image Segmentation and Object Recognition the redundancy is minimized which induces the classifiers to perform better. This feature selection is applied to the training data as a preprocessing step before learning the object classifiers. As a result of redundancy minimization, the classifier performance is significantly increased and the efficiency of the learning algorithm is boosted by the selection of feature subset [15, 37]. In theory, mutual information is defined in terms of probabilistic density functions p(x), p(y), and p(x, y) of two random variables x and y: Z Z I(X; Y ) = p(x, y) log Y X p(x, y) dxdy . p(x)p(y) Max-Dependency finds the subset of features S with m features {xi } and is defined as: max D(S, c), D = I({xi , i = 1, ...m}; c) Max-Relevance selects the features xi which are required to have the largest mutual information I(xi ; c) with the target class c, reflecting the largest dependency on the target class. An approximation of D(S, c) in the maximum relevance approach is computed as: max D(S, c), D = 1 X I(xi ; c) |S| x ∈S i However, it is very likely the case that there exist selected features which are dependent and give strong relevance. For this reason while the relevance is maximized, the redundancy of the correlated features should be minimized: min R(S), R = 1 |S|2 X I(xi ; xj ) xi ,xj ∈S A combination of the two criteria described above, formulates the minimal-redundancymaximal-relevance (mRMR) feature selection method. The final optimization problem is defined as the difference of the two terms: max Φ(D, R), Φ = D − R In this method, the subset of the features are iteratively selected such that at each step, one feature is selected depending on all previously selected features with an incremental search: î ó X 1 max I(xj ; c) − I(xj ; xi ) . xj ∈X−Sm−1 m − 1 x ∈S i m−1 Comparing to the Max-Dependency, the mRMR avoids the estimation of multivariate probability densities p(x1 , ..., xm ) and only calculates the bivariate densities, i.e. p(xi , xj ) and p(xi , c), that is much easier and applicable to big data. Moreover, this also leads to more efficient feature selection algorithms. On the other hand, mRMR algorithms do not guarantee the global maximization of the problem due to the difficulty in searching the whole space. However, global optimum solution in this approach might lead to overfitting on the data, whereas, the mRMR is a practical way to achieve a superior classification accuracy and it also reduces the complexity of the computation [37]. 18 Part II. Feature Selection and Learning for Semantic Segmentation 3. Feature Ranking and Object Learning In this chapter, we introduce our approach for feature selection and learning for semantic segmentation. Our purpose is to find the most distinctive, discriminative and robust feature set for object recognition in the task of semantic segmentation. We employ a feature ranking strategy to compute the importance of each feature in a given set of features and reduce the redundancy in the feature set by analyzing the change of the performance accuracy depending on each feature. Thus, the redundant features are eliminated whilst the accuracy and the efficiency of the system is enhanced. In the scope of the proposed method, Section 3.1 presents our appearance model, also called unary pixel potential, based on object detection/recognition. Section 3.2 shows how the features are ranked using mutual information based feature selection strategy. Furthermore, Section 3.3 compares two machine learning methods, which are used for object learning in the majority of state-of-the-art studies, in order to address the question which one is the most convenient learning strategy in the task of semantic segmentation. 3.1. Appearance Model Our appearance model is a unary pixel potential ρi (x) assigned for every class i given in Equation 3.1 and the potentials are inferred with multi-class object detectors i.e: ‹(i|f, x), i = 1, ..., n ρi (x) = − log P (3.1) ‹(i|f, x) is the probability of class i for a given feature vector f of pixel x. By taking where P the negative logarithm of probability distributions, we ensure that our potential function is a monotonously decreasing function. Hence, the potential is a convex function, meaning that there exists a global minimizer. Objects can be categorized with their shape, color and textural properties. To detect an object, one should exploit these features jointly so that each different types of object can be distinguished from others whilst the ones in the same class are kept together. In this study, we employ six types of Haar-like features as shape features, 5 color and 17 texton features (Gaussian filters, Laplacian of Gaussians and a first order derivative of Gaussian filters with various bandwidths) as color-texture features, in addition, we use normalized canonical location features in Equation 3.2 and when applicable depth features as shown in Figure 3.1 . Object classifiers are trained with these 37 joint features to build a robust multi-class object detector. Ç fl (p) = xp yp , Iwidth Iheight å (3.2) with (x, y) is the location of the pixel p on an image I. 21 3. Feature Ranking and Object Learning (a) Haar-like features (b) Color features (c) Depth features Figure 3.1.: Feature set. First row, Haar-like features: horizontal/vertical edges and lines, center surround and four square. Second row, color features: relative pixel/patch, horizontal/vertical edge and center surround on color channels. Third row, depth features: relative pixel, relative pixel comparison and height of a pixel. Filters are applied in a patch surrounding the pixel on the image. We use the CIELab colour space [1] since it is a perceptual uniform color model and is designed to approximate the human vision. This color space is a device-independent model and describes all visible colors to human eye. Luminance (L) channel varies from 0 to 100, indicates the brightness, ‘a’ and ‘b’ are color channels. ‘a’ axis varies between green and red while ‘b’ axis varies between blue and yellow. Colors are usually numbered from -128 to 127. Figure 3.2 shows the three dimensional Lab space. We sample the image pixels for training. Filter responses are computed on a ∆ss × ∆ss grid on the image to reduce the computational expense [46] during training. For testing, however, filter responses are computed for each pixel. 3.1.1. Shape Features In 2001, Viola and Jones [49] presented very simple, but robust Haar-like features for face detection. These features are reminiscent of Haar basis functions (Haar wavelets), that were proposed in 1909 by Alfred Haar [24]. A Haar wavelet allows a function defined in an interval to be represented in terms of orthogonal basis functions. The only disadvantage is that Haar wavelets are not continuous and therefore not differentiable. The Haar mother 22 3.1. Appearance Model Figure 3.2.: CIELab color space. Lab color space has three dimension. ‘L’ ranges from 0 to 100 and shows the brightness while ‘a’ and ‘b’ represent colors and they both range from -128 to 127. Source: [2]. basis and k-th Haar function is defined by: ψ(x) = 1 −1 0 0 ≤ x < 12 , 1 2 < x ≤ 1, else. ψjk (x) ≡ ψ(2j x − k) for j a non-negative integer and 0 ≤ k ≤ 2j − 1 as shown in Figure 3.3 . Figure 3.3.: Haar Wavelets. Haar mother basis function and haar wavelets [50]. These simple Haar-like features can be rapidly computed using an intermediate representation for the image which is called the integral image [49], also known as summed area table, introduced to computer graphics by Frank Crow in 1984 [13]. The integral image (I) at a pixel position (x, y) contains the sum of all pixel intensities of the pixels above and to 23 3. Feature Ranking and Object Learning R1 R2 1 R3 2 R4 3 4 Figure 3.4.: Haar-like feature calculation via Integral Image. On the integral image above, the pixel 1 contains the total sum of all pixels in the region R1 , 2 contains R1 +R2 , 3 is equal to R1 +R3 and 4 is the sum of all regions: R1 +R2 +R3 +R4 . The sum of the pixels in the region R4 can be computed with four basic operations: R4 = (4 + 1) − (2 + 3). the left: I(x, y) = X image(x0, y0). x0≤x,y0≤y With this small modification on the image, sum of the intensities in a rectangle can be computed efficiently only with 4 simple operations illustrated in Figure 3.4 . To compute the feature responses, the sum of the pixels which lie within the white rectangles is subtracted from the sum of the pixels in the grey rectangles. Haar-like features are very good to capture shapes such as lines, edges, corners and so forth, and can be extensively used for all kinds of objects. The most important advantage of Haar-like features is that the object boundaries can be recognized very sensitively and perfect object separations can be retrieved as a result of segmentation. Another big advantage of these features is that objects can be detected in any scale by scaling the feature window. When doing that, all the rectangle sums must be normalized by the size of rectangle for scale-invariant feature responses. We only employ six Haar-like features, however, one can also use different types and enlarge the feature set. All shape features are computed only on the luminance (L) channel in the Lab color space. 3.1.2. Color and Texture Features Pixel colors are very basic features and give the local information about the object particles. However, objects usually consist of many different colors and individual pixel colors cannot be used directly as a feature to the object as a whole. Therefore, we extract the color features for each pixel in a surrounding window. We are inspired by Hermans’ simple color features [26]: the relative pixel and relative patches, shown in Figure 3.1 are used for color relevancy of the object parts and different objects. Thus, we do not only learn 24 3.1. Appearance Model Gaussian Laplacian of Gaussian −3 Derivative of Gaussian −3 x 10 −3 x 10 x 10 8 5 2 6 0 1 4 −5 0 2 −10 −1 0 30 −15 30 −2 30 25 20 25 20 20 10 10 10 0 Gaussian 5 0 0 Laplacian of Gaussian 25 20 20 20 15 15 15 10 10 10 5 5 5 0 0 10 15 20 25 0 Derivative of Gaussian 25 5 10 10 10 5 0 25 0 20 15 15 5 0 20 20 15 0 0 5 10 15 20 25 0 2 4 6 8 10 12 14 16 18 20 Figure 3.5.: Texture features. Gaussian, Laplacian of Gaussian and derivative of Gaussian convolution kernels used as texture features. Rows show 3D and 2D illustration of the kernels, respectively. Warm colors represent the higher function values. the general color model of the object, but also the color relation between object and its surroundings. This also allows to learn the co-occurrency of the objects and increase the classification accuracy in some sense. In addition to the relative color features, we also use three of the Haar-like features since they are extremely powerful to discover object boundaries. Color features are computed on each channel in the Lab color space. Colors are not always so discriminative and robust for different textures. For that reason, we exploit the simple, but relatively more expensive basic spatial image filters for texture analysis, that is Gaussian (Equation 3.3), Laplacian of Gaussian (Equation 3.4) and derivative of the Gaussian (Equation 3.5) convolution kernels. The kernels are computed locally around each pixel and the kernel size is determined by its bandwidth (2κ · σ + 1, κ = 1). Figure 3.5 shows the 3D and 2D texture kernels used in this application. G(x, y; σ) = 1 − x2 +y2 2 e 2σ 2πσ 2 (3.3) ñ ô 1 x2 + y 2 − x2 +y2 2 LoG(x, y; σ) = − 4 1 − e 2σ πσ 2σ 2 ñ ∂Gσ ∂Gσ DoG(x, y; σ) = , ∂x ∂y ôT ñ = xe −x 2 +y 2 2σ 2 (3.4) −x , ye 2 +y 2 2σ 2 ôT (3.5) 3.1.3. Location Features More or less, every object occurs at a specific region on the images. For example, sky is always expected to be on the top as car, road and pavements invariably locate at the 25 3. Feature Ranking and Object Learning Rhino P.Bear Water Snow Vegetation Ground Sky Figure 3.6.: Location Potential Map. Location potential map of the 7-class Corel Benchmark [25], learnt from a training data. Pixels are normalized to 100-by-100 canonical size and wα is chosen 1. Colors show the occurrence likelihood of the objects at each pixel. Warmer colors represent higher probabilities. bottom. The location information about the objects can be considered in two different ways: • Location as a feature: In the simplest case, locations can be integrated into the object learning strategy in the same way as the other types of features. The location feature is basically computed as a normalized canonical form of a pixel location given in Equation 3.2 . The most important location features will be implicitly selected during training. • Location as a potential: Another way of exploiting the location attribute of the objects is to learn a potential, captures the weak dependence of the class label on the absolute location of the pixel in the image [44]. The location potential(θλ ) can be adapted into the appearance model as follows: Ä ä ‹(i|f, x) · θλ (i, x̂) . ρi (x) = − log P with θλ (i, x̂) being the location potential of class i at normalized pixel location x̂. The location potential is then learnt from a training data with Equation 3.6 , as used in [44]. Ç åw Ni,x̂ + αλ λ θλ (i, x̂) = (3.6) Nx̂+αλ where Ni,x̂ is the number of pixels of class i at normalized location x̂ in the training set, Nx̂ is the total number of pixels at location x̂ and αλ is a small integer, corresponding to a weak Dirichlet prior on θλ . The location potential is raised to the power of wλ to compensate for overcounting. Figure 3.6 shows an example location map, learnt for the 7-class Corel dataset of [25]. This map shows the object occurrences on the image. The color gets warmer as the occurrence potential of the object increases. As shown in the figure, there is a high uncertainity on the locations. The appearance model is composed of a scalar multiplication of the object detector confidence with the location confidence and the weak location confidence causes high uncertainty in the model. Therefore, in this study we decided to use the location attribute as a feature. 26 3.2. Feature Ranking 3.1.4. Depth Features With the development of cheap RGB-D cameras, the popularity of using these cameras in the applications has increased. RGB-D sensors are very cheap and provide depth information of the scene along with the color images. Therefore, nowadays researchers work on integrating the depth features into the system. By doing that, the accuracy of the system will be increased by only exploiting computationally very low cost and simple features. See Figure 3.7 for a sample gray scaled depth map of a given image, taken from NYU version 1 Depth Benchmark [47]. The Gaussian filter is applied to the all channels (Lab) whilst Laplacian of Gaussian and derivative of Gaussian filters are only applied to the luminance (L) channel. We use three deep features, also used in [26]: 1- Depth of a relative pixel in a surrounding window, normalized by the furthest depth in the corresponding column. 2- Depth comparison by subtraction of two relative depths as described in [45]. At a given pixel x, the features compute: Ç fθ (I, x) = dI u x+ dI (x) å Ç − dI v x+ dI (x) å , where dI (x) is the depth at pixel x on an image I, and u and v are randomly chosen offsets in a window (θ = (u, v)). The relative distances (offsets) are normalized by the depth of the current pixel dI1(x) to make the features more robust to camera translation. 3- The height of a point in 3D [47]: The height fh of a point x = (xr , xc ) with respect to the camera center is computed as follows: fh (I, x) = −dI (x) · xr . 3.2. Feature Ranking All the features used for object recognition in this study have been selected only because of their popularity in state-of-the-art works. However, randomly chosen features do not always perform well or combining strong robust features does not guarantee any improvement on the performance. In addition, another drawback of using many features is the huge computation time. As the features are computed for each pixel, the calculation time for the features exponentially grows by adding new ones to the system and the run-time thus drastically increases. Another important issue when using different types of features jointly is that there might be a redundant feature and therefore no longer a feature but a noise, will definitely reduce the classification performance. For all these reasons, we aim at reducing the redundancy and picking the most distinctive and robust feature set for a given benchmark. Among many other feature selection methods, we use a feature ranking algorithm based on mutual information. In [37] Peng et al. present an algorithm which allows ordering 27 3. Feature Ranking and Object Learning (a) RGB image (b) Depth map of the image (a) Figure 3.7.: Depth map of the scene. RGB-D cameras provide both RGB color image (a) and corresponding depth map (b) of the scene. Depth values are normalized to gray scale. Brightness increases as the regions get away from the camera. 28 3.3. Object Learning features by means of maximizing the relevance between features and the corresponding class labels and minimizing the inter-feature redundancy based on mutual information. The Minimum-Redundancy-Maximum-Relevance method (mRMR) maximizes the objective function in Equation 3.8 where M I(X, Y ) denotes the mutual information of two continuous random variables: Z Z M I(X; Y ) = p(x, y) log Y X p(x, y) dxdy , p(x)p(y) " max fj ∈X−Sm−1 X 1 M I(fj ; c) − m − 1 x ∈S i (3.7) # M I(fj ; fi ) . (3.8) m−1 The mRMR method (see Section 6.1 for detailed explanation) orders the features incrementally by selecting the one with highest score at each iteration. After ranking the features depending on their relative importance scores, we run the whole system adding the ranked features one by one to the feature set and performing the segmentation on all benchmarks independently. The list of ranked features for each benchmark is given in Appendix A . Further evaluations will be presented in Chapter 6 . 3.3. Object Learning After feature computation, the most important, class-separator features are selected with one of the machine learning feature selection methods. In this research, we only study commonly used supervised learning algorithms to learn the object classifiers. For our field of application, we consider Gentle AdaBoost [20] and Random Forests [7] to be the most relevant. Classification algorithms generally estimates a class label for a given feature vector. However, since we need to compute the appearance model for segmentation which is consisted of class distributions per pixel, we adapt these two learning algorithms to our purpose as described below. Random Forests (see Section 2.2.2 ) are an ensemble of decision trees which can handle large data. The learning process starts with one decision tree and adds another one to the forest if the so called out-of-bag (oob) error is bigger than a certain threshold or the maximum number of trees is not reached. Each tree Tw returns a tree vote Tw (f ) for a given feature vector f . The probability distribution for each class i given f and a pixel x is then estimated as follows: N P ‹(i|f, x) = P [Tw (f ) = i] w=1 N , i = 1, ..., C (3.9) where N is the number of trees in the forest and C is the total number of classes. Random Forests exhibit a good detection rate and are very robust against outliers. Another alternative object learning method is the Boosting algorithm (see Section 2.2.1 ). Boosting is a greedy learning algorithm that linearly combines the weak learners to train a strong classifier. A one-vs-all strategy is used for the training and one strong classifier for each class is trained. This method can either output a class label or a confidence value 29 3. Feature Ranking and Object Learning (a) Input Image (b) RF unary classifier (c) AB unary classifier (d) Ground Truth (e) RF Segmentation (f) AB Segmentation Figure 3.8.: Random Forest vs. Gentle AdaBoost. Images (a) and (d) show the input and ground truth images, respectively. Images (b) and (c) illustrate the detection re‹(i|f, x)) for Random Forest and Gentle AdaBoost and images sults (arg maxi,x P (e) and (f) compare the variational segmentation results. which is sum of the weighted votes of the weak learners. However, we require a probability distribution. Therefore, the distribution is computed using a soft-max transfer function: ‹(i|f, x) = P exp(Hi (f, x)) C P i=1 . (3.10) exp(Hi (f, x)) where Hi (f, x) denotes the confidence score class i computed as follows: Hi (f, x) = W X i αw (x) · hiw (x) . (3.11) w=1 For our experiments, we set the same learning parameters for all datasets. We trained the Random Forests with a maximum of 50 decision trees, each having a depth of 15 at most. In the Gentle AdaBoost, we perform the training using 100 weak learners per class. Figure 3.8 shows the experimental results for each unary pixel classifier. Note that the Random Forests give better classification results compared to the Gentle AdaBoost. This is due to the unbalanced training data. Hence in most benchmarks the pixel count for each class is non-uniformly distributed. This leads to a suppression of classes with less pixels. In addition, the soft-max transfer function over-smooths the class potentials. The reason for this is that each classifier is trained independently and that the confidence scores are discrete, real numbers. So the one-vs-all Boosting approach does not provide a good appearance model. In contrast, the Random Forests circumvent this problem by 30 3.3. Object Learning Rhino P.Bear Water Snow Vegetation Ground Sky (a) Random Forest Potential Map (b) Gentle AdaBoost Potential Map Figure 3.9.: Probability maps provided by Random Forest (a) and AdaBoost (b) classifiers. Warmer color represents higher probability. penalizing the class errors with the corresponding class weights during the training and perform equally well for each class. Figure 3.9 shows the class-wise probability maps for both learners. We have more uncertainty (cold colors) in the AdaBoost map ( Figure 3.9 b) compared to probabilities generated by the Random Forest ( Figure 3.9 a). This problem can be overcome by increasing the number of weak learners. However, this drastically increases the prediction time. Apart from that, prediction time in the Random Forests is faster than the one in the Gentle AdaBoost. For the above reasons, we decided to use Random Forests in the following experiments. 3.3.1. Measurement of the Classification Uncertainty Random Forests classifier does not return a class label but a distribution over classes in our study. By using this probability distribution ρ(x) for each pixel on an image (see Figure 3.9 a), we measure the quality of object recognition with entropy. Entropy in information theory is a measure of the uncertainty associated with a random variable. This term is also known as Shannon entropy [42], which quantifies the expected value of the information contained in a message, that is a specific realization of a random variable. The entropy H(C) of C random variables (in our case, the total number of classes) in a pixel x is defined as follows: H(C) = − C X ρi (x) · logC ρi (x). i Uncertainty of the detection at a pixel x is maximum (H(C) = 1) if the Random Forests assign equal probabilities to all classes, and the uncertainty is minimum (H(C) = 0) if one of the classes has the highest probability (p(x) = 1.0). Figure 3.10 illustrates a diagram that plots the entropy versus probability for the case of 2 random variables. We expect that a good distribution will retrieve the best segmentation at the end. Therefore, entropy is used to measure the quality of the classifier. Figure 3.11 shows an example 31 3. Feature Ranking and Object Learning H(X) 1 0.5 0 0 0.5 1 p(X=1) Figure 3.10.: Entropy H(X). Entropy in the case of two random variables with the probabilities p(X = 1) and 1 − p(X = 1). Uncertainty is maximal (H(X) = 1) when two random variables have equal probability. for entropy based uncertainty measurement. The global uncertainty measurements of the unary pixel classifiers for each benchmark are presented in Section 6.2 . 32 3.3. Object Learning (a) Example Image ‹(i|f, x)) (b) Unary pixel classification (arg maxi,x P (c) Entropy Image Figure 3.11.: Detection Uncertainty. (a) and (b) demonstrate the input image and the unary pixel classification result, respectively. (c) shows the detection accuracy at each pixel computed with entropy. Warmer colors represent high certainty. As shown in (b) and (c), vegetation, sky and inner part of the building have a high certainty in contrast to other regions. Due to the insufficient training samples of cars in the eTrims [27] Benchmark, the pixels, belonging to the car, have the lowest certainty. 33 4. Variational Optimization and Image Segmentation This chapter presents the energy functional based on unary pixel potential for multi-class image segmentation and gives a brief description of the convex energy optimization with a first-order primal-dual formulation. 4.1. Variational Image Segmentation Given an input image I : Ω → R3 defined on the image domain Ω ⊂ R2 , we compute the appearance model ρi (x) for all classes i. A subsequent minimization of a regularizing energy which is based on variational multi-labeling [9] allows to obtain consistent pixel labeling and thereby removing label noise in the unary potentials. The energy functional in Equation 4.1 is minimized using a first-order solver [11]. As a regularizer, the total variation of the indicator function ui of an object of class i is used. This penalizes the boundary length of the object associated with ui . In order to favor the coincidence of the objects and the image edges, the total variation term is weighted with a non-negative function g : Ω → R+ given in Equation 2.5 . E(u) = n Z X ρi (x) · ui (x) dx + λ i=1 Ω n Z X g(x) · |∇ui (x)| dx (4.1) i=1 Ω 4.2. First-order Primal-Dual Convex Optimization In order to optimize the solution, the energy functional must meet the following two conditions: • Necessary condition. If there exist any minimum of the energy functional, the derivative of the functional at the minimum must be equal to zero: dE(u) = 0. du • Sufficient condition. To obtain a global minimizer, functional must meet the convexity criteria. As already discussed in Section 2.1 , our energy functional in Equation 4.1 satisfies the above conditions. However, the minimization problem is now a saddle-point optimization since the total variation is replaced with its primal-dual formulation: |∇u| = sup ξ · ∇u , |ξ|≤1 35 4. Variational Optimization and Image Segmentation ∇u where |ξ| ∈ R2 is a dual variable and the supremum is attained at ξ = |∇u| if |∇u| = 6 0. This re-formulation allows to re-write the TV as in Equation 2.4 . Since we replace the total variation with primal-dual formulation, we minimize the energy functional to find a saddle-point. In this study, we exploit an efficient algorithm for minimizing this saddle-point problem, introduced in [11, 38]. We bring our energy function (Equation 4.1) into a primal form as follows: min sup n Z X u∈S ξ∈K i=1 Ω ρi (x) · ui (x) dx − λ n Z X div ξi · ui (x) dx i=1 Ω This problem can now be optimized easily using gradient descent/ascent aspect, also used in [53]. This solving scheme descents in the primal variable and ascents in the dual variable until convergence at the saddle-point. As following the necessary condition, we fix the variables and derive the energy to formulate the update scheme for optimization: ∂E ∂E = ρ − div ξ , = ∇u ∂u ∂ξ In the first step, the primal variable u0 , dual variable ξ 0 and an auxiliary variable v 0 used in the acceleration step, are initialized with 0. Then the algorithm iteratively runs the following steps: ξin+1 = ΠK (ξin + τ · ∇vin ) un+1 = ΠS (uni + η · (div ξin+1 − ρi )) i v n+1 = 2un+1 − un τ > 0, η > 0 are step sizes and the solution provably converges for sufficiently small step sizes. ΠK and ΠS are the projections onto the corresponding sets. For the primal variable u, the projection onto the set S = BV(Ω; [0, 1]) is calculated by clipping: (ΠS u)(x) = min {1, max {0, u(x)}} = u(x), 1, 0, if u(x) ∈ [0, 1] if u(x) > 1 if u(x) < 0 For the dual variable ξ, the projection onto the unit disk K is done as follows: (ΠK )(x) = ξ(x) max {1, |ξ(x)|} This update scheme is applied on each pixel independently on the image. Therefore, the optimization process can be accelerated on a GPU. Additionally, we also optimize the smoothing parameter λ for each benchmark by using binary search as given in Algorithm 2 . Hence, we obtain the best possible multi-region segmentation of the image. 36 4.2. First-order Primal-Dual Convex Optimization Data: Unary pixel-wise potentials ρ(x) Result: λopt ← 0.005; increment ← 3; in-between ← false; λ ← 1; λopt ← 0; Popt ← 0, Pcurrent ← 0, Pprevious ← 0; Solve the segmentation with λ and calculate the performance P; while Pcurrent - Pprevious ≥ do if Popt < P then Popt ← P; λopt ← λ; end Pprevious ← Pcurrent ; Pcurrent ← P; if Pcurrent - Pprevious > 0 then if in-between then increment ← increment / 2; end λ ← λ + increment; in-between ← false; else increment ← increment / 2; λ ← λ - increment; in-between ← true; end Solve the segmentation with λ; Calculate the performance P; end Algorithm 2: Optimization of the smoothing parameter λ. For each benchmark, λ is optimized with binary search. 37 Part III. Experimental Evaluation 5. Benchmarks In this chapter, we first introduce the benchmarks, used for the scientific evaluation. Our framework has been evaluated on four challenging databases. The benchmarks consist of landscape images, wild-scenes, facade and indoor scenes. This chapter gives a quick look for each benchmark. 5.1. eTrims Benchmark E-Training for Interpreting Images of Man-Made Scenes(eTrims) [27] dataset is consisted of 60 facade/street images. Each pixel is labeled with a color which corresponds to a class label. Thus, this dataset offers the ground truth annotation on both pixel and region levels. There are in total 8 classes, which are ‘Window’, ‘Vegetation’, ‘Sky’, ‘Building’, ‘Car’, ‘Road’, ‘Door’ and ‘Pavement’. Typical objects in the images exhibit considerable variations in both shape and appearance and hence represent a challenge for the object based image interpretation. This benchmark is one of the most challenging datasets since the shape and appearances of objects are more detailed in the images with high resolution. and generalization of the complex structures on the objects is a difficult task in object classification. Figure 5.1 shows a sample image and the ground truth labeling from eTrims Benchmark. The pixels close to the object boundaries are non-labeled (black) due to the ambiguity on the original images. Figure 5.1.: eTrims Benchmark. Sample image (left) from eTrims dataset with its associated ground truth labeling (right). Each class is represented with a color. Labels are imposed on the respective color regions. 41 5. Benchmarks 5.2. NYU version 1 Depth Benchmark Cheap and sufficiently accurate RGB-Depth cameras increase their popularity in computer vision applications since besides color images, these cameras also provide a depth map of the scene. In our experiments, we used the same RGB-D indoor benchmark introduced in [47] and also used in [26]. This benchmark is comprised of thousand of objects and each image has 640×480 resolution. In order to compare our results with state-of-the-art, we also categorize the objects into 12 core classes as ‘Bed’, ‘Blind’, ‘Bookshelf’, ‘Cabinet’, ‘Ceiling’, ‘Floor’, ‘Picture’, ‘Sofa’, ‘Table’, ‘TV’, ‘Wall’, ‘Window’. Depths are normalized to grey level (range from 0 to 255), the whiter color represents the further distances. Unknown regions are labeled with white color on the ground truth and these regions are not considered during the training. Figure 5.2 demonstrates a sample image with corresponding depth map and ground truth. Figure 5.2.: NYU version 1 Depth Benchmark. From left to right: original image, grey level depth map and corresponding ground truth labeling. 5.3. Corel and Sowerby Benchmarks We also applied our method to two natural image datasets, used in [25]. First dataset is a 100 subset of the Corel image database, consisting of African and Arctic wildlife natural scenes. The Corel dataset is labeled into 7 classes: ‘Rhino/Hippo’, ‘Polar Bear’, ‘Vegetation’, ‘Sky’, ‘Water’, ‘Snow’ and ‘Ground’. Each image is 180×120 pixels. See Figure 5.3 for a sample image and ground truth labeling. Second dataset, the Sowerby Image Database of British Aerospace, is a set of color images of out-door scenes (consisting objects near roads in rural and suburban area) and their associated labels. This benchmark is comprised of 104 images with 8 labels: ‘Sky’, ‘Grass’, ‘Roadline’, ‘Road’, ‘Building’, ‘Sign’, ‘Car’; each image is 96×64 pixels. 42 5.3. Corel and Sowerby Benchmarks Figure 5.3.: Corel Benchmark. An example image (left) from Corel benchmark and its associated ground truth (right) with imposed class names on the corresponding labels. Figure 5.4.: Sowerby Benchmark. An example image (left) from Sowerby benchmark and its associated ground truth (right) with imposed class names on the corresponding labels. 43 6. Experimental Results This chapter presents our evaluations for feature selection and segmentation separately. In Section 6.1 , we discuss how many and what kind of features give the best performance for each benchmark independently. Then, in Section 6.2 , we present our segmentation results with the selected subset of features. We compare our experimental results with state-of-the-art methods in terms of quantitative scores and run-time. We have implemented our approach in C++, NVidia CUDA with OpenCV (Open Computer Vision) [6] Library and DARWIN [23] Machine Learning Framework on Ubuntu 12.04 LTS (Precise Pangolin) Operating System. All experiments have been executed on Intel® CoreTM i7-3770 processor equipped with 32 GB RAM and a NVIDIA GeForce GTX480 graphics card. We tested our framework on four different benchmarks, the 8-class facade dataset introduced in [27], the 7-class Corel and Sowerby datasets of He et al. [25] as well as the NYU Depth v1 [47] with 12 core classes. Except for the eTrims Benchmark, we randomly split each dataset into training and test sets by 50%. Table 6.1 shows the patch size and the sampling rate ∆ss used for each benchmark. Due to huge memory consumption on feature ranking with mRMR algorithm, we only use 100 randomly sampled images for NYU Depth v1 Benchmark. Table 6.1.: Parameters. Patch size and sampling rate, used for each benchmark. Parameter Sowerby Corel eTrims NYUv1 Patch size ∆ss 6 3 10 3 24 5 24 5 For the object learning, the same parameters are used for all benchmarks during the training of Random Forests and Gentle AdaBoost classifiers. AdaBoost classifiers are trained with 100 weak learners for each class where each weak classifier is a decision stamp. Random Forests are trained with maximum 50 trees where each has at most 15 depths. Maximum categories parameter is set to 15 (any bigger number than the total class number) so that none of the classes in the benchmark are grouped during training. Splitting is repeated unless at least 10 samples are left in the node. The number of the trees in the forest is determined by the forest accuracy, also known as the out-of-bag error, which is set to 1%. Apart from the learning parameters, due to high unbalanced sample distribution over the classes in benchmarks, we set the class weights for each classes so that during the out-of-bag error calculation, class errors are weighted with the corresponding class weights to balance the class-wise classification accuracy. Although this weighting does not improve the overall performance, it has a significant effect on the class-wise per- 45 6. Experimental Results formance scores. Table 6.2 shows the class weights used for each benchmark. The class weights are set to 1 for each class in the NYUv1 Depth Benchmark. Table 6.2.: Class Weights. Class priors to balance the detection performance of each class in all benchmarks. (a) Sowerby Labels Sky Grass Roadline Road Building Sign Car Weight 1.0 1.0 1.5 1.0 1.0 1.5 1.5 (b) Corel Labels Rhino Polar Bear Water Snow Ground Vegetation Sky Weight 1.0 1.0 1.05 1.0 1.0 1.0 1.1 (c) eTrims Labels Building Vegetation Door Window Car Sky Pavement Road Weight 1.0 1.0 1.15 0.65 1.0 1.0 1.0 1.0 6.1. Feature Selection We rank the features using the mRMR method on the training images and select the best subset of features, by incrementally adding the ranked features one by one to our feature set and comparing the results for each feature set. Figure 6.1 illustrates the overall performance scores vs. the total number of used features for all benchmarks. The features which are actually increasing the performance as well as the number of features leading to a redundant-free feature set can be easily found in this illustration. The eTrims Benchmark is composed of high resolution images, thus object textures have significant importance for detection. Similar to Fröhlich et al. [21] we split the dataset by a ratio of 60%/40% for training/testing. As shown in Figure 6.1 , there is a jump at the 24th feature which is a first order derivative of a Gaussian on the ‘L’ channel. Figure 6.1 indicates that the remaining features are redundant. Therefore, we only use the first 24 features for the eTrims Benchmark. For a reasonable overall performance on the Sowerby Benchmark according to Figure 6.1 , already the first three features would be enough: the relative color feature on the ‘a’ channel, the Haar horizontal edge feature on the ‘b’ channel and the relative patch feature on the ‘L’ channel. Instead of using a larger set of features, this simple set can be used to obtain similar performance. However, one can observe another jump in the performance at the 21st feature. Additional features beyond the first 21 do not improve the performance, but would increase the computational cost. Hence, for our experiments we used the first 21 features. Most of the dropped features are texture features. In contrast to the eTrims Benchmark, the information gain of the textural features is relatively small because of the low resolution of the images in the Sowerby dataset. 46 6.2. Segmentation Results 1 0.89 0.9 C T T C T T L T C T C T L H T T T T T T H H H H C T C C T H C T C C T H T L T T T H H C T T T T T T H T C T T T T T H C 0.76 0.8 T L T H T L T T T T T H H H 0.72 H0.7 D C C T T L H T T T L T T T T T T H T C H H H C C C C C C C T C T T T T C C C C D C C H C T T T T C C C T C T C C C C T C T T C T H C Overall Performance 0.7 0.6 C 0.5 C L CC 0.4 0.3 H C eTrims Sowerby Corel NYUv1 Selected Features D 0.2 0 5 10 15 20 25 30 35 40 Feature Number Figure 6.1.: Overall performance vs. Features. Features are added to the feature set one by one. Pink dots denote the type of the feature (H: Haar, C: Color, T: Texture and L: Location, D: Depth) added at current step. Yellow circles show how many features are selected for the benchmark. 6.2. Segmentation Results In this section, we present our qualitative and quantitative segmentation results in addition to the performance/run-time comparisons between ours and state-of-the-art studies. We, first, present the global entropy scores computed during the evaluation. Entropy is computed at each pixel and averaged as global entropy all over the test set in each random split. Scores present the final mean global entropy scores for each benchmark given in Table 6.3 . Table 6.3.: Global Detection Entropy. Unary pixel classifier global entropy scores for all benchmarks. ‹global (C) H eTrims 0.31 NYUv1 0.38 Corel 0.42 Sowerby 0.23 In Table 6.4 we compare our eTrims benchmark segmentation results with the work of 47 6. Experimental Results Fröhlich et al. [21]. The algorithm is executed on ten random splits of each benchmark and the scores shown in Tables 6.4, 6.5, 6.6, 6.7 are the average scores of these random splits. The scores give the percentage of the correctly labeled pixels for each class, overall denotes the percentage of the correctly labeled pixels in the whole dataset whilst the average gives the mean of the class-wise scores. Time Average Overall Road Pavement Sky Car Window Door Vegetation Building Table 6.4.: eTrims Benchmark. Comparison of the segmentation performance and the efficiency for eTrims Labels Fröhlich et al. [21] 63.5 90.1 95.4 71.9 77.4 73.2 71.1 69.9 76.1 72.3 ≈ 17s Ours 75.0 84.0 64.0 57.0 78.0 96.0 56.0 69.0 76.0 72.0 ≈ 5s Our performance scores shown in Table 6.4 are as good as Fröhlich et al. [21]’s and we have significant improvement in the evaluation time for the eTrims Benchmark. Table 6.5 shows that both training and testing speeds are significantly reduced compared to Shotton et al. [46] on the Sowerby and Corel benchmarks. Table 6.5.: Sowerby and Corel Benchmark. Segmentation/detection and efficiency comparison with Shotton et al. Overall Sowerby Corel Shotton et al. – Full CRF model [46] 88.6 74.6 Ours – Segmentation 90.0 69.4 Shotton et al. – Unary classifier only 85.6 68.4 Ours – Unary classifier only 87.1 65.0 Speed(Train/Test) Sowerby Corel 5h/10s 12h/30s 13s/96ms 45s/280ms - Table 6.6.: Sowerby and Corel Benchmark Average Class Scores. Average class-wise segmentation performance of ten random splits. (a) Sowerby Labels Sky Grass Roadline Road Sign Building Car Average Score 93.7 87.5 7.35 96.0 57.0 0.042 0.274 (b) Corel Labels Rhino Polar Bear Water Snow Vegetation Ground Sky Average Score 83.1 65.5 75.1 60.6 71.5 65.6 53.0 Furthermore, we compare our segmentation with the approaches of Ladický et al. [29] 48 6.2. Segmentation Results and Hermans et al. [26] on the NYUv1 Benchmark. Here we use only the best 11 ranked features including the depth features to compete with Hermans et al. [26]. Table 6.7 demonstrates that we can significantly improve the detection rate for several classes. We obtain the best score compared to Ladický et al. [28] and Hermans et al. [26] for more than half of the classes. Furthermore, we obtain the best average performance. We cannot win a direct runtime-competition with the algorithm of Hermans et al. [26]. Nevertheless, our system is open to be paralellized for faster evaluation. Our goal is to keep things simple and to give the opportunity for a quick reimplementation without the need of a high-end hardware. Compared to the multi-threaded CPU implementation of Ladický et al. [29] our runtime is around 20 times faster. This shows the outstanding performance of our method in the detection accuracy and runtime. 14 58 82 51 76 76 88 78 87 75 3 57 59 42 57 57 67 82 48 72 34 58 75 54 70 59 93 75 88 71 49 57 79 62 70 34 67 71 50 66 47 46 81 40 74 56 82 85 73 74 90 78 68 70 63 11 17 49 19 44 68 71 70 65 64 44 64 74 57 68 Time Floor Picture Sofa Table TV Wall Window Overall Average Labels Ladický et al. [29] Hermans et al. [26] Ours Hermans et al. [26](unary) Ours (unary) Bed Blind Booksh. Cabinet Ceiling Table 6.7.: NYUv1 Benchmark. Comparison of the segmentation/detection performance and the efficiency for NYUv1 ≈3m ≈ 1s ≈ 8s 0.2s ≈ 4s Figure 6.2 illustrates the qualitative results in all benchmarks. In the segmentation result of eTrims in the third row of Figure 6.2 d, we can see that the windows on the roof are labeled as sky because of the reflection and also in the upper part of image we can find a region labeled as pavement although there is none in the ground truth. Comprehensive experiments on various benchmarks have proven that the segmentation performance is heavily dependent on the pixel-wise object classification where the quality of the feature set is significantly effective. Combining robust and discriminative features does not always yield better performance since the redundancy of the interrelated features causes high amount of misclassification in the object detection, hence decreases the efficiency of the system in terms of computational expense and the accuracy of segmentation. Therefore, the features were ranked based on their mutual information and the most distinctive feature set for each benchmark was determined with performance experiments on the test sets. These selected subset of features improved the segmentation accuracy whilst decreased the computational cost along with the complexity of the classification problem. It is also shown that Haar-like features are the most robust and the most discriminative features compared to the other simple features and also to the computationally expensive features such as textons. However, in contrast to the conventional object detection studies, these simple, yet powerful features yield better performance if the features are extracted on color channels. Moreover, Haar-like features are indeed robust to capture the object shapes which induces the object detectors to preserve the object boundaries by finding the best separation of the segments accurately. 49 6. Experimental Results (a) Original Image (b) Ground Truth (c) Unary Pixel Classification (d) Segmentation Figure 6.2.: Qualitative Results. Rows correspond to the Sowerby, Corel, eTrims and NYU Depth v1 benchmarks respectively. First two columns show input and ground truth images while the third and forth columns show unary pixel classification and segmentation result respectively. 50 Part IV. Summary and Conclusion 7. Summary Semantic segmentation is a joint task of object detection/recognition and segmentation in the field of computer vision. In this work, to address the problem of semantic segmentation, an appearance model based on object detectors is constructed and an energy functional based on this model is optimized with a convex optimization approach. In this study, we construct our appearance model with Random Forest classifier, trained with various types of features and optimize our energy functional with the total variation, boundary-length minimizer, based variational optimization technique. Although the optimization tries to find the best segmentation on a given appearance model, the robustness and the accuracy of the segmentation is purely dependent on the quality of the appearance model. The major challenge in constructing this sort of appearance model is to accurately detect the objects while preserving the boundaries in-between. In order to segment all disjoint sets on an image, the current studies use a mixture of all types of features so that the various objects, which carry different structural and textural properties, can be easily distinguished from each other. Despite the fact that different types of features are actually designed for capturing different object properties such as color, texture or shape, using these features jointly is not always a good solution because of the high relevance between features yields high redundancy in the feature set. As a result of redundant features, the noise in the training data increases and the classification methods struggle with building robust object classifiers. Moreover, the computational cost is exponentially increased by adding new features to the feature set since the object detection is performed at a pixel level and herewith the segmentation speed gets slower. By exploiting redundancies in feature sets we have shown that the computational cost for learning and testing in the task of semantic segmentation can be significantly reduced. To this end we adapted a feature selection algorithm to the purpose of semantic segmentation which reduced the redundancy within the feature set while preserving accuracy. In many cases, we are able to outperform state-of-the-art results in terms of performance and in addition to runtime due to a sophisticated feature set selection with reduced dimension. 7.1. Discussion In this study, we have proven that Random Forests perform better than Gentle AdaBoost in terms of runtime as well as classification accuracy in the task of multi-class object detection for semantic segmentation. Apart from that, the lists of ranked features indicates that Haar-like features are the most robust and the most discriminative features compared to the others. Nevertheless, we should also note that Haar-like rectangles are capable of understanding the shapes robustly and produce much more distinctive features on color channels. Additionally, experiments have shown that the object boundaries can be accurately detected with these simple features. 53 7. Summary As every object has its own characteristic, every benchmark has also certain differences and therefore the number and type of features, which should be used, vary from one to another. Simple objects and low resolution images on which the objects have no detail, could be easily segmented by using only a couple of features while high resolution images on which the shape and the texture of the objects can be observed well, could be segmented accurately by using relatively more features. Therefore, the used features should be examined and the best subset of the features should be selected in order to detect the objects and to find the consistent segmentation in addition to reduce the computational expense and runtime. 54 8. Conclusion Semantic segmentation is one of the most challenging research topics in the field of computer vision due to the huge diversity of the objects and the limitations of current object classification methods. Although the current multi-class classification algorithms perform well on variant image benchmarks, because of the high relevance between features and the redundancy in the feature set, these algorithms suffer from huge amount of the noise in the training data. As a result of noisy training, the accuracy of the classifiers drastically decreases and the segmentation fails. Therefore, the visual object detection features, which basically represent the characteristic of the objects, should be chosen depending on their importance so that the classification method generates a robust object classifier with a noise-free training set. Hence, the quality of the object detectors increases and the algorithm generates better segmentation outputs with a lower cost of computation. 8.1. Future Work The feature selection by reducing the redundancy in the feature set with minimumredundancy-maximum-relevance method improves the performance and reasonably reduces the computation time. However, this approach requires a huge memory to run and it is not possible to perform the feature selection on big benchmarks. As a future work, either one should optimize the method to perform on a huge data or another approach should be replaced. Due to the insufficient memory issue, we could only test several types of features and our framework only allows to select the features for each benchmark independently. In future studies, one could generalize the framework to select the features considering all benchmarks together. Moreover, once the system is eligible, the number of objects in the set of classes can be increased to solve more complex and higher dimensional object classification problems. 55 Appendix A. List of Ranked Features The full list of ranked features for each benchmark is given in this chapter. For eTrims, Corel and Sowerby benchmarks, 37 features, composed of 6 Haar-like features, 12 Color features, 2 location and 17 texture features, are used. For NYU version 1 Depth benchmark, additionally 3 depth features are used. Tables A.2, A.3, A.4, A.5 demonstrates the ranked features with their associated type, color channel which the feature computed on, the patch size and the bandwidth parameters used for the feature. For better visualization, rows in the tables are colored with the corresponding color of features given in Table A.1 . Table A.1.: Feature Colors. Each feature is associated with a color. Haar Color Texture Location Depth 59 A. List of Ranked Features Table A.2.: eTrims Benchmark. The full list of ranked features. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 60 Feature Color Color Color Haar Color Color Texture Color Color Color Color Texture Color Texture Texture Color Texture Haar Color Texture Texture Texture Texture Texture Location Texture Haar Texture Location Texture Texture Texture Texture Texture Haar Haar Haar Type Lab Channel Patch Size Bandwidth Relative Patch b 24 Vertical Edge a 24 Relative Pixel L 24 Vertical Line L 24 Relative Patch a 24 Center Surround b 24 Gaussian L 1 Vertical Edge b 24 Relative Patch L 24 Horizontal Edge a 24 Relative Pixel b 24 Laplacian of Gaussian L 1 Relative Pixel a 24 Gaussian L 2 Derivative of Gaussian L 2 Center Surround a 24 Gaussian L 4 Center Surround L 24 Horizontal Edge b 24 Derivative of Gaussian L 2 Derivative of Gaussian L 4 Laplacian of Gaussian L 2 Laplacian of Gaussian L 4 Derivative of Gaussian L 4 Y Direction Gaussian a 1 Vertical Edge L 24 Gaussian a 2 X Direction Gaussian b 4 Gaussian b 1 Gaussian a 4 Gaussian b 2 Laplacian of Gaussian L 8 Four Square L 24 Horizontal Edge L 24 Horizontal Line L 24 - Table A.3.: NYUv1 Benchmark. The full list of ranked features. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Feature Depth Location Color Color Depth Color Color Color Color Color Haar Color Depth Color Color Texture Haar Color Texture Texture Texture Texture Texture Texture Haar Texture Color Texture Texture Texture Texture Texture Haar Texture Location Texture Texture Texture Haar Haar Type Point Height Y Direction Relative Patch Center Surround Relative Pixel Depth Horizontal Edge Relative Patch Vertical Edge Relative Pixel Center Surround Vertical Line Horizontal Edge Relative P. Comparison Vertical Edge Relative Pixel Laplacian of Gaussian Horizontal Line Relative Patch Laplacian of Gaussian Laplacian of Gaussian Laplacian of Gaussian Derivative of Gaussian Derivative of Gaussian Gaussian Horizontal Edge Gaussian Relative Pixel Gaussian Gaussian Derivative of Gaussian Gaussian Gaussian Vertical Edge Derivative of Gaussian X Direction Gaussian Gaussian Gaussian Four Square Center Surround Lab Channel Patch Size Bandwidth Depth Map 24 a 24 a 24 Depth Map 24 a 24 b 24 a 24 a 24 b 24 L 24 b 24 Depth Map 24 b 24 b 24 L 1 L 24 L 24 L 8 L 2 L 4 L 4 L 4 a 1 L 24 b 2 L 24 b 4 a 2 L 2 a 4 b 1 L 24 L 2 L 4 L 1 L 2 L 24 L 24 - 61 A. List of Ranked Features Table A.4.: Corel Benchmark. The full list of ranked features. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 62 Feature Haar Color Color Color Texture Color Color Color Color Color Texture Color Texture Texture Texture Texture Color Color Texture Location Haar Texture Texture Texture Location Texture Texture Texture Texture Texture Texture Haar Texture Color Haar Haar Haar Type Lab Channel Patch Size Bandwidth Horizontal Line L 10 Vertical Edge a 10 Relative Patch a 10 Center Surround b 10 Derivative of Gaussian L 4 Center Surround a 10 Relative Patch L 10 Horizontal Edge a 10 Vertical Edge b 10 Relative Patch b 10 Derivative of Gaussian L 2 Relative Pixel a 10 Gaussian b 4 Derivative of Gaussian L 4 Laplacian of Gaussian L 2 Laplacian of Gaussian L 4 Relative Pixel b 10 Relative Pixel L 10 Derivative of Gaussian L 2 Y Direction Vertical Line L 10 Gaussian a 4 Gaussian a 1 Gaussian b 1 X Direction Gaussian L 1 Gaussian L 4 Gaussian b 2 Gaussian L 2 Gaussian a 2 Laplacian of Gaussian L 8 Center Surround L 10 Laplacian of Gaussian L 1 Horizontal Edge b 10 Four Square L 10 Vertical Edge L 10 Horizontal Edge L 10 - Table A.5.: Sowerby Benchmark. The full list of ranked features. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Feature Color Color Color Texture Color Color Texture Haar Color Texture Color Color Texture Color Texture Texture Color Texture Texture Location Texture Color Texture Color Texture Location Haar Texture Texture Texture Texture Texture Texture Haar Haar Haar Haar Type Lab Channel Patch Size Bandwidth Relative Pixel a 6 Horizontal Edge b 6 Relative Patch L 6 Laplacian of Gaussian L 1 Center Surround a 6 Center Surround b 6 Derivative of Gaussian L 4 Center Surround L 6 Relative Patch a 6 Derivative of Gaussian L 2 Relative Pixel L 6 Vertical Edge a 6 Gaussian b 4 Horizontal Edge a 6 Derivative of Gaussian L 4 Laplacian of Gaussian L 2 Relative Pixel b 6 Laplacian of Gaussian L 4 Derivative of Gaussian L 2 Y Direction Gaussian a 4 Vertical Edge b 6 Gaussian a 1 Relative Patch b 6 Gaussian b 1 X Direction Vertical Line L 6 Gaussian L 1 Gaussian L 4 Gaussian b 2 Gaussian L 2 Gaussian a 2 Laplacian of Gaussian L 8 Four Square L 6 Vertical Edge L 6 Horizontal Edge L 6 Horizontal Line L 6 - 63 Bibliography [1] Cie: 015:2004. Technical report, Colorimetry, 3rd edn. Commission Internationale de l’Eclairage, 2004. 22 [2] Adobe. ”Technical Guides”. Color Models: CIELAB,. http://dba.med.sc.edu/ price/irf/Adobe_tg/models/cielab.html. 23 [3] P. Arbelaez, B. Hariharan, Chunhui Gu, S. Gupta, L. Bourdev, and J. Malik. Semantic segmentation using regions and parts. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3378–3385, June 2012. 3 [4] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM. 13 [5] Wassim Bouachir, Atousa Torabi, Guillaume-Alexandre Bilodeau, and Pascal Blais. A bag of words approach for semantic segmentation of monitored scenes. CoRR, abs/1305.3189, 2013. 3 [6] G. Bradski. Dr. Dobb’s Journal of Software Tools, 2000. 45 [7] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. 13, 15, 17, 29 [8] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984. 15 [9] A. Chambolle, D. Cremers, and T. Pock. A convex approach for computing minimal partitions. Tech. rep. TR-2008-05, University of Bonn, 2008. 12, 35 [10] A. Chambolle, D. Cremers, and T. Pock. A convex approach to minimal partitions. SIAM Journal on Imaging Sciences, 5(4):1113–1158, 2012. 12 [11] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Technical Report R.I. 685, Ecole Polytechnique, Palaiseau Cedex, France, 2010. 35, 36 [12] T.M. Cover. The best two independent measurements are not the two best. Systems, Man and Cybernetics, IEEE Transactions on, SMC-4(1):116–117, Jan 1974. 4 [13] Franklin C. Crow. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’84, pages 207–212, New York, NY, USA, 1984. ACM. 23 65 Bibliography [14] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In In CVPR, pages 886–893, 2005. 3, 4 [15] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. In Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE, pages 523–528, Aug 2003. 17, 18 [16] PedroF. Felzenszwalb and DanielP. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004. 7 [17] WendellH. Fleming and Raymond Rishel. An integral formula for total gradient variation. Archiv der Mathematik, 11(1):218–222, 1960. 11 [18] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 – 139, 1997. 13 [19] Yoav Freund and Robert E. Schapire. A short introduction to boosting. In In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 1401– 1406. Morgan Kaufmann, 1999. 13, 15 [20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 1998. 13, 14, 29 [21] Björn Fröhlich, Erik Rodner, and Joachim Denzler. Semantic segmentation with millions of features: Integrating multiple cues in a combined random forest approach. In Asian Conference on Computer Vision (ACCV), pages 218–231, 2012. 4, 46, 48 [22] Enrico Giusti. Minimal Surfaces and Functions of Bounded Variation. Monographs in Mathematics. Birkhäuser Boston, 1984. 9 [23] Stephen Gould. DARWIN: A framework for machine learning and computer vision research and development. Journal of Machine Learning Research, 13:3533–3537, Dec 2012. 45 [24] A. Haar. Zur Theorie der orthogonalen Funktionensysteme. 1909. 22 [25] Xuming He, R.S. Zemel, and M.A. Carreira-Perpindn. Multiscale conditional random fields for image labeling. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–695–II– 702 Vol.2, June 2004. 26, 42, 45 [26] Alexander Hermans, Georgios Floros, and Bastian Leibe. Dense 3d semantic mapping of indoor scenes from rgb-d images. In International Conference on Robotics and Automation, 2014. 3, 4, 24, 27, 42, 49 [27] F. Korč and W. Förstner. eTRIMS Image Database for interpreting images of manmade scenes. Technical Report TR-IGG-P-2009-01, April 2009. 33, 41, 45 [28] L. Ladický, Russell C., Kohli P., and Torr P. Graph cut based inference with cooccurrence statistics. In Proceedings of ECCV, 2010. 3, 49 66 Bibliography [29] L. Ladický, C. Russell, P. Kohli, and P. H S Torr. Associative hierarchical crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on, pages 739–746, Sept 2009. 3, 4, 48, 49 [30] Ĺubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell, and PhilipH.S. Torr. What, where and how many? combining object detectors and crfs. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision – ECCV 2010, volume 6314 of Lecture Notes in Computer Science, pages 424–437. Springer Berlin Heidelberg, 2010. 4 [31] Jan Lellmann, Jörg Kappes, Jing Yuan, Florian Becker, and Christoph Schnörr. Convex multi-class image labeling by simplex-constrained total variation. In Xue-Cheng Tai, Knut Mørken, Marius Lysaker, and Knut-Andreas Lie, editors, Scale Space and Variational Methods in Computer Vision, volume 5567 of Lecture Notes in Computer Science, pages 150–162. Springer Berlin Heidelberg, 2009. 12 [32] D. Lowe. Object recognition from local scale-invariant features. September 1999. 3, 4 Corfu, Greece, [33] Philippe Magiera and Carl Löndahl. ”ROF Denoising Algorithm”,. http://www.mathworks.de/matlabcentral/fileexchange/ 22410-rof-denoising-algorithm, 2008. 10 [34] Bjoern H. Menze, Michael B. Kelm, Ralf Masuch, Uwe Himmelreich, Peter Bachert, Wolfgang Petrich, and Fred A. Hamprecht. A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics, 10(1), 2009. 16 [35] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997. 16 [36] David Mumford and Jayant Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Communications on Pure and Applied Mathematics, 42(5):577–685, 1989. 7, 8 [37] H. Peng, Fulmi Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226–1238, Aug 2005. 4, 5, 17, 18, 27 [38] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing the piecewise smooth mumford-shah functional. In IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan, 2009. 36 [39] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. 13 [40] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1–4):259 – 268, 1992. 8 67 Bibliography [41] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. In Machine Learning, pages 80–91, 1999. 15 [42] Claude Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656, July, October 1948. 31 [43] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June 2008. 3 [44] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for mulit-class object recognition and segmentation. In European Conference on Computer Vision (ECCV), pages 1–15, 2006. 3, 4, 26 [45] Jamie Shotton, Toby Sharp, Alex Kipman, Andrew W. Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. Real-time human pose recognition in parts from single depth images. Commun. ACM, 56(1):116–124, 2013. 27 [46] Jamie Shotton, John Winn, Carsten Rother, and Antonio Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2– 23, 2009. 3, 22, 48 [47] N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition, 2011. 27, 42, 45 [48] A. Torralba, K.P. Murphy, and W.T. Freeman. Sharing features: efficient boosting procedures for multiclass object detection. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–762–II–769 Vol.2, June 2004. 4 [49] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511–I–518 vol.1, 2001. 3, 4, 22, 23 [50] Eric. W. Weisstein. ”Haar Function.” From MathWorld– A Wolfram Web Resource. http://mathworld.wolfram.com/HaarFunction.html. 23 [51] Christopher Zach, David Gallup, Jan-Michael Frahm, and Marc Niethammer. Fast global labeling for real-time stereo using multiple plane sweeps. In VMV, pages 243– 252, 2008. 12 [52] Weiwei Zhang, Jian Sun, and Xiaoou Tang. Cat head detection - how to effectively exploit shape and texture features. In David Forsyth, Philip Torr, and Andrew Zisserman, editors, Computer Vision – ECCV 2008, volume 5305 of Lecture Notes in Computer Science, pages 802–816. Springer Berlin Heidelberg, 2008. 4 [53] M. Zhu and T. Chan. An efficient primal-dual hybrid gradient algorithm for total variation image restoration. UCLA CAM Report, pages 08–34, 2008. 36 68

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

### Related manuals

Download PDF

advertisement