Institutionen för systemteknik Department of Electrical Engineering Examensarbete Vehicle Detection in Monochrome Images Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping av Marcus Lundagårds LITH-ISY-EX- -08/4148- -SE Linköping 2008 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping Vehicle Detection in Monochrome Images Examensarbete utfört i Bildbehandling vid Tekniska högskolan i Linköping av Marcus Lundagårds LITH-ISY-EX- -08/4148- -SE Handledare: Ognjan Hedberg Autoliv Electronics AB Fredrik Tjärnström Autoliv Electronics AB Klas Nordberg isy, Linköpings universitet Examinator: Klas Nordberg isy, Linköpings universitet Linköping, 28 May, 2008 Avdelning, Institution Division, Department Datum Date Division of Computer Vision Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Språk Language Rapporttyp Report category ISBN Svenska/Swedish Licentiatavhandling ISRN Engelska/English Examensarbete C-uppsats D-uppsats Övrig rapport 2008-05-28 — LITH-ISY-EX- -08/4148- -SE Serietitel och serienummer ISSN Title of series, numbering — URL för elektronisk version http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-11819 Titel Title Detektering av Fordon i Monokroma Bilder Vehicle Detection in Monochrome Images Författare Marcus Lundagårds Author Sammanfattning Abstract The purpose of this master thesis was to study computer vision algorithms for vehicle detection in monochrome images captured by mono camera. The work has mainly been focused on detecting rear-view cars in daylight conditions. Previous work in the literature have been revised and algorithms based on edges, shadows and motion as vehicle cues have been modified, implemented and evaluated. This work presents a combination of a multiscale edge based detection and a shadow based detection as the most promising algorithm, with a positive detection rate of 96.4% on vehicles at a distance of between 5 m to 30 m. For the algorithm to work in a complete system for vehicle detection, future work should be focused on developing a vehicle classifier to reject false detections. Nyckelord Keywords vehicle detection, edge based detection, shadow based detection, motion based detection, mono camera system Abstract The purpose of this master thesis was to study computer vision algorithms for vehicle detection in monochrome images captured by mono camera. The work has mainly been focused on detecting rear-view cars in daylight conditions. Previous work in the literature have been revised and algorithms based on edges, shadows and motion as vehicle cues have been modified, implemented and evaluated. This work presents a combination of a multiscale edge based detection and a shadow based detection as the most promising algorithm, with a positive detection rate of 96.4% on vehicles at a distance of between 5 m to 30 m. For the algorithm to work in a complete system for vehicle detection, future work should be focused on developing a vehicle classifier to reject false detections. v Acknowledgments This thesis project is the final part of the educational program in Applied Physics and Electrical Engineering at Linköping University. The work has been carried out at Autoliv Electronics AB in Mjärdevi, Linköping. I would like to take this opportunity to thank people that have helped me during this work. My tutors Ognjan Hedberg and Fredrik Tjärnström for their helpfulness and valuable advice, Klas Nordberg for his theoretical input, Salah Hadi for showing interest in my thesis work and Alexander Vikström for his work on the performance evaluation tool I used to evaluate my algorithms. vii Contents 1 Introduction 1.1 Background . . . . . 1.2 Thesis Objective . . 1.3 Problem Conditions 1.4 System Overview . . 1.5 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 4 4 2 Theory 2.1 Vanishing Points and the Hough Transform 2.2 Edge Detection . . . . . . . . . . . . . . . . 2.3 The Correspondence Problem . . . . . . . . 2.3.1 The Harris Operator . . . . . . . . . 2.3.2 Normalized Correlation . . . . . . . 2.3.3 Lucas-Kanade Tracking . . . . . . . 2.4 Optical Flow . . . . . . . . . . . . . . . . . 2.5 Epipolar Geometry . . . . . . . . . . . . . . 2.5.1 Homogeneous Coordinates . . . . . . 2.5.2 The Fundamental Matrix . . . . . . 2.5.3 Normalized Eight-Point Algorithm . 2.5.4 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 7 9 9 10 11 12 13 13 13 15 3 Vehicle Detection Approaches 3.1 Knowledge Based Methods . 3.1.1 Color . . . . . . . . . 3.1.2 Corners and Edges . . 3.1.3 Shadow . . . . . . . . 3.1.4 Symmetry . . . . . . . 3.1.5 Texture . . . . . . . . 3.1.6 Vehicle Lights . . . . . 3.2 Stereo Based Methods . . . . 3.3 Motion Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 17 17 19 20 20 20 20 21 . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 4 Implemented Algorithms 4.1 Calculation of the Vanishing Point . . . 4.2 Distance to Vehicle and Size Constraint 4.3 Vehicle Model . . . . . . . . . . . . . . . 4.4 Edge Based Detection . . . . . . . . . . 4.4.1 Method Outline . . . . . . . . . 4.5 Shadow Based Detection . . . . . . . . . 4.5.1 Method Outline . . . . . . . . . 4.6 Motion Based Detection . . . . . . . . . 4.6.1 Method Outline . . . . . . . . . 4.7 Complexity . . . . . . . . . . . . . . . . Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 24 24 25 25 27 27 31 33 33 5 Results and Evaluation 5.1 Tests on Edge Based and Shadow Based Detection . . . . . . . . . 5.1.1 Edge Based Detection Tests . . . . . . . . . . . . . . . . . . 5.1.2 Shadow Based Detection Tests . . . . . . . . . . . . . . . . 5.1.3 Tests on Combining Edge Based and Shadow Based Detection 5.2 Tests on Motion Based Detection . . . . . . . . . . . . . . . . . . . 35 36 37 39 42 44 6 Conclusions 45 Bibliography 49 Chapter 1 Introduction This chapter introduces the problem to be addressed. Some background is given along with the thesis’ objective, a discussion of the problem conditions and a system overview. Finally, the structure of the report is outlined. 1.1 Background Road traffic accidents account for an estimated 1.2 million deaths and up to 50 million injuries worldwide every year [1]. Furthermore, the costs of these accidents add up to a shocking 1-3% of the world’s Gross National Product [2]. As the world leader in automotive safety, Autoliv is continuously developing products to reduce the risk associated with driving. In Linköping different vision based systems which aim to help the driver are being developed. An example is Autoliv’s Night Vision system, shown in Figure 1.1, improving the driver’s vision at night using an infrared camera. The camera detects heat from objects and is calibrated to be especially sensitive to the temperature of humans and animals. The view from the infrared camera is projected on a display in front of the driver and the camera is installed in the front of the car. Figure 1.1. Autoliv’s Night Vision system. Detection of obstacles, such as vehicles or pedestrians, is a vital part of such 1 2 Introduction a system. In a current project, CCD (Charge-Coupled Device) image sensors and stereo vision capture visible light and are used as the base for a driver-aid system. The motive for this thesis is Autoliv’s wish to investigate what can be achieved with a mono camera system as far as vehicle detection is concerned. 1.2 Thesis Objective As shown in Figure 1.2, vehicle detection is basically a two-step process consisting of detection and classification. Detection is the step where the image is scanned for ROIs (Regions Of Interest), i.e., vehicle hypothesis in this case. The detector is often used together with a classifier which eliminates false hypotheses. Classification is therefore the process of deciding whether or not a particular ROI contains a vehicle. A classifier is typically trained on a large amount of test data, both positive (vehicles) and negative (non-vehicles). What the objectives of the different steps are can vary from different approaches, but in general the detector aims to overestimate the number of ROIs with the intention of not missing vehicles. The task for the classifier is then to discard as many of the false vehicle detections as possible. A tracker can be used to further improve the performance of the system. By tracking regions which in multiple consecutive frames have been classified as vehicles, the system can act more stable through time. This thesis focuses on the detection step, aiming to investigate existing algorithms for detection of vehicles in monochrome (i.e., grayscale) images from a mono camera. From this study three of them are implemented in MATLAB and tested on data provided by Autoliv. Improvements are made by modifications of the existing algorithms. Furthermore, the complexity of the algorithms is discussed. Herein, vehicle detection will refer to the detection step, excluding the classification. 1.3 Problem Conditions Different approaches to vehicle detection have been proposed in the literature as will be further discussed in Chapter 3. Creating a robust system for vehicle detection is a very complex problem. There are numerous difficulties that need to be taken into account: • Vehicles can be of different size, shape and color. Furthermore, a vehicle can be observed from different angles, making the definition of a vehicle even broader. • Lighting and weather conditions vary substantially. Rain, snow, fog, daylight and darkness must all be taken into account when designing the system. • Vehicles might be occluded by other vehicles, buildings, etc. • For a precrash system to serve its purpose it is crucial for the system to achieve real-time performance. 1.3 Problem Conditions 3 (a) Initial image. (b) Detected ROIs. (c) The ROIs have been classified as vehicles and non-vehicles. Figure 1.2. Illustration of the two-step vehicle detection strategy 4 Introduction Due to these variabilities in conditions, it is absolutely necessary to strictly define and delimit the problem. Detecting all vehicles in every possible situation is not realistic. The work in this thesis focus on detecting fully visible rear-view cars and SUVs during daytime. Trucks are not prioritized. To largest possible extent the algorithms are designed to detect vehicles in various weather conditions (excluding night scenes) and at any distance. The issue of real-time performance is outside the scope of this thesis. 1.4 System Overview The system that has captured the test data consists of a pair of forward-directed CCD image sensors mounted on the vehicle’s rear-view mirror and a computer used for information storage. Lens distortion compensation is performed on the monochrome images and every second horizontal line is removed to avoid artifacts due to interlacing. The output of the system is therefore image frames with half the original vertical resolution, 720x240 pixels in size. The frame rate is 30 Hz. To investigate the performance of a mono system, only the information from the left image sensor has been used. 1.5 Report Structure This report is organized as follows: Chapter 2 explains different theoretical concepts needed in later chapters. Chapter 3 describes the approaches to vehicle detection that have been proposed in the literature. The algorithms chosen for implementation are presented in more detail in Chapter 4. The results from the evaluation of the algorithms follow in Chapter 5. Finally, Chapter 6 sums up the drawn conclusions. Chapter 2 Theory This chapter introduces theory and concepts needed to comprehend the methods used for vehicle detection. The different sections are fairly separate and could be read independently. It is up to the reader to decide whether to read this chapter in its full extent before continuing the report or to follow the references from Chapter 4, describing the implemented algorithms, to necessary theoretical explanations. 2.1 Vanishing Points and the Hough Transform (a) The three possible vanishing points (b) The interesting vanishing point in vehicle de(from [32]). tection. Figure 2.1. Vanishing points. The vanishing points are defined as points in an image plane where parallel lines in the 3D-space converge. As shown in Figure 2.1(a) there are a maximum of three vanishing points in an image. The interesting vanishing point in the application of vehicle detection is the point located on the horizon, seen in Figure 5 6 Theory 2.1(b). This point is valuable because the vertical distance in pixels between the vanishing point and the bottom of a vehicle can yield a distance measure between the camera and the vehicle under certain assumptions. Most of the methods (e.g., [4], [5]) proposed in the literature to calculate the vanishing points depend on the Hough transform [3] to locate dominant line segments. The vanishing point is then decided as the intersection point of these line segments. Due to measurement noise, there is usually not an unique intersection point and some sort of error target function is therefore minimized to find the best candidate. The Hough transform, illustrated in Figure 2.2, maps every point (x, y) in the image plane to a sinusoidal curve in the Hough space (ρθ-space) according to: y cos θ + x sin θ = ρ where ρ can be interpreted as the perpendicular distance between the origin and a line passing through the point (x, y) and θ the angle between the x-axis and the normal of the same line. The sinusoidal curves from different points along the same line in the image plane will intersect in the same point in the Hough space, superimposing the value at that point. Every point in the Hough space transforms back to a line in the image plane. By thresholding the Hough space one can therefore detect dominant line segments. Figure 2.2. The Hough transform transforms a point in the image plane to a sinusoidal curve in the Hough space. All image points on the same line will intersect in a common point in the Hough space (from [30]). 2.2 Edge Detection Edges are important features in image processing. They arise from sharp changes in image intensity and can, for example, indicate depth discontinuities or changes of material. Edges are detected by estimating image gradients which indicate how the intensity changes over an image. The simplest of all approaches to estimate 2.3 The Correspondence Problem 7 gradients is to use first-order discrete differences to obtain this estimate, e.g., the symmetric difference: I(x + ∆x, y) − I(x − ∆x, y) δI = δx 2∆x Combining this differentiation with an average filter in the direction perpendicular to the differentiation yields the famous Sobel filter. Edges are detected as pixels with high absolute response when convolving an image with a Sobel filter. The following sobel kernel detects vertical edges: 1 1 1 sx = 2 1 4 2 1 0 1 1 0 −1 = 2 0 8 1 0 −1 −2 . −1 To detect horizontal edges sy = sTx is used as sobel kernel. If both vertical and horizontal edge maps I x = I ∗ sx and I y = I ∗ sy have been calculated, the magnitude of the gradient is then given as the vector norm ||∇I(x, y)|| = q I 2x (x, y) + I 2y (x, y). One of the most used edge detection techniques was introduced by John F. Canny in 1986. It is known as the Canny edge detector and detects edges by searching for local maxima in the gradient norm ||∇I(x, y)||. A more detailed description of the Canny edge detector can be found in [7]. 2.3 The Correspondence Problem The correspondence problem refers to the problem of finding a set of corresponding points in two images taken of the same 3D scene from different views (see Figure 2.3). Two points correspond if they are the projection on respective image plane of the same 3D point. This is a fundamental and well-studied problem in the field of computer vision. Although humans solve this problem easily, it is a very hard problem to solve automatically by a computer. Solving the correspondence problem for every point in an image is seldom neither wanted nor possible. Apart from the massive computational effort of such an operation the aperture problem makes it impossible to match some points unambiguously. The aperture problem, shown in Figure 2.4, arises when onedimensional structures in motion are looked at through an aperture. Within the aperture it is impossible to match points on such structure between two consecutive image frames since the perceptual system is faced with a direction of motion ambiguity. It clearly shows that some points are inappropriate to match against others. Thus, to solve the correspondence problem, first a way to extract points suitable for point matching is needed. The Harris operator described below is a popular such method. Secondly, point correspondences must be found. Sections 2.3.2 and 2.3.3 describe two different ways of achieving just that. 8 Theory Figure 2.3. Finding corresponding points in two images of the same scene is called the correspondence problem. Figure 2.4. The aperture problem. Despite the fact that the lines move diagonally, only horizontal motion can be observed through the aperture. 2.3 The Correspondence Problem 2.3.1 9 The Harris Operator Interesting points in an image are often called feature points. The properties of such a point is not clearly defined, instead they depend on the problem at hand. Points suited for matching typically contain local two-dimensional structure, e.g., corners. The Harris operator [6] is probably the most used method to extract such feature points. First, image gradients are estimated, e.g., using Sobel filters. Then the 2x2 structure tensor matrix T is calculated as T Ix Iy ∇I(x) = I xI x I xI y T = ∇I(x)∇I(x)T = . IyIx IyIy The Harris response is then calculated as H(x) = det T (x) − c tr2 T (x). The constant c has been assigned different values in the literature, typically in the range 0.04 − 0.05 which empirically has proven to work well. Local maxima in the Harris response indicate feature points, here defined as points with sufficient two-dimensional structure. In Figure 2.5, feature points have been extracted using the Harris operator. Figure 2.5. Feature points extracted with the Harris operator 2.3.2 Normalized Correlation Correlation can be used for template-matching, i.e., to find a small region in an image which matches a template. Especially, putative point correspondences can be found by comparing a template region t around one feature point against regions around all feature points (x, y) in another image I. In its simplest form it is defined as XX C(x, y) = I(x + α, y + β)t(α, β) α,β 10 Theory where a high response is to indicate a good relation between the region and the template. The pixel in the centre of the template is assumed to be t(0, 0). However, in unipolar images (with only positive intensity values) this correlation formula can give high response in a region even though the region does not fit the template at all. This is due to the fact that regions with high intensity yield higher response since no normalization is performed. A couple of different normalized correlation formulas are common. While the first only normalizes the correlation with the norm of the region and the template the second also subtracts the mean intensity value from the image region, µI , and the template, µt . PP α,β I(x + α, y + β)t(α, β) C(x, y) = qP P PP 2 2 α,β I (x + α, y + β) α,β t (α, β) PP + α, y + β) − µI )(t(α, β) − µt ) PP 2 2 α,β (I(x + α, y + β) − µI ) α,β (t(α, β) − µt ) α,β (I(x C(x, y) = qP P 2.3.3 Lucas-Kanade Tracking Another way of deciding point correspondences is to extract feature points in one of the images and track them in the other image using Lucas-Kanade tracking [25]. Though first introducing an affine motion field as a motion model, this is reduced to a pure translation in [25] since inter frame motion is usually small. Therefore, the tracking consists of solving d in the equation J (x + d) = I(x) (2.1) T and a translational motion d. for two images I and J , a point x = Equation (2.1) can be written as J (x + x y d d ) = I(x − ) 2 2 (2.2) to make it symmetric with respect to both images. Because of image noise, changes in illumination, etc. Equation (2.2) is rarely satisfied. Therefore, the dissimilarity Z Z d d = (J (x + ) − I(x − ))2 w(x)dx (2.3) 2 2 W is minimized by solving δ = 0. δd The weight function w(x) is usually set to 1. In [8] it is shown that solving Equation (2.3) is approximately equivalent to solving Zd = e 2.4 Optical Flow 11 where Z is the 2x2 matrix Z Z Z= g(x)g T (x) dx W and e the 2x1 vector Z Z (I(x) − J (x))g(x)w(x)dx. e= W where g(x) = h δ I+J δ I+J δx ( 2 ) δy ( 2 ) iT . In practice, the Lucas-Kanade method is often used in an iterative scheme where the interesting region in the image is interpolated in each iteration. 2.4 Optical Flow The problem of deciding the optical flow between two images is closely related to the correspondence problem described in Section 2.3. The optical flow is an estimate of the apparent motion of each pixel in an image between two image frames, i.e., the flow of image intensity values. The optical flow should not be confused with the motion field which is the real motion of an object in a 3D-scene projected onto the image plane [9]. These are identical only if the object does not change the image intensity while moving. There are a number of different approaches to computing the optical flow, of which one derived by Lucas and Kanade [10] will be briefly discussed here. Assuming that two regions in an image are identical besides a translational motion, the optical flow is derived from the equation [33]: I(x + ∆x, y + ∆y, t + ∆t) = I(x, y, t). (2.4) Equation (2.4) states that a translation vector (∆x, ∆y) exists such that the image intensity I(x, y, t) after a time ∆t is located at I(x+∆x, y +∆y, t+∆t). Rewriting the equation using Taylor series of first order yields ∆xI x + ∆yI y + ∆tI t = 0. (2.5) v = ∆y ∆t the equation for optical flow is given u ∇I T = −I t . (2.6) v After division with ∆t and u = as ∆x ∆t , With one equation and two unknowns, further assumptions must be made in order to solve for the optical flow. The classical approach was proposed by Horn and Schunck [11] but other methods are found in the literature. 12 2.5 Theory Epipolar Geometry Epipolar geometry describes the geometry of two views, i.e., stereo vision. Given a single image, the 3D point corresponding to a point in the image plane must lie on a straight line passing through the camera centre and the image point. Because of the loss of one dimension when a 3D point is projected onto an image plane, it is impossible to reconstruct the world coordinates of that point from a single image. However, with two images of the same scene taken from different angles, the 3D point can be calculated by determining the intersection between the two straight lines passing through respective camera centres and image points. One such line projected onto another image plane of a camera at a different view point is known as the epipolar line of that image point. The epipole is the point in one of the images where the camera centre of the other image is projected. Another way to put it is that the epipolar points are the intersections between the two image planes and a line passing through the two camera centers. Figure 2.6. The projection of a 3D point X onto two image planes. The camera centres C1 , C2 , image coordinates x1 , x2 , epipolar lines l1 , l2 and epipoles e1 , e2 are shown in the figure. 2.5 Epipolar Geometry 2.5.1 13 Homogeneous Coordinates Homogeneous coordinates is the representation for the projective geometry used to project a 3D scene onto a 2D image plane. The homogeneous representation of a T T point x = x y in an image plane is xh = cx cy c for any non-zero constant c. Thus, all vectors separated by a constant c are equivalent and such a vector space is called a projective space. It is common in computer vision to set c = 1 in the homogeneous representation, so that the other elements represent actual coordinates in the metric unit chosen. In computer vision the homogeneous coordinates are convenient in that they can express an affine transformation, e.g., a rotation and a translation, as a matrix operation by rewriting y = Ax + b into y 1 = A b 0, . . . , 0 1 x 1 . In this way, affine transformations can be combined simply by multiplying their matrices. 2.5.2 The Fundamental Matrix The fundamental matrix F is the algebraic representation of epipolar geometry. It is an 3x3 matrix of rank two and depends on the two cameras internal parameters and relative pose. The epipolar constraint describes how corresponding points in a two-view geometry relate and is defined as xh2 F xh1 = 0. Here xh1 is the homogeneous representation of a 3D point in the first image and xh2 the coordinates of the same 3D point in the second image. This is a necessary but not sufficient condition for point correspondence. Therefore, one can only discard putative correspondences, not confirm them. The fundamental matrix can be calculated either by using the camera calibration matrices and their relative pose [13] or by using known point correspondences as described in Section 2.5.3. 2.5.3 Normalized Eight-Point Algorithm The normalized eight-point algorithm [12] estimates the fundamental matrix between two stereo images from eight corresponding point pairs. The eight points in each image are first transformed to place the centroid of them at the origin. The coordinate system √ is also scaled to make the mean distance from the origin to a point equal to 2. This normalization makes the algorithm more resistant to noise by ensuring that the point coordinates are in the same size range as 1 in the homogeneous representation x = (cx, cy, c) = (x, y, 1). 14 Theory The normalization is done by multiplying the homogeneous coordinates for the eight points with the normalization matrix P [13]: α 0 −αxc P = 0 α −αyc 0 0 1 where (xc , yc ) are the coordinates of the centroid of the eight points and α is the scale factor, defined by 8 xc = 1X xi 8 i=1 yc = 1X yi 8 i=1 8 √ α = P8 2∗8 2 2 i=1 (xi − xc ) + (yi − yc ) After normalization, the problem consists of minimizing 2 T M M F v while 2 ||F v || = 1 where Y = y1 x2 x1 y2 T Y1 Y2T M = . .. x1 y1 x2 y1 y2 y1 x2 y2 T 1 F23 F33 Y8T F v = F11 F21 F31 F12 F22 F32 F13 T . This is a total least squares problem and the standard solution is to choose F v as the eigenvector to M T M belonging to the smallest eigenvalue [12]. The vector F v is reshaped into a 3x3 matrix F est in the reverse order it was reshaped into a vector. To ensure that the estimated fundamental matrix has rank two, the norm ||F opt − F est || is minimized under the constraint det F opt = 0 2.5 Epipolar Geometry 15 This problem can be solved using Singular Value Decomposition [13]. Let F est = U DV T where U and V are orthogonal matrices and the diagonal matrix D consists of the singular values: D = diag(r, s, t), r ≥ s ≥ t. The solution is given by F opt = U diag(r,s,0)V T [13]. Finally the fundamental matrix is denormalized according to F = P T2 F opt P 1 where P 1 and P 2 are the transformation matrices for image one and two respectively [13]. 2.5.4 RANSAC RANSAC [13], short for Random Sample Consensus, is an iterative method to estimate parameters of a mathematical model from a set of observed data which contains outliers. In computer vision, RANSAC is often used to estimate the fundamental matrix given a set of putative point correspondences. Its advantage is its ability to give robust estimates even when there are outliers in the data set. However, there is no upper bound on computation time if RANSAC is to find the optimal parameter estimates. The use of RANSAC to estimate the fundamental matrix between two stereo images from a set of putative point correspondences is described in the following algorithm description. Note that this is only one of many variants of RANSAC. 1. Choose eight random point pairs from a larger set of putative point correspondences. 2. Estimate F with the normalized eight-point algorithm using the eight point pairs. 3. If ||F opt − F est || (see Section 2.5.3) is below a certain threshold • Evaluate the estimate F by determining the number of correspondences that agree with this fundamental matrix. If this is the best estimate so far, save it. 4. Repeat from step 1 until a maximum number of iterations have been processed. 5. Return F along with the consensus set of corresponding point pairs. In step 3 the number of corresponding point pairs are calculated as the number of points that are close enough to their respective epipolar lines. The epipolar lines 16 Theory and the normalized sum of distances to the epipolar lines are defined as follows: l1 = F T xh2 l2 = F xh1 |xT l1 | |xT l2 | = q h1 + q h2 l211 + l212 l221 + l222 dsum A threshold is applied to dsum to separate inliers from outliers. Because of the normalization of the epipolar lines, the distance measure will be in pixels. Chapter 3 Vehicle Detection Approaches As mentioned in Section 1.2, the purpose of the detection step in a vision based vehicle detection system is to find ROIs. How well this is achieved is a matter of definitions. Commonly used measures are the percentage of detected vehicles, detected non-vehicles, alignment of found ROIs, etc. There are essentially three different approaches to detection of vehicles proposed in the literature: knowledge based, stereo based and motion based [14]. 3.1 Knowledge Based Methods The knowledge based methods all use a priori image information to extract ROIs. Different cues have been proposed in the literature and systems often combine two or more of these cues to make the detection more robust. 3.1.1 Color Color information could possibly be used to distinguish vehicles from background. Examples exist where road segmentation has been performed using the color cue [15]. This thesis investigates vehicle detection in monochrome images and therefore no further research has been made concerning the color cue. 3.1.2 Corners and Edges Man-made objects like vehicles contain a high degree of corners and edges compared to the background, from whichever view they are looked upon. Although corners might not always be very well-defined in feasible image resolutions, the edge cue is probably the most exploited of the knowledge based approaches. In the case of rear-view vehicle detection, a vehicle model of two vertical (corresponding to the left and right side) and two horizontal (bottom and top) edges 17 18 Vehicle Detection Approaches could be used. This model holds for front-view detection as well. Sun et al. [26] describe a system that detects vehicles by computing vertical and horizontal edge profiles at three different levels of detail (see Figure 3.1). A horizontal edge candidate corresponding to the bottom of the vehicle is then combined with left and right side candidates to form ROIs. Wenner [27] uses sliding windows of different sizes to better capture local edge structures. Edges are detected within the image region delimited by the window instead of using global edge profiles for each row and column. Figure 3.1. Left column shows images at different scales. Second and third column show vertical and horizontal edge maps. The right column shows edge profiles used in [26]. Jung and Schramm [31] describe an interesting way of detecting rectangles in an image. By sliding a window over the image the Hough transform of small regions is computed. Rectangles can then be detected based on certain geometrical relations in the Hough space, e.g., that the line segments delimiting the rectangle appear in pairs and that the two pairs are separated by a 90◦ angle. This is shown in Figure 3.2. Using the fact that the camera and the vehicles are located on the same ground plane, i.e., the road, could simplify the model further by not only assuming a ∆θ = 90◦ but to lock the θ parameters to 0◦ and 90◦ for the two line pairs respectively. Edge detection is fairly fast to compute and the detection scheme is simple to comprehend. On the downside, other man-made objects like buildings, rails, etc. can confuse a system only based on the edge cue. 3.1 Knowledge Based Methods (a) 19 (b) Figure 3.2. Properties shown for Hough peaks corresponding to the four sides of a rectangle centered at the origin (from [31]). 3.1.3 Shadow The fact that the area underneath a vehicle is darker than the surrounding road due to the shadow of the vehicle has been suggested as a sign pattern for vehicle detection in a number of articles [16] [17]. Although this technique can yield very good result in perfect conditions, it suffers in scenes with changing illumination. Tzomakas et al. [18] partly overcame this problem by deciding an upper threshold for the shadow based on the intensity of the free driving space (i.e., the road). After extracting the free driving space, they calculated the mean and standard deviation of a Gaussian curve fitted to the intensity values of the road area. The upper threshold was then set to µ − 3σ where µ and σ are the road mean and standard deviation respectively. They combined the detected shadows with horizontal edge extraction to distill ROIs. Figure 3.3. Low sun from the side misaligns the ROIs when using the shadow cue. The most serious drawback of the shadow cue is scenes with low sun (Figure 20 Vehicle Detection Approaches 3.3), making vehicles cast long shadows. Hence, the detected shadows becomes wider in the case of a sun from the side or ill positioned in the case of the camera facing the sun. Even though the shadow is still darker beneath the vehicle than beside it, this is a very weak cue to use to align the ROIs. Surprisingly, this problem has not been encountered during the literature study. 3.1.4 Symmetry Symmetry is another sign of objects created by mankind. Vehicles include a fair amount of symmetry, especially the rear-view. Kim et al. [19] used symmetry as a complement to the shadow cue to better align the left and right side of the ROI. A disadvantage with this cue is that the free driving space is very symmetrical too, making symmetry unsuitable as a stand-alone detection cue. 3.1.5 Texture Little research has been made on texture as an object detection cue. Kalinke et al. [20] used entropy to find ROIs. The local entropy within a window was calculated and regions with high entropy were considered as possible vehicles. They proposed energy, contrast and correlation as other possible texture cues. 3.1.6 Vehicle Lights Vehicle lights could be used as a night time detection cue [19]. However, the vehicle lights detection scheme should only be seen as a complement to other techniques. Brighter illumination and the fact that vehicle lights are not compulsory to use during daytime in many countries makes it unsuitable for a robust vehicle detection system. 3.2 Stereo Based Methods Vehicle detection based on stereo vision uses either the disparity map or Inverse Perspective Mapping. The disparity map is generated by solving the correspondence problem for every pixel in the left and right image and shows the difference between the two views. From the disparity map a disparity histogram can be calculated. Since the rear-view of a vehicle is a vertical surface, and the points on the surface therefore are at the same distance from the camera, it should occur as a peak in the histogram [21]. The Inverse Perspective Mapping transforms an image point onto a horizontal plane in the 3D space. Zhao et al [22] used this to transform all points in the left image onto the ground plane and reproject them back onto the right image. Then they compared the result with the true right image to detect points above the ground plane as obstacles. Since this thesis only deals with detection methods based on images from one camera, the stereo based approach has not been investigated further. 3.3 Motion Based Methods 3.3 21 Motion Based Methods As opposed to the previous methods, motion based approaches use temporal information to detect ROIs. The basic idea is trying to detect vehicles by their motion. In systems with fix cameras (e.g., a traffic surveillance system) this is rather straightforward by performing background subtraction. It is done by subtracting two consecutive image frames and thresholding the result in order to extract moving objects. However, the problem becomes significantly more complex in an on-road system because of the ego-motion of the camera. At least two different approaches to solve this problem are possible. The first would be to compute the dense optical flow between two consecutive frames, solving the correspondence problem for every pixel [23]. Although, in theory, it would be possible to calculate the dense optical flow and detect moving objects as areas of diverging flow (compared to the dominant background flow) this is very time consuming and not a practical solution. In addition, the aperture problem makes it impossible to estimate the motion of every pixel. Another, more realistic approach, would be to compute a sparse optical flow. This could be done by extracting distinct feature points (e.g. corners) and solve the correspondence problem for these points. Either feature points are extracted from both image frames and point pairs are matched using normalized correlation, or feature points are extracted from one image frame and tracked to the other using e.g., Lucas-Kanade tracking [25]. By carefully selecting feature points from the background and not from moving vehicles, the fundamental matrix describing the ego-motion of the camera could be estimated from the putative point matches using RANSAC and the eight-point algorithm. In theory, the fundamental matrix could then be used to find outliers in the set of putative point matches. Such an outlier could either originate from a false point match or from a point on a moving object that does not meet the motion constraints of the background [24]. However, reality is such that very few feature points can be extracted from the homogeneous road. On the other hand, vehicles contain a lot of feature points, e.g., corners and edges. Therefore, the problem of only choosing feature points from the background is quite intricate. A possible advantage with the motion based approach could be at detecting partly occluded vehicles, e.g., overtaking vehicles. Such a vehicle should cause a diverging motion field even though the whole vehicle is not visible for the camera. An obvious disadvantage is the fact that a method solely based on motion cannot detect stationary vehicles like parked cars. This is a major drawback as stationary vehicles can cause dangerous situations if parked in the wrong place. On the same premises, slow vehicles are hard to detect using motion as the only detection cue. A property well worth noting is that motion based methods detect all moving objects, not just vehicles. This could be an advantage as well as a disadvantage. All moving objects, such as bicycles, pedestrians, etc. could be interesting to detect. Combined with an algorithm that detects the road area this could be useful. However, there is no easy way of distinguish a vehicle from other moving objects without using other cues than motion. Chapter 4 Implemented Algorithms The purpose of the literature study was to choose 2-3 promising algorithms to implement and test on data provided by Autoliv. Preferably, they could also be modified in order to improve performance. As described in Chapter 3, two fundamentally different approaches were possible. On one hand, the knowledge based approach, using a priori information from the images. On the other hand, the motion based approach, using the fact that moving vehicles move differently than the background in an image sequence. After weighting the pros and cons of each method studied in the literature, two knowledge based and one motion based approach were chosen. These algorithms have all been implemented in MATLAB and are described in detail in this chapter. All of them have been modified in different ways to improve performance. First, however, the calculation of the vanishing point will be explained. 4.1 Calculation of the Vanishing Point All the implemented algorithms need the y-coordinate of the vanishing point (Section 2.1) to calculate a distance measure from the camera to a vehicle and to determine size constraints for a vehicle based on its location in the image. In this implementation a static estimation of the vanishing point will be calculated based on the pitch angle of the camera, decided during calibration. Since only the y-coordinate is interesting for the distance calculation, the x-coordinate will not be estimated. The test data have been captured with a CCD camera with a Field of view (FOV) in the x-direction of 48◦ . The image size is 720x240 pixels. The ratio between the height and width of one pixel can be calculated from the known intrinsic parameters fx and fy . From these data, the y-coordinate of the vanishing 23 24 Implemented Algorithms point is calculated as: f sx f fy = sy fx = 240 sy 240 fx F OVx = F OVx 720 sx 720 fy α 240 + 240 = 2 F OVy F OVy = yvp where f is the focal length, sx and sy are the pixel width and height respectively and α is the pitch angle defined positive downwards. 4.2 Distance to Vehicle and Size Constraint To determine the allowed width interval in pixels of a bottom candidate on vertical distance ∆y from the vanishing point the angle this distance corresponds to in the camera is calculated as ∆y β= F OVy . 240 Assuming that the road is flat and that the vehicles are located on the road, the distance to the vehicle in meters can be calculated as l = Hcam / tan β where Hcam is the height above the ground of the camera in meters. Finally, assuming that the vehicle is located on the same longitudinal axis as the ego vehicle, the width in pixels of a vehicle is determined as w = 720 2 arctan W/2 l F OVx where W is the width in meters of a vehicle. The upper and lower bound for vehicle width in meters generate an interval in pixels of an allowed vehicle width. 4.3 Vehicle Model A vehicle is assumed to be less than 2.6 m in width. This is the maximum allowed width for a vehicle in Sweden [29] and many other countries have approximately the same regulation. A lower limit of 1.0 m is also set. In the same way, vehicles are assumed to be between 1.0 m and 2.0 m in height. Note that the upper limit is set low to avoid unnecessary false detections since truck detection is not prioritized. The bottom edge is assumed to be a light-to-dark edge looking bottom-up. In the same way the left edge is assumed to be a light-to-dark edge and the right edge to be a dark-to-light edge looking left-right. This is true because the algorithm only looks for the edges generated from the transition between road and wheels when finding the side edges. 4.4 Edge Based Detection 4.4 25 Edge Based Detection This method uses a multiscale edge detection technique to perform rear-view vehicle detection. It is inspired by the work done by Sun et al. [26] and Wenner [27] but has been modified to improve performance in the current application. The basic idea is that a rear-view vehicle is detected as two horizontal lines corresponding to its bottom and top and two vertical lines corresponding to its left and right side. ROIs not satisfying constraints on size based on the distance from the camera are rejected. Same goes for ROIs asymmetric around a vertical line in the middle of the ROI and ROIs with too small variance. The method uses both a coarse and a fine scale to improve robustness. Figure 4.1 shows different steps of the edge based detection scheme. 4.4.1 Method Outline Edges, both vertical and horizontal, are extracted by convolving the original image with sobel operators (Section 2.2). Edges a certain distance above the horizon are not interesting as part of a vehicle and to save computation time that area is ignored. Candidates for vehicle bottoms are found by sliding a window of 1xN pixels over the horizontal edge image, adding up the values inside the window. Local minima are found for each column and window size and candidates beneath a certain threshold, corresponding to a vehicle darker than the road, are kept. Candidates not meeting the constraints on width described in Section 4.2 are rejected. Next, the algorithm tries to find corresponding left and right sides to the bottom candidates. Upon each bottom candidate a column summation of the vertical edge image is performed with a window size of bwbottom /8cx1 pixels. Left sides, defined as a light-to-dark edge, are searched for close to the left side of the bottom candidate and right sides, defined as dark-to-light edges, are assumed to lie to the right of the bottom candidate. Each combination of left and right sides are saved along with corresponding bottom as a candidate. A bottom candidate without detected left or right sides is discarded. Vehicle top edges, defined as any kind of horizontal edge of a certain magnitude, were extracted in the same process as the bottom edges. Now, each candidate is matched against all vehicle top edges to complete the ROI rectangles. Using geometry a height interval in pixels is decided in which a top edge must be found in order to keep the candidate. This height is calculated as H − Hcam l Hcam θ2 = arctan l θ1 + θ2 h = 240 F OVy θ1 = arctan where one upper and one lower bound on the height H in meters of a vehicle give an interval of the vehicle height h in pixels. The parameter l is the distance 26 Implemented Algorithms (a) Bottom candidates meeting size constraints described in Section 4.2. (b) Candidates with detected left and right sides. (c) Candidates with detected top edges. (d) Asymmetric, homogeneous and redundant candidates rejected. Figure 4.1. The edge based detection scheme. 4.5 Shadow Based Detection 27 from the camera to the vehicle as derived in Section 4.2. The highest located top edge within this interval completes the ROI. If no top edge is found within the interval, the candidate is discarded. Another scale test is performed to reject candidates not meeting the width constraint criteria. The ROIs are now checked against symmetry and variance constraints to discard more false detections. To calculate an asymmetry measure, the right half of the ROI is flipped and the median of the squared differences between the left half and the flipped, right half is determined. An upper bound threshold on the asymmetry decides whether or not to discard a ROI. Likewise, a ROI is discarded if the variance over rows is below a certain threshold. The image intensity is summed up row-wise within the ROI and the variance is calculated on these row sums. Because of this constraint, possible homogeneous candidates covering e.g., the road are hopefully rejected. Finally, small ROIs within larger ones are rejected to get rid of ROIs covering e.g., vehicle windows. 4.5 Shadow Based Detection The shadow based detection algorithm implemented is based on the work by Tzomakas et al. [18]. The free driving space is detected and local means and standard deviations of the road are calculated. An upper shadow threshold of µ − 3σ is applied and the result is combined with a horizontal edge detection to create vehicle bottom candidates. A fix aspect ratio is used to complete the ROIs. Figure 4.2 shows different steps of the shadow based detection scheme. 4.5.1 Method Outline The free driving space is first estimated with the lowest central homogeneous region in the image delimited by edges. This is done by detecting edges using the Canny detector (Section 2.2) and then adding pixels to the free driving space in a bottom-up scheme until the first edge is encountered. The y-coordinate for the vanishing point is used as an upper bound of the free driving space in the case of no edges present. The image is cropped at the bottom to discard edges on the ego-vehicle and on the sides to prevent non-road regions to be included in the free driving space estimate. Figure 4.3-4.4 show a couple of examples. As opposed to Tzomakas’, this algorithm estimates a local mean and standard deviation for each row of the free driving space. This is to better capture the local road intensity. As seen in Figure 4.4(c) problems occur when vehicles, road markings or shadows occlude parts of the road, making it impossible, with this algorithm, to detect road points in front of these obstacles. To deal with this problem, the mean is extrapolated using linear regression to rows where no road pixels exist to average over. For these rows the standard deviation of the closest row estimate is used. The image is thresholded below the horizon using an upper bound on the shadow. This threshold is calculated as µ − 3σ where µ and σ are the road mean and standard deviation for the row including the current image point. Horizontal edges are extracted by simply subtracting a row-shifted copy from the original 28 Implemented Algorithms (a) The free driving space. (b) Thresholded image to extract shadows. (c) Regions classified as both shadows and light-to-dark, horizontal edges. (d) Final vehicle candidates. Figure 4.2. The shadow based detection scheme. 4.5 Shadow Based Detection 29 image and thresholding the result. By an AND-operation, interesting regions are extracted as regions classified as both shadows and light-to-dark (counting bottom-up) horizontal edges. (a) Initial image. (b) Edge map created with the Canny detector. (c) Detected road area. Figure 4.3. Detection of the free driving space. After morphological closing, to remove small holes in the shadows, and opening, to remove noise, segmentation is performed. Each candidate’s bottom position is then decided as the row with most points belonging to the shadow region. The left and right border are situated on the leftmost and rightmost point of the candidate’s bottom row and a fix aspect ratio decides the height of the ROI. The ROIs are checked against size constraints as described in Section 4.2. In the same way as in 30 Implemented Algorithms (a) Initial image. (b) Edge map created with the Canny detector. (c) Detected road area. Figure 4.4. Detection of the free driving space. 3 4.6 Motion Based Detection 31 the edge based detection scheme, asymmetric ROIs are discarded. 4.6 Motion Based Detection The idea of the following algorithm (inspired by the work done by Yamaguchi et al. [24]) was to look at two consecutive image frames as a stereo view. This can be done as long as the vehicle is moving and two image frames taken at different points in time capture the same scene from two different positions. The method outline was initially the following: • Extract a number of feature points in each image using the Harris operator (Section 2.3.1) • Determine a set of point correspondences between the two frames using either Lucas-Kanade tracking (Section 2.3.3) or normalized correlation (Section 2.3.2). • Estimate the fundamental matrix from background point correspondences using RANSAC and the eight point algorithm (Section 2.5). • Use the epipolar constraint to detect outliers as points on moving objects or false matched point correspondences. • Create ROIs from these outliers. However, a major difficulty turned out to be the problem of detecting points on vehicles as outliers. Consider two image frames from a mono system where the camera translates forward. The epipole will equal the vanishing point in such a system [13]. This implies that all epipolar lines will pass through the vanishing point on the horizon. Therefore, points on vehicles located on the road will translate along their corresponding epipolar lines, either towards or from the vanishing point. Since the epipolar constraint only can be used to reject points away from their epipolar lines as outliers and not confirm points close to their epipolar lines as inliers, it will be impossible to detect points on moving vehicles on the road as outliers. Figure 4.5 illustrates the problem. This is a major issue, though not encountered in the literature. Instead, a couple of assumptions were made in order to modify the motion based algorithm to detect certain vehicles: • Points on vehicles are used when estimating the fundamental matrix. Since they are moving along their epipolar lines their direction is consistent with the epipolar constraint for the background motion. This means that the significant problem of deciding which points to base the estimate on vanishes. • The vehicle on which the camera is mounted is assumed to move forward. Thus, the background can be assumed to move towards the camera and points moving away from the camera can be detected either as points on overtaking vehicles or mismatched points. 32 Implemented Algorithms Figure 4.5. Figure shows two consecutive image frames, matched point pairs and their epipolar lines. Points on vehicles move along their epipolar lines. 4.7 Complexity 33 Since this method cannot detect vehicles moving towards or at the same distance from the camera it is solely to be seen as a complement algorithm. It has potential to complement the other two algorithm especially in the case of overtaking vehicles not yet fully visible in the images. However, the alignment of the ROIs cannot be expected to be very precise, since only a number of points are available to base each ROI upon. If points moving towards the vanishing point are not detected on each of the four edges delimiting the vehicle, the ROI will be misaligned. 4.6.1 Method Outline 1 sec to detect The implemented algorithm uses two image frames separated by 15 vehicles. By convolving the current image frame with sobel kernels edge maps are obtained. These edge maps are used to extract feature points with the Harris operator. Since feature points tend to appear in clusters, the image is divided horizontally into three equal subimages from which an equal number of feature points are extracted, as long as their Harris response meet a minimum threshold. The extracted feature points are tracked from the current image frame to the previous using Lucas-Kanade tracking. All the putative point correspondences are then used to estimate the fundamental matrix using the eight point algorithm and RANSAC. Points moving towards the vanishing point are detected and sorted based on their horizontal position. The points are then clustered into groups depending on their position in the image and their velocity towards the vanishing point. ROIs not meeting the scale constraints are discarded, as in the other two algorithms. Since this algorithm focus on finding vehicles in difficult side poses no symmetry constraint is applied. 4.7 Complexity Using MATLAB execution time to judge an algorithm’s complexity is risky and difficult. To make the analysis more accurate, data structures have been preallocated where possible to avoid time consuming assignments. The algorithms have been optimized to some degree to lower computation time while generating detections in the many tests. However, a lot more could be done. Of course, a real-time implementation would need to written in another language, e.g., C++. Some comments can be made based on the MATLAB code, which have been run on an Intel Pentium 4 CPU of 3.20 GHz. In the edge based algorithm, the most time consuming operation is the nested loop finding horizontal candidates for vehicle bottoms and tops. Although it consists of fairly simple operations, the fact that both window size and location in the image are varied makes it expensive to compute. It accounts for around 30% of the total computation time in MATLAB. The rest of the edge based detection scheme consists of simple operations which are fast to compute. The execution time of this part of the program is mainly governed by the number of horizontal candidates found by the nested loop described above. In MATLAB the algorithm operates at roughly 1 Hz. 34 Implemented Algorithms The shadow based detection is faster than the edge based. The operation standing out as time consuming is the Canny edge detection used to extract the free driving space. Since the shadow based detection only creates one ROI per shadow, the number of ROIs are always kept low and therefore operations on the whole set of ROIs, e.g, checking them against size constraints are not expensive to compute. The algorithm can handle a frame rate of 2-3 Hz in MATLAB. The motion based detection scheme is without doubt the most time consuming algorithm. The 2D interpolation used during the Lucas-Kanade tracking, the 2D local maxima extraction from the Harris response and the Singular Value Decomposition in the eight point algorithm are all time consuming steps in this algorithm. Since the Lucas-Kanade tracking is the most time consuming function of the motion based detection algorithm, the computation time is largely dependent on how many feature points meet the Harris threshold. Also, the maximum number of iterations for the tracking is another critical parameter since a larger number of iterations implies more interpolation. In general, the algorithm runs at 0.1-0.25 Hz in MATLAB. Chapter 5 Results and Evaluation The test data has been captured by Autoliv with a stereo vision setup consisting of two CCD cameras with a FOV in the x-direction of 48◦ and with a frame rate of 30 Hz. The image frames are 720x240 pixels where every second row has been removed to avoid distortion due to effects from interlacing. Only the information from the left camera is used to evaluate the mono vision algorithms implemented in this thesis. The data used to tune the algorithm parameters have been separated from the validation data. The algorithm tuning has been done manually due to time limitations. In order to be able to evaluate algorithm performance, staff at Autoliv have manually marked vehicles, pedestrians, etc. in the test data. On a frame by frame basis, each vehicle has been marked with an optimal bounding box surrounding the vehicle, called a reference marking. Detection results from the algorithms have been compared against these vehicle reference markings to produce performance statistics. The following definitions have been used when analyzing the performance of the algorithms: • A PD (Positive Detection) is a vehicle detection that matches a reference marking better than the minimum requirements. If several detections match the same reference only the best match is a PD. • A ND (Negative Detection) is a reference marking not overlapped by a detection at all. • An OD (Object Detection) is a reported detection that does not overlap a reference at all. • The PD rate is the ratio between the number of PDs and the number of reference markings. The minimum requirements for a detection to be classified as a PD is for the left and right edges to differ less than 30% of the reference marking width from the true position (based on the reference marking). The top edge is allowed to 35 36 Results and Evaluation differ 50% of the reference marking height from its true position and the same limit for the bottom edge is 30%. The easier constraint on the top edge is because the top edge is not as important as the other edges for a number of reasons e.g., to measure a distance to the vehicle. 5.1 Tests on Edge Based and Shadow Based Detection A test data set of 70 files of different scenes including a total of 17339 image frames have been used. The data include motorway as well as city scenes from different locations around Europe. Different weather conditions such as sunny, cloudy, foggy, etc. are all represented. Table 5.1 shows a summary of the test data. Type of scene Motorway City # of files 54 16 Conditions Fine (sun/overcast) Low sun Fog Snow Tunnel 48 15 3 1 3 Other Trucks, etc. 11 Table 5.1. Summary of the 70 test data files. Occluded vehicles and vehicles overlapping each other have been left out from the analysis. Tests have been performed on vehicles at different distances from the ego-vehicle. Vehicles closer than 5 m have been disregarded since they are impossible to detect with these two algorithms, the bottom edge is simply not visible in the image at such close distance. Tests on vehicle reference markings up to 30 m, 50 m and 100 m have been done. In addition, two different sets of poses have been evaluated. The first set only include vehicles seen straight from behind or from the front. The other allows the vehicles to be seen from an angle showing its rear or front along with one side of the vehicle, i.e., front-left, front-right, rear-left or rear-right. Although trucks have not been prioritized to detect, the test data includes trucks as seen in Table 5.1. The evaluation tool does not distinguish between cars and trucks and therefore the PD rate of the different tests could probably increase further if such a separation was made. 5.1 Tests on Edge Based and Shadow Based Detection 37 The total number of vehicle reference markings in each test is displayed in Table 5.2. Poses Front, Rear Front, Rear, Front-left, Front-right, Rear-right, Rear-left Max distance [m] 30 50 100 6456 12560 17524 13483 23162 29023 Table 5.2. The number of vehicle reference markings in the different test scenarios. 5.1.1 Edge Based Detection Tests Table 5.3 shows a summary of the PD rate obtained in the different tests. As seen, this detection scheme is very good at detecting vehicles from side poses. In fact, the PD rate is higher on the set of vehicle poses including the side poses than the set only consisting of rear- and front-views. Poses Front, Rear Front, Rear, Front-left, Front-right, Rear-right, Rear-left Max distance [m] 30 50 100 89.9% 85.1% 75.1% 92.2% 86.1% 78.6% Table 5.3. PD rate for the edge based detection tests. Figure 5.1 is taken from the test with vehicles up to 100 m of all poses and shows interesting histograms on how the borders of the ROIs differ between detections and references for PDs. The ratio along the x-axis is defined as the position difference between the detection and reference border line divided by the size of reference marking (width for left and right border lines and height for bottom and top border lines). A negative ratio corresponds to a detection smaller than the reference, while a positive ratio indicates that the detection is larger than the reference at that specific border. As seen, the left and right borders are very well positioned. The histogram of the bottom border line does have a significant peak, however it is overestimated with an offset of 10%. This is partly explained by the fact that the vehicle bottoms have been marked where the vehicle wheels meet the road, while the algorithm often detects the edge arising from the vehicle shadow situated a few pixels down. This must be considered if a distance measure based on the vehicle bottom is to be implemented. If the vehicle bottom coordinate is overestimated the distance will be underestimated. 38 Results and Evaluation Figure 5.1. Difference ratio for PDs normalized against the number of reference markings. Test run on the edge based algorithm on vehicles up to 100 m of all poses. The top edge is the border line with most uncertainty. This comes as no surprise since this edge has been chosen as the highest positioned edge in a certain height interval above the bottom border line. In the case of a car, this edge will therefore more likely overestimate the vehicle height than underestimate it. A truck, however, does not fit into the height interval used and is therefore only detected if it contains an edge (somewhere in the middle of the rear-view) within the interval that can be taken as the top edge. The chosen edge will underestimate the vehicle height and this is one reason to why the histogram is so widely spread. Another interesting graph is shown in Figure 5.2, where PD rate has been plotted against distance to vehicles for the test using all poses. This distance has been calculated with stereo view information during the reference marking process by staff at Autoliv. The PD rate is clearly dependent on the distance to the vehicle. Obviously, smaller objects, possibly just a few pixels in size, are harder to detect. To give some perspective on the number of false detections the edge based algorithm detected 36.5 detections per frame on average, however only 12.1 of these were ODs, i.e, detections not overlapping a reference at all. The edge based algorithm typically generated multiple detections per vehicle though only the best was classified as a PD. Many of the ODs arise from railings on the side of the road. These are mistakenly detected as vehicles as they contain all four needed edges and also a fair amount of symmetry. Many of the references not detected well enough by this algorithm are trucks. 5.1 Tests on Edge Based and Shadow Based Detection 39 Figure 5.2. PD rate at different distances [m]. Test run on the edge based algorithm on vehicles up to 100 m of all poses. Another scenario where the algorithm suffers is in scenes with large variations in illumination, e.g., driving in or out from a tunnel. This, however, is not a drawback of the algorithm. Instead it is a consequence of the exposure control used by the camera. 5.1.2 Shadow Based Detection Tests The PD rate summary is shown in Table 5.4. As opposed to the edge based detection, the shadow based is more sensitive to the vehicle poses. The PD rate is also lower than the edge based in all but one test case: front-rear detection up to 100 m. Poses Front, Rear Front, Rear, Front-left, Front-right, Rear-right, Rear-left Max distance [m] 30 50 100 86.1% 82.0% 75.6% 79.2% 75.0% 71.3% Table 5.4. PD rate for the shadow based detection tests. 40 Results and Evaluation An interesting observation can be made from Figure 5.3, taken from the test with vehicles up to 100 m of all poses. The top border line is better aligned with this algorithm than the edge based, even though the shadow based algorithm uses a constant aspect ratio between the height and width of the ROI. The bottom border line histogram however, shows the same tendency as with the edge based algorithm, i.e., a slightly overestimated border line coordinate. Even though the PD rate drops as the distance to the vehicles increases, Figure 5.4 does in fact show that the shadow based algorithm is better at keeping detection rate at larger distances. Figure 5.3. Difference ratio for PDs normalized against the number of reference markings. Test run on the shadow based algorithm on vehicles up to 100 m of all poses. The shadow based algorithm is in general more sensitive to different problem conditions than the edge based. As mentioned before, low sun is a particularly difficult scenario for the shadow based algorithm, misaligning the ROIs. In addition, edges on the road from road markings, different layers of asphalt, etc. will confuse the free driving space detection. In addition, the alignment of the ROIs is more unstable through time compared to the edge based algorithm. The reason is that the edge based algorithm searches for left and right edges to delimit the ROI, something the shadow based algorithm does not. The shadow based algorithm, more restrictive in its way of generating ROIs per vehicles, had on average 10.2 detections per frame while the number of OD per frame was as low as 4.9. This is an advantage compared to the edge based algorithm, less false detections means less work for a classifier. Also, in a hardware 5.1 Tests on Edge Based and Shadow Based Detection 41 Figure 5.4. PD rate at different distances [m]. Test run on the shadow based algorithm on vehicles up to 100 m of all poses. implementation the total number of detections per frame will be an important figure to consider when designing the system. All ROIs, vehicles as well as nonvehicles, will have to be stored and processed by the classifier. 42 Results and Evaluation 5.1.3 Tests on Combining Edge Based and Shadow Based Detection To investigate how well the algorithms complemented each other, the detections from the edge and shadow based algorithms were simply merged. The results in Table 5.5 shows a significant increase in PD rate, especially at larger distances. Combining the detections from the two algorithms gave 46.7 detections per frame on average and an OD per frame of 17.0. Poses Front, Rear Front, Rear, Front-left, Front-right, Rear-right, Rear-left Max distance [m] 30 50 100 96.2% 94.3% 88.7% 96.4% 93.1% 88.9% Table 5.5. PD rate for the tests on combining edge and shadow detection. Figure 5.5. Difference ratio for PDs normalized against the number of reference markings. Test run on the edge and shadow based algorithm combined on vehicles up to 100 m of all poses. Figure 5.5 illustrates the difference ratio between detections and reference markings of each border line of the positive detections. The detection rate is 5.1 Tests on Edge Based and Shadow Based Detection 43 improved significantly at distances above 30 m compared to the edge based detection alone, as shown in Figure 5.6. Figure 5.6. PD rate at different distances [m]. Test run on the edge and shadow based algorithm combined on vehicles up to 100 m of all poses. 44 5.2 Results and Evaluation Tests on Motion Based Detection The motion based algorithm was only tested as a complement algorithm on difficult test cases where the simpler and faster implemented algorithms would fail. Therefore a single test was performed on overtaking vehicles not yet fully visible in the images. This was done by filtering out occluded vehicle reference markings with the evaluation tool at a distance of 0 to 30 m in rear-left, rear-right and side poses from a subset of 15 files (4202 frames and 883 vehicle reference markings) with overtaking vehicles from the original test data set. All border lines were, as opposed to the other test, allowed to differ 50% from the reference marking position. The reason was the difficulty to align the left and bottom border when the vehicles were not fully visible. The resulting PD rate was 35.0% with 8.2 detections per frame out of which 2.9 where ODs. Figure 5.7. Difference ratio for PDs normalized against the number of reference markings. Test run on the motion based algorithm on occluded vehicles up to 30 m. Figure 5.7 shows the alignment of the four borders of the ROIs. The top and right border are well aligned. The left and bottom border lines are almost exclusively underestimated. Obviously, this is a consequence of the vehicles not being fully visible in the image. Therefore their left or bottom (or both) reference border line has been set to the leftmost or lowermost coordinate in the image and thus that border line cannot be overestimated. Chapter 6 Conclusions Three algorithms for vehicle detection have been implemented and tested. The work has been focused on detecting fully visible rear-view cars and SUVs during daytime. However, the algorithms have been able to detect vehicles in other poses as well. To largest possible extent the algorithms have been designed to detect vehicles in various weather conditions and at any distance. Out of the implemented algorithms there is no doubt that the motion based is the low achiever and also the slowest implementation. The largest difference from the article [24] that has inspired this algorithm is the acknowledgement of the problem of vehicle points moving along their epipolar lines. The algorithm was therefore modified to only detect overtaking vehicles. Even if improvements are made concerning feature extraction, point matching and estimation of the fundamental matrix, the core problem still remains: points on vehicles move along their epipolar lines and are therefore impossible to reject as outliers. One alternative way to estimate the fundamental matrix would be to assume a forward translation and use the velocity from the vehicle’s CAN bus along with the camera calibration matrices to calculate the estimate. The edge based and the shadow based algorithms should be seen as two maturer algorithms ready to use for rear-view vehicle detection in daylight. The edge based algorithm has been modified from the original article by Sun et al. [26] by, e.g., using local edge detection and from Wenner’s approach [27] by introducing the top edge, symmetry and variance as additional vehicle cues. The shadow based algorithm, as opposed to the article by Tzomakas et al. [18], uses a rowby-row local estimation of the road intensity mean and standard deviation and interpolation to rows where no road points were extracted. Also, variance is used to reject ROIs. The ultimate goal for a detection algorithm is ideally, but not very realistic, a PD rate of 100% and none of the two algorithms are there. Still, two algorithms have been implemented that work well in good conditions and in the case of the edge based algorithm also in more difficult scenes. Yet, the most interesting statistics were those from the combination of the two algorithms with a PD rate of close to 90% even at distances up to as far as 100 m. Vehicle detection is a complex 45 46 Conclusions problem with a large number of parameters like illumination, poses, occlusion, etc. The way to go is therefore probably to combine different detection cues to improve robustness. The results from the tests in this thesis indicate that this is indeed the case. When judging the performance of the algorithms it is very important to keep in mind that these statistics are calculated on a frame per frame basis. It is clear that a tracking algorithm would be able to increase these figures notably by closing detection gaps of a few frames. These gaps are therefore not to be seen as a critical drawback of the detection algorithm. Furthermore, it is hard to predict the final outcome of a complete system for vehicle detection solely on the detection stage. A classifier would get rid of a lot of false detections but also slightly decrease the PD rate. As said, a tracker would then increase the PD rate by closing detection gaps. No real evaluation has been made on which vehicles are not detected. The same vehicle not being detected during multiple frames is a larger problem than that of vehicles missing in separate frames. However, the results from the algorithms in this thesis indicate good performance up to approximately 60 m and useful performance up to 100 m. To be useful in an on-road system, the algorithms need to run in real time. The complexity of these three algorithms indicate that the edge based and shadow based algorithms stand a good chance of meeting this requirement, even a combination of the two might work. However, the motion based is slow and achieving a frame rate of 30 Hz would most probably be difficult. Future work should be focused on adjusting the algorithms to suit an even broader range of problem conditions. As it is now, trucks are not within the applied vehicle model and therefore not prioritized to detect. A relaxation of the height constraint would definitely mean a greater number of ODs and difficulties to align the top border line though. The top edge was better aligned with the shadow based algorithm than with the edge based. This indicates that a fix aspect ratio works well when only detecting cars. Since the top edge is used to reject structures including a bottom and two side edges it should still be a part of the algorithm. But once detected, a ROI could possibly be better aligned using a fix aspect ratio. As it is now the vanishing point is fix for a certain camera calibration. An error will therefore be introduced when the road on which the vehicle is driving on is not flat. This leads to errors when estimating the allowed width and height of a vehicle at a certain distance from the camera and possibly to rejections of vehicles or approvals of non-vehicles. A dynamic, robust calculation of the vanishing point, e.g., using the Hough transform could prevent this. In addition, an algorithm detecting the road boundaries would help to decrease the number of ODs. In fact, Figure 6.1 shows that there are a lot of ODs on the left and right sides of the image, a fact that indicates that a road boundary algorithm would help reducing the ODs significantly. 47 Figure 6.1. Statistics on ODs (Object Detections) taken from the test combining the edge and shadow based algorithms (up to 100 m, all poses). Also, a larger test data set would increase the credibility of the test results. A separation of cars and trucks during the reference marking process would have made it possible to evaluate the algorithms on cars only, which would have been interesting since trucks were not prioritized. Instead of being manually chosen, the parameters of the different algorithms, e.g., the edge thresholds, could be tuned automatically on a set of test data to further optimize performance. Bibliography [1] Peden, M. et al. (2004): World report on road traffic injury prevention: summary. [2] Jones, W. (2001): Keeping cars from crashing, IEEE Spectrum, Vol. 38, No. 9, pp. 40-45. [3] Ballard, D.H. (1987): Generalizing the hough transform to detect arbitrary shapes, Readings in computer vision, pp. 714-725. [4] Cantoni, V., Lombardi, L., Porta, M., Sicard, N. (2001): Vanishing Point Detection: Representation Analysis and New Approaches, Image Analysis and Processing. Proceedings. 11th International Conference, pp. 90-94. [5] Lutton, E., Maitre, H., Lopez-Krahe J. (1994): Contribution to the Determination of Vanishing Points Using Hough Transform, IEEE Transactions on pattern analysis and machine intelligence, Vol. 16, No. 4, pp. 430-438. [6] Harris, C., Stephens, M. (1998): A Combined Corner and Edge Detector, Proceedings of 4th Alvey Vision Conference, Manchester, pp. 147-151. [7] Canny, J. (1986): A Computational Approach To Edge Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 679-714. [8] Birchfield, S. (1997): Derivation of Kanade-Lucas-Tomasi Tracking Equation, Unpublished. [9] Jähne B. (2005): Digital Image Processing, 6th revised and extended version. Springer. [10] Lucas, B., Kanade, T. (1981): An Iterative Image Registration Technique with an Application to Stereo Vision, Proc. 7th Intl. Joint Conf. on Artificial Intelligence, pp. 674-679. [11] Horn, B.K.P., Schunck, B.G. (1981): Determining optical flow, Artificial Intelligence 17, pp. 185-203. [12] Chojnacki, W.; Brooks, M.J. (2003): Revisiting Hartley’s normalized eightpoint algorithm Transactions on Pattern Analysis and Machine Intelligence, Volume: 25, Issue: 9, pp. 1172-1177. 49 50 Bibliography [13] Hartley, R., Zisserman, A. (2003): Multiple View Geometry in computer vision, Cambridge University Press. [14] Sun, Z., Bebis, G., Miller, R. (2006): On-Road Vehicle Detection: A Review, IEEE Transactions on pattern analysis and machine intelligence, Vol. 28, No. 5, pp. 694-711. [15] Guo, D., Fraichard, T., Xie, M., Laugier, C. (2000): Color Modeling by Spherical Influence Field in Sensing Driving Environment, IEEE Intelligent Vehicle Symp., pp. 249-254. [16] Mori, H., Charkai, N. (1993): Shadow and Rhythm as Sign Patterns of Obstacle Detection, Proc. Int’l Symp. Industrial Electronics, pp. 271-277. [17] Hoffmann, C., Dang, T., Stiller, C. (2004): Vehicle detection fusing 2D visual features, IEEE Intelligent Vehicles Symposium. [18] Tzomakas, C., Seelen, W. (1998): Vehicle detection in Traffic Scenes Using Shadows, Technical Report 98-06, Institut für Neuroinformatik, RuhrUniversität Bochum. [19] Kim, S., Kim, K et al. (2005): Front and Rear Vehicle Detection and Tracking in the Day and Night Times using Vision and Sonar Sensor Fusion, Intelligent Robots and Systems, 2005 IEEE/RSJ International Conference, pp. 2173-2178. [20] Kalinke, T., Tzomakas, C., Seelen, W. (1998): A Texture-based Object Detection and an adaptive Model-based Classification, Proc. IEEE International Conf. Intelligent Vehicles, pp. 143-148. [21] Franke, U., Kutzbach, I. (1996): Fast Stereo based Object Detection for Stop&Go Traffic, Intelligent Vehicles, pp. 339-344. [22] Zhao, G., Yuta, S. (1993): Obstacle Detection by Vision System For An Autonomous Vehicle, Intelligent Vehicles, pp. 31-36. [23] Giachetti, A., Campani, M., Torre, V. (1998): The Use of Optical Flow for Road Navigation, IEEE Transactions on robotics and automation, Vol. 14, No. 1, pp. 34-48. [24] Yamaguchi, K., Kato, T., Ninomiya, Y. (2006): Vehicle Ego-Motion Estimation and Moving Object Detection using a Monocular Camera, The 18th International Conference on Pattern Recognition (ICPR’06) Volume 4, pp. 610-613. [25] Shi, J., Tomasi, C. (1994): Good Features to Track, Computer Vision and Pattern Recognition. Proceedings CVPR ’94., IEEE Computer Society Conference, pp. 593-600. [26] Sun, Z., Bebis, G., Miller, R. (2006): Monocular Precrash Vehicle Detection: Features and Classifiers, IEEE Transactions on image processing, Vol. 15, No. 7, pp. 2019-2034. Bibliography 51 [27] Wenner, P. (2007): Night-vision II Vehicle classification, Umeå University. [28] Kalinke, T., Tzomakas, C. (1997): Objekthypothesen in Verkehrsszenen unter Nutzung der Kamerageometrie, IR-INI 97-07, Institut für Neuroinformatik, Ruhr-Universität Bochum. [29] Näringsdepartementet (1998): Trafikförordning (1998:1276), pp. 27. [30] Pentenrieder, K. (2005): Tracking Scissors Using Retro-Reflective Lines, Technical University Munich, Department of Informatics. [31] Jung, C., Schramm, R. (2004): Rectangle detection based on a windowed Hough transform, Proceedings of the 17th Brazilian Symposium on Computer Graphics and Image Processing. [32] Hanson, K. (2007): Perspective drawings, Wood Digest, September 2007 Issue. [33] Danielsson, P-E. et al. (2007): Bildanalys, TSBB07, 2007: Kompendium, Department of Electrical Engineering, Linköping University.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement